# SparkTDA

The scalable topological data analysis package for Apache Spark. This project aims to implement the following features:

- Scalable Mapper Implemented as Reeb Diagrams, i.e., Reeb Cosheaves
- Scalable Mapper Implementation
- Scalable Multiscale Mapper Implementation
- Scalable Tower Computation for Multiscale Mapper
- Scalable Persistent Homology Computation on Top of Apache Spark

If you would like to know how to use and/or learn more the implementation details of the above mentioned features, please follow the links.

# Status

**WIP** and **EXPERIMENTAL**. This package is still a proof-of-concept of scalable topological data analysis support for
Apache Spark, hence you cannot expect that this package is ready for production use.

# Examples

### Mapper

# Requirements

This library requires Spark 2.0+

# Building and Running Unit Tests

To compile this project, run `sbt package`

from the project home directory. This will also run the Scala unit tests.
To run the unit tests, run `sbt test`

from the project home directory. This project uses the
sbt-spark-package plugin, which provides the 'spPublish' and
'spPublishLocal' task. We recommend users to use this library with Apache Spark including the dependencies by
supplying a comma-delimited list of Maven coordinates with `--packages`

and download the package from the locally
repository or official Spark Packages repository.

### The package can be published locally with:

`$ sbt spPublishLocal`

### Spark Packages with (requires authentication and authorization):

The package can be published to`$ sbt spPublish`

# Using with Spark Shell

This package can be added to Spark using the `--packages`

command line option. For example, to include it when starting
the spark shell:

`$ spark-shell --packages ognis1205:spark-tda:0.0.1-SNAPSHOT-spark2.2-s_2.11`

# Future Works

### Mapper

- Write Wiki
- Implement Python APIs
- Publish to Spark Packages
- Benchmark
- Consider using GraphFrames instead of plain GraphX
- Implement some useful filter functions, e.g., Gaussian Density, Graph Laplacian, etc as transformers

# Related Softwares & Projects

# References

### Mapper

- G. Singh, F. Memoli, G. Carlsson (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague, September 2007.
- J. Curry (2013). Sheaves, Cosheaves and Applications, arXiv 2013
- T. K. Dey, F. Memoli, Y. Wang (2015), Mutiscale Mapper: A Framework for Topological Summarization of Data and Maps, arXiv 2015
- E. Munch, B. Wang (2015). Convergence between Categorical Representations of Reeb Space and Mapper, arXiv 2015
- E. Munch, B. Wang (2015). Reeb Space Approximation with Guarantees, The 25th Fall Workshop on Computational Geometry 2015.
- H. E. Kim (2015). Evaluating Ayasdi's Topological Data Analysis for Big Data, Master Thesis, Goethe University Frankfurt 2015.

### KNN/ANN/SNN

- L. Ting, et al (2004). An investigation of practical approximate nearest neighbor algorithms, Advances in neural information processing systems. 2004.
- L. Ting, C. Rosenberg, H. Rowley (2007). Clustering billions of images with large scale nearest neighbor search. Applications of Computer Vision, 2007. WACV'07. IEEE Workshop on. IEEE, 2007.
- D. Ravichandran, P. Pantel, E. Hovy (2005). Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering, ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics pp 622-629
- M. Steinbach, L. Ertoez, V. Kumar (2004). The Challenges of Clustering High Dimensional Data, New Directions in Statistical Physics, pp 273-309
- L. Ertoez, M. Steinbach, Vipin Kumar (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, Proceedings of the Third SIAM International Conference on Data Mining, 2003.
- M. E. Houle, H. P. Kriegel, P. Kroeger, E. S. A. Zimek (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?, Proceedings of the 22nd International Conference on Scientific and Statistical Database Management, 2010.