Counting 2,019 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

Author
Last Commit
Nov. 17, 2017
Created
Dec. 31, 2016

BICO (BIrch using COresets)

This is a python implementation of the original BICO code.

BICO is a fast streaming algorithm for the k-means problem. BICO reduces the number of input points to a specified amount such that the k-means solution on the compressed input set is still a high quality solution to the original data set. BICO combines a tree-like data structure of SIGMOND Test of Time Award winning algorithm BIRCH with reduction techniques from clustering theory.


Applications

Massive Data Sets

Running k-means on a very large data set is even nowadays infeasible. BICO can reduce the amount of input points drastically such that running k-means on the resulting reduced data set is just a matter of seconds. BICO ensures that the solution is of similar quality as the solution on the original input set.

Compression

BICO reduces a set of input points to a specified number of points while preserving the k-means solution quality as good as possible. Thus, BICO is a perfect tool to shrink the size of the data in order to reduce space or bandwidth.

Getting Started

Have a look at this example for an introduction on how to use this library and some visualizations presenting the results of the computed data reduction.

Slides showing a more technical overview of BICO can be found here. If you are interested in the theoretical point of view of BICO, please feel free to check Section 5.4 of this thesis which contains a very detailed description of the algorithm including all proofs of theoretical guarantees.