Counting 2,653 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

elasticsearch

30474

Open Source, Distributed, RESTful Search Engine

pachyderm

2727

Containerized Data Analytics

gobblin

1274

Universal data ingestion framework for Hadoop.

siddhi

408

Siddhi CEP is a lightweight, easy-to-use Open Source Complex Event Processing Engine (CEP) under Apache Software License v20

incubator-beam

1837

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines

NNPACK

1070

Acceleration package for neural networks on multi-core CPUs

vitess

5892

Vitess is a database clustering system for horizontal scaling of MySQL.

presto

7413

Distributed SQL query engine for big data

hue

2812

Let’s Big Data. Hue is an open source Web interface for analyzing data with Hadoop and Spark.

pinot

1816

A realtime distributed OLAP datastore

zookeeper

4300

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

ignite

1652

The Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.

caffe

23859

Caffe: a fast open framework for deep learning.

storm

4995

Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation

hive

1836

The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL

flink

3594

Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities

incubator-airflow

7799

Airflow is a platform to programmatically author, schedule and monitor workflows

hadoop

6790

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware

spark-py-notebooks

724

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

druid

6400

Column oriented distributed data store ideal for powering interactive applications

PyTables

714

A Python package to manage extremely large amounts of data

spark

17057

Spark is a fast and general cluster computing system for Big Data

hbase

1909

Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable

spark-testing-base

720

Base classes to use when writing tests with Spark

CNTK

14300

Microsoft Cognitive Toolkit (CNTK)

mesos

3658

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks

kafka

8090

Kafka™ is used for building real-time data pipelines and streaming apps

ML-From-Scratch

8152

Bare bones Python implementations of some of the foundational Machine Learning models and algorithms.

Theano

8160

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

cassandra

4356

Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key