Counting 2,784 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

LightGBM

5583

A fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms

PyTorch

15835

A python package that provides Tensor computation (like numpy) with strong GPU acceleration and Deep Neural Networks built on a tape-based autograd system

kylo

421

A data lake management software platform and framework for enabling scalable enterprise-class data lakes on Apache Hadoop and Spark

vespa

2502

An engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.

ParlAI

3292

A framework for training and evaluating AI models on a variety of openly available dialog datasets.

Paddle

6967

PArallel Distributed Deep LEarning

ClickHouse

4087

ClickHouse is a free analytic DBMS for big data.

beringei

2684

Beringei is a high performance, in-memory storage engine for time series data.

drill

965

Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems

mapd-core

1450

The MapD Core database

pai

234

A platform for cluster management and resource scheduling for AI that incorporates the mature design with a proven track record in Microsoft's large scale production environment

ranger

168

Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform

pilosa

1355

An open source, distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.

hue

2870

Let’s Big Data. Hue is an open source Web interface for analyzing data with Hadoop and Spark.

elasticsearch

31240

Open Source, Distributed, RESTful Search Engine

pachyderm

2803

Containerized Data Analytics

norikra

348

Schemaless Stream Processing (Complex Event Processing) Server with SQL

onyx

1740

Distributed, masterless, high performance, fault tolerant data processing

infinispan

657

Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.

incubator-airflow

8190

Airflow is a platform to programmatically author, schedule and monitor workflows

lucene-solr

1655

Apache Solr is a search engine server that uses Apache Lucene

bfs

2301

The Baidu File System.

hadoop

7184

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware

oryx

1424

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

mesos

3711

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks

luigi

9313

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

ignite

1715

The Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.

vitess

6028

Vitess is a database clustering system for horizontal scaling of MySQL.

zookeeper

4455

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

pinot

1847

A realtime distributed OLAP datastore