Counting 3,541 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

cassandra

4965

Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key

nupic

5819

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

kafka-monitor

1126

Kafka Monitor is a framework to implement and execute long-running kafka system tests in a real cluster

vespa

2726

An engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.

infinispan

708

Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.

geode

1467

Apache Geode is a data management platform that provides real-time, consistent access to data-intensive applications throughout widely distributed cloud architectures

universe

6990

Universe: a software platform for measuring and training an AI's general intelligence across the world's supply of games, websites and other applications.

crate

2319

A distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time.

incubator-edgent

137

Apache Edgent is an open source stream processing programming model and lightweight micro-kernel style runtime for edge devices that enables you to analyze data and events at the device

pilosa

1588

An open source, distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.

root

746

A modular scientific software framework. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage. It is mainly written in C++ but integrated with other languages such as Python and R.

uima-ducc

5

DUCC is a cluster management system providing tooling, management, and scheduling facilities to automate the scale-out of applications written to the UIMA framework

bookkeeper

587

A scalable, fault tolerant and low latency storage service optimized for append-only workloads.

ignite

2306

The Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.

pig

570

Pig is a dataflow programming environment for processing very large files

gridgain

2

GridGain’s In-Memory Data Fabric is designed to deliver uncompromised performance for a widest set of in-memory computing

calcite

1037

Apache Calcite is a dynamic data management framework.

Lasagne

3580

Lightweight library to build and train neural networks in Theano

pinot

2137

A realtime distributed OLAP datastore

curator

1422

Curator is a set of Java libraries that make using Apache ZooKeeper much easier

genie

1074

Federated Big Data Orchestration Service

parquet-mr

765

Parquet-MR contains the java implementation of the Parquet format

drill

1113

Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems

gearpump

684

Lightweight real-time big data streaming engine over Akka

gobblin

1489

Universal data ingestion framework for Hadoop.

oozie

455

Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services

alluxio

3887

Alluxio, formerly Tachyon, A Virtual Distributed Storage at Memory Speed

ranger

214

Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform

beringei

2816

Beringei is a high performance, in-memory storage engine for time series data.

snappydata

841

SnappyData: OLTP + OLAP Database built on Apache Spark