Counting 1,496 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

Author
Project Page
http://goo.gl/EdrCUo
Last Commit
Mar. 16, 2017
Created
Feb. 22, 2017

RDD-DF-DS-SSQL

TL;DR: Examples of and differences between various Spark APIs

The complete, runnable code, with output is available here: http://goo.gl/EdrCUo

(https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6374798392727515/2002375612871426/4076179716382534/latest.html)

Details:

I realized that most people who join our company and who’re new to the Spark ecosystem are overwhelmed by the different set of APIs that it offers! Most of their questions that needed some human answering or that waited on StackOverflow for an answer were related to porting one API call to another, differences between them, using the most optimized approach, how to use them etc.

Made this sample project to explain most of it.

In the fictional town of Irvin, there are all kinds of people - couples, singles, folks in long distance relationships, gay couples, open relationships, poly-marriages and in its biggest employer Notox, there’s widespread nepotism and gender imbalance!

Here’s an audit using

  1. RDD APIs
  2. DataFrame APIs
  3. Dataset APIs
  4. Spark SQL

Screenshot