TL;DR: Examples of and differences between various Spark APIs
The complete, runnable code, with output is available here: http://goo.gl/EdrCUo
I realized that most people who join our company and who’re new to the Spark ecosystem are overwhelmed by the different set of APIs that it offers! Most of their questions that needed some human answering or that waited on StackOverflow for an answer were related to porting one API call to another, differences between them, using the most optimized approach, how to use them etc.
Made this sample project to explain most of it.
In the fictional town of Irvin, there are all kinds of people - couples, singles, folks in long distance relationships, gay couples, open relationships, poly-marriages and in its biggest employer Notox, there’s widespread nepotism and gender imbalance!
Here’s an audit using
- RDD APIs
- DataFrame APIs
- Dataset APIs
- Spark SQL