Counting 2,129 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

Author
Contributors
Last Commit
Dec. 10, 2017
Created
Oct. 10, 2017

Metorikku Logo

Build Status codecov

Metorikku is a library that simplifies writing and executing ELTs on top of Apache Spark. A user needs to write a simple JSON configuration file that includes SQL queries and run Metorikku on a spark cluster. The platform also includes a way to write tests for metrics using MetorikkuTester.

Getting started

To run Metorikku you must first define 2 files.

MQL file

An MQL (Metorikku Query Language) file defines the steps and queries of the ELT as well as where and what to output.

For example a simple configuration JSON should be as follows:

{
    "steps": [
        {
            "sql": "SELECT * from input_1 where id > 100",
            "dataFrameName": "df1"
        },
        {
            "sql": "SELECT * from df1 where id < 1000",
            "dataFrameName": "df2"   
        }
    ],
    "output": [
        {
            "dataFrameName": "df2",
            "outputType": "Parquet",
            "outputOptions":
            {
                "saveMode": "Overwrite",
                "path": "df2.parquet"
            }
        }
    ]
}

Take a look at the examples file for further configuration examples.

Run configuration file

Metorikku uses a YAML file to describe the run configuration. This file will include input sources, output destinations and the location of the metric config files.

So for example a simple config.yaml file should be as follows:

metrics:
  - /full/path/to/your/MQL/file.json
inputs:
  input_1: parquet/input_1.parquet
  input_2: parquet/input_2.parquet
output:
    file:
        dir: /path/to/parquet/output

You can check out a full example file for all possible values in the sample YAML configuration file.

Supported input/output:

Currently Metorikku supports the following inputs: CSV, JSON, parquet

And the following outputs: CSV, JSON, parquet, Redshift, Cassandra, Segment
Redshift - s3_access_key and s3_secret are supported from spark-submit

Running Metorikku

There are currently 3 options to run Metorikku.

Run on a spark cluster

To run on a cluster Metorikku requires Apache Spark v2.2+

  • Download the last released JAR
  • Run the following command: spark-submit --class com.yotpo.metorikku.Metorikku metorikku.jar -c config.yaml
Run locally

Metorikku is released with a JAR that includes a bundled spark.

  • Download the last released Standalone JAR
  • Run the following command: java -Dspark.master=local[*] -cp metorikku-standalone.jar com.yotpo.metorikku.Metorikku -c config.yaml
Run as a library

It's also possible to use Metorikku inside your own software Metorikku library requires scala 2.11

  • Add the following dependency to your build.sbt: "com.yotpo" % "metorikku" % "0.0.1"
  • Start Metorikku by creating an instance of com.yotpo.metorikku.config and run com.yotpo.metorikku.Metorikku.execute(config)

Metorikku Tester

In order to test and fully automate the deployment of MQLs (Metorikku query language files) we added a method to run tests against MQLs.

A test is comprised of 2 files:

Test settings

This defines what to test and where to get the mocked data. For example, a simple test_settings.json file will be:

{
  "metric": "/path/to/metric",
  "mocks": [
    {
      "name": "table_1",
      "path": "mocks/table_1.jsonl"
    }
  ],
  "tests": {
    "df2": [
      {
        "id": 200,
        "name": "test"
      },
      {
        "id": 300,
        "name": "test2"
      }
    ]
  }
}

And the corresponding mocks/table_1.jsonl:

{ "id": 200, "name": "test" }
{ "id": 300, "name": "test2" }
{ "id": 1, "name": "test3" }
Running Metorikku Tester

You can run Metorikku tester in any of the above methods (just like a normal Metorikku). The main class changes from com.yotpo.metorikku.Metorikku to com.yotpo.metorikku.MetorikkuTester

License

See the LICENSE file for license rights and limitations (MIT).

Latest Releases
v0.0.5
 Dec. 5 2017
v0.0.4
 Dec. 5 2017
v0.0.3
 Dec. 4 2017
v0.0.2
 Dec. 4 2017
v0.0.1
 Nov. 19 2017