Counting 2,784 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

Last Commit
Apr. 26, 2018
Dec. 1, 2017

Data Linter


This code accompanies the NIPS 2017 ML Systems Workshop paper/poster, "The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets."

The Data Linter identifies potential issues (lints) in your ML training data.

Using the Data Linter


You'll need the following installed to use the Data Linter:

  1. Python
  2. Apache Beam
  3. TensorFlow
  4. Facets

Data Linter Demo

The easiest way to see how to use the Data Linter is to follow the demo instructions found in demo/

Running the Data Linter

Running the Data Linter requires the following steps:

  1. Encoding your data in TFRecord format.
  2. Generating summary statistics for those data, using Facets.
  3. Running the Data Linter.
  4. Using the Lint Explorer to produce the lint results.

Creating Data in the TFRecord Format

To see how to convert CSV files to the TFRecord format, look at the example code in demo/

Summarizing Your Data Using Facets

To see how to generate summary statistics for your data, see the example code in demo/

Executing the Data Linter

Once you have both the data and summary statistics, you can run the Data Linter as such:

python --dataset_path PATH_TO_TFRECORDS \

For example, if you follow the instructions in the demo folder, you'll invoke the Data Linter like this:

python --dataset_path /tmp/adult.tfrecords \
  --stats_path /tmp/adult_summary.bin \
  --results_path /tmp/datalinter/results/lint_results.bin

Viewing Results with the Lint Explorer

After the Data Linter is done examining your data, you can view the results using this command:

python --results_path PATH_TO_RESULTS

For example:

python --results_path \


The code makes use of Google's protobuf format. The protos are defined in protos/.

To make it easier to run the code, we include protobuf definitions from TensorFlow and Facets in this distribution.


This is not an official Google project. This project will not be supported or maintained, and we will not accept any pull requests.


The Data Linter was created by Nick Hynes ([email protected]) during an internship at Google with Michael Terry ([email protected]).