This Python package provides command-line utilities to make it easier to run machine learning experiments with scikit-learn. One of the primary goals of our project is to make it so that you can run scikit-learn experiments without actually needing to write any code other than what you used to generate/extract the features.
The main utility we provide is called
run_experiment and it can be used to
easily run a series of learners on datasets specified in a configuration file
[General] experiment_name = Titanic_Evaluate_Tuned # valid tasks: cross_validate, evaluate, predict, train task = evaluate [Input] # these directories could also be absolute paths # (and must be if you're not running things in local mode) train_directory = train test_directory = dev # Can specify multiple sets of feature files that are merged together automatically # (even across formats) featuresets = [["family.ndj", "misc.csv", "socioeconomic.arff", "vitals.csv"]] # List of scikit-learn learners to use learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] # Column in CSV containing labels to predict label_col = Survived # Column in CSV containing instance IDs (if any) id_col = PassengerId [Tuning] # Should we tune parameters of all learners by searching provided parameter grids? grid_search = true # Function to maximize when performing grid search objectives = ['accuracy'] [Output] # Also compute the area under the ROC curve as an additional metric metrics = ['roc_auc'] # The following can/should be absolute paths log = output results = output predictions = output models = output
We also provide utilities for:
- converting between machine learning toolkit formats (e.g., ARFF, CSV, MegaM)
- filtering feature files
- joining feature files
- other common tasks
If you just want to avoid writing a lot of boilerplate learning code, you can
also use our simple Python API which also supports pandas DataFrames.
The main way you'll want to use the API is through
Reader classes. For more details on our API, see
While our API can be broadly useful, it should be noted that the command-line utilities are intended as the primary way of using SKLL. The API is just a nice side-effect of our developing the utilities.
A Note on Pronunciation
SciKit-Learn Laboratory (SKLL) is pronounced "skull": that's where the learning happens.
- Python 2.7+
- BeautifulSoup 4
- Grid Map (only required if you plan to run things in parallel on a DRMAA-compatible cluster)
- configparser (only required for Python 2.7)
- logutils (only required for Python 2.7)
- mock (only required for Python 2.7)
The following packages can be optionally installed for additional features but are not required:
- seaborn (optional)
- Simpler Machine Learning with SKLL 1.0, Dan Blanchard, PyData NYC 2014 (video | slides)
- Simpler Machine Learning with SKLL, Dan Blanchard, PyData NYC 2013 (video | slides)
See GitHub releases.