PyTorch Unsupervised Sentiment Discovery
This codebase is part of our effort to reproduce, analyze, and scale the Generating Reviews and Discovering Sentiment paper from OpenAI.
Early efforts have yielded a training time of 5 days on 8 volta-class gpus down from the training time of 1 month reported in the paper.
The learned language model can be transferred to other natural language processing (NLP) tasks where it is used to featurize text samples. The featurizations provide a strong initialization point for discriminative language tasks, and allow for competitive task performance given only a few labeled samples. To illustrate this the model is transferred to the Binary Stanford Sentiment Treebank task.
The model's performance as a whole will increase as it processes more data. However, as is discovered early on in the training process, transferring the model to sentiment classification via Logistic Regression with an L1 penalty leads to the emergence of one neuron (feature) that correlates strongly with sentiment.
This sentiment neuron can be used to accurately and robustly discriminate between postive and negative sentiment given the single feature. A decision boundary of 0 is able to split the distribution of the neuron features into (relatively) distinct positive and negative sentiment clusters.
Samples from the tails of the feature distribution correlate strongly with positive/negative sentiment.
- SentimentDiscovery Package
Install the sentiment_discovery package with
pip setup.py install in order to run the modules/scripts within this repo.
At this time we only support python3.
- pytorch (>= 0.3.0)
We've included our own trained mlstm/lstm models as well as OpenAI's mlstm model:
We've provided the Binary Stanford Sentiment Treebank and IMDB Movie Review datasets as part of this repository. In order to train on the amazon dataset please download the "aggressively deduplicated data" version from Julian McAuley's original site. Access requests to the dataset should be approved instantly. While using the dataset make sure to load it with the
In addition to providing easily reusable code of the core functionalities of this work in our
sentiment_discovery package, we also provide scripts to perform the three main high-level functionalities in the paper:
- unsupervised reconstruction/language modeling of a corpus of text
- transfer of learned language model to perform sentiment analysis on a specified corpus
- heatmap visualization of sentiment in a body of text, and conversely generating text according to a fixed sentiment
Script results will be saved/logged to the
<experiment_dir>/<experiment_name>/* directory hierarchy.
Train a recurrent language model (mlstm/lstm) and save it to the
<model_dir>/ directory in the above hierarchy. The metric histories will be available in the appropriate
<history>/ directory in the same hierarchy. In order to apply
weight_norm only to lstm parameters as specified by the paper we use the
-lstm_only flag. Additionally, we use fused LSTM kernels in the mlstm by specifying the
fuse_lstm flag, in order to achieve modest speedups.
python3 text_reconstruction.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model -cuda \ -embed_size 64 -rnn_size 4096 -layers 1 -rnn_type mlstm -dropout 0 -weight_norm -lstm_only -fuse_lstm \ -train data/amazon/reviews.json -loose_json -text_key reviewText -lazy -data_set_type unsupervised \ -batch_size 32 -seq_length 256 -lr 0.000125 --optimizer_type=Adam -lr_scheduler LinearLR -epochs 1 \ -num_shards 1002 -split 1000,1,1 -eval_batch_size 1 -eval_seq_length -1 -persist_state 1 -save_epochs 1 -save_iters 5000
The training process takes in a json list (or csv) file as a dataset where each entry has key (or column)
text_key containing samples of text to model.
The unsupervised dataset will proportionally split
num_shards amongst train/val/test according to
split. Validation/test sets can also be supplied manually, they will also have
The dataset entries are transposed to allow for the concatenation of sequences in order to persist hidden state across a shard in the training process.
Hidden states are reset either at the start of every sequence, every shard, or never based on the value of
persist_state. See data flags.
We use an evaluation sequence length of -1. A negative
seq_length calculates sequence length s.t. there are
seq_length samples in the dataset.
Lastly, We know waiting for more than 1 million updates over 1 epoch for large datasets is a long time. We've set the training script to save the most current model (and data history) every
save_iters steps within an epoch, and also included a
-max_iters flag to end training early.
The resulting loss curve over 1 (x4 because of data parallelism) million iterations should produce a graph similar to this.
Note: if it's your first time modeling a particular corpus make sure to turn on the
For a list of default values for all flags look at
no_loss- do not compute or track the loss curve of the model
lr- Starting learning rate.
lr_scheduler- one of [ExponentialLR,LinearLR]
lr_factor- factor by which to decay exponential learning rate
optimizer_type- Class name of optimizer to use as listed in torch.optim class (ie. SGD, RMSProp, Adam)
clip- Clip gradients to this maximum value. Default: 1.
epochs- number of epochs to train for
max_iters- total number of training iterations to run. Takes presedence over number of epochs
start_epoch- epoch to start training at (used to resume training for exponential decay scheduler)
start_iter- what iteration to start training at (used mainly for resuming linear learning rate scheduler)
save_epochs- number of epochs to save model progress. Defaults to every epoch
save_iters- save model every so often per epoch
save_optim- if enabled save optimizer state when saving model progress
load_optim- if enabled load optimizer state from saved model if available
This script parses sample text and binary sentiment labels from csv (or json) data files according to
label_key. The text is then featurized and fit to a logistic regression model.
The index of the feature with the largest L1 penalty is then used as the index of the sentiment neuron. The relevant sentiment feature is extracted from the samples' features and then also fit to a logistic regression model to compare performance.
Below is example output for transfering our trained language model to the Binary Stanford Sentiment Treebank task.
python3 sentiment_transfer.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model -cuda \ -train binary_sst/train.csv -valid binary_sst/val.csv -test binary_sst/test.csv \ -text_key sentence -label_key label -load_model e0.pt
92.11/90.02/91.27 train/val/test accuracy w/ all_neurons 00.12 regularization coef w/ all_neurons 00107 features used w/ all neurons all_neurons neuron(s) 1285, 1503, 3986, 1228, 2301 are top sentiment neurons 86.36/85.44/86.71 train/val/test accuracy w/ n_neurons 00.50 regularization coef w/ n_neurons 00001 features used w/ all neurons n_neurons
These results are saved to
Note: The other neurons are firing, and we can use these neurons in our prediction metrics by setting
-num_neurons to the desired number of heavily penalized neurons. Even if this option is not enabled the logit visualizations/data are always saved for at least the top 5 neurons.
For a list of default values for all involved flags look at
num_neurons- number of neurons to consider for sentiment classification
no_test_eval- do not evaluate the test data for accuracy (useful when your test set has no labels)
write_results- write results (output probabilities as well as logits of top neurons) of model on test (or train if none is specified) data to specified filepath. (Can only write results to a csv file)
Transfer to your own data
Analyze sentiment on your own data by using one of the pretrained models to train on the Stanford Treebank (or other sentiment benchmarks) and evaluating on your own
If you don't have labels for your data set make sure to use the
Train a Different Classifier
If you have your own unique classification dataset complete with training labels, you can also use a pretrained model to extract relevant features for classification.
Supply your own
-train (and evaluation) dataset while making sure to specify the right text and label keys.
This script takes a piece of text specified by
text and then visualizes it via a karpathy-style heatmap based on the activations of the specified
neuron. In order to find the neuron to specify, find the index of the neuron with the largest l1 penalty based on the neuron weight visualizations from the previous section.
Without loss of generality we're going to go through this section first referring to OpenAI's neuron weights. For this set of weights the sentiment neuron can be found at neuron 2388.
Armed with this information we can use the script to analyze the following exerpt from the beginning of a review about The League of Extraordinary Gentlemen.
python3 visualize.py -experiment_dir ./experiments -experiment_name mlstm_paper -model_dir model -cuda \ -text "25 August 2003 League of Extraordinary Gentlemen: Sean Connery is one of the all time greats \ I have been a fan of his since the 1950's." -neuron 2388 -load_model e0.pt
However, not only does this script analyze text, but the user can also specify the begining of a text snippet with
text and generate additional following text up to a total
seq_length. Simply set the
python3 visualize.py -experiment_dir ./experiments -experiment_name mlstm_paper -model_dir model -cuda \ -text "25 August 2003 League of Extraordinary Gentlemen: Sean Connery is" \ -neuron 2388 -load_model e0.pt -generate
In addition to generating text, the user can also set the sentiment of the generated text to positive/negative by setting
-overwrite to +/- 1. Continuing our previous example:
When training a sentiment neuron with this method, sometimes we learn a positive sentiment neuron, and sometimes a negative sentiment neuron. This doesn’t matter, except for when running the visualization. The
negate flag improves the visualization of negative sentiment models, if you happen to learn one.
python3 visualize.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model -cuda \ -text "25 August 2003 League of Extraordinary Gentlemen: Sean Connery is" \ -neuron 1285 -load_model e0.pt -generate -negate -overwrite +/-1
The resulting heatmaps are all saved to a png file corresponding of the first 100 characters of (generated or analyzed) text at
For a list of default values for all flags look at
text- whole or initial text for visualization
neuron- index of sentiment neuron
generate- generates text following initial text up to a total length of
temperature- temperature from sampling from language model while generating text
overwrite- For generated portion of text overwrite the neuron's value while generating
neuroncorresponds to a negative sentiment neuron rather than positive sentiment
layer- layer of recurrent net to extract neurons from
python3 text_reconstruction.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model \ -embed_size 64 -rnn_size 4096 -layers 1 -rnn_type mlstm -dropout 0 -weight_norm -lstm_only -fuse_lstm \ -train data/amazon/reviews.json -loose_json -text_key reviewText -lazy -data_set_type unsupervised \ -batch_size 32 -seq_length 256 -lr 0.000125 --optimizer_type=Adam -lr_scheduler LinearLR -epochs 1 \ -num_shards 1002 -split 1000,1,1 -eval_batch_size 1 -eval_seq_length -1 -persist_state 1 -save_epochs 1 -save_iters 5000 -cuda -num_gpus 2 -distributed
In order to utilize Data Parallelism during training time ensure that
-cuda is in use and that
-num_gpus >1. As mentioned previously, vanilla DataParallelism produces no speedup for recurrent architectures. In order to circumvent this problem turn on the
-distributed flag to utilize PyTorch's DistributedDataParallel instead and experience speedup gains.
Also make sure to scale the learning rate as appropriate. Note how we went from
2.5e-4 learning rates when we doubled the number of gpus.
Note: validation and test datasets are currently not functioning properly for unsupervised reconstruction.
sentiment_discovery.modules: implementation of multiplicative LSTM, stacked LSTM.
sentiment_discovery.model: implementation of module for processing text, and model wrapper for running experiments
sentiment_discovery.reparametrization: implementation of weight reparameterization abstract class, and weight norm reparameterization.
sentiment_discovery.data_utils: implementations of datasets, data loaders, data samplers, data preprocessing utils.
sentiment_discovery.neuron_transfer: contains utilities for extracting neurons and featurizing text with the values of a model's neurons.
sentiment_discovery.learning_rates: contains implementations of learning rate schedules (ie. linear and exponential decay)
experiment_dir- Root directory for saving results, models, and figures
experiment_name- Name of experiment used for logging to
should_test- whether to train or evaluate a model
embed_size- embedding size for data
rnn_type- one of <
fuse_lstm- use fused lstm cuda kernels in mLSTM. Currently still buggy Do not use.
rnn_size- hidden state dimension of rnn
layers- Number of stacked recurrent layers
dropout- dropout probability after hidden state (0 = None, 1 = all dropout)
weight_norm- boolean flag, which, when enabled, will apply weight norm only to the recurrent (m)LSTM parameters
-weight_normis applied to the model, apply it to the lstm parmeters only
model_dir- subdirectory where models are saved in
load_model- a specific checkpoint file to load from
batch_size- minibatch size
data_size- dimension of each data point
seq_length- length of time sequence for reconstruction/generation (sentiment inference has no maximum sequence length)
data_set_type- what type of dataset to model. one of [unsupervised,supervised]
loose_json- Use loose json (one json-formatted string per newline), instead of tight json (data file is one json string)
persist_state- degree to which to persist state across samples for unsupervised learnign. 0 = reset state after every sample, 1 = reset state after finishing a shard, -1 = never reset state
transpose- transpose dataset, given batch size. Is always on for unsupervised learning, (should probably only be used for unsupervised).
no_wrap- In order to better concatenate strings together for unsupervised learning the batch data sampler does not drop the last batch, it instead "wraps" the dataset around to fill in the last batch of an epoch. On subsequent calls to the batch sampler the data set will have been shifted to account for this "wrapping" and allow for proper hidden state persistence. This is turned on by default in non-unsupervised scripts.
cache- cache recently used samples, and preemptively cache future data samples. Helps avoid thrashing in large datasets. Note: currently buggy and provides no performance benefit
Data Processing Flags
lazy- lazily load dataset (necessary for large datasets such as amazon)
preprocess- load dataset into memory and preprocess it according to section 4 of the paper. (should only be called once on a dataset)
shuffle- shuffle unsupervised dataset.
text_key- column name/dictionary key for extacting text samples from csv/json files.
label_key- column name/dictionary key for extacting sample labels from csv/json files.
delim- column delimiter for csv tokens (eg. ',', '\t', '|')
drop_unlabeled- drops unlabeled samples from csv/json file
binarize_sent- if sentiment labels are not 0 or 1, then binarize them so that the lower half of sentiment ratings get set to 0, and the upper half of the rating scale gets set to 1.
num_shards- number of total shards for unsupervised training dataset. If a
splitis specified, appropriately portions the number of shards amongst the splits.
val_shards- number of total shards for unsupervised validation set. If specified, overrides number of shards if any were split from
test_shards- number of total shards for unsupervised test set. If specified, overrides number of shards if any were split from
Dataset Path Flags
train- path to training set
split- comma-separated list of proportions for splitting the training set into training, validation, and test sets
valid- path to validation set
test- path to test set
cuda- utilize gpus to process strings. Should be on as cpu is too slow to process this.
num_gpus- number of gpus available
rank- distributed process index.
distributed- train with data parallelism distributed across multiple processes
world_size- number of distributed processes to run. Do not set. The scripts will automatically set it equal to
num_gpus(one gpu per process).
verbose- 2 = enable stdout for all processes (including all the data parallel workers). 1 = enable stdout for master worker. 0 = disable all stdout
seed- random seed for pytorch random operations
- Why Unsupervised Language Modeling?
- Reproducing Results
- Data Parallel Scalability
- Open Questions
Thanks to @guillette for providing a lightweight pytorch port of the original weights.
This project uses the amazon review dataset collected by J. McAuley
Want to help out? Open up an issue with questions/suggestions or pull requests ranging from minor fixes to new functionality.
May your learning be Deep and Unsupervised.