YellowFin
YellowFin is an autotuning optimizer based on momentum SGD which requires no manual specification of learning rate and momentum. It measures the objective landscape onthefly and tunes momentum as well as learning rate using local quadratic approximation.
The implementation here can be a dropin replacement for any optimizer in PyTorch. It supports step
and zero_grad
functions like any PyTorch optimizer after from yellowfin import YFOptimizer
. We also provide interface to manually set the learning rate schedule at every iteration for finer control.
For more technical details, please refer to our paper YellowFin and the Art of Momentum Tuning.
For more usage details, please refer to the inline documentation of tuner_utils/yellowfin.py
. Example usage can be found here for ResNext on CIFAR10 and Tied LSTM on PTB.
YellowFin is under active development. Many members of the community have kindly submitted issues and pull requests. We are incorporating fixes and smoothing things out. As a result the repository code is in flux. Please make sure you use the latest version and submit any issues you might have!
Updates
[2017.07.03] Fixed a gradient clipping bug. Please pull our latest master branch to make gradient clipping great again in YellowFin.
[2017.07.28] Switched to logrithmic smoothing to accelerate adaptation to curvature range trends.
[2017.08.01] Added optional feature to enforce nonincreasing value of lr * gradient norm for stablity in some rare cases.
[2017.08.05] Added feature to correct estimation bias from sparse gradient.
[2017.08.16] Replace numpy root solver with closed form solution using Vieta's substitution for cubic eqaution. It solves the stability issue of the numpy root solver.
Setup instructions for experiments
Please clone the master branch and follow the instructions to run YellowFin on ResNext for CIFAR10 and tied LSTM on Penn Treebank for language modeling. The models are adapted from ResNext repo and PyTorch example tied LSTM repo respectively. Thanks to the researchers for developing the models. For more experiments on more convolutional and recurrent neural networks, please refer to our Tensorflow implementation of YellowFin.
Note YellowFin is tested with PyTorch v0.1.12 for compatibility. It is tested under Python 2.7.
Run CIFAR10 ResNext experiments
The experiments on 110 layer ResNet with CIFAR10 and 164 layer ResNet with CIFAR100 can be launched using
cd pytorchcifar
python main.py lr=1.0 mu=0.0 logdir=path_to_logs opt_method=YF
Run Penn Treebank tied LSTM experiments
The experiments on multiplelayer LSTM on Penn Treebank can be launched using
cd word_language_model
python main.py emsize 650 nhid 650 dropout 0.5 epochs 40 tied opt_method=YF logdir=path_to_logs cuda
For more experiments, please refer to our YellowFin Tensorflow Repo.
Detailed guidelines

Basic use: YFOptimizer(parameter_list lr=1.0, mu=0.0) sets initial learnig rate and momentum to 1.0 and 0.0 respectively. This is the uniform setting (i.e. without tuning) for all our PyTorch and Tensorflow experiments. Typically, after a few thousand minibatches, the influence of these initial values diminishes.

If the loss explodes after a very small number of iterations, you may want to lower the init lr to prevent the explosion at the beginining.

We also have users reporting to use regularizer to avoid explosions.


Interface for manual finer control: If you want to more finely control the learning rate (say using a manually set constant learning rate), or you want to use the typical lrdropping technique after a ceritain number of epochs, please use
set_lr_factor()
in the YFOptimizer class. E.g. if you want to use a manually set constant learning rate, you can runset_lr_factor(desired_lr / self._lr)
beforeself.step()
at each iteration. More details can be found here. 
Gradient clipping: The default setting will not do gradient clipping to prevent gradient explosion. There are three cases regarding gradient clipping. We recommend first turning off gradient clipping, which is the default setting, and only turning it on when necessary.

If you want to manually set threshold to clip the gradient, you can consider using the
clip_thresh=thresh_on_the_gradient_norm
argument when initializing the YFOptimizer. 
If you want to totally turn off gradient clipping, please use
clip_thresh=None, auto_clip_fac=None
when initializing the YFOptimizer. 
If you want to keep the auto clipping feature, you can also play with
auto_clip_fac=positive_value
where lower value means stricter clipping and the value 1.1 or 2 work well on a few examples we tried out.


Normalization: When using log probability style losses, please make sure the loss is properly normalized. In some RNN/LSTM cases, the cross_entropy need to be averaged by the number of samples in a minibatch. Sometimes, it also needs to be averaged over the number of classes and the sequence length of each sample in some PyTorch loss functions. E.g. in nn.MultiLabelSoftMarginLoss,
size_average=True
needs to be set. 
Sparsity: Gradient norm, curvature estimations etc., when calculated with sparse gradient, are biased to larger values than the counterpart from the dense gradient on the full dataset. The bias can be illustrated using the following example: the norm of vectors (1.0, 0.0), (0.0, 1.0) and the norm of their average (0.5, 0.5). The norm of the latter is sqrt(sparsity (i.e. 0.5 here) ) * the norm of the former. The sparsity debias feature is useful when the model is very sparse, e.g. LSTM with word embedding. For nonsparse models, e.g. CNN, turning this feature off could slightly speedup.

Nonincreasing move: In some rare cases, we have observe increasing value of lr *  grad , i.e. the move, may result in unstableness. We implemented an engineering trick to enforce nonincreasing value of lr *  grad . The default setting turns the feature off, you can turn it on with
force_non_inc_step_after_iter=the starting iter you want to enforce the nonincreasing value
if it is really necessary. We recommendforce_non_inc_step_after_iter
to be at least a few hundreds because some models may need to gradually raise the magnitude of gradient in the beginning (e.g. a model, not properly initialized, may have near zerogradient and need iterations to get reasonable gradient level).
Citation
If you use YellowFin in your paper, please cite the paper:
@article{zhang2017yellowfin,
title={YellowFin and the Art of Momentum Tuning},
author={Zhang, Jian and Mitliagkas, Ioannis and R{\'e}, Christopher},
journal={arXiv preprint arXiv:1706.03471},
year={2017}
}
Implementation for other platforms
For Tensorflow users, we implemented YellowFin Tensorflow Repo.
For Theano users, Github user botev has already implemented a Theano version here: YellowFin Theano Repo.
We thank the contributors for YellowFin in different deep learning frameworks.