Counting 2,285 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1


Noisy Natural Gradient as Variational Inference

PyTorch implementation of Noisy Natural Gradient as Variational Inference.


  • Python 3
  • Pytorch
  • visdom


  • This paper is about how to optimize bayesian neural network which has matrix variate gaussian distribution.
  • This implementation contains Noisy Adam optimizer which is for Fully Factorized Gaussian(FFG) distribution, and Noisy KFAC optimizer which is for Matrix Variate Gaussian(MVG) distribution.
  • These optimizers only work with bayesian network which has specific structure that I will mention below.
  • Currently only linear layer is available.

Experimental comments

  • I addded a lr scheduler to noisy KFAC because loss is exploded during training. I guess this happens because of slight approximation.
  • For MNIST training noisy KFAC is 15-20x slower than noisy Adam, as mentioned in paper.
  • I guess the noisy KFAC needs more epochs to train simple neural network structure like 2 linear layers.


Currently only MNIST dataset are currently supported, and only fully connected layer is implemented.


  • model : Fully Factorized Gaussian(FFG) or Matrix Variate Gaussian(MVG)
  • n : total train dataset size. need this value for optimizer.
  • eps : parameter for optimizer. Default to 1e-8.
  • initial_size : initial input tensor size. Default to 784, size of MNIST data.
  • label_size : label size. Default to 10, size of MNIST label.

More details in


$ python --model=FFG --batch_size=100 --lr=1e-3 --dataset=MNIST
$ python --model=MVG --batch_size=100 --lr=1e-2 --dataset=MNIST --n=60000


  • To visualize intermediate results and loss plots, run python -m visdom.server and go to the URL http://localhost:8097


$ python --epoch=20

Training Graphs


  • network is consist of 2 linear layers.
  • FFG optimized by noisy Adam : epoch 20, lr 1e-3

  • MVG optimized by noisy KFAC : epoch 100, lr 1e-2, decay 0.1 for every 30 epochs
  • Need to tune learning rate.

Implementation detail

  • Optimizing parameter procedure is consists of 2 steps, Calculating gradient and Applying to bayeisan parameters.
  • Before forward, network samples parameters with means & variances.
  • Usually calling step function updates parameters, but not this case. After calling step function, you have to update bayesian parameters. Look at the


  • More benchmark cases
  • Supports bayesian convolution
  • Implement Block Tridiagonal Covariance, which is dependent between layers.

Code reference

Visualization code(, references to pytorch-CycleGAN-and-pix2pix( by Jun-Yan Zhu


Tony Kim