Generative Models Tutorial
Generative models are interesting topic in ML. Generative models are a subset of unsupervised learning that generate new sample/data by using given some training data. There are different types of ways of modelling same distribution of training data: AutoRegressive models, AutoEncoders and GANs. In this tutorial, we are focusing theory of generative models, demonstration of generative models, important papers, courses related generative models. It will continue to be updated over time.
Keywords: Bayesian Classifier Sampling, Variational Auto Encoder (VAE), Generative Adversial Networks (GANs), Popular GANs Architectures, AutoRegressive Models, Important Generative Model Papers, Courses, etc..
NOTE: This tutorial is only for education purpose. It is not academic study/paper. All related references are listed at the end of the file.
Table Of Content
 What is Generative Models?
 Preliminary (Recall)
 Sampling from Bayesian Classifier
 Unsupervised Deep Learning vs Supervised Deep Learning (Recall)
 Gaussian Mixture Model (GMM)
 AutoEncoders
 Variational Inference (VI)
 Variational AutoEncoder (VAE)
 Generative Adversial Networks (GANs)
 AutoRegressive Models
 Generative Model in Reinforcement Learning
 Important Papers
 Courses
 References
 Notes
What is Generative Models?

Generative models are a subset of unsupervised learning that generate new sample/data by using given some training data (from same distribution). In our daily life, there are huge data generated from electronic devices, computers, cameras, iot, etc. (e.g. millions of images, sentences, or sounds, etc.)

In generative models, a large amount of data in some domain firstly is collected and then model is trained with this large amount of data to generate data like it.

There are lots of applications for generative models:
 Image Denoising,
 Image Enhancement,
 Image Inpainting,
 Superresolution (upsampling): SRGAN,
 Generate 3D objects: 3DGAN, Video
 Creating Art:
 Pose Guided Person Image Generation
 Creating clothing images and styles from an image: PixelDTGAN
 Face Synthesis: TPGAN
 ImagetoImage Translation: Pix2Pix
 Labels to Street Scene
 Aerial to Map
 Sketch to Realistic Image
 Highresolution Image Synthesis
 Text to image: StackGAN, Paper
 Text to Image Synthesis
 Learn Joint Distribution: CoGAN
 Transfering style (or patterns) from one domain (handbag) to another (shoe): DiscoGAN
 Texture Synthesis: MGAN
 Image Editing: IcGAN
 Face Aging: AgecGAN
 Neural Photo Editor
 Medical (Anomaly Detection): AnoGAN
 Music Generation: MidiNet
 Video Generation
 Image Blending: GPGAN
 Object Detection: PerceptualGAN
 Natural Language:
 Artical Spinning

Generative models are also promising in the long term future because it has a potential power to learn the natural features of a dataset automatically.

Generative models are mostly used to generate images (vision area). In the recent studies, it will also used to generate sentences (natural language processing area).

It can be said that Generative models begins with sampling. Generative models are told in this tutorial according to the development steps of generative models: Sampling, Gaussian Mixture Models, Variational AutoEncoder, Generative Adversial Networks.
Preliminary (Recall)
 Bayesian Rule: p(zx)= p(xz) p(z) /p(x)
 Prior Distribution: "often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account." (e.g. p(z))
 Posterior Distribution: is a probability distribution that represents your updated beliefs about the parameter after having seen the data. (e.g. p(zx))
 Posterior probability = prior probability + new evidence (called likelihood)
 Probability Density Function (PDF): the set of possible values taken by the random variable
 Gaussian (Normal) Distribution: A symmetrical data distribution, where most of the results lie near the mean.
 Bayesian Analysis:
 Prior distribution: p(z)
 Gather data
 "Update your prior distribution with the data using Bayes' theorem to obtain a posterior distribution."
 "Analyze the posterior distribution and summarize it (mean, median, etc.)"
 It is expected that you have knowledge of neural network concept (gradient descent, cost function, activation functions, regression, classification)
 Typically used for regression or classification
 Basically: fit(X,Y) and predict(X)
Sampling from Bayesian Classifier
 We use sampling data to generate new samples (using distribution of the training data).
 If we know the probability distribution of the training data , we can sample from it.
 Two ways of sampling:
 First Method: Sample from a given digit/class
 Pick a class, e.g. y=2
 If it is known p(xy=2) is Gaussian
 Sample from this Gaussian using Scipy (mvn.rvs(mean,cov))
 Second Method: Sample from p(y)p(xy) => p(x,y)
 If there is graphical model (e.g. y> x), and they have different distribution,
 If p(y) and p(xy) are known and y has its own distribution (e.g. categorical or discrete distribution)
 Sample
 First Method: Sample from a given digit/class
Unsupervised Deep Learning vs Supervised Deep Learning (Recall)
 Unsupervised Deep Learning: trying to learn structure of inputs
 Example1: If the structure of poetry/text can be learned, it is possible to generate text/poetry that resembles the given text/poetry.
 Example2: If the structure of art can be learned, it is possible to make new art/drawings that resembles the given art/drawings.
 Example3: If the structure of music can be learned, it is possible to create new music that resembles the given music.
 Supervised Deep Learning: trying to map inputs to targets
Gaussian Mixture Model (GMM)
 Single gaussian model learns blurry images if there are more than one gaussian distribution (e.g. different types of writing digits in handwriting).
 To get best result, GMM have to used to model more than one gaussian distribution.
 GMM is latent variable model. With GMM, multimodal distribution can be modelled at the same time.
 Multiple gaussians in different proportions are fitted into the GMM.
 2 clusters: p(x)=p(z=1) p(xz=1) + p(z=2) p(xz=2). In figure, there are 2 different proportions gaussian distributions.
ExpectationMaximization (EM)
 GMM is trained using ExpectationMaximization (EM)
 EM is iterative algorithm that let the likelihood improves at each step.
 The aim of EM is to reach maximum likelihood.
AutoEncoders
 A neural network that predicts (reconstructs) its own input.
 It is a feed forward network.
 W: weight, b:bias, x:input, f() and g():activation functions, z: latent variable, x_hat= output (reconstructed input)
 Instead of fit(X,Y) like neural networks, autoencoders fit(X,X).
 It has 1 input, 1 hidden, 1 output layers (hidden layer size < input layer size; input layer size = output layer size)
 It forces neural network to learn compact/efficient representation (e.g. dimension reduction/ compression)
Variational Inference (VI)
 Variational inference (VI) is the significant component of Variational AutoEncoders.
 VI ~ Bayesian extension of EM.
 In GMM/KMeans Clustering, you have choose the number of clusters.
 VIGMM (Variational inferenceGaussian Mixture Model) automatically finds the number of cluster.
Variational AutoEncoder (VAE)
 VAE is a neural network that learns to generate its input.
 It can map data to latent space, then generate samples using latent space.
 VAE is combination of autoencoders and variational inference.
 It doesn't work like tradional autoencoders.
 Its output is the parameters of a distribution: mean and variance, which represent a GaussianPDF of Z (instead only one value)
 VAE consists of two units: Encoder, Decoder.
 The output of encoder represents Gaussian distributions.(e.g. In 2D Gaussian, encoder gives 2 mean and 2 variance/stddev)
 The output of decoder represents Bernoulli distributions.
 From a probability distribution, new samples can be generated.
 For example: let's say input x is a 28 by 28pixel photo
 The encoder ‘encodes’ the data which is 784dimensional into a latent (hidden) representation space z.
 The encoder outputs parameters to q(z∣x), which is a Gaussian probability density.
 The decoder gets as input the latent representation of a digit z and outputs 784 Bernoulli parameters, one for each of the 784 pixels in the image.
 The decoder ‘decodes’ the realvalued numbers in z into 784 realvalued numbers between 0 and 1.
Latent Space
 Encoder: x> q(z) {q(z): latent space, coded version of x}
 Decoder: q(z)~z > x_hat
 Encoder takes the input of image "8" and gives output q(zx).
 Sample from q(zx) to get z
 Get p(x_hatx), sample from it (this is called posterior predictive sample)
Cost Function of VAE
 Evidence Lower Bound (ELBO) is our objective function that has to be maximized.
 ELBO consists of two terms: Expected LogLikelihood of the data and KL divergence between q(zx) and p(z).
 Expected LogLikelihood is negative crossentropy between original data and recontructed data.
 "Expected LogLikelihood encourages the decoder to learn to reconstruct the data. If the decoder’s output does not reconstruct the data well, it will incur a large cost in this loss function".
 If the input and output have Bernoulli distribution, Expected LogLikelihood can be calculated like this:
 "KL divergence measures how much information is lost (in units of nats) when using q to represent p. It is one measure of how close q is to p".
 KL divergence provides to compare 2 probability distributions.
 If the two probability distributions are exactly same (q=p), KL divergence equals to 0.
 If the two probability distributions are not same (q!=p), KL divergence > 0 .
 Cost function consists of two part: How the model's output is close to target and regularization.
 COST = (TARGETOUTPUT) PENALTYREGULARIZATION PENALTY == RECONSTRUCTION PENALTY  REGULARIZATION PENALTY
Generative Adversial Networks (GANs)
 GANs are interesting because it generates samples exceptionally good.
 GANs are used in different applications (details are summarized following sections).
 GANs are different form other generative models (Bayesian Classifier, Variational Autoencoders, Restricted Boltzmann Machines). GANs are not dealing with explicit probabilities, instead, its aim is to reach Nash Equilibrium of a game.
 RBM generate samples with Monte Carlo Sampling (thousands of iterations are needed to generate, and how many iterations are needed is not known). GANs generate samples with in single pass.
 There are 2 different networks in GANs: generator and discriminator, compete against each other.
 Generator Network tries to fool the discriminator.
 Disciminator Network tries not to be fooled.
 Paper: Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks
GANs Cost Function
Discriminator Cost Function
 Generator and Discriminator try to optimize the opposite cost functions.
 Discriminator classifies images as a real or fake images with binary classification.
 t: target; y: output probability of the discriminator.
 Real image: t=1; fake image: t=0; y= p(image is real  image) between (0,1)
 Binary cost function evaluates discriminator cost function.
 x: real images only, x_hat: fake images only
 G(z): fake image from generator, x: real image
 Total cost: summing to individual negative loglikelihoods (batches of data in training)
 Estimate of the expected value over all possible data
 Discrimator Cost Function:
Generator Cost Function
 Relation between Generator and Discriminator Cost Functions:
 In game theory, this situation is called "zerosum game". Sum of all players in the game is 0.
 Theta_G: parameters of generator; Theta_D: parameters of discriminator
 Pseudocode of GANs:
 In the first step: Generator generates some of the real samples and fake samples.
 In the second step: Gradient descent of the discriminator is run one iteration.
 In the third step: Gradient descent of the generator is run one iteration.
 Repeat this until seeing the good samples.
 Some of the researches run third step twice to get better results.
DCGAN
 Paper: Radford, A., Metz, L., and Chintala, S., Unsupervised representation learning with deep convolutional generative adversarial networks
 DCGAN architecture produces high quality and high resolution images in a single pass.
 DCGANs contain batch normalization (batch norm: z=(xmean)/std, batch norm is used between layers).
 DCGANs also contain only allconvolutional layers instead of contaning convolution, pooling, linear layers together.
 It optimizes using ADAM optimizer (adaptive gradient desdent algorithm)
 Discriminator uses LeakyReLU (Rectified Linear Unit), generator uses normal ReLU.
 Typical Convolution: input size is bigger or equal than output size (Stride>1).
 Deconvolution: input size is smaller than output size (Stride<1).
FractionallyStrided Convolution
 Stride size is smaller than 1.
 If stride=2 is used while convolution operation, output image size is 1/2 original image size.
 If stride=1/2 is used while convolution operation, output image size is 2x original image size.
 With fractionallystrided convolution (deconvolution), output image size is bigger than input image size.
CycleGAN
 Unpaired ImagetoImage Translation using CycleConsistent Adversarial Networks
 Their algorithm translate an image from one to another:
 Transfer from Monet paintings to landscape photos from Flickr, and vice versa.
 Transfer from zebras to horses, and vice versa.
 Transfer from summer to winter photos, and vice versa.
Pix2Pix
 ImagetoImage Translation with Conditional Adversarial Networks
 Pix2Pix is an imagetoimage translation algorithm: aerials to map, labels to street scene, labels to facade, day to night, edges to photo.
PixelDTGAN
 PixelLevel Domain Transfer
 PixelDTGAN generates clothing images from an image.
 "The model transfers an input domain to a target domain in semantic level, and generates the target image in pixel level."
 "They verify their model through a challenging task of generating a piece of clothing from an input image of a dressed person"
PoseGuided
 Pose Guided Person Image Generation
 "This paper proposes the novel Pose Guided Person Generation Network (PG2 that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose"
SRGAN
 PhotoRealistic Single Image SuperResolution Using a Generative Adversarial Network
 SRGAN: "a generative adversarial network (GAN) for image superresolution (SR)"
 They generate superresolution images from the lower resolution images.
StackGAN
 StackGAN: Text to Photorealistic Image Synthesis with Stacked Generative Adversarial Networks
 Stacked Generative Adversarial Networks (StackGAN): "to generate photorealistic images conditioned on text descriptions".
 Input: sentence, Output: multiple images fitting the description.
TPGAN
 Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis
 "Synthesis faces in different poses: With a single input image, they create faces in different viewing angles."
Anime Generation
 Towards the Automatic Anime Characters Creation with Generative Adversarial Networks
 "They explore the training of GAN models specialized on an anime facial image dataset."
 They build website for their implementation (https://make.girls.moe)
3DGAN
 Learning a Probabilistic Latent Space of Object Shapes via 3D GenerativeAdversarial Modeling
 This paper proposed creating 3D objects with GAN.
 "They demonstrated that their models are able to generate novel objects and to reconstruct 3D objects from images"
AgecGAN
 FACE AGING WITH CONDITIONAL GENERATIVE ADVERSARIAL NETWORKS
 They proposed the GANbased method for automatic face aging.
AnoGAN
 Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery
 "A deep convolutional generative adversarial network to learn a manifold of normal anatomical variability".
DiscoGAN
 Learning to Discover CrossDomain Relations with Generative Adversarial Networks
 Proposed method transfers style from one domain to another (e.g handbag > shoes)
 "DiscoGAN learns cross domain relationship without labels or pairing".
DTN
 Unsupervised CrossDomain Image Generation
 Proposed method is to create emoji from pictures.
 "They can synthesize an SVHN image that resembles a given MNIST image, or synthesize a face that matches an emoji."
IcGAN
 Invertible Conditional GANs for image editing
 "They evaluate encoders to inverse the mapping of a cGAN, i.e., mapping a real image into a latent space and a conditional representation".
 Proposed method is to reconstruct or edit images with specific attribute.
MGAN
 Precomputed RealTime Texture Synthesis with Markovian Generative Adversarial Networks
 "Markovian Generative Adversarial Networks (MGANs), a method for training generative neural networks for efficient texture synthesis."
 "They apply this idea to texture synthesis, style transfer, and video stylization."
MidiNet
 MIDINET: A CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORK FOR SYMBOLICDOMAIN MUSIC GENERATION
 "They propose a novel conditional mechanism to exploit available prior knowledge, so that the model can generate melodies either from scratch, by following a chord sequence, or by conditioning on the melody of previous bars"
 "MidiNet can be expanded to generate music with multiple MIDI channels"
PerceptualGAN
 Perceptual Generative Adversarial Networks for Small Object Detection
 "Proposed method improves small object detection through narrowing representation difference of small objects from the large ones"
AutoRegressive Models
 "The basic difference between Generative Adversarial Networks (GANs) and Autoregressive models is that GANs learn implicit data distribution whereas the latter learns an explicit distribution governed by a prior imposed by model structure" [Sharma].
 Some of the advantages of AutoRegressive Models over GANs:
 Provides a way to calculate likelihood: They have advantage of returning explicit probability densities. Hence it can be applied in the application areas related compression and probabilistic planning and exploration.
 The training is more stable than GANs:"Training a GAN requires finding the Nash equilibrium". Training of PixelRNN, PixelCNN are more stable than GANs.
 It works for both discrete and continuous data:"It’s hard to learn to generate discrete data for GAN, like text" [Sharma].
PixelRNN
Paper: Oord et al., Pixel Recurrent Neural Networks (proposed from Google DeepMind)
 It is used for image completion applications.
 "It uses probabilistic density models (like Gaussian or Normal distribution) to quantify the pixels of an image as a product of conditional distributions."
 "This approach turns the modeling problem into a sequence problem wherein the next pixel value is determined by all the previously generated pixel values".
 There are four different methods to implement PixelRNN:
 Row LSTM
 Diagonal BiLSTM
 Fully Convolutional Network
 Multi Scale Network.
 Cost function: "Negative log likelihood (NLL) is used as the loss and evaluation metric as the network predicts(classifies) the values of pixel from values 0–255."
PixelCNN
 Papers: Oord et al., Pixel Recurrent Neural Networks; Oord et al., Conditional Image Generation with PixelCNN Decoders (proposed from Google DeepMind).
 "The main drawback of PixelRNN is that training is very slow as each state needs to be computed sequentially. This can be overcome by using convolutional layers and increasing the receptive field."
 "PixelCNN lowers the training time considerably as compared to PixelRNN."
 "The major drawback of PixelCNN is that it’s performance is worse than PixelRNN. Another drawback is the presence of a Blind Spot in the receptive field"
PixelCNN++
 Paper: Salimans et al.,PIXELCNN++: IMPROVING THE PIXEL CNN WITH DISCRETIZED LOGISTIC MIXTURE LIKELIHOOD AND OTHER MODIFICATIONS
 PixelCNN++ improves the performance of PixelCNN (proposed from OpenAI)
 Modifications:
 Discretized logistic mixture likelihood
 Conditioning on whole pixels
 Downsampling
 Shortcut connections
 Dropout
 "PixelCNN++ outperforms both PixelRNN and PixelCNN by a margin. When trained on CIFAR10 the best test loglikelihood is 2.92 bits/pixel as compared to 3.0 of PixelRNN and 3.03 of gated PixelCNN."
 Details are in the paper PIXELCNN++: IMPROVING THE PIXEL CNN WITH DISCRETIZED LOGISTIC MIXTURE LIKELIHOOD AND OTHER MODIFICATIONS.
Generative Model in Reinforcement Learning
Generative Adversarial Imitation Learning
Paper: Jonathan Ho, Stefano Ermon, Generative Adversarial Imitation Learning
 "The standard reinforcement learning setting usually requires one to design a reward function that describes the desired behavior of the agent. However, in practice this can sometimes involve expensive trialanderror process to get the details right. In contrast, in imitation learning the agent learns from example demonstrations (for example provided by teleoperation in robotics), eliminating the need to design a reward function. This approach can be used to learn policies from expert demonstrations (without rewards) on hard OpenAI Gym environments, such as Ant and Humanoid." [Blog OpenAI].
 "Popular imitation approaches involve a twostage pipeline: first learning a reward function, then running RL on that reward. Such a pipeline can be slow, and because it’s indirect, it is hard to guarantee that the resulting policy works well. This work shows how one can directly extract policies from data via a connection to GANs" [Blog OpenAI].
Important Papers
 Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks
 Radford, A., Metz, L., and Chintala, S., Unsupervised representation learning with deep convolutional generative adversarial networks
 Jonathan Ho, Stefano Ermon, Generative Adversarial Imitation Learning
 Ledig et al., PhotoRealistic Single Image SuperResolution Using a Generative Adversarial Network,
 Jiajun Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D GenerativeAdversarial Modeling
 Jin et al., Towards the Automatic Anime Characters Creation with Generative Adversarial Networks
 Zhu et al., Unpaired ImagetoImage Translation using CycleConsistent Adversarial Networks
 Taigman et al., UNSUPERVISED CROSSDOMAIN IMAGE GENERATION
 MA et al., Pose Guided Person Image Generation
 Yoo et al., PixelLevel Domain Transfer
 Huang aet al., Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis
 Isola et al., ImagetoImage Translation with Conditional Adversarial Networks
 Wang et al., HighResolution Image Synthesis and Semantic Manipulation with Conditional GANs
 Zhang et al., StackGAN: Text to Photorealistic Image Synthesis with Stacked Generative Adversarial Networks
 Reed et al., Generative Adversarial Text to Image Synthesis
 Liu et al., Coupled Generative Adversarial Networks
 Kim et al., Learning to Discover CrossDomain Relations with Generative Adversarial Networks
 Li et al., Precomputed RealTime Texture Synthesis with Markovian Generative Adversarial Networks
 Perarnau et al., Invertible Conditional GANs for image editing
 Antipov et al., FACE AGING WITH CONDITIONAL GENERATIVE ADVERSARIAL NETWORKS
 Schlegl et al., Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery
 Yang et al., MIDINET: A CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORK FOR SYMBOLICDOMAIN MUSIC GENERATION
 Wu et al., GPGAN: Towards Realistic HighResolution Image Blending
 Li et al., Perceptual Generative Adversarial Networks for Small Object Detection
 Oord et al., Pixel Recurrent Neural Networks
 Oord et al., Conditional Image Generation with PixelCNN Decoders
 Salimans et al.,PIXELCNN++: IMPROVING THE PIXEL CNN WITH DISCRETIZED LOGISTIC MIXTURE LIKELIHOOD AND OTHER MODIFICATIONS
 Springenberg et al., Striving for Simplicity: The All Convolutional Net
Courses
References
 Blog OpenAI
 Sharma, PixelRNN/PixelCNN
 Udemy GANVAE: Deep Learning GANs and Variational Autoencoders
 GAN Applications
 https://jaan.io/whatisvariationalautoencodervaetutorial/