# Stanford CS236: Deep Generative Models I 2023 I Lecture 5 - VAEs

## Latent Variable Models

- Latent variable models introduce unobserved random variables (Z) to capture unobserved factors of variation in the data.
- These models aim to model the joint probability distribution between observed variables (X) and latent variables (Z).
- Latent variable models offer advantages such as increased flexibility and the ability to extract latent features for representation learning.
- By conditioning on latent features, modeling the distribution of data points becomes easier as there is less variation to capture.
- Deep neural networks are used to model the conditional distribution of X given Z, with parameters depending on the latent variables through neural networks.
- The challenge lies in learning the parameters of the neural network since the latent variables are not observed during training.
- The function of Z is to represent latent variables that affect the distribution of X.
- In this model, X is not autoregressively generated, and P(x|Z) is a simple Gaussian distribution.
- The parameters of the Gaussian distribution for P(x|Z) are determined through a potentially complex nonlinear relationship with respect to Z.
- The individual conditional distributions P(x|Z) are typically modeled with simple distributions like Gaussians, but other distributions can be used as well.
- The functions used to model P(x|Z) are the same for every Z, but different prior distributions can be used for Z.
- The motivation for modeling P(x|Z) and P(X) is to make learning easier by clustering the data using the Z variables and to potentially gain insights into the latent factors of variation in the data.
- The number of latent variables is a hyperparameter that is determined through training.
- Sampling from the model is straightforward by first sampling Z from a Gaussian and then sampling X from the corresponding Gaussian defined by the mean and covariance predicted by the neural networks.

## Mixture of Gaussians

- The mixture of Gaussians is a simple example of a latent variable model where Z is a categorical random variable that determines the mixture component, and P(x|Z) is a Gaussian distribution with different means and covariances for each mixture component.
- Mixture models can be useful for clustering data and can provide a better fit to data compared to a single Gaussian distribution.
- Unsupervised learning aims to discover meaningful structures in data, but it's not always clear what constitutes a good structure or clustering.
- Mixture models, such as a mixture of Gaussians, can be used for unsupervised learning by identifying the mixture components and clustering data points accordingly.
- However, mixture models may not perform well on tasks like image classification unless the number of mixture components is extremely large.
- An example of using a generative model on MNIST data shows that it can achieve reasonable clustering, but the clustering is not perfect and there are limitations to what can be discovered.

## Variational Autoencoders (VAEs)

- Variational autoencoders (VAEs) are a powerful way of combining simple models to create more expressive generative models.
- VAEs use a continuous latent variable Z, which can take an infinite number of values, instead of a finite number of mixture components.
- The means and standard deviations of the Gaussian components in a VAE are determined by neural networks, providing more flexibility than a mixture of Gaussians with a lookup table.
- The sampling process in a VAE is similar to that of a mixture of Gaussians, involving sampling from the latent variable Z and then using neural networks to determine the parameters of the Gaussian distribution from which to sample.
- In a mixture of Gaussians (MoG), Z is continuous, allowing for smooth transitions between clusters.
- The mean of the latent representation C can be interpreted as an average representation of all the data points.
- The prior distribution for Z does not have to be uniform; it can be any simple distribution that allows for efficient sampling.
- In a Gaussian mixture model (GMM), there is no neural network; instead, a lookup table is used to map the latent variable Z to the parameters of the Gaussian distribution.
- The marginal distribution over X in a mixture of Gaussians is obtained by integrating over all possible values of Z.
- The dimensionality of Z is typically much lower than the dimensionality of X, allowing for dimensionality reduction.
- It is possible to incorporate more information into the prior distribution by using a more complex model, such as an autoregressive model.
- The number of components in a mixture of Gaussians is equal to the number of classes K.

## Challenges in Learning Latent Variable Models

- The challenge in learning mixture models is that the latent variables are missing, requiring marginalization over all possible completions of the data.
- Evaluating the marginal probability over X requires integrating over all possible values of the latent variables Z, which can be intractable.
- If the Z variables can only take a finite number of values, the sum can be computed by brute force, but for continuous variables, an integral is required.
- Gradient computations are also expensive, making direct optimization challenging.
- One approach is to use Monte Carlo sampling to approximate the sum, by randomly sampling a small number of values for Z and using the sample average as an approximation.
- Sampling from a uniform distribution over Z is used because the sum is being converted into an expectation with respect to a uniform distribution, making the approximation tractable.
- However, this approach is not ideal because it does not take into account the actual distribution of Z.
- Uniformly sampling latent variables for completion is not effective because most completions would have low probability and high variance.
- A smarter weighting of selecting latent variables is needed to improve the model's performance.

## Latent Variables and Disentangled Representations

- Latent variables (Z) are not necessarily meaningful features like hair or eye color, but they capture important factors of variation in the data.
- Z can be a vector of multiple values, allowing for the representation of multiple salient factors of variation.
- There are challenges in learning disentangled representations where latent variables have clear semantic meanings.
- If labels are available, semi-supervised learning can be used to steer latent variables towards desired semantic meanings.
- Important sampling is used to sample latent variables more efficiently by focusing on important completions.

## Importance Sampling and the Evidence Lower Bound (ELBO)

- The choice of the proposal distribution Q for important sampling is crucial and can significantly affect the model's performance.
- The goal is to estimate the log marginal probability of a data point, which is intractable to compute directly.
- Importance sampling is used to estimate the log marginal probability by sampling from a distribution Q(Z|X) and then taking the log of the ratio of the joint probability of (X, Z) under the true distribution and the joint probability of (X, Z) under Q(Z|X).
- This estimator is unbiased, but it can be improved by using Jensen's inequality to derive a lower bound on the log marginal probability.
- The evidence lower bound (ELBO) is a lower bound on the log marginal probability that can be optimized instead of the true log marginal probability.
- The choice of Q(Z|X) controls how tight the ELBO is, and a good choice of Q(Z|X) can make the ELBO a very good approximation to the true log marginal probability.
- It is easier to derive a lower bound on the log marginal probability than an upper bound.
- The tightness of the bound on the quantity of interest can be quantified.
- The bound becomes tight when Q is chosen to be the conditional distribution of Z given X under the model.
- This optimal Q distribution is not easy to evaluate, which is why other methods are needed.
- The optimal way of inferring the latent variables is to use the true distribution.
- Inverting the neural network to find the likely inputs that would produce a given X is generally hard.
- The machinery for training a VAE involves optimizing both P and Q.