Stanford CS236: Deep Generative Models I 2023 I Lecture 5 - VAEs

06 May 2024 (4 months ago)

Latent Variable Models

• Latent variable models introduce unobserved random variables (Z) to capture unobserved factors of variation in the data.
• These models aim to model the joint probability distribution between observed variables (X) and latent variables (Z).
• Latent variable models offer advantages such as increased flexibility and the ability to extract latent features for representation learning.
• By conditioning on latent features, modeling the distribution of data points becomes easier as there is less variation to capture.
• Deep neural networks are used to model the conditional distribution of X given Z, with parameters depending on the latent variables through neural networks.
• The challenge lies in learning the parameters of the neural network since the latent variables are not observed during training.
• The function of Z is to represent latent variables that affect the distribution of X.
• In this model, X is not autoregressively generated, and P(x|Z) is a simple Gaussian distribution.
• The parameters of the Gaussian distribution for P(x|Z) are determined through a potentially complex nonlinear relationship with respect to Z.
• The individual conditional distributions P(x|Z) are typically modeled with simple distributions like Gaussians, but other distributions can be used as well.
• The functions used to model P(x|Z) are the same for every Z, but different prior distributions can be used for Z.
• The motivation for modeling P(x|Z) and P(X) is to make learning easier by clustering the data using the Z variables and to potentially gain insights into the latent factors of variation in the data.
• The number of latent variables is a hyperparameter that is determined through training.
• Sampling from the model is straightforward by first sampling Z from a Gaussian and then sampling X from the corresponding Gaussian defined by the mean and covariance predicted by the neural networks.

Mixture of Gaussians

• The mixture of Gaussians is a simple example of a latent variable model where Z is a categorical random variable that determines the mixture component, and P(x|Z) is a Gaussian distribution with different means and covariances for each mixture component.
• Mixture models can be useful for clustering data and can provide a better fit to data compared to a single Gaussian distribution.
• Unsupervised learning aims to discover meaningful structures in data, but it's not always clear what constitutes a good structure or clustering.
• Mixture models, such as a mixture of Gaussians, can be used for unsupervised learning by identifying the mixture components and clustering data points accordingly.
• However, mixture models may not perform well on tasks like image classification unless the number of mixture components is extremely large.
• An example of using a generative model on MNIST data shows that it can achieve reasonable clustering, but the clustering is not perfect and there are limitations to what can be discovered.

Variational Autoencoders (VAEs)

• Variational autoencoders (VAEs) are a powerful way of combining simple models to create more expressive generative models.
• VAEs use a continuous latent variable Z, which can take an infinite number of values, instead of a finite number of mixture components.
• The means and standard deviations of the Gaussian components in a VAE are determined by neural networks, providing more flexibility than a mixture of Gaussians with a lookup table.
• The sampling process in a VAE is similar to that of a mixture of Gaussians, involving sampling from the latent variable Z and then using neural networks to determine the parameters of the Gaussian distribution from which to sample.
• In a mixture of Gaussians (MoG), Z is continuous, allowing for smooth transitions between clusters.
• The mean of the latent representation C can be interpreted as an average representation of all the data points.
• The prior distribution for Z does not have to be uniform; it can be any simple distribution that allows for efficient sampling.
• In a Gaussian mixture model (GMM), there is no neural network; instead, a lookup table is used to map the latent variable Z to the parameters of the Gaussian distribution.
• The marginal distribution over X in a mixture of Gaussians is obtained by integrating over all possible values of Z.
• The dimensionality of Z is typically much lower than the dimensionality of X, allowing for dimensionality reduction.
• It is possible to incorporate more information into the prior distribution by using a more complex model, such as an autoregressive model.
• The number of components in a mixture of Gaussians is equal to the number of classes K.

Challenges in Learning Latent Variable Models

• The challenge in learning mixture models is that the latent variables are missing, requiring marginalization over all possible completions of the data.
• Evaluating the marginal probability over X requires integrating over all possible values of the latent variables Z, which can be intractable.
• If the Z variables can only take a finite number of values, the sum can be computed by brute force, but for continuous variables, an integral is required.
• Gradient computations are also expensive, making direct optimization challenging.
• One approach is to use Monte Carlo sampling to approximate the sum, by randomly sampling a small number of values for Z and using the sample average as an approximation.
• Sampling from a uniform distribution over Z is used because the sum is being converted into an expectation with respect to a uniform distribution, making the approximation tractable.
• However, this approach is not ideal because it does not take into account the actual distribution of Z.
• Uniformly sampling latent variables for completion is not effective because most completions would have low probability and high variance.
• A smarter weighting of selecting latent variables is needed to improve the model's performance.

Latent Variables and Disentangled Representations

• Latent variables (Z) are not necessarily meaningful features like hair or eye color, but they capture important factors of variation in the data.
• Z can be a vector of multiple values, allowing for the representation of multiple salient factors of variation.
• There are challenges in learning disentangled representations where latent variables have clear semantic meanings.
• If labels are available, semi-supervised learning can be used to steer latent variables towards desired semantic meanings.
• Important sampling is used to sample latent variables more efficiently by focusing on important completions.

Importance Sampling and the Evidence Lower Bound (ELBO)

• The choice of the proposal distribution Q for important sampling is crucial and can significantly affect the model's performance.
• The goal is to estimate the log marginal probability of a data point, which is intractable to compute directly.
• Importance sampling is used to estimate the log marginal probability by sampling from a distribution Q(Z|X) and then taking the log of the ratio of the joint probability of (X, Z) under the true distribution and the joint probability of (X, Z) under Q(Z|X).
• This estimator is unbiased, but it can be improved by using Jensen's inequality to derive a lower bound on the log marginal probability.
• The evidence lower bound (ELBO) is a lower bound on the log marginal probability that can be optimized instead of the true log marginal probability.
• The choice of Q(Z|X) controls how tight the ELBO is, and a good choice of Q(Z|X) can make the ELBO a very good approximation to the true log marginal probability.
• It is easier to derive a lower bound on the log marginal probability than an upper bound.
• The tightness of the bound on the quantity of interest can be quantified.
• The bound becomes tight when Q is chosen to be the conditional distribution of Z given X under the model.
• This optimal Q distribution is not easy to evaluate, which is why other methods are needed.
• The optimal way of inferring the latent variables is to use the true distribution.
• Inverting the neural network to find the likely inputs that would produce a given X is generally hard.
• The machinery for training a VAE involves optimizing both P and Q.