Stanford CS236: Deep Generative Models I 2023 I Lecture 16 - Score Based Diffusion Models

06 May 2024 (1 year ago)

Score-based models

Score-based models estimate the gradient of the log-likelihood (score) of a data distribution using a neural network.
Denoising score matching is an efficient way to estimate the score of a noise-perturbed data distribution by training a model to denoise the data.
Score-based models can be seen as the limit of infinite noise levels.

Diffusion models

Diffusion models can be interpreted as a type of variational autoencoder, where the score function acts as the encoder and the denoising process acts as the decoder.
Diffusion models can be converted into ordinary differential equations (ODEs), allowing for exact likelihood computation and efficient sampling methods.
Controllable generation in diffusion models can be achieved by incorporating additional information or side information into the model.
Diffusion models are a type of generative model that can generate realistic-looking images.
They work by gradually adding noise to an image until it becomes completely random, and then gradually removing the noise to generate a new image.
The training objective of a diffusion model is to maximize the evidence lower bound, which is a measure of how well the model can reconstruct the original image.
The encoder in a diffusion model is fixed and simply adds noise to the image, while the decoder is a neural network that learns to remove the noise.
The loss function for a diffusion model is the same as the denoising score matching loss, which means that the model is learning to estimate the scores of the noise-perturbed data distributions.
The sampling procedure for a diffusion model is similar to the Langevin dynamics used in score-based models, but with different scalings of the noise.
Traditional diffusion models use a discrete number of steps to add noise, but a continuous-time diffusion process can be described using a stochastic differential equation.
The reverse process of going from noise to data can also be described using a stochastic differential equation, and the solution to this equation can be used to generate data.
The score function is a key component of the stochastic differential equation, and it can be estimated using score matching.
In practice, continuous-time diffusion models can be implemented by discretizing the stochastic differential equation and using numerical solvers to solve it.
Score-based models attempt to correct numerical errors in diffusion models by running Langevin dynamics for a time step.
DDPM is a predictor type of discretization of the underlying stochastic differential equation, while score-based models are corrector type.
The diffusion implicit model (DIM) converts the stochastic differential equation into an ordinary differential equation with the same marginals at every time step.
DIM has two advantages: it can be more efficient and it can be converted into a flow model with exact likelihood evaluation.

Noise Conditional Score Network (NCSN)

The Noise Conditional Score Network (NCSN) estimates the scores of noise-perturbed data distributions by iteratively reducing the amount of noise in the sample.
The inverse process of NCSN, which generates samples for the denoising score matching law, involves adding noise to the data at every step until pure noise is reached.
The process of going from data to noise can be seen as a Markov process where noise is added incrementally, and the joint distribution over the random variables is defined as the product of conditional densities.
The encoder in NCSN is a simple procedure that maps the original data point to a vector of latent variables by adding noise to it.
The marginals of the distribution are also Gaussian, and the probability of transitioning from one noise level to another can be computed in closed form.
NCSN can efficiently generate samples at a specific time step without simulating the whole chain, making it computationally efficient.
The diffusion process in NCSN is analogous to heat diffusion, where probability mass is spread out over the entire space.
To invert the NCSN process during inference, several conditions need to be met, including the ability to smooth out the structure of the data distribution to facilitate sampling.
The goal is to learn a probabilistic model that can generate data by inverting a process that destroys structure and adds noise to the data.
The process of adding noise is defined by a transition kernel that spreads out the probability mass in a controllable way, such as Gaussian noise.
The key idea is to learn an approximation of the reverse kernel that removes noise from a sample, which can be done variationally through a neural network.
The generative distribution is defined by sampling from a simple prior and then sampling from the conditional distributions of the remaining variables one at a a time, going from right to left.
The parameters of the conditional distributions are learned such that the generated samples have low signal-to-noise ratio, essentially reaching a steady state of pure noise.
Alternatively, Langevin dynamics can be used to generate samples by correcting the mistakes made in the vanilla procedure, which requires more computation.

Training diffusion models

The encoder in a diffusion model is fixed and simply adds noise to the image, while the decoder is a neural network that learns to remove the noise.
Fixing the encoder to be a simple noise-adding function simplifies the training process.
The Lambda parameters control the importance of different noise levels.
The Beta parameters control how quickly noise is added.
The architecture is similar to a noise-conditional score model, with a single decoder amortized across different noise levels.
Training is efficient because the computation can be broken down into smaller, more manageable steps.