Stanford CS236: Deep Generative Models I 2023 I Lecture 12 - Energy Based Models

06 May 2024 (1 year ago)

Energy-Based Models (EBMs)

EBMs use an energy function to represent the probability distribution of data.
Sampling from EBMs is challenging, especially in high dimensions, making training computationally expensive.
Alternative training methods for EBMs are needed that do not require sampling during training.

The score function provides an alternative view of the original function by looking at things from the perspective of the gradient instead of the likelihood itself.
The Fisher divergence between two probability densities can be used as a loss function for training EBMs.
The Fisher divergence can be expressed in terms of the difference between the gradients of the log data density and the log model density.
This results in a loss function that can be evaluated and optimized as a function of the model parameters.
The loss function encourages the data points to be local maxima of the log-likelihood, ensuring a good fit of the model to the data.

An alternative training method for EBMs involves contrasting data to samples from a noise distribution rather than directly to samples from the model.
By parameterizing the discriminator in terms of an energy-based model, the optimal discriminator will force the energy-based model to match the data distribution.
Contrastive learning with EBMs involves distinguishing between real data and fake samples generated from a fixed noise distribution.
The noise distribution should be close to the data distribution for effective learning.
Sampling during inference is not necessary as the trained model can be used as an energy-based model.

NCE is similar to AGN in that it uses binary cross-entropy loss and is a likelihood-free method.
Unlike AGN, NCE does not involve a Minimax optimization and is more stable to train.
NCE requires the ability to evaluate the likelihood of contrastive samples, while AGN only requires the ability to sample from the generator.
In NCE, the discriminator is trained to distinguish between real and noisy samples, and the energy function derived from the discriminator defines an energy-based model.

FCE is a variant of NCE where the noise distribution is defined by a normalizing flow model.
The flow model is trained adversarially to confuse the discriminator, making the classification problem harder and the noise distribution closer to the data distribution.
FCE provides both an energy-based model and a flow model, with the choice of which to use depending on the specific task.