Stanford CS236: Deep Generative Models I 2023 I Lecture 12 - Energy Based Models

Stanford CS236: Deep Generative Models I 2023 I Lecture 12 - Energy Based Models

Energy-Based Models (EBMs)

  • EBMs use an energy function to represent the probability distribution of data.
  • Sampling from EBMs is challenging, especially in high dimensions, making training computationally expensive.
  • Alternative training methods for EBMs are needed that do not require sampling during training.

Score Matching

  • The score function provides an alternative view of the original function by looking at things from the perspective of the gradient instead of the likelihood itself.
  • The Fisher divergence between two probability densities can be used as a loss function for training EBMs.
  • The Fisher divergence can be expressed in terms of the difference between the gradients of the log data density and the log model density.
  • This results in a loss function that can be evaluated and optimized as a function of the model parameters.
  • The loss function encourages the data points to be local maxima of the log-likelihood, ensuring a good fit of the model to the data.

Contrastive Learning

  • An alternative training method for EBMs involves contrasting data to samples from a noise distribution rather than directly to samples from the model.
  • By parameterizing the discriminator in terms of an energy-based model, the optimal discriminator will force the energy-based model to match the data distribution.
  • Contrastive learning with EBMs involves distinguishing between real data and fake samples generated from a fixed noise distribution.
  • The noise distribution should be close to the data distribution for effective learning.
  • Sampling during inference is not necessary as the trained model can be used as an energy-based model.

Noise Contrastive Estimation (NCE)

  • NCE is similar to AGN in that it uses binary cross-entropy loss and is a likelihood-free method.
  • Unlike AGN, NCE does not involve a Minimax optimization and is more stable to train.
  • NCE requires the ability to evaluate the likelihood of contrastive samples, while AGN only requires the ability to sample from the generator.
  • In NCE, the discriminator is trained to distinguish between real and noisy samples, and the energy function derived from the discriminator defines an energy-based model.

Flow Contrastive Estimation (FCE)

  • FCE is a variant of NCE where the noise distribution is defined by a normalizing flow model.
  • The flow model is trained adversarially to confuse the discriminator, making the classification problem harder and the noise distribution closer to the data distribution.
  • FCE provides both an energy-based model and a flow model, with the choice of which to use depending on the specific task.

Overwhelmed by Endless Content?