Stanford CS236: Deep Generative Models I 2023 I Lecture 4 - Maximum Likelihood Learning

06 May 2024 (almost 2 years ago)

Autoregressive Models

RNNs are a type of autoregressive model that uses a hidden vector to summarize the context and make predictions.
Attention mechanisms allow models to take into account the full context when making predictions while being selective about which parts of the sequence are relevant.
Transformers are more efficient to train than RNNs and can be parallelized, making them suitable for large-scale language modeling.
Autoregressive models can be used to generate images pixel by pixel, but they are slow due to the need to unroll the recursion.
Convolutional architectures are better suited for images, but they need to be masked to enforce autoregressive structure.
Attention mechanisms can also be used for images, but they are more computationally intensive to train.
Autoregressive models are easy to sample from and evaluate probabilities, making them useful for anomaly detection and extending to continuous variables.
Autoregressive models can be trained by treating them as a sequence of classifiers.

Generative models aim to learn a joint probability distribution over random variables that approximates the unknown data distribution.
Autoregressive models use the likelihood and the Kullback-Leibler (KL) divergence to define similarity.
KL divergence measures the difference between two probability distributions in terms of compression efficiency.
Optimizing KL divergence is equivalent to building a generative model that can compress data efficiently.
Computing KL divergence directly is challenging, but it can be simplified for optimization.
Other distance metrics besides KL divergence can be used to compare distributions, leading to different types of generative models.
The choice of using P or Q as the reference distribution in KL divergence affects the behavior of the model.

The objective of autoregressive models is to maximize the probability of observing a given dataset.
Evaluating the likelihood of a single data point is straightforward using the chain rule.
The probability of a dataset is the product of the probabilities of individual data points.
Maximum likelihood estimation involves finding the parameters that maximize the probability of observing the dataset.
Minimizing cross-entropy is equivalent to maximizing log-likelihood.
Training involves initializing parameters randomly, computing gradients on the loss using backpropagation, and performing gradient ascent.
Stochastic gradient descent or mini-batch can be used to make training scalable.
Regularization techniques are used to prevent overfitting.
Cross-validation can be used to evaluate the performance of a model on unseen data and identify overfitting.