# Stanford CS236: Deep Generative Models I 2023 I Lecture 4 - Maximum Likelihood Learning

## Autoregressive Models

- RNNs are a type of autoregressive model that uses a hidden vector to summarize the context and make predictions.
- Attention mechanisms allow models to take into account the full context when making predictions while being selective about which parts of the sequence are relevant.
- Transformers are more efficient to train than RNNs and can be parallelized, making them suitable for large-scale language modeling.
- Autoregressive models can be used to generate images pixel by pixel, but they are slow due to the need to unroll the recursion.
- Convolutional architectures are better suited for images, but they need to be masked to enforce autoregressive structure.
- Attention mechanisms can also be used for images, but they are more computationally intensive to train.
- Autoregressive models are easy to sample from and evaluate probabilities, making them useful for anomaly detection and extending to continuous variables.
- Autoregressive models can be trained by treating them as a sequence of classifiers.

## Generative Models

- Generative models aim to learn a joint probability distribution over random variables that approximates the unknown data distribution.
- Autoregressive models use the likelihood and the Kullback-Leibler (KL) divergence to define similarity.
- KL divergence measures the difference between two probability distributions in terms of compression efficiency.
- Optimizing KL divergence is equivalent to building a generative model that can compress data efficiently.
- Computing KL divergence directly is challenging, but it can be simplified for optimization.
- Other distance metrics besides KL divergence can be used to compare distributions, leading to different types of generative models.
- The choice of using P or Q as the reference distribution in KL divergence affects the behavior of the model.

## Training Autoregressive Models

- The objective of autoregressive models is to maximize the probability of observing a given dataset.
- Evaluating the likelihood of a single data point is straightforward using the chain rule.
- The probability of a dataset is the product of the probabilities of individual data points.
- Maximum likelihood estimation involves finding the parameters that maximize the probability of observing the dataset.
- Minimizing cross-entropy is equivalent to maximizing log-likelihood.
- Training involves initializing parameters randomly, computing gradients on the loss using backpropagation, and performing gradient ascent.
- Stochastic gradient descent or mini-batch can be used to make training scalable.
- Regularization techniques are used to prevent overfitting.
- Cross-validation can be used to evaluate the performance of a model on unseen data and identify overfitting.