# Stanford CS236: Deep Generative Models I 2023 I Lecture 18 - Diffusion Models for Discrete Data

## Discrete Generative Models

- Discrete data is harder to model than continuous data because samples cannot be interpolated between.
- Applications of discrete generative models include natural language processing, biology, and natural sciences.
- Existing continuous space models cannot be easily adapted to discrete data because they rely on calculus.
- Transformers can be used to map discrete input sequences to continuous values, but this is not a requirement for discrete generative models.

## Autoregressive Modeling

- Autoregressive modeling is widely used in natural language processing for tasks like language generation and translation.
- In autoregressive modeling, the probability of a sequence of tokens is decomposed into a product of conditional probabilities.
- Autoregressive modeling has advantages such as scalability, the ability to represent any probability distribution over sequences, and a reasonable inductive bias for natural language.
- However, autoregressive modeling also has drawbacks such as sampling drift, lack of inductive bias for non-language tasks, and computational inefficiency.
- The key challenge in using autoregressive modeling for discrete problems is ensuring that the sum of probabilities over all possible sequences equals one, which is computationally intractable.

## Score Matching for Discrete Distributions

- The lecture explores generalizing score matching techniques to discrete cases to address the challenge of using autoregressive modeling for discrete problems.
- The video discusses a method for extending score matching to discrete spaces using a finite difference approximation.
- The concrete score is defined as the collection of all py/px minus one, where py is the probability of sequence y and px is the probability of sequence x.
- To make the model computationally feasible, the ratios between two sequences that differ only at one position are modeled instead of modeling all py/px.
- A sequence of sequence model is used to model these ratios, which can be implemented as a non-autoregressive Transformer.
- The score entropy loss function is introduced as a generalization of score matching for discrete scores.
- Two alternative loss functions are mentioned: implicit score entropy and denoising score entropy.
- The denoising score entropy loss function is derived by assuming that the probability of a sequence is a convolution between a base distribution and a kernel.

## Score Entropy Discrete Diffusion (SEDD)

- The video introduces a new generative modeling technique called Score Entropy Discrete Diffusion (SEDD).
- SEDD outperforms autoaggressive Transformers in terms of generation quality and speed.
- SEDD allows for controllable generation, including prompting from an arbitrary location.
- The model can generate coherent text sequences and can be used to infill text between given prompt tokens.
- SEDD is particularly effective for long sequence generation and can achieve high-quality results with fewer sampling steps compared to autoaggressive Transformers.
- The model size of SEDD is relatively small compared to autoaggressive Transformers, making it more efficient for large-scale language generation tasks.

## Score-Based Diffusion Models for Discrete Data

- Score-based diffusion models can be extended to discrete spaces by modeling the ratios of the data distribution, known as concrete scores.
- A new score matching loss called score entropy is introduced, which can be optimized using denoising implicit variance.
- Sampling from the score-based model can be done using a forward and reverse diffusion process, which synergizes with the score entropy loss.
- The generation quality of score-based diffusion models can surpass autoregressive modeling because they can generate the whole sequence in parallel.
- A likelihood bound based on score entropy is proposed, which aligns with the score entropy loss and allows for comparison with autoregressive models.
- Score-based diffusion models challenge the dominance of autoregressive modeling on large-scale sequence generation tasks.
- The proposed method outperforms previous continuous diffusion models in terms of likelihood and generation quality.
- A discretization scheme called T-leaping is used to efficiently generate sequences from the score-based diffusion model.

## Comparison with Autoregressive Models

- Both score-based diffusion models and autoregressive models can learn any probability distribution over a discrete space, but score-based diffusion models may have a better inductive bias and be more amenable to optimization.

## Non-Causal Transformer for Discrete Diffusion

- The video discusses a method for training a diffusion model using a sequence of sequence neural network (Transformer) with a non-causal mask.
- The architecture allows the attention layer to go from everything to everything, similar to BERT.
- The model is evaluated using generative perplexity and PET distance metric, showing improvements over previous methods.
- The challenge lies in efficiently computing the matrix exponential of a large Q matrix, which can be computationally expensive.
- Experimenting with more complex Q matrices did not yield better results due to fundamental architectural differences.