# Stanford CS236: Deep Generative Models I 2023 I Lecture 18 - Diffusion Models for Discrete Data

()

## Discrete Generative Models

• Discrete data is harder to model than continuous data because samples cannot be interpolated between.
• Applications of discrete generative models include natural language processing, biology, and natural sciences.
• Existing continuous space models cannot be easily adapted to discrete data because they rely on calculus.
• Transformers can be used to map discrete input sequences to continuous values, but this is not a requirement for discrete generative models.

## Autoregressive Modeling

• Autoregressive modeling is widely used in natural language processing for tasks like language generation and translation.
• In autoregressive modeling, the probability of a sequence of tokens is decomposed into a product of conditional probabilities.
• Autoregressive modeling has advantages such as scalability, the ability to represent any probability distribution over sequences, and a reasonable inductive bias for natural language.
• However, autoregressive modeling also has drawbacks such as sampling drift, lack of inductive bias for non-language tasks, and computational inefficiency.
• The key challenge in using autoregressive modeling for discrete problems is ensuring that the sum of probabilities over all possible sequences equals one, which is computationally intractable.

## Score Matching for Discrete Distributions

• The lecture explores generalizing score matching techniques to discrete cases to address the challenge of using autoregressive modeling for discrete problems.
• The video discusses a method for extending score matching to discrete spaces using a finite difference approximation.
• The concrete score is defined as the collection of all py/px minus one, where py is the probability of sequence y and px is the probability of sequence x.
• To make the model computationally feasible, the ratios between two sequences that differ only at one position are modeled instead of modeling all py/px.
• A sequence of sequence model is used to model these ratios, which can be implemented as a non-autoregressive Transformer.
• The score entropy loss function is introduced as a generalization of score matching for discrete scores.
• Two alternative loss functions are mentioned: implicit score entropy and denoising score entropy.
• The denoising score entropy loss function is derived by assuming that the probability of a sequence is a convolution between a base distribution and a kernel.

## Score Entropy Discrete Diffusion (SEDD)

• The video introduces a new generative modeling technique called Score Entropy Discrete Diffusion (SEDD).
• SEDD outperforms autoaggressive Transformers in terms of generation quality and speed.
• SEDD allows for controllable generation, including prompting from an arbitrary location.
• The model can generate coherent text sequences and can be used to infill text between given prompt tokens.
• SEDD is particularly effective for long sequence generation and can achieve high-quality results with fewer sampling steps compared to autoaggressive Transformers.
• The model size of SEDD is relatively small compared to autoaggressive Transformers, making it more efficient for large-scale language generation tasks.

## Score-Based Diffusion Models for Discrete Data

• Score-based diffusion models can be extended to discrete spaces by modeling the ratios of the data distribution, known as concrete scores.
• A new score matching loss called score entropy is introduced, which can be optimized using denoising implicit variance.
• Sampling from the score-based model can be done using a forward and reverse diffusion process, which synergizes with the score entropy loss.
• The generation quality of score-based diffusion models can surpass autoregressive modeling because they can generate the whole sequence in parallel.
• A likelihood bound based on score entropy is proposed, which aligns with the score entropy loss and allows for comparison with autoregressive models.
• Score-based diffusion models challenge the dominance of autoregressive modeling on large-scale sequence generation tasks.
• The proposed method outperforms previous continuous diffusion models in terms of likelihood and generation quality.
• A discretization scheme called T-leaping is used to efficiently generate sequences from the score-based diffusion model.

## Comparison with Autoregressive Models

• Both score-based diffusion models and autoregressive models can learn any probability distribution over a discrete space, but score-based diffusion models may have a better inductive bias and be more amenable to optimization.

## Non-Causal Transformer for Discrete Diffusion

• The video discusses a method for training a diffusion model using a sequence of sequence neural network (Transformer) with a non-causal mask.
• The architecture allows the attention layer to go from everything to everything, similar to BERT.
• The model is evaluated using generative perplexity and PET distance metric, showing improvements over previous methods.
• The challenge lies in efficiently computing the matrix exponential of a large Q matrix, which can be computationally expensive.
• Experimenting with more complex Q matrices did not yield better results due to fundamental architectural differences.