Stanford CS25: V4 I Hyung Won Chung of OpenAI

13 Jun 2024 (over 1 year ago)

Introduction (0s)

Hyung Won Chung is a research scientist at OpenAI, specializing in large language models.
He has worked on various aspects of LLMs, including pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, and more.
Notable works include the scaling FLAN paper (FLAN-T5 and FLAN-PaLM), and T5X, the training framework used to train the PaLM language model.
Before OpenAI, he was at Google Brain and holds a PhD from MIT.
The goal of the lecture is to develop a unified perspective on the history of Transformers to project potential future developments in AI.
By looking at the early history of Transformers, we can learn lessons and gain insights into the future of AI.
The lecture will examine the architectures of Transformers to gain a deeper understanding of their development.

The dominant driving force behind rapid changes in AI research is the exponential decrease in compute costs.
Predicting the future of AI is simpler than predicting the future in general due to its narrower focus.
The traditional approach of modeling human thinking in AI is flawed and limits scalability.
The key to AI research is to leverage cheaper compute, scale up with weaker modeling assumptions, and provide more data to models.
The optimal level of structure or inductive bias for AI models depends on available compute resources.
Removing unnecessary structure is not incentivized in the current AI research paradigm, hindering progress.
Understanding the driving force of cheaper compute and scaling up is crucial for advancing AI research.
Analyzing the history of Transformer models and researcher decisions can provide insights into optimal structures over time.

A Transformer is a sequence model that uses attention to model the interaction between sequence elements.
There are three types of Transformers: encoder-decoder, encoder-only, and decoder-only.
The encoder-decoder Transformer is the original Transformer and has a more structured architecture compared to the other two types.
The encoder-only Transformer produces a single vector representation of the input sequence, regardless of its length.
The decoder-only Transformer is used in language models like GPT-3 and can generate sequences.
The attention mechanism in the Transformer allows the decoder to attend to some of the encoder's sequence representations.
The key design features of the decoder-only architecture are that self-attention serves both roles and the parameters are shared between input and target sequences.

The encoder-decoder and decoder-only architectures differ in their attention mechanisms, parameter sharing, and target-to-input attention patterns.
The encoder-decoder architecture has separate cross-attention and self-attention mechanisms, separate encoder and decoder parameters, and target-to-input attention, while the decoder-only architecture shares both attention mechanisms, parameters, and attends to the same layer representation of the encoder.
The additional structures in the encoder-decoder architecture, such as separate input and target parameters, cross-attention, and target-to-input attention, are useful when the input and target sequences are sufficiently different or long.
However, these assumptions may not hold for larger language models, longer target sequences, or multi-turn chat applications.
Bidirectional input attention may not be necessary at scale and presents engineering challenges for modern multi-turn chat applications, while unidirectional fine-tuning is more efficient and allows for caching of previous encodings.
The shift from bidirectional to unidirectional fine-tuning is driven by the exponential decrease in compute costs and the associated scaling efforts.
Analyzing historical artifacts and current events can provide insights into the assumptions and limitations of AI research, enabling the development of more general and scalable solutions.