Stanford CS25: V4 I Hyung Won Chung of OpenAI
Introduction
- Hyung Won Chung is a research scientist at OpenAI, specializing in large language models.
- He has worked on various aspects of LLMs, including pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, and more.
- Notable works include the scaling FLAN paper (FLAN-T5 and FLAN-PaLM), and T5X, the training framework used to train the PaLM language model.
- Before OpenAI, he was at Google Brain and holds a PhD from MIT.
- The goal of the lecture is to develop a unified perspective on the history of Transformers to project potential future developments in AI.
- By looking at the early history of Transformers, we can learn lessons and gain insights into the future of AI.
- The lecture will examine the architectures of Transformers to gain a deeper understanding of their development.
Identifying and understanding the dominant driving force behind AI.
- The dominant driving force behind rapid changes in AI research is the exponential decrease in compute costs.
- Predicting the future of AI is simpler than predicting the future in general due to its narrower focus.
- The traditional approach of modeling human thinking in AI is flawed and limits scalability.
- The key to AI research is to leverage cheaper compute, scale up with weaker modeling assumptions, and provide more data to models.
- The optimal level of structure or inductive bias for AI models depends on available compute resources.
- Removing unnecessary structure is not incentivized in the current AI research paradigm, hindering progress.
- Understanding the driving force of cheaper compute and scaling up is crucial for advancing AI research.
- Analyzing the history of Transformer models and researcher decisions can provide insights into optimal structures over time.
Overview of Transformer architectures: encoder-decoder, encoder-only and decoder-only
- A Transformer is a sequence model that uses attention to model the interaction between sequence elements.
- There are three types of Transformers: encoder-decoder, encoder-only, and decoder-only.
- The encoder-decoder Transformer is the original Transformer and has a more structured architecture compared to the other two types.
- The encoder-only Transformer produces a single vector representation of the input sequence, regardless of its length.
- The decoder-only Transformer is used in language models like GPT-3 and can generate sequences.
- The attention mechanism in the Transformer allows the decoder to attend to some of the encoder's sequence representations.
- The key design features of the decoder-only architecture are that self-attention serves both roles and the parameters are shared between input and target sequences.
Differences between encoder-decoder and decoder-only, and rationale for encoder-decoder’s additional structures from the perspective of scaling.
- The encoder-decoder and decoder-only architectures differ in their attention mechanisms, parameter sharing, and target-to-input attention patterns.
- The encoder-decoder architecture has separate cross-attention and self-attention mechanisms, separate encoder and decoder parameters, and target-to-input attention, while the decoder-only architecture shares both attention mechanisms, parameters, and attends to the same layer representation of the encoder.
- The additional structures in the encoder-decoder architecture, such as separate input and target parameters, cross-attention, and target-to-input attention, are useful when the input and target sequences are sufficiently different or long.
- However, these assumptions may not hold for larger language models, longer target sequences, or multi-turn chat applications.
- Bidirectional input attention may not be necessary at scale and presents engineering challenges for modern multi-turn chat applications, while unidirectional fine-tuning is more efficient and allows for caching of previous encodings.
- The shift from bidirectional to unidirectional fine-tuning is driven by the exponential decrease in compute costs and the associated scaling efforts.
- Analyzing historical artifacts and current events can provide insights into the assumptions and limitations of AI research, enabling the development of more general and scalable solutions.