Stanford CS25: V4 I Hyung Won Chung of OpenAI

()
Stanford CS25: V4 I Hyung Won Chung of OpenAI

Introduction rel="noopener noreferrer" target="_blank">(00:00:00)

  • Hyung Won Chung is a research scientist at OpenAI, specializing in large language models.
  • He has worked on various aspects of LLMs, including pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, and more.
  • Notable works include the scaling FLAN paper (FLAN-T5 and FLAN-PaLM), and T5X, the training framework used to train the PaLM language model.
  • Before OpenAI, he was at Google Brain and holds a PhD from MIT.
  • The goal of the lecture is to develop a unified perspective on the history of Transformers to project potential future developments in AI.
  • By looking at the early history of Transformers, we can learn lessons and gain insights into the future of AI.
  • The lecture will examine the architectures of Transformers to gain a deeper understanding of their development.

Identifying and understanding the dominant driving force behind AI. rel="noopener noreferrer" target="_blank">(00:02:05)

  • The dominant driving force behind rapid changes in AI research is the exponential decrease in compute costs.
  • Predicting the future of AI is simpler than predicting the future in general due to its narrower focus.
  • The traditional approach of modeling human thinking in AI is flawed and limits scalability.
  • The key to AI research is to leverage cheaper compute, scale up with weaker modeling assumptions, and provide more data to models.
  • The optimal level of structure or inductive bias for AI models depends on available compute resources.
  • Removing unnecessary structure is not incentivized in the current AI research paradigm, hindering progress.
  • Understanding the driving force of cheaper compute and scaling up is crucial for advancing AI research.
  • Analyzing the history of Transformer models and researcher decisions can provide insights into optimal structures over time.

Overview of Transformer architectures: encoder-decoder, encoder-only and decoder-only rel="noopener noreferrer" target="_blank">(00:15:18)

  • A Transformer is a sequence model that uses attention to model the interaction between sequence elements.
  • There are three types of Transformers: encoder-decoder, encoder-only, and decoder-only.
  • The encoder-decoder Transformer is the original Transformer and has a more structured architecture compared to the other two types.
  • The encoder-only Transformer produces a single vector representation of the input sequence, regardless of its length.
  • The decoder-only Transformer is used in language models like GPT-3 and can generate sequences.
  • The attention mechanism in the Transformer allows the decoder to attend to some of the encoder's sequence representations.
  • The key design features of the decoder-only architecture are that self-attention serves both roles and the parameters are shared between input and target sequences.

Differences between encoder-decoder and decoder-only, and rationale for encoder-decoder’s additional structures from the perspective of scaling. rel="noopener noreferrer" target="_blank">(00:23:29)

  • The encoder-decoder and decoder-only architectures differ in their attention mechanisms, parameter sharing, and target-to-input attention patterns.
  • The encoder-decoder architecture has separate cross-attention and self-attention mechanisms, separate encoder and decoder parameters, and target-to-input attention, while the decoder-only architecture shares both attention mechanisms, parameters, and attends to the same layer representation of the encoder.
  • The additional structures in the encoder-decoder architecture, such as separate input and target parameters, cross-attention, and target-to-input attention, are useful when the input and target sequences are sufficiently different or long.
  • However, these assumptions may not hold for larger language models, longer target sequences, or multi-turn chat applications.
  • Bidirectional input attention may not be necessary at scale and presents engineering challenges for modern multi-turn chat applications, while unidirectional fine-tuning is more efficient and allows for caching of previous encodings.
  • The shift from bidirectional to unidirectional fine-tuning is driven by the exponential decrease in compute costs and the associated scaling efforts.
  • Analyzing historical artifacts and current events can provide insights into the assumptions and limitations of AI research, enabling the development of more general and scalable solutions.

Overwhelmed by Endless Content?