# Stanford CS234 Reinforcement Learning I Offline RL 1 I 2024 I Lecture 8

## Introduction and Background

- The discussion begins with a clarification of general logistics and a brief review of previous topics, including the limitations of the DAgger algorithm, which requires continuous human feedback, and behavior cloning, which simplifies reinforcement learning to supervised learning by learning state-to-action mappings from expert demonstrations. (6s)
- The focus shifts to the potential of reinforcement learning (RL) to achieve capabilities similar to those of large language models, such as ChatGPT, which can generate code using techniques like Q-learning. This capability was not possible a few years ago and highlights the advancements in AI. (3m19s)
- The session plans to explore reinforcement learning from human feedback (RLHF), a method used in training models like ChatGPT, and its role in achieving advanced AI functionalities. (4m25s)
- An upcoming guest lecture will feature one of the authors of the Direct Preference Optimization work, which has shown promising results, potentially surpassing RLHF in performance on various benchmarks. This work was recognized at the Neural Information Processing Systems conference. (4m37s)
- The guest lecturer, a graduate student from Stanford, will discuss new developments in the field, including a forthcoming paper on extending these methods. The session emphasizes the ongoing innovation in combining RL with large language models to enhance AI performance. (5m5s)

## Imitation Learning and Human Feedback

- The focus of the lecture is on imitation learning, which is a method of using human feedback to train reinforcement learning agents. (5m37s)
- Imitation learning involves taking demonstrations from humans, which can be explicit, like showing a robot how to perform a task, or implicit, like analyzing decisions from electronic medical records to match or exceed human performance. (6m10s)
- The data used in imitation learning can be natural data traces from normal activities or explicit demonstration trajectories, consisting of sequences of states and actions without rewards. (6m47s)
- One motivation for imitation learning is the difficulty in defining a reward function that captures the complexity of human objectives. (7m10s)
- Behavior cloning is a method within imitation learning that maps the problem to supervised learning to directly match the expert's policy. (7m20s)
- DAgger (Dataset Aggregation) is introduced as a method to address the challenge of distribution mismatch in behavior cloning, where an expert provides guidance to correct mistakes and improve the learning process. (7m38s)
- The lecture discusses the possibility of recovering the reward function from demonstrations to understand the objectives behind human decisions and to potentially learn or improve policies. (8m29s)
- The lecture explores the idea of what is sufficient to mimic human behavior effectively in reinforcement learning. (9m5s)

## Reward Functions and Ambiguity

- The discussion highlights the relationship between policies, trajectories, states, and actions in reinforcement learning, emphasizing that two policies inducing the same distribution over states and actions will yield the same reward, assuming rewards are functions of states and actions. (9m19s)
- It is explained that reward functions can be seen as linear combinations of features, which can be predefined or derived from sensor-level data, depending on the problem context, such as online marketing or robotics. (10m25s)
- The concept of ensuring that the distribution over features is close to achieve similar rewards is discussed, with the assumption that if the weight vector's norm is bounded, closeness in features equates to closeness in rewards. (11m31s)
- A challenge in reinforcement learning is the non-uniqueness of reward functions compatible with observed data, even if the data is optimal, leading to the issue of multiple compatible rewards. (11m49s)

## Maximum Entropy Inverse Reinforcement Learning (Max Entropy IRL)

- The focus shifts to addressing this ambiguity through methods like maximum entropy inverse reinforcement learning (Max Entropy IRL) and Generative Adversarial Imitation Learning (GAIL), with Max Entropy IRL being introduced as a foundational approach developed in 2008. (12m28s)
- The principle of Maximum Entropy is discussed in the context of probability distributions, where entropy is defined as the negative sum over all states of the probability of each state times the log of that probability. This principle suggests choosing the probability distribution with the largest entropy that is consistent with given constraints and prior data. (12m57s)
- The concept is applied to scenarios where one has access to expert data in the form of trajectories but not the expert's policy. The goal is to create a probability distribution over trajectories that are consistent with observed data and have the highest entropy. This is useful in imitation tasks, such as programming a robot to imitate an expert's actions. (15m5s)
- The principle of Maximum Entropy is used to resolve ambiguities in choosing among multiple compatible reward functions by selecting those with maximum entropy. This approach is a choice to ensure consistency with observed data. (15m39s)
- The method was developed by Brian Zird and colleagues, who were interested in understanding taxi driver behavior. They aimed to model the decision-making process of taxi drivers, considering various constraints like distance, traffic, and tolls. (16m13s)
- The discussion involves using trajectories of taxi drivers in Pittsburgh to infer the reward function they use and develop a policy that performs as well as skilled taxi drivers. (16m38s)
- The approach involves dealing with the challenge of not being able to learn a unique reward function, leading to the use of maximum entropy methods. (16m57s)
- In the linear reward case, the focus is on finding a distribution over trajectories that matches the observed distribution from experts while maintaining high entropy. (17m4s)
- The learning process involves creating a probability distribution over trajectories with maximum entropy, subject to it being a true probability distribution and matching the features observed from expert trajectories. (17m32s)
- Matching features is equivalent to matching rewards when using a linear function, and the goal is to have a distribution of trajectories that aligns with the expert data. (17m55s)
- The concept of maximum entropy is applied to ensure compatibility with expert data while maximizing entropy in the distribution over trajectories. (18m50s)

## Learning from Demonstrations and Matching Distributions

- There is a clarification that a distribution over trajectories can be thought of as either directly over state-action pairs or implicitly through a policy generating those trajectories. (19m2s)
- The ultimate aim is to learn a policy that induces trajectories with rewards matching those of the expert, even though the rewards are not directly available. (19m38s)
- The discussion explores the concept of matching the distribution of trajectories to the reward of experts, assuming the experts' reward is optimal. The goal is to solve this problem by learning an optimal policy and updating the reward function iteratively. (20m17s)
- The process involves understanding the relationship between reward functions, optimal policies, and distributions over states and actions, and how these can be used to update the reward function. (21m16s)

## Mathematical Formulation and Optimization

- The original paper assumes that the dynamics reward model is known, and the discussion introduces the concept of Max entropy and the exponential family to explain the structural form of the distribution over trajectories. (21m33s)
- The constrained optimization problem is rewritten using Lagrange multipliers to gain intuition over the functional form, and the derivative is taken with respect to trajectories to optimize the problem. (23m10s)
- The discussion involves finding the derivative with respect to a probability of a particular trajectory, leading to the simplification of terms that are not functions of the probability. (24m53s)
- The goal is to set the derivative to zero to find the maximum, followed by algebraic manipulation and exponentiation. This process illustrates that the probability distribution over trajectories, which maximizes entropy subject to constraints, is proportional to the exponential of the reward for that trajectory. (25m21s)
- Observing this principle means placing more weight on trajectories with higher rewards, subject to maintaining a probability distribution. This results in a functional form over trajectories that is proportional to an exponential, known as an exponential family. (26m13s)
- This principle can be leveraged to learn a reward function, assuming a known reward for a particular trajectory. The derivative is taken with respect to the probability, not the reward function. (26m59s)
- Maximizing entropy over the probability distribution is equivalent to maximizing the likelihood of observed data under a max entropy distribution. The probability of a trajectory given a reward model is expressed as a normalized exponential, with a normalizing constant ensuring a well-formed probability distribution. (27m20s)
- The probability can also be expressed in terms of states within a trajectory, where the reward can be associated with individual states or the entire trajectory. This approach is useful even when the reward function is unknown, as the structural form of the probability under the max entropy principle is known, allowing the focus to shift to maximizing the unknown reward model. (28m23s)

## Maximizing Likelihood and Reparameterization

- The discussion involves maximizing the likelihood of observed data by changing the parameterization of a function, specifically focusing on entropy-constrained problems. This approach is linked to insights from Janes in 1957, which relate to maximizing entropy under constraints by converting it to an exponential family form. (29m35s)
- The process involves reparameterizing the objective function to eliminate nuisance parameters, focusing on learning a specific parameter of interest. This concept is related to direct preference optimization, which will be explored further in future discussions. (31m12s)
- The goal is to learn the reward function by leveraging the known structural form of the probability distribution over trajectories. This involves maximizing the probability of observing the data under the given reward function. (31m35s)
- The method includes rewriting the probability distribution as a sum of logs, using the maximum entropy form of trajectories, and simplifying the expression by canceling out the log and exponential terms. (32m42s)
- The expression is further simplified by identifying terms independent of certain variables, allowing for the extraction of constants related to the size of the dataset. This leads to a formulation that can be optimized using gradient descent. (33m20s)

## Gradient Descent and Optimization

- The optimization process involves taking the derivative of the parameterized function with respect to the reward function, which is a common approach in gradient descent methods. (34m4s)
- The discussion involves taking the derivative of a term related to reinforcement learning, specifically focusing on the probability of a particular trajectory. This expression is identified as the normalized exponential divided by a normalizing constant, which is equivalent to the probability of a trajectory given certain conditions. (35m22s)
- The gradient step is derived by computing the derivative with respect to the reward function for each trajectory in the expert data, minus the number of different trajectories times the sum over all trajectories of the probability of that trajectory given certain conditions. (36m25s)
- The goal is to optimize by computing the derivative with respect to the reward function, transitioning from trajectories to states. The probability of a trajectory is broken down into components, including the policy and the probability of transitioning to the next state given the current state and action. (37m12s)
- The probability of a trajectory is proportional to an exponential function of the reward, which can be expressed in terms of states. This allows the derivative to be expressed in terms of states instead of trajectories, focusing on the states in expert demonstrations. (38m17s)

## Computing State-Action Distributions and Dynamic Programming

- The process involves matching the distribution over states observed in the dataset. This is achieved by considering observed states and actions and determining the states and actions induced under a different policy, given a known dynamics model, especially in a tabular setting. (39m2s)
- In scenarios where the dynamics are known, the state-action distribution can be computed directly using dynamic programming. This involves calculating the state-action frequencies to match those observed from expert policies with those induced under a reward function. If the distributions match, the process is complete; otherwise, adjustments to the reward function and policy are necessary until they align. (39m32s)
- The distribution of states at the next time step depends on the distribution of states at the previous time step, the probability under the policy, the probability of the action given the state, and the probability of the state given the state and action. This allows for the computation of the average density for a particular state over all time steps. (40m51s)
- When computing the derivative of the objective function with respect to the reward, it involves summing over all states and the probability of the states given by time and the reward function. If the reward function is linear, the derivative simplifies to a sum over the features in the data set. (41m41s)
- The process involves providing expert demonstrations as input, initializing parameters, computing an optimal policy using methods like value iteration, calculating state visitation frequencies, computing the gradient on the reward model, and updating the reward model iteratively. The derivative is calculated as a sum over all trajectories in the data set, considering the features for each trajectory and the probability of the state given the current parameterization. (43m19s)
- The discussion involves a linear reward system and the steps in an algorithm that relies on knowing the Dynamics model, including computing the optimal policy and state visitation frequencies. (44m28s)
- The algorithm uses dynamic programming to compute the distribution over the next time step of states by summing over the distribution for the previous time step, the probability of action given a state, and the Dynamics model. (45m59s)
- To compute the optimal policy with value iteration, access to both the reward model and the Dynamics model is necessary. The Dynamics programming algorithm also requires access to the Dynamics model to determine the distribution of states in the next time step. (48m26s)

## Challenges and Extensions of Max Entropy IRL

- Once the state visitation frequencies are computed, the Dynamics model is not needed for the gradient. However, assuming access to the Dynamics model is a strong assumption, as it is often unknown in real-world scenarios like human decision-making. (48m55s)
- The approach discussed has been influential, initially using linear rewards and assuming a known Dynamics model. Follow-up work by Chelsea Finn demonstrated the use of general reward and cost functions, such as deep neural networks, and removed the need to know the Dynamics model. (49m30s)
- The discussion addresses the challenge of dealing with multiple reward models compatible with human behavior, suggesting the use of maximum entropy to effectively tackle this problem. This approach has been applied in various contexts, such as modeling taxi drivers. (50m11s)

## Reinforcement Learning from Human Feedback (RLHF)

- Imitation learning is introduced as a method to learn optimal behavior from existing demonstrations, which can reduce the data needed to develop a good policy. The theory of imitation learning in reinforcement learning (RL) and its sample complexity are also highlighted. (50m49s)
- Behavior cloning is emphasized as a frequently used technique in imitation learning, where RL is reduced to supervised learning using demonstrations. Understanding the principle of maximum entropy and its relation to trajectory distribution and maximum likelihood optimization is important. (51m33s)
- The reward model learned through maximum entropy is compatible with human demonstrations but does not necessarily represent the actual reward model used by people. It maximizes entropy in the distribution. (52m0s)
- The discussion transitions to exploring human feedback in reinforcement learning, focusing on how humans can assist RL agents in decision-making under uncertainty. This includes training agents for specific tasks or aligning large language models with human values and intents. (52m25s)
- Researchers have been exploring the concept of using human feedback to enhance autonomous agent learning for a long time. An early example is "Sophie's Kitchen" by Andrea Tamas and Cynthia Breazeal from MIT, which aimed to teach an agent to perform kitchen tasks with human assistance. This approach contrasts with traditional methods where agents learn independently through trial and error. (53m31s)
- The key insight from "Sophie's Kitchen" is that learning can be more efficient if a human provides feedback, similar to how humans learn with guidance from teachers or peers. The robot receives two types of input: feedback from humans and signals from the environment, which helps in training the agent more effectively. (54m0s)
- Another approach, the Tamer framework by Brad Knox and Peter Stone from UT Austin, also focuses on using human feedback to train agents more quickly. This method was applied to the game Tetris, where human feedback was used to develop an explicit reward model, leading to improved performance compared to traditional algorithms. (55m12s)
- The Tamer framework demonstrated that agents could learn to perform tasks better and faster with human feedback, although there are limitations, such as the impracticality of having humans provide feedback for extended periods. This highlights a challenge similar to the DAgger approach, where continuous human involvement is not feasible for long-term training. (56m2s)

## Preference-Based Methods and Human Input

- Model-based approaches in reinforcement learning involve explicitly training a reward model from human feedback, which can vary in the level of human involvement, ranging from passive demonstrations to active coaching. (56m44s)
- A potential middle ground for human feedback is using preference-based methods, where humans compare different behaviors to indicate which they prefer, rather than providing constant teaching or detailed reward functions. (57m35s)
- Early work in this area was conducted by Yan Yu and Thorsten Joachims at Cornell, focusing on recommendation ranking systems. They explored how users could compare outputs from different retrieval functions to determine which is better, rather than specifying a scalar reward function. (58m1s)
- This preference-based feedback approach is easier for humans and has been applied in various fields, including recommendation systems and robotics. (59m17s)
- In robotics and autonomous driving, researchers like Dorsa Sadigh have explored how to gather human input on acceptable driving behaviors, as it is challenging to define an objective function for all driving scenarios. (59m30s)
- Dorsa and her colleagues have conducted research on using preference pairs to understand human preferences efficiently, which respects human effort. This method is considered easier than writing a reward function and less labor-intensive than other methods like DAgger. (1h0m4s)
- Pairwise comparisons are seen as a sweet spot for understanding preferences, as they are simpler for people to perform compared to articulating a perfect response or writing down complex functions. (1h0m35s)

## Modeling Preferences and the Bradley Terry Model

- The Bradley Terry model is used to mathematically model how people compare preferences, assuming that individuals have an internal reward function that they may not be able to articulate clearly. This model accounts for the noisy nature of human preference expression. (1h1m2s)
- In the context of K-armed bandits, which involve multiple actions with different rewards but no states, humans make noisy pairwise comparisons. The probability of preferring one item over another is modeled using an exponential function of their rewards. (1h2m2s)
- If two items have identical rewards according to a person's internal model, the probability of preferring one over the other is 50%. If one item is significantly preferred, the probability can be much higher, such as 0.9 or 0.95. (1h2m45s)
- The discussion involves a noisy model related to reinforcement learning from human feedback, which makes assumptions about how people form preference pairs. This model helps understand the relationship between internal latent rewards and external preferences. (1h3m42s)
- The model is transitive, meaning probabilities between items can be deduced from other probabilities, allowing for chaining. This model was introduced around 70 years ago and is popular in recommendation systems. (1h4m6s)

## K-armed Bandits and Preference Optimization

- In scenarios with multiple actions, the goal is to learn the reward for each action and determine the best one. The concept of a Condorcet winner is introduced, where an item is preferred over all others. (1h4m37s)
- A Copeland winner is defined as an item with the highest number of pairwise victories, not necessarily preferred over all others. (1h5m58s)
- A Borda winner maximizes the expected score based on preferences, with scores assigned based on preference strength. (1h6m24s)
- Algorithms for K-armed or dueling bandits focus on finding an item that, on average, beats others, often using pairwise matrices to evaluate actions. (1h6m43s)
- The process involves extracting underlying reward functions from noisy pairwise comparisons to determine the best action or arm in reinforcement learning, aiming to optimize for the reward function. (1h7m17s)

## Learning Reward Models from Preferences

- The method assumes a classification task where preferences between items are expressed as values: 1 if the first item is preferred, 0 if the second is preferred, and 0.5 if indifferent. This resembles a logistic regression task, and the likelihood is maximized using cross-entropy. (1h7m41s)
- Reward models are parameterized using deep neural networks or other functions, and the goal is to find parameters that maximize likelihood, fitting a reward model based on observed preference pairs. (1h8m13s)
- In reinforcement learning, the approach involves considering trajectories with a series of rewards, preferring trajectories with higher rewards, similar to a bandit problem. (1h8m44s)
- The process includes asking people to compare trajectories, learning a reward model from these comparisons, and optimizing policies based on the learned reward model. (1h9m19s)

## Deep Bayesian Reinforcement Learning from Human Preferences

- An example from 2017, called Deep Bayesian Reinforcement Learning (DBRL) from Human Preferences, involved training a model to perform a backflip using about 900 bits of human feedback. (1h9m43s)
- The training involved showing people clips and asking them to choose which clip better represented a backflip, without explicitly defining the reward function for a backflip. (1h10m4s)
- A model was trained to perform a backflip using only about 900 examples, which is notably efficient compared to the large data requirements of deep Q-learning for tasks like learning Atari games, which often require millions of examples. (1h10m42s)

## RLHF and Direct Policy Optimization (DPO)

- An assignment involves using RHF (Reinforcement Learning from Human Feedback) and DPO (Direct Policy Optimization) to learn from human preferences, with a provided dataset for training agents. This is the first time this assignment is being conducted. (1h11m11s)
- A 2017 paper on this topic received attention, but significant work in this area has only emerged more recently. (1h11m42s)
- In the context of ChatGPT, RHF is used by obtaining demonstration data for supervised learning, known as behavior cloning, and comparison data to train a reward model. This process can involve ranking multiple options. (1h12m6s)
- The approach is an example of meta reinforcement learning, aiming to develop a general reward function for various tasks that large language models might perform, such as generating a story from a new prompt. (1h12m40s)
- The reward models are designed to handle multitask settings, adapting to whatever tasks users might request from ChatGPT, such as answering questions. (1h13m19s)
- Further discussion on this topic is planned for the following week, either before or after a guest lecture, to explore the framework in more detail. (1h13m39s)