# Stanford CS234 Reinforcement Learning I Offline RL 3 I 2024 I Lecture 10

## Discussion on DPO and RHF

- There is a discussion about DPO and RHF, where it is clarified that RHF learns an explicit representation of the reward function, while DPO does not. (2m30s)
- DPO assumes a particular parametric representation for its model and inverts it for direct policy learning, avoiding the need to explicitly learn a reward function like RHF does. (3m0s)
- It is noted that DPO is not constrained to be as good as the best examples in the preference data, as it can use a reward model to learn a policy that surpasses the performance of the demonstrations. (3m27s)
- The process of using a reward model to improve policy performance is compared to the development of ChatGPT, where a reward model was used to enhance response generation beyond initial supervised learning examples. (4m33s)
- Both DPO and RHF use a reference policy, and there is an emphasis on the importance of understanding how far one can extrapolate or interpolate from data without generalizing to areas with poor performance. This concept is relevant in various learning methods, including imitation learning, DPO, and PPO. (5m4s)

## Transition to Learning from Past Data and Data-Efficient Learning

- The class is transitioning to a focus on learning from past data, whether generated by humans or robots, and will soon cover fast or data-efficient learning. (6m9s)
- The lecture focuses on data-efficient learning, particularly separating policy evaluation and policy learning questions, which can help improve policy optimization. (6m22s)
- A question is posed about whether it is possible to surpass imitation learning, which involves mimicking expert data, especially in fields like healthcare where human performance may be limited. (6m54s)
- An example is given from a lab that deals with education and healthcare data, where decisions are generated by humans or automated systems, such as medical interventions and their outcomes. (7m3s)
- The discussion highlights the potential of reinforcement learning to improve decision-making processes, particularly in healthcare, where better sequences of decisions could lead to improved patient outcomes. (8m0s)

## Case Study: Personalized Learning in Games (Refraction and BrainPo)

- A backstory is provided about a collaboration with Zoran PopoviÄ‡ from the University of Washington, involving a game called Refraction, which teaches kids about fractions through a spaceship and laser beam activity. (8m15s)
- The goal was to personalize and adapt the game to students by using information about their performance to select the next activity, creating a decision policy based on various state features. (8m58s)
- The project involved analyzing data from about 11,000 learners who participated in activities assigned randomly, aiming to understand how to adaptively choose subsequent activities. (9m41s)
- A human designer created a specific sequence for a game on Brain Po, an educational platform for kids, but it was uncertain if this sequence was optimal. The goal was to use reinforcement learning to develop an adaptive policy to increase student persistence in the game. (9m50s)
- Evidence suggested that playing the game led to learning, but many kids stopped playing after a short time. The aim was to enhance student persistence beyond expert performance using reinforcement learning, which resulted in a policy that increased persistence by about 30%. (10m22s)
- The research indicated that there might be enough data to discover new decision policies that are significantly better than current ones. This inspires further exploration of using natural variation and past experiments to improve decision-making policies. (10m59s)

## Impact of New Policies and Heterogeneous Treatment Effects

- A recent analysis showed that the lowest performers were most positively impacted by the new policy, addressing concerns about increasing the inequity gap. This highlights the importance of understanding heterogeneous treatment effects and estimating impacts on different population subgroups. (12m2s)
- The policy changes involved varying elements such as the difficulty of fractions and graphical challenges in the game. The exact final policy used was not remembered, but it involved manipulating these aspects to improve engagement. (12m52s)
- The game "Battleship Number Line," which involves fractions, highlights the importance of variability for persistence and engagement. Adjusting elements like the size of battleships can significantly impact these factors. (13m14s)
- Imitation learning is valuable, especially when trying to emulate top performers like surgeons, but there are situations where surpassing human performance is possible. This is particularly true when high-level principles do not dictate specific actions, and data-driven approaches can be beneficial. (14m2s)
- In healthcare, collaboration with Finale Joshu Vals at Harvard has focused on optimizing policies for conditions like hypotension using data from the MIMIC dataset, which contains extensive electronic medical records. The study explored behavior policies and developed improved policies using a method called "popcorn," demonstrating that some policies can significantly outperform baselines. (14m33s)

## Introduction to Offline/Batch/Counterfactual Reinforcement Learning

- Offline or batch reinforcement learning (RL), also known as counterfactual RL, involves estimating or learning policies that are not present in the actual data collection strategy. This approach assumes a dataset of trajectories and focuses on sequences of states, actions, and rewards. (15m29s)
- A key challenge in counterfactual RL is estimating what might have happened if different actions were taken, which is a fundamental problem in causal inference. This difficulty is particularly pronounced when attempting to exceed existing performance levels. (16m21s)
- Data in various fields such as education, healthcare, climate change, and robotics is often censored, and generalization is necessary to avoid enumerating all possible policies. This is particularly important in scenarios where conducting experiments is expensive. (16m43s)
- There is a need to understand the performance of new decision policies that were not used to gather the data, which relates to off-policy reinforcement learning. Despite existing tools like Q-learning, challenges arise due to the "deadly triad" of bootstrapping, function approximation, and off-policy learning, which can lead to failures. (17m12s)
- Behavior Constrained Q-learning (BCQ), developed by Scott Fujimoto, is highlighted as a method that can outperform others like deep Q-learning and behavior cloning when dealing with offline data. This suggests the importance of using algorithms specifically designed for fixed data sets to achieve better results. (18m19s)
- The discussion emphasizes the need for new methods when data is constrained and additional data cannot be obtained, motivating the exploration of batch policy evaluation and policy optimization. (19m27s)

## Batch Policy Evaluation and Model-Based Learning

- Batch policy evaluation involves using a data set to estimate the effectiveness of a particular policy, either for a specific state or on average across multiple starting states. This approach aims to be sample efficient, as highlighted by Phil Thomas, a professor at UMass Amherst. (19m40s)
- The discussion involves working with Adobe, utilizing a dataset of 10 to 20 million trajectories to learn policies that are better than the behavior policy used to gather the data. The importance of data efficiency and good algorithms is emphasized. (20m17s)
- The behavior policy is defined as the policy used to collect the dataset. Clarification is provided to ensure understanding of this term. (20m45s)
- The initial approach involved using models to learn from historical data, led by graduate student Travis Mandal. The focus was on representing the state space and learning a dynamics model from the data. (21m17s)
- An explicit dynamics model and reward model were learned from the existing dataset. The reward model was known due to persistence, but the dynamics model was learned from the data. (21m42s)
- Various state representations were considered, and once a suitable representation was chosen, the system could be treated as a simulator. This allowed for the use of methods like dynamic programming, Q-learning, or Monte Carlo methods to learn or evaluate policies. (22m14s)
- The x-axis of a graph represents different state representations of the environment, with the understanding that human learning is not fully captured by small state spaces. The graph shows that increasing state space complexity improves data fit and prediction accuracy. (23m15s)
- The discussion involves predicting the next state of a student using a complex state space, which is expected to improve the dynamics model due to the complexity of human learning. This is evaluated using cross-validation on a held-out set, not training error. (23m58s)
- The dataset used is fixed, and the focus is on modeling clickstream data as state spaces for model selection. Once a simulator is developed, the goal is to learn and evaluate a good policy. (24m25s)
- Despite having a better simulator that fits the data well, the policy derived from optimizing this simulator performs worse in the real world. This discrepancy is highlighted by an unbiased reward estimator. (24m52s)
- The process involves creating a dynamics model from data, adding a reward function, and extracting an optimal policy (p star) under the simulator. However, the true value of the computed policy is often worse than expected. (25m54s)
- Previous work in educational data mining suggested building a model, simulating, and deploying the best-looking policy, but this approach is flawed due to model misspecification. (26m35s)
- The model's misspecification leads to a difference between the estimated value (V hat of Pi hat star) and the true value of the policy, causing the model to overestimate its effectiveness. (27m1s)
- When a model is misspecified, even with infinite data, it will not converge to the true model of student learning, leading to incorrect estimates. This suggests that the current state model of learning is not highly effective. (27m44s)
- It is argued that the accuracy of the dynamics model should not be used as a proxy for selecting policies. Instead, independent estimates are needed to directly evaluate the performance of a policy. (28m31s)
- Policy evaluation can be done by executing it in a real environment or estimating its performance using a simulated model. The goal is to determine the best policy to deploy without having to test it in a real model, which would be akin to online reinforcement learning. (29m2s)
- Model misspecification can be likened to overfitting, where increasing the number of states improves the fit but does not achieve a perfect fit, leading to significant bias in learning. (29m41s)
- Model-based learning can still be beneficial. It is suggested to build different models when evaluating different policies, optimizing for accuracy over the behavior policy's data distribution. (30m18s)
- A paper by Finel Dashi, Yalo, and Gotsman demonstrated that changing the loss function to prioritize accuracy over state and action pairs expected under a different policy can significantly improve model performance, as shown in a medical domain study. (30m51s)
- By reusing data, it is possible to fit dynamics models that better predict future dynamics. (31m21s)

## Model-Free Methods and Fitted Q Evaluation

- Model-free methods, such as fitted Q evaluation, are introduced as alternatives to model-based methods for policy evaluation, potentially offering fewer limitations. (31m33s)
- Fitted Q evaluation is similar to deep Q learning but focuses on a single policy without using a max operation. It involves minimizing the difference between a parameterized function and observed data. (31m51s)
- The process for fitted Q evaluation includes initializing the Q function, computing targets using the policy of interest, building a training set, and fitting the Q function. (33m17s)
- Fitted Q evaluation is closely related to fitted Q iteration, a common algorithm for off-policy learning, and is similar to deep Q learning. (34m21s)
- There is interest in understanding the theoretical grounding of fitted Q evaluation, specifically in terms of generalization error and the difference between computed and true policy values. (34m30s)
- The number of samples needed (n) is crucial for determining the amount of data required for this evaluation process. (35m21s)
- The concept of a "concentrability coefficient" is introduced, which measures the difference between the distribution of state-action pairs in a dataset and those under a desired policy. This is related to the divergence in state-action distributions and overlap. (36m3s)
- The accuracy of evaluating a policy's performance depends on factors such as the discount factor, the amount of data available, and the similarity between state-action distributions in the training and test sets. (36m41s)
- A challenge with the discussed approach is its reliance on the Markov assumption and the assumption that Q-functions are well-specified, meaning they can be accurately fitted if infinite data were available. However, in practice, infinite data is not available, leading to potential errors. (37m4s)
- Importance sampling is introduced as a method to address these challenges. It is a statistical technique adapted for reinforcement learning to evaluate policies using offline data from different distributions without relying on model correctness or the Markov assumption. (38m23s)

## Importance Sampling for Policy Evaluation

- Importance sampling helps address the data distribution mismatch by allowing the estimation of expected rewards over a distribution of states, which is crucial for offline policy evaluation. (39m2s)
- The discussion addresses the challenge of evaluating expected rewards under a policy when there is no data available from the probability distribution of reaching certain states under that policy. (39m22s)
- A method is introduced to estimate the expectation by using a different policy, denoted as Q, and its distribution. This involves multiplying and dividing by the same factor to rewrite the expression in a way that allows approximation using samples from Q. (40m23s)
- The approach provides an unbiased estimate by reweighting samples from policy Q to reflect the likelihood of reaching states under the original policy. This method can be extended to multiple time steps without requiring a Markov assumption. (42m4s)
- The unbiased estimator requires that the sampling distribution Q must have a non-zero probability for all states that the original policy could reach with a non-zero probability. This ensures that the estimator is valid for states that are possible under the target policy. (42m41s)

## Hidden Confounding in Data Analysis

- The concept of "No hidden confounding" is mentioned as an important consideration in empirical data, although further details on this are not provided in the text. (43m41s)
- In a healthcare setting, understanding counterfactuals, such as what would have happened if a different action was taken, is crucial. This requires knowing all features used in decision-making to avoid hidden confounding, which can lead to biased estimations. (44m34s)
- Hidden confounding occurs when there are unobserved features influencing decisions, leading to incorrect conclusions about the effectiveness of actions. For example, patients with similar observable features might receive different treatments due to unrecorded factors, affecting outcomes like survival. (45m10s)
- The issue of hidden confounding is significant in practice, especially when actions are optional or made by humans. It is important to consider whether additional confounding factors exist beyond the recorded features in a dataset. (46m0s)
- An experiment was conducted to assess whether providing students access to GPT-4 affected class participation and exam scores. The challenge was determining if students who used GPT-4 were inherently different, which could confound the results. (46m16s)
- In controlled environments like simulators, hidden confounding is less of a concern, but it remains a critical issue in real-world applications. (46m54s)

## Example Scenario with Three Actions and Two Policies

- The discussion includes a scenario with three actions, each with different probabilities and rewards, sampled from a behavior policy. This illustrates the concept of policy evaluation in reinforcement learning. (47m11s)
- The discussion involves evaluating two policies, Pi 1 and Pi 2, and determining if data from Pi 1 can be used to get an unbiased estimator of Pi 2. The impact of positive or negative rewards on this estimation is also considered. (48m1s)
- Pi 1 is a stochastic policy that pulls a specific action with a probability of 0.8, while Pi 2 pulls action two unless it pulls action one. (47m54s)
- The expected reward for action A1 is calculated to be 2, for action A2 it is 1.1, and for action A3 it is 0.5. Policies that place more weight on action one are generally better. (54m13s)
- The expected value of Pi 1 is calculated using the probabilities and rewards of actions A2 and A3, resulting in an approximate value of 4.2. (54m41s)
- For Pi 2, the expected reward is calculated to be approximately 1.5, indicating that Pi 2 has higher true rewards. (55m23s)
- It is concluded that data from Pi 1 cannot be used to get an unbiased estimate of Pi 2 because Pi 1 never pulls the necessary actions for Pi 2. (56m26s)
- It is possible to obtain a lower bound on the performance of a policy using another policy that does not have complete overlap, provided that the rewards are strictly positive. This is because if the rewards are always greater than or equal to zero, a behavior policy that lacks complete coverage with the target policy can still be useful. (57m20s)
- The concept of having zero probability mass on certain actions is explained as being similar to not sampling those actions, which results in a lower estimated value if all rewards are positive. This can be advantageous if the target evaluation policy is better than the behavior policy, even without full coverage. (57m50s)

## Importance Sampling in Reinforcement Learning

- The discussion introduces the idea that this approach can be applied to reinforcement learning (RL) policy evaluation. The method involves using importance sampling, also known as inverse propensity weighting, to evaluate policies based on historical data or data gathered through specific designs. (58m48s)
- In reinforcement learning, the value of a policy is considered as an expectation over all possible trajectories generated by that policy from an initial state, multiplied by the reward of those trajectories. This involves reweighting the probability of obtaining a particular trajectory under the behavior policy versus the target policy. (59m40s)
- Samples from a behavior policy can be used to approximate expectations by reweighting them to match a target policy, allowing for the computation of trajectory probabilities under different policies. This process does not require knowledge of the dynamics model, provided there is coverage of the trajectories. (1h1m3s)
- The probability of a trajectory given a policy and action is calculated as the product of transition probabilities and the probability of taking an action in a given state. This can be applied to both behavior and target policies, and under certain conditions, the dynamics model cancels out, simplifying the process. (1h1m26s)
- The method of reweighting trajectories to evaluate policies was introduced in reinforcement learning by Richard Sutton and Satinder Singh in 2000. It is unbiased and corrects for distribution mismatch without requiring the Markov assumption. (1h3m15s)
- Per decision importance sampling is an extension that reduces variance by strategically placing weights, similar to policy gradient methods. This is particularly beneficial for long sequences. (1h3m51s)
- High variance is a common issue in Monte Carlo methods, and concentration inequalities like Hoeffding's inequality can be used to obtain confidence intervals, though they can be problematic for long horizons in importance sampling. (1h4m24s)
- Various extensions exist to address these challenges, and if a Markov structure is present, it can be leveraged to improve the process. (1h4m58s)
- The discussion highlights the use of state distributions instead of trajectories, which can be beneficial in reducing variance by applying statistical methods like W robust estimation. This approach also attempts to integrate methods that assume a Markov property with those that do not. (1h5m7s)

## Offline Policy Learning and Coverage

- Offline policy learning is explored, emphasizing the importance of building simulators from historical data. While these simulators can be useful, they may introduce bias, which can affect policy selection. Model-free methods and importance sampling are also discussed as techniques to obtain unbiased estimates, though they may have high variance. (1h5m32s)
- The concept of coverage is crucial when evaluating policies, particularly in scenarios like intensive care units where interventions such as antibiotics, mechanical ventilation, and vasopressors are used. The data must support the policy being evaluated, meaning there should be a non-zero probability of each action in the dataset. If an action, like using a vasopressor, is not present in the behavior data, it cannot be evaluated for future use. (1h6m13s)
- Understanding sufficient coverage in real datasets can be challenging, as it involves determining the necessary data support for evaluating actions. This complexity arises from the need to assess whether rare actions in the dataset provide adequate coverage for policy evaluation. (1h7m2s)
- Up to around 2020, most methods for off-policy evaluation, whether model-based or model-free, assumed overlap between the policy of interest and the behavior policy. This assumption required that all possible policies in a domain had coverage with the behavior policy. However, real-world datasets often do not involve complete random exploration, making this assumption difficult to satisfy. (1h7m44s)
- When using certain methods in reinforcement learning, there is a risk of exploring parts of the domain with insufficient data coverage, which can lead to suboptimal policy decisions. A proposed solution is to adopt a pessimistic approach, especially in offline reinforcement learning, where additional data cannot be obtained. This involves being cautious in areas with high uncertainty regarding rewards. (1h8m31s)
- The concept of "doing the best with what you got" is introduced, which focuses on leveraging datasets with partial coverage to achieve the best possible outcomes within the data's support. This approach is similar to K constraint or policy clipping but is applied entirely in the offline context. (1h8m53s)

## Chain MDP Example and Pessimistic Approach

- The chain Markov Decision Process (MDP) is used as an example to illustrate the challenges in learning optimal policies. In this scenario, there is an initial state S0, and under a certain policy, there are probabilities of transitioning to states S1, S2, etc., or to state S10. Most states have deterministic rewards, except for one state with an expected reward of 0.8 and another with 0.5, which can mislead policy learning due to stochasticity. (1h9m43s)
- It is observed that some conservative reinforcement learning algorithms exhibit unexpected behavior as the amount of behavior data increases. Initially, these algorithms may learn the optimal policy, but with intermediate data amounts, they can be misled by stochastic rewards, leading to suboptimal policy decisions. Only with a significant increase in data do these algorithms eventually identify the best policy. (1h11m0s)
- There is a concern about the lack of monotonic improvement in performance when using certain methods with increasing amounts of behavior data, as some methods exhibit performance challenges across various examples. (1h12m2s)
- A key idea introduced is to adopt a pessimistic approach when encountering state-action pairs that have not been frequently observed. This involves using a filtration function with a threshold to determine the density of observed data, assigning a value of one if the data exceeds the threshold and zero otherwise. (1h12m29s)
- This approach is combined with Bellman backups, where if a transition leads to a state with insufficient data, the reward is assumed to be zero. This discourages actions that transition to poorly understood states, providing a lower bound on potential rewards. (1h13m5s)
- The method assumes rewards are positive, allowing for a pessimistic estimate of potential rewards. It can be applied to both policy evaluation and policy gradient or Q-learning methods, and is termed "marginalized Behavior supported policy optimization." (1h13m41s)
- Unlike previous methods that required data coverage assumptions for all possible policies, this approach focuses on policies with sufficient coverage, guaranteeing the best policy within that class. It also offers finite sample guarantees under certain assumptions, similar to fitted Q evaluation. (1h14m21s)
- The method includes function approximation and is not limited to tabular data. In a case study with the Hopper environment, it was observed that using DDPG performed worse than the behavior policy, while behavior cloning performed slightly better or about the same. (1h15m11s)
- Scott Fujimoto's work and a green approach are highlighted as performing substantially better in certain cases, emphasizing the importance of using methods that consider uncertainty. (1h15m42s)
- Three related papers were released in the same year, including a model-free approach and a model-based approach by Chelsea Finn and colleagues, which penalized model uncertainty during planning, showing promising results in D4L cases. (1h16m12s)
- Conservative Q-learning emerged around the same time and remains popular, focusing on being conservative in approach. (1h16m46s)
- Pessimistic approaches generally outperform alternatives, with various methods showing different levels of success in different settings. Emphasizing uncertainty and penalizing functions to stay within supported domains is beneficial. (1h17m2s)

## Offline Policy Learning with Constraints and Future Directions

- Offline policy learning can be extended to include constraints, such as ensuring performance improvement over baselines, demonstrated using a diabetes insulin management simulator approved by the FDA. This approach allows for learning new policies with confidence in their superiority over existing ones. (1h18m26s)
- Important sampling is defined and applied for policy evaluation, with an understanding of some limitations of previous works. (1h19m22s)
- Offline reinforcement learning (RL) has the potential to outperform imitation learning, and the concept of pessimism under uncertainty is introduced. (1h19m28s)
- Offline RL or offline policy evaluation is particularly useful in high-risk settings. (1h19m40s)
- Future discussions will focus on strategies for data gathering to efficiently learn policies. (1h19m46s)