# Stanford CS234 Reinforcement Learning I Introduction to Reinforcement Learning I 2024 I Lecture 1

## Introduction to Reinforcement Learning and its Applications

- The course CS234 focuses on reinforcement learning, which involves an automated agent learning through experience to make good decisions. This concept is central to achieving general artificial intelligence. (10s)
- Reinforcement learning is crucial for making decisions based on perception and information, which is a fundamental aspect of intelligence. This area has seen significant progress in perceptual machine learning, such as identifying objects like faces and cars. (20s)
- The study of reinforcement learning addresses the challenge of making decisions under uncertainty and limited data, a question that has been explored since the 1950s, notably by Richard Bellman. Bellman's equation is a key concept in this field. (1m50s)
- There are two main motivations for studying reinforcement learning: understanding intelligence and solving practical problems. It has been used to achieve unprecedented performance in various domains. (2m5s)
- A notable success of reinforcement learning is in the board game Go, which is more complex than chess. About eight to nine years ago, DeepMind developed a system using reinforcement learning and Monte Carlo research to create an AI that plays Go better than any human. (2m47s)
- The achievement of certain AI feats, such as those in reinforcement learning, occurred earlier than expected, significantly impacting fields like fusion science. (3m50s)
- Reinforcement learning has been applied to fusion science to address energy challenges by controlling elements within a vessel using coil controllers, as demonstrated in a Nature paper from two years ago. (4m32s)
- A notable application of reinforcement learning was in COVID-19 testing in Greece, where it was used to optimize testing resources and control the epidemic, as detailed in a paper by Stanford graduate HMA Basi and her colleague. (5m22s)
- ChatGPT is a prominent example of reinforcement learning in natural language processing, showcasing the development of more capable systems through techniques like behavior cloning or imitation learning. (6m8s)
- The process of training systems like ChatGPT involves using prompts to generate responses, illustrating the practical application of reinforcement learning in creating advanced language models. (7m5s)

## Imitation Learning and Behavior Cloning

- The initial approach discussed involves treating the problem as a supervised learning task, where input is used to produce output, referred to as imitation learning or behavior cloning. This method has been applied in natural language processing but had limitations in effectiveness. (7m19s)
- A subsequent approach involves explicitly considering utility or rewards by building a model of rewards, which is related to model-based reinforcement learning. This involves collecting preference data from people to learn a preference model, which is a new focus in the class. (7m42s)

## Reinforcement Learning from Human Feedback (RLHF)

- The concept of reinforcement learning from human feedback (RHF) is introduced, which was previously applied in simulated robotics tasks. ChatGPT demonstrated significant performance improvements using this method. (8m51s)
- There has been a growing interest in reinforcement learning, with a notable increase in research papers and community size, partly due to successes in tasks like Atari video games and AlphaGo. (9m33s)

## Skepticism and Importance of Reinforcement Learning

- Despite the growing interest and successes in reinforcement learning, there are skeptics, as highlighted by a notable talk by Yann LeCun in 2016 at a major machine learning conference. (10m43s)
- Yann LeCun, a prominent figure in neural network research and a Turing Award winner, delivered a notable talk at NeurIPS where he used the metaphor of a cake to describe the roles of different types of machine learning. He suggested that unsupervised learning is the main body of the cake, supervised learning is the icing, and reinforcement learning is just the cherry on top, indicating its relatively minor role in the field. (10m55s)
- The talk highlighted the importance of understanding where different machine learning techniques are most effective and where the most progress can be made in advancing AI. (12m9s)

## Course Overview and Motivation

- A poll was conducted to understand why participants were interested in taking the class, with various reasons ranging from curiosity to professional requirements. Participants were encouraged to share their motivations. (12m34s)
- The discussion emphasized the importance of understanding what reinforcement learning is about, the types of questions it addresses, and its potential applications, while also encouraging creativity in finding new applications for reinforcement learning. (16m28s)

## Key Concepts in Reinforcement Learning

- Reinforcement learning involves key concepts such as optimization, delayed consequences, exploration, and generalization. (16m56s)
- Optimization in reinforcement learning focuses on making the best decisions by having a clear notion of utility, allowing for direct comparison of different solutions, such as finding the minimum distance route between two cities. (17m4s)
- Delayed consequences refer to the impact of current decisions on future outcomes, which can be challenging due to the complexity of planning and the difficulty of temporal credit assignment, where it is hard to determine which actions led to specific outcomes. (17m45s)
- Planning in reinforcement learning involves reasoning about long-term ramifications, similar to playing chess, where the rules are known but determining the optimal move is complex. (18m25s)
- Temporal credit assignment is difficult because it involves figuring out which past actions caused future outcomes, a challenge faced in both learning and decision-making processes. (18m43s)
- Exploration in reinforcement learning is about learning through direct experience, akin to learning to ride a bike by trying and failing, emphasizing that knowledge is limited to what is directly experienced. (19m34s)
- The concept of exploration highlights the idea that one cannot know the outcomes of untried actions, as illustrated by the inability to understand the counterfactual scenario of attending a different institution. (19m58s)
- Causal inference presents a significant challenge in reinforcement learning, as it involves learning from the actions taken by an agent without knowing the outcomes of alternative actions. This is exemplified by a company giving promotions to customers without knowing the results if the promotions were not given. (20m22s)
- Exploration is a key aspect of reinforcement learning that differentiates it from previous approaches, as it involves learning decision policies that map experiences to decisions. This is necessary for solving large and complex problems. (21m4s)

## Comparison with Other AI/ML Approaches

- The complexity of decision-making in reinforcement learning is illustrated by the example of a video game, where the vast number of possible input images makes it impractical to pre-program responses. This necessitates the use of deep neural networks to handle the combinatorial explosion of scenarios. (21m33s)
- Reinforcement learning is compared to other AI and machine learning approaches, such as AI planning and supervised learning. AI planning involves optimization and handling delayed consequences, while supervised learning involves learning from labeled data. Reinforcement learning encompasses elements of both, including learning from experience and generalization. (22m35s)
- Unlike supervised learning, which uses correct labels, and unsupervised learning, which lacks labels, reinforcement learning focuses on learning from experience and generalization without explicit labels. This approach is increasingly being mapped to imitation learning. (23m29s)
- Imitation learning, also known as behavior cloning, involves reducing reinforcement learning to supervised learning by using expert trajectories, which are demonstrations of good policies. This approach allows the system to mimic expert behavior, such as driving a car, by learning from these demonstrations. (23m49s)
- The concept of reduction is emphasized as a powerful tool in computer science, where problems are often reduced to others to leverage existing solutions. In reinforcement learning, this involves reducing it to other problems, particularly in theoretical aspects. (24m31s)
- Imitation learning is not a separate technique but an application of supervised learning within the reinforcement learning context. It involves using demonstrations to bypass the need for exploration and delayed consequences, focusing instead on replicating observed good behavior. (25m10s)
- An example of imitation learning is using human driving data to teach a car how to drive. By recording a good driver's actions, the system can learn to steer the wheel correctly at each point, using these demonstrations as a guide for developing a good policy. (26m2s)
- In imitation learning, the goal is to optimize for good performance by learning from good trajectories, aiming to develop a policy that performs well rather than imitating poor performance. (26m59s)
- In decision-making contexts, there is often a real-valued scalar utility that measures the quality of decisions, unlike in classification tasks where outcomes are typically binary, such as determining if an image is a cat or not. (27m38s)
- Imitation learning involves learning from demonstrations, such as driving behavior, and can sometimes be more straightforward than reinforcement learning. However, reinforcement learning has the potential to match or exceed the performance of imitation learning, depending on the algorithm used. (28m46s)

## Advantages of Reinforcement Learning over Imitation Learning

- Imitation learning (IL) involves training on data assumed to be good, such as internet responses, while reinforcement learning from human feedback (RLHF) involves asking people to choose preferred responses and using reinforcement learning to improve policy based on this feedback. (29m10s)
- Reinforcement learning can discover strategies not previously known to humans, as demonstrated by AlphaGo's novel strategies in the game of Go. Over-reliance on imitation learning might limit a model's ability to explore beyond human knowledge. (29m48s)
- Reinforcement learning is particularly advantageous in scenarios where there are no examples of desired behavior, such as when aiming to surpass human performance in fields like healthcare or education. It focuses on optimizing performance and can be useful in areas lacking existing benchmarks. (30m13s)
- Reinforcement learning is powerful for decision-making problems where there is no prior data, requiring learning from scratch and direct optimization. (30m52s)
- It is also effective for large search or optimization problems with delayed outcomes, as demonstrated by DeepMind's work, such as AlphaTensor, which developed a faster matrix multiplication algorithm using reinforcement learning. (31m15s)
- The approach involves having AI invent new algorithms, framing large search problems as reinforcement learning problems to make them more tractable. (32m0s)
- AlphaTensor achieves faster matrix multiplication without errors, and the process includes verifying the correctness of the results. (33m5s)

## Course Content and Logistics

- The course will cover topics such as Markov decision processes, planning, model-free policy evaluation, model-free control, and policy search, including methods like proximal policy optimization. (33m32s)
- The course will include a deep dive into offline reinforcement learning, which involves learning from a fixed amount of data to develop a good decision policy. It will also cover reinforcement learning from human feedback and direct preference optimization, which is a new addition to the course. (34m10s)
- The learning goals of the class include defining key features of reinforcement learning, specifying applications as reinforcement learning problems, implementing and coding common RL algorithms, and understanding theoretical and empirical approaches for evaluating RL algorithms. (34m40s)
- The course structure includes live lectures, three homework assignments, a midterm, a multiple-choice quiz, a final project, and optional problem sessions. These sessions are designed to explore the conceptual and theoretical aspects of the class. (35m23s)
- Education is highlighted as a significant application area for reinforcement learning, with a focus on addressing poverty and inequality. A study is mentioned that shows engaging in more activities leads to better learning outcomes compared to passive activities like watching videos or reading. (35m56s)
- Active engagement in learning, such as solving problems and participating in problem sessions, is recommended over passive activities like rewatching lectures. Engaged practice, particularly forced recall, is emphasized as an effective learning strategy. (36m43s)
- Problem sessions for the course will be announced by the end of the following day, and materials and videos will be released for those who cannot attend in person. (37m45s)
- The course includes theoretical aspects, with more theory than typical machine learning and AI classes, but not as much as an advanced seminar. Most problem sets will include one theory question, and no prior background in proofs is required. (38m15s)
- Topics such as Monte Carlo tree search, reinforcement learning from human feedback, and multi-agent systems will be covered. The course aims to help students get up to speed with the latest ideas in reinforcement learning. (38m53s)
- Five teaching assistants will support the course, and information will be available on the website and Ed. Office hours will be announced by the end of the following day. (39m14s)
- The course will cover model-based approaches, starting with models and Markov decision processes, and later discussing offline approaches and different representations in reinforcement learning. (39m44s)
- A refresher exercise will be conducted to gauge students' prior exposure to reinforcement learning, ensuring that the course content is accessible to all participants. (40m35s)

## Formulating an Educational Problem as a Reinforcement Learning Problem

- The discussion involves formulating a problem as a reinforcement learning problem or a Markov decision process, specifically in the context of education. (41m4s)
- An example is given where an AI tutor provides practice problems in addition and subtraction, rewarding the AI with a +1 if the student answers correctly and a -1 if incorrectly. (41m44s)
- Participants are encouraged to think about the state space, action space, and reward model, and to consider what a policy to optimize the expected discounted sum of rewards would look like in this scenario. (41m54s)
- A suggestion for the state space is a vector pair representing the student's proficiency in addition and subtraction, with values indicating closeness to mastery. (50m34s)
- It is noted that commercial systems use similar models, such as hidden Markov models, to assess mastery. (51m15s)
- Another proposed state space includes the student's knowledge and the questions that have already been asked. (51m30s)
- The discussion explores the concept of capturing a student's knowledge through a history of questions and answers, highlighting the challenge of representing this history as it grows unboundedly. Techniques like using an LSTM or summarizing the state are suggested to manage this complexity. (51m40s)
- The actions available to a teaching agent are described as posing addition or subtraction questions, with a reward model that gives a plus one if the student answers correctly. (52m51s)
- A Dynamics model is introduced, which describes how the state of a student changes after a question is given. This model aims to reflect the student's increased knowledge or mastery of a topic after answering questions. (53m16s)
- A potential issue with the reward system is identified, where an agent might focus on giving easy questions to maximize its reward, as seen in a referenced paper where the agent was rewarded based on the time taken to solve problems. This could lead to a strategy of only presenting simple questions to students. (54m15s)
- The concept of "reward hacking" is introduced, where the specified reward does not lead to the desired behavior. An example is given where a system designed to help students learn might inadvertently encourage them to focus only on addition to maximize rewards, rather than learning both addition and subtraction. This issue will be explored further in the course. (55m8s)

## Sequential Decision-Making and Markov Decision Processes (MDPs)

- Sequential decision-making under uncertainty involves an agent taking actions and receiving observations and reward signals. The goal is to maximize the total expected future reward, balancing long-term and short-term rewards. (55m55s)
- Examples of sequential decision-making include web advertising, where companies like Amazon optimize for metrics such as click time, view time, or revenue, and robotics, where a robot might receive feedback from a camera image and be rewarded for tasks like clearing dishes from a counter. (56m57s)
- In robotics, a poorly specified reward could lead to undesirable behavior, such as a robot pushing dishes off a counter instead of properly cleaning them. A better reward would ensure the dishes are placed in a dishwasher and cleaned. (57m38s)
- Another example is blood pressure control, where an agent might recommend actions like exercise or medication, with feedback being the resulting blood pressure levels. (58m15s)
- In reinforcement learning, agents make sequences of decisions under uncertainty, typically within a finite series of time steps, rather than continuous time. The process involves the agent taking an action, the world updating based on that action, and emitting an observation and reward, which the agent uses to make subsequent decisions, forming a feedback cycle. (58m44s)
- The history ( H_t ) consists of all previous actions, observations, and rewards up to the current time point. This history can be used to make decisions, but it is often more practical to use a sufficient statistic to summarize the history. (59m26s)
- The Markov assumption is commonly used to simplify decision-making by assuming that the current state ( S_t ) is a sufficient statistic of the history. This means the future is independent of the past given the present, allowing the agent to make decisions based on the current state without needing the entire history. (1h0m28s)
- A state ( S_t ) is considered Markov if the probability of transitioning to the next state, given the current state and action, is the same as if conditioned on the entire history. This allows for a more compact representation of the state space, such as considering only recent information like blood pressure over the last two hours instead of all past data. (1h0m44s)
- There is a distinction between state and observation. For example, in the context of Atari video games, the state might be represented by the last few frames of the game, which the agent uses to make decisions. (1h1m41s)
- Using four frames instead of one in reinforcement learning can provide information about an object's velocity and acceleration, which is important for understanding temporal differences in the state. (1h1m58s)
- The approach of using historical data as part of the state is popular because it is simple and can often be satisfied, impacting computational complexity, data requirements, and performance. There are trade-offs between using small, easy-to-learn states and capturing the complexity of the world. (1h2m26s)
- In sequential decision-making processes, a key question is whether the state is Markov and if the world is partially observable. The Markov assumption can complicate reward attribution, as it may be difficult to determine which decisions led to a particular reward. (1h3m20s)
- An example is given where qualifying for the Boston Marathon results in a reward, but it is unclear which specific actions, such as eating well or training, contributed to achieving that state. This issue is independent of the Markov assumption. (1h4m9s)
- The concept of partial observability is mentioned, but it will not be the focus of the class. It relates to situations where there is a latent state that cannot be directly accessed. (1h5m4s)
- Observations in tasks like addition or subtraction can be noisy due to human error, even though the underlying knowledge is present. This concept is relevant in robotics, where a robot using a Laser Rangefinder may not uniquely identify its state due to similar environments, leading to partially observable cases. (1h5m24s)
- The distinction between deterministic and stochastic dynamics is important. Deterministic scenarios, like placing a piece on a Go board, have predictable outcomes, whereas stochastic scenarios, like flipping a coin, have uncertain outcomes. (1h6m29s)
- Actions can influence either immediate rewards or rewards in the next state. For example, in online advertising, showing an ad to a user impacts immediate reward but not future states, resembling a bandit problem. (1h6m49s)
- A Mars Rover is used as an example of a Markov Decision Process (MDP), where the state is the Rover's location on Mars, actions include trying to move left or right, and rewards are based on visiting interesting field sites. The MDP involves understanding the dynamics and reward model, which describe how the state evolves and the rewards received. (1h7m32s)
- In reinforcement learning, a dynamics model specifies the distribution of possible next states given a current state and action. For example, a Mars Rover might have a 50% probability of moving right and a 50% probability of moving left or staying in the same location due to its inaccuracy. (1h8m49s)
- The reward model predicts the immediate reward based on the current state and action. It can be a function of the current state, the action taken, or both the current and next states. Different conventions exist in reinforcement learning literature, with the most common being a function of the state and action. (1h9m19s)
- A stochastic Mars Rover model might involve starting in state S1 and attempting to move right, with some probability of reaching state S2 or staying in the same state. This model represents the agent's understanding of the world, which may differ from reality due to learning from experiences or having an inaccurate model. (1h10m0s)
- In a model-based Markov system, the agent has a specific representation of the dynamics model and assumptions about how rewards work. A policy in this context is a mapping from states to actions, which can be deterministic (a single action per state) or stochastic (randomized actions). (1h11m6s)
- An example of a policy for the Mars Rover could be always attempting to move right, regardless of its current state. This requires specifying the action or distribution of actions for every state. (1h11m51s)
- The discussion covers different types of policies in reinforcement learning, including the exploration of various policies over time to find an optimal one. This involves changing policies and evaluating their effectiveness. (1h12m28s)
- Two central questions in reinforcement learning are evaluation and control. Evaluation involves assessing the quality of a fixed policy, while control focuses on finding the best policy through trial and error. (1h13m1s)
- The complexity of problems in reinforcement learning will be addressed, including planning and control, and the use of tabular and function approximation methods to solve large problems. (1h13m41s)
- Planning assumes given models of dynamics and rewards, aiming to find a good policy, whereas learning involves making decisions to gather information for identifying an optimal policy. (1h14m24s)
- The initial focus will be on simple settings with finite states and actions, using models of the world to evaluate and compute optimal policies, which is akin to AI planning. (1h14m44s)

## Markov Processes and Reward Processes

- The concept of Markov processes will be introduced, starting with Markov chains and building up to Markov Decision Processes (MDPs). Evaluation can be viewed as a Markov reward process. (1h15m6s)
- A Markov chain is a memoryless random process with a finite set of states, represented by a transition matrix that indicates the probability of moving to the next state from a given state. (1h15m20s)
- In a Markov reward process, a Markov chain is combined with a reward function that evaluates the quality of each state. A discount factor is also introduced, which will be explained further. There are no actions involved, and the reward function can be expressed as a vector. (1h16m25s)
- The concept of a horizon is introduced, which refers to the number of time steps in each episode. This can be either finite or infinite, determining how many decisions can be made. The return, denoted as ( G_T ), is the discounted sum of rewards from the current time step to the end of the horizon. (1h17m8s)
- The value function is defined as the expected return, which may differ from the actual return due to the stochastic nature of the trajectories. This stochasticity results in varying rewards. (1h17m30s)
- The discount factor is used to weigh earlier rewards more heavily than later ones. This is mathematically convenient, especially for infinite time steps, and reflects human and organizational behavior, which often values immediate rewards more than future ones. (1h17m51s)
- For finite episode lengths, a discount factor ((\gamma)) of one can be used, meaning no discount is applied. However, for infinite horizons, it is important to have a discount factor less than one to prevent rewards from becoming unbounded and to facilitate comparison between policies with infinite rewards. (1h18m20s)
- Future discussions will focus on computing the value of Markov reward processes and connecting them to decision processes. (1h18m49s)