# Stanford CS234 Reinforcement Learning I Tabular MDP Planning I 2024 I Lecture 2

## Introduction and Course Overview

- The lecture begins with a refresh on conceptual understanding from previous lectures, using a platform called Ed for participation and logging responses. Participation points are calculated based on a percentage of completed activities, but these are optional. (5s)
- A question is posed regarding Markov Decision Processes (MDPs), specifically about the impact of a large discount factor gamma, which influences the weight of short-term versus long-term rewards. It is clarified that a large gamma means long-term rewards are more influential. (56s)
- The lecture discusses the concept of monotonic improvement in decision-making systems, questioning whether algorithms can be constructed to guarantee better decision policies with additional computation or iteration. (1m29s)
- A poll question reveals a misunderstanding about the effect of the discount factor gamma, with the correct answer being that a large gamma places more weight on long-term rewards. (3m8s)
- Clarification is provided on the distinction between reinforcement learning and other forms of AI and machine learning, particularly regarding the notion of optimization, which was previously confusing. (3m51s)
- The discussion emphasizes the importance of metrics and optimization in decision-making, highlighting that it involves more than just minimizing loss functions and can include various scalar values or multiple objectives. (4m24s)
- The initial weeks of the course may overlap with other classes, such as Stanford's CS238, but will focus more on theoretical aspects and algorithm properties, with the rest of the course content differing significantly. (5m1s)
- The course begins with simpler systems, like the seven-state Mars rover, to build foundational knowledge in decision processes, which are crucial for understanding advanced topics like AlphaGo and reinforcement learning from human feedback. (5m51s)
- The session focuses on making good decisions using a Markov Decision Process (MDP), which involves understanding the quality of decision policies and identifying optimal ones, given a model of the world's dynamics and reward system. (6m44s)
- Previous discussions covered Markov processes and Markov reward processes, which are useful for evaluating decision policies, and these concepts will be applied to Markov Decision Processes (MDPs). (7m16s)

## Markov Reward Processes (MRPs)

- The concept of discounting rewards is discussed, where rewards are reduced by a factor of gamma, which is less than one, leading to rewards eventually having zero value as the horizon approaches infinity. This is relevant in the context of a Markov reward process, where decisions are not yet involved, and the focus is on the expected discounted sum of rewards starting from a given state. (7m49s)
- The value function is introduced as a way to compute the expected reward from a state, considering the Markov property, which states that the future is independent of the past given the present. This allows for the calculation of expected future rewards by considering immediate rewards and the probabilities of transitioning to subsequent states. (8m2s)
- The Bellman equation is mentioned as a fundamental concept for computing the value of states in a Markov reward process. It involves summing the immediate reward and the discounted value of possible next states, weighted by their transition probabilities. (9m42s)
- In a tabular world, where each state can be represented separately, the value function can be expressed as a matrix equation. This involves calculating the value of each state as the immediate reward plus the discounted transition probabilities to all next states. (9m53s)
- The process of solving for the value function involves inverting a matrix equation, where V is the value function, gamma is the discount factor, P is the transition probability matrix, and R is the reward function. This approach is applicable when the world is small enough to allow for direct computation. (10m30s)
- The analytical solution to find the value function of a Markov Reward Process (MRP) requires a matrix inverse, which can be expensive to compute, especially for large state spaces (11m38s).
- In practice, the matrices are usually invertible, but the state spaces are often too large to compute the inverse directly (12m31s).

## Solving MRPs: Iterative Approach

- A second approach to solve the problem is to use dynamic programming and design an iterative algorithm, which avoids the matrix inverse and is more efficient for large state spaces (12m46s).
- The iterative algorithm initializes the value of each state to zero and then iteratively updates the value function using the formula VK(s) = R(s) + γ * ∑[P(s'|s) * V(K-1)(s')] (13m4s).
- The algorithm continues to update the value function until it converges, and the time complexity is O(s) for each iteration (13m48s).

## Markov Decision Processes (MDPs)

- A Markov Decision Process (MDP) is similar to an MRP but includes actions, and the dynamics transition model depends on the action taken (14m3s).
- In an MDP, the reward is a function of the state and the action taken, and the process is defined by a tuple (S, A, Dynamics model, Reward model, γ) (14m43s).
- The dynamics model in an MDP can be stochastic, and the reward function needs to be specified for each state-action pair (15m4s).
- The text discusses the components of a Markov Decision Process (MDP), which include the state space, action space, reward function, dynamics model, and discount factor Gamma. Once these are defined, the MDP is fully specified. (15m10s)

## Policies and MDPs

- Policies are introduced as a way to make decisions based on the current state. A policy can be deterministic or stochastic, with a focus on stochastic policies in the discussion. (15m29s)
- An MDP combined with a policy forms a Markov Reward Process (MRP). This transformation allows for the application of techniques used for MRPs to evaluate the value of a policy within an MDP. (16m1s)
- Policy evaluation in an MDP involves calculating the probability of taking an action in a given state and the expected discounted sum of rewards. This process is similar to an MRP but considers the probability of each action. This is referred to as a Bellman backup for a particular policy. (17m21s)
- If a policy is deterministic, the evaluation can be simplified by averaging over the rewards for specific actions. (18m12s)
- The text suggests that those unfamiliar with these concepts should practice value iteration or policy evaluation to better understand the process, although specific examples are not provided in the discussion. (18m46s)

## Optimal Policies in MDPs

- The number of policies in a given problem is a to the S, where a is the number of actions and S is the number of states, because for every single state, any of the actions can be chosen (19m41s).
- The optimal policy is not always unique and can depend on the problem, but it is not unique whenever more than one action has the same identical value (19m54s).
- Invalid actions can be dealt with in different ways, but in this particular example, it is assumed that trying to take an invalid action will result in staying in the same place (20m42s).
- In MDP control, the goal is to compute the optimal policy by taking the argmax over the policy space, which is a to the S space, and there exists a unique optimal value function (21m8s).
- The optimal policy in a tabular MDP in an infinite Horizon is unique and deterministic (21m22s).
- The optimal policy is stationary, meaning that it depends only on the state and not on the time step, because in an infinite Horizon problem, there are always an infinite number of additional time steps (21m38s).
- Policy search is an option for finding the optimal policy, and the optimality condition is defined per state, meaning that the optimal policy will be different for each state (22m11s).
- The optimal policy is defined per state, and the goal is to maximize the expected discounted sum of rewards from every state individually (22m36s).

## Monotonic Improvement and Policy Iteration

- The optimal value function in reinforcement learning is unique, but the optimal policy is not necessarily unique because there can be multiple actions with the same value. This will be proven later in the class. (22m48s)
- The goal is to develop methods and algorithms with monotonic improvement capabilities, such as policy search, to compute the optimal policy efficiently without evaluating all possible deterministic policies. (23m21s)
- Policy iteration involves alternating between evaluating a candidate decision policy and improving it until no further improvements can be made. The process starts with a random initialization of actions for each state. (23m56s)

## Policy Improvement and the Q-function

- The L1 norm is used to measure changes in the policy, and the process continues while the policy is still changing. The first step is to evaluate the policy, followed by an attempt to improve it. (24m19s)
- The Q function is defined to facilitate the policy improvement step. It represents the reward of an immediate state-action pair plus the discounted sum of future rewards if the policy is followed thereafter. (24m40s)
- The Q function simplifies the policy improvement step by allowing the computation of a new policy based on the expected discounted sum of rewards for each state-action pair. The best action is selected per state according to the Q function. (25m14s)
- The Q function, also known as the state-action value function, is related to the value function and is used to determine the best actions in policy iteration. (25m52s)
- The Q function is introduced as a key component in reinforcement learning, where the policy (\pi_{i+1}(s)) is determined by taking the action that maximizes the Q value for each state. This process is repeated iteratively. (26m10s)
- Questions arise regarding the formal properties of this approach, specifically whether it guarantees improvement. To address this, the policy improvement step is examined in detail. (26m30s)
- The Q function is computed using the old policy (\pi_i), and it is shown that the maximum Q value over actions for a given state is at least as good as the Q value for any specific action under the old policy. (26m51s)
- The equation for (Q
*{\pi*i}(s, a)) is explained, highlighting that it represents the value of taking a specific action as dictated by the old policy. (28m26s) - The new policy (\pi_{i+1}) is defined as the action that maximizes the Q function, ensuring that the expected sum of rewards is at least as good as if the old policy had been followed entirely. (29m2s)
- The policy improvement step involves extracting the new policy using the argmax of the Q function, and it is shown that this new policy is at least as valuable as the previous one. (30m7s)

## Monotonic Policy Improvement Proof

- The discussion involves the concept of policy improvement in reinforcement learning, specifically focusing on the creation of a new policy that is better or at least as good as the previous one. This is achieved by computing an argmax policy, which is not a hybrid but a completely new policy followed for all remaining steps. (30m32s)
- It is explained that simply taking one new action and then following the old policy is not sufficient to guarantee improvement. Additional work is needed to demonstrate that consistently following the new policy results in a monotonic improvement over the old policy. (31m15s)
- The process of policy improvement is described as a step that leads to a strictly monotonic improvement in the policy, meaning that each step results in a better policy for every state unless the policy has already converged. (31m33s)
- The value of the old policy is shown to be less than or equal to the value of the new policy, indicating a monotonic improvement. This is demonstrated by writing out the definition of the max over the Q function and showing that the new policy, constructed through the argmax of the Q function, is equal to or better than the old policy. (32m9s)
- The explanation includes the expansion of expressions to show that the value of following a particular policy is always less than or equal to taking the max over the Q function for that policy. This is because the max is either the same as the action taken by the policy or there is a better action available. (33m41s)
- The importance of this process is highlighted as it allows for consideration of taking new actions not just on the first step but on all future steps, ensuring continuous improvement in the policy. (34m20s)
- The discussion involves substituting a new action into a policy to demonstrate that the value of a state under a new policy is greater than or equal to the value under the old policy, leading to monotonic improvement. This is achieved by recursively taking a new action and then following the old policy, which eventually equals the value of the new policy. (34m43s)
- It is shown that policy evaluation, which involves computing the Q function and taking a maximum, results in monotonic improvement unless the policy remains unchanged. This implies that if a policy does not change, it cannot change again, and there is a maximum number of iterations for policy iteration. (36m14s)
- A question is raised about whether it is possible to remain at the same value level, given that the inequality is greater than or equal to. It is clarified that monotonic improvement occurs unless the policy is already optimal, and if there is any state where improvement is possible, it will occur. (38m13s)
- The concept of a single unique optimal policy is introduced to simplify understanding. It is suggested to consider scenarios without ties to better grasp the idea that a policy can change only once if it initially remains unchanged. (40m55s)
- When a policy does not change after a policy improvement step, it indicates that the new improved policy is the same as the old one, especially in deterministic environments with a single optimal policy. Once a policy reaches this state, it will not change again. (41m30s)
- The Q function remains the same if the policy does not change, meaning that if the old policy is the same as the new policy, the Q values for both policies are equal. This results in the policy remaining unchanged. (43m12s)
- In cases where there are ties, multiple actions can achieve the same Q function value. If these ties are broken deterministically, the policy will remain the same. Otherwise, it may oscillate between all optimal policies. (43m32s)
- In deterministic environments with a unique optimal action for each state, there is a finite number of policies. The policy improvement step either improves the policy's value or halts, meaning each policy is evaluated at most once. This ensures that policy iteration will eventually halt. (43m57s)
- Policy iteration will take at most (a^s) iterations, where (a) is the number of actions and (s) is the number of states. This is because each policy is evaluated at most once, although not all policies may be evaluated. (44m41s)
- The process guarantees monotonic improvement, meaning that each step either improves the policy or maintains it. However, the optimal policy reached may still be random, depending on the environment. (44m53s)
- In scenarios where all actions yield the same reward, acting randomly or following a policy results in the same value, and the ability to outperform random actions depends on the domain. However, generally, it is possible to achieve better outcomes through computation. (45m27s)

## Policy Iteration Convergence and Properties

- It is shown that policy evaluation, which involves computing the Q function and taking a maximum, results in monotonic improvement unless the policy remains unchanged. This implies that if a policy does not change, it cannot change again, and there is a maximum number of iterations for policy iteration. (36m14s)
- A question is raised about whether it is possible to remain at the same value level, given that the inequality is greater than or equal to. It is clarified that monotonic improvement occurs unless the policy is already optimal, and if there is any state where improvement is possible, it will occur. (38m13s)
- The concept of a single unique optimal policy is introduced to simplify understanding. It is suggested to consider scenarios without ties to better grasp the idea that a policy can change only once if it initially remains unchanged. (40m55s)
- When a policy does not change after a policy improvement step, it indicates that the new improved policy is the same as the old one, especially in deterministic environments with a single optimal policy. Once a policy reaches this state, it will not change again. (41m30s)
- The Q function remains the same if the policy does not change, meaning that if the old policy is the same as the new policy, the Q values for both policies are equal. This results in the policy remaining unchanged. (43m12s)
- In cases where there are ties, multiple actions can achieve the same Q function value. If these ties are broken deterministically, the policy will remain the same. Otherwise, it may oscillate between all optimal policies. (43m32s)
- In deterministic environments with a unique optimal action for each state, there is a finite number of policies. The policy improvement step either improves the policy's value or halts, meaning each policy is evaluated at most once. This ensures that policy iteration will eventually halt. (43m57s)
- Policy iteration will take at most (a^s) iterations, where (a) is the number of actions and (s) is the number of states. This is because each policy is evaluated at most once, although not all policies may be evaluated. (44m41s)
- The process guarantees monotonic improvement, meaning that each step either improves the policy or maintains it. However, the optimal policy reached may still be random, depending on the environment. (44m53s)
- In scenarios where all actions yield the same reward, acting randomly or following a policy results in the same value, and the ability to outperform random actions depends on the domain. However, generally, it is possible to achieve better outcomes through computation. (45m27s)

## Value Iteration

- An algorithm is discussed that improves policies with increased computation, which is beneficial when dealing with large state spaces. It allows for stopping after a set number of iterations while still ensuring policy improvement. (45m51s)
- The decision policy space is defined by the number of actions (A) and states (S), where for each state, one action is chosen, resulting in a multiplicative combination of possibilities. (46m25s)
- Policy iteration involves having an explicit policy at each time point, which dictates actions indefinitely. It calculates the Q value to determine the rewards from taking a specific action and following the policy forever, focusing on the infinite horizon case. (47m3s)
- Value iteration differs by maintaining the optimal value for starting in a state as if only a finite number of decisions can be made. It builds on the optimal actions for increasing steps, ensuring an optimal value for the wrong horizon. (47m42s)
- In value iteration, the process involves extending the decision-making horizon incrementally, using dynamic programming to build upon previous solutions for longer episodes. (48m38s)
- The Bellman equation, a foundational concept developed by Richard Bellman, is used to satisfy a particular policy and can be transformed into an algorithm through the Bellman backup operator. This operator takes a value function, initially considered as a vector, and performs a backup to determine the best action that maximizes immediate and expected future rewards. This process results in a new vector of values for all states, known as the Bellman operator. (48m55s)
- Value iteration is a recursive process that continues until convergence. It involves initializing a value function and repeatedly applying the Bellman backup for each state. The process continues until the value function stops changing, which is determined by the L Infinity norm, measuring the maximum difference between entries of two vectors. (50m2s)
- The value iteration process is compared to policy iteration, where the iteration continues until the policy stops changing. In value iteration, the process continues until the value function stabilizes, meaning the difference between the old and new values for each state becomes very small. (51m16s)
- The initial round of value iteration assumes no decisions are made, resulting in zero expected rewards. As the process progresses, it considers the best action to take if one decision is allowed, maximizing the reward times the discount factor. This iterative process refines the value function over successive rounds. (51m51s)
- The discussion involves maximizing over actions to determine the best decision when given the opportunity to make one decision, which is related to the concept of Q-values in reinforcement learning. (52m43s)
- In policy iteration, the process involves randomly initializing a policy, evaluating its value, and iteratively updating it. This includes setting initial values to zero and detecting successive policies. (53m31s)
- Policy iteration can halt when the policy or the value function stops changing, highlighting the iterative nature of the process. (54m27s)
- Policy search is popular in practice due to its monotonic improvement, unlike value iteration, which does not necessarily have this requirement. (55m0s)
- Value iteration properties are discussed, emphasizing its utility and the concept of Bellman operations, which involve committing to a particular policy and computing a fixed point through repeated application. (55m41s)
- Policy improvement can be achieved by taking the arg max instead of the max, which is similar to operations performed on the Q function. (56m29s)
- Bellman backups are typically associated with the optimal policy, although they can also be applied to a specific policy. (56m51s)
- The process of computing a value function involves performing value iteration multiple times and then extracting a policy by taking the argmax instead of the max. This approach does not compute a policy during the iterations but rather at the end. (57m1s)

## Convergence of Value Iteration

- Policy iteration is guaranteed to converge because there is a finite number of policies, and it either reaches the optimal policy or continues to improve. However, it is not immediately clear that value iteration should converge. (57m30s)
- A contraction operator is defined as an operator that, when applied to two different value functions, results in a smaller distance between them compared to before the application. The Bellman backup is an example of such an operator. (57m45s)
- The Bellman operator is proven to be a contraction operator, meaning it reduces the maximum difference between value functions over iterations. This property ensures convergence to a fixed point under certain conditions, such as a discount factor less than one or reaching a terminal state with probability one. (59m1s)
- The use of the infinity norm helps demonstrate that the Bellman operator reduces the maximum difference in values between any two states, ensuring that the distance between value functions shrinks over time. (59m59s)
- The Bellman backup operator involves two different maximizations: one over action ( a ) for the first value function and another over action ( a' ) for the second value function. This setup allows for different actions in the Bellman backups, but if the same action is used for both, the result is less than or equal to the original setup. (1h0m28s)
- By using the same action for both value functions, the maximization of the second function is not done separately, leading to a simplification where the two functions become identical. This allows the expression to be rewritten as a single maximization over action ( a ), factoring out the discount factor and summing over the probability of transitioning to the next state. (1h1m32s)
- The difference between the two value functions at any state is always less than or equal to the maximum difference across all states. This is expressed as a bound, where the difference is less than or equal to the maximum over action ( a ) of the discounted sum of probabilities of transitioning to the next state. (1h2m2s)
- The transition model's probability sums to one over all next states, allowing the expression to be simplified further, removing dependence on the action ( a ). This results in the maximum difference between the Bellman-backed value functions being no larger than the maximum difference between the initial value functions multiplied by the discount factor ( \gamma ). (1h3m3s)
- If the discount factor ( \gamma ) is less than one, the process is a contraction, meaning the maximum difference between value functions decreases over iterations. This implies that applying value iteration repeatedly will shrink the distance between successive value functions, eventually converging to a unique value function. (1h3m49s)
- The contraction property is proven, showing that even if all inequalities were equalities, the process would still be a contraction as long as ( \gamma ) is less than one. This ensures convergence to a unique value function through repeated Bellman backups. (1h4m56s)
- The discussion includes thoughts on proving the convergence of the value iteration to a unique solution in discrete state-action spaces, questioning whether initialization matters, and if the value of the policy extracted from value iteration at each round is guaranteed to monotonically improve. (1h5m7s)
- A question is raised about why taking the maximum out of the norm max action is greater than having separate inside, which is explained using the concept of a Q function and the conditions under which the maximum values are equal or different. (1h5m44s)
- The topic of whether the value of the policy extracted from value iteration is implicitly better with each iteration is discussed, noting that while this is proven for policy iteration, it has not been proven for value iteration. (1h7m32s)

## Finite Horizons and Policy Evaluation

- The concept of finite horizons is introduced, contrasting with the previously assumed infinite horizon. It is explained that for finite horizons, value iteration computes the optimal value for each horizon, and a policy can be derived from the value function for each decision point. (1h8m11s)
- A series of policies can be computed for each step in a process, and it is possible to simulate the value of a particular policy, which is useful in large state spaces. (1h9m2s)
- In policy evaluation, one can simulate the dynamics and rewards by generating numerous episodes and averaging the results to assess the policy's effectiveness. This approach is beneficial when it is difficult to explicitly define the dynamics model. (1h9m19s)
- Concentration inequalities, such as Hoeffding's inequality and Bernstein's inequality, can be used to determine the amount of data needed for an estimate to be close to the true value, and typically, not much data is required. (1h10m11s)
- Sampling is advantageous in large state spaces, such as those involving high-dimensional data, because accuracy improves with the number of samples, and it does not require assumptions about the Markov structure. (1h10m24s)
- Policy evaluation can be done to compute the value of a policy or the Q-value by starting in a state, rolling out the policy, and evaluating it, which is a popular technique in various scenarios. (1h11m15s)
- The concept of whether the optimal policy is stationary, meaning independent of time steps in finite horizon tasks, is posed as a thought question and will be explored further in homework. (1h12m0s)

## Key Concepts and Future Directions

- In the context of Markov decision processes and reinforcement learning, a model refers to a mathematical representation of dynamics and rewards, while a policy is a function mapping states to actions, which can be deterministic or stochastic. (1h12m14s)
- The value function is defined as the expected discounted sum of rewards from starting in a state and following a policy. (1h12m32s)
- Understanding key concepts such as a Markov process, Markov reward process, MDP (Markov Decision Process), Bellman operator, contraction model, Q-value, and policy is essential. (1h12m38s)
- It is important to be able to implement both value iteration and policy iteration, which will be practiced in the homework. (1h12m48s)
- There is a need to understand the strengths and weaknesses of these methods, particularly in terms of computational complexity, and to prove contraction properties. (1h12m54s)
- It is crucial to discern which methods leverage the Markov assumption and which do not require it. (1h13m3s)
- Future discussions will include topics on function approximation and learning in situations where models are unknown. (1h13m12s)