# Stanford CS234 Reinforcement Learning I Policy Search 2 I 2024 I Lecture 6

## Policy Gradient Methods and Local Optima

- The discussion focuses on policy gradient methods, emphasizing that these methods are not guaranteed to converge to a global optimum but only to a local optimum in the policy gradient space. (1m8s)
- The process involves parameterizing policies by Theta and moving in the direction of the log of the policy parameters, weighted by the return or Q function, to find parts of the policy space that yield higher estimated rewards. (1m36s)
- The session continues from a previous discussion on policy search, which involves directly searching in the policy parameterized by Theta, applicable to various policy classes such as Gaussian, softmax, or deep neural networks. (2m52s)

## Advanced Policy Gradient Methods and Variance Reduction

- The lecture covers advanced policy gradient methods, including the likelihood ratio or score function policy gradients, the introduction of a baseline to reduce bias in gradient estimates, and alternative targets. (3m26s)
- The concept of Po, used in applications like ChatGPT, is highlighted as a significant technique. (3m43s)
- The lecture discusses the derivative with respect to policy parameters, noting that while it provides an unbiased estimate of the gradient, it can be noisy, similar to Monte Carlo estimates. (4m8s)
- Strategies to reduce variance in estimators include leveraging temporal structure, where rewards at a certain time step do not depend on future actions, and the use of baselines to improve convergence speed. (4m27s)
- The goal is to achieve a low variance in the gradient to improve steps in the policy space. (4m52s)

## Baseline Function and Unbiased Gradient Estimation

- A baseline function, denoted as B of St, is introduced, which is only a function of the state and not of the parameters or actions. (5m24s)
- It is proven that for any choice of the baseline, as long as it is only a function of the state, the gradient estimator remains unbiased. (5m52s)
- The near-optimal choice for the baseline is the expected return, which helps in comparing the current policy's actions to other possible actions. (6m8s)
- The process of adding a baseline does not introduce bias in the gradient estimate, as shown by demonstrating that the expectation of the subtracted term is zero. (6m24s)
- The expectation is decomposed into the states seen up to a certain time step and the states seen from that time step onwards, allowing for the separation of terms that are not functions of future states and actions. (7m20s)
- The baseline term is only a function of the current state and action, allowing it to be rewritten as an expectation over the action at time t. (8m31s)
- The expectation over actions is calculated based on the actions taken according to the current policy. (9m21s)
- The expectation over actions is determined by the probability of taking each action according to the current policy. This involves using the likelihood ratio and derivatives with respect to the policy parameters, denoted as Theta. (9m26s)
- The derivative of the log of the policy is simplified by canceling out terms, leading to an expression involving the expectation of states and actions. The derivative is then switched with the sum over actions, which is important because the sum of probabilities over all actions equals one. (10m52s)
- The proof involves two main insights: breaking the expectation over trajectories into parts before and after a state of interest, and showing that the expectation can be rewritten in terms of actions. This allows the introduction of a baseline that only depends on the state, which does not introduce bias because its expectation is zero. (11m50s)

## Vanilla Policy Gradient and Monte Carlo Estimation

- The vanilla policy gradient method incorporates both temporal structure and a baseline. The process involves rolling out the current policy, computing returns for each time step, and estimating the advantage by subtracting the baseline from the return. The baseline is refitted each time, and while it is always unbiased, some choices are better than others. (12m46s)
- The policy is updated using the policy gradient estimate, which is a sum of terms involving the derivative of the log of the policy parameters and the advantage. This process is repeated iteratively. (13m38s)
- The discussion addresses the absence of a discount factor in the return, noting that it is not included because the scenario assumes a fully episodic case, though it could be added if desired. (14m1s)
- The current approach resembles Monte Carlo estimation, using G estimators to assess policy performance from a specific time step to the episode's end, but this method can result in noisy estimates. (14m33s)

## Actor-Critic Methods and Variance Reduction

- Alternatives to using the return include substituting Q functions and value functions, which could potentially reduce variance in estimates. (15m14s)
- Actor-critic methods are introduced as a way to reduce variance by using state-action values, employing techniques like bootstrapping or function approximation, similar to deep Q learning. (16m21s)
- In actor-critic methods, the actor represents the policy, often parameterized by Theta, while the critic represents the state-action value function, such as a V or Q function. These methods are used in policy gradient algorithms to improve learning efficiency. (17m0s)
- The term "Critic" in reinforcement learning refers to estimating the performance of a policy, where the actor makes decisions and the critic evaluates their quality. This forms the basis of actor-critic methods, such as A3C, which are prevalent in reinforcement learning. (18m0s)

## Implementation of Actor-Critic Methods

- In actor-critic algorithms, a deep neural network can represent the policy, and another network, which may share parameters, represents the value function. This setup allows for rewriting policy gradient formulas by substituting the return with a parameterized Q function. (18m27s)
- The parameters for the policy and value function are denoted as Theta and W, respectively. The baseline can be an estimate of the value function V, allowing the formulation of a state-action advantage function by calculating the difference between Q and V. (19m4s)
- The use of the advantage function was recognized as a significant development in reinforcement learning, particularly after the emergence of deep Q learning, and was proposed to maximize advantages within a policy gradient approach. (19m29s)
- There are different methods to estimate the value of a state-action pair, ranging from Monte Carlo returns to using a Q function, with potential for blending these approaches. These are known as n-step estimators. (19m53s)

## Temporal Difference Learning and N-Step Estimators

- Temporal difference (TD) learning, often seen as TD(0), involves using the immediate reward plus a discounted value of the next state. This can be extended to n-step TD methods, where the trade-off involves taking multiple steps before bootstrapping the estimate of the return. (20m23s)
- Various types of estimators can be used, such as R hat 1, which involves bootstrapping after one step, or R hat Infinity, which corresponds to the Monte Carlo return without bootstrapping, summing all rewards. (21m1s)
- N-step estimators are a type of estimator used in reinforcement learning, where n represents the number of time steps until bootstrapping, and they can be used to estimate the Q function or Advantage estimators by subtracting the V of the current state (21m47s).

## Bias-Variance Tradeoff in Estimators

- Introducing blended estimators can lead to a tradeoff between bias and variance, and understanding this tradeoff is important (22m2s).
- The bias and variance tradeoff can be seen in the comparison between Monte Carlo methods and temporal difference methods, where Monte Carlo methods are unbiased but have high variance, and temporal difference methods have low variance but high bias (25m56s).
- When comparing two estimators, subtracting the V function can help to focus on the first part of the estimate and understand which one has high variance or high bias (26m36s).
- The first estimator has low variance and high bias, similar to a TD0 update, while the second estimator has high variance and low bias, similar to a Monte Carlo method (26m55s).
- The intuition behind the bias and variance tradeoff is that using actual values is more accurate, while bootstrapping early is more of an estimate and less accurate (27m54s).
- The text discusses the concept of value function estimates in reinforcement learning, highlighting that these estimates are approximations based on finite data and backups, which can lead to bias. (27m57s)
- It explains that using a single reward point from the true policy and bootstrapping results in high bias but low variance, as the rewards are derived from a stochastic process. (28m21s)
- In contrast, a Monte Carlo estimate, which samples multiple rewards from a stochastic process, tends to have high variance but low or zero bias because it reflects the actual return from the executed policy. (28m56s)
- The discussion emphasizes the trade-off between low variance and low bias, suggesting that a balance can be achieved by using a few steps followed by bootstrapping to minimize mean squared error. (30m6s)

## Advanced Policy Gradient Methods and Sample Efficiency

- The text mentions the use of different methods, such as n-step methods, in actor-critic policy gradient methods, and indicates a transition to discussing more advanced policy gradient methods aimed at improving scalability and performance. (31m9s)
- Policy gradients have limitations, including poor sample efficiency, requiring many rollouts with the same policy to get a good estimate of the gradient (32m5s).
- The distance in the parameter space does not necessarily equal the distance in the policy space, meaning small changes in the policy parameterization can lead to large differences in decision-making (32m32s).
- This lack of smoothness in the policy space can cause issues when taking gradient steps, as small changes in the policy parameterization may not lead to smooth changes in the policy decisions (33m22s).
- The sample efficiency issue arises from the need to roll out the policy multiple times to get a good estimate of the gradient, making it difficult to take multiple gradient steps (33m52s).
- The problem of on-policy expectation arises when trying to take multiple gradient steps, as the data used to estimate the gradient is generated under the current policy, not the updated policy (34m14s).
- This issue is similar to those seen in other methods, such as SARSA, where the policy is learned by executing it and gathering data (34m20s).
- The challenge is that the gradient is estimated from trajectories generated under the current policy, and then a step is taken, but the new policy may not be well-represented by the data (34m51s).

## On-Policy vs. Off-Policy Methods

- The discussion addresses the challenge of taking multiple gradient steps in reinforcement learning without acquiring new data, as current data is only available for the initial policy, Theta, and not for a new policy, Theta Prime. This is a limitation in on-policy methods, which estimate the gradient based on the policy just executed. (35m6s)
- Off-policy methods are introduced as a potential solution, allowing the use of old data to estimate gradients for different policies. However, these methods can be unstable, and the goal is to find ways to use old data to take multiple gradient steps efficiently, reducing the need for new data collection. (36m9s)

## Step Size and Policy Parameterization Challenges

- The importance of step size in stochastic gradient ascent is highlighted, particularly in policy gradient methods. An inappropriate step size can lead to a collapse in performance, as large steps might lead to regions with poor policies and value functions, making it difficult to estimate a good gradient direction. (36m51s)
- Determining the appropriate step size in policy search is challenging because using a step size that is too large can lead to overshooting, while a step size that is too small can slow down progress. The goal is to find a balance that allows for quick convergence to the optimal local optima. (38m46s)
- Optimizers like Adam can assist in the process, but they do not completely resolve the issue of limited information about states and actions visited under the current policy, which can hinder gradient estimation. (39m23s)
- The parameterization of policies, such as using a logistic function, can lead to significant changes in action probabilities with relatively small changes in parameters. This can make a policy nearly deterministic, which limits learning about other actions. (39m51s)
- The challenge lies in selecting a step size that efficiently updates the policy without making it too deterministic, while still allowing for rapid progress. The aim is to develop an update rule that increases the policy's value without excessive changes. (41m24s)
- Ideally, policy updates should be monotonic, meaning each update should improve the policy, similar to policy improvement algorithms in tabular cases. However, with continuous parameterizations, achieving this is more complex. (42m4s)
- The goal is to achieve monotonic improvement in policy search, meaning the policy should improve at each step without overstepping or crashing, which is particularly important in real-world domains such as healthcare applications (42m27s).
- Current methods may converge to a local optimum but do not guarantee monotonic improvement, and it would be beneficial to have an update step that uses all available data efficiently and respects the distance in the policy or decision space (42m55s).

## Performance Difference Lemma and Policy Comparison

- To achieve this, it's necessary to understand how the performance of two policies relates, given data from one policy, and ideally, this would allow us to determine which policy to move to next (43m48s).
- The performance difference lemma (LMA) relates the performance of one policy to another policy given data from one of the policies, allowing us to compare the value of two policies (44m7s).
- The performance difference lemma states that the value of one policy (π') is equal to the expectation over trajectories sampled using π' of the advantage under policy π, which can be broken down into small differences in value between the two policies (44m39s).
- This lemma enables us to compare the value of two policies by summing up the differences in value at each time step, given the data from one policy (45m59s).

## State-Action Distributions and Discounted Future State Distribution

- The discussion involves transforming the analysis from trajectories to state-action distributions, focusing on clustering advantages related to the same state and considering the frequency of visiting those states under a policy, denoted as Pi Prime. (46m21s)
- The transformation involves moving from thinking about trajectories and time steps to considering a distribution over states and actions, allowing for a weighted distribution of states based on their likelihood of being visited under a policy. (46m42s)
- The concept of discount factors is introduced to ensure the distribution remains normalized, especially in the infinite horizon case where states can be visited frequently. (47m42s)
- The goal is to define the performance of a new policy, Pi Prime, in terms of advantages from an existing policy, Pi, but this requires trajectory samples from Pi Prime, which may not be available. (48m32s)
- The ultimate objective is to estimate the performance of the new policy, Pi Prime, using only data from the existing policy, Pi, to determine the best possible improvement over the previous policy. (49m11s)
- The policy is viewed from a different angle, considering the discounted future state distribution, denoted as dπ(s), which is a distribution over states (50m6s).
- The relative performance of a policy is rewritten using the discounted future state distribution, resulting in a policy performance identity that is relative to the performance of the current policy (50m20s).
- The policy performance identity is rewritten in the state-action representation, moving away from the trajectory notation, and is expressed as an expectation over the state-action distribution (51m5s).
- The expression is further rewritten using the discounted future state distribution, denoted as Dπ, and the expectation is taken over the state-action distribution (51m54s).
- The expression is then rewritten to use the new policy, denoted as π', and the expectation is taken over the state-action distribution under the new policy (52m18s).
- The expectation is then broken down into a sum over actions, where each action is weighted by the probability of taking that action under the new policy, denoted as π'(a|s'), and the advantage function (52m43s).
- The expression is then rewritten to use data from the old policy, denoted as π, by multiplying and dividing by the same term, resulting in a reweighted expectation (53m45s).
- The reweighted expectation is then expressed as an expectation over the state-action distribution under the old policy, where the advantages are reweighted by the probability of taking each action under the new policy versus the old policy (54m18s).

## Importance Sampling and Policy Evaluation

- The discussion involves evaluating a new policy's effectiveness without having direct samples from it, using a method called importance sampling. This allows reweighting existing data to assess the new policy's actions. (54m56s)
- A significant issue arises because there is no data from the states under the new policy, Pi Prime, only from the current policy, Pi. The approach taken is to ignore this discrepancy and assume the states are the same, which can introduce errors. (55m30s)
- The potential error from assuming the states are the same depends on the overlap between the state spaces visited by the two policies. If the policies visit similar areas, the error may be minimal, but if they visit different areas, the error could be significant. (56m14s)

## KL Divergence and Performance Difference Bound

- The approximation's accuracy is related to the Kullback-Leibler (KL) divergence between the policies. The KL divergence measures how one probability distribution diverges from a second, expected probability distribution. (57m34s)
- The KL divergence is particularly considered with respect to the states sampled according to the current policy's distribution, dpy, which represents the actual discounted states visited under the current policy. (57m45s)
- The performance difference bound depends on constants that are unknown, making it difficult to determine its value, but it is still a useful theorem (58m18s).
- The bound is checkable by evaluating the K-Divergence of the current policy and a new policy, Pi Prime, using actual trajectories (58m30s).
- The constant C in the bound is not necessary to know for now, but it is explained in the paper (58m39s).
- The bound is a good approximation if the policies are close, and it is tight if the policies are identical (58m55s).
- K-Divergence is a measure that compares two probability distributions, in this case, the actions taken by two policies, Pi and Pi Prime, in a particular state (59m19s).
- K-Divergence is zero if the two probability distributions are the same, and strictly positive otherwise, and it is not symmetric (59m54s).
- The bound is useful because it allows us to think about the difference between two policies in terms of the actual decisions made in particular states, rather than just the distance in parameter space (1h0m29s).

## Monotonic Policy Improvement and Trust Region

- The bound can be used to estimate the performance improvement of a new policy using old data, and it can be computed for lots of different new policies, not just the one obtained by a single gradient step (1h1m21s).
- Evaluating the value of changing a policy to π' with respect to the current performance can be done using an expression that considers the difference in policy decisions in states that would currently be reached, and the tightness of this expression depends on how different the policies are (1h1m48s).
- This concept relates to literature on monotonic policy improvement in policy gradient methods and policy search methods, as well as the notion of a trust region, which considers how far a policy can be changed while still trusting its performance (1h2m17s).

## Proximal Policy Optimization (PPO)

- The Proximal Policy Optimization (PPO) algorithm is inspired by these concepts and aims to take multiple gradient steps while avoiding overstepping and focusing on policy parameterization in terms of actual decisions (1h2m43s).
- There are two variants of PPO: one solves an unconstrained optimization problem using an approximation, and the other uses a constraint on the KL Divergence to ensure that the policy update does not deviate too far from the current policy (1h3m5s).
- The KL Divergence is defined for a single state, but can be extended to multiple states by taking an expectation over all states that would be visited (1h4m20s).
- The PPO algorithm computes the advantages using any advantage estimation algorithm, computes the policy update, and can take multiple gradient steps while checking the KL Divergence constraint (1h5m15s).
- The algorithm does not guarantee monotonic improvement but tries to get close to it by approximately satisfying the KL Divergence constraint (1h5m3s).

## PPO Implementation Details

- The policy update process involves adjusting penalties based on the size of the resulting policy, allowing for a trade-off between focusing on the Kullback-Leibler (KL) divergence constraint and other factors. Violations of the KL constraints are generally rare in practice. (1h5m37s)
- Multiple gradient steps are beneficial as they allow for more comprehensive updates compared to a single gradient step. (1h6m10s)
- The concept of natural gradients is mentioned as an alternative approach to taking gradient steps, but it is not discussed in detail. (1h6m23s)
- A clipped objective is introduced to manage the ratio used to weight the advantage function, which compares the likelihood of actions under old and new policies. This clipping prevents the ratio from becoming too high or too low, which can occur when policies differ significantly. (1h6m35s)
- The clipping mechanism ensures that the advantage function does not shrink towards zero or inflate excessively, which can happen if there is a large difference in actions between policies. This is related to the idea of keeping policies close, as enforced by the KL divergence constraint. (1h7m40s)
- The clipping process involves setting limits on the ratio, defined by a hyperparameter epsilon, to prevent extreme values. This constrains the policy class to make similar decisions within a specified range, ensuring stability in policy updates. (1h8m25s)
- The policy update is performed by taking an argmax over the clipped objective function, which helps maintain consistent decision-making across states. (1h9m2s)

## Clipped Objective and Loss Function

- The discussion involves understanding the function L clip, which is related to the advantage function in reinforcement learning. The x-axis represents the variable r, and the y-axis represents L clip. The focus is on how clipping affects the loss or objective function, particularly when the advantage function is positive or negative. (1h9m12s)
- The process involves taking the minimum between the reweighted advantage function and a clipped version of the r times the advantage function. This is an approximation of how much the policy will improve when changing the policy parameter Theta, aiming to maximize the objective function. (1h10m31s)
- It is noted that the behavior of L clip does not depend on the value of e, and there is a consensus among participants on this point. When the advantage is positive, the L clip value increases linearly with r until it is clipped. When the advantage is negative, the goal is to reduce the probability of taking that action, ideally pushing r to zero, although this is not always possible due to potential radical changes. (1h14m30s)
- The discussion involves changing policies and the concept of capping at one minus epsilon, preventing further reduction to zero, which is related to making the objective pessimistic when far from a certain parameter, Theta. (1h16m15s)
- The lecture includes a demonstration of plots related to an algorithm that involves clipping, with a note that further topics like generalized advantage estimation will be covered in the next session. (1h16m42s)

## PPO Performance and Conclusion

- Trust Region Policy Optimization (TRPO) and other algorithms are compared, highlighting that Proximal Policy Optimization (PPO) with clipping generally performs better, achieving better optima faster and with less data. (1h17m14s)
- PPO is noted for its simplicity and strong performance, contributing to its popularity. References to the original paper and a 2017 blog post are suggested for further reading. (1h18m5s)
- A follow-up paper from MIT is mentioned, emphasizing the importance of understanding implementation details, hyperparameters, and architecture choices, which can significantly impact practical performance. (1h18m34s)
- The lecture concludes with a note that the information provided is sufficient for making progress on homework 2, with further discussion to continue in the next session. (1h19m10s)