# Stanford CS234 Reinforcement Learning I Policy Search 1 I 2024 I Lecture 5

## Convergence of Q-learning and TD Learning

- The discussion begins with a review of the convergence properties of Q-learning in the tabular case, where it is stated that with an appropriate learning rate schedule, Q-learning can converge to the true value of a policy when there is no function approximation involved. (5s)
- It is explained that in a batch setting, Temporal Difference (TD) learning updates are equivalent to using a certainty equivalent model, which involves estimating the dynamics and reward models from existing data and then applying dynamic programming. (1m16s)
- The limitations of Deep Q-Learning (DQL) are discussed, highlighting that even with infinite iterations, convergence to the optimal Q function is not guaranteed due to issues like realizability and the choice of function approximator. If the wrong approximator is used, such as a linear model instead of a polynomial, convergence may not occur. (1m53s)
- Despite the theoretical challenges, it is noted that empirically, DQL often performs well. (3m40s)

## Shifting Focus to Policy Search Methods

- The focus of the class has been on value function-based methods, which involve explicit representations of expected rewards. The discussion is set to shift towards policy search, which will still consider the existence of a policy. (3m49s)
- Policy search methods in reinforcement learning involve mapping states to actions or assigning probabilities to actions in each state, without necessarily having an explicit representation of the value function. These methods have become popular and important in the field. (4m17s)

## Policy Search and Online Optimization

- Reinforcement learning algorithms involve optimization, delayed consequences, exploration, and generalization. There is a consideration of whether reinforcement learning can be reduced to another problem, such as online optimization, where the goal is to search for a good policy. Policy gradient methods are related to this idea and have been influential in recent years. (4m40s)
- Policy gradient methods have been used in various areas, including sequence-level training with recurrent neural networks and end-to-end training of deep visual motor policies. This approach was influential in the robotics community about a decade ago, with notable work by Professor Chelsea Finn and others at Berkeley. (5m37s)
- An example of policy gradient methods is learning directly from images to determine robot actions, which was one of the first attempts to do so. The motivation is to learn tasks in a way that generalizes well. (6m43s)
- Proximal Policy Optimization (PPO), a method related to policy gradient methods, is used in training models like ChatGPT. These algorithms are influential because they scale well to complex inputs, such as images, high-dimensional robotic tasks, and natural language. They can be used with or without state-action values. (7m50s)

## Function Approximation and Parameterized Policies

- Function approximation can be used to approximate state-action values or value functions with a set of parameters, allowing for direct learning of these functions and generating policies from them. (8m32s)
- The discussion contrasts the epsilon-greedy method, which involves either taking the best action suggested by the Q value or acting randomly, with a new approach that directly parameterizes the policy using Theta, aiming to learn parameterized policies. This approach can be likened to deep convolutional neural networks that output either an action or a probability, with the goal of finding a policy that maximizes the value function V pi for high rewards. (8m55s)
- The focus is on learning directly from experience without assuming access to or explicitly building a model, highlighting different perspectives in reinforcement learning, such as value-based and policy-based methods. Actor-critic methods, which combine value-based and policy-based approaches, are mentioned as popular techniques, with AlphaGo cited as an example of an actor-critic method. (9m50s)

## Advantages of Stochastic Policies

- The emphasis shifts to policy-based methods, particularly stochastic policies, which are important for learning about various actions. The example of the game rock-paper-scissors is used to illustrate the advantages of stochastic policies over deterministic ones, as deterministic strategies can be easily countered by opponents. (10m45s)
- In repeated interactions, deterministic policies can be exploited by opponents, making stochastic policies optimal in certain scenarios. A uniform random policy can be a Nash equilibrium in non-Markovian systems where the adversary's actions are not purely random. (12m42s)
- Stochastic policies are beneficial in environments with aliasing or partial observability, where an agent cannot distinguish between certain states due to limited sensor information. (13m53s)
- In such cases, a value-based reinforcement learning approach may lead to suboptimal actions because the agent cannot differentiate between states that appear identical. This can result in the agent getting stuck and failing to achieve high rewards. (14m50s)
- A stochastic policy allows the agent to randomize its actions in indistinguishable states, such as moving east or west with certain probabilities, which can lead to better outcomes compared to a deterministic policy. (16m15s)
- The system discussed is partially observable and not a Markov system, which can be addressed by treating it as a partially observable Markov decision process or using a stochastic policy. (16m42s)
- Stochastic policies and policy gradient methods can handle certain problems that are difficult for other methods. (17m9s)

## Policy Value and Optimization in Episodic Environments

- In episodic environments, the policy value at the start state can be used to determine the expected reward until the end of the episode, similar to Monte Carlo methods. (17m32s)
- The focus is on episodic cases, but the methods can be extended to infinite horizon cases. (17m51s)
- The problem of finding the parameters that maximize the policy value is an optimization problem, but it is challenging because the function can only be estimated through data. (18m31s)
- Various methods can be used to solve this optimization problem, including those that exploit the structure of sequential decision processes and those that ignore gradients, such as hill climbing, genetic algorithms, and cross-entropy methods. (19m2s)
- Stephen Collings from the mechanical engineering department is mentioned as doing interesting work related to these topics. (19m31s)

## Stochastic Policies and Exoskeleton Optimization

- A stochastic policy can output a distribution over actions, allowing for random selection of actions based on computed probabilities. (19m50s)
- Exoskeletons can be beneficial for individuals with motor impairments, but they need to be personalized to each user's physical configuration to be effective. (20m18s)
- Policy search methods are used to quickly personalize exoskeleton parameters, a process termed "human-in-the-loop exoskeleton optimization," which aims to reduce the effort needed for walking. (20m50s)
- The optimization process involves testing different control parameters and adjusting policies based on effectiveness, using a method called CES (Covariance Matrix Adaptation Evolution Strategy), which does not rely on gradient methods. (21m14s)
- This approach can significantly improve metabolic efficiency, reportedly by 20% to 30%, within a few hours of optimization. (21m35s)
- The method was published in a scientific journal about seven to eight years ago and serves as an example of online optimization that does not require consideration of temporal policy structure. (21m47s)

## Online Optimization and Differentiable Methods

- Initializing policy parameterization is necessary, similar to initializing a value function or neural network parameters in other contexts. (22m1s)
- Online optimization methods like CES can be effective even without leveraging specific structures of a Markov decision process and can work with any policy parameterization, including non-differentiable ones. (22m29s)
- CES is advantageous because it can be easily parallelized, allowing multiple policies to be tested simultaneously, which is useful in scenarios with many customers or robotic arms. (23m2s)
- A limitation of CES is its lower data efficiency due to ignoring temporal structure, which might make gradient-based methods more effective in certain situations. (23m29s)

## Differentiable Methods and Gradient Descent

- The focus is on differentiable methods in reinforcement learning, specifically using stochastic gradient descent on policy parameterization, such as deep neural networks, to update parameters. (23m41s)
- The methods discussed leverage the sequential structure of decision-making to optimize sequences of decisions, with an emphasis on episodic Markov decision processes. (24m11s)
- The goal is to reach a local maximum in the policy parameter space, acknowledging that global optimality is not guaranteed, unlike in tabular cases. (24m43s)
- The approach involves taking the gradient of the policy with respect to the parameters and using a step size parameter to make small updates, focusing on smooth functions. (25m47s)
- Automatic differentiation methods are mentioned as a modern approach, but earlier research, such as a 2004 paper by Peter Stone's group, explored these concepts in robotics, particularly in teaching quadruped robots to walk quickly for competitions like RoboCup. (26m10s)
- Researchers have used policy search methods to teach robots to walk faster by parameterizing the movement of the robot's foot and adjusting these parameters through repeated trials. This process involved using finite difference methods rather than automatic differentiation, resulting in a significantly faster walking speed after about four hours of training. (27m17s)
- The approach highlights the effectiveness of data-driven methods in situations where models of the environment are poor or incomplete, such as in robotics where contact forces and unknown hardware parameters are involved. (28m20s)

## Policy-Based RL and Differentiable Policies

- Policy-based reinforcement learning (RL) offers benefits such as better convergence properties and effectiveness in high-dimensional or continuous action spaces. It allows for learning stochastic policies, although it may be inefficient and have high variance, often reaching only local optima. (28m48s)
- The discussion transitions to differentiable policies, where the policy gradient can be computed analytically, avoiding finite differences. This involves assuming the ability to compute the gradient of policy parameters, applicable to various classes like deep neural networks, softmax, and Gaussian policies. (29m23s)
- The concept of a policy class is explained as the functional form used to determine the probability of an action given a state, with examples including softmax and neural networks. (30m18s)
- A robot's action can be determined by a stochastic policy, where the action is centered around a mean value with some variability, leading to different speeds and directions. (30m52s)
- The policy is assumed to be differentiable, and the goal is to find the policy that maximizes the value function by taking derivatives with respect to the policy parameters. (31m30s)

## Policy Value and Trajectories

- The policy value is defined as the expected sum of rewards, and in the episodic case, discounting is not necessary because the episode length is finite. (31m53s)
- The value of a policy can be expressed as the state-action value averaged over the probability of taking each action under the policy. (32m40s)
- Another way to express the value is by considering trajectories, which are sequences of states and actions sampled from the policy, and calculating the reward for each trajectory. (33m21s)
- Although summing over all possible trajectories is generally intractable due to large state spaces and long trajectories, it is mathematically well-defined and can be approximated using finite samples, similar to Monte Carlo methods. (34m11s)
- The discussion presents two valid ways to define the value of starting in a state and following a policy, focusing on the probability over trajectories when executing a policy from an initial state. (35m13s)

## Likelihood Ratio Policies and Gradient Computation

- The focus is on likelihood ratio policies, aiming to maximize the probability of obtaining high-reward trajectories by considering policies that induce such trajectories through the state and action space. (35m53s)
- The process involves taking the gradient of the value function with respect to policy parameters to update these parameters and increase the policy's value. (36m41s)
- A mathematical trick is introduced, involving multiplying and dividing by the probability of a trajectory, to facilitate the computation of derivatives with respect to the log of the trajectory probability. (37m33s)
- The purpose of this approach is to handle the complexity of propagating derivatives through expectations, ultimately aiming to compute gradients from samples, which are easier to obtain. (39m9s)
- The policy is executed in the environment by sampling trajectories, which involves calculating the expectation over trajectories of the reward, weighted by the gradient of the log probability of those trajectories. This expectation can be approximated by sampling a finite number of trajectories, such as 100, to estimate the outer expectation. (39m34s)
- The likelihood ratio is introduced as a method to compute the gradient with respect to the value function for policy parameters, allowing for approximation with samples. This approach aims to make the computation feasible. (40m9s)
- The expectation is approximated using an empirical estimate by sampling a limited number of trajectories, rather than considering all possible trajectories, which is impractical, especially with vision input. (40m52s)
- The process involves decomposing the trajectory into states and actions, where a trajectory is defined by following a policy until the end of an episode. The probability of a trajectory is approximated using a Monte Carlo estimate, assuming equal probability for each trajectory, but more likely trajectories will appear more frequently in the sample set. (41m33s)
- The probability of a trajectory is expressed by considering the initial state distribution and the policy's probability of selecting actions given the current state, along with the transition probabilities to subsequent states. (42m40s)

## Trajectory Probability and Decomposition

- The discussion begins with an explanation of a trajectory in reinforcement learning, starting with a distribution over the initial state and the probability of taking certain actions under a given policy. The assumption is made that rewards are deterministic, although a reward term could be added. The focus is on determining the probability of reaching a specific state given the history of previous states and actions. (43m43s)
- The process involves decomposing terms using the property that the logarithm of a product is the sum of the logarithms. This decomposition is applied to the joint probability, and it is noted that the initial state distribution is independent of the policy, allowing certain terms to drop out when taking derivatives. This simplifies the process by removing the need for a dynamics model. (44m12s)
- The remaining term after simplification is expressed as a function of the sum of the derivatives of the log of the policy at each action point. This approach eliminates the dependency on a dynamics model, which is advantageous. (46m0s)
- A question is addressed regarding why the entire history is considered rather than just the past state and action. It is explained that the dynamics are written in a general form without assuming the Markov property, which is not necessary because the dynamics model is independent of the policy. However, the Markov assumption is made for the policy, which depends only on the current state. The policy could also depend on a history of states if desired. (47m2s)
- The term "score function" is introduced, referring to the derivative with respect to the log of the policy itself. This notation is commonly used in the context of reinforcement learning. (48m16s)

## Score Function and Policy Classes

- The score function is generally easy to compute for differentiable functions, and this is demonstrated using different policy classes, such as the softmax policy. (48m26s)
- In a softmax policy, a linear combination of features is used, where the probability of an action is proportional to the exponentiated weight of the dot product between features and parameters, allowing for a stochastic policy. A temperature parameter can also be included. (48m56s)
- The derivative of the softmax policy with respect to its parameters can be computed easily, illustrating the feasibility of taking derivatives in policy parameterization. (49m30s)
- The score function for the softmax policy is the difference between the feature and the expected value of the features over the policy. (51m28s)
- For continuous action spaces, such as those used in robotics, policies can be parameterized with a mean and variance, and the score function can be computed in closed form. This is often done using deep neural networks and automatic differentiation tools. (52m10s)

## Policy Methods and Likelihood Ratio Score Function

- Policy methods involve direct parameterization of the policy, and the value function can be expressed as a weighted sum over trajectories generated by the policy, multiplied by their rewards. (53m0s)
- The discussion involves reexpressing certain functions to eliminate the need for a dynamics model, focusing on score functions and likelihood ratio score function policy gradients. A question is posed about whether the reward function needs to be differentiable, if it can only be used with Markov decision processes, and if it is mostly useful for infinite horizon tasks. The correct answer is "none of the above." (53m17s)
- It is explained that the policy function needs to be differentiable, but the reward function does not, as it is not a function of the policy. This characteristic makes policy gradients appealing for complex reward functions. The dynamics model is not necessary, and the process does not need to be Markovian. The assumption is that the process is finite horizon to allow for episodic tasks. (54m42s)
- The intuition behind the method is described as involving a function times the derivative of the log of a probability function. This approach aims to increase the log probability of samples with high rewards, encouraging policies that visit high-reward areas in the state and action space. The reward function does not need to be differentiable and can be discontinuous or unknown, as long as samples can be obtained from it. (55m50s)
- The explanation includes a reference to slides, possibly credited to John Schulman, illustrating the combination of the probability of inputs and the reward function over trajectories, which helps in adjusting the policy parameters. (56m36s)

## Policy Gradient Theorem and Monte Carlo Methods

- The policy gradient theorem can be applied using episodic reward, average reward per time step, or average value, and involves calculating the derivative with respect to value functions, which resembles the derivative with the score function and expected value over trajectories. (57m11s)
- The current method of estimating gradients using Monte Carlo methods is unbiased but can be noisy and high variance, prompting the need for more practical approaches that leverage temporal structure and include baselines. (58m31s)
- Instead of summing rewards over an entire trajectory, a more efficient gradient estimation can be achieved by considering the gradient estimator for a single reward term at each time step, focusing on the partial trajectory up to that point. (59m22s)
- The approach involves summing over all time steps, where each time step's reward is multiplied by the score functions up to that point, allowing for a rearrangement that improves the estimation process. (1h0m20s)

## Temporal Structure and REINFORCE Algorithm

- The discussion involves score functions at each time step, where the score function at time step zero influences rewards at subsequent time points, affecting the entire trajectory of rewards. (1h1m1s)
- The temporal structure of score functions is emphasized, indicating that decisions made at future time steps cannot impact rewards at earlier time steps, allowing for a more efficient calculation of influence on rewards. (1h2m3s)
- By leveraging this temporal structure, the equation is rewritten to multiply each score function only by the rewards it influences, reducing the variance of the estimator without introducing bias. (1h2m50s)
- The concept of summing rewards from the current time step to the end, known as the return, is introduced, leading to the development of the REINFORCE algorithm. (1h3m23s)
- The REINFORCE algorithm, influential in fields like NLP and robotics, updates policy parameters using a learning rate, score function, and return from each time step to the end of the episode, ensuring convergence to a local optimum of the value function. (1h3m47s)
- This approach is known as the Monte Carlo policy gradient or REINFORCE, originating around 1992, and has inspired numerous policy gradient algorithms. (1h4m43s)

## Variance Reduction with Baselines

- Concerns are raised about the high variance in the Monte Carlo methods used for estimating returns, which can lead to inefficiencies in policy optimization. (1h5m2s)
- To address this, the introduction of a Baseline is proposed as a method to reduce variance in gradient estimates, aiming to converge more quickly to local optima. (1h5m20s)
- The Baseline is a value subtracted from the return estimate that depends only on the state, not on the policy, and it helps maintain the unbiased nature of the gradient estimator. (1h6m9s)
- The Baseline allows for a comparison of returns against expected values, helping to adjust policy parameters towards actions that yield better-than-expected returns without introducing bias. (1h7m0s)
- Future discussions will include a formal proof of the Baseline's properties and an introduction to more efficient policy optimization methods, including a topic referred to as "po," which is part of the homework assignment. (1h7m56s)