Stanford CS234 Reinforcement Learning I Exploration 1 I 2024 I Lecture 11

31 Oct 2024 (over 1 year ago)

Introduction to Reinforcement Learning and Data Efficiency

Space repetition is helpful in learning, and periodically bringing up ideas from earlier in the course can aid in understanding (36s).
Important sampling does not leverage the Markov assumption and can work with non-Markov systems, making it a general method that can be used to estimate the value of a policy through rollouts (3m21s).
Important sampling can be used to estimate the value of a policy by using the advantage function of one policy and sampling from another policy, but this method is not exact due to the difference in states visited by the two policies (3m56s).
The approximation error of using samples from one policy to estimate the value of another policy is bounded by the average difference between the two policies over the states visited by one policy (4m28s).
Prior work has shown that it is possible to bound the error in approximating the value of a policy by using samples from another policy, which is a useful insight in reinforcement learning (4m46s).
The course will now focus on learning from data that can be gathered, rather than relying on prior data or human feedback, and will discuss how to strategically gather data to learn good policies (5m15s).
The next few lectures will discuss how to gather data for online reinforcement learning, including the influence of the data gathering process on the learning of good policies (5m39s).
Computational efficiency is an important consideration in reinforcement learning, particularly in simulated environments, where computational time is equivalent to data (6m2s).
In many domains, computation is separate from actual data, such as using mobile phones for health interventions, consumer marketing, educational technology, and climate or environmental applications, where real-world data is crucial and often limited (7m5s).
In these cases, computers can be used to compute policies, and real-world data, referred to as samples or sample efficiency, is essential and needs to be utilized carefully to make good decisions (7m55s).

Challenges of Limited Data in Reinforcement Learning

The data in these domains is often limited, unlike in cases like Atari or AlphaGo, where simulators can generate infinite data (8m17s).
When working with limited data, it's essential to consider the performance of different algorithms, including whether they converge, converge to the optimal policy, and how quickly they do so (9m1s).
The convergence of an algorithm is not always guaranteed, and even if it does converge, it may not be to the optimal policy, and the speed of convergence is also crucial (9m10s).
Different reinforcement learning algorithms can be evaluated using various measures, and their performance can vary significantly, with some algorithms being smooth and consistent, while others may make periodic mistakes (9m47s).
The performance of an algorithm can be illustrated using a graph of time versus reward, showing how different algorithms can have distinct performance profiles (9m52s).
There are different types of behavior in algorithms, such as one that is always pretty good but never awesome, and one that guarantees a certain level of performance, which are important to consider when evaluating the quality of algorithms (10m25s).
The performance guarantees of an algorithm can be thought of in terms of an AI clinician that helps manage blood pressure with a certain level of accuracy, and the difference between an algorithm that is 80% accurate on average versus one that completely manages blood pressure for 80% of the population (10m32s).

Introduction to Bandits and the Multi-Armed Bandit Problem

The course will introduce different types of settings and ways to evaluate the quality of algorithms, starting with Bandits, which are simple reinforcement learning problems that have been studied since the 1920s and have been used in various application areas (11m11s).
A Bandit is a simple RL problem with a finite set of arms, or actions, and a probability distribution over rewards for each arm, where the goal is to maximize cumulative reward over time steps (11m52s).
The Bandit setting can be applied to various areas, such as clinical trials, ads, and treatment or control, where the goal is to randomize the next person or action to take (12m57s).
A running example for the course will be a silly one, where the goal is to treat patients with broken toes, and there are three options: surgery, buddy taping the broken toe to another toe, or doing nothing, with a binary outcome measure of whether the toe is healed or not after 6 weeks (13m15s).
The goal is to figure out the best strategy for healing broken toes over time, without doing a clinical trial, but rather by trying different options and observing the outcomes (13m47s).
A problem can be modeled as a multi-arm bandit with three arms, where each arm is a newly introduced variable with an unknown parameter Theta, representing the probability of a successful outcome (13m56s).
The multi-arm bandit framework is suitable for this problem because only one decision is made per patient, and the outcome for each patient is independent of the others (15m27s).
The problem does not have sequential dependence, and each decision is made independently, like a new person arriving at a time point (15m46s).
The parameter Theta can be between zero and one, indicating that the outcomes are not deterministic, and sometimes the patient will heal, while other times they won't (15m53s).
The probability distribution of the reward is assumed to be stationary, meaning it remains the same at every time step, and does not change over time (16m15s).

Greedy Algorithms and the Need for Exploration

A greedy algorithm can be used to solve the problem, which selects the action with the highest expected reward, denoted by Q, and repeats this process (17m11s).
The expected reward Q can be estimated by counting and averaging the outcomes for each action (17m0s).
The greedy algorithm starts by sampling all actions once, and then selects the action with the highest observed reward (17m47s).
In the given example, the true set of parameters shows that surgery is the most effective action, followed by body taping, and then doing nothing (17m29s).
When applying the greedy algorithm, the arm with the highest observed reward is selected next, which in this case would be arm two, with an observed reward of one (18m24s).
A greedy algorithm that always chooses the action with the highest estimated value can get stuck in a suboptimal solution, never finding the optimal action, and will not converge to the optimal action (18m31s).
This is because the algorithm's estimate of the true value of the actions is low, and it will never sample the suboptimal action again, even if it has a higher true value (19m0s).
To avoid this, some form of exploration is needed to ensure the algorithm does not make an infinite number of bad decisions (19m30s).

Regret in Bandit Problems

The concept of regret is used to quantify the difference between the decisions made by the algorithm and the optimal decisions, with regret being the opportunity loss or the difference between the expected reward of the optimal action and the expected reward of the action taken (19m44s).
Regret is often used to score the algorithm based on the gap between the optimal value and the value achieved by the algorithm (19m59s).
The optimal value is the maximum expected reward, and the regret is the difference between the expected reward of the optimal action and the expected reward of the action taken (20m10s).
Regret is typically considered in expectation, due to stochasticity in the rewards, and is used to compare the expected reward of the optimal arm to the expected reward of the suboptimal arm (20m36s).
The goal is to minimize total regret, which is equivalent to maximizing cumulative reward, and this is often the focus in bandit problems (21m7s).
The regret can be computed by comparing the expected reward of the optimal action to the expected reward of the action taken over all time steps (21m1s).
The number of times an action is selected, NTA, is used to calculate the regret, with the gap for a particular arm being the difference between the expected reward of the optimal action and the expected reward of the action taken (21m37s).
The concept of the "Gap" is introduced, which refers to the difference in expected rewards between the optimal action and the action taken, and it plays a crucial role in learning which action is better, with larger gaps requiring fewer samples to figure out and smaller gaps requiring more data (22m29s).
The Gap (G) is a function of the gaps and counts, representing the number of times each action is taken and the difference between the optimal action and the reward received (22m55s).
The expected regret is the sum of the times each action is taken multiplied by the Gap, indicating that actions with large gaps should be avoided, while actions close to the optimal action are more acceptable (23m11s).
Many algorithms aim to bound the regret quantity, which is challenging due to the unknown optimal action and its value, but algorithms can be designed to prove how the regret grows (23m35s).
The regret can be visualized as a series of actions taken over time, with the true optimal action and observed rewards influencing the regret at each step (24m5s).
The size of the regret is determined by the Gap between the optimal arm and the arm taken, and making bad decisions forever can result in linear regret (24m48s).
Ideally, algorithms should strive for constant regret or zero regret, which means making a finite number of bad decisions before learning the optimal arm and taking it forever (25m31s).
In the worst-case scenario, linear regret can occur if mistakes are made consistently (25m55s).

Epsilon-Greedy Algorithms and Regret Analysis

Epsilon-greedy algorithms are used to balance exploration and exploitation, where the greedy action is chosen with probability 1 - Epsilon, and a random action is chosen with probability Epsilon (26m20s).
The regret of Epsilon-greedy algorithms depends on the value of Epsilon, and it is possible to have linear regret if Epsilon is not chosen carefully (26m29s).
In a setting where there is always at least one action with a non-zero gap, the regret of Epsilon-greedy can be analyzed to determine if it is sublinear or linear (28m14s).
The value of Epsilon can affect the regret of the algorithm, and it is possible to have linear regret for certain values of Epsilon (28m36s).
The algorithm's performance can be evaluated by considering the probability of choosing the greedy action versus a random action (27m3s).
In a specific example, with Epsilon = 0.1, the algorithm chooses the greedy action with 90% probability and a random action with 10% probability (27m15s).
The regret of the algorithm can be computed and analyzed to determine if it is sublinear or linear (28m1s).
Epsilon-greedy algorithms have a lower bound on the number of times each arm is sampled, which is at least Epsilon divided by the number of actions times T, where T is the total number of decisions made, resulting in at least linear regret if Epsilon is greater than zero (31m47s).
If Epsilon is set statically, it can be pretty bad, and decaying Epsilon over time can make a significant difference in the algorithm's performance (32m36s).
The growth of regret over time steps can be visualized in plots, showing that greedy algorithms can have linear regret, while Epsilon-greedy algorithms can have slightly better but still linear regret, and decaying Epsilon can achieve sublinear regret (32m55s).

Problem-Dependent and Problem-Independent Regret Bounds

A good algorithm should be able to guarantee sublinear regret without knowing the problem-dependent properties in advance, such as the gaps between rewards (33m34s).
Problem-independent bounds describe how regret grows as a function of time steps for any possible problem, while problem-dependent bounds describe regret as a function of the gap between rewards (33m52s).
Problem-dependent bounds are elegant because they don't require the algorithm to know the gaps, but rather adapt to the problem's difficulty, resulting in better regret if the problem is easier to learn (34m25s).
The gap between rewards is usually less than one, and the consideration of rewards is typically between zero and one, depending on the domain (35m4s).
In certain domains, the differences or gaps between optimal and suboptimal actions can be really tiny, requiring a lot of data and smart, data-efficient algorithms to optimize, whereas in other cases, the gaps can be large, making it easier to estimate them (35m21s).
A lower bound by Robbins from the 1950s provides a minimum regret that any algorithm will suffer as a function of the problem, indicating that the regret will be at least log T, where T is the number of time steps or decisions made (36m4s).
The lower bound also depends on the gap between the optimal and suboptimal arms, with the gap in the numerator, and is considered a problem-dependent or instance-dependent bound because it holds based on the unknown gaps (36m30s).

Optimism Under Uncertainty and Upper Confidence Bound (UCB) Algorithms

The concept of optimism under uncertainty is a principle that shows why it's provably optimal to be optimistic about things, as it allows for sublinear regret (37m9s).
Optimism in this context means choosing actions or arms that might have a high value, which can result in either getting high reward or learning something and improving estimates (37m29s).
When choosing actions that might be good, two possible outcomes are getting high reward, which is the goal, or getting low reward, which allows for learning and improving estimates (37m51s).
Both outcomes are valuable from the point of view of a reinforcement learning algorithm, as either the goal is achieved or something is learned to avoid making bad decisions in the future (38m48s).
To leverage information from low rewards, an algorithm is needed to formalize uncertainty, which involves quantifying confidence intervals or uncertainty bounds to make decisions (39m4s).
The approach involves estimating an upper confidence bound for each action value, ensuring it holds with high probability, and focusing on frequentist high-probability bounds (39m30s).
The upper confidence bound (UCB) is dependent on the number of times an action has been taken, and the agent will choose the action with the highest UCB (39m48s).
UCB algorithms are a suite of algorithms that follow this notion, and there are variants, including optimism in the face of uncertainty (OFU) (40m11s).
The performance of this approach and the quantification of uncertainty will be evaluated, using Hoeffding's inequality, which is a useful inequality for understanding how different the observed average can be from the true mean (40m36s).
Hoeffding's inequality states that the difference between the empirical estimate and the true estimate decreases exponentially as the number of samples increases, ensuring convergence to the true mean (41m14s).
The inequality implies that the probability of a large difference between the empirical and true means decreases exponentially with the number of samples, allowing for the estimation of an upper bound on the true expected reward (41m52s).
The goal is to determine how big the upper bound (u) needs to be set to get an upper bound on the true expected reward, using the equation derived from Hoeffding's inequality (42m26s).
To construct an upper confidence bound that holds with a certain probability, we use the formula u = sqrt(1/n * log(2/Delta)), where u is the range, n is the number of samples, and Delta is the confidence level (43m9s).
This formula gives us a range, and if we want the probability that our empirical estimate differs from the true mean by no more than u, it is sufficient to set u equal to this value (43m47s).
We can then say that the empirical estimate is less than or equal to the expected value of x, which is less than or equal to the empirical estimate plus u, with a probability greater than or equal to 1 - Delta (44m6s).
This creates an upper confidence bound, which we can use to estimate the upper limit of the true mean (44m9s).
The UCB1 algorithm uses this upper confidence bound, adding a bonus term to the empirical average, which depends on the number of samples (44m39s).
The UCB1 algorithm is used at every time step, computing the empirical average and adding the bonus term (44m44s).
The bonus term is calculated as sqrt(2 * log(t) / n), where t is the time step and n is the number of samples (45m0s).
The UCB1 algorithm is a variant of the Upper Confidence Bound algorithm, which was first introduced in a paper in 2002 (45m31s).
The algorithm uses optimism under uncertainty, which is a concept that was around before the 2000s, but was first formally proven in the 2002 paper (45m56s).

UCB1 Algorithm and its Implementation

The UCB1 algorithm can be used in different settings, such as sampling each arm once and then computing the upper confidence bounds for each arm (46m15s).
The algorithm then selects the arm with the highest upper confidence bound, which is the argmax of the upper confidence bounds (47m10s).
Upper confidence bounds (UCB) are used to balance exploration and exploitation in reinforcement learning, with UCB A2 being 1 + √(2 * log(1/Δ)) and UCB A3 being 0 + √(2 * log(1/Δ1)) (47m32s).
The upper confidence bounds are reduced as more information is gathered, allowing for more informed decision-making (48m0s).
The value of Δ is chosen to determine the confidence bounds, with the goal of having all confidence bounds hold for all time steps and arms (48m13s).
Union bounding is used to ensure that all confidence bounds hold simultaneously, taking into account the number of arms and total decisions (Big T) (48m30s).
The confidence intervals will change over time, causing the algorithm to alternate between arms based on the upper confidence bounds (49m1s).
With a fixed number of time steps (Big T), Δ can be set to roughly Δ / (T * |A|) using a union bound, where |A| is the number of arms (49m43s).
This approach is related to false discovery and other concepts in machine learning, and is used to ensure that the confidence bounds are valid at every time step (50m1s).
The union bound is used to sum over the probability of each event, resulting in a bound of roughly the number of arms times Big T (50m29s).
Dividing Δ by (T * |A|) is generally sufficient, but can result in a larger log term, which can be mitigated using approaches such as the law of iterated logarithms (50m50s).

Proof Sketch for Sublinear Regret in UCB Algorithms

A proof sketch is provided to show how a specific type of idea can be used to achieve sublinear regret in reinforcement learning (51m18s).
The proof is based on a result from a book on banded algorithms by Tor Lattimore and Csaba Szepesvári, which provides a rigorous version of the proof (51m53s).
The result states that if an arm is suboptimal, the number of times it is pulled in Upper Confidence Bounds scales as a constant C' times log of 1/Delta, where Delta is the gap between the expected reward of the arm and the true reward (53m11s).
The constant C' does not depend on the number of arms, gaps, or other domain-specific factors, and can be a fixed value such as 37 (53m33s).
The result is interesting because it shows that if the gap is large, the suboptimal arm will be pulled fewer times, and if the gap is small, it may be sampled more (53m51s).
Combining this result with another equation can lead to a sublinear regret bound, where the total expected regret is proportional to the number of actions times a constant (54m31s).
The proof sketch focuses on showing that the number of times a suboptimal arm is pulled is bounded by a constant times log of 1/Delta (54m18s).
The book by Lattimore and Szepesvári provides a more rigorous version of the proof, which is recommended for further reading (52m15s).

Formal Proof of Sublinear Regret in UCB Algorithms

The formal proof of the total number of times a particular arm is pulled in a multi-armed bandit problem scales with one over the size of the gap squared, and this concept will be explored in more detail (55m14s).
The proof relies heavily on the Hoeffding inequality and upper confidence bounds, which provide a way to estimate the expected value of an arm and bound the probability of the true expected value being outside of a certain range (55m36s).
The Hoeffding inequality states that the difference between the true expected value of an arm and its empirical average is greater than a certain quantity (the upper confidence bound) with probability no more than 1 - Delta/T, where Delta is a small probability and T is the number of times the arm is pulled (56m26s).
The goal is to bound the number of times a suboptimal arm is pulled, which is the only case where regret occurs, and this can be done by analyzing the times when a non-optimal arm is pulled and its upper confidence bound is higher than the optimal arm's upper confidence bound (57m5s).
If the confidence interval holds, then the upper confidence bound of a suboptimal arm is higher than the true expected value of the optimal arm, and this can be used to bound the number of times the suboptimal arm is pulled (57m57s).
Under the UCB algorithm, a suboptimal arm is only pulled if its upper confidence bound is higher than the optimal arm's upper confidence bound, and this can be used to derive a bound on the number of times the suboptimal arm is pulled (58m21s).
By substituting the upper confidence bounds of the suboptimal and optimal arms, it can be shown that the number of times the suboptimal arm is pulled is bounded by a function of the gap between the two arms and the number of times the suboptimal arm is pulled (59m56s).
The upper confidence bound on the optimal action also holds, so its upper confidence bound has to be higher than its true value (1h0m12s).
Q of a plus two of the confidence intervals is going to be greater than Q of a star, which means 2 < TK C log 1 Delta over NT of a is greater than Q of a star minus Q of a, which is equal to Delta a (1h2m50s).
Rearranging the equation, we get that 4 * C log 1 / Delta / NT of a is greater than equal to Delta a 2, which means NT of a is less than or equal to 4 C log 1 / Delta (1h3m48s).
Intuitively, this means that if confidence bounds hold and are used to make decisions, then the only time a wrong decision is made is when the confidence bounds are large enough to overwhelm the gap, and the number of times this occurs is finite (1h4m6s).
The size of the confidence intervals decreases over time, eventually becoming smaller than the gap, resulting in fewer suboptimal actions being taken (1h4m36s).
The algorithm achieves logarithmic asymptotic regret as a function of log T, because the log T term is inside the number of times suboptimal actions are taken (1h5m11s).

Comparison of UCB with Other Approaches and Extensions

Upper Confidence Bound (UCB) algorithms have a nice logarithmic shape if the constants are set correctly, but the constants can greatly affect the exploration time, and being more aggressive can result in better performance (1h5m41s).
An alternative to UCB is to always select the arm with the highest lower bound, which can yield linear regret, and it's helpful to think about why this approach wouldn't work in the upper confidence bound case (1h6m11s).
To demonstrate why selecting the arm with the highest lower bound can lead to linear regret, one can construct a multi-arm bandit case where this approach would result in linear regret, similar to the example shown for the greedy algorithm (1h7m30s).
When the condition for UCB is not met, the expectation can be split into a high-probability event and a low-probability event, allowing for the regret to be bounded from those time points (1h8m41s).
To achieve linear regret when selecting the arm with the highest lower bound, the upper bound and mean of the suboptimal arm (A2) should be placed such that the algorithm is pessimistic and never selects the optimal arm (A1), for example, by having a very high upper bound for A2 (1h11m18s).
If the upper bound of A2 is higher than A1, the algorithm would learn to select A2 under UCB, but being pessimistic would prevent this, and the algorithm would never pull A2 (1h12m22s).
The pessimistic view is that learning may not occur due to not updating other bounds (1h12m38s).
The empirical average plus its upper confidence bound being bigger than the optimal arm's empirical average plus its upper bound is based on the equation that the empirical average is always less than or equal to the true value plus the upper confidence bound (1h12m54s).
Substituting the empirical average with the true value plus the upper confidence bound results in Q of A plus 2 times the bound (1h13m15s).
This algorithm has provably sublinear regret, which is a desirable property, and is also easy to implement, especially when counts are available (1h13m41s).
The optimism under uncertainty principle can be extended to more complicated settings, such as function approximation and reinforcement learning (1h13m51s).
The principle is used to make decisions when outcomes are uncertain, with the goal of reducing regret over time (1h14m1s).

Introduction to Bayesian Bandits

The next topic to be covered is Bayesian Bandits, which involves thinking of the problem from a different perspective, using a prior over the distribution of rewards for each arm (1h14m19s).
Bayesian Bandits can also involve algorithms that use prior information to quickly gather data and make good decisions, which can be similar to optimism in certain ways (1h14m33s).