I am reviewing the book, Reinforcement Learning: An Introduction by Sutton and Barto. This post covers the first two chapters of the book. As the very first note in this series, it is good to explain why I write these notes. First of all, it is good to review RL even in this era where LLM/AIGC is the new hype. Secondly, I am preparing for my job search and grad study. It is good to review the basics.

These notes are not for purely beginers. It only covers some key points in the book, as well as some of the extension on theory.


Chapter 1: The Reinforcement Learning Problem

The first chapter is an introduction to the RL setting. Basically, the RL problem cares about what to do in a closed-loop system where learning system’s output influences the input later. The agent interacts with the environment, and the environment gives feedback to the agent.

One key aspect of the RL problem is the trade-off between exploration and exploitation. To get high reward, the agent needs to explore the environment to find the best action, but it also needs to exploit the known information to get the reward.

Sutton and Barto also mention that RL explicitly considers the whole problem of a goal-directed learning agent interacting with an uncertain environment. This is different from supervised learning, where the agent only needs to learn a mapping from input to output.

Key components of RL systems are:

  1. Policy - the agent’s behavior function, mapping perceived states of the environment to actions to be taken.
  2. Reward signal - a scalar feedback signal indicating how well the agent is doing at a given time step. They are immediate feedbacks, sent to agent depending onthe current action and current state.
  3. Value function - unlike reward, it is a function specifies what is good in a long run
  4. Model - a model of the environment is something that mimics the environment. Given a state and action, it predicts the next state and next reward for example.

There isn’t much to scribe in the first chapter. Some interesting remarks on the relationship between Optimal Control and RL are made.


Chapter 2: Multi-armed Bandits

The second chapter is about the simplest form of RL problem, the multi-armed bandit problem. In this problem, the agent has to choose between multiple actions, each with an unknown reward. The agent has to maximize the reward by choosing the best action.

An $n$-armed Bandit Problem

Consider the case: an agent is faced repeatedly with a choice among $n$ different options, or actions. After each choice, the agent receives a numerical reward chosen from a stationary probability distribution that depends on the action selected. The agent’s objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

This is the simplest form of RL, $n$-armed bandit problem. The name comes from the slot machines in casinos. But we here have $n$ levers to pull.

In this problem, each action has an expected reward or mean reward given action selected. This is called the value of that action. We do not know these values in advance.

We can maintain the estimate of the value of each action, and choose from the action with the highest estimated value. This is called the greedy action selection method. The agent can also choose to explore other actions to get a better estimate of the value of each action. This is called the exploration. Exploration may get reward in the short term, but it helps to get a better estimate of the value of each action in the long term.

Action-value Methods

Assume the true value of action $a$ as $q(a)$. The agent maintains an estimate of the value of action $a$ at time step $t$ as $Q_t(a)$. The agent can update the estimate of the value of action $a$ at time step $t$ as follows: $$ Q_t(a) = \frac{R_1 + R_2 + \cdots + R_{N_t(a)}}{N_t(a)} $$ where $R_i$ is the reward received after the $i$-th selection of action $a$, and $N_t(a)$ is the number of times action $a$ has been selected up to time step $t$. This is called the sample-average method.

When $N_t(a) = 0$, the agent can set $Q_t(a) = 0$ or some default value. This is called the initial value method. As $N_t(a) \to \infty$, $Q_t(a) \to q(a)$ based on the law of large numbers.

Now we consider how to select actions based on these estimates. The simplest method is the greedy method, where the agent selects the action with the highest estimated value, where $A_t=\argmax_a Q_t(a)$. The agent can also select a random action with a small probability $\epsilon$. This is called the $\epsilon$-greedy method. This is a better method as every action is selected infinitely often, enabling convergence to the true value of each action. Another good side of exploration is, in non-stationary problems, it helps to track the changes in the environment.

Incremental Implementation

Sample average method computes an average, and require some space to store the reward information. The average can also be computed incrementally to save the space cost. The update rule can be written as:

$$ \begin{aligned} Q_{k+1} & = \frac{1}{k} \sum_{i=1}^{k} R_i \\ & = \frac{1}{k} (R_k + \sum_{i=1}^{k-1} R_i) \\ & = \frac{1}{k} (R_k + (k-1) \frac{1}{k-1} \sum_{i=1}^{k-1} R_i)\quad\text{;key step} \\ & = \frac{1}{k} (R_k + k Q_k - Q_k) \\ & = Q_k + \frac{1}{k} (R_k - Q_k) \end{aligned} $$

The update rule can be written in a general form as: $$ \text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} \times (\text{Target} - \text{OldEstimate}) $$

We call the term $(\text{Target} - \text{OldEstimate})$ as the error in the estimate.

We set the step size $\frac{1}{k}$ in the above example. In general, we denote step size as $\alpha$, and for specific time and action, $\alpha_t(a)$. The step size can be constant or decreasing over time. The step size can also be adaptive.

Non-Stationary Problems

One of the most popular method in tackling non-stationary problems is using the constant step size. The increment rule can be written as: $$ Q_{k+1} = Q_k + \alpha (R_k - Q_k) $$ where $\alpha \in (0, 1]$ is a constant step size.

$$ \begin{aligned} Q_{k+1} & = Q_k + \alpha (R_k - Q_k) \\ & = \alpha R_k + (1-\alpha) Q_k \\ & = \alpha R_k + (1-\alpha) (\alpha R_{k-1} + (1-\alpha) Q_{k-1}) \\ & = \alpha R_k + (1-\alpha) \alpha R_{k-1} + (1-\alpha)^2 Q_{k-1} \\ & = \alpha R_k + (1-\alpha) \alpha R_{k-1} + (1-\alpha)^2 \alpha R_{k-2} + \cdots \\ & + (1-\alpha)^{k-1} \alpha R_1 + (1-\alpha)^k Q_1 \\ & = (1-\alpha)^k Q_1 + \sum_{i=1}^{k} \alpha (1-\alpha)^{k-i} R_i \end{aligned} $$

This is a weighted average since $(1-\alpha)^k + \sum_{i=1}^{k} \alpha (1-\alpha)^{k-i} = 1$. The weight of each reward is $\alpha(1-\alpha)^{k-i}$, which decreases over time. Sometimes we call this method as exponential recency-weighted average.

To converge to the true value of action $a$ with probability 1, the step size $\alpha_t(a)$ must satisfy the Robbins-Monro conditions: $$ \sum_{t=1}^{\infty} \alpha_t(a) = \infty \quad\text{and}\quad \sum_{t=1}^{\infty} \alpha_t(a)^2 < \infty $$

We can see our fixed step size $\alpha$ fail to satisfy the conditions. But this is actually benefitial for non-stationary problems. The fixed step size can be seen as a way to track the changes in the environment.

Optimistic Initial Values

All the action-value estimates $Q_t(a)$ are initialized to the same value, $Q_1(a)$, and we say they are biased by their initial value.

We can take a look at the estimation at timestep $t$. For sample-average method, the estimation is: $$ Q_k = \frac{1}{k}\sum_{i=1}^{k} R_i $$ The initial value does not affect the estimation in the long run. But for the constant step size method, the estimation is: $$ Q_{k+1} = (1-\alpha)^kQ_1 + \sum_{i=1}^{k} \alpha(1-\alpha)^{k-i}R_i $$ The initial value have permanent effect on the estimation. We may inject some prior information by adjusting initial values in fixed-step size setting.

We also have a technique called optimistic initial values, which assigns the initial value of each action to be higher than the true value. This encourages exploration, as the agent will choose the action with the highest initial value first. Say the reward is often ~1, we can set initial value to 10. Once an action is taken, it will certainly have lower value than the initial value 10, and the agent will explore other actions.

Such technique helps the agent to look at all actions in the early stage.

Upper Confidence Bound (UCB) Action Selection

While $epsilon$-greedy method is simple and effective, it does not take into account the uncertainty in the action-value estimates. The UCB action selection method takes into account the uncertainty in the action-value estimates.

Uncertainty represents the potential left to explore of an action. Less explored actions have higer uncertainty, and there might be a chance of getting higher reward using them.

The takeaways from the UCB method are: $$ A_t = \argmax_a \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right] $$ where $c$ is a hyperparameter that controls the degree of exploration. The term $\sqrt{\frac{\ln t}{N_t(a)}}$ is the uncertainty term. The uncertainty term decreases as the agent explores more actions and increases as the time step $t$ increases.

Gradient Bandit Algorithms

Gradient bandit gives the first glance at policy-gradient(policy-based) family.

So far in Chap2, Sutton and Barto talks about value, and how to estimate them in bandit setting. Now we consider an alternative approach, where the agent learns a preference for each action, and selects actions based on the preference. The agent selects actions with higher preference with higher probability.

The preference of action $a$ at time step $t$ is denoted as $H_t(a)$. The agent selects action $a$ at time step $t$ with a soft-max distribution: $$ Pr({A_t=a})=\frac{e^{H_t(a)}}{\sum_{b=1}^{n} e^{H_t(b)}} = \pi_t(a) $$ where $n$ is the number of actions. The agent updates the preference of action $a$ at time step $t$ as follows:

For selected action at $t$: $$ \begin{aligned} H_{t+1}(A_t) & = H_t(A_t) + \alpha (R_t - \bar{R}_t)(1-\pi_t(A_t)) \end{aligned} $$

For unselected action at $t$: $$ \begin{aligned} H_{t+1}(a) & = H_t(a) - \alpha (R_t - \bar{R}_t)\pi_t(a) \end{aligned} $$

where $\alpha$ is the step size, $R_t$ is the reward received at time step $t$, and $\bar{R}_t$ is the average reward up to time step $t$. It serves as a baseline to tell how much better the reward is than the average reward.

In plain language, for the selected action, if the result is good, then the preference of the action is increased. If this good action is rare, i.e. $\pi_t(A_t)$ is small, then the increase is large to compensate. For the unselected action, if the result is good, then the preference of the action is decreased. If this good action is common, i.e. $\pi_t(a)$ is large, then the decrease is large so we aggressively decrease the preference.

This is a stochastic approximation to gradient accent: $$ H_{t+1}(a) = H_t(a) + \alpha \frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} $$

where $\mathbb{E}[R_t]$ is the expected reward at time step $t$.

We expand the partial term, $$ \begin{aligned} \frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} & = \frac{\partial \sum_{b} \pi_t(b) q(b)}{\partial H_t(a)} \\ & = \sum_{b} q(b) \frac{\partial \pi_t(b)}{\partial H_t(a)} \quad q(b)\text{ does not have to do with partial}\\ & = \sum_{b} (q(b)-X_t) \frac{\partial \pi_t(b)}{\partial H_t(a)} \end{aligned} $$ Here we can leagally add $X-t$ since $\sum_b\frac{\partial \pi_t(b)}{\partial H_t(a)} = 0$. This is because the sum of the probability of all actions is 1 (unchanged), and the derivative of the sum is 0.

We go further by multiplying $1=\frac{\pi_t(b)}{\pi_t(b)}$: $$ \begin{aligned} \frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} & = \sum_{b} (q(b)-X_t) \frac{\partial \pi_t(b)}{\partial H_t(a)} \\ & = \sum_{b} \pi_t(b) (q(b)-X_t) \frac{\partial \pi_t(b)}{\partial H_t(a)}/\pi_t(b) \\ & = \mathbb{E}[(q(A_t) - X_t) \frac{\partial \pi_t(A_t)}{\partial H_t(a)}/\pi_t(A_t)] \quad\text{;by definition of expectation!}\\ & = \mathbb{E}[(R_t - \bar{R}_t) \frac{\partial \pi_t(A_t)}{\partial H_t(a)}/\pi_t(A_t)] \quad\text{;by definition of }X_t=\bar{R}_t \\ &\quad\text{ and substitue }q(A_t)=R_t \end{aligned} $$

We do the leagal substitution $X_t=\bar{R}_t$ and $q(A_t)=R_t$ as $\mathbb{E}[R_t]=q(A_t)$ AND other factors are non-random.

Now we made a Claim to be proved later that $\frac{\partial \pi_t(b)}{\partial H_t(a)} = \pi_t(b)(\mathbb{I}_{a=b}-\pi_t(a))$.

With this claim, we go further by substituting the claim into the equation: $$ \begin{aligned} & = \mathbb{E}[(R_t - \bar{R}_t)\pi_t(A_t)(\mathbb{I}_{a=A_t}-\pi_t(a))/\pi_t(A_t)] \\ & = \mathbb{E}[(R_t - \bar{R}_t)(\mathbb{I}_{a=A_t}-\pi_t(a))] \\ \end{aligned} $$

See, the above equation is the expectation of something. The “something” is, as a single sample, $(R_t - \bar{R}_t)(\mathbb{I}_{a=A_t}-\pi_t(a))$. Exactly what we used to update!

The only thing left is the proof of the Claim. It is somehow sophisticated, but it can be done by standard quotient rule. It is in Page 45 - Page 46 of the book, and I will just skip it here. The ultimate takeaway here is, the gradient bandit algorithm is a stochastic approximation to gradient ascent.

So far Chap2 discusses about the bandit problem, where we search for one single best action (either in stationary or non-stationary settings). But if the task changes with some context, say each time we face different multiarm-slot-machine with different colors, we need to associate the action with the context. This is the associative search problem.