In the last note, we have covered first 2 chapters of the book, and discussed about the tabular cases of RL(Bandit problems). In this note, we will discuss the Finite Markov Decision Process(MDP) and the Bellman Equation.

Agent-Environment Interface, Goals and Rewards

As in this series we assume readers have some ideas about “RL learns from interactions with the environment”, we will only briefly introduce the agent-environment interface here. It can be illustrated in a diagram as below:

graph LR; A[Agent] ==action==> E[Environment]; E ==state, reward==> A;

The notation used here should be carefully specified. Agent interact with the environment at each step of a discrete time step sequence, $t=0,1,2,3,…$. At each time step $t$, the agent receives a representation of the environment’s state, $S_t \in \mathcal{S}$, and on that basis selects an action, $A_t \in \mathcal{A}(S_t)$. One time step later, in part as a consequence of its action, the agent receives a numerical reward, $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$. The environment then goes into a new state, $S_{t+1}$, and the agent receives a new state and selects a new action, and so on.

Notice that taking action $A_t$ at state $S_t$ induces reward $R_{t+1}$. The reward is co-determined with the next state observed. In addition, the discrete time step sequence is a popular scheme used in RL research, but people also have continuous time step sequence in some cases.

The agent-environment interface defines the boundary of agent, which does not have to be a physical boundary. For example, the battery level of a robot can be considered as part of the environment. Reward computation is often considered as part of the environment as well, so the agent does not have the control to alter the reward function.

In RL setting, we have an important reward hypothesis.

That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

This hypothesis is somehow foundamental for RL, as it provides a way to formalize the goal of RL problems. The reward hypothesis is not a complete definition of the goal of RL, but it is a useful starting point.

Returns and Episodes

Our goal is to maxmize the cumulative reward, and we actually maximize some function of the reward seuqence $R_{t+1}, R_{t+2}, R_{t+3}, …$. This function is called the return $G_t$. One simplest case of the return: $$ G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T $$ where $T$ is the final time step of the episode.

The special $T$, associated with a special state called terminal state leads us to the distinction between episodic tasks and continuing tasks. In episodic tasks, the agent-environment interaction breaks naturally into subsequences, called episodes, which terminate in special states called terminal states. In continuing tasks, the interaction does not break naturally into episodes, but goes on continually without limit. For example, the game of chess is episodic, while a cleaning robot cleaning the street is continuing and through the whole life of the robot.

The return in continuing tasks cannot be defined as the sum of rewards, as the sum may be infinite. We need to introduce a discount factor $\gamma \in [0, 1]$ to make the return finite. The return in continuing tasks is defined as: $$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$ where $\gamma$ is the discount factor, which determines the present value of future rewards. The value of $\gamma$ is between 0 and 1, and the smaller $\gamma$ is, the more immediate the reward is considered (more myopic).

Unified Notation for Episodic and Continuing Tasks

Since episodic tasks often need extra specification of the episode, we need some notation like $S_{t,i}$ to indicate episode $i$ at time step $t$. But it is often non-neccessary to distinguish between episodes, so we just use $S_t$ to represent the state at time step $t$ for any $i$.

We have above defined the return for finite and infinite horizon (length) tasks. They can be unified as entering absorbing states in the episodic tasks.

graph LR; S0(S0) --R1--> S1(S1) --R2--> S2(S2) --R3--> S3(S3) --R4--> absorb[absorbing state]; absorb --R5, R6 ...--> absorb;

So the return can be defined as: $$ G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}\quad \text{; including the cases of} \ \gamma = 1 \ \text{and} \ T = \infty $$

The Markov Property

Agents can construct states. While state signal contains immediate sensation, agents are allowed to construct states, like building history of all sensations.

The Markov Property indicates that the state representaion should contain all information needed to make optimal decisions. Adding more information does not help. Ideally (or in simple setting), the state provided by the environment contains all information needed to make optimal decisions. However, we need to distinguish between state and obervation in complex cases, like partial observability.

The Markov Property enables several simplification from the math side.

Suppose there are finite number of stats and reward values, the dynamics can be defined as a distribution in the most general case, that the environment responds to the agent’s action with all things happened before: $$ Pr\{R_{t+1}=r, S_{t+1}=s^\prime| S_0, A_0, R_1,\dots, S_{t-1}, A_{t-1}, R_t, S_t, A_t\} $$

If the state signal has the Markov Property, the environment’s response can be simplified as: $$ p(s^\prime, r|s, a) = Pr\{R_{t+1}=r, S_{t+1}=s^\prime| S_t=s, A_t=a\} $$

The equvalence of the two equations can be summarized as in the book (page 64): That is, the best policy for choosing actions as a function of a Markov state is just as good as the best policy for choosing actions as a function of complete histories.

Markov Decision Processes

If the state and action space are finite, we call the problem a finite Markov Decision Process. As we have developed, the probability of the next state and reward can be defined as: $$ p(s^\prime, r|s, a) = Pr\{R_{t+1}=r, S_{t+1}=s^\prime| S_t=s, A_t=a\} $$

Given the dynamics of the environment, we can define the expected reward function for state-action pair: $$ r(s,a) = \mathbb{E}\{R_{t+1}|S_t=s, A_t=a\} = \sum_{r\in \mathcal{R}} r \sum_{s^\prime \in \mathcal{S}} p(s^\prime, r|s, a) $$ Notice that people often get confused by the 2 notations of reward functions: $r(s,a)$ and $r(s,a,s^\prime)$. Now we see $r(s,a)$ is the expectated reward of $r(s,a,s^\prime)$, where $$ r(s,a,s^\prime) = \mathbb{E}\{R_{t+1}|S_t=s,A_t=a,S_{t+1}=s^\prime\}=\frac{\sum_{r_\in\mathcal{R}}rp(s^\prime,r|s,a)}{p(s^\prime|s,a)} $$

The state transition probability can be written with the dynamic: $$ p(s^\prime|s,a)=Pr\{S_{t+1}=s^\prime|S_t=s,A_t=a\}=\sum_{r\in\mathcal{R}}p(s^\prime,r|s,a) $$ The state transition probability is the marginal distribution of $s^\prime$.

Value Functions

Value functions are important for not only MDP but other stochastic process as well. In RL setting, the value function is the function of states that estimate how good it is for a agent to be in a given state (or how good it is to perform a given action in a given state).

The goodness is defined as the expected future return (accumulated future rewards). Recall our policy function $\pi(a|s)$ which is the distribution of taking actions given state $s$, the value function of a state $s$ under the policy $\pi$ is: $$ v_\pi(s) = \mathbb{E}[G_t|S_t=s]=\mathbb{E}_\pi [\sum^\infty_{k=0}\gamma^k R_{t+k+1}|S_t=s] $$

and the terminal state always has 0 as value.

Similarly, for the goodness of actions, we define: $$ q_\pi(s,a)=\mathbb{E}[G_t|S_t=s,A_t=a]=\mathbb{E}_\pi[\sum^\infty_{k=0}\gamma^k R_{t+k+1}|S_t=s, A_t=a] $$

The fundamental (and nice) property of value function is that it is recursively defined. $$ \begin{aligned} v_\pi(s)&=\mathbb{E}_\pi[G_t|S_t=s]\\ &=\mathbb{E}_\pi[\sum^\infty_{k=0}\gamma^k R_{t+k+1}|S_t=s]\\ &=\mathbb{E}_\pi[R_{t+1} + \gamma\sum^\infty_{k=0}\gamma^k R_{t+k+2}|S_{t+1}=s^\prime]\\ &=\sum_a \pi(a|s)\sum_{s^\prime}\sum_{r}p(s^\prime,r|s,a)[r+\gamma\mathbb{E}_\pi[\sum^\infty_{k=0}\gamma^k R_{t+k+2}|S_{t+1}=s^\prime]]\quad(*)\\ &=\sum_a \pi(a|s)\sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma v_\pi(s^\prime)]\quad;\text{merged sum} \end{aligned} $$

$(*)$ part shows how to expand the $\mathbb{E}[\cdot]$ under $\pi$. The core idea is, the value function of a state should be equal to the immediate reward expected plus the expected next state’s value.

Optimal Value Functions

Value functions provide a way to compare different policies, so we can say one policy is better than the other. A policy is defined to be better than or equal to another policy if its value function is greater than or equal to the other policy’s value function for all states.

$$ \pi \geq \pi^\prime \quad \text{if} \quad v_\pi(s) \geq v_{\pi^\prime}(s) \quad \text{for all} \ s \in \mathcal{S} $$

We denote the optimal policy as $\pi_*$, which is the best policy among all policies. The optimal value function is denoted as $v_*$, which is the value function of the optimal policy. The optimal value function is defined as: $$ v_*(s) = \max_\pi v_\pi(s) \quad \text{for all} \ s \in \mathcal{S} $$

We also have optimal state-action value function: $$ q_* (s,a) = \max_\pi q_\pi(s,a) \quad \text{for all} \ s \in \mathcal{S} \ \text{and} \ a \in \mathcal{A}(s) $$

The optimal value function is connected to the optimal state-action value function by: $$ q_*(s,a) = \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1})|S_t=s, A_t=a] $$

In the previous section, we see the value function has a nice property of recursive definition, or we say it has self-consistency. Similarly, the optimal value function is value function as well and obeys the Bellman Optimality Equation: $$ \begin{aligned} v_*(s) &= \max_{a\in\mathcal{A}(s)} \sum_{s^\prime, r} p(s^\prime, r|s,a)[r + \gamma v_*(s^\prime)] \end{aligned} $$

Notice that the equation can also be derived from the optimal policy’s side, very similar to how we derived the (regular) value function recursion, and different when expand the $\mathbb{E}_{\pi_*}[\cdot]$.

Similarly, the optimality state-action value function obeys the Bellman Optimality Equation: $$ \begin{aligned} q_*(s, a) &=\sum_{s^\prime, r}p(s^\prime, r|s, a)[r + \gamma v_*(s^\prime)]\\ &=\sum_{s^\prime, r}p(s^\prime, r|s, a)[r + \gamma \max_{a^\prime} q_*(s^\prime, a^\prime)]\quad;\text{optimal in the next step} \end{aligned} $$

If the dynamics of the environment is known, the Bellman Optimality Equation can be solved by dynamic programming! One took the Algorithm class might see the Bellman Optimality Equation is very similar to the DP formula. If the optimal value function is known, the optimal policy can be derived by doing greadily with one-step look-ahead. If the optimal state-action value function is known, the optimal policy can be just choosing the action with the highest value.

Optimality and Approximation

In real-world problems, the dynamics of the environment is often unknown, so the Bellman Optimality Equation cannot be solved directly. Other difficulties include the supar large state space, and the partial observability. In these cases, the agent can only approximate the optimality to some extent.