RL 0.0: Reinforcement learning primer

A trivial attempt to unify the fundamental RL concepts in one place for building intuitions.

Md Ferdous Alam https://ferdous-alam.github.io (PhD student, MAE, The Ohio State University)https://mae.osu.edu
09-25-2021

Introduction

Here I will try to explain how RL stems from the sequential decision making framework and its close relation with optimal control theory. I will follow two primary references, reinforcement learning and optimal control (Bertsekas 2019) and introduction to reinforcement learning (Sutton and Barto 2018)

Online sequential decision making

The goal is to take sequential decisions “online” to achieve a certain goal; often times it is maximizing a performance objective which can be thougt of as a function \(J(\cdot)\). The input to this objective function is not important right now. Let’s call this decision maker “agent”. The catch is that the agent has to figure out which decision to take based on the observed feedback from the envrionment of its interest. To observe a feedback the agent has to interact with the envrionment through some sort of actions. So, optimization will be at the core of this decision making procedure while we use data collected in an online fashion to identify actions to take. This is why the “learning” happens.

Bandit: A mandatory prior to RL

Consider an online sequential decision making problem where an agent has \(k\) choices to choose an action and everytime it executes an action it receives a feedback from the environment. A fundamental question then aries for the agent: how to choose an action? The way it chooses an action describes its way of behaving in this particular environment which is known as the “policy” denoted as \(\pi\). Note that \(\pi\) decribes how to take an action but it does not say how to take the best action that will maximize the performance objective \(J(\cdot)\). To identify the optimal action we need to find out the optimal policy \(\pi^*\). So, the following makes sense \[\begin{equation} \pi^* = \text{argmax}_\pi J^\pi(\cdot) \end{equation}\] where \(J^\pi(\cdot)\) is the value of the performance objective obtained using policy \(\pi\). Note that the agent does not know the underlying distribution of the feedback from each action it takes. If it were known then the agent could easily pick the best action. This setting is known as the bandit problem or sometimes as “multi-armed bandit (MAB)” problem. Sometimes people call this “k-armed bandit” as well. Usually the feedback obtained from the environment is known as reward or cost.

Markov Decision Process: The RL formalism

Until now it is clear that we are interested in sequential decision making. To formalize such process we will adopt the `Markov Decision Process (MDP)’. An MDP \(\mathcal{M}\) is usually expressed as a tuple of these following 5-elements. \[\mathcal{M} = \langle \mathcal{X}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \gamma\rangle\] where,

Let’s focus on how this formalism helps in sequential decision making. Assume that the agent is in a current state \(\mathbf{x}_t\) at timestep \(t\). Based on some policy \(\pi\) it takes a decision to move to state \(\mathbf{x}_{t+1}\) by taking action \(a_t\). To move to that state, the agent needs to know the probability of moving to that state given the current state \(\mathbf{x}_t\) and action \(a_t\). This is how the conditional probability comes into the process. Once the agent reaches state \(\mathbf{x}_{t+1}\) it gets a feedback from the environment. This feedback is called a reward value, \(R_t\), which is usually a scalar numeric value. We assume that the reward value \(R_t\) comes as the output from the reward function \(\mathcal{R}\) while it takes \(\mathbf{x}_t\) and \(a_t\) as input, meaning \(\mathcal{R}(\mathbf{x}_t, a_t): \mathcal{X} \times \mathcal{A} \rightarrow R_t\). What if the reward function only depends on the current state and not the action? Then the reward function would be represented as \(\mathcal{R}(\mathbf{x}_t): \mathcal{X} \rightarrow R_t\). Similary if the reward depends on not only the current state and current action but also the state it ends up in, then we would use the description of the reward function as \(\mathcal{R}(\mathbf{x}_t, a_t, \mathbf{x}_{t+1}): \mathcal{X} \times \mathcal{A} \times \mathcal{X} \rightarrow R_t\). Finally the agent uses a discount factor \(\gamma\) to put less weight onto future rewards and more weight into recent rewards. This makes sense because the agent does not want to depend strongly on the information that comes after many timesteps into the future. All these information can be combined very convenienty in an MDP. Now it should be easier to follow why MDP is attractive for sequential decision making.

This formalism is great, but what is the goal of the agent in an MDP? In simplified terms the ‘goal’ of the agent is to maximize the accumulation of rewards. Let’s define the accumulation of rewards as return. The return obtained at timestep \(t\) can be expressed as \(G_t^\pi = R_t + R_{t+1} + R_{t+2} + \dots\). This means that if the agent moves to state \(\mathbf{x}_{t+1}\) from state \(\mathbf{x}_t\) by taking action \(a_t\) it receives reward \(R_t\) and at the next timestep if it follows the same policy it will receive reward \(R_{t+1}\) and so on. For an infinite horizon case, the return will blow up. This is why we use a discount factor \(\gamma\) such that \[ G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots = \sum_{t=0}^{\infty} \gamma^k R_{t+k}.\]

This discount factor serves two purposes: a) it provides more weight into recent rewards and b) it helps to keep the return as a finite value. If we do not use a discount factor in the MDP definition then those MDPs are called ‘undiscounted MDPs’.

But how do we maximize the return? The answer is pretty simple: by choosing the sequence of actions that provides the highest return. These actions are called ‘optimal actions’. Remember that actions are chosen according to a policy \(\pi:\mathcal{X} \rightarrow \mathcal{A}\). So to choose the optimal actions we need an optimal policy \(\pi^*\). Now we have successfully identified the fundamental goal in this learning scheme: “how to obtian the optimal policy for sequential decision making?”

To identify whether a state is good or bad we need to assign some sort of value to that state. Usually this is known as the value function. The agent would like to explore states which have higher values compared to the rest. To derive the value of a state we use the reward function in an intuitive way. Let’s take a look. \[V^\pi(\mathbf{x}) = \mathbb{E}^\pi[G_t|\mathbf{x}] = \mathbb{E}^\pi \left[R_{t} + \gamma R_{t+1} + \dots |\mathbf{x}\right] = \mathbb{E}^\pi\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k} |\mathbf{x}\right]\] Here we take the ‘expectation’ of the return to account for the stochasticity of the rewards. Notice that we are only taking into consideration the mean value of the return, not the variance. This sometimes cause a variance issue in developed algorithms based on this formalism. So, what does \(V^\pi(\mathbf{x})\) mean? This means that the value of a state while following a policy \(\pi\) is the expected value of the return. Let’s develop a simple algorithm that can help us figure out the optimal policy \(\pi^*\) using the value of the states.

Algorithm 1
1 Find value of all states, \(V^\pi(\mathbf{x})\) where \(\mathbf{x}\in\mathcal{X}\)
2 From each state find the next best state \(\mathbf{x}_b = \text{argmax}_{\mathbf{x}'} V^\pi(\mathbf{x}')\)
3 Find the optimal policy by choosing the action that led to \(\mathbf{x}_b\) meaning \(\pi^*(\mathbf{x}) = \{a: \mathbf{x} \rightarrow \mathbf{x}_b\}\)

Would not it be better if we could, rather than finding the value of a state, directly find the value of an action from a state? In that way we would be able to evaluate whether an action is good or bad based on the assigned value. Yes, we can and this is known as the action-value functions. These are also known as Q-values as they can be informally thought of as the quality of an action taken from a state. For convenience, an action taken from a state is combinedly referred as the state-action, \((\mathbf{x}, a)\), pair. So, how do we define Q-values? Looking closely to the definition of the value-functions we can similarly define the Q-values by conditioning the return on the state-action pair.

\[Q^\pi(\mathbf{x}, a) = \mathbb{E}^\pi[G_t| \mathbf{x}, a] = \mathbb{E}^\pi \left[R_{t} + \gamma R_{t+1} + \dots |\mathbf{x}, a\right] = \mathbb{E}^\pi\left[ \sum_{k=0}^{\infty}\gamma^k R_{t+k}|\mathbf{x}, a\right]\] So, Q-values are the values assigned to the state-action pair and values are assigned to the states only. Can we derive any relationship between them based on their properties? To do that we need to break down their formal definition using the properties of the expectation operator in the above equations. Let’s break down the equation further using the definition of an expectation. We will use the following three properties of the expectation operator.

Expectation of a random variable

  1. \(p1\): Remember that if \(X\) is a discrete random variable with finite number of outcomes \(x_1, x_2, \dots, x_k\) with probabilities \(p_1, p_2, \dots, p_k\) then \[\begin{equation} \mathbb{E}[X] = p_1x_1 + p_2x_2 + \dots + p_kx_k\end{equation}\]
  2. \(p2\): Expectations are linear operator, meaning \[\mathbb{E}[X_1] + \mathbb{E}[X_2] = \mathbb{E}[X_1 + X_2]\]
  3. \(p3\): For conditional expectations using partition theorem, \[\mathbb{E}[X] = \sum_y p(Y=y) \mathbb{E}[X|Y=y]\]

So, from the value function definition we get,

\[\begin{aligned} V^\pi(\mathbf{x}) &= \mathbb{E}^\pi [R_t + \gamma G_{t+1}|\mathbf{x}] \\ &= \sum_{a\in\mathcal{A}} \pi(a|\mathbf{x}) \mathbb{E}[R_t + \gamma G_{t+1}|\mathbf{x}, a] \ \ \ \ \text{ using } p3\\ &= \sum_{a\in\mathcal{A}} \pi(a|\mathbf{x}) \sum_{\mathbf{x}'\in\mathcal{\mathcal{X}}} p(\mathbf{x}, a, \mathbf{x}') \mathbb{E}[R_t + \gamma G_{t+1}|\mathbf{x}, a, \mathbf{x}'] \ \ \ \ \text{ using } p3\\ &= \sum_{a\in\mathcal{A}} \pi(a|\mathbf{x}) \sum_{\mathbf{x}'\in\mathcal{\mathcal{X}}} p(\mathbf{x}, a, \mathbf{x}') \left[\underbrace{\mathbb{E}[R_t]}_{r(\mathbf{x}, a)} + \mathbb{E}[\gamma G_{t+1}|\mathbf{x}, a, \mathbf{x}'] \right] \ \ \ \ \text{ using } p2\\ &= \sum_{a\in\mathcal{A}} \pi(a|\mathbf{x}) \sum_{\mathbf{x}'\in\mathcal{\mathcal{X}}} p(\mathbf{x}, a, \mathbf{x}') \left[ r(\mathbf{x}, a) + \gamma \underbrace{\mathbb{E}[G_{t+1}|\mathbf{x}']}_{V^\pi(\mathbf{x}')} \right] \ \ \ \ \text{ using } p2\\ &= \sum_{a\in\mathcal{A}} \pi(a|\mathbf{x}) \sum_{\mathbf{x}'\in\mathcal{\mathcal{X}}} p(\mathbf{x}, a, \mathbf{x}') \left[ r(\mathbf{x}, a) + \gamma V^\pi(\mathbf{x}') \right] \end{aligned}\]

This gives us a recursive formula! Similarly we can formulate the Q-values.

\[\begin{aligned} Q^\pi(\mathbf{x}, a) &= \mathbb{E}^\pi\left[ R_t + \gamma G_{t+1} | \mathbf{x}, a\right]\\ &= \mathbb{E}^\pi\left[ R_t + \gamma G_{t+1} | \mathbf{x}, a\right]\\ &= \sum_{\mathbf{x}'}p(\mathbf{x}, a, \mathbf{x}') \mathbb{E}^\pi\left[ R_t + \gamma G_{t+1} | \mathbf{x}, a, \mathbf{x}'\right]\\ &= \sum_{\mathbf{x}'}p(\mathbf{x}, a, \mathbf{x}') \left[ \mathbb{E}^\pi[R_t|\mathbf{x}, a, \mathbf{x}'] + \gamma \mathbb{E}^\pi \left[G_{t+1} | \mathbf{x}, a, \mathbf{x}'\right] \right]\\ &= \sum_{\mathbf{x}'}p(\mathbf{x}, a, \mathbf{x}') \left[ r(\mathbf{x}, a) + \gamma \mathbb{E}^\pi \left[G_{t+1} | \mathbf{x}'\right] \right]\\ &= \sum_{\mathbf{x}'}p(\mathbf{x}, a, \mathbf{x}') \left[ r(\mathbf{x}, a) + \gamma V^\pi(\mathbf{x}')\right] \end{aligned}\]

We can also develop relationship between the Q-values and value functions.

\[\begin{aligned} V^\pi(\mathbf{x}) &= \mathbb{E}^\pi[G_t | \mathbf{x}]\\ &= \sum_a \pi(a|\mathbf{x}) \mathbb{E}^\pi[G_t|\mathbf{x}, a]\\ &= \sum_a \pi(a|\mathbf{x}) Q^\pi(\mathbf{x}, a) \end{aligned}\]

Finally we develop the recursive formula for Q-values.

\[\begin{aligned} Q^\pi(\mathbf{x}, a) &= \sum_{\mathbf{x}'}p(\mathbf{x}, a, \mathbf{x}') \left[ r(\mathbf{x}, a) + \gamma V^\pi(\mathbf{x}')\right]\\ &= \sum_{\mathbf{x}'}p(\mathbf{x}, a, \mathbf{x}') \left[ r(\mathbf{x}, a) + \gamma \sum_{a'} \pi(a'|\mathbf{x}') Q^\pi(\mathbf{x}', a') \right] \end{aligned}\]

Is there any way to determine the value of all the states within the state-space? Yes, these algorithms are known as dynamic programming algorithms.

Dynamic programming

Algorithm 2: Policy iteration
1
2
3
4
5
Algorithm 2: Value iteration
1
2
3
4
5

Note that all the future rewards in the return equation is unknown. If we knew about all the possible future rewards we are going to get from current timestep and forward, then we could obtain the exact value of the return. But unfortunately we do not have the luxury to know all these reward values apriori. This is where Reinforcement Learning (RL) comes into play.

Reinforcement learning and optimal control

Building algorithms for RL

Types of RL algorithms

Sample complexity of RL algorithms

A case study

Conclusion

Bertsekas, Dimitri. 2019. Reinforcement and Optimal Control. Athena Scientific.
Sutton, Richard S, and Andrew G Barto. 2018. Reinforcement Learning: An Introduction. MIT press.

References