RL 0.0: Reinforcement learning primer

A trivial attempt to unify the fundamental RL concepts in one place for building intuitions.

Published

Sept. 25, 2021

DOI

Introduction

Here I will try to explain how RL stems from the sequential decision making framework and its close relation with optimal control theory. I will follow two primary references, reinforcement learning and optimal control and introduction to reinforcement learning

Online sequential decision making

The goal is to take sequential decisions “online” to achieve a certain goal; often times it is maximizing a performance objective which can be thougt of as a function J(). The input to this objective function is not important right now. Let’s call this decision maker “agent”. The catch is that the agent has to figure out which decision to take based on the observed feedback from the envrionment of its interest. To observe a feedback the agent has to interact with the envrionment through some sort of actions. So, optimization will be at the core of this decision making procedure while we use data collected in an online fashion to identify actions to take. This is why the “learning” happens.

Bandit: A mandatory prior to RL

Consider an online sequential decision making problem where an agent has k choices to choose an action and everytime it executes an action it receives a feedback from the environment. A fundamental question then aries for the agent: how to choose an action? The way it chooses an action describes its way of behaving in this particular environment which is known as the “policy” denoted as π. Note that π decribes how to take an action but it does not say how to take the best action that will maximize the performance objective J(). To identify the optimal action we need to find out the optimal policy π. So, the following makes sense π=argmaxπJπ()

where Jπ() is the value of the performance objective obtained using policy π. Note that the agent does not know the underlying distribution of the feedback from each action it takes. If it were known then the agent could easily pick the best action. This setting is known as the bandit problem or sometimes as “multi-armed bandit (MAB)” problem. Sometimes people call this “k-armed bandit” as well. Usually the feedback obtained from the environment is known as reward or cost.

Markov Decision Process: The RL formalism

Until now it is clear that we are interested in sequential decision making. To formalize such process we will adopt the `Markov Decision Process (MDP)’. An MDP M is usually expressed as a tuple of these following 5-elements. M=X,A,R,P,γ

where,

Let’s focus on how this formalism helps in sequential decision making. Assume that the agent is in a current state xt at timestep t. Based on some policy π it takes a decision to move to state xt+1 by taking action at. To move to that state, the agent needs to know the probability of moving to that state given the current state xt and action at. This is how the conditional probability comes into the process. Once the agent reaches state xt+1 it gets a feedback from the environment. This feedback is called a reward value, Rt, which is usually a scalar numeric value. We assume that the reward value Rt comes as the output from the reward function R while it takes xt and at as input, meaning R(xt,at):X×ARt. What if the reward function only depends on the current state and not the action? Then the reward function would be represented as R(xt):XRt. Similary if the reward depends on not only the current state and current action but also the state it ends up in, then we would use the description of the reward function as R(xt,at,xt+1):X×A×XRt. Finally the agent uses a discount factor γ to put less weight onto future rewards and more weight into recent rewards. This makes sense because the agent does not want to depend strongly on the information that comes after many timesteps into the future. All these information can be combined very convenienty in an MDP. Now it should be easier to follow why MDP is attractive for sequential decision making.

This formalism is great, but what is the goal of the agent in an MDP? In simplified terms the ‘goal’ of the agent is to maximize the accumulation of rewards. Let’s define the accumulation of rewards as return. The return obtained at timestep t can be expressed as Gπt=Rt+Rt+1+Rt+2+. This means that if the agent moves to state xt+1 from state xt by taking action at it receives reward Rt and at the next timestep if it follows the same policy it will receive reward Rt+1 and so on. For an infinite horizon case, the return will blow up. This is why we use a discount factor γ such that Gt=Rt+γRt+1+γ2Rt+2+=t=0γkRt+k.

This discount factor serves two purposes: a) it provides more weight into recent rewards and b) it helps to keep the return as a finite value. If we do not use a discount factor in the MDP definition then those MDPs are called ‘undiscounted MDPs’.

But how do we maximize the return? The answer is pretty simple: by choosing the sequence of actions that provides the highest return. These actions are called ‘optimal actions’. Remember that actions are chosen according to a policy π:XA. So to choose the optimal actions we need an optimal policy π. Now we have successfully identified the fundamental goal in this learning scheme: “how to obtian the optimal policy for sequential decision making?”

To identify whether a state is good or bad we need to assign some sort of value to that state. Usually this is known as the value function. The agent would like to explore states which have higher values compared to the rest. To derive the value of a state we use the reward function in an intuitive way. Let’s take a look. Vπ(x)=Eπ[Gt|x]=Eπ[Rt+γRt+1+|x]=Eπ[k=0γkRt+k|x]

Here we take the ‘expectation’ of the return to account for the stochasticity of the rewards. Notice that we are only taking into consideration the mean value of the return, not the variance. This sometimes cause a variance issue in developed algorithms based on this formalism. So, what does Vπ(x) mean? This means that the value of a state while following a policy π is the expected value of the return. Let’s develop a simple algorithm that can help us figure out the optimal policy π using the value of the states.

Algorithm 1
1 Find value of all states, Vπ(x) where xX
2 From each state find the next best state xb=argmaxxVπ(x)
3 Find the optimal policy by choosing the action that led to xb meaning π(x)={a:xxb}

Would not it be better if we could, rather than finding the value of a state, directly find the value of an action from a state? In that way we would be able to evaluate whether an action is good or bad based on the assigned value. Yes, we can and this is known as the action-value functions. These are also known as Q-values as they can be informally thought of as the quality of an action taken from a state. For convenience, an action taken from a state is combinedly referred as the state-action, (x,a), pair. So, how do we define Q-values? Looking closely to the definition of the value-functions we can similarly define the Q-values by conditioning the return on the state-action pair.

Qπ(x,a)=Eπ[Gt|x,a]=Eπ[Rt+γRt+1+|x,a]=Eπ[k=0γkRt+k|x,a]

So, Q-values are the values assigned to the state-action pair and values are assigned to the states only. Can we derive any relationship between them based on their properties? To do that we need to break down their formal definition using the properties of the expectation operator in the above equations. Let’s break down the equation further using the definition of an expectation. We will use the following three properties of the expectation operator.

Expectation of a random variable

  1. p1: Remember that if X is a discrete random variable with finite number of outcomes x1,x2,,xk with probabilities p1,p2,,pk then E[X]=p1x1+p2x2++pkxk
  2. p2: Expectations are linear operator, meaning E[X1]+E[X2]=E[X1+X2]
  3. p3: For conditional expectations using partition theorem, E[X]=yp(Y=y)E[X|Y=y]

So, from the value function definition we get,

Vπ(x)=Eπ[Rt+γGt+1|x]=aAπ(a|x)E[Rt+γGt+1|x,a]     using p3=aAπ(a|x)xXp(x,a,x)E[Rt+γGt+1|x,a,x]     using p3=aAπ(a|x)xXp(x,a,x)[E[Rt]r(x,a)+E[γGt+1|x,a,x]]     using p2=aAπ(a|x)xXp(x,a,x)[r(x,a)+γE[Gt+1|x]Vπ(x)]     using p2=aAπ(a|x)xXp(x,a,x)[r(x,a)+γVπ(x)]

This gives us a recursive formula! Similarly we can formulate the Q-values.

Qπ(x,a)=Eπ[Rt+γGt+1|x,a]=Eπ[Rt+γGt+1|x,a]=xp(x,a,x)Eπ[Rt+γGt+1|x,a,x]=xp(x,a,x)[Eπ[Rt|x,a,x]+γEπ[Gt+1|x,a,x]]=xp(x,a,x)[r(x,a)+γEπ[Gt+1|x]]=xp(x,a,x)[r(x,a)+γVπ(x)]

We can also develop relationship between the Q-values and value functions.

Vπ(x)=Eπ[Gt|x]=aπ(a|x)Eπ[Gt|x,a]=aπ(a|x)Qπ(x,a)

Finally we develop the recursive formula for Q-values.

Qπ(x,a)=xp(x,a,x)[r(x,a)+γVπ(x)]=xp(x,a,x)[r(x,a)+γaπ(a|x)Qπ(x,a)]

Is there any way to determine the value of all the states within the state-space? Yes, these algorithms are known as dynamic programming algorithms.

Dynamic programming

Algorithm 2: Policy iteration
1
2
3
4
5
Algorithm 2: Value iteration
1
2
3
4
5

Note that all the future rewards in the return equation is unknown. If we knew about all the possible future rewards we are going to get from current timestep and forward, then we could obtain the exact value of the return. But unfortunately we do not have the luxury to know all these reward values apriori. This is where Reinforcement Learning (RL) comes into play.

Reinforcement learning and optimal control

Building algorithms for RL

Types of RL algorithms

Sample complexity of RL algorithms

A case study

Conclusion

Footnotes

    References

    Bertsekas, Dimitri. 2019. Reinforcement and Optimal Control. Athena Scientific.
    Sutton, Richard S, and Andrew G Barto. 2018. Reinforcement Learning: An Introduction. MIT press.