A trivial attempt to unify the fundamental RL concepts in one place for building intuitions.
Here I will try to explain how RL stems from the sequential decision making framework and its close relation with optimal control theory. I will follow two primary references, reinforcement learning and optimal control (Bertsekas 2019) and introduction to reinforcement learning (Sutton and Barto 2018)
The goal is to take sequential decisions “online” to achieve a certain goal; often times it is maximizing a performance objective which can be thougt of as a function J(⋅). The input to this objective function is not important right now. Let’s call this decision maker “agent”. The catch is that the agent has to figure out which decision to take based on the observed feedback from the envrionment of its interest. To observe a feedback the agent has to interact with the envrionment through some sort of actions. So, optimization will be at the core of this decision making procedure while we use data collected in an online fashion to identify actions to take. This is why the “learning” happens.
Consider an online sequential decision making problem where an agent
has k choices to choose an action
and everytime it executes an action it receives a feedback from the
environment. A fundamental question then aries for the agent: how to
choose an action? The way it chooses an action describes its way of
behaving in this particular environment which is known as the “policy”
denoted as π. Note that π decribes how to take an action but it
does not say how to take the best action that will maximize the
performance objective J(⋅). To
identify the optimal action we need to find out the optimal policy π∗. So, the following makes sense
π∗=argmaxπJπ(⋅)
Until now it is clear that we are interested in sequential decision
making. To formalize such process we will adopt the `Markov
Decision Process (MDP)’. An MDP M is usually expressed as a
tuple of these following 5-elements. M=⟨X,A,R,P,γ⟩
Let’s focus on how this formalism helps in sequential decision making. Assume that the agent is in a current state xt at timestep t. Based on some policy π it takes a decision to move to state xt+1 by taking action at. To move to that state, the agent needs to know the probability of moving to that state given the current state xt and action at. This is how the conditional probability comes into the process. Once the agent reaches state xt+1 it gets a feedback from the environment. This feedback is called a reward value, Rt, which is usually a scalar numeric value. We assume that the reward value Rt comes as the output from the reward function R while it takes xt and at as input, meaning R(xt,at):X×A→Rt. What if the reward function only depends on the current state and not the action? Then the reward function would be represented as R(xt):X→Rt. Similary if the reward depends on not only the current state and current action but also the state it ends up in, then we would use the description of the reward function as R(xt,at,xt+1):X×A×X→Rt. Finally the agent uses a discount factor γ to put less weight onto future rewards and more weight into recent rewards. This makes sense because the agent does not want to depend strongly on the information that comes after many timesteps into the future. All these information can be combined very convenienty in an MDP. Now it should be easier to follow why MDP is attractive for sequential decision making.
This formalism is great, but what is the goal of the agent in an MDP?
In simplified terms the ‘goal’ of the agent is to maximize the
accumulation of rewards. Let’s define the accumulation of rewards as
return. The return obtained at timestep t can be expressed as Gπt=Rt+Rt+1+Rt+2+…. This means that if the agent moves to state xt+1 from state xt by taking action at it receives reward Rt and at the next timestep if it
follows the same policy it will receive reward Rt+1 and so on. For an infinite
horizon case, the return will blow up. This is why we use a discount
factor γ such that Gt=Rt+γRt+1+γ2Rt+2+⋯=∞∑t=0γkRt+k.
This discount factor serves two purposes: a) it provides more weight into recent rewards and b) it helps to keep the return as a finite value. If we do not use a discount factor in the MDP definition then those MDPs are called ‘undiscounted MDPs’.
But how do we maximize the return? The answer is pretty simple: by choosing the sequence of actions that provides the highest return. These actions are called ‘optimal actions’. Remember that actions are chosen according to a policy π:X→A. So to choose the optimal actions we need an optimal policy π∗. Now we have successfully identified the fundamental goal in this learning scheme: “how to obtian the optimal policy for sequential decision making?”
To identify whether a state is good or bad we need to assign some
sort of value to that state. Usually this is known as the value
function. The agent would like to explore states which have
higher values compared to the rest. To derive the value of a state we
use the reward function in an intuitive way. Let’s take a look. Vπ(x)=Eπ[Gt|x]=Eπ[Rt+γRt+1+…|x]=Eπ[∞∑k=0γkRt+k|x]
Algorithm 1 |
---|
1 Find value of all states, Vπ(x) where x∈X |
2 From each state find the next best state xb=argmaxx′Vπ(x′) |
3 Find the optimal policy by choosing the action that led to xb meaning π∗(x)={a:x→xb} |
Would not it be better if we could, rather than finding the value of a state, directly find the value of an action from a state? In that way we would be able to evaluate whether an action is good or bad based on the assigned value. Yes, we can and this is known as the action-value functions. These are also known as Q-values as they can be informally thought of as the quality of an action taken from a state. For convenience, an action taken from a state is combinedly referred as the state-action, (x,a), pair. So, how do we define Q-values? Looking closely to the definition of the value-functions we can similarly define the Q-values by conditioning the return on the state-action pair.
Qπ(x,a)=Eπ[Gt|x,a]=Eπ[Rt+γRt+1+…|x,a]=Eπ[∞∑k=0γkRt+k|x,a]
So, from the value function definition we get,
Vπ(x)=Eπ[Rt+γGt+1|x]=∑a∈Aπ(a|x)E[Rt+γGt+1|x,a] using p3=∑a∈Aπ(a|x)∑x′∈Xp(x,a,x′)E[Rt+γGt+1|x,a,x′] using p3=∑a∈Aπ(a|x)∑x′∈Xp(x,a,x′)[E[Rt]⏟r(x,a)+E[γGt+1|x,a,x′]] using p2=∑a∈Aπ(a|x)∑x′∈Xp(x,a,x′)[r(x,a)+γE[Gt+1|x′]⏟Vπ(x′)] using p2=∑a∈Aπ(a|x)∑x′∈Xp(x,a,x′)[r(x,a)+γVπ(x′)]
This gives us a recursive formula! Similarly we can formulate the Q-values.
Qπ(x,a)=Eπ[Rt+γGt+1|x,a]=Eπ[Rt+γGt+1|x,a]=∑x′p(x,a,x′)Eπ[Rt+γGt+1|x,a,x′]=∑x′p(x,a,x′)[Eπ[Rt|x,a,x′]+γEπ[Gt+1|x,a,x′]]=∑x′p(x,a,x′)[r(x,a)+γEπ[Gt+1|x′]]=∑x′p(x,a,x′)[r(x,a)+γVπ(x′)]
We can also develop relationship between the Q-values and value functions.
Vπ(x)=Eπ[Gt|x]=∑aπ(a|x)Eπ[Gt|x,a]=∑aπ(a|x)Qπ(x,a)
Finally we develop the recursive formula for Q-values.
Qπ(x,a)=∑x′p(x,a,x′)[r(x,a)+γVπ(x′)]=∑x′p(x,a,x′)[r(x,a)+γ∑a′π(a′|x′)Qπ(x′,a′)]
Is there any way to determine the value of all the states within the state-space? Yes, these algorithms are known as dynamic programming algorithms.
Algorithm 2: Policy iteration |
---|
1 |
2 |
3 |
4 |
5 |
Algorithm 2: Value iteration |
---|
1 |
2 |
3 |
4 |
5 |
Note that all the future rewards in the return equation is unknown. If we knew about all the possible future rewards we are going to get from current timestep and forward, then we could obtain the exact value of the return. But unfortunately we do not have the luxury to know all these reward values apriori. This is where Reinforcement Learning (RL) comes into play.