3 - Chapter 1 Basic Concepts
3 - Chapter 1 Basic Concepts
Basic Concepts
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
This chapter introduces the basic concepts of reinforcement learning. These concepts
are important because they will be widely used in this book. We first introduce these
concepts using examples and then formalize them in the framework of Markov decision
processes.
13
1.2. State and action S. Zhao, 2023
only occupy a single cell. The white cells are accessible for entry, and the orange cells
are forbidden. There is a target cell that the robot would like to reach. We will use such
grid world examples throughout this book since they are intuitive for illustrating new
concepts and algorithms.
Start
Forbidden
Forbidden Target
Figure 1.2: The grid world example is used throughout the book.
The ultimate goal of the agent is to find a “good” policy that enables it to reach
the target cell when starting from any initial cell. How can the “goodness” of a policy
be defined? The idea is that the agent should reach the target without entering any
forbidden cells, taking unnecessary detours, or colliding with the boundary of the grid.
It would be trivial to plan a path to reach the target cell if the agent knew the map of
the grid world. The task becomes nontrivial if the agent does not know any information
about the environment in advance. Then, the agent must interact with the environment
to find a good policy by trial and error. To do that, the concepts presented in the rest of
the chapter are necessary.
14
1.3. State transition S. Zhao, 2023
a1
s1 s2 s3
s4 s5 s6 a4 a2
a5
s7 s8 s9
a3
(a) States (b) Actions
Figure 1.3: Illustrations of the state and action concepts. (a) There are nine states {s1 , . . . , s9 }. (b)
Each state has five possible actions {a1 , a2 , a3 , a4 , a5 }.
15
1.4. Policy S. Zhao, 2023
corresponds to a state, and each column corresponds to an action. Each cell indicates
the next state to transition to after the agent takes an action at the corresponding state.
p(s1 |s1 , a2 ) = 0,
p(s2 |s1 , a2 ) = 1,
p(s3 |s1 , a2 ) = 0,
p(s4 |s1 , a2 ) = 0,
p(s5 |s1 , a2 ) = 0,
which indicates that, when taking a2 at s1 , the probability of the agent moving to s2
is one, and the probabilities of the agent moving to other states are zero. As a result,
taking action a2 at s1 will certainly cause the agent to transition to s2 . The preliminaries
of conditional probability are given in Appendix A. Readers are strongly advised to be
familiar with probability theory since it is necessary for studying reinforcement learning.
Although it is intuitive, the tabular representation is only able to describe determinis-
tic state transitions. In general, state transitions can be stochastic and must be described
by conditional probability distributions. For instance, when random wind gusts are ap-
plied across the grid, if taking action a2 at s1 , the agent may be blown to s5 instead of
s2 . We have p(s5 |s1 , a2 ) > 0 in this case. Nevertheless, we merely consider deterministic
state transitions in the grid world examples for simplicity in this book.
1.4 Policy
A policy tells the agent which actions to take at every state. Intuitively, policies can
be depicted as arrows (see Figure 1.4(a)). Following a policy, the agent can generate a
trajectory starting from an initial state (see Figure 1.4(b)).
16
1.4. Policy S. Zhao, 2023
Figure 1.4: A policy represented by arrows and some trajectories obtained by starting from different
initial states.
π(a1 |s1 ) = 0,
π(a2 |s1 ) = 1,
π(a3 |s1 ) = 0,
π(a4 |s1 ) = 0,
π(a5 |s1 ) = 0,
which indicates that the probability of taking action a2 at state s1 is one, and the prob-
abilities of taking other actions are zero.
The above policy is deterministic. Policies may be stochastic in general. For example,
the policy shown in Figure 1.5 is stochastic: at state s1 , the agent may take actions to
go either rightward or downward. The probabilities of taking these two actions are the
17
1.5. Reward S. Zhao, 2023
π(a1 |s1 ) = 0,
π(a2 |s1 ) = 0.5,
π(a3 |s1 ) = 0.5,
π(a4 |s1 ) = 0,
π(a5 |s1 ) = 0.
p = 0.5 p = 0.5
Figure 1.5: A stochastic policy. At state s1 , the agent may move rightward or downward with equal
probabilities of 0.5.
1.5 Reward
Reward is one of the most unique concepts in reinforcement learning.
18
1.5. Reward S. Zhao, 2023
Special attention should be given to the target state s9 . The reward process does not
have to terminate after the agent reaches s9 . If the agent takes action a5 at s9 , the next
state is again s9 , and the reward is rtarget = +1. If the agent takes action a2 , the next
state is also s9 , but the reward is rboundary = −1.
A reward can be interpreted as a human-machine interface, with which we can guide
the agent to behave as we expect. For example, with the rewards designed above, we can
expect that the agent tends to avoid exiting the boundary or stepping into the forbidden
cells. Designing appropriate rewards is an important step in reinforcement learning. This
step is, however, nontrivial for complex tasks since it may require the user to understand
the given problem well. Nevertheless, it may still be much easier than solving the problem
with other approaches that require a professional background or a deep understanding of
the given problem.
The process of getting a reward after executing an action can be intuitively represented
as a table, as shown in Table 1.3. Each row of the table corresponds to a state, and each
column corresponds to an action. The value in each cell of the table indicates the reward
that can be obtained by taking an action at a state.
One question that beginners may ask is as follows: if given the table of rewards, can
we find good policies by simply selecting the actions with the greatest rewards? The
answer is no. That is because these rewards are immediate rewards that can be obtained
after taking an action. To determine a good policy, we must consider the total reward
obtained in the long run (see Section 1.6 for more information). An action with the
greatest immediate reward may not lead to the greatest total reward.
Although intuitive, the tabular representation is only able to describe deterministic
reward processes. A more general approach is to use conditional probabilities p(r|s, a) to
describe reward processes. For example, for state s1 , we have
19
1.6. Trajectories, returns, and episodes S. Zhao, 2023
This indicates that, when taking a1 at s1 , the agent obtains r = −1 with certainty. In
this example, the reward process is deterministic. In general, it can be stochastic. For
example, if a student studies hard, he or she would receive a positive reward (e.g., higher
grades on exams), but the specific value of the reward may be uncertain.
r=0
s1 s2 s3 s1 s2 s3
r=0 r=0
s4 s5 s6 s4 s5 s6
r=0 r = −1
s7 s8 s9 s7 s8 s9
r = +1 r=0 r = +1
r = +1 r = +1
(a) Policy 1 and the trajectory (b) Policy 2 and the trajectory
Figure 1.6: Trajectories obtained by following two policies. The trajectories are indicated by red dashed
lines.
2a 3 a 3 2 a a
s1 −−→ s2 −−→ s5 −−→ s8 −−→ s9 .
r=0 r=0 r=0 r=1
The return of this trajectory is defined as the sum of all the rewards collected along the
trajectory:
return = 0 + 0 + 0 + 1 = 1. (1.1)
20
1.6. Trajectories, returns, and episodes S. Zhao, 2023
a 3 a
3 2 a2 a
s1 −−→ s4 −−−→ s7 −−→ s8 −−−→ s9 .
r=0 r=−1 r=0 r=+1
return = 0 − 1 + 0 + 1 = 0. (1.2)
The returns in (1.1) and (1.2) indicate that the left policy is better than the right one
since its return is greater. This mathematical conclusion is consistent with the intuition
that the right policy is worse since it passes through a forbidden cell.
A return consists of an immediate reward and future rewards. Here, the immediate
reward is the reward obtained after taking an action at the initial state; the future
rewards refer to the rewards obtained after leaving the initial state. It is possible that the
immediate reward is negative while the future reward is positive. Thus, which actions to
take should be determined by the return (i.e., the total reward) rather than the immediate
reward to avoid short-sighted decisions.
The return in (1.1) is defined for a finite-length trajectory. Return can also be defined
for infinitely long trajectories. For example, the trajectory in Figure 1.6 stops after
reaching s9 . Since the policy is well defined for s9 , the process does not have to stop after
the agent reaches s9 . We can design a policy so that the agent stays unchanged after
reaching s9 . Then, the policy would generate the following infinitely long trajectory:
a2 3 a 3 a 2 a5 a
5 a
s1 −−→ s2 −−→ s5 −−→ s8 −−→ s9 −−→ s9 −−→ s9 . . .
r=0 r=0 r=0 r=1 r=1 r=1
return = 0 + 0 + 0 + 1 + 1 + 1 + · · · = ∞,
which unfortunately diverges. Therefore, we must introduce the discounted return con-
cept for infinitely long trajectories. In particular, the discounted return is the sum of the
discounted rewards:
where γ ∈ (0, 1) is called the discount rate. When γ ∈ (0, 1), the value of (1.3) can be
21
1.6. Trajectories, returns, and episodes S. Zhao, 2023
calculated as
1
discounted return = γ 3 (1 + γ + γ 2 + . . . ) = γ 3 .
1−γ
The introduction of the discount rate is useful for the following reasons. First, it
removes the stop criterion and allows for infinitely long trajectories. Second, the dis-
count rate can be used to adjust the emphasis placed on near- or far-future rewards. In
particular, if γ is close to 0, then the agent places more emphasis on rewards obtained in
the near future. The resulting policy would be short-sighted. If γ is close to 1, then the
agent places more emphasis on the far future rewards. The resulting policy is far-sighted
and dares to take risks of obtaining negative rewards in the near future. These points
will be demonstrated in Section 3.5.
One important notion that was not explicitly mentioned in the above discussion is the
episode. When interacting with the environment by following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a trial ). If the
environment or policy is stochastic, we obtain different episodes when starting from the
same state. However, if everything is deterministic, we always obtain the same episode
when starting from the same state.
An episode is usually assumed to be a finite trajectory. Tasks with episodes are called
episodic tasks. However, some tasks may have no terminal states, meaning that the pro-
cess of interacting with the environment will never end. Such tasks are called continuing
tasks. In fact, we can treat episodic and continuing tasks in a unified mathematical
manner by converting episodic tasks to continuing ones. To do that, we need well define
the process after the agent reaches the terminal state. Specifically, after reaching the
terminal state in an episodic task, the agent can continue taking actions in the following
two ways.
First, if we treat the terminal state as a special state, we can specifically design its
action space or state transition so that the agent stays in this state forever. Such
states are called absorbing states, meaning that the agent never leaves a state once
reached. For example, for the target state s9 , we can specify A(s9 ) = {a5 } or set
A(s9 ) = {a1 , . . . , a5 } with p(s9 |s9 , ai ) = 1 for all i = 1, . . . , 5.
Second, if we treat the terminal state as a normal state, we can simply set its action
space to the same as the other states, and the agent may leave the state and come
back again. Since a positive reward of r = 1 can be obtained every time s9 is reached,
the agent will eventually learn to stay at s9 forever to collect more rewards. Notably,
when an episode is infinitely long and the reward received for staying at s9 is positive,
a discount rate must be used to calculate the discounted return to avoid divergence.
In this book, we consider the second scenario where the target state is treated as a normal
state whose action space is A(s9 ) = {a1 , . . . , a5 }.
22
1.7. Markov decision processes S. Zhao, 2023
Sets:
Model:
where t represents the current time step and t + 1 represents the next time step.
Equation (1.4) indicates that the next state or reward depends merely on the current
state and action and is independent of the previous ones. The Markov property is
important for deriving the fundamental Bellman equation of MDPs, as shown in the
next chapter.
Here, p(s0 |s, a) and p(r|s, a) for all (s, a) are called the model or dynamics. The
model can be either stationary or nonstationary (or in other words, time-invariant or
time-variant). A stationary model does not change over time; a nonstationary model
may vary over time. For instance, in the grid world example, if a forbidden area may pop
up or disappear sometimes, the model is nonstationary. In this book, we only consider
stationary models.
23
1.8. Summary S. Zhao, 2023
One may have heard about the Markov processes (MPs). What is the difference
between an MDP and an MP? The answer is that, once the policy in an MDP is fixed,
the MDP degenerates into an MP. For example, the grid world example in Figure 1.7 can
be abstracted as a Markov process. In the literature on stochastic processes, a Markov
process is also called a Markov chain if it is a discrete-time process and the number of
states is finite or countable [1]. In this book, the terms “Markov process” and “Markov
chain” are used interchangeably when the context is clear. Moreover, this book mainly
considers finite MDPs where the numbers of states and actions are finite. This is the
simplest case that should be fully understood.
Prob=0.5 Prob=1
s1 s2 s3
p = 0.5
Prob=0.5
p = 0.5
Prob=1
Prob=1
s4 s5 s6
Prob=1
Prob=1
Prob=1 Prob=1
s7 s8 s9
Figure 1.7: Abstraction of the grid world example as a Markov process. Here, the circles represent states
and the links with arrows represent state transitions.
1.8 Summary
This chapter introduced the basic concepts that will be widely used in the remainder of
the book. We used intuitive grid world examples to demonstrate these concepts and then
formalized them in the framework of MDPs. For more information about MDPs, readers
can see [1, 2].
1.9 Q&A
Q: Can we set all the rewards as negative or positive?
A: In this chapter, we mentioned that a positive reward would encourage the agent
to take an action and that a negative reward would discourage the agent from taking
24
1.9. Q&A S. Zhao, 2023
the action. In fact, it is the relative reward values instead of the absolute values that
determine encouragement or discouragement.
More specifically, we set rboundary = −1, rforbidden = −1, rtarget = +1, and rother = 0 in
this chapter. We can also add a common value to all these values without changing
the resulting optimal policy. For example, we can add −2 to all the rewards to obtain
rboundary = −3, rforbidden = −3, rtarget = −1, and rother = −2. Although the rewards
are all negative, the resulting optimal policy is unchanged. That is because optimal
policies are invariant to affine transformations of the rewards. Details will be given in
Chapter 3.5.
Q: Is the reward a function of the next state?
A: We mentioned that the reward r depends only on s and a but not the next state s0 .
However, this may be counterintuitive since it is the next state that determines the
reward in many cases. For example, the reward is positive when the next state is the
target state. As a result, a question that naturally follows is whether a reward should
depend on the next state. A mathematical rephrasing of this question is whether we
should use p(r|s, a, s0 ) where s0 is the next state rather than p(r|s, a). The answer is
that r depends on s, a, and s0 . However, since s0 also depends on s and a, we can
equivalently write r as a function of s and a: p(r|s, a) = s0 p(r|s, a, s0 )p(s0 |s, a). In
P
this way, the Bellman equation can be easily established as shown in Chapter 2.
25