0% found this document useful (0 votes)
17 views

3 - Chapter 1 Basic Concepts

Uploaded by

zhanshengheyu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

3 - Chapter 1 Basic Concepts

Uploaded by

zhanshengheyu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Chapter 1

Basic Concepts

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:


with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation

Chapter 10: policy-based


Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 1.1: Where we are in this book.

This chapter introduces the basic concepts of reinforcement learning. These concepts
are important because they will be widely used in this book. We first introduce these
concepts using examples and then formalize them in the framework of Markov decision
processes.

1.1 A grid world example


Consider an example as shown in Figure 1.2, where a robot moves in a grid world. The
robot, called agent, can move across adjacent cells in the grid. At each time step, it can

13
1.2. State and action S. Zhao, 2023

only occupy a single cell. The white cells are accessible for entry, and the orange cells
are forbidden. There is a target cell that the robot would like to reach. We will use such
grid world examples throughout this book since they are intuitive for illustrating new
concepts and algorithms.

Start

Forbidden

Forbidden Target

Figure 1.2: The grid world example is used throughout the book.

The ultimate goal of the agent is to find a “good” policy that enables it to reach
the target cell when starting from any initial cell. How can the “goodness” of a policy
be defined? The idea is that the agent should reach the target without entering any
forbidden cells, taking unnecessary detours, or colliding with the boundary of the grid.
It would be trivial to plan a path to reach the target cell if the agent knew the map of
the grid world. The task becomes nontrivial if the agent does not know any information
about the environment in advance. Then, the agent must interact with the environment
to find a good policy by trial and error. To do that, the concepts presented in the rest of
the chapter are necessary.

1.2 State and action


The first concept to be introduced is the state, which describes the agent’s status with
respect to the environment. In the grid world example, the state corresponds to the
agent’s location. Since there are nine cells, there are nine states as well. They are
indexed as s1 , s2 , . . . , s9 , as shown in Figure 1.3(a). The set of all the states is called the
state space, denoted as S = {s1 , . . . , s9 }.
For each state, the agent can take five possible actions: moving upward, moving
rightward, moving downward, moving leftward, and remaining unchanged. These five
actions are denoted as a1 , a2 , . . . , a5 , respectively (see Figure 1.3(b)). The set of all actions
is called the action space, denoted as A = {a1 , . . . , a5 }. Different states can have different
action spaces. For instance, considering that taking a1 or a4 at state s1 would lead to a
collision with the boundary, we can set the action space for state s1 as A(s1 ) = {a2 , a3 , a5 }.
In this book, we consider the most general case: A(si ) = A = {a1 , . . . , a5 } for all i.

14
1.3. State transition S. Zhao, 2023

a1
s1 s2 s3

s4 s5 s6 a4 a2
a5

s7 s8 s9
a3
(a) States (b) Actions

Figure 1.3: Illustrations of the state and action concepts. (a) There are nine states {s1 , . . . , s9 }. (b)
Each state has five possible actions {a1 , a2 , a3 , a4 , a5 }.

1.3 State transition


When taking an action, the agent may move from one state to another. Such a process is
called state transition. For example, if the agent is at state s1 and selects action a2 (that
is, moving rightward), then the agent moves to state s2 . Such a process can be expressed
as
a2
s1 −
→ s2 .

We next examine two important examples.


 What is the next state when the agent attempts to go beyond the boundary, for
example, taking action a1 at state s1 ? The answer is that the agent will be bounced
back because it is impossible for the agent to exit the state space. Hence, we have
a1
s1 −
→ s1 .
 What is the next state when the agent attempts to enter a forbidden cell, for example,
taking action a2 at state s5 ? Two different scenarios may be encountered. In the first
scenario, although s6 is forbidden, it is still accessible. In this case, the next state
a2
is s6 ; hence, the state transition process is s5 −→ s6 . In the second scenario, s6 is
not accessible because, for example, it is surrounded by walls. In this case, the agent
is bounced back to s5 if it attempts to move rightward; hence, the state transition
a2
process is s5 −→ s5 .
Which scenario should we consider? The answer depends on the physical environmen-
t. In this book, we consider the first scenario where the forbidden cells are accessible,
although stepping into them may get punished. This scenario is more general and in-
teresting. Moreover, since we are considering a simulation task, we can define the state
transition process however we prefer. In real-world applications, the state transition
process is determined by real-world dynamics.
The state transition process is defined for each state and its associated actions. This
process can be described by a table as shown in Table 1.1. In this table, each row

15
1.4. Policy S. Zhao, 2023

corresponds to a state, and each column corresponds to an action. Each cell indicates
the next state to transition to after the agent takes an action at the corresponding state.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)


s1 s1 s2 s4 s1 s1
s2 s2 s3 s5 s1 s2
s3 s3 s3 s6 s2 s3
s4 s1 s5 s7 s4 s4
s5 s2 s6 s8 s4 s5
s6 s3 s6 s9 s5 s6
s7 s4 s8 s7 s7 s7
s8 s5 s9 s8 s7 s8
s9 s6 s9 s9 s8 s9
Table 1.1: A tabular representation of the state transition process. Each cell indicates the next state to
transition to after the agent takes an action at a state.

Mathematically, the state transition process can be described by conditional proba-


bilities. For example, for s1 and a2 , the conditional probability distribution is

p(s1 |s1 , a2 ) = 0,
p(s2 |s1 , a2 ) = 1,
p(s3 |s1 , a2 ) = 0,
p(s4 |s1 , a2 ) = 0,
p(s5 |s1 , a2 ) = 0,

which indicates that, when taking a2 at s1 , the probability of the agent moving to s2
is one, and the probabilities of the agent moving to other states are zero. As a result,
taking action a2 at s1 will certainly cause the agent to transition to s2 . The preliminaries
of conditional probability are given in Appendix A. Readers are strongly advised to be
familiar with probability theory since it is necessary for studying reinforcement learning.
Although it is intuitive, the tabular representation is only able to describe determinis-
tic state transitions. In general, state transitions can be stochastic and must be described
by conditional probability distributions. For instance, when random wind gusts are ap-
plied across the grid, if taking action a2 at s1 , the agent may be blown to s5 instead of
s2 . We have p(s5 |s1 , a2 ) > 0 in this case. Nevertheless, we merely consider deterministic
state transitions in the grid world examples for simplicity in this book.

1.4 Policy
A policy tells the agent which actions to take at every state. Intuitively, policies can
be depicted as arrows (see Figure 1.4(a)). Following a policy, the agent can generate a
trajectory starting from an initial state (see Figure 1.4(b)).

16
1.4. Policy S. Zhao, 2023

(a) A deterministic policy

(b) Trajectories obtained from the policy

Figure 1.4: A policy represented by arrows and some trajectories obtained by starting from different
initial states.

Mathematically, policies can be described by conditional probabilities. Denote the


policy in Figure 1.4 as π(a|s), which is a conditional probability distribution function
defined for every state. For example, the policy for s1 is

π(a1 |s1 ) = 0,
π(a2 |s1 ) = 1,
π(a3 |s1 ) = 0,
π(a4 |s1 ) = 0,
π(a5 |s1 ) = 0,

which indicates that the probability of taking action a2 at state s1 is one, and the prob-
abilities of taking other actions are zero.
The above policy is deterministic. Policies may be stochastic in general. For example,
the policy shown in Figure 1.5 is stochastic: at state s1 , the agent may take actions to
go either rightward or downward. The probabilities of taking these two actions are the

17
1.5. Reward S. Zhao, 2023

same (both are 0.5). In this case, the policy for s1 is

π(a1 |s1 ) = 0,
π(a2 |s1 ) = 0.5,
π(a3 |s1 ) = 0.5,
π(a4 |s1 ) = 0,
π(a5 |s1 ) = 0.

p = 0.5 p = 0.5

Figure 1.5: A stochastic policy. At state s1 , the agent may move rightward or downward with equal
probabilities of 0.5.

Policies represented by conditional probabilities can be stored as tables. For example,


Table 1.2 represents the stochastic policy depicted in Figure 1.5. The entry in the ith
row and jth column is the probability of taking the jth action at the ith state. Such
a representation is called a tabular representation. We will introduce another way to
represent policies as parameterized functions in Chapter 8.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward ) a5 (unchanged)


s1 0 0.5 0.5 0 0
s2 0 0 1 0 0
s3 0 0 0 1 0
s4 0 1 0 0 0
s5 0 0 1 0 0
s6 0 0 1 0 0
s7 0 1 0 0 0
s8 0 1 0 0 0
s9 0 0 0 0 1
Table 1.2: A tabular representation of a policy. Each entry indicates the probability of taking an action
at a state.

1.5 Reward
Reward is one of the most unique concepts in reinforcement learning.

18
1.5. Reward S. Zhao, 2023

After executing an action at a state, the agent obtains a reward, denoted as r, as


feedback from the environment. The reward is a function of the state s and action a.
Hence, it is also denoted as r(s, a). Its value can be a positive or negative real number
or zero. Different rewards have different impacts on the policy that the agent would
eventually learn. Generally speaking, with a positive reward, we encourage the agent to
take the corresponding action. With a negative reward, we discourage the agent from
taking that action.
In the grid world example, the rewards are designed as follows:

 If the agent attempts to exit the boundary, let rboundary = −1.


 If the agent attempts to enter a forbidden cell, let rforbidden = −1.
 If the agent reaches the target state, let rtarget = +1.
 Otherwise, the agent obtains a reward of rother = 0.

Special attention should be given to the target state s9 . The reward process does not
have to terminate after the agent reaches s9 . If the agent takes action a5 at s9 , the next
state is again s9 , and the reward is rtarget = +1. If the agent takes action a2 , the next
state is also s9 , but the reward is rboundary = −1.
A reward can be interpreted as a human-machine interface, with which we can guide
the agent to behave as we expect. For example, with the rewards designed above, we can
expect that the agent tends to avoid exiting the boundary or stepping into the forbidden
cells. Designing appropriate rewards is an important step in reinforcement learning. This
step is, however, nontrivial for complex tasks since it may require the user to understand
the given problem well. Nevertheless, it may still be much easier than solving the problem
with other approaches that require a professional background or a deep understanding of
the given problem.
The process of getting a reward after executing an action can be intuitively represented
as a table, as shown in Table 1.3. Each row of the table corresponds to a state, and each
column corresponds to an action. The value in each cell of the table indicates the reward
that can be obtained by taking an action at a state.
One question that beginners may ask is as follows: if given the table of rewards, can
we find good policies by simply selecting the actions with the greatest rewards? The
answer is no. That is because these rewards are immediate rewards that can be obtained
after taking an action. To determine a good policy, we must consider the total reward
obtained in the long run (see Section 1.6 for more information). An action with the
greatest immediate reward may not lead to the greatest total reward.
Although intuitive, the tabular representation is only able to describe deterministic
reward processes. A more general approach is to use conditional probabilities p(r|s, a) to
describe reward processes. For example, for state s1 , we have

p(r = −1|s1 , a1 ) = 1, p(r 6= −1|s1 , a1 ) = 0.

19
1.6. Trajectories, returns, and episodes S. Zhao, 2023

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)


s1 rboundary 0 0 rboundary 0
s2 rboundary 0 0 0 0
s3 rboundary rboundary rforbidden 0 0
s4 0 0 rforbidden rboundary 0
s5 0 rforbidden 0 0 0
s6 0 rboundary rtarget 0 rforbidden
s7 0 0 rboundary rboundary rforbidden
s8 0 rtarget rboundary rforbidden 0
s9 rforbidden rboundary rboundary 0 rtarget
Table 1.3: A tabular representation of the process of obtaining rewards. Here, the process is deterministic.
Each cell indicates how much reward can be obtained after the agent takes an action at a given state.

This indicates that, when taking a1 at s1 , the agent obtains r = −1 with certainty. In
this example, the reward process is deterministic. In general, it can be stochastic. For
example, if a student studies hard, he or she would receive a positive reward (e.g., higher
grades on exams), but the specific value of the reward may be uncertain.

1.6 Trajectories, returns, and episodes

r=0
s1 s2 s3 s1 s2 s3
r=0 r=0

s4 s5 s6 s4 s5 s6
r=0 r = −1

s7 s8 s9 s7 s8 s9
r = +1 r=0 r = +1
r = +1 r = +1

(a) Policy 1 and the trajectory (b) Policy 2 and the trajectory

Figure 1.6: Trajectories obtained by following two policies. The trajectories are indicated by red dashed
lines.

A trajectory is a state-action-reward chain. For example, given the policy shown in


Figure 1.6(a), if the agent can move along a trajectory as follows:

2a 3 a 3 2 a a
s1 −−→ s2 −−→ s5 −−→ s8 −−→ s9 .
r=0 r=0 r=0 r=1

The return of this trajectory is defined as the sum of all the rewards collected along the
trajectory:

return = 0 + 0 + 0 + 1 = 1. (1.1)

20
1.6. Trajectories, returns, and episodes S. Zhao, 2023

Returns are also called total rewards or cumulative rewards.


Returns can be used to evaluate policies. For example, we can evaluate the two
policies in Figure 1.6 by comparing their returns. In particular, starting from s1 , the
return obtained by the left policy is 1 as calculated above. For the right policy, starting
from s1 , the following trajectory is generated:

a 3 a
3 2 a2 a
s1 −−→ s4 −−−→ s7 −−→ s8 −−−→ s9 .
r=0 r=−1 r=0 r=+1

The corresponding return is

return = 0 − 1 + 0 + 1 = 0. (1.2)

The returns in (1.1) and (1.2) indicate that the left policy is better than the right one
since its return is greater. This mathematical conclusion is consistent with the intuition
that the right policy is worse since it passes through a forbidden cell.
A return consists of an immediate reward and future rewards. Here, the immediate
reward is the reward obtained after taking an action at the initial state; the future
rewards refer to the rewards obtained after leaving the initial state. It is possible that the
immediate reward is negative while the future reward is positive. Thus, which actions to
take should be determined by the return (i.e., the total reward) rather than the immediate
reward to avoid short-sighted decisions.
The return in (1.1) is defined for a finite-length trajectory. Return can also be defined
for infinitely long trajectories. For example, the trajectory in Figure 1.6 stops after
reaching s9 . Since the policy is well defined for s9 , the process does not have to stop after
the agent reaches s9 . We can design a policy so that the agent stays unchanged after
reaching s9 . Then, the policy would generate the following infinitely long trajectory:

a2 3 a 3 a 2 a5 a
5 a
s1 −−→ s2 −−→ s5 −−→ s8 −−→ s9 −−→ s9 −−→ s9 . . .
r=0 r=0 r=0 r=1 r=1 r=1

The direct sum of the rewards along this trajectory is

return = 0 + 0 + 0 + 1 + 1 + 1 + · · · = ∞,

which unfortunately diverges. Therefore, we must introduce the discounted return con-
cept for infinitely long trajectories. In particular, the discounted return is the sum of the
discounted rewards:

discounted return = 0 + γ0 + γ 2 0 + γ 3 1+γ 4 1 + γ 5 1 + . . ., (1.3)

where γ ∈ (0, 1) is called the discount rate. When γ ∈ (0, 1), the value of (1.3) can be

21
1.6. Trajectories, returns, and episodes S. Zhao, 2023

calculated as
1
discounted return = γ 3 (1 + γ + γ 2 + . . . ) = γ 3 .
1−γ

The introduction of the discount rate is useful for the following reasons. First, it
removes the stop criterion and allows for infinitely long trajectories. Second, the dis-
count rate can be used to adjust the emphasis placed on near- or far-future rewards. In
particular, if γ is close to 0, then the agent places more emphasis on rewards obtained in
the near future. The resulting policy would be short-sighted. If γ is close to 1, then the
agent places more emphasis on the far future rewards. The resulting policy is far-sighted
and dares to take risks of obtaining negative rewards in the near future. These points
will be demonstrated in Section 3.5.
One important notion that was not explicitly mentioned in the above discussion is the
episode. When interacting with the environment by following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a trial ). If the
environment or policy is stochastic, we obtain different episodes when starting from the
same state. However, if everything is deterministic, we always obtain the same episode
when starting from the same state.
An episode is usually assumed to be a finite trajectory. Tasks with episodes are called
episodic tasks. However, some tasks may have no terminal states, meaning that the pro-
cess of interacting with the environment will never end. Such tasks are called continuing
tasks. In fact, we can treat episodic and continuing tasks in a unified mathematical
manner by converting episodic tasks to continuing ones. To do that, we need well define
the process after the agent reaches the terminal state. Specifically, after reaching the
terminal state in an episodic task, the agent can continue taking actions in the following
two ways.

 First, if we treat the terminal state as a special state, we can specifically design its
action space or state transition so that the agent stays in this state forever. Such
states are called absorbing states, meaning that the agent never leaves a state once
reached. For example, for the target state s9 , we can specify A(s9 ) = {a5 } or set
A(s9 ) = {a1 , . . . , a5 } with p(s9 |s9 , ai ) = 1 for all i = 1, . . . , 5.
 Second, if we treat the terminal state as a normal state, we can simply set its action
space to the same as the other states, and the agent may leave the state and come
back again. Since a positive reward of r = 1 can be obtained every time s9 is reached,
the agent will eventually learn to stay at s9 forever to collect more rewards. Notably,
when an episode is infinitely long and the reward received for staying at s9 is positive,
a discount rate must be used to calculate the discounted return to avoid divergence.

In this book, we consider the second scenario where the target state is treated as a normal
state whose action space is A(s9 ) = {a1 , . . . , a5 }.

22
1.7. Markov decision processes S. Zhao, 2023

1.7 Markov decision processes


The previous sections of this chapter illustrated some fundamental concepts in reinforce-
ment learning through examples. This section presents these concepts in a more formal
way under the framework of Markov decision processes (MDPs).
An MDP is a general framework for describing stochastic dynamical systems. The
key ingredients of an MDP are listed below.

 Sets:

- State space: the set of all states, denoted as S.


- Action space: a set of actions, denoted as A(s), associated with each state s ∈ S.
- Reward set: a set of rewards, denoted as R(s, a), associated with each state-action
pair (s, a).

 Model:

- State transition probability: At state s, when taking action a, the probability of


transitioning to state s0 is p(s0 |s, a). It holds that s0 ∈S p(s0 |s, a) = 1 for any (s, a).
P

- Reward probability: At state s, when taking action a, the probability of obtaining


P
reward r is p(r|s, a). It holds that r∈R(s,a) p(r|s, a) = 1 for any (s, a).

 Policy: At state s, the probability of choosing action a is π(a|s). It holds that


P
a∈A(s) π(a|s) = 1 for any s ∈ S.

 Markov property: The Markov property refers to the memoryless property of a s-


tochastic process. Mathematically, it means that

p(st+1 |st , at , st−1 , at−1 , . . . , s0 , a0 ) = p(st+1 |st , at ),


p(rt+1 |st , at , st−1 , at−1 , . . . , s0 , a0 ) = p(rt+1 |st , at ), (1.4)

where t represents the current time step and t + 1 represents the next time step.
Equation (1.4) indicates that the next state or reward depends merely on the current
state and action and is independent of the previous ones. The Markov property is
important for deriving the fundamental Bellman equation of MDPs, as shown in the
next chapter.

Here, p(s0 |s, a) and p(r|s, a) for all (s, a) are called the model or dynamics. The
model can be either stationary or nonstationary (or in other words, time-invariant or
time-variant). A stationary model does not change over time; a nonstationary model
may vary over time. For instance, in the grid world example, if a forbidden area may pop
up or disappear sometimes, the model is nonstationary. In this book, we only consider
stationary models.

23
1.8. Summary S. Zhao, 2023

One may have heard about the Markov processes (MPs). What is the difference
between an MDP and an MP? The answer is that, once the policy in an MDP is fixed,
the MDP degenerates into an MP. For example, the grid world example in Figure 1.7 can
be abstracted as a Markov process. In the literature on stochastic processes, a Markov
process is also called a Markov chain if it is a discrete-time process and the number of
states is finite or countable [1]. In this book, the terms “Markov process” and “Markov
chain” are used interchangeably when the context is clear. Moreover, this book mainly
considers finite MDPs where the numbers of states and actions are finite. This is the
simplest case that should be fully understood.
Prob=0.5 Prob=1
s1 s2 s3
p = 0.5

Prob=0.5
p = 0.5

Prob=1
Prob=1
s4 s5 s6

Prob=1

Prob=1
Prob=1 Prob=1
s7 s8 s9

Figure 1.7: Abstraction of the grid world example as a Markov process. Here, the circles represent states
and the links with arrows represent state transitions.

Finally, reinforcement learning can be described as an agent-environment interaction


process. The agent is a decision-maker that can sense its state, maintain policies, and
execute actions. Everything outside of the agent is regarded as the environment. In the
grid world examples, the agent and environment correspond to the robot and grid world,
respectively. After the agent decides to take an action, the actuator executes such a
decision. Then, the state of the agent would be changed and a reward can be obtained.
By using interpreters, the agent can interpret the new state and the reward. Thus, a
closed loop can be formed.

1.8 Summary
This chapter introduced the basic concepts that will be widely used in the remainder of
the book. We used intuitive grid world examples to demonstrate these concepts and then
formalized them in the framework of MDPs. For more information about MDPs, readers
can see [1, 2].

1.9 Q&A
 Q: Can we set all the rewards as negative or positive?
A: In this chapter, we mentioned that a positive reward would encourage the agent
to take an action and that a negative reward would discourage the agent from taking

24
1.9. Q&A S. Zhao, 2023

the action. In fact, it is the relative reward values instead of the absolute values that
determine encouragement or discouragement.
More specifically, we set rboundary = −1, rforbidden = −1, rtarget = +1, and rother = 0 in
this chapter. We can also add a common value to all these values without changing
the resulting optimal policy. For example, we can add −2 to all the rewards to obtain
rboundary = −3, rforbidden = −3, rtarget = −1, and rother = −2. Although the rewards
are all negative, the resulting optimal policy is unchanged. That is because optimal
policies are invariant to affine transformations of the rewards. Details will be given in
Chapter 3.5.
 Q: Is the reward a function of the next state?
A: We mentioned that the reward r depends only on s and a but not the next state s0 .
However, this may be counterintuitive since it is the next state that determines the
reward in many cases. For example, the reward is positive when the next state is the
target state. As a result, a question that naturally follows is whether a reward should
depend on the next state. A mathematical rephrasing of this question is whether we
should use p(r|s, a, s0 ) where s0 is the next state rather than p(r|s, a). The answer is
that r depends on s, a, and s0 . However, since s0 also depends on s and a, we can
equivalently write r as a function of s and a: p(r|s, a) = s0 p(r|s, a, s0 )p(s0 |s, a). In
P

this way, the Bellman equation can be easily established as shown in Chapter 2.

25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy