0% found this document useful (0 votes)

17 views

3 - Chapter 1 Basic Concepts

Uploaded by

zhanshengheyu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

3 - Chapter 1 Basic Concepts

Uploaded by

zhanshengheyu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Chapter 1

Basic Concepts

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 1.1: Where we are in this book.

This chapter introduces the basic concepts of reinforcement learning. These concepts
are important because they will be widely used in this book. We first introduce these
concepts using examples and then formalize them in the framework of Markov decision
processes.

1.1 A grid world example

Consider an example as shown in Figure 1.2, where a robot moves in a grid world. The
robot, called agent, can move across adjacent cells in the grid. At each time step, it can

13
1.2. State and action S. Zhao, 2023

only occupy a single cell. The white cells are accessible for entry, and the orange cells
are forbidden. There is a target cell that the robot would like to reach. We will use such
grid world examples throughout this book since they are intuitive for illustrating new
concepts and algorithms.

Start

Forbidden

Forbidden Target

Figure 1.2: The grid world example is used throughout the book.

The ultimate goal of the agent is to find a “good” policy that enables it to reach
the target cell when starting from any initial cell. How can the “goodness” of a policy
be defined? The idea is that the agent should reach the target without entering any
forbidden cells, taking unnecessary detours, or colliding with the boundary of the grid.
It would be trivial to plan a path to reach the target cell if the agent knew the map of
the grid world. The task becomes nontrivial if the agent does not know any information
about the environment in advance. Then, the agent must interact with the environment
to find a good policy by trial and error. To do that, the concepts presented in the rest of
the chapter are necessary.

1.2 State and action

The first concept to be introduced is the state, which describes the agent’s status with
respect to the environment. In the grid world example, the state corresponds to the
agent’s location. Since there are nine cells, there are nine states as well. They are
indexed as s1 , s2 , . . . , s9 , as shown in Figure 1.3(a). The set of all the states is called the
state space, denoted as S = {s1 , . . . , s9 }.
For each state, the agent can take five possible actions: moving upward, moving
rightward, moving downward, moving leftward, and remaining unchanged. These five
actions are denoted as a1 , a2 , . . . , a5 , respectively (see Figure 1.3(b)). The set of all actions
is called the action space, denoted as A = {a1 , . . . , a5 }. Different states can have different
action spaces. For instance, considering that taking a1 or a4 at state s1 would lead to a
collision with the boundary, we can set the action space for state s1 as A(s1 ) = {a2 , a3 , a5 }.
In this book, we consider the most general case: A(si ) = A = {a1 , . . . , a5 } for all i.

14
1.3. State transition S. Zhao, 2023

a1
s1 s2 s3

s4 s5 s6 a4 a2
a5

s7 s8 s9
a3
(a) States (b) Actions

Figure 1.3: Illustrations of the state and action concepts. (a) There are nine states {s1 , . . . , s9 }. (b)
Each state has five possible actions {a1 , a2 , a3 , a4 , a5 }.

1.3 State transition

When taking an action, the agent may move from one state to another. Such a process is
called state transition. For example, if the agent is at state s1 and selects action a2 (that
is, moving rightward), then the agent moves to state s2 . Such a process can be expressed
as
a2
s1 −
→ s2 .

We next examine two important examples.

What is the next state when the agent attempts to go beyond the boundary, for
example, taking action a1 at state s1 ? The answer is that the agent will be bounced
back because it is impossible for the agent to exit the state space. Hence, we have
a1
s1 −
→ s1 .
What is the next state when the agent attempts to enter a forbidden cell, for example,
taking action a2 at state s5 ? Two different scenarios may be encountered. In the first
scenario, although s6 is forbidden, it is still accessible. In this case, the next state
a2
is s6 ; hence, the state transition process is s5 −→ s6 . In the second scenario, s6 is
not accessible because, for example, it is surrounded by walls. In this case, the agent
is bounced back to s5 if it attempts to move rightward; hence, the state transition
a2
process is s5 −→ s5 .
Which scenario should we consider? The answer depends on the physical environmen-
t. In this book, we consider the first scenario where the forbidden cells are accessible,
although stepping into them may get punished. This scenario is more general and in-
teresting. Moreover, since we are considering a simulation task, we can define the state
transition process however we prefer. In real-world applications, the state transition
process is determined by real-world dynamics.
The state transition process is defined for each state and its associated actions. This
process can be described by a table as shown in Table 1.1. In this table, each row

15
1.4. Policy S. Zhao, 2023

corresponds to a state, and each column corresponds to an action. Each cell indicates
the next state to transition to after the agent takes an action at the corresponding state.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)

s1 s1 s2 s4 s1 s1
s2 s2 s3 s5 s1 s2
s3 s3 s3 s6 s2 s3
s4 s1 s5 s7 s4 s4
s5 s2 s6 s8 s4 s5
s6 s3 s6 s9 s5 s6
s7 s4 s8 s7 s7 s7
s8 s5 s9 s8 s7 s8
s9 s6 s9 s9 s8 s9
Table 1.1: A tabular representation of the state transition process. Each cell indicates the next state to
transition to after the agent takes an action at a state.

Mathematically, the state transition process can be described by conditional proba-

bilities. For example, for s1 and a2 , the conditional probability distribution is

which indicates that, when taking a2 at s1 , the probability of the agent moving to s2
is one, and the probabilities of the agent moving to other states are zero. As a result,
taking action a2 at s1 will certainly cause the agent to transition to s2 . The preliminaries
of conditional probability are given in Appendix A. Readers are strongly advised to be
familiar with probability theory since it is necessary for studying reinforcement learning.
Although it is intuitive, the tabular representation is only able to describe determinis-
tic state transitions. In general, state transitions can be stochastic and must be described
by conditional probability distributions. For instance, when random wind gusts are ap-
plied across the grid, if taking action a2 at s1 , the agent may be blown to s5 instead of
s2 . We have p(s5 |s1 , a2 ) > 0 in this case. Nevertheless, we merely consider deterministic
state transitions in the grid world examples for simplicity in this book.

1.4 Policy
A policy tells the agent which actions to take at every state. Intuitively, policies can
be depicted as arrows (see Figure 1.4(a)). Following a policy, the agent can generate a
trajectory starting from an initial state (see Figure 1.4(b)).

16
1.4. Policy S. Zhao, 2023

(a) A deterministic policy

(b) Trajectories obtained from the policy

Figure 1.4: A policy represented by arrows and some trajectories obtained by starting from different
initial states.

Mathematically, policies can be described by conditional probabilities. Denote the

policy in Figure 1.4 as π(a|s), which is a conditional probability distribution function
defined for every state. For example, the policy for s1 is

π(a1 |s1 ) = 0,
π(a2 |s1 ) = 1,
π(a3 |s1 ) = 0,
π(a4 |s1 ) = 0,
π(a5 |s1 ) = 0,

which indicates that the probability of taking action a2 at state s1 is one, and the prob-
abilities of taking other actions are zero.
The above policy is deterministic. Policies may be stochastic in general. For example,
the policy shown in Figure 1.5 is stochastic: at state s1 , the agent may take actions to
go either rightward or downward. The probabilities of taking these two actions are the

17
1.5. Reward S. Zhao, 2023

same (both are 0.5). In this case, the policy for s1 is

π(a1 |s1 ) = 0,
π(a2 |s1 ) = 0.5,
π(a3 |s1 ) = 0.5,
π(a4 |s1 ) = 0,
π(a5 |s1 ) = 0.

p = 0.5 p = 0.5

Figure 1.5: A stochastic policy. At state s1 , the agent may move rightward or downward with equal
probabilities of 0.5.

Policies represented by conditional probabilities can be stored as tables. For example,

Table 1.2 represents the stochastic policy depicted in Figure 1.5. The entry in the ith
row and jth column is the probability of taking the jth action at the ith state. Such
a representation is called a tabular representation. We will introduce another way to
represent policies as parameterized functions in Chapter 8.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward ) a5 (unchanged)

s1 0 0.5 0.5 0 0
s2 0 0 1 0 0
s3 0 0 0 1 0
s4 0 1 0 0 0
s5 0 0 1 0 0
s6 0 0 1 0 0
s7 0 1 0 0 0
s8 0 1 0 0 0
s9 0 0 0 0 1
Table 1.2: A tabular representation of a policy. Each entry indicates the probability of taking an action
at a state.

1.5 Reward
Reward is one of the most unique concepts in reinforcement learning.

18
1.5. Reward S. Zhao, 2023

After executing an action at a state, the agent obtains a reward, denoted as r, as

feedback from the environment. The reward is a function of the state s and action a.
Hence, it is also denoted as r(s, a). Its value can be a positive or negative real number
or zero. Different rewards have different impacts on the policy that the agent would
eventually learn. Generally speaking, with a positive reward, we encourage the agent to
take the corresponding action. With a negative reward, we discourage the agent from
taking that action.
In the grid world example, the rewards are designed as follows:

If the agent attempts to exit the boundary, let rboundary = −1.

If the agent attempts to enter a forbidden cell, let rforbidden = −1.
If the agent reaches the target state, let rtarget = +1.
Otherwise, the agent obtains a reward of rother = 0.

Special attention should be given to the target state s9 . The reward process does not
have to terminate after the agent reaches s9 . If the agent takes action a5 at s9 , the next
state is again s9 , and the reward is rtarget = +1. If the agent takes action a2 , the next
state is also s9 , but the reward is rboundary = −1.
A reward can be interpreted as a human-machine interface, with which we can guide
the agent to behave as we expect. For example, with the rewards designed above, we can
expect that the agent tends to avoid exiting the boundary or stepping into the forbidden
cells. Designing appropriate rewards is an important step in reinforcement learning. This
step is, however, nontrivial for complex tasks since it may require the user to understand
the given problem well. Nevertheless, it may still be much easier than solving the problem
with other approaches that require a professional background or a deep understanding of
the given problem.
The process of getting a reward after executing an action can be intuitively represented
as a table, as shown in Table 1.3. Each row of the table corresponds to a state, and each
column corresponds to an action. The value in each cell of the table indicates the reward
that can be obtained by taking an action at a state.
One question that beginners may ask is as follows: if given the table of rewards, can
we find good policies by simply selecting the actions with the greatest rewards? The
answer is no. That is because these rewards are immediate rewards that can be obtained
after taking an action. To determine a good policy, we must consider the total reward
obtained in the long run (see Section 1.6 for more information). An action with the
greatest immediate reward may not lead to the greatest total reward.
Although intuitive, the tabular representation is only able to describe deterministic
reward processes. A more general approach is to use conditional probabilities p(r|s, a) to
describe reward processes. For example, for state s1 , we have

p(r = −1|s1 , a1 ) = 1, p(r 6= −1|s1 , a1 ) = 0.

19
1.6. Trajectories, returns, and episodes S. Zhao, 2023

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)

s1 rboundary 0 0 rboundary 0
s2 rboundary 0 0 0 0
s3 rboundary rboundary rforbidden 0 0
s4 0 0 rforbidden rboundary 0
s5 0 rforbidden 0 0 0
s6 0 rboundary rtarget 0 rforbidden
s7 0 0 rboundary rboundary rforbidden
s8 0 rtarget rboundary rforbidden 0
s9 rforbidden rboundary rboundary 0 rtarget
Table 1.3: A tabular representation of the process of obtaining rewards. Here, the process is deterministic.
Each cell indicates how much reward can be obtained after the agent takes an action at a given state.

This indicates that, when taking a1 at s1 , the agent obtains r = −1 with certainty. In
this example, the reward process is deterministic. In general, it can be stochastic. For
example, if a student studies hard, he or she would receive a positive reward (e.g., higher
grades on exams), but the specific value of the reward may be uncertain.

1.6 Trajectories, returns, and episodes

r=0
s1 s2 s3 s1 s2 s3
r=0 r=0

s4 s5 s6 s4 s5 s6
r=0 r = −1

s7 s8 s9 s7 s8 s9
r = +1 r=0 r = +1
r = +1 r = +1

(a) Policy 1 and the trajectory (b) Policy 2 and the trajectory

Figure 1.6: Trajectories obtained by following two policies. The trajectories are indicated by red dashed
lines.

A trajectory is a state-action-reward chain. For example, given the policy shown in

Figure 1.6(a), if the agent can move along a trajectory as follows:

2a 3 a 3 2 a a
s1 −−→ s2 −−→ s5 −−→ s8 −−→ s9 .
r=0 r=0 r=0 r=1

The return of this trajectory is defined as the sum of all the rewards collected along the
trajectory:

return = 0 + 0 + 0 + 1 = 1. (1.1)

20
1.6. Trajectories, returns, and episodes S. Zhao, 2023

Returns are also called total rewards or cumulative rewards.

Returns can be used to evaluate policies. For example, we can evaluate the two
policies in Figure 1.6 by comparing their returns. In particular, starting from s1 , the
return obtained by the left policy is 1 as calculated above. For the right policy, starting
from s1 , the following trajectory is generated:

a 3 a
3 2 a2 a
s1 −−→ s4 −−−→ s7 −−→ s8 −−−→ s9 .
r=0 r=−1 r=0 r=+1

The corresponding return is

return = 0 − 1 + 0 + 1 = 0. (1.2)

The returns in (1.1) and (1.2) indicate that the left policy is better than the right one
since its return is greater. This mathematical conclusion is consistent with the intuition
that the right policy is worse since it passes through a forbidden cell.
A return consists of an immediate reward and future rewards. Here, the immediate
reward is the reward obtained after taking an action at the initial state; the future
rewards refer to the rewards obtained after leaving the initial state. It is possible that the
immediate reward is negative while the future reward is positive. Thus, which actions to
take should be determined by the return (i.e., the total reward) rather than the immediate
reward to avoid short-sighted decisions.
The return in (1.1) is defined for a finite-length trajectory. Return can also be defined
for infinitely long trajectories. For example, the trajectory in Figure 1.6 stops after
reaching s9 . Since the policy is well defined for s9 , the process does not have to stop after
the agent reaches s9 . We can design a policy so that the agent stays unchanged after
reaching s9 . Then, the policy would generate the following infinitely long trajectory:

a2 3 a 3 a 2 a5 a
5 a
s1 −−→ s2 −−→ s5 −−→ s8 −−→ s9 −−→ s9 −−→ s9 . . .
r=0 r=0 r=0 r=1 r=1 r=1

The direct sum of the rewards along this trajectory is

return = 0 + 0 + 0 + 1 + 1 + 1 + · · · = ∞,

which unfortunately diverges. Therefore, we must introduce the discounted return con-
cept for infinitely long trajectories. In particular, the discounted return is the sum of the
discounted rewards:

discounted return = 0 + γ0 + γ 2 0 + γ 3 1+γ 4 1 + γ 5 1 + . . ., (1.3)

where γ ∈ (0, 1) is called the discount rate. When γ ∈ (0, 1), the value of (1.3) can be

21
1.6. Trajectories, returns, and episodes S. Zhao, 2023

calculated as
1
discounted return = γ 3 (1 + γ + γ 2 + . . . ) = γ 3 .
1−γ

The introduction of the discount rate is useful for the following reasons. First, it
removes the stop criterion and allows for infinitely long trajectories. Second, the dis-
count rate can be used to adjust the emphasis placed on near- or far-future rewards. In
particular, if γ is close to 0, then the agent places more emphasis on rewards obtained in
the near future. The resulting policy would be short-sighted. If γ is close to 1, then the
agent places more emphasis on the far future rewards. The resulting policy is far-sighted
and dares to take risks of obtaining negative rewards in the near future. These points
will be demonstrated in Section 3.5.
One important notion that was not explicitly mentioned in the above discussion is the
episode. When interacting with the environment by following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a trial ). If the
environment or policy is stochastic, we obtain different episodes when starting from the
same state. However, if everything is deterministic, we always obtain the same episode
when starting from the same state.
An episode is usually assumed to be a finite trajectory. Tasks with episodes are called
episodic tasks. However, some tasks may have no terminal states, meaning that the pro-
cess of interacting with the environment will never end. Such tasks are called continuing
tasks. In fact, we can treat episodic and continuing tasks in a unified mathematical
manner by converting episodic tasks to continuing ones. To do that, we need well define
the process after the agent reaches the terminal state. Specifically, after reaching the
terminal state in an episodic task, the agent can continue taking actions in the following
two ways.

First, if we treat the terminal state as a special state, we can specifically design its
action space or state transition so that the agent stays in this state forever. Such
states are called absorbing states, meaning that the agent never leaves a state once
reached. For example, for the target state s9 , we can specify A(s9 ) = {a5 } or set
A(s9 ) = {a1 , . . . , a5 } with p(s9 |s9 , ai ) = 1 for all i = 1, . . . , 5.
Second, if we treat the terminal state as a normal state, we can simply set its action
space to the same as the other states, and the agent may leave the state and come
back again. Since a positive reward of r = 1 can be obtained every time s9 is reached,
the agent will eventually learn to stay at s9 forever to collect more rewards. Notably,
when an episode is infinitely long and the reward received for staying at s9 is positive,
a discount rate must be used to calculate the discounted return to avoid divergence.

In this book, we consider the second scenario where the target state is treated as a normal
state whose action space is A(s9 ) = {a1 , . . . , a5 }.

22
1.7. Markov decision processes S. Zhao, 2023

1.7 Markov decision processes

The previous sections of this chapter illustrated some fundamental concepts in reinforce-
ment learning through examples. This section presents these concepts in a more formal
way under the framework of Markov decision processes (MDPs).
An MDP is a general framework for describing stochastic dynamical systems. The
key ingredients of an MDP are listed below.

Sets:

- State space: the set of all states, denoted as S.

- Action space: a set of actions, denoted as A(s), associated with each state s ∈ S.
- Reward set: a set of rewards, denoted as R(s, a), associated with each state-action
pair (s, a).

Model:

- State transition probability: At state s, when taking action a, the probability of

transitioning to state s0 is p(s0 |s, a). It holds that s0 ∈S p(s0 |s, a) = 1 for any (s, a).
P

- Reward probability: At state s, when taking action a, the probability of obtaining

P
reward r is p(r|s, a). It holds that r∈R(s,a) p(r|s, a) = 1 for any (s, a).

Policy: At state s, the probability of choosing action a is π(a|s). It holds that

P
a∈A(s) π(a|s) = 1 for any s ∈ S.

Markov property: The Markov property refers to the memoryless property of a s-

tochastic process. Mathematically, it means that

p(st+1 |st , at , st−1 , at−1 , . . . , s0 , a0 ) = p(st+1 |st , at ),

p(rt+1 |st , at , st−1 , at−1 , . . . , s0 , a0 ) = p(rt+1 |st , at ), (1.4)

where t represents the current time step and t + 1 represents the next time step.
Equation (1.4) indicates that the next state or reward depends merely on the current
state and action and is independent of the previous ones. The Markov property is
important for deriving the fundamental Bellman equation of MDPs, as shown in the
next chapter.

Here, p(s0 |s, a) and p(r|s, a) for all (s, a) are called the model or dynamics. The
model can be either stationary or nonstationary (or in other words, time-invariant or
time-variant). A stationary model does not change over time; a nonstationary model
may vary over time. For instance, in the grid world example, if a forbidden area may pop
up or disappear sometimes, the model is nonstationary. In this book, we only consider
stationary models.

23
1.8. Summary S. Zhao, 2023

One may have heard about the Markov processes (MPs). What is the difference
between an MDP and an MP? The answer is that, once the policy in an MDP is fixed,
the MDP degenerates into an MP. For example, the grid world example in Figure 1.7 can
be abstracted as a Markov process. In the literature on stochastic processes, a Markov
process is also called a Markov chain if it is a discrete-time process and the number of
states is finite or countable [1]. In this book, the terms “Markov process” and “Markov
chain” are used interchangeably when the context is clear. Moreover, this book mainly
considers finite MDPs where the numbers of states and actions are finite. This is the
simplest case that should be fully understood.
Prob=0.5 Prob=1
s1 s2 s3
p = 0.5

Prob=0.5
p = 0.5

Prob=1
Prob=1
s4 s5 s6

Prob=1

Prob=1
Prob=1 Prob=1
s7 s8 s9

Figure 1.7: Abstraction of the grid world example as a Markov process. Here, the circles represent states
and the links with arrows represent state transitions.

Finally, reinforcement learning can be described as an agent-environment interaction

process. The agent is a decision-maker that can sense its state, maintain policies, and
execute actions. Everything outside of the agent is regarded as the environment. In the
grid world examples, the agent and environment correspond to the robot and grid world,
respectively. After the agent decides to take an action, the actuator executes such a
decision. Then, the state of the agent would be changed and a reward can be obtained.
By using interpreters, the agent can interpret the new state and the reward. Thus, a
closed loop can be formed.

1.8 Summary
This chapter introduced the basic concepts that will be widely used in the remainder of
the book. We used intuitive grid world examples to demonstrate these concepts and then
formalized them in the framework of MDPs. For more information about MDPs, readers
can see [1, 2].

1.9 Q&A
Q: Can we set all the rewards as negative or positive?
A: In this chapter, we mentioned that a positive reward would encourage the agent
to take an action and that a negative reward would discourage the agent from taking

24
1.9. Q&A S. Zhao, 2023

the action. In fact, it is the relative reward values instead of the absolute values that
determine encouragement or discouragement.
More specifically, we set rboundary = −1, rforbidden = −1, rtarget = +1, and rother = 0 in
this chapter. We can also add a common value to all these values without changing
the resulting optimal policy. For example, we can add −2 to all the rewards to obtain
rboundary = −3, rforbidden = −3, rtarget = −1, and rother = −2. Although the rewards
are all negative, the resulting optimal policy is unchanged. That is because optimal
policies are invariant to affine transformations of the rewards. Details will be given in
Chapter 3.5.
Q: Is the reward a function of the next state?
A: We mentioned that the reward r depends only on s and a but not the next state s0 .
However, this may be counterintuitive since it is the next state that determines the
reward in many cases. For example, the reward is positive when the next state is the
target state. As a result, a question that naturally follows is whether a reward should
depend on the next state. A mathematical rephrasing of this question is whether we
should use p(r|s, a, s0 ) where s0 is the next state rather than p(r|s, a). The answer is
that r depends on s, a, and s0 . However, since s0 also depends on s and a, we can
equivalently write r as a function of s and a: p(r|s, a) = s0 p(r|s, a, s0 )p(s0 |s, a). In
P

this way, the Bellman equation can be easily established as shown in Chapter 2.

3 - Chapter 1 Basic Concepts
No ratings yet
3 - Chapter 1 Basic Concepts
13 pages
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
No ratings yet
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
524 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
2024 MTH058 Lecture05 ReinforcementLearning
No ratings yet
2024 MTH058 Lecture05 ReinforcementLearning
59 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
2
No ratings yet
2
23 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
21ai020 & Reinforcement Learning: The Agent-Environment Interface
No ratings yet
21ai020 & Reinforcement Learning: The Agent-Environment Interface
8 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
AI problem solving
No ratings yet
AI problem solving
38 pages
Sections
No ratings yet
Sections
76 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
35 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Lecture3__InsideAnAgent
No ratings yet
Lecture3__InsideAnAgent
35 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
CS229
No ratings yet
CS229
17 pages
Solving Stochastic Planning Problems With Large State and Action Spaces
No ratings yet
Solving Stochastic Planning Problems With Large State and Action Spaces
9 pages
DRL
No ratings yet
DRL
9 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
5 - MDP
No ratings yet
5 - MDP
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Lec 25
No ratings yet
Lec 25
20 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Lecture#3_Bellmann_Equation_and_Dynamic_programming_DP_2024_Part
No ratings yet
Lecture#3_Bellmann_Equation_and_Dynamic_programming_DP_2024_Part
33 pages
M 2
No ratings yet
M 2
12 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
RL_20241103355_report
No ratings yet
RL_20241103355_report
4 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
ML U5 Notes
No ratings yet
ML U5 Notes
26 pages
NoteGPT_CS 285_ Lecture 2, Imitation Learning. Part 1
No ratings yet
NoteGPT_CS 285_ Lecture 2, Imitation Learning. Part 1
7 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
lecture-06
No ratings yet
lecture-06
98 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
rl-3
No ratings yet
rl-3
31 pages
RL-DQN-PG
No ratings yet
RL-DQN-PG
65 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
MDP PDF
No ratings yet
MDP PDF
37 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Unit 5
No ratings yet
Unit 5
10 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
2 - Overview of this book
No ratings yet
2 - Overview of this book
4 pages
Cos3751 201 2 2017 PDF
100% (1)
Cos3751 201 2 2017 PDF
18 pages
rl5
No ratings yet
rl5
26 pages
AI (IT) UNIT-4
No ratings yet
AI (IT) UNIT-4
37 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
RL Unit 2
No ratings yet
RL Unit 2
11 pages
Situation Calculus: Fundamentals and Applications
From Everand
Situation Calculus: Fundamentals and Applications
Fouad Sabry
No ratings yet
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
3 - Chapter 5 Monte Carlo Methods
No ratings yet
3 - Chapter 5 Monte Carlo Methods
23 pages
3 - Chapter 6 Stochastic Approximation
No ratings yet
3 - Chapter 6 Stochastic Approximation
24 pages
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
No ratings yet
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
21 pages
3 - Chapter 2 State Values and Bellman Equation
No ratings yet
3 - Chapter 2 State Values and Bellman Equation
20 pages
Workout Thesis Statement
100% (3)
Workout Thesis Statement
8 pages
RANDOM - Discrete and Continuous - VARIABLE
100% (1)
RANDOM - Discrete and Continuous - VARIABLE
13 pages
Savonius Wind Turbine
No ratings yet
Savonius Wind Turbine
4 pages
Load Takedown (Example 1)
No ratings yet
Load Takedown (Example 1)
137 pages
30th International Kangaroo Mathematics Contest 2020 Answer of Problems
No ratings yet
30th International Kangaroo Mathematics Contest 2020 Answer of Problems
1 page
Sae NVH Upcoming Training Schedule
No ratings yet
Sae NVH Upcoming Training Schedule
24 pages
School of Professional Advancement
No ratings yet
School of Professional Advancement
3 pages
CCT203 2308-2020-1201 Proj
No ratings yet
CCT203 2308-2020-1201 Proj
17 pages
0 Subiect XI A Final
No ratings yet
0 Subiect XI A Final
3 pages
Bosch Router 1617 EVS - Owner's Manual
No ratings yet
Bosch Router 1617 EVS - Owner's Manual
72 pages
Autocade
No ratings yet
Autocade
23 pages
Hap-4 4
No ratings yet
Hap-4 4
142 pages
CC Assignment-1
No ratings yet
CC Assignment-1
3 pages
Proline Prosonic Flow 90U, 90W, 91W, 93C, 93U, 93W: Technical Information
No ratings yet
Proline Prosonic Flow 90U, 90W, 91W, 93C, 93U, 93W: Technical Information
48 pages
MUCLecture 2024 41742243
No ratings yet
MUCLecture 2024 41742243
7 pages
Pit Design
100% (1)
Pit Design
99 pages
job-172952141240258
No ratings yet
job-172952141240258
11 pages
(La Poutre, D.B.) Strength and Stability of Channel Sections Used As Beam PDF
No ratings yet
(La Poutre, D.B.) Strength and Stability of Channel Sections Used As Beam PDF
201 pages
Operating Instructions Flexdip CYA112: Wastewater Assembly
No ratings yet
Operating Instructions Flexdip CYA112: Wastewater Assembly
36 pages
PHYS115 Sp24 Lecture05 ViscousFluids PRE
No ratings yet
PHYS115 Sp24 Lecture05 ViscousFluids PRE
22 pages
E1.50XM, E1.75XM, E2.00XMS (F114) : 1625063 ©2014 Hyster Company 06/2014
No ratings yet
E1.50XM, E1.75XM, E2.00XMS (F114) : 1625063 ©2014 Hyster Company 06/2014
380 pages
Grade 11 - Aquaculture Summative Test # 1
No ratings yet
Grade 11 - Aquaculture Summative Test # 1
5 pages
Siemens SW Lowering PCB Costs With Material Utilization WP 82872 C1
No ratings yet
Siemens SW Lowering PCB Costs With Material Utilization WP 82872 C1
10 pages
Design For A High Temperature Shift Converter
No ratings yet
Design For A High Temperature Shift Converter
43 pages
Optimal Capacitor Bank Allocation in Power Distribution System White Paper Wp917001en
No ratings yet
Optimal Capacitor Bank Allocation in Power Distribution System White Paper Wp917001en
9 pages
Robowars: 1. Problem Statement
No ratings yet
Robowars: 1. Problem Statement
6 pages
Math-1XX3 W2023 Outline
No ratings yet
Math-1XX3 W2023 Outline
5 pages
FM 351 - Getting - Started
No ratings yet
FM 351 - Getting - Started
20 pages
Naresh Gund: Sales Operations Manager
No ratings yet
Naresh Gund: Sales Operations Manager
1 page
Password Reset SB A2plus B1
No ratings yet
Password Reset SB A2plus B1
26 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

3 - Chapter 1 Basic Concepts

Uploaded by

3 - Chapter 1 Basic Concepts

Uploaded by

Chapter 1

Chapter 4: Chapter 5: Chapter 6:

Chapter 10: policy-based

Figure 1.1: Where we are in this book.

1.1 A grid world example

1.2 State and action

1.3 State transition

We next examine two important examples.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)

Mathematically, the state transition process can be described by conditional proba-

(a) A deterministic policy

(b) Trajectories obtained from the policy

Mathematically, policies can be described by conditional probabilities. Denote the

same (both are 0.5). In this case, the policy for s1 is

Policies represented by conditional probabilities can be stored as tables. For example,

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward ) a5 (unchanged)

After executing an action at a state, the agent obtains a reward, denoted as r, as

If the agent attempts to exit the boundary, let rboundary = −1.

p(r = −1|s1 , a1 ) = 1, p(r 6= −1|s1 , a1 ) = 0.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)

1.6 Trajectories, returns, and episodes

A trajectory is a state-action-reward chain. For example, given the policy shown in

Returns are also called total rewards or cumulative rewards.

The corresponding return is

The direct sum of the rewards along this trajectory is

discounted return = 0 + γ0 + γ 2 0 + γ 3 1+γ 4 1 + γ 5 1 + . . ., (1.3)

1.7 Markov decision processes

- State space: the set of all states, denoted as S.

- State transition probability: At state s, when taking action a, the probability of

- Reward probability: At state s, when taking action a, the probability of obtaining

Policy: At state s, the probability of choosing action a is π(a|s). It holds that

Markov property: The Markov property refers to the memoryless property of a s-

p(st+1 |st , at , st−1 , at−1 , . . . , s0 , a0 ) = p(st+1 |st , at ),

Finally, reinforcement learning can be described as an agent-environment interaction

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

3 - Chapter 1 Basic Concepts

Uploaded by

3 - Chapter 1 Basic Concepts

Uploaded by

Chapter 1

Chapter 4: Chapter 5: Chapter 6:

Chapter 10: policy-based

Figure 1.1: Where we are in this book.

1.1 A grid world example

1.2 State and action

1.3 State transition

We next examine two important examples.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)

Mathematically, the state transition process can be described by conditional proba-

(a) A deterministic policy

(b) Trajectories obtained from the policy

Mathematically, policies can be described by conditional probabilities. Denote the

same (both are 0.5). In this case, the policy for s1 is

Policies represented by conditional probabilities can be stored as tables. For example,

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward ) a5 (unchanged)

After executing an action at a state, the agent obtains a reward, denoted as r, as

 If the agent attempts to exit the boundary, let rboundary = −1.

p(r = −1|s1 , a1 ) = 1, p(r 6= −1|s1 , a1 ) = 0.

a1 (upward) a2 (rightward) a3 (downward) a4 (leftward) a5 (unchanged)

1.6 Trajectories, returns, and episodes

A trajectory is a state-action-reward chain. For example, given the policy shown in

Returns are also called total rewards or cumulative rewards.

The corresponding return is

The direct sum of the rewards along this trajectory is

discounted return = 0 + γ0 + γ 2 0 + γ 3 1+γ 4 1 + γ 5 1 + . . ., (1.3)

1.7 Markov decision processes

- State space: the set of all states, denoted as S.

- State transition probability: At state s, when taking action a, the probability of

- Reward probability: At state s, when taking action a, the probability of obtaining

 Policy: At state s, the probability of choosing action a is π(a|s). It holds that

 Markov property: The Markov property refers to the memoryless property of a s-

p(st+1 |st , at , st−1 , at−1 , . . . , s0 , a0 ) = p(st+1 |st , at ),

Finally, reinforcement learning can be described as an agent-environment interaction

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

If the agent attempts to exit the boundary, let rboundary = −1.

Policy: At state s, the probability of choosing action a is π(a|s). It holds that

Markov property: The Markov property refers to the memoryless property of a s-