Probability and Stats For Data Science PDF
Probability and Stats For Data Science PDF
Carlos Fernandez-Granda
These notes were developed for the course Probability and Statistics for Data Science at the
Center for Data Science in NYU. The goal is to provide an overview of fundamental concepts
in probability and statistics from first principles. I would like to thank Levent Sagun and Vlad
Kobzar, who were teaching assistants for the course, as well as Brett Bernstein and David
Rosenberg for their useful suggestions. I am also very grateful to all my students for their
feedback.
While writing these notes, I was supported by the National Science Foundation under NSF
award DMS-1616340.
ii
Contents
2 Random Variables 11
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Conditioning on an event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Functions of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Generating random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Expectation 70
4.1 Expectation operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 Random Processes 95
5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Mean and autocovariance functions . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 Independent identically-distributed sequences . . . . . . . . . . . . . . . . . . . . 100
5.4 Gaussian process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6 Random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
iii
CONTENTS iv
In this chapter we introduce the mathematical framework of probability theory, which makes it
possible to reason about uncertainty in a principled way using set theory. Appendix A contains
a review of basic set-theory concepts.
Definition 1.1.1 (Probability space). A probability space is a triple (Ω, F, P) consisting of:
• A probability measure P that assigns probabilities to the events in F (see Definition 1.1.4
below).
Sample spaces may be discrete or continuous. Examples of discrete sample spaces include the
possible outcomes of a coin toss, the score of a basketball game, the number of people that show
up at a party, etc. Continuous sample spaces are usually intervals of R or Rn used to model
time, position, temperature, etc.
The term σ-algebra is used in measure theory to denote a collection of sets that satisfy certain
conditions listed below. Don’t be too intimidated by it. It is just a sophisticated way of stating
that if we assign a probability to certain events (for example it will rain tomorrow or it will
1
CHAPTER 1. BASIC PROBABILITY THEORY 2
snow tomorrow ) we also need to assign a probability to their complements (i.e. it will not rain
tomorrow or it will not snow tomorrow ) and to their union (it will rain or snow tomorrow ).
Definition 1.1.2 (σ-algebra). A σ-algebra F is a collection of sets in Ω such that:
1. If a set S ∈ F then S c ∈ F.
2. If the sets S1 , S2 ∈ F, then S1 ∪ S2 ∈ F. This also holds for infinite sequences; if
S1 , S2 , . . . ∈ F then ∪∞
i=1 Si ∈ F.
3. Ω ∈ F.
If our sample space is discrete, a possible choice for the σ-algebra is the power set of the sample
space, which consists of all possible sets of elements in the sample space. If we are tossing a coin
and the sample space is
Ω := {heads, tails} , (1.1)
then the power set is a valid σ-algebra
F := {heads or tails, heads, tails, ∅} , (1.2)
where ∅ denotes the empty set. However, in many cases σ-algebras do not contain every possible
set of outcomes.
Example 1.1.3 (Cholesterol). A doctor is interested in modeling the cholesterol levels of her
patients probabilistically. Every time a patient visits her, she tests their cholesterol level. Here
the experiment is the cholesterol test, the outcome is the measured cholesterol level, and the
sample space Ω is the positive real line. The doctor is mainly interested in whether the patients
to have low, borderline-high, or high cholesterol. The event L (low cholesterol) contains all
outcomes below 200 mg/dL, the event B (borderline-high cholesterol) contains all outcomes
between 200 and 240 mg/dL, and the event H (high cholesterol) contains all outcomes above
240 mg/dL. The σ-algebra F of possible events therefore equals
F := {L ∪ B ∪ H, L ∪ B, L ∪ H, B ∪ H, L, B, H, ∅} . (1.3)
The events are a partition of the sample space, which simplifies deriving the corresponding
σ-algebra. 4
The role of the probability measure P is to quantify how likely we are to encounter each of the
events in the σ-algebra. Intuitively, the probability of an event A can be interpreted as the
fraction of times that the outcome of the experiment is in A, as the number of repetitions tends
to infinity. It follows that probabilities should always be nonnegative. Also, if two events A and
B are disjoint (their intersection is empty), then
outcomes in A or B
P (A ∪ B) = (1.4)
total
outcomes in A + outcomes in B
= (1.5)
total
outcomes in A outcomes in B
= + (1.6)
total total
= P (A) + P (B) . (1.7)
CHAPTER 1. BASIC PROBABILITY THEORY 3
Probabilities of unions of disjoint events should equal the sum of the individual probabilities.
Additionally, the probability of the whole sample space Ω should equal one, as it contains all
outcomes
outcomes in Ω
P (Ω) = (1.8)
total
total
= (1.9)
total
= 1. (1.10)
Definition 1.1.4 (Probability measure). A probability measure is a function defined over the
sets in a σ-algebra F such that:
3. P (Ω) = 1.
The two first axioms capture the intuitive idea that the probability of an event is a measure
such as mass (or length or volume): just like the mass of any object is nonnegative and the
total mass of several distinct objects is the sum of their masses, the probability of any event
is nonnegative and the probability of the union of several disjoint objects is the sum of their
probabilities. However, in contrast to mass, the amount of probability in an experiment cannot
be unbounded. If it is highly likely that it will rain tomorrow, then it cannot be also very
likely that it will not rain. If the probability of an event S is large, then the probability of
its complement S c must be small. This is captured by the third axiom, which normalizes the
probability measure (and implies that P (S c ) = 1 − P (S)).
It is important to stress that the probability measure does not assign probabilities to individual
outcomes, but rather to events in the σ-algebra. The reason for this is that when the number
of possible outcomes is uncountably infinite, then one cannot assign nonzero probability to all
the outcomes and still satisfy the condition P (Ω) = 1. This is not an exotic situation, it occurs
for instance in the cholesterol example where any positive real number is a possible outcome.
In the case of discrete or countable sample spaces, the σ-algebra may equal the power set of the
sample space, which means that we do assign probabilities to events that only contain a single
outcome (e.g. the coin-toss example).
CHAPTER 1. BASIC PROBABILITY THEORY 4
Example 1.1.5 (Cholesterol (continued)). A valid probability measure for Example 1.1.3 is
Using the properties, we can determine for instance that P (B ∪ H) = 0.6 + 0.28 = 0.88. 4
P (∅) = 0, (1.14)
A⊆B implies P (A) ≤ P (B) , (1.15)
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) . (1.16)
where we assume that P (S) 6= 0 (later on we will have to deal with the case when S has
zero probability, which often occurs in continuous probability spaces). The definition is rather
intuitive: S is now the new sample space, so if the outcome is in S 0 then it must belong to
S 0 ∩ S. However, just using the probability of the intersection would underestimate how likely it
is for S 0 to occur because the sample space has been reduced to S. Therefore we normalize by
the probability of S. As a sanity check, we have P (S|S) = 1 and if S and S 0 are disjoint then
P (S 0 |S) = 0.
The conditional probability P (·|S) is a valid probability measure in the probability space
(S, FS , P (·|S)), where FS is a σ-algebra that contains the intersection of S and the sets in
F. To simplify notation, when we condition on an intersection of sets we write the conditional
probability as
Example 1.2.1 (Flights and rain). JFK airport hires you to estimate how the punctuality of
flight arrivals is affected by the weather. You begin by defining a probability space for which
the sample space is
Ω = {late and rain, late and no rain, on time and rain, on time and no rain} (1.21)
and the σ-algebra is the power set of Ω. From data of past flights you determine that a reasonable
estimate for the probability measure of the probability space is
2 14
P (late, no rain) = , P (on time, no rain) = , (1.22)
20 20
3 1
P (late, rain) = , P (on time, rain) = . (1.23)
20 20
The airport is interested in the probability of a flight being late if it rains, so you define a new
probability space conditioning on the event rain. The sample space is the set of all outcomes
such that rain occurred, the σ-algebra is the power set of {on time, late} and the probability
measure is P (·|rain). In particular,
P (late, rain) 3/20 3
P (late|rain) = = = (1.24)
P (rain) 3/20 + 1/20 4
and similarly P (late|no rain) = 1/8.
4
Conditional probabilities can be used to compute the intersection of several events in a structured
way. By definition, we can express the probability of the intersection of two events A, B ∈ F as
follows,
In this formula P (A) is known as the prior probability of A, as it captures the information we
have about A before anything else is revealed. Analogously, P (A|B) is known as the posterior
probability. These are fundamental quantities in Bayesian models, discussed in Chapter 10.
Generalizing (1.25) to a sequence of events gives the chain rule, which allows to express the
probability of the intersection of multiple events in terms of conditional probabilities. We omit
the proof, which is a straightforward application of induction.
Theorem 1.2.2 (Chain rule). Let (Ω, F, P) be a probability space and S1 , S2 , . . . a collection of
events in F,
Sometimes, estimating the probability of a certain event directly may be more challenging than
estimating its probability conditioned on simpler events. A collection of disjoint sets A1 , A2 , . . .
such that Ω = ∪i Ai is called a partition of Ω. The law of total probability allows us to pool
conditional probabilities together, weighting them by the probability of the individual events in
the partition, to compute the probability of the event of interest.
CHAPTER 1. BASIC PROBABILITY THEORY 6
Theorem 1.2.3 (Law of total probability). Let (Ω, F, P) be a probability space and let the
collection of disjoint sets A1 , A2 , . . . ∈ F be any partition of Ω. For any set S ∈ F
X
P (S) = P (S ∩ Ai ) (1.29)
i
X
= P (Ai ) P (S|Ai ) . (1.30)
i
Proof. This is an immediate consequence of the chain rule and Axiom 2 in Definition 1.1.4, since
S = ∪i S ∩ Ai and the sets S ∩ Ai are disjoint.
Example 1.2.4 (Aunt visit). Your aunt is arriving at JFK tomorrow and you would like to
know how likely it is for her flight to be on time. From Example 1.2.1, you recall that
After checking out a weather website, you determine that P (rain) = 0.2.
Now, how can we integrate all of this information? The events rain and no rain are disjoint and
cover the whole sample space, so they form a partition. We can consequently apply the law of
total probability to determine
It is crucial to realize that in general P (A|B) 6= P (B|A): most players in the NBA probably
own a basketball (P (owns ball|NBA) is large) but most people that own basketballs are not in
the NBA (P (NBA|owns ball) is small). The reason is that the prior probabilities are very differ-
ent: P (NBA) is much smaller than P (owns ball). However, it is possible to invert conditional
probabilities, i.e. find P (A|B) from P (B|A), as long as we take into account the priors. This
straightforward consequence of the definition of conditional probability is known as Bayes’ rule.
Theorem 1.2.5 (Bayes’ rule). For any events A and B in a probability space (Ω, F, P)
P (A) P (B|A)
P (A|B) = , (1.34)
P (B)
Example 1.2.6 (Aunt visit (continued)). You explain the probabilistic model described in
Example 1.2.4 to your cousin Marvin who lives in California. A day later, you tell him that your
aunt arrived late but you don’t mention whether it rained or not. After he hangs up, Marvin
wants to figure out the probability that it rained. Recall that the probability of rain was 0.2,
but since your aunt arrived late he should update the estimate. Applying Bayes’ rule and the
CHAPTER 1. BASIC PROBABILITY THEORY 7
1.3 Independence
As discussed in the previous section, conditional probabilities quantify the extent to which the
knowledge of the occurrence of a certain event affects the probability of another event. In some
cases, it makes no difference: the events are independent. More formally, events A and B are
independent if and only if
This definition is not valid if P (B) = 0. The following definition covers this case and is otherwise
equivalent.
Definition 1.3.1 (Independence). Let (Ω, F, P) be a probability space. Two events A, B ∈ F
are independent if and only if
Example 1.3.2 (Congress). We consider a data set compiling the votes of members of the
U.S. House of Representatives on two issues in 1984 1 . The issues are cost sharing for a water
project (issue 1) and adoption of the budget resolution (issue 2). We model the behavior of
the congressmen probabilistically, defining a sample space where each outcome is a sequence of
votes. For instance, a possible outcome is issue 1 = yes, issue 2 = no. We choose the σ-algebra
to be the power set of the sample space. To estimate the probability measure associated to
different events, we just compute the fraction of their occurrence in the data.
members voting yes on issue 1
P (issue 1 = yes) ≈ (1.40)
total votes on issue 1
= 0.597, (1.41)
members voting yes on issue 2
P (issue 2 = yes) ≈ (1.42)
total votes on issue 2
= 0.417, (1.43)
members voting yes on issues 1 and 2
P (issue 1 = yes ∩ issue 2 = yes) ≈ (1.44)
total members voting on issues 1 and 2
= 0.069. (1.45)
1
The data is available here.
CHAPTER 1. BASIC PROBABILITY THEORY 8
Based on these data, we can evaluate whether voting behavior on the two issues was dependent.
In other words, if we know how a member voted on issue 1, does this provide information about
how they voted on issue 2? The answer is yes, since
is very different from P (issue 1 = yes ∩ issue 2 = yes). If a member voted yes on issue 1, they
were less likely to vote yes on issue 2. 4
Similarly, we can define conditional independence between two events given a third event.
A and B are conditionally independent given C if and only if
where P (A|B, C) := P (A|B ∩ C). Intuitively, this means that the probability of A is not affected
by whether B occurs or not, as long as C occurs.
Definition 1.3.3 (Conditional independence). Let (Ω, F, P) be a probability space. Two events
A, B ∈ F are conditionally independent given a third event C ∈ F if and only if
Example 1.3.4 (Congress (continued)). The main factor that determines how members of
congress vote is political affiliation. We therefore incorporate it into the probabilistic model in
Example 1.3.2. Each outcome now consists of the votes for issues 1 and 2, and also the affiliation
of the member, e.g. issue 1 = yes, issue 2 = no, affiliation = republican, or issue 1 = no, issue
2 = no, affiliation = democrat. The σ-algebra is the power set of the sample space. We again
estimate the values of the probability measure associated to different events using the data:
republicans voting yes on issue 1
P (issue 1 = yes | republican) ≈ (1.49)
total republican votes on issue 1
= 0.134, (1.50)
republicans voting yes on issue 2
P (issue 2 = yes | republican) ≈ (1.51)
total republican votes on issue 2
= 0.988, (1.52)
republicans voting yes on issues 1 and 2
P (issue 1 = yes ∩ issue 2 = yes | republican) ≈
republicans voting on both issues
= 0.134. (1.53)
Based on these data, we can evaluate whether voting behavior on the two issues was dependent
conditioned on the member being a republican. In other words, if we know how a member voted
on issue 1 and that they are a republican, does this provide information about how they voted
on issue 2? The answer is no, since
is very close to P (issue 1 = yes ∩ issue 2 = yes | republican). The votes are approximately inde-
pendent given the knowledge that the member is a republican. 4
CHAPTER 1. BASIC PROBABILITY THEORY 9
As suggested by Examples 1.3.2 and 1.3.4, independence does not imply conditional indepen-
dence or vice versa. This is further illustrated by the following examples. From now on, to
simplify notation, we write the probability of the intersection of several events in the following
form
P (A, B, C) := P (A ∩ B ∩ C) . (1.55)
Example 1.3.5 (Conditional independence does not imply independence). Your cousin Marvin
from Exercise 1.2.6 always complains about taxis in New York. From his many visits to JFK he
has calculated that
where taxi denotes the event of finding a free taxi after picking up your luggage. Given the
events rain and no rain, it is reasonable to model the events plane arrived late and taxi as
conditionally independent,
The logic behind this is that the availability of taxis after picking up your luggage depends
on whether it’s raining or not, but not on whether the plane is late or not (we assume that
availability is constant throughout the day). Does this assumption imply that the events are
independent?
If they were independent, then knowing that your aunt was late would give no information to
Marvin about taxi availability. However,
P (taxi) = P (taxi, rain) + P (taxi, no rain) (by the law of total probability) (1.59)
= P (taxi|rain) P (rain) + P (taxi|no rain) P (no rain) (1.60)
= 0.1 · 0.2 + 0.6 · 0.8 = 0.5, (1.61)
P (taxi, late, rain) + P (taxi, late, no rain)
P (taxi|late) = (by the law of total probability)
P (late)
P (taxi|rain) P (late|rain) P (rain) + P (taxi|no rain) P (late|no rain) P (no rain)
=
P (late)
0.1 · 0.75 · 0.2 + 0.6 · 0.125 · 0.8
= = 0.3. (1.62)
0.25
P (taxi) 6= P (taxi|late) so the events are not independent. This makes sense, since if the airplane
is late, it is more probable that it is raining, which makes taxis more difficult to find.
4
Example 1.3.6 (Independence does not imply conditional independence). After looking at your
probabilistic model from Example 1.2.1 your contact at JFK points out that delays are often
caused by mechanical problems in the airplanes. You look at the data and determine that
so the events mechanical problem and rain in NYC are independent, which makes intuitive
sense. After some more analysis of the data, you estimate
The next time you are waiting for Marvin at JFK, you start wondering about the probability of
his plane having had some mechanical problem. Without any further information, this proba-
bility is 0.1. It is a sunny day in New York, but this is of no help because according to the data
(and common sense) the events problem and rain are independent.
Suddenly they announce that Marvin’s plane is late. Now, what is the probability that his
plane had a mechanical problem? At first thought you might apply Bayes’ rule to compute
P (problem|late) = 0.28 as in Example 1.2.6. However, you are not using the fact that it is
sunny. This means that the rain was not responsible for the delay, so intuitively a mechanical
problem should be more likely. Indeed,
Random Variables
Random variables are a fundamental tool in probabilistic modeling. They allow us to model
numerical quantities that are uncertain: the temperature in New York tomorrow, the time of
arrival of a flight, the position of a satellite... Reasoning about such quantities probabilistically
allows us to structure the information we have about them in a principled way.
2.1 Definition
Formally, we define a random variables as a function mapping each outcome in a probability
space to a real number.
Definition 2.1.1 (Random variable). Given a probability space (Ω, F, P), a random variable
X is a function from the sample space Ω to the real numbers R. Once the outcome ω ∈ Ω of
the experiment is revealed, the corresponding X (ω) is known as a realization of the random
variable.
Remark 2.1.2 (Rigorous definition). If we want to be completely rigorous, Definition 2.1.1 is
missing some details. Consider two sample spaces Ω1 and Ω2 , and a σ-algebra F2 of sets in Ω2 .
Then, for X to be a random variable, there must exist a σ-algebra F1 in Ω1 such that for any
set S in F2 the inverse image of S, defined by
belongs to F1 . Usually, we take Ω2 to be the reals R and F2 to be the Borel σ-algebra, which is
defined as the smallest σ-algebra defined on the reals that contains all open intervals (amazingly,
it is possible to construct sets of real numbers that do not belong to this σ-algebra). In any case,
for the purpose of these notes, Definition 2.1.1 is sufficient (more information about the formal
foundations of probability can be found in any book on measure theory and advanced probability
theory).
Remark 2.1.3 (Notation). We often denote events of the form
{X (ω) ∈ S : ω ∈ Ω} (2.2)
{X ∈ S} (2.3)
11
CHAPTER 2. RANDOM VARIABLES 12
to alleviate notation, since the underlying probability space is often of no significance once we
have specified the random variables of interest.
A random variable quantifies our uncertainty about the quantity it represents, not the value
that it happens to finally take once the outcome is revealed. You should never think of a random
variable as having a fixed numerical value. If the outcome is known, then that determines a
realization of the random variable. In order to stress the difference between random variables
and their realizations, we denote the former with uppercase letters (X, Y , . . . ) and the latter
with lowercase letters (x, y, . . . ).
If we have access to the probability space (Ω, F, P) in which the random variable is defined, then
it is straightforward to compute the probability of a random variable X belonging to a certain
set S:1 it is the probability of the event that comprises all outcomes in Ω which X maps to S,
However, we almost never model the probability space directly, since this requires estimating the
probability of every possible event in the corresponding σ-algebra. As we explain in Sections 2.2
and 2.3, there are more practical methods to specify random variables, which automatically
imply that a valid underlying probability space exists. The existence of this probability space
ensures that the whole framework is mathematically sound, but you don’t really have to worry
about it.
Definition 2.2.1 (Probability mass function). Let (Ω, F, P) be a probability space and X : Ω →
Z a random variable. The probability mass function (pmf ) of X is defined as
0.4
0.3
pX (x)
0.2
0.1
0
1 2 3 4 5
x
Figure 2.1: Probability mass function of the random variable X in Example 2.2.2.
which satisfies
The converse is also true, if a function defined on a countable subset D of the reals is nonnegative
and adds up to one, then it may be interpreted as the pmf of a random variable. In fact, in
practice we usually define discrete random variables by just specifying their pmf.
To compute the probability that a random variable X is in a certain set S we take the sum of
the pmf over all the values contained in S:
X
P (X ∈ S) = pX (x) . (2.8)
x∈S
Example 2.2.2 (Discrete random variable). Figure 2.1 shows the probability mass function of
a discrete random variable X (check that it adds up to one). To compute the probability of X
belonging to different sets we apply (2.8):
Bernoulli
Bernoulli random variables are used to model experiments that have two possible outcomes.
By convention we usually represent an outcome by 0 and the other outcome by 1. A canonical
example is flipping a biased coin, such that the probability of obtaining heads is p. If we encode
heads as 1 and tails as 0, then the result of the coin flip corresponds to a Bernoulli random
variable with parameter p.
Definition 2.2.3 (Bernoulli). The pmf of a Bernoulli random variable with parameter p ∈ [0, 1]
is given by
pX (0) = 1 − p, (2.11)
pX (1) = p. (2.12)
A special kind of Bernoulli random variable is the indicator random variable of an event. This
random variable is particularly useful in proofs.
Definition 2.2.4 (Indicator). Let (Ω, F, P) be a probability space. The indicator random vari-
able of an event S ∈ F is defined as
(
1, if ω ∈ S,
1S (ω) = (2.13)
0, otherwise.
By definition the distribution of an indicator random variable is Bernoulli with parameter P (S).
Geometric
Imagine that we take a biased coin and flip it until we obtain heads. If the probability of
obtaining heads is p and the flips are independent then the probability of having to flip k times
is
P (k flips) = P (1st flip = tails, . . . , k − 1th flip = tails, kth flip = heads) (2.14)
= P (1st flip = tails) · · · P (k − 1th flip = tails) P (kth flip = heads) (2.15)
k−1
= (1 − p) p. (2.16)
This reasoning can be applied to any situation in which a random experiment with a fixed prob-
ability p is repeated until a particular outcome occurs, as long as the independence assumption
is met. In such cases the number of repetitions is modeled as a geometric random variable.
Definition 2.2.5 (Geometric). The pmf of a geometric random variable with parameter p is
given by
Figure 2.2 shows the probability mass function of geometric random variables with different
parameters. The larger p is, the more the distribution concentrates around smaller values of k.
CHAPTER 2. RANDOM VARIABLES 15
0 0 0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
k k k
Figure 2.2: Probability mass function of three geometric random variables with different parameters.
Binomial
Binomial random variables are extremely useful in probabilistic modeling. They are used to
model the number of positive outcomes of n trials modeled as independent Bernoulli random
variables with the same parameter. The following example illustrates this with coin flips.
Example 2.2.6 (Coin flips). If we flip a biased coin n times, what is the probability that we
obtain exactly k heads if the flips are independent and the probability of heads is p?
Let us first consider a simpler problem: what is the probability of first obtaining k heads and
then n − k tails? By independence, the answer is
Note that the same reasoning implies that this is also the probability of obtaining exactly k
heads in any fixed order. The probability of obtaining exactly k heads is the union of all of these
events. Because these events are disjoint (we cannot obtain exactly k heads in two different
orders simultaneously) we can add their individual to compute the probability of our event of
interest. We just need to know the number
of possible orderings. By basic combinatorics, this
is given by the binomial coefficient nk , defined as
n n!
:= . (2.20)
k k! (n − k)!
We conclude that
n k
P (k heads out of n flips) = p (1 − p)(n−k) . (2.21)
k
The random variable representing the number of heads in the example is called a binomial
random variable.
CHAPTER 2. RANDOM VARIABLES 16
0 0 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
k k k
Figure 2.3: Probability mass function of three binomial random variables with different values of p and
n = 20.
λ = 10 λ = 20 λ = 30
0.15 0.15 0.15
0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
k k k
Figure 2.4: Probability mass function of three Poisson random variables with different parameters.
Definition 2.2.7 (Binomial). The pmf of a binomial random variable with parameters n and p
is given by
n k
pX (k) = p (1 − p)(n−k) , k = 0, 1, 2, . . . , n. (2.22)
k
Figure 2.3 shows the probability mass function of binomial random variables with different values
of p.
Poisson
We motivate the definition of the Poisson random variable using an example.
Example 2.2.8 (Call center). A call center wants to model the number of calls they receive
over a day in order to decide how many people to hire. They make the following assumptions:
2. A given call has the same probability of occurring at any given time of the day.
Lemma 2.2.9.
n! λ n
lim 1− = e−λ . (2.28)
n→∞ (n − k)! (n − λ)k n
Random variables with the pmf that we have derived in the example are called Poisson random
variables. They are used to model situations where something happens from time to time at a
constant rate: packets arriving at an Internet router, earthquakes, traffic accidents, etc. The
number of such events that occur over a fixed interval follows a Poisson distribution, as long as
the assumptions we listed in the example hold.
Definition 2.2.10 (Poisson). The pmf of a Poisson random variable with parameter λ is given
by
λk e−λ
pX (k) = , k = 0, 1, 2, . . . (2.29)
k!
CHAPTER 2. RANDOM VARIABLES 18
20 20
Binomial: n = 40, p = 40 Binomial: n = 80, p = 80
0.15 0.15
0.1 0.1
5 · 10−2 5 · 10−2
0 0
0 10 20 30 40 0 10 20 30 40
k k
20
Binomial: n = 400, p = 400 Poisson: λ = 20
0.15 0.15
0.1 0.1
5 · 10−2 5 · 10−2
0 0
0 10 20 30 40 0 10 20 30 40
k k
Figure 2.5: Convergence of the binomial pmf with p = λ/n to a Poisson pmf of parameter λ as n grows.
Figure 2.4 shows the probability mass function of Poisson random variables with different values
of λ. In Example 2.2.8 we prove that as n → ∞ the pmf of a binomial random variable with
parameters n and λ/n tends to the pmf of a Poisson with parameter λ (as we will see later in
the course, this is an example of convergence in distribution). Figure 2.5 shows an example of
this phenomenon numerically; the convergence is quite fast.
You might feel a bit skeptical about Example 2.2.8: the probability of receiving a call surely
changes over the day and it must be different on weekends! That is true, but the model is
actually very useful if we restrict our attention to shorter periods of time. In Figure 2.6 we show
the result of modeling the number of calls received by a call center in Israel2 over an interval of
four hours (8 pm to midnight) using a Poisson random variable. We plot the histogram of the
number of calls received during that interval for two months (September and October of 1999)
together with a Poisson pmf fitted to the data (we will learn how to fit distributions to data
later on in the course). Despite the fact that our assumptions do not hold exactly, the model
2
The data is available here.
CHAPTER 2. RANDOM VARIABLES 19
0.14
Real data
0.12 Poisson distribution
0.10
0.08
0.06
0.04
0.02
0.000 5 10 15 20 25 30 35 40
Number of calls
Figure 2.6: In blue, we see the histogram of the number of calls received during an interval of four
hours over two months at a call center in Israel. A Poisson pmf approximating the distribution of the
data is plotted in orange.
Definition 2.3.1 (Cumulative distribution function). Let (Ω, F, P) be a probability space and
X : Ω → R a random variable. The cumulative distribution function (cdf ) of X is defined as
FX (x) := P (X ≤ x) . (2.30)
Note that the cumulative distribution function can be defined for both continuous and discrete
random variables.
The following lemma describes some basic properties of the cdf. You can find the proof in
Section 2.7.2.
Lemma 2.3.2 (Properties of the cdf). For any continuous random variable X
To see why the cdf completely determines a random variable recall that we are only considering
sets that can be expressed as unions of intervals. The probability of a random variable X
belonging to an interval (a, b] is given by
P (a < X ≤ b) = P (X ≤ b) − P (X ≤ a) (2.34)
= FX (b) − FX (a) . (2.35)
Remark 2.3.3. Since individual points have zero probability, for any continuous random vari-
able X
Now, to find the probability of X belonging to any particular set, we only need to decompose it
into disjoint intervals and apply (2.35), as illustrated by the following example.
Example 2.3.4 (Continuous random variable). Consider a continuous random variable X with
a cdf given by
0 for x < 0,
for 0 ≤ x ≤ 1,
0.5 x
FX (x) := 0.5
for 1 ≤ x ≤ 2, (2.37)
2
0.5 1 + (x − 2) for 2 ≤ x ≤ 3,
1 for x > 3.
CHAPTER 2. RANDOM VARIABLES 21
FX (x)
0.5 P (X∈(0.5, 2.5])
0
−1 0 0.5 1 2 2.5 3 4
x
Figure 2.7: Cumulative distribution function of the random variable in Examples 2.3.4 and 2.3.7.
Figure 2.7 shows the cdf on the left image. You can check that it satisfies the properties in
Lemma 2.3.2. To determine the probability that X is between 0.5 and 2.5, we apply (2.35),
Definition 2.3.5 (Probability density function). Let X : Ω → R be a random variable with cdf
FX . If FX is differentiable then the probability density function or pdf of X is defined as
dFX (x)
fX (x) := . (2.39)
dx
Our sets of interest belong the Borel σ-algebra, and hence can be decomposed into unions of
intervals, so we can obtain the probability of X belonging to any such set S by integrating its
pdf over S
Z
P (X ∈ S) = fX (x) dx. (2.42)
S
CHAPTER 2. RANDOM VARIABLES 22
P (X ∈ (0.5, 2.5])
1
fX (x)
0.5
0
−0.5 0 0.5 1 1.5 2 2.5 3 3.5
x
Figure 2.8: Probability density function of the random variable in Examples 2.3.4 and 2.3.7.
Finally, just as in the case of discrete random variables, we often say that a random variable is
distributed according to a certain pdf or cdf, or that we know its distribution. The reason is
that the pmf, pdf or cdf suffice to characterize the underlying probability space.
Example 2.3.7 (Continuous random variable (continued)). To compute the pdf of the random
variable in Example 2.3.4 we differentiate its cdf, to obtain
0 for x < 0,
for 0 ≤ x ≤ 1
0.5
fX (x) = 0 for 1 ≤ x ≤ 2 (2.45)
x − 2 for 2 ≤ x ≤ 3
0 for x > 3.
Figure 2.8 shows the pdf. You can check that it integrates to one. To determine the probability
that X is between 0.5 and 2.5, we can just integrate over that interval to obtain the same answer
as in Example 2.3.4,
Z
P (0.5 < X ≤ 2.5) = fX (x) dx (2.46)
0.5
Z 1 Z 2.5
= 0.5 dx + x − 2 dx = 0.375. (2.47)
0.5 2
CHAPTER 2. RANDOM VARIABLES 23
1
b−a 1
FX (x)
fX (x)
0 0
a b a b
x x
Figure 2.9: Probability density function (left) and cumulative distribution function (right) of a uniform
random variable X.
Figure 2.8 illustrates that the probability of an event is equal to the area under the pdf once we
restrict it to the corresponding subset of the real line.
4
Uniform
A uniform random variable models an experiment in which every outcome within a continuous
interval is equally likely. As a result the pdf is constant over the interval. Figure 2.9 shows the
pdf and cdf of a uniform random variable.
Definition 2.3.8 (Uniform). The pdf of a uniform random variable with domain [a, b], where
b > a are real numbers, is given by
(
1
, if a ≤ x ≤ b,
fX (x) = b−a (2.48)
0, otherwise.
Exponential
Exponential random variables are often used to model the time that passes until a certain event
occurs. Examples include decaying radioactive particles, telephone calls, earthquakes and many
others.
Definition 2.3.9 (Exponential). The pdf of an exponential random variable with parameter λ
is given by
(
λe−λx , if x ≥ 0,
fX (x) = (2.49)
0, otherwise.
CHAPTER 2. RANDOM VARIABLES 24
λ = 0.5
1.5
λ = 1.0
λ = 1.5
1
fX (x)
0.5
0
0 1 2 3 4 5 6
x
Figure 2.10: Probability density functions of exponential random variables with different parameters.
Figure 2.10 shows the pdf of three exponential random variables with different parameters. In
order to illustrate that the potential of exponential distributions for modeling real data, in
Figure 2.11 we plot the histogram of inter-arrival times of calls at the same call center in Israel
we mentioned earlier. In more detail, these inter-arrival times are the times between consecutive
calls occurring between 8 pm and midnight over two days in September 1999. An exponential
model fits the data quite well.
An important property of an exponential random variable is that it is memoryless. We elaborate
on this property, which is shared by the geometric distribution, in Section 2.4.
Gaussian or Normal
The Gaussian or normal random variable is arguably the most popular random variable in all
of probability and statistics. It is often used to model variables with unknown distributions in
the natural sciences. This is motivated by the fact that sums of independent random variables
often converge to Gaussian distributions. This phenomenon is captured by the Central Limit
Theorem, which we discuss in Chapter 6.
Definition 2.3.10 (Gaussian). The pdf of a Gaussian or normal random variable with mean µ
and standard deviation σ is given by
1 (x−µ)2
fX (x) = √ e− 2σ2 . (2.50)
2πσ
A Gaussian distribution with mean µ and standard deviation σ is usually denoted by N µ, σ 2 .
We provide formal definitions of the mean and the standard deviation of a random variable in
Chapter 4. For now, you can just think of them as quantities that parametrize the Gaussian
pdf.
It is not immediately obvious that the pdf of the Gaussian integrates to one. We establish this
in the following lemma.
Lemma 2.3.11 (Proof in Section 2.7.3). The pdf of a Gaussian random variable integrates to
one.
CHAPTER 2. RANDOM VARIABLES 25
0.9
Exponential distribution
0.8 Real data
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9
Interarrival times (s)
Figure 2.11: Histogram of inter-arrival times of calls at a call center in Israel (red) compared to its
approximation by an exponential pdf.
0.4 µ=2σ=1
µ=0σ=2
µ=0σ=4
fX (x)
0.2
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
Figure 2.12: Gaussian random variable with different means and standard deviations.
CHAPTER 2. RANDOM VARIABLES 26
0.25
Gaussian distribution
Real data
0.20
0.15
0.10
0.05
60 62 64 66 68 70 72 74 76
Height (inches)
Figure 2.13: Histogram of heights in a population of 25,000 people (blue) and its approximation using
a Gaussian distribution (orange).
Figure 2.12 shows the pdfs of two Gaussian random variables with different values of µ and σ.
Figure 2.13 shows the histogram of the heights in a population of 25,000 people and how it is
very well approximated by a Gaussian random variable3 .
An annoying feature of the Gaussian random variable is that its cdf does not have a closed form
solution, in contrast to the uniform and exponential random variables. This complicates the
task of determining the probability that a Gaussian random variable is in a certain interval. To
tackle this problem we use the fact that if X is a Gaussian random variable with mean µ and
standard deviation σ, then
X −µ
U := (2.51)
σ
is a standard Gaussian random variable, which means that its mean is zero and its standard
deviation equals one. See Lemma 2.5.1 for the proof. This allows us to express the probability
of X being in an interval [a, b] in terms of the cdf of a standard Gaussian, which we denote by
Φ,
X −µ a−µ b−µ
P (X ∈ [a, b]) = P ∈ , (2.52)
σ σ σ
b−µ a−µ
=Φ −Φ . (2.53)
σ σ
As long as we can evaluate Φ, this formula allows us to deal with arbitrary Gaussian random
variables. To evaluate Φ people used to resort to lists of tabulated values, compiled by computing
the corresponding integrals numerically. Nowadays you can just use Matlab, WolframAlpha,
SciPy, etc.
3
The data is available here.
CHAPTER 2. RANDOM VARIABLES 27
a=1b=1
6 a=1b=2
a=3b=3
a=6b=2
4 a = 3 b = 15
fX (x)
0
0 0.2 0.4 0.6 0.8 1
x
Figure 2.14: Pdfs of beta random variables with different values of the a and b parameters.
Beta
Beta distributions allow us to parametrize unimodal continuous distributions supported on the
unit interval. This is useful in Bayesian statistics, as we discuss in Chapter 10.
Definition 2.3.12 (Beta distribution). The pdf of a beta distribution with parameters a and b
is defined as
( a−1
θ (1−θ)b−1
β(a,b) , if 0 ≤ θ ≤ 1,
fβ (θ; a, b) := (2.54)
0 otherwise,
where
Z
β (a, b) := ua−1 (1 − u)b−1 du. (2.55)
u
β (a, b) is a special function called the beta function or Euler integral of the first kind, which
must be computed numerically. The uniform distribution is an example of a beta distribution
(where a = 1 and b = 1). Figure 2.14 shows the pdf of several different beta distributions.
Example 2.4.2 (Exponential random variables are memoryless). Let us assume that the inter-
arrival times of your emails follow an exponential distribution (over intervals of several hours
this is probably a good approximation, let us know if you check). You receive an email. The
time until you receive your next email is exponentially distributed with a certain parameter λ.
No email arrives in the next t0 minutes. Surprisingly, the time from then until you receive your
next email is again exponentially distributed with the same parameter, no matter the value of
t0 . Just like geometric random variables, exponential random variables are memoryless.
Let us prove this rigorously. We compute the conditional cdf of an exponential random variable T
with parameter λ conditioned on the event {T > t0 }– for an arbitrary t0 > 0– by applying (2.60)
Rt
fT (u) du
FT |T >t0 (t) = R t∞
0
(2.66)
t0 fT (u) du
e−λt − e−λt0
= (2.67)
−e−λt0
= 1 − e−λ(t−t0 ) . (2.68)
Differentiating with respect to t yields an exponential pdf fT |T >t0 (t) = λe−λ(t−t0 ) starting at t0 .
4
If X is continuous, the procedure is more subtle. We first compute the cdf of Y by applying the
definition,
FY (y) = P (Y ≤ y) (2.72)
= P (g (X) ≤ y) (2.73)
Z
= fX (x) dx, (2.74)
{x | g(x)≤y}
where the last equality obviously only holds if X has a pdf. We can then obtain the pdf of Y
from its cdf if it is differentiable. This idea can be used to prove a useful result about Gaussian
random variables.
CHAPTER 2. RANDOM VARIABLES 30
Lemma 2.5.1 (Gaussian random variable). If X is a Gaussian random variable with mean µ
and standard deviation σ, then
X −µ
U := (2.75)
σ
is a standard Gaussian random variable.
2. Transforming the uniform samples so that they have the desired distribution.
Here we focus on the second step, assuming that we have access to a random-number generator
that produces independent samples following a uniform distribution in [0, 1]. The construction
of good uniform random generators is an important problem, which is beyond the scope of these
notes.
x3
x2
x1
0 u4 FX (x1 ) u1 u5 u3 FX (x2 ) u2 1
Figure 2.15: Illustration of the method to generate samples from an arbitrary discrete distribution
described in Section 2.6.1. The cdf of a discrete random variable is shown in blue. The samples u4 and
u2 from a uniform distribution are mapped to x1 and x3 respectively, whereas u1 , u3 and u5 are mapped
to x3 .
Very conveniently, the unit interval can be partitioned into intervals of length pX (xi ). We can
consequently generate X by sampling from U and setting
x1 if 0 ≤ U ≤ pX (x1 ) ,
x2 if pX (x1 ) ≤ U ≤ pX (x1 ) + pX (x2 ) ,
X = ... (2.80)
Pi−1 Pi
xi if
j=1 pX (xj ) ≤ U ≤ j=1 pX (xj ) ,
...
FX (x) = P (X ≤ x) (2.81)
X
= pX (xi ) , (2.82)
xi ≤x
so our algorithm boils down to obtaining a sample u from U and then outputting the xi such
that FX (xi−1 ) ≤ u ≤ FX (xi ). This is illustrated in Figure 2.15.
1. Obtain a sample u of U .
FX−1 (u1 )
FX−1 (u3 )
FX−1 (u4 )
FX−1 (u5 )
FX−1 (u2 )
0 u2 u5 u4 u3 u1 1
Figure 2.16: Samples from an exponential distribution with parameter λ = 1 obtained by inverse-
transform sampling as described in Example 2.6.4. The samples u1 , . . . , u5 are generated from a uniform
distribution.
The careful reader will point out that FX may not be invertible at every point. To avoid this
problem we define the generalized inverse of the cdf as
The function is well defined because all cdfs are non-decreasing, so FX is equal to a constant c
in any interval [x1 , x2 ] where it is not invertible.
We now prove that Algorithm 2.6.1 works.
FY (y) = P (Y ≤ y) (2.84)
=P FX−1 (U ) ≤y (2.85)
= P (U ≤ FX (y)) (2.86)
Z FX (y)
= du (2.87)
u=0
= FX (y) , (2.88)
where in step (2.86) we have to take into account that we are using the generalized inverse of
the cdf. This is resolved by the following lemma proved in Section 2.7.4.
Lemma 2.6.3. The events FX−1 (U ) ≤ y and {U ≤ FX (y)} are equivalent.
CHAPTER 2. RANDOM VARIABLES 33
FX−1 (U ) is an exponential random variable with parameter λ by Theorem 2.6.2. Figure 2.16
shows how the samples of U are transformed into samples of X.
4
2.7 Proofs
2.7.1 Proof of Lemma 2.2.9
For any fixed constants c1 and c2
n − c1
lim = 1, (2.90)
n→∞ n − c2
so that
n! n n−1 n−k+1
lim k
= · ··· = 1. (2.91)
n→∞ (n − k)! (n − λ) n−λ n−λ n−λ
The proof of (2.32) follows from this result. Let Y = −X, then
Lemma 2.7.1.
Z ∞
2 √
e−t dt = π. (2.101)
−∞
= πe−(r ) ]∞
2
0 = π. (2.106)
√
To complete the proof we use the change of variables t = (x − µ) / 2σ.
Assume that U > FX (y), then for all x, such that FX (x) = U , x > y because the cdf is nonde-
creasing. In particular minx {FX (x) = U } > y.
{U ≤ FX (y)} implies FX−1 (U ) ≤ y
Assume that minx {FX (x) = U } > y, then U > FX (y) because the cdf is nondecreasing. The
inequality is strict because U = FX (y) would imply that y belongs to {FX (x) = U }, which
cannot be the case as we are assuming that it is smaller than the minimum of that set.
Chapter 3
Probabilistic models usually include multiple uncertain numerical quantities. In this chapter we
describe how to specify random variables to represent such quantities and their interactions. In
some occasions, it will make sense to group these random variables as random vectors, which
~ Realizations of these random vectors
we write using uppercase letters with an arrow on top: X.
are denoted with lowercase letters: ~x.
35
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 36
As in the case of the pmf of a single random variable, the joint pmf is a valid probability measure
if we consider a probability space where the sample space is RX × RY 1 (or RX1 × RX2 · · · × RXn
in the case of a random vector) and the σ-algebra is just the power set of the sample space. This
implies that the joint pmf completely characterizes the random variables or the random vector,
we don’t need to worry about the underlying probability space.
By the definition of probability measure, the joint pmf must be nonnegative and its sum over
all its possible arguments must equal one,
pX,Y (x, y) ≥ 0 for any x ∈ RX , y ∈ RY , (3.4)
X X
pX,Y (x, y) = 1. (3.5)
x∈RX y∈RY
By the Law of Total Probability, the joint pmf allows us to obtain the probability of X and Y
belonging to any set S ⊆ RX × RY ,
P ((X, Y ) ∈ S) = P ∪(x,y)∈S {X = x, Y = y} (union of disjoint events) (3.6)
X
= P (X = x, Y = y) (3.7)
(x,y)∈S
X
= pX,Y (x, y) . (3.8)
(x,y)∈S
These properties also hold for random vectors (and groups of more than two random variables).
For any random vector X,~
3.1.2 Marginalization
Assume we have access to the joint pmf of several random variables in a certain probability
space, but we are only interested in the behavior of one of them. To compute the value of its
pmf for a particular value, we fix that value and sum over the remaining random variables.
Indeed, by the Law of Total Probability
pX (x) = P (X = x) (3.12)
= P (∪y∈RY {X = x, Y = y}) (union of disjoint events) (3.13)
X
= P (X = x, Y = y) (3.14)
y∈RY
X
= pX,Y (x, y) . (3.15)
y∈RY
1
This is the Cartesian product of the two sets, defined in Section A.2, which contains all possible pairs (x, y)
where x ∈ RX and y ∈ Ry .
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 37
When the joint pmf involves more than two random variables the argument is exactly the same.
This is called marginalizing over the other random variables. In this context, the pmf of a
single random variable is called its marginal pmf. Table 3.1 shows an example of a joint pmf
and the corresponding marginal pmfs.
If we are interested in computing the joint pmf of several entries in a random vector, instead
of just one, the marginalization process is essentially the same. The pmf is again obtained
by summing over the rest of the entries. Let I ⊆ {1, 2, . . . , n} be a subset of m < n entries
of an n-dimensional random vector X ~ and X
~ I the corresponding random subvector. To com-
~
pute the joint pmf of XI we sum over all the entries that are not in I, which we denote by
{j1 , j2 , . . . , jn−m } := {1, 2, . . . , n} /I
X X X
pX~ I (~xI ) = ··· pX~ (~x) . (3.16)
xj1 ∈Rj1 ~
~ xj2 ∈Rj2 xjn−m ∈Rjn−m
~
Definition 3.1.2 (Conditional probability mass function). The conditional probability mass
function of Y given X, where X and Y are discrete random variables defined on the same
probability space, is given by
The conditional pmf pX|Y (·|y) characterizes our uncertainty about X conditioned on the event
{Y = y}. This object is a valid pmf of X, so that if RX is the range of X
X
pX|Y (x|y) = 1 (3.19)
x∈RX
pX~ (~x)
pX~ I |X~ J (~xI |~xJ ) := , (3.20)
pX~ J (~xJ )
2 3 5 1 3
1 20 20 20 8 4
16 4
pR 20 20
14 1
pR|L (·|0) 15 15
2 3
pR|L (·|1) 5 5
Table 3.1: Joint, marginal and conditional pmfs of the random variables L and R defined in Exam-
ple 3.1.5.
The conditional pmfs pY |X (·|x) and pX~ I |X~ J (·|~xJ ) are valid pmfs in the probability space where
X = x or X~ J = ~xJ respectively. For instance, they must be nonnegative and add up to one.
From the definition of conditional pmfs we derive a chain rule for discrete random variables and
vectors.
Lemma 3.1.4 (Chain rule for discrete random variables and vectors).
where the order of indices in the random vector is arbitrary (any order works).
The following example illustrates the definitions of marginal and conditional pmfs.
Example 3.1.5 (Flights and rains (continued)). Within the probability space described in
Example 1.2.1 we define a random variable
(
1 if plane is late,
L= (3.24)
0 otherwise,
represents whether it rains or not. Equivalently, these random variables are just the indicators
R = 1rain and L = 1late . Table 3.1 shows the joint, marginal and conditional pmfs of L and R.
4
Definition 3.2.1 (Joint cumulative distribution function). Let (Ω, F, P) be a probability space
and X, Y : Ω → R random variables. The joint cdf of X and Y is defined as
In words, FX,Y (x, y) is the probability of X and Y being smaller than x and y respectively.
Let X~ : Ω → Rn be a random vector of dimension n on a probability space (Ω, F, P). The joint
~ is defined as
cdf of X
FX~ (~x) := P X~ 1 ≤ ~x1 , X
~ 2 ≤ ~x2 , . . . , X
~ n ≤ ~xn . (3.27)
Proof. The proof follows along the same lines as that of Lemma 2.3.2.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 40
The joint cdf completely specifies the behavior of the corresponding random variables. Indeed,
we can decompose any Borel set into a union of disjoint n-dimensional intervals and compute
their probability by evaluating the joint cdf. Let us illustrate this for the bivariate case:
This means that, as in the univariate case, to define a random vector or a group of random
variables all we need to do is define their joint cdf. We don’t have to worry about the underlying
probability space.
If the joint cdf is differentiable, we can differentiate it to obtain the joint probability density
function of X and Y . As in the case of univariate random variables, this is often a more
convenient way of specifying the joint distribution.
Definition 3.2.3 (Joint probability density function). If the joint cdf of two random variables
X, Y is differentiable, then their joint pdf is defined as
∂ 2 FX,Y (x, y)
fX,Y (x, y) := . (3.35)
∂x∂y
∂ n FX~ (~x)
fX~ (~x) := . (3.36)
∂~x1 ∂~x2 · · · ∂~xn
The joint pdf should be understood as an n-dimensional density, not as a probability (for
instance, it can be larger than one). In the two-dimensional case,
Due to the monotonicity of joint cdfs in every variable, joint pmfs are always nonnegative.
The joint pdf of X and Y allows us to compute the probability of any Borel set S ⊆ R2 by
integrating over S
Z
P ((X, Y ) ∈ S) = fX,Y (x, y) dx dy. (3.38)
(x,y)∈S
Similarly, the joint pdf of an n-dimensional random vector X ~ allows to compute the probability
~
that X belongs to a set Borel set S ⊆ R ,n
Z
P X~ ∈S = fX~ (~x) d~x. (3.39)
x∈S
~
In particular, if we integrate a joint pdf over the whole space Rn , then it must integrate to one
by the Law of Total Probability.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 41
1.5
E F
1
C
0.5 D
B
0
A
−0.5
−0.5 0 0.5 1 1.5
Example 3.2.4 (Triangle lake). A biologist is tracking an otter that lives in a lake. She decides
to model the location of the otter probabilistically. The lake happens to be triangular as shown
in Figure 3.1, so that we can represent it by the set
The biologist has no idea where the otter is, so she models the position as a random vector X ~
~ is constant,
which is uniformly distributed over the lake. In other words, the joint pdf of X
(
c if ~x ∈ Lake,
fX~ (~x) = (3.41)
0 otherwise.
To find the normalizing constant c we use the fact that to be a valid joint pdf fX~ should integrate
to 1.
Z ∞ Z ∞ Z 1 Z 1−x2
c dx1 dx2 = c dx1 dx2 (3.42)
x1 =−∞ x2 =−∞ x2 =0 x1 =0
Z 1
=c (1 − x2 ) dx2 (3.43)
x2 =0
c
= = 1, (3.44)
2
so c = 2.
We now compute the cdf of X. ~ F ~ (~x) represents the probability that the otter is southwest of
X
the point ~x. Computing the joint cdf requires dividing the range into the sets
shown in Figure 3.1
~
and integrating the joint pdf. If ~x ∈ A then FX~ (~x) = 0 because P X ≤ ~x = 0. If (~x) ∈ B,
Z ~
x2 Z ~
x1
FX~ (~x) = 2 dv du = 2~x1 ~x2 . (3.45)
u=0 v=0
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 42
If ~x ∈ C,
Z 1−~
x1 Z ~
x1 Z ~
x2 Z 1−u
FX~ (~x) = 2 dv du + 2 dv du = 2~x1 + 2~x2 − ~x22 − ~x21 − 1. (3.46)
u=0 v=0 u=1−~
x1 v=0
If ~x ∈ D,
FX~ (~x) = P X~ 1 ≤ ~x1 , X
~ 2 ≤ ~x2 = P X~ 1 ≤ 1, X
~ 2 ≤ ~x2 = F ~ (1, ~x2 ) = 2~x2 − ~x2 , (3.47)
X 2
2
FX~ (~x) = 2~x1 −~x1 for
where the last step follows from (3.46). Exchanging ~x1 and ~x2 , we obtain
~x ∈ E by the same reasoning. Finally, for ~x ∈ F F ~ (~x) = 1 because P X ~ 1 ≤ x1 , X
~ 2 ≤ x2 = 1.
X
Putting everything together,
0 if ~x1 < 0 or ~x2 < 0,
2~x1 ~x2 , if ~x1 ≥ 0, ~x2 ≥ 0, ~x1 + ~x2 ≤ 1,
2~x + 2~x − ~x2 − ~x2 − 1,
1 2 2 1 if ~x1 ≤ 1, ~x2 ≤ 1, ~x1 + ~x2 ≥ 1,
FX~ (~x) = (3.48)
2~x 2 − ~
x 2, if ~x1 ≥ 1, 0 ≤ ~x2 ≤ 1,
2
2~x − ~
x 2, if 0 ≤ ~x1 ≤ 1, ~x2 ≥ 1,
1 1
1, if ~x1 ≥ 1, ~x2 ≥ 1.
3.2.2 Marginalization
We now discuss how to characterize the marginal distributions of individual random variables
from a joint cdf or a joint pdf. Consider the joint cdf FX,Y (x, y). When x → ∞ the limit
of FX,Y (x, y) is by definition the probability of Y being smaller than y, which is precisely the
marginal cdf of Y . More formally,
If the random variables have a joint pdf, we can also compute the marginal cdf by integrating
over x
FY (y) = P (Y ≤ y) (3.53)
Z y Z ∞
= fX,Y (x, u) dx dy. (3.54)
u=−∞ x=−∞
Differentiating the latter equation with respect to y, we obtain the marginal pdf of Y
Z ∞
fY (y) = fX,Y (x, y) dx. (3.55)
x=−∞
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 43
Example 3.2.5 (Triangle lake (continued)). The biologist is interested in the probability that
the otter is south of x1 . This information is encoded in the cdf of the random vector, we just
need to take the limit when x2 → ∞ to marginalize over x2 .
0 if x1 < 0,
2
FX1 (x1 ) = 2x1 − x1 if 0 ≤ x1 ≤ 1, (3.57)
1 if x1 ≥ 1.
To obtain the marginal pdf of X1 , which represents the latitude of the otter’s position, we
differentiate the marginal cdf
(
dFX1 (x1 ) 2 (1 − x1 ) if 0 ≤ x1 ≤ 1,
fX1 (x1 ) = = . (3.58)
dx1 0, otherwise.
Alternatively, we could have integrated the joint uniform pdf over x2 (we encourage you to check
that the result is the same).
4
Definition 3.2.6 (Joint conditional cdf and pdf given an event). Let X, Y be random variables
with joint pdf fX,Y and let S ⊆ R2 be any Borel set with nonzero probability, the conditional cdf
and pdf of X and Y given the event (X, Y ) ∈ S is defined as
This definition only holds for events with nonzero probability. However, events of the form
{X = x} have probability equal to zero because the random variable is continuous. Indeed, the
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 44
Definition 3.2.7 (Conditional pdf and cdf). If FX,Y is differentiable, then the conditional pdf
of Y given X is defined as
fX,Y (x, y)
fY |X (y|x) := if fX (x) > 0 (3.63)
fX (x)
We now justify this definition, beyond the analogy with (3.18). Assume that fX (x) > 0. Let us
write the definition of the conditional pdf in terms of limits. We have
P (x ≤ X ≤ x + ∆x )
fX (x) = lim , (3.65)
∆x →0 ∆x
1 ∂P (x ≤ X ≤ x + ∆x , Y ≤ y)
fX,Y (x, y) = lim . (3.66)
∆x →0 ∆x ∂y
This implies
fX,Y (x, y) 1 ∂P (x ≤ X ≤ x + ∆x , Y ≤ y)
= lim . (3.67)
fX (x) ∆x →0,∆y →0 P (x ≤ X ≤ x + ∆x ) ∂y
We can therefore interpret the conditional cdf as the limit of the cdf of Y at y conditioned on
X belonging to an interval around x when the width of the interval tends to zero.
Remark 3.2.8. Interchanging limits and integrals as in (3.69) is not necessarily justified in
general. In this case it is, as long as the integral converges and the quantities involved are
bounded.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 45
An immediate consequence of Definition 3.2.7 is the chain rule for continuous random variables.
Applying the same ideas as in the bivariate case, we define the conditional distribution of a
subvector given the rest of the random vector.
~I, I ⊆
Definition 3.2.10 (Conditional pdf). The conditional pdf of a random subvector X
~
{1, 2, . . . , n}, given the subvector X{1,...,n}/I is
fX~ (~x)
fX~ I |X~ ~xI |~x{1,...,n}/I := . (3.73)
{1,...,n}/I fX~ ~x{1,...,n}/I
{1,...,n}/I
It is often useful to represent the joint pdf of a random vector by factoring it into conditional
pdfs using the chain rule for random vectors.
~ can be
Lemma 3.2.11 (Chain rule for random vectors). The joint pdf of a random vector X
decomposed into
fX~ (~x) = fX~ 1 (~x1 ) fX~ 2 |X~ 1 (~x2 |~x1 ) . . . fX~ n |X~ 1 ,...,X~ n−1 (~xn |~x1 , . . . , ~xn−1 ) (3.74)
n
Y
= fX~ i |X~ ~xi |~x{1,...,i−1} . (3.75)
{1,...,i−1}
i=1
Note that the order is arbitrary, you can reorder the components of the vector in any way you
like.
Proof. The result follows from applying the definition of conditional pdf recursively.
Example 3.2.12 (Triangle lake (continued)). The biologist spots the otter from the shore of
the lake. She is standing on the west side of the lake at a latitude of x1 = 0.75 looking east and
the otter is right in front of her. The otter is consequently also at a latitude of x1 = 0.75, but
she cannot tell at what distance. The distribution of the location of the otter given its latitude
X1 is characterized by the conditional pdf of the longitude X2 given X1 ,
fX1 ,X2 (x1 , x2 )
fX2 |X1 (x2 |x1 ) = (3.76)
fX1 (x1 )
1
= , 0 ≤ x2 ≤ 1 − x1 . (3.77)
1 − x1
The biologist is interested in the probability that the otter is closer than x2 to her. This
probability is given by the conditional cdf
Z x2
FX2 |X1 (x2 |x1 ) = fX2 |X1 (u|x1 ) du (3.78)
−∞
x2
= . (3.79)
1 − x1
The probability that the otter is less than x2 away is 4x2 for 0 ≤ x2 ≤ 1/4.
4
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 46
fX (x)
fY (y)
0.2
fX,Y (x, y)
0.1
0
−3 2
−2
−1 0
0
1 y
x 2 −2
3
Figure 3.2: Joint pdf of a bivariate Gaussian random variable (X, Y ) together with the marginal pdfs
of X and Y .
Definition 3.2.13 (Gaussian random vector). A Gaussian random vector X~ is a random vector
with joint pdf
1 1 T −1
fX~ (~x) = p exp − (~x − µ
~ ) Σ (~x − µ
~) (3.80)
(2π)n |Σ| 2
where the mean vector µ ~ ∈ Rn and the covariance matrix Σ, which is symmetric and positive
definite, parametrize the distribution. A Gaussian distribution with mean µ
~ and covariance
matrix Σ is usually denoted by N (~ µ, Σ).
Theorem 3.2.14 (Linear transformations of Gaussian random vectors are Gaussian). Let X ~
be a Gaussian random vector of dimension n with mean µ ~ and covariance matrix Σ. For any
matrix A ∈ Rm×n and ~b ∈ Rm Y
~ = AX ~ + ~b is a Gaussian random vector with mean A~µ + ~b and
T
covariance matrix AΣA .
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 47
A corollary of this result is that the joint pdf of a subvector of a Gaussian random vector is also
a Gaussian vector.
Corollary 3.2.15 (Marginals of Gaussian random vectors are Gaussian). The joint pdf of any
subvector of a Gaussian random vector is Gaussian. Without loss of generality, assume that the
~ consists of the first m entries of the Gaussian random vector,
subvector X
" #
~
X µ
~
Z := ~ , with mean µ ~ := X~ (3.81)
Y µY~
where I ∈ Rm×m is an identity matrix and 0c×d represents a matrix of zeros of dimensions c × d.
The result then follows from Theorem 3.2.14.
Figure 3.2 shows the joint pdf of a bivariate Gaussian random variable along with its marginal
pdfs.
Definition 3.3.1 (Conditional cdf and pdf of a continuous random variable given a discrete
random variable). Let C and D be a continuous and a discrete random variable defined on the
same probability space. Then, the conditional cdf and pdf of C given D are of the form
We obtain the marginal cdf and pdf of C from the conditional cdfs and pdfs by computing a
weighted sum.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 48
·10−2
2 fW |S (·|0)
fW |S (·|1)
fW (·)
0
0 50 100 150 200 250 300 350 400
Figure 3.3: Conditional and marginal distributions of the weight of the bears W in Example 3.3.3.
Lemma 3.3.2. Let FC|D and fC|D be the conditional cdf and pdf of a continuous random variable
C given a discrete random variable D. Then,
X
FC (c) = pD (d) FC|D (c|d) , (3.86)
d∈RD
X
fC (c) = pD (d) fC|D (c|d) . (3.87)
d∈RD
Proof. The events {D = d} are a partition of the whole probability space (one of them must
happen and they are all disjoint), so
FC (c) = P (C ≤ c) (3.88)
X
= P (D = d) P (C ≤ c|d) by the Law of Total Probability (3.89)
d∈RD
X
= pD (d) FC|D (c|d) . (3.90)
d∈RD
Combining a discrete marginal pmf with a continuous conditional distribution allows us to define
mixture models where the data is drawn from a continuous distribution whose parameters are
chosen from a discrete set. If a Gaussian is used as the continuous distribution, this yields a
Gaussian mixture model. Fitting Gaussian mixture models is a popular technique for clustering
data.
Example 3.3.3 (Grizzlies in Yellowstone). A scientist is gathering data on the bears in Yel-
lowstone. It turns out that the weight of the males is well modeled by a Gaussian random
variable with mean 240 kg and standard variation 40 kg, whereas the weight of the females is
well modeled by a Gaussian with mean 140 kg and standard deviation 20 kg. There are about
the same number of females and males.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 49
The distribution of the weights of all the grizzlies can consequently be modeled by a Gaussian
mixture that includes a continuous random variable W to represent the weight and a discrete
random variable S to represent the sex of the bears. S is Bernoulli with parameter 1/2, W given
S = 0 (male) is N (240, 1600) and W given S = 1 (female) is N (140, 400). By (3.87) the pdf of
W is consequently of the form
1
X
fW (w) = pS (s) fW |S (w|s) (3.91)
s=0
(w−240)2 (w−140)2
− −
1 e 3200 e 800
= √ + . (3.92)
2 2π 40 20
Defining the conditional pmf of a discrete random variable D given a continuous random variable
C is challenging because the probability of the event {C = c} is zero. We follow the same
approach as in Definition 3.2.7 and define the conditional pmf as a limit.
Definition 3.3.4 (Conditional pmf of a discrete random variable given a continuous random
variable). Let C and D be a continuous and a discrete random variable defined on the same
probability space. Then, the conditional pmf of D given C is defined as
P (D = d, c ≤ C ≤ c + ∆)
pD|C (d|c) := lim . (3.93)
∆→0 P (c ≤ C ≤ c + ∆)
Analogously to Lemma 3.3.2, we obtain the marginal pmf of D from the conditional pmfs by
computing a weighted sum.
Lemma 3.3.5. Let pD|C be the conditional pmf of a discrete random variable D given a con-
tinuous random variable C. Then,
Z ∞
pD (d) = fC (c) pD|C (d|c) dc. (3.94)
c=−∞
Proof. We will not give a formal proof but rather an intuitive argument that can be made
rigorous. If we take a grid of values for c which are on a grid . . . , c−1 , c0 , c1 , . . . of width ∆, then
∞
X
pD (d) = P (D = d, ci ≤ C ≤ ci + ∆) (3.95)
i=−∞
by the Law of Total probability. Taking the limit as ∆ → 0 the sum becomes an integral and
we have
Z ∞
P (D = d, c ≤ C ≤ c + ∆)
pD (d) = lim dc (3.96)
c=−∞ ∆→0 ∆
Z ∞
P (c ≤ C ≤ c + ∆) P (D = d, c ≤ C ≤ c + ∆)
= lim · dc (3.97)
c=−∞ ∆→0 ∆ P (c ≤ C ≤ c + ∆)
Z ∞
= fC (c) pD|C (d|c) dc. (3.98)
c=−∞
Example 3.3.6 (Bayesian coin flip). Your uncle bets you ten dollars that a coin flip will turn
out heads. You suspect that the coin is biased, but you are not sure to what extent. To model
this uncertainty you represent the bias as a continuous random variable B with the following
pdf:
You can now compute the probability that the coin lands on heads denoted by X using Lemma 3.3.5.
Conditioned on the bias B, the result of the coin flip is Bernoulli with parameter B.
Z ∞
pX (1) = fB (b) pX|B (1|b) db (3.100)
b=−∞
Z1
= 2b2 db (3.101)
b=0
2
= . (3.102)
3
According to your model the probability that the coin lands heads is 2/3. 4
The following lemma provides an analogue to the chain rule for jointly distributed continuous
and discrete random variables.
Lemma 3.3.7 (Chain rule for jointly distributed continuous and discrete random variables). Let
C be a continuous random variable with conditional pdf fC|D and D a discrete random variable
with conditional pmf pD|C . Then,
P (c ≤ C ≤ c + ∆|D = d)
pD (d) fC|D (c|d) = lim P (D = d) (3.104)
∆→0 ∆
P (D = d, c ≤ C ≤ c + ∆)
= lim (3.105)
∆→0 ∆
P (c ≤ C ≤ c + ∆) P (D = d, c ≤ C ≤ c + ∆)
= lim · (3.106)
∆→0 ∆ P (c ≤ C ≤ c + ∆)
= fC (c) pD|C (d|c) . (3.107)
Example 3.3.8 (Grizzlies in Yellowstone (continued)). The scientist observes a bear with her
binoculars. From their size she estimates that its weight is 180 kg. What is the probability that
the bear is male?
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 51
2 fB (·)
fB|X (·|0)
1.5
0.5
0
0 0.2 0.4 0.6 0.8 1
Figure 3.4: Conditional and marginal distributions of the bias of the coin flip in Example 3.3.9.
= 0.545. (3.110)
According to the probabilistic model, the probability that it’s a male is 0.545.
4
Example 3.3.9 (Bayesian coin flip (continued)). The coin lands on tails. You decide to recom-
pute the distribution of the bias conditioned on this information. By Lemma 3.3.7
fB (b) pX|B (0|b)
fB|X (b|0) = (3.111)
pX (0)
2b (1 − b)
= (3.112)
1/3
= 6b (1 − b) . (3.113)
Conditioned on the outcome, the pdf of the bias is now centered instead of concentrated near
one as before, as shown in Figure 3.4.
4
3.4 Independence
In this section we define independence and conditional independence for random variables and
vectors.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 52
3.4.1 Definition
When knowledge about a random variable X does not affect our uncertainty about another
random variable Y , we say that X and Y are independent. Formally, this is reflected by the
marginal and conditional cdf and the conditional pmf or pdf which must be equal, i.e.
and
depending on whether the variable is discrete or continuous, for any x and any y for which the
conditional distributions are well defined. Equivalently, the joint cdf and the conditional pmf or
pdf factors into the marginals.
Definition 3.4.1 (Independent random variables). Two random variables X and Y are inde-
pendent if and only if
If the variables are continuous have joint and marginal pdfs, the following condition is equivalent
We now extend the definition to account for several random variables (or equivalently several
entries in a random vector) that do not provide information about each other.
which is equivalent to
n
Y
pX~ (~x) = pXi (~xi ) (3.120)
i=1
The following example shows that pairwise independence does not imply independence.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 53
Example 3.4.3 (Pairwise independence does not imply joint independence). Let X1 and X2
be the outcomes of independent unbiased coin flips. Let X3 be the indicator of the event
{X1 and X2 have the same outcome},
(
1 if X1 = X2 ,
X3 = (3.122)
0 6 X2 .
if X1 =
The pmf of X3 is
1
pX3 (1) = pX1 ,X2 (1, 1) + pX1 ,X2 (0, 0) = , (3.123)
2
1
pX3 (0) = pX1 ,X2 (0, 1) + pX1 ,X2 (1, 0) = . (3.124)
2
X1 and X2 are independent by assumption. X1 and X3 are independent because
1
pX1 ,X3 (0, 0) = pX1 ,X2 (0, 1) = = pX1 (0) pX3 (0) , (3.125)
4
1
pX1 ,X3 (1, 0) = pX1 ,X2 (1, 0) = = pX1 (1) pX3 (0) , (3.126)
4
1
pX1 ,X3 (0, 1) = pX1 ,X2 (0, 0) = = pX1 (0) pX3 (1) , (3.127)
4
1
pX1 ,X3 (1, 1) = pX1 ,X2 (1, 1) = = pX1 (1) pX3 (1) . (3.128)
4
X2 and X3 are independent too (the reasoning is the same).
However, are X1 , X2 and X3 all independent?
1 1
pX1 ,X2 ,X3 (1, 1, 1) = P (X1 = 1, X2 = 1) = 6= pX1 (1) pX2 (1) pX3 (1) = . (3.129)
4 8
They are not, which makes sense since X3 is a function of X1 and X2 . 4
Conditional independence indicates that two random variables do not depend on each other,
as long as an additional random variable is known.
Definition 3.4.4 (Conditionally independent random variables). Two random variables X and
Y are independent with respect to another random variable Z if and only if
and any z for which the conditional cdfs are well defined. If the variables are discrete, the
following condition is equivalent
and any z for which the conditional pmfs are well defined. If the variables are continuous have
joint and marginal pdfs, the following condition is equivalent
and any z for which the conditional pmfs are well defined.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 54
which is equivalent to
Y
pX~ I |X~ J (~xI |~xJ ) = pXi |X~ J (~xi |~xJ ) (3.134)
i∈I
As established in Examples 1.3.5 and 1.3.6, independence does not imply conditional indepen-
dence or vice versa.
X1 X2 X3
X4 X5
For large probabilistic models it is crucial to find factorizations that reduce the number of
parameters as much as possible.
• The marginal pmf or pdf of the variables corresponding to all nodes with no incoming
edges.
• The conditional pmf or pdf of the remaining random variables given their parents. A is a
parent of B if there is a directed edge from (the node assigned to) A to (the node assigned
to) B.
To be concrete, consider the DAG in Figure 3.5. For simplicity we denote each node using the
corresponding random variable and assume that they are all discrete. Nodes X1 and X4 have
no parents, so the factorization of the joint pmf includes their marginal pmfs. Node X2 only
descends from X4 so we include pX2 | X4 . Node X3 descends from X2 so we include pX3 | X2 .
Finally, node X5 descends from X3 and X4 so we include pX5 | X3 ,X4 . The factorization is of the
form
pX1 ,X2 ,X3 ,X4 ,X5 = pX1 pX4 pX2 | X4 pX3 | X2 pX5 | X3 ,X4 . (3.138)
This factorization reveals some dependence assumptions. By the chain rule another valid fac-
torization of the joint pmf is
pX1 ,X2 ,X3 ,X4 ,X5 = pX1 pX4 | X1 pX2 | X1 ,X4 pX3 | X1 ,X2 ,X4 pX5 | X1 ,X2 ,X3 ,X4 . (3.139)
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 56
T L M R
R L
Figure 3.6: Directed graphical models corresponding to the variables in Examples 1.3.5 and 1.3.6.
Comparing both expressions, we see that X1 and all the other variables are independent, since
pX4 | X1 = pX4 , pX2 | X1 ,X4 = pX2 | X4 and so on. In addition, X3 is conditionally independent of
X4 given X2 since pX3 | X2 ,X4 = pX3 | X2 . These dependence assumptions can be read directly
from the graph, using the following property.
Theorem 3.4.6 (Local Markov property). The factorization of the joint pmf or pdf represented
by a DAG satisfies the local Markov property: each variable is conditionally independent of its
non-descendants given all its parent variables. In particular, if it has no parents, it is independent
of its non-descendants. To be clear, B is a non-descendant of A if there is no directed path from
A to B.
pX1 ,...,Xn = pXN pXP |XN pXi |XP pXD |Xi . (3.140)
pX1 ,...,Xn = pXN pXP |XN pXi |XP ,XN pXD |Xi ,XP ,XN . (3.141)
Comparing both expressions we conclude that pXi |XP ,XN = pXi |XP so Xi is conditionally inde-
pendent of XN given XP .
We illustrate these ideas by showing the DAGs for Examples 1.3.5 and 1.3.6.
Example 3.4.7 (Graphical model for Example 1.3.5). We model the different events in Ex-
ample 1.3.5 using indicator random variables. T represents whether a taxi is available (T = 1)
or not (T = 0), L whether the plane is late (L = 1) or not (L = 0), and R whether it rains
(R = 1) or not (R = 0). In the example, T and L are conditionally independent given R. We
can represent the corresponding factorization using the graph on the left of Figure 3.6.
4
Example 3.4.8 (Graphical model for Example 1.3.6). We model the different events in Exam-
ple 1.3.6 using indicator random variables. M represents whether a mechanical problem occurs
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 57
S1 (5 electors)
S2 (3 electors)
S3 (7 electors)
S4 (3 electors)
(M = 1) or not (M = 0) and L and R are the same as in Example 3.4.7. In the example, M
and R are independent, but L depends on both of them. We can represent the corresponding
factorization using the graph on the right of Figure 3.6.
4
The following example that introduces an important class of graphical models called Markov
chains, which we will discuss at length in Chapter 7.
Example 3.4.9 (Election). In the country shown in Figure 3.7 the presidential election follows
the same system as in the United States. Citizens cast ballots for electors in the Electoral
College. Each state is entitled to a number of electors (in the US this is usually the same as the
members of Congress). In every state, the electors are pledged to the candidate that wins the
state. Our goal is to model the election probabilistically. We assume that there are only two
candidates A and B. Each state is represented by a random variable Si , 1 ≤ i ≤ 4,
(
1 if candidate A wins state i,
Si = (3.142)
−1 if candidate B wins state i.
An important decision to make is what independence assumptions to assume about the model.
Figure 3.8 shows three different options. If we model each state as independent, then we only
need to estimate a single parameter for each state. However, the model may not be accurate,
as the outcome in states with similar demographics is bound to be related. Another option is
to estimate the full joint pmf. The problem is that it may be quite challenging to compute
the parameters. We can estimate the marginal pmfs of the individual states using poll data,
but conditional probabilities are more difficult to estimate. In addition, for larger models it is
not tractable to consider fully dependent models (for instance in the case of the US election,
as mentioned previously). A reasonable compromise could be to model the states that are
not adjacent as conditionally independent given the states between them. For example, we
assume that the outcome of states 1 and 3 are only related through state 2. The corresponding
graphical model, depicted on the right of Figure 3.8, is called a Markov chain. It corresponds
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 58
S1 S2
S1 S2
S1 S2 S3 S4
S3 S4
S3 S4
Figure 3.8: Graphical models capturing different assumptions about the distribution of the random
variables considered in Example 3.4.9.
Under this model we only need to worry about estimating pairwise conditional probabilities, as
opposed to the full joint pmf. We discuss Markov chains at length in Chapter 7.
4
Example 3.4.10 (Desert). Dani and Felix are traveling through the desert in Arizona. They
become concerned that their car might break down and decide to build a probabilistic model
to evaluate the risk. They model the time until the car breaks down as an exponential random
variable T with a parameter that depends on the state of the motor M and the state of the road
R. These three quantities are represented by random variables in the same probability space.
Unfortunately they have no idea what the state of the motor is so they assume that it is uniform
between 0 (no problem with the motor) and 1 (the motor is almost dead). Similarly, they have
no information about the road, so they also assume that its state is a uniform random variable
between 0 (no problem with the road) and 1 (the road is terrible). In addition, they assume that
the states of the road and the car are independent and that the parameter of the exponential
random variable that represents the time in hours until there is a breakdown is equal to M + R.
The corresponding graphical model is shown in Figure 3.9
To find the joint distribution of the random variables, we apply the chain rule to obtain,
Note that we start with M and R because we know their marginal distribution, whereas we only
know the conditional distribution of T given M and R.
After 15 minutes, the car breaks down. The road seems OK, about a 0.2 in the scale they
defined for the value of R, so they naturally wonder about the state of the motor. Given their
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 59
1.5
0.5
T
0
0 0.2 0.4 0.6 0.8 1
m
Figure 3.9: The left image is a graphical model representing the random variables in Example 3.4.10.
The right plot shows the conditional pdf of M given T = 0.25 and R = 0.2.
probabilistic model, their uncertainty about the motor given all of this information is captured
by the conditional distribution of M given T and R.
To compute the conditional pdf, we first need to compute the joint marginal distribution of T
and R by marginalizing over M . In order to simplify the computations, we use the following
simple lemma.
Proof. Equation (3.147) is obtained using the antiderivative of the exponential function (itself),
whereas integrating by parts yields (3.148). 4
We have
Z 1
fR,T (r, t) = fM,R,T (m, r, t) dm (3.149)
m=0
Z 1 Z 1
−tr −tm −tm
=e me dm + r e dm (3.150)
m=0 m=0
!
1 − (1 + t) e−t r 1 − e−t
= e−tr + by (3.147) and (3.148) (3.151)
t2 t
e−tr
= 2
1 + tr − e−t (1 + t + tr) , (3.152)
t
for t ≥ 0, 0 ≤ r ≤ 1.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 60
fM,R,T (m, r, t)
fM |R,T (m|r, t) = (3.153)
fR,T (r, t)
(m + r) e−(m+r)t
= e−tr
(3.154)
t2
(1 + tr − e−t (1 + t + tr))
(m + r) t2 e−tm
= , (3.155)
1 + tr − e−t (1 + t + tr)
for 0 ≤ m ≤ 1 and to zero otherwise. The pdf is plotted in Figure 3.9. According to the model,
it seems quite likely that the state of the motor was not good. 4
This follows directly from (3.11). In words, the probability that g (X1 , . . . , Xn ) = y is the sum
of the joint pmf over all possible values such that y = g (x1 , . . . , xn ).
Example 3.5.1 (Election). In Example 3.4.9 we discussed several possible models for a presi-
dential election for a country with four states. Imagine that you are trying to predict the result
of the election using poll data from individual states. The goal is to predict the outcome of the
election, represented by the random variable
( P
1 if 4i=1 ni Si > 0,
O := (3.159)
0 otherwise,
where ni denotes the number of electors in state i (notice that the sum can never be zero).
From analyzing the poll data you conclude that the probability that candidate A wins each of the
states is 0.15. If you assume that all the states are independent, this is enough to characterize
the joint pmf. Table 3.2 lists the probability of all possible outcomes for this model. By (3.158)
we only need to add up the outcomes for which O = 1. Under the full-independence assumption,
the probability that candidate A wins is 6%.
You are not satisfied by the result because you suspect that the outcomes in different states
are highly dependent. From past elections, you determine that the conditional probability of a
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 61
candidate winning a state if they win an adjacent state is indeed very high. You incorporate
your estimate of the conditional probabilities into a Markov-chain model described by (3.143):
This means that if candidate B wins a state, they are very likely to win the adjacent one. If
candidate A wins a state, their chance to win an adjacent state is significantly higher than if
they don’t (but still lower than candidate B). Under this model the marginal probability that
candidate A wins each state is still 0.15. Table 3.2 lists the probability of all possible outcomes.
The probability that candidate A wins is now 11%, almost double the probability than that
obtained under the fully-independent model. This illustrates the danger of not accounting for
dependencies between states, which for example may have been one of the reasons why many
forecasts severely underestimated Donald Trump’s chances in the 2016 election.
4
Section 2.5 explains how to derive the distribution of functions of univariate random variables by
first computing their cdf and then differentiating it to obtain their pdf. This directly extends to
multivariable random functions. Let X, Y be random variables defined on the same probability
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 62
space, and let U = g (X, Y ) and V = h (X, Y ) for two arbitrary functions g, h : R2 → R. Then,
where the last equality only holds if the joint pdf of X and Y exists. The joint pdf can then be
obtained by differentiation.
Theorem 3.5.2 (Pdf of the sum of two independent random variables). The pdf of Z = X + Y ,
where X and Y are independent random variables is equal to the convolution of their respective
pdfs fX and fY ,
Z ∞
fZ (z) = fX (z − u) fY (u) du. (3.166)
u=−∞
FZ (z) = P (X + Y ≤ z) (3.167)
Z ∞ Z z−y
= fX (x) fY (y) dx dy (3.168)
y=−∞ x=−∞
Z ∞
= FX (z − y) fY (y) dy. (3.169)
y=−∞
Note that the joint pdf of X and Y is the product of the marginal pdfs because the random
variables are independent. We now differentiate the cdf to obtain the pdf. Note that this requires
an interchange of a limit operator with a differentiation operator and another interchange of
an integral operator with a differentiation operator, which are justified because the functions
involved are bounded and integrable.
Z u
d
fZ (z) = lim FX (z − y) fY (y) dy (3.170)
dz u→∞ y=−u
Z u
d
= lim FX (z − y) fY (y) dy (3.171)
u→∞ dz y=−u
Z u
d
= lim FX (z − y) fY (y) dy (3.172)
u→∞ y=−u dz
Z u
= lim fX (z − y) fY (y) dy. (3.173)
u→∞ y=−u
Example 3.5.3 (Coffee beans). A company that makes coffee buys beans from two small local
producers in Colombia and Vietnam. The amount of beans they can buy from each producer
varies depending on the weather. The company models these quantities C and V as independent
random variables (assuming that the weather in Colombia is independent from the weather in
Vietnam) which have uniform distributions in [0, 1] and [0, 2] (the unit is tons) respectively.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 63
fC fB
1 fV 1
0.5 0.5
0 0
We now compute the pdf of the total amount of coffee beans B := E +V applying Theorem 3.5.2,
Z ∞
fB (b) = fC (b − u) fV (u) du (3.174)
u=−∞
Z 2
1
= fC (b − u) du (3.175)
2 u=0
Rb
1 b
2 Ru=0 du = 2 if b ≤ 1
b
= 12 u=b−1 du = 12 if 1 ≤ b ≤ 2 (3.176)
1 2R 3−b
2 u=b−1 du = 2 if 2 ≤ b ≤ 3.
The pdf of B is shown in Figure 3.10.
4
1. Obtain a sample x1 of X1 .
2. For i = 2, 3, . . . , n, obtain a sample xi of Xi given the event {X1 = x1 , . . . , Xi−1 = xi−1 }
by sampling from FXi |X1 ,...,Xi−1 (·|x1 , . . . , xi−1 ).
The chain rule implies that the output x1 , . . . , xn of this procedure are samples from the joint
distribution of the random variables. The following example considers the problem of sampling
from a mixture of exponential random variables.
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 64
Example 3.6.2 (Mixture of exponentials). Let B be a Bernoulli random variable with parame-
ter p and X an exponential random variable with parameter 1 if B = 0 and 2 if B = 1. Assume
that we have access to two independent samples u1 and u2 from a uniform distribution in [0, 1].
To obtain samples from B and X:
2. Then, we set
1 1
x := log (3.177)
λ 1 − u2
pX (x) P (Accepted | X = x)
pX | Accepted (x | Accepted) = Pn by Bayes’ rule (3.178)
i=1 pX (i) P (Accepted | X = i)
pX (x) ax
= Pn . (3.179)
i=1 pX (i) ai
We would like to fix the accept probabilities so that for all x ∈ {1, 2, . . . , n}
Finally, we can use a uniform random variable U between 0 and 1 to accept or reject, accepting
each sample x if U ≤ ax . You might be wondering why we can’t just generate Y directly from
U . That would be indeed work and is much simpler; here we are just presenting the discrete
case as a pedagogical introduction to the continuous case.
Algorithm 3.7.1 (Rejection sampling). Let X and Y be random variables with pmfs pX and
pY such that
pY (x)
c≥ max (3.183)
x∈{1,...,n} pX (x)
for all x such that pY (x) is nonzero, and U a random variable that is uniformly distributed in
[0, 1] and independent of X.
1. Obtain a sample y of X.
2. Obtain a sample u of U .
3. Declare y to be a sample of Y if
pY (y)
u≤ . (3.184)
c pX (y)
for all y, where c is a fixed positive constant. In words, the pdf of Y must be bounded by a
scaled version of the pdf of X.
Algorithm 3.7.2 (Rejection sampling). Let X be a random variable with pdf fX and U a ran-
dom variable that is uniformly distributed in [0, 1] and independent of X. We assume that (3.185)
holds.
1. Obtain a sample y of X.
2. Obtain a sample u of U .
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 66
3. Declare y to be a sample of Y if
fY (y)
u≤ . (3.186)
c fX (y)
The following theorem establishes that the samples obtained by rejection sampling have the
desired distribution.
Theorem 3.7.3 (Rejection sampling works). If assumption (3.185) holds, then the samples
produced by rejection sampling are distributed according to fY .
Proof. Let Z denote the random variable produced by rejection sampling. The cdf of Z is equal
to
fY (X)
FZ (y) = P X ≤ y | U ≤ (3.187)
c fX (X)
P X ≤ y, U ≤ cffYX(X)
(X)
= . (3.188)
fY (X)
P U ≤ c fX (X)
To compute the numerator we integrate the joint pdf of U and X over the region of interest
Z y Z fY (x)
fY (X) c fX (x)
P X ≤ y, U ≤ = fX (x) du dx (3.189)
c fX (X) x=−∞ u=0
Z y
fY (x)
= fX (x) dx (3.190)
x=−∞ c fX (x)
Z y
1
= fY (x) dx (3.191)
c x=−∞
1
= FY (y) . (3.192)
c
The denominator is obtained in a similar way
Z ∞ Z fY (x)
fY (X) c fX (x)
P U≤ = fX (x) du dx (3.193)
c fX (X) x=−∞ u=0
Z ∞
fY (x)
= fX (x) dx (3.194)
x=−∞ c fX (x)
Z ∞
1
= fY (x) dx (3.195)
c x=−∞
1
= . (3.196)
c
We conclude that
We now illustrate the method by applying it to produce a Gaussian random variable from an
exponential and a uniform random variable.
Example 3.7.4 (Generating a Gaussian random variable). In Example 2.6.4 we learned how
to generate an exponential random variables using samples from a uniform distribution. In this
example we will use samples from an exponential distribution to generate a standard Gaussian
random variable applying rejection sampling.
The following lemma shows that we can generate a standard Gaussian random variable Y by:
2. Generating a random variable S which is equal to 1 or -1 with probability 1/2, for example
by applying the method described in Section 2.6.1.
3. Setting Y := SH.
Lemma 3.7.5. Let H be a continuous random variable with pdf given by (3.198) and S a discrete
random variable which equals 1 with probability 1/2 and −1 with probability 1/2. The random
variable of Y := SH is a standard Gaussian.
The reason why we reduce the problem to generating H is that its pdf is only nonzero on the
positive axis, which allows us to bound it with
p the exponential pdf of an exponential random
variable X with parameter 1. If we set c := 2e/π then fH (x) ≤ cfX (x) for all x, as illustrated
in Figure 3.11. Indeed,
2
√2 exp − x
fH (x) 2π 2
= (3.203)
fX (x) exp (−x)
r !
2e − (x − 1)2
= exp (3.204)
π 2
r
2e
≤ . (3.205)
π
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 68
cfX (x)
fH (x)
1
0.5
0
0 1 2 3 4 5
x
Figure 3.11: Bound on the pdf of the target distribution in Example 3.7.4.
3. Accept x as a sample of H if
!
− (x − 1)2
u ≤ exp . (3.206)
2
This procedure is illustrated in Figure 3.12. The rejection mechanism ensures that the accepted
samples have the right distribution. 4
CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 69
1.0
0.8
0.6
Histogram of 50,000 iid
samples from X (fX is 0.4
shown in black)
0.2
0.00 1 2 3 4 5
X
1.0
0.8
0.00 1 2 3 4 5
X
0.9
0.8
0.7
0.6
0.5
Histogram of accepted
samples (fH is shown in 0.4
black) 0.3
0.2
0.1
0.00 1 2 3 4 5
H
Figure 3.12: Illustration of how to generate 50,000 samples from the random variable H defined in
Example 3.7.4 via rejection sampling.
Chapter 4
Expectation
In this section we introduce some quantities that describe the behavior of random variables
very succinctly. The mean is the value around which the distribution of a random variable is
centered. The variance quantifies the extent to which a random variable fluctuates around the
mean. The covariance of two random variables indicates whether they tend to deviate from
their means in a similar way. In multiple dimensions, the covariance matrix of a random vector
encodes its variance in every possible direction. These quantities do not completely characterize
the distribution of a random variable or vector, but they provide a useful summary of their
behavior with just a few numbers.
Definition 4.1.1 (Expectation for discrete random variables). Let X be a discrete random
variable with range R. The expected value of a function g (X), g : R → R, of X is
X
E (g (X)) := g (x) pX (x) . (4.1)
x∈R
Similarly, if X, Y are both discrete random variables with ranges RX and RY then the expected
value of a function g (X, Y ), g : R2 → R, of X and Y is
X X
E (g (X, Y )) := g (x, y) pX,Y (x, y) . (4.2)
x∈RX x∈RY
70
CHAPTER 4. EXPECTATION 71
Definition 4.1.2 (Expectation for continuous random variables). Let X be a continuous random
variable. The expected value of a function g (X), g : R → R, of X is
Z ∞
E (g (X)) := g (x) fX (x) dx. (4.4)
x=−∞
Similarly, if X, Y are both continuous random variables then the expected value of a function
g (X, Y ), g : R2 → R, of X and Y is
Z ∞ Z ∞
E (g (X, Y )) := g (x, y) fX,Y (x, y) dx dy. (4.5)
x=−∞ y=−∞
In the case of quantities that depend on both continuous and discrete random variables, the
product of the marginal and conditional distributions plays the role of the joint pdf or pmf.
Definition 4.1.3 (Expectation with respect to continuous and discrete random variables). If
C is a continuous random variable and D a discrete random variable with range RD defined on
the same probability space, the expected value of a function g (C, D) of C and D is
Z ∞ X
E (g (C, D)) := g (c, d) fC (c) pD|C (d|c) dc (4.7)
c=−∞ d∈R
D
X Z ∞
= g (c, d) pD (d) fC|D (c|d) dc. (4.8)
d∈RD c=−∞
The expected value of a certain quantity may be infinite or not even exist if the correspond-
ing sum or integral tends towards infinity or has an undefined value. This is illustrated by
Examples 4.1.4 and 4.2.2 below.
Example 4.1.4 (St Petersburg paradox). A casino offers you the following game. You will flip
an unbiased coin until it lands on heads and the casino will pay you 2k dollars where k is the
number of flips. How much are you willing to pay in order to play?
Let us compute the expected gain. If the flips are independent, the total number of flips X is a
geometric random variable, so pX (k) = 1/2k . The gain is 2X which means that
∞
X 1
E (Gain) = 2k · = ∞. (4.9)
2k
k=1
The expected gain is infinite, but since you only get to play once, the amount of money that
you are willing to pay is probably bounded. This is known as the St Petersburg paradox.
4
Proof. The theorem follows immediately from the linearity of sums and integrals.
Linearity of expectation makes it very easy to compute the expectation of linear functions of
random variables. In contrast, computing the joint pdf or pmf is usually much more complicated.
Example 4.1.6 (Coffee beans (continued from Example 3.5.3)). Let us compute the expected
total amount of beans that can be bought. C is uniform in [0, 1], so E (C) = 1/2. V is uniform
in [0, 2], so E (V ) = 1. By linearity of expectation
E (C + V ) = E (C) + E (V ) (4.12)
= 1.5 tons. (4.13)
Note that this holds even if the two quantities are not independent.
4
If two random variables are independent, then the expectation of the product factors into a
product of expectations.
Proof. We prove the result for continuous random variables, but the proof for discrete random
variables is essentially the same.
Z ∞ Z ∞
E (g (X) h (Y )) = g (x) h (y) fX,Y (x, y) dx dy (4.15)
x=−∞ y=−∞
Z ∞ Z ∞
= g (x) h (y) fX (x) fY (y) dx dy by independence (4.16)
x=−∞ y=−∞
= E (g (X)) E (h (Y )) . (4.17)
CHAPTER 4. EXPECTATION 73
0.3
0.2
fX (x)
0.1
0
−10 −5 0 5 10
x
Definition 4.2.1 (Mean). The mean or first moment of X is the expected value of X: E (X).
Table 4.1 lists the means of some important random variables. The derivations can be found in
Section 4.5.1. As illustrated by Figure 4.3, the mean is the center of mass of the pmf or the pdf
of the corresponding random variable.
If the distribution of a random variable is very heavy tailed, which means that the probability of
the random variable taking large values decays slowly, its mean may be infinite. This is the case
of the random variable representing the gain in Example 4.1.4. The following example shows
that the mean may not exist if the value of the corresponding sum or integral is not well defined.
Example 4.2.2 (Cauchy random variable). The pdf of the Cauchy random variable, which is
shown in Figure 4.1, is given by
1
fX (x) = . (4.18)
π(1 + x2 )
The mean of a random vector is defined as the vector formed by the means of its components.
~ is
Definition 4.2.3 (Mean of a random vector). The mean of a random vector X
~1
E X
~2
~ := E X
E(X) . (4.21)
···
E X~n
As in the univariate case, the mean can be interpreted as the value around which the distribution
of the random vector is centered.
It follows immediately from the linearity of the expectation operator in one dimension that the
mean operator is linear.
Theorem 4.2.4 (Mean of linear transformation of a random vector). For any random vector
~ of dimension n, any matrix A ∈ Rm×n and ~b ∈ Rm
X
E AX~ + ~b = A E(X)~ + ~b. (4.22)
Proof.
P
n ~i
E A1i X + b1
Pi=1
n ~
~ ~ E i=1 A2i Xi + b2
E AX + b = (4.23)
···
P
n ~
E i=1 Ami Xi + bn
P
n ~ i + b1
A 1i E X
Pi=1
n ~ i + b2
i=1 A2i E X
= by linearity of expectation (4.24)
· · ·
P
n ~ i + bn
Ami E X
i=1
~ + ~b.
= A E(X) (4.25)
4.2.2 Median
The mean is often interpreted as representing a typical value taken by the random variable.
However, the probability of a random variable being equal to its mean may be zero! For instance,
a Bernoulli random variable cannot equal 0.5. In addition, the mean can be severely distorted
by a small subset of extreme values, as illustrated by Example 4.2.6 below. The median is an
alternative characterization of a typical value taken by the random variable, which is designed
to be more robust to such situations. It is defined as the midpoint of the pmf or pdf of the
random variable. If the random variable is continuous, the probability that it is either larger or
smaller than the median is equal to 1/2.
CHAPTER 4. EXPECTATION 75
0.1 Mean
Median
fX (x)
0
−10 0 10 20 30 40 50 60 70 80 90 100 110
x
Figure 4.2: Uniform pdf in [−4.5, 4.5] ∪ [99.5, 100.5]. The mean is 10 and the median is 0.5.
Definition 4.2.5 (Median). The median of a discrete random variable X is a number m such
that
1 1
P (X ≤ m) ≥ and P (X ≥ m) ≥ . (4.26)
2 2
The following example illustrates the robustness of the median to the presence of a small subset
of extreme values with nonzero probability.
Example 4.2.6 (Mean vs median). Consider a uniform random variable X with support
[−4.5, 4.5] ∪ [99.5, 100.5]. The mean of X equals
Z 4.5 Z 100.5
E (X) = xfX (x) dx + xfX (x) dx (4.28)
x=−4.5 x=99.5
1 100.52 − 99.52
= (4.29)
10 2
= 10. (4.30)
Bernoulli p p p (1 − p)
1 1−p
Geometric p p p2
Binomial n, p np np (1 − p)
Poisson λ λ λ
a+b (b−a)2
Uniform a, b 2 12
1 1
Exponential λ λ λ2
Gaussian µ, σ µ σ2
Table 4.1: Means and variance of common random variables, derived in Section 4.5.1 of the appendix.
The definition generalizes to higher moments, defined as E (X p ) for integers larger than two. The
mean square of the difference between the random variable and its mean is called the variance
of the random value. It quantifies the variation of the random variable around its mean and
is also referred to as the second centered moment of the distribution. The square root of this
quantity is the standard deviation of the random variable.
Definition 4.2.8 (Variance and standard deviation). The variance of X is the mean square
deviation from the mean
Var (X) := E (X − E (X))2 (4.33)
= E X 2 − E2 (X) . (4.34)
The standard deviation σX of X is
p
σX := Var (X). (4.35)
We have compiled the variances of some important random variables in Table 4.1. The deriva-
tions can be found in Section 4.5.1. In Figure 4.3 we plot the pmfs and pdfs of these random
variables and display the range of values that fall within one standard deviation of the mean.
The variance operator is not linear, but it is straightforward to determine the variance of a linear
function of a random variable.
CHAPTER 4. EXPECTATION 77
0.2 ·10−2
0.2
8
0.15
0.15
6
pX (k)
0.1
0.1
4
5 · 10−2 5 · 10−2
2
0 0
0 5 10 15 20 0 5 10 15 20 0
10 20 30 40
k k k
1 1 0.4
0.3
fX (x)
0.1
0 0 0
−0.5 0 0.5 1 1.5 0 2 4 −4 −2 0 2 4
x x x
Figure 4.3: Pmfs of discrete random variables (top row) and pdfs of continuous random variables
(bottom row). The mean of the random variable is marked in red. Values that are within one standard
deviation of the mean are marked in pink.
Proof.
Var (a X + b) = E (a X + b − E (a X + b))2 (4.37)
= E (a X + b − aE (X) − b)2 (4.38)
= a2 E (X − E (X))2 (4.39)
= a2 Var (X) . (4.40)
This result makes sense: If we change the center of the random variable by adding a constant,
then the variance is not affected because the variance only measures the deviation from the
mean. If we multiply a random variable by a constant, the standard deviation is scaled by the
same factor.
CHAPTER 4. EXPECTATION 78
Theorem 4.2.10 (Markov’s inequality). Let X be a nonnegative random variable. For any
positive constant a > 0,
E (X)
P (X ≥ a) ≤ . (4.41)
a
Proof. Consider the indicator variable 1X≥a . We have
X − a 1X≥a ≥ 0. (4.42)
In particular its expectation is nonnegative (as it is the sum or integral of a nonnegative quantity
over the positive real line). By linearity of expectation and the fact that 1X≥a is a Bernoulli
random variable with expectation P (X ≥ a) we have
Example 4.2.11 (Age of students). You hear that the mean age of NYU students is 20 years,
but you know quite a few students that are older than 30. You decide to apply Markov’s
inequality to bound the fraction of students above 30 by modeling age as a nonnegative random
variable A.
E (A) 2
P(A ≥ 30) ≤ = . (4.44)
30 3
At most two thirds of the students are over 30.
4
As illustrated Example 4.2.11, Markov’s inequality can be rather loose. The reason is that it
barely uses any information about the distribution of the random variable.
Chebyshev’s inequality controls the deviation of the random variable from its mean. Intuitively,
if the variance (and hence the standard deviation) is small, then the probability that the random
variable is far from its mean must be low.
Theorem 4.2.12 (Chebyshev’s inequality). For any positive constant a > 0 and any random
variable X with bounded variance,
Var (X)
P (|X − E (X)| ≥ a) ≤ . (4.45)
a2
Proof. Applying Markov’s inequality to the random variable Y = (X − E (X))2 yields the result.
CHAPTER 4. EXPECTATION 79
An interesting corollary to Chebyshev’s inequality shows that if the variance of a random variable
is zero, then the random variable is a constant or, to be precise, the probability that it deviates
from its mean is zero.
Corollary 4.2.13. If Var (X) = 0 then P (X 6= E (X)) = 0.
Example 4.2.14 (Age of students (continued)). You are not very satisfied with your bound on
the number of students above 30. You find out that the standard deviation of student age is
actually just 3 years. Applying Chebyshev’s inequality, this implies that
4.3 Covariance
4.3.1 Covariance of two random variables
The covariance of two random variables describes their joint behavior. It is the expected value
of the product between the difference of the random variables and their respective means. In-
tuitively, it measures to what extent the random variables fluctuate together.
Definition 4.3.1 (Covariance). The covariance of X and Y is
Figure 4.4 shows samples from bivariate Gaussian distributions with different covariances. If
the covariance is zero, then the joint pdf has a spherical form. If the covariance is positive and
large, then the joint pdf becomes skewed so that the two variables tend to have similar values.
If the covariance is large and negative, then the two variables will tend to have similar values
with opposite sign.
The variance of the sum of two random variables can be expressed in terms of their individ-
ual variances and their covariance. As a result, their fluctuations reinforce each other if the
covariance is positive and cancel each other if it is negative.
Theorem 4.3.2 (Variance of the sum of two random variables).
Figure 4.4: Samples from 2D Gaussian vectors (X, Y ), where X and Y are standard Gaussian random
variables with zero mean and unit variance, for different values of the covariance between X and Y .
Proof.
Var (X + Y ) = E (X + Y − E (X + Y ))2 (4.52)
= E (X − E (X))2 + E (Y − E (Y ))2 + 2E ((X − E (X)) (Y − E (Y )))
= Var (X) + Var (Y ) + 2 Cov (X, Y ) . (4.53)
An immediate consequence is that if two random variables are uncorrelated, then the variance
of their sum equals the sum of their variances.
The following lemma and example show that independence implies uncorrelation, but uncorre-
lation does not always imply independence.
Lemma 4.3.4 (Independence implies uncorrelation). If two random variables are independent,
then they are uncorrelated.
Example 4.3.5 (Uncorrelation does not imply independence). Let X and Y be two independent
Bernoulli random variables with parameter 1/2. Consider the random variables
U = X + Y, (4.56)
V = X − Y. (4.57)
Note that
1
pU (0) = P (X = 0, Y = 0) = , (4.58)
4
1
pV (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) = , (4.59)
2
1 1
pU,V (0, 0) = P (X = 0, Y = 0) = 6= pU (0) pV (0) = , (4.60)
4 8
so U and V are not independent. However, they are uncorrelated as
The final equality holds because X and Y have the same distribution.
4
Definition 4.3.6 (Pearson correlation coefficient). The Pearson correlation coefficient of two
random variables X and Y is
Cov (X, Y )
ρX,Y := . (4.64)
σX σY
The correlation coefficient between X and Y is equal to the covariance between X/σX and
Y /σY . Figure 4.5 compares samples of bivariate Gaussian random variables that have the same
correlation coefficient, but different covariance and vice versa.
Although it might not be immediately obvious, the magnitude of the correlation coefficient is
bounded by one because the covariance of two random variables cannot exceed the product
of their standard deviations. A useful interpretation of the correlation coefficient is that it
quantifies to what extent X and Y are linearly related. In fact, if it is equal to 1 or -1 then one
of the variables is a linear function of the other! All of this follows from the Cauchy-Schwarz
inequality. The proof is in Section 4.5.3.
Theorem 4.3.7 (Cauchy-Schwarz inequality). For any random variables X and Y defined on
the same probability space
p
|E (XY )| ≤ E (X 2 ) E (Y 2 ). (4.65)
CHAPTER 4. EXPECTATION 82
Figure 4.5: Samples from 2D Gaussian vectors (X, Y ), where X is a standard Gaussian random variables
with zero mean and unit variance, for different values of the standard deviation σY of Y (which is mean
zero) and of the covariance between X and Y .
Assume E X 2 =6 0,
s
p E (Y 2 )
E (XY ) = E (X 2 ) E (Y 2 ) ⇐⇒ Y = X, (4.66)
E (X 2 )
s
p E (Y 2 )
E (XY ) = − E (X 2 ) E (Y 2 ) ⇐⇒ Y = − X. (4.67)
E (X 2 )
Corollary 4.3.8. For any random variables X and Y ,
Cov (X, Y ) ≤ σX σY . (4.68)
Equivalently, the Pearson correlation coefficient satisfies
|ρX,Y | ≤ 1, (4.69)
with equality if and only if there is a linear relationship between X and Y
|ρX,Y | = 1 ⇐⇒ Y = c X + d. (4.70)
where
(
σY
σX if ρX,Y = 1,
c := d := E (Y ) − c E (X) . (4.71)
− σσX
Y
if ρX,Y = −1,
Proof. Let
U := X − E (X) , (4.72)
V := Y − E (Y ) . (4.73)
From the definition of the variance and the correlation coefficient,
E U 2 = Var (X) , (4.74)
E V 2 = Var (Y ) (4.75)
E (U V )
ρX,Y = p . (4.76)
E (U 2 ) E (V 2 )
The result now follows from applying Theorem 4.3.7 to U and V .
CHAPTER 4. EXPECTATION 83
Note that if all the entries of a vector are uncorrelated, then its covariance matrix is diagonal.
From Theorem 4.2.4 we obtain a simple expression for the covariance matrix of the linear
transformation of a random vector.
~ be a random vector
Theorem 4.3.10 (Covariance matrix after a linear transformation). Let X
of dimension n with covariance matrix Σ. For any matrix A ∈ Rm×n and ~b ∈ Rm ,
T
ΣAX+
~ ~b = AΣX
~A . (4.79)
Proof.
T T
ΣAX+
~ ~b = E
~ ~ ~ ~
AX + b AX + b ~ ~ ~ ~
− E AX + b E AX + b (4.80)
T
= AE X ~ T AT + ~b E X
~X ~ ~ ~bT + ~b~bT
AT + A E(X)
~
− A E(X)E( ~ ~bT − ~b E(X)
~ T AT − A E(X)
X) ~ T AT − ~b~bT (4.81)
=A E X ~X ~
~ T − E(X)E( ~ T AT
X) (4.82)
= AΣX~ AT . (4.83)
An immediate corollary of this result is that we can easily decode the variance of the random
vector in any direction from the covariance matrix. Mathematically, the variance of the random
vector in the direction of a unit vector ~v is equal to the variance of its projection onto ~v .
ΣX~ = U ΛU T (4.85)
λ1 0 ··· 0
0 λ2 ··· 0
= ~u1 ~u2 · · · ~un ~u1 ~u2 · · · ~un T , (4.86)
···
0 0 · · · λn
Theorem 4.3.12. Let X ~ be a random vector of dimension n with covariance matrix Σ ~ . The
X
eigendecomposition of ΣX~ given by (4.86) satisfies
λ1 = max Var ~v T X ~ , (4.87)
||~v ||2 =1
~u1 = arg max Var ~v T X ~ , (4.88)
||~v ||2 =1
λk = max ~ ,
Var ~v T X (4.89)
||~v ||2 =1,~v ⊥~ u1 ,...,~
uk−1
~uk = arg max Var ~v T X~ . (4.90)
||~v ||2 =1,~v ⊥~
u1 ,...,~
uk−1
In words, ~u1 is the direction of maximum variance. The eigenvector ~u2 corresponding to the
second largest eigenvalue λ2 is the direction of maximum variation that is orthogonal to ~u1 . In
general, the eigenvector ~uk corresponding to the kth largest eigenvalue λk reveals the direction
of maximum variation that is orthogonal to ~u1 , ~u2 , . . . , ~uk−1 . Finally, u~n is the direction of
minimum variance. Figure 4.6 illustrates this with an example, where n = 2. As we discuss
in Chapter 8, principal component analysis– a popular method for unsupervised learning and
dimensionality reduction– applies the same principle to determine the directions of variation of
a data set.
To conclude the section, we describe an algorithm to transform samples from an uncorrelated
random vector so that they have a prescribed covariance matrix. The process of transforming
uncorrelated samples for this purpose is called coloring because uncorrelated samples are usually
described as being white noise. As we will see in the next section, coloring allows to simulate
Gaussian random vectors.
√ √ √ √ √ √
λ1 = 1.22, λ2 = 0.71 λ1 = 1, λ2 = 1 λ1 = 1.38, λ2 = 0.32
Figure 4.6: Samples from bivariate Gaussian random vectors with different covariance matrices are
shown in gray. The eigenvectors of the covariance matrices are plotted in red. Each is scaled by the
square roof of the corresponding eigenvalue λ1 or λ2 .
√ √
2. Set ~y := U Λ~x, where Λ is a diagonal matrix containing the square roots of the eigen-
values of Σ,
√
λ1 √0 ··· 0
√ 0 λ2 · · · 0
Λ :=
. (4.91)
··· √
0 0 ··· λn
√
By Theorem 4.3.10 the covariance matrix of Y := U Λ~x indeed equals Σ.
√ √ T
ΣY~ = U ΛΣX~ Λ U T (4.92)
√ √ T
= U ΛI Λ U T (4.93)
= Σ. (4.94)
Figure 4.7 illustrates the two steps of coloring in 2D: First the samples are stretched according to
the eigenvalues of Σ and then they are rotated to align them with the corresponding eigenvectors.
Lemma 4.3.14 (Uncorrelation implies mutual independence for Gaussian random vectors). If
~ are uncorrelated, this implies that they are
all the components of a Gaussian random vector X
mutually independent.
Proof. The parameter Σ of the joint pdf of a Gaussian random vector is its covariance matrix (one
can verify this by applying the definition of covariance and integrating). If all the components
CHAPTER 4. EXPECTATION 86
√ √
~
X ~
ΛX ~
U ΛX
√
Figure 4.7: When we color two-dimensional uncorrelated samples (left), first the diagonal matrix Λ
stretches them differently along different directions according to the eigenvalues of the desired covariance
matrix (center) and then U rotates them so that they are aligned with the correspondent eigenvectors
(right).
Since the joint pdf factors into a product of the marginals, the components are all mutually
independent.
The following algorithm generates samples from a Gaussian random vector with an arbitrary
mean and covariance matrix by coloring (and centering) a vector of independent samples from
a standard Gaussian distribution.
CHAPTER 4. EXPECTATION 87
~ is zero. The same argument used in equation (4.94) shows that the covari-
since the mean of X
~
ance matrix of X is Σ. Since coloring and centering are linear operations, by Theorem 3.2.14
~ is Gaussian with the desired mean and covariance matrix. For example, in Figure 4.7 the
Y
generated samples are Gaussian. For non-Gaussian random vectors, coloring will modify the
covariance matrix, but not necessarily preserve the distribution.
if Y is continuous.
Note that E (g (X, Y ) |X = x) can actually be interpreted as a function of x since it maps every
value of x to a real number. This allows to define the conditional expectation of g (X, Y ) given
X as follows.
where
Beware the confusing definition, the conditional expectation is actually a random variable!
One of the main uses of conditional expectation is applying iterated expectation for computing
expected values. The idea is that the expected value of a certain quantity can be expressed as
the expectation of the conditional expectation of the quantity.
Theorem 4.4.2 (Iterated expectation). For any random variables X and Y and any function
g : R2 → R
Proof. We prove the result for continuous random variables, the proof for discrete random
variables, and for quantities that depend on both continuous and discrete random variables, is
almost identical. To make the explanation clearer, we define Follow meSteve
on LinkedIn for more:
Nouri
https://www.linkedin.com/in/stevenouri/
h (x) := E (g (X, Y ) |X = x) (4.107)
Z ∞
= g (x, y) fY |X (y|x) dy. (4.108)
y=−∞
Now,
Iterated expectation allows to obtain the expectation of quantities that depend on several quan-
tities very easily if we have access to the marginal and conditional distributions. We illustrate
this with several examples taken from the previous chapters.
Example 4.4.3 (Desert (continued from Example 3.4.10)). Let us compute the mean time at
which the car breaks down, i.e. the mean of T . By iterated expectation
4
CHAPTER 4. EXPECTATION 89
Example 4.4.4 (Grizzlies in Yellowstone (continued from Example 3.3.3)). Let us compute the
mean weight of a bear in Yosemite. By iterated expectation
E (W ) = E (E (W |S)) (4.118)
E (W |S = 0) + E (W |S = 1)
= (4.119)
2
= 180 kg. (4.120)
Example 4.4.5 (Bayesian coin flip (continued from Example 3.3.6). Let us compute the mean
of the coin-flip outcome X. By iterated expectation
4.5 Proofs
4.5.1 Derivation of means and variances in Table 4.1
Bernoulli
Geometric
To compute the mean of a geometric random variable, we need to deal with a geometric series.
By Lemma 4.5.3 in Section 4.5.2 below we have:
∞
X
E (X) = k pX (k) (4.128)
k=1
∞
X
= k p (1 − p)k−1 (4.129)
k=1
∞
p X 1
= k (1 − p)k = . (4.130)
1−p p
k=1
CHAPTER 4. EXPECTATION 90
To compute the mean square value we apply Lemma 4.5.4 in the same section:
∞
X
E X2 = k 2 pX (k) (4.131)
k=1
X∞
= k 2 p (1 − p)k−1 (4.132)
k=1
∞
p X 2
= k (1 − p)k (4.133)
1−p
k=1
2−p
= . (4.134)
p2
Binomial
As shown in Example 2.2.6, we can express a binomial random variable with parameters n and
p as the sum of n independent Bernoulli random variables B1 , B2 , . . . with parameter p
n
X
X= Bi . (4.135)
i=1
Poisson
From calculus we have
∞
X λk
= eλ , (4.139)
k!
k=0
which is the Taylor series expansion of the exponential function. This implies
∞
X
E (X) = kpX (k) (4.140)
k=1
X∞
λk e−λ
= (4.141)
(k − 1)!
k=1
∞
X λm+1
= e−λ = λ, (4.142)
m!
m=0
CHAPTER 4. EXPECTATION 91
and
∞
X
E X2 = k 2 pX (k) (4.143)
k=1
∞
X kλk e−λ
= (4.144)
(k − 1)!
k=1
∞
!
X (k − 1) λk kλk
= e−λ + (4.145)
(k − 1)! (k − 1)!
k=1
∞ ∞
!
X λm+2 X λm+1
= e−λ + = λ2 + λ. (4.146)
m! m!
m=1 m=1
Uniform
We apply the definition of expected value for continuous random variables to obtain
Z ∞ Z b
x
E (X) = xfX (x) dx = dx (4.147)
−∞ a b−a
b2 − a2 a+b
= = . (4.148)
2 (b − a) 2
Similarly,
Z b
2
x2
E X = dx (4.149)
a b−a
b3 − a3
= (4.150)
3 (b − a)
a2 + ab + b2
= . (4.151)
3
Exponential
Applying integration by parts,
Z ∞
E (X) = xfX (x) dx (4.152)
Z−∞
∞
= xλe−λx dx (4.153)
0
Z ∞
−λx ∞ 1
= xe ]0 + e−λx dx = . (4.154)
0 λ
Similarly,
Z ∞
E X2 = x2 λe−λx dx (4.155)
0
Z ∞
2 −λx ∞ 2
=x e ]0 + 2 xe−λx dx = 2 . (4.156)
0 λ
CHAPTER 4. EXPECTATION 92
Gaussian
We apply the change of variables t = (x − µ) /σ.
Z ∞
E (X) = xfX (x) dx (4.157)
−∞
Z ∞
x (x−µ)2
= √ e− 2σ2 dx (4.158)
−∞ 2πσ
Z ∞ Z ∞
σ − t2
2 µ t2
=√ te dt + √ e− 2 dt (4.159)
2π −∞ 2π −∞
= µ, (4.160)
where the last step follows from the fact that the integral of a bounded odd function over a
symmetric interval is zero.
Applying the change of variables t = (x − µ) /σ and integrating by parts, we obtain that
Z ∞
2
E X = x2 fX (x) dx (4.161)
−∞
Z ∞
x2 − (x−µ) 2
= √ e 2σ2 dx (4.162)
−∞ 2πσ
Z ∞ Z Z ∞
σ2 t2 2µσ ∞ − t2 µ2 t2
=√ t2 e− 2 dt + √ te 2 dt + √ e− 2 dt (4.163)
2π −∞ 2π −∞ 2π −∞
Z ∞
σ2 2
2 − t2 ∞
2
− t2
=√ t e ]−∞ + e dt + µ2 (4.164)
2π −∞
= σ 2 + µ2 . (4.165)
Proof. We just multiply the sum by the factor (1 − α) / (1 − α) which obviously equals one,
1 − α n1
αn1 + αn1 +1 + · · · + αn2 −1 + αn2 = α + αn1 +1 + · · · + αn2 −1 + αn2 (4.168)
1−α
αn1 − αn1 +1 + αn1 +1 + · · · − αn2 + αn2 − αn2 +1
=
1−α
αn1 − αn2 +1
= (4.169)
1−α
CHAPTER 4. EXPECTATION 93
Since the left limit converges, we can differentiate on both sides to obtain
∞
X 1
kαk−1 = . (4.172)
k=0
(1 − α)2
Since the left limit converges, we can differentiate on both sides to obtain
∞
X 1+α
k 2 αk−1 = . (4.175)
k=1
(1 − α)3
The expectation of a nonnegative quantity is nonzero because the integral or sum of a non-
negative quantity is nonnegative. Consequently, the left-hand side of (4.176) and (4.178) is
nonnegative, so (B.117) and (B.118) are both nonnegative, which implies (4.65).
Let us prove (B.21) by proving both implications.
p
(⇒). Assume E (XY ) = − E (X 2 ) E (Y 2 ). Then (B.117) equals zero, so
2
p p
E E (X 2 )X + E (X 2 )Y = 0, (4.180)
p p
which by Corollary 4.2.13 means that E (Y 2 )X = − E (X 2 )Y with probability one.
E(Y 2 )
(⇐). Assume Y = − E(X 2 ) X. Then one can easily check that (B.117) equals zero, which
p
implies E (XY ) = − E (X 2 ) E (Y 2 ).
The proof of (B.22) is almost identical (using (4.176) instead of (B.117)).
Chapter 5
Random Processes
Random processes, also known as stochastic processes, are used to model uncertain quantities
that evolve in time: the trajectory of a particle, the price of oil, the temperature in New York, the
national debt of the United States, etc. In these notes we introduce a mathematical framework
that makes it possible to reason probabilistically about such quantities.
5.1 Definition
e This is not standard
We denote random processes using a tilde over an upper case letter X.
notation, but we want to emphasize the difference with random variables and random vectors.
e is a function that maps elements in a sample space Ω to real-
Formally, a random process X
valued functions.
Definition 5.1.1 (Random process). Given a probability space (Ω, F, P), a random process X e
e (ω, ·) : T → R,
is a function that maps each element ω in the sample space Ω to a function X
where T is a discrete or continuous set.
e (ω, t):
There are two possible interpretations for X
• If the indexing variable t is defined on a discrete set, usually the integers or the natural
numbers, then X e is a discrete-time random process. In such cases we often use a different
letter from t, such as i, as an indexing variable.
95
CHAPTER 5. RANDOM PROCESSES 96
1
ω = 0.62 ω = 0.31
0.8 ω = 0.91 0.8 ω = 0.89
ω = 0.12 ω = 0.52
0.6 0.6
e (ω, i)
e (ω, t)
0.4 0.4
D
C
0.2 0.2
0 0
2 4 6 8 10 1 2 3 4 5 6 7 8 9 10
t i
Figure 5.1: Realizations of the continuous-time (left) and discrete-time (right) random process defined
in Example 5.1.2.
Note that there are continuous-state discrete-time random processes and discrete-state continuous-
time random processes. Any combination is possible.
The underlying probability space (Ω, F, P) mentioned in the definition completely determines
the stochastic behavior of the random process. In principle we can specify random processes
by defining (1) a probability space (Ω, F, P) and (2) a mapping that assigns a function to each
element of Ω, as illustrated in the following example. This way of specifying random processes
is only tractable for very simple cases.
Example 5.1.2 (Puddle). Bob asks Mary to model a puddle probabilistically. When the puddle
is formed, it contains an amount of water that is distributed uniformly between 0 and 1 gallon.
As time passes, the water evaporates. After a time interval t the water that is left is t times less
than the initial quantity.
Mary models the water in the puddle as a continuous-state continuous-time random process
e The underlying sample space is (0, 1), the σ algebra is the corresponding Borel σ algebra
C.
(all possible countable unions of intervals in (0, 1)) and the probability measure is the uniform
probability measure on (0, 1). For a particular element in the sample space ω ∈ (0, 1)
e (ω, t) := ω ,
C t ∈ [1, ∞), (5.1)
t
where the unit of t is days in this example. Figure 6.1 shows different realizations of the random
process. Each realization is a deterministic function on [1, ∞).
Bob points out that he only cares what the state of the puddle is each day, as opposed to at any
time t. Mary decides to simplify the model by using a continuous-state discrete-time random
CHAPTER 5. RANDOM PROCESSES 97
e The underlying probability space is exactly the same as before, but the time index
process D.
is now discrete. For a particular element in the sample space ω ∈ (0, 1)
e (ω, i) := ω ,
D i = 1, 2, . . . (5.2)
i
Figure 6.1 shows different realizations of the continuous random process. Note that each real-
ization is just a deterministic discrete sequence.
4
Recall that the value of the random process at a specific time is a random variable. We can
therefore characterize the behavior of the process at that time by computing the distribution
of the corresponding random variable. Similarly, we can consider the joint distribution of the
process sampled at n fixed times. This is given by the nth-order distribution of the random
process.
If the nth order distribution of a random process is shift-invariant, then the process is said to
be strictly or strongly stationary.
The random processes in Example 5.1.2 are clearly not strictly stationary because their first-
order pdf and pmf are not the same at every point. An important example of strictly stationary
processes are independent identically-distributed sequences, presented in Section 5.3.
As in the case of random variables and random vectors, defining the underlying probability
space in order to specify a random process is usually not very practical, except for very simple
CHAPTER 5. RANDOM PROCESSES 98
cases like the one in Example 5.1.2. The reason is that it is challenging to come up with a
probability space that gives rise to a given n-th order distribution of interest. Fortunately, we
can also specify a random process by directly specifying its n-th order distribution for all values
of n = 1, 2, . . . This completely characterizes the random process. Most of the random processes
described in this chapter, e.g. independent identically-distributed sequences, Markov chains,
Poisson processes and Gaussian processes, are specified in this way.
Finally, random processes can also be specified by expressing them as functions of other random
processes. A function Ye := g(X)
e of a random process X e is also a random process, as it maps
e
any element ω in the sample space Ω to a function Y (ω, ·) := g(X e (ω, ·)). In Section 5.6 we
define random walks in this way.
Note that the mean is a deterministic function of t. The autocovariance of a random process is
another deterministic function that is equal to the covariance of X e (t1 ) and X
e (t2 ) for any two
points t1 and t2 . If we set t1 := t2 , then the autocovariance equals the variance at t1 .
Definition 5.2.2 (Autocovariance). The autocovariance of a random process is the function
RXe (t1 , t2 ) := Cov Xe (t1 ) , X
e (t2 ) . (5.8)
In particular,
RXe (t, t) := Var Xe (t) . (5.9)
Intuitively, the autocovariance quantifies the correlation between the process at two different
time points. If this correlation only depends on the separation between the two points, then the
process is said to be wide-sense stationary.
Definition 5.2.3 (Wide-sense/weakly stationary process). A process is stationary in a wide or
weak sense if its mean is constant
for any t1 and t2 and any shift τ . For weakly stationary processes, the autocovariance is usually
expressed as a function of the difference between the two time points,
Autocovariance function
1.2 1.2
1.0 1.0 1.0
0.8 0.8 0.5
0.6 0.6
0.0
R(s)
R(s)
R(s)
0.4 0.4
0.2 0.2 0.5
0.0 0.0 1.0
0.2 0.2 15 10 5 0 5 10 15
15 10 5 0 5 10 15 15 10 5 0 5 10 15
s s s
3 3 3
2 2 2
1 1 1
0 0 0
1 1 1
2 2 2
3 3 30 2 4 6 8 10 12 14
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
i i i
3 3 3
2 2 2
1 1 1
0 0 0
1 1 1
2 2 2
3 3 30 2 4 6 8 10 12 14
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
i i i
3 3 3
2 2 2
1 1 1
0 0 0
1 1 1
2 2 2
3 3 30 2 4 6 8 10 12 14
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
i i i
Figure 5.2: Realizations (bottom three rows) of Gaussian processes with zero mean and the autocovari-
ance functions shown on the top row.
CHAPTER 5. RANDOM PROCESSES 100
Note that any strictly stationary process is necessarily weakly stationary because its first and
second-order distributions are shift invariant.
Figure 5.2 shows several stationary random processes with different autocovariance functions. If
the autocovariance function is zero everywhere except at the origin, then the values of the random
processes at different points are uncorrelated. This results in erratic fluctuations. When the
autocovariance at neighboring times is high, the trajectory random process becomes smoother.
The autocorrelation can also induce more structured behavior, as in the right column of the
figure. In that example X e (i) is negatively correlated with its two neighbors X e (i − 1) and
e e e
X (i + 1), but positively correlated with X (i − 2) and X (i + 2). This results in rapid periodic
fluctuations.
Note that the distribution that does not vary if we shift every index by the same amount, so
the process is strictly stationary.
e (i) is a continuous random variable, then we denote the pdf associated to the
Similarly, if X
distribution by fXe . For any n indices i1 , i2 , . . . , in and any n we have
n
Y
fX(i
e 1 ),X(i e n ) (xi1 , xi2 , . . . , xin ) =
e 2 ),...,X(i fXe (xi ) . (5.14)
i=1
Figure 5.3 shows several realizations from iid sequences which follow a uniform and a geometric
distribution.
The mean of an iid random sequence is constant and equal to the mean of its associated distri-
bution, which we denote by µ,
µXe (i) := E Xe (i) (5.15)
= µ. (5.16)
Let us denote the variance of the distribution associated to the iid sequence by σ 2 . The auto-
covariance function is given by
RXe (i, j) := E X e (i) X
e (j) − E Xe (i) E X e (j) (5.17)
(
σ2,
= (5.18)
0.
e (i) and X
This is not surprising, X e (j) are independent for all i 6= j, so they are also uncorrelated.
CHAPTER 5. RANDOM PROCESSES 101
Figure 5.3: Realizations of an iid uniform sequence in (0, 1) (first row) and an iid geometric sequence
with parameter p = 0.4 (second row).
Figure 5.2 shows realizations of several discrete Gaussian processes with different autocovariance
functions. Sampling from a Gaussian random process boils down to sampling a Gaussian random
vector with the appropriate mean and covariance matrix.
Algorithm 5.4.1 (Generating a Gaussian random process). To sample from an Gaussian ran-
dom process with mean function µXe and autocovariance function ΣXe at n points t1 ,. . . , tn we:
We now assume that these conditions hold in the semi-infinite interval [0, ∞) and define a random
process N e that counts the events. To be clear N e (t) is the number of events that happen between
0 and t.
By the same reasoning as in Example 2.2.8, the distribution of the random variable N e (t2 ) −
e
N (t1 ), which represents the number of events that occur between t1 and t2 , is a Poisson random
variable with parameter λ (t2 − t1 ). This holds for any t1 and t2 . In addition the random
variables N e (t2 ) − N
e (t1 ) and N
e (t4 ) − N
e (t3 ) are independent as along as the intervals [t1 , t2 ] and
(t3 , t4 ) do not overlap by Condition 1. A Poisson process is a discrete-state continuous random
process that satisfies these two properties.
Poisson processes are often used to model events such as earthquakes, telephone calls, decay of
radioactive particles, neural spikes, etc. Figure 2.6 shows an example of a real scenario where
the number of calls received at a call center is well approximated as a Poisson process (as long as
we only consider a few hours). Note that here we are using the word event to mean something
that happens, such as the arrival of an email, instead of a set within a sample space, which is
the meaning that it usually has elsewhere in these notes.
CHAPTER 5. RANDOM PROCESSES 103
e (0) = 0.
1. N
e (t2 ) − N
2. For any t1 < t2 < t3 < t4 N e (t1 ) is a Poisson random variable with parameter
λ (t2 − t1 ).
e (t2 ) − N
3. For any t1 < t2 < t3 < t4 the random variables N e (t1 ) and N
e (t4 ) − N
e (t3 ) are
independent.
CHAPTER 5. RANDOM PROCESSES 104
We now check that the random process is well defined, by proving that we can derivethe joint
e at any n points t1 < t2 < . . . < tn for any n ≥ 0. To alleviate notation let p λ̃, x be
pmf of N
the value of the pmf of a Poisson random variable with parameter λ̃ at x, i.e.
λ̃x e−λ̃
p λ̃, x := . (5.22)
x!
We have
In words, we have expressed the event that N e (ti ) = xi for 1 ≤ i ≤ n in terms of the random
e e e
variables N (t1 ) and N (ti )−N (ti−1 ), 2 ≤ i ≤ n, which are independent Poisson random variables
with parameters λt1 and λ (ti − ti−1 ) respectively.
Figure 5.4 shows several sequences of events corresponding to the realizations of a Poisson
process Ne for different values of the parameter λ (N
e (t) equals the number of events up to time
t). Interestingly, the interarrival time of the events, i.e. the time between contiguous events,
always has the same distribution: it is an exponential random variable.
Lemma 5.5.2 (Interarrival times of a Poisson process are exponential). Let T denote the time
between two contiguous events in a Poisson process with parameter λ. T is an exponential
random variable with parameter λ.
The proof is in Section 5.7.1 of the appendix. Figure 2.11 shows that the interarrival times of
telephone calls at a call center are indeed well modeled as exponential.
Lemma 5.5.2 suggests that to simulate a Poisson process all we need to do is sample from an
exponential distribution.
Algorithm 5.5.3 (Generating a Poisson random process). To sample from a Poisson random
process with parameter λ we:
Figure 5.4 was generated in this way. To confirm that the algorithm allows to sample from a
Poisson process, we would have to prove that the resulting process satisfies the conditions in
Definition 5.5.1. This is indeed the case, but we omit the proof.
The following lemma, which derives the mean and autocovariance functions of a Poisson process
is proved in Section 5.7.2.
CHAPTER 5. RANDOM PROCESSES 105
Lemma 5.5.4 (Mean and autocovariance of a Poisson process). The mean and autocovariance
of a Poisson process equal
E X e (t) = λ t, (5.27)
RXe (t1 , t2 ) = λ min {t1 , t2 } . (5.28)
The mean of the Poisson process is not constant and its autocovariance is not shift-invariant, so
the process is neither strictly nor wide-sense stationary.
Example 5.5.5 (Earthquakes). The number of earthquakes with intensity at least 3 on the
Richter scale occurring in the San Francisco peninsula is modeled using a Poisson process with
parameter 0.3 earthquakes/year. What is the probability that there are no earthquakes in the
next ten years and then at least one earthquake over the following twenty years?
We define a Poisson process X e with parameter 0.3 to model the problem. The number of
earthquakes in the next 10 years, i.e. X e (10), is a Poisson random variable with parameter
0.3 · 10 = 3. The earthquakes in the following 20 years, X e (30) − X
e (10), are Poisson with
parameter 0.3 · 20 = 6. The two random variables are independent because the intervals do not
overlap.
P X e (10) = 0, X
e (30) ≥ 1 = P X e (10) = 0, X
e (30) − Xe (10) ≥ 1 (5.29)
=P X e (10) = 0 P X e (30) − Xe (10) ≥ 1 (5.30)
=P X e (10) = 0 1 − P X e (30) − X
e (10) = 0 (5.31)
= e−3 1 − e−6 = 4.97 10−2 . (5.32)
We have specified X e as a function of an iid sequence, so it is well defined. Figure 5.5 shows
several realizations of the random walk.
Xe is symmetric (there is the same probability of taking a positive step and a negative step) and
begins at the origin. It is easy to define variations where the walk is non-symmetric and begins
CHAPTER 5. RANDOM PROCESSES 106
6 6 6
4 4 4
2 2 2
0 0 0
2 2 2
4 4 4
6 6 6
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
i i i
at another point. Generalizations to higher dimensional spaces– for instance to model random
processes on a 2D surface– are also possible.
We derive the first-order pmf of the random walk in the following lemma, proved in Section 5.7.3
of the appendix.
e
Lemma 5.6.1 (First-order pmf of a random walk). The first-order pmf of the random walk X
is
( i
1
i+x
2i
if i + x is even and −i ≤ x ≤ i
pX(i) (x) = 2 (5.35)
0
e
otherwise.
The first-order distribution of the random walk is clearly time-dependent, so the random process
is not strictly stationary. By the following lemma, the mean of the random walk is constant (it
equals zero). The autocovariance, however, is not shift invariant, so the process is not weakly
stationary either.
Lemma 5.6.2 (Mean and autocovariance of a random walk). The mean and autocovariance of
e are
the random walk X
Proof.
µXe (i) := E Xe (i) (5.38)
Xi
= E Se (j) (5.39)
j=1
i
X
= E Se (j) by linearity of expectation (5.40)
j=1
= 0. (5.41)
CHAPTER 5. RANDOM PROCESSES 107
RXe (i, j) := E Xe (i) Xe (j) − E X e (i) E Xe (j) (5.42)
i X j
!
X
=E Se (k) Se (l) (5.43)
k=1 l=1
min{i,j}
X X j
i X
= E e (k)2 +
S Se (k) Se (l) (5.44)
k=1 k=1 l=1
l6=k
min{i,j} j
i X
X X
= 1+ E Se (k) E Se (l) (5.45)
k=1 k=1 l=1
l6=k
5.7 Proofs
5.7.1 Proof of Lemma 5.5.2
We begin by deriving the cdf of T ,
FT (t) := P (T ≤ t) (5.50)
= 1 − P (T > t) (5.51)
= 1 − P (no events in an interval of length t) (5.52)
−λ t
=1−e (5.53)
because the number of points in an interval of length t follows a Poisson distribution with
parameter λ t. Differentiating we conclude that
fT (t) = λe−λ t . (5.54)
CHAPTER 5. RANDOM PROCESSES 108
e (t1 ) and X
By assumption X e (t2 ) − X
e (t1 ) are independent so that
E X e (t1 ) X
e (t2 ) = E X e (t1 ) X e (t2 ) − Xe (t1 ) + X e (t1 )2 (5.57)
=E X e (t1 ) E X e (t2 ) − Xe (t1 ) + E X e (t1 )2 (5.58)
= λ2 t1 (t2 − t1 ) + λt1 + λ2 t21 (5.59)
2
= λ t1 t2 + λt1 . (5.60)
x = S+ − S− (5.61)
= 2S+ − i. (5.62)
In this chapter we study the convergence of discrete random processes. This allows to charac-
terize two phenomena that are fundamental in statistical estimation and probabilistic modeling:
the law of large numbers and the central limit theorem.
lim xi = x (6.1)
i→∞
if xi is arbitrarily close to x as the index i grows. More formally, the sequence converges to x if
for any > 0 there is an index i0 such that for all indices i greater than i0 we have |xi − x| < .
Recall that any realization of a discrete-time random process X e (ω, i) where we fix the outcome
ω is a deterministic sequence. Establishing convergence of such realizations to a fixed number
can therefore be achieved by computing the corresponding limit. However, if we consider the
random process itself instead of a realization and we want to determine whether it eventually
converges to a random variable X, then deterministic convergence no longer makes sense. In
this section we describe several alternative definitions of convergence, which allow to extend this
concept to random quantities.
109
CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 110
e (ω, i)
0.5
D
0
0 5 10 15 20 25 30 35 40 45 50
i
Definition 6.1.1 (Convergence with probability one). A discrete random vector X e converges
with probability one to a random variable X belonging to the same probability space (Ω, F, P) if
P e
ω | ω ∈ Ω, lim X (ω, i) = X (ω) = 1. (6.3)
i→∞
Recall that in general the sample space Ω is very difficult to define and manipulate explicitly,
except for very simple cases.
Example 6.1.2 (Puddle (continued from Example 5.1.2)). Let us consider the discrete random
e defined in Example 5.1.2. If we fix ω ∈ (0, 1)
process D
e (ω, i) = lim ω
lim D (6.4)
i→∞ i→∞ i
= 0. (6.5)
It turns out the realizations tend to zero for all possible values of ω in the sample space. This
implies that De converges to zero with probability one.
4
Note that as in the case of convergence in mean square, the limit in this definition is deterministic,
as it is a limit of probabilities, which are just real numbers.
As a direct consequence of Markov’s inequality, convergence in mean square implies convergence
in probability.
Theorem 6.1.5. Convergence in mean square implies convergence in probability.
Proof. We have
2
e (i) > = lim P X − X e (i) > 2
lim P X − X (6.8)
i→∞ i→∞
2
e
E X − X (i)
≤ lim by Markov’s inequality (6.9)
i→∞ 2
= 0, (6.10)
if the sequence converges in mean square.
It turns out that convergence with probability one also implies convergence in probability. Con-
vergence in probability one does not imply convergence in mean square or vice versa. The
difference between these three types of convergence is not very important for the purposes of
this course.
Note that convergence in distribution is a much weaker notion than convergence with probability
one, in mean square or in probability. If a discrete random process X e converges to a random
e (i) tends
variable X in distribution, this only means that as i becomes large the distribution of X
to the distribution of X, not that the values of the two random variables are close. However,
convergence in probability (and hence convergence with probability one or in mean square) does
imply convergence in distribution.
Example 6.1.7 (Binomial converges to Poisson). Let us define a discrete random process X e (i)
e e e
such that the distribution of X (i) is binomial with parameters i and p := λ/i. X (i) and X (j)
are independent for i 6= j, which completely characterizes the n-order distributions of the process
for all n > 1. Consider a Poisson random variable X with parameter λ that is independent of
X̃ (i) for all i. Do you expect the values of X and X̃ (i) to be close as i → ∞?
No! In fact even X̃ (i) and X̃ (i + 1) will not be close in general. However, X̃ converges in
distribution to X, as established in Example 2.2.8:
i
lim pX(i) (x) = lim px (1 − p)(i−x) (6.12)
i→∞ i→∞ x
e
λx e−λ
= (6.13)
x!
= pX (x) . (6.14)
Definition 6.2.1 (Moving average). The moving or running average A e of a discrete random
e defined for i = 1, 2, . . . (i.e. 1 is the starting point), is equal to
process X,
X i
e (i) := 1
A e (j) .
X (6.15)
i
j=1
Consider an iid sequence. A very natural interpretation for the moving average is that it is a
real-time estimate of the mean. In fact, in statistical terms the moving average is the sample
mean of the process up to time i (the sample mean is defined in Chapter 8). The law of large
numbers establishes that the average does indeed converge to the mean of the iid sequence.
Theorem 6.2.2 (Weak law of large numbers). Let X e be an iid discrete random process with
e 2
mean µXe := µ such that the variance of X (i) σ is bounded. Then the average A e of X
e converges
in mean square to µ.
CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 113
Proof. First, we establish that the mean of Ae (i) is constant and equal to µ,
Xi
E A e (i) = E 1 Xe (j) (6.16)
i
j=1
1X e
i
= E X (j) (6.17)
i
j=1
= µ. (6.18)
Due to the independence assumption, the variance scales linearly in i. Recall that for indepen-
dent random variables the variance of the sum equals the sum of the variances,
Xi
Var A e (i) = Var 1 e (j)
X (6.19)
i
j=1
1
i
X
= Var Xe (j) (6.20)
i2
j=1
σ2
= . (6.21)
i
We conclude that
2 2
lim E e (i) − µ
A e e
= lim E A (i) − E A (i) by (6.18) (6.22)
i→∞ i→∞
= lim Var Ae (i) (6.23)
i→∞
σ2
= lim by (6.21) (6.24)
i→∞ i
= 0. (6.25)
By Theorem 6.1.5 the average also converges to the mean of the iid sequence in probability. In
fact, one can also prove convergence with probability one under the same assumptions. This
result is known as the strong law of large numbers, but the proof is beyond the scope of these
notes. We refer the interested reader to more advanced texts in probability theory.
Figure 6.2 shows averages of realizations of several iid sequences. When the iid sequence is
Gaussian or geometric we observe convergence to the mean of the distribution, however when the
sequence is Cauchy the moving average diverges. The reason is that, as shown in Example 4.2.2,
the Cauchy distribution does not have a well defined mean! Intuitively, extreme values have
non-negligeable probability under the Cauchy distribution so from time to time the iid sequence
takes values with very large magnitudes and this prevents the moving average from converging.
Cauchy (iid)
30 10 30
25 Moving average Moving average 20 Moving average
Median of iid seq. Median of iid seq. 10 Median of iid seq.
20 5
0
15 10
0 20
10
5 30
5 40
0 50
5 10 60
0 10 20 30 40 50 0 100 200 300 400 500 0 1000 2000 3000 4000 5000
i i i
Figure 6.2: Realization of the moving average of an iid standard Gaussian sequence (top), an iid
geometric sequence with parameter p = 0.4 (center) and an iid Cauchy sequence (bottom).
CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 115
e (i) as i
the variance is finite). In this section, we characterize the distribution of the average A
increases. It turns out that A e converges to a Gaussian random variable in distribution, which
is very useful in statistics as we will see later on.
This result, known as the central limit theorem, justifies the use of Gaussian distributions
to model data that are the result of many different independent factors. For example, the
distribution of height or weight of people in a certain population often has a Gaussian shape–
as illustrated by Figure 2.13– because the height and weight of a person depends on many
different factors that are roughly independent. In many signal-processing applications noise is
well modeled as having a Gaussian distribution for the same reason.
Theorem 6.3.1 (Central limit theorem). Let X e be an iid discrete random process with mean
e 2 √ e
µXe := µ such that the variance of X (i) σ is bounded. The random process n(A − µ), which
e
corresponds to the centered and scaled moving average of X, converges in distribution to a
Gaussian random variable with mean 0 and variance σ 2 .
Proof. The proof of this remarkable result is beyond the scope of these notes. It can be found
in any advanced text on probability theory. However, we would still like to provide some
intuition as to why the theorem holds. Theorem 3.5.2 establishes that the pdf of the sum of two
independent random variables is equal to the convolutions of their individual pdfs. The same
holds for discrete random variables: the pmf of the sum is equal to the convolution of the pmfs,
as long as the random variables are independent.
If each of the entries of the iid sequence has pdf f , then the pdf of the sum of the first i elements
can be obtained by convolving f with itself i times
fP1 X(j)
e (x) = (f ∗ f ∗ · · · ∗ f ) (x) . (6.26)
j=1
If the sequence has a discrete state and each of the entries has pmf p, the pmf of the sum of the
first i elements can be obtained by convolving p with itself i times
pP1 X(j)
e (x) = (p ∗ p ∗ · · · ∗ p) (x) . (6.27)
j=1
Normalizing by i just results in scaling the result of the convolution, so the pmf or pdf of the
moving mean A e is the result of repeated convolutions of a fixed function. These convolutions
have a smoothing effect, which eventually transforms the pmf/pdf into a Gaussian! We show
this numerically in Figure 6.3 for two very different distributions: a uniform distribution and a
very irregular one. Both converge to Gaussian-like shapes after just 3 or 4 convolutions. The
central limit theorem makes this precise, establishing that the shape of the pmf or pdf becomes
Gaussian asymptotically.
In statistics the central limit theorem is often invoked to justify treating averages as if they have a
√ e
Gaussian distribution. The idea is that for large enough n n(A − µ) is approximately Gaussian
2 e
with mean 0 and variance σ , which implies that A is approximately Gaussian with mean µ and
variance σ 2 /n. It’s important to remember that we have not established this rigorously. The
rate of convergence will depend on the particular distribution of the entries of the iid sequence.
In practice convergence is usually very fast. Figure 6.4 shows the empirical distribution of the
moving average of an exponential and a geometric iid sequence. In both cases the approximation
obtained by the central limit theory is very accurate even for an average of 100 samples. The
CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 116
i=1
i=2
i=3
i=4
i=5
i=1
i=2
i=3
i=4
i=5
Figure 6.3: Result of convolving two different distributions with themselves several times. The shapes
quickly become Gaussian-like.
CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 117
Cauchy (iid)
0.30 0.30 0.30
0.25 0.25 0.25
0.20 0.20 0.20
0.15 0.15 0.15
0.10 0.10 0.10
0.05 0.05 0.05
20 15 10 5 0 5 10 15 20 15 10 5 0 5 10 15 20 15 10 5 0 5 10 15
i = 102 i = 103 i = 104
Figure 6.4: Empirical distribution of the moving average of an iid standard Gaussian sequence (top),
an iid geometric sequence with parameter p = 0.4 (center) and an iid Cauchy sequence (bottom). The
empirical distribution is computed from 104 samples in all cases. For the two first rows the estimate
provided by the central limit theorem is plotted in red.
CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 118
figure also shows that for a Cauchy iid sequence, the distribution of the moving average does
not become Gaussian, which does not contradict the central limit theorem as the distribution
does not have a well defined mean. To close this section we derive a useful approximation to
the binomial distribution using the central limit theorem.
Example 6.3.2 (Gaussian approximation to the binomial distribution). Let X have a binomial
distribution with parameters n and p, such that n is large. Computing the probability that X is
in a certain interval requires summing its pmf over all the values in that interval. Alternatively,
we can obtain a quick approximation using the fact that for large n the distribution of a binomial
random variable is approximately Gaussian. Indeed, we can write X as the sum of n independent
Bernoulli random variables with parameter p,
n
X
X= Bi . (6.28)
i=1
The mean of Bi is p and its variance is p (1 − p). By the central limit theorem n1 X is approx-
imately Gaussian with mean p and variance p (1 − p) /n. Equivalently, by Lemma 2.5.1, X is
approximately Gaussian with mean np and variance np (1 − p).
Assume that a basketball player makes each shot she takes with probability p = 0.4. If we
assume that each shot is independent, what is the probability that she makes more than 420
shots out of 1000? We can model the shots made as a binomial X with parameters 1000 and
0.4. The exact answer is
1000
X
P (X ≥ 420) = pX (x) (6.29)
x=420
X
1000
1000
= 0.4x 0.6(n−x) (6.30)
x
x=420
= 10.4 10−2 . (6.31)
If we apply the Gaussian approximation, by Lemma 2.5.1 X being larger than 420 is the same
as a standard Gaussian U being larger than 420−µ where µ and σ are the mean and standard
p σ
deviation of X, equal to np = 400 and np(1 − p) = 15.5 respectively.
p
P (X ≥ 420) ≈ P np (1 − p)U + np ≥ 420 (6.32)
= P (U ≥ 1.29) (6.33)
= 1 − Φ (1.29) (6.34)
−2
= 9.85 10 . (6.35)
As an example, imagine that you set up a probabilistic model to determine the probability of
winning a game of solitaire. If the cards are well shuffled, this probability equals
Number of permutations that lead to a win
P (Win) = . (6.36)
Total number
The problem is that characterizing what permutations lead to a win is very difficult without
actually playing out the game to see the outcome. Doing this for every possible permutation is
computationally intractable, since there are 52! ≈ 8 1067 of them. However, there is a simple way
to approximate the probability of interest: simulating a large number of games and recording
what fraction result in wins. The game of solitaire was precisely what inspired Stanislaw Ulam
to propose simulation-based methods, known as the Monte Carlo method (a code name, inspired
by the Monte Carlo Casino in Monaco), in the context of nuclear-weapon research in the 1940s:
The first thoughts and attempts I made to practice (the Monte Carlo Method) were suggested
by a question which occurred to me in 1946 as I was convalescing from an illness and playing
solitaires. The question was what are the chances that a Canfield solitaire laid out with 52
cards will come out successfully? After spending a lot of time trying to estimate them by pure
combinatorial calculations, I wondered whether a more practical method than ”abstract thinking”
might not be to lay it out say one hundred times and simply observe and count the number of
successful plays.
This was already possible to envisage with the beginning of the new era of fast computers, and
I immediately thought of problems of neutron diffusion and other questions of mathematical
physics, and more generally how to change processes described by certain differential equations
into an equivalent form interpretable as a succession of random operations. Later, I described
the idea to John von Neumann, and we began to plan actual calculations.1
Monte Carlo methods use simulation to estimate quantities that are challenging to compute
exactly. In this section, we consider the problem of approximating the probability of an event
E, as in the game of solitaire example.
Algorithm 6.4.1 (Monte Carlo approximation). To approximate the probability of an event E,
we:
1. Generate n independent samples from the indicator function 1E associated to the event:
I1 , I2 , . . . , In .
2. Compute the average of the n samples
X n
e (n) := 1
A Ii (6.37)
n
i=1
The probability of interest can be interpreted as the expectation of the indicator function 1E
associated to the event,
E (1E ) = P (E) . (6.38)
By the law of large numbers, the estimate A e converges to the true probability as n → ∞. The
following example illustrates the power of this simple technique.
1
http://en.wikipedia.org/wiki/Monte_Carlo_method#History
CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 120
Table 6.1: The table on the left shows all possible outcomes in a league of three teams (m = 3), the
resulting ranks for each team and the corresponding probability. The table on the right shows the pmf
of the ranks of each of the teams.
Example 6.4.2 (Basketball league). In an intramural basketball league m teams play each other
once every season. The teams are ordered according to their past results: team 1 being the best
and team m the worst. We model the probability that team i beats team j, for 1 ≤ i < j ≤ m
as
1
P (team j beats team i) := . (6.39)
j−i+1
The best team beats the second with probability 1/2 and the third with probability 2/3, the
second beats the third with probability 1/2, the fourth with probability 2/3 and the fifth with
probability 3/4, and so on. We assume that the outcomes of the different games are independent.
At the end of the season, after every team has played with every other team, the teams are
ranked according to their number of wins. If several teams have the same number of wins, then
they share the same rank. For example, if two teams have the most wins, they both have rank
1, and the next team has rank 3. The goal is to compute the distribution of the final rank of
each team in the league, which we model as the random variables R1 , R2 , . . . , Rm . We have all
the information to compute the joint pmf of these random variables by applying the law of total
probability. As shown in Table 6.1 for m = 3, all we need to do is enumerate all the possible
outcomes of the games and sum the probabilities of the outcomes that result in a particular
rank.
Unfortunately, the number of possible outcomes grows dramatically with m. The number of
games equals m (m − 1) /2, so the possible outcomes are 2m(m−1)/2 . When there are just
10 teams, this is larger than 1013 . Computing the exact distribution of the final ranks for
leagues that are not very small is therefore very computationally demanding. Fortunately, Al-
gorithm 6.4.1 offers a more tractable alternative: We can sample a large number of seasons n by
simulating each game as a Bernoulli random variable with a parameter given by equation (6.39)
and approximate the pmf using the fraction of times that each team ends up in each position.
Simulating a whole season only requires sampling m (m − 1) /2 games, which can be done very
fast.
Table 6.2 illustrates the Monte Carlo approach for m = 3. The approximation is quite coarse if
we only use n = 10 simulated seasons, but becomes very accurate when n = 2, 000. Figure 6.5
CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 121
Table 6.2: The table on the left shows 10 simulated outcomes of a league of three teams (m = 3) and
the resulting ranks. The tables on the right show the estimated pmf obtained by Monte Carlo simulation
from the simulated outcomes on the left (top) and from 2,000 simulated outcomes (bottom). The exact
values are included in brackets for comparison.
103
Exact computation
102 Monte Carlo approx.
Running time (seconds)
Figure 6.5: The graph on the left shows the time needed to obtain the exact pmf of the final ranks
in Example 6.4.2 and to approximate them by Monte Carlo approximation using 2,000 simulated league
outcomes. The table on the right shows the average error per entry of the Monte Carlo approximation.
CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 122
m=5 m = 20 m = 100
Figure 6.6: Approximate pmf of the final ranks in Example 6.4.2 using 2,000 simulated league outcomes.
shows the running time needed to compute the exact pmf and to approximate it with the Monte
Carlo approach for different numbers of teams. When the number of teams is very small the
exact computation is very fast, but the running time increases exponentially with m as expected,
so that for 7 teams the computation already takes 5 and a half minutes. In contrast, the Monte
Carlo approximation is dramatically faster. For m = 20 it just takes half a second. Figure 6.6
shows the approximate pmf of the final ranks for 5, 20 and 100 teams. Higher ranks have higher
probabilities because when two teams are tied they are awarded the higher rank.
4
Chapter 7
Markov Chains
The Markov property is satisfied by any random process for which the future is conditionally
independent from the past given the present.
Definition 7.0.1 (Markov property). A random process satisfies the Markov property if X e (ti+1 )
e e e
is conditionally independent of X (t1 ) , . . . , X (ti−1 ) given X (ti ) for any t1 < t2 < . . . < ti < ti+1 .
If the state space of the random process is discrete, then for any x1 , x2 , . . . , xi+1
pX(t
e n+1 )|X(t
e 1 ),X(t e i ) (xn+1 |x1 , x2 , . . . , xn ) = pX(t
e 2 ),...,X(t e i ) (xi+1 |xi ) .
e i+1 )|X(t (7.1)
If the state space of the random process is continuous (and the distribution has a joint pdf ),
fX(t
e i+1 )|X(t
e 1 ),X(t e i ) (xi+1 |x1 , x2 , . . . , xi ) = fX(t
e 2 ),...,X(t e i ) (xi+1 |xi ) .
e i+1 )|X(t (7.2)
Figure 7.1 shows the directed graphical model that corresponds to the dependence assumptions
implied by the Markov property. Any iid sequence satisfies the Markov property, since all
conditional pmfs or pdfs are just equal to the marginals (in this case there would be no edges
in the directed acyclic graph of Figure 7.1). The random walk also satisfies the property, since
once we fix where the walk is at a certain time i the path that it took before i has no influence
in its next steps.
Proof. Let X e denote the random walk defined in Section 5.6. Conditioned on X e (j) = xi for j ≤
e e
i, X (i + 1) equals xi + S (i + 1). This does not depend on x1 , . . . , xi−1 , which implies (7.1).
123
CHAPTER 7. MARKOV CHAINS 124
X1 X2 X3 X4 X5 ...
Figure 7.1: Directed graphical model describing the dependence assumptions implied by the Markov
property.
n≥0
n
Y
pX(0),
e X(1),...,
e X(n)
e (x0 , x1 , . . . , xn ) := pX(i)|
e X(0),...,
e X(i−1)
e (xi |x0 , . . . , xi−1 ) (7.3)
i=0
n
Y
= pX(i)|
e X(i−1)
e (xi |xi−1 ) . (7.4)
i=0
If these transition probabilities are the same at every time step (i.e. they are constant and do
not depend on i), then the Markov chain is said to be time homogeneous. In this case, we can
store the probability of each possible transition in an s × s matrix TXe , where s is the number of
states.
TXe jk := pX(i+1)|
e X(i)
e (xj |xk ) . (7.5)
In this chapter we focus on time-homogeneous finite-state Markov chains. The transition prob-
abilities of these chains can be visualized using a state diagram, which shows each state and the
probability of every possible transition. See Figure 7.2 below for an example. The state diagram
should not be confused with the directed acyclic graph (DAG) that represents the dependence
structure of the model, illustrated in Figure 7.1. In the state diagram, each node corresponds to
a state and the edges to transition probabilities between states, whereas the DAG just indicates
the dependence structure of the random process in time and is usually the same for all Markov
chains.
To simplify notation we define an s-dimensional vector p~X(i) e called the state vector, which
contains the marginal pmf of the Markov chain at each time i,
pX(i) (x 1 )
e
pX(i) (x 2
)
p~X(i)
:=
e
. (7.6)
e
···
pX(i)
e (x s )
Each entry in the state vector contains the probability that the Markov chain is in that particular
state at time i. It is not the value of the Markov chain, which is a random variable.
The initial state space p~X(0)
e and the transition matrix TXe suffice to completely specify a time-
homogeneous finite-state Markov chain. Indeed, we can compute the joint distribution of the
chain at any n time points i1 , i2 , . . . , in for any n ≥ 1 from p~X(0)
e and TXe by applying (7.4) and
marginalizing over any times that we are not interested in. We illustrate this in the following
example.
CHAPTER 7. MARKOV CHAINS 125
0.8
LA
0.2 0.1
0.1 0.3
SF SJ
0.3
0.2
0.6 0.4
SJ SJ SJ
LA LA LA
SF SF SF
0 5 10 15 0 5 10 15 0 5 10 15
Customer Customer Customer
Figure 7.2: State diagram of the Markov chain described in Example (7.1.1) (top). Each arrow shows
the probability of a transition between the two states. Below we show three realizations of the Markov
chain.
Example 7.1.1 (Car rental). A car-rental company hires you to model the location of their
cars. The company operates in Los Angeles, San Francisco and San Jose. Customers regularly
take a car in a city and drop it off in another. It would be very useful for the company to be
able to compute how likely it is for a car to end up in a given city. You decide to model the
location of the car as a Markov chain, where each time step corresponds to a new customer
taking the car. The company allocates new cars evenly between the three cities. The transition
probabilities, obtained from past data, are given by
To be clear, the probability that a customer moves the car from San Francisco to LA is 0.2, the
CHAPTER 7. MARKOV CHAINS 126
probability that the car stays in San Francisco is 0.6, and so on.
The initial state vector and the transition matrix of the Markov chain are
1/3 0.6 0.1 0.3
p~X(0) := 1/3 , TXe := 0.2 0.8 0.3 . (7.7)
e
State 1 is assigned to San Francisco, state 2 to Los Angeles and state 3 to San Jose. Figure 7.2
shows a state diagram of the Markov chain. Figure 7.2 shows some realizations of the Markov
chain.
The company wants to find out the probability that the car starts in San Francisco, but is in
San Jose right after the second customer. This is given by
3
X
pX(0),
e X(2)
e (1, 3) = pX(0),
e X(1),
e X(2)
e (1, i, 3) (7.8)
i=1
X3
= pX(0)
e (1) pX(1)|
e X(0)
e (i|1) pX(2)|
e X(1)
e (3|i) (7.9)
i=1
X
3
= p~X(0) TXe i1 TXe 3i (7.10)
1
e
i=1
0.6 · 0.2 + 0.2 · 0.1 + 0.2 · 0.4
= ≈ 7.33 10−2 . (7.11)
3
The probability is 7.33%.
4
The following lemma provides a simple expression for the state vector at time i p~X(i)
e in terms
of TXe and the previous state vector.
e with transition
Lemma 7.1.2 (State vector and transition matrix). For a Markov chain X
matrix TXe
p~X(i)
e = TXe p~X(i−1)
e . (7.12)
p~X(i)
e = TXei p~X(0)
e , (7.13)
Equation (7.13) is obtained by applying (7.12) i times and taking into account the Markov
property.
Example 7.1.3 (Car rental (continued)). The company wants to estimate the distribution of
locations right after the 5th customer has used a car. Applying Lemma 7.1.2 we obtain
p~X(5)
e = TX5e p~X(0)
e (7.16)
0.281
= 0.534 . (7.17)
0.185
The model estimates that after 5 customers more than half of the cars are in Los Angeles.
4
7.2 Recurrence
The states of a Markov chain can be classified depending on whether the Markov chain is
guaranteed to always return to them or whether it may eventually stop visiting those states.
Definition 7.2.1 (Recurrent and transient states). Let X e be a time-homogeneous finite-state
Markov chain. We consider a particular state x. If
P X e (j) = s for some j > i | X
e (i) = s = 1 (7.18)
then the state is recurrent. In words, given that the Markov chain is at x, the probability that
it returns to x is one. In contrast, if
P X e (j) 6= s for all j > i | X
e (i) = s > 0 (7.19)
the state is transient. Given that the Markov chain is at x, there is nonzero probability that it
will never return.
CHAPTER 7. MARKOV CHAINS 128
0.9 0.6
0.1
Employed Unemployed
0.4
0.1
0.1
Student Intern
0.5
0.8 0.5
Figure 7.3: State diagram of the Markov chain described in Example (7.2.2) (top). Below we show
three realizations of the Markov chain.
CHAPTER 7. MARKOV CHAINS 129
The following example illustrates the difference between recurrent and transient states.
Example 7.2.2 (Employment dynamics). A researcher is interested in modeling the employ-
ment dynamics of young people using a Markov chain.
She determines that at age 18 a person is either a student with probability 0.9 or an intern with
probability 0.1. After that she estimates the following transition probabilities:
The Markov assumption is obviously not completely precise, someone who has been a student
for longer is probably less likely to remain a student, but such Markov models are easier to fit
(we only need to estimate the transition probabilities) and often yield useful insights.
The initial state vector and the transition matrix of the Markov chain are
0.9 0.8 0.5 0 0
0.1 0.1 0.5 0 0
p~X(0)
e :=
,
TXe :=
.
(7.20)
0 0.1 0 0.9 0.4
0 0 0 0.1 0.6
Figure 7.3 shows the state diagram and some realizations of the Markov chain.
States 1 (student) and 2 (intern) are transient states. Note that the probability that the Markov
chain returns to those states after visiting state 3 (employed) is zero, so
P Xe (j) 6= 1 for all j > i | X
e (i) = 1 ≥ P X e (i + 1) = 3 |Xe (i) = 1 (7.21)
= 0.1 > 0, (7.22)
e (j) 6= 2 for all j > i | X
P X e (i) = 2 ≥ P X e (i + 2) = 3 |X
e (i) = 2 (7.23)
= 0.5 · 0.1 > 0. (7.24)
In contrast, states 3 and 4 (unemployed) are recurrent. We prove this for state 3 (the argument
for state 4 is exactly the same):
P X e (j) 6= 3 for all j > i | X
e (i) = 3 (7.25)
=P X e (j) = 4 for all j > i |Xe (i) = 3 (7.26)
Y
k
e (i + 1) = 4 |X
= lim P X e (i) = 3 P Xe (i + j + 1) = 4 |X
e (i + j) = 4 (7.27)
k→∞
j=1
In this example, it is not possible to reach the states student and intern from the states employed
or unemployed. Markov chains for which there is a possible transition between any two states
(even if it is not direct) are called irreducible.
One can easily check that the Markov chain in Example 7.1.1 is irreducible, whereas the one
in Example 7.2.2. An important result is that all states in an irreducible Markov chain are
recurrent.
Theorem 7.2.4 (Irreducible Markov chains). All states in an irreducible Markov chain are
recurrent.
Proof. In any finite-state Markov chain there must be at least one state that is recurrent. If
all the states are transient there is a nonzero probability that it leaves all of the states forever,
which is not possible. Without loss of generality let us assume that state x is recurrent. We now
provide a sketch of a proof that another arbitrary state y must also be recurrent. To alleviate
notation let
px,x := P X e (j) = x for some j > i | X
e (i) = x , (7.31)
px,y := P X e (j) = y for some j > i | X
e (i) = x , (7.32)
py,x := P X e (k) = x for some j > i | X
e (i) = y . (7.33)
The chain is irreducible so there is a nonzero probability pm > 0 of reaching y from x in at most
m steps for some m > 0. The probability that the chain goes from x to y and never goes back
to x is consequently at least pm (1 − py,x ). However, x is recurrent, so this probability must be
zero! Since pm > 0 this implies py,x = 1.
Consider the following event:
e goes from y to x.
1. X
e does not return to y in m steps after reaching x.
2. X
e eventually reaches x again at a time m0 > m.
3. X
The probability of this event is equal to py,x (1 − pm ) px,x = 1 − pm (recall that x is recurrent so
px,x = 1). Now imagine that steps 2 and 3 repeat k times, i.e. that X e fails to go from x to y in
m steps k times. The probability of this event is py,x (1 − pm ) px,x = (1 − pm )k . Taking k → ∞
k k
this is equal to zero for any m so the probability that X e does not eventually return to x must
be zero (this can be made rigorous, but the details are beyond the scope of these notes).
CHAPTER 7. MARKOV CHAINS 131
1 0.9
A B C
0.1 1
Figure 7.4: State diagram of a Markov chain where states the states have period two.
7.3 Periodicity
Another important consideration is whether the Markov chain always visits a given state at
regular intervals. If this is the case, then the state has a period greater than one.
Figure 7.4 shows a Markov chain where the states have a period equal to two. Aperiodic Markov
chains do not contain states with periods greater than one.
e
Definition 7.3.2 (Aperiodic Markov chain). A time-homogeneous finite-state Markov chain X
is aperiodic if all states have period equal to one.
The Markov chains in Examples 7.1.1 and 7.2.2 are both aperiodic.
7.4 Convergence
In this section we study under what conditions a finite-state time-homogeneous Markov chain
X e converges in distribution. If a Markov chain converges in distribution, then its state vector
p~X(i) e
e , which contains the first order pmf of X, converges to a fixed vector p
~∞ ,
In that case the probability of the Markov chain being in each state eventually tends to a fixed
value (which does not imply that the Markov chain will stay at a given state!).
By Lemma 7.1.2 we can express (7.34) in terms of the initial state vector and the transition
matrix of the Markov chain
Computing this limit analytically for a particular TXe and p~X(0) e may seem challenging at first
sight. However, it is often possible to leverage the eigendecomposition of the transition matrix
(if it exists) to find p~∞ . This is illustrated in the following example.
CHAPTER 7. MARKOV CHAINS 132
1 Exported Sold 1
0.7
0.2
0.1
In stock
Figure 7.5: State diagram of the Markov chain described in Example (7.4.1) (top). Below we show
three realizations of the Markov chain.
Example 7.4.1 (Mobile phones). A company that makes mobile phones wants to model the
sales of a new model they have just released. At the moment 90% of the phones are in stock,
10% have been sold locally and none have been exported. Based on past data, the company
determines that each day a phone is sold with probability 0.2 and exported with probability 0.1.
The initial state vector and the transition matrix of the Markov chain are
0.9 0.7 0 0
~a := 0.1 , TXe = 0.2 1 0 . (7.36)
0 0.1 0 1
~a ~b ~c
Figure 7.6: Evolution of the state vector of the Markov chain in Example (7.4.1) for different values of
the initial state vector p~X(0)
e .
is equivalent to computing
It will be useful to express the initial state vector ~a in terms of the different eigenvectors. This
is achieved by computing
0.3
Q−1 p~X(0)
e = 0.7 , (7.42)
1.122
so that
We conclude that
lim TXei ~a = lim TXei (0.3 ~q1 + 0.7 ~q2 + 1.122 ~q3 ) (7.44)
i→∞ i→∞
= lim 0.3 TXei ~q1 + 0.7 TXei ~q2 + 1.122 TXei ~q3 (7.45)
i→∞
= lim 0.3 λ1i ~q1 + 0.7 λ2i ~q2 + 1.122 λ3i ~q3 (7.46)
i→∞
= lim 0.3 ~q1 + 0.7 ~q2 + 1.122 0.5 i ~q3 (7.47)
i→∞
= 0.3 ~q1 + 0.7 ~q2 (7.48)
0
= 0.7 . (7.49)
0.3
This means that eventually the probability that each phone has been sold locally is 0.7 and the
probability that it has been exported is 0.3. The left graph in Figure 7.6 shows the evolution of
the state vector. As predicted, it eventually converges to the vector in equation (7.49).
In general, because of the special structure of the two eigenvectors with eigenvalues equal to one
in this example, we have
0
−1
lim TXei p~X(0) = Q p~X(0)
2 . (7.50)
e
i→∞
e
Q−1 p~X(0)
1
e
This is illustrated in Figure 7.6 where you can see the evolution of the state vector if it is
initialized to these other two distributions:
0.6 0.6
~b := 0 , Q−1~b = 0.4 , (7.51)
0.4 0.75
0.4 0.23
~c := 0.5 ,
Q−1~c = 0.77 . (7.52)
0.1 0.50
The transition matrix of the Markov chain in Example 7.4.1 has two eigenvectors with eigenvalue
equal to one. If we set the initial state vector to equal either of these eigenvectors (note that we
must make sure to normalize them so that the state vector contains a valid pmf) then
TXe p~X(0)
e = p~X(0)
e , (7.53)
so that
p~X(i)
e = TXei p~X(0)
e (7.54)
= p~X(0)
e (7.55)
CHAPTER 7. MARKOV CHAINS 135
Establishing whether a distribution is stationary by checking whether (7.57) holds may be chal-
lenging computationally if the state space is very large. We now derive an alternative condition
that implies stationarity. Let us first define reversibility of Markov chains.
Definition 7.4.3 (Reversibility). Let X e be a finite-state time-homogeneous Markov chain with
s states and transition matrix TXe . Assume that X e (i) is distributed according to the state vector
s
p~ ∈ R . If
P X e (i) = xj , X
e (i + 1) = xk = P X e (i) = xk , X
e (i + 1) = xj , for all 1 ≤ j, k ≤ s, (7.58)
As proved in the following theorem, reversibility implies stationarity, but the converse does not
hold. A Markov chain is not necessarily reversible with respect to a stationary distribution (and
often will not be). The detailed-balance condition therefore only provides a sufficient condition
for stationarity.
Theorem 7.4.4 (Reversibility implies stationarity). If a time-homogeneous Markov chain X e is
e
reversible with respect to a distribution pX , then pX is a stationary distribution of X.
Proof. Let p~ be the state vector containing pX . By assumption TXe and p~ satisfy (7.59), so for
1≤j≤s
s
X
TXe p~ j = TXe jk p~k (7.60)
k=1
s
X
= TXe kj
p~j (7.61)
k=1
s
X
= p~j TXe kj
(7.62)
k=1
= p~j . (7.63)
CHAPTER 7. MARKOV CHAINS 136
1/3 1 0.1
p~X(0)
e = 1/3 p~X(0)
e = 0 p~X(0)
e = 0.2
1/3 0 0.7
Figure 7.7: Evolution of the state vector of the Markov chain in Example (7.4.7).
The last step follows from the fact that the columns of a valid transition matrix must add to
one (the chain always has to go somewhere).
In Example 7.4.1 the Markov chain has two stationary distributions. It turns out that this is
not possible for irreducible Markov chains.
Proof. This follows from the Perron-Frobenius theorem, which states that the transition ma-
trix of an irreducible Markov chain has a single eigenvector with eigenvalue equal to one and
nonnegative entries.
Example 7.4.7 (Car rental (continued)). The Markov chain in the car rental example is irre-
ducible and aperiodic. We will now check that it indeed converges in distribution. Its transition
matrix has the following eigenvectors
0.273 −0.577 −0.577
~q1 := 0.545 , ~q2 := 0.789 , ~q3 := −0.211 . (7.64)
0.182 −0.211 0.789
For any initial state vector, the component that is collinear with ~q1 will be preserved by the
transitions of the Markov chain, but the other two components will become negligible after
a while. The chain consequently converges in distribution to a random variable with pmf ~q1
(note that ~q1 has been normalized to be a valid pmf), as predicted by Theorem 7.4.6. This is
illustrated in Figure 7.7. No matter how the company allocates the new cars, eventually 27.3%
will end up in San Francisco, 54.5% in LA and 18.2% in San Jose. 4
Algorithm 7.5.1 (Metropolis-Hastings algorithm). We store the pmf pX of the target distri-
bution in a vector p~ ∈ Rs , such that
Let T denote the transition matrix of an irreducible Markov chain with the same state space
{x1 , . . . , xs } as p~.
Initialize X e (0) randomly or to a fixed state, then repeat the following steps for i = 1, 2, 3, . . ..
2. Set
(
e (i − 1) , C ,
e (i) := C
X
with probability pacc X
(7.67)
Xe (i − 1) otherwise,
It turns out that this algorithm yields a Markov chain that is reversible with respect to the
distribution of interest, which ensures that the distribution is stationary.
e
Theorem 7.5.2. The pmf in p~ corresponds to a stationary distribution of the Markov chain X
obtained by the Metropolis-Hastings algorithm.
CHAPTER 7. MARKOV CHAINS 138
holds for all 1 ≤ j, k ≤ s. This establishes the result by Theorem 7.4.4. The detailed-balanced
condition holds trivially if j = k. If j 6= k we have
TXe kj := P X e (i) = k | X
e (i − 1) = j (7.70)
=P X e (i) = C, C = k | X e (i − 1) = j (7.71)
=P X e (i) = C | C = k, X e (i − 1) = j P C = k | X
e (i − 1) = j (7.72)
= pacc (j, k) Tkj (7.73)
and by exactly the same argument TXe jk
= pacc (k, j) Tjk . We conclude that
TXe kj
p~j = pacc (j, k) Tkj p~j (7.74)
Tjk p~k
= Tkj p~j min ,1 (7.75)
Tkj p~j
= min {Tjk p~k , Tkj p~j } (7.76)
Tkj p~j
= Tjk p~k min 1, (7.77)
Tjk p~k
= pacc (k, j) Tjk p~k (7.78)
= TXe jk p~k . (7.79)
The following example is taken from Hastings’s seminal paper Monte Carlo Sampling Methods
Using Markov Chains and Their Applications.
Example 7.5.3 (Generating a Poisson random variable). Our aim is to generate a Poisson
random variable X. Note that we don’t need to know the normalizing constant in the Poisson
pmf, which equals to eλ , as long as we know that it is proportional to
λx
pX (x) ∝ (7.80)
x!
The auxiliary Markov chain must be able to reach any possible value of X, i.e. all positive
integers. We will use a modified random walk that takes steps upwards and downwards with
probability 1/2, but never goes below 0. Its transition matrix equals
1
2 if j = 0 and k = 0,
1 if k = j + 1,
Tkj := 12 (7.81)
if j > 0 and k = j − 1,
2
0 otherwise.
CHAPTER 7. MARKOV CHAINS 139
To compute the acceptance probability, we only consider transitions that are possible under the
random walk. If j = 0 and k = 0
If k = j + 1
λj+1
(j+1)!
pacc (j, j + 1) = min ,1 (7.85)
λj
j!
λ
= min ,1 . (7.86)
j+1
If k = j − 1
λj−1
(j−1)!
pacc (j, j − 1) = min ,1 (7.87)
λj
j!
j
= min ,1 . (7.88)
λ
We now spell out the steps of the Metropolis-Hastings method. To simulate the auxiliary random
walk we use a sequence of Bernoulli random variables that indicate whether the random walk
is trying to go up or down (or stay at zero). We initialize the chain at x0 = 0. Then, for
i = 1, 2, . . ., we
• Generate a sample b from a Bernoulli distribution with parameter 1/2 and a sample u
uniformly distributed in [0, 1].
• If b = 0:
– If xi−1 = 0, xi := 0.
– If xi−1 > 0:
∗ If u < xi−1
λ , xi := xi−1 − 1.
∗ Otherwise xi := xi−1 .
• If b = 1:
λ
– If u < xi−1 +1 , xi := xi−1 + 1.
– Otherwise xi := xi−1 .
CHAPTER 7. MARKOV CHAINS 140
0.35
0
0.30 1
2
0.25 3
4
Distribution
0.20 5
0.15
0.10
0.05
0.00 0
10 101 102 103
Iterations
Figure 7.8: Convergence in distribution of the Markov chain constructed in Example 7.8 for λ := 6.
To prevent clutter we only plot the empirical distribution of 6 states, computed by running the Markov
chain 104 times.
The Markov chain that we have built is irreducible: there is nonzero probability of going from
any nonnegative integer to any other nonnegative integer (although it could take a while!). We
have not really proved that the chain should converge to the desired distribution, since we have
not discussed convergence of Markov chains with infinite state spaces, but Figure 7.8 shows that
the method indeed allows to sample from a Poisson distribution with λ := 6.
4
For the example in Figure 7.8, approximate convergence in distribution occurs after around 100
iterations. This is called the mixing time of the Markov chain. To account for it, MCMC
methods usually discard the samples from the chain over an initial period known as burn-in
time.
The careful reader might be wondering about the point of using MCMC methods if we already
have access to the desired distribution. It seems much simpler to just apply the method described
in Section 2.6.1 instead. However, the Metropolis-Hastings method can be applied to discrete
distributions with infinite supports and also to continuous distributions (justifying this is beyond
the scope of these notes). Crucially, in contrast with inverse-transform and rejection sampling,
Metropolis-Hastings does not require having access to the pmf pX or pdf fX of the target
distribution, but rather to the ratio pX (x) /pX (y) or fX (x) /fX (y) for every x 6= y. This is
very useful when computing conditional distributions within probabilistic models.
Imagine that we have access to the marginal distribution of a continuous random variable A and
the conditional distribution of another continuous random variable B given A. Computing the
CHAPTER 7. MARKOV CHAINS 141
conditional pdf
is not necessary feasible due to the integral in the denominator. However, if we apply Metropolis-
Hastings to sample from fA|B we don’t need to compute the normalizing factor since for any
a1 6= a2
Descriptive statistics
In this chapter we describe several techniques for visualizing data, as well as for computing
quantities that summarize it effectively. Such quantities are known as descriptive statistics. As
we will see in the following chapters, these statistics can often be interpreted within a proba-
bilistic framework, but they are also useful when probabilistic assumptions are not warranted.
Because of this, we present them from a deterministic point of view.
8.1 Histogram
We begin by considering data sets containing one-dimensional data. One of the most natural
ways of visualizing 1D data is to plot their histogram. The histogram is obtained by binning the
range of the data and counting the number of instances that fall within each bin. The width of
the bins is a parameter that can be adjusted to yield higher or lower resolution. If we interpret
the data as corresponding to samples from a random variable, then the histogram would be a
piecewise constant approximation to their pmf or pdf.
Figure 8.1 shows two histograms computed from temperature data gathered at a weather station
in Oxford over 150 years.1 Each data point represents the maximum temperature recorded in
January or August of a particular year. Figure 8.2 shows a histogram of the GDP per capita of
all countries in the world in 2014 according to the United Nations.2
Definition 8.2.1 (Sample mean). Let {x1 , x2 , . . . , xn } be a set of real-valued data. The sample
1
The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/
oxforddata.txt.
2
The data is available at http://unstats.un.org/unsd/snaama/selbasicFast.asp.
142
CHAPTER 8. DESCRIPTIVE STATISTICS 143
45
January
40 August
35
30
25
20
15
10
5
0 0 5 10 15 20 25 30
Degrees (Celsius)
Figure 8.1: Histograms of temperature data taken in a weather station in Oxford over 150 years. Each
data point equals the maximum temperature recorded in a certain month in a particular year.
90
80
70
60
50
40
30
20
10
00 50 100 150 200
Thousands of dollars
Figure 8.2: Histogram of the GDP per capita of all countries in the world in 2014.
CHAPTER 8. DESCRIPTIVE STATISTICS 144
Figure 8.3: Effect of centering a two-dimensional data set. The axes are depicted using dashed lines.
Let {~x1 , ~x2 , . . . , ~xn } be a set of d-dimensional real-valued data vectors. The sample mean is
n
1X
av (~x1 , ~x2 , . . . , ~xn ) := ~xi . (8.2)
n
i=1
The sample mean of the data in Figure 8.1 is 6.73 ◦ C in January and 21.3 ◦ C in August. The
sample mean of the GDPs per capita in Figure 8.2 is $16,500.
Geometrically, the average, also known as the sample mean, is the center of mass of the data. A
common preprocessing step in data analysis is to center a set of data by subtracting its sample
mean. Figure 8.3 shows an example.
Algorithm 8.2.2 (Centering). Let ~x1 , . . . , ~xn be a set of d-dimensional data. To center the
data set we:
The resulting data set ~y1 , . . . , ~yn has sample mean equal to zero; it is centered at the origin.
The sample variance is the average of the squared deviations from the sample mean. Geomet-
rically, it quantifies the average variation of the data set around its center. It is a deterministic
counterpart to the variance of a random variable.
CHAPTER 8. DESCRIPTIVE STATISTICS 145
Definition 8.2.3 (Sample variance and standard deviation). Let {x1 , x2 , . . . , xn } be a set of
real-valued data. The sample variance is defined as
n
1 X
var (x1 , x2 , . . . , xn ) := (xi − av (x1 , x2 , . . . , xn ))2 (8.4)
n−1
i=1
The sample standard deviation is the square root of the sample variance
p
std (x1 , x2 , . . . , xn ) := var (x1 , x2 , . . . , xn ). (8.5)
You might be wondering why the normalizing constant is 1/ (n − 1) instead of 1/n. The reason
is that this ensures that the expectation of the sample variance equals the true variance when
the data are iid (see Lemma 9.2.5). In practice there is not much difference between the two
normalizations.
The sample standard deviation of the temperature data in Figure 8.1 is 1.99 ◦ C in January and
1.73 ◦ C in August. The sample standard deviation of the GDP data in Figure 8.2 is $25,300.
Definition 8.3.1 (Quantiles and percentiles). Let x(1) ≤ x(2) ≤ . . . ≤ x(n) denote the ordered
elements of a set of data {x1 , x2 , . . . , xn }. The q quantile of the data for 0 < q < 1 is x([q(n+1)]) ,
where [q (n + 1)] is the result of rounding q (n + 1) to the closest integer. The 100 p quantile is
known as the p percentile.
The 0.25 and 0.75 quantiles are known as the first and third quartiles, whereas the 0.5 quantile
is known as the sample median. A quarter of the data are smaller than the 0.25 quantile, half
are smaller (or larger) than the median and three quarters are smaller than the 0.75 quartile. If
n is even, the sample median is usually set to
x(n/2) + x(n/2+1)
. (8.6)
2
The difference between the third and the first quartile is known as the interquartile range
(IQR).
It turns out that for the temperature data set in Figure 8.1 the sample median is 6.80 ◦ C in
January and 21.2 ◦ C in August, which is essentially the same as the sample mean. The IQR is
CHAPTER 8. DESCRIPTIVE STATISTICS 146
30
25
20
Degrees (Celsius) 15
10
5
0
5 January April August November
Figure 8.4: Box plots of the Oxford temperature data set used in Figure 8.1. Each box plot corresponds
to the maximum temperature in a particular month (January, April, August and November) over the
last 150 years.
2.9 ◦ C in January and 2.1 ◦ C in August. This gives a very similar spread around the median, as
the sample mean. In this particular example, there does not seem to be an advantage in using
order statistics.
For the GDP data set, the median is $6,350. This means that half of the countries have a GDP
of less than $6,350. In contrast, 71% of the countries have a GDP per capita lower than the
sample mean! The IQR of these data is $18,200. To provide a more complete description of the
data set, we can list a five-number summary of order statistics: the minimum x(1) , the first
quartile, the sample median, the third quartile and the maximum x(n) . For the GDP data set
these are $130, $1,960, $6,350, $20,100, and $188,000 respectively.
We can visualize the main order statistics of a data set by using a box plot, which shows the
median value of the data enclosed in a box. The bottom and top of the box are the first and
third quartiles. This way of visualizing a data set was proposed by the mathematician John
Tukey. Tukey’s box plot also includes whiskers. The lower whisker is a line extending from the
bottom of the box to the smallest value within 1.5 IQR of the first quartile. The higher whisker
extends from the top of the box to the highest value within 1.5 IQR of the third quartile. Values
beyond the whiskers are considered outliers and are plotted separately.
Figure 8.4 applies box plots to visualize the temperature data set used in Figure 8.1. Each box
plot corresponds to the maximum temperature in a particular month (January, April, August
and November) over the last 150 years. The box plots allow us to quickly compare the spread
of temperatures in the different months. Figure 8.5 shows a box plot of the GDP data from
Figure 8.2. From the box plot it is immediately apparent that most countries have very small
GDPs per capita, that the spread between countries increases for larger GDPs per capita and
that a small number of countries have very large GDPs per capita.
CHAPTER 8. DESCRIPTIVE STATISTICS 147
60
50
Thousands of dollars
40
30
20
10
0
Figure 8.5: Box plot of the GDP per capita of all countries in the world in 2014. Not all of the outliers
are shown.
In order to take into account that each individual feature may vary on a different scale, a common
preprocessing step is to normalize each feature, dividing it by its sample standard deviation.
CHAPTER 8. DESCRIPTIVE STATISTICS 148
ρ = 0.269 ρ = 0.962
20 20
18 15
Minimum temperature
16 10
April
14 5
12 0
10 5
8 10
16 18 20 22 24 26 28 5 0 5 10 15 20 25 30
August Maximum temperature
Figure 8.6: Scatterplot of the temperature in January and in August (left) and of the maximum and
minimum monthly temperature (right) in Oxford over the last 150 years.
If we normalize before computing the covariance, we obtain the sample correlation coefficient
of the two features. One of the advantages of the correlation coefficient is that we don’t need
to worry about the units in which the features are measured. In contrast, measuring a feature
representing distance in inches or miles can severely distort the covariance, if we don’t scale the
other feature accordingly.
Definition 8.4.2 (Sample correlation coefficient). Let {(x1 , y1 ) , (x2 , y2 ) , . . . , (xn , yn )} be a data
set where each example consists of two features. The sample correlation coefficient is defined as
By the Cauchy-Schwarz inequality (Theorem B.2.4), which states that for any vectors ~a and ~b
~aT ~b
−1 ≤ ≤ 1, (8.9)
||a||2 ||b||2
In order to characterize the variation of a multidimensional data set around its center, we
consider its variation in different directions. The average variation of the data in a certain
direction is quantified by the sample variance of the projections of the data onto that direction.
Let ~v be a unit-norm vector aligned with a direction of interest, the sample variance of the data
set in the direction of ~v is given by
n
1 X T 2
var ~v T ~x1 , . . . , ~v T ~xn = ~v ~xi − av ~v T ~x1 , . . . , ~v T ~xn (8.12)
n−1
i=1
Xn
1 2
= ~v T (~xi − av (~x1 , . . . , ~xn )) (8.13)
n−1
i=1
n
!
1 X
= ~v T
(~xi − av (~x1 , . . . , ~xn )) (~xi − av (~x1 , . . . , ~xn ))T ~v
n−1
i=1
= ~v T Σ (~x1 , . . . , ~xn ) ~v . (8.14)
Using the sample covariance matrix we can express the variation in every direction! This is
a deterministic analog of the fact that the covariance matrix of a random vector encodes its
variance in every direction.
~u1
~u1
~u1
Figure 8.7: PCA of a set consisting of n = 100 two-dimensional data points with different configurations.
Theorem 8.5.2. Let the sample covariance of a set of vectors Σ (~x1 , . . . , ~xn ) have an eigende-
composition given by (8.15) where the eigenvalues are ordered λ1 ≥ λ2 ≥ . . . ≥ λn . Then,
λ1 = max var ~v T ~x1 , . . . , ~v T ~xn , (8.16)
||~v ||2 =1
~u1 = arg max var ~v T ~x1 , . . . , ~v T ~xn , (8.17)
||~v ||2 =1
λk = max var ~v T ~x1 , . . . , ~v T ~xn , (8.18)
||~v ||2 =1,~
u⊥~
u1 ,...,~
uk−1
~uk = arg max var ~v T ~x1 , . . . , ~v T ~xn . (8.19)
||~v ||2 =1,~
u⊥~
u1 ,...,~
uk−1
This means that ~u1 is the direction of maximum variation. The eigenvector ~u2 corresponding
to the second largest eigenvalue λ2 is the direction of maximum variation that is orthogonal
to ~u1 . In general, the eigenvector ~uk corresponding to the kth largest eigenvalue λk reveals
the direction of maximum variation that is orthogonal to ~u1 , ~u2 , . . . , ~uk−1 . Finally, u~n is the
direction of minimum variation.
In data analysis, the eigenvectors of the sample covariance matrix are usually called principal
directions. Computing these eigenvectors to quantify the variation of a data set in different
directions is called principal component analysis (PCA). Figure 8.7 shows the principal
directions for several 2D examples.
Figure 8.8 illustrates the importance of centering before applying PCA. Theorem 8.5.2 still holds
if the data are not centered. However, the norm of the projection onto a certain direction no
longer reflects the variation of the data. In fact, if the data are concentrated around a point
that is far from the origin, the first principal direction tends be aligned with that point. This
makes sense as projecting onto that direction captures more energy. As a result, the principal
directions do not reflect the directions of maximum variation within the cloud of data. Centering
the data set before applying PCA solves the issue.
The following example explains how to apply principal component analysis to dimensionality re-
duction. The motivation is that in many cases directions of higher variation are more informative
about the structure of the data set.
CHAPTER 8. DESCRIPTIVE STATISTICS 151
λ1 /n = 25.78 λ1 /n = 1.590
λ2 /n = 0.790 λ2 /n = 0.019
u2
u1
u1
u2
Figure 8.8: PCA applied to n = 100 2D data points. On the left the data are not centered. As a result
the dominant principal direction u1 lies in the direction of the mean of the data and PCA does not reflect
the actual structure. Once we center, u1 becomes aligned with the direction of maximal variation.
2.0 2.0
1.5 1.5
1.0 1.0
Projection onto second PC
0.5 0.5
0.0 0.0
0.5 0.5
1.0 1.0
1.5 1.5
2.0 2.0
2.5 2.5
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Projection onto first PC Projection onto (d-1)th PC
Figure 8.9: Projection of 7-dimensional vectors describing different wheat seeds onto the first two (left)
and the last two (right) principal directions of the data set. Each color represents a variety of wheat.
CHAPTER 8. DESCRIPTIVE STATISTICS 152
√ √
~x1 , . . . , ~xn U T ~x1 , . . . , U T ~xn Λ−1 U T ~x1 , . . . , Λ−1 U T ~xn
Figure 8.10: Effect of whitening a set of data. The original data are dominated by a linear skew (left).
Applying
√ U T aligns the axes with the eigenvectors of the sample covariance matrix (center). Finally,
Λ−1 reweights the data along those axes so that they have the same average variation, revealing the
nonlinear structure that was obscured by the linear skew (right).
Example 8.5.3 (Dimensionality reduction via PCA). We consider a data set where each data
point corresponds to a seed which has seven features: area, perimeter, compactness, length of
kernel, width of kernel, asymmetry coefficient and length of kernel groove. The seeds belong to
three different varieties of wheat: Kama, Rosa and Canadian.3 Our aim is to visualize the data
by projecting the data down to two dimensions in a way that preserves as much variation as
possible. This can be achieved by projecting each point onto the two first principal dimensions
of the data set.
Figure 8.9 shows the projection of the data onto the first two and the last two principal directions.
In the latter case, there is almost no discernible variation. The structure of the data is much
better conserved by the two first directions, which allow to clearly visualize the difference between
the three types of seeds. Note however that projection onto the first principal directions only
ensures that we preserve as much variation as possible, but it does not necessarily preserve useful
features for tasks such as classification. 4
8.5.3 Whitening
Whitening is a useful procedure for preprocessing data that contains nonlinear patterns. The
goal is to eliminate the linear skew in the data by rotating and contracting the data along
different directions in order to reveal its underlying nonlinear structure. This can be achieved
by applying a linear transformation that essentially inverts the sample covariance matrix, so that
the result is uncorrelated. The process is known as whitening, because random vectors with
uncorrelated entries are often referred to as white noise. It is closely related to Algorithm 8.5.4
for coloring random vectors.
Algorithm 8.5.4 (Whitening). Let ~x1 , . . . , ~xn be a set of d-dimensional data, which we assume
to be centered and to have a full-rank covariance matrix. To whiten the data set we:
√ −1
2. Set ~yi := Λ U T ~xi , for i = 1, . . . , n, where
√
λ1 √0 ··· 0
√ 0 λ2 ··· 0
Λ :=
, (8.20)
··· √
0 0 ··· λn
√ √
so that Σ (~x1 , . . . , ~xn ) = U Λ ΛU T .
The whitened data set ~y1 , . . . , ~yn has a sample covariance matrix equal to the identity,
n
1 X T
Σ (~y1 , . . . , ~yn ) := ~yi ~yi (8.21)
n−1
i=1
1
n √
X −1
√ −1
T
= Λ U T ~xi Λ U T ~xi (8.22)
n−1
i=1
n
!
√ −1 1 X √ −1
= Λ UT ~xi ~xTi U Λ (8.23)
n−1
i=1
√ −1 T √ −1
= Λ U Σ (~x1 , . . . , ~xn ) U Λ (8.24)
√ −1 √ √ √ −1
= Λ U T U Λ ΛU T U Λ (8.25)
= I. (8.26)
Intuitively, whitening first rotates the data and then shrinks or expands it so that the average
variation is the same in every direction. As a result, nonlinear patterns become more apparent,
as illustrated by Figure 8.10.
Chapter 9
Frequentist Statistics
The goal of statistical analysis is to extract information from data by computing statistics,
which are deterministic functions of the data. In Chapter 8 we describe several statistics from
a deterministic and geometric point of view, without making any assumptions about the data-
generation process. This makes it very challenging to evaluate the accuracy of the acquired
information.
In this chapter we model the data-acquisition process probabilistically. This allows to ana-
lyze statistical techniques and derive theoretical guarantees on their performance. The data
are interpreted as realizations of random variables, vectors or processes (depending on the
dimensionality). The information that we want to extract can then be expressed in terms of the
joint distribution of these quantities. We consider this distribution to be unknown but fixed,
taking a frequentist perspective. The alternative framework of Bayesian statistics is described
in Chapter 10.
Example 9.1.1 (Sampling from a population). Assume that we are studying a population of
m individuals. We are interested in a certain quantity associated to each person, e.g. their
cholesterol level, their salary or who they are voting for in an election. There are k possible
values for the quantity {z1 , z2 , . . . , zk }, where k can be equal to m or much smaller. We denote
by mj the number of people for whom the quantity is equal to zj , 1 ≤ j ≤ k. In the case of
an election with two candidates, k would equal two and m1 and m2 would represent the people
voting for each of the candidates.
154
CHAPTER 9. FREQUENTIST STATISTICS 155
X1 X2 X3 X4 ... Xn
Figure 9.1: Directed graphical model corresponding to an independent sequence. If the sequence is also
identically distributed, then X1 , X2 , . . . , Xn all have the same distribution.
Let us assume that we select n individuals independently at random with replacement, which
means that one individual could be chosen more than once, and record the value of the quantity
of interest. Under these assumptions the measurements can be modeled as a random sequence
e Since the probability of choosing any individual is the same every
of independent variables X.
time we make a selection, the first-order pmf of the sequence is
pX(i)
e (zj ) = P (The ith measurement equals zj ) (9.1)
People such that the quantity equals zj
= (9.2)
Total number of people
mj
= , 1 ≤ j ≤ k, (9.3)
m
for 1 ≤ i ≤ n by the law of total probability. We conclude that the data can be modeled as a
realization of an iid sequence. 4
y := h (x1 , x2 , . . . , xn ) . (9.4)
For example, as we will see, if we want to estimate the expectation of the underlying distribution,
a reasonable estimator is the average of the data. Since we are taking a frequentist viewpoint,
the quantity of interest is modeled as deterministic (in contrast to the Bayesian viewpoint which
would model it as a random variable). For a fixed data set, the estimator is a deterministic
function of the data. However, if we model the data as realizations of a sequence of random
variables, then the estimator is also a realization of the random variable
Y := h (X1 , X2 , . . . , Xn ) . (9.5)
This allows to evaluate the estimator probabilistically (usually under some assumptions on the
underlying distribution). For instance, we can measure the error incurred by the estimator by
computing the mean square of the difference between the estimator and the true quantity of
interest.
Definition 9.2.1 (Mean square error). The mean square error (MSE) of an estimator Y that
approximates a deterministic quantity γ ∈ R is
MSE (Y ) := E (Y − γ)2 . (9.6)
CHAPTER 9. FREQUENTIST STATISTICS 156
The MSE can be decomposed into a bias term and a variance term. The bias term is the
difference between the quantity of interest and the expected value of the estimator. The variance
term corresponds to the variation of the estimator around its expected value.
If the bias is zero, then the estimator equals the quantity of interest on average.
E (Y ) = γ. (9.8)
An estimator may be unbiased but still incur in a large mean square error due to its variance.
The following lemmas establish that the sample mean and variance are unbiased estimators of
the true mean and variance of an iid sequence of random variables.
Lemma 9.2.4 (The sample mean is unbiased). The sample mean is an unbiased estimator of
the mean of an iid sequence of random variables.
e with mean µ,
Proof. We consider the sample mean of an iid sequence X
n
1X e
Ye (n) := X (i) . (9.9)
n
i=1
By linearity of expectation
1X n
E Ye (n) = E Xe (i) (9.10)
n
i=1
= µ. (9.11)
Lemma 9.2.5 (The sample variance is unbiased). The sample variance is an unbiased estimator
of the variance of an iid sequence of random variables.
9.3 Consistency
If we are estimating a scalar quantity, the estimate should improve as we gather more data.
Ideally the estimate should converge to the true value in the limit when the number of data
n → ∞. Estimators that achieve this are said to be consistent.
Definition 9.3.1 (Consistency). An estimator Ye (n) := h X e (1) , X
e (2) , . . . , X
e (n) that ap-
proximates γ ∈ R is consistent if it converges to γ as n → ∞ in mean square, with probability
one or in probability.
Theorem 9.3.2 (The sample mean is consistent). The sample mean is a consistent estimator
of the mean of an iid sequence of random variables as long as the variance of the sequence is
bounded.
e with mean µ,
Proof. We consider the sample mean of an iid sequence X
n
1X e
Ye (n) := X (i) . (9.12)
n
i=1
The estimator is equal to the moving average of the data. As a result it converges to µ in mean
square (and with probability one) by the law of large numbers (Theorem 6.2.2), as long as the
variance σ 2 of each of the entries in the iid sequence is bounded.
Example 9.3.3 (Estimating the average height). In this example we illustrate the consistency
of the sample mean. Imagine that we want to estimate the mean height in a population. To be
concrete we consider a population of m := 25000 people. Figure 9.2 shows a histogram of their
heights.1 As explained in Example 9.1.1 if we sample n individuals from this population with
e The mean of this sequence is
replacement, then their heights form an iid sequence X.
m
X
E Xe (i) := P (Person j is chosen) · height of person j (9.13)
j=1
Xm
1
= hj (9.14)
m
j=1
= av (h1 , . . . , hm ) (9.15)
for 1 ≤ i ≤ n, where h1 , . . . , hm are the heights of the people. In addition, the variance
is bounded because the heights are finite. By Theorem 9.3.2 the sample mean of the n data
should converge to the mean of the iid sequence and hence to the average height over the whole
population. Figure 9.3 illustrates this numerically.
4
If the mean of the underlying distribution is not well defined, or its variance is unbounded,
then the sample mean is not necessarily a consistent estimator. This is related to the fact that
1
The data are available here: wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights.
CHAPTER 9. FREQUENTIST STATISTICS 158
0.25
0.20
0.15
0.10
0.05
60 62 64 66 68 70 72 74 76
Height (inches)
72
True mean
71 Empirical mean
70
Height (inches)
69
68
67
66
65
64 0
10 101 102 103
n
Figure 9.3: Different realizations of the sample mean when individuals from the population in Figure 9.2
are sampled with replacement.
CHAPTER 9. FREQUENTIST STATISTICS 159
Sample mean
30 10 30
25 Moving average Moving average 20 Moving average
Median of iid seq. Median of iid seq. 10 Median of iid seq.
20 5
0
15 10
0 20
10
5 30
5 40
0 50
5 10 60
0 10 20 30 40 50 0 100 200 300 400 500 0 1000 2000 3000 4000 5000
i i i
Sample median
3 3 3
Moving median Moving median Moving median
2 Median of iid seq. 2 Median of iid seq. 2 Median of iid seq.
1 1 1
0 0 0
1 1 1
2 2 2
3 3 3
0 10 20 30 40 50 0 100 200 300 400 500 0 1000 2000 3000 4000 5000
i i i
Figure 9.4: Realization of the moving average of an iid Cauchy sequence (top) compared to the moving
median (bottom).
the sample mean can be severely affected by the presence of extreme values, as we discussed
in Section 8.2. The sample median, in contrast, tends to be more robust in such situations, as
discussed in Section 8.3. The following theorem establishes that the sample median is consistent
under the iid assumption, even if the mean is not well defined or the variance is unbounded.
The proof is in Section 9.7.2.
Theorem 9.3.4 (Sample median as an estimator of the median). The sample median is a
consistent estimator of the median of an iid sequence of random variables.
Figure 9.4 compares the moving average and the moving median of an iid sequence of Cauchy
random variables for three different realizations. The moving average is unstable and does not
converge no matter how many data are available, which is not surprising because the mean is
not well defined. In contrast, the moving median does eventually converge to the true median
as predicted by Theorem 9.3.4.
The sample variance and covariance are consistent estimators of the variance and covariance
respectively, under certain assumptions on the higher moments of the underlying distributions.
This provides an intuitive interpretation for principal component analysis (see Section 8.5.2) un-
der the assumption that the data are realizations of an iid sequence of random vectors: the prin-
cipal components approximate the eigenvectors of the true covariance matrix (see Section 4.3.3),
and hence the directions of maximum variance of the multidimensional distribution. Figure 9.5
CHAPTER 9. FREQUENTIST STATISTICS 160
n=5 n = 20 n = 100
True covariance
Sample covariance
Figure 9.5: Principal directions of n samples from a bivariate Gaussian distribution (red) compared to
the eigenvectors of the covariance matrix of the distribution (black).
illustrates this with a numerical example, where the principal components indeed converge to
the eigenvectors as the number of data increases.
P (γ ∈ I) ≥ 1 − α, (9.16)
Confidence intervals are usually of the form [Y − c, Y + c] where Y is an estimator of the quantity
of interest and c is a constant that depends on the number of data. The following theorem derives
a confidence interval for the mean of an iid sequence. The confidence interval is centered at the
sample mean.
Theorem 9.4.2 (Confidence interval for the mean of an iid sequence). Let X e be an iid sequence
2 2
with mean µ and variance σ ≤ b for some b > 0. For any 0 < α < 1
b b e (1) , X
e (2) , . . . , X
e (n) ,
In := Yn − √ , Yn + √ , Yn := av X (9.17)
αn αn
Proof. Recall that the variance of Yn equals Var X̄n = σ 2 /n (see equation (6.21) in the proof
of Theorem 6.2.2). We have
b σ b
P µ ∈ Yn − √ , Yn + √ = 1 − P |Yn − µ| > √ (9.18)
αn αn αn
α nVar (Yn )
≥1− by Chebyshev’s inequality (9.19)
b2
α σ2
=1− 2 (9.20)
b
≥ 1 − α. (9.21)
The width of the interval provided in the theorem decreases with n for fixed α, which makes
sense as incorporating more data reduces the variance of the estimator and hence our uncertainty
about it.
Example 9.4.3 (Bears in Yosemite). A scientist is trying to estimate the average weight of the
black bears in Yosemite National Park. She manages to capture 300 bears. We assume that the
bears are sampled uniformly at random with replacement (a bear can be weighed more than
once). Under this assumptions, in Example 9.1.1 we show that the data can be modeled as iid
samples and in Example 9.3.3 we show the sample mean is a consistent estimator of the mean
of the whole population.
The average weight of the 300 captured bears is Y := 200 lbs. To derive a confidence interval
from this information we need a bound on the variance. The maximum weight recorded for a
black bear ever is 880 lbs. Let µ and σ 2 be the (unknown) mean and variance of the weights of
the whole population. If X is the weight of a bear chosen uniformly at random from the whole
population then X has mean µ and variance σ 2 , so
σ 2 = E X 2 − E2 (X) (9.22)
2
≤E X (9.23)
≤ 8802 because X ≤ 880. (9.24)
As a result, 880 is an upper bound for the standard deviation. Applying Theorem 9.4.2,
b b
Y −√ ,Y + √ = [−27.2, 427.2] (9.25)
αn αn
is a 95% confidence interval for the average weight of the whole population. The interval is not
very precise because n is not very large. 4
As illustrated by this example, confidence intervals derived from Chebyshev’s inequality tend to
be very conservative. An alternative is to leverage the central limit theorem (CLT). The CLT
characterizes the distribution of the sample mean asymptotically, so confidence intervals derived
from it are not guaranteed to be precise. However, the CLT often provides a very accurate
approximation to the distribution of the sample mean for finite n, as we show through some
numerical examples in Chapter 6. In order to obtain confidence intervals for the mean of an iid
sequence from the CLT as stated in Theorem 6.3.1 we would need to know the true variance of
CHAPTER 9. FREQUENTIST STATISTICS 162
the sequence, which is unrealistic in practice. However, the following result states that we can
substitute the true variance with the sample variance. The proof is beyond the scope of these
notes.
Theorem 9.4.4 (Central limit theorem with sample standard deviation). Let X e be an iid
e (i)4 )
discrete random process with mean µXe := µ such that its variance and fourth moment E(X
are bounded. The sequence
√ e
e (n) − µ
n av X (1) , . . . , X
(9.26)
std X e (1) , . . . , X
e (n)
Recall that the cdf of a standard Gaussian does not have a closed-form expression. To simplify
notation we express the confidence interval in terms of the Q function.
Definition 9.4.5 (Q function). Q (x) is the probability that a standard Gaussian random vari-
able is greater than x for positive x,
Z ∞ 2
1 u
Q (x) := √ exp − du, x > 0. (9.27)
u=x 2π 2
By symmetry, if U is a standard Gaussian random variable and y < 0
P (U < y) = Q (−y) . (9.28)
Corollary 9.4.6 (Approximate confidence interval for the mean). Let X e be an iid sequence that
satisfies the conditions of Theorem 9.4.4. For any 0 < α < 1
Sn −1 α Sn −1 α
In := Yn − √ Q , Yn + √ Q , (9.29)
n 2 n 2
Yn := av X e (1) , Xe (2) , . . . , Xe (n) , (9.30)
Sn := std X e (1) , Xe (2) , . . . , Xe (n) , (9.31)
Proof. By the central limit theorem, when n → ∞ X̄n is distributed as a Gaussian random
variable with mean µ and variance σ 2 . As a result
Sn −1 α Sn −1 α
P (µ ∈ In ) = 1 − P Yn > µ + √ Q − P Yn < µ − √ Q (9.33)
n 2 n 2
√ α √ α
n (Yn − µ) n (Yn − µ)
=1−P > Q−1 −P < −Q−1 (9.34)
Sn 2 Sn 2
α
≈ 1 − 2Q Q−1 by Theorem 9.4.4 (9.35)
2
= 1 − α. (9.36)
CHAPTER 9. FREQUENTIST STATISTICS 163
It is important to stress that the result only provides an accurate confidence interval if n is large
enough for the sample variance to converge to the true variance and for the CLT to take effect.
Example 9.4.7 (Bears in Yosemite (continued)). The sample standard deviation of the bears
captured by the scientist equals 100 lbs. We apply Corollary 9.4.6 to derive an approximate
confidence interval that is tighter than the one obtained applying Chebyshev’s inequality. Given
that Q (1.95) ≈ 0.025,
σ −1 α σ −1 α
Y −√ Q ,Y + √ Q ≈ [188.8, 211.3] (9.37)
n 2 n 2
is an approximate 95% confidence interval for the mean weight of the population of bears.
4
Interpreting confidence intervals is somewhat tricky. After computing the confidence interval in
Example 9.4.7 one is tempted to state:
The probability that the average weight is between 188.8 and 211.3 lbs is 0.95.
However we are modeling the average weight as a deterministic quantity, so there are no random
quantities in this statement! The correct interpretation is that if we repeat the process of
sampling the population and compute the confidence interval many times, then the true value
will lie in the interval 95% of the time. This is illustrated in the following example and Figure 9.6.
Example 9.4.8 (Estimating the average height (continued)). Figure 9.6 shows several 95%
confidence intervals for the average of the height population in Example 9.3.3. To compute
each interval we select n individuals and then apply Corollary 9.4.6. The width of the intervals
decreases as n grows, but because they are all 95% confidence intervals they all contain the true
average with probability 0.95. Indeed this is the case for 113 out of 120 (94%) of the intervals
that are plotted.
4
n = 50 n = 200 n = 1000
where x ∈ R.
The empirical cdf is an unbiased and consistent estimator of the true cdf. This is established
rigorously in Theorem 9.5.2 below and illustrated empirically in Figure 9.7. The cdf of the height
data from 25,000 people is compared to three realizations of the empirical cdf computed from
different numbers of iid samples. As the number of available samples grows, the approximation
becomes very accurate.
Theorem 9.5.2. Let X e be an iid sequence with marginal cdf FX . For any fixed x ∈ R Fbn (x)
is an unbiased and consistent estimator of FX (x). In fact, Fbn (x) converges in mean square to
FX (x).
True cdf
Empirical cdf
0.8
0.6
n = 10
0.4
0.2
0.0
60 62 64 66 68 70 72 74 76
Height (inches)
True cdf
Empirical cdf
0.8
0.6
n = 100
0.4
0.2
0.0
60 62 64 66 68 70 72 74 76
Height (inches)
True cdf
Empirical cdf
0.8
0.6
n = 1000
0.4
0.2
0.0
60 62 64 66 68 70 72 74 76
Height (inches)
Figure 9.7: Cdf of the height data in Figure 2.13 along with three realizations of the empirical cdf
computed with n iid samples for n = 10, 100, 1000.
CHAPTER 9. FREQUENTIST STATISTICS 166
1 X e
n n n
1 X X e (i) ≤ x, X
e (j) ≤ x
= P X (i) ≤ x + P X (9.43)
n2 n2
i=1 i=1 j=1,i6=j
n n
FX (x) 1 X X
= + 2 FX(i) (x) FX(j) (x) by independence (9.44)
n n
e e
i=1 j=1,i6=j
FX (x) n − 1 2
= + FX (x) . (9.45)
n n
The variance is consequently equal to
Var Fbn (x) = E Fbn (x)2 − E2 Fbn (x) (9.46)
FX (x) (1 − FX (x))
= . (9.47)
n
We conclude that
2
lim E FX (x) − Fbn (x) = lim Var Fbn (x) = 0. (9.48)
n→∞ n→∞
h = 0.1 h = 0.5
1.4 0.35
True distribution True distribution
1.2 Data 0.30 Data
Kernel-density estimate Kernel-density estimate
1.0 0.25
0.8 0.20
0.5 0.35
True distribution True distribution
Data 0.30 Data
0.4 Kernel-density estimate Kernel-density estimate
0.25
0.3
0.20
0.35 0.35
True distribution True distribution
0.30 Data 0.30 Data
Kernel-density estimate Kernel-density estimate
0.25 0.25
0.20 0.20
Figure 9.8: Kernel density estimation for the Gaussian mixture described in Example 9.6.5 for different
number of iid samples and different values of the kernel bandwidth h.
CHAPTER 9. FREQUENTIST STATISTICS 168
The effect of the kernel is to weight each sample according to their distance to the point at which
we are estimating the pdf x. Choosing a rectangular kernel yields an empirical density estimate
that is piecewise constant and roughly looks like a histogram (the corresponding weights √are
constant or equal to zero). A popular alternative is the Gaussian kernel k (x) = exp −x2 / π,
which produces a smooth density estimate. The kernel should decay so that k ((x − xi ) /h) is
large when the sample xi is close to x and small when it is far. This decay is governed by the
bandwidth h, which is chosen before hand based on our expectations about the smoothness of
the pdf and on the amount of available data. If the bandwidth is very small, individual samples
have a large influence on the density estimate. This allows to reproduce irregular shapes more
easily, but also yields spurious fluctuations that are not present in the true curve, especially if we
don’t have a lot of samples. Increasing the bandwidth smooths out such fluctuations and yields
more stable estimates when the number of data is small. However, it may also over-smooth the
estimate. As a rule of thumb, we should decrease the bandwidth of the kernel as the number of
data increases.
Figures 9.8 and 9.9 illustrate the effect of varying the bandwidth h at different sampling rates.
In Figure 9.8 Gaussian kernel density estimation is applied to estimate the Gaussian mixture
described in Example 9.6.5. Figure 9.9 shows an example where the same technique is used on
real data: the aim is to estimate the density of the weight of a sea-snail population.2 The whole
population consists of 4,177 individuals. The kernel density estimate is computed from 200 iid
samples for different values of the kernel bandwidth.
0.6
0.4
0.2
0.0
1 0 1 2 3 4
Weight (grams)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0 1 0 1 2 3 4
Weight (grams)
Figure 9.9: Kernel density estimate for the weight of a population of abalone, a species of sea snail. In
the plot above the density is estimated from 200 iid samples using a Gaussian kernel with three different
bandwidths. Black crosses representing the individual samples are shown underneath. In the plot below
we see the result of repeating the procedure three times using a fixed bandwidth equal to 0.25.
CHAPTER 9. FREQUENTIST STATISTICS 170
0.9 0.25
Exponential distribution Gaussian distribution
0.8 Real data Real data
0.7 0.20
0.6
0.15
0.5
0.4
0.10
0.3
0.2 0.05
0.1
0.0
0 1 2 3 4 5 6 7 8 9 60 62 64 66 68 70 72 74 76
Interarrival times (s) Height (inches)
Figure 9.10: Exponential distribution fitted to data consisting of inter-arrival times of calls at a call
center in Israel (left). Gaussian distribution fitted to height data (right).
The graph on the right of Figure 9.10 shows the result of fitting an exponential to the call-center
data in Figure 2.11. Similarly, to fit a Gaussian using the method of moments we set the mean
equal to its sample mean and the variance equal to the sample variance, as illustrated by the
graph on the right of Figure 9.10 using the data from Figure 2.13.
If the random variables are continuous with pdf fθ~ , where θ~ ∈ Rm , the likelihood function is
Lx1 ,...,xn θ~ := fθ~ (x1 , . . . , xn ) . (9.56)
The log-likelihood function is equal to the logarithm of the likelihood function log Lx1 ,...,xn θ~ .
CHAPTER 9. FREQUENTIST STATISTICS 171
When the data are modeled as iid samples, the likelihood factors into a product of the marginal
pmf or pdf, so the log likelihood can be decomposed into a sum.
In the case of discrete distributions, for a fixed θ~ the likelihood is the probability that X1 , . . . , Xn
equal the observed data. If we don’t know θ, ~ it makes sense to choose a value for θ~ such that this
probability is as high as possible, i.e. to maximize the likelihood. For continuous distributions
we apply the same principle to the joint pdf of the data.
The maximum of the likelihood function and that of the log-likelihood function are at the same
location because the logarithm is monotone.
Under certain conditions, one can show that the maximum-likelihood estimator is consistent: it
converges in probability to the true parameter as the number of data increases. One can also
show that its distribution converges to that of a Gaussian random variable (or vector), just like
the distribution of the sample mean. These results are beyond the scope of the course. Bear in
mind, however, that they only hold if the data are indeed generated by the type of distribution
that we are considering.
We now show how to derive the maximum-likelihood for a Bernoulli and a Gaussian distribution.
The resulting estimators for the parameters are the same as the method-of-moments estimators
(except for a slight difference in the estimate of the Gaussian variance parameter).
Example 9.6.3 (ML estimator of the parameter of a Bernoulli distribution). We model a set
of data x1 , . . . , xn as iid samples from a Bernoulli distribution with parameter θ (in this case
there is only one parameter). The likelihood function is equal to
where n1 are the number of samples equal to one and n0 the number of samples equal to zero.
The ML estimator of the parameter θ is
The function we are trying to maximize is strictly concave in {µ, σ}. To prove this, we would
have to show that the Hessian of the function is positive definite. We omit the calculations that
show that this is the case. Setting the partial derivatives to zero we obtain
n
1X
µML = xi , (9.75)
n
i=1
Xn
1
2
σML = (xi − µML )2 . (9.76)
n
i=1
CHAPTER 9. FREQUENTIST STATISTICS 173
Estimated distribution
6.0 96
True distribution Estimated parameters 99
0.15 Data 5.5 True parameters
102
5.0 105
Estimated distribution
6.0 101.6
True distribution Estimated parameters 104.0
0.15 Data 5.5 True parameters
106.4
5.0 108.8
0.10 σ 4.5 111.2
113.6
3.5 118.4
Estimated distribution
6.0 93.9
True distribution Estimated parameters 95.4
0.15 Data 5.5 True parameters 96.9
5.0 98.4
0.10 σ 4.5
99.9
101.4
102.9
0.05 4.0
104.4
3.5 105.9
107.4
0.00 10 5 0 5 10 15 3.02.0 2.5 3.0 3.5 4.0 4.5 5.0
µ
Figure 9.11: The left column shows histograms of 50 iid samples from a Gaussian distribution, together
with the pdf of the original distribution, as well as the maximum-likelihood estimate. The right column
shows the log-likelihood function corresponding to the data and the location of its maximum and of the
point corresponding to the true parameters.
CHAPTER 9. FREQUENTIST STATISTICS 174
0.35 3.0 75
Estimate (maximum)
Estimate (local max.)
Global maximum 200
0.30 Local maximum
True distribution
Data 2.5 True parameters 325
0.25 450
Figure 9.12: The left image shows a histogram of 40 iid samples from the Gaussian mixture defined in
Example 9.6.5, together with the pdf of the original distribution. The right image shows the log-likelihood
function corresponding to the data, which has a local maximum apart from the global maximum. The
density estimates corresponding to the two maxima are shown on the left.
The estimator for the mean is just the sample mean. The estimator for the variance is a rescaled
sample variance.
4
Figure 9.11 displays the log-likelihood function corresponding to 50 iid samples from a Gaussian
distribution with µ := 3 and σ := 4. It also shows the approximation to the true pdf obtained by
maximum likelihood. In Examples 9.6.3 and 9.6.4 the log-likelihood function is strictly concave.
This means that the function has a unique maximum that can be located by setting the gradient
to zero. When this yields nonlinear equations that cannot be solved directly, we can leverage
optimization methods such as gradient ascent that will converge to the maximum. However,
the log-likelihood function is not always concave. As illustrated by the following example, in
such cases it can have multiple local maxima, which may make it intractable to compute the
maximum-likelihood estimator.
where G1 is a Gaussian random variable with mean −µ and variance σ 2 , whereas G2 is also
Gaussian with mean µ and variance σ 2 . We have parameterized the mixture with just two
parameters so that we can visualize the log-likelihood in two dimensions. Let x1 , x2 , . . . be data
CHAPTER 9. FREQUENTIST STATISTICS 175
Figure 9.12 shows the log-likelihood function for 40 iid samples of the distribution when µ := 4
and σ := 1. The function has a local maximum away from the global maximum. This means
that if we use a local ascent method to find the ML estimator, we might not find the global
maximum, but remain stuck at the local maximum instead. The estimate corresponding to the
local maximum (shown on the left) has the same variance as the global maximum but µ is close
to −4 instead of 4. Although the estimate doesn’t fit the data very well, it is locally optimal,
small shifts of µ and σ yield worse fits (in terms of the likelihood).
4
To finish this section, we describe a machine-learning algorithm for supervised learning based
on parametric fitting using ML estimation.
{~
µa , Σa } := arg max L~a1 ,...,~an (~
µ, Σ) , (9.81)
µ
~ ,Σ
{~
µb , Σb } := arg max L~b1 ,...,~bn (~
µ, Σ) . (9.82)
µ
~ ,Σ
Then for each new example ~x, the value of the density function at the example for both classes
is evaluated. If
then ~x is declared to belong to the first class, otherwise it is declared to belong to the second
class. Figure 9.13 shows the results of applying the method to data simulated using two Gaussian
distributions.
4
CHAPTER 9. FREQUENTIST STATISTICS 176
Figure 9.13: Quadratic-discriminant analysis applied to data from two different classes (left). The data
corresponding to the two different classes are colored orange and blue. Three new examples are colored
in black. Two bivariate Gaussians are fit to the data. Their contour lines are shown in the respective
color of each class on the right. These distributions are used to classify the new examples, which are
colored according to their estimated class.
9.7 Proofs
9.7.1 Proof of Lemma 9.2.5
We consider the sample variance of an iid sequence X e with mean µ and variance σ 2 ,
Xn X1
1 1
Ye (n) := Xe (i) − e (j)
X (9.84)
n−1 n
i=1 j=1
2
Xn
1 e 1 e (j)
= X (i) − X (9.85)
n−1 n
j=1
n n n
1 e 2 1 XX e e (k) − 2 X
e (i) X
e (j)
= X (i) + 2 X (j) X X (9.86)
n−1 n n
j=1 k=1 j=1
CHAPTER 9. FREQUENTIST STATISTICS 177
To simplify notation, we denote the mean square E Xe (i)2 = µ2 + σ 2 by ξ. We have
1 X e 2
n
1 X e
n 1 XX e
n n
E Ye (n) = E X (i) + 2 2
E X (j) + 2 e (k)
E X (j) X (9.87)
n−1 n n
i=1 j=1 j=1 k=1
k6=j
2 e 2 2 X e
n
− E X (i) − e (j)
E X (i) X (9.88)
n n
j=1
j6=i
n
1 X n ξ n (n − 1) µ2 2 ξ 2 (n − 1) µ2
= ξ+ 2 + − − (9.89)
n−1 n n2 n n
i=1
n
X
1 n−1
= ξ − µ2 (9.90)
n−1 n
i=1
= σ2. (9.91)
n o
If we order the set e (1) , . . . , X
X e (n) , then Ye (n) equals the (n + 1) /2th element if n is odd
and the average of the n/2th and the (n/2 + 1)th element if n is even. The event Ye (n) ≥ γ +
therefore implies that at least (n + 1) /2 of the elements are larger than γ + .
For each individual Xe (i), the probability that X e (i) > γ + is
p := 1 − FX(i)
e (γ + ) = 1/2 − 0 (9.95)
where we assume that 0 > 0. If this is not the case then the cdf of the iid sequence
n is flat at γ ando
e
the median is not well defined. The number of random variables in the set X (1) , . . . , X e (n)
which are larger than γ + is distributed as a binomial random variable Bn with parameters n
CHAPTER 9. FREQUENTIST STATISTICS 178
Bayesian Statistics
In the frequentist paradigm we model the data as realizations from a distribution that is fixed. In
particular, if the model is parametric, the parameters are deterministic quantities. In contrast,
in Bayesian parametric modeling the parameters are modeled as random variables. The goal is
to have the flexibility to quantify our uncertainty about the underlying distribution beforehand,
for example in order to integrate available prior information about the data.
Our goal when learning a Bayesian model is to compute the posterior distribution of the
parameters Θ given X.~ Evaluating this posterior distribution at the realization ~x allows to
update our uncertainty about Θ using the data.
The following example fits a Bayesian model to iid samples from a Bernoulli random variable.
Example 10.1.1 (Bernoulli distribution). Let ~x be a vector of data that we wish to model as
iid samples from a Bernoulli distribution. Since we are taking a Bayesian approach we choose
a prior distribution for the parameter of the Bernoulli. We will consider two different Bayesian
estimators Θ1 and Θ2 :
179
CHAPTER 10. BAYESIAN STATISTICS 180
Prior distribution n0 = 1, n1 = 3
2.0 2.5
1.5 2.0
1.5
1.0
1.0
0.5
0.5
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
n0 = 3, n1 = 1 n0 = 91, n1 = 9
14
Posterior mean (uniform prior)
2.0 Posterior mean (skewed prior)
ML estimator
12
1.5 10
8
1.0
6
4
0.5
2
0.0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 10.1: The prior distribution of Θ1 (blue) and Θ2 (dark red) in Example 10.1.1 are shown in the
top-left graph. The rest of the graphs show the corresponding posterior distributions for different data
sets.
2. Θ2 is an estimator that assumes that the parameter is closer to 1 than to 0. We could use
it for instance to capture the suspicion that a coin is biased towards heads. We choose a
skewed pdf that increases linearly from zero to one,
(
2 θ for 0 ≤ θ ≤ 1,
fΘ2 (θ) = (10.2)
0 otherwise.
CHAPTER 10. BAYESIAN STATISTICS 181
By the iid assumption, the likelihood, which is just the conditional pmf of the data given the
parameter of the Bernoulli, equals
pX|Θ
~ (~x|θ) = θn1 (1 − θ)n0 , (10.3)
where n1 is the number of ones in the data and n0 the number of zeros (see Example 9.6.3).
The posterior pdfs of the two estimators are consequently equal to
where
Z
β (a, b) := ua−1 (1 − u)b−1 du (10.12)
u
is a special function called the beta function or Euler integral of the first kind, which is tabulated.
Figure 10.1 shows the plot of the posterior distribution for different values of n1 and n0 . It
also shows the maximum-likelihood estimator of the parameter, which is just n1 / (n0 + n1 ) (see
Example 9.6.3). For a small number of flips, the posterior pdf of Θ2 is skewed to the right with
respect to that of Θ1 , reflecting the prior belief that the parameter is closer to 1. However for
a large number of flips both posterior densities are very close.
4
Definition 10.2.1 (Conjugate priors). A conjugate family of distributions for a certain likeli-
hood satisfies the following property: if the prior belongs to the family, then the posterior also
belongs to the family.
Theorem 10.2.2 (The beta distribution is conjugate to the binomial likelihood). If the prior
distribution of Θ is a beta distributions with parameters a and b and the likelihood of the data
X given Θ is binomial with parameters n and x, then the posterior distribution of Θ given X is
a beta distribution with parameters x + a and n − x + b.
Proof.
fΘ (θ) pX | Θ (x | θ)
fΘ | X (θ | x) = (10.13)
pX (x)
fΘ (θ) pX | Θ (x | θ)
=R (10.14)
u fΘ (u) pX | Θ (x | u) du
θa−1 (1 − θ)b−1 nx θx (1 − θ)n−x
=R (10.15)
a−1 (1 − u)b−1 n ux (1 − u)n−x du
uu x
θx+a−1 (1 − θ)n−x+b−1
=R (10.16)
uu
x+a−1 (1 − u)n−x+b−1 du
= fβ (θ; x + a, n − x + b) . (10.17)
Note that the posteriors obtained in Example 10.1.1 follow immediately from the theorem.
Example 10.2.3 (Poll in New Mexico). In a poll in New Mexico for the 2016 US election, 429
participants, 227 people intend to vote for Clinton and 202 for Trump (the data are from a real
poll1 , but for simplicity we are ignoring the other candidates and people that were undecided).
Our aim is to use a Bayesian framework to predict the outcome of the election in New Mexico
using these data.
We model the fraction of people that vote for Trump as a random variable Θ. We assume that
the n people in the poll are chosen uniformly at random with replacement from the population,
so given Θ = θ the number of Trump voters is a binomial with parameters n and θ. We don’t
have any additional information about the possible value of Θ, so we assume it is uniform or
equivalently a beta distribution with parameters a := 1 and b := 1.
By Theorem 10.2.2 the posterior distribution of Θ given the data that we observe is a beta
distribution with parameters a := 203 and b := 228, depicted in Figure 10.2. The corresponding
probability that Θ ≥ 0.5 is 11.4%, which is our estimate for the probability that Trump wins in
New Mexico.
4
1
The poll results are taken from
https://www.abqjournal.com/883092/clinton-still-ahead-in-new-mexico.html
CHAPTER 10. BAYESIAN STATISTICS 183
18
88.6%
16 11.4%
14
12
10
8
6
4
2
0
0.35 0.40 0.45 0.50 0.55 0.60
Figure 10.2: Posterior distribution of the fraction of Trump voters in New Mexico conditioned on the
poll data in Example 10.2.3.
Theorem 10.3.1 (The posterior mean minimizes the MSE). The posterior mean is the min-
imum mean-square-error (MMSE) estimate of the parameter Θ ~ given the data X.
~ To be more
precise, let us define
~ X
θMMSE (~x) := E Θ| ~ = ~x . (10.18)
~ = ~x in
Proof. We begin by computing the MSE of the arbitrary estimator conditioned on X
CHAPTER 10. BAYESIAN STATISTICS 184
By iterated expectation,
2 2
~ ~ ~ ~
E θother (X) − Θ = E E θother (X) − Θ X (10.24)
2 2
~ − θMMSE (X)
= E θother (X) ~ + E E θMMSE (X) ~ −Θ ~ X ~
2 2
~
= E θother (X) − θMMSE (X)~ +E ~ ~
θMMSE (X) − Θ (10.25)
2
≥ E θMMSE (X) ~ −Θ~ , (10.26)
Example 10.3.2 (Bernoulli distribution (continued)). In order to obtain point estimates for
the parameter in Example 10.1.1 we compute the posterior means:
Z 1
~
E Θ1 |X = ~x = θfΘ1 |X~ (θ|~x) dθ (10.27)
0
R 1 n +1
0 θ
1 (1 − θ)n0 dθ
= (10.28)
β (n1 + 1, n0 + 1)
β (n1 + 2, n0 + 1)
= , (10.29)
β (n1 + 1, n0 + 1)
Z 1
~
E Θ2 |X = ~x = θfΘ2 |X~ (θ|~x) dθ (10.30)
0
β (n1 + 3, n0 + 1)
= . (10.31)
β (n1 + 2, n0 + 1)
Figure 10.1 shows the posterior means for different values of n0 and n1 .
4
In Figure 10.1 the ML estimator of Θ is the mode (maximum value) of the posterior distribution
when the prior is uniform. This is not a coincidence, under a uniform prior the MAP and ML
estimates are the same.
Proof. We prove the result when the model for the data and the parameters is continuous, if
any or both of them are discrete the proof is identical (in that case the ML estimator is the
mode of the pmf of the posterior). If the prior distribution of the parameters is uniform, then
fΘ ~ ~
~ θ is constant for any θ, which implies
fΘ
~
~
θ f ~ ~
~ Θ x|θ~
X|
arg max fΘ
~ |X
~
~
θ|~
x = arg max R (10.34)
θ~ θ~ u fΘ
~ (u) fX| x|u) du
~ (~
~ Θ
= arg max fX| ~ ~
~ Θ x|θ~ ~
(the rest of the terms do not depend on θ)
θ~
= arg max L~x θ~ . (10.35)
θ~
Note that uniform priors are only well defined in situations where the parameter is restricted to
a bounded set.
We now describe a situation in which the MAP estimator is optimal. If the parameter Θ can
only take a discrete set of values, then the MAP estimator minimizes the probability of making
the wrong choice.
Theorem 10.3.5 (MAP estimator minimizes the probability of error). Let Θ~ be a discrete
~
random vector and let X be a random vector modeling the data. We define
θMAP (~x) := arg max pΘ
~ |X
~ ~
~ θ|X = ~
x . (10.36)
θ~
where (10.40) follows from the definition of the MAP estimator as the mode of the posterior.
Example 10.3.6 (Sending bits). We consider a very simple model for a communication channel
in which we aim to send a signal Θ consisting of a single bit. Our prior knowledge indicates that
the signal is equal to one with probability 1/4.
1 3
pΘ (1) = , pΘ (0) = . (10.42)
4 4
Due to the presence of noise in the channel, we send the signal n times. At the receptor we
observe
X ~i,
~i = Θ + Z 1 ≤ i ≤ n, (10.43)
where Z ~ contains n iid standard Gaussian random variables. Modeling perturbations as Gaus-
sian is a popular choice in communications. It is justified by the central limit theorem, under
the assumption that the noise is a combination of many small effects that are approximately
independent.
We will now compute and compare the ML and MAP estimators of Θ given the observations.
The likelihood is equal to
n
Y
L~x (θ) = fX~ i |Θ (~xi | θ) (10.44)
i=1
Yn
1 (~xi −θ)2
= √ e− 2 . (10.45)
i=1
2π
It is easier to deal with the log-likelihood function,
n
X (~xi − θ)2 n
log L~x (θ) = − − log 2π. (10.46)
2 2
i=1
Since Θ only takes two values, we can compare directly. We will choose θML (~x) = 1 if
n
X ~x 2 − 2~xi + 1
i n
log L~x (1) = − − log 2π (10.47)
2 2
i=1
Xn
~xi2 n
≥− − log 2π (10.48)
2 2
i=1
= log L~x (0) . (10.49)
CHAPTER 10. BAYESIAN STATISTICS 187
Equivalently,
( P
1 if n1 ni=1 ~xi > 12 ,
θML (~x) = (10.50)
0 otherwise.
The rule makes a lot of sense: if the sample mean of the data is closer to 1 than to 0 then our
estimate is equal to 1. By the law of total probability, the probability of error of this estimator
is equal to
~ = P Θ 6= θML (X)
P Θ 6= θML (X) ~ Θ = 0 P (Θ = 0) + P Θ 6= θML (X) ~ Θ = 1 P (Θ = 1)
n ! n !
1X~ 1 1X~ 1
=P Xi > Θ = 0 P (Θ = 0) + P Xi < Θ = 1 P (Θ = 1)
n 2 n 2
i=1 i=1
√
= Q n/2 , (10.51)
where the last equality follows from the fact that if we condition on Θ = θ the empirical mean
is Gaussian with variance σ 2 /n and mean θ (see the proof of Theorem 6.2.2).
To compute the MAP estimate we must find the maximum of the posterior pdf of Θ given the
observed data. Equivalently, we find the maximum of its logarithm (this is equivalent because
the logarithm is a monotone function),
Qn
i=1 fX xi |θ) pΘ (θ)
~ i |Θ (~
log pΘ|X~ (θ|~x) = log (10.52)
fX~ (~x)
Xn
= log fX~ i |Θ (~xi |θ) pΘ (θ) − log fX~ (~x) (10.53)
i=1
n
X ~x 2 − 2~xi θ + θ2
i n
=− − log 2π + log pΘ (θ) − log fX~ (~x) . (10.54)
2 2
i=1
We compare the value of this function for the two possible values of Θ: 0 and 1. We choose
θMAP (~x) = 1 if
n
X ~x 2 − 2~xi + 1
i n
log pΘ|X~ (1|~x) + log fX~ (~x) = − − log 2π − log 4 (10.55)
2 2
i=1
n
X ~x 2i n
≥− − log 2π − log 4 + log 3 (10.56)
2 2
i=1
= log pΘ|X~ (0|~x) + log fX~ (~x) . (10.57)
Equivalently,
( P
1 if n1 ni=1 ~xi > 1
2 + log 3
n ,
θMAP (~x) = (10.58)
0 otherwise.
The MAP estimate shifts the threshold with respect to the ML estimate to take into account
that Θ is more prone to equal zero. However, the correction term tends to zero as we gather
more evidence, so if a lot of data is available the two estimators will be very similar.
CHAPTER 10. BAYESIAN STATISTICS 188
0.35
ML estimator
0.30 MAP estimator
0.25
Probability of error
0.20
0.15
0.10
0.05
0.00
0 5 10 15 20
n
Figure 10.3: Probability of error of the ML and MAP estimators in Example 10.3.6 for different values
of n.
We compare the probability of error of the ML and MAP estimators in Figure 10.3. MAP
estimation results in better performance, but the difference becomes small as n increases.
4
Chapter 11
Hypothesis testing
In a medical study we observe that 10% of the women and 12.5% of the men suffer from heart
disease. If there are 20 people in the study, we would probably be hesitant to declare that
women are less prone to suffer from heart disease than men; it is very possible that the results
occurred by chance. However, if there are 20,000 people in the study, then it seems more likely
that we are observing a real phenomenon. Hypothesis testing makes this intuition precise; it is
a framework that allows us to decide whether patterns that we observe in our data are likely to
be the result of random fluctuations or not.
R := {t | t ≥ η} , (11.1)
where t is the test statistic computed from the data and η is a predefined threshold. In this
case, we would reject the null hypothesis only if t is larger than η.
As shown in Table 11.1, there are two possible errors that we can make. A Type I error is
a false positive: our conjecture is false, but we reject the null hypothesis. A Type II error is
189
CHAPTER 11. HYPOTHESIS TESTING 190
Reject H0 ?
No Yes
H0 is true , Type I error
H1 is true Type II error ,
a false negative: our conjecture holds, but we do not reject the null hypothesis. In hypothesis
testing, our priority is to control Type I errors. When you read in a study that a result is
statistically significant at a level of 0.05, this means that the probability of committing a
Type I error is bounded by 5%.
Definition 11.1.1 (Significance level and size). The size of a test is the probability of making
a Type I error. The significance level of a test is an upper bound on the size.
Rejecting the null hypothesis does not give a quantitative sense of the extent to which the data
are incompatible with the null hypothesis. The p value is a function of the data that plays this
role.
Definition 11.1.2 (p value). The p value is the smallest significance level at which we would
reject the null hypothesis for the data we observe.
For a fixed significance level, it is desirable to select a test that minimizes the probability of
making a Type II error. Equivalently, we would like to maximize the probability of rejecting
the null hypothesis when it does not hold. This probability is known as the power of the test.
Definition 11.1.3 (Power). The power of a test is the probability of rejecting the null hypothesis
if it does not hold.
Note that in order to characterize the power of a test we need to know the distribution of
the data under the alternative hypothesis, which is often unrealistic (recall that the alternative
hypothesis is just the complement of the null hypothesis and consequently encompasses many
different possibilities).
The standard procedure to apply hypothesis testing in the applied sciences is the following:
1. Choose a conjecture.
3. Choose a test.
6. Compute the p value and reject the null hypothesis if it is below a predefined limit (typically
1% or 5%).
Example 11.1.4 (Clutch). We want to test the conjecture that a certain player in the NBA
is clutch, i.e. that he scores more points at the end of close games than during the rest of the
game. The null hypothesis is that there is no difference in his performance. The test statistic t
that we choose is whether he makes more or less points per minute in the last quarter than in
the rest of the game
n
X
t (~x) = 1~xi >0 , (11.2)
i=1
where ~xi is the difference between the points per minute he scores in the 4th quarter and in the
rest of the quarters of game i for 1 ≤ i ≤ n.
The rejection region of the test is of the form
R := {t (~x) | t (~x) ≥ η} , (11.3)
for a fixed threshold η. Under the null hypothesis the probability of scoring more points per
minute in the 4th quarter is 1/2 (for simplicity we ignore the possibility that he scores the same
number of points), so we can model the test statistic under the null hypothesis as a binomial
random variable with parameters n and 1/2. If η is an integer between 0 and n, then the
probability that the test statistic is in the rejection region if the null hypothesis holds is
n
1 X n
P (T0 ≥ η) = n . (11.4)
2 k
k=η
Pn n
So the size of the test is 21n k=η k . Table 11.2 shows this value for all possible values of η. If
we want a significance level of 1% or 5% then we need to set the threshold at η = 16 or η = 15
respectively.
We gather the data from 20 games ~x and compute the value of the test statistic t (~x) (note that
we use a lowercase letter because it is a specific realization), which turns out to be 14 (he scores
more points per minute in the fourth quarter in 14 of the games). This is not enough to reject
the null hypothesis for our predefined level of 1% or 5%. Therefore the result is not statistically
significant.
In any case, we compute the p value, which is the smallest level at which the result would have
been significant. From the table it is equal to 0.058. Note that under a frequentist framework we
cannot interpret this as the probability that the null hypothesis holds (i.e. that the player is not
better in the fourth quarter) because the hypothesis is not random, it either holds or it doesn’t.
Our result is almost significant and although we do not have enough evidence to support our
conjecture, it does seem plausible that the player performs better in the fourth quarter.
4
η 1 2 3 4 5 6 7 8 9 10
P (T0 ≥ η) 1.000 1.000 1.000 0.999 0.994 0.979 0.942 0.868 0.748 0.588
η 11 12 13 14 15 16 17 18 19 20
P (T0 ≥ η) 0.412 0.252 0.132 0.058 0.021 0.006 0.001 0.000 0.000 0.000
Table 11.2: Probability of committing a Type I error depending on the value of the threshold in
Example 11.1.4. The values are rounded to three decimal points.
as is usually done in most studies in the applied sciences. The parameter is consequently
deterministic and so are the hypotheses: the null hypothesis is true or not, there is no such
thing as the probability that the null hypothesis holds.
To simplify the exposition, we assume that the probability distribution depends only on one
parameter that we denote by θ. Pθ is the probability measure of our probability space if θ is
~ is a random vector distributed according to Pθ . The actual data
the value of the parameter. X
that we observe, which we denote by ~x is assumed to be a realization from this random vector.
Assume that the null hypothesis is θ = θ0 . In that case, the size of a test with test statistic T
and rejection region R is equal to
α = Pθ0 T (X)~ ∈R . (11.5)
If the realization of the test statistic is T (x1 , . . . , xn ) then the significance level at which we
would reject H0 would be
p = Pθ0 T (X) ~ ≥ T (~x) , (11.7)
which is the p value if we observe ~x. The p value can consequently be interpreted as the
probability of observing a result that is more extreme than what we observe in the data if the
null hypothesis holds.
A hypothesis of the form θ = θ0 is known as a simple hypothesis. If a hypothesis is of the form
θ ∈ S for a certain set S then the hypothesis is composite. For a composite null hypothesis
θ ∈ H0 we redefine the size and the p value in the following way,
α = sup Pθ T (X)~ ≥η , (11.8)
θ∈H0
p = sup Pθ T (X)~ ≥ T (~x) . (11.9)
θ∈H0
In order to characterize the power of the test for a certain significance level, we compute the
power function. Follow me on LinkedIn for more:
Steve Nouri
https://www.linkedin.com/in/stevenouri/
CHAPTER 11. HYPOTHESIS TESTING 193
Definition 11.2.1 (Power function). Let Pθ be the probability measure parametrized by θ and
let R the rejection region for a test based on the test statistic T (~x). The power function of the
test is defined as
~ ∈R
β (θ) := Pθ T (X) (11.10)
Example 11.2.2 (Coin flip). We are interested in checking whether a coin is biased towards
heads. The null hypothesis is that for each coin flip the probability of obtaining heads is θ ≤ 1/2.
Consequently, the alternative hypothesis is θ > 1/2. Let us consider a test statistic equal to the
number of heads observed in a sequence of n iid flips,
n
X
T (~x) = 1~xi =1 , (11.11)
i=1
where ~xi is one if the ith coin flip is heads and zero otherwise. A natural rejection region is
T (~x) ≥ η. (11.12)
1. η = n, i.e. we only reject the null hypothesis if all the coin flips are heads,
2. η = 3n/5, i.e. we reject the null hypothesis if at least three fifths of the coin flips are
heads.
What test should we use if the number of coin flips is 5, 50 or 100? Do the tests have a 5%
significance level? What is the power of the tests for these values of n?
To answer these questions, we compute the power function of the test for both options. If η = n,
~ ∈R
β1 (θ) = Pθ T (X) (11.13)
= θn . (11.14)
If η = 3n/5,
n
X
n k
β2 (θ) = θ (1 − θ)n−k . (11.15)
k
k=3n/5
Figure 11.1 shows the two power functions. If η = n, then the test has a significance level of
5% for the three values of n. However the power is very low, especially for large n. This makes
sense: even if the coin is pretty biased the probability of n heads is extremely low. If η = 3n/5,
then for n = 5 the test has a significance level way above 5%, since even if the coin is not biased
the probability of observing 3 heads out of 5 flips is quite high. However for large n the test has
much higher power than the first option. If the bias of the coin is above 0.7 we reject the null
hypothesis with high probability.
4
CHAPTER 11. HYPOTHESIS TESTING 194
3n
η=n η= 5
n=5 n=5
n = 50 n = 50
0.75 n = 100 0.75 n = 100
0.50 0.50
β(θ)
β(θ)
0.25 0.25
0.05 0.05
0.25 0.50 0.75 0.25 0.50 0.75
θ θ
Figure 11.1: Power functions for the tests described in Example 11.2.2.
A systematic method for building tests under parametric assumptions is to threshold the ratio
between the likelihood of the data under the null hypothesis and the likelihood of the data under
the alternative hypothesis. If this ratio is high, the data are compatible with the null hypothesis,
so it should not be rejected.
Definition 11.2.3 (Likelihood-ratio test). Let L~x (θ) denote the likelihood function correspond-
ing to a data vector ~x. H0 and H1 are the sets corresponding to the null and alternative hy-
potheses respectively. The likelihood ratio is
A likelihood-ratio test has a rejection region of the form {Λ (~x) ≤ η}, for a constant threshold η.
Example 11.2.4 (Gaussian with known variance). Imagine that you have some data that are
well modeled as iid Gaussian with a known variance σ. The mean is unknown and we are
interested in establishing that it is not equal to a certain value µ0 . What is the corresponding
likelihood-ratio test and how should be set the threshold so that we have a significance level α?
First, from Example 9.6.4 the sample mean achieves the maximum of the likelihood function of
a Gaussian
A motivating argument to employ the likelihood-ratio test is that if the null and alternative
hypotheses are simple, then it is optimal in terms of power.
Lemma 11.2.5 (Neyman-Pearson Lemma). If both the null hypothesis and the alternative hy-
pothesis are simple, i.e. the parameter θ can only have two values θ0 and θ1 , then the likelihood-
ratio test has the highest power among all tests with a fixed size.
Proof. Recall that the power is the probability of rejecting the null hypothesis if it does not
hold. If we denote the rejection region of the likelihood-ratio test by RLR then its power is
Pθ1 X ~ ∈ RLR . (11.27)
Assume that we have another test with rejection region R. Its power is equal to
P θ1 X~ ∈R . (11.28)
To prove that the power of the likelihood-ratio test is larger we only need to establish that
P θ1 X~ ∈ Rc ∩ RLR ≥ Pθ X ~ ∈ RcLR ∩ R . (11.29)
1
CHAPTER 11. HYPOTHESIS TESTING 196
Let us assume that the data are continuous random variables (the argument for discrete random
variables is practically the same) and that the pdf when the null and alternative hypotheses
hold are fθ0 and fθ1 respectively. By the definition of the rejection region of the likelihood-ratio
test, if Λ (~x) ∈ RLR
fθ0 (~x)
fθ1 (~x) ≥ , (11.30)
η
fθ0 (~x)
fθ1 (~x) ≤ . (11.31)
η
and consequently
P θ0 X~ ∈ Rc ∩ RLR = Pθ X ~ ∈ RLR − Pθ X ~ ∈ R ∩ RLR (11.33)
0 0
~ ~
= Pθ0 X ∈ R − Pθ0 X ∈ R ∩ RLR (11.34)
= P θ0 X ~ ∈ R ∩ Rc
LR . (11.35)
any distribution with a predefined form. In this section we describe the permutation test, a
nonparametric test that can be used to compare two data sets ~xA and ~xB in order to evaluate
conjectures of the form ~xA is sampled from a distribution that has a higher mean than ~xB or ~xB
is sampled from a distribution that has a higher variance than ~xA . The null hypothesis is that
the two data sets are actually sampled from the same distribution.
The test statistic in a permutation test is the difference between the values of a test statistic of
interest t evaluated on the two data sets
where ~x are all the data merged together. Our goal is to test whether t (~xA ) is larger than t (~xB )
at a certain significance level. The corresponding rejection region is of the form R := {t | t ≥ η}.
The problem is how to fix the threshold so that the test has the desired significance level.
Imagine that we randomly permute the labels A and B in the merged data set ~x. As a result,
some of the data that were labeled as A will be labeled as B and vice versa. If we recompute
tdiff (~x) we will obviously obtain a different value. However, the distribution of the random
variable tdiff (X) ~ under the hypothesis that the data are sampled from the same distribution
has not changed. Indeed, the null hypothesis implies that the distribution of any function of
X~ 1, X~ 2, . . . , X
~ n that only depends on the class assigned to each variable is invariant to permu-
tations. More formally, the random sequence is exchangeable with respect to such functions.
Consider the value of tdiff for all the possible permutations of the labels: tdiff,1 , tdiff,2 , . . . tdiff,n! . If
the null hypothesis holds, then it would be surprising to find that tdiff (~x) is larger than most of
~ is uniformly distributed
the tdiff,i . In fact, under the null hypothesis, the random variable tdiff (X)
in the set {tdiff,1 , tdiff,2 , . . . tdiff,n! }, so that
1 X
n!
~
P tdiff (X) ≥ η = 1tdiff,i ≥η . (11.44)
n!
i=1
This is exactly to the size of the test. We can therefore compute the p value of the observed
statistic tdiff (~x) as
~ ≥ tdiff (~x)
p = P tdiff (X) (11.45)
n!
1 X
= 1tdiff,i ≥tdiff (~x) . (11.46)
n!
i=1
In words, the p value is the fraction of permutations that yield a more extreme test statistic
than the one we observe. Unfortunately, it is often challenging to compute (11.46) exactly.
Even for moderately sized data sets the number of possible permutations is usually too large
(for example, 40! > 8 1047 ) for it to be computationally tractable. In such cases the p value
can be approximated by sampling a large number of permutations and making a Monte Carlo
approximation of (11.46) with its average.
Before looking at an example, let us review the steps to be followed when applying a permutation
test.
16 30
Men Men
14 Women Women
25
12
20
10
8 15
6
10
4
5
2
0 0
100 150 200 250 300 350 400 450 80 100 120 140 160 180 200 220
Figure 11.2: Histograms of the cholesterol and blood-pressure for men and women in Example 11.3.1.
4. Permute the labels m times and compute the corresponding values of tdiff : tdiff,1 , tdiff,2 ,
. . . tdiff,m .
and reject the null hypothesis if it is below a predefined limit (typically 1% or 5%).
Example 11.3.1 (Cholesterol and blood pressure). A scientist want to determine whether men
have higher cholesterol and blood pressure. She gathers data from 86 men and 182 women.
Figure 11.2 shows the histograms of the cholesterol and blood-pressure for men and women.
From the histograms it seems that men have higher levels of cholesterol and blood pressure.
The sample mean for cholesterol is 261.3 mg/dl amongst men and 242.0 mg/dl amongst women.
The sample mean for blood pressure is 133.2 mmHg amongst men and 130.6 mmHg amongst
women.
In order to quantify whether these differences are significant we compute the sample permutation
distribution of the difference between the sample means using 106 permutations. To make sure
that the results are stable, we repeat the procedure three times. The results are shown in
Figure 11.3. For cholesterol, the p value is around 0.1%, so we have very strong evidence against
the null hypothesis. In contrast, the p value for blood pressure is 13%, so the results are not
very conclusive, we cannot reject the possibility that the difference is merely due to random
fluctuations.
CHAPTER 11. HYPOTHESIS TESTING 199
0.20
0.07
0.06
0.15
0.05
0.04 0.10
0.03
0.02 0.05
0.01
0.00 0.00
20.00 10.00 0.00 10.00 19.22 5.0 0.0 2.6 5.0
0.20
0.07
0.06
0.15
0.05
0.04 0.10
0.03
0.02 0.05
0.01
0.00 0.00
20.00 10.00 0.00 10.00 19.22 5.0 0.0 2.6 5.0
0.20
0.07
0.06
0.15
0.05
0.04 0.10
0.03
0.02 0.05
0.01
0.00 0.00
20.00 10.00 0.00 10.00 19.22 5.0 0.0 2.6 5.0
Figure 11.3: Approximate distribution under the null hypothesis of the difference between the sample
means of cholesterol and blood pressure in men and women. The observed value for the test statistic is
marked by a dashed line.
CHAPTER 11. HYPOTHESIS TESTING 200
Proof. The result follows directly from the union bound, which controls the probability of a
union of events with the sum of their individual probabilities.
Theorem 11.4.3 (Union bound). Let (Ω, F, P) be a probability space and S1 , S2 , . . . a collection
of events in F. Then
X
P (∪i Si ) ≤ P (Si ) . (11.52)
i
It is straightforward to show by induction that ∪nj=1 Sj = ∪nj=1 S̃j for any n, so ∪i Si = ∪i S̃i . The
sets S̃1 , S̃2 , . . . are disjoint by construction, so
X
P (∪i Si ) = P ∪i S̃i = P S̃i by Axiom 2 in Definition 1.1.4 (11.54)
i
X
≤ P (Si ) because S̃i ⊆ Si . (11.55)
i
CHAPTER 11. HYPOTHESIS TESTING 201
Example 11.4.4 (Clutch (continued)). If we apply the test in Example 11.1.4 to 10 players,
the probability that one of them seems to be clutch just due to chance increases substantially.
To control for this, by Bonferroni’s method we must divide the p values of the individual tests
by 10. As a result, to maintain a significance level of 0.05 we would require that each player
score more points per minute during the last quarter in 17 of the 20 games instead of 15 (see
Table 11.2) in order to reject the null hypothesis.
4
Chapter 12
Linear Regression
In statistics, regression is the problem of characterizing the relation between a certain quantity
of interest y, called the response or the dependent variable, to several observed variables
x1 , x2 , . . . , xp , known as covariates, features or independent variables. For example, the
response could be price of a house and the covariates could correspond to the extension, the
number of rooms, the year it was built, etc. A regression model would describe how house prices
are affected by all of these factors.
More formally, the main assumption in regression models is that the predictor is generated
according to a function h applied to the features and then perturbed by some unknown noise z,
which is often additive,
y = h (~x) + z. (12.1)
The aim is to learn h from n examples of responses and their corresponding features
y (1) , ~x (1) , y (2) , ~x (2) , . . . , y (n) , ~x (n) . (12.2)
In this chapter we focus on the case where h is a linear function.
202
CHAPTER 12. LINEAR REGRESSION 203
Equivalently,
~y = X β~ ∗ + ~z, (12.5)
where X is a n × p matrix containing the features, ~y contains the response and ~z ∈ Rn represents
the noise.
Example 12.1.1 (Linear model for GDP). We consider the problem of building a linear model
to predict the gross domestic product (GDP) of a state in the US from its population and
unemployment rate. We have available the following data:
In this example, the GDP is the response, and the population and the unemployment rate are
the features. Our goal is to fit a linear model to the data so that we can predict the GDP of
Tennessee, using a linear model. We begin by centering and normalizing the data. The averages
of the response and of the features are
h i
av (~y ) = 179 236, av (X) = 3 802 073 4.1 . (12.6)
We subtract the average and divide by the standard deviations so that both the response and
CHAPTER 12. LINEAR REGRESSION 204
To obtain the estimate for the GDP of Tennessee we fit the model
~
~y ≈ X β, (12.9)
rescale according to the standard deviations (12.7) and recenter using the averages (12.6). The
final estimate is
D E
~y Ten = av (~y ) + std (~y ) ~xTen , ~
β (12.10)
norm
where ~xTen
norm is centered using av (X) and normalized using std (X). 4
The least-squares estimate β~LS is the vector of weights that minimizes this cost function,
~ ~
βLS := arg min ~y − X β . (12.12)
~
β 2
The least-squares cost function is convenient from a computational view, since it is convex and
can be minimized efficiently (in fact, as we will see in a moment it has a closed-form solution).
In addition, it has intuitive geometric and probabilistic interpretations. Figure 12.1 shows the
linear model learnt using least squares in a simple example where there is just one feature (p = 1)
and 40 examples (n = 40).
CHAPTER 12. LINEAR REGRESSION 205
1.2
Data
1.0 Least-squares fit
0.8
y 0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2
x
Figure 12.1: Linear model learnt via least-squares fitting for a simple example where there is just one
feature (p = 1) and 40 examples (n = 40).
Example 12.2.1 (Linear model for GDP (continued)). The least-squares estimate for the re-
gression coefficients in the linear GDP model is equal to
1.019
β~LS = . (12.13)
−0.111
The GDP seems to be proportional to the population and inversely proportional to the unem-
ployment rate. We now compare the fit provided by the linear model to the original data, as
well as its prediction of the GDP of Tennessee:
GDP Estimate
North Dakota 52 089 46 241
Alabama 204 861 239 165
Mississippi 107 680 119 005
Arkansas 120 689 145 712
Kansas 153 258 136 756
Georgia 525 360 513 343
Iowa 178 766 158 097
West Virginia
73 374 59 969
Kentucky 197 043 194 829
Tennessee 328 770 345 352
CHAPTER 12. LINEAR REGRESSION 206
Figure 12.2: Illustration of Corollary 12.2.3. The least-squares solution is a projection of the data onto
the subspace spanned by the columns of X , denoted by X1 and X2 .
A corollary to this result provides a geometric interpretation for the least-squares estimate of
~y : it is obtained by projecting the response onto the column space of the matrix formed by the
predictors.
Corollary 12.2.3. For p ≥ n, if X is full rank then X β~LS is the projection of ~y onto the column
space of X .
We provide a formal proof in Section 12.5.2 of the appendix, but the result is very intuitive.
Any vector of the form X β~ is in the span of the columns of X . By definition, the least-squares
estimate is the closest vector to ~y that can be represented in this way, so it is the projection of
~y onto the column space of X . This is illustrated in Figure 12.2.
which is an iid Gaussian random vector with mean X β~ and covariance matrix σ 2 I. The joint
~ is equal to
pdf of Y
2
1
n
Y 1
fY~ (~a) = √ exp − 2 ~ai − X β~ (12.16)
2πσ 2σ i
i=1
1 1 ~ 2
=p exp − 2 ~a − X β . (12.17)
(2π)n σ n 2σ 2
The likelihood is the probability density function of Y~ evaluated at the observed data ~y and
~
interpreted as a function of the weight vector β,
2
~ 1 1 ~
L~y β = p n
exp − ~y − X β . (12.18)
(2π) 2 2
To find the ML estimate, we maximize the log likelihood. We conclude that it is given by the
solution to the least-squares problem, since
β~ML = arg max L~y β~ (12.19)
~
β
= arg max log L~y β~ (12.20)
~
β
2
= arg min ~y − X β~ (12.21)
~
β 2
= β~LS . (12.22)
12.3 Overfitting
Imagine that a friend tells you:
I found a cool way to predict the temperature in New York: It’s just a linear combination of the
temperature in every other state. I fit the model on data from the last month and a half and it’s
perfect!
Your friend is not lying, but the problem is that she is using a number of data points to fit the
linear model that is roughly the same as the number of parameters. If n ≤ p we can find a β~
such that ~y = X β~ exactly, even if ~y and X have nothing to do with each other! This is called
overfitting and is usually caused by using a model that is too flexible with respect to the number
of data that are available.
To evaluate whether a model suffers from overfitting we separate the data into a training set
and a test set. The training set is used to fit the model and the test set is used to evaluate the
error. A model that overfits the training set will have very low error when evaluated on the
training examples, but will not generalize well to the test examples.
CHAPTER 12. LINEAR REGRESSION 208
Figure 12.3 shows the result of evaluating the training error and the test error of a linear model
with p = 50 parameters fitted from n training examples. The training and test data are generated
by fixing a vector of weights β~ ∗ and then computing
where the entries of Xtrain , Xtest , ~ztrain and β~ ∗ are sampled independently at random from
a Gaussian distribution with zero mean and unit variance. The training and test errors are
defined as
Xtrain β~LS − ~ytrain
2
errortrain = , (12.25)
||~ytrain ||2
Xtest β~LS − ~ytest
2
errortest = . (12.26)
||~ytest ||2
Note that even the true β~ ∗ does not achieve zero training error because of the presence of the
noise, but the test error is actually zero if we manage to estimate β~ ∗ exactly.
The training error of the linear model grows with n. This makes sense as the model has to fit
more data using the same number of parameters. When n is close to p := 50, the fitted model
is much better than the true model at replicating the training data (the error of the true model
is shown in green). This is a sign of overfitting: the model is adapting to the noise and not
learning the true linear structure. Indeed, in that regime the test error is extremely high. At
larger n, the training error rises to the level achieved by the true linear model and the test error
decreases, indicating that we are learning the underlying model.
where 1 ≤ t ≤ n denotes the time in months (n equals 12 times 150). The corresponding matrix
1
The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/
oxforddata.txt.
CHAPTER 12. LINEAR REGRESSION 209
0.5
Error (training)
Error (test)
Noise level (training)
0.4
0.2
0.1
Figure 12.3: Relative `2 -norm error in estimating the response achieved using least-squares regression
for different values of n (the number of training data). The training error is plotted in blue, whereas the
test error is plotted in red. The green line indicates the training error of the true model used to generate
the data.
of predictors is
2πt1 2πt1
1 cos 12 sin 12 t1
1 cos 2πt 2
sin 2πt2
t 2
X :=
12 12 .
(12.28)
· · · ··· ··· · · ·
1 cos 2πt
12
n
sin 2πtn
12 tn
The intercept β~0 represents the mean temperature, β~1 and β~2 account for periodic yearly fluctu-
ations and β~3 is the overall trend. If β~3 is positive then the model indicates that temperatures
are increasing, if it is negative then it indicates that temperatures are decreasing.
The results of fitting the linear model are shown in Figures 12.4 and 12.5. The fitted model
indicates that both the maximum and minimum temperatures have an increasing trend of about
0.8 degrees Celsius (around 1.4 degrees Fahrenheit).
12.5 Proofs
12.5.1 Proof of Proposition 12.2.2
Let X = U Σ VT be the singular-value decomposition (SVD) of X . Under the conditions of the
−1 T
theorem, X T X X y = V Σ U T . We begin by separating ~y into two components
y = UUT y + I − UUT y (12.29)
CHAPTER 12. LINEAR REGRESSION 210
30 20
25 15
20
Temperature (Celsius)
Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Model Model
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000
25 14
12
20
10
Temperature (Celsius)
Temperature (Celsius)
15 8
6
10 4
2
5
Data 0 Data
Model Model
0 2
1900 1901 1902 1903 1904 1905 1900 1901 1902 1903 1904 1905
25 15
20
10
Temperature (Celsius)
Temperature (Celsius)
15
5
10
0
5
5
0
Data Data
Model Model
5 10
1960 1961 1962 1963 1964 1965 1960 1961 1962 1963 1964 1965
Figure 12.4: Temperature data together with the linear model described by (12.27) for both maximum
and minimum temperatures.
CHAPTER 12. LINEAR REGRESSION 211
30 20
25 15
20
Temperature (Celsius)
Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Trend Trend
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000
Figure 12.5: Temperature trend obtained by fitting the model described by (12.27) for both maximum
and minimum temperatures.
where U U T y is the projection of ~y onto the column space of X . Note that I − U U T y is
orthogonal to the column space of X and consequently to both U U T y and X β~ for any β.
~ By
Pythagoras’s Theorem
2 2 2
~ T T ~
~y − X β = I − U U y 2 + U U y − X β . (12.30)
2 2
˜ is ||~y ⊥ ||2 .
The minimum value of this cost function that can be achieved by optimizing over beta X 2
This can be achieved by solving the system of equations
U U T y = X β~ = U Σ VT β.
~ (12.31)
Since U T U = I because p ≥ n, multiplying both sides of the equality yields the equivalent
system
~
U T y = Σ VT β. (12.32)
Since X is full rank, Σ and V are square and invertible (and by definition of the SVD V −1 = V T ),
so
β~LS = V Σ U T y (12.33)
is the unique solution to the system and consequently also of the least-squares problem.
CHAPTER 12. LINEAR REGRESSION 212
Set theory
S := {x | s(x)} . (A.1)
For example, A := {x | 1 < x < 3} is the set of all elements greater than 1 and smaller than 3.
Let us define some important sets and set operations using this notation.
S c := {x | x ∈
/ S} . (A.2)
• The union of two sets A and B contains the objects that belong to A or B.
A ∪ B := {x | x ∈ A or x ∈ B} . (A.3)
213
APPENDIX A. SET THEORY 214
• The intersection of two sets A and B contains the objects that belong to A and B.
A ∩ B := {x | x ∈ A and x ∈ B} . (A.5)
• The difference of two sets A and B contains the elements in A that are not in B.
A/B := {x | x ∈ A and x ∈
/ B} . (A.7)
• The power set 2S of a set S is the set of all possible subsets of S, including ∅ and S.
2S := S 0 | S 0 ⊆ S . (A.8)
• The cartesian product of two sets S1 and S2 is the set of all ordered pairs of elements
in the sets
S1 × S2 := {(x1 , x2 ) | x1 ∈ S1 , x2 ∈ S2 } . (A.9)
Two sets are equal if they have the same elements, i.e. A = B if and only if A ⊆ B and B ⊆ A.
It is easy to verify for instance that (Ac )c = A, S ∪ Ω = Ω, S ∩ Ω = S or the following identities
which are known as De Morgan’s laws.
Theorem A.2.2 (De Morgan’s laws). For any two sets A and B
(A ∪ B)c = Ac ∩ B c , (A.10)
c c c
(A ∩ B) = A ∪ B . (A.11)
Proof. Let us prove the first identity; the proof of the second is almost identical.
First we prove that (A ∪ B)c ⊆ Ac ∩B c . A standard way to prove the inclusion of a set in another
set is to show that if an element belongs to the first set then it must also belong to the second.
Any element x in (A ∪ B)c (if the set is empty then the inclusion holds trivially, since ∅ ⊆ S for
any set S) is in Ac ; otherwise it would belong to A and consequently to A ∪ B. Similarly, x also
belongs to B c . We conclude that x belongs to Ac ∩ B c , which proves the inclusion.
To complete the proof we establish Ac ∩ B c ⊆ (A ∪ B)c . If x ∈ Ac ∩ B c , then x ∈
/ A and x ∈
/ B,
c
so x ∈
/ A ∪ B and consequently x ∈ (A ∪ B) .
Appendix B
Linear Algebra
From the point of view of algebra, vectors are much more general objects. They are elements of
sets called vector spaces that satisfy the following definition.
Definition B.1.1 (Vector space). A vector space consists of a set V and two operations + and
· satisfying the following conditions.
7. Scalar and vector sums are both distributive, i.e. for all α, β ∈ R and ~x, ~y ∈ V
215
APPENDIX B. LINEAR ALGEBRA 216
From now on, for ease of notation we will ignore the symbol for the scalar product ·, writing
α · ~x as α ~x.
Remark B.1.2 (More general definition). We can define vector spaces over an arbitrary field,
instead of R, such as the complex numbers C. We refer to any linear algebra text for more
details.
We can easily check that Rn is a valid vector space together with the usual vector addition and
h iT
vector-scalar product. In this case the zero vector is the all-zero vector 0 0 0 . . . . When
thinking about vector spaces it is a good idea to have R2 or R3 in mind to gain intuition, but
it is also important to bear in mind that we can define vector sets over many other objects,
such as infinite sequences, polynomials, functions and even random variables as in the following
example.
The definition of vector space guarantees that any linear combination of vectors in a vector
space V, obtained by adding the vectors after multiplying by scalar coefficients, belongs to V.
Given a set of vectors, a natural question to ask is whether they can be expressed as linear
combinations of each other, i.e. if they are linearly dependent or independent.
Definition B.1.3 (Linear dependence/independence). A set of m vectors ~x1 , ~x2 , . . . , ~xm is lin-
early dependent if there exist m scalar coefficients α1 , α2 , . . . , αm which are not all equal to zero
and such that
m
X
αi ~xi = ~0. (B.5)
i=1
Equivalently, at least one vector in a linearly dependent set can be expressed as the linear com-
bination of the rest, whereas this is not the case for linearly independent sets.
Let us check the equivalence. Equation (B.5) holds with αj 6= 0 for some j if and only if
1 X
~xj = αi ~xi . (B.6)
αj
i∈{1,...,m}/{j}
We define the span of a set of vectors {~x1 , . . . , ~xm } as the set of all possible linear combinations
of the vectors:
( m
)
X
span (~x1 , . . . , ~xm ) := ~y | ~y = αi ~xi for some α1 , α2 , . . . , αm ∈ R . (B.7)
i=1
Lemma B.1.4. The span of any set of vectors ~x1 , . . . , ~xm belonging to a vector space V is a
subspace of V.
APPENDIX B. LINEAR ALGEBRA 217
Proof. The span is a subset of V due to Conditions 1 and 2 in Definition B.1.1. We now show
that it is a vector space. Conditions 5, 6 and 7 in Definition B.1.1 hold because V is a vector
space. We check Conditions 1, 2, 3 and 4 by proving that for two arbitrary elements of the span
m
X m
X
~y1 = αi ~xi , ~y2 = βi ~xi , α1 , . . . , αm , β1 , . . . , βm ∈ R, (B.8)
i=1 i=1
so γ1 ~y1 + γ2 ~y2 is in span (~x1 , . . . , ~xm ). Now to prove Condition 1 we set γ1 = γ2 = 1, for
Condition 2 γ2 = 0, for Condition 3 γ1 = γ2 = 0 and for Condition 4 γ1 = −1, γ2 = 0.
When working with a vector space, it is useful to consider the set of vectors with the smallest
cardinality that spans the space. This is called a basis of the vector space.
Definition B.1.5 (Basis). A basis of a vector space V is a set of independent vectors {~x1 , . . . , ~xm }
such that
V = span (~x1 , . . . , ~xm ) . (B.10)
An important property of all bases in a vector space is that they have the same cardinality.
Theorem B.1.6. If a vector space V has a basis with finite cardinality then every basis of V
contains the same number of vectors.
This theorem, which is proved in Section B.8.1, allows us to define the dimension of a vector
space.
Definition B.1.7 (Dimension). The dimension dim (V) of a vector space V is the cardinality
of any of its bases, or equivalently the smallest number of linearly independent vectors that span
V.
This definition coincides with the usual geometric notion of dimension in R2 and R3 : a line
has dimension 1, whereas a plane has dimension 2 (as long as they contain the origin). Note
that there exist infinite-dimensional vector spaces, such as the continuous real-valued functions
defined on [0, 1].
The vector space that we use to model a certain problem is usually called the ambient space
and its dimension the ambient dimension. In the case of Rn the ambient dimension is n.
Lemma B.1.8 (Dimension of Rn ). The dimension of Rn is n.
One can easily check that this set is a basis. It is in fact the standard basis of Rn .
Definition B.2.1 (Inner product). An inner product on a vector space V is an operation h·, ·i
that maps pairs of vectors to R and satisfies the following conditions.
• It is positive semidefinite: h~x, ~xi is nonnegative for all ~x ∈ V and if h~x, ~xi = 0 then ~x = 0.
A vector space endowed with an inner product is called an inner-product space. An important
instance of an inner product is the dot product between two vectors ~x, ~y ∈ Rn as
X
~x · ~y := ~x [i] ~y [i] , (B.15)
i
where ~x [i] is the ith entry of ~x. In this section we use ~xi to denote a vector, but in some other
parts of the notes it may also denote an entry of a vector ~x; this will be clear from the context.
It is easy to check that the dot product is a valid inner product. Rn endowed with the dot
product is usually called a Euclidean space of dimension n.
The norm of a vector is a generalization of the concept of length.
Definition B.2.2 (Norm). Let V be a vector space, a norm is a function ||·|| from V to R that
satisfies the following conditions.
A vector space equipped with a norm is called a normed space. Distances in a normed space
can be measured using the norm of the difference between vectors.
APPENDIX B. LINEAR ALGEBRA 219
Definition B.2.3 (Distance). The distance between two vectors ~x and ~y in a normed space with
norm ||·|| is
Inner-product spaces are normed spaces because we can define a valid norm using the inner
product. The norm induced by an inner product is obtained by taking the square root of the
inner product of the vector with itself,
p
||~x||h·,·i := h~x, ~xi. (B.19)
The norm induced by an inner product is clearly homogeneous by linearity and symmetry of
the inner product. ||~x||h·,·i = 0 implies ~x = 0 because the inner product is positive semidefinite.
We only need to establish that the triangle inequality holds to ensure that the inner-product
is a valid norm. This follows from a classic inequality in linear algebra, which is proved in
Section B.8.2.
Theorem B.2.4 (Cauchy-Schwarz inequality). For any two vectors ~x and ~y in an inner-product
space
Assume ||~x||h·,·i 6= 0,
||~y ||h·,·i
h~x, ~y i = − ||~x||h·,·i ||~y ||h·,·i ⇐⇒ ~y = − ~x, (B.21)
||~x||h·,·i
||~y ||h·,·i
h~x, ~y i = ||~x||h·,·i ||~y ||h·,·i ⇐⇒ ~y = ~x. (B.22)
||~x||h·,·i
Corollary B.2.5. The norm induced by an inner product satisfies the triangle inequality.
Proof.
B.3 Orthogonality
An important concept in linear algebra is orthogonality.
h~x, ~y i = 0. (B.26)
h~x, ~y i = 0. (B.28)
Distances between orthogonal vectors measured in terms of the norm induced by the inner
product are easy to compute.
If we want to show that a vector is orthogonal to a certain subspace, it is enough to show that
it is orthogonal to every vector in a basis of the subspace.
Lemma B.3.3. Let ~x be a vector and S a subspace of dimension n. If for any basis ~b1 , ~b2 , . . . , ~bn
of S,
D E
~x, ~bi = 0, 1 ≤ i ≤ n, (B.33)
then ~x is orthogonal to S.
P n ~
Proof. Any vector v ∈ S can be represented as v = i αi=1 bi for α1 , . . . , αn ∈ R, from (B.33)
* +
X X D E
n ~
h~x, vi = ~x, αi=1 bi = n
αi=1 ~x, ~bi = 0. (B.34)
i i
APPENDIX B. LINEAR ALGEBRA 221
Definition B.3.4 (Orthonormal basis). A basis of mutually orthogonal vectors with norm equal
to one is called an orthonormal basis.
It is very easy to find the coefficients of a vector in an orthonormal basis: we just need to
compute the dot products with the basis vectors.
Immediately,
* m
+ m
X X
h~ui , ~xi = ~ui , αi ~ui = αi h~ui , ~ui i = αi (B.37)
i=1 i=1
For any subspace of Rn we can obtain an orthonormal basis by applying the Gram-Schmidt
method to a set of linearly independent vectors spanning the subspace.
Algorithm B.3.6 (Gram-Schmidt). Consider a set of linearly independent vectors ~x1 , . . . , ~xm
in Rn . To obtain an orthonormal basis of the span of these vectors we:
2. For i = 1, . . . , m, compute
i−1
X
~vi := ~xi − h~uj , ~xi i ~uj . (B.38)
j=1
It is not difficult to show that the resulting set of vectors ~u1 , . . . , ~um is an orthonormal basis for
the span of ~x1 , . . . , ~xm . This implies in particular that we can always assume that a subspace
has an orthonormal basis.
Proof. To see that the Gram-Schmidt method produces an orthonormal basis for the span of
the input vectors we can check that span (~x1 , . . . , ~xi ) = span (~u1 , . . . , ~ui ) and that ~u1 , . . . , ~ui is
set of orthonormal vectors.
APPENDIX B. LINEAR ALGEBRA 222
B.4 Projections
The projection of a vector ~x onto a subspace S is the vector in S that is closest to ~x. In order
to define this rigorously, we start by introducing the concept of direct sum. If two subspaces are
disjoint, i.e. their only common point is the origin, then a vector that can be written as a sum
of a vector from each subspace is said to belong to their direct sum.
Definition B.4.1 (Direct sum). Let V be a vector space. For any subspaces S1 , S2 ⊆ V such
that
S1 ∩ S2 = {0} (B.39)
Proof. If ~x ∈ S1 ⊕ S2 then by definition there exist ~s1 ∈ S1 , ~s2 ∈ S2 such that ~x = ~s1 + ~s2 .
Assume ~x = ~v1 + ~v2 , ~v1 ∈ S1 , ~v2 ∈ S2 , then ~s1 − ~v1 = ~s2 − ~v2 . This implies that ~s1 − ~v1 and
~s2 −~v2 are in S1 and also in S2 . However, S1 ∩ S2 = {0}, so we conclude ~s1 = ~v1 and ~s2 = ~v2 .
We can now define the projection of a vector ~x onto a subspace S by separating the vector into
a component that belongs to S and another that belongs to its orthogonal complement.
Definition B.4.3 (Orthogonal projection). Let V be a vector space. The orthogonal projection
of a vector ~x ∈ V onto a subspace S ⊆ V is a vector denoted by PS ~x such that ~x − PS ~x ∈ S ⊥ .
Theorem B.4.4 (Properties of orthogonal projections). Let V be a vector space. Every vector
~x ∈ V has a unique orthogonal projection PS ~x onto any subspace S ⊆ V of finite dimension.
In particular ~x can be expressed as
~x = PS ~x + PS ⊥ ~x. (B.42)
Proof. Let us denote the dimension of S by m. Since m is finite, the exists an orthonormal basis
S: ~b01 , . . . , ~b0m . Consider the vector
m D
X E
p~ := ~x, ~b0i ~b0i . (B.45)
i=1
Computing the norm of the projection of a vector onto a subspace is easy if we have access to
an orthonormal basis (as long as the norm is induced by the inner product).
Lemma B.4.5 (Norm of the projection). The norm of the projection of an arbitrary vector
~x ∈ V onto a subspace S ⊆ V of dimension d can be written as
v
u d D E2
uX
||PS ~x||h·,·i = t ~bi , ~x (B.49)
i
Proof. By (B.44)
||PS ~x||2h·,·i = hPS ~x, PS ~xi (B.50)
* d +
XD E X d D E
= ~bi , ~x ~bi , ~bj , ~x ~bj (B.51)
i j
d D
d X
X ED ED E
= ~bi , ~x ~bj , ~x ~bi , ~bj (B.52)
i j
d D
X E2
= ~bi , ~x . (B.53)
i
1
For any vector ~v that belongs to both S and S ⊥ h~v , ~v i = ||~v ||22 = 0, which implies ~v = 0.
APPENDIX B. LINEAR ALGEBRA 224
h~v , ~xi
Pspan(~v) ~x = ~v . (B.54)
||~v ||2h·,·i
Finally, we prove that the projection of a vector ~x onto a subspace S is indeed the vector in S
that is closest to ~x in the distance induced by the inner-product norm.
Theorem B.4.7 (The orthogonal projection is closest). The orthogonal projection of a vector
~x onto a subspace S belonging to the same inner-product space is the closest vector to ~x that
belongs to S in terms of the norm induced by the inner product. More formally, PS ~x is the
solution to the optimization problem
where (B.58) follows from the Pythagorean theorem since because PS ⊥ ~x := x − PS ~x belongs to
S ⊥ and PS ~x − ~s to S.
B.5 Matrices
A matrix is a rectangular array of numbers. We denote the vector space of m × n matrices by
Rm×n . We denote the ith row of a matrix A by Ai: , the jth column by A:j and the (i, j) entry
by Aij . The transpose of a matrix is obtained by switching its rows and columns.
Definition B.5.2 (Matrix-vector product). The product of a matrix A ∈ Rm×n and a vector
~x ∈ Rn is a vector A~x ∈ Rn , such that
n
X
(A~x)i = Aij ~x [j] (B.61)
j=1
i.e. the ith entry of A~x is the dot product between the ith row of A and ~x.
Equivalently,
n
X
A~x = A:j ~x [j] , (B.63)
j=1
i.e. A~x is a linear combination of the columns of A weighted by the entries in ~x.
One can easily check that the transpose of the product of two matrices A and B is equal to the
transposes multiplied in the inverse order,
(AB)T = B T AT . (B.64)
i.e. the (i, j) entry of AB is the dot product between the ith row of A and the jth column of B.
Equivalently, the jth column of AB is the result of multiplying A and the jth column of B
n
X
AB = Aik Bkj = hAi: , B:,j i , (B.68)
k=1
and ith row of AB is the result of multiplying the ith row of A and B.
APPENDIX B. LINEAR ALGEBRA 226
Square matrices may have an inverse. If they do, the inverse is a matrix that reverses the effect
of the matrix of any vector.
Definition B.5.5 (Matrix inverse). The inverse of a square matrix A ∈ Rn×n is a matrix
A−1 ∈ Rn×n such that
Definition B.5.7 (Orthogonal matrix). An orthogonal matrix is a square matrix such that its
inverse is equal to its transpose,
UT U = UUT = I (B.72)
By definition, the columns U:1 , U:2 , . . . , U:n of any orthogonal matrix have unit norm and orthog-
onal to each other, so they form an orthonormal basis (it’s somewhat confusing that orthogonal
matrices are not called orthonormal matrices instead). We can interpret applying U T to a vector
~x as computing the coefficients of its representation in the basis formed by the columns of U .
Applying U to U T ~x recovers ~x by scaling each basis vector with the corresponding coefficient:
n
X
~x = U U T ~x = hU:i , ~xi U:i . (B.73)
i=1
Applying an orthogonal matrix to a vector does not affect its norm, it just rotates the vector.
Lemma B.5.8 (Orthogonal matrices preserve the norm). For any orthogonal matrix U ∈ Rn×n
and any vector ~x ∈ Rn ,
B.6 Eigendecomposition
An eigenvector ~v of a matrix A satisfies
for a scalar λ which is the corresponding eigenvalue. Even if A is real, its eigenvectors and
eigenvalues can be complex.
Proof.
h i
AQ = A~v1 A~v2 · · · A~vn (B.81)
h i
= λ1~v1 λ2~v2 · · · λ2~vn (B.82)
= QΛ. (B.83)
If the columns of a square matrix are all linearly independent, then the matrix has an inverse,
so multiplying the expression by Q−1 on both sides completes the proof.
Assume λ has a nonzero eigenvalue corresponding to an eigenvector with entries ~v [1] and ~v [2],
then
~v [2] 0 1 ~v [1] λ~v [2]
= = , (B.85)
0 0 0 ~v [2] λ~v [2]
which implies that ~v [2] = 0 and hence ~v [1] = 0, since we have assumed that λ 6= 0. This implies
that the matrix does not have nonzero eigenvalues associated to nonzero eigenvectors.
APPENDIX B. LINEAR ALGEBRA 228
An interesting use of the eigendecomposition is computing successive matrix products very fast.
Assume that we want to compute
i.e. we want to apply A to ~x k times. Ak cannot be computed by taking the power of its entries
(try out a simple example to convince yourself). However, if A has an eigendecomposition,
using the fact that for diagonal matrices applying the matrix repeatedly is equivalent to taking
the power of the diagonal entries. This allows to compute the k matrix products using just 3
matrix products and taking the power of n numbers.
From high-school or undergraduate algebra you probably remember how to compute eigenvectors
using determinants. In practice, this is usually not a viable option due to stability issues. A
popular technique to compute eigenvectors is based on the following insight. Let A ∈ Rn×n be
a matrix with eigendecomposition QΛQ−1 and let ~x be an arbitrary vector in Rn . Since the
columns of Q are linearly independent, they form a basis for Rn , so we can represent ~x as
n
X
~x = αi Q:i , αi ∈ R, 1 ≤ i ≤ n. (B.90)
i=1
If we assume that the eigenvectors are ordered according to their magnitudes and that the
magnitude of one of them is larger than the rest, |λ1 | > |λ2 | ≥ . . ., and that α1 6= 0 (which
happens with high probability if we draw a random ~x) then as k grows larger the term α1 λk1 Q:1
dominates. The term will blow up or tend to zero unless we normalize every time before applying
A. Adding the normalization step to this procedure results in the power method or power
iteration, an algorithm of great importance in numerical linear algebra.
~x2 ~x3
~v1 ~v1 ~v1
Figure B.1: Illustration of the first three iterations of the power method for a matrix with eigenvectors
~v1 and ~v2 , whose corresponding eigenvalues are λ1 = 1.05 and λ2 = 0.1661.
Initialization: Set ~x1 := ~x/ ||~x||2 , where the entries of ~x are drawn at random.
For i = 1, . . . , k, compute
A~xi−1
~xi := . (B.93)
||A~xi−1 ||2
Figure B.1 illustrates the power method on a simple example, where the matrix is equal to
0.930 0.388
A= . (B.94)
0.237 0.286
The convergence to the eigenvector corresponding to the eigenvalue with the largest magnitude
is very fast.
Proof. The proof that every real symmetric matrix has n eigenvectors is beyond the scope of
these notes. Under the assumption that this is the case, we begin by proving that the eigenvalues
are real. Consider an arbitrary eigenvalue λi and the corresponding normalized eigenvector ~vi ,
we have
~vi∗ A~vi = λ~vi∗~vi = λ, (B.96)
∗ ∗
~vi∗ A~vi = (A~vi ) ~vi = (λ~vi ) = λ̄~vi∗~vi = λ̄. (B.97)
APPENDIX B. LINEAR ALGEBRA 230
This implies that λ is real because λ = λ̄, so we can restrict the eigenvectors to be real (since the
eigenvalue is real, both the real and imaginary parts of the eigenvector are eigenvectors them-
selves and at least one of them must be nonzero). If several linearly independent eigenvectors
have the same eigenvalue, an orthonormal basis of their span will also consist of eigenvectors of
the matrix. All that is left to prove is that eigenvectors corresponding to different eigenvalues are
orthogonal. Assume ~vi and ~vj are eigenvectors corresponding to different eigenvalues λi 6= λj ,
then
1
~uTi ~uj = (A~ui )T ~uj (B.98)
λi
1
= ~uTi AT ~uj (B.99)
λi
1
= ~uTi A~uj (B.100)
λi
λj
= ~uTi ~uj . (B.101)
λi
The eigenvalues of a symmetric matrix determine the value of the quadratic form:
n
X 2
q (~x) := ~xT A~x = λi ~xT ~ui (B.102)
i=1
If we order the eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λn then the first eigenvalue is the maximum value
attained by the quadratic if its input has unit `2 norm, the second eigenvalue is the maximum
value attained by the quadratic form if we restrict its argument to be normalized and orthogonal
to the first eigenvector, and so on.
Theorem B.7.2. For any symmetric matrix A ∈ Rn with normalized eigenvectors ~u1 , ~u2 , . . . , ~un
with corresponding eigenvalues λ1 , λ2 , . . . , λn
Proof. The eigenvectors are an orthonormal basis (they are mutually orthogonal and we assume
that they have been normalized), so we can represent any unit-norm vector ~hk that is orthogonal
to ~u1 , . . . , ~uk−1 as
m
X
~hk = αi ~ui (B.107)
i=k
APPENDIX B. LINEAR ALGEBRA 231
where
2 X m
~
hk = αi2 = 1, (B.108)
2
i=k
This establishes (B.103) and (B.105). To prove (B.104) and (B.106) we just need to show that
~uk achieves the maximum
n
X 2
~uTk A~uk = λi ~uTi ~uk (B.113)
i=1
= λk . (B.114)
B.8 Proofs
B.8.1 Proof of Theorem B.1.6
We prove the claim by contradiction. Assume that we have two bases {~x1 , . . . , ~xm } and {~y1 , . . . , ~yn }
such that m < n (or the second set has infinite cardinality). The proof follows from applying
the following lemma m times (setting r = 0, 1, . . . , m − 1) to show that {~y1 , . . . , ~ym } spans V
and hence {~y1 , . . . , ~yn } must be linearly dependent.
Lemma B.8.1. Under the assumptions of the theorem, if {~y1 , ~y2 , . . . , ~yr , ~xr+1 , . . . , ~xm } spans V
then {~y1 , . . . , ~yr+1 , ~xr+2 , . . . , ~xm } also spans V (possibly after rearranging the indices r+1, . . . , m)
for r = 0, 1, . . . , m − 1.
where at least one of the γj is non zero, as {~y1 , . . . , ~yn } is linearly independent by assumption.
Without loss of generality (here is where we might need to rearrange the indices) we assume
that γr+1 6= 0, so that
r m
!
1 X X
~xr+1 = βi ~yi − γi ~xi . (B.116)
γr+1
i=1 i=r+2
This implies that any vector in the span of {~y1 , ~y2 , . . . , ~yr , ~xr+1 , . . . , ~xm }, i.e. in V, can be
represented as a linear combination of vectors in {~y1 , . . . , ~yr+1 , ~xr+2 , . . . , ~xm }, which completes
the proof.