0% found this document useful (0 votes)
13 views

MML Book Removed

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

MML Book Removed

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

292 Linear Regression 9.

2 Parameter Estimation 293


Figure 9.2 Linear 20
10 10
regression example. 9.2.1 Maximum Likelihood Estimation
(a) Example 0 0 0
y

y
functions that fall A widely used approach to finding the desired parameters θ ML is maximum maximum likelihood
into this category; −10 −10
likelihood estimation, where we find parameters θ ML that maximize the estimation
−20
(b) training set; −10 0 10 −10 −5 0 5 10 −10 −5 0 5 10
likelihood (9.5b). Intuitively, maximizing the likelihood means maximiz- Maximizing the
x x x
(c) maximum likelihood means
likelihood estimate. (a) Example functions (straight (b) Training set. (c) Maximum likelihood esti- ing the predictive distribution of the training data given the model param-
maximizing the
lines) that can be described us- mate. eters. We obtain the maximum likelihood parameters as
predictive
ing the linear model in (9.4).
distribution of the
θ ML ∈ arg max p(Y | X , θ) . (9.7)
θ (training) data
given the
Remark. The likelihood p(y | x, θ) is not a probability distribution in θ : It parameters.
refers to models that are “linear in the parameters”, i.e., models that de- is simply a function of the parameters θ but does not integrate to 1 (i.e., The likelihood is not
scribe a function by a linear combination of input features. Here, a “fea- it is unnormalized), and may not even be integrable with respect to θ . a probability
ture” is a representation ϕ(x) of the inputs x. However, the likelihood in (9.7) is a normalized probability distribution distribution in the
In the following, we will discuss in more detail how to find good pa- parameters.
in y . ♢
rameters θ and how to evaluate whether a parameter set “works well”.
To find the desired parameters θ ML that maximize the likelihood, we
For the time being, we assume that the noise variance σ 2 is known.
typically perform gradient ascent (or gradient descent on the negative
likelihood). In the case of linear regression we consider here, however, Since the logarithm
a closed-form solution exists, which makes iterative gradient descent un- is a (strictly)
9.2 Parameter Estimation necessary. In practice, instead of maximizing the likelihood directly, we monotonically
increasing function,
Consider the linear regression setting (9.4) and assume we are given a apply the log-transformation to the likelihood function and minimize the the optimum of a
training set training set D := {(x1 , y1 ), . . . , (xN , yN )} consisting of N inputs xn ∈ negative log-likelihood. function f is
Figure 9.3 RD and corresponding observations/targets yn ∈ R, n = 1, . . . , N . The Remark (Log-Transformation). Since the likelihood (9.5b) is a product of
identical to the
Probabilistic optimum of log f .
corresponding graphical model is given in Figure 9.3. Note that yi and yj N Gaussian distributions, the log-transformation is useful since (a) it does
graphical model for
are conditionally independent given their respective inputs xi , xj so that not suffer from numerical underflow, and (b) the differentiation rules will
linear regression.
Observed random the likelihood factorizes according to turn out simpler. More specifically, numerical underflow will be a prob-
variables are lem when we multiply N probabilities, where N is the number of data
shaded, p(Y | X , θ) = p(y1 , . . . , yN | x1 , . . . , xN , θ) (9.5a)
points, since we cannot represent very small numbers, such as 10−256 .
deterministic/ N N
known values are
Y Y Furthermore, the log-transform will turn the product into a sum of log-
yn | x ⊤ 2

= p(yn | xn , θ) = N n θ, σ , (9.5b)
without circles. probabilities such that the corresponding gradient is a sum of individual
n=1 n=1
θ gradients, instead of a repeated application of the product rule (5.46) to
where we defined X := {x1 , . . . , xN } and Y := {y1 , . . . , yN } as the sets compute the gradient of a product of N terms. ♢
σ of training inputs and corresponding targets, respectively. The likelihood To find the optimal parameters θ ML of our linear regression problem,
xn yn and the factors p(yn | xn , θ) are Gaussian due to the noise distribution; we minimize the negative log-likelihood
see (9.3).
N N
n = 1, . . . , N In the following, we will discuss how to find optimal parameters θ ∗ ∈ Y X
− log p(Y | X , θ) = − log p(yn | xn , θ) = − log p(yn | xn , θ) , (9.8)
RD for the linear regression model (9.4). Once the parameters θ ∗ are
n=1 n=1
found, we can predict function values by using this parameter estimate
in (9.4) so that at an arbitrary test input x∗ the distribution of the corre- where we exploited that the likelihood (9.5b) factorizes over the number
sponding target y∗ is of data points due to our independence assumption on the training set.
In the linear regression model (9.4), the likelihood is Gaussian (due to
p(y∗ | x∗ , θ ∗ ) = N y∗ | x⊤ ∗ 2

∗θ , σ . (9.6) the Gaussian additive noise term), such that we arrive at

In the following, we will have a look at parameter estimation by maxi- 1


log p(yn | xn , θ) = − (yn − x⊤ 2
n θ) + const , (9.9)
mizing the likelihood, a topic that we already covered to some degree in 2σ 2
Section 8.3. where the constant includes all terms independent of θ . Using (9.9) in the

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
294 Linear Regression 9.2 Parameter Estimation 295

negative log-likelihood (9.8), we obtain (ignoring the constant terms)


Example 9.2 (Fitting Lines)
N
1 X Let us have a look at Figure 9.2, where we aim to fit a straight line f (x) =
L(θ) := 2 (yn − x⊤
n θ)
2
(9.10a) θx, where θ is an unknown slope, to a dataset using maximum likelihood
2σ n=1
estimation. Examples of functions in this model class (straight lines) are
1 1 2
= (y − Xθ)⊤ (y − Xθ) = 2 ∥y − Xθ∥ , (9.10b) shown in Figure 9.2(a). For the dataset shown in Figure 9.2(b), we find
2σ 2 2σ the maximum likelihood estimate of the slope parameter θ using (9.12c)
The negative where we define the design matrix X := [x1 , . . . , xN ]⊤ ∈ RN ×D as the and obtain the maximum likelihood linear function in Figure 9.2(c).
log-likelihood collection of training inputs and y := [y1 , . . . , yN ]⊤ ∈ RN as a vector that
function is also
collects all training targets. Note that the nth row in the design matrix X
called error function.
corresponds to the training input xn . In (9.10b), we used the fact that the Maximum Likelihood Estimation with Features
design matrix
The squared error is
sum of squared errors between the observations yn and the corresponding So far, we considered the linear regression setting described in (9.4),
often used as a model prediction x⊤ n θ equals the squared distance between y and Xθ . which allowed us to fit straight lines to data using maximum likelihood
measure of distance. With (9.10b), we have now a concrete form of the negative log-likelihood estimation. However, straight lines are not sufficiently expressive when it Linear regression
Recall from function we need to optimize. We immediately see that (9.10b) is quadratic comes to fitting more interesting data. Fortunately, linear regression offers refers to “linear-in-
Section 3.1 that the-parameters”
in θ . This means that we can find a unique global solution θ ML for mini- us a way to fit nonlinear functions within the linear regression framework:
∥x∥2 = x⊤ x if we regression models,
choose the dot mizing the negative log-likelihood L. We can find the global optimum by Since “linear regression” only refers to “linear in the parameters”, we can but the inputs can
product as the inner computing the gradient of L, setting it to 0 and solving for θ . perform an arbitrary nonlinear transformation ϕ(x) of the inputs x and undergo any
product. Using the results from Chapter 5, we compute the gradient of L with then linearly combine the components of this transformation. The corre- nonlinear
respect to the parameters as transformation.
sponding linear regression model is
dL d 1 p(y | x, θ) = N y | ϕ⊤ (x)θ, σ 2
  
= (y − Xθ)⊤ (y − Xθ) (9.11a)
dθ dθ 2σ 2
K−1
X (9.13)
1 d   ⇐⇒ y = ϕ⊤ (x)θ + ϵ = θk ϕk (x) + ϵ ,
= 2 y ⊤ y − 2y ⊤ Xθ + θ ⊤ X ⊤ Xθ (9.11b)
2σ dθ k=0
1
= 2 (−y ⊤ X + θ ⊤ X ⊤ X) ∈ R1×D . (9.11c) where ϕ : RD → RK is a (nonlinear) transformation of the inputs x and
σ ϕk : RD → R is the k th component of the feature vector ϕ. Note that the feature vector
The maximum likelihood estimator θ ML solves dL
= 0⊤ (necessary opti- model parameters θ still appear only linearly.

Ignoring the mality condition) and we obtain
possibility of
duplicate data dL (9.11c)
Example 9.3 (Polynomial Regression)
points, rk(X) = D = 0⊤ ⇐⇒ θ ⊤ ⊤ ⊤
ML X X = y X (9.12a) We are concerned with a regression problem y = ϕ⊤ (x)θ+ϵ, where x ∈ R

if N ⩾ D, i.e., we and θ ∈ RK . A transformation that is often used in this context is
do not have more ⇐⇒ θ ⊤ ⊤ ⊤
ML = y X(X X)
−1
(9.12b)
1
 
parameters than
⇐⇒ θ ML = (X ⊤ X)−1 X ⊤ y . (9.12c)  
data points. ϕ0 (x)  x 
 
 ϕ1 (x)   x2 
We could right-multiply the first equation by (X ⊤ X)−1 because X ⊤ X is ϕ(x) = 

..  =  x  ∈ RK .
  3 
(9.14)
positive definite if rk(X) = D, where rk(X) denotes the rank of X .  .  
 . 

ϕK−1 (x) .
.
Remark. Setting the gradient to 0⊤ is a necessary and sufficient condition,
 
and we obtain a global minimum since the Hessian ∇2θ L(θ) = X ⊤ X ∈ xK−1
RD×D is positive definite. ♢ This means that we “lift” the original one-dimensional input space into
Remark. The maximum likelihood solution in (9.12c) requires us to solve a K -dimensional feature space consisting of all monomials xk for k =
a system of linear equations of the form Aθ = b with A = (X ⊤ X) and 0, . . . , K − 1. With these features, we can model polynomials of degree
b = X ⊤y. ♢ ⩽ K−1 within the framework of linear regression: A polynomial of degree

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
296 Linear Regression 9.2 Parameter Estimation 297
⊤ K×K
are linearly independent. In (9.19), we therefore require Φ Φ ∈ R
K − 1 is to be invertible. This is the case if and only if rk(Φ) = K . ♢
K−1
X
k ⊤
f (x) = θk x = ϕ (x)θ , (9.15)
k=0
Example 9.5 (Maximum Likelihood Polynomial Fit)
where ϕ is defined in (9.14) and θ = [θ0 , . . . , θK−1 ]⊤ ∈ RK contains the
(linear) parameters θk .
Figure 9.4
4 4 Training data Polynomial
Let us now have a look at maximum likelihood estimation of the param- MLE regression:
eters θ in the linear regression model (9.13). We consider training inputs 2 2 (a) dataset
consisting of
feature matrix xn ∈ RD and targets yn ∈ R, n = 1, . . . , N , and define the feature matrix 0 0

y
(xn , yn ) pairs,
design matrix (design matrix) as n = 1, . . . , 10;
−2 −2
(b) maximum
ϕ0 (x1 ) · · · ϕK−1 (x1 )
 
 ⊤ 
ϕ (x1 ) −4 −4 likelihood
 ϕ0 (x2 ) · · · ϕK−1 (x2 )  polynomial of
.. −4 −2 0 2 4 −4 −2 0 2 4
Φ :=   =  .. ..  ∈ RN ×K , (9.16)
   
.  .
x x degree 4.
⊤ . 
ϕ (xN ) (a) Regression dataset. (b) Polynomial of degree 4 determined by max-
ϕ0 (xN ) · · · ϕK−1 (xN ) imum likelihood estimation.

where Φij = ϕj (xi ) and ϕj : RD → R. Consider the dataset in Figure 9.4(a). The dataset consists of N = 10
pairs (xn , yn ), where xn ∼ U[−5, 5] and yn = − sin(xn /5) + cos(xn ) + ϵ,
Example 9.4 (Feature Matrix for Second-order Polynomials) where ϵ ∼ N 0, 0.22 .
For a second-order polynomial and N training points xn ∈ R, n = We fit a polynomial of degree 4 using maximum likelihood estimation,
1, . . . , N , the feature matrix is i.e., parameters θ ML are given in (9.19). The maximum likelihood estimate
yields function values ϕ⊤ (x∗ )θ ML at any test location x∗ . The result is
1 x1 x21
 
shown in Figure 9.4(b).
1 x2 x22 
Φ =  .. .. ..  . (9.17)
 
. . . 
1 xN x2N Estimating the Noise Variance
Thus far, we assumed that the noise variance σ 2 is known. However, we
can also use the principle of maximum likelihood estimation to obtain the
With the feature matrix Φ defined in (9.16), the negative log-likelihood maximum likelihood estimator σML 2
for the noise variance. To do this, we
for the linear regression model (9.13) can be written as follow the standard procedure: We write down the log-likelihood, com-
1 pute its derivative with respect to σ 2 > 0, set it to 0, and solve. The
− log p(Y | X , θ) = (y − Φθ)⊤ (y − Φθ) + const . (9.18) log-likelihood is given by
2σ 2
Comparing (9.18) with the negative log-likelihood in (9.10b) for the “fea- N
X
log p(Y | X , θ, σ 2 ) = log N yn | ϕ⊤ (xn )θ, σ 2

ture-free” model, we immediately see we just need to replace X with Φ. (9.20a)
n=1
Since both X and Φ are independent of the parameters θ that we wish to
N 
optimize, we arrive immediately at the maximum likelihood estimate 1 1 1

maximum likelihood X
estimate = − log(2π) − log σ 2 − 2 (yn − ϕ⊤ (xn )θ)2 (9.20b)
2 2 2σ
θ ML = (Φ⊤ Φ)−1 Φ⊤ y (9.19) n=1
N
N 1 X
for the linear regression problem with nonlinear features defined in (9.13). =− log σ 2 − 2 (yn − ϕ⊤ (xn )θ)2 + const . (9.20c)
⊤ 2 2σ n=1
Remark. When we were working without features, we required X X to | {z }
be invertible, which is the case when rk(X) = D, i.e., the columns of X =:s

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
298 Linear Regression 9.2 Parameter Estimation 299
4 Training data 4 Training data 4 Training data
Figure 9.5
2
The partial derivative of the log-likelihood with respect to σ is then 2
MLE
2
MLE
2
MLE Maximum
likelihood fits for
∂ log p(Y | X , θ, σ 2 ) N 1 0 0 0

y
different polynomial
= − 2 + 4s = 0 (9.21a) −2 −2 −2
degrees M .
∂σ 2 2σ 2σ −4 −4 −4
N s −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
⇐⇒ 2 = 4 (9.21b) x x x
2σ 2σ (a) M = 0 (b) M = 1 (c) M = 3
so that we identify
4 Training data 4 Training data 4 Training data
N
s 1 X MLE MLE MLE
2
σML = = (yn − ϕ⊤ (xn )θ)2 . (9.22) 2 2 2

N N n=1 0 0 0

y
−2 −2 −2
Therefore, the maximum likelihood estimate of the noise variance is the −4 −4 −4

empirical mean of the squared distances between the noise-free function −4 −2 0


x
2 4 −4 −2 0
x
2 4 −4 −2 0
x
2 4

values ϕ⊤ (xn )θ and the corresponding noisy observations yn at input lo- (d) M = 4 (e) M = 6 (f) M = 9
cations xn .

9.2.2 Overfitting in Linear Regression than data points, and would need to solve an underdetermined system of
We just discussed how to use maximum likelihood estimation to fit lin- linear equations (Φ⊤ Φ in (9.19) would also no longer be invertible) so
ear models (e.g., polynomials) to data. We can evaluate the quality of that there are infinitely many possible maximum likelihood estimators.
the model by computing the error/loss incurred. One way of doing this Figure 9.5 shows a number of polynomial fits determined by maximum
is to compute the negative log-likelihood (9.10b), which we minimized likelihood for the dataset from Figure 9.4(a) with N = 10 observations.
to determine the maximum likelihood estimator. Alternatively, given that We notice that polynomials of low degree (e.g., constants (M = 0) or
the noise parameter σ 2 is not a free model parameter, we can ignore the linear (M = 1)) fit the data poorly and, hence, are poor representations
scaling by 1/σ 2 , so that we end up with a squared-error-loss function of the true underlying function. For degrees M = 3, . . . , 6, the fits look
root mean square
2
∥y − Φθ∥ . Instead of using this squared loss, we often use the root mean plausible and smoothly interpolate the data. When we go to higher-degree The case of
error polynomials, we notice that they fit the data better and better. In the ex- M = N − 1 is
square error (RMSE) extreme in the sense
RMSE treme case of M = N − 1 = 9, the function will pass through every single
that otherwise the
v
r u N data point. However, these high-degree polynomials oscillate wildly and
1 2
u1 X null space of the
∥y − Φθ∥ = t (yn − ϕ⊤ (xn )θ)2 , (9.23) are a poor representation of the underlying function that generated the corresponding
N N n=1
data, such that we suffer from overfitting. system of linear
equations would be
which (a) allows us to compare errors of datasets with different sizes Remember that the goal is to achieve good generalization by making
non-trivial, and we
The RMSE is and (b) has the same scale and the same units as the observed func- accurate predictions for new (unseen) data. We obtain some quantita- would have
normalized. tion values yn . For example, if we fit a model that maps post-codes (x tive insight into the dependence of the generalization performance on the infinitely many
is given in latitude, longitude) to house prices (y -values are EUR) then polynomial of degree M by considering a separate test set comprising 200 optimal solutions to
the linear regression
the RMSE is also measured in EUR, whereas the squared error is given data points generated using exactly the same procedure used to generate
problem.
The negative in EUR2 . If we choose to include the factor σ 2 from the original negative the training set. As test inputs, we chose a linear grid of 200 points in the
overfitting
log-likelihood is log-likelihood (9.10b), then we end up with a unitless objective, i.e., in interval of [−5, 5]. For each choice of M , we evaluate the RMSE (9.23) for
unitless. Note that the noise
the preceding example, our objective would no longer be in EUR or EUR2 . both the training data and the test data. variance σ 2 > 0.
For model selection (see Section 8.6), we can use the RMSE (or the Looking now at the test error, which is a qualitive measure of the gen-
negative log-likelihood) to determine the best degree of the polynomial by eralization properties of the corresponding polynomial, we notice that ini-
finding the polynomial degree M that minimizes the objective. Given that tially the test error decreases; see Figure 9.6 (orange). For fourth-order
the polynomial degree is a natural number, we can perform a brute-force polynomials, the test error is relatively low and stays relatively constant up
search and enumerate all (reasonable) values of M . For a training set of to degree 5. However, from degree 6 onward the test error increases signif-
size N it is sufficient to test 0 ⩽ M ⩽ N − 1. For M < N , the maximum icantly, and high-order polynomials have very bad generalization proper-
likelihood estimator is unique. For M ⩾ N , we have more parameters ties. In this particular example, this also is evident from the corresponding

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
300 Linear Regression 9.2 Parameter Estimation 301

Figure 9.6 Training 10 where the constant comprises the terms that are independent of θ . We see
and test error. Training error that the log-posterior in (9.25) is the sum of the log-likelihood p(Y | X , θ)
8 Test error
and the log-prior log p(θ) so that the MAP estimate will be a “compromise”
6 between the prior (our suggestion for plausible parameter values before
RMSE observing data) and the data-dependent likelihood.
4 To find the MAP estimate θ MAP , we minimize the negative log-posterior
distribution with respect to θ , i.e., we solve
2
θ MAP ∈ arg min{− log p(Y | X , θ) − log p(θ)} . (9.26)
0 θ
0 2 4 6 8 10
Degree of polynomial The gradient of the negative log-posterior with respect to θ is
d log p(θ | X , Y) d log p(Y | X , θ) d log p(θ)
training error maximum likelihood fits in Figure 9.5. Note that the training error (blue − =− − , (9.27)
dθ dθ dθ
curve in Figure 9.6) never increases when the degree of the polynomial in- where we identify the first term on the right-hand side as the gradient of
creases. In our example, the best generalization (the point of the smallest the negative log-likelihood from (9.11c).
test error test error) is obtained for a polynomial of degree M = 4. 
With a (conjugate) Gaussian prior p(θ) = N 0, b2 I on the parameters
θ , the negative log-posterior for the linear regression setting (9.13), we
9.2.3 Maximum A Posteriori Estimation obtain the negative log posterior
1 1
We just saw that maximum likelihood estimation is prone to overfitting. − log p(θ | X , Y) = (y − Φθ)⊤ (y − Φθ) + 2 θ ⊤ θ + const . (9.28)
We often observe that the magnitude of the parameter values becomes 2σ 2 2b
relatively large if we run into overfitting (Bishop, 2006). Here, the first term corresponds to the contribution from the log-likelihood,
To mitigate the effect of huge parameter values, we can place a prior and the second term originates from the log-prior. The gradient of the log-
distribution p(θ) on the parameters. The prior distribution explicitly en- posterior with respect to the parameters θ is then
codes what parameter values are plausible (before  having seen any data). d log p(θ | X , Y) 1 1
For example, a Gaussian prior p(θ) = N 0, 1 on a single parameter − = 2 (θ ⊤ Φ⊤ Φ − y ⊤ Φ) + 2 θ ⊤ . (9.29)
dθ σ b
θ encodes that parameter values are expected lie in the interval [−2, 2]
(two standard deviations around the mean value). Once a dataset X , Y We will find the MAP estimate θ MAP by setting this gradient to 0⊤ and
is available, instead of maximizing the likelihood we seek parameters that solving for θ MAP . We obtain
maximize the posterior distribution p(θ | X , Y). This procedure is called 1 ⊤ ⊤ 1
maximum a maximum a posteriori (MAP) estimation.
(θ Φ Φ − y ⊤ Φ) + 2 θ ⊤ = 0⊤ (9.30a)
σ2   b
posteriori The posterior over the parameters θ , given the training data X , Y , is 1 ⊤ 1 1
MAP obtained by applying Bayes’ theorem (Section 6.3) as ⇐⇒ θ ⊤ Φ Φ + 2 I − 2 y ⊤ Φ = 0⊤ (9.30b)
σ2 b σ
p(Y | X , θ)p(θ) σ2
 
p(θ | X , Y) = . (9.24) ⇐⇒ θ ⊤ Φ⊤ Φ + 2 I = y ⊤ Φ (9.30c)
p(Y | X ) b
−1
σ2

Since the posterior explicitly depends on the parameter prior p(θ), the ⊤
⇐⇒ θ = y Φ Φ⊤ Φ + 2 I

(9.30d)
prior will have an effect on the parameter vector we find as the maximizer b
of the posterior. We will see this more explicitly in the following. The
so that the MAP estimate is (by transposing both sides of the last equality) Φ⊤ Φ is symmetric,
parameter vector θ MAP that maximizes the posterior (9.24) is the MAP −1 positive semi
σ2

estimate. definite. The
θ MAP = Φ⊤ Φ + 2 I Φ⊤ y . (9.31) additional term
To find the MAP estimate, we follow steps that are similar in flavor b
in (9.31) is strictly
to maximum likelihood estimation. We start with the log-transform and
Comparing the MAP estimate in (9.31) with the maximum likelihood es- positive definite so
compute the log-posterior as that the inverse
timate in (9.19), we see that the only difference between both solutions
2 exists.
log p(θ | X , Y) = log p(Y | X , θ) + log p(θ) + const , (9.25) is the additional term σb2 I in the inverse matrix. This term ensures that

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
302 Linear Regression 9.3 Bayesian Linear Regression 303
⊤ σ2
Φ Φ+ I
is symmetric and strictly positive definite (i.e., its inverse
b2
useful for variable selection. For p = 1, the regularizer is called LASSO LASSO
exists and the MAP estimate is the unique solution of a system of linear (least absolute shrinkage and selection operator) and was proposed by Tib-
equations). Moreover, it reflects the impact of the regularizer. shirani (1996). ♢
2
The regularizer λ ∥θ∥2 in (9.32) can be interpreted as a negative log-
Example 9.6 (MAP Estimation for Polynomial Regression) Gaussian prior, which we use in MAP estimation; see (9.26). More specif-
In the polynomial regression ically, with a Gaussian prior p(θ) = N 0, b2 I , we obtain the negative
 example from Section 9.2.1, we place a Gaus-
sian prior p(θ) = N 0, I on the parameters θ and determine the MAP log-Gaussian prior
estimates according to (9.31). In Figure 9.7, we show both the maximum 1 2
likelihood and the MAP estimates for polynomials of degree 6 (left) and − log p(θ) = ∥θ∥2 + const (9.33)
2b2
degree 8 (right). The prior (regularizer) does not play a significant role
so that for λ = 2b12 the regularization term and the negative log-Gaussian
for the low-degree polynomial, but keeps the function relatively smooth
prior are identical.
for higher-degree polynomials. Although the MAP estimate can push the
Given that the regularized least-squares loss function in (9.32) consists
boundaries of overfitting, it is not a general solution to this problem, so
of terms that are closely related to the negative log-likelihood plus a neg-
we need a more principled approach to tackle overfitting.
ative log-prior, it is not surprising that, when we minimize this loss, we
obtain a solution that closely resembles the MAP estimate in (9.31). More
Figure 9.7 specifically, minimizing the regularized least-squares loss function yields
Polynomial 4 4 Training data
regression: MLE θ RLS = (Φ⊤ Φ + λI)−1 Φ⊤ y , (9.34)
maximum likelihood 2 2 MAP 2
and MAP estimates. which is identical to the MAP estimate in (9.31) for λ = σb2 , where σ 2 is
0 0 2
y

(a) Polynomials of the noise variance


degree 6;  and b the variance of the (isotropic) Gaussian prior
−2 Training data −2 p(θ) = N 0, b2 I . A point estimate is a
(b) polynomials of MLE
So far, we have covered parameter estimation using maximum likeli- single specific
degree 8. −4 MAP −4 parameter value,
hood and MAP estimation where we found point estimates θ ∗ that op-
−4 −2 0 2 4 −4 −2 0 2 4 unlike a distribution
x x timize an objective function (likelihood or posterior). We saw that both over plausible
(a) Polynomials of degree 6. (b) Polynomials of degree 8. maximum likelihood and MAP estimation can lead to overfitting. In the parameter settings.
next section, we will discuss Bayesian linear regression, where we use
Bayesian inference (Section 8.4) to find a posterior distribution over the
unknown parameters, which we subsequently use to make predictions.
9.2.4 MAP Estimation as Regularization More specifically, for predictions we will average over all plausible sets of
parameters instead of focusing on a point estimate.
Instead of placing a prior distribution on the parameters θ , it is also pos-
sible to mitigate the effect of overfitting by penalizing the amplitude of
regularization the parameter by means of regularization. In regularized least squares, we 9.3 Bayesian Linear Regression
regularized least consider the loss function Previously, we looked at linear regression models where we estimated the
squares 2
∥y − Φθ∥ + λ ∥θ∥2 ,
2
(9.32) model parameters θ , e.g., by means of maximum likelihood or MAP esti-
mation. We discovered that MLE can lead to severe overfitting, in particu-
which we minimize with respect to θ (see Section 8.2.3). Here, the first lar, in the small-data regime. MAP addresses this issue by placing a prior
data-fit term term is a data-fit term (also called misfit term), which is proportional to on the parameters that plays the role of a regularizer. Bayesian linear
misfit term the negative log-likelihood; see (9.10b). The second term is called the regression
Bayesian linear regression pushes the idea of the parameter prior a step
regularizer regularizer, and the regularization parameter λ ⩾ 0 controls the “strict- further and does not even attempt to compute a point estimate of the
regularization ness” of the regularization. parameters, but instead the full posterior distribution over the parameters
parameter
Remark. Instead of the Euclidean norm ∥·∥2 , we can choose any p-norm is taken into account when making predictions. This means we do not fit
∥·∥p in (9.32). In practice, smaller values for p lead to sparser solutions. any parameters, but we compute a mean over all plausible parameters
Here, “sparse” means that many parameter values θd = 0, which is also settings (according to the posterior).

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
304 Linear Regression 9.3 Bayesian Linear Regression 305
2
9.3.1 Model with the parameters θ , whereas σ is the uncertainty contribution due to
In Bayesian linear regression, we consider the model the measurement noise.
If we are interested in predicting noise-free function values f (x∗ ) =
ϕ⊤ (x∗ )θ instead of the noise-corrupted targets y∗ we obtain

prior p(θ) = N m0 , S 0 ,
(9.35)
likelihood p(y | x, θ) = N y | ϕ⊤ (x)θ, σ 2 ,

p(f (x∗ )) = N ϕ⊤ (x∗ )m0 , ϕ⊤ (x∗ )S 0 ϕ(x∗ ) ,

(9.40)

Figure 9.8 where we now explicitly place a Gaussian prior p(θ) = N m0 , S 0 on θ , which only differs from (9.38) in the omission of the noise variance σ 2 in
Graphical model for which turns the parameter vector into a random variable. This allows us the predictive variance.
Bayesian linear
to write down the corresponding graphical model in Figure 9.8, where we
regression. Remark (Distribution over Functions). Since we can represent the distri- The parameter
made the parameters of the Gaussian prior on θ explicit. The full proba- distribution p(θ)
m0 S0 bution p(θ) using a set of samples θ i and every sample θ i gives rise to a
bilistic model, i.e., the joint distribution of observed and unobserved ran- induces a
function fi (·) = θ ⊤
i ϕ(·), it follows that the parameter distribution p(θ)
dom variables, y and θ , respectively, is distribution over
θ induces a distribution p(f (·)) over functions. Here we use the notation (·) functions.
σ p(y, θ | x) = p(y | x, θ)p(θ) . (9.36) to explicitly denote a functional relationship. ♢

x y
9.3.2 Prior Predictions Example 9.7 (Prior over Functions)
Figure 9.9 Prior
In practice, we are usually not so much interested in the parameter values over functions.
4 4
θ themselves. Instead, our focus often lies in the predictions we make (a) Distribution over
with those parameter values. In a Bayesian setting, we take the parameter functions
2 2
represented by the
distribution and average over all plausible parameter settings when we mean function
0 0

y
make predictions. More specifically, to make predictions at an input x∗ , (black line) and the
we integrate out θ and obtain −2 −2 marginal
Z uncertainties
(shaded),
p(y∗ | x∗ ) = p(y∗ | x∗ , θ)p(θ)dθ = Eθ [p(y∗ | x∗ , θ)] , (9.37) −4 −4
representing the
−4 −2 0 2 4 −4 −2 0 2 4
x x 67% and 95%
which we can interpret as the average prediction of y∗ | x∗ , θ for all plau- confidence bounds,
sible parameters θ according to the prior distribution p(θ). Note that pre- (a) Prior distribution over functions. (b) Samples from the prior distribution over respectively;
functions. (b) samples from
dictions using the prior distribution only require us to specify the input
the prior over
x∗ , but no training data. Let us consider a Bayesian linear regression problem withpolynomials
functions, which are
In our model (9.35), we chose a conjugate (Gaussian) prior on θ so of degree 5. We choose a parameter prior p(θ) = N 0, 14 I . Figure 9.9 induced by the
that the predictive distribution is Gaussian as well (and can be visualizes the induced prior distribution over functions (shaded area: dark
 computed samples from the
in closed form): With the prior distribution p(θ) = N m0 , S 0 , we obtain gray: 67% confidence bound; light gray: 95% confidence bound) induced parameter prior.
the predictive distribution as by this parameter prior, including some function samples from this prior.
A function sample is obtained by first sampling a parameter vector
p(y∗ | x∗ ) = N ϕ⊤ (x∗ )m0 , ϕ⊤ (x∗ )S 0 ϕ(x∗ ) + σ 2 ,

(9.38) θ i ∼ p(θ) and then computing fi (·) = θ ⊤ i ϕ(·). We used 200 input lo-
where we exploited that (i) the prediction is Gaussian due to conjugacy cations x∗ ∈ [−5, 5] to which we apply the feature function ϕ(·). The
(see Section 6.6) and the marginalization property of Gaussians (see Sec- uncertainty (represented by the shaded area) in Figure 9.9 is solely due to
tion 6.5), (ii) the Gaussian noise is independent so that the parameter uncertainty because we considered the noise-free predictive
distribution (9.40).
V[y∗ ] = Vθ [ϕ⊤ (x∗ )θ] + Vϵ [ϵ] , (9.39)
and (iii) y∗ is a linear transformation of θ so that we can apply the rules So far, we looked at computing predictions using the parameter prior
for computing the mean and covariance of the prediction analytically by p(θ). However, when we have a parameter posterior (given some train-
using (6.50) and (6.51), respectively. In (9.38), the term ϕ⊤ (x∗ )S 0 ϕ(x∗ ) ing data X , Y ), the same principles for prediction and inference hold
in the predictive variance explicitly accounts for the uncertainty associated as in (9.37) – we just need to replace the prior p(θ) with the posterior

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
306 Linear Regression 9.3 Bayesian Linear Regression 307

p(θ | X , Y). In the following, we will derive the posterior distribution in where the constant contains terms independent of θ . We will ignore the
detail before using it to make predictions. constant in the following. We now factorize (9.45b), which yields
1 −2 ⊤
− σ y y − 2σ −2 y ⊤ Φθ + θ ⊤ σ −2 Φ⊤ Φθ + θ ⊤ S −1
0 θ
9.3.3 Posterior Distribution 2 (9.46a)
−1 ⊤ −1
− 2m⊤

Given a training set of inputs xn ∈ RD and corresponding observations 0 S 0 θ + m0 S 0 m 0

yn ∈ R, n = 1, . . . , N , we compute the posterior over the parameters 1


= − θ ⊤ (σ −2 Φ⊤ Φ + S −1 −2 ⊤
Φ y + S −1 ⊤

0 )θ − 2(σ 0 m0 ) θ + const ,
using Bayes’ theorem as 2
(9.46b)
p(Y | X , θ)p(θ)
p(θ | X , Y) = , (9.41) where the constant contains the black terms in (9.46a), which are inde-
p(Y | X )
pendent of θ . The orange terms are terms that are linear in θ , and the
where X is the set of training inputs and Y the collection of correspond-
blue terms are the ones that are quadratic in θ . Inspecting (9.46b), we
ing training targets. Furthermore, p(Y | X , θ) is the likelihood, p(θ) the
find that this equation is quadratic in θ . The fact that the unnormalized
parameter prior, and
Z log-posterior distribution is a (negative) quadratic form implies that the
p(Y | X ) = p(Y | X , θ)p(θ)dθ = Eθ [p(Y | X , θ)] (9.42) posterior is Gaussian, i.e.,
p(θ | X , Y) = exp(log p(θ | X , Y)) ∝ exp(log p(Y | X , θ) + log p(θ))
marginal likelihood the marginal likelihood/evidence, which is independent of the parameters
evidence θ and ensures that the posterior is normalized, i.e., it integrates to 1. We (9.47a)
The marginal  1
can think of the marginal likelihood as the likelihood averaged over all 
likelihood is the ∝ exp − θ ⊤ (σ −2 Φ⊤ Φ + S −1
0 )θ − 2(σ −2 ⊤
Φ y + S −1
0 m0 )⊤
θ ,
possible parameter settings (with respect to the prior distribution p(θ)). 2
expected likelihood
under the parameter
(9.47b)
Theorem 9.1 (Parameter Posterior). In our model (9.35), the parameter
prior.
posterior (9.41) can be computed in closed form as where we used (9.46b) in the last expression.
 The remaining task is it to bring this (unnormalized) Gaussian into the
p(θ | X , Y) = N θ | mN , S N , (9.43a) 
form that is proportional to N θ | mN , S N , i.e., we need to identify the
S N = (S −1
0 +σ
−2 ⊤
Φ Φ)−1 , (9.43b) mean mN and the covariance matrix S N . To do this, we use the concept
mN = S N (S −1
0 m0 + σ
−2 ⊤
Φ y) , (9.43c) of completing the squares. The desired log-posterior is completing the
squares
where the subscript N indicates the size of the training set. 1
log N θ | mN , S N = − (θ − mN )⊤ S −1

N (θ − mN ) + const (9.48a)
2
Proof Bayes’ theorem tells us that the posterior p(θ | X , Y) is propor- 1 ⊤ −1 ⊤ −1 ⊤ −1

tional to the product of the likelihood p(Y | X , θ) and the prior p(θ): = − θ S N θ − 2mN S N θ + mN S N mN . (9.48b)
2
p(Y | X , θ)p(θ)
Posterior p(θ | X , Y) = (9.44a) Here, we factorized the quadratic form (θ − mN )⊤ S −1 N (θ − mN ) into a Since p(θ | X , Y) =
p(Y | X ) term that is quadratic in θ alone (blue), a term that is linear in θ (orange),

N mN , S N , it
p(Y | X , θ) = N y | Φθ, σ 2 I holds that

Likelihood (9.44b) and a constant term (black). This allows us now to find S N and mN by
θ MAP = mN .
matching the colored expressions in (9.46b) and (9.48b), which yields

Prior p(θ) = N θ | m0 , S 0 . (9.44c)
Instead of looking at the product of the prior and the likelihood, we S −1 ⊤ −2
N = Φ σ IΦ + S −1
0 (9.49a)
can transform the problem into log-space and solve for the mean and ⇐⇒ S N = (σ −2 Φ⊤ Φ + S −1 −1
0 ) (9.49b)
covariance of the posterior by completing the squares.
The sum of the log-prior and the log-likelihood is and
log N y | Φθ, σ 2 I + log N θ | m0 , S 0 −1 −2 ⊤
Φ y + S −1
 
(9.45a) m⊤
N S N = (σ 0 m0 )

(9.50a)
1 ⇐⇒ mN = S N (σ −2 Φ⊤ y + S −1
= − σ −2 (y − Φθ)⊤ (y − Φθ) + (θ − m0 )⊤ S −1 0 m0 ) . (9.50b)

0 (θ − m0 ) + const
2
(9.45b)

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
308 Linear Regression 9.3 Bayesian Linear Regression 309

Remark (General Approach to Completing the Squares). If we are given Remark (Marginal Likelihood and Posterior Predictive Distribution). By
an equation replacing the integral in (9.57a), the predictive distribution can be equiv-
alently written as the expectation Eθ | X ,Y [p(y∗ | x∗ , θ)], where the expec-
x⊤ Ax − 2a⊤ x + const1 , (9.51) tation is taken with respect to the parameter posterior p(θ | X , Y).
where A is symmetric and positive definite, which we wish to bring into Writing the posterior predictive distribution in this way highlights a
the form close resemblance to the marginal likelihood (9.42). The key difference
between the marginal likelihood and the posterior predictive distribution
(x − µ)⊤ Σ(x − µ) + const2 , (9.52) are (i) the marginal likelihood can be thought of predicting the training
targets y and not the test targets y∗ , and (ii) the marginal likelihood av-
we can do this by setting erages with respect to the parameter prior and not the parameter poste-
rior. ♢
Σ := A , (9.53)
Remark (Mean and Variance of Noise-Free Function Values). In many
µ := Σ−1 a (9.54) cases, we are not interested in the predictive distribution p(y∗ | X , Y, x∗ )
and const2 = const1 − µ⊤ Σµ. ♢ of a (noisy) observation y∗ . Instead, we would like to obtain the distribu-
tion of the (noise-free) function values f (x∗ ) = ϕ⊤ (x∗ )θ . We determine
We can see that the terms inside the exponential in (9.47b) are of the the corresponding moments by exploiting the properties of means and
form (9.51) with variances, which yields
A := σ −2 Φ⊤ Φ + S −1
0 , (9.55) E[f (x∗ ) | X , Y] = Eθ [ϕ⊤ (x∗ )θ | X , Y] = ϕ⊤ (x∗ )Eθ [θ | X , Y]
(9.58)
a := σ −2 Φ⊤ y + S −1
0 m0 . (9.56) = ϕ⊤ (x∗ )mN = m⊤
N ϕ(x∗ ) ,

Since A, a can be difficult to identify in equations like (9.46a), it is of- Vθ [f (x∗ ) | X , Y] = Vθ [ϕ⊤ (x∗ )θ | X , Y]
ten helpful to bring these equations into the form (9.51) that decouples = ϕ⊤ (x∗ )Vθ [θ | X , Y]ϕ(x∗ ) (9.59)
quadratic term, linear terms, and constants, which simplifies finding the
= ϕ⊤ (x∗ )S N ϕ(x∗ ) .
desired solution.
We see that the predictive mean is the same as the predictive mean for
noisy observations as the noise has mean 0, and the predictive variance
9.3.4 Posterior Predictions only differs by σ 2 , which is the variance of the measurement noise: When
we predict noisy function values, we need to include σ 2 as a source of
In (9.37), we computed the predictive distribution of y∗ at a test input uncertainty, but this term is not needed for noise-free predictions. Here,
x∗ using the parameter prior p(θ). In principle, predicting with the pa- the only remaining uncertainty stems from the parameter posterior. ♢
rameter posterior p(θ | X , Y) is not fundamentally different given that Integrating out
in our conjugate model the prior and posterior are both Gaussian (with Remark (Distribution over Functions). The fact that we integrate out the parameters induces
parameters θ induces a distribution over functions: If we sample θ i ∼ a distribution over
different parameters). Therefore, by following the same reasoning as in functions.
Section 9.3.2, we obtain the (posterior) predictive distribution p(θ | X , Y) from the parameter posterior, we obtain a single function re-
alization θ ⊤
i ϕ(·). The mean function, i.e., the set of all expected function mean function
values Eθ [f (·) | θ, X , Y], of this distribution over functions is m⊤
Z
p(y∗ | X , Y, x∗ ) = p(y∗ | x∗ , θ)p(θ | X , Y)dθ (9.57a) N ϕ(·).
The (marginal) variance, i.e., the variance of the function f (·), is given by

ϕ (·)S N ϕ(·). ♢
Z
⊤ 2
 
= N y∗ | ϕ (x∗ )θ, σ N θ | mN , S N dθ (9.57b)

= N y∗ | ϕ⊤ (x∗ )mN , ϕ⊤ (x∗ )S N ϕ(x∗ ) + σ 2 . (9.57c)



Example 9.8 (Posterior over Functions)
Let us revisit the Bayesian linear regression problem with polynomials

E[y∗ | X , Y, x∗ ] = The term ϕ (x∗ )S N ϕ(x∗ ) reflects the posterior uncertainty associated of degree 5. We choose a parameter prior p(θ) = N 0, 14 I . Figure 9.9
ϕ⊤ (x∗ )mN = with the parameters θ . Note that S N depends on the training inputs visualizes the prior over functions induced by the parameter prior and
ϕ⊤ (x∗ )θ MAP .
through Φ; see (9.43b). The predictive mean ϕ⊤ (x∗ )mN coincides with sample functions from this prior.
the predictions made with the MAP estimate θ MAP .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
310 Linear Regression 9.3 Bayesian Linear Regression 311
Figure 9.11
4 4 Bayesian linear
Figure 9.10 shows the posterior over functions that we obtain via regression. Left
2 2 panels: Shaded
Bayesian linear regression. The training dataset is shown in panel (a);
areas indicate the
panel (b) shows the posterior distribution over functions, including the 0 0

y
67% (dark gray)
Training data
functions we would obtain via maximum likelihood and MAP estimation. and 95% (light
−2 MLE −2
The function we obtain using the MAP estimate also corresponds to the MAP
gray) predictive
confidence bounds.
posterior mean function in the Bayesian linear regression setting. Panel (c) −4 BLR −4
The mean of the
shows some plausible realizations (samples) of functions under that pos- −4 −2 0 2 4 −4 −2 0 2 4 Bayesian linear
x x
terior over functions. regression model
(a) Posterior distribution for polynomials of degree M = 3 (left) and samples from the pos- coincides with the
Figure 9.10
terior over functions (right). MAP estimate. The
Bayesian linear 4 4 4
predictive
regression and 2 2 2
uncertainty is the
posterior over 4 4
0 0 0 sum of the noise
y

y
functions. Training data
−2 −2 MLE −2 term and the
(a) training data; MAP 2 2
posterior parameter
(b) posterior −4 −4 BLR −4

−4 −2 −4 −2 −4 −2
uncertainty, which
distribution over 0 2 4 0 2 4 0 2 4 0 0

y
x x x
Training data depends on the
functions;
(a) Training data. (b) Posterior over functions rep- (c) Samples from the posterior −2 MLE −2 location of the test
(c) Samples from
resented by the marginal uncer- over functions, which are in- MAP input. Right panels:
the posterior over
tainties (shaded) showing the duced by the samples from the −4 BLR −4 sampled functions
functions.
67% and 95% predictive con- parameter posterior. from the posterior
−4 −2 0 2 4 −4 −2 0 2 4
fidence bounds, the maximum x x distribution.
likelihood estimate (MLE) and
the MAP estimate (MAP), the (b) Posterior distribution for polynomials of degree M = 5 (left) and samples from the
latter of which is identical to posterior over functions (right).
the posterior mean function.
4 Training data 4
MLE
2 MAP 2
Figure 9.11 shows some posterior distributions over functions induced BLR
by the parameter posterior. For different polynomial degrees M , the left 0 0

y
panels show the maximum likelihood function θ ⊤ ML ϕ(·), the MAP func- −2 −2
tion θ ⊤
MAP ϕ(·) (which is identical to the posterior mean function), and the
67% and 95% predictive confidence bounds obtained by Bayesian linear −4 −4

regression, represented by the shaded areas. −4 −2 0 2 4 −4 −2 0 2 4


x x
The right panels show samples from the posterior over functions: Here,
(c) Posterior distribution for polynomials of degree M = 7 (left) and samples from the pos-
we sampled parameters θ i from the parameter posterior and computed
terior over functions (right).
the function ϕ⊤ (x∗ )θ i , which is a single realization of a function under
the posterior distribution over functions. For low-order polynomials, the
parameter posterior does not allow the parameters to vary much: The
sampled functions are nearly identical. When we make the model more
flexible by adding more parameters (i.e., we end up with a higher-order
polynomial), these parameters are not sufficiently constrained by the pos-
terior, and the sampled functions can be easily visually separated. We also
see in the corresponding panels on the left how the uncertainty increases, the posterior uncertainty is huge. This information can be critical when
especially at the boundaries. we use these predictions in a decision-making system, where bad deci-
Although for a seventh-order polynomial the MAP estimate yields a rea- sions can have significant consequences (e.g., in reinforcement learning
sonable fit, the Bayesian linear regression model additionally tells us that or robotics).

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
312 Linear Regression 9.4 Maximum Likelihood as Orthogonal Projection 313
4 4 Figure 9.12
9.3.5 Computing the Marginal Likelihood Geometric
2 2 interpretation of
In Section 8.6.2, we highlighted the importance of the marginal likelihood least squares.
for Bayesian model selection. In the following, we compute the marginal 0 0 (a) Dataset;

y
likelihood for Bayesian linear regression with a conjugate Gaussian prior (b) maximum
Projection likelihood solution
on the parameters, i.e., exactly the setting we have been discussing in this −2 −2
Observations interpreted as a
chapter. Maximum likelihood estimate
projection.
−4 −4
Just to recap, we consider the following generative process: −4 −2 0
x
2 4 −4 −2 0
x
2 4


θ ∼ N m0 , S 0 (9.60a) (a) Regression dataset consisting of noisy ob- (b) The orange dots are the projections of
servations yn (blue) of function values f (xn ) the noisy observations (blue dots) onto the
yn | xn , θ ∼ N x⊤ 2

n θ, σ , (9.60b) at input locations xn . line θML x. The maximum likelihood solution to
a linear regression problem finds a subspace
The marginal n = 1, . . . , N . The marginal likelihood is given by (line) onto which the overall projection er-
likelihood can be
ror (orange lines) of the observations is mini-
Z
interpreted as the p(Y | X ) = p(Y | X , θ)p(θ)dθ (9.61a) mized.
expected likelihood
under the prior, i.e.,
Z
= N y | Xθ, σ 2 I N θ | m0 , S 0 dθ ,
 
Eθ [p(Y | X , θ)]. (9.61b)
= N y | Xm0 , XS 0 X ⊤ + σ 2 I .

(9.64b)
where we integrate out the model parameters θ . We compute the marginal
likelihood in two steps: First, we show that the marginal likelihood is Given the close connection with the posterior predictive distribution (see
Gaussian (as a distribution in y ); second, we compute the mean and co- Remark on Marginal Likelihood and Posterior Predictive Distribution ear-
variance of this Gaussian. lier in this section), the functional form of the marginal likelihood should
not be too surprising.
1. The marginal likelihood is Gaussian: From Section 6.5.2, we know that
(i) the product of two Gaussian random variables is an (unnormalized)
Gaussian distribution, and (ii) a linear transformation of a Gaussian 9.4 Maximum Likelihood as Orthogonal Projection
random variable is Gaussian distributed. In (9.61b), we require a linear
 Having crunched through much algebra to derive maximum likelihood
transformation to bring N y | Xθ, σ 2 I into the form N θ | µ, Σ for
and MAP estimates, we will now provide a geometric interpretation of
some µ, Σ. Once this is done, the integral can be solved in closed form.
maximum likelihood estimation. Let us consider a simple linear regression
The result is the normalizing constant of the product of the two Gaus-
setting
sians. The normalizing constant itself has Gaussian shape; see (6.76).
y = xθ + ϵ, ϵ ∼ N 0, σ 2 ,

2. Mean and covariance. We compute the mean and covariance matrix (9.65)
of the marginal likelihood by exploiting the standard results for means in which we consider linear functions f : R → R that go through the
and covariances of affine transformations of random variables; see Sec- origin (we omit features here for clarity). The parameter θ determines the
tion 6.4.4. The mean of the marginal likelihood is computed as slope of the line. Figure 9.12(a) shows a one-dimensional dataset.
E[Y | X ] = Eθ,ϵ [Xθ + ϵ] = X Eθ [θ] = Xm0 . (9.62) With a training data set {(x1 , y1 ), . . . , (xN , yN )} we recall the results
 from Section 9.2.1 and obtain the maximum likelihood estimator for the
2
Note that ϵ ∼ N 0, σ I is a vector of i.i.d. random variables. The slope parameter as
covariance matrix is given as
X ⊤y
Cov[Y|X ] = Covθ,ϵ [Xθ + ϵ] = Covθ [Xθ] + σ 2 I (9.63a) θML = (X ⊤ X)−1 X ⊤ y = ∈ R, (9.66)
X ⊤X
= X Covθ [θ]X ⊤ + σ 2 I = XS 0 X ⊤ + σ 2 I . (9.63b) ⊤ N ⊤
where X = [x1 , . . . , xN ] ∈ R , y = [y1 , . . . , yN ] ∈ R .N

Hence, the marginal likelihood is This means for the training inputs X we obtain the optimal (maximum
N 1
likelihood) reconstruction of the training targets as
p(Y | X ) = (2π)− 2 det(XS 0 X ⊤ + σ 2 I)− 2 (9.64a)
X ⊤y XX ⊤
· exp − 12 (y − Xm0 )⊤ (XS 0 X ⊤ + σ 2 I)−1 (y − Xm0 )
 XθML = X = ⊤ y, (9.67)
X ⊤X X X

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
314 Linear Regression 9.5 Further Reading 315

i.e., we obtain the approximation with the minimum least-squares error When the basis is not orthogonal, one can convert a set of linearly inde-
between y and Xθ. pendent basis functions to an orthogonal basis by using the Gram-Schmidt
As we are looking for a solution of y = Xθ, we can think of linear process; see Section 3.8.3 and (Strang, 2003).
Linear regression regression as a problem for solving systems of linear equations. There-
can be thought of as fore, we can relate to concepts from linear algebra and analytic geometry
a method for solving
that we discussed in Chapters 2 and 3. In particular, looking carefully 9.5 Further Reading
systems of linear
equations. at (9.67) we see that the maximum likelihood estimator θML in our ex- In this chapter, we discussed linear regression for Gaussian likelihoods
ample from (9.65) effectively does an orthogonal projection of y onto and conjugate Gaussian priors on the parameters of the model. This al-
Maximum the one-dimensional subspace spanned by X . Recalling the results on or- lowed for closed-form Bayesian inference. However, in some applications

likelihood linear thogonal projections from Section 3.8, we identify XX as the projection we may want to choose a different likelihood function. For example, in
regression performs X⊤X
matrix, θML as the coordinates of the projection onto the one-dimensional a binary classification setting, we observe only two possible (categorical) classification
an orthogonal
projection. subspace of RN spanned by X and XθML as the orthogonal projection of outcomes, and a Gaussian likelihood is inappropriate in this setting. In-
y onto this subspace. stead, we can choose a Bernoulli likelihood that will return a probability of
Therefore, the maximum likelihood solution provides also a geometri- the predicted label to be 1 (or 0). We refer to the books by Barber (2012),
cally optimal solution by finding the vectors in the subspace spanned by Bishop (2006), and Murphy (2012) for an in-depth introduction to classifi-
X that are “closest” to the corresponding observations y , where “clos- cation problems. A different example where non-Gaussian likelihoods are
est” means the smallest (squared) distance of the function values yn to important is count data. Counts are non-negative integers, and in this case
xn θ. This is achieved by orthogonal projections. Figure 9.12(b) shows the a Binomial or Poisson likelihood would be a better choice than a Gaussian.
projection of the noisy observations onto the subspace that minimizes the All these examples fall into the category of generalized linear models, a flex- generalized linear
squared distance between the original dataset and its projection (note that ible generalization of linear regression that allows for response variables model

the x-coordinate is fixed), which corresponds to the maximum likelihood that have error distributions other than a Gaussian distribution. The GLM Generalized linear
solution. generalizes linear regression by allowing the linear model to be related models are the
building blocks of
In the general linear regression case where to the observed values via a smooth and invertible function σ(·) that may
deep neural
be nonlinear so that y = σ(f (x)), where f (x) = θ ⊤ ϕ(x) is the linear networks.
y = ϕ⊤ (x)θ + ϵ, ϵ ∼ N 0, σ 2

(9.68) regression model from (9.13). We can therefore think of a generalized
with vector-valued features ϕ(x) ∈ RK , we again can interpret the maxi- linear model in terms of function composition y = σ ◦ f , where f is a
mum likelihood result linear regression model and σ the activation function. Note that although
we are talking about “generalized linear models”, the outputs y are no
y ≈ Φθ ML , (9.69) longer linear in the parameters θ . In logistic regression, we choose the logistic regression
θ ML = (Φ⊤ Φ)−1 Φ⊤ y (9.70) 1
logistic sigmoid σ(f ) = 1+exp(−f )
∈ [0, 1], which can be interpreted as the logistic sigmoid
probability of observing y = 1 of a Bernoulli random variable y ∈ {0, 1}.
as a projection onto a K -dimensional subspace of RN , which is spanned
The function σ(·) is called transfer function or activation function, and its transfer function
by the columns of the feature matrix Φ; see Section 3.8.2.
inverse is called the canonical link function. From this perspective, it is activation function
If the feature functions ϕk that we use to construct the feature ma- canonical link
also clear that generalized linear models are the building blocks of (deep)
trix Φ are orthonormal (see Section 3.7), we obtain a special case where function
feedforward neural networks: If we consider a generalized linear model
the columns of Φ form an orthonormal basis (see Section 3.5), such that For ordinary linear
Φ⊤ Φ = I . This will then lead to the projection y = σ(Ax + b), where A is a weight matrix and b a bias vector, we iden- regression the
! tify this generalized linear model as a single-layer neural network with activation function
K would simply be the
X activation function σ(·). We can now recursively compose these functions
Φ(Φ⊤ Φ)−1 Φ⊤ y = ΦΦ⊤ y = ϕk ϕ⊤
k y (9.71) via
identity.
k=1 A great post on the
xk+1 = f k (xk ) relation between
so that the maximum likelihood projection is simply the sum of projections (9.72) GLMs and deep
of y onto the individual basis vectors ϕk , i.e., the columns of Φ. Further-
f k (xk ) = σk (Ak xk + bk )
networks is
more, the coupling between different features has disappeared due to the for k = 0, . . . , K − 1, where x0 are the input features and xK = y are available at
orthogonality of the basis. Many popular basis functions in signal process- the observed outputs, such that f K−1 ◦ · · · ◦ f 0 is a K -layer deep neural https://tinyurl.
com/glm-dnn.
ing, such as wavelets and Fourier bases, are orthogonal basis functions. network. Therefore, the building blocks of this deep neural network are

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
316 Linear Regression

the generalized linear models defined in (9.72). Neural networks (Bishop,


1995; Goodfellow et al., 2016) are significantly more expressive and flexi-
ble than linear regression models. However, maximum likelihood parame- 10
ter estimation is a non-convex optimization problem, and marginalization
of the parameters in a fully Bayesian setting is analytically intractable.
We briefly hinted at the fact that a distribution over parameters in- Dimensionality Reduction with Principal
duces a distribution over regression functions. Gaussian processes (Ras-
Gaussian process
mussen and Williams, 2006) are regression models where the concept of
Component Analysis
a distribution over function is central. Instead of placing a distribution
over parameters, a Gaussian process places a distribution directly on the
space of functions without the “detour” via the parameters. To do so, the Working directly with high-dimensional data, such as images, comes with A 640 × 480 pixel
kernel trick Gaussian process exploits the kernel trick (Schölkopf and Smola, 2002), some difficulties: It is hard to analyze, interpretation is difficult, visualiza- color image is a data
point in a
which allows us to compute inner products between two function values tion is nearly impossible, and (from a practical point of view) storage of
million-dimensional
f (xi ), f (xj ) only by looking at the corresponding input xi , xj . A Gaus- the data vectors can be expensive. However, high-dimensional data often space, where every
sian process is closely related to both Bayesian linear regression and sup- has properties that we can exploit. For example, high-dimensional data is pixel responds to
port vector regression but can also be interpreted as a Bayesian neural often overcomplete, i.e., many dimensions are redundant and can be ex- three dimensions,
one for each color
network with a single hidden layer where the number of units tends to plained by a combination of other dimensions. Furthermore, dimensions
channel (red, green,
infinity (Neal, 1996; Williams, 1997). Excellent introductions to Gaussian in high-dimensional data are often correlated so that the data possesses an blue).
processes can be found in MacKay (1998) and Rasmussen and Williams intrinsic lower-dimensional structure. Dimensionality reduction exploits
(2006). structure and correlation and allows us to work with a more compact rep-
We focused on Gaussian parameter priors in the discussions in this chap- resentation of the data, ideally without losing information. We can think
ter, because they allow for closed-form inference in linear regression mod- of dimensionality reduction as a compression technique, similar to jpeg or
els. However, even in a regression setting with Gaussian likelihoods, we mp3, which are compression algorithms for images and music.
may choose a non-Gaussian prior. Consider a setting, where the inputs are In this chapter, we will discuss principal component analysis (PCA), an principal component
x ∈ RD and our training set is small and of size N ≪ D. This means that algorithm for linear dimensionality reduction. PCA, proposed by Pearson analysis
the regression problem is underdetermined. In this case, we can choose (1901) and Hotelling (1933), has been around for more than 100 years PCA
dimensionality
a parameter prior that enforces sparsity, i.e., a prior that tries to set as and is still one of the most commonly used techniques for data compres- reduction
variable selection many parameters to 0 as possible (variable selection). This prior provides sion and data visualization. It is also used for the identification of simple
a stronger regularizer than the Gaussian prior, which often leads to an in- patterns, latent factors, and structures of high-dimensional data. In the
creased prediction accuracy and interpretability of the model. The Laplace
prior is one example that is frequently used for this purpose. A linear re- Figure 10.1
gression model with the Laplace prior on the parameters is equivalent to Illustration:
4 4
dimensionality
LASSO linear regression with L1 regularization (LASSO) (Tibshirani, 1996). The reduction. (a) The
Laplace distribution is sharply peaked at zero (its first derivative is discon- 2 2 original dataset
tinuous) and it concentrates its probability mass closer to zero than the does not vary much
Gaussian distribution, which encourages parameters to be 0. Therefore, along the x2

x2

x2
0 0
direction. (b) The
the nonzero parameters are relevant for the regression problem, which is
data from (a) can be
the reason why we also speak of “variable selection”. −2 −2
represented using
the x1 -coordinate
−4 −4 alone with nearly no
loss.
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x1 x1

(a) Dataset with x1 and x2 coordinates. (b) Compressed dataset where only the x1 coor-
dinate is relevant.

317
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
318 Dimensionality Reduction with Principal Component Analysis 10.1 Problem Setting 319

Karhunen-Loève signal processing community, PCA is also known as the Karhunen-Loève Original Reconstructed Figure 10.2
transform D D Graphical
transform. In this chapter, we derive PCA from first principles, drawing on R R
Compressed illustration of PCA.
our understanding of basis and basis change (Sections 2.6.1 and 2.7.2), In PCA, we find a
M
projections (Section 3.8), eigenvalues (Section 4.2), Gaussian distribu- R compressed version
tions (Section 6.5), and constrained optimization (Section 7.2). x z x̃ z of original data x.
Dimensionality reduction generally exploits a property of high-dimen- The compressed
data can be
sional data (e.g., images) that it often lies on a low-dimensional subspace. reconstructed into
Figure 10.1 gives an illustrative example in two dimensions. Although x̃, which lives in the
the data in Figure 10.1(a) does not quite lie on a line, the data does not original data space,
vary much in the x2 -direction, so that we can express it as if it were on but has an intrinsic
lower-dimensional
a line – with nearly no loss; see Figure 10.1(b). To describe the data in
Chapter 2, we know that x ∈ R2 can be represented as a linear combina- representation than
Figure 10.1(b), only the x1 -coordinate is required, and the data lies in a x.
tion of these basis vectors, e.g.,
one-dimensional subspace of R2 .  
5
= 5e1 + 3e2 . (10.4)
3
10.1 Problem Setting
However, when we consider vectors of the form
In PCA, we are interested in finding projections x̃n of data points xn that  
are as similar to the original data points as possible, but which have a sig- 0
x̃ = ∈ R2 , z ∈ R , (10.5)
nificantly lower intrinsic dimensionality. Figure 10.1 gives an illustration z
of what this could look like. they can always be written as 0e1 + ze2 . To represent these vectors it is
More concretely, we consider an i.i.d. dataset X = {x1 , . . . , xN }, xn ∈ sufficient to remember/store the coordinate/code z of x̃ with respect to
data covariance RD , with mean 0 that possesses the data covariance matrix (6.42) the e2 vector. The dimension of a
matrix vector space
N More precisely, the set of x̃ vectors (with the standard vector addition
1 X corresponds to the
S= xn x⊤
n . (10.1) and scalar multiplication) forms a vector subspace U (see Section 2.4) number of its basis
N n=1 with dim(U ) = 1 because U = span[e2 ]. vectors (see
Section 2.6.1).
Furthermore, we assume there exists a low-dimensional compressed rep-
resentation (code) In Section 10.2, we will find low-dimensional representations that re-
z n = B ⊤ xn ∈ RM (10.2) tain as much information as possible and minimize the compression loss.
An alternative derivation of PCA is given in Section 10.3, where we will
of xn , where we define the projection matrix 2
be looking at minimizing the squared reconstruction error ∥xn − x̃n ∥ be-
B := [b1 , . . . , bM ] ∈ RD×M . (10.3) tween the original data xn and its projection x̃n .
Figure 10.2 illustrates the setting we consider in PCA, where z repre-
We assume that the columns of B are orthonormal (Definition 3.7) so that sents the lower-dimensional representation of the compressed data x̃ and
The columns b⊤ ⊤
i bj = 0 if and only if i ̸= j and bi bi = 1. We seek an M -dimensional plays the role of a bottleneck, which controls how much information can
b1 , . . . , bM of B subspace U ⊆ RD , dim(U ) = M < D onto which we project the data. We flow between x and x̃. In PCA, we consider a linear relationship between
form a basis of the
denote the projected data by x̃n ∈ U , and their coordinates (with respect the original data x and its low-dimensional code z so that z = B ⊤ x and
M -dimensional
subspace in which to the basis vectors b1 , . . . , bM of U ) by z n . Our aim is to find projections x̃ = Bz for a suitable matrix B . Based on the motivation of thinking
the projected data x̃n ∈ RD (or equivalently the codes z n and the basis vectors b1 , . . . , bM ) of PCA as a data compression technique, we can interpret the arrows in
x̃ = BB ⊤ x ∈ RD so that they are as similar to the original data xn and minimize the loss Figure 10.2 as a pair of operations representing encoders and decoders.
live.
due to compression. The linear mapping represented by B can be thought of as a decoder,
which maps the low-dimensional code z ∈ RM back into the original data
Example 10.1 (Coordinate Representation/Code) space RD . Similarly, B ⊤ can be thought of an encoder, which encodes the
Consider R2 with the canonical basis e1 = [1, 0]⊤ , e2 = [0, 1]⊤ . From original data x as a low-dimensional (compressed) code z .
Throughout this chapter, we will use the MNIST digits dataset as a re-

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
320 Dimensionality Reduction with Principal Component Analysis 10.2 Maximum Variance Perspective 321

Figure 10.3 Figure 10.4 PCA


Examples of finds a
handwritten digits lower-dimensional
from the MNIST subspace (line) that
dataset. http: maintains as much
//yann.lecun. variance (spread of
com/exdb/mnist/. the data) as possible
occurring example, which contains 60,000 examples of handwritten digits when the data
0 through 9. Each digit is a grayscale image of size 28 × 28, i.e., it contains (blue) is projected
onto this subspace
784 pixels so that we can interpret every image in this dataset as a vector
(orange).
x ∈ R784 . Examples of these digits are shown in Figure 10.3.
10.2.1 Direction with Maximal Variance
We maximize the variance of the low-dimensional code using a sequential
10.2 Maximum Variance Perspective
approach. We start by seeking a single vector b1 ∈ RD that maximizes the The vector b1 will
Figure 10.1 gave an example of how a two-dimensional dataset can be variance of the projected data, i.e., we aim to maximize the variance of be the first column
represented using a single coordinate. In Figure 10.1(b), we chose to ig- of the matrix B and
the first coordinate z1 of z ∈ RM so that
therefore the first of
nore the x2 -coordinate of the data because it did not add too much in- M orthonormal
N
formation so that the compressed data is similar to the original data in 1 X 2 basis vectors that
Figure 10.1(a). We could have chosen to ignore the x1 -coordinate, but V1 := V[z1 ] = z (10.7) span the
N n=1 1n
then the compressed data had been very dissimilar from the original data, lower-dimensional
subspace.
and much information in the data would have been lost. is maximized, where we exploited the i.i.d. assumption of the data and
If we interpret information content in the data as how “space filling” defined z1n as the first coordinate of the low-dimensional representation
the dataset is, then we can describe the information contained in the data z n ∈ RM of xn ∈ RD . Note that first component of z n is given by
by looking at the spread of the data. From Section 6.4.1, we know that the
variance is an indicator of the spread of the data, and we can derive PCA as z1n = b⊤
1 xn , (10.8)
a dimensionality reduction algorithm that maximizes the variance in the
low-dimensional representation of the data to retain as much information i.e., it is the coordinate of the orthogonal projection of xn onto the one-
as possible. Figure 10.4 illustrates this. dimensional subspace spanned by b1 (Section 3.8). We substitute (10.8)
Considering the setting discussed in Section 10.1, our aim is to find into (10.7), which yields
a matrix B (see (10.3)) that retains as much information as possible
N N
when compressing data by projecting it onto the subspace spanned by 1 X ⊤ 1 X ⊤
the columns b1 , . . . , bM of B . Retaining most information after data com- V1 = (b1 xn )2 = b xn x⊤n b1 (10.9a)
N n=1 N n=1 1
pression is equivalent to capturing the largest amount of variance in the !
N
low-dimensional code (Hotelling, 1933). 1 X
= b⊤
1 xn x⊤n b1 = b⊤
1 Sb1 , (10.9b)
Remark. (Centered Data) For the data covariance matrix in (10.1), we N n=1
assumed centered data. We can make this assumption without loss of gen-
erality: Let us assume that µ is the mean of the data. Using the properties where S is the data covariance matrix defined in (10.1). In (10.9a), we
of the variance, which we discussed in Section 6.4.4, we obtain have used the fact that the dot product of two vectors is symmetric with
respect to its arguments, that is, b⊤ ⊤
1 xn = xn b1 .
Vz [z] = Vx [B ⊤ (x − µ)] = Vx [B ⊤ x − B ⊤ µ] = Vx [B ⊤ x] , (10.6) Notice that arbitrarily increasing the magnitude of the vector b1 in-
creases V1 , that is, a vector b1 that is two times longer can result in V1
i.e., the variance of the low-dimensional code does not depend on the
that is potentially four times larger. Therefore, we restrict all solutions to ∥b1 ∥2 = 1
mean of the data. Therefore, we assume without loss of generality that the 2
∥b1 ∥ = 1, which results in a constrained optimization problem in which ⇐⇒ ∥b1 ∥ = 1.
data has mean 0 for the remainder of this section. With this assumption
we seek the direction along which the data varies most.
the mean of the low-dimensional code is also 0 since Ez [z] = Ex [B ⊤ x] =
With the restriction of the solution space to unit vectors the vector b1
B ⊤ Ex [x] = 0. ♢
that points in the direction of maximum variance can be found by the

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
322 Dimensionality Reduction with Principal Component Analysis 10.2 Maximum Variance Perspective 323
D
constrained optimization problem (m − 1)-dimensional subspace of R . Generally, the mth principal com-
ponent can be found by subtracting the effect of the first m − 1 principal
max b⊤
1 Sb1
b1 components b1 , . . . , bm−1 from the data, thereby trying to find principal
(10.10)
2
subject to ∥b1 ∥ = 1 . components that compress the remaining information. We then arrive at
the new data matrix
Following Section 7.2, we obtain the Lagrangian m−1
X
L(b1 , λ) = b⊤ ⊤ X̂ := X − bi b⊤
i X = X − B m−1 X , (10.17)
1 Sb1 + λ1 (1 − b1 b1 ) (10.11)
i=1
to solve this constrained optimization problem. The partial derivatives of
xN ] ∈ RD×N contains the data points as column
where X = [x1 , . . . , P The matrix X̂ :=
L with respect to b1 and λ1 are
vectors and B m−1 := i=1 bi b⊤
m−1 [x̂1 , . . . , x̂N ] ∈
i is a projection matrix that projects onto
∂L ∂L RD×N in (10.17)
the subspace spanned by b1 , . . . , bm−1 .
= 2b⊤ ⊤
1 S − 2λ1 b1 , = 1 − b⊤
1 b1 , (10.12) contains the
∂b1 ∂λ1 Remark (Notation). Throughout this chapter, we do not follow the con- information in the
vention of collecting data x1 , . . . , xN as the rows of the data matrix, but data that has not yet
respectively. Setting these partial derivatives to 0 gives us the relations
been compressed.
we define them to be the columns of X . This means that our data ma-
Sb1 = λ1 b1 , (10.13) trix X is a D × N matrix instead of the conventional N × D matrix. The
b⊤
1 b1 = 1 . (10.14) reason for our choice is that the algebra operations work out smoothly
without the need to either transpose the matrix or to redefine vectors as
By comparing this with the definition of an eigenvalue decomposition
row vectors that are left-multiplied onto matrices. ♢
(Section 4.4), we see that b1 is an eigenvector of the data covariance
matrix S , and the Lagrange multiplier λ1 plays the role of the correspond- To find the mth principal component, we maximize the variance

The quantity λ1 is ing eigenvalue. This eigenvector property (10.13) allows us to rewrite our N N
1 X 2 1 X ⊤
also called the variance objective (10.10) as Vm = V[zm ] = zmn = (b xˆn )2 = b⊤
m Ŝbm , (10.18)
loading of the unit N n=1 N n=1 m
vector b1 and V1 = b⊤ ⊤
1 Sb1 = λ1 b1 b1 = λ1 , (10.15)
represents the 2
subject to ∥bm ∥ = 1, where we followed the same steps as in (10.9b)
standard deviation i.e., the variance of the data projected onto a one-dimensional subspace
of the data and defined Ŝ as the data covariance matrix of the transformed dataset
equals the eigenvalue that is associated with the basis vector b1 that spans
accounted for by the X̂ := {x̂1 , . . . , x̂N }. As previously, when we looked at the first principal
this subspace. Therefore, to maximize the variance of the low-dimensional
principal subspace component alone, we solve a constrained optimization problem and dis-
span[b1 ]. code, we choose the basis vector associated with the largest eigenvalue
cover that the optimal solution bm is the eigenvector of Ŝ that is associated
principal component of the data covariance matrix. This eigenvector is called the first principal
with the largest eigenvalue of Ŝ .
component. We can determine the effect/contribution of the principal com-
It turns out that bm is also an eigenvector of S . More generally, the sets
ponent b1 in the original data space by mapping the coordinate z1n back
of eigenvectors of S and Ŝ are identical. Since both S and Ŝ are sym-
into data space, which gives us the projected data point
metric, we can find an ONB of eigenvectors (spectral theorem 4.15), i.e.,
x̃n = b1 z1n = b1 b⊤
1 xn ∈ R
D
(10.16) there exist D distinct eigenvectors for both S and Ŝ . Next, we show that
every eigenvector of S is an eigenvector of Ŝ . Assume we have already
in the original data space.
found eigenvectors b1 , . . . , bm−1 of Ŝ . Consider an eigenvector bi of S ,
Remark. Although x̃n is a D-dimensional vector, it only requires a single i.e., Sbi = λi bi . In general,
coordinate z1n to represent it with respect to the basis vector b1 ∈ RD . ♢
1 ⊤ 1
Ŝbi = X̂ X̂ bi = (X − B m−1 X)(X − B m−1 X)⊤ bi (10.19a)
N N
10.2.2 M -dimensional Subspace with Maximal Variance = (S − SB m−1 − B m−1 S + B m−1 SB m−1 )bi . (10.19b)
Assume we have found the first m − 1 principal components as the m − 1 We distinguish between two cases. If i ⩾ m, i.e., bi is an eigenvector
eigenvectors of S that are associated with the largest m − 1 eigenvalues. that is not among the first m − 1 principal components, then bi is orthogo-
Since S is symmetric, the spectral theorem (Theorem 4.15) states that we nal to the first m−1 principal components and B m−1 bi = 0. If i < m, i.e.,
can use these eigenvectors to construct an orthonormal eigenbasis of an bi is among the first m − 1 principal components, then bi is a basis vector

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
324 Dimensionality Reduction with Principal Component Analysis 10.3 Projection Perspective 325

of the principal subspace onto which B m−1 projects. Since b1 , . . . , bm−1 Figure 10.6
Illustration of the
are an ONB of this principal subspace, we obtain B m−1 bi = bi . The two
projection
cases can be summarized as follows: approach: Find a
subspace (line) that
B m−1 bi = bi if i < m , B m−1 bi = 0 if i ⩾ m . (10.20)
minimizes the
In the case i ⩾ m, by using (10.20) in (10.19b), we obtain Ŝbi = (S − length of the
difference vector
B m−1 S)bi = Sbi = λi bi , i.e., bi is also an eigenvector of Ŝ with eigen- between projected
value λi . Specifically, (orange) and
original (blue) data.
Ŝbm = Sbm = λm bm . (10.21)
Equation (10.21) reveals that bm is not only an eigenvector of S but also
of Ŝ . Specifically, λm is the largest eigenvalue of Ŝ and λm is the mth Taking all digits “8” in the MNIST training data, we compute the eigen-
largest eigenvalue of S , and both have the associated eigenvector bm . values of the data covariance matrix. Figure 10.5(a) shows the 200 largest
In the case i < m, by using (10.20) in (10.19b), we obtain eigenvalues of the data covariance matrix. We see that only a few of
Ŝbi = (S − SB m−1 − B m−1 S + B m−1 SB m−1 )bi = 0 = 0bi (10.22) them have a value that differs significantly from 0. Therefore, most of
the variance, when projecting data onto the subspace spanned by the cor-
This means that b1 , . . . , bm−1 are also eigenvectors of Ŝ , but they are as- responding eigenvectors, is captured by only a few principal components,
sociated with eigenvalue 0 so that b1 , . . . , bm−1 span the null space of Ŝ . as shown in Figure 10.5(b).
Overall, every eigenvector of S is also an eigenvector of Ŝ . However,
if the eigenvectors of S are part of the (m − 1) dimensional principal Overall, to find an M -dimensional subspace of RD that retains as much
This derivation subspace, then the associated eigenvalue of Ŝ is 0. information as possible, PCA tells us to choose the columns of the matrix
shows that there is With the relation (10.21) and b⊤ m bm = 1, the variance of the data pro- B in (10.3) as the M eigenvectors of the data covariance matrix S that
an intimate
jected onto the mth principal component is are associated with the M largest eigenvalues. The maximum amount of
connection between
the M -dimensional
V m = b⊤
(10.21)
= λ m b⊤ variance PCA can capture with the first M principal components is
subspace with m Sbm m bm = λm . (10.23)
M
maximal variance
This means that the variance of the data, when projected onto an M - VM =
X
λm , (10.24)
and the eigenvalue
decomposition. We dimensional subspace, equals the sum of the eigenvalues that are associ- m=1
will revisit this ated with the corresponding eigenvectors of the data covariance matrix.
connection in where the λm are the M largest eigenvalues of the data covariance matrix
Section 10.4. S . Consequently, the variance lost by data compression via PCA is
D
Example 10.2 (Eigenvalues of MNIST “8”) X
JM := λj = VD − VM . (10.25)
Figure 10.5 j=M +1
50
Properties of the
500
training data of 40 Instead of these absolute quantities, we can define the relative variance
MNIST “8”. (a) captured as VVM , and the relative variance lost by compression as 1 − VVM
Captured variance

400 .
Eigenvalue

D D
Eigenvalues sorted 30
in descending order; 300

(b) Variance 20
200
captured by the 10.3 Projection Perspective
10
principal 100
components
In the following, we will derive PCA as an algorithm that directly mini-
0
associated with the
0 50 100
Index
150 200 0 50 100 150
Number of principal components
200 mizes the average reconstruction error. This perspective allows us to in-
largest eigenvalues. terpret PCA as implementing an optimal linear auto-encoder. We will draw
(a) Eigenvalues (sorted in descending order) of (b) Variance captured by the principal compo-
the data covariance matrix of all digits “8” in nents. heavily from Chapters 2 and 3.
the MNIST training set. In the previous section, we derived PCA by maximizing the variance
in the projected space to retain as much information as possible. In the

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
326 Dimensionality Reduction with Principal Component Analysis 10.3 Projection Perspective 327
Figure 10.7 2.5 2.5
Simplified tion, we would arrive at exactly the same solution, but the notation would
projection setting. 2.0 2.0 be substantially more cluttered.
(a) A vector x ∈ R2
(red cross) shall be 1.5 1.5
We are interested in finding the best linear projection of X onto a lower-
projected onto a dimensional subspace U of RD with dim(U ) = M and orthonormal basis
one-dimensional vectors b1 , . . . , bM . We will call this subspace U the principal subspace. principal subspace
x2

x2
1.0 1.0
subspace U ⊆ R2 U U The projections of the data points are denoted by
spanned by b. (b) 0.5 0.5
shows the difference b b M
X
vectors between x 0.0 0.0 x̃n := zmn bm = Bz n ∈ RD , (10.28)
and some m=1
candidates x̃. −0.5 −0.5
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x1 x1 where z n := [z1n , . . . , zM n ]⊤ ∈ RM is the coordinate vector of x̃n with
(a) Setting. (b) Differences x − x̃i for 50 different x̃i are respect to the basis (b1 , . . . , bM ). More specifically, we are interested in
shown by the red lines. having the x̃n as similar to xn as possible.
The similarity measure we use in the following is the squared distance
2
(Euclidean norm) ∥x − x̃∥ between x and x̃. We therefore define our ob-
jective as minimizing the average squared Euclidean distance (reconstruction reconstruction error
following, we will look at the difference vectors between the original data
error) (Pearson, 1901)
xn and their reconstruction x̃n and minimize this distance so that xn and
x̃n are as close as possible. Figure 10.6 illustrates this setting. 1 X
N
JM := ∥xn − x̃n ∥2 , (10.29)
N n=1
10.3.1 Setting and Objective where we make it explicit that the dimension of the subspace onto which
Assume an (ordered) orthonormal basis (ONB) B = (b1 , . . . , bD ) of RD , we project the data is M . In order to find this optimal linear projection,
i.e., b⊤ we need to find the orthonormal basis of the principal subspace and the
i bj = 1 if and only if i = j and 0 otherwise.
From Section 2.5 we know that for a basis (b1 , . . . , bD ) of RD any x ∈ coordinates z n ∈ RM of the projections with respect to this basis.
RD can be written as a linear combination of the basis vectors of RD , i.e., To find the coordinates z n and the ONB of the principal subspace, we
Vectors x̃ ∈ U could
follow a two-step approach. First, we optimize the coordinates z n for a
be vectors on a given ONB (b1 , . . . , bM ); second, we find the optimal ONB.
D M D
plane in R3 . The X X X
dimensionality of x= ζd b d = ζm b m + ζj b j (10.26)
the plane is 2, but d=1 m=1 j=M +1
10.3.2 Finding Optimal Coordinates
the vectors still have
three coordinates for suitable coordinates ζd ∈ R. Let us start by finding the optimal coordinates z1n , . . . , zM n of the projec-
with respect to the We are interested in finding vectors x̃ ∈ RD , which live in lower- tions x̃n for n = 1, . . . , N . Consider Figure 10.7(b), where the principal
standard basis of dimensional subspace U ⊆ RD , dim(U ) = M , so that subspace is spanned by a single vector b. Geometrically speaking, finding
R3 .
M
X the optimal coordinates z corresponds to finding the representation of the
x̃ = zm bm ∈ U ⊆ RD (10.27) linear projection x̃ with respect to b that minimizes the distance between
m=1 x̃ − x. From Figure 10.7(b), it is clear that this will be the orthogonal
is as similar to x as possible. Note that at this point we need to assume projection, and in the following we will show exactly this.
that the coordinates zm of x̃ and ζm of x are not identical. We assume an ONB (b1 , . . . , bM ) of U ⊆ RD . To find the optimal co-
In the following, we use exactly this kind of representation of x̃ to find ordinates z m with respect to this basis, we require the partial derivatives
optimal coordinates z and basis vectors b1 , . . . , bM such that x̃ is as sim-
ilar to the original data point x as possible, i.e., we aim to minimize the ∂JM ∂JM ∂ x̃n
= , (10.30a)
(Euclidean) distance ∥x − x̃∥. Figure 10.7 illustrates this setting. ∂zin ∂ x̃n ∂zin
Without loss of generality, we assume that the dataset X = {x1 , . . . , xN }, ∂JM 2
xn ∈ RD , is centered at 0, i.e., E[X ] = 0. Without the zero-mean assump- = − (xn − x̃n )⊤ ∈ R1×D , (10.30b)
∂ x̃n N

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
328 Dimensionality Reduction with Principal Component Analysis 10.3 Projection Perspective 329
Figure 10.8 3.25 2.5

Optimal projection
3.00
must be identical for m = 1, . . . , M since U = span[bM +1 , . . . , bD ] is
of a vector x ∈ R2 2.0
the orthogonal complement (see Section 3.6) of U = span[b1 , . . . , bM ].
onto a 2.75
one-dimensional 1.5
2.50 Remark (Orthogonal Projections with Orthonormal Basis Vectors). Let us
subspace
kx − x̃k

2.25 briefly recap orthogonal projections from Section 3.8. If (b1 , . . . , bD ) is an

x2
(continuation from 1.0
Figure 10.7). 2.00
U orthonormal basis of RD then b⊤
j x is the
0.5 x̃
(a) Distances coordinate of the
bj (b⊤ −1 ⊤
bj b⊤
1.75 b D
∥x − x̃∥ for some x̃ = j bj ) bj x = j x ∈R (10.33) orthogonal
0.0
x̃ ∈ U . 1.50 projection of x onto
(b) Orthogonal
1.25 −0.5
is the orthogonal projection of x onto the subspace spanned by the j th ba- the subspace
projection and −1.0 −0.5 0.0 0.5
x1
1.0 1.5 2.0 −1.0 −0.5 0.0 0.5
x1
1.0 1.5 2.0
sis vector, and zj = b⊤ j x is the coordinate of this projection with respect to
spanned by bj .
optimal coordinates.
(a) Distances ∥x − x̃∥ for some x̃ = z1 b ∈ (b) The vector x̃ that minimizes the distance
the basis vector bj that spans that subspace since zj bj = x̃. Figure 10.8(b)
U = span[b]; see panel (b) for the setting. in panel (a) is its orthogonal projection onto illustrates this setting.
U . The coordinate of the projection x̃ with More generally, if we aim to project onto an M -dimensional subspace
respect to the basis vector b that spans U of RD , we obtain the orthogonal projection of x onto the M -dimensional
is the factor we need to scale b in order to
“reach” x̃.
subspace with orthonormal basis vectors b1 , . . . , bM as
⊤ −1 ⊤ ⊤
| {zB}) B x = BB x ,
x̃ = B(B (10.34)
=I
M
!
∂ x̃n (10.28) ∂ X
where we defined B := [b1 , . . . , bM ] ∈ RD×M . The coordinates of this
= zmn bm = bi (10.30c)
∂zin ∂zin m=1 projection with respect to the ordered basis (b1 , . . . , bM ) are z := B ⊤ x
as discussed in Section 3.8.
for i = 1, . . . , M , such that we obtain
We can think of the coordinates as a representation of the projected
(10.30b) M
!⊤ vector in a new coordinate system defined by (b1 , . . . , bM ). Note that al-
∂JM (10.30c) 2 (10.28) 2 X
= − (xn − x̃n )⊤ bi = − xn − zmn bm bi though x̃ ∈ RD , we only need M coordinates z1 , . . . , zM to represent
∂zin N N m=1 this vector; the other D − M coordinates with respect to the basis vectors
(10.31a) (bM +1 , . . . , bD ) are always 0. ♢
ONB 2 ⊤ 2 ⊤ So far we have shown that for a given ONB we can find the optimal
= − (x bi − zin b⊤
i bi ) = − (x bi − zin ) . (10.31b)
N n N n coordinates of x̃ by an orthogonal projection onto the principal subspace.
since b⊤ In the following, we will determine what the best basis is.
The coordinates of i bi = 1. Setting this partial derivative to 0 yields immediately the
the optimal optimal coordinates
projection of xn

with respect to the zin = x⊤
n bi = bi xn (10.32) 10.3.3 Finding the Basis of the Principal Subspace
basis vectors
b1 , . . . , bM are the for i = 1, . . . , M and n = 1, . . . , N . This means that the optimal co- To determine the basis vectors b1 , . . . , bM of the principal subspace, we
coordinates of the rephrase the loss function (10.29) using the results we have so far. This
ordinates zin of the projection x̃n are the coordinates of the orthogonal
orthogonal
projection of xn projection (see Section 3.8) of the original data point xn onto the one- will make it easier to find the basis vectors. To reformulate the loss func-
onto the principal dimensional subspace that is spanned by bi . Consequently: tion, we exploit our results from before and obtain
subspace.
M M
The optimal linear projection x̃n of xn is an orthogonal projection. X (10.32)
X
x̃n = zmn bm = (x⊤
n bm )bm . (10.35)
The coordinates of x̃n with respect to the basis (b1 , . . . , bM ) are the m=1 m=1
coordinates of the orthogonal projection of xn onto the principal sub-
space. We now exploit the symmetry of the dot product, which yields
An orthogonal projection is the best linear mapping given the objec- XM
!
tive (10.29). x̃n = bm b⊤
m xn . (10.36)
The coordinates ζm of x in (10.26) and the coordinates zm of x̃ in (10.27) m=1

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
330 Dimensionality Reduction with Principal Component Analysis 10.3 Projection Perspective 331

Figure 10.9 symmetric and has rank M . Therefore, the average squared reconstruction
Orthogonal 6 U⊥
error can also be written as
projection and
displacement 4 N N
1 X 2 1 X 2
vectors. When ∥xn − x̃n ∥ = xn − BB ⊤ xn (10.40a)
projecting data 2 N n=1 N n=1
points xn (blue) x2 N
0 1 X 2
onto subspace U1 ,
we obtain x̃n
U = (I − BB ⊤ )xn . (10.40b)
−2 N n=1
(orange). The
displacement vector −4 Finding orthonormal basis vectors b1 , . . . , bM , which minimize the differ- PCA finds the best
x̃n − xn lies rank-M
completely in the
ence between the original data xn and their projections x̃n , is equivalent
−6 approximation of
orthogonal to finding the best rank-M approximation BB ⊤ of the identity matrix I
−5 0 5 the identity matrix.
complement U2 of x1 (see Section 4.6). ♢
U1 .
Now we have all the tools to reformulate the loss function (10.29).
N N D 2
Since we can generally write the original data point xn as a linear combi- 1 X (10.38b) 1 X X
nation of all basis vectors, it holds that JM = ∥xn − x̃n ∥2 = (b⊤
j xn )bj . (10.41)
N n=1 N n=1 j=M +1
D D D
!
(10.32)
X X X ⊤
xn = zdn bd = (x⊤
n b d )b d = b d bd xn (10.37a) We now explicitly compute the squared norm and exploit the fact that the
d=1 d=1 d=1 bj form an ONB, which yields
M
! D
!
N D N D
1 X X 1 X X ⊤
X X
= bm b⊤ xn + bj b⊤ xn , (10.37b) JM = (b⊤ xn )2 = b xn b⊤
j xn (10.42a)
m j
m=1 j=M +1 N n=1 j=M +1 j N n=1 j=M +1 j
where we split the sum with D terms into a sum over M and a sum N
1 X X ⊤
D

over D − M terms. With this result, we find that the displacement vector = b xn x⊤
n bj , (10.42b)
N n=1 j=M +1 j
xn − x̃n , i.e., the difference vector between the original data point and its
projection, is where we exploited the symmetry of the dot product in the last step to
write b⊤ ⊤
j xn = xn bj . We now swap the sums and obtain
D
!
X
xn − x̃n = bj b⊤
j xn (10.38a) !
D N D
j=M +1 X 1 X X
JM = b⊤
j xn x⊤
n bj = b⊤
j Sbj (10.43a)
D
X j=M +1
N n=1 j=M +1
= (x⊤
n bj )bj . (10.38b) | {z }
=:S
j=M +1
D
X D
X D
 X  
This means the difference is exactly the projection of the data point onto = tr(b⊤
j Sbj ) = tr(Sbj b⊤
j ) = tr bj b⊤
j S ,
the orthogonal complement of the principal subspace: We identify the ma- j=M +1 j=M +1 j=M +1
trix j=M +1 bj b⊤
PD
j in (10.38a) as the projection matrix that performs this
| {z }
projection matrix
projection. Hence the displacement vector xn − x̃n lies in the subspace (10.43b)
that is orthogonal to the principal subspace as illustrated in Figure 10.9.
Remark (Low-Rank Approximation). In (10.38a), we saw that the projec- where we exploited the property that the trace operator tr(·) (see (4.18))
tion matrix, which projects x onto x̃, is given by is linear and invariant to cyclic permutations of its arguments. Since we
assumed that our dataset is centered, i.e., E[X ] = 0, we identify S as the
M
X data covariance matrix. Since the projection matrix in (10.43b) is con-
bm b⊤ ⊤
m = BB . (10.39) structed as a sum of rank-one matrices bj b⊤j it itself is of rank D − M .
m=1
Equation (10.43a) implies that we can formulate the average squared
By construction as a sum of rank-one matrices bm b⊤ ⊤
m we see that BB is reconstruction error equivalently as the covariance matrix of the data,

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
332 Dimensionality Reduction with Principal Component Analysis 10.4 Eigenvector Computation and Low-Rank Approximations 333

projected onto the orthogonal complement of the principal subspace. Min-


Minimizing the imizing the average squared reconstruction error is therefore equivalent to cluster. Four embeddings of the digits “0” and “1” in the principal subspace
average squared minimizing the variance of the data when projected onto the subspace we are highlighted in red with their corresponding original digit. The figure
reconstruction error reveals that the variation within the set of “0” is significantly greater than
ignore, i.e., the orthogonal complement of the principal subspace. Equiva-
is equivalent to
minimizing the lently, we maximize the variance of the projection that we retain in the the variation within the set of “1”.
projection of the principal subspace, which links the projection loss immediately to the
data covariance maximum-variance formulation of PCA discussed in Section 10.2. But this
matrix onto the
then also means that we will obtain the same solution that we obtained
orthogonal 10.4 Eigenvector Computation and Low-Rank Approximations
complement of the for the maximum-variance perspective. Therefore, we omit a derivation
principal subspace. that is identical to the one presented in Section 10.2 and summarize the In the previous sections, we obtained the basis of the principal subspace
Minimizing the results from earlier in the light of the projection perspective. as the eigenvectors that are associated with the largest eigenvalues of the
average squared The average squared reconstruction error, when projecting onto the M - data covariance matrix
reconstruction error
dimensional principal subspace, is 1 X
N
1
is equivalent to
maximizing the D S= xn x⊤
n = XX ⊤ , (10.45)
X N n=1 N
variance of the JM = λj , (10.44)
projected data. j=M +1 X = [x1 , . . . , xN ] ∈ RD×N . (10.46)
where λj are the eigenvalues of the data covariance matrix. Therefore, Note that X is a D × N matrix, i.e., it is the transpose of the “typical”
to minimize (10.44) we need to select the smallest D − M eigenvalues, data matrix (Bishop, 2006; Murphy, 2012). To get the eigenvalues (and
which then implies that their corresponding eigenvectors are the basis of the corresponding eigenvectors) of S , we can follow two approaches: Use
the orthogonal complement of the principal subspace. Consequently, this eigendecomposition
means that the basis of the principal subspace comprises the eigenvectors We perform an eigendecomposition (see Section 4.2) and compute the or SVD to compute
eigenvalues and eigenvectors of S directly. eigenvectors.
b1 , . . . , bM that are associated with the largest M eigenvalues of the data
covariance matrix. We use a singular value decomposition (see Section 4.5). Since S is
symmetric and factorizes into XX ⊤ (ignoring the factor N1 ), the eigen-
values of S are the squared singular values of X .
Example 10.3 (MNIST Digits Embedding)
More specifically, the SVD of X is given by
Figure 10.10
Embedding of
X = |{z}
|{z} U |{z} V⊤ ,
Σ |{z} (10.47)
MNIST digits 0 D×N D×D D×N N ×N
(blue) and 1
(orange) in a where U ∈ RD×D and V ⊤ ∈ RN ×N are orthogonal matrices and Σ ∈
two-dimensional RD×N is a matrix whose only nonzero entries are the singular values σii ⩾
principal subspace 0. It then follows that
using PCA. Four
1 1 1
embeddings of the
S = XX ⊤ = U Σ V ⊤ ⊤ ⊤
| {zV} Σ U = N U ΣΣ U .
⊤ ⊤
(10.48)
digits “0” and “1” in N N
the principal =I N
subspace are
With the results from Section 4.5, we get that the columns of U are the The columns of U
highlighted in red
with their eigenvectors of XX ⊤ (and therefore S ). Furthermore, the eigenvalues are the eigenvectors
of S.
corresponding λd of S are related to the singular values of X via
original digit.
σd2
Figure 10.10 visualizes the training data of the MMIST digits “0” and “1” λd =
. (10.49)
N
embedded in the vector subspace spanned by the first two principal com-
ponents. We observe a relatively clear separation between “0”s (blue dots) This relationship between the eigenvalues of S and the singular values
and “1”s (orange dots), and we see the variation within each individual of X provides the connection between the maximum variance view (Sec-
tion 10.2) and the singular value decomposition.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
334 Dimensionality Reduction with Principal Component Analysis 10.5 PCA in High Dimensions 335

10.4.1 PCA Using Low-Rank Matrix Approximations the null space of S and follows the iteration
To maximize the variance of the projected data (or minimize the average Sxk
xk+1 = , k = 0, 1, . . . . (10.52)
squared reconstruction error), PCA chooses the columns of U in (10.48) ∥Sxk ∥
to be the eigenvectors that are associated with the M largest eigenvalues This means the vector xk is multiplied by S in every iteration and then If S is invertible, it
of the data covariance matrix S so that we identify U as the projection ma- normalized, i.e., we always have ∥xk ∥ = 1. This sequence of vectors con- is sufficient to
trix B in (10.3), which projects the original data onto a lower-dimensional verges to the eigenvector associated with the largest eigenvalue of S . The ensure that x0 ̸= 0.
Eckart-Young subspace of dimension M . The Eckart-Young theorem (Theorem 4.25 in original Google PageRank algorithm (Page et al., 1999) uses such an al-
theorem Section 4.6) offers a direct way to estimate the low-dimensional represen- gorithm for ranking web pages based on their hyperlinks.
tation. Consider the best rank-M approximation

X̃ M := argminrk(A)⩽M ∥X − A∥2 ∈ RD×N (10.50) 10.5 PCA in High Dimensions


of X , where ∥·∥2 is the spectral norm defined in (4.93). The Eckart-Young In order to do PCA, we need to compute the data covariance matrix. In D
theorem states that X̃ M is given by truncating the SVD at the top-M dimensions, the data covariance matrix is a D × D matrix. Computing the
singular value. In other words, we obtain eigenvalues and eigenvectors of this matrix is computationally expensive
as it scales cubically in D. Therefore, PCA, as we discussed earlier, will be
X̃ M = U M ΣM V ⊤ D×N infeasible in very high dimensions. For example, if our xn are images with
M ∈ R (10.51)
|{z} |{z} |{z}
D×M M ×M M ×N
10,000 pixels (e.g., 100 × 100 pixel images), we would need to compute
the eigendecomposition of a 10,000 × 10,000 covariance matrix. In the
with orthogonal matrices U M := [u1 , . . . , uM ] ∈ RD×M and V M := following, we provide a solution to this problem for the case that we have
[v 1 , . . . , v M ] ∈ RN ×M and a diagonal matrix ΣM ∈ RM ×M whose diago- substantially fewer data points than dimensions, i.e., N ≪ D.
nal entries are the M largest singular values of X . Assume we have a centered dataset x1 , . . . , xN , xn ∈ RD . Then the
data covariance matrix is given as
1
S= XX ⊤ ∈ RD×D , (10.53)
10.4.2 Practical Aspects N
where X = [x1 , . . . , xN ] is a D × N matrix whose columns are the data
Finding eigenvalues and eigenvectors is also important in other funda-
points.
mental machine learning methods that require matrix decompositions. In
We now assume that N ≪ D, i.e., the number of data points is smaller
theory, as we discussed in Section 4.2, we can solve for the eigenvalues as
than the dimensionality of the data. If there are no duplicate data points,
roots of the characteristic polynomial. However, for matrices larger than
the rank of the covariance matrix S is N , so it has D − N + 1 many eigen-
4×4 this is not possible because we would need to find the roots of a poly-
values that are 0. Intuitively, this means that there are some redundancies.
Abel-Ruffini nomial of degree 5 or higher. However, the Abel-Ruffini theorem (Ruffini,
theorem In the following, we will exploit this and turn the D×D covariance matrix
1799; Abel, 1826) states that there exists no algebraic solution to this
into an N × N covariance matrix whose eigenvalues are all positive.
problem for polynomials of degree 5 or more. Therefore, in practice, we
np.linalg.eigh In PCA, we ended up with the eigenvector equation
or solve for eigenvalues or singular values using iterative methods, which are
np.linalg.svd implemented in all modern packages for linear algebra. Sbm = λm bm , m = 1, . . . , M , (10.54)
In many applications (such as PCA presented in this chapter), we only
where bm is a basis vector of the principal subspace. Let us rewrite this
require a few eigenvectors. It would be wasteful to compute the full de-
equation a bit: With S defined in (10.53), we obtain
composition, and then discard all eigenvectors with eigenvalues that are
beyond the first few. It turns out that if we are interested in only the first 1
Sbm = XX ⊤ bm = λm bm . (10.55)
few eigenvectors (with the largest eigenvalues), then iterative processes, N
which directly optimize these eigenvectors, are computationally more effi- ⊤
We now multiply X ∈ RN ×D from the left-hand side, which yields
cient than a full eigendecomposition (or SVD). In the extreme case of only
1 1 ⊤
power iteration needing the first eigenvector, a simple method called the power iteration X ⊤ X X ⊤ bm = λm X ⊤ bm ⇐⇒ X Xcm = λm cm , (10.56)
N | {z } | {z } N
is very efficient. Power iteration chooses a random vector x0 that is not in N ×N =:cm

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
336 Dimensionality Reduction with Principal Component Analysis 10.6 Key Steps of PCA in Practice 337
Figure 10.11 Steps
and we get a new eigenvector/eigenvalue equation: λm remains eigen- 5.0 5.0 5.0 of PCA. (a) Original
dataset;
value, which confirms our results from Section 4.5.3 that the nonzero 2.5 2.5 2.5 (b) centering;
eigenvalues of XX ⊤ equal the nonzero eigenvalues of X ⊤ X . We obtain

x2

x2

x2
0.0 0.0 0.0
(c) divide by
the eigenvector of the matrix N1 X ⊤ X ∈ RN ×N associated with λm as standard deviation;
cm := X ⊤ bm . Assuming we have no duplicate data points, this matrix −2.5 −2.5 −2.5 (d) eigendecomposi-
tion; (e) projection;
has rank N and is invertible. This also implies that N1 X ⊤ X has the same 0
x1
5 0
x1
5 0
x1
5
(f) mapping back to
(nonzero) eigenvalues as the data covariance matrix S . But this is now an original data space.
(a) Original dataset. (b) Step 1: Centering by sub- (c) Step 2: Dividing by the
N × N matrix, so that we can compute the eigenvalues and eigenvectors tracting the mean from each standard deviation to make
much more efficiently than for the original D × D data covariance matrix. data point. the data unit free. Data has
Now that we have the eigenvectors of N1 X ⊤ X , we are going to re- variance 1 along each axis.
cover the original eigenvectors, which we still need for PCA. Currently,
we know the eigenvectors of N1 X ⊤ X . If we left-multiply our eigenvalue/ 5.0 5.0 5.0
eigenvector equation with X , we get
2.5 2.5 2.5
1

x2

x2

x2
XX ⊤ Xcm = λm Xcm (10.57) 0.0 0.0 0.0
|N {z } −2.5 −2.5 −2.5
S
0 5 0 5 0 5
and we recover the data covariance matrix again. This now also means x1 x1 x1

that we recover Xcm as an eigenvector of S . (d) Step 3: Compute eigenval- (e) Step 4: Project data onto (f) Undo the standardization
ues and eigenvectors (arrows) the principal subspace. and move projected data back
Remark. If we want to apply the PCA algorithm that we discussed in Sec- of the data covariance matrix into the original data space
tion 10.6, we need to normalize the eigenvectors Xcm of S so that they (ellipse). from (a).
have norm 1. ♢

10.6 Key Steps of PCA in Practice responding eigenvalue. The longer vector spans the principal subspace,
which we denote by U . The data covariance matrix is represented by
In the following, we will go through the individual steps of PCA using a the ellipse.
running example, which is summarized in Figure 10.11. We are given a
4. Projection We can project any data point x∗ ∈ RD onto the principal
two-dimensional dataset (Figure 10.11(a)), and we want to use PCA to
subspace: To get this right, we need to standardize x∗ using the mean
project it onto a one-dimensional subspace.
µd and standard deviation σd of the training data in the dth dimension,
1. Mean subtraction We start by centering the data by computing the respectively, so that
mean µ of the dataset and subtracting it from every single data point. (d)
This ensures that the dataset has mean 0 (Figure 10.11(b)). Mean sub- x∗ − µ d
x(d)
∗ ← , d = 1, . . . , D , (10.58)
traction is not strictly necessary but reduces the risk of numerical prob- σd
lems. (d)
where x∗ is the dth component of x∗ . We obtain the projection as
2. Standardization Divide the data points by the standard deviation σd
of the dataset for every dimension d = 1, . . . , D. Now the data is unit x̃∗ = BB ⊤ x∗ (10.59)
free, and it has variance 1 along each axis, which is indicated by the
standardization two arrows in Figure 10.11(c). This step completes the standardization with coordinates
of the data. z ∗ = B ⊤ x∗ (10.60)
3. Eigendecomposition of the covariance matrix Compute the data
covariance matrix and its eigenvalues and corresponding eigenvectors. with respect to the basis of the principal subspace. Here, B is the ma-
Since the covariance matrix is symmetric, the spectral theorem (The- trix that contains the eigenvectors that are associated with the largest
orem 4.15) states that we can find an ONB of eigenvectors. In Fig- eigenvalues of the data covariance matrix as columns. PCA returns the
ure 10.11(d), the eigenvectors are scaled by the magnitude of the cor- coordinates (10.60), not the projections x∗ .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
338 Dimensionality Reduction with Principal Component Analysis 10.7 Latent Variable Perspective 339

Having standardized our dataset, (10.59) only yields the projections in


the context of the standardized dataset. To obtain our projection in the cipal components, we effectively obtain a near-perfect reconstruction. If
original data space (i.e., before standardization), we need to undo the we were to choose 784 PCs, we would recover the exact digit without any
standardization (10.58) and multiply by the standard deviation before compression loss.
adding the mean so that we obtain Figure 10.13 shows the average squared reconstruction error, which is
N D
x̃(d) (d)
∗ ← x̃∗ σd + µd , d = 1, . . . , D . (10.61) 1 X 2
X
∥xn − x̃n ∥ = λi , (10.62)
Figure 10.11(f) illustrates the projection in the original data space. N n=1 i=M +1

as a function of the number M of principal components. We can see that


Example 10.4 (MNIST Digits: Reconstruction) the importance of the principal components drops off rapidly, and only
In the following, we will apply PCA to the MNIST digits dataset, which marginal gains can be achieved by adding more PCs. This matches exactly
contains 60,000 examples of handwritten digits 0 through 9. Each digit is our observation in Figure 10.5, where we discovered that the most of the
an image of size 28×28, i.e., it contains 784 pixels so that we can interpret variance of the projected data is captured by only a few principal compo-
every image in this dataset as a vector x ∈ R784 . Examples of these digits nents. With about 550 PCs, we can essentially fully reconstruct the training
are shown in Figure 10.3. data that contains the digit “8” (some pixels around the boundaries show
no variation across the dataset as they are always black).
Figure 10.12 Effect
of increasing the

Average squared reconstruction error


number of principal
Original 500
Figure 10.13
Average squared
components on reconstruction error
reconstruction. 400
as a function of the
PCs: 1 number of principal
300
components. The
average squared
200
PCs: 10 reconstruction error
100 is the sum of the
eigenvalues in the
PCs: 100 0 orthogonal
0 200 400 600 800 complement of the
Number of PCs principal subspace.
PCs: 500

For illustration purposes, we apply PCA to a subset of the MNIST digits, 10.7 Latent Variable Perspective
and we focus on the digit “8”. We used 5,389 training images of the digit In the previous sections, we derived PCA without any notion of a prob-
“8” and determined the principal subspace as detailed in this chapter. We abilistic model using the maximum-variance and the projection perspec-
then used the learned projection matrix to reconstruct a set of test im- tives. On the one hand, this approach may be appealing as it allows us to
ages, which is illustrated in Figure 10.12. The first row of Figure 10.12 sidestep all the mathematical difficulties that come with probability the-
shows a set of four original digits from the test set. The following rows ory, but on the other hand, a probabilistic model would offer us more flex-
show reconstructions of exactly these digits when using a principal sub- ibility and useful insights. More specifically, a probabilistic model would
space of dimensions 1, 10, 100, and 500, respectively. We see that even
with a single-dimensional principal subspace we get a halfway decent re- Come with a likelihood function, and we can explicitly deal with noisy
construction of the original digits, which, however, is blurry and generic. observations (which we did not even discuss earlier)
With an increasing number of principal components (PCs), the reconstruc- Allow us to do Bayesian model comparison via the marginal likelihood
tions become sharper and more details are accounted for. With 500 prin- as discussed in Section 8.6
View PCA as a generative model, which allows us to simulate new data

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
340 Dimensionality Reduction with Principal Component Analysis 10.7 Latent Variable Perspective 341

Allow us to make straightforward connections to related algorithms Figure 10.14


zn Graphical model for
Deal with data dimensions that are missing at random by applying
probabilistic PCA.
Bayes’ theorem The observations xn
Give us a notion of the novelty of a new data point B µ explicitly depend on
Give us a principled way to extend the model, e.g., to a mixture of PCA corresponding
models latent variables
xn σ 
z n ∼ N 0, I . The
Have the PCA we derived in earlier sections as a special case model parameters
Allow for a fully Bayesian treatment by marginalizing out the model n = 1, . . . , N
B, µ and the
parameters likelihood
parameter σ are
By introducing a continuous-valued latent variable z ∈ RM it is possible Remark. Note the direction of the arrow that connects the latent variables shared across the
to phrase PCA as a probabilistic latent-variable model. Tipping and Bishop z and the observed data x: The arrow points from z to x, which means dataset.
probabilistic PCA (1999) proposed this latent-variable model as probabilistic PCA (PPCA). that the PPCA model assumes a lower-dimensional latent cause z for high-
PPCA PPCA addresses most of the aforementioned issues, and the PCA solution dimensional observations x. In the end, we are obviously interested in
that we obtained by maximizing the variance in the projected space or finding something out about z given some observations. To get there we
by minimizing the reconstruction error is obtained as the special case of will apply Bayesian inference to “invert” the arrow implicitly and go from
maximum likelihood estimation in a noise-free setting. observations to latent variables. ♢

10.7.1 Generative Process and Probabilistic Model Example 10.5 (Generating New Data Using Latent Variables)
In PPCA, we explicitly write down the probabilistic model for linear di-
mensionality reduction. For this we assume a continuous
 latent variable Figure 10.15
z ∈ RM with a standard-normal prior p(z) = N 0, I and a linear rela- Generating new
tionship between the latent variables and the observed x data where MNIST digits. The
latent variables z
x = Bz + µ + ϵ ∈ RD , (10.63) can be used to
 generate new data
where ϵ ∼ N 0, σ I is Gaussian observation noise and B ∈ RD×M
2
x̃ = Bz. The closer
and µ ∈ RD describe the linear/affine mapping from latent to observed we stay to the
training data, the
variables. Therefore, PPCA links latent and observed variables via
more realistic the
p(x|z, B, µ, σ 2 ) = N x | Bz + µ, σ 2 I . generated data.

(10.64)
Overall, PPCA induces the following generative process:

z n ∼ N z | 0, I (10.65)
xn | z n ∼ N x | Bz n + µ, σ 2 I

(10.66)
To generate a data point that is typical given the model parameters, we Figure 10.15 shows the latent coordinates of the MNIST digits “8” found
ancestral sampling follow an ancestral sampling scheme: We first sample a latent variable z n by PCA when using a two-dimensional principal subspace (blue dots). We
from p(z). Then we use z n in (10.64) to sample a data point conditioned can query any vector z ∗ in this latent space and generate an image x̃∗ =
on the sampled z n , i.e., xn ∼ p(x | z n , B, µ, σ 2 ). Bz ∗ that resembles the digit “8”. We show eight of such generated images
This generative process allows us to write down the probabilistic model with their corresponding latent space representation. Depending on where
(i.e., the joint distribution of all random variables; see Section 8.4) as we query the latent space, the generated images look different (shape,
p(x, z|B, µ, σ 2 ) = p(x|z, B, µ, σ 2 )p(z) , (10.67) rotation, size, etc.). If we query away from the training data, we see more
and more artifacts, e.g., the top-left and top-right digits. Note that the
which immediately gives rise to the graphical model in Figure 10.14 using intrinsic dimensionality of these generated images is only two.
the results from Section 8.5.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
342 Dimensionality Reduction with Principal Component Analysis 10.8 Further Reading 343

The likelihood does


10.7.2 Likelihood and Joint Distribution rules of Gaussian conditioning from Section 6.5.1. The posterior distribu-
not depend on the Using the results from Chapter 6, we obtain the likelihood of this proba- tion of the latent variable given an observation x is then
latent variables z.
bilistic model by integrating out the latent variable z (see Section 8.4.3)

p(z | x) = N z | m, C , (10.73)
so that
Z m = B ⊤ (BB ⊤ + σ 2 I)−1 (x − µ) , (10.74)
p(x | B, µ, σ 2 ) = p(x | z, B, µ, σ 2 )p(z)dz (10.68a) C = I − B ⊤ (BB ⊤ + σ 2 I)−1 B . (10.75)
Note that the posterior covariance does not depend on the observed data
Z
= N x | Bz + µ, σ 2 I N z | 0, I dz .
 
(10.68b)
x. For a new observation x∗ in data space, we use (10.73) to determine
the posterior distribution of the corresponding latent variable z ∗ . The co-
From Section 6.5, we know that the solution to this integral is a Gaussian
variance matrix C allows us to assess how confident the embedding is. A
distribution with mean
covariance matrix C with a small determinant (which measures volumes)
Ex [x] = Ez [Bz + µ] + Eϵ [ϵ] = µ (10.69) tells us that the latent embedding z ∗ is fairly certain. If we obtain a pos-
terior distribution p(z ∗ | x∗ ) with much variance, we may be faced with
and with covariance matrix an outlier. However, we can explore this posterior distribution to under-
stand what other data points x are plausible under this posterior. To do
V[x] = Vz [Bz + µ] + Vϵ [ϵ] = Vz [Bz] + σ 2 I (10.70a)
this, we exploit the generative process underlying PPCA, which allows us
= B Vz [z]B ⊤ + σ 2 I = BB ⊤ + σ 2 I . (10.70b) to explore the posterior distribution on the latent variables by generating
new data that is plausible under this posterior:
The likelihood in (10.68b) can be used for maximum likelihood or MAP
estimation of the model parameters. 1. Sample a latent variable z ∗ ∼ p(z | x∗ ) from the posterior distribution
Remark. We cannot use the conditional distribution in (10.64) for maxi- over the latent variables (10.73).
mum likelihood estimation as it still depends on the latent variables. The 2. Sample a reconstructed vector x̃∗ ∼ p(x | z ∗ , B, µ, σ 2 ) from (10.64).
likelihood function we require for maximum likelihood (or MAP) estima- If we repeat this process many times, we can explore the posterior dis-
tion should only be a function of the data x and the model parameters, tribution (10.73) on the latent variables z ∗ and its implications on the
but must not depend on the latent variables. ♢ observed data. The sampling process effectively hypothesizes data, which
From Section 6.5, we know that a Gaussian random variable z and is plausible under the posterior distribution.
a linear/affine transformation x = Bz of it are jointly Gaussian
 dis-
tributed. We already know the marginals p(z) = N z | 0, I and p(x) =
N x | µ, BB ⊤ + σ 2 I . The missing cross-covariance is given as
 10.8 Further Reading
We derived PCA from two perspectives: (a) maximizing the variance in the
Cov[x, z] = Covz [Bz + µ] = B Covz [z, z] = B . (10.71) projected space; (b) minimizing the average reconstruction error. How-
Therefore, the probabilistic model of PPCA, i.e., the joint distribution of ever, PCA can also be interpreted from different perspectives. Let us recap
latent and observed random variables is explicitly given by what we have done: We took high-dimensional data x ∈ RD and used
a matrix B ⊤ to find a lower-dimensional representation z ∈ RM . The
BB ⊤ + σ 2 I B
     
x µ columns of B are the eigenvectors of the data covariance matrix S that are
p(x, z | B, µ, σ 2 ) = N , ⊤ , (10.72)
z 0 B I associated with the largest eigenvalues. Once we have a low-dimensional
representation z , we can get a high-dimensional version of it (in the orig-
with a mean vector of length D + M and a covariance matrix of size
inal data space) as x ≈ x̃ = Bz = BB ⊤ x ∈ RD , where BB ⊤ is a
(D + M ) × (D + M ). projection matrix.
We can also think of PCA as a linear auto-encoder as illustrated in Fig- auto-encoder
ure 10.16. An auto-encoder encodes the data xn ∈ RD to a code z n ∈ RM code
10.7.3 Posterior Distribution and decodes it to a x̃n similar to xn . The mapping from the data to the
The joint Gaussian distribution p(x, z | B, µ, σ 2 ) in (10.72) allows us to code is called the encoder, and the mapping from the code back to the orig- encoder
determine the posterior distribution p(z | x) immediately by applying the inal data space is called the decoder. If we consider linear mappings where decoder

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
344 Dimensionality Reduction with Principal Component Analysis 10.8 Further Reading 345

Figure 10.16 PCA Original jecting D-dimensional data onto an M -dimensional subspace, are
can be viewed as a
D
linear auto-encoder. R RD 1 X
N
It encodes the
Code µML = xn , (10.77)
high-dimensional M N n=1
⊤ R
data x into a B B B ML = T (Λ − σ 2 I) 2 R ,
1
(10.78)
lower-dimensional x z x̃
representation D
2 1 X
(code) z ∈ RM and σML = λj , (10.79)
decodes z using a D−M j=M +1
decoder. The
decoded vector x̃ is where T ∈ RD×M contains M eigenvectors of the data covariance matrix, The matrix Λ − σ 2 I
the orthogonal Encoder Decoder Λ = diag(λ1 , . . . , λM ) ∈ RM ×M is a diagonal matrix with the eigenvalues in (10.78) is
projection of the guaranteed to be
original data x onto
associated with the principal axes on its diagonal, and R ∈ RM ×M is
positive semidefinite
the M -dimensional an arbitrary orthogonal matrix. The maximum likelihood solution B ML is as the smallest
principal subspace. the code is given by z n = B ⊤ xn ∈ RM and we are interested in minimiz- unique up to an arbitrary orthogonal transformation, e.g., we can right- eigenvalue of the
ing the average squared error between the data xn and its reconstruction multiply B ML with any rotation matrix R so that (10.78) essentially is a data covariance
matrix is bounded
x̃n = Bz n , n = 1, . . . , N , we obtain singular value decomposition (see Section 4.5). An outline of the proof is
from below by the
given by Tipping and Bishop (1999). noise variance σ 2 .
The maximum likelihood estimate for µ given in (10.77) is the sample
N N
1 X 1 X 2 mean of the data. The maximum likelihood estimator for the observation
∥xn − x̃n ∥2 = xn − BB ⊤ xn . (10.76)
N n=1 N n=1 noise variance σ 2 given in (10.79) is the average variance in the orthog-
onal complement of the principal subspace, i.e., the average leftover vari-
ance that we cannot capture with the first M principal components is
This means we end up with the same objective function as in (10.29) that treated as observation noise.
we discussed in Section 10.3 so that we obtain the PCA solution when we In the noise-free limit where σ → 0, PPCA and PCA provide identical
minimize the squared auto-encoding loss. If we replace the linear map- solutions: Since the data covariance matrix S is symmetric, it can be di-
ping of PCA with a nonlinear mapping, we get a nonlinear auto-encoder. agonalized (see Section 4.4), i.e., there exists a matrix T of eigenvectors
A prominent example of this is a deep auto-encoder where the linear func- of S so that
tions are replaced with deep neural networks. In this context, the encoder
S = T ΛT −1 . (10.80)
recognition network is also known as a recognition network or inference network, whereas the
inference network decoder is also called a generator. In the PPCA model, the data covariance matrix is the covariance matrix of
generator
Another interpretation of PCA is related to information theory. We can the Gaussian likelihood p(x | B, µ, σ 2 ), which is BB ⊤ +σ 2 I , see (10.70b).
think of the code as a smaller or compressed version of the original data For σ → 0, we obtain BB ⊤ so that this data covariance must equal the
point. When we reconstruct our original data using the code, we do not PCA data covariance (and its factorization given in (10.80)) so that
get the exact data point back, but a slightly distorted or noisy version 1
Cov[X ] = T ΛT −1 = BB ⊤ ⇐⇒ B = T Λ 2 R , (10.81)
The code is a of it. This means that our compression is “lossy”. Intuitively, we want
compressed version to maximize the correlation between the original data and the lower- i.e., we obtain the maximum likelihood estimate in (10.78) for σ = 0.
of the original data.
dimensional code. More formally, this is related to the mutual information. From (10.78) and (10.80), it becomes clear that (P)PCA performs a de-
We would then get the same solution to PCA we discussed in Section 10.3 composition of the data covariance matrix.
by maximizing the mutual information, a core concept in information the- In a streaming setting, where data arrives sequentially, it is recom-
ory (MacKay, 2003). mended to use the iterative expectation maximization (EM) algorithm for
In our discussion on PPCA, we assumed that the parameters of the maximum likelihood estimation (Roweis, 1998).
model, i.e., B, µ, and the likelihood parameter σ 2 , are known. Tipping To determine the dimensionality of the latent variables (the length of
and Bishop (1999) describe how to derive maximum likelihood estimates the code, the dimensionality of the lower-dimensional subspace onto which
for these parameters in the PPCA setting (note that we use a different we project the data), Gavish and Donoho (2014) suggest the heuristic
notation in this chapter). The maximum likelihood parameters, when pro- that, if we can estimate the noise variance σ 2 of the data, we should

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
346 Dimensionality Reduction with Principal Component Analysis 10.8 Further Reading 347

4σ√ D to require non-Gaussian priors p(z). We refer to the books by Hyvarinen
discard all singular values smaller than .
Alternatively, we can use
3
(nested) cross-validation (Section 8.6.1) or Bayesian model selection cri- et al. (2001) and Murphy (2012) for more details on ICA.
teria (discussed in Section 8.6.2) to determine a good estimate of the PCA, factor analysis, and ICA are three examples for dimensionality re-
intrinsic dimensionality of the data (Minka, 2001b). duction with linear models. Cunningham and Ghahramani (2015) provide
a broader survey of linear dimensionality reduction.
Similar to our discussion on linear regression in Chapter 9, we can place The (P)PCA model we discussed here allows for several important ex-
a prior distribution on the parameters of the model and integrate them tensions. In Section 10.5, we explained how to do PCA when the in-
out. By doing so, we (a) avoid point estimates of the parameters and the put dimensionality D is significantly greater than the number N of data
issues that come with these point estimates (see Section 8.6) and (b) al- points. By exploiting the insight that PCA can be performed by computing
low for an automatic selection of the appropriate dimensionality M of the (many) inner products, this idea can be pushed to the extreme by consid-
Bayesian PCA latent space. In this Bayesian PCA, which was proposed by Bishop (1999), ering infinite-dimensional features. The kernel trick is the basis of kernel kernel trick
a prior p(µ, B, σ 2 ) is placed on the model parameters. The generative PCA and allows us to implicitly compute inner products between infinite- kernel PCA
process allows us to integrate the model parameters out instead of condi- dimensional features (Schölkopf et al., 1998; Schölkopf and Smola, 2002).
tioning on them, which addresses overfitting issues. Since this integration There are nonlinear dimensionality reduction techniques that are de-
is analytically intractable, Bishop (1999) proposes to use approximate in- rived from PCA (Burges (2010) provides a good overview). The auto-
ference methods, such as MCMC or variational inference. We refer to the encoder perspective of PCA that we discussed previously in this section
work by Gilks et al. (1996) and Blei et al. (2017) for more details on these can be used to render PCA as a special case of a deep auto-encoder. In the deep auto-encoder
approximate inference techniques. deep auto-encoder, both the encoder and the decoder are represented by
In PPCA, we considered the linear model p(xn | z n ) = N xn | Bz n + multilayer feedforward neural networks, which themselves are nonlinear
  mappings. If we set the activation functions in these neural networks to be
µ, σ 2 I with prior p(z n ) = N 0, I , where all observation dimensions
the identity, the model becomes equivalent to PCA. A different approach to
are affected by the same amount of noise. If we allow each observation
nonlinear dimensionality reduction is the Gaussian process latent-variable Gaussian process
factor analysis dimension d to have a different variance σd2 , we obtain factor analysis latent-variable
model (GP-LVM) proposed by Lawrence (2005). The GP-LVM starts off with
(FA) (Spearman, 1904; Bartholomew et al., 2011). This means that FA model
the latent-variable perspective that we used to derive PPCA and replaces
gives the likelihood some more flexibility than PPCA, but still forces the GP-LVM
the linear relationship between the latent variables z and the observations
An overly flexible data to be explained by the model parameters B, µ.However, FA no
likelihood would be longer allows for a closed-form maximum likelihood solution so that we
x with a Gaussian process (GP). Instead of estimating the parameters of
able to explain more the mapping (as we do in PPCA), the GP-LVM marginalizes out the model
need to use an iterative scheme, such as the expectation maximization
than just the noise. parameters and makes point estimates of the latent variables z . Similar
algorithm, to estimate the model parameters. While in PPCA all station-
to Bayesian PCA, the Bayesian GP-LVM proposed by Titsias and Lawrence Bayesian GP-LVM
ary points are global optima, this no longer holds for FA. Compared to
(2010) maintains a distribution on the latent variables z and uses approx-
PPCA, FA does not change if we scale the data, but it does return different
imate inference to integrate them out as well.
solutions if we rotate the data.
independent An algorithm that is also closely related to PCA is independent com-
component analysis ponent analysis (ICA (Hyvarinen et al., 2001)). Starting again with the
ICA latent-variable perspective p(xn | z n ) = N xn | Bz n + µ, σ 2 I we now
change the prior on z n to non-Gaussian distributions. ICA can be used
blind-source for blind-source separation. Imagine you are in a busy train station with
separation many people talking. Your ears play the role of microphones, and they
linearly mix different speech signals in the train station. The goal of blind-
source separation is to identify the constituent parts of the mixed signals.
As discussed previously in the context of maximum likelihood estimation
for PPCA, the original PCA solution is invariant to any rotation. Therefore,
PCA can identify the best lower-dimensional subspace in which the sig-
nals live, but not the signals themselves (Murphy, 2012). ICA addresses
this issue by modifying the prior distribution p(z) on the latent sources

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
11.1 Gaussian Mixture Model 349

In practice, the Gaussian (or similarly all other distributions we encoun-


tered so far) have limited modeling capabilities. For example, a Gaussian
11 approximation of the density that generated the data in Figure 11.1 would
be a poor approximation. In the following, we will look at a more ex-
pressive family of distributions, which we can use for density estimation:
Density Estimation with Gaussian Mixture mixture models. mixture model

Models Mixture models can be used to describe a distribution p(x) by a convex


combination of K simple (base) distributions
K
X
p(x) = πk pk (x) (11.1)
In earlier chapters, we covered already two fundamental problems in k=1
machine learning: regression (Chapter 9) and dimensionality reduction K
X
(Chapter 10). In this chapter, we will have a look at a third pillar of ma- 0 ⩽ πk ⩽ 1 , πk = 1 , (11.2)
chine learning: density estimation. On our journey, we introduce impor- k=1
tant concepts, such as the expectation maximization (EM) algorithm and where the components pk are members of a family of basic distributions,
a latent variable perspective of density estimation with mixture models. e.g., Gaussians, Bernoullis, or Gammas, and the πk are mixture weights. mixture weight
When we apply machine learning to data we often aim to represent Mixture models are more expressive than the corresponding base distri-
data in some way. A straightforward way is to take the data points them- butions because they allow for multimodal data representations, i.e., they
selves as the representation of the data; see Figure 11.1 for an example. can describe datasets with multiple “clusters”, such as the example in Fig-
However, this approach may be unhelpful if the dataset is huge or if we ure 11.1.
are interested in representing characteristics of the data. In density esti- We will focus on Gaussian mixture models (GMMs), where the basic
mation, we represent the data compactly using a density from a paramet- distributions are Gaussians. For a given dataset, we aim to maximize the
ric family, e.g., a Gaussian or Beta distribution. For example, we may be likelihood of the model parameters to train the GMM. For this purpose,
looking for the mean and variance of a dataset in order to represent the we will use results from Chapter 5, Chapter 6, and Section 7.2. However,
data compactly using a Gaussian distribution. The mean and variance can unlike other applications we discussed earlier (linear regression or PCA),
be found using tools we discussed in Section 8.3: maximum likelihood or we will not find a closed-form maximum likelihood solution. Instead, we
maximum a posteriori estimation. We can then use the mean and variance will arrive at a set of dependent simultaneous equations, which we can
of this Gaussian to represent the distribution underlying the data, i.e., we only solve iteratively.
think of the dataset to be a typical realization from this distribution if we
were to sample from it.
11.1 Gaussian Mixture Model
Figure 11.1 A Gaussian mixture model is a density model where
Two-dimensional  we combine a finite Gaussian mixture

dataset that cannot 4 number of K Gaussian distributions N x | µk , Σk so that model

be meaningfully K
represented by a 2
X 
p(x | θ) = πk N x | µk , Σk (11.3)
Gaussian.
k=1
0
x2

K
X
0 ⩽ πk ⩽ 1 , πk = 1 , (11.4)
−2 k=1

where we defined θ := {µk , Σk , πk : k = 1, . . . , K} as the collection of


−4
all parameters of the model. This convex combination of Gaussian distri-
bution gives us significantly more flexibility for modeling complex densi-
−5 0 5 ties than a simple Gaussian distribution (which we recover from (11.3) for
x1
K = 1). An illustration is given in Figure 11.2, displaying the weighted
348
©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
350 Density Estimation with Gaussian Mixture Models 11.2 Parameter Learning via Maximum Likelihood 351

Figure 11.2 0.30


Component 1
Gaussian mixture Component 2 We consider a one-dimensional dataset X = {−3, −2.5, −1, 0, 2, 4, 5}
model. The 0.25 Component 3
consisting of seven data points and wish to find a GMM with K = 3
Gaussian mixture GMM density
distribution (black) 0.20 components that models the density of the data. We initialize the mixture
is composed of a components as
p(x)

0.15
convex combination 
of Gaussian p1 (x) = N x | − 4, 1 (11.6)
0.10
distributions and is 
p2 (x) = N x | 0, 0.2 (11.7)
more expressive 0.05 
than any individual p3 (x) = N x | 8, 3 (11.8)
component. Dashed 0.00
lines represent the −4 −2 0 2 4 6 8 and assign them equal weights π1 = π2 = π3 = 13 . The corresponding
weighted Gaussian x model (and the data points) are shown in Figure 11.3.
components.

components and the mixture density, which is given as In the following, we detail how to obtain a maximum likelihood esti-
p(x | θ) = 0.5N x | − 2, 12 + 0.2N x | 1, 2 + 0.3N x | 4, 1 . (11.5) mate θ ML of the model parameters θ . We start by writing down the like-
  
lihood, i.e., the predictive distribution of the training data given the pa-
rameters. We exploit our i.i.d. assumption, which leads to the factorized
likelihood
11.2 Parameter Learning via Maximum Likelihood N
Y K
X 
p(X | θ) = p(xn | θ) , p(xn | θ) = πk N xn | µk , Σk , (11.9)
Assume we are given a dataset X = {x1 , . . . , xN }, where xn , n = n=1 k=1
1, . . . , N , are drawn i.i.d. from an unknown distribution p(x). Our ob-
jective is to find a good approximation/representation of this unknown where every individual likelihood term p(xn | θ) is a Gaussian mixture
distribution p(x) by means of a GMM with K mixture components. The density. Then we obtain the log-likelihood as
parameters of the GMM are the K means µk , the covariances Σk , and N
X N
X K
X 
mixture weights πk . We summarize all these free parameters in θ := log p(X | θ) = log p(xn | θ) = log πk N xn | µk , Σk . (11.10)
{πk , µk , Σk : k = 1, . . . , K}. n=1 n=1
|
k=1
{z }
=:L

Example 11.1 (Initial Setting) We aim to find parameters θ ∗ML that maximize the log-likelihood L defined
in (11.10). Our “normal” procedure would be to compute the gradient
dL/dθ of the log-likelihood with respect to the model parameters θ , set
Figure 11.3 Initial 0.30
it to 0, and solve for θ . However, unlike our previous examples for max-
π1 N (x|µ1 , σ12 )
setting: GMM π2 N (x|µ2 , σ22 ) imum likelihood estimation (e.g., when we discussed linear regression in
0.25
(black) with π3 N (x|µ3 , σ32 ) Section 9.2), we cannot obtain a closed-form solution. However, we can
GMM density
mixture three 0.20 exploit an iterative scheme to find good model parameters θ ML , which will
mixture components
turn out to be the EM algorithm for GMMs. The key idea is to update one
p(x)

(dashed) and seven 0.15


data points (discs). model parameter at a time while keeping the others fixed.
0.10
Remark. If we were to consider a single Gaussian as the desired density,
0.05
the sum over k in (11.10) vanishes, and the log can be applied directly to
0.00 the Gaussian component, such that we get
−5 0 5 10 15
log N x | µ, Σ = − D2 log(2π) − 12 log det(Σ) − 12 (x − µ)⊤ Σ−1 (x − µ).

x

(11.11)
Throughout this chapter, we will have a simple running example that
helps us illustrate and visualize important concepts. This simple form allows us to find closed-form maximum likelihood esti-
mates of µ and Σ, as discussed in Chapter 8. In (11.10), we cannot move

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
352 Density Estimation with Gaussian Mixture Models 11.2 Parameter Learning via Maximum Likelihood 353
P
the log into the sum over k so that we cannot obtain a simple closed-form k rnk = 1 with rnk ⩾ 0. This probability vector distributes probabil-
maximum likelihood solution. ♢ ity mass among the K mixture components, and we can think of r n as a
Any local optimum of a function exhibits the property that its gradi- “soft assignment” of xn to the K mixture components. Therefore, the re- The responsibility
sponsibility rnk from (11.17) represents the probability that xn has been rnk is the
ent with respect to the parameters must vanish (necessary condition); see probability that the
Chapter 7. In our case, we obtain the following necessary conditions when generated by the k th mixture component.
kth mixture
we optimize the log-likelihood in (11.10) with respect to the GMM param- component
eters µk , Σk , πk : Example 11.2 (Responsibilities) generated the nth
data point.
N For our example from Figure 11.3, we compute the responsibilities rnk
∂L X ∂ log p(xn | θ)
= 0⊤ ⇐⇒ = 0⊤ , (11.12) 
1.0 0.0 0.0

∂µk n=1
∂µk
 1.0 0.0 0.0 
N
∂ log p(xn | θ)
 
∂L X 0.057 0.943 0.0 
= 0 ⇐⇒ = 0, (11.13) 0.001 0.999 0.0  ∈ RN ×K .
 
∂Σk ∂Σk (11.19)
n=1  
 0.0 0.066 0.934
N
∂L ∂ log p(xn | θ)
 
X  0.0 0.0 1.0 
= 0 ⇐⇒ = 0. (11.14)
∂πk n=1
∂πk 0.0 0.0 1.0
For all three necessary conditions, by applying the chain rule (see Sec- Here the nth row tells us the responsibilities of all mixture components
tion 5.2.2), we require partial derivatives of the form for xn . The sum of all K responsibilities for a data point (sum of every
row) is 1. The k th column gives us an overview of the responsibility of
∂ log p(xn | θ) 1 ∂p(xn | θ)
= , (11.15) the k th mixture component. We can see that the third mixture component
∂θ p(xn | θ) ∂θ
(third column) is not responsible for any of the first four data points, but
where θ = {µk , Σk , πk , k = 1, . . . , K} are the model parameters and takes much responsibility of the remaining data points. The sum of all
1 1 entries of a column gives us the values Nk , i.e., the total responsibility of
= PK . (11.16) the k th mixture component. In our example, we get N1 = 2.058, N2 =
p(xn | θ) π N xn | µj , Σj
j=1 j 2.008, N3 = 2.934.
In the following, we will compute the partial derivatives (11.12) through
(11.14). But before we do this, we introduce a quantity that will play a In the following, we determine the updates of the model parameters
central role in the remainder of this chapter: responsibilities. µk , Σk , πk for given responsibilities. We will see that the update equa-
tions all depend on the responsibilities, which makes a closed-form solu-
tion to the maximum likelihood estimation problem impossible. However,
11.2.1 Responsibilities for given responsibilities we will be updating one model parameter at a
We define the quantity time, while keeping the others fixed. After this, we will recompute the
 responsibilities. Iterating these two steps will eventually converge to a lo-
πk N xn | µk , Σk
rnk := PK  (11.17) cal optimum and is a specific instantiation of the EM algorithm. We will
j=1 πj N xn | µj , Σj discuss this in some more detail in Section 11.3.
responsibility as the responsibility of the k th mixture component for the nth data point.
The responsibility rnk of the k th mixture component for data point xn is
11.2.2 Updating the Means
proportional to the likelihood
 Theorem 11.1 (Update of the GMM Means). The update of the mean pa-
p(xn | πk , µk , Σk ) = πk N xn | µk , Σk (11.18) rameters µk , k = 1, . . . , K , of the GMM is given by
r n follows a of the mixture component given the data point. Therefore, mixture com- PN
n=1 rnk xn
Boltzmann/Gibbs ponents have a high responsibility for a data point when the data point µnew
k = P N
, (11.20)
distribution.
could be a plausible sample from that mixture component. Note that n=1 rnk

r n := [rn1 , . . . , rnK ]⊤ ∈ RK is a (normalized) probability vector, i.e., where the responsibilities rnk are defined in (11.17).

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
354 Density Estimation with Gaussian Mixture Models 11.2 Parameter Learning via Maximum Likelihood 355

Remark. The update of the means µk of the individual mixture compo- Therefore, the mean µk is pulled toward a data point xn with strength Figure 11.4 Update
nents in (11.20) depends on all means, covariance matrices Σk , and mix- given by rnk . The means are pulled stronger toward data points for which of the mean
parameter of
ture weights πk via rnk given in (11.17). Therefore, we cannot obtain a the corresponding mixture component has a high responsibility, i.e., a high
mixture component
closed-form solution for all µk at once. ♢ likelihood. Figure 11.4 illustrates this. We can also interpret the mean up- in a GMM. The
date in (11.20) as the expected value of all data points under the distri- mean µ is being
Proof From (11.15), we see that the gradient of the log-likelihood with pulled toward
bution given by
respect to the mean parameters µk , k = 1, . . . , K , requires us to compute individual data
the partial derivative r k := [r1k , . . . , rN k ]⊤ /Nk , (11.25) points with the
weights given by the
K which is a normalized probability vector, i.e.,
 
∂p(xn | θ) X ∂N xn | µj , Σj ∂N xn | µk , Σk corresponding
= πj = πk (11.21a) responsibilities.
∂µk j=1
∂µk ∂µk µk ← Erk [X ] . (11.26)
x2 x3
= πk (xn − µk )⊤ Σ−1

k N xn | µk , Σk , (11.21b) r2
Example 11.3 (Mean Updates) r1 r3
where we exploited that only the k th mixture component depends on µk . x1
µ
We use our result from (11.21b) in (11.15) and put everything together
Figure 11.5 Effect
so that the desired partial derivative of L with respect to µk is given as 0.30 π1 N (x|µ1 , σ12 )
0.30 π1 N (x|µ1 , σ12 ) of updating the
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
N N 0.25 mean values in a
∂L X∂ log p(xn | θ) X 1 ∂p(xn | θ) π3 N (x|µ3 , σ32 ) 0.25 π3 N (x|µ3 , σ32 )

GMM. (a) GMM


= = (11.22a) 0.20
GMM density GMM density

∂µk n=1
∂µk n=1
p(xn | θ) ∂µk 0.20
before updating the

p(x)

p(x)
0.15 0.15 mean values;
N 
X πk N xn | µk , Σk 0.10 0.10 (b) GMM after
= (xn − µk )⊤ Σ−1
k PK  (11.22b) 0.05 0.05 updating the mean
n=1 j=1 πj N xn | µj , Σj values µk while
0.00 0.00
| {z } −5 0 5 10 15 −5 0 5 10 15 retaining the
=rnk x x variances and
N
X (a) GMM density and individual components (b) GMM density and individual components mixture weights.
= rnk (xn − µk )⊤ Σ−1
k . (11.22c) prior to updating the mean values. after updating the mean values.
n=1
In our example from Figure 11.3, the mean values are updated as fol-
Here we used the identity from (11.16) and the result of the partial deriva- lows:
tive in (11.21b) to get to (11.22b). The values rnk are the responsibilities
we defined in (11.17). µ1 : −4 → −2.7 (11.27)
∂L(µnew )
We now solve (11.22c) for µnew
k so that ∂µk = 0⊤ and obtain µ2 : 0 → −0.4 (11.28)
k

N N PN N
µ3 : 8 → 3.7 (11.29)
X X rnk xn 1 X
rnk xn = rnk µnew
k ⇐⇒ µnew
k
n=1
= P = rnk xn , Here we see that the means of the first and third mixture component
n=1 n=1
N
rnk Nk n=1 move toward the regime of the data, whereas the mean of the second
n=1
(11.23) component does not change so dramatically. Figure 11.5 illustrates this
change, where Figure 11.5(a) shows the GMM density prior to updating
where we defined
the means and Figure 11.5(b) shows the GMM density after updating the
N
X mean values µk .
Nk := rnk (11.24)
n=1
The update of the mean parameters in (11.20) look fairly straight-
as the total responsibility of the k th mixture component for the entire
forward. However, note that the responsibilities rnk are a function of
dataset. This concludes the proof of Theorem 11.1.
πj , µj , Σj for all j = 1, . . . , K , such that the updates in (11.20) depend
Intuitively, (11.20) can be interpreted as an importance-weighted Monte on all parameters of the GMM, and a closed-form solution, which we ob-
Carlo estimate of the mean, where the importance weights of data point tained for linear regression in Section 9.2 or PCA in Chapter 10, cannot
xn are the responsibilities rnk of the k th cluster for xn , k = 1, . . . , K . be obtained.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
356 Density Estimation with Gaussian Mixture Models 11.2 Parameter Learning via Maximum Likelihood 357

11.2.3 Updating the Covariances with respect to Σk is given by


Theorem 11.2 (Updates of the GMM Covariances). The update of the co- ∂L
N
X ∂ log p(xn | θ) X 1
N
∂p(xn | θ)
variance parameters Σk , k = 1, . . . , K of the GMM is given by = = (11.36a)
∂Σk n=1
∂Σk n=1
p(xn | θ) ∂Σk
N
1 X N 
Σnew = rnk (xn − µk )(xn − µk )⊤ , (11.30)
X πk N xn | µk , Σk
k
Nk n=1 = PK 
n=1 j=1 πj N xn | µj , Σj
| {z }
where rnk and Nk are defined in (11.17) and (11.24), respectively. =rnk

· − 2 (Σk − Σ−1
 1 −1 ⊤ −1

k (xn − µk )(xn − µk ) Σk ) (11.36b)
Proof To prove Theorem 11.2, our approach is to compute the partial
N
derivatives of the log-likelihood L with respect to the covariances Σk , set 1X
them to 0, and solve for Σk . We start with our general approach
=− rnk (Σ−1 −1 ⊤ −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.36c)
2 n=1
N N N N
!
∂L X ∂ log p(xn | θ) 1 ∂p(xn | θ) 1 X 1 X
= − Σ−1 rnk + Σ−1 rnk (xn − µk )(xn − µk )⊤ Σ−1
X
= = . (11.31) k .
∂Σk n=1
∂Σ k n=1
p(xn | θ) ∂Σk 2 k n=1 2 k n=1
| {z }
=Nk
We already know 1/p(xn | θ) from (11.16). To obtain the remaining par-
tial derivative ∂p(xn | θ)/∂Σk , we write down the definition of the Gaus- (11.36d)
sian distribution p(xn | θ) (see (11.9)) and drop all terms but the k th. We We see that the responsibilities rnk also appear in this partial derivative.
then obtain Setting this partial derivative to 0, we obtain the necessary optimality
∂p(xn | θ) condition
(11.32a) N
!
∂Σk X


D 1
 Nk Σ−1
k = Σ −1
k rnk (xn − µk )(xn − µk )⊤
Σ−1
k (11.37a)
πk (2π)− 2 det(Σk )− 2 exp − 21 (xn − µk )⊤ Σ−1

= k (xn − µk ) n=1
∂Σk N
!
(11.32b)
X
 ⇐⇒ Nk I = rnk (xn − µk )(xn − µk )⊤ Σ−1
k . (11.37b)

D ∂ −
1
1 ⊤ −1
 n=1
= πk (2π) 2 det(Σk ) exp − 2 (xn − µk ) Σk (xn − µk )
2
∂Σk By solving for Σk , we obtain

1 ∂
+ det(Σk )− 2 exp − 12 (xn − µk )⊤ Σ−1
 N
k (xn − µk ) . (11.32c) 1 X
∂Σk Σnew
k = rnk (xn − µk )(xn − µk )⊤ , (11.38)
Nk n=1
We now use the identities
where r k is the probability vector defined in (11.25). This gives us a sim-
∂ 1 1
(5.101) 1
det(Σk )− 2 = − det(Σk )− 2 Σ−1
k , (11.33) ple update rule for Σk for k = 1, . . . , K and proves Theorem 11.2.
∂Σk 2
∂ (5.103) Similar to the update of µk in (11.20), we can interpret the update of
(xn − µk )⊤ Σ−1 −1 ⊤ −1
k (xn − µk ) = −Σk (xn − µk )(xn − µk ) Σk the covariance in (11.30) as an importance-weighted expected value of
∂Σk
(11.34) the square of the centered data X̃k := {x1 − µk , . . . , xN − µk }.

and obtain (after some rearranging) the desired partial derivative required
Example 11.4 (Variance Updates)
in (11.31) as
In our example from Figure 11.3, the variances are updated as follows:
∂p(xn | θ)
σ12 : 1 → 0.14

= πk N xn | µk , Σk (11.39)
∂Σk
σ22 : 0.2 → 0.44 (11.40)
· − 2 (Σk − Σ−1
 1 −1 ⊤ −1

k (xn − µk )(xn − µk ) Σk ) . (11.35)
σ32 : 3 → 1.53 (11.41)
Putting everything together, the partial derivative of the log-likelihood

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
358 Density Estimation with Gaussian Mixture Models 11.2 Parameter Learning via Maximum Likelihood 359
N K K
!
X X  X
Here we see that the variances of the first and third component shrink = log πk N xn | µk , Σk + λ πk − 1 , (11.43b)
significantly, whereas the variance of the second component increases n=1 k=1 k=1

slightly. where L is the log-likelihood from (11.10) and the second term encodes
Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but for the equality constraint that all the mixture weights need to sum up to
zoomed in) to Figure 11.5(b) and shows the GMM density and its indi- 1. We obtain the partial derivative with respect to πk as
vidual components prior to updating the variances. Figure 11.6(b) shows N 
the GMM density after updating the variances. ∂L X N xn | µk , Σk
= PK  +λ (11.44a)
∂πk n=1 j=1 πj N xn | µj , Σj
Figure 11.6 Effect
of updating the π1 N (x|µ1 , σ12 ) 0.35 π1 N (x|µ1 , σ12 ) N 
0.30
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 ) 1 X πk N xn | µk , Σk Nk
variances in a GMM. 0.25 π3 N (x|µ3 , σ32 )
0.30 π3 N (x|µ3 , σ32 ) =  +λ = + λ, (11.44b)
πk n=1 K j=1 πj N xn | µj , Σj
πk
P
(a) GMM before GMM density 0.25 GMM density
0.20
updating the 0.20 | {z }
p(x)

p(x)
variances; (b) GMM 0.15 =Nk
0.15
after updating the 0.10
0.10 and the partial derivative with respect to the Lagrange multiplier λ as
variances while 0.05 0.05
retaining the means K
0.00 0.00 ∂L X
and mixture −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 = πk − 1 . (11.45)
weights. x x ∂λ k=1
(a) GMM density and individual components (b) GMM density and individual components
prior to updating the variances. after updating the variances. Setting both partial derivatives to 0 (necessary condition for optimum)
yields the system of equations
Nk
πk = − , (11.46)
Similar to the update of the mean parameters, we can interpret (11.30) λ
as a Monte Carlo estimate of the weighted covariance of data points xn K
X
associated with the k th mixture component, where the weights are the 1= πk . (11.47)
responsibilities rnk . As with the updates of the mean parameters, this up- k=1

date depends on all πj , µj , Σj , j = 1, . . . , K , through the responsibilities Using (11.46) in (11.47) and solving for πk , we obtain
rnk , which prohibits a closed-form solution. K K
X X Nk N
πk = 1 ⇐⇒ − = 1 ⇐⇒ − = 1 ⇐⇒ λ = −N .
k=1 k=1
λ λ
11.2.4 Updating the Mixture Weights (11.48)
Theorem 11.3 (Update of the GMM Mixture Weights). The mixture weights This allows us to substitute −N for λ in (11.46) to obtain
of the GMM are updated as Nk
, πknew = (11.49)
Nk N
πknew = , k = 1, . . . , K , (11.42) which gives us the update for the weight parameters πk and proves Theo-
N
rem 11.3.
where N is the number of data points and Nk is defined in (11.24).
We can identify the mixture weight in (11.42) as the ratio of the to-
Proof To find the partial derivative of the log-likelihood with respect tal responsibility
P of the k th cluster and the number of data points. Since
weight parameters πk , k = 1, . . . , K , we account for the con-
to the P N = k Nk , the number of data points can also be interpreted as the
straint k πk = 1 by using Lagrange multipliers (see Section 7.2). The total responsibility of all mixture components together, such that πk is the
Lagrangian is relative importance of the k th mixture component for the dataset.
PN
K
! Remark. Since Nk = i=1 rnk , the update equation (11.42) for the mix-
ture weights πk also depends on all πj , µj , Σj , j = 1, . . . , K via the re-
X
L=L+λ πk − 1 (11.43a)
k=1 sponsibilities rnk . ♢

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
360 Density Estimation with Gaussian Mixture Models 11.3 EM Algorithm 361

rithm) was proposed by Dempster et al. (1977) and is a general iterative


Example 11.5 (Weight Parameter Updates) scheme for learning parameters (maximum likelihood or MAP) in mixture
models and, more generally, latent-variable models.
Figure 11.7 Effect In our example of the Gaussian mixture model, we choose initial values
of updating the 0.35 π1 N (x|µ1 , σ12 )
π2 N (x|µ2 , σ22 )
0.30 π1 N (x|µ1 , σ12 )
π2 N (x|µ2 , σ22 )
for µk , Σk , πk and alternate until convergence between
mixture weights in a 0.30 π3 N (x|µ3 , σ32 ) 0.25 π3 N (x|µ3 , σ32 )

GMM. (a) GMM 0.25 GMM density


0.20
GMM density E-step: Evaluate the responsibilities rnk (posterior probability of data
before updating the 0.20 point n belonging to mixture component k ).
p(x)

p(x)
0.15
mixture weights; 0.15
M-step: Use the updated responsibilities to reestimate the parameters
0.10
(b) GMM after 0.10

updating the 0.05 0.05


µk , Σk , πk .
mixture weights 0.00 0.00
Every step in the EM algorithm increases the log-likelihood function (Neal
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
while retaining the
means and
x x
and Hinton, 1999). For convergence, we can check the log-likelihood or
(a) GMM density and individual components (b) GMM density and individual components the parameters directly. A concrete instantiation of the EM algorithm for
variances. Note the
prior to updating the mixture weights. after updating the mixture weights.
different scales of estimating the parameters of a GMM is as follows:
the vertical axes. In our running example from Figure 11.3, the mixture weights are up-
1. Initialize µk , Σk , πk .
dated as follows:
2. E-step: Evaluate responsibilities rnk for every data point xn using cur-
1
π1 : 3
→ 0.29 (11.50) rent parameters πk , µk , Σk :
1
π2 : → 0.29 (11.51)

3 πk N xn | µk , Σk
1 rnk = P . (11.53)
π3 : 3
→ 0.42 (11.52) j πj N xn | µj , Σj

Here we see that the third component gets more weight/importance, 3. M-step: Reestimate parameters πk , µk , Σk using the current responsi-
while the other components become slightly less important. Figure 11.7 bilities rnk (from E-step): Having updated the
illustrates the effect of updating the mixture weights. Figure 11.7(a) is N
means µk
identical to Figure 11.6(b) and shows the GMM density and its individual 1 X in (11.54), they are
µk = rnk xn , (11.54) subsequently used
components prior to updating the mixture weights. Figure 11.7(b) shows Nk n=1
in (11.55) to update
the GMM density after updating the mixture weights. N the corresponding
1 X
Overall, having updated the means, the variances, and the weights Σk = rnk (xn − µk )(xn − µk )⊤ , (11.55) covariances.
once, we obtain the GMM shown in Figure 11.7(b). Compared with the Nk n=1
initialization shown in Figure 11.3, we can see that the parameter updates Nk
πk = . (11.56)
caused the GMM density to shift some of its mass toward the data points. N
After updating the means, variances, and weights once, the GMM fit
in Figure 11.7(b) is already remarkably better than its initialization from Example 11.6 (GMM Fit)
Figure 11.3. This is also evidenced by the log-likelihood values, which
increased from −28.3 (initialization) to −14.4 after a full update cycle. Figure 11.8 EM
π1 N (x|µ1 , σ12 ) 28 algorithm applied to
0.30
π2 N (x|µ2 , σ22 )
the GMM from

Negative log-likelihood
26
0.25 π3 N (x|µ3 , σ32 )
GMM density 24 Figure 11.2. (a)
0.20
22 Final GMM fit;

p(x)
11.3 EM Algorithm 0.15
20 (b) negative
0.10
18
log-likelihood as a
Unfortunately, the updates in (11.20), (11.30), and (11.42) do not consti- function of the EM
0.05 16
tute a closed-form solution for the updates of the parameters µk , Σk , πk iteration.
0.00 14
of the mixture model because the responsibilities rnk depend on those pa- −5 0 5 10 15 0 1 2 3 4 5
x
Iteration
rameters in a complex way. However, the results suggest a simple iterative
(a) Final GMM fit. After five iterations, the EM (b) Negative log-likelihood as a function of the
scheme for finding a solution to the parameters estimation problem via algorithm converges and returns this GMM. EM iterations.
EM algorithm maximum likelihood. The expectation maximization algorithm (EM algo-

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
362 Density Estimation with Gaussian Mixture Models 11.4 Latent-Variable Perspective 363
Figure 11.9 10 6 6 Figure 11.10 GMM
Illustration of the 104 fit and
4 4

Negative log-likelihood
EM algorithm for 5 responsibilities
fitting a Gaussian 2 2 when EM converges.
mixture model with (a) GMM fit when
x2

x2

x2
0 0 0
3
three components to 6 × 10 EM converges;
−2 −2
a two-dimensional −5 (b) each data point
dataset. (a) Dataset; −4 −4 is colored according
4 × 103
(b) negative −10 −6 −6 to the
−10 −5 0 5 10 0 20 40 60 −5 0 5 −5 0 5
log-likelihood x1 EM iteration x1 x1 responsibilities of
(lower is better) as the mixture
(a) Dataset. (b) Negative log-likelihood. (a) GMM fit after 62 iterations. (b) Dataset colored according to the respon-
a function of the EM components.
sibilities of the mixture components.
iterations. The red
10 10
dots indicate the
iterations for which
5 5
the mixture
components of the the corresponding final GMM fit. Figure 11.10(b) visualizes the final re-
x2

x2
0 0
corresponding GMM sponsibilities of the mixture components for the data points. The dataset is
fits are shown in (c) colored according to the responsibilities of the mixture components when
−5 −5
through (f). The
yellow discs indicate
EM converges. While a single mixture component is clearly responsible
the means of the −10
−10 −5 0 5 10
−10
−10 −5 0 5 10 for the data on the left, the overlap of the two data clusters on the right
x1 x1
Gaussian mixture could have been generated by two mixture components. It becomes clear
components. (c) EM initialization. (d) EM after one iteration. that there are data points that cannot be uniquely assigned to a single
Figure 11.10(a)
component (either blue or yellow), such that the responsibilities of these
shows the final 10 10
GMM fit. two clusters for those points are around 0.5.
5 5
x2

x2

0 0
11.4 Latent-Variable Perspective
−5 −5
We can look at the GMM from the perspective of a discrete latent-variable
−10 −10
model, i.e., where the latent variable z can attain only a finite set of val-
−10 −5 0 5 10 −10 −5 0 5 10
x1 x1 ues. This is in contrast to PCA, where the latent variables were continuous-
(e) EM after 10 iterations. (f) EM after 62 iterations. valued numbers in RM .
The advantages of the probabilistic perspective are that (i) it will jus-
tify some ad hoc decisions we made in the previous sections, (ii) it allows
for a concrete interpretation of the responsibilities as posterior probabil-
ities, and (iii) the iterative algorithm for updating the model parameters
When we run EM on our example from Figure 11.3, we obtain the final
can be derived in a principled manner as the EM algorithm for maximum
result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b)
likelihood parameter estimation in latent-variable models.
shows how the negative log-likelihood evolves as a function of the EM
iterations. The final GMM is given as
 
p(x) = 0.29N x | − 2.75, 0.06 + 0.28N x | − 0.50, 0.25 11.4.1 Generative Process and Probabilistic Model
 (11.57)
+ 0.43N x | 3.64, 1.63 . To derive the probabilistic model for GMMs, it is useful to think about the
generative process, i.e., the process that allows us to generate data, using
a probabilistic model.
We applied the EM algorithm to the two-dimensional dataset shown We assume a mixture model with K components and that a data point
in Figure 11.1 with K = 3 mixture components. Figure 11.9 illustrates x can be generated by exactly one mixture component. We introduce a
some steps of the EM algorithm and shows the negative log-likelihood as binary indicator variable zk ∈ {0, 1} with two states (see Section 6.2) that
a function of the EM iteration (Figure 11.9(b)). Figure 11.10(a) shows indicates whether the k th mixture component generated that data point

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
364 Density Estimation with Gaussian Mixture Models 11.4 Latent-Variable Perspective 365

so that for k = 1, . . . , K , so that


    

p(x | zk = 1) = N x | µk , Σk . (11.58) p(x, z1 = 1) π1 N x | µ1 , Σ1
p(x, z) =  .
.. .
..
= , (11.62)
   
We define z := [z1 , . . . , zK ]⊤ ∈ RK as a probability vector consisting of 
K −1 many 0s and exactly one 1. For example, for K = 3, a valid z would
p(x, zK = 1) πK N x | µK , ΣK
be z = [z1 , z2 , z3 ]⊤ = [0, 1, 0]⊤ , which would select the second mixture which fully specifies the probabilistic model.
component since z2 = 1.
Remark. Sometimes this kind of probability distribution is called “multi- 11.4.2 Likelihood
noulli”, a generalization of the Bernoulli distribution to more than two
values (Murphy, 2012). ♢ To obtain the likelihood p(x | θ) in a latent-variable model, we need to
PK marginalize out the latent variables (see Section 8.4.3). In our case, this
one-hot encoding The properties of z imply that k=1 zk = 1. Therefore, z is a one-hot can be done by summing out all latent variables from the joint p(x, z)
1-of-K encoding (also: 1-of-K representation). in (11.62) so that
representation Thus far, we assumed that the indicator variables zk are known. How- X
ever, in practice, this is not the case, and we place a prior distribution p(x | θ) = p(x | θ, z)p(z | θ) , θ := {µk , Σk , πk : k = 1, . . . , K} .
z
K
X (11.63)

p(z) = π = [π1 , . . . , πK ] , πk = 1 , (11.59)
We now explicitly condition on the parameters θ of the probabilistic model,
k=1
which we previously omitted. In (11.63), P we sum over all K possible one-
on the latent variable z . Then the k th entry hot encodings of z , which is denoted by z . Since there is only a single
nonzero single entry in each z there are only K possible configurations/
πk = p(zk = 1) (11.60)
settings of z . For example, if K = 3, then z can have the configurations
of this probability vector describes the probability that the k th mixture      
1 0 0
Figure 11.11 component generated data point x. 0 , 1 , 0 . (11.64)
Graphical model for
a GMM with a single Remark (Sampling from a GMM). The construction of this latent-variable 0 0 1
data point. model (see the corresponding graphical model in Figure 11.11) lends it- Summing over all possible configurations of z in (11.63) is equivalent to
π self to a very simple sampling procedure (generative process) to generate looking at the nonzero entry of the z -vector and writing
data: X
p(x | θ) = p(x | θ, z)p(z | θ) (11.65a)
z 1. Sample z (i) ∼ p(z). z
2. Sample x(i) ∼ p(x | z (i) = 1). K
X
µk
= p(x | θ, zk = 1)p(zk = 1 | θ) (11.65b)
In the first step, we select a mixture component i (via the one-hot encod- k=1
Σk x ing z ) at random according to p(z) = π ; in the second step we draw a so that the desired marginal distribution is given as
k = 1, . . . , K
sample from the corresponding mixture component. When we discard the
K
samples of the latent variable so that we are left with the x(i) , we have (11.65b)
X
p(x | θ) = p(x | θ, zk = 1)p(zk = 1|θ) (11.66a)
valid samples from the GMM. This kind of sampling, where samples of
k=1
random variables depend on samples from the variable’s parents in the K
graphical model, is called ancestral sampling. ♢
X
ancestral sampling

= πk N x | µk , Σk , (11.66b)
Generally, a probabilistic model is defined by the joint distribution of k=1

the data and the latent variables (see Section 8.4). With the prior p(z) which we identify as the GMM model from (11.3). Given a dataset X , we
defined in (11.59) and (11.60) and the conditional p(x | z) from (11.58), immediately obtain the likelihood
we obtain all K components of this joint distribution via N N X
K
(11.66b)
Y Y 
p(x, zk = 1) = p(x | zk = 1)p(zk = 1) = πk N x | µk , Σk

(11.61)
p(X | θ) = p(xn | θ) = πk N xn | µk , Σk , (11.67)
n=1 n=1 k=1

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
366 Density Estimation with Gaussian Mixture Models 11.4 Latent-Variable Perspective 367

Figure 11.12 π We share the same prior distribution π across all latent variables z n .
Graphical model for
The corresponding graphical model is shown in Figure 11.12, where we
a GMM with N data
points. use the plate notation.
zn The conditional distribution p(x1 , . . . , xN | z 1 , . . . , z N ) factorizes over
the data points and is given as
N
µk Y
p(x1 , . . . , xN | z 1 , . . . , z N ) = p(xn | z n ) . (11.71)
Σk xn n=1
k = 1, . . . , K
n = 1, . . . , N To obtain the posterior distribution p(znk = 1 | xn ), we follow the same
reasoning as in Section 11.4.3 and apply Bayes’ theorem to obtain
which is exactly the GMM likelihood from (11.9). Therefore, the latent- p(xn | znk = 1)p(znk = 1)
variable model with latent indicators zk is an equivalent way of thinking p(znk = 1 | xn ) = PK (11.72a)
j=1 p(xn | znj = 1)p(znj = 1)
about a Gaussian mixture model. 
πk N xn | µk , Σk
= PK  = rnk . (11.72b)
j=1 πj N xn | µj , Σj
11.4.3 Posterior Distribution
Let us have a brief look at the posterior distribution on the latent variable This means that p(zk = 1 | xn ) is the (posterior) probability that the k th
z . According to Bayes’ theorem, the posterior of the k th component having mixture component generated data point xn and corresponds to the re-
generated data point x sponsibility rnk we introduced in (11.17). Now the responsibilities also
have not only an intuitive but also a mathematically justified interpreta-
p(zk = 1)p(x | zk = 1) tion as posterior probabilities.
p(zk = 1 | x) = , (11.68)
p(x)
where the marginal p(x) is given in (11.66b). This yields the posterior
distribution for the k th indicator variable zk 11.4.5 EM Algorithm Revisited

p(zk = 1)p(x | zk = 1) πk N x | µk , Σk The EM algorithm that we introduced as an iterative scheme for maximum
p(zk = 1 | x) = PK = PK ,
likelihood estimation can be derived in a principled way from the latent-
j=1 p(zj = 1)p(x | zj = 1) j=1 πj N x | µj , Σj
(11.69) variable perspective. Given a current setting θ (t) of model parameters, the
E-step calculates the expected log-likelihood
which we identify as the responsibility of the k th mixture component for
data point x. Note that we omitted the explicit conditioning on the GMM Q(θ | θ (t) ) = Ez | x,θ(t) [log p(x, z | θ)] (11.73a)
parameters πk , µk , Σk where k = 1, . . . , K . Z
= log p(x, z | θ)p(z | x, θ (t) )dz , (11.73b)

11.4.4 Extension to a Full Dataset where the expectation of log p(x, z | θ) is taken with respect to the poste-
Thus far, we have only discussed the case where the dataset consists only rior p(z | x, θ (t) ) of the latent variables. The M-step selects an updated set
of a single data point x. However, the concepts of the prior and posterior of model parameters θ (t+1) by maximizing (11.73b).
can be directly extended to the case of N data points X := {x1 , . . . , xN }. Although an EM iteration does increase the log-likelihood, there are
In the probabilistic interpretation of the GMM, every data point xn pos- no guarantees that EM converges to the maximum likelihood solution.
sesses its own latent variable It is possible that the EM algorithm converges to a local maximum of
the log-likelihood. Different initializations of the parameters θ could be
z n = [zn1 , . . . , znK ]⊤ ∈ RK . (11.70)
used in multiple EM runs to reduce the risk of ending up in a bad local
Previously (when we only considered a single data point x), we omitted optimum. We do not go into further details here, but refer to the excellent
the index n, but now this becomes important. expositions by Rogers and Girolami (2016) and Bishop (2006).

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
368 Density Estimation with Gaussian Mixture Models 11.5 Further Reading 369

11.5 Further Reading 0.30 Figure 11.13


Data Histogram (orange
The GMM can be considered a generative model in the sense that it is 0.25 KDE bars) and kernel
straightforward to generate new data using ancestral sampling (Bishop, 0.20
Histogram density estimation
2006). For given GMM parameters πk , µk , Σk , k = 1, . . . , K , we sample (blue line). The

p(x)
kernel density
an index k from the probability
 vector [π1 , . . . , πK ]⊤ and then sample a 0.15
estimator produces
data point x ∼ N µk , Σk . If we repeat this N times, we obtain a dataset 0.10 a smooth estimate
that has been generated by a GMM. Figure 11.1 was generated using this of the underlying
procedure. 0.05 density, whereas the
histogram is an
Throughout this chapter, we assumed that the number of components 0.00 unsmoothed count
−4 −2 0 2 4 6 8
K is known. In practice, this is often not the case. However, we could use x measure of how
nested cross-validation, as discussed in Section 8.6.1, to find good models. many data points
Gaussian mixture models are closely related to the K -means clustering (black) fall into a
In this chapter, we discussed mixture models for density estimation. single bin.
algorithm. K -means also uses the EM algorithm to assign data points to
There is a plethora of density estimation techniques available. In practice,
clusters. If we treat the means in the GMM as cluster centers and ignore
we often use histograms and kernel density estimation. histogram
the covariances (or set them to I ), we arrive at K -means. As also nicely
Histograms provide a nonparametric way to represent continuous den-
described by MacKay (2003), K -means makes a “hard” assignment of data
sities and have been proposed by Pearson (1895). A histogram is con-
points to cluster centers µk , whereas a GMM makes a “soft” assignment
structed by “binning” the data space and count, how many data points fall
via the responsibilities.
into each bin. Then a bar is drawn at the center of each bin, and the height
We only touched upon the latent-variable perspective of GMMs and the
of the bar is proportional to the number of data points within that bin. The
EM algorithm. Note that EM can be used for parameter learning in general
bin size is a critical hyperparameter, and a bad choice can lead to overfit-
latent-variable models, e.g., nonlinear state-space models (Ghahramani
ting and underfitting. Cross-validation, as discussed in Section 8.2.4, can
and Roweis, 1999; Roweis and Ghahramani, 1999) and for reinforcement
be used to determine a good bin size. kernel density
learning as discussed by Barber (2012). Therefore, the latent-variable per- estimation
Kernel density estimation, independently proposed by Rosenblatt (1956)
spective of a GMM is useful to derive the corresponding EM algorithm in
and Parzen (1962), is a nonparametric way for density estimation. Given
a principled way (Bishop, 2006; Barber, 2012; Murphy, 2012).
N i.i.d. samples, the kernel density estimator represents the underlying
We only discussed maximum likelihood estimation (via the EM algo-
distribution as
rithm) for finding GMM parameters. The standard criticisms of maximum
N
likelihood also apply here: 1 X x − xn
 
p(x) = k , (11.74)
As in linear regression, maximum likelihood can suffer from severe N h n=1 h
overfitting. In the GMM case, this happens when the mean of a mix- where k is a kernel function, i.e., a nonnegative function that integrates to
ture component is identical to a data point and the covariance tends to 1 and h > 0 is a smoothing/bandwidth parameter, which plays a similar
0. Then, the likelihood approaches infinity. Bishop (2006) and Barber role as the bin size in histograms. Note that we place a kernel on every
(2012) discuss this issue in detail. single data point xn in the dataset. Commonly used kernel functions are
We only obtain a point estimate of the parameters πk , µk , Σk for k = the uniform distribution and the Gaussian distribution. Kernel density esti-
1, . . . , K , which does not give any indication of uncertainty in the pa- mates are closely related to histograms, but by choosing a suitable kernel,
rameter values. A Bayesian approach would place a prior on the param- we can guarantee smoothness of the density estimate. Figure 11.13 illus-
eters, which can be used to obtain a posterior distribution on the param- trates the difference between a histogram and a kernel density estimator
eters. This posterior allows us to compute the model evidence (marginal (with a Gaussian-shaped kernel) for a given dataset of 250 data points.
likelihood), which can be used for model comparison, which gives us a
principled way to determine the number of mixture components. Un-
fortunately, closed-form inference is not possible in this setting because
there is no conjugate prior for this model. However, approximations,
such as variational inference, can be used to obtain an approximate
posterior (Bishop, 2006).

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
Classification with Support Vector Machines 371

Figure 12.1
Example 2D data,

12 illustrating the
intuition of data
where we can find a
linear classifier that

x(2)
Classification with Support Vector Machines separates orange
crosses from blue
discs.

x(1)
In many situations, we want our machine learning algorithm to predict
one of a number of (discrete) outcomes. For example, an email client sorts
mail into personal mail and junk mail, which has two outcomes. Another SVMs. First, the SVM allows for a geometric way to think about supervised
example is a telescope that identifies whether an object in the night sky machine learning. While in Chapter 9 we considered the machine learning
is a galaxy, star, or planet. There are usually a small number of outcomes, problem in terms of probabilistic models and attacked it using maximum
and more importantly there is usually no additional structure on these likelihood estimation and Bayesian inference, here we will consider an
An example of outcomes. In this chapter, we consider predictors that output binary val- alternative approach where we reason geometrically about the machine
structure is if the ues, i.e., there are only two possible outcomes. This machine learning task learning task. It relies heavily on concepts, such as inner products and
outcomes were
is called binary classification. This is in contrast to Chapter 9, where we projections, which we discussed in Chapter 3. The second reason why we
ordered, like in the
case of small, considered a prediction problem with continuous-valued outputs. find SVMs instructive is that in contrast to Chapter 9, the optimization
medium, and large For binary classification, the set of possible values that the label/output problem for SVM does not admit an analytic solution so that we need to
t-shirts. can attain is binary, and for this chapter we denote them by {+1, −1}. In resort to a variety of optimization tools introduced in Chapter 7.
binary classification
other words, we consider predictors of the form The SVM view of machine learning is subtly different from the max-
imum likelihood view of Chapter 9. The maximum likelihood view pro-
f : RD → {+1, −1} . (12.1)
poses a model based on a probabilistic view of the data distribution, from
Recall from Chapter 8 that we represent each example (data point) xn which an optimization problem is derived. In contrast, the SVM view starts
Input example xn as a feature vector of D real numbers. The labels are often referred to as by designing a particular function that is to be optimized during training,
may also be referred the positive and negative classes, respectively. One should be careful not based on geometric intuitions. We have seen something similar already
to as inputs, data
to infer intuitive attributes of positiveness of the +1 class. For example, in Chapter 10, where we derived PCA from geometric principles. In the
points, features, or
instances. in a cancer detection task, a patient with cancer is often labeled +1. In SVM case, we start by designing a loss function that is to be minimized
class principle, any two distinct values can be used, e.g., {True, False}, {0, 1} on training data, following the principles of empirical risk minimization
For probabilistic or {red, blue}. The problem of binary classification is well studied, and (Section 8.2).
models, it is we defer a survey of other approaches to Section 12.6. Let us derive the optimization problem corresponding to training an
mathematically
We present an approach known as the support vector machine (SVM), SVM on example–label pairs. Intuitively, we imagine binary classification
convenient to use
{0, 1} as a binary which solves the binary classification task. As in regression, we have a su- data, which can be separated by a hyperplane as illustrated in Figure 12.1.
representation; see pervised learning task, where we have a set of examples xn ∈ RD along Here, every example xn (a vector of dimension 2) is a two-dimensional
the remark after with their corresponding (binary) labels yn ∈ {+1, −1}. Given a train- location (x(1) (2)
n and xn ), and the corresponding binary label yn is one of
Example 6.12.
ing data set consisting of example–label pairs {(x1 , y1 ), . . . , (xN , yN )}, we two different symbols (orange cross or blue disc). “Hyperplane” is a word
would like to estimate parameters of the model that will give the smallest that is commonly used in machine learning, and we encountered hyper-
classification error. Similar to Chapter 9, we consider a linear model, and planes already in Section 2.8. A hyperplane is an affine subspace of di-
hide away the nonlinearity in a transformation ϕ of the examples (9.13). mension D − 1 (if the corresponding vector space is of dimension D).
We will revisit ϕ in Section 12.4. The examples consist of two classes (there are two possible labels) that
The SVM provides state-of-the-art results in many applications, with have features (the components of the vector representing the example)
sound theoretical guarantees (Steinwart and Christmann, 2008). There arranged in such a way as to allow us to separate/classify them by draw-
are two main reasons why we chose to illustrate binary classification using ing a straight line.

370
©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
372 Classification with Support Vector Machines 12.1 Separating Hyperplanes 373
Figure 12.2
In the following, we formalize the idea of finding a linear separator w Equation of a
separating
of the two classes. We introduce the idea of the margin and then extend
w hyperplane (12.3).
linear separators to allow for examples to fall on the “wrong” side, incur- (a) The standard
ring a classification error. We present two equivalent ways of formalizing b way of representing
the SVM: the geometric view (Section 12.2.4) and the loss function view .Positive the equation in 3D.
(b) For ease of
(Section 12.2.5). We derive the dual version of the SVM using Lagrange
multipliers (Section 7.2). The dual SVM allows us to observe a third way
. . drawing, we look at
0 Negative the hyperplane edge
of formalizing the SVM: in terms of the convex hulls of the examples of on.
each class (Section 12.3.2). We conclude by briefly describing kernels and (a) Separating hyperplane in 3D (b) Projection of the setting in (a) onto
a plane
how to numerically solve the nonlinear kernel-SVM optimization problem.

12.1 Separating Hyperplanes where the second line is obtained by the linearity of the inner product
(Section 3.2). Since we have chosen xa and xb to be on the hyperplane,
Given two examples represented as vectors xi and xj , one way to compute this implies that f (xa ) = 0 and f (xb ) = 0 and hence ⟨w, xa − xb ⟩ = 0.
the similarity between them is using an inner product ⟨xi , xj ⟩. Recall from Recall that two vectors are orthogonal when their inner product is zero. w is orthogonal to
Section 3.2 that inner products are closely related to the angle between Therefore, we obtain that w is orthogonal to any vector on the hyperplane. any vector on the
two vectors. The value of the inner product between two vectors depends hyperplane.
on the length (norm) of each vector. Furthermore, inner products allow Remark. Recall from Chapter 2 that we can think of vectors in different
us to rigorously define geometric concepts such as orthogonality and pro- ways. In this chapter, we think of the parameter vector w as an arrow
jections. indicating a direction, i.e., we consider w to be a geometric vector. In
The main idea behind many classification algorithms is to represent contrast, we think of the example vector x as a data point (as indicated
data in RD and then partition this space, ideally in a way that examples by its coordinates), i.e., we consider x to be the coordinates of a vector
with the same label (and no other examples) are in the same partition. with respect to the standard basis. ♢
In the case of binary classification, the space would be divided into two When presented with a test example, we classify the example as pos-
parts corresponding to the positive and negative classes, respectively. We itive or negative depending on the side of the hyperplane on which it
consider a particularly convenient partition, which is to (linearly) split occurs. Note that (12.3) not only defines a hyperplane; it additionally de-
the space into two halves using a hyperplane. Let example x ∈ RD be an fines a direction. In other words, it defines the positive and negative side
element of the data space. Consider a function of the hyperplane. Therefore, to classify a test example xtest , we calcu-
late the value of the function f (xtest ) and classify the example as +1 if
f : RD → R (12.2a) f (xtest ) ⩾ 0 and −1 otherwise. Thinking geometrically, the positive ex-
x 7→ f (x) := ⟨w, x⟩ + b , (12.2b) amples lie “above” the hyperplane and the negative examples “below” the
hyperplane.
parametrized by w ∈ RD and b ∈ R. Recall from Section 2.8 that hy-
When training the classifier, we want to ensure that the examples with
perplanes are affine subspaces. Therefore, we define the hyperplane that
positive labels are on the positive side of the hyperplane, i.e.,
separates the two classes in our binary classification problem as
⟨w, xn ⟩ + b ⩾ 0 when yn = +1 (12.5)
x ∈ RD : f (x) = 0 .

(12.3)
and the examples with negative labels are on the negative side, i.e.,
An illustration of the hyperplane is shown in Figure 12.2, where the
vector w is a vector normal to the hyperplane and b the intercept. We can ⟨w, xn ⟩ + b < 0 when yn = −1 . (12.6)
derive that w is a normal vector to the hyperplane in (12.3) by choosing Refer to Figure 12.2 for a geometric intuition of positive and negative
any two examples xa and xb on the hyperplane and showing that the examples. These two conditions are often presented in a single equation
vector between them is orthogonal to w. In the form of an equation,
yn (⟨w, xn ⟩ + b) ⩾ 0 . (12.7)
f (xa ) − f (xb ) = ⟨w, xa ⟩ + b − (⟨w, xb ⟩ + b) (12.4a)
Equation (12.7) is equivalent to (12.5) and (12.6) when we multiply both
= ⟨w, xa − xb ⟩ , (12.4b) sides of (12.5) and (12.6) with yn = 1 and yn = −1, respectively.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
374 Classification with Support Vector Machines 12.2 Primal Support Vector Machine 375

Figure 12.3
Possible separating
.xa Figure 12.4 Vector
addition to express
r w
x′a .
hyperplanes. There distance to
are many linear hyperplane:
w
classifiers (green xa = x′a + r ∥w∥ .
lines) that separate

x(2)
orange crosses from
blue discs.
.
0

x(1) hyperplane, we know that the distance r is just a scaling of this vector w.
If the length of w is known, then we can use this scaling factor r factor
to work out the absolute distance between xa and x′a . For convenience,
12.2 Primal Support Vector Machine
we choose to use a vector of unit length (its norm is 1) and obtain this
Based on the concept of distances from points to a hyperplane, we now w
by dividing w by its norm, ∥w∥ . Using vector addition (Section 2.4), we
are in a position to discuss the support vector machine. For a dataset obtain
{(x1 , y1 ), . . . , (xN , yN )} that is linearly separable, we have infinitely many w
candidate hyperplanes (refer to Figure 12.3), and therefore classifiers, xa = x′a + r . (12.8)
∥w∥
that solve our classification problem without any (training) errors. To find
a unique solution, one idea is to choose the separating hyperplane that Another way of thinking about r is that it is the coordinate of xa in the
maximizes the margin between the positive and negative examples. In subspace spanned by w/ ∥w∥. We have now expressed the distance of xa
other words, we want the positive and negative examples to be separated from the hyperplane as r, and if we choose xa to be the point closest to
A classifier with by a large margin (Section 12.2.1). In the following, we compute the dis- the hyperplane, this distance r is the margin.
large margin turns tance between an example and a hyperplane to derive the margin. Recall Recall that we would like the positive examples to be further than r
out to generalize
that the closest point on the hyperplane to a given point (example xn ) is from the hyperplane, and the negative examples to be further than dis-
well (Steinwart and
Christmann, 2008). obtained by the orthogonal projection (Section 3.8). tance r (in the negative direction) from the hyperplane. Analogously to
the combination of (12.5) and (12.6) into (12.7), we formulate this ob-
jective as
12.2.1 Concept of the Margin
yn (⟨w, xn ⟩ + b) ⩾ r . (12.9)
margin The concept of the margin is intuitively simple: It is the distance of the
There could be two separating hyperplane to the closest examples in the dataset, assuming In other words, we combine the requirements that examples are at least
or more closest that the dataset is linearly separable. However, when trying to formalize r away from the hyperplane (in the positive and negative direction) into
examples to a
this distance, there is a technical wrinkle that may be confusing. The tech- one single inequality.
hyperplane.
nical wrinkle is that we need to define a scale at which to measure the Since we are interested only in the direction, we add an assumption to
distance. A potential scale is to consider the scale of the data, i.e., the raw our model that the parameter vector w is of √ unit length, i.e., ∥w∥ = 1,
values of xn . There are problems with this, as we could change the units where we use the Euclidean norm ∥w∥ = w⊤ w (Section 3.1). This We will see other
of measurement of xn and change the values in xn , and, hence, change assumption also allows a more intuitive interpretation of the distance r choices of inner
the distance to the hyperplane. As we will see shortly, we define the scale products
(12.8) since it is the scaling factor of a vector of length 1.
(Section 3.2) in
based on the equation of the hyperplane (12.3) itself. Section 12.4.
Remark. A reader familiar with other presentations of the margin would
Consider a hyperplane ⟨w, x⟩ + b, and an example xa as illustrated in
notice that our definition of ∥w∥ = 1 is different from the standard
Figure 12.4. Without loss of generality, we can consider the example xa
presentation if the SVM was the one provided by Schölkopf and Smola
to be on the positive side of the hyperplane, i.e., ⟨w, xa ⟩ + b > 0. We
(2002), for example. In Section 12.2.3, we will show the equivalence of
would like to compute the distance r > 0 of xa from the hyperplane. We
both approaches. ♢
do so by considering the orthogonal projection (Section 3.8) of xa onto
the hyperplane, which we denote by x′a . Since w is orthogonal to the Collecting the three requirements into a single constrained optimization

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
376 Classification with Support Vector Machines 12.2 Primal Support Vector Machine 377

Figure 12.5
Derivation of the
.xa By substituting (12.8) into (12.11), we obtain
1
r w 
w

margin: r = ∥w∥ .
x′a . w, xa − r
∥w∥
+ b = 0. (12.12)

Exploiting the bilinearity of the inner product (see Section 3.2), we get

⟨w
⟨w, w⟩
⟨w, xa ⟩ + b − r = 0. (12.13)

,x
⟨w
∥w∥

⟩+
,
x⟩
Observe that the first term is 1 by our assumption of scale, i.e., ⟨w, xa ⟩ +

b=
+
b = 1. From (3.16) in Section 3.1, we know that ⟨w, w⟩ = ∥w∥2 . Hence,

b=

1
the second term reduces to r∥w∥. Using these simplifications, we obtain

0
1
r= . (12.14)
problem, we obtain the objective ∥w∥

max r This means we derived the distance r in terms of the normal vector w
w,b,r |{z} of the hyperplane. At first glance, this equation is counterintuitive as we We can also think of
margin
(12.10) seem to have derived the distance from the hyperplane in terms of the the distance as the
subject to yn (⟨w, xn ⟩ + b) ⩾ r , ∥w∥ = 1 , r > 0, length of the vector w, but we do not yet know this vector. One way to projection error that
| {z } | {z } incurs when
data fitting normalization think about it is to consider the distance r to be a temporary variable projecting xa onto
which says that we want to maximize the margin r while ensuring that that we only use for this derivation. Therefore, for the rest of this section the hyperplane.
1
the data lies on the correct side of the hyperplane. we will denote the distance to the hyperplane by ∥w∥ . In Section 12.2.3,
we will see that the choice that the margin equals 1 is equivalent to our
Remark. The concept of the margin turns out to be highly pervasive in ma-
previous assumption of ∥w∥ = 1 in Section 12.2.1.
chine learning. It was used by Vladimir Vapnik and Alexey Chervonenkis
Similar to the argument to obtain (12.9), we want the positive and
to show that when the margin is large, the “complexity” of the function
negative examples to be at least 1 away from the hyperplane, which yields
class is low, and hence learning is possible (Vapnik, 2000). It turns out
the condition
that the concept is useful for various different approaches for theoret-
ically analyzing generalization error (Steinwart and Christmann, 2008; yn (⟨w, xn ⟩ + b) ⩾ 1 . (12.15)
Shalev-Shwartz and Ben-David, 2014). ♢ Combining the margin maximization with the fact that examples need to
be on the correct side of the hyperplane (based on their labels) gives us
12.2.2 Traditional Derivation of the Margin 1
max (12.16)
In the previous section, we derived (12.10) by making the observation that
w,b ∥w∥
we are only interested in the direction of w and not its length, leading to subject to yn (⟨w, xn ⟩ + b) ⩾ 1 for all n = 1, . . . , N. (12.17)
the assumption that ∥w∥ = 1. In this section, we derive the margin max- Instead of maximizing the reciprocal of the norm as in (12.16), we often
imization problem by making a different assumption. Instead of choosing minimize the squared norm. We also often include a constant 12 that does The squared norm
that the parameter vector is normalized, we choose a scale for the data. not affect the optimal w, b but yields a tidier form when we compute the results in a convex
We choose this scale such that the value of the predictor ⟨w, x⟩ + b is 1 at gradient. Then, our objective becomes quadratic
Recall that we the closest example. Let us also denote the example in the dataset that is programming
currently consider closest to the hyperplane by xa . 1 problem for the
linearly separable
min ∥w∥2 (12.18) SVM (Section 12.5).
Figure 12.5 is identical to Figure 12.4, except that now we rescaled the w,b 2
data.
axes, such that the example xa lies exactly on the margin, i.e., ⟨w, xa ⟩ + subject to yn (⟨w, xn ⟩ + b) ⩾ 1 for all n = 1, . . . , N . (12.19)
b = 1. Since x′a is the orthogonal projection of xa onto the hyperplane, it
Equation (12.18) is known as the hard margin SVM. The reason for the hard margin SVM
must by definition lie on the hyperplane, i.e.,
expression “hard” is because the formulation does not allow for any vi-
⟨w, x′a ⟩ + b = 0 . (12.11) olations of the margin condition. We will see in Section 12.2.4 that this

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
378 Classification with Support Vector Machines 12.2 Primal Support Vector Machine 379
Figure 12.6
“hard” condition can be relaxed to accommodate violations if the data is (a) Linearly
separable and
not linearly separable.
(b) non-linearly
separable data.

x(2)

x(2)
12.2.3 Why We Can Set the Margin to 1
In Section 12.2.1, we argued that we would like to maximize some value
r, which represents the distance of the closest example to the hyperplane.
In Section 12.2.2, we scaled the data such that the closest example is of
distance 1 to the hyperplane. In this section, we relate the two derivations, x(1) x(1)
and show that they are equivalent.
(a) Linearly separable data, with a large (b) Non-linearly separable data
Theorem 12.1. Maximizing the margin r, where we consider normalized margin

weights as in (12.10),
max r ′
w,b,r |{z} renaming the parameters to w′′ and b′′ . Since w′′ = ∥ww′ ∥r , rearranging for
margin
(12.20) r gives
subject to yn (⟨w, xn ⟩ + b) ⩾ r , ∥w∥ = 1 , r > 0,
w′ 1 w′ 1
∥w′′ ∥ = = · = .
| {z } | {z }
data fitting normalization (12.24)
∥w′ ∥ r r ∥w′ ∥ r
is equivalent to scaling the data, such that the margin is unity:
By substituting this result into (12.23), we obtain
1 2 1
min ∥w∥ max
w,b 2 2
| {z } w′′ ,b′′ ∥w′′ ∥ (12.25)
margin (12.21)
subject to yn (⟨w′′ , xn ⟩ + b′′ ) ⩾ 1 .
subject to yn (⟨w, xn ⟩ + b) ⩾ 1 .
1
The final step is to observe that maximizing yields the same solution
| {z }
data fitting ∥w′′ ∥2
1 ′′ 2
Proof Consider (12.20). Since the square is a strictly monotonic trans- as minimizing 2
∥w ∥ , which concludes the proof of Theorem 12.1.
formation for non-negative arguments, the maximum stays the same if we
consider r2 in the objective. Since ∥w∥ = 1 we can reparametrize the
12.2.4 Soft Margin SVM: Geometric View
equation with a new weight vector w′ that is not normalized by explicitly
w′ In the case where data is not linearly separable, we may wish to allow
using ∥w ′ ∥ . We obtain

some examples to fall within the margin region, or even to be on the


max r2 wrong side of the hyperplane as illustrated in Figure 12.6.
w′ ,b,r
(12.22) The model that allows for some classification errors is called the soft soft margin SVM
w′
  
subject to yn , x n + b ⩾ r, r > 0. margin SVM. In this section, we derive the resulting optimization problem
∥w′ ∥ using geometric arguments. In Section 12.2.5, we will derive an equiv-
Equation (12.22) explicitly states that the distance r is positive. Therefore, alent optimization problem using the idea of a loss function. Using La-
Note that r > 0 we can divide the first constraint by r, which yields grange multipliers (Section 7.2), we will derive the dual optimization
because we problem of the SVM in Section 12.3. This dual optimization problem al-
assumed linear max r2 lows us to observe a third interpretation of the SVM: as a hyperplane that
w′ ,b,r
separability, and
hence there is no
  bisects the line between convex hulls corresponding to the positive and
issue to divide by r.
*

+
(12.23) negative data examples (Section 12.3.2).
 w b 
subject to yn  , xn +  ⩾ 1, r>0 The key geometric idea is to introduce a slack variable ξn corresponding slack variable
 
 ∥w′ ∥ r r to each example–label pair (xn , yn ) that allows a particular example to be
|{z} 
| {z } ′′ b
w′′ within the margin or even on the wrong side of the hyperplane (refer to

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
380 Classification with Support Vector Machines 12.2 Primal Support Vector Machine 381

Figure 12.7 Soft choose hyperplanes as the hypothesis class, that is


margin SVM allows
. w
examples to be f (x) = ⟨w, x⟩ + b. (12.27)
within the margin or
on the wrong side of ξ We will see in this section that the margin corresponds to the regulariza-
the hyperplane. The tion term. The remaining question is, what is the loss function? In con- loss function
slack variable ξ trast to Chapter 9, where we consider regression problems (the output
x+ .

⟨w
measures the
of the predictor is a real number), in this chapter, we consider binary

,x
⟨w
distance of a

⟩+
classification problems (the output of the predictor is one of two labels

,
positive example

x⟩
{+1, −1}). Therefore, the error/loss function for each single example–

b=
x+ to the positive

+
margin hyperplane label pair needs to be appropriate for binary classification. For example,

b=

1
⟨w, x⟩ + b = 1 the squared loss that is used for regression (9.10b) is not suitable for bi-

0
when x+ is on the
wrong side.
nary classification.
Remark. The ideal loss function between binary labels is to count the num-
Figure 12.7). We subtract the value of ξn from the margin, constraining ber of mismatches between the prediction and the label. This means that
ξn to be non-negative. To encourage correct classification of the samples, for a predictor f applied to an example xn , we compare the output f (xn )
we add ξn to the objective with the label yn . We define the loss to be zero if they match, and one if
N they do not match. This is denoted by 1(f (xn ) ̸= yn ) and is called the
1 X
min ∥w∥2 + C ξn (12.26a) zero-one loss. Unfortunately, the zero-one loss results in a combinatorial zero-one loss
w,b,ξ 2 n=1 optimization problem for finding the best parameters w, b. Combinatorial
subject to yn (⟨w, xn ⟩ + b) ⩾ 1 − ξn (12.26b) optimization problems (in contrast to continuous optimization problems
ξn ⩾ 0 (12.26c) discussed in Chapter 7) are in general more challenging to solve. ♢
What is the loss function corresponding to the SVM? Consider the error
for n = 1, . . . , N . In contrast to the optimization problem (12.18) for the between the output of a predictor f (xn ) and the label yn . The loss de-
soft margin SVM hard margin SVM, this one is called the soft margin SVM. The parameter scribes the error that is made on the training data. An equivalent way to
C > 0 trades off the size of the margin and the total amount of slack that derive (12.26a) is to use the hinge loss hinge loss
regularization we have. This parameter is called the regularization parameter since, as
parameter we will see in the following section, the margin term in the objective func-
ℓ(t) = max{0, 1 − t} where t = yf (x) = y(⟨w, x⟩ + b) . (12.28)
tion (12.26a) is a regularization term. The margin term ∥w∥2 is called If f (x) is on the correct side (based on the corresponding label y ) of the
regularizer the regularizer, and in many books on numerical optimization, the reg- hyperplane, and further than distance 1, this means that t ⩾ 1 and the
ularization parameter is multiplied with this term (Section 8.2.3). This hinge loss returns a value of zero. If f (x) is on the correct side but too
is in contrast to our formulation in this section. Here a large value of C close to the hyperplane (0 < t < 1), the example x is within the margin,
implies low regularization, as we give the slack variables larger weight, and the hinge loss returns a positive value. When the example is on the
hence giving more priority to examples that do not lie on the correct side wrong side of the hyperplane (t < 0), the hinge loss returns an even larger
There are of the margin. value, which increases linearly. In other words, we pay a penalty once we
alternative are closer than the margin to the hyperplane, even if the prediction is
parametrizations of Remark. In the formulation of the soft margin SVM (12.26a) w is reg-
this regularization, ularized, but b is not regularized. We can see this by observing that the correct, and the penalty increases linearly. An alternative way to express
which is regularization term does not contain b. The unregularized term b com- the hinge loss is by considering it as two linear pieces
why (12.26a) is also (
plicates theoretical analysis (Steinwart and Christmann, 2008, chapter 1) 0 if t ⩾ 1
often referred to as ℓ(t) = , (12.29)
the C-SVM. and decreases computational efficiency (Fan et al., 2008). ♢ 1 − t if t < 1
as illustrated in Figure 12.8. The loss corresponding to the hard margin
12.2.5 Soft Margin SVM: Loss Function View SVM 12.18 is defined as
(
Let us consider a different approach for deriving the SVM, following the 0 if t ⩾ 1
ℓ(t) = . (12.30)
principle of empirical risk minimization (Section 8.2). For the SVM, we ∞ if t < 1

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
382 Classification with Support Vector Machines 12.3 Dual Support Vector Machine 383

Figure 12.8 The


4 negative log-likelihood. Furthermore, since the likelihood term for linear
hinge loss is a Zero-one loss regression with Gaussian noise is Gaussian, the negative log-likelihood for

max{0, 1 − t}
convex upper bound
Hinge loss each example is a squared error function. The squared error function is the
of zero-one loss.
2 loss function that is minimized when looking for the maximum likelihood
solution. ♢

0
−2 0 2
t
12.3 Dual Support Vector Machine
The description of the SVM in the previous sections, in terms of the vari-
This loss can be interpreted as never allowing any examples inside the ables w and b, is known as the primal SVM. Recall that we consider inputs
margin. x ∈ RD with D features. Since w is of the same dimension as x, this
For a given training set {(x1 , y1 ), . . . , (xN , yN )}, we seek to minimize means that the number of parameters (the dimension of w) of the opti-
the total loss, while regularizing the objective with ℓ2 -regularization (see mization problem grows linearly with the number of features.
Section 8.2.3). Using the hinge loss (12.28) gives us the unconstrained
In the following, we consider an equivalent optimization problem (the
optimization problem
so-called dual view), which is independent of the number of features. In-
1 XN stead, the number of parameters increases with the number of examples
min ∥w∥2 + C max{0, 1 − yn (⟨w, xn ⟩ + b)} . (12.31) in the training set. We saw a similar idea appear in Chapter 10, where we
w,b 2 n=1
| {z } expressed the learning problem in a way that does not scale with the num-
regularizer
| {z }
error term ber of features. This is useful for problems where we have more features
regularizer The first term in (12.31) is called the regularization term or the regularizer than the number of examples in the training dataset. The dual SVM also
loss term (see Section 8.2.3), and the second term is called the loss term or the error has the additional advantage that it easily allows kernels to be applied,
2
error term term. Recall from Section 12.2.4 that the term 12 ∥w∥ arises directly from as we shall see at the end of this chapter. The word “dual” appears often
the margin. In other words, margin maximization can be interpreted as in mathematical literature, and in this particular case it refers to convex
regularization regularization. duality. The following subsections are essentially an application of convex
In principle, the unconstrained optimization problem in (12.31) can duality, which we discussed in Section 7.2.
be directly solved with (sub-)gradient descent methods as described in
Section 7.1. To see that (12.31) and (12.26a) are equivalent, observe that
the hinge loss (12.28) essentially consists of two linear parts, as expressed
in (12.29). Consider the hinge loss for a single example-label pair (12.28). 12.3.1 Convex Duality via Lagrange Multipliers
We can equivalently replace minimization of the hinge loss over t with a Recall the primal soft margin SVM (12.26a). We call the variables w, b,
minimization of a slack variable ξ with two constraints. In equation form, and ξ corresponding to the primal SVM the primal variables. We use αn ⩾ In Chapter 7, we
0 as the Lagrange multiplier corresponding to the constraint (12.26b) that used λ as Lagrange
min max{0, 1 − t} (12.32)
t multipliers. In this
the examples are classified correctly and γn ⩾ 0 as the Lagrange multi-
section, we follow
is equivalent to plier corresponding to the non-negativity constraint of the slack variable; the notation
see (12.26c). The Lagrangian is then given by commonly chosen in
min ξ SVM literature, and
ξ,t
(12.33) use α and γ.
subject to ξ ⩾ 0, ξ ⩾ 1 − t. 1 XN
L(w, b, ξ, α, γ) = ∥w∥2 + C ξn (12.34)
By substituting this expression into (12.31) and rearranging one of the 2 n=1
constraints, we obtain exactly the soft margin SVM (12.26a). N
X N
X
Remark. Let us contrast our choice of the loss function in this section to the − αn (yn (⟨w, xn ⟩ + b) − 1 + ξn ) − γ n ξn .
n=1 n=1
loss function for linear regression in Chapter 9. Recall from Section 9.2.1 | {z } | {z }
that for finding maximum likelihood estimators, we usually minimize the constraint (12.26b) constraint (12.26c)

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
384 Classification with Support Vector Machines 12.3 Dual Support Vector Machine 385

By differentiating the Lagrangian (12.34) with respect to the three primal bilinear (see Section 3.2). Therefore, the first two terms in (12.39) are
variables w, b, and ξ respectively, we obtain over the same objects. These terms (colored blue) can be simplified, and
N
we obtain the Lagrangian
∂L X
= w⊤ − αn yn xn ⊤ , (12.35) N N N N
∂w 1 XX X X
n=1
D(ξ, α, γ) = − yi yj αi αj ⟨xi , xj ⟩ + αi + (C − αi − γi )ξi .
N 2 i=1 j=1
∂L X i=1 i=1
=− αn yn , (12.36) (12.40)
∂b n=1 The last term in this equation is a collection of all terms that contain slack
∂L variables ξi . By setting (12.37) to zero, we see that the last term in (12.40)
= C − αn − γn . (12.37)
∂ξn is also zero. Furthermore, by using the same equation and recalling that
We now find the maximum of the Lagrangian by setting each of these the Lagrange multiplers γi are non-negative, we conclude that αi ⩽ C .
partial derivatives to zero. By setting (12.35) to zero, we find We now obtain the dual optimization problem of the SVM, which is ex-
pressed exclusively in terms of the Lagrange multipliers αi . Recall from
N
X Lagrangian duality (Definition 7.1) that we maximize the dual problem.
w= αn yn xn , (12.38)
This is equivalent to minimizing the negative dual problem, such that we
n=1
end up with the dual SVM dual SVM
representer theorem which is a particular instance of the representer theorem (Kimeldorf and
N N N
The representer Wahba, 1970). Equation (12.38) states that the optimal weight vector in 1 XX X
theorem is actually the primal is a linear combination of the examples xn . Recall from Sec- min yi yj αi αj ⟨xi , xj ⟩ − αi
a collection of
α 2 i=1 j=1 i=1
tion 2.6.1 that this means that the solution of the optimization problem
theorems saying
lies in the span of training data. Additionally, the constraint obtained by
N
X (12.41)
that the solution of
subject to yi αi = 0
minimizing setting (12.36) to zero implies that the optimal weight vector is an affine
i=1
empirical risk lies in combination of the examples. The representer theorem turns out to hold
the subspace
for very general settings of regularized empirical risk minimization (Hof-
0 ⩽ αi ⩽ C for all i = 1, . . . , N .
(Section 2.4.3)
defined by the mann et al., 2008; Argyriou and Dinuzzo, 2014). The theorem has more The equality constraint in (12.41) is obtained from setting (12.36) to
examples. general versions (Schölkopf et al., 2001), and necessary and sufficient zero. The inequality constraint αi ⩾ 0 is the condition imposed on La-
conditions on its existence can be found in Yu et al. (2013). grange multipliers of inequality constraints (Section 7.2). The inequality
Remark. The representer theorem (12.38) also provides an explanation constraint αi ⩽ C is discussed in the previous paragraph.
of the name “support vector machine.” The examples xn , for which the The set of inequality constraints in the SVM are called “box constraints”
corresponding parameters αn = 0, do not contribute to the solution w at because they limit the vector α = [α1 , · · · , αN ]⊤ ∈ RN of Lagrange mul-
support vector all. The other examples, where αn > 0, are called support vectors since tipliers to be inside the box defined by 0 and C on each axis. These
they “support” the hyperplane. ♢ axis-aligned boxes are particularly efficient to implement in numerical
By substituting the expression for w into the Lagrangian (12.34), we solvers (Dostál, 2009, chapter 5). It turns out that
Once we obtain the dual parameters α, we can recover the primal pa- examples that lie
obtain the dual exactly on the
N N N
*N + rameters w by using the representer theorem (12.38). Let us call the op- margin are
1 XX X X
timal primal parameter w∗ . However, there remains the question on how
D(ξ, α, γ) = yi yj αi αj ⟨xi , xj ⟩ − yi αi yj αj xj , xi examples whose
2 i=1 j=1 i=1 j=1
to obtain the parameter b∗ . Consider an example xn that lies exactly on dual parameters lie
N N N N N the margin’s boundary, i.e., ⟨w∗ , xn ⟩ + b = yn . Recall that yn is either +1 strictly inside the
box constraints,
or −1. Therefore, the only unknown is b, which can be computed by
X X X X X
+C ξi − b yi αi + αi − αi ξi − γ i ξi . 0 < αi < C. This is
i=1 i=1 i=1 i=1 i=1 derived using the
(12.39) b∗ = yn − ⟨w∗ , xn ⟩ . (12.42) Karush Kuhn Tucker
conditions, for
PNinvolving the primal variable w.
Note that there are no longer any terms Remark. In principle, there may be no examples that lie exactly on the example in
By setting (12.36) to zero, we obtain n=1 yn αn = 0. Therefore, the term margin. In this case, we should compute |yn − ⟨w∗ , xn ⟩ | for all support Schölkopf and
involving b also vanishes. Recall that inner products are symmetric and vectors and take the median value of this absolute value difference to be Smola (2002).

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
386 Classification with Support Vector Machines 12.3 Dual Support Vector Machine 387
Figure 12.9 Convex
hulls. (a) Convex for all n = 1, . . . , N . If the two clouds of points corresponding to the
hull of points, some
positive and negative classes are separated, then the convex hulls do not
of which lie within
the boundary; overlap. Given the training data (x1 , y1 ), . . . , (xN , yN ), we form two con-
(b) convex hulls vex hulls, corresponding to the positive and negative classes respectively.
around positive and We pick a point c, which is in the convex hull of the set of positive exam-
negative examples. ples, and is closest to the negative class distribution. Similarly, we pick a
c
point d in the convex hull of the set of negative examples and is closest to
the positive class distribution; see Figure 12.9(b). We define a difference
d
vector between d and c as

w := c − d . (12.44)

(a) Convex hull. (b) Convex hulls around positive (blue) and Picking the points c and d as in the preceding cases, and requiring them
negative (orange) examples. The distance be- to be closest to each other is equivalent to minimizing the length/norm of
tween the two convex sets is the length of the w, so that we end up with the corresponding optimization problem
difference vector c − d.
1 2
arg min ∥w∥ = arg min ∥w∥ . (12.45)
w w 2
the value of b∗ . A derivation of this can be found in http://fouryears. Since c must be in the positive convex hull, it can be expressed as a convex
eu/2012/06/07/the-svm-bias-term-conspiracy/. ♢ combination of the positive examples, i.e., for non-negative coefficients
αn+
X
c= αn+ xn . (12.46)
12.3.2 Dual SVM: Convex Hull View n:yn =+1

Another approach to obtain the dual SVM is to consider an alternative In (12.46), we use the notation n : yn = +1 to indicate the set of indices
geometric argument. Consider the set of examples xn with the same label. n for which yn = +1. Similarly, for the examples with negative labels, we
We would like to build a convex set that contains all the examples such obtain
that it is the smallest possible set. This is called the convex hull and is X
illustrated in Figure 12.9. d= αn− xn . (12.47)
n:yn =−1
Let us first build some intuition about a convex combination of points.
Consider two points x1 and x2 and corresponding non-negative weights By substituting (12.44), (12.46), and (12.47) into (12.45), we obtain the
α1 , α2 ⩾ 0 such that α1 +α2 = 1. The equation α1 x1 +α2 x2 describes each objective
point on a line between x1 and x2 . Consider what happensPwhen we add 2
3
a third point x3 along with a weight α3 ⩾ 0 such that n=1 αn = 1. 1 X X
min αn+ xn − αn− xn . (12.48)
The convex combination of these three points x1 , x2 , x3 spans a two- α 2 n:yn =+1 n:yn =−1
convex hull dimensional area. The convex hull of this area is the triangle formed by
the edges corresponding to each pair of of points. As we add more points, Let α be the set of all coefficients, i.e., the concatenation of α+ and α− .
and the number of points becomes greater than the number of dimen- Recall that we require that for each convex hull that their coefficients sum
sions, some of the points will be inside the convex hull, as we can see in to one,
Figure 12.9(a). X X
αn+ = 1 and αn− = 1 . (12.49)
In general, building a convex convex hull can be done by introducing n:yn =+1 n:yn =−1
non-negative weights αn ⩾ 0 corresponding to each example xn . Then
the convex hull can be described as the set This implies the constraint
(N ) N N
X X X
conv (X) = αn xn with αn = 1 and αn ⩾ 0, (12.43) yn αn = 0 . (12.50)
n=1 n=1 n=1

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
388 Classification with Support Vector Machines 12.4 Kernels 389
Figure 12.10 SVM
This result can be seen by multiplying out the individual classes with different
kernels. Note that
N
X X X while the decision
yn αn = (+1)αn+ + (−1)αn− (12.51a) boundary is

Second feature

Second feature
n=1 n:yn =+1 n:yn =−1 nonlinear, the
X X underlying problem
= αn+ − αn− = 1 − 1 = 0 . (12.51b) being solved is for a
n:yn =+1 n:yn =−1 linear separating
hyperplane (albeit
The objective function (12.48) and the constraint (12.50), along with the with a nonlinear
assumption that α ⩾ 0, give us a constrained (convex) optimization prob- kernel).
lem. This optimization problem can be shown to be the same as that of
the dual hard margin SVM (Bennett and Bredensteiner, 2000a).
First feature First feature
Remark. To obtain the soft margin dual, we consider the reduced hull. The
(a) SVM with linear kernel (b) SVM with RBF kernel
reduced hull reduced hull is similar to the convex hull but has an upper bound to the
size of the coefficients α. The maximum possible value of the elements
of α restricts the size that the convex hull can take. In other words, the
bound on α shrinks the convex hull to a smaller volume (Bennett and
Bredensteiner, 2000b). ♢

Second feature

Second feature
12.4 Kernels
Consider the formulation of the dual SVM (12.41). Notice that the in-
ner product in the objective occurs only between examples xi and xj .
There are no inner products between the examples and the parameters.
Therefore, if we consider a set of features ϕ(xi ) to represent xi , the only
change in the dual SVM will be to replace the inner product. This mod- First feature First feature
ularity, where the choice of the classification method (the SVM) and the (c) SVM with polynomial (degree 2) kernel (d) SVM with polynomial (degree 3) kernel
choice of the feature representation ϕ(x) can be considered separately,
provides flexibility for us to explore the two problems independently. In
this section, we discuss the representation ϕ(x) and briefly introduce the
There is a unique reproducing kernel Hilbert space associated with every
idea of kernels, but do not go into the technical details.
kernel k (Aronszajn, 1950; Berlinet and Thomas-Agnan, 2004). In this
Since ϕ(x) could be a non-linear function, we can use the SVM (which
unique association, ϕ(x) = k(·, x) is called the canonical feature map. canonical feature
assumes a linear classifier) to construct classifiers that are nonlinear in map
The generalization from an inner product to a kernel function (12.52) is
the examples xn . This provides a second avenue, in addition to the soft
known as the kernel trick (Schölkopf and Smola, 2002; Shawe-Taylor and kernel trick
margin, for users to deal with a dataset that is not linearly separable. It
Cristianini, 2004), as it hides away the explicit non-linear feature map.
turns out that there are many algorithms and statistical methods that have
The matrix K ∈ RN ×N , resulting from the inner products or the appli-
this property that we observed in the dual SVM: the only inner products
cation of k(·, ·) to a dataset, is called the Gram matrix, and is often just Gram matrix
are those that occur between examples. Instead of explicitly defining a
referred to as the kernel matrix. Kernels must be symmetric and positive kernel matrix
non-linear feature map ϕ(·) and computing the resulting inner product
semidefinite functions so that every kernel matrix K is symmetric and
between examples xi and xj , we define a similarity function k(xi , xj ) be-
positive semidefinite (Section 3.2.3):
kernel tween xi and xj . For a certain class of similarity functions, called kernels,
the similarity function implicitly defines a non-linear feature map ϕ(·). ∀z ∈ RN : z ⊤ Kz ⩾ 0 . (12.53)
The inputs X of the Kernels are by definition functions k : X × X → R for which there exists
kernel function can Some popular examples of kernels for multivariate real-valued data xi ∈
a Hilbert space H and ϕ : X → H a feature map such that
be very general and RD are the polynomial kernel, the Gaussian radial basis function kernel,
are not necessarily k(xi , xj ) = ⟨ϕ(xi ), ϕ(xj )⟩H . (12.52) and the rational quadratic kernel (Schölkopf and Smola, 2002; Rasmussen
restricted to RD .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
390 Classification with Support Vector Machines 12.5 Numerical Solution 391

and Williams, 2006). Figure 12.10 illustrates the effect of different kernels ferentiable. Therefore, we apply a subgradient approach for solving it.
on separating hyperplanes on an example dataset. Note that we are still However, the hinge loss is differentiable almost everywhere, except for
solving for hyperplanes, that is, the hypothesis class of functions are still one single point at the hinge t = 1. At this point, the gradient is a set of
linear. The non-linear surfaces are due to the kernel function. possible values that lie between 0 and −1. Therefore, the subgradient g of
Remark. Unfortunately for the fledgling machine learner, there are mul- the hinge loss is given by
tiple meanings of the word “kernel.” In this chapter, the word “kernel” 
comes from the idea of the reproducing kernel Hilbert space (RKHS) (Aron- −1
 t<1
szajn, 1950; Saitoh, 1988). We have discussed the idea of the kernel in lin- g(t) = [−1, 0] t = 1 . (12.54)

0 t>1

ear algebra (Section 2.7.3), where the kernel is another word for the null
space. The third common use of the word “kernel” in machine learning is
the smoothing kernel in kernel density estimation (Section 11.5). ♢ Using this subgradient, we can apply the optimization methods presented
in Section 7.1.
Since the explicit representation ϕ(x) is mathematically equivalent to Both the primal and the dual SVM result in a convex quadratic pro-
the kernel representation k(xi , xj ), a practitioner will often design the gramming problem (constrained optimization). Note that the primal SVM
kernel function such that it can be computed more efficiently than the in (12.26a) has optimization variables that have the size of the dimen-
inner product between explicit feature maps. For example, consider the sion D of the input examples. The dual SVM in (12.41) has optimization
polynomial kernel (Schölkopf and Smola, 2002), where the number of variables that have the size of the number N of examples.
terms in the explicit expansion grows very quickly (even for polynomials To express the primal SVM in the standard form (7.45) for quadratic
of low degree) when the input dimension is large. The kernel function programming, let us assume that we use the dot product (3.5) as the
only requires one multiplication per input dimension, which can provide inner product. We rearrange the equation for the primal SVM (12.26a), Recall from
significant computational savings. Another example is the Gaussian ra- such that the optimization variables are all on the right and the inequality Section 3.2 that we
dial basis function kernel (Schölkopf and Smola, 2002; Rasmussen and of the constraint matches the standard form. This yields the optimization use the phrase dot
Williams, 2006), where the corresponding feature space is infinite dimen- product to mean the
N inner product on
sional. In this case, we cannot explicitly represent the feature space but 1 X Euclidean vector
The choice of can still compute similarities between a pair of examples using the kernel. min ∥w∥2 + C ξn
w,b,ξ 2 space.
kernel, as well as Another useful aspect of the kernel trick is that there is no need for n=1 (12.55)
the parameters of
the original data to be already represented as multivariate real-valued −yn x⊤
n w − yn b − ξn ⩽ −1
the kernel, is often subject to
data. Note that the inner product is defined on the output of the function −ξn ⩽ 0
chosen using nested
cross-validation ϕ(·), but does not restrict the input to real numbers. Hence, the function
n = 1, . . . , N . By concatenating the variables w, b, xn into a single vector,
(Section 8.6.1). ϕ(·) and the kernel function k(·, ·) can be defined on any object, e.g.,
and carefully collecting the terms, we obtain the following matrix form of
sets, sequences, strings, graphs, and distributions (Ben-Hur et al., 2008;
the soft margin SVM:
Gärtner, 2008; Shi et al., 2009; Sriperumbudur et al., 2010; Vishwanathan
et al., 2010).  ⊤    
w   w
⊤ w
1  ID 0D,N +1   
min b b + 0D+1,1 C1N,1  b 
w,b,ξ 2 ξ 0N +1,D 0N +1,N +1
ξ ξ
12.5 Numerical Solution  
  w  
We conclude our discussion of SVMs by looking at how to express the −Y X −y −I N   −1N,1
subject to b ⩽ .
problems derived in this chapter in terms of the concepts presented in 0N,D+1 −I N 0N,1
ξ
Chapter 7. We consider two different approaches for finding the optimal (12.56)
solution for the SVM. First we consider the loss view of SVM 8.2.2 and ex-
press this as an unconstrained optimization problem. Then we express the In the preceding optimization problem, the minimization is over the pa-
constrained versions of the primal and dual SVMs as quadratic programs rameters [w⊤ , b, ξ ⊤ ]⊤ ∈ RD+1+N , and we use the notation: I m to rep-
in standard form 7.3.2. resent the identity matrix of size m × m, 0m,n to represent the matrix
Consider the loss function view of the SVM (12.31). This is a convex of zeros of size m × n, and 1m,n to represent the matrix of ones of size
unconstrained optimization problem, but the hinge loss (12.28) is not dif- m × n. In addition, y is the vector of labels [y1 , · · · , yN ]⊤ , Y = diag(y)

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
392 Classification with Support Vector Machines 12.6 Further Reading 393

is an N by N matrix where the elements of the diagonal are from y , and how to optimize them. A broader book about kernel methods (Shawe-
X ∈ RN ×D is the matrix obtained by concatenating all the examples. Taylor and Cristianini, 2004) also includes many linear algebra approaches
We can similarly perform a collection of terms for the dual version of the for different machine learning problems.
SVM (12.41). To express the dual SVM in standard form, we first have to An alternative derivation of the dual SVM can be obtained using the
express the kernel matrix K such that each entry is Kij = k(xi , xj ). If we idea of the Legendre–Fenchel transform (Section 7.3.3). The derivation
have an explicit feature representation xi then we define Kij = ⟨xi , xj ⟩. considers each term of the unconstrained formulation of the SVM (12.31)
For convenience of notation we introduce a matrix with zeros everywhere separately and calculates their convex conjugates (Rifkin and Lippert,
except on the diagonal, where we store the labels, that is, Y = diag(y). 2007). Readers interested in the functional analysis view (also the reg-
The dual SVM can be written as ularization methods view) of SVMs are referred to the work by Wahba
1 ⊤ (1990). Theoretical exposition of kernels (Aronszajn, 1950; Schwartz,
min α Y KY α − 1⊤ N,1 α 1964; Saitoh, 1988; Manton and Amblard, 2015) requires a basic ground-
α 2
ing in linear operators (Akhiezer and Glazman, 1993). The idea of kernels
y⊤

−y ⊤ 
  (12.57) have been generalized to Banach spaces (Zhang et al., 2009) and Kreı̆n
0N +2,1
subject to −I N  α ⩽ C1N,1 .
  spaces (Ong et al., 2004; Loosli et al., 2016).
Observe that the hinge loss has three equivalent representations, as
IN
shown in (12.28) and (12.29), as well as the constrained optimization
Remark. In Sections 7.3.1 and 7.3.2, we introduced the standard forms problem in (12.33). The formulation (12.28) is often used when compar-
of the constraints to be inequality constraints. We will express the dual ing the SVM loss function with other loss functions (Steinwart, 2007).
SVM’s equality constraint as two inequality constraints, i.e., The two-piece formulation (12.29) is convenient for computing subgra-
dients, as each piece is linear. The third formulation (12.33), as seen
Ax = b is replaced by Ax ⩽ b and Ax ⩾ b . (12.58) in Section 12.5, enables the use of convex quadratic programming (Sec-
Particular software implementations of convex optimization methods may tion 7.3.2) tools.
provide the ability to express equality constraints. ♢ Since binary classification is a well-studied task in machine learning,
other words are also sometimes used, such as discrimination, separation,
Since there are many different possible views of the SVM, there are and decision. Furthermore, there are three quantities that can be the out-
many approaches for solving the resulting optimization problem. The ap- put of a binary classifier. First is the output of the linear function itself
proach presented here, expressing the SVM problem in standard convex (often called the score), which can take any real value. This output can be
optimization form, is not often used in practice. The two main implemen- used for ranking the examples, and binary classification can be thought
tations of SVM solvers are Chang and Lin (2011) (which is open source) of as picking a threshold on the ranked examples (Shawe-Taylor and Cris-
and Joachims (1999). Since SVMs have a clear and well-defined optimiza- tianini, 2004). The second quantity that is often considered the output
tion problem, many approaches based on numerical optimization tech- of a binary classifier is the output determined after it is passed through
niques (Nocedal and Wright, 2006) can be applied (Shawe-Taylor and a non-linear function to constrain its value to a bounded range, for ex-
Sun, 2011). ample in the interval [0, 1]. A common non-linear function is the sigmoid
function (Bishop, 2006). When the non-linearity results in well-calibrated
probabilities (Gneiting and Raftery, 2007; Reid and Williamson, 2011),
12.6 Further Reading this is called class probability estimation. The third output of a binary
The SVM is one of many approaches for studying binary classification. classifier is the final binary decision {+1, −1}, which is the one most com-
Other approaches include the perceptron, logistic regression, Fisher dis- monly assumed to be the output of the classifier.
criminant, nearest neighbor, naive Bayes, and random forest (Bishop, 2006; The SVM is a binary classifier that does not naturally lend itself to a
Murphy, 2012). A short tutorial on SVMs and kernels on discrete se- probabilistic interpretation. There are several approaches for converting
quences can be found in Ben-Hur et al. (2008). The development of SVMs the raw output of the linear function (the score) into a calibrated class
is closely linked to empirical risk minimization, discussed in Section 8.2. probability estimate (P (Y = 1|X = x)) that involve an additional cal-
Hence, the SVM has strong theoretical properties (Vapnik, 2000; Stein- ibration step (Platt, 2000; Zadrozny and Elkan, 2001; Lin et al., 2007).
wart and Christmann, 2008). The book about kernel methods (Schölkopf From the training perspective, there are many related probabilistic ap-
and Smola, 2002) includes many details of support vector machines and proaches. We mentioned at the end of Section 12.2.5 that there is a re-

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
394 Classification with Support Vector Machines

lationship between loss function and the likelihood (also compare Sec-
tions 8.2 and 8.3). The maximum likelihood approach corresponding to
a well-calibrated transformation during training is called logistic regres-
sion, which comes from a class of methods called generalized linear mod-
els. Details of logistic regression from this point of view can be found in
Agresti (2002, chapter 5) and McCullagh and Nelder (1989, chapter 4). Index
Naturally, one could take a more Bayesian view of the classifier output by
estimating a posterior distribution using Bayesian logistic regression. The
Bayesian view also includes the specification of the prior, which includes
design choices such as conjugacy (Section 6.6.1) with the likelihood. Ad-
ditionally, one could consider latent functions as priors, which results in 1-of-K representation, 364 canonical basis, 45
ℓ1 norm, 71 canonical feature map, 389
Gaussian process classification (Rasmussen and Williams, 2006, chapter ℓ2 norm, 72 canonical link function, 315
3). abduction, 258 categorical variable, 180
Abel-Ruffini theorem, 334 Cauchy-Schwarz inequality, 75
Abelian group, 36 change-of-variable technique, 219
absolutely homogeneous, 71 characteristic polynomial, 104
activation function, 315 Cholesky decomposition, 114
affine mapping, 63 Cholesky factor, 114
affine subspace, 61 Cholesky factorization, 114
Akaike information criterion, 288 class, 370
algebra, 17 classification, 315
algebraic multiplicity, 106 closure, 36
analytic, 143 code, 343
ancestral sampling, 340, 364 codirected, 105
angle, 76 codomain, 58, 139
associativity, 24, 26, 36 collinear, 105
attribute, 253 column, 22
augmented matrix, 29 column space, 59
auto-encoder, 343 column vector, 22, 38
automatic differentiation, 161 completing the squares, 307
automorphism, 49 concave function, 236
condition number, 230
backpropagation, 159
conditional probability, 179
basic variable, 30
basis, 44 conditionally independent, 195
basis vector, 45 conjugate, 208
Bayes factor, 287 conjugate prior, 208
Bayes’ law, 185 convex conjugate, 242
Bayes’ rule, 185 convex function, 236
Bayes’ theorem, 185 convex hull, 386
Bayesian GP-LVM, 347 convex optimization problem, 236, 239
Bayesian inference, 274 convex set, 236
Bayesian information criterion, 288 coordinate, 50
Bayesian linear regression, 303 coordinate representation, 50
Bayesian model selection, 286 coordinate vector, 50
Bayesian network, 278, 283 correlation, 191
Bayesian PCA, 346 covariance, 190
Bernoulli distribution, 205 covariance matrix, 190, 198
Beta distribution, 206 covariate, 253
bilinear mapping, 72 CP decomposition, 136
bijective, 48 cross-covariance, 191
binary classification, 370 cross-validation, 258, 263
Binomial distribution, 206 cumulative distribution function, 178,
blind-source separation, 346 181
Borel σ-algebra, 180 d-separation, 281

407
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
408 Index Index 409

data covariance matrix, 318 feature, 253 inverse element, 36 margin, 374
data point, 253 feature map, 254 invertible, 24 marginal, 190
data-fit term, 302 feature matrix, 296 Isomap, 136 marginal likelihood, 186, 286, 306
decoder, 343 feature vector, 295 isomorphism, 49 marginal probability, 179
deep auto-encoder, 347 Fisher discriminant analysis, 136 Jacobian, 146, 150 marginalization property, 184
defective, 111 Fisher-Neyman theorem, 210 Jacobian determinant, 152 Markov random field, 283
denominator layout, 151 forward mode, 161 Jeffreys-Lindley paradox, 287 matrix, 22
derivative, 141 free variable, 30 Jensen’s inequality, 239 matrix factorization, 98
design matrix, 294, 296 full rank, 47 joint probability, 178 maximum a posteriori, 300
determinant, 99 full SVD, 128 maximum a posteriori estimation, 269
fundamental theorem of linear Karhunen-Loève transform, 318
diagonal matrix, 115 maximum likelihood, 257
mappings, 60 kernel, 33, 47, 58, 254, 388
diagonalizable, 116 maximum likelihood estimate, 296
kernel density estimation, 369
diagonalization, 116 Gaussian elimination, 31 maximum likelihood estimation, 265,
kernel matrix, 389
difference quotient, 141 Gaussian mixture model, 349 293
kernel PCA, 347
dimension, 45 Gaussian process, 316 mean, 187
kernel trick, 316, 347, 389
dimensionality reduction, 317 Gaussian process latent-variable model, mean function, 309
directed graphical model, 278, 283 label, 253 mean vector, 198
347
direction, 61 Lagrange multiplier, 234 measure, 180
general linear group, 37
direction space, 61 Lagrangian, 234 median, 188
general solution, 28, 30
distance, 75 Lagrangian dual problem, 234 metric, 76
generalized linear model, 272, 315
distribution, 177 Laplace approximation, 170 minimal, 44
generating set, 44
distributivity, 24, 26 Laplace expansion, 102 minimax inequality, 234
generative process, 272, 286
domain, 58, 139 Laplacian eigenmaps, 136 misfit term, 302
generator, 344
dot product, 72 LASSO, 303, 316 mixture model, 349
geometric multiplicity, 108
dual SVM, 385 latent variable, 275 mixture weight, 349
Givens rotation, 94
law, 177, 181 mode, 188
Eckart-Young theorem, 131, 334 global minimum, 225
law of total variance, 203 model, 251
eigendecomposition, 116 GP-LVM, 347
leading coefficient, 30 model evidence, 286
eigenspace, 106 gradient, 146
least-squares loss, 154 model selection, 258
eigenspectrum, 106 Gram matrix, 389
least-squares problem, 261 Moore-Penrose pseudo-inverse, 35
eigenvalue, 105 Gram-Schmidt orthogonalization, 89
least-squares solution, 88 multidimensional scaling, 136
eigenvalue equation, 105 graphical model, 278
left-singular vectors, 119 multiplication by scalars, 37
eigenvector, 105 group, 36 Legendre transform, 242
elementary transformations, 28 multivariate, 178
Hadamard product, 23 Legendre-Fenchel transform, 242
EM algorithm, 360 multivariate Gaussian distribution, 198
hard margin SVM, 377 length, 71
embarrassingly parallel, 264 multivariate Taylor series, 166
Hessian, 164 likelihood, 185, 265, 269, 291
empirical covariance, 192 Hessian eigenmaps, 136 line, 61, 82 natural parameters, 212
empirical mean, 192 Hessian matrix, 165 linear combination, 40 negative log-likelihood, 265
empirical risk, 260 hinge loss, 381 linear manifold, 61 nested cross-validation, 258, 284
empirical risk minimization, 257, 260 histogram, 369 linear mapping, 48 neutral element, 36
encoder, 343 hyperparameter, 258 linear program, 239 noninvertible, 24
endomorphism, 49 hyperplane, 61, 62 linear subspace, 39 nonsingular, 24
epigraph, 236 hyperprior, 281 linear transformation, 48 norm, 71
equivalent, 56 i.i.d., 195 linearly dependent, 40 normal distribution, 197
error function, 294 ICA, 346 linearly independent, 40 normal equation, 86
error term, 382 identity automorphism, 49 link function, 272 normal vector, 80
Euclidean distance, 72, 75 identity mapping, 49 loading, 322 null space, 33, 47, 58
Euclidean norm, 72 identity matrix, 23 local minimum, 225 numerator layout, 150
Euclidean vector space, 73 image, 58, 139 log-partition function, 211 Occam’s razor, 285
event space, 175 independent and identically distributed, logistic regression, 315 ONB, 79
evidence, 186, 285, 306 195, 260, 266 logistic sigmoid, 315 one-hot encoding, 364
example, 253 independent component analysis, 346 loss function, 260, 381 ordered basis, 50
expected risk, 261 inference network, 344 loss term, 382 orthogonal, 77
expected value, 187 injective, 48 lower-triangular matrix, 101 orthogonal basis, 79
exponential family, 205, 211 inner product, 73 Maclaurin series, 143 orthogonal complement, 79
extended Kalman filter, 170 inner product space, 73 Manhattan norm, 71 orthogonal matrix, 78
factor analysis, 346 intermediate variables, 162 MAP, 300 orthonormal, 77
factor graph, 283 inverse, 24 MAP estimation, 269 orthonormal basis, 79

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
410 Index Index 411

outer product, 38 regularization parameter, 263, 302, 380 Taylor polynomial, 142, 166
overfitting, 262, 271, 299 regularized least squares, 302 Taylor series, 142
PageRank, 114 regularizer, 263, 302, 380, 382 test error, 300
parameters, 61 representer theorem, 384 test set, 262, 284
parametric equation, 61 responsibility, 352 Tikhonov regularization, 265
partial derivative, 146 reverse mode, 161 trace, 103
particular solution, 27, 30 right-singular vectors, 119 training, 12
PCA, 317 RMSE, 298 training error, 300
pdf, 181 root mean square error, 298 training set, 260, 292
penalty term, 263 rotation, 91 transfer function, 315
pivot, 30 rotation matrix, 92 transformation matrix, 51
plane, 62 row, 22 translation vector, 63
plate, 281 row vector, 22, 38 transpose, 25, 38
row-echelon form, 30 triangle inequality, 71, 76
population mean and covariance, 191
truncated SVD, 129
positive definite, 71, 73, 74, 76 sample mean, 192
Tucker decomposition, 136
posterior, 185, 269 sample space, 175
posterior odds, 287 scalar, 37 underfitting, 271
power iteration, 334 scalar product, 72 undirected graphical model, 283
power series representation, 145 sigmoid, 213 uniform distribution, 182
PPCA, 340 similar, 56 univariate, 178
preconditioner, 230 singular, 24 unscented transform, 170
predictor, 12, 255 singular value decomposition, 119 upper-triangular matrix, 101
primal problem, 234 singular value equation, 124 validation set, 263, 284
principal component, 322 singular value matrix, 119 variable selection, 316
principal component analysis, 136, 317 singular values, 119 variance, 190
principal subspace, 327 slack variable, 379 vector, 37
prior, 185, 269 soft margin SVM, 379, 380 vector addition, 37
prior odds, 287 solution, 20 vector space, 37
probabilistic inverse, 186 span, 44 vector space homomorphism, 48
probabilistic PCA, 340 special solution, 27 vector space with inner product, 73
probabilistic programming, 278 spectral clustering, 136 vector subspace, 39
probability, 175 spectral norm, 131 weak duality, 235
probability density function, 181 spectral theorem, 111
probability distribution, 172 spectrum, 106 zero-one loss, 381
probability integral transform, 217 square matrix, 25
probability mass function, 178 standard basis, 45
product rule, 184 standard deviation, 190
projection, 82 standard normal distribution, 198
projection error, 88 standardization, 336
projection matrix, 82 statistical independence, 194
pseudo-inverse, 86 statistical learning theory, 265
random variable, 172, 175 stochastic gradient descent, 231
range, 58 strong duality, 236
rank, 47 sufficient statistics, 210
rank deficient, 47 sum rule, 184
rank-k approximation, 130 support point, 61
rank-nullity theorem, 60 support vector, 384
raw-score formula for variance, 193 supporting hyperplane, 242
recognition network, 344 surjective, 48
reconstruction error, 88, 327 SVD, 119
reduced hull, 388 SVD theorem, 119
reduced row-echelon form, 31 symmetric, 73, 76
reduced SVD, 129 symmetric matrix, 25
REF, 30 symmetric, positive definite, 74
regression, 289 symmetric, positive semidefinite, 74
regular, 24 system of linear equations, 20
regularization, 262, 302, 382 target space, 175

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy