MML Book Removed
MML Book Removed
y
functions that fall A widely used approach to finding the desired parameters θ ML is maximum maximum likelihood
into this category; −10 −10
likelihood estimation, where we find parameters θ ML that maximize the estimation
−20
(b) training set; −10 0 10 −10 −5 0 5 10 −10 −5 0 5 10
likelihood (9.5b). Intuitively, maximizing the likelihood means maximiz- Maximizing the
x x x
(c) maximum likelihood means
likelihood estimate. (a) Example functions (straight (b) Training set. (c) Maximum likelihood esti- ing the predictive distribution of the training data given the model param-
maximizing the
lines) that can be described us- mate. eters. We obtain the maximum likelihood parameters as
predictive
ing the linear model in (9.4).
distribution of the
θ ML ∈ arg max p(Y | X , θ) . (9.7)
θ (training) data
given the
Remark. The likelihood p(y | x, θ) is not a probability distribution in θ : It parameters.
refers to models that are “linear in the parameters”, i.e., models that de- is simply a function of the parameters θ but does not integrate to 1 (i.e., The likelihood is not
scribe a function by a linear combination of input features. Here, a “fea- it is unnormalized), and may not even be integrable with respect to θ . a probability
ture” is a representation ϕ(x) of the inputs x. However, the likelihood in (9.7) is a normalized probability distribution distribution in the
In the following, we will discuss in more detail how to find good pa- parameters.
in y . ♢
rameters θ and how to evaluate whether a parameter set “works well”.
To find the desired parameters θ ML that maximize the likelihood, we
For the time being, we assume that the noise variance σ 2 is known.
typically perform gradient ascent (or gradient descent on the negative
likelihood). In the case of linear regression we consider here, however, Since the logarithm
a closed-form solution exists, which makes iterative gradient descent un- is a (strictly)
9.2 Parameter Estimation necessary. In practice, instead of maximizing the likelihood directly, we monotonically
increasing function,
Consider the linear regression setting (9.4) and assume we are given a apply the log-transformation to the likelihood function and minimize the the optimum of a
training set training set D := {(x1 , y1 ), . . . , (xN , yN )} consisting of N inputs xn ∈ negative log-likelihood. function f is
Figure 9.3 RD and corresponding observations/targets yn ∈ R, n = 1, . . . , N . The Remark (Log-Transformation). Since the likelihood (9.5b) is a product of
identical to the
Probabilistic optimum of log f .
corresponding graphical model is given in Figure 9.3. Note that yi and yj N Gaussian distributions, the log-transformation is useful since (a) it does
graphical model for
are conditionally independent given their respective inputs xi , xj so that not suffer from numerical underflow, and (b) the differentiation rules will
linear regression.
Observed random the likelihood factorizes according to turn out simpler. More specifically, numerical underflow will be a prob-
variables are lem when we multiply N probabilities, where N is the number of data
shaded, p(Y | X , θ) = p(y1 , . . . , yN | x1 , . . . , xN , θ) (9.5a)
points, since we cannot represent very small numbers, such as 10−256 .
deterministic/ N N
known values are
Y Y Furthermore, the log-transform will turn the product into a sum of log-
yn | x ⊤ 2
= p(yn | xn , θ) = N n θ, σ , (9.5b)
without circles. probabilities such that the corresponding gradient is a sum of individual
n=1 n=1
θ gradients, instead of a repeated application of the product rule (5.46) to
where we defined X := {x1 , . . . , xN } and Y := {y1 , . . . , yN } as the sets compute the gradient of a product of N terms. ♢
σ of training inputs and corresponding targets, respectively. The likelihood To find the optimal parameters θ ML of our linear regression problem,
xn yn and the factors p(yn | xn , θ) are Gaussian due to the noise distribution; we minimize the negative log-likelihood
see (9.3).
N N
n = 1, . . . , N In the following, we will discuss how to find optimal parameters θ ∗ ∈ Y X
− log p(Y | X , θ) = − log p(yn | xn , θ) = − log p(yn | xn , θ) , (9.8)
RD for the linear regression model (9.4). Once the parameters θ ∗ are
n=1 n=1
found, we can predict function values by using this parameter estimate
in (9.4) so that at an arbitrary test input x∗ the distribution of the corre- where we exploited that the likelihood (9.5b) factorizes over the number
sponding target y∗ is of data points due to our independence assumption on the training set.
In the linear regression model (9.4), the likelihood is Gaussian (due to
p(y∗ | x∗ , θ ∗ ) = N y∗ | x⊤ ∗ 2
∗θ , σ . (9.6) the Gaussian additive noise term), such that we arrive at
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
294 Linear Regression 9.2 Parameter Estimation 295
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
296 Linear Regression 9.2 Parameter Estimation 297
⊤ K×K
are linearly independent. In (9.19), we therefore require Φ Φ ∈ R
K − 1 is to be invertible. This is the case if and only if rk(Φ) = K . ♢
K−1
X
k ⊤
f (x) = θk x = ϕ (x)θ , (9.15)
k=0
Example 9.5 (Maximum Likelihood Polynomial Fit)
where ϕ is defined in (9.14) and θ = [θ0 , . . . , θK−1 ]⊤ ∈ RK contains the
(linear) parameters θk .
Figure 9.4
4 4 Training data Polynomial
Let us now have a look at maximum likelihood estimation of the param- MLE regression:
eters θ in the linear regression model (9.13). We consider training inputs 2 2 (a) dataset
consisting of
feature matrix xn ∈ RD and targets yn ∈ R, n = 1, . . . , N , and define the feature matrix 0 0
y
(xn , yn ) pairs,
design matrix (design matrix) as n = 1, . . . , 10;
−2 −2
(b) maximum
ϕ0 (x1 ) · · · ϕK−1 (x1 )
⊤
ϕ (x1 ) −4 −4 likelihood
ϕ0 (x2 ) · · · ϕK−1 (x2 ) polynomial of
.. −4 −2 0 2 4 −4 −2 0 2 4
Φ := = .. .. ∈ RN ×K , (9.16)
. .
x x degree 4.
⊤ .
ϕ (xN ) (a) Regression dataset. (b) Polynomial of degree 4 determined by max-
ϕ0 (xN ) · · · ϕK−1 (xN ) imum likelihood estimation.
where Φij = ϕj (xi ) and ϕj : RD → R. Consider the dataset in Figure 9.4(a). The dataset consists of N = 10
pairs (xn , yn ), where xn ∼ U[−5, 5] and yn = − sin(xn /5) + cos(xn ) + ϵ,
Example 9.4 (Feature Matrix for Second-order Polynomials) where ϵ ∼ N 0, 0.22 .
For a second-order polynomial and N training points xn ∈ R, n = We fit a polynomial of degree 4 using maximum likelihood estimation,
1, . . . , N , the feature matrix is i.e., parameters θ ML are given in (9.19). The maximum likelihood estimate
yields function values ϕ⊤ (x∗ )θ ML at any test location x∗ . The result is
1 x1 x21
shown in Figure 9.4(b).
1 x2 x22
Φ = .. .. .. . (9.17)
. . .
1 xN x2N Estimating the Noise Variance
Thus far, we assumed that the noise variance σ 2 is known. However, we
can also use the principle of maximum likelihood estimation to obtain the
With the feature matrix Φ defined in (9.16), the negative log-likelihood maximum likelihood estimator σML 2
for the noise variance. To do this, we
for the linear regression model (9.13) can be written as follow the standard procedure: We write down the log-likelihood, com-
1 pute its derivative with respect to σ 2 > 0, set it to 0, and solve. The
− log p(Y | X , θ) = (y − Φθ)⊤ (y − Φθ) + const . (9.18) log-likelihood is given by
2σ 2
Comparing (9.18) with the negative log-likelihood in (9.10b) for the “fea- N
X
log p(Y | X , θ, σ 2 ) = log N yn | ϕ⊤ (xn )θ, σ 2
ture-free” model, we immediately see we just need to replace X with Φ. (9.20a)
n=1
Since both X and Φ are independent of the parameters θ that we wish to
N
optimize, we arrive immediately at the maximum likelihood estimate 1 1 1
maximum likelihood X
estimate = − log(2π) − log σ 2 − 2 (yn − ϕ⊤ (xn )θ)2 (9.20b)
2 2 2σ
θ ML = (Φ⊤ Φ)−1 Φ⊤ y (9.19) n=1
N
N 1 X
for the linear regression problem with nonlinear features defined in (9.13). =− log σ 2 − 2 (yn − ϕ⊤ (xn )θ)2 + const . (9.20c)
⊤ 2 2σ n=1
Remark. When we were working without features, we required X X to | {z }
be invertible, which is the case when rk(X) = D, i.e., the columns of X =:s
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
298 Linear Regression 9.2 Parameter Estimation 299
4 Training data 4 Training data 4 Training data
Figure 9.5
2
The partial derivative of the log-likelihood with respect to σ is then 2
MLE
2
MLE
2
MLE Maximum
likelihood fits for
∂ log p(Y | X , θ, σ 2 ) N 1 0 0 0
y
different polynomial
= − 2 + 4s = 0 (9.21a) −2 −2 −2
degrees M .
∂σ 2 2σ 2σ −4 −4 −4
N s −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
⇐⇒ 2 = 4 (9.21b) x x x
2σ 2σ (a) M = 0 (b) M = 1 (c) M = 3
so that we identify
4 Training data 4 Training data 4 Training data
N
s 1 X MLE MLE MLE
2
σML = = (yn − ϕ⊤ (xn )θ)2 . (9.22) 2 2 2
N N n=1 0 0 0
y
−2 −2 −2
Therefore, the maximum likelihood estimate of the noise variance is the −4 −4 −4
values ϕ⊤ (xn )θ and the corresponding noisy observations yn at input lo- (d) M = 4 (e) M = 6 (f) M = 9
cations xn .
9.2.2 Overfitting in Linear Regression than data points, and would need to solve an underdetermined system of
We just discussed how to use maximum likelihood estimation to fit lin- linear equations (Φ⊤ Φ in (9.19) would also no longer be invertible) so
ear models (e.g., polynomials) to data. We can evaluate the quality of that there are infinitely many possible maximum likelihood estimators.
the model by computing the error/loss incurred. One way of doing this Figure 9.5 shows a number of polynomial fits determined by maximum
is to compute the negative log-likelihood (9.10b), which we minimized likelihood for the dataset from Figure 9.4(a) with N = 10 observations.
to determine the maximum likelihood estimator. Alternatively, given that We notice that polynomials of low degree (e.g., constants (M = 0) or
the noise parameter σ 2 is not a free model parameter, we can ignore the linear (M = 1)) fit the data poorly and, hence, are poor representations
scaling by 1/σ 2 , so that we end up with a squared-error-loss function of the true underlying function. For degrees M = 3, . . . , 6, the fits look
root mean square
2
∥y − Φθ∥ . Instead of using this squared loss, we often use the root mean plausible and smoothly interpolate the data. When we go to higher-degree The case of
error polynomials, we notice that they fit the data better and better. In the ex- M = N − 1 is
square error (RMSE) extreme in the sense
RMSE treme case of M = N − 1 = 9, the function will pass through every single
that otherwise the
v
r u N data point. However, these high-degree polynomials oscillate wildly and
1 2
u1 X null space of the
∥y − Φθ∥ = t (yn − ϕ⊤ (xn )θ)2 , (9.23) are a poor representation of the underlying function that generated the corresponding
N N n=1
data, such that we suffer from overfitting. system of linear
equations would be
which (a) allows us to compare errors of datasets with different sizes Remember that the goal is to achieve good generalization by making
non-trivial, and we
The RMSE is and (b) has the same scale and the same units as the observed func- accurate predictions for new (unseen) data. We obtain some quantita- would have
normalized. tion values yn . For example, if we fit a model that maps post-codes (x tive insight into the dependence of the generalization performance on the infinitely many
is given in latitude, longitude) to house prices (y -values are EUR) then polynomial of degree M by considering a separate test set comprising 200 optimal solutions to
the linear regression
the RMSE is also measured in EUR, whereas the squared error is given data points generated using exactly the same procedure used to generate
problem.
The negative in EUR2 . If we choose to include the factor σ 2 from the original negative the training set. As test inputs, we chose a linear grid of 200 points in the
overfitting
log-likelihood is log-likelihood (9.10b), then we end up with a unitless objective, i.e., in interval of [−5, 5]. For each choice of M , we evaluate the RMSE (9.23) for
unitless. Note that the noise
the preceding example, our objective would no longer be in EUR or EUR2 . both the training data and the test data. variance σ 2 > 0.
For model selection (see Section 8.6), we can use the RMSE (or the Looking now at the test error, which is a qualitive measure of the gen-
negative log-likelihood) to determine the best degree of the polynomial by eralization properties of the corresponding polynomial, we notice that ini-
finding the polynomial degree M that minimizes the objective. Given that tially the test error decreases; see Figure 9.6 (orange). For fourth-order
the polynomial degree is a natural number, we can perform a brute-force polynomials, the test error is relatively low and stays relatively constant up
search and enumerate all (reasonable) values of M . For a training set of to degree 5. However, from degree 6 onward the test error increases signif-
size N it is sufficient to test 0 ⩽ M ⩽ N − 1. For M < N , the maximum icantly, and high-order polynomials have very bad generalization proper-
likelihood estimator is unique. For M ⩾ N , we have more parameters ties. In this particular example, this also is evident from the corresponding
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
300 Linear Regression 9.2 Parameter Estimation 301
Figure 9.6 Training 10 where the constant comprises the terms that are independent of θ . We see
and test error. Training error that the log-posterior in (9.25) is the sum of the log-likelihood p(Y | X , θ)
8 Test error
and the log-prior log p(θ) so that the MAP estimate will be a “compromise”
6 between the prior (our suggestion for plausible parameter values before
RMSE observing data) and the data-dependent likelihood.
4 To find the MAP estimate θ MAP , we minimize the negative log-posterior
distribution with respect to θ , i.e., we solve
2
θ MAP ∈ arg min{− log p(Y | X , θ) − log p(θ)} . (9.26)
0 θ
0 2 4 6 8 10
Degree of polynomial The gradient of the negative log-posterior with respect to θ is
d log p(θ | X , Y) d log p(Y | X , θ) d log p(θ)
training error maximum likelihood fits in Figure 9.5. Note that the training error (blue − =− − , (9.27)
dθ dθ dθ
curve in Figure 9.6) never increases when the degree of the polynomial in- where we identify the first term on the right-hand side as the gradient of
creases. In our example, the best generalization (the point of the smallest the negative log-likelihood from (9.11c).
test error test error) is obtained for a polynomial of degree M = 4.
With a (conjugate) Gaussian prior p(θ) = N 0, b2 I on the parameters
θ , the negative log-posterior for the linear regression setting (9.13), we
9.2.3 Maximum A Posteriori Estimation obtain the negative log posterior
1 1
We just saw that maximum likelihood estimation is prone to overfitting. − log p(θ | X , Y) = (y − Φθ)⊤ (y − Φθ) + 2 θ ⊤ θ + const . (9.28)
We often observe that the magnitude of the parameter values becomes 2σ 2 2b
relatively large if we run into overfitting (Bishop, 2006). Here, the first term corresponds to the contribution from the log-likelihood,
To mitigate the effect of huge parameter values, we can place a prior and the second term originates from the log-prior. The gradient of the log-
distribution p(θ) on the parameters. The prior distribution explicitly en- posterior with respect to the parameters θ is then
codes what parameter values are plausible (before having seen any data). d log p(θ | X , Y) 1 1
For example, a Gaussian prior p(θ) = N 0, 1 on a single parameter − = 2 (θ ⊤ Φ⊤ Φ − y ⊤ Φ) + 2 θ ⊤ . (9.29)
dθ σ b
θ encodes that parameter values are expected lie in the interval [−2, 2]
(two standard deviations around the mean value). Once a dataset X , Y We will find the MAP estimate θ MAP by setting this gradient to 0⊤ and
is available, instead of maximizing the likelihood we seek parameters that solving for θ MAP . We obtain
maximize the posterior distribution p(θ | X , Y). This procedure is called 1 ⊤ ⊤ 1
maximum a maximum a posteriori (MAP) estimation.
(θ Φ Φ − y ⊤ Φ) + 2 θ ⊤ = 0⊤ (9.30a)
σ2 b
posteriori The posterior over the parameters θ , given the training data X , Y , is 1 ⊤ 1 1
MAP obtained by applying Bayes’ theorem (Section 6.3) as ⇐⇒ θ ⊤ Φ Φ + 2 I − 2 y ⊤ Φ = 0⊤ (9.30b)
σ2 b σ
p(Y | X , θ)p(θ) σ2
p(θ | X , Y) = . (9.24) ⇐⇒ θ ⊤ Φ⊤ Φ + 2 I = y ⊤ Φ (9.30c)
p(Y | X ) b
−1
σ2
Since the posterior explicitly depends on the parameter prior p(θ), the ⊤
⇐⇒ θ = y Φ Φ⊤ Φ + 2 I
⊤
(9.30d)
prior will have an effect on the parameter vector we find as the maximizer b
of the posterior. We will see this more explicitly in the following. The
so that the MAP estimate is (by transposing both sides of the last equality) Φ⊤ Φ is symmetric,
parameter vector θ MAP that maximizes the posterior (9.24) is the MAP −1 positive semi
σ2
estimate. definite. The
θ MAP = Φ⊤ Φ + 2 I Φ⊤ y . (9.31) additional term
To find the MAP estimate, we follow steps that are similar in flavor b
in (9.31) is strictly
to maximum likelihood estimation. We start with the log-transform and
Comparing the MAP estimate in (9.31) with the maximum likelihood es- positive definite so
compute the log-posterior as that the inverse
timate in (9.19), we see that the only difference between both solutions
2 exists.
log p(θ | X , Y) = log p(Y | X , θ) + log p(θ) + const , (9.25) is the additional term σb2 I in the inverse matrix. This term ensures that
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
302 Linear Regression 9.3 Bayesian Linear Regression 303
⊤ σ2
Φ Φ+ I
is symmetric and strictly positive definite (i.e., its inverse
b2
useful for variable selection. For p = 1, the regularizer is called LASSO LASSO
exists and the MAP estimate is the unique solution of a system of linear (least absolute shrinkage and selection operator) and was proposed by Tib-
equations). Moreover, it reflects the impact of the regularizer. shirani (1996). ♢
2
The regularizer λ ∥θ∥2 in (9.32) can be interpreted as a negative log-
Example 9.6 (MAP Estimation for Polynomial Regression) Gaussian prior, which we use in MAP estimation; see (9.26). More specif-
In the polynomial regression ically, with a Gaussian prior p(θ) = N 0, b2 I , we obtain the negative
example from Section 9.2.1, we place a Gaus-
sian prior p(θ) = N 0, I on the parameters θ and determine the MAP log-Gaussian prior
estimates according to (9.31). In Figure 9.7, we show both the maximum 1 2
likelihood and the MAP estimates for polynomials of degree 6 (left) and − log p(θ) = ∥θ∥2 + const (9.33)
2b2
degree 8 (right). The prior (regularizer) does not play a significant role
so that for λ = 2b12 the regularization term and the negative log-Gaussian
for the low-degree polynomial, but keeps the function relatively smooth
prior are identical.
for higher-degree polynomials. Although the MAP estimate can push the
Given that the regularized least-squares loss function in (9.32) consists
boundaries of overfitting, it is not a general solution to this problem, so
of terms that are closely related to the negative log-likelihood plus a neg-
we need a more principled approach to tackle overfitting.
ative log-prior, it is not surprising that, when we minimize this loss, we
obtain a solution that closely resembles the MAP estimate in (9.31). More
Figure 9.7 specifically, minimizing the regularized least-squares loss function yields
Polynomial 4 4 Training data
regression: MLE θ RLS = (Φ⊤ Φ + λI)−1 Φ⊤ y , (9.34)
maximum likelihood 2 2 MAP 2
and MAP estimates. which is identical to the MAP estimate in (9.31) for λ = σb2 , where σ 2 is
0 0 2
y
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
304 Linear Regression 9.3 Bayesian Linear Regression 305
2
9.3.1 Model with the parameters θ , whereas σ is the uncertainty contribution due to
In Bayesian linear regression, we consider the model the measurement noise.
If we are interested in predicting noise-free function values f (x∗ ) =
ϕ⊤ (x∗ )θ instead of the noise-corrupted targets y∗ we obtain
prior p(θ) = N m0 , S 0 ,
(9.35)
likelihood p(y | x, θ) = N y | ϕ⊤ (x)θ, σ 2 ,
p(f (x∗ )) = N ϕ⊤ (x∗ )m0 , ϕ⊤ (x∗ )S 0 ϕ(x∗ ) ,
(9.40)
Figure 9.8 where we now explicitly place a Gaussian prior p(θ) = N m0 , S 0 on θ , which only differs from (9.38) in the omission of the noise variance σ 2 in
Graphical model for which turns the parameter vector into a random variable. This allows us the predictive variance.
Bayesian linear
to write down the corresponding graphical model in Figure 9.8, where we
regression. Remark (Distribution over Functions). Since we can represent the distri- The parameter
made the parameters of the Gaussian prior on θ explicit. The full proba- distribution p(θ)
m0 S0 bution p(θ) using a set of samples θ i and every sample θ i gives rise to a
bilistic model, i.e., the joint distribution of observed and unobserved ran- induces a
function fi (·) = θ ⊤
i ϕ(·), it follows that the parameter distribution p(θ)
dom variables, y and θ , respectively, is distribution over
θ induces a distribution p(f (·)) over functions. Here we use the notation (·) functions.
σ p(y, θ | x) = p(y | x, θ)p(θ) . (9.36) to explicitly denote a functional relationship. ♢
x y
9.3.2 Prior Predictions Example 9.7 (Prior over Functions)
Figure 9.9 Prior
In practice, we are usually not so much interested in the parameter values over functions.
4 4
θ themselves. Instead, our focus often lies in the predictions we make (a) Distribution over
with those parameter values. In a Bayesian setting, we take the parameter functions
2 2
represented by the
distribution and average over all plausible parameter settings when we mean function
0 0
y
make predictions. More specifically, to make predictions at an input x∗ , (black line) and the
we integrate out θ and obtain −2 −2 marginal
Z uncertainties
(shaded),
p(y∗ | x∗ ) = p(y∗ | x∗ , θ)p(θ)dθ = Eθ [p(y∗ | x∗ , θ)] , (9.37) −4 −4
representing the
−4 −2 0 2 4 −4 −2 0 2 4
x x 67% and 95%
which we can interpret as the average prediction of y∗ | x∗ , θ for all plau- confidence bounds,
sible parameters θ according to the prior distribution p(θ). Note that pre- (a) Prior distribution over functions. (b) Samples from the prior distribution over respectively;
functions. (b) samples from
dictions using the prior distribution only require us to specify the input
the prior over
x∗ , but no training data. Let us consider a Bayesian linear regression problem withpolynomials
functions, which are
In our model (9.35), we chose a conjugate (Gaussian) prior on θ so of degree 5. We choose a parameter prior p(θ) = N 0, 14 I . Figure 9.9 induced by the
that the predictive distribution is Gaussian as well (and can be visualizes the induced prior distribution over functions (shaded area: dark
computed samples from the
in closed form): With the prior distribution p(θ) = N m0 , S 0 , we obtain gray: 67% confidence bound; light gray: 95% confidence bound) induced parameter prior.
the predictive distribution as by this parameter prior, including some function samples from this prior.
A function sample is obtained by first sampling a parameter vector
p(y∗ | x∗ ) = N ϕ⊤ (x∗ )m0 , ϕ⊤ (x∗ )S 0 ϕ(x∗ ) + σ 2 ,
(9.38) θ i ∼ p(θ) and then computing fi (·) = θ ⊤ i ϕ(·). We used 200 input lo-
where we exploited that (i) the prediction is Gaussian due to conjugacy cations x∗ ∈ [−5, 5] to which we apply the feature function ϕ(·). The
(see Section 6.6) and the marginalization property of Gaussians (see Sec- uncertainty (represented by the shaded area) in Figure 9.9 is solely due to
tion 6.5), (ii) the Gaussian noise is independent so that the parameter uncertainty because we considered the noise-free predictive
distribution (9.40).
V[y∗ ] = Vθ [ϕ⊤ (x∗ )θ] + Vϵ [ϵ] , (9.39)
and (iii) y∗ is a linear transformation of θ so that we can apply the rules So far, we looked at computing predictions using the parameter prior
for computing the mean and covariance of the prediction analytically by p(θ). However, when we have a parameter posterior (given some train-
using (6.50) and (6.51), respectively. In (9.38), the term ϕ⊤ (x∗ )S 0 ϕ(x∗ ) ing data X , Y ), the same principles for prediction and inference hold
in the predictive variance explicitly accounts for the uncertainty associated as in (9.37) – we just need to replace the prior p(θ) with the posterior
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
306 Linear Regression 9.3 Bayesian Linear Regression 307
p(θ | X , Y). In the following, we will derive the posterior distribution in where the constant contains terms independent of θ . We will ignore the
detail before using it to make predictions. constant in the following. We now factorize (9.45b), which yields
1 −2 ⊤
− σ y y − 2σ −2 y ⊤ Φθ + θ ⊤ σ −2 Φ⊤ Φθ + θ ⊤ S −1
0 θ
9.3.3 Posterior Distribution 2 (9.46a)
−1 ⊤ −1
− 2m⊤
Given a training set of inputs xn ∈ RD and corresponding observations 0 S 0 θ + m0 S 0 m 0
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
308 Linear Regression 9.3 Bayesian Linear Regression 309
Remark (General Approach to Completing the Squares). If we are given Remark (Marginal Likelihood and Posterior Predictive Distribution). By
an equation replacing the integral in (9.57a), the predictive distribution can be equiv-
alently written as the expectation Eθ | X ,Y [p(y∗ | x∗ , θ)], where the expec-
x⊤ Ax − 2a⊤ x + const1 , (9.51) tation is taken with respect to the parameter posterior p(θ | X , Y).
where A is symmetric and positive definite, which we wish to bring into Writing the posterior predictive distribution in this way highlights a
the form close resemblance to the marginal likelihood (9.42). The key difference
between the marginal likelihood and the posterior predictive distribution
(x − µ)⊤ Σ(x − µ) + const2 , (9.52) are (i) the marginal likelihood can be thought of predicting the training
targets y and not the test targets y∗ , and (ii) the marginal likelihood av-
we can do this by setting erages with respect to the parameter prior and not the parameter poste-
rior. ♢
Σ := A , (9.53)
Remark (Mean and Variance of Noise-Free Function Values). In many
µ := Σ−1 a (9.54) cases, we are not interested in the predictive distribution p(y∗ | X , Y, x∗ )
and const2 = const1 − µ⊤ Σµ. ♢ of a (noisy) observation y∗ . Instead, we would like to obtain the distribu-
tion of the (noise-free) function values f (x∗ ) = ϕ⊤ (x∗ )θ . We determine
We can see that the terms inside the exponential in (9.47b) are of the the corresponding moments by exploiting the properties of means and
form (9.51) with variances, which yields
A := σ −2 Φ⊤ Φ + S −1
0 , (9.55) E[f (x∗ ) | X , Y] = Eθ [ϕ⊤ (x∗ )θ | X , Y] = ϕ⊤ (x∗ )Eθ [θ | X , Y]
(9.58)
a := σ −2 Φ⊤ y + S −1
0 m0 . (9.56) = ϕ⊤ (x∗ )mN = m⊤
N ϕ(x∗ ) ,
Since A, a can be difficult to identify in equations like (9.46a), it is of- Vθ [f (x∗ ) | X , Y] = Vθ [ϕ⊤ (x∗ )θ | X , Y]
ten helpful to bring these equations into the form (9.51) that decouples = ϕ⊤ (x∗ )Vθ [θ | X , Y]ϕ(x∗ ) (9.59)
quadratic term, linear terms, and constants, which simplifies finding the
= ϕ⊤ (x∗ )S N ϕ(x∗ ) .
desired solution.
We see that the predictive mean is the same as the predictive mean for
noisy observations as the noise has mean 0, and the predictive variance
9.3.4 Posterior Predictions only differs by σ 2 , which is the variance of the measurement noise: When
we predict noisy function values, we need to include σ 2 as a source of
In (9.37), we computed the predictive distribution of y∗ at a test input uncertainty, but this term is not needed for noise-free predictions. Here,
x∗ using the parameter prior p(θ). In principle, predicting with the pa- the only remaining uncertainty stems from the parameter posterior. ♢
rameter posterior p(θ | X , Y) is not fundamentally different given that Integrating out
in our conjugate model the prior and posterior are both Gaussian (with Remark (Distribution over Functions). The fact that we integrate out the parameters induces
parameters θ induces a distribution over functions: If we sample θ i ∼ a distribution over
different parameters). Therefore, by following the same reasoning as in functions.
Section 9.3.2, we obtain the (posterior) predictive distribution p(θ | X , Y) from the parameter posterior, we obtain a single function re-
alization θ ⊤
i ϕ(·). The mean function, i.e., the set of all expected function mean function
values Eθ [f (·) | θ, X , Y], of this distribution over functions is m⊤
Z
p(y∗ | X , Y, x∗ ) = p(y∗ | x∗ , θ)p(θ | X , Y)dθ (9.57a) N ϕ(·).
The (marginal) variance, i.e., the variance of the function f (·), is given by
⊤
ϕ (·)S N ϕ(·). ♢
Z
⊤ 2
= N y∗ | ϕ (x∗ )θ, σ N θ | mN , S N dθ (9.57b)
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
310 Linear Regression 9.3 Bayesian Linear Regression 311
Figure 9.11
4 4 Bayesian linear
Figure 9.10 shows the posterior over functions that we obtain via regression. Left
2 2 panels: Shaded
Bayesian linear regression. The training dataset is shown in panel (a);
areas indicate the
panel (b) shows the posterior distribution over functions, including the 0 0
y
67% (dark gray)
Training data
functions we would obtain via maximum likelihood and MAP estimation. and 95% (light
−2 MLE −2
The function we obtain using the MAP estimate also corresponds to the MAP
gray) predictive
confidence bounds.
posterior mean function in the Bayesian linear regression setting. Panel (c) −4 BLR −4
The mean of the
shows some plausible realizations (samples) of functions under that pos- −4 −2 0 2 4 −4 −2 0 2 4 Bayesian linear
x x
terior over functions. regression model
(a) Posterior distribution for polynomials of degree M = 3 (left) and samples from the pos- coincides with the
Figure 9.10
terior over functions (right). MAP estimate. The
Bayesian linear 4 4 4
predictive
regression and 2 2 2
uncertainty is the
posterior over 4 4
0 0 0 sum of the noise
y
y
functions. Training data
−2 −2 MLE −2 term and the
(a) training data; MAP 2 2
posterior parameter
(b) posterior −4 −4 BLR −4
−4 −2 −4 −2 −4 −2
uncertainty, which
distribution over 0 2 4 0 2 4 0 2 4 0 0
y
x x x
Training data depends on the
functions;
(a) Training data. (b) Posterior over functions rep- (c) Samples from the posterior −2 MLE −2 location of the test
(c) Samples from
resented by the marginal uncer- over functions, which are in- MAP input. Right panels:
the posterior over
tainties (shaded) showing the duced by the samples from the −4 BLR −4 sampled functions
functions.
67% and 95% predictive con- parameter posterior. from the posterior
−4 −2 0 2 4 −4 −2 0 2 4
fidence bounds, the maximum x x distribution.
likelihood estimate (MLE) and
the MAP estimate (MAP), the (b) Posterior distribution for polynomials of degree M = 5 (left) and samples from the
latter of which is identical to posterior over functions (right).
the posterior mean function.
4 Training data 4
MLE
2 MAP 2
Figure 9.11 shows some posterior distributions over functions induced BLR
by the parameter posterior. For different polynomial degrees M , the left 0 0
y
panels show the maximum likelihood function θ ⊤ ML ϕ(·), the MAP func- −2 −2
tion θ ⊤
MAP ϕ(·) (which is identical to the posterior mean function), and the
67% and 95% predictive confidence bounds obtained by Bayesian linear −4 −4
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
312 Linear Regression 9.4 Maximum Likelihood as Orthogonal Projection 313
4 4 Figure 9.12
9.3.5 Computing the Marginal Likelihood Geometric
2 2 interpretation of
In Section 8.6.2, we highlighted the importance of the marginal likelihood least squares.
for Bayesian model selection. In the following, we compute the marginal 0 0 (a) Dataset;
y
likelihood for Bayesian linear regression with a conjugate Gaussian prior (b) maximum
Projection likelihood solution
on the parameters, i.e., exactly the setting we have been discussing in this −2 −2
Observations interpreted as a
chapter. Maximum likelihood estimate
projection.
−4 −4
Just to recap, we consider the following generative process: −4 −2 0
x
2 4 −4 −2 0
x
2 4
θ ∼ N m0 , S 0 (9.60a) (a) Regression dataset consisting of noisy ob- (b) The orange dots are the projections of
servations yn (blue) of function values f (xn ) the noisy observations (blue dots) onto the
yn | xn , θ ∼ N x⊤ 2
n θ, σ , (9.60b) at input locations xn . line θML x. The maximum likelihood solution to
a linear regression problem finds a subspace
The marginal n = 1, . . . , N . The marginal likelihood is given by (line) onto which the overall projection er-
likelihood can be
ror (orange lines) of the observations is mini-
Z
interpreted as the p(Y | X ) = p(Y | X , θ)p(θ)dθ (9.61a) mized.
expected likelihood
under the prior, i.e.,
Z
= N y | Xθ, σ 2 I N θ | m0 , S 0 dθ ,
Eθ [p(Y | X , θ)]. (9.61b)
= N y | Xm0 , XS 0 X ⊤ + σ 2 I .
(9.64b)
where we integrate out the model parameters θ . We compute the marginal
likelihood in two steps: First, we show that the marginal likelihood is Given the close connection with the posterior predictive distribution (see
Gaussian (as a distribution in y ); second, we compute the mean and co- Remark on Marginal Likelihood and Posterior Predictive Distribution ear-
variance of this Gaussian. lier in this section), the functional form of the marginal likelihood should
not be too surprising.
1. The marginal likelihood is Gaussian: From Section 6.5.2, we know that
(i) the product of two Gaussian random variables is an (unnormalized)
Gaussian distribution, and (ii) a linear transformation of a Gaussian 9.4 Maximum Likelihood as Orthogonal Projection
random variable is Gaussian distributed. In (9.61b), we require a linear
Having crunched through much algebra to derive maximum likelihood
transformation to bring N y | Xθ, σ 2 I into the form N θ | µ, Σ for
and MAP estimates, we will now provide a geometric interpretation of
some µ, Σ. Once this is done, the integral can be solved in closed form.
maximum likelihood estimation. Let us consider a simple linear regression
The result is the normalizing constant of the product of the two Gaus-
setting
sians. The normalizing constant itself has Gaussian shape; see (6.76).
y = xθ + ϵ, ϵ ∼ N 0, σ 2 ,
2. Mean and covariance. We compute the mean and covariance matrix (9.65)
of the marginal likelihood by exploiting the standard results for means in which we consider linear functions f : R → R that go through the
and covariances of affine transformations of random variables; see Sec- origin (we omit features here for clarity). The parameter θ determines the
tion 6.4.4. The mean of the marginal likelihood is computed as slope of the line. Figure 9.12(a) shows a one-dimensional dataset.
E[Y | X ] = Eθ,ϵ [Xθ + ϵ] = X Eθ [θ] = Xm0 . (9.62) With a training data set {(x1 , y1 ), . . . , (xN , yN )} we recall the results
from Section 9.2.1 and obtain the maximum likelihood estimator for the
2
Note that ϵ ∼ N 0, σ I is a vector of i.i.d. random variables. The slope parameter as
covariance matrix is given as
X ⊤y
Cov[Y|X ] = Covθ,ϵ [Xθ + ϵ] = Covθ [Xθ] + σ 2 I (9.63a) θML = (X ⊤ X)−1 X ⊤ y = ∈ R, (9.66)
X ⊤X
= X Covθ [θ]X ⊤ + σ 2 I = XS 0 X ⊤ + σ 2 I . (9.63b) ⊤ N ⊤
where X = [x1 , . . . , xN ] ∈ R , y = [y1 , . . . , yN ] ∈ R .N
Hence, the marginal likelihood is This means for the training inputs X we obtain the optimal (maximum
N 1
likelihood) reconstruction of the training targets as
p(Y | X ) = (2π)− 2 det(XS 0 X ⊤ + σ 2 I)− 2 (9.64a)
X ⊤y XX ⊤
· exp − 12 (y − Xm0 )⊤ (XS 0 X ⊤ + σ 2 I)−1 (y − Xm0 )
XθML = X = ⊤ y, (9.67)
X ⊤X X X
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
314 Linear Regression 9.5 Further Reading 315
i.e., we obtain the approximation with the minimum least-squares error When the basis is not orthogonal, one can convert a set of linearly inde-
between y and Xθ. pendent basis functions to an orthogonal basis by using the Gram-Schmidt
As we are looking for a solution of y = Xθ, we can think of linear process; see Section 3.8.3 and (Strang, 2003).
Linear regression regression as a problem for solving systems of linear equations. There-
can be thought of as fore, we can relate to concepts from linear algebra and analytic geometry
a method for solving
that we discussed in Chapters 2 and 3. In particular, looking carefully 9.5 Further Reading
systems of linear
equations. at (9.67) we see that the maximum likelihood estimator θML in our ex- In this chapter, we discussed linear regression for Gaussian likelihoods
ample from (9.65) effectively does an orthogonal projection of y onto and conjugate Gaussian priors on the parameters of the model. This al-
Maximum the one-dimensional subspace spanned by X . Recalling the results on or- lowed for closed-form Bayesian inference. However, in some applications
⊤
likelihood linear thogonal projections from Section 3.8, we identify XX as the projection we may want to choose a different likelihood function. For example, in
regression performs X⊤X
matrix, θML as the coordinates of the projection onto the one-dimensional a binary classification setting, we observe only two possible (categorical) classification
an orthogonal
projection. subspace of RN spanned by X and XθML as the orthogonal projection of outcomes, and a Gaussian likelihood is inappropriate in this setting. In-
y onto this subspace. stead, we can choose a Bernoulli likelihood that will return a probability of
Therefore, the maximum likelihood solution provides also a geometri- the predicted label to be 1 (or 0). We refer to the books by Barber (2012),
cally optimal solution by finding the vectors in the subspace spanned by Bishop (2006), and Murphy (2012) for an in-depth introduction to classifi-
X that are “closest” to the corresponding observations y , where “clos- cation problems. A different example where non-Gaussian likelihoods are
est” means the smallest (squared) distance of the function values yn to important is count data. Counts are non-negative integers, and in this case
xn θ. This is achieved by orthogonal projections. Figure 9.12(b) shows the a Binomial or Poisson likelihood would be a better choice than a Gaussian.
projection of the noisy observations onto the subspace that minimizes the All these examples fall into the category of generalized linear models, a flex- generalized linear
squared distance between the original dataset and its projection (note that ible generalization of linear regression that allows for response variables model
the x-coordinate is fixed), which corresponds to the maximum likelihood that have error distributions other than a Gaussian distribution. The GLM Generalized linear
solution. generalizes linear regression by allowing the linear model to be related models are the
building blocks of
In the general linear regression case where to the observed values via a smooth and invertible function σ(·) that may
deep neural
be nonlinear so that y = σ(f (x)), where f (x) = θ ⊤ ϕ(x) is the linear networks.
y = ϕ⊤ (x)θ + ϵ, ϵ ∼ N 0, σ 2
(9.68) regression model from (9.13). We can therefore think of a generalized
with vector-valued features ϕ(x) ∈ RK , we again can interpret the maxi- linear model in terms of function composition y = σ ◦ f , where f is a
mum likelihood result linear regression model and σ the activation function. Note that although
we are talking about “generalized linear models”, the outputs y are no
y ≈ Φθ ML , (9.69) longer linear in the parameters θ . In logistic regression, we choose the logistic regression
θ ML = (Φ⊤ Φ)−1 Φ⊤ y (9.70) 1
logistic sigmoid σ(f ) = 1+exp(−f )
∈ [0, 1], which can be interpreted as the logistic sigmoid
probability of observing y = 1 of a Bernoulli random variable y ∈ {0, 1}.
as a projection onto a K -dimensional subspace of RN , which is spanned
The function σ(·) is called transfer function or activation function, and its transfer function
by the columns of the feature matrix Φ; see Section 3.8.2.
inverse is called the canonical link function. From this perspective, it is activation function
If the feature functions ϕk that we use to construct the feature ma- canonical link
also clear that generalized linear models are the building blocks of (deep)
trix Φ are orthonormal (see Section 3.7), we obtain a special case where function
feedforward neural networks: If we consider a generalized linear model
the columns of Φ form an orthonormal basis (see Section 3.5), such that For ordinary linear
Φ⊤ Φ = I . This will then lead to the projection y = σ(Ax + b), where A is a weight matrix and b a bias vector, we iden- regression the
! tify this generalized linear model as a single-layer neural network with activation function
K would simply be the
X activation function σ(·). We can now recursively compose these functions
Φ(Φ⊤ Φ)−1 Φ⊤ y = ΦΦ⊤ y = ϕk ϕ⊤
k y (9.71) via
identity.
k=1 A great post on the
xk+1 = f k (xk ) relation between
so that the maximum likelihood projection is simply the sum of projections (9.72) GLMs and deep
of y onto the individual basis vectors ϕk , i.e., the columns of Φ. Further-
f k (xk ) = σk (Ak xk + bk )
networks is
more, the coupling between different features has disappeared due to the for k = 0, . . . , K − 1, where x0 are the input features and xK = y are available at
orthogonality of the basis. Many popular basis functions in signal process- the observed outputs, such that f K−1 ◦ · · · ◦ f 0 is a K -layer deep neural https://tinyurl.
com/glm-dnn.
ing, such as wavelets and Fourier bases, are orthogonal basis functions. network. Therefore, the building blocks of this deep neural network are
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
316 Linear Regression
x2
x2
0 0
direction. (b) The
the nonzero parameters are relevant for the regression problem, which is
data from (a) can be
the reason why we also speak of “variable selection”. −2 −2
represented using
the x1 -coordinate
−4 −4 alone with nearly no
loss.
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x1 x1
(a) Dataset with x1 and x2 coordinates. (b) Compressed dataset where only the x1 coor-
dinate is relevant.
317
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
318 Dimensionality Reduction with Principal Component Analysis 10.1 Problem Setting 319
Karhunen-Loève signal processing community, PCA is also known as the Karhunen-Loève Original Reconstructed Figure 10.2
transform D D Graphical
transform. In this chapter, we derive PCA from first principles, drawing on R R
Compressed illustration of PCA.
our understanding of basis and basis change (Sections 2.6.1 and 2.7.2), In PCA, we find a
M
projections (Section 3.8), eigenvalues (Section 4.2), Gaussian distribu- R compressed version
tions (Section 6.5), and constrained optimization (Section 7.2). x z x̃ z of original data x.
Dimensionality reduction generally exploits a property of high-dimen- The compressed
data can be
sional data (e.g., images) that it often lies on a low-dimensional subspace. reconstructed into
Figure 10.1 gives an illustrative example in two dimensions. Although x̃, which lives in the
the data in Figure 10.1(a) does not quite lie on a line, the data does not original data space,
vary much in the x2 -direction, so that we can express it as if it were on but has an intrinsic
lower-dimensional
a line – with nearly no loss; see Figure 10.1(b). To describe the data in
Chapter 2, we know that x ∈ R2 can be represented as a linear combina- representation than
Figure 10.1(b), only the x1 -coordinate is required, and the data lies in a x.
tion of these basis vectors, e.g.,
one-dimensional subspace of R2 .
5
= 5e1 + 3e2 . (10.4)
3
10.1 Problem Setting
However, when we consider vectors of the form
In PCA, we are interested in finding projections x̃n of data points xn that
are as similar to the original data points as possible, but which have a sig- 0
x̃ = ∈ R2 , z ∈ R , (10.5)
nificantly lower intrinsic dimensionality. Figure 10.1 gives an illustration z
of what this could look like. they can always be written as 0e1 + ze2 . To represent these vectors it is
More concretely, we consider an i.i.d. dataset X = {x1 , . . . , xN }, xn ∈ sufficient to remember/store the coordinate/code z of x̃ with respect to
data covariance RD , with mean 0 that possesses the data covariance matrix (6.42) the e2 vector. The dimension of a
matrix vector space
N More precisely, the set of x̃ vectors (with the standard vector addition
1 X corresponds to the
S= xn x⊤
n . (10.1) and scalar multiplication) forms a vector subspace U (see Section 2.4) number of its basis
N n=1 with dim(U ) = 1 because U = span[e2 ]. vectors (see
Section 2.6.1).
Furthermore, we assume there exists a low-dimensional compressed rep-
resentation (code) In Section 10.2, we will find low-dimensional representations that re-
z n = B ⊤ xn ∈ RM (10.2) tain as much information as possible and minimize the compression loss.
An alternative derivation of PCA is given in Section 10.3, where we will
of xn , where we define the projection matrix 2
be looking at minimizing the squared reconstruction error ∥xn − x̃n ∥ be-
B := [b1 , . . . , bM ] ∈ RD×M . (10.3) tween the original data xn and its projection x̃n .
Figure 10.2 illustrates the setting we consider in PCA, where z repre-
We assume that the columns of B are orthonormal (Definition 3.7) so that sents the lower-dimensional representation of the compressed data x̃ and
The columns b⊤ ⊤
i bj = 0 if and only if i ̸= j and bi bi = 1. We seek an M -dimensional plays the role of a bottleneck, which controls how much information can
b1 , . . . , bM of B subspace U ⊆ RD , dim(U ) = M < D onto which we project the data. We flow between x and x̃. In PCA, we consider a linear relationship between
form a basis of the
denote the projected data by x̃n ∈ U , and their coordinates (with respect the original data x and its low-dimensional code z so that z = B ⊤ x and
M -dimensional
subspace in which to the basis vectors b1 , . . . , bM of U ) by z n . Our aim is to find projections x̃ = Bz for a suitable matrix B . Based on the motivation of thinking
the projected data x̃n ∈ RD (or equivalently the codes z n and the basis vectors b1 , . . . , bM ) of PCA as a data compression technique, we can interpret the arrows in
x̃ = BB ⊤ x ∈ RD so that they are as similar to the original data xn and minimize the loss Figure 10.2 as a pair of operations representing encoders and decoders.
live.
due to compression. The linear mapping represented by B can be thought of as a decoder,
which maps the low-dimensional code z ∈ RM back into the original data
Example 10.1 (Coordinate Representation/Code) space RD . Similarly, B ⊤ can be thought of an encoder, which encodes the
Consider R2 with the canonical basis e1 = [1, 0]⊤ , e2 = [0, 1]⊤ . From original data x as a low-dimensional (compressed) code z .
Throughout this chapter, we will use the MNIST digits dataset as a re-
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
320 Dimensionality Reduction with Principal Component Analysis 10.2 Maximum Variance Perspective 321
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
322 Dimensionality Reduction with Principal Component Analysis 10.2 Maximum Variance Perspective 323
D
constrained optimization problem (m − 1)-dimensional subspace of R . Generally, the mth principal com-
ponent can be found by subtracting the effect of the first m − 1 principal
max b⊤
1 Sb1
b1 components b1 , . . . , bm−1 from the data, thereby trying to find principal
(10.10)
2
subject to ∥b1 ∥ = 1 . components that compress the remaining information. We then arrive at
the new data matrix
Following Section 7.2, we obtain the Lagrangian m−1
X
L(b1 , λ) = b⊤ ⊤ X̂ := X − bi b⊤
i X = X − B m−1 X , (10.17)
1 Sb1 + λ1 (1 − b1 b1 ) (10.11)
i=1
to solve this constrained optimization problem. The partial derivatives of
xN ] ∈ RD×N contains the data points as column
where X = [x1 , . . . , P The matrix X̂ :=
L with respect to b1 and λ1 are
vectors and B m−1 := i=1 bi b⊤
m−1 [x̂1 , . . . , x̂N ] ∈
i is a projection matrix that projects onto
∂L ∂L RD×N in (10.17)
the subspace spanned by b1 , . . . , bm−1 .
= 2b⊤ ⊤
1 S − 2λ1 b1 , = 1 − b⊤
1 b1 , (10.12) contains the
∂b1 ∂λ1 Remark (Notation). Throughout this chapter, we do not follow the con- information in the
vention of collecting data x1 , . . . , xN as the rows of the data matrix, but data that has not yet
respectively. Setting these partial derivatives to 0 gives us the relations
been compressed.
we define them to be the columns of X . This means that our data ma-
Sb1 = λ1 b1 , (10.13) trix X is a D × N matrix instead of the conventional N × D matrix. The
b⊤
1 b1 = 1 . (10.14) reason for our choice is that the algebra operations work out smoothly
without the need to either transpose the matrix or to redefine vectors as
By comparing this with the definition of an eigenvalue decomposition
row vectors that are left-multiplied onto matrices. ♢
(Section 4.4), we see that b1 is an eigenvector of the data covariance
matrix S , and the Lagrange multiplier λ1 plays the role of the correspond- To find the mth principal component, we maximize the variance
√
The quantity λ1 is ing eigenvalue. This eigenvector property (10.13) allows us to rewrite our N N
1 X 2 1 X ⊤
also called the variance objective (10.10) as Vm = V[zm ] = zmn = (b xˆn )2 = b⊤
m Ŝbm , (10.18)
loading of the unit N n=1 N n=1 m
vector b1 and V1 = b⊤ ⊤
1 Sb1 = λ1 b1 b1 = λ1 , (10.15)
represents the 2
subject to ∥bm ∥ = 1, where we followed the same steps as in (10.9b)
standard deviation i.e., the variance of the data projected onto a one-dimensional subspace
of the data and defined Ŝ as the data covariance matrix of the transformed dataset
equals the eigenvalue that is associated with the basis vector b1 that spans
accounted for by the X̂ := {x̂1 , . . . , x̂N }. As previously, when we looked at the first principal
this subspace. Therefore, to maximize the variance of the low-dimensional
principal subspace component alone, we solve a constrained optimization problem and dis-
span[b1 ]. code, we choose the basis vector associated with the largest eigenvalue
cover that the optimal solution bm is the eigenvector of Ŝ that is associated
principal component of the data covariance matrix. This eigenvector is called the first principal
with the largest eigenvalue of Ŝ .
component. We can determine the effect/contribution of the principal com-
It turns out that bm is also an eigenvector of S . More generally, the sets
ponent b1 in the original data space by mapping the coordinate z1n back
of eigenvectors of S and Ŝ are identical. Since both S and Ŝ are sym-
into data space, which gives us the projected data point
metric, we can find an ONB of eigenvectors (spectral theorem 4.15), i.e.,
x̃n = b1 z1n = b1 b⊤
1 xn ∈ R
D
(10.16) there exist D distinct eigenvectors for both S and Ŝ . Next, we show that
every eigenvector of S is an eigenvector of Ŝ . Assume we have already
in the original data space.
found eigenvectors b1 , . . . , bm−1 of Ŝ . Consider an eigenvector bi of S ,
Remark. Although x̃n is a D-dimensional vector, it only requires a single i.e., Sbi = λi bi . In general,
coordinate z1n to represent it with respect to the basis vector b1 ∈ RD . ♢
1 ⊤ 1
Ŝbi = X̂ X̂ bi = (X − B m−1 X)(X − B m−1 X)⊤ bi (10.19a)
N N
10.2.2 M -dimensional Subspace with Maximal Variance = (S − SB m−1 − B m−1 S + B m−1 SB m−1 )bi . (10.19b)
Assume we have found the first m − 1 principal components as the m − 1 We distinguish between two cases. If i ⩾ m, i.e., bi is an eigenvector
eigenvectors of S that are associated with the largest m − 1 eigenvalues. that is not among the first m − 1 principal components, then bi is orthogo-
Since S is symmetric, the spectral theorem (Theorem 4.15) states that we nal to the first m−1 principal components and B m−1 bi = 0. If i < m, i.e.,
can use these eigenvectors to construct an orthonormal eigenbasis of an bi is among the first m − 1 principal components, then bi is a basis vector
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
324 Dimensionality Reduction with Principal Component Analysis 10.3 Projection Perspective 325
of the principal subspace onto which B m−1 projects. Since b1 , . . . , bm−1 Figure 10.6
Illustration of the
are an ONB of this principal subspace, we obtain B m−1 bi = bi . The two
projection
cases can be summarized as follows: approach: Find a
subspace (line) that
B m−1 bi = bi if i < m , B m−1 bi = 0 if i ⩾ m . (10.20)
minimizes the
In the case i ⩾ m, by using (10.20) in (10.19b), we obtain Ŝbi = (S − length of the
difference vector
B m−1 S)bi = Sbi = λi bi , i.e., bi is also an eigenvector of Ŝ with eigen- between projected
value λi . Specifically, (orange) and
original (blue) data.
Ŝbm = Sbm = λm bm . (10.21)
Equation (10.21) reveals that bm is not only an eigenvector of S but also
of Ŝ . Specifically, λm is the largest eigenvalue of Ŝ and λm is the mth Taking all digits “8” in the MNIST training data, we compute the eigen-
largest eigenvalue of S , and both have the associated eigenvector bm . values of the data covariance matrix. Figure 10.5(a) shows the 200 largest
In the case i < m, by using (10.20) in (10.19b), we obtain eigenvalues of the data covariance matrix. We see that only a few of
Ŝbi = (S − SB m−1 − B m−1 S + B m−1 SB m−1 )bi = 0 = 0bi (10.22) them have a value that differs significantly from 0. Therefore, most of
the variance, when projecting data onto the subspace spanned by the cor-
This means that b1 , . . . , bm−1 are also eigenvectors of Ŝ , but they are as- responding eigenvectors, is captured by only a few principal components,
sociated with eigenvalue 0 so that b1 , . . . , bm−1 span the null space of Ŝ . as shown in Figure 10.5(b).
Overall, every eigenvector of S is also an eigenvector of Ŝ . However,
if the eigenvectors of S are part of the (m − 1) dimensional principal Overall, to find an M -dimensional subspace of RD that retains as much
This derivation subspace, then the associated eigenvalue of Ŝ is 0. information as possible, PCA tells us to choose the columns of the matrix
shows that there is With the relation (10.21) and b⊤ m bm = 1, the variance of the data pro- B in (10.3) as the M eigenvectors of the data covariance matrix S that
an intimate
jected onto the mth principal component is are associated with the M largest eigenvalues. The maximum amount of
connection between
the M -dimensional
V m = b⊤
(10.21)
= λ m b⊤ variance PCA can capture with the first M principal components is
subspace with m Sbm m bm = λm . (10.23)
M
maximal variance
This means that the variance of the data, when projected onto an M - VM =
X
λm , (10.24)
and the eigenvalue
decomposition. We dimensional subspace, equals the sum of the eigenvalues that are associ- m=1
will revisit this ated with the corresponding eigenvectors of the data covariance matrix.
connection in where the λm are the M largest eigenvalues of the data covariance matrix
Section 10.4. S . Consequently, the variance lost by data compression via PCA is
D
Example 10.2 (Eigenvalues of MNIST “8”) X
JM := λj = VD − VM . (10.25)
Figure 10.5 j=M +1
50
Properties of the
500
training data of 40 Instead of these absolute quantities, we can define the relative variance
MNIST “8”. (a) captured as VVM , and the relative variance lost by compression as 1 − VVM
Captured variance
400 .
Eigenvalue
D D
Eigenvalues sorted 30
in descending order; 300
(b) Variance 20
200
captured by the 10.3 Projection Perspective
10
principal 100
components
In the following, we will derive PCA as an algorithm that directly mini-
0
associated with the
0 50 100
Index
150 200 0 50 100 150
Number of principal components
200 mizes the average reconstruction error. This perspective allows us to in-
largest eigenvalues. terpret PCA as implementing an optimal linear auto-encoder. We will draw
(a) Eigenvalues (sorted in descending order) of (b) Variance captured by the principal compo-
the data covariance matrix of all digits “8” in nents. heavily from Chapters 2 and 3.
the MNIST training set. In the previous section, we derived PCA by maximizing the variance
in the projected space to retain as much information as possible. In the
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
326 Dimensionality Reduction with Principal Component Analysis 10.3 Projection Perspective 327
Figure 10.7 2.5 2.5
Simplified tion, we would arrive at exactly the same solution, but the notation would
projection setting. 2.0 2.0 be substantially more cluttered.
(a) A vector x ∈ R2
(red cross) shall be 1.5 1.5
We are interested in finding the best linear projection of X onto a lower-
projected onto a dimensional subspace U of RD with dim(U ) = M and orthonormal basis
one-dimensional vectors b1 , . . . , bM . We will call this subspace U the principal subspace. principal subspace
x2
x2
1.0 1.0
subspace U ⊆ R2 U U The projections of the data points are denoted by
spanned by b. (b) 0.5 0.5
shows the difference b b M
X
vectors between x 0.0 0.0 x̃n := zmn bm = Bz n ∈ RD , (10.28)
and some m=1
candidates x̃. −0.5 −0.5
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x1 x1 where z n := [z1n , . . . , zM n ]⊤ ∈ RM is the coordinate vector of x̃n with
(a) Setting. (b) Differences x − x̃i for 50 different x̃i are respect to the basis (b1 , . . . , bM ). More specifically, we are interested in
shown by the red lines. having the x̃n as similar to xn as possible.
The similarity measure we use in the following is the squared distance
2
(Euclidean norm) ∥x − x̃∥ between x and x̃. We therefore define our ob-
jective as minimizing the average squared Euclidean distance (reconstruction reconstruction error
following, we will look at the difference vectors between the original data
error) (Pearson, 1901)
xn and their reconstruction x̃n and minimize this distance so that xn and
x̃n are as close as possible. Figure 10.6 illustrates this setting. 1 X
N
JM := ∥xn − x̃n ∥2 , (10.29)
N n=1
10.3.1 Setting and Objective where we make it explicit that the dimension of the subspace onto which
Assume an (ordered) orthonormal basis (ONB) B = (b1 , . . . , bD ) of RD , we project the data is M . In order to find this optimal linear projection,
i.e., b⊤ we need to find the orthonormal basis of the principal subspace and the
i bj = 1 if and only if i = j and 0 otherwise.
From Section 2.5 we know that for a basis (b1 , . . . , bD ) of RD any x ∈ coordinates z n ∈ RM of the projections with respect to this basis.
RD can be written as a linear combination of the basis vectors of RD , i.e., To find the coordinates z n and the ONB of the principal subspace, we
Vectors x̃ ∈ U could
follow a two-step approach. First, we optimize the coordinates z n for a
be vectors on a given ONB (b1 , . . . , bM ); second, we find the optimal ONB.
D M D
plane in R3 . The X X X
dimensionality of x= ζd b d = ζm b m + ζj b j (10.26)
the plane is 2, but d=1 m=1 j=M +1
10.3.2 Finding Optimal Coordinates
the vectors still have
three coordinates for suitable coordinates ζd ∈ R. Let us start by finding the optimal coordinates z1n , . . . , zM n of the projec-
with respect to the We are interested in finding vectors x̃ ∈ RD , which live in lower- tions x̃n for n = 1, . . . , N . Consider Figure 10.7(b), where the principal
standard basis of dimensional subspace U ⊆ RD , dim(U ) = M , so that subspace is spanned by a single vector b. Geometrically speaking, finding
R3 .
M
X the optimal coordinates z corresponds to finding the representation of the
x̃ = zm bm ∈ U ⊆ RD (10.27) linear projection x̃ with respect to b that minimizes the distance between
m=1 x̃ − x. From Figure 10.7(b), it is clear that this will be the orthogonal
is as similar to x as possible. Note that at this point we need to assume projection, and in the following we will show exactly this.
that the coordinates zm of x̃ and ζm of x are not identical. We assume an ONB (b1 , . . . , bM ) of U ⊆ RD . To find the optimal co-
In the following, we use exactly this kind of representation of x̃ to find ordinates z m with respect to this basis, we require the partial derivatives
optimal coordinates z and basis vectors b1 , . . . , bM such that x̃ is as sim-
ilar to the original data point x as possible, i.e., we aim to minimize the ∂JM ∂JM ∂ x̃n
= , (10.30a)
(Euclidean) distance ∥x − x̃∥. Figure 10.7 illustrates this setting. ∂zin ∂ x̃n ∂zin
Without loss of generality, we assume that the dataset X = {x1 , . . . , xN }, ∂JM 2
xn ∈ RD , is centered at 0, i.e., E[X ] = 0. Without the zero-mean assump- = − (xn − x̃n )⊤ ∈ R1×D , (10.30b)
∂ x̃n N
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
328 Dimensionality Reduction with Principal Component Analysis 10.3 Projection Perspective 329
Figure 10.8 3.25 2.5
⊥
Optimal projection
3.00
must be identical for m = 1, . . . , M since U = span[bM +1 , . . . , bD ] is
of a vector x ∈ R2 2.0
the orthogonal complement (see Section 3.6) of U = span[b1 , . . . , bM ].
onto a 2.75
one-dimensional 1.5
2.50 Remark (Orthogonal Projections with Orthonormal Basis Vectors). Let us
subspace
kx − x̃k
x2
(continuation from 1.0
Figure 10.7). 2.00
U orthonormal basis of RD then b⊤
j x is the
0.5 x̃
(a) Distances coordinate of the
bj (b⊤ −1 ⊤
bj b⊤
1.75 b D
∥x − x̃∥ for some x̃ = j bj ) bj x = j x ∈R (10.33) orthogonal
0.0
x̃ ∈ U . 1.50 projection of x onto
(b) Orthogonal
1.25 −0.5
is the orthogonal projection of x onto the subspace spanned by the j th ba- the subspace
projection and −1.0 −0.5 0.0 0.5
x1
1.0 1.5 2.0 −1.0 −0.5 0.0 0.5
x1
1.0 1.5 2.0
sis vector, and zj = b⊤ j x is the coordinate of this projection with respect to
spanned by bj .
optimal coordinates.
(a) Distances ∥x − x̃∥ for some x̃ = z1 b ∈ (b) The vector x̃ that minimizes the distance
the basis vector bj that spans that subspace since zj bj = x̃. Figure 10.8(b)
U = span[b]; see panel (b) for the setting. in panel (a) is its orthogonal projection onto illustrates this setting.
U . The coordinate of the projection x̃ with More generally, if we aim to project onto an M -dimensional subspace
respect to the basis vector b that spans U of RD , we obtain the orthogonal projection of x onto the M -dimensional
is the factor we need to scale b in order to
“reach” x̃.
subspace with orthonormal basis vectors b1 , . . . , bM as
⊤ −1 ⊤ ⊤
| {zB}) B x = BB x ,
x̃ = B(B (10.34)
=I
M
!
∂ x̃n (10.28) ∂ X
where we defined B := [b1 , . . . , bM ] ∈ RD×M . The coordinates of this
= zmn bm = bi (10.30c)
∂zin ∂zin m=1 projection with respect to the ordered basis (b1 , . . . , bM ) are z := B ⊤ x
as discussed in Section 3.8.
for i = 1, . . . , M , such that we obtain
We can think of the coordinates as a representation of the projected
(10.30b) M
!⊤ vector in a new coordinate system defined by (b1 , . . . , bM ). Note that al-
∂JM (10.30c) 2 (10.28) 2 X
= − (xn − x̃n )⊤ bi = − xn − zmn bm bi though x̃ ∈ RD , we only need M coordinates z1 , . . . , zM to represent
∂zin N N m=1 this vector; the other D − M coordinates with respect to the basis vectors
(10.31a) (bM +1 , . . . , bD ) are always 0. ♢
ONB 2 ⊤ 2 ⊤ So far we have shown that for a given ONB we can find the optimal
= − (x bi − zin b⊤
i bi ) = − (x bi − zin ) . (10.31b)
N n N n coordinates of x̃ by an orthogonal projection onto the principal subspace.
since b⊤ In the following, we will determine what the best basis is.
The coordinates of i bi = 1. Setting this partial derivative to 0 yields immediately the
the optimal optimal coordinates
projection of xn
⊤
with respect to the zin = x⊤
n bi = bi xn (10.32) 10.3.3 Finding the Basis of the Principal Subspace
basis vectors
b1 , . . . , bM are the for i = 1, . . . , M and n = 1, . . . , N . This means that the optimal co- To determine the basis vectors b1 , . . . , bM of the principal subspace, we
coordinates of the rephrase the loss function (10.29) using the results we have so far. This
ordinates zin of the projection x̃n are the coordinates of the orthogonal
orthogonal
projection of xn projection (see Section 3.8) of the original data point xn onto the one- will make it easier to find the basis vectors. To reformulate the loss func-
onto the principal dimensional subspace that is spanned by bi . Consequently: tion, we exploit our results from before and obtain
subspace.
M M
The optimal linear projection x̃n of xn is an orthogonal projection. X (10.32)
X
x̃n = zmn bm = (x⊤
n bm )bm . (10.35)
The coordinates of x̃n with respect to the basis (b1 , . . . , bM ) are the m=1 m=1
coordinates of the orthogonal projection of xn onto the principal sub-
space. We now exploit the symmetry of the dot product, which yields
An orthogonal projection is the best linear mapping given the objec- XM
!
tive (10.29). x̃n = bm b⊤
m xn . (10.36)
The coordinates ζm of x in (10.26) and the coordinates zm of x̃ in (10.27) m=1
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
330 Dimensionality Reduction with Principal Component Analysis 10.3 Projection Perspective 331
Figure 10.9 symmetric and has rank M . Therefore, the average squared reconstruction
Orthogonal 6 U⊥
error can also be written as
projection and
displacement 4 N N
1 X 2 1 X 2
vectors. When ∥xn − x̃n ∥ = xn − BB ⊤ xn (10.40a)
projecting data 2 N n=1 N n=1
points xn (blue) x2 N
0 1 X 2
onto subspace U1 ,
we obtain x̃n
U = (I − BB ⊤ )xn . (10.40b)
−2 N n=1
(orange). The
displacement vector −4 Finding orthonormal basis vectors b1 , . . . , bM , which minimize the differ- PCA finds the best
x̃n − xn lies rank-M
completely in the
ence between the original data xn and their projections x̃n , is equivalent
−6 approximation of
orthogonal to finding the best rank-M approximation BB ⊤ of the identity matrix I
−5 0 5 the identity matrix.
complement U2 of x1 (see Section 4.6). ♢
U1 .
Now we have all the tools to reformulate the loss function (10.29).
N N D 2
Since we can generally write the original data point xn as a linear combi- 1 X (10.38b) 1 X X
nation of all basis vectors, it holds that JM = ∥xn − x̃n ∥2 = (b⊤
j xn )bj . (10.41)
N n=1 N n=1 j=M +1
D D D
!
(10.32)
X X X ⊤
xn = zdn bd = (x⊤
n b d )b d = b d bd xn (10.37a) We now explicitly compute the squared norm and exploit the fact that the
d=1 d=1 d=1 bj form an ONB, which yields
M
! D
!
N D N D
1 X X 1 X X ⊤
X X
= bm b⊤ xn + bj b⊤ xn , (10.37b) JM = (b⊤ xn )2 = b xn b⊤
j xn (10.42a)
m j
m=1 j=M +1 N n=1 j=M +1 j N n=1 j=M +1 j
where we split the sum with D terms into a sum over M and a sum N
1 X X ⊤
D
over D − M terms. With this result, we find that the displacement vector = b xn x⊤
n bj , (10.42b)
N n=1 j=M +1 j
xn − x̃n , i.e., the difference vector between the original data point and its
projection, is where we exploited the symmetry of the dot product in the last step to
write b⊤ ⊤
j xn = xn bj . We now swap the sums and obtain
D
!
X
xn − x̃n = bj b⊤
j xn (10.38a) !
D N D
j=M +1 X 1 X X
JM = b⊤
j xn x⊤
n bj = b⊤
j Sbj (10.43a)
D
X j=M +1
N n=1 j=M +1
= (x⊤
n bj )bj . (10.38b) | {z }
=:S
j=M +1
D
X D
X D
X
This means the difference is exactly the projection of the data point onto = tr(b⊤
j Sbj ) = tr(Sbj b⊤
j ) = tr bj b⊤
j S ,
the orthogonal complement of the principal subspace: We identify the ma- j=M +1 j=M +1 j=M +1
trix j=M +1 bj b⊤
PD
j in (10.38a) as the projection matrix that performs this
| {z }
projection matrix
projection. Hence the displacement vector xn − x̃n lies in the subspace (10.43b)
that is orthogonal to the principal subspace as illustrated in Figure 10.9.
Remark (Low-Rank Approximation). In (10.38a), we saw that the projec- where we exploited the property that the trace operator tr(·) (see (4.18))
tion matrix, which projects x onto x̃, is given by is linear and invariant to cyclic permutations of its arguments. Since we
assumed that our dataset is centered, i.e., E[X ] = 0, we identify S as the
M
X data covariance matrix. Since the projection matrix in (10.43b) is con-
bm b⊤ ⊤
m = BB . (10.39) structed as a sum of rank-one matrices bj b⊤j it itself is of rank D − M .
m=1
Equation (10.43a) implies that we can formulate the average squared
By construction as a sum of rank-one matrices bm b⊤ ⊤
m we see that BB is reconstruction error equivalently as the covariance matrix of the data,
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
332 Dimensionality Reduction with Principal Component Analysis 10.4 Eigenvector Computation and Low-Rank Approximations 333
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
334 Dimensionality Reduction with Principal Component Analysis 10.5 PCA in High Dimensions 335
10.4.1 PCA Using Low-Rank Matrix Approximations the null space of S and follows the iteration
To maximize the variance of the projected data (or minimize the average Sxk
xk+1 = , k = 0, 1, . . . . (10.52)
squared reconstruction error), PCA chooses the columns of U in (10.48) ∥Sxk ∥
to be the eigenvectors that are associated with the M largest eigenvalues This means the vector xk is multiplied by S in every iteration and then If S is invertible, it
of the data covariance matrix S so that we identify U as the projection ma- normalized, i.e., we always have ∥xk ∥ = 1. This sequence of vectors con- is sufficient to
trix B in (10.3), which projects the original data onto a lower-dimensional verges to the eigenvector associated with the largest eigenvalue of S . The ensure that x0 ̸= 0.
Eckart-Young subspace of dimension M . The Eckart-Young theorem (Theorem 4.25 in original Google PageRank algorithm (Page et al., 1999) uses such an al-
theorem Section 4.6) offers a direct way to estimate the low-dimensional represen- gorithm for ranking web pages based on their hyperlinks.
tation. Consider the best rank-M approximation
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
336 Dimensionality Reduction with Principal Component Analysis 10.6 Key Steps of PCA in Practice 337
Figure 10.11 Steps
and we get a new eigenvector/eigenvalue equation: λm remains eigen- 5.0 5.0 5.0 of PCA. (a) Original
dataset;
value, which confirms our results from Section 4.5.3 that the nonzero 2.5 2.5 2.5 (b) centering;
eigenvalues of XX ⊤ equal the nonzero eigenvalues of X ⊤ X . We obtain
x2
x2
x2
0.0 0.0 0.0
(c) divide by
the eigenvector of the matrix N1 X ⊤ X ∈ RN ×N associated with λm as standard deviation;
cm := X ⊤ bm . Assuming we have no duplicate data points, this matrix −2.5 −2.5 −2.5 (d) eigendecomposi-
tion; (e) projection;
has rank N and is invertible. This also implies that N1 X ⊤ X has the same 0
x1
5 0
x1
5 0
x1
5
(f) mapping back to
(nonzero) eigenvalues as the data covariance matrix S . But this is now an original data space.
(a) Original dataset. (b) Step 1: Centering by sub- (c) Step 2: Dividing by the
N × N matrix, so that we can compute the eigenvalues and eigenvectors tracting the mean from each standard deviation to make
much more efficiently than for the original D × D data covariance matrix. data point. the data unit free. Data has
Now that we have the eigenvectors of N1 X ⊤ X , we are going to re- variance 1 along each axis.
cover the original eigenvectors, which we still need for PCA. Currently,
we know the eigenvectors of N1 X ⊤ X . If we left-multiply our eigenvalue/ 5.0 5.0 5.0
eigenvector equation with X , we get
2.5 2.5 2.5
1
x2
x2
x2
XX ⊤ Xcm = λm Xcm (10.57) 0.0 0.0 0.0
|N {z } −2.5 −2.5 −2.5
S
0 5 0 5 0 5
and we recover the data covariance matrix again. This now also means x1 x1 x1
that we recover Xcm as an eigenvector of S . (d) Step 3: Compute eigenval- (e) Step 4: Project data onto (f) Undo the standardization
ues and eigenvectors (arrows) the principal subspace. and move projected data back
Remark. If we want to apply the PCA algorithm that we discussed in Sec- of the data covariance matrix into the original data space
tion 10.6, we need to normalize the eigenvectors Xcm of S so that they (ellipse). from (a).
have norm 1. ♢
10.6 Key Steps of PCA in Practice responding eigenvalue. The longer vector spans the principal subspace,
which we denote by U . The data covariance matrix is represented by
In the following, we will go through the individual steps of PCA using a the ellipse.
running example, which is summarized in Figure 10.11. We are given a
4. Projection We can project any data point x∗ ∈ RD onto the principal
two-dimensional dataset (Figure 10.11(a)), and we want to use PCA to
subspace: To get this right, we need to standardize x∗ using the mean
project it onto a one-dimensional subspace.
µd and standard deviation σd of the training data in the dth dimension,
1. Mean subtraction We start by centering the data by computing the respectively, so that
mean µ of the dataset and subtracting it from every single data point. (d)
This ensures that the dataset has mean 0 (Figure 10.11(b)). Mean sub- x∗ − µ d
x(d)
∗ ← , d = 1, . . . , D , (10.58)
traction is not strictly necessary but reduces the risk of numerical prob- σd
lems. (d)
where x∗ is the dth component of x∗ . We obtain the projection as
2. Standardization Divide the data points by the standard deviation σd
of the dataset for every dimension d = 1, . . . , D. Now the data is unit x̃∗ = BB ⊤ x∗ (10.59)
free, and it has variance 1 along each axis, which is indicated by the
standardization two arrows in Figure 10.11(c). This step completes the standardization with coordinates
of the data. z ∗ = B ⊤ x∗ (10.60)
3. Eigendecomposition of the covariance matrix Compute the data
covariance matrix and its eigenvalues and corresponding eigenvectors. with respect to the basis of the principal subspace. Here, B is the ma-
Since the covariance matrix is symmetric, the spectral theorem (The- trix that contains the eigenvectors that are associated with the largest
orem 4.15) states that we can find an ONB of eigenvectors. In Fig- eigenvalues of the data covariance matrix as columns. PCA returns the
ure 10.11(d), the eigenvectors are scaled by the magnitude of the cor- coordinates (10.60), not the projections x∗ .
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
338 Dimensionality Reduction with Principal Component Analysis 10.7 Latent Variable Perspective 339
For illustration purposes, we apply PCA to a subset of the MNIST digits, 10.7 Latent Variable Perspective
and we focus on the digit “8”. We used 5,389 training images of the digit In the previous sections, we derived PCA without any notion of a prob-
“8” and determined the principal subspace as detailed in this chapter. We abilistic model using the maximum-variance and the projection perspec-
then used the learned projection matrix to reconstruct a set of test im- tives. On the one hand, this approach may be appealing as it allows us to
ages, which is illustrated in Figure 10.12. The first row of Figure 10.12 sidestep all the mathematical difficulties that come with probability the-
shows a set of four original digits from the test set. The following rows ory, but on the other hand, a probabilistic model would offer us more flex-
show reconstructions of exactly these digits when using a principal sub- ibility and useful insights. More specifically, a probabilistic model would
space of dimensions 1, 10, 100, and 500, respectively. We see that even
with a single-dimensional principal subspace we get a halfway decent re- Come with a likelihood function, and we can explicitly deal with noisy
construction of the original digits, which, however, is blurry and generic. observations (which we did not even discuss earlier)
With an increasing number of principal components (PCs), the reconstruc- Allow us to do Bayesian model comparison via the marginal likelihood
tions become sharper and more details are accounted for. With 500 prin- as discussed in Section 8.6
View PCA as a generative model, which allows us to simulate new data
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
340 Dimensionality Reduction with Principal Component Analysis 10.7 Latent Variable Perspective 341
10.7.1 Generative Process and Probabilistic Model Example 10.5 (Generating New Data Using Latent Variables)
In PPCA, we explicitly write down the probabilistic model for linear di-
mensionality reduction. For this we assume a continuous
latent variable Figure 10.15
z ∈ RM with a standard-normal prior p(z) = N 0, I and a linear rela- Generating new
tionship between the latent variables and the observed x data where MNIST digits. The
latent variables z
x = Bz + µ + ϵ ∈ RD , (10.63) can be used to
generate new data
where ϵ ∼ N 0, σ I is Gaussian observation noise and B ∈ RD×M
2
x̃ = Bz. The closer
and µ ∈ RD describe the linear/affine mapping from latent to observed we stay to the
training data, the
variables. Therefore, PPCA links latent and observed variables via
more realistic the
p(x|z, B, µ, σ 2 ) = N x | Bz + µ, σ 2 I . generated data.
(10.64)
Overall, PPCA induces the following generative process:
z n ∼ N z | 0, I (10.65)
xn | z n ∼ N x | Bz n + µ, σ 2 I
(10.66)
To generate a data point that is typical given the model parameters, we Figure 10.15 shows the latent coordinates of the MNIST digits “8” found
ancestral sampling follow an ancestral sampling scheme: We first sample a latent variable z n by PCA when using a two-dimensional principal subspace (blue dots). We
from p(z). Then we use z n in (10.64) to sample a data point conditioned can query any vector z ∗ in this latent space and generate an image x̃∗ =
on the sampled z n , i.e., xn ∼ p(x | z n , B, µ, σ 2 ). Bz ∗ that resembles the digit “8”. We show eight of such generated images
This generative process allows us to write down the probabilistic model with their corresponding latent space representation. Depending on where
(i.e., the joint distribution of all random variables; see Section 8.4) as we query the latent space, the generated images look different (shape,
p(x, z|B, µ, σ 2 ) = p(x|z, B, µ, σ 2 )p(z) , (10.67) rotation, size, etc.). If we query away from the training data, we see more
and more artifacts, e.g., the top-left and top-right digits. Note that the
which immediately gives rise to the graphical model in Figure 10.14 using intrinsic dimensionality of these generated images is only two.
the results from Section 8.5.
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
342 Dimensionality Reduction with Principal Component Analysis 10.8 Further Reading 343
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
344 Dimensionality Reduction with Principal Component Analysis 10.8 Further Reading 345
Figure 10.16 PCA Original jecting D-dimensional data onto an M -dimensional subspace, are
can be viewed as a
D
linear auto-encoder. R RD 1 X
N
It encodes the
Code µML = xn , (10.77)
high-dimensional M N n=1
⊤ R
data x into a B B B ML = T (Λ − σ 2 I) 2 R ,
1
(10.78)
lower-dimensional x z x̃
representation D
2 1 X
(code) z ∈ RM and σML = λj , (10.79)
decodes z using a D−M j=M +1
decoder. The
decoded vector x̃ is where T ∈ RD×M contains M eigenvectors of the data covariance matrix, The matrix Λ − σ 2 I
the orthogonal Encoder Decoder Λ = diag(λ1 , . . . , λM ) ∈ RM ×M is a diagonal matrix with the eigenvalues in (10.78) is
projection of the guaranteed to be
original data x onto
associated with the principal axes on its diagonal, and R ∈ RM ×M is
positive semidefinite
the M -dimensional an arbitrary orthogonal matrix. The maximum likelihood solution B ML is as the smallest
principal subspace. the code is given by z n = B ⊤ xn ∈ RM and we are interested in minimiz- unique up to an arbitrary orthogonal transformation, e.g., we can right- eigenvalue of the
ing the average squared error between the data xn and its reconstruction multiply B ML with any rotation matrix R so that (10.78) essentially is a data covariance
matrix is bounded
x̃n = Bz n , n = 1, . . . , N , we obtain singular value decomposition (see Section 4.5). An outline of the proof is
from below by the
given by Tipping and Bishop (1999). noise variance σ 2 .
The maximum likelihood estimate for µ given in (10.77) is the sample
N N
1 X 1 X 2 mean of the data. The maximum likelihood estimator for the observation
∥xn − x̃n ∥2 = xn − BB ⊤ xn . (10.76)
N n=1 N n=1 noise variance σ 2 given in (10.79) is the average variance in the orthog-
onal complement of the principal subspace, i.e., the average leftover vari-
ance that we cannot capture with the first M principal components is
This means we end up with the same objective function as in (10.29) that treated as observation noise.
we discussed in Section 10.3 so that we obtain the PCA solution when we In the noise-free limit where σ → 0, PPCA and PCA provide identical
minimize the squared auto-encoding loss. If we replace the linear map- solutions: Since the data covariance matrix S is symmetric, it can be di-
ping of PCA with a nonlinear mapping, we get a nonlinear auto-encoder. agonalized (see Section 4.4), i.e., there exists a matrix T of eigenvectors
A prominent example of this is a deep auto-encoder where the linear func- of S so that
tions are replaced with deep neural networks. In this context, the encoder
S = T ΛT −1 . (10.80)
recognition network is also known as a recognition network or inference network, whereas the
inference network decoder is also called a generator. In the PPCA model, the data covariance matrix is the covariance matrix of
generator
Another interpretation of PCA is related to information theory. We can the Gaussian likelihood p(x | B, µ, σ 2 ), which is BB ⊤ +σ 2 I , see (10.70b).
think of the code as a smaller or compressed version of the original data For σ → 0, we obtain BB ⊤ so that this data covariance must equal the
point. When we reconstruct our original data using the code, we do not PCA data covariance (and its factorization given in (10.80)) so that
get the exact data point back, but a slightly distorted or noisy version 1
Cov[X ] = T ΛT −1 = BB ⊤ ⇐⇒ B = T Λ 2 R , (10.81)
The code is a of it. This means that our compression is “lossy”. Intuitively, we want
compressed version to maximize the correlation between the original data and the lower- i.e., we obtain the maximum likelihood estimate in (10.78) for σ = 0.
of the original data.
dimensional code. More formally, this is related to the mutual information. From (10.78) and (10.80), it becomes clear that (P)PCA performs a de-
We would then get the same solution to PCA we discussed in Section 10.3 composition of the data covariance matrix.
by maximizing the mutual information, a core concept in information the- In a streaming setting, where data arrives sequentially, it is recom-
ory (MacKay, 2003). mended to use the iterative expectation maximization (EM) algorithm for
In our discussion on PPCA, we assumed that the parameters of the maximum likelihood estimation (Roweis, 1998).
model, i.e., B, µ, and the likelihood parameter σ 2 , are known. Tipping To determine the dimensionality of the latent variables (the length of
and Bishop (1999) describe how to derive maximum likelihood estimates the code, the dimensionality of the lower-dimensional subspace onto which
for these parameters in the PPCA setting (note that we use a different we project the data), Gavish and Donoho (2014) suggest the heuristic
notation in this chapter). The maximum likelihood parameters, when pro- that, if we can estimate the noise variance σ 2 of the data, we should
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
346 Dimensionality Reduction with Principal Component Analysis 10.8 Further Reading 347
√
4σ√ D to require non-Gaussian priors p(z). We refer to the books by Hyvarinen
discard all singular values smaller than .
Alternatively, we can use
3
(nested) cross-validation (Section 8.6.1) or Bayesian model selection cri- et al. (2001) and Murphy (2012) for more details on ICA.
teria (discussed in Section 8.6.2) to determine a good estimate of the PCA, factor analysis, and ICA are three examples for dimensionality re-
intrinsic dimensionality of the data (Minka, 2001b). duction with linear models. Cunningham and Ghahramani (2015) provide
a broader survey of linear dimensionality reduction.
Similar to our discussion on linear regression in Chapter 9, we can place The (P)PCA model we discussed here allows for several important ex-
a prior distribution on the parameters of the model and integrate them tensions. In Section 10.5, we explained how to do PCA when the in-
out. By doing so, we (a) avoid point estimates of the parameters and the put dimensionality D is significantly greater than the number N of data
issues that come with these point estimates (see Section 8.6) and (b) al- points. By exploiting the insight that PCA can be performed by computing
low for an automatic selection of the appropriate dimensionality M of the (many) inner products, this idea can be pushed to the extreme by consid-
Bayesian PCA latent space. In this Bayesian PCA, which was proposed by Bishop (1999), ering infinite-dimensional features. The kernel trick is the basis of kernel kernel trick
a prior p(µ, B, σ 2 ) is placed on the model parameters. The generative PCA and allows us to implicitly compute inner products between infinite- kernel PCA
process allows us to integrate the model parameters out instead of condi- dimensional features (Schölkopf et al., 1998; Schölkopf and Smola, 2002).
tioning on them, which addresses overfitting issues. Since this integration There are nonlinear dimensionality reduction techniques that are de-
is analytically intractable, Bishop (1999) proposes to use approximate in- rived from PCA (Burges (2010) provides a good overview). The auto-
ference methods, such as MCMC or variational inference. We refer to the encoder perspective of PCA that we discussed previously in this section
work by Gilks et al. (1996) and Blei et al. (2017) for more details on these can be used to render PCA as a special case of a deep auto-encoder. In the deep auto-encoder
approximate inference techniques. deep auto-encoder, both the encoder and the decoder are represented by
In PPCA, we considered the linear model p(xn | z n ) = N xn | Bz n + multilayer feedforward neural networks, which themselves are nonlinear
mappings. If we set the activation functions in these neural networks to be
µ, σ 2 I with prior p(z n ) = N 0, I , where all observation dimensions
the identity, the model becomes equivalent to PCA. A different approach to
are affected by the same amount of noise. If we allow each observation
nonlinear dimensionality reduction is the Gaussian process latent-variable Gaussian process
factor analysis dimension d to have a different variance σd2 , we obtain factor analysis latent-variable
model (GP-LVM) proposed by Lawrence (2005). The GP-LVM starts off with
(FA) (Spearman, 1904; Bartholomew et al., 2011). This means that FA model
the latent-variable perspective that we used to derive PPCA and replaces
gives the likelihood some more flexibility than PPCA, but still forces the GP-LVM
the linear relationship between the latent variables z and the observations
An overly flexible data to be explained by the model parameters B, µ.However, FA no
likelihood would be longer allows for a closed-form maximum likelihood solution so that we
x with a Gaussian process (GP). Instead of estimating the parameters of
able to explain more the mapping (as we do in PPCA), the GP-LVM marginalizes out the model
need to use an iterative scheme, such as the expectation maximization
than just the noise. parameters and makes point estimates of the latent variables z . Similar
algorithm, to estimate the model parameters. While in PPCA all station-
to Bayesian PCA, the Bayesian GP-LVM proposed by Titsias and Lawrence Bayesian GP-LVM
ary points are global optima, this no longer holds for FA. Compared to
(2010) maintains a distribution on the latent variables z and uses approx-
PPCA, FA does not change if we scale the data, but it does return different
imate inference to integrate them out as well.
solutions if we rotate the data.
independent An algorithm that is also closely related to PCA is independent com-
component analysis ponent analysis (ICA (Hyvarinen et al., 2001)). Starting again with the
ICA latent-variable perspective p(xn | z n ) = N xn | Bz n + µ, σ 2 I we now
change the prior on z n to non-Gaussian distributions. ICA can be used
blind-source for blind-source separation. Imagine you are in a busy train station with
separation many people talking. Your ears play the role of microphones, and they
linearly mix different speech signals in the train station. The goal of blind-
source separation is to identify the constituent parts of the mixed signals.
As discussed previously in the context of maximum likelihood estimation
for PPCA, the original PCA solution is invariant to any rotation. Therefore,
PCA can identify the best lower-dimensional subspace in which the sig-
nals live, but not the signals themselves (Murphy, 2012). ICA addresses
this issue by modifying the prior distribution p(z) on the latent sources
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
11.1 Gaussian Mixture Model 349
be meaningfully K
represented by a 2
X
p(x | θ) = πk N x | µk , Σk (11.3)
Gaussian.
k=1
0
x2
K
X
0 ⩽ πk ⩽ 1 , πk = 1 , (11.4)
−2 k=1
0.15
convex combination
of Gaussian p1 (x) = N x | − 4, 1 (11.6)
0.10
distributions and is
p2 (x) = N x | 0, 0.2 (11.7)
more expressive 0.05
than any individual p3 (x) = N x | 8, 3 (11.8)
component. Dashed 0.00
lines represent the −4 −2 0 2 4 6 8 and assign them equal weights π1 = π2 = π3 = 13 . The corresponding
weighted Gaussian x model (and the data points) are shown in Figure 11.3.
components.
components and the mixture density, which is given as In the following, we detail how to obtain a maximum likelihood esti-
p(x | θ) = 0.5N x | − 2, 12 + 0.2N x | 1, 2 + 0.3N x | 4, 1 . (11.5) mate θ ML of the model parameters θ . We start by writing down the like-
lihood, i.e., the predictive distribution of the training data given the pa-
rameters. We exploit our i.i.d. assumption, which leads to the factorized
likelihood
11.2 Parameter Learning via Maximum Likelihood N
Y K
X
p(X | θ) = p(xn | θ) , p(xn | θ) = πk N xn | µk , Σk , (11.9)
Assume we are given a dataset X = {x1 , . . . , xN }, where xn , n = n=1 k=1
1, . . . , N , are drawn i.i.d. from an unknown distribution p(x). Our ob-
jective is to find a good approximation/representation of this unknown where every individual likelihood term p(xn | θ) is a Gaussian mixture
distribution p(x) by means of a GMM with K mixture components. The density. Then we obtain the log-likelihood as
parameters of the GMM are the K means µk , the covariances Σk , and N
X N
X K
X
mixture weights πk . We summarize all these free parameters in θ := log p(X | θ) = log p(xn | θ) = log πk N xn | µk , Σk . (11.10)
{πk , µk , Σk : k = 1, . . . , K}. n=1 n=1
|
k=1
{z }
=:L
Example 11.1 (Initial Setting) We aim to find parameters θ ∗ML that maximize the log-likelihood L defined
in (11.10). Our “normal” procedure would be to compute the gradient
dL/dθ of the log-likelihood with respect to the model parameters θ , set
Figure 11.3 Initial 0.30
it to 0, and solve for θ . However, unlike our previous examples for max-
π1 N (x|µ1 , σ12 )
setting: GMM π2 N (x|µ2 , σ22 ) imum likelihood estimation (e.g., when we discussed linear regression in
0.25
(black) with π3 N (x|µ3 , σ32 ) Section 9.2), we cannot obtain a closed-form solution. However, we can
GMM density
mixture three 0.20 exploit an iterative scheme to find good model parameters θ ML , which will
mixture components
turn out to be the EM algorithm for GMMs. The key idea is to update one
p(x)
(11.11)
Throughout this chapter, we will have a simple running example that
helps us illustrate and visualize important concepts. This simple form allows us to find closed-form maximum likelihood esti-
mates of µ and Σ, as discussed in Chapter 8. In (11.10), we cannot move
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
352 Density Estimation with Gaussian Mixture Models 11.2 Parameter Learning via Maximum Likelihood 353
P
the log into the sum over k so that we cannot obtain a simple closed-form k rnk = 1 with rnk ⩾ 0. This probability vector distributes probabil-
maximum likelihood solution. ♢ ity mass among the K mixture components, and we can think of r n as a
Any local optimum of a function exhibits the property that its gradi- “soft assignment” of xn to the K mixture components. Therefore, the re- The responsibility
sponsibility rnk from (11.17) represents the probability that xn has been rnk is the
ent with respect to the parameters must vanish (necessary condition); see probability that the
Chapter 7. In our case, we obtain the following necessary conditions when generated by the k th mixture component.
kth mixture
we optimize the log-likelihood in (11.10) with respect to the GMM param- component
eters µk , Σk , πk : Example 11.2 (Responsibilities) generated the nth
data point.
N For our example from Figure 11.3, we compute the responsibilities rnk
∂L X ∂ log p(xn | θ)
= 0⊤ ⇐⇒ = 0⊤ , (11.12)
1.0 0.0 0.0
∂µk n=1
∂µk
1.0 0.0 0.0
N
∂ log p(xn | θ)
∂L X 0.057 0.943 0.0
= 0 ⇐⇒ = 0, (11.13) 0.001 0.999 0.0 ∈ RN ×K .
∂Σk ∂Σk (11.19)
n=1
0.0 0.066 0.934
N
∂L ∂ log p(xn | θ)
X 0.0 0.0 1.0
= 0 ⇐⇒ = 0. (11.14)
∂πk n=1
∂πk 0.0 0.0 1.0
For all three necessary conditions, by applying the chain rule (see Sec- Here the nth row tells us the responsibilities of all mixture components
tion 5.2.2), we require partial derivatives of the form for xn . The sum of all K responsibilities for a data point (sum of every
row) is 1. The k th column gives us an overview of the responsibility of
∂ log p(xn | θ) 1 ∂p(xn | θ)
= , (11.15) the k th mixture component. We can see that the third mixture component
∂θ p(xn | θ) ∂θ
(third column) is not responsible for any of the first four data points, but
where θ = {µk , Σk , πk , k = 1, . . . , K} are the model parameters and takes much responsibility of the remaining data points. The sum of all
1 1 entries of a column gives us the values Nk , i.e., the total responsibility of
= PK . (11.16) the k th mixture component. In our example, we get N1 = 2.058, N2 =
p(xn | θ) π N xn | µj , Σj
j=1 j 2.008, N3 = 2.934.
In the following, we will compute the partial derivatives (11.12) through
(11.14). But before we do this, we introduce a quantity that will play a In the following, we determine the updates of the model parameters
central role in the remainder of this chapter: responsibilities. µk , Σk , πk for given responsibilities. We will see that the update equa-
tions all depend on the responsibilities, which makes a closed-form solu-
tion to the maximum likelihood estimation problem impossible. However,
11.2.1 Responsibilities for given responsibilities we will be updating one model parameter at a
We define the quantity time, while keeping the others fixed. After this, we will recompute the
responsibilities. Iterating these two steps will eventually converge to a lo-
πk N xn | µk , Σk
rnk := PK (11.17) cal optimum and is a specific instantiation of the EM algorithm. We will
j=1 πj N xn | µj , Σj discuss this in some more detail in Section 11.3.
responsibility as the responsibility of the k th mixture component for the nth data point.
The responsibility rnk of the k th mixture component for data point xn is
11.2.2 Updating the Means
proportional to the likelihood
Theorem 11.1 (Update of the GMM Means). The update of the mean pa-
p(xn | πk , µk , Σk ) = πk N xn | µk , Σk (11.18) rameters µk , k = 1, . . . , K , of the GMM is given by
r n follows a of the mixture component given the data point. Therefore, mixture com- PN
n=1 rnk xn
Boltzmann/Gibbs ponents have a high responsibility for a data point when the data point µnew
k = P N
, (11.20)
distribution.
could be a plausible sample from that mixture component. Note that n=1 rnk
r n := [rn1 , . . . , rnK ]⊤ ∈ RK is a (normalized) probability vector, i.e., where the responsibilities rnk are defined in (11.17).
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
354 Density Estimation with Gaussian Mixture Models 11.2 Parameter Learning via Maximum Likelihood 355
Remark. The update of the means µk of the individual mixture compo- Therefore, the mean µk is pulled toward a data point xn with strength Figure 11.4 Update
nents in (11.20) depends on all means, covariance matrices Σk , and mix- given by rnk . The means are pulled stronger toward data points for which of the mean
parameter of
ture weights πk via rnk given in (11.17). Therefore, we cannot obtain a the corresponding mixture component has a high responsibility, i.e., a high
mixture component
closed-form solution for all µk at once. ♢ likelihood. Figure 11.4 illustrates this. We can also interpret the mean up- in a GMM. The
date in (11.20) as the expected value of all data points under the distri- mean µ is being
Proof From (11.15), we see that the gradient of the log-likelihood with pulled toward
bution given by
respect to the mean parameters µk , k = 1, . . . , K , requires us to compute individual data
the partial derivative r k := [r1k , . . . , rN k ]⊤ /Nk , (11.25) points with the
weights given by the
K which is a normalized probability vector, i.e.,
∂p(xn | θ) X ∂N xn | µj , Σj ∂N xn | µk , Σk corresponding
= πj = πk (11.21a) responsibilities.
∂µk j=1
∂µk ∂µk µk ← Erk [X ] . (11.26)
x2 x3
= πk (xn − µk )⊤ Σ−1
k N xn | µk , Σk , (11.21b) r2
Example 11.3 (Mean Updates) r1 r3
where we exploited that only the k th mixture component depends on µk . x1
µ
We use our result from (11.21b) in (11.15) and put everything together
Figure 11.5 Effect
so that the desired partial derivative of L with respect to µk is given as 0.30 π1 N (x|µ1 , σ12 )
0.30 π1 N (x|µ1 , σ12 ) of updating the
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
N N 0.25 mean values in a
∂L X∂ log p(xn | θ) X 1 ∂p(xn | θ) π3 N (x|µ3 , σ32 ) 0.25 π3 N (x|µ3 , σ32 )
∂µk n=1
∂µk n=1
p(xn | θ) ∂µk 0.20
before updating the
p(x)
p(x)
0.15 0.15 mean values;
N
X πk N xn | µk , Σk 0.10 0.10 (b) GMM after
= (xn − µk )⊤ Σ−1
k PK (11.22b) 0.05 0.05 updating the mean
n=1 j=1 πj N xn | µj , Σj values µk while
0.00 0.00
| {z } −5 0 5 10 15 −5 0 5 10 15 retaining the
=rnk x x variances and
N
X (a) GMM density and individual components (b) GMM density and individual components mixture weights.
= rnk (xn − µk )⊤ Σ−1
k . (11.22c) prior to updating the mean values. after updating the mean values.
n=1
In our example from Figure 11.3, the mean values are updated as fol-
Here we used the identity from (11.16) and the result of the partial deriva- lows:
tive in (11.21b) to get to (11.22b). The values rnk are the responsibilities
we defined in (11.17). µ1 : −4 → −2.7 (11.27)
∂L(µnew )
We now solve (11.22c) for µnew
k so that ∂µk = 0⊤ and obtain µ2 : 0 → −0.4 (11.28)
k
N N PN N
µ3 : 8 → 3.7 (11.29)
X X rnk xn 1 X
rnk xn = rnk µnew
k ⇐⇒ µnew
k
n=1
= P = rnk xn , Here we see that the means of the first and third mixture component
n=1 n=1
N
rnk Nk n=1 move toward the regime of the data, whereas the mean of the second
n=1
(11.23) component does not change so dramatically. Figure 11.5 illustrates this
change, where Figure 11.5(a) shows the GMM density prior to updating
where we defined
the means and Figure 11.5(b) shows the GMM density after updating the
N
X mean values µk .
Nk := rnk (11.24)
n=1
The update of the mean parameters in (11.20) look fairly straight-
as the total responsibility of the k th mixture component for the entire
forward. However, note that the responsibilities rnk are a function of
dataset. This concludes the proof of Theorem 11.1.
πj , µj , Σj for all j = 1, . . . , K , such that the updates in (11.20) depend
Intuitively, (11.20) can be interpreted as an importance-weighted Monte on all parameters of the GMM, and a closed-form solution, which we ob-
Carlo estimate of the mean, where the importance weights of data point tained for linear regression in Section 9.2 or PCA in Chapter 10, cannot
xn are the responsibilities rnk of the k th cluster for xn , k = 1, . . . , K . be obtained.
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
356 Density Estimation with Gaussian Mixture Models 11.2 Parameter Learning via Maximum Likelihood 357
· − 2 (Σk − Σ−1
1 −1 ⊤ −1
k (xn − µk )(xn − µk ) Σk ) (11.36b)
Proof To prove Theorem 11.2, our approach is to compute the partial
N
derivatives of the log-likelihood L with respect to the covariances Σk , set 1X
them to 0, and solve for Σk . We start with our general approach
=− rnk (Σ−1 −1 ⊤ −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.36c)
2 n=1
N N N N
!
∂L X ∂ log p(xn | θ) 1 ∂p(xn | θ) 1 X 1 X
= − Σ−1 rnk + Σ−1 rnk (xn − µk )(xn − µk )⊤ Σ−1
X
= = . (11.31) k .
∂Σk n=1
∂Σ k n=1
p(xn | θ) ∂Σk 2 k n=1 2 k n=1
| {z }
=Nk
We already know 1/p(xn | θ) from (11.16). To obtain the remaining par-
tial derivative ∂p(xn | θ)/∂Σk , we write down the definition of the Gaus- (11.36d)
sian distribution p(xn | θ) (see (11.9)) and drop all terms but the k th. We We see that the responsibilities rnk also appear in this partial derivative.
then obtain Setting this partial derivative to 0, we obtain the necessary optimality
∂p(xn | θ) condition
(11.32a) N
!
∂Σk X
∂
D 1
Nk Σ−1
k = Σ −1
k rnk (xn − µk )(xn − µk )⊤
Σ−1
k (11.37a)
πk (2π)− 2 det(Σk )− 2 exp − 21 (xn − µk )⊤ Σ−1
= k (xn − µk ) n=1
∂Σk N
!
(11.32b)
X
⇐⇒ Nk I = rnk (xn − µk )(xn − µk )⊤ Σ−1
k . (11.37b)
−
D ∂ −
1
1 ⊤ −1
n=1
= πk (2π) 2 det(Σk ) exp − 2 (xn − µk ) Σk (xn − µk )
2
∂Σk By solving for Σk , we obtain
1 ∂
+ det(Σk )− 2 exp − 12 (xn − µk )⊤ Σ−1
N
k (xn − µk ) . (11.32c) 1 X
∂Σk Σnew
k = rnk (xn − µk )(xn − µk )⊤ , (11.38)
Nk n=1
We now use the identities
where r k is the probability vector defined in (11.25). This gives us a sim-
∂ 1 1
(5.101) 1
det(Σk )− 2 = − det(Σk )− 2 Σ−1
k , (11.33) ple update rule for Σk for k = 1, . . . , K and proves Theorem 11.2.
∂Σk 2
∂ (5.103) Similar to the update of µk in (11.20), we can interpret the update of
(xn − µk )⊤ Σ−1 −1 ⊤ −1
k (xn − µk ) = −Σk (xn − µk )(xn − µk ) Σk the covariance in (11.30) as an importance-weighted expected value of
∂Σk
(11.34) the square of the centered data X̃k := {x1 − µk , . . . , xN − µk }.
and obtain (after some rearranging) the desired partial derivative required
Example 11.4 (Variance Updates)
in (11.31) as
In our example from Figure 11.3, the variances are updated as follows:
∂p(xn | θ)
σ12 : 1 → 0.14
= πk N xn | µk , Σk (11.39)
∂Σk
σ22 : 0.2 → 0.44 (11.40)
· − 2 (Σk − Σ−1
1 −1 ⊤ −1
k (xn − µk )(xn − µk ) Σk ) . (11.35)
σ32 : 3 → 1.53 (11.41)
Putting everything together, the partial derivative of the log-likelihood
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
358 Density Estimation with Gaussian Mixture Models 11.2 Parameter Learning via Maximum Likelihood 359
N K K
!
X X X
Here we see that the variances of the first and third component shrink = log πk N xn | µk , Σk + λ πk − 1 , (11.43b)
significantly, whereas the variance of the second component increases n=1 k=1 k=1
slightly. where L is the log-likelihood from (11.10) and the second term encodes
Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but for the equality constraint that all the mixture weights need to sum up to
zoomed in) to Figure 11.5(b) and shows the GMM density and its indi- 1. We obtain the partial derivative with respect to πk as
vidual components prior to updating the variances. Figure 11.6(b) shows N
the GMM density after updating the variances. ∂L X N xn | µk , Σk
= PK +λ (11.44a)
∂πk n=1 j=1 πj N xn | µj , Σj
Figure 11.6 Effect
of updating the π1 N (x|µ1 , σ12 ) 0.35 π1 N (x|µ1 , σ12 ) N
0.30
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 ) 1 X πk N xn | µk , Σk Nk
variances in a GMM. 0.25 π3 N (x|µ3 , σ32 )
0.30 π3 N (x|µ3 , σ32 ) = +λ = + λ, (11.44b)
πk n=1 K j=1 πj N xn | µj , Σj
πk
P
(a) GMM before GMM density 0.25 GMM density
0.20
updating the 0.20 | {z }
p(x)
p(x)
variances; (b) GMM 0.15 =Nk
0.15
after updating the 0.10
0.10 and the partial derivative with respect to the Lagrange multiplier λ as
variances while 0.05 0.05
retaining the means K
0.00 0.00 ∂L X
and mixture −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 = πk − 1 . (11.45)
weights. x x ∂λ k=1
(a) GMM density and individual components (b) GMM density and individual components
prior to updating the variances. after updating the variances. Setting both partial derivatives to 0 (necessary condition for optimum)
yields the system of equations
Nk
πk = − , (11.46)
Similar to the update of the mean parameters, we can interpret (11.30) λ
as a Monte Carlo estimate of the weighted covariance of data points xn K
X
associated with the k th mixture component, where the weights are the 1= πk . (11.47)
responsibilities rnk . As with the updates of the mean parameters, this up- k=1
date depends on all πj , µj , Σj , j = 1, . . . , K , through the responsibilities Using (11.46) in (11.47) and solving for πk , we obtain
rnk , which prohibits a closed-form solution. K K
X X Nk N
πk = 1 ⇐⇒ − = 1 ⇐⇒ − = 1 ⇐⇒ λ = −N .
k=1 k=1
λ λ
11.2.4 Updating the Mixture Weights (11.48)
Theorem 11.3 (Update of the GMM Mixture Weights). The mixture weights This allows us to substitute −N for λ in (11.46) to obtain
of the GMM are updated as Nk
, πknew = (11.49)
Nk N
πknew = , k = 1, . . . , K , (11.42) which gives us the update for the weight parameters πk and proves Theo-
N
rem 11.3.
where N is the number of data points and Nk is defined in (11.24).
We can identify the mixture weight in (11.42) as the ratio of the to-
Proof To find the partial derivative of the log-likelihood with respect tal responsibility
P of the k th cluster and the number of data points. Since
weight parameters πk , k = 1, . . . , K , we account for the con-
to the P N = k Nk , the number of data points can also be interpreted as the
straint k πk = 1 by using Lagrange multipliers (see Section 7.2). The total responsibility of all mixture components together, such that πk is the
Lagrangian is relative importance of the k th mixture component for the dataset.
PN
K
! Remark. Since Nk = i=1 rnk , the update equation (11.42) for the mix-
ture weights πk also depends on all πj , µj , Σj , j = 1, . . . , K via the re-
X
L=L+λ πk − 1 (11.43a)
k=1 sponsibilities rnk . ♢
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
360 Density Estimation with Gaussian Mixture Models 11.3 EM Algorithm 361
p(x)
0.15
mixture weights; 0.15
M-step: Use the updated responsibilities to reestimate the parameters
0.10
(b) GMM after 0.10
Here we see that the third component gets more weight/importance, 3. M-step: Reestimate parameters πk , µk , Σk using the current responsi-
while the other components become slightly less important. Figure 11.7 bilities rnk (from E-step): Having updated the
illustrates the effect of updating the mixture weights. Figure 11.7(a) is N
means µk
identical to Figure 11.6(b) and shows the GMM density and its individual 1 X in (11.54), they are
µk = rnk xn , (11.54) subsequently used
components prior to updating the mixture weights. Figure 11.7(b) shows Nk n=1
in (11.55) to update
the GMM density after updating the mixture weights. N the corresponding
1 X
Overall, having updated the means, the variances, and the weights Σk = rnk (xn − µk )(xn − µk )⊤ , (11.55) covariances.
once, we obtain the GMM shown in Figure 11.7(b). Compared with the Nk n=1
initialization shown in Figure 11.3, we can see that the parameter updates Nk
πk = . (11.56)
caused the GMM density to shift some of its mass toward the data points. N
After updating the means, variances, and weights once, the GMM fit
in Figure 11.7(b) is already remarkably better than its initialization from Example 11.6 (GMM Fit)
Figure 11.3. This is also evidenced by the log-likelihood values, which
increased from −28.3 (initialization) to −14.4 after a full update cycle. Figure 11.8 EM
π1 N (x|µ1 , σ12 ) 28 algorithm applied to
0.30
π2 N (x|µ2 , σ22 )
the GMM from
Negative log-likelihood
26
0.25 π3 N (x|µ3 , σ32 )
GMM density 24 Figure 11.2. (a)
0.20
22 Final GMM fit;
p(x)
11.3 EM Algorithm 0.15
20 (b) negative
0.10
18
log-likelihood as a
Unfortunately, the updates in (11.20), (11.30), and (11.42) do not consti- function of the EM
0.05 16
tute a closed-form solution for the updates of the parameters µk , Σk , πk iteration.
0.00 14
of the mixture model because the responsibilities rnk depend on those pa- −5 0 5 10 15 0 1 2 3 4 5
x
Iteration
rameters in a complex way. However, the results suggest a simple iterative
(a) Final GMM fit. After five iterations, the EM (b) Negative log-likelihood as a function of the
scheme for finding a solution to the parameters estimation problem via algorithm converges and returns this GMM. EM iterations.
EM algorithm maximum likelihood. The expectation maximization algorithm (EM algo-
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
362 Density Estimation with Gaussian Mixture Models 11.4 Latent-Variable Perspective 363
Figure 11.9 10 6 6 Figure 11.10 GMM
Illustration of the 104 fit and
4 4
Negative log-likelihood
EM algorithm for 5 responsibilities
fitting a Gaussian 2 2 when EM converges.
mixture model with (a) GMM fit when
x2
x2
x2
0 0 0
3
three components to 6 × 10 EM converges;
−2 −2
a two-dimensional −5 (b) each data point
dataset. (a) Dataset; −4 −4 is colored according
4 × 103
(b) negative −10 −6 −6 to the
−10 −5 0 5 10 0 20 40 60 −5 0 5 −5 0 5
log-likelihood x1 EM iteration x1 x1 responsibilities of
(lower is better) as the mixture
(a) Dataset. (b) Negative log-likelihood. (a) GMM fit after 62 iterations. (b) Dataset colored according to the respon-
a function of the EM components.
sibilities of the mixture components.
iterations. The red
10 10
dots indicate the
iterations for which
5 5
the mixture
components of the the corresponding final GMM fit. Figure 11.10(b) visualizes the final re-
x2
x2
0 0
corresponding GMM sponsibilities of the mixture components for the data points. The dataset is
fits are shown in (c) colored according to the responsibilities of the mixture components when
−5 −5
through (f). The
yellow discs indicate
EM converges. While a single mixture component is clearly responsible
the means of the −10
−10 −5 0 5 10
−10
−10 −5 0 5 10 for the data on the left, the overlap of the two data clusters on the right
x1 x1
Gaussian mixture could have been generated by two mixture components. It becomes clear
components. (c) EM initialization. (d) EM after one iteration. that there are data points that cannot be uniquely assigned to a single
Figure 11.10(a)
component (either blue or yellow), such that the responsibilities of these
shows the final 10 10
GMM fit. two clusters for those points are around 0.5.
5 5
x2
x2
0 0
11.4 Latent-Variable Perspective
−5 −5
We can look at the GMM from the perspective of a discrete latent-variable
−10 −10
model, i.e., where the latent variable z can attain only a finite set of val-
−10 −5 0 5 10 −10 −5 0 5 10
x1 x1 ues. This is in contrast to PCA, where the latent variables were continuous-
(e) EM after 10 iterations. (f) EM after 62 iterations. valued numbers in RM .
The advantages of the probabilistic perspective are that (i) it will jus-
tify some ad hoc decisions we made in the previous sections, (ii) it allows
for a concrete interpretation of the responsibilities as posterior probabil-
ities, and (iii) the iterative algorithm for updating the model parameters
When we run EM on our example from Figure 11.3, we obtain the final
can be derived in a principled manner as the EM algorithm for maximum
result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b)
likelihood parameter estimation in latent-variable models.
shows how the negative log-likelihood evolves as a function of the EM
iterations. The final GMM is given as
p(x) = 0.29N x | − 2.75, 0.06 + 0.28N x | − 0.50, 0.25 11.4.1 Generative Process and Probabilistic Model
(11.57)
+ 0.43N x | 3.64, 1.63 . To derive the probabilistic model for GMMs, it is useful to think about the
generative process, i.e., the process that allows us to generate data, using
a probabilistic model.
We applied the EM algorithm to the two-dimensional dataset shown We assume a mixture model with K components and that a data point
in Figure 11.1 with K = 3 mixture components. Figure 11.9 illustrates x can be generated by exactly one mixture component. We introduce a
some steps of the EM algorithm and shows the negative log-likelihood as binary indicator variable zk ∈ {0, 1} with two states (see Section 6.2) that
a function of the EM iteration (Figure 11.9(b)). Figure 11.10(a) shows indicates whether the k th mixture component generated that data point
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
364 Density Estimation with Gaussian Mixture Models 11.4 Latent-Variable Perspective 365
the data and the latent variables (see Section 8.4). With the prior p(z) which we identify as the GMM model from (11.3). Given a dataset X , we
defined in (11.59) and (11.60) and the conditional p(x | z) from (11.58), immediately obtain the likelihood
we obtain all K components of this joint distribution via N N X
K
(11.66b)
Y Y
p(x, zk = 1) = p(x | zk = 1)p(zk = 1) = πk N x | µk , Σk
(11.61)
p(X | θ) = p(xn | θ) = πk N xn | µk , Σk , (11.67)
n=1 n=1 k=1
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
366 Density Estimation with Gaussian Mixture Models 11.4 Latent-Variable Perspective 367
Figure 11.12 π We share the same prior distribution π across all latent variables z n .
Graphical model for
The corresponding graphical model is shown in Figure 11.12, where we
a GMM with N data
points. use the plate notation.
zn The conditional distribution p(x1 , . . . , xN | z 1 , . . . , z N ) factorizes over
the data points and is given as
N
µk Y
p(x1 , . . . , xN | z 1 , . . . , z N ) = p(xn | z n ) . (11.71)
Σk xn n=1
k = 1, . . . , K
n = 1, . . . , N To obtain the posterior distribution p(znk = 1 | xn ), we follow the same
reasoning as in Section 11.4.3 and apply Bayes’ theorem to obtain
which is exactly the GMM likelihood from (11.9). Therefore, the latent- p(xn | znk = 1)p(znk = 1)
variable model with latent indicators zk is an equivalent way of thinking p(znk = 1 | xn ) = PK (11.72a)
j=1 p(xn | znj = 1)p(znj = 1)
about a Gaussian mixture model.
πk N xn | µk , Σk
= PK = rnk . (11.72b)
j=1 πj N xn | µj , Σj
11.4.3 Posterior Distribution
Let us have a brief look at the posterior distribution on the latent variable This means that p(zk = 1 | xn ) is the (posterior) probability that the k th
z . According to Bayes’ theorem, the posterior of the k th component having mixture component generated data point xn and corresponds to the re-
generated data point x sponsibility rnk we introduced in (11.17). Now the responsibilities also
have not only an intuitive but also a mathematically justified interpreta-
p(zk = 1)p(x | zk = 1) tion as posterior probabilities.
p(zk = 1 | x) = , (11.68)
p(x)
where the marginal p(x) is given in (11.66b). This yields the posterior
distribution for the k th indicator variable zk 11.4.5 EM Algorithm Revisited
p(zk = 1)p(x | zk = 1) πk N x | µk , Σk The EM algorithm that we introduced as an iterative scheme for maximum
p(zk = 1 | x) = PK = PK ,
likelihood estimation can be derived in a principled way from the latent-
j=1 p(zj = 1)p(x | zj = 1) j=1 πj N x | µj , Σj
(11.69) variable perspective. Given a current setting θ (t) of model parameters, the
E-step calculates the expected log-likelihood
which we identify as the responsibility of the k th mixture component for
data point x. Note that we omitted the explicit conditioning on the GMM Q(θ | θ (t) ) = Ez | x,θ(t) [log p(x, z | θ)] (11.73a)
parameters πk , µk , Σk where k = 1, . . . , K . Z
= log p(x, z | θ)p(z | x, θ (t) )dz , (11.73b)
11.4.4 Extension to a Full Dataset where the expectation of log p(x, z | θ) is taken with respect to the poste-
Thus far, we have only discussed the case where the dataset consists only rior p(z | x, θ (t) ) of the latent variables. The M-step selects an updated set
of a single data point x. However, the concepts of the prior and posterior of model parameters θ (t+1) by maximizing (11.73b).
can be directly extended to the case of N data points X := {x1 , . . . , xN }. Although an EM iteration does increase the log-likelihood, there are
In the probabilistic interpretation of the GMM, every data point xn pos- no guarantees that EM converges to the maximum likelihood solution.
sesses its own latent variable It is possible that the EM algorithm converges to a local maximum of
the log-likelihood. Different initializations of the parameters θ could be
z n = [zn1 , . . . , znK ]⊤ ∈ RK . (11.70)
used in multiple EM runs to reduce the risk of ending up in a bad local
Previously (when we only considered a single data point x), we omitted optimum. We do not go into further details here, but refer to the excellent
the index n, but now this becomes important. expositions by Rogers and Girolami (2016) and Bishop (2006).
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
368 Density Estimation with Gaussian Mixture Models 11.5 Further Reading 369
p(x)
kernel density
an index k from the probability
vector [π1 , . . . , πK ]⊤ and then sample a 0.15
estimator produces
data point x ∼ N µk , Σk . If we repeat this N times, we obtain a dataset 0.10 a smooth estimate
that has been generated by a GMM. Figure 11.1 was generated using this of the underlying
procedure. 0.05 density, whereas the
histogram is an
Throughout this chapter, we assumed that the number of components 0.00 unsmoothed count
−4 −2 0 2 4 6 8
K is known. In practice, this is often not the case. However, we could use x measure of how
nested cross-validation, as discussed in Section 8.6.1, to find good models. many data points
Gaussian mixture models are closely related to the K -means clustering (black) fall into a
In this chapter, we discussed mixture models for density estimation. single bin.
algorithm. K -means also uses the EM algorithm to assign data points to
There is a plethora of density estimation techniques available. In practice,
clusters. If we treat the means in the GMM as cluster centers and ignore
we often use histograms and kernel density estimation. histogram
the covariances (or set them to I ), we arrive at K -means. As also nicely
Histograms provide a nonparametric way to represent continuous den-
described by MacKay (2003), K -means makes a “hard” assignment of data
sities and have been proposed by Pearson (1895). A histogram is con-
points to cluster centers µk , whereas a GMM makes a “soft” assignment
structed by “binning” the data space and count, how many data points fall
via the responsibilities.
into each bin. Then a bar is drawn at the center of each bin, and the height
We only touched upon the latent-variable perspective of GMMs and the
of the bar is proportional to the number of data points within that bin. The
EM algorithm. Note that EM can be used for parameter learning in general
bin size is a critical hyperparameter, and a bad choice can lead to overfit-
latent-variable models, e.g., nonlinear state-space models (Ghahramani
ting and underfitting. Cross-validation, as discussed in Section 8.2.4, can
and Roweis, 1999; Roweis and Ghahramani, 1999) and for reinforcement
be used to determine a good bin size. kernel density
learning as discussed by Barber (2012). Therefore, the latent-variable per- estimation
Kernel density estimation, independently proposed by Rosenblatt (1956)
spective of a GMM is useful to derive the corresponding EM algorithm in
and Parzen (1962), is a nonparametric way for density estimation. Given
a principled way (Bishop, 2006; Barber, 2012; Murphy, 2012).
N i.i.d. samples, the kernel density estimator represents the underlying
We only discussed maximum likelihood estimation (via the EM algo-
distribution as
rithm) for finding GMM parameters. The standard criticisms of maximum
N
likelihood also apply here: 1 X x − xn
p(x) = k , (11.74)
As in linear regression, maximum likelihood can suffer from severe N h n=1 h
overfitting. In the GMM case, this happens when the mean of a mix- where k is a kernel function, i.e., a nonnegative function that integrates to
ture component is identical to a data point and the covariance tends to 1 and h > 0 is a smoothing/bandwidth parameter, which plays a similar
0. Then, the likelihood approaches infinity. Bishop (2006) and Barber role as the bin size in histograms. Note that we place a kernel on every
(2012) discuss this issue in detail. single data point xn in the dataset. Commonly used kernel functions are
We only obtain a point estimate of the parameters πk , µk , Σk for k = the uniform distribution and the Gaussian distribution. Kernel density esti-
1, . . . , K , which does not give any indication of uncertainty in the pa- mates are closely related to histograms, but by choosing a suitable kernel,
rameter values. A Bayesian approach would place a prior on the param- we can guarantee smoothness of the density estimate. Figure 11.13 illus-
eters, which can be used to obtain a posterior distribution on the param- trates the difference between a histogram and a kernel density estimator
eters. This posterior allows us to compute the model evidence (marginal (with a Gaussian-shaped kernel) for a given dataset of 250 data points.
likelihood), which can be used for model comparison, which gives us a
principled way to determine the number of mixture components. Un-
fortunately, closed-form inference is not possible in this setting because
there is no conjugate prior for this model. However, approximations,
such as variational inference, can be used to obtain an approximate
posterior (Bishop, 2006).
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
Classification with Support Vector Machines 371
Figure 12.1
Example 2D data,
12 illustrating the
intuition of data
where we can find a
linear classifier that
x(2)
Classification with Support Vector Machines separates orange
crosses from blue
discs.
x(1)
In many situations, we want our machine learning algorithm to predict
one of a number of (discrete) outcomes. For example, an email client sorts
mail into personal mail and junk mail, which has two outcomes. Another SVMs. First, the SVM allows for a geometric way to think about supervised
example is a telescope that identifies whether an object in the night sky machine learning. While in Chapter 9 we considered the machine learning
is a galaxy, star, or planet. There are usually a small number of outcomes, problem in terms of probabilistic models and attacked it using maximum
and more importantly there is usually no additional structure on these likelihood estimation and Bayesian inference, here we will consider an
An example of outcomes. In this chapter, we consider predictors that output binary val- alternative approach where we reason geometrically about the machine
structure is if the ues, i.e., there are only two possible outcomes. This machine learning task learning task. It relies heavily on concepts, such as inner products and
outcomes were
is called binary classification. This is in contrast to Chapter 9, where we projections, which we discussed in Chapter 3. The second reason why we
ordered, like in the
case of small, considered a prediction problem with continuous-valued outputs. find SVMs instructive is that in contrast to Chapter 9, the optimization
medium, and large For binary classification, the set of possible values that the label/output problem for SVM does not admit an analytic solution so that we need to
t-shirts. can attain is binary, and for this chapter we denote them by {+1, −1}. In resort to a variety of optimization tools introduced in Chapter 7.
binary classification
other words, we consider predictors of the form The SVM view of machine learning is subtly different from the max-
imum likelihood view of Chapter 9. The maximum likelihood view pro-
f : RD → {+1, −1} . (12.1)
poses a model based on a probabilistic view of the data distribution, from
Recall from Chapter 8 that we represent each example (data point) xn which an optimization problem is derived. In contrast, the SVM view starts
Input example xn as a feature vector of D real numbers. The labels are often referred to as by designing a particular function that is to be optimized during training,
may also be referred the positive and negative classes, respectively. One should be careful not based on geometric intuitions. We have seen something similar already
to as inputs, data
to infer intuitive attributes of positiveness of the +1 class. For example, in Chapter 10, where we derived PCA from geometric principles. In the
points, features, or
instances. in a cancer detection task, a patient with cancer is often labeled +1. In SVM case, we start by designing a loss function that is to be minimized
class principle, any two distinct values can be used, e.g., {True, False}, {0, 1} on training data, following the principles of empirical risk minimization
For probabilistic or {red, blue}. The problem of binary classification is well studied, and (Section 8.2).
models, it is we defer a survey of other approaches to Section 12.6. Let us derive the optimization problem corresponding to training an
mathematically
We present an approach known as the support vector machine (SVM), SVM on example–label pairs. Intuitively, we imagine binary classification
convenient to use
{0, 1} as a binary which solves the binary classification task. As in regression, we have a su- data, which can be separated by a hyperplane as illustrated in Figure 12.1.
representation; see pervised learning task, where we have a set of examples xn ∈ RD along Here, every example xn (a vector of dimension 2) is a two-dimensional
the remark after with their corresponding (binary) labels yn ∈ {+1, −1}. Given a train- location (x(1) (2)
n and xn ), and the corresponding binary label yn is one of
Example 6.12.
ing data set consisting of example–label pairs {(x1 , y1 ), . . . , (xN , yN )}, we two different symbols (orange cross or blue disc). “Hyperplane” is a word
would like to estimate parameters of the model that will give the smallest that is commonly used in machine learning, and we encountered hyper-
classification error. Similar to Chapter 9, we consider a linear model, and planes already in Section 2.8. A hyperplane is an affine subspace of di-
hide away the nonlinearity in a transformation ϕ of the examples (9.13). mension D − 1 (if the corresponding vector space is of dimension D).
We will revisit ϕ in Section 12.4. The examples consist of two classes (there are two possible labels) that
The SVM provides state-of-the-art results in many applications, with have features (the components of the vector representing the example)
sound theoretical guarantees (Steinwart and Christmann, 2008). There arranged in such a way as to allow us to separate/classify them by draw-
are two main reasons why we chose to illustrate binary classification using ing a straight line.
370
©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
372 Classification with Support Vector Machines 12.1 Separating Hyperplanes 373
Figure 12.2
In the following, we formalize the idea of finding a linear separator w Equation of a
separating
of the two classes. We introduce the idea of the margin and then extend
w hyperplane (12.3).
linear separators to allow for examples to fall on the “wrong” side, incur- (a) The standard
ring a classification error. We present two equivalent ways of formalizing b way of representing
the SVM: the geometric view (Section 12.2.4) and the loss function view .Positive the equation in 3D.
(b) For ease of
(Section 12.2.5). We derive the dual version of the SVM using Lagrange
multipliers (Section 7.2). The dual SVM allows us to observe a third way
. . drawing, we look at
0 Negative the hyperplane edge
of formalizing the SVM: in terms of the convex hulls of the examples of on.
each class (Section 12.3.2). We conclude by briefly describing kernels and (a) Separating hyperplane in 3D (b) Projection of the setting in (a) onto
a plane
how to numerically solve the nonlinear kernel-SVM optimization problem.
12.1 Separating Hyperplanes where the second line is obtained by the linearity of the inner product
(Section 3.2). Since we have chosen xa and xb to be on the hyperplane,
Given two examples represented as vectors xi and xj , one way to compute this implies that f (xa ) = 0 and f (xb ) = 0 and hence ⟨w, xa − xb ⟩ = 0.
the similarity between them is using an inner product ⟨xi , xj ⟩. Recall from Recall that two vectors are orthogonal when their inner product is zero. w is orthogonal to
Section 3.2 that inner products are closely related to the angle between Therefore, we obtain that w is orthogonal to any vector on the hyperplane. any vector on the
two vectors. The value of the inner product between two vectors depends hyperplane.
on the length (norm) of each vector. Furthermore, inner products allow Remark. Recall from Chapter 2 that we can think of vectors in different
us to rigorously define geometric concepts such as orthogonality and pro- ways. In this chapter, we think of the parameter vector w as an arrow
jections. indicating a direction, i.e., we consider w to be a geometric vector. In
The main idea behind many classification algorithms is to represent contrast, we think of the example vector x as a data point (as indicated
data in RD and then partition this space, ideally in a way that examples by its coordinates), i.e., we consider x to be the coordinates of a vector
with the same label (and no other examples) are in the same partition. with respect to the standard basis. ♢
In the case of binary classification, the space would be divided into two When presented with a test example, we classify the example as pos-
parts corresponding to the positive and negative classes, respectively. We itive or negative depending on the side of the hyperplane on which it
consider a particularly convenient partition, which is to (linearly) split occurs. Note that (12.3) not only defines a hyperplane; it additionally de-
the space into two halves using a hyperplane. Let example x ∈ RD be an fines a direction. In other words, it defines the positive and negative side
element of the data space. Consider a function of the hyperplane. Therefore, to classify a test example xtest , we calcu-
late the value of the function f (xtest ) and classify the example as +1 if
f : RD → R (12.2a) f (xtest ) ⩾ 0 and −1 otherwise. Thinking geometrically, the positive ex-
x 7→ f (x) := ⟨w, x⟩ + b , (12.2b) amples lie “above” the hyperplane and the negative examples “below” the
hyperplane.
parametrized by w ∈ RD and b ∈ R. Recall from Section 2.8 that hy-
When training the classifier, we want to ensure that the examples with
perplanes are affine subspaces. Therefore, we define the hyperplane that
positive labels are on the positive side of the hyperplane, i.e.,
separates the two classes in our binary classification problem as
⟨w, xn ⟩ + b ⩾ 0 when yn = +1 (12.5)
x ∈ RD : f (x) = 0 .
(12.3)
and the examples with negative labels are on the negative side, i.e.,
An illustration of the hyperplane is shown in Figure 12.2, where the
vector w is a vector normal to the hyperplane and b the intercept. We can ⟨w, xn ⟩ + b < 0 when yn = −1 . (12.6)
derive that w is a normal vector to the hyperplane in (12.3) by choosing Refer to Figure 12.2 for a geometric intuition of positive and negative
any two examples xa and xb on the hyperplane and showing that the examples. These two conditions are often presented in a single equation
vector between them is orthogonal to w. In the form of an equation,
yn (⟨w, xn ⟩ + b) ⩾ 0 . (12.7)
f (xa ) − f (xb ) = ⟨w, xa ⟩ + b − (⟨w, xb ⟩ + b) (12.4a)
Equation (12.7) is equivalent to (12.5) and (12.6) when we multiply both
= ⟨w, xa − xb ⟩ , (12.4b) sides of (12.5) and (12.6) with yn = 1 and yn = −1, respectively.
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
374 Classification with Support Vector Machines 12.2 Primal Support Vector Machine 375
Figure 12.3
Possible separating
.xa Figure 12.4 Vector
addition to express
r w
x′a .
hyperplanes. There distance to
are many linear hyperplane:
w
classifiers (green xa = x′a + r ∥w∥ .
lines) that separate
x(2)
orange crosses from
blue discs.
.
0
x(1) hyperplane, we know that the distance r is just a scaling of this vector w.
If the length of w is known, then we can use this scaling factor r factor
to work out the absolute distance between xa and x′a . For convenience,
12.2 Primal Support Vector Machine
we choose to use a vector of unit length (its norm is 1) and obtain this
Based on the concept of distances from points to a hyperplane, we now w
by dividing w by its norm, ∥w∥ . Using vector addition (Section 2.4), we
are in a position to discuss the support vector machine. For a dataset obtain
{(x1 , y1 ), . . . , (xN , yN )} that is linearly separable, we have infinitely many w
candidate hyperplanes (refer to Figure 12.3), and therefore classifiers, xa = x′a + r . (12.8)
∥w∥
that solve our classification problem without any (training) errors. To find
a unique solution, one idea is to choose the separating hyperplane that Another way of thinking about r is that it is the coordinate of xa in the
maximizes the margin between the positive and negative examples. In subspace spanned by w/ ∥w∥. We have now expressed the distance of xa
other words, we want the positive and negative examples to be separated from the hyperplane as r, and if we choose xa to be the point closest to
A classifier with by a large margin (Section 12.2.1). In the following, we compute the dis- the hyperplane, this distance r is the margin.
large margin turns tance between an example and a hyperplane to derive the margin. Recall Recall that we would like the positive examples to be further than r
out to generalize
that the closest point on the hyperplane to a given point (example xn ) is from the hyperplane, and the negative examples to be further than dis-
well (Steinwart and
Christmann, 2008). obtained by the orthogonal projection (Section 3.8). tance r (in the negative direction) from the hyperplane. Analogously to
the combination of (12.5) and (12.6) into (12.7), we formulate this ob-
jective as
12.2.1 Concept of the Margin
yn (⟨w, xn ⟩ + b) ⩾ r . (12.9)
margin The concept of the margin is intuitively simple: It is the distance of the
There could be two separating hyperplane to the closest examples in the dataset, assuming In other words, we combine the requirements that examples are at least
or more closest that the dataset is linearly separable. However, when trying to formalize r away from the hyperplane (in the positive and negative direction) into
examples to a
this distance, there is a technical wrinkle that may be confusing. The tech- one single inequality.
hyperplane.
nical wrinkle is that we need to define a scale at which to measure the Since we are interested only in the direction, we add an assumption to
distance. A potential scale is to consider the scale of the data, i.e., the raw our model that the parameter vector w is of √ unit length, i.e., ∥w∥ = 1,
values of xn . There are problems with this, as we could change the units where we use the Euclidean norm ∥w∥ = w⊤ w (Section 3.1). This We will see other
of measurement of xn and change the values in xn , and, hence, change assumption also allows a more intuitive interpretation of the distance r choices of inner
the distance to the hyperplane. As we will see shortly, we define the scale products
(12.8) since it is the scaling factor of a vector of length 1.
(Section 3.2) in
based on the equation of the hyperplane (12.3) itself. Section 12.4.
Remark. A reader familiar with other presentations of the margin would
Consider a hyperplane ⟨w, x⟩ + b, and an example xa as illustrated in
notice that our definition of ∥w∥ = 1 is different from the standard
Figure 12.4. Without loss of generality, we can consider the example xa
presentation if the SVM was the one provided by Schölkopf and Smola
to be on the positive side of the hyperplane, i.e., ⟨w, xa ⟩ + b > 0. We
(2002), for example. In Section 12.2.3, we will show the equivalence of
would like to compute the distance r > 0 of xa from the hyperplane. We
both approaches. ♢
do so by considering the orthogonal projection (Section 3.8) of xa onto
the hyperplane, which we denote by x′a . Since w is orthogonal to the Collecting the three requirements into a single constrained optimization
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
376 Classification with Support Vector Machines 12.2 Primal Support Vector Machine 377
Figure 12.5
Derivation of the
.xa By substituting (12.8) into (12.11), we obtain
1
r w
w
margin: r = ∥w∥ .
x′a . w, xa − r
∥w∥
+ b = 0. (12.12)
Exploiting the bilinearity of the inner product (see Section 3.2), we get
⟨w
⟨w, w⟩
⟨w, xa ⟩ + b − r = 0. (12.13)
,x
⟨w
∥w∥
⟩+
,
x⟩
Observe that the first term is 1 by our assumption of scale, i.e., ⟨w, xa ⟩ +
b=
+
b = 1. From (3.16) in Section 3.1, we know that ⟨w, w⟩ = ∥w∥2 . Hence,
b=
1
the second term reduces to r∥w∥. Using these simplifications, we obtain
0
1
r= . (12.14)
problem, we obtain the objective ∥w∥
max r This means we derived the distance r in terms of the normal vector w
w,b,r |{z} of the hyperplane. At first glance, this equation is counterintuitive as we We can also think of
margin
(12.10) seem to have derived the distance from the hyperplane in terms of the the distance as the
subject to yn (⟨w, xn ⟩ + b) ⩾ r , ∥w∥ = 1 , r > 0, length of the vector w, but we do not yet know this vector. One way to projection error that
| {z } | {z } incurs when
data fitting normalization think about it is to consider the distance r to be a temporary variable projecting xa onto
which says that we want to maximize the margin r while ensuring that that we only use for this derivation. Therefore, for the rest of this section the hyperplane.
1
the data lies on the correct side of the hyperplane. we will denote the distance to the hyperplane by ∥w∥ . In Section 12.2.3,
we will see that the choice that the margin equals 1 is equivalent to our
Remark. The concept of the margin turns out to be highly pervasive in ma-
previous assumption of ∥w∥ = 1 in Section 12.2.1.
chine learning. It was used by Vladimir Vapnik and Alexey Chervonenkis
Similar to the argument to obtain (12.9), we want the positive and
to show that when the margin is large, the “complexity” of the function
negative examples to be at least 1 away from the hyperplane, which yields
class is low, and hence learning is possible (Vapnik, 2000). It turns out
the condition
that the concept is useful for various different approaches for theoret-
ically analyzing generalization error (Steinwart and Christmann, 2008; yn (⟨w, xn ⟩ + b) ⩾ 1 . (12.15)
Shalev-Shwartz and Ben-David, 2014). ♢ Combining the margin maximization with the fact that examples need to
be on the correct side of the hyperplane (based on their labels) gives us
12.2.2 Traditional Derivation of the Margin 1
max (12.16)
In the previous section, we derived (12.10) by making the observation that
w,b ∥w∥
we are only interested in the direction of w and not its length, leading to subject to yn (⟨w, xn ⟩ + b) ⩾ 1 for all n = 1, . . . , N. (12.17)
the assumption that ∥w∥ = 1. In this section, we derive the margin max- Instead of maximizing the reciprocal of the norm as in (12.16), we often
imization problem by making a different assumption. Instead of choosing minimize the squared norm. We also often include a constant 12 that does The squared norm
that the parameter vector is normalized, we choose a scale for the data. not affect the optimal w, b but yields a tidier form when we compute the results in a convex
We choose this scale such that the value of the predictor ⟨w, x⟩ + b is 1 at gradient. Then, our objective becomes quadratic
Recall that we the closest example. Let us also denote the example in the dataset that is programming
currently consider closest to the hyperplane by xa . 1 problem for the
linearly separable
min ∥w∥2 (12.18) SVM (Section 12.5).
Figure 12.5 is identical to Figure 12.4, except that now we rescaled the w,b 2
data.
axes, such that the example xa lies exactly on the margin, i.e., ⟨w, xa ⟩ + subject to yn (⟨w, xn ⟩ + b) ⩾ 1 for all n = 1, . . . , N . (12.19)
b = 1. Since x′a is the orthogonal projection of xa onto the hyperplane, it
Equation (12.18) is known as the hard margin SVM. The reason for the hard margin SVM
must by definition lie on the hyperplane, i.e.,
expression “hard” is because the formulation does not allow for any vi-
⟨w, x′a ⟩ + b = 0 . (12.11) olations of the margin condition. We will see in Section 12.2.4 that this
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
378 Classification with Support Vector Machines 12.2 Primal Support Vector Machine 379
Figure 12.6
“hard” condition can be relaxed to accommodate violations if the data is (a) Linearly
separable and
not linearly separable.
(b) non-linearly
separable data.
x(2)
x(2)
12.2.3 Why We Can Set the Margin to 1
In Section 12.2.1, we argued that we would like to maximize some value
r, which represents the distance of the closest example to the hyperplane.
In Section 12.2.2, we scaled the data such that the closest example is of
distance 1 to the hyperplane. In this section, we relate the two derivations, x(1) x(1)
and show that they are equivalent.
(a) Linearly separable data, with a large (b) Non-linearly separable data
Theorem 12.1. Maximizing the margin r, where we consider normalized margin
weights as in (12.10),
max r ′
w,b,r |{z} renaming the parameters to w′′ and b′′ . Since w′′ = ∥ww′ ∥r , rearranging for
margin
(12.20) r gives
subject to yn (⟨w, xn ⟩ + b) ⩾ r , ∥w∥ = 1 , r > 0,
w′ 1 w′ 1
∥w′′ ∥ = = · = .
| {z } | {z }
data fitting normalization (12.24)
∥w′ ∥ r r ∥w′ ∥ r
is equivalent to scaling the data, such that the margin is unity:
By substituting this result into (12.23), we obtain
1 2 1
min ∥w∥ max
w,b 2 2
| {z } w′′ ,b′′ ∥w′′ ∥ (12.25)
margin (12.21)
subject to yn (⟨w′′ , xn ⟩ + b′′ ) ⩾ 1 .
subject to yn (⟨w, xn ⟩ + b) ⩾ 1 .
1
The final step is to observe that maximizing yields the same solution
| {z }
data fitting ∥w′′ ∥2
1 ′′ 2
Proof Consider (12.20). Since the square is a strictly monotonic trans- as minimizing 2
∥w ∥ , which concludes the proof of Theorem 12.1.
formation for non-negative arguments, the maximum stays the same if we
consider r2 in the objective. Since ∥w∥ = 1 we can reparametrize the
12.2.4 Soft Margin SVM: Geometric View
equation with a new weight vector w′ that is not normalized by explicitly
w′ In the case where data is not linearly separable, we may wish to allow
using ∥w ′ ∥ . We obtain
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
380 Classification with Support Vector Machines 12.2 Primal Support Vector Machine 381
⟨w
measures the
of the predictor is a real number), in this chapter, we consider binary
,x
⟨w
distance of a
⟩+
classification problems (the output of the predictor is one of two labels
,
positive example
x⟩
{+1, −1}). Therefore, the error/loss function for each single example–
b=
x+ to the positive
+
margin hyperplane label pair needs to be appropriate for binary classification. For example,
b=
1
⟨w, x⟩ + b = 1 the squared loss that is used for regression (9.10b) is not suitable for bi-
0
when x+ is on the
wrong side.
nary classification.
Remark. The ideal loss function between binary labels is to count the num-
Figure 12.7). We subtract the value of ξn from the margin, constraining ber of mismatches between the prediction and the label. This means that
ξn to be non-negative. To encourage correct classification of the samples, for a predictor f applied to an example xn , we compare the output f (xn )
we add ξn to the objective with the label yn . We define the loss to be zero if they match, and one if
N they do not match. This is denoted by 1(f (xn ) ̸= yn ) and is called the
1 X
min ∥w∥2 + C ξn (12.26a) zero-one loss. Unfortunately, the zero-one loss results in a combinatorial zero-one loss
w,b,ξ 2 n=1 optimization problem for finding the best parameters w, b. Combinatorial
subject to yn (⟨w, xn ⟩ + b) ⩾ 1 − ξn (12.26b) optimization problems (in contrast to continuous optimization problems
ξn ⩾ 0 (12.26c) discussed in Chapter 7) are in general more challenging to solve. ♢
What is the loss function corresponding to the SVM? Consider the error
for n = 1, . . . , N . In contrast to the optimization problem (12.18) for the between the output of a predictor f (xn ) and the label yn . The loss de-
soft margin SVM hard margin SVM, this one is called the soft margin SVM. The parameter scribes the error that is made on the training data. An equivalent way to
C > 0 trades off the size of the margin and the total amount of slack that derive (12.26a) is to use the hinge loss hinge loss
regularization we have. This parameter is called the regularization parameter since, as
parameter we will see in the following section, the margin term in the objective func-
ℓ(t) = max{0, 1 − t} where t = yf (x) = y(⟨w, x⟩ + b) . (12.28)
tion (12.26a) is a regularization term. The margin term ∥w∥2 is called If f (x) is on the correct side (based on the corresponding label y ) of the
regularizer the regularizer, and in many books on numerical optimization, the reg- hyperplane, and further than distance 1, this means that t ⩾ 1 and the
ularization parameter is multiplied with this term (Section 8.2.3). This hinge loss returns a value of zero. If f (x) is on the correct side but too
is in contrast to our formulation in this section. Here a large value of C close to the hyperplane (0 < t < 1), the example x is within the margin,
implies low regularization, as we give the slack variables larger weight, and the hinge loss returns a positive value. When the example is on the
hence giving more priority to examples that do not lie on the correct side wrong side of the hyperplane (t < 0), the hinge loss returns an even larger
There are of the margin. value, which increases linearly. In other words, we pay a penalty once we
alternative are closer than the margin to the hyperplane, even if the prediction is
parametrizations of Remark. In the formulation of the soft margin SVM (12.26a) w is reg-
this regularization, ularized, but b is not regularized. We can see this by observing that the correct, and the penalty increases linearly. An alternative way to express
which is regularization term does not contain b. The unregularized term b com- the hinge loss is by considering it as two linear pieces
why (12.26a) is also (
plicates theoretical analysis (Steinwart and Christmann, 2008, chapter 1) 0 if t ⩾ 1
often referred to as ℓ(t) = , (12.29)
the C-SVM. and decreases computational efficiency (Fan et al., 2008). ♢ 1 − t if t < 1
as illustrated in Figure 12.8. The loss corresponding to the hard margin
12.2.5 Soft Margin SVM: Loss Function View SVM 12.18 is defined as
(
Let us consider a different approach for deriving the SVM, following the 0 if t ⩾ 1
ℓ(t) = . (12.30)
principle of empirical risk minimization (Section 8.2). For the SVM, we ∞ if t < 1
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
382 Classification with Support Vector Machines 12.3 Dual Support Vector Machine 383
max{0, 1 − t}
convex upper bound
Hinge loss each example is a squared error function. The squared error function is the
of zero-one loss.
2 loss function that is minimized when looking for the maximum likelihood
solution. ♢
0
−2 0 2
t
12.3 Dual Support Vector Machine
The description of the SVM in the previous sections, in terms of the vari-
This loss can be interpreted as never allowing any examples inside the ables w and b, is known as the primal SVM. Recall that we consider inputs
margin. x ∈ RD with D features. Since w is of the same dimension as x, this
For a given training set {(x1 , y1 ), . . . , (xN , yN )}, we seek to minimize means that the number of parameters (the dimension of w) of the opti-
the total loss, while regularizing the objective with ℓ2 -regularization (see mization problem grows linearly with the number of features.
Section 8.2.3). Using the hinge loss (12.28) gives us the unconstrained
In the following, we consider an equivalent optimization problem (the
optimization problem
so-called dual view), which is independent of the number of features. In-
1 XN stead, the number of parameters increases with the number of examples
min ∥w∥2 + C max{0, 1 − yn (⟨w, xn ⟩ + b)} . (12.31) in the training set. We saw a similar idea appear in Chapter 10, where we
w,b 2 n=1
| {z } expressed the learning problem in a way that does not scale with the num-
regularizer
| {z }
error term ber of features. This is useful for problems where we have more features
regularizer The first term in (12.31) is called the regularization term or the regularizer than the number of examples in the training dataset. The dual SVM also
loss term (see Section 8.2.3), and the second term is called the loss term or the error has the additional advantage that it easily allows kernels to be applied,
2
error term term. Recall from Section 12.2.4 that the term 12 ∥w∥ arises directly from as we shall see at the end of this chapter. The word “dual” appears often
the margin. In other words, margin maximization can be interpreted as in mathematical literature, and in this particular case it refers to convex
regularization regularization. duality. The following subsections are essentially an application of convex
In principle, the unconstrained optimization problem in (12.31) can duality, which we discussed in Section 7.2.
be directly solved with (sub-)gradient descent methods as described in
Section 7.1. To see that (12.31) and (12.26a) are equivalent, observe that
the hinge loss (12.28) essentially consists of two linear parts, as expressed
in (12.29). Consider the hinge loss for a single example-label pair (12.28). 12.3.1 Convex Duality via Lagrange Multipliers
We can equivalently replace minimization of the hinge loss over t with a Recall the primal soft margin SVM (12.26a). We call the variables w, b,
minimization of a slack variable ξ with two constraints. In equation form, and ξ corresponding to the primal SVM the primal variables. We use αn ⩾ In Chapter 7, we
0 as the Lagrange multiplier corresponding to the constraint (12.26b) that used λ as Lagrange
min max{0, 1 − t} (12.32)
t multipliers. In this
the examples are classified correctly and γn ⩾ 0 as the Lagrange multi-
section, we follow
is equivalent to plier corresponding to the non-negativity constraint of the slack variable; the notation
see (12.26c). The Lagrangian is then given by commonly chosen in
min ξ SVM literature, and
ξ,t
(12.33) use α and γ.
subject to ξ ⩾ 0, ξ ⩾ 1 − t. 1 XN
L(w, b, ξ, α, γ) = ∥w∥2 + C ξn (12.34)
By substituting this expression into (12.31) and rearranging one of the 2 n=1
constraints, we obtain exactly the soft margin SVM (12.26a). N
X N
X
Remark. Let us contrast our choice of the loss function in this section to the − αn (yn (⟨w, xn ⟩ + b) − 1 + ξn ) − γ n ξn .
n=1 n=1
loss function for linear regression in Chapter 9. Recall from Section 9.2.1 | {z } | {z }
that for finding maximum likelihood estimators, we usually minimize the constraint (12.26b) constraint (12.26c)
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
384 Classification with Support Vector Machines 12.3 Dual Support Vector Machine 385
By differentiating the Lagrangian (12.34) with respect to the three primal bilinear (see Section 3.2). Therefore, the first two terms in (12.39) are
variables w, b, and ξ respectively, we obtain over the same objects. These terms (colored blue) can be simplified, and
N
we obtain the Lagrangian
∂L X
= w⊤ − αn yn xn ⊤ , (12.35) N N N N
∂w 1 XX X X
n=1
D(ξ, α, γ) = − yi yj αi αj ⟨xi , xj ⟩ + αi + (C − αi − γi )ξi .
N 2 i=1 j=1
∂L X i=1 i=1
=− αn yn , (12.36) (12.40)
∂b n=1 The last term in this equation is a collection of all terms that contain slack
∂L variables ξi . By setting (12.37) to zero, we see that the last term in (12.40)
= C − αn − γn . (12.37)
∂ξn is also zero. Furthermore, by using the same equation and recalling that
We now find the maximum of the Lagrangian by setting each of these the Lagrange multiplers γi are non-negative, we conclude that αi ⩽ C .
partial derivatives to zero. By setting (12.35) to zero, we find We now obtain the dual optimization problem of the SVM, which is ex-
pressed exclusively in terms of the Lagrange multipliers αi . Recall from
N
X Lagrangian duality (Definition 7.1) that we maximize the dual problem.
w= αn yn xn , (12.38)
This is equivalent to minimizing the negative dual problem, such that we
n=1
end up with the dual SVM dual SVM
representer theorem which is a particular instance of the representer theorem (Kimeldorf and
N N N
The representer Wahba, 1970). Equation (12.38) states that the optimal weight vector in 1 XX X
theorem is actually the primal is a linear combination of the examples xn . Recall from Sec- min yi yj αi αj ⟨xi , xj ⟩ − αi
a collection of
α 2 i=1 j=1 i=1
tion 2.6.1 that this means that the solution of the optimization problem
theorems saying
lies in the span of training data. Additionally, the constraint obtained by
N
X (12.41)
that the solution of
subject to yi αi = 0
minimizing setting (12.36) to zero implies that the optimal weight vector is an affine
i=1
empirical risk lies in combination of the examples. The representer theorem turns out to hold
the subspace
for very general settings of regularized empirical risk minimization (Hof-
0 ⩽ αi ⩽ C for all i = 1, . . . , N .
(Section 2.4.3)
defined by the mann et al., 2008; Argyriou and Dinuzzo, 2014). The theorem has more The equality constraint in (12.41) is obtained from setting (12.36) to
examples. general versions (Schölkopf et al., 2001), and necessary and sufficient zero. The inequality constraint αi ⩾ 0 is the condition imposed on La-
conditions on its existence can be found in Yu et al. (2013). grange multipliers of inequality constraints (Section 7.2). The inequality
Remark. The representer theorem (12.38) also provides an explanation constraint αi ⩽ C is discussed in the previous paragraph.
of the name “support vector machine.” The examples xn , for which the The set of inequality constraints in the SVM are called “box constraints”
corresponding parameters αn = 0, do not contribute to the solution w at because they limit the vector α = [α1 , · · · , αN ]⊤ ∈ RN of Lagrange mul-
support vector all. The other examples, where αn > 0, are called support vectors since tipliers to be inside the box defined by 0 and C on each axis. These
they “support” the hyperplane. ♢ axis-aligned boxes are particularly efficient to implement in numerical
By substituting the expression for w into the Lagrangian (12.34), we solvers (Dostál, 2009, chapter 5). It turns out that
Once we obtain the dual parameters α, we can recover the primal pa- examples that lie
obtain the dual exactly on the
N N N
*N + rameters w by using the representer theorem (12.38). Let us call the op- margin are
1 XX X X
timal primal parameter w∗ . However, there remains the question on how
D(ξ, α, γ) = yi yj αi αj ⟨xi , xj ⟩ − yi αi yj αj xj , xi examples whose
2 i=1 j=1 i=1 j=1
to obtain the parameter b∗ . Consider an example xn that lies exactly on dual parameters lie
N N N N N the margin’s boundary, i.e., ⟨w∗ , xn ⟩ + b = yn . Recall that yn is either +1 strictly inside the
box constraints,
or −1. Therefore, the only unknown is b, which can be computed by
X X X X X
+C ξi − b yi αi + αi − αi ξi − γ i ξi . 0 < αi < C. This is
i=1 i=1 i=1 i=1 i=1 derived using the
(12.39) b∗ = yn − ⟨w∗ , xn ⟩ . (12.42) Karush Kuhn Tucker
conditions, for
PNinvolving the primal variable w.
Note that there are no longer any terms Remark. In principle, there may be no examples that lie exactly on the example in
By setting (12.36) to zero, we obtain n=1 yn αn = 0. Therefore, the term margin. In this case, we should compute |yn − ⟨w∗ , xn ⟩ | for all support Schölkopf and
involving b also vanishes. Recall that inner products are symmetric and vectors and take the median value of this absolute value difference to be Smola (2002).
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
386 Classification with Support Vector Machines 12.3 Dual Support Vector Machine 387
Figure 12.9 Convex
hulls. (a) Convex for all n = 1, . . . , N . If the two clouds of points corresponding to the
hull of points, some
positive and negative classes are separated, then the convex hulls do not
of which lie within
the boundary; overlap. Given the training data (x1 , y1 ), . . . , (xN , yN ), we form two con-
(b) convex hulls vex hulls, corresponding to the positive and negative classes respectively.
around positive and We pick a point c, which is in the convex hull of the set of positive exam-
negative examples. ples, and is closest to the negative class distribution. Similarly, we pick a
c
point d in the convex hull of the set of negative examples and is closest to
the positive class distribution; see Figure 12.9(b). We define a difference
d
vector between d and c as
w := c − d . (12.44)
(a) Convex hull. (b) Convex hulls around positive (blue) and Picking the points c and d as in the preceding cases, and requiring them
negative (orange) examples. The distance be- to be closest to each other is equivalent to minimizing the length/norm of
tween the two convex sets is the length of the w, so that we end up with the corresponding optimization problem
difference vector c − d.
1 2
arg min ∥w∥ = arg min ∥w∥ . (12.45)
w w 2
the value of b∗ . A derivation of this can be found in http://fouryears. Since c must be in the positive convex hull, it can be expressed as a convex
eu/2012/06/07/the-svm-bias-term-conspiracy/. ♢ combination of the positive examples, i.e., for non-negative coefficients
αn+
X
c= αn+ xn . (12.46)
12.3.2 Dual SVM: Convex Hull View n:yn =+1
Another approach to obtain the dual SVM is to consider an alternative In (12.46), we use the notation n : yn = +1 to indicate the set of indices
geometric argument. Consider the set of examples xn with the same label. n for which yn = +1. Similarly, for the examples with negative labels, we
We would like to build a convex set that contains all the examples such obtain
that it is the smallest possible set. This is called the convex hull and is X
illustrated in Figure 12.9. d= αn− xn . (12.47)
n:yn =−1
Let us first build some intuition about a convex combination of points.
Consider two points x1 and x2 and corresponding non-negative weights By substituting (12.44), (12.46), and (12.47) into (12.45), we obtain the
α1 , α2 ⩾ 0 such that α1 +α2 = 1. The equation α1 x1 +α2 x2 describes each objective
point on a line between x1 and x2 . Consider what happensPwhen we add 2
3
a third point x3 along with a weight α3 ⩾ 0 such that n=1 αn = 1. 1 X X
min αn+ xn − αn− xn . (12.48)
The convex combination of these three points x1 , x2 , x3 spans a two- α 2 n:yn =+1 n:yn =−1
convex hull dimensional area. The convex hull of this area is the triangle formed by
the edges corresponding to each pair of of points. As we add more points, Let α be the set of all coefficients, i.e., the concatenation of α+ and α− .
and the number of points becomes greater than the number of dimen- Recall that we require that for each convex hull that their coefficients sum
sions, some of the points will be inside the convex hull, as we can see in to one,
Figure 12.9(a). X X
αn+ = 1 and αn− = 1 . (12.49)
In general, building a convex convex hull can be done by introducing n:yn =+1 n:yn =−1
non-negative weights αn ⩾ 0 corresponding to each example xn . Then
the convex hull can be described as the set This implies the constraint
(N ) N N
X X X
conv (X) = αn xn with αn = 1 and αn ⩾ 0, (12.43) yn αn = 0 . (12.50)
n=1 n=1 n=1
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
388 Classification with Support Vector Machines 12.4 Kernels 389
Figure 12.10 SVM
This result can be seen by multiplying out the individual classes with different
kernels. Note that
N
X X X while the decision
yn αn = (+1)αn+ + (−1)αn− (12.51a) boundary is
Second feature
Second feature
n=1 n:yn =+1 n:yn =−1 nonlinear, the
X X underlying problem
= αn+ − αn− = 1 − 1 = 0 . (12.51b) being solved is for a
n:yn =+1 n:yn =−1 linear separating
hyperplane (albeit
The objective function (12.48) and the constraint (12.50), along with the with a nonlinear
assumption that α ⩾ 0, give us a constrained (convex) optimization prob- kernel).
lem. This optimization problem can be shown to be the same as that of
the dual hard margin SVM (Bennett and Bredensteiner, 2000a).
First feature First feature
Remark. To obtain the soft margin dual, we consider the reduced hull. The
(a) SVM with linear kernel (b) SVM with RBF kernel
reduced hull reduced hull is similar to the convex hull but has an upper bound to the
size of the coefficients α. The maximum possible value of the elements
of α restricts the size that the convex hull can take. In other words, the
bound on α shrinks the convex hull to a smaller volume (Bennett and
Bredensteiner, 2000b). ♢
Second feature
Second feature
12.4 Kernels
Consider the formulation of the dual SVM (12.41). Notice that the in-
ner product in the objective occurs only between examples xi and xj .
There are no inner products between the examples and the parameters.
Therefore, if we consider a set of features ϕ(xi ) to represent xi , the only
change in the dual SVM will be to replace the inner product. This mod- First feature First feature
ularity, where the choice of the classification method (the SVM) and the (c) SVM with polynomial (degree 2) kernel (d) SVM with polynomial (degree 3) kernel
choice of the feature representation ϕ(x) can be considered separately,
provides flexibility for us to explore the two problems independently. In
this section, we discuss the representation ϕ(x) and briefly introduce the
There is a unique reproducing kernel Hilbert space associated with every
idea of kernels, but do not go into the technical details.
kernel k (Aronszajn, 1950; Berlinet and Thomas-Agnan, 2004). In this
Since ϕ(x) could be a non-linear function, we can use the SVM (which
unique association, ϕ(x) = k(·, x) is called the canonical feature map. canonical feature
assumes a linear classifier) to construct classifiers that are nonlinear in map
The generalization from an inner product to a kernel function (12.52) is
the examples xn . This provides a second avenue, in addition to the soft
known as the kernel trick (Schölkopf and Smola, 2002; Shawe-Taylor and kernel trick
margin, for users to deal with a dataset that is not linearly separable. It
Cristianini, 2004), as it hides away the explicit non-linear feature map.
turns out that there are many algorithms and statistical methods that have
The matrix K ∈ RN ×N , resulting from the inner products or the appli-
this property that we observed in the dual SVM: the only inner products
cation of k(·, ·) to a dataset, is called the Gram matrix, and is often just Gram matrix
are those that occur between examples. Instead of explicitly defining a
referred to as the kernel matrix. Kernels must be symmetric and positive kernel matrix
non-linear feature map ϕ(·) and computing the resulting inner product
semidefinite functions so that every kernel matrix K is symmetric and
between examples xi and xj , we define a similarity function k(xi , xj ) be-
positive semidefinite (Section 3.2.3):
kernel tween xi and xj . For a certain class of similarity functions, called kernels,
the similarity function implicitly defines a non-linear feature map ϕ(·). ∀z ∈ RN : z ⊤ Kz ⩾ 0 . (12.53)
The inputs X of the Kernels are by definition functions k : X × X → R for which there exists
kernel function can Some popular examples of kernels for multivariate real-valued data xi ∈
a Hilbert space H and ϕ : X → H a feature map such that
be very general and RD are the polynomial kernel, the Gaussian radial basis function kernel,
are not necessarily k(xi , xj ) = ⟨ϕ(xi ), ϕ(xj )⟩H . (12.52) and the rational quadratic kernel (Schölkopf and Smola, 2002; Rasmussen
restricted to RD .
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
390 Classification with Support Vector Machines 12.5 Numerical Solution 391
and Williams, 2006). Figure 12.10 illustrates the effect of different kernels ferentiable. Therefore, we apply a subgradient approach for solving it.
on separating hyperplanes on an example dataset. Note that we are still However, the hinge loss is differentiable almost everywhere, except for
solving for hyperplanes, that is, the hypothesis class of functions are still one single point at the hinge t = 1. At this point, the gradient is a set of
linear. The non-linear surfaces are due to the kernel function. possible values that lie between 0 and −1. Therefore, the subgradient g of
Remark. Unfortunately for the fledgling machine learner, there are mul- the hinge loss is given by
tiple meanings of the word “kernel.” In this chapter, the word “kernel”
comes from the idea of the reproducing kernel Hilbert space (RKHS) (Aron- −1
t<1
szajn, 1950; Saitoh, 1988). We have discussed the idea of the kernel in lin- g(t) = [−1, 0] t = 1 . (12.54)
0 t>1
ear algebra (Section 2.7.3), where the kernel is another word for the null
space. The third common use of the word “kernel” in machine learning is
the smoothing kernel in kernel density estimation (Section 11.5). ♢ Using this subgradient, we can apply the optimization methods presented
in Section 7.1.
Since the explicit representation ϕ(x) is mathematically equivalent to Both the primal and the dual SVM result in a convex quadratic pro-
the kernel representation k(xi , xj ), a practitioner will often design the gramming problem (constrained optimization). Note that the primal SVM
kernel function such that it can be computed more efficiently than the in (12.26a) has optimization variables that have the size of the dimen-
inner product between explicit feature maps. For example, consider the sion D of the input examples. The dual SVM in (12.41) has optimization
polynomial kernel (Schölkopf and Smola, 2002), where the number of variables that have the size of the number N of examples.
terms in the explicit expansion grows very quickly (even for polynomials To express the primal SVM in the standard form (7.45) for quadratic
of low degree) when the input dimension is large. The kernel function programming, let us assume that we use the dot product (3.5) as the
only requires one multiplication per input dimension, which can provide inner product. We rearrange the equation for the primal SVM (12.26a), Recall from
significant computational savings. Another example is the Gaussian ra- such that the optimization variables are all on the right and the inequality Section 3.2 that we
dial basis function kernel (Schölkopf and Smola, 2002; Rasmussen and of the constraint matches the standard form. This yields the optimization use the phrase dot
Williams, 2006), where the corresponding feature space is infinite dimen- product to mean the
N inner product on
sional. In this case, we cannot explicitly represent the feature space but 1 X Euclidean vector
The choice of can still compute similarities between a pair of examples using the kernel. min ∥w∥2 + C ξn
w,b,ξ 2 space.
kernel, as well as Another useful aspect of the kernel trick is that there is no need for n=1 (12.55)
the parameters of
the original data to be already represented as multivariate real-valued −yn x⊤
n w − yn b − ξn ⩽ −1
the kernel, is often subject to
data. Note that the inner product is defined on the output of the function −ξn ⩽ 0
chosen using nested
cross-validation ϕ(·), but does not restrict the input to real numbers. Hence, the function
n = 1, . . . , N . By concatenating the variables w, b, xn into a single vector,
(Section 8.6.1). ϕ(·) and the kernel function k(·, ·) can be defined on any object, e.g.,
and carefully collecting the terms, we obtain the following matrix form of
sets, sequences, strings, graphs, and distributions (Ben-Hur et al., 2008;
the soft margin SVM:
Gärtner, 2008; Shi et al., 2009; Sriperumbudur et al., 2010; Vishwanathan
et al., 2010). ⊤
w w
⊤ w
1 ID 0D,N +1
min b b + 0D+1,1 C1N,1 b
w,b,ξ 2 ξ 0N +1,D 0N +1,N +1
ξ ξ
12.5 Numerical Solution
w
We conclude our discussion of SVMs by looking at how to express the −Y X −y −I N −1N,1
subject to b ⩽ .
problems derived in this chapter in terms of the concepts presented in 0N,D+1 −I N 0N,1
ξ
Chapter 7. We consider two different approaches for finding the optimal (12.56)
solution for the SVM. First we consider the loss view of SVM 8.2.2 and ex-
press this as an unconstrained optimization problem. Then we express the In the preceding optimization problem, the minimization is over the pa-
constrained versions of the primal and dual SVMs as quadratic programs rameters [w⊤ , b, ξ ⊤ ]⊤ ∈ RD+1+N , and we use the notation: I m to rep-
in standard form 7.3.2. resent the identity matrix of size m × m, 0m,n to represent the matrix
Consider the loss function view of the SVM (12.31). This is a convex of zeros of size m × n, and 1m,n to represent the matrix of ones of size
unconstrained optimization problem, but the hinge loss (12.28) is not dif- m × n. In addition, y is the vector of labels [y1 , · · · , yN ]⊤ , Y = diag(y)
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
392 Classification with Support Vector Machines 12.6 Further Reading 393
is an N by N matrix where the elements of the diagonal are from y , and how to optimize them. A broader book about kernel methods (Shawe-
X ∈ RN ×D is the matrix obtained by concatenating all the examples. Taylor and Cristianini, 2004) also includes many linear algebra approaches
We can similarly perform a collection of terms for the dual version of the for different machine learning problems.
SVM (12.41). To express the dual SVM in standard form, we first have to An alternative derivation of the dual SVM can be obtained using the
express the kernel matrix K such that each entry is Kij = k(xi , xj ). If we idea of the Legendre–Fenchel transform (Section 7.3.3). The derivation
have an explicit feature representation xi then we define Kij = ⟨xi , xj ⟩. considers each term of the unconstrained formulation of the SVM (12.31)
For convenience of notation we introduce a matrix with zeros everywhere separately and calculates their convex conjugates (Rifkin and Lippert,
except on the diagonal, where we store the labels, that is, Y = diag(y). 2007). Readers interested in the functional analysis view (also the reg-
The dual SVM can be written as ularization methods view) of SVMs are referred to the work by Wahba
1 ⊤ (1990). Theoretical exposition of kernels (Aronszajn, 1950; Schwartz,
min α Y KY α − 1⊤ N,1 α 1964; Saitoh, 1988; Manton and Amblard, 2015) requires a basic ground-
α 2
ing in linear operators (Akhiezer and Glazman, 1993). The idea of kernels
y⊤
−y ⊤
(12.57) have been generalized to Banach spaces (Zhang et al., 2009) and Kreı̆n
0N +2,1
subject to −I N α ⩽ C1N,1 .
spaces (Ong et al., 2004; Loosli et al., 2016).
Observe that the hinge loss has three equivalent representations, as
IN
shown in (12.28) and (12.29), as well as the constrained optimization
Remark. In Sections 7.3.1 and 7.3.2, we introduced the standard forms problem in (12.33). The formulation (12.28) is often used when compar-
of the constraints to be inequality constraints. We will express the dual ing the SVM loss function with other loss functions (Steinwart, 2007).
SVM’s equality constraint as two inequality constraints, i.e., The two-piece formulation (12.29) is convenient for computing subgra-
dients, as each piece is linear. The third formulation (12.33), as seen
Ax = b is replaced by Ax ⩽ b and Ax ⩾ b . (12.58) in Section 12.5, enables the use of convex quadratic programming (Sec-
Particular software implementations of convex optimization methods may tion 7.3.2) tools.
provide the ability to express equality constraints. ♢ Since binary classification is a well-studied task in machine learning,
other words are also sometimes used, such as discrimination, separation,
Since there are many different possible views of the SVM, there are and decision. Furthermore, there are three quantities that can be the out-
many approaches for solving the resulting optimization problem. The ap- put of a binary classifier. First is the output of the linear function itself
proach presented here, expressing the SVM problem in standard convex (often called the score), which can take any real value. This output can be
optimization form, is not often used in practice. The two main implemen- used for ranking the examples, and binary classification can be thought
tations of SVM solvers are Chang and Lin (2011) (which is open source) of as picking a threshold on the ranked examples (Shawe-Taylor and Cris-
and Joachims (1999). Since SVMs have a clear and well-defined optimiza- tianini, 2004). The second quantity that is often considered the output
tion problem, many approaches based on numerical optimization tech- of a binary classifier is the output determined after it is passed through
niques (Nocedal and Wright, 2006) can be applied (Shawe-Taylor and a non-linear function to constrain its value to a bounded range, for ex-
Sun, 2011). ample in the interval [0, 1]. A common non-linear function is the sigmoid
function (Bishop, 2006). When the non-linearity results in well-calibrated
probabilities (Gneiting and Raftery, 2007; Reid and Williamson, 2011),
12.6 Further Reading this is called class probability estimation. The third output of a binary
The SVM is one of many approaches for studying binary classification. classifier is the final binary decision {+1, −1}, which is the one most com-
Other approaches include the perceptron, logistic regression, Fisher dis- monly assumed to be the output of the classifier.
criminant, nearest neighbor, naive Bayes, and random forest (Bishop, 2006; The SVM is a binary classifier that does not naturally lend itself to a
Murphy, 2012). A short tutorial on SVMs and kernels on discrete se- probabilistic interpretation. There are several approaches for converting
quences can be found in Ben-Hur et al. (2008). The development of SVMs the raw output of the linear function (the score) into a calibrated class
is closely linked to empirical risk minimization, discussed in Section 8.2. probability estimate (P (Y = 1|X = x)) that involve an additional cal-
Hence, the SVM has strong theoretical properties (Vapnik, 2000; Stein- ibration step (Platt, 2000; Zadrozny and Elkan, 2001; Lin et al., 2007).
wart and Christmann, 2008). The book about kernel methods (Schölkopf From the training perspective, there are many related probabilistic ap-
and Smola, 2002) includes many details of support vector machines and proaches. We mentioned at the end of Section 12.2.5 that there is a re-
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
394 Classification with Support Vector Machines
lationship between loss function and the likelihood (also compare Sec-
tions 8.2 and 8.3). The maximum likelihood approach corresponding to
a well-calibrated transformation during training is called logistic regres-
sion, which comes from a class of methods called generalized linear mod-
els. Details of logistic regression from this point of view can be found in
Agresti (2002, chapter 5) and McCullagh and Nelder (1989, chapter 4). Index
Naturally, one could take a more Bayesian view of the classifier output by
estimating a posterior distribution using Bayesian logistic regression. The
Bayesian view also includes the specification of the prior, which includes
design choices such as conjugacy (Section 6.6.1) with the likelihood. Ad-
ditionally, one could consider latent functions as priors, which results in 1-of-K representation, 364 canonical basis, 45
ℓ1 norm, 71 canonical feature map, 389
Gaussian process classification (Rasmussen and Williams, 2006, chapter ℓ2 norm, 72 canonical link function, 315
3). abduction, 258 categorical variable, 180
Abel-Ruffini theorem, 334 Cauchy-Schwarz inequality, 75
Abelian group, 36 change-of-variable technique, 219
absolutely homogeneous, 71 characteristic polynomial, 104
activation function, 315 Cholesky decomposition, 114
affine mapping, 63 Cholesky factor, 114
affine subspace, 61 Cholesky factorization, 114
Akaike information criterion, 288 class, 370
algebra, 17 classification, 315
algebraic multiplicity, 106 closure, 36
analytic, 143 code, 343
ancestral sampling, 340, 364 codirected, 105
angle, 76 codomain, 58, 139
associativity, 24, 26, 36 collinear, 105
attribute, 253 column, 22
augmented matrix, 29 column space, 59
auto-encoder, 343 column vector, 22, 38
automatic differentiation, 161 completing the squares, 307
automorphism, 49 concave function, 236
condition number, 230
backpropagation, 159
conditional probability, 179
basic variable, 30
basis, 44 conditionally independent, 195
basis vector, 45 conjugate, 208
Bayes factor, 287 conjugate prior, 208
Bayes’ law, 185 convex conjugate, 242
Bayes’ rule, 185 convex function, 236
Bayes’ theorem, 185 convex hull, 386
Bayesian GP-LVM, 347 convex optimization problem, 236, 239
Bayesian inference, 274 convex set, 236
Bayesian information criterion, 288 coordinate, 50
Bayesian linear regression, 303 coordinate representation, 50
Bayesian model selection, 286 coordinate vector, 50
Bayesian network, 278, 283 correlation, 191
Bayesian PCA, 346 covariance, 190
Bernoulli distribution, 205 covariance matrix, 190, 198
Beta distribution, 206 covariate, 253
bilinear mapping, 72 CP decomposition, 136
bijective, 48 cross-covariance, 191
binary classification, 370 cross-validation, 258, 263
Binomial distribution, 206 cumulative distribution function, 178,
blind-source separation, 346 181
Borel σ-algebra, 180 d-separation, 281
407
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
408 Index Index 409
data covariance matrix, 318 feature, 253 inverse element, 36 margin, 374
data point, 253 feature map, 254 invertible, 24 marginal, 190
data-fit term, 302 feature matrix, 296 Isomap, 136 marginal likelihood, 186, 286, 306
decoder, 343 feature vector, 295 isomorphism, 49 marginal probability, 179
deep auto-encoder, 347 Fisher discriminant analysis, 136 Jacobian, 146, 150 marginalization property, 184
defective, 111 Fisher-Neyman theorem, 210 Jacobian determinant, 152 Markov random field, 283
denominator layout, 151 forward mode, 161 Jeffreys-Lindley paradox, 287 matrix, 22
derivative, 141 free variable, 30 Jensen’s inequality, 239 matrix factorization, 98
design matrix, 294, 296 full rank, 47 joint probability, 178 maximum a posteriori, 300
determinant, 99 full SVD, 128 maximum a posteriori estimation, 269
fundamental theorem of linear Karhunen-Loève transform, 318
diagonal matrix, 115 maximum likelihood, 257
mappings, 60 kernel, 33, 47, 58, 254, 388
diagonalizable, 116 maximum likelihood estimate, 296
kernel density estimation, 369
diagonalization, 116 Gaussian elimination, 31 maximum likelihood estimation, 265,
kernel matrix, 389
difference quotient, 141 Gaussian mixture model, 349 293
kernel PCA, 347
dimension, 45 Gaussian process, 316 mean, 187
kernel trick, 316, 347, 389
dimensionality reduction, 317 Gaussian process latent-variable model, mean function, 309
directed graphical model, 278, 283 label, 253 mean vector, 198
347
direction, 61 Lagrange multiplier, 234 measure, 180
general linear group, 37
direction space, 61 Lagrangian, 234 median, 188
general solution, 28, 30
distance, 75 Lagrangian dual problem, 234 metric, 76
generalized linear model, 272, 315
distribution, 177 Laplace approximation, 170 minimal, 44
generating set, 44
distributivity, 24, 26 Laplace expansion, 102 minimax inequality, 234
generative process, 272, 286
domain, 58, 139 Laplacian eigenmaps, 136 misfit term, 302
generator, 344
dot product, 72 LASSO, 303, 316 mixture model, 349
geometric multiplicity, 108
dual SVM, 385 latent variable, 275 mixture weight, 349
Givens rotation, 94
law, 177, 181 mode, 188
Eckart-Young theorem, 131, 334 global minimum, 225
law of total variance, 203 model, 251
eigendecomposition, 116 GP-LVM, 347
leading coefficient, 30 model evidence, 286
eigenspace, 106 gradient, 146
least-squares loss, 154 model selection, 258
eigenspectrum, 106 Gram matrix, 389
least-squares problem, 261 Moore-Penrose pseudo-inverse, 35
eigenvalue, 105 Gram-Schmidt orthogonalization, 89
least-squares solution, 88 multidimensional scaling, 136
eigenvalue equation, 105 graphical model, 278
left-singular vectors, 119 multiplication by scalars, 37
eigenvector, 105 group, 36 Legendre transform, 242
elementary transformations, 28 multivariate, 178
Hadamard product, 23 Legendre-Fenchel transform, 242
EM algorithm, 360 multivariate Gaussian distribution, 198
hard margin SVM, 377 length, 71
embarrassingly parallel, 264 multivariate Taylor series, 166
Hessian, 164 likelihood, 185, 265, 269, 291
empirical covariance, 192 Hessian eigenmaps, 136 line, 61, 82 natural parameters, 212
empirical mean, 192 Hessian matrix, 165 linear combination, 40 negative log-likelihood, 265
empirical risk, 260 hinge loss, 381 linear manifold, 61 nested cross-validation, 258, 284
empirical risk minimization, 257, 260 histogram, 369 linear mapping, 48 neutral element, 36
encoder, 343 hyperparameter, 258 linear program, 239 noninvertible, 24
endomorphism, 49 hyperplane, 61, 62 linear subspace, 39 nonsingular, 24
epigraph, 236 hyperprior, 281 linear transformation, 48 norm, 71
equivalent, 56 i.i.d., 195 linearly dependent, 40 normal distribution, 197
error function, 294 ICA, 346 linearly independent, 40 normal equation, 86
error term, 382 identity automorphism, 49 link function, 272 normal vector, 80
Euclidean distance, 72, 75 identity mapping, 49 loading, 322 null space, 33, 47, 58
Euclidean norm, 72 identity matrix, 23 local minimum, 225 numerator layout, 150
Euclidean vector space, 73 image, 58, 139 log-partition function, 211 Occam’s razor, 285
event space, 175 independent and identically distributed, logistic regression, 315 ONB, 79
evidence, 186, 285, 306 195, 260, 266 logistic sigmoid, 315 one-hot encoding, 364
example, 253 independent component analysis, 346 loss function, 260, 381 ordered basis, 50
expected risk, 261 inference network, 344 loss term, 382 orthogonal, 77
expected value, 187 injective, 48 lower-triangular matrix, 101 orthogonal basis, 79
exponential family, 205, 211 inner product, 73 Maclaurin series, 143 orthogonal complement, 79
extended Kalman filter, 170 inner product space, 73 Manhattan norm, 71 orthogonal matrix, 78
factor analysis, 346 intermediate variables, 162 MAP, 300 orthonormal, 77
factor graph, 283 inverse, 24 MAP estimation, 269 orthonormal basis, 79
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
410 Index Index 411
outer product, 38 regularization parameter, 263, 302, 380 Taylor polynomial, 142, 166
overfitting, 262, 271, 299 regularized least squares, 302 Taylor series, 142
PageRank, 114 regularizer, 263, 302, 380, 382 test error, 300
parameters, 61 representer theorem, 384 test set, 262, 284
parametric equation, 61 responsibility, 352 Tikhonov regularization, 265
partial derivative, 146 reverse mode, 161 trace, 103
particular solution, 27, 30 right-singular vectors, 119 training, 12
PCA, 317 RMSE, 298 training error, 300
pdf, 181 root mean square error, 298 training set, 260, 292
penalty term, 263 rotation, 91 transfer function, 315
pivot, 30 rotation matrix, 92 transformation matrix, 51
plane, 62 row, 22 translation vector, 63
plate, 281 row vector, 22, 38 transpose, 25, 38
row-echelon form, 30 triangle inequality, 71, 76
population mean and covariance, 191
truncated SVD, 129
positive definite, 71, 73, 74, 76 sample mean, 192
Tucker decomposition, 136
posterior, 185, 269 sample space, 175
posterior odds, 287 scalar, 37 underfitting, 271
power iteration, 334 scalar product, 72 undirected graphical model, 283
power series representation, 145 sigmoid, 213 uniform distribution, 182
PPCA, 340 similar, 56 univariate, 178
preconditioner, 230 singular, 24 unscented transform, 170
predictor, 12, 255 singular value decomposition, 119 upper-triangular matrix, 101
primal problem, 234 singular value equation, 124 validation set, 263, 284
principal component, 322 singular value matrix, 119 variable selection, 316
principal component analysis, 136, 317 singular values, 119 variance, 190
principal subspace, 327 slack variable, 379 vector, 37
prior, 185, 269 soft margin SVM, 379, 380 vector addition, 37
prior odds, 287 solution, 20 vector space, 37
probabilistic inverse, 186 span, 44 vector space homomorphism, 48
probabilistic PCA, 340 special solution, 27 vector space with inner product, 73
probabilistic programming, 278 spectral clustering, 136 vector subspace, 39
probability, 175 spectral norm, 131 weak duality, 235
probability density function, 181 spectral theorem, 111
probability distribution, 172 spectrum, 106 zero-one loss, 381
probability integral transform, 217 square matrix, 25
probability mass function, 178 standard basis, 45
product rule, 184 standard deviation, 190
projection, 82 standard normal distribution, 198
projection error, 88 standardization, 336
projection matrix, 82 statistical independence, 194
pseudo-inverse, 86 statistical learning theory, 265
random variable, 172, 175 stochastic gradient descent, 231
range, 58 strong duality, 236
rank, 47 sufficient statistics, 210
rank deficient, 47 sum rule, 184
rank-k approximation, 130 support point, 61
rank-nullity theorem, 60 support vector, 384
raw-score formula for variance, 193 supporting hyperplane, 242
recognition network, 344 surjective, 48
reconstruction error, 88, 327 SVD, 119
reduced hull, 388 SVD theorem, 119
reduced row-echelon form, 31 symmetric, 73, 76
reduced SVD, 129 symmetric matrix, 25
REF, 30 symmetric, positive definite, 74
regression, 289 symmetric, positive semidefinite, 74
regular, 24 system of linear equations, 20
regularization, 262, 302, 382 target space, 175
Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com. ©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).