Geometry of Deep Learning - Ye
Geometry of Deep Learning - Ye
Jong Chul Ye
Geometry
of Deep
Learning
A Signal Processing Perspective
Mathematics in Industry
Volume 37
Series Editors
Hans Georg Bock, Interdisciplinary Center for Scientific Computing IWR,
Heidelberg University, Heidelberg, Germany
Frank de Hoog, CSIRO, Canberra, Australia
Avner Friedman, Ohio State University, Columbus, OH, USA
Arvind Gupta, University of British Columbia, Vancouver, BC, Canada
André Nachbin, IMPA, Rio de Janeiro, RJ, Brazil
Tohru Ozawa, Waseda University, Tokyo, Japan
William R. Pulleyblank, United States Military Academy, West Point, NY, USA
Torgeir Rusten, Det Norske Veritas, Hoevik, Norway
Fadil Santosa, University of Minnesota, Minneapolis, MN, USA
Jin Keun Seo, Yonsei University, Seoul, Korea (Republic of)
Anna-Karin Tornberg, Royal Institute of Technology (KTH), Stockholm, Sweden
Mathematics in Industry focuses on the research and educational aspects of
mathematics used in industry and other business enterprises. Books for Mathematics
in Industry are in the following categories: research monographs, problem-oriented
multi-author collections, textbooks with a problem-oriented approach, conference
proceedings. Relevance to the actual practical use of mathematics in industry is the
distinguishing feature of the books in the Mathematics in Industry series.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
To Andy, Ella, and Joo
Preface
It was a very different, unprecedented, and weird start of the semester, and I did
not know what to do. This semester, I was supposed to offer a new senior-level
undergraduate class on Advanced Intelligence to jointly teach students at the Depart-
ment of Bio/Brain Engineering and the Department of Mathematical Sciences. I
had initially planned a standard method for teaching machine learning, the contents
of which are practical, experience-based lectures with a lot of interaction with the
students through many mini-projects and term projects. Unfortunately, the global
pandemic of COVID-19 has completely changed the world and such interactive
classes are no longer an option most of the time.
So, I thought about the best way to give online lectures to my students. I
wanted my class to be different from other popular online machine learning courses
but still provide up-to-date information about modern deep learning. However,
not many options were available. Most existing textbooks are already outdated or
very implementation oriented without touching the basics. One option would be to
prepare presentation slides by adding all the up-to-date knowledge that I wanted to
teach. However, for undergraduate-level courses, the presentation files are usually
not enough for students to follow the class, and we need a textbook that students
can read independently to understand the class. For this reason, I decided to write
a reading material first and then create presentation files based on it, so that the
students can learn independently before and after the online lectures. This was the
start of my semester-long book project on Geometry of Deep Learning.
In fact, it has been my firm belief that a deep neural network is not a magic black
box, but rather a source of endless inspiration for new mathematical discoveries.
Also, I believed in the famous quote by Isaac Newton, “Standing on the shoulders
of giants,” and looking for a mathematical interpretation of deep learning. For me
as a medical imaging researcher, this topic was critical not only from a theoretical
point of view but also for clinical decision-making, because we do not want to create
false features that can be recognized as diseases.
In 2017, on a street in Lisbon, I had Eureka! moment in understanding hidden
framelet structure in encoder-decoder neural networks. The resulting interpretation
of the deep convolutional framelets, published in the SIAM Journal of Imaging
vii
viii Preface
Science, has had a significant impact on the applied math community and has
been one of the most downloaded papers since its publication. However, the role
of the rectified linear unit (ReLU) was not clear in this work, and one of the
reviewers in a medical imaging journal consistently asked me to explain the role
of the ReLU in deep neural networks. At first, this looked like a question that went
beyond the scope of the medical application paper, but I am grateful to the reviewer,
as during the agony of preparing the answers to the question, I realized that the
ReLU determines the input space partitioning, which is automatically adapted to
the input space manifold. In fact, this finding led to a 2019 ICML paper, in which
we revealed the combinatorial representation of framelets, which clearly shows the
crucial connection with the classic compressed sensing (CS) approaches.
Looking back, I was pretty brave to start this book project, as these are just two
pieces of my geometric understanding of deep learning. However, as I was preparing
the reading material for each subject of deep learning, I found that there are indeed
many exciting geometric insights that have not been fully discussed.
For example, when I wrote the chapter on backpropagation, I recognized the
importance of the denominator layout convention in the matrix calculus, which
led to the beautiful geometry of the backpropagation. Before writing this book,
the normalization and attention mechanisms looked very heuristic to me, with
no evidence of a systematic understanding that is even more confusing due to
their similarities. For example, AdaIN, Transformer, and BERT were like dark
recipes that researchers have developed with their own secret sauces. However,
an in-depth study for the preparation of the reading material has revealed a very
nice mathematical structure behind their intuition, which shows a close connection
between them and their relationship to optimal transport theory.
Writing a chapter on the geometry of deep neural networks was another joy
that broadened my insight. During my lecture, one of my students pointed out that
some partitions can lead to a low-rank mapping. In retrospect, this was already in
the equation, but it was not until my students challenged me that I recognized the
beautiful geometry of the partition, which fits perfectly with fascinating empirical
observations of the deep neural network.
The last chapter, on generative models and unsupervised learning, is something
of which I am very proud. In contrast to the conventional explanation of the gener-
ative adversarial network (GAN), variational auto-encoder (VAE), and normalizing
flows with probabilistic tools, my main focus was to derive them with geometric
tools. In fact, this effort was quite rewarding, and this chapter clearly unified various
forms of generative model as statistical distance minimization and optimal transport
problems.
In fact, the focus of this book is to give students a geometric insight that can
help them understand deep learning in a unified framework, and I believe that this is
one of the first deep learning books written from such a perspective. As this book is
based on the materials that I have prepared for my senior-level undergraduate class, I
believe that this book can be used for one-semester-long senior-level undergraduate
and graduate-level classes. In addition, my class was a code-shared course for
Preface ix
both bioengineering and math students, so that much of the content of the work
is interdisciplinary, which tries to appeal to students in both disciplines.
I am very grateful to my TAs and students of the 2020 spring class of BiS400C
and MAS480. I would especially like to thank my great team of TAs: Sangjoon
Park, Yujin Oh, Chanyong Jung, Byeongsu Sim, Hyungjin Chung, and Gyutaek
Oh. Sangjoon, in particular, has done a tremendous job as Head TA and provided
organized feedback on the typographical errors and mistakes of this book. I would
also like to thank my wonderful team at the Bio Imaging, Signal Processing
and Learning laboratory (BISPL) at KAIST, who have produced ground-breaking
research works that have inspired me.
Many thanks to my awesome son and future scientist, Andy Sangwoo, and my
sweet daughter and future writer, Ella Jiwoo, for their love and support. You are my
endless source of energy and inspiration, and I am so proud of you. Last, but not the
least, I would like to thank my beloved wife, Seungjoo (Joo), for her endless love
and constant support ever since we met. I owe you everything and you made me a
good man. With my warmest thanks,
xi
xii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Part I
Basic Tools for Machine Learning
“I heard reiteration of the following claim: Complex theories do not work; simple
algorithms do. I would like to demonstrate that in the area of science a good old
principle is valid: Nothing is more practical than a good theory.”
–Vladimir N Vapnik
Chapter 1
Mathematical Preliminaries
In this chapter, we briefly review the basic mathematical concepts that are required
to understand the materials of this book.
A metric space (X, d) is a set X together with a metric d on the set. Here, a metric
is a function that defines a concept of distance between any two members of the set,
which is formally defined as follows.
Definition 1.1 (Metric) A metric on a set X is a function called the distance d :
X×X → R+ , where R+ is the set of non-negative real numbers. For all x, y, z ∈ X,
this function is required to satisfy the following conditions:
1. d(x, y) ≥ 0 (non-negativity).
2. d(x, y) = 0 if and only if x = y.
3. d(x, y) = d(y, x) (symmetry).
4. d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality).
A metric on a space induces topological properties like open and closed sets, which
lead to the study of more abstract topological spaces. Specifically, about any point
x in a metric space X, we define the open ball of radius r > 0 about x as the set
Using this, we have the formal definition of openness and closedness of a set.
Definition 1.2 (Open Set, Closed Set) A subset U ∈ X is called open if for every
x ∈ U there exists an r > 0 such that Br (x) is contained in U . The complement of
an open set is called closed.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 3
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_1
4 1 Mathematical Preliminaries
d(xn , xm ) < ε, ∀ m, n ≥ N.
Here, the constant K is often called the Lipschitz constant, and a function f with
the Lipschitz constant K is called K-Lipschitz function.
A vector space V is a set that is closed under finite vector addition and scalar
multiplication. In machine learning applications, the scalars are usually members of
real or complex values, in which case V is called a vector space over real numbers,
or complex numbers.
For example, the Euclidean n-space Rn is called a real vector space, and Cn
is called a complex vector space. In the n-dimensional Euclidean space Rn , every
element is represented by a list of n real numbers, addition is component-wise, and
scalar multiplication is multiplication on each term separately. More specifically, we
define a column n-real-valued vector x to be an array of n real numbers, denoted by
⎡ ⎤
x1
⎢x2 ⎥
⎢ ⎥
x = ⎢ . ⎥ = x1 x2 · · · xn ∈ Rn ,
⎣ .. ⎦
xn
1.2 Vector Space 5
where the superscript denotes the adjoint. Note that for a real vector, the adjoint
is just a transpose. Then, the sum of the two vectors x and y, denoted by x + y, is
defined by
x + y = x1 + y1 x2 + y2 · · · xn + yn .
In addition, we formally define the inner product and the norm in a vector space
as follows.
Definition 1.5 (Inner Product) Let V be a vector space over R. A function
·, ·
V : V × V → R is an inner product on V if:
1. Linear: α1 f 1 + α2 f 2 , g
V = α1 f 1 , g
V + α2 f 2 , g
V for all α1 , α2 ∈ R and
f 1 , f 2 , g ∈ V.
2. Symmetric: f , g
V = g, f
V .
3. f , f
V ≥ 0 and f , f
V = 0 if and only if f = 0.
If the underlying vector space V is obvious, we usually represent the inner product
without the subscript V, i.e. f , g
. For example, the inner product of the two
vectors f , g ∈ Rn is defined as
n
f , g
= fi gi = f g.
i=1
x, y
= 0,
S⊥ = {x ∈ V : v, x
= 0, ∀v ∈ S}.
From the inner product, we can obtain the so-called induced norm:
x
= x, x
.
Similarly, the definition of the metric in Sect. 1.1 informs us that a norm in a vector
space V induces a metric, i.e.
d(x, y) = x − y , x, y ∈ V. (1.3)
The norm and inner product in a vector space have special relations. For example,
for any two vectors x, y ∈ V, the following Cauchy–Schwarz inequality always
holds:
| x, y
| ≤
x
y
. (1.4)
An inner product space is defined as a vector space that is equipped with an inner
product. A normed space is a vector space on which a norm is defined. An inner
product
space is always a normed space since we can define a norm as
f
=
f , f
, which is often called the induced norm. Among the various forms of the
normed space, one of the most useful normed spaces is the Banach space.
Definition 1.7 The Banach space is a complete normed space.
Here, the “completeness” is especially important from the optimization perspective,
since most optimization algorithms are implemented in an iterative manner so that
the final solution of the iterative method should belong to the underlying space H.
Recall that the convergence property is a property of a metric space. Therefore, the
Banach space can be regarded as a vector space equipped with desirable properties
of a metric space. Similarly, we can define the Hilbert space.
Definition 1.8 The Hilbert space is a complete inner product space.
We can easily see that the Hilbert space is also a Banach space thanks to the
induced norm. The inclusion relationship between vector spaces, normed spaces,
inner product spaces, Banach spaces and Hilbert spaces is illustrated in Fig. 1.1.
As shown in Fig. 1.1, the Hilbert space has many nice mathematical structures
such as inner product, norm, completeness, etc., so it is widely used in the machine
learning literature. The following are well-known examples of Hilbert spaces:
• l 2 (Z): a function space composed of square summable discrete-time signals, i.e.
∞
l (Z) = x =
2
{xl }∞
l=−∞ | |xl | < ∞ .
2
l=−∞
1.3 Banach and Hilbert Space 7
Fig. 1.1 RKHS, Hilbert space, Banach space, and vector space
x, y
H = xl yl , ∀x, y ∈ H. (1.5)
l=−∞
Among the various forms of the Hilbert space, the reproducing kernel Hilbert space
(RKHS) is of particular interest in the classical machine learning literature, which
will be explained later in this book. Here, the readers are reminded that the RKHS is
only a subset of the Hilbert space as shown in Fig. 1.1, i.e. the Hilbert space is more
general than the RKHS.
α1 x 1 + α2 x 2 + · · · + αk x k = 0
8 1 Mathematical Preliminaries
implies that
αi = 0, i = 1, · · · , k.
The set of all vectors reachable by taking linear combinations of vectors in a set S
is called the span of S. For example, if S = {x i }ki=1 , then we have
k
span(S) = αi x i , ∀αi ∈ R .
i=1
A set B = {bi }m
i=1 of elements (vectors) in a vector space V is called a basis,
if every element of V may be written in a unique way as a linear combination of
elements of B, that is, for all f ∈ V, there exists unique coefficients {ci } such that
m
f = ci bi . (1.7)
i=1
For function spaces, the number of basis vectors can be infinite. For example, for
the space VT composed of periodic functions with the period of T , the following
complex sinusoidals constitute its basis:
2π nt
B = {ϕn (t)}∞
n=−∞ , ϕn (t) = ei T , (1.9)
Unlike the basis, which leads to the unique expansion, the frame is composed
of redundant basis vectors, which allows multiple representations. For example,
consider the following frame in R2 :
1 0 1
{v 1 , v 2 , v 3 } = , , . (1.12)
0 1 1
Then, we can easily see that the frame allows multiple representations of, for
example, x = [2, 3] as shown in the following:
x = 2v 1 + 3v 2 = v 2 + 2v 3 . (1.13)
Frames can also be extended to deal with function spaces, in which case the number
of frame elements is infinite.
Formally, a set of functions
= [φ k ]k∈ = · · · φ k−1 φ k · · ·
α
f
2 ≤ | f , φ k
|2 ≤ β
f
2 , ∀f ∈ H, (1.14)
k∈
where α, β > 0 are called the frame bounds. If α = β, then the frame is said to be
tight. In fact, the basis is a special case of tight frames.
We now start with a formal definition of a probability space and related terms from
the measure theory [2].
Definition 1.9 (Probability Space) A probability space is a triple (, F, μ) con-
sisting of the sample space , an event space F composed of a subset of
(which is often called σ -algebra), and the probability measure (or distribution)
μ : F → [0, 1], a function such that:
• μ must satisfy the countable additivity property that for all countable collections
{Ei } of pairwise disjoint sets:
μ(∪i Ei ) = ∪i μ(Ei );
The trace of a square matrix A ∈ Rn×n , denoted Tr(A) is defined to be the sum of
elements on the main diagonal (from the upper left to the lower right) of A:
n
Tr(A) = aii .
i=1
Definition 1.11 (Range Space) The range space of a matrix A ∈ Rm×n , denoted
by R(A), is defined by R(A) := {Ax | ∀x ∈ Rn }.
Definition 1.12 (Null Space) The null space of a matrix A ∈ Rm×n , denoted by
N(A), is defined by N(A) := {x ∈ Rn | Ax = 0}.
12 1 Mathematical Preliminaries
A subset of a vector space is called a subspace if it is closed under both addition and
scalar multiplication. We can easily see that the range and null spaces are subspaces.
Moreover, we can show the following fundamental property:
ŷ = PS y
PS = B(B B)−1 B .
Av = λv, (1.18)
σ1 ≥ σ2 ≥ · · · ≥ σr > 0.
where uk and v k are called left singular vectors and right singular vectors,
respectively.
1.5 Some Matrix Algebra 13
Using the SVD, we can define the matrix norm. Among the various forms of matrix
norms for a matrix X ∈ Rn×n , the spectral norm
X
2 and the nuclear norm
X
∗
are quite often used, which are defined by
where σmax (·) and λmax (·) denote the largest singular value and eigenvalue,
respectively.
The following matrix inversion lemma [3] is quite useful.
Lemma 1.1 (Matrix Inversion Lemma)
−1
(I + U CV )−1 = I − U C −1 + V U V, (1.22)
−1
(A + U CV )−1 = A−1 − A−1 U C −1 + V A−1 U V A−1 . (1.23)
The Kronecker product has many important properties, which can be exploited to
simplify many matrix-related operations. Some of the basic properties are provided
in the following lemma. The proofs of the lemmas are straightforward, which can
easily be found from a standard linear algebra textbook [4].
14 1 Mathematical Preliminaries
Lemma 1.2
A ⊗ (B + C) = A ⊗ B + A ⊗ C. (1.25)
(B + C) ⊗ A = B ⊗ A + C ⊗ A. (1.26)
A ⊗ B = B ⊗ A. (1.27)
(A ⊗ B) ⊗ C = A ⊗ (B ⊗ C). (1.28)
(A ⊗ B) = A ⊗ B . (1.29)
(A ⊗ B)−1 = A−1 ⊗ B −1 . (1.30)
Lemma 1.3 If A, B, C and D are matrices of such a size that one can form the
matrix products AC and BD, then
One of the important usages of the Kronecker product comes from the vectorization
operation of a matrix. For this we first define the following two operations.
Definition 1.15 If A = a 1 · · · a n ∈ Rm×n , then
⎡ ⎤
a1
⎢ ⎥
VEC(A) = ⎣ ... ⎦ ∈ Rmn , (1.32)
an
⎛⎡
⎤⎞
a1
⎜⎢ ⎥⎟
UNVEC(VEC(A)) = UNVEC ⎝⎣ ... ⎦⎠ = A. (1.33)
an
From these definitions, we can obtain the following two lemmas which will be
extensively used here.
Lemma 1.4 ([4]) For the matrices A, B, C with appropriate sizes, we have
∂y ∂x ⎢ ∂y. ⎥
∂y
= ∂x ··· ∂y
∂xn , =⎢ ⎥
⎣ .. ⎦ ,
∂x 1 ∂y ∂x n
∂y
implying that the number of the row follows that of the numerator. On the other
hand, the denominator layout notation provides
⎡ ∂y
⎤
∂y ⎢ . ⎥ ∂x1
∂x
=⎢ ⎥
⎣ .. ⎦ , = ∂x
∂y · · ·
1 ∂xn
∂y ,
∂x ∂y
∂y
∂xn
where the number of resulting rows follows that of the denominator. Either layout
convention is okay, but we should be consistent in using the convention.
Here, we will follow the denominator layout convention. The main motivation
for using the denominator layout is from the derivative with respect to the matrix.
More specifically, for a given scalar c and a matrix W ∈ Rm×n , according to the
denominator layout, we have
⎡ ⎤
∂c
∂w11 ··· ∂c
∂w1n
∂c ⎢ . .. ⎥
=⎢
⎣ ..
..
. ⎥
. ⎦∈R
m×n
. (1.36)
∂W
∂c
∂wm1 ··· ∂c
∂wmn
∂a x ∂x a
= = a. (1.37)
∂x ∂x
16 1 Mathematical Preliminaries
Accordingly, for a given scalar c and a matrix W ∈ Rm×n , we can show that
∂c ∂c
:= UNVEC ∈ Rm×n , (1.38)
∂W ∂VEC(W )
in order to be consistent with (1.36). Under the denominator layout notation, for
given vectors x ∈ Rm and y ∈ Rn , the derivative of a vector with respect to a vector
is given by
⎡ ∂y ∂yn
⎤
· · · ∂x
∂x1
1
∂y ⎢ . . 1
⎥
=⎢
⎣ .. . . ... ⎥ ⎦∈R
m×n
. (1.39)
∂x ∂y1 ∂yn
∂xm · · · ∂xm
∂Ax
= A . (1.41)
∂x
Finally, the following result is useful.
Lemma 1.6 Let A ∈ Rm×n and x ∈ Rn . Then, we have
∂Ax
= x ⊗ I m. (1.42)
∂VEC(A)
where we use (1.37) and (1.29) for the second and the third equalities, respectively.
Q.E.D.
1.6 Elements of Convex Optimization 17
Lemma 1.7 ([5]) Let x, a and B denote vectors and a matrix with appropriate
sizes, respectively. Then, we have
∂x a ∂a x
= = a, (1.44)
∂x ∂x
∂x Bx
= (B + B )x. (1.45)
∂x
∂
∇
:= ∈ Rn .
∂x
FixT = {x ∈ D | Tx = x}.
Let X and Y be real normed vector space. As a special case of an operator, we define
a set of linear operators:
and we write B(X) = B(X, X). Let f : X → [−∞, ∞] be a function. The domain
of f is
the graph of f is
SC (x) = sup{ x, y
|y ∈ C}.
x → Tx + b, x ∈ X, y ∈ Y, T ∈ B(X, Y).
A function is lower semicontinuous if and only if all of its lower level sets {x ∈
X : f (x) ≤ α} are closed. Alternatively, f is lower semicontinuous if and only if
the epigraph of f is closed. A function is proper if −∞ ∈ / f (X) and domf = ∅
(Fig. 1.2).
An operator A : H → H is positive semidefinite if and only if
x, Ax
≥ 0, ∀x ∈ H.
Fig. 1.2 Epigraphs for (a) a lower semicontinuous function, and (b) a function which is not lower
semicontinuous
1.6 Elements of Convex Optimization 19
x, Ax
> 0, ∀x ∈ H.
f (θ x 1 + (1 − θ )x 2 ) ≤ θf (x 1 ) + (1 − θ )f (x 1 )
for all x 1 , x 2 ∈ domf , 0 ≤ θ ≤ 1. A convex set is a set that contains every line
segment between any two points in the set (see Fig. 1.3). Specifically, a set C is
convex if x 1 , x 2 ∈ C, then θ x 1 + (1 − θ )x 2 ∈ C for all 0 ≤ θ ≤ 1. The relation
between a convex function and a convex set can also be stated using its epigraph.
Specifically, a function f (x) is convex if and only if its epigraph epif is a convex
set.
Convexity is preserved under various operations. For example, if {fi }i∈I is
a family of convex functions, then, supi∈I fi is convex. In addition, a set of
convex functions is closed under addition and multiplication by strictly positive real
numbers. Moreover, the limit point of a convergent sequence of convex functions is
also convex. Important examples of convex functions are summarized in Table 1.1.
1.6.3 Subdifferentials
f (x + αy) − f (x)
f (x; y) = lim (1.48)
α↓0 α
1.6 Elements of Convex Optimization 21
if the limit exists. If the limit exists for all y ∈ H, then one says that f is Gãteaux
differentiable at x. Suppose f (x; ·) is linear and continuous on H. Then, there exist
a unique gradient vector ∇f (x) ∈ H such that
f (x; y) = y, ∇f (x)
, ∀y ∈ H.
f (x + y) − f (x) − y, ∇f (x)
lim = 0, (1.49)
0=y→0
y
A convex conjugate or convex dual is very important concept for both classical and
mordern convex optimization techniques. Formally, the conjugate function f ∗ :
22 1 Mathematical Preliminaries
Fig. 1.5 (a) Geometry of convex conjugate. (b) Examples of finding convex conjugate for f (x) =
bx + c
f ∗ (u) = sup { u, x
− f (x)}. (1.52)
x∈H
Table 1.3 Examples of convex conjugate pairs used often in imaging problems. Here, D ⊂ H and
we use the interpretation 0 log 0 = 0
f (x) domf f ∗ (u) domf ∗
f (ax) D f ∗ (u/a) D
f (x + b) D f ∗ (u) − b, u
D
af (x), a > 0 D af ∗ (u/a) D
bx + c D −c y=a {a}
+∞, u = a
√
1/x R++ −2 −u −R+
− log x R++ −(1 + log(−u)) −R++
x log x R+ eu−1 R
√ √
1 + x2 R − 1 − u2 [−1, 1]
e x R u log(u) − u R+
log(1 + ex ) R u log(u) + (1 − u) log(1 − u) [0, 1]
− log(1 − ex ) R−− u log(u) + (1 + u) log(1 + u) R+
|x|p |u|q
p , p>1 R q , 1
p + 1
q =1 R
x
1 Rn 0,
u
2 ≤ 1 {u ∈ Rn :
u
2 < 1}
∞
u
2 > 1
a, x
+ b Rn −b, u=b {b} ⊂ Rn
∞, u = a
1 1 −1
2 x Qx, Q ∈ S++ Rn 2u Q u Rn
ιC (x) C SC (u) H
n ! n n
log i=1 e
xi Rn i=1 ui log ui , i=1 ui =1 Rn+
− log det X−1 Sn++ log det(−U )−1 − n −Sn++
Table 1.3 summarizes these findings for a variety of functions that are often used in
applications.
It is clear that f ∗ is convex since f ∗ is a point-wise supremum of a convex
function of y. In general, if f : H → [−∞, ∞], then the following hold:
1. For α ∈ R++ , we have
2. Fenchel–Young inequality:
f (x) + f ∗ (y) ≥ y, x
, ∀x, y ∈ H. (1.54)
24 1 Mathematical Preliminaries
f ∗∗ = f, (1.56)
∗ ∗
y ∈ ∂f (x) ⇐⇒ f (x) + f (y) = x, y
⇐⇒ x ∈ ∂f (y). (1.57)
Perhaps one of the most important uses of convex conjugate is to obtain the dual
formulation. More specifically, for a given primal problem (P),
The gap between the primal and dual problem is called the duality gap.
subject to Ax = y,
(continued)
1.6 Elements of Convex Optimization 25
which provides
1
P : min x x subject to b = Ax
2
min ιC (y) + 12 x x
x,y
subject to y = b − Ax.
Therefore, we have
1 1
min ιC (Ax − b) + x x ≤ min ιC (y) + x x + u (Ax − b − y)
x 2 x,y 2
1
≤ min ιC (y) − u y + min x x − u Ax + u b
y x 2
1
≤ min −u y + min x x − u Ax + u b
y∈{0} x 2
1
= u AA u + u b,
2
(continued)
26 1 Mathematical Preliminaries
where the last equality comes from x = A u at the minimizer. Hence, the
dual problem becomes
1
D : minm u AA u + u b.
u∈R 2
min f0 (x)
x
subject to fi (x) ≤ 0, i = 1, · · · , n, (1.61)
hi (x) = 0, i = 1, · · · , p. (1.62)
subject to α ≥ 0, (1.64)
p ⎬
g(α, ν) := inf f0 (x) + αi fi (x) + νj hj (x) . (1.65)
x ⎩ ⎭
i=1 j =1
One of the important findings in convex optimization theory [6] is that if the
primal problem is convex, then we have the following strong duality:
g(α ∗ , ν ∗ ) = f0 (x ∗ ), (1.66)
where x ∗ and α ∗ , ν ∗ are the optimal solutions for the primal and dual problems,
respectively. Often, the dual formulation is easier to solve than the primal problem.
Additionally, there is also interesting an geometric interpretation, which will be
explained later.
1.7 Exercises 27
1.7 Exercises
x̂ = arg minn
y − Ax
2 + λ
x
2
x∈R
= (A A + λI )−1 A y
= A (AA + λI )−1 y,
∇ 2 f (x) 0, ∀x ∈ domf.
13. Let f (x) = |x| with x ∈ [−1, 1]. Find its subdifferential ∂f (x).
14. Prove Fermat’s rule in Theorem 1.3.
15. Show that the following properties hold for the subdifferentials:
a. If f is differentiable, then ∂f (x) = {∇f (x)}.
b. Let f be proper. Then, ∂f (x) is closed and convex for any x ∈ domf .
c. Let λ ∈ R++ . Then, ∂(λf ) = λ∂f .
d. Let f, g be convex, and lower semicontinuous functions, and L is a linear
operator. Then
f (x) + f ∗ (y) ≥ y, x
, ∀x, y ∈ H.
(∂f )−1 = ∂f ∗ .
where
g(Ax) = Ax 1 , f (x) = y − x 22
− minm u AA u + y A u
u∈R
subject to
u
2 ≤ 1.
Chapter 2
Linear and Kernel Classifiers
2.1 Introduction
Classification is one of the most basic tasks in machine learning. In computer vision,
an image classifier is designed to classify input images in corresponding categories.
Although this task appears trivial to humans, there are considerable challenges with
regard to automated classification by computer algorithms.
For example, let us think about recognizing “dog” images. One of the first
technical issues here is that a dog image is usually taken in the form of a digital
format such as JPEG, PNG, etc. Aside from the compression scheme used in
the digital format, the image is basically just a collection of numbers on a two-
dimensional grid, which takes integer values from 0 to 255. Therefore, a computer
algorithm should read the numbers to decide whether such a collection of numbers
corresponds to a high-level concept of “dog”. However, if the viewpoint is changed,
the composition of the numbers in the array is totally changed, which poses
additional challenges to the computer program. To make matters worse, in a natural
setting a dog is rarely found on a white background; rather, the dog plays on the
lawn or takes a nap in the living room, hides underneath furniture or chews with her
eyes closed, which makes the distribution of the numbers very different depending
on the situation. Additional technical challenges in computer-based recognition of
a dog come from all kinds of sources such as different illumination conditions,
different poses, occlusion, intra-class variation, etc., as shown in Fig. 2.1. Therefore,
designing a classifier that is robust to such variations was one of the important topics
in computer vision literature for several decades.
In fact, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [7]
was initiated to evaluate various computer algorithms for image classification at
large scale. ImageNet is a large visual database designed for use in visual object
recognition software research [8]. Over 14 million images have been hand-annotated
in the project to indicate which objects are depicted, and at least one million of
the images also have bounding boxes. In particular, ImageNet contains more than
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 29
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_2
30 2 Linear and Kernel Classifiers
Fig. 2.1 Technical challenges in recognizing a dog from digital images. Figures courtesy of Ella
Jiwoo Ye
20,000 categories made up of several hundred images. Since 2010, the ImageNet
project has organized an annual software competition, the ImageNet Large Scale
Visual Recognition Challenge (ILSVRC), in which software programs compete for
the correct classification and recognition of objects and scenes. The main motivation
is to allow researchers to compare progress in classification across a wider variety
of objects. Since the introduction of AlexNet in 2012 [9], which was the first
deep learning approach to win the ImageNet Challenge, the state-of-the art image
classification methods are all deep learning approaches, and now their performance
even surpasses human observers.
Before we discuss in detail recent deep learning approaches, we revisit the
classical classifier, in particular the support vector machine (SVM) [10], to discuss
its mathematical principles. Although the SVM is already an old classical technique,
its review is important since the mathematical understanding of the SVM allows
readers to understand how the modern deep learning approaches are closely related
to the classical ones.
Specifically, consider binary classification problems where data sets from two
different classes are distributed as shown in Fig. 2.2a,b,c. Note that in Fig. 2.2a, the
two sets are perfectly separable with linear hyperplanes. For the case of Fig. 2.2b,
there exists no linear hyperplane that perfectly separates two data sets, but one could
find a linear boundary where only a small set of data are incorrectly classified.
However, the situation in Fig. 2.2c is much different, since there exists no linear
boundary that can separate the majority of elements of the two classes. Rather, one
could find a nonlinear class boundary that can separate the two sets with small errors.
The theory of the SVM deals with all situations in Fig. 2.2a,b,c using a hard-margin
linear classifier, soft-margin linear classifier, and kernel SVM method, respectively.
In the following, we discuss each topic in detail.
2.2 Hard-Margin Linear Classifier 31
Fig. 2.2 Examples of binary classification problems: (a) linear separable case, (b) approximately
linear separable case, and (c) linear non-separable case
For the linear separable case in Fig. 2.2a, there can be an infinite number of choices
of linear hyperplanes. Among them, one of the most widely used choices of the
classification boundary is to maximize the margin between the two classes. This is
often called the maximum margin linear classifier [10].
To derive this, we introduce some notations. Let {x i , yi }N
i=1 denote the set of the
data x i ∈ X ⊂ R with the binary label yi such that yi ∈ {1, −1}. We now define a
d
hyperplane in Rd :
w, x
+ b = w x + b = 0, (2.1)
Fig. 2.3 Geometric structure of hard-margin linear support vector machine classifier
32 2 Linear and Kernel Classifiers
respectively:
S1 = {x ∈ Rd | w, x
+ b ≥ 1}, (2.2)
S−1 = {x ∈ Rd | w, x
+ b ≤ −1}. (2.3)
Then, the margin between the two sets is defined as the minimum distance between
the two linear boundaries of S1 and S−1 . To calculate this, we need the following
lemma:
Lemma 2.1 The distance between two parallel hyperplanes
1 : w, x
+ c1 = 0
and
2 : w, x
+ c2 = 0 is given by
|c1 − c2 |
m := . (2.4)
w
Proof Let m be the distance between the two parallel hyperplanes
1 and
2 , then
there exists two points x ∈
1 and x 2 ∈
2 such that
x 1 −x 2
= m. Then, using the
Pythagoras theorem, the vector v := x 1 − x 2 should be along the normal direction
of the hyperplanes. Accordingly,
m =
x 1 − x 2
=
w/
w
, x 1
− w/
w
, x 2
,
w, x 1
− w, x 2
|c1 − c2 |
m= = .
w
w
Q.E.D.
Since w, x
+ b − 1 = 0 and w, x
+ b + 1 = 0 correspond to the linear
boundaries of S1 and S−1 , Lemma 2.1 informs us that the margin between the two
classes is given by
2
margin := . (2.5)
w
Therefore, for the given training data set {x i , yi }ni=1 with x i ∈ X ⊂ Rd and the
binary label yi ∈ {1, −1}, the maximum margin linear binary classifier design
problem can be formulated as follows:
1
(P) minw
w
2 (2.6)
2
subject to 1 − yi ( w, x i
+ b) ≤ 0, ∀i. (2.7)
2.2 Hard-Margin Linear Classifier 33
min f0 (x)
x
subject to fi (x) ≤ 0, i = 1, · · · , n (2.8)
hi (x) = 0, i = 1, · · · , p. (2.9)
subject to α ≥ 0, (2.11)
p ⎬
g(α, ν) := inf f0 (x) + αi fi (x) + νj hj (x) . (2.12)
x ⎩ ⎭
i=1 j =1
One of the important findings in convex optimization theory [6] is that if the
primal problem is convex, then we have the following strong duality:
g(α ∗ , ν ∗ ) = f0 (x ∗ ), (2.13)
where x ∗ and α ∗ , ν ∗ are the optimal solutions for the primal and dual problems,
respectively. Often, the dual formulation is easier to solve than the primal problem.
Additionally, there is also interesting geometric interpretation.
34 2 Linear and Kernel Classifiers
where α = [α1 , · · · , αn ] is a dual variable with respect to the primal variable w and
b, and
w 2
n
g(α) = min + αi (1 − yi ( w, x i
+ b)) . (2.14)
w,b 2
i=1
At the minimizers of (2.14), the derivatives with respect to w and b should be zero,
which leads to the following first-order necessary conditions (FONC):
n
w= αi yi x i , αi yi = 0. (2.15)
i=1 i=1
The FONCs in Eq. (2.15) have very important geometric interpretations. For
example, the first equation in (2.15) clearly shows how the normal vector for the
hyperplanes can be constructed using the dual variables. The second equation leads
to the balancing conditions. These will be explained in more detail later.
By plugging these FONCs into (2.14), the dual problem (D) becomes
n
1
n n
max αi − αi αj yi yj x i , x j
(2.16)
α 2
i=1 i=1 j =1
n
subject to αi yi = 0, αi ≥ 0, ∀i.
i=1
Let w ∗ , b∗ and α ∗ denote the solutions for the primal and dual problems. Then, the
resulting binary classifier is given by
y ← sign( w ∗ , x
+ b∗ ) (2.17)
y ← sign αi∗ yi x i , x
+ b∗ (2.18)
i=1
for the case of the dual formulation, where sign(x) denotes the sign of x.
2.2 Hard-Margin Linear Classifier 35
p
g(α ∗ , ν ∗ ) = f0 (x ∗ ) + αi∗ fi (x ∗ ) + νj∗ hj (x ∗ )
i=1 j =1
n
= f0 (x ∗ ) + αi∗ fi (x ∗ ), (2.19)
i=1
where the last equality comes from the constraint hj (x ∗ ) = 0 in the primal problem.
In order to make (2.19) equal to f0 (x ∗ ), which corresponds to the strong duality
(2.13), the following condition should be satisfied:
αi∗ > 0 #⇒ yi ( w∗ , x i
+ b) = 1, (2.21)
which implies that in constructing the normal vector direction w∗ of the hyperplane
using (2.15), only the training data at the class boundaries contribute:
∗
w = αi∗ yi x i = αi∗ x i − αi∗ x i , (2.22)
i=1 i∈I + i∈I −
I + = {i ∈ [1, · · · , n] | w∗ , x i
+ b = 1}, (2.23)
I − = {i ∈ [1, · · · , n] | w∗ , x i
+ b = −1}. (2.24)
On the other hand, for the case of the training data x i inside the class boundaries,
yi ( w, x i
+ b) > 1. Therefore, the corresponding Lagrangian variable αi becomes
zero. This situation is illustrated in Fig. 2.3. Here, the set of the training data x i with
i ∈ I + or i ∈ I − is often called the support vector, which is why the corresponding
classifier is often called the support vector machine (SVM) [10].
36 2 Linear and Kernel Classifiers
αi∗ = αi∗ ,
i∈I + i∈I −
which states the balancing condition between dual variables. In other words, the
weighting parameters for the support vectors should be balanced for each class
boundary.
As shown in Fig. 2.2b, many practical classification problems often contain data
sets that cannot be perfectly separable by a hyperplane. When the two classes are
not linearly separable (e.g., due to noise), the condition for the optimal hyperplane
can be relaxed by including extra terms:
yi ( w, x i
+ b) ≥ 1 − ξi , ξi ≥ 0 ∀i, (2.25)
where ξi are often called the slack variables. The role of the slack variables is to
allow errors in the classification. Then, the optimization goal is to find the classifier
with the maximum margin with the minimum errors as shown in Fig. 2.4.
Fig. 2.4 Geometric structure of soft-margin linear support vector machine classifier
2.3 Soft-Margin Linear Classifiers 37
1
n
(P ) minw,ξ
w
2 + C ξi
2
i=1
subject to 1 − yi ( w, x i
+ b) ≤ ξi , (2.26)
ξi ≥ 0, ∀i,
where the optimization problem again has implicit dependency on the bias term b.
The following theorem shows that the corresponding dual problem has a form very
similar to the hard-margin classifier in (2.16) with the exception of the differences
in the constraint for the dual variables.
Theorem 2.1 The Lagrangian dual formulation of the primal problem in (2.26) is
given by
n
1
n n
max αi − αi αj yi yj x i , x j
(2.27)
α 2
i=1 i=1 j =1
n
subject to αi yi = 0, 0 ≤ αi ≤ C, ∀i.
i=1
Proof For the given primal problem in (2.26), the corresponding Lagrangian dual
is given by
max g(α, γ )
α,γ
subject to α ≥ 0, γ ≥ 0, (2.28)
1
n
g(α, γ ) = max
w
2 + C ξi (2.29)
w,b,ξ 2
i=1
n
+ αi (1 − yi ( w, x i
+ b) − ξi ) − γi ξ i .
i=1 i=1
The first-order necessary conditions (FONCs) with respect to w, b and ξ lead to the
following equations:
n
w= αi yi x i (2.30)
i=1
38 2 Linear and Kernel Classifiers
and
n
αi yi = 0, αi + γi = C. (2.31)
i=1
n
1
n n
g(α, γ ) = αi − αi αj yi yj x i , x j
,
2
i=1 i=1 j =1
of which a pictorial description is given in Fig. 2.5. Specifically, we define the slack
variable:
ξi := 1 − yi ( w, x i
+ b).
To make the slack variable represent the classification error for the data set (x i , yi )
within the class boundary, ξi should be zero when the data is already well classified,
but positive when there exists a classification error. This leads to the following
definition of the slack variable:
ξi = max{0, 1 − yi ( w, x i
+ b)} =
hinge (yi , w, x i
+ b) . (2.33)
Later, we will show that this representation is closely related to the so-called
representer theorem [11].
whereas class 2 data are located outside of the ellipse. This implies that although
the two classes of data cannot be separated by a single hyperplane, the nonlinear
boundary in (2.35) can separate the two classes.
Interestingly, the existence of the nonlinear boundary implies that we can find
the corresponding linear hyperplane in the higher-dimensional space. Specifically,
suppose we have a nonlinear mapping ϕ : x = [x1 , x2 ] → ϕ(x) to the feature
space in R3 such that
√
ϕ(x) = [ϕ1 , ϕ2 , ϕ2 ] = x12 , x22 , 2x1 x2 . (2.36)
Then, we can easily see that S1 can be represented in the feature space by
√
S1 = {(ϕ1 , ϕ2 , ϕ3 ) | ϕ1 + 2ϕ2 + 2ϕ3 ≤ 2}. (2.37)
Fig. 2.6 Lifting to a high-dimensional feature space for linear classifier design
40 2 Linear and Kernel Classifiers
Therefore, there exists a linear classifier in R3 using the feature space mapping ϕ(x)
as shown in Fig. 2.6.
In general, to allow the existence of a linear classifier, the feature space should
be in a higher-dimensional space than the ambient input space. In this sense, the
feature mapping ϕ(x) works as a lifting operation that lifts up the dimension of the
data to a higher-dimensional one. In the lifted feature space by the feature mapping
ϕ(x), the binary classifier design problem in (2.27) can be defined as
n
1
n n
max αi − αi αj yi yj ϕ(x i ), ϕ(x j )
(2.38)
α 2
i=1 i=1 i=j
n
subject to αi yi = 0, 0 ≤ αi ≤ C, ∀i.
i=1
By extending (2.18) from the linear classifier, the associated nonlinear classifier
with respect to the optimization problem (2.38) can be similarly defined by
n
y ← sign αi∗ yi ϕ(x i ), ϕ(x)
+ b , (2.39)
i=1
where αi∗ and b are the solutions for the dual problem.
Although (2.38) and (2.39) are nice generalizations of (2.27) and (2.18), there exist
several technical issues. One of the most critical issues is that for the existence of a
linear classifier, the lifting operation may require a very-high-dimensional or even
infinite-dimensional feature space. Therefore, an explicit calculation of the feature
vector ϕ(x) may be computationally intensive or not possible.
The so-called kernel trick may overcome this technical issue by bypassing the
explicit construction of the lifting operation [11]. Specifically, as shown in (2.38)
and (2.39), all we need for the calculation of the linear classifier is the inner product
between the two feature vectors. Specifically, if we define the kernel function K :
X × X → R as follows:
n
1
n n
max αi − αi αj yi yj K(x i , x j ) (2.41)
α 2
i=1 i=1 j =1
n
subject to αi yi = 0, 0 ≤ αi ≤ C, ∀i
i=1
K(x, y) = (x y)p .
K(x, y) = (x y + 1)p .
• Sigmoid kernel:
tanh(ηx y + ν).
However, care should be taken since not all kernels can be used for SVM. To
be a viable option, a kernel should originate from the feature space mapping ϕ(x).
In fact, there exists an associated feature mapping if the kernel function satisfies
the so-called Mercer’s condition [11]. The kernel that satisfies Mercer’s condition
is often called the positive definite kernel. The details of Mercer’s condition can be
found from standard SVM literature [11] and will be explained later in the context
of the representer theorem.
42 2 Linear and Kernel Classifiers
Although the SVM and its kernel extension are beautiful convex optimization
frameworks devoid of local minimizers, there are fundamental challenges in using
these methods for image classification. In particular, the ambient space X should not
be significantly large in the SVM due to the computationally extensive optimization
procedure. Accordingly, one of the essential steps of using the SVM framework is
feature engineering, which pre-processes the input images to obtain significantly
smaller dimensional vector x ∈ X that can capture all essential information of the
input images. For example, a classical pipeline for the image classification task can
be summarized as follows (see Fig. 2.7):
• Process the data set to extract hand-crafted features based on some knowledge of
imaging physics, geometry, and other analytic tools,
• or extract features by feeding the data into a standard set of feature extractors
such as SIFT (the Scale-Invariant Feature Transform) [12], or SURF (the
Speeded-Up Robust Features) [13], etc.
• Choose the kernels based on your domain expertise.
• Put the training data composed of hand-crated features and labels into a kernel
SVM to learn a classifier.
Here, the main technical innovations usually comes from the feature extraction,
often based on the serendipitous discoveries of lucky graduate students. Moreover,
kernel selection also requires domain expertise that was previously the subject of
extensive research. We will see later that one of the main innovations in the modern
deep learning approach is that this hand-crafted feature engineering and kernel
design are no longer required as they are automatically learned from the training
data. This simplicity can be one of the main reasons for the success of deep learning,
which led to the deluge of new deep tech companies.
So far we have mainly discussed the binary classification problems. Note that
more general forms of the classifiers beyond the binary classifier are of importance
in practice: for example, ImageNet has more than 20,000 categories. The extension
of the linear classifier for such a setup is important, but will be discussed later.
2.6 Exercises
k(x, y) = (x y + c)2 , x, y ∈ R2 ,
a. Are the two classes linear separable? Answer this question by visualizing their
distribution in R2 .
b. Now, we are interested in designing a hard-margin linear SVM. What are
the support vectors? Please answer this by inspection. You must give your
reasoning.
c. Using primal formulation, compute the closed form solution of the linear
SVM classifier by hand calculation. You must show each step of your
calculation. The inequality constraints may be simplified by exploiting the
support vectors and KKT conditions.
d. Using dual formulation, compute the closed form solution of the linear SVM
classifier by hand calculation. You must show each step of your calculation.
The inequality constraints may be simplified by exploiting the support vectors
and KKT conditions.
4. Suppose we are given the following positively labeled data points:
a. Are the two classes linearly separable? Answer this question by visualizing
their distribution in R2 .
b. Now, we are interested in designing a soft-margin linear SVM. Using
MATLAB, plot the decision boundaries for various choices of C.
c. What do you observe when C → ∞?
44 2 Linear and Kernel Classifiers
a. Are the two classes linearly separable? Answer this question by visualizing
their distribution in R2 .
b. Find a feature mapping ϕ : R2 → F ⊂ R3 so that two classes are linear
separable in the feature space F . Show this by drawing data distribution in F .
c. What is the corresponding kernel?
d. What are the support vectors in F ?
e. Using dual formulation, compute the closed form solution of a kernel SVM
classifier by hand calculation. You must show each step of your calculation.
The inequality constraints may be simplified by exploiting the support vectors
and KKT conditions.
Chapter 3
Linear, Logistic, and Kernel Regression
3.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 45
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_3
46 3 Linear, Logistic, and Kernel Regression
Fig. 3.1 Example of various regression problems. The x-axes are for the independent variables,
and y-axes are for the dependent variables. (a) linear regression, (b) logistic regression, and (c)
nonlinear regression using a polynomial kernel
A linear regression uses a linear model as shown in Fig. 3.1a. More specifically,
the dependent variable can be calculated from a linear combination of the input
variables. It is also common to refer to a linear model as Ordinary Least Squares
(OLS) linear regression or just Least Squares (LS) regression. For example, a simple
linear regression model is given by
yi = β0 + β1 xi + i , i = 1, · · · , n (3.1)
and the goal is to estimate the parameter set β = {β0 , β1 } from the training data
{xi , yi }ni=1 .
In general, a linear regression problem can be represented by
yi = x i , β
+ i , i = 1, · · · , n, (3.2)
y = X β + , (3.3)
where
⎡ ⎤ ⎡ ⎤
y1 1
⎢ .. ⎥ ⎢ .. ⎥
y := ⎣ . ⎦ , X := x 1 · · · x n , := ⎣ . ⎦ .
yn n
Then, the regression analysis using l2 loss or the mean squared error (MSE) loss
can be done by
1
min
(β),
(β) :=
y − X β
2 , (3.4)
β 2
1
(β) :=
y − X β
2
2
1
= (y − X β) (y − X β)
2
1
= y y − y X β − β Xy + β XX β ,
2
The parameter that minimizes the MSE loss can be found by setting the gradient
of the loss with respect to β to zero. To calculate the gradient for the vector-valued
function, the following lemma is useful.
Lemma 3.1 [5] Let x, a and B denotes vectors and a matrix with appropriate
sizes, respectively. Then, we have
∂x a ∂a x
= = a, (3.5)
∂x ∂x
∂x Bx
= (B + B )x. (3.6)
∂x
Using Lemma 3.1, we have
+
∂
(β) ++
= −Xy + XX β̂ = 0,
∂β +β=β̂
where β̂ is the minimizer. If XX is invertible, or X has the full row rank, then we
have
−1
β̂ = XX Xy. (3.7)
The full rank condition is important for the existence of the matrix inverse, which
will be revisited again in the ridge regression.
This regression setup is closely related to the general linear model (GLM), which
has been successfully used for statistical analysis. For example, GLM analysis is
one of the main workhorses for the functional MRI data analysis [14]. The main
idea of functional MRI is that multiple temporal frames of MR images of a brain
are obtained during a given task (for example, motion tasks), and then the temporal
48 3 Linear, Logistic, and Kernel Regression
variation of the MR values at each voxel location is analyzed to check whether its
temporal variation is correlated with a given task. Here the temporal time series data
y from one voxel is described as a linear combination of the model (X ), which is
often termed as the “design matrix”, containing a set of regressors as in Fig. 3.2
representing the independent variable and the residuals (i.e., the errors), then the
results are stored, displayed, and possibly analyzed further in the form of voxelwise
maps as shown in the top right of Fig. 3.2 when β = [β1 , β2 ] .
Similar to the example in Fig. 3.1b, there are many important problems for which
the dependent variable has limited values. For example, in binary logistic regression
for analyzing smoking behavior, the dependent variable is a dummy variable: coded
0 (did not smoke) or 1 (did smoke). In another example, one is interested in fitting a
linear model to the probability of the event. In this case, the dependent variable only
takes values between 0 and 1. In this case, transforming the independent variables
does not remedy all of the potential problems. Instead, the key idea of the logistic
regression is transforming the dependent variable.
3.3 Logistic Regression 49
where q is a probability in a range of 0–1. The odds have a range of 0–∞ with
values greater than 1 associated with an event being more likely to occur than to
not occur and values less than 1 associated with an event that is less likely to occur.
Then, the term logit is defined as the log of the odds:
q
logit := log(odds) = log .
1−q
1
Sig(x) = ,
1 + e−x
whose shape is shown in Fig. 3.3. It is remarkable that although the nonlinear
transform is originally applied to the dependent variable for linear regression, the
net result is the introduction of the nonlinearity after the linear term. In fact, this is
closely related to the modern deep neural networks that have nonlinearities after the
linear layers.
50 3 Linear, Logistic, and Kernel Regression
Recall that the basic assumption for the linear regression solution in (3.7) is that
X has full column rank or X has the full row rank. However, when X is high-
dimensional, the columns of X can be collinear, which in statistical terms refers to
the event of two (or multiple) covariates being highly linearly related. Consequently,
X may not be of full column rank or close to not being the full column rank,
and we cannot use the standard linear regression. To deal with this issue, the ridge
regression is useful.
Specifically, the following regularized least squares problem is solved:
where
1 λ
ridge (β) :=
y − X β
2 +
β
2 , (3.13)
2 2
where λ > 0 is the regularization parameter. This type of regularization is often
called the Tikhonov regularization. Using Lemma 3.1, we can easily show
+
∂
ridge (β) ++
+ = −Xy + XX β̂ + λβ̂ = 0,
∂β β=β̂
52 3 Linear, Logistic, and Kernel Regression
which leads to
−1
β̂ = XX + λI Xy. (3.14)
1 −1
= XX /λ + I Xy
λ
−1
1
= I − X λI + X X X Xy
λ
−1
1
= X I − λI + X X X X y
λ
1 −1 , -
= X λI + X X λI + X X − X X y
λ
−1
= X X X + λI y. (3.16)
In particular, the expression in (3.16) is useful when X is a tall matrix, since the
size of the matrix inversion is much smaller than that of (3.14). Even if this is
not the case, the expression in (3.16) is extremely useful to derive the kernel ridge
regression, which is the main topic in the next section.
Recall that a nonlinear kernel SVM was developed based on the observation that
the nonlinear decision boundary in the original input space can be often represented
as a linear boundary in the high-dimensional feature space. A similar idea can be
used for regression. Specifically, the goal is to implement the linear regression in
the high-dimensional feature space, but the net result is that the resulting regression
becomes nonlinear in the original space (see Fig. 3.5).
In order to use a kernel trick similar to that used in the kernel SVM, let us revisit
the linear regression problem in (3.2). Using the parameter estimation from the ridge
3.5 Kernel Regression 53
regression (3.16), the estimated function fˆ(x) for a given independent variable x ∈
Rp is given by
fˆ(x) := x β̂
= x X(X X + λI )−1 y
⎛⎡ ⎤ ⎞−1
x 1 , x 1
· · · x 1 , x n
⎜⎢ .. .. .. ⎥ ⎟
= x, x 1
· · · x, x n
⎝⎣ . . . ⎦ + λI ⎠ y, (3.17)
x n , x 1
· · · x n , x n
where we use
x X = x, x 1
· · · x, x n
and
⎡ ⎤ ⎡ ⎤
x x 1 , x 1
· · · x 1 , x n
⎢ ⎥ ⎢
1
⎥
X X = ⎣ ... ⎦ x 1 · · · x n = ⎣ ..
.
..
.
..
. ⎦
xn x n , x 1
· · · x n , x n
.
Since everything is represented by the inner product of the input vectors, we can
now lift the data x to a feature space using ϕ(x) to compute the inner product in
the high-dimensional feature space. Then, using the kernel trick, the inner product
in the feature space can be replaced by the kernel:
x, x i
→ k(x, x i ) := ϕ(x), ϕ(x i )
. (3.18)
Equivalently, (3.19) can be derived from the following regression problem with
kernel:
p
yi = αj k(x i , x j ) + (3.21)
j =1
which is a nonlinear extension of (3.2). Then, (3.19) is obtained using the following
optimization problem:
⎛ ⎞2
p
min ⎝yi − αj k(x i , x j )⎠ + λα Kα, (3.22)
α∈Rp
i=1 j =1
where K is the kernel Gram matrix in (3.20). This implies that the regularization
term should be weighted by the kernel to take into account of the deformation in
the feature space. More rigorous derivation of (3.22) is obtained from the so-called
representer theorem, which is the topic of the next chapter.
Figure 3.6 shows the examples of linear regression and kernel regression using
the polynomial and radial basis function (RBF) kernels. We can clearly see that
nonlinear kernel regression follows the trend much better.
In this section, we will discuss the important issue of the bias and variance trade-off
in regression analysis.
Let {x i , y i }ni=1 denote the training data set, where x i ∈ Rp ⊂ X is an
independent variable and y i ∈ Rp ⊂ Y is a dependent variable that has dependency
on x i . The reason we use the boldface characters x i and y i is that they can be
vectors. In regression analysis, the dependent variable is often represented as a
functional relationship with respect to the independent variable:
y i = f (x i ) + i , (3.23)
where i denotes an additive error term that may stand in for unmodeled parts, and
f (·) is a regression function (which can be possibly a nonlinear function) with the
input variable x i and parameterized by . With a slight abuse of notation, we often
use f := f when the dependency on the parameter is obvious.
In (3.23), is the regression parameter set that should be estimated from the
training data set. Usually, this parameter set is estimated by minimizing a loss. For
example, one of the most popular loss functions is l2 or the MSE loss, in which case
the parameter estimation problem is given by
n
min
y i − f (x i )
2 . (3.24)
2
i=1
Another popular tool that is often used in regression analysis is the regularization. In
regularized regression analysis, an additional term is added to impose a constraint
on the parameter. More specifically, the following optimization problem is solved to
estimate the parameter :
n
min
y i − f (x i )
2 + λR(), (3.25)
2
i=1
where R() and λ are often called the regularization function and regularization
parameter, respectively.
ˆ the estimated function f̂ is defined as
With the estimated parameter ,
Suppose that the noise is zero mean i.i.d. Gaussian with the variance σ 2 . Then, the
MSE error of the regression problem is given by
E y − f̂ 2 = E f + − f̂ 2
= E
f + − f̂ + E[f̂ ] − E[f̂ ]
2
56 3 Linear, Logistic, and Kernel Regression
= E f − E[f̂ ] 2 + E f̂ − E[f̂ ] 2 + E 2
= f − E[f̂ ] 2 + E f̂ − E[f̂ ] 2 + E 2
E[ (f − E[f̂ ])] = 0,
and the fourth equation comes from the fact that f and E[f̂ ] are deterministic.
Equation (3.27) clearly shows that the MSE expression of the prediction error
is composed of bias and variance components. This leads to the so-called bias–
variance trade-off in regression problem, which can be explained in detail in the
following example.
3.6.1 Examples
Here, we will investigate the bias and variance trade-off for the linear regression
problem, where the regression function is given by
f (x) = x, β
= x β. (3.28)
By defining the expectation operation E[·], the bias and variance of the OLS in (3.7)
can be computed as follows:
.) := x β − E[x β̂]
Bias(f
= x β − x E[(XX )−1 Xy]
= x β − x (XX )−1 XE[y]
= x β − x (XX )−1 XX β = 0,
= x (XX )−1 XE X (XX )−1 x
= σ 2 x (XX )−1 x.
On the other hand, the bias and covariance of the ridge regression in (3.14) are
given by
= λx (XX + λI )−1 β,
and
Var(f̂ ) = E x (XX + λI )−1 X X (XX + λI )−1 x
3.7 Exercises
Patient id 1 2 3 4 5 6 7
x 42 70 45 30 55 25 57
y (mmHg) 98 130 121 88 182 80 125
Trial id 1 2 3 4 5 6 7 8 9 10
Temperature 53 57 58 63 66 67 67 67 68 69
Damaged 5 1 1 1 0 0 0 0 0 1
Undamaged 7 6 5 6 8 8 7 6 5 6
x 11 22 32 41 55 67 78 89 100 50 71 91
y 2330 2750 2309 2500 2100 1120 1010 1640 1931 1705 1751 2002
3.7 Exercises 59
c. Perform the kernel regression with h = 5, 10 and 15. What do you observe?
6. By directly solving (3.22), derive the kernel regression in (3.17).
7. Show that the variance of the kernel regression in (3.29) increases with decreas-
ing regularization parameter λ.
Chapter 4
Reproducing Kernel Hilbert Space,
Representer Theorem
4.1 Introduction
One of the key concepts in machine learning is the feature space, which is often
referred to as the latent space. A feature space is usually a higher or lower-
dimensional space than the original one where the input data lie (which is often
referred to as the ambient space). Recall that in the kernel SVM, by lifting the data to
a higher-dimensional feature space, one can find a linear classifier that can separate
two different classes of samples (see Fig. 4.1a). Similarly, in kernel regression,
rather than searching for nonlinear functions that can fit the data in the ambient
space, the main idea is to compute a linear regressor in a higher-dimensional feature
space as shown in Fig. 4.1b. On the other hand, in the principal component analysis
(PCA), the input signals are projected on a lower-dimensional feature space using
singular vector decomposition (see Fig. 4.1c).
In this section, we formally define a feature space that has good mathematical
properties. Here, the “good” mathematical properties refer to the well-defined
structure such as existence of the inner product, the completeness, reproducing
properties, etc. In fact, the feature space with these properties is often called the
reproducing kernel Hilbert space (RKHS) [11]. Although the RKHS is only a small
subset of the Hilbert space, its mathematical properties are highly versatile, which
makes the algorithm development simpler.
The RKHS theory has wide applications, including complex analysis, harmonic
analysis, and quantum mechanics. Reproducing kernel Hilbert spaces are particu-
larly important in the field of machine learning theory because of the celebrated
representer theorem [11, 15] which states that every function in an RKHS that
minimizes an empirical risk functional can be written as a linear combination of the
kernel function evaluated at training samples. Indeed, the representer theorem has
played a key role in classical machine learning problems, since it provides a means
to reduce infinite dimensional optimization problems to tractable finite-dimensional
ones.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 61
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_4
62 4 Reproducing Kernel Hilbert Space, Representer Theorem
Fig. 4.1 Example of feature space embedding in (a) kernel SVM, (b) kernel regression, and (c)
principle component analysis
In this chapter, we review the RKHS theory and the representer theorem. Then,
we revisit the classifier and regression problems to show how kernel SVM and
regression can be derived from the representer theorem. Then, we discuss the
limitation of the kernel machines. Later we will show how these limitations of kernel
machines can be largely overcome by modern deep learning approaches.
As the theory of the RKHS originates from core mathematics, the rigorous definition
is very abstract, which is often difficult to understand for students working on
machine learning applications. Therefore, this section tries to explain the concept
4.2 Reproducing Kernel Hilbert Space (RKHS) 63
Fig. 4.2 RKHS, Hilbert space, Banach space, and vector space
from a more machine learning perspective so that students can understand why the
RKHS theory has been the main workhorse in the classical machine learning theory.
Before diving into details, the readers are reminded that the RKHS is only a
subset of the Hilbert space as shown in Fig. 4.2, i.e. the Hilbert space is more
general than the RKHS. For the formal definition of the Hilbert space, please refer
to Chap. 1.
For example, a feature mapping we used to explain the kernel SVM was
√
φ(x) = [φ1 , φ2 , φ3 ] = x12 x22 2x1 x2 , (4.2)
64 4 Reproducing Kernel Hilbert Space, Representer Theorem
where X = R2 (see Fig. 4.1a). We also showed that the corresponding kernel is
given by
k(x, x ) = φl (x)φl (x ),
l=−∞
In the kernel SVM and/or kernel regression, the optimization problem for the
design of a classifier and/or regressor is formulated using kernels without ever
using the feature map. Then, if we are given a function of two arguments, k(x, x ),
how can we determine if it is a valid kernel? To answer this question, we need to
check whether there exists a valid feature map. For this, the concept of the positive
definiteness is important.
Definition 4.2 A symmetric function k : X × X → R is positive definite if ∀n ≥ 1,
∀(a1 , · · · , an ) ∈ Rn , ∀(x 1 , · · · , x n ) ∈ Xn
n
ai aj k(x i , x j ) ≥ 0. (4.3)
i=1 j =1
Although this condition is both necessary and sufficient, the forward direction is
more intuitive in understanding why the kernel function should be positive definite.
More specifically, if we define the kernel as in (4.1), we have
n
ai aj k(x i , x j ) = ai aj φ(x i ), φ(x j )
H
i=1 j =1 i=1 j =1
/ n /2
/
/
/ /
=/ ai φ(x i )/ ≥ 0.
/ /
i=1 H
With the definition of kernels and feature mapping, we are now ready to define the
reproducing kernel Hilbert space. Toward this goal, let us revisit the feature mapping
we used to explain the kernel SVM:
√
φ(x) = [φ1 , φ2 , φ2 ] = x12 x22 2x1 x2 .
66 4 Reproducing Kernel Hilbert Space, Representer Theorem
3
f (x) = fl φl (x)
l=1
√
= f1 x12 + f2 x22 + f3 2x1 x2 .
where the feature map φ(x) is often called the point evaluation function at x in the
RKHS literature.
Now, the key ingredient of the RKHS is that rather than considering all of the
Hilbert space H, we consider its subset Hφ (recall Fig. 4.2) that is generated by the
evaluation function φ. More specifically, for all f (·) ∈ Hφ there exists a set {x i }ni=1 ,
x i ∈ X such that
n
f (·) = αi φ(x i ). (4.5)
i=1
n
= αi φ(x i ), φ(x)
H
i=1
n
= αi k(x i , x). (4.6)
i=1
As a special case, we can easily see that the coordinate of a kernel in the feature
space, k(x , ·) for a given x ∈ X, lives in an RKHS Hφ , since we have
where the last equality comes from the definition of a kernel. Therefore, we can see
that
s
f, g
H = αi βj k(x i , ·), k(x i , ·)
(4.11)
i=1 j =1
s
= αi βj k(x i , x j ). (4.12)
i=1 j =1
r
f
H = f, f
H = αi αj k(x i , x j ). (4.13)
i=1 j =1
s
f, g
H = αi βj k(x i , x j ), (4.14)
i=1 j =1
where f (·) = r s
i=1 αi k(x i , ·) and g(·) = i=1 βi k(x i , ·).
68 4 Reproducing Kernel Hilbert Space, Representer Theorem
From the (classical) machine learning perspective, the most important reason to
use the RKHS is Eq. (4.5), which states that the feature map of the target function
can be represented as a linear span of {k(x, ·) : x ∈ X} or, equivalently, {φ(x) : x ∈
X}. This implies that as long as we have a sufficient number of training data, we can
estimate the target function by estimating their feature space coordinates.
In fact, one of the important breakthroughs of the modern neural network
approach is to relax the assumption that the feature map of the target function should
be represented as a linear span. This issue will be discussed in detail later.
Given the definition of kernels and the RKHS, the representer theorem is a simple
consequence. Recall that in machine learning problems, the loss is defined as the
error energy between the actual target and the estimated one. For example, in the
linear regression problem, the MSE loss for the given training data {x i , yi }ni=1 is
defined by
!
n
2 {x i , yi , f (x i )}ni=1 =
yi − f (x i )
2 , (4.15)
i=1
where
f (x i ) = x i , β
,
with β being the unknown parameter to estimate. In the soft-margin SVM, the loss
is given by the hinge loss:
!
n
hinge {x i , yi , f (x i )}ni=1 = max{0, 1 − yi f (x i )}, (4.16)
i=1
where
f (x i ) = w, x i
+ b,
with w and b denoting the parameters to estimate. For the general loss function, the
celebrated representer theorem is given as follows:
Theorem 4.1 [11, 15] Consider a positive definite real-valued kernel k : X × X →
R on a non-empty set X with the corresponding RKHS Hk . Let there be given
training data set {x i , yi }ni=1 with x i ∈ X and yi ∈ R and a strictly increasing real-
valued regularization function R : [0, ∞) → R. Then, for arbitrary loss function
4.4 Application of Representer Theorem 69
!
{x i , yi , f (x i )}ni=1 , any minimizer for the following optimization problem:
!
f ∗ = arg min
{x i , yi , f (x i )}ni=1 + R(
f
H ) (4.17)
f ∈Hk
n
∗
f (·) = αi k(x i , ·) = αi φ(x i ) (4.18)
i=1 i=1
n
f ∗ (x) = αi k(x i , x). (4.19)
i=1
The proof of the representer theorem can easily be found in the standard machine
learning textbook [11], so we do not revisit it here. Instead, we briefly touch upon the
main idea of the proof, since it also highlights the limitations of kernel machines.
Specifically, the feature space coordinate of the minimizer f ∗ , denoted by f ∗ (·),
should be represented by the linear combination of the feature maps from the
training data {φ(x i )}ni=1 and its orthogonal complement. But when we perform the
point evaluation with {φ(x i )}ni=1 using the inner product during the training phase,
the contribution from the orthogonal complement disappears, which leads to the
final form in (4.18).
In this section, we revisit the kernel SVM and regression to show how the
representer theorem can simplify the derivation.
Recall that the ridge regression was given by the following optimization problem:
n
min
yi − x i , β
2 + λ
β
2 .
β
i=1
70 4 Reproducing Kernel Hilbert Space, Representer Theorem
n
min
yi − f (x i )
2 + λ
f
2H , (4.20)
f ∈Hk
i=1
where Hk is the RKHS with the positive definite kernel k. From Theorem 4.1, we
know that the minimizer should have the form
n
f (·) = αj φ(x j ). (4.21)
j =1
n
yi − f (x i )
=
2
yi − f (·), φ(x i )
2
i=1 i=1
n
=
yi − αj φ(x j ), φ(x i )
2
i=1 j =1
n
=
yi − αj k(x j , x i )
2
i=1 j =1
= y − Kα 2 ,
and
y = y1 · · · yn , α = α1 · · · αn . (4.23)
f 2H = f (·), f (·)
n
= αi αj φ(x i ), φ(x j )
i=1 j =1
4.4 Application of Representer Theorem 71
n
= αi αj k(x i , x j )
i=1 j =1
= α Kα.
The problem is convex; so using the first order necessary condition, we have
(K 2 + λK)α̂ = Ky.
α̂ = (K + λI )−1 y.
n
= αi φ(x i ), φ(x)
i=1
= k(x 1 , x) · · · k(x n , x) (K + λI )−1 y,
Recall that the soft-margin SVM formulation (without bias) can be represented by
n
min 12
w
2 + C i=1
hinge (yi , w, x i
) , (4.25)
w
This problem can be solved using the representer theorem. Specifically, an extended
formulation of (4.25) in the RKHS is given by
1
n
min
f
2H + C
hinge (yi , f (x i )) , (4.27)
f ∈Hk 2
i=1
n
f (·) = αj k(x j , ·). (4.28)
j =1
n
hinge (yi , f (x i )) = max{0, 1 − yi αj k(x j , x i )}. (4.29)
j =1
f 2H = α Kα,
where K is the kernel Gram matrix in (4.22). Now, (4.27) can be represented in an
constrained form
1
n
minα,ξ α Kα + C ξi
2
i=1
n
subject to 1 − yi αj k(x j , x i ) ≤ ξi , (4.30)
j =1
ξi ≥ 0, ∀i.
For the given primal problem in (4.30), the corresponding Lagrangian dual is given
by
max g(λ, γ )
λ,γ
subject to λ ≥ 0, γ ≥ 0, (4.31)
4.5 Pros and Cons of Kernel Machines 73
1
n
g(λ, γ ) = min α Kα + C ξi (4.32)
α,ξ 2
i=1
⎫
n ⎬
+ λi (1 − yi αj k(x j , x i ) − ξi ) − γi ξ i ,
⎭
i=1 j =1 i=1
n
g(λ, γ ) = min α Kα + λi (1 − ξi ) + (C − γi )ξi − r Kα , (4.33)
α,ξ 2
i=1
where
r = y1 λ1 · · · yn λn .
The first-order optimality conditions with respect to α and ξ lead to the following
equations:
Kα = Kr #⇒ α = r (4.34)
and
λi + γi = C. (4.35)
n
1
n n
g(λ, γ ) = λi − λi λj yi yj k(k i , k j )
2
i=1 i=1 j =1
n
f (x) = yj λj k(x j , x), (4.36)
j =1
The kernel machine has many important advantages that deserve further discussion.
This approach is based on the beautiful theory of the RKHS, which leads to the
closed form solution in designing classifiers and regressors thanks to the representer
74 4 Reproducing Kernel Hilbert Space, Representer Theorem
theorem. Therefore, the classical research issue is not about the machine learning
algorithm itself, but rather to find the feature space embedding that can effectively
represent the data in the ambient space.
Having said this, there are several limitations associated with the classical kernel
machines. First, the reason that enables a closed form solution in terms of the
representer theorem is the assumption that the feature space forms an RKHS. This
implies that the mapping from the feature space to the final function is assumed to be
linear. This approach is somewhat unbalanced given that only the mapping from the
ambient space to feature space is nonlinear, whereas the feature space representation
is linear. Moreover, as discussed before, the RKHS is only a subset of underlying
Hilbert space; therefore, restricting feature space within the RKHS severely reduces
available function class from the underlying Hilbert space (see Fig. 4.2). As such, it
limits the flexibility of the learning algorithm and resulting expressiveness.
Finally, the feature mapping and the associated kernel in the classical machine
learning approach are primarily selected in a top-down manner based on human
intuition or mathematical modeling that has no space that can be automatically
learned from the data. In fact, the learning part of the kernel machine is for the
linear weighting parameters in the representer (i.e. αi ’s in (4.18)), whereas the
feature map itself is deterministic once the kernel is selected in a top-down manner.
This significantly limits the capability of learning. Later, we will investigate how
this limitation of the kernel machine can be mitigated by modern deep learning
approaches.
4.6 Exercises
k(x, y) = (x y)p .
k(x, y) = (x y + 1)p .
e. Sigmoid kernel:
tanh(ηx y + ν).
4.6 Exercises 75
2. Let k1 and k2 be two positive definite kernels on a set X, and α, β two positive
scalars. Show that αk1 + βk2 is positive definite.
3. Let k1 be a positive definite kernel on a set X. Then, for any polynomial p(·)
with non-negative coefficients, show that the following is also a positive definite
kernel on a set X:
a. Show that
b. Show that dX (x, y) is not a metric. Which property of the metric does it
violate?
8. Define the mean of the feature space
μφ = φ(x i ).
n
a. Show that
n n
μφ
2H = k(x i , x j ).
n2
i=1 j =1
76 4 Reproducing Kernel Hilbert Space, Representer Theorem
b. Show that
n
1
σφ2 :=
φ(x i ) − μφ
2H = Tr(K) −
μφ
2H ,
n n
i=1
where Tr(·) denotes the matrix trace, and K is the kernel Gram matrix
⎡ ⎤
k(x 1 , x i ) · · · k(x 1 , x n )
⎢ .. .. .. ⎥
K=⎣ . . . ⎦.
k(x n , x 1 ) · · · k(x n , x n )
9. The kernel SVM formulation in (4.27) is often called the 1-SVM. In this
problem, we are interested in obtaining the 2-SVM, which is defined by
1
n
min
f
2H + C
2hinge (yi , f (x i )) ,
f ∈Hk 2
i=1
!2
2hinge (y, ŷ) = max{0, 1 − y ŷ} .
Write the primal and dual problems associated with the 2-SVM, and compare
the result with the 1-SVM.
10. Consider the following kernel regression problem:
1
n
min
f
2H + C
logit (yi , f (x i )) ,
f ∈Hk 2
i=1
Write the dual problems and find the solution as simply as possible.
Part II
Building Blocks of Deep Learning
“I get very excited when we discover a way of making neural networks better and
when that’s closely related to how the brain works.”
– Geoffrey Hinton
Chapter 5
Biological Neural Networks
5.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 79
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_5
80 5 Biological Neural Networks
5.2 Neurons
A typical neuron consists of a cell body (soma), dendrites, and a single axon (see
Fig. 5.1). The axon and dendrites are filaments that extrude from the cell body.
Dendrites typically branch heavily and extend a few hundred micrometers from the
soma. The axon leaves the soma at the axon hillock, and moves up to 1 m in humans
or more in other species. The end branches of an axon are called telodendria. At
the extreme tip of the axon’s branches are synaptic terminals, where the neuron can
transmit a signal to another cell via the synapse.
The endoplasmic reticulum (ER) in the soma performs many general func-
tions, including folding protein molecules and transporting synthesized proteins
in vesicles to the Golgi apparatus. Proteins synthesized in the ER are packaged
into vesicles, which then fuse with the Golgi apparatus. These cargo proteins are
modified in the Golgi apparatus and destined for secretion via exocytosis or for use
in the cell as shown in Fig. 5.2.
Fig. 5.2 ER and Golgi apparatus for protein synthesis and transport
for fast electric signal transmission [18, 19], chemical synapses, which transmit the
action potential via neurotransmitters, are the most common and are of great interest
for artificial neural networks.
As shown in Fig. 5.3, in a chemical synapse, electrical activity in the presynaptic
neuron is converted into the release of neurotransmitters that bind to receptors
located in the membrane of the postsynaptic cell. The neurotransmitters are usually
packaged in a synaptic vesicle, as shown in Fig. 5.3. Therefore, the amount of the
actual neurotransmitter at the postsynaptic terminal is an integer multiple of the
number of neurotransmitters in each vesicle, so this phenomenon is often referred to
as quantal release. The release is regulated by a voltage-dependent calcium channel.
The released neurotransmitter then binds to the receptors on the postsynaptic
dendrites, which can trigger an electrical response that can produce excitatory
postsynaptic potentials (EPSPs) or inhibitory postsynaptic potentials (IPSPs).
82 5 Biological Neural Networks
Fig. 5.3 Chemical synapse between presynaptic terminal and postsynaptic dendrite
The axon hillock (see Fig. 5.1) is a specialized part of the cell body that is
connected to the axon. Both IPSPs and EPSPs are summed in the axon hillock and
once a trigger threshold is exceeded, an action potential propagates through the rest
of the axon. This switching behavior of the axon hillock plays a very important role
in the information processing of neural networks, as will be discussed in detail later
in Chap. 6.
during the LTP additional receptors are fused to the membrane by exocytosis,
which are then moved to the postsynaptic dendrite by lateral diffusion within the
membrane. On the other hand, in the case of LTD, some of the redundant receptors
are moved into the endocytosis region by lateral diffusion within the membrane, and
then absorbed by the cell via endocytosis.
Because of the dynamics of learning and synaptic plasticity, it becomes clear that
the trafficking of these receptors is an important mechanism to meet the demand
and supply of the receptors at various synaptic locations in the neurons. There are
various mechanisms that are being intensively researched by neurobiologists. For
example, assembled receptors leave the endoplasmic reticulum (ER) and reach the
neural surface via the Golgi network. Packets of nascent receptors are transported
along microtubule tracks from the cell body to synaptic sites through microtubule
networks. Figure 5.5 shows critical steps in receptor assembly, transport, intracellu-
lar trafficking, slow release and insertion at synapses.
One of the most mysterious features of the brain is the emergence of higher-
level information processing from the connections of neurons. To understand this
emergent property, one of the most extensively studied biological neural networks
is the visual system. Therefore, in this section we review the information processing
in the visual system.
The visual system is a part of the central nervous system that enables organisms to
process visual detail as eyesight. It detects and interprets information from visible
light to create a representation of the environment. The visual system performs a
number of complex tasks, from capturing light to identifying and categorizing visual
objects.
As shown in Fig. 5.6, the reflected light from objects shines on the retina. The
retina uses photoreceptors to convert this image into electrical impulses. The optic
nerve then carries these impulses through the optic canal. Upon reaching the optic
chiasm, the nerve fibers decussate (left becomes right). Most of the optic nerve
fibers terminate in the lateral geniculate nucleus (LGN). The LGN forwards the
impulses to V1 of the visual cortex. The LGN also sends some fibers to V2 and V3.
V1 performs edge detection to understand spatial organization. V1 also creates a
bottom-up saliency map to guide attention.
5.3 Biological Neural Network
One of the most important discoveries of Hubel and Wiesel [20] is the hierarchical
visual information flow in the primary visual cortex. Specifically, by examining the
primary visual cortex of cats, Hubel and Wiesel found two classes of functional
cells in the primary visual cortex: simple cells and complex cells. More specifically,
simple cells at V1 L4 respond best to edge-like stimuli with a certain orientation,
5.3 Biological Neural Network 87
position and phase within their relatively small receptive fields (Fig. 5.7). They
realized that such a response of the simple cells could be obtained by pooling the
activity of a small set of input cells with the same receptive field that is observed in
LGN cells. They also observed that complex cells at V1 L2/L3, although selective
for oriented bars and edges too, tend to have larger receptive fields and have some
tolerance with regard to the exact position within their receptive fields. Hubel and
Wiesel found that position tolerance at the complex cell level could be obtained
by grouping simple cells at the level below with the same preferred orientation but
slightly different positions. As will be discussed later, the operation of pooling LGN
cells with the same receptive field is similar to the convolution operation, which
88 5 Biological Neural Networks
inspired Yann LeCun to invent the convolutional neural network for handwritten zip
code identification [21].
The extension of these ideas from the primary visual cortex to higher areas
of the visual cortex led to a class of object recognition models, the feedforward
hierarchical models [22]. Specifically, as shown in Fig. 5.8, as we go from V1
to TE, the size of the receptive field increases and the latency for the response
increases. This implies that there is a neuronal connection along this path, which
forms a neuronal hierarchy. A more surprising finding is that as we go along this
pathway, neurons become sensitive to more complex inputs that are not sensitive to
transforms.
The study involved eight epilepsy patients who were temporarily implanted with
a single cell recording device to monitor the activity of brain cells in the medial
temporal lobe (MTL). The medial temporal lobe contains a system of anatomically
related structures that are essential for declarative memory (conscious memory for
facts and events). The system consists of the hippocampal region (Cornu Ammonis
(CA) fields, dentate gyrus, and subicular complex) and the adjacent perirhinal,
entorhinal, and parahippocampal cortices (see Fig. 5.9).
During the single cell recording, the authors in [23] noticed a strange pattern on
the medial temporal lobe (MTL) of the brain in one of their participants. Every time
the patient saw a picture of Jennifer Aniston, a specific neuron in the brain fired.
They tried to show the words “Jennifer Aniston,” and again it would fire. They tried
other ways to summon Jennifer Aniston in other ways, and each time it fired. The
conclusion was inevitable: for this particular person, there was a single neuron that
embodied the concept of Jennifer Aniston.
The experiment showed that individual neurons in the MTL respond to the
faces of certain people. The researchers say that these types of cell are involved
in sophisticated aspects of visual processing, such as identifying a person, rather
than just a simple shape. This observation leads to a fundamental question: can a
single neuron embody a single concept? Although this issue will be investigated
thoroughly throughout the book, the short answer is “no” because it is not the single
neuron in isolation, but a neuron from a densely connected neural network that can
extract the high-level concept.
90 5 Biological Neural Networks
5.4 Exercises
6.1 Introduction
6.2.1 Notation
x n , y n , {x n , y n }N
n=1 , on , g n .
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 91
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_6
92 6 Artificial Neural Networks and Backpropagation
Second, with a slight abuse of notation, the subscript i and j for the light face lower-
case letters denotes the i-th and j -th element of a vector: for example, oi is the i-th
element of the vector o ∈ Rd :
oi = [o]i , or o = o1 · · · od .
Similarly, the double index ij indicates the (i, j ) element of a matrix: for example,
wij is the (i, j )-th element of a matrix W ∈ Rp×q :
⎡ ⎤
w11 · · · w1q
⎢ ⎥
wij = [W ]i,j or W = ⎣ ... . . . ... ⎦ .
wp1 · · · wpq
This index notation is often used to refer to the i-th or j -th neuron in each layer of
a neural network. To avoid potential confusion, if we refer to the i-th element of the
n-th training data vector x n is referred to as (x n )i . Next, to denote the l-th layer, the
following superscript notation is used:
(l)
Accordingly, by combining the training index n, for example g n refers to the l-th
layer g vector for the n-th training data. Finally, the t-th update using an optimizer
such as the stochastic gradient method can be denoted by [t]: for example,
[t], V [t]
Consider a typical biological neuron in Fig. 6.1 and its mathematical diagram in
Fig. 6.2. Let oj , j = 1, · · · , d denote the presynaptic potential from the j -th
dendric synapse. For mathematical simplicity, we assume that the potential occurs
synchronously, and arrives simultaneously at the axon hillock. At the axon hillock,
they are summed together, and fires an action potential if the summed signal
is greater than the specific threshold value. This process can be mathematically
modeled as
⎛ ⎞
d
neti = σ ⎝ wij oj + bi ⎠ , (6.1)
j =1
6.2 Artificial Neural Networks 93
where neti denotes the action potential arriving at the i-th synaptic terminal of the
telodendria, and bi is the bias term for the nonlinearity σ (·) at the axon hillock. Note
that the wij is the weight parameter determined by the synaptic plasticity, and the
positive values imply that wij oj are the excitatory postsynaptic potentials (EPSPs),
whereas the negative weights correspond to the inhibitory postsynaptic potentials
(IPSPs).
In artificial neural networks (ANNs), the nonlinearity σ (·) in (6.1) is modeled
in various ways as shown in Fig. 6.3. This nonlinearity is often called the activa-
tion function. Nonlinearity may be perhaps the most important feature of neural
networks, since learning and adaptation never happen without nonlinearity. The
mathematical proof of this argument is somewhat complicated, so the discussion
will be deferred to later.
94 6 Artificial Neural Networks and Backpropagation
Among the various forms of the activation functions, one of the most successful
ones in modern deep learning is the rectified linear unit (ReLU), which is defined as
[24]
The ReLU activation function is called active when the output is nonzero. It is
believed that the non-vanishing gradient in the positive range contributed to the
success of modern deep learning. Specifically, we have
∂ReLU(x) 1, if x > 0
= , (6.3)
∂c 0, otherwise
which shows that the gradient is always 1 whenever the ReLU is active. Note that
we set the gradient 0 at x = 0 by convention, since the ReLU is not differentiable at
x = 0.
In evaluating the activation function σ (x), the gain function, which refers to the
input/output ratio, is also useful:
σ (x)
γ (x) := , x = 0. (6.4)
x
6.2 Artificial Neural Networks 95
Biological neural networks are composed of multiple neurons that are connected
to each other. This connection can have complicated topology, such as recurrent
connection, asynchronous connection, inter-neurons, etc.
One of the most simple forms of the neural network connection is the multi-layer
feedforward neural network as shown in Figs. 6.4 and 6.8. Specifically, let oj(l−1)
denote the j -th output of the (l − 1)-th layer neuron, which is given as the j -th
(l)
dendrite presynaptic potential input for the l-th layer neuron, and wij corresponds
to the synaptic weights at the l-th layer. Then, by extending the model in (6.1) we
have
⎛ (l) ⎞
d
oi = σ ⎝ + bi ⎠ ,
(l) (l) (l−1) (l)
wij oj (6.6)
j =1
for i = 1, · · · , d (l) , where d (l) denotes the number of dendrites of the l-th layer
neuron. This can be represented in a matrix form
o(l) = σ W (l) o(l−1) + b(l) , (6.7)
where W (l) ∈ Rd ×d
(l) (l−1)
is the weight matrix whose (i, j ) elements are given by
(l)
wij , σ (·) denotes the nonlinearity σ (·) applied for each elements of the vector, and
(l)
o(l) = o1(l) · · · od(l)(l) ∈ Rd , (6.8)
(l)
b(l) = b1(l) · · · bd(l)(l) ∈ Rd . (6.9)
Another way to simplify the multilayer representation is using the hidden nodes
from linear layers in between. Specifically, an L-layer feedforward neural network
can be represented recursively using the hidden node g (l) by
for l = 1, · · · , L.
ˆ = arg min c(),
(6.11)
N
!
c() :=
y n , f (x n ) . (6.12)
n=1
Here,
(·, ·) denotes a loss function, and f (x n ) is a regression function with the
input x n , which is parameterized by the parameter set .
For the case of an L-layer feedforward neural network, the regression function
f (x n ) in (6.12) can be represented by
f (x n ) := σ ◦ g (L) ◦ σ ◦ g (L−1) · · · ◦ g (1) (x n ) , (6.13)
6.3 Artificial Neural Network Training 97
Fig. 6.5 Examples of cost functions for a 1-D optimization problem: (a) both local and global
minimizers exists, (b) only a single global minimizer exist, (c) multiple global minimizers exist
where the parameter set is composed of the synaptic weight and bias for each
layer:
⎡ ⎤
W (1) , b(1)
⎢ .. ⎥ .
= ⎣ ... . ⎦ (6.14)
W ,b
(L) (L)
6.3.2 Optimizers
In view of the parameterized neural network in (6.13), the key question is how the
minimizers for the optimization problem (6.11) can be found. As already mentioned,
the main technical challenge of this minimization problem is that there are many
local minimizers, as shown in Fig. 6.5a. Another tricky issue is that sometimes
there are many global minimizers, as shown in Fig. 6.5c. Although all the global
minimizers can be equally good in the training phase, each global minimizer
may have different generalization performance in the test phase. This issue is
important and will be discussed later. Furthermore, different global minimizers can
be achieved depending on the specific choice of an optimizer, which is often called
98 6 Artificial Neural Networks and Backpropagation
the implicit bias or inductive bias of an optimization algorithm. This topic will also
be discussed later.
One of the most important observations in designing optimization algorithms is
that the following first-order necessary condition (FONC) holds at local minimizers.
Lemma 6.1 Let c : Rp → R be a differentiable function. If ∗ is a local minimizer,
then
+
∂c ++
= 0. (6.15)
∂ +=∗
Indeed, various optimization algorithms exploit the FONC, and the main dif-
ference between them is the way they avoid the local minimum and provide fast
convergence. In the following, we start with the discussion of the classical gradient
descent method and its stochastic extension called the stochastic gradient descent
(SGD), after which various improvements will be discussed.
N
∂
!
= y n , f (x n ) , (6.16)
∂
n=1
which is equal to the sum of the gradient at each of the training data. Since the
gradient is the steep direction for the increasing cost function, the steep descent
algorithm is to update the parameter in its opposite direction:
+
∂c +
[t + 1] = [t] − η ()++
∂ =[t]
+
N
∂
!+
= [t] − η y , f (x n ) ++ , (6.17)
∂ n =[t]
n=1
where η > 0 denotes the step size and [t] is the t-th update of the parameter
. Figure 6.6a illustrates why gradient descent is a good way to minimize the cost
for the convex optimization problem. As the gradient of the cost points toward the
uphill direction of the cost, the parameter update should be in its negative direction.
6.3 Artificial Neural Network Training 99
Fig. 6.6 Steepest gradient descent example: (a) convex cases, where steepest descent succeeds,
(b) non-convex case, where the steepest descent cannot go uphill, (c) steepest gradient leads to
different local minimizers depending on the initialization
After a small step, a new gradient is computed and a new search direction is found.
By iterating the procedure, we can achieve the global minimum.
One of the downsides of the gradient descent method is that when the gradient
becomes zero at a local minimizers at t ∗ , the update equation in (6.17) make the
iteration stuck in the local minimizers, i.e.:
[t + 1] = [t], t ≥ t ∗. (6.18)
For example, Fig. 6.6b,c show the potential limitation of the gradient descent. For
the case of Fig. 6.6b, during the path toward the global minimum, there exists uphill
directions, which cannot be overcome by the gradient methods. On the other hand,
Fig. 6.6c shows that depending on the initialization, different local minimizers can
be found by the gradient descent due to the different intermediate path. In fact, the
situations in Fig. 6.6b,c are more likely situations in neural network training, since
the optimization problem is highly non-convex due to the cascaded connection of
nonlinearities. In addition, despite using the same initialization, the optimizer can
converge to a completely different solution depending on the step size or certain
optimization algorithms. In fact, algorithmic bias is a major research topic in modern
deep learning, often referred to as inductive bias.
This can be another reason why neural network training is difficult and depends
heavily on who is training the model. For example, even if multiple students are
given the exact same training set, network architecture, GPU, etc., it is usually
observed that some students are successfully training the neural network and others
are not. The main reason for such a difference is usually due to their commitment
and self-confidence, which leads to different optimization algorithms with different
inductive biases. Successful students usually try different initializations, optimizers,
different learning rates, etc. until the model works, while unsuccessful students
usually stick to the parameters all the time without trying to carefully change them.
Instead, they often claim that the failure is not their fault, but because of the wrong
model they started with. If the training problem were convex, then regardless of the
inductive bias they have in training, all students could be successful. Unfortunately,
100 6 Artificial Neural Networks and Backpropagation
We say that the update equations in (6.17) are based on full gradients, since at
each iteration we need to compute the gradient with respect to the whole data set.
However, if n is large, computational cost for the gradient calculation is quite heavy.
Moreover, by using the full gradient, it is difficult to avoid the local minimizer, since
the gradient descent direction is always toward the lower cost value.
To address the problem, the SGD algorithm uses an easily computable estimate
of the gradient using a small subset of training data. Although it is a bit noisy, this
noisy gradient can even be helpful in avoiding local minimizers. For example, let
I [t] ⊂ {1, · · · , N } denote a random subset of the index set {1, · · · , N } at the t-th
update. Then, our estimate of the full gradient at the t-th iteration is given by
+ +
∂c + N
∂
!+
()++ % y n , f (x n ) ++ , (6.19)
∂ =[t] |I [t]| ∂ =[t]
i∈I [t]
where |I [t]| denotes the number of elements in I [t]. As the SGD utilizes a small
random subset of the original training data set (i.e. |I [t]| N) in calculating the
gradient, the computational complexity for each update is much smaller than the
original gradient descent method. Moreover, it is not exactly the same as the true
gradient direction so that the resulting noise can provide a means to escape from the
local minimizers.
Another way to overcome the local minimum is to take into account the previous
updates as additional terms to avoid getting stuck in local minima. Specifically, a
desirable update equation may be written as
t
∂c
[t + 1] = [t] − η β t−s ([s]) (6.20)
∂
s=1
6.3 Artificial Neural Network Training 101
Fig. 6.7 Example trajectory of update in (a) stochastic gradient, (b) SGD with momentum method
for an appropriate forgetting factor 0 < β < 1. This implies that the contribution
from the past gradient is gradually reduced in calculating the current update
direction. However, the main limitation of using (6.20) is that all the history of
the past gradients should be saved, which requires huge GPU memory. Instead,
the following recursive formulation is mostly used which provide the equivalent
representation:
∂c
V [t] = β([t] − [t − 1]) − η ([t]),
∂
[t + 1] = [t] + V [t]. (6.21)
This type of method is called the momentum method, and is particularly useful
when it is combined with the SGD. The example update trajectory of the SGD with
momentum is shown in Fig. 6.7b. Compared to the fluctuating path, the momentum
method provides a smoothed solution path thanks to the averaging effects from the
past gradient, which results in fast convergence.
In neural networks, several other variants of the optimizers are often used, among
which ADAGrad [25], RMSprop [26], and Adam [27] are most popular. The main
ideas of these variants is that instead of using the fixed step size η for all elements of
the gradient, an element-wise adaptive step size is used. For example, for the case
of the steepest descent in (6.17), we use the following update equation:
∂c
[t + 1] = [t] − ϒ[t] & ([t]), (6.22)
∂
where ϒ[t] is a matrix with the step size and & is the element-wise multiplication.
In fact, the main difference in these algorithms is how to update the matrix ϒ[t]
at each iteration. For more details for specific update rules, see the original papers
[25–27].
102 6 Artificial Neural Networks and Backpropagation
In the previous section, various optimization algorithms for neural network training
∂c
were discussed based on the assumption that the gradient ∂ ([t]) is computed.
However, given the complicated nonlinear nature of the feedforward neural network,
the computation of the gradient is not trivial.
In machine learning, backpropagation (backprop, or BP) [28] is a standard way
of calculating the gradient in training feedforward neural networks, by providing
an explicit and computationally efficient way of computing the gradient. The term
backpropagation and its general use in neural networks were originally derived in
Rumelhart, Hinton and Williams [28]. Their main idea is that although the multi-
layer neural network is composed of complicated connections of neurons with a
large number of unknown weights, the recursive structure of the multilayer neural
network in (6.10) lends itself to computationally efficient optimization methods.
∂Ax
= x ⊗ I m. (6.23)
∂VEC(A)
(l) ×d (l−1)
where W (l) ∈ Rd . Using the denominator layout as explained in Chap. 1,
we have
⎡ ∂c
⎤
(1)
∂c ⎢ ∂W ⎥
=⎢ .. ⎥, (6.26)
∂ ⎣ . ⎦
∂c
∂W (L)
so that the weight at the l-th layer can be updated with the increment:
⎡ ⎤
W (1)
⎢ ⎥ ∂c
= ⎣ ... ⎦ , where W (l) = −η . (6.27)
∂W (l)
W (L)
Therefore, ∂c/∂W (l) should be specified. More specifically, for a given training data
set {x n , y n }N
n=1 , recall that the cost function c() in (6.12) is given by
N
!
c() =
y n , f (x n ) , (6.28)
n=1
where f (x n ) is defined in (6.13). Now define the l-th layer variable with respect
to the n-th training data:
n = σ (g n ),
o(l) (l)
g (l)
n = W on
(l) (l−1)
, (6.29)
n := x n ,
o(0) (6.30)
n = f (x n ) ,
o(L)
Using the chain rule for the denominator convention (see Eq. (1.40))
we have
∂c
N (l)
∂g n ∂
y n , o(L)
n
= .
∂VEC(W (l) ) n=1
∂VEC(W (l) ) ∂g n
(l)
In (6.34), we use
(l)
∂o(l) ∂σ g n (l) ×d (l)
n
(l)
n := (l)
= (l)
∈ Rd , (6.35)
∂g n ∂g n
(l+1) (l)
∂g n ∂W (l+1) on
= = W (l+1) , (6.36)
∂o(l)
n ∂o(l)
n
6.4 The Backpropagation Algorithm 105
which is obtained using the denominator convention (see (1.41) in Chap. 1).
Accordingly, we have
∂c
N
∂g n
(l) ∂
y n , o(L)
n
=
∂VEC(W (l) ) n=1
∂VEC(W (l) ) ∂g n
(l)
N
= o(l−1)
n ⊗ I d (l) δ (l)
n
n=1
N
= VEC δ (l)
n on
(l−1)
,
n=1
where we use (6.32) and (6.33) for the second equality, and Lemma 6.3 for the last
equality. Finally, we have the following derivative of the cost with respect to W (l) :
∂c ∂c
= UNVEC
∂W (l) ∂VEC(W (l) )
)N *
= UNVEC VEC δ n on
(l) (l−1)
n=1
N
= δ (l)
n on
(l−1)
,
n=1
where we use the linearity of UNVEC(·) operator for the last equality. Therefore, the
weight update increment is given by
∂c
W (l) = −η
∂W (l)
N
= −η δ (l)
n on
(l−1)
. (6.37)
n=1
This weight update scheme in (6.37) is the key in BP. Not only is the final form of
the weight update in (6.37) very concise, but it also has a very important geometric
meaning, which deserves further discussion. In particular, the update is totally
(l−1) (l−1)
determined by the outer product of the two terms δ (l) n and on , i.e. δ (l)
n on .
Why are these terms so important? This is the main discussion point in this section.
106 6 Artificial Neural Networks and Backpropagation
(l−1)
First, recall that on is the (l − 1)-th layer neural network output given by
(6.29). Since this term is calculated in the forward path of the neural network, it is
nothing but the forward-propagated input to the l-th layer neuron. Second, recall
that
∂
y n , o(L)
n
n = (L)
.
∂on
which is indeed the estimation error of the neural network output. Since we have
n = n W
δ (l) n W
(l) (l+1) (l+1) (l+2)
· · · W (L) (L)
n n , (6.38)
δ (l)
n = n W δn ,
(l) (l+1) (l+1)
(6.40)
n = xn,
o(0) δ (L)
n = n. (6.41)
The geometric interpretation and recursive formulae are illustrated in Fig. 6.8.
1
(y, o(L) ) =
y − o(L)
2 , (6.42)
2
where the subscript n for the training data index is neglected here for simplicity and
o(L) := σ W (L) o(L−1) , (6.43)
One of the important observations is that for the case of the ReLU, (6.43) can be
represented by
(L) ×d (L)
where (L) ∈ Rd is a diagonal matrix with 0 and 1 values given by
⎡ ⎤
γ1 ··· 0 ··· 0
⎢ .. .. .. . . .. ⎥
⎢. . . . . ⎥
⎢ ⎥
(L) =⎢
⎢0 ··· γj · · · 0 ⎥ ⎥, (6.45)
⎢. .. .. . . .. ⎥
⎣ .. . . . . ⎦
0 · · · 0 · · · γd (L)
108 6 Artificial Neural Networks and Backpropagation
where
γj = γ [g (L) ]j , (6.46)
where [g (L) ]j denotes the j -th element of the vector g (L) and γ (·) is defined in
(6.4). Thanks to (6.5), we have
where (l) is defined as the derivative of the activation function in (6.35). Therefore,
using the recursive formula, we have
Using this, we now investigate whether the cost decreases with the perturbed
weight
When the step size η is sufficiently small, then the ReLU activation patterns from
W (l) + W (l) do not change from those by W (l) (this issue will be discussed later),
so that the new cost function value is given by
.
(y, o(L) ) :=
y − (L) W (L) · · · (l) (W (l) + W (l) )o(l−1)
2 .
δ (L) = o(L) − y
= (L) W (L) · · · (l) W (l) o(l−1) − y.
Accordingly, we have
.
(y, o(L) ) =
− δ (L) − (L) W (L) · · · (l) W (l) o(l−1)
2 (6.50)
=
− δ (L) + η(L) W (L) · · · (l) δ (l) o(l−1) o(l−1)
2
/ /2
/ /
= / I − η
o(l−1)
2 M (l) δ (L) / ,
(L)
which comes from (6.38). Now, we can easily see that for all x ∈ Rd we have
so that M (l) is positive semidefinite, i.e. its eigenvalues are non-negative. Further-
more, we have
/ /2
/ /
/ I − η
o(l−1)
2 M (l) δ (L) / ≤ λ2max I − η
o(l−1)
2 M (l)
× δ (L) 2 , (6.52)
we can show
.
(y, o(L) ) ≤
δ (l)
2 =
(y, o(L) ),
Note that we have a minus sign in front of δ (l) inspired by its global counterpart in
(6.50). By inspection, we can easily see that the optimal solution for (6.54) is given
by
1
W∗ = − δ (l) o(l−1) , (6.55)
o(l−1)
2
since plugging (6.55) in (6.54) makes the cost function zero. Therefore, the optimal
search direction for the weight update should be given by
which is equivalent to (6.49). The take-away message here is that as long as we can
obtain the back-propagated error and the forward-propagated input, we can obtain a
local variational formulation, which can be solved by any means.
6.5 Exercises
1. Derive the general form of the activation function σ (x) that satisfies the following
differential equation:
σ (x) ∂σ (x)
=
x ∂x
2. Show that (6.21) is equivalent to (6.20).
3. Recall that L-layer feedforward neural network can be represented recursively
by
for l = 1, · · · , L. When the training data size is 1, the weight update is given by
a. Derive the update equation similar to (6.58) for the bias term, i.e. b(l) .
b. Suppose the weight matrix W (l) , l =, · · · , L is a diagonal matrix. Draw
the network connection architecture similarly to Fig. 6.8. Then, derive the
6.5 Exercises 111
backprop algorithm for the diagonal term of the weight matrix, assuming that
the bias is zero. You must use the chain rule to derive this.
4. Let a two-layer ReLU neural network f have an input and output dimension for
each layer in R2 , i.e. f : x ∈ R2 → f (x) ∈ R2 . Suppose that the parameter
of the network is composed of weight and bias:
, -
= W (1) , W (2) , b(1) , b(2) , (6.60)
1
() =
y − f (x)
2 (6.62)
2
and a training data
compute the weight and bias update for the first two iterations of the backpropa-
gation algorithm. It is suggested that the unit step size, i.e. γ = 1, be used.
5. We are now interested in extending (6.54) for the training data composed of N
samples.
a. Show that the following equality holds for the local variation formulation:
N
min
− δ (l)
n − W on
= min
− (l) − W O (l−1)
2F ,
(l−1) 2
(6.64)
W W
n=1
b. Show that there exists a step size γ > 0 such that the weight perturbation
N
W (l) = −γ δ (l)
n on
(l−1)
n=1
6. Suppose that our activation function is sigmoid. Derive the BP algorithm for
the L-layer neural network. What is the main difference of the BP algorithm
compared to the network with a ReLU? Is this an advantage or disadvantage?
Answer this question in terms of variational perspective.
7. Now we are interested in extending the model in (6.6) to a convolutional neural
network model
⎛ (l) ⎞
d
oi = σ ⎝ + bi ⎠ ,
(l) (l) (l−1) (l)
hi−j oj (6.65)
j =1
what is the corresponding weight matrix W (l) ? Please show the structure of
W (l) explicitly in terms of h(l) elements.
b. Derive the backpropagation algorithm for the filter update h(l) .
Chapter 7
Convolutional Neural Networks
7.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 113
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_7
114 7 Convolutional Neural Networks
Fig. 7.1 LeNet: the first CNN proposed by Yann LeCun for zip code identificiation [21]
the inventor of the SVM. In the preface of his classical book entitled The Nature of
Statistical Learning Theory [10], Vapnik expressed his concern saying that “Among
artificial intelligence researchers the hardliners had considerable influence (it is
precisely they who declared that complex theories do not work, simple algorithms
do)”.
Ironically, the advent of the SVM and kernel machines has led to a long period of
decline in neural network research, often referred to as the “AI winter”. During the
AI winter, the neural network researchers were largely considered pseudo-scientists
and even had difficulty in securing research funding. Although there have been
several notable publications on neural networks during the AI winter, the revival of
convolutional neural network research, up to the level of general public acceptance,
has had to wait until the series of deep neural network breakthroughs at the ILSVRC
(ImageNet Large Scale Visual Recognition Competition).
In the following section, we give a brief overview of the history of modern CNN
research that has contributed to the revival of research on neural networks.
7.2.1 AlexNet
ImageNet is a large visual database designed for use in visual object recognition
software research [8]. ImageNet contains more than 20,000 categories, consisting of
several hundred images. Since 2010, the ImageNet project has an annual software
contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [7],
where software programs compete to correctly classify and detect objects and
scenes. Around 2011, a good ILSVRC classification error rate, which was based
on classical machine learning approaches, was about 27%.
In the 2012 ImageNet Challenge, Krizhevsky et al. [9] proposed a CNN
architectures, shown in Fig. 7.2, which is now known as AlexNet. The AlexNet
architecture is composed of five convolution layers and three fully connected layers.
In fact, the basic components of AlexNet were nearly the same as those of LeNet by
7.2 History of Modern CNNs 115
Fig. 7.2 The ImageNet challenges and the CNN winners that have completely changed the
landscape of artificial intelligence
Yann LeCun [21], except the new nonlinearity using the rectified linear unit (ReLU).
AlexNet got a Top-5 error rate (rate of not finding the true label of a given image
among its top 5 predictions) of 15.3%. The next best result in the challenge, which
was based on the classical kernel machines, trailed far behind (26.2%).
In fact, the celebrated victory of AlexNet declared the start of a “new era” in
data science, as witnessed by more than 75k citations according to Google Scholar
as of January 2021. With the introduction of AlexNet, the world was no longer
the same, and all the subsequent winners at the ImageNet challenges were deep
neural networks, and nowadays CNN surpasses the human observers in ImageNet
classification. In the following, we introduce several subsequent CNN architectures
which have made significant contributions in deep learning research.
7.2.2 GoogLeNet
GoogLeNet [30] was the winner at the 2014 ILSVRC (see Fig. 7.2). As the name
“GoogLeNet” indicates, it is from Google, but one may wonder why it is not written
as “GoogleNet”. This is because the researchers of “GoogLeNet” tried to pay tribute
to Yann LeCun’s LeNet [21] by containing the word “LeNet”.
The network architecture is quite different from AlexNet due to the so-called
inception module[30], shown in Fig. 7.3. Specifically, at each inception module,
there exist different sizes/types of convolutions for the same input and stack-
116 7 Convolutional Neural Networks
ing all the outputs. This idea was inspired by the famous 2010 science fiction
film Inception, in which Leonardo DiCaprio starred. In the film, the renowned
director Christoper Nolan wanted to explore “the idea of people sharing a dream
space. . . That gives you the ability to access somebody’s unconscious mind.” The
key concept which GoogLeNet borrowed from the film was the “dream within a
dream” strategy, which led to the “network within a network” strategy that improves
the overall performance.
7.2.3 VGGNet
VGGNet [31] was invented by the VGG (Visual Geometry Group) from University
of Oxford for the 2014 ILSVRC (see Fig. 7.2). Although VGGNet was not the
winner of the 2014 ILSVRC (GoogLeNet was the winner at that time, and the
VGGNet came second), VGGNet has made a prolonged impact in the machine
learning community due to its modular and simple architecture, yet resulting in
a significant performance improvement over AlexNet [9]. In fact, the pretrained
VGGNet model captures many important image features; therefore, it is still widely
used for various purposes such as perceptual loss [32], etc. Later we will use
VGGNet to visualize CNNs.
As shown in Fig. 7.2, VGGNet is composed of multiple layers of convolution,
max pooling, the ReLU, followed by fully connected layers and softmax. One of
the most important observations of VGGNet is that it achieves an improvement
over AlexNet by replacing large kernel-sized filters with multiple 3 × 3 kernel-
sized filters. As will be shown later, for a given receptive field size, cascaded
application of a smaller size kernel followed by the ReLU makes the neural network
more expressive than one with a larger kernel size. This is why VGGNet provided
significantly improved performance over AlexNet despite its simple structure.
7.2 History of Modern CNNs 117
7.2.4 ResNet
In the history of ILSVRC, the Residual Network (ResNet) [33] is considered another
masterpiece, as shown in its citation record of more than 68k as of January 2020.
Since the representation power of a deep neural network increases with the
network depth, there has been strong research interest in increasing the network
depth. For example, AlexNet [9] from 2012 LSVRC had only five convolutional
layers, while the VGG network [31] and GoogLeNet [30] from 2014 LSVRC
had 19 and 22 layers, respectively. However, people soon realized that a deeper
neural network is hard to train. This is because of the vanishing gradient problem,
where the gradient can be easily back-propagated to layers closer to the output,
but is difficult to be back-propagated far from the output layer since the repeated
multiplication may make the gradient so small. As discussed in the previous chapter,
the ReLU nonlinearity partly mitigates the problem, since the forward and backward
propagation are symmetric, but still the deep neural network turns out to be difficult
to train due to an unfavorable optimization landscape [34]; this issue will be
reviewed later.
As shown in Fig. 7.2, there exist bypass (or skip) connections in the ResNet,
representing an identity mapping. The bypass connection was proposed to promote
the gradient back-propagation. Thanks to the skip connection, ResNet makes it
possible to train up to hundreds or even thousands of layers, achieving a significant
performance improvement. Recent researches reveals that the bypass connection
also improves the forward propagation, making the representation more expressive
[35]. Furthermore, its optimization landscape can be significantly improved thanks
to bypass connections that eliminate many local minimizers [35, 36].
7.2.5 DenseNet
DenseNet (Dense Convolutional Network) [37] exploits the extreme form of skip
connection as shown in Fig. 7.4. In DenseNet, at each layer there exists skip
connections from all preceding layers to obtain additional inputs.
Since each layer receives inputs from all preceding layers, the representation
power of the network increases significantly, which makes the network compact,
thereby reducing the number of channels. With dense connections, the authors
demonstrated that fewer parameters and higher accuracy are achieved compared
to ResNet [37].
118 7 Convolutional Neural Networks
7.2.6 U-Net
Unlike the aforementioned networks that are designed for ImageNet classification
task, the U-Net architecture [38] in Fig. 7.5 was originally proposed for biomedical
image segmentation, and is widely used for inverse problems [39, 40].
One of the unique aspects of U-Net is its symmetric encoder–decoder architec-
ture. The encoder part consists of 3 × 3 convolution, batch normalization [41], and
the ReLU. In the decoder part, upsampling and 3 × 3 convolution are used. Also,
there are max pooling layers and skip connections through channel concatenation.
The multi-scale architecture of U-Net significantly increases the receptive field,
which may be the main reason for the success of U-Net for segmentation, inverse
problems, etc., where global information from all over the images is necessary to
update the local image information. This issue will be discussed later. Moreover,
the skip connection is important to retain the high-frequency content of the input
signal.
7.3 Basic Building Blocks of CNNs 119
7.3.1 Convolution
y = h ∗ x, (7.1)
where ∗ denotes the convolution operation. For example, the 3 × 3 convolution case
for 2-D images can be represented element by element as follows:
1
y[m, n] = h[p, q]x[m − p, n − q], (7.2)
p,q=−1
where y[m, n], h[m, n] and x[m, n] denote the (m, n)-element of the matrices Y , H
and X, respectively. One example of computing this convolution is illustrated in
Fig. 7.6, where the filter is already flipped for visualization.
It is important to note that the convolution used in CNNs is richer than the
simple convolution in (7.1) and Fig. 7.6. For example, a three channel input signal
can generate a single channel output as shown in Fig. 7.7a, which is often referred
to as multi-input single-output (MISO) convolution. In another example shown in
Fig. 7.7b, a 5 × 5 filter kernel is used to generate 6 (resp. 10) output channels from
3 (resp. and 6) input channels. This is often called the multi-input multi-output
120 7 Convolutional Neural Networks
(MIMO) convolution. Finally, in Fig. 7.7c, the 1 × 1 filter kernel is used to generate
32 output channels from 64 input channels.
All these seemingly different convolutional operations can be written in a general
MIMO convolution form:
cin
yi = hi,j ∗ x j , i = 1, · · · , cout , (7.3)
j =1
where cin and cout denote the number of input and output channels, respectively,
x j , y i refer to the j -th input and the i-th output channel image, respectively, and hi,j
is the convolution kernel that contributes to the i-th channel output by convolving
with the j -th input channel images. For the case of 1 × 1 convolution, the filter
kernel becomes
so that (7.3) becomes the weighted sum of input channel images as follows:
cin
yi = wij x j , i = 1, · · · , cout . (7.4)
j =1
A pooling layer is used to progressively reduce the spatial size of the representation
to reduce the number of parameters and amount computation in the network. The
pooling layer operates on each feature map independently. The most common
approaches used in pooling are max pooling and average pooling as shown in
Fig. 7.8b. In this case, the pooling layer will always reduce the size of each feature
7.3 Basic Building Blocks of CNNs 121
Fig. 7.7 Various convolutions used in CNNs. (a) Multi-input single-output (MISO) convolution,
(b) Multi-input multi-output (MIMO) convolution, (c) 1 × 1 convolution
122 7 Convolutional Neural Networks
Fig. 7.8 (a) Pooling and unpooling operation, (b) max and average pooling operation
map by a factor of 2. For example, a max (average) pooling layer in Fig. 7.8b applied
to an input image of 16 × 16 produces an output pooled feature map of 8 × 8.
On the other hand, unpooling is an operation for image upsampling. For example,
in a narrow meaning of unpooling with respect to max pooling, one can copy the
max pooled signal at the original location as shown in Fig. 7.9a. Or one could
perform a transpose operation to copy all the pooled signal to the enlarged area
as shown in Fig. 7.9b, which is often called the deconvolution. Regardless of the
definition, unpooling tries to enlarge the downsampled image.
It was believed that a pooling layer is necessary to impose the spatial invariance
in classification tasks [43]. The main ground for this claim is that small movements
in the position of the feature in the input image will result in a different feature
map after the convolution operation, so that spatially invariant object classification
may be difficult. Therefore, downsampling to a lower resolution version of an input
signal without the fine detail may be useful for the classification task by imposing
invariance to translation.
7.3 Basic Building Blocks of CNNs 123
Fig. 7.9 Two ways of unpooling. (a) Copying to the original location (unpooling), (b) copying to
all neighborhood (deconvolution)
However, these classical views have been challenged even by the deep learning
godfather, Geoffrey Hinton. In “Ask Me Anything” column on Reddit he said, “the
pooling operation used in convolutional neural networks is a big mistake and the
fact that it works so well is a disaster. If the pools do not overlap, pooling loses
valuable information about where things are. We need this information to detect
precise relationships between the parts of an object. . . ”.
Regardless of Geoffrey Hinton’s controversial comment, the undeniable advan-
tage of the pooling layer results from the increased size of the receptor field.
For example, in Fig. 7.10a,b we compare the effective receptive field sizes, which
determine the areas of input image affecting a specific point at the output image
of a single resolution network and U-Net, respectively. We can clearly see that
the receptive field size increases linearly without pooling, but can be expanded
exponentially with the help of a pooling layer. In many computer vision tasks, a
large receptive field size is useful to achieve better performance. So the pooling and
unpooling are very effective in these applications.
Before we move on to the next topic, a remaining question is whether there exists
a pooling operation which does not lose any information but increases the receptive
field size exponentially. If there is, then it does address Geoffrey Hinton’s concern.
Fortunately, the short answer is yes, since there exists an important advance in this
field from the geometric understanding of deep neural networks [40, 42]. We will
cover this issue later when we investigate the mathematical principle.
124 7 Convolutional Neural Networks
Another important building block, which has been pioneered by ResNet [33] and
also by U-Net [38], is the skip connection. For example, as shown in Fig. 7.11, the
feature map output from the internal block is given by
y = F(x) + x,
where F(x) is the output of the standard layers in the CNN with respect to the input
x, and the additional term x at the output comes directly from the input.
Thanks to the skipped branch, ResNet [33] can easily approximate the identity
mapping, which is difficult to do using the standard CNN blocks. Later we will
show that additional advantages of the skip connection come from removing local
minimizers, which makes the training much more stable [35, 36].
7.4 Training CNNs 125
When a CNN architecture is chosen, the filter kernel should be estimated. This is
usually done during a training phase by minimizing a loss function. Specifically,
given input data x and its label y ∈ Rm , an average loss is defined by
!
c() := E[
y, f (x) ], (7.5)
where E[·] denotes the mean,
(·) is a loss function, and f (x) is a CNN with input
x, which is parameterized by the filter kernel parameter set . In (7.5), the mean is
usually taken empirically from training data.
For the multi-class classification problem using CNNs, one of the most widely
used losses is the softmax loss [44]. This is a multi-class extension of the binary
logistic regression classifier we studied before. A softmax classifier produces nor-
malized class probabilities, and also has a probabilistic interpretation. Specifically,
we perform the softmax transform:
ef (x)
.
p () = , (7.6)
1 ef (x)
126 7 Convolutional Neural Networks
c() = −E .i () ,
yi log p (7.7)
i=1
where yi and p.i denote the i-th elements of y and .p, respectively. If the class label
y ∈ Rm is normalized to have probabilitistic meaning, i.e. 1 y = 1, then (7.7) is
indeed the cross entropy between the target class distribution and the estimated class
distribution.
For the case of regression problems using CNNs, which are quite often used for
image processing tasks such as denoising, the loss function is usually defined by the
norm, i.e.
p
c() = E
y − f (x)
p (7.8)
In training CNNs, available data sets should be first split into three categories:
training, validation, and test data sets, as shown in Fig. 7.12. The training data is also
split into mini-batches so that each mini-batch can be used for stochastic gradient
computation. The training data set is then used to estimate the CNN filter kernels,
and the validation set is used to monitor whether there exists any overfitting issue in
the training.
For example, Fig. 7.13a shows the example of overfitting that can be monitored
during the training using the validation data. If this type of overfitting happens,
several approaches should be taken to achieve stable training behavior as shown in
Fig. 7.13b. Such a strategy will be discussed in the following section.
Fig. 7.12 Available data split into training, validation, and test data sets
7.4 Training CNNs 127
Fig. 7.13 Neural network training dynamics: (a) overfitting problems, (b) no overfitting
7.4.3 Regularization
When we observe the overfitting behaviors similar to Fig. 7.13a, the easiest solution
is to increase the training data set. However, in many real-world applications, the
training data are scarce. In this case, there are several ways to regularize the neural
network training.
Using data augmentation we generate artificial training instances. These are new
training instances created, for example, by applying geometric transformations such
as mirroring, flipping, rotation, on the original image so that it doesn’t change the
label information.
128 7 Convolutional Neural Networks
7.4.3.3 Dropout
Another unique regularization used for deep learning is the dropout [45]. The idea of
a dropout is relatively simple. During the training time, at each iteration, a neuron is
temporarily “dropped” or disabled with probability p. This means all the inputs and
outputs to some neurons will be disabled at the current iteration. The dropped-out
neurons are resampled with probability p at every training step, so a dropped-out
neuron at one step can be active at the next one. See Fig. 7.14. The reason that the
dropout prevents overfitting is that during the random dropping, the input signal for
each layer varies, resulting in additional data augmentation effects.
As already mentioned, hierarchical features arise in the brain during visual informa-
tion processing. A similar phenomenon can be observed in the convolution neural
network, once it is properly trained. In particular, VGGNet provides very intuitive
information that is well correlated with the visual information processing in the
brain.
For example, Fig. 7.15 illustrates the input signal that maximizes the filter
response at specific channels and layers of VGGNet [31]. Remember that the filters
7.5 Visualizing CNNs 129
Fig. 7.15 Input images that maximize filter responses at specific channels and layers of VGGNet
are of size 3×3, so rather than visualizing the filters, an input image where this filter
activates the most is displayed for specific channel and layer filters. In fact, this is
similar to the Hubel and Wiesel experiments where they analyzed the input image
that maximizes the neuronal activation.
Figure 7.15 shows that at the earlier layers the input signal maximizing filter
response is composed of directional edges similar to the Hubel and Wiesel
experiment. As we go deeper into the network, the filters build on each other and
learn to code more complex patterns. Interestingly, the input images that maximize
the filter response get more complicated as the depth of the layer increases. In one
of the filter sets, we can see several objects in different orientations, as the particular
position in the picture is not important as long as it is displayed somewhere where
the filter is activated. Because of this, the filter tries to identify the object in multiple
positions by encoding it in multiple places in the filter.
Finally, the blue box in Fig. 7.15 shows the input images that maximize the
response on the last softmax level in the specific classes. In fact, this corresponds to
the visualization of the input images that maximize the class categories. In a certain
category, an object is displayed several times in the images. The emergence of the
hierarchical feature from simple edges to the high-level concept is similar to visual
information processing in the brain.
Finally, Fig. 7.16 visualizes the feature maps on the different levels of VGGNets
in relation to a cat picture. Since the output of a convolution layer is a 3D volume, we
will only visualize some of the images. As can be seen from Fig. 7.16, a feature map
develops from edge-like features of the cat to information with the lower-resolution,
which describes the location of the cat. In the later levels, the feature map works with
a probability map in which the cat is located.
130 7 Convolutional Neural Networks
Fig. 7.16 Visualization of feature maps at several channels and layers of VGGNets when the input
image is a cat
CNN is the most widely used neural network architecture in the age of modern
AI. Similar to the visual information processing in the brain, the CNN filters are
trained in such a way that hierarchical features can be captured effectively. This can
be one of the reasons for CNN’s success with many image classification problems,
low-level image processing problems, and so on.
In addition to commercial applications in unmanned vehicles, smartphones,
commercial electronics, etc., another important application is in the field of
medical imaging. CNN has been successfully used for disease diagnosis, image
segmentation and registration, image reconstruction, etc.
For example, Fig. 7.17 shows a segmentation network architecture for cancer
segmentation. Here, the label is the binary mask for cancer, and the backbone CNN
is based on the U-Net architecture, where there exists a softmax layer at the end
for pixel-wise classification. Then, the network is trained to classify the background
7.7 Exercises 131
and the cancer regions. Very similar architecture can be also used for noise removal
in low-dose CT images, as shown in Fig. 7.18. Instead of using the softmax layer,
the network is trained with a regression loss of l1 or l2 using the high-quality, low-
noise images as a reference. In fact, one of the amazing and also mysterious parts of
deep learning is that a similar architecture works for different problems simply by
changing the training data.
Because of this simplicity in designing and training CNNs, there are many
exciting new startups targeting novel medical applications of AI. As the importance
of global health care increases with the COVID-19 pandemic, medical imaging
and general health care are undoubtedly among the most important areas of AI.
Therefore, for the application of AI to health, opportunities are so numerous that
we need many young, bright researchers who can invest their time and effort in AI
research to improve human health care.
7.7 Exercises
1. Consider the VGGNet in Fig. 7.2. In its original implementation, the convolution
kernel was 3 × 3.
a. What is the total number of convolution filter sets in VGGNet?
b. Then, what is the total number of trainable parameters in VGGNet including
convolution filters and fully connected layers? (Hint: for the fully connected
132 7 Convolutional Neural Networks
ef (x)
.
p() = , (7.10)
1 ef (x)
10
c() = −E .i () ,
yi log p (7.11)
i=1
n−1
(u v)[n] = u[n − i]v[n],
i=0
where the periodic boundary condition is assumed. Now, for any vector x ∈ Rn1
and y ∈ Rn2 with n1 , n2 ≤ m, define their circular convolution in Rn :
x y = x0 y0,
where x 0 = [x, 0n−n1 ] and y 0 = [y, 0n−n2 ] . Finally, for any v ∈ Rn1 with
n1 ≤ n, define the flip v[n] = v 0 [−n].
a. For an input signal x ∈ Rn and a filter ψ ∈ Rn , show that
u F v = u (f v) = f (u v) = f , u v
, (7.15)
where F = Hnr (f ).
d. Let the multi-input single-output (MISO) circular convolution for the p-
channel input Z = [z1 , · · · , zp ] ∈ Rn×p and the output y ∈ Rn be defined
by
p
j
y= zj ψ , (7.16)
j =1
where
⎡ ⎤
ψ1
⎢ ⎥
= ⎣ ... ⎦
ψp
and
Hnr|p (Z) := Hnr (z1 ) Hnr (z2 ) · · · Hnr (zp ) . (7.18)
134 7 Convolutional Neural Networks
p
yi = zj ψ i,j , i = 1, · · · , q, (7.19)
j =1
where p and q are the number of input and output channels, respectively;
ψ i,j ∈ Rr denotes a r-dimensional vector and ψ i,j ∈ Rn refers to its flip.
Then, show that (7.19) can be represented in a matrix form by
p
Y = Hnr (zj ) j = Hnr|p (Z) ,
j =1
where
⎡ ⎤
1
⎢ ⎥
= ⎣ ... ⎦ where j = ψ 1,j · · · ψ q,j .
p
p
yi = wj zj ψ i,j , i = 1, · · · , q, (7.20)
j =1
where wj denotes the j -th index of 1×1 convolution filter weighting. Show
that this can be represented in a matrix form by
p
Y = wj Hnr (zj ) j = Hnr|p (Z) w , (7.21)
j =1
where
⎡ ⎤
w1 1
⎢ ⎥
w = ⎣ ... ⎦ . (7.22)
wp p
Chapter 8
Graph Neural Networks
8.1 Introduction
Many important real-world data sets are available in the form of graphs or networks:
social networks, world-wide web (WWW), protein-interaction networks, brain
networks, molecule networks, etc. See some examples in Fig. 8.1. In fact, the
complex interaction in real systems can be described by different forms of graphs,
so that graphs can be a ubiquitous tool for representing complex systems.
A graph consists of nodes and edges as shown in Fig. 8.2. Although it looks
simple, the main technical problem is that the number of nodes and edges in
many interesting real-world problems is very large, and cannot be traced by simple
inspection. Accordingly, people are interested in different forms of machine learning
approaches to extract useful information from diagrams.
With a machine learning tool, for example, a node classification can be carried
out in which different labels are assigned to each node in a complex diagram. This
could be used to classify the function of proteins in the interaction network (see
Fig. 8.3a). Link analysis is another important problem in graph machine learning,
which is about finding missing links between nodes. As shown in Fig. 8.3b, link
analysis can be used for repurposing drugs for new types of pathogens or diseases.
Yet another important goal of graph analysis is community detection. For example,
one could identify a subnetwork that consists of disease proteins (see Fig. 8.3c).
Despite the wide range of possible applications, approaches to neural networks in
graphs are not as mature as other studies of neural networks for images, voices, etc.
This is because the processing and learning of graph data require new perspectives
on neural networks.
For example, as shown in Fig. 8.4, the basic assumption of convolutional neural
networks (CNNs) is that images have pixel values on regular grids, but graphs
have irregular node and edge structure so that the applications of basic modules
such as convolution, pooling, etc., are not easy. Another serious problem is that,
although CNN training data consists of images or their patches of the same size, the
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 135
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_8
136 8 Graph Neural Networks
training data of the graph neural network usually consists of graphs with different
numbers of nodes, network topology, and so on. For example, in graphical neural
network approaches for examining the toxicity of drug candidates, the chemicals
in the training data set can have a different number of molecules. This leads to the
fundamental question in the graph machine learning task: What do we learn from
the training data?
In fact, the main advantage of neural network approaches over other machine
learning approaches like compressed sensing [46] and low-rank matrix factorization
[47], etc. is that the neural network approaches are inductive, which means that the
8.2 Mathematical Preliminaries 137
Fig. 8.3 Several application goals of machine learning on graphs: (a) node classification, (b) link
analysis, (c) community detection
Fig. 8.4 Difference between image domain CNN and graph neural network
trained neural network is not just applied to the data on which the network resides
and was originally trained, but also to other unseen data during training.
However, given that each graph in training data is different in its structure (for
example, with different node and edge numbers and even topology), what kind of
inductive information can we get from the graph neural network training? Although
the universal approximation theorem [48] guarantees that neural networks can
approximate any nonlinear function, it is not even clear which nonlinear function
a graph neural network tries to approximate.
Hence the main aim of this chapter is to answer these puzzling questions. In fact,
we will focus on how machine learning researchers came up with brilliant ideas to
enable inductive learning independent of different graph structures in the training
phase.
Before we discuss graph neural networks, we review basic mathematical tools from
graph theory.
138 8 Graph Neural Networks
8.2.1 Definition
A graph can exist in different forms having the same number of vertices, edges,
and also the same edge connectivity. Such graphs are called isomorphic graphs.
Formally, two graphs G and H are said to be isomorphic if (1) their numbers
of components (vertices and edges) are equal, and (2) their edge connections are
identical. Some examples of isomorphic graphs are shown in Fig. 8.6.
Graph isomorphism is widely used in many areas where identifying similarities
between graphs is important. In these areas, the graph isomorphism problem is often
referred to as the graph matching problem. Some practical uses of graph isomor-
phism include identifying identical chemical compounds in different configurations,
checking equivalent circuits in electronic design, etc.
Unfortunately, testing graph isomorphism is not a trivial task. Even if the number
of nodes is the same, two isomorphic graphs, for example, can have different
adjacency matrices, since the order of the nodes in the isomorphic graph can be
arbitrary, but the structure of their adjacency matrices is critically determined by
the order of the nodes. In fact, the graph isomorphism problem is one of the few
standard problems whose complexity remains unsolved.
8.3 Related Works 139
Since each diagram in the training data has a different configuration, the main
concern of machine learning of graphs is to assign latent vectors in the common
latent space to graphs, subgraphs, or nodes so that standard CNN, perceptron, etc.
can be applied to the latent space for inference or regression. This procedure is often
called graph embedding, as shown in Fig. 8.8. One of the most important research
topics in graph neural networks is to find an inductive rule for the graph embedding
that can be applied to graphs with a different number of nodes, topologies, etc.
Unfortunately, one of the difficulties associated with the graphs is that they are
unstructured. In fact, there is a lot of unstructured data that we encounter in everyday
life, and one of the most important classes of unstructured data is natural language.
Therefore, many of the graphics machine learning techniques are borrowed from
natural language processing (NLP). So this section explains the key idea of natural
language processing.
140 8 Graph Neural Networks
Fig. 8.7 Node coloring example in a molecular system. (a) Initial coloring with feature vectors,
(b) its successive update using a machine learning approach
Word embedding is one of the most popular representations for natural language
processing. Basically, it is a vector representation of a particular word that can
capture the context of a word in a document, its semantic and syntactic similarity,
its relationship to other words, and so on.
For example, consider a vocabulary “king”. From its semantic meaning, one
could come to the following conclusion:
This concept is illustrated in Fig. 8.9. There are several ways to embed a word. The
main problem here is to represent each word in large text as a vector so that similar
words are close together in latent space.
Among the various ways of performing word embedding, the so-called word2vec
is one of the most frequently used methods [50, 51]. Word2vec is composed of
a two-layer neural network. The network is trained in two complementary ways:
continuous bag of words (CBOW) and skip-gram. The key idea of these approaches
is that there are significant causal relationships and redundancies between words in
natural languages, the information of which can be used to embed words in vector
space. In the following, we describe them in detail.
8.3.1.1 CBOW
CBOW begins with the assumption that a missing word can be found from its
surrounding words in the sentence. For example, consider a sentence: The big dog
is chasing the small rabbit. The idea of CBOW is that a target word in the sentence
(which is usually the center word), for example, “dog” as shown in Fig. 8.10, can be
estimated from the nearby words within the context window (for example, using
“big” and “is” for the case of context window size c = 1). In general, for a
given context window size c, the i-th word x i is assumed to be estimated using
the adjacent words within a window, i.e. {x j | j ∈ Ic (i)}, as shown in Fig. 8.10,
where
Now, here comes the fun part. In CBOW, rather than directly estimating the word
x i , it employs an encoder-decoder structure as depicted in Fig. 8.11. Specifically, an
encoder, represented by the shared weight W , converts input x n into a corresponding
latent space vector, and then the decoder with the weight W 2 converts the latent
vector into the estimate of the target word x̂ i .
Furthermore, one of the most important assumptions of CBOW is that the latent
vector of the missing word is represented as the average value of the latent vectors
of the adjacent words, i.e.
hi = W xk . (8.4)
2c − 1
k∈Ic (i)
8.3 Related Works 143
Specifically, using the 2c − 1 input vectors and the shared encoder weight, we
generate 2c − 1 latent vectors, after which their average value is generated. Then,
the center word is estimated by decoding from the averaged latent vector with the
weight W2:
2 hi .
x̂ i = W (8.5)
Note that other than the softmax unit in the network output, which will be explained
later, there are no non-linearities in the hidden layer of CBOW.
To start off, one should first build the corpus vocabulary, where we could map
each vocabulary to a unique numeric identifier x i . For example, if the corpus size is
M, then x i is an M-dimensional vector with one-hot vector encoding as shown in
Fig. 8.12. Once the neural network in CBOW is trained, the word embedding can be
simply done using the encoder part of the network.
The very strict assumption that the center word may be similar to the average of
the surrounding vocabularies in the latent space works amazingly well, and CBOW
is one of the most popular classical word embedding techniques [50, 51].
8.3.1.2 Skip-Gram
Skip-gram can be seen as a complementary idea of CBOW. The main idea behind
the skip-gram model is this: once the neural network is trained, the latent vector
generated by the focus word can predict every word in the window with high
probability. For example, Fig. 8.13 shows the example of how we extract the focus
word and the target word within different window sizes. Here the green word is the
focus word from which the target words in the window are estimated.
144 8 Graph Neural Networks
Similar to CBOW, the neural network training is carried out in the form of latent
vectors. In particular, the focus word encoded with a one-hot vector is converted to
a latent vector using an encoder with the weight W , and then the latent vector is
decoded via a parallel decoder network with the shared weight W 2 , as shown in
Fig. 8.14. So the basic assumption of skip-gram can be written by
2 hi ,
xj % W ∀j ∈ Ic (i), (8.6)
hi = W x i . (8.7)
Again, there are no non-linearities in the hidden layer of skip-gram other than the
softmax unit in the network output.
The loss function for the neural network training in word2vec deserves further
discussion. Similar to the classification problem, the loss function is based on the
cross entropy between the target word and the generated word from the decoder.
8.3 Related Works 145
where the latent vector hi is given by the average latent vector in (8.4). On the other
hand, the loss function for the skip-gram is given by
⎛ ⎞
3
C 2
w t hi
2 ) = − log ⎝ e j
⎠
skipgram (W , W
M 2
w tk hi
j ∈Ic (i) k=1 e
)M *
2
w t hi
=− 2
w tj hi + C log e k , (8.9)
j ∈Ic (i) k=1
The main assumption of matrix factorization approaches for graph embedding is that
an adjacency matrix can be decomposed into low rank matrices. More specifically,
for a given adjacency matrix A ∈ RN ×N , its low rank matrix decomposition is to
find U , V ∈ RN ×d such that
A % U V , (8.10)
where d is the latent space dimension. Then, the i-th node embedding in the latent
space Rd is given by
hi = V x i ∈ Rd ,
Random walks approaches for graph embedding are very closely related to the word
embedding, in particular, word2vec [50, 51]. Here, we review two powerful random
walk approaches: DeepWalks [53] and node2vec [54].
8.4.2.1 DeepWalks
The main intuition of DeepWalks [53] is that random walks are comparable to
sentences in the word2vec approach so that word2vec can be used for embedding
each node of a graph. More specifically, as depicted in Fig. 8.16, the method
basically consists of three steps:
• Sampling: A graph is sampled with random walks. A few random walks with
specific length are performed from each node.
• Training skip-gram: The skip-gram network is trained by accepting a node from
the random walk as a one-hot vector as an input and target.
• Node embedding: From the encoder part of the trained skip-gram, each node in
a graph is embedded into a vector in the latent space.
8.4.2.2 Node2vec
Recently, there has been significant progress and growing interest in graph neural
networks (GNNs), which comprise graph operations performed by deep neural net-
works. For example, spectral graph convolution approaches [55], graph convolution
network (GCN) [56], graph isomorphism network (GIN) [57], graphSAGE [58], to
just name a few.
Although these approaches have been derived from different assumptions and
approximations, common GNNs typically integrate the features on each layer
in order to embed each node features into a predefined feature vector of the
next layer. The integration process is implemented by selecting suitable functions
for aggregating features of the neighborhood nodes. Since a level in the GNN
aggregates its 1-hop neighbors, each node feature is embedded with features in
its k-hop neighbor of the graph after k aggregation layers. These features are then
extracted by applying a readout function to obtain a nodal embedding.
8.4 Graph Embedding 149
(t)
Specifically, let x v denote the t-th iteration feature vector at the v-th node.
Then, this graph operation is generally composed of the AGGREGAT E, and
COMBI NE functions:
,, --
a (t)
v = AGGREGAT E x (t−1)
u : u ∈ N(v) ,
v = COMBI NE x v
x (t) , av ,
(t−1) (t)
= x (t−1)
u . (8.11)
u∈N(v)
Although this sum operation is one of the most popular approaches in GNNs, we
can consider a more general form of the operation with desirable properties. This is
the main topic in the following section.
Compared to the matrix factorization and random walks approaches, the success
of graph embedding using neural networks appears mysterious. This is because in
order to be a valid embedding, the semantically similar input should be closely
located in the latent space, but it is not clear whether the graph neural network
produces such behaviors.
For the case of matrix factorization, the embedding transform is obtained from
the assumption that the latent vector should live in the low-dimensional subspace.
For the case of random walks, the underlying intuition for the embedding is similar
to that of word2vec. Therefore, these approaches are guaranteed to retain semantic
information in the latent space. Then, how do we know that the neural-network-
based graph embedding also conveys the semantic information?
This understanding is particularly important because a GNN algorithm is usually
designed as an empirical algorithm and not based on the top-down principle in
order to achieve the desired embedding properties. Recently, a number of authors
[57, 59–62] has shown that the GNN is indeed a neural network implementation
of Weisfeiler–Lehman (WL) graph isormorphism test [63]. This implies that if the
embedding vectors of a GNN are distinct from each other, then the corresponding
graphs are not isormophic. Therefore, GNNs may retain useful semantic information
during the embedding. In this section, we review this exciting discovery in more
detail.
isomorphism test is to find a signature for each node in each graph based on
the neighborhood around the node. These signatures can then be used to find the
correspondence between nodes in the two graphs. Specifically, if the signatures of
two graphs are not equivalent, then the graphs are definitively not isomorphic.
We now describe the WL algorithm formally. For a given colored graph G,
(t)
the WL computes a node coloring cv : V (G) → , depending on the coloring
from the previous iteration. To iterate the algorithm, we assign each node a tuple
that contains the old compressed label (or color) of the node and a multiset of the
compressed labels (colors) of the neighbors of the node:
, ,, ---
v = cv , cu | u ∈ N(v)
m(t) (t) (t)
, (8.12)
where { ·}} denotes the multiset, which is a set (a collection of elements where order
is not important) in which elements may appear more than once. Then, H ASH (·)
bijectively assigns the above pair to a unique compressed label that was not used in
previous iterations:
c(t+1)
v = H ASH m(t)
v . (8.13)
If the number of colors does not change between two iterations, then the algorithm
ends. This procedure is illustrated in Fig. 8.19.
To test two graphs G and H for isomorphism, we run the above algorithm
in “parallel” on both graphs. If the two graphs have different numbers of nodes,
which are colored in the WL algorithm, it is concluded that the graphs are not
isomorphic. In the algorithm described above, the “compressed labels” serve as
signatures. However, it is possible that two non-isomorphic graphs have the same
signatures, so this test alone cannot provide conclusive evidence that two graphs are
isomorphic. However, it has been shown that the WL test can be successful in the
graph isomorphism test with a high degree of probability. This is the main reason
the WL test is so important [63].
(t)
Recall that a GNN computes a sequence {x v }v∈V for t ≥ 0 of vector embeddings
of a graph G = (V , E). In the most general form, the embedding is recursively
computed as
,, --
v = AGGREGAT E
a (t) x (t−1)
u : u ∈ N(v) , (8.14)
where { ·}} is the multi-set and the aggregation function is symmetric in its arguments,
and the updated feature vector is given by
x (t)
v = COMBI NE x v , a
(t−1) (t)
v . (8.15)
(t)
From (8.14) and (8.15) in comparison with (8.12) and (8.13), if we identify x v
(t)
as the coloring at the t-th iteration, i.e. cv , then we can see that there are
remarkable similarities between GNN updates and the WL algorithm in terms of
their arguments, which are made up of multiset neighborhoods and the previous
node. In fact, these are not incidental findings; there is a fundamental equivalence
between them.
For example, in graph convolutional neural networks (GCNs) [56] and graph-
SAGE [58], the AGGREGAT E function is given by an average operation, whereas
it is just a simple sum in the graph isormorphism network (GIN) [57]. One could use
the element-by-element max operation as the AGGREGAT E function, or even a
long short-term memory (LSTM) can be used [58]. Similarly, a simple sum followed
by a multilayer percentron (MLP) can be used as the COMBI N E function, or the
weighted sum or concatenation followed by an MLP could be used [58, 59]. In
general, the GNN operation can be represented by
(t) (t)
x (t+1)
v = σ W 1 x (t)
v + W 2 x (t)
u , (8.16)
u∈N(v)
implementation of the WL algorithm for the graph isomorphism test, and the way
GNNs produce node embedding is to map the graph to a signature that can be used
to test the graph matching.
So far we have discussed the graphical neural network approach as a modern method
of performing graph embedding. The most important finding is that the GNN is
actually a neural network implementation of the WL test. Therefore, GNN fulfills
the important properties of embedding: if the two feature vectors in latent space are
different, the underlying graph is different.
The embedding of the graph with GNNs is by no means complete. In order to get
a really meaningful graph embedding, the vector operation in latent space should
have the same semantic meaning as in the original diagram, similar to that of word
embedding. However, it is still not clear whether the current GNN-based embedding
of graphs can lead to such versatile properties.
Hence, the field of graphic neural networks is still a wide open area of research
and the next level of breakthroughs will require many good ideas from young and
enthusiastic researchers.
8.7 Exercises
1. Show that every connected graph with n vertices has at least n − 1 edges.
2. For the case of CBOW, recall that the target vector x i is also a one-hot encoded
vector. Let tk denote the nonzero index of the vocabulary vector x k . Then, show
that the loss function of CBOW can be written as a softmax function:
) 2
w
*
ti hi
2 ) = − log e
CBOW (W , W
M 2
w tk hi
k=1 e
)M *
w 2 h
w
= −2 ti hi + log e tk i , (8.17)
k=1
5. The GIN was proposed as a special case of spatial GNNs suitable for graph clas-
sification tasks. The network implements the aggregate and combine functions
as the sum of the node features:
x (k)
v = MLP
(k)
(1 + (k) ) · x (k−1)
v + x (k−1)
u , (8.18)
u∈N(v)
where (k) = 0.1, and MLP is a multilayer perceptron with ReLU nonlinear-
ity.
a. Draw the corresponding graph, whose adjacency matrix is given by
⎡ ⎤
0 1 1 0
⎢1 0 1 1⎥
A=⎢
⎣1
⎥.
1 0 0⎦
0 1 0 0
Then, obtain the next layer feature matrices X (1) and X(2) assuming that there
exists no bias at each MLP.
Chapter 9
Normalization and Attention
9.1 Introduction
In this chapter, we will discuss very exciting and rapidly evolving technical fields of
deep learning: normalization and attention.
Normalization originated from the batch normalization technique [41] that
accelerates the convergence of stochastic gradient methods by reducing the covariate
shift. The idea has been extended further to various forms of normalization, such
as layer norm [64], instance norm [65], group norm [66], etc. In addition to the
original use of normalization for better convergence of stochastic gradients, adaptive
instance normalization (AdaIN) [67] is another example where the normalization
technique can be used as a simple but powerful tool for style transfer and generative
models.
On the other hand, attention has been drawn to computer vision applications
based on intuition that we “attend to” a particular part when processing a large
amount of information [68–72]. Attention has played the key role in the recent
breakthroughs in natural language processing (NLP), such as Transformer [73],
Google’s Bidirectional Encoder Representations from Transformers (BERT) [74],
OpenAI’s Generative Pre-trained Transformer (GPT)-2 [75] and GPT-3 [76], etc.
For beginners, the normalization and attention mechanisms look very heuristic
without any clue for systematic understanding, which is even more confusing due
to their similarities. In addition, understanding AdaIN, Transformer, BERT, and
GPT is like reading recipes the researchers developed with their own secret sauces.
However, an in-depth study reveals a very nice mathematical structure behind their
intuition.
In this chapter, we first review classical and current state-of-the art normalization
and attention techniques, and then discuss their specific realization in various
deep learning architectures, such as style transfer [77–83], multi-domain image
transfer [84–87], generative adversarial network (GAN) [71, 88, 89], Transformer
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 155
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_9
156 9 Normalization and Attention
[73], BERT [74], and GPT [75, 76]. Then, we conclude by providing a unified
mathematical view to understand both normalization and attention.
9.1.1 Notation
In deep neural networks, a feature map is defined as a filter output at each layer. For
example, feature maps from VGGNet are shown in Fig. 9.1, where the input image
is a cat. Since there exist multiple channels at each layer, the feature map is indeed
a 3D volume. Moreover, during the training, multiple 3D feature maps are obtained
from a mini-batch.
Fig. 9.1 Examples of feature maps on one channel of each layer of VGGNet
9.2 Normalization 157
To make the notation simple for mathematical analysis, in this chapter a feature
map for each channel is vectorized. Moreover, we often ignore the layer-dependent
indices in the features. Specifically, the feature map on a layer is represented by
X = x 1 · · · x C ∈ RH W ×C , (9.1)
where x i ∈ R1×C refers to the i-th row vector, representing the channel dimensional
feature at the i-th pixel location.
9.2 Normalization
Batch normalization was originally proposed to reduce the internal covariate shift
and improve the speed, performance, and stability of artificial neural networks.
During the training phase of the networks, the distribution of the input on the
current layer changes accordingly if the distribution of the feature on the previous
layers changes, so that the current layer has to be constantly adapted to new
distributions. This problem is particularly severe for deep networks because small
changes in shallower hidden layers are amplified as they propagate through the
network, causing a significant shift in deeper hidden layers. The method of batch
normalization is therefore proposed to reduce these undesirable shifts by recentering
and scaling.
158 9 Normalization and Attention
1
μ̄c = E[1 x c ], (9.4)
HW
4
1
σ̄c = E[
x c − μ̄c 1
2 ], (9.5)
HW
where the expectation E[·] is taken over the mini-batch. In matrix form, (9.3) can be
represented by
Y = XT + B, (9.6)
where
⎡γ ⎤
1
σ̄1 ··· 0
⎢. .. ⎥
T =⎢
⎣ ..
..
. ⎥
. ⎦, ∈ R
C×C
(9.7)
γC
0 ··· σ̄C
⎡ γ1 μ̄1
⎤
5 67 8 ⎢β1 − σ̄1 ··· 0
C
⎥
B = 1 ··· 1 ⎢
⎣
..
.
..
.
..
.
⎥.
⎦
γC μ̄C
0 · · · βC − σ̄C
Batch normalization is a powerful tool, but not without its limitations. The main
limitation of batch normalization is that it depends on the mini-batch when
calculating (9.4) and (9.5). Then, how can we mitigate the problem of batch
normalization?
To understand this question, let us look into the volume of the feature maps that
are stacked along the mini-batch in Fig. 9.4. The left column of Fig. 9.4 shows the
normalization operation in batch norm, whereby the shadow area is used to calculate
the mean and standard deviation for centering and rescaling. Here, B denotes the
size of the mini-batch.
160 9 Normalization and Attention
Fig. 9.4 Various forms of feature normalization methods. B: batch size, C: number of channels,
and H, W : height and width of the feature maps
In fact, the picture of batch norm shows that there are several normalization
options. For example, the layer normalization [64] computes the mean and standard
deviation along the channel and image direction without considering the mini-batch.
More specifically, we have
γ
yc = (x c − μ1) + β1, (9.8)
σ
for all c = 1, · · · , C. Here, γ and β are channel-independent trainable parameters,
while μ and σ are computed by
1
C
μ= 1 xc, (9.9)
HWC
c=1
9
:
: 1
C
σ =;
x c − μ1
2 . (9.10)
HWC
c=1
In the layer normalization, each sample within the mini-batch has a different normal-
ization operation, allowing arbitrary mini-batch sizes to be used. The experimental
results show that layer normalization performs well for recurrent neural networks
[64].
On the other hand, the instance normalization normalizes the feature data for each
sample and channel as shown on the right-hand side of Fig. 9.4. More specifically,
we have
γc
yc = (x c − μc 1) + βc 1, (9.11)
σc
9.2 Normalization 161
1
μc = 1 xc, (9.12)
HW
4
1
σc =
x c − μc 1
2 , (9.13)
HW
whereas γc and βc are trainable parameters for the channel c. In matrix form, (9.11)
can be represented by
Y = XT + B, (9.14)
where T and B are similar to (9.7) but calculated for each sample.
With AdaIN [67], a new chapter of normalization method has opened, which
goes beyond the classical normalization methods that were designed to improve
the performance and reduce the dependency on learning rate. The most important
finding of AdaIN is that the instance normalization transformation in (9.11) provides
an important hint for the style transfer.
Before we discuss the details of AdaIN, we first explain the concept of image
style transfer. Figure 9.5 shows an example of image style transfer using AdaIN
[67]. Here, the top row shows the content images associated with the content feature
X = [x 1 , · · · , x C ], while the left-most column corresponds to style images that are
associated with the style feature S = [s 1 , · · · , s C ]. The aim of the image style
transfer is then to convert the content images into a stylized image that is guided by
a certain style image. How does AdaIN manage the style transfer in this context?
The main idea is to use the instance normalization in (9.11), but instead of using
γc and βc that are calculated by its own feature, these values are calculated as the
standard deviation and the mean value of the style image, i.e.
1
βcs = 1 sc, (9.15)
HW
4
1
γcs =
s c − βc2 1
2 , (9.16)
HW
where s c is the c-th channel feature map from the style image. In matrix form,
AdaIN can be represented by
Y = XT x T s + B x,s , (9.17)
162 9 Normalization and Attention
The generation of the style feature map can be done with the same encoder, as
shown in Fig. 9.6, whereby both content and style images are given as inputs for the
VGG encoder for feature vector extraction, from which the AdaIN layer changes
the style using the AdaIN operation described above.
The whitening and coloring transform (WCT) is another powerful method of image
style transfer [79], which is composed of a whitening transform followed by a
coloring transform. Mathematically, this can be written by
Y = XT x T s + B x,s , (9.21)
where B x,s is the same as (9.20), and the whitening transform T x and the coloring
transform T x are computed by X and S, respectively:
−1 1
T x = U x x 2 U
x, T s = U s s2 U
s , (9.22)
164 9 Normalization and Attention
X X = U x x U
x , S S = U s s U s (9.23)
Therefore, we can easily see that AdaIN is a special case of WCT, when the
covariance matrix is diagonal.
9.3 Attention
It is known that there are two types of neurotransmitter receptors: ionotropic and
metabotropic receptors [91]. Ionotropic receptors are transmembrane molecules
that can “open” or “close” a channel so that different types of ions can migrate
in and out of the cell, as shown in Fig. 9.7a. On the other hand, the activation of
the metabotropic receptors only indirectly influences the opening and closing of ion
channels. In particular, a receptor activates the G-protein as soon as a ligand binds to
the metabotropic receptor. Once activated, the G-protein itself goes on and activates
another molecule called a “secondary messenger”. The secondary messenger moves
until it binds to ion channels, located at different points on the membrane, and opens
them (see Fig. 9.7b). It is important to remember that metabotropic receptors do not
have ion channels and the binding of a ligand may or may not lead to the opening
of ion channels at different locations on the membrane.
Mathematically, this process can be modeled as follows. Let xn be the number
of neurotransmitters that bind to the n-th synapse. G-proteins generated at the n-th
synapse are proportional to the sensitivity of the metabotropic receptor, which is
denoted by kn . Then, the G-proteins generate the secondary messengers that bind to
the ion channel at the m-th synapse with the sensitivity of qm . Since the secondary
messengers are generated from metabotropic receptors at various synapses, the total
amount of ion influx from the m-th synapse is determined by the sum given by
N
ym = qm kn xn , m = 1, · · · , N, (9.24)
n=1
9.3 Attention 165
Fig. 9.7 Two different types of neurotransmitter receptors and their mechanisms
166 9 Normalization and Attention
y = T x, where T := qk . (9.25)
Note that the matrix T in (9.25) is a transform matrix from x to y. Indeed, the
transform matrix T is a rank-1 matrix. Accordingly, the output y is constrained to
live in the linear subspace of the column vector, i.e. R(q), where R(·) denotes the
range space. This implies that the activation patterns in the neuron follow the ion
channel sensitivity patterns, q, while their magnitude is modulated by k.
This could explain another role for the metabotropic receptors. In particular,
metabotropic receptors act more for their prolonged activation than for a short-
term activation as in the case of ionotropic receptors, since the activation pattern
is determined by the ion channel distributions to which the secondary messengers
bind rather than by the specific location at which the original neurotransmitter is
released. Thus, the synergistic combination of the q and k determines the general
behavior of neuronal activation.
In (9.25), the vectors q and k are often referred to as query and key. It is
remarkable that even with the same key k, a totally different activation pattern can
be obtained by changing the query vector q. In fact, this is the core idea of the
attention mechanism. By decoupling the query and key, we can dynamically adapt
the neuronal activation patterns for our purpose. In the following, we review the
general form of the attention developed based on this concept.
In artificial neural networks, the model (9.24) is generalized for vector quantities.
Specifically, the row vector output at the m-th pixel y m ∈ RC is determined by the
vector version of query q m ∈ Rd , keys k n ∈ Rd , and values x n ∈ RC :
N
ym = amn x n , (9.26)
n=1
where m = 1, · · · , N and
!
exp score(q m , k n )
amn := . (9.27)
N m n
n =1 exp score(q , k )
Here, score(·, ·) determines the similarity between the two vectors. In matrix form,
(9.26) can be represented by
Y = AX, (9.28)
9.3 Attention 167
where
⎡ ⎤ ⎡ ⎤
x1 y1
⎢ ⎥ ⎢ ⎥
X = ⎣ ... ⎦ , Y = ⎣ ... ⎦ , (9.29)
xN yN
and
⎡ ⎤
a11 · · · a1N
⎢ ⎥
A = ⎣ ... . . . ... ⎦ . (9.30)
aN 1 · · · aN N
m n
n
.
For example, in dot production attention, the query and key vectors are usually
generated using linear embeddings. More specifically,
q n = xnW Q, kn = x nW K , n = 1, · · · , N, (9.31)
where W Q , W K ∈ RC×d are shared across all indices. Matrix form representation
of the query and key are then given by
Q = XW Q , K = XW K , (9.32)
vn = x nW V ∈ Rdv , (9.34)
where W V ∈ RC×dv is the linear embedding matrix for the values. Then, attention
is computed by
N
ym = amn v n , (9.35)
n=1
168 9 Normalization and Attention
where
!
exp x m W Q , x n W K
amn := !, (9.36)
N n
n =1 exp x W Q , x W K
Y = AXW V , (9.37)
1
z= 1 X. (9.38)
N
At the excitation step, a 1 × C weight vector w is generated from z using a neural
network F which is parameterized by :
w = F (z). (9.39)
9.4 Applications
9.4.1 StyleGAN
One of the most exciting developments in CVPR 2019 was the introduction of a
novel generative adversarial network (GAN) called StyleGAN from Nvidia [89]. As
shown in Fig. 9.9, StyleGAN can generate high-resolution images that were realistic
enough to shock the world.
Although generative models, specifically GANs, will be discussed later in
Chap. 13, we are introducing StyleGAN here, as the main breakthrough of style-
GAN comes from AdaIN. The right-hand neural network of Fig. 9.10 generates the
latent codes used as the style image feature vector, while the left-hand network
generates the content feature vectors from random noise. The AdaIN layer then
combines the style features and the content features in order to generate more
realistic features for each resolution. In fact, this architecture is fundamentally
different from the standard GAN architecture that we will review later, with the
fake image only being generated by a content generator (for example, the one on the
left). Through the synergistic combination with another style generator, StyleGAN
successfully produces very realistic images.
Fig. 9.11 Architecture of self-attention GAN. Both key and query are generated by the input
features
In a self-attention GAN (SAGAN) [71], self-attention layers are added into the
GAN so that both the generator and the discriminator can better capture model
relationships between spatial regions (see Fig. 9.11). It should be remembered that
in convolutional neural networks, the size of the receiving field is limited by the size
of the filter. With this in mind, self-attention is a great way to learn the relationship
between a pixel and all other positions, even regions that are far apart so that global
dependencies can be easily grasped. Hence, a GAN endowed with self-attention is
expected to handle details better.
More specifically, let X ∈ RN ×C be the feature map with N pixels and C
channels, and x m ∈ RC denote the m-th row vector of X, which represents the
feature vector at the m-th pixel location. The query, key, and the value images are
then generated as follows:
q m = xmW Q, km = x mW K , vm = x mW V (9.41)
Y = AV = AXW V , (9.42)
where
⎡ ⎤
v1
⎢ ⎥
V = ⎣ ... ⎦ , (9.43)
vN
amn := . (9.44)
N m n
n =1 exp q , k
O = Y WO, (9.45)
In the graph attention network (GAT) [69], the main focus is on a node which a
neural network should visit more in order to achieve better embedding in the middle
node (Fig. 9.13). To incorporate the graph connectivity, the authors suggested
specific constraints on the query, key, and value vectors as follows:
qv = xv W , ku = vu = x uW , u ∈ N(v). (9.46)
From this, the attentional coefficients between the nodes are calculated by
evu = score(q v , k u ),
where score(·) denotes the specific attention mechanism. To make the coefficient
easily accessible across different nodes, the coefficients are normalized by
exp(evu )
αvu = . (9.47)
u ∈N(v) exp(evu )
9.4 Applications
Fig. 9.12 Attentional GAN architecture. Here, the query is generated by image regions, whereas the key is generated by sentence embedding
173
174 9 Normalization and Attention
xv = σ ⎝ αvu x u W ⎠ . (9.48)
u∈N(v)
9.4.5 Transformer
Transformer is a deep machine learning model that was introduced in 2017 and
was originally used for natural language processing (NLP) [73]. In NLP, the
recurrent neural networks (RNN) such as Long Short-Term Memory (LSTM) [92]
had traditionally been used. In RNN, the data is processed in a sequential order
using the memory unit inside. Although Transformers are designed to process
ordered data sequences such as speech, unlike the RNN, Transformer processes the
entire sequence in parallel to reduce path lengths, making it easier to learn long-
distance dependencies in sequences. Since its inception, Transformer has become
the building block of most state-of-the-art architectures in NLP, resulting in the
development of famous state-of-the-art Bidirectional Encoder Representations from
Transformers (BERT) [74], Generative Pre-trained Transformer 3 (GPT-3) [76], etc.
As shown in Fig. 9.14, Transformer-based language translation consists of an
encoder and decoder architecture. The main idea of Transformer is the attention
mechanism discussed earlier. In particular, the essence of the query, key, and value
vectors in the attention mechanism is fully utilized so that the encoder can learn the
language embedding and the decoder performs the language translation.
9.4 Applications
175
In particular, sentences from, for example, English are used on the encoder to
learn how to embed each word in a sentence. In order to learn the long-range
dependency between the words within the sentence, a self-attention mechanism
is used on the encoder. Of course, self-attention is not enough to perform a
complicated speech embedding task. Therefore, there are an additional residual
connection, a layer normalization, and a neural feedforward network, followed
by additional units of encoder blocks (see Fig. 9.15). Once trained, Transformer’s
encoder generates the word embedding, which contains the structural role of each
word within the sentence.
In the decoder, these embedding vectors from the encoder are now used to
generate the key vectors, as shown in Figs. 9.14 and 9.16. This is combined with
the query vector that is generated from the target language, like French. This hybrid
combination then creates the attention map, which serves as the transformation
matrix of the words between the two languages by taking into account their
structural roles.
9.4 Applications 177
Fig. 9.16 Generation of key vectors for each decoder layer of Transformer
This position encoding vector is then added to the word embedding vector x n ∈ Rd
to obtain a position encoded word embedding vector:
x n ← x n + pn , (9.50)
9.4.6 BERT
One of the latest milestones in NLP is the release of BERT (Bidirectional Encoder
Representations from Transformers) [74]. This release of BERT can even be seen as
the beginning of a new era in NLP. One of the unique features of BERT is that the
resulting structure is as regular as FPGA (Field Programmable Gate Array) chips, so
the BERT unit can be used for different purposes and languages by simply changing
the training scheme.
The main architecture of BERT is the cascaded connection of bidirectional
transformer encoder units, as shown in Fig. 9.17. Due to the use of the encoder-part
of the Transformer architecture, the number of input and output features remains
the same, while each feature vector dimension may be different. For example,
the input feature can be a one-hot coded word, the feature dimension of which
is determined by the size of the corpus vocabulary. The output may be the low
dimensional embedding that sums up the role of the word in context. The reason for
using the bidirectional Transformer encoder is based on the observation that people
can understand the sentence even if the order of the words within the sentence is
9.4 Applications 179
reversed. By considering the reverse order, the role of each word in context is better
summarized as an attention map, resulting in more efficient embedding of words.
Yet another beauty of BERT lies in the training. More specifically, as shown
in Fig. 9.18, BERT training consists of two steps: pre-training and fine-tuning. In
the pre-training step, the goal of the task is to guess the masked word within an
input sentence. Figure 9.19 shows a more detailed explanation of this masked word
estimation. Approximately 15% of the words in the input sentence from Wikipedia
are masked with a specific token (in this case, [MASK]), and the goal of the training
is to estimate the masked word from the embedded output in the same place. Since
the BERT output is just an embedded feature, we need an additional fully connected
neural network (FFNN) and softmax layer to estimate the specific word. With this
additional network we can correctly pre-train the BERT unit.
Once BERT pre-training is finished, the BERT unit is fine-tuned using supervised
learning tasks. For example, Fig. 9.20 shows a supervised learning task. Here, the
180
9 Normalization and Attention
Fig. 9.20 Supervised learning task for BERT fine-tuning for the next sentence estimation
182 9 Normalization and Attention
input for BERT consists of two sentences, separately with another token [SEP]. The
goal of supervised learning is then to assess whether the second sentence is a correct
continuation of the first sentence. The output of this is now embedded in the BERT
Output 1, which is then used as an input of fully connected neural network, followed
by a softmax layer to estimate whether the second sentence is next. Since the same
number is entered and output in BERT, the first word of the input record should be
a token that indicates the vacant word [CLS].
Another example of a supervised fine-tuning is the classification of whether the
sentence is spam or not, as shown in Fig. 9.21. In this case, only a single sentence is
used as the BERT input and Output 1 of BERT is used to classify whether the input
sentence is spam or not.
In fact, there are multiple ways of utilizing the BERT unit for supervised fine
tuning, which is another important advantage of BERT [74].
Fig. 9.21 BERT fine-tuning using supervised learning to classify whether the input sentence is spam or not
184 9 Normalization and Attention
Fig. 9.24 Difference between the self-attention in BERT and masked self-attention in GPT-3
decoder blocks, which consist of masked self-attention blocks with a width of 2048
tokens and a feedforward neural network (see Fig. 9.23). As shown in Fig. 9.24, the
masked self-attention calculates the attention matrix using the preceding words in a
sentence that can be used to estimate the next word.
To train the 175 billion weights, GPT-3 is trained with 499 billion tokens or
words. Sixty percent of the training data set comes from a filtered version of
Common Crawl consisting of 410 billion tokens. Other sources are 19 billion tokens
from WebText2, 12 billion tokens from Books1, 55 billion tokens from Books2, and
9.5 Mathematical Analysis of Normalization and Attention 185
3 billion tokens from Wikipedia [76]. Nonetheless, the performance of GPT-3 can
be affected by the quality of the training data. For example, it was reported that
GPT-3 generates sexist, racist and other biased and negative language when it was
asked to discuss Jews, women, black people, and the Holocaust [95].
Inspired by the fact that Transformer architecture has become state of the art
for NLP, researchers have explored its applications for computer vision. As
mentioned earlier, in computer vision, attention is usually applied in connection
with convolutional networks, so that certain components of convolutional networks
are replaced with attention while maintaining their overall structure. In [96], the
authors have shown that this dependence on CNNs is not necessary and a pure
transformer applied directly to sequences of image patches can work very well in
image classification tasks.
Their model, called Vision Transformer (ViT), is depicted in Fig. 9.25. To handle
2D images, the input image x is reshaped into a sequence of flattened 2D patches,
after each patch is embedded into a D-dimensional vector using a trainable linear
projection. Transformer then uses a constant latent vector size D through all of its
layers. Position embeddings are added to the patch embeddings to retain positional
information. The resulting sequence of embedding vectors serves as input to the
encoder. With regard to the [Class] token on the front, a learnable embedding
in the sequence of embedded patches at the output of the Transformer encoder
serves as the entire image representation. A classification head is attached during
both pre-training and fine-tuning to train the network to have the embedded image
representation for the best classification results.
The Transformer encoder in ViT consists of alternating layers of multi-headed
self-attention and MLP blocks. Layer norm and residual connections are applied
before and after every block, respectively. The MLP contains two layers with a
GELU non-linearity. Typically, ViT is trained on large data sets, and fine-tuned to
(smaller) downstream tasks. For this, we remove the pre-trained prediction head
and attach a zero-initialized D × K feedforward layer, where K is the number of
downstream classes.
Fig. 9.25 Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence
of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token”
to the sequence
9 Normalization and Attention
9.5 Mathematical Analysis of Normalization and Attention 187
Y = XT + B, (9.51)
where the channel-directional transform T and the bias B are learned from the
statistics of the feature maps. The only differences between instance normalization,
AdaIN, and WCT are their specific ways of estimating T and B. For example,
all elements of T are estimated from the input features in the case of instance
normalization, while they are estimated from the statistics on content and style
images in the case of AdaIN and WCT. The main difference between WCT, instance
normalization and AdaIN is that T is a densely populated matrix for the case of
WCT, while instance norm and AdaIN use a diagonal matrix.
On the other hand, the spatial attention can be represented by
Y = AX, (9.52)
where A is calculated from its own feature for the case of self-attention, or with the
help of other domain features for the case of cross-domain attention. Similarly, the
channel attention such as SENet can be computed as
Y = XT , (9.53)
Y = AXT + B. (9.54)
9.6 Exercises
1. Find the conditions when the WCT transform in (9.22) is reduced to AdaIN.
2. Let the feature map with the number of pixels H × W = 4 and the channels
C = 3 be given by
⎡ ⎤
1 2 3
⎢−1 −3 0⎥
X=⎢
⎣5
⎥. (9.55)
−2 1⎦
0 0 −5
Fig. 9.27 (Top) 1024×1024 images generated by our method, trained using CelebA-HQ data set.
(Bottom) 512×512 images generated by our method, trained using AFHQ data set. (a) A source
image generated from arbitrary style and content code. (b) Samples with varying style codes and
fixed content code. (c) Samples generated with varying content codes and fixed style. (d) Samples
generated with both varying content and style codes
9.6 Exercises 191
3. Additionally, suppose that the feature map for the style image is given by
⎡ ⎤
0 1 1
⎢−1 −1 1⎥
S=⎢
⎣1
⎥. (9.56)
0 0⎦
−1 1 1
a. For the given feature map in (9.55), perform the adaptive instance normaliza-
tion from X to the style of S.
b. For the given feature map in (9.55), perform the WCT style transfer from X
to the style of S.
4. Using the feature map in (9.55), we are interested in computing the self-attention
map. Let W Q and W k be the embedding matrices for the query and key,
respectively:
⎡ ⎤ ⎡1 ⎤
21 3 0
WQ = ⎣0 12 ⎦ , W K = ⎣ 1 −1⎦ . (9.57)
00 10 5
a. Using the dot product score function, compute the attention matrix A.
b. What is the attended feature map, i.e. Y = AX?
c. For the case of masked self-attention in GPT-3, compute the attention mask A
and attended feature map Y = AX.
5. For a given positional encoding in (9.49) for the Transformer with encoding
dimension d = 10, compute the positional encoding vector p n for n =
1, · · · , 10.
6. Explain the following sentence in detail: “BERT has encoder only structure,
while GPT-3 has decoder only architecture.”
7. For a given feature map X ∈ RN ×C , show that the feature map of styleGAN after
the application of AdaIN and noise is represented by
Y = XT + B. (9.58)
Y = AXT + B. (9.59)
Specify the structure of the matrices A, T and B, and their mathematical roles.
Part III
Advanced Topics in Deep Learning
– Michael Elad
Chapter 10
Geometry of Deep Neural Networks
10.1 Introduction
In this chapter, which is mathematically intensive, we will try to answer perhaps the
most important questions of machine learning: what does the deep neural network
learn? How does a deep neural network, especially a CNN, accomplish these goals?
The full answer to these basic questions is still a long way off. Here are some of
the insights we’ve obtained while traveling towards that destination. In particular,
we explain why the classic approaches to machine learning such as single-layer
perceptron or kernel machines are not enough to achieve the goal and why a modern
CNN turns out to be a promising tool.
Recall that at the early phase of the deep learning revolution, most of the
CNN architectures such as AlexNet, VGGNet, ResNet, etc., were mainly developed
for the classification tasks such as ImageNet challenges. Then, CNNs started to
be widely used for low-level computer vision problems such as image denoising
[90, 98], super-resolution [99, 100], segmentation [38], etc., which are considered as
regression tasks. In fact, classification and regression are the two most fundamental
tasks in machine learning, which can be unified under the umbrella of function
approximation. Recall that the representer theorem [15] says that a classifier design
or regression problem for a given test data set {(x i , yi )}ni=1 can be addressed by
solving the following optimization problem:
1
n
min
f
2H + C
(yi , f (x i )) , (10.1)
f ∈Hk 2
i=1
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 195
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_10
196 10 Geometry of Deep Neural Networks
where Hk denotes the reproducing kernel Hilbert space (RKHS) with the kernel
k(x, x ),
·
H is the Hilbert space norm, and
(·, ·) is the loss function. One of the
most important results of the representer theorem is that the minimizer f has the
following closed-form representation:
n
f (x) = αi k(x i , x), (10.2)
i=1
where {αi }ni=1 are learned parameters from the training data set. For example, if a
hinge function is used as a loss, the solution becomes a kernel SVM, whereas if an
l2 function is used as a loss, it becomes a kernel regression.
In general, the solution f (x) in (10.2) is a nonlinear function of the input x based
on the kernel k(x j , ·), which is nonlinearly dependent upon x. This nonlinearity of
the kernel makes the expression in (10.2) more expressive, thereby generating a
wide variation of functions within the RKHS Hk .
That said, the expression in (10.2) still has fundamental limitations. First, the
RKHS Hk is specified by choosing the kernel in a top-down manner, and to the best
of our knowledge, there is no way to automatically learn from the data. Second,
once the kernel machine is trained, the parameters {αi }ni=1 are fixed, and it is not
possible to adjust them at the test phase. These drawbacks lead to the fundamental
limitations of the expressivity of neural networks, which means the capability of
approximating any function. Of course, one could increase the expressivity by
increasing complexity of the learning machines, for example, by combining multiple
kernel machines. However, our goal is to achieve better expressivity for a given
complexity constraint, and in this sense the kernel machine has problems.
Given the limitations of the kernel machine, we can state the following desiderata—
the desired things that an ultimate learning machine should satisfy:
• Data-driven model: The function space that a learning machine can represent
should be learned from the data, rather than specified by a top-down mathemati-
cal model.
• Adaptive model: Even after the machine has learned, the learned model should
adapt to the given input data at the test phase.
• Expressive model: The expressivity of the model should increase more than the
model complexity increases.
• Inductive model: The learned information from the training data should be used
at the test phase.
In the following, we review two classical approaches—single layer perceptron and
frame representation—and explain why these classical models failed to meet the
10.2 Case Studies 197
desiderata. Later we will show how the modern deep learning approaches have been
developed by overcoming the drawbacks of these classical approaches by exploiting
their inherent strengths.
d
f (x) = vi ϕ w
i x + bi , x ∈ X, (10.3)
i=1
n
min
(yi , f (x i )) + λR(), (10.4)
i=1
for all x ∈ X.
198 10 Geometry of Deep Neural Networks
The theorem thus states that simple neural networks can represent a wide variety
of interesting functions when given appropriate parameters. In fact, the universal
approximation theorem was a blessing for classic machine learning; it promoted the
research interests of the neural network as a powerful functional approximation, but
also turned out to be a curse for the development of machine learning by preventing
understanding of the role of deep neural networks.
More specifically, the theorem only guarantees the existence of d, the number
of neurons, but it does not specify how many neurons are required for a given
approximation error. Only recently have people realized that the depth matters, i.e.
there exists a function that a deep neural network can approximate but a shallow
neural network with the same number of parameters cannot [101–105]. In fact, these
modern theoretical studies have provided a theoretical foundation for the revival of
modern deep learning research.
When compared with the kernel machine (10.2), the pros and cons of the single- !
layer perception in (10.3) can be easily understood. Specifically, ϕ w i x + bi in
(10.3) works similarly as a kernel function k(x i , x), and vi in (10.3) is similar to the
weight parameter! αi in (10.2). However, the nonlinear mapping in the perceptron,
i.e. ϕ wi x + bi , does not necessarily satisfy the positive semidefiniteness of the
kernel, thereby increasing the approximable functions beyond the RKHS to a larger
function class in Hilbert space. Therefore, there exists potential for improving the
expressivity. On the other hand, the weighting parameters vi are still fixed once the
neural network is trained, which leads to limitations similar to those of the kernel
machines.
m
f = ai bi . (10.5)
i=1
Unlike the basis, which leads to the unique expansion, the frame is composed of
redundant basis vectors, which allows multiple representation. Frames can also be
extended to deal with function spaces, in which case the number of frame elements
is infinite. Formally, a set of functions
= [φ k ]k∈ = · · · φ k−1 φ k · · ·
10.2 Case Studies 199
α
f
2 ≤ | f , φ k
|2 ≤ β
f
2 , ∀f ∈ H, (10.6)
k∈
where α, β > 0 are called the frame bounds. If α = β, then the frame is said to be
tight. In fact, a basis is a special case of tight frames.
By writing ck := f , φ k
as the expansion coefficient with respect to the k-th
frame vector φ k and defining the frame coefficient vector
c = [ck ]k∈ = f ,
α f 2 ≤ c 2 ≤ β f 2 , ∀f ∈ H. (10.7)
This implies that the energy of the expansion coefficients should be bounded by the
original signal energy, and for the case of the tight frame, the expansion coefficient
energy is the same as the original signal energy up to the scaling factor.
When the frame lower bound α is nonzero, then the recovery of the original
signal can be done from the frame coefficient vector c = f using the dual frame
operator 2 given by
2 = ··· 2
φ k−1 2
φk · · · , (10.8)
˜ = I,
(10.9)
because we have
2 = f = f ,
f̂ := c
or equivalently,
f = ck 2
φk = f , φ k
2
φk . (10.10)
k∈ k∈
Note that (10.10) is a linear signal expansion, so it is not useful for machine
learning tasks. However, something more interesting occurs when it is combined
with a nonlinear regularization. For example, consider a regression problem to
estimate a noiseless signal from the noisy measurement y:
y = f + w, (10.11)
200 10 Geometry of Deep Neural Networks
where
·
1 is the l1 norm, then the solution satisfies the following [106]:
!
f̂ = ρλ y, φ k
2φk , (10.13)
k∈
In contrast to the kernel machine in (10.2), the basis pursuit using the frame
representation has several unique advantages. First, the function space that the basis
pursuit can generate is often larger than the RKHS from (10.2). In fact, this space is
often called the union of subspaces [108], which is a large subset of a Hilbert space.
Second, among the given frames, the choice of active dual frame basis φ 2k is totally
data-dependent. Therefore, the basis pursuit
representation
! is an adaptive model.
Moreover, the expansion coefficients ρλ y, φ k
of the basis pursuit are also totally
dependent on the input y, thereby generating more diverse representation than the
kernel machine with fixed expansion coefficients.
Having said this, one of the most fundamental limitations of the basis pursuit
approach in (10.13) is that it is transductive, which does not allow inductive learning
202 10 Geometry of Deep Neural Networks
from the training data. In general, the basis pursuit regression in (10.12) should be
solved for each data set, since the nonlinear thresholding function should be found
by an optimization method for each data set. Therefore, it is difficult to transfer the
learning from one data set to another.
Before we dive into the convolutional neural network, here we briefly review the
theory of deep convolutional framelets [42], which is a linear frame expansion but
turns out to be an important stepping stone to understand the geometry of CNN. For
simplicity, we consider the 1-D version of the theory.
n−1
(x h)[i] = x[i − k]h[k], (10.14)
k=0
v w = v 0 w0 ,
where
v 0 = v 0
n−n1 , w 0 = w 0
n−n2 .
• For any v ∈ Rn1 with n1 ≤ n, define the flip of v as v[n] = v 0 [−n], where we
use the periodic boundary condition.
204 10 Geometry of Deep Neural Networks
n−1
y[i] = (x ψ)[i] = x[i − k]ψ 0 [−k]. (10.15)
k=0
Then, we can obtain the following key equality [109], whose proof is repeated here
for educational purposes:
Lemma 10.1 For a given f ∈ Rn , let Hnr (f ) ∈ Rn×r denote the associated Hankel
matrix. Then, for any vectors u ∈ Rn and v ∈ Rr with r ≤ n and Hankel matrix
F := Hnr (f ), we have
u F v = u (f v) = f (u v) = f , u v
, (10.18)
n−1
= f [i] u[k]v [i − k]
0
i=0 k=0
)n−1 *
n−1
n−1
k=0 i=0
10.3 Convolution Framelets 205
n−1
= u[k](f v)[k]
k=0
= u (f v) .
Lemma 10.1 provides an important clue for the convolution framelet expansion.
Specifically, for a given signal f ∈ Rn , consider the following two sets of matrices,
2 ∈ Rn×n and ,
, 2 ∈ Rr×r , such that they satisfy the following frame
condition[42]:
2 = In 2 = Ir.
(10.19)
Hnr (f ) = 2 = C
2 Hnr (f ) 2 2, (10.20)
where
cij = φ
i Hr (f )ψ j = f , φ i ψ j
,
n
(10.22)
where φ i and ψ j denote the i-th and the j -th column vector of and ,
respectively, and the last equality of (10.22) comes from Lemma 10.1.
n(−)
Now, we define an inverse Hankel operator Hr : Rn×r → Rn such that for any
f ∈ R , the following equality satisfies
n
n !
f = Hn(−)
r Hr (f ) . (10.23)
1
r
!
2 2 2j
2 j ψ
Hn(−)
r C = c (10.24)
r
j =1
1
2j ).
= cij (2
φi ψ (10.25)
r
i,j
206 10 Geometry of Deep Neural Networks
1
!
2j .
f = f , φ i ψ j
2
φi ψ (10.26)
r
i,j
This implies that {φ i ψ j }i,j constitutes a frame for Rn and {2 φi ψ2j }i,j
corresponds to its dual frame. Furthermore, for many interesting signals f in real
applications, the Hankel matrix Hnr (f ) has low-rank structures [110–112], which
makes the expansion coefficients cij nonzero only at small index sets. Therefore,
the convolution framelet expansion is a concise signal representation similar to the
wavelet frames [42, 109].
In the convolution framelet, the functions φ i , φ̃ i correspond to the global basis,
whereas ψ i , ψ̃ i are local basis functions. Therefore, by the convolution between
the global and local basis to generate a new frame basis, convolution framelets can
exploit both local and global structures of signals [42, 109], which is an important
advance in signal representation theory.
where
f := f ψ 1 · · · f ψ r (10.28)
1
2 ! 2
r
f = cj ψ j , (10.29)
r
j =1
10.3 Convolution Framelets 207
which shows the processing step of the framelet coefficient C at the decoder.
More specifically, we apply the global operation 2 to cj first, after which multi-
input single-output (MISO) convolution operation is performed to obtain the final
reconstruction.
In fact, the order of these signal processing operations is very similar to the
two-layer encoder–decoder architecture, as shown in Figs. 10.4 and 10.5. At the
encoder side, the SIMO convolution operation is performed first to generate multi-
channel feature maps, after which the global pooling operation is performed. At the
decoder side, the feature map is unpooled first, after which the MISO convolution
is performed. Therefore, we can easily see the important analogy: the convolution
framelet coefficients are similar to the feature maps in CNNs, and , 2 work as a
2
pooling and unpooling layers, respectively, whereas , correspond to the encoder
and decoder filters, respectively. This implies that the pooling operation defines the
global basis, whereas the convolution filters determine the local basis, and the CNN
tries to exploit both global and local structure of the signal.
208 10 Geometry of Deep Neural Networks
Furthermore, by simply changing the global basis, we can obtain various network
architectures. For example, in Fig. 10.4, we use = 2 = I n , whereas we use the
Haar wavevelet transform as global pooling for the case of Fig. 10.5.
Now, we are ready to explain the multilayer convolution framelets, which we call
deep convolutional framelets [42]. For simplicity, we consider encoder–decoder
networks without skip connections, as shown in Fig. 10.6, although the analysis
can be applied equally well when the skip connections are present. Furthermore, we
assume symmetric configuration so that both encoder and decoder have the same
number of layers, say κ; the input and output dimensions for the encoder layer El
and the decoder layer Dl are symmetric:
where [n] denotes the set {1, · · · , n}. At the l-th layer, ml and ql denote the
dimension of the signal, and the number of filter channels, respectively. The length
of filter is assumed to be r.
We now define the l-th layer input signal for the encoder layer from ql−1 -input
channels,
zl−1 := zl−1
1 · · · z l−1
ql−1 ∈ Rdl−1 , (10.31)
where denotes the transpose, and zl−1 j ∈ Rml−1 refers to the j -th channel input
with the dimension ml−1 . The l-th layer output signal zl is similarly defined. Note
that the filtered output is now stacked as a single column vector in (10.31), which
is different from the former treatment at the convolution framelet where the filter
output for each channel is stacked as an additional column. It turns out that the
notation in (10.31) makes the mathematical derivation for multilayer convolutional
neural networks much more trackable than the former notation, although the role of
the global and local basis are clearly seen in the former notation.
10.3 Convolution Framelets 209
Then, for the linear encoder–decoder CNN without skip connections, as shown
in Fig. 10.6a, we have the following linear representation at the l-th encoder layer
[35]:
zl = E l zl−1 , (10.32)
where
⎡ ⎤
l ψ l1,1 · · · l ψ lql ,1
⎢ ⎥
El = ⎢
⎣
..
.
..
.
..
.
⎥,
⎦ (10.33)
l ψ l1,ql−1 · · · l ψ lql ,ql−1
where l denotes the ml × ml matrix that represents the pooling operation at the
l-th layer, and ψ li,j ∈ Rr represents the l-th layer encoder filter to generate the i-
th channel output from the contribution of the j -th channel input, and l ψ li,j
represents a single-input multi-output (SIMO) convolution [35]:
l ψ li,j = φ l1 ψ li,j · · · φ ln ψ li,j . (10.34)
Note that the inclusion of the bias can be readily done by including additional rows
into E l as the bias and augmenting the last element of zl−1 by 1.
Similarly, the l-th decoder layer can be represented by
where
⎡ ⎤
˜ l ψ̃ l1,1
··· ˜ l ψ̃ l1,q
⎢ l
⎥
Dl = ⎢
⎣
..
.
..
.
..
.
⎥,
⎦ (10.36)
˜ ψ̃ lq ,1
l
···
l l
˜ ψ̃ q ,q
l−1 l−1 l
l
where ˜ denotes the ml × ml matrix that represents the unpooling operation at the
l
l-th layer, and ψ̃ i,j ∈ Rr represents the l-th layer decoder filter to generate the i-th
channel output from the contribution of the j -th channel input.
Then, the output v of the encoder-decoder CNN with respect to input z can be
represented by the following representation [35]:
v = T (z) = bi , z
b̃i (10.37)
i
210 10 Geometry of Deep Neural Networks
where refers to all encoder and decoder convolution filters, and bi and b̃i denote
the i-th column of the following matrices, respectively:
B = E 1 E 2 · · · E κ , B̃ = D 1 D 2 · · · D κ (10.38)
Note that this representation is completely linear, since the representation does
not vary once the network parameters are trained. Furthermore, consider the
following multilayer frame conditions for the pooling and filter layers:
˜ l = 1 I rql−1 ,
˜ l l = αI ml−1 , l ∀l, (10.39)
rα
where I n denotes the n × n identity matrix and α > 0 is a nonzero constant, and
⎡ ⎤
ψ l1,1 · · · ψ lql ,1
⎢ .. ⎥
l = ⎢
⎣ ..
. ..
. . ⎦,
⎥ (10.40)
ψ l1,ql−1 · · · ψ lql ,ql−1
⎡ l l ⎤
ψ̃ · · · ψ̃ 1,ql
⎢ .1,1 .. ⎥
˜ l
⎢
= ⎣ .. .. ⎥ (10.41)
. . ⎦,
l l
ψ̃ ql−1 ,1 · · · ψ̃ ql−1 ,ql
Under these frame conditions, we showed in [35] that (10.37) satisfies the perfect
reconstruction condition, i.e
z = L (z) := bi , z
b̃i , (10.42)
i
n
min
(yi , L (x i )) + λR(). (10.43)
i=1
Once the parameter is learned, the encoder and decoder matrices E l and D l are
determined. Therefore, the representations are entirely data-driven and dependent
on the filter sets that are learned from the training data set, which is different from
the classical kernel machine or basis pursuit approaches, where underlying kernels
or frames are specified in a top-down manner.
10.4 Geometry of CNN 211
That said, the deep convolutional framelet does not yet meet the desiderata of the
machine learning, since once it is trained, the frame representation does not vary,
hence the data-driven adaptation is not possible. In the next section, we will show
that the last missing element is the nonlinearity such as ReLU, which plays key roles
in machine learning.
In fact, the analysis of deep convolutional framelets with the ReLU nonlinearities
turns out to be a simple modification, but it provides very fundamental insights on
the geometry of the deep neural network.
Specifically, in [35] we showed that even with ReLU nonlinearities the expres-
sion (10.37) is still valid. The only change is that the basis matrices have additional
ReLU pattern blocks in between encoder, decoder, and skipped blocks. For example,
the expression in (10.38) is changed as follows:
where l (z) and ˜ l (z) are the diagonal matrices with 0 and 1 elements indicating
the ReLU activation patterns.
Accordingly, the linear representation in (10.37) should be modified as a
nonlinear representation:
v = T (z) = bi (z), z
b̃i (z), (10.46)
i
where we now have an explicit dependency on z for bi (z) and b̃i (z) due to the input-
dependent ReLU activation patterns, which makes the representation nonlinear.
Again the filter parameter is estimated by solving the optimization problem in
(10.43) by replacing L (z) with T (z) in (10.46). Therefore, the representations
are entirely data-driven.
In (10.44) and (10.45), the encoder and decoder basis matrices have an explicit
dependence on the ReLU activation pattern on the input. Here we will show that
212 10 Geometry of Deep Neural Networks
10.4.3 Expressivity
Given the partition-dependent framelet geometry of CNN, we can easily expect that
with a greater number of input space partitions, the nonlinear function approxima-
10.4 Geometry of CNN 213
Fig. 10.8 Expressivity increases exponentially with channels, depth, and skip connections
tion by the piecewise linear frame representation becomes more accurate. Therefore,
the number of piecewise linear regions is directly related to the expressivity or
representation power of the neural network. If each ReLU activation pattern is
independent of the others, then the number of distinct ReLU activation patterns
is 2# of neurons , where the number of neurons is determined by the number of the
entire features. Therefore, the number of distinct linear representation increases
exponentially with the depth, width, and skip connection as shown in Fig. 10.8 [35].
This again confirms the expressive power of CNN thanks to the ReLU nonlinearities.
To understand the claim, let us first revisit the ReLU operation for each neuron at
the encoder layer. Let E li denote the i-th column of encoder matrix E l and zil be the
i-th element of zl . Then, the output of an activated neuron can be represented as:
| E li , zl−1
|
zil = ×
E li
, (10.47)
E li
7 85 6
distance to the hyperplane
nl = E li . (10.48)
This implies that the output of the activated neuron is the scaled version of the
distance to the hyperplane which partitions the space of feature vector zl−1 into
active and non-active regions. Therefore, the role of the neural network can be
understood as representing the input data with a coordinate vector using the relative
distances with respect to multiple hyperplanes.
In fact, the aforementioned interpretation of the feature may not be novel, since
a similar interpretation can be used to explain the geometrical meaning of the
linear frame coefficients. Instead, one of the most important differences comes from
the multilayer representation. To understand this, consider the following two layer
neural network:
zil = σ (E l
i z
l−1
), (10.49)
where
zl−1 = σ E (l−1) zl−2 = (zl−1 )E (l−1) zl−2 , (10.50)
where (zl−1 ) again encodes the ReLU activation pattern. Using the property of
the inner product and adjoint operator, we have
zil = σ (E l
i z
l−1
)
< =
= σ E li , (zl−1 )E (l−1) zl−2
< =
= σ (zl−1 )E li , E (l−1) zl−2 . (10.51)
This indicates that on the space of the unconstrained feature vector from the
previous layer (i.e. no ReLU is assumed), the hyperplane normal vector is now
changed to
nl = (zl−l )E li . (10.52)
10.4 Geometry of CNN 215
Fig. 10.9 Two-layer neural network with two neurons for each layer. Blue arrows indicate the
normal direction of the hyperplane. The black lines are hyperplanes for the first layers, and the red
lines correspond to the second layer hyperplanes
This implies that the hyperplane in the current layer is adaptively changed with
respect to the input data, since the ReLU activation pattern in the previous layer,
i.e. (zl−l ), can vary depending on inputs. This is an important difference over
the linear multilayer frame representation, whose hyperplane structure is the same
regardless of different inputs.
For example, Fig. 10.9 shows a partition geometry of R2 by a two-layer neural
network with two neurons at each layer. The normal vector directions for the second
layer hyperplanes are determined by the ReLU activation patterns such that the
coordinate values at the inactive neuron become degenerate. More specifically, for
the (A) quadrant where two neurons at the first layers are active, we can obtain two
hyperplanes in any normal direction determined by the filter coefficients. However,
for the (B) quadrant where the second neuron is inactive, the situation is different.
Specifically, due to (10.52), the second coordinate of the normal vector, which
corresponds to the inactive neuron, becomes degenerate. This leads to the two
parallel hyperplanes that are distinct only by the bias term. A similar phenomenon
occurs for the quadrant (C) where the first neuron is inactive. For the (D) quadrant
where two neurons are inactive, the normal vector becomes zero and there exists no
partitioning. Therefore, we can conclude that the hyperplane geometry is adaptively
determined by the feature vectors in the previous layer.
In the following, we provide several toy examples in which the partition geometry
can be easily calculated.
216 10 Geometry of Deep Neural Networks
Draw the corresponding input space partition, and compute the output mapping
with respect to an input vector (x, y) in each input partition. Please derive all
the steps explicitly.
(b) In problem (a), suppose that the bias terms are zero. Compute the input space
partition and the output mapping. What do you observe compared to the one
with bias?
(c) In problem (a), suppose that the second layer weight and bias are changed as
12 0
W (1)
= , b (1)
= .
01 1
Draw the corresponding input space partition, and compute the output mapping
with respect to an input vector (x, y) in each input partition. Compared to the
original problem in (a), what do you observe?
10.4 Geometry of CNN 217
Solution 10.1
(a) Let x = [x, y] ∈ R2 . At the first layer, the output signal is given by
σ (2x − y + 1)
o (1)
=σ W x+b
(0) (0)
= ,
σ (x + y − 1)
where σ is the ReLU. Now, at the second layer, we need to consider all cases
where each ReLU is active or inactive.
(i) If 2x − y + 1 < !0 and x + y − 1 < 0, then o(1) = [0, 0] , o(2) =
σ W (1) o(1) + b(1) = σ [−9, −2] = [0, 0] .
(ii) If 2x − y + 1 ≥ 0 and x + y − 1! < 0, then o(1) = [2x − y + 1, 0] .
Hence, o(2) = σ W (1) o(1) + b(1) = σ ([2x − y − 8, −2x + y − 3]) .
Therefore,
[0, 0] , 2x − y − 8 < 0,
o (2)
=
[2x − y − 8, 0] , otherwise.
The resulting input space partition is shown in Fig. 10.11, where the
corresponding linear mapping and its rank are illustrated. Note that around
the two full rank partitions, there exist rank-1 mapping partitions, which
join with the rank-0 mapping partition.
218 10 Geometry of Deep Neural Networks
Fig. 10.11 Input space partitioning for the problem (a) case
where σ is the ReLU. At the second layer, we again consider all cases where
each ReLU is active or inactive.
!
(i) If 2x − y < 0 and x + y < 0, then o(1) = [0, 0] , o(2) = σ W (1) o(1) =
[0, 0] .
(ii) If 2x − y ≥ 0 and x + y < 0, then o(1) = [2x − y, 0] . Hence,
o(2) = σ W (1) o(1) + b(1) = σ ([2x − y, −2x + y]) = [2x − y, 0] .
Fig. 10.12 Input space partitioning for the problem (b) case
The resulting input space partition is shown in Fig. 10.12, where the corre-
sponding linear mapping and its rank are illustrated. Similar to problem (a),
around the two full rank partitions, there exist rank-1 mapping partitions, which
join with the rank-0 mapping partition. Since there is no bias term, all the
hyperplanes should contain the origin. Also, there are no hyperplane with same
normal vector, since parallel hyperplanes cannot be formed without bias terms.
As a result, the input space partition becomes simpler compared to (a).
(c) At the first layer, the output signal is given by
σ (2x − y + 1)
o (1)
=σ W (0)
x+b (0)
=
σ (x + y − 1)
where σ is the ReLU. Now, at the second layer, we need to consider all cases
where each ReLU is active or inactive.
(i) If 2x − y + 1 < !0 and x + y − 1 < 0, then o(1) = [0, 0] , o(2) =
σ W (1) o(1) + b(1) = σ [0, 1] = [0, 1] .
(ii) If 2x − y + 1 ≥ 0 and x +!y − 1 < 0, then o(1) = [2x − y + 1, 0] . Hence,
o(2) = σ W (1) o(1) + b(1) = σ ([2x − y + 1, 1]) = [2x − y + 1, 1] .
(iii) If 2x −y +1 < 0 and ! x +y −1 ≥ 0, then o = [0,
(1) x +y −1] and o(2) =
σ W o +b
(1) (1) (1) = σ ([2x + 2y − 2, x + y]) = [2x+2y−2, x+y] .
220 10 Geometry of Deep Neural Networks
.
x = ψ(z) = ψ ◦ ϕ(x).
In practice, both encoder and decoder are parameterized with the parameter ,
so that the autoencoder is described by
.
x = T (x) = ψ ◦ ϕ (x)
n
min
(yi , T (x i )) + λR(), (10.54)
i=1
The geometric understanding of the autoencoder now gives a clear picture of what
happens in the deep neural network classifier. In this case, we only have an encoder
to map to the latent space, which leads to a simplified commutative diagram:
Since the encoder is also parameterized with and equipped with a ReLU, the input
manifold is also partitioned into piecewise linear regions, as shown in Fig. 10.14d.
Then, the linear layer followed by softmax assigns to the class probability for each
piecewise linear cell.
10.5 Open Problems 223
Fig. 10.15 Denoising as a piecewise linear projection on the reconstruction manifold [114]
Our discussion so far reveals that the deep neural network is indeed trained to
partition the input data manifold such that the linear mapping at each piecewise
linear region can effectively perform machine learning tasks, such as classification,
regression, etc. Therefore, we strongly believe that the clue to unveil the mystery
of deep neural networks comes from the understanding of the high-dimensional
manifold structure and its piecewise linear partition, and how the partitions can be
controlled.
In fact, many machine learning theoreticians have been focusing on this, thereby
generating many intriguing theoretical and empirical observations [115–118]. For
example, although we mentioned that the number of linear regions can potentially
increase exponentially with the network complexity, they observed that the actual
number of piecewise linear representation for specific tasks is much smaller. For
example, Fig. 10.16 shows that the number of linear regions indeed converges to
a smaller value compared to the initialization as the number of epochs increases
[115, 116].
224 10 Geometry of Deep Neural Networks
Fig. 10.16 Here the authors [115, 116] show the linear regions that intersect a 2D plane through
input space for a network of depth 3 and width 64 trained on MNIST
Fig. 10.17 Linear regions and classification regions of models trained with different optimization
techniques [117]
Note that only the number of epochs determines the number of piecewise linear
regions, but also, depending on the choice of the optimization algorithms, the
number of linear regions varies. For example, Fig. 10.17 shows that the number
of linear regions varies depending on the optimization algorithms, which leads to
the different classification boundaries. Here, the gray curves in the bottom row are
transition boundaries separating different linear regions, and the color represents
the activation rate of the corresponding linear region. In the top row, different colors
represent different classification regions, separated by the decision boundaries. The
models were trained on the vectorized MNIST data set, and this figure shows a
two-dimensional slice of the input space.
In fact, this phenomenon can be understood as a data-driven adaptation to
eliminate the unnecessary partitions for machine learning tasks. Note that the
partition boundary can collapse, resulting in a smaller number of partitions, as
10.6 Exercises 225
10.6 Exercises
1. Prove (10.24).
2. Prove the equality (10.25).
3. Fill in the missing step in (10.26).
4. Show (10.29).
5. Our goal is to derive the input–output relation in (10.32) at the encoder.
(a) Show that
l
(l ψ lj,k ) zl−1
k = l (zl−1
k ψ j,k ). (10.56)
˜ l ψ̃ lj,k )z̃lk =
( ˜ l z̃lk ψ̃ lj,k . (10.57)
Draw the corresponding input space partition, and compute the output
mapping with respect to an input vector (x, y) in each input partition. Please
derive all the steps explicitly.
(b) In problem (a), suppose that the bias terms are zero. Compute the input space
partition and the output mapping. What do you observe compared to the one
with bias?
(c) In problem (a), the last layer weight W (2) and bias b(2) are changed due to
the fine tuning. Please give an example of W (2) and bias b(2) that gives the
smallest number of partitions.
Chapter 11
Deep Learning Optimization
11.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 227
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_11
228 11 Deep Learning Optimization
[130–132] that have been used extensively to analyze the convergence properties of
local deep learning search methods.
In Chap. 6, we pointed out that the basic optimization problem in neural network
training can be formulated as
where θ refers to the network parameters and
: Rn → R is the loss function. In the
case of supervised learning with the mean square error (MSE) loss, the loss function
is defined by
1
(θ ) :=
y − f θ (x)
2 , (11.2)
2
where x, y denotes the pair of the network input and the label, and f θ (·) is a neural
network parameterized by trainable parameters θ . For the case of an L-layer feed-
forward neural network, the regression function f (x) can be represented by
f θ (x) := σ ◦ g (L) ◦ σ ◦ g (L−1) · · · ◦ g (1) (x) , (11.3)
for l = 1, · · · , L. Here, the number of the l-th layer hidden neurons, often referred
to as the width, is denoted by d (l) , so that g (l) , o(l) ∈ Rd and W (l) ∈ Rd ×d
(l) (l) (l−1)
.
The popular local search approaches using the gradient descent use the following
update rule:
+
∂
(θ) ++
θ [k + 1] = θ [k] − ηk , (11.7)
∂θ +θ =θ[k]
11.3 Polyak–Łojasiewicz-Type Convergence Analysis 229
where ηk denotes the k-th iteration step size. In a differential equation form, the
update rule can be represented by
∂
(θ[k])
θ̇ [t] = − , (11.8)
∂θ
(θ ) ≥
(θ) + ∇
(θ ), θ − θ
, ∀θ , θ . (11.10)
Our starting point is the observation that the convex analysis mentioned above
is not the right approach to analyzing a deep neural network. The non-convexity
is essential for the analysis. This situation has motivated a variety of alternatives
to the convexity to prove the convergence. One of the oldest of these conditions
is the error bounds (EB) of Luo and Tseng [134], but other conditions have been
recent considered, which include essential strong convexity (ESC) [135], weak
strong convexity (WSC) [136], and the restricted secant inequality (RSI) [137].
See their specific forms of conditions in Table 11.1. On the other hand, there
is a much older condition called the Polyak–Łojasiewicz (PL) condition, which
was originally introduced by Polyak [138] and found to be a special case of the
inequality of Łojasiewicz [139]. Specifically, we will say that a function satisfies
the PL inequality if the following holds for some μ > 0:
1
∇
(θ )
2 ≥ μ(
(θ ) −
∗ ), ∀θ . (11.11)
2
230
Table 11.1 Examples of conditions for gradient descent (GD) convergence. All of these definitions involve some constant μ > 0 (which may not be the same
across conditions). θ p denotes the projection of θ onto the solution set X∗ , and
∗ refers to the minimum cost
Name Conditions
μ
Strong convexity (SC)
(θ ) ≥
(θ ) + ∇
(θ), θ − θ
+ 2
θ − θ
2 , ∀θ, θ
μ
Essential strong convexity (ESC)
(θ ) ≥
(θ ) + ∇
(θ), θ − θ
+ 2
θ − θ
2 , ∀θ, θ s.t. θ p = θ p
μ
Weak strong convexity (WSC)
∗ ≥
(θ) + ∇
(θ), θ p − θ
+ 2
θ p − θ
2 , ∀θ
Restricted secant inequality (RSI) ∇
(θ), θ − θ p
≥ μ
θ p − θ
2 , ∀θ
Error bound (EB)
∇
(θ)
≥ μ
θ p − θ
2 , ∀θ
1 2
Polyak–Łojasiewicz (PL) 2
∇
(θ)
≥ μ(
(θ ) −
∗ ), ∀θ
11 Deep Learning Optimization
11.3 Polyak–Łojasiewicz-Type Convergence Analysis 231
Note that this inequality implies that every stationary point is a global minimum.
But unlike SC, it does not imply that there is a unique solution. We will revisit this
issue later.
Similar to other conditions in Table 11.1, PL is a sufficient condition for
gradient descent to achieve a linear convergence rate [122]. In fact, PL is the
mildest condition among them. Specifically, the following relationship between the
conditions holds [122]:
if have a Lipschitz continuous gradient, i.e. there exists L > 0 such that
∇ (θ ) − ∇ (θ ) ≤ L θ − θ , ∀θ , θ . (11.12)
1
θ [k + 1] = θ[k] − ∇
(θ [k]) (11.13)
L
has a global convergence rate
μ k !
(θ [k]) −
∗ ≤ 1 −
(θ [0]) −
∗ .
L
Proof Using Lemma 11.1 (see next section), L-Lipschitz continuous gradient of the
loss function
implies that the function
L
g(θ ) =
θ
2 −
(θ )
2
is convex. Thus, the first-order equivalence of convexity in Proposition 1.1 leads to
the following:
L 2 L
θ
−
(θ ) ≥
θ
2 −
(θ) + θ − θ, Lθ − ∇
(θ )
2 2
L
= −
θ
2 −
(θ ) + L θ , θ
− θ − θ , ∇
(θ )
.
2
232 11 Deep Learning Optimization
L
(θ ) ≤
(θ ) + ∇
(θ ), θ − θ
+
θ − θ
2 , ∀θ , θ .
2
By setting θ = θ [k + 1] and θ = θ [k] and using the update rule (11.13), we have
1
(θ[k + 1]) −
(θ[k]) ≤ −
∇
(θ [k])
2 . (11.14)
2L
Using the PL inequality (11.11), we get
μ !
(θ[k + 1]) −
(θ[k]) ≤ −
(θ [k]) −
∗ .
L
In Theorem 11.1, we use the two conditions for the loss function: (1)
satisfies
the PL condition and (2) the gradient of
is Lipschitz continuous. Although these
conditions are much weaker than the convexity of the loss function, they still impose
the geometric constraint for the loss function, which deserves further discussion.
Lemma 11.1 If the gradient of
(θ ) satisfies the L-Lipschitz condition in (11.12),
then the transformed function g : Rn → R defined by
L
g(θ) := θ θ −
(θ) (11.15)
2
is convex.
Proof Using the Cauchy–Schwarz inequality, (11.12) implies
∇
(θ ) − ∇
(θ ), θ − θ
≤ L
θ − θ
2 , ∀θ , θ .
11.3 Polyak–Łojasiewicz-Type Convergence Analysis 233
θ − θ , ∇g(θ ) − ∇g(θ )
≥ 0, ∀θ , θ , (11.16)
where
L
g(θ) =
θ
2 −
(θ ).
2
Thus, using the monotonicity of gradient equivalence in Proposition 1.1, we can
show that g(θ ) is convex.
Lemma 11.1 implies that although
is not convex, its transformed function by
(11.15) can be convex. Figure 11.1a shows an example of such case. Another impor-
tant geometric consideration for the loss landscape comes from the PL condition.
More specifically, the PL condition in (11.11) implies that every stationary point is
a global minimizer, although the global minimizers may not be unique, as shown in
Fig. 11.1b,c. While the PL inequality does not imply convexity of
, it does imply
the weaker condition of invexity [122]. A function is invex if it is differentiable and
there exists a vector-valued function η such that for any θ and θ in Rn , the following
inequality holds:
(θ ) ≥
(θ ) + ∇
(θ ), η(θ , θ )
. (11.17)
A convex function is a special case of invex functions since (11.17) holds when we
set η(θ, θ ) = θ − θ . It was shown that a smooth
is invex if and only if every
stationary point of
is a global minimum [140]. As the PL condition implies that
every stationary point is a global minimizer, a function satisfying PL is an invex
function. The inclusion relationship between convex, invex, and PL functions is
illustrated in Fig. 11.2.
The loss landscape, where every stationary point is a global minimizer, implies
that that there are no spurious local minimizers. This is often called the benign
optimization landscape. Finding the conditions for a benign optimization landscape
Fig. 11.1 Loss landscape for the function
(x) with (a) (11.15) is convex, and (b, c) PL conditions
234 11 Deep Learning Optimization
Fig. 11.3 Loss landscapes of (a) under-parameterized models and (b) over-parameterized models
the Lyapunov stability analysis is concerned with checking whether the solution
trajectory θ [t] converges to zero as t → ∞. To provide a general solution for this,
we first define the Lyapunov function V (z), which satisfies the following properties:
Definition 11.1 A function V : Rn → R is positive definite (PD) if
• V (z) ≥ 0 for all z.
• V (z) = 0 if and only if z = 0.
• All sublevel sets of V are bounded.
The Lyapunov function V has an analogy to the potential function of classical
dynamics, and −V̇ can be considered the associated generalized dissipation func-
tion. Furthermore, if we set z := θ [t] to analyze the nonlinear dynamic system in
(11.18), then V̇ : z ∈ Rn → R is computed by
∂V ∂V
V̇ (z) = ż = g(z). (11.19)
∂z ∂z
236 11 Deep Learning Optimization
The following Lyapunov global asymptotic stability theorem is one of the keys
to the stability analysis of dynamic systems:
Theorem 11.2 (Lyapunov Global Asymptotic Stability [146]) Suppose there is
a function V such that 1) V is positive definite, and 2) V̇ (z) < 0 for all z = 0 and
V̇ (0) = 0. Then, every trajectory θ [t] of θ̇ = g(θ ) converges to zero as t → ∞.
(i.e., the system is globally asymptotically stable).
θ̇ = −θ.
We can easily show that the system is globally asymptotically stable since the
solution is θ [t] = C exp(−t) for some constant C, and θ [t] → 0 as t → ∞.
Now, we want to prove this using Theorem 11.2 without ever solving the
differential equation. First, choose a Lyapunov function
z2
V (z) = ,
2
where z = θ [t]. We can easily show that V (z) is positive definite. Further-
more, we have
Therefore, using Theorem 11.2 we can show that θ [t] converges to zero as
t → ∞.
∂
θ̇ [t] = − (θ[t]) .
∂θ
For the MSE loss, this leads to
∂f θ [t] (x) !
θ̇[t] = − y − f θ[t] (x) . (11.20)
∂θ
11.4 Lyapunov-Type Convergence Analysis 237
Now let
1
V (z) = z z,
2
where z = e[t]. Then, we have
∂V
V̇ (z) = ż = z ż. (11.21)
∂z
where
+
∂f θ ∂f θ ++
K t = K θ[t] := + (11.22)
∂θ ∂θ +
θ =θ[t]
is often called the neural tangent kernel (NTK) [130–132]. By plugging this into
(11.21), we have
Accordingly, if the NTK is positive definite for all t, then V̇ (z) < 0. Therefore,
e[t] → 0 so that f (θ[t]) → y as t → ∞. This proves the convergence of gradient
descent approach.
In the previous discussion we showed that the Lyapunov analysis only requires
a positive-definiteness of the NTK along the solution trajectory. While this is a
great advantage over PL-type analysis, which requires knowledge of the global loss
landscape, the NTK is a function of time, so it is important to obtain the conditions
for the positive-definiteness of NTK along the solution trajectory.
To understand this, here we are interested in deriving the explicit form of the
NTK to understand the convergence behavior of the gradient descent methods.
238 11 Deep Learning Optimization
Using the backpropagation in Chap. 6, we can obtain the weight update as follows:
Similarly, we have
L
= (
o(l) [t]
2 + 1)M (l) [t],
l=1
where
M (l) [t] = (L) W (L) [t] · · · W (l+1) [t](l) (l) W (l+1) [t] · · · W (L) [t](L) .
(11.24)
Therefore, the positive definiteness of the NTK comes from the properties of
M (l) [t]. In particular, if M (l) [t] is positive definite for any l, the resulting NTK
is positive definite. Moreover, the positive-definiteness of M (l) [t] can be readily
shown if the following sensitivity matrix is full row ranked:
Although we derived the explicit form of the NTK using backpropagation, still the
component matrix in (11.24) is difficult to analyze due to the stochastic nature of
the weights and ReLU activation patterns.
11.4 Lyapunov-Type Convergence Analysis 239
To address this problem, the authors in [130] calculated the NTK at the infinite
width limit and showed that it satisfies the positive definiteness. Specifically, they
considered the following normalized form of the neural network update:
o(0)
n = x, (11.25)
1
g (l) = √ W (l) o(l−1)
n + βb(l−1) , (11.26)
d (l)
o(l) = σ (g (l) ), (11.27)
for l = 1, · · · , L, and d (l) denotes the width of the l-th layer. Furthermore,
(l)
they considered what is sometimes called LeCun initialization, taking Wij ∼
(l)
N 0, d1(l) and bj ∼ N(0, 1). Then, the following asymptotic form of the NTK
can be obtained.
Theorem 11.3 (Jacot et al. [130]) For a network of depth L at initialization, with a
Lipschitz nonlinearity σ , and in the limit as the layers width d (1) · · · , d (L−1) → ∞,
the neural tangent kernel K (L) converges in probability to a deterministic limiting
kernel:
K (L) → κ∞
(L)
⊗ I dL . (11.28)
1
(1)
κ∞ (x, x ) = x x + β 2, (11.29)
d (0)
(l+1)
κ∞ (x, x ) = κ∞
(l)
(x, x )ν̇ (l+1) (x, x ) + ν (l+1) (x, x ), (11.30)
where
ν (l+1) (x, x ) = Eg σ (g(x))σ (g(x ) + β 2 , (11.31)
ν̇ (l+1) (x, x ) = Eg σ̇ (g(x))σ̇ (g(x ) , (11.32)
Now, we are interested in extending the example above to the general loss function
with multiple training data sets. For a given training data set {x n }N
n=1 , the gradient
dynamics in (11.7) can be extended to
N
∂
(f θ (x n ))
N
∂f θ (x n ) ∂
(x n )
θ̇ = − =− ,
∂θ ∂θ ∂f θ (x n )
n=1 n=1
N
∂
(x n )
=− K t (x m , x n ) ,
∂f θ (x n )
n=1
N
V (z) =
(f θ (x m )) =
(zm + f ∗m ),
m=1 m=1
where
⎡⎤ ⎡ ⎤
z1 f θ (x 1 ) − f ∗ (x 1 )
⎢ z2 ⎥ ⎢ f (x 2 ) − f ∗ (x 2 ) ⎥
⎢ ⎥ ⎢ θ ⎥
z=⎢ . ⎥=⎢ .. ⎥,
⎣ .. ⎦ ⎣ . ⎦
zN f θ (x N ) − f ∗ (x N )
11.5 Exercises 241
∀n, (f θ (x n )) > 0, if f θ (x n ) = f ∗n , (f n ∗) = 0,
N N +
∂
(f θ (x m ))
∂
(x m ) +
+
V̇ (z) = żm = ḟ θ (x m )+
∂zm ∂f θ (x m ) +
m=1 m=1 θ=θ [t]
+
∂
(f θ (x n )) ++
N
N
∂
(f θ (x m ))
=− K t (x m , x n ) +
∂f θ (x m ) ∂f θ (x n ) +
m=1 n=1 θ=θ [t]
= −e[t] K[t]e[t],
where
⎡ ∂
(f ⎤ ⎡ ⎤
θ (x 1 ))
∂f θ (x 1 ) K t (x 1 , x 1 ) · · · K t (x 1 , x N )
⎢ ⎥ ⎢ ⎥
e[t] = ⎢
⎣
..
.
⎥
⎦ , K[t] = ⎣ ..
.
..
.
..
. ⎦
∂
(f θ (x N ))
∂f θ (x N )
K t (x N , x 1 ) · · · K t (x N , x N ).
θ=θ[t]
Therefore, if the NTK K[t] is positive definite for all t, then Lyapunov stability
theory guarantees that the gradient dynamics converge to the global minima.
11.5 Exercises
1. Show that a smooth
(θ ) is invex if and only if every stationary point of
(θ) is a
global minimum.
2. Show that a convex function is invex.
3. Let a > 0. Show that V (x, y) = x 2 + 2y 2 is a Lyapunov function for the system
ẋ = ay 2 − x , ẏ = −y − ax 2 .
x2
ẋ = x(y − 1) , ẏ = − .
1 + x2
Given the corresponding input space partition in Fig. 10.11, compute the
neural tangent kernel for each partition. Are they positive definite?
(b) In problem (a), suppose that the second layer weight and bias are changed to
12 0
W (1)
= , b (1)
= .
01 1
Given the corresponding input space partition, compute the neural tangent
kernel for each partition. Are they positive definite?
Chapter 12
Generalization Capability of Deep
Learning
12.1 Introduction
One of the main reasons for the enormous success of deep neural networks is
their amazing ability to generalize, which seems mysterious from the perspective
of classic machine learning. In particular, the number of trainable parameters in
deep neural networks is often greater than the training data set, this situation
being notorious for overfitting from the point of view of classical statistical
learning theory. However, empirical results have shown that a deep neural network
generalizes well at the test phase, resulting in high performance for the unseen data.
This apparent contradiction has raised questions about the mathematical foun-
dations of machine learning and their relevance to practitioners. A number of
theoretical papers have been published to understand the intriguing generalization
phenomenon in deep learning models [147–153]. The simplest approach to studying
generalization in deep learning is to prove a generalization bound, which is typically
an upper limit for test error. A key component in these generalization bounds is the
notion of complexity measure: a quantity that monotonically relates to some aspect
of generalization. Unfortunately, it is difficult to find tight bounds for a deep neural
network that can explain the fascinating ability to generalize.
Recently, the authors in [154, 155] have delivered groundbreaking work that can
reconcile classical understanding and modern practice in a unified framework. The
so-called “double descent” curve extends the classical U-shaped bias-variance trade-
off curve by showing that increasing the model capacity beyond the interpolation
point leads to improved performance in the test phase. Particularly, the induced bias
by optimization algorithms such as the stochastic gradient descent (SGD) offers
simpler solutions that improve generalization in the over-parameterized regime.
This relationship between the algorithms and structure of machine learning models
describes the limits of classical analysis and has implications for the theory and
practice of machine learning.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 243
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_12
244 12 Generalization Capability of Deep Learning
This chapter also presents new results showing that a generalization bound
based on the robustness of the algorithm can be a promising tool to understand
the generalization ability of the ReLU network. In particular, we claim that it
can potentially offer a tight generalization bound that depends on the piecewise
linear nature of the deep neural network and the inductive bias of the optimization
algorithms.
1
(f , z) =
y − f (x)
2 .
2
f S = A(S). (12.1)
For example, the estimated hypothesis from the popular empirical risk minimization
(ERM) principle [10] is given by
N
R̂N (f ) :=
(f , zn ) , (12.3)
N
n=1
which is assumed to uniformly converge to the population (or expected) risk defined
by:
If uniform convergence holds, then the empirical risk minimizer (ERM) is consis-
tent, that is, the population risk of the ERM converges to the optimal population
risk, and the problem is said to be learnable using the ERM [10].
12.2 Mathematical Preliminaries 245
In fact, learning algorithms that satisfy such performance guarantees are called
the probably approximately correct (PAC) learning [156]. Formally, PAC learnabil-
ity is defined as follows.
Definition 12.1 (PAC Learnability [156]) A concept class C is PAC learnable if
there exist some algorithm A and a polynomial function poly(·) such that the
following holds. Pick any target concept c ∈ C. Pick any input distribution P
over X. Pick any , δ ∈ [0, 1]. Define S := {x n , c(x n )}N
n=1 where x n ∼ P are
i.i.d samples. Given N ≥ poly(1/, 1/δ, dim(X), size(c)), where dim(X), size(c)
denote the computational costs of representing inputs x ∈ X and target c, the
generalization error is bounded as
where AS denotes the learned hypothesis by the algorithm A using the training
data S.
The PAC learnability is closely related to the generalization bounds. More
specifically, the ERM could only be considered a solution to a machine learning
problem or PAC-learnable if the difference between the training error and the
generalization error, called the generalization gap, is small enough. This implies
that the following probability should be sufficiently small:
P sup |R(f ) − R̂N (f )| > . (12.6)
f ∈F
Note that this is the worst-case probability, so even in the worst-case scenario, we
try to minimize the difference between the empirical risk and the expected risk.
A standard trick to bound the probability in (12.6) is based on concentration
inequalities. For example, Hoeffding’s inequality is useful.
Theorem 12.1 (Hoeffding’s Inequality [157]) If x1 , x2 , · · · , xN are N i.i.d. sam-
ples of a random variable X distributed by P, and a ≤ xn ≤ b for every n, then for
a small positive nonzero value :
+ +
+ 1
++
N
−2N 2
+
P +E[X] − xn + > ≤ 2 exp . (12.7)
+ N + (b − a)2
n=1
246 12 Generalization Capability of Deep Learning
Assuming that our loss is bounded between 0 and 1 using a 0/1 loss function
or by squashing any other loss between 0 and 1, (12.6) can be bounded as follows
using Hoeffding’s inequality:
⎧ ⎫
⎨? ⎬
P sup |R(f ) − R̂N (f )| > =P |R(f ) − R̂N (f )| >
f ∈F ⎩ ⎭
f ∈F
(a)
, -
≤ P |R(f ) − R̂N (f )| > (12.8)
f ∈F
= 2|F| exp(−2N 2 ),
where |F| is the size of the hypothesis space and we use the union bound in (a) to
obtain the inequality. By denoting the right hand side of the above inequality by δ,
we can say that with probability at least 1 − δ, we have
@
ln |F| + ln 2δ
R(f ) ≤ R̂N (f ) + . (12.9)
2N
Indeed, (12.9) is one of the simplest forms of the generalization bound, but still
reveals the fundamental bias–variance trade-off in classical statistical learning
theory. For example, the ERM for a given function class F results in the minimum
empirical loss:
which goes to zero as the hypothesis class F becomes bigger. On the other hand, the
second term in (12.9) grows with increasing |F|. This trade-off in the generalization
bound with respect to the hypothesis class size |F| is illustrated in Fig. 12.1.
Although the expression in (12.9) looks very nice, it turns out that the bound is
very loose. This is due to the term |F| which originates from the union bound of all
elements in the hypothesis class F. In the following, we discuss some representative
classical approaches to obtain tighter generation bounds.
One of the key ideas of the work of Vapnik and Chervonenkis [10] is to replace
the union bound for all hypothesis class in (12.8) with the union bound of simpler
empirical distributions. This idea is historically important, so we will review it here.
12.2 Mathematical Preliminaries 247
Fig. 12.1 Generation bound behavior according to the hypothesis class size |F|
1
N
!
R̂N (f ) =
f , zn . (12.11)
N
n=1
Vapnik and Chervonenkis [10] used the symmetrization lemma to obtain a much
tighter generalization bound:
P sup |R(f ) − R̂N (f )| > ≤ 2P sup |R̂N (f ) − R̂N (f )| >
f ∈F f ∈FS,S 2
⎧ ⎫
⎨ ? ⎬
= 2P |R̂N (f ) − R̂N (f )| >
⎩ ⎭
f ∈FS,S
248 12 Generalization Capability of Deep Learning
, -
≤ 2GF (2N) · P |R̂N (f ) − R̂N (f )| >
where the last inequality is obtained by Hoeffding’s inequality and FS,S denotes
the restriction of the hypothesis class to the empirical distribution for S, S . Here,
GF (·) is called the growth function defined by
which represents the number of the most possible sets of dichotomies using the
hypothesis class F on any 2N points from S and S .
The discovery of the growth function is one of the important contributions of
Vapnik and Chervonenkis [10]. This is closely related to the concept of shattering,
which is formally defined as follows.
Definition 12.2 (Shattering) We say F shatters S if |F| = 2|S| .
In fact, the growth function GF (N ) is often called the shattering number: the
number of the most possible sets of dichotomies using the hypothesis class F on
any N points. Below, we show several facts for the growth function:
• By definition, the shattering number satisfies GF (N ) ≤ 2N .
• When F is finite, we always have GF (N ) = |F|.
• If GF (N ) = 2N , then there is a set of N points such that the class of functions F
can generate any possible classification result on these points. Figure 12.2 shows
such a case where F is the class of linear classifiers.
Fig. 12.2 Most possible sets of dichotomies using linear classifier on any three points. The
resulting shattering number is GF (3) = 8
12.2 Mathematical Preliminaries 249
GF (N ) = 2N .
Example: Sinusoids
f is a single-parametric sine classifier, i.e, for a certain parameter θ , the
classifier fθ returns 1 if the input number x is larger than sin(θ x) and 0
otherwise. The VC dimension of f is infinite, since it can shatter any finite
subset of the set {2−m | m ∈ N}.
Finally, we can derive the generalization bound using the VC dimension. For this,
the following lemma by Sauer is the key element.
Lemma 12.2 (Sauer’s Lemma[158]) Suppose that F has a finite VC dimension
dV C . Then
dV C
n
GF (n) ≤ (12.15)
i
i=1
Corollary 12.1 (VC Bound Using VC Dimension) Let dV C ≥ N. Then, for any
δ > 0, with probability at least 1 − δ, we have
@
dV C + 8 ln δ
8dV C ln 2eN 2
R(f ) ≤ R̂N (f ) + . (12.17)
N
N
errN (f ) = 1 [f (x n ) = yn ] , (12.18)
N
n=1
1
1 − yn f (x n )
N
errN (f ) =
N 2
n=1
N
1
= − yn f (x n ) . (12.20)
2 N
i=1
7 85 6
correlation
N
sup yn f (x n ). (12.21)
f ∈F N n=1
Note that the idea is closely related to the shattering in VC analysis. Specifically,
if the hypothesis class F shatters S = {x n , yn }N n=1 , then the correlation becomes
a maximum. However, in contrast to the VC analysis that considers the worst-
case scenario, Rademacher complexity analysis deals with average-case analysis.
Formally, we define the so-called Rademacher complexity [161].
252 12 Generalization Capability of Deep Learning
N
RadN (F, S) = Eσ sup σn f (x n ) , (12.22)
f ∈F N n=1
N
RadN (F) = E sup σ n , f (x n )
, (12.24)
f ∈F N n=1
where {σ n }N
n=1 refers to the independent random vectors. In the following, we pro-
vide some examples where the Rademacher complexity can be explicitly calculated.
N N
Rad(F) = E sup σn f (x n ) = f (x 1 ) · E σn = 0,
f ∈F N n=1
N
n=1
where the second equality comes from the fact that f (x n ) = f (x 1 ) for all n
when |F| = 1. The final equation comes from the definition of the random
variable σn .
(continued)
12.2 Mathematical Preliminaries 253
where the second equality comes from the fact that we can find a hypothesis
such that f (x n ) = σn for all n. The final equation comes from the definition
of the random variable σn .
Although the Rademacher complexity was originally derived above for the
binary classifiers, it can also be used to evaluate the complexity of the regression.
The following example shows that a closed form Rademacher complexity can be
obtained for ridge regression.
N
Rad(F, S) = Eσ sup σn w x n
w:
w
≤W N
n=1
0 ) *1
1
N
= Eσ sup w σn x n
N w:
w
≤W n=1
/N / 9
/
/ (b) W :
:
N
(a) W / / ;
= Eσ / σn x n / ≤ Eσ
σn x n
2
N / / N
n=1 n=1
9
:N
W:;
WX
=
x n
2 ≤ √ ,
N N
n=1
where (a) comes from the definition of the l1 norm, and (b) comes from
Jensen’s inequality.
Using the Rademacher complexity, we can now derive a new type of generaliza-
tion bound. First, we need the following concentration inequality.
Lemma 12.3 (McDiarmid’s Inequality[161]) Let x1 , · · · , xN be independent
random variables taking on values in a set X and let c1 , · · · , cn be positive real
constants. If ϕ : XN → R satisfies
for 1 ≤ n ≤ N , then
) *
2 2
P{|ϕ(x1 , · · · , xN ) − Eϕ(x1 , · · · , xN )| ≥ } ≤ 2 exp − N
. (12.25)
2
n=1 cn
N
In particular, if ϕ(x1 , · · · , xN ) = n=1 xn /N, the inequality (12.25) reduces to
Hoeffding’s inequality.
Using McDiarmid’s inequality and symmetrization using “ghost samples”, we
can obtain the following generalization bound.
Theorem 12.3 (Rademacher Bound) Let S := {x n , yn }N n=1 denote the training
set and f (x) ∈ [a, b]. For any δ > 0, with probability at least 1 − δ, we have
4
ln 1/δ
R(f ) ≤ R̂N (f ) + 2RadN (F) + (b − a) , (12.26)
2N
and
4
ln 2/δ
R(f ) ≤ R̂N (f ) + 2RadN (F, S) + 3(b − a) . (12.27)
2N
So far, we have discussed performance guarantees which hold whenever the training
and test data are drawn independently from an identical distribution. In fact,
learning algorithms that satisfy such performance guarantees are called the probably
approximately correct (PAC) learning [156]. It was shown that the concept class C
is PAC learnable if and only if the VC dimension of C is finite [162].
In addition to PAC learning, there is another important area of modern learning
theory—Bayesian inference. Bayesian inferences apply whenever the training and
test data are generated according to the specified prior. However, there is no
guarantee of an experimental environment in which training and test data are
generated according to a different probability distribution than the previous one. In
fact, much of modern learning theory can be broken down into Bayesian inference
and PAC learning. Both areas investigate learning algorithms that use training data
12.2 Mathematical Preliminaries 255
as the input and generate a concept or model as the output, which can then be tested
on test data.
The difference between the two approaches can be seen as a trade-off between
generality and performance. We define an “experimental setting” as a probability
distribution over training and test data. A PAC performance guarantee applies to a
wide class of experimental settings. A Bayesian correctness theorem applies only
to experimental settings that match those previously used in the algorithm. In this
restricted class of settings, however, the Bayesian learning algorithm can be optimal
and generally outperforms the PAC learning algorithms.
The PAC–Bayesian theory combines Bayesian and frequentist approaches [163].
The PAC–Bayesian theory is based on a prior probability distribution concerning
the “situation” occurring in nature, and a “rule” expresses a learner’s preference for
some rules over others. There is no supposed relationship between the learner’s bias
for rules and the nature distribution. This differs from the Bayesian inference, where
the starting point is a common distribution of rules and situations, which induces a
conditional distribution of rules in certain situations.
Under this set-up, the following PAC–Bayes generalization bound can be
obtained.
Theorem 12.4 (PAC–Bayes Generalization Bound) [163] Let Q be an arbitrary
distribution over z := (x, y) ∈ Z := X × Y. Let F be a hypothesis class and let
be a loss function such that for all f and z we have
(f , z) ∈ [0, 1]. Let P be a
prior distribution over F and let δ ∈ (0, 1). Then, with probability of at least 1 − δ
over the choice of an i.i.d. training set S := {zn }N
n=1 sampled according to Q, for
all distributions Q over F (even such that depend on S), we have
@
KL(Q||P) + ln N/δ
Ef ∼Q [R(f )] ≤ Ef ∼Q R̂N (f ) + , (12.28)
2(N − 1)
where
Recall that the following error bound can be obtained for the ERM estimate in
(12.2):
4
c
R(f ∗ERM ) ≤ R̂N (f ∗ERM ) + O , (12.30)
7 85 6 N
empirical risk (training error) 7 85 6
complexity penalty
where O(·) denotes the “big O” notation and c refers to the model complexity such
as VC dimension, Rademacher complexity, etc.
In (12.30), with increasing hypothesis class size |F|, the empirical risk or
training error decreases, whereas the complexity penalty increases. The control of
the functional class capacity can be therefore done explicitly by choosing F (e.g.
selection of the neural network architecture). This is summarized in the classic U-
shaped risk curve, which is shown in Fig. 12.3a and was often used as a guide for
model selection. A widely accepted view from this curve is that a model with zero
training error is overfitted to the training data and will typically generalize poorly
[10]. Classical thinking therefore deals with the search for the “sweet spot” between
underfitting and overfitting.
Lately, this view has been challenged by empirical results that seem mysterious.
For example, in [165] the authors trained several standard architectures on a copy
of the data, with the true labels being replaced by random labels. Their central
finding can be summarized as follows: deep neural networks easily fit random labels.
More precisely, neural networks achieve zero training errors if they are trained on
a completely random labeling of the true data. While this observation is easy to
formulate, it has profound implications from a statistical learning perspective: the
effective capacity of neural networks is sufficient to store the entire data set. Despite
the high capacity of the functional classes and the almost perfect fit to training data,
these predictors often give very accurate predictions for new data in the test phase.
These observations rule out VC dimension, Rademacher complexity, etc. from
describing the generalization behavior. In particular, the Rademacher complexity
for the interpolation regime, which leads to a training error of 0, assumes the
maximum value of 1, as previously explained in an example. Therefore, the classic
generalization bounds are vacuous and cannot explain the amazing generalization
ability of the neural network.
The recent breakthrough in Belkin et al.’s “double descent” risk curve [154, 155]
reconciles the classic bias–variance trade-off with behaviors that have been observed
in over-parameterized regimes for a large number of machine learning models. In
particular, when the functional class capacity is below the “interpolation threshold”,
learned predictors show the classic U-shaped curve from Fig. 12.3a, where the
function class capacity is identified with the number of parameters needed to specify
a function within the class. The bottom of the U-shaped risk can be achieved at
12.3 Reconciling the Generalization Gap via Double Descent Model 257
Fig. 12.3 Curves for training risk (dashed line) and test risk (solid line). (a) The classical U-
shaped risk curve arising from the bias–variance trade-off. (b) The double descent risk curve,
which incorporates the U-shaped risk curve (i.e., the “classical” regime) together with the observed
behavior from using high-capacity function classes (i.e., the “modern” interpolating regime),
separated by the interpolation threshold. The predictors to the right of the interpolation threshold
have zero training risk
the sweet spot which balances the fit to the training data and the susceptibility
to over-fitting. When we increase the function class capacity high enough by
increasing the size of the neural network architecture, the learned predictors achieve
(almost) perfect fits to the training data. Although the learned predictors obtained
at the interpolation threshold typically have high risk, increasing the function class
capacity beyond this point leads to decreasing risk, which typically falls below the
risk achieved at the sweet spot in the “classic” regime (see Fig. 12.3b).
In the following example we provide concrete and explicit evidence for the
double descent behavior in the context of simple linear regression models. The
analysis shows the transition from under- to over-parameterized regimes. It also
allows us to compare the risks at any point on the curve and explain how the risk in
the over-parameterized regime can be lower than any risk in the under-parameterized
regime.
258 12 Generalization Capability of Deep Learning
y = x β + , (12.31)
where β ∈ RD and x and are a normal random vector and a variable, where
x ∼ N(0, I D ) and ∼ N(0, σ 2 ). Given training data {x n , yn }N n=1 , we fit a
linear model to the data using only a subset T ⊂ [D] of cardinality of p,
where [D] := {0, · · · , D}. Let X = [x 1 , · · · , x N ] ∈ RD×N be the design
matrix, y = [y1 , · · · , yN ] be the vector of response. For a subset T , we use
β T to denote its |T |-dimensional subvector of entries from T ; we also use XT
to denote an N × p sub-matrix of X composed of columns in T . Then, the
risk of β̂, where β̂ T = X †T y and β̂ T c = 0, is given by
⎧
⎪
⎪
p
(
β T c
2 + σ 2 ) 1 + N−p−1 ; if p ≤ N − 2
⎪
⎪
⎪
⎪
⎨∞; if N − 1 ≤ p ≤ N + 1
E (y − x β̂)2 =
⎪
⎪
β T
2 1 − Np
⎪
⎪
⎪
⎪
⎩ +(
β c
2 + σ 2 ) 1 + N
T p−N−1 ; if p ≥ N + 2.
(12.32)
= σ 2 + β T c 2 + E β T − β̂ T 2 ,
β̂ T = (XT X −1 −1 −1
T ) X T y = (X T X T ) X T X T β T + (X T X T ) X T η
= β T + (XT X −1
T ) X T η,
(continued)
12.3 Reconciling the Generalization Gap via Double Descent Model 259
where
η := y − X
T β T = + XT c β T c .
In addition, we have
E ηη = E + E X β
Tc T c X
β
Tc T c
= (σ 2 + β T c 2 )I N ,
where R(XT ) denotes the range space of XT and P R(XT ) denotes the
projection to the range space of XT . Furthermore, P R(XT ) is Hotelling’s T-
squared distribution with parameter p and N − p + 1 so that
p
N −p−1 , if p ≤ N − 2
TrE P R(XT ) = . (12.33)
+∞, if p = N − 1
Therefore, by putting them together we conclude the proof for the classical
regime.
(Modern interpolating regime) We consider p ≥ N. Then, we have
β̂ T = X −1 −1 −1
T (X T X T ) y = X T (X T X T ) X T β T + X T (X T X T ) η
= X −1 −1
T (X T X T ) X T β T + X T (X T X T ) η
= PR(X ) β T + X −1
T (X T X T ) η,
T
where
η := y − X
T β T = + XT c β T c .
Therefore,
E
β T − β̂ T
2 = E
P⊥ β
R(X ) T
2
+ E η
(X X
T T
−1
) η .
T
(continued)
260 12 Generalization Capability of Deep Learning
Furthermore, we have
n
E
P⊥R(X
β T
2
= 1 −
β T
2
T) p
E η (XT X
T ) −1
η = Tr E(X X
T T
−1
) E ηη
,
where we use the independency between XT and XT c and for the second
equality. In addition, we have
E ηη = E + E X
T c β T c X
T c β T c
= (σ 2 + β T c 2 )I N .
Figure 12.4 illustrates an example plot for the linear regression problem analyzed
above for a particular parameter set.
All learned predictors to the right of the interpolation threshold fit perfectly with
the training data and have no empirical risk. Then, why should some—especially
those from larger functional classes—have a lower test risk than others so that
they generalize better? The answer is that the functional class capacity, such
as VC dimension, or Rademacher complexity, does not necessarily reflect the
inductive bias of the predictor appropriate for the problem at hand. Indeed, one
12.5 Generalization Bounds via Algorithm Robustness 261
Fig. 12.4 Plot of the risk in (12.32) as a function of p under the random selection of T . Here
β
2 = 1, σ 2 = 1/25 and N = 40
of the underlying reasons for the appearance of the double descent model in the
previous linear regression problem is that we impose an inductive bias to choose the
minimum norm solution β̂ T = XT (X −1
T X T ) y for the over-parameterized regime,
which leads to the smooth solution.
Among the various interpolation solutions, choosing the smooth or simple
function that perfectly fits the observed data is a form of Occam’s razor: the simplest
explanation compatible with the observations should be preferred. By considering
larger functional classes that contain more candidate predictors that are compatible
with the data, we can find interpolation functions that are “simpler”. Increasing
the functional class capacity thus improves the performance of classifiers. One
of the important advantages of choosing a simpler solution is that it is easy to
generalize by avoiding unnecessary glitches in the data. Increasing the functional
class capacity to the over-parameterized area thus improves the performance of the
resulting classifiers.
Then, one of the remaining questions is: what is the underlying mechanism by
which a trained network becomes smooth or simple? This is closely related to
the inductive bias (or implicit bias) of an optimization algorithm such as gradient
descent, stochastic gradient descent (SGD), etc. [166–171]. Indeed, this is an active
area of research. For example, the authors in [168] show that the gradient descent
for the linear classifier for specific loss function leads to the maximum margin SVM
classifier. Other researchers have shown that the gradient descent in deep neural
network training leads to a simple solution [169–171].
Another important question is how we can quantify the inductive bias of the
algorithm in terms of a generalization error bound. In this section, we introduce
a notion of algorithmic robustness for quantifying the generalization error, which
262 12 Generalization Capability of Deep Learning
was originally proposed in [172], but has been largely neglected in deep learning
research. It turns out that the generalization bound based on algorithmic robustness
has all the ingredients to quantify the fascinating generalization behavior of the deep
neural network, so it can be a useful tool for studying generalization.
Recall that the underlying assumption for the classical generalization bounds is
the uniform convergence of empirical quantities to their mean [10], which provides
ways to bound the gap between the expected risk and the empirical risk by the
complexity of the hypothesis set. On the other hand, robustness requires that a
prediction rule has comparable performance if tested on a sample close to a training
sample. This is formally defined as follows.
Definition 12.5 (Algorithm Robustness [172]) Algorithm A is said to be
(K, (·))-robust for K ∈ N and (·) : Z → R, if Z := X × Y can be partitioned
into K disjoint sets, denoted by {Ci }K
i=1 such that the following holds for all training
sets S ⊂ Z:
for all i = 1, · · · , K, where AS denotes the algorithm A trained with the data set S.
Then, we can obtain the generalization bound based on algorithmic robustness.
First, we need the following concentration inequality.
Lemma 12.4 (Breteganolle–Huber–Carol Inequality [173]) If the random vec-
tor (N1 , · · · , Nk ) is multinomially distributed with parameters N and (p1 , · · · , pk ),
then
k
√
P |Ni − Npi | ≥ 2 Nλ ≤ 2k exp(−2λ2 ), λ > 0. (12.35)
i=1
where
Proof Let Ni be the set of indices of points of S that fall into the Ci . Note that
(|N1 |, · · · , |NK |) is an i.i.d. multinomial random variable with parameters N and
(μ(Ci ), · · · , μ(CK )). Then, the following holds by Lemma 12.4.
K + +
+ |Ni | + Nλ2
P + − μ(C ) + ≥ λ ≤ 2K
exp − . (12.37)
+ N i +
2
i=1
K +
+ 4
+ |Ni | + 2K ln 2 + 2 ln(1/δ)
+ +
+ N − μ(Ci )+ ≤ N
. (12.38)
i=1
N +
+ +
|R(AS ) − R̂N (AS )| ≤ + Ez∼μ
(AS , z|z ∈ Ci )μ(Ci ) −
(AS , s i )+
+ N +
i=1 n=1
+K +
(a) ++
1
+
N
|Ni | +
≤+ Ez∼μ
(AS , z|z ∈ Ci ) −
(AS , s i )+
+ N N +
i=1 n=1
+K +
+
N
|Ni | ++
+
++ Ez∼μ
(AS , z|z ∈ Ci )μ(Ci ) − Ez∼μ
(AS , z|z ∈ Ci ) +
+ N +
i=1 n=1
+ +
+K +
(b) 1 +
+
≤ + max |
(A , s ) −
(A , z )| +
+
N + z2 ∈Cj
S j S 2 +
i=1 j ∈Ni +
K +
+
+ |Ni | +
+ max |
(AS , z)| + − μ(C )+
z∈Z + N i +
i=1
K +
+
(c) + |Ni | +
≤ (S) + M + − μ(C )+
+ N i +
i=1
4
(d) 2K ln 2 + 2 ln(1/δ)
≤ (S) + M ,
N
where (a), (b), and (c) are due to the triangle inequality, the definition of Ni , and the
definition of (S) and M, respectively.
Note that the definition of robustness requires that (12.34) holds for every training
sample. The parameters K and (·) quantify the robustness of an algorithm. Since
(·) is a function of training samples, an algorithm can have different robustness
properties for different training patterns. For example, a classification algorithm is
more robust to a training set with a larger margin. Since (12.34) includes both the
trained solution AS and the training set S, robustness is a property of the learning
264 12 Generalization Capability of Deep Learning
algorithm, rather than the property of the “effective hypothesis space”. This is why
the robustness-based generalization bound can account for the inductive bias from
the algorithm.
For example, for the case of a single-layer ReLU neural network f : R2 → R2
with the following weight matrix and bias:
2 −1 1
W (0) = , b(0) =
1 1 −1
Therefore, in spite of the twice larger parameter sizes, the number of partitions is
K = 4, which is the same as the single-layer neural network. Therefore, in terms
of the generalization bounds, the two algorithms have same upper bound up to the
parameter (S). This example clearly confirms that generalization is a property of
the learning algorithm, rather than the property of the effective hypothesis space or
the number of parameters.
12.6 Exercises 265
12.6 Exercises
for γ > 0.
Chapter 13
Generative Models and Unsupervised
Learning
13.1 Introduction
The last part of our voyage toward the understanding of the geometry of deep learn-
ing concerns perhaps the most exciting aspect of deep learning—generative models.
Generative models cover a large spectrum of research activities, which include the
variational autoencoder (VAE) [174, 175], generative adversarial network (GAN)
[88, 176, 177], normalizing flow [178–181], optimal transport (OT) [182–184], etc.
This field has evolved very quickly, and at any machine learning conference like
NeurIPS, CVPR, ICML, ICLR, etc., you may have seen exciting new developments
that far surpass existing approaches. In fact, this may be one of the excuses why the
writing of this chapter has been deferred till the last minute, since there could be
new updates during the writing.
For example, Fig. 13.1 shows the examples of fake human faces generated by
various generative models starting from the GAN[88] in 2014 to styleGAN[89]
in 2018. You may be amazed to see how the images become so realistic with so
much detail within such a short time period. In fact, this may be another reason why
DeepFake by generative models has become a societal problem in the modern deep
learning era.
Besides creating fake faces, another reason that a generative model is so
important is that it is a systematic means of designing unsupervised learning
algorithms. For example, in Yann LeCun’s famous cake analogy at NeurIPS 2016,
he emphasized the importance of unsupervised learning by saying “If intelligence
is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is
supervised learning, and the cherry on the cake is reinforcement learning (RL).”
Referring to the GAN, Yann LeCun said that it was “the most interesting idea in the
last 10 years in machine learning,” and predicted that it may become one of the most
important engines for modern unsupervised learning.
Despite their popularities, one of the reasons generative models are difficult to
understand is that there are so many variations, such as the VAE [174], β-VAE [175],
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 267
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_13
268 13 Generative Models and Unsupervised Learning
GAN [88], f -GAN [176], W-GAN [177], normalizing flow [178–180], GLOW
[181], optimal transport [182–184], cycleGAN [185], W-GAN [177], starGAN [87],
CollaGAN [186], to name just a few. Moreover, the modern deep generative models,
in particular GANs, have been characterized by the public media as magical black
boxes which can generate anything from nothing. Therefore, one of the main goals
of this chapter is to demystify the public belief of generative models by providing a
coherent geometric picture of generative models.
Specifically, our unified geometric view starts from Fig. 13.2. Here, the ambient
image space is X, where we can take samples with the real data distribution μ.
If the latent space is Z, the generator G can be treated as a mapping from the
latent space to the ambient space, G : Z → X, often realized by a deep network
with parameter θ , i.e. G := Gθ . Let ζ be a fixed distribution on the latent space,
such as uniform or Gaussian distribution. The generator Gθ pushes forward ζ to a
distribution μθ = Gθ# ζ in the ambient space X (don’t worry about the term “push-
forward” at this point, as it will be explained later). Then, the goal of the generative
model training is to make μθ as close as possible to the real data distribution μ.
Additionally, for the case of autoencoding generative model, the generator works as
a decoder, and there exists an additional encoder. More specifically, an encoder F
maps from the sample space to the latent space F : X → Z, parameterized by φ,
i.e. F = Fφ so that the encoder pushes forward μ to a distribution ζφ = Fφ# μ in the
13.2 Mathematical Preliminaries 269
latent space. Accordingly, the additional constraint is again to minimize the distance
between ζφ and ζ .
Using this unified geometric model, we can show that various types of generative
models such as VAE, β-VAE, GAN, OT, normalizing flow, etc. only differ in their
choices of distances between μθ and μ or between ζφ and ζ , and how to train the
generator and encoder to minimize the distances.
Therefore, this chapter is structured somewhat differently from the conventional
approaches to describing generative models. Rather than directly diving into specific
details of each generative model, here we try to first provide a unified theoretical
view, and then derive each generative model as a special case. Specifically, we
first provide a brief review of probability theory, statistical distances, and optimal
transport theory [182, 184]. Using these tools, we discuss in detail how each specific
algorithm can be derived by simply changing the choice of statistical distance.
In this section, we assume that the readers are familiar with basic probability and
measure theory [2]. For more background on the formal definition of probability
space and related terms from the measure theory, see Chap. 1.
Definition 13.1 (Push-Forward of a Measure) Let (X, F, μ) be a probability
space, let Y be a set, and let f : X → Y be a function. The push-forward of μ
by f is the probability measure ν : f (F) → [0, 1] defined by
Accordingly, we can regard the random variable X as pushing forward the measure
μ on to a measure ν on R.
270 13 Generative Models and Unsupervised Learning
theory is the probability density function (pdf) or probability mass function (pmf)
as discussed below. The Radon–Nikodym derivative is also a key to defining an
f -divergence as a statistical distance measure.
This is often called the discrete cumulative distribution function (cdf), and for
this discrete case, it increases stepwise. Then, the corresponding probability
measure is
P (A) = pi . (13.5)
i:ai ∈A
f (ai ) = pi , i = 1, 2, · · · , (13.7)
(continued)
272 13 Generative Models and Unsupervised Learning
where f (y) is the probability density function (pdf). Then, the corresponding
probability belonging to an interval A can be computed by
P (A) = f (y)dy (13.9)
A
for any interval A. Therefore, we can easily see that the pdf f is the Radon–
Nikodym derivative with respect to the Lebesgue measure.
Although the Radon–Nikodym derivative is used to derive the pdf and pmf, it
is a more general concept often used for any integral operation with respect to a
measure. The following proposition is quite helpful for evaluating integrals with
respect to a push-forward measure.
Proposition 13.1 (Change-of-Variable Formula) Let (X, F, μ) be a probability
space, and let f : X → Y be a function, such that a push-forward measure ν is
defined by ν = f# μ. Then, we have
gdν = g ◦ f dμ, (13.10)
Y X
As discussed before, the distance in the probability space is one of the key concepts
for understanding the generative models. In statistics, a statistical distance quantifies
the distance between two statistical objects, which can be two random variables, or
two probability distributions or samples. The distance can be between an individual
sample point and a population or a wider sample of points.
13.3.1 f -Divergence
Definition 13.3 (f -Divergence) Let μ and ν are two probability distributions over
a space such that μ ν. Then, for a convex function f such that f (1) = 0, the
f -divergence of μ from ν is defined as
dμ
Df (μ||ν) := f dν, (13.11)
dν
One thing which is very important and should be treated carefully is the condition
μ ν. For example, if μ is the measure of the original data and ν is the distribution
for the generated data, their absolute continuity w.r.t each other should be checked
first to choose a right form of divergence.
For the discrete case, when Q(x) and P (x) become the respective probability
mass functions, then the f -divergence can be written as
P (x)
Df (P ||Q) := Q(x)f . (13.13)
x
Q(x)
Depending on the choice of the convex function f , we can obtain various special
cases. Some of the representative special cases are as follows.
f (t) = t log t.
= H (P , Q) − H (P ), (13.14)
274 13 Generative Models and Unsupervised Learning
Using this, we can show that JS divergence is closely related to the KL divergence
as:
1 1
DJ S (P ||Q) = DKL (P ||M) + DKL (Q||M), (13.17)
2 2
where M = (P + Q)/2.
Note that JS divergence has important advantages over KL divergence. Since
M = (P + Q)/2, we can always guarantee P M and Q M. Therefore,
the Radon–Nikodym derivative dP /dM and dQ/dM are always well-defined and
the f -divergence in (13.11) can be obtained. On the other hand, to use the KL
divergence DKL (P ||Q) or DKL (Q||P ), we should have P Q or Q P
respectively, which is difficult to know a prior in practice.
The generators for other forms of f -divergence are defined in Table 13.1. Later,
we will show that various types of GAN architecture emerge depending on the
choice of the generator.
Unlike the f -divergence, the Wasserstein metric is a metric that satisfies all four
properties of a metric in Definition 1.1. Therefore, this becomes a powerful way of
measuring distance in the probability space. For example, to define an f -divergence,
we should always check the absolute continuity w.r.t. each other, which is difficult
in practice. In the Wasserstein metric, such hassles are no longer necessary.
Let (M, d) be a metric space with a metric d. For p ≥ 1, let Pp (M) denote the
collection of all probability measures μ on M with a finite p-th moment. Then, the
13.3 Statistical Distances
1 p1
−1 −1
Wp (μ, ν) = |F (z) − G p
(z)| dz . (13.21)
0
where
1/2 1/2 1/2
B 2 (1 , 2 ) = Tr(1 ) + Tr(2 ) − 2Tr 1 2 1 , (13.23)
which is simply the push-forward of the measure, i.e., ν = T# μ. See Fig. 13.4 for
an example of an optimal transport.
Suppose there is a cost function c : X×Y → R∪{∞} such that c(x, y) represents
the cost of moving one unit of mass from x ∈ X to y ∈ Y. Monge’s original OT
problem [182, 184] is then to find a transport map T that transports μ to ν at the
minimum total transportation cost:
A
min M(T ) := X c(x, T (x))dμ(x) (13.25)
T
subject to ν = T# μ.
1 p1
−1 −1
Wp (μ, ν) = |F (z) − G p
(z)| dz
0
1
p
−1
= |x − G p
(F (x))| dF (x) . (13.26)
R
Therefore, for the given transport cost c(x, y) = |x − y|p , we can see that
Monge’s optimal transport map is given by
T : x → m2 + A(x − m1 ), (13.27)
where
−1/2 1/2 1/2 1/2 −1/2
A = 1 1 2 1 1 . (13.28)
for all measurable sets A ∈ X and B ∈ Y. Here, the last two constraints come from
the observation that the total amount of mass removed from any measurable set has
to be equal to the marginal distributions [182, 184].
Another important advantage of the Kantorovich formulation is the dual formu-
lation, as stated in the following theorem:
Theorem 13.2 (Kantorovich Duality Theorem) [182, Theorem 5.10, pp.57–59]
Let (X, μ) and (Y, ν) be two probability spaces and let c : X × Y → R be a
continuous cost function, such that |c(x, y)| ≤ cX (x) + cY (y) for some cX ∈ L1 (μ)
and cY ∈ L1 (ν), where L1 (μ) denotes a Lebesgue space with an integral function
with the measure μ. Then, there is a duality:
min c(x, y)dπ(x, y)
π ∈ (μ,ν) X×Y
, -
= sup ϕ(x)dμ(x) + ϕ c (y)dν(y) (13.31)
ϕ∈L1 (μ) X Y
, -
= sup ψ (x)dμ(x) +
c
ψ(y)dν(y) , (13.32)
ψ∈L1 (μ) X Y
where
and the above maximum is taken over the so-called Kantorovich potentials ϕ and
ψ, whose c-transforms are defined as
1 x2 y2
ϕ c (y) = inf
x − y
2 − ϕ(x) = inf + − x, y
− ϕ(x),
x 2 x 2 2
which leads to
2 ∗
y2 x2 y
− ϕ c (y) = sup x, y
− − ϕ(x) = − ϕ(y) .
2 x 2 2
Therefore, we have
∗
x2 x2
ϕ c (x) = − − ϕ(x) .
2 2
13.4 Optimal Transport 281
where Lip1 (X) = {ϕ ∈ L1 (μ) : |ϕ(x) − ϕ(y)| ≤ ||x − y||}. Compared to the primal
form (13.36) which requires the integration with respect to the joint measures,
the dual formulation in (13.37) just requires marginals μ and ν, which make the
computation much more tractable. This is why the dual form is more widely used in
generative models.
where (μ, ν) denotes the set of joint distributions whose marginal distributions
are μ(x) and ν(y), respectively. Then, the following proposition shows that the
associate dual problem has very interesting formulation.
Proposition 13.2 The dual of the primal problem in (13.38) is given by
−c(x, y) + φ(x) + ϕ(y)
sup φ(x)dμ(x) + ϕ(y)dν(y) − γ exp . d(x, y).
φ,ϕ X Y X×Y γ
(13.39)
282 13 Generative Models and Unsupervised Learning
Proof Using the convex conjugate formulation in Chap. 1, we know that ex is the
convex conjugate of x log x − x for x > 0. Accordingly, we have
−c + φ + ϕ
sup φdμ + ϕdν − γ exp d(x, y)
φ,ϕ X Y X×Y γ
= sup φdμ + ϕdν + inf dπ(c − φ − ϕ) + γ (π log π − π )d(x, y)
φ,ϕ X Y X×Y π >0
= inf cπ + γ π(log π − 1)d(x, y)
π >0 X×Y
+ inf sup φdμ − φdπ + ϕdμ − ϕdπ.
π >0 φ,ϕ X X×Y Y X×Y
Under the constraint that π ∈ (μ, ν), the last four terms vanish. Therefore, we
have
−c(x, y) + φ(x) + ϕ(y)
sup φ(x)dμ(x) + ϕ(y)dν(y) − γ exp d(x, y)
φ,ϕ X Y X×Y γ
= inf c(x, y)dπ(x, y) + γ π(x, y)(log π(x, y) − 1)d(x, y).
π∈ (μ,ν),π>0 X×Y X×Y
which leads to
c(x, y)
sup γ log α(x)dμ(x) + γ log β(y)dν(y) − γ α(x) exp − β(y)d(x, y).
α,β X Y X×Y γ
(13.41)
Using the variational calculus, for a given perturbation α → α + δα, the first-order
variation is given by
δα(x) dμ(x) c(x, y)
dx − δα(x) exp − β(y)dydx (13.42)
X α(x) dx X Y γ
1 dμ c(x, y)
= δα(x) (x) − exp − β(y)dy dx = 0. (13.43)
X α(x) dx Y γ
13.5 Generative Adversarial Networks 283
Thus, we have
dμ
(x)
α(x) = A dx . (13.44)
c(x,y)
Y exp − γ β(y)dy
Similarly, we have
dν
dy (y)
β(y) = A . (13.45)
c(x,y)
X exp − γ α(x)dx
In fact, the update rule (13.44) and (13.45) are the main iterations for Sinkhorn’s
fixed point iteration [183].
With the mathematical backgrounds set, we are now ready to discuss specific forms
of the generative models, and explain how they can be derived in a unified theoretical
framework. In this section, we will mainly describe the decoder-type generative
models, which we simply call generative models. Later, we will explain how this
analysis can be extended to the autoencoder-type generative models.
The original form of generative adversarial network (GAN) [88] was inspired by the
success of discriminative models for classification. In particular, Goodfellow et al.
[88] formulated generative model training as a minimax game between a generative
network (generator) that maps a random latent vector into the data in the ambient
space, and a discriminative network trying to distinguish the generated samples from
real samples. Surprisingly, this minimax formation of a deep generative model can
transfer the success of deep discriminative models to generative models, resulting in
significant improvement in generative model performance [88]. In fact, the success
of GANs has generated significant interest in the generative model in general, which
has been followed by many breakthrough ideas.
Before we explain the geometric structure of the GAN and its variants from a
unified framework, we briefly present the original explanation of the GAN, since it
is more intuitive to the general public. Let X and Z denote the ambient and latent
space equipped with the measure μ and ζ , respectively (recall the geometric picture
284 13 Generative Models and Unsupervised Learning
in Fig. 13.2). Then, the original form of the GAN solves the following minimax
game:
where
GAN (D, G) := Eμ log D(x) + Eζ log(1 − D(G(z)) ,
where D(x) is the discriminator that takes as input a sample and outputs a scalar
between [0, 1], G(z) is the generator that maps a latent vector z to the ambient
space vector, and
Eμ log D(x) = log D(x)dμ(x),
X
Eζ log(1 − D(G(z))) = log(1 − D(G(z)))dζ (z).
Z
The meaning of (13.46) is that the generator tries to fool the discriminator, while
the discriminator wants to maximize the differentiation power between the true
and generated samples. In GANs, the discriminator and generator are usually
implemented as deep networks which are parameterized by network parameters φ
and θ , i.e. D(x) := Dφ (x), G(z) := Gθ (z). Therefore, (13.46) can be formulated
as a minmax problem with respect to θ and φ.
Figure 13.5 illustrates some of the samples generated by GANs from this minmax
optimization that appeared in their original paper [88]. By current standards, the
results look very poor, but when these were published in 2014, they shocked
the world and were considered state-of-the art. We can again see the light-speed
progress of generative model technology.
Since it was first published, one of the puzzling questions about GANs is the
mathematical origin of the minmax problems, and why it is important. In fact, the
pursuit of understanding such questions has been very rewarding, and has led to the
discovery of numerous key results that are essential toward the understanding of the
geometric structure of GANs.
Among them, two most notable results are the f -GAN [176] and Wasserstein
GAN (W-GAN) [177], which will be reviewed in the following sections. These
works reveal that the GAN indeed originates from minimizing statistical distances
using dual formulation. These two methods differ only in their choices of statistical
distances and associated dual formulations.
13.5 Generative Adversarial Networks 285
Fig. 13.5 Examples of GAN-generated samples in [88]. The rightmost columns show the nearest
training example of the neighboring sample, in order to demonstrate that the model has not
memorized the training set. These images show actual samples from the model distributions,
not conditional means given samples of hidden units. (a) TFD, (b) MNIST, (c) CIFAR-10 (fully
connected model), (d) CIFAR-10
13.5.2 f -GAN
The f -GAN [176] was perhaps one of the most important theoretical results in the
early history of GANs, and clearly demonstrates the importance of the statistical
distances and dual formulation. As the name suggests, the f -GAN starts with f -
divergence.
Recall that f -divergence is defined by
dμ
Df (μ||ν) = f dν (13.47)
dν
if μ ν. The main idea of the f -GAN (which includes the original GAN) is to
use f -divergence as a statistical distance between the real data distribution X with
the measure μ and the synthesized data distribution in the ambient space X with the
measure ν := μθ so that the probability measure ν gets closer to μ (see Fig. 13.2,
where μθ is now considered as ν for notational simplicity). The key observation
is that instead of directly minimizing the f -divergence, something very interesting
emerges if we formulate its dual problems. More specifically, the author exploits the
286 13 Generative Models and Unsupervised Learning
following dual formulation of the f -divergence [176], whose proof is repeated here
for educational purposes. Recall the following definition of a convex conjugate (for
more detail, see Chap. 1):
Definition 13.4 ([6]) For a given function f : I → R, its convex conjugate is
defined by
13.5 Generative Adversarial Networks 287
where the last equality holds when dμ = pdξ and dν = qdξ for common measure
ξ [176].
While the lower bound in (13.50) is intuitive, one of the complications in the
derivation of the f -GAN is that the function τ should be within the domain of f ∗ ,
i.e. τ ∈ I ∗ . To address this, the authors in [176] proposed the following trick:
where
f GAN (G, gf ) := Eμ gf (V (x)) − Eζ f ∗ (gf (V (G(z)))) .
= − log(1 − eu ).
The domain of the conjugate function f ∗ should be R− in order to make the 1−eu >
0. One of the functions gf to allow this is given by
1
gf (V ) = log = log Sig(V ),
1 + e−V
Therefore, if we use a discriminator with the sigmoid being the last layer we have
D(x) = Sig(V (x)) and this leads to the following f -GAN cost function:
sup τ (x)dμ(x) − f ∗ (τ (x))dν(x)
τ ∈I ∗ X X
= sup gf (V (x))dμ(x) − f ∗ (gf (V (x)))dν(x)
gf ,V X X
= sup log D(x)dμ(x) + log(1 − D(x))dν(x).
D X X
Finally, the measure ν is for the samples from latent space Z with the measure ζ
by generator G(z), z ∈ Z, so ν is the push-forward measure G# ζ (see Fig. 13.2).
Using the change-of-variable formula in Proposition 13.1, the final loss function is
given by
(D, G) := sup log D(x)dμ(x)
D X
+ log(1 − D(G(z)))dζ (x).
Z
Note that the f -GAN interprets the GAN training as a statistical distance mini-
mization in the form of dual formulation. However, its main limitation is that the
f -divergence is not a metric, limiting the fundamental performance.
A similar minimization idea is employed for the Wasserstein GAN, but now with
a real metric in probability space. More specifically, the W-GAN minimizes the
following Wasserstein-1 norm:
W1 (P , Q) := min ||x − x ||dπ(x, x ), (13.54)
π ∈ (μ,ν) X×X
where X is the ambient space, μ and ν are measures for the real data and generated
data, respectively, and π(x, x ) is the joint distribution with the marginals μ and ν,
respectively (recall the definition of (μ, ν) in (13.33)).
13.5 Generative Adversarial Networks 289
Similar to the f -GAN, rather than solving the complicated primal problem, a
dual problem is solved. Recall that the Kantorivich dual formulation leads to the
following dual formulation of the Wasserstein 1-norm:
, -
W1 (μ, ν) = sup ϕ(x)dμ(x) − ϕ(x )dν(x ) , (13.55)
ϕ∈Lip1 (X) X X
where Lip1 (X) denotes the 1-Lipschitz function space with domain X. Again, the
measure ν is for the generated samples from latent space Z with the measure ζ
by generator G(z), z ∈ Z, so ν can be considered as the push-forward measure
ν = G# μ. Using the change-of-variable formula in Proposition 13.1, the final loss
function is given by
, -
W1 (μ, ν) = sup ϕ(x)dμ(x) − ϕ(G(z))dζ (z) . (13.56)
ϕ∈Lip1 (X) X Z
min W1 (μ, ν)
ν
, -
= min max ϕ(x)dμ(x) − ϕ(G(z))dζ (z) ,
G ϕ∈Lip1 (X) X Z
where G(z) is called the generator, and the Kantorovich potential ϕ is called the
discriminator.
Therefore, imposing a1-Lipschitz condition on the discriminator is necessary in
the W-GAN [177]. There are many approaches to address this. For example, in the
original W-GAN paper [177], weight clipping was used to impose a 1-Lipschitz
condition. Another method is to use spectral normalization [188], which utilizes the
power iteration method to impose a constraint on the largest singular value of the
weight matrix in each layer. Yet another popular method is the W-GAN with the
gradient penalty (WGAN-GP), where the gradient of the Kantorovich potential is
constrained to be 1 [189]. Specifically, the following modified loss function is used
for the minmax problem:
13.5.4 StyleGAN
As mentioned before, one of the most exciting developments in CVPR 2019 was
the introduction of novel generative adversarial network (GAN) called StyleGAN
by Nvidia [89], which can produce very realistic high-resolution images.
Aside from various sophisticated tricks, StyleGAN also introduced impor-
tant innovations from a theoretical perspective. For example, one of the main
breakthroughs of styleGAN comes from AdaIN. The neural network in Fig. 13.6
generates the latent codes that are used as style image feature vectors. Then,
the AdaIN layer combines the style features and the content features together to
generate more realistic features at each resolution.
Yet another breakthrough idea is that SytleGAN introduces noise into each layer
to create stochastic variation, as shown in Fig. 13.6. Recall that most of the GANs
starts with the simple latent vector z in the latent space as an input to the generator.
On the other hand, the noise at each layer of StyleGAN can be considered as a more
complicated latent space, so that a mapping from a more complicated input latent
space to the data domain produces more realistic images. In fact, by introducing a
more complicated latent space, styleGAN enables local changes in the pixel level
and targets stochastic variation in generating local variants of features.
Although we have already discussed the generative model such as the GAN, his-
torically the autoencoder-type generative model precedes the GAN-type models. In
fact, the autoencoder-type generative model goes back to the denoising autoencoder
[190], which is a deterministic form of encoder–decoder networks.
The real generative autoencoder model in fact originates from the variational
autoencoder (VAE) [174], which enables the generation of the target samples by
changing latent variables using random samples. Another breakthrough in the VAE
comes from the normalizing flow [178–181], which significantly improves the
quality of generated samples by allowing invertible mapping. In this section, we
review the two ideas in a unified geometric framework. To do this, we first explain
the important concept in variational inference—the evidence lower bound (ELBO)
or the variational lower bound [191].
13.6.1 ELBO
Here, the goal is to find the parameter θ to maximize the loglikelihood using the
given data set x ∈ X.
292 13 Generative Models and Unsupervised Learning
which is often called the evidence lower bound (ELBO) or the variational lower
bound [191].
Since the choice of posterior qφ (z|x) could be arbitrary, the goal of the variational
inference is to find qφ to maximize the ELBO, or, equivalently, minimize the
following loss function:
ELBO (x; θ, φ) := − log pθ (x|z)qφ (z|x)dz + DKL (qφ (z|x)||p(z)),
(13.59)
where the first term is the likelihood term and the second KL term can be interpreted
as the penalty term. Then, variational inference tries to find θ and φ to minimize the
loss for a given x, or average loss for all x.
Using the ELBO, we are now ready to derive the VAE. However, our derivation is
somewhat different from the original derivation of the VAE [174], since the original
derivation makes it difficult to show the link to normalizing flow [178–181]. The
following derivation originates from the f -VAE [193].
Specifically, among the various choices of qφ (z|x) for the ELBO, we choose the
following form:
qφ (z|x) = δ(z − Fφx (u))r(u)du, (13.60)
13.6 Autoencoder-Type Generative Models 293
where r(u) is the standard Gaussian, and Fφx (u) is the encoder function for a given
x which has another noisy input u. See Fig. 13.7a for the concept of the encoder
Fφx (u). For the given encoder function, we have the following key result for the
ELBO loss.
Proposition 13.3 For the given encoder in (13.60), the ELBO loss in (13.59) can
be represented by
ELBO (x; θ, φ) := − log pθ (x|Fφx (u))r(u)du
) *
r(u)
+ log r(u)du
p(Fφx (u))
+ ) x *+
+ ∂Fφ (u) ++
+
− log +det + r(u)du. (13.61)
+ ∂u +
Using the encoder representation in (13.60), the first term of (13.62) becomes
log (pθ (x|z)p(z)) δ(z − Fφx (u))r(u)dudz
= log pθ (x|Fφx (u))p(Fφx (u)) r(u)du
= log pθ (x|Fφx (u))r(u)du + log p(Fφx (u))r(u)du.
Then, we have
log δ(Fφx (u) − Fφx (u ))r(u )du r(u)du
⎛ ⎞
⎜ r(Hx (v)) ⎟
= log ⎜
⎝ δ(Fφx (u) − v) + x + dv ⎟ r(u)du
+ + ⎠
+det ∂Fφ (u ) +
+ ∂u +
13.6 Autoencoder-Type Generative Models 295
⎛ ⎞
⎜ ⎟
⎜ r(Hx (Fφx (u))) ⎟
= log ⎜
⎜++ + ⎟ r(u)du
⎟
⎝ +det ∂Fφ (u ) ++
x
⎠
+ ∂u +
v=Fφx (u)
+ ) x *+
+ ∂Fφ (u) ++
+
= log r(u)r(u)du − log +det + r(u)du.
+ ∂u +
where Id is the d × d identity matrix and d is the dimension of the latent space.
This was referred to as the reparameterization trick in the original VAE paper [174].
Under this choice, the second term in (13.61) becomes
) *
r(u)
log r(u)du
p(Fφx (u))
1 1
=−
u
2 r(u)du +
μ(x) + σ (x) & u
2 r(u)du
2 2
1
2
d
= (σi (x) + μ2i (x) − 1), (13.64)
2
i=1
296 13 Generative Models and Unsupervised Learning
d
+
− log +det + r(u)du = − log σi2 (x). (13.65)
+ ∂u + 2
i=1
Finally, the first term in (13.61) is the likelihood term, which can be represented as
follows by assuming the Gaussian distribution:
− log pθ (x|Fφx (u))r(u)du
1
=
x − Gθ (Fφx (u))
2 r(u)du
2
1
=
x − Gθ (μφ (x) + σφ (x) & u)
2 r(u)du. (13.66)
2
Therefore, the encoder and decoder parameter optimization problem for the VAE
can be obtained as follows:
where
1
V AE (θ, φ) =
x − Gθ (μφ (x) + σφ (x) & u)
2 r(u)dudμ(x)
2 X
1
d
+ (σi2 (x) + μ2i (x) − log σi2 (x) − 1)dμ(x). (13.67)
2 X
i=1
Once the neural network is trained, one of the very important advantages of the
VAE is that we can simply control the decoder output by changing the random
samples. More specifically, the decoder output is now given by
which has an explicit dependency on the random variable u. Therefore, for a given
x, we can change the output by drawing sample u.
13.6.3 β-VAE
By inspection of VAE loss in (13.67), we can easily see that the first term represents
the distance between the generated samples and the real ones, whereas the second
13.6 Autoencoder-Type Generative Models 297
term is the KL distance between the real latent space measure and posterior
distribution. Therefore, VAE loss is a measure of the distances that considers both
latent space and the ambient space between real and generated samples.
In fact, this observation nicely fits into our geometric view of the autoencoder
illustrated in Fig. 13.2. Here, the ambient image space is X, the real data distribution
is μ, whereas the autoencoder output data distribution is μθ . The latent space is
Z. In the autoencoder, the generator Gθ corresponds to the decoder, which is a
mapping from the latent space to the sample space, Gθ : Z → X, realized by a
deep network. Then, the goal of the decoder training is to make the push-forward
measure μθ = Gθ# ζ as close as possible to the real data distribution μ. Additionally,
an encoder Fφ maps from the real data in X to the latent space Fφ : X → Z so that
the encoder pushes forward the measure μ to a distribution ζφ = F# μ in the latent
space. Therefore, the VAE design problem can be formulated by minimizing the
sum of the both distances, which are measured by average sample distance and KL
distance, respectively.
Rather than giving uniform weights for both distances, β-VAE [175] relaxes
this constraint of the VAE. Following the same incentive in the VAE, we want
to maximize the probability of generating real data, while keeping the distance
between the real and estimated posterior distributions small (say, under a small
constant). This leads to the following β-VAE cost function:
where β now controls the importance of the distance measure in the latent space.
When β = 1, it is the same as the VAE. When β > 1, it applies a stronger constraint
on the latent space.
As a higher β imposes more constraint on the latent space, it turns out that the
latent space is more interpretable and controllable, which is known as the disen-
tanglement. More specifically, if each variable in the inferred latent representation
z is only sensitive to one single generative factor and relatively invariant to other
factors, we will say this representation is disentangled or factorized. One benefit
that often comes with disentangled representation is good interpretability and easy
generalization to a variety of tasks. For some conditionally independent generative
factors, keeping them disentangled is the most efficient representation, and β-
VAE provides more disentangled representation. For example, the generated faces
from the original VAE have various directions, whereas they are toward specific
directions in the β-VAE, implying that factors for the face directions is successfully
disentangled [175].
298 13 Generative Models and Unsupervised Learning
The normalizing flow (NF) [178–181] is a modern way of overcoming the limitation
of VAE. As shown in Fig. 13.8, normalizing flow transforms a simple distribution
into a complex one by applying a sequence of invertible transformation functions.
Flowing through a chain of transformations, we repeatedly substitute the variable
for the new one according to the change-of-variables theorem and eventually obtain
a probability distribution of the final target variable. Such a sequence of invertible
transformations is the origin of the name “normalizing flow” [179].
The derivation of the cost function for a normalizing flow also starts with the
same ELBO and encoder model in (13.60). However, the normalizing flow chooses
a different encoder function:
Now the main technical difficulty of NF arises from the last term, which involves
a complicated determinant calculation for a huge matrix. As discussed before, NF
mainly focuses on the encoder function Fφ (and, likewise, the decoder G), which is
composed of a sequence of transformations:
∂Fφ (u) hK h2 h1
= ··· , (13.74)
∂u ∂hK−1 ∂h1 ∂u
we have
+ + + +
+ ∂Fφ (u) ++
+ ∂hi ++
K
+ +
log +det
∂u += log +det
∂h +, (13.75)
i−1
i=1
where h0 = u. Therefore, most of the current research efforts for NF have focused
on how to design an invertible block such that the determinant calculation is simple.
Now, we review a few representative techniques.
NICE (nonlinear independent component estimation) [178] is based on learning
a non-linear bijective transformation between the data space and a latent space. The
architecture is composed of a series of blocks defined as follows, where x1 and x2
are a partition of the input in each layer, and y1 and y2 are partitions of the output.
300 13 Generative Models and Unsupervised Learning
y1 = x1 ,
y2 = x2 + F(x1 ), (13.76)
where F(·) is a neural network. Then, the block inversion can be readily done by
x1 = y1 ,
x2 = y2 − F(y1 ). (13.77)
Furthermore, it is easy to see that its Jacobian has a unit determinant and the cost
function in (13.72) and its gradient can be tractably computed.
However, this architecture imposes some constraints on the functions the network
can represent; for instance, it can only represent volume-preserving mappings.
Follow-up work [180] addressed this limitation by introducing a new reversible
transformation. More specifically, they extend the space of such models using
real-valued non-volume-preserving (real NVP) transformations using the following
operation [180]:
y1 = x1 ,
y2 = x2 & exp(s(x1 )) + t (x1 ), (13.78)
Given the observation that this Jacobian is triangular, we can efficiently compute
its determinant as
⎛ ⎞
∂y
det = exp ⎝ s(x1 [j ])⎠ , (13.80)
∂x
j
where x1 [j ] denotes the j -th element of x1 . The inverse of the transform can also
be easily implemented by
x1 = y1 ,
x2 = (y2 − t (y1 )) & exp(−s(y1 )). (13.81)
Fig. 13.9 Forward and inverse architecture of a building block in real NVP transform [180]. (a)
Forward propagation. (b) Inverse propagation
Fig. 13.10 Example of normalizing flow using GLOW [181]. Figure courtesy of https://openai.
com/blog/glow/
So far, we have discussed generative models that generate samples from noise.
Generative models are also useful to convert one distribution to another. This is
why generative models become the main workhorse for unsupervised learning tasks.
302 13 Generative Models and Unsupervised Learning
Among the various unsupervised learning tasks, in this section we are mainly
focusing on image translation, which is a very active area of research.
13.7.1 Pix2pix
Pix2pix [194] was presented in 2016 by researchers from Berkeley in their work
“Image-to-Image Translation with Conditional Adversarial Networks.” This is not
unsupervised learning per se, as it requires matched data sets, but it opens a new era
of image translation, so we review this here.
Most of the problems in image processing and computer vision can be posed
as “translating” an input image into a corresponding output image. For example, a
scene may be rendered as an RGB image, a gradient field, an edge map, a semantic
label map, etc. In analogy to automatic language translation, we define automatic
image-to-image translation as the task of translating one possible representation of
a scene into another, given a large amount of training data.
Pix2pix uses a generative adversarial network (GAN) [88] to learn a function
to map from an input image to an output image. The network is made up of two
main pieces, the generator, and the discriminator. The generator transforms the input
image to get the output image. The discriminator measures the similarity of the
generated image to the target image from the data set, and tries to guess if this was
produced by the generator.
For example, in Fig. 13.11, the generator produces a photo-realistic shoe image
from a sketch, and the discriminator tries to differentiate whether the generated
images are the real photo from the sketch or the fake one.
The nice thing about pix2pix is that it is generic and does not require the user to
define any relationship between the two types of images. It makes no assumptions
about the relationship and instead learns the objective during training, by comparing
the defined inputs and outputs during training and inferring the objective. This
makes pix2pix highly adaptable to a wide variety of situations, including ones where
it is not easy to verbally or explicitly define the task we want to model.
That said, one downside of pix2pix is that it requires paired data sets to learn
their relationship, and these are often difficult to obtain in practice. This issue is
largely addressed by cycleGAN [185], which is the topic of the following section.
13.7.2 CycleGAN
Here, the target image space X is equipped with a probability measure μ, whereas
the original image space Y has a probability measure ν. Since there are no paired
data, the goal of unsupervised learning is to match the probability distributions
rather than each individual sample. This can be done by finding transportation
maps that transport the measure μ to ν, and vice versa. More specifically, the
transportation from a measure space (Y, ν) to another measure space (X, μ) is done
by a generator Gθ : Y → X, realized by a deep network parameterized with θ .
Then, the generator Gθ “pushes forward” the measure ν in Y to a measure μθ
in the target space X [182, 184]. Similarly, the transport from (X, μ) to (Y, ν)
is performed by another neural network generator Fφ , so that the generator Fφ
pushes forward the measure μ in X to νφ in the original space Y. Then, the
optimal transport map for unsupervised learning can be achieved by minimizing
the statistical distances dist(μθ , μ) between μ and μθ , and dist(νφ , ν) between ν
and νφ , and our proposal is to use the Wasserstein-1 metric as a means to measure
the statistical distance.
More specifically, for the choice of a metric d(x, x ) =
x − x
in X, the
Wasserstein-1 metric between μ and μθ can be computed by Villani [182], Peyré et
al. [184]
W1 (μ, μθ ) = inf
x − Gθ (y)
dπ(x, y). (13.82)
π ∈ (μ,ν) X×Y
Rather than minimizing (13.82) and (13.83) separately with distinct joint distribu-
tions, a better way of finding the transportation map is to minimize them together
with the same joint distribution π :
inf
x − Gθ (y)
+
Fφ (x) − y
dπ(x, y). (13.84)
π ∈ (μ,ν) X×Y
One of the most important contributions of [195] is to show that the primal
formulation of the unsupervised learning in (13.84) can be represented by a dual
formulation:
where
Here, ϕ, ψ are often called Kantorovich potentials and satisfy the 1-Lipschitz
condition (i.e.
correspondence between the original and target domain, removing the mode-
collapsing behaviors of GANs. The corresponding network architecture can be
represented in Fig. 13.14. Specifically, ϕ tries to find the difference between the
true image x and the generated image Gθ (y), whereas ψ attempts to find the
fake measurement data that are generated by the synthetic measurement procedure
Fφ (x). In fact, this formulation is equivalent to the cycleGAN formulation [185]
except for the use of 1-Lipschitz discriminators.
CycleGAN has been very successful for various unsupervised learning tasks.
Figure 13.15 shows the examples of unsupervised style transfers between two
different styles of paintings.
13.7.3 StarGAN
In Fig. 13.15, one downside of cycleGAN is that we need to train separate generators
for each pair of domains. For example, if there are N different styles in the
paintings, there should be N(N − 1) distinct generators to translate the images (see
Fig. 13.16a).
13.7 Unsupervised Learning via Image Translation 307
In many applications requiring multiple inputs to obtain the desired output, if any of
the input data is missing, it often introduces large amounts of bias. Although many
techniques have been developed for imputing missing data, image imputation is still
difficult due to the complicated nature of natural images. To address this problem, a
novel framework collaborative GAN (CollaGAN) [186] was proposed.
308
Fig. 13.17 Generator and discriminator architecture of starGAN [87]. (a) Training discriminator. (b) Training generator (c) Fooling the discriminator
13 Generative Models and Unsupervised Learning
13.7 Unsupervised Learning via Image Translation 309
Fig. 13.19 Comparison with various multi-domain translation architecture. (a) Cross-domain
models. (b) StarGAN. (c) Collaborative GAN
clean data set. Since the missing data domain is not difficult to estimate a priori, the
imputation algorithm should be designed such that one algorithm can estimate the
missing data in any domain by exploiting the data for the rest of the domains.
Due to the specific applications, CollaGAN is not an unsupervised learning
method. However, one of the key concepts in CollaGAN is the cycle consistency
for multiple inputs, which is useful for other applications. Specifically, since the
inputs are multiple images, the cycle loss should be redefined. In particular, for the
N -domain data, from a generated output, we should be able to generate N − 1 new
combinations as the other inputs for the backward flow of the generator (Fig. 13.20
13.9 Exercises 311
middle). For example, when N = 4, there are three combinations of multi-input and
single-output so that we can reconstruct the three images of original domains using
backward flow of the generator. In regard to the discriminator, the discriminator
should have a classifier header as well as the discriminator part similar to that of
StarGAN.
Figure 13.21 shows an example of missing domain imputation, where CollaGAN
produces very realistic images.
13.9 Exercises
1 1
DJ S (P ||Q) = DKL (P ||M) + DKL (Q||M), (13.88)
2 2
where M = (P + Q)/2.
312 13 Generative Models and Unsupervised Learning
y = T (x) = ∇u(x),
+ ) x *+
+ ∂Fφ (u) ++ 1
d
+
− log +det + r(u)du = − log σi2 (x),
+ ∂u + 2
i=1
y1 = x1 , y2 = x2 + F(y1 ). (13.91)
(a) Why does the Jacobian term become the identity? Please derive explicitly.
(b) Suppose we are interested in a more expressive network given by
for some function G. What is the inverse operation? How can you make the
corresponding normalizing flow cost function simple in terms of Jacobian
calculation? You may want to split the update into two steps to simplify the
derivation.
Chapter 14
Summary and Outlook
With the tremendous success of deep learning in recent years, the field of data
science has undergone unprecedented changes that can be considered a “revolution”.
Despite the great successes of deep learning in various areas, there is a tremendous
lack of rigorous mathematical foundations which enable us to understand why deep
learning methods perform well. In fact, the recent development of deep learning
is largely empirical, and the theory that explains the success remains seriously
behind. For this reason, until recently, deep learning was viewed as pseudoscience
by rigorous scientists, including mathematicians.
In fact, the success of deep learning appears very mysterious. Although sophis-
ticated network architectures have been proposed by many researchers in recent
years, the basic building blocks of deep neural networks are the convolution, pooling
and nonlinearity, which from a mathematical point of view are regarded as very
primitive tools from the “Stone Age”. However, one of the most mysterious aspects
of deep learning is that the cascaded connection of these “Stone Age” tools results
in superior performance that far exceeds the sophisticated mathematical tools.
Nowadays, in order to develop high-performance data processing algorithms, we
do not have to hire highly educated doctoral students or postdocs, but only give
TensorFlow and many training data to undergraduate students. Does it mean a dark
age of mathematics? Then, what is the role of the mathematicians in this data-driven
world?
A popular explanation for the success of deep neural networks is that the neural
network was developed by mimicking the human brain and is therefore destined
for success. In fact, as discussed in Chap. 5, one of the most famous numerical
experiments is the emergence of the hierarchical features from a deep neural
network when it is trained to classify human faces. Interestingly, this phenomenon
is similarly observed in human brains, where hierarchical features of the objects
emerge during visual information processing. Based on these numerical observa-
tions, some of the artificial neural network “hardliners” even claim that instead
of mathematics we need to investigate the biology of the brain to design more
sophisticated artificial neural networks and to understand the working principle of
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 315
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_14
316 14 Summary and Outlook
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 317
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7_15
318 15 Bibliography
40. Y. Han and J. C. Ye, “Framing U-Net via deep convolutional framelets: Application to sparse-
view CT,” IEEE Transactions on Medical Imaging, vol. 37, no. 6, pp. 1418–1429, 2018.
41. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
42. J. C. Ye, Y. Han, and E. Cha, “Deep convolutional framelets: A general deep learning
framework for inverse problems,” SIAM Journal on Imaging Sciences, vol. 11, no. 2, pp.
991–1048, 2018.
43. J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1872–1886, 2013.
44. I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016.
45. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a
simple way to prevent neural networks from overfitting,” The Journal of Machine Learning
Research, vol. 15, no. 1, pp. 1929–1958, 2014.
46. D. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, vol. 52, no. 4, pp.
1289–1306, 2006.
47. E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Found.
Comput. Math., vol. 9, no. 6, pp. 717–772, 2009.
48. G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of
Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, 1989.
49. S. Ryu, J. Lim, S. H. Hong, and W. Y. Kim, “Deeply learning molecular structure-
property relationships using attention-and gate-augmented graph convolutional network,”
arXiv preprint arXiv:1805.10988, 2018.
50. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of
words and phrases and their compositionality,” in Advances in Neural Information Processing
Systems, 2013, pp. 3111–3119.
51. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations
in vector space,” arXiv preprint arXiv:1301.3781, 2013.
52. W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on graphs: Methods and
applications,” arXiv preprint arXiv:1709.05584, 2017.
53. B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: Online learning of social representations,”
in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, 2014, pp. 701–710.
54. A. Grover and J. Leskovec, “Node2vec: Scalable feature learning for networks,” in Proceed-
ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, 2016, pp. 855–864.
55. M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep
learning: going beyond Euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4,
pp. 18–42, 2017.
56. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional net-
works,” arXiv preprint arXiv:1609.02907, 2016.
57. K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv
preprint arXiv:1810.00826, 2018.
58. W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,”
in Advances in Neural Information Processing Systems, 2017, pp. 1024–1034.
59. C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe,
“Weisfeiler and Leman go neural: Higher-order graph neural networks,” in Proceedings of
the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 4602–4609.
60. Z. Chen, S. Villar, L. Chen, and J. Bruna, “On the equivalence between graph isomorphism
testing and function approximation with GNNs,” in Advances in Neural Information Process-
ing Systems, 2019, pp. 15 868–15 876.
61. P. Barceló, E. V. Kostylev, M. Monet, J. Pérez, J. Reutter, and J. P. Silva, “The logical expres-
siveness of graph neural networks,” in International Conference on Learning Representations,
2019.
320 15 Bibliography
62. M. Grohe, “word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings
of structured data,” arXiv preprint arXiv:2003.12590, 2020.
63. N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt,
“Weisfeiler–Lehman graph kernels,” Journal of Machine Learning Research, vol. 12, no. 77,
pp. 2539–2561, 2011.
64. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint
arXiv:1607.06450, 2016.
65. D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient
for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
66. Y. Wu and K. He, “Group normalization,” in Proceedings of the European Conference on
Computer Vision (ECCV), 2018, pp. 3–19.
67. X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance
normalization,” in Proceedings of the IEEE International Conference on Computer Vision,
2017, pp. 1501–1510.
68. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
69. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention
networks,” arXiv preprint arXiv:1710.10903, 2017.
70. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
71. H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial
networks,” in International conference on machine learning. PMLR, 2019, pp. 7354–7363.
72. T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-grained
text to image generation with attentional generative adversarial networks,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1316–1324.
73. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing
Systems, 2017, pp. 5998–6008.
74. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional
transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
75. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are
unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
76. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” arXiv preprint
arXiv:2005.14165, 2020.
77. L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional
neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 2414–2423.
78. Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain image generation,” arXiv
preprint arXiv:1611.02200, 2016.
79. Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature
transforms,” in Advances in Neural Information Processing Systems, 2017, pp. 386–396.
80. Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed-form solution to photorealistic
image stylization,” in Proceedings of the European Conference on Computer Vision (ECCV),
2018, pp. 453–468.
81. D. Y. Park and K. H. Lee, “Arbitrary style transfer with style-attentional networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019,
pp. 5880–5888.
82. J. Yoo, Y. Uh, S. Chun, B. Kang, and J.-W. Ha, “Photorealistic style transfer via wavelet
transforms,” in Proceedings of the IEEE International Conference on Computer Vision, 2019,
pp. 9036–9045.
83. T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-
adaptive normalization,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2019, pp. 2337–2346.
15 Bibliography 321
84. X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image
translation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018,
pp. 172–189.
85. J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “Toward
multimodal image-to-image translation,” in Advances in Neural Information Processing
Systems, 2017, pp. 465–476.
86. H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse image-to-image
translation via disentangled representations,” in Proceedings of the European Conference on
Computer Vision (ECCV), 2018, pp. 35–51.
87. Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative
adversarial networks for multi-domain image-to-image translation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797.
88. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing
Systems, 2014, pp. 2672–2680.
89. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative
adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2019, pp. 4401–4410.
90. K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual
learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26,
no. 7, pp. 3142–3155, 2017.
91. M. Bear, B. Connors, and M. A. Paradiso, Neuroscience: Exploring the brain. Jones & Bartlett
Learning, LLC, 2020.
92. K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “LSTM: a
search space odyssey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28,
no. 10, pp. 2222–2232, 2016.
93. J. Pérez, J. Marinković, and P. Barceló, “On the Turing completeness of modern neural
network architectures,” in International Conference on Learning Representations, 2018.
94. J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship between self-attention and
convolutional layers,” arXiv preprint arXiv:1911.03584, 2019.
95. G. Marcus and E. Davis, “GPT-3, bloviator: OpenAI’s language generator has no idea what
it’s talking about,” Technology Review, 2020.
96. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16×16 words:
Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
97. G. Kwon and J. C. Ye, “Diagonal attention and style-based GAN for content-style disentan-
glement in image generation and translation,” arXiv preprint arXiv:2103.16146, 2021.
98. J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep neural networks,” in
Advances in Neural Information Processing Systems, 2012, pp. 341–349.
99. C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional
networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2,
pp. 295–307, 2015.
100. J. Kim, J. K. Lee, and K. Lee, “Accurate image super-resolution using very deep convolu-
tional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 1646–1654.
101. M. Telgarsky, “Representation benefits of deep feedforward networks,” arXiv preprint
arXiv:1509.08101, 2015.
102. R. Eldan and O. Shamir, “The power of depth for feedforward neural networks,” in 29th
Annual Conference on Learning Theory, 2016, pp. 907–940.
103. M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. S. Dickstein, “On the expressive power
of deep neural networks,” in Proceedings of the 34th International Conference on Machine
Learning. JMLR, 2017, pp. 2847–2854.
104. D. Yarotsky, “Error bounds for approximations with deep ReLU networks,” Neural Networks,
vol. 94, pp. 103–114, 2017.
322 15 Bibliography
105. R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding deep neural networks with
rectified linear units,” arXiv preprint arXiv:1611.01491, 2016.
106. S. Mallat, A wavelet tour of signal processing. Academic Press, 1999.
107. D. L. Donoho, “De-noising by soft-thresholding,” IEEE Transactions on Information Theory,
vol. 41, no. 3, pp. 613–627, 1995.
108. Y. C. Eldar and M. Mishali, “Robust recovery of signals from a structured union of
subspaces,” IEEE Transactions on Information Theory, vol. 55, no. 11, pp. 5302–5316, 2009.
109. R. Yin, T. Gao, Y. M. Lu, and I. Daubechies, “A tale of two bases: Local-nonlocal
regularization on image patches with convolution framelets,” SIAM Journal on Imaging
Sciences, vol. 10, no. 2, pp. 711–750, 2017.
110. J. C. Ye, J. M. Kim, K. H. Jin, and K. Lee, “Compressive sampling using annihilating filter-
based low-rank interpolation,” IEEE Transactions on Information Theory, vol. 63, no. 2, pp.
777–801, 2016.
111. K. H. Jin and J. C. Ye, “Annihilating filter-based low-rank Hankel matrix approach for image
inpainting,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3498–3511, 2015.
112. K. H. Jin, D. Lee, and J. C. Ye, “A general framework for compressed sensing and
parallel MRI using annihilating filter based low-rank Hankel matrix,” IEEE Transactions on
Computational Imaging, vol. 2, no. 4, pp. 480–495, 2016.
113. J.-F. Cai, B. Dong, S. Osher, and Z. Shen, “Image restoration: total variation, wavelet frames,
and beyond,” Journal of the American Mathematical Society, vol. 25, no. 4, pp. 1033–1089,
2012.
114. N. Lei, D. An, Y. Guo, K. Su, S. Liu, Z. Luo, S.-T. Yau, and X. Gu, “A geometric
understanding of deep learning,” Engineering, 2020.
115. B. Hanin and D. Rolnick, “Complexity of linear regions in deep networks," in International
Conference on Machine Learning. PMLR, 2019, pp. 2596–2604.
116. B. Hanin and D. Rolnick. “Deep ReLU networks have surprisingly few activation patterns,”
Advances in Neural Information Processing Systems, vol. 32, pp. 361–370, 2019.
117. X. Zhang and D. Wu, “Empirical studies on the properties of linear regions in deep neural
networks,” arXiv preprint arXiv:2001.01072, 2020.
118. G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of linear regions of deep
neural networks,” in Advances in Neural Information Processing Systems, 2014, pp. 2924–
2932.
119. Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-
parameterization,” in International Conference on Machine Learning. PMLR, 2019, pp.
242–252.
120. S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep
neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 1675–
1685.
121. D. Zou, Y. Cao, D. Zhou, and Q. Gu, “Stochastic gradient descent optimizes over-
parameterized deep ReLU networks,” arXiv preprint arXiv:1811.08888, 2018.
122. H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient
methods under the Polyak-łojasiewicz condition,” in Joint European Conference on Machine
Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811.
123. Q. Nguyen, “On connected sublevel sets in deep learning,” in International Conference on
Machine Learning. PMLR, 2019, pp. 4790–4799.
124. C. Liu, L. Zhu, and M. Belkin, “Toward a theory of optimization for over-parameterized sys-
tems of non-linear equations: the lessons of deep learning,” arXiv preprint arXiv:2003.00307,
2020.
125. Z. Allen-Zhu, Y. Li, and Y. Liang, “Learning and generalization in overparameterized neural
networks, going beyond two layers,” arXiv preprint arXiv:1811.04918, 2018.
126. M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoretical insights into the optimization
landscape of over-parameterized shallow neural networks,” IEEE Transactions on Informa-
tion Theory, vol. 65, no. 2, pp. 742–769, 2018.
15 Bibliography 323
151. S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, “Stronger generalization bounds for deep nets
via a compression approach,” in International Conference on Machine Learning. PMLR,
2018, pp. 254–263.
152. N. Golowich, A. Rakhlin, and O. Shamir, “Size-independent sample complexity of neural
networks,” in Conference On Learning Theory. PMLR, 2018, pp. 297–299.
153. B. Neyshabur, S. Bhojanapalli, and N. Srebro, “A pac-Bayesian approach to spectrally-
normalized margin bounds for neural networks,” arXiv preprint arXiv:1707.09564, 2017.
154. M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice
and the classical bias–variance trade-off,” Proceedings of the National Academy of Sciences,
vol. 116, no. 32, pp. 15 849–15 854, 2019.
155. M. Belkin, D. Hsu, and J. Xu, “Two models of double descent for weak features,” SIAM
Journal on Mathematics of Data Science, vol. 2, no. 4, pp. 1167–1180, 2020.
156. L. G. Valiant, “A theory of the learnable,” Communications of the ACM, vol. 27, no. 11, pp.
1134–1142, 1984.
157. W. Hoeffding, “Probability inequalities for sums of bounded random variables,” in The
Collected Works of Wassily Hoeffding. Springer, 1994, pp. 409–426.
158. N. Sauer, “On the density of families of sets,” Journal of Combinatorial Theory, Series A,
vol. 13, no. 1, pp. 145–147, 1972.
159. Y. Jiang, B. Neyshabur, H. Mobahi, D. Krishnan, and S. Bengio, “Fantastic generalization
measures and where to find them,” arXiv preprint arXiv:1912.02178, 2019.
160. P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian, “Nearly-tight VC-dimension and
pseudodimension bounds for piecewise linear neural networks.” Journal of Machine Learning
Research, vol. 20, no. 63, pp. 1–17, 2019.
161. P. L. Bartlett and S. Mendelson, “Rademacher and Gaussian complexities: Risk bounds and
structural results,” Journal of Machine Learning Research, vol. 3, pp. 463–482, 2002.
162. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learnability and the Vapnik–
Chervonenkis dimension,” Journal of the ACM (JACM), vol. 36, no. 4, pp. 929–965, 1989.
163. D. A. McAllester, “Some PAC-Bayesian theorems,” Machine Learning, vol. 37, no. 3, pp.
355–363, 1999.
164. P. Germain, A. Lacasse, F. Laviolette, and M. Marchand, “PAC-Bayesian learning of
linear classifiers,” in Proceedings of the 26th Annual International Conference on Machine
Learning, 2009, pp. 353–360.
165. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning
requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
166. A. Bietti and J. Mairal, “On the inductive bias of neural tangent kernels,” arXiv preprint
arXiv:1905.12173, 2019.
167. B. Neyshabur, R. Tomioka, and N. Srebro, “In search of the real inductive bias: On the role
of implicit regularization in deep learning,” arXiv preprint arXiv:1412.6614, 2014.
168. D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of
gradient descent on separable data,” The Journal of Machine Learning Research, vol. 19,
no. 1, pp. 2822–2878, 2018.
169. S. Gunasekar, J. Lee, D. Soudry, and N. Srebro, “Implicit bias of gradient descent on linear
convolutional networks,” arXiv preprint arXiv:1806.00468, 2018.
170. L. Chizat and F. Bach, “Implicit bias of gradient descent for wide two-layer neural networks
trained with the logistic loss,” in Conference on Learning Theory. PMLR, 2020, pp. 1305–
1338.
171. S. Gunasekar, J. Lee, D. Soudry, and N. Srebro, “Characterizing implicit bias in terms of
optimization geometry,” in International Conference on Machine Learning. PMLR, 2018, pp.
1832–1841.
172. H. Xu and S. Mannor, “Robustness and generalization,” Machine Learning, vol. 86, no. 3, pp.
391–423, 2012.
173. A. W. Van Der Vaart and J. A. Wellner, “Weak convergence,” in Weak Convergence and
Empirical Processes. Springer, 1996, pp. 16–28.
15 Bibliography 325
Symbols Basis, 8
β-VAE, 297 Basis pursuit, 200
σ -algebra, 9 Basis vectors, 8
c-transforms, 279 Batch norm (BN), 158
f -GAN, 285 Batch normalization, 157
f -divergence, 272, 285 Benign optimization landscape, 233
Bias–variance trade-off, 56
Bidirectional Encoder Representations from
A Transformers (BERT), 178
Absolutely continuous, 270 Biological neural network, 79
Absolutely continuous measure, 10 Break point, 249
Action potential, 82
Activation function, 93
Adaptive instance normalization (AdaIN), 161 C
Adjacency matrix, 138 Calculus of variations, 106
Adjoint, 5 Cauchy–Schwarz inequality, 6
Affine function, 18 Cauchy sequence, 4
AlexNet, 114 Channel attention, 168
Algorithmic robustness, 261 Chart, 221
Ambient space, 61 Chemical synapses, 81
Atlas, 221 Classifier, 29
Attention, 164 Collaborative GAN (CollaGAN), 307
Attentional GAN (AttnGAN), 172 Colored graph, 139
Autoencoder, 220 Community detection, 135
Auxiliary classifier, 307 Complete, 4
Average pooling, 120 Compressed sensing, 200
Axon, 80 Concave, 20
Axon hillock, 80, 82 Concentration inequalities, 245
Content image, 161
Continuous bag-of-words (CBOW), 141
B Convex, 19
Backpropagation, 102 Convex conjugate, 21
Backward-propagated estimation error, 106 Convolution, 119
Bag-of-words (BOW) kernel, 64 Convolutional neural network (CNN), 113
Banach space, 6 Convolution framelet, 205
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 327
J. C. Ye, Geometry of Deep Learning, Mathematics in Industry 37,
https://doi.org/10.1007/978-981-16-6046-7
328 Index
E I
Earth-mover distance, 276 ImageNet, 114
Edges, 135, 138 Image style transfer, 161
Eigen-decomposition, 12 Implicit bias, 234, 261
Electric synapses, 80 Inception module, 115
Empirical risk minimization (ERM), 244 Independent variable, 46
Encoder–decoder CNN, 209 Indicator function, 18
Entropy regularization, 281 Induced norm, 6
Evidence lower bound (ELBO), 292 Inductive bias, 260
Excitatory postsynaptic potentials (EPSPs), Inhibitory postsynaptic potentials (IPSPs), 81
81 Inner product, 5
Expressivity, 196, 212 Instance normalization, 160
Interpolation threshold, 256
Invexity, 233
F Ionotropic receptors, 164
Feature engineering, 42
Feature space, 61
Feedforward neural network, 95 J
First-order necessary conditions (FONC), Jennifer Aniston cell, 88
34 Jensen–Shannon (JS) divergence, 274
Fixed points, 17
Forward-propagated input, 106
Fréchet differentiable, 21 K
Frame, 9, 198 Kantorovich formulation, 278
Frame condition, 199 Karush–Kuhn–Tucker (KKT) conditions, 35
Framelets, 200 Kernel, 63
Index 329