0% found this document useful (0 votes)
6 views

DL 02 Deep Forward Networks

Uploaded by

monikasadwal11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DL 02 Deep Forward Networks

Uploaded by

monikasadwal11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Deep Learning

Deep Learning
2. Deep Feedforward Networks

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)


Institute for Computer Science
University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 33
Deep Learning

Syllabus

Tue. 21.4. (1) 1. Supervised Learning (Review 1)


Tue. 28.4. (2) 2. Neural Networks (Review 2)
Tue. 5.5. (3) 3. Regularization
Tue. 12.5. (4) 4. Optimization
Tue. 19.5. (5) 5. Convolutional Neural Networks
Tue. 26.5. (6) 6. Recurrent Neural Networks
Tue. 2.6. — — Pentecoste Break —
Tue. 9.6. (7) 7. Autoencoders
Tue. 16.6. (8) 8. Generative Adversarial Networks
Tue. 23.6. (9) 9. Recent Advances
Tue. 30.6. (10) 10. Engineering Deep Learning Models
Tue. 7.7. (11) tbd.
Tue. 14.7. (12) Q&A

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 33
Deep Learning

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 33
Deep Learning 1. What is a Neural Network?

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 33
Deep Learning 1. What is a Neural Network?

What is a Deep Feedforward Network (DFN)?

I Feedforward networks
(aka feedforward neural networks or multilayer perceptrons)

I Given a function y = f ∗ (x) that maps input x to output y

I A DFN defines a parametric mapping ŷ = f(x; θ) with parameters θ

I Aim is to learn θ such as f(x; θ) best approximates f ∗ (x)!

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 33
Deep Learning 1. What is a Neural Network?

Why Feedforward?
I Given a Feedforward Network ŷ = f(x; θ)
I Input x, then pass through a chain of steps before outputting y

I Example f 1 (x), f 2 (x) and f 3 (x) can be chained as:


I f (x) = f 3 (f 2 (f 1 (x)))
I x is the zero-th layer, or the input layer
I f 1 is the first layer, or the first hidden layer
I f 2 is the second layer, or a second hidden layer
I f 3 is the last layer, or the output layer

I No feedback exists between the steps of the chain


I Feedback connections yield the Recurrent Neural Network

I Number of hidden layers define the depth of the network

I Dimensionality of the hidden layers defines the width of the network


Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
2 / 33
Deep Learning 1. What is a Neural Network?

Why Neural?

I Loosely inspired by neuroscience, hence Artificial Neural Network

I Each hidden layer node resembles a neuron

I Input to a neuron are the synaptic connections from the previous


attached neuron

I Output of a neuron is an aggregation of the input vector

I Signal propagates forward in a chain of ”Neuron”-to-”Neuron”


transmissions

I However, modern Deep Learning research is steered mainly by


mathematical and engineering principles!

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
3 / 33
Deep Learning 1. What is a Neural Network?

Why Network?
I A feed-forward network is an acyclic directed graph, but
I Graph nodes are structured in layers
I Directed links between nodes are parameters/weights
I Each node is a computational functions
I No inter-layer and intra-layer connections (but possible)
I Input to the first layer is given (the features x)
I Output is the computation of the last layer (the target ŷ)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Figure 1: FNN, Source www.analyticsvidhya.com 4 / 33
Deep Learning 1. What is a Neural Network?

Nonlinear Mapping
I We can easily solve linear regression, but not every problem is linear.

I Can the function f (x) = (x + 1)2 be approximated through a linear


function?

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
5 / 33
Deep Learning 1. What is a Neural Network?

Nonlinear Mapping
I We can easily solve linear regression, but not every problem is linear.

I Can the function f (x) = (x + 1)2 be approximated through a linear


function?

I Yes, but only if we map the feature x into a new space:

f(x)=(x+1)^2 f(a,b)= a + 2 b + 1

120
120
100 100
80
f(a,b)
80
60
f(x)

60 40
20
40 0

20 10
5 100
0 80
0 60
-5 40
b=x -10 0
20
a = x^2
-10 -5 0 5 10
x

Figure 2: Mapping
Lars Schmidt-Thieme, Information Systems feature x into
and Machine a new
Learning (ISMLL), Universityxof→
Lab dimensionality φ(x) =Germany
Hildesheim, (a, b)
5 / 33
Deep Learning 1. What is a Neural Network?

Nonlinear Mapping (II)

I Which mapping φ(x) is the best?

There are various ways of designing φ(x):


1. Hand-craft (manually engineer) φ(x)
2. Use a very generic φ(x), RBF or polynomial expansion
3. Parametrize and learn the mapping f(x; θ, w) := φ(x; θ)T w

Deep Feedforward Networks follow the third approach, where:


I the hidden layers (weights θ) learn the mapping φ(x; θ)

I the output layer (weights w ) learns the function g (z; w ) := z T w

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
6 / 33
Deep Learning 1. What is a Neural Network?

Nonlinear Mapping

I consider the function

f (x) = x 2 + 2e x + 3x − 5
I from which latent features can it be linearly combined:
A. x 2
B. x 2 , x, e x
C. x, e x
D. x 2 , x, e x , sin(x)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
7 / 33
Deep Learning 2. An example: XOR

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
8 / 33
Deep Learning 2. An example: XOR

An example - Learn XOR


I XOR is a function:
x1 x2 y = f ∗ (x)
0 0 0
0 1 1
1 0 1
1 1 0

I Can we learn a DFN ŷ = f(x; θ) such that f resembles f ∗ ?

I Our dataset       
train 0 1 0 1
D := {( , 0), ( , 1), ( , 1), ( , 0)}
0 0 1 1
I Leading to the optimization:
1 X
arg min J(θ) := (y − f (x; θ))2
θ 4
(x,y )∈Dtrain

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
8 / 33
Deep Learning 2. An example: XOR

An example - Learn XOR (2)


I We will learn a simple DFN with one hidden layer:

Figure 3: Left: Detailed, Right: Compact, Source: Goodfellow et al., 2016

I Two functions are chained h = f 1 (x; W , c) and y = f 2 (h; w , b)


(n) 
I For n-th instance: Hidden-layer hi = g W:,iT x (n) + ci
I For n-th instance: output layer: ŷn = w T h(n) + b
I W ∈ R2×2 , c ∈ R2×1 , w ∈ R2×1 , b ∈ R
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
9 / 33
Deep Learning 2. An example: XOR

Rectified Linear Unit (ReLU)


non-linear activation function:
relu(z) := max{0, z}
node:
T
f (z) := relu(Wz) = max{0, Wz} = (max{0, (Wk,. z)})k=1:K

Figure 4: The ReLU activation, Source: Goodfellow et al., 2016

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
10 / 33
Deep Learning 2. An example: XOR

”Deus ex machina” solution?

Suppose I magically found out that:

     
1 1 0 1
W = , c= , w= , b=0
1 1 −1 −2

We would later on see an optimization technique called backpropagation


to learn the network parameters.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
11 / 33
Deep Learning 2. An example: XOR

XOR Solution - Hidden Layer Computations


   
(1)

T
  0
h1 = g W:,1 x1 + c = g 1 1 + 0 = g (0) = 0
0
   
(1)

T
  0
h2 = g W:,2 x1 + c = g 1 1 − 1 = g (−1) = 0
0
   
(2)

T
  0
h1 = g W:,1 x2 + c = g 1 1 + 0 = g (1) = 1
1
   
(2)

T
  0
h2 = g W:,2 x2 + c = g 1 1 − 1 = g (0) = 0
1
   
(3)

T
  1
h1 = g W:,1 x3 + c = g 1 1 + 0 = g (1) = 1
0
   
(3)

T
  1
h2 = g W:,2 x3 + c = g 1 1 − 1 = g (0) = 0
0
   
(4)

T
  1
h1 = g W:,1 x4 + c = g 1 1 + 0 = g (2) = 2
1
   
(4)

T
  1
h2 = g W:,2 x4 + c = g 1 1 − 1 = g (1) = 1
1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
12 / 33
Deep Learning 2. An example: XOR

XOR Solution - Output Layer Computations

 
 0
ŷ (1) = w T h(1) + b = 1 −2 +0=0
0
 
 1
ŷ (2) = w T h(2) + b = 1 −2 +0=1
0
 
 1
ŷ (3) = w T h(3) + b = 1 −2 +0=1
0
 
 2
ŷ (4) = w T h(4) + b = 1 −2 +0=0
1

The computations of the final layer match exactly those of the XOR function.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
13 / 33
Deep Learning 3. Loss and Output Layer

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
14 / 33
Deep Learning 3. Loss and Output Layer

Maximum Likelihood as Objective

I The loss can be expressed in probabilistic terms as

J(θ) = −E(x,y )∼pdata log pmodel (y | x)

I If our model outputs normal uncertainty:

pmodel (y | x) = N (y ; f (x; θ), σ 2 )


1
J(θ) = E(x,y )∼pdata (y − f (x; θ))2 + const
2
I the model just outputs the mean f (x; θ),
σ 2 is its error variance.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
14 / 33
Deep Learning 3. Loss and Output Layer

Output Layer — Gaussian Output Distribution

I Affine transformation without nonlinearity


I Given features h, produces ŷ = w T h + b
I activation function is the identify a(h) := h

I Interpreted as mean of a conditional Gaussian distribution


I p(y | x) = N (y ; ŷ , σ 2 ), ŷ := f (x; θ)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
15 / 33
Deep Learning 3. Loss and Output Layer

Bernoulli Output Distributions

I Binary target variables follow a Bernoulli distribution


P(y = 1) = p, P(y = 0) = 1 − p

I Train a DFN such that ŷ = f (x; θ) ∈ [0, 1]


I Naive Option: clip a linear output layer:
 
I P(y = 1 | x) = max 0, min 1, w T h + b

I What is the problem with the clipped linear output layer?

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
16 / 33
Deep Learning 3. Loss and Output Layer

Bernoulli Output Distributions (2)

I Use a smooth sigmoid output unit:


ez
ŷ = σ (z) =
ez + 1
z = wT h + b

I The loss for a DFN f (x; θ) with a sigmoid output is:


N
X
J(θ) = −yn log(f (xn ; θ)) − (1 − yn ) log(1 − f (xn , θ))
n=1

I Also called cross entropy

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
17 / 33
Deep Learning 3. Loss and Output Layer

Multinoulli Output Distribution


I For multi-category targets ŷc = P(y = c | x), c ∈ {1, . . . , C }

I last latent layer: unnormalized log probabilities:


zc = log P̃(y = c | x) := wcT h + b
I yields probabilities:
e zc
P(y = c | x) := softmax(z)c := P z
e d
d
I Minimizing the log-likelihood loss:
N X
X C
J(θ) = −I(yn = c) log P(y = c | x)
n=1 c=1
N X C
!
X X
zd
=− I(yn = c) zc − log e
n=1 c=1 d

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
18 / 33
Deep Learning 4. Basic Feedforward Network Architecture

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
19 / 33
Deep Learning 4. Basic Feedforward Network Architecture

Types of Hidden Units

I Question: Can we use no activation function,i.e., only purely linear


layers h = W T x + b?

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
19 / 33
Deep Learning 4. Basic Feedforward Network Architecture

Types of Hidden Units

I Question: Can we use no activation function,i.e., only purely linear


layers h = W T x + b?

I Remember the most used hidden layer is ReLU:

h = relu(W T x + b) = max(0, W T x + b)

I Alternatively, the sigmoid function:

h = σ(z)

I or, the hyperbolic tangent:

h = tanh(z) = 2σ(2z) − 1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
19 / 33
Deep Learning 4. Basic Feedforward Network Architecture

Architecture of Hidden Layers

A DFN with L hidden layers:


T
h(1) = g (1) (W (1) x + b (1) )
T
h(2) = g (2) (W (2) h(1) + b (2) )
...
(L) T
h = g (L) (W (L) h(L−1) + b (L) )

Different layers can have different activation functions.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
20 / 33
Deep Learning 5. Backpropagation

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
21 / 33
Deep Learning 5. Backpropagation

Computational Graphs

z2
+
z
dot
z1
dot
x w

x w b
z = xT w

z2 = z1 + b = x T w + b

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
21 / 33
Deep Learning 5. Backpropagation

Computational Graphs z6
relu
z5
z3
+
relu z4
z2 dot
z3 w2 b2
+
relu
z2
z1
+
dot z1
x w b dot
x w1 b1

z3 = relu(x T w + b)
z6 = relu(w2T relu(x T w1 + b1 ) + b2 )
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
22 / 33
Deep Learning 5. Backpropagation

Forward Computation
I computational graph (Z , E ), a DAG.
I Z a set of node IDs.
I E ⊆ Z × Z a set of directed edges.
For every node z ∈ Z :
I Tz : domain of the node (e.g., R17 )
and additionally for every non-root node z ∈ Z :
Q
I fz : Tz 0 → Tz node operation
z 0 ∈fanin(z)

I forward computation:
Given values vz ∈ Tz of all the root nodes z ∈ Z ,
compute a value for every node z ∈ Z via

vz := fz ((vz 0 )z 0 ∈fanin(z) )
| {z }
=:vfanin(z)
Note: fanin(Z ,E ) (z) := {z 0 ∈ Z | (z 0 , z) ∈ E } nodes with edges into z.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
23 / 33
Deep Learning 5. Backpropagation

Forward Computation / Example


I types for each node:

z3 Tx := R2 , Tw := R2 , Tb := R,
relu Tz1 = Tz2 = Tz3 := R

z2 I functions for each non-root node:


+ f1 (x, w ) := x T w , f2 (z1 , b) = z1 + b,
f3 (z2 ) := relu(z2 )
z1
dot I given values for all root nodes:
   
x w 2 1
b x= ,w = , b = 0.5
1 −1

I compute values for all non-root nodes:


z3 = relu(x T w + b)
z1 = 1, z2 = 1.5, z3 = 1.5
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
24 / 33
Deep Learning 5. Backpropagation

Forward Computation / Algorithm


1 cg-forward((Z , E , f ), (vz )z∈roots(Z ,E ) ) :
2 for z ∈ Z \ roots(Z , E ):
3 vz := fz ((vz 0 )z 0 ∈fanin(z) )
4 return (vz )z∈Z

Note: x _ y denotes the concatenation of two lists.


Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
25 / 33
Deep Learning 5. Backpropagation

Forward Computation / Algorithm


1 cg-forward((Z , E , f ), (vz )z∈roots(Z ,E ) ) :
2 for z ∈ Z \ roots(Z , E ) in topological order:
3 vz := fz ((vz 0 )z 0 ∈fanin(z) )
4 return (vz )z∈Z

1 topological-order(Z , E ) :
2 x := ()
3 while Z 6= ∅:
4 choose z ∈ roots(Z , E ) arbitrarily
5 delete z in graph (Z , E )
6 x := x _ (z)
7 return x

Note: x _ y denotes the concatenation of two lists.


Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
25 / 33
Deep Learning 5. Backpropagation

Gradients in Computational Graphs (1/2)

I lets assume every operation is differentiable and


∂z
we have for every non-root node z its gradients functions ∇z 0 z = ∂z 0 :
Y
gz,z 0 : Tz̃ 0 → Tz × Tz 0 , z 0 ∈ fanin(z)
z̃ 0 ∈fanin(z)

I for any node then we can compute its gradient values w.r.t. its inputs:

wz,z 0 = gz,z 0 ((vz̃ 0 )z̃ 0 ∈fanin(z) )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
26 / 33
Deep Learning 5. Backpropagation

Gradient Computation / Example


I functions for each non-root node:
f1 (x, w ) := x T w , f2 (z1 , b) = z1 + b,
z3 f3 (z2 ) := relu(z2 )

relu I gradient functions for each non-root node:


g1,x (x, w ) := w , g1,w (x, w ) := x,
z2
g2,z1 (z1 , b) := 1, g2,b (z1 , b) := 1
+
g3,z2 (z2 ) := I(z2 ≥ 0)
z1
I given values
 for
 all root
 nodes:

dot 2 1
x= ,w = , b = 0.5
1 −1
x w b
I compute gradient values for all neighbors:
∇z2 z3 = 1, ∇z1 z2 = 1, ∇b1 z2 = 1,
z3 = relu(x T w + b)  
1
 
2
∇x z 1 = , ∇w z 1 =
−1 1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
27 / 33
Deep Learning 5. Backpropagation

Gradients in Computational Graphs (2/2)

I for any subgraph z 00 → z10 , z20 , . . . , zK0 → z by chain rule:


z
K K
X X ∂z ∂zk0
∇z 00 z = ∇zk0 z ∇z 00 zk0 =
∂zk0 ∂z 00 z10 z20 ... zK0
k=1 k=1
XK
wz,z 00 = wz,zk0 gzk0 ,z 00 (vfanin(zk0 ) ) z 00
k=1
XK X
= (wz,zk0 ).,i gzk0 ,z 00 (vfanin(zk0 ) )i,. ∈ Tz × Tz 00
k=1 i∈dim Tz 0
k

I this way, gradients between any two nodes in a computational graph


can be computed automatically.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
28 / 33
Deep Learning 5. Backpropagation

Gradient Computation / Example


I gradient values for all neighbors:
∇z2 z3 = 1, ∇z1 z2 = 1, ∇b1 z2 = 1,
z3    
1 2
∇x z 1 = , ∇w z 1 =
relu −1 1

z2 I gradient values for all node pairs:


∇z1 z3 = 1, ∇b z3 = 1,
+

z1 A.    
2 1
dot ∇x z 3 = , ∇w z 3 =
1 −1
x w B.
b 
  
1 2
∇x z 3 = , ∇w z 3 =
−1 1
   
z3 = relu(x T w + b) ∇x z 2 =
1
, ∇w z 2 =
2
−1 1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
29 / 33
Deep Learning 5. Backpropagation

Gradient Computation / Example

z3
I gradient values for all neighbors:
relu ∇z2 z3 = 1, ∇z1 z2 = 1, ∇b1 z2 = 1,
   
z2 1 2
∇x z 1 = , ∇w z 1 =
−1 1
+
I gradient values for all node pairs:
z1 ∇z1 z3 = 1, ∇b z3 = 1,
   
dot 1 2
∇x z 3 = , ∇w z 3 =
−1 1
x w b  
1
 
2
∇x z 2 = , ∇w z 2 =
−1 1

z3 = relu(x T w + b)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
29 / 33
Deep Learning 5. Backpropagation

Gradient Computation / Algorithm / Single Leaf


1 cg-gradient((Z , E , f , g ), (vz )z∈roots(Z ,E ) ) :
2 v := cg-forward((Z , E , f ), (vz )z∈roots(Z ,E ) )
3 z := single leaf in (Z , E )
4 wz,z := 1
5 for z 00 ∈ Z \ {z} in reverse topological order:
6 wz,z 00 := 0
7 for z 0 ∈ fanout(z 00 ):
8 wz 0 ,z 00 := gz 0 ,z 00 ((vz̃ 00 )z̃ 00 ∈fanin(z 0 ) )
9 wz,z 00 := wz,z 00 + wz,z 0 wz 0 ,z 00
10 return (wz,z 0 )z 0 ∈roots(Z ,E )

I compute gradients ∇z 0 z for single leaf node z and all root nodes z 0

I take the subgraph on ancestors(z) ∩ descendants(Zin ) to compute all


gradients ∇z 0 z for nodes z 0 ∈ Zin .
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
30 / 33
Deep Learning 5. Backpropagation

Gradient Computation

I automatic gradient computation in computational graphs combines


I manually specified gradients of elementary functions,
I the chain rule, and
I a useful arrangement of computations
I forward function computations
I backwards gradient computations
I compute each neighbor gradient once (for multiple leaf nodes)

I called backpropagation for feedforward neural networks


I algorithm can be formulated without graph terminology

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
31 / 33
Deep Learning 5. Backpropagation

Summary
I Feedforward neural networks are models for supervised prediction
(regression, classification).

I They consist of L stacked layers, each of the form


z` (z`−1 ) := a(W`T z`−1 )
consisting of
I a linear combination of the previous layer values with parameters W
I a nonlinear function (activation function).
I often just the rectifying linear unit relu(z) := max{0, z}

I The output layer contains an activation function that reflects the


target type / output distribution:
I identity: for continuous targets (with normally distributed uncertainty)
I logistic function: for a bernoulli probability (binary classification)
I softmax function: for a multinoulli probability (multi-class
classification)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
32 / 33
Deep Learning 5. Backpropagation

Summary (2/2)

I Any loss can be used, esp. the negative log-likelihood.


I for classification problems: called cross entropy

I To learn a feedforward neural network, gradient-descent type


algorithms can be used
(esp. stochastic gradient descent).

I Gradients of neural networks


— and more generally, any computational graph —
can be computed automatically.
I backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33 / 33
Deep Learning

Further Readings
I Goodfellow et al. 2016, ch. 6

I Zhang et al. 2020, ch. 2.5, 3–5

I lecture Machine Learning, chapter B.2

Acknowledgement: An earlier version of the slides for this lecture have been written by my
former postdoc Dr Josif Grabocka.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
34 / 33
Deep Learning

References
Charu C. Aggarwal. Neural Networks and Deep Learning: A Textbook. Springer International Publishing, 2018. ISBN
978-3-319-94462-3. doi: 10.1007/978-3-319-94463-0.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The Mit Press, Cambridge, Massachusetts, November
2016. ISBN 978-0-262-03561-3.
Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander Smola. Dive into Deep Learning. https://d2l.ai/, 2020.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
35 / 33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy