0% found this document useful (0 votes)

6 views

DL 02 Deep Forward Networks

Uploaded by

monikasadwal11

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

DL 02 Deep Forward Networks

Uploaded by

monikasadwal11

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Deep Learning

Deep Learning
2. Deep Feedforward Networks

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)

Institute for Computer Science
University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 33
Deep Learning

Syllabus

Tue. 21.4. (1) 1. Supervised Learning (Review 1)

Tue. 28.4. (2) 2. Neural Networks (Review 2)
Tue. 5.5. (3) 3. Regularization
Tue. 12.5. (4) 4. Optimization
Tue. 19.5. (5) 5. Convolutional Neural Networks
Tue. 26.5. (6) 6. Recurrent Neural Networks
Tue. 2.6. — — Pentecoste Break —
Tue. 9.6. (7) 7. Autoencoders
Tue. 16.6. (8) 8. Generative Adversarial Networks
Tue. 23.6. (9) 9. Recent Advances
Tue. 30.6. (10) 10. Engineering Deep Learning Models
Tue. 7.7. (11) tbd.
Tue. 14.7. (12) Q&A

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 33
Deep Learning

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 33
Deep Learning 1. What is a Neural Network?

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 33
Deep Learning 1. What is a Neural Network?

What is a Deep Feedforward Network (DFN)?

I Feedforward networks
(aka feedforward neural networks or multilayer perceptrons)

I Given a function y = f ∗ (x) that maps input x to output y

I A DFN defines a parametric mapping ŷ = f(x; θ) with parameters θ

I Aim is to learn θ such as f(x; θ) best approximates f ∗ (x)!

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 33
Deep Learning 1. What is a Neural Network?

Why Feedforward?
I Given a Feedforward Network ŷ = f(x; θ)
I Input x, then pass through a chain of steps before outputting y

I Example f 1 (x), f 2 (x) and f 3 (x) can be chained as:

I f (x) = f 3 (f 2 (f 1 (x)))
I x is the zero-th layer, or the input layer
I f 1 is the first layer, or the first hidden layer
I f 2 is the second layer, or a second hidden layer
I f 3 is the last layer, or the output layer

I No feedback exists between the steps of the chain

I Feedback connections yield the Recurrent Neural Network

I Number of hidden layers define the depth of the network

I Dimensionality of the hidden layers defines the width of the network

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
2 / 33
Deep Learning 1. What is a Neural Network?

Why Neural?

I Loosely inspired by neuroscience, hence Artificial Neural Network

I Each hidden layer node resembles a neuron

I Input to a neuron are the synaptic connections from the previous

attached neuron

I Output of a neuron is an aggregation of the input vector

I Signal propagates forward in a chain of ”Neuron”-to-”Neuron”

transmissions

I However, modern Deep Learning research is steered mainly by

mathematical and engineering principles!

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
3 / 33
Deep Learning 1. What is a Neural Network?

Why Network?
I A feed-forward network is an acyclic directed graph, but
I Graph nodes are structured in layers
I Directed links between nodes are parameters/weights
I Each node is a computational functions
I No inter-layer and intra-layer connections (but possible)
I Input to the first layer is given (the features x)
I Output is the computation of the last layer (the target ŷ)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Figure 1: FNN, Source www.analyticsvidhya.com 4 / 33
Deep Learning 1. What is a Neural Network?

Nonlinear Mapping
I We can easily solve linear regression, but not every problem is linear.

I Can the function f (x) = (x + 1)2 be approximated through a linear

function?

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
5 / 33
Deep Learning 1. What is a Neural Network?

Nonlinear Mapping
I We can easily solve linear regression, but not every problem is linear.

I Can the function f (x) = (x + 1)2 be approximated through a linear

function?

I Yes, but only if we map the feature x into a new space:

f(x)=(x+1)^2 f(a,b)= a + 2 b + 1

120
120
100 100
80
f(a,b)
80
60
f(x)

60 40
20
40 0

20 10
5 100
0 80
0 60
-5 40
b=x -10 0
20
a = x^2
-10 -5 0 5 10
x

Figure 2: Mapping
Lars Schmidt-Thieme, Information Systems feature x into
and Machine a new
Learning (ISMLL), Universityxof→
Lab dimensionality φ(x) =Germany
Hildesheim, (a, b)
5 / 33
Deep Learning 1. What is a Neural Network?

Nonlinear Mapping (II)

I Which mapping φ(x) is the best?

There are various ways of designing φ(x):

1. Hand-craft (manually engineer) φ(x)
2. Use a very generic φ(x), RBF or polynomial expansion
3. Parametrize and learn the mapping f(x; θ, w) := φ(x; θ)T w

Deep Feedforward Networks follow the third approach, where:

I the hidden layers (weights θ) learn the mapping φ(x; θ)

I the output layer (weights w ) learns the function g (z; w ) := z T w

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
6 / 33
Deep Learning 1. What is a Neural Network?

Nonlinear Mapping

I consider the function

f (x) = x 2 + 2e x + 3x − 5
I from which latent features can it be linearly combined:
A. x 2
B. x 2 , x, e x
C. x, e x
D. x 2 , x, e x , sin(x)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
7 / 33
Deep Learning 2. An example: XOR

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
8 / 33
Deep Learning 2. An example: XOR

An example - Learn XOR

I XOR is a function:
x1 x2 y = f ∗ (x)
0 0 0
0 1 1
1 0 1
1 1 0

I Can we learn a DFN ŷ = f(x; θ) such that f resembles f ∗ ?

I Our dataset
train 0 1 0 1
D := {( , 0), ( , 1), ( , 1), ( , 0)}
0 0 1 1
I Leading to the optimization:
1 X
arg min J(θ) := (y − f (x; θ))2
θ 4
(x,y )∈Dtrain

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
8 / 33
Deep Learning 2. An example: XOR

An example - Learn XOR (2)

I We will learn a simple DFN with one hidden layer:

Figure 3: Left: Detailed, Right: Compact, Source: Goodfellow et al., 2016

I Two functions are chained h = f 1 (x; W , c) and y = f 2 (h; w , b)

(n)
I For n-th instance: Hidden-layer hi = g W:,iT x (n) + ci
I For n-th instance: output layer: ŷn = w T h(n) + b
I W ∈ R2×2 , c ∈ R2×1 , w ∈ R2×1 , b ∈ R
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
9 / 33
Deep Learning 2. An example: XOR

Rectified Linear Unit (ReLU)

non-linear activation function:
relu(z) := max{0, z}
node:
T
f (z) := relu(Wz) = max{0, Wz} = (max{0, (Wk,. z)})k=1:K

Figure 4: The ReLU activation, Source: Goodfellow et al., 2016

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
10 / 33
Deep Learning 2. An example: XOR

”Deus ex machina” solution?

Suppose I magically found out that:

1 1 0 1
W = , c= , w= , b=0
1 1 −1 −2

We would later on see an optimization technique called backpropagation

to learn the network parameters.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
11 / 33
Deep Learning 2. An example: XOR

XOR Solution - Hidden Layer Computations

(1)

T
0
h1 = g W:,1 x1 + c = g 1 1 + 0 = g (0) = 0
0

(1)

T
0
h2 = g W:,2 x1 + c = g 1 1 − 1 = g (−1) = 0
0

(2)

T
0
h1 = g W:,1 x2 + c = g 1 1 + 0 = g (1) = 1
1

(2)

T
0
h2 = g W:,2 x2 + c = g 1 1 − 1 = g (0) = 0
1

(3)

T
1
h1 = g W:,1 x3 + c = g 1 1 + 0 = g (1) = 1
0

(3)

T
1
h2 = g W:,2 x3 + c = g 1 1 − 1 = g (0) = 0
0

(4)

T
1
h1 = g W:,1 x4 + c = g 1 1 + 0 = g (2) = 2
1

(4)

T
1
h2 = g W:,2 x4 + c = g 1 1 − 1 = g (1) = 1
1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
12 / 33
Deep Learning 2. An example: XOR

XOR Solution - Output Layer Computations

0
ŷ (1) = w T h(1) + b = 1 −2 +0=0
0

1
ŷ (2) = w T h(2) + b = 1 −2 +0=1
0

1
ŷ (3) = w T h(3) + b = 1 −2 +0=1
0

2
ŷ (4) = w T h(4) + b = 1 −2 +0=0
1

The computations of the final layer match exactly those of the XOR function.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
13 / 33
Deep Learning 3. Loss and Output Layer

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
14 / 33
Deep Learning 3. Loss and Output Layer

Maximum Likelihood as Objective

I The loss can be expressed in probabilistic terms as

J(θ) = −E(x,y )∼pdata log pmodel (y | x)

I If our model outputs normal uncertainty:

pmodel (y | x) = N (y ; f (x; θ), σ 2 )

1
J(θ) = E(x,y )∼pdata (y − f (x; θ))2 + const
2
I the model just outputs the mean f (x; θ),
σ 2 is its error variance.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
14 / 33
Deep Learning 3. Loss and Output Layer

Output Layer — Gaussian Output Distribution

I Affine transformation without nonlinearity

I Given features h, produces ŷ = w T h + b
I activation function is the identify a(h) := h

I Interpreted as mean of a conditional Gaussian distribution

I p(y | x) = N (y ; ŷ , σ 2 ), ŷ := f (x; θ)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
15 / 33
Deep Learning 3. Loss and Output Layer

Bernoulli Output Distributions

I Binary target variables follow a Bernoulli distribution

P(y = 1) = p, P(y = 0) = 1 − p

I Train a DFN such that ŷ = f (x; θ) ∈ [0, 1]

I Naive Option: clip a linear output layer:

I P(y = 1 | x) = max 0, min 1, w T h + b

I What is the problem with the clipped linear output layer?

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
16 / 33
Deep Learning 3. Loss and Output Layer

Bernoulli Output Distributions (2)

I Use a smooth sigmoid output unit:

ez
ŷ = σ (z) =
ez + 1
z = wT h + b

I The loss for a DFN f (x; θ) with a sigmoid output is:

N
X
J(θ) = −yn log(f (xn ; θ)) − (1 − yn ) log(1 − f (xn , θ))
n=1

I Also called cross entropy

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
17 / 33
Deep Learning 3. Loss and Output Layer

Multinoulli Output Distribution

I For multi-category targets ŷc = P(y = c | x), c ∈ {1, . . . , C }

I last latent layer: unnormalized log probabilities:

zc = log P̃(y = c | x) := wcT h + b
I yields probabilities:
e zc
P(y = c | x) := softmax(z)c := P z
e d
d
I Minimizing the log-likelihood loss:
N X
X C
J(θ) = −I(yn = c) log P(y = c | x)
n=1 c=1
N X C
!
X X
zd
=− I(yn = c) zc − log e
n=1 c=1 d

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
18 / 33
Deep Learning 4. Basic Feedforward Network Architecture

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
19 / 33
Deep Learning 4. Basic Feedforward Network Architecture

Types of Hidden Units

I Question: Can we use no activation function,i.e., only purely linear

layers h = W T x + b?

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
19 / 33
Deep Learning 4. Basic Feedforward Network Architecture

Types of Hidden Units

I Question: Can we use no activation function,i.e., only purely linear

layers h = W T x + b?

I Remember the most used hidden layer is ReLU:

h = relu(W T x + b) = max(0, W T x + b)

I Alternatively, the sigmoid function:

h = σ(z)

I or, the hyperbolic tangent:

h = tanh(z) = 2σ(2z) − 1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
19 / 33
Deep Learning 4. Basic Feedforward Network Architecture

Architecture of Hidden Layers

A DFN with L hidden layers:

T
h(1) = g (1) (W (1) x + b (1) )
T
h(2) = g (2) (W (2) h(1) + b (2) )
...
(L) T
h = g (L) (W (L) h(L−1) + b (L) )

Different layers can have different activation functions.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
20 / 33
Deep Learning 5. Backpropagation

Outline

1. What is a Neural Network?

2. An example: XOR

3. Loss and Output Layer

4. Basic Feedforward Network Architecture

5. Backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
21 / 33
Deep Learning 5. Backpropagation

Computational Graphs

z2
+
z
dot
z1
dot
x w

x w b
z = xT w

z2 = z1 + b = x T w + b

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
21 / 33
Deep Learning 5. Backpropagation

Computational Graphs z6
relu
z5
z3
+
relu z4
z2 dot
z3 w2 b2
+
relu
z2
z1
+
dot z1
x w b dot
x w1 b1

z3 = relu(x T w + b)
z6 = relu(w2T relu(x T w1 + b1 ) + b2 )
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
22 / 33
Deep Learning 5. Backpropagation

Forward Computation
I computational graph (Z , E ), a DAG.
I Z a set of node IDs.
I E ⊆ Z × Z a set of directed edges.
For every node z ∈ Z :
I Tz : domain of the node (e.g., R17 )
and additionally for every non-root node z ∈ Z :
Q
I fz : Tz 0 → Tz node operation
z 0 ∈fanin(z)

I forward computation:
Given values vz ∈ Tz of all the root nodes z ∈ Z ,
compute a value for every node z ∈ Z via

vz := fz ((vz 0 )z 0 ∈fanin(z) )
| {z }
=:vfanin(z)
Note: fanin(Z ,E ) (z) := {z 0 ∈ Z | (z 0 , z) ∈ E } nodes with edges into z.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
23 / 33
Deep Learning 5. Backpropagation

Forward Computation / Example

I types for each node:

z3 Tx := R2 , Tw := R2 , Tb := R,
relu Tz1 = Tz2 = Tz3 := R

z2 I functions for each non-root node:

+ f1 (x, w ) := x T w , f2 (z1 , b) = z1 + b,
f3 (z2 ) := relu(z2 )
z1
dot I given values for all root nodes:

x w 2 1
b x= ,w = , b = 0.5
1 −1

I compute values for all non-root nodes:

z3 = relu(x T w + b)
z1 = 1, z2 = 1.5, z3 = 1.5
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
24 / 33
Deep Learning 5. Backpropagation

Forward Computation / Algorithm

1 cg-forward((Z , E , f ), (vz )z∈roots(Z ,E ) ) :
2 for z ∈ Z \ roots(Z , E ):
3 vz := fz ((vz 0 )z 0 ∈fanin(z) )
4 return (vz )z∈Z

Note: x _ y denotes the concatenation of two lists.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
25 / 33
Deep Learning 5. Backpropagation

Forward Computation / Algorithm

1 cg-forward((Z , E , f ), (vz )z∈roots(Z ,E ) ) :
2 for z ∈ Z \ roots(Z , E ) in topological order:
3 vz := fz ((vz 0 )z 0 ∈fanin(z) )
4 return (vz )z∈Z

1 topological-order(Z , E ) :
2 x := ()
3 while Z 6= ∅:
4 choose z ∈ roots(Z , E ) arbitrarily
5 delete z in graph (Z , E )
6 x := x _ (z)
7 return x

Note: x _ y denotes the concatenation of two lists.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
25 / 33
Deep Learning 5. Backpropagation

Gradients in Computational Graphs (1/2)

I lets assume every operation is differentiable and

∂z
we have for every non-root node z its gradients functions ∇z 0 z = ∂z 0 :
Y
gz,z 0 : Tz̃ 0 → Tz × Tz 0 , z 0 ∈ fanin(z)
z̃ 0 ∈fanin(z)

I for any node then we can compute its gradient values w.r.t. its inputs:

wz,z 0 = gz,z 0 ((vz̃ 0 )z̃ 0 ∈fanin(z) )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
26 / 33
Deep Learning 5. Backpropagation

Gradient Computation / Example

I functions for each non-root node:
f1 (x, w ) := x T w , f2 (z1 , b) = z1 + b,
z3 f3 (z2 ) := relu(z2 )

relu I gradient functions for each non-root node:

g1,x (x, w ) := w , g1,w (x, w ) := x,
z2
g2,z1 (z1 , b) := 1, g2,b (z1 , b) := 1
+
g3,z2 (z2 ) := I(z2 ≥ 0)
z1
I given values
for
all root
nodes:

dot 2 1
x= ,w = , b = 0.5
1 −1
x w b
I compute gradient values for all neighbors:
∇z2 z3 = 1, ∇z1 z2 = 1, ∇b1 z2 = 1,
z3 = relu(x T w + b)
1

2
∇x z 1 = , ∇w z 1 =
−1 1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
27 / 33
Deep Learning 5. Backpropagation

Gradients in Computational Graphs (2/2)

I for any subgraph z 00 → z10 , z20 , . . . , zK0 → z by chain rule:

z
K K
X X ∂z ∂zk0
∇z 00 z = ∇zk0 z ∇z 00 zk0 =
∂zk0 ∂z 00 z10 z20 ... zK0
k=1 k=1
XK
wz,z 00 = wz,zk0 gzk0 ,z 00 (vfanin(zk0 ) ) z 00
k=1
XK X
= (wz,zk0 ).,i gzk0 ,z 00 (vfanin(zk0 ) )i,. ∈ Tz × Tz 00
k=1 i∈dim Tz 0
k

I this way, gradients between any two nodes in a computational graph

can be computed automatically.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
28 / 33
Deep Learning 5. Backpropagation

Gradient Computation / Example

I gradient values for all neighbors:
∇z2 z3 = 1, ∇z1 z2 = 1, ∇b1 z2 = 1,
z3
1 2
∇x z 1 = , ∇w z 1 =
relu −1 1

z2 I gradient values for all node pairs:

∇z1 z3 = 1, ∇b z3 = 1,
+

z1 A.
2 1
dot ∇x z 3 = , ∇w z 3 =
1 −1
x w B.
b

1 2
∇x z 3 = , ∇w z 3 =
−1 1

z3 = relu(x T w + b) ∇x z 2 =
1
, ∇w z 2 =
2
−1 1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
29 / 33
Deep Learning 5. Backpropagation

Gradient Computation / Example

z3
I gradient values for all neighbors:
relu ∇z2 z3 = 1, ∇z1 z2 = 1, ∇b1 z2 = 1,

z2 1 2
∇x z 1 = , ∇w z 1 =
−1 1
+
I gradient values for all node pairs:
z1 ∇z1 z3 = 1, ∇b z3 = 1,

dot 1 2
∇x z 3 = , ∇w z 3 =
−1 1
x w b
1

2
∇x z 2 = , ∇w z 2 =
−1 1

z3 = relu(x T w + b)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
29 / 33
Deep Learning 5. Backpropagation

Gradient Computation / Algorithm / Single Leaf

1 cg-gradient((Z , E , f , g ), (vz )z∈roots(Z ,E ) ) :
2 v := cg-forward((Z , E , f ), (vz )z∈roots(Z ,E ) )
3 z := single leaf in (Z , E )
4 wz,z := 1
5 for z 00 ∈ Z \ {z} in reverse topological order:
6 wz,z 00 := 0
7 for z 0 ∈ fanout(z 00 ):
8 wz 0 ,z 00 := gz 0 ,z 00 ((vz̃ 00 )z̃ 00 ∈fanin(z 0 ) )
9 wz,z 00 := wz,z 00 + wz,z 0 wz 0 ,z 00
10 return (wz,z 0 )z 0 ∈roots(Z ,E )

I compute gradients ∇z 0 z for single leaf node z and all root nodes z 0

I take the subgraph on ancestors(z) ∩ descendants(Zin ) to compute all

gradients ∇z 0 z for nodes z 0 ∈ Zin .
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
30 / 33
Deep Learning 5. Backpropagation

Gradient Computation

I automatic gradient computation in computational graphs combines

I manually specified gradients of elementary functions,
I the chain rule, and
I a useful arrangement of computations
I forward function computations
I backwards gradient computations
I compute each neighbor gradient once (for multiple leaf nodes)

I called backpropagation for feedforward neural networks

I algorithm can be formulated without graph terminology

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
31 / 33
Deep Learning 5. Backpropagation

Summary
I Feedforward neural networks are models for supervised prediction
(regression, classification).

I They consist of L stacked layers, each of the form

z` (z`−1 ) := a(W`T z`−1 )
consisting of
I a linear combination of the previous layer values with parameters W
I a nonlinear function (activation function).
I often just the rectifying linear unit relu(z) := max{0, z}

I The output layer contains an activation function that reflects the

target type / output distribution:
I identity: for continuous targets (with normally distributed uncertainty)
I logistic function: for a bernoulli probability (binary classification)
I softmax function: for a multinoulli probability (multi-class
classification)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
32 / 33
Deep Learning 5. Backpropagation

Summary (2/2)

I Any loss can be used, esp. the negative log-likelihood.

I for classification problems: called cross entropy

I To learn a feedforward neural network, gradient-descent type

algorithms can be used
(esp. stochastic gradient descent).

I Gradients of neural networks

— and more generally, any computational graph —
can be computed automatically.
I backpropagation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33 / 33
Deep Learning

Further Readings
I Goodfellow et al. 2016, ch. 6

I Zhang et al. 2020, ch. 2.5, 3–5

I lecture Machine Learning, chapter B.2

Acknowledgement: An earlier version of the slides for this lecture have been written by my
former postdoc Dr Josif Grabocka.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
34 / 33
Deep Learning

References
Charu C. Aggarwal. Neural Networks and Deep Learning: A Textbook. Springer International Publishing, 2018. ISBN
978-3-319-94462-3. doi: 10.1007/978-3-319-94463-0.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The Mit Press, Cambridge, Massachusetts, November
2016. ISBN 978-0-262-03561-3.
Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander Smola. Dive into Deep Learning. https://d2l.ai/, 2020.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
35 / 33

Hybrid Deep Neural Network Using Transfer Learning For EEG Motor Imagery
No ratings yet
Hybrid Deep Neural Network Using Transfer Learning For EEG Motor Imagery
7 pages
An Ingression Into Deep Learning - Resp
No ratings yet
An Ingression Into Deep Learning - Resp
25 pages
1 A Taller 24 de Mayo Del 20190001
No ratings yet
1 A Taller 24 de Mayo Del 20190001
1 page
Module 2
No ratings yet
Module 2
44 pages
Ch06 Deep Feedforward Networks
No ratings yet
Ch06 Deep Feedforward Networks
90 pages
A2.2 DNN Update 2
No ratings yet
A2.2 DNN Update 2
51 pages
week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
week 03-04 - Deep Feedforward Networks - Intro
141 pages
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
No ratings yet
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
45 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Chapter 6 AI
No ratings yet
Chapter 6 AI
52 pages
02 Deep Feedforward Learning - Notes
No ratings yet
02 Deep Feedforward Learning - Notes
34 pages
MLT unit 4 and 5 part 2
No ratings yet
MLT unit 4 and 5 part 2
34 pages
Unit 03 - Neural Networks - MD
No ratings yet
Unit 03 - Neural Networks - MD
24 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
MODULE 2 DL SNOTES P1
No ratings yet
MODULE 2 DL SNOTES P1
16 pages
AI Lab 1
No ratings yet
AI Lab 1
11 pages
Unit 4
100% (1)
Unit 4
57 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
UNIT-I.pptx
No ratings yet
UNIT-I.pptx
90 pages
deep learning UNIT 1
No ratings yet
deep learning UNIT 1
22 pages
Unit 1
No ratings yet
Unit 1
16 pages
Contents MLP PDF
No ratings yet
Contents MLP PDF
60 pages
Neural Network and Fuzzy Logic
50% (2)
Neural Network and Fuzzy Logic
54 pages
Ch5-Feedforward Neural Networks, Word Embeddings, Neural Language Models, and Word2vec PDF
No ratings yet
Ch5-Feedforward Neural Networks, Word Embeddings, Neural Language Models, and Word2vec PDF
67 pages
CS 611 Slides 5
No ratings yet
CS 611 Slides 5
28 pages
What is Gradient Based Learning in Deep Learning
No ratings yet
What is Gradient Based Learning in Deep Learning
12 pages
Types of Neural Networks and Definition of Neural Network
No ratings yet
Types of Neural Networks and Definition of Neural Network
15 pages
AN2DL_02_2324_Perceptron_2_FeedForward
No ratings yet
AN2DL_02_2324_Perceptron_2_FeedForward
55 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
82 pages
Neural Networks
No ratings yet
Neural Networks
54 pages
Feed Forward Neural Network
No ratings yet
Feed Forward Neural Network
16 pages
Deep Learnig
No ratings yet
Deep Learnig
16 pages
Unit 1
No ratings yet
Unit 1
70 pages
Lecture8,9-Neural Networks
No ratings yet
Lecture8,9-Neural Networks
65 pages
L2 - UCLxDeepMind DL2020
No ratings yet
L2 - UCLxDeepMind DL2020
104 pages
Unit-3 D.L
No ratings yet
Unit-3 D.L
16 pages
Unit 4
No ratings yet
Unit 4
19 pages
Unit 4 Hca
No ratings yet
Unit 4 Hca
57 pages
DL-2
No ratings yet
DL-2
62 pages
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
3 pages
Understanding Multi-Layer Feed-Forward Neural Networks in Machine Learning
No ratings yet
Understanding Multi-Layer Feed-Forward Neural Networks in Machine Learning
4 pages
2 DeepLearning
No ratings yet
2 DeepLearning
46 pages
THE_DEEP_NEURAL_NETWORK-A_REVIEW
No ratings yet
THE_DEEP_NEURAL_NETWORK-A_REVIEW
5 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
Unit-3
No ratings yet
Unit-3
16 pages
Neural Networks / Deep Learning
No ratings yet
Neural Networks / Deep Learning
9 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Unit II
No ratings yet
Unit II
56 pages
BME 6407 - Class 7-8 (April 2023)
No ratings yet
BME 6407 - Class 7-8 (April 2023)
45 pages
Unit 2 v1.
No ratings yet
Unit 2 v1.
41 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
Deep Learning Basics Lecture 1 Feedforward
No ratings yet
Deep Learning Basics Lecture 1 Feedforward
31 pages
Lecture 11 - Introduction To Artificial Neural Networks (ANN)
No ratings yet
Lecture 11 - Introduction To Artificial Neural Networks (ANN)
35 pages
Unit 2 Deep Learning
No ratings yet
Unit 2 Deep Learning
19 pages
Unit 4-Health care and Deep Learninh
No ratings yet
Unit 4-Health care and Deep Learninh
87 pages
Artificial Neural Network: Lecture Module 22
No ratings yet
Artificial Neural Network: Lecture Module 22
54 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Long Short Term Memory: Fundamentals and Applications for Sequence Prediction
From Everand
Long Short Term Memory: Fundamentals and Applications for Sequence Prediction
Fouad Sabry
No ratings yet
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
From Everand
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
Fouad Sabry
No ratings yet
Beyond Silicon
From Everand
Beyond Silicon
Piyush yadav
5/5 (1)
Deep learning: deep learning explained to your granny – a guide for beginners
From Everand
Deep learning: deep learning explained to your granny – a guide for beginners
PAT NAKAMOTO
3/5 (2)
Department of Computer Science and Engineering Coding Assignment For Deep Learning CSE754
No ratings yet
Department of Computer Science and Engineering Coding Assignment For Deep Learning CSE754
6 pages
Lecture09 SVM Intro, Kernel Trick (Updated)
No ratings yet
Lecture09 SVM Intro, Kernel Trick (Updated)
36 pages
Sorted Colleges Branches
No ratings yet
Sorted Colleges Branches
6 pages
IEEE Conference 29 July
No ratings yet
IEEE Conference 29 July
8 pages
BERT Sentiment Analysis Twitter
No ratings yet
BERT Sentiment Analysis Twitter
11 pages
Print Synopsis
No ratings yet
Print Synopsis
9 pages
Indian Sign Language Interpretation and Sentence Formation: Disha Gangadia Varsha Chamaria Vidhi Doshi
No ratings yet
Indian Sign Language Interpretation and Sentence Formation: Disha Gangadia Varsha Chamaria Vidhi Doshi
6 pages
Kumar - Singh - 2021 - IOP - Conf. - Ser. - Mater. - Sci. - Eng. - 1084 - 012021
No ratings yet
Kumar - Singh - 2021 - IOP - Conf. - Ser. - Mater. - Sci. - Eng. - 1084 - 012021
9 pages
Medical Image Segmentation Using Squeeze-and-Expansion Transformers
No ratings yet
Medical Image Segmentation Using Squeeze-and-Expansion Transformers
9 pages
Python DL
No ratings yet
Python DL
52 pages
A Survey On Deep Learning For Data-Driven Soft Sensors
No ratings yet
A Survey On Deep Learning For Data-Driven Soft Sensors
14 pages
Amazons Hiring AI Essay
No ratings yet
Amazons Hiring AI Essay
2 pages
Machine-Learning Research: Four Current Directions
No ratings yet
Machine-Learning Research: Four Current Directions
40 pages
Delachaux 2013
No ratings yet
Delachaux 2013
8 pages
Application of Machine Learning To Antenna Kim
No ratings yet
Application of Machine Learning To Antenna Kim
2 pages
Lightweight Image Super-Resolution Based On
No ratings yet
Lightweight Image Super-Resolution Based On
27 pages
Deep Learning For Semantic Similarity
No ratings yet
Deep Learning For Semantic Similarity
7 pages
Major Project Report - G-21 Group
No ratings yet
Major Project Report - G-21 Group
18 pages
(IJCST-V12I3P3) :sanskar Soni, Shweta Kanojiya, Siddharth Yadav, Prof Rajendra Arakh, Prof Richa Shukla
No ratings yet
(IJCST-V12I3P3) :sanskar Soni, Shweta Kanojiya, Siddharth Yadav, Prof Rajendra Arakh, Prof Richa Shukla
4 pages
Ai Worksheet
No ratings yet
Ai Worksheet
5 pages
(Ebook) Deep Learning with TensorFlow: Explore neural networks with Python by Zaccone, Giancarlo, Karim, Md. Rezaul, Menshawy, Ahmed ISBN 9781786469786, 1786469782download
100% (3)
(Ebook) Deep Learning with TensorFlow: Explore neural networks with Python by Zaccone, Giancarlo, Karim, Md. Rezaul, Menshawy, Ahmed ISBN 9781786469786, 1786469782download
49 pages
Lecture 16 Hao
No ratings yet
Lecture 16 Hao
56 pages
ML Unit Wise Important Questions
No ratings yet
ML Unit Wise Important Questions
2 pages
Visualizing a Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing Machine Learning One Concept at a Time.
No ratings yet
Visualizing a Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) – Jay Alammar – Visualizing Machine Learning One Concept at a Time.
13 pages
Machine Learning Presentation
No ratings yet
Machine Learning Presentation
13 pages
How Large Language Models Work. From Zero To ChatGPT - by Andreas Stöffelbauer - Medium - Data Science at Microsoft
No ratings yet
How Large Language Models Work. From Zero To ChatGPT - by Andreas Stöffelbauer - Medium - Data Science at Microsoft
39 pages
Unit 2 Soft
No ratings yet
Unit 2 Soft
14 pages
BAED-AI2121-2322S-Performance Task 2 4th Quarter Grade 12
No ratings yet
BAED-AI2121-2322S-Performance Task 2 4th Quarter Grade 12
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.