100% found this document useful (1 vote)
265 views

2 DNN-CNN-RNN

This document is a slide presentation on neural networks given at KAIST. It introduces deep neural networks (DNNs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs). The key points covered are: - DNNs are neural networks with more than two layers that can model complex functions. Forward propagation is used to compute outputs, and backpropagation is used to calculate gradients to optimize weights through gradient descent. - CNNs are a type of neural network that use convolution and pooling for computer vision tasks. - RNNs are neural networks with feedback connections that make them well-suited for modeling sequential data like text. - Training deep neural networks is difficult due to vanishing gradients, but

Uploaded by

Salma Hamzaoui
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
265 views

2 DNN-CNN-RNN

This document is a slide presentation on neural networks given at KAIST. It introduces deep neural networks (DNNs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs). The key points covered are: - DNNs are neural networks with more than two layers that can model complex functions. Forward propagation is used to compute outputs, and backpropagation is used to calculate gradients to optimize weights through gradient descent. - CNNs are a type of neural network that use convolution and pooling for computer vision tasks. - RNNs are neural networks with feedback connections that make them well-suited for modeling sequential data like text. - Training deep neural networks is difficult due to vanishing gradients, but

Uploaded by

Salma Hamzaoui
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Algorithmic

Intelligence Laboratory

Introduction to Neural Networks:


DNN / CNN / RNN

EE807: Recent Advances in Deep Learning


Lecture 1

Slide made by
Hyungwon Choi and Yunhun Jang
KAIST EE

Algorithmic Intelligence Laboratory


What is Machine/Deep Learning?

• Human Learning

• Machine Learning = Build an algorithm from data


• Deep learning is a special type of algorithms in machine learning

Learning perceptions

Algorithmic Intelligence Laboratory


Learning interactions 2
Definition of Deep Learning

• An algorithm that learns multiple levels of abstractions in data

Deep & Large Networks

Objects

Lots of Data
Edge Parts
Multi-layer Data Representations (feature hierarchy)
Algorithmic Intelligence Laboratory 3
Deep Learning = Feature Learning

• Why deep learning outperforms other machine learning (ML) approaches for
vision, speech, language?

Input Feature Extraction Other ML Output

SIFT

Input Deep Network Output

Algorithmic Intelligence Laboratory 4


Table of Contents

1. Deep Neural Networks (DNN)


• Basics
• Training : Back propagation

2. Convolutional Neural Networks (CNN)


• Basics
• Convolution and pooling
• Some applications

3. Recurrent Neural Networks (RNN)


• Basics
• Character-level language model (example)

4. Question
• Why is it difficult to train a deep neural network?

Algorithmic Intelligence Laboratory 5


Table of Contents

1. Deep Neural Networks (DNN)


• Basics
• Training : Back propagation

2. Convolutional Neural Networks (CNN)


• Basics
• Convolution and pooling
• Some applications

3. Recurrent Neural Networks (RNN)


• Basics
• Character-level language model (example)

4. Question
• Why is it difficult to train a deep neural network?

Algorithmic Intelligence Laboratory 6


DNN: Neurons in the Brain

• Human brain is made up of 100 billion neurons


• Neurons receive electric signals at the dendrites and send them to the axon
• Dendrites can perform complex non-linear computations
• Synapses are not a single weight but a complex non-linear dynamical system

Algorithmic Intelligence Laboratory *source : https://pt.slideshare.net/hammawan/deep-neural-networks 7


DNN: Artificial Neural Networks

• Artificial neural networks


• A simplified version of biological neural network

Bias Nonlinear
activation
function

Output / activation of the neuron


Summation
Inputs
Weights

Algorithmic Intelligence Laboratory 8


DNN: The Brain vs. Artificial Neural Networks

• Similarities
• Consists of neurons & connections between neurons
• Learning process = Update of connections
• Massive parallel processing

• Differences
• Computation within neuron vastly simplified
• Discrete time steps
• Typically some of supervised learning with massive number of stimuli

Algorithmic Intelligence Laboratory *source : http://mt-class.org/jhu/slides/lecture-nn-intro.pdf 9


DNN: Basics

• Deep neural networks


• Neural network with more than 2 layers
• Can model more complex functions

Nonlinear Inputs Outputs


Bias
activation
function

Summation
Inputs
Weights

Hidden

“2-layer Neural Net”


“1-hidden-layer Neural Net”

Algorithmic Intelligence Laboratory 10


DNN: Notation

• Training dataset
• : input data
• : target data (or label for classification)

• Neural network parameterized by

Next, forward propagation


Algorithmic Intelligence Laboratory 11
DNN: Forward Propagation

• Forward propagation: calculate the output of the neural network

where is activation function (e.g., sigmoid function) and is number of layers

Algorithmic Intelligence Laboratory 12


DNN: Forward Propagation (Example)

Algorithmic Intelligence Laboratory 13


DNN: Forward Propagation (Example)

• Input data

1.0

-0.5

Algorithmic Intelligence Laboratory 14


DNN: Forward Propagation (Example)

• Compute hidden units

0.79
1.0

0.92

-0.5
0.16

where

Algorithmic Intelligence Laboratory 15


DNN: Forward Propagation (Example)

• Compute output

0.79
1.0

0.92 0.62

-0.5
0.16

Next, training objective


Algorithmic Intelligence Laboratory 16
DNN: Objective

• Objective: Find a parameter that minimizes the error (or empirical risk)

where is a loss function e.g., MSE(Mean square error) or cross entropy

Next, how to optimize ?


Algorithmic Intelligence Laboratory 17
DNN: Training

• Gradient descent (GD) updates parameters iteratively to the gradient direction.


parameters loss function

learning rate

• Backpropagation
• First adjust the last layer weights
• Propagate error back to each previous layers
• Adjust previous layer weights

Next, backpropagation in details


Algorithmic Intelligence Laboratory 18
DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output

Algorithmic Intelligence Laboratory 19


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

Algorithmic Intelligence Laboratory 20


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

• Compute derivative of with respect to

Algorithmic Intelligence Laboratory 21


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

• Compute derivative of with respect to

Algorithmic Intelligence Laboratory 22


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

• Compute derivative of with respect to

• Parameter update rule learning rate

Algorithmic Intelligence Laboratory 23


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

• Compute derivative of with respect to

• Parameter update rule learning rate

Algorithmic Intelligence Laboratory 24


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

• Similarly, we can compute gradients with respect to


• And update using the same update rule

Algorithmic Intelligence Laboratory 25


DNN: Backpropagation (Example)

• Compute the error

0.79
1.0

0.92 0.62
• Compute
-0.5
0.16

Algorithmic Intelligence Laboratory 26


DNN: Backpropagation (Example)

• Compute

0.79
1.0

0.92 0.62
• Update with
-0.5
0.16

Algorithmic Intelligence Laboratory 27


DNN: Backpropagation (Example)

• Compute

0.79
1.0

0.92 0.62
• Update with
-0.5
0.16

• Similarly, we can update

Algorithmic Intelligence Laboratory 28


Table of Contents

1. Deep Neural Networks (DNN)


• Basics
• Training : Back propagation

2. Convolutional Neural Networks (CNN)


• Basics
• Convolution and Pooling
• Some applications

3. Recurrent Neural Networks (RNN)


• Basics
• Character-level language model (example)

4. Question
• Why is it difficult to train a deep neural network?

Algorithmic Intelligence Laboratory 29


CNN: Drawbacks of Fully-Connected DNN

• Previous DNNs use fully-connected layers


• Connect all the neurons between the layers

• Drawbacks
• (-) Large number of parameters
• Easy to be over-fitted
• Large memory consumption

• (-) Does not enforce any structure, e.g., local information


• In many applications, local features are important, e.g., images, language, etc.

Algorithmic Intelligence Laboratory 30


CNN: Basics

• Weight sharing and local connectivity (convolution)


• Use multiple filters convolve over inputs
• (+) Reduce the number of parameters (less over-fitting)
• (+) Learn local features
• (+) Translation invariance

• Pooling (or subsampling)


• Make the representations smaller
• (+) Reduce number of parameters and computation

Algorithmic Intelligence Laboratory *source : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.1794&rep=rep1&type=pdf 30


CNN: Weight Sharing and Translation Invariance

• Weight sharing
• Apply same weights over the different spatial regions
• One can achieve translation invariance (not perfect though)

Algorithmic Intelligence Laboratory *source : https://www.cc.gatech.edu/~san37/post/dlhc-cnn/ 32


CNN: Weight Sharing and Translation Invariance

• Weight sharing
• Apply same weights over the different spatial regions
• One can achieve translation invariance

• Translation invariance
• When input is changed spatially (translated or shifted), the corresponding output
to recognize the object should not be changed
• CNN can produce the same output even though the input image is shifted due to
weight sharing

Algorithmic Intelligence Laboratory *source : https://www.cc.gatech.edu/~san37/post/dlhc-cnn/ 33


CNN: Convolution
The result of taking a dot product
Fully-connected layer between a row of and the input
• 32×32×3 image à stretch to 3072×1
Input Activation

1 1
3072 10×3072 weights 10

Convolution layer
32×32×3 image
5×5×3 filter
(equivalent to 1×75 weights for FC layer)
32
Convolve the filter with the image
i.e., “slide over the image spatially,
computing dot products”

32
3

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 34


CNN: Convolution
The result of taking a dot product
Fully-connected layer between a row of and the input
• 32×32×3 image à stretch to 3072×1
Input Activation

1 1
3072 10×3072 weights 10

Convolution layer
32×32×3 image
5×5×3 filter
(equivalent to 1×75 weights for FC layer)
32
The result of taking a dot product between the filter
and a small 5×5×3 chunk of the image
(i.e., 5×5×3 = 75-dimensional dot product + bias)

32
3

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 35


CNN: Convolution
The result of taking a dot product
Fully-connected layer between a row of and the input
• 32×32×3 image à stretch to 3072×1
Input Activation

1 1
3072 10×3072 weights 10

Convolution layer
32×32×3 image Activation map
5×5×3 filter
(equivalent to 1×75 weights for FC layer)
32
28

Convolve (slide) over all spatial locations 28


32
1
3

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 36


CNN: Convolution
The result of taking a dot product
Fully-connected layer between a row of and the input
• 32×32×3 image à stretch to 3072×1
Input Activation

1 1
3072 10×3072 weights 10

Convolution layer
32×32×3 image 4 separate activation maps
If there are four 5×5×3 filters
32 28

Convolve (slide) over all spatial locations 28


32
4
3

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 37


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 1
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 38


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 1
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 39


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 1
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 40


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 1
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 41


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 1
7
à 5×5 output

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 42


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 2
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 43


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 2
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 44


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 2
7
à 3×3 output

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 45


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 3 ?
7
Doesn’t fit!
Cannot apply 3×3 filter on
7×7 input with stride 3

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 46


CNN: An Example

• In practice: Common to zero pad the border


• Used to control the output filter size

0 0 0 0 0 0 0 0 0 7×7 input (spatially)


0 0 Zero pad 1 pixel border
Assume 3×3 filter
0 0 Applied with stride 3
0 0
à 3×3 output
0 0 9
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 47


CNN: An Example (Animation)

No padding, stride 1 Padding 1, stride 1

Padding 1, stride 2 (odd)

No padding, stride 2 Padding 1, stride 2

Algorithmic Intelligence Laboratory *source : https://github.com/vdumoulin/conv_arithmetic 48


CNN: An Example

32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32

32
3
Output volume size = ?

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 49


CNN: An Example

32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32

32 32
Output volume size = ?
3 10
• (32 + 2×2 - 5)/1 + 1 = 32 spatially
• = > 32×32×10

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 50


CNN: An Example

32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32

32 32
3 10
Number of parameters in this layer?

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 51


CNN: An Example

32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32

32 32
Number of parameters in this layer?
3 10
• Each filter has 5×5×3 + 1 = 76 params ( +1 for bias )
• = > 76×10 = 760

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 52


CNN: Convolution

• ConvNet is a sequence of Convolutional layers, followed by non-linearity

32×32×3 image

32 28 24
20

Conv, Conv, Conv,


ReLU ReLU 24 ReLU 20
28 10
e.g., 4 e.g., 6 6 e.g., 10
32
5×5×3 4 5×5×4 5×5×6
3 filters filters filters

ReLU LeakyReLU
• Choices of other non-linearity
• Tanh/Sigmoid
• ReLU [Nair et al., 2010]
• Leaky ReLU [Maas et. al., 2013]

*reference: http://cs231n.stanford.edu/2017/
Algorithmic Intelligence Laboratory *Image source: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6 53
CNN: Pooling

• Pooling layer
• Makes the representations smaller and more manageable
• Operates over each activation map independently
• Enhance translation invariance (invariance to small transformation)
• Larger receptive fields (see more of input)
• Regularization effect

224×224×64
112×112×64

Pooling

224 112
Downsampling
112
224
Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 54
CNN: Pooling

• Max pooling and average pooling


• With 2×2 filters and stride 2

ROI pooling

• Another kind of pooling layers are also used


• e.g. stochastic pooling, ROI pooling

*source:
https://deepsense.ai/region-of-interest-pooling-explained/
http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf
Algorithmic Intelligence Laboratory https://vaaaaaanquish.hatenablog.com/entry/2015/01/26/060622 55
CNN: Visualization

• Visualization of CNN feature representations [Zeiler et al., 2014]


• VGG-16 [Simonyan et al., 2015]

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 56


CNN in Computer Vision: Everywhere

Classification and retrieval [Krizhevsky et al., 2012]

Algorithmic Intelligence Laboratory 57


CNN in Computer Vision: Everywhere

Detection [Ren et al., 2015] Segmentation [Farabet et al., 2013]

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 58


CNN in Computer Vision: Everywhere
Self-driving cars Human pose estimation [Cae et al., 2017]

Image captioning [Vinyals et al., 2015][Karpathy et al., 2015]

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 59


Table of Contents

1. Deep Neural Networks (DNN)


• Basics
• Training : Back propagation

2. Convolutional Neural Networks (CNN)


• Basics
• Convolution and Pooling
• Some applications

3. Recurrent Neural Networks (RNN)


• Basics
• Character-level language model (example)

4. Question
• Why is it difficult to train a deep neural network ?

Algorithmic Intelligence Laboratory 60


RNN: Basics

• CNN models spatial invariance information

• Recurrent Neural Network (RNN)


• Models temporal information
• Hidden states as a function of inputs and previous time step
information

• Temporal information is important in many applications


• Language
• Speech
• Video

Algorithmic Intelligence Laboratory 61


RNN: Basics

• Process a sequence of vectors by applying


recurrence formula at every time step :

New state Old state Input


vector at
time step t

Function parameterized by e.g, DNN, CNN

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 62


RNN: Basics

• Process a sequence of vectors by applying


recurrence formula at every time step :

• Same function and the same set of parameters


are used at every time step

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 63


RNN: Vanilla RNN

• Simple RNN
• The state consists of a single “hidden” vector
• Vanilla RNN (or sometimes called Elman RNN)

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 64


RNN: Computation Graph

Re-use the same weight matrix at every time step

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 65


RNN: Computation Graph (Many to Many)

e.g., Machine Translation


(Sequence of words à Sequence of words)

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 66


RNN: Computation Graph (Many to One)

e.g., Sentiment Classification


(Sequence of words à sentiment) Good paper or not?

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 67


RNN: Computation Graph (One to Many)

e.g., Image Captioning


(Image à sequence of words)

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 68


RNN: An Example

• Character-level
language model

• Vocabulary : [h,e,l,o]

• Example training
sequence : “hello” 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7

1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0

Input chars: “h” “e” “l” “l”

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 69


RNN: An Example

• Character-level Target chars: “e” “l” “l” “o”


language model
1.0 0.5 0.1 0.2
2.2 0.3 0.5 -1.5
Output layer
• Vocabulary : [h,e,l,o] -3.0 -1.0 1.9 -0.1
4.1 1.2 -1.1 2.2

• Example training
sequence : “hello” 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7

1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0

Input chars: “h” “e” “l” “l”

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 70


RNN: An Example

• Character-level Samples: “e” “l” “l” “o”


language model .03 .25 .11 .11
.13 .20 .17 .02
Softmax
.00 .05 .68 .08
• Vocabulary : [h,e,l,o] .84 .50 .03 .79

1.0 0.5 0.1 0.2


Output layer 2.2 0.3 0.5 -1.5
• At test time, sample -3.0 -1.0 1.9 -0.1
character one at a 4.1 1.2 -1.1 2.2
time and feedback to
model 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7

1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0
Input chars: “h” “e” “l” “l”

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 71


RNN: Backpropagation Through Time (BPTT)

• Backpropagation through time (BPTT)


• Forward through entire sequence to compute loss, then backward through
entire sequence to compute gradient

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 72


Contents

1. Deep Neural Networks (DNN)


• Basics
• Training : Back propagation

2. Convolutional Neural Networks (CNN)


• Basics
• Convolution and Pooling
• Some applications

3. Recurrent Neural Networks (RNN)


• Basics
• Character-level language model (example)

4. Question
• Why is it difficult to train a deep neural network?

Algorithmic Intelligence Laboratory 73


Question

• Why is it difficult to train a deep neural network?

• Can we just simply stack multiple layers and train them all?
• Unfortunately, it does not work well
• Even if we have infinite amount of computational resource

Vanishing gradient problem :


• The magnitude of the gradients shrink exponentially as we backpropagate through
many layers
• Since typical activation functions such as sigmoid or tanh are bounded
• The phenomenon is called vanishing gradient problem

Algorithmic Intelligence Laboratory 74


Vanishing Gradient Problem

• Why do gradients vanish?


• Think of a simplified 3-layer neural network

Algorithmic Intelligence Laboratory 75


Vanishing Gradient Problem

• Why do gradients vanish?


• Think of a simplified 3-layer neural network

• First, let’s update


• Calculate the gradient of the loss with respect to

Algorithmic Intelligence Laboratory 76


Vanishing Gradient Problem

• Why do gradients vanish?


• Think of a simplified 3-layer neural network

• How about ? Gradients < 1


• Calculate the gradient of the loss with respect to

Keep multiplying values < 1 will


decrease the amount exponentially
Algorithmic Intelligence Laboratory 77
Vanishing Gradient Over Time

• This is more problematic in vanilla RNN (with tanh/sigmoid activation)


• When trying to handle long temporal dependency
• Similar to previous example, the gradient vanishes over time

Algorithmic Intelligence Laboratory *source :https://mediatum.ub.tum.de/doc/673554/file.pdf 78


Quiz

• Vanishing gradient problem is critical in training neural network


• Q: Can we just use activation function that has gradients > 1?

Algorithmic Intelligence Laboratory 79


Answer for Quiz

• Not really. It will cause another problem so called exploding gradients.


• Let’s consider if we use exponential activation function:
• The magnitude of gradient is always larger than 1 when input > 0
• If output of the networks are positive, then the gradients to update will explode

Gradients > 1

• This will cause the training very unstable


• The weights will be updated in very large amount, resulting in NaN values
• Very critical problem in training neural networks

Algorithmic Intelligence Laboratory 80


How Can We Overcome Vanishing Gradient Problems?

• Possible solutions
• Activation functions
• CNN: Residual networks [He et al., 2016]
• RNN: LSTM (Long Short-Term Memory)

LSTM (Long Short-Term Memory)

*source
https://mediatum.ub.tum.de/doc/673554/file.pdf
Algorithmic Intelligence Laboratory https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092 81
Solving Vanishing Gradient: Activation Functions

• Use different activation functions that are not bounded:


• Recent works largely use ReLU or their variants
• No saturation, easy to optimize

*source: https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092
Algorithmic Intelligence Laboratory 82
Solving Vanishing Gradient: Activation Functions

• Several generalizations of ReLU


• Leaky ReLU [Maas et. al., 2013]: Introducing non-zero gradient for ‘dying ReLUs’
• Parameteric ReLU (PReLU) [He et al., 2015]: Additional learnable parameter 𝑎 on
leaky ReLUs.
• Randomized ReLU (RReLU) [Xu et al., 2015]: Samples parameter 𝑎 from uniform
distribution. output

leaky ReLU
𝑓 𝑥 = max(0.01𝑥, 𝑥)
input
PReLU, RReLU
𝑓 𝑥 = max(𝑥/𝑎, 𝑥)

• Concatenated ReLU (CReLU) [Shang et al., 2016] output

• ‘Opposite pairs’ of filters found in CNN


CReLU
- Needs to learn twice the information 𝑓 𝑥 = max(−𝑥, 𝑥)
• Two-sided ReLU input

Algorithmic Intelligence Laboratory 83


Solving Vanishing Gradient: Residual Networks

• Residual networks (ResNet [He et al., 2016])


• Feed-forward NN with “shortcut connections”
• Can preserve gradient flow throughout the entire depth of the network
• Possible to train more than 100 layers by simply stacking residual blocks

Plain network Residual network

Algorithmic Intelligence Laboratory 84


Solving Vanishing Gradient: LSTM and GRU

• LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units)


• Specially designed RNN which can remember information for much longer period

3 main steps:
• Forget irrelevant parts of previous state
• Selectively update the cell state based on the
new input
• Selectively decide what part of the cell state to
output as the new hidden state

Preservation of gradient information in LSTM


*source :
http://harinisuresh.com/2016/10/09/lstms/
Algorithmic Intelligence Laboratory https://mediatum.ub.tum.de/doc/673554/file.pdf 85
References

• [Nair et al., 2010] "Rectified linear units improve restricted boltzmann machines." ICML 2010.
link : https://dl.acm.org/citation.cfm?id=3104425
• [Krizhevsky et al., 2012] "Imagenet classification with deep convolutional neural networks." NIPS 2012
link : https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
• [Maas et al., 2013] "Rectifier nonlinearities improve neural network acoustic models." ICML 2013.
link : https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf
• [Farabet et al., 2013] "Learning hierarchical features for scene labeling." IEEE transactions on PAMI 2013
link : https://www.ncbi.nlm.nih.gov/pubmed/23787344
• [Zeiler et al., 2014] "Visualizing and understanding convolutional networks." ECCV 2014.
link : https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
• [Simonyan et al., 2015] "Very deep convolutional networks for large-scale image recognition.” ICLR 2015
link : https://arxiv.org/abs/1409.1556
• [Ren et al., 2015] "Faster r-cnn: Towards real-time object detection with region proposal networks." NIPS 2015
link : https://arxiv.org/abs/1506.01497
• [Vinyals et al., 2015] "Show and tell: A neural image caption generator." CVPR 2015.
link : https://arxiv.org/abs/1411.4555
• [Karpathy et al., 2015] "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
link : https://cs.stanford.edu/people/karpathy/cvpr2015.pdf
• [He et al., 2015] "Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification." ICCV 2015.
link : https://arxiv.org/abs/1502.01852

Algorithmic Intelligence Laboratory 86


References

• [Xu et al., 2015] "Empirical evaluation of rectified activations in convolutional network." arXiv preprint, 2015.
link : https://arxiv.org/abs/1505.00853
• [Shang et al., 2016] "Understanding and improving convolutional neural networks via concatenated rectified
linear units." ICML 2016.
link : https://arxiv.org/abs/1603.05201
• [He et al., 2016] "Deep residual learning for image recognition." CVPR 2016
link : https://arxiv.org/abs/1512.03385
• [Cae et al., 2017] "Realtime multi-person 2d pose estimation using part affinity fields.", CVPR 2017
link : https://arxiv.org/abs/1611.08050
• [Fei-Fei and Karpathy, 2017] “CS231n: Convolutional Neural Networks for Visual Recognition”, 2017. (Stanford
University)
link : http://cs231n.stanford.edu/2017/

Algorithmic Intelligence Laboratory 87

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy