0% found this document useful (0 votes)
352 views

Solution PDF

α = 1/√n Var(s) = Var(Σwijxj) = ΣVar(wijxj) = ΣE(wij^2)Var(xj) + ΣE(xj^2)Var(wij) = α^2/3 * 1 + 1 * α^2/n = α^2 Setting α^2 = 1 gives the result.

Uploaded by

Vard Farrell
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
352 views

Solution PDF

α = 1/√n Var(s) = Var(Σwijxj) = ΣVar(wijxj) = ΣE(wij^2)Var(xj) + ΣE(xj^2)Var(wij) = α^2/3 * 1 + 1 * α^2/n = α^2 Setting α^2 = 1 gives the result.

Uploaded by

Vard Farrell
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Chair for Computer Vision and Artificial Intelligence

Department of Informatics
Technical University of Munich

Note:
• During the attendance check a sticker containing a unique code will be put on this exam.
Esolution • This code contains a unique number that associates this exam with your registration number.
Place student sticker here • This number is printed both next to the code and to the signature field in the attendance
check list.

Introduction to Deep Learning

n
Exam: IN2346 / Endterm Date: Thursday 8th August, 2019

tio
Examiner: Prof. Dr. Leal-Taixé, Prof. Dr. Nießner Time: 08:00 – 09:30

P1 P2 P3 P4

lu P5 P6
So
I
e
pl

Working instructions
• This exam consists of 20 pages with a total of 6 problems.
Please make sure now that you received a complete copy of the exam.
m

• The total amount of achievable credits in this exam is 90 credits.


• Detaching pages from the exam is prohibited.
Sa

• Allowed resources:
– none

• Do not write with red or green colors nor use pencils.


• Physically turn off all electronic devices, put them into your bag and close the bag. This includes
calculators.

Left room from to / Early submission at

– Page 1 / 20 –
Problem 1 Multiple Choice (18 credits)
Mark your answer clearly by a cross in the corresponding box. Multiple correct answers per question possible.

a) Your network is overfitting. What are good ways to approach this problem?
Increase the size of the validation set

× Increase the size of the training set


× Reduce your model capacity
Reduce learning rate and continue training

b) A sigmoid layer

n
has a learnable parameter.

cannot be used during backpropagation.

tio
× is continuous and differentiable everywhere.
maps to values between -1 and 1.

lu
c) Training error does not decrease. What could be a reason?

× Too much regularization.


Too many weights in your network.
So
× Bad initialization.
× Learning rate is too high.
d) How many network parameters are in ResNet-152?
1,337,337.
e

× 60,344,232.
pl

more than a billion.

152.
m

e) What is the correct order of operations for an optimization with gradient descent?

a Update the network weights to minimize the loss.


b Calculate the difference between the predicted and target value.
Sa

c Iteratively repeat the procedure until convergence.


d Compute a forward pass.

e Initialize the neural network weights.

bcdea

ebadc

eadbc

× edbac

– Page 2 / 20 –
f) Dropout
has trouble with tanh activations.

× is an efficient way for regularization.


× can be seen as an ensemble of networks.
makes your network train faster.

g) Consider a simple convolutional neural network with a single convolutional layer. Which of the following
statements is true about this network?
All input nodes are connected to all output nodes.

It is scale invariant.

n
It is translation invariant.

tio
It is rotation invariant.

h) You are building a model to predict the presence (labeled 1) or absence(labeled 0) of a tumor in a brain
scan. The goal is to ultimately deploy the model to help doctors in hospitals. Which of these two metrics
would you choose to use?

lu
× Recall = True positive examples
Total positive examples .

True positive examples


Precision = Total predicted positive examples .
So
Average Precision = True positive examples + True negative examples
Total examples .

i) Why you would want use 1 × 1 convolutions? (check all that apply)
Predict binary class probabilities.

× Collapse number of channels.


e

× Learn more complex functions by introducing additional non-linearities.


pl

To enforce a fixed size output.


m
Sa

– Page 3 / 20 –
Problem 2 Short Questions (24 credits)

0 a) You are training a neural network with 15 fully-connected layers with a tanh nonlinearity. Explain the
behavior of the gradient of the non-linearity with respect to very large positive inputs.
1

2 Because the tanh is almost flat for very large positive values (1pt), its gradient will be almost 0. (1pt)
Comment: Points deducted for saying "gradient saturates" but not mentioning the small value of the
gradient, a neuron saturates but not the gradient.

0 b) Why might this be a problem for training neural networks? Name and explain this phenomenon.

n
1
Vanishing gradient (1p), during backprop gradient of non-linearity is close to zero, makes train-
2 ing/parameter updates much much slower. (explanation is another 1p)

tio
lu
0 c) In modern architectures, another type of non-linearity is commonly used. Draw and name this non-linearity
(1p) and explain why it helps solve the problem mentioned in the previous two questions (1p).
1

2 Rectified Linear Unit (0.5p) + drawing (0.5p)


So
Because ReLU activations are linear, they do not saturate for large (positive) values, and hence freely
allow gradients to change weights in the network. (1p) Comment: Saturation was enough
e

0 d) Why do we often refer to L 2-regularization as “weight decay”? Derive a mathematical expression that
pl

includes the weights W , the learning rate η, and the L 2-regularization hyperparameter λ to explain your
1 point.
2
Weight update with objective function J incl. weight decay:
m

W = W − η∇W J + 12 λ i Wi2
P 

W = W (1 − ηλ) − η∇W J ,
where η = learning rate and λ = regularisation parameter with ηλ << 1.
Value of W is pushed towards zero in each iteration.
Sa

Points: Qualitative answer: 0.5. Mathematical part: l2 loss 0.5, weight update formula 0.5, final
result 0.5

– Page 4 / 20 –
e) You are solving the binary classification task of classifying images as cars vs. persons. You design a CNN 0
with a single output neuron. Let the output of this neuron be z . The final output of your network, ŷ is given by:
1
ŷ = σ (ReLU(z))
You classify all inputs with a final value ŷ ≥ 0.5 as car images. What problem are you going to encounter?

Using ReLU then sigmoid will cause all predictions to be positive (0.5p)
σ (ReLU(z)) ≥ 0.5 ∀z . (0.5p)
Writing "all predictions are ’cars’ is enough

n
f) Suppose you initialize your weights w with uniform random distribution U (−α, α). The output s for given 0
input vector x is given by
n
X 1

tio
si = wij · xj ,
j=0 2

where n is the number of input values.


Assume that the input data x and weights are independent and identically distributed. How do you have to
choose α such that the variance of the input data and the output is identical, hence Var(s) = Var(x).
Hint: For two statistically independent variables X and Y holds:

lu
 2  2
Var(X · Y ) = E(X ) Var(Y ) + E(Y ) Var(X ) + Var(X )Var(Y )

Furthermore the PDF of an uniform distribution U (a, b) is


So
(
1
for x ∈ [a, b]
f (x) = b −a
0 otherwise.

The variance of a continuous distribution is calculated as


Z
Var(X ) = x 2 f (x) dx − µ2 ,
R
e

where µ is the expected value of X .


pl

P 
n Pn
Var(si ) = Var j=0 wij · xj , = j=0 Var(wij )Var(xj ) = n · Var(w)Var(x)(1p)

Var (U (−α, α)) = 13 α2 (0.5)


q
3
α= (0.5)
m

Correct result: 2p
If only Var(w) = n1 is written then 1p.
Sa

g) Consider 2 different models for image classification of the MNIST data set. 0
The models are: (i) a 3 layer perceptron, (ii) LeNet.
Which of the two models is more robust to translation of the digits in the images? Give a short explanation 1
why.
2

LeNet (0.5p), Convolutional layers (1.5p)

2p: lenet mentioned and convolutional layers as reason


1.5: lenet mentioned, convolutional layers are mentioned but students wrote too much text which
included wrong statements

– Page 5 / 20 –
0 h) Consider the following one-dimensional data points with classes {0, 1} . Sketch a linear (0.5p) and logistic
(0.5p) regression into the figures. Which model is more suitable for this task (1p)?
1

n
tio
Plot linear regression (left) and logistic regression (right).

Logistic regression. (1pt)


an "S" is not a function
(0.5pt)
lu
(0.5p) line should go through points not 0,
So
0 i) What is the mean and standard deviation of Xavier initialization? What changes to this initialization would
you propose when used with ReLU non-linearities?
1

2 Mean=0, Var= n1 . With relus: Var= 2


n
(1p each) Writing variance instead of stddev was fine, both
solutions accepted
3
e
pl

0 j) You have 4000 cat and 100 dog images and want to train a neural network on these images to do binary
classification. What problems do you foresee with this dataset distribution? Name two possible solutions.
m

2 Network prefers cats as they are more likely or imbalance between classes (1pt)
leave out images/reweight dataloader/reweight loss function/collect more dog images/data augmenta-
tion for dogs (0.5pt/sol) No points for: dropout, regularization, batch norm, transfer learning, "get more
Sa

data"

0 k) Why is initializing all the weights of a fully connected layer to the same value problematic during training?

1
If all weights are equal, nodes will learn the same thing during backpropagation, and this limits the
2 capacity. (2 if correct)
If there is no mention of gradients/weight updating, e.g. by only saying "the network will not learn", ->
1.5p

– Page 6 / 20 –
l) What is the difference between dropout for convolutional layers compared to dropout for fully connected 0
layers? Explain both behaviours.
1

Conv: drop feature map at random, fully connected: drop weights at random (1p each) 2

n
tio
lu
So
e
pl
m
Sa

– Page 7 / 20 –
Problem 3 Optimization (12 credits)

0 a) Explain the concept behind RMSProp optimization. How does it help converging faster?

1
Mitigate step size in directions with high-variance gradients (1). Can increase learning rate (1).
2

0 b) Which SGD variation uses first and second momentum?

1
Adam.

n
tio
0 c) Why is it common to use a learning rate decay?

1
When far away (0.5p), one want higher gradients to get closer to solution; the closer you get, the less
jitter/overshooting you want. (0.5)

1
dealing with saddle points?

lu
d) What is a saddle point? What is the advantage/disadvantage of Stochastic Gradient Descent (SGD) in
So
2 Saddle point - The gradient is zero (0.5p), but it is neither a local minima nor a local maxima (0.5p)
(or:the gradient is zero and the function has a local maximum in one direction, but a local minimum in
another direction).
SGD has noisier updates and can help escape from a saddle point (1p)
e
pl

0 e) Why would one want to use larger mini-batches in SGD?

1
Make the gradients less noisy.
m

0 f) Why do we usually use small mini-batches in practice?


Sa

1
Limited GPU memory / faster compute (for each batch), so faster update

0 g) Your network’s training curve diverges (assuming data loading is correct). Name one way to address the
problem through hyperparameter change.
1

reduce learning rate (1 point each)

– Page 8 / 20 –
h) What is an epoch? 0

1
full run through the entire train set

i) When is SGD guaranteed to converge to a local minima (provide formula)? 0

P∞ P∞ 1
Robbins-Monro condition; i=1 αi = ∞ (1p) and i=1 αi2 < ∞ (1p)
2

n
tio
lu
So
e
pl
m
Sa

– Page 9 / 20 –
Problem 4 Convolutional Neural Networks and Advanced Architectures (12 credits)
In the following we assume that the input of our network is a 224 × 224 × 3 color (RGB) image. The task is
to perform image classification on 1000 classes. You design a network with the following structure [CONV
- RELU] x 20 - FC - FC. That is, you place 20 consecutive convolutional layers (including non-linear ac-
tivations), followed by two fully-connected layers. Each layer will have its own number of filters and kernel size.

0 a) The first 3 convolutional layers have each 5 filters with kernels of size 3 × 3, applied with stride 1 and no
padding. How large is the receptive field of a feature after the 3 convolutional operations?
1

1x1− > 3x3− > 5x5− > 7x7 (1p)

n
0 b) What are the dimensions of the feature map after the 3 convolutional operations from (a) ?

tio
1
224 - 2 (first conv layer) - 2 (second conv layer) - 2 (third conv layer) = 218x218x5 (number of filters)
2 (1p spatial size, 1p kernel size)

0
dimension represent? (1p)
lu
c) What are the dimensions of the weight tensor of the first convolutional layer? (1p) What does each
So
1

2 Shape: (3, 5, 3, 3) (1pt)


Reasoning: input channels (RGB), output channels/number of filters, kernel size = 3x3 (1p)
( no points when only 3dims are mentioned)
e

0 d) After the 10th convolutional layer your feature map has size 100x100x224. You realize the next convolu-
pl

tional filter operation will involve too many multiplications that make your network training slow. However, the
1 next layer requires identical spatial size of the feature map.
Propose a solution for this problem (1p) and demonstrate your solution with an example (1p).
2
m

1x1 convolutions (1p)


If you use 25 convolutional filters of 1x1x256, we would reduce the feature map to 100x100x25, making
the next operation cheaper. (1p)
Comment: Any output larget than 100x100x224 is wrong.
Sa

0 e) Your network is now trained for the task of image classification. You now want to use the trained weights
of this network for the task of image segmentation for which you need a pixel-wise output. Which layers of
1 your original network described above can you not reuse for the image segmentation task? (1p) Describe
briefly how you would adapt the network for image segmentation given any input image size? (1p)
2

The FC layers, because they take a fixed input size (1p) Make it fully convolutional (1pt). Comment:
mentioning only upscaling: 0.5p

– Page 10 / 20 –
f) You decide to increase the number of layers substantially and therefore you switch to a ResNet architecture. 0
Draw a ResNet block (1p). Describe all the operations inside the block (1pt). What is the advantage of using
such a block in terms of training (1p)? 1

(1pt)

Final summation of passed features through convolutional layers and skipped initial features. F(x) + x .

n
(1p)
One of multiple solutions: Skip-connections
- provide highways for gradients and make network easier to train

tio
- resolve vanishing gradient problem (1p)

lu
So
e
pl
m
Sa

– Page 11 / 20 –
Problem 5 Backpropagation and Convolutional Layers (12 credits)
Your friend is excited to try out those "Convolutional Layers" you were talking about from your lecture.
However, he seems to have some issues and requests your help for some theoretical computations on a toy
example.
Consider a neural network with a convolutional (without activation) and a max pooling layer. The convolutional
layer has a single filter with kernel size (1, 1), no bias, a stride of 1 and no padding. The filter weights are all
initialized to a value of 1. The max pooling layer has a kernel size of (2, 2) with stride 2, and 1 zero-padding.

n
tio
You are given the following input image of dimensions (3, 2, 2):
     
1 −0.5 −2 1 1 0
x= , ,
2 −2 −1.5 1 0 0

0 a) Compute the forward pass of this input and write down your calculations.

lu
1
Forward pass
2        
1 −0.5 −2 1 1 0 0 0.5
+ + = (1p)
So
2 −2 −1.5 1 0 0 0.5 −1
After max pooling,  
0 0.5
(1p)
0.5 0

b) Consider the corresponding ground truth,


e

0
 
1 0 1
y=
1 0
pl

Calculate the binary cross-entropy with respect to the natural logarithm by summing over all output pixels of
the forward pass computed in (a). You may assume log(0) ≈ −109 . (Write down the equation and keep the
logarithm for the final result.)
m

X
BCEloss = − ti log si (0.5p for either his or the line below)
i
Sa

= − log(2w1 − 1.5w2 ) − log(−0.5w1 + w2 )


= − log(0.5) − log(0.5) = 2 log 2 (1p)

0 c) You don’t recall learning the formula for backpropagation through convolutional layers but those 1 × 1
½ convolutions seem suspicious. Write down the name of a common layer that is able to produce the same
result as the convolutional layer used above.

Fully-connected layer

– Page 12 / 20 –
d) Update the kernel weights accordingly by using gradient descent with a learning rate of 1. (Write down 0
your calculations!)
1

Partial derivatives for w1/w2 (2p), 2

∂ BCE
∂ w1 = − ∂ ln(2w1 −1.5w∂2w)+ln(
1
−0.5w1 +w2 )
= − 2w1 −21.5w2 − −0.5
−0.5w1 +w2 = −4 + 1 = −3 3

4
∂ BCE
∂ w2 = − ∂ ln(2w1 −1.5w∂2w)+ln(
2
−0.5w1 +w2 )
= − 2w1−−1.5
1.5w2
− 1
−0.5w1 +w2 =3−2=1
5
Update using gradient descent for w1/w2 (2p),

∂ BCE
w1+ = w1 − lr ∗ ∂ w1 = 1 − 1 × (−3) = 4

∂ BCE
w2+ = w2 − lr ∗ =1−1×1=0

n
∂ w2

Derivate and update for w3 (1p total):

tio
∂ BCE
∂ w3 =0

w3+ = w3 − 0 = 1

lu
1p if the person only wrote at least the gradient descent update rule
So
e
pl
m
Sa

– Page 13 / 20 –
0 e) After helping your friend debugging, you want to showcase the power of convolutional layers. Deduce
what kind of 3 × 3 convolutional filter was used to generate the output (right) of the grayscale image (left)
1 and write down its 3 × 3 values.
2

n
Vertical edge detector (1p)

tio
 
1 0 −1
1 0 −1 (1p)
1 0 −1

Flipping & Scaling are OK

0
each pixel has a value between 0 (black) and 1 (white).
lu
f) He finally introduces you to his real problem. He wants to find 3 × 3 black crosses in grayscale images, i.e.,
So
1
e

You notice that you can actually hand-craft such a filter. Write down the numerical values of a 3 × 3 filter that
maximally highlights on the position of black crosses.
pl

 
−1 1 −1
1 −1 1  (2p)
−1 1 −1
m

Flipping & Scaling are OK, even though pixel values were given
Sa

– Page 14 / 20 –
Problem 6 Recurrent Neural Networks and LSTMs (12 credits)

a) Consider a vanilla RNN cell of the form ht = tanh(V · ht −1 + W · xt ). The figure below shows the input 0
sequence x1 , x2 , and x3 .
1

n
tio
Given the dimensions xt ∈ R4 and ht ∈ R12 , what is the number of parameters in the RNN cell? Neglect the
bias parameter.

4 × 12 + 12 × 12 (1 pt) = 48 + 144 = 192 (1 pt)

lu
So
b) If xt is the 0 vector, then ht = ht −1 . Discuss whether this statement is correct. 0

1
False: ( 1 pt)
e

After transformation with V and non-linearity xt = 0 does not lead to ht = ht −1 (1 pt) . Full points require 2
explanation, solely equation not sufficient.
pl
m
Sa

– Page 15 / 20 –
0 c) Now consider the following one-dimensional ReLU-RNN cell.

1 ht = ReLU(V · ht −1 + W · xt )

2 (Hidden state, input, and weights are scalars)


Calculate h1 , h2 and h3 where V = 1, W = 2, h0 = −3, x1 = 1, x2 = 2 and x3 = 0.
3

h0 = −3
h1 = relu(1 · (−3) + 2 · 1) = 0 ( 1 pt)
h2 = relu(1 · 0 + 2 · 2) = 4 (1 pt)
h3 = relu(1 · 4 + 2 · 0) = 4 ( 1 pt)

n
tio
lu
So
e
pl
m
Sa

– Page 16 / 20 –
∂ h3 ∂ h3 ∂ h3
d) Calculate the derivatives ∂V , ∂W , and ∂ x1 for the forward pass of the ReLU-RNN Cell of (c). Use that 0



∂ x ReLU(x)
= 1. 1
x=0
2

3
ht = ReLU(V · ht −1 + W · xt ) = ReLU(zt )

∂ h3 ∂ ∂ ∂
= ReLU(x) · h2 + ReLU(x) · V · h1 + ReLU(x) · V 2 · h0 =
∂V ∂x x=z3 ∂ x x=z2 ∂ x x=z1

= 1 · 4 + 1 · 1 · 0 + 0 · 1 · (−3) = 4 (1 pt)

∂ h3 ∂ ∂ ∂
= ReLU(x)
· x3 + ReLU(x)
· V · x2 + ReLU(x) · V 2 · x1 =
∂W ∂x x=z3 ∂ x x=z2 ∂ x x=z1

1·0+1·2+0·0=2 (1 pt)

n

∂ h3
= ReLU(x)
· V · ReLU(x)
· V · ReLU(x) ·W =1·1·1·1·0·2=0 (1 pt)
∂ x1 x=z3 x=z2 x=z1

tio
Only correct and calculated result gives point.

lu
So
e
pl
m
Sa

– Page 17 / 20 –
0 e) A Long-Short Term Memory (LSTM) unit is defined as

1 g1 = σ (W1 · xt + U1 · ht −1 ) ,
g2 = σ (W2 · xt + U2 · ht −1 ) ,
2
g3 = σ (W3 · xt + U3 · ht −1 ) ,
c̃t = tanh (Wc · xt + uc · ht −1 ) ,
ct = g2 ◦ ct −1 + g3 ◦ c̃t ,
ht = g1 ◦ ct ,

where g1 , g2 , and g3 are the gates of the LSTM cell.


1) Assign these gates correctly to the forget f , update u, and output o gates. (1p)
2) What does the value ct represent in a LSTM? (1p)

n
g1 = output gate
g2 = forget gate

tio
g3 = update gate
(1 pt)
ct : cell state
(1 pt)

lu
So
e
pl
m
Sa

– Page 18 / 20 –
Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike
out invalid solutions.

n
tio
lu
So
e
pl
m
Sa

– Page 19 / 20 –
n
tio
lu
So
e
pl
m
Sa

– Page 20 / 20 –

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy