Solution PDF
Solution PDF
Department of Informatics
Technical University of Munich
Note:
• During the attendance check a sticker containing a unique code will be put on this exam.
Esolution • This code contains a unique number that associates this exam with your registration number.
Place student sticker here • This number is printed both next to the code and to the signature field in the attendance
check list.
n
Exam: IN2346 / Endterm Date: Thursday 8th August, 2019
tio
Examiner: Prof. Dr. Leal-Taixé, Prof. Dr. Nießner Time: 08:00 – 09:30
P1 P2 P3 P4
lu P5 P6
So
I
e
pl
Working instructions
• This exam consists of 20 pages with a total of 6 problems.
Please make sure now that you received a complete copy of the exam.
m
• Allowed resources:
– none
– Page 1 / 20 –
Problem 1 Multiple Choice (18 credits)
Mark your answer clearly by a cross in the corresponding box. Multiple correct answers per question possible.
a) Your network is overfitting. What are good ways to approach this problem?
Increase the size of the validation set
b) A sigmoid layer
n
has a learnable parameter.
tio
× is continuous and differentiable everywhere.
maps to values between -1 and 1.
lu
c) Training error does not decrease. What could be a reason?
× 60,344,232.
pl
152.
m
e) What is the correct order of operations for an optimization with gradient descent?
bcdea
ebadc
eadbc
× edbac
– Page 2 / 20 –
f) Dropout
has trouble with tanh activations.
g) Consider a simple convolutional neural network with a single convolutional layer. Which of the following
statements is true about this network?
All input nodes are connected to all output nodes.
It is scale invariant.
n
It is translation invariant.
tio
It is rotation invariant.
h) You are building a model to predict the presence (labeled 1) or absence(labeled 0) of a tumor in a brain
scan. The goal is to ultimately deploy the model to help doctors in hospitals. Which of these two metrics
would you choose to use?
lu
× Recall = True positive examples
Total positive examples .
i) Why you would want use 1 × 1 convolutions? (check all that apply)
Predict binary class probabilities.
– Page 3 / 20 –
Problem 2 Short Questions (24 credits)
0 a) You are training a neural network with 15 fully-connected layers with a tanh nonlinearity. Explain the
behavior of the gradient of the non-linearity with respect to very large positive inputs.
1
2 Because the tanh is almost flat for very large positive values (1pt), its gradient will be almost 0. (1pt)
Comment: Points deducted for saying "gradient saturates" but not mentioning the small value of the
gradient, a neuron saturates but not the gradient.
0 b) Why might this be a problem for training neural networks? Name and explain this phenomenon.
n
1
Vanishing gradient (1p), during backprop gradient of non-linearity is close to zero, makes train-
2 ing/parameter updates much much slower. (explanation is another 1p)
tio
lu
0 c) In modern architectures, another type of non-linearity is commonly used. Draw and name this non-linearity
(1p) and explain why it helps solve the problem mentioned in the previous two questions (1p).
1
0 d) Why do we often refer to L 2-regularization as “weight decay”? Derive a mathematical expression that
pl
includes the weights W , the learning rate η, and the L 2-regularization hyperparameter λ to explain your
1 point.
2
Weight update with objective function J incl. weight decay:
m
W = W − η∇W J + 12 λ i Wi2
P
W = W (1 − ηλ) − η∇W J ,
where η = learning rate and λ = regularisation parameter with ηλ << 1.
Value of W is pushed towards zero in each iteration.
Sa
Points: Qualitative answer: 0.5. Mathematical part: l2 loss 0.5, weight update formula 0.5, final
result 0.5
– Page 4 / 20 –
e) You are solving the binary classification task of classifying images as cars vs. persons. You design a CNN 0
with a single output neuron. Let the output of this neuron be z . The final output of your network, ŷ is given by:
1
ŷ = σ (ReLU(z))
You classify all inputs with a final value ŷ ≥ 0.5 as car images. What problem are you going to encounter?
Using ReLU then sigmoid will cause all predictions to be positive (0.5p)
σ (ReLU(z)) ≥ 0.5 ∀z . (0.5p)
Writing "all predictions are ’cars’ is enough
n
f) Suppose you initialize your weights w with uniform random distribution U (−α, α). The output s for given 0
input vector x is given by
n
X 1
tio
si = wij · xj ,
j=0 2
lu
2 2
Var(X · Y ) = E(X ) Var(Y ) + E(Y ) Var(X ) + Var(X )Var(Y )
P
n Pn
Var(si ) = Var j=0 wij · xj , = j=0 Var(wij )Var(xj ) = n · Var(w)Var(x)(1p)
Correct result: 2p
If only Var(w) = n1 is written then 1p.
Sa
g) Consider 2 different models for image classification of the MNIST data set. 0
The models are: (i) a 3 layer perceptron, (ii) LeNet.
Which of the two models is more robust to translation of the digits in the images? Give a short explanation 1
why.
2
– Page 5 / 20 –
0 h) Consider the following one-dimensional data points with classes {0, 1} . Sketch a linear (0.5p) and logistic
(0.5p) regression into the figures. Which model is more suitable for this task (1p)?
1
n
tio
Plot linear regression (left) and logistic regression (right).
0 j) You have 4000 cat and 100 dog images and want to train a neural network on these images to do binary
classification. What problems do you foresee with this dataset distribution? Name two possible solutions.
m
2 Network prefers cats as they are more likely or imbalance between classes (1pt)
leave out images/reweight dataloader/reweight loss function/collect more dog images/data augmenta-
tion for dogs (0.5pt/sol) No points for: dropout, regularization, batch norm, transfer learning, "get more
Sa
data"
0 k) Why is initializing all the weights of a fully connected layer to the same value problematic during training?
1
If all weights are equal, nodes will learn the same thing during backpropagation, and this limits the
2 capacity. (2 if correct)
If there is no mention of gradients/weight updating, e.g. by only saying "the network will not learn", ->
1.5p
– Page 6 / 20 –
l) What is the difference between dropout for convolutional layers compared to dropout for fully connected 0
layers? Explain both behaviours.
1
Conv: drop feature map at random, fully connected: drop weights at random (1p each) 2
n
tio
lu
So
e
pl
m
Sa
– Page 7 / 20 –
Problem 3 Optimization (12 credits)
0 a) Explain the concept behind RMSProp optimization. How does it help converging faster?
1
Mitigate step size in directions with high-variance gradients (1). Can increase learning rate (1).
2
1
Adam.
n
tio
0 c) Why is it common to use a learning rate decay?
1
When far away (0.5p), one want higher gradients to get closer to solution; the closer you get, the less
jitter/overshooting you want. (0.5)
1
dealing with saddle points?
lu
d) What is a saddle point? What is the advantage/disadvantage of Stochastic Gradient Descent (SGD) in
So
2 Saddle point - The gradient is zero (0.5p), but it is neither a local minima nor a local maxima (0.5p)
(or:the gradient is zero and the function has a local maximum in one direction, but a local minimum in
another direction).
SGD has noisier updates and can help escape from a saddle point (1p)
e
pl
1
Make the gradients less noisy.
m
1
Limited GPU memory / faster compute (for each batch), so faster update
0 g) Your network’s training curve diverges (assuming data loading is correct). Name one way to address the
problem through hyperparameter change.
1
– Page 8 / 20 –
h) What is an epoch? 0
1
full run through the entire train set
P∞ P∞ 1
Robbins-Monro condition; i=1 αi = ∞ (1p) and i=1 αi2 < ∞ (1p)
2
n
tio
lu
So
e
pl
m
Sa
– Page 9 / 20 –
Problem 4 Convolutional Neural Networks and Advanced Architectures (12 credits)
In the following we assume that the input of our network is a 224 × 224 × 3 color (RGB) image. The task is
to perform image classification on 1000 classes. You design a network with the following structure [CONV
- RELU] x 20 - FC - FC. That is, you place 20 consecutive convolutional layers (including non-linear ac-
tivations), followed by two fully-connected layers. Each layer will have its own number of filters and kernel size.
0 a) The first 3 convolutional layers have each 5 filters with kernels of size 3 × 3, applied with stride 1 and no
padding. How large is the receptive field of a feature after the 3 convolutional operations?
1
n
0 b) What are the dimensions of the feature map after the 3 convolutional operations from (a) ?
tio
1
224 - 2 (first conv layer) - 2 (second conv layer) - 2 (third conv layer) = 218x218x5 (number of filters)
2 (1p spatial size, 1p kernel size)
0
dimension represent? (1p)
lu
c) What are the dimensions of the weight tensor of the first convolutional layer? (1p) What does each
So
1
0 d) After the 10th convolutional layer your feature map has size 100x100x224. You realize the next convolu-
pl
tional filter operation will involve too many multiplications that make your network training slow. However, the
1 next layer requires identical spatial size of the feature map.
Propose a solution for this problem (1p) and demonstrate your solution with an example (1p).
2
m
0 e) Your network is now trained for the task of image classification. You now want to use the trained weights
of this network for the task of image segmentation for which you need a pixel-wise output. Which layers of
1 your original network described above can you not reuse for the image segmentation task? (1p) Describe
briefly how you would adapt the network for image segmentation given any input image size? (1p)
2
The FC layers, because they take a fixed input size (1p) Make it fully convolutional (1pt). Comment:
mentioning only upscaling: 0.5p
– Page 10 / 20 –
f) You decide to increase the number of layers substantially and therefore you switch to a ResNet architecture. 0
Draw a ResNet block (1p). Describe all the operations inside the block (1pt). What is the advantage of using
such a block in terms of training (1p)? 1
(1pt)
Final summation of passed features through convolutional layers and skipped initial features. F(x) + x .
n
(1p)
One of multiple solutions: Skip-connections
- provide highways for gradients and make network easier to train
tio
- resolve vanishing gradient problem (1p)
lu
So
e
pl
m
Sa
– Page 11 / 20 –
Problem 5 Backpropagation and Convolutional Layers (12 credits)
Your friend is excited to try out those "Convolutional Layers" you were talking about from your lecture.
However, he seems to have some issues and requests your help for some theoretical computations on a toy
example.
Consider a neural network with a convolutional (without activation) and a max pooling layer. The convolutional
layer has a single filter with kernel size (1, 1), no bias, a stride of 1 and no padding. The filter weights are all
initialized to a value of 1. The max pooling layer has a kernel size of (2, 2) with stride 2, and 1 zero-padding.
n
tio
You are given the following input image of dimensions (3, 2, 2):
1 −0.5 −2 1 1 0
x= , ,
2 −2 −1.5 1 0 0
0 a) Compute the forward pass of this input and write down your calculations.
lu
1
Forward pass
2
1 −0.5 −2 1 1 0 0 0.5
+ + = (1p)
So
2 −2 −1.5 1 0 0 0.5 −1
After max pooling,
0 0.5
(1p)
0.5 0
0
1 0 1
y=
1 0
pl
Calculate the binary cross-entropy with respect to the natural logarithm by summing over all output pixels of
the forward pass computed in (a). You may assume log(0) ≈ −109 . (Write down the equation and keep the
logarithm for the final result.)
m
X
BCEloss = − ti log si (0.5p for either his or the line below)
i
Sa
0 c) You don’t recall learning the formula for backpropagation through convolutional layers but those 1 × 1
½ convolutions seem suspicious. Write down the name of a common layer that is able to produce the same
result as the convolutional layer used above.
Fully-connected layer
– Page 12 / 20 –
d) Update the kernel weights accordingly by using gradient descent with a learning rate of 1. (Write down 0
your calculations!)
1
∂ BCE
∂ w1 = − ∂ ln(2w1 −1.5w∂2w)+ln(
1
−0.5w1 +w2 )
= − 2w1 −21.5w2 − −0.5
−0.5w1 +w2 = −4 + 1 = −3 3
4
∂ BCE
∂ w2 = − ∂ ln(2w1 −1.5w∂2w)+ln(
2
−0.5w1 +w2 )
= − 2w1−−1.5
1.5w2
− 1
−0.5w1 +w2 =3−2=1
5
Update using gradient descent for w1/w2 (2p),
∂ BCE
w1+ = w1 − lr ∗ ∂ w1 = 1 − 1 × (−3) = 4
∂ BCE
w2+ = w2 − lr ∗ =1−1×1=0
n
∂ w2
tio
∂ BCE
∂ w3 =0
w3+ = w3 − 0 = 1
lu
1p if the person only wrote at least the gradient descent update rule
So
e
pl
m
Sa
– Page 13 / 20 –
0 e) After helping your friend debugging, you want to showcase the power of convolutional layers. Deduce
what kind of 3 × 3 convolutional filter was used to generate the output (right) of the grayscale image (left)
1 and write down its 3 × 3 values.
2
n
Vertical edge detector (1p)
tio
1 0 −1
1 0 −1 (1p)
1 0 −1
0
each pixel has a value between 0 (black) and 1 (white).
lu
f) He finally introduces you to his real problem. He wants to find 3 × 3 black crosses in grayscale images, i.e.,
So
1
e
You notice that you can actually hand-craft such a filter. Write down the numerical values of a 3 × 3 filter that
maximally highlights on the position of black crosses.
pl
−1 1 −1
1 −1 1 (2p)
−1 1 −1
m
Flipping & Scaling are OK, even though pixel values were given
Sa
– Page 14 / 20 –
Problem 6 Recurrent Neural Networks and LSTMs (12 credits)
a) Consider a vanilla RNN cell of the form ht = tanh(V · ht −1 + W · xt ). The figure below shows the input 0
sequence x1 , x2 , and x3 .
1
n
tio
Given the dimensions xt ∈ R4 and ht ∈ R12 , what is the number of parameters in the RNN cell? Neglect the
bias parameter.
lu
So
b) If xt is the 0 vector, then ht = ht −1 . Discuss whether this statement is correct. 0
1
False: ( 1 pt)
e
After transformation with V and non-linearity xt = 0 does not lead to ht = ht −1 (1 pt) . Full points require 2
explanation, solely equation not sufficient.
pl
m
Sa
– Page 15 / 20 –
0 c) Now consider the following one-dimensional ReLU-RNN cell.
1 ht = ReLU(V · ht −1 + W · xt )
h0 = −3
h1 = relu(1 · (−3) + 2 · 1) = 0 ( 1 pt)
h2 = relu(1 · 0 + 2 · 2) = 4 (1 pt)
h3 = relu(1 · 4 + 2 · 0) = 4 ( 1 pt)
n
tio
lu
So
e
pl
m
Sa
– Page 16 / 20 –
∂ h3 ∂ h3 ∂ h3
d) Calculate the derivatives ∂V , ∂W , and ∂ x1 for the forward pass of the ReLU-RNN Cell of (c). Use that 0
∂
∂ x ReLU(x)
= 1. 1
x=0
2
3
ht = ReLU(V · ht −1 + W · xt ) = ReLU(zt )
∂ h3 ∂ ∂ ∂
= ReLU(x) · h2 + ReLU(x) · V · h1 + ReLU(x) · V 2 · h0 =
∂V ∂x x=z3 ∂ x x=z2 ∂ x x=z1
= 1 · 4 + 1 · 1 · 0 + 0 · 1 · (−3) = 4 (1 pt)
∂ h3 ∂ ∂ ∂
= ReLU(x)
· x3 + ReLU(x)
· V · x2 + ReLU(x) · V 2 · x1 =
∂W ∂x x=z3 ∂ x x=z2 ∂ x x=z1
1·0+1·2+0·0=2 (1 pt)
n
∂ h3
= ReLU(x)
· V · ReLU(x)
· V · ReLU(x) ·W =1·1·1·1·0·2=0 (1 pt)
∂ x1 x=z3 x=z2 x=z1
tio
Only correct and calculated result gives point.
lu
So
e
pl
m
Sa
– Page 17 / 20 –
0 e) A Long-Short Term Memory (LSTM) unit is defined as
1 g1 = σ (W1 · xt + U1 · ht −1 ) ,
g2 = σ (W2 · xt + U2 · ht −1 ) ,
2
g3 = σ (W3 · xt + U3 · ht −1 ) ,
c̃t = tanh (Wc · xt + uc · ht −1 ) ,
ct = g2 ◦ ct −1 + g3 ◦ c̃t ,
ht = g1 ◦ ct ,
n
g1 = output gate
g2 = forget gate
tio
g3 = update gate
(1 pt)
ct : cell state
(1 pt)
lu
So
e
pl
m
Sa
– Page 18 / 20 –
Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike
out invalid solutions.
n
tio
lu
So
e
pl
m
Sa
– Page 19 / 20 –
n
tio
lu
So
e
pl
m
Sa
– Page 20 / 20 –