Solution: Introduction To Deep Learning
Solution: Introduction To Deep Learning
Solution: Introduction To Deep Learning
Department of Informatics
Technical University of Munich
Note:
• During the attendance check a sticker containing a unique code will be put on this exam.
Esolution • This code contains a unique number that associates this exam with your registration number.
Place student sticker here • This number is printed both next to the code and to the signature field in the attendance
check list.
n
Exam: IN2346 / Endterm Date: Wednesday 19th February, 2020
tio
Examiner: Prof. Dr. Matthias Nießner Time: 13:30 – 15:00
P1 P2 P3 P4
lu P5 P6 P7
So
I
e
pl
Working instructions
• This exam consists of 20 pages with a total of 7 problems.
Please make sure now that you received a complete copy of the exam.
m
• If you need additional space for a question, use the additional pages in the back and properly note that
you are using additional space in the question’s solution box.
– Page 1 / 20 –
Problem 1 Multiple Choice (18 credits)
• For each question, you’ll receive 2 points if all boxes are answered correctly (i.e. correct answers are
checked, wrong answers are not checked) and 0 otherwise.
n
• If you change your mind, please fill the box: (interpreted as not checked)
• If you change your mind again, please put an additional cross besides the box.
tio
a) You train a neural network and the loss diverges. What are reasonable things to do?
x1 x2 x1 AND x2
1.0 1.0 1.0
1.0 0.0 0.0
e
c) You want to train an autoencoder to overfit a single image with a fixed learning rate. Setting numerical
precision aside, which loss function is able to reach zero loss after training with gradient descent.
× L2
L1
d) Regularization:
× Weight 2
decay (L ) is commonly applied in neural networks to spread the decision power among as
many neurons as possible.
Is a technique that aims to reduce your validation error and increases your training accuracy.
– Page 2 / 20 –
e) Which statements are correct for a Rectified Linear Unit (ReLU) applied in a CNN?
Despite a small learning rate, without saturation the gradients are very likely to explode.
× A large negative bias in the previous layer can cause the ReLU to always output zero.
× Large and consistent gradients allow for a fast network convergence.
Max pooling must always be applied after the ReLU.
f) You want to use a convolutional layer to decrease your RGB image size from 2545x254 to 127x127. What
parameter triplets achieve this?
Kernel size 3, stride 2, padding 1
n
× Kernel size 2, stride 2, padding 0
tio
Kernel size 5, stride 2, padding 2
g) Your train and val loss converge to about the same value. What would you do to increase the performance
of your model?
h) An autoencoder
has no loss function.
i) What is the correct order of operations for an optimization with gradient descent?
m
(b) Calculate the difference between the predicted and target value.
(c) Iteratively repeat the procedure until convergence.
Sa
bcdea
ebadc
eadbc
× edbac
– Page 3 / 20 –
Problem 2 Short Questions (22 credits)
0 a) Kaiming initialization corresponds to Xavier initialization with the variance multiplied by two. In which case
(1p) and why (1p) would you chose this initialization?
1
0 b) You are given a convolutional layer with kernel size 3, number of filters 3, stride 1 and padding 1. Compute
the shape of the weights (0.5p) and write them down explicitly such that this convolutional layer represents
1 the identity for an RGB image input (1.5p).
2
Shape of W=(3,3,3,3) (0.5 p)
n
W[0,0,1,1] = W[1,1,1,1] = W[2,2,1,1] = 1, else 0 (1.5p) i.e. W =
0 0 0
tio
0 1 0 0 0
0 0 0
0 0 0
0 0 1 0 0
0 0 0
0 0 0
lu
0 0 0 1 0
0 0 0
So
0 c) Name two reasons to use an inception layer in favor of a standard convolutional layer.
1
Avoid choosing kernel size / dimensionality reduction leads to lower computational cost
2
e
pl
0 d) What are the definitions of bias and variance in the context of machine learning?
m
1
Bias: underfitting, error caused by model being too simple
2 Variance: overfitting, high variance in the model means its pays more attention to training data and
doesnt generalize on unseen data (val / test)
Sa
– Page 4 / 20 –
e) In Generative Adversarial Networks the generator and discriminator play a two player min-max game. 0
Why is this hard to optimize and what heuristical method is used instead of the default min-max formulation.
1
f) Consider the quote below. Demonstrate how a fully connected layer with an input size of 512 neurons and 0
an output of 10 neurons can be modeled as a convolutional layer.
n
1
tio
lu
unsqueeze apply 1x1 conv, weight shape (512, 10, 1, 1)
512 −−−−−−→ 512x1x1 −−−−−−−−−−−−−−−−−−−−−−−−→ 10x1x1
So
g) Explain the Markov Assumption in reinforcement learning. 0
e
1
P[st+1 |s1, ..., st ] = P[st+1 |st ] or
The future is independent of the past given the present 2
pl
m
h) Where do we use neural networks in Q-learning (1p) and why are they needed (1p)? 0
1
Function approximator for Q values: Q ∗ (s, a) = Q(s, a; θ) where θ are the network parameters.
For even semi big problems, calculating all state action pairs is not tractable, but we can approximate 2
Sa
– Page 5 / 20 –
0 i) The following image shows a rectangular image of size 16 × 16. Design a 5 × 5 convolutional filter that is
maximally activated when sliding over the ’3’ in the image below (Black pixels are 1s and white are -1. For
1 simplicity use only -1s and 1s in your designed filter).
2
n
tio
−1 1 1 1 −1
−1 −1 −1 1 −1
−1
W = 1 1 1 −1
−1 −1 −1 1 −1
lu
−1 1 1 1 −1
1
Advantage: can model series of variable lengths
e
0 k) Long-Short Term Memory Units suffer less from the vanishing gradient problem than vanilla RNNs. What
two design changes make this possible?
m
– Page 6 / 20 –
Problem 3 House Prices and Backpropagation (11 credits)
You are tasked to predict house prices for your first job, i.e. for a given input vector of numbers you have to
predict a single floating point number that indicates the house price (e.g., between 0 and 1m euros).
a) What network loss function would you suggest for this problem (0.5p) and why (0.5p)? 0
1
Loss: MSE, floating point regression
b) How would you approximate the task as a classification problem (0.5) and which loss function would you 0
propose in this situation (0.5)?
1
n
Cross Entropy
tio
After collecting some data you start off with a simple architecture
lu
So
e
pl
Your architecture is composed of two fully-connected layers (fc1 and fc2) which both contain
• a linear layer with weights and biases as outlined on the next page,
m
1
12
– Page 7 / 20 –
Weights and biases of the linear layers:
0 d) You are experiencing difficulties during training and thus decide to check the network in test mode manually.
In your test case the input x values are
1 x1 1
x= = .
x2 0
2
What is the output of your network for that input? Write down all computations for each step.
3
n
4 h1 linear= 1 · 1 + 0 · −1 + 1 = 2 (0.5) −Dropout > 1 −LReLu > 1 (0.5),
h2 linear= 1 · −1 + 0 · 1 − 1 = −2 (0.5) −Dropout > −1 −LReLu > −0.5 (0.5),
tio
o1 linear= 1 · 0.5 + −0.5 · −1 + 1 = 2 (0.5) −Dropout > 1 −LReLu > 1 (0.5),
o2 linear= 1 · 1+ −0.5 · 1 + 0 = 0.5 (0.5) −Drop > 0.25 −LReLu > 0.25 (0.5),
1
=> o =
0.25
0.5 for linear and 0.5 for dropout + leaky relu
lu
So
0 e) As you were unsure of your loss function, you phrase the task as a binary classification problem for each
e
of your two outputs independently. Calculate the binary cross-entropy with respect to the natural logarithm
1 for the labels
y1 1
y= = .
pl
y2 0
You may assume 0 · ln(0) = 0. (Write down the equations and keep the simplified logarithm.)
m
– Page 8 / 20 –
f) Please update the weight v21 using gradient descent with a learning rate of 1.0 with respect to the binary 0
cross-entropy loss as well as labels from e) and the forward computation from d). (Please write down all your
computations. Writing down formulas and values in general form is encouraged) 1
2
Layers: olin , odrop , olrelu , BCE .
3
∂ BCE ∂ (−y1 ln(olrelu )) ∂ olrelu ∂ odrop ∂ olin
= · · ·
∂ v21 ∂ olrelu ∂ odrop ∂ olin ∂ v21
1 ∂ (hafterlrelu2 · v21 + bo2 )
=− · 1 · 0.5 ·
olrelu ∂ v21
= −1 · 0.5 · h2afterlrelu = −0.5 · −0.5 = 0.25
+ ∂ BCE
n
v21 = v21 − lr · = −1.0 − 1.0 · 0.25 = −1.25
∂ v21
tio
ln derivative 0.5p,
final derivative result 0.5p,
update 1p
lu
So
e
pl
m
Sa
– Page 9 / 20 –
Problem 4 Optimization (11 credits)
0 a) Why can’t we expect to find a global minima while training neural networks?
1
Neural networks are non-convex
• many (different) local minima
• no (practical) way to say which is globally optimal
1
The idea of the overparameterization in neural networks suggests that many local minimia provide
n
equivalent performance (Note: Similar performance does not mean similarity in parameter space)
tio
0 c) What is a saddle point (1p)? What is the advantage/disadvantage of Stochastic Gradient Descent (SGD)
in dealing with saddle points (1p)?
1
lu
2 Saddle point - The gradient is zero (0.5p), but it is neither a local minima nor a local maxima.
SGD has noisier updates and can help escape from a saddle point. Disadvantage: no momentum
So
e
1
Avoid getting stuck in local minima. or accelerate optimization. Uses exponentially weighted averages
of past gradients.
m
0 e) Which optimizer introduced in the lecture uses second but not first order momentum?
Sa
1
RMSProp
0 f) Explain the beta (β1 and β2 ) hyperparameters of Adam with respect to the gradients.
1
These two are exponential decay rates for the moment estimates. β1 is related to bias correction for
2 first moment estimate (decays the running average of the gradient), while β2 is used in bias correction
of second moment estimate (decays running average of square of gradient).
– Page 10 / 20 –
g) What would be an advantage of a second order optimization method such as the Newton method (1p) ? 0
Why is it not commonly used in the context of neural networks?
1
h) Name the key idea of Newton’s method (1p) and write down the update step formula (1p). 0
n
1
Approximate by second order taylor expansion
θ∗ = θ0 − H −1 ∇θ L (θ) 2
tio
lu
So
e
pl
m
Sa
– Page 11 / 20 –
Problem 5 Network Architectures and Training (10 credits)
You are training a neural network with 10 convolutional layers and the activation function shown below:
n
tio
lu
0 a) What is the purpose of activation functions in neural networks?
Introduce non-linearity or otherwise network is only linear / prevents layer collapse (1pt).
So
0 b) You find that your network still is not training properly and discover that your network weights have all
been default initialized to 1. Why might this cause issues for training?
e
1
pl
Weights all 1 means everything computes the same function, gradients are all the same -> network
capacity
m
Sa
0 c) How can you resolve the problems with weight initialization and provide stable training in most scenarios
for the proposed network? Explain your solution.
1
2
Xavier / Kaiming initialization (or their equation) (1p); aims to maintain same variance of output as
input (1p).
0
d) After employing your solutions, you are ready to train your network for image segmentation. After 50
1 epochs, you come to the conclusion that the network is too large for such a task. What is this effect called?
How do you make this observation? Make a plot of the corresponding training (regular line) and validation
2
losses (dashed line) and name them appropriately.
– Page 12 / 20 –
n
tio
Overfitting (1p); Plot (1p).
lu
So
e
pl
m
Sa
e) Without changing the convolutional layers of your network, name two approaches to counteract the 0
problems encountered in (f).
1
2
weight regularization / weight decay, data augmentation, dropout, more data, early stopping (1p each)
f) You adapt your network training accordingly, and now are performing a grid search to find the optimal 1
hyperparameters for vanilla stochastic gradient descent (SGD). You try three learning rates τi with i ∈ {1, 2, 3},
2
and obtain the following three curves for the training accuracy. Order the learning rates from larger to smaller.
– Page 13 / 20 –
Training
Accuracy
⌧3
<latexit sha1_base64="AbDA9cHawrbbGG0CDyT+hhtqSPc=">AAACz3ichVFNS8NAEJ3Gr7Z+VT16CRbBU0lVUG8FrXgRqhhbaEvZpNsami+STbEWxatHr/rL9Ld48O2aClrEDZuZffPm7cyOFbpOLAzjLaPNzM7NL2Rz+cWl5ZXVwtr6dRwkkc1NO3CDqGGxmLuOz03hCJc3wogzz3J53Rocy3h9yKPYCfwrMQp522N93+k5NhOA6i3Bks5evlMoGiVDLX3aKadOkdJVCwrv1KIuBWRTQh5x8knAd4lRjK9JZTIoBNamMbAInqPinO4pj9wELA4GAzrAv49TM0V9nKVmrLJt3OJiR8jUaRv7VClaYMtbOfwY9gP7TmH9P28YK2VZ4QjWgmJOKZ4DF3QDxn+ZXsqc1PJ/puxKUI8OVTcO6gsVIvu0v3VOEImADVREp6pi9qFhqfMQL+DDmqhAvvJEQVcdd2GZslyp+Kkig14EK18f9WDM5d9DnXbM3dJRybjYL1aq6byztElbtIOhHlCFzqiGMmSDz/RCr9qldqs9aI9fVC2T5mzQj6U9fQKtSpDl</latexit>
⌧2 <latexit sha1_base64="TpMpisIp41FSmBGIalyAl7cevIs=">AAACz3ichVFNS8NAEJ3Gr7Z+VT16CRbBU0mLoN4KWvEiVDG20JaySbc1NF8km2ItilePXvWX6W/x4Ns1FbSIGzYz++bN25kdK3SdWBjGW0abm19YXMrm8ssrq2vrhY3N6zhIIpubduAGUdNiMXcdn5vCES5vhhFnnuXyhjU8lvHGiEexE/hXYhzyjscGvtN3bCYANdqCJd1KvlsoGiVDLX3WKadOkdJVDwrv1KYeBWRTQh5x8knAd4lRjK9FZTIoBNahCbAInqPinO4pj9wELA4GAzrEf4BTK0V9nKVmrLJt3OJiR8jUaRf7VClaYMtbOfwY9gP7TmGDP2+YKGVZ4RjWgmJOKZ4DF3QDxn+ZXsqc1vJ/puxKUJ8OVTcO6gsVIvu0v3VOEImADVVEp5piDqBhqfMIL+DDmqhAvvJUQVcd92CZslyp+Kkig14EK18f9WDM5d9DnXXMSumoZFzsF6u1dN5Z2qYd2sNQD6hKZ1RHGbLBZ3qhV+1Su9UetMcvqpZJc7box9KePgGq5JDk</latexit>
⌧1 <latexit sha1_base64="r9wDRpTXGE0AaQPIbCXIYe13mGg=">AAACznichVFNS8NAEH3Gr7Z+VT16KRbBU0lFUG8FrXgRWrBaaEvZpNsamy+SbaWKePXoVf+Z/hYPvqxRUBE3bGb2zZu3MztW6DqxMs2XKWN6ZnZuPpPNLSwuLa/kV9fO42AU2bJhB24QNS0RS9fxZUM5ypXNMJLCs1x5YQ0Pk/jFWEaxE/hnahLKjicGvtN3bKEInbeVGHXL3XzRLJl6FX475dQpIl21IP+KNnoIYGMEDxI+FH0XAjG/FsowERLr4JZYRM/RcYk75Jg7IkuSIYgO+R/w1EpRn+dEM9bZNm9xuSNmFrDFfawVLbKTWyX9mPaN+0Zjgz9vuNXKSYUTWouKWa14Slzhkoz/Mr2U+VnL/5lJVwp97OtuHNYXaiTp0/7SOWIkIjbUkQKqmjmghqXPY76AT9tgBckrfyoUdMc9WqGt1Cp+qiioF9Emr896OObyz6H+dho7pYOSWd8tVqrpvDPYwCa2OdQ9VHCCGsuwcYVHPOHZqBvXxp1x/0E1ptKcdXxbxsM7aIeQzw==</latexit>
Epochs
n
τ1 > τ3 > τ2 . (2p for all correct, no half points given)
tio
lu
So
e
pl
m
Sa
– Page 14 / 20 –
Problem 6 Batchnormalization (5 credits)
A friend suggested you to use Batchnormalization in your network. Recall the batch normalization layer
(1) (m)
takes values x = (x(1), ..., x(m)) as input and computes x = (xnorm , ..., xnorm ) according to:
m
(k ) x (k ) − µ 1 X (i) 1 (k )
xnorm = √ where µ = x , σ2 = (x − µ)2 .
σ2 m m
k =1
It then applies a second transformation to get y = (y (1) , ..., y (m) ) using learned parameters γ (k ) and β (k ) :
y k = γ (k ) (x)knorm + β (k ) .
n
(i) (i)
1
Replace x√−µ x −µ
with √ .
σ2 σ 2 +
Prevents division by zero if variance is zero
tio
Other answers: add a small constant / noise to σ 2 / denominator
b) How can the network undo the normalization operation of Batchnorm? Write down the exact parameters. 0
lu
p 1
γ= var(x k ), β k = E(x k )
2
So
c) Name the main difference of Batchnormalization during training or testing and note down eventual 0
parameters that need to be stored.
e
Test: use stored mean and variance and don’t compute any batch statistics
m
Sa
– Page 15 / 20 –
Problem 7 Convolutional Neural Networks (12 credits)
You are contemplating design choices for a convolutional neural network for the classification of digits. LeCun
et. al suggest the following network architecture:
n
For clarification: the shape after having applied the operation ‘conv1’ (the first convolutional layer in the
network) is 6x28x28.
All operations are done with stride 1 and no padding. For the convolution layers, assume a kernel size of 5x5.
tio
0 a) Explain the term ‘receptive field’ (1p). What is the receptive field of one pixel after the operation
‘maxpool1’(1p)? What is the receptive field of a neuron in layer ‘f1’ (1p)?
1
2 Receptive field is the size of the region in the input space that a pixel in the output space is affected by.
lu
maxpool1: 6x6. One pixel after maxpool1 is affected by 4 pixels (2x2) in conv1. with 5x5 kernel and
3 stride 1, a 2x2 output comes from a 6x6 grid.
(0.5p if only maxpool1 wrt to conv1(= 2x2) specified)
f1: whole image (32x32)
So
b) Instead of digits, you now want to be able to classify handwritten alphabetic characters (26 characters).
e
0
What is the minimal change in network architecture needed in order to support this?
1
pl
0 c) Instead of taking 32 × 32 images, you now want to train the network to classify images of size 68 × 68.
List two possible architecture changes to support this?
Sa
2
• Resize layer to downsample images to 32x32 / Downsample images to 32x32 (preprocess)
• conv 5x5 (68 → 64) + maxpool 2x2 (64 → 32) before the current architecture (0.5p if only
specified conv+maxpool without parameters)
• Change input dimension of layer f1 to 16x14x14 (= 3136) (0.5p if only suggested changing input
dimension layer without new dim)
• fully convolutional layers + global average pooling (0.5p if only fully conv layer suggested without
global average pooling)
– Page 16 / 20 –
d) Your architecture works and you manage to classify characters fairly well. After reading many online blogs, 0
you decide to try out a much deeper network to boost the network’s capacity. Name 2 problems that you
might encounter when training very deep networks. 1
• Vanishing gradients
• Memory issues
• Overfitting
n
• Increased training time
tio
lu
e) You read that skip connections are beneficial for training deep networks. In the following image you can
see a segment of a very deep architecture that uses skip connections. How are skip connections helpful?
0
So
(1p). Demonstrate this mathematically by computing gradient of output z with respect to ‘w0 ’ for the network 1
below in comparison to the case without a skip connection (3p). For simplicity, you can assume that gradient
2
of ReLU, d(ReLU(p))
dp
= 1.
3
4
e
pl
m
Sa
– Page 17 / 20 –
Help prevent vanishing gradients / Provides highway for the gradients in backward pass (1p)
Let,
0
z = G(y) + y
G(y) = ReLU(w2 y)w3
0
z = ReLU(z )
0
y = F(x) + x
F(x) = w1 ReLU(w0 x)
0
y = ReLU(y )
0 0
dz dz dz dy dy
= 0 0
dw0 dz dy dy dw0
n
0 0
dz d(ReLU(z )) dG(y) d(ReLU(y )) dF(x)
= 0 ( + 1) 0
dw0 dz dy dy dw0
tio
0 0
dz d(ReLU(z )) d(ReLU(w2 y)) d(ReLU(y )) d(ReLU(w0 x))
= 0 (w3 w2 + 1) 0 w1 x )
dw0 dz d(w2 y) dy d(w0 x)
lu
dz dG(y) dF(x)
= (w3 w2 + 1)w1 x (2 points for full expansion, 1pt if and are not expanded)
dw0 dy dw0
So
Comparing this to derivative without skip connection, which is
dz
= (w3 w2 )w1 x (1 point / 0.5p if not expanded)
dw0
The extra ‘+1’ term in the skip connection derivative help propagation of gradient flow, preventing
e
vanishing gradients
pl
m
Sa
– Page 18 / 20 –
Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike
out invalid solutions.
n
tio
lu
So
e
pl
m
Sa
– Page 19 / 20 –
n
tio
lu
So
e
pl
m
Sa
– Page 20 / 20 –