Solution: Introduction To Deep Learning

Chair of Chair of Visual Computing & Artificial Intelligence
Department of Informatics
Technical University of Munich
Note:
• During the attendance check a sticker containing a unique code will be put on this exam.
Esolution • This code contains a unique number that associates this exam with your registration number.
Place student sticker here • This number is printed both next to the code and to the signature field in the attendance
check list.
Introduction to Deep Learning
n
Exam: IN2346 / Endterm Date: Wednesday 19th February, 2020
tio
Examiner: Prof. Dr. Matthias Nießner Time: 13:30 – 15:00
P1 P2 P3 P4
lu P5 P6 P7
So
I
e
pl
Working instructions
• This exam consists of 20 pages with a total of 7 problems.
Please make sure now that you received a complete copy of the exam.
m
• The total amount of achievable credits in this exam is 89 credits.

• Detaching pages from the exam is prohibited.
Sa
• Allowed resources: none
• Do not write with red or green colors nor use pencils.

• Physically turn off all electronic devices, put them into your bag and close the bag.
• If you need additional space for a question, use the additional pages in the back and properly note that
you are using additional space in the question’s solution box.
Left room from to / Early submission at
– Page 1 / 20 –
Problem 1 Multiple Choice (18 credits)
Multiple Choice Questions:

• For all multiple choice questions any number of answers, i.e. either zero (!) or one or multiple answers
can be correct.
• For each question, you’ll receive 2 points if all boxes are answered correctly (i.e. correct answers are
checked, wrong answers are not checked) and 0 otherwise.
How to Check a Box:
• Please cross the respective box: (interpreted as checked)
n
• If you change your mind, please fill the box: (interpreted as not checked)
• If you change your mind again, please put an additional cross besides the box.
tio
a) You train a neural network and the loss diverges. What are reasonable things to do?
× Try a different optimizer.

Add dropout.
× Decrease the learning rate.

Increase the number of parameters.
lu
So
b) Use a neuron to regress the AND function. The activation function of the neuron is f (x) = 0 if x < 0 else
1. What is the weight and bias of the neuron?
x1 x2 x1 AND x2
1.0 1.0 1.0
1.0 0.0 0.0
e
0.0 0.0 0.0

0.0 1.0 0.0
pl
Bias: 1.5, w1 : 2.0, w2 : 2.0
Bias: -1.0, w1 : 1.5, w2 : 1.5

m
× Bias: -1.5, w : 1.0, w : 1.0

1 2
× Bias: -2.0, w : 1.4, w : 1.2

1 2
Sa
c) You want to train an autoencoder to overfit a single image with a fixed learning rate. Setting numerical
precision aside, which loss function is able to reach zero loss after training with gradient descent.
× L2
L1
d) Regularization:
× Is a technique that aims to reduce the generalization gap.

Dropout, the use of ReLU activation functions, and early stopping can all be considered regularization
techniques.
× Weight 2
decay (L ) is commonly applied in neural networks to spread the decision power among as
many neurons as possible.
Is a technique that aims to reduce your validation error and increases your training accuracy.
– Page 2 / 20 –
e) Which statements are correct for a Rectified Linear Unit (ReLU) applied in a CNN?
Despite a small learning rate, without saturation the gradients are very likely to explode.
× A large negative bias in the previous layer can cause the ReLU to always output zero.
× Large and consistent gradients allow for a fast network convergence.
Max pooling must always be applied after the ReLU.
f) You want to use a convolutional layer to decrease your RGB image size from 2545x254 to 127x127. What
parameter triplets achieve this?
Kernel size 3, stride 2, padding 1
× Kernel size 6, stride 2, padding 2
n
× Kernel size 2, stride 2, padding 0
tio
Kernel size 5, stride 2, padding 2
g) Your train and val loss converge to about the same value. What would you do to increase the performance
of your model?
× Add more input features.

× Increase the capacity of your model.
Add more training data.
lu
So
Add regularization.
h) An autoencoder
has no loss function.
× has a bottle neck layer.

e
is often used for fine-tuning pre-trained models.

pl
requires class labels for training.
i) What is the correct order of operations for an optimization with gradient descent?
m
(a) Update the network weights to minimize the loss.
(b) Calculate the difference between the predicted and target value.
(c) Iteratively repeat the procedure until convergence.
Sa
(d) Compute a forward pass.
(e) Initialize the neural network weights.
bcdea
ebadc
eadbc
× edbac
– Page 3 / 20 –
Problem 2 Short Questions (22 credits)
0 a) Kaiming initialization corresponds to Xavier initialization with the variance multiplied by two. In which case
(1p) and why (1p) would you chose this initialization?
1
2 ReLU activation. (1p) → Half of the input. (1p)
0 b) You are given a convolutional layer with kernel size 3, number of filters 3, stride 1 and padding 1. Compute
the shape of the weights (0.5p) and write them down explicitly such that this convolutional layer represents
1 the identity for an RGB image input (1.5p).
2
Shape of W=(3,3,3,3) (0.5 p)
n
W[0,0,1,1] = W[1,1,1,1] = W[2,2,1,1] = 1, else 0 (1.5p) i.e. W =
   
0 0 0
tio
  0 1 0  0 0 
 

 0 0 0  



 0 0 0 

 0  0 1 0  0 
 

 0 0 0  


0 0 0
lu
 
 
 0 0  0 1 0  
0 0 0
So
0 c) Name two reasons to use an inception layer in favor of a standard convolutional layer.
1
Avoid choosing kernel size / dimensionality reduction leads to lower computational cost
2
e
pl
0 d) What are the definitions of bias and variance in the context of machine learning?
m
1
Bias: underfitting, error caused by model being too simple
2 Variance: overfitting, high variance in the model means its pays more attention to training data and
doesnt generalize on unseen data (val / test)
Sa
– Page 4 / 20 –
e) In Generative Adversarial Networks the generator and discriminator play a two player min-max game. 0
Why is this hard to optimize and what heuristical method is used instead of the default min-max formulation.
1
G can not learn when D rejects all generator samples 2

G maximizes the log-probability of D being mistaken
f) Consider the quote below. Demonstrate how a fully connected layer with an input size of 512 neurons and 0
an output of 10 neurons can be modeled as a convolutional layer.
n
1
tio
lu
unsqueeze apply 1x1 conv, weight shape (512, 10, 1, 1)
512 −−−−−−→ 512x1x1 −−−−−−−−−−−−−−−−−−−−−−−−→ 10x1x1
So
g) Explain the Markov Assumption in reinforcement learning. 0
e
1
P[st+1 |s1, ..., st ] = P[st+1 |st ] or
The future is independent of the past given the present 2
pl
m
h) Where do we use neural networks in Q-learning (1p) and why are they needed (1p)? 0
1
Function approximator for Q values: Q ∗ (s, a) = Q(s, a; θ) where θ are the network parameters.
For even semi big problems, calculating all state action pairs is not tractable, but we can approximate 2
Sa
them using neural networks.
– Page 5 / 20 –
0 i) The following image shows a rectangular image of size 16 × 16. Design a 5 × 5 convolutional filter that is
maximally activated when sliding over the ’3’ in the image below (Black pixels are 1s and white are -1. For
1 simplicity use only -1s and 1s in your designed filter).
2
n
tio
 
−1 1 1 1 −1
−1 −1 −1 1 −1
 
−1
W = 1 1 1 −1

−1 −1 −1 1 −1
lu
−1 1 1 1 −1
1’s could be shifted to the left or right

Grading mostly binary (0 or 2 points) with exceptions being ·(−1) or a single wrongly placed operator.
So
0 j) Name one advantage and one disadvantage of Recurrent Neural Networks in general.
1
Advantage: can model series of variable lengths
e
2 Disadvantage: vanishing/exploding gradients

pl
0 k) Long-Short Term Memory Units suffer less from the vanishing gradient problem than vanilla RNNs. What
two design changes make this possible?
m
2 Gate system, esp. the forget gate (1 pt) .

Highway for gradients through the cell state. (1 pt)
Sa
– Page 6 / 20 –
Problem 3 House Prices and Backpropagation (11 credits)
You are tasked to predict house prices for your first job, i.e. for a given input vector of numbers you have to
predict a single floating point number that indicates the house price (e.g., between 0 and 1m euros).
a) What network loss function would you suggest for this problem (0.5p) and why (0.5p)? 0
1
Loss: MSE, floating point regression
b) How would you approximate the task as a classification problem (0.5) and which loss function would you 0
propose in this situation (0.5)?
1
Bucket prices into price ranges
n
Cross Entropy
tio
After collecting some data you start off with a simple architecture
lu
So
e
pl
Your architecture is composed of two fully-connected layers (fc1 and fc2) which both contain
• a linear layer with weights and biases as outlined on the next page,
m
• followed by a Dropout with a probability parameter of 0.5,

• as well as a Leaky ReLu with a parameter of 0.5 as your non-linearity of choice at the end.
Sa
c) How many trainable parameters does your network have? 0
1
12
– Page 7 / 20 –
Weights and biases of the linear layers:
Layer fc1 fc2

Nodes h1 h2 o1 o2
Variable w11 w21 bh1 w12 w22 bh2 v11 v21 bo1 v12 v22 bo2
Value 1.0 -1.0 1.0 -1.0 1.0 -1.0 0.5 -1.0 1.0 1.0 1.0 0.0
0 d) You are experiencing difficulties during training and thus decide to check the network in test mode manually.
In your test case the input x values are
1 x1 1
x= = .
x2 0
2
What is the output of your network for that input? Write down all computations for each step.
3
n
4 h1 linear= 1 · 1 + 0 · −1 + 1 = 2 (0.5) −Dropout > 1 −LReLu > 1 (0.5),
h2 linear= 1 · −1 + 0 · 1 − 1 = −2 (0.5) −Dropout > −1 −LReLu > −0.5 (0.5),
tio
o1 linear= 1 · 0.5 + −0.5 · −1 + 1 = 2 (0.5) −Dropout > 1 −LReLu > 1 (0.5),
o2 linear= 1 · 1+ −0.5 · 1 + 0 = 0.5 (0.5) −Drop > 0.25 −LReLu > 0.25 (0.5),
1
=> o =
0.25
0.5 for linear and 0.5 for dropout + leaky relu
lu
So
0 e) As you were unsure of your loss function, you phrase the task as a binary classification problem for each
e
of your two outputs independently. Calculate the binary cross-entropy with respect to the natural logarithm
1 for the labels
y1 1
y= = .
pl
y2 0
You may assume 0 · ln(0) = 0. (Write down the equations and keep the simplified logarithm.)
m
BCE(x, y) = −(y1 · log(o1 ) + (1 − y1 ) · log(1 − o1 )) − (y2 · log(o2 ) + (1 − y2 ) · log(1 − o2 ))

= − ln(1) − ln(1 − 0.25) = − ln(0.75)) (1p)
-0.5 for minus
BCE for both components: −(y · log(o) + (1 − y) · log(1 − o))
Sa
– Page 8 / 20 –
f) Please update the weight v21 using gradient descent with a learning rate of 1.0 with respect to the binary 0
cross-entropy loss as well as labels from e) and the forward computation from d). (Please write down all your
computations. Writing down formulas and values in general form is encouraged) 1
2
Layers: olin , odrop , olrelu , BCE .
3
∂ BCE ∂ (−y1 ln(olrelu )) ∂ olrelu ∂ odrop ∂ olin
= · · ·
∂ v21 ∂ olrelu ∂ odrop ∂ olin ∂ v21
1 ∂ (hafterlrelu2 · v21 + bo2 )
=− · 1 · 0.5 ·
olrelu ∂ v21
= −1 · 0.5 · h2afterlrelu = −0.5 · −0.5 = 0.25
+ ∂ BCE
n
v21 = v21 − lr · = −1.0 − 1.0 · 0.25 = −1.25
∂ v21
Split in layers: 1p,
tio
ln derivative 0.5p,
final derivative result 0.5p,
update 1p
lu
So
e
pl
m
Sa
– Page 9 / 20 –
Problem 4 Optimization (11 credits)
0 a) Why can’t we expect to find a global minima while training neural networks?
1
Neural networks are non-convex
• many (different) local minima
• no (practical) way to say which is globally optimal
0 b) Why is finding a local minimum often enough?
1
The idea of the overparameterization in neural networks suggests that many local minimia provide
n
equivalent performance (Note: Similar performance does not mean similarity in parameter space)
tio
0 c) What is a saddle point (1p)? What is the advantage/disadvantage of Stochastic Gradient Descent (SGD)
in dealing with saddle points (1p)?
1
lu
2 Saddle point - The gradient is zero (0.5p), but it is neither a local minima nor a local maxima.
SGD has noisier updates and can help escape from a saddle point. Disadvantage: no momentum
So
e
0 d) Explain the concept behind momentum in SGD.

pl
1
Avoid getting stuck in local minima. or accelerate optimization. Uses exponentially weighted averages
of past gradients.
m
0 e) Which optimizer introduced in the lecture uses second but not first order momentum?
Sa
1
RMSProp
0 f) Explain the beta (β1 and β2 ) hyperparameters of Adam with respect to the gradients.
1
These two are exponential decay rates for the moment estimates. β1 is related to bias correction for
2 first moment estimate (decays the running average of the gradient), while β2 is used in bias correction
of second moment estimate (decays running average of square of gradient).
– Page 10 / 20 –
g) What would be an advantage of a second order optimization method such as the Newton method (1p) ? 0
Why is it not commonly used in the context of neural networks?
1
Faster / second order convergence in fewer iterations (1p).

Estimating the Hessian in a stochastic setting is difficult - stochasticity doesnt work well with 2nd order
methods. Note: Second part of the question isn’t graded - the question is worth 1p instead of 2p
originally
h) Name the key idea of Newton’s method (1p) and write down the update step formula (1p). 0
n
1
Approximate by second order taylor expansion
θ∗ = θ0 − H −1 ∇θ L (θ) 2
tio
lu
So
e
pl
m
Sa
– Page 11 / 20 –
Problem 5 Network Architectures and Training (10 credits)
You are training a neural network with 10 convolutional layers and the activation function shown below:
n
tio
lu
0 a) What is the purpose of activation functions in neural networks?
Introduce non-linearity or otherwise network is only linear / prevents layer collapse (1pt).
So
0 b) You find that your network still is not training properly and discover that your network weights have all
been default initialized to 1. Why might this cause issues for training?
e
1
pl
Weights all 1 means everything computes the same function, gradients are all the same -> network
capacity
m
Sa
0 c) How can you resolve the problems with weight initialization and provide stable training in most scenarios
for the proposed network? Explain your solution.
1
2
Xavier / Kaiming initialization (or their equation) (1p); aims to maintain same variance of output as
input (1p).
0
d) After employing your solutions, you are ready to train your network for image segmentation. After 50
1 epochs, you come to the conclusion that the network is too large for such a task. What is this effect called?
How do you make this observation? Make a plot of the corresponding training (regular line) and validation
2
losses (dashed line) and name them appropriately.
– Page 12 / 20 –
n
tio
Overfitting (1p); Plot (1p).
lu
So
e
pl
m
Sa
e) Without changing the convolutional layers of your network, name two approaches to counteract the 0
problems encountered in (f).
1
2
weight regularization / weight decay, data augmentation, dropout, more data, early stopping (1p each)
f) You adapt your network training accordingly, and now are performing a grid search to find the optimal 1
hyperparameters for vanilla stochastic gradient descent (SGD). You try three learning rates τi with i ∈ {1, 2, 3},
2
and obtain the following three curves for the training accuracy. Order the learning rates from larger to smaller.
– Page 13 / 20 –
Training
Accuracy
⌧3
<latexit sha1_base64="AbDA9cHawrbbGG0CDyT+hhtqSPc=">AAACz3ichVFNS8NAEJ3Gr7Z+VT16CRbBU0lVUG8FrXgRqhhbaEvZpNsami+STbEWxatHr/rL9Ld48O2aClrEDZuZffPm7cyOFbpOLAzjLaPNzM7NL2Rz+cWl5ZXVwtr6dRwkkc1NO3CDqGGxmLuOz03hCJc3wogzz3J53Rocy3h9yKPYCfwrMQp522N93+k5NhOA6i3Bks5evlMoGiVDLX3aKadOkdJVCwrv1KIuBWRTQh5x8knAd4lRjK9JZTIoBNamMbAInqPinO4pj9wELA4GAzrAv49TM0V9nKVmrLJt3OJiR8jUaRv7VClaYMtbOfwY9gP7TmH9P28YK2VZ4QjWgmJOKZ4DF3QDxn+ZXsqc1PJ/puxKUI8OVTcO6gsVIvu0v3VOEImADVREp6pi9qFhqfMQL+DDmqhAvvJEQVcdd2GZslyp+Kkig14EK18f9WDM5d9DnXbM3dJRybjYL1aq6byztElbtIOhHlCFzqiGMmSDz/RCr9qldqs9aI9fVC2T5mzQj6U9fQKtSpDl</latexit>
⌧2 <latexit sha1_base64="TpMpisIp41FSmBGIalyAl7cevIs=">AAACz3ichVFNS8NAEJ3Gr7Z+VT16CRbBU0mLoN4KWvEiVDG20JaySbc1NF8km2ItilePXvWX6W/x4Ns1FbSIGzYz++bN25kdK3SdWBjGW0abm19YXMrm8ssrq2vrhY3N6zhIIpubduAGUdNiMXcdn5vCES5vhhFnnuXyhjU8lvHGiEexE/hXYhzyjscGvtN3bCYANdqCJd1KvlsoGiVDLX3WKadOkdJVDwrv1KYeBWRTQh5x8knAd4lRjK9FZTIoBNahCbAInqPinO4pj9wELA4GAzrEf4BTK0V9nKVmrLJt3OJiR8jUaRf7VClaYMtbOfwY9gP7TmGDP2+YKGVZ4RjWgmJOKZ4DF3QDxn+ZXsqc1vJ/puxKUJ8OVTcO6gsVIvu0v3VOEImADVVEp5piDqBhqfMIL+DDmqhAvvJUQVcd92CZslyp+Kkig14EK18f9WDM5d9DnXXMSumoZFzsF6u1dN5Z2qYd2sNQD6hKZ1RHGbLBZ3qhV+1Su9UetMcvqpZJc7box9KePgGq5JDk</latexit>
⌧1 <latexit sha1_base64="r9wDRpTXGE0AaQPIbCXIYe13mGg=">AAACznichVFNS8NAEH3Gr7Z+VT16KRbBU0lFUG8FrXgRWrBaaEvZpNsamy+SbaWKePXoVf+Z/hYPvqxRUBE3bGb2zZu3MztW6DqxMs2XKWN6ZnZuPpPNLSwuLa/kV9fO42AU2bJhB24QNS0RS9fxZUM5ypXNMJLCs1x5YQ0Pk/jFWEaxE/hnahLKjicGvtN3bKEInbeVGHXL3XzRLJl6FX475dQpIl21IP+KNnoIYGMEDxI+FH0XAjG/FsowERLr4JZYRM/RcYk75Jg7IkuSIYgO+R/w1EpRn+dEM9bZNm9xuSNmFrDFfawVLbKTWyX9mPaN+0Zjgz9vuNXKSYUTWouKWa14Slzhkoz/Mr2U+VnL/5lJVwp97OtuHNYXaiTp0/7SOWIkIjbUkQKqmjmghqXPY76AT9tgBckrfyoUdMc9WqGt1Cp+qiioF9Emr896OObyz6H+dho7pYOSWd8tVqrpvDPYwCa2OdQ9VHCCGsuwcYVHPOHZqBvXxp1x/0E1ptKcdXxbxsM7aIeQzw==</latexit>
Epochs
n
τ1 > τ3 > τ2 . (2p for all correct, no half points given)
tio
lu
So
e
pl
m
Sa
– Page 14 / 20 –
Problem 6 Batchnormalization (5 credits)
A friend suggested you to use Batchnormalization in your network. Recall the batch normalization layer
(1) (m)
takes values x = (x(1), ..., x(m)) as input and computes x = (xnorm , ..., xnorm ) according to:
m
(k ) x (k ) − µ 1 X (i) 1 (k )
xnorm = √ where µ = x , σ2 = (x − µ)2 .
σ2 m m
k =1
It then applies a second transformation to get y = (y (1) , ..., y (m) ) using learned parameters γ (k ) and β (k ) :
y k = γ (k ) (x)knorm + β (k ) .
a) How would you make the formulation above numericly stable? 0
n
(i) (i)
1
Replace x√−µ x −µ
with √ .
σ2 σ 2 +
Prevents division by zero if variance is zero
tio
Other answers: add a small constant / noise to σ 2 / denominator
b) How can the network undo the normalization operation of Batchnorm? Write down the exact parameters. 0
lu
p 1
γ= var(x k ), β k = E(x k )
2
So
c) Name the main difference of Batchnormalization during training or testing and note down eventual 0
parameters that need to be stored.
e
Train: compute statistics from minibatch. To store: mean and variance 2

pl
Test: use stored mean and variance and don’t compute any batch statistics
m
Sa
– Page 15 / 20 –
Problem 7 Convolutional Neural Networks (12 credits)
You are contemplating design choices for a convolutional neural network for the classification of digits. LeCun
et. al suggest the following network architecture:
n
For clarification: the shape after having applied the operation ‘conv1’ (the first convolutional layer in the
network) is 6x28x28.
All operations are done with stride 1 and no padding. For the convolution layers, assume a kernel size of 5x5.
tio
0 a) Explain the term ‘receptive field’ (1p). What is the receptive field of one pixel after the operation
‘maxpool1’(1p)? What is the receptive field of a neuron in layer ‘f1’ (1p)?
1
2 Receptive field is the size of the region in the input space that a pixel in the output space is affected by.
lu
maxpool1: 6x6. One pixel after maxpool1 is affected by 4 pixels (2x2) in conv1. with 5x5 kernel and
3 stride 1, a 2x2 output comes from a 6x6 grid.
(0.5p if only maxpool1 wrt to conv1(= 2x2) specified)
f1: whole image (32x32)
So
b) Instead of digits, you now want to be able to classify handwritten alphabetic characters (26 characters).
e
0
What is the minimal change in network architecture needed in order to support this?
1
pl
Change no. of output neurons from 10 to 26

(0.5p if only "change number of output neurons" specified)
m
0 c) Instead of taking 32 × 32 images, you now want to train the network to classify images of size 68 × 68.
List two possible architecture changes to support this?
Sa
2
• Resize layer to downsample images to 32x32 / Downsample images to 32x32 (preprocess)
• conv 5x5 (68 → 64) + maxpool 2x2 (64 → 32) before the current architecture (0.5p if only
specified conv+maxpool without parameters)
• Change input dimension of layer f1 to 16x14x14 (= 3136) (0.5p if only suggested changing input
dimension layer without new dim)
• fully convolutional layers + global average pooling (0.5p if only fully conv layer suggested without
global average pooling)
– Page 16 / 20 –
d) Your architecture works and you manage to classify characters fairly well. After reading many online blogs, 0
you decide to try out a much deeper network to boost the network’s capacity. Name 2 problems that you
might encounter when training very deep networks. 1
• Vanishing gradients
• Memory issues
• Overfitting
n
• Increased training time
tio
lu
e) You read that skip connections are beneficial for training deep networks. In the following image you can
see a segment of a very deep architecture that uses skip connections. How are skip connections helpful?
0
So
(1p). Demonstrate this mathematically by computing gradient of output z with respect to ‘w0 ’ for the network 1
below in comparison to the case without a skip connection (3p). For simplicity, you can assume that gradient
2
of ReLU, d(ReLU(p))
dp
= 1.
3
4
e
pl
m
Sa
– Page 17 / 20 –
Help prevent vanishing gradients / Provides highway for the gradients in backward pass (1p)
Let,
0
z = G(y) + y
G(y) = ReLU(w2 y)w3
0
z = ReLU(z )
0
y = F(x) + x
F(x) = w1 ReLU(w0 x)
0
y = ReLU(y )
0 0
dz dz dz dy dy
= 0 0
dw0 dz dy dy dw0
n
0 0
dz d(ReLU(z )) dG(y) d(ReLU(y )) dF(x)
= 0 ( + 1) 0
dw0 dz dy dy dw0
tio
0 0
dz d(ReLU(z )) d(ReLU(w2 y)) d(ReLU(y )) d(ReLU(w0 x))
= 0 (w3 w2 + 1) 0 w1 x )
dw0 dz d(w2 y) dy d(w0 x)
Putting ReLU derivatives to 1
lu
dz dG(y) dF(x)
= (w3 w2 + 1)w1 x (2 points for full expansion, 1pt if and are not expanded)
dw0 dy dw0
So
Comparing this to derivative without skip connection, which is
dz
= (w3 w2 )w1 x (1 point / 0.5p if not expanded)
dw0
The extra ‘+1’ term in the skip connection derivative help propagation of gradient flow, preventing
e
vanishing gradients
pl
m
Sa
– Page 18 / 20 –
Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike
out invalid solutions.
n
tio
lu
So
e
pl
m
Sa
– Page 19 / 20 –
n
tio
lu
So
e
pl
m
Sa
– Page 20 / 20 –

Solution: Introduction To Deep Learning

Uploaded by

Copyright:

Available Formats

Solution: Introduction To Deep Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solution: Introduction To Deep Learning

Uploaded by

Copyright:

Available Formats

Chair of Chair of Visual Computing & Artificial Intelligence

Introduction to Deep Learning

• The total amount of achievable credits in this exam is 89 credits.

• Allowed resources: none

• Do not write with red or green colors nor use pencils.

Left room from to / Early submission at

Multiple Choice Questions:

How to Check a Box:

• Please cross the respective box: (interpreted as checked)

× Try a different optimizer.

× Decrease the learning rate.

0.0 0.0 0.0

Bias: 1.5, w1 : 2.0, w2 : 2.0

Bias: -1.0, w1 : 1.5, w2 : 1.5

× Bias: -1.5, w : 1.0, w : 1.0

× Bias: -2.0, w : 1.4, w : 1.2

× Is a technique that aims to reduce the generalization gap.

× Kernel size 6, stride 2, padding 2

× Add more input features.

× has a bottle neck layer.

is often used for fine-tuning pre-trained models.

requires class labels for training.

(a) Update the network weights to minimize the loss.

(d) Compute a forward pass.

(e) Initialize the neural network weights.

2 ReLU activation. (1p) → Half of the input. (1p)

G can not learn when D rejects all generator samples 2

them using neural networks.

1’s could be shifted to the left or right

2 Disadvantage: vanishing/exploding gradients

2 Gate system, esp. the forget gate (1 pt) .

Bucket prices into price ranges

• followed by a Dropout with a probability parameter of 0.5,

c) How many trainable parameters does your network have? 0

Layer fc1 fc2

BCE(x, y) = −(y1 · log(o1 ) + (1 − y1 ) · log(1 − o1 )) − (y2 · log(o2 ) + (1 − y2 ) · log(1 − o2 ))

Split in layers: 1p,

0 b) Why is finding a local minimum often enough?

0 d) Explain the concept behind momentum in SGD.

Faster / second order convergence in fewer iterations (1p).

a) How would you make the formulation above numericly stable? 0

Train: compute statistics from minibatch. To store: mean and variance 2

Change no. of output neurons from 10 to 26

Putting ReLU derivatives to 1

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.