02 Cnn Slides
02 Cnn Slides
14.1.2025
1
Image classification task
• Example classification problem: Classify images of handwritten digits from the MNIST dataset.
• Inputs x(n) : Images 28 × 28 pixels in scalar greyscale values.
• Targets y(n) : One of the 10 classes in one-hot coding.
2
Spatial structure matters
• If we change the order of the pixels (in the same way for all images), the classification task
becomes much harder for humans.
• This suggests that our model can and should benefit from using the spatial information.
3
Image classification with a multilayer perceptron
4
Problem 1: MLP ignores the spatial structure
5
Problem 1: MLP ignores the spatial structure
6
Problem 2: Number of parameters
10 outputs
• Let us use an MLP with the following structure to solve
f = softmax(W3 h2 + b3 )
the MNIST classification task.
• Let us count the number of parameters in the network 144 units
(ignoring the bias terms b): h2 = relu(W2 h1 + b2 )
h1 = relu(W1 x + b1 )
• If we want to process images that contain millions of
pixels, the number of parameters would be several orders 784 pixels
of magnitude larger. input x
7
Motivation for a layer of a new type
• We want to design an alternative to the fully-connected layer that would address these problems:
• Take into account the order of the inputs
• Change the outputs in a predictable way for simple transformations such as translation
• Reduce the number of parameters in the network
8
Convolutional layer
Fully-connected layer as a starting point
• Let us consider an input with one-dimensional structure. For example, we want to process time
series and the order of the inputs is determined by the time of the measurements.
• Let us start with a fully-connected layer that has 5 inputs and 5 outputs:
x1 x2 x3 x4 x5
10
Local connectivity
11
Parameter sharing
• We can further reduce the number of parameters by using weight sharing (arrows with the same
color red/black/blue represent shared weights).
• Now the layer has only 3 parameters.
• Why parameter sharing is useful: patterns that appear in different parts of the input sequence will
activate neurons in a similar way in the corresponding location of the output layer.
=⇒ Position/translation/shift equivariance in the input-output mapping.
12
1D convolutional layer
• The layer is called a (one-dimensional) convolutional layer because the computations are closely
related to (one-dimensional) discrete convolution familiar from signal processing:
X
(w ∗ x )[t] = w [a]x [t − a]
a
13
1D convolutional layer
• Inputs and outputs of such a layer usually contain multiple elements (usually called channels):
XX
yi,o = w∆i,o,c xi+∆i,c + bo
∆i c
0 0
14
Inputs with 2D structure
1
W , 1b
15
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer: Forward computations
16
2D convolutional layer as feature detector
• We can view the filter that we used in this example as a simple feature detector.
• Note that the filter has the shape of a corner. And the output is maximum at the position where
this corner is present in the input image.
• The local image structure or image feature “correlates” with the values of the convolution
mask/template/window/kernel.
0 0 0 0 0
0 1 1 1 0 dot 1 1 0 1 1 1
0 1 0 1 0 1 0 1 3 2 2
0 1 1 1 0 1 2 2 2
0 0 0 0 0 1 2 2 1
17
2D convolutional layer with multiple channels
• The 1D position i and offset ∆i have been replaced by their 2D counterparts i, j and ∆i, ∆j.
• Just like in multilayer perceptrons, the output of a convolutional layer is usually run through a
nonlinear activation function, such as ReLU:
0
yi,j,o = relu(yi,j,o )
18
Convolution ∗
19
2D convolutional layer in PyTorch
0 0 0 0 0 0 0
padding
• Convolution visualization
• The size of the output will be different:
Hi + 2p − k − (k − 1)(d − 1)
Ho = +1
s
20
Why do we need padding?
• With padding, the output of a convolutional layer can have the same height and width as the
input.
• It is easier to design networks when the height and width is preserved.
• To use skip connections x + conv(x), like in ResNet, we need the dimensions to match.
• With padding, we can use deeper networks. Without padding, the size would reduce quickly with
adding new layers.
• Padding improves the performance by keeping information at the borders.
21
Convolutional layer is equivariant to translation
• Shifting the input image by one pixel to the right changes the output in the same way: it is
shifted by one pixel to the right.
0 0 0 0 0 0 0 0 0 0
0 1 1 1 0 dot 1 1 0 1 1 1 0 0 1 1 1 dot 1 1 0 0 1 1
0 1 0 1 0 1 0 1 3 2 2 0 0 1 0 1 1 0 0 1 3 2
0 1 1 1 0 1 2 2 2 0 0 1 1 1 0 1 2 2
0 0 0 0 0 1 2 2 1 0 0 0 0 0 0 1 2 2
f (T (x )) = T (f (x )).
f (T (x )) = f (x ).
The result of f does not change when you apply the transformation to the input.
22
Convolutional networks
Example: MNIST classification
• Let us build a convolutional neural network (a network with convolutional layers) to solve the
MNIST classification task.
• The input is 28 × 28 pixels and 1 channel.
• First convolutional layer:
• 9 filters with 5 × 5 kernel and padding.
• First hidden layer: 28 × 28 pixels and 9 channels
• The number of parameters in the first layer (ignoring
biases):
5 × 5 × 9 = 225
• Compare with the fully connected layer:
28 × 28 × 225 = 176400
24
Example: MNIST classification
28 × 28 × 9 = 7056
25
Pooling layer
26
Pooling layer
26
Pooling layer
26
Pooling layer
26
Example: MNIST classification
27
Stack more layers
28
Full network
• Finally, we flatten the outputs of the last convolutional layer and feed
them to a fully-connected layer with 10 outputs.
• We apply the softmax nonlinearity to the outputs and use the
cross-entropy loss.
• The network can be trained by any gradient-based optimization
procedure, for example, Adam.
• The gradients are computed by backpropagation as in the multilayer
perceptron. The biggest difference is that we need to take into
account parameter sharing inside the convolutional layers.
29
Backpropagation through a convolutional layer
∂L XXX ∂L
= w∆i,∆j,o,c
∂xi,j,c ∂yi−∆i,j−∆j,o
∆i ∆j o
∂L
• The latter operation is called transposed convolution.
∂x
30
Modern convolutional neural networks
Historical note: First convolutional networks
• (Waibel et al., 1989): Time-delay neural network which were similar to conv nets but applied to
audio (in a moving window).
• (LeCun et al., 1998): LeNet-5, a classical architecture of a convolutional neural networks.
32
ImageNet progress
32
16 AlexNet
8 VGG
Human
4 ResNet
2
2011 2012 2013 2014 2015 2016 2017
33
AlexNet (Krizhevsky, 2012)
image source:oreilly.com
34
ImageNet progress
32
16 AlexNet
8 VGG
Human
4 ResNet
2
2011 2012 2013 2014 2015 2016 2017
35
VGG-19 (Simonyan & Zisserman, 2015)
36
VGG-19 (Simonyan & Zisserman, 2015)
• Compared to AlexNet:
• Smaller (3 × 3) filters
• Deeper network (more layers)
37
ImageNet progress
32
16 AlexNet
8 VGG
Human
4 ResNet
2
2011 2012 2013 2014 2015 2016 2017
38
ResNet (He et al, 2016)
39
ResNet (He et al., 2016)
• ResNet:
• Instead of learning f (x), layers learn x + h(x).
• He et al., (2016): If an identity mapping is optimal, it might be easier to push residual h(x) to zero
than to learn an identity mapping with f (x).
• Compared to VGG:
• Skip connections
• More layers
40
Why residual connections help training
• Balduzzi et al. (2017) Experiment with a randomly initialized MLP f : R → R, each hidden layer
contains 200 neurons with ReLU activations.
• Gradients ∂f
∂x
(x ) as a function of the input:
• Gradients are shattered for deep network without skip connections: Small changes of the input
have significant effect on the gradient. Thus the optimization becomes more difficult.
41
Batch normalization in convolutional networks
42
Applications of convolutional networks
Advantages of convolutional networks
44
Temporal convolutions
• The conditional distribution p(xt |x1 , . . . , xt−1 ) is modeled with a 1D convolutional network:
45
WaveNet: Dilated convolutions
• Dilated convolutions allow fast growth of the receptive field which is good for modeling long-term
dependencies.
• WaveNet (van den Oord, 2016) by Google, based on dilated convolutions, used to be the
state-of-the-art model for speech generation.
46
Semantic segmentation
• Segmentation: Generating pixel-wise segmentations giving the class of the object visible at each
pixel, or ”background” otherwise.
47
Semantic segmentation with U-Net (Ronneberger et al, 2015)
48
Convolutional model for neural machine translation (Gehring et al., 2017)
49
Convolutional networks in reinforcement learning
50
Protein folding (DeepMind blog)
• Proteins are large, complex molecules essential to all of life. What any given protein can do
depends on its unique 3D structure.
• Proteins are comprised of chains of amino acids. The information about the sequence of amino
acids is contained in DNA.
• Protein folding problem: Predicting how these chains will fold into the 3D structure of a protein.
51
CASP competition
52
AlphaFold (Senior et al., 2020)
53
Recommended reading
54
Recap
Summary of Lecture #2
56
Home assignment
Assignment 02 cnn
2. VGG-style network
3. ResNet
58