Chapter14 CNN
Chapter14 CNN
Tsz-Chiu Au
chiu@unist.ac.kr
• Zero padding: In order for a layer to have the same height and width as
the previous layer, it is common to add zeros around the inputs.
Convolutional Layers (cont.)
• It is also possible to connect a large input layer to a much smaller layer by
spacing out the receptive fields.
• Stride: The shift from one receptive field to the next.
• A 5 × 7 input layer (plus zero padding) is connected to a 3 × 4 layer, using 3 × 3
receptive fields and a stride of 2.
• In this example the stride is the same in both directions, but it does not have
to be so.
• A neuron located in row i, column j in the upper layer is connected to the
outputs of the neurons in the previous layer located in rows i × sh to i × sh + fh –
1, columns j × sw to j × sw + fw – 1, where sh and sw are the vertical and
horizontal strides.
Filters
• Convolutional layers can act as filters (or convolution kernals) that outputs feature
maps, which highlights the areas in an image that activate the filter the most.
• The vertical filter is represented as a black square with a vertical white line in the
middle (it is a 7 × 7 matrix full of 0s except for the central column, which is full of 1s).
» ignore everything in their receptive field except for the central vertical line.
• If all neurons in a layer use the same vertical line filter (and the same bias term), the
output is the Feature map 1.
» The vertical white lines get enhanced while the rest gets blurred.
» All neurons within a given feature map share the same parameters.
• During training the convolutional layer will automatically learn the filters.
Stacking Multiple Feature Maps
• Typically, a convolutional layer has
multiple filters and outputs one
feature map per filter.
» All neurons in a feature map
share the same parameters
§ Dramatically reduces the number of
parameters in the model.
§ Once the CNN has learned to
recognize a pattern in one location, it
can recognize it in any other location.
» Neurons in different feature maps
use different parameters.
» A neuron’s receptive field is the
same as described earlier, but it
extends across all the previous
layers’ feature maps
Stacking Multiple Feature Maps (cont.)
• Input images are composed of multiple sublayers: one per color channel.
» There are typically three: red, green, and blue (RGB).
» Grayscale images have just one channel.
» But some images may have much more—for example, satellite images that
capture extra light frequencies (such as infrared).
• Specifically, a neuron located in row i, column j of the feature map k in a
given convolutional layer l is connected to the outputs of the neurons in
the previous layer l – 1, located in rows i × sh to i × sh + fh – 1 and columns j
× sw to j × sw + fw – 1, across all feature maps (in layer l – 1).
» All neurons located in the same row i and column j but in different feature
maps are connected to the outputs of the exact same neurons in the previous
layer.
TensorFlow Implementation of Convolutional Layers
• Each input image is typically represented as a 3D tensor of shape [height,
width, channels]
• A mini-batch is represented as a 4D tensor of shape [mini-batch size,
height, width, channels].
• The weights of a convolutional layer are represented as a 4D tensor of
shape [fh, fw, fnʹ, fn].
• The bias terms of a convolutional layer are simply represented as a 1D
tensor of shape [fn].
Padding
Keras Implementation of Convolutional Layers
• We don’t want to manually define the filters.
» Filters should be trainable variables so the neural net can learn which filters work best
• In Keras, creates a Conv2D layer with 32 filters, each 3 × 3, using a stride of 1
(both horizontally and vertically) and "same" padding, and applying the ReLU
activation function to its outputs.
• Hyperparameters: the number of filters, their height and width, the strides,
and the padding type.
» But using cross-validation to find the right hyperparameter values is very time-consuming.
• The convolutional layers require a huge amount of RAM.
» The reverse pass of backpropagation requires all the intermediate values computed during the
forward pass.
» If training crashes because of an out-of-memory error, you can try
§ reducing the mini-batch size.
§ reducing dimensionality using a stride, or removing a few layers.
§ using 16-bit floats instead of 32-bit floats.
§ Distributing the CNN across multiple devices.
Pooling Layers
• Pooling layers subsample (i.e., shrink) the input image in order to reduce
the computational load, the memory usage, and the number of
parameters
» Thereby limiting the risk of overfitting.
• A neuron in pooling layers is just like a neuron in convolutional layers
except it has no weights.
» All it does is aggregate the inputs using an aggregation function such as the
max or mean.
• Max pooling layer is the most common type of pooling layer.
» 2 × 2 pooling kernel with a stride of 2 and no padding.
» Only the max input value in each receptive field makes it to the next layer, while the
other inputs are dropped.
Pooling Layers (cont.)
• A pooling layer typically works on every input channel independently, so the
output depth is the same as the input depth.
• Max pooling offers a small amount of rotational invariance and a slight scale
invariance.
» Such invariance (even if it is limited) can be useful in cases where the prediction should not
depend on these details, such as in classification tasks.
• Global average pooling layer: compute the mean of each entire feature map
» it’s like an average pooling layer using a pooling kernel with the same spatial dimensions as
the inputs.
» it can be useful as the output layer
CNN Architectures
• Typical CNN architectures stack a few convolutional layers (each one generally
followed by a ReLU layer), then a pooling layer, then another few convolutional
layers (+ReLU), then another pooling layer, and so on.
» The image gets smaller and smaller as it progresses through the network, but it also typically
gets deeper and deeper (i.e., with more feature maps)
• At the top of the stack, a few fully connected layers are added, and then the
final layer outputs the prediction.
• This simple CNN reaches over 92% accuracy in the Fashion MNIST dataset
» Much better than what we achieved with dense networks in Chapter 10.
• But variants of this fundamental architecture have been developed.
• A good measure of this progress is the error rate in competitions such as
the ILSVRC ImageNet challenge.
» In this competition the top-five error rate for image classification fell from over 26% to
less than 2.3% in just six years.
» The top-five error rate is the number of test images for which the system’s top five
predictions did not include the correct answer.
» The images are large (256 pixels high) and there are 1,000 classes, some of which are
really subtle (try distinguishing 120 dog breeds).
• We will first look at the classical LeNet-5 architecture (1998), then four of
the winners of the ILSVRC challenge: AlexNet (2012), GoogLeNet (2014),
ResNet (2015), and SENet (2017)
LeNet-5
• The LeNet-5 architecture was created by Yann LeCun in 1998 and has been
widely used for handwritten digit recognition (MNIST).
» It introduces most of the elements in modern CNN architectures.
» But the choice of the average pooling layers and the use of the square of the Euclidian
distance between its input vector and its weight vector at output layers are obsolete.
AlexNet
• The AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenge by
a large margin.
» Achieved a top-five error rate of 17%, while the second best achieved only 26%.
• Similar to LeNet-5 but larger.
• It was the first to stack convolutional layers directly on top of one another,
instead of stacking a pooling layer on top of each convolutional layer.
AlexNet (cont.)
• Introduced two regularization techniques:
» Dropout with a 50% dropout rate during training to the outputs of layers F9 and F10.
» Data augmentation by randomly shifting the training images by various offsets,
flipping them horizontally, and changing the lighting conditions.
§ Increases the size of the training set by generating many realistic variants of each training instance.
§ Forces the model to be more tolerant to variations in the position, orientation, and size of the objects in
the pictures.
AlexNet (cont.)
• AlexNet also uses a competitive normalization step immediately after the
ReLU step of layers C1 and C3, called local response normalization (LRN)
» The most strongly activated neurons inhibit other neurons located at the same
position in neighboring feature maps.
» Such competitive activation has been observed in biological neurons
» This encourages different feature maps to specialize, pushing them apart and
forcing them to explore a wider range of features, ultimately improving
generalization.
• This model can beat the winner in the ILSVRC 2015 challenge.
Using Pretrained Models from Keras
• Load the ResNet-50 model, pretrained on ImageNet.
• To use it, you first need to ensure that the images have the right size.
» A ResNet-50 model expects 224 × 224-pixel images (other models may expect other
sizes, such as 299 × 299),
» Use TensorFlow’s tf.image.resize() function to resize the images
• This model should reach around 95% accuracy on the test set.
Classification and Localization
• Localizing an object in a picture can be expressed as a regression task that predicts a
bounding box around the object
» horizontal and vertical coordinates of the object’s center, as well as its height and width.
• Just add a second dense output layer with four units (typically on top of the global
average pooling layer), and train the model using the MSE loss.