Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
(Image Recognition)
Part - II
1
Content
• s
Regularization Techniques
– L2 and L1 regularization
– Dropout
– Data augmentation
– Early stopping
• Batch Normalization
• Case Study: AlexNet
• Using Pre-Trained Nets
– Transfer Learning
– Fine Tuning
2
Regularization Techniques
• Objective: Avoiding overfitting
3
Overfittin
g
4
Regularization
• Regularization is a technique which makes slight
modifications to the learning algorithm such that the
model generalizes better.
• This in turn improves the model’s performance on the
unseen data as well.
5
Intuitive Explanation: Regression
Regularization
• Loss function for a linear regression with 4 input variables:
6
Intuitive Explanation: Regression
Regularization
• When we penalize the weights θ3 and θ4 and make them too small,
very close to zero. It makes those terms negligible and helps
simplify the model
7
Intuitive Explanation: Regression
Regularization
• We add the regularization term to the sum of squared
differences between the actual value and predicted value.
• Regularization term keeps the weights small making the model
simpler and avoiding overfitting
8
Intuitive Explanation: Regression
Regularization
• λ is the penalty term or regularization parameter which determines
how much to penalizes the weights.
• When λ is zero then the regularization term becomes zero. We are back
to the original Loss function.
9
Example - NN
• Consider a neural network which is overfitting
on the training data as shown in the image
below.
10
Example - NN
• Regularization penalizes the weight matrices of the nodes.
• Assume that our regularization coefficient is so high that some of the
weight matrices are nearly equal to zero.
• This will result in a much simpler linear network and slight underfitting
of the training data
11
Example - NN
• We need to optimize the value of regularization coefficient in
order to obtain a well-fitted model
12
Regularization Techniques
• L2 and L1 regularization
• Dropout
• Data augmentation
• Early stopping
13
L1 and L2 Regularization
• L1 and L2 are the most common types of regularization. These
update the general cost function by adding another term known as
the regularization term.
– Cost function = Loss (say, binary cross entropy) + Regularization
term
• Due to the addition of this regularization term, the values of weight
matrices decrease because it assumes that a neural network with
smaller weight matrices leads to simpler models.
• Therefore, it will also reduce overfitting to quite an extent.
14
L1 and L2 Regularization
• L2 Regularization
λ 2
Cost Function = Loss + 2 × Σ w
• L1 Regularization
15
L1 and L2 Regularization
16
Dropout
• Dropout is by far the most popular regularization
technique for deep neural networks.
• During training time, at each iteration, a neuron is
temporarily “dropped” or disabled with probability p
• This means all the inputs and outputs to this neuron will
be disabled at the current iteration.
• The dropped-out neurons are resampled with probability
p at every training step, so a dropped out neuron at one
step can be active at the next one.
• The hyperparameter p is called the dropout-rate and it’s
typically a number around 0.5
17
Dropout
• Why dropout works?
– Dropout prevents the network to be too dependent on a
small number of neurons, and forces every neuron to be able to
operate independently
– Practically – Parameters linked to dropped neurons are not updated in
the iteration it was dropped
18
Dropout
• Let’s say you were the only person at your company who knows
about finance. If you were guaranteed to be at work every day, your
coworkers wouldn’t have an incentive to pick up finance skills.
• But if every morning you tossed a coin to decide whether you will
go to work or not, then your coworkers will need to adapt.
• Some days you might not be at work but finance tasks still need to
get done, so they can’t rely only at you. Your coworkers will need to
learn about finance and this expertise needs to be spread out
between various people.
• The workers need to cooperate with several other employees, not
with a fixed set of people.
• This makes the company much more resilient overall, increasing the
quality and skillset of the employees.
19
Dropout
20
Data Augmentation
• The simplest way to reduce overfitting is to increase the size
of the training data.
21
Data Augmentation Techniques
• A computer takes an image as an input, it will take in an array of
pixel values.
• Let’s say that the whole image is shifted left by 1 pixel. To us, this
change is imperceptible. However, to a computer, this shift can be
fairly significant as the classification or label of the image doesn’t
change, while the array does.
• Approaches that alter the training data in ways that change the
array representation while keeping the label the same are known
as data augmentation techniques.
• They are a way to artificially expand the dataset.
22
Data Augmentation
23
Data Augmentation
24
Data Augmentation
• Horizontal Flips
25
Data Augmentation
• Random Crops/Scales
26
Data Augmentation
• Color Jitter: Randomly jitter contrast
27
Data Augmentation
• Random Combinations of:
– Translation
– Rotation
– Stretching
– Shearing
– Lens distortion
– …..
28
Data Augmentation
29
Data Augmentation
• In keras, we can perform all of these transformations
using ImageDataGenerator.
• It has a big list of arguments which can be used to pre-process
training data.
30
Data Augmentation
31
Data Augmentation
32
Data Augmentation
• Saving copies of images to disk is not desirable
• Make copies and pass them directly to the learning algorithm
33
Data Augmentation
• Saving copies of images to disk is not desirable
• Make copies and pass them directly to the learning algorithm
34
Early Stopping
• Early stopping is a kind of cross-validation strategy where we keep
one part of the training set as the validation set.
• When we see that the performance on the validation set is getting
worse, we immediately stop the training on the model.
35
Early Stopping
patience denotes the number of epochs with no further improvement after which
the training will be stopped. After the dotted line, each epoch will result in a higher
value of validation error. Therefore, 5 epochs after the dotted line (since our
patience is equal to 5), our model will stop because no further improvement is seen.
36
Content
• s
Regularization Techniques
– L2 and L1 regularization
– Dropout
– Data augmentation
– Early stopping
• Batch Normalization
• Case Study: AlexNet
• Using Pre-Trained Nets
– Transfer Learning
– Fine Tuning
37
Batch Normalization
• Batch normalization improved the top result of ImageNet
(2014) by a significant margin using only 7% of the training
steps
38
Covariate Shift
• A binary classifier for roses: The output is 1 if the image is that of a
rose and the output is 0 otherwise.
• Consider a subset of the training data that primarily has red
rose buds as rose examples and wildflowers as non-rose
examples.
39
Covariate Shift
• Consider another subset that has fully blown roses of different colors as
rose examples and other non-rose flowers in the picture as non-rose
examples.
40
Covariate Shift
The two subsets have very different distributions. The last column shows the distribution of
the two classes in the feature space using red and green dots. The blue line show the decision
boundary between the two classes.
41
Covariate Shift
• The natural way to solve this problem for the input layer is to randomize
the data before creating mini-batches.
• How do we solve this for the hidden layers?
• Each hidden unit’s input distribution changes every time there is a
parameter update in the previous layer.
• Since the activations of a previous layer are the inputs of the next
layer, each layer in the neural network is faced with a situation
where the input distribution changes with each step.
• This is called internal covariate shift and makes training slow
• This problem is solved by normalizing the layer’s inputs over a mini-
batch and this process is therefore called Batch Normalization.
42
Batch Normalization
• We normalize the input layer by adjusting and scaling the activations.
• For example, when we have features from 0 to 1 and some from 1 to
1000, we should normalize them to speed up learning.
• If the input layer is benefiting from it, why not do the same thing also
for the values in the hidden layers, that are changing all the time, and
get 10 times or more improvement in the training speed.
• To increase the stability of a neural network, batch
normalization normalizes the output of a previous activation
layer by subtracting the batch mean and dividing by the batch
standard deviation.
43
Why Normalize?
44
Batch Normalization
• The basic idea behind batch normalization is to limit covariate shift
by normalizing the activations of each layer (transforming the
inputs to be mean 0 and unit variance).
• This, supposedly, allows each layer to learn on a more stable
distribution of inputs, and would thus accelerate the training of the
network.
• In practice, restricting the activations of each layer to be
strictly 0 mean and unit variance can limit the expressive
power of the network.
• Therefore, in practice, batch normalization allows the network
to learn parameters gamma and beta that can convert the
mean and variance to any value that the network desires.
45
Batch Normalization
• Normalizing input values speeds up learning the parameters:
• Deeper Network
a 1 a 2 a 3
W 3 ,b 3
46
Batch Normalization
• Can we normalize a 2 so as to learn W 3 , b 3 faster?
• For practical purposes, there is a debate on whether to normalize
a[2] or z[2] (Before or after activation), the later is employed in
practice (Andrew Ng)
• Steps:
– Given some intermediate values in the net for some hidden layer
48
BN: Implementation
• Import the Batch Normalization module from keras
49
Batch Normalization
• Evolution of loss with and without BN
50
Batch Normalization at Test Time
• Training: Data is provided 1-mini batch at a time
• Test: May have only one example at a time
• During training – For mini-batches (of size m):
• Mean and variance required for scaling computed from the whole
mini-batch
51
Batch Normalization at Test Time
• There is NO mini-batch at test time
• Need a different way of coming up with mean and variance – Mean
and variance of one example does not make sense
• Estimated using exponentially weighted average (across mini-
batches in the training set)
• During training consider layer l and a set of mini-batches:
– X 1 , X 2 , X 3 ,…. For each mini-batch we compute the corresponding
mean and variance values
– μ 1 l , μ 2 l , μ 3 l ,…..
– Exponentially weighted average of these becomes an estimate of mean for
layer l
52
Batch Normalization at Test Time
• At test time, instead of
i ( z i − μ)
z norm = σ2 + є
znor m = (z − μ)
σ2 + є
• And using Beta and Gamma learnt during training, compute:
z˜ = γ. z n o r m + β
53
Case Study: AlexNet
54
Case Study: AlexNet
• Input: 227x227x3 images
• First layer (CONV1): 96 11x11 filters applied at
stride 4
• The output volume size?
– (227-11)/4+1 = 55
55
Output Size
Output Size
56
Case Study: AlexNet
• Input: 227x227x3 images
• First layer (CONV1): 96 11x11 filters applied at
stride 4
• Output Volume: 55 x 55 x 96
• Total Number of Parameters?
– Parameters: (11*11*3)*96 = 35K
57
Case Study: AlexNet
• Input: 227x227x3 images
• After CONV1: 55x55x96
• Second layer (POOL1): 3x3 filters applied at
stride 2
• Output volume?
– 27x27x96
58
Case Study: AlexNet
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad
1 [13x13x256] CONV5: 256 3x3 filters at stride 1,
pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
59
Using Pre-Trained ConvNets
(Transfer Learning)
60
Using Pre-Trained Nets
• A Typical CNN has two parts:
– Convolutional base: which is composed by a stack of
convolutional and pooling layers. The main goal of the
convolutional base is to generate features from the
image.
– Classifier: which is usually composed by fully
connected layers. The main goal of the classifier is to
classify the image based on the detected features. A
fully connected layer is a layer whose neurons have full
connections to all activation in the previous layer.
61
Using Pre-Trained Nets
• Deep learning models can automatically learn hierarchical feature
representations.
• Features computed by the first layer are general and can be reused
in different problem domains, while features computed by the last
layer are specific and depend on the chosen dataset and task
• A common misconception in the DL community is that
without a Google-esque amount of data, you can’t possibly
hope to create effective deep learning models.
• While data is a critical part of creating the network, the idea
of transfer learning has helped to lessen the data demands.
• Transfer Learning: Taking a pre-trained model and adapting
it to a given problem.
62
Using Pre-Trained Nets
• Strategy I: Train the entire model
– In this case, you use the architecture of the pre-trained
model and train it according to your dataset. You’re
learning the model from scratch, so you’ll need a large
dataset (and a lot of computational power).
63
Using Pre-Trained Nets
• Strategy II: Fine-Tune a pre-trained model
– Change the fully connected layer of the network to match
the data under study
– Continue back propagation and update parameters of all
or a subset of layers (Initial layers can be frozen)
64
Using Pre-Trained Nets
• It is possible to fine-tune all the layers of the ConvNet, or it’s
possible to keep some of the earlier layers fixed (due to overfitting
concerns) and only fine-tune some higher-level portion of the
network.
• Earlier features of a ConvNet contain more generic features (e.g.
edge detectors or color blob detectors) that should be useful to
many tasks, but later layers of the ConvNet becomes progressively
more specific to the details of the classes contained in the original
dataset.
65
Using Pre-Trained Nets
• Strategy III: Use CNN as Feature Extractor
– Freeze the convolutional base
– Pass the data through network and use the output of
convolutional base as features
– Feed features to another classifier
– Example:
• For AlexNet: 4096-D vector for every image that contains
the activations of the hidden layer immediately before the
classifier.
• Train Classifier on these features (e.g. SVM)
66
Summary: Using Pre-Trained ConvNets
70
Implementation: Transfer Learning
• Convert labels to one hot encoding
71
Implementation: Transfer Learning
• Pass images through network using predict
function to get features
72
Implementation: Fine Tuning
• Load the model
73
Implementation: Fine Tuning
• Create new model: Add classification layers on
top of convolutional base
74
References
• The material in these slides has been taken from the following
sources.
– Slides by CS231n Winter 2016 – Andrej Karpathy
– Convolutional Neural Networks (CNNs): An Illustrated
Explanation, Abhineet Saxena
– An Intuitive Explanation of Convolutional Neural
Networks
– A Beginner's Guide To Understanding Convolutional Neural
Networks, Adit Deshpande
75