0% found this document useful (0 votes)
6 views

Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman

Uploaded by

zofashane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman

Uploaded by

zofashane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

Convolutional Neural Networks

(Image Recognition)
Part - II

Dr. Syed M. Usman

1
Content
• s
Regularization Techniques
– L2 and L1 regularization
– Dropout
– Data augmentation
– Early stopping
• Batch Normalization
• Case Study: AlexNet
• Using Pre-Trained Nets
– Transfer Learning
– Fine Tuning

2
Regularization Techniques
• Objective: Avoiding overfitting

3
Overfittin
g

4
Regularization
• Regularization is a technique which makes slight
modifications to the learning algorithm such that the
model generalizes better.
• This in turn improves the model’s performance on the
unseen data as well.

Regularization is a technique to discourage the


complexity of the model. It does this by penalizing
the loss function.

5
Intuitive Explanation: Regression

Regularization
• Loss function for a linear regression with 4 input variables:

• As the degree of the input features increases the model becomes


complex and tries to fit all the data points:

6
Intuitive Explanation: Regression

Regularization
• When we penalize the weights θ3 and θ4 and make them too small,
very close to zero. It makes those terms negligible and helps
simplify the model

• Regularization works on assumption that smaller weights


generate simpler model and thus helps avoid overfitting.

7
Intuitive Explanation: Regression

Regularization
• We add the regularization term to the sum of squared
differences between the actual value and predicted value.
• Regularization term keeps the weights small making the model
simpler and avoiding overfitting

8
Intuitive Explanation: Regression

Regularization
• λ is the penalty term or regularization parameter which determines
how much to penalizes the weights.
• When λ is zero then the regularization term becomes zero. We are back
to the original Loss function.

• When λ is large, we penalizes the weights and they become close to


zero. This results is a very simple model having a high bias or is
underfitting.

9
Example - NN
• Consider a neural network which is overfitting
on the training data as shown in the image
below.

10
Example - NN
• Regularization penalizes the weight matrices of the nodes.
• Assume that our regularization coefficient is so high that some of the
weight matrices are nearly equal to zero.
• This will result in a much simpler linear network and slight underfitting
of the training data

11
Example - NN
• We need to optimize the value of regularization coefficient in
order to obtain a well-fitted model

12
Regularization Techniques
• L2 and L1 regularization
• Dropout
• Data augmentation
• Early stopping

13
L1 and L2 Regularization
• L1 and L2 are the most common types of regularization. These
update the general cost function by adding another term known as
the regularization term.
– Cost function = Loss (say, binary cross entropy) + Regularization
term
• Due to the addition of this regularization term, the values of weight
matrices decrease because it assumes that a neural network with
smaller weight matrices leads to simpler models.
• Therefore, it will also reduce overfitting to quite an extent.

14
L1 and L2 Regularization
• L2 Regularization
λ 2
Cost Function = Loss + 2 × Σ w

• L1 Regularization

Cost Function = Loss + λ × Σ w

• lambda is the regularization parameter.


• It is the hyperparameter whose value is optimized for better
results.

15
L1 and L2 Regularization

Note: The penalties (regularizes) are applied on a per-layer basis.

16
Dropout
• Dropout is by far the most popular regularization
technique for deep neural networks.
• During training time, at each iteration, a neuron is
temporarily “dropped” or disabled with probability p
• This means all the inputs and outputs to this neuron will
be disabled at the current iteration.
• The dropped-out neurons are resampled with probability
p at every training step, so a dropped out neuron at one
step can be active at the next one.
• The hyperparameter p is called the dropout-rate and it’s
typically a number around 0.5

17
Dropout
• Why dropout works?
– Dropout prevents the network to be too dependent on a
small number of neurons, and forces every neuron to be able to
operate independently
– Practically – Parameters linked to dropped neurons are not updated in
the iteration it was dropped

18
Dropout
• Let’s say you were the only person at your company who knows
about finance. If you were guaranteed to be at work every day, your
coworkers wouldn’t have an incentive to pick up finance skills.
• But if every morning you tossed a coin to decide whether you will
go to work or not, then your coworkers will need to adapt.
• Some days you might not be at work but finance tasks still need to
get done, so they can’t rely only at you. Your coworkers will need to
learn about finance and this expertise needs to be spread out
between various people.
• The workers need to cooperate with several other employees, not
with a fixed set of people.
• This makes the company much more resilient overall, increasing the
quality and skillset of the employees.

19
Dropout

Note: Dropout is applied to the fully-connected layers of ConvNet


only

20
Data Augmentation
• The simplest way to reduce overfitting is to increase the size
of the training data.

21
Data Augmentation Techniques
• A computer takes an image as an input, it will take in an array of
pixel values.
• Let’s say that the whole image is shifted left by 1 pixel. To us, this
change is imperceptible. However, to a computer, this shift can be
fairly significant as the classification or label of the image doesn’t
change, while the array does.
• Approaches that alter the training data in ways that change the
array representation while keeping the label the same are known
as data augmentation techniques.
• They are a way to artificially expand the dataset.

22
Data Augmentation

23
Data Augmentation

24
Data Augmentation
• Horizontal Flips

25
Data Augmentation
• Random Crops/Scales

26
Data Augmentation
• Color Jitter: Randomly jitter contrast

27
Data Augmentation
• Random Combinations of:
– Translation
– Rotation
– Stretching
– Shearing
– Lens distortion
– …..

28
Data Augmentation

29
Data Augmentation
• In keras, we can perform all of these transformations
using ImageDataGenerator.
• It has a big list of arguments which can be used to pre-process
training data.

30
Data Augmentation

31
Data Augmentation

32
Data Augmentation
• Saving copies of images to disk is not desirable
• Make copies and pass them directly to the learning algorithm

33
Data Augmentation
• Saving copies of images to disk is not desirable
• Make copies and pass them directly to the learning algorithm

• A batch of 128 images is picked, a random transformation is applied


from the set chosen, and 128 transformed images are passed to
learning algorithm.
• In the next epoch, some other random transformations are applied
– Size of data in each epoch does not increase
• In each epoch, the learning algorithm sees the same data but
transformed differently

34
Early Stopping
• Early stopping is a kind of cross-validation strategy where we keep
one part of the training set as the validation set.
• When we see that the performance on the validation set is getting
worse, we immediately stop the training on the model.

35
Early Stopping

monitor denotes the quantity that needs to be monitored and ‘val_err’


denotes the validation error.

patience denotes the number of epochs with no further improvement after which
the training will be stopped. After the dotted line, each epoch will result in a higher
value of validation error. Therefore, 5 epochs after the dotted line (since our
patience is equal to 5), our model will stop because no further improvement is seen.

36
Content
• s
Regularization Techniques
– L2 and L1 regularization
– Dropout
– Data augmentation
– Early stopping
• Batch Normalization
• Case Study: AlexNet
• Using Pre-Trained Nets
– Transfer Learning
– Fine Tuning

37
Batch Normalization
• Batch normalization improved the top result of ImageNet
(2014) by a significant margin using only 7% of the training
steps

38
Covariate Shift
• A binary classifier for roses: The output is 1 if the image is that of a
rose and the output is 0 otherwise.
• Consider a subset of the training data that primarily has red
rose buds as rose examples and wildflowers as non-rose
examples.

39
Covariate Shift
• Consider another subset that has fully blown roses of different colors as
rose examples and other non-rose flowers in the picture as non-rose
examples.

• Intuitively, it makes sense that every mini-batch used in the training


process should have the same distribution.
• A mini-batch should not have only images from one of the two subsets
above. It should have images randomly selected from both subsets in each
mini-batch.

40
Covariate Shift

The two subsets have very different distributions. The last column shows the distribution of
the two classes in the feature space using red and green dots. The blue line show the decision
boundary between the two classes.

41
Covariate Shift
• The natural way to solve this problem for the input layer is to randomize
the data before creating mini-batches.
• How do we solve this for the hidden layers?
• Each hidden unit’s input distribution changes every time there is a
parameter update in the previous layer.
• Since the activations of a previous layer are the inputs of the next
layer, each layer in the neural network is faced with a situation
where the input distribution changes with each step.
• This is called internal covariate shift and makes training slow
• This problem is solved by normalizing the layer’s inputs over a mini-
batch and this process is therefore called Batch Normalization.

42
Batch Normalization
• We normalize the input layer by adjusting and scaling the activations.
• For example, when we have features from 0 to 1 and some from 1 to
1000, we should normalize them to speed up learning.
• If the input layer is benefiting from it, why not do the same thing also
for the values in the hidden layers, that are changing all the time, and
get 10 times or more improvement in the training speed.
• To increase the stability of a neural network, batch
normalization normalizes the output of a previous activation
layer by subtracting the batch mean and dividing by the batch
standard deviation.

43
Why Normalize?

44
Batch Normalization
• The basic idea behind batch normalization is to limit covariate shift
by normalizing the activations of each layer (transforming the
inputs to be mean 0 and unit variance).
• This, supposedly, allows each layer to learn on a more stable
distribution of inputs, and would thus accelerate the training of the
network.
• In practice, restricting the activations of each layer to be
strictly 0 mean and unit variance can limit the expressive
power of the network.
• Therefore, in practice, batch normalization allows the network
to learn parameters gamma and beta that can convert the
mean and variance to any value that the network desires.

45
Batch Normalization
• Normalizing input values speeds up learning the parameters:

• Deeper Network
a 1 a 2 a 3

• Normalizing the activations a 2 can efficiently learn the parameters

W 3 ,b 3

46
Batch Normalization
• Can we normalize a 2 so as to learn W 3 , b 3 faster?
• For practical purposes, there is a debate on whether to normalize
a[2] or z[2] (Before or after activation), the later is employed in
practice (Andrew Ng)
• Steps:
– Given some intermediate values in the net for some hidden layer

– Compute (batch) mean: 1


μ = m Σ zi
i
2
– Compute (batch) variance: 1
σ2 = Σ zi − μ
m i
– Normalize zi: i
( z i − μ)
z norm = σ2 + є
The superscript i corresponds to the ith data in the mini batch
47
Batch Normalization
• The previous steps will ensure that the distribution of z in a layer
has zero mean and unit variance
• We do not want always want the hidden units to have zero mean
and unit variance
z˜i = γ. z i + β
norm

• Gamma and Beta represent learnable parameters of the model


• The effect of these parameter is that mean of z˜i can be what the
network thinks is the best

48
BN: Implementation
• Import the Batch Normalization module from keras

• Batch Normalization is added after the Conv2D or Dense


function calls, but before the following Activation function

49
Batch Normalization
• Evolution of loss with and without BN

50
Batch Normalization at Test Time
• Training: Data is provided 1-mini batch at a time
• Test: May have only one example at a time
• During training – For mini-batches (of size m):

• Mean and variance required for scaling computed from the whole
mini-batch

51
Batch Normalization at Test Time
• There is NO mini-batch at test time
• Need a different way of coming up with mean and variance – Mean
and variance of one example does not make sense
• Estimated using exponentially weighted average (across mini-
batches in the training set)
• During training consider layer l and a set of mini-batches:
– X 1 , X 2 , X 3 ,…. For each mini-batch we compute the corresponding
mean and variance values
– μ 1 l , μ 2 l , μ 3 l ,…..
– Exponentially weighted average of these becomes an estimate of mean for
layer l

52
Batch Normalization at Test Time
• At test time, instead of

i ( z i − μ)
z norm = σ2 + є

• Compute znorm for one test example:

znor m = (z − μ)
σ2 + є
• And using Beta and Gamma learnt during training, compute:
z˜ = γ. z n o r m + β

53
Case Study: AlexNet

54
Case Study: AlexNet
• Input: 227x227x3 images
• First layer (CONV1): 96 11x11 filters applied at
stride 4
• The output volume size?
– (227-11)/4+1 = 55

55
Output Size
Output Size

56
Case Study: AlexNet
• Input: 227x227x3 images
• First layer (CONV1): 96 11x11 filters applied at
stride 4
• Output Volume: 55 x 55 x 96
• Total Number of Parameters?
– Parameters: (11*11*3)*96 = 35K

57
Case Study: AlexNet
• Input: 227x227x3 images
• After CONV1: 55x55x96
• Second layer (POOL1): 3x3 filters applied at
stride 2
• Output volume?
– 27x27x96

58
Case Study: AlexNet
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad
1 [13x13x256] CONV5: 256 3x3 filters at stride 1,
pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

59
Using Pre-Trained ConvNets
(Transfer Learning)

60
Using Pre-Trained Nets
• A Typical CNN has two parts:
– Convolutional base: which is composed by a stack of
convolutional and pooling layers. The main goal of the
convolutional base is to generate features from the
image.
– Classifier: which is usually composed by fully
connected layers. The main goal of the classifier is to
classify the image based on the detected features. A
fully connected layer is a layer whose neurons have full
connections to all activation in the previous layer.

61
Using Pre-Trained Nets
• Deep learning models can automatically learn hierarchical feature
representations.
• Features computed by the first layer are general and can be reused
in different problem domains, while features computed by the last
layer are specific and depend on the chosen dataset and task
• A common misconception in the DL community is that
without a Google-esque amount of data, you can’t possibly
hope to create effective deep learning models.
• While data is a critical part of creating the network, the idea
of transfer learning has helped to lessen the data demands.
• Transfer Learning: Taking a pre-trained model and adapting
it to a given problem.

62
Using Pre-Trained Nets
• Strategy I: Train the entire model
– In this case, you use the architecture of the pre-trained
model and train it according to your dataset. You’re
learning the model from scratch, so you’ll need a large
dataset (and a lot of computational power).

63
Using Pre-Trained Nets
• Strategy II: Fine-Tune a pre-trained model
– Change the fully connected layer of the network to match
the data under study
– Continue back propagation and update parameters of all
or a subset of layers (Initial layers can be frozen)

64
Using Pre-Trained Nets
• It is possible to fine-tune all the layers of the ConvNet, or it’s
possible to keep some of the earlier layers fixed (due to overfitting
concerns) and only fine-tune some higher-level portion of the
network.
• Earlier features of a ConvNet contain more generic features (e.g.
edge detectors or color blob detectors) that should be useful to
many tasks, but later layers of the ConvNet becomes progressively
more specific to the details of the classes contained in the original
dataset.

65
Using Pre-Trained Nets
• Strategy III: Use CNN as Feature Extractor
– Freeze the convolutional base
– Pass the data through network and use the output of
convolutional base as features
– Feed features to another classifier
– Example:
• For AlexNet: 4096-D vector for every image that contains
the activations of the hidden layer immediately before the
classifier.
• Train Classifier on these features (e.g. SVM)

66
Summary: Using Pre-Trained ConvNets

Credits: Transfer Learning using Pre-trained models, Pedro Marcelino


67
Using Pre-Trained Nets: Possibilities

Credits: Transfer Learning using Pre-trained models, Pedro Marcelino


68
When to use What?

Credits: Transfer Learning using Pre-trained models, Pedro Marcelino


69
Implementation: Transfer Learning
• Load the model:

• Load Data and Labels


– Data must be 224x224x3

70
Implementation: Transfer Learning
• Convert labels to one hot encoding

• Create tensors to store features

71
Implementation: Transfer Learning
• Pass images through network using predict
function to get features

• Employ any classifier: Feed it with training


features and respective labels

72
Implementation: Fine Tuning
• Load the model

• Freeze the initial layers

73
Implementation: Fine Tuning
• Create new model: Add classification layers on
top of convolutional base

74
References
• The material in these slides has been taken from the following
sources.
– Slides by CS231n Winter 2016 – Andrej Karpathy
– Convolutional Neural Networks (CNNs): An Illustrated
Explanation, Abhineet Saxena
– An Intuitive Explanation of Convolutional Neural
Networks
– A Beginner's Guide To Understanding Convolutional Neural
Networks, Adit Deshpande

75

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy