0% found this document useful (0 votes)
52 views43 pages

Cnnbasics 171028092801

Uploaded by

jayasanthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views43 pages

Cnnbasics 171028092801

Uploaded by

jayasanthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Convolutional Neural

Networks
Anantharaman Palacode Narayana Iyer
narayana dot Anantharaman at gmail dot com
5 Aug 2017
References
“A dramatic moment in the meteoric rise of
deep learning came when a convolutional
network won this challenge for the first time
and by a wide margin, bringing down the
state-of-the-art top-5 error rate from 26.1% to
15.3% (Krizhevsky et al., 2012), meaning that
the convolutional network produces a ranked list
of possible categories for each image and the
correct category appeared in the first five entries
of this list for all but 15.3% of the test examples.
Since then, these competitions are consistently
won by deep convolutional nets, and as of this
writing, advances in deep learning have brought
the latest top-5 error rate in this contest down to
3.6%” – Ref: Deep Learning Book by Y Bengio
et al
What is a convolutional neural network?

Convolutional networks are simply


neural networks that use
convolution in place of general
matrix multiplication in at least
one of their layers.

• Convolution is a mathematical
operation having a linear form
Types of inputs
• Inputs have a structure
• Color images are three dimensional and so have a volume
• Time domain speech signals are 1-d while the frequency domain representations (e.g. MFCC
vectors) take a 2d form. They can also be looked at as a time sequence.
• Medical images (such as CT/MR/etc) are multidimensional
• Videos have the additional temporal dimension compared to stationary images
• Speech signals can be modelled as 2 dimensional
• Variable length sequences and time series data are again multidimensional

• Hence it makes sense to model them as tensors instead of vectors.

• The classifier then needs to accept a tensor as input and perform the necessary
machine learning task. In the case of an image, this tensor represents a volume.
CNNs are everywhere
• Image retrieval
• Detection
• Self driving cars
• Semantic segmentation
• Face recognition (FB tagging)
• Pose estimation
• Detect diseases
• Speech Recognition
• Text processing
• Analysing satellite data

Copyright 2016 JNResearch, All Rights Reserved


CNNs for applications that involve images
• Why CNNs are more suitable to process images?

• Pixels in an image correlate to each other. However, nearby pixels correlate


stronger and distant pixels don’t influence much
• Local features are important: Local Receptive Fields

• Affine transformations: The class of an image doesn’t change with translation. We


can build a feature detector that can look for a particular feature (e.g. an edge)
anywhere in the image plane by moving across. A convolutional layer may have
several such filters constituting the depth dimension of the layer.
Fully connected layers
• Fully connected layers (such as the hidden layers of a traditional neural network)
are agnostic to the structure of the input
• They take inputs as vectors and generate an output vector
• There is no requirement to share parameters unless forced upon in specific architectures.
This blows up the number of parameters as the input and/or output dimensions increase.
• Suppose we are to perform classification on an image of 100x100x3 dimensions.
• If we implement using a feed forward neural network that has an input, hidden
and an output layer, where: hidden units (nh) = 1000, output classes = 10 :
• Input layer = 10k pixels * 3 = 30k, weight matrix for hidden to input layer = 1k * 30k = 30 M
and output layer matrix size = 10 * 1000 = 10k
• We may handle this is by extracting the features using pre processing and
presenting a lower dimensional input to the Neural Network. But this requires
expert engineered features and hence domain knowledge
Convolution
𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑖𝑛 1 𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛:

𝑘=∞

𝑦𝑛 = 𝑥 𝑘 ℎ[𝑛 − 𝑘]
𝑘=−∞

𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑖𝑛 2 𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠:

𝑘1=−∞ 𝑘2=∞

𝑦 𝑛1 , 𝑛2 = 𝑥 𝑘1 , 𝑘2 ℎ[ 𝑛1 − 𝑘1 , 𝑛2 − 𝑘2 ]
𝑘1=−∞ 𝑘2=−∞
CNNs
Types of layers in a CNN:

• Convolution Layer

• Pooling Layer

• Fully Connected Layer


Convolution Layer
• A layer in a regular neural
network take vector as input
and output a vector.

• A convolution layer takes a


tensor (3d volume for RGB
images) as input and
generates a tensor as output
Fig Credit: Lex Fridman, MIT, 6.S094
Slide Credit: Lex Fridman, MIT, 6.S094
Local Receptive Fields

• Filter (Kernel) is applied on the input image


like a moving window along width and height

• The depth of a filter matches that of the input.

• For each position of the filter, the dot product


of filter and the input are computed
(Activation)

• The 2d arrangement of these activations is


called an activation map.

• The number of such filters constitute the


depth of the convolution layer

Fig Credit: Lex Fridman, MIT, 6.S094


Convolution Operation between filter and image
• The convolution layer
computes dot products
between the filter and a
piece of image as it slides
along the image

• The step size of slide is


called stride

• Without any padding, the


convolution process
decreases the spatial
dimensions of the output
Fig Credit: A Karpathy, CS231n
Activation Maps
• Example:
• Consider an image 32 x 32 x 3 and a 5 x 5 x 3 filter.
• The convolution happens between a 5 x 5 x 3 chunk of the image with the filter: 𝑤 𝑇 𝑥 + 𝑏
• In this example we get 75 dimensional vector and a bias term
• In this example, with a stride of 1, we get 28 x 28 x 1 activation for 1 filter without padding
• If we have 6 filters, we would get 28 x 28 x 6 without padding

• In the above example we have an activation map of 28 x 28 per filter.

• Activation maps are feature inputs to the subsequent layer of the network

• Without any padding, the 2D surface area of the activation map is smaller than
the input surface area for a stride of >= 1
Copyright 2016 JNResearch, All Rights Reserved
Stacking Convolution Layers

Fig Credit: A Karpathy, CS231n


Feature Representation as a hierarchy
Padding
• The spatial (x, y) extent of the output produced by the convolutional layer is less
than the respective dimensions of the input (except for the special case of 1 x 1
filter with a stride 1).

• As we add more layers and use larger strides, the output surface dimensions keep
reducing and this may impact the accuracy.

• Often, we may want to preserve the spatial extent during the initial layers and
downsample them at a later time.

• Padding the input with suitable values (padding with zero is common) helps to
preserve the spatial size
Zero Padding the border

Fig Credit: A Karpathy, CS231n


Hyperparameters of the convolution layer

• Filter Size

• # Filters

• Stride

• Padding
Fig Credit: A Karpathy, CS231n
Pooling Layer
• Pooling is a downsampling
operation

• The rationale is that the “meaning”


embedded in a piece of image can
be captured using a small subset of
“important” pixels

• Max pooling and average pooling


are the two most common
operations

• Pooling layer doesn’t have any


trainable parameters
Fig Credit: A Karpathy, CS231n
Max Pooling Illustration
Popular Network Architectures
Current trend: Deeper Models
• CNNs consistently outperform other
approaches for the core tasks of CV
• Deeper models work better
• Increasing the number of parameters in layers
of CNN without increasing their depth is not
effective at increasing test set performance.
• Shallow models overfit at around 20 million
parameters while deep ones can benefit from
having over 60 million.
• Key insight: Model performs better when it is
architected to reflect composition of simpler
functions than a single complex function. This
may also be explained off viewing the
computation as a chain of dependencies
VGG Net
VGG net
ResNet
Core Tasks of Computer Vision
Core CV Task Task Description Output Metrics
Classification Given an image, assign a label Class Label Accuracy
Localization Determine the bounding box containing Box given by (x1, y1, Ratio of intersection to
the object in the given image x2, y2) the union (Overlap)
between the ground truth
and bounding box
Object Given an image, detect all the objects and For each object: Mean Avg Best Overlap
Detection their locations in the image (Label, Box) (MABO,) mean Average
Precision (mAP)
Semantic Given an image, assign each pixel to a A set of image Classification metrics,
Segmentation class label, so that we can look at the segments Intersection by Union
image as a set of labelled segments overlap
Instance Same as semantic segmentation, but each A set of image
Segmentation instance of a segment class is determined segments
uniquely
Object Localization

• Given an image containing an object


of interest, determine the bounding
box for the object

• Classify the object


Slide Credit: A Karpathy, CS231n
Slide Credit: A Karpathy, CS231n
Datasets for evaluation

• Imagenet challenges provide a platform for


researchers to benchmark their novel
algorithms

• PASCAL VOC 2010 is great for small scale


experiments. About 1.3 GB download size.

• MS COCO datasets are available for tasks


like Image Captioning. Download size is
huge but selective download is possible.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy