UNIT 2 Study Materials 1

UNIT -II
Convolutional Networks (or) Convolutional neural networks

(or) CNN
UNIT II CONVOLYUTIONAL NEURAL NETWORKS
Convolution Operation -- Sparse Interactions -- Parameter Sharing -- Equivariance -- Pooling
-- Convolution Variants: Strided -- Tiled -- Transposed and dilated convolutions; CNN
Learning-Nonlinearity Functions -- Loss Functions -- Regularization -- Optimizers --Gradient
Computation.
Basic of Convolution
Specialized kind of neural network for processing data that has a known
grid-like topology.
Examples: Time series data series data – 1D grid taking samples at regular time intervals.
Image Data- Image Data can be thought 2D grid of pixels. The name “convolutional
Neural Network” indicates that the network employs Convolution in place of general
matrix multiplication in atleast one oftheir layers.
A Convolutional Neural Network (CNN) is a deep learning algorithm that can recognize
and classify features in images for computer vision. It is a multi-layer neural network
designed to analyze visual inputs and perform tasks such as image classification,
segmentation and object detection, which can be useful for autonomous vehicles. CNNs
can also be used for deep learning applications in healthcare, such as medical imaging.
There are two main parts to a CNN:
• A convolution tool that splits the various features of the image for analysis
• A fully connected layer that uses the output of the convolution layer to predict the
best descriptionfor the image.
Convolution Operation: Convolution is an operation on two functions of a real valued

argument. we aretracking the location of a spaceship with a laser sensor. Our laser sensor
provides a single output x(t), the position of the spaceship at time t. Both x and t are real
valued, that is, we can get a different reading from the laser sensor at any instant in time.
Suppose our laser sensor is noisy then spaceship position can be detected by weighted
average that gives more weight to recent measurements. We obtain a new function s
providing a smoothed estimate of the position of the spaceship.
In convolutional network terminology, the first argument (in this example, the function x
to the convolution is often referred to as the input, and the second argument as the
kernel. The output is sometimes referred to as the feature map.
Discrete Convolution:
Input is multi-dimensional array. Kernel is multidimensional array of parameters

adapted by learn algorithm. Discrete convolution can be viewed as multiplication by a
matrix, but the matrix has several entries constrained to be equal to other entries. For
example, for univariate discrete convolution, each row of the matrix is constrained to be
equal to the row above shifted by one element. This is known as a Toeplitz matrix. In
two dimensions, a doubly block circulant matrix corresponds to convolution.
The intuition behind Convolution of ‘f’ and ‘g’ is the degree to which ‘f’, and ‘g’
overlaps when f sweeps over the function g.
A CNN is composed of several kinds of layers:
• Convolutional layer - creates a feature map to predict the class probabilities
for each featureby applying a filter that scans the whole image, few pixels at a time.
• Pooling layer (down sampling) ━scales down the amount of information
the convolutional layer generated for each feature and maintains the most essential
information (the process of the convolutional and pooling layers usually repeats
several times).
• Fully connected input layer— “flattens” the outputs generated
by previous layersto turn them into a single vector that can be used as an
input for the next layer.
• Fully connected layer—applies weights over the input
generated by thefeature analysis to predict an accurate label.
• Fully connected output layer - generates the final probabilities
to determine aclass for the image.
Convolutional Neural Network
Convolutional Neural Networks are a special type of feed-forward artificial neural network in
which the connectivity pattern between its neuron is inspired by the visual cortex.
The visual cortex encompasses a small region of cells that are region sensitive to visual
fields. In case some certain orientation edges are present then only some individual neuronal
cells get fired inside the brain such as some neurons responds as and when they get exposed
to the vertical edges, however some responds when they are shown to horizontal or diagonal
edges, which is nothing but the motivation behind Convolutional Neural Networks.
The Convolutional Neural Networks, which are also called as covnets, are nothing but neural
networks, sharing their parameters. Suppose that there is an image, which is embodied as a
cuboid, such that it encompasses length, width, and height. Here the dimensions of the image
are represented by the red, green, and blue channels, as shown in the image given below.
Now assume that we have taken a small patch of the same image, followed by running a
small neural network on it, having k number of outputs, which is represented in a vertical
manner. Now when we slide our small neural network all over the image, it will result in
another image constituting different width, height as well as depth. We will notice that rather
than having R, G, B channels, we have come across some more channels that, too, with less
width and height, which is actually the concept of Convolution. In case, if we accomplished
in having similar patch size as that of the image, then it would have been a regular neural
network. We have some wights due to this small patch.
Mathematically it could be understood as follows;
• The Convolutional layers encompass a set of learnable filters, such that each filter
embraces small width, height as well as depth as that of the provided input volume (if
the image is the input layer then probably it would be 3).
• Suppose that we want to run the convolution over the image that comprises of
34x34x3 dimension, such that the size of a filter can be axax3. Here a can be any of
the above 3, 5, 7, etc. It must be small in comparison to the dimension of the image.
• Each filter gets slide all over the input volume during the forward pass. It slides step
by step, calling each individual step as a stride that encompasses a value of 2 or 3 or 4
for higher-dimensional images, followed by calculating a dot product in between
filter's weights and patch from input volume.
• It will result in 2-Dimensional output for each filter as and when we slide our filters
followed by stacking them together so as to achieve an output volume to have a
similar depth value as that of the number of filters. And then, the network will learn
all the filters.
Question - 1: Working (or) Operation of CNN (or) Architecture of CNN
For Understanding – 1
Generally, a Convolutional Neural Network has three layers, which are as follows;
• Input: If the image consists of 32 widths, 32 height encompassing three R, G, B

channels, then it will hold the raw pixel([32x32x3]) values of an image.
• Convolution: It computes the output of those neurons, which are associated with
input's local regions, such that each neuron will calculate a dot product in between
weights and a small region to which they are actually linked to in the input volume.
For example, if we choose to incorporate 12 filters, then it will result in a volume of
[32x32x12].
• ReLU Layer: It is specially used to apply an activation function elementwise, like as
max (0, x) thresholding at zero. It results in ([32x32x12]), which relates to an
unchanged size of the volume.
• Pooling: This layer is used to perform a down sampling operation along the spatial
dimensions (width, height) that results in [16x16x12] volume.
• Locally Connected: It can be defined as a regular neural network layer that receives
an input from the preceding layer followed by computing the class scores and results
in a 1-Dimensional array that has the equal size to that of the number of classes.
We will start with an input image to which we will be applying multiple feature detectors,
which are also called as filters to create the feature maps that comprises of a Convolution
layer. Then on the top of that layer, we will be applying the ReLU or Rectified Linear Unit to
remove any linearity or increase non-linearity in our images.
Next, we will apply a Pooling layer to our Convolutional layer, so that from every feature
map we create a Pooled feature map as the main purpose of the pooling layer is to make sure
that we have spatial invariance in our images. It also helps to reduce the size of our images as
well as avoid any kind of overfitting of our data. After that, we will flatten all of our pooled
images into one long vector or column of all of these values, followed by inputting these
values into our artificial neural network. Lastly, we will feed it into the locally connected
layer to achieve the final output.
Building a CNN
Basically, a Convolutional Neural Network consists of adding an extra layer, which is called
convolutional that gives an eye to the Artificial Intelligence or Deep Learning model because
with the help of it we can easily take a 3D frame or image as an input as opposed to our
previous artificial neural network that could only take an input vector containing some
features as information. But here we are going to add at the front a convolutional layer which
will be able to visualize images just like humans do.
In our dataset, we have all the images of cats and dogs in training as well as in the test set
folders. We are going to train our CNN model on 4000 images of cats as well as 4000 images
of dogs, each respectively that are present in the training set followed by evaluating our
model with the new 1000 images of cats and 1000 images of dogs, each respectively in the
test set on which our model was not trained. So, we are actually going to build and train a
Convolutional Neural network to recognize if there is a dog or cat in the image.
In the second part, we will build the whole architecture of CNN. We will initialize the CNN
as a sequence of layers, and then we will add the convolution layer followed by adding the
max-pooling layer. Then we will add the second convolutional layer to make it a deep neural
network as opposed to a shallow neural network. Next, we will proceed to the flattening layer
to flatten the result of all the convolutions and pooling into a one-dimensional vector, which
will become the input of a fully connected neural network. Finally, we will connect all this to
the output layer.
In the third part, we will first compile the CNN, and then we will train the CNN on the
training set. And then, finally, we will make a single prediction to test our model in a
prediction that is when we will deploy our CNN on to different images, one that has a dog
and the other that has a cat.
So, this was just a brief description of how we will build our CNN model, let's get started
with its practical implementation.
Introduction
A convolutional neural network (CNN), is a network architecture for deep learning which
learns directly from data. CNNs are particularly useful for finding patterns in images to
recognize objects. They can also be quite effective for classifying non-image data such as
audio, time series, and signal data.
Kernel or Filter or Feature Detectors

In a convolutional neural network, the kernel is nothing but a filter that is used to extract
the features from the images.
Formula = [i-k]+1
i -> Size of input , K-> Size of kernel
Stride
Stride is a parameter of the neural network’s filter that modifies the amount of movement
over the image or video. we had stride 1 so it will take one by one. If we give stride 2 then it
will take value by skipping the next 2 pixels.
Formula =[i-k/s]+1
i -> Size of input , K-> Size of kernel, S-> Stride
Padding
Padding is a term relevant to convolutional neural networks as it refers to the number of
pixels added to an image when it is being processed by the kernel of a CNN. For example, if
the padding in a CNN is set to zero, then every pixel value that is added will be of value zero.
When we use the filter or Kernel to scan the image, the size of the image will go smaller. We
have to avoid that because we wanted preserve the original size of the image to extract some
low-level features. Therefore, we will add some extra pixels outside the image.
Formula =[i-k+2p/s]+1
i -> Size of input, K-> Size of kernel, S-> Stride, p->Padding
Pooling
Pooling in convolutional neural networks is a technique for generalizing features extracted by
convolutional filters and helping the network recognize features independent of their location
in the image.
Flatten
Flattening is used to convert all the resultant 2-Dimensional arrays from pooled feature maps
into a single long continuous linear vector. The flattened matrix is fed as input to the fully
connected layer to classify the image.
Layers used to build CNN

Convolutional neural networks are distinguished from other neural networks by their superior
performance with image, speech, or audio signal inputs. They have three main types of
layers, which are:
• Convolutional layer
• Pooling layer
• Fully-connected (FC) layer
Convolutional layer
This layer is the first layer that is used to extract the various features from the input images.
In this layer, We use a filter or Kernel method to extract features from the input image.
Pooling layer
The primary aim of this layer is to decrease the size of the convolved feature map to reduce
computational costs. This is performed by decreasing the connections between layers and
independently operating on each feature map. Depending upon the method used, there are
several types of Pooling operations. We have Max pooling and average pooling.
Fully-connected layer
The Fully Connected (FC) layer consists of the weights and biases along with the neurons
and is used to connect the neurons between two different layers. These layers are usually
placed before the output layer and form the last few layers of a CNN Architecture.
Dropout
Another typical characteristic of CNNs is a Dropout layer. The Dropout layer is a mask that
nullifies the contribution of some neurons towards the next layer and leaves unmodified all
others.
Activation Function
An Activation Function decides whether a neuron should be activated or not. This means that
it will decide whether the neuron’s input to the network is important or not in the process of
prediction. There are several commonly used activation functions such as the ReLU,
Softmax, tanH, and the Sigmoid functions. Each of these functions has a specific usage.
Sigmoid — For a binary classification in the CNN model
tanH - The tanh function is very similar to the sigmoid function. The only difference is that it
is symmetric around the origin. The range of values, in this case, is from -1 to 1.
Softmax- It is used in multinomial logistic regression and is often used as the last activation
function of a neural network to normalize the output of a network to a probability distribution
over predicted output classes.
ReLU- the main advantage of using the ReLU function over other activation functions is that
it does not activate all the neurons at the same time.
Layers in a Convolutional Neural Network

A convolution neural network has multiple hidden layers that help in extracting information
from an image. The four important layers in CNN are:
1. Convolution layer
2. ReLU layer
3. Pooling layer
4. Fully connected layer
5. ReLU layer/ Activation Layer
6. Flattening
7. Output Layer
Convolution Layer
This is the first step in the process of extracting valuable features from an image. A
convolution layer has several filters that perform the convolution operation. Every image is
considered as a matrix of pixel values.
Consider the following 5x5 image whose pixel values are either 0 or 1. There’s also a filter
matrix with a dimension of 3x3. Slide the filter matrix over the image and compute the dot
product to get the convolved feature matrix.
ReLU layer
ReLU stands for the rectified linear unit. Once the feature maps are extracted, the next step is
to move them to a ReLU layer.
ReLU performs an element-wise operation and sets all the negative pixels to 0. It introduces
non-linearity to the network, and the generated output is a rectified feature map. Below is the
graph of a ReLU function:
The original image is scanned with multiple convolutions and ReLU layers for locating the
features.
Pooling Layer
Pooling is a down-sampling operation that reduces the dimensionality of the feature map. The
rectified feature map now goes through a pooling layer to generate a pooled feature map.
The pooling layer uses various filters to identify different parts of the image like edges,
corners, body, feathers, eyes, and beak.
Here’s how the structure of the convolution neural network looks so far:
The next step in the process is called flattening. Flattening is used to convert all the resultant
2-Dimensional arrays from pooled feature maps into a single long continuous linear vector.
The flattened matrix is fed as input to the fully connected layer to classify the image.
Here’s how exactly CNN recognizes a bird:
• The pixels from the image are fed to the convolutional layer that performs the
convolution operation
• It results in a convolved map
• The convolved map is applied to a ReLU function to generate a rectified feature map
• The image is processed with multiple convolutions and ReLU layers for locating the
features
• Different pooling layers with various filters are used to identify specific parts of the
image
• The pooled feature map is flattened and fed to a fully connected layer to get the final
output
• Activation Layer
The activation layer introduces nonlinearity into the network by applying an activation
function to the output of the previous layer. This is crucial for the network to learn complex
patterns. Common activation functions, such as ReLU, Tanh, and Leaky ReLU, transform the
input while keeping the output size unchanged.
• Flattening
After the convolution and pooling operations, the feature maps still exist in a multi-
dimensional format. Flattening converts these feature maps into a one-dimensional vector.
This process is essential because it prepares the data to be passed into fully connected layers
for classification or regression tasks.
• Output Layer
In the output layer, the final result from the fully connected layers is processed through a
logistic function, such as sigmoid or softmax. These functions convert the raw scores into
probability distributions, enabling the model to predict the most likely class label.
Question – 2: Analyze and explain the three main motivation behind Convolutional Nets
(or) CNN
The convolution neural networks are suitable for image processing as they take into account
the neighboring pixels, as for images the neighboring pixels of a pixel get a say in defining it.
Convolution neural networks leverage three important ideas that can help improve machine
learning systems: sparse interactions, parameter sharing, and equivariant
representations. Moreover, convolution neural networks provide a means for working with
inputs of variable size.
Sparse Connectivity
The traditional neural networks use matrix multiplication by a matrix of parameters with a
separate parameter describing the interaction between each input and output unit. This means
that every input unit is connected to every output unit. Convolution neural networks,
however, have sparse connections (also known as sparse weights).
Traditional Neural Networks

Convolution Neural Networks
This is accomplished by making the kernel (a matrix of weights that are multiplied with the
input to extract relevant features) smaller than the input image. The kernels can detect small,
meaningful features taking up only a small part of the image. This means that we only need to
store a few parameters. This allows the network to efficiently describe complicated
interactions between many variables by constructing interactions from simple building blocks
which each describe only sparse interactions.
Parameter Sharing
Parameter sharing refers to the use of the same parameters for more than one function in a
model. In a traditional neural net, each element of the neural network has its own weight, i.e.
each element of the weight matrix is used only once while computing the output of a layer. It
is multiplied by one element of the input and then never revisited.
We can also say that the convolution networks have tied weights because the weights applied
to one input are tied to the value of the weight applied elsewhere. This means that rather than
learning a different set of parameters for each location, we use only one set.
Equivariance Representation
In convolutions, the particular case of parameter sharing causes the networks to have a
property called equivariance. To say a function is equivariant means that if the input changes,
the output changes in the same way.
When processing time series data, this means that convolution produces a sort of timeline of
when different features appear in the input. If we move an event later in time, the exact same
representation of it will appear in the output, just later in time.
Similarly, for images, convolution creates a 2-D map of where certain features appear in the
input. If we move the object in the input, it will move by the same amount in the output. This
is useful when we know that some function of the same number of pixels is useful when
applied to multiple input locations.
How Convolution Neural Networks are time and space efficient?
Let m be the number of inputs and n be the number of outputs. For traditional networks, we
would require m*n parameters and O(m*n) runtime. When we limit the number of
connections to k using convolution networks the total parameters required will be k*n and the
runtime will be O(k*n) where k<m. The k can be several dimensions less than m.
Also, since we use the idea of parameter sharing, we only need to store k parameters, thus
reducing the memory requirement. Convolution is thus dramatically more efficient than
matrix multiplication in terms of memory requirements and statistical efficiency.
Convolutional Neural Networks mark a significant advancement in the field of machine
learning and artificial intelligence, particularly in image and video processing. They are not
only efficient in terms of computational resources but also proficient at capturing the
hierarchical pattern in data. By implementing sparse connectivity, parameter sharing, and
equivariance representation, CNNs are able to handle complex visual tasks that are usually
challenging for traditional neural networks.
Whether it’s facial recognition in security systems or diagnosing diseases from medical
imagery, the applications of CNNs are immense and continue to expand. As we continue to
generate more and more data, it is undeniable that the role of CNNs will become even more
crucial. If we harness their power efficiently and ethically, we’re just scratching the surface
of what’s possible.
Motivation
Sparse Interactions
Each output unit is connected to (affected by) only a subset of the input units.
Sparse connectivity (upper) vs full connectivity (lower). The grey shaded nodes in the input
show the receptive field of the node in the first layer (source)
If there are m input units and n output units, a fully connected layer would require mn
parameters (one per connection) and correspondingly the number of operations would scale
as O(mn). On the other hand, if each output unit is sparsely connected to k input units, the
layer requires kn parameters and O(kn) computations. In general, for a convolutional layer,
the number of output units are a function of kernel size, stride and padding (discussed later).
This actually makes n a function of m. Keeping this in mind O(mn) ~ O(m²) while O(kn) ~
O(km). By keeping k several orders of magnitude smaller than m, we see that the
computational saving from sparse connections are huge.
As a practical example, consider a 3x3kernel operating on a black and white image of
dimensions 224x224(this is a very standard setting of kernel size and image size, and can be
seen in the first layer of the VGGNet). For same padding and stride as 1 (discussed in detail
later), output size will also be 224x224. If this first layer were to be a fully connected layer,
number of parameters would be ~2.5 billion (=224² x 224²). On the other hand, using a sparse
layer with each output connected to 9(=3x3) inputs, the number of parameters is ~451
thousand (=224² x 9). In fact, a convolutional layer also incorporates parameter sharing (see
below) and this number will decrease further.
Parameter Sharing
In the previous section, we saw that the output units are only connected to a small number of
input units. In a convolutional layer, each kernel weight is used at every input position
(except maybe at boundaries where different padding rules apply as discussed below), i.e.
parameters used to compute different output units are tied together. By tied together, we
mean that at all times their values are same. This means that even during training, they are
updated by the same amount and by collecting the gradients from all output units.
Parameter sharing allows models to capture local connectivity while simultaneously
computing the same features at different spatial locations. We will see the use of this property
soon.
Here we make a short detour to section 5 for discussing locally connected layers and tiled
convolution.
• Locally connected layer/unshared convolution: The connectivity graph of
convolution operation and locally connected layer is the same. The only difference is
that parameter sharing is not performed, i.e. each output unit performs a linear
operation on its neighbourhood but the parameters are not shared across output units.
This allows models to capture local connectivity while allowing different features to
be computed at different spatial locations. This however requires much more
parameters than the convolution operation.
• Tiled convolution is a sort of middle step between locally connected layer and
traditional convolution. It uses a set of kernels that are cycled through. This reduces
the number of parameters in the model while allowing for some freedom provided by
unshared convolution.
Comparison of connectivity and parameters of locally-connected (top), tiled (middle) and

standard convolution (bottom)
The parameter complexity and computation complexity can be obtained as below. Note that:
• m = number of input units
• n = number of output units
• k = kernel size
• l = number of kernels in the set (for tiled convolution)
You can see now that the quantity of ~451 thousand parameters corresponds to the locally
connected convolution operation. If we use a set of 200 kernels, the number of parameters for
tiled convolution is 1.8 thousand. For a traditional convolution operation, this number is 9
parameters.
Equivariance
A function f is said to be equivariant to a function g if
f(g(x)) = g(f(x))
i.e. if input changes, the output changes in the same way.
Parameter sharing in a convolutional network provides equivariance to translation. What
this means is that translation of the image results in corresponding translation in the output
map (except maybe for boundary pixels). The reason for this is very intuitive: the same
feature is being computed at all input points.
Question – 3: Max pooling introduces invariance and Pooling with down sampling.
Max pooling introduces invariance
The use of pooling can be viewed as adding an infinitely strong prior
that the function the layer learns must be invariant to small translations.
When this assumption is correct, it can greatly improve the statistical
efficiency of the network.
pooling summarizes the responses over a whole neighborhood, it is
possible to use fewer pooling units than detector units, by reporting
summary statistics for pooling regions spaced pixels apart rather than 1
pixel apart.
Fig: Learned Invariance Example A pooling unit that pools over multiple
features that are learned with separate parameters can learn to be
invariant to transformations of the input. Here we show how a set of
three learned filters and a max pooling unit can learn to become invariant
to rotation. All three filters are intended to detect a hand written 5. Each
filter attempts to match a slightly different orientation of the 5. When a 5
appears in the input, the corresponding filter will match it and cause a
large activation in a detector unit. The max pooling unit then has a large
activation regardless of which detector unit was activated. We show here
how the network processes two different inputs, resulting in two different
detector units being activated. The effect on the pooling unit is roughly the
same either way.This principle is leveraged by maxout networks.
Pooling with downsampling.
Here we use max pooling with a pool width of three and a stride between
pools of two. This reduces the representation size by a factor of two,
which reduces the computational and statistical burden on the next layer.
Note that the rightmost pooling region has a smaller size but must be
included if we do not want to ignore some of the detector units.
Q4- CNN – Stride and Padding
STRIDE
What is Stride?
Stride is a parameter that dictates the movement of the kernel, or filter, across the input data,
such as an image. When performing a convolution operation, the stride determines how many
units the filter shifts at each step. This shift can be horizontal, vertical, or both, depending on
the stride's configuration.
For example, a stride of 1 moves the filter one pixel at a time, while a stride of 2 moves it
two pixels. A larger stride will produce a smaller output dimension, effectively
downsampling the image.
Importance of Stride
The choice of stride affects the model in several ways:
• Output Size: A larger stride will result in a smaller output spatial dimension. This is
because the filter covers a larger area of the input image with each step, thus reducing
the number of positions it can occupy.
• Computational Efficiency: Increasing the stride can decrease the computational load.
Since the filter moves more pixels per step, it performs fewer operations, which can
speed up the training and inference processes.
• Field of View: A higher stride means that each step of the filter takes into account a
wider area of the input image. This can be beneficial when the model needs to capture
more global features rather than focusing on finer details.
• Downsampling: Strides can be used as an alternative to pooling layers for
downsampling the input. Pooling layers, such as max pooling, are often used to
reduce the spatial dimensions and to introduce invariance to small translations.
However, increasing the stride in a convolutional layer can achieve a similar effect
without the need for an additional pooling layer.
Stride in Practice
In practice, stride is often set to 1 or 2. A stride of 1 is common when the model needs to
maintain a high resolution of features, which is particularly important in the initial layers of
the network. A stride of 2 or more may be used in deeper layers or when the input images are
large, and the model needs to reduce dimensionality to control the number of parameters and
computational cost.
It's important to note that while increasing the stride can improve computational efficiency, it
may also lead to a loss of information. Strides larger than 1 skip over pixels, which could
contain useful information for feature extraction. Therefore, the choice of stride is a trade-off
that needs to be carefully considered based on the specific task and dataset.
Calculating Output Size with Stride
The output size of a convolutional operation can be calculated using the following formula:
O = ((W - K + 2P) / S) + 1
Where:
• O is the output size
• W is the input size (width or height)
• K is the kernel size
• P is the padding
• S is the stride
This formula helps to determine the dimensions of the output feature map, which is essential
for designing and understanding the architecture of a CNN.
Conclusion
Stride is a fundamental hyperparameter in convolutional neural networks that influences the
model's performance and efficiency. It controls how the convolutional filters interact with the
input data and affects the size of the output feature maps. Understanding and selecting the
appropriate stride is crucial for optimizing CNNs for various tasks in image and video
analysis, as well as other domains where CNNs are applicable.
When designing a convolutional neural network, one must consider the implications of stride
on the network's ability to capture relevant features, computational requirements, and the
overall performance of the model. Balancing these factors is key to developing effective and
efficient CNNs for machine learning applications.
PADDING
During convolution, the size of the output feature map is determined by the size of the input
feature map, the size of the kernel, and the stride. if we simply apply the kernel on the input
feature map, then the output feature map will be smaller than the input. This can result in the
loss of information at the borders of the input feature map. In Order to preserve the border
information we use padding.
What Is Padding
padding is a technique used to preserve the spatial dimensions of the input image after
convolution operations on a feature map. Padding involves adding extra pixels around the
border of the input feature map before convolution.
This can be done in two ways:
• Valid Padding: In the valid padding, no padding is added to the input feature map,
and the output feature map is smaller than the input feature map. This is useful when
we want to reduce the spatial dimensions of the feature maps.
• Same Padding: In the same padding, padding is added to the input feature map such
that the size of the output feature map is the same as the input feature map. This is
useful when we want to preserve the spatial dimensions of the feature maps.
The number of pixels to be added for padding can be calculated based on the size of the
kernel and the desired output of the feature map size. The most common padding value is
zero-padding, which involves adding zeros to the borders of the input feature map.
Padding can help in reducing the loss of information at the borders of the input feature map
and can improve the performance of the model. However, it also increases the computational
cost of the convolution operation. Overall, padding is an important technique in CNNs that
helps in preserving the spatial dimensions of the feature maps and can improve the
performance of the model.
Effect Of Padding On Input Images

Padding is simply a process of adding layers of zeros to our input images so as to avoid the
problems mentioned above through the following changes to the input image.
Padding prevents the shrinking of the input image.

p = number of layers of zeros added to the border of the image,
then (n x n) image —> (n + 2p) x (n + 2p) image after padding.
(n + 2p) x (n + 2p) * (f x f) —–> outputs (n + 2p – f + 1) x (n + 2p – f + 1) images
For example, by adding one layer of padding to an (8 x 8) image and using a (3 x 3) filter we
would get an (8 x 8) output after performing a convolution operation.
This increases the contribution of the pixels at the border of the original image by bringing
them into the middle of the padded image. Thus, information on the borders is preserved as
well as the information in the middle of the image.
Types of Padding
Valid Padding: It implies no padding at all. The input image is left in its valid/unaltered
shape. So
where, nxn is the dimension of input image

fxf is kernel size
n-f+1 is output image size
* represents a convolution operation.
Same Padding: In this case, we add ‘p’ padding layers such that the output image has the
same dimensions as the input image.
So,
[(n + 2p) x (n + 2p) image] * [(f x f) filter] —> [(n x n) image]
which gives p = (f – 1) / 2 (because n + 2p – f + 1 = n).
So, if we use a (3 x 3) filter on an input image to get the output with the same dimensions. the
1 layer of zeros must be added to the borders for the same padding. Similarly, if (5 x 5) filter
is used 2 layers of zeros must be appended to the border of the image.
Padding serves several purposes:

1. Dimension Preservation: As mentioned, padding can help maintain the spatial
dimensions of the input through the layers of the network.
2. Border Information: Padding allows the network to take into account information at
the borders of the image. Without padding, the filters would mostly be applied to the
central pixels of the image, which could lead to losing information from the edges of
the image.
Like strides, the choice of padding type is a hyperparameter of the network and may need to
be tuned during the model training process.
Question – 5 - CNN - Pooling Layer and types of Pooling
The pooling operation involves sliding a two-dimensional filter over each channel of feature
map and summarizing the features lying within the region covered by the filter.
A common CNN model architecture is to have a number of convolution and pooling layers
stacked one after the other.
Why to use Pooling Layers?
• Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces
the number of parameters to learn and the amount of computation performed in the
network.
• The pooling layer summarises the features present in a region of the feature map
generated by a convolution layer. So, further operations are performed on summarised
features instead of precisely positioned features generated by the convolution layer.
This makes the model more robust to variations in the position of the features in the
input image.
Types of Pooling Layers:
Max Pooling
1. Max pooling is a pooling operation that selects the maximum element from the region
of the feature map covered by the filter. Thus, the output after max-pooling layer
would be a feature map containing the most prominent features of the previous feature
map.
Average Pooling
1. Average pooling computes the average of the elements present in the region of feature
map covered by the filter. Thus, while max pooling gives the most prominent feature
in a particular patch of the feature map, average pooling gives the average of features
present in a patch.
Global Pooling
1. Global pooling reduces each channel in the feature map to a single value. Thus, an nh
x nw x nc feature map is reduced to 1 x 1 x nc feature map. This is equivalent to using
a filter of dimensions nh x nw i.e. the dimensions of the feature map.
Further, it can be either global max pooling or global average pooling.
In convolutional neural networks (CNNs), the pooling layer is a common type of layer that is
typically added after convolutional layers. The pooling layer is used to reduce the spatial
dimensions (i.e., the width and height) of the feature maps, while preserving the depth (i.e.,
the number of channels).
1. The pooling layer works by dividing the input feature map into a set of non-
overlapping regions, called pooling regions. Each pooling region is then transformed
into a single output value, which represents the presence of a particular feature in that
region. The most common types of pooling operations are max pooling and average
pooling.
2. In max pooling, the output value for each pooling region is simply the maximum
value of the input values within that region. This has the effect of preserving the most
salient features in each pooling region, while discarding less relevant information.
Max pooling is often used in CNNs for object recognition tasks, as it helps to identify
the most distinctive features of an object, such as its edges and corners.
3. In average pooling, the output value for each pooling region is the average of the
input values within that region. This has the effect of preserving more information
than max pooling, but may also dilute the most salient features. Average pooling is
often used in CNNs for tasks such as image segmentation and object detection, where
a more fine-grained representation of the input is required.
Pooling layers are typically used in conjunction with convolutional layers in a CNN, with
each pooling layer reducing the spatial dimensions of the feature maps, while the
convolutional layers extract increasingly complex features from the input. The resulting
feature maps are then passed to a fully connected layer, which performs the final
classification or regression task.
Advantages of Pooling Layer:
1. Dimensionality reduction: The main advantage of pooling layers is that they help in
reducing the spatial dimensions of the feature maps. This reduces the computational
cost and also helps in avoiding overfitting by reducing the number of parameters in
the model.
2. Translation invariance: Pooling layers are also useful in achieving translation
invariance in the feature maps. This means that the position of an object in the image
does not affect the classification result, as the same features are detected regardless of
the position of the object.
3. Feature selection: Pooling layers can also help in selecting the most important features
from the input, as max pooling selects the most salient features and average pooling
preserves more information.
Disadvantages of Pooling Layer:
1. Information loss: One of the main disadvantages of pooling layers is that they discard
some information from the input feature maps, which can be important for the final
classification or regression task.
2. Over-smoothing: Pooling layers can also cause over-smoothing of the feature maps,
which can result in the loss of some fine-grained details that are important for the
final classification or regression task.
3. Hyperparameter tuning: Pooling layers also introduce hyperparameters such as the
size of the pooling regions and the stride, which need to be tuned in order to achieve
optimal performance. This can be time-consuming and requires some expertise in
model building.
Question - 6 - UNSHARED Convolution and TILED Convolution

Question - 6 - Dilated and Atrous Convolution
Dilated Convolution: It is a technique that expands the kernel (input) by inserting holes
between its consecutive elements. In simpler terms, it is the same as convolution but it
involves pixel skipping, so as to cover a larger area of the input.
Dilated convolution, also known as atrous convolution, is a type of convolution operation
used in convolutional neural networks (CNNs) that enables the network to have a larger
receptive field without increasing the number of parameters.
In a regular convolution operation, a filter of a fixed size slides over the input feature map,
and the values in the filter are multiplied with the corresponding values in the input feature
map to produce a single output value. The receptive field of a neuron in the output feature
map is defined as the area in the input feature map that the filter can “see”. The size of the
receptive field is determined by the size of the filter and the stride of the convolution.
In contrast, in a dilated convolution operation, the filter is “dilated” by inserting gaps between
the filter values. The dilation rate determines the size of the gaps, and it is a hyperparameter
that can be adjusted. When the dilation rate is 1, the dilated convolution reduces to a regular
convolution.
The dilation rate effectively increases the receptive field of the filter without increasing the
number of parameters, because the filter is still the same size, but with gaps between the
values. This can be useful in situations where a larger receptive field is needed, but increasing
the size of the filter would lead to an increase in the number of parameters and computational
complexity.
Dilated convolutions have been used successfully in various applications, such as semantic
segmentation, where a larger context is needed to classify each pixel, and audio processing,
where the network needs to learn patterns with longer time dependencies.
Some advantages of dilated convolutions are:

1. Increased receptive field without increasing parameters
2. Can capture features at multiple scales
3. Reduced spatial resolution loss compared to regular convolutions with larger filters
Some disadvantages of dilated convolutions are:
1. Reduced spatial resolution in the output feature map compared to the input feature
map
2. Increased computational cost compared to regular convolutions with the same filter
size and stride
An additional parameter l (dilation factor) tells how much the input is expanded. In other
words, based on the value of this parameter, (l-1) pixels are skipped in the kernel. Fig 1
depicts the difference between normal vs dilated convolution. In essence, normal convolution
is just a 1-dilated convolution.
Fig 1: Normal Convolution vs Dilated Convolution
Intuition:
Dilated convolution helps expand the area of the input image covered without pooling. The
objective is to cover more information from the output obtained with every convolution
operation. This method offers a wider field of view at the same computational cost. We
determine the value of the dilation factor (l) by seeing how much information is obtained
with each convolution on varying values of l.
By using this method, we are able to obtain more information without increasing the number
of kernel parameters. In Fig 1, the image on the left depicts dilated convolution. On keeping
the value of l = 2, we skip 1 pixel (l – 1 pixel) while mapping the filter onto the input, thus
covering more information in each step.
Advantages of Dilated Convolution:
Using this method rather than normal convolution is better as:
1. Larger receptive field (i.e. no loss of coverage)
2. Computationally efficient (as it provides a larger coverage on the same computation
cost)
3. Lesser Memory consumption (as it skips the pooling step) implementation
4. No loss of resolution of the output image (as we dilate instead of performing
pooling)
5. Structure of this convolution helps in maintaining the order of the data.
Example:
Question - 7 - Transposed Convolutional
https://www.geeksforgeeks.org/what-is-transposed-convolutional-layer/
Question - 8 - Generalize how to introduce non-linearity in a CNN, providing an

example to illustrate the concept. (or) Explain about the activation functions in CNN.
CNN Learning: The weight layers in a CNN are often followed by a non-linear activation
function. The activation function takes a real valued input and squashes within a small range
such as [0;1] and [-1:1]. The application of a nonlinear function after weight layer is highly
important. Since it allows a neural network to learn nonlinear mapping. In the absence of
non-linearities, a stacked network of weight layers is equivalent to a linear mapping from the
input domain to output domain.
The Need for Non-Linearity: Activation functions introduce non-linearity into the model,
allowing it to learn and perform complex tasks. Without them, no matter how many layers we
stack in the network, it would still behave as a single-layer perceptron because the
composition of linear functions is a linear function.
Where to Apply Activation Functions in CNNs
In a CNN, activation functions are typically applied after each convolutional layer and fully
connected layer. However, they are not applied after pooling layers. The purpose of using
activation functions after the convolutional and fully connected layers is to introduce non-
linearity into the model after performing linear operations (convolution and matrix
multiplication).
In essence, activation functions serve as the “switch” in artificial neurons that decide whether
that neuron should be activated or not based on the weighted sum of the input. This reflects
how neurons in the human brain work: they either fire, or they don’t. This biological analogy
helps to conceptualize the role of activation functions in a CNN.
• In CNN the activation functions are typically applied after each

convolutional layer (ReLU) and fully connected layer (SoftMax).
However, they are not applied after pooling layers.
• The purpose of using activation functions after the convolutional and fully
connected layers is to introduce non-linearity into the model after
performing linear operations (convolution and matrix multiplication).
Some common activation functions used in CNNs include:
ReLU (Rectified Linear Unit): This is the most commonly used activation function in
CNNs. It returns 0 if it receives any negative input, but for any positive value x, it returns that
value back. Hence, it can be written as f(x) = max(0, x). The function is non-linear, which
means the output is not proportional to the input. It helps to alleviate the vanishing gradient
problem.
RELU Function
• It Stands for Rectified linear unit. It is the most widely used activation function.
Chiefly implemented in hidden layers of Neural network.
• Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
• Value Range :- [0, inf)
• Nature :- non-linear, which means we can easily backpropagate the errors and have
multiple layers of neurons being activated by the ReLU function.
• Uses :- ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. At a time only a few neurons are activated
making the network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Dying ReLU
Whenever we get the input as negative in ReLU the output will become 0. In back-
propagation network doesn’t learn anything (because you can’t backpropagate into it) since it
just keeps outputting 0s for the negative input, the gradient descent does not affect it
anymore. In other word if the derivative is 0 the whole activation becomes zero hence no
contribution of that neuron into the network.
Leaky ReLU: Leaky ReLU is a variant of ReLU. Instead of being 0 when x < 0, a leaky
ReLU allows a small, non-zero, constant gradient α (Normally, α=0.01). Hence, the function
could be written as f(x)=max(αx,x). It mitigates the dying ReLU problem which refers to the
problem when the ReLU neurons become inactive and only output 0 for any input.
The plot of the function and its derivative:

Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are trying to
handle multi- class classification problems.
• Nature :- non-linear
• Uses :- Usually used when trying to handle multiple classes. the softmax function was
commonly found in the output layer of image classification problems. The softmax
function would squeeze the outputs for each class between 0 and 1 and would also
divide by the sum of the outputs.
• Output:- The softmax function is ideally used in the output layer of the classifier
where we are actually trying to attain the probabilities to define the class of each
input.
• The basic rule of thumb is if you really don’t know what activation function to use,
then simply use RELU as it is a general activation function in hidden layers and is
used in most cases these days.
• If your output is for binary classification then, sigmoid function is very natural choice
for output layer.
• If your output is for multi-class classification then, Softmax is very useful to predict
the probabilities of each classes.
variants of Activation Function
Linear Function
• Equation : Linear function has the equation similar to as of a straight line i.e. y = x
• No matter how many layers we have, if all are linear in nature, the final activation
function of last layer is nothing but just a linear function of the input of first layer.
• Range : -inf to +inf
• Uses : Linear activation function is used at just one place i.e. output layer.
• Issues : If we will differentiate linear function to bring non-linearity, result will no
more depend on input “x” and function will become constant, it won’t introduce any
ground-breaking behavior to our algorithm.
For example : Calculation of price of a house is a regression problem. House price may have
any big/small value, so we can apply linear activation at output layer. Even in this case neural
net must have any non-linear function at hidden layers.
Sigmoid Function
• It is a function which is plotted as ‘S’ shaped graph.

• Equation : A = 1/(1 + e-x)
• Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are very
steep. This means, small changes in x would also bring about large changes in the
value of Y.
• Value Range : 0 to 1
• Uses : Usually used in output layer of a binary classification, where result is either 0
or 1, as value for sigmoid function lies between 0 and 1 only so, result can be
predicted easily to be 1 if value is greater than 0.5 and 0 otherwise.
Tanh Function
• The activation that works almost always better than sigmoid function is Tanh function
also known as Tangent Hyperbolic function. It’s actually mathematically shifted
version of the sigmoid function. Both are similar and can be derived from each other.
• Equation :-
f(x) = tanh(x) = 2/(1 + e-2x) – 1
OR
tanh(x) = 2 * sigmoid(2x) – 1
• Value Range :- -1 to +1
• Nature :- non-linear
• Uses :- Usually used in hidden layers of a neural network as it’s values lies between -
1 to 1 hence the mean for the hidden layer comes out be 0 or very close to it, hence
helps in centering the data by bringing mean close to 0. This makes learning for the
next layer much easier.
Question - 9 - Loss Function in Deep Learning
In mathematical optimization and decision theory, a loss or cost function (sometimes also
called an error function) is a function that maps an event or values of one or more variables
onto a real number intuitively representing some “cost” associated with the event.
In simple terms, the Loss function is a method of evaluating how well your algorithm is
modeling your dataset. It is a mathematical function of the parameters of the machine
learning algorithm.
In simple linear regression, prediction is calculated using slope (m) and intercept (b). The loss
function for this is the (Yi – Yihat)^2 i.e., loss function is the function of slope and intercept.
Regression loss functions like the MSE loss function are commonly used in evaluating the
performance of regression models. Additionally, objective functions play a crucial role in
optimizing machine learning models by minimizing the loss or cost. Other commonly used
loss functions include the Huber loss function, which combines the characteristics of the
MSE and MAE loss functions, providing robustness to outliers in the data.
Cost Functions in Machine Learning

Cost functions are vital in machine learning, measuring the disparity between predicted and
actual outcomes. They guide the training process by quantifying errors and driving parameter
updates. Common ones include Mean Squared Error (MSE) for regression and cross-entropy
for classification. These functions shape model performance and guide optimization
techniques like gradient descent, leading to better predictions.
Role of Loss Functions in Machine Learning Algorithms
Loss functions play a pivotal role in machine learning algorithms, acting as objective
measures of the disparity between predicted and actual values. They serve as the basis for
model training, guiding algorithms to adjust model parameters in a direction that minimizes
the loss and improves predictive accuracy. Here, we explore the significance of loss functions
in the context of machine learning algorithms.
In machine learning, loss functions quantify the extent of error between predicted and actual
outcomes. They provide a means to evaluate the performance of a model on a given dataset
and are instrumental in optimizing model parameters during the training process.
Fundamental Tasks
One of the fundamental tasks of machine learning algorithms is regression, where the goal is
to predict continuous variables. Loss functions such as Mean Squared Error (MSE) and Mean
Absolute Error (MAE) are commonly employed in regression tasks. MSE penalizes larger
errors more heavily than MAE, making it suitable for scenarios where outliers may have a
significant impact on the model’s performance.
For classification problems, where inputs are categorized into discrete classes, cross-entropy
loss functions are widely used. Binary cross-entropy loss is employed in binary classification
tasks, while categorical cross-entropy loss is utilized for multi-class classification. These
functions measure the disparity between predicted probability distributions and the actual
distribution of classes, guiding the model towards more accurate predictions.
The choice of a loss function depends on various factors, including the nature of the problem,
the distribution of the data, and the desired characteristics of the model. Different loss
functions emphasize different aspects of model performance and may be more suitable for
specific applications.
During the training process, machine learning algorithms employ optimization techniques
such as gradient descent to minimize the loss function. By iteratively adjusting model
parameters based on the gradients of the loss function, the algorithm aims to converge to the
optimal solution, resulting in a model that accurately captures the underlying patterns in the
data.
Overall, loss functions play a crucial role in machine learning algorithms, serving as
objective measures of model performance and guiding the learning process. Understanding
the role of loss functions is essential for effectively training and optimizing machine learning
models for various tasks and applications.
Loss Functions in Deep Learning
Regression Loss Functions
1. Mean Squared Error/Squared loss/ L2 loss

The Mean Squared Error (MSE) is a straightforward and widely used loss function. To
calculate the MSE, you take the difference between the actual value and the model prediction,
square it, and then average it across the entire dataset.
Advantage
• Easy Interpretation: The MSE is straightforward to understand.
• Always Differential: Due to the squaring, it is always differentiable.

• Single Local Minimum: It has only one local minimum.
Disadvantage
• Error Unit in Squared Form: The error is measured in squared units, which might
not be intuitively interpretable.
• Not Robust to Outliers: MSE is sensitive to outliers.
Note: In regression tasks, at the last neuron, it’s common to use a linear activation function.
2. Mean Absolute Error/ L1 loss Functions

The Mean Absolute Error (MAE) is another simple loss function. It calculates the average
absolute difference between the actual value and the model prediction across the dataset.
Advantage
• Intuitive and Easy: MAE is easy to grasp.
• Error Unit Matches Output Column: The error unit is the same as the output
column.
• Robust to Outliers: MAE is less affected by outliers.
Disadvantage
• Graph Not Differential: The MAE graph is not differentiable, so gradient descent
cannot be applied directly. Subgradient calculation is an alternative.
Note: In regression tasks, at the last neuron, a linear activation function is commonly used.
3. Huber Loss
The Huber loss is used in robust regression and is less sensitive to outliers compared to
squared error loss.
• n: The number of data points.

• y: The actual value (true value) of the data point.
• ŷ: The predicted value returned by the model.
• δ: Defines the point where the Huber loss transitions from quadratic to linear.
Advantage
• Robust to Outliers: Huber loss is more robust to outliers.
• Balances MAE and MSE: It lies between MAE and MSE.

Disadvantage
• Complexity: Optimizing the hyperparameter δ increases training requirements.
Classification Loss
1. Binary Cross Entropy/log loss Functions in machine learning models

It is used in binary classification problems like two classes. example a person has covid or
not or my article gets popular or not.
Binary cross entropy compares each of the predicted probabilities to the actual class output
which can be either 0 or 1. It then calculates the score that penalizes the probabilities based
on the distance from the expected value. That means how close or far from the actual value.
• yi – actual values
• yihat – Neural Network prediction
Advantage –
• A cost function is a differential.
Disadvantage –
• Multiple local minima
• Not intuitive
Note – In classification at last neuron use sigmoid activation function.
2. Categorical Cross Entropy

Categorical Cross entropy is used for Multiclass classification and softmax regression.
where
• k is classes,
• y = actual value
• yhat – Neural Network prediction
Note – In multi-class classification at the last neuron use the softmax activation function.
if problem statement have 3 classes

softmax activation – f(z) = ez1/(ez1+ez2+ez3)
When to use categorical cross-entropy and sparse categorical cross-entropy?

If target column has One hot encode to classes like 0 0 1, 0 1 0, 1 0 0 then use categorical
cross-entropy. and if the target column has Numerical encoding to classes like 1,2,3,4….n
then use sparse categorical cross-entropy.
Which is Faster?
Sparse categorical cross-entropy faster than categorical cross-entropy.
Conclusion
The significance of loss functions in deep learning cannot be overstated. They serve as vital
metrics for evaluating model performance, guiding parameter adjustments, and optimizing
algorithms during training. Whether it’s quantifying disparities in regression tasks through
MSE or MAE, penalizing deviations in binary classification with binary cross-entropy, or
ensuring robustness to outliers with the Huber loss function, selecting the appropriate loss
function is crucial. Understanding the distinction between loss and cost functions, as well as
their role in objective functions, provides valuable insights into model optimization.
Ultimately, the choice of loss function profoundly impacts model training and performance,
underscoring its pivotal role in the deep learning landscape.
Computational efficiency
Computational resource is a commodity within machine learning, commercial, and research
domain. Access to large computing capacity enables practitioners the flexibility to
experiment with large datasets and solve more complex machine-learning problems. Some
loss functions are more computationally demanding than others, especially when the number
of datasets is large. This makes the computational efficiency of a loss function a factor to
consider during the selection process of a loss function.
Factor Description
Type of Learning
Classification vs Regression; Binary vs Multiclass Classification.
Problem
Model Sensitivity to Some loss functions are more sensitive to outliers (e.g., MSE), while
Outliers others are more robust (e.g., MAE).
Desired Model Influences how the model behaves, e.g., hinge loss in SVMs focuses
Behavior on maximizing the margin.
Computational Some loss functions are more computationally intensive, impacting
Efficiency the choice based on available resources.
Convergence The smoothness and convexity of a loss function can affect the ease
Properties and speed of training.
For large-scale tasks, a loss function that scales well and can be
The scale of the Task
efficiently optimized is crucial.
Impact of outliers and data distribution

Outliers are data samples that fall outside the overall statistical distribution of a dataset; they
are sometimes referred to as anomalies or irregularities. How outliers are managed
determines the performance and accuracy of the trained machine learning model.
As mentioned earlier, outliers in datasets affect the error values utilized in loss functions,
depending on the loss function used. The effect of outliers on the loss functions propagates to
the outcome of the learning process of the machine learning algorithm, which can lead to
intended or unintended behavior from the machine learning algorithm or model.
For example, mean squared error penalizes outliers contributing to large error values/terms;
this means during the training process, the model weights are adjusted to learn how to
accommodate these outliers. Again, if this isn’t the intended behavior of the machine learning
model, the finalized model created after training will have poor generalization to unseen data.
For scenarios where mitigating the impact of outliers is required, functions such as MAE and
Huber loss are more applicable.
Applicability to Applicability to Sensitivity to
Loss Function
Classification Regression Outliers
Mean Squared Error
High
(MSE)
Mean Absolute Error
Low
(MAE)
Cross-Entropy Medium
Hinge Loss Low
Huber Loss Medium
Log Loss Medium
Question - 9 - Gradient Computation
In a Convolutional Neural Network (CNN), the gradient computation is

done using backpropagation. Backpropagation is a method of computing
the gradient of the loss function with respect to the weights of the network.
It is used to update the weights of the network during training.
During backpropagation, the gradient of the loss function is computed with

respect to the input and also with respect to the filter. The convolution
between input X and filter F gives us an output O. This can be represented
as:
O=X*F
The gradient of O with respect to X can be computed as:
dO/dX = dO/dY * dY/dX
where dO/dY is the gradient of O with respect to Y, and dY/dX is the

gradient of Y with respect to X. Similarly, the gradient of O with respect
to F can be computedas:
dO/dF = dO/dY * dY/dF
where dY/dFis the gradient of Y with respect to F
Manual Hyperparameter:
➢ Hyperparameters control algorithm behavior and DL algorithm
come with manyhyperparameters.
➢ Choosing hyperparameter manually, one must understand the
relationship between hyperparameters, training error,
generalization error and computational resources.
➢ Goal is to find the lowest generalization error subject to some
runtime and memorybudget.
➢ Choosing the manually requires understanding of what they do and
knowledge of howthey achieve good generalization.
Goal of Hyperparameter search:
i) Adjust effective capacity of model to match complexity of task:
ii) capacity is controlled by
1. Representational capacity of model

2. Ability of learning algorithm to minimize the cost
3. Degree to which cost and training regularize model
A model with more layers and more hidden nodes per layer has
higher capacity, butlearning algorithm may not learn the function.
Generalization error: It is a U- shaped curve. Not every
hyperparameter will be able to explore the entire U-shaped curve.
Many hyperparameters are discrete, such as number of linear pieces
in a max-out unit so it is only possible to visit a few points along the
curve. Some hyperparameters are binary.
Reference Details:
Question - 1: Working (or) Operation of CNN (or) Architecture of CNN

https://www.javatpoint.com/keras-convolutional-neural-network
https://medium.com/@draj0718/convolutional-neural-networks-cnn-architectures-explained-
716fb197b243
https://www.simplilearn.com/tutorials/deep-learning-tutorial/convolutional-neural-network
Question – 2: Analyze and explain the three main motivation behind Convolutional Nets
(or) CNN
https://medium.com/@rukaiya.rk24/convolution-neural-networks-all-you-need-to-know-
a71fde27e498
https://medium.com/inveterate-learner/deep-learning-book-chapter-9-convolutional-networks-
45e43bfc718d
Q4- CNN – Stride and Padding

https://deepai.org/machine-learning-glossary-and-terms/stride
https://medium.com/@nerdjock/convolutional-neural-network-lesson-5-strides-2ffdeacf8f2c
https://www.geeksforgeeks.org/cnn-introduction-to-padding/
Question – 5 - CNN - Pooling Layer and types of Pooling
https://www.geeksforgeeks.org/cnn-introduction-to-pooling-layer/
https://medium.com/@abhishekjainindore24/pooling-and-their-types-in-cnn-4a4b8a7a4611
Question - 6 - Dilated and Atrous Convolution
https://www.geeksforgeeks.org/dilated-convolution/
https://arinjoyemail.medium.com/dilated-convolutions-deep-learning-eb9fd3121e8e
Question - 7 - Transposed Convolutional

https://www.geeksforgeeks.org/what-is-transposed-convolutional-layer/
https://medium.com/@nerdjock/convolutional-neural-network-lesson-9-activation-functions-in-
cnns-57def9c6e759
https://www.geeksforgeeks.org/activation-functions-neural-networks/
Question - 8 - Generalize how to introduce non-linearity in a CNN, providing an

example to illustrate the concept. (or) Explain about the activation functions in CNN.
https://medium.com/@nerdjock/convolutional-neural-network-lesson-9-activation-functions-in-
cnns-57def9c6e759
https://www.geeksforgeeks.org/activation-functions-neural-networks/
Question - 9 - Loss Function in Deep Learning

https://www.analyticsvidhya.com/blog/2022/06/understanding-loss-function-in-deep-learning/
https://www.datacamp.com/tutorial/loss-function-in-machine-learning
https://medium.com/@ibtedaazeem/loss-functions-in-deep-learning-e4bd353ea08a
CNN video lecture link:
https://www.youtube.com/watch?v=Etksi-F5ug8
https://www.youtube.com/watch?v=PGBop7Ka9AU&t=295s
https://www.youtube.com/watch?v=VpSLtKiPhLM
https://www.youtube.com/watch?v=PmZp5VtMwLE
https://www.youtube.com/watch?v=Y1qxI-Df4Lk
https://www.youtube.com/watch?v=O2CBKXr_Tuc

UNIT 2 Study Materials 1

Uploaded by

Copyright:

Available Formats

UNIT 2 Study Materials 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT 2 Study Materials 1

Uploaded by

Copyright:

Available Formats

UNIT -II

Convolutional Networks (or) Convolutional neural networks

There are two main parts to a CNN:

Convolution Operation: Convolution is an operation on two functions of a real valued

Input is multi-dimensional array. Kernel is multidimensional array of parameters

Question - 1: Working (or) Operation of CNN (or) Architecture of CNN

• Input: If the image consists of 32 widths, 32 height encompassing three R, G, B

Kernel or Filter or Feature Detectors

Layers used to build CNN

Layers in a Convolutional Neural Network

Traditional Neural Networks

Comparison of connectivity and parameters of locally-connected (top), tiled (middle) and

Effect Of Padding On Input Images

Padding prevents the shrinking of the input image.

where, nxn is the dimension of input image

[(n + 2p) x (n + 2p) image] * [(f x f) filter] —> [(n x n) image]

which gives p = (f – 1) / 2 (because n + 2p – f + 1 = n).

Padding serves several purposes:

Question – 5 - CNN - Pooling Layer and types of Pooling

Types of Pooling Layers:

Question - 6 - UNSHARED Convolution and TILED Convolution

Some advantages of dilated convolutions are:

Fig 1: Normal Convolution vs Dilated Convolution

Question - 8 - Generalize how to introduce non-linearity in a CNN, providing an

Where to Apply Activation Functions in CNNs

• In CNN the activation functions are typically applied after each

The plot of the function and its derivative:

variants of Activation Function

• It is a function which is plotted as ‘S’ shaped graph.

Question - 9 - Loss Function in Deep Learning

Cost Functions in Machine Learning

Role of Loss Functions in Machine Learning Algorithms

Loss Functions in Deep Learning

Regression Loss Functions

1. Mean Squared Error/Squared loss/ L2 loss

• Always Differential: Due to the squaring, it is always differentiable.

2. Mean Absolute Error/ L1 loss Functions

• n: The number of data points.

• Balances MAE and MSE: It lies between MAE and MSE.

1. Binary Cross Entropy/log loss Functions in machine learning models

2. Categorical Cross Entropy

if problem statement have 3 classes

When to use categorical cross-entropy and sparse categorical cross-entropy?

Impact of outliers and data distribution

Question - 9 - Gradient Computation

In a Convolutional Neural Network (CNN), the gradient computation is

During backpropagation, the gradient of the loss function is computed with

The gradient of O with respect to X can be computed as:

dO/dX = dO/dY * dY/dX

where dO/dY is the gradient of O with respect to Y, and dY/dX is the

dO/dF = dO/dY * dY/dF

where dY/dFis the gradient of Y with respect to F

1. Representational capacity of model

Question - 1: Working (or) Operation of CNN (or) Architecture of CNN

Q4- CNN – Stride and Padding

Question – 5 - CNN - Pooling Layer and types of Pooling

Question - 7 - Transposed Convolutional

Question - 8 - Generalize how to introduce non-linearity in a CNN, providing an

Question - 9 - Loss Function in Deep Learning

CNN video lecture link: