FODL Unit-4
FODL Unit-4
FODL Unit-4
Artificial neural networks are made up of neurons, which are the core processing units of the
network. For better understanding, refer to the diagram below:
n the given diagram, first, we have the ‘INPUT LAYER’, where the neurons are fed with training
observations. Then in between is the ‘HIDDEN LAYER‘ that performs most of the computations
required by our network. Lastly, the ‘OUTPUT LAYER‘ predicts the final output extracted from
the previous two layers.
How does this neural network work?
For instance, if an image is passed as input, with N X N pixels, each pixel is fed as input
to each neuron of the first layer.
Neurons of one layer are connected to the following layers through ‘channels’.
Each of these channels is assigned a numerical value called ‘weight’.
The inputs (x1, x2, …… xn) are multiplied by their corresponding weights, and their sum is
sent to the neurons in the hidden layer.
Each of these neurons is associated with a numerical value called the ‘Bias’, further added
to the input sum.
This value is then passed through a threshold function called the ‘Activation function’,
which determines whether the particular neuron will get activated or not.
The activated neuron transmits data to neurons of the next layer over channels.
Thus, data is propagated through the network, and the neuron with the highest value
determines the output.
Output= f(sigma w i*xi)+Bias ,where f is the activation function.
It is a class of deep neural networks that extracts features from images, given as input, to
perform specific tasks such as image classification, face recognition and semantic image
system. A CNN has one or more convolution layers for simple feature extraction, which
execute convolution operation (i.e. multiplication of a set of weights with input) while
retaining the critical features (spatial and temporal information) without human
supervision.
CNN is needed as it is an important and more accurate way for image classification
problems. With Artificial Neural Networks, a 2D image would first be converted into a 1-
dimensional vector before training the model.
Also, with an increase in the size of the image, the number of training parameters would
increase exponentially, resulting in loss of storage. Moreover, ANN cannot capture the
sequential information required for sequence data.
Thus, CNN would always be a preferred way for dealing with 2D image classification
problems because of its ability to deal with images as data, thereby providing higher
accuracy.
1)Convolution layer:
This is the first layer of the convolutional network that performs feature extraction by sliding the
filter over the input image. The output or the convolved feature is the element-wise product of
filters in the image and their sum for every sliding action.
The output layer, also known as the feature map, corresponds to original images like curves, sharp
edges, textures, etc.
In the case of networks with more convolutional layers, the initial layers are meant for extracting
the generic features while the complex parts are removed as the network gets deeper.
2)Pooling Layer:
The primary purpose of this layer is to reduce the number of trainable parameters by decreasing
the spatial size of the image, thereby reducing the computational cost.
The image depth remains unchanged since pooling is done independently on each depth
dimension. Max Pooling is the most common pooling method, where the most significant element
is taken as input from the feature map. Max Pooling is then performed to give the output image
with dimensions reduced to a great extent while retaining the essential information.
The last few layers which determine the output are the fully connected layers. The output from the
pooling layer is Flattened into a one-dimensional vector and then given as input to the fully
connected layer.
The output layer has the same number of neurons as the number of categories we had in our
problem for classification, thus associating features to a particular label.
After this process is known as forwarding propagation, the output so generated is compared to the
actual production for error generation.
The error is then backpropagated to update the filters(weights) and bias values. Thus, one training
is completed after this forwarding and backward propagation cycle.
Q2. How edges can be detected from an image? Also explain concepts related CNN along
with examples ( IMPORTANT, Must DO, refer for Numerical)
Early layers of a neural network detect edges from an image. Deeper layers might be able to detect
the cause of the objects and even more deeper layers might detect the cause of complete objects
(like a person’s face).
But how do we detect these edges? To illustrate this, let’s take a 6 X 6 grayscale image (i.e. only
one channel):
After the convolution, we will get a 4 X 4 image. The first element of the 4 X 4 matrix will be
calculated as:
So, we take the first 3 X 3 matrix from the 6 X 6 image and multiply it with the filter. Now, the
first element of the 4 X 4 output will be the sum of the element-wise product of these values, i.e.
3*1 + 0 + 1*-1 + 1*1 + 5*0 + 8*-1 + 2*1 + 7*0 + 2*-1 = -5. To calculate the second element of
the 4 X 4 output, we will shift our filter one step towards the right and again get the sum of the
element-wise product:
Similarly, we will convolve over the entire image and get a 4 X 4 output:
So, convolving a 6 X 6 input with a 3 X 3 filter gave us an output of 4 X 4. Consider one more
example:
Note: Higher pixel values represent the brighter portion of the image and the lower pixel values
represent the darker portions. This is how we can detect a vertical edge in an image.
The type of filter that we choose helps to detect the vertical or horizontal edges. We can use the
following filters to detect different edges:
Padding
Input: n X n
Filter size: f X f
Output: (n-f+1) X (n-f+1)
1. Every time we apply a convolutional operation, the size of the image shrinks
2. Pixels present in the corner of the image are used only a few number of times during
convolution as compared to the central pixels. Hence, we do not focus too much on the
corners since that can lead to information loss
To overcome these issues, we can pad the image with an additional border, i.e., we add one pixel
all around the edges. This means that the input will be an 8 X 8 matrix (instead of a 6 X 6 matrix).
Applying convolution of 3 X 3 on it will result in a 6 X 6 matrix which is the original shape of the
image. This is where padding comes to the fore:
Input: n X n
Padding: p
Filter size: f X f
Output: (n+2p-f+1) X (n+2p-f+1)
1. Valid: It means no padding. If we are using valid padding, the output will be (n-f+1) X (n-
f+1)
2. Same: Here, we apply padding so that the output size is the same as the input size, i.e.,
n+2p-f+1 = n
So, p = (f-1)/2
We now know how to use padded convolution. This way we don’t lose a lot of information and
the image does not shrink either. Next, we will look at how to implement strided convolutions.
Strided Convolutions
Suppose we choose a stride of 2. So, while convoluting through the image, we will take two steps
– both in the horizontal and vertical directions separately. The dimensions for stride s will be:
Input: n X n
Padding: p
Stride: s
Filter size: f X f
Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1]
Stride helps to reduce the size of the image, a particularly useful feature.
Suppose, instead of a 2-D image, we have a 3-D input image of shape 6 X 6 X 3. How will we
apply convolution on this image? We will use a 3 X 3 X 3 filter instead of a 3 X 3 filter. Let’s look
at an example:
Input: 6 X 6 X 3
Filter: 3 X 3 X 3
The dimensions above represent the height, width and channels in the input and filter. Keep in
mind that the number of channels in the input and filter should be same. This will result in an
output of 4 X 4. Let’s understand it visually:
Since there are three channels in the input, the filter will consequently also have three channels.
After convolution, the output shape is a 4 X 4 matrix. So, the first element of the output is the sum
of the element-wise product of the first 27 values from the input (9 values from each channel) and
the 27 values from the filter. After that we convolve over the entire image.
Instead of using just a single filter, we can use multiple filters as well. How do we do that? Let’s
say the first filter will detect vertical edges and the second filter will detect horizontal edges from
the image. If we use multiple filters, the output dimension will change. So, instead of having a 4
X 4 output as in the above example, we would have a 4 X 4 X 2 output (if we have used 2 filters):
Generalized dimensions can be given as:
Input: n X n X nc
Filter: f X f X nc
Padding: p
Stride: s
Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1] X nc’
Here, nc is the number of channels in the input and filter, while nc’ is the number of filters.
Once we get an output after convolving over the entire image using a filter, we add a bias term to
those outputs and finally apply an activation function to generate activations. This is one layer of
a convolutional network. Recall that the equation for one forward pass is given by:
In our case, input (6 X 6 X 3) is a[0]and filters (3 X 3 X 3) are the weights w[1]. These activations
from layer 1 act as the input for layer 2, and so on. Clearly, the number of parameters in case of
convolutional neural networks is independent of the size of the image. It essentially depends on
the filter size. Suppose we have 10 filters, each of shape 3 X 3 X 3. What will be the number of
parameters in that layer? Let’s try to solve this:
Let’s combine all the concepts we have learned so far and look at a convolutional network
example.
We take an input image (size = 39 X 39 X 3 in our case), convolve it with 10 filters of size 3 X 3,
and take the stride as 1 and no padding. This will give us an output of 37 X 37 X 10. We convolve
this output further and get an output of 7 X 7 X 40 as shown above. Finally, we take all these
numbers (7 X 7 X 40 = 1960), unroll them into a large vector, and pass them to a classifier that
will make predictions. This is a microcosm of how a convolutional network works.
There are a number of hyperparameters that we can tweak while building a convolutional network.
These include the number of filters, size of filters, stride to be used, padding, etc. We will look at
each of these in detail later in this article. Just keep in mind that as we go deeper into the network,
the size of the image shrinks whereas the number of channels usually increases.
1. Convolution layer
2. Pooling layer
3. Fully connected layer
Pooling Layers
Pooling layers are generally used to reduce the size of the inputs and hence speed up the
computation. Consider a 4 X 4 matrix as shown below:
For every consecutive 2 X 2 block, we take the max number. Here, we have applied a filter of size
2 and a stride of 2. These are the hyperparameters for the pooling layer. Apart from max pooling,
we can also apply average pooling where, instead of taking the max of the numbers, we take their
average. In summary, the hyperparameters for a pooling layer are:
1. Filter size
2. Stride
3. Max or average pooling
If the input of the pooling layer is nh X nw X nc, then the output will be [{(nh – f) / s + 1} X {(nw –
f) / s + 1} X nc].
CNN Example
We’ll take things up a notch now. Let’s look at how a convolution neural network with
convolutional and pooling layer works. Suppose we have an input of shape 32 X 32 X 3:
There are a combination of convolution and pooling layers at the beginning, a few fully connected
layers at the end and finally a softmax classifier to classify the input into various categories. There
are a lot of hyperparameters in this network which we have to specify as well.
Generally, we take the set of hyperparameters which have been used in proven research and they
end up doing well. As seen in the above example, the height and width of the input shrinks as we
go deeper into the network (from 32 X 32 to 5 X 5) and the number of channels increases (from 3
to 10).
All of these concepts and techniques bring up a very fundamental question – why convolutions?
Why not something else?
Why Convolutions?
There are primarily two major advantages of using convolutional layers over using just fully
connected layers:
1. Parameter sharing
2. Sparsity of connections
If we would have used just the fully connected layer, the number of parameters would be =
32*32*3*28*28*6, which is nearly equal to 14 million! Makes no sense, right?
If we see the number of parameters in case of a convolutional layer, it will be = (5*5 + 1) * 6 (if
there are 6 filters), which is equal to 156. Convolutional layers reduce the number of parameters
and speed up the training of the model significantly.
In convolutions, we share the parameters while convolving through the input. The intuition behind
this is that a feature detector, which is helpful in one part of the image, is probably also useful in
another part of the image. So a single filter is convolved over the entire input and hence the
parameters are shared.
The second advantage of convolution is the sparsity of connections. For each layer, each output
value depends on a small number of inputs, instead of taking into account all the inputs.
Understanding and Calculating the number of Parameters in Convolution Neural Networks
(CNNs)
FYI: The above image does not represent correct number of parameters. Please refer to the
section titled “CORRECTION”. You may skip to that section if you just want the numbers.
If you’ve been playing with CNN’s it is common to encounter a summary of parameters as seen in
the above image. We all know it is easy to calculate the activation size, considering it’s merely the
product of width, height and the number of channels in that layer.
For example, as shown in the above image from coursera, the input layer’s shape is (32, 32, 3), the
activation size of that layer is 32 * 32 * 3 = 3072. The same holds good if you want to calculate the
activation shape of any other layer. Say, we want to calculate the activation size for CONV2. All
we have to do is just multiply (10,10,16) , i.e 10*10*16 = 1600, and you’re done calculating the
activation size.
However, what sometimes may get tricky, is the approach to calculate the number of parameters in
a given layer. With that said, here are some simple ideas to keep in my mind to do the same.
This goes back to the idea of understanding what we are doing with a convolution neural net, which
is basically trying to learn the values of filter(s) using backprop. In other words, if a layer has weight
matrices, that is a “learnable” layer.
Basically, the number of parameters in a given layer is the count of “learnable” (assuming such a
word exists) elements for a filter aka parameters for the filter for that layer.
Parameters in general are weights that are learnt during training. They are weight matrices that
contribute to model’s predictive power, changed during back-propagation process. Who governs
the change? Well, the training algorithm you choose, particularly the optimization strategy makes
them change their values.
Now that you know what “parameters” are, let’s dive into calculating the number of parameters in
the sample image we saw above. But, I’d want to include that image again here to avoid your
scrolling effort and time.
Example taken from Coursera: https://www.coursera.org/learn/convolutional-neural-
networks/lecture/uRYL1/cnn-example
1. Input layer: Input layer has nothing to learn, at it’s core, what it does is just provide
the input image’s shape. So no learnable parameters here. Thus number of parameters
= 0.
2. CONV layer: This is where CNN learns, so certainly we’ll have weight matrices. To
calculate the learnable parameters here, all we have to do is just multiply the by the
shape of width m, height n, previous layer’s filters d and account for all such
filters k in the current layer. Don’t forget the bias term for each of the filter. Number
of parameters in a CONV layer would be : ((m * n * d)+1)* k), added 1 because of the
bias term for each filter. The same expression can be written as follows: ((shape of
width of the filter * shape of height of the filter * number of filters in the previous
layer+1)*number of filters). Where the term “filter” refer to the number of filters in
the current layer.
3. POOL layer: This has got no learnable parameters because all it does is calculate a
specific number, no backprop learning involved! Thus number of parameters = 0.
4. Fully Connected Layer (FC): This certainly has learnable parameters, matter of fact,
in comparison to the other layers, this category of layers has the highest number of
parameters, why? because, every neuron is connected to every other neuron! So, how
to calculate the number of parameters here? You probably know, it is the product of
the number of neurons in the current layer c and the number of neurons on the previous
layer p and as always, do not forget the bias term. Thus number of parameters here
are: ((current layer neurons c * previous layer neurons p)+1*c).
Now let’s follow these pointers and calculate the number of parameters, shall we?
2. Parameters in the second CONV1(filter shape =5*5, stride=1) layer is: ((shape of
width of filter*shape of height filter*number of filters in the previous
layer+1)*number of filters) = (((5*5*3)+1)*8) = 608.
4. Parameters in the fourth CONV2(filter shape =5*5, stride=1) layer is: ((shape of
width of filter * shape of height filter * number of filters in the previous layer+1)
* number of filters) = (((5*5*8)+1)*16) = 3216.
6. Parameters in the Sixth FC3 layer is((current layer c*previous layer p)+1*c) =
120*400+1*120= 48120.
7. Parameters in the Seventh FC4 layer is: ((current layer c*previous layer p)+1*c) = 84*120+1*
84 = 10164.
8. The Eighth Softmax layer has ((current layer c*previous layer p)+1*c) parameters =
10*84+1*10 = 850.
Update V2:
Thanks for the comments by observant readers. Appreciate the corrections. Changed the image for
better understanding.
FYI:
1. In this article, term “layer” very loosely to explain the separation. Ideally, CONV + Pooling
is termed as a layer.
2. Just because there are no parameters in the pooling layer, it does not imply that pooling has no
role in backprop. Pooling layer is responsible for passing on the values to the next and previous
layers during forward and backward propagation respectively.
In this article we saw what a parameter in means, we saw how to calculate the activation size, also
we understood how to calculate the number of parameters in a CNN.
What is CNN?
Before we get to different types of CNN architecture, let’s quickly recall what a CNN is? What a
CNN model is? What are the most fundamental components of a CNN architecture?
Convolutional Neural Networks, commonly referred to as CNNs, are a specialized kind of neural
network architecture that is designed to process data with a grid-like topology. This makes them
particularly well-suited for dealing with spatial and temporal data, like images and videos, that
maintain a high degree of correlation between adjacent elements.
CNNs are similar to other neural networks, but they have an added layer of complexity due to the
fact that they use a series of convolutional layers. Convolutional layers perform a mathematical
operation called convolution, a sort of specialized matrix multiplication, on the input data. The
convolution operation helps to preserve the spatial relationship between pixels by learning image
features using small squares of input data. . The picture below represents a typical CNN
architecture
The following are definitions of different layers shown in the above architecture:
Convolutional layers
Convolutional layers operate by sliding a set of ‘filters’ or ‘kernels’ across the input data. Each
filter is designed to detect a specific feature or pattern, such as edges, corners, or more complex
shapes in the case of deeper layers. As these filters move across the image, they generate a map
that signifies the areas where those features were found. The output of the convolutional layer
is a feature map, which is a representation of the input image with the filters applied.
Convolutional layers can be stacked to create more complex models, which can learn more
intricate features from images. Simply speaking, convolutional layers are responsible for
extracting features from the input images. These features might include edges, corners, textures,
or more complex patterns.
Pooling layers
Pooling layers follow the convolutional layers and are used to reduce the spatial dimension of
the input, making it easier to process and requiring less memory. In the context of images, “spatial
dimensions” refer to the width and height of the image. An image is made up of pixels, and you
can think of it like a grid, with rows and columns of tiny squares (pixels). By reducing the spatial
dimensions, pooling layers help reduce the number of parameters or weights in the network.
This helps to combat overfitting and help train the model in a fast manner. Max pooling helps in
reducing computational complexity owing to reduction in size of feature map, and, making the
model invariant to small transitions. Without max pooling, the network would not gain the ability
to recognize features irrespective of small shifts or rotations. This would make the model less
robust to variations in object positioning within the image, possibly affecting accuracy.
There are two main types of pooling: max pooling and average pooling. Max pooling takes the
maximum value from each feature map. For example, if the pooling window size is 2×2, it will
pick the pixel with the highest value in that 2×2 region. Max pooling effectively captures the most
prominent feature or characteristic within the pooling window. Average pooling calculates the
average of all values within the pooling window. It provides a smooth, average feature
representation.
Fully connected layers
Fully-connected layers are one of the most basic types of layers in a convolutional neural network
(CNN). As the name suggests, each neuron in a fully-connected layer is Fully connected- to every
other neuron in the previous layer. Fully connected layers are typically used towards the end of a
CNN- when the goal is to take the features learned by the convolutional and max pooling layers
and use them to make predictions such as classifying the input to a label. For example, if we were
using a CNN to classify images of animals, the final Fully connected layer might take the features
learned by the previous layers and use them to classify an image as containing a dog, cat, bird,
etc.
Fully connected layers take the high-dimensional output from the previous convolutional and
pooling layers and flatten it into a one-dimensional vector. This allows the network to combine
and integrate all the extracted features across the entire image, rather than considering localized
features. It helps in understanding the global context of the image. The fully connected layers are
responsible for mapping the integrated features to the desired output, such as class labels in
classification tasks. They act as the final decision-making part of the network, determining what
the extracted features mean in the context of the specific problem (e.g., recognizing a cat or a dog).
The combination of Convolution layer followed by max-pooling layer and then similar sets creates
a hierarchy of features. The first layer detects simple patterns, and subsequent layers build on those
to detect more complex patterns.
Output Layer
The output layer in a Convolutional Neural Network (CNN) plays a critical role as it’s the final
layer that produces the actual output of the network, typically in the form of a classification or
regression result. Its importance can be outlined as follows:
The LeNet CNN is a simple yet powerful model that has been used for various tasks such as
handwritten digit recognition, traffic sign recognition, and face detection. Although LeNet was
developed more than 20 years ago, its architecture is still relevant today and continues to be used.
The key innovation of ZFNet lies in its approach to improving the AlexNet architecture, which
was the winner of the ILSVRC in 2012. ZFNet addressed some of the limitations of AlexNet
by tweaking the CNN architecture, particularly focusing on the convolutional layers.
ZFNet modified the first few layers of AlexNet. It used smaller filter sizes in the first and second
convolutional layers and altered the stride and filter sizes to improve feature extraction. One of the
most notable contributions of ZFNet was the introduction of a novel visualization
technique that allowed for better understanding and interpretation of the feature maps in CNNs.
By fine-tuning the architecture, ZFNet achieved improved performance in image classification
tasks compared to its predecessor, AlexNet.
GoogLeNet_DeepDream – Generate images based on CNN features
GoogLeNet_DeepDream is a deep dream CNN architecture that was developed by Alexander
Mordvintsev, Christopher Olah, et al.. It uses the Inception network to generate images based on
CNN features. The architecture is often used with the ImageNet dataset to generate psychedelic
images or create abstract artworks using human imagination at the ICLR 2017 workshop by David
Ha, et al.
To summarize the different types of CNN architectures described above in an easy to remember
form, you can use the following:
Classic Networks
1. LeNet-5
2. AlexNet
3. VGG
We will also see how ResNet works and finally go through a case study of an inception neural
network.
LeNet-5
Parameters: 60k
Layers flow: Conv -> Pool -> Conv -> Pool -> FC -> FC -> Output
Activation functions: Sigmoid/tanh and ReLu
AlexNet
This network is similar to LeNet-5 with just more convolution and pooling layers:
Parameters: 60 million
Activation function: ReLu
VGG-16
The underlying idea behind VGG-16 was to use a much simpler network where the focus is on
having convolution layers that have 3 X 3 filters with a stride of 1 (and always using the same
padding). The max pool layer is used after each convolution layer with a filter size of 2 and a stride
of 2. Let’s look at the architecture of VGG-16:
As it is a bigger network, the number of parameters are also more.
These are three classic architectures. Next, we’ll look at more advanced architecture starting with
ResNet.
ResNet
Training very deep networks can lead to problems like vanishing and exploding gradients. How
do we deal with these issues? We can use skip connections where we take activations from one
layer and feed it to another layer that is even more deeper in the network. There are residual blocks
in ResNet which help in training deeper networks.
Residual Blocks
The general flow to calculate activations from different layers can be given as:
This is how we calculate the activations a[l+2] using the activations a[l] and then a[l+1]. a[l] needs to
go through all these steps to generate a[l+2]:
In a residual network, we make a change in this path. We take the activations a [l] and pass them
directly to the second layer:
The benefit of training a residual network is that even if we train deeper networks, the
training error does not increase. Whereas in case of a plain network, the training error first
decreases as we train a deeper network and then starts to rapidly increase:
We now have an overview of how ResNet works. But why does it perform so well? Let’s find out!
In order to make a good model, we first have to make sure that it’s performance on the training
data is good. That’s the first test and there really is no point in moving forward if our model fails
here. We have seen earlier that training deeper networks using a plain network increases the
training error after a point of time. But while training a residual network, this isn’t the case. Even
when we build a deeper residual network, the training error generally does not increase.
a[l+2] = g(a[l])
It is fairly easy to calculate a[l+2] knowing just the value of a[l]. As per the research paper, ResNet
is given by:
Let’s see how a 1 X 1 convolution can be helpful. Suppose we have a 28 X 28 X 192 input and we
apply a 1 X 1 convolution using 32 filters. So, the output will be 28 X 28 X 32:
The basic idea of using 1 X 1 convolution is to reduce the number of channels from the image. A
couple of points to keep in mind:
We generally use a pooling layer to shrink the height and width of the image
To reduce the number of channels from an image, we convolve it using a 1 X 1 filter (hence
reducing the computation cost as well)
While designing a convolutional neural network, we have to decide the filter size. Should it be a 1
X 1 filter, or a 3 X 3 filter, or a 5 X 5? Inception does all of that for us! Let’s see how it works.
Suppose we have a 28 X 28 X 192 input volume. Instead of choosing what filter size to use, or
whether to use convolution layer or pooling layer, inception uses all of them and stacks all the
outputs:
A good question to ask here – why are we using all these filters instead of using just a single filter
size, say 5 X 5? Let’s look at how many computations would arise if we would have used only a
5 X 5 filter on our input:
Number of multiplies = 28 * 28 * 32 * 5 * 5 * 192 = 120 million! Can you imagine how expensive
performing all of these will be?
Now, let’s look at the computations a 1 X 1 convolution and then a 5 X 5 convolution will give
us:
Number of multiplies for first convolution = 28 * 28 * 16 * 1 * 1 * 192 = 2.4 million
Number of multiplies for second convolution = 28 * 28 * 32 * 5 * 5 * 16 = 10 million
Total number of multiplies = 12.4 million
Inception Networks
Computer Vision:
1. Image Classification:
CNNs excel in image classification tasks by learning hierarchical features from pixels to high-
level representations. Applications include identifying objects in images, facial recognition, and
scene classification
2. Object Detection:
CNNs are widely used for object detection tasks where the goal is to not only classify objects but
also locate them in an image. Applications include autonomous vehicles, surveillance, and
augmented reality.
3. Segmentation: CNNs are applied to image segmentation tasks, where the goal is to assign
a label to each pixel in an image. This is useful in medical imaging for identifying tumors,
as well as in general image editing and understanding.
4. Image Generation:CNNs can be used for image generation tasks, such as generating
realistic images from scratch or modifying existing ones. Generative models like GANs
(Generative Adversarial Networks) leverage CNNs for image synthesis.
5. Facial Recognition:CNNs play a crucial role in facial recognition systems, enabling
applications like unlocking devices, verifying identities, and enhancing security.
Speech Processing:
1. Speech Recognition:
CNNs are applied to automatic speech recognition (ASR) systems, converting spoken language
into text. They learn spectral features from audio signals, improving the accuracy of transcriptions
in applications like voice assistants, dictation software, and customer service automation.
2. Speaker Identification: CNNs can be used to identify individuals based on their voice
characteristics. This has applications in security, authentication systems, and personalized
services.
Audio Analysis:
The success of CNNs in these applications lies in their ability to automatically learn hierarchical
features and patterns directly from the raw input data, without the need for manual feature
engineering. This adaptability makes CNNs a powerful tool in various domains, enabling
advancements in technology and improving the efficiency of a wide range of applications.
The reuse of a pre-trained model on a new problem is known as transfer learning in machine
learning. A machine uses the knowledge learned from a prior assignment to increase prediction
about a new task in transfer learning. You could, for example, use the information gained during
training to distinguish beverages when training a classifier to predict whether an image contains
cuisine.
The knowledge of an already trained machine learning model is transferred to a different but
closely linked problem throughout transfer learning. For example, if you trained a simple classifier
to predict whether an image contains a backpack, you could use the model’s training knowledge
the concepts in another. weights are being automatically being shifted to a network performing
Because of the massive amount of CPU power required, transfer learning is typically applied in
computer vision and natural language processing tasks like sentiment analysis.
Transfer learning is a powerful technique used in Deep Learning. By harnessing the ability to reuse
existing models and their knowledge of new problems, transfer learning has opened doors to
training deep neural networks even with limited data. This breakthrough is especially significant
in data science, where practical scenarios often need more labeled data.
What Is Transfer Learning?
The reuse of a pre-trained model on a new problem is known as transfer learning in machine
learning. A machine uses the knowledge learned from a prior assignment to increase prediction
about a new task in transfer learning. You could, for example, use the information gained during
training to distinguish beverages when training a classifier to predict whether an image contains
cuisine.
The knowledge of an already trained machine learning model is transferred to a different but
closely linked problem throughout transfer learning. For example, if you trained a simple classifier
to predict whether an image contains a backpack, you could use the model’s training knowledge
With transfer learning, we basically try to use what we’ve learned in one task to better understand
the concepts in another. weights are being automatically being shifted to a network performing
Because of the massive amount of CPU power required, transfer learning is typically applied in
computer vision and natural language processing tasks like sentiment analysis.
How Transfer Learning Works?
In computer vision, neural networks typically aim to detect edges in the first layer, forms in the
The early and central layers are employed in transfer learning, and the latter layers are only
retrained. It makes use of the labelled data from the task it was trained on.
Transfer learning is a powerful technique used in Deep Learning. By harnessing the ability to reuse
existing models and their knowledge of new problems, transfer learning has opened doors to
training deep neural networks even with limited data. This breakthrough is especially significant
in data science, where practical scenarios often need more labeled data. In this article, we delve
into the depths of transfer learning, unraveling its concepts and exploring its applications in
empowering data scientists to tackle complex challenges with newfound efficiency and
effectiveness.
What Is Transfer Learning?
The reuse of a pre-trained model on a new problem is known as transfer learning in machine
learning. A machine uses the knowledge learned from a prior assignment to increase prediction
about a new task in transfer learning. You could, for example, use the information gained during
training to distinguish beverages when training a classifier to predict whether an image contains
cuisine.
The knowledge of an already trained machine learning model is transferred to a different but
closely linked problem throughout transfer learning. For example, if you trained a simple classifier
to predict whether an image contains a backpack, you could use the model’s training knowledge
the concepts in another. weights are being automatically being shifted to a network performing
Because of the massive amount of CPU power required, transfer learning is typically applied in
computer vision and natural language processing tasks like sentiment analysis.
In computer vision, neural networks typically aim to detect edges in the first layer, forms in the
The early and central layers are employed in transfer learning, and the latter layers are only
retrained. It makes use of the labelled data from the task it was trained on.
Let’s return to the example of a model that has been intended to identify a backpack in an image
and will now be used to detect sunglasses. Because the model has trained to recognise objects in
the earlier levels, we will simply retrain the subsequent layers to understand what distinguishes
Transfer learning offers a number of advantages, the most important of which are reduced training
time, improved neural network performance (in most circumstances), and the absence of a large
amount of data.
To train a neural model from scratch, a lot of data is typically needed, but access to that data isn’t
Because the model has already been pre-trained, a good machine learning model can be generated
with fairly little training data using transfer learning. This is especially useful in natural language
processing, where huge labelled datasets require a lot of expert knowledge. Additionally, training
time is decreased because building a deep neural network from the start of a complex task can take
When we don’t have enough annotated data to train our model with and there is a pre-trained
model that has been trained on similar data and tasks. If you used TensorFlow to train the original
model, you might simply restore it and retrain some layers for your job. Transfer learning, on the
other hand, only works if the features learnt in the first task are general, meaning they can be
applied to another activity. Furthermore, the model’s input must be the same size as it was when
If you don’t have it, add a step to resize your input to the required size:
Consider the situation in which you wish to tackle Task A but lack the necessary data to
train a deep neural network. Finding a related task B with a lot of data is one method to get
around this.
Utilize the deep neural network to train on task B and then use the model to solve task A.
The problem you’re seeking to solve will decide whether you need to employ the entire
If the input in both jobs is the same, you might reapply the model and make predictions for
your new input. Changing and retraining distinct task-specific layers and the output layer,
of these models out there, so do some research beforehand. The number of layers to reuse
Keras consists of nine pre-trained models used in transfer learning, prediction, fine-tuning.
These models, as well as some quick lessons on how to utilise them, may be found here.
The most popular application of this form of transfer learning is deep learning.
3. Extraction of Features
Another option is to utilise deep learning to identify the optimum representation of your
problem, which comprises identifying the key features. This method is known as
representation learning, and it can often produce significantly better results than hand-
designed representations.
Feature creation in machine learning is mainly done by hand by researchers and domain
specialists. Deep learning, fortunately, can extract features automatically. Of course, this
does not diminish the importance of feature engineering and domain knowledge; you must
which aren’t. Even for complicated tasks that would otherwise necessitate a lot of human
The learned representation can then be applied to a variety of other challenges. Simply
utilise the initial layers to find the appropriate feature representation, but avoid using the
network’s output because it is too task-specific. Instead, send data into your network and
This method is commonly used in computer vision since it can shrink your dataset,
reducing computation time and making it more suited for classical algorithms.
There are a number of popular pre-trained machine learning models available. The Inception-v3
model, which was developed for the ImageNet “Large Visual Recognition Challenge,” is one of
them.” Participants in this challenge had to categorize pictures into 1,000 subcategories such as