CS601 - Machine Learning - Unit 3 - Notes - 1672759761

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Chameli Devi Group of Institutions, Indore

Department of Computer Science and Engineering


Subject Notes
CS 601- Machine Learning
UNIT-III

Syllabus: Convolutional neural network, flattening, subsampling, padding, stride, convolution


layer, pooling layer, loss layer, dense layer 1x1 convolution, inception network, input channels,
transfer learning, one shot learning, dimension reductions, implementation of CNN like tensor
flow, keras etc.
Course Outcome:
Student will be able to design the CNN algorithms to solve related real-life problems.

Convolutional Neural Networks (CNNs / ConvNets)


Convolutional Neural Networks are very similar to ordinary Neural Networks. They are made up
of neurons that have learnable weights and biases. Each neuron receives some inputs,
performs a dot product, and optionally follows it with a non-linearity. The whole network still
expresses a single differentiable score function: from the raw image pixels on one end to class
scores at the other. They still have a loss function (e.g., SVM/Softmax) on the last (fully
connected) layer and all the tips/tricks we developed for learning regular Neural Networks still
apply.
ConvNet architectures make the explicit assumption that the inputs are images, which allows
us to encode certain properties into the architecture. These then make the forward function
more efficient to implement and vastly reduce the number of parameters in the network.As we
described above, a simple ConvNet is a sequence of layers, and every layer of a ConvNet
transforms one volume of activations to another through a differentiable function. We use
three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer,
and Fully Connected Layer (exactly as seen in regular Neural Networks). We will stack these
layers to form a full ConvNetArchitecture.

Figure 3.1: ConvNet Architecture

Example of ConvNet Architecture:


 INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width
32, height 32, and with three color channels R, G, B.
 CONV layer will compute the output of neurons that are connected to local regions in the
input, each computing a dot product between their weights and a small region they are
connected to in the input volume. This may result in volume such as [32x32x12] if we
decided to use 12 filters.
 RELU layer will apply an element wise activation function, such as
the max(0,x)max(0,x) thresholding at zero. This leaves the size of the volume unchanged
([32x32x12]).
 POOL layer will perform a downsampling operation along the spatial dimensions (width,
height), resulting in volume such as [16x16x12].
 FC (i.e.,fully connected) layer will compute the class scores, resulting in volume of size
[1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10
categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each
neuron in this layer will be connected to all the numbers in the previous volume.
In this way, ConvNets transform the original image layer by layer from the original pixel values
to the final class scores.

Flattening
Flattening is converting the data into a 1-dimensional array for inputting it to the next layer. We
flatten the output of the convolutional layers to create a single long feature vector. And it is
connected to the final classification model, which is called a fully-connected layer. In other
words, we put all the pixel data in one line and make connections with the final layer.

Figure 3.2: Flattening


Subsampling
Sub-sampling's purpose is just to reduce the dimensions of the input. In standard CNNs, a
convolution layer has trainable parameters which are tuned during the training process, while
the sub-sampling layer is a constant operation (usually performed by a max-pooling layer). In
CNNs this max-pooling usually helps add some spatial invariance to the model.
You have a filter of a certain size. The output is the element wise multiplication of this filter and
different areas (of similar size) of the input. You can use the stride of the convolution filter to
perform sub-sampling of the input.
For an input of 7x7, using a filter of the same size, in the first image with a stride of 1 pixel we
get an output of 5x5, while using a stride of 2 pixels in the second image we get an output of
3x3. So technically we got sub-sampling as part of the convolution layer, but the sub-sampling is
not trainable (the stride size is constant).
More often, when talking about a sub-sampling layer, the meaning is a max-pooling layer, which
similarly to convolution, also has a filter and a stride of some size. However, there are no-
trainable weights (the output is just the max pixel of each area).

Padding
Padding is a term relevant to convolutional neural networks as it refers to the amount of pixels
added to an image when it is being processed by the kernel of a CNN. For example, if the
padding in a CNN is set to zero, then every pixel value that is added will be of value zero. If,
however, the zero padding is set to one, there will be a one pixel border added to the image
with a pixel value of zero.

Figure 3.3: Padding

How does Padding work?


Padding works by extending the area of which a convolutional neural network processes an
image. The kernel is the neural networks filter which moves across the image, scanning each
pixel and converting the data into a smaller, or sometimes larger, format. In order to assist the
kernel with processing the image, padding is added to the frame of the image to allow for more
space for the kernel to cover the image. Adding padding to an image processed by a CNN allows
for more accurate analysis of images. 

Stride
Stride is a component of convolutional neural networks, or neural networks tuned for the
compression of images and video data. Stride is a parameter of the neural network's filter that
modifies the amount of movement over the image or video. For example, if a neural network's
stride is set to 1, the filter will move one pixel, or unit, at a time. The size of the filter affects the
encoded output volume, so stride is often set to a whole integer, rather than a fraction or
decimal.
Figure 3.4: Stride
Imagine a convolutional neural network is taking an image and analysing the content. If the
filter size is 3x3 pixels, the contained nine pixels will be converted down to 1 pixel in the output
layer. Naturally, as the stride, or movement, is increased, the resulting output will be smaller.
Stride is a parameter that works in conjunction with padding, the feature that adds blank or
empty pixels to the frame of the image to allow for a minimized reduction of size in the output
layer. Roughly, it is a way of increasing the size of an image, to counteract the fact that stride
reduces the size. Padding and stride are the foundational parameters of any convolutional
neural network. 

Convolution Layer
Convolution is the first layer to extract features from an input image. Convolution preserves the
relationship between pixels by learning image features using small squares of input data. It is a
mathematical operation that takes two inputs such as image matrix and a filter or kernel.

Figure 3.5: Convolution layer


Consider a 5 x 5 whose image pixel values are 0, 1 and filter matrix 3 x 3 as shown below:

Figure 3.6: Working of Convolution layer

Then the convolution of 5 x 5 image matrix multiplies with 3 x 3 filter matrix which is
called “Feature Map” as output shown below:

Figure 3.7: Convolution layer output

Pooling Layer
Pooling layers section would reduce the number of parameters when the images are too large.
Spatial pooling also called subsampling or down sampling which reduces the dimensionality of
each map but retains important information. Spatial pooling can be of different types:
 Max Pooling
 Average Pooling
 Sum Pooling
Max pooling takes the largest element from the rectified feature map. Taking the largest
element could also take the average pooling. Sum of all elements in the feature map called as
sum pooling.
Figure 3.8: Pooling layer
Loss Layer:
In the context of an optimization algorithm, the function used to evaluate a candidate solution
(i.e., a set of weights) is referred to as the objective function.
We may seek to maximize or minimize the objective function, meaning that we are searching for
a candidate solution that has the highest or lowest score respectively.
Typically, with neural networks, we seek to minimize the error. As such, the objective function is
often referred to as a cost function or a loss function and the value calculated by the loss
function is referred to as simply “loss.”

The function we want to minimize or maximize is called the objective function or criterion. When
we are minimizing it, we may also call it the cost function, loss function, or error function.

The cost or loss function has an important job in that it must faithfully distill all aspects of the
model down into a single number in such a way that improvements in that number are a sign of
a better model.

The cost function reduces all the various good and bad aspects of a possibly complex system
down to a single number, a scalar value, which allows candidate solutions to be ranked and
compared.

In calculating the error of the model during the optimization process, a loss function must be
chosen.
This can be a challenging problem as the function must capture the properties of the problem
and be motivated by concerns that are important to the project and stakeholders.

It is important, therefore, that the function faithfully represent our design goals. If we choose a
poor error function and obtain unsatisfactory results, the fault is ours for badly specifying the
goal of the search.

The layer we call as FC layer, we flattened our matrix into vector and feed it into a fully
connected layer like a neural network.
Figure 3.9: Working of Loss Layer

Dense Layer
Dense layer is the regular deeply connected neural network layer. It is most common and
frequently used layer. Dense layer does the below operation on the input and return the
output.
output = activation (dot (input, kernel) + bias)
where,
 Input represents the input data
 Kernel represents the weight data
 Dot representsnumpy dot product of all input and its corresponding weights
 Bias represents a biased value used in machine learning to optimize the model
 Activationrepresents the activation function.

The output shape of the Dense layer will be affected by the number of neuron / units specified
in the Dense layer. For example, if the input shape is (8) and number of units is 16, then the
output shape is (16). All layers will have batch size as the first dimension and so, input shape
will be represented by (None, 8) and the output shape as (None, 16). Currently, batch size is
none as it is not set. Batch size is usually set during training phase.

Figure 3.10: Working of dens layer


1x1 convolutions
Convolutions layers are lighter than fully connected ones. But they still connect every input
channel with every output channels for every position in the kernel windows. This is what gives
the c_in * c_out multiplicative factor in the number of weights.
We would usually have a 3x3 kernel size with 256 input and output channels. Instead of this, we
first do a 1x1 convolutional layer bringing the number of channels down to something like 32.
Then we perform the convolution with a 3x3 kernel size. We finally make another 1x1
convolutional layer to have 256 channels again.
A 1x1 convolution kernel acts as an embedding solution. It reduces the size of the input vector,
the number of channels. It makes it more meaningful. The 1x1 convolutional layer is also called
a Point wise Convolution.

1 x 1 conv was used to reduce the number of channels while introducing non-linearity.
In 1X1 Convolution simply means the filter is of size 1X1 (Yes — that means a single number as
opposed to matrix like, say 3X3 filter). This 1X1 filter will convolve over the ENTIRE input image
pixel by pixel.
Staying with our example input of 64X64X3, if we choose a 1X1 filter (which would be 1X1X3),
then the output will have the same Height and Weight as input but only one channel —
64X64X1
Now consider inputs with large number of channels — 192 for example. If we want to reduce
the depth and but keep the Height X Width of the feature maps (Receptive field) the same, then
we can choose 1X1 filters (remember Number of filters = Output Channels) to achieve this
effect. This effect of cross channel down-sampling is called ‘Dimensionality reduction’.

Figure 3.11: Working of 1X1 convolution layer

1X1 Convolution is effectively used for:


1. Dimensionality Reduction/Augmentation
2. Reduce computational load by reducing parameter map
3. Add additional non-linearity to the network
4. Create deeper network through “Bottle-Neck” layer
5. Create smaller CNN network which retains higher degree of accuracy
Inception Network
Inception Network is used in convolutional neural networks to allow more efficient
computation and deeper networks through a dimensionality reduction with stacked 1×1
convolution. The modules were designed to solve the problem of computational expense, as
well as overfitting, among other issues. The solution, in short, is to take multiple kernel filter
sizes within the CNN, and rather than stacking them sequentially, ordering them to operate on
the same level. 

Figure 3.12: Inception Network with naïve bias

How does an Inception network work?


Inception Modules are incorporated into convolutional neural networks (CNNs) as a way of
reducing computational expense. As a neural net deals with a vast array of images, with wide
variation in the featured image content, also known as the salient parts, they need to be
designed appropriately. The most simplified version of an inception module works by
performing a convolution on an input with not one, but three different sizes of filters (1x1, 3x3,
5x5). Also, max pooling is performed. Then, the resulting outputs are concatenated and sent to
the next layer. By structuring the CNN to perform its convolutions on the same level, the
network gets progressively wider, not deeper. 

Input Channels
Channels come from "media". Looking at broadcast technology behind TVs you have multiple
channels for different information that gets broadcasted to your TV. For example, an image
might consist of only three channels that contain information on how much Red, Green or Blue
each pixel in an image is. Mapping this to a CNN you would have an RGB image with three
channels. An image however can be interpreted as different things as well. For example, you
could take information from an image how cyan, magenta, yellow or black something is. This
would mean your CMYK image would be analysed by four channels (each colour being one
channel).
In a grayscale image, the data is a matrix of dimensions w×h, where w is the width of the image
and h is its height. In a color image, we normally have 3 channels: red, green, and blue; this
way, a color image can be represented as a matrix of dimensions w×h×c, where cis the number
of channels, that is, 3.
A convolution layer receives the image (w×h×c) as input and generates as output an activation
map of dimensions w′×h′×c′. The number of input channels in the convolution is c, while the
number of output channels is c′. The filter for such a convolution is a tensor of dimensions
f×f×c×c′, where fis the filter size (normally 3 or 5).
This way, the number of channels is the depth of the matrices involved in the convolutions.
Also, a convolution operation defines the variation in such depth by specifying input and output
channels.These explanations are directly extrapolable to 1D signals or 3D signals, but the
analogy with image channels made it more appropriate to use 2D signals.

Transfer learning
Transfer learning is the idea of overcoming the isolated learning paradigms and utilizing the
knowledge acquired for one task to solve related ones, as applied to machine learning, and in
particular, to the domain of deep learning.
Why transfer learning?
Many deep neural networks trained on natural images exhibit a curious phenomenon in
common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-
layer features appear not too specific to a particular dataset or task but are general in that they
are applicable to many datasets and tasks. As finding these standard features on the first layer
seems to occur regardless of the exact cost function and natural image dataset, we call these
first-layer features general. For example, in a network with an N-dimensional softmax output
layer that has been successfully trained towards a supervised classification objective, each
output unit will be specific to a particular class. We thus call the last-layer features specific.
In transfer learning we first train a base network on a base dataset and task, and then we
repurpose the learned features, or transfer them, to a second target network to be trained on a
target dataset and task. This process will tend to work if the features are general, that is,
suitable to both base and target tasks, instead of being specific to the base task.
In practice, very few people train an entire Convolutional Network from scratch because it is
relatively rare to have a dataset of sufficient size. Instead, it is common to pre-train a ConvNet
on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories),
and then use the ConvNet either as an initialization or a fixed feature extractor for the task of
interest.
There aremany strategies to follow for the transfer learning process in the deep. Therefore, a
widely used strategy in transfer learning is to:
 Load the weights matrices of a pre-trained model except for the weights of the very last
layers near the O/P,
 Hold those weights fixed, i.e., untrainable
 Attach new layers suitable for the task at hand, and train the model with new data
Figure 3.13: The transfer learning strategy for deep learning networks

This way, we don’t have to train the whole model; we get to repurpose the model for our
specific machine learning task yet can leverage the learned structures and patterns of the data,
contained in the fixed weights, which are loaded from the pre-trained, optimized model.

One shot learning


Deep Convolutional Neural Networks have become the state-of-the-art methods for image
classification tasks. However, one of the biggest limitations is they require a lot of labelled data.
In many applications, collecting this much data is sometimes not feasible. One Shot Learning
aims to solve this problem.
One-shot learning is a classification task where one, or a few, examples are used to classify many
new examples in the future.
This characterizes tasks seen in the field of face recognition, such as face identification and face
verification, where people must be classified correctly with different facial expressions, lighting
conditions, accessories, and hairstyles given one or a few template photos.
Modern face recognition systems approach the problem of one-shot learning via face
recognition by learning a rich low-dimensional feature representation, called a face embedding
that can be calculated for faces easily and compared for verification and identification tasks.
Historically, embeddings were learned for one-shot learning problems using a Siamese network.
The training of Siamese networks with comparative loss functions resulted in better
performance, later leading to the triplet loss function used in the FaceNet system by Google that
achieved then state-of-the-art results on benchmark face recognition tasks.
In case of standard classification, the input image is fed into a series of layers, and finally at the
output we generate a probability distribution over all the classes (typically using a Softmax). For
example, if we are trying to classify an image as cat or dog or horse or elephant, then for every
input image, we generate 4 probabilities, indicating the probability of the image belonging to
each of the 4 classes. Two important points must be noticed here. First, during the training
process, we require a large number of images for each of the class (cats, dogs, horses and
elephants). Second, if the network is trained only on the above 4 classes of images, and then we
cannot expect to test it on any other class, example “zebra”. If we want our model to classify the
images of zebra as well, then we need to first get a lot of zebra images and then we must re-
train the model again. There are applications wherein we neither have enough data for each
class and the total number classes are huge as well as dynamically changing. Thus, the cost of
data collection and periodical re-training is too high.
On the other hand, in a one shot classification, we require only one training example for each
class.

Dimension Reduction
In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These factors are basically variables called features. The
higher the number of features, the harder it gets to visualize the training set and then work on
it. Sometimes, most of these features are correlated, and hence redundant. This is where
dimensionality reduction algorithms come into play. Dimensionality reduction is the process of
reducing the number of random variables under consideration, by obtaining a set of principal
variables. It can be divided into feature selection and feature extraction.
Why is Dimensionality Reduction important in Machine Learning and Predictive Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these features
may overlap. In another condition, a classification problem that relies on both humidity and
rainfall can be collapsed into just one underlying feature, since both of them are correlated to a
high degree. Hence, we can reduce the number of features in such problems. A 3-D
classification problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2
dimensional space, and a 1-D problem to a simple line. The below figure illustrates this concept,
where a 3-D feature space is split into two 1-D feature spaces, and later, if found to be
correlated, the number of features can be reduced even further.

Figure 3.14: Dimensionality reduction


There are two components of dimensionality reduction:
 Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction: The various methods used for dimensionality reduction
include:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear and non-linear, depending upon the method used.

Implementation of CNN: TensorFlow

This example demonstrates training a simple convolutional neural network (CNN) to classify
CIFAR images.Following steps can be used in order for the implementation of CNN with
TensorFlow:

Step 1: Import TensorFlow

importtensorflowastf

fromtensorflow.kerasimport datasets, layers, models


importmatplotlib.pyplotasplt

Step 2: Download and prepare the CIFAR10 dataset


The CIFAR10 dataset contains 60,000 color images in 10 classes, with 6,000 images in each class.
The dataset is divided into 50,000 training images and 10,000 testing images. The classes are
mutually exclusive and there is no overlap between them.

(train_images,train_labels),
(test_images,test_labels)=datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1


train_images,test_images=train_images/255.0,test_images/255.0

Step 3: Verify the data


class_names=['airplane','automobile','bird','cat','deer',
               'dog','frog','horse','ship','truck']

plt.figure(figsize=(10,10))
foriin range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i],cmap=plt.cm.binary)
    # The CIFAR labels happen to be arrays,
    # which is why you need the extra index
    plt.xlabel(class_names[train_labels[i][0]])
plt.show()

Step 4: Create the convolutional base


model =models.Sequential()
model.add(layers.Conv2D(32,(3,3), activation='relu',input_shape=(32,32,3)))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64,(3,3), activation='relu'))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64,(3,3), activation='relu'))

Step 5: Compile and train the model


model.compile(optimizer='adam',
             
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history =model.fit(train_images,train_labels, epochs=10,


                    validation_data=(test_images,test_labels))

Step 6: Evaluate the model


plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label ='val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5,1])
plt.legend(loc='lower right')

test_loss,test_acc=model.evaluate(test_images,  test_labels, verbose=2)

Implementation of CNN: Keras


First, we will define the Convolutional neural networks architecture as follows:
1- The first hidden layer is a convolutional layer called a Convolution2D. We will use 32 filters
with size 5×5 each.
2- Then a Max pooling layer with a pool size of 2×2.
3- Another convolutional layer with 64 filters with size 5×5 each.
4- Then a Max pooling layer with a pool size of 2×2.
5- Then next is a Flatten layer that converts the 2D matrix data to a 1D vector before building
the fully connected layers.
6- After that we will use a fully connected layer with 1024 neurons and relu activation function.
7- Then we will use a regularization layer called Dropout. It is configured to randomly exclude
20% of neurons in the layer in order to reduce overfitting.
8- Finally, the output layer which has 10 neurons for the 10 classes and softmax activation
function to output probability-like predictions for each class.
After deciding the above, we can set up a neural network model with a few lines of code as
follows:

Step 1: Create a model


Keras first creates a new instance of a model object and then add layers to it one after the. It is
called a sequential model API.
# Importing the required Keras modules containing model and layers

from keras.models import Sequential


from keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D
# Creating a Sequential Model and adding the layers
model = Sequential()
model.add(Conv2D(32, kernel_size=(5,5), input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, kernel_size=(5,5), input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten()) # Flattening the 2D arrays for fully connected layers
model.add(Dense(1024, activation=tf.nn.relu))
model.add(Dropout(0.2))
model.add(Dense(10,activation=tf.nn.softmax))
#Compile the model
model.compile(optimizer=’adam’, loss=’sparse_categorical_crossentropy’,
metrics=[‘accuracy’])

Once we execute the above code, Keras will build a TensorFlow model behind the scenes.

Step 2: Train the model


We can train the model by calling model.fit and pass in the training data and the expected
output. Keras will run the training process and print out the progress to the console. When
training completes, it will report the final accuracy that was achieved with the training data.
model.fit(x=x_train,y=y_train, epochs=10)

Step 3: Test the model


We can test the model by calling model. Evaluate and passing in the testing data set and the
expected output.
test_error_rate = model.evaluate(x_test, y_test, verbose=0)
print(“The mean squared error (MSE) for the test data set is:
{}”.format(test_error_rate))

Step 4: Save and Load the model


Once we reach the optimum results we can save the model using model.save and pass in the
file name. This file will contain everything we need to use our model in another program.
model.save(“trained_model.h5”)

Your model will be saved in the Hierarchical Data Format (HDF) with .h5 extension. It contains
multidimensional arrays of scientific data.
We can load our previously trained model by calling the load model function and passing in a
file name. Then we call the predict function and pass in the new data for predictions.

model = keras.models.load_model(“trained_model.h5”)
predictions = model.predict(new_data)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy