CS601 - Machine Learning - Unit 3 - Notes - 1672759761
CS601 - Machine Learning - Unit 3 - Notes - 1672759761
CS601 - Machine Learning - Unit 3 - Notes - 1672759761
Flattening
Flattening is converting the data into a 1-dimensional array for inputting it to the next layer. We
flatten the output of the convolutional layers to create a single long feature vector. And it is
connected to the final classification model, which is called a fully-connected layer. In other
words, we put all the pixel data in one line and make connections with the final layer.
Padding
Padding is a term relevant to convolutional neural networks as it refers to the amount of pixels
added to an image when it is being processed by the kernel of a CNN. For example, if the
padding in a CNN is set to zero, then every pixel value that is added will be of value zero. If,
however, the zero padding is set to one, there will be a one pixel border added to the image
with a pixel value of zero.
Stride
Stride is a component of convolutional neural networks, or neural networks tuned for the
compression of images and video data. Stride is a parameter of the neural network's filter that
modifies the amount of movement over the image or video. For example, if a neural network's
stride is set to 1, the filter will move one pixel, or unit, at a time. The size of the filter affects the
encoded output volume, so stride is often set to a whole integer, rather than a fraction or
decimal.
Figure 3.4: Stride
Imagine a convolutional neural network is taking an image and analysing the content. If the
filter size is 3x3 pixels, the contained nine pixels will be converted down to 1 pixel in the output
layer. Naturally, as the stride, or movement, is increased, the resulting output will be smaller.
Stride is a parameter that works in conjunction with padding, the feature that adds blank or
empty pixels to the frame of the image to allow for a minimized reduction of size in the output
layer. Roughly, it is a way of increasing the size of an image, to counteract the fact that stride
reduces the size. Padding and stride are the foundational parameters of any convolutional
neural network.
Convolution Layer
Convolution is the first layer to extract features from an input image. Convolution preserves the
relationship between pixels by learning image features using small squares of input data. It is a
mathematical operation that takes two inputs such as image matrix and a filter or kernel.
Then the convolution of 5 x 5 image matrix multiplies with 3 x 3 filter matrix which is
called “Feature Map” as output shown below:
Pooling Layer
Pooling layers section would reduce the number of parameters when the images are too large.
Spatial pooling also called subsampling or down sampling which reduces the dimensionality of
each map but retains important information. Spatial pooling can be of different types:
Max Pooling
Average Pooling
Sum Pooling
Max pooling takes the largest element from the rectified feature map. Taking the largest
element could also take the average pooling. Sum of all elements in the feature map called as
sum pooling.
Figure 3.8: Pooling layer
Loss Layer:
In the context of an optimization algorithm, the function used to evaluate a candidate solution
(i.e., a set of weights) is referred to as the objective function.
We may seek to maximize or minimize the objective function, meaning that we are searching for
a candidate solution that has the highest or lowest score respectively.
Typically, with neural networks, we seek to minimize the error. As such, the objective function is
often referred to as a cost function or a loss function and the value calculated by the loss
function is referred to as simply “loss.”
The function we want to minimize or maximize is called the objective function or criterion. When
we are minimizing it, we may also call it the cost function, loss function, or error function.
The cost or loss function has an important job in that it must faithfully distill all aspects of the
model down into a single number in such a way that improvements in that number are a sign of
a better model.
The cost function reduces all the various good and bad aspects of a possibly complex system
down to a single number, a scalar value, which allows candidate solutions to be ranked and
compared.
In calculating the error of the model during the optimization process, a loss function must be
chosen.
This can be a challenging problem as the function must capture the properties of the problem
and be motivated by concerns that are important to the project and stakeholders.
It is important, therefore, that the function faithfully represent our design goals. If we choose a
poor error function and obtain unsatisfactory results, the fault is ours for badly specifying the
goal of the search.
The layer we call as FC layer, we flattened our matrix into vector and feed it into a fully
connected layer like a neural network.
Figure 3.9: Working of Loss Layer
Dense Layer
Dense layer is the regular deeply connected neural network layer. It is most common and
frequently used layer. Dense layer does the below operation on the input and return the
output.
output = activation (dot (input, kernel) + bias)
where,
Input represents the input data
Kernel represents the weight data
Dot representsnumpy dot product of all input and its corresponding weights
Bias represents a biased value used in machine learning to optimize the model
Activationrepresents the activation function.
The output shape of the Dense layer will be affected by the number of neuron / units specified
in the Dense layer. For example, if the input shape is (8) and number of units is 16, then the
output shape is (16). All layers will have batch size as the first dimension and so, input shape
will be represented by (None, 8) and the output shape as (None, 16). Currently, batch size is
none as it is not set. Batch size is usually set during training phase.
1 x 1 conv was used to reduce the number of channels while introducing non-linearity.
In 1X1 Convolution simply means the filter is of size 1X1 (Yes — that means a single number as
opposed to matrix like, say 3X3 filter). This 1X1 filter will convolve over the ENTIRE input image
pixel by pixel.
Staying with our example input of 64X64X3, if we choose a 1X1 filter (which would be 1X1X3),
then the output will have the same Height and Weight as input but only one channel —
64X64X1
Now consider inputs with large number of channels — 192 for example. If we want to reduce
the depth and but keep the Height X Width of the feature maps (Receptive field) the same, then
we can choose 1X1 filters (remember Number of filters = Output Channels) to achieve this
effect. This effect of cross channel down-sampling is called ‘Dimensionality reduction’.
Input Channels
Channels come from "media". Looking at broadcast technology behind TVs you have multiple
channels for different information that gets broadcasted to your TV. For example, an image
might consist of only three channels that contain information on how much Red, Green or Blue
each pixel in an image is. Mapping this to a CNN you would have an RGB image with three
channels. An image however can be interpreted as different things as well. For example, you
could take information from an image how cyan, magenta, yellow or black something is. This
would mean your CMYK image would be analysed by four channels (each colour being one
channel).
In a grayscale image, the data is a matrix of dimensions w×h, where w is the width of the image
and h is its height. In a color image, we normally have 3 channels: red, green, and blue; this
way, a color image can be represented as a matrix of dimensions w×h×c, where cis the number
of channels, that is, 3.
A convolution layer receives the image (w×h×c) as input and generates as output an activation
map of dimensions w′×h′×c′. The number of input channels in the convolution is c, while the
number of output channels is c′. The filter for such a convolution is a tensor of dimensions
f×f×c×c′, where fis the filter size (normally 3 or 5).
This way, the number of channels is the depth of the matrices involved in the convolutions.
Also, a convolution operation defines the variation in such depth by specifying input and output
channels.These explanations are directly extrapolable to 1D signals or 3D signals, but the
analogy with image channels made it more appropriate to use 2D signals.
Transfer learning
Transfer learning is the idea of overcoming the isolated learning paradigms and utilizing the
knowledge acquired for one task to solve related ones, as applied to machine learning, and in
particular, to the domain of deep learning.
Why transfer learning?
Many deep neural networks trained on natural images exhibit a curious phenomenon in
common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-
layer features appear not too specific to a particular dataset or task but are general in that they
are applicable to many datasets and tasks. As finding these standard features on the first layer
seems to occur regardless of the exact cost function and natural image dataset, we call these
first-layer features general. For example, in a network with an N-dimensional softmax output
layer that has been successfully trained towards a supervised classification objective, each
output unit will be specific to a particular class. We thus call the last-layer features specific.
In transfer learning we first train a base network on a base dataset and task, and then we
repurpose the learned features, or transfer them, to a second target network to be trained on a
target dataset and task. This process will tend to work if the features are general, that is,
suitable to both base and target tasks, instead of being specific to the base task.
In practice, very few people train an entire Convolutional Network from scratch because it is
relatively rare to have a dataset of sufficient size. Instead, it is common to pre-train a ConvNet
on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories),
and then use the ConvNet either as an initialization or a fixed feature extractor for the task of
interest.
There aremany strategies to follow for the transfer learning process in the deep. Therefore, a
widely used strategy in transfer learning is to:
Load the weights matrices of a pre-trained model except for the weights of the very last
layers near the O/P,
Hold those weights fixed, i.e., untrainable
Attach new layers suitable for the task at hand, and train the model with new data
Figure 3.13: The transfer learning strategy for deep learning networks
This way, we don’t have to train the whole model; we get to repurpose the model for our
specific machine learning task yet can leverage the learned structures and patterns of the data,
contained in the fixed weights, which are loaded from the pre-trained, optimized model.
Dimension Reduction
In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These factors are basically variables called features. The
higher the number of features, the harder it gets to visualize the training set and then work on
it. Sometimes, most of these features are correlated, and hence redundant. This is where
dimensionality reduction algorithms come into play. Dimensionality reduction is the process of
reducing the number of random variables under consideration, by obtaining a set of principal
variables. It can be divided into feature selection and feature extraction.
Why is Dimensionality Reduction important in Machine Learning and Predictive Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these features
may overlap. In another condition, a classification problem that relies on both humidity and
rainfall can be collapsed into just one underlying feature, since both of them are correlated to a
high degree. Hence, we can reduce the number of features in such problems. A 3-D
classification problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2
dimensional space, and a 1-D problem to a simple line. The below figure illustrates this concept,
where a 3-D feature space is split into two 1-D feature spaces, and later, if found to be
correlated, the number of features can be reduced even further.
This example demonstrates training a simple convolutional neural network (CNN) to classify
CIFAR images.Following steps can be used in order for the implementation of CNN with
TensorFlow:
importtensorflowastf
(train_images,train_labels),
(test_images,test_labels)=datasets.cifar10.load_data()
plt.figure(figsize=(10,10))
foriin range(25):
plt.subplot(5,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(train_images[i],cmap=plt.cm.binary)
# The CIFAR labels happen to be arrays,
# which is why you need the extra index
plt.xlabel(class_names[train_labels[i][0]])
plt.show()
Once we execute the above code, Keras will build a TensorFlow model behind the scenes.
Your model will be saved in the Hierarchical Data Format (HDF) with .h5 extension. It contains
multidimensional arrays of scientific data.
We can load our previously trained model by calling the load model function and passing in a
file name. Then we call the predict function and pass in the new data for predictions.
model = keras.models.load_model(“trained_model.h5”)
predictions = model.predict(new_data)