IV Ai & Ds Al3451 ML Unit4
IV Ai & Ds Al3451 ML Unit4
E
Anna University Regulation: 2021
O
C
AL3451 – Machine Learning
E
AC
II Year / IV Semester
R
NOTES
G
Prepared by,
Mrs. N. Nancy Chitra Thilaga, AP/ECE
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
The perceptron is the basic processing element. It has inputs that may come from the
environment or may be the outputs of other perceptrons. Associated with each input, xj
connection weight ∈ R , j = 1, . . . , d, is a connection weight, or synaptic weight wj ∈ R, and
E
the output, y, in the simplest case is a weighted sum of the inputs:
𝑑
O
C
𝑦 = ∑ wj xj + w0
𝑗=1
w0 is the intercept value to make the model more general; it is generally modeled as
E
the weight coming from an extra bias unit, x0, which is always +1. We can write the output
of the perceptron as a dot product y = wT x
AC
where w = [w0,w1, . . . , wd]T and x = [1, x1, . . . , xd]T are augmented vectors to
include also the bias weight and input. During testing, with given weights, w, for input x, we
R
compute the output y. To implement a given task, we need to learn the weights w, the
parameters of the system, such that correct outputs are generated given the inputs.
G
When d = 1 and x is fed from the environment through an input unit, we have
y = wx + w0
which is the equation of a line with w as the slope and w0 as the intercept. Thus this
perceptron with one input and one output can be used to implement a linear fit. With more
than one input, the line becomes a (hyper)plane, and the perceptron with more than one input
can be used to implement multivariate linear fit.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
E
structure and function of the human brain's interconnected network of neurons. It consists of
interconnected nodes called artificial neurons, organized into layers. Information flows through
O
the network, with each neuron processing input signals and producing an output signal that
influences other neurons in the network.
C
A multi-layer perceptron (MLP) is a type of artificial neural network consisting of
multiple layers of neurons. The neurons in the MLP typically use nonlinear activation
E
functions, allowing the network to learn complex patterns in data. MLPs are significant in
machine learning because they can learn nonlinear relationships in data, making them powerful
AC
A neural network consists of interconnected nodes, called neurons, organized into layers. Each
neuron receives input signals, performs a computation on them using an activation function,
and produces an output signal that may be passed to other neurons in the network.
An activation function determines the output of a neuron given its input. These functions
introduce nonlinearity into the network, enabling it to learn complex patterns in data.
The network is typically organized into layers, starting with the input layer, where data is
introduced. Followed by hidden layers where computations are performed and finally, the
output layer where predictions or decisions are made.
Neurons in adjacent layers are connected by weighted connections, which transmit signals from
one layer to the next. The strength of these connections, represented by weights, determines
how much influence one neuron's output has on another neuron's input. During the training
process, the network learns to adjust its weights based on examples provided in a training
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
dataset. Additionally, each neuron typically has an associated bias, which allows the neuron to
adjust its output threshold.
The goal of training a neural network is to minimize this loss function by adjusting the
weights and biases. The adjustments are guided by an optimization algorithm, such as gradient
descent.
E
Types of Neural Network
O
C
E
AC
R
G
The ANN depicted on the right of the image is a simple neural network called
‘perceptron’. It consists of a single layer, which is the input layer, with multiple neurons with
their own weights; there are no hidden layers. The perceptron algorithm learns the weights for
the input signals in order to draw a linear decision boundary.
There are several types of ANN, each designed for specific tasks and architectural
requirements. Let's briefly discuss some of the most common types before diving deeper into
MLPs next.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
These are the simplest form of ANNs, where information flows in one direction, from input to
output. There are no cycles or loops in the network architecture. Multilayer perceptrons (MLP)
are a type of feedforward neural network.
In RNNs, connections between nodes form directed cycles, allowing information to persist
over time. This makes them suitable for tasks involving sequential data, such as time series
prediction, natural language processing, and speech recognition.
CNNs are designed to effectively process grid-like data, such as images. They consist of layers
of convolutional filters that learn hierarchical representations of features within the input data.
CNNs are widely used in tasks like image classification, object detection, and image
segmentation.
E
Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU)
Autoencoder
O
C
It is designed for unsupervised learning and consists of an encoder network that compresses
the input data into a lower-dimensional latent space, and a decoder network that reconstructs
E
the original input from the latent representation. Autoencoders are often used for
AC
GANs consist of two neural networks, a generator and a discriminator, trained simultaneously
in a competitive setting. The generator learns to generate synthetic data samples that are
G
indistinguishable from real data, while the discriminator learns to distinguish between real and
fake samples. GANs have been widely used for generating realistic images, videos, and other
types of data.
Multilayer Perceptrons
MLPs have been widely used in various fields, including image recognition, natural
language processing, and speech recognition, among others. Their flexibility in architecture
and ability to approximate any function under certain conditions make them a fundamental
building block in deep learning and neural network research. Let's take a deeper dive into some
of its key concepts.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
Input layer
The input layer consists of nodes or neurons that receive the initial input data. Each neuron
represents a feature or dimension of the input data. The number of neurons in the input layer is
determined by the dimensionality of the input data.
Hidden layer
Between the input and output layers, there can be one or more layers of neurons. Each neuron
in a hidden layer receives inputs from all neurons in the previous layer (either the input layer
or another hidden layer) and produces an output that is passed to the next layer. The number of
hidden layers and the number of neurons in each hidden layer are hyperparameters that need to
be determined during the model design phase.
Output layer
This layer consists of neurons that produce the final output of the network. The number of
neurons in the output layer depends on the nature of the task. In binary classification, there may
E
be either one or two neurons depending on the activation function and representing the
probability of belonging to one class; while in multi-class classification tasks, there can be
multiple neurons in the output layer. O
C
Weights
Neurons in adjacent layers are fully connected to each other. Each connection has an associated
E
weight, which determines the strength of the connection. These weights are learned during the
AC
training process.
Bias Neurons
R
In addition to the input and hidden neurons, each layer (except the input layer) usually
includes a bias neuron that provides a constant input to the neurons in the next layer. The bias
G
neuron has its own weight associated with each connection, which is also learned during
training.
The bias neuron effectively shifts the activation function of the neurons in the
subsequent layer, allowing the network to learn an offset or bias in the decision boundary. By
adjusting the weights connected to the bias neuron, the MLP can learn to control the threshold
for activation and better fit the training data.
Note: It is important to note that in the context of MLPs, bias can refer to two related but distinct
concepts: bias as a general term in machine learning and the bias neuron (defined above). In
general machine learning, bias refers to the error introduced by approximating a real-world
problem with a simplified model. Bias measures how well the model can capture the underlying
patterns in the data. A high bias indicates that the model is too simplistic and may underfit the
data, while a low bias suggests that the model is capturing the underlying patterns well.
Activation Function
Typically, each neuron in the hidden layers and the output layer applies an activation function
to its weighted sum of inputs. Common activation functions include sigmoid, tanh, ReLU
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
(Rectified Linear Unit), and softmax. These functions introduce nonlinearity into the network,
allowing it to learn complex patterns in the data.
MLPs are trained using the backpropagation algorithm, which computes gradients of a loss
function with respect to the model's parameters and updates the parameters iteratively to
minimize the loss.
E
O
C
E
AC
R
G
Input layer
• The input layer of an MLP receives input data, which could be features
extracted from the input samples in a dataset. Each neuron in the input layer
represents one feature.
• Neurons in the input layer do not perform any computations; they simply pass
the input values to the neurons in the first hidden layer.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
Hidden layers
E
O
C
Where n is the total number of input connections, wi is the weight for the i-th input, and xi is
the i-th input value.
E
on the nature of the task and the desired properties of the network.
G
Output layer
• The output layer of an MLP produces the final predictions or outputs of the
network. The number of neurons in the output layer depends on the task being
performed (e.g., binary classification, multi-class classification, regression).
• Each neuron in the output layer receives input from the neurons in the last
hidden layer and applies an activation function. This activation function is
usually different from those used in the hidden layers and produces the final
output value or prediction.
During the training process, the network learns to adjust the weights associated with each
neuron's inputs to minimize the discrepancy between the predicted outputs and the true target
values in the training data. By adjusting the weights and learning the appropriate activation
functions, the network learns to approximate complex patterns and relationships in the data,
enabling it to make accurate predictions on new, unseen samples.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
2. Iterative Optimization: The aim of this step is to find the minimum of a loss
function, by iteratively moving in the direction of the steepest decrease in the
function's value.
For each iteration (or epoch) of training:
• Shuffle the training data to ensure that the model doesn't learn from the same
E
patterns in the same order every time.
• O
Split the training data into mini-batches (small subsets of data).
C
• For each mini-batch:
• Compute the gradient of the loss function with respect to the model
E
parameters using only the data points in the mini-batch. This gradient
estimation is a stochastic approximation of the true gradient.
AC
Θt+1 = θt - n * ⛛ J (θt)
Where:
G
4. Learning Rate: The step size taken in each iteration of gradient descent is
determined by a parameter called the learning rate, denoted above as n. This
parameter controls the size of the steps taken towards the minimum. If the
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
learning rate is too small, convergence may be slow; if it is too large, the
algorithm may oscillate or diverge.
However, SGD can also have some challenges, such as increased noise due to the stochastic
nature of the gradient estimation and the need to tune hyperparameters like the learning rate.
Various extensions and adaptations of SGD, such as mini-batch stochastic gradient descent,
E
momentum, and adaptive learning rate methods like AdaGrad, RMSProp, and Adam, have
been developed to address these challenges and improve convergence and performance.
O
You have seen the working of the multilayer perceptron layers and learned about stochastic
C
gradient descent; to put it all together, there is one last topic to dive into: backpropagation.
E
Backpropagation
AC
the entire training dataset (which can be computationally expensive for large datasets), SGD
computes the gradients using small random subsets of the data called mini-batches. Here’s an
G
1. Forward pass: During the forward pass, input data is fed into the neural
network, and the network's output is computed layer by layer. Each neuron
computes a weighted sum of its inputs, applies an activation function to the
result, and passes the output to the neurons in the next layer.
2. Loss computation: After the forward pass, the network's output is compared
to the true target values, and a loss function is computed to measure the
discrepancy between the predicted output and the actual output.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
the rate of change of the loss function with respect to each parameter and
provide information about how to adjust the parameters to decrease the loss.
4. Parameter update: Once the gradients have been computed, the network's
parameters are updated in the opposite direction of the gradients in order to
minimize the loss function. This update is typically performed using an
optimization algorithm such as stochastic gradient descent (SGD), that we
discussed earlier.
5. Iterative Process: Steps 1-4 are repeated iteratively for a fixed number of
epochs or until convergence criteria are met. During each iteration, the
network's parameters are adjusted based on the gradients computed in the
backward pass, gradually reducing the loss and improving the model's
performance.
E
5.7 Backpropagation Algorithm Part 1 in Tamil (youtube.com)
O
5.8 Backpropagation Algorithm Part 2 in Tamil (youtube.com)
C
Basic Gradient Descent Algorithm
E
The gradient descent algorithm is an approximate and iterative method for mathematical
AC
optimization. You can use it to approach the minimum of any differentiable function.
Note: There are many optimization methods and subfields of mathematical programming. If
R
you want to learn how to use some of them with Python, then check out Scientific Python:
Using SciPy for Optimization and Hands-On Linear Programming: Optimization With Python.
G
Although gradient descent sometimes gets stuck in a local minimum or a saddle point instead
of finding the global minimum, it’s widely used in practice. Data science and machine
learning methods often apply it internally to optimize model parameters. For example, neural
networks find weights and biases with gradient descent.
In a regression problem, you typically have the vectors of input variables 𝐱 = (𝑥₁, …, 𝑥ᵣ) and
the actual outputs 𝑦. You want to find a model that maps 𝐱 to a predicted response 𝑓(𝐱) so that
𝑓(𝐱) is as close as possible to 𝑦. For example, you might want to predict an output such as a
person’s salary given inputs like the person’s number of years at the company or level of
education.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
Your goal is to minimize the difference between the prediction 𝑓(𝐱) and the actual data 𝑦. This
difference is called the residual.
In this type of problem, you want to minimize the sum of squared residuals (SSR), where SSR
= Σᵢ(𝑦ᵢ − 𝑓(𝐱ᵢ))² for all observations 𝑖 = 1, …, 𝑛, where 𝑛 is the total number of observations.
Alternatively, you could use the mean squared error (MSE = SSR / 𝑛) instead of SSR.
Both SSR and MSE use the square of the difference between the actual and predicted outputs.
The lower the difference, the more accurate the prediction. A difference of zero indicates that
the prediction is equal to the actual data.
SSR or MSE is minimized by adjusting the model parameters. For example, in linear
regression, you want to find the function 𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ, so you need to determine
the weights 𝑏₀, 𝑏₁, …, 𝑏ᵣ that minimize SSR or MSE.
In a classification problem, the outputs 𝑦 are categorical, often either 0 or 1. For example, you
might try to predict whether an email is spam or not. In the case of binary outputs, it’s
convenient to minimize the cross-entropy function that also depends on the actual outputs 𝑦ᵢ
E
and the corresponding predictions 𝑝(𝐱ᵢ):
O
C
E
In logistic regression, which is often used to solve classification problems, the functions 𝑝(𝐱)
and 𝑓(𝐱) are defined as the following:
AC
R
G
Again, you need to find the weights 𝑏₀, 𝑏₁, …, 𝑏ᵣ, but this time they should minimize the cross-
entropy function.
The gradient of a function 𝐶 of several independent variables 𝑣₁, …, 𝑣ᵣ is denoted with ∇𝐶(𝑣₁,
…, 𝑣ᵣ) and defined as the vector function of the partial derivatives of 𝐶 with respect to each
independent variable: ∇𝐶 = (∂𝐶/∂𝑣₁, …, ∂𝐶/𝑣ᵣ). The symbol ∇ is called nabla.
The nonzero value of the gradient of a function 𝐶 at a given point defines the direction and rate
of the fastest increase of 𝐶. When working with gradient descent, you’re interested in the
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
direction of the fastest decrease in the cost function. This direction is determined by the
negative gradient, −∇𝐶.
The idea behind gradient descent is similar: you start with an arbitrarily chosen position of the
point or vector 𝐯 = (𝑣₁, …, 𝑣ᵣ) and move it iteratively in the direction of the fastest decrease of
the cost function. As mentioned, this is the direction of the negative gradient vector, −∇𝐶.
Once you have a random starting point 𝐯 = (𝑣₁, …, 𝑣ᵣ), you update it, or move it to a new
position in the direction of the negative gradient: 𝐯 → 𝐯 − 𝜂∇𝐶, where 𝜂 (pronounced “ee-tah”)
is a small positive value called the learning rate.
The learning rate determines how large the update or moving step is. It’s a very important
E
parameter. If 𝜂 is too small, then the algorithm might converge very slowly. Large 𝜂 values can
also cause issues with convergence or make the algorithm divergent.
O
C
Gradient Descent is an iterative optimization process that searches for an objective function’s
optimum value (Minimum/Maximum). It is one of the most used methods for changing a
model’s parameters in order to reduce a cost function in machine learning projects.
E
The primary goal of gradient descent is to identify the model parameters that provide the
maximum accuracy on both training and test datasets. In gradient descent, the gradient is a
AC
vector pointing in the general direction of the function’s steepest rise at a particular point.
The algorithm might gradually drop towards lower values of the function by moving in the
opposite direction of the gradient, until reaching the minimum of the function.
R
G
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
E
In SGD, since only one sample from the dataset is chosen at random for each iteration, the
path taken by the algorithm to reach the minima is usually noisier than your typical Gradient
O
Descent algorithm. But that doesn’t matter all that much because the path taken by the
algorithm does not matter, as long as we reach the minimum and with a significantly shorter
C
training time.
E
AC
R
G
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
E
O
C
Less stable as it may oscillate More stable as it converges
Stability around the optimal solution. smoothly towards the optimum.
E
In the realm of deep learning, the optimization process plays a crucial role in training
neural networks. Gradient descent, a fundamental optimization algorithm, can sometimes
encounter two common issues: vanishing gradients and exploding gradients. In this article,
E
we will delve into these challenges, providing insights into what they are, why they occur,
and how to mitigate them. We will build and train a model, and learn how to face vanishing
and exploding problems.
What is Vanishing Gradient?
O
C
The vanishing gradient problem is a challenge that emerges during backpropagation
when the derivatives or slopes of the activation functions become progressively smaller as
E
we move backward through the layers of a neural network. This phenomenon is particularly
prominent in deep networks with many layers, hindering the effective training of the model.
AC
The weight updates becomes extremely tiny, or even exponentially small, it can significantly
prolong the training time, and in the worst-case scenario, it can halt the training process
altogether.
Why the Problem Occurs?
R
During backpropagation, the gradients propagate back through the layers of the network, they
G
decrease significantly. This means that as they leave the output layer and return to the input
layer, the gradients become progressively smaller. As a result, the weights associated with
the initial levels, which accommodate these small gradients, are updated little or not at each
iteration of the optimization process.
The vanishing gradient problem is particularly associated with the sigmoid and
hyperbolic tangent (tanh) activation functions because their derivatives fall within the range
of 0 to 0.25 and 0 to 1, respectively. Consequently, extreme weights becomes very small,
causing the updated weights to closely resemble the original ones. This persistence of small
updates contributes to the vanishing gradient issue.
The sigmoid and tanh functions limit the input values to the ranges [0,1] and [-1,1], so that
they saturate at 0 or 1 for sigmoid and -1 or 1 for Tanh. The derivatives at points becomes
zero as they are moving. In these regions, especially when inputs are very small or large, the
gradients are very close to zero. While this may not be a major concern in shallow networks
with a few layers, it is a more pronounced issue in deep networks. When the inputs fall in
saturated regions, the gradients approach zero, resulting in little update to the weights of the
previous layer. In simple networks this does not pose much of a problem, but as more layers
are added, these small gradients, which multiply between layers, decay significantly and
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
consequently the first layer tears very slowly , and hinders overall model performance and
can lead to convergence failure.
How can we identify?
Identifying the vanishing gradient problem typically involves monitoring the training
dynamics of a deep neural network.
• One key indicator is observing model weights converging to 0 or stagnation in
the improvement of the model’s performance metrics over training epochs.
• During training, if the loss function fails to decrease significantly, or if there is
erratic behavior in the learning curves, it suggests that the gradients may be
vanishing.
• Additionally, examining the gradients themselves during backpropagation can
provide insights. Visualization techniques, such as gradient histograms or
norms, can aid in assessing the distribution of gradients throughout the network.
How can we solve the issue?
• Batch Normalization : Batch normalization normalizes the inputs of each layer,
reducing internal covariate shift. This can help stabilize and accelerate the training
process, allowing for more consistent gradient flow.
• Activation function: Activation function like Rectified Linear Unit
E
(ReLU) can be used. With ReLU, the gradient is 0 for negative and zero input,
and it is 1 for positive input, which helps alleviate the vanishing gradient issue.
O
Therefore, ReLU operates by replacing poor enter values with 0, and 1 for fine
enter values, it preserves the input unchanged.
C
• Skip Connections and Residual Networks (ResNets): Skip connections, as
seen in ResNets, allow the gradient to bypass certain layers during
E
AL3451_ML
S. DIFFERENCE DEEP LEARNING
No. BETWEEN NEURAL NETWORKS SYSTEMS
4931_Grace College of Engineering, Thoothukudi
A neural network is a model of Deep learning neural
neurons inspired by the human networks are distinguished
1. Definition brain. It is made up of many from neural networks on
neurons that at inter-connected the basis of their depth or
with each other. number of hidden layers.
Recursive Neural
Feed Forward Neural Networks Networks
Recurrent Neural Networks Unsupervised Pre-trained
2. Architecture Networks
Symmetrically Connected
Neural Networks Convolutional Neural
Networks
Neurons Motherboards
Connection and weights PSU
3. Structure
Propagation function RAM
E
Learning rate Processors
O
C
It generally takes less time to It generally takes more
train them. time to train them.
Time &
4.
E
Accuracy They have lower accuracy than They have higher accuracy
Deep Learning Systems than Neural Networks.
AC
Networks. networks.
G
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
ReLU
Activation functions in neural networks and deep learning play a significant role in
igniting the hidden nodes to produce a more desirable output. The main purpose of
E
the activation function is to introduce the property of nonlinearity into the model.
the positive part of its argument. It is one of the most popular activation functions in deep
learning.
AC
In artificial neural networks, the activation function of a node defines the output of that
node given an input or set of inputs. A standard integrated circuit can be seen as a digital
network of activation functions that can be “ON” or “OFF,” depending on the input.
R
G
An example
of the sigmoid activation function. | Image: Wikipedia
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
An example
of a tanh linear graph. | Image: Wikipedia
Sigmoid and tanh were monotonous, differentiable and previously more popular activation
functions. However, these functions suffer saturation over time, and this leads to problems
E
occurring with vanishing gradients. An alternative and the most popular activation function to
overcome this issue is the Rectified Linear Unit (ReLU).
whereas the green line is a variant of ReLU called Softplus. The other variants of ReLU include
leaky ReLU, exponential linear unit (ELU) and Sigmoid linear unit (SiLU), etc., which are
AC
Example of a ReLU activation function (blue) and Softplus (green). | Image: Wikipedia
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
In this article, we’ll only look at the rectified linear unit (ReLU) because it’s still the most used
activation function by default for performing a majority of deep learning tasks. Its variants are
typically used for specific purposes in which they might have a slight edge over the ReLU.
This activation function was first introduced to a dynamical network by Hahnloser et al. in
2000, with strong biological motivations and mathematical justifications. It was demonstrated
for the first time in 2011 as a way to enable better training of deeper networks compared to
other widely used activation functions including the logistic sigmoid (which is inspired
by probability theory and logistic regression) and the hyperbolic tangent.
The rectifier is, as of 2017, the most popular activation function for deep neural networks. A
unit employing the rectifier is also called a rectified linear unit (ReLU).
The main reason ReLU wasn’t used until more recently is because it was not differentiable at
the point zero. Researchers tended to use differentiable activation functions like sigmoid and
tanh. However, it’s now determined that ReLU is the best activation function for deep learning.
E
O
C
Equation for the ReLU activation function. | Image: Wikipedia
The ReLU activation function is differentiable at all points except at zero. For values greater
E
than zero, we just consider the max of the function. This can be written as:
AC
f(x) = max{0, z}
In simple terms, this can also be written as follows:
R
if input > 0:
G
return input
else:
return 0
All the negative values default to zero, and the maximum for the positive number is taken into
consideration.
For the computation of the backpropagation of neural networks, the differentiation for the
ReLU is relatively easy. The only assumption we will make is the derivative at the point zero,
which will also be considered as zero. This is usually not such a big concern, and it works well
for the most part. The derivative of the function is the value of the slope. The slope for negative
values is 0.0, and the slope for positive values is 1.0.
1. Convolutional layers and deep learning: It is the most popular activation function for
training convolutional layers and deep learning models.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
The main issue with ReLU is that all the negative values become zero immediately, which
decreases the ability of the model to fit or train from the data properly.
That means any negative input given to the ReLU activation function turns the value into zero
immediately in the graph, which in turn affects the resulting graph by not mapping the negative
values appropriately. This can however be easily fixed by using the different variants of the
ReLU activation function, like the leaky ReLU and other functions discussed earlier in the
article.
E
This is just a short introduction to the rectified linear unit and its importance in deep learning
technology today. It’s more popular than all other activation functions, and for good reason.
Hyperparameter Tuning
O
C
Hyperparameter tuning is the process of selecting the optimal values for a machine
E
learning model’s hyperparameters. Hyperparameters are settings that control the learning
process of the model, such as the learning rate, the number of neurons in a neural network,
AC
or the kernel size in a support vector machine. The goal of hyperparameter tuning is to find
the values that lead to the best performance on a given task.
What are Hyperparameters?
R
In the context of machine learning, hyperparameters are configuration variables that are set
before the training process of a model begins. They control the learning process itself, rather
G
than being learned from the data. Hyperparameters are often used to tune the performance of
a model, and they can have a significant impact on the model’s accuracy, generalization, and
other metrics.
Different Ways of Hyperparameters Tuning
Hyperparameters are configuration variables that control the learning process of a machine
learning model. They are distinct from model parameters, which are the weights and biases
that are learned from the data. There are several different types of hyperparameters:
Hyperparameters in Neural Networks
Neural networks have several essential hyperparameters that need to be adjusted, including:
• Learning rate: This hyperparameter controls the step size taken by the
optimizer during each iteration of training. Too small a learning rate can result in
slow convergence, while too large a learning rate can lead to instability and
divergence.
• Epochs: This hyperparameter represents the number of times the entire training
dataset is passed through the model during training. Increasing the number of
epochs can improve the model’s performance but may lead to overfitting if not
done carefully.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
E
• Kernel: The kernel function that defines the similarity between data points.
Different kernels can capture different relationships between data points, and the
O
choice of kernel can significantly impact the performance of the SVM. Common
kernels include linear, polynomial, radial basis function (RBF), and sigmoid.
C
• Gamma: The parameter that controls the influence of support vectors on the
decision boundary. A larger value of gamma indicates that nearby support vectors
E
have a stronger influence, while a smaller value indicates that distant support
vectors have a weaker influence. The choice of gamma is particularly important
AC
convergence, but it may also increase the risk of overfitting. A smaller learning
rate may result in slower convergence but can help prevent overfitting.
• n_estimators: This hyperparameter determines the number of boosting trees to
be trained. A larger number of trees can improve the model’s accuracy, but it can
also increase the risk of overfitting. A smaller number of trees may result in lower
accuracy but can help prevent overfitting.
• max_depth: This hyperparameter determines the maximum depth of each tree
in the ensemble. A larger max_depth can allow the trees to capture more complex
relationships in the data, but it can also increase the risk of overfitting. A smaller
max_depth may result in less complex trees but can help prevent overfitting.
• min_child_weight: This hyperparameter determines the minimum sum of
instance weight (hessian) needed in a child node. A larger min_child_weight can
help prevent overfitting by requiring more data to influence the splitting of trees.
A smaller min_child_weight may allow for more aggressive tree splitting but can
increase the risk of overfitting.
• subsample: This hyperparameter determines the percentage of rows used for
each tree construction. A smaller subsample can improve the efficiency of training
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
but may reduce the model’s accuracy. A larger subsample can increase the
accuracy but may make training more computationally expensive.
Some other examples of model hyperparameters include:
1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
2. Number of Trees and Depth of Trees for Random Forests.
3. The learning rate for training a neural network.
4. Number of Clusters for Clustering Algorithms.
5. The k in k-nearest neighbors.
Hyperparameter Tuning techniques
Models can have many hyperparameters and finding the best combination of
parameters can be treated as a search problem. The two best strategies for Hyperparameter
tuning are:
1. GridSearchCV
2. RandomizedSearchCV
3. Bayesian Optimization
1. GridSearchCV
Grid search can be considered as a “brute force” approach to hyperparameter
E
optimization. We fit the model using all possible combinations after creating a grid of
potential discrete hyperparameter values. We log each set’s model performance and then
O
choose the combination that produces the best results. This approach is called GridSearchCV,
C
because it searches for the best set of hyperparameters from a grid of hyperparameters
values.
An exhaustive approach that can identify the ideal hyperparameter combination is grid
E
search. But the slowness is a disadvantage. It often takes a lot of processing power and time
to fit the model with every potential combination, which might not be available.
AC
For example: if we want to set two hyperparameters C and Alpha of the Logistic Regression
Classifier model, with different sets of values. The grid search technique will construct many
versions of the model with all possible combinations of hyperparameters and will return the
R
best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a
G
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
2. RandomizedSearchCV
As the name suggests, the random search method selects values at random as opposed to the
grid search method’s use of a predetermined set of numbers. Every iteration, random search
attempts a different set of hyperparameters and logs the model’s performance. It returns the
combination that provided the best outcome after several iterations. This approach reduces
unnecessary computation.
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a
fixed number of hyperparameter settings. It moves within the grid in a random fashion to find
the best set of hyperparameters. The advantage is that, in most cases, a random search will
produce a comparable result faster than a grid search.
3. Bayesian Optimization
Grid search and random search are often inefficient because they evaluate many unsuitable
hyperparameter combinations without considering the previous iterations’ results. Bayesian
optimization, on the other hand, treats the search for optimal hyperparameters as an
optimization problem. It considers the previous evaluation results when selecting the next
hyperparameter combination and applies a probabilistic function to choose the combination
that will likely yield the best results. This method discovers a good hyperparameter
E
combination in relatively few iterations.
O
Data scientists use a probabilistic model when the objective function is unknown. The
probabilistic model estimates the probability of a hyperparameter combination’s objective
C
function result based on past evaluation results.
P(score(y)|hyperparameters(x))
It is a “surrogate” of the objective function, which can be the root-mean-square error
E
(RMSE), for example. The objective function is calculated using the training data with the
hyperparameter combination, and we try to optimize it (maximize or minimize, depending
AC
the surrogate probability model every time the objective function runs. Better hyperparameter
predictions decrease the number of objective function evaluations needed to achieve a good
G
result. Gaussian processes, random forest regression, and tree-structured Parzen estimators
(TPE) are examples of surrogate models.
The Bayesian optimization model is complex to implement, but off-the-shelf libraries like
Ray Tune can simplify the process. It’s worth using this type of model because it finds an
adequate hyperparameter combination in relatively few iterations. However, compared to
grid search or random search, we must compute Bayesian optimization sequentially, so it
doesn’t allow distributed processing. Therefore, Bayesian optimization takes longer yet uses
fewer computational resources.
Drawback: Requires an understanding of the underlying probabilistic model.
Challenges in Hyperparameter Tuning
• Dealing with High-Dimensional Hyperparameter Spaces: Efficient Exploration
and Optimization
• Handling Expensive Function Evaluations: Balancing Computational Efficiency
and Accuracy
• Incorporating Domain Knowledge: Utilizing Prior Information for Informed
Tuning
• Developing Adaptive Hyperparameter Tuning Methods: Adjusting Parameters
During Training
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
E
• No guarantee of optimal performance
• Requires expertise O
C
Batch Normalization
Batch normalization is a deep learning approach that has been shown to significantly
E
improve the efficiency and reliability of neural network models. It is particularly useful for
AC
training very deep networks, as it can help to reduce the internal covariate shift that can occur
during training.
R
of the output distribution from the preceding layer, allowing it to analyze the data more
effectively.
The term “internal covariate shift” is used to describe the effect that updating the parameters
of the layers above it has on the distribution of inputs to the current layer during deep
learning training. This can make the optimization process more difficult and can slow down the
convergence of the model.
Since normalization guarantees that no activation value is too high or too low, and since it
enables each layer to learn independently from the others, this strategy leads to quicker learning
rates. By standardizing inputs, the “dropout” rate (the amount of information lost between
processing stages) may be decreased. That ultimately leads to a vast increase in precision across
the board.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
The goal of batch normalization is to stabilize the training process and improve the
generalization ability of the model. It can also help to reduce the need for careful initialization
of the model’s weights and can allow the use of higher learning rates, which can speed up the
E
training process.
O
It is common practice to apply batch normalization prior to a layer’s activation function,
C
and it is commonly used in tandem with other regularization methods like a dropout. It is a
widely used technique in modern deep learning and has been shown to be effective in a variety
E
• Stabilize the training process. Batch normalization can help to reduce the internal
covariate shift that occurs during training, which can improve the stability of the training
G
• Reduces the need for careful initialization. Batch normalization can help reduce the
sensitivity of the model to the initial weights, making it easier to train the model.
• Allows for higher learning rates. Batch normalization can allow the use of
higher learning rates that can speed up the training process.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
• Scaled and shifted activations: zi = γyi + β, where γ and β have learned parameters
During inference, the activations of a layer are normalized using the mean and variance of the
activations calculated during training, rather than using the mean and variance of the mini-batch:
E
• Scaled and shifted activations: zi = γyi + β
import torch.nn as nn
AC
nn.BatchNorm2d(num_features=16),
G
nn.ReLU(),
# ...
)
The BatchNorm2d module takes in the number of channels (i.e., the number of features) in the
input as an argument and applies batch normalization over the spatial dimensions (height and
width) of the input. The BatchNorm2d module also has learnable parameters for scaling and
shifting the normalized activations, which are updated during training.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
E
preventing a model from becoming overly complex and memorizing the training
O
data instead of learning its underlying patterns.
3. Balancing Bias and Variance: Regularization can help balance the trade-off
C
between model bias (underfitting) and model variance (overfitting) in machine
learning, which leads to improved performance.
4. Feature Selection: Some regularization methods, such as L1 regularization
E
(Lasso), promote sparse solutions that drive some feature coefficients to zero. This
automatically selects important features while excluding less important ones.
AC
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
noise in the training data as well. This is the case when our model memorizes the training data
instead of learning the patterns in it.
Underfitting on the other hand is the case when our model is not able to learn even the basic
patterns available in the dataset. In the case of the underfitting model is unable to perform well
even on the training data hence we cannot expect it to perform well on the validation data. This
is the case when we are supposed to increase the complexity of the model or add more features
to the feature set.
E
O
C
E
AC
R
Finding a proper balance between the two that is also known as the Bias-Variance Tradeoff can
G
help us prune the model from getting overfitted to the training data.
Different Combinations of Bias-Variance
There can be four combinations between bias and variance:
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
E
• High Bias, Low Variance: A model that has high bias and low variance is
•
considered to be underfitting. O
High Variance, Low Bias: A model that has high variance and low bias is
C
considered to be overfitting.
• High-Bias, High-Variance:A model with high bias and high variance cannot
E
• Low Bias, Low Variance:A model with low bias and low variance can capture
data patterns and handle variations in training data. This is the perfect scenario for
a machine learning model where it can generalize well to unseen data and make
consistent, accurate predictions. However, in reality, this is not feasible.
R
G
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
E
Regularization in Machine Learning
Regularization is a technique used to reduce errors by fitting the function appropriately on the
O
given training set and avoiding overfitting. The commonly used regularization techniques are
:
C
1. Lasso Regularization – L1 Regularization
2. Ridge Regularization – L2 Regularization
E
Lasso Regression
A regression model which uses the L1 Regularization technique is
called LASSO(Least Absolute Shrinkage and Selection Operator) regression. Lasso
Regression adds the “absolute value of magnitude” of the coefficient as a penalty term to the
loss function(L). Lasso regression also helps us achieve feature selection by penalizing the
weights to approximately equal to zero if that feature does not serve any purpose in the model.
where,
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
• m – Number of Features
• n – Number of Examples
• y_i – Actual Target Value
• y_i(hat) – Predicted Target Value
Ridge Regression
A regression model that uses the L2 regularization technique is called Ridge
regression. Ridge regression adds the “squared magnitude” of the coefficient as a penalty
term to the loss function(L).
Benefits of Regularization
E
1. Regularization improves model generalization by reducing overfitting.
O
Regularized models learn underlying patterns, while overfit models memorize noise
in training data.
C
2. Regularization techniques such as L1 (Lasso) L1 regularization simplifies models
and improves interpretability by reducing coefficients of less important features to
zero.
E
4. Regularization makes models stable across different subsets of the data. It reduces
the sensitivity of model outputs to minor changes in the training set.
5. Regularization prevents models from becoming overly complex, which is
R
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
Dropout
Dropout Regularization
When you have training data, if you try to train your model too much, it might overfit, and
when you get the actual test data for making predictions, it will not probably perform
well. Dropout regularization is one technique used to tackle overfitting problems in deep
learning.
That’s what we are going to look into in this blog, and we’ll go over some theories first,
and then we’ll write python code using TensorFlow, and we’ll see how adding a dropout
layer increases the performance of your neural network.
E
Training with Drop-Out Layers
O
Dropout is a regularization method approximating concurrent training of many neural
C
networks with various designs. During training, the network randomly ignores or drops
some layer outputs. This changes the layer’s appearance and connectivity compared to t he
preceding layer. In practice, each training update gives the layer a different perspective.
E
Dropout makes the training process noisy, requiring nodes within a layer to take on more
or less responsible for the inputs on a probabilistic basis.
AC
According to this conception, Dropout in machine learning may break apart circumstances
in which network tiers co-adapt to fix mistakes committed by prior layers, making the
R
model more robust. Dropout is implemented per layer in a neural network. It works with
the vast majority of layers, including dense, fully connected, convolutional, and recurrent
G
layers such as the long short-term memory network layer. Dropout can occur on any or all
of the network’s hidden layers as well as the visible or input layer. It is not used on the
output layer.
Dropout Implementation
Using the torch. nn, you can easily add a Dropout in machine learning to your PyTorch
models. The dropout class takes the dropout rate (the likelihood of deactivating a neuron)
as a parameter.
self.dropout = nn.Dropout(0.25)
To investigate the impact of dropout, train an image classification model. I’ll start with an
unregularized network and then use Dropout in machine learning to train a regularised
network. The Cifar-10 dataset is used to train the models over 15 epochs.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
class Net(nn.Module):
def __init__(self, input_shape=(3,32,32)):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3)
self.conv2 = nn.Conv2d(32, 64, 3)
self.conv3 = nn.Conv2d(64, 128, 3)
self.pool = nn.MaxPool2d(2,2)
n_size = self._get_conv_output(input_shape)
self.fc1 = nn.Linear(n_size, 512)
self.fc2 = nn.Linear(512, 10)
self.dropout = nn.Dropout(0.25)
def forward(self, x):
x = self._forward_features(x)
x = x.view(x.size(0), -1)
x = self.dropout(x)
x = F.relu(self.fc1(x))
E
# Apply dropout
x = self.dropout(x)
x = self.fc2(x)
return x
O
C
E
AC
R
G
AL3451_ML
4931_Grace College of Engineering, Thoothukudi
of the unregularized network, the total validation accuracy has improved. This explains
why the generalization error has decreased.
When combating overfitting, dropping out is far from the only choice. Regularization
techniques commonly used include:
E
• Noise: Allow some random variations in the data through augmentation to create
O
noise (which makes the network robust to a larger distribution of inputs and hence
improves generalization).
C
• Model Combination: the outputs of separately trained neural networks are
averaged (which requires a lot of computational power, data, and time).
E
In deep learning regularization, researchers have found that using a high momentum and
a large decaying learning rate are effective hyperparameter values with dropout. Limiting
R
our weight vectors using dropout allows us to employ a high learning rate witho ut fear of
the weights blowing up. Dropout noise, along with our big decaying learning rate, allows
G
us to explore alternative areas of our loss function and, hopefully, reach a better minimum.
Although dropout is a potent tool, it has certain downsides. A dropout network may take
2-3 times longer to train than a normal network. Finding a regularize virtually comparable
to a dropout layer is one method to reap the benefits of dropout in deep learning without
slowing down training. This regularize is a modified variant of L2 regularization for linear
regression. An analogous regularize for more complex models has yet to be discovered
until that time when doubt drops out.
AL3451_ML