0% found this document useful (0 votes)
4 views36 pages

IV Ai & Ds Al3451 ML Unit4

The document provides an overview of neural networks, focusing on multilayer perceptrons (MLPs) and their components, including input, hidden, and output layers, as well as activation functions and training methods such as backpropagation and stochastic gradient descent. It explains the structure and function of MLPs, highlighting their ability to learn complex patterns in data for tasks like classification and regression. Additionally, it discusses various types of neural networks, including feedforward, recurrent, and convolutional networks, emphasizing their applications in machine learning.

Uploaded by

jenison717
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views36 pages

IV Ai & Ds Al3451 ML Unit4

The document provides an overview of neural networks, focusing on multilayer perceptrons (MLPs) and their components, including input, hidden, and output layers, as well as activation functions and training methods such as backpropagation and stochastic gradient descent. It explains the structure and function of MLPs, highlighting their ability to learn complex patterns in data for tasks like classification and regression. Additionally, it discusses various types of neural networks, including feedforward, recurrent, and convolutional networks, emphasizing their applications in machine learning.

Uploaded by

jenison717
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

4931_Grace College of Engineering, Thoothukudi

DEPARTMENT OF ARITIFICIAL INTELLIGENCE AND


DATA SCIENCE

B.Tech. Artificial Intelligence and Data Science

E
Anna University Regulation: 2021
O
C
AL3451 – Machine Learning
E
AC

II Year / IV Semester
R

NOTES
G

UNIT IV – NEURAL NETWORKS

Prepared by,
Mrs. N. Nancy Chitra Thilaga, AP/ECE

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Multilayer perceptron, activation functions, network training – gradient descent


optimization – stochastic gradient descent, error backpropagation, from shallow
networks to deep networks –Unit saturation (aka the vanishing gradient problem) –
ReLU, hyperparameter tuning, batch normalization, regularization, dropout.

4.1 Multilayer perceptron

The multilayer perceptron is an artificial neural network structure and is a


nonparametric estimator that can be used for classification and regression. We are interested
in artificial neural networks because we believe that they may help us build better computer
systems. The brain is an information processing device that has some incredible abilities and
surpasses current engineering products in many domains—for example, vision, speech
recognition, and learning, to name three.

The perceptron is the basic processing element. It has inputs that may come from the
environment or may be the outputs of other perceptrons. Associated with each input, xj
connection weight ∈ R , j = 1, . . . , d, is a connection weight, or synaptic weight wj ∈ R, and

E
the output, y, in the simplest case is a weighted sum of the inputs:
𝑑
O
C
𝑦 = ∑ wj xj + w0
𝑗=1
w0 is the intercept value to make the model more general; it is generally modeled as
E

the weight coming from an extra bias unit, x0, which is always +1. We can write the output
of the perceptron as a dot product y = wT x
AC

where w = [w0,w1, . . . , wd]T and x = [1, x1, . . . , xd]T are augmented vectors to
include also the bias weight and input. During testing, with given weights, w, for input x, we
R

compute the output y. To implement a given task, we need to learn the weights w, the
parameters of the system, such that correct outputs are generated given the inputs.
G

When d = 1 and x is fed from the environment through an input unit, we have
y = wx + w0
which is the equation of a line with w as the slope and w0 as the intercept. Thus this
perceptron with one input and one output can be used to implement a linear fit. With more
than one input, the line becomes a (hyper)plane, and the perceptron with more than one input
can be used to implement multivariate linear fit.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

An artificial neural network (ANN) is a machine learning model inspired by the

E
structure and function of the human brain's interconnected network of neurons. It consists of
interconnected nodes called artificial neurons, organized into layers. Information flows through
O
the network, with each neuron processing input signals and producing an output signal that
influences other neurons in the network.
C
A multi-layer perceptron (MLP) is a type of artificial neural network consisting of
multiple layers of neurons. The neurons in the MLP typically use nonlinear activation
E

functions, allowing the network to learn complex patterns in data. MLPs are significant in
machine learning because they can learn nonlinear relationships in data, making them powerful
AC

models for tasks such as classification, regression, and pattern recognition.

Basics of Neural Networks


R
G

Neural networks or artificial neural networks are fundamental tools in machine


learning, powering many state-of-the-art algorithms and applications across various domains,
including computer vision, natural language processing, robotics, and more.

A neural network consists of interconnected nodes, called neurons, organized into layers. Each
neuron receives input signals, performs a computation on them using an activation function,
and produces an output signal that may be passed to other neurons in the network.
An activation function determines the output of a neuron given its input. These functions
introduce nonlinearity into the network, enabling it to learn complex patterns in data.

The network is typically organized into layers, starting with the input layer, where data is
introduced. Followed by hidden layers where computations are performed and finally, the
output layer where predictions or decisions are made.

Neurons in adjacent layers are connected by weighted connections, which transmit signals from
one layer to the next. The strength of these connections, represented by weights, determines
how much influence one neuron's output has on another neuron's input. During the training
process, the network learns to adjust its weights based on examples provided in a training

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

dataset. Additionally, each neuron typically has an associated bias, which allows the neuron to
adjust its output threshold.

Neural networks are trained using techniques called feedforward propagation


and backpropagation. During feedforward propagation, input data is passed through the
network layer by layer, with each layer performing a computation based on the inputs it
receives and passing the result to the next layer.

Backpropagation is an algorithm used to train neural networks by iteratively adjusting


the network's weights and biases in order to minimize the loss function. A loss function (also
known as a cost function or objective function) is a measure of how well the model's predictions
match the true target values in the training data. The loss function quantifies the difference
between the predicted output of the model and the actual output, providing a signal that guides
the optimization process during training.

The goal of training a neural network is to minimize this loss function by adjusting the
weights and biases. The adjustments are guided by an optimization algorithm, such as gradient
descent.

E
Types of Neural Network
O
C
E
AC
R
G

Picture credit: Keras Tutorial: Deep Learning in Python

The ANN depicted on the right of the image is a simple neural network called
‘perceptron’. It consists of a single layer, which is the input layer, with multiple neurons with
their own weights; there are no hidden layers. The perceptron algorithm learns the weights for
the input signals in order to draw a linear decision boundary.

However, to solve more complicated, non-linear problems related to image processing,


computer vision, and natural language processing tasks, we work with deep neural networks.

There are several types of ANN, each designed for specific tasks and architectural
requirements. Let's briefly discuss some of the most common types before diving deeper into
MLPs next.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Feedforward Neural Networks (FNN)

These are the simplest form of ANNs, where information flows in one direction, from input to
output. There are no cycles or loops in the network architecture. Multilayer perceptrons (MLP)
are a type of feedforward neural network.

Recurrent Neural Networks (RNN)

In RNNs, connections between nodes form directed cycles, allowing information to persist
over time. This makes them suitable for tasks involving sequential data, such as time series
prediction, natural language processing, and speech recognition.

Convolutional Neural Networks (CNN)

CNNs are designed to effectively process grid-like data, such as images. They consist of layers
of convolutional filters that learn hierarchical representations of features within the input data.
CNNs are widely used in tasks like image classification, object detection, and image
segmentation.

E
Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU)

Autoencoder
O
C
It is designed for unsupervised learning and consists of an encoder network that compresses
the input data into a lower-dimensional latent space, and a decoder network that reconstructs
E

the original input from the latent representation. Autoencoders are often used for
AC

dimensionality reduction, data denoising, and generative modeling.

Generative Adversarial Networks (GAN)


R

GANs consist of two neural networks, a generator and a discriminator, trained simultaneously
in a competitive setting. The generator learns to generate synthetic data samples that are
G

indistinguishable from real data, while the discriminator learns to distinguish between real and
fake samples. GANs have been widely used for generating realistic images, videos, and other
types of data.

Multilayer Perceptrons

A multilayer perceptron is a type of feedforward neural network consisting of fully


connected neurons with a nonlinear kind of activation function. It is widely used to distinguish
data that is not linearly separable.

MLPs have been widely used in various fields, including image recognition, natural
language processing, and speech recognition, among others. Their flexibility in architecture
and ability to approximate any function under certain conditions make them a fundamental
building block in deep learning and neural network research. Let's take a deeper dive into some
of its key concepts.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Input layer

The input layer consists of nodes or neurons that receive the initial input data. Each neuron
represents a feature or dimension of the input data. The number of neurons in the input layer is
determined by the dimensionality of the input data.

Hidden layer

Between the input and output layers, there can be one or more layers of neurons. Each neuron
in a hidden layer receives inputs from all neurons in the previous layer (either the input layer
or another hidden layer) and produces an output that is passed to the next layer. The number of
hidden layers and the number of neurons in each hidden layer are hyperparameters that need to
be determined during the model design phase.

Output layer

This layer consists of neurons that produce the final output of the network. The number of
neurons in the output layer depends on the nature of the task. In binary classification, there may

E
be either one or two neurons depending on the activation function and representing the
probability of belonging to one class; while in multi-class classification tasks, there can be
multiple neurons in the output layer. O
C
Weights

Neurons in adjacent layers are fully connected to each other. Each connection has an associated
E

weight, which determines the strength of the connection. These weights are learned during the
AC

training process.

Bias Neurons
R

In addition to the input and hidden neurons, each layer (except the input layer) usually
includes a bias neuron that provides a constant input to the neurons in the next layer. The bias
G

neuron has its own weight associated with each connection, which is also learned during
training.

The bias neuron effectively shifts the activation function of the neurons in the
subsequent layer, allowing the network to learn an offset or bias in the decision boundary. By
adjusting the weights connected to the bias neuron, the MLP can learn to control the threshold
for activation and better fit the training data.

Note: It is important to note that in the context of MLPs, bias can refer to two related but distinct
concepts: bias as a general term in machine learning and the bias neuron (defined above). In
general machine learning, bias refers to the error introduced by approximating a real-world
problem with a simplified model. Bias measures how well the model can capture the underlying
patterns in the data. A high bias indicates that the model is too simplistic and may underfit the
data, while a low bias suggests that the model is capturing the underlying patterns well.

Activation Function

Typically, each neuron in the hidden layers and the output layer applies an activation function
to its weighted sum of inputs. Common activation functions include sigmoid, tanh, ReLU
AL3451_ML
4931_Grace College of Engineering, Thoothukudi

(Rectified Linear Unit), and softmax. These functions introduce nonlinearity into the network,
allowing it to learn complex patterns in the data.

Training with Backpropagation

MLPs are trained using the backpropagation algorithm, which computes gradients of a loss
function with respect to the model's parameters and updates the parameters iteratively to
minimize the loss.

Workings of a Multilayer Perceptron: Layer by Layer

E
O
C
E
AC
R
G

Example of a MLP having two hidden layers

In a multilayer perceptron, neurons process information in a step-by-step manner, performing


computations that involve weighted sums and nonlinear transformations. Let's walk layer by
layer to see the magic that goes within.

Input layer

• The input layer of an MLP receives input data, which could be features
extracted from the input samples in a dataset. Each neuron in the input layer
represents one feature.
• Neurons in the input layer do not perform any computations; they simply pass
the input values to the neurons in the first hidden layer.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Hidden layers

• The hidden layers of an MLP consist of interconnected neurons that perform


computations on the input data.
• Each neuron in a hidden layer receives input from all neurons in the previous
layer. The inputs are multiplied by corresponding weights, denoted as w. The
weights determine how much influence the input from one neuron has on the
output of another.
• In addition to weights, each neuron in the hidden layer has an associated bias,
denoted as b. The bias provides an additional input to the neuron, allowing it
to adjust its output threshold. Like weights, biases are learned during training.
• For each neuron in a hidden layer or the output layer, the weighted sum of its
inputs is computed. This involves multiplying each input by its corresponding
weight, summing up these products, and adding the bias:

E
O
C
Where n is the total number of input connections, wi is the weight for the i-th input, and xi is
the i-th input value.
E

• The weighted sum is then passed through an activation function, denoted as f.


The activation function introduces nonlinearity into the network, allowing it
AC

to learn and represent complex relationships in the data. The activation


function determines the output range of the neuron and its behavior in
response to different input values. The choice of activation function depends
R

on the nature of the task and the desired properties of the network.
G

Output layer

• The output layer of an MLP produces the final predictions or outputs of the
network. The number of neurons in the output layer depends on the task being
performed (e.g., binary classification, multi-class classification, regression).
• Each neuron in the output layer receives input from the neurons in the last
hidden layer and applies an activation function. This activation function is
usually different from those used in the hidden layers and produces the final
output value or prediction.
During the training process, the network learns to adjust the weights associated with each
neuron's inputs to minimize the discrepancy between the predicted outputs and the true target
values in the training data. By adjusting the weights and learning the appropriate activation
functions, the network learns to approximate complex patterns and relationships in the data,
enabling it to make accurate predictions on new, unseen samples.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

This adjustment is guided by an optimization algorithm, such as stochastic gradient descent


(SGD), which computes the gradients of a loss function with respect to the weights and updates
the weights iteratively.

Stochastic Gradient Descent (SGD)

1. Initialization: SGD starts with an initial set of model parameters (weights


and biases) randomly or using some predefined method.

2. Iterative Optimization: The aim of this step is to find the minimum of a loss
function, by iteratively moving in the direction of the steepest decrease in the
function's value.
For each iteration (or epoch) of training:

• Shuffle the training data to ensure that the model doesn't learn from the same

E
patterns in the same order every time.
• O
Split the training data into mini-batches (small subsets of data).
C
• For each mini-batch:
• Compute the gradient of the loss function with respect to the model
E

parameters using only the data points in the mini-batch. This gradient
estimation is a stochastic approximation of the true gradient.
AC

• Update the model parameters by taking a step in the opposite


direction of the gradient, scaled by a learning rate:
R

Θt+1 = θt - n * ⛛ J (θt)
Where:
G

θt represents the model parameters at iteration t. This parameter can be


the weight
⛛ J (θt) is the gradient of the loss function J with respect to the
parameters θt
n is the learning rate, which controls the size of the steps taken during
optimization
3. Direction of Descent: The gradient of the loss function indicates the direction
of the steepest ascent. To minimize the loss function, gradient descent moves
in the opposite direction, towards the steepest descent.

4. Learning Rate: The step size taken in each iteration of gradient descent is
determined by a parameter called the learning rate, denoted above as n. This
parameter controls the size of the steps taken towards the minimum. If the

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

learning rate is too small, convergence may be slow; if it is too large, the
algorithm may oscillate or diverge.

5. Convergence: Repeat the process for a fixed number of iterations or until a


convergence criterion is met (e.g., the change in loss function is below a
certain threshold).
Stochastic gradient descent updates the model parameters more frequently using smaller
subsets of data, making it computationally efficient, especially for large datasets. The
randomness introduced by SGD can have a regularization effect, preventing the model from
overfitting to the training data. It is also well-suited for online learning scenarios where new
data becomes available incrementally, as it can update the model quickly with each new data
point or mini-batch.

However, SGD can also have some challenges, such as increased noise due to the stochastic
nature of the gradient estimation and the need to tune hyperparameters like the learning rate.
Various extensions and adaptations of SGD, such as mini-batch stochastic gradient descent,

E
momentum, and adaptive learning rate methods like AdaGrad, RMSProp, and Adam, have
been developed to address these challenges and improve convergence and performance.
O
You have seen the working of the multilayer perceptron layers and learned about stochastic
C
gradient descent; to put it all together, there is one last topic to dive into: backpropagation.
E

Backpropagation
AC

Backpropagation is short for “backward propagation of errors.” In the context of


backpropagation, SGD involves updating the network's parameters iteratively based on the
gradients computed during each batch of training data. Instead of computing the gradients using
R

the entire training dataset (which can be computationally expensive for large datasets), SGD
computes the gradients using small random subsets of the data called mini-batches. Here’s an
G

overview of how backpropagation algorithm works:

1. Forward pass: During the forward pass, input data is fed into the neural
network, and the network's output is computed layer by layer. Each neuron
computes a weighted sum of its inputs, applies an activation function to the
result, and passes the output to the neurons in the next layer.

2. Loss computation: After the forward pass, the network's output is compared
to the true target values, and a loss function is computed to measure the
discrepancy between the predicted output and the actual output.

3. Backward Pass (Gradient Calculation): In the backward pass, the gradients


of the loss function with respect to the network's parameters (weights and
biases) are computed using the chain rule of calculus. The gradients represent

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

the rate of change of the loss function with respect to each parameter and
provide information about how to adjust the parameters to decrease the loss.

4. Parameter update: Once the gradients have been computed, the network's
parameters are updated in the opposite direction of the gradients in order to
minimize the loss function. This update is typically performed using an
optimization algorithm such as stochastic gradient descent (SGD), that we
discussed earlier.

5. Iterative Process: Steps 1-4 are repeated iteratively for a fixed number of
epochs or until convergence criteria are met. During each iteration, the
network's parameters are adjusted based on the gradients computed in the
backward pass, gradually reducing the loss and improving the model's
performance.

E
5.7 Backpropagation Algorithm Part 1 in Tamil (youtube.com)

O
5.8 Backpropagation Algorithm Part 2 in Tamil (youtube.com)
C
Basic Gradient Descent Algorithm
E

The gradient descent algorithm is an approximate and iterative method for mathematical
AC

optimization. You can use it to approach the minimum of any differentiable function.

Note: There are many optimization methods and subfields of mathematical programming. If
R

you want to learn how to use some of them with Python, then check out Scientific Python:
Using SciPy for Optimization and Hands-On Linear Programming: Optimization With Python.
G

Although gradient descent sometimes gets stuck in a local minimum or a saddle point instead
of finding the global minimum, it’s widely used in practice. Data science and machine
learning methods often apply it internally to optimize model parameters. For example, neural
networks find weights and biases with gradient descent.

Cost Function: The Goal of Optimization


The cost function, or loss function, is the function to be minimized (or maximized) by varying
the decision variables. Many machine learning methods solve optimization problems under the
surface. They tend to minimize the difference between actual and predicted outputs by
adjusting the model parameters (like weights and biases for neural networks, decision rules
for random forest or gradient boosting, and so on).

In a regression problem, you typically have the vectors of input variables 𝐱 = (𝑥₁, …, 𝑥ᵣ) and
the actual outputs 𝑦. You want to find a model that maps 𝐱 to a predicted response 𝑓(𝐱) so that
𝑓(𝐱) is as close as possible to 𝑦. For example, you might want to predict an output such as a
person’s salary given inputs like the person’s number of years at the company or level of
education.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Your goal is to minimize the difference between the prediction 𝑓(𝐱) and the actual data 𝑦. This
difference is called the residual.

In this type of problem, you want to minimize the sum of squared residuals (SSR), where SSR
= Σᵢ(𝑦ᵢ − 𝑓(𝐱ᵢ))² for all observations 𝑖 = 1, …, 𝑛, where 𝑛 is the total number of observations.
Alternatively, you could use the mean squared error (MSE = SSR / 𝑛) instead of SSR.

Both SSR and MSE use the square of the difference between the actual and predicted outputs.
The lower the difference, the more accurate the prediction. A difference of zero indicates that
the prediction is equal to the actual data.

SSR or MSE is minimized by adjusting the model parameters. For example, in linear
regression, you want to find the function 𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ, so you need to determine
the weights 𝑏₀, 𝑏₁, …, 𝑏ᵣ that minimize SSR or MSE.

In a classification problem, the outputs 𝑦 are categorical, often either 0 or 1. For example, you
might try to predict whether an email is spam or not. In the case of binary outputs, it’s
convenient to minimize the cross-entropy function that also depends on the actual outputs 𝑦ᵢ

E
and the corresponding predictions 𝑝(𝐱ᵢ):
O
C
E

In logistic regression, which is often used to solve classification problems, the functions 𝑝(𝐱)
and 𝑓(𝐱) are defined as the following:
AC
R
G

Again, you need to find the weights 𝑏₀, 𝑏₁, …, 𝑏ᵣ, but this time they should minimize the cross-
entropy function.

Gradient of a Function: Calculus Refresher


In calculus, the derivative of a function shows you how much a value changes when you
modify its argument (or arguments). Derivatives are important for optimization because
the zero derivatives might indicate a minimum, maximum, or saddle point.

The gradient of a function 𝐶 of several independent variables 𝑣₁, …, 𝑣ᵣ is denoted with ∇𝐶(𝑣₁,
…, 𝑣ᵣ) and defined as the vector function of the partial derivatives of 𝐶 with respect to each
independent variable: ∇𝐶 = (∂𝐶/∂𝑣₁, …, ∂𝐶/𝑣ᵣ). The symbol ∇ is called nabla.

The nonzero value of the gradient of a function 𝐶 at a given point defines the direction and rate
of the fastest increase of 𝐶. When working with gradient descent, you’re interested in the

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

direction of the fastest decrease in the cost function. This direction is determined by the
negative gradient, −∇𝐶.

Intuition Behind Gradient Descent


To understand the gradient descent algorithm, imagine a drop of water sliding down the side
of a bowl or a ball rolling down a hill. The drop and the ball tend to move in the direction of
the fastest decrease until they reach the bottom. With time, they’ll gain momentum and
accelerate.

The idea behind gradient descent is similar: you start with an arbitrarily chosen position of the
point or vector 𝐯 = (𝑣₁, …, 𝑣ᵣ) and move it iteratively in the direction of the fastest decrease of
the cost function. As mentioned, this is the direction of the negative gradient vector, −∇𝐶.

Once you have a random starting point 𝐯 = (𝑣₁, …, 𝑣ᵣ), you update it, or move it to a new
position in the direction of the negative gradient: 𝐯 → 𝐯 − 𝜂∇𝐶, where 𝜂 (pronounced “ee-tah”)
is a small positive value called the learning rate.

The learning rate determines how large the update or moving step is. It’s a very important

E
parameter. If 𝜂 is too small, then the algorithm might converge very slowly. Large 𝜂 values can
also cause issues with convergence or make the algorithm divergent.
O
C
Gradient Descent is an iterative optimization process that searches for an objective function’s
optimum value (Minimum/Maximum). It is one of the most used methods for changing a
model’s parameters in order to reduce a cost function in machine learning projects.
E

The primary goal of gradient descent is to identify the model parameters that provide the
maximum accuracy on both training and test datasets. In gradient descent, the gradient is a
AC

vector pointing in the general direction of the function’s steepest rise at a particular point.
The algorithm might gradually drop towards lower values of the function by moving in the
opposite direction of the gradient, until reaching the minimum of the function.
R
G

Types of Gradient Descent:


Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
Stochastic Gradient Descent (SGD):
Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is used
for optimizing machine learning models. It addresses the computational inefficiency of
traditional Gradient Descent methods when dealing with large datasets in machine learning
projects.
In SGD, instead of using the entire dataset for each iteration, only a single random training
example (or a small batch) is selected to calculate the gradient and update the model
parameters. This random selection introduces randomness into the optimization process,
hence the term “stochastic” in stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when dealing with
large datasets. By using a single example or a small batch, the computational cost per iteration
is significantly reduced compared to traditional Gradient Descent methods that require
processing the entire dataset.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Stochastic Gradient Descent Algorithm


• Initialization: Randomly initialize the parameters of the model.
• Set Parameters: Determine the number of iterations and the learning rate (alpha)
for updating the parameters.
• Stochastic Gradient Descent Loop: Repeat the following steps until the model
converges or reaches the maximum number of iterations:
• Shuffle the training dataset to introduce randomness.
• Iterate over each training example (or a small batch) in the shuffled
order.
• Compute the gradient of the cost function with respect to the model
parameters using the current training
example (or batch).
• Update the model parameters by taking a step in the direction of the
negative gradient, scaled by the learning rate.
• Evaluate the convergence criteria, such as the difference in the cost
function between iterations of the gradient.
• Return Optimized Parameters: Once the convergence criteria are met or the
maximum number of iterations is reached, return the optimized model parameters.

E
In SGD, since only one sample from the dataset is chosen at random for each iteration, the
path taken by the algorithm to reach the minima is usually noisier than your typical Gradient
O
Descent algorithm. But that doesn’t matter all that much because the path taken by the
algorithm does not matter, as long as we reach the minimum and with a significantly shorter
C
training time.
E
AC
R
G

Batch gradient optimization path stochastic gradient optimization path

Difference between Stochastic Gradient Descent & batch Gradient Descent


The comparison between Stochastic Gradient Descent (SGD) and Batch Gradient Descent
are as follows:
Stochastic Gradient Descent
Aspect (SGD) Batch Gradient Descent

Uses a single random sample or


Uses the entire dataset (batch) at
a small batch of samples at each
each iteration.
Dataset Usage iteration.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Stochastic Gradient Descent


Aspect (SGD) Batch Gradient Descent

Computationally less expensive Computationally more expensive


Computational per iteration, as it processes per iteration, as it processes the
Efficiency fewer data points. entire dataset.

Faster convergence due to Slower convergence due to less


Convergence frequent updates. frequent updates.

High noise due to frequent


Low noise as it updates
updates with a single or few
parameters using all data points.
Noise in Updates samples.

E
O
C
Less stable as it may oscillate More stable as it converges
Stability around the optimal solution. smoothly towards the optimum.
E

Requires less memory as it


AC

Requires more memory to hold


Memory processes fewer data points at a
the entire dataset in memory.
Requirement time.
R

Frequent updates make it


Less frequent updates make it
Update suitable for online learning and
G

suitable for smaller datasets.


Frequency large datasets.

Less sensitive to initial


More sensitive to initial
Initialization parameter values due to frequent
parameter values.
Sensitivity updates.

Advantages of Stochastic Gradient Descent


• Speed: SGD is faster than other variants of Gradient Descent such as Batch
Gradient Descent and Mini-Batch Gradient Descent since it uses only one
example to update the parameters.
• Memory Efficiency: Since SGD updates the parameters for each training
example one at a time, it is memory-efficient and can handle large datasets that
cannot fit into memory.
• Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability
to escape from local minima and converges to a global minimum.
AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Disadvantages of Stochastic Gradient Descent


• Noisy updates: The updates in SGD are noisy and have a high variance, which
can make the optimization process less stable and lead to oscillations around the
minimum.
• Slow Convergence: SGD may require more iterations to converge to the
minimum since it updates the parameters for each training example one at a time.
• Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD
since using a high learning rate can cause the algorithm to overshoot the
minimum, while a low learning rate can make the algorithm converge slowly.
• Less Accurate: Due to the noisy updates, SGD may not converge to the exact
global minimum and can result in a suboptimal solution. This can be mitigated by
using techniques such as learning rate scheduling and momentum-based updates.

UNIT SATURATION (VANISHING GRADIENT PROBLEM)

In the realm of deep learning, the optimization process plays a crucial role in training
neural networks. Gradient descent, a fundamental optimization algorithm, can sometimes
encounter two common issues: vanishing gradients and exploding gradients. In this article,

E
we will delve into these challenges, providing insights into what they are, why they occur,
and how to mitigate them. We will build and train a model, and learn how to face vanishing
and exploding problems.
What is Vanishing Gradient?
O
C
The vanishing gradient problem is a challenge that emerges during backpropagation
when the derivatives or slopes of the activation functions become progressively smaller as
E

we move backward through the layers of a neural network. This phenomenon is particularly
prominent in deep networks with many layers, hindering the effective training of the model.
AC

The weight updates becomes extremely tiny, or even exponentially small, it can significantly
prolong the training time, and in the worst-case scenario, it can halt the training process
altogether.
Why the Problem Occurs?
R

During backpropagation, the gradients propagate back through the layers of the network, they
G

decrease significantly. This means that as they leave the output layer and return to the input
layer, the gradients become progressively smaller. As a result, the weights associated with
the initial levels, which accommodate these small gradients, are updated little or not at each
iteration of the optimization process.
The vanishing gradient problem is particularly associated with the sigmoid and
hyperbolic tangent (tanh) activation functions because their derivatives fall within the range
of 0 to 0.25 and 0 to 1, respectively. Consequently, extreme weights becomes very small,
causing the updated weights to closely resemble the original ones. This persistence of small
updates contributes to the vanishing gradient issue.
The sigmoid and tanh functions limit the input values to the ranges [0,1] and [-1,1], so that
they saturate at 0 or 1 for sigmoid and -1 or 1 for Tanh. The derivatives at points becomes
zero as they are moving. In these regions, especially when inputs are very small or large, the
gradients are very close to zero. While this may not be a major concern in shallow networks
with a few layers, it is a more pronounced issue in deep networks. When the inputs fall in
saturated regions, the gradients approach zero, resulting in little update to the weights of the
previous layer. In simple networks this does not pose much of a problem, but as more layers
are added, these small gradients, which multiply between layers, decay significantly and

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

consequently the first layer tears very slowly , and hinders overall model performance and
can lead to convergence failure.
How can we identify?
Identifying the vanishing gradient problem typically involves monitoring the training
dynamics of a deep neural network.
• One key indicator is observing model weights converging to 0 or stagnation in
the improvement of the model’s performance metrics over training epochs.
• During training, if the loss function fails to decrease significantly, or if there is
erratic behavior in the learning curves, it suggests that the gradients may be
vanishing.
• Additionally, examining the gradients themselves during backpropagation can
provide insights. Visualization techniques, such as gradient histograms or
norms, can aid in assessing the distribution of gradients throughout the network.
How can we solve the issue?
• Batch Normalization : Batch normalization normalizes the inputs of each layer,
reducing internal covariate shift. This can help stabilize and accelerate the training
process, allowing for more consistent gradient flow.
• Activation function: Activation function like Rectified Linear Unit

E
(ReLU) can be used. With ReLU, the gradient is 0 for negative and zero input,
and it is 1 for positive input, which helps alleviate the vanishing gradient issue.
O
Therefore, ReLU operates by replacing poor enter values with 0, and 1 for fine
enter values, it preserves the input unchanged.
C
• Skip Connections and Residual Networks (ResNets): Skip connections, as
seen in ResNets, allow the gradient to bypass certain layers during
E

backpropagation. This facilitates the flow of information through the network,


preventing gradients from vanishing.
AC

• Long Short-Term Memory Networks (LSTMs) and Gated Recurrent Units


(GRUs): In the context of recurrent neural networks (RNNs), architectures like
LSTMs and GRUs are designed to address the vanishing gradient problem in
R

sequences by incorporating gating mechanisms .


• Gradient Clipping: Gradient clipping involves imposing a threshold on the
G

gradients during backpropagation. Limit the magnitude of gradients during


backpropagation, this can prevent them from becoming too small or
exploding, which can also hinder learning.

AL3451_ML
S. DIFFERENCE DEEP LEARNING
No. BETWEEN NEURAL NETWORKS SYSTEMS
4931_Grace College of Engineering, Thoothukudi
A neural network is a model of Deep learning neural
neurons inspired by the human networks are distinguished
1. Definition brain. It is made up of many from neural networks on
neurons that at inter-connected the basis of their depth or
with each other. number of hidden layers.

Recursive Neural
Feed Forward Neural Networks Networks
Recurrent Neural Networks Unsupervised Pre-trained
2. Architecture Networks
Symmetrically Connected
Neural Networks Convolutional Neural
Networks

Neurons Motherboards
Connection and weights PSU
3. Structure
Propagation function RAM

E
Learning rate Processors
O
C
It generally takes less time to It generally takes more
train them. time to train them.
Time &
4.
E

Accuracy They have lower accuracy than They have higher accuracy
Deep Learning Systems than Neural Networks.
AC

It gives low performance It gives high performance


5. Performance compared to Deep Learning compared to neural
R

Networks. networks.
G

The deep learning network


Task Your task is poorly interpreted
6. more effectively perceives
Interpretation by a neural network.
your task.

Deep learning models can


The ability to model non-linear
be used in a variety of
processes makes neural
industries, including
networks excellent tools for
pattern recognition, speech
addressing a variety of issues,
recognition, natural
7. Applications including classification, pattern
language processing,
recognition, prediction and
computer games, self-
analysis, clustering, decision
driving cars, social
making, machine learning, deep
network filtering, and
learning, and more.
more.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

S. DIFFERENCE DEEP LEARNING


No. BETWEEN NEURAL NETWORKS SYSTEMS

Neural network criticism


centered on training problems,
Deep learning criticism
theoretical problems, hardware
8. Critique centered on theory, errors,
problems, real-world
cyberthreats, etc.
counterexamples to criticisms,
and hybrid techniques.

ReLU

Activation functions in neural networks and deep learning play a significant role in
igniting the hidden nodes to produce a more desirable output. The main purpose of

E
the activation function is to introduce the property of nonlinearity into the model.

What Is the ReLU Activation Function?


O
C
The rectified linear unit (ReLU) or rectifier activation function introduces the property
of nonlinearity to a deep learning model and solves the vanishing gradients issue. It interprets
E

the positive part of its argument. It is one of the most popular activation functions in deep
learning.
AC

In artificial neural networks, the activation function of a node defines the output of that
node given an input or set of inputs. A standard integrated circuit can be seen as a digital
network of activation functions that can be “ON” or “OFF,” depending on the input.
R
G

An example
of the sigmoid activation function. | Image: Wikipedia

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

An example
of a tanh linear graph. | Image: Wikipedia
Sigmoid and tanh were monotonous, differentiable and previously more popular activation
functions. However, these functions suffer saturation over time, and this leads to problems

E
occurring with vanishing gradients. An alternative and the most popular activation function to
overcome this issue is the Rectified Linear Unit (ReLU).

What Is the ReLU Activation Function?


O
C
The diagram below with the blue line is the representation of the Rectified Linear Unit (ReLU),
E

whereas the green line is a variant of ReLU called Softplus. The other variants of ReLU include
leaky ReLU, exponential linear unit (ELU) and Sigmoid linear unit (SiLU), etc., which are
AC

used to improve performances in some tasks.


R
G

Example of a ReLU activation function (blue) and Softplus (green). | Image: Wikipedia

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

In this article, we’ll only look at the rectified linear unit (ReLU) because it’s still the most used
activation function by default for performing a majority of deep learning tasks. Its variants are
typically used for specific purposes in which they might have a slight edge over the ReLU.

This activation function was first introduced to a dynamical network by Hahnloser et al. in
2000, with strong biological motivations and mathematical justifications. It was demonstrated
for the first time in 2011 as a way to enable better training of deeper networks compared to
other widely used activation functions including the logistic sigmoid (which is inspired
by probability theory and logistic regression) and the hyperbolic tangent.

The rectifier is, as of 2017, the most popular activation function for deep neural networks. A
unit employing the rectifier is also called a rectified linear unit (ReLU).

Why Is ReLU a Good Activation Function?

The main reason ReLU wasn’t used until more recently is because it was not differentiable at
the point zero. Researchers tended to use differentiable activation functions like sigmoid and
tanh. However, it’s now determined that ReLU is the best activation function for deep learning.

E
O
C
Equation for the ReLU activation function. | Image: Wikipedia
The ReLU activation function is differentiable at all points except at zero. For values greater
E

than zero, we just consider the max of the function. This can be written as:
AC

f(x) = max{0, z}
In simple terms, this can also be written as follows:
R

if input > 0:
G

return input

else:

return 0
All the negative values default to zero, and the maximum for the positive number is taken into
consideration.

For the computation of the backpropagation of neural networks, the differentiation for the
ReLU is relatively easy. The only assumption we will make is the derivative at the point zero,
which will also be considered as zero. This is usually not such a big concern, and it works well
for the most part. The derivative of the function is the value of the slope. The slope for negative
values is 0.0, and the slope for positive values is 1.0.

The main advantages of the ReLU activation function are:

1. Convolutional layers and deep learning: It is the most popular activation function for
training convolutional layers and deep learning models.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

2. Computational simplicity: The rectifier function is trivial to implement, requiring


only a max() function.
3. Representational sparsity: An important benefit of the rectifier function is that it is
capable of outputting a true zero value.
4. Linear behavior: A neural network is easier to optimize when its behavior is linear or
close to linear.

Disadvantages of the ReLU Activation Function

The main issue with ReLU is that all the negative values become zero immediately, which
decreases the ability of the model to fit or train from the data properly.

That means any negative input given to the ReLU activation function turns the value into zero
immediately in the graph, which in turn affects the resulting graph by not mapping the negative
values appropriately. This can however be easily fixed by using the different variants of the
ReLU activation function, like the leaky ReLU and other functions discussed earlier in the
article.

E
This is just a short introduction to the rectified linear unit and its importance in deep learning
technology today. It’s more popular than all other activation functions, and for good reason.

Hyperparameter Tuning
O
C
Hyperparameter tuning is the process of selecting the optimal values for a machine
E

learning model’s hyperparameters. Hyperparameters are settings that control the learning
process of the model, such as the learning rate, the number of neurons in a neural network,
AC

or the kernel size in a support vector machine. The goal of hyperparameter tuning is to find
the values that lead to the best performance on a given task.
What are Hyperparameters?
R

In the context of machine learning, hyperparameters are configuration variables that are set
before the training process of a model begins. They control the learning process itself, rather
G

than being learned from the data. Hyperparameters are often used to tune the performance of
a model, and they can have a significant impact on the model’s accuracy, generalization, and
other metrics.
Different Ways of Hyperparameters Tuning
Hyperparameters are configuration variables that control the learning process of a machine
learning model. They are distinct from model parameters, which are the weights and biases
that are learned from the data. There are several different types of hyperparameters:
Hyperparameters in Neural Networks
Neural networks have several essential hyperparameters that need to be adjusted, including:
• Learning rate: This hyperparameter controls the step size taken by the
optimizer during each iteration of training. Too small a learning rate can result in
slow convergence, while too large a learning rate can lead to instability and
divergence.
• Epochs: This hyperparameter represents the number of times the entire training
dataset is passed through the model during training. Increasing the number of
epochs can improve the model’s performance but may lead to overfitting if not
done carefully.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

• Number of layers: This hyperparameter determines the depth of the model,


which can have a significant impact on its complexity and learning ability.
• Number of nodes per layer: This hyperparameter determines the width of the
model, influencing its capacity to represent complex relationships in the data.
• Architecture: This hyperparameter determines the overall structure of the
neural network, including the number of layers, the number of neurons per
layer, and the connections between layers. The optimal architecture depends on
the complexity of the task and the size of the dataset
• Activation function: This hyperparameter introduces non-linearity into the
model, allowing it to learn complex decision boundaries. Common activation
functions include sigmoid, tanh, and Rectified Linear Unit (ReLU).
Hyperparameters in Support Vector Machine
We take into account some essential hyperparameters for fine-tuning SVMs:
• C: The regularization parameter that controls the trade-off between the margin
and the number of training errors. A larger value of C penalizes training errors
more heavily, resulting in a smaller margin but potentially better generalization
performance. A smaller value of C allows for more training errors but may lead to
overfitting.

E
• Kernel: The kernel function that defines the similarity between data points.
Different kernels can capture different relationships between data points, and the
O
choice of kernel can significantly impact the performance of the SVM. Common
kernels include linear, polynomial, radial basis function (RBF), and sigmoid.
C
• Gamma: The parameter that controls the influence of support vectors on the
decision boundary. A larger value of gamma indicates that nearby support vectors
E

have a stronger influence, while a smaller value indicates that distant support
vectors have a weaker influence. The choice of gamma is particularly important
AC

for RBF kernels.


Hyperparameters in XGBoost
The following essential XGBoost hyperparameters need to be adjusted:
R

• learning_rate: This hyperparameter determines the step size taken by the


optimizer during each iteration of training. A larger learning rate can lead to faster
G

convergence, but it may also increase the risk of overfitting. A smaller learning
rate may result in slower convergence but can help prevent overfitting.
• n_estimators: This hyperparameter determines the number of boosting trees to
be trained. A larger number of trees can improve the model’s accuracy, but it can
also increase the risk of overfitting. A smaller number of trees may result in lower
accuracy but can help prevent overfitting.
• max_depth: This hyperparameter determines the maximum depth of each tree
in the ensemble. A larger max_depth can allow the trees to capture more complex
relationships in the data, but it can also increase the risk of overfitting. A smaller
max_depth may result in less complex trees but can help prevent overfitting.
• min_child_weight: This hyperparameter determines the minimum sum of
instance weight (hessian) needed in a child node. A larger min_child_weight can
help prevent overfitting by requiring more data to influence the splitting of trees.
A smaller min_child_weight may allow for more aggressive tree splitting but can
increase the risk of overfitting.
• subsample: This hyperparameter determines the percentage of rows used for
each tree construction. A smaller subsample can improve the efficiency of training

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

but may reduce the model’s accuracy. A larger subsample can increase the
accuracy but may make training more computationally expensive.
Some other examples of model hyperparameters include:
1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
2. Number of Trees and Depth of Trees for Random Forests.
3. The learning rate for training a neural network.
4. Number of Clusters for Clustering Algorithms.
5. The k in k-nearest neighbors.
Hyperparameter Tuning techniques
Models can have many hyperparameters and finding the best combination of
parameters can be treated as a search problem. The two best strategies for Hyperparameter
tuning are:
1. GridSearchCV
2. RandomizedSearchCV
3. Bayesian Optimization

1. GridSearchCV
Grid search can be considered as a “brute force” approach to hyperparameter

E
optimization. We fit the model using all possible combinations after creating a grid of
potential discrete hyperparameter values. We log each set’s model performance and then
O
choose the combination that produces the best results. This approach is called GridSearchCV,
C
because it searches for the best set of hyperparameters from a grid of hyperparameters
values.
An exhaustive approach that can identify the ideal hyperparameter combination is grid
E

search. But the slowness is a disadvantage. It often takes a lot of processing power and time
to fit the model with every potential combination, which might not be available.
AC

For example: if we want to set two hyperparameters C and Alpha of the Logistic Regression
Classifier model, with different sets of values. The grid search technique will construct many
versions of the model with all possible combinations of hyperparameters and will return the
R

best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a
G

combination of C=0.3 and Alpha=0.2, the performance score comes out to


be 0.726(Highest), therefore it is selected.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

2. RandomizedSearchCV
As the name suggests, the random search method selects values at random as opposed to the
grid search method’s use of a predetermined set of numbers. Every iteration, random search
attempts a different set of hyperparameters and logs the model’s performance. It returns the
combination that provided the best outcome after several iterations. This approach reduces
unnecessary computation.
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a
fixed number of hyperparameter settings. It moves within the grid in a random fashion to find
the best set of hyperparameters. The advantage is that, in most cases, a random search will
produce a comparable result faster than a grid search.

3. Bayesian Optimization
Grid search and random search are often inefficient because they evaluate many unsuitable
hyperparameter combinations without considering the previous iterations’ results. Bayesian
optimization, on the other hand, treats the search for optimal hyperparameters as an
optimization problem. It considers the previous evaluation results when selecting the next
hyperparameter combination and applies a probabilistic function to choose the combination
that will likely yield the best results. This method discovers a good hyperparameter

E
combination in relatively few iterations.
O
Data scientists use a probabilistic model when the objective function is unknown. The
probabilistic model estimates the probability of a hyperparameter combination’s objective
C
function result based on past evaluation results.
P(score(y)|hyperparameters(x))
It is a “surrogate” of the objective function, which can be the root-mean-square error
E

(RMSE), for example. The objective function is calculated using the training data with the
hyperparameter combination, and we try to optimize it (maximize or minimize, depending
AC

on the objective function selected).


Applying the probabilistic model to the hyperparameters is computationally inexpensive
compared to the objective function. Therefore, this method typically updates and improves
R

the surrogate probability model every time the objective function runs. Better hyperparameter
predictions decrease the number of objective function evaluations needed to achieve a good
G

result. Gaussian processes, random forest regression, and tree-structured Parzen estimators
(TPE) are examples of surrogate models.
The Bayesian optimization model is complex to implement, but off-the-shelf libraries like
Ray Tune can simplify the process. It’s worth using this type of model because it finds an
adequate hyperparameter combination in relatively few iterations. However, compared to
grid search or random search, we must compute Bayesian optimization sequentially, so it
doesn’t allow distributed processing. Therefore, Bayesian optimization takes longer yet uses
fewer computational resources.
Drawback: Requires an understanding of the underlying probabilistic model.
Challenges in Hyperparameter Tuning
• Dealing with High-Dimensional Hyperparameter Spaces: Efficient Exploration
and Optimization
• Handling Expensive Function Evaluations: Balancing Computational Efficiency
and Accuracy
• Incorporating Domain Knowledge: Utilizing Prior Information for Informed
Tuning
• Developing Adaptive Hyperparameter Tuning Methods: Adjusting Parameters
During Training

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Applications of Hyperparameter Tuning


• Model Selection: Choosing the Right Model Architecture for the Task
• Regularization Parameter Tuning: Controlling Model Complexity for Optimal
Performance
• Feature Preprocessing Optimization: Enhancing Data Quality and Model
Performance
• Algorithmic Parameter Tuning: Adjusting Algorithm-Specific Parameters for
Optimal Results
Advantages of Hyperparameter tuning:
• Improved model performance
• Reduced overfitting and underfitting
• Enhanced model generalizability
• Optimized resource utilization
• Improved model interpretability
Disadvantages of Hyperparameter tuning:
• Computational cost
• Time-consuming process
• Risk of overfitting

E
• No guarantee of optimal performance
• Requires expertise O
C
Batch Normalization
Batch normalization is a deep learning approach that has been shown to significantly
E

improve the efficiency and reliability of neural network models. It is particularly useful for
AC

training very deep networks, as it can help to reduce the internal covariate shift that can occur
during training.
R

• Batch normalization is a supervised learning method for normalizing the


interlayer outputs of a neural network. As a result, the next layer receives a “reset”
G

of the output distribution from the preceding layer, allowing it to analyze the data more
effectively.

The term “internal covariate shift” is used to describe the effect that updating the parameters
of the layers above it has on the distribution of inputs to the current layer during deep
learning training. This can make the optimization process more difficult and can slow down the
convergence of the model.

Since normalization guarantees that no activation value is too high or too low, and since it
enables each layer to learn independently from the others, this strategy leads to quicker learning
rates. By standardizing inputs, the “dropout” rate (the amount of information lost between
processing stages) may be decreased. That ultimately leads to a vast increase in precision across
the board.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

How does batch normalization work?


Batch normalization is a technique used to improve the performance of a deep learning
network by first removing the batch mean and then splitting it by the batch standard deviation.
Stochastic gradient descent is used to rectify this standardization if the loss function is too big,
by shifting or scaling the outputs by a parameter, which in turn affects the accuracy of the
weights in the following layer.

When applied to a layer, batch normalization multiplies its output by a standard


deviation parameter (gamma) and adds a mean parameter (beta) to it as a secondary trainable
parameter. Data may be “denormalized” by adjusting just these two weights for each output,
thanks to the synergy between batch normalization and gradient descents. Reduced data loss
and improved network stability were the results of adjusting the other relevant weights.

The goal of batch normalization is to stabilize the training process and improve the
generalization ability of the model. It can also help to reduce the need for careful initialization
of the model’s weights and can allow the use of higher learning rates, which can speed up the

E
training process.
O
It is common practice to apply batch normalization prior to a layer’s activation function,
C
and it is commonly used in tandem with other regularization methods like a dropout. It is a
widely used technique in modern deep learning and has been shown to be effective in a variety
E

of tasks, including image classification, natural language processing, and machine


translation.systems are more fragile than you think. All based on our open-source core.
AC

Advantages of batch normalization


R

• Stabilize the training process. Batch normalization can help to reduce the internal
covariate shift that occurs during training, which can improve the stability of the training
G

process and make it easier to optimize the model.

• Improves generalization. By normalizing the activations of a layer, batch


normalization can help to reduce overfitting and improve the generalization ability of
the model.

• Reduces the need for careful initialization. Batch normalization can help reduce the
sensitivity of the model to the initial weights, making it easier to train the model.

• Allows for higher learning rates. Batch normalization can allow the use of
higher learning rates that can speed up the training process.

Batch normalization overfitting


While batch normalization can help to reduce overfitting, it is not a guarantee that a model will
not overfit. Overfitting can still occur if the model is too complex for the amount of training
data, if there is a lot of noise in the data, or if there are other issues with the training process. It
is important to use other regularization techniques like dropout, and to monitor the performance
of the model on a validation set during training to ensure that it is not overfitting.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Batch normalization equations


During training, the activations of a layer are normalized for each mini-batch of data using the
following equations:

• Mean: mean = 1/m ∑i=1 to m xi

• Variance: variance = 1/m ∑i=1 to m (xi – mean)^2

• Normalized activations: yi = (xi – mean) / sqrt(variance + ε)

• Scaled and shifted activations: zi = γyi + β, where γ and β have learned parameters

During inference, the activations of a layer are normalized using the mean and variance of the
activations calculated during training, rather than using the mean and variance of the mini-batch:

• Normalized activations: yi = (xi – mean) / sqrt(variance + ε)

E
• Scaled and shifted activations: zi = γyi + β

Batch normalization in PyTorch O


C
In PyTorch, batch normalization can be implemented using the BatchNorm2d module, which
can be applied to the output of a convolutional layer. For example:
E

import torch.nn as nn
AC

model = nn. Sequential(

nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1),


R

nn.BatchNorm2d(num_features=16),
G

nn.ReLU(),

# ...

)
The BatchNorm2d module takes in the number of channels (i.e., the number of features) in the
input as an argument and applies batch normalization over the spatial dimensions (height and
width) of the input. The BatchNorm2d module also has learnable parameters for scaling and
shifting the normalized activations, which are updated during training.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Regularization in Machine Learning


• While developing machine learning models you must have encountered a situation in
which the training accuracy of the model is high but the validation accuracy or the testing
accuracy is low. This is the case which is popularly known as overfitting in the domain
of machine learning. Also, this is the last thing a machine learning practitioner would like to
have in his model. In this article, we will learn about a method known as Regularization in
Python which helps us to solve the problem of overfitting. But before that let’s understand what
is the role of regularization in Python and what is underfitting and overfitting.
Role Of Regularization

In Python, Regularization is a technique used to prevent overfitting by adding a penalty term


to the loss function, discouraging the model from assigning too much importance to individual
features or coefficients.
Let’s explore some more detailed explanations about the role of Regularization in Python:
1. Complexity Control: Regularization helps control model complexity by
preventing overfitting to training data, resulting in better generalization to new data.
2. Preventing Overfitting: One way to prevent overfitting is to use regularization,
which penalizes large coefficients and constrains their magnitudes, thereby

E
preventing a model from becoming overly complex and memorizing the training
O
data instead of learning its underlying patterns.
3. Balancing Bias and Variance: Regularization can help balance the trade-off
C
between model bias (underfitting) and model variance (overfitting) in machine
learning, which leads to improved performance.
4. Feature Selection: Some regularization methods, such as L1 regularization
E

(Lasso), promote sparse solutions that drive some feature coefficients to zero. This
automatically selects important features while excluding less important ones.
AC

5. Handling Multicollinearity: When features are highly correlated


(multicollinearity), regularization can stabilize the model by reducing coefficient
sensitivity to small data changes.
R

6. Generalization: Regularized models learn underlying patterns of data for better


generalization to new data, instead of memorizing specific examples.
G

What are Overfitting and Underfitting?

Overfitting is a phenomenon that occurs when a Machine Learning model is constrained to


the training set and not able to perform well on unseen data. That is when our model learns the

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

noise in the training data as well. This is the case when our model memorizes the training data
instead of learning the patterns in it.
Underfitting on the other hand is the case when our model is not able to learn even the basic
patterns available in the dataset. In the case of the underfitting model is unable to perform well
even on the training data hence we cannot expect it to perform well on the validation data. This
is the case when we are supposed to increase the complexity of the model or add more features
to the feature set.

What are Bias and Variance?


Bias refers to the errors which occur when we try to fit a statistical model on real-world data
which does not fit perfectly well on some mathematical model. If we use a way too simplistic
a model to fit the data then we are more probably face the situation of High Bias which refers
to the case when the model is unable to learn the patterns in the data at hand and hence performs
poorly.
Variance implies the error value that occurs when we try to make predictions by using data
that is not previously seen by the model. There is a situation known as high variance that
occurs when the model learns noise that is present in the data.

E
O
C
E
AC
R

Finding a proper balance between the two that is also known as the Bias-Variance Tradeoff can
G

help us prune the model from getting overfitted to the training data.
Different Combinations of Bias-Variance
There can be four combinations between bias and variance:

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

E
• High Bias, Low Variance: A model that has high bias and low variance is


considered to be underfitting. O
High Variance, Low Bias: A model that has high variance and low bias is
C
considered to be overfitting.
• High-Bias, High-Variance:A model with high bias and high variance cannot
E

capture underlying patterns and is too sensitive to training data changes. On


average, the model will generate unreliable and inconsistent predictions.
AC

• Low Bias, Low Variance:A model with low bias and low variance can capture
data patterns and handle variations in training data. This is the perfect scenario for
a machine learning model where it can generalize well to unseen data and make
consistent, accurate predictions. However, in reality, this is not feasible.
R
G

Bias Variance tradeoff


The bias-variance tradeoff is a fundamental concept in machine learning. It refers to the balance
between bias and variance, which affect predictive model performance. Finding the right
tradeoff is crucial for creating models that generalize well to new data.
• The bias-variance tradeoff demonstrates the inverse relationship between bias and
variance. When one decreases, the other tends to increase, and vice versa.
• Finding the right balance is crucial. An overly simple model with high bias won’t
capture the underlying patterns, while an overly complex model with high variance
will fit the noise in the data.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

E
Regularization in Machine Learning
Regularization is a technique used to reduce errors by fitting the function appropriately on the
O
given training set and avoiding overfitting. The commonly used regularization techniques are
:
C
1. Lasso Regularization – L1 Regularization
2. Ridge Regularization – L2 Regularization
E

3. Elastic Net Regularization – L1 and L2 Regularization


AC
R
G

Lasso Regression
A regression model which uses the L1 Regularization technique is
called LASSO(Least Absolute Shrinkage and Selection Operator) regression. Lasso
Regression adds the “absolute value of magnitude” of the coefficient as a penalty term to the
loss function(L). Lasso regression also helps us achieve feature selection by penalizing the
weights to approximately equal to zero if that feature does not serve any purpose in the model.

where,

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

• m – Number of Features
• n – Number of Examples
• y_i – Actual Target Value
• y_i(hat) – Predicted Target Value
Ridge Regression
A regression model that uses the L2 regularization technique is called Ridge
regression. Ridge regression adds the “squared magnitude” of the coefficient as a penalty
term to the loss function(L).

Elastic Net Regression


This model is a combination of L1 as well as L2 regularization. That implies that we add the
absolute norm of the weights as well as the squared measure of the weights. With the help of
an extra hyperparameter that controls the ratio of the L1 and L2 regularization.

Benefits of Regularization

E
1. Regularization improves model generalization by reducing overfitting.
O
Regularized models learn underlying patterns, while overfit models memorize noise
in training data.
C
2. Regularization techniques such as L1 (Lasso) L1 regularization simplifies models
and improves interpretability by reducing coefficients of less important features to
zero.
E

3. Regularization improves model performance by preventing excessive weighting of


outliers or irrelevant features.
AC

4. Regularization makes models stable across different subsets of the data. It reduces
the sensitivity of model outputs to minor changes in the training set.
5. Regularization prevents models from becoming overly complex, which is
R

especially important when dealing with limited data or noisy environments.


6. Regularization can help handle multicollinearity (high correlation between
G

features) by reducing the magnitudes of correlated coefficients.


7. Regularization introduces hyperparameters (e.g., alpha or lambda) that control the
strength of regularization. This allows fine-tuning models to achieve the right
balance between bias and variance.
8. Regularization promotes consistent model performance across different datasets.
It reduces the risk of dramatic performance changes when encountering new data.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

Dropout

In machine learning, “dropout” refers to the practice of disregarding certain nodes in a


layer at random during training. A dropout regularization in deep learning is a
regularization approach that prevents overfitting by ensuring that no units are codepen dent
with one another.

Dropout Regularization

When you have training data, if you try to train your model too much, it might overfit, and
when you get the actual test data for making predictions, it will not probably perform
well. Dropout regularization is one technique used to tackle overfitting problems in deep
learning.

That’s what we are going to look into in this blog, and we’ll go over some theories first,
and then we’ll write python code using TensorFlow, and we’ll see how adding a dropout
layer increases the performance of your neural network.

E
Training with Drop-Out Layers
O
Dropout is a regularization method approximating concurrent training of many neural
C
networks with various designs. During training, the network randomly ignores or drops
some layer outputs. This changes the layer’s appearance and connectivity compared to t he
preceding layer. In practice, each training update gives the layer a different perspective.
E

Dropout makes the training process noisy, requiring nodes within a layer to take on more
or less responsible for the inputs on a probabilistic basis.
AC

According to this conception, Dropout in machine learning may break apart circumstances
in which network tiers co-adapt to fix mistakes committed by prior layers, making the
R

model more robust. Dropout is implemented per layer in a neural network. It works with
the vast majority of layers, including dense, fully connected, convolutional, and recurrent
G

layers such as the long short-term memory network layer. Dropout can occur on any or all
of the network’s hidden layers as well as the visible or input layer. It is not used on the
output layer.

Dropout Implementation

Using the torch. nn, you can easily add a Dropout in machine learning to your PyTorch
models. The dropout class takes the dropout rate (the likelihood of deactivating a neuron)
as a parameter.

self.dropout = nn.Dropout(0.25)

Dropout can be used after any non-output layer.

To investigate the impact of dropout, train an image classification model. I’ll start with an
unregularized network and then use Dropout in machine learning to train a regularised
network. The Cifar-10 dataset is used to train the models over 15 epochs.

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

A complete example of introducing dropout to a PyTorch model is provided.

class Net(nn.Module):
def __init__(self, input_shape=(3,32,32)):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3)
self.conv2 = nn.Conv2d(32, 64, 3)
self.conv3 = nn.Conv2d(64, 128, 3)
self.pool = nn.MaxPool2d(2,2)
n_size = self._get_conv_output(input_shape)
self.fc1 = nn.Linear(n_size, 512)
self.fc2 = nn.Linear(512, 10)
self.dropout = nn.Dropout(0.25)
def forward(self, x):
x = self._forward_features(x)
x = x.view(x.size(0), -1)
x = self.dropout(x)
x = F.relu(self.fc1(x))

E
# Apply dropout
x = self.dropout(x)
x = self.fc2(x)
return x
O
C
E
AC
R
G

An unregularized network overfits instantly on the training dataset. Take note of


how the validation loss for the no-dropout regularization in deep learning run diverges
dramatically after only a few epochs. This explains why the generalization error has
grown.
Overfitting is avoided by training with two dropout in deep learning layers and a dropout
probability of 25%. However, this affects training accuracy, necessitating the training of
a regularised network over a longer period.
Leaving improves model generalisation. Although the training accuracy is lower than that

AL3451_ML
4931_Grace College of Engineering, Thoothukudi

of the unregularized network, the total validation accuracy has improved. This explains
why the generalization error has decreased.

Why will dropout help with overfitting?

• It can’t rely on one input as it might be randomly dropped out.


• Neurons will not learn redundant details of inputs

Other Popular Regularization Techniques

When combating overfitting, dropping out is far from the only choice. Regularization
techniques commonly used include:

• Early stopping: automatically terminates training when a performance measure


(e.g., validation loss, accuracy) ceases to improve.
• Weight decay: add a penalty to the loss function to motivate the network to utilize
lesser weights.

E
• Noise: Allow some random variations in the data through augmentation to create
O
noise (which makes the network robust to a larger distribution of inputs and hence
improves generalization).
C
• Model Combination: the outputs of separately trained neural networks are
averaged (which requires a lot of computational power, data, and time).
E

Dropout Regularization Hyperparameters


AC

In deep learning regularization, researchers have found that using a high momentum and
a large decaying learning rate are effective hyperparameter values with dropout. Limiting
R

our weight vectors using dropout allows us to employ a high learning rate witho ut fear of
the weights blowing up. Dropout noise, along with our big decaying learning rate, allows
G

us to explore alternative areas of our loss function and, hopefully, reach a better minimum.

The Drawbacks of Dropout

Although dropout is a potent tool, it has certain downsides. A dropout network may take
2-3 times longer to train than a normal network. Finding a regularize virtually comparable
to a dropout layer is one method to reap the benefits of dropout in deep learning without
slowing down training. This regularize is a modified variant of L2 regularization for linear
regression. An analogous regularize for more complex models has yet to be discovered
until that time when doubt drops out.

AL3451_ML

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy