UNIT-2 Machine Learning
UNIT-2 Machine Learning
MULTI-LAYER PERCEPTRON
A Multi-Layer Perceptron (MLP) consists of fully connected dense layers that transform
input data from one dimension to another. It is called “multi-layer” because it contains an input
layer, one or more hidden layers, and an output layer. The purpose of an MLP is to model
complex relationships between inputs and outputs, making it a powerful tool for various
machine learning tasks. MLP (Multi-Layer Perceptron) is primarily used for supervised
learning, as it is a type of artificial neural network that requires labeled data to train and learn
relationships between input features and target outputs, making it suitable for tasks like
classification and regression.
Every connection in the diagram is a representation of the fully connected nature of an MLP.
This means that every node in one layer connects to every node in the next layer. As the data
moves through the network, each layer transforms it until the final output is generated in the
output layer.
WORKING OF MULTI-LAYER PERCEPTRON
1. Weighted Sum: The neuron computes the weighted sum of the inputs:
The activation function decides whether a neuron should be activated by calculating the
weighted sum of inputs and adding a bias term. This helps the model make complex
decisions and predictions by introducing non-linearities to the output of each neuron.
Neural networks consist of neurons that operate using weights, biases, and activation
functions.
Without non-linearity, even deep networks would be limited to solving only simple,
linearly separable problems. Activation functions empower neural networks to model
highly complex data distributions and solve advanced deep learning tasks. Adding non-
linear activation functions introduce flexibility and enable the network to learn more
complex and abstract patterns from data.
Here, 𝑒 is the base of the natural logarithm (approximately equal to 2.71828), and 𝑥 is
the input to the function.
f(x)=max(0,x)
Where:
• x is the input to the neuron.
• The function returns x if x is greater than 0.
• If x is less than or equal to 0, the function returns 0. In mathematical terms, the ReLU
function can be written as:
For a classification problem, the commonly used binary cross-entropy loss function is:
For regression problems, the mean squared error (MSE) is often used:
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the network’s
weights and biases. This is achieved through backpropagation. Both MSE and BCE can be
used in backpropagation. Backpropagation computes gradients of the chosen loss function
(MSE or BCE) and updates the network’s weights using gradient descent.
1. Gradient Calculation: The gradients of the loss function with respect to each
weight and bias are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by
layer.
3. Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss
For both regression (MSE loss) and classification (BCE loss), the weights are updated using
the gradient descent formula:
Step 4: Iteration
• Forward and backward propagation repeat over multiple epochs until the model
converges (i.e., achieves an acceptable error rate).
MLP ALGORITHM:
The Multi-Layer Perceptron (MLP) Algorithm is like training a digital brain to learn patterns
and make predictions.
This section explores practical considerations for using Multi-Layer Perceptrons (MLPs) to solve
real-world problems, focusing on three critical aspects: the amount of training data, the number
of hidden layers, and when to stop learning.
• For the MLP with one hidden layer there are (L + 1) ×M + (M + 1) × N weights, where L,M,N
are the number of nodes in the input, hidden, and output layers, respectively.
• The extra +1s come from the bias nodes, which also have adjustable weights
• This is a potentially huge number of adjustable parameters that we need to set during the
training phase.
• Setting the values of these weights is the job of the back-propagation algorithm, which is
driven by the errors coming from the training data.
• Clearly, the more training data there is, the better for learning, although the time that the
algorithm takes to learn increases.
• Unfortunately, there is no way to compute what the minimum amount of data required is, since
it depends on the problem.
• A rule of thumb that you should use a number of training examples that is at least 10 times the
number of weights.
• This is probably going to be a very large number of examples, so neural network training is a
fairly computationally expensive operation, because we need to show the network all of these
inputs lots of times.
• Two Choices
• It is possible to show mathematically that one hidden layer with lots of hidden nodes is
sufficient. This is known as the Universal Approximation Theorem.
• we will never normally need more than two layers (that is, one hidden layer and the output
layer)
• The training of the MLP requires that the algorithm runs over the entire dataset many times,
with the weights changing as the network makes errors in each iteration.
• Using both of these options together can help, as can terminating the learning once the error
stops decreasing.
• We train the network for some predetermined amount of time, and then use the validation set to
estimate how well the network is generalising.
• We then carry on training for a few more iterations, and repeat the whole process.
• At some stage the error on the validation set will start increasing again, because the network
has stopped learning about the function that generated the data, and started to learn about the
noise that is in the data itself.
• At this stage we stop the training. This technique is called early stopping.
• We will then apply MLP to find solutions to four different types of problem: Regression,
Classification, Time-series prediction, and Data compression.
Regression:
• If you want to predict a single value, you only need a single output neuron and if you want to
predict multiple values, you can add multiple output neurons.
• In general, we don't apply any activation function to the output layer of MLP, when dealing
with regression tasks, It just does the weighted sum and sends the output.
• But, in case you want your value between a given range, for example, -1 or +1 you can use
activation like Tanh(Hyperbolic Tangent) function.
• The loss functions that can be used in Regression MLP include Mean Squared Error(MSE) and
Mean Absolute Error(MAE).
• MSE can be used in datasets with fewer outliers, while MAE is a good measure in datasets
which has more outliers.
• If the output variable is categorical, then we have to use classification for prediction.
• The aim is to classify iris flowers among three species (Setosa, Versicolor, or Virginica) from
the sepals’ and petals’ length and width measurements.
• The above neural network has one input layer, two hidden layers and one output layer.
• In the hidden layers we use sigmoid as an activation function for all neurons.
• In the output layer, we use softmax as an activation function for the three output neurons.
• In this regard, all outputs are between 0 and 1, and their sum is 1.
• The neural network has three outputs since the target variable contains three classes (Setosa,
Versicolor, and Virginica).
• There is a common data analysis task known as time-series prediction, where we have a set of
data that show how something varies over time, and we want to predict how the data will vary in
the future.
• The problem is that even if there is some regularity in the time-series, it can appear over many
different scales. For example, there is often seasonal variation in temperatures.
• Example: A typical time-series problem is to predict the ozone levels into the future and see if
you can detect an overall drop in the mean ozone level.
• We train the network to reproduce the inputs at the output layer called auto-associative
learning.
• The network is trained so that whatever you give as the input is reproduced at the output, which
doesn’t seem very useful at first, but suppose that we use a hidden layer that has fewer neurons
than the input layer.
• This bottleneck hidden layer has to represent all of the information in the input, so that it can
be reproduced at the output.
• It therefore performs some compression of the data, representing it using fewer dimensions
than were used in the input.
• They are finding a different representation of the input data that extracts important components
of the data, and ignores the noise.
• This auto-associative network can be used to compress images and other data.
DERIVING BACK-PROPAGATION
Backpropagation is an algorithm used in artificial intelligence and machine learning to train
artificial neural networks through error correction. The computer learns by calculating the loss
function, or the difference between the input you provided and the output it produced. When you
apply backpropagation, you work backward from output nodes to input nodes to reduce the loss
function and produce the desired result.
Backpropagation is the process of adjusting a neural network’s weights and biases to reduce
error. It does this by:
We use Mean Squared Error (MSE) loss, which is used when predicting continuous values
(e.g., predicting house prices).
Repeating these steps reduces error over time. By repeating this process, the model gradually
improves and learns the correct weight and bias to minimize the error.
Step 1: Network architecture and Define Input Values and given weights
Step 2: Forward Propagation: We calculate the hidden layer activation, then the output
layer activation.
Note: Hidden layers do have their own weights and biases. The hidden layer does have an
input value, but it comes from the previous layer
Each neuron in a layer is connected to neurons in the previous layer via weights. Every layer
(except the input layer) has:
For a Neural Network with 1 Input, 1 Hidden Layer, and 1 Output Layer:
A radial basis function (RBF) neural network is a type of artificial neural network that uses radial
basis functions as activation functions. It typically consists of three layers: an input layer, only
one hidden layer, and an output layer. The hidden layer applies a radial basis function, usually
a Gaussian function. RBF neural networks are highly versatile and are extensively used in
pattern classification tasks, function approximation, and a variety of machine learning
applications. They are especially known for their ability to handle non-linear problems
effectively.
• Input layer: This layer simply transmits the inputs to the neurons in the hidden layer.
• Hidden layer: Each neuron in this layer applies a radial basis function to the inputs it
receives. RBF has strictly one hidden layer.
• Output layer: Each neuron in this layer computes a weighted sum of the outputs from
the hidden layer, resulting in the final output.
Working of RBF
• When dealing with non-linear data, we aim to convert it into linearly separable data.
• To achieve this, every hidden layer neuron uses a non-linear radial basis function as the
activation function, transforming the data into a higher-dimensional space.
Types of Radial Basis Functions:
x = Input
c = Center
r = Radius
2. Multiquadric RBF:
Algorithm of RBF
• Assign weights for each connection from hidden layer to output layer.
• Initially, weights are randomly assigned in the range [-1,1].
Forward Phase
The curse of dimensionality is a common machine learning problem that occurs when a dataset
has many dimensions. This can make it difficult to analyze, organize, and model the data. The
Curse of Dimensionality refers to the various challenges and complications that arise when
analyzing and organizing data in high-dimensional spaces (often hundreds or thousands of
dimensions). In the realm of machine learning, it's crucial to understand this concept because as
the number of features or dimensions in a dataset increases, the amount of data we need to
generalize accurately grows exponentially.
1. Data sparsity: As mentioned, data becomes sparse, meaning that most of the high-
dimensional space is empty. This makes clustering and classification tasks challenging.
2. Increased computation: More dimensions mean more computational resources and time
to process the data.
3. Overfitting: With higher dimensions, models can become overly complex, fitting to the
noise rather than the underlying pattern. This reduces the model's ability to generalize to
new data.
4. Distances lose meaning: In high dimensions, the difference in distances between data
points tends to become negligible, making measures like Euclidean distance less
meaningful.
5. Performance degradation: Algorithms, especially those relying on distance measurements
like k-nearest neighbors, can see a drop in performance.
6. Visualization challenges: High-dimensional data is hard to visualize, making exploratory
data analysis more difficult.
It occurs mainly because as we add more features or dimensions, we're increasing the complexity
of our data without necessarily increasing the amount of useful information. Moreover, in high-
dimensional spaces, most data points are at the "edges" or "corners," making the data sparse.
The primary solution to the curse of dimensionality is "dimensionality reduction." It's a process
that reduces the number of random variables under consideration by obtaining a set of principal
variables. By reducing the dimensionality, we can retain the most important information in the
data while discarding the redundant or less important features.
PCA is a statistical method that transforms the original variables into a new set of variables,
which are linear combinations of the original variables. These new variables are called principal
components.
Let's say we have a dataset containing information about different aspects of cars, such as
horsepower, torque, acceleration, and top speed. We want to reduce the dimensionality of this
dataset using PCA.
Using PCA, we can create a new set of variables called principal components. The first principal
component would capture the most variance in the data, which could be a combination of
horsepower and torque. The second principal component might represent acceleration and top
speed. By reducing the dimensionality of the data using PCA, we can visualize and analyze the
dataset more effectively.
Linear Discriminant Analysis (LDA)
LDA aims to identify attributes that account for the most variance between classes. It's
particularly useful for classification tasks. Suppose we have a dataset with various features of
flowers, such as petal length, petal width, sepal length, and sepal width. Additionally, each
flower in the dataset is labeled as either a rose or a lily. We can use LDA to identify the
attributes that account for the most variance between these two classes.
LDA might find that petal length and petal width are the most discriminative attributes between
roses and lilies. It would create a linear combination of these attributes to form a new variable,
which can then be used for classification tasks. By reducing the dimensionality using LDA, we
can improve the accuracy of flower classification models.
t-SNE is a non-linear dimensionality reduction technique that's particularly useful for visualizing
high-dimensional datasets. Let's consider a dataset with images of different types of animals,
such as cats, dogs, and birds. Each image is represented by a high-dimensional feature vector
extracted from a deep neural network.
Using t-SNE, we can reduce the dimensionality of these feature vectors to two dimensions,
allowing us to visualize the dataset. The t-SNE algorithm would map similar animals closer
together in the reduced space, enabling us to observe clusters of similar animals. This
visualization can help us understand the relationships and similarities between different animal
types in a more intuitive way.
Autoencoders
These are neural networks used for dimensionality reduction. They work by compressing the
input into a compact representation and then reconstructing the original input from this
representation. Suppose we have a dataset of images of handwritten digits, such as the MNIST
dataset. Each image is represented by a high-dimensional pixel vector.
We can use an autoencoder, which is a type of neural network, for dimensionality reduction.
The autoencoder would learn to compress the input images into a lower-dimensional
representation, often called the latent space. This latent space would capture the most important
features of the images. We can then use the autoencoder to reconstruct the original images from
the latent space representation. By reducing the dimensionality using autoencoders, we can
effectively capture the essential information from the images while discarding unnecessary
details.
INTERPOLATION AND BASIS FUNCTIONS
INTERPOLATION:
In machine learning, interpolation refers to the process of estimating unknown values that
fall between known data points. This can be useful in various scenarios, such as filling in
missing values in a dataset or generating new data points to smooth out a curve.
• Geodesy: Interpolation is used to map out features on Earth's surface, such as mountains
or ocean currents, using satellite imagery.
• Statistical analysis: Interpolation can be used to smooth out data sets so that they
become more evenly distributed. For example, if you have a spike in sales one day, you
can use interpolation to smooth out the rest of your sales data for that month so that the
overall trend looks smooth instead of erratic.
TYPES OF INTERPOLATION:
• Linear interpolation: Linear interpolation is a simple method for estimating unknown
values between two known points. It assumes that the data points can be connected by a
straight line.
Formula for Linear Interpolation:
• Polynomial interpolation: What if we have more than two points? Instead of a straight
line, we can fit a curve using a polynomial. This works like connecting the dots
smoothly so the estimated values follow the trend of the data. A common method for this
is Lagrange interpolation.
Basis function
Instead of using a single equation to represent a function, we combine multiple small functions
(called basis functions) to form the final function. It means a function breaks into small parts
using basis functions so that a machine learning model can learn patterns better.
Think of it like building a house with Lego blocks—each basis function is a Lego piece.
This method is used in splines and radial basis functions (RBFs) to make models that can fit
complex patterns.
A cubic spline is a smooth curve made up of cubic polynomials that are joined together at
specific points called knotpoints.
Once you have knotpoints, you need to choose how the function behaves in each section.
Problem: The function is not smooth—it jumps from one level to another without a transition.
2. Linear Basis Function (Straight Line Segments)
Problem: If you just use straight lines, they may not connect smoothly at knotpoints—meaning there
might be sharp corners.
Best Choice for Smoothness: Cubic splines! They create smooth curves that don’t have sharp
edges or abrupt changes.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Support Vectors: The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it
in 2d space with z=1, then it will become as:
Hence, we get a circumference of radius 1 in case of non-linear data.
SVM Algorithm
1. Goal:
o Find the best line (or hyperplane in higher dimensions) that separates two classes
of data points.
2. Steps:
o Step 1: Collect Data:
▪ Gather your data with features (e.g., height, weight) and labels (e.g., cat or
dog).
o Step 2: Plot Data:
▪ Visualize the data points on a graph (if possible).
o Step 3: Find the Best Line:
▪ Draw a line that separates the two classes.
▪ Make sure the line is as far as possible from the closest data points of both
classes (these closest points are called support vectors).
o Step 4: Handle Non-Linear Data:
▪ If the data isn’t linearly separable (you can’t draw a straight line), use a
trick called the kernel trick to transform the data into a higher dimension
where a line can separate the classes.
o Step 5: Make Predictions:
1. Plot the Data: Each data point is represented in n-dimensional space (n = number of
features). For example, if you have two features, you can plot the data on a 2D graph.
2. Find the Hyperplane: SVM finds the hyperplane (a straight line in 2D, a flat plane in
3D, or more generally, an n-dimensional plane) that separates the two classes of data
points with the maximum margin.
o Maximum Margin: This is the largest possible distance between the hyperplane
and the nearest data points from both classes.
o These closest points are called support vectors because they “support” the
hyperplane.
3. Separate the Classes: The hyperplane divides the data into two regions, each
representing one class. For example:
o One side of the line = Class A.
o Other side = Class B.
4. Non-Linearly Separable Data: If the data cannot be separated with a straight line (e.g.,
spiral data), SVM uses something called a kernel trick to transform the data into a higher
dimension where it becomes linearly separable.
o Kernel Functions: Mathematical functions like polynomial, RBF (Radial Basis
Function), etc., are used to transform the data.
Advantages and Disadvantages of Support Vector Machine (SVM)
1. High-Dimensional Performance: SVM excels in high-dimensional spaces, making
it suitable for image classification and gene expression analysis.
2. Nonlinear Capability: Utilizing kernel functions like RBF and polynomial, SVM
effectively handles nonlinear relationships.
3. Outlier Resilience: The soft margin feature allows SVM to ignore outliers,
enhancing robustness in spam detection and anomaly detection.
4. Binary and Multiclass Support: SVM is effective for both binary
classification and multiclass classification, suitable for applications in text
classification.
5. Memory Efficiency: SVM focuses on support vectors, making it memory efficient
compared to other algorithms.