0% found this document useful (0 votes)
4 views

Deep Learn

The document provides an introduction to deep learning and machine learning, discussing essential mathematical concepts, types of machine learning, and various algorithms such as Support Vector Machines (SVM), Linear Regression, and Logistic Regression. It explains the process of data representation, model tuning, optimization algorithms, and the differences between hard and soft margin SVMs, as well as parametric and non-parametric models. Additionally, it covers K-nearest neighbors as a classification and regression method, highlighting its lazy learning approach.

Uploaded by

Pragyamita Basu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Deep Learn

The document provides an introduction to deep learning and machine learning, discussing essential mathematical concepts, types of machine learning, and various algorithms such as Support Vector Machines (SVM), Linear Regression, and Logistic Regression. It explains the process of data representation, model tuning, optimization algorithms, and the differences between hard and soft margin SVMs, as well as parametric and non-parametric models. Additionally, it covers K-nearest neighbors as a classification and regression method, highlighting its lazy learning approach.

Uploaded by

Pragyamita Basu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to Deep Learning

The mathematical part of Machine Learning deals with statistics, calculus, linear algebra, and probabilities
Introduction to Deep Learning

Evolution of Artificial Intelligence


Example: Detecting email spam
Example: Detecting email spam
Types of Machine Learning
The components of Machine Learning
• ML is modelling technique that involves data.
• Final product of machine learning is the
compressed form of the data- extracts knowledge
from the data
Representation of the data

Each row represent one sample of the data (or


instance of data).
Example: Recognize apples and oranges
• Each row represent one apple or one orange
• The columns are different features such as color,
texture, shape, smell and so on of the objects.
• This matrix is usually called feature matrix which
represent our data in a way that can feed into the
computer.
• Target array, which stores the output of the values.
Either it will be the label of the different classes,
or the quantities.
Example: Image data: numerical values of pixels are
stored in an array.
Tunable Model
After we have the input and output data ready, we need an algorithm that can learn from the data. The algorithms are usually
have many different parameters that can be tuned in a way that the model can become better for our goal based on the data we
present to it. There are many different tunable models such as artificial neural network support vector machine, random forest,
logistic regression, and so on. Each model has different ideas behind it and be tuned differently.

Optimization algorithm
This is the main working force behind the machine learning algorithms. Usually we need to define some sort of objective
function and use the optimization algorithms to either minimize or maximize it. The objective function could be calculated
by the difference/error between the estimation and the target (true answer), therefore, usually the smaller the error is, the
better. There are different optimization algorithms to tune the model, such as gradient descent, least square minimization,
and so on.

Trained Model
After the tuning of the model, it has the capability of making a classification or prediction and so on. This trained model can
be used on the new data for achieving our goal.
Example:
Classification
Classification of group of Oranges and Apples (binary classification problem since only two classes)

• First calculate features for each apple and orange and save
• Since we have 5 features in the figure, it is not easy to
it into the feature matrix as shown below.
visualize it. If we only plot the first two features, i.e.
• Target label 0 represent orange and 1 refers to apple.
color and texture, we may see something as below:

• The classification problem essentially is a problem to


find a decision boundary, either a straight line or
other curves, to separate them.
• The tuning of the algorithm is basically to move this line or find out the shape of the boundary, but in a higher dimension
Support Vector Machines
• It is a very intuitive algorithm based on how we make the decision. • If we have a new data point (blue dot), then the black dotted line
• In the following figure, “which line boundary is better?” The black model will make the wrong decision. Therefore, the model which
dotted line or the red solid line? Most people will choose the red has a line close to the middle of the gap and far away from both
solid line, because it is in the middle of the gap between the two classes are better ones. This intuition from us need to be
groups. formalized into a way that the computer can do this.

The design of the SVM algorithm, it first forms a buffer from the
boundary line to the points in both classes that are close to the line
(these are the support vectors, where the name comes from). Then the
problem becomes given a set of these support vectors, which line has
the maximum buffer. The black dotted line has a narrow buffer while
the red solid line has a wider buffer.
Mathematics of Support Vector Machines
• Consider a random point X and we want to know
Dot product whether it lies on the right side of the plane or the
left side of the plane (positive or negative).
• we make a vector (w) which is perpendicular to
the hyperplane.
• Let’s say the distance of vector w from origin to
decision boundary is ‘c’. Now we take the
projection of X vector on w. Projection of any
vector or another vector is called dot-product.

In SVM we just need the projection


of A not the magnitude of B
point lies on the right side

point lies on the left side


Mathematics of Support Vector Machines

Equation of a hyper plane is w.x+b=0


• where w is a vector normal to hyper plane and b is an offset
To classify a point as negative or positive we need to define a decision rule as :

Now we need (w, b) such that the margin has a maximum distance. Let’s say this distance is ‘d’.
To calculate ‘d’ we need the equation of L1 and L2.
For this, we will take few assumptions that
the equation of L1 is w.x+b=1 and for L2 it is w.x+b= -1.
Mathematics of Support Vector Machines
• Take 2 support vectors, 1 from the negative class and 2nd from the positive class.
The distance between these two vectors x1 and x2 will be (x2-x1) vector.
• Take a vector ‘w’ perpendicular to the hyper plane and then find the projection of
(x2-x1) vector on ‘w’. To make this ‘w’ a unit vector we divide this with the norm
of ‘w’

Since x2 and x1 are support vectors and they lie on the hyper plane, hence they will follow yi* (w.x+b)=1
so we can write it as:

Hence the equation which we have to maximize is:


Support Vector Machines- Maximal margin classifier

Quadratic programming
problem with inequality
constraint

Solved using Lagrange multipliers

Compute the gradient:


Support Vector Machines- Maximal Margin Classifier
Non-zero alpha indicates a support vector:
When a data point has a non-zero alpha value, it is considered a support
vector, meaning it directly influences the position of the hyper plane in the
SVM model.
Importance in the decision boundary:
The alpha values of support vectors determine how much each data point
contributes to the final decision boundary.
Lagrange multiplier connection:
Alpha is essentially a Lagrange multiplier used in the SVM optimization
problem, which helps to find the optimal hyper plane by maximizing the
margin between classes while respecting the constraints.
Support Vector Machine
Solved example 𝑊 = 0.5(1,2) + 0.5(2,1) − (0,0) = [1.5,1.5]
𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 = 0

𝐿1 = 𝑤. 𝑥 + 𝑏 = +1

For 𝑥 = (1, 2) ,
1.5 1 + 1.5 2 + 𝑏 = 1
 𝑏 = −3.5
For 𝑥 = (2, 1) ,
1.5 1 + 1.5 2 + 𝑏 = 1
 𝑏 = −3.5
𝐿2 = 𝑤. 𝑥 + 𝑏 = −1

For 𝑥 = (0,0) ,
1.5 0 + 1.5 0 + 𝑏 = −1
 𝑏 = −1

Take the average of b


−3.5−3.5−1
𝑏= 3
= −8/3
Support Vector Machine Solved example
Hard and Soft Margin SVM

• In real-life applications, we rarely encounter datasets that are perfectly linearly separable. Instead, we often come across
datasets that are either nearly linearly separable or entirely non-linearly separable.
Hard and Soft Margin SVM
“Hard Margin” is applicable only when we have “Linearly Separable Data”. In real datasets where data points perfectly
do not belong to the right region, Hard margins do not give good results. Hard Margins are very sensitive to outliers. We can
use Soft Margin classifications to avoid these issues. Hard margins will result in over fitting of a model that allows zero
errors.
Forcing rigid margins can result in a
model that performs perfectly in the
training set, but is possibly over-fit /
less generalizable when applied to a
new dataset.
Hard and Soft Margin SVM
The soft margin SVM follows a somewhat similar optimization procedure with a couple of differences. First, in this
scenario, we allow misclassifications to happen. So we’ll need to minimize the misclassification error, which means that
we’ll have to deal with one more constraint. Second, to minimize the error, we should define a loss function. A common loss
function used for soft margin is the hinge loss.

The loss of a misclassified point is called a slack variable and is added to the primal problem that we had for hard margin
SVM. So the primal problem for the soft margin becomes:

A new regularization parameter controls the trade-off between maximizing the margin and minimizing the loss. As you
can see, the difference between the primal problem and the one for the hard margin is the addition of slack variables.
The new slack variables ( in the figure below) add flexibility for misclassifications of the model:
SVM Kernels In many real-world scenarios, the data is not linearly separable
in the original feature space. Kernels help by implicitly
mapping the original feature space into a higher-dimensional
space where the data might be more easily separable.

Kernels allow SVMs to handle non-linearly separable data by


transforming the feature space.
SVM uses a method called the kernel trick to map the data to a higher-
dimensional feature space.
Linear Regression Model Representation
Simple linear regression when we have single input and output: Y=𝑏0 + 𝑏1 x

In estimating the coefficients of the model Least squares procedure seeks to minimize the sum of the squared residuals.
This means that given a regression line through the data we calculate the distance from each data point to the regression line,
square it, and sum all of the squared errors together. This is the quantity that ordinary least squares seeks to minimize.
Weight = 𝑏0 + 𝑏1 Height

The output for Linear regression should only be the continuous values such as price, age, salary, etc.
Linear Regression
LR allocates weight parameter,  for each of the training features. The predicted output(h(θ)) will be a
-
linear function of features (x) and coefficients (θ).

-Linear regression output equation

During the start of training, each theta is randomly initialized.


But during the training, we correct the theta corresponding to
each feature such that, the loss (metric of the deviation between
expected and predicted output) is minimized. Gradient descend
algorithm will be used to align the θ values in the right
direction. In the given diagram, each red dots represent the
training data and the blue line shows the derived solution.

Loss function :
In LR, we use mean squared error as the metric of loss.
The deviation of expected and actual outputs will be
squared and sum up. Derivative of this loss will be used by
gradient descend algorithm.
Logistic Regression
• Logistic Regression is used for solving the Classification problems.
• The output of Logistic Regression problem can be only between the 0 and 1.
• Logistic regression can be used where the probabilities between two classes is
required. Such as whether it will rain today or not, either 0 or 1, true or false etc.
• In logistic regression, we pass the weighted sum of inputs through an activation
function that can map values in between 0 and 1. Such activation function is
known as sigmoid function and the curve obtained is called as sigmoid curve or
S-curve. Consider the image:
The equation for logistic regression is
Logistic Regression - Probabilistic way of classification

The output of the logistic regression will be a probability (0 ≤ 𝑥 ≤ 1), and can be used to predict the binary 0 or 1
as the output ( if 𝑥 < 0.5, 𝑜𝑢𝑡𝑝𝑢𝑡 = 0, 𝑒𝑙𝑠𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 = 1).

Logistic Regression acts somewhat very similar to linear regression. It also


calculates the linear output, followed by a squashing function over the
regression output. Sigmoid function is the frequently used logistic function.

The h(θ) value here corresponds to 𝑃(𝑦 = 1|𝑥), ie, probability of output to be binary 1, given input x. 𝑃(𝑦 = 0|𝑥) will be
equal to 1 − ℎ(𝜃). When value of z is 0, 𝑔(𝑧) will be 0.5. Whenever z is positive, ℎ(𝜃) will be greater than 0.5 and output
will be binary 1. Likewise, whenever z is negative, value of y will be 0. As we use a linear equation to find the classifier,
the output model also will be a linear one, that means it splits the input dimension into two spaces with all points in one
space corresponds to same label.
Logistic Regression
Sigmoid function

Loss function :
We can’t use mean squared error as loss function (like linear regression), because
we use a non-linear sigmoid function at the end. MSE function may introduce
local minimums and will affect the gradient descend algorithm.
So cross entropy is used as loss function here. Two equations will be used,
corresponding to y=1 and y=0. The basic logic here is that, whenever the
prediction is badly wrong, (eg : 𝑦’ = 1 & 𝑦 = 0), cost will be −log(0) which is
infinity.
Parametric vs Non-parametric models
Statistical learning models can be classified as parametric or nonparametric models. Machine learning models use
statistical learning to predict unseen data.
The purpose of the statistical model is to approximate the relationship between the dependent and independent variables.
The dependent variable is the center of attention of machine learning models which is needed to predict based on
one/some independent variable.
Parametric model
A common example of a parametric algorithm is linear regression, and the linear
function uses parameters like intercept and slopes, so by training the data we calculate
these parameters.
• Some other examples of parametric machine learning algorithms are: Linear
Support Vector Machines, Logistic Regression, Naive Bayes classifier, Perceptron

Non-parametric models do not assume the model based on a function. This model is
flexible, the example of a non-parametric model is the K-Nearest Neighbor that makes
predictions based on the k most similar training patterns for a new data instance. The
method does not assume anything about the form of the mapping function other than
patterns that are close are likely have a similar output variable.
• Some other examples of non-parametric machine learning algorithms are: Decision
Tree, SVM with Gaussian kernel, ANN
K – nearest neighbors
-
KNN
K-nearest neighbors is a non-parametric method used for classification and regression. It is one of the most easy ML
technique used. It is a lazy learning model, with local approximation.
Basic Theory
The basic logic behind KNN is to explore your neighborhood, assume the test datapoint to be similar to them and derive
the output. In KNN, we look for k neighbors and come up with the prediction.
In case of KNN classification, a majority voting is applied over the k nearest data points whereas, in KNN regression,
mean of k nearest data points is calculated as the output. As a rule of thumb, we select odd numbers as k. KNN is a lazy
learning model where the computations happens only runtime.
K – nearest neighbors
-
In the given figure yellow and violet points corresponds to Class A and
Class B in training data. The red star, points to the test data which is to be
classified. when k = 3, we predict Class B as the output and when K=6, we
predict Class A as the output.

Loss function :
There is no training involved in KNN. During testing, k neighbors with minimum distance, will take part in
classification /regression.
K – nearest neighbors
Example – Consider the following table of height, weight and age values of 10 people. Predict the weight value of the
person ID-11

D (11, 1) = sqrt((5.5-5)^2+(38-45)^2)= 7.017


D(11,1) = 0.5+7=7.5
Predicted value of weight for sample 11 is 69.6
K – nearest neighbors
Solution– Consider the following table of height, weight and age values of 10 people. Predict the weight value of the
person ID-11

Since ID11 is closer to points 5 and 1, so it


must have a weight similar to these IDs,
probably between 72-77 kgs (weights of
ID1 and ID5 from the table).

How KNN algorithm predicts this value?


K – nearest neighbors
Solution– KNN algorithm uses feature similarity for prediction.

First the distance between the new point and each


training input is calculated
K – nearest neighbors Solution– KNN algorithm uses feature similarity for prediction.

• First the distance between the new point and each


training input is calculated.
• The closest k data points are selected (based on
the distance). In this example, points 1, 5, 6 will
be selected if value of k is 3.
Methods of calculating the distance

Hamming Distance:
K – nearest neighbors Continued- How to choose the k factor? (k = no. of neighbors)

The KNN Algorithm


1.Load the data
2.Initialize K to your chosen number of neighbors
3. For each example in the data
3.1 Calculate the distance between the query example and the current example from the data.
3.2 Add the distance and the index of the example to an ordered collection
4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the
distances
5. Pick the first K entries from the sorted collection
6. Get the labels of the selected K entries
7. If regression, return the mean of the K labels
8. If classification, return the mode of the K labels
Naïve Bayes Classifier
Naive Bayes is a kind of classifier which uses the Bayes Theorem.

Bayes Theorem
Conditional probability - probability that something will happen, given that something else has already
occurred.
Using the conditional probability, we can calculate the probability of an event using its prior knowledge.
Below is the formula for calculating the conditional probability.

P(H) - probability of hypothesis H being true. This is known as the prior probability.
P(E) - probability of the evidence (regardless of the hypothesis).
P(E|H) - probability of the evidence given that hypothesis is true. (probability of Likelihood of evidence)
P(H|E) - probability of the hypothesis given that the evidence is there.
Naïve Bayes Classifier
Example- consider a school with a total population of 100.
what is the conditional probability that a certain member of the
school is a ‘Teacher’ given that he is a ‘Man’?

solution
Naïve Bayes Classifier
Example- Consider 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These are the 3 possible classes of the Y
variable. We have data for the following X variables (binary) - long, sweet and yellow. These are the features used in training
dataset and first few rows of training data set are given below

Fruit Long (x1) Sweet (x2) Yellow (x3)


Orange 0 1 0
Banana 1 0 1
Banana 1 1 1
Other 1 1 0

Aggregated dataset (counts table)

Predict if the given fruit is banana, orange, or other


Naïve Bayes Classifier
Solution

Step 1: Compute the ‘Prior’ probabilities for each of the class of fruits.
P(Y=Banana) = 500 / 1000 = 0.50
P(Y=Orange) = 300 / 1000 = 0.30
P(Y=Other) = 200 / 1000 = 0.20

Step 2: Compute the probability of evidence that goes in the denominator.

P(x1=Long) = 500 / 1000 = 0.50


P(x2=Sweet) = 650 / 1000 = 0.65
P(x3=Yellow) = 800 / 1000 = 0.80
Step 3: Compute the probability of likelihood of evidences that goes in the numerator.
Probability of Likelihood for Banana
P(x1=Long | Y=Banana) = 400 / 500 = 0.80
P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70
P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90

If a fruit is long, sweet and yellow, what fruit it is?


Naïve Bayes Classifier
If a fruit is long, sweet and yellow, what fruit it is?

P(long | banana) P( sweet | banana) P( yellow | banana) P(banana)


P( Banana | (long , sweet , yellow) 
P(long , sweet , yellow)

= (0.8 ) (0.7) (0.9) (0.5)/ P(Evidence)

P(long | Orange) P( sweet | Orange) P( yellow | Orange) P(Orange)


P(Orange | (long , sweet , yellow) 
P(long , sweet , yellow)

= 0 / P(Evidence)

P(long | Other ) P( sweet | Other ) P( yellow | Other ) P(Other )


P(Other | (long , sweet , yellow) 
P(long , sweet , yellow)
= 0.01875 / P(Evidence)

Answer- Banana, since it has highest probability amongst 3 classes


Naïve Bayes Classifier Bayes Theorem
Example: Patient’s probability of finding out a disease.

• D – disease with two test results “positive” & “negative”


• Lab test guarantees 99% accuracy of test result - i.e patient with disease will get test positive 99%
of the time and if the patient doesn’t have the disease, will get test negative 99% of the time.
• If 3% of the people have this disease and test gives positive result, what is the probability
that a person actually has this disease?
Solution: p(D) = 3% = 0.03 - probability of people suffering from disease
p(~D) = 97% = 0.97 - probability of people not suffering from disease
p(pos | D) = 99% = 0.99 – probability of getting positive test for a patient having the disease
p(pos | ~D) = 1% = 0.01 – probability of getting positive test for a patient not suffering from disease
probability that the patient actually has the disease
p( pos | D) p( D)
P( D | pos)  where P( pos)  P( D) P( pos | D)  P(~ D) P( pos |~ D)
p( pos)

P(D|pos) = 0.75 i.e 75% chances are there that the patient is actually suffering from the disease.
Naïve Bayes Classifier
It predicts membership probabilities for each class such as the probability that given record or data point belongs
to a particular class. The class with the highest probability is considered as the most likely class. This is also
known as Maximum A Posteriori (MAP). MAP  max(P( H | E )
max(P( E | H ) P( H ))

P( E )
 max(P( E | H ) P( H )) P(E) is used to normalize the result

Example: 3 classes associated with animal types: i) parrot ii) Dog iii) Fish.
4 features for predicting the animal type: i) Swim ii) Wings iii) Green colour iv) Dangerous teeth

Swim Wings Green colour Dangerous Animal type


teeth
50/500 500/500 400/500 0 Parrot
450/500 0 0 500/500 Dog
500/500 0 150/500 50/500 Fish
Total number of animals: 1500
Naïve Bayes Classifier
Prediction using Naïve Bayes classifier

Swim Wings Green colour Dangerous Animal type


teeth
True False True False ?
True False True True ?

Naive Bayes approach: P( E1 | H ) P( E 2 | H )...P( En | H ) P( H )


P( H | Multiple  evidences) 
P( Multiple  evidences)
i) For the first record of data the evidence is Swim and Green
Hypothesis testing for the animal to be Dog
P( Swim | Dog ) P(Green | Dog ) P( Dog )
P( Dog | ( Swim, Green) 
P( Swim, Green)

0.9  0  0.333
 0
P( Swim, Green)
Naïve Bayes Classifier
Prediction using Naïve Bayes classifier

Swim Wings Green colour Dangerous Animal type


teeth
True False True False ?
True False True True ?

ii) For the second record of data the evidence is Swim, Green and dangerous teeth
Hypothesis testing for the animal to be Dog
P( Swim | Dog ) P(Green | Dog ) P(Teeth | Dog ) P( Dog )
P( Dog | ( Swim, Green, Teeth) 
P( Swim, Green, Teeth)

0.9  0 1 0.333


 0
P( Swim, Green)
Naïve Bayes Classifier
Prediction using Naïve Bayes classifier

Hypothesis testing for animal Hypothesis testing for animal to Hypothesis testing for animal Animal Type
to be Dog be Parrot to be Fish

0.9  0  0.333 0.1 0.8  0.333 1  0.2  0.333


0  0.02664  0.0666
Fish
P( Swim, Green) P( Swim, Green) P( Swim, Green)

0.9  0 1 0.333


0 0.1  0.8  0  0.333 1  0.2  0.1  0.333
 0.00666
Fish
P( Swim, Green, Teeth) 0
P( Swim, Green, Teeth) P( Swim, Green, Teeth)

Using Naive Bayes, we can predict that the class of first record is Fish and the second record is also Fish.

Naive Bayes can learn individual features importance but can’t determine the relationship among features.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy