Deep Learn
Deep Learn
The mathematical part of Machine Learning deals with statistics, calculus, linear algebra, and probabilities
Introduction to Deep Learning
Optimization algorithm
This is the main working force behind the machine learning algorithms. Usually we need to define some sort of objective
function and use the optimization algorithms to either minimize or maximize it. The objective function could be calculated
by the difference/error between the estimation and the target (true answer), therefore, usually the smaller the error is, the
better. There are different optimization algorithms to tune the model, such as gradient descent, least square minimization,
and so on.
Trained Model
After the tuning of the model, it has the capability of making a classification or prediction and so on. This trained model can
be used on the new data for achieving our goal.
Example:
Classification
Classification of group of Oranges and Apples (binary classification problem since only two classes)
• First calculate features for each apple and orange and save
• Since we have 5 features in the figure, it is not easy to
it into the feature matrix as shown below.
visualize it. If we only plot the first two features, i.e.
• Target label 0 represent orange and 1 refers to apple.
color and texture, we may see something as below:
The design of the SVM algorithm, it first forms a buffer from the
boundary line to the points in both classes that are close to the line
(these are the support vectors, where the name comes from). Then the
problem becomes given a set of these support vectors, which line has
the maximum buffer. The black dotted line has a narrow buffer while
the red solid line has a wider buffer.
Mathematics of Support Vector Machines
• Consider a random point X and we want to know
Dot product whether it lies on the right side of the plane or the
left side of the plane (positive or negative).
• we make a vector (w) which is perpendicular to
the hyperplane.
• Let’s say the distance of vector w from origin to
decision boundary is ‘c’. Now we take the
projection of X vector on w. Projection of any
vector or another vector is called dot-product.
Now we need (w, b) such that the margin has a maximum distance. Let’s say this distance is ‘d’.
To calculate ‘d’ we need the equation of L1 and L2.
For this, we will take few assumptions that
the equation of L1 is w.x+b=1 and for L2 it is w.x+b= -1.
Mathematics of Support Vector Machines
• Take 2 support vectors, 1 from the negative class and 2nd from the positive class.
The distance between these two vectors x1 and x2 will be (x2-x1) vector.
• Take a vector ‘w’ perpendicular to the hyper plane and then find the projection of
(x2-x1) vector on ‘w’. To make this ‘w’ a unit vector we divide this with the norm
of ‘w’
Since x2 and x1 are support vectors and they lie on the hyper plane, hence they will follow yi* (w.x+b)=1
so we can write it as:
Quadratic programming
problem with inequality
constraint
𝐿1 = 𝑤. 𝑥 + 𝑏 = +1
For 𝑥 = (1, 2) ,
1.5 1 + 1.5 2 + 𝑏 = 1
𝑏 = −3.5
For 𝑥 = (2, 1) ,
1.5 1 + 1.5 2 + 𝑏 = 1
𝑏 = −3.5
𝐿2 = 𝑤. 𝑥 + 𝑏 = −1
For 𝑥 = (0,0) ,
1.5 0 + 1.5 0 + 𝑏 = −1
𝑏 = −1
• In real-life applications, we rarely encounter datasets that are perfectly linearly separable. Instead, we often come across
datasets that are either nearly linearly separable or entirely non-linearly separable.
Hard and Soft Margin SVM
“Hard Margin” is applicable only when we have “Linearly Separable Data”. In real datasets where data points perfectly
do not belong to the right region, Hard margins do not give good results. Hard Margins are very sensitive to outliers. We can
use Soft Margin classifications to avoid these issues. Hard margins will result in over fitting of a model that allows zero
errors.
Forcing rigid margins can result in a
model that performs perfectly in the
training set, but is possibly over-fit /
less generalizable when applied to a
new dataset.
Hard and Soft Margin SVM
The soft margin SVM follows a somewhat similar optimization procedure with a couple of differences. First, in this
scenario, we allow misclassifications to happen. So we’ll need to minimize the misclassification error, which means that
we’ll have to deal with one more constraint. Second, to minimize the error, we should define a loss function. A common loss
function used for soft margin is the hinge loss.
The loss of a misclassified point is called a slack variable and is added to the primal problem that we had for hard margin
SVM. So the primal problem for the soft margin becomes:
A new regularization parameter controls the trade-off between maximizing the margin and minimizing the loss. As you
can see, the difference between the primal problem and the one for the hard margin is the addition of slack variables.
The new slack variables ( in the figure below) add flexibility for misclassifications of the model:
SVM Kernels In many real-world scenarios, the data is not linearly separable
in the original feature space. Kernels help by implicitly
mapping the original feature space into a higher-dimensional
space where the data might be more easily separable.
In estimating the coefficients of the model Least squares procedure seeks to minimize the sum of the squared residuals.
This means that given a regression line through the data we calculate the distance from each data point to the regression line,
square it, and sum all of the squared errors together. This is the quantity that ordinary least squares seeks to minimize.
Weight = 𝑏0 + 𝑏1 Height
The output for Linear regression should only be the continuous values such as price, age, salary, etc.
Linear Regression
LR allocates weight parameter, for each of the training features. The predicted output(h(θ)) will be a
-
linear function of features (x) and coefficients (θ).
Loss function :
In LR, we use mean squared error as the metric of loss.
The deviation of expected and actual outputs will be
squared and sum up. Derivative of this loss will be used by
gradient descend algorithm.
Logistic Regression
• Logistic Regression is used for solving the Classification problems.
• The output of Logistic Regression problem can be only between the 0 and 1.
• Logistic regression can be used where the probabilities between two classes is
required. Such as whether it will rain today or not, either 0 or 1, true or false etc.
• In logistic regression, we pass the weighted sum of inputs through an activation
function that can map values in between 0 and 1. Such activation function is
known as sigmoid function and the curve obtained is called as sigmoid curve or
S-curve. Consider the image:
The equation for logistic regression is
Logistic Regression - Probabilistic way of classification
The output of the logistic regression will be a probability (0 ≤ 𝑥 ≤ 1), and can be used to predict the binary 0 or 1
as the output ( if 𝑥 < 0.5, 𝑜𝑢𝑡𝑝𝑢𝑡 = 0, 𝑒𝑙𝑠𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 = 1).
The h(θ) value here corresponds to 𝑃(𝑦 = 1|𝑥), ie, probability of output to be binary 1, given input x. 𝑃(𝑦 = 0|𝑥) will be
equal to 1 − ℎ(𝜃). When value of z is 0, 𝑔(𝑧) will be 0.5. Whenever z is positive, ℎ(𝜃) will be greater than 0.5 and output
will be binary 1. Likewise, whenever z is negative, value of y will be 0. As we use a linear equation to find the classifier,
the output model also will be a linear one, that means it splits the input dimension into two spaces with all points in one
space corresponds to same label.
Logistic Regression
Sigmoid function
Loss function :
We can’t use mean squared error as loss function (like linear regression), because
we use a non-linear sigmoid function at the end. MSE function may introduce
local minimums and will affect the gradient descend algorithm.
So cross entropy is used as loss function here. Two equations will be used,
corresponding to y=1 and y=0. The basic logic here is that, whenever the
prediction is badly wrong, (eg : 𝑦’ = 1 & 𝑦 = 0), cost will be −log(0) which is
infinity.
Parametric vs Non-parametric models
Statistical learning models can be classified as parametric or nonparametric models. Machine learning models use
statistical learning to predict unseen data.
The purpose of the statistical model is to approximate the relationship between the dependent and independent variables.
The dependent variable is the center of attention of machine learning models which is needed to predict based on
one/some independent variable.
Parametric model
A common example of a parametric algorithm is linear regression, and the linear
function uses parameters like intercept and slopes, so by training the data we calculate
these parameters.
• Some other examples of parametric machine learning algorithms are: Linear
Support Vector Machines, Logistic Regression, Naive Bayes classifier, Perceptron
Non-parametric models do not assume the model based on a function. This model is
flexible, the example of a non-parametric model is the K-Nearest Neighbor that makes
predictions based on the k most similar training patterns for a new data instance. The
method does not assume anything about the form of the mapping function other than
patterns that are close are likely have a similar output variable.
• Some other examples of non-parametric machine learning algorithms are: Decision
Tree, SVM with Gaussian kernel, ANN
K – nearest neighbors
-
KNN
K-nearest neighbors is a non-parametric method used for classification and regression. It is one of the most easy ML
technique used. It is a lazy learning model, with local approximation.
Basic Theory
The basic logic behind KNN is to explore your neighborhood, assume the test datapoint to be similar to them and derive
the output. In KNN, we look for k neighbors and come up with the prediction.
In case of KNN classification, a majority voting is applied over the k nearest data points whereas, in KNN regression,
mean of k nearest data points is calculated as the output. As a rule of thumb, we select odd numbers as k. KNN is a lazy
learning model where the computations happens only runtime.
K – nearest neighbors
-
In the given figure yellow and violet points corresponds to Class A and
Class B in training data. The red star, points to the test data which is to be
classified. when k = 3, we predict Class B as the output and when K=6, we
predict Class A as the output.
Loss function :
There is no training involved in KNN. During testing, k neighbors with minimum distance, will take part in
classification /regression.
K – nearest neighbors
Example – Consider the following table of height, weight and age values of 10 people. Predict the weight value of the
person ID-11
Hamming Distance:
K – nearest neighbors Continued- How to choose the k factor? (k = no. of neighbors)
Bayes Theorem
Conditional probability - probability that something will happen, given that something else has already
occurred.
Using the conditional probability, we can calculate the probability of an event using its prior knowledge.
Below is the formula for calculating the conditional probability.
P(H) - probability of hypothesis H being true. This is known as the prior probability.
P(E) - probability of the evidence (regardless of the hypothesis).
P(E|H) - probability of the evidence given that hypothesis is true. (probability of Likelihood of evidence)
P(H|E) - probability of the hypothesis given that the evidence is there.
Naïve Bayes Classifier
Example- consider a school with a total population of 100.
what is the conditional probability that a certain member of the
school is a ‘Teacher’ given that he is a ‘Man’?
solution
Naïve Bayes Classifier
Example- Consider 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These are the 3 possible classes of the Y
variable. We have data for the following X variables (binary) - long, sweet and yellow. These are the features used in training
dataset and first few rows of training data set are given below
Step 1: Compute the ‘Prior’ probabilities for each of the class of fruits.
P(Y=Banana) = 500 / 1000 = 0.50
P(Y=Orange) = 300 / 1000 = 0.30
P(Y=Other) = 200 / 1000 = 0.20
= 0 / P(Evidence)
P(D|pos) = 0.75 i.e 75% chances are there that the patient is actually suffering from the disease.
Naïve Bayes Classifier
It predicts membership probabilities for each class such as the probability that given record or data point belongs
to a particular class. The class with the highest probability is considered as the most likely class. This is also
known as Maximum A Posteriori (MAP). MAP max(P( H | E )
max(P( E | H ) P( H ))
P( E )
max(P( E | H ) P( H )) P(E) is used to normalize the result
Example: 3 classes associated with animal types: i) parrot ii) Dog iii) Fish.
4 features for predicting the animal type: i) Swim ii) Wings iii) Green colour iv) Dangerous teeth
0.9 0 0.333
0
P( Swim, Green)
Naïve Bayes Classifier
Prediction using Naïve Bayes classifier
ii) For the second record of data the evidence is Swim, Green and dangerous teeth
Hypothesis testing for the animal to be Dog
P( Swim | Dog ) P(Green | Dog ) P(Teeth | Dog ) P( Dog )
P( Dog | ( Swim, Green, Teeth)
P( Swim, Green, Teeth)
Hypothesis testing for animal Hypothesis testing for animal to Hypothesis testing for animal Animal Type
to be Dog be Parrot to be Fish
Using Naive Bayes, we can predict that the class of first record is Fish and the second record is also Fish.
Naive Bayes can learn individual features importance but can’t determine the relationship among features.