ML UNIT-5 Notes PDF
ML UNIT-5 Notes PDF
ML UNIT-5 Notes PDF
COLLEGE
(Approved by AICTE, Affiliated to JNTUK)(An ISO9001:2008 Certified
Institution)
Prepared by Dr.M.BHEEMALINGAIAH
UNIT V
Bayesian Learning: Probability theory and Bayes rule. Naive Bayes learning algorithm. Parameter
smoothing. Generative vs. discriminative training. Logistic regression. Bayes nets and Markov nets
for representing dependencies.
Instance-Based Learning: Constructing explicit generalizations versus comparing to past specific
examples. K-Nearest-neighbor algorithm. Case-based learning
Content
5.1 Bayesian learning
5.2 Probability theory and Bayes rule
5.3 Naive Bayes learning algorithm
5.3.1 Advantages of Naïve Bayes Classifier
5.3.2 Disadvantages of Naïve Bayes Classifier
5.3.3 Applications of Naïve Bayes Classifier
5.3.4 Types of Naïve Bayes Model:
5.3.5 Problem-1
5.3.6 Problem-2
5.3.7 Parameter smoothing
5.4 Generative vs. discriminative Models
5.5 Logistic regression
5.6 Bayesian Belief Network
5.6.1 Problem-1
5.6.2 Problem-2
5.7 Hidden Markov Model (Networks)
4.7.1 Hidden Markov Model with an Example
4.8.2 Application of Hidden Markov Model
5.8 Instance-based learning
5.9 K-Nearest Neighbor Algorithm
5.9.1 Why do we need a K-NN Algorithm?
5.9.2 How does K-NN work?
5.9.3 Problem
4.12.4 Masking
1
5.1 Bayesian learning
Each observed training example can incrementally decrease or increase the estimated probability
that a hypothesis is correct.
This provides a more flexible approach to learning than algorithms that completelyeliminate
a hypothesis if it is found to be inconsistent with any single example.
Prior knowledge can be combined with observed data to determine the final probability of a
hypothesis. In Bayesian learning, prior knowledge is provided byasserting
a prior probability for each candidate hypothesis, and
a probability distribution over observed data for each possible hypothesis.
New instances can be classified by combining the predictions of multiple hypotheses, weighted by
their probabilities.
•Even in cases where Bayesian methods prove computationally intractable, they can provide a
standard of optimal decision making against which other practical methodscan be measured.
Bayesian learning
• In machine learning, we try to determine the best hypothesis from somehypothesis space H,
given the observed training data D.
• In Bayesian learning, the best hypothesis means the most probable hypothesis, given
the data D plus any initial knowledge about the prior probabilities of the various hypotheses
in H.
• Bayes theorem provides a way to calculate the probability of a hypothesis based on its
prior probability, the probabilities of observing various data given the hypothesis, and the
observed data itself.
P(h) to denote the initial probability that hypothesis h holds, before observing training data.
P(h) may reflect any background knowledge we have about the chance that h is correct. If we
2
have no such prior knowledge, then each candidate hypothesis might simply get the same
prior probability.
P(h|D) is called the posterior probability of h, because it reflects our confidence that h holds after
we have seen the training data D.
The posterior probability P(h|D) reflects the influence of the training data D, in contrast tothe prior
probability P(h), which is independent of D.
The probability of observing data D given some world in which hypothesis h holds.
Generally, we write P(x | y) to denote the probability of event x given event y.
In machine learning problems, we are interested in the probability P(h|D) that h holds given the
observed training data D.
Bayes theorem provides a way to calculate the posterior probability P(h|D), from the prior
probability P(h), together with P(D) and P(D|h).
Bayes Theorem:
P D | h P h
P(h | D)
P D
3
A holds T T F F T F T
B holds T F T F T F F
•Any such maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis
hMAP.
•We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior
probability of each candidate hypothesis.
4
• Any hypothesis that maximizes P(D|h) is called a maximum likelihood (ML) hypothesis, hML.
The test returns a correct positive result in only 98% of the cases in which the disease is actually
present, and a correct negative result in only 97% of the cases in which thedisease is not present.
A patient takes a lab test and the result comes back positive
5
5.2 Probability theory and Baye’s rule
𝑃(𝐵/𝐴) × 𝑃(𝐴)
𝑃 (𝐴/𝐵) =
𝑃(𝐵)
Problem -1: Consider there are three baskets, Basket I, Basket II and Basket III with each basket
containing rings of red color and green color. Basket I contains 6red rings and 5 green rings.
Basket II contains 3green rings and 2red rings while Basket III contains 6 rings which are all red.
A person chooses a ring randomly from a basket. If the ring picked is red, find the probability
that it was taken from Basket II.
Solution:
Consider there are three bags Bag I, Bag II and Bag III with each bag containing balls of blue
color and yellow color. In Bag I, there are 4 blue balls and 7 yellow balls. Bag II contains 5 blue
balls and 4 yellow balls while Bag III contains 3 blue balls and 6 yellow balls. A person chooses
a ball randomly from a bag. If the ball he has picked is yellow, find the probability that it was
taken from Bag III.
Let the events A1, A2, and A3 are choosing from Basket I, Basket II and Basket III. Let E be the
event of picking a red ring.
6
P (Picking a red ring from Basket I) = P( E | A1) = 6 /11
P (Picking a red ring from Basket II) = P( E | A2) = 2 /5
P (Picking a red ring from Basket III) = P( E | A3) = 6 /6
Now to find the probability of picking a ring from Basket II given that it is red i.e., P(A2 | E)
Applying Bayes theorem,
P(E|A2)P(A2)
P(A2 | E) =
P(A1)P(E|A1)+ P(A2)P(E|A2)+ P(A3)P(E|A3)
2 1 2
∗
5 3 15
= 1 6 1 2 1 6 = 107 = 22/107 = 0.2056
∗ + ∗ + ∗
3 11 3 5 3 6 165
Problem-2 : Assume the following probabilities, the probability of a person having Malaria to be
0.02%, the probability of the test to be positive on detecting Malaria, given that the person has
Malaria is 98% and similarly the probability of the test to be negative on detecting Malaria,
given that the person doesn’t have malaria to be 95%. Find the probability of a person having
Malaria; given that, the test result is positive.
Solution:
Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not affect the
other. Hence it is called naive.
According to this example, Bayes theorem can be rewritten as:
The variable y is the class variable (label) which represents output. For example Playtennis
training dataset, there are two class labels yes or no given the conditions. Variable X represents
8
Here x1, x2….xn represent the features/attributes of training dataset,
For example in Playtennis training dataset, (Outlook, Temp, Humidity and Wind) .
Now, you can obtain the values for each by looking at the dataset and substitute them into the
equation. For all entries in the dataset, the denominator does not change, it remain static.
In our case, the class variable(y) has only two outcomes, yes or no. There could be cases where
the classification could be multivariate. Therefore, we need to find the class y with maximum
probability.
Using the above function, we can obtain the class, given the predictors.
9
When assumption of independence holds, a Naive Bayes classifier performs better compare
to other models like logistic regression and you need less training data.
It performs well in case of categorical input variables compared to numerical variable(s). For
numerical variable, normal distribution is assumed (bell curve, which is a strong assumption.
Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
data set, then model will assign a 0 (zero) probability and will be unable to make a prediction.
This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique.
One of the simplest smoothing techniques is called Laplace estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs
is almost impossible that we get a set of predictors which are completely independent.
It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
Real time Prediction: Naive Bayes is an eager Learning classifier and it is sure fast. Thus,
used in text classification (due to better result in multi class problems and independence rule)
have higher success rate as compared to other algorithms. As a result, it is widely used in Spam
filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify
positive and negative customer sentiments)
10
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together
builds a Recommendation System that uses machine Learning and data mining techniques to
filter unseen information and predict whether a user would like a given resource or not.
5.3.5 Problem -1 .Take a real time example of predicting the result of a student using Naïve
Bayes algorithm. The training dataset T consists of 8 data instances with attributes such as
'Assessment', 'Assignment', 'Project' and ‘Seminar' as shown in the below Table 8.17. The target
variable is Result which is classified as Pass or Fail for a candidate student. Given a test data to
be (Assessment = Average, Assignment = Yes, Project = No and Seminar = Good), predict the
result of the student. Apply Laplace Correction if Zero probability problem occurs.
11
8. Good Yes Yes Poor Pass
Solution:
Step 1: Compute the prior probability for the target feature ‘Result’. It has two classes ‘Pass’ and
‘Fail’.
From the training data set, we observe that the frequency or the number of instances with ‘Result
= Pass’ is 5 and ‘Result = Fail’ is 3.
The prior probability for ‘Result = Pass’ is 5/8 and ‘Result = Fail’ is 3/8.
Step 2: Compute Frequency matrix and Likelihood Probability for each of the feature.
Step 2a: Feature - Assessment
Table 1 shows the frequency matrix for the feature Assessment.
Table 1 Frequency Matrix of Assessment
Assessment Result = Pass Result = Fail
Good 4 0
Average 2 2
Total 6 2
Table 2 shows how the likelihood probability is calculated for Assessment using conditional
probability.
Table 2 Likelihood Probability of Assessment
Assessment P(Result = Pass) P(Result = Fail)
12
Table 3 Frequency Matrix of Assignment
Assignment Result = Pass Result = Fail
YES 2 2
NO 3 1
Total 5 3
Table 4 shows how the likelihood probability is calculated for Assignment using conditional
probability.
Table 4 Likelihood Probability of Assignment
Assignment P(Result = Pass) P(Result = Fail)
NO 3/5 1/3
YES 4 2
NO 1 1
Total 5 3
Table 6 shows how the likelihood probability is calculated for Project using conditional
probability.
13
Table 6 Likelihood Probability of Project
Project P(Result = Pass) P(Result = Fail)
No 1/5 1/3
Good 3 2
Poor 1 2
Total 4 4
Good ¾ 2/4
Poor ¼ 2/4
14
We can ignore P (Test Data) in the denominator since it is common for all cases to be
considered.
2 2 1 3 5
Hence P (Result = Pass | Test data) = 0.0125
6 5 5 5 8
P(Result = Fail | Test data) =(P (Assessment = Average |Result = Fail) P(Assignment =
2 2 1 2 3
Fail) P(Result = Fail))/(P(Test Data))= 0.0417
2 3 3 4 8
Step 4: Use Maximum A Posteriori (MAP) Hypothesis, 𝒉𝑴𝑨𝑷 to classify the test object to the
hypothesis with the highest probability.
Since P(Result = Fail | Test data) has the highest probability value, the test data is classified as
‘Result= Fail’.
There is no Zero Probability Error, so the algorithm stops.
5.3. 6 Problem -2 : Let us take an example to get some better intuition. Consider the problem of
playing tennis. The dataset is represented as below
Table 2 : Training Data Set
15
Pay Tennis
Day Outlook Temp Humidity Wind
(decision)
Step 1: Estimate frequencies and probabilities for each feature of all their instances
Total 4 features outlook, temperature, Humidity and Wind with total 14 instances
Output column is classification column that is Playtennis
16
Playtennis
Yes No Total
Classification
9 5 14
1. Outlook 2.Temperate
Instances Yes No Total
Instances Yes No Total
1 sunny 2 3 5
1 Hot 2 2 4
2 overcast 4 0 4
2 Mild 4 2 6
3 rainy 3 2 5
3 Cool 3 1 4
Total 9 5 14
Total 9 5 14
4. wind
3.Humidity
Instances Yes No Total Instances Yes No Total
1 High 3 4 7 1 Strong 3 3 6
2 Normal 6 1 7 2 Week 6 2 8
Total 9 5 14 Total 9 5 14
Playtennis
Yes No Total
Classification
9 5 14
17
1. Outlook
Instances Yes No 2.Temperate
Instances Yes No
1 sunny 2/9 3/5
1 Hot 2/9 2/5
2 overcast 4/9 0/5
2 Mild 4/9 2/5
3 rainy 3/9 2/5
3 Cool 3/9 1/5
2. Humidity
Instances Yes No 4. wind
1 High 3/9 4/5 Instances Yes No
2 Normal 6/9 1/5 1 Strong 3/9 3/5
2 Week 6/9 2/5
P (Play=Yes) = 9/14
18
P (Outlook=Sunny | Play=No) = 3/5
P (Play=No) = 5/14
2 3 3 3 9
P(Yes | x ') 0.0082 0.6428 0.0053
9 9 9 9 14
P No | x’ = P Sunny | No P Cool | No P High | No P Strong | No P Play No
3 1 4 3 5
P No | x’ = 0.077 0.3571 0.0275
5 5 5 5 14
Conclusions: Given the fact P (Yes’) < P (No | x’), we label x’ to be “No”. As per computed
probabilities of both Yes and No .Probability of play tennis no is higher than probability of play
tennis yes .Given day weather conditions don’t support to pay tennis
Here,
alpha represents the smoothing parameter,
K represents the number of dimensions (features) in the data, and
N represents the number of reviews with y=positive
19
If we choose a value of alpha!=0 (not equal to 0), the probability will no longer be zero even if a
word is not present in the training dataset.
Interpretation of changing alpha
Let’s say the occurrence of word w is 3 with y=positive in training data. Assuming we have 2
features in our dataset, i.e., K=2 and N=100 (total number of positive reviews).
Generative models: Generative models are models where the focus is the distribution of
individual classes in a dataset and the learning algorithms tend to model the underlying
patterns/distribution of the data points. These models use the intuition of joint probability in
theory, creating instances where a given feature (x)/input and the desired output/label (y) exist at
the same time.
Generative models use probability estimates and likelihood to model data points and distinguish
between different class labels in a dataset. These models are capable of generating new data
20
instances. However, they also have a major drawback. The presence of outliers affects these
models to a significant extent.
Discriminative models: Discriminative models, also called conditional models, tend to learn the
boundary between classes/labels in a dataset. Unlike generative models, the goal here is to find
the decision boundary separating one class from another.
So while a generative model will tend to model the joint probability of data points and is capable
of creating new instances using probability estimates and maximum likelihood, discriminative
models (just as in the literal meaning) separate classes by rather modeling the conditional
probability and do not make any assumptions about the data point. They are also not capable of
generating new data instances.
Discriminative models have the advantage of being more robust to outliers, unlike the generative
models.
However, one major drawback is a misclassification problem, i.e., wrongly classifying a data
point.
Another key difference between these two types of models is that while a generative model
focuses on explaining how the data was generated, a discriminative model focuses on predicting
labels of the data.
21
Examples of discriminative models in machine learning are:
Logistic regression
Support vector machine
Decision tree
Random forest
Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image is
showing the logistic function:
22
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
1
𝑦 = 𝑓(𝑥) =
1+𝑒 −𝑥
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
23
The independent variable should not have multi-collinearity.
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
24
5.6 Bayesian Belief Network
Bayesian belief network is key computer technology for dealing with probabilistic events and
to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
25
In the above diagram, A, B, C, and D are random variables represented by the nodes of
the network graph.
If we are considering node B, which is connected with node A by a directed arrow, then node
A is called the parent of Node B. Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as
a directed acyclic graph or DAG.
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
In general for each variable Xi, we can write the equation as:
Let's understand the Bayesian network through an example by creating a directed acyclic graph:
26
5.6.1 Problem-1: Example: Harry installed a new burglar alarm at his home to detect burglary.
The alarm reliably responds at detecting a burglary but also responds for minor earthquakes.
Harry has two neighbors David and Sophia, who have taken a responsibility to inform Harry at
work when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other hand,
Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here we would
like to compute the probability of Burglary Alarm.
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.
Solution:
The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
27
The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
The conditional distributions for each node are given as conditional probabilities table or
CPT.
Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
Burglary (B)
Earthquake(E)
Alarm(A)
David Calls(D)
Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can
rewrite the above probability statement using joint probability distribution:
Let's take the observed probability for the Burglary and earthquake component:
28
The Conditional probability of Alarm A depends on Burglar and earthquake:
The Conditional probability of David that he will call depends on the probability of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
= 0.00068045.
29
5.6.2 Problem-2: Consider the scenario shown in Figure 9.16.Events‘student joins tuition
classes’ and ‘goes to school daily’ have a direct effect on how the ‘student studies’. The event
that the student studies have a direct effect on his scoring marks or playing games. What is the
probability that he does not join tuition class, he goes to school daily, he studies, and he does not
score marks?
Solution:
𝑃 (¬𝐽𝐶 ) ∗ 𝑃 (𝐺𝑆) ∗ 𝑃(𝑆| ¬𝐽𝐶 𝑎𝑛𝑑 𝐺𝑆) ∗ 𝑃(¬𝑆𝑀|𝑆)
= 0.6 ∗ 0.9 ∗ 0.6 ∗ 0.1
= 0.0324
The Hidden Markov model is a probabilistic model which is used to explain or derive the
probabilistic characteristic of any random process. It basically says that an observed event
will not be corresponding to its step-by-step status but related to a set of probability
distributions. Let’s assume a system that is being modelled is assumed to be a Markov chain
and in the process, there are some hidden states. In that case, we can say that hidden states are
a process that depends on the main Markov process/chain.
30
The main goal of HMM is to learn about a Markov chain by observing its hidden states.
Considering a Markov process X with hidden states Y here the HMM solidifies that for each
time stamp the probability distribution of Y must not depend on the history of X according to
that time as shown in figure
To explain it more we can take the example of two friends, Rahul and Ashok. Now Rahul
completes his daily life works according to the weather conditions. Major three activities
completed by Rahul are- go jogging, go to the office, and cleaning his residence. What Rahul
is doing today depends on whether and whatever Rahul does he tells Ashok and Ashok has no
proper information about the weather But Ashok can assume the weather condition according
to Rahul work.
Ashok believes that the weather operates as a discrete Markov chain, wherein the chain there
are only two states whether the weather is Rainy or it is sunny. The condition of the weather
cannot be observed by Ashok, here the conditions of the weather are hidden from Ashok. On
each day, there is a certain chance that Bob will perform one activity from the set of the
following activities {“jog”, “work”,” clean”}, which are depending on the weather. Since
Rahul tells Ashok that what he has done, those are the observations. The entire system is that
of a hidden Markov model (HMM).
Here we can say that the parameter of HMM is known to Ashok because he has general
information about the weather and he also knows what Rahul likes to do on average.
31
So let’s consider a day where Rahul called Ashok and told him that he has cleaned his
residence. In that scenario, Ashok will have a belief that there are more chances of a rainy day
and we can say that belief Ashok has is the start probability of HMM let’s say which is like
the following.
Now the distribution of the probability has the weightage more on the rainy day stateside so
we can say there will be more chances for a day to being rainy again and the probabilities for
next day weather states are as following
}
From the above we can say the changes in the probability for a day is transition probabilities
and according to the transition probability the emitted results for the probability of work that
Rhul will perform is
32
So here from the above intuition and the example we can understand how we can use this
probabilistic model to make a prediction. Now let’s just discuss the applications where it can
be used.
An application, where HMM is used, aims to recover the data sequence where the next
sequence of the data cannot be observed immediately but the next data depends on the old
sequences. Taking the above intuition into account the HMM can be used in the following
applications:
Computational finance
speed analysis
Speech recognition
Speech synthesis
Part-of-speech tagging
Document separation in scanning solutions
Machine translation
Handwriting recognition
Time series analysis
Activity recognition
Sequence classification
Transportation forecasting
Hidden Markov Models in NLP
33
From the above application of HMM, we can understand that the applications where the
HMM can be used have sequential data like time series data, audio, and video data, and text
data or NLP data. In this article, our main focus is on those applications of NLP where we can
use the HMM for better performance of the model, and here in the above-given list, we can
see that one of the applications of the HMM is that we can use it in the Part-of-Speech
tagging. Next in the article, we will see how we can use the HMM for POS-tagging.
5. 8 Instance-based learning
The Machine Learning systems which are categorized as instance-based learning are the
systems that learn the training examples by heart and then generalize to new instances based on
some similarity measure. It is called instance-based because it builds the hypotheses from the
training instances.
It is also known as memory-based learning or lazy-learning. The time complexity of this
algorithm depends upon the size of training data. The worst-case time complexity of this
algorithm is O (n), where n is the number of training instances.
For example, If we were to create a spam filter with an instance-based learning algorithm,
instead of just flagging emails that are already marked as spam emails, our spam filter would
be programmed to also flag emails that are very similar to them. This requires a measure of
resemblance between two emails.
A similarity measure between two emails could be the same sender or the repetitive use of
the same keywords or something else.
Advantages:
Instead of estimating for the entire instance set, local approximations can be made to
the target function.
This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
Classification costs are high
Large amount of memory required to store the data, and each query involves starting
the identification of a local model from scratch.
Some of the instance-based learning algorithms are :
K- Nearest Neighbor (KNN)
Self-Organizing Map (SOM)
Learning Vector Quantization (LVQ)
Locally Weighted Learning (LWL)
34
5. 9 𝐊-Nearest Neighbor Algorithm
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN model will find the similar features of the new
data set to the cats and dogs images and based on the most similar features it will put it in either
cat or dog category.
35
5.9.1 Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
36
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
37
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Always needs to determine the value of K which may be complex some time.
The computation cost is high because of calculating the distance between the data points
for all the training samples
38
5.9.3 Problem: Consider the following training data set of 10 data instances shown in Table
4.12 which describes the award performance of individual students based on GPA and No. of
projects done. The target variable is ‘Award’ which is a discrete valued variable that takes 2
values ‘Yes’ or ‘No’.
Table : Training Dataset
S.No. GPA No. of Award
Projects done
1. 9.5 5 Yes
2. 8.0 4 Yes
3. 7.2 1 No
4. 6.5 5 Yes
5. 9.5 4 Yes
6. 3.2 1 No
7. 6.6 1 No
8. 5.4 1 No
9. 8.9 3 Yes
10. 7.2 4 Yes
Given a test instance (GPA -7.8, No. of projects done - 4), use the training set to classify the test
instance. Choose k=3 by using k-Nearest Neighbor classifier
Solution:
Step 1: Calculate the Euclidean distance between the test instance (GPA -7.8, No. of projects
done - 4) and each of the training instances as shown in Table 1.
Euclidean Distance 𝑑 = √(𝑥2 − 𝑥1)2 + (𝑦2 − 𝑦1)2
x1=7.8, y1=4
39
Table : Euclidean Distance
S.No. GPA No. of Award Euclidean Distance
(x1) Projects done
(y1)
1. 9.5 5 Yes 𝐷1 = √(9.5 − 7.8)2 + (5 − 4)2
= 1.972308292
2. 8.0 4 Yes 𝐷2 = √(8.0 − 7.8)2 + (4 − 4)2 = 0.2
40
Step 2: Sort the distances in the ascending order and select the first 3 nearest training data
instances to the test instance. The selected nearest neighbors are shown in Table 2. k=3
Table 2 Nearest Neighbors
Instance Euclidean distance Class
2 D2=0.2 Yes
10 D10=0.6 Yes
9 D9=1.487 Yes
41