Unit 3 KNN
Unit 3 KNN
Unit 3 KNN
Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
o Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we
can use the KNN algorithm, as it works on a similarity measure. Our KNN model
will find the similar features of the new data set to the cats and dogs images and
based on the most similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
Pause
Unmute
Current Time 1:11
Duration 4:57
Loaded: 100.00%
Â
Fullscreen
x
o There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the
data points for all the training samples.
Introduction
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be
used for both classification as well as regression predictive problems. However, it is
mainly used for classification predictive problems in industry. The following two
properties would define KNN well −
Lazy learning algorithm − KNN is a lazy learning algorithm because it does not
have a specialized training phase and uses all the data for training while
classification.
Non-parametric learning algorithm − KNN is also a non-parametric learning
algorithm because it doesn’t assume anything about the underlying data.
We can see in the above diagram the three nearest neighbors of the data point with
black dot. Among those three, two of them lies in Red class hence the black dot will also
be assigned in red class.
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters one?
Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit rating by comparing with the
persons having similar traits.
Politics
With the help of KNN algorithms, we can classify a potential voter into various classes
like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
=SQRT((161-158)^2+(61-58)^2)
Similarly, we will calculate distance of all the training cases with new
case and calculates the rank in terms of distance. The smallest
distance value will be ranked 1 and considered as nearest neighbor.
1. Standardization
When independent variables in training data are measured in different
units, it is important to standardize variables before calculating
distance. For example, if one variable is based on height in cms, and
the other is based on weight in kgs then height will influence more on
the distance calculation. In order to make them comparable we need
to standardize them which can be done by any of the following
methods :
Standardization
After standardization, 5th closest value got changed as height was
dominating earlier before standardization. Hence, it is important to
standardize predictors before running K-nearest neighbor algorithm.
2. Outlier
Low k-value is sensitive to outliers and a higher K-value is more
resilient to outliers as it considers more voters to decide prediction.
Why KNN is non-parametric?
Non-parametric means not making any assumptions on the underlying
data distribution. Non-parametric methods do not have fixed numbers
of parameters in the model. Similarly in KNN, model parameters
actually grows with the training data set - you can imagine each
training case as a "parameter" in the model.
Pros
1. Easy to understand
2. No assumptions about data
3. Can be applied to both classification and regression
4. Works easily on multi-class problems
Cons