Unit 3 KNN

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

K-Nearest Neighbor(KNN) Algorithm for

Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
o Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we
can use the KNN algorithm, as it works on a similarity measure. Our KNN model
will find the similar features of the new data set to the cats and dogs images and
based on the most similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm:

Pause
Unmute

Current Time 1:11

Duration 4:57
Loaded: 100.00%
 
Fullscreen
x

o There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the
data points for all the training samples.

Introduction
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be
used for both classification as well as regression predictive problems. However, it is
mainly used for classification predictive problems in industry. The following two
properties would define KNN well −
 Lazy learning algorithm − KNN is a lazy learning algorithm because it does not
have a specialized training phase and uses all the data for training while
classification.
 Non-parametric learning algorithm − KNN is also a non-parametric learning
algorithm because it doesn’t assume anything about the underlying data.

Working of KNN Algorithm


K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of
new datapoints which further means that the new data point will be assigned a value
based on how closely it matches the points in the training set. We can understand its
working with the help of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first step of
KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be
any integer.
Step 3 − For each point in the test data do the following −
 3.1 − Calculate the distance between test data and each row of training data with
the help of any of the method namely: Euclidean, Manhattan or Hamming
distance. The most commonly used method to calculate distance is Euclidean.
 3.2 − Now, based on the distance value, sort them in ascending order.
 3.3 − Next, it will choose the top K rows from the sorted array.
 3.4 − Now, it will assign a class to the test point based on most frequent class of
these rows.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN
algorithm −
Suppose we have a dataset which can be plotted as follows −
Now, we need to classify new data point with black dot (at point 60,60) into blue or red
class. We are assuming K = 3 i.e. it would find three nearest data points. It is shown in
the next diagram −

We can see in the above diagram the three nearest neighbors of the data point with
black dot. Among those three, two of them lies in Red class hence the black dot will also
be assigned in red class.

Pros and Cons of KNN


Pros
 It is very simple algorithm to understand and interpret.
 It is very useful for nonlinear data because there is no assumption about data in
this algorithm.
 It is a versatile algorithm as we can use it for classification as well as regression.
 It has relatively high accuracy but there are much better supervised learning
models than KNN.
Cons
 It is computationally a bit expensive algorithm because it stores all the training
data.
 High memory storage required as compared to other supervised learning
algorithms.
 Prediction is slow in case of big N.
 It is very sensitive to the scale of data as well as irrelevant features.

Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters one?
Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit rating by comparing with the
persons having similar traits.
Politics
With the help of KNN algorithms, we can classify a potential voter into various classes
like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.

How KNN algorithm works


Suppose we have height, weight and T-shirt size of some customers
and we need to predict the T-shirt size of a new customer given only
height and weight information we have. Data including height, weight
and T-shirt size information is shown below –
Calculate Similarity based on distance function
There are many distance functions but Euclidean is the most
commonly used measure. It is mainly used when data is continuous.
Manhattan distance is also very common for continuous variables.

The idea to use distance measure is to find the distance (similarity)


between new sample and training cases and then finds the k-closest
customers to new customer in terms of height and weight.

New customer named 'Monica' has height 161cm and weight


61kg.
Euclidean distance between first observation and new observation
(monica) is as follows -

=SQRT((161-158)^2+(61-58)^2)
Similarly, we will calculate distance of all the training cases with new
case and calculates the rank in terms of distance. The smallest
distance value will be ranked 1 and considered as nearest neighbor.

Step 2 : Find K-Nearest Neighbors


Let k be 5. Then the algorithm searches for the 5 customers closest to
Monica, i.e. most similar to Monica in terms of attributes, and see what
categories those 5 customers were in. If 4 of them had ‘Medium T shirt
sizes’ and 1 had ‘Large T shirt size’ then your best guess for Monica is
‘Medium T shirt. See the calculation shown in the snapshot below -

Calculate KNN manually


In the graph below, binary dependent variable (T-shirt size) is
displayed in blue and orange color. 'Medium T-shirt size' is in blue
color and 'Large T-shirt size' in orange color. New customer
information is exhibited in yellow circle. Four blue highlighted data
points and one orange highlighted data point are close to yellow circle.
so the prediction for the new case is blue highlighted data point which
is Medium T-shirt size.
Assumptions of KNN

1. Standardization
When independent variables in training data are measured in different
units, it is important to standardize variables before calculating
distance. For example, if one variable is based on height in cms, and
the other is based on weight in kgs then height will influence more on
the distance calculation. In order to make them comparable we need
to standardize them which can be done by any of the following
methods :

Standardization
After standardization, 5th closest value got changed as height was
dominating earlier before standardization. Hence, it is important to
standardize predictors before running K-nearest neighbor algorithm.

Knn after standardization

2. Outlier
Low k-value is sensitive to outliers and a higher K-value is more
resilient to outliers as it considers more voters to decide prediction.
Why KNN is non-parametric?
Non-parametric means not making any assumptions on the underlying
data distribution. Non-parametric methods do not have fixed numbers
of parameters in the model. Similarly in KNN, model parameters
actually grows with the training data set - you can imagine each
training case as a "parameter" in the model.

KNN vs. K-mean


Many people get confused between these two statistical techniques-
K-mean and K-nearest neighbor. See some of the difference below -

1. K-mean is an unsupervised learning technique (no dependent


variable) whereas KNN is a supervised learning algorithm
(dependent variable exists)
2. K-mean is a clustering technique which tries to split data points
into K-clusters such that the points in each cluster tend to be
near each other whereas K-nearest neighbor tries to determine
the classification of a point, combines the classification of the K
nearest points

Can KNN be used for regression?


Yes, K-nearest neighbor can be used for regression. In other words,
K-nearest neighbor algorithm can be applied  when dependent
variable is continuous. In this case, the predicted value is the average
of the values of its k nearest neighbors.

Pros and Cons of KNN

Pros
1. Easy to understand
2. No assumptions about data
3. Can be applied to both classification and regression
4. Works easily on multi-class problems

Cons

1. Memory Intensive / Computationally expensive


2. Sensitive to scale of data
3. Not work well on rare event (skewed) target variable
4. Struggle when high number of independent variables

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy