Experiment 2.2 KNN Classifier
Experiment 2.2 KNN Classifier
Experiment 2.2 KNN Classifier
COURSE OUTCOMES
CO3 Identify and implement simple learning strategies using data science and statistics principles.
CO4 Evaluate machine learning model’s performance and apply learning strategy to improve the
performance of supervised and unsupervised learning model.
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique. K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available categories.K-NN
algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset. KNN algorithm at the training phase just stores the dataset and when
it gets new data, then it classifies that data into a category that is much similar to the new data.
Why do you need to scale your data for the k-NN algorithm?
A dataset having m number of “examples” and n number of “features”. There is one feature dimension
having values exactly between 0 and 1. Meanwhile, there is also a feature dimension that varies from -
99999 to 99999. Considering the formula of Euclidean Distance, this will affect the performance by
giving higher weightage to variables having a higher magnitude.
Although Euclidean Distance is the most common method used and taught, it is not always the optimal
decision. In fact, it is hard to come up with the right metric just by looking at data, so I would suggest
trying a set of them. However, there are some special cases. For instance, hamming distance is used in
case of a categorical variable.
Why should we not use the KNN algorithm for large datasets?
Here is an overview of the data flow that occurs in the KNN algorithm:
1. Calculate the distances to all vectors in a training set and store them
2. Sort the calculated distances
Imagine you have a very large dataset. Therefore, it is not only a bad decision to store a large amount
of data but it is also computationally costly to keep calculating and sorting all the values.
advantage
It is simple to implement.
Disadvantages
Always needs to determine the value of K which may be complex some time.
The computation cost is high because of calculating the distance between the data points for all the
training samples.
Let’s now get into the implementation of KNN in Python. We’ll go over the steps to help you break the
code down and make better sense of it.
import pandas as pd
2. Creating Dataset
Scikit-learn has a lot of tools for creating synthetic datasets, which are great for testing machine
learning algorithms. I’m going to utilize the make blobs method.
This code generates a dataset of 500 samples separated into four classes with a total of two
characteristics. Using associated parameters, you may quickly change the number of samples,
characteristics, and classes. We may also change the distribution of each cluster (or class).
plt.style.use('seaborn')
plt.figure(figsize = (10,10))
plt.show()
Data Visualization KNN
It is critical to partition a dataset into train and test sets for every supervised machine learning method.
We first train the model and then put it to the test on various portions of the dataset. If we don’t
separate the data, we’re simply testing the model with data it already knows. Using the train_test_split
method, we can simply separate the tests.
After that, we’ll build a kNN classifier object. I develop two classifiers with k values of 1 and 5 to
demonstrate the relevance of the k value. The models are then trained using a train set. The k value is
chosen using the n_neighbors argument. It does not need to be explicitly specified because the default
value is 5.
knn5 = KNeighborsClassifier(n_neighbors = 5)
knn1 = KNeighborsClassifier(n_neighbors=1)
Then, in the test set, we forecast the target values and compare them to the actual values.
knn5.fit(X_train, y_train)
knn1.fit(X_train, y_train)
y_pred_5 = knn5.predict(X_test)
y_pred_1 = knn1.predict(X_test)
Let’s view the test set and predicted values with k=5 and k=1 to see the influence of k values.
plt.figure(figsize = (15,5))
plt.subplot(1,2,1)
plt.subplot(1,2,2)
plt.show()
Viva Questions
Why is it recommended not to use the KNN Algorithm for large datasets?