Knn Datacamp
Knn Datacamp
com/
tutorial/introduction-machine-
learning-python
Introduction
Machine Learning evolved from computer science that primarily
studies the design of algorithms that can learn from experience. To
learn, they need data that has certain attributes based on which the
algorithms try to find some meaningful predictive patterns. Majorly,
ML tasks can be categorized as concept learning, clustering,
predictive modeling, etc. The ultimate goal of ML algorithms is to be
able to take decisions without any human intervention correctly.
Predicting the stocks or weather are a couple of applications of
machine learning algorithms.There are various machine learning
algorithms like Decision trees, Naive Bayes, Random forest, Support
vector machine, K-nearest neighbor, K-means clustering, etc. From
the class of machine learning algorithms, the one that you will be
using today is k-nearest neighbor.
Firstly, you load all the data and initialize the value of k,
Then, the distance between the stored data points and a new data
point that you want to classify is calculated using various similarity
or distance metrics like Manhattan distance (L1), Euclidean distance
(L2), Cosine similarity, Bhattacharyya distance, Chebyshev distance,
etc.
Next, the distance values are sorted either in descending or
ascending order and top or lower k-nearest neighbors are
determined.
The labels of the k-nearest neighbors are gathered, and a majority
vote or a weighted vote is used for classifying the new data point.
The new data point is assigned a class label based on a certain data
point that has the highest score out of all the stored data points.
Finally, the predicted class for the new instance is returned.
The prediction can be of two types: either classification in which a
class label is assigned to a new data point or regression wherein a
value is assigned to the new data point. Unlike classification, in
regression, the mean of all the k-nearest neighbors is assigned to
the new data point.
Feel free to use some other public dataset or your private dataset.
POWERED BY
load_irishas both the data and the class labels for each sample.
Let's quickly extract all of it.
data = load_iris().data
POWERED BY
data.shape
POWERED BY
(150, 4)
POWERED BY
labels = load_iris().target
POWERED BY
labels.shape
POWERED BY
(150,)
POWERED BY
Next, you have to combine the data and the class labels, and for
that, you will use an excellent python library called NumPy. NumPy adds
support for large, multi-dimensional arrays and matrices, along with
an extensive collection of high-level mathematical functions to
operate on these arrays. So, let's quickly import it!
import numpy as np
POWERED BY
Since data is a 2-d array, you will have to reshape the labels also to a
2-d array.
labels = np.reshape(labels,(150,1))
POWERED BY
data = np.concatenate([data,labels],axis=-1)
POWERED BY
data.shape
POWERED BY
(150, 5)
POWERED BY
import pandas as pd
POWERED BY
POWERED BY
dataset = pd.DataFrame(data,columns=names)
POWERED BY
Now, you have the dataset data frame that has both data & the class
labels that you need!
Before you dive any further, remember that the labels variable has
class labels as numeric values, but you will convert the numeric
values as the flower names or species.
For doing this, you will select only the class column and replace each
of the three numeric values with the corresponding species. You will
use inplace=True which will modify the data frame dataset.
dataset['species'].replace(0, 'Iris-setosa',inplace=True)
dataset['species'].replace(1, 'Iris-versicolor',inplace=True)
dataset['species'].replace(2, 'Iris-virginica',inplace=True)
POWERED BY
Let's print the first five rows of the dataset and see what it looks like!
dataset.head(5)
POWERED BY
Let's visualize the data that you loaded above using a scatterplot to
find out how much one variable is affected by the other variable or
let's say how much correlation is between the two variables.
POWERED BY
Tip: Are you keen on learning different ways of visualizing the data
in python? Then check out Introduction to data visualization
with matplotlib course.
plt.xlabel('Sepal length',fontsize=20)
plt.ylabel('Sepal width',fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(prop={'size': 18})
plt.show()
POWERED BY
From the above plot, it is very much apparent that there is a high
correlation between the Iris setosa flowers w.r.t the sepal length and
sepal width. On the other hand, there is less correlation between Iris
versicolor and Iris virginica. The data points in versicolor & virginica
are more spread out compared to setosa that are dense.
Let's just quickly also plot the graph for petal-length and petal-width.
plt.xlabel('Petal length',fontsize=15)
plt.ylabel('Petal width',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(prop={'size': 20})
plt.show()
POWERED BY
Even when it comes to petal-length and petal-width, the above graph
indicates a strong correlation for setosa flowers which are densely
clustered together.
dataset.iloc[:,2:].corr()
POWERED BY
petal-
petal-width
length
petal-
1.000000 0.962865
length
petal-
0.962865 1.000000
width
dataset.iloc[:50,:].corr() #setosa
POWERED BY
sepal-
1.000000 0.742547 0.267176 0.278098
length
sepal-
0.742547 1.000000 0.177700 0.232752
width
petal-
0.267176 0.177700 1.000000 0.331630
length
sepal- sepal- petal-
petal-width
length width length
petal-
0.278098 0.232752 0.331630 1.000000
width
dataset.iloc[50:100,:].corr() #versicolor
POWERED BY
sepal-
1.000000 0.525911 0.754049 0.546461
length
sepal-
0.525911 1.000000 0.560522 0.663999
width
petal-
0.754049 0.560522 1.000000 0.786668
length
petal-
0.546461 0.663999 0.786668 1.000000
width
dataset.iloc[100:,:].corr() #virginica
POWERED BY
sepal- sepal- petal-
petal-width
length width length
sepal-
1.000000 0.457228 0.864225 0.281108
length
sepal-
0.457228 1.000000 0.401045 0.537728
width
petal-
0.864225 0.401045 1.000000 0.322108
length
petal-
0.281108 0.537728 0.322108 1.000000
width
From the above three tables, it is pretty much clear that the
correlation between petal-length and petal-
width of setosa and virginica is 0.33 and 0.32 respectively. Whereas,
for versicolor it is 0.78.
ax = fig.gca()
dataset.hist(ax=ax)
plt.show()
POWERED BY
The petal-length, petal-width, and sepal-length shows a unimodal
distribution, whereas sepal-width shows a kind of Gaussian
distribution. All these are useful analysis because then you can think
of using an algorithm that works well with this kind of distribution.
Next, you will analyze whether all the four attributes are on the
same scale or not; this is an essential aspect of ML. pandas data
frame has an inbuilt function called describe that gives you
the count, mean, max, min of the data in a tabular format.
dataset.describe()
POWERED BY
sepal- sepal- petal-
petal-width
length width length
You can see that all the four attributes have a similar scale between
0 & 8 and are in centimeters if you want you can further scale it
down to between 0 and 1.
Even though you all know that there are 50 samples per class, i.e.,
~33.3% of the total distribution, but still let's recheck it!
print(dataset.groupby('species').size())
POWERED BY
species
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int64
POWERED BY
There can be two ways by which you can normalize your data:
Well, the answer is pretty much all the time. It is a good practice to
normalize your data as it brings all the samples in the same scale
and range. Normalizing the data is crucial when the data you have is
not consistent. You can check for inconsistency by using
the describe() function that you studied above which will give
you max and min values. If the max and min values of one feature are
significantly larger than the other feature then normalizing both the
features to the same scale is very important. Let's say X is one
feature having a larger range and Y being the second feature with a
smaller range. Then, the influence of feature Y can be overpowered
by feature X's influence. In such a case, it becomes important to
normalize both the features X and Y.
Let's print the describe() function again and see why you do not need
any normalization.
dataset.describe()
POWERED BY
For training and testing set split, you will use the sklearn library
which has an in-built splitting function called train_test_split. So, let's
split the data.
POWERED BY
Let's quickly print the shape of training and testing data along with
its labels.
train_data.shape,train_label.shape,test_data.shape,test_label.shape
POWERED BY
POWERED BY
POWERED BY
neighbors = np.arange(1,9)
train_accuracy =np.zeros(len(neighbors))
test_accuracy = np.zeros(len(neighbors))
POWERED BY
Next piece of code is where all the magic will happen. You
will enumerate over all the nine neighbor values and for each neighbor
you will then predict both on training and testing data. Finally, store
the accuracy in the train_accuracy and test_accuracy numpy arrays.
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(train_data, train_label)
POWERED BY
Next, you will plot the training and testing accuracy using matplotlib,
with accuracy vs. varying number of neighbors graph you will be able to
choose the best k value at which your model performs the best.
plt.figure(figsize=(10,6))
plt.legend(prop={'size': 20})
plt.xlabel('Number of neighbors',fontsize=20)
plt.ylabel('Accuracy',fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()
POWERED BY
Well, by looking at the above graph, it looks like when n_neighbors=3,
both the model performs the best. So, let's stick
with n_neighbors=3 and re-run the training once again.
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(train_data, train_label)
POWERED BY
Let's first check the accuracy of the model on the testing data.
test_accuracy
POWERED BY
0.9666666666666667
POWERED BY
Viola! It looks like the model was able to classify 96.66% of the
testing data correctly. Isn't that amazing? With just a few lines of
code, you were able to train an ML model that is now able to tell you
the flower name by using only four features with 96.66% accuracy.
Who knows maybe it performed way better than a human can.
Confusion Matrix
prediction = knn.predict(test_data)
POWERED BY
The following plot_confusion_matrix() function has been modified and
acquired from this source.
import itertools
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
plt.ylabel('True label',fontsize=30)
plt.xlabel('Predicted label',fontsize=30)
plt.tight_layout()
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
class_names = load_iris().target_names
np.set_printoptions(precision=2)
plt.figure(figsize=(10,8))
plot_confusion_matrix(cnf_matrix, classes=class_names)
plt.title('Confusion Matrix',fontsize=30)
plt.show()
POWERED BY
From the above confusion_matrix plot, you can observe that the model
classified all the flowers correctly except one virginica flower which
is classified as a versicolor flower.
Classification Report
POWERED BY
print(classification_report(test_label, prediction))
POWERED BY
POWERED BY
Go Further!
First of all congratulations to all those who successfully made it till
the end! But this was just the start. There is still a long way to
go! This tutorial majorly dealt with the basics of machine learning
and the implementation of one kind of ML algorithm known as KNN
with Python. The Iris data set that you used was pretty small and a
little simple. If this tutorial ignited an interest in you to learn more,
you can try using some other datasets or try learning about some
more ML algorithms and maybe apply on the Iris dataset to observe
the effect on the accuracy. This way you will learn a lot more than
just understanding the theory! If you have experimented enough
with the basics presented in this tutorial and other machine learning
algorithms, you might want to go further into python and data
analysis.