0% found this document useful (0 votes)
68 views

KNN Algorithm in Machine Learning

The document provides an introduction to the K-nearest neighbors (KNN) machine learning algorithm. It explains that KNN is a simple supervised learning method used for classification problems. It works by finding the K closest training examples in the feature space and assigning the data point to the most common class among its K neighbors. The document discusses why KNN is useful, what it is, how to choose the K value, when it should be used, and how the KNN algorithm makes predictions by calculating distances to neighbors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

KNN Algorithm in Machine Learning

The document provides an introduction to the K-nearest neighbors (KNN) machine learning algorithm. It explains that KNN is a simple supervised learning method used for classification problems. It works by finding the K closest training examples in the feature space and assigning the data point to the most common class among its K neighbors. The document discusses why KNN is useful, what it is, how to choose the K value, when it should be used, and how the KNN algorithm makes predictions by calculating distances to neighbors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction to KNN(K Nearest Neighbor)

0:01
hello and welcome to k-nearest neighbors
0:05
algorithm tutorial my name is Richard
0:08
Kirchner and I'm with the simply learned
0:10
team today we're gonna cover the K
0:12
nearest neighbors a lot refer to as K in
0:14
n and k n n is really a fundamental
0:17
place to start in the machine learning
0:19
it's a basis of a lot of other things
0:21
and just the logic behind it is easy to
0:23
understand and incorporate it in other
0:26
forms of machine learning so today
0:28
what's in it for you
0:29
why do we need KN n what is KN n how do
0:34
we choose the factor K when do we use KN
0:38
n how does knn algorithm work and then
0:42
we'll dive in to my favorite part the
0:44
use case predict whether a person will
0:46
have diabetes or not that is a very
0:48
common and popular used data set as far
0:51
as testing out models and learning how
0:54
to use the different models and machine
0:56
learning by now we all know machine
Why do we need KNN?
0:58
learning models make predictions by
1:01
learning from the past data available so
1:03
we have our input values our machine
1:05
learning model builds on those inputs of
1:07
what we already know and then we use
1:09
that to create a predicted output is
1:11
that a dog little kid looking over there
1:14
and watching the black cat cross their
1:16
path no dear you can differentiate
1:18
between a cat and a dog based on their
1:21
characteristics cats cats have sharp
1:25
claws uses to climb smaller lengths of
1:27
ears meows and purrs doesn't love to
1:30
play around dogs have dull claws bigger
1:33
lengths of ears barks loves to run
1:35
around usually don't see a cat running
1:38
around people other why do you have a
1:39
cat that does that where dogs do and we
1:42
can look at these we can say we can
1:43
evaluate their sharpness of the claws
1:45
how sharper their claws and we can
1:47
evaluate the length of the ears and we
1:49
can usually sort out cats from dogs
1:51
based on even those two characteristics
1:54
now tell me if it is a cat or a dog
1:56
nod question usually little kids know
1:58
cats and dogs bite now listen them a
2:01
place where there's not many cats or
2:02
dogs so if we look at the sharpness of
2:04
the claws the length of the years and we
2:06
can see that the cat has a smaller ears
2:09
and sharper claws and the other animals
2:11
its features are more like
2:13
Katz's it must be a cat sharp claws
2:16
length of ears and it goes in the cat
2:18
group because KNN is based on feature
2:21
similarity we can do classification
2:23
using KNN classifier so we have our
2:26
input value the picture of the black cat
2:28
it goes into our trained model and it
2:30
predicts that this is a cat coming out
2:32
so what is KN n what is the KN algorithm
What is KNN?
2:36
K nearest neighbors is what that stands
2:39
for
2:40
is one of the simplest supervised
2:41
machine learning algorithms mostly used
2:44
for classification so we want to know is
2:47
this a dog or it's not a dog is it a cat
2:50
or not a cat it classifies a data point
2:52
based on how its neighbors are
2:54
classified KN in stores all available
2:56
cases and classifies new cases based on
2:59
a similarity measure and here we gone
3:01
from cats and dogs right into wine
3:04
another favorite of mine KN in stores
3:06
all available cases and classifies new
3:08
cases based on a similarity measure and
3:10
here you see we have a measurement of
3:12
sulfur dioxide versus the chloride level
3:15
and then the different wines they've
3:16
tested and where they fall on that graph
3:18
based on how much sulfur dioxide and how
3:20
much chloride k and KN is a perimeter
3:22
that refers to the number of nearest
3:24
neighbors to include in the majority of
3:26
the voting process and so if we add a
3:28
new glass of wine there red or white we
3:31
want to know what the neighbors are in
3:32
this case we're gonna put k equals five
3:34
and we'll talk about K in just a minute
3:36
a data point is classified by the
3:38
majority of votes from its five nearest
3:40
neighbors here the unknown point would
3:42
be classified as red since four out of
3:44
five neighbors are red so how do we
3:47
choose K how do we know K equals five I
3:50
mean that's what is the value we put in
How do we choose the factor 'K'?
3:52
there and so we can talk about it how do
3:53
we choose the factor K knn algorithm is
3:56
based on feature similarity choosing the
3:58
right value of K is a process called
4:01
parameter tuning and is important for
4:04
better accuracy so at K equals three we
4:06
can classify we have a question mark in
4:08
the middle as either a as a square or
4:10
not is it a square or is it in this case
4:12
a triangle and so if we set K equals to
4:15
three we're going to look at the three
4:16
nearest neighbors we're going to say
4:18
this is a square and if we put K equals
4:20
to seven we classify as a triangle
4:23
depending on what the other data is
4:24
around and you can see is the K change
4:27
depending on where that point is that
4:28
drastically changes your answer and we
4:31
jump here we go how do we choose the
4:33
factor of K you'll find this in all
4:35
machine learning choosing these factors
4:37
that's the face you get he's like oh my
4:39
gosh you'd say choose the right K did I
4:42
set it right my values in whatever
4:44
machine learning tool you're looking at
4:45
so that you don't have a huge bias in
4:47
one direction or the other and in terms
4:49
of K n n the number of K if you choose
4:52
it too low the bias is based on it's
4:55
just too noisy it's right next to a
4:57
couple things and it's gonna pick those
4:59
things and you might get asked you to
5:00
answer and if your K is too big then
5:03
it's gonna take forever to process so
5:05
you're gonna run into processing issues
5:07
and resource issues so what we do the
5:10
most common use and there's other
5:11
options for choosing K is to use the
5:14
square root of n so it is a total number
5:17
of values you have you take the square
5:19
root of it in most cases you also if
5:21
it's an even number so if you're using
5:23
like in this case squares and triangles
5:25
if it's even you want to make your K
5:28
value odd that helps it select better so
5:31
in other words you're not going to have
5:32
a balance between two different factors
5:34
that are equal so you usually take the
5:36
square root of n and if it's even you
5:38
add one to it or subtract one from it
5:39
and that's where you get the K value
5:41
from that is the most common use and
5:43
it's pretty solid
5:44
it works very well when do we use KN n
When do we use KNN?
5:47
we can use K n when data is labeled so
5:50
you need a label on it we know we have a
5:51
group of pictures with dogs dogs cats
5:54
cats data is noise free and so you can
5:57
see here when we have a class that we
6:00
have like underweight 140 23 Hello Kitty
6:03
normal that's pretty confusing we have a
6:05
variety of data coming in so it's very
6:07
noisy and that would cause an issue Dana
6:10
said is small so we're usually working
6:12
with smaller data sets where you might
6:14
get into gig of data if it's really
6:16
clean it doesn't have a lot of noise
6:18
because K and N is a lazy learner ie it
6:21
doesn't learn a discriminative function
6:23
from the training set so it's very lazy
6:25
so if you have very complicated data and
6:27
you have a large amount of it you're not
6:29
going to use the KNN but it's really
6:30
great to get a place to start even with
6:32
large data you can sort out a small
6:34
sample and get an idea of what that
6:36
looks like using the KNN and also just
6:38
using for smaller data sets
6:40
works really good how does a knn
How does the KNN algorithm work?
6:43
algorithm work consider a data set
6:45
having two variables height and
6:47
centimeters and weight in kilograms and
6:50
each point is classified as normal or
6:52
underweight so we can see right here we
6:55
have two variables
6:56
you know true/false or either normal or
6:58
they're not they're underweight on the
6:59
basis of the given data we have to
7:01
classify the below set as normal or
7:03
underweight using KNN so if we have new
7:06
data coming in this says 57 kilograms
7:08
and 177 centimeters is that going to be
7:12
normal or underweight to find the
7:14
nearest neighbors we'll calculate the
7:16
Euclidean distance according to the
7:18
Euclidean distance formula the distance
7:20
between two points in the plane with the
7:23
coordinates XY and a B is given by
7:25
distance D equals the square root of x
7:29
minus a squared plus y minus B squared
7:32
and you can remember that from the two
7:34
edges of a triangle we're computing the
7:36
third edge since we know the x side and
7:39
the y side let's calculate it to
7:40
understand clearly
7:42
so we have our unknown point and we
7:44
placed it there in red and we have our
7:46
other points where the data is scattered
7:47
around the distance d1
7:50
is a square root of 170 minus 167
7:53
squared plus 57 minus 51 squared which
7:56
is about six point seven and distance
7:59
two is about 13 and distance three is
8:03
about 13 point four
8:04
similarly we will calculate the
8:06
Euclidean distance of unknown data point
8:08
from all the points in the data set
8:10
and because we're dealing with small
8:12
amount of data that's not that hard to
8:13
do it's actually pretty quick for a
8:15
computer and it's not really complicated
8:17
maths you can just see how close is the
8:18
data based on the Euclidean distance
8:20
hence we have calculated the Euclidean
8:23
distance of unknown data point from all
8:25
the points as showing where x1 and y1
8:27
equal 57 and 170 whose class we have to
8:31
classify now we're looking at they were
8:33
saying well here's a Euclidean distance
8:35
who's gonna be their closest neighbors
8:37
now let's calculate the nearest neighbor
8:39
at K equals 3 and we can see the three
8:42
closest neighbors put some at normal and
8:45
that's pretty self-evident when you look
8:46
at this graph it's pretty easy to say
8:48
okay what we're just voting normal
8:50
normal normal three votes for normal
8:51
this is gonna be a normal weight so
8:53
majority of neighbor
8:54
are pointing towards normal hints as per
8:56
knn algorithm the class of 57 170 should
9:00
be normal so a recap of KN n positive
9:03
integer K is specified along with a new
9:05
sample we select the K entries in our
9:08
database which are closest to the new
9:09
sample we find the most common
9:11
classification of these entries this is
9:13
the classification we give to the new
9:15
sample so as you can see it's pretty
9:17
straightforward we're just looking for
9:18
the closest things that match what we
Use case - Predict whether a person will have diabetes or not
9:20
got so let's take a look and see what
9:22
that looks like in a use case in Python
9:25
so let's dive in to the predict diabetes
9:27
use case so use case predict diabetes
9:30
the objective predict whether a person
9:33
will be diagnosed with diabetes or not
9:35
we have a data set of 768 people who
9:39
were or were not diagnosed with diabetes
9:41
and let's go ahead and open that file
9:44
and just take a look at that data and
9:45
this is in a simple spreadsheet format
9:48
the data itself is comma separated very
9:51
common set of data and it's also a very
9:53
common way to get the data and you can
9:54
see here we have columns a through I
9:57
that's what 1 2 3 4 5 6 7 8 8 columns
10:02
with a particular attribute and then the
10:05
ninth column which is the outcome is
10:07
whether they have diabetes as a data
10:09
scientist the first thing you should be
10:10
looking at is insolent well you know if
10:12
someone has insulin they have diabetes
10:14
that's why they're taking it when that
10:16
could cause issue and some of the
10:17
machine learning packages but for a very
10:19
basic setup this works fine for doing
10:22
the KNN and the next thing you notice is
10:24
it didn't take very much to open it up I
10:26
can scroll down to the bottom of the
10:28
data there's 768 it's pretty much a
10:30
small data set you know at 769 I can
10:33
easily fit this into my RAM on my
10:36
computer I can look at it I can
10:38
manipulate it and it's not gonna really
10:40
tax just a regular desktop computer you
10:43
don't even need an enterprise version to
10:44
run a lot of this so let's start with
10:46
importing all the tools we need and
10:48
before that of course we need to discuss
10:50
what IDE I'm using certainly you can use
10:53
any particular editor for python but i
10:55
like to use for doing very basic visual
10:58
stuff the anaconda which is great for
11:01
doing demos with the jupiter notebook
11:03
and just a quick view of the anaconda
11:05
navigator which
11:07
the new release out there which is
11:08
really nice you can see under home I can
11:11
choose my application we're gonna be
11:12
using Python three six I have a couple
11:15
different versions on this particular
11:17
machine if I go under environments
11:18
create a unique environment for each one
11:21
which is nice and there's even a little
11:22
button there where I can install
11:24
different packages so if I click on that
11:25
button and open the terminal I can then
11:27
use a simple pip install to install
11:29
different packages I'm working with
11:30
let's go ahead and go back on your home
11:32
and we're gonna launch our notebook and
11:34
I've already you know kind of like the
11:36
old cooking shows I've already prepared
11:38
a lot of my stuff so we don't have to
11:39
wait for it to launch because it takes a
11:41
few minutes for it to open up a browser
11:43
window in this case I'm gonna it's gonna
11:45
open up Chrome because that's my default
11:47
that I use and since the script is pre
11:49
done you'll see I have a number of
11:50
windows open up at the top the one we're
11:52
working in and since we're working on
11:54
the KNN predict whether a person will
11:57
have diabetes or not let's go and put
11:58
that title in there
11:59
and I'm also gonna go up here and click
12:02
on sell actually we want to go ahead and
12:04
first insert a cell below and then I'm
12:06
gonna go back up to the top cell and I'm
12:09
gonna change the cell type to markdown
12:11
that means this is not going to run as
12:13
Python it's a markdown language so if I
12:15
run this first one it comes up in nice
12:16
big letters which is kind of nice
12:18
mine just what we're working on and by
12:20
now you should be familiar with doing
12:22
all of our imports we're gonna import
12:24
the pandas as PD import numpy is NP
12:28
pandas is the pandas dataframe and numpy
12:31
is a number array very powerful tools to
12:33
use in here so we have our imports so
12:36
we've brought in our pandas our numpy or
12:38
to general python tools and then you can
12:41
see over here we have our train tests
12:43
split by now I used to be familiar with
12:45
splitting the data we want to split part
12:47
of it for training our thing and then
12:49
training our particular model and then
12:51
we want to go ahead and test the
12:53
remaining data to see how good it is pre
12:55
processing a standard scaler
12:57
preprocessor so we don't have a bias of
12:59
really large numbers remember in the
13:01
data we had like number of pregnancies
13:03
isn't gonna get very large where the
13:05
amount of insulin they taking it up to
13:07
256 so 256 versus 6 that will skew
13:11
results so we want to go ahead and
13:12
change that so that they're all uniform
13:14
between minus 1 and 1 and then the
13:16
actual tool this is the K neighbors
13:19
classifier we're going to
13:20
and finally the last three are three
13:24
tools to test all about testing our
13:26
model how good is it we just put down
13:27
tests on there and we have our confusion
13:29
matrix our f1 score and our accuracy so
13:32
we have our two general python modules
13:35
we're importing and then we have our six
13:38
module specific from the SK learn setup
13:41
and then we do need to go ahead and run
13:43
this so these are actually imported
13:46
there we go and then move on to the next
13:48
step and so in this set we're gonna go
13:49
ahead and load the database we're gonna
13:51
use pandas remember pandas as PD and
13:54
we'll take a look at the data in Python
13:56
we looked at it in a simple spreadsheet
13:57
but usually I like to also pull it up so
14:00
that we can see what we're doing so
14:01
here's our data set equals peed read CSV
14:05
that's a panda's command and the
14:08
diabetes folder I just put in the same
14:10
folder where my I python script is if
14:12
you put it in a different folder you
14:14
need the full length on there we can
14:16
also do a quick length of the data set
14:19
that is a simple Python command Elian
14:22
for length
14:22
we might even let's go ahead and print
14:24
that we'll go print and if you do it on
14:26
its own line links that data set that
14:28
you put a notebook it'll automatically
14:30
print it but when you're in most of your
14:32
different setups you want to do the
14:33
print in front of there and then we want
14:35
to take a look at the actual data set
14:37
and since we're in pandas we can simply
14:40
do data set head and again let's go
14:42
ahead and add the print in there if you
14:44
put a bunch of these in a row you know
14:46
the data set one head data set to head
14:49
it only prints out the last one so I
14:51
used to always like to keep the print
14:52
statement in there but because most
14:54
projects only use one data frame pandas
14:57
dataframe doing it this way doesn't
14:59
really matter the other way works just
15:00
fine and you can see when we hit the Run
15:02
button we have the 768 lines which we
15:05
knew and we have our pregnancies it's
15:07
automatically given a label on the left
15:09
remember the head only shows the first
15:11
five lines so we have zero through four
15:15
and just a quick look at the data you
15:17
can see it matches what we looked at
15:18
before we have pregnancy glucose blood
15:20
pressure all the way to age and then the
15:23
outcome on the end and we're gonna do a
15:25
couple things in this next step we're
15:28
going to create a list of columns where
15:30
we can't have zero there's no such thing
15:32
as zeros
15:33
in thickness or zero blood pressure or
15:36
zero glucose any of those you'd be dead
15:39
so not a really good factor if they
15:41
don't if they have a zero in there
15:42
because they didn't have the data and
15:43
we'll take a look at that because we're
15:44
gonna start replacing that information
15:46
with a couple of different things and
15:49
let's see what that looks like so first
15:51
we create a nice list as you can see we
15:53
have the values talked about glucose
15:55
blood pressure skin thickness and this
15:57
is a nice way when you're working with
15:58
columns is to list the columns you need
16:00
to do some kind of transformation on a
16:02
very common thing to do and then for
16:04
this particular setup we certainly could
16:06
use the there's some Panda tools that
16:09
will do a lot of this where we can
16:10
replace the n/a but we're gonna go ahead
16:12
and do it as a data set column equals
16:16
data set column don't replace this is
16:18
this is still pandas you can do a direct
16:20
there's also one that's that you look
16:22
for your n/a n/a a lot of different
16:24
options in here but the n/a n/a numpy na
16:27
n is what that stands for is is none
16:28
doesn't exist so the first thing we're
16:31
doing here is we're replacing the zero
16:34
with a numpy none there's no data there
16:37
that's what that says that's what this
16:39
is saying right here so put the zero in
16:41
and we're gonna play zeroes with no data
16:43
so if it's a zero that means a person's
16:46
well hopefully not dead oh they just
16:47
didn't get the data the next thing they
16:49
want to do is we're going to create the
16:50
mean which is the integer from the data
16:53
set from the column mean where we skip n
16:56
A's we can do that that is a pandas
16:59
command there the Skip ena so we're
17:01
gonna figure out the mean of that data
17:02
set and then we're gonna take that data
17:04
set column and we're gonna replace all
17:07
the N P na n with the means why did we
17:11
do that we could have actually just
17:12
taken this step and gone right down here
17:14
and just replace zero and skip anything
17:16
were except you could actually there's a
17:18
way to skip zeros and then just replace
17:20
all the zeros but in this case we want
17:22
to go ahead and do it this way so you
17:23
could see that we're switching this to a
17:25
non-existent value and we're gonna
17:27
create the mean well this is the average
17:29
person so if we don't know what it is if
17:32
they did not get the data and the data
17:34
is missing one of the tricks is you
17:36
replace it with the average what is the
17:38
most common data for that this way you
17:41
can still use the rest of those values
17:43
to do your computation and it kind of
17:45
just brings that particular
17:47
those missing values out of the equation
17:49
let's go ahead and take this and we'll
17:51
go ahead and run it doesn't actually do
17:53
anything so we're still preparing our
17:55
data if you wanted to see what that
17:57
looks like we don't have anything in the
17:58
first few lines so it's not gonna show
18:00
up but we certainly could look at a row
18:02
let's do that let's go into our data set
18:05
with a printed data set and let's pick
18:08
in this case let's just do glucose and
18:11
if I run this this is gonna pray on all
18:13
the different glucose levels going down
18:16
and we thankfully don't see anything in
18:18
here that looks like missing data at
18:20
least on the ones that show so you can
18:21
see you skipped a bunch in the middle
18:23
that's what it does we have two me lines
18:24
and Jupiter notebook he'll skip a few
18:26
and go on to the next and a data set let
18:29
me go and remove this will just zero out
18:31
that and of course before we do any
18:33
processing before proceeding any further
18:35
we need to split the data set into our
18:37
trained and testing data that way we
18:39
have something to train it with and
18:41
something to test it on and you're going
18:43
to notice we did a little something here
18:44
with the pandas data base code there we
18:47
go my drawing tool we've added in this
18:49
right here off the data set and what
18:52
this says is that the first one in
18:54
pandas this is from the PD pandas it's
18:57
gonna say within the data set we want to
18:59
look at the eye location and it is all
19:02
rows that's what that says so we're
19:03
gonna keep all the rows but we're only
19:05
looking at 0 column 0 to 8
19:07
remember column 9 here it is right up
19:09
here we printed it in here as outcome
19:11
well that's not part of the training
19:13
data that's part of the answer yes
19:15
column 9 but it's listed as 8 number 8
19:17
so 0 to 8 is 9 columns so 8 is the value
19:21
and when you see it in here 0 this is
19:24
actually 0 to 7 it doesn't include the
19:26
last one and then we go down here to Y
19:28
which is our answer and we want just the
19:31
last one just column 8 and you can do it
19:34
this way with this particular notation
19:35
and then if you remember we imported the
19:37
Train test split that's part of the SK
19:40
learn right there and we simply put in
19:43
our X and our Y we're gonna do random
19:46
state equals 0 you don't have to
19:47
necessarily seed it that's a seed number
19:49
I think the default is 1 when you seed
19:51
it I have to look that up and then the
19:53
test size test size is 0.2 that simply
19:55
means we're gonna take 20% of the data
19:57
and put it aside so that we can test it
19:59
later
20:00
that's all that is and again we're gonna
20:02
run it not very exciting so far we
20:04
haven't had any printout other than to
20:05
look at the data but that is a lot of
20:08
this is prepping this data once you prep
20:10
it the actual lines of code are quick
20:11
and easy and we're almost there but the
20:14
actual running of our KNN we need to go
20:16
ahead and do a scale the data if you
20:18
remember correctly we're fitting the
20:20
data and a standard scaler which means
20:21
instead of the data being from you know
20:24
five to three hundred three in one
20:26
column and the next column is one to six
20:29
we're gonna set that all so that all the
20:30
data is between minus one and one that's
20:33
what that standard scaler does keeps it
20:35
standardized and we only want to fit the
20:39
scaler with the training set but we want
20:41
to make sure the testing set is the X
20:43
test going in is also transformed so
20:46
it's processing at the same so here we
20:49
go with their standard scaler we're
20:51
gonna call it SC underscore X for the
20:52
scalar and we're gonna import the
20:55
standard scalar into this variable and
20:56
then our X train equals SC underscore X
21:00
dot fit transform so we're creating the
21:03
scalar on the X train variable and then
21:05
our X test we're also going to transform
21:07
it so we've trained and transformed the
21:10
X train and then the X test isn't part
21:12
of that training it isn't part of that
21:14
of training the Transformer
21:15
it just gets transformed that's all it
21:17
does and again we're gonna go and run
21:19
this if you look at this we've now gone
21:21
through these steps all three of them
21:24
we've taken care of replacing our zeros
21:27
four key columns that shouldn't be 0 and
21:31
we replace that with the means of those
21:33
columns that way that they fit right in
21:36
with our data models we've come down
21:38
here we split the data so now we have
21:40
our test data and our training data and
21:43
then we've taken and we've scaled the
21:45
data so all of our data going in oh no
21:47
we don't trip we don't train the Y part
21:50
the why train and why test that never
21:53
has to be trained it's only the data
21:55
going in that's what we want to train in
21:57
there then define the model using K
21:59
neighbors classifier and fit the train
22:01
data in the model so we do all that data
22:03
prep and you can see down here we're
22:05
only going to have a couple lines of
22:07
code where we're actually building our
22:09
model and training it that's one of the
22:12
cool things about Python and how far
22:13
we've come it's such an exciting time to
22:15
be in machine learning because there's
22:16
so many automated tools
22:18
let's see before we do this let's do a
22:20
quick links' of and let's do why we want
22:24
let's just do length of Y and we get 7
22:27
or 68 and if we import math we do math
22:31
dot square root
22:33
let's do Y train there we go it's
22:36
actually supposed to be X train before
22:38
we do this let's go ahead and do import
22:41
math and do math square root length of Y
22:44
test and when I run that we get 12 point
22:47
409 I want to see show you where this
22:49
number comes from we're about to use 12
22:51
is an even number so if you know if
22:53
you're ever voting on things remember
22:55
the neighbors all vote don't want to
22:57
have an even number of neighbors voting
22:58
so we want to do something odd and let's
23:00
just take one away we'll make it 11 let
23:02
me delete this out of here this one the
23:04
reasons I love Jupiter notebook because
23:05
you can flip around and do all kinds of
23:07
things on the fly so we'll go ahead and
23:08
put in our classifier we're creating our
23:10
classifier now and it's going to be the
23:12
K neighbors classifier and neighbors
23:14
equal 11 remember we did 12 minus 1 for
23:17
11 so we have an odd number of neighbors
23:19
P equals 2 because we're looking for is
23:22
it are the diabetic or not and we're
23:24
using the Euclidean metric there are
23:27
other means of measuring the distance
23:29
you could do like square square means
23:31
value there's all kinds of measure this
23:33
but the Euclidean is the most common one
23:35
and it works quite well it's important
23:37
to evaluate the model let's use the
23:39
confusion matrix to do that and we're
23:41
going to do is a confusion matrix
23:42
wonderful tool and then we'll jump into
23:45
the f1 score and finally an accuracy
23:49
score which is probably the most
23:50
commonly used quoted number when you go
23:53
into a meeting or something like that so
23:55
let's go ahead and paste that in there
23:56
and we'll set the CM equal to confusion
23:59
matrix why test why predict so those are
24:02
the two values we're going to put in
24:04
there and let me go ahead and run that
24:05
and print it out and the way you
24:07
interpret this is you have the Y
24:10
predicted which would be your title up
24:12
here you can do assist do PR IDI predict
24:16
it across the top and actual going down
24:21
it's always hard to write in here actual
24:24
that means that this column here down
24:26
the middle
24:27
that's the important column and it means
24:29
that our prediction said 94 and
24:32
prediction in the actual agreed on 94
24:35
and 32 this number here the 13 and the
24:40
15 those are what was wrong
24:42
so you could have like three different
24:43
if you're looking at this across three
24:45
different variables instead of just two
24:47
you'd end up with the third row down
24:48
here and the column going down the
24:50
middle so in the first case we have the
24:52
the and I believe the zero has a 90 for
24:55
people who don't have diabetes the
24:57
prediction said 213 of those people did
24:59
have diabetes and we're at high risk and
25:01
the 32 that had diabetes it had correct
25:04
but our prediction said another 15 out
25:08
of that 15 it classified as incorrect so
25:12
you can see where that classification
25:13
comes in and how that works on the
25:15
confusion matrix then we're gonna go
25:18
ahead and print the f1 score let me just
25:20
run that and you see we got 2 point 6 9
25:23
and our f1 score the f1 takes into
25:28
account both sides of the balance of
25:30
false positives where if we go ahead and
25:33
just do the accuracy account and that's
25:35
what most people think of is it looks at
25:38
just how many we got right out of how
25:40
many we got wrong so a lot of people in
25:42
your data scientists and you're talking
25:44
the other data scientists they're gonna
25:46
ask you what the f1 score the F score is
25:48
if you're talking to the general public
25:50
or the decision-makers in the business
25:53
they're gonna ask what the accuracy is
25:54
and the accuracy is always better than
25:56
the f1 score but the f1 score is more
25:59
telling let's just know that there's
26:01
more false positives than we would like
26:03
on here but 82% not too bad for a quick
26:06
flash look at people's different
26:09
statistics and running an SK learn and
26:11
running the KNN the K nearest neighbor
26:14
on it so we have created a model using
26:16
KN n which can predict whether a person
26:19
will have diabetes or not or at the very
26:21
least whether they should go get a
26:22
checkup and have their glucose checked
26:25
regularly or not the print accurate
26:27
score we got the point eight one eight
26:29
was pretty close to what we got and we
26:31
can pretty much round that off and just
26:32
say we have an accuracy of 80% tells us
26:35
that it's a pretty fair fit in the model
26:37
to pull that all together there's always
26:39
a lot of fun make sure we cover
26:40
everything we went over
26:41
we covered why we need a KN in looking
26:44
at cats and dogs great if you have a cat
26:46
door and you want to figure out where
26:47
there's a cat or dog coming in don't let
26:49
the dog in or out using Euclidean
26:51
distance the simple distance calculated
26:53
by the two sides of the triangle or the
26:55
square root of the two sides squared
26:57
choosing the value of K we discuss that
27:00
a little bit at least one of the main
27:01
choices that people use for choosing K
27:03
and how KNN works then finally we did a
27:06
full cane and classifier for diabetes
27:08
prediction thank you for joining us
27:10
today for more information visit
27:19
www.subscriptorium.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy