0% found this document useful (0 votes)
54 views

Chapter II - Lecture 2 - KNN

This document provides an overview of the k-nearest neighbors (k-NN) algorithm. It explains that k-NN is a simple supervised learning algorithm that stores all training examples and classifies new examples based on their similarity to existing examples. The document outlines the k-NN classification process, discusses how to choose the k value for k nearest neighbors, provides examples of k-NN classification, and reviews the strengths and weaknesses of the k-NN algorithm.

Uploaded by

Halbeega Waayaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Chapter II - Lecture 2 - KNN

This document provides an overview of the k-nearest neighbors (k-NN) algorithm. It explains that k-NN is a simple supervised learning algorithm that stores all training examples and classifies new examples based on their similarity to existing examples. The document outlines the k-NN classification process, discusses how to choose the k value for k nearest neighbors, provides examples of k-NN classification, and reviews the strengths and weaknesses of the k-NN algorithm.

Uploaded by

Halbeega Waayaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter II:

Supervised Learning
Lecture 2.2
Road Map
◼ Introduction
◼ Generalization, Overfitting, and Underfitting
◼ Some Sample Datasets
◼ Supervised Machine Learning Algorithms
◼ k-Nearest Neighbors
◼ Linear Models
◼ Naive Bayes Classifiers
◼ Decision Trees
◼ Support Vector Machines

Chapter 2: Supervised Machine Learning 2


What is k-NN?
◼ A powerful classification algorithm
used in pattern recognition.
◼ KNN stores all available cases and
classifies new cases based on a Similarity
Measure (e.g. Distance Function)
◼ One of the top data mining algorithm
used today.
◼ A non-parametric lazy learning algorithm
(An Instance-based Learning method).

Chapter 2: Supervised Machine Learning 3


k-Nearest Neighbor
◼ The k-NN algorithm is arguably the
simplest machine learning algorithm.
◼ To make a prediction for a new data
point, the algorithm finds the closest data
points in the training dataset—its “nearest
neighbors.
◼ k-NN algorithm, given an input, chooses
the most common class out of the k
nearest data points to that input.

Chapter 2: Supervised Machine Learning 4


Simple Analogy!

Chapter 2: Supervised Machine Learning 5


k-Nearest Neighbor Classification (kNN)
◼ Unlike all the other learning methods, kNN
does not build model from the training data.
◼ To classify a test instance d, define k-
neighborhood P as k nearest neighbors of d
◼ Count number n of training instances in P that
belong to class cj
◼ No training is needed. Classification time is
linear in training set size for each test case.

Chapter 2: Supervised Machine Learning 6


k-NN Classification Process

Chapter 2: Supervised Machine Learning 7


kNNAlgorithm

◼ k is usually chosen empirically via a validation


set or cross-validation by trying a range of k
values.
◼ Distance function is crucial, but depends on
applications.
Chapter 2: Supervised Machine Learning 8
Example: k=6 (6NN)
Government
Science
Arts

A new point
Pr(science| )?

Chapter 2: Supervised Machine Learning 9


Cont..

◼ K-nearest neighbors of a record x are data points


that have the k smallest distance to x

Chapter 2: Supervised Machine Learning 10


How to choose K?
◼ If K is too small, it is sensitive to noise points.
◼ Larger K works well. But too large K may include
majority points from other classes.

◼ Rule of thumb is K < sqrt(n), n is the number of


data points.

Chapter 2: Supervised Machine Learning 11


k-NN Example!
◼ The k-NN algorithm only considers exactly one
nearest neighbor, which is the closest training data
point to the point we want to make a prediction for.
◼ The prediction is then simply the known output for
this training point. Below Figure, illustrates this for
the case of classification on the forge dataset:

Input:

Chapter 2: Supervised Machine Learning 12


Cont..
◼ Here, we added three new data points, shown as stars. For each
of them, we marked the closest point in the training set. The
prediction of the one-nearest-neighbor algorithm is the label of that
point (shown by the color of the cross).
◼ Instead of considering only the closest neighbor, we can also
consider an arbitrary number, k, of neighbors. This is where the
name of the k-nearest neighbors algorithm comes from. The
following example (uses the three closest neighbors:
Input:

Chapter 2: Supervised Machine Learning 13


Cont..
◼ Again, the prediction is shown as the color of
the cross. You can see that the prediction for
the new data point at the top left is not the
same as the prediction when we used only one
neighbor.
◼ Now let’s look at how we can apply the k-
nearest neighbors algorithm using scikit-learn.
◼ First, we split our data into a training and a test
set so we can evaluate generalization
performance, as discussed in this Chapter.

Chapter 2: Supervised Machine Learning 14


Cont..
◼ Input:

◼ Input:

Chapter 2: Supervised Machine Learning 15


Cont..
◼ The train_test_split() function is used to split
the dataset into train and test sets. By default,
the function shuffles the data (with shuffle=True)
before splitting.
◼ The random state hyperparameter in
the train_test_split() function controls the
shuffling process.
◼ With random_state = None, we get different
train and test sets across different executions
and the shuffling process is out of control.
◼ With random_state = 0, we get the same train
and test sets across different executions.
Chapter 2: Supervised Machine Learning 16
Cont..
◼ Input:

◼ Input:

◼ Output:

Chapter 2: Supervised Machine Learning 17


Cont..
◼ Input:

◼ Output:

Chapter 2: Supervised Machine Learning 18


Strengths and Weaknesses of KNN
◼ Strengths of KNN
- Very simple and intuitive.
- Can be applied to the data from any distribution.
- Good classification if the number of samples is large
enough.
◼ Weaknesses of KNN
- Take more time to classify a new examples
- Need to calculate and compare distance from new
examples to all other examples.
- Choosing K may be tricky.
- Need large number of samples for accuracy.

Chapter 2: Supervised Machine Learning 19


Discussions
◼ kNN can deal with complex and arbitrary
decision boundaries.
◼ Despite its simplicity, researchers have
shown that the classification accuracy of kNN
can be quite strong and in many cases as
accurate as those elaborated methods.
◼ kNN is slow at the classification time
◼ kNN does not produce an understandable
model

Chapter 2: Supervised Machine Learning 20


Thank You!

Chapter 2: Supervised Machine Learning 21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy