0% found this document useful (0 votes)
72 views26 pages

3a KNN PDF

Instance-based learning involves storing all training examples and classifying new examples based on their similarity to stored examples. The k-nearest neighbors algorithm is a common instance-based learning method where the class is determined by the majority class of the k closest training examples. Key considerations for k-NN include how to measure similarity, choosing an appropriate value for k, and addressing the curse of dimensionality from irrelevant features. Distance weighting and feature selection help k-NN perform well even with many irrelevant features.

Uploaded by

jithu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views26 pages

3a KNN PDF

Instance-based learning involves storing all training examples and classifying new examples based on their similarity to stored examples. The k-nearest neighbors algorithm is a common instance-based learning method where the class is determined by the majority class of the k closest training examples. Key considerations for k-NN include how to measure similarity, choosing an appropriate value for k, and addressing the curse of dimensionality from irrelevant features. Distance weighting and feature selection help k-NN perform well even with many irrelevant features.

Uploaded by

jithu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Foundations of Machine Learning

Module 3: Instance Based Learning and Feature


Selection
Part D: Instance Based Learning

Sudeshna Sarkar
IIT Kharagpur
Instance-Based Learning
• One way of solving tasks of approximating
discrete or real valued target functions
• Have training examples: (xn, f(xn)), n=1..N.
• Key idea:
– just store the training examples
– when a test example is given then find the closest
matches

2
Inductive Assumption

• Similar inputs map to similar outputs


– If not true => learning is impossible
– If true => learning reduces to defining “similar”

• Not all similarities created equal


– predicting a person’s weight may depend on
different attributes than predicting their IQ
Basic k-nearest neighbor classification

• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that
are closest to the test example x
– Predict the most frequent class among those yi’s.

• Example:
http://cgm.cs.mcgill.ca/~soss/cs644/projects/simard/

4
What is the decision boundary?
Voronoi diagram

5
Basic k-nearest neighbor classification

• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that are closest
to the test example x
– Predict the most frequent class among those yi’s.

• Improvements:
– Weighting examples from the neighborhood
– Measuring “closeness”
– Finding “close” examples in a large training set quickly

6
k-Nearest Neighbor
N
 
2
Dist(c1 , c2 )   attri (c1 )  attri (c2 )
i1


k  NearestNeighbors  k  MIN(Dist(ci , ctest )) 
1 k 1 k
predictiontest   classi (or  valuei )
k i1 k i1

• Average of k points more reliable when:


– noise in attributes

attribute_2
o
+ o oooo
– noise in class labels o + oo+o
o oo o
– classes partially overlap + ++++
+ +
+
+
attribute_1
How to choose “k”

• Large k:
– less sensitive to noise (particularly class noise)
– better probability estimates for discrete classes
– larger training sets allow larger values of k
• Small k:
– captures fine structure of problem space better
– may be necessary with small training sets
• Balance must be struck between large and small k
• As training set approaches infinity, and k grows large, kNN
becomes Bayes optimal
From Hastie, Tibshirani, Friedman 2001 p418
From Hastie, Tibshirani, Friedman 2001 p418
From Hastie, Tibshirani, Friedman 2001 p419
Distance-Weighted kNN

• tradeoff between small and large k can be difficult


– use large k, but more emphasis on nearer neighbors?

k k

 w  class
i i  w  value
i i
predictiontest  i1
k (or i 1
k )
w i w i
i 1 i 1

1
wk 
Dist(ck , ctest )
Locally Weighted Averaging
• Let k = number of training points
• Let weight fall-off rapidly with distance
k k

 w  class
i i  w  value
i i
predictiontest  i1
k (or i 1
k )
w
i 1
i w
i 1
i

1
wk 
e KernelWidthDist(ck ,c test )

 KernelWidth controls size of neighborhood that


has large effect on value (analogous to k)
Locally Weighted Regression
• All algs so far are strict averagers: interpolate,
but can’t extrapolate
• Do weighted regression, centered at test
point, weight controlled by distance and
KernelWidth
• Local regressor can be linear, quadratic, n-th
degree polynomial, neural net, …
• Yields piecewise approximation to surface that
typically is more complex than local regressor
Euclidean Distance

N
D(c1, c2)  
 attri (c1)  attri (c2)
i1

2

• gives all attributes equal weight?


– only if scale of attributes and differences are
similar

attribute_2
– scale attributes to equal range or equal o
o oooo
oo o
variance + o
o
+ +++
• assumes spherical classes + +
+
+
attribute_1
Euclidean Distance?

attribute_2

attribute_2
o + o o
o oooo + + o
oo o
o + + o oo
o + o
+ ++++ + + o oo
+
+ +
+ + o
+
attribute_1 attribute_1

• if classes are not spherical?


• if some attributes are more/less important
than other attributes?
• if some attributes have more/less noise in
them than other attributes?
Weighted Euclidean Distance

N
D(c1, c2)   
 wi  attri (c1)  attri (c2)
i1
2

• large weights => attribute is more important


• small weights => attribute is less important
• zero weights => attribute doesn’t matter

• Weights allow kNN to be effective with axis-parallel


elliptical classes
• Where do weights come from?
Curse of Dimensionality

• as number of dimensions increases, distance between points becomes larger and more uniform
• if number of relevant attributes is fixed, increasing the number of less relevant attributes may swamp
distance

• when more irrelevant than relevant dimensions, distance becomes less reliable
• solutions: larger k or KernelWidth, feature selection, feature weights, more complex distance functions

attr (c1)  attr (c2)


relevant irrelevant 2
D(c1,c2)  
i1
attri (c1)  attri (c2)  2

j 1
j j
K-NN and irrelevant features
+ + + oo o oo? o + + o + o oooo+ o ooooo +

19
K-NN and irrelevant features
+
o
+ o
? o o
+
o o
o o
o o
+ o
+ +
+ o
o + o o
o
o
o

20
K-NN and irrelevant features

+
+ oo o o
? +
o o o oo
+ o o +
+ o o + o +
o oo
o

21
Ways of rescaling for KNN
Normalized L1 distance:

Scale by IG:

Modified value distance metric:

22
Ways of rescaling for KNN
Dot product:

Cosine distance:

TFIDF weights for text: for doc j, feature i: xi=tfi,j * idfi :

#docs in
#occur. of
corpus
term i in
doc j
#docs in corpus
that contain
term i
23
Combining distances to neighbors
Standard KNN: yˆ  arg max y C ( y, Neighbors ( x))
C ( y, D ' ) | {( x' , y ' )  D ': y '  y} |
Distance-weighted KNN:

yˆ  arg max y C ( y, Neighbors ( x))


C ( y, D' )   (SIM ( x, x' ))
{( x ', y ')D ': y ' y }

C ( y, D' )  1   (1  SIM ( x, x' ))


{( x ', y ')D ': y ' y }

SIM ( x, x' )  1  ( x, x' )


24
Advantages of Memory-Based Methods
• Lazy learning: don’t do any work until you know what you
want to predict (and from what variables!)
– never need to learn a global model
– many simple local models taken together can represent a more
complex global model
– better focussed learning
– handles missing values, time varying distributions, ...
• Very efficient cross-validation
• Intelligible learning method to many users
• Nearest neighbors support explanation and training
• Can use any distance metric: string-edit distance, …
Weaknesses of Memory-Based Methods

• Curse of Dimensionality:
– often works best with 25 or fewer dimensions
• Run-time cost scales with training set size
• Large training sets will not fit in memory
• Many MBL methods are strict averagers
• Sometimes doesn’t seem to perform as well as other methods
such as neural nets
• Predicted values for regression not continuous

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy