3a KNN PDF
3a KNN PDF
Sudeshna Sarkar
IIT Kharagpur
Instance-Based Learning
• One way of solving tasks of approximating
discrete or real valued target functions
• Have training examples: (xn, f(xn)), n=1..N.
• Key idea:
– just store the training examples
– when a test example is given then find the closest
matches
2
Inductive Assumption
• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that
are closest to the test example x
– Predict the most frequent class among those yi’s.
• Example:
http://cgm.cs.mcgill.ca/~soss/cs644/projects/simard/
4
What is the decision boundary?
Voronoi diagram
5
Basic k-nearest neighbor classification
• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that are closest
to the test example x
– Predict the most frequent class among those yi’s.
• Improvements:
– Weighting examples from the neighborhood
– Measuring “closeness”
– Finding “close” examples in a large training set quickly
6
k-Nearest Neighbor
N
2
Dist(c1 , c2 ) attri (c1 ) attri (c2 )
i1
k NearestNeighbors k MIN(Dist(ci , ctest ))
1 k 1 k
predictiontest classi (or valuei )
k i1 k i1
attribute_2
o
+ o oooo
– noise in class labels o + oo+o
o oo o
– classes partially overlap + ++++
+ +
+
+
attribute_1
How to choose “k”
• Large k:
– less sensitive to noise (particularly class noise)
– better probability estimates for discrete classes
– larger training sets allow larger values of k
• Small k:
– captures fine structure of problem space better
– may be necessary with small training sets
• Balance must be struck between large and small k
• As training set approaches infinity, and k grows large, kNN
becomes Bayes optimal
From Hastie, Tibshirani, Friedman 2001 p418
From Hastie, Tibshirani, Friedman 2001 p418
From Hastie, Tibshirani, Friedman 2001 p419
Distance-Weighted kNN
k k
w class
i i w value
i i
predictiontest i1
k (or i 1
k )
w i w i
i 1 i 1
1
wk
Dist(ck , ctest )
Locally Weighted Averaging
• Let k = number of training points
• Let weight fall-off rapidly with distance
k k
w class
i i w value
i i
predictiontest i1
k (or i 1
k )
w
i 1
i w
i 1
i
1
wk
e KernelWidthDist(ck ,c test )
N
D(c1, c2)
attri (c1) attri (c2)
i1
2
attribute_2
– scale attributes to equal range or equal o
o oooo
oo o
variance + o
o
+ +++
• assumes spherical classes + +
+
+
attribute_1
Euclidean Distance?
attribute_2
attribute_2
o + o o
o oooo + + o
oo o
o + + o oo
o + o
+ ++++ + + o oo
+
+ +
+ + o
+
attribute_1 attribute_1
N
D(c1, c2)
wi attri (c1) attri (c2)
i1
2
• as number of dimensions increases, distance between points becomes larger and more uniform
• if number of relevant attributes is fixed, increasing the number of less relevant attributes may swamp
distance
• when more irrelevant than relevant dimensions, distance becomes less reliable
• solutions: larger k or KernelWidth, feature selection, feature weights, more complex distance functions
19
K-NN and irrelevant features
+
o
+ o
? o o
+
o o
o o
o o
+ o
+ +
+ o
o + o o
o
o
o
20
K-NN and irrelevant features
+
+ oo o o
? +
o o o oo
+ o o +
+ o o + o +
o oo
o
21
Ways of rescaling for KNN
Normalized L1 distance:
Scale by IG:
22
Ways of rescaling for KNN
Dot product:
Cosine distance:
#docs in
#occur. of
corpus
term i in
doc j
#docs in corpus
that contain
term i
23
Combining distances to neighbors
Standard KNN: yˆ arg max y C ( y, Neighbors ( x))
C ( y, D ' ) | {( x' , y ' ) D ': y ' y} |
Distance-weighted KNN:
• Curse of Dimensionality:
– often works best with 25 or fewer dimensions
• Run-time cost scales with training set size
• Large training sets will not fit in memory
• Many MBL methods are strict averagers
• Sometimes doesn’t seem to perform as well as other methods
such as neural nets
• Predicted values for regression not continuous