BookSlides 5B Similarity-based-Learning

Noisy Data Normalization Cont. Targets Similarity Feat. Sel.
Efficiency Summary
Similarity-based Learning
Sections 5.4, 5.5
John D. Kelleher and Brian Mac Namee and Aoife D’Arcy

Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
1 Handling Noisy Data
2 Data Normalization
3 Predicting Continuous Targets
4 Other Measures of Similarity
5 Feature Selection
6 Efficient Memory Search
7 Summary
Handling Noisy Data

Figure: Is the instance at the top right of the diagram really noise?
The k nearest neighbors model predicts the target level

with the majority vote from the set of k nearest neightbors
to the query q:
k
X
Mk (q) = argmax δ(ti , l) (1)
l∈levels(t) i=1
Figure: The decision boundary using majority classification of the

nearest 3 neighbors.
Figure: The decision boundary using majority classification of the

nearest 5 neighbors.
Figure: The decision boundary when k is set to 15.

In a distance weighted k nearest neighbor algorithm the

contribution of each neighbor to the classification decision
is weighted by the reciprocal of the squared distance
between the neighbor d and the query q:
1
(2)
dist(q, d)2
The weighted k nearest neighbor model is defined as:
k
X 1
Mk (q) = argmax × δ(ti , l) (3)
l∈levels(t) i=1 dist(q, di )2
Figure: The weighted k nearest neighbor model decision boundary.

Data Normalization
Table: A dataset listing the salary and age information for customers
and whether or not the purchased a pension plan .
ID Salary Age Purchased
1 53700 41 No
2 65300 37 No
3 48900 45 Yes
4 64800 49 Yes
5 44200 30 No
6 55900 57 Yes
7 48600 26 No
8 72800 60 Yes
9 45300 34 No
10 73200 52 Yes
The marketing department wants to decide whether or not

they should contact a customer with the following profile:
hS ALARY = 56, 000, AGE = 35i
60
8
6
55
10
50
4
45
Age 3
1
40
2
35
?
9
30
7
25
45000 55000 65000 75000

Salary
Figure: The salary and age feature space with the data in Table 1 [12]
plotted. The instances are labelled their IDs, triangles represent the
negative instances and crosses represent the positive instances. The
location of the query hS ALARY = 56, 000, AGE = 35i is indicated by
the ?.
Salary and Age Salary Only Age Only

ID Salary Age Purch. Dist. Neigh. Dist. Neigh. Dist. Neigh.
1 53700 41 No 2300.0078 2 2300 2 6 4
2 65300 37 No 9300.0002 6 9300 6 2 2
3 48900 45 Yes 7100.0070 3 7100 3 10 6
4 64800 49 Yes 8800.0111 5 8800 5 14 7
5 44200 30 No 11800.0011 8 11800 8 5 5
6 55900 57 Yes 102.3914 1 100 1 22 9
7 48600 26 No 7400.0055 4 7400 4 9 3
8 72800 60 Yes 16800.0186 9 16800 9 25 10
9 45300 34 No 10700.0000 7 10700 7 1 1
10 73200 52 Yes 17200.0084 10 17200 10 17 8
This odd prediction is caused by features taking different

ranges of values, this is equivalent to features having
different variances.
We can adjust for this using normalization; the equation for
range normalization is:
0 ai − min(a)
ai = × (high − low) + low (4)
max(a) − min(a)
Normalized Dataset Salary and Age Salary Only Age Only

ID Salary Age Purch. Dist. Neigh. Dist. Neigh. Dist. Neigh.
1 0.3276 0.4412 No 0.1935 1 0.0793 2 0.17647 4
2 0.7276 0.3235 No 0.3260 2 0.3207 6 0.05882 2
3 0.1621 0.5588 Yes 0.3827 5 0.2448 3 0.29412 6
4 0.7103 0.6765 Yes 0.5115 7 0.3034 5 0.41176 7
5 0.0000 0.1176 No 0.4327 6 0.4069 8 0.14706 3
6 0.4034 0.9118 Yes 0.6471 8 0.0034 1 0.64706 9
7 0.1517 0.0000 No 0.3677 3 0.2552 4 0.26471 5
8 0.9862 1.0000 Yes 0.9361 10 0.5793 9 0.73529 10
9 0.0379 0.2353 No 0.3701 4 0.3690 7 0.02941 1
10 1.0000 0.7647 Yes 0.7757 9 0.5931 10 0.50000 8
Normalizing the data is an important thing to do for almost

all machine learning algorithms, not just nearest neighbor!
Predicting Continuous Targets

Return the average value in the neighborhood:
k
1X
Mk (q) = ti (5)
k
i=1
Table: A dataset of whiskeys listing the age (in years) and the rating
(between 1 and 5, with 5 being the best) and the bottle price of each
whiskey.
ID Age Rating Price ID Age Rating Price
1 0 2 30.00 11 19 5 500.00
2 12 3.5 40.00 12 6 4.5 200.00
3 10 4 55.00 13 8 3.5 65.00
4 21 4.5 550.00 14 22 4 120.00
5 12 3 35.00 15 6 2 12.00
6 15 3.5 45.00 16 8 4.5 250.00
7 16 4 70.00 17 10 2 18.00
8 18 3 85.00 18 30 4.5 450.00
9 18 3.5 78.00 19 1 1 10.00
10 16 3 75.00 20 4 3 30.00
Table: The whiskey dataset after the descriptive features have been
normalized.
ID Age Rating Price ID Age Rating Price
1 0.0000 0.25 30.00 11 0.6333 1.00 500.00
2 0.4000 0.63 40.00 12 0.2000 0.88 200.00
3 0.3333 0.75 55.00 13 0.2667 0.63 65.00
4 0.7000 0.88 550.00 14 0.7333 0.75 120.00
5 0.4000 0.50 35.00 15 0.2000 0.25 12.00
6 0.5000 0.63 45.00 16 0.2667 0.88 250.00
7 0.5333 0.75 70.00 17 0.3333 0.25 18.00
8 0.6000 0.50 85.00 18 1.0000 0.88 450.00
9 0.6000 0.63 78.00 19 0.0333 0.00 10.00
10 0.5333 0.50 75.00 20 0.1333 0.50 30.00
?
1.0
●
12 ● ● 16 ● ●
0.8
● 3 ● ●
● ● ● ●
0.6
Rating ● ● ● ●
0.4
● ● ●
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0

Age
Figure: The AGE and R ATING feature space for the whiskey dataset.
The location of the query instance is indicated by the ? symbol. The
circle plotted with a dashed line demarcates the border of the
neighborhood around the query when k = 3. The three nearest
neighbors to the query are labelled with their ID values.
The model will return a price prediction that is the average

price of the three neighbors:
200.00 + 250.00 + 55.00

= 168.33
3
In a weighted k nearest neighbor the model prediction

equation is changed to:
k
X 1
× ti
dist(q, di )2
i=1
Mk (q) = (6)
k
X 1
dist(q, di )2
i=1
Table: The calculations for the weighted k nearest neighbor
prediction
ID Price Distance Weight Price×Weight
1 30.00 0.7530 1.7638 52.92
2 40.00 0.5017 3.9724 158.90
3 55.00 0.3655 7.4844 411.64
4 550.00 0.6456 2.3996 1319.78
5 35.00 0.6009 2.7692 96.92
6 45.00 0.5731 3.0450 137.03
7 70.00 0.5294 3.5679 249.75
8 85.00 0.7311 1.8711 159.04
9 78.00 0.6520 2.3526 183.50
10 75.00 0.6839 2.1378 160.33
11 500.00 0.5667 3.1142 1557.09
12 200.00 0.1828 29.9376 5987.53
13 65.00 0.4250 5.5363 359.86
14 120.00 0.7120 1.9726 236.71
15 12.00 0.7618 1.7233 20.68
16 250.00 0.2358 17.9775 4494.38
17 18.00 0.7960 1.5783 28.41
18 450.00 0.9417 1.1277 507.48
19 10.00 1.0006 0.9989 9.99
20 30.00 0.5044 3.9301 117.90
Totals: 99.2604 16 249.85
Other Measures of Similarity

Table: A binary dataset listing the behavior of two individuals on a

website during a trial period and whether or not they subsequently
signed-up for the website.
ID Profile FAQ Help Forum Newsletter Liked Signup
1 1 1 1 0 1 Yes
2 1 0 0 0 0 No
Who is q more similar to d1 or d2 ?

q = hP ROFILE:1, FAQ:0, H ELP F ORUM:1, N EWSLETTER:0, L IKED:0, i
ID Profile FAQ Help Forum Newsletter Liked Signup

1 1 1 1 0 1 Yes
2 1 0 0 0 0 No
q q
Pres. Abs. Pres. Abs.
Pres. CP=2 PA=0 Pres. CP=1 PA=1
d1 d2
Abs. AP=2 CA=1 Abs. AP=0 CA=3
Table: The similarity between the current trial user, q, and the two
users in the dataset, d1 and d2 , in terms of co-presence (CP),
co-absence (CA), presence-absence (PA), and absence-presence
(AP).
Russel-Rao
The ratio between the number of co-presenses and the
total number of binary features considered.
CP(q, d)
simRR (q, d) = (7)
|q|
Russel-Rao
CP(q, d)
simRR (q, d) = (8)
|q|
hP ROFILE:1, FAQ:0, H ELP F ORUM:1, N EWSLETTER:0, L IKED:0, i
q q
d1 d2
Russel-Rao
CP(q, d)
simRR (q, d) = (9)
|q|
Example
2
simRR (q, d1 ) = = 0.4
5
1
simRR (q, d2 ) = = 0.2
5
The current trial user is judged to be more similar to

instance d1 then d2 .
Sokal-Michener
Sokal-Michener is defined as the ratio between the total
number of co-presences and co-absences, and the total
number of binary features considered.
CP(q, d) + CA(q, d)
simSM (q, d) = (10)
|q|
Sokal-Michener
CP(q, d) + CA(q, d)
simSM (q, d) = (11)
|q|
q q
d1 d2
Sokal-Michener
CP(q, d) + CA(q, d)
simSM (q, d) = (12)
|q|
Example
3
simSM (q, d1 ) = = 0.6
5
4
simSM (q, d2 ) = = 0.8
5
The current trial user is judged to be more similar to

instance d2 then d1 .
Jaccard
The Jacquard index ignore co-absences
CP(q, d)
simJ (q, d) = (13)
CP(q, d) + PA(q, d) + AP(q, d)
Jaccard
CP(q, d)
simJ (q, d) = (14)
q q
d1 d2
Jaccard
CP(q, d)
simJ (q, d) = (15)
Example
2
simJ (q, d1 ) = = 0.5
4
1
simJ (q, d2 ) = = 0.5
2
The current trial user is judged to be equally similar to

instance d1 and d2 !
Cosine similarity between two instances is the cosine of

the inner angle between the two vectors that extend from
the origin to each instance.
Cosine
(a[1] × b[1]) + · · · + (a[m] × b[m])
simCOSINE (a, b) = qP qP
m 2× m 2
i=1 a[i] i=1 b[i]
Calculate the cosine similarity between the following two

instances:
d1 = hSMS = 97, VOICE = 21i
d2 = hSMS = 18, VOICE = 184i
Pm
(a[i] × b[i])
simCOSINE (a, b) = qP i=1 qP
m 2× m 2
i=1 a[i] i=1 b[i]

instances:
d2 = hSMS = 18, VOICE = 184i
(97 × 181) + (21 × 184)
simCOSINE (d1 , d1 ) = √ √
972 + 212 × 1812 + 1842
= 0.8362
1.0
200
0.8
2
150
0.6
Voice
Voice
100
0.4
1
50
0.2
1
θ θ
0.0
0
● ●
0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0

SMS SMS
(a) (b)
Figure: (a) The θ represents the inner angle between the vector
emanating from the origin to instance d1 hSMS = 97, VOICE = 21i
and the vector emanating from the origin to instance d2
hSMS = 181, VOICE = 184i; (b) shows d1 and d2 normalized to the
unit circle.

instances:
d3 = hSMS = 194, VOICE = 42i
Pm
(a[i] × b[i])
simCOSINE (a, b) = qP i=1 qP
m 2× m 2
i=1 a[i] i=1 b[i]

instances:
d3 = hSMS = 194, VOICE = 42i
(97 × 194) + (21 × 42)
simCOSINE (d1 , d1 ) = √ √
972 + 212 × 1942 + 422
=1
100
100
100
80
80
80
● B ● C ● B ● C ● B ● C
60
60
60
Y ● A Y ● A Y ● A
40
40
40
20
20
20
0
0
0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
X X X
(a) (b) (c)
Figure: Scatter plots of three bivariate datasets with the same center
point A and two queries B and C both equidistant from A. (a) A
dataset uniformly spread around the center point. (b) A dataset with
negative covariance. (c) A dataset with positive covariance.
The mahalanobis distance uses covariance to scale

distances so that distances along a direction where the
dataset is spreadout a lot are scaled down and distances
along directions where the dataset is tightly packed are
scaled up.
Mahalanobis(a, b) =
 
−1
X a[1] − b[1] (16)
[a[1] − b[1], . . . , a[m] − b[m]] × × ... 
a[m] − b[m]
Similar to Euclidean distance, the mahalanobis distance

squares the differences of the features.
But it also rescales the differences (using the inverse
covariance matrix) so that all the features have unit
variance and the effects of covariance is removed.
80
5 80 80
3 1
●
60 60 60
5
1 3 3
Y ●
Y Y
40 40 5 40
1
●
20 20 20
20 40 60 80 20 40 60 80 20 40 60 80
X X X
(a) (b) (c)
Figure: The coordinate systems defined by the mahalanobis metric

using the co-variance matrix for the dataset in Figure 9(c) [46] using
three different origins: (a) (50, 50), (b) (63, 71), (c) (42, 35). The
ellipses in each figure plot the 1, 3, and 5 unit distances contours
from each origin.
9.35
80
●B 2.15 ●C
60
Y
●A
40
20
20 40 60 80
Figure: The effect of using a mahalanobis versus euclidean distance.

Point A is the center of mass of the dataset in Figure 9(c) [46] . The
ellipses plot the mahalanobis distance contours from A that B and C
lie on. In euclidean terms B and C are equidistant from A, however
using the mahalanobis metric C is much closer to A than B.
Feature Selection
3.0
2.5
2.0
3.0
2.5
Y 1.5
2.0
1.0 Y
1.5
3.0
2.5
1.0
0.5 2.0
1.5
0.5
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 Z
0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
X X X
(a) (b) (c)
Figure: A set of scatter plots illustrating the curse of dimensionality.

Across figures (a), (b) and (c) the density of the marked unit
hypercubes decreases as the number of dimensions increases.
3.0
2.5
2.0
3.0
2.5
Y 1.5
2.0
1.0 Y
1.5
3.0
2.5
1.0
0.5 2.0
1.5
0.5
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 Z
0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
X X X
(a) (b) (c)

3.0
2.5
2.0
3.0
2.5
Y 1.5
2.0
1.0 Y
1.5
3.0
2.5
1.0
0.5 2.0
1.5
0.5
1.0
0.5
Z
0.0
0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
X X
(d) (e)
Figure: Figures (d) and (e) illustrate the cost we must incur if we wish
to maintain the density of the instances in the feature space as the
dimensionality of the feature space increases.
During our discussion of feature selection approaches it

will be useful to distinguish between different classes of
descriptive features:
1 Predictive
2 Interacting
3 Redundant
4 Irrelevant
When framed as a local search problem feature selection

is defined in terms of an iterative process consisting of the
following stages:
1 Subset Generation
2 Subset Selection
3 Termination Condition
The search can move through the search space in a
number of ways:
Forward sequential selection
Backward sequential selection
X X
Y
X X
Forward Selection Y Y Backward Selection
Z Z
Y
Z Z
Figure: Feature Subset Space for a dataset with 3 features X , Y , Z .

Figure: The process of model induction with feature selection.
Efficient Memory Search

Assuming that the training set will remain relatively stable it

is possible to speed up the prediction speed of a nearest
neighbor model by investing in some one-off computation
to create an index of the instances that enables efficient
retrieval of the nearest neighbors.
The k-d tree, which is short for k-dimensional tree, is one
of the best known of these indexes.
Example
Let’s build a k-d tree for the college athlete dataset
Table: The speed and agility ratings for 22 college athletes labelled
with the decisions for whether they were drafted or not.
ID Speed Agility Draft ID Speed Agility Draft
1 2.50 6.00 No 12 5.00 2.50 No
2 3.75 8.00 No 13 8.25 8.50 No
3 2.25 5.50 No 14 5.75 8.75 Yes
4 3.25 8.25 No 15 4.75 6.25 Yes
5 2.75 7.50 No 16 5.50 6.75 Yes
6 4.50 5.00 No 17 5.25 9.50 Yes
7 3.50 5.25 No 18 7.00 4.25 Yes
8 3.00 3.25 No 19 7.50 8.00 Yes
9 4.00 4.00 No 20 7.25 5.75 Yes
10 4.25 3.75 No 21 6.75 3.00 Yes
11 2.00 2.00 No
Example
First split on the S PEED feature
ID=6
Speed:4.5
Speed<4.5 Speed≥4.5
8
IDs= 1,2,3,4,5,7, IDs= 12,13,14,15,16,
Agility
6
8,9,10,11 17,18,19,20,21
4
2
2 4 6 8
Speed
(a) (b)
Example
Next split on the AGILITY feature
ID=6
Speed:4.5
Speed<4.5 Speed≥4.5
8
ID=3 IDs= 12,13,14,15,16,
Agility:5.5 17,18,19,20,21
Agility
6
Agility<5.5 Agility≥5.5
4
IDs=7,8,9,10,11 IDs=1,2,4,5
2
2 4 6 8
Speed
(a) (b)
Figure: (a) the k-d tree after the dataset at the left child of the root
has been split using the AGILITY feature with a threshold of 5.5. (b)
Example
After completing the tree building process
8
ID=6
Speed:4.5
Speed<4.5 Speed>=4.5
Agility
6
ID=3 ID=16
Agility:5.5 Agility:6.75
Agility<5.5 Agility>=5.5 Agility<6.75 Agility>=6.75
4
ID=7 ID=4 ID=21 ID=19
Speed:3.5 Speed:3.25 Speed:6.75 Speed:7.5
Speed<3.5 Speed>=3.5 Speed<3.25 Speed>=3.25 Speed<6.75 Speed>=6.75 Speed<7.5 Speed>=7.5
2
ID=8 ID=9 ID=5 ID=15 ID=20 ID=17
ID=2 ID=13
Agility:3.25 Agility:4.0 Agility:7.5 Agility:6.25 Agility:5.75 Agility:9.5
Agility<3.25 Agility<4.0 Agility<7.5 Agility<6.25 Agility<5.75 Agility<9.5 2 4 6 8

ID=11 ID=10 ID=1 ID=12 ID=18 ID=14 Speed
(a) (b)
Figure: (a) The complete k-d tree generated, for the dataset in Table
7 [60] . (b) the partitioning of the feature space by the k-d tree in (a)
Example
Finding neighbors for query S PEED= 6, AGILITY= 3.5.
8
15
Agility
6
ID=12
18
4
?
21
12
2
2 4 6 8
Speed
(a) (b)
Figure: (a) the path taken from the root node to a leaf node when we
Example
Finding neighbors for query S PEED= 6, AGILITY= 3.5.
Instance=6
Split on: Speed
Median: 4.5
Speed<4.5 Speed>=4.5
Prune subtree
Instance=3 Instance=16 and ascend?
8
Split on: Agility Split on: Agility
Median: 5.5 Median: 6.75
Agility<5.5 Agility>=5.5 Agility<6.75 Agility>=6.75 15
Agility
6
Instance=7 Instance=4 Instance=21 Instance=19
Split on: Speed Split on: Speed Split on: Speed Split on: Speed
Median: 3.5 Median: 3.25 Median: 6.75 Median: 7.5 18
Search
4
Speed<3.5 Speed>=3.5 Speed<3.25 Speed>=3.25 Speed<6.75 subtree?
Speed>=6.75 Speed<7.5 Speed>=7.5
?
21
Instance=8 Instance=9 Instance=5 Instance=15 Instance=20 Instance=17
Split on: Agility Split on: Agility Split on: Agility Instance=2 Split on: Agility Split on: Agility Split on: Agility Instance=13
12
2
Median: 3.25 Median: 4.0 Median: 7.5 Median: 6.25 Median: 5.75 Median: 9.5
Agility<3.25 Agility<4.0 Agility<7.5 Agility<6.25 Agility<5.75 Agility<9.5

2 4 6 8
Instance=11 Instance=10 Instance=1 Instance=12 Instance=18 Instance=14 Speed
(a) (b)
Figure: (a) the state of the retrieval process after instance d21 has
been stored as current-best. (b) the dashed circle illustrates the
extent of the target hypersphere after current-best-distance has been
updated.
Example
ID=21
Summary
Nearest neighbor models are very sensitive to noise in the

target feature the easiest way to solve this problem is to
employ a k nearest neighbor.
Normalization techniques should almost always be
applied when nearest neighbor models are used.
It is easy to adapt a nearest neighbor model to
continuous targets.
There are many different measures of similarity.
Feature selection is a particularly important process for
nearest neighbor algorithms it alleviates the curse of
dimensionality.
As the number of instances becomes large, a nearest
neighbor model will become slower—techniques such as
the k-d tree can help with this issue.
1 Handling Noisy Data
2 Data Normalization
3 Predicting Continuous Targets
4 Other Measures of Similarity
5 Feature Selection
6 Efficient Memory Search
7 Summary

BookSlides 5B Similarity-based-Learning

Uploaded by

Copyright:

Available Formats

BookSlides 5B Similarity-based-Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BookSlides 5B Similarity-based-Learning

Uploaded by

Copyright:

Available Formats

Noisy Data Normalization Cont. Targets Similarity Feat. Sel.

John D. Kelleher and Brian Mac Namee and Aoife D’Arcy

1 Handling Noisy Data

3 Predicting Continuous Targets

4 Other Measures of Similarity

6 Efficient Memory Search

Handling Noisy Data

The k nearest neighbors model predicts the target level

Figure: The decision boundary using majority classification of the

Figure: The decision boundary using majority classification of the

Figure: The decision boundary when k is set to 15.

In a distance weighted k nearest neighbor algorithm the

The weighted k nearest neighbor model is defined as:

Figure: The weighted k nearest neighbor model decision boundary.

The marketing department wants to decide whether or not

45000 55000 65000 75000

Salary and Age Salary Only Age Only

This odd prediction is caused by features taking different

Normalized Dataset Salary and Age Salary Only Age Only

Normalizing the data is an important thing to do for almost

Predicting Continuous Targets

Return the average value in the neighborhood:

0.0 0.2 0.4 0.6 0.8 1.0

The model will return a price prediction that is the average

200.00 + 250.00 + 55.00

In a weighted k nearest neighbor the model prediction

Other Measures of Similarity

Table: A binary dataset listing the behavior of two individuals on a

Who is q more similar to d1 or d2 ?

ID Profile FAQ Help Forum Newsletter Liked Signup

hP ROFILE:1, FAQ:0, H ELP F ORUM:1, N EWSLETTER:0, L IKED:0, i

The current trial user is judged to be more similar to

hP ROFILE:1, FAQ:0, H ELP F ORUM:1, N EWSLETTER:0, L IKED:0, i

The current trial user is judged to be more similar to

hP ROFILE:1, FAQ:0, H ELP F ORUM:1, N EWSLETTER:0, L IKED:0, i

The current trial user is judged to be equally similar to

Cosine similarity between two instances is the cosine of

Calculate the cosine similarity between the following two

Calculate the cosine similarity between the following two

0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0

Calculate the cosine similarity between the following two

Calculate the cosine similarity between the following two

(a) (b) (c)

The mahalanobis distance uses covariance to scale

Similar to Euclidean distance, the mahalanobis distance

(a) (b) (c)

Figure: The coordinate systems defined by the mahalanobis metric

Figure: The effect of using a mahalanobis versus euclidean distance.

(a) (b) (c)

Figure: A set of scatter plots illustrating the curse of dimensionality.

(a) (b) (c)

During our discussion of feature selection approaches it

When framed as a local search problem feature selection

Figure: Feature Subset Space for a dataset with 3 features X , Y , Z .

Efficient Memory Search

Assuming that the training set will remain relatively stable it

Agility<5.5 Agility>=5.5 Agility<6.75 Agility>=6.75

Speed<3.5 Speed>=3.5 Speed<3.25 Speed>=3.25 Speed<6.75 Speed>=6.75 Speed<7.5 Speed>=7.5

Agility<3.25 Agility<4.0 Agility<7.5 Agility<6.25 Agility<5.75 Agility<9.5 2 4 6 8