BookSlides 5B Similarity-based-Learning
BookSlides 5B Similarity-based-Learning
BookSlides 5B Similarity-based-Learning
Efficiency Summary
Similarity-based Learning
Sections 5.4, 5.5
2 Data Normalization
5 Feature Selection
7 Summary
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Figure: Is the instance at the top right of the diagram really noise?
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
k
X
Mk (q) = argmax δ(ti , l) (1)
l∈levels(t) i=1
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
1
(2)
dist(q, d)2
k
X 1
Mk (q) = argmax × δ(ti , l) (3)
l∈levels(t) i=1 dist(q, di )2
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Data Normalization
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Table: A dataset listing the salary and age information for customers
and whether or not the purchased a pension plan .
ID Salary Age Purchased
1 53700 41 No
2 65300 37 No
3 48900 45 Yes
4 64800 49 Yes
5 44200 30 No
6 55900 57 Yes
7 48600 26 No
8 72800 60 Yes
9 45300 34 No
10 73200 52 Yes
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
55
10
50
4
45
Age 3
1
40
2
35
?
9
30
7
25
Figure: The salary and age feature space with the data in Table 1 [12]
plotted. The instances are labelled their IDs, triangles represent the
negative instances and crosses represent the positive instances. The
location of the query hS ALARY = 56, 000, AGE = 35i is indicated by
the ?.
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
0 ai − min(a)
ai = × (high − low) + low (4)
max(a) − min(a)
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
k
1X
Mk (q) = ti (5)
k
i=1
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Table: A dataset of whiskeys listing the age (in years) and the rating
(between 1 and 5, with 5 being the best) and the bottle price of each
whiskey.
ID Age Rating Price ID Age Rating Price
1 0 2 30.00 11 19 5 500.00
2 12 3.5 40.00 12 6 4.5 200.00
3 10 4 55.00 13 8 3.5 65.00
4 21 4.5 550.00 14 22 4 120.00
5 12 3 35.00 15 6 2 12.00
6 15 3.5 45.00 16 8 4.5 250.00
7 16 4 70.00 17 10 2 18.00
8 18 3 85.00 18 30 4.5 450.00
9 18 3.5 78.00 19 1 1 10.00
10 16 3 75.00 20 4 3 30.00
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Table: The whiskey dataset after the descriptive features have been
normalized.
ID Age Rating Price ID Age Rating Price
1 0.0000 0.25 30.00 11 0.6333 1.00 500.00
2 0.4000 0.63 40.00 12 0.2000 0.88 200.00
3 0.3333 0.75 55.00 13 0.2667 0.63 65.00
4 0.7000 0.88 550.00 14 0.7333 0.75 120.00
5 0.4000 0.50 35.00 15 0.2000 0.25 12.00
6 0.5000 0.63 45.00 16 0.2667 0.88 250.00
7 0.5333 0.75 70.00 17 0.3333 0.25 18.00
8 0.6000 0.50 85.00 18 1.0000 0.88 450.00
9 0.6000 0.63 78.00 19 0.0333 0.00 10.00
10 0.5333 0.50 75.00 20 0.1333 0.50 30.00
?
1.0
●
12 ● ● 16 ● ●
0.8
● 3 ● ●
● ● ● ●
0.6
Rating ● ● ● ●
0.4
● ● ●
0.2
0.0
Figure: The AGE and R ATING feature space for the whiskey dataset.
The location of the query instance is indicated by the ? symbol. The
circle plotted with a dashed line demarcates the border of the
neighborhood around the query when k = 3. The three nearest
neighbors to the query are labelled with their ID values.
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
k
X 1
× ti
dist(q, di )2
i=1
Mk (q) = (6)
k
X 1
dist(q, di )2
i=1
Table: The calculations for the weighted k nearest neighbor
prediction
ID Price Distance Weight Price×Weight
1 30.00 0.7530 1.7638 52.92
2 40.00 0.5017 3.9724 158.90
3 55.00 0.3655 7.4844 411.64
4 550.00 0.6456 2.3996 1319.78
5 35.00 0.6009 2.7692 96.92
6 45.00 0.5731 3.0450 137.03
7 70.00 0.5294 3.5679 249.75
8 85.00 0.7311 1.8711 159.04
9 78.00 0.6520 2.3526 183.50
10 75.00 0.6839 2.1378 160.33
11 500.00 0.5667 3.1142 1557.09
12 200.00 0.1828 29.9376 5987.53
13 65.00 0.4250 5.5363 359.86
14 120.00 0.7120 1.9726 236.71
15 12.00 0.7618 1.7233 20.68
16 250.00 0.2358 17.9775 4494.38
17 18.00 0.7960 1.5783 28.41
18 450.00 0.9417 1.1277 507.48
19 10.00 1.0006 0.9989 9.99
20 30.00 0.5044 3.9301 117.90
Totals: 99.2604 16 249.85
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
q q
Pres. Abs. Pres. Abs.
Pres. CP=2 PA=0 Pres. CP=1 PA=1
d1 d2
Abs. AP=2 CA=1 Abs. AP=0 CA=3
Table: The similarity between the current trial user, q, and the two
users in the dataset, d1 and d2 , in terms of co-presence (CP),
co-absence (CA), presence-absence (PA), and absence-presence
(AP).
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Russel-Rao
The ratio between the number of co-presenses and the
total number of binary features considered.
CP(q, d)
simRR (q, d) = (7)
|q|
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Russel-Rao
CP(q, d)
simRR (q, d) = (8)
|q|
q q
Pres. Abs. Pres. Abs.
Pres. CP=2 PA=0 Pres. CP=1 PA=1
d1 d2
Abs. AP=2 CA=1 Abs. AP=0 CA=3
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Russel-Rao
CP(q, d)
simRR (q, d) = (9)
|q|
Example
2
simRR (q, d1 ) = = 0.4
5
1
simRR (q, d2 ) = = 0.2
5
Sokal-Michener
Sokal-Michener is defined as the ratio between the total
number of co-presences and co-absences, and the total
number of binary features considered.
CP(q, d) + CA(q, d)
simSM (q, d) = (10)
|q|
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Sokal-Michener
CP(q, d) + CA(q, d)
simSM (q, d) = (11)
|q|
q q
Pres. Abs. Pres. Abs.
Pres. CP=2 PA=0 Pres. CP=1 PA=1
d1 d2
Abs. AP=2 CA=1 Abs. AP=0 CA=3
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Sokal-Michener
CP(q, d) + CA(q, d)
simSM (q, d) = (12)
|q|
Example
3
simSM (q, d1 ) = = 0.6
5
4
simSM (q, d2 ) = = 0.8
5
Jaccard
The Jacquard index ignore co-absences
CP(q, d)
simJ (q, d) = (13)
CP(q, d) + PA(q, d) + AP(q, d)
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Jaccard
CP(q, d)
simJ (q, d) = (14)
CP(q, d) + PA(q, d) + AP(q, d)
q q
Pres. Abs. Pres. Abs.
Pres. CP=2 PA=0 Pres. CP=1 PA=1
d1 d2
Abs. AP=2 CA=1 Abs. AP=0 CA=3
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Jaccard
CP(q, d)
simJ (q, d) = (15)
CP(q, d) + PA(q, d) + AP(q, d)
Example
2
simJ (q, d1 ) = = 0.5
4
1
simJ (q, d2 ) = = 0.5
2
Cosine
(a[1] × b[1]) + · · · + (a[m] × b[m])
simCOSINE (a, b) = qP qP
m 2× m 2
i=1 a[i] i=1 b[i]
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
1.0
200
0.8
2
150
0.6
Voice
Voice
100
0.4
1
50
0.2
1
θ θ
0.0
0
● ●
(a) (b)
Figure: (a) The θ represents the inner angle between the vector
emanating from the origin to instance d1 hSMS = 97, VOICE = 21i
and the vector emanating from the origin to instance d2
hSMS = 181, VOICE = 184i; (b) shows d1 and d2 normalized to the
unit circle.
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
100
100
100
80
80
80
● B ● C ● B ● C ● B ● C
60
60
60
Y ● A Y ● A Y ● A
40
40
40
20
20
20
0
0
0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
X X X
Figure: Scatter plots of three bivariate datasets with the same center
point A and two queries B and C both equidistant from A. (a) A
dataset uniformly spread around the center point. (b) A dataset with
negative covariance. (c) A dataset with positive covariance.
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Mahalanobis(a, b) =
−1
X a[1] − b[1] (16)
[a[1] − b[1], . . . , a[m] − b[m]] × × ...
a[m] − b[m]
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
80
5 80 80
3 1
●
60 60 60
5
1 3 3
Y ●
Y Y
40 40 5 40
1
●
20 20 20
20 40 60 80 20 40 60 80 20 40 60 80
X X X
9.35
80
●B 2.15 ●C
60
Y
●A
40
20
20 40 60 80
Feature Selection
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
3.0
2.5
2.0
3.0
2.5
Y 1.5
2.0
1.0 Y
1.5
3.0
2.5
1.0
0.5 2.0
1.5
0.5
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 Z
0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
X X X
2.5
2.0
3.0
2.5
Y 1.5
2.0
1.0 Y
1.5
3.0
2.5
1.0
0.5 2.0
1.5
0.5
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 Z
0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
X X X
2.5
2.0
3.0
2.5
Y 1.5
2.0
1.0 Y
1.5
3.0
2.5
1.0
0.5 2.0
1.5
0.5
1.0
0.5
Z
0.0
0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
X X
(d) (e)
Figure: Figures (d) and (e) illustrate the cost we must incur if we wish
to maintain the density of the instances in the feature space as the
dimensionality of the feature space increases.
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
X X
Y
X X
Forward Selection Y Y Backward Selection
Z Z
Y
Z Z
Example
Let’s build a k-d tree for the college athlete dataset
Table: The speed and agility ratings for 22 college athletes labelled
with the decisions for whether they were drafted or not.
ID Speed Agility Draft ID Speed Agility Draft
1 2.50 6.00 No 12 5.00 2.50 No
2 3.75 8.00 No 13 8.25 8.50 No
3 2.25 5.50 No 14 5.75 8.75 Yes
4 3.25 8.25 No 15 4.75 6.25 Yes
5 2.75 7.50 No 16 5.50 6.75 Yes
6 4.50 5.00 No 17 5.25 9.50 Yes
7 3.50 5.25 No 18 7.00 4.25 Yes
8 3.00 3.25 No 19 7.50 8.00 Yes
9 4.00 4.00 No 20 7.25 5.75 Yes
10 4.25 3.75 No 21 6.75 3.00 Yes
11 2.00 2.00 No
Example
First split on the S PEED feature
ID=6
Speed:4.5
Speed<4.5 Speed≥4.5
8
IDs= 1,2,3,4,5,7, IDs= 12,13,14,15,16,
Agility
6
8,9,10,11 17,18,19,20,21
4
2
2 4 6 8
Speed
(a) (b)
Example
Next split on the AGILITY feature
ID=6
Speed:4.5
Speed<4.5 Speed≥4.5
8
ID=3 IDs= 12,13,14,15,16,
Agility:5.5 17,18,19,20,21
Agility
6
Agility<5.5 Agility≥5.5
4
IDs=7,8,9,10,11 IDs=1,2,4,5
2
2 4 6 8
Speed
(a) (b)
Figure: (a) the k-d tree after the dataset at the left child of the root
has been split using the AGILITY feature with a threshold of 5.5. (b)
Example
After completing the tree building process
8
ID=6
Speed:4.5
Speed<4.5 Speed>=4.5
Agility
6
ID=3 ID=16
Agility:5.5 Agility:6.75
4
ID=7 ID=4 ID=21 ID=19
Speed:3.5 Speed:3.25 Speed:6.75 Speed:7.5
2
ID=8 ID=9 ID=5 ID=15 ID=20 ID=17
ID=2 ID=13
Agility:3.25 Agility:4.0 Agility:7.5 Agility:6.25 Agility:5.75 Agility:9.5
(a) (b)
Figure: (a) The complete k-d tree generated, for the dataset in Table
7 [60] . (b) the partitioning of the feature space by the k-d tree in (a)
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Example
Finding neighbors for query S PEED= 6, AGILITY= 3.5.
8
15
Agility
6
ID=12
18
4
?
21
12
2
2 4 6 8
Speed
(a) (b)
Figure: (a) the path taken from the root node to a leaf node when we
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Example
Finding neighbors for query S PEED= 6, AGILITY= 3.5.
Instance=6
Split on: Speed
Median: 4.5
Speed<4.5 Speed>=4.5
Prune
subtree
Instance=3 Instance=16 and
ascend?
8
Split on: Agility Split on: Agility
Median: 5.5 Median: 6.75
Agility
6
Instance=7 Instance=4 Instance=21 Instance=19
Split on: Speed Split on: Speed Split on: Speed Split on: Speed
Median: 3.5 Median: 3.25 Median: 6.75 Median: 7.5 18
Search
4
Speed<3.5 Speed>=3.5 Speed<3.25 Speed>=3.25 Speed<6.75 subtree?
Speed>=6.75 Speed<7.5 Speed>=7.5
?
21
Instance=8 Instance=9 Instance=5 Instance=15 Instance=20 Instance=17
Split on: Agility Split on: Agility Split on: Agility Instance=2 Split on: Agility Split on: Agility Split on: Agility Instance=13
12
2
Median: 3.25 Median: 4.0 Median: 7.5 Median: 6.25 Median: 5.75 Median: 9.5
(a) (b)
Figure: (a) the state of the retrieval process after instance d21 has
been stored as current-best. (b) the dashed circle illustrates the
extent of the target hypersphere after current-best-distance has been
updated.
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Example
ID=21
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
Summary
Noisy Data Normalization Cont. Targets Similarity Feat. Sel. Efficiency Summary
2 Data Normalization
5 Feature Selection
7 Summary