Ruiz Modified I2ml3e Chap6

Lecture Slides for
INTRODUCTION
TO
MACHİNE
LEARNİNG
3RD EDİTİON
ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz
© The MIT Press, 2014 for CS539 Machine Learning at
WPI
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 6:
DİMENSİONALİTY
REDUCTİON
Why Reduce Dimensionality?
3
 Reduces time complexity: Less computation

 Reduces space complexity: Fewer parameters
 Saves the cost of observing the feature
 Simpler models are more robust on small datasets
 More interpretable; simpler explanation
 Data visualization (structure, groups, outliers, etc)
if plotted in 2 or 3 dimensions
Feature Selection vs Extraction
4
 Feature selection: Choosing k<d important features,

ignoring the remaining d – k
Subset selection algorithms
 Feature extraction: Project the
original xi , i =1,...,d dimensions to
new k<d dimensions, zj , j =1,...,k
Subset Selection
5
 There are 2d subsets of d features

 Forward search: Add the best feature at each step
 Set of features F initially Ø.
 At each iteration, find the best new feature
j = argmini E ( F È xi ) where E(.) is the error on the validation set
 Add xj to F if E ( F È xj ) < E ( F )
 Hill-climbing O(d2) algorithm

 Backward search: Start with all features and remove one
at a time, if possible.
 Floating search (Add k, remove l)
Example: weather dataset
6
forward
Figure taken from Witten's and Frank's

“Data Mining Practical Machine Learning backward
Tools and Techniques” textbook slides -
Chapter 7.
Another Example Iris data: Single feature Training data
Chosen
7
Iris data: Add one more feature to F4
Chosen
8
Principal Components Analysis
9
 Find a low-dimensional space such that when x is

projected there, information loss is minimized.
 The projection of x on the direction of w is: z = wTx
 Find w such that Var(z) is maximized
Var(z) = Var(wTx) = E[(wTx – wTμ)2]
= E[(wTx – wTμ)(wTx – wTμ)]
= E[wT(x – μ)(x – μ)Tw]
= wT E[(x – μ)(x –μ)T]w = wT ∑ w
where Var(x)= E[(x – μ)(x –μ)T] = ∑
 Maximize Var(z) subject to ||w||=1 (i.e, w1Tw1 = 1)
using Lagrange formulation: max w1T w1   w1T w1  1
w1
taking derivative w.r.t. w1 and making it equal to 0:

∑w1 = αw1 that is, w1 is an eigenvector of ∑ and α its eigenvalue
Choose the one with the largest eigenvalue for Var(z) to be max
 Second principal component: Max Var(z2), s.t., ||w2||=1
and orthogonal to w1
max w T2 w 2   w T2 w 2  1  w T2 w1  0 
w2
∑ w2 = α w2 that is, w2 is another eigenvector of ∑

and so on.
10
What PCA does
11
z = WT(x – m)
where the columns of W are the eigenvectors of ∑
and m is sample mean
Centers the data at the origin and rotates the axes
How to choose k ?
12
 Proportion of Variance (PoV) explained

1  2    k
1  2    k    d
when λi are sorted in descending order

 Typically, stop at PoV>0.9
 Scree graph plots of PoV vs k, stop at “elbow”
13
14
Feature Embedding
15
 When X is the Nxd data matrix,

XTX is the dxd matrix (covariance of features, if mean-centered)
XXT is the NxN matrix (pairwise similarities of instances)
 PCA uses the eigenvectors of XTX which are d-dim and can
be used for projection

 Feature embedding uses the eigenvectors of XXT which are
N-dim and which give directly the coordinates after

projection
 Sometimes, we can define pairwise similarities (or distances)
between instances, then we can use feature embedding

without needing to represent instances as vectors.
Factor Analysis
16
 Find a small number of factors z, which when

combined generate x :
xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi
where zj, j =1,...,k are the latent factors with

E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j ,
εi are the noise sources
E[ εi ]= ψi, Cov(εi , εj) =0, i ≠ j, Cov(εi , zj) =0
,
and vij are the factor loadings
PCA vs FA
17
 PCA From x to z z = WT(x – µ)

 FA From z to x x – µ = Vz + ε
x z
z x
Factor Analysis
18
 In FA, factors zj are stretched, rotated and translated

to generate x
Singular Value Decomposition and Matrix
Factorization
19
 Singular value decomposition: X=VAWT

V is NxN and contains the eigenvectors of XXT
W is dxd and contains the eigenvectors of XTX
and A is Nxd and contains singular values on its
first k diagonal
 X=u1a1v1T+...+ukakvkT where k is the rank of X
Matrix Factorization
20
 Matrix factorization: X=FG

F is Nxk and G is kxd
Latent semantic indexing

Multidimensional Scaling
21
 Given pairwise distances between N points,
dij, i,j =1,...,N
place on a low-dim map s.t. distances are preserved
(by feature embedding)
 z = g (x | θ ) Find θ that min Sammon stress
E  | X   
z r
z  x x
s r s

2
s 2
r ,s x xr

 gx |  gx |   x
r s r
x s

2
s 2
r ,s x x
r
Map of Europe by MDS
22
Map from CIA – The World Factbook: http://www.cia.gov/

Linear Discriminant Analysis
 Find a low-dimensional
space such that when x
is projected, classes are
well-separated.
 Find w that maximizes
J w  
m1  m2  2
s1  s2
2 2
m1 
t x r
w T t t
s  t w x  m1  r
2 T t 2 t
r t 1
t
23
 Between-class scatter:
m1  m2   w m1  w m 2 
2 T T 2
 w T m1  m 2 m1  m 2 T w
 w T SB w where SB  m1  m 2 m1  m 2 T
 Within-class scatter:
s  t w x  m1  r
2 T t 2 t
1
 t w x  m1 x  m1  wr t  w T S1w
T t t T
where S1  t x  m1 x  m1  r
t t T t
s12  s12  w T SW w where SW  S1  S 2

24
Fisher’s Linear Discriminant
25
 Find w that max

w SB w w m1  m 2 
T 2
T
J w   T 
w SW w w T SW w
 LDA soln:
w  c  SW1 m1  m 2 
 Parametric soln:
w   1 μ1  μ 2 
when px |C i  ~ N μ i ,  
K>2 Classes
26
 Within-class scatter:
S i  t ri x  m i x  m i 
K
SW   S i t t t T
i 1
 Between-class scatter:
K
1 K
SB   Ni m i  m m i  m  T
m   mi
i 1 K i 1
 Find W that max JW  WT SB W
WT SW W
The largest eigenvectors of SW-1SB; maximum rank of K-1
27
PCA vs LDA
28
Canonical Correlation Analysis
29
 X={xt,yt}t ; two sets of variables x and y x

 We want to find two projections w and v st when x
is projected along w and y is projected along v, the
correlation is maximized:
CCA
30
 x and y may be two different views or modalities;

e.g., image and word tags, and CCA does a joint
mapping
Isomap
31
 Geodesic distance is the distance along the

manifold that the data lies in, as opposed to the
Euclidean distance in the input space
Isomap
32
 Instances r and s are connected in the graph if

||xr-xs||<e or if xs is one of the k neighbors of xr
The edge length is ||xr-xs||
 For two nodes r and s not connected, the distance is
equal to the shortest path between them
 Once the NxN distance matrix is thus formed, use
MDS to find a lower-dimensional mapping
Optdigits after Isomap (with neighborhood graph).
150
100 22
2
22
22
2
50 33 22 2
7 77 111 313
333
7 7
77 1 1 338
7
7 7 4 11 1 8
1 5 83
0 9
7 44
9 9 5 5 98
38
4
9
9949 5 9 88
49
4
88 000
0 00
-50 0
0
0
6
4 666 0
666
-100 4
44
4
-150
-150 -100 -50 0 50 100 150
Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html
33
Locally Linear Embedding
34
1. Given xr find its neighbors xs(r)

2. Find Wrs that minimize
2
E (W | X )   x r   Wrs x(sr )
r s
3. Find the new coordinates zr that minimize

2
E (z | W)   z r   Wrs z(sr )
r s
35
LLE on Optdigits
36
00
0
0
1
7
7
77
7
7
66 7
7 9 9
666
6 7
1 8 443
399 7
94
44 89
3933
8
3
9 9
4
3
45
4843
18
141
9
83 2
1
44 1 8 122 2
2 22
2
89
8 25
1
1
1 55
5
1
1
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
Matlab source from http://www.cs.toronto.edu/~roweis/lle/code.html

Laplacian Eigenmaps
37
 Let r and s be two instances and Brs is their similarity, we

want to find zr and zs that
 Brs can be defined in terms of similarity in an original

space: 0 if xr and xs are too far, otherwise
 Defines a graph Laplacian, and feature embedding

returns zr
Laplacian Eigenmaps on Iris
38
Spectral clustering (chapter 7)

Ruiz Modified I2ml3e Chap6

Uploaded by

Copyright:

Available Formats

Ruiz Modified I2ml3e Chap6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ruiz Modified I2ml3e Chap6

Uploaded by

Copyright:

Available Formats

Lecture Slides for

 Reduces time complexity: Less computation

 Feature selection: Choosing k<d important features,

 There are 2d subsets of d features

 Hill-climbing O(d2) algorithm

Figure taken from Witten's and Frank's

 Find a low-dimensional space such that when x is

taking derivative w.r.t. w1 and making it equal to 0:

∑ w2 = α w2 that is, w2 is another eigenvector of ∑

 Proportion of Variance (PoV) explained

when λi are sorted in descending order

 When X is the Nxd data matrix,

be used for projection

N-dim and which give directly the coordinates after

between instances, then we can use feature embedding

 Find a small number of factors z, which when

where zj, j =1,...,k are the latent factors with

 PCA From x to z z = WT(x – µ)

 In FA, factors zj are stretched, rotated and translated

 Singular value decomposition: X=VAWT

 Matrix factorization: X=FG

Latent semantic indexing

Map from CIA – The World Factbook: http://www.cia.gov/

s12  s12  w T SW w where SW  S1  S 2

 Find w that max

 X={xt,yt}t ; two sets of variables x and y x

 x and y may be two different views or modalities;

 Geodesic distance is the distance along the

 Instances r and s are connected in the graph if

Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html

1. Given xr find its neighbors xs(r)

3. Find the new coordinates zr that minimize

Matlab source from http://www.cs.toronto.edu/~roweis/lle/code.html

 Let r and s be two instances and Brs is their similarity, we

 Brs can be defined in terms of similarity in an original

 Defines a graph Laplacian, and feature embedding

Spectral clustering (chapter 7)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.