0% found this document useful (0 votes)
41 views37 pages

I2ml3e Chap6

This document provides a summary of key concepts in dimensionality reduction techniques discussed in Chapter 6. It begins with an overview of why dimensionality reduction is useful, distinguishing between feature selection and feature extraction. It then covers several specific techniques including principal component analysis (PCA), linear discriminant analysis (LDA), canonical correlation analysis (CCA), Isomap, and others. For each technique, it provides a brief explanation of the mathematical formulation and goal in 2-3 sentences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views37 pages

I2ml3e Chap6

This document provides a summary of key concepts in dimensionality reduction techniques discussed in Chapter 6. It begins with an overview of why dimensionality reduction is useful, distinguishing between feature selection and feature extraction. It then covers several specific techniques including principal component analysis (PCA), linear discriminant analysis (LDA), canonical correlation analysis (CCA), Isomap, and others. For each technique, it provides a brief explanation of the mathematical formulation and goal in 2-3 sentences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Lecture Slides for

INTRODUCTION
TO
MACHINE
LEARNING
3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014

alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 6:

DIMENSIONALITY
REDUCTION
Why Reduce Dimensionality?
3

 Reduces time complexity: Less computation


 Reduces space complexity: Fewer parameters
 Saves the cost of observing the feature
 Simpler models are more robust on small datasets
 More interpretable; simpler explanation
 Data visualization (structure, groups, outliers, etc) if
plotted in 2 or 3 dimensions
Feature Selection vs Extraction
4

 Feature selection: Choosing k<d important features,


ignoring the remaining d – k
Subset selection algorithms
 Feature extraction: Project the
original xi , i =1,...,d dimensions to
new k<d dimensions, zj , j =1,...,k
Subset Selection
5

 There are 2d subsets of d features


 Forward search: Add the best feature at each step
 Set of features F initially Ø.
 At each iteration, find the best new feature
j = argmini E ( F  xi )
 Add xj to F if E ( F  xj ) < E ( F )

 Hill-climbing O(d2) algorithm


 Backward search: Start with all features and remove
one at a time, if possible.
 Floating search (Add k, remove l)
Iris data: Single feature

Chosen

6
Iris data: Add one more feature to F4

Chosen

7
Principal Components Analysis
8

 Find a low-dimensional space such that when x is


projected there, information loss is minimized.
 The projection of x on the direction of w is: z = wTx
 Find w such that Var(z) is maximized
Var(z) = Var(wTx) = E[(wTx – wTμ)2]
= E[(wTx – wTμ)(wTx – wTμ)]
= E[wT(x – μ)(x – μ)Tw]
= wT E[(x – μ)(x –μ)T]w = wT ∑ w
where Var(x)= E[(x – μ)(x –μ)T] = ∑
 Maximize Var(z) subject to ||w||=1
maxw1T w1   w1T w1  1
w1

∑w1 = αw1 that is, w1 is an eigenvector of ∑


Choose the one with the largest eigenvalue for Var(z) to be
max
 Second principal component: Max Var(z2), s.t.,
||w2||=1 and orthogonal to w1
maxwT2 w 2   wT2 w 2  1   wT2 w1  0
w2

∑ w2 = α w2 that is, w2 is another eigenvector of ∑


and so on.
9
What PCA does
10

z = WT(x – m)
where the columns of W are the eigenvectors of ∑
and m is sample mean
Centers the data at the origin and rotates the axes
How to choose k ?
11

 Proportion of Variance (PoV) explained


1  2    k
1  2    k    d

when λi are sorted in descending order


 Typically, stop at PoV>0.9
 Scree graph plots of PoV vs k, stop at “elbow”
12
13
Feature Embedding
14

 When X is the Nxd data matrix,


XTX is the dxd matrix (covariance of features, if mean-
centered)
XXT is the NxN matrix (pairwise similarities of instances)
 PCA uses the eigenvectors of XTX which are d-dim and can
be used for projection
T
 Feature embedding uses the eigenvectors of XX which are
N-dim and which give directly the coordinates after
projection
 Sometimes, we can define pairwise similarities (or distances)
between instances, then we can use feature embedding
without needing to represent instances as vectors.
Factor Analysis
15

 Find a small number of factors z, which when


combined generate x :
xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi

where zj, j =1,...,k are the latent factors with


E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j ,
εi are the noise sources
E[ εi ]= ψi, Cov(εi , εj) =0, i ≠ j, Cov(εi , zj) =0 ,
and vij are the factor loadings
PCA vs FA
16

 PCA From x to z z = WT(x – µ)


 FA From z to x x – µ = Vz + ε

x z

z x
Factor Analysis
17

 In FA, factors zj are stretched, rotated and


translated to generate x
Singular Value Decomposition and
18
Matrix Factorization
 Singular value decomposition: X=VAWT
V is NxN and contains the eigenvectors of XXT
W is dxd and contains the eigenvectors of XTX
and A is Nxd and contains singular values on its first
k diagonal
 X=u1a1v1T+...+ukakvkT where k is the rank of X
Matrix Factorization
19

 Matrix factorization: X=FG


F is Nxk and G is kxd

Latent semantic indexing


Multidimensional Scaling
20

 Given pairwise distances between N points,


dij, i,j =1,...,N
place on a low-dim map s.t. distances are preserved
(by feature embedding)
 z = g (x | θ ) Find θ that min Sammon stress

E  | X   
z r
z  x x
s r s

2

s 2
r ,s x xr


 gx |  gx |   x
r s r
x s

2

s 2
r ,s x x
r
Map of Europe by MDS
21

Map from CIA – The World Factbook: http://www.cia.gov/


Linear Discriminant Analysis

 Find a low-dimensional
space such that when x
is projected, classes are
well-separated.
 Find w that maximizes

J w  
m1  m2  2

s1  s2
2 2

m1 
t x r
w T t t

s  t w x  m1  r
2 T t 2 t

r t 1
t
22
 Between-class scatter:
m1  m2   w m1  w m 2 
2 T T 2

 w T m1  m 2 m1  m 2 T w
 w T SB w where SB  m1  m 2 m1  m 2 T

 Within-class scatter:
s  t w x  m1  r
2 T t 2 t
1

 t w x  m1 x  m1  wr t  w T S1w
T t t T

where S1  t x  m1 x  m1  r
t t T t

s12  s12  w T SW w where SW  S1  S 2


23
Fisher’s Linear Discriminant
24

 Find w that max


w SB w w m1  m 2 
T 2
T
Jw   T 
w SW w w T SW w
LDA soln:
w  c  SW1 m1  m2 

 Parametric soln:
w   1 μ1  μ 2 
when px|C i  ~ N μ i ,  
K>2 Classes
25

 Within-class scatter:
Si  t ri x  m i x  m i 
K
SW   Si t t t T

i 1

 Between-class scatter:
K
1 K
SB   Ni m i  m m i  m  T
m  mi
i 1 K i 1
 Find W that max JW   WT SB W
WT SW W
The largest eigenvectors of SW-1SB; maximum rank of K-1
26
PCA vs LDA
27
Canonical Correlation Analysis
28

 X={xt,yt}t ; two sets of variables x and y x


 We want to find two projections w and v st when x
is projected along w and y is projected along v, the
correlation is maximized:
CCA
29

 x and y may be two different views or modalities;


e.g., image and word tags, and CCA does a joint
mapping
Isomap
30

 Geodesic distance is the distance along the


manifold that the data lies in, as opposed to the
Euclidean distance in the input space
Isomap
31

 Instances r and s are connected in the graph if


||xr-xs||<e or if xs is one of the k neighbors of xr
The edge length is ||xr-xs||
 For two nodes r and s not connected, the distance is
equal to the shortest path between them
 Once the NxN distance matrix is thus formed, use
MDS to find a lower-dimensional mapping
Optdigits after Isomap (with neighborhood graph).
150

100 2
22222
2
2
50 33 22 2
7 7777 1 11 313
3
3
77 7 7 7 4 111 1
1 3 338
1 8 83
7 44999 5 5 5 98 38
0 9 4
994949 5 9 88
4 0
88 0 0
0 00
-50 000
6
4 6 66 0
6 66
4
44
-100
4

-150
-150 -100 -50 0 50 100 150

Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html

32
Locally Linear Embedding
33

1. Given xr find its neighbors xs(r)


2. Find Wrs that minimize
2

E (W| X )   x r   Wrs x(sr )


r s

3. Find the new coordinates zr that minimize


2

E (z | W)   z r   Wrs z(sr )
r s
34
LLE on Optdigits
35

0 000

7
7777
7
6 666 7
7 9 9
1 66 399 47
84 4
8383
9334
957
9
44
389 93
41
9
8 34
3
484 1
1
4 4 1 82 282
1 1 22 222
9
8 25
1
1
1 55

5
1

1
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

Matlab source from http://www.cs.toronto.edu/~roweis/lle/code.html


Laplacian Eigenmaps
36

 Let r and s be two instances and Brs is their similarity, we


want to find zr and zs that

 Brs can be defined in terms of similarity in an original


space: 0 if xr and xs are too far, otherwise

 Defines a graph Laplacian, and feature embedding


returns zr
Laplacian Eigenmaps on Iris
37

Spectral clustering (chapter 7)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy