0% found this document useful (0 votes)
4 views158 pages

Mathematial Introduction to Data Science

This document is a working draft of a monograph by Yuan Yao that introduces data science from a mathematical perspective, primarily focusing on Principal Component Analysis (PCA). It covers various topics including multidimensional scaling, random projections, high dimensional statistics, and nonlinear dimensionality reduction, among others. The content is structured as a course material aimed at graduate students in applied mathematics, computer science, and statistics.

Uploaded by

k61294685
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views158 pages

Mathematial Introduction to Data Science

This document is a working draft of a monograph by Yuan Yao that introduces data science from a mathematical perspective, primarily focusing on Principal Component Analysis (PCA). It covers various topics including multidimensional scaling, random projections, high dimensional statistics, and nonlinear dimensionality reduction, among others. The content is structured as a course material aimed at graduate students in applied mathematics, computer science, and statistics.

Uploaded by

k61294685
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

A Mathematical Introduction to Data Science

Yuan Yao
School of Mathematical Sciences, Peking University, Beijing, China
100871
E-mail address: yuany@math.pku.edu.cn
URL: http://www.math.pku.edu.cn/teachers/yaoy/Fall2012/lectures.pdf
This is a working draft last updated on
October 14, 2014
2000 Mathematics Subject Classification. Primary
Key words and phrases. keywords

Special thanks to Amit Singer, Weinan E, Xiuyuan Cheng, and the following
students in PKU who help scribe lecture notes with various improvements: Hong
Cheng, Chao Deng, Yanzhen Deng, Chendi Huang, Lei Huang, Shujiao Huang,
Longlong Jiang, Yuwei Jiang, Wei Jin, Changcheng Li, Xiaoguang Li, Tengyuan
Liang, Feng Lin, Yaning Liu, Peng Luo, Wulin Luo, Tangjie Lv, Yuan Lv, Hongyu
Meng, Ping Qin, Jie Ren, Hu Sheng, Zhiming Wang, Yuting Wei, Jiechao Xiong,
Jie Xu, Bowei Yan, Jun Yin, and Yue Zhao.

Abstract. This monograph aims to provide graduate students or senior grad-


uates in applied mathematics, computer science and statistics an introduction
to data science from a mathematical perspective. It is focused around a cen-
tral topic in data analysis, Principal Component Analysis (PCA), with a diver-
gence to some mathematical theories for deeper understanding, such as random
matrix theory, convex optimization, random walks on graphs, geometric and
topological perspectives in data analysis.
Contents

Preface 1
Chapter 1. Multidimensional Scaling and Principal Component Analysis 3
1. Classical MDS 3
2. Theory of MDS (Young/Househölder/Schoenberg’1938) 4
3. Hilbert Space Embedding and Reproducing Kernels 8
4. Linear Dimensionality Reduction 8
5. Principal Component Analysis 9
6. Dual Roles of MDS vs. PCA in SVD 11
Chapter 2. Random Projections and Almost Isometry 13
1. Introduction 13
2. The Johnson-Lindenstrauss Lemma 14
3. Example: MDS in Human Genome Diversity Project 17
4. Random Projections and Compressed Sensing 18
Chapter 3. High Dimensional Statistics: Mean and Covariance in Noise 25
1. Maximum Likelihood Estimation 25
2. Bias-Variance Decomposition of Mean Square Error 27
3. Stein’s Phenomenon and Shrinkage of Sample Mean 29
4. Random Matrix Theory and Phase Transitions in PCA 36
Chapter 4. Generalized PCA/MDS via SDP Relaxations 45
1. Introduction of SDP with a Comparison to LP 45
2. Robust PCA 47
3. Probabilistic Exact Recovery Conditions for RPCA 50
4. Sparse PCA 51
5. MDS with Uncertainty 53
6. Exact Reconstruction and Universal Rigidity 56
7. Maximal Variance Unfolding 58

Chapter 5. Nonlinear Dimensionality Reduction 59


1. Introduction 59
2. ISOMAP 61
3. Locally Linear Embedding (LLE) 64
4. Laplacian LLE (Eigenmap) 66
5. Hessian LLE 67
6. Local Tangent Space Alignment (LTSA) 69
7. Diffusion Map 71
8. Connection Laplacian and Vector Diffusion Maps 73
9. Comparisons 73
3
4 CONTENTS

Chapter 6. Random Walk on Graphs 75


1. Introduction to Perron-Frobenius Theory and PageRank 75
2. Introduction to Fiedler Theory and Cheeger Inequality 81
3. *Laplacians and the Cheeger inequality for directed graphs 88
4. Lumpability of Markov Chain 94
5. Applications of Lumpability: MNcut and Optimal Reduction of
Complex Networks 97
6. Mean First Passage Time 102
7. Transition Path Theory 104
Chapter 7. Diffusion Map 109
1. Diffusion map and Diffusion Distance 109
2. Commute Time Map and Distance 117
3. Diffusion Map: Convergence Theory 120
4. *Vector Diffusion Map 127
Chapter 8. Semi-supervised Learning 135
1. Introduction 135
2. Harmonic Extension of Functions on Graph 135
3. Explanation from Gaussian Markov Random Field 135
4. Explanation from Transition Path Theory 136
5. Well-posedness 137
Chapter 9. Beyond graphs: high dimensional topological/geometric analysis 139
1. From Graph to Simplicial Complex 139
2. Persistent Homology and Discrete Morse Theory 141
3. Exterior Calculus on Complex and Combinatorial Hodge Theory 141
4. Applications of Hodge Theory: Statistical Ranking 143
5. Euler-Calculus 148
Bibliography 151
Preface

This book is used in a course instructed by Yuan Yao at Peking University, part
of which is based on a similar course led by Amit Singer at Princeton University.
If knowledge comes from the impressions made upon us by natural
objects, it is impossible to procure knowledge without the use of
objects which impress the mind. –John Dewey

It is important to understand what you CAN DO before you learn to


measure how WELL you seem to have DONE it. –John W. Tukey

1
CHAPTER 1

Multidimensional Scaling and Principal


Component Analysis

1. Classical MDS
Multidimensional Scaling (MDS) roots in psychology [YH41] which aims to
recover Euclidean coordinates given pairwise distance metrics or dissimilarities.
It is equivalent to PCA when pairwise distances are Euclidean. In the core of
theoretical foundation of MDS lies the notion of positive definite functions [Sch37,
Sch38a, Sch38b] (or see the survey [Bav11]) which has been the foundation
of the kernel method in statistics [Wah90] and modern machine learning society
(http://www.kernel-machines.org/).
In this section we study classical MDS, or metric Multidimensional scaling
problem. The problem of classical MDS or isometric Euclidean embedding: given
pairwise distances between data points, can we find a system of Euclidean coordi-
nates for those points whose pairwise distances meet given constraints?
Consider a forward problem: given a set of points x1 , x2 , ..., xn ∈ Rp , let
X = [x1 , x2 , ..., xn ]p×n .
The distance between point xi and xj satisfies
2 T
d2ij = kxi − xj k = (xi − xj ) (xi − xj ) = xi T xi + xj T xj − 2xi T xj .
Now we are considering the inverse problem: given dij , find a {xi } satisfying the
relations above. Clearly the solutions are not unique as any Euclidean transform
on {xi } gives another solution. General ideas of classic (metric) MDS is:
(1) transform squared distance matrix D = [d2ij ] to an inner product form;
(2) compute the eigen-decomposition for this inner product form.
Below we shall see how to do this given D.
Let K be the inner product matrix
K = X T X,
with k = diag(Kii ) ∈ Rn . So
D = (d2ij ) = k · 1T + 1 · k T − 2K.
where 1 = (1, 1, ..., 1)T ∈ Rn .
Define the mean and the centered data
n
1X 1
µbn = xi = · X · 1,
n i=1 n
1
ei = xi − µ
x bn = xi − · X · 1,
n
3
4 1. MULTIDIMENSIONAL SCALING AND PRINCIPAL COMPONENT ANALYSIS

or
e = X − 1 X · 1 · 1T .
X
n
Thus,
K̃ , X̃ T X̃
T
1 1
= (X − X · 1 · 1T ) (X − X · 1 · 1T )
n n
1 1 1
= K − K · 1 · 1T − 1 · 1T · K + 2 · 1 · 1T · K · 1 · 1T .
n n n
Let
1
B = − H · D · HT
2
1
where H = I − n · 1 · 1T . H is called as a centering matrix.
So
1
B = − H · (k · 1T + 1 · k T − 2K) · H T
2
T
Since k · 1T · H T = k · 1(I − n1 · 1 · 1T ) = k · 1 − k( 1 n·1 ) · 1 = 0, we have
H · k 1 · H T = H · 1 · k T · H T = 0.
Therefore,
1 1
B = H · K · H T = (I − · 1 · 1T ) · K · (I − · 1 · 1T )
n n
1 1 1
= K − · 1 · 1 · K − · K · 1 · 1 + 2 · 1(1T · K1) · 1T
T
n n n
= K̃.
That is,
1
B = − H · D · H T = X̃ T X̃.
2
Note that often we define the covariance matrix
n
bn , 1 1 e eT
X
Σ (xi − µ bn )T =
bn )(xi − µ XX .
n − 1 i=1 n−1
Above we have shown that given a squared distance matrix D = (d2ij ), we can
1
convert it to an inner product matrix by B = − HDH T . Eigen-decomposition
2
applied to B will give rise the Euclidean coordinates centered at the origin.
In practice, one often chooses top k nonzero eigenvectors of B for a k-dimensional
Euclidean embedding of data.
Hence X ek gives k-dimensional Euclidean coordinations for the n points.
In Matlab, the command for computing MDS is ”cmdscale”, short for Clas-
sical Multidimensional Scaling. For non-metric MDS, you may choose ”mdscale”.
Figure 1 shows an example of MDS.

2. Theory of MDS (Young/Househölder/Schoenberg’1938)


Definition (Positive Semi-definite). Suppose An×n is a real symmetric matrix,then:
A is p.s.d.(positive semi-definite)(A  0) ⇐⇒ ∀v ∈ Rn , v T Av ≥ 0 ⇐⇒ A = Y T Y
Property. Suppose An×n , B n×n are real symmetric matrix, A  0, B  0. Then
we have:
2. THEORY OF MDS (YOUNG/HOUSEHÖLDER/SCHOENBERG’1938) 5

Algorithm 1: Classical MDS Algorithm


Input: A squared distance matrix Dn×n with Dij = d2ij .
Output: Euclidean k-dimensional coordinates X ek ∈ Rk×n of data.
1
1 Compute B = − H · D · H T , where H is a centering matrix.
2
2 Compute Eigenvalue decomposition B = U ΛU T with Λ = diag(λ1 , . . . , λn ) where
λ1 ≥ λ2 ≥ . . . ≥ λn ≥ 0;
3 Choose top k nonzero eigenvalues and corresponding eigenvectors, X ek = Uk Λk 12
where
Uk = [u1 , . . . , uk ], uk ∈ Rn ,
Λk = diag(λ1 , . . . , λk )
with λ1 ≥ λ2 ≥ . . . ≥ λk > 0.

(a)

(b) (c)

Figure 1. MDS of nine cities in USA. (a) Pairwise distances be-


1
tween 9 cities; (b) Eigenvalues of B = − H · D · H T ; (c) MDS
2
embedding with top-2 eigenvectors.

(1) A + B  0;
(2) A ◦ B  0;
where A ◦ B is called Hadamard product and (A ◦ B)i,j := Ai,j × Bi,j .
Definition (Conditionally Negative Definite). Let An×n be a real symmetric ma-
trix.
Pn A is c.n.d.(conditionally negative definite) ⇐⇒ ∀v ∈ Rn , such that 1T v =
T
i=1 vi = 0, there holds v Av ≤ 0

Lemma 2.1 (Young/Househölder-Schoenberg


Pn ’1938). For any signed probability
measure α (α ∈ Rn , i=1 αi = 1),
1
Bα = − Hα CHαT  0 ⇐⇒ C is c.n.d.
2
6 1. MULTIDIMENSIONAL SCALING AND PRINCIPAL COMPONENT ANALYSIS

where Hα is Householder centering matrix: Hα = I − 1 · αT .


Proof. ⇐ We are to show if C is c.n.d., then Bα ≥ 0. Taking an
arbitrary x ∈ Rn ,
1 1
xT Bα x = − xT Hα CHαT x = − (HαT x)T C(HαT x).
2 2
Now we are going to show that y = HαT x satisfies 1T y = 0. In fact,
1T · HαT x = 1T · (I − α · 1T )x = (1 − 1T · α)1T · x = 0
as 1T · α = 1 for signed probability measure α. Therefore,
1
xT Bα x = − (HαT x)T C(HαT x) ≥ 0,
2
as C is c.n.d.
⇒ Now it remains to show if Bα ≥ 0 then C is c.n.d. For ∀x ∈ Rn satisfying
1T · x = 0, we have
HαT x = (I − α · 1T )x = x − α · 1T x = x
Thus,
xT Cx = (HαT x)T C(HαT x) = xT Hα CHαT x = −2xT Bα x ≤ 0,
as desired.
This completes the proof. 
1
Theorem 2.2 (Classical MDS). Let Dn×n a real symmetric matrix. C = D − d ·
2
1
1T − 1 · dT , d = diag(D). Then:
2
(1) Bα = − 12 Hα DHαT = − 21 Hα CHαT for ∀α signed probability measrue;
(2) Ci,j = Bi,i (α) + Bj,j (α) − 2Bi,j (α)
(3) D c.n.d. ⇐⇒ C c.n.d.
(4) C c.n.d. ⇒ C is a square distance matrix (i.e. ∃Y n×k s.t. Ci,j =
Pk 2
m=1 (yi,m − yj,m ) )

Proof. (1) Hα DHαT − Hα CHαT = Hα (D − C)HαT = Hα ( 21 d · 1T + 12 1 ·


T T
d )Hα .
Since Hα · 1 = 0, we have
Hα DHαT − Hα CHαT = 0
(2) Bα = − 12 Hα CHαT = − 12 (I − 1 · αT )C(I − α · 1T ) = − 12 C + 12 1 · αT C +
1 T 1 T T
2 Cα · 1 − 2 1 · α Cα · 1 , so we have:
1 1 1 1
Bi,j (α) = − Ci,j + ci + cj − c
2 2 2 2
where ci = (αT C)i , c = αT Cα. This implies

1 1
Bi,i (α) + Bj,j (α) − 2Bi,j (α) = − Cii − Cjj + Cij = Cij ,
2 2
where the last step is due to Ci,i = 0.
(3) According to Lemma 2.1 and the first part of Theorem 2.2: C c.n.d.
⇐⇒ B p.s.d ⇐⇒ D c.n.d.
2. THEORY OF MDS (YOUNG/HOUSEHÖLDER/SCHOENBERG’1938) 7

(4) According to Lemma 2.1 and the second part of Theorem 2.2:
T
C c.n.d. ⇐⇒ B p.s.d
P P ⇐⇒ ∃Y 2s.t. Bα = Y Y ⇐⇒ Bi,j (α) =
k Yi,k Yj,k ⇒ Ci,j = k (Yi,k − Yj,k )
This completes the proof. 

Sometimes, we may want to transform a square distance matrix to another


square distance matrix. The following theorem tells us the form of all the transfor-
mations between squared distance matrices.
Theorem 2.3 (Schoenberg Transform). Given D a squared distance matrix, Ci,j =
Φ(Di,j ). Then
C is a squared distance matrix ⇐⇒ Φ is a Schoenberg Transform.
A Schoenberg Transform Φ is a transform from R+ to R+ , which takes d to
Z ∞
1 − exp (−λd)
Φ(d) = g(λ)dλ,
0 λ
where g(λ) is some nonnegative measure on [0, ∞) s.t
Z ∞
g(λ)
dλ < ∞.
0 λ
Examples of Schoeberg transforms include
• φ0 (d) = d with g0 (λ) = δ(λ);
1 − exp(−ad)
• φ1 (d) = with g1 (λ) = δ(λ − a) (a > 0);
a
• φ2 (d) = ln(1 + d/a) with g2 (λ) = exp(−aλ);
d
• φ3 (d) = with g3 (λ) = λ exp(−aλ);
a(a + d)
p
• φ4 (d) = dp (p ∈ (0, 1)) with g4 (λ) = λ−p (see more in [Bav11]).
Γ(1 − p)
The first one
√ gives the identity transform and the last one implies that for a distance
function, d is also a distance function but d2 is not. To see this, take three
points on a line x = 0, y = 1, z = 2 where d(x, y) = d(y, z) = 1, then for p > 1
dp (x, z) = 2p > dp (x, y) + dp (y, z) = 2 which violates the triangle inequality. In
fact, dp (p ∈ (0, 1)) is Euclidean distance function immediately implies the following
triangle inequality
dp (0, x + y) ≤ dp (0, x) + dp (0, y).
Note that Schoenberg transform satisfies φ(0) = 0,
Z ∞
0
φ (d) = exp(−λd)g(λ)dλ ≥ 0,
0
Z ∞
φ00 (d) = − exp(−λd)λg(λ)dλ ≤ 0,
0
and so on. In other words, φ is a completely monotonic function defined by
(−1)n φ(n) (x) ≥ 0, with additional constraint φ(0) = 0. Schoenberg showed in
1938 that a function φ is completely monotone on [0, ∞) if and only if φ(d2 ) is
positive definite and radial on Rs for all s.
8 1. MULTIDIMENSIONAL SCALING AND PRINCIPAL COMPONENT ANALYSIS

3. Hilbert Space Embedding and Reproducing Kernels


Schoenberg [Sch38b] shows that Euclidean embedding of finite points can be
characterized completely by positive definite functions, which paves a way toward
Hilbert space embedding. Later Aronzajn [Aro50] developed Reproducing Kernel
Hilbert spaces based on positive definite functions which eventually leads to the
kernel methods in statistics and machine learning [Vap98, BTA04, CST03].
Theorem 3.1 (Schoenberg 38). A separable space M with a metric function d(x, y)
can be isometrically imbedded in a Hilbert space H, if and only if the family of
2
functions e−λd are positive definite for all λ > 0 (in fact we just need it for a
sequence of λi whose accumulate point is 0).
Here a symmetric function k(x, y) = k(y, x) is called positive definite if for all
finite xi , xj ,
X
ci cj k(xi , xj ) ≥ 0, ∀ci , cj
i,j

with equality = holds iff ci = cj = 0. In other words the function k restricted on


{(xi , xj ) : i, j = 1, . . . , n} is a positive definite matrix.
Combined this with Schoenberg transform, one shows that if d(x, y) is an Eu-
2
clidean distance matrix, then e−λΦ(d) is positive definite for all λ > 0. Note that for
k
homogeneous function e−λΦ(tx) = e−λt Φ(x) , it suffices to check positive definiteness
for λ = 1.
Symmetric positive definite functions k(x, y) are often called reproducing ker-
nels [Aro50]. In fact the functions spanned by kx (·) = k(x, ·) for x ∈ X made
up of a Hilbert space, where we can associate an inner product induced from
2 2
hkx , ky i = k(x, y). The radial basis function e−λd = e−λkxk is often called Gauss-
ian kernel or heat kernel in literature and has been widely used in machine learning.
On the other hand, every Hilbert space H of functions on X with bounded evalu-
ation functional can be regarded as a reproducing kernel Hilbert space [Wah90]. By
Riesz representation, for every x ∈ X there exists Ex ∈ H such that f (x) = hf, Ex i.
By boundedness of evaluation functional, |f (x)| ≤ kf kH kEx k, one can define a re-
producing kernel k(x, y) = hEx , Ey i which is bounded, symmetric and positive def-
inite. It is called ‘reproducing’ because we can reproduce the function value using
f (x) = hf, kx i where kx (·) := k(x, ·) as a function in H. Such an universal prop-
erty makes RKHS a unified tool to study Hilbert function spaces in nonparametric
statistics, including Sobolev spaces consisting of splines [Wah90].

4. Linear Dimensionality Reduction


We have seen that given a set of paired distances dij , how to find an Eu-
clidean embedding xi ∈ Rp such that kxi − xj k = dij . However the dimension-
ality of such an embedding p can be very large. For example, any n + 1 points
can be isometrically embedded into Rn∞ using (di1 , di2 , . . . , din ) and l∞ -metric:
d∞ (xj , xk ) = maxi=1,...,n |dij − djk | = dik due to triangle inequality. Moreover,
2
via the heat kernel e−λt they can be embedded into Hilbert spaces of infinite
dimensions.
Therefore dimensionality reduction is desired when p is large, at the best preser-
vation of pairwise distances.
5. PRINCIPAL COMPONENT ANALYSIS 9

Given a set of points xi ∈ Rp (i = 1, 2, · · · , n); form a data Matrix X p×n =


[X1 , X2 · · · Xn ]T , when p is large, especially in some cases larger than n, we want to
find k-dimensional projection with which pairwise distances of the data point are
preserved as well as possible. That is to say, if we know the original pairwise distance
dij = kXi − Xj k or data distances with some disturbance d˜ij = kXi − Xj k + , we
want to find Yi ∈ Rk s.t.:
X
(1) min (kYi − Yj k2 − d2ij )2
Yi ∈Rk
i,j

take the derivative w.r.t Yi ∈ Rk :


X
(kYi k2 + kYj k2 − 2YiT Yj − d2ij )(Yi − Yj ) = 0
i,j
P P P
which implies i Yi = j Yj . For simplicity set i Yi = 0, i.e.putting the origin
as data center.
Use a linear transformation to move the sample mean to be the origin of the
coordinates, i.e. define a matrix Bij = − 21 HDH where D = (d2ij ), H = I − n1 11T ,
then, the minimization (1) is equivalent to find Yi ∈ Rk :

min kY T Y − Bk2F
Y ∈Rk×n

then the row vectors of matrix Y are the eigenvectors corresponding to k largest
eigenvalues of B = X e T X,
e or equivalently the top k right singular vectors of X
e =
T
U SV .
We have seen in the first section that the covariance matrix of data Σ bn =
1 e eT 1 2 T
n−1 X X = n U S U , passing through the singular vector decomposition (SVD)
of Xe = U SV T . Taking top k left singular vectors as the embedding coordinates
is often called Principal Component Analysis (PCA). In PCA, given (centralized)
Euclidean coordinate X, e ususally one gets the inner product matrix as covariance
1 e
matrix Σn = n−1 X · X T which is a p × p positive semi-definite matrix, then the top
b e
k eigenvectors of Σb n give rise to a k-dimensional embedding of data, as principal
components. So both MDS and PCA are unified in SVD of centralized data matrix.
The following introduces PCA from another point of view as best k-dimensional
affine space approximation of data.

5. Principal Component Analysis


Principal component analysis (PCA), invented by Pearson (1901) and Hotelling
(1933), is perhaps the most ubiquitous method for dimensionality reduction with
high dimensional Euclidean data, under various names in science and engineering
such as Karhunen-Loève Transform, Empirical Orthogonal Functions, and Principal
Orthogonal Decomposition, etc. In the following we will introduce PCA from its
sampled version.
Let X = [X1 |X2 | · · · |Xn ] ∈ Rp×n . Now we are going to look for a k-dimensional
affine space in Rp to best approximate these n examples. Assume that such an affine
space can be parameterized by µ + U β such that U = [u1 , . . . , uk ] consists of k-
columns of an orthonormal basis of the affine space. Then the best approximation
10 1. MULTIDIMENSIONAL SCALING AND PRINCIPAL COMPONENT ANALYSIS

in terms of Euclidean distance is given by the following optimization problem.


n
X
(2) min I := kXi − (µ + U βi )k2
β,µ,U
i=1
Pn
where U ∈ Rp×k , U T U = Ip , and i=1 βi = 0 (nonzero sum of βi can be repre-
sented by µ). Taking the first order optimality conditions,
n n
∂I X 1X
= −2 (Xi − µ − U βi ) = 0 ⇒ µ̂n = Xi
∂µ i=1
n i=1

∂I
= (xi − µ − U βi )T U = 0 ⇒ βi = U T (Xi − µ)
∂βi
Plug in the expression of µ̂n and βi
n
X
I = kXi − µ̂n − U U T (Xi − µ̂n )k2
i=1
n
X
= kXi − µ̂n − Pk (Xi − µ̂n )k2
i=1
n
X
= kYi − Pk (yi )k2 , Yi := Xi − µ̂n
i=1

where Pk = U U T is a projection operator satisfying the idempotent property Pk2 =


Pk .
Denote Y = [Y1 |Y2 | · · · |Yn ] ∈ Rp×n , whence the original problem turns into
n
X
min kYi − Pk (Yi )k2 = min trace[(Y − Pk Y )T (Y − Pk Y )]
U
i=1
= min trace[Y T (I − Pk )(I − Pk )Y ]
= min trace[Y Y T (I − Pk )2 ]
= min trace[Y Y T (I − Pk )]
= min[trace(Y Y T ) − trace(Y Y T U U T )]
= min[trace(Y Y T ) − trace(U T Y Y T U )].
Above we use cyclic property of trace and idempotent property of projection.
Since Y does not depend on U , the problem above is equivalent to
1
(3) max V ar(U T Y ) = max trace(U T Y Y T U ) = max trace(U T Σ̂n U )
U U T =I k U U T =I k n U U T =Ik

where Σ̂n = n1 Y Y T = n1 (X − µ̂n 1T )(X − µ̂n 1T )T is the sample variance. Assume


that the sample covariance matrix, which is positive semi-definite, has the eigen-
value decomposition Σ̂n = Û Λ̂Û T , where Û T Û = I, Λ = diag(λ̂1 , . . . , λ̂n ), and
λ̂1 ≥ . . . ≥ λ̂n ≥ 0. Then
k
X
max trace(U T Σ̂n U ) = λ̂i
U U T =Ik
i=1
6. DUAL ROLES OF MDS VS. PCA IN SVD 11

In fact when k = 1, the maximal covariance is given by the largest eigenvalue along
the direction of its associated eigenvector,
max uT Σ̂n u =: λ̂1 .
kuk=1

Restricted on the orthogonal subspace u ⊥ û1 will lead to


max uT Σ̂n u =: λ̂2 ,
kuk=1,uT û1 =0

and so on.
Here we conclude that the k-affine space can be discovered by eigenvector de-
composition of Σ̂n . The sample principal components are defined as column vectors
of Q̂ = Û T Y , where the j-th observation has its projection on the k-th component
as q̂k (j) = ûTk yj = ûTk (xi − µ̂n ). Therefore, PCA takes the eigenvector decompo-
sition of Σ̂n = Û Λ̂Û T and studies the projection of centered data points on top k
eigenvectors as the principle components. This is equivalent to the singular value
decomposition (SVD) of X = [x1 , . . . , xn ]T ∈ Rn×p in the following sense,
1 T
Y =X− 11 X = Ũ S̃ Ṽ T , 1 = (1, . . . , 1)T ∈ Rn
n
where top right singular vectors of centered data matrix Y gives the same principle
components. From linear algebra, k-principal components thus gives the best rank-
k approximation of centered data matrix Y .
Given a PCA, the following quantities are often used to measure the variances
• total variance:
Xp
trace(Σ̂n ) = λ̂i ;
i=1
• percentage of variance explained by top-k principal components:
k
X
λ̂i /trace(Σ̂n );
i=1

• generalized variance as total volume:


p
Y
det(Σ̂n ) = λ̂i .
i=1

Example. Take the dataset of hand written digit “3”, X̂ ∈ R658×256 contains
658 images, each of which is of 16-by-16 grayscale image as hand written digit 3.
Figure 2 shows a random selection of 9 images, the sorted singular values divided
by total sum of singular values, and an approximation of x1 by top 3 principle
components: x1 = µ̂n − 2.5184ṽ1 − 0.6385ṽ2 + 2.0223ṽ3 .

6. Dual Roles of MDS vs. PCA in SVD


Consider the data matrix
X = [x1 , . . . , xn ]T ∈ Rn×p .
Let the centered data admits a singular vector decomposition (SVD),
e = X − 1 11T X = U
X e SeVe T , 1 = (1, . . . , 1)T ∈ Rn .
n
12 1. MULTIDIMENSIONAL SCALING AND PRINCIPAL COMPONENT ANALYSIS

(a) (b)

≈ - 2.52 - 0.64 + 2.02


(c)

Figure 2. (a) random 9 images. (b) percentage of singular values


over total sum. (c) approximation of the first image by top 3
principle components (singular vectors).

We have seen that both MDS and PCA can be obtained from such a SVD of
centered data matrix.
1/2
• MDS embedding is given by top k left singular vectors YkM DS = Uek S
fk ∈
Rn×k ;
1/2
• PCA embedding is given by top k right singular vectors YkP CA = Vek S
fk ∈
Rn×k .
Altogether U fk Ve T gives best rank-k approximation of X
ek S e in any unitary invariant
k
norms.
CHAPTER 2

Random Projections and Almost Isometry

1. Introduction
For this class, we introduce Random Projection method which may reduce the
dimensionality of n points in Rp to k = O(c() log n) at the cost of a uniform met-
ric distortion of at most  > 0, with high probability. The theoretical basis of this
method was given as a lemma by Johnson and Lindenstrauss [JL84] in the study
of a Lipschitz extension problem. The result has a widespread application in math-
ematics and computer science. The main application of Johnson-Lindenstrauss
Lemma in computer science is high dimensional data compression via random pro-
jections [Ach03]. In 2001, Sanjoy Dasgupta and Anupam Gupta [DG03a], gave
a simple proof of this theorem using elementary probabilistic techniques in a four-
page paper. Below we are going to present a brief proof of Johnson-Lindenstrauss
Lemma based on the work of Sanjoy Dasgupta, Anupam Gupta [DG03a], and
Dimitris Achlioptas [Ach03].
Recall the problem of MDS: given a set of points xi ∈ Rp (i = 1, 2, · · · , n);
form a data Matrix X p×n = [X1 , X2 · · · Xn ]T , when p is large, especially in some
cases larger than n, we want to find k-dimensional projection with which pairwise
distances of the data point are preserved as well as possible. That is to say, if we
know the original pairwise distance dij = kXi − Xj k or data distances with some
disturbance d˜ij = kXi − Xj k + ij , we want to find Yi ∈ Rk s.t.:
X
(4) min (kYi − Yj k2 − d2ij )2
i,j

take the derivative w.r.t Yi ∈ R :


k
X
(kYi k2 + kYj k2 − 2YiT Yj − d2ij )(Yi − Yj ) = 0
i,j
P P P
which implies i Yi = j Yj . For simplicity set i Yi = 0, i.e.putting the origin
as data center.
Use a linear transformation to move the sample mean to be the origin of the
coordinates, i.e. define a matrix K = − 21 HDH where D = (d2ij ), H = I − n1 11T ,
then, the minimization (4) is equivalent to find Yi ∈ Rk :
(5) min kY T Y − Kk2F
Y ∈Rk×n

then the row vectors of matrix Y are the eigenvectors (singular vectors) correspond-
ing to k largest eigenvalues (singular values) of B.
The main features of MDS are the following.
• MDS looks for Euclidean embedding of data whose total or average metric
distortion are minimized.
13
14 2. RANDOM PROJECTIONS AND ALMOST ISOMETRY

• MDS embedding basis is adaptive to the data, namely as a function of


data via eigen-decomposition.
Note that distortion measure here amounts to a certain distance between the set
of projected points and the original set of points B. Under the Frobenius norm the
distortion equals the sum of the squared lengths of these vectors. It is clear that
such vectors captures a significant global property, but it does not offer any local
guarantees. Chances are that some points deviate greatly from the original if we
only consider the total metric distortion minimization.
What if we want a uniform control on metric distortion at every data pair, say
(1 − )dij ≤ kYi − Yj k ≤ (1 + )dij ?
Such an embedding is an almost isometry or a Lipschitz mapping from metric space
X to Euclidean space Y. If X is an Euclidean space (or more generally Hilbert
space), Johnson-Lindenstrauss Lemma tells us that one can take Y as a subspace
of X of dimension k = O(c() log n) via random projections to obtain an almost
isometry with high probability. As a contrast to MDS, the main features of this
approach are the following.
• Almost isometry is achieved with a uniform metric distortion bound (Lip-
schitz bound), with high probability, rather than average metric distortion
control;
• The mapping is universal, rather than being adaptive to the data.

2. The Johnson-Lindenstrauss Lemma


Theorem 2.1 (Johnson-Lindenstrauss Lemma). For any 0 <  < 1 and any integer
n, let k be a positive integer such that
k ≥ (4 + 2α)(2 /2 − 3 /3)−1 ln n, α > 0.
Then for any set V of n points in Rd , there is a map f : Rd → Rk such that for all
u, v ∈ V
(6) (1 − ) k u − v k2 ≤k f (u) − f (v) k2 ≤ (1 + ) k u − v k2
Such a f in fact can be found in randomized polynomial time. In fact, inequalities
(6) holds with probability at least 1 − 1/nα .
Remark. We have following facts.
(1) The embedding dimension k = O(c() log n) which is independent to am-
bient dimension d and logarithmic to the number of samples n. The
independence to d in fact suggests that the Lemma can be generalized to
the Hilbert spaces of infinite dimension.
(2) How to construct the map f ? In fact we can use random projections:
Y n×k = X n×d Rd×k
where the following random matrices R can cater our needs.
• R = [r1 ,√· · · , rk ] ri ∈ S d−1 ri = (ai1 , · · · , aid )/ k ai k aik ∼ N (0, 1)
• R = A/ k Aij ∼ N ( (0, 1)
√ 1 p = 1/2
• R = A/ k Aij =
−1 p = 1/2
2. THE JOHNSON-LINDENSTRAUSS LEMMA 15


p 1
 p = 1/6
• R = A/ k/3 Aij = 0 p = 2/3

−1 p = 1/6

The proof below actually takes the first form of R as an illustration.
Now we are going to prove Johnson-Lindenstrauss Lemma using a random
projection to k-subspace in Rd . Notice that the distributions of the following two
events are identical:

unit vector was randomly projected to k-subspace


⇐⇒ random vector on S d−1 fixed top-k coordinates.
Based on this observation, we change our target from random k-dimensional pro-
jection to random vector on sphere S d−1 .
If xi ∼ N (0, 1), (i = 1, · · · , d), X = (x1 , · · · , xd ), then Y = X/kxk ∈ S d−1 is
uniformly distributed. Fixing top-k coordinates, we get z = (x1 , · · · , xk , 0, · · · , 0)T /kxk ∈
Rd . Let L = kZk2 and µ = E[L] = k/d.
The following lemma is crucial to reach the main theorem.
Lemma 2.2. let any k < d then we have
(a) if β < 1 then
d−k/2
(1 − β)k
  
k
Prob[L ≤ βµ] ≤ β k/2 1 − ≤ exp (1 − β + ln β)
d−k 2
(b) if β > 1 then
d−k/2
(1 − β)k
  
k
Prob[L ≥ βµ] ≤ β k/2 1 + ≤ exp (1 − β + ln β)
d−k 2
Here µ = k/d.
We first show how to use this lemma to prove the main theorem – Johnson-
Lindenstrauss lemma.

Proof of Johnson-Lindenstrauss Lemma. If d ≤ k,the theorem is trivial.


Otherwise take a random k-dimensional subspace S, and let vi0 be the projection
of point vi ∈ V into S, then setting L = kvi0 − vj0 k2 and µ = (k/d)kvi − vj k2 and
applying Lemma 2(a), we get that

k
Prob[L ≤ (1 − )µ] ≤ exp( (1 − (1 − ) + ln(1 − )))
2
k 2
≤ exp( ( − ( + ))),
2 2
by ln(1 − x) ≤ −x − x2 /2 for 0 ≤ x < 1
k2
= exp(− )
4
≤ exp(−(2 + α) ln n), for k ≥ 4(1 + α/2)(2 /2)−1 ln n
1
= 2+α
n
16 2. RANDOM PROJECTIONS AND ALMOST ISOMETRY

k
Prob[L ≥ (1 + )µ] ≤ exp( (1 − (1 + ) + ln(1 + )))
2
k 2 3
≤ exp( (− + ( − + ))),
2 2 3
by ln(1 + x) ≤ x − x2 /2 + x3 /3 for x ≥ 0
k
= exp(− (2 /2 − 3 /3)),
2
≤ exp(−(2 + α) ln n), for k ≥ 4(1 + α/2)(2 /2 − 3 /3)−1 ln n
1
= 2+α
n
r r
d 0 d
Now set the map f (x) = x = (x1 , . . . , xk , 0, . . . , 0). By the above
k k
calculations, for some fixed pair i, j, the probability that the distortion
kf (vi ) − f (vj )k2
kvi − vj k2
2
does not lie in the range [(1 − ), (1 + )] is at most n(2+α) . Using the trivial union
2
bound with Cn pairs, the chance that some pair of points suffers a large distortion
is at most:  
2 1 1 1
Cn2 (2+α) = α 1 − ≤ α.
n n n n
1
Hence f has the desired properties with probability at least 1 − α . This gives us
n
a randomized polynomial time algorithm. 

Now, it remains to Lemma 3.6.

Proof of Lemma 3.6.


k
X d
X
Prob(L ≤ βµ) =Prob( (x2i ) ≤ βµ( (x2i )))
i=1 i=1
d
X k
X
=Prob(βµ (x2i ) − (x2i ) ≤ 0)
i=1 i=1
d
X k
X
=Prob[exp(tβµ (x2i ) − t (x2i )) ≤ 1] (t > 0)
i=1 i=1
d
X k
X
≤E[exp(tβµ (x2i ) − t (x2i ))] (by M arkov 0 s inequality)
i=1 i=1
=Πki=1 E exp(t(βµ − 1)x2i )Πdi=k+1 Eexp(t(βµ)x2i )
=(E exp(t(βµ − 1)x2 ))k (E exp(tβµ2 ))d−k
=(1 − 2t(βµ − 1))−k/2 (1 − 2tβµ)−(d−k)/2
2 1
We use the fact that if X ∼ N (0, 1),then E[esX ] = p , for −∞ < s < 1/2.
(1 − 2s)
3. EXAMPLE: MDS IN HUMAN GENOME DIVERSITY PROJECT 17

Now we will refer to last expression as g(t). The last line of derivation gives
us the additional constraints that tβµ ≤ 1/2 and t(βµ − 1) ≤ 1/2, and so we have
0 < t < 1/(2βµ). Now to minimize g(t), which is equivalent to maximize
h(t) = 1/g(t) = (1 − 2t(βµ − 1))k/2 (1 − 2tβµ)(d−k)/2
in the interval 0 < t < 1/(2βµ). Setting the derivative h0 (t) = 0, we get the
maximum is achieved at
1−β
t0 =
2β(d − βk)
Hence we have
d − k (d−k)/2
h(t0 ) = ( ) (1/β)k/2
d − kβ
And this is exactly what we need.
The proof of Lemma 3.6 (b) is almost exactly the same as that of Lemma 3.6
(a). 
2.1. Conclusion. As we can see, this proof of Lemma is both simple (using
just some elementary probabilistic techniques) and elegant. And you may find
in the field of machine learning, stochastic method always turns out to be really
powerful. The random projection method we approaching today can be used in
many fields especially huge dimensions of data is concerned. For one example, in
the term document, you may find it really useful for compared with the number
of words in the dictionary, the words included in a document is typically sparse
(with a few thousands of words) while the dictionary is hugh. Random projections
often provide us a useful tool to compress such data without losing much pairwise
distance information.

3. Example: MDS in Human Genome Diversity Project


Now consider a SNPs (Single Nucleid Polymorphisms) dataset in Human Genome
Diversity Project (HGDP, http://www.cephb.fr/en/hgdp_panel.php) which con-
sists of a data matrix of n-by-p for n = 1064 individuals around the world and
p = 644258 SNPs. Each entry in the matrix has 0, 1, 2, and 9, representing “AA”,
“AC”, “CC”, and “missing value”, respectively. After removing 21 rows with all
missing values, we are left with a matrix X of size 1043 × 644258.
Consider the projection of 1043 persons on the MDS (PCA) coordinates. Let
H = I − n1 11T be the centering matrix. Then define
K = HXX T H = U ΛU T
which is a positive semi-define matrix as centered Gram matrix whose√eigenvalue
decomposition is given by U ΛU T . Taking the first two eigenvectors λi ui (i =
1, . . . , 2) as the projections of n individuals, Figure 1 gives the projection plot.
It is interesting to note that the point cloud data exhibits a continuous trend of
human migration in history: origins from Africa, then migrates to the Middle East,
followed by one branch to Europe and another branch to Asia, finally spreading
into America and Oceania.
One computational concern is that the high dimensionality caused by p =
644, 258, which is much larger than the number of samples n = 1043. However
random projections introduced above will provide us an efficient way to compute
MDS (PCA) principal components with an almost isometry.
18 2. RANDOM PROJECTIONS AND ALMOST ISOMETRY

We randomly select (without replacement) {ni , i = 1, . . . , k} from 1, . . . , p with


equal probability. Let R ∈ Rk×p is a Bernoulli random matrix satisfying:
(
1/k j = ni ,
Rij =
0 otherwise.
Now define
e = H(XRT )(RX T )H
K
whose eigenvectors leads to new principal components of MDS. In the middle and
right, Figure 1 plots the such approximate MDS principal components with k =
5, 000, and k = 100, 000, respectively. These plots are qualitatively equivalent to
the original one.

Figure 1. (Left) Projection of 1043 individuals on the top 2 MDS


principal components. (Middle) MDS computed from 5,000 ran-
dom projections. (Right) MDS computed from 100,000 random
projections. Pictures are due to Qing Wang.

4. Random Projections and Compressed Sensing


There are wide applications of random projections in high dimensional data
processing, e.g. [Vem04]. Here we particularly choose a special one, the com-
pressed (or compressive) sensing (CS) where we will use the Johnson-Lindenstrauss
Lemma to prove the Restricted Isometry Property (RIP), a crucial result in CS. A
reference can be found at [BDDW08].
Compressive sensing can be traced back to 1950s in signal processing in geog-
raphy. Its modern version appeared in LASSO [Tib96] and BPDN [CDS98], and
achieved a highly noticeable status by [CT05, CRT06, CT06]. For a comprehen-
sive literature on this topic, readers may refer to http://dsp.rice.edu/cs.
The basic problem of compressive sensing can be expressed by the following
under-determined linear algebra problem. Assume that a signal x∗ ∈ Rp is sparse
with respect to some basis (measurement matrix) Φ ∈ Rn×p where n < p, given
measurement b = Φx∗ ∈ Rn , how can one recover x∗ by solving the linear equation
system
(7) Φx = b?
As n < p, it is an under-determined problem, whence without further constraint,
the problem does not have an unique solution. To overcome this issue, one popular
4. RANDOM PROJECTIONS AND COMPRESSED SENSING 19

assumption is that the signal x∗ is sparse, namely the number of nonzero compo-
nents kx∗ k0 := #{x∗i 6= 0 : 1 ≤ i ≤ p} is small compared to the total dimensionality
p. Figure 2 gives an illustration of such sparse linear equation problem.

Figure 2. Illustration of Compressive Sensing (CS). Φ is a rect-


angular matrix with more columns than rows. The dark elements
represent nonzero elements while the light ones are zeroes. The
signal vector x∗ , although high dimensional, is sparse.

With such a sparse assumption, we would like to find the sparsest solution
satisfying the measurement equation.
(8) (P0 ) min kxk0
s.t. Φx = b.
This is an NP-hard combinatorial optimization problem. A convex relaxation of
(8) is called Basis Pursuit [CDS98],
X
(9) (P1 ) min kxk1 := |xi |
s.t. Φx = b.
This is a linear programming problem. Figure 3 shows different projections of a
sparse vector x∗ under l0 , l1 and l2 , from which one can see in some cases the
convex relaxation (9) does recover the sparse signal solution in (8). Now a natural
problem arises, under what conditions the linear programming problem (P1 ) has
the solution exactly solves (P0 ), i.e. exactly recovers the sparse signal x∗ ?

Figure 3. Comparison between different projections. Left: pro-


jection of x∗ under k · k0 ; middle: projection under k · k1 which
favors sparse solution; right: projection under Euclidean distance.

To understand the equivalence between (P0 ) and (P1 ), one asks the question
when the true signal x∗ is the unique solution of P0 and P1 . In such cases, P1 is
20 2. RANDOM PROJECTIONS AND ALMOST ISOMETRY

equivalent to P0 . For the uniqueness of P1 , one turns to the duality of Linear Pro-
gramming via the Karush-Kuhn-Tucker (KKT) conditions. Take the Lagrangian of
(P1 ),
L(x; λ) = kxk1 + λT (Φx − b), λ ∈ Rn .
Assume the support of x∗ as T ⊆ {1, . . . , p}, i.e. T = {1 ≤ i ≤ p : xi 6= 0}, and
denote its complement by T c . x∗ is an optimal solution of P1 if
0 ∈ ∂L(x∗ , λ)
which implies that sign(x∗T ) = ΦTT λ and |ΦTT c λ| ≤ 1. How to ensure that there are
no other solutions than x∗ ? The following condition is used in [CT05] and other
related works.
Lemma 4.1. Assume that ΦT is of full rank. If there exists λ ∈ Rn such that:
(1) For each i ∈ T ,
(10) ΦTi λ = sign(x∗i );
(2) For each i ∈ T c ,
(11) |ΦTi λ| < 1.
Then P1 has a unique solution x∗ .
These two conditions just ensure a special dual variable λ exists, under which
any optimal solution of P1 must have the same support T as x∗ (strictly comple-
mentary condition in (2)). Since ΦT is of full rank, then P1 must have a unique
solution x∗ . In this case solving P1 is equivalent to P0 . If these conditions fail,
then there exists a problem instance (Φ, b) such that P1 has a solution different to
x∗ . In this sense, these conditions are necessary and sufficient for the equivalence
between P1 and P0 .
Various sufficient conditions have been proposed in literature to meet the KKT
conditions above. For example, these includes the mutual incoherence by Donoho-
Huo (1999) [DH01], Elad-Bruckstein (2001) [EB01] and the Exact Recovery Con-
dition by Tropp [Tro04] or Irrepresentative condition (IRR) by Zhao-Yu [ZY06]
(see also [MY09]). The former condition essentially requires Φ to be a nearly
orthogonal matrix,
µ(Φ) = max |φTi φj |,
i6=j
where Φ = [φ1 , . . . , φp ] and kφi k2 = 1, under which [DH01] shows that as long as
sparsity of x∗ satisfies
1
1 + µ(Φ)
kx∗ k0 = |T | <
2
which is later improved by [EB01] to be

2 − 12
kx∗ k0 = |T | < ,
µ(Φ)
then P1 recovers x∗ . The latter assumes that the dual variable λ lies in the column
space of AT , i.e. λ = ΦT α. Then we solve λ explicitly in equation (10) and plugs
in the solution to the inequality (11)
kΦTT c ΦT (ΦTT ΦT )−1 sign(x∗ |T )k∞ < 1
or simply
kΦTT c ΦT (ΦTT ΦT )−1 k∞ < 1.
4. RANDOM PROJECTIONS AND COMPRESSED SENSING 21

If for every k-sparse signal x∗ with support T , conditions above are satisfied, then
P1 recovers x∗ .
The most popular condition is proposed by [CRT06], called Restricted Isom-
etry Property (RIP).
Definition. Define the isometry constant δk of a matrix Φ to be the smallest
nonnegative number such that
(1 − δk )kxk22 ≤ kΦxk22 ≤ (1 + δk )kxk22
holds for all k-sparse vectors x ∈ Rp . A vector x is called k-sparse if it has at most
k nonzero elements.
[AC09] shows that incoherence conditions implies RIP, whence RIP is a weaker
condition. Under RIP condition, uniqueness of P0 and P1 can be guaranteed for all
k-sparse signals, often called uniform exact recovery[Can08].
Theorem 4.2. The following holds for all k-sparse x∗ satisfying Φx∗ = b.
1, then problem P0 has a unique solution x∗ ;
(1) If δ2k < √
(2) If δ2k < 2 − 1, then the solution of P1 (9) has a unique solution x∗ , i.e.
recovers the original sparse signal x∗ .
The first condition is nothing but every 2k-columns of Φ are linearly dependent.
To see the first condition, assume by contradiction that there is another k-sparse
solution of P0 , x0 . Then by Φy = 0 and y = x∗ − x0 is 2k-sparse. If y 6= 0, it violates
δ2k < 1 such that 0 = kΦyk ≥ (1 − δ2k )kyk > 0. Hence one must have y = 0, i.e.
x∗ = x0 which proves the uniqueness of P0 . The first condition is also necessary for
the uniqueness of P0 ’s solutions. In fact, if δ2k = 1, this implies that there is a 2k-
subset 2T such that columns of Φ2T are linearly dependent, i.e. Φ2T z = 0 for some
2k-vector z. One can define x1 to collect first k nonzero elements of z with zero
otherwise, and x2 to collect the second half nonzero entries of z but zero otherwise.
Hence Φ2T (x1 + x2 ) = 0 ⇒ ΦT1 x1 = 0 = ΦT2 x2 with T1 and T2 consisting the
first and second k columns of Φ2T respectively, which violates the uniqueness of P0
solutions. The proof of the second condition can be found in [Can08].
When measurement noise exists, e.g. b = Φx+e with bound kek2 , the following
Basis Pursuit De-Noising (BPDN) [CDS98] or LASSO [Tib96] are used instead

(12) (BP DN ) min kxk1


s.t. kΦx − bk2 ≤ .

(13) (LASSO) minp kΦx − bk2 + λkxk1


x∈R

For bounded kek∞ , the following formulation is used in network analysis [JYLG12]
(14) min kxk1
s.t. kΦx − bk∞ ≤ 
RIP conditions also lead to upper bounds between solutions above and the
true sparse signal x∗ . For example, in the case of BPDN the follwoing result holds
[Can08].
22 2. RANDOM PROJECTIONS AND ALMOST ISOMETRY


Theorem 4.3. Suppose that kek2 ≤ . If δ2k < 2 − 1, then

kx̂ − x k2 ≤ C1 k −1/2 σk1 (x∗ ) + C2 ,
where x̂ is the solution of BPDN and
σk1 (x∗ ) = min kx∗ − yk1
supp(y)≤k

is the best k-term approximation error in l1 of x∗ .


How to find matrices satisfying RIP? Equipped with Johnson-Lindenstrauss
Lemma, one can construct such matrices by random projections with high proba-
bility [BDDW08].
Recall that in the Johnson-Lindenstrauss Lemma, a random matrix Φ ∈ Rn×p
with each element is i.i.d. according to some distribution satisfying certain bounded
moment conditions, e.g. Φij ∼ N (0, 1). The key step to establish Johnson-
Lindenstrauss Lemma is the following fact
Pr kΦxk22 − kxk22 ≥ kxk22 ≤ 2e−nc0 () .

(15)
With this one can establish a bound on the action of Φ on k-sparse x by an union
bound via covering numbers of k-sparse signals.
Lemma 4.4. Let Φ ∈ Rn×p be a random matrix satisfying the concentration
inequality (15). Then for any δ ∈ (0, 1) and any set all T with |T | = k < n, the
following holds
(16) (1 − δ)kxk2 ≤ kΦxk2 ≤ (1 + δ)kxk2
for all x whose support is contained in T , with probability at least
 2
12
(17) 1−2 e−c0 (δ/2)n .
δ
Proof. It suffices to prove the results when kxk2 = 1 as Φ is linear. Let
XT := {x : supp(x) = T, kxk2 = 1}. We first choose QT , a δ/4-cover of XT , such
that for every x ∈ XT there exists q ∈ QT satisfying kq − xk2 ≤ δ/4. Since XT
has dimension at most k, it is well-known from covering numbers that the capacity
#(QT ) ≤ (12/δ)k . Now we are going to apply the union bound of (15) to the set QT
with  = δ/2. For each q ∈ QT , with probability at most 2e−c0 (δ/2)n , |Φqk22 −kqk22 ≥
δ/2kqk22 . Hence for all q ∈ QT , the same bound holds with probability at most
 2
−c0 (δ/2)n 12
2#(QT )e =2 e−c0 (δ/2)n .
δ
Now we define α to be the smallest constant such that
kΦxk2 ≤ (1 + α)kxk2 , for all x ∈ XT .
We can show that α ≤ δ with the same probability. For this, pick up a q ∈ QT
such that kq − xk2 ≤ δ/4, whence by the triangle inequality
kΦxk2 ≤ kΦqk2 + kΦ(x − q)k2 ≤ 1 + δ/2 + (1 + α)δ/4.
This implies that α ≤ δ/2 + (1 + α)δ/4, whence α ≤ 3δ/4/(1 − δ/4) ≤ δ. This gives
the upper bound. The lower bound also follows this since
kΦxk2 ≥ kΦqk2 − kΦ(x − q)k2 ≥ 1 − δ/2 − (1 + δ)δ/4 ≥ 1 − δ,
which completes the proof. 
4. RANDOM PROJECTIONS AND COMPRESSED SENSING 23

p

With this lemma, note that there are at most k subspaces of k-sparse, an
union bound leads to the following result for RIP.
Theorem 4.5. Let Φ ∈ Rn×p be a random matrix satisfying the concentration
inequality (15) and δ ∈ (0, 1). There exists c1 , c2 > 0 such that if
n
k ≤ c1
log(p/k)
the following RIP holds
(1 − δk )kxk22 ≤ kΦxk22 ≤ (1 + δk )kxk22
with probability at least 1 − 2e−c2 n .
Proof. For each of k-sparse signal (XT ), RIP fails with probability at most
 2
12
2 e−c0 (δ/2)n .
δ
There are kp ≤ (ep/k)k such subspaces. Hence, RIP fails with probability at most


 ep k  12 2
2 e−c0 (δ/2)n = 2e−c0 (δ/2)n+k[log(ep/k)+log(12/δ)] .
k δ
Thus for a fixed c1 > 0, whenever k ≤ c1 n/ log(p/k), the exponent above will
be ≤ −c2 n provided that c2 ≤ c0 (δ/2) − c1 (1 + (1 + log(12/δ))/ log(p/k). c2 can be
always chosen to be > 0 if c1 > 0 is small enough. This leads to the results. 
Another use of random projections (random matrices) can be found in Robust
Principal Component Analysis (RPCA) in the next chapter.
CHAPTER 3

High Dimensional Statistics: Mean and


Covariance in Noise

In this very first lecture, we talk about data representation as vectors, matrices
(esp. graphs, networks), and tensors, etc. Data are mappings of real world based
on sensory measurements, whence the real world puts constraints on the variations
of data. Data science is the study of laws in real world which shapes the data.
We start the first topic on sample mean and variance in high dimensional
Euclidean spaces Rp , as the maximal likelihood estimators based on multivariate
Gaussian assumption. Principle Component Analysis (PCA) is the projection of
high dimensional data on its top singular vectors. In classical statistics with the
Law of Large Numbers, for fixed p when sample size n → ∞, we know such sample
mean and variance will converge, so as to PCA. Although sample mean µ̂n and
sample covariance Σ̂n are the most commonly used statistics in multivariate data
analysis, they may suffer some problems in high dimensional settings, e.g. for large
p and small n scenario. In 1956, Stein [Ste56] shows that the sample mean is
not the best estimator in terms of the mean square error, for p > 2; moreover
in 2006, Jonestone [Joh06] shows by random matrix theory that PCA might be
overwhelmed by random noise for fixed ratio p/n when n → ∞. Among other
works, these two pieces of excellent works inspired a long pursuit toward modern
high dimensional statistics with a large unexplored field ahead.

1. Maximum Likelihood Estimation


Consider the statistical model f (X|θ) as a conditional probability function
on Rp with parameter space θ ∈ Θ. Let X1 , ..., Xn ∈ Rp are independently and
identically distributed (i.i.d.) sampled according to f (X|θ0 ) on Rp for some θ0 ∈ Θ.
The likelihood function is defined as the probability of observing the given data as
a function of θ,
Yn
L(θ) = f (Xi |θ),
i=1
and a maximum likelihood estimator is defined as
n
Y
θ̂nM LE ∈ arg max L(θ) = arg max f (Xi |θ)
θ∈Θ θ∈Θ
i=1
which is equivalent to
n
1X
arg max log f (Xi |θ).
θ∈Θ n i=1
Under some regularity conditions, the maximum likelihood estimator θ̂nM LE has the
following nice limiting properties:
25
26 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE

A. (Consistency) θ̂nM LE → θ0 , in probability and almost surely.



B. (Asymptotic Normality) n(θ̂nM LE − θ0 ) → N (0, I0−1 ) in distribution,
where I0 is the Fisher Information matrix
∂ ∂2
I(θ0 ) := E[( log f (X|θ0 ))2 ] = −E[ 2 log f (X|θ0 )].
∂θ ∂θ
C. (Asymptotic Efficiency) limn→∞ cov(θ̂nM LE ) = I −1 (θ0 ). Hence θ̂nM LE is
the Uniformly Minimum-Variance Unbiased Estimator, i.e. the estimator
with the least variance among the class of unbiased estimators, for any
unbiased estimator θ̂n , limn→∞ var(θ̂nM LE ) ≤ limn→∞ var(θ̂nM LE ).
However in finite sample case, there are better estimators than MLEs, which include
some bias in further reduction of variance.
1.1. Example: Multivariate Normal Distribution. For example, con-
sider the normal distribution N (µ, Σ),
 
1 1
f (X|µ, Σ) = p exp − (X − µ)T Σ−1 (X − µ) ,
(2π)p |Σ| 2
where |Σ| is the determinant of covariance matrix Σ.
To get the MLE of normal distribution, we need to
n
Y 1
max P (X1 , ..., Xn |µ, Σ) = max p exp[−(Xi − µ)T Σ−1 (Xi − µ)]
µ,Σ µ,Σ
i=1
2π|Σ|
It is equivalent to maximize the log-likelihood
n
1X n
I = log P (X1 , ..., Xn |µ, Σ) = − (Xi − µ)T Σ−1 (Xi − µ) − log |Σ| + C
2 i=1 2
Let µ∗ is the MLE of µ, we have
n
∂I X
0= = − Σ−1 (Xi − µ∗ )
∂µ∗ i=1
n
1X
⇒ µ∗ = Xi = µ̂n
n i=1
To get the estimation of Σ, we need to maximize
n
1 X n
I(Σ) = trace(I) = − trace (Xi − µ)T Σ−1 (Xi − µ) − trace log |Σ| + C
2 i=1
2

n n
1 X 1X
− trace (Xi − µ)T Σ−1 (Xi − µ) = − trace[Σ−1 (Xi − µ)(Xi − µ)T ]
2 i=1
2 i=1
1
= − (traceΣ−1 Σ̂n )(n − 1)
2
n−1 1 1
= − trace(Σ−1 Σ̂n2 Σ̂n2 )
2
n−1 1 1
= − trace(Σ̂n2 Σ−1 Σ̂n2 )
2
n−1
= − trace(S)
2
2. BIAS-VARIANCE DECOMPOSITION OF MEAN SQUARE ERROR 27

where
n
1 X
Σ̂n = (Xi − µ̂n )(Xi − µ̂n )T ,
n − 1 i=1
1 1
S = Σ̂n2 Σ−1 Σ̂n2 is symmetric and positive definite. Above we repeatedly use cyclic
property of trace:
• trace(AB) = trace(BA), or more generally
• (invariance under cyclic permutation group) trace(ABCD) = trace(BCDA) =
trace(CDAB) = trace(DABC).
Then we have
−1 −1
Σ = Σ̂n 2 S −1 Σ̂n 2
n n n
− log |Σ| = log |S| + log |Σ̂n | = f (Σ̂n )
2 2 2
Therefore,
n−1 n
max I(Σ) ⇔ min trace(S) − log |S| + Const(Σ̂n , 1)
2 2
Suppose S = U ΛU is the eigenvalue decomposition of S, Λ = diag(λi )
p p
n−1X nX
J= λi − log(λi ) + Const
2 i=1 2 i=1
∂J n−1 n 1 n
= − ⇒ λi =
∂λi 2 2 λi n−1
n
S= Ip
n−1
This gives the MLE solution
n
n−1 1X
Σ∗ = Σ̂n = (Xi − µ̂n )(Xi − µ̂n )T ,
n n i=1

which differs to Σ̂n only in that the denominator (n − 1) is replaced by n. In


covariance matrix, (n − 1) is used because for a single sample n = 1, there is no
variance at all.
Fixed p, when n → ∞, MLE satisfies µ̂n → µ and Σ̂n → Σ. However as we can
see in the following classes, they are not the best estimators when the dimension of
the data p gets large, with finite sample n.

2. Bias-Variance Decomposition of Mean Square Error


Consider multivariate Gaussian model: let X1 , . . . , Xn ∼ N (µ, Σ), Xi ∈ Rp (i =
1 . . . n), then the maximum likelihood estimators (MLE) of the parameters (µ and
Σ) are as follows:
n n
1X 1X
µ̂M
n
LE
= Xi , Σ̂M
n
LE
= (Xi − µ̂n )(Xi − µ̂n )T .
n i=1 n i=1

For simplicity, take a coordinate transform (PCA) Yi = U T Xi where Σ = U ΛU T


is an eigen-decomposition. Assume that Λ = σ 2 Ip and n = 1, then it suffices to
consider Y ∼ N (µ, σ 2 Ip ) in the sequel. In this case µ̂M LE = Y .
28 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE

To measure the performance of an estimator µ̂n , one may look at the following
so-called risk,
R(µ̂n , µ) = EL(µ̂n , µ)
where the loss function takes the square loss here
L(µ̂n , µ) = kµ̂n − µk2 .
The mean square error (MSE) to measure the risk enjoys the following bias-
variance decomposition, from the Pythagorean theorem.
R(µ̂n , µ) = Ekµ̂n − E[µ̂n ] + E[µ̂n ] − µk2
= Ekµ̂n − E[µ̂n ]k2 + kE[µ̂n ] − µk2
=: V ar(µ̂n ) + Bias(µ̂n )2
Example 1. For the simple case Yi ∼ N (µ, σ 2 Ip ) (i = 1, . . . , n), the MLE estimator
satisfies
Bias(µ̂M
n
LE
)=0
and
p 2
V ar(µ̂M
n
LE
)= σ
n
In particular for n = 1, V ar(µ̂M LE ) = σ 2 p for µ̂M LE = Y .
Example 2. MSE of Linear Estimators. Consider Y ∼ N (µ, σ 2 Ip ) and linear
estimator µ̂C = CY . Then we have
Bias(µ̂C ) = k(I − C)µk2
and
V ar(µ̂C ) = E[(CY −Cµ)T (CY −Cµ)] = E[trace((Y −µ)T C T C(Y −µ))] = σ 2 trace(C T C).
In applications, one often consider the diagonal linear estimators C = diag(ci ), e.g.
in Ridge regression
1 λ
min kY − Xβk2 + kβk2 .
µ 2 2
For diagonal linear estimators, the risk
p
X p
X
R(µ̂C , µ) = σ 2 c2i + (1 − ci )2 µ2i .
i=1 i=1

In this case, it is simple to find minimax risk over the hyper-rectangular model class
|µi | ≤ τi ,
p
X σ 2 τi2
inf sup R(µ̂C , µ) = .
ci |µ |≤τ
i i i=1
σ 2 + τi2
From here one can see that for those sparse model classes such that #{i : τi =
O(σ)} = k  p, it is possible to get smaller risk using linear estimators than MLE.
In general, is it possible to introduce some biased estimators which significantly
reduces the variance such that the total risk is smaller than MLE uniformly for all
µ? This is the notion of inadmissibility introduced by Charles Stein in 1956 and he
find the answer is YES by presenting the James-Stein estimators, as the shrinkage
of sample means.
3. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 29

3. Stein’s Phenomenon and Shrinkage of Sample Mean


Definition (Inadmissible). An estimator µ̂n of the parameter µ is called inadmis-
sible on Rp with respect to the squared risk if there exists another estimator µ∗n
such that
Ekµ∗n − µk2 ≤ Ekµ̂n − µk2 for all µ ∈ Rp ,
and there exist µ0 ∈ Rp such that
Ekµ∗n − µ0 k2 < Ekµ̂n − µ0 k2 .
In this case, we also call that µ∗n dominates µ̂n . Otherwise, the estimator µ̂n is
called admissible.
The notion of inadmissibility or dominance introduces a partial order on the
set of estimators where admissible estimators are local optima in this partial order.
Stein (1956) [Ste56] found that if p ≥ 3, then the MLE estimator µ̂n is inad-
missible. This property is known as Stein’s phenomenon. This phenomenon can
be described like:
For p ≥ 3, there exists µ̂ such that ∀µ,
R(µ̂, µ) < R(µ̂MLE , µ)
which makes MLE inadmissible.
A typical choice is the James-Stein estimator given by James-Stein (1961),
σ 2 (p − 2)
 
JS
µ̃n = 1 − µ̂M
n
LE
, σ = ε.
kµ̂Mn
LE k

Theorem 3.1. Suppose Y ∼ Np (µ, I). Then µ̂MLE = Y . R(µ̂, µ) = Eµ kµ̂ − µk2 ,
and define
p−2
 
JS
µ̂ = 1 − Y
kY k2
then
R(µ̂JS , µ) < R(µ̂MLE , µ)
We’ll prove a useful lemma first.

3.1. Stein’s Unbiased Risk Estimates (SURE). Discussions below are all
under the assumption that Y ∼ Np (µ, I).
Lemma 3.2. (Stein’s Unbiased Risk Estimates (SURE)) Suppose µ̂ = Y + g(Y ),
g satisfies 1
(1) gPis weakly differentiable.
p R
(2) i=1 |∂i gi (x)|dx < ∞
then
(18) R(µ̂, µ) = Eµ (p + 2∇T g(Y ) + kg(Y )k2 )
Pp ∂
where ∇T g(Y ) := i=1 ∂yi gi (Y ).

1cf. p38, Prop 2.4 [GE]


30 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE

Examples of g(x): For James-Stein estimator


p−2
g(x) = − Y
kY k2
and for soft-thresholding, each component

 −λ xi > λ
gi (x) = −xi |xi | ≤ λ
λ xi < −λ

Both of them are weakly differentiable. But Hard-Thresholding:



0 |xi | > λ
gi (x) =
−xi |xi | ≤ λ
which is not weakly differentiable!
Proof. Let φ(y) be the density function of standard Normal distribution
Np (0, I).
R(µ̂, µ) = Eµ kY + g(Y ) − µk2
= Eµ p + 2(Y − µ)T g(Y ) + kg(Y )k2


p Z
X ∞
Eµ (Y − µ)T g(Y ) = (yi − µi )gi (Y )φ(Y − µ)dY
i=1 −∞
p Z ∞
X ∂
= −gi (Y ) φ(Y − µ)dY, derivative of Gaussian function
i=1 −∞
∂yi
p Z ∞
X ∂
= gi (Y )φ(Y − µ)dY, Integration by parts
i=1 −∞ ∂yi
= Eµ ∇ g(Y )T


Thus, we define
(19) U (Y ) := p + 2∇T g(Y ) + kg(Y )k2
for convenience, and R(µ̂, µ) = Eµ U (Y ).
This lemma is in fact called the Stein’s lemma in Tsybakov’s book [Tsy09]
(page 157∼158).

3.2. Risk of Linear Estimator.


µ̂C (Y ) = Cy
g(Y ) = (C − I)Y
X ∂
∇T g(Y ) = − ((C − I)Y ) = trace(C) − p
i
∂yi

U (Y ) = p + 2∇T g(Y ) + kg(Y )k2


= p + 2(trace(C) − p) + k(I − C)Y k2
= −p + 2trace(C) + k(I − C)Y k2
3. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 31

In applications, C = C(λ) often depends on some regularization parameter λ (e.g.


ridge regression). So one could find optimal λ∗ by minimizing the MSE over λ.
Suppose Y ∼ N (µ, σ 2 I),

R(µ̂C , µ) = k(I − C(λ))Y k2 − pσ 2 + 2σ 2 trace(C(λ)).

3.3. Risk of James-Stein Estimator. Recall


p−2
g(Y ) = − Y
kY k2

U (Y ) = p + 2∇T g(Y ) + kg(Y )k2


(p − 2)2
kg(Y )k2 =
kY k2
p−2 (p − 2)2
X ∂  
T
∇ g(Y ) = − Y = −
i
∂yi kY k2 kY k2
we have
(p − 2)2
R(µ̂JS , µ) = EU (Y ) = p − Eµ < p = R(µ̂MLE , µ)
kY k2
when p ≥ 3.
Problem. What’s wrong when p = 1? Does SURE still hold?
Remark. Indeed, we have the following theorem
Theorem 3.3 (Lemma 2.8 in Johnstone’s book (GE)). Y ∼ N (µ, I), ∀µ̂ = CY , µ̂
is admissable iff
(1) C is symmetric.
(2) 0 ≤ ρi (C) ≤ 1 (eigenvalue).
(3) ρi (C) = 1 for at most two i.
To find an upper bound of the risk of James-Stein estimator, notice that kY k2 ∼
χ (kµk2 , p) and 2
2

kµk2
 
d
χ2 (kµk2 , p) = χ2 (0, p + 2N ), N ∼ Poisson
2
we have
   
1 1
Eµ = EEµ N
kY k2 kY k2
1
= E
p + 2N − 2
1
≥ (Jensen’s Inequality)
p + 2EN − 2
1
=
p + kµk2 − 2
that is

2This is a homework.
32 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE

Proposition 3.4 (Upper bound of MSE for the James-Stein Estimator). Y ∼


N (µ, Ip ),
(p − 2)2 (p − 2)kµk2
R(µ̂JS , µ) ≤ p − = 2 +
12
10
8 p − 2 + kµk2 p − 2 + kµk2
R

JS
4

MLE
2
0

0 2 4 6 8 10

||u||

3.4. Risk of Soft-thresholding. Using Stein’s unbiased risk estimate, we


have soft-thresholding in the form of

µ̂(x) = x + g(x). gi (x) = −I(|xi | ≤ λ)
∂i
We then have
p p
!
X X
Eµ kµ̂λ − µk 2
= Eµ p−2 I(|xi | ≤ λ) + x2i ∧λ 2

i=1 i=1
p
X p
≤ 1 + (2 log p + 1) µ2i ∧ 1 if we take λ = 2 log p
i=1

By using the inequality


1 ab
a∧b≤ ≤a∧b
2 a+b
we can compare the risk of soft-thresholding and James-Stein estimator as
p p
! !
(µi ∧ 1) Q 2 + c
X X
2 2
1 + (2 log p + 1) µi ∧ p c ∈ (1/2, 1)
i=1 i=1
3. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 33

In LHS, the risk for each µi is bounded by 1 so if µ is sparse (s = #{i : µi 6= 0})


but large in magnitudes (s.t. kµk22 ≥ p), we may expect LHS = O(s log p) < O(p) =
RHS. 3
In addition to L1 penalty in LASSO, there are also other penalty functions like
• λkβk0 This leads to hard -thresholding when X = I. Solving this problem
is normally NP-hard.
• λkβk
P p , 0 < p < 1. Non-convex, also NP-hard.
• λ ρ(βi ). such that
(1) ρ0 (0) singular (for sparsity in variable selection)
(2) ρ0 (∞) = 0 (for unbiasedness in parameter estimation)
Such ρ must be non-convex essentially (Jianqing Fan and Runze Li, 2001).
3.5. How to Optimize the Constants in James-Stein Estimator? Now,
let us look for a function g such that the risk of the estimator µ̃n (Y ) = (1 − g(Y ))Y
is smaller than the MLE of Y ∼ N (µ, ε2 Ip ). We have
p
X
Ekµ̃n − µk2 = E[((1 − g(y))yi − µi )2 ]
i=1
p
X
= {E[(yi − µi )2 ] + 2E[(µi − yi )g(y)yi ]
i=1
+ E[yi2 g(y)2 ]}.
Suppose now that the function g is such that the assumptions of Stein’s Lemma 3.5
hold (page 157∼158 in Tsybakov’s book [Tsy09]), i.e. weakly differentiable.
Lemma 3.5 (Stein’s lemma). Suppose that a function f : Rp → R satisfies:
(i) f (u1 , . . . , up ) is absolutely continuous in each coordinate ui for almost all
values (with respect to the Lebesgue measure on Rp−1 ) of other coordinates
(uj , j 6= i)
(ii)
∂f (y)
E < ∞, i = 1, . . . , p.
∂yi
then  
∂f
E[(µi − yi )f (y)] = −ε E
2
(y) , i = 1, . . . , p.
∂yi
With Stein’s Lemma, therefore
 
∂g
E[(µi − yi )(1 − g(y))yi ] = −ε E g(y) + yi
2
(y) ,
∂yi
with
E[(yi − µi )2 ] = ε2 = σ 2 ,
we have
 
∂g
E[(µ̃n,i − µi )] = ε − 2ε E g(y) + yi
2 2 2
(y) + E[yi2 g(y)2 ].
∂yi
Summing over i gives
Ekµ̃n − µk2 = pε2 + E[W (y)] = Ekµ̂n − µk2 + E[W (y)]
3also cf. p43 [GE]
34 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE

with
p
X ∂g
W (y) = −2pε2 g(y) + 2ε2 yi (y) + kyk2 g(y)2 .
i=1
∂yi
The risk of µ̃n is smaller than that of µ̂n if we choose g such that
E[W (y)] < 0.
In order to satisfy this inequality, we can search for g among the functions of
the form
b
g(y) =
a + kyk2
with an appropriately chosen constants a ≥ 0, b > 0. Therefore, W (y) can be
written as
p
b X 2byi2 b2 kyk2
W (y) = −2pε2 + 2ε 2
+
a + kyk 2
i=1
(a + kyk )2 2 (a + kyk2 )2
4bε2 kyk2 b2 kyk2
 
1 2
= −2pbε + +
a + kyk2 a + kyk2 (a + kyk2 )2
1
≤ (−2pbε2 + 4bε2 + b2 ) kyk2 ≤ a + kyk2 for a ≥ 0
a + kyk2
Q(b)
= , Q(b) = b2 − 2pbε2 + 4bε2 .
a + kyk2
The minimizer in b of quadratic function Q(b) is equal to
bopt = ε2 (p − 2),
where the minimum of W (y) satisfies
b2opt ε4 (p − 2)2
Wmin (y) ≤ − = − < 0.
a + kyk2 a + kyk2
Note that when b ∈ (b1 , b2 ), i.e. between the two roots of Q(b)
b1 = 0, b2 = 2ε2 (p − 2)
we have W (y) < 0, which may lead to other estimators having smaller mean square
errors than MLE estimator.
When a = 0, the function g and the estimator µ̃n = (1 − g(y))y associated to
this choice of g are given by
ε2 (p − 2)
g(y) = ,
kyk2
and
ε2 (p − 2)
 
µ̃n = 1− y =: µ̃JS ,
kyk2
respectively. µ̃JS is called James-Stein estimator. If dimension p ≥ 3 and the
norm kyk2 is sufficiently large, multiplication of y by g(y) shrinks the value of y to
0. This is called the Stein shrinkage. If b = bopt , then
ε4 (p − 2)2
Wmin (y) = − .
kyk2
3. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 35

Lemma 3.6. Let p ≥ 3. Then, for all µ ∈ Rp ,


 
1
0<E < ∞.
kyk2
The proof of Lemma 3.6 can be found on Tsybakov’s book [Tsy09] (page
158∼159). For the function W , Lemma 3.6 implies −∞ < E[W (y)] < 0, provided
that p ≥ 3. Therefore, if p ≥ 3, the risk of the estimator µ̃n satisfies
 4
ε (p − 2)2

Ekµ̃n − µk2 = pε2 − E < Ekµ̂n − µk2
kyk2
for all µ ∈ Rp .
Besides James-Stein estimator, there are other estimators having smaller mean
square errors than MLE mu ˆ n.
• Stein estimator : a = 0, b = ε2 p,
ε2 p
 
µ̃S := 1 − y
kyk2
• James-Stein estimator : c ∈ (0, 2(p − 2))
ε2 c
 
µ̃cJS := 1 − y
kyk2
• Positive part James-Stein estimator :
ε2 (p − 2)
 
µ̃JS+ := 1 − y
kyk2 +
• Positive part Stein estimator :
ε2 p
 
µ̃S+ := 1 − y
kyk2 +
where (x)+ = min(0, x). Denote the mean square error by M SE(µ̃) = Ekµ̃ − µk2 ,
then we have
M SE(µ̃JS+ ) < M SE(µ̃JS ) < M SE(µ̂n ), M SE(µ̃S+ ) < M SE(µ̃S ) < M SE(µ̂n ).
See Efron’s Book, Chap 1, Table 1.1.
Another dimension of variation is Shrinkage toward any vector rather than the
origin.
ε2 c
 
µ̃µ0 = µ0 + 1 − (y − µ0 ), c ∈ (0, 2(p − 2)).
kyk2
Pp
In particular, one may choose µ0 = ȳ where ȳ = i=1 yi /p.
3.6. Discussion. Stein’s phenomenon firstly shows that in high dimensional
estimation, shrinkage may lead to better performance than MLE, the sample mean.
This opens a new era for modern high dimensional statistics. In fact discussions
above study independent random variables in p-dimensional space, concentration of
measure tells us some priori knowledge about the estimator distribution – samples
are concentrating around certain point. Shrinkage toward such point may naturally
lead to better performance.
However, after Stein’s phenomenon firstly proposed in 1956, for many years
researchers have not found the expected revolution in practice. Mostly because
Stein’s type estimators are too complicated in real applications and very small
36 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE

gain can be achieved in many cases. Researchers struggle to show real application
examples where one can benefit greatly from Stein’s estimators. For example, Efron-
Morris (1974) showed three examples that JS-estimator significantly improves the
multivariate estimation. On other other hand, deeper understanding on Shrinkage-
type estimators has been pursued from various aspects in statistics.
The situation changes dramatically when LASSO-type estimators by Tibshi-
rani, also called Basis Pursuit by Donoho et al. are studied around 1996. This
brings sparsity and L1-regularization into the central theme of high dimensional
statistics and leads to a new type of shrinkage estimator, thresholding. For exam-
ple,
1
min I = min kµ̃ − µk2 + λkµ̃k1
µ̃ µ̃ 2
Subgradients of I over µ̃ leads to
0 ∈ ∂µ̃j I = (µ̃j − µj ) + λsign(µ̃j ) ⇒ µ̃j = sign(µj )(|µj | − λ)+
where the set-valued map sign(x) = 1 if x > 0, sign(x) = −1 if x < 0, and
sign(x) = [−1, 1] if x = 0, is the subgradient of absolute function |x|. Under this
new framework shrinkage estimators achieves a new peak with an ubiquitous spread
in data analysis with high dimensionality.

4. Random Matrix Theory and Phase Transitions in PCA


In PCA, one often looks at the eigenvalue plot in an decreasing order as per-
centage or variations. A large gap in the eigenvalue drops may indicate those top
eigenvectors reflect major variation directions, where those small eigenvalues in-
dicate directions due to noise which will vanish when n → ∞. Is this true in
all situations? The answer is yes in classical setting p << n. Unfortunately, in
high dimensional statistics even with fixed ratio p/n = γ, top eigenvectors of sam-
ple covariance matrices might not reflect the subspace of signals. In the following
we consider one particularly simple example: rank-1 signal (spike) model, where
random matrix theory will tell us when PCA fails to capture the signal subspace.
First of all, let’s introduce some basic results in random matrix theory which
will be used later.

4.1. Marčenko-Pastur Law of Sample Covariance Matrix. Let X ∈


Rp∗n , Xi ∼ N (0, Ip ).
When p fixed and n → ∞, the classical Law of Large Numbers tells us

(20) b n = 1 XX 0 → Ip .
Σ
n
Such a random matrix Σ b n is called Wishart matrix.
p
But when n → γ 6= 0, the distribution of the eigenvalues of Σ
b n follows [BS10]
(Chapter 3), if γ ≤ 1,
(
0 t∈
/ [a, b]
(21) µMP
(t) = √(b−t)(t−a)
2πγt dt t ∈ [a, b]
and has an additional point mass 1 − 1/γ at the origin if γ > 1. Note that a =
√ √
(1 − γ)2 , b = (1 + γ)2 . Figure 1 illustrates the MP-distribution by MATLAB
simulations whose codes can be found below.
4. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 37

(a) (b)

Figure 1. (a) Marčenko-Pastur distribution with γ = 2. (b)


Marčenko-Pastur distribution with γ = 0.5.

%Wishart matrix
% S = 1/n*X*X.’, X is p-by-n, X ij i.i.d N(0,1),
% ESD S converge to M.P. with parameter y = p/n

y = 2;

a = (1-sqrt(y))^2;
b = (1+sqrt(y))^2;

f MP = @(t) sqrt(max(b-t, 0).*max(t-a, 0) )./(2*pi*y*t); %MP Distribution

%non-zero eigenvalue part


n = 400;
p = n*y;

X = randn(p,n);
S = 1/n*(X*X.’);
evals = sort( eig(S), ’descend’);

nbin = 100;
[nout, xout] = hist(evals, nbin);
hx = xout(2) - xout(1); % step size, used to compute frequency below
x1 = evals(end) -1;
x2 = evals(1) + 1; % two end points
xx = x1+hx/2: hx: x2;
fre = f MP(xx)*hx;

figure,
h = bar(xout, nout/p);
set(h, ’BarWidth’, 1, ’FaceColor’, ’w’, ’EdgeColor’, ’b’);
hold on;
plot(xx, fre, ’--r’);
38 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE

if y > 1 % there are (1-1/y)*p zero eigenvalues


axis([-1 x2+1 0 max(fre)*2]);
end

In the following, we are going to show that if dimension is relatively large


compared to sample size, i.e. p/n → γ > 0, PCA may fail to identify signals
from noise even the signal lies in a low-dimensional subspace. In fact, there is a
phase transition for signal identifiability by PCA: below a threshold of signal-noise
ratio, PCA will fail with high probability and above that threshold of signal-noise
ratio, PCA will approximate the signal subspace with high probability. This will
be illustrated by the following simplest rank-1 model.
4.2. Phase Transitions of PCA in Rank-1 Model. Consider the following
rank-1 signal-noise model
Y = X + ε,
2
where signal lies in an one-dimensional subspace X = αu with α ∼ N (0, σX ) and
noise ε ∼ N (0, σε2 Ip ) is i.i.d. Gaussian. For multi-rank models, please see [KN08].
Therefore Y ∼ N (0, Σ) where
2
Σ = σX uu0 + σε2 Ip .
The whole question in the remaining part of this section is to ask, can we recover
signal direction u from principal component analysis on noisy measurements Y ?
σ2
Define the signal-noise ratio SN R = R = σX2 , where for simplicity σε2 = 1. We
ε
aim to show how SNR affect the result of PCA when p is large. A fundamental
result by Johnstone in 2006 [Joh06], or see [NBG10], shows that the primary
(largest) eigenvalue of sample covariance matrix satisfies
( √ √
(1 + γ)2 = b, 2
σX ≤ γ
(22) b n) →
λmax (Σ √
2
(1 + σX )(1 + σγ2 ), σX
2
> γ
X

which implies that if signal energy is small, top eigenvalue of sample covariance
matrix never pops up from random matrix ones; only if the signal energy is beyond

the phase transition threshold γ, top eigenvalue can be separated from random
matrix eigenvalues. However, even in the latter case it is a biased estimation.
Moreover, the primary eigenvector associated with the largest eigenvalue (prin-
cipal component) converges to
 2 √
0 σX ≤ γ
|hu, vmax i|2 → 1− σX
γ
(23) 4
2 √
 1+ γ , σX > γ
σ2
X

which means the same phase transition phenomenon: if signal is of low energy,
PCA will tell us nothing about the true signal and the estimated top eigenvector is
orthogonal to the true direction u; if the signal is of high energy, PCA will return a
biased estimation which lies in a cone whose angle with the true signal is no more
than
1 − σγ4
X
.
1 + σγ2
X

Below we are going to show such results.


4. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 39

4.3. Stieltjes Transform. The following Stieltjes Transformation of MP-


density will be useful in the next part. Define the Stieltjes Transformation of
MP-density µM P to be
1
Z
(24) s(z) := dµM P (t), z ∈ C
R t−z
If z ∈ R, the transformation is called Hilbert Transformation. Further details can
be found in Terry Tao’s textbook, Topics on Random Matrix Theory [Tao11], Sec.
2.4.3 (the end of page 169) for the definition of Stieltjes transform of a density
p(t)dt on R.
In [BS10], Lemma 3.11 on page 52 gives the following characterization of s(z)
(note that the book contains a typo that 4yσ 4 in numerator should be replaced by
4yzσ 2 ):
p
(1 − γ) − z + (z − 1 − γ)2 − 4γz
(25) s(z) = ,
2γz
which is the largest root of the quadratic equation,
1 1
(26) γzs(z)2 + (z − (1 − γ))s(z) + 1 = 0 ⇐⇒ z + = .
s(z) 1 + γs(z)
From the equation (25), one can take derivative of z on both side to obtain s0 (z)
in terms of s and z. Using s(z) one can compute the following basic integrals.
Lemma 4.1. (1)
b
t
Z
µM P (t)dt = −λs(λ) − 1;
a λ−t
(2)
b
t2
Z
µM P (t)dt = λ2 s0 (λ) + 2λs(λ) + 1
a (λ − t)2
Proof. For convenience, define
b
t
Z
(27) T (λ) := µM P (t)dt.
a λ−t
Note that
b b
t λ − t + t MP
Z Z
(28) 1 + T (λ) = 1 + µM P (t)dt = µ (t)dt = −λs(λ)
a λ−t a λ−t
which give the first result.
4.4. Characterization of Phase Transitions with RMT. First of all, we
give an overview of this part. Following the rank-1 model, consider random vectors
{Yi }ni=1 ∼ N (0, Σ), where Σ = σx2 uuT + σε2 Ip and kuk2 = 1. This covariance matrix
Σ thus has a structure that low-rank plus sparse matrix. Define the Signal-Noise-
σ2
Ratio (SNR) R = σx2 . Without of generality,we assume σε2 = 1.
ε
Pn
The sample covariance matrix of Y is Σ̂n = n1 i=1 Yi Yit = n1 Y Y T where
Y = [Y1 , . . . , Yn ] ∈ Rp×n . Suppose one of its eigenvalue is λ and the corresponding
unit eigenvector is v̂, so Σ̂n v̂ = λv̂.
After that, we relate the λ to the MP distribution by the trick:
1 1
(29) Yi = Σ 2 Zi → Zi ∼ N (0, Ip ), where Σ 2 = σx2 uuT + σε2 Ip = RuuT + Ip
40 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE

Pn
Then Sn = n1 i=1 Zi ZiT is a Wishart random matrix whose eigenvalues follow the
MP distribution.
1 1
Notice that Σ̂n = Σ 2 Sn Σ 2 and (λ, v̂) is eigenvalue-eigenvector pair of matrix
Σ̂n . Therefore
1 1 1 1
(30) Σ 2 Sn Σ 2 v̂ = λv̂ ⇒ Sn Σ(Σ− 2 v̂) = λ(Σ− 2 v̂)
1
In other words, λ and Σ− 2 v̂ are the eigenvalue and eigenvector of matrix Sn Σ.
1
Suppose cΣ− 2 v̂ = v where the constant c makes v a unit eigenvector and thus
satisfies,
(31) c2 = cv̂ T v̂ = v T Σv = v T (σx2 uuT + σε2 )v = σx2 (uT v)2 + σε2 ) = R(uT v)2 + 1.
With the aid of Stieltjes transform, we can calculate the largest eigenvalue of
matrix Σ̂n and the properties of the corresponding eigenvector v̂.
In fact, the eigenvalue λ satisfies
p Z b
2 1X λi 2 t
(32) 1 = σX · ∼ σX · dµM P (t),
p i=1 λ − σε2 λi a λ − σε2 t
and the inner product of u and v satisfies
(33) |uT v|2
Z b
t2
= {σx4 dµM P (t)}−1
a (λ − σ 2 )2
ε
σx4 p λ(2λ − (a + b)) −1
= { (−4λ + (a + b) + 2( (λ − a)(λ − b)) + p )}
4γ (λ − a)(λ − b)
1 − Rγ2
= 2γ
1+γ+ R
σ2
where R = SN R = σx2 = σx2 ,γ = np . We can compute the inner product of u and
p
ε
v̂ which we are really interested in from the above equation:
1 1 1 1 1 1 1 p
|uT v̂|2 = ( uT Σ 2 v)2 = 2 ((Σ 2 u)T v)2 = 2 (((RuuT + Ip ) 2 u)T v)2 = 2 (( (1 + R)u)T v)2
c c c c
γ
(1 + R)(uT v)2 1+R− R − Rγ2 1 − Rγ2
= = γ = γ
R(uT v)2 + 1 1+R+γ+ R 1+ R
Now we are going to present the details.
First of all, from
(34) Sn Σv = λv,
we obtain the following by plugging in the expression of Σ
(35) 2
Sn (σX uu0 + σε2 Ip )v = λv
Rearrange the term with u to one side, we got
(36) (λIp − σε2 Sn )v = σX
2
Sn uu0 v
Assuming that λIp − σε2 Sn is invertable, then multiple its reversion at both sides
of the equality, we get,
(37) 2
v = σX · (λIp − σε2 Sn )−1 · Sn u(u0 v).
4. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 41

4.4.1. Primary Eigenvalue. Multiply (37) by u0 at both side,


(38) u0 v = σX
2
· u0 (λIp − σε2 Sn )−1 Sn u · (u0 v)
that is, if u0 v 6= 0,
(39) 2
1 = σX · u0 (λIp − σε2 Sn )−1 Sn u
For SVD Sn = W ΛW 0 , where Λ is diagonal, W · W 0 = W 0 · W = Ip , W =
[W1 , W2 , · · · , Wn ] ∈ Rp×p , α = [α
P1p, α2 , · · · , αn ] ∈ R , in which 0 Wi is the corre-
p×1

sponding eigenvector, then u = i=1 αi Wi = W · α, then, α = W u, and,


2
(40) 1 = σX · u0 [W (λIp − σε2 Λ)−1 W 0 ][W ΛW 0 ]u = σX
2
· (u0 W )(λIp − σε2 Λ)−1 Λ(W 0 u)
Replace W 0 u = α, then,
p
2
X λi
(41) 1 = σX · α2
i=1
λ − σε2 λi i
Pp 2
where i=1 αi
= 1. Since W is a random orthogonal basis on a sphere, αi will
concentrate on its mean αi = √1q . According to the fact that p is large enough(∼
∞), due to Law of Large Numbers(LLN) and λ ∼ µM P (λi can be thought sampled
from the µM P ), the equation (12) can be thought of as the Expected Value (Monte-
Carlo Integration), then equation (12) can be written as,
p Z b
2 1X λi 2 t
(42) 1 = σX · ∼ σ X · dµM P (t)
p i=1 λ − σε2 λi a λ − σ 2t
ε

For convenience, assume without loss of generosity that σε2 = 1, that is the
noise volatility is 1. Now we unveil the story of the ratio γ, do the integration in
equation (13), we got,
Z b p
2 t (b − t)(t − a) σ2 p
(43) 1 = σX · dt = X [2λ − (a + b) − 2 |(λ − a)(b − λ)|]
a λ−t 2πγt 4γ
where the last step can be computed via Stieltjes transform introduced above.
From the definition of T (λ), we have
Z b
t2 0
(44) µM P (t)dt = −T (λ) − λT (λ).
a (λ − t)
2

Combined with the first result, we reach the second one. 


p
If we suppose σε2 = 1 as in the script. Then R = σX
2
p
. Note that when R ≥ n ,
λ ≥ b. Solve the equation:
σ2

p
1 = X [2λ − (a + b) − 2 (λ − a)(λ − b)

γ γ
∴ λ = σX 2
+ 2 + 1 + γ = (1 + σX 2
)(1 + 2 )
σX σX
Loose this assumption. Then all the equations above is true, except that all the λ
will be replaced by σλ2 and λ0 by SN R. Then we get:
ε

γ
λ = (1 + SN R)(1 + )σ 2
SN R ε
Here we observe the following phase transitions for primary eigenvalue:
42 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE

b n has eigenvalue λ within supp(µM P ), so it is undis-


• If λ ∈ [a, b], then Σ
tinguishable from the noise Sn .
• If λ ≥ b, PCA will pick up the top eigenvalue as non-noise. So λ = b is the
phase transition where PCA works to pop up correct eigenvalue. Then
plug in λ = b in equation (14), we get,

σ2
r
2 1 2 p
(45) 1 = σX · [2b − (a + b)] = √X ⇔ σX =
4γ γ n
So, in order to make PCA works, we need to let SN R ≥ np .
p

We know that if PCA works good and noise doesn’t dominate the effect, the inner-
product |u0 v̂| should be close to 1. On the other hand, from RMT we know that if
the top eigenvalue λ is merged in the M. P. distribution, then the top eigenvector
computed is purely random and |u0 v̂| = 0, which means that from v̂ we can know
nothing about the signal u.
4.4.2. Primary Eigenvector. We now study the phase transition of top-eigenvector.
It is convenient to study |u0 v|2 first and then translate back to |u0 v̂|2 . Using
the equation (37),

(46)
1 = |v 0 v| = σX
4
·v 0 uu0 Sn (λIp −σε2 Sn )−2 Sn uu0 v = σX
4
·(|v 0 u|)[u0 Sn (λIp −σε2 Sn )−2 Sn u](|u0 v|)

(47) |u0 v|−2 = σX


4
[u0 Sn (λIp − σε2 Sn )−2 Sn u]
Using the same trick as the equation (39),

b
t2
Z
(48) |u0 v|−2 = σX
4
[u0 Sn (λIp − σε2 Sn )−2 Sn u] ∼ σX
4
· dµM P (t)
a (λ − σε2 t)2
and assume that λ > b, from Stieltjes transform introduced later one can compute
the integral as
(49)
Z b
0 −2 4 t2 MP
4
σX p λ(2λ − (a + b))
|u v| = σX · dµ (t) = (−4λ+(a+b)+2 (λ − a)(λ − b)+ p
a (λ − σ 2 t)2
ε 4γ (λ − a)(λ − b)
γ
from which it can be computed that (using λ = (1 + R)(1 + R) obtained above,
2
σX
where R = SN R = σ2 )
γ
1−
|u0 v|2 = R2
2γ .
1+γ+ R
Using the relation √
 
0 0 1 1/2
1+R 0
u v̂ = u = Σ v (u v)
c c

where the second equality uses Σ1/2 u = 1 + Ru, and with the formula for c2
above, we can compute
1+R
(u0 v̂)2 = (u0 v)2
1 + R(u0 v)2

in terms of R. Note that this number holds under the condition that R > γ.
4. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 43

4.5. Further Comments. When log(p) n → 0, we need to add more restric-


tions on Σn in order to estimate it faithfully. There are typically three kinds of
b
restrictions.
• Σ sparse
• Σ−1 sparse, also called–Precision Matrix
• banded structures (e.g. Toeplitz) on Σ or Σ−1
Recent developments can be found by Bickel, Tony Cai, Tsybakov, Wainwright et
al.
For spectral study on random kernel matrices, see El Karoui, Tiefeng Jiang,
Xiuyuan Cheng, and Amit Singer et al.
CHAPTER 4

Generalized PCA/MDS via SDP Relaxations

1. Introduction of SDP with a Comparison to LP


Here we will give a short note on Semidefinite Programming (SDP) formula-
tion of Robust PCA, Sparse PCA, MDS with uncertainty, and Maximal Variance
Unfolding, etc. First of all, we give a short introduction to SDP based on a parallel
comparison with LP.
Semi-definite programming (SDP) involves linear objective functions and linear
(in)equalities constraint with respect to variables as positive semi-definite matri-
ces. SDP is a generalization of linear programming (LP) by replacing nonnegative
variables with positive semi-definite matrices. We will give a brief introduction of
SDP through a comparison with LP.
LP (Linear Programming): for x ∈ Rn and c ∈ Rn ,

(50) min cT x
s.t. Ax = b
x≥0
This is the primal linear programming problem.
In SDP, the inner product between vectors cT x in LP will change to Hadamard
inner product (denoted by •) between matrices.
SDP (Semi-definite Programming): for X, C ∈ Rn×n
X
(51) min C • X = cij Xij
i,j
s.t. Ai • X = bi , for i = 1, · · · , m
X0
Linear programming has a dual problem via the Lagrangian. The Lagrangian
of the primal problem is
max min Lx;y,µ = cT x + y T (b − Ax) − µT x
µ≥0,y x

which implies that


∂L
= c − AT y − µ = 0
∂x
⇐⇒ c − AT y = µ ≥ 0

=⇒ max L = −y T b
µ≥0,y

which leads to the following dual problem.


45
46 4. GENERALIZED PCA/MDS VIA SDP RELAXATIONS

LD (Dual Linear Programming):


(52) min bT y
s.t. µ = c − AT y ≥ 0
In a similar manner, for SDP’s dual form, we have the following.
SDD (Dual Semi-definite Programming):
(53) max −bT y
m
X
s.t. S=C− Ai yi  0 =: C − AT ⊗ y
i=1

where  
A1
A =  ...
 

Am
and  
y1
y =  ... 
 

ym

1.1. Duality of SDP. Define the feasible set of primalP and dual problems are
Fp = {X  0; Ai • X = bi } and Fd = {(y, S) : S = C − i yi Ai  0}, respectively.
Similar to linear programming, semi-definite programming also has properties of
week and strong duality. The week duality says that the primal value is always an
upper bound of dual value. The strong duality says that the existence of an interior
point ensures the vanishing duality gap between primal value and dual value, as
well as the complementary conditions. In this case, to check the optimality of a
primal variable, it suffices to find a dual variable which meets the complementary
condition with the primal. This is often called the witness method.
For more reference on duality of SDP, see e.g. [Ali95].
Theorem 1.1 (Weak Duality of SDP). If Fp 6= ∅, Fd 6= ∅, We have C • X ≥ bT y,
for ∀X ∈ Fp and ∀(y, S) ∈ Fd .
Theorem 1.2 (Strong Duality SDP). Assume the following hold,
(1) Fp 6= ∅, Fd 6= ∅;
(2) At least one feasible set has an interior.
Then X ∗ is optimal iff
(1) X ∗ ∈ Fp
(2) ∃(y ∗ , S ∗ ) ∈ Fd
s.t. C • X ∗ = bT y ∗ or X ∗ S ∗ = 0 (note: in matrix product)
In other words, the existence of an interior solution implies the complementary
condition of optimal solutions. Under the complementary condition, we have
rank(X ∗ ) + rank(S ∗ ) ≤ n
for every optimal primal X ∗ and dual S ∗ .
2. ROBUST PCA 47

2. Robust PCA
Let X ∈ Rp×n be a data matrix. Classical PCA tries to find
(54) min kX − Lk
s.t. rank(L) ≤ k
where the Pnorm here is matrix-norm P
or Frobenius norm. SVD provides a solution
with L = i≤k σi ui viT where X = i σi ui viT (σ1 ≥ σ2 ≥ . . .). In other words,
classical PCA looks for decomposition
X =L+E
where the error matrix E has small matrix/Frobenius norm. However, it is well-
known that classical PCA is sensitive to outliers which are sparse and lie far from
the major population.

Figure 1. Classical PCA is sensitive to outliers

Robust PCA looks for the following decomposition instead


X =L+S
where
• L is a low rank matrix;
• S is a sparse matrix.
Example. Let X = [x1 , . . . , xp ]T ∼ N (0, Σ) be multivariate Gaussian random
variables. The following characterization [CPW12] holds
xi and xj are conditionally independent given other variables
⇔ (Σ−1 )ij = 0
We denote it by xi ⊥ xj |xk (k 6∈ {i, j}). Let G = (V, E) be a undirected graph
where V represent p random variables and (i, j) ∈ E ⇔ xi ⊥ xj |xk (k 6∈ {i, j}). G
is called a (Gaussian) graphical model of X.
Divide the random variables into observed and hidden (a few) variables X =
(Xo , Xh )T (in semi-supervised learning, unlabeled and labeled, respectively) and
   
Σoo Σoh −1 Qoo Qoh
Σ= and Q = Σ =
Σho Σhh Qho Qhh
The following Schur Complement equation holds for covariance matrix of observed
variables
−1
Σoo = Qoo + Qoh Q−1
hh Qho .
Note that
48 4. GENERALIZED PCA/MDS VIA SDP RELAXATIONS

• Observable variables are often conditional independent given hidden vari-


ables, so Qoo is expected to be sparse;
• Hidden variables are of small number, so Qoh Q−1hh Qho is of low-rank.
In semi-supervised learning, the labeled points are of small number, and the unla-
beled points should be as much conditional independent as possible to each other
given labeled points. This implies that the labels should be placed on those most
“influential” points.

Figure 2. Surveilliance video as low rank plus sparse matrices:


Left = low rank (middle) + sparse (right) [CLMW09]

Example (Surveilliance Video Decomposition). Figure 2 gives an example of low


rank vs. sparse decomposition in surveilliance video. On the left column, surveil-
liance video of a movie theatre records a great amount of images with the same
background and the various walking customers. If we vectorize these images (each
image as a vector) to form a matrix, the background image leads to a rank-1 part
and the occasional walking customers contribute to the sparse part.
More examples can be found at [CLMW09, CSPW11, CPW12].
In Robust PCA the purpose is to solve
(55) min kX − Lk0
s.t. rank(L) ≤ k
where kAk0 = #{Aij 6= 0}. However both the objective function and the constraint
are non-convex, whence it is NP-hard to solve in general.
The simplest convexification leads to a Semi-definite relaxation:
kSk0 := #{Sij 6= 0} ⇒ kSk1
2. ROBUST PCA 49

X
rank(L) := #{σi (L) 6= 0} ⇒ kLk∗ = σi (L),
i
where kLk∗ is called the nuclear norm of L, which has a semi-definite representation
1
kLk∗ = min (trace(W1 ) + trace(W2 ))
2 
W1 L
s.t.  0.
LT W 2
With these, the relaxed Robust PCA problem can be solved by the following
semi-definite programming (SDP).
1
(56) min (trace(W1 ) + trace(W2 )) + λkSk1
2
s.t. Lij + Sij = Xij , (i, j) ∈ E
 
W1 L
0
LT W2
The following Matlab codes realized the SDP algorithm above by CVX (http:
//cvxr.com/cvx).
% Construct a random 20-by-20 Gaussian matrix and construct a rank-1
% matrix using its top-1 singular vectors
R = randn(20,20);
[U,S,V] = svds(R,3);
A = U(:,1)*V(:,1)’;

% Construct a 90% uniformly sparse matrix


E0 = rand(20);
E = 1*abs(E0>0.9);

X = A + E;

% Choose the regularization parameter


lambda = 0.25;

% Solve the SDP by calling cvx toolbox


if exist(’cvx setup.m’,’file’),
cd /matlab tools/cvx/
cvx setup
end

cvx begin
variable L(20,20);
variable S(20,20);
variable W1(20,20);
variable W2(20,20);
variable Y(40,40) symmetric;
Y == semidefinite(40);
minimize(.5*trace(W1)+0.5*trace(W2)+lambda*sum(sum(abs(S))));
subject to
L + S >= X-1e-5;
50 4. GENERALIZED PCA/MDS VIA SDP RELAXATIONS

L + S <= X + 1e-5;
Y == [W1, L’;L W2];
cvx end

% The difference between sparse solution S and E


disp(’$\—S-E\— \infty$:’)
norm(S-E,’inf’)

% The difference between the low rank solution L and A


disp(’\—A-L\—’)
norm(A-L)

Typically CVX only solves SDP problem of small sizes (say matrices of size less
than 100). Specific matlab tools have been developed to solve large scale RPCA,
which can be found at http://perception.csl.uiuc.edu/matrix-rank/.

3. Probabilistic Exact Recovery Conditions for RPCA


A fundamental question about Robust PCA is: given X = L0 + S0 with low-
rank L and sparse S, under what conditions that one can recover X by solving SDP
in (56)?
It is necessary to assume that
• the low-rank matrix L0 can not be sparse;
• the sparse matrix S0 can not be of low-rank.
The first assumption is called incoherence condition. Assume that L0 ∈ Rn×n =
U ΣV T and r = rank(L0 ).
Incoherence condition [CR09]: there exists a µ ≥ 1 such that for all ei =
(0, . . . , 0, 1, 0, . . . , 0)T ,
µr µr
kU T ei k2 ≤ , kV T ei k2 ≤ ,
n n
and
µr
|U V T |2ij ≤ 2 .
n
These conditions, roughly speaking, ensure that the singular vectors are not
sparse, i.e. well-spread over all coordinates and won’t concentrate on some coor-
dinates. The incoherence condition holds if |Uij |2 ∨ |Vij |2 ≤ µ/n. In fact, if U
represent random projections to r-dimensional subspaces with r ≥ log n, we have
maxi kU T ei k2  r/n.
To meet the second condition, we simply assume that the sparsity pattern of
S0 is uniformly random.
Theorem 3.1. Assume the following holds,
(1) L0 is n-by-n with rank(L0 ) ≤ ρr nµ−1 (log n)−2 ,
(2) S0 is uniformly sparse of cardinality m ≤ ρs n2 .

Then with probability 1 − O(n−10 ), (56) with λ = 1/ n is exact, i.e. its solution
L̂ = L0 and Ŝ = S0 .

pNote that if L0 is a rectangular matrix of n1 × n2 , the same holds with λ =


1/ (max n1 , n2 ). The result can be generalized to 1−O(n−β ) for β > 0. Extensions
4. SPARSE PCA 51

and improvements of these results to incomplete measurements can be found in


[CT10, Gro11] etc., which solves the following SDP problem.

(57) min kLk∗ + λkSk1


s.t. Lij + Sij = Xij , (i, j) ∈ Ωobs .
Theorem 3.2. Assume the following holds,
(1) L0 is n-by-n with rank(L0 ) ≤ ρr nµ−1 (log n)−2 ,
(2) Ωobs is a uniform random set of size m = 0.1n2 ,
(3) each observed entry is corrupted with probability τ ≤ τs .

Then with probability 1−O(n−10 ), (56) with λ = 1/ 0.1n is exact, i.e. √its solution
L̂ = L0 . The same conclusion holds for rectangular matrices with λ = 1/ max dim.
All these results hold irrespective to the magnitudes of L0 and S0 .
When there are no sparse perturbation in optimization problem (57), the prob-
lem becomes the classical Matrix Completion problem with uniformly random sam-
pling:
(58) min kLk∗
s.t. Lij = L0ij , (i, j) ∈ Ωobs .
Assumed the same condition as before,[CT10] gives the following result: solu-
tion to SDP (58) is exact with probability at least 1 − n−10 if m ≥ µnr loga n where
a ≤ 6, which can be improved by [Gro11] to be near-optimal
m ≥ µnr log2 n.
Another theory based on geometry can be found in [CSPW11, CRPW12].
3.1. Phase Transitions. Take L0 = U V T as a product of n × r i.i.d. N (0, 1)
random matrices. Figure 3 shows the phase transitions of successful recovery prob-
ability over sparsity ratio ρs = m/n2 and low rank ratio r/n. White color indicates
the probability equals to 1 and black color corresponds to the probability being 0.
A sharp phase transition curve can be seen in the pictures. (a) and (b) respectively
use random signs and coherent signs in sparse perturbation, where (c) is purely ma-
trix completion with no perturbation. Increasing successful recovery can be seen
from (a) to (c).

4. Sparse PCA
Sparse PCA is firstly proposed by [ZHT06] which tries to locate sparse prin-
cipal components, which also has a SDP relaxation.
Recall that classical PCA is to solve
max xT Σx
s.t. kxk2 = 1
which gives the maximal variation direction of covariance matrix Σ.
Note that xT Σx = trace(Σ(xxT )). Classical PCA can thus be written as
max trace(ΣX)
s.t. trace(X) = 1
X0
52 4. GENERALIZED PCA/MDS VIA SDP RELAXATIONS

Figure 3. Phase Transitions in Probability of Successful Recovery

The optimal solution gives a rank-1 X along the first principal component. A
recursive application of the algorithm may lead to top k principal components.
That is, one first to find a rank-1 approximation of Σ and extract it from Σ0 = Σ
to get Σ1 = Σ − X, then pursue the rank-1 approximation of Σ1 , and so on.
Now we are looking for sparse principal components, i.e. #{Xij 6= 0} are small.
Using 1-norm convexification, we have the following SDP formulation [dGJL07]
for Sparse PCA
max trace(ΣX) − λkXk1
s.t. trace(X) = 1
X0
The following Matlab codes realized the SDP algorithm above by CVX (http:
//cvxr.com/cvx).
% Construct a 10-by-20 Gaussian random matrix and form a 20-by-20 correlation
% (inner product) matrix R
X0 = randn(10,20);
R = X0’*X0;

d = 20;
e = ones(d,1);

% Call CVX to solve the SPCA given R


if exist(’cvx setup.m’,’file’),
cd /matlab tools/cvx/
cvx setup
end

lambda = 0.5;
k = 10;

cvx begin
5. MDS WITH UNCERTAINTY 53

variable X(d,d) symmetric;


X == semidefinite(d);
minimize(-trace(R*X)+lambda*(e’*abs(X)*e));
subject to
trace(X)==1;
cvx end

5. MDS with Uncertainty


In this lecture, we introduce Semi-Definite Programming (SDP) approach to
solve some generalized Multi-dimensional Scaling (MDS) problems with uncer-
tainty. Recall that in classical MDS, given pairwise distances dij = kxi − xj k2
among a set of points xi ∈ Rp ( i = 1, 2, · · · , n) whose coordinates are unknown,
our purpose is to find yi ∈ Rk (k ≤ p) such that

n
X 2
(59) min kyi − yj k2 − dij .
i,j=1

In classical MDS (Section 1 in Chapter 1) an eigen-decomposition approach is


pursued to find a solution when all pairwise distances dij ’s are known and noise-
free. In case that dij ’s are not from pairwise distances, we often use gradient
descend method to solve it. However there is no guarantee that gradient descent
will converge to the global optimal solution. In this section we will introduce
a method based on convex relaxation, in particular the semi-definite relaxation,
which will guarantee us to find optimal solutions in the following scenarios.
• Noisy perturbations: dij → df ij = dij + ij
• Incomplete measurments: only partial pairwise distance measurements
are available on an edge set of graph, i.e. G = (V, E) and dij is given
when (i, j) ∈ E (e.g. xi and xj in a neighborhood).
• Anchors: sometimes we may fixed the locations of some points called
anchors, e.g. in sensor network localization (SNL) problem.
In other words, we are looking for MDS on graphs with partial and noisy informa-
tion.
5.1. SD Relaxation of MDS. Like PCA, classical MDS has a semi-definite
relaxation. In the following we shall introduce how the constraint
(60) kyi − yj k2 = dij ,
can be relaxed into linear matrix inequality system with positive semidefinite vari-
ables.
Denote Y = [y1 , · · · , yn ]k×n where yi ∈ Rk , and
ei = (0, 0, · · · , 1, 0, · · · , 0) ∈ Rn .
Then we have
kyi − yj k2 = (yi − yj )T (yi − yj ) = (ei − ej )T Y T Y (ei − ej )
Set X = Y T Y , which is symmetric and positive semi-definite. Then
kYi − Yj k2 = (ei − ej )(ei − ej )T • X.
54 4. GENERALIZED PCA/MDS VIA SDP RELAXATIONS

So
kYi − Yj k2 = d2ij ⇔ (ei − ej )(ei − ej )T • X = d2ij
which is linear with respect to X.
Now we relax the constrain X = Y T Y to
X  Y T Y ⇐⇒ X − Y T Y  0.
Through Schur Complement Lemma we know
 
T I Y
X − Y Y  0 ⇐⇒ 0
YT X
We may define a new variable
 
Ik Y
Z ∈ S k+n , Z =
YT X
which gives the following result.
Lemma 5.1. The quadratic constraint
kyi − yj k2 = d2ij , (i, j) ∈ E
has a semi-definite relaxation:


 Z1:k,1:k = I
(0; ei− ej )(0; ei − ej )T • Z = d2ij , (i, j) ∈ E

Ik Y
 Z=  0.


YT X
Pn
where • denotes the Hadamard inner product, i.e. A • B := i,j=1 Aij Bij .
Note that the constraint with equalities of d2ij can be replaced by inequalities
such as ≤ d2ij (1 + ) (or ≥ d2ij (1 − )). This is a system of linear matrix (in)-
equalities with positive semidefinite variable Z. Therefore, the problem becomes a
typical semidefinite programming.
Given such a SD relaxation, we can easily generalize classical MDS to the sce-
narios in the introduction. For example, consider the generalized MDS with anchors
which is often called sensor network localization problem in literature [BLT+ 06].
Given anchors ak (k = 1, . . . , s) with known coordinates, find xi such that
• kxi − xj k2 = d2ij where (i, j) ∈ Ex and xi are unknown locations
2
• kak − xj k2 = dckj where (k, j) ∈ Ea and ak are known locations
We can exploit the following SD relaxation:
• (0; ei − ej )(0; ei − ej )T • Z = dij for (i, j) ∈ Ex ,
• (ai ; ej )(ai ; ej )T • Z = dc
ij for (i, j) ∈ Ea ,
both of which are linear with respect to Z.
Recall that every SDP problem has a dual problem (SDD). The SDD associated
with the primal problem above is
X X
(61) min I • V + wij dij + wbij dc
ij
i,j∈Ex i,j∈Ea

s.t.  
V 0 X X
S= + wij Aij + w
bij A ij  0
d
0 0
i,j∈Ex i,j∈Ea
5. MDS WITH UNCERTAINTY 55

where
Aij = (0; ei − ej )(0; ei − ej )T
T
A
d ij = (ai ; ej )(ai ; ej ) .
The variables wij is the stress matrix on edge between unknown points i and j and
w
bij is the stress matrix on edge between anchor i and unknown point j. Note that
the dual is always feasible, as V = 0, yij = 0 for all (i, j) ∈ Ex and wij = 0 for all
(i, j) ∈ Ea is a feasible solution.
There are many matlab toolboxes for SDP, e.g. CVX, SEDUMI, and recent
toolboxes SNLSDP (http://www.math.nus.edu.sg/~mattohkc/SNLSDP.html) and
DISCO (http://www.math.nus.edu.sg/~mattohkc/disco.html) by Toh et. al.,
adapted to MDS with uncertainty.
A crucial theoretical question is to ask, when X = Y T Y holds such that SDP
embedding Y gives the same answer as the classical MDS? Before looking for an-
swers to this question, we first present an application example of SDP embedding.
5.2. Protein 3D Structure Reconstruction. Here we show an example of
using SDP to find 3-D coordinates of a protein molecule based on noisy pairwise
distances for atoms in -neighbors. We use matlab package SNLSDP by Kim-
Chuan Toh, Pratik Biswas, and Yinyu Ye, downladable at http://www.math.nus.
edu.sg/~mattohkc/SNLSDP.html.

nf = 0.1, λ = 1.0e+00

10

−5

−10

10
10
5 5
0 0
−5 −5
−10 −10
Refinement: RMSD = 5.33e−01

(a) (b)

Figure 4. (a) 3D Protein structure of PDB-1GM2, edges are


chemical bonds between atoms. (b) Recovery of 3D coordinates
from SNLSDP with 5Å-neighbor graph and multiplicative noise at
0.1 level. Red point: estimated position of unknown atom. Green
circle: actual position of unknown atom. Blue line: deviation from
estimation to the actual position.

After installation, Figure 4 shows the results of the following codes.


>> startup
>> testSNLsolver

number of anchors = 0
number of sensors = 166
box scale = 20.00
radius = 5.00
multiplicative noise, noise factor = 1.00e-01
56 4. GENERALIZED PCA/MDS VIA SDP RELAXATIONS

-------------------------------------------------------
estimate sensor positions by SDP
-------------------------------------------------------
num of constraints = 2552,
Please wait:
solving SDP by the SDPT3 software package
sdpobj = -3.341e+03, time = 34.2s
RMSD = 7.19e-01
-------------------------------------------------------
refine positions by steepest descent
-------------------------------------------------------
objstart = 4.2408e+02, objend = 2.7245e+02
number of iterations = 689, time = 0.9s
RMSD = 5.33e-01
-------------------------------------------------------
(noise factor)^2 = -20.0dB,
mean square error (MSE) in estimated positions = -5.0dB
-------------------------------------------------------

6. Exact Reconstruction and Universal Rigidity


Now we are going to answer the fundamental question, when the SDP relaxation
exactly reconstruct the coordinates up to a rigid transformation. We will provide
two theories, one from the optimality rank properties of SDP, and the other from
a geometric criterion, universal rigidity.
Recall that for a standard SDP with X, C ∈ Rn×n
X
(62) min C • X = cij Xij
i,j
s.t. Ai • X = bi , for i = 1, · · · , m
X0
whose SDD is
(63) max −bT y
m
X
s.t. S=C− Ai yi  0.
i=1

Such SDP has the following rank properties [Ali95]:


A. maximal rank solutions X ∗ or S ∗ exist;
B. minimal rank solutions X ∗ or S ∗ exist;
C. if complementary condition X ∗ S ∗ = 0 holds, then rank(X ∗ ) + rank(S ∗ ) ≤
n with equality holds iff strictly complementary condition holds, whence
rank(S ∗ ) ≥ n − k ⇒ rank(X ∗ ) ≤ k.
Strong duality of SDP tells us that an interior point feasible solution in primal
or dual problem will ensure the complementary condition and the zero duality gap.
Now we assume that dij = kxi − xj k precisely for some unknown xi ∈ Rk . Then the
primal problem is feasible with Z = (Id ; Y )T (Id ; Y ). Therefore the complementary
6. EXACT RECONSTRUCTION AND UNIVERSAL RIGIDITY 57

condition holds and the duality gap is zero. In this case, assume that Z ∗ is a primal
feasible solution of SDP embedding and S ∗ is an optimal dual solution, then
(1) rank(Z ∗ ) + rank(S ∗ ) ≤ k + n and rank(Z ∗ ) ≥ k, whence rank(S ∗ ) ≤ n;
(2) rank(Z ∗ ) = k ⇐⇒ X = Y T Y .
It follows that if an optimal dual S ∗ has rank n, then every primal solution Z ∗ has
rank k, which ensures X = Y T Y . Therefore it suffices to find a maximal rank dual
solution S ∗ whose rank is n.
Above we have optimality rank condition from SDP. Now we introduce a geo-
metric criterion based on universal rigidity.
Definition (Universal Rigidity (UR) or Unique Localization (UL)). ∃!yi ∈ Rk ,→
2
Rl where l ≥ k s.t. d2ij = kyi − yj k2 , dc 2
ij = kak − yj k .

It simply says that there is no nontrivial extension of yi ∈ Rk in Rl satisfying


2
d2ij
= kyi − yj k2 and dc 2
ij = k(ak ; 0) − yj k . The following is a short history about
universal rigidity.
[Schoenberg 1938] G is complete =⇒ UR
[So-Ye 2007] G is incomplete =⇒ UR ⇐⇒ SDP has maximal rank solution
rank(Z ∗ ) = k.
Theorem 6.1. [SY07] The following statements are equivalent.
(1) The graph is universally rigid or has a unique localization in Rk .
(2) The max-rank feasible solution of the SDP relaxation has rank k;
(3) The solution matrix has X = Y T Y or trace(X − Y T Y ) = 0.
Moreover, the localization of a UR instance can be computed approximately in a
time polynomial in n, k, and the accuracy log(1/).
In fact, the max-rank solution of SDP embedding is unique. There are many
open problems in characterizing UR conditions, see Ye’s survey at ICCM’2010.
In practice, we often meet problems with noisy measurements αd2ij ≥ d˜2ij ≤
βdij . If we relax the constraint kyi − yj k2 = d2ij or equivalently Ai • X = bi to
2

inequalities, however we can achieve arbitrary small rank solution. To see this,
assume that
Ai X = bi 7→ αbi ≤ Ai X ≤ βbi i = 1, . . . , m, where β ≥ 1, α ∈ (0, 1)
then So, Ye, and Zhang (2008) [SYZ08] show the following result.
Theorem 6.2. For every d ≥ 1, there is a SDP solution X b  0 with rank
rank(X) ≤ d, if the following holds,
b
18 ln 2m

 1+
 1 ≤ d ≤ 18 ln 2m
β= √ d
 1 + 18 ln 2m d ≥ 18 ln 2m

d
 1

 e(2m)2/d
 1 ≤ d ≤ 4 ln 2m
α=
( r )
1 4 ln 2m
 max

 ,1 − d ≥ 4 ln 2m
e(2m)2/d d

Note that α, β are independent to n.


58 4. GENERALIZED PCA/MDS VIA SDP RELAXATIONS

7. Maximal Variance Unfolding


Here we give a special case of SDP embedding, Maximal Variance Unfolding
(MVU) [WS06]. In this case we choose graph G = (V, E) as k-nearest neighbor
graph. As a contrast to the SDP embedding above, we did not pursue a semi-
definite relaxation X  Y T Y , but instead define it as a positive semi-definite
kernel K = Y T Y and maximize the trace of K.
Consider a set of points xi (i = 1, . . . , n) whose pairwise distance dij is known
if xj lies in k-nearest neighbors of xi . In other words, consider a k-nearest neighbor
graph G = (V, E) with V = {xi : i = 1, . . . , n} and (i, j) ∈ E if j is a member of
k-nearest neighbors of i.
Our purpose is to find coordinates yi ∈ Rk for i = 1, 2, . . . , n s.t.
d2ij = kyi − yj k2
P
wherever (i, j) ∈ E and i yi = 0.
Set Kij = hyi , yj i. Then K is symmetric and positive semidefinite, which
satisfies
Kii + Kjj − 2Kij = d2ij .
There are possibly many solutions for such K, and we look for a particular one
with maximal trace which characterizes the maximal variance.
Xn
(64) max trace(K) = λi (K)
i=1
s.t. Kii + Kjj − 2Kij = d2ij ,
X
Kij = 0,
j
K0
Again it is a SDP. The final embedding is obtained by using eigenvector decompo-
sition of K = Y T Y .
However we note here that maximization of trace is not a provably good ap-
proach to “unfold” a manifold. Sometimes, there are better ways than MVU, e.g.
if original data lie on a plane then maximization of the diagonal distance between
two neighboring triangles will unfold and force it to be a plane. This is a special
case of the general k + 1-lateration graphs [SY07]. From here we see that there are
other linear objective functions better than trace for the purpose of “unfolding” a
manifold.
CHAPTER 5

Nonlinear Dimensionality Reduction

1. Introduction
In the past month we talked about two topics: one is the sample mean and
sample covariance matrix (PCA) in high dimensional spaces. We have learned that
when dimension p is large and sample size n is relatively small, in contrast to the
traditional statistics where p is fixed and n → ∞, both sample mean and PCA may
have problems. In particular, Stein’s phenomenon shows that in high dimensional
space with independent Gaussian distributions, the sample mean is worse than a
shrinkage estimator; moreover, random matrix theory sheds light on that in high
dimensional space with sample size in a fixed ratio of dimension, the sample co-
variance matrix and PCA may not reflect the signal faithfully. These phenomena
start a new philosophy in high dimensional data analysis that to overcome the curse
of dimensionality, additional constraints has to be put that data never distribute
in every corner in high dimensional spaces. Sparsity is a common assumption in
modern high dimensional statistics. For example, data variation may only depend
on a small number of variables; independence of Gaussian random fields leads to
sparse covariance matrix; and the assumption of conditional independence can also
lead to sparse inverse covariance matrix. In particular, an assumption that data
concentrate around a low dimensional manifold in high dimensional spaces, leads
to manifold learning or nonlinear dimensionality reduction, e.g. ISOMAP, LLE,
and Diffusion Maps etc. This assumption often finds example in computer vision,
graphics, and image processing.
All the work introduced in this chapter can be regarded as generalized PCA/MDS
on nearest neighbor graphs, which has roots in manifold learning concept. Two
pieces of milestone works, ISOMAP [TdSL00] and Locally Linear Embedding
(LLE) [RL00], are firstly published in science 2000, which opens a new field called
nonlinear dimensionality reduction, or manifold learning in high dimensional data
analysis. Here is the development of manifold learning method:


 Laplacian Eigen Map
Diffusion Map

PCA −→ LLE −→

 Hessian LLE
Local Tangent Space Alignment

MDS −→ ISOMAP
To understand the motivation of such a novel methodology, let’s take a brief
review on PCA/MDS. Given a set of data xi ∈ Rp (i = 1, . . . , n) or merely pairwise
distances d(xi , xj ), PCA/MDS essentially looks for an affine space which best cap-
ture the variation of data distribution, see Figure 1(a). However, this scheme will
not work in the scenario that data are actually distributed on a highly nonlinear
59
60 5. NONLINEAR DIMENSIONALITY REDUCTION

curved surface, i.e. manifolds, see the example of Swiss Roll in Figure 1(b). Can we
extend PCA/MDS in certain sense to capture intrinsic coordinate systems which
charts the manifold?

(a) (b)

Figure 1. (a) Find an affine space to approximate data variation


in PCA/MDS. (b) Swiss Roll data distributed on a nonlinear 2-D
submanifold in Euclidean space R3 . Our purpose is to capture an
intrinsic coordinate system describing the submanifold.

ISOMAP and LLE, as extensions from MDS and local PCA, respectively, leads
to a series of attempts to address this problem.
All the current techniques in manifold learning, as extensions of PCA and
MDS, are often called as Spectral Kernel Embedding. The common theme of these
techniques can be described in Figure 2. The basic problem is: given a set of
data points {x1 , x2 , ..., xn ∈ Rp }, how to find out y1 , y2 , ..., yn ∈ Rd , where d  p,
such that some geometric structures (local or global) among data points are best
preserved.

Figure 2. The generative model for manifold learning. Y is the


hidden parameter space (like rotation angle of faces below), f is
a measure process which maps Y into a sub-manifold in a high
dimensional ambient space, X = f (Y ) ⊂ Rp . All of our purpose is
to recover this hidden parameter space Y given samples {xi ∈ Rp :
i = 1, . . . , n}.

All the manifold learning techniques can be summarized in the following meta-
algorithm, which explains precisely the name of spectral kernel embedding. All the
methods can be called certain eigenmaps associated with some positive semi-definite
kernels.
1. Construct a data graph G = (V, E), where V = {xi : i = 1, ..., n}.
e.g.1. ε-neighborhood, i ∼ j ⇔ d(xi , xj ) 6 ε, which leads to an undirected
graph;
2. ISOMAP 61

e.g.2. k-nearest neighbor, (i, j) ∈ E ⇔ j ∈ Nk (i), which leads to a directed


graph.
2. Construct a positive semi-definite matrix K (kernel).
1
3. Eigen-decomposition K = U ΛU T , then Yd = Ud Λd2 , where choose d eigen-
vectors (top or bottom) Ud .
Example 3 (PCA). G is complete, K = Σ̂n is a covariance matrix.
Example 4 (MDS). G is complete, K = − 12 HDH T , where Dij = d2 (xi , xj ).
Example 5 (ISOMAP). G is incomplete.
(
d(xi , xj ) if (i, j) ∈ E,
Dij = ˆ
dg (xi , xj ) if (i, j) 6∈ E.

where dˆg is a graph shorted path. Then


1
K = − HDH T .
2
Note that K is positive semi-definite if and only if D is a squared distance matrix.
Example 6 (LLE). G is incomplete. K = (I − W )T (I − W ), where
(
n×n wij j ∈ N (i),
Wij =
0 other’s.
and wij solves the following optimization problem
X
P min kXi − wij X̄j k2 , X̄j = Xj − Xi .
j wij =1
j∈N (i)

After obtaining W , compute the global embedding d-by-n embedding matrix Y =


[Y1 , . . . , Yn ],
n
X n
X
min kYi − Wij Yj k2 = trace((I − W )Y T Y (I − W )T ).
Y
i=1 j=1

This is equivalent to find smallest eigenvectors of K = (I − W )T (I − W ).

2. ISOMAP
ISOMAP is an extension of MDS, where pairwise euclidean distances between
data points are replaced by geodesic distances, computed by graph shortest path
distances.
(1) Construct a neighborhood graph G = (V, E, dij ) such that
V = {xi : i = 1, . . . , n}
E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest
neighbors, -neighbors
dij = d(xi , xj ), e.g. Euclidean distance when xi ∈ Rp
(2) Compute graph shortest path distances
dij = minP =(xi ,...,xj ) (kxi − xt1 k + . . . + kxtk−1 − xj k), is the length
of a graph shortest path connecting i and j
Dijkstra’s algorithm (O(kn2 log n)) and Floyd’s Algorithm (O(n3 ))
62 5. NONLINEAR DIMENSIONALITY REDUCTION

(3) classical MDS with D = (d2ij )


construct a symmetric (positive semi-definite if D is a squared dis-
tance) B = −0.5HDH T where H = I − 11T /n (or H = I − 1aT for any
aT 1 = 1).
Find eigenvector decomposition of B = U ΛU T and choose top d
eigenvectors as embedding coordinates in Rd , i.e. Yd = [y1 , . . . , yd ] =
1/2
[U1 , . . . , Ud ]Λd ∈ Rn×d

Algorithm 2: ISOMAP Algorithm


Input: A weighted undirected graph G = (V, E, d) such that
1 V = {xi : i = 1, . . . , n}
2 E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest neighbors,
-neighbors
3 dij = d(xi , xj ), e.g. Euclidean distance when xi ∈ Rp
Output: Euclidean k-dimensional coordinates Y = [yi ] ∈ Rk×n of data.
4 Step 1 : Compute graph shortest path distances
dij = min (kxi − xt1 k + . . . + kxtk−1 − xj k),
P =(xi ,...,xj )

which is the length of a graph shortest path connecting i and j;


1
5 Step 2 : Compute K = − H · D · H T (D := [d2ij ]), where H is the Househölder
2
centering matrix;
6 Step 3 : Compute Eigenvalue decomposition K = U ΛU T with
Λ = diag(λ1 , . . . , λn ) where λ1 ≥ λ2 ≥ . . . ≥ λn ≥ 0;
7 Step 4 : Choose top k nonzero eigenvalues and corresponding eigenvectors,
Xek = Uk Λk 12 where
Uk = [u1 , . . . , uk ], uk ∈ Rn ,
Λk = diag(λ1 , . . . , λk )
with λ1 ≥ λ2 ≥ . . . ≥ λk > 0.

The basic feature of ISOMAP can be described as: we find a low dimensional
embedding of data such that points nearby are mapped nearby and points far away
are mapped far away. In other words, we have global control on the data distance
and the method is thus a global method. The major shortcoming of ISOMAP
lies in its computational complexity, characterized by a full matrix eigenvector
decomposition.
2.1. ISOMAP Example. Now we give an example of ISOMAP with matlab
codes.
% load 33-face data
load ../data/face.mat Y
X = reshape(Y,[size(Y,1)*size(Y,2) size(Y,3)]);
p = size(X,1);
n = size(X,2);
D = pdist(X’);
DD = squareform(D);

% ISOMAP embedding with 5-nearest neighbors


[Y iso,R iso,E iso]=isomapII(DD,’k’,5);
2. ISOMAP 63

(a) (b)

Figure 3. (a) Residual Variance plot for ISOMAP. (b) 2-D


ISOMAP embedding, where the first coordinate follows the order
of rotation angles of the face.

% Scatter plot of top 2-D embeddings


y=Y iso.coords{2};
scatter(y(1,:),y(2,:))

2.2. Convergence of ISOMAP. Under dense-sample and regularity condi-


tions on manifolds, ISOMAP is proved to show convergence to preserve geodesic
distances on manifolds. The key is to approximate geodesic distance on manifold
by a sequence of short Euclidean distance hops.
Consider arbitrary two points on manifold x, y ∈ M . Define
dM (x, y) = inf {length(γ)}
γ

dG (x, y) = min(kx0 − x1 k + . . . + kxt−1 − xt k)


P

dS (x, y) = min(dM (x0 , x1 ) + . . . + dM (xt−1 , xt ))


P
where γ varies over the set of smooth arcs connecting x to y in M and P varies
over all paths along the edges of G starting at x0 = x and ending at xt = y. We
are going to show dM ≈ dG with the bridge dS .
It is easy to see the following upper bounds by dS :
(65) dM (x, y) ≤ dS (x, y)

(66) dG (x, y) ≤ dS (x, y)


where the first upper bound is due to triangle inequality for the metric dM and the
second upper bound is due to that Euclidean distances kxi − xi+1 k are smaller than
arc-length dM (xi , xi+1 ).
To see other directions, one has to impose additional conditions on sample
density and regularity of manifolds.
Lemma 2.1 (Sufficient Sampling). Let G = (V, E) where V = {xi : i = 1, . . . , n} ⊆
M is a -net of manifold M , i.e.for every x ∈ M there exists xi ∈ V such that
64 5. NONLINEAR DIMENSIONALITY REDUCTION

dM (x, xi ) < , and {i, j} ∈ E if dM (xi , xj ) ≤ α (α ≥ 4). Then for any pair
x, y ∈ V ,
α
dS (x, y) ≤ max(α − 1, )dM (x, y).
α−2
Proof. Let γ be a shortest path connecting x and y on M whose length is
l. If l ≤ (α − 2), then there is an edge connecting x and y whence dS (x, y) =
dM (x, y). Otherwise split γ into pieces such that l = l0 + tl1 where l1 = (α − 2)
and  ≤ l0 < (α − 2). This divides arc γ into a sequence of points γ0 = x, γ1 ,. . .,
γt+1 = y such that dM (x, γ1 ) = l0 and dM (γi , γi+1 ) = l1 (i ≥ 1). There exists a
sequence of x0 = x, x1 , . . . , xt+1 = y such that dM (xi , γi ) ≤  and
dM (xi , xi+1 ) ≤ dM (xi , γi ) + dM (γi , γi+1 ) + dM (γi+1 , xi+1 )
≤  + l1 + 
= α
= l1 α/(α − 2)
whence (xi , xi+1 ) ∈ E. Similarly dM (x, x1 ) ≤ dM (x, γ1 ) + dM (γ1 , x1 ) ≤ (α − 1) ≤
l0 (α − 1).
t−1
X
dS (x, y) ≤ dM (xi , xi+1 )
i=0
 
α
≤ l max ,α − 1
α−2
Setting α = 4 gives rise to dS (x, y) ≤ 3dM (x, y). 
The other lower bound dS (x, y) ≤ cdG (x, y) requires that for every two points
xi and xj , Euclidean distance kxi − xj k ≤ cdM (xi , xj ). This imposes a regularity
on manifold M , whose curvature has to be bounded. We omit this part here and
leave the interested readers to the reference by Bernstein, de Silva, Langford, and
Tenenbaum 2000, as a supporting information to the ISOMAP paper.

3. Locally Linear Embedding (LLE)


In applications points nearby should be mapped nearby, while points far away
should impose no constraint. This is because in applications when points are close
enough, they are similar, while points are far, there is no faithful information to
measure how far they are. This motivates another type of algorithm, locally linear
embedding. This is a local method as it involves local PCA and sparse eigenvector
decomposition.
(1) Construct a neighborhood graph G = (V, E, W ) such that
V = {xi : i = 1, . . . , n}
E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest
neighbors, -neighbors
Wij = d(xi , xj ) in Euclidean distance
(2) Local fitting:
Pick up a point xi and its neighbors Ni
Compute the local fitting weights
X
P min kxi − wij (xj − xi )k2 .
j∈Ni wij =1
j∈Ni
3. LOCALLY LINEAR EMBEDDING (LLE) 65

This can be done by Lagrange multiplier method, i.e. solving


1 X X
min kxi − wij (xj − xi )k2 + λ(1 − wij ).
wij 2
j∈Ni j∈Ni

Let wi = [wij1 , . . . wijk ] ∈ R , X̄i = [xj1 − xi , . . . , xjk − xi ], and the local


T k
(i)
Gram (covariance) matrix Cjk = hxj − xi , xk − xi i, whence the weights
are
wi = Ci† (X̄iT xi + λ1),
where the Lagrange multiplier equals to
1  
λ= †
1 − 1T Ci† X̄iT xi ,
1T Ci 1
and Ci† is a Moore-Penrose (pseudo) inverse of Ci . Note that Ci is often
ill-conditioned and to find its Moore-Penrose inverse one can use regular-
ization method (Ci + µI)−1 for some µ > 0.
(3) Global alignment
Define a n-by-n weight matrix W :

wij , j ∈ Ni
Wij =
0, otherwise
Compute the global embedding d-by-n embedding matrix Y ,
X n
X
min kyi − Wij yj k2 = trace(Y (I − W )T (I − W )Y T )
Y
i j=1

In other words, construct a positive semi-definite matrix B = (I −


W )T (I −W ) and find d+1 smallest eigenvectors of B, v0 , v1 , . . . , vd associ-
ated smallest eigenvalues λ0 , . . . , λd . Drop the smallest eigenvector which
is the constant
p vector explaining √ the degree of freedom as translation and
set Y = [v1 / (λ1 ), . . . , vd / λd ]T .
The benefits of LLE are:
• Neighbor graph: k-nearest neighbors is of O(kn)
• W is sparse: kn/n2 = k/n non-zeroes
• B = (I − W )T (I − W ) is guaranteed to be positive semi-definite
However, unlike ISOMAP, it is not clear if LLE constructed above converges
under certain conditions. This has to be left to some variations of basic LLE above,
Hessian LLE and LTSA to finish the convergence conditions.

Table 1. Comparisons between ISOMAP and LLE.

ISOMAP LLE
MDS on geodesic distance matrix local PCA + eigen-decomposition
global approach local approach
no for nonconvex manifolds with holes ok with nonconvex manifolds with holes
landmark (Nystrom) Hessian
Extensions: conformal Extensions: Laplacian
isometric, etc. LTSA etc.
66 5. NONLINEAR DIMENSIONALITY REDUCTION

4. Laplacian LLE (Eigenmap)


Consider the graph Laplacian with heat kernels [BN01, BN03]. Define a
weight matrix W = (wij ) ∈ Rn×n by
( kxi −xj k2

wij = e
t j ∈ N (i),
0 otherwise.
P
Let D = diag( j∈Ni wij ) be the diagonal matrix with weighted degree as diagonal
elements.
Define the unnormalized graph Laplacian by
L = D − W,
and the normalized graph Laplacian by
1 1
L = D− 2 (D − W )D− 2 .
Note that eigenvectors of L are also generalized eigenvectors of L up to a scaling
matrix. This can be seen in the following reasoning.
Lφ = λφ
1 1
⇔ D− 2 (D − W )D− 2 φ = λφ
1
⇔ Lv = (D − W )v = λDv, v = D− 2 φ
Generalized eigenvectors v of L are also right eigenvectors of row Markov matrix
P = D−1 W . (∵ P v = λv ⇔ D−1 W v = λv ⇔ (I − D−1 W )v = (1 − λ)v ∴
(D − W )v = (1 − λ)Dv).
Depending on the meaning of eigenvectors above, we can always choose bot-
tom d + 1 eigenvectors, and dropped the smallest eigenvector (the constant vector
associated with eigenvalue 0) and use the remaining d vectors to construct a d
dimensional embedding of data.

4.1. Convergence of Laplacian Eigenmap. Why choose Laplacian? Con-


sider a linear chain graph,
(df )(i) = fi+1 − fi = [(z − 1)f ](i)

d2 f = (z − 1)2 f = (z 2 − 2z + 1)f → fi+1 − 2fi + fi−1


On graphs, d2 f = (D − W )f = Lf
X Z Z
T
f Lf = wij (fi − fj ) ≥ 0 ∼ k∇M f k = (trace(f T Hf ))2
2 2

i≥j

where H = [∂ 2 /∂i ∂j ] ∈ Rd×d is the Hessian matrix.


Some rigorous results about convergence of Laplacian eigenmaps are given
in [BN08]. Assume that M is a compact manifold with vol(M) = 1. Let the
Laplacian-Beltrami operator
∆M : C(M) → L2 (M)
f 7→ − ÷ (∇f )
5. HESSIAN LLE 67

Consider the following operator


L̂t,n : C(M) → C(M)
!
1 X

ky−xi k X ky−xi k2
f 7 → e 4t f (y) − e 4t f (xi )
t(4πt)k/2 i i

where (L̂t,n f )(y) is a function on M, and


Lt : L2 (M) → L2 (M)
Z 
1
Z
ky−xk ky−xk2
− 4t
f 7→ e f (y)dx − e 4t f (x)dx .
t(4πt)k/2 M M
Then [BN08] shows that when those operators have no repeated eigenvalues,
the spectrum of L̂t,n converges to Lt as n → ∞ (variance), where the latter con-
verges to that of ∆M with a suitable choice of t → ∞ (bias). The following gives
a summary.
Theorem 4.1 (Belkin-Niyogi). Assume that all the eigenvalues in consideration
are of multiplicity one. For small enough t, let λ̂tn,i be the i-th eigenvalue of L̂t,n
t
and v̂n,i be the corresponding eigenfunction. Let λi and vi be the corresponding
eigenvalue and eigenfunction of ∆M . Then there exists a sequence tn → 0 such
that
lim λ̂tn,i
n
= λi
n→∞
tn
lim kv̂n,i − vi k = 0
n→∞
where the limits are taken in probability.
From above one can see that Laplacian LLE minimizes trace of Hessian. Is
that what you desire? Why not the original Hessian?

5. Hessian LLE
Laplacian Eigenmap looks for coordinate curves
Z
min k∇M f k2 , kf k = 1

while Hessian Eigenmap looks for


Z
min kHf k2 , kf k = 1

Donoho and Grimes (2003) [DG03b] replaces the graph Laplacian, or the trace
of Hessian matrix, by the whole Hessian. This is because the kernel of Hessian,
∂2f
 
f (y1 , . . . , yd ) : =0
∂yi ∂yj
must be constant function or linear functions in yi (i = 1, . . . , d). Therefore this
kernel space is a linear subspace of dimension d+1. Minimizing Hessian will exactly
leads to a basis with constant function and d independent coordinate functions.
1. G is incomplete, often k-nearest neighbor graph.
2. Local SVD on neighborhood of xi , for xij ∈ N (xi ),

X̃ (i) = [xi1 − µi , ..., xik − µi ]p×k = Ũ (i) Σ̃(Ṽ (i) )T ,


68 5. NONLINEAR DIMENSIONALITY REDUCTION

Pk 1 (i) (i) (i)


where µi = j=1 xij = k Xi 1, Ũ = [Ũ1 , ..., Ũk ] is an approximate tangent
space at xi .
3. Hessian estimation, assumed d-dimension: define
d
M = [1, V˜1 , ..., V˜k , V˜1 V˜2 , ..., Ṽd−1 V˜d ] ∈ Rk×(1+d+(2))
where Ṽi V˜j = [Ṽik Ṽjk ]T ∈ Rk denotes the elementwise product (Hadamard product)
between vector Ṽi and Ṽj .
Now we perform a Gram-Schmidt Orthogonalization procedure on M , get
d
M̃ = [1, v̂1 , ..., v̂k , ŵ1 , ŵ2 , ..., ŵ(d)−1 ] ∈ Rk×(1+d+(2))
2

Define Hessian by
 
(i) T d
[H ] = [last columns of M̃ ]k×(d)
2 2

as the first d + 1 columns of M̃ consists an orthonormal basis for the kernel of


Hessian.
Define a selection matrix S (i) ∈ Rn×k which selects those data in N (xi ), i.e.
[x1 , .., xn ]S (i) = [xi1 , ..., xik ]
Then the kernel matrix is defined to be
Xn
K= S (i) H (i)T H (i) S (i)T ∈ Rn×n
i=1

Find smallest d + 1 eigenvectors of K and drop the smallest eigenvector, the re-
maining d eigenvectors will give rise to a d dimensional embedding of data points.

5.1. Convergence of Hessian LLE. There are two assumptions for the con-
vergence of ISOMAP:
• Isometry: the geodesic distance between two points on manifolds equals
to the Euclidean distances between intrinsic parameters.
• Convexity: the parameter space is a convex subset in Rd .
Therefore, if the manifold contains a hole, ISOMAP will not faithfully recover
the intrinsic coordinates. Hessian LLE above is provable to find local orthogonal
coordinates for manifold reconstruction, even in nonconvex case. Figure [?] gives
an example.
Donoho and Grimes [DG03b] relaxes the conditions above into the following
ones.
• Local Isometry: in a small enough neighborhood of each point, geodesic
distances between two points on manifolds are identical to Euclidean dis-
tances between parameter points.
• Connecteness: the parameter space is an open connected subset in Rd .
Based on the relaxed conditions above, they prove the following result.
Theorem 5.1. Supper M = ψ(Θ) where Θ is an open connected subset of Rd ,
and ψ is a locally isometric embedding of Θ into Rn . Then the Hessian H(f ) has a
d + 1 dimensional nullspace, consisting of the constant function and d-dimensional
space of functions spanned by the original isometric coordinates.
6. LOCAL TANGENT SPACE ALIGNMENT (LTSA) 69

Algorithm 3: Hessian LLE Algorithm


Input: A weighted undirected graph G = (V, E, d) such that
1 V = {xi ∈ Rp : i = 1, . . . , n}
2 E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest neighbors
Output: Euclidean k-dimensional coordinates Y = [yi ] ∈ Rk×n of data.
3 Step 1 : Compute local PCA on neighborhood of xi , for,
X̃ (i) = [xi1 − µi , ..., xik − µi ]p×k = Ũ (i) Σ̃(Ṽ (i) )T , xij ∈ N (xi ),
Pk (i) (i)
where µi = j=1 xij = k1 Xi 1, Ũ (i) = [Ũ1 , ..., Ũk ] is an approximate tangent
space at xi ;
4 Step 2 : Hessian estimation, assumed d-dimension: define
d
M = [1, V˜1 , ..., V˜k , V˜1 V˜2 , ..., Ṽd−1 V˜d ] ∈ Rk×(1+d+(2))
where Ṽi V˜j = [Ṽik Ṽjk ]T ∈ Rk denotes the elementwise product (Hadamard
product) between vector Ṽi and Ṽj . Now we perform a Gram-Schmidt
Orthogonalization procedure on M , get
d
M̃ = [1, v̂1 , ..., v̂k , ŵ1 , ŵ2 , ..., ŵ(d)−1 ] ∈ Rk×(1+d+(2))
2

Define Hessian by
!
(i) T d
[H ] = [last columns of M̃ ]k×(d)
2 2

as the first d + 1 columns of M̃ consists an orthonormal basis for the kernel of


Hessian.
5 Step 3 : Define
n
X
K= S (i) H (i)T H (i) S (i)T ∈ Rn×n , [x1 , .., xn ]S (i) = [xi1 , ..., xik ],
i=1

find smallest d + 1 eigenvectors of K and drop the smallest eigenvector, the


remaining d eigenvectors will give rise to a d-embedding.

Under this theorem, the original isometric coordinates can be recovered, up to


a rigid motion, by identifying a suitable basis for the null space of H(f ).

6. Local Tangent Space Alignment (LTSA)


A shortcoming of Hessian LLE is the nonlinear construction of Hessian which
requires Hadamard products between tangent vectors. This is prone to noise.
Zhenyue Zhang and Hongyuan Zha (2002) [ZZ02] suggest the following procedure
which does not involve nonlinear Hessian but still leave an orthogonal basis for
tangent space as bottom eigenvectors. In contrast to Hessian LLE’s minimization
of projections on pairwise products between tangent vectors, LTSA minimizes the
projection on the normal space.
LTSA looks for the following coordinates,
X
min kyi − Ui UjT yj k2
Y
i∼j

where Ui is a local PCA basis for tangent space at point xi ∈ Rp .


70 5. NONLINEAR DIMENSIONALITY REDUCTION

Figure 4. Comparisons of Hessian LLE on Swiss roll against


ISOMAP and LLE. Hessian better recovers the intrinsic coordi-
nates as the rectangular hole is the least distorted.

Figure 5. Local tangent space approximation.

Note that Connection Laplacian looks for:


X
min kyi − Oij yj k2 , Oij = arg min kUi − Oij Uj k2
Y O
i∼j

where Ui is a local PCA basis for tangent space at point xi ∈ Rp .


1. G is incomplete, taken to be k-nn graph here.
2. Local SVD on neighborhood of xi , xij ∈ N (xi ),

X̃ (i) = [xi1 − µi , ..., xik − µi ]p×k = Ũ (i) Σ̃(Ṽ (i) )T ,


Pk 1 (i) (i)
where µi = j=1 xij = k Xi 1, Ũ (i) = [Ũ1 , ..., Ũk ] is an approximate tangent
space at xi . Define
√ (i) (i)
Gi = [1/ k, V˜1 , ..., V˜d ]k×(d+1)
3. Alignment (kernel) matrix
n
X
K n×n = Φ = Si Wi WiT SiT
i=1
7. DIFFUSION MAP 71

where weight matrix


Wik×k = I − Gi GTi
selection matrix Sin×k : [xi1 , ..., xik ] = [x1 , ..., xn ]Sin×k
Similarly as above, choose bottom d + 1 eigenvectors, and dropped smallest
which gives embedding matrix Y (n×d) .
As the Hessian LLE, LTSA may recover the global coordinates under certain
conditions where [ZZ09] presents some analysis on this.

Algorithm 4: LTSA Algorithm


Input: A weighted undirected graph G = (V, E, d) such that
1 V = {xi ∈ Rp : i = 1, . . . , n}
2 E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest neighbors
Output: Euclidean k-dimensional coordinates Y = [yi ] ∈ Rk×n of data.
3 Step 1 : Compute local PCA on neighborhood of xi , xij ∈ N (xi ),
X̃ (i) = [xi1 − µi , ..., xik − µi ]p×k = Ũ (i) Σ̃(Ṽ (i) )T ,
Pk (i) (i)
where µi = j=1 xij = 1
X 1,
k i
Ũ (i) = [Ũ1 , ..., Ũk ] is an approximate tangent
space at xi . Define
√ (i) (i)
Gi = [1/ k, V˜1 , ..., V˜d ]k×(d+1) ;
4 Step 2 : Alignment (kernel) matrix
n
X
K n×n = Si Wi WiT SiT , Wik×k = I − Gi GTi ,
i=1

where selection matrix Sin×k : [xi1 , ..., xik ] = [x1 , ..., xn ]Sin×k ;
5 Step 3 : Find smallest d + 1 eigenvectors of K and drop the smallest eigenvector,
the remaining d eigenvectors will give rise to a d-embedding.

7. Diffusion Map
Recall xi ∈ Rd , i = 1, 2, · · · , n,
d(xi , xj )2
 
Wij = exp − ,
t
W is a symmetrical
Pn n × n matrix.
Let di = j=1 Wij and

D = diag(di ), P = D−1 W
and
S = D−1/2 W D−1/2 = I − L, L = D−1/2 (D − W )D−1/2 .
Then
1) S is symmetrical, has n orthogonal eigenvectors V = [v1 , v2 , · · · , vn ],
S = V ΛV T , Λ = diag(λi )T ∈ Rn−1 , V T V = I.
Here we assume that 1 = λ0 ≥ λ1 ≥ λ2 . . . ≥ λn−1 due to positivity of W .
2) Φ = D−1/2 V = [φ1 , φ2 , · · · , φn ] are right eigenvectors of P , P Φ = ΦΛ.
72 5. NONLINEAR DIMENSIONALITY REDUCTION

3) Ψ = D1/2 V = [ψ1 , ψ2 , · · · , ψn ] are left eigenvectors of P , ΨT P = ΛΨT .


Note that φ0 = 1 ∈ R and ψ0 (i) = di / i di . Thus
n
P 2
P ψ0 is the same
eigenvector as the stationary distribution π(i) = di / i di (π T 1 = 1) up
to a scaling factor.
Φ and Ψ are bi-orthogonal basis, i.e. φTi Dψj = δij or simply ΦT DΨ = I.
Define diffusion map [CLL+ 05]
Φt (xi ) = [λt1 φ1 (i), · · · , λtn−1 φn−1 (i)], t > 0.
7.1. General Diffusion Maps and Convergence. In [CLL+ 05] a general
class of diffusion maps are defined which involves a normalized weight matrix,
d(xi , xk )2
 
Wij X
(67) Wijα,t = α α , pi := exp −
pi · pj t
k
wherePα = 0 recovers the definition above. With this family, one can define Dα =
diag( j Wijα,t ) and the row Markov matrix
(68) Pα,t,n = Dα−1 W α ,
whose right eigenvectors Φα lead to a family of diffusion maps parameterized by α.
Such a definition suggests the following integral operators as diffusion operators.
Assume that q(x) is a density on M.
• Let kt (x, y) = h(kx − yk2 /t) where h is a radial basis function, e.g. h(z) =
exp(−z).
• Define Z
qt (x) = kt (x, y)q(y)dy
M
and form the new kernel
(α) kt (x, y)
kt (x, y) = .
qt (x)qtα (y)
α

• Let Z
(α) (α)
dt (x) = kt (x, y)q(y)dy
M
and define the transition kernel of a Markov chain by
(α)
kt (x, y)
pt,α (x, y) = (α)
.
dt (x)
Then the Markov chain can be defined as the operator
Z
Pt,α f (x) = pt,α (x, y)f (y)q(y)dy.
M
• Define the infinitesimal generator of the Markov chain
I − Pt,α
Lt,α = .
t
For this, Lafon et al.[CL06] shows the following pointwise convergence results.
Theorem 7.1. Let M ∈ Rp be a compact smooth submanifold, q(x) be a proba-
bility density on M, and ∆M be the Laplacian-Beltrami operator on M.
∆M (f q 1−α ) ∆M (q 1−α ))
(69) lim Lt,α = − .
t→0 q 1−α q 1−α
9. COMPARISONS 73

This suggests that


• for α = 1, it converges to the Laplacian-Beltrami operator limt→0 Lt,1 =
∆M ;
• for α = 1/2, it converges to a Schrödinger operator whose conjugation
leads to a forward Fokker-Planck equation;
• for α = 0, it is the normalized graph Laplacian.
A central question in diffusion maps is:
Why we choose right eigenvectors φi in diffusion map?
To answer this we will introduce the concept of lumpability in finite Markov chains
on graphs.

8. Connection Laplacian and Vector Diffusion Maps


to be finished...

9. Comparisons
According to the comparative studies by Todd Wittman, LTSA has the best
overall performance in current manifold learning techniques. Try yourself his code,
mani.m, and enjoy your new discoveries!

Figure 6. Comparisons of Manifold Learning Techniques on Swiss Roll


CHAPTER 6

Random Walk on Graphs

We have talked about Diffusion Map as a model of Random walk or Markov


Chain on data graph. Among other methods of Manifold Learning, the distinct
feature of Diffusion Map lies in that it combines both geometry and stochastic
process. In the next few sections, we will talk about general theory of random
walks or finite Markov chains on graphs which are related to data analysis. From
this one can learn the origin of many ideas in diffusion maps.
Random Walk on Graphs.
• Perron-Frobenius Vector and Google’s PageRank: this is about Perron-
Frobenius theory for nonnegative matrices, which leads to the character-
ization of nonnegative primary eigenvectors, such as stationary distribu-
tions of Markov chains; application examples include Google’s PageRank.
• Fiedler Vector, Cheeger’s Inequality, and Spectral Bipartition: this is
about the second eigenvector in a Markov chain, mostly reduced from
graph Laplacians (Fiedler theory, Cheeger’s Inequality), which is the ba-
sis for spectral partition.
• Lumpability/Metastability, piecewise constant right eigenvector, and Mul-
tiple spectral clustering (“MNcut” by Maila-Shi, 2001): this is about
when to use multiple eigenvectors, whose relationship with lumpability
or metastability of Markov chains, widely used in diffusion map, image
segmentation, etc.
• Mean first passage time, commute time distance: the origins of diffusion
distances.
Today we shall discuss the first part.

1. Introduction to Perron-Frobenius Theory and PageRank


Given An×n , we define A > 0, positive matrix, iff Aij > 0 ∀i, j, and A ≥ 0,
nonnegative matrix, iff Aij ≥ 0 ∀i, j.
Note that this definition is different from positive definite:
A  0 ⇔ A is positive-definite ⇔ xT Ax > 0 ∀x 6= 0
A  0 ⇔ A is semi-positive-definite ⇔ xT Ax ≥ 0 ∀x 6= 0

Theorem 1.1 (Perron Theorem for Positive Matrix). Assume that A > 0, i.e.a
positive matrix. Then
1) ∃λ∗ > 0, ν ∗ > 0, kν ∗ k2 = 1, s.t. Aν ∗ = λ∗ ν ∗ , ν ∗ is a right eigenvector
(∃λ∗ > 0, ω > 0, kωk2 = 1, s.t. (ω T )A = λ∗ ω T , left eigenvector)
2) ∀ other eigenvalue λ of A, |λ| < λ∗
3) ν ∗ is unique up to rescaling or λ∗ is simple
75
76 6. RANDOM WALK ON GRAPHS

4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max .
x≥0,x6=0 xi 6=0 xi x>0 xi
Such eigenvectors will be called Perron vectors. This result can be extended to
nonnegative matrices.
Theorem 1.2 (Nonnegative Matrix, Perron). Assume that A ≥ 0, i.e.nonnegative.
Then
1’) ∃λ∗ > 0, ν ∗ ≥ 0, kν ∗ k2 = 1, s.t. Aν ∗ = λ∗ ν ∗ (similar to left eigenvector)
2’) ∀ other eigenvalue λ of A, |λ| ≤ λ∗
3’) ν ∗ is NOT unique
4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max
x≥0,x6=0 xi 6=0 xi x>0 xi
Notice the changes in 1’), 2’), and 3’). Perron vectors are nonnegative rather
than positive. In the nonnegative situation what we lose is the uniqueness in λ∗
(2’)and ν ∗ (3’). The next question is: can we add more conditions such that the
loss can be remedied? Now recall the concept of irreducible and primitive matrices
introduced before.

Irreducibility exactly describes the case that the induced graph from A is con-
nected, i.e.every pair of nodes are connected by a path of arbitrary length. However
primitivity strengths this condition to k-connected, i.e.every pair of nodes are con-
nected by a path of length k.
Definition (Irreducible). The following definitions are equivalent:
1) For any 1 ≤ i, j ≤ n, there is an integer k ∈ Z, s.t. Akij > 0; ⇔
2) Graph G = (V, E) (V = {1, . . . , n} and {i, j} ∈ E iff Aij > 0) is (path-)
connected, i.e.∀{i, j} ∈ E, there is a path (x0 , x1 , . . . , xt ) ∈ V n+1 where i = x0 and
xt = j, connecting i and j.
Definition (Primitive). The following characterizations hold:
1) There is an integer k ∈ Z, such that ∀i, j, Akij > 0; ⇔
2) Any node pair {i, j} ∈ E are connected with a path of length no more than k;

3) A has unique λ∗ = max |λ|; ⇐
4) A is irreducible and Aii > 0, for some i,
Note that condition 4) is sufficient for primitivity but not necessary; all the first
three conditions are necessary and sufficient for primitivity. Irreducible matrices
imply an unique primary eigenvector, but not unique primary eigenvalue.
When A is a primitive matrix, Ak becomes a positive matrix for some k, then we
can recover 1), 2) and 3) for positivity and uniqueness. This leads to the following
Perron-Frobenius theorem.
Theorem 1.3 (Nonnegative Matrix, Perron-Frobenius). Assume that A ≥ 0 and
A is primitive. Then
1) ∃λ∗ > 0, ν ∗ > 0, kν ∗ k2 = 1, s.t. Aν ∗ = λ∗ ν ∗ (right eigenvector)
and ∃ω > 0, kωk2 = 1, s.t. (ω T )A = λ∗ ω T (left eigenvector)
2) ∀ other eigenvalue λ of A, |λ| < λ∗
1. INTRODUCTION TO PERRON-FROBENIUS THEORY AND PAGERANK 77

3) ν ∗ is unique
4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max
x>0 xi x>0 xi
Such eigenvectors and eigenvalue will be called as Perron-Frobenius or primary
eigenvectors/eigenvalue.
Example (Markov Chain). Given a graph G = (V, E), consider a random walk
on G with transition probability Pij = P rob(xt+1 = j|xt = i) ≥ 0. Thus P is a

− →
− →

row-stochastic or row-Markov matrix i.e. P · 1 = 1 where 1 ∈ Rn is the vector
with all elements being 1. From Perron theorem for nonnegative matrices, we know


ν ∗ = 1 > 0 is a right Perron eigenvector of P
λ∗ = 1 is a Perron eigenvalue and all other eigenvalues |λ| ≤ 1 = λ∗
∃ left PF-eigenvector π such that π T P = π T where π ≥ 0, 1T π = 1; such π is
called an invariant/equilibrium distribution
P is irreducible (G is connected) ⇒ π unique
P is primitive (G connected by paths of length ≤ k) ⇒ |λ| = 1 unique

⇔ lim π0T P k → π T ∀π0 ≥ 0, 1T π0 = 1


t→∞
This means when we take powers of P , i.e.P k , all rows of P k will converge to the
stationary distribution π T . Such a convergence only holds when P is primitive. If
P is not primitive, e.g. P = [0, 1; 1, 0] (whose eigenvalues are 1 and −1), P k always
oscillates and never converges.
What’s the rate of the convergence? Let

γ = max{|λ2 |, · · · , |λn |}, λ1 = 1


and πt = (P T )t π0 , roughly speaking we have
kπt − πk1 ∼ O(e−γt ).
This type of rates will be seen in various mixing time estimations.
A famous application of Markov chain in modern data analysis is Google’s
PageRank [BP98], although Google’s current search engine only exploits that as
one factor among many others. But you can still install Google Toolbar on your
browser and inspect the PageRank scores of webpages. For more details about
PageRank, readers may refer to Langville and Meyer’s book [LM06].
Example (Pagerank). Consider a directed weighted graph G = (V, E, W ) whose
weight matrix decodes the webpage link structure:
(
#{link : i 7→ j}, (i, j) ∈ E
wij =
0, otherwise
Pn
Define an out-degree vector doi = j=1 wij , which measures the number of out-links
from i. A diagonal matrix D = diag(di ) and a row Markov matrix P1 = D−1 W ,
assumed for simplicity that all nodes have non-empty out-degree. This P1 accounts
for a random walk according to the link structure of webpages. One would expect
that stationary distributions of such random walks will disclose the importance of
webpages: the more visits, the more important. However Perron-Frobenius above
78 6. RANDOM WALK ON GRAPHS

tells us that to obtain a unique stationary distribution, we need a primitive Markov


matrix. For this purpose, Google’s PageRank does the following trick.
Let Pα = αP1 + (1 − α)E, where E = n1 1 · 1T is a random surfer model, i.e.one
can jump to any other webpage uniformly. So in the model Pα , a browser will play
a dice: he will jump according to link structure with probability α or randomly
surf with probability 1 − α. With 1 > α > 0, the existence of random surfer model
makes P a positive matrix, whence ∃!πs.t.PαT π = π (means ’there exists a unique
π’). Google choose α = 0.85 and in this case π gives PageRank scores.
Now you probably can figure out how to cheat PageRank. If there are many
cross links between a small set of nodes (for example, Wikipedia), those nodes must
appear to be high in PageRank. This phenomenon actually has been exploited by
spam webpages, and even scholar citations. After learning the nature of PageRank,
we should be aware of such mis-behaviors.
Finally we discussed a bit on Kleinberg’s HITS algorithm [Kle99], which is
based on singular value decomposition (SVD) of link matrix WP . Above we have
defined the out-degree do . Similarly we can define in-degree dik = j wjk . High out-
degree webpages can be regarded as hubs, as they provide more links to others. On
the other hand, high in-degree webpages are regarded as authorities, as they were
cited by others intensively. Basically in/out-degrees can be used to rank webpages,
which gives relative ranking as authorities/hubs. It turns out Kleinberg’s HITS
algorithm gives pretty similar results to in/out-degree ranking.
Definition (HITS-authority). This use primary right singular vector of W as scores
to give the ranking. To understand this, define La = W T W . Primary right singular
vector of WP is just a primary eigenvector of nonnegative symmetric matrix La . Since
La (i, j) = P k Wki Wkj , thus it counts the number of references which cites both
i and j, i.e. k #{i ← k → j}. The higher value of La (i, j) the more references
received on the pair of nodes. Therefore Perron vector tend to rank the webpages
according to authority.
Definition (HITS-hub). This use primary left singular vector of W as scores to
give the ranking. Define Lh = W W T , whence primary left singular vector of
W is just Pa primary eigenvector of nonnegative symmetric matrix Lh . Similarly
Lh (i, j) = k Wik WP jk , which counts the number of links from both i and j, hitting
the same target, i.e. k #{i → k ← j}. Therefore the Perron vector Lh gives hub-
ranking.
The last example is about Economic Growth model where the Debreu intro-
duced nonnegative matrix into its study. Similar applications include population
growth and exchange market, etc.
Example (Economic Growth/Population/Exchange Market). Consider a market
consisting n sectors (or families, currencies) where Aij represents for each unit
investment on sector j, how much the outcome in sector i. The nonnegative con-
straint Aij ≥ 0 requires that i and j are not mutually inhibitor, which means that
investment in sector j does not decrease products in sector i. We study the dynam-
ics xt+1 = Axt and its long term behavior as t → ∞ which describes the economic
growth.
Moreover in exchange market, an additional requirement is put as Aij = 1/Aji ,
which is called reciprocal matrix. Such matrices are also used for preference aggre-
gation in decision theory by Saaty.
1. INTRODUCTION TO PERRON-FROBENIUS THEORY AND PAGERANK 79

From Perron-Frobenius theory we get: ∃λ∗ > 0 ∃ν ∗ ≥ 0 Aν ∗ = λ∗ ν ∗ and


∃ω ≥ 0 AT ω ∗ = λ∗ ω ∗ .

When A is primitive, (Ak > 0, i.e.investment in one sector will increase the product
in another sector in no more than k industrial periods), we have for all other
eigenvalues λ, |λ| < λ∗ and ω ∗ , ν ∗ are unique. In this case one can check that the
long term economic growth is governed by
At → (λ∗ )t ν ∗ ω ∗T
where
1) for all i, (x(xt−1
t )i
)i → λ

2) distribution of resources → ν ∗ / i νi∗ , so the distribution is actually not bal-


P
anced
3) ωi∗ gives the relative value of investment on sector i in long term

1.1. Proof of Perron Theorem for Positive Matrices. A complete proof


can be found in Meyer’s book [Mey00], Chapter 8. Our proof below is based on
optimization view, which is related to the Collatz-Wielandt Formula.
Assume that A > 0. Consider the following optimization problem:
max δ
s.t. Ax ≥ δx
x≥0
x 6= 0
Without loss of generality, assume that 1T x = 1. Let y = Ax and consider the
growth factor xyii , for xi 6= 0. Our purpose above is to maximize the minimal
growth factor δ (yi /xi ≥ δ).
Let λ∗ be optimal value with ν ∗ ≥ 0, 1T ν ∗ = 1, and Aν ∗ ≥ λ∗ ν ∗ . Our
purpose is to show
1) Aν ∗ = λ∗ ν ∗
2) ν ∗ > 0
3) ν ∗ and λ∗ are unique.
4) For other eigenvalue λ (λz = Az when z 6= 0), |λ| < λ∗ .
Sketchy Proof of Perron Theorem. 1) If Aν ∗ 6= λ∗ ν ∗ , then for some i,
[Aν ]i > λ∗ νi∗ . Below we will find an increase of λ∗ , which is thus not optimal.

Define ν̃ = ν ∗ + ei with  > 0 and ei denotes the vector which is one on the ith
component and zero otherwise.
For those j 6= i,
(Aν̃)j = (Aν ∗ )j + (Aei )j = λ∗ νj∗ + Aji > λ∗ νj∗ = λ∗ ν˜j
where the last inequality is due to A > 0.
For those j = i,
(Aν̃)i = (Aν ∗ )i + (Aei )i > λ∗ νi∗ + Aii .
Since λ∗ ν˜i = λ∗ νi∗ + λ∗ , we have
(Aν̃)i − (λ∗ ν̃)i + (Aii − λ∗ ) = (Aν ∗ )i − (λ∗ νi∗ ) − (λ∗ − Aii ) > 0,
where the last inequality holds for small enough  > 0. That means, for some small
 > 0, (Aν̃) > λ∗ ν̃. Thus λ∗ is not optimal, which leads to a contradiction.
80 6. RANDOM WALK ON GRAPHS

2) Assume on the contrary, for some k, νk∗ = 0, then (Aν ∗ )k = λ∗ νk∗ = 0. But
A > 0, ν ∗ ≥ 0 and ν ∗ 6= 0, so there ∃i, νi∗ > 0, which implies that Aν ∗ > 0.
That contradicts to the previous conclusion. So ν ∗ > 0, which followed by λ∗ > 0
(otherwise Aν ∗ > 0 = λ∗ ν ∗ = Aν ∗ ).
3) We are going to show that for every ν ≥ 0, Aν = µν ⇒ µ = λ∗ . Following the
same reasoning above, A must have a left Perron vector ω ∗ > 0, s.t. AT ω ∗ = λ∗ ω ∗ .
Then λ∗ (ω ∗T ν) = ω ∗T Aν = µ(ω ∗T ν). Since ω ∗T ν > 0 (ω ∗ > 0, ν ≥ 0), there
must be λ∗ = µ, i.e. λ∗ is unique, and ν ∗ is unique.
4) For any other eigenvalue Az = λz, A|z| ≥ |Az| = |λ||z|, so |λ| ≤ λ∗ . Then
we prove that |λ| < λ∗ . Before proceeding, we need the following lemma.
Lemma 1.4. Az = λz, |λ| = λ∗ , z 6= 0 ⇒ A|z| = λ∗ |z|. λ∗ = maxi |λi (A)|
Proof of Lemma. Since |λ| = λ∗ ,
A|z| = |A||z| ≥ |Az| = |λ||z| = λ∗ |z|
Assume that ∃k, λ1∗ A|zk | > |zk |. Denote Y = λ1∗ A|z| − |z| ≥ 0, then Yk > 0.
Using that A > 0, x ≥ 0, x 6= 0, ⇒ Ax > 0, we can get
1 1
⇒ ∗ AY > 0, A|z| > 0
λ λ∗
A A
⇒ ∃ > 0, ∗
Y >  ∗ |z|
λ λ
A
⇒ ĀY > Ā|z|, Ā = ∗
λ
⇒ Ā2 |z| − Ā|z| > Ā|z|
Ā2
⇒ |z| > Ā|z|
1+

⇒B= , 0 = lim B m Ā|z| ≥ Ā|z|
1+ m→∞

⇒ Ā|z| = 0 ⇒ |z| = 0 ⇒ Y = 0 ⇒ Ā|z| = λ∗ |z|



Equipped with this lemma, assume that we have Az = λz (z 6= 0) with |λ| = λ∗ ,
then
X X A
A|z| = λ∗ |z| = |λ||z| = |Az| ⇒ | āij zj | = āij |zj |, Ā = ∗
j j
λ

which implies that zj has the same sign, i.e.zj ≥ 0 or zj ≤ 0 (∀j). In both cases |z|
(z 6= 0) is a nonnegative eigenvector A|z| = λ|z| which implies λ = λ∗ by 3). 
1.2. Perron-Frobenius theory for Nonnegative Tensors. Some researchers,
e.g. Liqun Qi (Polytechnic University of Hong Kong), Lek-Heng Lim (U Chicago)
and Kung-Ching Chang (PKU) et al. recently generalize Perron-Frobenius theory
to nonnegative tensors, which may open a field toward PageRank for hypergraphs
and array or tensor data. For example, A(i, j, k) is a 3-tensor of dimension n,
representing for each object 1 ≤ i ≤ n, which object of j and k are closer to i.
A tensor of order-m and dimension-n means an array of nm real numbers:
A = (ai1 ,...,im ), 1 ≤ i1 , . . . , im ≤ n
2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 81

An n-vector ν = (ν1 , . . . , νn )T is called an eigenvector, if


Aν [m−1] = λν m−1
for some λ ∈ R, where

n
X
Aν [m−1] := aki2 ...im νi2 · · · νim , ν m−1 := (ν1m−1 , . . . , νnm−1 )T .
i2 ,...,im =1

Chang-Pearson-Zhang [2008] extends Perron-Frobenius theorem to show the exis-


tence of λ∗ > 0 and ν ∗ > 0 when A > 0 is irreducible.
[Ax[m−1] ]i [Ax[m−1] ]i
λ∗ = max min m−1 = min max .
x>0 i xi x>0 i xm−1
i

2. Introduction to Fiedler Theory and Cheeger Inequality


In this class, we introduced the random walk on graphs. The last lecture
shows Perron-Frobenius theory to the analysis of primary eigenvectors which is the
stationary distribution. In this lecture we will study the second eigenvector. To
analyze the properties of the graph, we construct two matrices: one is (unnormal-
ized) graph Laplacian and the other is normalized graph Laplacian. In the first
part, we introduce Fiedler Theory for the unnormalized graph Laplacian, which
shows the second eigenvector can be used to bipartite the graph into two connected
components. In the second part, we study the eigenvalues and eigenvectors of nor-
malized Laplacian matrix to show its relations with random walks or Markov chains
on graphs. In the third part, we will introduce the Cheeger Inequality for second
eigenvector of normalized Laplacian, which leads to an approximate algorithm for
Normalized graph cut (NCut) problem, an NP-hard problem itself.
2.1. Unnormalized Graph Laplacian and Fiedler Theory. Let G =
(V, E) be an undirected, unweighted simple1 graph. Although the edges here are
unweighted, the theory below still holds when weight is added. We can get a similar
conclusion with the weighted adjacency matrix. However the extension to directed
graphs will lead to different pictures.
We use i ∼ j to denote that node i ∈ V is a neighbor of node j ∈ V .
Definition (Adjacency Matrix).

1 i∼j
Aij = .
0 otherwise
Remark. We can use the weight of edge i ∼ j to define Aij if the graph is weighted.
That indicates Aij ∈ R+ . We can also extend Aij to R which involves both positive
and negative weights, like correlation graphs. But the theory below can not be
applied to such weights being positive and negative.
The degree of node i is defined as follows.
X n
di = Aij .
j=1

1Simple graph means for every pair of nodes there are at most one edge associated with it;
and there is no self loop on each node.
82 6. RANDOM WALK ON GRAPHS

Define a diagonal matrix D = diag(di ). Now let’s come to the definition of Lapla-
cian Matrix L.

Definition (Graph Laplacian).



 di i = j,
Lij = −1 i∼j
0 otherwise

This matrix is often called unnormalized graph Laplacian in literature, to dis-


tinguish it from the normalized graph Laplacian below. In fact, L = D − A.

Example. V = {1, 2, 3, 4}, E = {{1, 2}, {2, 3}, {3, 4}}. This is a linear chain with
four nodes.  
1 −1 0 0
 −1 2 −1 0 
L=  0 −1 2 −1  .

0 0 −1 1

Example. A complete graph of n nodes, Kn . V = {1, 2, 3...n}, every two points


are connected, as the figure above with n = 5.
 
n−1 −1 −1 ... −1
 −1 n − 1 −1 ... −1 
L=  −1
.
... −1 n − 1 −1 
−1 ... −1 −1 n−1
From the definition, we can see that L is symmetric, so all its eigenvalues will
be real and there is an orthonormal eigenvector system. Moreover L is positive
semi-definite (p.s.d.). This is due to the fact that
 
XX X X
v T Lv = vi (vi − vj ) = di vi2 − vi vj 
i j:j∼i i j:j∼i
X
= 2
(vi − vj ) ≥ 0, ∀v ∈ R . n

i∼j
2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 83

In fact, L admits the decomposition L = BB T where B ∈ R|V |×|E| is called inci-


dence matrix (or boundary map in algebraic topology) here, for any 1 ≤ j < k ≤ n,

 1, i = j,
B(i, {j, k}) = −1, i = k,
0, otherwise

These two statements imply the eigenvalues of L can’t be negative. That is to say
λ(L) ≥ 0.
Theorem 2.1 (Fiedler theory). Let L has n eigenvectors
Lvi = λi vi , vi 6= 0, i = 0, . . . , n − 1
where 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 . For the second smallest eigenvector v1 , define
N− = {i : v1 (i) < 0},
N+ = {i : v1 (i) > 0},
N0 = V − N− − N+ .
We have the following results.
(1) #{i, λi = 0} = #{connected components of G};
(2) If G is connected, then both N− and N+ are connected. N− ∪ N0 and
N+ ∪ N0 might be disconnected if N0 6= ∅.
This theorem tells us that the second smallest eigenvalue can be used to tell us
if the graph is connected, i.e.G is connected iff λ1 6= 0, i.e.
λ1 = 0 ⇔ there are at least two connected components.
λ1 > 0 ⇔ the graph is connected.
Moreover, the second smallest eigenvector can be used to bipartite the graph into
two connected components by taking N− and N+ when N0 is empty. For this reason,
we often call the second smallest eigenvalue λ1 as the algebraic connectivity. More
materials can be found in Jim Demmel’s Lecture notes on Fiedler Theory at UC
Berkeley: why we use unnormalized Laplacian eigenvectors for spectral partition
(http://www.cs.berkeley.edu/~demmel/cs267/lecture20/lecture20.html).
We can calculate eigenvalues by using Rayleigh Quotient. This gives a sketch
proof of the first part of the theory.
Proof of Part I. Let (λ, v) be a pair of eigenvalue-eigenvector, i.e.Lv = λv.
Since L1 = 0, so the constant vector 1 ∈ Rn is always the eigenvector associated
with λ0 = 0. In general,

(vi − vj )2
P
T
v Lv i∼j
λ= T = P 2 .
v v vi
i
Note that
0 = λ1 ⇔ vi = vj (j is path connected with i).
Therefore v is a piecewise constant function on connected components of G. If
G has k components, then there are k independent piecewise constant vectors in
the span of characteristic functions on those components, which can be used as
eigenvectors of L. In this way, we proved the first part of the theory. 
84 6. RANDOM WALK ON GRAPHS

2.2. Normalized graph Laplacian and Cheeger’s Inequality.


Definition (Normalized Graph Laplacian).

1 i = j,


Lij = − √di dj i ∼ j,
1


 0 otherwise.

In fact L = D−1/2 (D − A)D−1/2 = D−1/2 LD−1/2 = I − D−1/2 (D − A)D−1/2 .


From this one can see the relations between eigenvectors of normalized L and un-
normalized L. For eigenvectors Lv = λv, we have
(I − D−1/2 LD−1/2 )v = λv ⇔ Lu = λDu, u = D−1/2 v,
whence eigenvectors of L, v after rescaling by D−1/2 v, become generalized eigen-
vectors of L.
We can also use the Rayleigh Quotient to calculate the eigenvalues of L.
1 1
v T Lv v T D− 2 (D − A)D− 2 v
=
vT v vv
uT Lu
= T Du
uP
(ui − uj )2
i∼j
= P 2 .
uj dj
j

Similarly we get the relations between eigenvalue and the connected components of
the graph.
#{λi (L) = 0} = #{connected components of G}.
Next we show that eigenvectors of L are related to random walks on graphs.
This will show you why we choose this matrix to analysis the graph.
We can construct a random walk on G whose transition matrix is defined by
Aij 1
Pij ∼ P = .
Aij di
j

By easy calculation, we see the result below.


P = D−1 A = D−1/2 (I − L)D1/2 .
Hence P is similar to I −L. So their eigenvalues satisfy λi (P ) = 1−λi (L). Consider
the right eigenvector φ and left eigenvector ψ of P .
uT P = λu,
P v = λv.
Due to the similarity between P and L,
uT P = λuT ⇔ uT D−1/2 (I − L)D1/2 = λuT .
Let ū = D−1/2 u, we will get:
ūT (I − L) = λūT
⇔ Lū = (1 − λ)ū.
2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 85

You can see ū is the eigenvector of L, and we can get left eigenvectors of P
from ū by multiply it with D1/2 on the left side. Similarly for the right eigenvectors
v = D−1/2 ū.
If we choose u0 = πi ∼ Pdidi , then:
p
ū0 (i) ∼ di ,
ūTk ūl = δkl ,
uTk Dvl = δkl ,
πi Pij = πj Pji ∼ Aij = Aji ,
where the last identity says the Markov chain is time-reversible.
All the conclusions above show that the normalized graph Laplacian L keeps
some connectivity measure of unnormalized graph Laplacian L. Furthermore, L is
more related with random walks on graph, through which eigenvectors of P are easy
to check and calculate. That’s why we choose this matrix to analysis the graph.
Now we are ready to introduce the Cheeger’s inequality with normalized graph
Laplacian.
Let G be a graph, G = (V, E) and S is a subset of V whose complement is
S̄ = V − S. We define V ol(S), CU T (S) and N CU T (S) as below.
X
V ol(S) = di .
i∈S
X
CU T (S) = Aij .
i∈S,j∈S̄

CU T (S)
N CU T (S) = .
min(V ol(S), V ol(S̄))
N CU T (S) is called normalized-cut. We define the Cheeger constant
hG = min N CU T (S).
S

Finding minimal normalized graph cut is NP-hard. It is often defined that


CU T (S)
Cheeger ratio (expander): hS :=
V ol(S)
and
Cheeger constant: hG := min max {hS , hS̄ } .
S
Cheeger Inequality says the second smallest eigenvalue provides both upper and
lower bounds on the minimal normalized graph cut. Its proof gives us a constructive
polynomial algorithm to achieve such bounds.
Theorem 2.2 (Cheeger Inequality). For every undirected graph G,
h2G
≤ λ1 (L) ≤ 2hG .
2
Proof. (1) Upper bound:
Assume the following function f realizes the optimal normalized graph cut,
(
1
V ol(S) i ∈ S,
f (i) = −1
V ol(S̄)
i ∈ S̄,
86 6. RANDOM WALK ON GRAPHS

By using the Rayleigh Quotient, we get

g T Lg
λ1 = inf
g⊥D 1/2 e g T g
2
P
i∼j (fi − fj )
≤ P 2
fi di
1 1
( V ol(S) + V ol(S̄)
)2 CU T (S)
= 1 1
V ol(S) V ol(S) 2 + V ol(S̄) V ol(S̄)2

1 1
=( + )CU T (S)
V ol(S) V ol(S̄)
2CU T (S)
≤ =: 2hG .
min(V ol(S), V ol(S̄))
which gives the upper bound.
(2) Lower bound: the proof of lower bound actually gives a constructive algo-
rithm to compute an approximate optimal cut as follows.
Let v be the second eigenvector, i.e. Lv = λ1 v, and f = D−1/2 v. Then we
reorder node set V such that f1 ≤ f2 ≤ ... ≤ fn ). Denote V− = {i; vi < 0}, V+ =
{i; vi ≥ vr }. Without Loss of generality, we can assume
X X
dv ≥ dv
i∈V− i∈V+

Define new functions f + to be the magnitudes of f on V+ .



+ fi i ∈ V+ ,
fi =
0 otherwise,

Now consider a series of particular subsets of V ,

Si = {v1 , v2 , ...vi },

and define
V
g ol(S) = min(V ol(S), V ol(S̄)).

αG = min N CU T (Si ).
i

Clearly finding the optimal value α just requires comparison over n − 1 NCUT
values.
Below we shall show that
h2G α2
≤ G ≤ λ1 .
2 2
First, we have Lf = λ1 Df , so we must have
X
(70) fi (fi − fj ) = λ1 di fi2 .
j:j∼i

From this we will get the following results,


2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 87

P P
i∈V+ fi j:j∼i (fi − fj )
λ1 = P 2 ,
i∈V+ di fi
− fj )2 +
P P P
i∼j i,j∈V+ (fi i∈V+ fi j∼i j∈V− (fi − fj )
= , (fi − fj )2 = fi (fi − fj ) + fj (fj − fi )
di fi2
P
i∈V+
− fj )2 +
P P P
i∼j i,j∈V+ (fi i∈V+ fi j∼i j∈V− (fi )
> ,
di fi2
P
i∈V+
+
− fj+ )2
P
i∼j (fi
= 2 ,
di fi+
P
i∈V
( i∼j (fi+ − fj+ )2 )( i∼j (fi+ + fj+ )2 )
P P
= 2
( i∈V fi+ di )( i∼j (fi+ + fj+ )2 )
P P

2 2
( i∼j fi+ − fj+ )2
P
≥ 2 , Cauchy-Schwartz Inequality
( i∈V fi+ di )( i∼j (fi+ + fj+ )2 )
P P

2 2
( i∼j fi+ − fj+ )2
P
≥ 2 ,
2( i∈V fi+ di )2
P

where the second last step is due to the Cauchy-Schwartz inequality |hx, yi|2 ≤
P + + 2 P +2
hx, xi · hy, yi, and the last step is due to i∼j∈V (fi + fj ) = i∼j∈V (fi +
+2 + + P +2 +2 P +2
fj + 2fi fj ) ≤ 2 i∼j∈V (fi + fj ) ≤ 2 i∈V fi di . Continued from the last
inequality,
2 2
fi+ − fj+ )2
P
( i∼j
λ1 ≥ 2 ,
2( i∈V fi+ di )2
P
2 + 2
( i∈V (fi+ − fi−1 )CU T (Si−1 ))2
P
≥ 2 , since f1 ≤ f2 ≤ . . . ≤ fn
2( i∈V fi+ di )2
P
2 + 2
( i∈V (fi+ − fi−1 ol(Si−1 ))2
P
)αG V g
≥ 2
2( i∈V fi+ di )2
P
2
( i∈V fi+ (V ol(Si )))2
2
P
αG g ol(Si−1 ) − Vg
= · 2 ,
2 ( i∈V fi+ di )2
P

2 (
P +2 2
αG i∈V fi di ) α2
= 2 = G.
2 ( P + 2 2
i∈V fi di )

where the last inequality is due to the assumption V ol(V− ) ≥ V ol(V+ ), whence
V
g ol(Si ) = V ol(S̄i ) for i ∈ V+ .
This completes the proof. 

Fan Chung gives a short proof of the lower bound in Simons Institute workshop,
2014.
88 6. RANDOM WALK ON GRAPHS

Short Proof. The proof is based on the fact that


P
x∼y |f (x) − f (y)|
hG = inf sup P
f 6=0 c∈R x |f (x) − c|dx

where the supreme over c is reached at c∗ = median(f (x) : x ∈ V ).


2
P
x∼y (f (x) − f (y))
λ1 = R(f ) = sup P ,
x (f (x) − c) dx
2
c
2
P
x∼y (g(x) − g(y))
≥ P 2
, g(x) = f (x) − c
x g(x) dx
( x∼y (g(x) − g(y))2 )( x∼y (g(x) + g(y))2 )
P P
= P P
( x∈V g 2 (x)dx )(( x∼y (g(x) + g(y))2 )
( x∼y |g 2 (x) − g 2 (y)|)2
P
≥ P P , Cauchy-Schwartz Inequality
( x∈V g 2 (x)dx )(( x∼y (g(x) + g(y))2 )
( x∼y |g 2 (x) − g 2 (y)|)2
P
≥ P , (g(x) + g(y))2 ≤ 2(g 2 (x) + g 2 (y))
2( x∈V g 2 (x)dx )2
h2G
≥ .
2


3. *Laplacians and the Cheeger inequality for directed graphs


The following section is mainly contained in [Chu05], which described the
following results:
(1) Define Laplacians on directed graphs.
(2) Define Cheeger constants on directed graphs.
(3) Give an example of the singularity of Cheeger constant on directed graph.
(4) Use the eigenvalue of Lapacian and the Cheeger constant to estimate the
convergence rate of random walk on a directed graph.
Another good reference is [LZ10].

3.1. Definition of Laplacians on directed graphs. On a finite and strong


connected directed graph G = (V, E) (A directed graph is strong connected if there
is a path between any pair of vertices), a weight is a function
w: E → R≥0
The in-degree and out-degree of a vertex are defined as
din : V → P R≥0
din
i = j∈V wji
dout : V → P R≥0
dout
i = j∈V wij

Note that din out


i may be different from di .
A random walk on the weighted G is a Markov chain with transition probability
wij
Pij = out .
di
3. *LAPLACIANS AND THE CHEEGER INEQUALITY FOR DIRECTED GRAPHS 89

Since G is strong connected, P is irreducible, and consequently there is a unique


stationary distribution φ. (And the distribution of the Markov chain will converge
to it if and only if P is aperiodic.)
Example (undirected graph).
dx
φ(x) = P .
y dy
out
Example (Eulerian graph). If din out Pdx out .
x = dx for every vertex x, then φ(x) =
y dy
This is because dout
x is an unchanged measure with
X X
dout
x Pxy = wxy = din out
y = dy .
x x

Example (exponentially small stationary dist.). G is a directed graph with n + 1


vertices formed by the union of a directed circle v0 → v1 → · · · → vn and edges
vi → v0 for i = 1, 2, · · · , n. The weight on any edge is 1. Checking from vn to
v0 with the prerequisite of stationary distribution that the inward probability flow
equals to the outward probability flow, we can see that
φ(v0 ) = 2n φ(vn ), i.e.φ(vn ) = 2−n φ(v0 ).
This exponentially small stationary distribution cannot occur in undirected
graph cases for then
di 1
φ(i) = P ≥ .
d
j j n(n − 1)
However, the stationary dist. can be no smaller than exponential, because we
have
Theorem 3.1. If G is a strong connected directed graph with w ≡ 1, and doutx ≤
k, ∀x, then max{φ(x) : x ∈ V } ≤ k D min{φ(y) : y ∈ V }, where D is the diameter
of G.
It can be easily proved using induction on the path connecting x and y.
Now we give a definition on those balanced weights.
Definition (circulation).
F : E → R≥0
If F satisfies X X
F (u, v) = F (v, w), ∀v,
u,u→v w,v→w
then F is called a circulation.
Note. A circulation is a flow with no source or sink.
Example. For a directed graph, Fφ (u, v) = φ(u)P (u, v) is a circulation, for
X X
Fφ (u, v) = φ(v) = Fφ (v, w).
u,u→v w,v→w

Definition (Rayleigh quotient). For a directed graph G with transition probability


matrix P and stationary distribution φ, the Rayleigh quotient for any f : V → C
is defined as
| f (u) − f (v) |2 φ(u)P (u, v)
P
R(f ) = u→v P .
v | f (v) | φ(v)
2
90 6. RANDOM WALK ON GRAPHS

Note. Compare with the undirected graph condition where

| f (u) − f (v) |2 wuv


P
R(f ) = u∼vP .
v | f (v) | d(v)
2

If we look on every undirected edge (u, v) as two directed edges u → v, v → u,


then we get a Eulerian directed graph. So φ(u) ∼ doutu and dout
u P (u, v) = wuv , as
a result R(f )(directed) = 2R(f )(undirected). The factor 2 is the result of looking
on every edge as two edges.

The next step is to extend the definition of Laplacian to directed graphs. First
we give a review on Lapalcian on undirected graphs. On an undirected graph,
adjacent matrix is

1, i ∼ j;
Aij =
0, i 6∼ j.

D = diag(d(i)),

L = D−1/2 (D − A)D−1/2 .

On a directed graph, however, there are two degrees on a vertex which are
generally inequivalent. Notice that on an undirected graph, stationary distribution
φ(i) ∼ d(i), so D = cΦ, where c is a constant and Φ = diag(φ(i)).

L = I − D−1/2 AD−1/2
= I − D1/2 P D−1/2
= I − c1/2 Φ1/2 P c−1/2 Φ−1/2
= I − Φ1/2 P Φ−1/2

Extending and symmetrizing it, we define Laplacian on a directed graph

Definition (Laplacian).

1
L = I − (Φ1/2 P Φ−1/2 + Φ−1/2 P ∗ Φ1/2 ).
2
Suppose the eigenvalues of L are 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 . Like the undirected
case, we can calculate λ1 with the Rayleigh quotient.

Theorem 3.2.
R(f )
λ1 = P inf .
f (x)φ(x)=0 2

Before proving that, we need

Lemma 3.3.
gLg ∗
R(f ) = 2 , where g = f Φ1/2 .
k g k2
3. *LAPLACIANS AND THE CHEEGER INEQUALITY FOR DIRECTED GRAPHS 91

Proof.
(u) − f (v) |2 φ(u)P (u, v)
P
u→v | fP
R(f ) =
v | f (v) | φ(v)
2

2 2
P P P
u→v | f (u) | φ(u)P (u, v) + v | f (v) | φ(v) − u→v (f (u)f (v) + f (u)f (v))φ(u)P (u, v)
=
f Φf ∗
2 2 ∗ ∗
P P
u | f (u) | φ(u) + v | f (v) | φ(v) − (f ΦP f + f ΦP f
=
f Φf ∗
∗ ∗
f (P Φ + ΦP )f
= 2−
f Φf ∗
(gΦ−1/2 )(P ∗ Φ + ΦP )(Φ−1/2 g ∗ )
= 2−
(gΦ−1/2 )Φ(Φ−1/2 g ∗ )
g(Φ−1/2 P ∗ Φ1/2 + Φ1/2 P Φ−1/2 )g ∗
= 2−
gg ∗

gLg
= 2·
k g k2

1/2
Proof of Theorem 3.2. With Lemma 3.3 and L(φ(x) )n×1 = 0, we have
R(f )
λ1 = inf
P
2
g(x)φ(x)1/2 =0

R(f )
= P inf .
f (x)φ(x)=0 2

Note.
R(f )
λ1 = inf
2
P
f, f (x)φ(x)=0

(u) − f (v) |2 φ(u)P (u, v)


P
u→v | fP
= P inf
f, f (x)φ(x)=0 2 v | f (v) |2 φ(v)
| f (u) − f (v) |2 φ(u)P (u, v)
P
= P inf sup u→v P
f, f (x)φ(x)=0 c 2 v | f (v) − c |2 φ(v)
Theorem 3.4. Suppose the eigenvalues of P are ρ0 , · · · , ρn−1 with ρ0 = 1, then
λ1 ≤ min(1 − Reρi ).
i6=0

3.2. Definition of Cheeger constants on directed graphs. We have a


circulation Fφ (u, v) = φ(u)P (u, v). Define
X X X X
F (∂S) = F (u, v), F (v) = F (u, v) = F (v, w), F (S) = F (v),
u∈S,v6∈S u,u→v w,v→w v∈S

then F (∂S) = F (∂ S̄).


Definition (Cheeger constant). The Cheeger constant of a graph G is defined as
F (∂S)
h(G) = inf 
S⊂V min F (S), F (S̄)
92 6. RANDOM WALK ON GRAPHS

Note. Compare with the undirected graph condition where


| ∂S |
hG = inf .
S⊂V min | S |, | S̄ |
Similarly, we have
| ∂S |
hG (undirected) = inf 
S⊂V min | S |, | S̄ |
P
u∈S,v∈S̄ wuv
= inf P P 
S⊂V min u∈S d(u), u∈S̄ d(u)
P
u∈S,v∈S̄φ(u)P (u, v)
hG (directed) = inf P P 
S⊂V min u∈S φ(u), u∈S̄ φ(u)
F (∂S)
= inf .
S⊂V min F (S), F (S̄)
Theorem 3.5. For every directed graph G,
h2 (G)
≤ λ1 ≤ 2h(G).
2
The proof is similar to the undirected case using Rayleigh quotient and Theorem
3.2.
3.3. An example of the singularity of Cheeger constant on a directed
graph. We have already given an example of a directed graph with n + 1 vertices
and stationary distribution φ satisfying φ(vn ) = 2−n φ(v0 ). Now we make a copy
of this graph and denote the new n + 1 vertices u0 , . . . , un . Joining the two graphs
together by two edges vn → un and un → vn , we get a bigger directed graph.
Let S = (v0 , · · · , vn ), we have h(G) ∼ 2−n . In comparison, h(G) ≥ n(n−1) 2
for
undirected graph.
3.4. Estimate the convergence rate of random walks on directed
graphs. Define the distance of P after s steps and φ as
!1/2
X (P s (y, x) − φ(x))2
∆(s) = max .
y∈V φ(x)
x∈V
I+P
Modify the random walk into a lazy random walk P̃ = 2 , so that it is aperiodic.
Theorem 3.6.
λ1 t
∆(t)2 ≤ C(1 − ).
2
3.5. Random Walks on Digraphs, The Generalized Digraph Lapla-
cian, and The Degree of Asymmetry. In this paper the following have been
discussed:
(1) Define an asymmetric Laplacian L̃ on directed graph;
(2) Use L̃ to estimate the hitting time and commute time of the corresponding
Markov chain;
(3) Introduce a metric to measure the asymmetry of L̃ and use this measure
to give a tighter bound on the Markov chain mixing rate and a bound on
the Cheeger constant.
3. *LAPLACIANS AND THE CHEEGER INEQUALITY FOR DIRECTED GRAPHS 93

Let P be the transition matrix of Markov chain, and π = (π1 , . . . , πn )T (column


vector) denote its stationary distribution (which is unique if the Markov chain is ir-
reducible, or if the directed graph is strongly connected). Let Π = diag{π1 , . . . , πn },
then we define the normalized Laplacian L̃ on directed graph:
1 1
(71) L̃ = I − Π 2 P Π− 2
3.5.1. Hitting time, commute time and fundamental matrix. We establish the
relations between L̃ and the hitting time and commute time of random walk on
directed graph through the fundamental matrix Z = [zij ], which is defined as:

X
(72) zij = (ptij − πj ), 1 ≤ i, j ≤ n
t=0

or alternatively as an infinite sum of matrix series:



X
(73) Z= (P t − 1π T )
t=0

With the fundamental matrix, the hitting time and commute time can be ex-
pressed as follows:
zjj − zij
(74) Hij =
πj

zjj − zij zii − zji


(75) Cij = Hij + Hji = +
πj πi
Using (73), we can write the fundamental matrix Z in a more explicit form.
Notice that
(76) (P − 1π T )(P − 1π T ) = P 2 − 1π T P − P 1π T + 1π T 1π T = P 2 − 1π T
We use the fact that 1 and π are the right and left eigenvector of the transition
matrix P with eigenvalue 1, and that π T 1 = 1 since π is a distribution. Then

X
(77) Z + 1π =T
(P − 1π T )t = (I − P + 1π T )−1
t=0

3.5.2. Green’s function and Laplacian for directed graph. If we treat the di-
rected graph Laplacian L̃ as an asymmetric operator on a directed graph G, then
we can define the Green’s Function G̃ (without boundary condition) for directed
graph. The entries of G satisfy the conditions:

(78) (G̃ L̃)ij = δij − πi πj
or in the matrix form
1 1 T
(79) G̃ L̃ = I − π 2 π 2
The central theorem in the second paper associate the Green’s Function G̃, the
fundamental matrix Z and the normalize directed graph Laplacian L̃:
1 1
Theorem 3.7. Let Z̃ = Π 2 ZΠ− 2 and L̃† denote the Moore-Penrose pseudo-
inverse L̃, then
(80) G̃ = Z̃ = L̃†
94 6. RANDOM WALK ON GRAPHS

3.6. measure of asymmetric and its relation to Cheeger constant and


mixing rate. To measure the asymmetry in directed graph, we write the L̃ into
the sum of a symmetric part and a skew-symmetric part:
1 1
(81) L̃ = (L̃ + L̃T ) + (L̃ − L̃T )
2 2
1 T
2 (L̃ + L̃ ) = L is the symmetrized Laplacian introduced in the first paper. Let
∆ = 12 (L̃ − L̃T ), the ∆ captures the difference between L̃ and its transpose. Let σi ,
λi and δi (1 ≤ i ≤ n) denotes the i-th singular value of L, L, ∆ in ascending order
(σ1 = λ1 = δ1 = 0). Then the relation L̃ = L + ∆ implies
(82) λi ≤ σi ≤ λi + δn
Therefore δn = k∆k2 is used to measure the degree of asymmetry in the directed
graph.
The following two theorems are application of this measure.
Theorem 3.8. The second singular of L̃ has bounds :
h(G)2 δn
(83) ≤ σ2 ≤ (1 + ) · 2h(G)
2 λ2
where h(G) is the Cheeger constant of graph G
Theorem 3.9. For a aperiodic Markov chain P ,
kP̃ f k2 1
(84) δn2 ≤ max{ : f ⊥ π 2 } ≤ (1 − λ2 )2 + 2δn λn + δn2
kf k 2

1 1
where P̃ = Π 2 P Π− 2

4. Lumpability of Markov Chain


Let P be the transition matrix of a Markov chain on graph G = (V, E) with
V = {1, 2, · · · , n}, i.e. Pij = Pr{xt = j : xt−1 = i}. Assume that V admits a
partition Ω:
V = ∪ki=1 Ωi , Ωi ∩ Ωj = ∅, i 6= j.
Ω = {Ωs : s = 1, · · · , k}.
Observe a sequence{x0 , x1 , · · · , xt } sampled from the Markov chain with initial
distribution π0 .
Definition (Lumpability, Kemeny-Snell 1976). P is lumpable with respect to parti-
tion Ω if the sequence {yt } is Markovian. In other words, the transition probabilities
do not depend on the choice of initial distribution π0 and history, i.e.
(85)
Probπ0 {xt ∈ Ωkt : xt−1 ∈ Ωkt−1 , · · · , x0 ∈ Ωk0 } = Prob{xt ∈ Ωkt : xt−1 ∈ Ωkt−1 }.
Relabel xt 7→ yt ∈ {1, · · · , k} by
k
X
yt = sXΩs (xt ).
s=1

Thus we obtain a sequence (yt ) which is a coarse-grained representation of original


sequence. The lumpability condition above can be rewritten as
(86) Probπ0 {yt = kt : yt−1 = kt−1 , · · · , y0 = k0 } = Prob{yt = kt : yt−1 = kt−1 }.
4. LUMPABILITY OF MARKOV CHAIN 95

Theorem 4.1. I. (Kemeny-Snell 1976) P is lumpable with respect to parti-


P
tion Ω ⇔ ∀Ωs , Ωt ∈ Ω, ∀i, j ∈ Ωs , P̂iΩt = P̂jΩt , where P̂iΩt = j∈Ωt Pij .

Figure 1. Lumpability condition P̂iΩt = P̂jΩt

II. (Meila-Shi
P 2001) P is lumpable with respect to partition Ω and P̂ (p̂st =
i∈Ωs ,j∈Ωt pij ) is nonsingular ⇔ P has k independent piecewise constant
right eigenvectors in span{χΩs : s = 1, · · · , k}, χ is the characteristic
function.

Figure 2. A linear chain of 2n nodes with a random walk.

Example. Consider a linear chain with 2n nodes (Figure 2) whose adjacency ma-
trix and degree matrix are given by
 
0 1
 1 0 1 
 
. . . .
A= . .. .. , D = diag{1, 2, · · · , 2, 1}
 
 
 1 0 1 
1 0
So the transition matrix is P = D−1 A which is illustrated in Figure 2. The spectrum
of P includes two eigenvalues of magnitude 1, i.e.λ0 = 1 and λn−1 = −1. Although
P is not a primitive matrix here, it is lumpable. Let Ω1 = {odd nodes}, Ω2 = {even
nodes}. We can check that I and II are satisfied.
To see I, note that for any two even nodes, say i = 2 and j = 4, P̂iΩ2 = P̂jΩ2 = 1
as their neighbors are all odd nodes, whence I is satisfied. To see II, note that φ0
(associated with λ0 = 1) is a constant vector while φ1 (associated with λn−1 = −1)
is constant on even nodes and odd nodes respectively. Figure 3 shows the lumpable
states when n = 4 in the left.
96 6. RANDOM WALK ON GRAPHS

Note that lumpable states might not be optimal bi-partitions in N CU T =


Cut(S)/ min(vol(S), vol(S̄)). In this example, the optimal bi-partition by Ncut is
given by S = {1, . . . , n}, shown in the right of Figure 3. In fact the second largest
eigenvalue λ1 = 0.9010 with eigenvector
v1 = [0.4714, 0.4247, 0.2939, 0.1049, −0.1049, −0.2939, −0.4247, −0.4714],
give the optimal bi-partition.

Figure 3. Left: two lumpable states; Right: optimal-bipartition


of Ncut.

Example. Uncoupled Markov chains are lumpable, e.g.


 
Ω1
P0 =  Ω2  , P̂it = P̂jt = 0.
Ω3
A markov chain P̃ = P0 + O() is called nearly uncoupled Markov chain. Such
Markov chains can be approximately represented as uncoupled Markov chains
with metastable states, {Ωs }, where within metastable state transitions are fast
while cross metastable states transitions are slow. Such a separation of scale in
dynamics often appears in many phenomena in real lives, such as protein fold-
ing, your life transitions primary schools 7→ middle schools 7→ high schools 7→
college/university 7→ work unit, etc.
Before the proof of the theorem, we note that condition I is in fact equivalent
to
(87) V U P V = P V,
where U is a k-by-n matrix where each row is a uniform probability that
k×n 1
Uis = χΩ (i), i ∈ V, s ∈ Ω,
|Ωs | s
and V is a n-by-k matrix where each column is a characteristic function on Ωs ,
Vsjn×k = χΩs (j).
With this we have P̂ = U P V and U V = I. Such a matrix representation will be
useful in the derivation of condition II. Now we give the proof of the main theorem.
Proof. I. “⇒” To see the necessity, P is lumpable w.r.t. partition Ω, then it
is necessary that
Probπ0 {x1 ∈ Ωt : x0 ∈ Ωs } = Probπ0 {y1 = t : y0 = s} = p̂st
which does not depend on π0 . Now assume there are two different initial distribution
(1) (2)
such that π0 (i) = 1 and π0 (j) = 1 for ∀i, j ∈ Ωs . Thus
p̂iΩt = Probπ(1) {x1 ∈ Ωt : x0 ∈ Ωs } = p̂st = Probπ(2) {x1 ∈ Ωt : x0 ∈ Ωs } = p̂jΩt .
0 0
5. APPLICATIONS OF LUMPABILITY: MNCUT AND OPTIMAL REDUCTION OF COMPLEX NETWORKS
97

“⇐” To show the sufficiency, we are going to show that if the condition is satisfied,
then the probability

Probπ0 {yt = t : yt−1 = s, · · · , y0 = k0 }

depends only on Ωs , Ωt ∈ Ω. Probability above can be written as Probπt−1 (yt = t)


where πt−1 is a distribution with support only on Ωs which depends on π0 and
history up to t − 1.P But since Probi (yt = t) = p̂iΩt ≡ p̂st for all i ∈ Ωs , then
Probπt−1 (yt = t) = i∈Ωs πt−1 p̂iΩt = p̂st which only depends on Ωs and Ωt .
II.
“⇒”
Since P̂ is nonsingular, let {ψi , i = 1, · · · , k} are independent right eigenvectors
of P̂ , i.e., P̂ ψi = λi ψi . Define φi = V ψi , then φi are independent piecewise constant
vectors in span{χΩi , i = 1, · · · , k}. We have

P φi = P V ψi = V U P V ψi = V P̂ ψi = λi V ψi = λi φi ,

i.e.φi are right eigenvectors of P .


“⇐”
Let {φi , i = 1, · · · , k} be k independent piecewise constant right eigenvectors
of P in span{XΩi , i = 1, · · · , k}. There must be k independent vectors ψi ∈ Rk
that satisfied φi = V ψi . Then

P φi = λi φi ⇒ P V ψi = λi V ψi ,

Multiplying V U to the left on both sides of the equation, we have

V U P V ψi = λi V U V ψi = λi V ψi = P V ψi , (U V = I),

which implies
(V U P V − P V )Ψ = 0, Ψ = [ψ1 , . . . , ψk ].
Since Ψ is nonsingular due to independence of ψi , whence we must have V U P V =
PV . 

5. Applications of Lumpability: MNcut and Optimal Reduction of


Complex Networks
If the random walk on a graph P has top k nearly piece-wise constant right
eigenvectors, then the Markov chain P is approximately lumpable. Some spectral
clustering algorithms are proposed in such settings.

5.1. MNcut. Meila-Shi (2001) calls the following algorithm as MNcut, stand-
ing for modified Ncut. Due to the theory above, perhaps we’d better to call it
multiple spectral clustering.
1) Find top k right eigenvectors P Φi = λi Φi , i = 1, · · · , k, λi = 1 − o().
2) Embedding Y n×k = [φ1 , · · · , φk ] → diffusion map when λi ≈ 1.
3) k-means (or other suitable clustering methods) on Y to k-clusters.

5.2. Optimal Reduction and Complex Network.


98 6. RANDOM WALK ON GRAPHS

5.2.1. Random Walk on Graph. Let G = G(S, E) denotes an undirected graph.


Here S has the meaning of ”states”. |S| = n  1 . Let A = e(x, y) denotes its
adjacency matrix, that is,
(
1 x∼y
e(x, y) =
0 otherwise
Here x ∼ y means (x, y) ∈ E . Here, weights on different edges are the same 1.
They may be different in some cases.
Now we define a random walk on G . Let
e(x, y) X
p(x, y) = where d(x) = e(x, y)
d(x)
y∈S

We can check that P = p(x, y) is a stochastic matrix and (S, P ) is a Markov


chain. If G is connected, this Markov chain is irreducible and if G is not a tree,
the chain is even primitive. We assume G is connected from now on. If it is not,
we can focus on each of its connected component.So the Markov chain has unique
invariant distributionµ by irreducibility:
d(x)
µ(x) = P ∀x ∈ S
d(z)
z∈S

A Markov chain defined as above is reversible. That is, detailed balance con-
dition is satisfied:
µ(x)p(x, y) = µ(y)p(y, x) ∀x, y ∈ S
Define an inner product on spaceL2µ :
XX
< f, g >µ = f (x)g(x)µ(x) f, g ∈ L2µ
x∈S y∈S

L2µ is a Hilbert space with this inner product. If we define an operator T on it:
X
T f (x) = p(x, y)f (y) = E[y|x] f (y)
y∈S

We can check that T is a self adjoint operator on L2µ :


X
< T f (x), g(x) >µ = T f (x)g(x)µ(x)
x∈S
XX
= p(x, y)f (y)g(x)µ(x) with detailed balance condition
x∈S y∈S
XX
= p(y, x)f (y)g(x)µ(y)
y∈S x∈S
X
= f (y)T g(y)µ(y)
y∈S
= < f (x), T g(x) >µ
n−1
That means T is self-adjoint. So there is a set of orthonormal basis {φj (x)}j=0 and
a set of eigenvalue {λj }j=0 ⊂ [−1, 1], 1 = λ0 > λ1 > λ2 > · · · > λn−1 , s.t.Probφj =
n−1

λj φj , j = 0, 1, . . . n − 1, and < φi , φj >µ = δij , ∀i, j = 0, 1, . . . n − 1.So φj (x) is right


5. APPLICATIONS OF LUMPABILITY: MNCUT AND OPTIMAL REDUCTION OF COMPLEX NETWORKS
99

eigenvectors. The corresponding left eigenvectors are denoted by {ψj (x)}n−1


j=0 . One
can obtain that ψj (x) = φj (x)µ(x). In fact,because T φj = λj φj ,

P
µ(x) p(x, y)φj (y) = λj φj (x)µ(x) with detailed balance condition
y∈S
P
p(y, x)µ(y)φj (y) = λj φj (x)µ(x) that is
y∈S
P
ψj Prob(x) = p(y, x)φ(y) = λj (x)ψ(x)
y∈S

Generally, T has spectral decomposition


n−1
X n−1
X
p(x, y) = λi ψi (x)φ(y) = p(x, y)φi (x)φi (y)µ(x)
i=0 i=0

Since P is a stochastic matrix, we have λ0 = 1,the corresponding right eigen-


vector is φ0 (x) ≡ 1,and left eigenvector is the invariant distribution ψ0 (x) = µ(x)
5.2.2. Optimal Reduction. This section is by [ELVE08]. Suppose the number
of states n is very large. The scale of Markov chain is so big that we want a smaller
chain to present its behavior. That is, we want to decompose the state space S:
SN T
Let S = i=1 Si , s.t.N  n, Si Sj = ∅, ∀i 6= j, and define a transition probability
P̂ on it. We want the Markov chain ({Si }, P̂ ) has similar property as chain (S, P ).
We call {Si } coarse space. The first difficult we’re facing is whether ({Si }, P̂ )
really Markovian. We want
Pr(Xit+1 ∈ Sit+1 |xit ∈ Sit , . . . X0 ∈ Si0 ) = Pr(Xit+1 ∈ Sit+1 |xit ∈ Sit )
and this probability is independent of initial distribution. This property is so-called
lumpability, which you can refer Lecture 9. Unfortunately, lumpability is a strick
constraint that it seldom holds.
So we must modify our strategy of reduction. One choice is to do a optimization
with some norm on L2µ . First, Let us introduce Hilbert-Schmidt norm on L2µ .
Suppose F is an operator on L2µ , and F f (x) =
P
K(x, y)f (y)µ(y). Here K is
y∈S
called a kernel function. If K is symmetric, F is self adjoint. In fact,
XX
< F f (x), g(x) >µ = K(x, y)f (y)µ(y)g(x)µ(x)
x∈S y∈S
XX
= K(y, x)f (y)µ(y)g(x)µ(x)
y∈S x∈S
= < f (x), F g(x) >µ

So F guarantee a spectral decomposition. Let {λj }n−1 j=0 denote its eigenvalue
n−1
and {φj (x)}j=0 denote its eigenvector, then k(x, y) can be represented as K(x, y) =
n−1
P
λj φj (x)φj (y). Hilbert-Schmidt norm of F is defined as follow:
j=0

n−1
X
kF k2HS = tr(F ∗ F ) = tr(F 2 ) = λ2i
i=0
100 6. RANDOM WALK ON GRAPHS

One can check that kF k2HS = K 2 (x, y)µ(x)µ(y). In fact,


P
x,y∈S
 2
X n−1
X
RHS =  λj φj (x)φj (y) µ(x)µ(y)
x,y∈S j=0
n−1
X n−1
X X
= λj λk φj (x)φk (x)φj (y)φk (y)µ(x)µ(y)
j=0 k=0 x,y∈S
n−1
X
= λ2j
j=0

the last equal sign dues do the orthogonality of eigenvectors. It is clear that if
L2µ = L2 , Hilbert-Schmidt norm is just Frobenius norm.
Now we can write our T as
X X p(x, y)
T f (x) = p(x, y)f (y) = f (y)µ(y)
µ(y)
y∈S y∈S

p(x,y)
and take K(x, y) = µ(y) . By detailed balance condition, K is symmetric. So
X p2 (x, y) X µ(x)
kT k2HS = µ(x)µ(y) = p2 (x, y)
µ2 (y) µ(y)
x,y∈S x,y∈S

We’ll rename kP kHS to kP kµ in the following paragraphs.


Now go back to our reduction problem. Suppose we have a coarse space {Si }N i=1 ,
and a transition probability P̂ (k, l), k, l = 1, 2, . . . N on it. If we want to compare
({Si }, P̂ ) with (S, P ), we must ”lift” the coarse process to fine space. One nature
consideration is as follow: if x ∈ Sk , y ∈ Sl , first, we transit from x to Sl follow the
rule P̂ (k, l), and in Sl , we transit to y ”randomly”. To make ”randomly” rigorously,
one may choose the lifted transition probably as follow:
N
X 1
P̃ (x, y) = 1Sk (x)P̂ (k, l)1Sl (y)
|Sl |
k,l=1

One can check that this P̃ is a stochastic matrix, but it is not reversible. One
more convenient choice is transit ”randomly” by invariant distribution:
N
X µ(y)
P̃ (x, y) = 1Sk (x)P̂ (k, l)1Sl (y)
µ̂(Sl )
k,l=1

where
X
µ̂(Sl ) = µ(z)
z∈Sl

Then you can check this matrix is not only a stochastic matrix, but detailed
balance condition also hold provides P̂ on {Si } is reversible.
Now let us do some summary. Given a decomposition of state space S =
SN
i=1 Si , and a transition probability P̂ on coarse space, we may obtain a lifted
5. APPLICATIONS OF LUMPABILITY: MNCUT AND OPTIMAL REDUCTION OF COMPLEX NETWORKS
101

transition probability P̃ on fine space. Now we can compare ({Si }, P̂ ) and (S, P )
in a clear way: kP − P̃ kµ . So our optimization problem can be defined clearly:
E = min min kP − P̂ k2µ
S1 ...SN P̂

That is, given a partition of S, find the optimal P̂ to minimize kP − P̂ k2µ , and
find the optimal partition to minimize E.
N
5.2.3. Community structure of complex network. Given a partition S = ∪ Sk ,
k=1
the solution of optimization problem
min kp − p̂k2µ

is
1 X
p̂∗kl = µ(x)p(x, y)
µ̂(Sk )
x∈Sk ,y∈Sl
It is easy to show that {p̂∗kl } form a transition probability matrix with detailed
balance condition:
p̂∗kl ≥ 0
X 1 X XX
p̂∗kl = µ(x) p(x, y)
µ̂(Sk )
l x∈Sk l y∈Sl
1 X
= µ(x) = 1
µ̂(Sk )
x∈Sk
X
µ̂(Sk )p̂∗kl = µ(x)p(x, y)
x∈Sk ,y∈Sl
X
= µ(y)p(y, x)
x∈Sk ,y∈Sl
= µ̂(Sl )p̂∗lk
The last equality implies that µ̂ is the invariant distribution of the reduced Markov
chain. Thus we find the optimal transition probability in the coarse space. p̂∗ has
the following property
kp − p∗ k2µ = kpk2µ − kp̂∗ k2µ̂
However, the partition of the original graph is not given in advance, so we
need to minimize E ∗ with respect to all possible partitions. This is a combinatorial
optimization problem, which is extremely difficult to find the exact solution. An
effective approach to obtain an approximate solution, which inherits ideas of K-
means clustering, is proposed as following: First we rewrite E ∗ as
N
X µ(x) X p̂∗
E∗ = |p(x, y) − 1Sk (x) kl 1Sl (y)µ(y)|2
µ(y) µ̂(Sk )
x,y∈S k,l=1
N 2
X X p(x, y) p̂∗
= µ(x)µ(y) − kl
µ(y) µ̂(Sk )
k,l=1 x∈Sk ,y∈Sl
N X
,
X
E ∗ (x, Sk )
k=1 x∈Sk
102 6. RANDOM WALK ON GRAPHS

where
N X 2
X p(x, y) p̂∗
E ∗ (x, Sk ) = µ(x)µ(y) − kl
µ(y) µ̂(Sk )
l=1 y∈Sl
Based on above expression, a variation of K-means is designed:
N
E step: Fix partition ∪ Sk , compute p̂∗ .
k=1
(n+1)
M step: Put x in Sk such that

E (x, Sk ) = min E ∗ (x, Sj )
j

5.2.4. Extensions: Fuzzy Partition. This part is in [LLE09, LL11]. It is un-


necessary to require that each vertex belong to a definite class. We introduce ρk (x)
as the probability of a vertex x belonging to class k, and we lift the Markov chain
in coarse space to fine space using the following transition probability
N
X µ(y)
p̃(x, y) = ρk (x)p̂kl ρl (y)
µ̂l
k,l=1

Now we solve
min kp − p̃k2µ

to obtain a optimal reduction.
5.2.5. Model selection. Note the number of partition N should also not be
given in advance. But in strategies similar to K-means, the value of minimal E ∗ is
monotone decreasing with N . This means larger N is always preferred.
A possible approach is to introduce another quantity which is monotone in-
creasing with N . We take K-means clustering for example. In K-means clustering,
only compactness is reflected. If another quantity indicates separation of centers of
each cluster, we can minimize the ratio of compactness and separation to find an
optimal N .

6. Mean First Passage Time


Consider a Markov chain P on graph G = (V, E). In this section we study the
mean first passage time between vertices, which exploits the unnormalized graph
Laplacian and will be useful for commute time map against diffusion map.
Definition.
(1) First passage time (or hitting time): τij := inf(t ≥ 0|xt = j, x0 = i);
(2) Mean First Passage Time: Tij = Ei τij ;
+
(3) τij := inf(t > 0|xt = j, x0 = i), where τii+ is also called first return time;
(4) Tij+ = Ei τij
+
, where Tii+ is also called mean first return time.
Here Ei denotes the conditional expectation with fixed initial condition x0 = i.
Theorem 6.1. Assume that P is irreducible. Let L = D − W be the unnormalized
graph Laplacian with Moore-Penrose inverse L† , where D = diag(di ) with di =
P
j:j i Wij being the degree of node i. Then
(1) Mean First Passage Time is given by
Tii = 0,
L†ik dk − L†ij vol(G) + L†jj vol(G) − L†jk dk ,
X X
Tij = i 6= j.
k k
6. MEAN FIRST PASSAGE TIME 103

(2) Mean First Return Time is given by


1
Tii+ = , Tij+ = Tij .
πi
Proof. Since P is irreducible, then the stationary distribution is unique, de-
noted by π. By definition, we have
X
(88) Tij+ = Pij · 1 + +
Pik (Tkj + 1)
k6=j

Let E = 1 · 1T where 1 ∈ Rn is a vector with all elements one, Td+ = diag(Tii+ ).


Then 127 becomes
(89) T + = E + P (T + − Td+ ).
For the unique stationary distribution π, π T P = P , whence we have
πT T + = π T 1 · 1T + π T P (T + − Td+ )
πT T + = 1T + π T T + − π T Td+
1 = Td+ π
1
Tii+ =
πi
Before proceeding to solve equation (127), we first show its solution is unique.
Lemma 6.2. P is irreducible ⇒ T + and T are both unique.
Proof. Assume S is also a solution of equation (128), then
(I − P )S = E − P diag(1/πi ) = (I − P )T +

⇔ ((I − P )(T + − S) = 0.
Therefore for irreducible P , S and T + must satisfy
diag(T + − S) = 0


T + − S = 1uT , ∀u

which implies T + = S. T ’s uniqueness follows from T = T + − Td+ . 

Now we continue with the proof of the main theorem. Since T = T + − Td+ ,
then (127) becomes
T = E + P T − Td+
(I − P )T = E − Td+
(I − D−1 W )T = F
(D − W )T = DF
LT = DF
where F = E − Td+ and L = D − W is the (unnormalized)
Pn graph Laplacian. Since
T
L is symmetric and irreducible, we have L = Pn k=1 k k k , where 0 = µ1 < µ2 ≤
µ ν ν
· · · ≤ µn , ν1 = 1/||1||, νkT νl = δkl . Let L† = k=2 µ1k νk νkT , L† is called the pseudo-
inverse (or Moore-Penrose inverse) of L. We can test and verify L† satisfies the
104 6. RANDOM WALK ON GRAPHS

following four conditions


 † †
 L LL = L†
LL† L

= L


 (LL† )T = LL†
(L† L)T = L† L

From LT = D(E − Td+ ), multiplying both sides by L† leads to


T = L† DE − L† DTd+ + 1 · uT ,
as 1 · uT ∈ ker(L), whence
n
1
L†ik dk − L†ij dj ·
X
Tij = + uj
πj
k=1
n
L†ik dk + L†ii vol(G),
X
ui = − j=i
k=1

L†ik dk − L†ij vol(G) + L†jj vol(G) − L†jk dk


X X
Tij =
k k


P
Note that vol(G) = i di and πi = di /vol(G) for all i.

As L† is a positive definite matrix, this leads to the following corollary.


Corollary 6.3.
(90) Tij + Tji = vol(G)(L†ii + L†jj − 2L†ij ).
Therefore the average commute time between i and j leads to an Euclidean distance
metric
p
dc (xi , xj ) := Tij + Tji
often called commute time distance.

7. Transition Path Theory


The transition path theory was originally introduced in the context of continuous-
time Markov process on continuous state space [EVE06] and discrete state space
[MSVE09], see [EVE10] for a review. Another description of discrete transition
path theory for molecular dynamics can be also found in [NSVE+ 09]. The follow-
ing material is adapted to the setting of discrete time Markov chain with transition
probability matrix P [?]. We assume reversibility in the following presentation,
which can be extended to non-reversible Markov chains.
Assume that an irreducible Markov  Chain on graph G = (V, E) admits the
Pll Plu
following decomposition P = D−1 W = . Here Vl = V0 ∪V1 denotes the
Pul Puu
labeled vertices with source set V0 (e.g. reaction state in chemistry) and sink set V1
(e.g. product state in chemistry), and Vu is the unlabeled vertex set (intermediate
states). That is,
• V0 = {i ∈ Vl : fi = f (xi ) = 0}
• V1 = {i ∈ Vl : fi = f (xi ) = 1}
• V = V0 ∪ V1 ∪ Vu where Vl = V0 ∪ V1
7. TRANSITION PATH THEORY 105

Given two sets V0 and V1 in the state space V , the transition path theory tells
how these transitions between the two sets happen (mechanism, rates, etc.). If we
view V0 as a reactant state and V1 as a product state, then one transition from V0
to V1 is a reaction event. The reactve trajectories are those part of the equilibrium
trajectory that the system is going from V0 to V1 .
Let the hitting time of Vl be
τik = inf{t ≥ 0 : x(0) = i, x(t) ∈ Vk }, k = 0, 1.
The central object in transition path theory is the committor function. Its
value at i ∈ Vu gives the probability that a trajectory starting from i will hit the
set V1 first than V0 , i.e., the success rate of the transition at i.
Proposition 7.1. For ∀i ∈ Vu , define the committor function
qi := P rob(τi1 < τi0 ) = P rob(trajectory starting from xi hit V1 before V0 )
which satisfies the following Laplacian equation with Dirichlet boundary conditions
(Lq)(i) = [(I − P )q](i) = 0, i ∈ Vu
qi∈V0 = 0, qi∈V1 = 1.
The solution is
qu = (Du − Wuu )−1 Wul ql .
Proof. By definition,

1
 xi ∈ V 1
1 0
qi = P rob(τi < τi ) = 0 xi ∈ V 0

P
j∈V Pij qj i ∈ Vu
This is because ∀i ∈ Vu ,
qi = P r(τiV1 < τiV0 )
X
= Pij qj
j
X X X
= Pij qj + Pij qj + Pij qj
j∈V1 j∈V0 j∈Vu
X X
= Pij + Pij qj
j∈V1 j∈Vu

∴ qu = Pul ql + Puu qu = Du−1 Wul ql + Du−1 Wuu qu


multiply Du to both side and reorganize
(Du − Wuu )qu = Wul ql
If Du − Wuu is reversible, we get
qu = (Du − Wuu )−1 Wul ql .

The committor function provides natural decomposition of the graph. If q(x)
is less than 0.5, i is more likely to reach V0 first than V1 ; so that {i | q(x) < 0.5}
gives the set of points that are more attached to set V0 .
Once the committor function is given, the statistical properties of the reaction
trajectories between V0 and V1 can be quantified. We state several propositions
106 6. RANDOM WALK ON GRAPHS

characterizing transition mechanism from V0 to V1 . The proof of them is an easy


adaptation of [EVE06, MSVE09] and will be omitted.

Proposition 7.2 (Probability distribution of reactive trajectories). The probabil-


ity distribution of reactive trajectories
(91) πR (x) = P(Xn = x, n ∈ R)
is given by
(92) πR (x) = π(x)q(x)(1 − q(x)).

The distribution πR gives the equilibrium probability that a reactive trajec-


tory visits x. It provides information about the proportion of time the reactive
trajectories spend in state x along the way from V0 to V1 .

Proposition 7.3 (Reactive current from V0 to V1 ). The reactive current from A


to B, defined by
(93) J(xy) = P(Xn = x, Xn+1 = y, {n, n + 1} ⊂ R),
is given by
(
π(x)(1 − q(x))Pxy q(y), x 6= y;
(94) J(xy) =
0, otherwise.

The reactive current J(xy) gives the average rate the reactive trajectories jump
from state x to y. From the reactive current, we may define the effective reactive
current on an edge and transition current through a node which characterizes the
importance of an edge and a node in the transition from A to B, respectively.

Definition. The effective current of an edge xy is defined as


(95) J + (xy) = max(J(xy) − J(yx), 0).
The transition current through a node x ∈ V is defined as
+
 P
 Py∈V J (xy), x∈A
(96) T (x) = J + (yx), x∈B
 Py∈V + P +
y∈V J (xy) = y∈V J (yx), x 6∈ A ∪ B

In applications one often examines partial transition current through a node


connecting two communities V − = {x : q(x) < 0.5} and V + = {x : q(x) ≥ 0.5},
+ −
P
e.g. y∈V + J (xy) for x ∈ V , which shows relative importance of the node in
bridging communities.
The reaction rate ν, defined as the number of transitions from V0 to V1 hap-
pened in a unit time interval, can be obtained from adding up the probability
current flowing out of the reactant state. This is stated by the next proposition.

Proposition 7.4 (Reaction rate). The reaction rate is given by


X X
(97) ν= J(xy) = J(xy).
x∈A,y∈V x∈V,y∈B
7. TRANSITION PATH THEORY 107

Finally, the committor functions also give information about the time propor-
tion that an equilibrium trajectory comes from A (the trajectory hits A last rather
than B).

Proposition 7.5. The proportion of time that the trajectory comes from A (resp. from
B) is given by
X X
(98) ρA = π(x)q(x), ρB = π(x)(1 − q(x)).
x∈V x∈V
CHAPTER 7

Diffusion Map

Finding meaningful low-dimensional structures hidden in high-dimensional ob-


servations is an fundamental task in high-dimensional statistics. The classical tech-
niques for dimensionality reduction, principal component analysis (PCA) and multi-
dimensional scaling (MDS), guaranteed to discover the true structure of data lying
on or near a linear subspace of the high-dimensional input space. PCA finds a
low-dimensional embedding of the data points that best preserves their variance as
measured in the high-dimensional input space. Classical MDS finds an embedding
that preserves the interpoint distances, equivalent to PCA when those distances
are Euclidean [TdL00]. However, these linear techniques cannot adequately han-
dle complex nonlinear data. Recently more emphasis is put on detecting non-linear
features in the data. For example, ISOMAP [TdL00] etc. extends MDS by in-
corporating the geodesic distances imposed by a weighted graph. It defines the
geodesic distance to be the sum of edge weights along the shortest path between
two nodes. The top n eigenvectors of the geodesic distance matrix are used to
represent the coordinates in the new n-dimensional Euclidean space. Nevertheless,
as mentioned in [EST09], in practice robust estimation of geodesic distance on
a manifold is an awkward problem that require rather restrictive assumptions on
the sampling. Moreover, since the MDS step in the ISOMAP algorithm intends to
preserve the geodesic distance between points, it provides a correct embedding if
submanifold is isometric to a convex open set of the subspace. If the submanifold is
not convex, then there exist a pair of points that can not be joined by a straight line
contained in the submanifold. Therefore,their geodesic distance can not be equal
to the Euclidean distance. Diffusion maps [CLL+ 05] leverages the relationship
between heat diffusion and a random walk (Markov Chain); an analogy is drawn
between the diffusion operator on a manifold and a Markov transition matrix op-
erating on functions defined on a weighted graph whose nodes were sampled from
the manifold. A diffusion map, which maps coordinates between data and diffusion
space, aims to re-organize data according to a new metric. In this class, we will
discuss this very metric-diffusion distance and it’s related properties.

1. Diffusion map and Diffusion Distance


Viewing the data points x1 ,x2 ,. . . ,xn as the nodes of a weighted undirected
graph G = (V, EW )(W = (Wij )), where the weight Wij is a measure of the similarity
between xi and xj . There are many ways to define Wij , such as:
(1) Heat kernel. If xi and xj are connected, put:
−kxi −xj k2
(99) Wijε = e ε

109
110 7. DIFFUSION MAP

with some positive parameter ε ∈ R+


0.

(2) Cosine Similarity


xi xj
(100) Wij = cos(∠(xi , xj )) = ·
kxi k kxj k

(3) Kullback-Leibler divergence. P Assume xi and xj are two nonvanishing


k
probability distribution, i.e. x
k i = 1 and xki > 0. Define Kullback-
Leibler divergence
(k)
X (k) xi
D(KL) (xi ||xj ) = xi log (k)
k xj

and its symmetrization D̄ = D(KL) (xi ||xj ) + DKL (xj ||xi ), which measure
a kind of ‘distance’ between distributions; Jensen-Shannon divergence as
the symmetrization of KL-divergence between one distribution and their
average,

D(JS) (xi , xj ) = D(KL) (xi ||(xi + xj )/2) + D(KL) (xj ||(xi + xj )/2)

A similarity kernel can be

(101) Wij = −D(KL) (xi ||xj )

or

(102) Wij = −D(JS) (xi , xj )

The similarity functions are widely used in various applications. Sometimes


the matrix W is positive semi-definite (psd), that for any vector x ∈ Rn ,

(103) xT W x ≥ 0.

PSD kernels includes heat kernels, cosine similarity kernels, and JS-divergence ker-
nels. But in many other cases (e.g. KL-divergence kernels), similarity kernels are
not necessarily PSD. For a PSD kernel, it can be understood as a generalized co-
variance function; otherwise, diffusions as random walks on similarity graphs will
be helpful to disclose their structures.
n
Define A := D−1 W , where D = diag( Wij ) , diag(d1 , d2 , · · · , dn ) for sym-
P
j=1
metric Wij = Wji ≥ 0. So
n
X
(104) Aij = 1 ∀i ∈ {1, 2, · · ·, n} (Aij ≥ 0)
j=1

whence A is a row Markov matrix of the following discrete time Markov chain
{Xt }t∈N satisfying

(105) P (Xt+1 = xj | Xt = xi ) = Aij .


1. DIFFUSION MAP AND DIFFUSION DISTANCE 111

1.1. Spectral Properties of A. We may reach a spectral decomposition of


A with the aid of the following symmetric matrix S which is similar to A. Let
1 1
(106) S := D− 2 W D− 2
which is symmetric and has an eigenvalue decomposition
(107) S = V ΛV T , where V V T = In , Λ = diag(λ1 , λ2 , · · ·, λn )
So
1 1 1 1
A = D−1 W = D−1 (D 2 SD 2 ) = D− 2 SD 2
which is similar to S, whence sharing the same eigenvalues as S. Moreover
1 1
(108) A = D− 2 V ΛV T D 2 = ΦΛΨT
1 1
where Φ = D− 2 V and Ψ = D 2 V give right and left eigenvectors of A respectively,
AΦ = ΦΛ and ΨT A = ΛΨT , and satisfy ΨT Φ = In .
The Markov matrix A satisfies the following properties by Perron-Frobenius
Theory.
Proposition 1.1. (1) A has eigenvalues λ(A) ⊂ [−1, 1].
(2) A is irreducible, if and only if ∀(i, j) ∃t s.t. (At )ij > 0 ⇔ Graph G = (V, E)
is connected
(3) A is irreducible ⇒ λmax = 1
(4) A is primitive, if and only if ∃t > 0 s.t. ∀(i, j) (At )ij > 0 ⇔ Graph
G = (V, E) is path-t connected, i.e. any pair of nodes are connected by a
path of length no more than t
(5) A is irreducible and ∀i, Aii > 0 ⇒ A is primitive
(6) A is primitive ⇒ −1 6∈ λ(A)
(7) Wij is induced from the heat kernel, or any positive definite function
⇒ λ(A) ≥ 0
Proof. (1) assume λ and v are the eigenvalue and eigenvector of A, soAv =
λv. Find j0 s.t. |vj0 | ≥ |vj |, ∀j 6= j0 where vj is the j-th entry of v. Then:
n
X
λvj0 = (Av)j0 = Aj 0 j v j
j=1

So:
n
X n
X
|λ||vj0 | = | Aj 0 j v j | ≤ Aj0 j |vj | ≤ |vj0 |.
j=1 j=1

(7) Let S = D−1/2 W D−1/2 . As W is positive semi-definite, so S has eigenvalues


λ(S) ≥ 0. Note that A = D−1/2 SD1/2 , i.e. similar to S, whence A shares the same
eigenvalues with S. 

Sort the eigenvalues 1 = λ1 ≥ λ2 ≥ . . . ≥ λn ≥ −1. Denote Φ = [φ1 , . . . , φn ]


and Ψ = [ψ1 , . . . , ψn ]. So the primary (first) right and left eigenvectors are
φ1 = 1,

ψ1 = π
as the stationary distribution of the Markov chain, respectively.
112 7. DIFFUSION MAP

1.2. Diffusion Map and Distance. Diffusion map of a point x is defined


as the weighted Euclidean embedding via right eigenvectors of Markov matrix A.
From the interpretation of the matrix A as a Markov transition probability matrix
(109) Aij = P r{s(t + 1) = xj |s(t) = xi }
it follows that
(110) Atij = P r{s(t + 1) = xj |s(0) = xi }
We refer to the i0 th row of the matrix At , denoted Ati,∗ , as the transition prob-
ability of a t-step random walk that starts at xi . We can express At using the
decomposition of A. Indeed, from
(111) A = ΦΛΨT
with ΨT Φ = I, we get
(112) At = ΦΛt ΨT .
Written in a component-wise way, this is equivalent to
Xn
t
(113) Aij = λtk φk (i)ψk (j).
k=1

Therefore Φ and Ψ are right and left eigenvectors of At , respectively.


Let the diffusion map Φt : V 7→ Rn at scale t be
 t 
λ1 φ1 (i)
 λt2 φ2 (i) 
(114) Φt (xi ) :=  ..
 

 . 
λtn φn (i)
The mapping of points onto the diffusion map space spanned the right eigenvectors
of the row Markov matrix has a well defined probabilistic meaning in terms of the
random walks. Lumpable Markov chains with Piece-wise constant right eigenvec-
tors thus help us understand the behavior of diffusion maps and distances in such
cases.
The diffusion distance is defined to be the Euclidean distances between embed-
ded points,
n
!1/2
X
2t 2
(115) dt (xi , xj ) := kΦt (xi ) − Φt (xj )kRn = λk (φk (i) − φk (j)) .
k=1
The main intuition to define diffusion distance is to describe “perceptual dis-
tances” of points in the same and different clusters. For example Figure 1 shows
that points within the same cluster have small diffusion distances while in different
clusters have large diffusion distances. This is because the metastability phenom-
enon of random walk on graphs where each cluster represents a metastable state.
The main properties of diffusion distances are as follows.
• Diffusion distances reflect average path length connecting points via ran-
dom walks.
• Small t represents local random walk, where diffusion distances reflect
local geometric structure.
• Large t represents global random walk, where diffusion distances reflect
large scale cluster or connected components.
1. DIFFUSION MAP AND DIFFUSION DISTANCE 113

Figure 1. Diffusion Distances dt (A, B) >> dt (B, C) while graph


shortest path dgeod (A, B) ∼ dgeod (B, C).

1.3. Examples. Three examples about diffusion map:


EX1: two circles.
Suppose graph G : (V, E). Matrix W satisfies wij > 0, if and only if (i, j) ∈ E.
Choose k(x, y) = Ikx−yk<δ . In this case,
 
A1 0
A= ,
0 A2
where A1 is a n1 × n1 matrix, A2 is a n2 × n2 matrix, n1 + n2 = n.
Notice that the eigenvalue λ0 = 1 of A is of multiplicity 2, the two eigenvectors
0
are φ0 = 1n and φ0 = [c1 1Tn1 , c2 1Tn2 ]T c1 6= c2 .

Φ1D 1D

t (x1 ), · · · , Φt (xn1 ) = c1
Diffusion Map :
Φt (xn1 +1 ), · · · , Φ1D
1D
t (xn ) = c2
EX2: ring graph. ”single circle”
In this case, W is a circulant matrix
 
1 1 0 0 ··· 1
 1 1 1 0 ··· 0 
 
W =  0 1 1 1 ··· 0 
 
 .. .. .. .. .. 
 . . . . ··· . 
1 0 0 0 ··· 1
The eigenvalue of W is λk = cos 2πk n
n k = 0, 1, · · · , 2 and the corresponding eigen-
2π 2πkj 2πkj t
vector is (uk )j = ei n kj j = 1, · · · , n. So we can get Φ2D t (xi ) = (cos n , sin n )c
EX3: order the face. Let
kx − yk2
 
kε (x, y) = exp − ,
ε
Wijε = kε (xi , xj ) and Aε = D−1 W ε where D = diag( j Wijε ). Define a graph
P

Laplacian (recall that L = D−1 A − I)

1 ε→0
Lε := (Aε − I) −→ backward Kolmogorov operator
ε
114 7. DIFFUSION MAP

Figure 2. Two circles

Figure 3. EX2 single circle

Figure 4. Order the face

1 00 0 0

1 2 φ (s) 0− φ (s)V (s) = λφ(s)
Lε f = 4M f − ∇f · ∇V ⇒ Lε φ = λφ ⇒ 0
2 φ (0) = φ (1) = 0

Where V (s) is the Gibbs free energy and p(s) = e−V (x) is the density of data points
along the curve. 4M is Laplace-Beltrami Operator. If p(x) = const, we can get
00
(116) V (s) = const ⇒ φ (s) = 2λφ(s) ⇒ φk (s) = cos(kπs), 2λk = −k 2 π 2

On the other hand p(s) 6= const, one can show 1 that φ1 (s) is monotonic for
arbitrary p(s). As a result, the faces can still be ordered by using φ1 (s).

1.4. Properties of Diffusion Distance.

Lemma 1.2. The diffusion distance is equal to a `2 distance between the proba-
bility clouds Ati,∗ and Atj,∗ with weights 1/dl ,i.e.,

(117) dt (xi , xj ) = kAti,∗ − Atj,∗ k`2 (Rn ,1/d)

1by changing to polar coordinate p(s)φ0 (s) = r(s) cos θ(s), φ(s) = r(s) sin θ(s) ( the so-called
‘Prufer Transform’ ) and then try to show that φ0 (s) is never zero on (0, 1).
1. DIFFUSION MAP AND DIFFUSION DISTANCE 115

Proof.
n
2
X 1
kAti,∗ − Atj,∗ k`2 (Rn ,1/d) = (Atil − Atjl )2
dl
l=1
n X n
X 1
= [ λtk φk (i)ψk (l) − λtk φk (j)ψk (l)]2
dl
l=1 k=1
n X
n
X 1
= λtk (φk (i) − φk (j))ψk (l)λtk0 (φk0 (i) − φk0 (j))ψk0 (l)
dl
l=1 k,k0
n n
X X ψk (l)ψk0 (l)
= λtk λtk0 (φk (i) − φk (j))(φk0 (i) − φk0 (j))
0
dl
k,k l=1
Xn
= λtk λtk0 (φk (i) − φk (j))(φk0 (i) − φk0 (j))δkk0
k,k0
n
X
= λ2t
k (φk (i) − φk (j))
2

k=1
= d2t (xi , xj )

In practice we usually do not use the mapping Φt but rather the truncate
diffusion map Φδt that makes use of fewer than n coordinates. Specifically, Φδt uses
t
only the eigenvectors for which the eigenvalues satisfy |λk | > δ. When t is enough
large, we can use the truncated diffusion distance:
2 21
X
(118) dδt (xi , xj ) = kΦδt (xi ) − Φδt (xj )k = [ λ2t
k (φk (i) − φk (j)) ]
k:|λk |t >δ
2
as an approximation of the weighted ` distance of the probability clouds. We now
derive a simple error bound for this approximation.
Lemma 1.3 (Truncated Diffusion Distance). The truncated diffusion distance sat-
isfies the following upper and lower bounds.
2δ 2
d2t (xi , xj ) − (1 − δij ) ≤ [dδt (xi , xj )]2 ≤ d2t (xi , xj ),
dmin
P
where dmin = min1≤i≤n di with di = j Wij .
1
Proof. Since, Φ = D− 2 V , where V is an orthonormal matrix (V V T =
T
V V = I), it follows that
1 1
(119) ΦΦT = D− 2 V V T D− 2 = D−1
Therefore,
n
X δij
(120) φk (i)φk (j) = (ΦΦT )ij =
di
k=1
and
n
X 1 1 2δij
(121) (φk (i) − φk (j))2 = + −
di dj di
k=1
116 7. DIFFUSION MAP

clearly,
n
X 2
(122) (φk (i) − φk (j))2 ≤ (1 − δij ), f orall i, j = 1, 2, · · · , n
dmin
k=1

As a result,
X
[dδt (xi , xj )]2 = d2t (xi , xj ) − λ2t
k (φk (i) − φk (j))
2

k:|λk |t <δ
X
≥ d2t (xi , xj ) − δ2 (φk (i) − φk (j))2
k:|λk |t <δ
n
X
≥ d2t (xi , xj ) − δ 2 (φk (i) − φk (j))2
k=1
2

≥ d2t (xi , xj ) − (1 − δij )
dmin
on the other hand, it is clear that
(123) [dδt (xi , xj )]2 ≤ d2t (xi , xj )
We conclude that
2δ 2
(124) d2t (xi , xj ) − (1 − δij ) ≤ [dδt (xi , xj )]2 ≤ d2t (xi , xj )
dmin


Therefore, for small δ the truncated diffusion distance provides a very good
approximation to the diffusion distance. Due to the fast decay of the eigenvalues,
the number of coordinates used for the truncated diffusion map is usually much
smaller than n, especially when t is large.

1.5. Is the diffusion distance really a distance? A distance function d :


X × X → R must satisfy the following properties:
(1) Symmetry: d(x, y) = d(y, x)
(2) Non-negativity: d(x, y) ≥ 0
(3) Identity of indiscernibles: d(x, y) = 0 ⇔ x = y
(4) Triangle inequality: d(x, z) + d(z, y) ≥ d(x, y)
Since the diffusion map is an embedding into the Euclidean space Rn , the
diffusion distance inherits all the metric properties of Rn such as symmetry, non-
negativity and the triangle inequality. The only condition that is not immediately
implied is dt (x, y) = 0 ⇔ x = y. Clearly, xi = xj implies that dt (xi , xj ) = 0. But
is it true that dt (xi , xj ) = 0 implies xi = xj ? Suppose dt (xi , xj ) = 0, Then,
n
X
(125) 0 = d2t (xi , xj ) = λ2t
k (φk (i) − φk (j))
2

k=1

It follows that φk (i) = φk (j) for all k with λk 6= 0. But there is still the possibility
that φk (i) 6= φk (j) for k with λk = 0. We claim that this can happen only whenever
i and j have the exact same neighbors and proportional weights, that is:
2. COMMUTE TIME MAP AND DISTANCE 117

Proposition 1.4. The situation dt (xi , xj ) = 0 with xi 6= xj occurs if and only if


node i and j have the exact same neighbors and proportional weights
Wik = αWjk , α > 0, f or all k ∈ V.
n
λ2t 2
P
Proof. (Necessity) If dt (xi , xj ) = 0, then k (φk (i) − φk (j)) = 0 and
k=1
φk (i) = φk (j) for k with λk 6= 0 This implies that dt0 (xi , xj ) = 0 for all t0 , because
n
X 0
(126) dt0 (xi , xj ) = λ2t 2
k (φk (i) − φk (j) = 0.
k=1
0
In particular, for t = 1, we get d1 (xi , xj ) = 0. But
d1 (xi , xj ) = kAi,∗ − Aj,∗ k`2 (Rn ,1/d) ,
and since k · k`2 (Rn ,1/d) is a norm, we must have Ai,∗ = Aj,∗ , which implies for each
k ∈V,
Wik Wjk
= , ∀k ∈ V
di dj
whence Wik = αWjk where α = di /dj , as desired.
n
(Ai,k − Aj,k )2 /dk = d21 (xi , xj ) ==
P
(Sufficiency) If Ai,∗ = Aj,∗ , then 0 =
k=1
n
λ2k (φk (i) 2
P
− φk (j)) and therefore φk (i) = φk (j) for k with λk 6= 0, from which
k=1
it follows that dt (xi , xj ) = 0 for all t. 
Example 7. In a graph with three nodes V = {1, 2, 3} and two edges, say E =
{(1, 2), (2, 3)}, the diffusion distance between nodes 1 and 3 is 0. Here the transition
matrix is  
0 1 0
A =  1/2 0 1/2  .
0 1 0

2. Commute Time Map and Distance


Diffusion distance depends on time scale parameter t which is hard to select in
applications. In this section we introduce another closely related distance, namely
commute time distance, derived from mean first passage time between points. For
such distances we do not need to choose the time scale t.
Definition.
(1) First passage time (or hitting time): τij := inf(t ≥ 0|xt = j, x0 = i);
(2) Mean First Passage Time: Tij = Ei τij ;
+
(3) τij := inf(t > 0|xt = j, x0 = i), where τii+ is also called first return time;
(4) Tij+ = Ei τij
+
, where Tii+ is also called mean first return time.
Here Ei denotes the conditional expectation with fixed initial condition x0 = i.
All the below will show that the (average) commute time between xi and xj ,
i.e.Tij + Tji , in fact leads to an Euclidean distance metric which can be used for
embedding.
p
Theorem 2.1. dc (xi , xj ) := Tij + Tji is an Euclidean distance metric, called
commute time distance.
118 7. DIFFUSION MAP

Proof. For simplicity, we will assume that P is irreducible such that the
stationary distribution is unique. We will give a constructive proof that Tij + Tji
is a squared distance of some Euclidean coordinates for xi and xj .
By definition, we have
X
(127) Tij+ = Pij · 1 + +
Pik (Tkj + 1)
k6=j

Let E = 1 · 1T where 1 ∈ Rn is a vector with all elements one, Td+ = diag(Tii+ ).


Then 127 becomes
(128) T + = E + P (T + − Td+ ).
For the unique stationary distribution π, π T P = P , whence we have
πT T + = π T 1 · 1T + π T P (T + − Td+ )
πT T + = 1T + π T T + − π T Td+
1= Td+ π
1
Tii+ =
πi
Before proceeding to solve equation (127), we first show its solution is unique.
Lemma 2.2. P is irreducible ⇒ T + and T are both unique.
Proof. Assume S is also a solution of equation (128), then
(I − P )S = E − P diag(1/πi ) = (I − P )T +
⇔ ((I − P )(T + − S) = 0.
Therefore for irreducible P , S and T + must satisfy
diag(T + − S) = 0


T + − S = 1uT , ∀u
which implies T + = S. T ’s uniqueness follows from T = T + − Td+ . 
Now we continue with the proof of the main theorem. Since T = T + − Td+ ,
then (127) becomes
T = E + P T − Td+
(I − P )T = E − Td+
(I − D−1 W )T = F
(D − W )T = DF
LT = DF
where F = E − Td+ and L = D − W is the (unnormalized)Pn graph Laplacian. Since
T
L is symmetric and irreducible, we have L = Pn k=1 µ k νk ν k , where 0 = µ1 < µ2 ≤
· · · ≤ µn , ν1 = 1/||1||, νkT νl = δkl . Let L+ = k=2 µ1k νk νkT , L+ is called the pseudo-
inverse (or Moore-Penrose inverse) of L. We can test and verify L+ satisfies the
following four conditions  + +

 L LL = L+
+
LL L = L

+ T

 (LL ) = LL+
+ T
(L L) = L+ L

2. COMMUTE TIME MAP AND DISTANCE 119

From LT = D(E − Td+ ), multiplying both sides by L+ leads to


T = L+ DE − L+ DTd+ + 1 · uT ,
as 1 · uT ∈ ker(L), whence
n
X 1
Tij = L+ +
ik dk − Lij dj · + uj
πj
k=1
Xn
ui = − L+ +
ik dk + Lii vol(G), j=i
k=1
X X
Tij = L+ + +
ik dk − Lij vol(G) + Ljj vol(G) − L+
jk dk
k k

P
Note that vol(G) = i di and πi = di /vol(G) for all i.
Then
(129) Tij + Tji = vol(G)(L+ + +
ii + Ljj − 2Lij ).

To see it is a squared Euclidean distance, we need the following lemma.


Lemma 2.3. If K is a symmetric and positive semidefinite matrix, then
K(x, x)+K(y, y)−2K(x, y) = d2 (Φ(x), Φ(y)) = hΦ(x), Φ(x)i+hΦ(y), Φ(y)i−2hΦ(x), Φ(y)i

P. . . , n) are orthonormal eigenvectors with eigenvalues µi ≥ 0,


where Φ = (φi : i = 1,
such that K(x, y) = i µi φi (x)φi (y).
Clearly L+ is a positive semidefinite matrix and we define the commute time
map by its eigenvectors,
 T
1 1
Ψ(xi ) = √ ν2 (i), · · · , √ νn (i) ∈ Rn−1 .
µ2 µn
q
then L+ + + 2
ii +Ljj −2Lij = ||Ψ(xi )−Ψ(xj ))||l2 , and we call dr (xi , xj ) = L+ + +
ii + Ljj − 2Lij
the resistance distance.

p p
So we have dc (xi , xj ) = Tij + Tji = vol(G)dr (xi , xj ).

Table 1. Comparisons between diffusion map and commute time


map. Here x ∼ y means that x and y are in the same cluster and
x  y for different clusters.

Diffusion Map Commute Time Map


P ’s right eigenvectors L+ ’s eigenvectors
scale parameters: t and ε scale: ε
∃t, s.t. x ∼ y, dt (x, y) → 0 and x  y, dt (x, y) → ∞ x ∼ y, dc (x, y) small and x  y, dc (x, y) large?

2.1. Comparisons between diffusion map and commute time map.


However, recently Radl, von Luxburg, and Hein give a negative answer for the last
desired property of dc (x, y) in geometric random graphs. Their result is as follows.
Let X ⊆ Rp be a compact set and let k : X × X → (0, +∞) be a symmetric
and continuous function. Suppose that (xi )i∈N is a sequence of data points drawn
120 7. DIFFUSION MAP

i.i.d. from X according to a density function p > 0 on X . Define Wij = k(xi , xj ),


P = D−1 W , and L = D − W . Then Radl et al. shows
1 1
lim ndr (xi , xj ) = +
n→∞ d(xi ) d(xj )
d (x ,x )
k(x, y)dp(y) is a smoothed density at x, dr (xi , xj ) = √c i j is the
R
where d(x) = X vol(G)
resistance distance. This result shows that in this setting commute time distance
has no information about cluster information about point cloud data, instead it
simply reflects density information around the two points.

3. Diffusion Map: Convergence Theory


Diffusion distance depends on both the geometry and density of the dataset.
The key concepts in the analysis of these methods, that incorporates the density and
geometry of a dataset. This section we will prove the convergence of diffusion map
with heat kernels to its geometric limit, the eigenfunctions of Laplacian-Beltrami
operators.
This is left by previous lecture. W is positive definite if using Gaussian Kernel.
One can check that, when
Z
Q(x) = e−ixξ dµ(ξ),
R

for some positive finite Borel measure dµ on R, then the (symmetric/Hermitian)


integral kernel
k(x, y) = Q(x − y)
is positive definite, that is, for any function φ(x) on R,
Z Z
φ̄(x)φ(y)k(x, y) ≥ 0.

Proof omitted. The reverse is also true, which is Bochner theorem. High dimen-
sional case is similar.
2
Take 1-dimensional as an example. Since the Gaussian distribution e−ξ /2 dξ
is a positive finite Borel measure, and the Fourier transform of Gaussian kernel is
2
itself, we know that k(x, y) = e−|x−y| /2 is a positive definite integral kernel. The
matrix W as an discretized version of k(x, y) keeps the positive-definiteness (make
this rigorous? Hint: take φ(x) as a linear combination of n delta functions).

3.1. Main Result. In this lecture, we will study the bias and variance de-
composition for sample graph Laplacians and their asymptotic convergence to
Laplacian-Beltrami operators on manifolds.
Let M be a smooth manifold without boundary in Rp (e.g. a d-dimensional
sphere). Randomly draw a set of n data points, x1, ..., xn ∈ M ⊂ Rp , according to
distribution p(x) in an independent and identically distributed (i.i.d.) way. We can
extract an n × n weight matrix Wij as follows:

Wij = k(xi , xj )
where k(x, y) is a symmetric k(x, y) = k(y, x) and positivity-preserving kernel
k(x, y) ≥ 0. As an example, it can be the heat kernel (or Gaussian kernel),
3. DIFFUSION MAP: CONVERGENCE THEORY 121

||xi − xj ||2
 
k (xi , xj ) = exp − ,
2
where ||  ||2 is the Euclidean distance in space Rp and  is the bandwidth of the
kernel. Wij stands for similarity function between xi and xj . A diagonal matrix D
is defined with diagonal elements are the row sums of W :
n
X
Dii = Wij .
j=1

Let’s consider a family of re-weighted similarity matrix, with superscript (α),


W (α) = D−α W D−α
and
n
(α) (α)
X
Dii = Wij .
j=1

(α) (α) −1
Pn (α)
Denote A = (D ) W , and we can verify that j=1 Aij = 1, i.e.a row
Markov matrix. Now define L(α) = A(α) − I = (D(α) )−1 W (α) − I; and

1 (α)
L,α = (A − I)
 
when k (x, y) is used in constructing W . In general, L(α) and L,α are both called
graph Laplacians. In particular L(0) is the unnormalized graph Laplacian in litera-
ture.
The target is to show that graph Laplacian L,α converges to continuous differ-
ential operators acting on smooth functions on M the manifold. The convergence
can be roughly understood as: we say a sequence of n-by-n matrix L(n) as n → ∞
converges to a limiting operator L, if for L’s eigenfunction f (x) (a smooth function
on M) with eigenvalue λ, that is
Lf = λf,
the length-n vector f (n) = (f (xi )), (i = 1, · · · , n) is approximately an eigenvector
of L(n) with eigenvalue λ, that is
L(n) f (n) = λf (n) + o(1),
where o(1) goes to zero as n → ∞.
Specifically, (the convergence is in the sense of multiplying a positive constant)
(I) L,0 = 1 (A − I) → 12 (∆M + 2 ∇p p · ∇) as  → 0 and n → ∞. ∆M is
the Laplace-Beltrami operator of manifold M . At a point on M which
is d-dimensional, in local (orthogonal) geodesic coordinate s1 , · · · , sd , the
Laplace-Beltrami operator has the same form as the laplace in calculus
d
X ∂2
∆M f = f;
i=1
∂s2i

∇ denotes the gradient of a function on M , and · denotes the inner product


on tangent spaces of M. Note that p = e−V , so ∇p p = −∇V .
122 7. DIFFUSION MAP

(Ignore this part if you don’t know stochastic process) Suppose we


have the following diffusion process
(M )
dXt = −∇V (Xt )dt + σdWt ,
(M )
where Wt is the Brownian motion on M , and σ is the volatility, say a
positive constant, then the backward Kolmogorov operator/Fokker-Plank
operator/infinitesimal generator of the process is
σ2
∆M − ∇V · ∇,
2
so we say in (I) the limiting operator is the Fokker-Plank operator. Notice
that in Lafon ’06 paper they differ the case of α = 0 and α = 1/2, and
argue that only in the later case the limiting operator is the Fokker-Plank.
However the difference between α = 0 and α = 1/2 is a 1/2 factor in front
of −∇V , and that can be unified by changing the volatility σ to another
number. (Actually, according to Thm 2. on Page 15 of Lafon’06, one can
1
check that σ 2 = 1−α .) So here we say for α = 0 the limiting operator is
also Fokker-Plank. (not talked in class, open to discussion...)
(1)
(II) L,1 = 1 (A − I) → 21 ∆M as  → 0 and n → ∞. Notice that this
case is of important application value: whatever the density p(x) is, the
Laplacian-Beltrami operator of M is approximated, so the geometry of
the manifold can be understood.
A special case is that samples xi are uniformly distributed on M, whence
∇p = 0. Then (I) and (II) are the same up to multiplying a positive constant, due
to that D’s diagonal entries are almost the same number and the re-weight does
not do anything.
Convergence results like these can be found in Coifman and Lafon [CL06],
Diffusion maps, Applied and Computational Harmonic Analysis.
We also refer [Sin06] From graph to manifold Laplacian: The convergence
rate, Applied and Computational Harmonic Analysis for a complete analysis of the
variance error, while the analysis of bias is very brief in this paper.
3.2. Proof. For a smooth function f (x) on M, let f = (fi ) ∈ Rn as a vector
defined by fi = f (xi ). At a given fixed point xi , we have the formula:
Pn ! 1
Pn !
1 Wij fj 1 j=1 Wij fj
(Lf )i
= Pj=1
n − fi = n
1
Pn − fi
 j=1 Wij  n j=1 Wij
1
P !
1 n j6=i k (xi , xj ).f (xj ) 1
= 1 − f (xi ) + f (xi )O( d )

P
n j6=i k (xi , xj ) n 2
where in the last step the diagonal terms j = i are excluded from the sums resulting
d
in an O(n−1 − 2 ) error. Later we will see that compared to the variance error, this
term is negligible.
We rewrite the Laplacian above as
 
i1 F (xi ) 1
(130) (Lf ) = − f (xi ) + f (xi )O( d )
 G(xi ) n 2
where
3. DIFFUSION MAP: CONVERGENCE THEORY 123

1X 1X
F (xi ) = k (xi , xj )f (xj ), G(xi ) = k (xi, xj ).
n n
j6=i j6=i
depends only on the other n − 1 data points than xi . In what follows we treat
xi as a fixed chosen point and write as x.
Bias-Variance Decomposition. The points xj , j 6= i are independent iden-
tically distributed (i.i.d), therefore every term in the summation of F (x) (G(x))
are i.i.d., and by theR Law of Large Numbers (LLN) one should expect F (x) ≈
Ex1 [k(x, x1 )f (x1 )] = M k(x, y)f (y)p(y)dy (and G(x) ≈ Ek(x, x1 ) = M k(x, y)p(y)dy).
R

Recall that given a random variable x, and a sample estimator θ̂ (e.g. sample mean),
the bias-variance decomposition is given by
Ekx − θ̂k2 = Ekx − Exk2 + EkEx − θ̂k2 .
E[F ]
If we use the same strategy here (though not exactly the same, since E[ G
F
] 6= E[G]
!), we can decompose Eqn. (130) as
1 E[F ] 1 F (xi ) E[F ]
   
1
(Lf )i = − f (xi ) + f (xi )O( d ) + −
 E[G] n 2  G(xi ) E[G]
= bias + variance.
In the below we shall show that for case (I) the estimates are
(131)
1 E[F ] ∇p
 
1 m2  d

bias = − f (x) + f (xi )O( d ) = (∆M f +2∇f · )+O()+O n−1 − 2 .
 E[G] n 2 2 p

1 F (xi ) E[F ]
 
1 d
(132) variance = − = O(n− 2 − 4 −1 ),
 G(xi ) E[G]
whence
1 d 1 d
bias + variance = O(, n− 2 − 4 −1 ) = C1  + C2 n− 2 − 4 −1 .
As the bias is a monotone increasing function of  while the variance is decreasing
w.r.t. , the optimal choice of  is to balance the two terms by taking derivative
1 d
of the right hand side equal to zero (or equivalently setting  ∼ n− 2 − 4 −1 ) whose
solution gives the optimal rates
∗ ∼ n−1/(2+d/2) .
[CL06] gives the bias and [HAvL05] contains the variance parts, which are further
improved by [Sin06] in both bias and variance.
3.3. The Bias Term. Now focus on E[F ]
 
1 n−1
X Z
E[F ] = E  k (xi , xj )f (xj ) =
 k (x, y)f (y)p(y)dy
n n M
j6=i
n−1
n is close to 1 and is treated as 1.
(1) the case of one-dimensional and flat (which means the manifold M is just
a real line, i.e.M = R)
(x−y)2
Let f˜(y) = f (y)p(y), and k (x, y) = √1 e− 2 , by change of variable

y = x + z,
124 7. DIFFUSION MAP

we have
√ 1
Z
2
= f˜(x + z)e− 2 dz = m0 f˜(x) + m2 f 00 (x) + O(2 )
R 2
2 2
where m0 = R e− 2 dz, and m2 = R z 2 e− 2 dz.
R R

(2) 1 Dimensional & Not flat:


Divide the integral into 2 parts:

Z Z Z
k (x, y)f˜(y)p(y)dy = √
·+ √
·
m ||x−y||>c  ||x−y||<c 

First part = ◦

1 2
| ◦ | ≤ ||f˜||∞ a e− 2 ,
2

due to ||x − y||2 > c 

1
c ∼ ln( ).

so this item is tiny and can
√ be ignored.
Locally, that is u ∼ , we have the curve in a plane and has the
following parametrized equation

(x(u), y(u)) = (u, au2 + qu3 + · · · ),

then the chord length

1 1 1
||x − y||2 = [u2 + (au2 + qu3 + ...)2 ] = [u2 + a2 u4 + q5 (u) + · · · ],
  
u
where we mark a2 u4 + 2aqu5 + ... = q5 (u). Next, change variable √

= z,
− ξ2
then with h(ξ) = e

||x − y|| 2 3
h( ) = h(z 2 ) + h0 (z 2 )(2 az 4 +  2 q5 + O(2 )),

also
df˜ 1 d2 f˜
f˜(s) = f˜(x) + (x)s + (x)s2 + · · ·
ds 2 ds2
and
Z u p
s= 1 + (2au + 3quu2 + ...)2 du + · · ·
0

and
ds 2
= 1 + 2a2 u2 + q2 (u) + O(2 ), s = u + a2 u3 + O(2 ).
du 3
3. DIFFUSION MAP: CONVERGENCE THEORY 125

Now come back to the integral


1 x−y ˜
Z

√ h( )f (s)ds
|x−y|<c   
df˜
Z +∞
3 √ 2 3
≈ [h(z 2 ) + h0 (z 2 )(2 az 4 +  2 q5 ] · [f˜(x) + (x)( z + a2 z 2  2 )
−∞ ds 3

1d f
+ (x)z 2 ] · [1 + 2a2 + 3 y3 (z)]dz
2 ds2
m2 d2 f˜
=m0 f˜(x) +  ( (x) + a2 f˜(x)) + O(2 ),
2 ds2
O(2 ) tails are omitted in middle steps, and m0 = h(z 2 )dz,m2 =
R
where
R 2 the
z h(z 2 )dz, are positive constants. In what follows we normalize both of
them by m0 , so only m2 appears as coefficient in the O() term. Also the
ξ
fact that h(ξ) = e− 2 , and so h0 (ξ) = − 12 h(ξ), is used.
(3) For high dimension, M is of dimension d,
1 |x−y|2
k (x, y) = d e− 2 ,
 2

the corresponding result is (Lemma 8 in Appendix B of Lafon ’06 paper)


m2
Z
(133) k (x, y)f˜(y)dy = f˜(x) +  (∆M f˜ + E(x)f˜(x)) + O(2 ),
M 2
where
d
X X
E(x) = ai (x)2 − ai1 (x)ai2 (x),
i=1 i1 6=i2

and ai (x) are the curvatures along coordinates si (i = 1, · · · , d) at point


x.
Now we study the limiting operator and the bias error:
0 2
EF f +  m22 (f 00 + 2f 0 pp + f pp + Ef ) + O(2 )
R
k (x, y)f (y)p(y)dy
= ≈
EG
00
R
k (x, y)p(y)dy 1 +  m22 ( pp + E) + O(2 )
m2 00 p0
(134) = f (x) +  (f + 2f 0 ) + o(2 ),
2 p
and as a result, for generally d-dim case,
1 EF ∇p
 
m2
− f (x) = (∆M f + 2∇f · ) + O().
 EG 2 p
Using the same method and use Eqn. (133), one can show that for case (II)
where α = 1, the limiting operator is exactly the Laplace-Beltrami operator and
the bias error is again O() (homework).
About M with boundary: firstly the limiting differential operator bears
√ Newmann/no-
flux boundary condition. Secondly, the convergence at a belt of width  near ∂M
is slower than the inner part of M, see more in Lafon’06 paper.
126 7. DIFFUSION MAP

3.4. Variance Term. Our purpose is to derive the large deviation bound for2
E[F ]
 
F
(135) P rob − ≥α
G E[G]
where F = F (xi ) = n j6=i k (xi , xj )f (xj ) and G = G(xi ) = n1 j6=i k (x, xj ).
1
P P

With x1 , x2 , ..., xn as i.i.d random variables, F and G are sample means (up to a
scaling constant). Define a new random variable
Y = E[G]F − E[F ]G − αE[G](G − E[G])
which is of mean zero and Eqn. (135) can be rewritten as
P rob(Y ≥ αE[G]2 ).
For simplicity by Markov (Chebyshev) inequality 3 ,
E[Y 2 ]
P rob(Y ≥ αE[G]2 ) ≤
α2 E[G]4
and setting the right hand side to be δ ∈ (0, 1), then with probability at least 1 − δ
the following holds !
E[Y 2 ] E[Y 2 ]
p p
α≤ √ ∼O .
E[G]2 δ E[G]2
It remains to bound
E[Y 2 ] = (EG)2 E(F 2 ) − 2(EG)(EF )E(F G) + (EF )2 E(G2 ) + ...
+2α(EG)[(EF )E(G2 ) − (EG)E(F G)] + α2 (EG)2 (E(G2 ) − (EG)2 ).
So it suffices to give E(F ), E(G), E(F G), E(F 2 ), and E(G2 ). The former two are
given in bias and for the variance parts in latter three, let’s take one simple example
with E(G2 ).
Recall that x1 , x2 , ..., xn are distributed i.i.d according to density p(x), and
1X
G(x) = k (x, xj ),
n
j6=i
so Z 
1 2 2
V ar(G) = 2 (n − 1) k (x, y)) p(y)dy − (Ek (x, y)) .
n M
Look at the simplest case of 1-dimension flat M for an illustrative example:
1 √
Z Z
(k (x, y))2 p(y)dy = √ h2 (z 2 )(p(x) + p0 (x)( z + O()))dz,

M R
R 2 2
let M2 = h (z )dz
R
1 √
Z
(k (x, y))2 p(y)dy = p(x) · √ M2 + O( ).

M
Recall that Ek (x, y) = O(1), we finally have
 
1 p(x)M2 1
V ar(G) ∼ √ + O(1) ∼ √ .
n  n 
2The opposite direction is omitted here.
3It means that P rob(X > α) ≤ E(X 2 )/α2 . A Chernoff bound with exponential tail can be
found in Singer’06.
4. *VECTOR DIFFUSION MAP 127

d
Generally, for d-dimensional case, V ar(G) ∼ n−1 − 2 . Similarly one can derive
estimates on V ar(F ).
Ignoring the joint effect ofpE(F G), one can somehow get a rough estimate
based on F/G = [E(F ) + O( E(F 2 ))]/[E(G) + O( E(G2 ))] where we applied
p

the Markov inequality on both the numerator and denominator. Combining those
estimates together, we have the following,
1 d
F f p +  m22 (∆(f p) + E[f p]) + O(2 , n− 2 − 4 )
= 1 d
G p +  m22 (∆p + E[p]) + O(2 , n− 2 − 4 )
m2 1 d
= f + (∆p + E[p]) + O(2 , n− 2 − 4 ),
2
here O(B1 , B2 ) denotes the dominating one of the two bounds B1 and B2 in the
asymptotic limit. As a result, the error (bias + variance) of L,α (dividing another
) is of the order
1 d
(136) O(, n− 2 − 4 −1 ).
In [Sin06] paper, the last term in the last line is improved to
1 d 1
(137) O(, n− 2 − 4 − 2 ),
F
where the improvement is by carefully analyzing the large deviation bound of G
around EG shown above, making use of the fact that F and G are correlated.
EF

Technical details are not discussed here.


In conclusion, we need to choose  to balance bias error and variance error to
be both small. For example, by setting the two bounds in Eqn. (137) to be of the
same order we have
 ∼ n−1/2 −1/2−d/4 ,
that is
 ∼ n−1/(3+d/2) ,
so the total error is O(n−1/(3+d/2) ).

4. *Vector Diffusion Map


In this class, we introduce the topic of vector Laplacian on graphs and vector
diffusion map.
The ideas for vector Laplacian on graphs and vector diffusion mapping are a
natural extension from graph Laplacian operator and diffusion mapping on graphs.
The reason why diffusion mapping is important is that previous dimension reduction
techniques, such as the PCA and MDS, ignore the intrinsic structure of the man-
ifold. By contrast, diffusion mapping derived from graph Laplacian is the optimal
embedding that preserves locality in a certain way. Moreover, diffusion mapping
gives rise to a kind of metic called diffusion distance. Manifold learning problems
involving vector bundle on graphs provide the demand for vector diffusion mapping.
And since vector diffusion mapping is an extension from diffusion mapping, their
properties and convergence behavior are similar.
The application of vector diffusion mapping is not restricted to manifold learn-
ing however. Due to its usage of optimal registration transformation, it is also a
valuable tool for problems in computer vision and computer graphics, for example,
optimal matching of 3D shapes.
128 7. DIFFUSION MAP

The organization of this lecture notes is as follows: We first review graph Lapla-
cian and diffusion mapping on graphs as the basis for vector diffusion mapping. We
then introduce three examples of vector bundles on graphs. After that, we come to
vector diffusion mapping. Finally, we introduce some conclusions about the con-
vergence of vector diffusion mapping.

4.1. graph Laplacian and diffusion mapping.


4.2. graph Laplacian. The goal of graph Laplacian is to discover the intrinsic
manifold structure given a set of data points in space. There are three steps of
constructing the graph Laplacian operator:
• construct the graph using either the −neighborhood way (for any data
point, connect it with all the points in its −neighborhood) or the k-
nearest neighbor way (connect it with its k-nearest neighbors);
• construct the the weight matrix. Here we can use the simple-minded
binary weight (0 or 1), or use the heat kernel weight. For undirected
graph, the weight matrix is symmetric; P
• denote D as the diagonal matrix with D(i, i) = deg(i), deg(i) := j wij .
The graph Laplacian operator is:
L=D−W
The graph Laplacian has the following properties:
• ∀f : V → R, f T Lf = (i,j)∈E wij (fi − fj )2 ≥ 0
P

• G is connected ⇔ f T Lf > 0, ∀f T ~1, where ~1 = (1, · · · , 1)T


• G has k-connected components ⇔ dim(ker(L))=k
(this property is compatible with the previous one, since L~1 = 0)
• Kirchhofff’s Matrix Tree theorem:
Consider a connected graph G and the binary weight matrix: wij =
(
1, (i, j) ∈ E
, denote the eigenvalues of L as 0 = λ1 < λ1 ≤ λ2 ≤
0, otherwise
· · · ≤ λn , then #{T: T is a spanning tree of G}= n1 λ2 · · · λn
• Fieldler Theory, which will be introduced in later chapters.
We can have a further understanding of Graph Laplacian using the language
of exterior calculus on graph.
We give the following denotations:
V = {1, 2, · · · , |V |}. E~ is the oriented edge set that for (i, j) ∈ E and i < j,
hi, ji is the positive orientation, and hj, ii is the negative orientation.
~
δ0 : RV → RE is a coboundary map, such that
(
~
fi − fj , hi, ji ∈ E
δ0 ◦ f (i, j) =
0, otherwise
It is easy to see that δ0 ◦ f (i, j) = −δ0 ◦ f (j, i)
~
The inner product of operators on RE is defined as:
X
hu, vi = wij uij vij
i,j

u := u diag(wij )
4. *VECTOR DIFFUSION MAP 129

n(n−1) n(n−1)
where diag(wij ) ∈ R 2 × 2 is the diagonal matrix that has wij on the diag-
onal position corresponding to hi, ji.
u∗ v = hu, vi
Then,
L = D − W = δ0T diag(wij )δ0 = δ0∗ δ0
We first look at the graph Laplacian operator. We solve the generalized eigen-
value problem:
Lf = λDf
denote the generalized eigenvalues as:
0 = λ1 ≤ λ2 ≤ · · · ≤ λn
and the corresponding generalized eigenvectors:
f1 , · · · , fn
we have already obtained the m-dimensional Laplacian eigenmap:
xi → (f1 (i), · · · , fm (i))
We now explains that this is the optimal embedding that preserves locality in the
sense that connected points stays as close as possible. Specifically speaking, for the
one-dimensional embedding, the problem is:
X
min (yi − yj )2 wij = 2miny yT Ly
i,j
1 1 1 1
y T Ly = y T D− 2 (I − D− 2 W D− 2 )D− 2 y
1 1 1 1
Since I − D− 2 W D− 2 is symmetric, the object is minimized when D− 2 )D− 2 y is
the eigenvector for the second smallest eigenvalue(the first smallest eigenvalue is
1 1
0) of I − D− 2 W D− 2 , which is the same with λ2 , the second smallest generalized
eigenvalue of L.
Similarly, the m-dimensional optimal embedding is given by Y = (f1 , · · · , fm ).
In diffusion map, the weights are used to define a discrete random walk. The
transition probability in a single step from i to j is:
wij
aij =
deg(i)
Then the transition matrix A = D−1 W .
1 1 1 1
A = D− 2 (D− 2 W D− 2 )D 2
Therefore, A is similar to a symmetric matrix, and has n real eigenvalues µ1 , · · · , µn
and the corresponding eigenvectors φ1 , · · · , φn .
Aφi = µi φi
At is the transition matrix after t steps. Thus, we have:
At φi = µti φi
Define Λ as the diagonal matrix with Λ(i, i) = µi , Φ = [φ1 , · · · , φn ]. The diffusion
map is given by:
Φt := ΦΛt = [µt1 φ1 , · · · , µtn φn ]
130 7. DIFFUSION MAP

4.3. the embedding given by diffusion map. Φt (i) denotes the ith row of
Φt .
n
X At (i, k) At (j, k)
hΦt (i), Φ(j)i = p p
k=1
deg(k) deg(k)
we can thus define a distance called diffusion distance
n
X (At (i, k) − At (j, k))2
d2DM,t (i, j) := hΦt (i), Φ(i)i+hΦt (j), Φ(j)i−2hΦt (i), Φ(j)i =
deg(k)
k=1

4.4. Examples of vector bundles on graph.


(1) Wind velocity field on globe:
To simplify the problem, we consider the two dimensional mesh on
the globe(the latitude and the longitude). Each node on the mesh has a
vector f~ which is the wind velocity at that place.
(2) Local linear regression:
The goal of local linear regression is to give an approximation of the
regression function at an arbitrary point in the variable space.
Given the data (yi , x~i )ni=1 and an arbitrary point ~x,~x, x~1 , · · · , x~n ∈ Rp ,
we want to find β~ := (β0 , β1 , · · · , βp )T that minimize
Pn
i=1 (yi − β0 −
β1 xi1 − · · · βp xip )2 Kn (x~i , ~x). Here Kn (x~i , ~x) is a kernel function that
defines the weight for x~i at the point ~x. For example, we can use the
||x−xi ||2
Nadaraya-Watson kernel Kn (x~i , ~x) = e n .
For a graph G=(V,E), each point ~x ∈ V has a corresponding vector
~
β(~x). We therefore get a vector bundle on the graph G(V,E).
Here β~ is kind of a gradient. In fact, if y and ~x has the relationship
y = f (~x), then β = (f (~x), ∇f (~x))T .
(3) Social networks:
If we see users as vertices and the relationship bonds that connected
users as edges, then a social network naturally gives rise to a graph
G=(V,E). Each user has an attribute profile containing all kinds of per-
sonal information, and a certain kind of information can be described by
a vector f~ recording different aspects. Again, we get a vector bundle on
graph.

4.5. optimal registration transformation. Like in graph eigenmap, we ex-


pect the embedding f~ to be preserve locality to a certain extent, which means that
we expect the embedding of connected points to be sufficiently close. In the graph
Laplacian case, we use i∼j wij ||f~i − f~j ||2 . However, for vector bundle on graphs,
P
subtraction of vectors at different points may not be done directly due to the cur-
vature of the manifold. What makes sense should be the difference of vectors
compared with the tangent spaces at the certain points. Therefore, we borrow the
idea of parallel transport from differential geometry. Denote Oij as the parallel
transport operator from the tangent space at xj to the tangent space at xi . We
want to find out the embedding that minimizes
wij ||f~i − Oij f~j ||2
X

i∼j
4. *VECTOR DIFFUSION MAP 131

we will later define the vector diffusion mapping, and using the similar argument
as in diffusion mapping, it is easy to see that vector diffusion mapping gives the
optimal embedding that preserves locality in this sense.
we now discuss how we get the approximation of parallel transport operator
given the data set.
The approximation of the tangent space at a certain point xi is given by local PCA.
Choose i to be sufficiently small, and denote xi1 , · · · , xiNi as the data points in
the i -neighborhood of xi . Define
Xi := [xi1 − xi , · · · , xiNi − xi ]
Denote Di as the diagonal matrix with
s
||xij − xi ||
Di (j, j) = K( ), j = 1, · · · , Ni
i

Bi := Xi Di
Perform SVD on Bi :
Bi = Ui Σi ViT
We use the first d columns of Ui (which are the left eigenvectors of the d largest
eigenvalues of Bi ) to form an approximation of the tangent space at xi . That is,
Oi = [ui1 , · · · , uid ]
Then Oi is a numerical approximation to an orthonormal basis of the tangent space
at xi .

For connected points xi and xj , since they are sufficiently close to each other,
their tangent space should be close. Therefore, Oi Oij and Oj should also be close.
We there use the closest orthogonal matrix to OiT Oj as the approximation of the
parallel transport operator from xj to xi :
ρij := argminOorthogonol ||O − OiT Oj ||HS
where ||A||2HS = T r(AAT ) is the Hilbert-Schimidt norm.

4.6. Vector Laplacian. Given the weight matrix W = (wij ), we denote


 
deg(1)Ip
D := 
 .. ∈R
 np×np
.
deg(n)Ip
P
where deg(i) = j wij as in graph Laplacian.
Define S as the block matrix with
(
wij ρij , i ∼ j
Sij =
0, otherwise
The vector Laplacian is then defined as L = D − S
132 7. DIFFUSION MAP

Like Graph Laplacian, we introduce an orientation on E and a coboundary map


~
δ0 : (Rd )V → (Rd )E
(
f~i − ρij f~j , hi, ji ∈ E
~
δ0 ◦ f (i, j) = , where f = (f~1 , · · · , f~n )T
0, otherwise
~
Inner product on (Rd )E is defined as
X
hu, vi = wij uTij vij
i,j

u := u diag(wij ), u∗ v = hu, vi

~ then, L = D −W = δ T diag(wij )δ0 =


If we let ρij be orthogonal, ∀i, j, s.t.hi, ji ∈ E, 0

δ0 δ0 .

Analogous properties with Graph Laplacian:


• G has k connected components ⇔ dim ker(L) = kp
• generalized Matrix tree theorem.
4.7. Vector diffusion mapping.
L = D − S = D(I − D−1 S)
1 1
D−1 S = D− 2 SD− 2
Denote
1 1
S̃ := D− 2 SD− 2
S̃ has nd real eigenvalues λ1 , · · · , λnd and the corresponding eigenvectors v1 , · · · , vnd .
Thinking of these vectors of length nd in blocks of d, we denote vk (i) as the ith
block of vk .
The spectral decompositions of S̃(i, j) and S̃ 2t (i, j) are given by:
nd
X
S̃(i, j) = λk vk (i)vk (j)T
k=1
nd
∴ S̃ (i, j) =
X
2t
λ2t
k vk (i)vk (j)
T

k=1
We use ||S̃ 2t (i, j)||2HS to measure the affinity between i and j. Thus,
||S̃ 2t (i, j)||2HS = T r(S̃ 2t (i, j)S̃ 2t (i, j)T )
Pnd
= (λk λl )2t T r(vk (i)vk (j)T vl (j)vl (i)T )
Pk,l=1
nd
= (λk λl )2t T r(vk (j)T vl (j)vl (i)T vk (i))
Pk,l=1
nd 2t
= k,l=1 (λk λl ) hvk (j), vl (j)ihvk (i), vl (i)i
The vector diffusion mapping is defined as:
Vt : i → ((λk λl )t hvk (i), vl (i)i)nd
k,l=1

Like graph Laplacian, ||S̃ 2t (i, j)||2HS is actually an inner product:


||S̃ 2t (i, j)||2HS = hVt (i), Vt (j)i
This gives rise to a distance called vector diffusion distance:
d2V DM,t = hVt (i), Vt (i)i + hVt (j), Vt (j)i − 2hVt (i), Vt (j)i
4. *VECTOR DIFFUSION MAP 133

4.8. Normalized Vector Diffusion Mappings. An important kind of nor-


malized VDM is obtained as follows:
Take 0 ≤ α ≤ 1,
Wα := D−α W D−α
Sα := D−α SD−α
n
X
degα (i) := Wα (i, j)
j=1
We define Dα ∈ Rn×n as the diagonal matrix with
Dα (i, i) = degα (i)
and Dα ∈ R nd×nd
as the block diagonal matrix with
Dα (i, i) = degα (i)Id
We can then get the vector diffusion mapping Vα,t using Sα and Dα instead of S
and D.
4.9. Convergence of VDM. We first introduce some concepts.
Suppose M is a smooth manifold, and TM is a tensor bundle on M. When
the rank of TM is 0, it is the set of functions on M. When the rank of TM is 1, it
is the set of vector fields on M.
Theconnection Laplacian operator is:
∇2X,Y T = −(∇X ∇Y T − ∇∇X Y T )
where ∇X Y is the covariant derivative of Y over X.
Intuitively, we can see the first item of the connection Laplacian operator as the sum
of the change of T over X and over Y, and the second item as the overlapped part
of the change of T over X and over Y. The remainder can be seen as an operator
that differentiates the vector fields in the direction of two orthogonal vector fields.
Now we introduce some results about convergence.
The normalized graph Laplacian converges to the Laplace-Beltrami operator:
(D−1 W − I)f → c∆f
for sufficiently smooth f and some constant c.

For VDM, Dα−1 Sα − I converges to the connection Laplacian operator [SW12]


plus some potential terms. When α = 1, D1−1 S1 − I converges to exactly the
connection Laplacian operator:
(D1−1 S1 − I)X → c∇2 X
CHAPTER 8

Semi-supervised Learning

1. Introduction
Problem: x1 , x2 , ..., xl ∈ Vl are labled data, that is data with the value f (xi ), f ∈
V → R observed. xl+1 , xl+2 , ..., xl+u ∈ Vu are unlabled. Our concern is how to fully
exploiting the information (like geometric structure in disbution) provided in the
labeled and unlabeled data to find the unobserved labels.
This kind of problem may occur in many situations, like ZIP Code recognition.
We may only have a part of digits labeled and our task is to label the unlabeled
ones.

2. Harmonic Extension of Functions on Graph


Suppose the whole graph is G = (V,  V = Vl ∪ Vu and weight
 E, W ), where
Wll Wlu
matrix is partitioned into blocks W = . As before, we define D =
W Wuu
Pnul
diag(d1 , d2 , ..., dn ) = diag(Dl , Du ), di = j=1 Wij , L = D − W The goal is to find
fu = (fl+1 , ..., fl+u )T such that
min f T Lf
s.t. f (Vl ) = fl
 
fl
where f = . Note that
fu
 
fl
f T Lf = (flT , fuT )L = fuT Luu fu + flT Lll fl + 2fuT Lul fl
fu
So we have:
∂f T Lf
= 0 ⇒ 2Luu fu + 2Llu fu = 0 ⇒ fu = −L−1
uu Lul fl = (Du − Wuu )
−1
Wul fl
∂fu

3. Explanation from Gaussian Markov Random Field


If we consider f : V → R are Gaussian random variables on graph nodes
whose inverse covariance matrix (precision matrix) is given by unnormalized graph
Laplacian L (sparse but singular), i.e. f ∼ N (0, Σ) where Σ−1 = L (interpreted as
a pseudo inverse). Then the conditional expectation of fu given fl is:
fu = Σul Σ−1
ll fl

where  
Σll Σlu
Σ=
Σul Σuu
135
136 8. SEMI-SUPERVISED LEARNING

Block matrix inversion formula tells us that when A and D are invertible,
−1 −1
−A−1 BSA
       
A B X Y X Y SD
· =I⇒ = −1 −1
C D Z W Z W −D−1 CSD SA
−1 −1
BD−1
       
X Y A B X Y SD −SD
· =I⇒ = −1 −1
Z W C D Z W −SA CA−1 SA
where SA = D − CA−1 B and SD = A − BD−1 C are called Schur complements of
A and D, respectively. The matrix expressions for inverse are equivalent when the
matrix is invertible.
The graph Laplacian
 
Dl − Wll Wlu
L=
Wul Du − Wuu
is not invertible.
P Dl − Wll and Du − Wuu are both strictly diagonally dominant, i.e.
Dl (i, i) > j |Wll (i, j)|, whence they are invertible by Gershgorin Circle Theorem.
However their Schur complements SDu −Wuu and SDl −Wll are still not invertible and
the block matrix inversion formula above can not be applied directly. To avoid this
issue, we define a regularized version of graph Laplacian
Lλ = L + λI, λ>0
and study its inverse Σλ = L−1
λ .
By the block matrix inversion formula, we can set Σ as its right inverse above,
−1 −1
−(λ + Dl − Wll )−1 Wlu Sλ+D
 
Sλ+Du −Wuu l −Wll
Σλ = −1 −1
−(λ + Du − Wuu )−1 Wul Sλ+D u −Wuu
Sλ+D l −Wll

Therefore,
fu,λ = Σul,λ Σ−1
ll,λ fl = (λ + Du − Wuu )
−1
Wul fl ,
whose limit however exits limλ→0 fu,λ = (Du − Wuu )−1 Wul fl = fu . This implies
that fu can be regarded as the conditional mean given fl .

4. Explanation from Transition Path Theory


We can also view the problem as a random walk  on graph.  Constructing a
P ll P lu
graph model with transition matrix P = D−1 W = . Assume that the
Pul Puu
labeled data are binary (classification). That is, for xi ∈ Vl , f (xi ) = 0 or 1. Denote
• V0 = {i ∈ Vl : fi = f (xi ) = 0}
• V1 = {i ∈ Vl : fi = f (xi ) = 1}
• V = V0 ∪ V1 ∪ Vu where Vl = V0 ∪ V1
With this random walk on graph P , fu can be interpreted as hitting time or
first passage time of V1 .
Proposition 4.1. Define hitting time
τik = inf{t ≥ 0 : x(0) = i, x(t) ∈ Vk }, k = 0, 1
Then for ∀i ∈ Vu ,
fi = P rob(τi1 < τi0 )
i.e.
fi = P rob(trajectory starting from xi hit V1 before V0 )
5. WELL-POSEDNESS 137

Note that the probability above also called committor function in Transition
Path Theory of Markov Chains.

Proof. Define the committor function,



1
 xi ∈ V 1
qi+ = P rob(τi1 < τi0 ) = 0 xi ∈ V 0
Pij qj+

P
j∈V i ∈ Vu
This is because ∀i ∈ Vu ,
qi+ = P r(τiV1 < τiV0 )
X
= Pij qj+
j
X X X
= Pij qj+ + Pij qj+ + Pij qj+
j∈V1 j∈V0 j∈Vu
X X
= Pij + Pij qj+
j∈V1 j∈Vu

∴ qu+ = Pul fl + Puu qu+ = Du−1 Wul fl + Du−1 Wuu qu+


multiply Du to both side and reorganize:
(Du − Wuu )qu+ = Wul fl
If Du − Wuu is reversible, we get:
qu+ = (Du − Wuu )−1 Wul fl = fu
i.e. fu is the committor function on Vu . 

The result coincides with we obtained through the view of gaussian markov
random field.

5. Well-posedness
One natural problem is: if we only have a fixed amount of labeled data, can
we recover labels of an infinite amount of unobserved data? This is called well-
posedness. [Nadler-Srebro 2009] gives the following result:
• If xi ∈ R1 , the problem is well-posed.
• If xi ∈ Rd (d ≥ 3), the problem is ill-posed in which case Du − Wuu
becomes singular and f becomes a bump function (fu is almost always
zeros or ones except on some singular points).
Here we can give a brief explanation:
Z
f T Lf ∼ k∇f k2

kx−x0 k22
(
2 kx − x0 k2 < 
If we have Vl = {0, 1}, f (x0 ) = 0, f (x1 ) = 1 and let f (x) = .
1 otherwise
From multivariable calculus,
Z
k∇f k2 = cd−2 .
138 8. SEMI-SUPERVISED LEARNING

R
Since d ≥ 3, so  → 0 ⇒ k∇f k2 → 0. So f (x) ( → 0) converges to a bump func-
tion which is one almost everywhere except x0 whose value is 0. No generalization
ability is learned for such bump functions.
This means in high dimensional case, to obtain a smooth generalization, we
have to add constraints more than the norm of the first order derivatives. We
also have a theorem to illustrate what kind of constraint is enough for a good
generalization:
Theorem 5.1 (Sobolev embedding Theorem). f ∈ Ws,p (Rd ) ⇐⇒ f has s’th
order weak derivative f (s) ∈ Lp ,
d
s> ⇒ Ws,2 ,→ C(Rd ).
2
So in Rd , to obtain a continuous function, one needs smoothness regularization
k∇s f k with degree s > d/2. To implement this in discrete Laplacian setting, one
R

may consider iterative Laplacian Ls which might converge to high order smoothness
regularization.
CHAPTER 9

Beyond graphs: high dimensional


topological/geometric analysis

1. From Graph to Simplicial Complex


Definition (Simplicial Complex). An abstract simplicial complex is a collection Σ
of subsets of V which is closed under inclusion (or deletion), i.e. τ ∈ Σ and σ ⊆ τ ,
then σ ∈ Σ.
We have the following examples:
• Chess-board Complex
• Point cloud data:
Nerve complex
Cech, Rips, Witness complex
Mayer-Vietoris Blowup
• Term-document cooccurance complex
• Clique complex in pairwise comparison graphs
• Strategic complex in flow games
Example (Chess-board Complex). Let V be the positions on a Chess board. Σ
collects position subsets of V where one can place queens (rooks) without capturing
each other. It is easy to check the closedness under deletion: if σ ∈ Σ is a set of
“safe” positions, then any subset τ ⊆ σ is also a set of “safe” positions
Example (Nerve Complex). Define a cover of X, X = ∪α Uα . V = {Uα } and
define Σ = {UI : ∩α∈I UI 6= ∅}.
• Closedness under deletion
• Can be applied to any topological space X
• In a metric space (X, d), if Uα = B (tα ) := {x ∈ X : d(x − tα ) ≤ }, we
have Cech complex C .
• Nerve Theorem: if every UI is contractible, then X has the same homotopy
type as Σ.
• Cech complex is hard to compute, even in Euclidean space
• One can easily compute an upper bound for Cech complex
Construct a Cech subcomplex of 1-dimension, i.e. a graph with
edges connecting point pairs whose distance is no more than .
Find the clique complex, i.e. maximal complex whose 1-skeleton is
the graph above, where every k-clique is regarded as a k − 1 simplex
Example (Vietoris-Rips Complex). Let V = {xα ∈ X}. Define V R = {UI ⊆ V :
d(xα , xβ ) ≤ , α, β ∈ I}.
• Rips is easier to compute than Cech
139
140 9. BEYOND GRAPHS: HIGH DIMENSIONAL TOPOLOGICAL/GEOMETRIC ANALYSIS

even so, Rips is exponential to dimension generally


• However Vietoris-Rips CAN NOT preserve the homotopy type as Cech
0 and σ 0 are “disconnected” from other
the simplexes σ(r
• But there is still a hope to find a lower bound on homology 5) – (r6 )
four members. The following definition is used to model this
Theorem 1.1 (“Sandwich”). kind of topological property. We have modified the original
definition of “connectiveness” in Q-analysis to cater for our
V R ⊆ C ⊆ V R2 present application.
• If a homology group “persists” through RDefinition
→ R2 , 3.
then it must exists in
Let " be a simplicial family and d is the highest
C ; but not the vice versa. dimension of the simplexes in ". Let 0 ≤ q ≤ d be an integer.
• All above gives rise to a filtration of simplicial complex
We call two simplexes σa and σb in " q-near if they have a
∅ = Σ0 ⊆ Σ1 ⊆ Σ2 ⊆ . .common
. q-face. We call σa and σb q-connected if there exists
a sequence
• Functoriality of inclusion: there are homomorphisms between σ1 , homology
σ2 , . . . , σj (10)
groups of distinct simplexes of ", such that σ1 = σa , σj = σb , and σi
FIG. 3. (i) A simplicial family. (ii) Not a simplicial family.
0 → H1 → H2 → . . . is qi -near to σi+1 for all 1 ≤ i ≤ j − 1, 0 ≤ qi ≤ d an integer,
• A persistent homology is the image of Hi in andHqj=with
min{qji }.>We
i. call Sequence 10 a q-chain Cab from
σa to σb and the number (j − 1) the length of Cab , denoted
Example (Strong Witness Complex). Let V = {tα by
∈ X}.
l(Cab ).Define Ws = {U
For all possible I ⊆ Vconnecting
q-chains : σ a to σ b with
∃x ∈ X, ∀α ∈ I, d(x, tα ) ≤ d(x, V ) + }. the same length L, we call the chain with the maximum value
w ab of q = q* the maximal L-chain, denoted by C∗ (L). We say
Example (Week Witness Complex). Let V = {tα ∈that
X}. Define
σ a and  = {UI ⊆ifVthey
Wq*-connected
σ b are : are connected by a
∃x ∈ X, ∀α ∈ I, d(x, tα ) ≤ d(x, V−I ) + }. maximal chain.
Note that if two simplexes are q-near, then they must be
• V can be a set of landmarks, much smallerconnected
than X and the lengththeissimplexes
equal σto 0
1. σIf(r0 )there
(r ) and is no chain
are “disconnected” from other 5 6

• Monotonicity: W∗ ⊆ W∗0 if  ≤ 0 four members. The following definition is used to model this
connecting two simplexes,
kindthen we set
of topological theWelength
property. between
have modified the original
• But not easy to control homotopy types between W If∗ two
them to ∞. andsimplexes
X definitionareofq-connected,
“connectiveness” in then
present application.
they
Q-analysis also
to cater for our

are (q − 1)-connected for (q − 1) ≥ 0.


Example (Term-Document Occurrence complex, Li & Kwong 2009).Definition Left3. Let is "abe a simplicial family and d is the highest
dimension of the simplexes in ". Let 0 ≤ q ≤ d be an integer.
term-document co-occurrence matrix; Right is a simplicial Examplecomplex
4. Referring representation
toWeFigure 4 of Example
call two simplexes σa and σb in "3, theif they
q-near sim- have a
0 than 2Latent common q-face.
1 We call σa 1and σb q-connected if there exists
of terms. Connectivity analysis captures more information plexes σ(r 1)
and σ(r2 ) are 0-near,
a Semantic
sequence σ(r3 ) and σ (r4 ) are 1-near, and
Index. 0 and σ 0 are 0-near. Furthermore, σσ 1 ,0σ2 , . . . , σj (10)
FIG. 4. The simplexes generated by the rows of the matrix (8). σ(r 5) (r6 ) (r1 ) is 0-connected of distinct simplexes of ", such that σ1 = σa , σj = σb , and σi
FIG. 3. (i) A simplicial family.1(ii) Not a simplicial1family.
to σ(r3 ) and σ(r4 ) via, respectively,
is qi -near to σi+1 forthe
all 1 maximal
≤ i ≤ j − 1, 0 ≤2-chains
qi ≤ d an integer,
Example 3. Consider the following matrix 0 , σ 2 , σ 1 and σ 0 ,and
σ(r σ 2q = ,min{q
σ 1 i }. We call Sequence 10 a q-chain Cab from
(i.e., q* = 0). However,
σ to σ and the number (j − 1) the length of Cab , denoted
1) (r2 ) (r3 ) (r1 ) (r2 ) (r4 )
a b
0 0 by l(Cab ). For all possible q-chains connecting σ a to σ b with
c1 c2 c3 c4 c5 σ(r 5)
and σ(r6 ) are not connected
the same length to
L, weany
call theof the
chain withother
the fourvalue
maximum
∗ (L). We say

r1 1 0 0 0 0 simplexes. of q = q* the maximal L-chain, denoted by Cab


that σ a and σ b are q*-connected if they are connected by a
r2 1 1 1 0 0 A further structure canmaximal
be defined
chain. on a simplicial family,
Note that if two simplexes are q-near, then they must be
r3 0 0 1 1 0 (8) as follows. connected and the length is equal to 1. If there is no chain
connecting two simplexes, then we set the length between
r4 0 0 1 1 0 them to ∞. If two simplexes are q-connected, then they also
r5 0 0 0 0 1 Definition 4. The relationare “is(q q-connected
− 1)-connected for (q to”
− 1)on ≥ 0.a simplicial

r6 0 0 0 0 1 family ", denoted by rq , Example is an equivalence relation.


4. Referring to Figure 4 of Example 3, theqsim-
Let "
be the set of simplexes in plexes"σ(r0with) and σ(rdimension
) 1 (r ) greater
2 are 0-near, σ 1 and σ 1 are 1-near, and
2 (r ) 3
than
4
0 and σ 0 are 0-near. Furthermore, σ 0 is 0-connected
with six rows r1 , r2 , . . . , r6 and five columns c1 , c2 ,FIG. . . .4. , cThe5 .simplexesor equal to q, where q = 0,
generated by the rows of the matrix (8). σ(r
1,
) . . .(r,) dim". Then, r
1
5
1
6
q
to σ(r ) and σ(r ) via, respectively, the maximal 2-chains
partitions
(r ) 1

For row r 1 , the column c1 contains a “1” and Example


the other q intocomplex equivalence classes 0 of 2q-connected 2simplexes. These
3 4
3. Consider the "following matrix 1 0 1
Figure 1. Term-Document Occurrence σ(r ) , σ(r ) , σ(r ) and σ(r ) , σ(r ) , σ(r ) (i.e., q* = 0). However,
1 2 3 1 2 4

columns contain “0.” We associate with r 1 a 0-simplex c1equivalence c2 c3 c4 c5 classes are called
0
σ(r ) and the
0
σ(r 5) q-connected
are not connected tocomponents
6
any of the other four
0 = (c ). In a similar way, we obtain the following r1 1 of0".0 Let simplexes.
σ(r 1) 1 r2 1 1 1 0 0 q
0 0
Q denote the number A furtherof structure can be definedcomponents
q-connected on a simplicial family,
simplexes for the remaining rows: r3 0 in0".1 The 1 determination
0 (8) of the components and Qq for each
as follows.
Example (Flag Complex of Paired Comparison Graph, r4 0 0 Jiang-Lim-Yao-Ye
1 1 0 2011[JLYY11]).
r5 0 value 0 0 of0 q 1is termed a Q-analysis Definition 4. of The ".relation “is q-connected to” on a simplicial
Let V be a set ofσ(r2alternatives
2 ) = (c 1 , c2 , c3to
), be compared and r6 undirected
0 0 0 0 1 pair (i, j) family ∈ E", if theby rq, is an equivalence relation. Let "q
denoted
be the set of simplexes in " with dimension greater than
pair is comparable. σ(r1 A =flag
(c3 , ccomplex
4 ),
χG consists
with six rows all r1 , r2 ,cliques
, r6 and fiveas
. . .Example 5.simplices
The
columns c1 , cresult orQ-analysis
2 , . . . , c5 .of orfaces
equal to q,(e.g. whereforq =the 0, 1,simplicial
. . . , dim". Then, family
rq partitions
3) For row r 1 , the column c1 contains a “1” and the other "q into equivalence classes of q-connected simplexes. These
3-cliques as 2-faces1 and k + 1-cliques as k-faces), columns contain “0.” in also We Example
associate with 3
called r 1 is
clique given in equivalence
a 0-simplexcomplex Tableof2.classes G.Since the the
are called highest
q-connected dimen-
components
σ(r4 ) = (c3 , c4 ), 0 = (c ).(9) In a similar way, we obtain the following of ". Let Qq denote the number of q-connected components
σ(r ) 1
simplexes for the remaining rows:
1sion of the simplexes is 2, the Q-analysis of the simplicial
in ". The determination of the components and Qq for each
Example (Strategic 0 Simplicial Complex for Flow Games,
σ(r5 ) = (c5 ), family Candogan-Menache-Ozdaglar–
2 = (c , c , c ),
σ(r
has three levels corresponding value of q is termed ato q = 0,1
Q-analysis of ".and 2. The
1 2 3
Parrilo 2011 [CMOP11]). 0
σ(r6 ) = (c5 ). Strategic simplicial complex
σ(r level
1
)
q =is2 the consists
2
clique of those complex
simplexes
Example
ofwith
5. The result dimension
of Q-analysis greater
for the simplicial family
) = (c3 , c4 ), 3
pairwise comparison graph G = (V, E) of strategicσ(r1profiles, than
) = (c 3 ,or
c4 ), equal
where
4
to 2;
V hence,
consists
(9) this
in Example level
of3 isall
given contains
in Table 2. Since one thesimplex
sion of the simplexes is 2, the Q-analysis of the simplicial
highest dimen-
2 1 and
We draw the six simplexes in Figure 4, from which we σ(r0 σ) =(r2(c)5.),Next, at the level q = 1, hastwo threemore simplexesto qσ=(r
0
5 family levels corresponding 0,1) and 2. The
3
σ(r ) = 1 (c5 ). level q = 2 consists of those simplexes with dimension greater
see clearly that they do form a simplicial family. However, σ(r4 ) come in, which are 1-connected
6
than or equal to 2; by hence, a this
chain level of length
contains 1
one simplex
2 . Next, at the level q = 1, two more simplexes σ 1 and
We draw the six simplexes in Figure 4, from which we σ(r 2) (r3 )
see clearly that they do form a simplicial family. However, 1 come in, which are 1-connected by a chain of length 1
σ(r 4)

JOURNAL OF THE AMERICAN SOCIETY FOR JOURNAL


INFORMATION SCIENCE AND TECHNOLOGY—March 2010
OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—March 2010
597 597
DOI: 10.1002/asi
DOI: 10.1002/asi
of comparable strategy profiles defined above, respectively. Notice that, by definition, the graph
G(G) has the structure of a direct product of M cliques (one per player), with clique m having
hm vertices. The pairwise comparison function X : E × E → R defines a flow on G(G), as it
satisfies X(p, q) = −X(q, p) and X(p, q) = 0 for (p, q) ∈ / A. This flow may thus serve as an
equivalent representation of any game (up to a “non-strategic” component). It follows directly
from the statements above that two games are strategically equivalent if and only if they have the
same flow representation and game graph.
3. EXTERIOR CALCULUS ON COMPLEX AND COMBINATORIAL HODGE THEORY 141
Two examples of game graph representations are given below.
strategy profiles of players and a pair of strategy (x, x0 ) ∈ E is comparable if only
Example 2.2.
one player Consider
changes again
strategy from the
x to “battle offinite
x0 . Every the sexes”
game cangame from Example
be decomposed as 2.1. The game graph
has the direct
four sum ofcorresponding
vertices, potential gamestoand
thezero-sum games (harmonic
direct product games).
of two 2-cliques, and is presented in Figure 2.

2
(O, O) (O, F )
O F O F
O 3, 2 0, 0 3O 4, 2 0,2 0
F 0, 0 2, 3 F 1, 0 2, 3
3
(a) Battle of the sexes (F,(b)
O)Modified(F, F ) of
battle
the sexes
Figure 2: Flows on the game graph corresponding to “battle of the sexes” (Example 2.2).
Figure 2. Illustration of Game Strategic Complex: Battle of Sex
It is easy to see that these two games have the same pairwise comparisons, which will lead to
identical equilibria for the two games: (O, O) and (F, F ). It is only the actual equilibrium payoffs
that would differ. In
Example 2.3. particular,
2. ConsiderinaHomology
Persistent the equilibrium
three-player (O, O),
andgame, the
where
Discrete payoff
eachTheory
Morse of the can
player row chooseplayer isbetween
increased
two strategies
by 1. {a, b}. Recall
We represent
that the strategic interactions among the players by the directed graph in Figure
3a, where the payoff of player i is −1 if its strategy is identical to the strategy of its successor
The usual solution
Theorem concepts
2.1 in games (e.g., Nash, mixed Nash, correlated equilibria) are defined
(“Sandwich”).
in terms of pairwise comparisons only. VGames R ⊆ Cwith identical pairwise comparisons share the same
 ⊆ V R2
equilibrium sets. Thus, we refer to games with identical pairwise comparisons as strategically
• If a homology group “persists” through R 7 → R2 , then it must exists in
equivalent games. C ; but not the vice versa.

By employing•the Allnotion
above givesof pairwise
rise to acomparisons, we can concisely
filtration of simplicial complex represent any strategic-form
game in terms of a flow in a graph. We recall this notion next. Let G = (N, L) be an undirected
∅ = Σ0 ⊆ Σ1 ⊆ Σ2 ⊆ . . .
graph, with set of nodes N and set of links L. An edge flow (or just flow ) on this graph is a function
• Functoriality of inclusion: there are homomorphisms between homology
Y : N × N → R such that Y (p, q) = −Y (q, p) and Y (p, q) = 0 for (p, q) ∈ / L [21, 2]. Note that
groups
the flow conservation equations are not0 → enforced under
H1 → H2 → . . . this general definition.
Given a game• G, we define a graph where each node corresponds to a strategy profile, and
A persistent homology is the image of Hi in Hj with j > i.
each edge connects two comparable strategy profiles. This undirected graph is referred to as the
Persistent Homology is firstly proposed by Edelsbrunner-Letscher-Zomorodian,
game graphwithandanisalgebraic
denotedformulation
by G(G) by � (E, A), where E andThe
Zomorodian-Carlsson. A are the strategy
algorithm is equivalent profiles and pairs
of comparable strategy
to Robin Forman’sprofiles defined
discrete Morseabove,
theory.respectively. Notice that, by definition, the graph
G(G) has the structure of a direct product of M cliques (one per player), with clique m having
to be continued...
hm vertices. The pairwise comparison function X : E × E → R defines a flow on G(G), as it
3. Exterior Calculus on Complex and Combinatorial Hodge Theory
satisfies X(p, q) = −X(q, p) and X(p, q) = 0 for (p, q) ∈ / A. This flow may thus serve as an
2 d
equivalent representation of any game (up to a “non-strategic”l (V
We are going to study functions on simplicial complex, ).
component). It follows directly
A basis of “forms”:
from the statements above that two games are strategically equivalent if and only if they have the
2 2
P
• l (V ): e (i
same flow representation and game graph.
i ∈ V ), so f ∈ l (V ) has a representation f = i∈V f i e i , e.g.
global ranking score on VP.
Two examples of 2 game
2 graph representations are given below.
2 2
• l (V ): eij = −eji , f = (i,j) fij eij for f ∈ l (V ), e.g. paired compari-
2
son scores
Example 2.2. Consider on Vthe
again . “battle of the sexes” game from Example 2.1. The game graph
2 3
P
• l (V ):
has four vertices, correspondinge ijk = e
tojkithe kij = −e
= edirect jik = −e
product ofkji
two= −eikj , f = and
2-cliques, fijk
ijk is eijk
presented in Figure 2.
2 d+1
• l (V ): ei0 ,...,id is an alternating d-form
2 σ(i0 ),...,σ(id ) ,
ei0 ,...,id = sign(σ)e
(O, O) (O, F )
where σ ∈ Sd is a permutation on {0, . . . , d}.

3 2
3
(F, O) (F, F )

Figure 2: Flows on the game graph corresponding to “battle of the sexes” (Example 2.2).
142 9. BEYOND GRAPHS: HIGH DIMENSIONAL TOPOLOGICAL/GEOMETRIC ANALYSIS

Vector spaces of functions l2 (V d+1 ) represented on such basis with an inner product
defined, are called d-forms (cochains).
Example. In the crowdsourcing ranking of world universities,
http://www.allourideas.org/worldcollege/,
V consists of world universities, E are university pairs in comparison, l2 (V ) consists
of ranking scores of universities, l2 (V 2 ) is made up of paired comparison data.
Discrete differential operators: k-dimensional coboundary maps δk : L2 (V k ) →
L (V k+1 ) are defined as the alternating difference operator
2

k+1
X
(δk u)(i0 , . . . , ik+1 ) = (−1)j+1 u(i0 , . . . , ij−1 , ij+1 , . . . , ik+1 )
j=0

• δk plays the role of differentiation


• δk+1 ◦ δk = 0
So we have chain map
δ δ δk−1 δ
L2 (V ) −→
0
L2 (V 2 ) −→
1
L2 (V 3 ) → . . . L2 (V k ) −−−→ L2 (V k+1 ) −→
k
...
with δk ◦ δk−1 = 0.
Example (Gradient, Curl, and Divergence). We can define discrete gradient and
curl, as well as their adjoints
• (δ0 v)(i, j) = vj − vi =: (grad v)(i, j)
• (δ1 w)(i, j, k) = (±)(wij + wjk + wki ) =: (curl w)(i, j, k), which measures
the total flow-sum along the loop i → j → k → i and (δ1 w)(i, j, k) = 0
implies the paired comparison data is path-independent, which defines the
triangular transitivity subspace
• for each alternative i ∈ V , the combinatorial divergence
X
(div w)(i) := −(δ0T w)(i) := wi∗
which measures the inflow-outflow sum at i and (δ0T w)(i) = 0 implies
alternative i is preference-neutral in all pairwise comparisons as a cyclic
ranking passing through alternatives.
Definition (Combinatorial Hodge Laplacian). Define the k-dimensional combina-
torial Laplacian, ∆k : L2 (V k+1 ) → L2 (C k+1 ) by
T
∆k = δk−1 δk−1 + δkT δk , k>0
• k = 0, ∆0 = δ0T δ0 is the well-known graph Laplacian
• k = 1,
∆1 = curl ◦ curl∗ − div ◦ grad
• Important Properties:
∆k positive semi-definite
T
ker(∆k ) = ker(δk−1 ) ∩ ker(δk ): k-Harmonics, dimension equals to
k-th Betti number
Hodge Decomposition Theorem
Theorem 3.1 (Hodge Decomposition). The space of k-forms (cochains) C k (K(G), R),
admits an orthogonal decomposition into three
C k (K(G), R) = im(δk−1 ) ⊕ Hk ⊕ im(δkT )
4. APPLICATIONS OF HODGE THEORY: STATISTICAL RANKING 143

where
Hk = ker(δk−1 ) ∩ ker(δkT ) = ker(∆k ).
• dim(Hk ) = βk .
A simple understanding is possible via Dirac operator:
D = δ + δ ∗ : ⊕k L2 (V k ) → ⊕k L2 (V k )
Hence D = D∗ is self-adjoint. Combine the chain map
δ δ δk−1 δ
L2 (V ) −→
0
L2 (V 2 ) −→
1
L2 (V 3 ) → . . . L2 (V k ) −−−→ L2 (V k+1 ) −→
k
...
into a big operator: Dirac operator.
Abstract Hodge Laplacian:
∆ = D2 = δδ ∗ + δ ∗ δ,
since δ 2 = 0.
By the Fundamental Theorem of Linear Algebra (Closed Range Theorem in
Banach Space),
⊕k L2 (V k ) = im(D) ⊕ ker(D)
where
im(D) = im(δ) ⊕ im(δ ∗ )
and ker(D) = ker(∆) is the space of harmonic forms.

4. Applications of Hodge Theory: Statistical Ranking


4.1. HodgeRank on Graphs. Let ∧ = {1, ..., m} be a set of participants and
V = {1, ..., n} be the set of videos to be ranked. Paired comparison data is collected
as a function on ∧ × V × V , which is skew-symmetric for each participant α, i.e.,
Yijα = −Yjiα representing the degree that α prefers i to j. The simplest setting is
the binary choice, where

α 1 if α prefers i to j,
Yij =
−1 otherwise.
In general, Yijα can be used to represent paired comparison grades, e.g., Yijα > 0
refers to the degree that α prefers i to j and the vice versa Yjiα = −Yijα < 0 measures
the dispreference degree [JLYY11].
In this paper we shall focus on the binary choice, which is the simplest setting
and the data collected in this paper belongs to this case. However the theory can
be applied to the more general case with multiple choices above.
Such paired comparison data can be represented by a directed graph, or hyper-
graph, with n nodes, where each directed edge between i and j refers the preference
indicated by Yijα .
A nonnegative weight function ω : ∧ × V × V −→ [0, ∞) is defined as,

α 1 if α makes a comparison for {i, j},
(138) ωij =
0 otherwise.
It may reflect the confidence level that a participant compares {i, j} by taking
different values, and this is however not pursued in this paper.
144 9. BEYOND GRAPHS: HIGH DIMENSIONAL TOPOLOGICAL/GEOMETRIC ANALYSIS

Our statistical rank aggregation problem is to look for some global ranking
score s : V → R such that
X
α
(139) min ωij (si − sj − Yijα )2 ,
s∈R|V |
i,j,α

which is equivalent to the following weighted least square problem


X
(140) min ωij (si − sj − Ŷij )2 ,
s∈R|V |
i,j
α α
P P α P α
where Ŷij = ( α ωij Yij )/( α ωij ) and ωij = α ωij . For the principles behind
such a choice, readers may refer [JLYY11].
A graph structure arises naturally from ranking data as follows. Let G = (V, E)
be a paired ranking graph whose vertex set is V , the set of videos to be ranked, and
whose edge set is E, the set of video pairs which receive some comparisons, i.e.,
(   X )
V α
(141) E = {i, j} | ωi,j > 0 .
2
α
A pairwise ranking is called complete if each participant α in ∧ gives a total
judgment of all videos in V ; otherwise it is called incomplete. It is called
P balanced if
α
the paired comparison graph is k -regular with equal weights ωij = α ωij ≡ c for
all {i, j} ∈ E; otherwise it is called imbalanced. A complete and balanced ranking
induces a complete graph with equal weights on all edges. The existing paired
comparison methods in VQA often assume complete and balanced data. However,
this is an unrealistic assumption for real world data, e.g. randomized experiments.
Moreover in crowdsourcing, raters and videos come in an unspecified way and it is
hard to control the test process with precise experimental designs. Nevertheless,
as to be shown below, it is efficient to utilize some random sampling design based
on random graph theory where for each participant a fraction of video pairs are
chosen randomly. The HodgeRank approach adopted in this paper enables us a
unified scheme which can deal with incomplete and imbalanced data emerged from
random sampling in paired comparisons.
The minimization problem (140) can be generalized to a family of linear models
in paired comparison methods [Dav88]. To see this, we first rewrite (140) in
another simpler form. Assume that for each edge as video pair {i, j}, the number
of comparisons is nij , among which aij participants have a preference on i over j
(aji carries the opposite meaning). So aij + aji = nij if no tie occurs. Therefore,
for each edge {i, j} ∈ E, we have a preference probability estimated from data
π̂ij = aij /nij . With this definition, the problem (140) can be rewritten as
X
(142) min nij (si − sj − (2π̂ij − 1))2 ,
s∈R|V |
{i,j}∈E

since Ŷij = (aij − aji )/nij = 2π̂ij − 1 due to Equation (138).


General linear models, which are firstly formulated by G. Noether [Noe60],
assume that the true preference probability can be fully decided by a linear scaling
function on V , i.e.,
(143) πij = Prob{i is preferred over j} = F (s∗i − s∗j ),
for some s∗ ∈ R|V | . F can be chosen as any symmetric cumulated distributed
function. When only an empirical preference probability π̂ij is observed, we can
4. APPLICATIONS OF HODGE THEORY: STATISTICAL RANKING 145

map it to a skew-symmetric function by the inverse of F ,


(144) Ŷij = F −1 (π̂ij ),
where Ŷij = −Ŷji . However, in this case, one can only expect that
(145) Ŷij = s∗i − s∗j + εij ,
where εij accounts for the noise. The case in (142) takes a linear F and is often
called a uniform model. Below we summarize some well known models which have
been studied extensively in [Dav88].
1. Uniform model:
(146) Ŷij = 2π̂ij − 1.
2. Bradley-Terry model:
π̂ij
(147) Ŷij = log .
1 − π̂ij
3. Thurstone-Mosteller model:
(148) Ŷij = F −1 (π̂ij ).
where F is essentially the Gauss error function
Z ∞
1 1 2
(149) F (x) = √ e− 2 t dt.
2π −x/[2σ2 (1−ρ)]1/2
Note that constants σ and ρ will only contribute to a rescaling of the solution of
(140).
4. Angular transform model:
(150) Ŷij = arcsin(2π̂ij − 1).
This model is created for the so called variance stabilization property: asymptot-
ically Ŷij has variance only depending on number of ratings on edge {i, j} or the
weight ωij , but not on the true probability pij .
Different models will give different Ŷij from the same observation π̂ij , followed
by the same weighted least square problem (140) for the solution. Therefore, a
deeper analysis of problem (140) will disclose more properties about the ranking
problem.
HodgeRank on graph G = (V, E) provides us such a tool, which characterizes
the solution and residue of (140), adaptive to topological structures of G. The
following theorem adapted from [JLYY11] describes a decomposition of Ŷ , which
can be visualized as edge flows on graph G with direction i → j if Ŷij > 0 and vice
versa. Before the statement of the theorem, we first define the triangle set of G as
all the 3-cliques in G.
   
V
(151) T = {i, j, k} |{i, j}, {j, k}, {k, i}E .
3
Equipped with T , graph G becomes an abstract simplicial complex, the clique
complex χ(G) = (V, E, T ).
Theorem 1 [Hodge Decomposition of Paired Ranking] Let Ŷij be a
paired comparison flow on graph G = (V, E), i.e., Ŷij = −Ŷji for {i, j} ∈ E, and
Ŷij = 0 otherwise. There is a unique decomposition of Ŷ satisfying
(152) Ŷ = Ŷ g + Ŷ h + Ŷ c ,
146 9. BEYOND GRAPHS: HIGH DIMENSIONAL TOPOLOGICAL/GEOMETRIC ANALYSIS

Figure 3. Hodge decomposition (three orthogonal components)


of paired rankings [JLYY11].

where
(153) Ŷijg = ŝi − ŝj , for some ŝ ∈ RV ,

(154) Ŷijh + Ŷjk


h h
+ Ŷki = 0, for each {i, j, k} ∈ T ,
X
(155) ωij Ŷijh = 0, for each i ∈ V .
j∼i

The decomposition
P above is orthogonal under the following inner product on R|E| ,
hu, viω = {i,j}∈E ωij uij vij .
The following provides some remarks on the decomposition.
1. When G is connected, Ŷijg is a rank two skew-symmetric matrix and gives a
linear score function ŝ ∈ RV up to translations. We thus call Ŷ g a gradient flow
since it is given by the difference (discrete gradient) of the score function ŝ on graph
nodes,
(156) Ŷijg = (δ0 ŝ)(i, j) := ŝi − ŝj ,
where δ0 : RV → RE is a finite difference operator (matrix) on G. ŝ can be chosen
as any least square solution of (140), where we often choose the minimal norm
solution,
(157) ŝ = ∆†0 δ0∗ Ŷ ,
where δ0∗ = δ0T W (W P = diag(ωij )), ∆0 = δ0∗ ·δ0 is the unnormalized graph Laplacian
defined by (∆0 )ii = j∼i ωij and (∆0 )ij = −ωij , and (·)† is the Moore-Penrose
(pseudo) inverse. On a complete and balanced graph, (157) is reduced to ŝi =
1
P
n−1 j6=i Ŷij , often called Borda Count as the earliest preference aggregation rule in
social choice [JLYY11]. For expander graphs like regular graphs, graph Laplacian
∆0 has small condition numbers and thus the global ranking is stable against noise
on data.
2. Ŷ h satisfies two conditions (154) and (155), which are called curl-free and
divergence-free conditions respectively. The former requires the triangular trace
of Ŷ to be zero, on every 3-clique in graph G; while the later requires the total
sum (inflow minus outflow) to be zero on each node of G. These two conditions
characterize a linear subspace which is called harmonic flows.
3. The residue Ŷ c actually satisfies (155) but not (154). In fact, it measures
the amount of intrinsic (local) inconsistancy in Ŷ characterized by the triangular
4. APPLICATIONS OF HODGE THEORY: STATISTICAL RANKING 147

trace. We often call this component curl flow. In particular, the following relative
curl,
|Ŷij + Ŷjk + Ŷki | |Ŷijc + Ŷjk
c c
+ Ŷki |
(158) curlrijk = = ∈ [0, 1],
|Ŷij | + |Ŷjk | + |Ŷki | |Ŷij | + |Ŷjk | + |Ŷki |
can be used to characterize triangular intransitivity; curlrijk = 1 iff {i, j, k} contains
an intransitive triangle of Ŷ . Note that computing the percentage of curlrijk = 1
is equivalent to calculating the Transitivity Satisfaction Rate (TSR) in complete
graphs.
Figure 3 illustrates the Hodge decomposition for paired comparison flows and
Algorithm 5 shows how to compute global ranking and other components. The
readers may refer to [JLYY11] for the detail of theoretical development. Below we
just make a few comments on the application of HodgeRank in our setting.

Algorithm 5: Procedure of Hodge decomposition in Matlab Pseudocodes


Input: A paired comparison hypergraph G provide by assessors.
Output: Global score ŝ, gradient flow Ŷ g , curl flow Ŷ c , and harmonic flow Ŷ h .
1 Initialization:
2 Ŷ (a numEdge-vector consisting Ŷij defined),
3 W (a numEdge-vector consisting ωij ).
4 Step 1 :
5 Compute δ0 , δ1 ; // δ0 = gradient, δ1 = curl
6 δ0∗ = δ0T ∗ diag(W ); // the conjugate of δ0
7 40 = δ0∗ ∗ δ0 ; // Unnormalized Graph Laplacian
8 div = δ0∗ ∗ Ŷ ; // divergence operator
9 ŝ = lsqr(40 , div); // global score
10 Step 2 :
11 Compute 1st projection on gradient flow: Ŷ g = δ0 ∗ ŝ;
12 Step 3 :
13 δ1∗ = δ1T ∗ diag(1./W );
14 41 = δ1 ∗ δ1∗ ;
15 curl = δ1 ∗ Ŷ ;
16 z = lsqr(41 , curl);
17 Compute 3rd projection on curl flow: Ŷ c = δ1∗ ∗ z;
18 Step 4 :
19 Compute 2nd projection on harmonic flow: Ŷ h = Ŷ − Ŷ g − Ŷ c .

1. To find a global ranking ŝ in (157), the recent developments of Spielman-Teng


[ST04] and Koutis-Miller-Peng [KMP10] suggest fast (almost linear in |E|Poly(log |V |))
algorithms for this purpose.
2. Inconsistency of Ŷ has two parts: global inconsistency measured by harmonic
flow Ŷ h and local inconsistency measured by curls in Ŷ c . Due to the orthogonal
decomposition, kŶ h k2ω /kŶ k2ω and kŶ c k2ω /kŶ k2ω provide percentages of global and
local inconsistencies, respectively.
3. A nontrivial harmonic component Ŷ h 6= 0 implies the fixed tournament issue,
i.e., for any candidate i ∈ V , there is a paired comparison design by removing some
of the edges in G = (V, E) such that i is the overall winner.
148 9. BEYOND GRAPHS: HIGH DIMENSIONAL TOPOLOGICAL/GEOMETRIC ANALYSIS

4. One can control the harmonic component by controlling the topology of


clique complex χ(G). In a loop-free clique complex χ(G) where β1 = 0, harmonic
component vanishes. In this case, there are no cycles which traverse all the nodes,
e.g., 1  2  3  4  . . .  n  1. All the inconsistency will be summarized in
those triangular cycles, e.g., i  j  k  i.
Theorem 2. The linear space of harmonic flows has the dimension equal to
β1 , i.e., the number of independent loops in clique complex χ(G), which is called
the first order Betti number.
Fortunately, with the aid of some random sampling principles, it is not hard to
obtain graphs whose β1 are zero.
4.2. Random Graphs. In this section, we first describe two classical random
models: Erdös-Rényi random graph and random regular graph; then we investigate
the relation between them.
4.2.1. Erdös-Rényi Random Graph. Erdös-Rényi random graph G(n, p) starts
from n vertices and draws its edges independently according to a fixed probability
p. Such random graph model is chosen to meet the scenario that in crowdsourc-
ing ranking raters and videos come in an unspecified way. Among various models,
Erdös-Rényi random graph is the simplest one equivalent to I.I.D. sampling. There-
fore, such a model is to be systematically studied in the paper.
However, to exploit Erdös-Rényi random graph in crowdsourcing experimental
designs, one has to meet some conditions depending on our purpose:
1. The resultant graph should be connected, if we hope to derive global scores
for all videos in comparison;
2. The resultant graph should be loop-free in its clique complex, if we hope to
get rid of the global inconsistency in harmonic component.
The two conditions can be easily satisfied for large Erdös-Rényi random graph.
Theorem 3. Let G(n, p) be the set of Erdös-Rényi random graphs with n
nodes and edge appearance probability p. Then the following holds as n → ∞,
1. [Erdös-Rényi 1959] [ER59] if p  logn/n, then G(n, p) is almost always
connected; and if p ≺ logn/n then G(n, p) is almost always disconnected;
2. [Kahle 2009] [Kah09, Kah13] if p = O(nα ), with α < −1 or α > −1/2,
then the expected β1 of the clique complex χ(G(n, p)) is almost always equal to
zero, i.e., loop-free.
These theories imply that when p is large enough, Erdös-Rényi random graph
will meet the two conditions above with high probability. In particular, almost
linear O(n log n) edges suffice to derive a global ranking, and with O(n3/2 ) edges
harmonic-free condition is met.
Despite such an asymptotic theory for large random graphs, it remains a ques-
tion how to ensure that a given graph instance satisfies the two conditions? Fortu-
nately, the recent development in computational topology provides us such a tool,
persistent homology, which will be illustrated in Section ??.

5. Euler-Calculus
to be finished...
5. EULER-CALCULUS 149

Figure 4. Examples of k-regular graphs.


Bibliography

[AC09] R DeVore A Cohen, W Dahmen, Compressed sensing and best k-term approximation,
J. Amer. Math. Soc 22 (2009), no. 1, 211–231.
[Ach03] Dimitris Achlioptas, Database-friendly random projections: Johnson-lindenstrauss
with binary coins, Journal of Computer and System Sciences 66 (2003), 671687.
[Ali95] F. Alizadeh, Interior point methods in semidefinite programming with applications
to combinatorial optimization, SIAM J. Optim. 5 (1995), no. 1, 13–51.
[Aro50] N. Aronszajn, Theory of reproducing kernels, Transactions of the American Mathe-
matical Society 68 (1950), no. 3, 337–404.
[Bav11] Francois Bavaud, On the schoenberg transformations in data analysis: Theory and
illustrations, Journal of Classification 28 (2011), no. 3, 297–314.
[BDDW08] Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin, A simple
proof of the restricted isometry property for random matrices, Constructive Approx-
imation 28 (2008), no. 3, 253–263.
[BLT+ 06] P. Biswas, T.-C. Liang, K.-C. Toh, T.-C. Wang, and Y. Ye, Semidefinite programming
approaches for sensor network localization with noisy distance measurements, IEEE
Transactions on Automation Science and Engineering 3 (2006), 360–371.
[BN01] Mikhail Belkin and Partha Niyogi, Laplacian eigenmaps and spectral techniques
for embedding and clustering, Advances in Neural Information Processing Systems
(NIPS) 14, MIT Press, 2001, pp. 585–591.
[BN03] Mikhail Belkin and Partha Niyogi, Laplacian eigenmaps for dimensionality reduction
and data representation, Neural Computation 15 (2003), 1373–1396.
[BN08] Mikhail Belkin and Partha Niyogi, Convergence of laplacian eigenmaps, Tech. report,
2008.
[BP98] Sergey Brin and Larry Page, The anatomy of a large-scale hypertextual web search
engine, Proceedings of the 7th international conference on World Wide Web (WWW)
(Australia), 1998, pp. 107–117.
[BS10] Zhidong Bai and Jack W. Silverstein, Spectral analysis of large dimensional random
matrices, Springer, 2010.
[BTA04] Alain Berlinet and Christine Thomas-Agnan, Reproducing kernel hilbert spaces in
probability and statistics, Kluwer Academic Publishers, 2004.
[Can08] E. J. Candès, The restricted isometry property and its implications for compressed
sensing, Comptes Rendus de l’Académie des Sciences, Paris, Série I 346 (2008), 589–
592.
[CDS98] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders, Atomic decompo-
sition by basis pursuit, SIAM Journal on Scientific Computing 20 (1998), 33–61.
[Chu05] Fan R. K. Chung, Laplacians and the cheeger inequality for directed graphs, Annals
of Combinatorics 9 (2005), no. 1, 1–19.
[CL06] Ronald R. Coifman and Stéphane. Lafon, Diffusion maps, Applied and Computa-
tional Harmonic Analysis 21 (2006), 5–30.
[CLL+ 05] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W.
Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition
of data: Diffusion maps i, Proceedings of the National Academy of Sciences of the
United States of America 102 (2005), 7426–7431.
[CLMW09] E. J. Candès, Xiaodong Li, Yi Ma, and John Wright, Robust principal component
analysis, Journal of ACM 58 (2009), no. 1, 1–37.

151
152 BIBLIOGRAPHY

[CMOP11] Ozan Candogan, Ishai Menache, Asuman Ozdaglar, and Pablo A. Parrilo, Flows and
decompositions of games: Harmonic and potential games, Mathematics of Operations
Research 36 (2011), no. 3, 474–503.
[CPW12] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky, Latent variable graphical model
selection via convex optimization (with discussion), Annals of Statistics (2012), to
appear, http://arxiv.org/abs/1008.1290.
[CR09] E. J. Candès and B. Recht, Exact matrix completion via convex optimization, Foun-
dation of Computational Mathematics 9 (2009), no. 6, 717772.
[CRPW12] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, The convex geometry
of linear inverse problems, Foundation of Computational Mathematics (2012), to
appear, http://arxiv.org/abs/1012.0621.
[CRT06] Emmanuel. J. Candès, Justin Romberg, and Terrence Tao, Robust uncertainty prin-
ciples: Exact signal reconstruction from highly incomplete frequency information,
IEEE Trans. on Info. Theory 52 (2006), no. 2, 489–509.
[CSPW11] V. Chandrasekaran, S. Sanghavi, P.A. Parrilo, and A. Willsky, Rank-sparsity inco-
herence for matrix decomposition, SIAM Journal on Optimization 21 (2011), no. 2,
572596, http://arxiv.org/abs/0906.2220.
[CST03] N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines and
other kernel-based learning methods, Cambridge University Press, 2003.
[CT05] E. J. Candès and Terrence Tao, Decoding by linear programming, IEEE Trans. on
Info. Theory 51 (2005), 4203–4215.
[CT06] Emmanuel. J. Candès and Terrence Tao, Near optimal signal recovery from random
projections: Universal encoding strategies, IEEE Trans. on Info. Theory 52 (2006),
no. 12, 5406–5425.
[CT10] E. J. Candès and T. Tao, The power of convex relaxation: Near-optimal matrix
completion, IEEE Transaction on Information Theory 56 (2010), no. 5, 2053–2080.
[Dav88] H. David, The methods of paired comparisons, 2nd ed., Griffin’s Statistical Mono-
graphs and Courses, 41, Oxford University Press, New York, NY, 1988.
[DG03a] Sanjoy Dasgupta and Anupam Gupta, An elementary proof of a theorem of johnson
and lindenstrauss, Random Structures and Algorithms 22 (2003), no. 1, 60–65.
[DG03b] David L. Donoho and Carrie Grimes, Hessian eigenmaps: Locally linear embedding
techniques for high-dimensional data, Proceedings of the National Academy of Sci-
ences of the United States of America 100 (2003), no. 10, 5591–5596.
[dGJL07] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G.
Lanckriet, A direct formulation for sparse pca using semidefinite programming, SIAM
Review 49 (2007), no. 3, http://arxiv.org/abs/cs/0406021.
[DH01] David L. Donoho and Xiaoming Huo, Uncertainty principles and ideal atomic de-
composition, IEEE Transactions on Information Theory 47 (2001), no. 7, 2845–2862.
[EB01] M. Elad and A.M. Bruckstein, On sparse representations, International Conference
on Image Processing (ICIP) (Tsaloniky, Greece), November 2001.
[ELVE08] Weinan E, Tiejun Li, and Eric Vanden-Eijnden, Optimal partition and effective dy-
namics of complex networks, Proc. Nat. Acad. Sci. 105 (2008), 7907–7912.
[ER59] P. Erdos and A. Renyi, On random graphs i, Publ. Math. Debrecen 6 (1959), 290–297.
[EST09] Ioannis Z. Emiris, Frank J. Sottile, and Thorsten Theobald, Nonlinear computational
geometry, Springer, New York, 2009.
[EVE06] Weinan E and Eric Vanden-Eijnden, Towards a theory of transition paths, J. Stat.
Phys. 123 (2006), 503–523.
[EVE10] Weinan E and Eric Vanden-Eijnden, Transition-path theory and path-finding algo-
rithms for the study of rare events, Annual Review of Physical Chemistry 61 (2010),
391–420.
[Gro11] David Gross, Recovering low-rank matrices from few coefficients in any basis, IEEE
Transaction on Information Theory 57 (2011), 1548, arXiv:0910.1879.
[HAvL05] M. Hein, J. Audibert, and U. von Luxburg, From graphs to manifolds: weak and
strong pointwise consistency of graph laplacians, COLT, 2005.
[JL84] W. B. Johnson and J. Lindenstrauss, Extensions of lipschitz maps into a hilbert space,
Contemp Math 26 (1984), 189–206.
BIBLIOGRAPHY 153

[JLYY11] Xiaoye Jiang, Lek-Heng Lim, Yuan Yao, and Yinyu Ye, Statistical ranking and com-
binatorial hodge theory, Mathematical Programming 127 (2011), no. 1, 203–244,
arXiv:0811.1067 [stat.ML].
[Joh06] I. Johnstone, High dimensional statistical inference and random matrices, Proc. In-
ternational Congress of Mathematicians, 2006.
[JYLG12] Xiaoye Jiang, Yuan Yao, Han Liu, and Leo Guibas, Detecting network cliques with
radon basis pursuit, The Fifteenth International Conference on Artificial Intelligence
and Statistics (AISTATS) (La Palma, Canary Islands), April 21-23 2012.
[Kah09] Matthew Kahle, Topology of random clique complexes, Discrete Mathematics 309
(2009), 1658–1671.
[Kah13] , Sharp vanishing thresholds for cohomology of random flag complexes, Annals
of Mathematics (2013), arXiv:1207.0149.
[Kle99] Jon Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the
ACM 46 (1999), no. 5, 604–632.
[KMP10] Ioannis Koutis, G. Miller, and Richard Peng, Approaching optimality for solving
sdd systems, FOCS ’10 51st Annual IEEE Symposium on Foundations of Computer
Science, 2010.
[KN08] S. Kritchman and B. Nadler, Determining the number of components in a factor
model from limited noisy data, Chemometrics and Intelligent Laboratory Systems 94
(2008), 19–32.
[LL11] Jian Li and Tiejun Li, Probabilistic framework for network partition, Phys. A 390
(2011), 3579.
[LLE09] Tiejun Li, Jian Liu, and Weinan E, Probabilistic framework for network partition,
Phys. Rev. E 80 (2009), 026106.
[LM06] Amy N. Langville and Carl D. Meyer, Google’s pagerank and beyond: The science of
search engine rankings, Princeton University Press, 2006.
[LZ10] Yanhua Li and Zhili Zhang, Random walks on digraphs, the generalized digraph lapla-
cian, and the degree of asymmetry, Algorithms and Models for the Web-Graph, Lec-
ture Notes in Computer Science, vol. 6516, 2010, pp. 74–85.
[Mey00] Carl D. Meyer, Matrix analysis and applied linear algebra, SIAM, 2000.
[MSVE09] Philipp Metzner, Christof Schütte, and Eric Vanden-Eijnden, Transition path theory
for markov jump processes, Multiscale Model. Simul. 7 (2009), 1192.
[MY09] Nicolai Meinshausen and Bin Yu, Lasso-type recovery of sparse representations for
high-dimensional data, Annals of Statistics 37 (2009), no. 1, 246–270.
[NBG10] R. R. Nadakuditi and F. Benaych-Georges, The breakdown point of signal subspace
estimation, IEEE Sensor Array and Multichannel Signal Processing Workshop (2010),
177–180.
[Noe60] G. Noether, Remarks about a paired comparison model, Psychometrika 25 (1960),
357–367.
[NSVE+ 09] Frank Noè, Christof Schütte, Eric Vanden−Eijnden, Lothar Reich, and Thomas R.
Weikl, Constructing the equilibrium ensemble of folding pathways from short off-
equilibrium simulations, Proceedings of the National Academy of Sciences of the
United States of America 106 (2009), no. 45, 19011–19016.
[RL00] Sam T. Roweis and Saul K. Lawrence, Locally linear embedding, Science 290 (2000),
no. 5500, 2319–2323.
[Sch37] I. J. Schoenberg, On certain metric spaces arising from euclidean spaces by a change
of metric and their imbedding in hilbert space, The Annals of Mathematics 38 (1937),
no. 4, 787–793.
[Sch38a] , Metric spaces and completely monotone functions, The Annals of Mathe-
matics 39 (1938), 811–841.
[Sch38b] , Metric spaces and positive denite functions, Transactions of the American
Mathematical Society 44 (1938), 522–536.
[Sin06] Amit Singer, From graph to manifold laplacian: The convergence rate, Applied and
Computational Harmonic Analysis 21 (2006), 128–134.
[ST04] D. Spielman and Shang-Hua Teng, Nearly-linear time algorithms for graph partition-
ing, graph sparsification, and solving linear systems, STOC ’04 Proceedings of the
thirty-sixth annual ACM symposium on Theory of computing, 2004.
154 BIBLIOGRAPHY

[Ste56] Charles Stein, Inadmissibility of the usual estimator for the mean of a multivariate
distribution, Proceedings of the Third Berkeley Symposium on Mathematical Statis-
tics and Probability 1 (1956), 197–206.
[SW12] Amit Singer and Hau-Tieng Wu, Vector diffusion maps and the connection laplacian,
Comm. Pure Appl. Math. 65 (2012), no. 8, 1067–1144.
[SY07] Anthony Man-Cho So and Yinyu Ye, Theory of semidefinite programming for sensor
network localization, Mathematical Programming, Series B 109 (2007), no. 2-3, 367–
384.
[SYZ08] Anthony Man-Cho So, Yinyu Ye, and Jiawei Zhang, A unified theorem on sdp rank
reduction, Mathematics of Operations Research 33 (2008), no. 4, 910–920.
[Tao11] Terrence Tao, Topics in random matrix theory, Lecture Notes in UCLA, 2011.
[TdL00] J. B. Tenenbaum, Vin deSilva, and John C. Langford, A global geometric framework
for nonlinear dimensionality reduction, Science 290 (2000), 2319–2323.
[TdSL00] J. Tenenbaum, V. de Silva, and J. Langford, A global geometric framework for non-
linear dimensionality reduction, Science 290 (2000), no. 5500, 2323–2326.
[Tib96] R. Tibshirani, Regression shrinkage and selection via the lasso, J. of the Royal Sta-
tistical Society, Series B 58 (1996), no. 1, 267–288.
[Tro04] Joel A. Tropp, Greed is good: Algorithmic results for sparse approximation, IEEE
Trans. Inform. Theory 50 (2004), no. 10, 2231–2242.
[Tsy09] Alexandre Tsybakov, Introduction to nonparametric estimation, Springer, 2009.
[Vap98] V. Vapnik, Statistical learning theory, Wiley, New York, 1998.
[Vem04] Santosh Vempala, The random projection method, Am. Math. Soc., Providence, 2004.
[Wah90] Grace Wahba, Spline models for observational data, CBMS-NSF Regional Conference
Series in Applied Mathematics 59, SIAM, 1990.
[WS06] Killian Q. Weinberger and Lawrence K. Saul, Unsupervised learning of image man-
ifolds by semidefinite programming, International Journal of Computer Vision 70
(2006), no. 1, 77–90.
[YH41] G. Young and A. S. Householder, A note on multidimensional psycho-physical anal-
ysis, Psychometrika 6 (1941), 331–333.
[ZHT06] H. Zou, T. Hastie, and R. Tibshirani, Sparse principal component analysis, Journal
of Computational and Graphical Statistics 15 (2006), no. 2, 262–286.
[ZY06] Peng Zhao and Bin Yu, On model selection consistency of lasso, J. Machine Learning
Research 7 (2006), 2541–2567.
[ZZ02] Zhenyue Zhang and Hongyuan Zha, Principal manifolds and nonlinear dimension
reduction via local tangent space alignment, SIAM Journal of Scientific Computing
26 (2002), 313–338.
[ZZ09] Hongyuan Zha and Zhenyue Zhang, Spectral properties of the alignment matrices in
manifold learning, SIAM Review 51 (2009), no. 3, 545–566.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy