Mathematial Introduction to Data Science
Mathematial Introduction to Data Science
Yuan Yao
School of Mathematical Sciences, Peking University, Beijing, China
100871
E-mail address: yuany@math.pku.edu.cn
URL: http://www.math.pku.edu.cn/teachers/yaoy/Fall2012/lectures.pdf
This is a working draft last updated on
October 14, 2014
2000 Mathematics Subject Classification. Primary
Key words and phrases. keywords
Special thanks to Amit Singer, Weinan E, Xiuyuan Cheng, and the following
students in PKU who help scribe lecture notes with various improvements: Hong
Cheng, Chao Deng, Yanzhen Deng, Chendi Huang, Lei Huang, Shujiao Huang,
Longlong Jiang, Yuwei Jiang, Wei Jin, Changcheng Li, Xiaoguang Li, Tengyuan
Liang, Feng Lin, Yaning Liu, Peng Luo, Wulin Luo, Tangjie Lv, Yuan Lv, Hongyu
Meng, Ping Qin, Jie Ren, Hu Sheng, Zhiming Wang, Yuting Wei, Jiechao Xiong,
Jie Xu, Bowei Yan, Jun Yin, and Yue Zhao.
Preface 1
Chapter 1. Multidimensional Scaling and Principal Component Analysis 3
1. Classical MDS 3
2. Theory of MDS (Young/Househölder/Schoenberg’1938) 4
3. Hilbert Space Embedding and Reproducing Kernels 8
4. Linear Dimensionality Reduction 8
5. Principal Component Analysis 9
6. Dual Roles of MDS vs. PCA in SVD 11
Chapter 2. Random Projections and Almost Isometry 13
1. Introduction 13
2. The Johnson-Lindenstrauss Lemma 14
3. Example: MDS in Human Genome Diversity Project 17
4. Random Projections and Compressed Sensing 18
Chapter 3. High Dimensional Statistics: Mean and Covariance in Noise 25
1. Maximum Likelihood Estimation 25
2. Bias-Variance Decomposition of Mean Square Error 27
3. Stein’s Phenomenon and Shrinkage of Sample Mean 29
4. Random Matrix Theory and Phase Transitions in PCA 36
Chapter 4. Generalized PCA/MDS via SDP Relaxations 45
1. Introduction of SDP with a Comparison to LP 45
2. Robust PCA 47
3. Probabilistic Exact Recovery Conditions for RPCA 50
4. Sparse PCA 51
5. MDS with Uncertainty 53
6. Exact Reconstruction and Universal Rigidity 56
7. Maximal Variance Unfolding 58
This book is used in a course instructed by Yuan Yao at Peking University, part
of which is based on a similar course led by Amit Singer at Princeton University.
If knowledge comes from the impressions made upon us by natural
objects, it is impossible to procure knowledge without the use of
objects which impress the mind. –John Dewey
1
CHAPTER 1
1. Classical MDS
Multidimensional Scaling (MDS) roots in psychology [YH41] which aims to
recover Euclidean coordinates given pairwise distance metrics or dissimilarities.
It is equivalent to PCA when pairwise distances are Euclidean. In the core of
theoretical foundation of MDS lies the notion of positive definite functions [Sch37,
Sch38a, Sch38b] (or see the survey [Bav11]) which has been the foundation
of the kernel method in statistics [Wah90] and modern machine learning society
(http://www.kernel-machines.org/).
In this section we study classical MDS, or metric Multidimensional scaling
problem. The problem of classical MDS or isometric Euclidean embedding: given
pairwise distances between data points, can we find a system of Euclidean coordi-
nates for those points whose pairwise distances meet given constraints?
Consider a forward problem: given a set of points x1 , x2 , ..., xn ∈ Rp , let
X = [x1 , x2 , ..., xn ]p×n .
The distance between point xi and xj satisfies
2 T
d2ij = kxi − xj k = (xi − xj ) (xi − xj ) = xi T xi + xj T xj − 2xi T xj .
Now we are considering the inverse problem: given dij , find a {xi } satisfying the
relations above. Clearly the solutions are not unique as any Euclidean transform
on {xi } gives another solution. General ideas of classic (metric) MDS is:
(1) transform squared distance matrix D = [d2ij ] to an inner product form;
(2) compute the eigen-decomposition for this inner product form.
Below we shall see how to do this given D.
Let K be the inner product matrix
K = X T X,
with k = diag(Kii ) ∈ Rn . So
D = (d2ij ) = k · 1T + 1 · k T − 2K.
where 1 = (1, 1, ..., 1)T ∈ Rn .
Define the mean and the centered data
n
1X 1
µbn = xi = · X · 1,
n i=1 n
1
ei = xi − µ
x bn = xi − · X · 1,
n
3
4 1. MULTIDIMENSIONAL SCALING AND PRINCIPAL COMPONENT ANALYSIS
or
e = X − 1 X · 1 · 1T .
X
n
Thus,
K̃ , X̃ T X̃
T
1 1
= (X − X · 1 · 1T ) (X − X · 1 · 1T )
n n
1 1 1
= K − K · 1 · 1T − 1 · 1T · K + 2 · 1 · 1T · K · 1 · 1T .
n n n
Let
1
B = − H · D · HT
2
1
where H = I − n · 1 · 1T . H is called as a centering matrix.
So
1
B = − H · (k · 1T + 1 · k T − 2K) · H T
2
T
Since k · 1T · H T = k · 1(I − n1 · 1 · 1T ) = k · 1 − k( 1 n·1 ) · 1 = 0, we have
H · k 1 · H T = H · 1 · k T · H T = 0.
Therefore,
1 1
B = H · K · H T = (I − · 1 · 1T ) · K · (I − · 1 · 1T )
n n
1 1 1
= K − · 1 · 1 · K − · K · 1 · 1 + 2 · 1(1T · K1) · 1T
T
n n n
= K̃.
That is,
1
B = − H · D · H T = X̃ T X̃.
2
Note that often we define the covariance matrix
n
bn , 1 1 e eT
X
Σ (xi − µ bn )T =
bn )(xi − µ XX .
n − 1 i=1 n−1
Above we have shown that given a squared distance matrix D = (d2ij ), we can
1
convert it to an inner product matrix by B = − HDH T . Eigen-decomposition
2
applied to B will give rise the Euclidean coordinates centered at the origin.
In practice, one often chooses top k nonzero eigenvectors of B for a k-dimensional
Euclidean embedding of data.
Hence X ek gives k-dimensional Euclidean coordinations for the n points.
In Matlab, the command for computing MDS is ”cmdscale”, short for Clas-
sical Multidimensional Scaling. For non-metric MDS, you may choose ”mdscale”.
Figure 1 shows an example of MDS.
(a)
(b) (c)
(1) A + B 0;
(2) A ◦ B 0;
where A ◦ B is called Hadamard product and (A ◦ B)i,j := Ai,j × Bi,j .
Definition (Conditionally Negative Definite). Let An×n be a real symmetric ma-
trix.
Pn A is c.n.d.(conditionally negative definite) ⇐⇒ ∀v ∈ Rn , such that 1T v =
T
i=1 vi = 0, there holds v Av ≤ 0
1 1
Bi,i (α) + Bj,j (α) − 2Bi,j (α) = − Cii − Cjj + Cij = Cij ,
2 2
where the last step is due to Ci,i = 0.
(3) According to Lemma 2.1 and the first part of Theorem 2.2: C c.n.d.
⇐⇒ B p.s.d ⇐⇒ D c.n.d.
2. THEORY OF MDS (YOUNG/HOUSEHÖLDER/SCHOENBERG’1938) 7
(4) According to Lemma 2.1 and the second part of Theorem 2.2:
T
C c.n.d. ⇐⇒ B p.s.d
P P ⇐⇒ ∃Y 2s.t. Bα = Y Y ⇐⇒ Bi,j (α) =
k Yi,k Yj,k ⇒ Ci,j = k (Yi,k − Yj,k )
This completes the proof.
min kY T Y − Bk2F
Y ∈Rk×n
then the row vectors of matrix Y are the eigenvectors corresponding to k largest
eigenvalues of B = X e T X,
e or equivalently the top k right singular vectors of X
e =
T
U SV .
We have seen in the first section that the covariance matrix of data Σ bn =
1 e eT 1 2 T
n−1 X X = n U S U , passing through the singular vector decomposition (SVD)
of Xe = U SV T . Taking top k left singular vectors as the embedding coordinates
is often called Principal Component Analysis (PCA). In PCA, given (centralized)
Euclidean coordinate X, e ususally one gets the inner product matrix as covariance
1 e
matrix Σn = n−1 X · X T which is a p × p positive semi-definite matrix, then the top
b e
k eigenvectors of Σb n give rise to a k-dimensional embedding of data, as principal
components. So both MDS and PCA are unified in SVD of centralized data matrix.
The following introduces PCA from another point of view as best k-dimensional
affine space approximation of data.
∂I
= (xi − µ − U βi )T U = 0 ⇒ βi = U T (Xi − µ)
∂βi
Plug in the expression of µ̂n and βi
n
X
I = kXi − µ̂n − U U T (Xi − µ̂n )k2
i=1
n
X
= kXi − µ̂n − Pk (Xi − µ̂n )k2
i=1
n
X
= kYi − Pk (yi )k2 , Yi := Xi − µ̂n
i=1
In fact when k = 1, the maximal covariance is given by the largest eigenvalue along
the direction of its associated eigenvector,
max uT Σ̂n u =: λ̂1 .
kuk=1
and so on.
Here we conclude that the k-affine space can be discovered by eigenvector de-
composition of Σ̂n . The sample principal components are defined as column vectors
of Q̂ = Û T Y , where the j-th observation has its projection on the k-th component
as q̂k (j) = ûTk yj = ûTk (xi − µ̂n ). Therefore, PCA takes the eigenvector decompo-
sition of Σ̂n = Û Λ̂Û T and studies the projection of centered data points on top k
eigenvectors as the principle components. This is equivalent to the singular value
decomposition (SVD) of X = [x1 , . . . , xn ]T ∈ Rn×p in the following sense,
1 T
Y =X− 11 X = Ũ S̃ Ṽ T , 1 = (1, . . . , 1)T ∈ Rn
n
where top right singular vectors of centered data matrix Y gives the same principle
components. From linear algebra, k-principal components thus gives the best rank-
k approximation of centered data matrix Y .
Given a PCA, the following quantities are often used to measure the variances
• total variance:
Xp
trace(Σ̂n ) = λ̂i ;
i=1
• percentage of variance explained by top-k principal components:
k
X
λ̂i /trace(Σ̂n );
i=1
Example. Take the dataset of hand written digit “3”, X̂ ∈ R658×256 contains
658 images, each of which is of 16-by-16 grayscale image as hand written digit 3.
Figure 2 shows a random selection of 9 images, the sorted singular values divided
by total sum of singular values, and an approximation of x1 by top 3 principle
components: x1 = µ̂n − 2.5184ṽ1 − 0.6385ṽ2 + 2.0223ṽ3 .
(a) (b)
We have seen that both MDS and PCA can be obtained from such a SVD of
centered data matrix.
1/2
• MDS embedding is given by top k left singular vectors YkM DS = Uek S
fk ∈
Rn×k ;
1/2
• PCA embedding is given by top k right singular vectors YkP CA = Vek S
fk ∈
Rn×k .
Altogether U fk Ve T gives best rank-k approximation of X
ek S e in any unitary invariant
k
norms.
CHAPTER 2
1. Introduction
For this class, we introduce Random Projection method which may reduce the
dimensionality of n points in Rp to k = O(c() log n) at the cost of a uniform met-
ric distortion of at most > 0, with high probability. The theoretical basis of this
method was given as a lemma by Johnson and Lindenstrauss [JL84] in the study
of a Lipschitz extension problem. The result has a widespread application in math-
ematics and computer science. The main application of Johnson-Lindenstrauss
Lemma in computer science is high dimensional data compression via random pro-
jections [Ach03]. In 2001, Sanjoy Dasgupta and Anupam Gupta [DG03a], gave
a simple proof of this theorem using elementary probabilistic techniques in a four-
page paper. Below we are going to present a brief proof of Johnson-Lindenstrauss
Lemma based on the work of Sanjoy Dasgupta, Anupam Gupta [DG03a], and
Dimitris Achlioptas [Ach03].
Recall the problem of MDS: given a set of points xi ∈ Rp (i = 1, 2, · · · , n);
form a data Matrix X p×n = [X1 , X2 · · · Xn ]T , when p is large, especially in some
cases larger than n, we want to find k-dimensional projection with which pairwise
distances of the data point are preserved as well as possible. That is to say, if we
know the original pairwise distance dij = kXi − Xj k or data distances with some
disturbance d˜ij = kXi − Xj k + ij , we want to find Yi ∈ Rk s.t.:
X
(4) min (kYi − Yj k2 − d2ij )2
i,j
then the row vectors of matrix Y are the eigenvectors (singular vectors) correspond-
ing to k largest eigenvalues (singular values) of B.
The main features of MDS are the following.
• MDS looks for Euclidean embedding of data whose total or average metric
distortion are minimized.
13
14 2. RANDOM PROJECTIONS AND ALMOST ISOMETRY
p 1
p = 1/6
• R = A/ k/3 Aij = 0 p = 2/3
−1 p = 1/6
The proof below actually takes the first form of R as an illustration.
Now we are going to prove Johnson-Lindenstrauss Lemma using a random
projection to k-subspace in Rd . Notice that the distributions of the following two
events are identical:
k
Prob[L ≤ (1 − )µ] ≤ exp( (1 − (1 − ) + ln(1 − )))
2
k 2
≤ exp( ( − ( + ))),
2 2
by ln(1 − x) ≤ −x − x2 /2 for 0 ≤ x < 1
k2
= exp(− )
4
≤ exp(−(2 + α) ln n), for k ≥ 4(1 + α/2)(2 /2)−1 ln n
1
= 2+α
n
16 2. RANDOM PROJECTIONS AND ALMOST ISOMETRY
k
Prob[L ≥ (1 + )µ] ≤ exp( (1 − (1 + ) + ln(1 + )))
2
k 2 3
≤ exp( (− + ( − + ))),
2 2 3
by ln(1 + x) ≤ x − x2 /2 + x3 /3 for x ≥ 0
k
= exp(− (2 /2 − 3 /3)),
2
≤ exp(−(2 + α) ln n), for k ≥ 4(1 + α/2)(2 /2 − 3 /3)−1 ln n
1
= 2+α
n
r r
d 0 d
Now set the map f (x) = x = (x1 , . . . , xk , 0, . . . , 0). By the above
k k
calculations, for some fixed pair i, j, the probability that the distortion
kf (vi ) − f (vj )k2
kvi − vj k2
2
does not lie in the range [(1 − ), (1 + )] is at most n(2+α) . Using the trivial union
2
bound with Cn pairs, the chance that some pair of points suffers a large distortion
is at most:
2 1 1 1
Cn2 (2+α) = α 1 − ≤ α.
n n n n
1
Hence f has the desired properties with probability at least 1 − α . This gives us
n
a randomized polynomial time algorithm.
Now we will refer to last expression as g(t). The last line of derivation gives
us the additional constraints that tβµ ≤ 1/2 and t(βµ − 1) ≤ 1/2, and so we have
0 < t < 1/(2βµ). Now to minimize g(t), which is equivalent to maximize
h(t) = 1/g(t) = (1 − 2t(βµ − 1))k/2 (1 − 2tβµ)(d−k)/2
in the interval 0 < t < 1/(2βµ). Setting the derivative h0 (t) = 0, we get the
maximum is achieved at
1−β
t0 =
2β(d − βk)
Hence we have
d − k (d−k)/2
h(t0 ) = ( ) (1/β)k/2
d − kβ
And this is exactly what we need.
The proof of Lemma 3.6 (b) is almost exactly the same as that of Lemma 3.6
(a).
2.1. Conclusion. As we can see, this proof of Lemma is both simple (using
just some elementary probabilistic techniques) and elegant. And you may find
in the field of machine learning, stochastic method always turns out to be really
powerful. The random projection method we approaching today can be used in
many fields especially huge dimensions of data is concerned. For one example, in
the term document, you may find it really useful for compared with the number
of words in the dictionary, the words included in a document is typically sparse
(with a few thousands of words) while the dictionary is hugh. Random projections
often provide us a useful tool to compress such data without losing much pairwise
distance information.
assumption is that the signal x∗ is sparse, namely the number of nonzero compo-
nents kx∗ k0 := #{x∗i 6= 0 : 1 ≤ i ≤ p} is small compared to the total dimensionality
p. Figure 2 gives an illustration of such sparse linear equation problem.
With such a sparse assumption, we would like to find the sparsest solution
satisfying the measurement equation.
(8) (P0 ) min kxk0
s.t. Φx = b.
This is an NP-hard combinatorial optimization problem. A convex relaxation of
(8) is called Basis Pursuit [CDS98],
X
(9) (P1 ) min kxk1 := |xi |
s.t. Φx = b.
This is a linear programming problem. Figure 3 shows different projections of a
sparse vector x∗ under l0 , l1 and l2 , from which one can see in some cases the
convex relaxation (9) does recover the sparse signal solution in (8). Now a natural
problem arises, under what conditions the linear programming problem (P1 ) has
the solution exactly solves (P0 ), i.e. exactly recovers the sparse signal x∗ ?
To understand the equivalence between (P0 ) and (P1 ), one asks the question
when the true signal x∗ is the unique solution of P0 and P1 . In such cases, P1 is
20 2. RANDOM PROJECTIONS AND ALMOST ISOMETRY
equivalent to P0 . For the uniqueness of P1 , one turns to the duality of Linear Pro-
gramming via the Karush-Kuhn-Tucker (KKT) conditions. Take the Lagrangian of
(P1 ),
L(x; λ) = kxk1 + λT (Φx − b), λ ∈ Rn .
Assume the support of x∗ as T ⊆ {1, . . . , p}, i.e. T = {1 ≤ i ≤ p : xi 6= 0}, and
denote its complement by T c . x∗ is an optimal solution of P1 if
0 ∈ ∂L(x∗ , λ)
which implies that sign(x∗T ) = ΦTT λ and |ΦTT c λ| ≤ 1. How to ensure that there are
no other solutions than x∗ ? The following condition is used in [CT05] and other
related works.
Lemma 4.1. Assume that ΦT is of full rank. If there exists λ ∈ Rn such that:
(1) For each i ∈ T ,
(10) ΦTi λ = sign(x∗i );
(2) For each i ∈ T c ,
(11) |ΦTi λ| < 1.
Then P1 has a unique solution x∗ .
These two conditions just ensure a special dual variable λ exists, under which
any optimal solution of P1 must have the same support T as x∗ (strictly comple-
mentary condition in (2)). Since ΦT is of full rank, then P1 must have a unique
solution x∗ . In this case solving P1 is equivalent to P0 . If these conditions fail,
then there exists a problem instance (Φ, b) such that P1 has a solution different to
x∗ . In this sense, these conditions are necessary and sufficient for the equivalence
between P1 and P0 .
Various sufficient conditions have been proposed in literature to meet the KKT
conditions above. For example, these includes the mutual incoherence by Donoho-
Huo (1999) [DH01], Elad-Bruckstein (2001) [EB01] and the Exact Recovery Con-
dition by Tropp [Tro04] or Irrepresentative condition (IRR) by Zhao-Yu [ZY06]
(see also [MY09]). The former condition essentially requires Φ to be a nearly
orthogonal matrix,
µ(Φ) = max |φTi φj |,
i6=j
where Φ = [φ1 , . . . , φp ] and kφi k2 = 1, under which [DH01] shows that as long as
sparsity of x∗ satisfies
1
1 + µ(Φ)
kx∗ k0 = |T | <
2
which is later improved by [EB01] to be
√
2 − 12
kx∗ k0 = |T | < ,
µ(Φ)
then P1 recovers x∗ . The latter assumes that the dual variable λ lies in the column
space of AT , i.e. λ = ΦT α. Then we solve λ explicitly in equation (10) and plugs
in the solution to the inequality (11)
kΦTT c ΦT (ΦTT ΦT )−1 sign(x∗ |T )k∞ < 1
or simply
kΦTT c ΦT (ΦTT ΦT )−1 k∞ < 1.
4. RANDOM PROJECTIONS AND COMPRESSED SENSING 21
If for every k-sparse signal x∗ with support T , conditions above are satisfied, then
P1 recovers x∗ .
The most popular condition is proposed by [CRT06], called Restricted Isom-
etry Property (RIP).
Definition. Define the isometry constant δk of a matrix Φ to be the smallest
nonnegative number such that
(1 − δk )kxk22 ≤ kΦxk22 ≤ (1 + δk )kxk22
holds for all k-sparse vectors x ∈ Rp . A vector x is called k-sparse if it has at most
k nonzero elements.
[AC09] shows that incoherence conditions implies RIP, whence RIP is a weaker
condition. Under RIP condition, uniqueness of P0 and P1 can be guaranteed for all
k-sparse signals, often called uniform exact recovery[Can08].
Theorem 4.2. The following holds for all k-sparse x∗ satisfying Φx∗ = b.
1, then problem P0 has a unique solution x∗ ;
(1) If δ2k < √
(2) If δ2k < 2 − 1, then the solution of P1 (9) has a unique solution x∗ , i.e.
recovers the original sparse signal x∗ .
The first condition is nothing but every 2k-columns of Φ are linearly dependent.
To see the first condition, assume by contradiction that there is another k-sparse
solution of P0 , x0 . Then by Φy = 0 and y = x∗ − x0 is 2k-sparse. If y 6= 0, it violates
δ2k < 1 such that 0 = kΦyk ≥ (1 − δ2k )kyk > 0. Hence one must have y = 0, i.e.
x∗ = x0 which proves the uniqueness of P0 . The first condition is also necessary for
the uniqueness of P0 ’s solutions. In fact, if δ2k = 1, this implies that there is a 2k-
subset 2T such that columns of Φ2T are linearly dependent, i.e. Φ2T z = 0 for some
2k-vector z. One can define x1 to collect first k nonzero elements of z with zero
otherwise, and x2 to collect the second half nonzero entries of z but zero otherwise.
Hence Φ2T (x1 + x2 ) = 0 ⇒ ΦT1 x1 = 0 = ΦT2 x2 with T1 and T2 consisting the
first and second k columns of Φ2T respectively, which violates the uniqueness of P0
solutions. The proof of the second condition can be found in [Can08].
When measurement noise exists, e.g. b = Φx+e with bound kek2 , the following
Basis Pursuit De-Noising (BPDN) [CDS98] or LASSO [Tib96] are used instead
For bounded kek∞ , the following formulation is used in network analysis [JYLG12]
(14) min kxk1
s.t. kΦx − bk∞ ≤
RIP conditions also lead to upper bounds between solutions above and the
true sparse signal x∗ . For example, in the case of BPDN the follwoing result holds
[Can08].
22 2. RANDOM PROJECTIONS AND ALMOST ISOMETRY
√
Theorem 4.3. Suppose that kek2 ≤ . If δ2k < 2 − 1, then
∗
kx̂ − x k2 ≤ C1 k −1/2 σk1 (x∗ ) + C2 ,
where x̂ is the solution of BPDN and
σk1 (x∗ ) = min kx∗ − yk1
supp(y)≤k
p
With this lemma, note that there are at most k subspaces of k-sparse, an
union bound leads to the following result for RIP.
Theorem 4.5. Let Φ ∈ Rn×p be a random matrix satisfying the concentration
inequality (15) and δ ∈ (0, 1). There exists c1 , c2 > 0 such that if
n
k ≤ c1
log(p/k)
the following RIP holds
(1 − δk )kxk22 ≤ kΦxk22 ≤ (1 + δk )kxk22
with probability at least 1 − 2e−c2 n .
Proof. For each of k-sparse signal (XT ), RIP fails with probability at most
2
12
2 e−c0 (δ/2)n .
δ
There are kp ≤ (ep/k)k such subspaces. Hence, RIP fails with probability at most
ep k 12 2
2 e−c0 (δ/2)n = 2e−c0 (δ/2)n+k[log(ep/k)+log(12/δ)] .
k δ
Thus for a fixed c1 > 0, whenever k ≤ c1 n/ log(p/k), the exponent above will
be ≤ −c2 n provided that c2 ≤ c0 (δ/2) − c1 (1 + (1 + log(12/δ))/ log(p/k). c2 can be
always chosen to be > 0 if c1 > 0 is small enough. This leads to the results.
Another use of random projections (random matrices) can be found in Robust
Principal Component Analysis (RPCA) in the next chapter.
CHAPTER 3
In this very first lecture, we talk about data representation as vectors, matrices
(esp. graphs, networks), and tensors, etc. Data are mappings of real world based
on sensory measurements, whence the real world puts constraints on the variations
of data. Data science is the study of laws in real world which shapes the data.
We start the first topic on sample mean and variance in high dimensional
Euclidean spaces Rp , as the maximal likelihood estimators based on multivariate
Gaussian assumption. Principle Component Analysis (PCA) is the projection of
high dimensional data on its top singular vectors. In classical statistics with the
Law of Large Numbers, for fixed p when sample size n → ∞, we know such sample
mean and variance will converge, so as to PCA. Although sample mean µ̂n and
sample covariance Σ̂n are the most commonly used statistics in multivariate data
analysis, they may suffer some problems in high dimensional settings, e.g. for large
p and small n scenario. In 1956, Stein [Ste56] shows that the sample mean is
not the best estimator in terms of the mean square error, for p > 2; moreover
in 2006, Jonestone [Joh06] shows by random matrix theory that PCA might be
overwhelmed by random noise for fixed ratio p/n when n → ∞. Among other
works, these two pieces of excellent works inspired a long pursuit toward modern
high dimensional statistics with a large unexplored field ahead.
n n
1 X 1X
− trace (Xi − µ)T Σ−1 (Xi − µ) = − trace[Σ−1 (Xi − µ)(Xi − µ)T ]
2 i=1
2 i=1
1
= − (traceΣ−1 Σ̂n )(n − 1)
2
n−1 1 1
= − trace(Σ−1 Σ̂n2 Σ̂n2 )
2
n−1 1 1
= − trace(Σ̂n2 Σ−1 Σ̂n2 )
2
n−1
= − trace(S)
2
2. BIAS-VARIANCE DECOMPOSITION OF MEAN SQUARE ERROR 27
where
n
1 X
Σ̂n = (Xi − µ̂n )(Xi − µ̂n )T ,
n − 1 i=1
1 1
S = Σ̂n2 Σ−1 Σ̂n2 is symmetric and positive definite. Above we repeatedly use cyclic
property of trace:
• trace(AB) = trace(BA), or more generally
• (invariance under cyclic permutation group) trace(ABCD) = trace(BCDA) =
trace(CDAB) = trace(DABC).
Then we have
−1 −1
Σ = Σ̂n 2 S −1 Σ̂n 2
n n n
− log |Σ| = log |S| + log |Σ̂n | = f (Σ̂n )
2 2 2
Therefore,
n−1 n
max I(Σ) ⇔ min trace(S) − log |S| + Const(Σ̂n , 1)
2 2
Suppose S = U ΛU is the eigenvalue decomposition of S, Λ = diag(λi )
p p
n−1X nX
J= λi − log(λi ) + Const
2 i=1 2 i=1
∂J n−1 n 1 n
= − ⇒ λi =
∂λi 2 2 λi n−1
n
S= Ip
n−1
This gives the MLE solution
n
n−1 1X
Σ∗ = Σ̂n = (Xi − µ̂n )(Xi − µ̂n )T ,
n n i=1
To measure the performance of an estimator µ̂n , one may look at the following
so-called risk,
R(µ̂n , µ) = EL(µ̂n , µ)
where the loss function takes the square loss here
L(µ̂n , µ) = kµ̂n − µk2 .
The mean square error (MSE) to measure the risk enjoys the following bias-
variance decomposition, from the Pythagorean theorem.
R(µ̂n , µ) = Ekµ̂n − E[µ̂n ] + E[µ̂n ] − µk2
= Ekµ̂n − E[µ̂n ]k2 + kE[µ̂n ] − µk2
=: V ar(µ̂n ) + Bias(µ̂n )2
Example 1. For the simple case Yi ∼ N (µ, σ 2 Ip ) (i = 1, . . . , n), the MLE estimator
satisfies
Bias(µ̂M
n
LE
)=0
and
p 2
V ar(µ̂M
n
LE
)= σ
n
In particular for n = 1, V ar(µ̂M LE ) = σ 2 p for µ̂M LE = Y .
Example 2. MSE of Linear Estimators. Consider Y ∼ N (µ, σ 2 Ip ) and linear
estimator µ̂C = CY . Then we have
Bias(µ̂C ) = k(I − C)µk2
and
V ar(µ̂C ) = E[(CY −Cµ)T (CY −Cµ)] = E[trace((Y −µ)T C T C(Y −µ))] = σ 2 trace(C T C).
In applications, one often consider the diagonal linear estimators C = diag(ci ), e.g.
in Ridge regression
1 λ
min kY − Xβk2 + kβk2 .
µ 2 2
For diagonal linear estimators, the risk
p
X p
X
R(µ̂C , µ) = σ 2 c2i + (1 − ci )2 µ2i .
i=1 i=1
In this case, it is simple to find minimax risk over the hyper-rectangular model class
|µi | ≤ τi ,
p
X σ 2 τi2
inf sup R(µ̂C , µ) = .
ci |µ |≤τ
i i i=1
σ 2 + τi2
From here one can see that for those sparse model classes such that #{i : τi =
O(σ)} = k p, it is possible to get smaller risk using linear estimators than MLE.
In general, is it possible to introduce some biased estimators which significantly
reduces the variance such that the total risk is smaller than MLE uniformly for all
µ? This is the notion of inadmissibility introduced by Charles Stein in 1956 and he
find the answer is YES by presenting the James-Stein estimators, as the shrinkage
of sample means.
3. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 29
Theorem 3.1. Suppose Y ∼ Np (µ, I). Then µ̂MLE = Y . R(µ̂, µ) = Eµ kµ̂ − µk2 ,
and define
p−2
JS
µ̂ = 1 − Y
kY k2
then
R(µ̂JS , µ) < R(µ̂MLE , µ)
We’ll prove a useful lemma first.
3.1. Stein’s Unbiased Risk Estimates (SURE). Discussions below are all
under the assumption that Y ∼ Np (µ, I).
Lemma 3.2. (Stein’s Unbiased Risk Estimates (SURE)) Suppose µ̂ = Y + g(Y ),
g satisfies 1
(1) gPis weakly differentiable.
p R
(2) i=1 |∂i gi (x)|dx < ∞
then
(18) R(µ̂, µ) = Eµ (p + 2∇T g(Y ) + kg(Y )k2 )
Pp ∂
where ∇T g(Y ) := i=1 ∂yi gi (Y ).
p Z
X ∞
Eµ (Y − µ)T g(Y ) = (yi − µi )gi (Y )φ(Y − µ)dY
i=1 −∞
p Z ∞
X ∂
= −gi (Y ) φ(Y − µ)dY, derivative of Gaussian function
i=1 −∞
∂yi
p Z ∞
X ∂
= gi (Y )φ(Y − µ)dY, Integration by parts
i=1 −∞ ∂yi
= Eµ ∇ g(Y )T
Thus, we define
(19) U (Y ) := p + 2∇T g(Y ) + kg(Y )k2
for convenience, and R(µ̂, µ) = Eµ U (Y ).
This lemma is in fact called the Stein’s lemma in Tsybakov’s book [Tsy09]
(page 157∼158).
kµk2
d
χ2 (kµk2 , p) = χ2 (0, p + 2N ), N ∼ Poisson
2
we have
1 1
Eµ = EEµ N
kY k2 kY k2
1
= E
p + 2N − 2
1
≥ (Jensen’s Inequality)
p + 2EN − 2
1
=
p + kµk2 − 2
that is
2This is a homework.
32 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE
JS
4
MLE
2
0
0 2 4 6 8 10
||u||
i=1 i=1
p
X p
≤ 1 + (2 log p + 1) µ2i ∧ 1 if we take λ = 2 log p
i=1
with
p
X ∂g
W (y) = −2pε2 g(y) + 2ε2 yi (y) + kyk2 g(y)2 .
i=1
∂yi
The risk of µ̃n is smaller than that of µ̂n if we choose g such that
E[W (y)] < 0.
In order to satisfy this inequality, we can search for g among the functions of
the form
b
g(y) =
a + kyk2
with an appropriately chosen constants a ≥ 0, b > 0. Therefore, W (y) can be
written as
p
b X 2byi2 b2 kyk2
W (y) = −2pε2 + 2ε 2
+
a + kyk 2
i=1
(a + kyk )2 2 (a + kyk2 )2
4bε2 kyk2 b2 kyk2
1 2
= −2pbε + +
a + kyk2 a + kyk2 (a + kyk2 )2
1
≤ (−2pbε2 + 4bε2 + b2 ) kyk2 ≤ a + kyk2 for a ≥ 0
a + kyk2
Q(b)
= , Q(b) = b2 − 2pbε2 + 4bε2 .
a + kyk2
The minimizer in b of quadratic function Q(b) is equal to
bopt = ε2 (p − 2),
where the minimum of W (y) satisfies
b2opt ε4 (p − 2)2
Wmin (y) ≤ − = − < 0.
a + kyk2 a + kyk2
Note that when b ∈ (b1 , b2 ), i.e. between the two roots of Q(b)
b1 = 0, b2 = 2ε2 (p − 2)
we have W (y) < 0, which may lead to other estimators having smaller mean square
errors than MLE estimator.
When a = 0, the function g and the estimator µ̃n = (1 − g(y))y associated to
this choice of g are given by
ε2 (p − 2)
g(y) = ,
kyk2
and
ε2 (p − 2)
µ̃n = 1− y =: µ̃JS ,
kyk2
respectively. µ̃JS is called James-Stein estimator. If dimension p ≥ 3 and the
norm kyk2 is sufficiently large, multiplication of y by g(y) shrinks the value of y to
0. This is called the Stein shrinkage. If b = bopt , then
ε4 (p − 2)2
Wmin (y) = − .
kyk2
3. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 35
gain can be achieved in many cases. Researchers struggle to show real application
examples where one can benefit greatly from Stein’s estimators. For example, Efron-
Morris (1974) showed three examples that JS-estimator significantly improves the
multivariate estimation. On other other hand, deeper understanding on Shrinkage-
type estimators has been pursued from various aspects in statistics.
The situation changes dramatically when LASSO-type estimators by Tibshi-
rani, also called Basis Pursuit by Donoho et al. are studied around 1996. This
brings sparsity and L1-regularization into the central theme of high dimensional
statistics and leads to a new type of shrinkage estimator, thresholding. For exam-
ple,
1
min I = min kµ̃ − µk2 + λkµ̃k1
µ̃ µ̃ 2
Subgradients of I over µ̃ leads to
0 ∈ ∂µ̃j I = (µ̃j − µj ) + λsign(µ̃j ) ⇒ µ̃j = sign(µj )(|µj | − λ)+
where the set-valued map sign(x) = 1 if x > 0, sign(x) = −1 if x < 0, and
sign(x) = [−1, 1] if x = 0, is the subgradient of absolute function |x|. Under this
new framework shrinkage estimators achieves a new peak with an ubiquitous spread
in data analysis with high dimensionality.
(20) b n = 1 XX 0 → Ip .
Σ
n
Such a random matrix Σ b n is called Wishart matrix.
p
But when n → γ 6= 0, the distribution of the eigenvalues of Σ
b n follows [BS10]
(Chapter 3), if γ ≤ 1,
(
0 t∈
/ [a, b]
(21) µMP
(t) = √(b−t)(t−a)
2πγt dt t ∈ [a, b]
and has an additional point mass 1 − 1/γ at the origin if γ > 1. Note that a =
√ √
(1 − γ)2 , b = (1 + γ)2 . Figure 1 illustrates the MP-distribution by MATLAB
simulations whose codes can be found below.
4. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 37
(a) (b)
%Wishart matrix
% S = 1/n*X*X.’, X is p-by-n, X ij i.i.d N(0,1),
% ESD S converge to M.P. with parameter y = p/n
y = 2;
a = (1-sqrt(y))^2;
b = (1+sqrt(y))^2;
X = randn(p,n);
S = 1/n*(X*X.’);
evals = sort( eig(S), ’descend’);
nbin = 100;
[nout, xout] = hist(evals, nbin);
hx = xout(2) - xout(1); % step size, used to compute frequency below
x1 = evals(end) -1;
x2 = evals(1) + 1; % two end points
xx = x1+hx/2: hx: x2;
fre = f MP(xx)*hx;
figure,
h = bar(xout, nout/p);
set(h, ’BarWidth’, 1, ’FaceColor’, ’w’, ’EdgeColor’, ’b’);
hold on;
plot(xx, fre, ’--r’);
38 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE
which implies that if signal energy is small, top eigenvalue of sample covariance
matrix never pops up from random matrix ones; only if the signal energy is beyond
√
the phase transition threshold γ, top eigenvalue can be separated from random
matrix eigenvalues. However, even in the latter case it is a biased estimation.
Moreover, the primary eigenvector associated with the largest eigenvalue (prin-
cipal component) converges to
2 √
0 σX ≤ γ
|hu, vmax i|2 → 1− σX
γ
(23) 4
2 √
1+ γ , σX > γ
σ2
X
which means the same phase transition phenomenon: if signal is of low energy,
PCA will tell us nothing about the true signal and the estimated top eigenvector is
orthogonal to the true direction u; if the signal is of high energy, PCA will return a
biased estimation which lies in a cone whose angle with the true signal is no more
than
1 − σγ4
X
.
1 + σγ2
X
Pn
Then Sn = n1 i=1 Zi ZiT is a Wishart random matrix whose eigenvalues follow the
MP distribution.
1 1
Notice that Σ̂n = Σ 2 Sn Σ 2 and (λ, v̂) is eigenvalue-eigenvector pair of matrix
Σ̂n . Therefore
1 1 1 1
(30) Σ 2 Sn Σ 2 v̂ = λv̂ ⇒ Sn Σ(Σ− 2 v̂) = λ(Σ− 2 v̂)
1
In other words, λ and Σ− 2 v̂ are the eigenvalue and eigenvector of matrix Sn Σ.
1
Suppose cΣ− 2 v̂ = v where the constant c makes v a unit eigenvector and thus
satisfies,
(31) c2 = cv̂ T v̂ = v T Σv = v T (σx2 uuT + σε2 )v = σx2 (uT v)2 + σε2 ) = R(uT v)2 + 1.
With the aid of Stieltjes transform, we can calculate the largest eigenvalue of
matrix Σ̂n and the properties of the corresponding eigenvector v̂.
In fact, the eigenvalue λ satisfies
p Z b
2 1X λi 2 t
(32) 1 = σX · ∼ σX · dµM P (t),
p i=1 λ − σε2 λi a λ − σε2 t
and the inner product of u and v satisfies
(33) |uT v|2
Z b
t2
= {σx4 dµM P (t)}−1
a (λ − σ 2 )2
ε
σx4 p λ(2λ − (a + b)) −1
= { (−4λ + (a + b) + 2( (λ − a)(λ − b)) + p )}
4γ (λ − a)(λ − b)
1 − Rγ2
= 2γ
1+γ+ R
σ2
where R = SN R = σx2 = σx2 ,γ = np . We can compute the inner product of u and
p
ε
v̂ which we are really interested in from the above equation:
1 1 1 1 1 1 1 p
|uT v̂|2 = ( uT Σ 2 v)2 = 2 ((Σ 2 u)T v)2 = 2 (((RuuT + Ip ) 2 u)T v)2 = 2 (( (1 + R)u)T v)2
c c c c
γ
(1 + R)(uT v)2 1+R− R − Rγ2 1 − Rγ2
= = γ = γ
R(uT v)2 + 1 1+R+γ+ R 1+ R
Now we are going to present the details.
First of all, from
(34) Sn Σv = λv,
we obtain the following by plugging in the expression of Σ
(35) 2
Sn (σX uu0 + σε2 Ip )v = λv
Rearrange the term with u to one side, we got
(36) (λIp − σε2 Sn )v = σX
2
Sn uu0 v
Assuming that λIp − σε2 Sn is invertable, then multiple its reversion at both sides
of the equality, we get,
(37) 2
v = σX · (λIp − σε2 Sn )−1 · Sn u(u0 v).
4. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 41
For convenience, assume without loss of generosity that σε2 = 1, that is the
noise volatility is 1. Now we unveil the story of the ratio γ, do the integration in
equation (13), we got,
Z b p
2 t (b − t)(t − a) σ2 p
(43) 1 = σX · dt = X [2λ − (a + b) − 2 |(λ − a)(b − λ)|]
a λ−t 2πγt 4γ
where the last step can be computed via Stieltjes transform introduced above.
From the definition of T (λ), we have
Z b
t2 0
(44) µM P (t)dt = −T (λ) − λT (λ).
a (λ − t)
2
γ
λ = (1 + SN R)(1 + )σ 2
SN R ε
Here we observe the following phase transitions for primary eigenvalue:
42 3. HIGH DIMENSIONAL STATISTICS: MEAN AND COVARIANCE IN NOISE
σ2
r
2 1 2 p
(45) 1 = σX · [2b − (a + b)] = √X ⇔ σX =
4γ γ n
So, in order to make PCA works, we need to let SN R ≥ np .
p
We know that if PCA works good and noise doesn’t dominate the effect, the inner-
product |u0 v̂| should be close to 1. On the other hand, from RMT we know that if
the top eigenvalue λ is merged in the M. P. distribution, then the top eigenvector
computed is purely random and |u0 v̂| = 0, which means that from v̂ we can know
nothing about the signal u.
4.4.2. Primary Eigenvector. We now study the phase transition of top-eigenvector.
It is convenient to study |u0 v|2 first and then translate back to |u0 v̂|2 . Using
the equation (37),
(46)
1 = |v 0 v| = σX
4
·v 0 uu0 Sn (λIp −σε2 Sn )−2 Sn uu0 v = σX
4
·(|v 0 u|)[u0 Sn (λIp −σε2 Sn )−2 Sn u](|u0 v|)
b
t2
Z
(48) |u0 v|−2 = σX
4
[u0 Sn (λIp − σε2 Sn )−2 Sn u] ∼ σX
4
· dµM P (t)
a (λ − σε2 t)2
and assume that λ > b, from Stieltjes transform introduced later one can compute
the integral as
(49)
Z b
0 −2 4 t2 MP
4
σX p λ(2λ − (a + b))
|u v| = σX · dµ (t) = (−4λ+(a+b)+2 (λ − a)(λ − b)+ p
a (λ − σ 2 t)2
ε 4γ (λ − a)(λ − b)
γ
from which it can be computed that (using λ = (1 + R)(1 + R) obtained above,
2
σX
where R = SN R = σ2 )
γ
1−
|u0 v|2 = R2
2γ .
1+γ+ R
Using the relation √
0 0 1 1/2
1+R 0
u v̂ = u = Σ v (u v)
c c
√
where the second equality uses Σ1/2 u = 1 + Ru, and with the formula for c2
above, we can compute
1+R
(u0 v̂)2 = (u0 v)2
1 + R(u0 v)2
√
in terms of R. Note that this number holds under the condition that R > γ.
4. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 43
(50) min cT x
s.t. Ax = b
x≥0
This is the primal linear programming problem.
In SDP, the inner product between vectors cT x in LP will change to Hadamard
inner product (denoted by •) between matrices.
SDP (Semi-definite Programming): for X, C ∈ Rn×n
X
(51) min C • X = cij Xij
i,j
s.t. Ai • X = bi , for i = 1, · · · , m
X0
Linear programming has a dual problem via the Lagrangian. The Lagrangian
of the primal problem is
max min Lx;y,µ = cT x + y T (b − Ax) − µT x
µ≥0,y x
=⇒ max L = −y T b
µ≥0,y
where
A1
A = ...
Am
and
y1
y = ...
ym
1.1. Duality of SDP. Define the feasible set of primalP and dual problems are
Fp = {X 0; Ai • X = bi } and Fd = {(y, S) : S = C − i yi Ai 0}, respectively.
Similar to linear programming, semi-definite programming also has properties of
week and strong duality. The week duality says that the primal value is always an
upper bound of dual value. The strong duality says that the existence of an interior
point ensures the vanishing duality gap between primal value and dual value, as
well as the complementary conditions. In this case, to check the optimality of a
primal variable, it suffices to find a dual variable which meets the complementary
condition with the primal. This is often called the witness method.
For more reference on duality of SDP, see e.g. [Ali95].
Theorem 1.1 (Weak Duality of SDP). If Fp 6= ∅, Fd 6= ∅, We have C • X ≥ bT y,
for ∀X ∈ Fp and ∀(y, S) ∈ Fd .
Theorem 1.2 (Strong Duality SDP). Assume the following hold,
(1) Fp 6= ∅, Fd 6= ∅;
(2) At least one feasible set has an interior.
Then X ∗ is optimal iff
(1) X ∗ ∈ Fp
(2) ∃(y ∗ , S ∗ ) ∈ Fd
s.t. C • X ∗ = bT y ∗ or X ∗ S ∗ = 0 (note: in matrix product)
In other words, the existence of an interior solution implies the complementary
condition of optimal solutions. Under the complementary condition, we have
rank(X ∗ ) + rank(S ∗ ) ≤ n
for every optimal primal X ∗ and dual S ∗ .
2. ROBUST PCA 47
2. Robust PCA
Let X ∈ Rp×n be a data matrix. Classical PCA tries to find
(54) min kX − Lk
s.t. rank(L) ≤ k
where the Pnorm here is matrix-norm P
or Frobenius norm. SVD provides a solution
with L = i≤k σi ui viT where X = i σi ui viT (σ1 ≥ σ2 ≥ . . .). In other words,
classical PCA looks for decomposition
X =L+E
where the error matrix E has small matrix/Frobenius norm. However, it is well-
known that classical PCA is sensitive to outliers which are sparse and lie far from
the major population.
X
rank(L) := #{σi (L) 6= 0} ⇒ kLk∗ = σi (L),
i
where kLk∗ is called the nuclear norm of L, which has a semi-definite representation
1
kLk∗ = min (trace(W1 ) + trace(W2 ))
2
W1 L
s.t. 0.
LT W 2
With these, the relaxed Robust PCA problem can be solved by the following
semi-definite programming (SDP).
1
(56) min (trace(W1 ) + trace(W2 )) + λkSk1
2
s.t. Lij + Sij = Xij , (i, j) ∈ E
W1 L
0
LT W2
The following Matlab codes realized the SDP algorithm above by CVX (http:
//cvxr.com/cvx).
% Construct a random 20-by-20 Gaussian matrix and construct a rank-1
% matrix using its top-1 singular vectors
R = randn(20,20);
[U,S,V] = svds(R,3);
A = U(:,1)*V(:,1)’;
X = A + E;
cvx begin
variable L(20,20);
variable S(20,20);
variable W1(20,20);
variable W2(20,20);
variable Y(40,40) symmetric;
Y == semidefinite(40);
minimize(.5*trace(W1)+0.5*trace(W2)+lambda*sum(sum(abs(S))));
subject to
L + S >= X-1e-5;
50 4. GENERALIZED PCA/MDS VIA SDP RELAXATIONS
L + S <= X + 1e-5;
Y == [W1, L’;L W2];
cvx end
Typically CVX only solves SDP problem of small sizes (say matrices of size less
than 100). Specific matlab tools have been developed to solve large scale RPCA,
which can be found at http://perception.csl.uiuc.edu/matrix-rank/.
4. Sparse PCA
Sparse PCA is firstly proposed by [ZHT06] which tries to locate sparse prin-
cipal components, which also has a SDP relaxation.
Recall that classical PCA is to solve
max xT Σx
s.t. kxk2 = 1
which gives the maximal variation direction of covariance matrix Σ.
Note that xT Σx = trace(Σ(xxT )). Classical PCA can thus be written as
max trace(ΣX)
s.t. trace(X) = 1
X0
52 4. GENERALIZED PCA/MDS VIA SDP RELAXATIONS
The optimal solution gives a rank-1 X along the first principal component. A
recursive application of the algorithm may lead to top k principal components.
That is, one first to find a rank-1 approximation of Σ and extract it from Σ0 = Σ
to get Σ1 = Σ − X, then pursue the rank-1 approximation of Σ1 , and so on.
Now we are looking for sparse principal components, i.e. #{Xij 6= 0} are small.
Using 1-norm convexification, we have the following SDP formulation [dGJL07]
for Sparse PCA
max trace(ΣX) − λkXk1
s.t. trace(X) = 1
X0
The following Matlab codes realized the SDP algorithm above by CVX (http:
//cvxr.com/cvx).
% Construct a 10-by-20 Gaussian random matrix and form a 20-by-20 correlation
% (inner product) matrix R
X0 = randn(10,20);
R = X0’*X0;
d = 20;
e = ones(d,1);
lambda = 0.5;
k = 10;
cvx begin
5. MDS WITH UNCERTAINTY 53
n
X 2
(59) min kyi − yj k2 − dij .
i,j=1
So
kYi − Yj k2 = d2ij ⇔ (ei − ej )(ei − ej )T • X = d2ij
which is linear with respect to X.
Now we relax the constrain X = Y T Y to
X Y T Y ⇐⇒ X − Y T Y 0.
Through Schur Complement Lemma we know
T I Y
X − Y Y 0 ⇐⇒ 0
YT X
We may define a new variable
Ik Y
Z ∈ S k+n , Z =
YT X
which gives the following result.
Lemma 5.1. The quadratic constraint
kyi − yj k2 = d2ij , (i, j) ∈ E
has a semi-definite relaxation:
Z1:k,1:k = I
(0; ei− ej )(0; ei − ej )T • Z = d2ij , (i, j) ∈ E
Ik Y
Z= 0.
YT X
Pn
where • denotes the Hadamard inner product, i.e. A • B := i,j=1 Aij Bij .
Note that the constraint with equalities of d2ij can be replaced by inequalities
such as ≤ d2ij (1 + ) (or ≥ d2ij (1 − )). This is a system of linear matrix (in)-
equalities with positive semidefinite variable Z. Therefore, the problem becomes a
typical semidefinite programming.
Given such a SD relaxation, we can easily generalize classical MDS to the sce-
narios in the introduction. For example, consider the generalized MDS with anchors
which is often called sensor network localization problem in literature [BLT+ 06].
Given anchors ak (k = 1, . . . , s) with known coordinates, find xi such that
• kxi − xj k2 = d2ij where (i, j) ∈ Ex and xi are unknown locations
2
• kak − xj k2 = dckj where (k, j) ∈ Ea and ak are known locations
We can exploit the following SD relaxation:
• (0; ei − ej )(0; ei − ej )T • Z = dij for (i, j) ∈ Ex ,
• (ai ; ej )(ai ; ej )T • Z = dc
ij for (i, j) ∈ Ea ,
both of which are linear with respect to Z.
Recall that every SDP problem has a dual problem (SDD). The SDD associated
with the primal problem above is
X X
(61) min I • V + wij dij + wbij dc
ij
i,j∈Ex i,j∈Ea
s.t.
V 0 X X
S= + wij Aij + w
bij A ij 0
d
0 0
i,j∈Ex i,j∈Ea
5. MDS WITH UNCERTAINTY 55
where
Aij = (0; ei − ej )(0; ei − ej )T
T
A
d ij = (ai ; ej )(ai ; ej ) .
The variables wij is the stress matrix on edge between unknown points i and j and
w
bij is the stress matrix on edge between anchor i and unknown point j. Note that
the dual is always feasible, as V = 0, yij = 0 for all (i, j) ∈ Ex and wij = 0 for all
(i, j) ∈ Ea is a feasible solution.
There are many matlab toolboxes for SDP, e.g. CVX, SEDUMI, and recent
toolboxes SNLSDP (http://www.math.nus.edu.sg/~mattohkc/SNLSDP.html) and
DISCO (http://www.math.nus.edu.sg/~mattohkc/disco.html) by Toh et. al.,
adapted to MDS with uncertainty.
A crucial theoretical question is to ask, when X = Y T Y holds such that SDP
embedding Y gives the same answer as the classical MDS? Before looking for an-
swers to this question, we first present an application example of SDP embedding.
5.2. Protein 3D Structure Reconstruction. Here we show an example of
using SDP to find 3-D coordinates of a protein molecule based on noisy pairwise
distances for atoms in -neighbors. We use matlab package SNLSDP by Kim-
Chuan Toh, Pratik Biswas, and Yinyu Ye, downladable at http://www.math.nus.
edu.sg/~mattohkc/SNLSDP.html.
nf = 0.1, λ = 1.0e+00
10
−5
−10
10
10
5 5
0 0
−5 −5
−10 −10
Refinement: RMSD = 5.33e−01
(a) (b)
number of anchors = 0
number of sensors = 166
box scale = 20.00
radius = 5.00
multiplicative noise, noise factor = 1.00e-01
56 4. GENERALIZED PCA/MDS VIA SDP RELAXATIONS
-------------------------------------------------------
estimate sensor positions by SDP
-------------------------------------------------------
num of constraints = 2552,
Please wait:
solving SDP by the SDPT3 software package
sdpobj = -3.341e+03, time = 34.2s
RMSD = 7.19e-01
-------------------------------------------------------
refine positions by steepest descent
-------------------------------------------------------
objstart = 4.2408e+02, objend = 2.7245e+02
number of iterations = 689, time = 0.9s
RMSD = 5.33e-01
-------------------------------------------------------
(noise factor)^2 = -20.0dB,
mean square error (MSE) in estimated positions = -5.0dB
-------------------------------------------------------
condition holds and the duality gap is zero. In this case, assume that Z ∗ is a primal
feasible solution of SDP embedding and S ∗ is an optimal dual solution, then
(1) rank(Z ∗ ) + rank(S ∗ ) ≤ k + n and rank(Z ∗ ) ≥ k, whence rank(S ∗ ) ≤ n;
(2) rank(Z ∗ ) = k ⇐⇒ X = Y T Y .
It follows that if an optimal dual S ∗ has rank n, then every primal solution Z ∗ has
rank k, which ensures X = Y T Y . Therefore it suffices to find a maximal rank dual
solution S ∗ whose rank is n.
Above we have optimality rank condition from SDP. Now we introduce a geo-
metric criterion based on universal rigidity.
Definition (Universal Rigidity (UR) or Unique Localization (UL)). ∃!yi ∈ Rk ,→
2
Rl where l ≥ k s.t. d2ij = kyi − yj k2 , dc 2
ij = kak − yj k .
inequalities, however we can achieve arbitrary small rank solution. To see this,
assume that
Ai X = bi 7→ αbi ≤ Ai X ≤ βbi i = 1, . . . , m, where β ≥ 1, α ∈ (0, 1)
then So, Ye, and Zhang (2008) [SYZ08] show the following result.
Theorem 6.2. For every d ≥ 1, there is a SDP solution X b 0 with rank
rank(X) ≤ d, if the following holds,
b
18 ln 2m
1+
1 ≤ d ≤ 18 ln 2m
β= √ d
1 + 18 ln 2m d ≥ 18 ln 2m
d
1
e(2m)2/d
1 ≤ d ≤ 4 ln 2m
α=
( r )
1 4 ln 2m
max
,1 − d ≥ 4 ln 2m
e(2m)2/d d
1. Introduction
In the past month we talked about two topics: one is the sample mean and
sample covariance matrix (PCA) in high dimensional spaces. We have learned that
when dimension p is large and sample size n is relatively small, in contrast to the
traditional statistics where p is fixed and n → ∞, both sample mean and PCA may
have problems. In particular, Stein’s phenomenon shows that in high dimensional
space with independent Gaussian distributions, the sample mean is worse than a
shrinkage estimator; moreover, random matrix theory sheds light on that in high
dimensional space with sample size in a fixed ratio of dimension, the sample co-
variance matrix and PCA may not reflect the signal faithfully. These phenomena
start a new philosophy in high dimensional data analysis that to overcome the curse
of dimensionality, additional constraints has to be put that data never distribute
in every corner in high dimensional spaces. Sparsity is a common assumption in
modern high dimensional statistics. For example, data variation may only depend
on a small number of variables; independence of Gaussian random fields leads to
sparse covariance matrix; and the assumption of conditional independence can also
lead to sparse inverse covariance matrix. In particular, an assumption that data
concentrate around a low dimensional manifold in high dimensional spaces, leads
to manifold learning or nonlinear dimensionality reduction, e.g. ISOMAP, LLE,
and Diffusion Maps etc. This assumption often finds example in computer vision,
graphics, and image processing.
All the work introduced in this chapter can be regarded as generalized PCA/MDS
on nearest neighbor graphs, which has roots in manifold learning concept. Two
pieces of milestone works, ISOMAP [TdSL00] and Locally Linear Embedding
(LLE) [RL00], are firstly published in science 2000, which opens a new field called
nonlinear dimensionality reduction, or manifold learning in high dimensional data
analysis. Here is the development of manifold learning method:
Laplacian Eigen Map
Diffusion Map
PCA −→ LLE −→
Hessian LLE
Local Tangent Space Alignment
MDS −→ ISOMAP
To understand the motivation of such a novel methodology, let’s take a brief
review on PCA/MDS. Given a set of data xi ∈ Rp (i = 1, . . . , n) or merely pairwise
distances d(xi , xj ), PCA/MDS essentially looks for an affine space which best cap-
ture the variation of data distribution, see Figure 1(a). However, this scheme will
not work in the scenario that data are actually distributed on a highly nonlinear
59
60 5. NONLINEAR DIMENSIONALITY REDUCTION
curved surface, i.e. manifolds, see the example of Swiss Roll in Figure 1(b). Can we
extend PCA/MDS in certain sense to capture intrinsic coordinate systems which
charts the manifold?
(a) (b)
ISOMAP and LLE, as extensions from MDS and local PCA, respectively, leads
to a series of attempts to address this problem.
All the current techniques in manifold learning, as extensions of PCA and
MDS, are often called as Spectral Kernel Embedding. The common theme of these
techniques can be described in Figure 2. The basic problem is: given a set of
data points {x1 , x2 , ..., xn ∈ Rp }, how to find out y1 , y2 , ..., yn ∈ Rd , where d p,
such that some geometric structures (local or global) among data points are best
preserved.
All the manifold learning techniques can be summarized in the following meta-
algorithm, which explains precisely the name of spectral kernel embedding. All the
methods can be called certain eigenmaps associated with some positive semi-definite
kernels.
1. Construct a data graph G = (V, E), where V = {xi : i = 1, ..., n}.
e.g.1. ε-neighborhood, i ∼ j ⇔ d(xi , xj ) 6 ε, which leads to an undirected
graph;
2. ISOMAP 61
2. ISOMAP
ISOMAP is an extension of MDS, where pairwise euclidean distances between
data points are replaced by geodesic distances, computed by graph shortest path
distances.
(1) Construct a neighborhood graph G = (V, E, dij ) such that
V = {xi : i = 1, . . . , n}
E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest
neighbors, -neighbors
dij = d(xi , xj ), e.g. Euclidean distance when xi ∈ Rp
(2) Compute graph shortest path distances
dij = minP =(xi ,...,xj ) (kxi − xt1 k + . . . + kxtk−1 − xj k), is the length
of a graph shortest path connecting i and j
Dijkstra’s algorithm (O(kn2 log n)) and Floyd’s Algorithm (O(n3 ))
62 5. NONLINEAR DIMENSIONALITY REDUCTION
The basic feature of ISOMAP can be described as: we find a low dimensional
embedding of data such that points nearby are mapped nearby and points far away
are mapped far away. In other words, we have global control on the data distance
and the method is thus a global method. The major shortcoming of ISOMAP
lies in its computational complexity, characterized by a full matrix eigenvector
decomposition.
2.1. ISOMAP Example. Now we give an example of ISOMAP with matlab
codes.
% load 33-face data
load ../data/face.mat Y
X = reshape(Y,[size(Y,1)*size(Y,2) size(Y,3)]);
p = size(X,1);
n = size(X,2);
D = pdist(X’);
DD = squareform(D);
(a) (b)
dM (x, xi ) < , and {i, j} ∈ E if dM (xi , xj ) ≤ α (α ≥ 4). Then for any pair
x, y ∈ V ,
α
dS (x, y) ≤ max(α − 1, )dM (x, y).
α−2
Proof. Let γ be a shortest path connecting x and y on M whose length is
l. If l ≤ (α − 2), then there is an edge connecting x and y whence dS (x, y) =
dM (x, y). Otherwise split γ into pieces such that l = l0 + tl1 where l1 = (α − 2)
and ≤ l0 < (α − 2). This divides arc γ into a sequence of points γ0 = x, γ1 ,. . .,
γt+1 = y such that dM (x, γ1 ) = l0 and dM (γi , γi+1 ) = l1 (i ≥ 1). There exists a
sequence of x0 = x, x1 , . . . , xt+1 = y such that dM (xi , γi ) ≤ and
dM (xi , xi+1 ) ≤ dM (xi , γi ) + dM (γi , γi+1 ) + dM (γi+1 , xi+1 )
≤ + l1 +
= α
= l1 α/(α − 2)
whence (xi , xi+1 ) ∈ E. Similarly dM (x, x1 ) ≤ dM (x, γ1 ) + dM (γ1 , x1 ) ≤ (α − 1) ≤
l0 (α − 1).
t−1
X
dS (x, y) ≤ dM (xi , xi+1 )
i=0
α
≤ l max ,α − 1
α−2
Setting α = 4 gives rise to dS (x, y) ≤ 3dM (x, y).
The other lower bound dS (x, y) ≤ cdG (x, y) requires that for every two points
xi and xj , Euclidean distance kxi − xj k ≤ cdM (xi , xj ). This imposes a regularity
on manifold M , whose curvature has to be bounded. We omit this part here and
leave the interested readers to the reference by Bernstein, de Silva, Langford, and
Tenenbaum 2000, as a supporting information to the ISOMAP paper.
ISOMAP LLE
MDS on geodesic distance matrix local PCA + eigen-decomposition
global approach local approach
no for nonconvex manifolds with holes ok with nonconvex manifolds with holes
landmark (Nystrom) Hessian
Extensions: conformal Extensions: Laplacian
isometric, etc. LTSA etc.
66 5. NONLINEAR DIMENSIONALITY REDUCTION
i≥j
5. Hessian LLE
Laplacian Eigenmap looks for coordinate curves
Z
min k∇M f k2 , kf k = 1
Donoho and Grimes (2003) [DG03b] replaces the graph Laplacian, or the trace
of Hessian matrix, by the whole Hessian. This is because the kernel of Hessian,
∂2f
f (y1 , . . . , yd ) : =0
∂yi ∂yj
must be constant function or linear functions in yi (i = 1, . . . , d). Therefore this
kernel space is a linear subspace of dimension d+1. Minimizing Hessian will exactly
leads to a basis with constant function and d independent coordinate functions.
1. G is incomplete, often k-nearest neighbor graph.
2. Local SVD on neighborhood of xi , for xij ∈ N (xi ),
Define Hessian by
(i) T d
[H ] = [last columns of M̃ ]k×(d)
2 2
Find smallest d + 1 eigenvectors of K and drop the smallest eigenvector, the re-
maining d eigenvectors will give rise to a d dimensional embedding of data points.
5.1. Convergence of Hessian LLE. There are two assumptions for the con-
vergence of ISOMAP:
• Isometry: the geodesic distance between two points on manifolds equals
to the Euclidean distances between intrinsic parameters.
• Convexity: the parameter space is a convex subset in Rd .
Therefore, if the manifold contains a hole, ISOMAP will not faithfully recover
the intrinsic coordinates. Hessian LLE above is provable to find local orthogonal
coordinates for manifold reconstruction, even in nonconvex case. Figure [?] gives
an example.
Donoho and Grimes [DG03b] relaxes the conditions above into the following
ones.
• Local Isometry: in a small enough neighborhood of each point, geodesic
distances between two points on manifolds are identical to Euclidean dis-
tances between parameter points.
• Connecteness: the parameter space is an open connected subset in Rd .
Based on the relaxed conditions above, they prove the following result.
Theorem 5.1. Supper M = ψ(Θ) where Θ is an open connected subset of Rd ,
and ψ is a locally isometric embedding of Θ into Rn . Then the Hessian H(f ) has a
d + 1 dimensional nullspace, consisting of the constant function and d-dimensional
space of functions spanned by the original isometric coordinates.
6. LOCAL TANGENT SPACE ALIGNMENT (LTSA) 69
Define Hessian by
!
(i) T d
[H ] = [last columns of M̃ ]k×(d)
2 2
where selection matrix Sin×k : [xi1 , ..., xik ] = [x1 , ..., xn ]Sin×k ;
5 Step 3 : Find smallest d + 1 eigenvectors of K and drop the smallest eigenvector,
the remaining d eigenvectors will give rise to a d-embedding.
7. Diffusion Map
Recall xi ∈ Rd , i = 1, 2, · · · , n,
d(xi , xj )2
Wij = exp − ,
t
W is a symmetrical
Pn n × n matrix.
Let di = j=1 Wij and
D = diag(di ), P = D−1 W
and
S = D−1/2 W D−1/2 = I − L, L = D−1/2 (D − W )D−1/2 .
Then
1) S is symmetrical, has n orthogonal eigenvectors V = [v1 , v2 , · · · , vn ],
S = V ΛV T , Λ = diag(λi )T ∈ Rn−1 , V T V = I.
Here we assume that 1 = λ0 ≥ λ1 ≥ λ2 . . . ≥ λn−1 due to positivity of W .
2) Φ = D−1/2 V = [φ1 , φ2 , · · · , φn ] are right eigenvectors of P , P Φ = ΦΛ.
72 5. NONLINEAR DIMENSIONALITY REDUCTION
• Let Z
(α) (α)
dt (x) = kt (x, y)q(y)dy
M
and define the transition kernel of a Markov chain by
(α)
kt (x, y)
pt,α (x, y) = (α)
.
dt (x)
Then the Markov chain can be defined as the operator
Z
Pt,α f (x) = pt,α (x, y)f (y)q(y)dy.
M
• Define the infinitesimal generator of the Markov chain
I − Pt,α
Lt,α = .
t
For this, Lafon et al.[CL06] shows the following pointwise convergence results.
Theorem 7.1. Let M ∈ Rp be a compact smooth submanifold, q(x) be a proba-
bility density on M, and ∆M be the Laplacian-Beltrami operator on M.
∆M (f q 1−α ) ∆M (q 1−α ))
(69) lim Lt,α = − .
t→0 q 1−α q 1−α
9. COMPARISONS 73
9. Comparisons
According to the comparative studies by Todd Wittman, LTSA has the best
overall performance in current manifold learning techniques. Try yourself his code,
mani.m, and enjoy your new discoveries!
Theorem 1.1 (Perron Theorem for Positive Matrix). Assume that A > 0, i.e.a
positive matrix. Then
1) ∃λ∗ > 0, ν ∗ > 0, kν ∗ k2 = 1, s.t. Aν ∗ = λ∗ ν ∗ , ν ∗ is a right eigenvector
(∃λ∗ > 0, ω > 0, kωk2 = 1, s.t. (ω T )A = λ∗ ω T , left eigenvector)
2) ∀ other eigenvalue λ of A, |λ| < λ∗
3) ν ∗ is unique up to rescaling or λ∗ is simple
75
76 6. RANDOM WALK ON GRAPHS
4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max .
x≥0,x6=0 xi 6=0 xi x>0 xi
Such eigenvectors will be called Perron vectors. This result can be extended to
nonnegative matrices.
Theorem 1.2 (Nonnegative Matrix, Perron). Assume that A ≥ 0, i.e.nonnegative.
Then
1’) ∃λ∗ > 0, ν ∗ ≥ 0, kν ∗ k2 = 1, s.t. Aν ∗ = λ∗ ν ∗ (similar to left eigenvector)
2’) ∀ other eigenvalue λ of A, |λ| ≤ λ∗
3’) ν ∗ is NOT unique
4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max
x≥0,x6=0 xi 6=0 xi x>0 xi
Notice the changes in 1’), 2’), and 3’). Perron vectors are nonnegative rather
than positive. In the nonnegative situation what we lose is the uniqueness in λ∗
(2’)and ν ∗ (3’). The next question is: can we add more conditions such that the
loss can be remedied? Now recall the concept of irreducible and primitive matrices
introduced before.
Irreducibility exactly describes the case that the induced graph from A is con-
nected, i.e.every pair of nodes are connected by a path of arbitrary length. However
primitivity strengths this condition to k-connected, i.e.every pair of nodes are con-
nected by a path of length k.
Definition (Irreducible). The following definitions are equivalent:
1) For any 1 ≤ i, j ≤ n, there is an integer k ∈ Z, s.t. Akij > 0; ⇔
2) Graph G = (V, E) (V = {1, . . . , n} and {i, j} ∈ E iff Aij > 0) is (path-)
connected, i.e.∀{i, j} ∈ E, there is a path (x0 , x1 , . . . , xt ) ∈ V n+1 where i = x0 and
xt = j, connecting i and j.
Definition (Primitive). The following characterizations hold:
1) There is an integer k ∈ Z, such that ∀i, j, Akij > 0; ⇔
2) Any node pair {i, j} ∈ E are connected with a path of length no more than k;
⇔
3) A has unique λ∗ = max |λ|; ⇐
4) A is irreducible and Aii > 0, for some i,
Note that condition 4) is sufficient for primitivity but not necessary; all the first
three conditions are necessary and sufficient for primitivity. Irreducible matrices
imply an unique primary eigenvector, but not unique primary eigenvalue.
When A is a primitive matrix, Ak becomes a positive matrix for some k, then we
can recover 1), 2) and 3) for positivity and uniqueness. This leads to the following
Perron-Frobenius theorem.
Theorem 1.3 (Nonnegative Matrix, Perron-Frobenius). Assume that A ≥ 0 and
A is primitive. Then
1) ∃λ∗ > 0, ν ∗ > 0, kν ∗ k2 = 1, s.t. Aν ∗ = λ∗ ν ∗ (right eigenvector)
and ∃ω > 0, kωk2 = 1, s.t. (ω T )A = λ∗ ω T (left eigenvector)
2) ∀ other eigenvalue λ of A, |λ| < λ∗
1. INTRODUCTION TO PERRON-FROBENIUS THEORY AND PAGERANK 77
3) ν ∗ is unique
4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max
x>0 xi x>0 xi
Such eigenvectors and eigenvalue will be called as Perron-Frobenius or primary
eigenvectors/eigenvalue.
Example (Markov Chain). Given a graph G = (V, E), consider a random walk
on G with transition probability Pij = P rob(xt+1 = j|xt = i) ≥ 0. Thus P is a
→
− →
− →
−
row-stochastic or row-Markov matrix i.e. P · 1 = 1 where 1 ∈ Rn is the vector
with all elements being 1. From Perron theorem for nonnegative matrices, we know
→
−
ν ∗ = 1 > 0 is a right Perron eigenvector of P
λ∗ = 1 is a Perron eigenvalue and all other eigenvalues |λ| ≤ 1 = λ∗
∃ left PF-eigenvector π such that π T P = π T where π ≥ 0, 1T π = 1; such π is
called an invariant/equilibrium distribution
P is irreducible (G is connected) ⇒ π unique
P is primitive (G connected by paths of length ≤ k) ⇒ |λ| = 1 unique
When A is primitive, (Ak > 0, i.e.investment in one sector will increase the product
in another sector in no more than k industrial periods), we have for all other
eigenvalues λ, |λ| < λ∗ and ω ∗ , ν ∗ are unique. In this case one can check that the
long term economic growth is governed by
At → (λ∗ )t ν ∗ ω ∗T
where
1) for all i, (x(xt−1
t )i
)i → λ
∗
Define ν̃ = ν ∗ + ei with > 0 and ei denotes the vector which is one on the ith
component and zero otherwise.
For those j 6= i,
(Aν̃)j = (Aν ∗ )j + (Aei )j = λ∗ νj∗ + Aji > λ∗ νj∗ = λ∗ ν˜j
where the last inequality is due to A > 0.
For those j = i,
(Aν̃)i = (Aν ∗ )i + (Aei )i > λ∗ νi∗ + Aii .
Since λ∗ ν˜i = λ∗ νi∗ + λ∗ , we have
(Aν̃)i − (λ∗ ν̃)i + (Aii − λ∗ ) = (Aν ∗ )i − (λ∗ νi∗ ) − (λ∗ − Aii ) > 0,
where the last inequality holds for small enough > 0. That means, for some small
> 0, (Aν̃) > λ∗ ν̃. Thus λ∗ is not optimal, which leads to a contradiction.
80 6. RANDOM WALK ON GRAPHS
2) Assume on the contrary, for some k, νk∗ = 0, then (Aν ∗ )k = λ∗ νk∗ = 0. But
A > 0, ν ∗ ≥ 0 and ν ∗ 6= 0, so there ∃i, νi∗ > 0, which implies that Aν ∗ > 0.
That contradicts to the previous conclusion. So ν ∗ > 0, which followed by λ∗ > 0
(otherwise Aν ∗ > 0 = λ∗ ν ∗ = Aν ∗ ).
3) We are going to show that for every ν ≥ 0, Aν = µν ⇒ µ = λ∗ . Following the
same reasoning above, A must have a left Perron vector ω ∗ > 0, s.t. AT ω ∗ = λ∗ ω ∗ .
Then λ∗ (ω ∗T ν) = ω ∗T Aν = µ(ω ∗T ν). Since ω ∗T ν > 0 (ω ∗ > 0, ν ≥ 0), there
must be λ∗ = µ, i.e. λ∗ is unique, and ν ∗ is unique.
4) For any other eigenvalue Az = λz, A|z| ≥ |Az| = |λ||z|, so |λ| ≤ λ∗ . Then
we prove that |λ| < λ∗ . Before proceeding, we need the following lemma.
Lemma 1.4. Az = λz, |λ| = λ∗ , z 6= 0 ⇒ A|z| = λ∗ |z|. λ∗ = maxi |λi (A)|
Proof of Lemma. Since |λ| = λ∗ ,
A|z| = |A||z| ≥ |Az| = |λ||z| = λ∗ |z|
Assume that ∃k, λ1∗ A|zk | > |zk |. Denote Y = λ1∗ A|z| − |z| ≥ 0, then Yk > 0.
Using that A > 0, x ≥ 0, x 6= 0, ⇒ Ax > 0, we can get
1 1
⇒ ∗ AY > 0, A|z| > 0
λ λ∗
A A
⇒ ∃ > 0, ∗
Y > ∗ |z|
λ λ
A
⇒ ĀY > Ā|z|, Ā = ∗
λ
⇒ Ā2 |z| − Ā|z| > Ā|z|
Ā2
⇒ |z| > Ā|z|
1+
Ā
⇒B= , 0 = lim B m Ā|z| ≥ Ā|z|
1+ m→∞
which implies that zj has the same sign, i.e.zj ≥ 0 or zj ≤ 0 (∀j). In both cases |z|
(z 6= 0) is a nonnegative eigenvector A|z| = λ|z| which implies λ = λ∗ by 3).
1.2. Perron-Frobenius theory for Nonnegative Tensors. Some researchers,
e.g. Liqun Qi (Polytechnic University of Hong Kong), Lek-Heng Lim (U Chicago)
and Kung-Ching Chang (PKU) et al. recently generalize Perron-Frobenius theory
to nonnegative tensors, which may open a field toward PageRank for hypergraphs
and array or tensor data. For example, A(i, j, k) is a 3-tensor of dimension n,
representing for each object 1 ≤ i ≤ n, which object of j and k are closer to i.
A tensor of order-m and dimension-n means an array of nm real numbers:
A = (ai1 ,...,im ), 1 ≤ i1 , . . . , im ≤ n
2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 81
n
X
Aν [m−1] := aki2 ...im νi2 · · · νim , ν m−1 := (ν1m−1 , . . . , νnm−1 )T .
i2 ,...,im =1
1Simple graph means for every pair of nodes there are at most one edge associated with it;
and there is no self loop on each node.
82 6. RANDOM WALK ON GRAPHS
Define a diagonal matrix D = diag(di ). Now let’s come to the definition of Lapla-
cian Matrix L.
Example. V = {1, 2, 3, 4}, E = {{1, 2}, {2, 3}, {3, 4}}. This is a linear chain with
four nodes.
1 −1 0 0
−1 2 −1 0
L= 0 −1 2 −1 .
0 0 −1 1
i∼j
2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 83
These two statements imply the eigenvalues of L can’t be negative. That is to say
λ(L) ≥ 0.
Theorem 2.1 (Fiedler theory). Let L has n eigenvectors
Lvi = λi vi , vi 6= 0, i = 0, . . . , n − 1
where 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 . For the second smallest eigenvector v1 , define
N− = {i : v1 (i) < 0},
N+ = {i : v1 (i) > 0},
N0 = V − N− − N+ .
We have the following results.
(1) #{i, λi = 0} = #{connected components of G};
(2) If G is connected, then both N− and N+ are connected. N− ∪ N0 and
N+ ∪ N0 might be disconnected if N0 6= ∅.
This theorem tells us that the second smallest eigenvalue can be used to tell us
if the graph is connected, i.e.G is connected iff λ1 6= 0, i.e.
λ1 = 0 ⇔ there are at least two connected components.
λ1 > 0 ⇔ the graph is connected.
Moreover, the second smallest eigenvector can be used to bipartite the graph into
two connected components by taking N− and N+ when N0 is empty. For this reason,
we often call the second smallest eigenvalue λ1 as the algebraic connectivity. More
materials can be found in Jim Demmel’s Lecture notes on Fiedler Theory at UC
Berkeley: why we use unnormalized Laplacian eigenvectors for spectral partition
(http://www.cs.berkeley.edu/~demmel/cs267/lecture20/lecture20.html).
We can calculate eigenvalues by using Rayleigh Quotient. This gives a sketch
proof of the first part of the theory.
Proof of Part I. Let (λ, v) be a pair of eigenvalue-eigenvector, i.e.Lv = λv.
Since L1 = 0, so the constant vector 1 ∈ Rn is always the eigenvector associated
with λ0 = 0. In general,
(vi − vj )2
P
T
v Lv i∼j
λ= T = P 2 .
v v vi
i
Note that
0 = λ1 ⇔ vi = vj (j is path connected with i).
Therefore v is a piecewise constant function on connected components of G. If
G has k components, then there are k independent piecewise constant vectors in
the span of characteristic functions on those components, which can be used as
eigenvectors of L. In this way, we proved the first part of the theory.
84 6. RANDOM WALK ON GRAPHS
Similarly we get the relations between eigenvalue and the connected components of
the graph.
#{λi (L) = 0} = #{connected components of G}.
Next we show that eigenvectors of L are related to random walks on graphs.
This will show you why we choose this matrix to analysis the graph.
We can construct a random walk on G whose transition matrix is defined by
Aij 1
Pij ∼ P = .
Aij di
j
You can see ū is the eigenvector of L, and we can get left eigenvectors of P
from ū by multiply it with D1/2 on the left side. Similarly for the right eigenvectors
v = D−1/2 ū.
If we choose u0 = πi ∼ Pdidi , then:
p
ū0 (i) ∼ di ,
ūTk ūl = δkl ,
uTk Dvl = δkl ,
πi Pij = πj Pji ∼ Aij = Aji ,
where the last identity says the Markov chain is time-reversible.
All the conclusions above show that the normalized graph Laplacian L keeps
some connectivity measure of unnormalized graph Laplacian L. Furthermore, L is
more related with random walks on graph, through which eigenvectors of P are easy
to check and calculate. That’s why we choose this matrix to analysis the graph.
Now we are ready to introduce the Cheeger’s inequality with normalized graph
Laplacian.
Let G be a graph, G = (V, E) and S is a subset of V whose complement is
S̄ = V − S. We define V ol(S), CU T (S) and N CU T (S) as below.
X
V ol(S) = di .
i∈S
X
CU T (S) = Aij .
i∈S,j∈S̄
CU T (S)
N CU T (S) = .
min(V ol(S), V ol(S̄))
N CU T (S) is called normalized-cut. We define the Cheeger constant
hG = min N CU T (S).
S
g T Lg
λ1 = inf
g⊥D 1/2 e g T g
2
P
i∼j (fi − fj )
≤ P 2
fi di
1 1
( V ol(S) + V ol(S̄)
)2 CU T (S)
= 1 1
V ol(S) V ol(S) 2 + V ol(S̄) V ol(S̄)2
1 1
=( + )CU T (S)
V ol(S) V ol(S̄)
2CU T (S)
≤ =: 2hG .
min(V ol(S), V ol(S̄))
which gives the upper bound.
(2) Lower bound: the proof of lower bound actually gives a constructive algo-
rithm to compute an approximate optimal cut as follows.
Let v be the second eigenvector, i.e. Lv = λ1 v, and f = D−1/2 v. Then we
reorder node set V such that f1 ≤ f2 ≤ ... ≤ fn ). Denote V− = {i; vi < 0}, V+ =
{i; vi ≥ vr }. Without Loss of generality, we can assume
X X
dv ≥ dv
i∈V− i∈V+
Si = {v1 , v2 , ...vi },
and define
V
g ol(S) = min(V ol(S), V ol(S̄)).
αG = min N CU T (Si ).
i
Clearly finding the optimal value α just requires comparison over n − 1 NCUT
values.
Below we shall show that
h2G α2
≤ G ≤ λ1 .
2 2
First, we have Lf = λ1 Df , so we must have
X
(70) fi (fi − fj ) = λ1 di fi2 .
j:j∼i
P P
i∈V+ fi j:j∼i (fi − fj )
λ1 = P 2 ,
i∈V+ di fi
− fj )2 +
P P P
i∼j i,j∈V+ (fi i∈V+ fi j∼i j∈V− (fi − fj )
= , (fi − fj )2 = fi (fi − fj ) + fj (fj − fi )
di fi2
P
i∈V+
− fj )2 +
P P P
i∼j i,j∈V+ (fi i∈V+ fi j∼i j∈V− (fi )
> ,
di fi2
P
i∈V+
+
− fj+ )2
P
i∼j (fi
= 2 ,
di fi+
P
i∈V
( i∼j (fi+ − fj+ )2 )( i∼j (fi+ + fj+ )2 )
P P
= 2
( i∈V fi+ di )( i∼j (fi+ + fj+ )2 )
P P
2 2
( i∼j fi+ − fj+ )2
P
≥ 2 , Cauchy-Schwartz Inequality
( i∈V fi+ di )( i∼j (fi+ + fj+ )2 )
P P
2 2
( i∼j fi+ − fj+ )2
P
≥ 2 ,
2( i∈V fi+ di )2
P
where the second last step is due to the Cauchy-Schwartz inequality |hx, yi|2 ≤
P + + 2 P +2
hx, xi · hy, yi, and the last step is due to i∼j∈V (fi + fj ) = i∼j∈V (fi +
+2 + + P +2 +2 P +2
fj + 2fi fj ) ≤ 2 i∼j∈V (fi + fj ) ≤ 2 i∈V fi di . Continued from the last
inequality,
2 2
fi+ − fj+ )2
P
( i∼j
λ1 ≥ 2 ,
2( i∈V fi+ di )2
P
2 + 2
( i∈V (fi+ − fi−1 )CU T (Si−1 ))2
P
≥ 2 , since f1 ≤ f2 ≤ . . . ≤ fn
2( i∈V fi+ di )2
P
2 + 2
( i∈V (fi+ − fi−1 ol(Si−1 ))2
P
)αG V g
≥ 2
2( i∈V fi+ di )2
P
2
( i∈V fi+ (V ol(Si )))2
2
P
αG g ol(Si−1 ) − Vg
= · 2 ,
2 ( i∈V fi+ di )2
P
2 (
P +2 2
αG i∈V fi di ) α2
= 2 = G.
2 ( P + 2 2
i∈V fi di )
where the last inequality is due to the assumption V ol(V− ) ≥ V ol(V+ ), whence
V
g ol(Si ) = V ol(S̄i ) for i ∈ V+ .
This completes the proof.
Fan Chung gives a short proof of the lower bound in Simons Institute workshop,
2014.
88 6. RANDOM WALK ON GRAPHS
The next step is to extend the definition of Laplacian to directed graphs. First
we give a review on Lapalcian on undirected graphs. On an undirected graph,
adjacent matrix is
1, i ∼ j;
Aij =
0, i 6∼ j.
D = diag(d(i)),
L = D−1/2 (D − A)D−1/2 .
On a directed graph, however, there are two degrees on a vertex which are
generally inequivalent. Notice that on an undirected graph, stationary distribution
φ(i) ∼ d(i), so D = cΦ, where c is a constant and Φ = diag(φ(i)).
L = I − D−1/2 AD−1/2
= I − D1/2 P D−1/2
= I − c1/2 Φ1/2 P c−1/2 Φ−1/2
= I − Φ1/2 P Φ−1/2
Definition (Laplacian).
1
L = I − (Φ1/2 P Φ−1/2 + Φ−1/2 P ∗ Φ1/2 ).
2
Suppose the eigenvalues of L are 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 . Like the undirected
case, we can calculate λ1 with the Rayleigh quotient.
Theorem 3.2.
R(f )
λ1 = P inf .
f (x)φ(x)=0 2
Lemma 3.3.
gLg ∗
R(f ) = 2 , where g = f Φ1/2 .
k g k2
3. *LAPLACIANS AND THE CHEEGER INEQUALITY FOR DIRECTED GRAPHS 91
Proof.
(u) − f (v) |2 φ(u)P (u, v)
P
u→v | fP
R(f ) =
v | f (v) | φ(v)
2
2 2
P P P
u→v | f (u) | φ(u)P (u, v) + v | f (v) | φ(v) − u→v (f (u)f (v) + f (u)f (v))φ(u)P (u, v)
=
f Φf ∗
2 2 ∗ ∗
P P
u | f (u) | φ(u) + v | f (v) | φ(v) − (f ΦP f + f ΦP f
=
f Φf ∗
∗ ∗
f (P Φ + ΦP )f
= 2−
f Φf ∗
(gΦ−1/2 )(P ∗ Φ + ΦP )(Φ−1/2 g ∗ )
= 2−
(gΦ−1/2 )Φ(Φ−1/2 g ∗ )
g(Φ−1/2 P ∗ Φ1/2 + Φ1/2 P Φ−1/2 )g ∗
= 2−
gg ∗
∗
gLg
= 2·
k g k2
1/2
Proof of Theorem 3.2. With Lemma 3.3 and L(φ(x) )n×1 = 0, we have
R(f )
λ1 = inf
P
2
g(x)φ(x)1/2 =0
R(f )
= P inf .
f (x)φ(x)=0 2
Note.
R(f )
λ1 = inf
2
P
f, f (x)φ(x)=0
With the fundamental matrix, the hitting time and commute time can be ex-
pressed as follows:
zjj − zij
(74) Hij =
πj
3.5.2. Green’s function and Laplacian for directed graph. If we treat the di-
rected graph Laplacian L̃ as an asymmetric operator on a directed graph G, then
we can define the Green’s Function G̃ (without boundary condition) for directed
graph. The entries of G satisfy the conditions:
√
(78) (G̃ L̃)ij = δij − πi πj
or in the matrix form
1 1 T
(79) G̃ L̃ = I − π 2 π 2
The central theorem in the second paper associate the Green’s Function G̃, the
fundamental matrix Z and the normalize directed graph Laplacian L̃:
1 1
Theorem 3.7. Let Z̃ = Π 2 ZΠ− 2 and L̃† denote the Moore-Penrose pseudo-
inverse L̃, then
(80) G̃ = Z̃ = L̃†
94 6. RANDOM WALK ON GRAPHS
1 1
where P̃ = Π 2 P Π− 2
II. (Meila-Shi
P 2001) P is lumpable with respect to partition Ω and P̂ (p̂st =
i∈Ωs ,j∈Ωt pij ) is nonsingular ⇔ P has k independent piecewise constant
right eigenvectors in span{χΩs : s = 1, · · · , k}, χ is the characteristic
function.
Example. Consider a linear chain with 2n nodes (Figure 2) whose adjacency ma-
trix and degree matrix are given by
0 1
1 0 1
. . . .
A= . .. .. , D = diag{1, 2, · · · , 2, 1}
1 0 1
1 0
So the transition matrix is P = D−1 A which is illustrated in Figure 2. The spectrum
of P includes two eigenvalues of magnitude 1, i.e.λ0 = 1 and λn−1 = −1. Although
P is not a primitive matrix here, it is lumpable. Let Ω1 = {odd nodes}, Ω2 = {even
nodes}. We can check that I and II are satisfied.
To see I, note that for any two even nodes, say i = 2 and j = 4, P̂iΩ2 = P̂jΩ2 = 1
as their neighbors are all odd nodes, whence I is satisfied. To see II, note that φ0
(associated with λ0 = 1) is a constant vector while φ1 (associated with λn−1 = −1)
is constant on even nodes and odd nodes respectively. Figure 3 shows the lumpable
states when n = 4 in the left.
96 6. RANDOM WALK ON GRAPHS
“⇐” To show the sufficiency, we are going to show that if the condition is satisfied,
then the probability
P φi = P V ψi = V U P V ψi = V P̂ ψi = λi V ψi = λi φi ,
P φi = λi φi ⇒ P V ψi = λi V ψi ,
V U P V ψi = λi V U V ψi = λi V ψi = P V ψi , (U V = I),
which implies
(V U P V − P V )Ψ = 0, Ψ = [ψ1 , . . . , ψk ].
Since Ψ is nonsingular due to independence of ψi , whence we must have V U P V =
PV .
5.1. MNcut. Meila-Shi (2001) calls the following algorithm as MNcut, stand-
ing for modified Ncut. Due to the theory above, perhaps we’d better to call it
multiple spectral clustering.
1) Find top k right eigenvectors P Φi = λi Φi , i = 1, · · · , k, λi = 1 − o().
2) Embedding Y n×k = [φ1 , · · · , φk ] → diffusion map when λi ≈ 1.
3) k-means (or other suitable clustering methods) on Y to k-clusters.
A Markov chain defined as above is reversible. That is, detailed balance con-
dition is satisfied:
µ(x)p(x, y) = µ(y)p(y, x) ∀x, y ∈ S
Define an inner product on spaceL2µ :
XX
< f, g >µ = f (x)g(x)µ(x) f, g ∈ L2µ
x∈S y∈S
L2µ is a Hilbert space with this inner product. If we define an operator T on it:
X
T f (x) = p(x, y)f (y) = E[y|x] f (y)
y∈S
P
µ(x) p(x, y)φj (y) = λj φj (x)µ(x) with detailed balance condition
y∈S
P
p(y, x)µ(y)φj (y) = λj φj (x)µ(x) that is
y∈S
P
ψj Prob(x) = p(y, x)φ(y) = λj (x)ψ(x)
y∈S
So F guarantee a spectral decomposition. Let {λj }n−1 j=0 denote its eigenvalue
n−1
and {φj (x)}j=0 denote its eigenvector, then k(x, y) can be represented as K(x, y) =
n−1
P
λj φj (x)φj (y). Hilbert-Schmidt norm of F is defined as follow:
j=0
n−1
X
kF k2HS = tr(F ∗ F ) = tr(F 2 ) = λ2i
i=0
100 6. RANDOM WALK ON GRAPHS
the last equal sign dues do the orthogonality of eigenvectors. It is clear that if
L2µ = L2 , Hilbert-Schmidt norm is just Frobenius norm.
Now we can write our T as
X X p(x, y)
T f (x) = p(x, y)f (y) = f (y)µ(y)
µ(y)
y∈S y∈S
p(x,y)
and take K(x, y) = µ(y) . By detailed balance condition, K is symmetric. So
X p2 (x, y) X µ(x)
kT k2HS = µ(x)µ(y) = p2 (x, y)
µ2 (y) µ(y)
x,y∈S x,y∈S
One can check that this P̃ is a stochastic matrix, but it is not reversible. One
more convenient choice is transit ”randomly” by invariant distribution:
N
X µ(y)
P̃ (x, y) = 1Sk (x)P̂ (k, l)1Sl (y)
µ̂(Sl )
k,l=1
where
X
µ̂(Sl ) = µ(z)
z∈Sl
Then you can check this matrix is not only a stochastic matrix, but detailed
balance condition also hold provides P̂ on {Si } is reversible.
Now let us do some summary. Given a decomposition of state space S =
SN
i=1 Si , and a transition probability P̂ on coarse space, we may obtain a lifted
5. APPLICATIONS OF LUMPABILITY: MNCUT AND OPTIMAL REDUCTION OF COMPLEX NETWORKS
101
transition probability P̃ on fine space. Now we can compare ({Si }, P̂ ) and (S, P )
in a clear way: kP − P̃ kµ . So our optimization problem can be defined clearly:
E = min min kP − P̂ k2µ
S1 ...SN P̂
That is, given a partition of S, find the optimal P̂ to minimize kP − P̂ k2µ , and
find the optimal partition to minimize E.
N
5.2.3. Community structure of complex network. Given a partition S = ∪ Sk ,
k=1
the solution of optimization problem
min kp − p̂k2µ
p̂
is
1 X
p̂∗kl = µ(x)p(x, y)
µ̂(Sk )
x∈Sk ,y∈Sl
It is easy to show that {p̂∗kl } form a transition probability matrix with detailed
balance condition:
p̂∗kl ≥ 0
X 1 X XX
p̂∗kl = µ(x) p(x, y)
µ̂(Sk )
l x∈Sk l y∈Sl
1 X
= µ(x) = 1
µ̂(Sk )
x∈Sk
X
µ̂(Sk )p̂∗kl = µ(x)p(x, y)
x∈Sk ,y∈Sl
X
= µ(y)p(y, x)
x∈Sk ,y∈Sl
= µ̂(Sl )p̂∗lk
The last equality implies that µ̂ is the invariant distribution of the reduced Markov
chain. Thus we find the optimal transition probability in the coarse space. p̂∗ has
the following property
kp − p∗ k2µ = kpk2µ − kp̂∗ k2µ̂
However, the partition of the original graph is not given in advance, so we
need to minimize E ∗ with respect to all possible partitions. This is a combinatorial
optimization problem, which is extremely difficult to find the exact solution. An
effective approach to obtain an approximate solution, which inherits ideas of K-
means clustering, is proposed as following: First we rewrite E ∗ as
N
X µ(x) X p̂∗
E∗ = |p(x, y) − 1Sk (x) kl 1Sl (y)µ(y)|2
µ(y) µ̂(Sk )
x,y∈S k,l=1
N 2
X X p(x, y) p̂∗
= µ(x)µ(y) − kl
µ(y) µ̂(Sk )
k,l=1 x∈Sk ,y∈Sl
N X
,
X
E ∗ (x, Sk )
k=1 x∈Sk
102 6. RANDOM WALK ON GRAPHS
where
N X 2
X p(x, y) p̂∗
E ∗ (x, Sk ) = µ(x)µ(y) − kl
µ(y) µ̂(Sk )
l=1 y∈Sl
Based on above expression, a variation of K-means is designed:
N
E step: Fix partition ∪ Sk , compute p̂∗ .
k=1
(n+1)
M step: Put x in Sk such that
∗
E (x, Sk ) = min E ∗ (x, Sj )
j
Now we solve
min kp − p̃k2µ
p̂
to obtain a optimal reduction.
5.2.5. Model selection. Note the number of partition N should also not be
given in advance. But in strategies similar to K-means, the value of minimal E ∗ is
monotone decreasing with N . This means larger N is always preferred.
A possible approach is to introduce another quantity which is monotone in-
creasing with N . We take K-means clustering for example. In K-means clustering,
only compactness is reflected. If another quantity indicates separation of centers of
each cluster, we can minimize the ratio of compactness and separation to find an
optimal N .
⇔ ((I − P )(T + − S) = 0.
Therefore for irreducible P , S and T + must satisfy
diag(T + − S) = 0
T + − S = 1uT , ∀u
Now we continue with the proof of the main theorem. Since T = T + − Td+ ,
then (127) becomes
T = E + P T − Td+
(I − P )T = E − Td+
(I − D−1 W )T = F
(D − W )T = DF
LT = DF
where F = E − Td+ and L = D − W is the (unnormalized)
Pn graph Laplacian. Since
T
L is symmetric and irreducible, we have L = Pn k=1 k k k , where 0 = µ1 < µ2 ≤
µ ν ν
· · · ≤ µn , ν1 = 1/||1||, νkT νl = δkl . Let L† = k=2 µ1k νk νkT , L† is called the pseudo-
inverse (or Moore-Penrose inverse) of L. We can test and verify L† satisfies the
104 6. RANDOM WALK ON GRAPHS
P
Note that vol(G) = i di and πi = di /vol(G) for all i.
Given two sets V0 and V1 in the state space V , the transition path theory tells
how these transitions between the two sets happen (mechanism, rates, etc.). If we
view V0 as a reactant state and V1 as a product state, then one transition from V0
to V1 is a reaction event. The reactve trajectories are those part of the equilibrium
trajectory that the system is going from V0 to V1 .
Let the hitting time of Vl be
τik = inf{t ≥ 0 : x(0) = i, x(t) ∈ Vk }, k = 0, 1.
The central object in transition path theory is the committor function. Its
value at i ∈ Vu gives the probability that a trajectory starting from i will hit the
set V1 first than V0 , i.e., the success rate of the transition at i.
Proposition 7.1. For ∀i ∈ Vu , define the committor function
qi := P rob(τi1 < τi0 ) = P rob(trajectory starting from xi hit V1 before V0 )
which satisfies the following Laplacian equation with Dirichlet boundary conditions
(Lq)(i) = [(I − P )q](i) = 0, i ∈ Vu
qi∈V0 = 0, qi∈V1 = 1.
The solution is
qu = (Du − Wuu )−1 Wul ql .
Proof. By definition,
1
xi ∈ V 1
1 0
qi = P rob(τi < τi ) = 0 xi ∈ V 0
P
j∈V Pij qj i ∈ Vu
This is because ∀i ∈ Vu ,
qi = P r(τiV1 < τiV0 )
X
= Pij qj
j
X X X
= Pij qj + Pij qj + Pij qj
j∈V1 j∈V0 j∈Vu
X X
= Pij + Pij qj
j∈V1 j∈Vu
The reactive current J(xy) gives the average rate the reactive trajectories jump
from state x to y. From the reactive current, we may define the effective reactive
current on an edge and transition current through a node which characterizes the
importance of an edge and a node in the transition from A to B, respectively.
Finally, the committor functions also give information about the time propor-
tion that an equilibrium trajectory comes from A (the trajectory hits A last rather
than B).
Proposition 7.5. The proportion of time that the trajectory comes from A (resp. from
B) is given by
X X
(98) ρA = π(x)q(x), ρB = π(x)(1 − q(x)).
x∈V x∈V
CHAPTER 7
Diffusion Map
109
110 7. DIFFUSION MAP
and its symmetrization D̄ = D(KL) (xi ||xj ) + DKL (xj ||xi ), which measure
a kind of ‘distance’ between distributions; Jensen-Shannon divergence as
the symmetrization of KL-divergence between one distribution and their
average,
D(JS) (xi , xj ) = D(KL) (xi ||(xi + xj )/2) + D(KL) (xj ||(xi + xj )/2)
or
(103) xT W x ≥ 0.
PSD kernels includes heat kernels, cosine similarity kernels, and JS-divergence ker-
nels. But in many other cases (e.g. KL-divergence kernels), similarity kernels are
not necessarily PSD. For a PSD kernel, it can be understood as a generalized co-
variance function; otherwise, diffusions as random walks on similarity graphs will
be helpful to disclose their structures.
n
Define A := D−1 W , where D = diag( Wij ) , diag(d1 , d2 , · · · , dn ) for sym-
P
j=1
metric Wij = Wji ≥ 0. So
n
X
(104) Aij = 1 ∀i ∈ {1, 2, · · ·, n} (Aij ≥ 0)
j=1
whence A is a row Markov matrix of the following discrete time Markov chain
{Xt }t∈N satisfying
So:
n
X n
X
|λ||vj0 | = | Aj 0 j v j | ≤ Aj0 j |vj | ≤ |vj0 |.
j=1 j=1
ψ1 = π
as the stationary distribution of the Markov chain, respectively.
112 7. DIFFUSION MAP
Φ1D 1D
t (x1 ), · · · , Φt (xn1 ) = c1
Diffusion Map :
Φt (xn1 +1 ), · · · , Φ1D
1D
t (xn ) = c2
EX2: ring graph. ”single circle”
In this case, W is a circulant matrix
1 1 0 0 ··· 1
1 1 1 0 ··· 0
W = 0 1 1 1 ··· 0
.. .. .. .. ..
. . . . ··· .
1 0 0 0 ··· 1
The eigenvalue of W is λk = cos 2πk n
n k = 0, 1, · · · , 2 and the corresponding eigen-
2π 2πkj 2πkj t
vector is (uk )j = ei n kj j = 1, · · · , n. So we can get Φ2D t (xi ) = (cos n , sin n )c
EX3: order the face. Let
kx − yk2
kε (x, y) = exp − ,
ε
Wijε = kε (xi , xj ) and Aε = D−1 W ε where D = diag( j Wijε ). Define a graph
P
1 ε→0
Lε := (Aε − I) −→ backward Kolmogorov operator
ε
114 7. DIFFUSION MAP
1 00 0 0
1 2 φ (s) 0− φ (s)V (s) = λφ(s)
Lε f = 4M f − ∇f · ∇V ⇒ Lε φ = λφ ⇒ 0
2 φ (0) = φ (1) = 0
Where V (s) is the Gibbs free energy and p(s) = e−V (x) is the density of data points
along the curve. 4M is Laplace-Beltrami Operator. If p(x) = const, we can get
00
(116) V (s) = const ⇒ φ (s) = 2λφ(s) ⇒ φk (s) = cos(kπs), 2λk = −k 2 π 2
On the other hand p(s) 6= const, one can show 1 that φ1 (s) is monotonic for
arbitrary p(s). As a result, the faces can still be ordered by using φ1 (s).
Lemma 1.2. The diffusion distance is equal to a `2 distance between the proba-
bility clouds Ati,∗ and Atj,∗ with weights 1/dl ,i.e.,
1by changing to polar coordinate p(s)φ0 (s) = r(s) cos θ(s), φ(s) = r(s) sin θ(s) ( the so-called
‘Prufer Transform’ ) and then try to show that φ0 (s) is never zero on (0, 1).
1. DIFFUSION MAP AND DIFFUSION DISTANCE 115
Proof.
n
2
X 1
kAti,∗ − Atj,∗ k`2 (Rn ,1/d) = (Atil − Atjl )2
dl
l=1
n X n
X 1
= [ λtk φk (i)ψk (l) − λtk φk (j)ψk (l)]2
dl
l=1 k=1
n X
n
X 1
= λtk (φk (i) − φk (j))ψk (l)λtk0 (φk0 (i) − φk0 (j))ψk0 (l)
dl
l=1 k,k0
n n
X X ψk (l)ψk0 (l)
= λtk λtk0 (φk (i) − φk (j))(φk0 (i) − φk0 (j))
0
dl
k,k l=1
Xn
= λtk λtk0 (φk (i) − φk (j))(φk0 (i) − φk0 (j))δkk0
k,k0
n
X
= λ2t
k (φk (i) − φk (j))
2
k=1
= d2t (xi , xj )
In practice we usually do not use the mapping Φt but rather the truncate
diffusion map Φδt that makes use of fewer than n coordinates. Specifically, Φδt uses
t
only the eigenvectors for which the eigenvalues satisfy |λk | > δ. When t is enough
large, we can use the truncated diffusion distance:
2 21
X
(118) dδt (xi , xj ) = kΦδt (xi ) − Φδt (xj )k = [ λ2t
k (φk (i) − φk (j)) ]
k:|λk |t >δ
2
as an approximation of the weighted ` distance of the probability clouds. We now
derive a simple error bound for this approximation.
Lemma 1.3 (Truncated Diffusion Distance). The truncated diffusion distance sat-
isfies the following upper and lower bounds.
2δ 2
d2t (xi , xj ) − (1 − δij ) ≤ [dδt (xi , xj )]2 ≤ d2t (xi , xj ),
dmin
P
where dmin = min1≤i≤n di with di = j Wij .
1
Proof. Since, Φ = D− 2 V , where V is an orthonormal matrix (V V T =
T
V V = I), it follows that
1 1
(119) ΦΦT = D− 2 V V T D− 2 = D−1
Therefore,
n
X δij
(120) φk (i)φk (j) = (ΦΦT )ij =
di
k=1
and
n
X 1 1 2δij
(121) (φk (i) − φk (j))2 = + −
di dj di
k=1
116 7. DIFFUSION MAP
clearly,
n
X 2
(122) (φk (i) − φk (j))2 ≤ (1 − δij ), f orall i, j = 1, 2, · · · , n
dmin
k=1
As a result,
X
[dδt (xi , xj )]2 = d2t (xi , xj ) − λ2t
k (φk (i) − φk (j))
2
k:|λk |t <δ
X
≥ d2t (xi , xj ) − δ2 (φk (i) − φk (j))2
k:|λk |t <δ
n
X
≥ d2t (xi , xj ) − δ 2 (φk (i) − φk (j))2
k=1
2
2δ
≥ d2t (xi , xj ) − (1 − δij )
dmin
on the other hand, it is clear that
(123) [dδt (xi , xj )]2 ≤ d2t (xi , xj )
We conclude that
2δ 2
(124) d2t (xi , xj ) − (1 − δij ) ≤ [dδt (xi , xj )]2 ≤ d2t (xi , xj )
dmin
Therefore, for small δ the truncated diffusion distance provides a very good
approximation to the diffusion distance. Due to the fast decay of the eigenvalues,
the number of coordinates used for the truncated diffusion map is usually much
smaller than n, especially when t is large.
k=1
It follows that φk (i) = φk (j) for all k with λk 6= 0. But there is still the possibility
that φk (i) 6= φk (j) for k with λk = 0. We claim that this can happen only whenever
i and j have the exact same neighbors and proportional weights, that is:
2. COMMUTE TIME MAP AND DISTANCE 117
Proof. For simplicity, we will assume that P is irreducible such that the
stationary distribution is unique. We will give a constructive proof that Tij + Tji
is a squared distance of some Euclidean coordinates for xi and xj .
By definition, we have
X
(127) Tij+ = Pij · 1 + +
Pik (Tkj + 1)
k6=j
T + − S = 1uT , ∀u
which implies T + = S. T ’s uniqueness follows from T = T + − Td+ .
Now we continue with the proof of the main theorem. Since T = T + − Td+ ,
then (127) becomes
T = E + P T − Td+
(I − P )T = E − Td+
(I − D−1 W )T = F
(D − W )T = DF
LT = DF
where F = E − Td+ and L = D − W is the (unnormalized)Pn graph Laplacian. Since
T
L is symmetric and irreducible, we have L = Pn k=1 µ k νk ν k , where 0 = µ1 < µ2 ≤
· · · ≤ µn , ν1 = 1/||1||, νkT νl = δkl . Let L+ = k=2 µ1k νk νkT , L+ is called the pseudo-
inverse (or Moore-Penrose inverse) of L. We can test and verify L+ satisfies the
following four conditions + +
L LL = L+
+
LL L = L
+ T
(LL ) = LL+
+ T
(L L) = L+ L
2. COMMUTE TIME MAP AND DISTANCE 119
P
Note that vol(G) = i di and πi = di /vol(G) for all i.
Then
(129) Tij + Tji = vol(G)(L+ + +
ii + Ljj − 2Lij ).
Proof omitted. The reverse is also true, which is Bochner theorem. High dimen-
sional case is similar.
2
Take 1-dimensional as an example. Since the Gaussian distribution e−ξ /2 dξ
is a positive finite Borel measure, and the Fourier transform of Gaussian kernel is
2
itself, we know that k(x, y) = e−|x−y| /2 is a positive definite integral kernel. The
matrix W as an discretized version of k(x, y) keeps the positive-definiteness (make
this rigorous? Hint: take φ(x) as a linear combination of n delta functions).
3.1. Main Result. In this lecture, we will study the bias and variance de-
composition for sample graph Laplacians and their asymptotic convergence to
Laplacian-Beltrami operators on manifolds.
Let M be a smooth manifold without boundary in Rp (e.g. a d-dimensional
sphere). Randomly draw a set of n data points, x1, ..., xn ∈ M ⊂ Rp , according to
distribution p(x) in an independent and identically distributed (i.i.d.) way. We can
extract an n × n weight matrix Wij as follows:
Wij = k(xi , xj )
where k(x, y) is a symmetric k(x, y) = k(y, x) and positivity-preserving kernel
k(x, y) ≥ 0. As an example, it can be the heat kernel (or Gaussian kernel),
3. DIFFUSION MAP: CONVERGENCE THEORY 121
||xi − xj ||2
k (xi , xj ) = exp − ,
2
where || ||2 is the Euclidean distance in space Rp and is the bandwidth of the
kernel. Wij stands for similarity function between xi and xj . A diagonal matrix D
is defined with diagonal elements are the row sums of W :
n
X
Dii = Wij .
j=1
(α) (α) −1
Pn (α)
Denote A = (D ) W , and we can verify that j=1 Aij = 1, i.e.a row
Markov matrix. Now define L(α) = A(α) − I = (D(α) )−1 W (α) − I; and
1 (α)
L,α = (A − I)
when k (x, y) is used in constructing W . In general, L(α) and L,α are both called
graph Laplacians. In particular L(0) is the unnormalized graph Laplacian in litera-
ture.
The target is to show that graph Laplacian L,α converges to continuous differ-
ential operators acting on smooth functions on M the manifold. The convergence
can be roughly understood as: we say a sequence of n-by-n matrix L(n) as n → ∞
converges to a limiting operator L, if for L’s eigenfunction f (x) (a smooth function
on M) with eigenvalue λ, that is
Lf = λf,
the length-n vector f (n) = (f (xi )), (i = 1, · · · , n) is approximately an eigenvector
of L(n) with eigenvalue λ, that is
L(n) f (n) = λf (n) + o(1),
where o(1) goes to zero as n → ∞.
Specifically, (the convergence is in the sense of multiplying a positive constant)
(I) L,0 = 1 (A − I) → 12 (∆M + 2 ∇p p · ∇) as → 0 and n → ∞. ∆M is
the Laplace-Beltrami operator of manifold M . At a point on M which
is d-dimensional, in local (orthogonal) geodesic coordinate s1 , · · · , sd , the
Laplace-Beltrami operator has the same form as the laplace in calculus
d
X ∂2
∆M f = f;
i=1
∂s2i
1X 1X
F (xi ) = k (xi , xj )f (xj ), G(xi ) = k (xi, xj ).
n n
j6=i j6=i
depends only on the other n − 1 data points than xi . In what follows we treat
xi as a fixed chosen point and write as x.
Bias-Variance Decomposition. The points xj , j 6= i are independent iden-
tically distributed (i.i.d), therefore every term in the summation of F (x) (G(x))
are i.i.d., and by theR Law of Large Numbers (LLN) one should expect F (x) ≈
Ex1 [k(x, x1 )f (x1 )] = M k(x, y)f (y)p(y)dy (and G(x) ≈ Ek(x, x1 ) = M k(x, y)p(y)dy).
R
Recall that given a random variable x, and a sample estimator θ̂ (e.g. sample mean),
the bias-variance decomposition is given by
Ekx − θ̂k2 = Ekx − Exk2 + EkEx − θ̂k2 .
E[F ]
If we use the same strategy here (though not exactly the same, since E[ G
F
] 6= E[G]
!), we can decompose Eqn. (130) as
1 E[F ] 1 F (xi ) E[F ]
1
(Lf )i = − f (xi ) + f (xi )O( d ) + −
E[G] n 2 G(xi ) E[G]
= bias + variance.
In the below we shall show that for case (I) the estimates are
(131)
1 E[F ] ∇p
1 m2 d
bias = − f (x) + f (xi )O( d ) = (∆M f +2∇f · )+O()+O n−1 − 2 .
E[G] n 2 2 p
1 F (xi ) E[F ]
1 d
(132) variance = − = O(n− 2 − 4 −1 ),
G(xi ) E[G]
whence
1 d 1 d
bias + variance = O(, n− 2 − 4 −1 ) = C1 + C2 n− 2 − 4 −1 .
As the bias is a monotone increasing function of while the variance is decreasing
w.r.t. , the optimal choice of is to balance the two terms by taking derivative
1 d
of the right hand side equal to zero (or equivalently setting ∼ n− 2 − 4 −1 ) whose
solution gives the optimal rates
∗ ∼ n−1/(2+d/2) .
[CL06] gives the bias and [HAvL05] contains the variance parts, which are further
improved by [Sin06] in both bias and variance.
3.3. The Bias Term. Now focus on E[F ]
1 n−1
X Z
E[F ] = E k (xi , xj )f (xj ) =
k (x, y)f (y)p(y)dy
n n M
j6=i
n−1
n is close to 1 and is treated as 1.
(1) the case of one-dimensional and flat (which means the manifold M is just
a real line, i.e.M = R)
(x−y)2
Let f˜(y) = f (y)p(y), and k (x, y) = √1 e− 2 , by change of variable
√
y = x + z,
124 7. DIFFUSION MAP
we have
√ 1
Z
2
= f˜(x + z)e− 2 dz = m0 f˜(x) + m2 f 00 (x) + O(2 )
R 2
2 2
where m0 = R e− 2 dz, and m2 = R z 2 e− 2 dz.
R R
Z Z Z
k (x, y)f˜(y)p(y)dy = √
·+ √
·
m ||x−y||>c ||x−y||<c
First part = ◦
1 2
| ◦ | ≤ ||f˜||∞ a e− 2 ,
2
√
due to ||x − y||2 > c
1
c ∼ ln( ).
so this item is tiny and can
√ be ignored.
Locally, that is u ∼ , we have the curve in a plane and has the
following parametrized equation
1 1 1
||x − y||2 = [u2 + (au2 + qu3 + ...)2 ] = [u2 + a2 u4 + q5 (u) + · · · ],
u
where we mark a2 u4 + 2aqu5 + ... = q5 (u). Next, change variable √
= z,
− ξ2
then with h(ξ) = e
||x − y|| 2 3
h( ) = h(z 2 ) + h0 (z 2 )(2 az 4 + 2 q5 + O(2 )),
also
df˜ 1 d2 f˜
f˜(s) = f˜(x) + (x)s + (x)s2 + · · ·
ds 2 ds2
and
Z u p
s= 1 + (2au + 3quu2 + ...)2 du + · · ·
0
and
ds 2
= 1 + 2a2 u2 + q2 (u) + O(2 ), s = u + a2 u3 + O(2 ).
du 3
3. DIFFUSION MAP: CONVERGENCE THEORY 125
3.4. Variance Term. Our purpose is to derive the large deviation bound for2
E[F ]
F
(135) P rob − ≥α
G E[G]
where F = F (xi ) = n j6=i k (xi , xj )f (xj ) and G = G(xi ) = n1 j6=i k (x, xj ).
1
P P
With x1 , x2 , ..., xn as i.i.d random variables, F and G are sample means (up to a
scaling constant). Define a new random variable
Y = E[G]F − E[F ]G − αE[G](G − E[G])
which is of mean zero and Eqn. (135) can be rewritten as
P rob(Y ≥ αE[G]2 ).
For simplicity by Markov (Chebyshev) inequality 3 ,
E[Y 2 ]
P rob(Y ≥ αE[G]2 ) ≤
α2 E[G]4
and setting the right hand side to be δ ∈ (0, 1), then with probability at least 1 − δ
the following holds !
E[Y 2 ] E[Y 2 ]
p p
α≤ √ ∼O .
E[G]2 δ E[G]2
It remains to bound
E[Y 2 ] = (EG)2 E(F 2 ) − 2(EG)(EF )E(F G) + (EF )2 E(G2 ) + ...
+2α(EG)[(EF )E(G2 ) − (EG)E(F G)] + α2 (EG)2 (E(G2 ) − (EG)2 ).
So it suffices to give E(F ), E(G), E(F G), E(F 2 ), and E(G2 ). The former two are
given in bias and for the variance parts in latter three, let’s take one simple example
with E(G2 ).
Recall that x1 , x2 , ..., xn are distributed i.i.d according to density p(x), and
1X
G(x) = k (x, xj ),
n
j6=i
so Z
1 2 2
V ar(G) = 2 (n − 1) k (x, y)) p(y)dy − (Ek (x, y)) .
n M
Look at the simplest case of 1-dimension flat M for an illustrative example:
1 √
Z Z
(k (x, y))2 p(y)dy = √ h2 (z 2 )(p(x) + p0 (x)( z + O()))dz,
M R
R 2 2
let M2 = h (z )dz
R
1 √
Z
(k (x, y))2 p(y)dy = p(x) · √ M2 + O( ).
M
Recall that Ek (x, y) = O(1), we finally have
1 p(x)M2 1
V ar(G) ∼ √ + O(1) ∼ √ .
n n
2The opposite direction is omitted here.
3It means that P rob(X > α) ≤ E(X 2 )/α2 . A Chernoff bound with exponential tail can be
found in Singer’06.
4. *VECTOR DIFFUSION MAP 127
d
Generally, for d-dimensional case, V ar(G) ∼ n−1 − 2 . Similarly one can derive
estimates on V ar(F ).
Ignoring the joint effect ofpE(F G), one can somehow get a rough estimate
based on F/G = [E(F ) + O( E(F 2 ))]/[E(G) + O( E(G2 ))] where we applied
p
the Markov inequality on both the numerator and denominator. Combining those
estimates together, we have the following,
1 d
F f p + m22 (∆(f p) + E[f p]) + O(2 , n− 2 − 4 )
= 1 d
G p + m22 (∆p + E[p]) + O(2 , n− 2 − 4 )
m2 1 d
= f + (∆p + E[p]) + O(2 , n− 2 − 4 ),
2
here O(B1 , B2 ) denotes the dominating one of the two bounds B1 and B2 in the
asymptotic limit. As a result, the error (bias + variance) of L,α (dividing another
) is of the order
1 d
(136) O(, n− 2 − 4 −1 ).
In [Sin06] paper, the last term in the last line is improved to
1 d 1
(137) O(, n− 2 − 4 − 2 ),
F
where the improvement is by carefully analyzing the large deviation bound of G
around EG shown above, making use of the fact that F and G are correlated.
EF
The organization of this lecture notes is as follows: We first review graph Lapla-
cian and diffusion mapping on graphs as the basis for vector diffusion mapping. We
then introduce three examples of vector bundles on graphs. After that, we come to
vector diffusion mapping. Finally, we introduce some conclusions about the con-
vergence of vector diffusion mapping.
n(n−1) n(n−1)
where diag(wij ) ∈ R 2 × 2 is the diagonal matrix that has wij on the diag-
onal position corresponding to hi, ji.
u∗ v = hu, vi
Then,
L = D − W = δ0T diag(wij )δ0 = δ0∗ δ0
We first look at the graph Laplacian operator. We solve the generalized eigen-
value problem:
Lf = λDf
denote the generalized eigenvalues as:
0 = λ1 ≤ λ2 ≤ · · · ≤ λn
and the corresponding generalized eigenvectors:
f1 , · · · , fn
we have already obtained the m-dimensional Laplacian eigenmap:
xi → (f1 (i), · · · , fm (i))
We now explains that this is the optimal embedding that preserves locality in the
sense that connected points stays as close as possible. Specifically speaking, for the
one-dimensional embedding, the problem is:
X
min (yi − yj )2 wij = 2miny yT Ly
i,j
1 1 1 1
y T Ly = y T D− 2 (I − D− 2 W D− 2 )D− 2 y
1 1 1 1
Since I − D− 2 W D− 2 is symmetric, the object is minimized when D− 2 )D− 2 y is
the eigenvector for the second smallest eigenvalue(the first smallest eigenvalue is
1 1
0) of I − D− 2 W D− 2 , which is the same with λ2 , the second smallest generalized
eigenvalue of L.
Similarly, the m-dimensional optimal embedding is given by Y = (f1 , · · · , fm ).
In diffusion map, the weights are used to define a discrete random walk. The
transition probability in a single step from i to j is:
wij
aij =
deg(i)
Then the transition matrix A = D−1 W .
1 1 1 1
A = D− 2 (D− 2 W D− 2 )D 2
Therefore, A is similar to a symmetric matrix, and has n real eigenvalues µ1 , · · · , µn
and the corresponding eigenvectors φ1 , · · · , φn .
Aφi = µi φi
At is the transition matrix after t steps. Thus, we have:
At φi = µti φi
Define Λ as the diagonal matrix with Λ(i, i) = µi , Φ = [φ1 , · · · , φn ]. The diffusion
map is given by:
Φt := ΦΛt = [µt1 φ1 , · · · , µtn φn ]
130 7. DIFFUSION MAP
4.3. the embedding given by diffusion map. Φt (i) denotes the ith row of
Φt .
n
X At (i, k) At (j, k)
hΦt (i), Φ(j)i = p p
k=1
deg(k) deg(k)
we can thus define a distance called diffusion distance
n
X (At (i, k) − At (j, k))2
d2DM,t (i, j) := hΦt (i), Φ(i)i+hΦt (j), Φ(j)i−2hΦt (i), Φ(j)i =
deg(k)
k=1
i∼j
4. *VECTOR DIFFUSION MAP 131
we will later define the vector diffusion mapping, and using the similar argument
as in diffusion mapping, it is easy to see that vector diffusion mapping gives the
optimal embedding that preserves locality in this sense.
we now discuss how we get the approximation of parallel transport operator
given the data set.
The approximation of the tangent space at a certain point xi is given by local PCA.
Choose i to be sufficiently small, and denote xi1 , · · · , xiNi as the data points in
the i -neighborhood of xi . Define
Xi := [xi1 − xi , · · · , xiNi − xi ]
Denote Di as the diagonal matrix with
s
||xij − xi ||
Di (j, j) = K( ), j = 1, · · · , Ni
i
Bi := Xi Di
Perform SVD on Bi :
Bi = Ui Σi ViT
We use the first d columns of Ui (which are the left eigenvectors of the d largest
eigenvalues of Bi ) to form an approximation of the tangent space at xi . That is,
Oi = [ui1 , · · · , uid ]
Then Oi is a numerical approximation to an orthonormal basis of the tangent space
at xi .
For connected points xi and xj , since they are sufficiently close to each other,
their tangent space should be close. Therefore, Oi Oij and Oj should also be close.
We there use the closest orthogonal matrix to OiT Oj as the approximation of the
parallel transport operator from xj to xi :
ρij := argminOorthogonol ||O − OiT Oj ||HS
where ||A||2HS = T r(AAT ) is the Hilbert-Schimidt norm.
u := u diag(wij ), u∗ v = hu, vi
∗
k=1
We use ||S̃ 2t (i, j)||2HS to measure the affinity between i and j. Thus,
||S̃ 2t (i, j)||2HS = T r(S̃ 2t (i, j)S̃ 2t (i, j)T )
Pnd
= (λk λl )2t T r(vk (i)vk (j)T vl (j)vl (i)T )
Pk,l=1
nd
= (λk λl )2t T r(vk (j)T vl (j)vl (i)T vk (i))
Pk,l=1
nd 2t
= k,l=1 (λk λl ) hvk (j), vl (j)ihvk (i), vl (i)i
The vector diffusion mapping is defined as:
Vt : i → ((λk λl )t hvk (i), vl (i)i)nd
k,l=1
Semi-supervised Learning
1. Introduction
Problem: x1 , x2 , ..., xl ∈ Vl are labled data, that is data with the value f (xi ), f ∈
V → R observed. xl+1 , xl+2 , ..., xl+u ∈ Vu are unlabled. Our concern is how to fully
exploiting the information (like geometric structure in disbution) provided in the
labeled and unlabeled data to find the unobserved labels.
This kind of problem may occur in many situations, like ZIP Code recognition.
We may only have a part of digits labeled and our task is to label the unlabeled
ones.
where
Σll Σlu
Σ=
Σul Σuu
135
136 8. SEMI-SUPERVISED LEARNING
Block matrix inversion formula tells us that when A and D are invertible,
−1 −1
−A−1 BSA
A B X Y X Y SD
· =I⇒ = −1 −1
C D Z W Z W −D−1 CSD SA
−1 −1
BD−1
X Y A B X Y SD −SD
· =I⇒ = −1 −1
Z W C D Z W −SA CA−1 SA
where SA = D − CA−1 B and SD = A − BD−1 C are called Schur complements of
A and D, respectively. The matrix expressions for inverse are equivalent when the
matrix is invertible.
The graph Laplacian
Dl − Wll Wlu
L=
Wul Du − Wuu
is not invertible.
P Dl − Wll and Du − Wuu are both strictly diagonally dominant, i.e.
Dl (i, i) > j |Wll (i, j)|, whence they are invertible by Gershgorin Circle Theorem.
However their Schur complements SDu −Wuu and SDl −Wll are still not invertible and
the block matrix inversion formula above can not be applied directly. To avoid this
issue, we define a regularized version of graph Laplacian
Lλ = L + λI, λ>0
and study its inverse Σλ = L−1
λ .
By the block matrix inversion formula, we can set Σ as its right inverse above,
−1 −1
−(λ + Dl − Wll )−1 Wlu Sλ+D
Sλ+Du −Wuu l −Wll
Σλ = −1 −1
−(λ + Du − Wuu )−1 Wul Sλ+D u −Wuu
Sλ+D l −Wll
Therefore,
fu,λ = Σul,λ Σ−1
ll,λ fl = (λ + Du − Wuu )
−1
Wul fl ,
whose limit however exits limλ→0 fu,λ = (Du − Wuu )−1 Wul fl = fu . This implies
that fu can be regarded as the conditional mean given fl .
Note that the probability above also called committor function in Transition
Path Theory of Markov Chains.
The result coincides with we obtained through the view of gaussian markov
random field.
5. Well-posedness
One natural problem is: if we only have a fixed amount of labeled data, can
we recover labels of an infinite amount of unobserved data? This is called well-
posedness. [Nadler-Srebro 2009] gives the following result:
• If xi ∈ R1 , the problem is well-posed.
• If xi ∈ Rd (d ≥ 3), the problem is ill-posed in which case Du − Wuu
becomes singular and f becomes a bump function (fu is almost always
zeros or ones except on some singular points).
Here we can give a brief explanation:
Z
f T Lf ∼ k∇f k2
kx−x0 k22
(
2 kx − x0 k2 <
If we have Vl = {0, 1}, f (x0 ) = 0, f (x1 ) = 1 and let f (x) = .
1 otherwise
From multivariable calculus,
Z
k∇f k2 = cd−2 .
138 8. SEMI-SUPERVISED LEARNING
R
Since d ≥ 3, so → 0 ⇒ k∇f k2 → 0. So f (x) ( → 0) converges to a bump func-
tion which is one almost everywhere except x0 whose value is 0. No generalization
ability is learned for such bump functions.
This means in high dimensional case, to obtain a smooth generalization, we
have to add constraints more than the norm of the first order derivatives. We
also have a theorem to illustrate what kind of constraint is enough for a good
generalization:
Theorem 5.1 (Sobolev embedding Theorem). f ∈ Ws,p (Rd ) ⇐⇒ f has s’th
order weak derivative f (s) ∈ Lp ,
d
s> ⇒ Ws,2 ,→ C(Rd ).
2
So in Rd , to obtain a continuous function, one needs smoothness regularization
k∇s f k with degree s > d/2. To implement this in discrete Laplacian setting, one
R
may consider iterative Laplacian Ls which might converge to high order smoothness
regularization.
CHAPTER 9
• Monotonicity: W∗ ⊆ W∗0 if ≤ 0 four members. The following definition is used to model this
connecting two simplexes,
kindthen we set
of topological theWelength
property. between
have modified the original
• But not easy to control homotopy types between W If∗ two
them to ∞. andsimplexes
X definitionareofq-connected,
“connectiveness” in then
present application.
they
Q-analysis also
to cater for our
columns contain “0.” We associate with r 1 a 0-simplex c1equivalence c2 c3 c4 c5 classes are called
0
σ(r ) and the
0
σ(r 5) q-connected
are not connected tocomponents
6
any of the other four
0 = (c ). In a similar way, we obtain the following r1 1 of0".0 Let simplexes.
σ(r 1) 1 r2 1 1 1 0 0 q
0 0
Q denote the number A furtherof structure can be definedcomponents
q-connected on a simplicial family,
simplexes for the remaining rows: r3 0 in0".1 The 1 determination
0 (8) of the components and Qq for each
as follows.
Example (Flag Complex of Paired Comparison Graph, r4 0 0 Jiang-Lim-Yao-Ye
1 1 0 2011[JLYY11]).
r5 0 value 0 0 of0 q 1is termed a Q-analysis Definition 4. of The ".relation “is q-connected to” on a simplicial
Let V be a set ofσ(r2alternatives
2 ) = (c 1 , c2 , c3to
), be compared and r6 undirected
0 0 0 0 1 pair (i, j) family ∈ E", if theby rq, is an equivalence relation. Let "q
denoted
be the set of simplexes in " with dimension greater than
pair is comparable. σ(r1 A =flag
(c3 , ccomplex
4 ),
χG consists
with six rows all r1 , r2 ,cliques
, r6 and fiveas
. . .Example 5.simplices
The
columns c1 , cresult orQ-analysis
2 , . . . , c5 .of orfaces
equal to q,(e.g. whereforq =the 0, 1,simplicial
. . . , dim". Then, family
rq partitions
3) For row r 1 , the column c1 contains a “1” and the other "q into equivalence classes of q-connected simplexes. These
3-cliques as 2-faces1 and k + 1-cliques as k-faces), columns contain “0.” in also We Example
associate with 3
called r 1 is
clique given in equivalence
a 0-simplexcomplex Tableof2.classes G.Since the the
are called highest
q-connected dimen-
components
σ(r4 ) = (c3 , c4 ), 0 = (c ).(9) In a similar way, we obtain the following of ". Let Qq denote the number of q-connected components
σ(r ) 1
simplexes for the remaining rows:
1sion of the simplexes is 2, the Q-analysis of the simplicial
in ". The determination of the components and Qq for each
Example (Strategic 0 Simplicial Complex for Flow Games,
σ(r5 ) = (c5 ), family Candogan-Menache-Ozdaglar–
2 = (c , c , c ),
σ(r
has three levels corresponding value of q is termed ato q = 0,1
Q-analysis of ".and 2. The
1 2 3
Parrilo 2011 [CMOP11]). 0
σ(r6 ) = (c5 ). Strategic simplicial complex
σ(r level
1
)
q =is2 the consists
2
clique of those complex
simplexes
Example
ofwith
5. The result dimension
of Q-analysis greater
for the simplicial family
) = (c3 , c4 ), 3
pairwise comparison graph G = (V, E) of strategicσ(r1profiles, than
) = (c 3 ,or
c4 ), equal
where
4
to 2;
V hence,
consists
(9) this
in Example level
of3 isall
given contains
in Table 2. Since one thesimplex
sion of the simplexes is 2, the Q-analysis of the simplicial
highest dimen-
2 1 and
We draw the six simplexes in Figure 4, from which we σ(r0 σ) =(r2(c)5.),Next, at the level q = 1, hastwo threemore simplexesto qσ=(r
0
5 family levels corresponding 0,1) and 2. The
3
σ(r ) = 1 (c5 ). level q = 2 consists of those simplexes with dimension greater
see clearly that they do form a simplicial family. However, σ(r4 ) come in, which are 1-connected
6
than or equal to 2; by hence, a this
chain level of length
contains 1
one simplex
2 . Next, at the level q = 1, two more simplexes σ 1 and
We draw the six simplexes in Figure 4, from which we σ(r 2) (r3 )
see clearly that they do form a simplicial family. However, 1 come in, which are 1-connected by a chain of length 1
σ(r 4)
2
(O, O) (O, F )
O F O F
O 3, 2 0, 0 3O 4, 2 0,2 0
F 0, 0 2, 3 F 1, 0 2, 3
3
(a) Battle of the sexes (F,(b)
O)Modified(F, F ) of
battle
the sexes
Figure 2: Flows on the game graph corresponding to “battle of the sexes” (Example 2.2).
Figure 2. Illustration of Game Strategic Complex: Battle of Sex
It is easy to see that these two games have the same pairwise comparisons, which will lead to
identical equilibria for the two games: (O, O) and (F, F ). It is only the actual equilibrium payoffs
that would differ. In
Example 2.3. particular,
2. ConsiderinaHomology
Persistent the equilibrium
three-player (O, O),
andgame, the
where
Discrete payoff
eachTheory
Morse of the can
player row chooseplayer isbetween
increased
two strategies
by 1. {a, b}. Recall
We represent
that the strategic interactions among the players by the directed graph in Figure
3a, where the payoff of player i is −1 if its strategy is identical to the strategy of its successor
The usual solution
Theorem concepts
2.1 in games (e.g., Nash, mixed Nash, correlated equilibria) are defined
(“Sandwich”).
in terms of pairwise comparisons only. VGames R ⊆ Cwith identical pairwise comparisons share the same
⊆ V R2
equilibrium sets. Thus, we refer to games with identical pairwise comparisons as strategically
• If a homology group “persists” through R 7 → R2 , then it must exists in
equivalent games. C ; but not the vice versa.
By employing•the Allnotion
above givesof pairwise
rise to acomparisons, we can concisely
filtration of simplicial complex represent any strategic-form
game in terms of a flow in a graph. We recall this notion next. Let G = (N, L) be an undirected
∅ = Σ0 ⊆ Σ1 ⊆ Σ2 ⊆ . . .
graph, with set of nodes N and set of links L. An edge flow (or just flow ) on this graph is a function
• Functoriality of inclusion: there are homomorphisms between homology
Y : N × N → R such that Y (p, q) = −Y (q, p) and Y (p, q) = 0 for (p, q) ∈ / L [21, 2]. Note that
groups
the flow conservation equations are not0 → enforced under
H1 → H2 → . . . this general definition.
Given a game• G, we define a graph where each node corresponds to a strategy profile, and
A persistent homology is the image of Hi in Hj with j > i.
each edge connects two comparable strategy profiles. This undirected graph is referred to as the
Persistent Homology is firstly proposed by Edelsbrunner-Letscher-Zomorodian,
game graphwithandanisalgebraic
denotedformulation
by G(G) by � (E, A), where E andThe
Zomorodian-Carlsson. A are the strategy
algorithm is equivalent profiles and pairs
of comparable strategy
to Robin Forman’sprofiles defined
discrete Morseabove,
theory.respectively. Notice that, by definition, the graph
G(G) has the structure of a direct product of M cliques (one per player), with clique m having
to be continued...
hm vertices. The pairwise comparison function X : E × E → R defines a flow on G(G), as it
3. Exterior Calculus on Complex and Combinatorial Hodge Theory
satisfies X(p, q) = −X(q, p) and X(p, q) = 0 for (p, q) ∈ / A. This flow may thus serve as an
2 d
equivalent representation of any game (up to a “non-strategic”l (V
We are going to study functions on simplicial complex, ).
component). It follows directly
A basis of “forms”:
from the statements above that two games are strategically equivalent if and only if they have the
2 2
P
• l (V ): e (i
same flow representation and game graph.
i ∈ V ), so f ∈ l (V ) has a representation f = i∈V f i e i , e.g.
global ranking score on VP.
Two examples of 2 game
2 graph representations are given below.
2 2
• l (V ): eij = −eji , f = (i,j) fij eij for f ∈ l (V ), e.g. paired compari-
2
son scores
Example 2.2. Consider on Vthe
again . “battle of the sexes” game from Example 2.1. The game graph
2 3
P
• l (V ):
has four vertices, correspondinge ijk = e
tojkithe kij = −e
= edirect jik = −e
product ofkji
two= −eikj , f = and
2-cliques, fijk
ijk is eijk
presented in Figure 2.
2 d+1
• l (V ): ei0 ,...,id is an alternating d-form
2 σ(i0 ),...,σ(id ) ,
ei0 ,...,id = sign(σ)e
(O, O) (O, F )
where σ ∈ Sd is a permutation on {0, . . . , d}.
3 2
3
(F, O) (F, F )
Figure 2: Flows on the game graph corresponding to “battle of the sexes” (Example 2.2).
142 9. BEYOND GRAPHS: HIGH DIMENSIONAL TOPOLOGICAL/GEOMETRIC ANALYSIS
Vector spaces of functions l2 (V d+1 ) represented on such basis with an inner product
defined, are called d-forms (cochains).
Example. In the crowdsourcing ranking of world universities,
http://www.allourideas.org/worldcollege/,
V consists of world universities, E are university pairs in comparison, l2 (V ) consists
of ranking scores of universities, l2 (V 2 ) is made up of paired comparison data.
Discrete differential operators: k-dimensional coboundary maps δk : L2 (V k ) →
L (V k+1 ) are defined as the alternating difference operator
2
k+1
X
(δk u)(i0 , . . . , ik+1 ) = (−1)j+1 u(i0 , . . . , ij−1 , ij+1 , . . . , ik+1 )
j=0
where
Hk = ker(δk−1 ) ∩ ker(δkT ) = ker(∆k ).
• dim(Hk ) = βk .
A simple understanding is possible via Dirac operator:
D = δ + δ ∗ : ⊕k L2 (V k ) → ⊕k L2 (V k )
Hence D = D∗ is self-adjoint. Combine the chain map
δ δ δk−1 δ
L2 (V ) −→
0
L2 (V 2 ) −→
1
L2 (V 3 ) → . . . L2 (V k ) −−−→ L2 (V k+1 ) −→
k
...
into a big operator: Dirac operator.
Abstract Hodge Laplacian:
∆ = D2 = δδ ∗ + δ ∗ δ,
since δ 2 = 0.
By the Fundamental Theorem of Linear Algebra (Closed Range Theorem in
Banach Space),
⊕k L2 (V k ) = im(D) ⊕ ker(D)
where
im(D) = im(δ) ⊕ im(δ ∗ )
and ker(D) = ker(∆) is the space of harmonic forms.
Our statistical rank aggregation problem is to look for some global ranking
score s : V → R such that
X
α
(139) min ωij (si − sj − Yijα )2 ,
s∈R|V |
i,j,α
where
(153) Ŷijg = ŝi − ŝj , for some ŝ ∈ RV ,
The decomposition
P above is orthogonal under the following inner product on R|E| ,
hu, viω = {i,j}∈E ωij uij vij .
The following provides some remarks on the decomposition.
1. When G is connected, Ŷijg is a rank two skew-symmetric matrix and gives a
linear score function ŝ ∈ RV up to translations. We thus call Ŷ g a gradient flow
since it is given by the difference (discrete gradient) of the score function ŝ on graph
nodes,
(156) Ŷijg = (δ0 ŝ)(i, j) := ŝi − ŝj ,
where δ0 : RV → RE is a finite difference operator (matrix) on G. ŝ can be chosen
as any least square solution of (140), where we often choose the minimal norm
solution,
(157) ŝ = ∆†0 δ0∗ Ŷ ,
where δ0∗ = δ0T W (W P = diag(ωij )), ∆0 = δ0∗ ·δ0 is the unnormalized graph Laplacian
defined by (∆0 )ii = j∼i ωij and (∆0 )ij = −ωij , and (·)† is the Moore-Penrose
(pseudo) inverse. On a complete and balanced graph, (157) is reduced to ŝi =
1
P
n−1 j6=i Ŷij , often called Borda Count as the earliest preference aggregation rule in
social choice [JLYY11]. For expander graphs like regular graphs, graph Laplacian
∆0 has small condition numbers and thus the global ranking is stable against noise
on data.
2. Ŷ h satisfies two conditions (154) and (155), which are called curl-free and
divergence-free conditions respectively. The former requires the triangular trace
of Ŷ to be zero, on every 3-clique in graph G; while the later requires the total
sum (inflow minus outflow) to be zero on each node of G. These two conditions
characterize a linear subspace which is called harmonic flows.
3. The residue Ŷ c actually satisfies (155) but not (154). In fact, it measures
the amount of intrinsic (local) inconsistancy in Ŷ characterized by the triangular
4. APPLICATIONS OF HODGE THEORY: STATISTICAL RANKING 147
trace. We often call this component curl flow. In particular, the following relative
curl,
|Ŷij + Ŷjk + Ŷki | |Ŷijc + Ŷjk
c c
+ Ŷki |
(158) curlrijk = = ∈ [0, 1],
|Ŷij | + |Ŷjk | + |Ŷki | |Ŷij | + |Ŷjk | + |Ŷki |
can be used to characterize triangular intransitivity; curlrijk = 1 iff {i, j, k} contains
an intransitive triangle of Ŷ . Note that computing the percentage of curlrijk = 1
is equivalent to calculating the Transitivity Satisfaction Rate (TSR) in complete
graphs.
Figure 3 illustrates the Hodge decomposition for paired comparison flows and
Algorithm 5 shows how to compute global ranking and other components. The
readers may refer to [JLYY11] for the detail of theoretical development. Below we
just make a few comments on the application of HodgeRank in our setting.
5. Euler-Calculus
to be finished...
5. EULER-CALCULUS 149
[AC09] R DeVore A Cohen, W Dahmen, Compressed sensing and best k-term approximation,
J. Amer. Math. Soc 22 (2009), no. 1, 211–231.
[Ach03] Dimitris Achlioptas, Database-friendly random projections: Johnson-lindenstrauss
with binary coins, Journal of Computer and System Sciences 66 (2003), 671687.
[Ali95] F. Alizadeh, Interior point methods in semidefinite programming with applications
to combinatorial optimization, SIAM J. Optim. 5 (1995), no. 1, 13–51.
[Aro50] N. Aronszajn, Theory of reproducing kernels, Transactions of the American Mathe-
matical Society 68 (1950), no. 3, 337–404.
[Bav11] Francois Bavaud, On the schoenberg transformations in data analysis: Theory and
illustrations, Journal of Classification 28 (2011), no. 3, 297–314.
[BDDW08] Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin, A simple
proof of the restricted isometry property for random matrices, Constructive Approx-
imation 28 (2008), no. 3, 253–263.
[BLT+ 06] P. Biswas, T.-C. Liang, K.-C. Toh, T.-C. Wang, and Y. Ye, Semidefinite programming
approaches for sensor network localization with noisy distance measurements, IEEE
Transactions on Automation Science and Engineering 3 (2006), 360–371.
[BN01] Mikhail Belkin and Partha Niyogi, Laplacian eigenmaps and spectral techniques
for embedding and clustering, Advances in Neural Information Processing Systems
(NIPS) 14, MIT Press, 2001, pp. 585–591.
[BN03] Mikhail Belkin and Partha Niyogi, Laplacian eigenmaps for dimensionality reduction
and data representation, Neural Computation 15 (2003), 1373–1396.
[BN08] Mikhail Belkin and Partha Niyogi, Convergence of laplacian eigenmaps, Tech. report,
2008.
[BP98] Sergey Brin and Larry Page, The anatomy of a large-scale hypertextual web search
engine, Proceedings of the 7th international conference on World Wide Web (WWW)
(Australia), 1998, pp. 107–117.
[BS10] Zhidong Bai and Jack W. Silverstein, Spectral analysis of large dimensional random
matrices, Springer, 2010.
[BTA04] Alain Berlinet and Christine Thomas-Agnan, Reproducing kernel hilbert spaces in
probability and statistics, Kluwer Academic Publishers, 2004.
[Can08] E. J. Candès, The restricted isometry property and its implications for compressed
sensing, Comptes Rendus de l’Académie des Sciences, Paris, Série I 346 (2008), 589–
592.
[CDS98] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders, Atomic decompo-
sition by basis pursuit, SIAM Journal on Scientific Computing 20 (1998), 33–61.
[Chu05] Fan R. K. Chung, Laplacians and the cheeger inequality for directed graphs, Annals
of Combinatorics 9 (2005), no. 1, 1–19.
[CL06] Ronald R. Coifman and Stéphane. Lafon, Diffusion maps, Applied and Computa-
tional Harmonic Analysis 21 (2006), 5–30.
[CLL+ 05] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W.
Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition
of data: Diffusion maps i, Proceedings of the National Academy of Sciences of the
United States of America 102 (2005), 7426–7431.
[CLMW09] E. J. Candès, Xiaodong Li, Yi Ma, and John Wright, Robust principal component
analysis, Journal of ACM 58 (2009), no. 1, 1–37.
151
152 BIBLIOGRAPHY
[CMOP11] Ozan Candogan, Ishai Menache, Asuman Ozdaglar, and Pablo A. Parrilo, Flows and
decompositions of games: Harmonic and potential games, Mathematics of Operations
Research 36 (2011), no. 3, 474–503.
[CPW12] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky, Latent variable graphical model
selection via convex optimization (with discussion), Annals of Statistics (2012), to
appear, http://arxiv.org/abs/1008.1290.
[CR09] E. J. Candès and B. Recht, Exact matrix completion via convex optimization, Foun-
dation of Computational Mathematics 9 (2009), no. 6, 717772.
[CRPW12] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, The convex geometry
of linear inverse problems, Foundation of Computational Mathematics (2012), to
appear, http://arxiv.org/abs/1012.0621.
[CRT06] Emmanuel. J. Candès, Justin Romberg, and Terrence Tao, Robust uncertainty prin-
ciples: Exact signal reconstruction from highly incomplete frequency information,
IEEE Trans. on Info. Theory 52 (2006), no. 2, 489–509.
[CSPW11] V. Chandrasekaran, S. Sanghavi, P.A. Parrilo, and A. Willsky, Rank-sparsity inco-
herence for matrix decomposition, SIAM Journal on Optimization 21 (2011), no. 2,
572596, http://arxiv.org/abs/0906.2220.
[CST03] N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines and
other kernel-based learning methods, Cambridge University Press, 2003.
[CT05] E. J. Candès and Terrence Tao, Decoding by linear programming, IEEE Trans. on
Info. Theory 51 (2005), 4203–4215.
[CT06] Emmanuel. J. Candès and Terrence Tao, Near optimal signal recovery from random
projections: Universal encoding strategies, IEEE Trans. on Info. Theory 52 (2006),
no. 12, 5406–5425.
[CT10] E. J. Candès and T. Tao, The power of convex relaxation: Near-optimal matrix
completion, IEEE Transaction on Information Theory 56 (2010), no. 5, 2053–2080.
[Dav88] H. David, The methods of paired comparisons, 2nd ed., Griffin’s Statistical Mono-
graphs and Courses, 41, Oxford University Press, New York, NY, 1988.
[DG03a] Sanjoy Dasgupta and Anupam Gupta, An elementary proof of a theorem of johnson
and lindenstrauss, Random Structures and Algorithms 22 (2003), no. 1, 60–65.
[DG03b] David L. Donoho and Carrie Grimes, Hessian eigenmaps: Locally linear embedding
techniques for high-dimensional data, Proceedings of the National Academy of Sci-
ences of the United States of America 100 (2003), no. 10, 5591–5596.
[dGJL07] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G.
Lanckriet, A direct formulation for sparse pca using semidefinite programming, SIAM
Review 49 (2007), no. 3, http://arxiv.org/abs/cs/0406021.
[DH01] David L. Donoho and Xiaoming Huo, Uncertainty principles and ideal atomic de-
composition, IEEE Transactions on Information Theory 47 (2001), no. 7, 2845–2862.
[EB01] M. Elad and A.M. Bruckstein, On sparse representations, International Conference
on Image Processing (ICIP) (Tsaloniky, Greece), November 2001.
[ELVE08] Weinan E, Tiejun Li, and Eric Vanden-Eijnden, Optimal partition and effective dy-
namics of complex networks, Proc. Nat. Acad. Sci. 105 (2008), 7907–7912.
[ER59] P. Erdos and A. Renyi, On random graphs i, Publ. Math. Debrecen 6 (1959), 290–297.
[EST09] Ioannis Z. Emiris, Frank J. Sottile, and Thorsten Theobald, Nonlinear computational
geometry, Springer, New York, 2009.
[EVE06] Weinan E and Eric Vanden-Eijnden, Towards a theory of transition paths, J. Stat.
Phys. 123 (2006), 503–523.
[EVE10] Weinan E and Eric Vanden-Eijnden, Transition-path theory and path-finding algo-
rithms for the study of rare events, Annual Review of Physical Chemistry 61 (2010),
391–420.
[Gro11] David Gross, Recovering low-rank matrices from few coefficients in any basis, IEEE
Transaction on Information Theory 57 (2011), 1548, arXiv:0910.1879.
[HAvL05] M. Hein, J. Audibert, and U. von Luxburg, From graphs to manifolds: weak and
strong pointwise consistency of graph laplacians, COLT, 2005.
[JL84] W. B. Johnson and J. Lindenstrauss, Extensions of lipschitz maps into a hilbert space,
Contemp Math 26 (1984), 189–206.
BIBLIOGRAPHY 153
[JLYY11] Xiaoye Jiang, Lek-Heng Lim, Yuan Yao, and Yinyu Ye, Statistical ranking and com-
binatorial hodge theory, Mathematical Programming 127 (2011), no. 1, 203–244,
arXiv:0811.1067 [stat.ML].
[Joh06] I. Johnstone, High dimensional statistical inference and random matrices, Proc. In-
ternational Congress of Mathematicians, 2006.
[JYLG12] Xiaoye Jiang, Yuan Yao, Han Liu, and Leo Guibas, Detecting network cliques with
radon basis pursuit, The Fifteenth International Conference on Artificial Intelligence
and Statistics (AISTATS) (La Palma, Canary Islands), April 21-23 2012.
[Kah09] Matthew Kahle, Topology of random clique complexes, Discrete Mathematics 309
(2009), 1658–1671.
[Kah13] , Sharp vanishing thresholds for cohomology of random flag complexes, Annals
of Mathematics (2013), arXiv:1207.0149.
[Kle99] Jon Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the
ACM 46 (1999), no. 5, 604–632.
[KMP10] Ioannis Koutis, G. Miller, and Richard Peng, Approaching optimality for solving
sdd systems, FOCS ’10 51st Annual IEEE Symposium on Foundations of Computer
Science, 2010.
[KN08] S. Kritchman and B. Nadler, Determining the number of components in a factor
model from limited noisy data, Chemometrics and Intelligent Laboratory Systems 94
(2008), 19–32.
[LL11] Jian Li and Tiejun Li, Probabilistic framework for network partition, Phys. A 390
(2011), 3579.
[LLE09] Tiejun Li, Jian Liu, and Weinan E, Probabilistic framework for network partition,
Phys. Rev. E 80 (2009), 026106.
[LM06] Amy N. Langville and Carl D. Meyer, Google’s pagerank and beyond: The science of
search engine rankings, Princeton University Press, 2006.
[LZ10] Yanhua Li and Zhili Zhang, Random walks on digraphs, the generalized digraph lapla-
cian, and the degree of asymmetry, Algorithms and Models for the Web-Graph, Lec-
ture Notes in Computer Science, vol. 6516, 2010, pp. 74–85.
[Mey00] Carl D. Meyer, Matrix analysis and applied linear algebra, SIAM, 2000.
[MSVE09] Philipp Metzner, Christof Schütte, and Eric Vanden-Eijnden, Transition path theory
for markov jump processes, Multiscale Model. Simul. 7 (2009), 1192.
[MY09] Nicolai Meinshausen and Bin Yu, Lasso-type recovery of sparse representations for
high-dimensional data, Annals of Statistics 37 (2009), no. 1, 246–270.
[NBG10] R. R. Nadakuditi and F. Benaych-Georges, The breakdown point of signal subspace
estimation, IEEE Sensor Array and Multichannel Signal Processing Workshop (2010),
177–180.
[Noe60] G. Noether, Remarks about a paired comparison model, Psychometrika 25 (1960),
357–367.
[NSVE+ 09] Frank Noè, Christof Schütte, Eric Vanden−Eijnden, Lothar Reich, and Thomas R.
Weikl, Constructing the equilibrium ensemble of folding pathways from short off-
equilibrium simulations, Proceedings of the National Academy of Sciences of the
United States of America 106 (2009), no. 45, 19011–19016.
[RL00] Sam T. Roweis and Saul K. Lawrence, Locally linear embedding, Science 290 (2000),
no. 5500, 2319–2323.
[Sch37] I. J. Schoenberg, On certain metric spaces arising from euclidean spaces by a change
of metric and their imbedding in hilbert space, The Annals of Mathematics 38 (1937),
no. 4, 787–793.
[Sch38a] , Metric spaces and completely monotone functions, The Annals of Mathe-
matics 39 (1938), 811–841.
[Sch38b] , Metric spaces and positive denite functions, Transactions of the American
Mathematical Society 44 (1938), 522–536.
[Sin06] Amit Singer, From graph to manifold laplacian: The convergence rate, Applied and
Computational Harmonic Analysis 21 (2006), 128–134.
[ST04] D. Spielman and Shang-Hua Teng, Nearly-linear time algorithms for graph partition-
ing, graph sparsification, and solving linear systems, STOC ’04 Proceedings of the
thirty-sixth annual ACM symposium on Theory of computing, 2004.
154 BIBLIOGRAPHY
[Ste56] Charles Stein, Inadmissibility of the usual estimator for the mean of a multivariate
distribution, Proceedings of the Third Berkeley Symposium on Mathematical Statis-
tics and Probability 1 (1956), 197–206.
[SW12] Amit Singer and Hau-Tieng Wu, Vector diffusion maps and the connection laplacian,
Comm. Pure Appl. Math. 65 (2012), no. 8, 1067–1144.
[SY07] Anthony Man-Cho So and Yinyu Ye, Theory of semidefinite programming for sensor
network localization, Mathematical Programming, Series B 109 (2007), no. 2-3, 367–
384.
[SYZ08] Anthony Man-Cho So, Yinyu Ye, and Jiawei Zhang, A unified theorem on sdp rank
reduction, Mathematics of Operations Research 33 (2008), no. 4, 910–920.
[Tao11] Terrence Tao, Topics in random matrix theory, Lecture Notes in UCLA, 2011.
[TdL00] J. B. Tenenbaum, Vin deSilva, and John C. Langford, A global geometric framework
for nonlinear dimensionality reduction, Science 290 (2000), 2319–2323.
[TdSL00] J. Tenenbaum, V. de Silva, and J. Langford, A global geometric framework for non-
linear dimensionality reduction, Science 290 (2000), no. 5500, 2323–2326.
[Tib96] R. Tibshirani, Regression shrinkage and selection via the lasso, J. of the Royal Sta-
tistical Society, Series B 58 (1996), no. 1, 267–288.
[Tro04] Joel A. Tropp, Greed is good: Algorithmic results for sparse approximation, IEEE
Trans. Inform. Theory 50 (2004), no. 10, 2231–2242.
[Tsy09] Alexandre Tsybakov, Introduction to nonparametric estimation, Springer, 2009.
[Vap98] V. Vapnik, Statistical learning theory, Wiley, New York, 1998.
[Vem04] Santosh Vempala, The random projection method, Am. Math. Soc., Providence, 2004.
[Wah90] Grace Wahba, Spline models for observational data, CBMS-NSF Regional Conference
Series in Applied Mathematics 59, SIAM, 1990.
[WS06] Killian Q. Weinberger and Lawrence K. Saul, Unsupervised learning of image man-
ifolds by semidefinite programming, International Journal of Computer Vision 70
(2006), no. 1, 77–90.
[YH41] G. Young and A. S. Householder, A note on multidimensional psycho-physical anal-
ysis, Psychometrika 6 (1941), 331–333.
[ZHT06] H. Zou, T. Hastie, and R. Tibshirani, Sparse principal component analysis, Journal
of Computational and Graphical Statistics 15 (2006), no. 2, 262–286.
[ZY06] Peng Zhao and Bin Yu, On model selection consistency of lasso, J. Machine Learning
Research 7 (2006), 2541–2567.
[ZZ02] Zhenyue Zhang and Hongyuan Zha, Principal manifolds and nonlinear dimension
reduction via local tangent space alignment, SIAM Journal of Scientific Computing
26 (2002), 313–338.
[ZZ09] Hongyuan Zha and Zhenyue Zhang, Spectral properties of the alignment matrices in
manifold learning, SIAM Review 51 (2009), no. 3, 545–566.