Geometric and Topological Data Reduction
Geometric and Topological Data Reduction
Yuan Yao
Current address: Department of Mathematics, Hong Kong University of Sci-
ence and Technology, Clear Water Bay, Hong Kong SAR, P. R. China,
Special thanks to Prof. Amit Singer, Weinan E, Xiuyuan Cheng, Feng Ruan,
Jiangshu Wang, Kaizheng Wang, and the following students in PKU who help
scribe lecture notes with various improvements: Hong Cheng, Chao Deng,
Yanzhen Deng, Chendi Huang, Lei Huang, Shujiao Huang, Longlong Jiang, Yuwei
Jiang, Wei Jin, Changcheng Li, Xiaoguang Li, Zhen Li, Tengyuan Liang, Feng
Lin, Yaning Liu, Peng Luo, Wulin Luo, Tangjie Lv, Yuan Lv, Hongyu Meng, Ping
Qin, Jie Ren, Hu Sheng, Zhiming Wang, Yuting Wei, Jiechao Xiong, Jie Xu,
Bowei Yan, Jun Yin, and Yue Zhao.
Preface 1
Index 269
Preface
as some kernel method defined on various data graphs where the spectral decom-
position of kernel gives rise to the small amount of geometric coordinates of data
shapes, while topological data reduction can be inferred from various simplicial
complex data representation, extended from graphs, with spectral decomposition
of Hodge Laplacians.
Therefore, in this course we shall see a dancing among geometric, topological,
and statistical data reductions. We are going to present these stories in four parts.
In general, data representation can be vectors, matrices (esp. graphs, net-
works), tensors, and possibly unstructured such as images, videos, languages, se-
quences, etc.
************************
This book is used in a course instructed by Yuan Yao at Peking University and
the Hong Kong University of Science and Technology, part of which is based on a
similar course led by Amit Singer at Princeton University.
Part 1
5
6 1. GEOMETRY OF PCA AND MDS
∂I
= (xi − µ − U βi )T U = 0 ⇒ βbi = U T (xi − µ
bn )
∂βi
Plug in the expression of µ
bn and βbi
n
X
(2) I = bn − U U T (xi − µ
∥xi − µ bn )∥2
i=1
n
X
= ∥xi − µ bn )∥2
bn − Pk (xi − µ
i=1
n
X
= ∥yi − Pk (yi )∥2 , yi := xi − µ
bn
i=1
Above we use the cyclic property of trace trace(ABC) = trace(BCA) and idempo-
tent property of projection P 2 = P .
Since Y does not depend on U , the problem above is equivalent to
1
(3) max Var(U T Y ) = max trace(U T Y Y T U ) = max trace(U T Σ
b nU )
U U T =Ik U U T =Ik n U U T =Ik
1.1. PRINCIPAL COMPONENT ANALYSIS (PCA) 7
In fact when k = 1, the maximal covariance is given by the largest eigenvalue along
the direction of its associated eigenvector,
max uT Σ
b nu = λ
b1 .
∥u∥=1
and so on. Therefore, PCA takes the eigenvector decomposition of Σbn = U bΛ b T and
bU
studies the projection of centred data points on top k eigenvectors as the principle
components. In this way, we conclude that the k-dimensional affine space can be
discovered by the eigenvector decomposition of Σ b n.
1Note that in statistics the sampled covariance matrix is often defined by for n ≥ 2,
n
bn ≜ 1 X 1 e eT
Σ (xi − µ bn )T =
bn )(xi − µ XX .
n − 1 i=1 n−1
(a) (b)
where the i-th observation xi has its j-th principal component score as the j-
th projection on uj , βbji = uTj yi = uTj (xi − µ
b). The importance or variance of
j-th principal component is characterized by the j-th eigenvalue λ bj . Given the
eigenvalues, the following quantities are often used to measure the variances.
• Total variance:
Xp
trace(Σn ) =
b λ
bj ;
j=1
• Percentage of variance explained by top-k principal components:
k
X
λ
bj /trace(Σ
b n );
i=j
The images in Figure 3 are the 201-st to 205-th PCs of the same dataset. They are
more like noise, hence are probably not to be kept.
In general, the following schemes are often adopted in applications. For exam-
ple, one can inspect the eigenvalue plot in a non-increasing order to see if there
is a change point or an elbow where one can truncate; one can also find enough
principal components to explain a prescribed q-percentage of total variation (e.g.
q = 95%); sometimes one can select principal components with variance larger than
average.
The following example shows that one can set a threshold of the empirical
eigenvalues based on the percentage of variance. For instance, choose k such that
k
X
λ
bi /trace(Σ
b n ) > 0.95.
i=1
In section 2.4, we will discuss Horn’s Parallel Analysis with random permutation
test whose interpretation is based on the Random Matrix Theory.
The classical metric MDS gives a specific solution based on the following idea:
(a) transform squared distance matrix D = [d2ij ] to an inner product form;
(b) compute the eigen-decomposition for this inner product form.
The key observation is that the two-side centering transform of squared distance
matrix D gives the Gram matrix (inner product matrix or kernel matrix) of centered
data matrix, i.e.
1
(4) − HDH T = (XH)T (XH) =: K. b
2
where H := I − n1 1 · 1T = H T with 1 = (1, 1, ..., 1)T ∈ Rn is the Househölder
centering matrix.
To see this, let K be the inner product or kernel or Gram matrix
K = X T X, X = [xi ] ∈ Rp×n
with k = diag(Kii ) ∈ Rn . Note that
D = (d2ij ) = k · 1T + 1 · k T − 2K.
The following lines established the fact that
1
(5) − H · D · H T = H T KH.
2
In fact, note that
1 1
− H · D · H T = − H · (k · 1T + 1 · k T − 2K) · H T
2 2
T
Since k · 1T · H T = k · 1(I − n1 · 1 · 1T ) = k · 1 − k( 1 n·1 ) · 1 = 0, we have
H · k 1 · H T = H · 1 · k T · H T = 0. This implies that
1
− H · D · H T = H · K · H T = HX T XH T = (XH)T (XH),
2
since H = H T , which establishes (4).
Therefore Y = XH = X − n1 X ·1·1T is the centered data matrix and K b = Y TY
is the inner product matrix of centered data, which is positive semi-definite and
admits an orthogonal eigenvector decomposition.
Above we have shown that given a squared distance matrix D = (d2ij ), we can
convert it to an inner product matrix by K b = − 1 HDH T = (XH)T (XH). Eigen-
2
decomposition applied to K b will give rise the Euclidean coordinates centred at the
origin.
In practice, one often chooses top k nonzero eigenvectors of K
b for a k-dimensional
Euclidean embedding or approximation of n data points, summarized in the classi-
cal MDS Algorithm 2.
1.2.2. Example: Metric MDS of Cities on the Earth. Consider pairwise
geodesic distances among cities on the earth. Figure 4 shows the results of 2-D
embedding of eight cities using the classical metric MDS algorithm 2. Since the
cities lie on the surface of a sphere, their geodesic distances2 can not be isometrically
embedded by 2-D Euclidean spaces. Hence one can see some negative eigenvalues
of K.
b Details will be discussed in Section 1.5.
Vbk = [b
v1 , . . . , vbk ], vbk ∈ Rn ,
Λ
b k = diag(λ bk ),
b1 , . . . , λ
with λ
b1 ≥ λ
b2 ≥ . . . ≥ λ
bk ≥ 0.
(a) (b)
(∥Yi − Yj ∥2 − d˜2ij )2 .
X
(6) min
Yi ∈Rk
i,j
P
Without loss of generality, we set i Yi = 0, i.e. putting the origin as data center.
This is called nonmetric MDS for general d˜ij , which is not necessarily a distance.
sense,
1
(7) Y = XH = X − X · 11T = U
b SbVb T ,
n
1 T
H = I− 11 , 1 = (1, . . . , 1)T ∈ Rn
n
where top left singular vectors of centred data matrix Y ∈ Rp×n are the eigenvectors
of the sample covariance matrix Σ. b The singular vectors are not unique, but the
singular subspace spanned by singular vectors associated with distinct singular
values is unique.
How about the right singular vectors here? In section 1.2, we have seen that
the metric Multidimensional scaling (MDS) is characterized by eigenvectors of the
positive semi-definite kernel matrix K b = − 1 H · D · H T = (XH)T (XH) = Y T Y =
2
Vb Sb2 Vb T , where the last step is by (7). Hence MDS is equivalent to apply the top k
right singular vectors of centered data matrix for the Euclidean embedding.
Therefore both PCA and MDS can be both obtained from SVD of centred data
matrix (7) in the following way.
• PCA has principal components given by top k left singular vectors U bk ∈
Rp×k and the projection of centred data on to such a subspace gives the
principal component scores βbk = U b T Y = Sbk Vb T ∈ Rk×n ;
k k
• MDS embedding is given by top k right singular vectors as ZkM DS =
Sbk VbkT ∈ Rk×n .
Note that both PCA and MDS share the same k-dimensional representation βbk =
ZkM DS = Sbk VbkT ∈ Rk×n . Unified under the framework of SVD for the centered
data matrix, PCA and MDS thus play dual roles in the same linear dimensionality
reduction.
From the properties of singular value decomposition (? ), PCA and MDS
provide the best rank-k approximation of centred data matrix Y in any unitary
invariant norms. That is, if Y = U ΣV T is the SVD of Y , then Yk = U Σk V T
(where Σk = diag(σ1 , σ2 , . . . , σk , 0, . . . , 0) is a diagonal matrix containing the largest
k singular values) is a rank-k matrix which satisfies
∥Y − Yk ∥⋆ = min ∥Y − Z∥⋆ ,
rank(Z)=k
This theorem tells us that when dij is exactly given by Euclidean distance
between points, the kernel matrix Kb ij = − 1 HDH T is positive semi-definite, where
2
D = (d2ij ) and H = I − n1 11T . Therefore, for positive semi-definite K
b as an inner
product matrix, the optimization problem
(8) min ∥Y T Y − K∥ b 2.
F
Y ∈Rk×n
is equivalent to (6) and has a solution Y whose rows are the eigenvectors corre-
sponding to k largest eigenvalues of K.
b This is exactly the classical or metric MDS
algorithm 2.
1.4.2. Isometric Hilbertian Embedding. Schoenberg (Sch38b) shows that
Euclidean embedding of finite points can be characterized completely by positive
definite functions, which paves a way toward Hilbert space embedding. Later Aron-
zajn (Aro50) developed Reproducing Kernel Hilbert spaces based on positive defi-
nite functions which eventually leads to the kernel methods in statistics and machine
learning (BTA04? ; Vap98; CST03).
Theorem 1.4.3 (Schoenberg 38). A separable space M with a metric function
d(x, y) can be isometrically imbedded in a Hilbert space H, if and only if the family
2
of functions e−λd are positive definite for all λ > 0 (in fact we just need it for a
sequence of λi whose accumulate point is 0).
Here a symmetric function k(x, y) = k(y, x) is called positive definite if for all
finite xi , xj ,
X
ci cj k(xi , xj ) ≥ 0, ∀ci , cj
i,j
with equality = holds iff ci = cj = 0. In other words the function k restricted on
{(xi , xj ) : i, j = 1, . . . , n} is a positive definite matrix.
To see the theorem, recall that by the classic MDS theorem, a set of n points
with distance dij can be isometrically imbedded in an Euclidean space if and only
if the squared distance matrix is conditionally negative definite, i.e.
X X
(9) ci cj d2ij ≤ 0, ci = 0.
i,j i
by Tayler expansion. For sufficiently small λ, this implies (9). Since the number of
2
sample points is arbitrary, positive definite function e−λd ensures a Hilbert space
embedding of possibly infinite dimension.
16 1. GEOMETRY OF PCA AND MDS
for all α ∈ (0, 2). Then letting α approach the limit 2 gives the condition (9).
A slightly more general formula than (11) gives the Schoenberg Transform.
Definition 1.4.3 (Schoenberg Transform). The Schoenberg Transform Φ :
R+ → R+ is defined by
Z ∞
1 − exp (−λt)
(14) Φ(t) := g(λ)dλ,
0 λ
where g(λ) is some nonnegative measure on [0, ∞) s.t
Z ∞
g(λ)
dλ < ∞.
0 λ
Sometimes, we may want to transform a square distance matrix to another
square distance matrix. The following theorem tells us that Schoenberg Transform
(Sch38a; Sch38b) characterizes all the transformations between squared distance
matrices.
Theorem 1.4.4 (Schoenberg Transform). Given D a squared distance matrix,
Ci,j = Φ(Di,j ). Then
C is a squared distance matrix ⇐⇒ Φ is a Schoenberg Transform.
Examples of Schoenberg Transforms include
• Φ0 (t) = t with g0 (λ) = δ(λ);
1 − exp(−at)
• Φ1 (t) = with g1 (λ) = δ(λ − a) (a > 0);
a
• Φ2 (t) = ln(1 + t/a) with g2 (λ) = exp(−aλ);
t
• Φ3 (t) = with g3 (λ) = λ exp(−aλ);
a(a + t)
p
• Φ4 (t) = tp (p ∈ (0, 1)) with g4 (λ) = λ−p .
Γ(1 − p)
For more examples, see (Bav11). The first one gives √ the identity transform and
the last one implies that for a distance function, d is also a distance function
but d2 is not. To see this, take three points on a line x = 0, y = 1, z = 2 where
d(x, y) = d(y, z) = 1, then for p > 1 dp (x, z) = 2p > dp (x, y) + dp (y, z) = 2
which violates the triangle inequality. In fact, dp (p ∈ (0, 1)) is Euclidean distance
function immediately implies the following triangle inequality
dp (0, x + y) ≤ dp (0, x) + dp (0, y).
1.4. SCHOENBERG ISOMETRIC EMBEDDING AND REPRODUCING KERNELS 17
be the k-th eigenvalue of LK and {ϕk }k∈N the corresponding eigenvectors. For all
x, t ∈ X ,
∞
X
(15) K(x, t) = λk ϕk (x)ϕ(t)
k=1
LrK : XLρ2X →L X
2
ρX
(16) a k ϕk 7→ ak λrk ϕk
k k
1/2
In particular, LK : Lρ2X → HK is an isometrical isomorphism between the quo-
tient space Lρ2X / ker(LK ) and HK . For simplicity we assume that ker(LK ) = {0},
which happens when K is a universal kernel (Ste01) such that HK is dense in
1/2 −1/2 −1/2
Lρ2X . With LK , ⟨ϕk , ϕk′ ⟩HK = ⟨LK ϕk , LK ϕk′ ⟩Lρ2 = λ−1 k ⟨ϕk , ϕk ⟩Lρ2 ,
′
X X
whence (ϕk ) is a bi-orthogonal system in HK and Lρ2X . Some examples based on
spherical harmonics are given in (MNY06) and for further examples see (? Wah90).
2 ′ 2
For example, the radial basis function kλ (x, x′ ) = e−λd = e−λ∥x−x ∥ is a
universal kernel, often called Gaussian kernel or heat kernel in literature and has
been widely used in statistics and machine learning.
Reproducing Kernel Hilbert Spaces are universal in statistical learning of func-
tions, in the sense that every Hilbert space H of functions on X with bounded eval-
uation functional can be regarded as a reproducing kernel Hilbert space (Wah90).
This is a result of the Riesz representation theorem, that for every x ∈ X there
exists Ex ∈ H such that f (x) = ⟨f, Ex ⟩. By boundedness of evaluation functional,
|f (x)| ≤ ∥f ∥∥Ex ∥, one can define a reproducing kernel k(x, y) = ⟨Ex , Ey ⟩ which is
bounded, symmetric and positive definite. It is called ‘reproducing’ because we can
reproduce the function value using f (x) = ⟨f, kx ⟩ where kx (·) := k(x, ·) as a func-
tion in H . Such a universal property makes RKHS a unified tool to study Hilbert
function spaces in nonparametric statistics, including Sobolev spaces consisting of
splines (Wah90).
Mercer’s Theorem shows that the spectrum decomposition of a Mercer kernel,
a continuous positive definite function on compact domains, renders an orthogonal
basis adaptive to the probability measure ρX . In practice, one can compute an
empirical version of such a basis based on finite samples according to ρX . This is
often known as the kernel PCA (SSM98), or more precisely the kernel MDS in the
following procedure.
Definition 1.4.4 (Kernel PCA/MDS). Given a data sample of {xi : i =
1, . . . , n} drawn independently and identically distributed from ρX , the kernel ma-
trix K = (k(xi , xj ) : i, j = 1, . . . , n) is a positive definite matrix. Then the following
procedure gives a k-dimensional Euclidean embedding of data.
(a) Find the top-k eigen-decomposition of the following centred matrix
b = HKH T ,
K where K = (k(xi , xj ) : i, j = 1, . . . , n).
(b) Embed the data in the same way as classical MDS in Algorithm 2.
1.5. LAB AND FURTHER STUDIES 19
Vbk = [b
v1 , . . . , vbk ], vbk ∈ Rn ,
Λ
b k = diag(λ bk ),
b1 , . . . , λ
with λ
b1 ≥ λ
b2 ≥ . . . ≥ λ
bk ≥ 0.
"""
=========================================================
Principal Component Analysis ( PCA ) : an example on dataset zip digit 3
=========================================================
The PCA does an unsupervised dim ensional ity reduction , as the best
affine
20 1. GEOMETRY OF PCA AND MDS
"""
print ( __doc__ )
import pandas as pd
import io
import requests
import numpy as np
url = " https :// statweb . stanford . edu /~ tibs / ElemStatLearn / datasets / zip .
digits / train .3 "
s = requests . get ( url ) . content
c = pd . read_csv ( io . StringIO ( s . decode ( ’utf -8 ’) ) )
data = np . array (c , dtype = ’ float32 ’) ;
# data = np . array ( pd . read_csv ( ’ train .3 ’) , dtype = ’ float32 ’) ;
data . shape
# Reshape the data into image of 16 x16 and show the image .
import matplotlib . pyplot as plt
img1 = np . reshape ( data [1 ,:] ,(16 ,16) ) ;
imgshow = plt . imshow ( img1 , cmap = ’ gray ’)
# #########################################
# PCA
print ( pca . e x p l a i n e d _ v a r i a n c e _ r a t i o _ )
# Plot the ’ e x p l a i n e d _ v a r i a n c e _ r a t i o _ ’
plt . ylabel ( ’ e x p l a i n e d _ v a r i a n c e _ r a t i o _ ’)
# Principal components
Y = pca . components_ ;
Y . shape
# coding : utf -8
# # M u l t i d i m e n s i o n a l Scaling ( MDS )
# In this example , we show the MDS embedding of some cities with their
geodesic distances on earth .
# In [1]:
# In [2]:
import numpy as np
import matplotlib . pyplot as plt
# In [3]:
22 1. GEOMETRY OF PCA AND MDS
# Cities
cities = [ " Beijing " ," Shanghai " ," Hongkong " ," Tokyo " ," Hawaii " ," Seattle " ,"
San ␣ Francisco " ," Los ␣ Angeles " ]
# cities = [" Boston " , " New York " , " Miami " , " Chicago " , " Seattle " , " San
Francisco " , " Los Angeles "]
# In [4]:
def g et _c oo r di na te s ( city ) :
loc = geolocator . geocode ( city )
return ( loc . latitude , loc . longitude )
# In [5]:
# In [6]:
# In [7]:
_ , ax = plt . subplots ()
1.5. LAB AND FURTHER STUDIES 23
ax . scatter ( Y0 [: , 0] , - Y0 [: , 1])
# In [8]:
D0 = e u c l i d e a n _ d i s t a n c e s ( X )
# In [9]:
n = D0 . shape [0]
# Compute K
K = -1/2 * np . matmul ( np . matmul (H , D02 ) , H . T )
evals , evecs
# In [10]:
# In [11]:
24 1. GEOMETRY OF PCA AND MDS
Y1
# In [23]:
# In [13]:
# In [14]:
# In [15]:
1.5. LAB AND FURTHER STUDIES 25
n = D . shape [0]
# Compute K
K = -1/2 * np . matmul ( np . matmul (H , D2 ) , H . T )
evals , evecs
# In [16]:
# In [17]:
Y2
26 1. GEOMETRY OF PCA AND MDS
# In [22]:
# In [19]:
# In [20]:
# In [24]:
We have seen that sample mean and covariance in high dimensional Euclidean
space Rp are exploited in Principal Component Analysis (PCA) or its equivalent
Multidimensional Scaling (MDS), which are the projections of high dimensional
data on to its top singular vectors of centered data matrix. In statistics, the sample
mean and covariance are Fisher’s Maximum Likelihood estimators based on mul-
tivariate Gaussian models. In classical statistics with the Law of Large Numbers,
for fixed p when sample size n → ∞, such sample mean and covariance converge,
so as to PCA. Although sample mean µ bn and sample covariance Σb n are the widely
used statistics in multivariate data analysis, they may suffer some problems in high
dimensional settings, e.g. for large p and small n scenario. In 1956, Stein (Ste56)
shows that the sample mean is not the best estimator in terms of prediction mea-
sured by the mean square error, for p > 2; furthermore in 2006, Jonestone (Joh06)
shows by random matrix theory that PCA might be overwhelmed by random noise
for fixed ratio p/n = γ when both n, p → ∞. Among other works, these two pieces
of excellent works inspired a long pursuit in modern high dimensional statistics for
biased estimators with shrinkage or regularization which trades variance by bias
toward a reduced prediction error.
which is equivalent to
n
1X
arg max log f (Xi |θ).
θ∈Θ n i=1
The following example shows that the sample mean and covariance can be
derived from the maximum likelihood estimator under multivariate normal models
of data.
27
28 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS
n
1X
⇒µ
bn = Xi .
n i=1
To get the estimation of Σ, we need to maximize
n
1 X n
I(Σ) = trace(I) = − trace (Xi − µ)T Σ−1 (Xi − µ) − trace log |Σ| + C
2 i=1
2
n n
1 X 1X
− trace bn )T Σ−1 (Xi − µ
(Xi − µ bn ) = − trace[Σ−1 (Xi − µ)(Xi − µ
bn )T ]
2 i=1
2 i=1
n
= − (traceΣ−1 Σ b n)
2
n 1 1
= − trace(Σ−1 Σ b n2 Σ
b n2 )
2
n 1
b n2 Σ−1 Σ
1
= − trace(Σ b n2 )
2
n
= − trace(S)
2
where
n
bn = 1
X
Σ (Xi − µ bn )T ,
bn )(Xi − µ
n i=1
1 1
and S = Σ b n2 Σ−1 Σ
b n2 is symmetric and positive definite. Above we repeatedly use
cyclic property of trace:
• trace(AB) = trace(BA), or more generally
• (invariance under cyclic permutation group) trace(ABCD) = trace(BCDA) =
trace(CDAB) = trace(DABC).
2.1. MAXIMUM LIKELIHOOD ESTIMATE OF MEAN AND COVARIANCE 29
Then we have
1 1
b n2 S −1 Σ
Σ=Σ b n2
n n n
− log |Σ| = log |S| + log |Σ
b n | = f (Σ
b n)
2 2 2
where we use for determinant of squared matrices of equal size, det(AB) = |AB| =
det(A) det(B) = |A| · |B|. Therefore,
n n
max I(Σ) ⇔ min trace(S) − log |S| + C,
2 2
where C is a constant. Suppose S = U ΛU T is the eigenvalue decomposition of S,
Λ = diag(λi )
p p
nX nX
J= λi − log(λi ) + C
2 i=1 2 i=1
∂J n n 1
⇒0= = − ⇒ λi = 1
∂λi 2 2 λi
⇒ S = Ip
This gives the MLE solution
n
bM LE 1X
Σ n = (Xi − µ bn )T .
bn )(Xi − µ
n i=1
Under some regularity conditions, the maximum likelihood estimator θ̂nM LE has
the following nice asymptotic properties as n → ∞.
A. (Consistency) θ̂nM LE → θ0 , in probability and almost surely.
√
B. (Asymptotic Normality) n(θ̂nM LE − θ0 ) → N (0, I0−1 ) in distribution,
where I0 is the Fisher Information matrix
" 2 # 2
∂ ∂
I(θ0 ) := E log f (X|θ0 ) = −E log f (X|θ 0 ) .
∂θ ∂θ2
Linear estimator includes an important case, the Ridge regression (also known as
Tikhonov regularization in applied mathematics) with C = X(X T X + λI)−1 X T ,
1 λ
(17) min ∥Y − Xβ∥2 + ∥β∥2 , λ > 0.
µ 2 2
For simplicity, one may restrict our discussions on the diagonal linear estimators
C = diag(ci ) (up to an change of orthonormal basis for Ridge regression), whose
risk is
p
X p
X
µC , µ) = σ 2
R(b c2i + (1 − ci )2 µ2i .
i=1 i=1
In this case, it is simple to find minimax risk over the hyper-rectangular model class
|µi | ≤ τi ,
p
X σ 2 τi2
inf sup R(b µC , µ) = .
ci |µ |≤τ
i i i=1
σ 2 + τi2
From here one can see that for those sparse model classes such that #{i : τi =
O(σ)} = k ≪ p, it is possible to get smaller risk using linear estimators than MLE.
In general, is it possible to introduce some biased estimators which significantly
reduces the variance such that the total risk is smaller than MLE uniformly for all
µ? This is the notion of inadmissibility introduced by Charles Stein in 1956 and he
find the answer is YES by presenting the James-Stein estimators, as the shrinkage
of sample means.
Definition 2.2.1 (Inadmissible). An estimator µ
bn of the parameter µ is called
inadmissible on Rp with respect to the squared risk if there exists another esti-
mator µ∗n such that
E∥µ∗n − µ∥2 ≤ E∥b
µn − µ∥2 for all µ ∈ Rp ,
and there exist µ0 ∈ Rp such that
E∥µ∗n − µ0 ∥2 < E∥b
µn − µ0 ∥2 .
In this case, we also call that µ∗n dominates µ
bn . Otherwise, the estimator µ
bn is
called admissible.
The notion of inadmissibility or dominance introduces a partial order on the
set of estimators where admissible estimators are local optima in this partial order.
Stein (1956) (Ste56) found that if p ≥ 3, then the MLE estimator µ bn is inad-
missible. This property is known as Stein’s phenomenon. This phenomenon can
be described like:
b such that ∀µ ∈ Rp ,
For p ≥ 3, there exists µ
R(b µMLE , µ)
µ, µ) < R(b
which makes MLE inadmissible.
A typical choice is the James-Stein estimator.
Example 3 (James-Stein Estimator). Charles Stein shows in 1956 that MLE is
inadmissible, while the following original form of James-Stein estimator is demon-
strated by his student Willard James in 1961. Bradley Efron (Efr10) summarizes
32 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS
mu_i is generated from Normal(0,1), sample size N=100 mu_i is generated from Uniform [0,1], sample size N=100
1.2
1.2
1.0
1.0
Error: ||hat{mu}-mu||/N
Error: ||hat{mu}-mu||/N
0.8
0.8
0.6
0.6
0.4
0.4
err_MLE err_JSE err_MLE err_JSE
the history and gives a simple derivation of these estimators from an Empirical
Bayes point of view.
σ 2 (p − 2)
JS0
(18) µ = 1 − M LE 2 µ bM LE .
∥b
µ ∥
b
where (x)+ takes the positive part of x if x > 0 or zero otherwise. James-Stein
estimator can be written as a Multitask Ridge regression:
p
X
(21) (b
µi , µ
b) := arg min [(µi − Xi )2 + λ(µi − µ)2 ].
µi ,µ
i=1
2 2
Taking λ = σ (p − 3)/(S − σ (p − 3)), µ bJS ; λ = min(S, σ 2 (p − 3))/(S −
b gives µ
2 JS+
min(S, σ (p − 3))) with 1/0 = 0, it gives µ
b .
Theorem 2.2.1. Suppose Y ∼ Np (µ, I). Then µ µ, µ) = Eµ ∥b
bMLE = Y . R(b µ−
2
µ∥ , and define
p−2
JS
b = 1−
µ Y
∥Y ∥2
then
µJS , µ) < R(b
R(b µMLE , µ)
2.2. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 33
Next we outline the proof of such results. First of all, we’ll prove a useful
lemma.
p Z
X ∞
Eµ (Y − µ)T g(Y ) = (yi − µi )gi (Y )ϕ(Y − µ)dY
i=1 −∞
p Z ∞
X ∂
= −gi (Y ) ϕ(Y − µ)dY, derivative of Gaussian function
i=1 −∞
∂yi
p Z ∞
X ∂
= gi (Y )ϕ(Y − µ)dY, Integration by parts
i=1 −∞ ∂yi
= Eµ ∇ g(Y ) T
which gives
µC , µ) = ∥(I − C(λ))Y ∥2 − p + 2trace(C(λ)).
R(b
In applications, C = C(λ) often depends on some regularization parameter λ (e.g.
ridge regression). So one could find optimal λ∗ by minimizing the MSE over λ.
we have
1 1
Eµ = EEµ
∥Y ∥2 ∥Y ∥2
J
1
= E
p + 2J − 2
1
≥ , by Jensen’s Inequality
p + 2EJ − 2
1
= .
p + ∥µ∥2 − 2
This gives the following result.
Proposition 2.2.4 (Upper bound of MSE for the James-Stein Estimator).
Y ∼ N (µ, Ip ),
(p − 2)2 (p − 2)∥µ∥2
µJS , µ) ≤ p −
R(b = 2 +
p − 2 + ∥µ∥2 p − 2 + ∥µ∥2
Using the inequality,
ab
≤a∧b
a+b
it gives the upper bound
µJS , µ) ≤ 2 + min((p − 2), ∥µ∥2 ).
R(b
√
Therefore for ∥µ∥ p, the risk of James-Stein Estimator is dominated by 2 + ∥µ∥2 ,
that arbitrarily approaches 2 when ∥µ∥ → 0. In comparison with the risk of MLE,
µMLE , µ) = p, James-Stein Estimator clearly wins a large gap in high dimension.
R(b
This is illustrated in Figure 2.
2.2.5. Risk of Soft-thresholding. Using Stein’s unbiased risk estimate, we
have soft-thresholding in the form of
∂
µb(x) = x + g(x). gi (x) = −I(|xi | ≤ λ)
∂i
We then have
p p
!
X X
Eµ ∥b
µλ − µ∥ = Eµ p − 2
2
I(|xi | ≤ λ) + 2
xi ∧ λ 2
i=1 i=1
p
X p
≤ 1 + (2 log p + 1) µ2i ∧ 1 if we take λ = 2 log p
i=1
By using the inequality
1 ab
a∧b≤ ≤a∧b
2 a+b
we can compare the risk of soft-thresholding and James-Stein estimator as
p p
! !
(µi ∧ 1) ⋚ 2 + c
X X
2 2
1 + (2 log p + 1) µi ∧ p c ∈ (1/2, 1)
i=1 i=1
12
10
8
R
JS
4
MLE
2
0
0 2 4 6 8 10
||u||
Subgradients of I over µ
b leads to the soft-thresholding,
0 ∈ ∂µbj I = (b
µj − µj ) + λsign(b
µj ) ⇒ µ
bj = sign(µj )(|µj | − λ)+
where the set-valued map sign(x) = 1 if x > 0, sign(x) = −1 if x < 0, and
sign(x) = [−1, 1] if x = 0, is the subgradient of absolute function |x|. Under this
new framework shrinkage estimators achieves a new peak with an ubiquitous spread
in data analysis with high dimensionality.
In addition to ℓ1 penalty in LASSO, there are also other penalty functions like
• λ∥β∥0 This leads to hard -thresholding when X = I. Solving this problem
is normally NP-hard.
• λ∥β∥
P p , 0 < p < 1. Non-convex, also NP-hard.
• λ ρ(βi ). such that
(1) ρ′ (0) singular (for sparsity in variable selection)
(2) ρ′ (∞) = 0 (for unbiasedness in parameter estimation)
Such ρ must be non-convex essentially (FL01).
In Section 3.5, we also introduce a new type of dynamic regularization paths of
shrinkage estimators, called as the inverse scale space method developed in applied
mathematics (OBG+ 05; ORX+ 16).
Suppose now that the function g is such that the assumptions of Stein’s Lemma 2.2.5
hold (Lemma 3.6 in (Tsy09)), i.e. weakly differentiable.
Lemma 2.2.5 (Stein’s lemma). Suppose that a function f : Rp → R satisfies:
(1) f (u1 , . . . , up ) is absolutely continuous in each coordinate ui for almost all
values (with respect to the Lebesgue measure on Rp−1 ) of other coordi-
nates (uj , j ̸= i)
2.2. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 39
(2)
∂f (y)
E
< ∞, i = 1, . . . , p.
∂yi
then
∂f
E[(µi − yi )f (y)] = −ε E 2
(y) , i = 1, . . . , p.
∂yi
With Stein’s Lemma, therefore
∂g
E[(µi − yi )(1 − g(y))yi ] = −ε2 E g(y) + yi (y) ,
∂yi
with
E[(yi − µi )2 ] = ε2 ,
we have
∂g
E[(b
µn,i − µi )] = ε − 2ε E g(y) + yi
2 2 2
(y) + E[yi2 g(y)2 ].
∂yi
Summing over i gives
E∥b
µn − µ∥2 = pε2 +E[W (y)],
|{z}
µM
=:R(bn
LE )=E∥b
µM
n
LE −µ∥2
with
p
X ∂g
W (y) = −2pε2 g(y) + 2ε2 yi (y) + ∥y∥2 g(y)2 .
i=1
∂yi
The risk of µ bM
bn is smaller than that of MLE µn
LE
if we choose g such that
E[W (y)] < 0.
In order to satisfy this inequality, we can search for g among the functions of
the form
b
g(y) =
a + ∥y∥2
with an appropriately chosen constants a ≥ 0, b > 0. Therefore, W (y) can be
written as
p
b X 2byi2 b2 ∥y∥2
W (y) = −2pε2 + 2ε 2
+
a + ∥y∥2 i=1
(a + ∥y∥2 )2 (a + ∥y∥2 )2
4bε2 ∥y∥2 b2 ∥y∥2
1 2
= −2pbε + +
a + ∥y∥2 a + ∥y∥2 (a + ∥y∥2 )2
1
≤ (−2pbε2 + 4bε2 + b2 ) ∥y∥2 ≤ a + ∥y∥2 for a ≥ 0
a + ∥y∥2
Q(b)
= , Q(b) = b2 − 2pbε2 + 4bε2 .
a + ∥y∥2
The minimizer in b of quadratic function Q(b) is equal to
bopt = ε2 (p − 2),
where the minimum of W (y) satisfies
b2opt ε4 (p − 2)2
Wmin (y) ≤ − = − < 0.
a + ∥y∥2 a + ∥y∥2
40 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS
Note that when b ∈ (b1 , b2 ), i.e. between the two roots of Q(b)
b1 = 0, b2 = 2ε2 (p − 2)
we have W (y) < 0, which may lead to other estimators having smaller mean square
errors than MLE estimator.
bn = (1 − g(y))y associated to
When a = 0, the function g and the estimator µ
this choice of g are given by
ε2 (p − 2)
g(y) = ,
∥y∥2
and
ε2 (p − 2)
µ
bn = 1− y =: µ
bJS ,
∥y∥2
respectively. µbJS is called James-Stein estimator. If dimension p ≥ 3 and the
norm ∥y∥2 is sufficiently large, multiplication of y by g(y) shrinks the value of y to
0. This is called the Stein shrinkage. If b = bopt , then
ε4 (p − 2)2
Wmin (y) = − .
∥y∥2
Lemma 2.2.6. Let p ≥ 3. Then, for all µ ∈ Rp ,
1
0<E < ∞.
∥y∥2
The proof of Lemma 2.2.6 can be referred to Lemma 3.9 in (Tsy09). For
the function W , Lemma 2.2.6 implies −∞ < E[W (y)] < 0, provided that p ≥ 3.
Therefore, if p ≥ 3, the risk of the estimator µbn satisfies
4
ε (p − 2)2
E∥b
µn − µ∥ = pε − E
2 2
< E∥b µM
n
LE
− µ∥2
∥y∥2
for all µ ∈ Rp .
Besides James-Stein estimator, there are other estimators having smaller mean
square errors than MLE.
• Stein estimator : a = 0, b = ε2 p,
ε2 p
bS := 1 −
µ y
∥y∥2
• James-Stein estimator : c ∈ (0, 2(p − 2))
ε2 c
c
bJS := 1 −
µ y
∥y∥2
• Positive part James-Stein estimator :
ε2 (p − 2)
bJS+ := 1 −
µ y
∥y∥2 +
where (x)+ = min(0, x). Comparisons of their risks as mean square errors are as
follows:
R(b
µJS+ ) < R(b µM
µJS ) < R(bn
LE
), R(b
µS+ ) < R(b µM
µS ) < R(bn
LE
).
Another dimension of variation is Shrinkage toward any vector rather than the
origin.
ε2 c
bµ0 = µ0 + 1 −
µ (y − µ0 ), c ∈ (0, 2(p − 2)).
∥y∥2
Pp
In particular, one may choose µ0 = ȳ where ȳ = i=1 yi /p.
The answer is yes in the classical setting where γ = 0 governed by the Law of Large
Numbers. Unfortunately, in high dimensional statistics with γ > 0, top eigenvectors
of sample covariance matrices might not reflect the subspace of signals. In fact,
there is a phase transition for signal identifiability by PCA: below a threshold of
signal-noise ratio, PCA will fail with high probability and above that threshold of
signal-noise ratio, PCA will approximate the signal subspace with high probability.
This will be illustrated by the following simplest rank-1 (spike) signal model, in
which a leverage of random matrix theory will shed light on the phase transitions
where PCA fails to capture the signal subspace depending on the signal-noise ratio.
in 2006 (Joh06), or see (NBG10), shows that the primary (largest) eigenvalue of
sample covariance matrix satisfies
( √ √
(1 + γ)2 = b, 2
σX ≤ γ
(25) λmax (Σn ) →
b √
2
(1 + σX )(1 + σγ2 ), σX
2
> γ
X
which implies that if signal energy is small, top eigenvalue of sample covariance
matrix never pops up from random matrix ones; only if the signal energy is beyond
√
the phase transition threshold γ, top eigenvalue can be separated from random
matrix eigenvalues. However, even in the latter case it is a biased estimation.
Moreover, the primary eigenvector (principal component) associated with the
largest eigenvalue converges to
2 √
0 σX ≤ γ
|⟨u, vmax ⟩|2 → 1− σX
γ
(26) 4
2 √
1+ γ , σX > γ
σ2
X
which means the same phase transition phenomenon: if signal is of low energy,
PCA will tell us nothing about the true signal and the estimated top eigenvector is
orthogonal to the true direction u; if the signal is of high energy, PCA will return a
biased estimation which lies in a cone whose angle with the true signal is no more
than
1 − σγ4
!
arccos X
.
1 + σγ2
X
(27) b n = 1 XX T .
Σ
n
Such a random matrix Σ b n is called a Wishart matrix.
• In classical statistics: when p fixed and n → ∞, the classical Law of Large
Numbers tells us Σ b n → Ip .
• In high dimensional statistics when both n and p grow: np → γ ̸= 0, the
distribution of the eigenvalues of Σb n follows a so called Marčcenko-Pastur
(MP) distribution (BS10),
(
0 t∈
/ [a, b],
1
(28) µMP
(t) = 1 − δ(x)I(γ > 1) + √(b−t)(t−a)
γ dt t ∈ [a, b],
2πγt
√ √
where a = (1 − γ)2 , b = (1 + γ)2 . In other words if γ ≤ 1, the
distribution has a support on [a, b] and if γ > 1, it has an additional point
mass 1 − 1/γ at the origin.
Figure 3 illustrates the MP-distribution by MATLAB simulations whose codes
can be found below.
2.3. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 43
(a) (b)
n
1
Y Y T
where Y = [Y1 , . . . , Y n ] ∈ R p×n
. Suppose one of its eigenvalue is λ
b and the
corresponding unit eigenvector is v̂, so Σ b n v̂ = λv̂.
First of all, we relate the λ to the MP distribution by the trick:
b
1
(29) Zi = Σ− 2 Yi → Zi ∼ N (0, Ip ), where Σ = σx2 uuT + σε2 Ip = RuuT + Ip .
Pn
Then Sn = n1 i=1 Zi ZiT = n1 ZZ T is a Wishart random matrix whose eigenvalues
follow the Marčenko-Pastur distribution.
b n = 1 Y Y T = Σ1/2 ( 1 ZZ T )Σ1/2 = Σ 21 Sn Σ 12 and (λ,
Notice that Σ b v̂) is eigenvalue-
n n
eigenvector pair of matrix Σn . Therefore
b
1 1 1 1
(30) b ⇒ Sn Σ(Σ− 2 v̂) = λ(Σ
Σ 2 Sn Σ 2 v̂ = λv̂ b − 2 v̂)
that is, if uT v ̸= 0,
(37) 2
1 = σX b p − σ 2 Sn )−1 Sn u
· uT (λI ε
which is
p
2
X λi
(39) 1 = σX · αi2
b − σ 2 λi
λ
i=1 ε
Pp 2
where i=1 αi
= 1. Since W consists of a random orthonormal basis on a sphere, αi
will concentrate on its mean αi = √1p . For large p, λi ∼ µM P (λi ) can be thought
sampled from the µM P and the sum (39) can thus be regarded as the following
Monte-Carlo integration with respect to the MP distribution,
p Z b
2 1X λi 2 t
(40) 1 = σX · ∼ σX · dµM P (t)
p i=1 λ
b − σ 2 λi
ε a λ
b − σ 2t
ε
Since we had assumed without loss of generosity that σε2 = 1, we can compute
the integration above using the Stieltjes transform and obtain,
Z b p
t (b − t)(t − a) σ2 b
q
2
(41) 1 = σX · dt = X [2λ − (a + b) − 2 |(λb − a)(b − λ)|].
b
a λb−t 2πγt 4γ
b ≥ b and R = σ 2 ≥ √γ, we have
For λ X
2
σX
q
∵ 1= [2λ − (a + b) − 2 (λ
b b − a)(λb − b),
4γ
∴ b = σ 2 + γ + 1 + γ = (1 + σ 2 )(1 + γ ).
λ X 2 X 2
σX σX
More general for σ 2 ̸= 1, all the equations above is true, except that all the λ
ε
b will
λ 2 2
be replaced by and σX by signal-noise-ratio R = σX /σε2 . Then we get:
b
σε2
b = (1 + R) 1 + γ σ 2 .
λ
R ε
Here we observe the following phase transitions for primary eigenvalue:
• If λb ∈ [a, b], then Σ
b n has its primary eigenvalue λb within supp(µM P ), so
it is undistinguishable from the noise Sn .
• If λb ≥ b, PCA will pick up the top eigenvalue as a signal.
2.3. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 45
• So λ
b = b is the phase transition where PCA works to pop up signal rather
than noise. Then plugging in λ b = b in (41), we get,
σ2
r
2 1 2 p
(42) 1 = σX · [2b − (a + b)] = √X ⇔ σX =
4γ γ n
Hence,
p in order to make PCA works, we need to let the signal-noise-ratio
R ≥ np .
2.3.3.2. Primary Eigenvector. We now study the phase transition of the pri-
mary eigenvector. It is convenient to study |uT v|2 first and then translate back to
|uT v̂|2 . From Equation (35), we obtain
1 = v T v = σX
4
· v T uuT Sn (λIp − σε2 Sn )−2 Sn uuT v
4
= σX · (|v T u|)[uT Sn (λIp − σε2 Sn )−2 Sn u](|uT v|)
which implies that
(43) |uT v|−2 = σX
4
[uT Sn (λIp − σε2 Sn )−2 Sn u].
Using the same trick as the equation (37), we reach the following Monte-Carlo
integration
Z b
T −2 4 T 2 −2 4 t2
(44) |u v| = σX [u Sn (λIp − σε Sn ) Sn u] ∼ σX dµM P (t)
a (λ − σε t)
2 2
and assume that λ ≥ b, from Stieltjes transform introduced later one can compute
the integral as
Z b
t2
|uT v|−2 = σX 4
· dµM P (t)
a (λ − σε t)
2 2
4
σX p λ(2λ − (a + b))
= (−4λ + (a + b) + 2 (λ − a)(λ − b) + p
4γ (λ − a)(λ − b)
γ
from which it can be computed that (using λ
b = (1 + R)(1 +
R) obtained above,
2
σX
where R = σϵ2 )
γ
1− R2
|uT v|2 = 2γ .
1+γ+ R
Now we can compute the inner product of u and v̂ that we are really interested
in:
1 1 1 1
|uT v̂|2 = ( uT Σ 2 v)2 = 2 ((Σ 2 u)T v)2
c c
1 1
= 2
(((Ruu + Ip ) 2 u)T v)2
T
c
∗ 1 p
= (( (1 + R)u)T v)2
c2
∗∗ (1 + R)(uT v)2
=
R(uT v)2 + 1
γ
1+R− R − Rγ2 1 − Rγ2
= γ = γ
1+R+γ+ R 1+ R
46 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS
√
where the equality (∗) uses Σ1/2 u = 1 + Ru, and the equality (∗∗) is due to
the formula for c2 (Equation (31) above). Note that this identity holds under the
√
condition that R ≥ γ to make the numerator above non-negative.
Therefore if PCA works well and noise doesn’t dominate the effect, the inner
product |uT v̂| should be close to 1. Particularly when γ = 0 we have |uT v̂| = 1 as
disclosed by the classical Law of Large Numbers in statistics. On the other hand,
from RMT we know that if the top eigenvalue λ b ∈ [a, b] is overwhelmed in the
domain of M. P. distribution, then the primary eigenvector computed from PCA
is purely random and |uT v̂| = 0, which means that from v̂ we can know nothing
about the signal u.
2.3.5. Biographic Remarks. Random Matrix Theory can only deal with
homogeneous Gaussian noise σε2 Ip here. Moreover, it is still an open problem how
to deal with heteroscedastic noise, where Art Owen and Jingshu Wang has some
preliminary studies (OW16).
When log(p)
n → 0, we need to add more restrictions on Σ b n in order to estimate
it faithfully. There are typically three kinds of restrictions.
• Σ sparse
• Σ−1 sparse, also called–Precision Matrix
• banded structures (e.g. Toeplitz) on Σ or Σ−1
Recent developments can be found by Bickel, Tony Cai, Tsybakov, Wainwright et
al.
For spectral study on random kernel matrices, see El Karoui, Tiefeng Jiang,
Xiuyuan Cheng, and Amit Singer et al.
b2 }i=1,...,p . Repeat such procedure for R times, we can get R sets of singular
values {λi
values. They can be put together as a matrix
b1
λ b1 · · · λ
λ b1
1 2 p
b2 b2 · · · λ b2
λ1 λ 2 p
.
. .. ..
. ..
. . . .
bR λ
λ bR · · · λ bR
1 2 p
variable y. Does PCA really matter with response variable in supervised learning,
e.g., in classification or regression?
50 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS
In the 2005 Fisher Lecture, R. Dennis Cook (Coo07) described PCA as a suffi-
cient dimensionality reduction in regression, and also extended it to principal fitted
components (PFC). Here we introduce his idea, together with several variations
of supervised PCA: Fisher’s Linear Discriminant Analysis and Li’s Sliced Inverse
Regression.
A sufficient dimension reduction Γ (Γ ∈ Rp×d , ΓT Γ = Id ) refers to the setting
that the conditional distribution of Y |X is the same as the distribution of Y |ΓT X
for all X.
For example, in regression Y = f (X, ε), for some unknown function f , sufficient
dimensionality reduction implies that Y = f (ΓT X, ε). However f is unknown here.
How can we find Γ independent to the choice of f ?
The answer is a possible Yes when we consider the inverse problem, based on
the conditional distribution X|Y .
For example, consider the following inverse model, for each value in response
variable y,
(51) Xy = µ + Γνy + ε,
Q
and the MLE tries to find arg maxµ,Γ,νy y f (Xy |µ, Γ, νy ), which is equivalent to
the following optimization problem after a logarithmic transform:
1 X X
max − 2 ∥Xy − µ − Γνy ∥2 − p log σ + C.
µ,Γ,νy 2σ y y
b T (Xy − µ
νy = Γ b),
and
X
(52) Γ
b = arg min ∥Xy − µ b)∥2 ,
b − PΓ (Xy − µ PΓ = ΓΓT .
ΓT Γ=I
y
Comparison with (52) and (2) shows that when y is of distinct values (e.g. the
unknown function f is injective), this is exactly the PCA in unsupervised learn-
ing. Therefore PCA can be also derived as a sufficient dimensionality reduction in
supervised learning, even the function f is unknown here.
For y with discrete or repeated values of equal number Ny of samples at different
values, it suffices to replace Xy by
1 X
µ
by = Xi .
Ny y =y
i
Two famous examples are Fisher’s Linear Discriminant Analysis (LDA) for classi-
fication (HTF01) and Ker-Chau Li’s Sliced Inverse Regression (Li91) (SIR), which
will be called as supervised PCAs here. See Cook (Coo07) for a general class of
principal fitted components adapted to supervised learning.
but not ordered. LDA captures the variance between class and meanwhile discards
the variance within class.
Define the between-class covariance matrix
K
p×p 1 X
ΣB =
b µk − µ
(b b)(b b)T ;
µk − µ
K
k=1
where µ
b is sample mean and µ
bk is within class means, i.e.
1 X
µ
bk = Xi .
Nk
yi =k
the principal curve or inverse regression curve. Such a curve might be easier to deal
with than the high dimensional regression function.
ordered discrete)
Output: Effective dimension reducing directions Γd
1 Step 1 : Divide the range of yi into S non-overlapping slices Hs (s = 1, ..., S). Ns
is the number of observations within each slice
2 Step 2 :Compute the sample mean and total covariance matrix
N N
1 X b p×p = 1
X
b=
µ Xi , Σ (Xi − µ b)T ;
b)(Xi − µ
N i=1 N i=1
Step 3 : Compute the mean of Xi over all slices and Between slices covariance
matrix
K
1 X b p×p = 1
X
bk =
µ Xi , Σ B (b
µk − µ
b)(b b)T ;
µk − µ
Ns y ∈H K
i s h
# # A simulation to show that JSE has smaller Mean Square Error than
MLE
# ## $ \ mu$ is generated from Normal (0 ,1)
# A simulation to show that JSE has smaller Mean Square Error than MLE
import numpy as np
import pandas as pd
54 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS
ordered discrete)
Output: Effective dimension reducing directions Γd
1 Step 1 : Compute total covariance matrix Σ b as in SIR;
2 Step 2 : Divide the range of yi into S non-overlapping slices Hs (s = 1, ..., S); for
each sample (Xi , yi ) compute the localized mean
1 X
bi,loc =
µ Xj ,
|si | j∈s
i
b loc = 1
X
Σ (b
µi,loc − µ
b)(b b)T ;
µi,loc − µ
N i
nrep =100
# p = N in the following
N = 100
for i in range ( nrep ) :
mu = np . random . normal (0 ,1 , N )
z = np . random . normal ( mu ,1 , N )
mu_MLE = z
mu_JSE =(1 -( N -2) / np . sum ( z **2) ) * z
err_MLE [ i ]= np . sum (( mu_MLE - mu ) **2) / N
err_JSE [ i ]= np . sum (( mu_JSE - mu ) **2) / N
err1 = pd . DataFrame ({ ’ err_MLE ’: err_MLE , ’ err_JSE ’: err_JSE })
names =[ " Clemente " ," F . Robinson " ," F . Howard " ," Johnstone " ," Berry " ," Spencer
" ," Kessinger " ," L . Alvarado " ," Santo " ," Swoboda " ," Unser " ," Williams " ,"
Scott " ," Petrocelli " ," E . Rodriguez " ," Campaneris " ," Munson " ," Alvis " ]
hits =[18 ,17 ,16 ,15 ,14 ,14 ,13 ,12 ,11 ,11 ,10 ,10 ,10 ,10 ,10 ,9 ,8 ,7]
n = 45
mu =[.346 ,.298 ,.276 ,.222 ,.273 ,.270 ,.263 ,.210 ,.269 ,
.230 ,.264 ,.256 ,.303 ,.264 ,.226 ,.286 ,.316 ,.200]
p = len ( hits )
X = pd . DataFrame ([ names , hits , np . round ( mu_mle ,3) ,mu , np . round ( mu_js ,3) , np .
round ( mu_js1 ,3) ]) . T
# X . columns =[ ’ Names ’,’ hits ’,’$ \ hat {\ mu } _i ^{( MLE ) } $ ’,’$ \ mu_i$ ’,’$ \ hat {\
mu } _i ^{( JS0 ) } $ ’,’$ \ hat {\ mu } _i ^{( JS +) } $ ’]
X . columns =[ ’ Names ’ , ’ hits ’ , ’$ \ mu_ { MLE } $ ’ , ’$ \ mu_i$ ’ , ’$ \ mu_ { JS0 } $ ’ , ’$ \ mu_
{ JS +} $ ’]
56 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS
# # MP - law
#
# Eigenvalue distribution of S converges to Marcenko - Pastur
distribution with parameter gamma = p / n
import numpy as np
import matplotlib . pyplot as plt
# plotting part
bins =100
evals = np . linalg . eigvals ( S ) . real
hist , edges = np . histogram ( evals , bins = bins )
width = edges [1] - edges [0]
hist = hist / p
plt . bar ( edges [0: -1] , hist , align = " edge " , width = width , alpha =0.75)
ts = np . linspace ( edges [0] , b *1.05 , num =1000)
f_mps =[]
for t in ts :
f_mps . append ( f_MP (a ,b ,t , gamma ) )
f_mps = np . array ( f_mps )
plt . plot ( ts , f_mps * width , color = " r " )
plt . ylim (0 , max ( f_mps [1: -1]* width ) *2)
plt . show ()
Figure 6 shows the mean image and the top 24 PCs for digit “3”. Run Python
code papca image.py.
import numpy as np
import scipy . io as sio
import matplotlib . pyplot as plt
# ###############################################################
X = np . loadtxt ( ’ train .3 ’ , delimiter = ’ , ’)
n , dim = X . shape
mean = np . mean (X , axis = 0)
X0 = X - mean
# X0 = X
Cov = np . dot ( X0 .T , X0 ) /( n -1)
# ###############################################################
Xcp = X0 . copy ()
evals_perm = np . zeros ([ n_perm , dim ])
# ###############################################################
plt . figure ( figsize = (20 , 10) )
ax = plt . subplot (111)
ax . loglog ( evals , ’r - o ’ , linewidth = 2 , label = r ’ original ’)
# ax . loglog ( evals0 , ’g -* ’ , linewidth = 2 , label = r ’ permuted mean ’)
58 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS
ax1 = ax . twinx ()
ax1 . plot ( pvals , ’b ’ , linewidth = 2)
ax1 . vlines ( pv1 , 0 , 1 , ’b ’ , ’ dashed ’ , linewidth = 2 , label = r ’1 - st ␣
nonzero ␣p - values ’)
ax1 . hlines ( perc , pv1 , dim , ’k ’ , ’ dotted ’ , linewidth = 2)
ax1 . fill_between ( np . arange ( dim ) , np . ones ( dim ) , where = ( pvals > perc ) ,
alpha = 0.2 , label = r ’ color ␣ fill : ␣ for ␣p - value ␣ >␣ % s % s ’
%( perc *100 , ’% ’) )
ax1 . tick_params ( axis = ’y ’ , labelsize = 20)
ax1 . set_yticks ([ perc , 1])
ax1 . se t_ y ti ck la b el s ([ ’% s % s ’ %(100* perc , ’% ’) , ’% s % s ’ %(100 , ’% ’) ])
ax1 . set_ylabel ( ’p - values ’ , fontsize = 20)
ax1 . legend ( loc = ’ upper ␣ right ’ , fontsize = 20)
# ###############################################################
# img = X [0]. reshape (16 , 16)
plt . figure ( figsize = (10 , 10) )
# plt . title ( ’ mean and principal components ’, fontsize = 20)
for j in range (25) :
# img = Xcp [ j ]. reshape (16 , 16)
img = modes [: , j ]. reshape (16 , 16)
ax = plt . subplot (5 , 5 , j +1)
ax . imshow ( img , cmap = ’ gray ’)
ax . set_title ( ’% d ’ %( j +1) , fontsize = 20)
ax . set_xticks ([])
ax . set_yticks ([])
plt . tight_layout ()
plt . show ()
CHAPTER 3
then the row vectors of matrix Y are the eigenvectors (singular vectors) correspond-
ing to k largest eigenvalues (singular values) of B.
The main features of MDS are the following.
• MDS looks for Euclidean embedding of data whose total or average metric
distortion are minimized.
• MDS embedding basis is adaptive to the data, namely as a function of
data via eigen-decomposition.
59
60 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE
Note that distortion measure here amounts to a certain distance between the set
of projected points and the original set of points B. Under the Frobenius norm the
distortion equals the sum of the squared lengths of these vectors. It is clear that
such vectors captures a significant global property, but it does not offer any local
guarantees. Chances are that some points deviate greatly from the original if we
only consider the total metric distortion minimization.
What if we want a uniform control on metric distortion at every data pair, say
∥Yi − Yj ∥2
(1 − ϵ) ≤ ≤ (1 + ϵ)?
d2ij
Such an embedding is an almost isometry or a Lipschitz mapping from metric space
X to Euclidean space Y. If X is an Euclidean space (or more generally Hilbert
space), Johnson-Lindenstrauss Lemma tells us that one can take Y as a subspace
of X of dimension k = O(c(ϵ) log n) via random projections to obtain an almost
isometry with high probability. As a contrast to MDS, the main features of this
approach are the following.
• Almost isometry is achieved with a uniform metric distortion bound (Bi-
Lipschitz bound), with high probability, rather than average metric dis-
tortion control;
• The mapping is universal, rather than being adaptive to the data.
p 1
p = 1/6
• R = A/ k/3 Aij = 0 p = 2/3
−1 p = 1/6
The proof below actually takes the first form of R as an illustration.
Now we are going to prove Johnson-Lindenstrauss Lemma using a random
projection to k-subspace in Rd . Notice that the distributions of the following two
events are identical:
k
Prob[L ≥ (1 + ϵ)µ] ≤ exp (1 − (1 + ϵ) + ln(1 + ϵ))
2
ϵ2 ϵ3
k
≤ exp (−ϵ + (ϵ − + )) ,
2 2 3
by ln(1 + x) ≤ x − x2 /2 + x3 /3 for x ≥ 0
k
= exp − (ϵ2 /2 − ϵ3 /3) ,
2
≤ exp(−(2 + α) ln n), for k ≥ 4(1 + α/2)(ϵ2 /2 − ϵ3 /3)−1 ln n
1
= 2+α
n
r r
d ′ d
Now set the map f (x) = x = (x1 , . . . , xk , 0, . . . , 0). By the above
k k
calculations, for some fixed pair i, j, the probability that the distortion
∥f (vi ) − f (vj )∥2
∥vi − vj ∥2
2
does not lie in the range [(1 − ϵ), (1 + ϵ)] is at most n(2+α) . Using the trivial union
n
bound with pairs, the chance that some pair of points suffers a large distortion
2
is at most:
n 2 1 1 1
(2+α)
= α
1 − ≤ α.
2 n n n n
1
Hence f has the desired properties with probability at least 1 − α . This gives us
n
a randomized polynomial time algorithm. □
Now, it remains to Lemma 3.2.2.
Proof of Lemma 3.2.2.
p
k
!
X X
2 2
Prob(L ≤ βµ) =Prob xi ≤ βµ( xi )
i=1 i=1
p k
!
X X
=Prob βµ x2i − x2i ≥0
i=1 i=1
p
" k
! #
X X
=Prob exp tβµ x2i − t x2i ≥1 (t > 0)
i=1 i=1
p
" k
!#
X X
≤E exp tβµ x2i −t x2i (by Markov’s inequality)
i=1 i=1
=Πki=1 E exp(t(βµ − 1)xi )Πpi=k+1 E exp(tβµx2i )
2
2 k 2 p−k
=(E exp(t(βµ − 1)x )) (E exp(tβµx ))
=(1 − 2t(βµ − 1))−k/2 (1 − 2tβµ)−(p−k)/2
2 1
We use the fact that if X ∼ N (0, 1),then E[esX ] = p , for −∞ < s < 1/2.
(1 − 2s)
3.3. APPLICATION: HUMAN GENOME DIVERSITY PROJECT 63
Now we will refer to last expression as g(t). The last line of derivation gives
us the additional constraints that tβµ ≤ 1/2 and t(βµ − 1) ≤ 1/2, and so we have
0 < t < 1/(2βµ). Now to minimize g(t), which is equivalent to maximize
h(t) = 1/g(t) = (1 − 2t(βµ − 1))k/2 (1 − 2tβµ)(p−k)/2
in the interval 0 < t < 1/(2βµ). Setting the derivative h′ (t) = 0, we get the
maximum is achieved at
1−β
t0 =
2β(p − βk)
Hence we have (p−k)/2 k/2
p−k
1
h(t0 ) = ,
p − kβ β
and this is exactly what we need.
The proof of Lemma 3.2.2 (b) is almost exactly the same as that of Lemma
3.2.2 (a). □
3.2.1. Conclusion. As we can see, this proof of Lemma is both simple (using
just some elementary probabilistic techniques) and elegant. And you may find
in the field of machine learning, stochastic method always turns out to be really
powerful. The random projection method we approaching today can be used in
many fields especially huge dimensions of data is concerned. For one example, in
the term document, you may find it really useful for compared with the number
of words in the dictionary, the words included in a document is typically sparse
(with a few thousands of words) while the dictionary is huge. Random projections
often provide us a useful tool to compress such data without losing much pairwise
distance information.
assumption is that the signal x∗ is sparse, namely the number of nonzero compo-
nents ∥x∗ ∥0 := #{x∗i ̸= 0 : 1 ≤ i ≤ p} is small compared to the total dimensionality
p. Figure 2 gives an illustration of such sparse linear equation problem.
3. It’s natural to ask how well OMP can recover x∗ , the answer is yes under
some conditions we will talk below.
3.4.1.3. LASSO. Least Absolute Shrinkage and Selection Operator (LASSO)
(Tib96) solves the following problem for noisy measurement b = Ax + e:
then P1 recovers x∗ .
Tropp (Tro04) also shows that incoherence condition is stronger than the Ir-
representable condition in the following sense:
Lemma 3.4.1. (Tropp, 2004 (Tro04))
1 kµ
(63) µ< ⇒M ≤ < 1.
2k − 1 1 − (k − 1)µ
On the other hand, Tony Cai et al. (CXZ09; CW11) shows that the irrepre-
sentable or the incoherence condition is tight in the sense that if it fails, there exists
data A, x∗ , and b such that sparse recovery is not possible.
(66)
max |∆i,j | ≤ µ, diag(∆) = 0;
k−1
⇒ ∥∆∥∞ ≤ < 1;
2k − 1
∞
X
⇒ (A∗S AS )−1 = (Ik + ∆)−1 = (−∆)j ;
j=0
∞ ∞
X X 1 1
⇒ ∥(A∗S AS )−1 ∥∞ = ∥ (−∆)j ∥∞ ≤ ∥∆∥j∞ = ≤ .
j=0 j=0
1 − ∥∆∥∞ 1 − (k − 1)µ
In noise-free case,
b = Ax∗ ∈ im(AS )
)
(69) ⇒ rt ∈ im(AS ).
rt = b − Axt ∈ im(AS )
PS = AS (A∗S AS )−1 A∗S is the projection operator onto im(AS ), thus we have
rt = PS rt . Hence,
(70)
∥A∗S c (PS rt )∥∞ ∥A∗S c AS (A∗S AS )−1 A∗S rt ∥∞
ρ(rt ) = ∗ = ≤ ∥A∗S c AS (A∗S AS )−1 ∥∞ < 1.
∥AS rt ∥∞ ∥A∗S rt ∥∞
(II) BP recovers x∗ .
Assume x̂ ̸= x∗ solves
(71) P1 : min ∥x∥1 , s.t. Ax = b.
Denote Ŝ = supp(x̂) and Ŝ \ S ̸= ∅. We have
(72)
∥x∗ ∥1 = ∥(A∗S AS )−1 A∗S b∥1
= ∥(A∗S AS )−1 A∗S AŜ x̂Ŝ ∥1 (Ax̂ = b)
= ∥(A∗S AS )−1 A∗S AS x̂S + (A∗S AS )−1 A∗S AŜ\S x̂Ŝ\S ∥1 (x̂Ŝ = x̂S + x̂Ŝ\S )
< ∥x̂S ∥1 + ∥x̂Ŝ\S ∥1 = ∥x̂Ŝ ∥1 ,
which is a contradictory. □
3.4.2.3. RIP and Random Projections. (CDD09) shows that incoherence con-
ditions implies RIP, whence RIP is a weaker condition. Under RIP condition,
uniqueness of P0 and P1 can be guaranteed for all k-sparse signals, often called
uniform exact recovery(Can08).
Theorem 3.4.3. The following holds for all k-sparse x∗ satisfying Ax∗ = b.
1, then problem P0 has a unique solution x∗ ;
(1) If δ2k < √
(2) If δ2k < 2 − 1, then the solution of P1 (58) has a unique solution x∗ , i.e.
recovers the original sparse signal x∗ .
The first condition2 is nothing but every 2k-columns of A are linearly depen-
dent. To see the first condition, assume by contradiction that there is another
k-sparse solution of P0 , x′ . Then by Ay = 0 and y = x∗ − x′ is 2k-sparse. If y ̸= 0,
it violates δ2k < 1 such that 0 = ∥Ay∥ ≥ (1 − δ2k )∥y∥ > 0. Hence one must have
y = 0, i.e. x∗ = x′ which proves the uniqueness of P0 . The proof of the second
condition can be found in (Can08).
RIP conditions also lead to upper bounds between solutions above and the
true sparse signal x∗ . For example, in the case of BPDN the follwoing result holds
(Can08).
√
Theorem 3.4.4. Suppose that ∥e∥2 ≤ ϵ. If δ2k < 2 − 1, then
∥x̂ − x∗ ∥2 ≤ C1 k −1/2 σk1 (x∗ ) + C2 ϵ,
2The necessity of the first condition fails. As pointed to me by Mr. Kaizheng Wang, a counter
example can be constructed as follows: Let A = [1, 1, 1, 0; 1, 1, 0, 1], x∗ = [0, 0, 1, 0]?, b = [1, 0]T ,
x = [1, −1, 0, 0]T , k = 1. Then x∗ is the unique k-sparse solution to Ax∗ = b. On the other
hand, x is 2k-sparse, but Ax = 0. Hence dependence of columns in A implies that δ2k ≥ 1 which
disproves necessity of δ2k < 1.
70 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE
ep k 12 2
2 e−c0 (δ/2)n = 2e−c0 (δ/2)n+k[log(ep/k)+log(12/δ)] .
k δ
Thus for a fixed c1 > 0, whenever k ≤ c1 n/ log(p/k), the exponent above will
be ≤ −c2 n provided that c2 ≤ c0 (δ/2) − c1 (1 + (1 + log(12/δ))/ log(p/k). c2 can be
always chosen to be > 0 if c1 > 0 is small enough. This leads to the results. □
Another use of random projections (random matrices) can be found in Robust
Principal Component Analysis (RPCA) in the next chapter.
1.5
0.5
−0.5
−1
estimators (FL01), and produce unbiased estimators under nearly the same model
selection consistency conditions as Lasso. Furthermore, (HSXY20) showed that un-
der a strictly weaker condition than generalized Lasso (LST13), statistical path con-
sistency could be achieved by (220) equipped with the variable splitting technique.
These studies laid down a theoretical foundation for the statistical consistency of
regularization paths generated by the solutions of differential inclusion (220). A free
R package is released, Libra (Linearized Bregman Algorithms) (RXY18). Another
Matlab package3 was also released with the Split LBI algorithm (HSXY16). These
works fostered various successful applications, such as high dimensional statistics
(XRY18), computer vision (FHX+ 16; ZSF+ 18), medical image analysis (SHYW17),
multimedia (XXCY16b), machine learning (XXCY16a; HSXY16), and AI (HY18).
Below we are going to show the bias of LASSO and a way of deriving the inverse
scale space ?? together with its statistical model selection consistency.
3.5.1. The Bias of LASSO. Consider a sparse linear model as follows. As-
sume that β ∗ ∈ Rp is sparse and unknown. Our purpose is to recover β ∗ from n
linear measurements
y = Xβ ∗ + ϵ, y ∈ Rn
where noise ϵ ∼ N (0, σ 2 ) and S := supp(β ∗ ) with s = |S| ≤ n ≤ p and T be the
complement of S.
Recall that LASSO ((Tib96); or equivalently BPDN (CDS98)) solves the fol-
lowing ℓ1 -regularized Maximum Likelihood Estimate problem
t
(77) min ∥β∥1 + ∥y − Xβ∥22 .
β 2n
where regularization parameter λ = 1/t is often used in literature. But here we
adopt parameter t for the purpose of deriving inverse scale space dynamics ??.
LASSO is biased in the sense that E(β̂t ) ̸= β ∗ for all t > 0. Let’s take some
simple examples.
Example 5. (a) X = Id, n = p = 1, LASSO is soft-thresholding
if τ < 1/β̃ ∗ ;
0,
β̂τ =
β̃ ∗ − τ ,
1
otherwise,
(b) n = 100, p = 256, Xij ∼ N (0, 1), ϵi ∼ N (0, 0.1)
3https://github.com/yuany-pku/split-lbi
3.5. INVERSE SCALE SPACE METHOD FOR SPARSE LEARNING 73
Even when the following model selection consistency (conditions given by (ZY06;
Zou06; YL07; Wai09), etc.) is reached at certain τn :
∃τn ∈ (0, ∞) s.t. supp(β̂τn ) = S,
LASSO estimate is biased away from the oracle estimator
1
(78) (β̂τn )S = β̃S∗ − Σ−1 sign(βS∗ ), τn > 0.
τn n,S
where the oracle estimator is defined as the subset least square solution (MLE)
with β̃T∗ = 0, had God revealed S to us,
1 −1 T
(79) β̃S∗ = βS∗ + Σ X ϵ, where Σn = n1 XST XS .
n n S
Estimate (78) can be derived from the first order optimality condition of LASSO,
ρt 1 T
(80a) = X (y − Xβt ),
t n
(80b) ρt ∈ ∂∥βt ∥1 ,
by setting βT (t) = 0 and solving βS (t).
How to remove the bias and return the Oracle Estimator? To reduce bias, var-
ious non-convex regularization schemes were proposed (Fan-Li’s SCAD, Zhang’s
MPLUS, Zou’s Adaptive LASSO, lq (q < 1), etc.)
X t
min ϕ(|βi |) + ∥y − Xβ∥22 ,
β
i
2n
where ϕ(t) is a nonnegative function such that ϕ(0) is singular (non-differentiable)
for sparsity and its derivative limt→∞ ϕ′ (t) = 0 for debiasing. Such ϕ must be
nonconvex, which is in general computationally hard to locate the global optimizer
in nonconvex optimization (? ). Various studies show the conditions that any local
optimizer can meet statistical precision (). Are there any other simple scheme?
3.5.2. Deriving Differential Inclusion as Debiasing. The crucial idea is
as follows.
• LASSO:
t
min ∥β∥1 + ∥y − Xβ∥22 .
β 2n
• KKT optimality condition:
1
⇒ ρt = X T (y − Xβt )t
n
74 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE
⇒ β̇τn τn + βτn = β̃ ∗
• Equivalently, the blue part removes bias of LASSO automatically
1 −1
βτlasso = β̃ ∗ − Σ sign(β ∗ ) ⇒ β̇τlasso τn + βτlasso = β̃ ∗ (oracle)!
n
τn n n n
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
Remark. (a) “Irrepresentable” means that one can not represent (regress)
column vectors in XT by covariates in XS .
(b) The incoherence/irrepresentable condition is used independently in (Tro04;
YL06; ZY06; Zou06; Wai09; CWX10; CW11) etc.
ISS is a kind of restricted gradient descent (also known as Bregman gradient or
mirror descent):
1
ρ̇t = −gradL(βt ) = X T (y − Xβt ), ρt ∈ ∂∥βt ∥1
n
such that
• incoherence condition and strong signals ensure it firstly evolves on index
set S (Oracle Subspace) to reduce the loss
76 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE
Theorem 3.5.1 ((ORX+ 16)). Assume (A1) and (A2). Define an early stopping
time
−1
η n
r
τ := max ∥Xj ∥ ,
2σ log p j∈T
∗
and the smallest magnitude βmin = min(|βi∗ | : i ∈ S). Then
(a) No-false-positive: for all t ≤ τ , the path has no-false-positive with high
probability, supp(β(t)) ⊆ S;
(b) Model selection consistency: moreover if the signal is strong enough
such that
r
8σ(2 + log s) (maxj∈T ∥Xj ∥)
∗ 4σ log p
βmin ≥ ∨ ,
γ 1/2 γη n
ẋt
(84) ρ̇t + = −∇x ℓ(xt ),
κ
ρt ∈ ∂Ω(xt ).
Its Euler forward discretization gives the Linearized Bregman Iterations (LBI) as
# Covariance Matrix
K = H . dot ( X ) . dot ( X . transpose () ) . dot ( H . transpose () )
4https://scikit-learn.org/stable/modules/random projection.html
78 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE
where
A1
A = ...
Am
and
y1
y = ...
ym
4.1.1. Duality of SDP. Define the feasible set of primal andP dual problems
are Fp = {X ⪰ 0; Ai • X = bi } and Fd = {(y, S) : S = C − i yi Ai ⪰ 0},
respectively. Similar to linear programming, semi-definite programming also has
properties of week and strong duality. The week duality says that the primal value
is always an upper bound of dual value. The strong duality says that the existence
of an interior point ensures the vanishing duality gap between primal value and
dual value, as well as the complementary conditions. In this case, to check the
optimality of a primal variable, it suffices to find a dual variable which meets the
complementary condition with the primal. This is often called the witness method.
For more reference on duality of SDP, see e.g. (Ali95).
Theorem 4.1.1 (Weak Duality of SDP). If Fp ̸= ∅, Fd ̸= ∅, We have C • X ≥
b y, for ∀X ∈ Fp and ∀(y, S) ∈ Fd .
T
To address this issue, Robust PCA looks for the following decomposition instead
X =L+S
where
• L is a low rank matrix;
• S is a sparse matrix.
Example 7. In the spike signal model, X = αu + σϵ ϵ, where α ∼ N (0, σu2 )
and ϵ ∼ N (0, Ip ). X is thus subject to the following normal distribution N (0, Σ)
where Σ = σu2 uu2 + σϵ2 I. So Σ = L + S has such a rank-sparsity structure with
L = σu2 uuT and S = σϵ2 I.
Example 8. Let X = [x1 , . . . , xp ]T ∼ N (0, Σ) be multivariate Gaussian ran-
dom variables. The following characterization (CPW12) holds
xi and xj are conditionally independent given other variables
⇔ (Σ−1 )ij = 0
We denote it by xi ⊥ xj |xk (k ̸∈ {i, j}). Let G = (V, E) be a undirected graph
where V represent p random variables and (i, j) ∈ E ⇔ xi ⊥ xj |xk (k ̸∈ {i, j}). G
is called a (Gaussian) graphical model of X.
82 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING
Divide the random variables into observed and hidden (a few) variables X =
(Xo , Xh )T (in semi-supervised learning, unlabeled and labeled, respectively) and
Σoo Σoh −1 Qoo Qoh
Σ= and Q = Σ =
Σho Σhh Qho Qhh
The following Schur Complement equation holds for covariance matrix of observed
variables
−1
Σoo = Qoo + Qoh Q−1hh Qho .
Note that
• Observable variables are often conditional independent given hidden vari-
ables, so Qoo is expected to be sparse;
• Hidden variables are of small number, so Qoh Q−1 hh Qho is of low-rank.
In semi-supervised learning, the labeled points are of small number, and the unla-
beled points should be as much conditional independent as possible to each other
given labeled points. This implies that the labels should be placed on those most
“influential” points.
distribution with ρ(t) ∼ t. Due to the scale invariance L(Σ) = cL(Σ) in this case,
one often adds the constraint trace(Σ) = 1 to make the minimizer unqiue, i.e.
Pn
(93) Σ̂
T yler
= arg min L(Σ) := n1 i=1 log(xTi Σxi ) + 12 log det Σ
trace(Σ)=1,Σ⪰0
(CSPW11) shows the following uncertainty principle, for any matrix M , µ(M )·
ξ(M ) ≥ 1. Therefore a sufficient condition holds,
\
µ(S0 ) · ξ(L0 ) < 1, ⇒ T (L0 ) Ω(S0 ) = {0}.
Moreover, (CSPW11) shows the following deterministic recovery conditions by SDP
µ(S0 ) · ξ(L0 ) < 1/6, ⇒ SDP recovers L0 and S0 .
Probabilistic recovery conditions are given earlier in (CR09). First of all we
need some incoherence conditions for the identifiability. Assume that L0 ∈ Rn×n =
U ΣV T and r = rank(L0 ).
4.2. ROBUST PCA VIA SDP 85
Sparse PCA
(97) max trace(ΣX) − λ∥X∥1
s.t. trace(X) = 1
X⪰0
Some consistency studies can be found at (? ) and references therein.
The SDP algorithm above has a simple Matlab implementation based on CVX
(http://cvxr.com/cvx), shown in Section 4.5.2.
n
X 2
(98) min ∥yi − yj ∥2 − dij .
i,j=1
Note that the constraint with equalities of d2ij can be replaced by inequalities
such as ≤ d2ij (1 + ϵ) (or ≥ d2ij (1 − ϵ)). This is a system of linear matrix (in)-
equalities with positive semidefinite variable Z. Therefore, the problem becomes a
typical semidefinite programming.
Given such a SD relaxation, we can easily generalize classical MDS to the sce-
narios in the introduction. For example, consider the generalized MDS with anchors
which is often called sensor network localization problem in literature (BLT+ 06).
Given anchors ak (k = 1, . . . , s) with known coordinates, find xi such that
• ∥xi − xj ∥2 = d2ij where (i, j) ∈ Ex and xi are unknown locations
2
• ∥ak − xj ∥2 = dckj where (k, j) ∈ Ea and ak are known locations
We can exploit the following SD relaxation:
• (0; ei − ej )(0; ei − ej )T • Z = dij for (i, j) ∈ Ex ,
• (ai ; ej )(ai ; ej )T • Z = dc
ij for (i, j) ∈ Ea ,
both of which are linear with respect to Z.
Recall that every SDP problem has a dual problem (SDD). The SDD associated
with the primal problem above is
X X
(100) min I • V + wij dij + wbij dc
ij
i,j∈Ex i,j∈Ea
4.4. GRAPH REALIZATION AND UNIVERSAL RIGIDITY 89
s.t.
V 0 X X
S= + wij Aij + w
bij A ij ⪰ 0
d
0 0
i,j∈Ex i,j∈Ea
where
Aij = (0; ei − ej )(0; ei − ej )T
T
A
d ij = (ai ; ej )(ai ; ej ) .
The variables wij is the stress matrix on edge between unknown points i and j and
w
bij is the stress matrix on edge between anchor i and unknown point j. Note that
the dual is always feasible, as V = 0, yij = 0 for all (i, j) ∈ Ex and wij = 0 for all
(i, j) ∈ Ea is a feasible solution.
There are many matlab toolboxes for SDP, e.g. CVX, SEDUMI, and recent
toolboxes SNLSDP (http://www.math.nus.edu.sg/~mattohkc/SNLSDP.html) and
DISCO (http://www.math.nus.edu.sg/~mattohkc/disco.html) by Toh et. al.,
adapted to MDS with uncertainty.
A crucial theoretical question is to ask, when X = Y T Y holds such that SDP
embedding Y gives the same answer as the classical MDS? Before looking for an-
swers to this question, we first present an application example of SDP embedding.
nf = 0.1, λ = 1.0e+00
10
−5
−10
10
10
5 5
0 0
−5 −5
−10 −10
Refinement: RMSD = 5.33e−01
(a) (b)
In fact, the max-rank solution of SDP embedding is unique. There are many
open problems in characterizing UR conditions, see Ye’s survey at ICCM’2010.
In practice, we often meet problems with noisy measurements αd2ij ≥ d˜2ij ≤
βdij . If we relax the constraint ∥yi − yj ∥2 = d2ij or equivalently Ai • X = bi to
2
inequalities, however we can achieve arbitrary small rank solution. To see this,
assume that
Ai X = bi 7→ αbi ≤ Ai X ≤ βbi i = 1, . . . , m, where β ≥ 1, α ∈ (0, 1)
then So, Ye, and Zhang (2008) (SYZ08) show the following result.
Theorem 4.4.3. For every d ≥ 1, there is a SDP solution X b ⪰ 0 with rank
rank(X) ≤ d, if the following holds,
b
18 ln 2m
1+
1 ≤ d ≤ 18 ln 2m
β= √ d
1 + 18 ln 2m d ≥ 18 ln 2m
d
1
e(2m)2/d
1 ≤ d ≤ 4 ln 2m
α=
( r )
1 4 ln 2m
max
2/d
,1 − d ≥ 4 ln 2m
e(2m) d
X = A + E;
cvx begin
variable L(20,20);
variable S(20,20);
variable W1(20,20);
variable W2(20,20);
variable Y(40,40) symmetric;
Y == semidefinite(40);
minimize(.5*trace(W1)+0.5*trace(W2)+lambda*sum(sum(abs(S))));
subject to
L + S >= X-1e-5;
L + S <= X + 1e-5;
Y == [W1, L’;L W2];
cvx end
4.5. LAB AND FURTHER STUDIES 93
4.5.2. SPCA by CVX. The SDP algorithm (97) has a simple Matlab im-
plementation based on CVX (http://cvxr.com/cvx).
% Construct a 10-by-20 Gaussian random matrix and form a 20-by-20 correlation
% (inner product) matrix R
X0 = randn(10,20);
R = X0’*X0;
d = 20;
e = ones(d,1);
lambda = 0.5;
k = 10;
cvx begin
variable X(d,d) symmetric;
X == semidefinite(d);
minimize(-trace(R*X)+lambda*(e’*abs(X)*e));
subject to
trace(X)==1;
cvx end
4.5.3. RPCA by ADMM. Some ADMM-based Matlab codes for RPCA are
given by Stephen Boyd1. The following codes use cvxpy to implement RPCA:
# coding : utf -8
# In [1]:
import numpy as np
import cvxpy as cp
# In [2]:
# In [3]:
constraints = [ Y >> 0]
constraints += [ L + S >= X -1 e -5 , L + S <= X + 1e -5 , Y == cp . vstack ([ cp .
hstack ([ W1 , L . T ]) , cp . hstack ([ L , W2 ]) ]) ]
prob = cp . Problem ( objective , constraints )
# In [4]:
4.5. LAB AND FURTHER STUDIES 95
# In [5]:
print ( ’ The ␣ difference ␣ between ␣ the ␣ low ␣ rank ␣ solution ␣ L ␣ and ␣ true ␣ L0 ␣ $ % s$
: ␣ % f ’ %( r ’ \| L - L0 \| _2 ’ , np . linalg . norm ( L . value - L0 , ord =2) ) )
# In [6]:
# Another simple CVX implem entation directly using matrix nuclear norm
X1 = cp . Variable ((20 ,20) )
X2 = cp . Variable ((20 ,20) )
# In [7]:
# In [8]:
print ( ’ The ␣ difference ␣ between ␣ the ␣ low - rank ␣ solution ␣ X1 ␣ and ␣ true ␣ L0 ␣ $ %
s$ : ␣ % f ’ %( r ’ \| X1 - L0 \| _2 ’ , np . linalg . norm ( X1 . value - L0 , ord =2) ) )
2https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html
96 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING
# coding : utf -8
# In [1]:
import numpy as np
import cvxpy as cp
np . random . seed (0)
# In [2]:
# In [3]:
# In [4]:
4.5. LAB AND FURTHER STUDIES 97
# In [5]:
cp . i n s t a l l e d _ s o l v e r s ()
# In [6]:
d =20
lambdas = 5
k = 10
e = np . ones (( d ,1) )
# In [7]:
cp . i n s t a l l e d _ s o l v e r s ()
# In [8]:
# In [9]:
Nonlinear Dimensionality
Reduction: Kernels on Graphs
CHAPTER 5
Manifold Learning
5.1. Introduction
In the past month we talked about two topics: one is the sample mean and
sample covariance matrix (PCA) in high dimensional spaces. We have learned that
when dimension p is large and sample size n is relatively small, in contrast to the
traditional statistics where p is fixed and n → ∞, both sample mean and PCA may
have problems. In particular, Stein’s phenomenon shows that in high dimensional
space with independent Gaussian distributions, the sample mean is worse than a
shrinkage estimator; moreover, random matrix theory sheds light on that in high
dimensional space with sample size in a fixed ratio of dimension, the sample co-
variance matrix and PCA may not reflect the signal faithfully. These phenomena
start a new philosophy in high dimensional data analysis that to overcome the curse
of dimensionality, additional constraints has to be put that data never distribute
in every corner in high dimensional spaces. Sparsity is a common assumption in
modern high dimensional statistics. For example, data variation may only depend
on a small number of variables; independence of Gaussian random fields leads to
sparse covariance matrix; and the assumption of conditional independence can also
lead to sparse inverse covariance matrix. In particular, an assumption that data
concentrate around a low dimensional manifold in high dimensional spaces, leads
to manifold learning or nonlinear dimensionality reduction, e.g. ISOMAP, LLE,
and Diffusion Maps etc. This assumption often finds example in computer vision,
graphics, and image processing.
All the work introduced in this chapter can be regarded as generalized PCA/MDS
on nearest neighbor graphs, which has roots in manifold learning concept. Two
pieces of milestone works, ISOMAP (TdSL00) and Locally Linear Embedding
(LLE) (RL00), are firstly published in science 2000, which opens a new field called
nonlinear dimensionality reduction, or manifold learning in high dimensional data
analysis. Here is the development of manifold learning method:
(104) MDS −→ ISOMAP
Local Tangent Space Alignment
Hessian LLE
PCA −→ LLE −→
Laplacian Eigen Map
Diffusion Map
To understand the motivation of such a novel methodology, let’s take a brief
review on PCA/MDS. Given a set of data xi ∈ Rp (i = 1, . . . , n) or merely pairwise
distances d(xi , xj ), PCA/MDS essentially looks for an affine space which best cap-
ture the variation of data distribution, see Figure 1(a). However, this scheme will
not work in the scenario that data are actually distributed on a highly nonlinear
101
102 5. MANIFOLD LEARNING
curved surface, i.e. manifolds, see the example of Swiss Roll in Figure 1(b). Can we
extend PCA/MDS in certain sense to capture intrinsic coordinate systems which
charts the manifold?
(a) (b)
ISOMAP and LLE, as extensions from MDS and local PCA, respectively, leads
to a series of attempts to address this problem.
All the current techniques in manifold learning, as extensions of PCA and
MDS, are often called as Spectral Kernel Embedding. The common theme of these
techniques can be described in Figure 2. The basic problem is: given a set of
data points {x1 , x2 , ..., xn ∈ Rp }, how to find out y1 , y2 , ..., yn ∈ Rd , where d ≪ p,
such that some geometric structures (local or global) among data points are best
preserved.
All the manifold learning techniques can be summarized in the following meta-
algorithm, which explains precisely the name of spectral kernel embedding. All the
methods can be called certain eigenmaps associated with some positive semi-definite
kernels.
(1) Construct a data graph G = (V, E), where V = {xi : i = 1, ..., n}. For
example,
5.2. ISOMAP 103
5.2. ISOMAP
ISOMAP is an extension of MDS, where pairwise euclidean distances between
data points are replaced by geodesic distances, computed by graph shortest path
distances.
(1) Construct a neighborhood graph G = (V, E, dij ) such that
V = {xi : i = 1, . . . , n}
E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest
neighbors, ϵ-neighbors
dij = d(xi , xj ), e.g. Euclidean distance when xi ∈ Rp
(2) Compute graph shortest path distances
dij = minP =(xi ,...,xj ) (∥xi − xt1 ∥ + . . . + ∥xtk−1 − xj ∥), is the length
of a graph shortest path connecting i and j
Dijkstra’s algorithm (O(kn2 log n)) and Floyd’s Algorithm (O(n3 ))
104 5. MANIFOLD LEARNING
The basic feature of ISOMAP can be described as: we find a low dimensional
embedding of data such that points nearby are mapped nearby and points far away
are mapped far away. In other words, we have global control on the data distance
and the method is thus a global method. The major shortcoming of ISOMAP
lies in its computational complexity, characterized by a full matrix eigenvector
decomposition.
5.2.1. ISOMAP Example. Now we give an example of ISOMAP with mat-
lab codes.
% load 33-face data
load ../data/face.mat Y
X = reshape(Y,[size(Y,1)*size(Y,2) size(Y,3)]);
p = size(X,1);
n = size(X,2);
D = pdist(X’);
DD = squareform(D);
(a) (b)
such that dM (x, xi ) < ϵ, and {i, j} ∈ E if dM (xi , xj ) ≤ αϵ (α ≥ 4). Then for any
pair x, y ∈ V ,
α
dS (x, y) ≤ max(α − 1, )dM (x, y).
α−2
Proof. Let γ be a shortest path connecting x and y on M whose length is
l. If l ≤ (α − 2)ϵ, then there is an edge connecting x and y whence dS (x, y) =
dM (x, y). Otherwise split γ into pieces such that l = l0 + tl1 where l1 = (α − 2)ϵ
and ϵ ≤ l0 < (α − 2)ϵ. This divides arc γ into a sequence of points γ0 = x, γ1 ,. . .,
γt+1 = y such that dM (x, γ1 ) = l0 and dM (γi , γi+1 ) = l1 (i ≥ 1). There exists a
sequence of x0 = x, x1 , . . . , xt+1 = y such that dM (xi , γi ) ≤ ϵ and
dM (xi , xi+1 ) ≤ dM (xi , γi ) + dM (γi , γi+1 ) + dM (γi+1 , xi+1 )
≤ ϵ + l1 + ϵ
= αϵ
= l1 α/(α − 2)
whence (xi , xi+1 ) ∈ E. Similarly dM (x, x1 ) ≤ dM (x, γ1 ) + dM (γ1 , x1 ) ≤ (α − 1)ϵ ≤
l0 (α − 1).
t−1
X
dS (x, y) ≤ dM (xi , xi+1 )
i=0
α
≤ l max ,α − 1
α−2
Setting α = 4 gives rise to dS (x, y) ≤ 3dM (x, y). □
The other lower bound dS (x, y) ≤ cdG (x, y) requires that for every two points
xi and xj , Euclidean distance ∥xi − xj ∥ ≤ cdM (xi , xj ). This imposes a regularity
on manifold M , whose curvature has to be bounded. We omit this part here and
leave the interested readers to the reference by Bernstein, de Silva, Langford, and
Tenenbaum 2000, as a supporting information to the ISOMAP paper.
by ŵi (µ) = (Ci + µI)−1 1 for some regularization parameter µ > 0 and
wi = ŵi /ŵiT 1;
5 Step 2 (global alignment): define the weight embedding matrix
6
wij , j ∈ Ni
Wij =
0, otherwise
Compute K = (I − W )T (I − W ) which is a positive semi-definite kernel matrix;
7 Step 3 (Eigenmap): Compute Eigenvalue decomposition K = U ΛU T with
Λ = diag(λ1 , . . . , λn ) where λ1 ≥ λ2 ≥ . . . λn−1 > λn = 0; choose bottom d + 1
nonzero eigenvalues and corresponding eigenvectors and drop the smallest
eigenvalue-eigenvector (0-constant) pair, such that
Ud = [un−d , . . . , un−1 ], u j ∈ Rn ,
Λd = diag(λn−d , . . . , λn−1 ).
1
Define Yd = Ud Λd . 2
that is, finding a linear combination (possibly not unique!) for the sub-
space spanned by {(xj − xi ) : j ∈ Ni }. This can be done by Lagrange
multiplier method, i.e. solving
1 X X
min ∥ wij (xj − xi )∥2 + λ(1 − wij ).
wij 2
j∈Ni j∈Ni
ISOMAP LLE
MDS on geodesic distance matrix local PCA + eigen-decomposition
global approach local approach
no for nonconvex manifolds with holes ok with nonconvex manifolds with holes
landmark (Nystrom) Hessian
Extensions: conformal Extensions: Laplacian
isometric, etc. LTSA etc.
by ŵi (µ) = (Ci + µI)−1 1 for some regularization parameter µ > 0 and
wi = ŵi /ŵiT 1;
5 Step 2 (local residue PCA): for each xi and its neighbors Ni (ki = |Ni |), let
Ci = V ΛV T be its eigenvalue decomposition where Λ = (λ1 , . . . , λki ) with
λ1 ≥ · · · ≥ λki . Find the size of almost normal subspace si as the maximal size
that the ratio of residue eigenvalue sum over principle eigenvalue sum is below
a threshold, i.e. ( Pki )
j=k −l+1 λj
si = max l ≤ ki − d, Pki −l ≤η
i
j=1 λj
l
(or u = 0 if it is small).
6 Step 3 (global alignment): define the weight embedding matrix
7
−1Tsi ,
j = i,
ci (j, :) =
W Wi , j ∈ Ni ,
0, otherwise.
sensitive to the noise direction and might be mixed with signal directions. LLE in
this case might not capture well the signal directions in local PCA. Hessian LLE
and LTSA are both improvements over this by exploiting all the local principal
components. On the other hand, Modified Locally Linear Embedding (MLLE)
(ZW) remedies the issue using multiple weight vectors projected from orthogonal
complement of local PCA.
MLLE replace the weight vector above by a weight matrix Wi ∈ Rki ×si , a family
of si weight vectors using bottom si eigenvectors of Ci , Vi = [vki −si +1 , . . . , vki ] ∈
110 5. MANIFOLD LEARNING
In LLE, one chooses the weights wij to minimize the following energy
X
P min ∥ wij (xj − xi )∥2 .
j∈Ni wij =1
j∈Ni
5.4. HESSIAN LLE 111
In the ideal case, if the points x̃j = xj − xi arePlinearly dependent, then there
is some wij , possibly not unique, such that 0 = j∈Ni wij x̃j . In this local chart
(Figure 4), we have
X X
0= wij x̃j , and 1 = wij .
j∈Ni j∈Ni
For any smooth function y(x), consider its Taylor expansion up to the second order
1
y(x) = y(0) + xT ∇y(0) + xT (Hy)(0)x + o(∥x∥2 ).
2
Therefore
X
(I − W )y(0) := y(0) − wij y(x̃j )
j∈Ni
X X 1 X T
≈ y(0) − wij y(0) − wij x̃Tj ∇y(0) − x̃j (Hy)(0)x̃j
2
j∈Ni j∈Ni j∈Ni
1 X T
= − x̃j (Hy)(0)x̃j .
2
j∈Ni
d+1
M̃ = [1, v̂1 , ..., v̂d , ŵ1 , ŵ2 , ..., ŵ(d+1) ] ∈ Rk×(1+d+( 2 ))
2
Find smallest d + 1 eigenvectors of K and drop the smallest eigenvector, the re-
maining d eigenvectors will give rise to a d dimensional embedding of data points.
Define !
(i) T d+1
[H ] = [last columns of M̃ ]k×(d+1) .
2 2
Step 3 : Define
X n
K= S (i) H (i)T H (i) S (i)T ∈ Rn×n , [x1 , .., xn ]S (i) = [xi1 , ..., xik ],
i=1
find smallest d + 1 eigenvectors of K and drop the smallest eigenvector, and the
remaining d eigenvectors will give rise to a d-embedding.
5.5. LOCAL TANGENT SPACE ALIGNMENT (LTSA) 113
5.4.1. Convergence of Hessian LLE. There are two assumptions for the
convergence of ISOMAP:
• Isometry: the geodesic distance between two points on manifolds equals
to the Euclidean distances between intrinsic parameters.
• Convexity: the parameter space is a convex subset in Rd .
Therefore, if the manifold contains a hole, ISOMAP will not faithfully recover
the intrinsic coordinates. Hessian LLE above is provable to find local orthogonal
coordinates for manifold reconstruction, even in nonconvex case. Figure (? ) gives
an example.
Donoho and Grimes (DG03b) relaxes the conditions above into the following
ones.
• Local Isometry: in a small enough neighborhood of each point, geodesic
distances between two points on manifolds are identical to Euclidean dis-
tances between parameter points.
• Connecteness: the parameter space is an open connected subset in Rd .
Based on the relaxed conditions above, they prove the following result.
Theorem 5.4.1. Supper M = ψ(Θ) where Θ is an open connected subset of
Rd , and ψ is a locally isometric embedding of Θ into Rn . Then the Hessian H(f ) has
a d+1 dimensional nullspace, consisting of the constant function and d-dimensional
space of functions spanned by the original isometric coordinates.
Under this theorem, the original isometric coordinates can be recovered, up to
a rigid motion, by identifying a suitable basis for the null space of H(f ).
dimensionality is high and also not stable when noise are presented. On the other
hand, Zhenyue Zhang and Hongyuan Zha (2002) (ZZ02) suggest Local Tangent
Space Alignment (LTSA) algorithm which just needs the linear form of local PCA
which is more stable and cheaper than Hessian LLE.
The basic idea of LTSA is illustrated in Figure 6, where given a smooth curve
(black), one can use discrete samples to find a good approximation of the tangent
space of the original curve at each sample point. Finding such an approximation is
in the spirit of principal curve or principal manifold proposed by Werner Stuetzle
and Trevor Hastie (HS89). Zhenyue Zhang and Hongyuan Zha (2002) (ZZ02) pro-
pose to use sampled data to find a good approximation of tangent space via local
PCA, then the reconstruction data coordinates tries to preserve such approximate
tangent space at each point to reach a global alignment.
where selection matrix Sin×k : [xi1 , ..., xik ] = [x1 , ..., xn ]Sin×k ;
5 Step 3 : Find smallest d + 1 eigenvectors of K and drop the smallest
eigenvector, the remaining d eigenvectors will give rise to a d-embedding.
For each xi in Rd with neighbor Ni of size |Ni | = ki −1, let X (i) = [xj1 , xj2 , . . . , xjki ] ∈
R p×ki
be the coordinate matrix. Consider the local SVD (PCA)
X̃ (i) = [xi1 − µi , ..., xiki − µi ]p×ki = X (i) H = Ũ (i) Σ̃(Ṽ (i) )T ,
5.6. LAPLACIAN LLE (EIGENMAP) 115
(i) (i)
where H = I − k1i 1ki 1Tki . Left singular vectors {Ũ1 , ..., Ũd } give an orthonormal
basis of the approximate d-dimensional tangent space at xi . Right singular vectors
(i) (i)
(Ṽ1 , . . . , Ṽd ) ∈ Rki ×d present the d-coordinates of ki samples with respect to the
tangent space basis.
Let Yi ∈ Rd×ki be the embedding coordinates of the samples in Rd and Li :
(i) (i)
R p×d
be an estimated basis of the tangent space at xi in Rp . Let Θi = Ũd Σ̃d (Ṽd )T ∈
R p×ki
be the truncated SVD using top d components. LTSA looks for the minimizer
of the following problem
2
X X
1 T
2 T
(111) min ∥Ei ∥ =
Yi (I − n 11 ) − Li Θi
.
Y,L
i i
1 †
One can estimate LTi = Yi (1 − T
n 11 )Θi .
Hence it reduces to
2
X
2
X
1 T †
(112) min ∥Ei ∥ =
Yi (I − 11 )(I − Θi Θi )
Y
i i
n
a weight matrix,
Wiki ×ki = I − Gi GTi ,
and a positive semi-definite kernel matrix for alignment,
X n
K n×n = Φ = Si Wi WiT SiT
i=1
For any smooth function f (x), consider its Taylor expansion up to the second order
1
f (x) = f (0) + xT ∇f (0) + xT H(0)x + o(∥x∥2 ).
2
X
(I − W )y(0) := y(0) − wij y(x̃j )
j∈Ni
X X 1 X T
≈ y(0) − wij y(0) − wij x̃Tj ∇y(0) − x̃j (Hy)(0)x̃j
2
j∈Ni j∈Ni j∈Ni
1 X T
= − x̃j (Hy)(0)x̃j .
2
j∈Ni
When the {x̃i } in the last step becomes an orthonormal basis1, the equation above
gives
1 X T
− x̃j H(0)x̃j ≈ trace(H(0)) = ∆f (0),
2
j∈Ni
Pd ∂ 2
where the Laplacian operator ∆ = trace(H) = i=1 ∂yi2 with local coordinate
system (yi ). Such an observation leads to Laplacian LLE which looks for embedding
functions Z Z
min ∥∇y∥ = y T ∆y,
2
y⊥1,∥y∥=1
The kernel of Laplacian consists of constant, linear functions, and bilinear functions
of coordinates, of dimensionality 1 + d + d2 . Therefore Laplacian LLE does not
recover linear coordinates. However, Laplacian LLE converges to the spectrum of
Laplacian-Beltrami operator, which enables us to choose wij as heat kernels. It has
various connections with spectral graph theory and random walks on graphs, which
further leads to Diffusion Map and relates to topology of data graph, namely the
connectivity or the 0-th homology.
How to define Laplacian with discrete data? Graph Laplacians with heat
kernels provide us an answer (BN01; BN03). To see the idea, first consider a
weighted oriented graph G = (V, E, W ) where V = {x1 , . . . , xn } is the vertex set,
E = {(i, j) : i, j ∈ V } is the set of oriented edges, and W = [wij = wji ≥ 0] is
the weight matrix. Consider a particular weight matrix induced by heat kernels
W = (wij ) ∈ Rn×n as
( ∥xi −xj ∥2
−
wij = e
t j ∈ N (i),
0 otherwise.
P
In particular, for t → ∞, it gives binary weights. Let D = diag( j∈Ni wij ) be the
diagonal matrix with weighted degree as diagonal elements. Define the unnormal-
ized graph Laplacian by
L = D − W,
t
L̂t,n and v̂n,i be the corresponding eigenfunction. Let λi and vi be the correspond-
ing eigenvalue and eigenfunction of ∆M . Then there exists a sequence tn → 0 such
that
lim λ̂tn,i
n
= λi
n→∞
tn
lim ∥v̂n,i − vi ∥ = 0
n→∞
where the limits are taken in probability.
Define diffusion map at scale t (CLL+ 05), by dropping the constant eigenvector
ϕ0 for connected graph G,
Φτ (xi ) = [λτ1 ϕ1 (i), · · · , λτn−1 ϕn−1 (i)], τ ≥ 0.
Clearly, Laplacian LLE corresponds to such a diffusion map at τ = 0; as τ grows
and small eigenvalues |λi |τ < 1 will drop to zero exponentially fast, which leads
to a multiscale analysis on dimensionality reduction. For example, one can set a
threshold δ > 0, and only keep dδ dimensions such that |λi |τ ≥ δ for 1 ≤ i ≤ dδ .
5.7.1. General Diffusion Maps and Convergence. In (CLL+ 05) a general
class of diffusion maps are defined which involves a normalized weight matrix,
d(xi , xk )2
α,t Wij X
(113) Wij = α α , pi := exp −
pi · pj t
k
wherePα = 0 recovers the definition above. With this family, one can define Dα =
diag( j Wijα,t ) and the row Markov matrix
(114) Pα,t,n = Dα−1 W α ,
whose right eigenvectors Φα lead to a family of diffusion maps parameterized by α.
Such a definition suggests the following integral operators as diffusion operators.
Assume that q(x) is a density on M.
• Let kt (x, y) = h(∥x − y∥2 /t) where h is a radial basis function, e.g. h(z) =
exp(−z).
• Define Z
qt (x) = kt (x, y)q(y)dy
M
and form the new kernel
(α) kt (x, y)
kt (x, y) = .
qt (x)qtα (y)
α
• Let Z
(α) (α)
dt (x) = kt (x, y)q(y)dy
M
and define the transition kernel of a Markov chain by
(α)
kt (x, y)
pt,α (x, y) = (α)
.
dt (x)
Then the Markov chain can be defined as the operator
Z
Pt,α f (x) = pt,α (x, y)f (y)q(y)dy.
M
• Define the infinitesimal generator of the Markov chain
I − Pt,α
Lt,α = .
t
For this, Lafon et al.(CL06) shows the following pointwise convergence results.
Theorem 5.7.1. Let M ∈ Rp be a compact smooth submanifold, q(x) be a
probability density on M, and ∆M be the Laplacian-Beltrami operator on M.
∆M (f q 1−α ) ∆M (q 1−α ))
(115) lim Lt,α = − .
t→0 q 1−α q 1−α
5.9. LAB: COMPARATIVE STUDIES 121
Theorem 6.1.1 (Perron Theorem for Positive Matrix). Assume that A > 0,
i.e.a positive matrix. Then
1) ∃λ∗ > 0, ν ∗ > 0, ∥ν ∗ ∥2 = 1, s.t. Aν ∗ = λ∗ ν ∗ , ν ∗ is a right eigenvector
(∃λ∗ > 0, ω > 0, ∥ω∥2 = 1, s.t. (ω T )A = λ∗ ω T , left eigenvector)
2) ∀ other eigenvalue λ of A, |λ| < λ∗
3) ν ∗ is unique up to rescaling or λ∗ is simple
123
124 6. RANDOM WALK ON GRAPHS
4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max .
x≥0,x̸=0 xi ̸=0 xi x>0 xi
Such eigenvectors will be called Perron vectors. This result can be extended to
nonnegative matrices.
Theorem 6.1.2 (Nonnegative Matrix, Perron). Assume that A ≥ 0, i.e.nonnegative.
Then
1’) ∃λ∗ > 0, ν ∗ ≥ 0, ∥ν ∗ ∥2 = 1, s.t. Aν ∗ = λ∗ ν ∗ (similar to left eigenvector)
2’) ∀ other eigenvalue λ of A, |λ| ≤ λ∗
3’) ν ∗ is NOT unique
4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max
x≥0,x̸=0 xi ̸=0 xi x>0 xi
Notice the changes in 1’), 2’), and 3’). Perron vectors are nonnegative rather
than positive. In the nonnegative situation what we lose is the uniqueness in λ∗
(2’)and ν ∗ (3’). The next question is: can we add more conditions such that the
loss can be remedied? The answer is yes, if we add the concepts of irreducible and
primitive matrices.
Irreducibility exactly describes the case that the induced graph from A is con-
nected, i.e.every pair of nodes are connected by a path of arbitrary length. However
primitivity strengths this condition to k-connected, i.e.every pair of nodes are con-
nected by a path of length k.
Definition 6.1.1 (Irreducible). The following definitions are equivalent:
1) For any 1 ≤ i, j ≤ n, there is an integer k ∈ Z, s.t. Akij > 0; ⇔
2) Graph G = (V, E) (V = {1, . . . , n} and {i, j} ∈ E iff Aij > 0) is (path-)
connected, i.e.∀{i, j} ∈ E, there is a path (x0 , x1 , . . . , xt ) ∈ V n+1 where i = x0 and
xt = j, connecting i and j.
Definition 6.1.2 (Primitive). The following characterizations hold:
1) There is an integer k ∈ Z, such that ∀i, j, Akij > 0; ⇔
2) Any node pair {i, j} ∈ E are connected with a path of length no more than k;
⇔
3) A has unique λ∗ = max |λ|; ⇐
4) A is irreducible and Aii > 0, for some i,
Note that condition 4) is sufficient for primitivity but not necessary; all the first
three conditions are necessary and sufficient for primitivity. Irreducible matrices
have a simple primary eigenvalue λ∗ and 1-dimensional primary (left and right)
eigenspaces, with unique left and right eigenvectors. However, there might be other
eigenvalues whose absolute values (module) equal to the primary eigenvalue, i.e.,
λ∗ eiω .
When A is a primitive matrix, Ak becomes a positive matrix for some k, then we
can recover 1), 2) and 3) for positivity and uniqueness. This leads to the following
Perron-Frobenius theorem.
Theorem 6.1.3 (Nonnegative Matrix, Perron-Frobenius). Assume that A ≥ 0
and A is primitive. Then
6.1. INTRODUCTION TO PERRON-FROBENIUS THEORY AND PAGERANK 125
assumed for simplicity that all nodes have non-empty out-degree. This P1 accounts
for a random walk according to the link structure of webpages. One would expect
that stationary distributions of such random walks will disclose the importance of
webpages: the more visits, the more important. However Perron-Frobenius above
tells us that to obtain a unique stationary distribution, we need a primitive Markov
matrix. For this purpose, Google’s PageRank does the following trick.
Let Pα = αP1 + (1 − α)E, where E = n1 1 · 1T is a random surfer model, i.e.one
can jump to any other webpage uniformly. So in the model Pα , a browser will play
a dice: he will jump according to link structure with probability α or randomly
surf with probability 1 − α. With 1 > α > 0, the existence of random surfer model
makes P a positive matrix, whence ∃!πs.t.PαT π = π (means ’there exists a unique
π’). Google choose α = 0.85 and in this case π gives PageRank scores.
Now you probably can figure out how to cheat PageRank. If there are many
cross links between a small set of nodes (for example, Wikipedia), those nodes must
appear to be high in PageRank. This phenomenon actually has been exploited by
spam webpages, and even scholar citations. After learning the nature of PageRank,
we should be aware of such mis-behaviors.
Finally we discussed a bit on Kleinberg’s HITS algorithm (Kle99), which is
based on singular value decomposition (SVD) of link matrix WP . Above we have
defined the out-degree do . Similarly we can define in-degree dik = j wjk . High out-
degree webpages can be regarded as hubs, as they provide more links to others. On
the other hand, high in-degree webpages are regarded as authorities, as they were
cited by others intensively. Basically in/out-degrees can be used to rank webpages,
which gives relative ranking as authorities/hubs. It turns out Kleinberg’s HITS
algorithm gives pretty similar results to in/out-degree ranking.
The last example is about Economic Growth model where the Debreu intro-
duced nonnegative matrix into its study. Similar applications include population
growth and exchange market, etc.
dynamics xt+1 = Axt and its long term behavior as t → ∞ which describes the
economic growth.
Moreover in exchange market, an additional requirement is put as Aij = 1/Aji ,
which is called reciprocal matrix. Such matrices are also used for preference aggre-
gation in decision theory by Saaty.
From Perron-Frobenius theory we get: ∃λ∗ > 0 ∃ν ∗ ≥ 0 Aν ∗ = λ∗ ν ∗ and
∃ω ≥ 0 AT ω ∗ = λ∗ ω ∗ .
∗
When A is primitive, (Ak > 0, i.e.investment in one sector will increase the product
in another sector in no more than k industrial periods), we have for all other
eigenvalues λ, |λ| < λ∗ and ω ∗ , ν ∗ are unique. In this case one can check that the
long term economic growth is governed by
At → (λ∗ )t ν ∗ ω ∗T
where
1) for all i, (x(xt−1
t )i
)i → λ
∗
Ā2
⇒ |z| > Ā|z|
1+ϵ
Ā
⇒B= , 0 = lim B m Ā|z| ≥ Ā|z|
1+ϵ m→∞
which implies that zj has the same sign, i.e.zj ≥ 0 or zj ≤ 0 (∀j). In both cases |z|
(z ̸= 0) is a nonnegative eigenvector A|z| = λ|z| which implies λ = λ∗ by 3). □
6.2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 129
n
X
Aν [m−1] := aki2 ...im νi2 · · · νim , ν m−1 := (ν1m−1 , . . . , νnm−1 )T .
i2 ,...,im =1
Remark. We can use the weight of edge i ∼ j to define Aij if the graph is
weighted. That indicates Aij ∈ R+ . We can also extend Aij to R which involves
both positive and negative weights, like correlation graphs. But the theory below
can not be applied to such weights being positive and negative.
The degree of node i is defined as follows.
n
X
di = Aij .
j=1
Define a diagonal matrix D = diag(di ). Now let’s come to the definition of Lapla-
cian Matrix L.
Definition 6.2.2 (Graph Laplacian).
di i = j,
Lij = −1 i∼j
0 otherwise
From the definition, we can see that L is symmetric, so all its eigenvalues will
be real and there is an orthonormal eigenvector system. Moreover L is positive
semi-definite (p.s.d.). This is due to the fact that
XX X X
T 2
v Lv = vi (vi − vj ) = di vi − vi vj
i j:j∼i i j:j∼i
X
= (vi − vj )2 ≥ 0, ∀v ∈ Rn .
i∼j
These two statements imply the eigenvalues of L can’t be negative. That is to say
λ(L) ≥ 0.
Theorem 6.2.1 (Fiedler theory). Let L has n eigenvectors
Lvi = λi vi , vi ̸= 0, i = 0, . . . , n − 1
where 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 . For the second smallest eigenvector v1 , define
N− = {i : v1 (i) < 0},
(vi − vj )2
P
T
v Lv i∼j
λ= T = P 2 .
v v vi
i
Note that
0 = λ1 ⇔ vi = vj (j is path connected with i).
Therefore v is a piecewise constant function on connected components of G. If
G has k components, then there are k independent piecewise constant vectors in
the span of characteristic functions on those components, which can be used as
eigenvectors of L. In this way, we proved the first part of the theory. □
6.2.2. Normalized graph Laplacian and Cheeger’s Inequality.
Definition 6.2.3 (Normalized Graph Laplacian).
1 i = j,
Lij = − √ 1
i ∼ j,
di dj
0 otherwise.
Similarly we get the relations between eigenvalue and the connected components of
the graph.
#{λi (L) = 0} = #{connected components of G}.
Next we show that eigenvectors of L are related to random walks on graphs.
This will show you why we choose this matrix to analysis the graph.
We can construct a random walk on G whose transition matrix is defined by
Aij 1
Pij ∼ P = .
Aij di
j
CU T (S)
N CU T (S) = .
min(V ol(S), V ol(S̄))
N CU T (S) is called normalized-cut. We define the Cheeger constant
hG = min N CU T (S).
S
Cheeger Inequality says the second smallest eigenvalue provides both upper and
lower bounds on the minimal normalized graph cut. Its proof gives us a constructive
polynomial algorithm to achieve such bounds.
Theorem 6.2.2 (Cheeger Inequality). For every undirected graph G,
h2G
≤ λ1 (L) ≤ 2hG .
2
Proof. (1) Upper bound:
Assume the following function f realizes the optimal normalized graph cut,
(
1
V ol(S) i ∈ S,
f (i) = −1
V ol(S̄)
i ∈ S̄,
1 1
=( + )CU T (S)
V ol(S) V ol(S̄)
2CU T (S)
≤ =: 2hG .
min(V ol(S), V ol(S̄))
which gives the upper bound.
(2) Lower bound: the proof of lower bound actually gives a constructive algo-
rithm to compute an approximate optimal cut as follows.
Let v be the second eigenvector, i.e. Lv = λ1 v, and f = D−1/2 v. Then we
reorder node set V such that f1 ≤ f2 ≤ ... ≤ fn ). Denote V− = {i; vi < 0}, V+ =
{i; vi ≥ vr }. Without Loss of generality, we can assume
X X
dv ≥ dv
i∈V− i∈V+
+
Define new functions f to be the magnitudes of f on V+ .
+ fi i ∈ V+ ,
fi =
0 otherwise,
Now consider a series of particular subsets of V ,
Si = {v1 , v2 , ...vi },
and define
V
g ol(S) = min(V ol(S), V ol(S̄)).
αG = min N CU T (Si ).
i
Clearly finding the optimal value α just requires comparison over n − 1 NCUT
values.
6.2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 135
2 2
( i∼j fi+ − fj+ )2
P
≥ 2 , Cauchy-Schwartz Inequality
( i∈V fi+ di )( i∼j (fi+ + fj+ )2 )
P P
2 2
( i∼j fi+ − fj+ )2
P
≥ 2 ,
2( i∈V fi+ di )2
P
where the second last step is due to the Cauchy-Schwartz inequality |⟨x, y⟩|2 ≤
P + + 2 P +2
⟨x, x⟩ · ⟨y, y⟩, and the last step is due to i∼j∈V (fi + fj ) = i∼j∈V (fi +
+2 + + P +2 +2 P +2
fj + 2fi fj ) ≤ 2 i∼j∈V (fi + fj ) ≤ 2 i∈V fi di . Continued from the last
inequality,
2 2
( i∼j fi+ − fj+ )2
P
λ1 ≥ 2 ,
2( i∈V fi+ di )2
P
2 + 2
( i∈V (fi+ − fi−1 )CU T (Si−1 ))2
P
≥ 2 , since f1 ≤ f2 ≤ . . . ≤ fn
2( i∈V fi+ di )2
P
2 + 2
( i∈V (fi+ − fi−1 ol(Si−1 ))2
P
)αG V g
≥ 2
2( i∈V fi+ di )2
P
2
( i∈V fi+ (V ol(Si )))2
2
P
αG g ol(Si−1 ) − Vg
= · 2 ,
2 ( i∈V fi+ di )2
P
2 (
P +2 2 2
αG i∈V fi di ) αG
= = .
2 ( P +2 2 2
i∈V fi di )
136 6. RANDOM WALK ON GRAPHS
where the last inequality is due to the assumption V ol(V− ) ≥ V ol(V+ ), whence
V
g ol(Si ) = V ol(S̄i ) for i ∈ V+ .
This completes the proof. □
Fan Chung gives a short proof of the lower bound in Simons Institute workshop,
2014.
Short Proof. The proof is based on the fact that
P
x∼y |f (x) − f (y)|
hG = inf sup P
f ̸=0 c∈R x |f (x) − c|dx
where the supreme over c is reached at c∗ = median(f (x) : x ∈ V ).
2
P
x∼y (f (x) − f (y))
λ1 = R(f ) = sup P ,
x (f (x) − c) dx
2
c
2
P
x∼y (g(x) − g(y))
≥ P , g(x) = f (x) − c
g(x)2 dx
P x
( x∼y (g(x) − g(y))2 )( x∼y (g(x) + g(y))2 )
P
= P P
( x∈V g 2 (x)dx )(( x∼y (g(x) + g(y))2 )
( x∼y |g 2 (x) − g 2 (y)|)2
P
≥ P P , Cauchy-Schwartz Inequality
( x∈V g 2 (x)dx )(( x∼y (g(x) + g(y))2 )
( x∼y |g 2 (x) − g 2 (y)|)2
P
≥ P , (g(x) + g(y))2 ≤ 2(g 2 (x) + g 2 (y))
2( x∈V g 2 (x)dx )2
h2G
≥ .
2
□
Proof.
(u) − f (v) |2 ϕ(u)P (u, v)
P
u→v | fP
R(f ) =
v | f (v) | ϕ(v)
2
2 2
P P P
u→v | f (u) | ϕ(u)P (u, v) + v | f (v) | ϕ(v) − u→v (f (u)f (v) + f (u)f (v))ϕ(u)P (u, v)
=
f Φf ∗
2 2 ∗ ∗
P P
u | f (u) | ϕ(u) + v | f (v) | ϕ(v) − (f ΦP f + f ΦP f
= ∗
f Φf
f (P ∗ Φ + ΦP )f ∗
= 2−
f Φf ∗
(gΦ−1/2 )(P ∗ Φ + ΦP )(Φ−1/2 g ∗ )
= 2−
(gΦ−1/2 )Φ(Φ−1/2 g ∗ )
g(Φ−1/2 P ∗ Φ1/2 + Φ1/2 P Φ−1/2 )g ∗
= 2−
gg ∗
∗
gLg
= 2·
∥ g ∥2
□
R(f )
= P inf .
f (x)ϕ(x)=0 2
□
Note.
R(f )
λ1 = inf
2
P
f, f (x)ϕ(x)=0
Similarly, we have
| ∂S |
hG (undirected) = inf
S⊂V min | S |, | S̄ |
P
u∈S,v∈S̄ wuv
= inf P P
S⊂V min u∈S d(u), u∈S̄ d(u)
P
u∈S,v∈S̄ϕ(u)P (u, v)
hG (directed) = inf P P
S⊂V min u∈S ϕ(u), u∈S̄ ϕ(u)
F (∂S)
= inf .
S⊂V min F (S), F (S̄)
h2 (G)
≤ λ1 ≤ 2h(G).
2
The proof is similar to the undirected case using Rayleigh quotient and Theorem
6.3.2.
I+P
Modify the random walk into a lazy random walk P̃ = 2 , so that it is aperiodic.
Theorem 6.3.6.
λ1 t
∆(t)2 ≤ C(1 − ).
2
6.3. *LAPLACIANS AND THE CHEEGER INEQUALITY FOR DIRECTED GRAPHS 141
With the fundamental matrix, the hitting time and commute time can be ex-
pressed as follows:
zjj − zij
(120) Hij =
πj
6.3.5.2. Green’s function and Laplacian for directed graph. If we treat the di-
rected graph Laplacian L̃ as an asymmetric operator on a directed graph G, then
we can define the Green’s Function G̃ (without boundary condition) for directed
graph. The entries of G satisfy the conditions:
√
(124) (G̃ L̃)ij = δij − πi πj
142 6. RANDOM WALK ON GRAPHS
II. (Meila-Shi
P 2001) P is lumpable with respect to partition Ω and P̂ (p̂st =
i∈Ωs ,j∈Ωt pij ) is nonsingular ⇔ P has k independent piecewise constant
right eigenvectors in span{χΩs : s = 1, · · · , k}.
Example 6.4.1. Consider a linear chain with 2n nodes (Figure 2) whose adja-
cency matrix and degree matrix are given by
0 1
1 0 1
A=
. .
.. .. .. ,
D = diag{1, 2, · · · , 2, 1}
.
1 0 1
1 0
So the transition matrix is P = D−1 A which is illustrated in Figure 2. The spectrum
of P includes two eigenvalues of magnitude 1, i.e.λ0 = 1 and λn−1 = −1. Although
144 6. RANDOM WALK ON GRAPHS
“⇐” To show the sufficiency, we are going to show that if the condition is satisfied,
then the probability
Probπ0 {yt = t : yt−1 = s, · · · , y0 = k0 }
depends only on Ωs , Ωt ∈ Ω. Probability above can be written as Probπt−1 (yt = t)
where πt−1 is a distribution with support only on Ωs which depends on π0 and
history up to t − 1.P But since Probi (yt = t) = p̂iΩt ≡ p̂st for all i ∈ Ωs , then
Probπt−1 (yt = t) = i∈Ωs πt−1 p̂iΩt = p̂st which only depends on Ωs and Ωt .
II.
“⇒”
Since P̂ is nonsingular, let {ψi , i = 1, · · · , k} are independent right eigenvectors
of P̂ , i.e., P̂ ψi = λi ψi . Define ϕi = V ψi , then ϕi are independent piecewise constant
vectors in span{χΩi , i = 1, · · · , k}. We have
P ϕi = P V ψi = V U P V ψi = V P̂ ψi = λi V ψi = λi ϕi ,
i.e.ϕi are right eigenvectors of P .
“⇐”
Let {ϕi , i = 1, · · · , k} be k independent piecewise constant right eigenvectors
of P in span{XΩi , i = 1, · · · , k}. There must be k independent vectors ψi ∈ Rk
that satisfied ϕi = V ψi . Then
P ϕi = λi ϕi ⇒ P V ψi = λi V ψi ,
Multiplying V U to the left on both sides of the equation, we have
V U P V ψi = λi V U V ψi = λi V ψi = P V ψi , (U V = I),
which implies
(V U P V − P V )Ψ = 0, Ψ = [ψ1 , . . . , ψk ].
Since Ψ is nonsingular due to independence of ψi , whence we must have V U P V =
PV . □
A Markov chain defined as above is reversible. That is, detailed balance con-
dition is satisfied:
µ(x)p(x, y) = µ(y)p(y, x) ∀x, y ∈ S
Define an inner product on spaceL2µ :
XX
< f, g >µ = f (x)g(x)µ(x) f, g ∈ L2µ
x∈S y∈S
L2µ is a Hilbert space with this inner product. If we define an operator T on it:
X
T f (x) = p(x, y)f (y) = E[y|x] f (y)
y∈S
P
µ(x) p(x, y)ϕj (y) = λj ϕj (x)µ(x) with detailed balance condition
y∈S
P
p(y, x)µ(y)ϕj (y) = λj ϕj (x)µ(x) that is
y∈S
P
ψj Prob(x) = p(y, x)ϕ(y) = λj (x)ψ(x)
y∈S
So F guarantee a spectral decomposition. Let {λj }n−1 j=0 denote its eigenvalue
n−1
and {ϕj (x)}j=0 denote its eigenvector, then k(x, y) can be represented as K(x, y) =
n−1
P
λj ϕj (x)ϕj (y). Hilbert-Schmidt norm of F is defined as follow:
j=0
n−1
X
∥F ∥2HS = tr(F ∗ F ) = tr(F 2 ) = λ2i
i=0
148 6. RANDOM WALK ON GRAPHS
the last equal sign dues do the orthogonality of eigenvectors. It is clear that if
L2µ = L2 , Hilbert-Schmidt norm is just Frobenius norm.
Now we can write our T as
X X p(x, y)
T f (x) = p(x, y)f (y) = f (y)µ(y)
µ(y)
y∈S y∈S
p(x,y)
and take K(x, y) = µ(y) . By detailed balance condition, K is symmetric. So
X p2 (x, y) X µ(x)
∥T ∥2HS = µ(x)µ(y) = p2 (x, y)
µ2 (y) µ(y)
x,y∈S x,y∈S
One can check that this P̃ is a stochastic matrix, but it is not reversible. One
more convenient choice is transit ”randomly” by invariant distribution:
N
X µ(y)
P̃ (x, y) = 1Sk (x)P̂ (k, l)1Sl (y)
µ̂(Sl )
k,l=1
where
X
µ̂(Sl ) = µ(z)
z∈Sl
Then you can check this matrix is not only a stochastic matrix, but detailed
balance condition also hold provides P̂ on {Si } is reversible.
Now let us do some summary. Given a decomposition of state space S =
SN
i=1 Si , and a transition probability P̂ on coarse space, we may obtain a lifted
6.5. APPLICATIONS OF LUMPABILITY: MNCUT AND NETWORK REDUCTION 149
transition probability P̃ on fine space. Now we can compare ({Si }, P̂ ) and (S, P )
in a clear way: ∥P − P̃ ∥µ . So our optimization problem can be defined clearly:
E = min min ∥P − P̂ ∥2µ
S1 ...SN P̂
That is, given a partition of S, find the optimal P̂ to minimize ∥P − P̂ ∥2µ , and
find the optimal partition to minimize E.
N
6.5.2.3. Community structure of complex network. Given a partition S = ∪ Sk ,
k=1
the solution of optimization problem
min ∥p − p̂∥2µ
p̂
is
1 X
p̂∗kl = µ(x)p(x, y)
µ̂(Sk )
x∈Sk ,y∈Sl
It is easy to show that {p̂∗kl } form a transition probability matrix with detailed
balance condition:
p̂∗kl ≥ 0
X 1 X XX
p̂∗kl = µ(x) p(x, y)
µ̂(Sk )
l x∈Sk l y∈Sl
1 X
= µ(x) = 1
µ̂(Sk )
x∈Sk
X
µ̂(Sk )p̂∗kl = µ(x)p(x, y)
x∈Sk ,y∈Sl
X
= µ(y)p(y, x)
x∈Sk ,y∈Sl
= µ̂(Sl )p̂∗lk
The last equality implies that µ̂ is the invariant distribution of the reduced Markov
chain. Thus we find the optimal transition probability in the coarse space. p̂∗ has
the following property
∥p − p∗ ∥2µ = ∥p∥2µ − ∥p̂∗ ∥2µ̂
However, the partition of the original graph is not given in advance, so we
need to minimize E ∗ with respect to all possible partitions. This is a combinatorial
optimization problem, which is extremely difficult to find the exact solution. An
effective approach to obtain an approximate solution, which inherits ideas of K-
means clustering, is proposed as following: First we rewrite E ∗ as
N
X µ(x) X p̂∗
E∗ = |p(x, y) − 1Sk (x) kl 1Sl (y)µ(y)|2
µ(y) µ̂(Sk )
x,y∈S k,l=1
N 2
p̂∗
X X p(x, y)
= µ(x)µ(y) − kl
µ(y) µ̂(Sk )
k,l=1 x∈Sk ,y∈Sl
N X
≜
X
E ∗ (x, Sk )
k=1 x∈Sk
150 6. RANDOM WALK ON GRAPHS
where
N X 2
p̂∗kl
∗
X p(x, y)
E (x, Sk ) = µ(x)µ(y)
−
µ(y) µ̂(Sk )
l=1 y∈Sl
Based on above expression, a variation of K-means is designed:
N
E step: Fix partition ∪ Sk , compute p̂∗ .
k=1
(n+1)
M step: Put x in Sk such that
∗
E (x, Sk ) = min E ∗ (x, Sj )
j
Now we solve
min ∥p − p̃∥2µ
p̂
to obtain a optimal reduction.
6.5.2.5. Model selection. Note the number of partition N should also not be
given in advance. But in strategies similar to K-means, the value of minimal E ∗ is
monotone decreasing with N . This means larger N is always preferred.
A possible approach is to introduce another quantity which is monotone in-
creasing with N . We take K-means clustering for example. In K-means clustering,
only compactness is reflected. If another quantity indicates separation of centers of
each cluster, we can minimize the ratio of compactness and separation to find an
optimal N .
⇔ ((I − P )(T + − S) = 0.
Therefore for irreducible P , S and T + must satisfy
diag(T + − S) = 0
T + − S = 1uT , ∀u
Now we continue with the proof of the main theorem. Since T = T + − Td+ ,
then (173) becomes
T = E + P T − Td+
(I − P )T = E − Td+
(I − D−1 W )T = F
(D − W )T = DF
LT = DF
where F = E − Td+ and L = D − W is the (unnormalized)
Pn graph Laplacian. Since
T
L is symmetric and irreducible, we have L = Pn k=1 k k k , where 0 = µ1 < µ2 ≤
µ ν ν
· · · ≤ µn , ν1 = 1/||1||, νkT νl = δkl . Let L† = k=2 µ1k νk νkT , L† is called the pseudo-
inverse (or Moore-Penrose inverse) of L. We can test and verify L† satisfies the
152 6. RANDOM WALK ON GRAPHS
□
P
Note that vol(G) = i di and πi = di /vol(G) for all i.
Given two sets V0 and V1 in the state space V , the transition path theory tells
how these transitions between the two sets happen (mechanism, rates, etc.). If we
view V0 as a reactant state and V1 as a product state, then one transition from V0
to V1 is a reaction event. The reactve trajectories are those part of the equilibrium
trajectory that the system is going from V0 to V1 .
Let the hitting time of Vl be
τik = inf{t ≥ 0 : x(0) = i, x(t) ∈ Vk }, k = 0, 1.
The central object in transition path theory is the committor function. Its
value at i ∈ Vu gives the probability that a trajectory starting from i will hit the
set V1 first than V0 , i.e., the success rate of the transition at i.
Proposition 6.7.1. For ∀i ∈ Vu , define the committor function
qi := P rob(τi1 < τi0 ) = P rob(trajectory starting from xi hit V1 before V0 )
which satisfies the following Laplacian equation with Dirichlet boundary conditions
(Lq)(i) = [(I − P )q](i) = 0, i ∈ Vu
qi∈V0 = 0, qi∈V1 = 1.
The solution is
qu = (Du − Wuu )−1 Wul ql .
Proof. By definition,
1
xi ∈ V1
1 0
qi = P rob(τi < τi ) = 0 xi ∈ V0
P
j∈V Pij qj i ∈ Vu
This is because ∀i ∈ Vu ,
qi = P r(τiV1 < τiV0 )
X
= Pij qj
j
X X X
= Pij qj + Pij qj + Pij qj
j∈V1 j∈V0 j∈Vu
X X
= Pij + Pij qj
j∈V1 j∈Vu
The reactive current J(xy) gives the average rate the reactive trajectories jump
from state x to y. From the reactive current, we may define the effective reactive
current on an edge and transition current through a node which characterizes the
importance of an edge and a node in the transition from A to B, respectively.
Finally, the committor functions also give information about the time propor-
tion that an equilibrium trajectory comes from A (the trajectory hits A last rather
than B).
Proposition 6.7.5. The proportion of time that the trajectory comes from A
(resp. from B) is given by
X X
(144) ρA = π(x)q(x), ρB = π(x)(1 − q(x)).
x∈V x∈V
6.8.2. Explanation from Transition Path Theory. We can also view the
problem as a random walk on graph. Constructing a graph model with transition
−1 Pll Plu
matrix P = D W = . Assume that the labeled data are binary
Pul Puu
(classification). That is, for xi ∈ Vl , f (xi ) = 0 or 1. Denote
• V0 = {i ∈ Vl : fi = f (xi ) = 0}
• V1 = {i ∈ Vl : fi = f (xi ) = 1}
• V = V0 ∪ V1 ∪ Vu where Vl = V0 ∪ V1
With this random walk on graph P , fu can be interpreted as hitting time or
first passage time of V1 .
156 6. RANDOM WALK ON GRAPHS
−1 −1
BD−1
X Y A B X Y SD −SD
· =I⇒ = −1 −1
Z W C D Z W −SA CA−1 SA
where SA = D − CA−1 B and SD = A − BD−1 C are called Schur complements of
A and D, respectively. The matrix expressions for inverse are equivalent when the
matrix is invertible.
The graph Laplacian
Dl − Wll −Wlu
L=
−Wul Du − Wuu
is not invertible.
P Dl − Wll and Du − Wuu are both strictly diagonally dominant, i.e.
Dl (i, i) > j |Wll (i, j)|, whence they are invertible by Gershgorin Circle Theorem.
However their Schur complements SDu −Wuu and SDl −Wll are still not invertible and
the block matrix inversion formula above can not be applied directly. To avoid this
issue, we define a regularized version of graph Laplacian
Lλ = L + λI, λ>0
and study its inverse Σλ = L−1
λ .
By the block matrix inversion formula, we can set Σ as its right inverse above,
−1 −1
(λ + Dl − Wll )−1 Wlu Sλ+D
Sλ+Du −Wuu l −Wll
Σλ = −1 −1
(λ + Du − Wuu )−1 Wul Sλ+D u −Wuu
Sλ+D l −Wll
Therefore,
fu,λ = Σul,λ Σ−1
ll,λ fl = (λ + Du − Wuu )
−1
Wul fl ,
whose limit however exits limλ→0 fu,λ = (Du − Wuu )−1 Wul fl = fu . This implies
that fu can be regarded as the conditional mean given fl .
6.8.4. Remarks. One natural problem is: if we only have a fixed amount of
labeled data, can we recover labels of an infinite amount of unobserved data? This
is called well-posedness. [Nadler-Srebro 2009] gives the following result:
• If xi ∈ R1 , the problem is well-posed.
• If xi ∈ Rd (d ≥ 3), the problem is ill-posed in which case Du − Wuu
becomes singular and f becomes a bump function (fu is almost always
zeros or ones except on some singular points).
Here we can give a brief explanation:
Z
f T Lf ∼ ∥∇f ∥2
∥x−x0 ∥22
(
ϵ2 ∥x − x0 ∥2 < ϵ
If we have Vl = {0, 1}, f (x0 ) = 0, f (x1 ) = 1 and let fϵ (x) = .
1 otherwise
From multivariable calculus,
Z
∥∇f ∥2 = cϵd−2 .
R
Since d ≥ 3, so ϵ → 0 ⇒ ∥∇f ∥2 → 0. So fϵ (x) (ϵ → 0) converges to a bump func-
tion which is one almost everywhere except x0 whose value is 0. No generalization
ability is learned for such bump functions.
This means in high dimensional case, to obtain a smooth generalization, we
have to add constraints more than the norm of the first order derivatives. We
158 6. RANDOM WALK ON GRAPHS
also have a theorem to illustrate what kind of constraint is enough for a good
generalization:
Theorem 6.8.2 (Sobolev embedding Theorem). f ∈ Ws,p (Rd ) ⇐⇒ f has
s’th order weak derivative f (s) ∈ Lp ,
d
s> ⇒ Ws,2 ,→ C(Rd ).
2
So in Rd , to obtain a continuous function, one needs smoothness regularization
∥∇s f ∥ with degree s > d/2. To implement this in discrete Laplacian setting, one
R
may consider iterative Laplacian Ls which might converge to high order smoothness
regularization.
D = sum (A , 2) ;
N = length ( D ) ;
Label = [0: N -1];
TransProb = diag (1./ D ) * A ;
LMat = TransProb - diag ( ones (N , 1) ) ;
for i = 1: N
localmin = true ;
for j = setdiff (1: N , i )
if (( LMat (i , j ) >0) &( EquiMeasure ( j ) > EquiMeasure ( i ) ) )
localmin = false ;
break
end
end
6.9. LAB AND FURTHER STUDIES 159
if ( localmin )
i
end
end
mfpt = zeros (N , 1) ;
SourceSet = 11;
RemainSet = setdiff (1: N , SourceSet ) ;
mfpt ( RemainSet ) = - LMat ( RemainSet , RemainSet ) \ ones (N -1 , 1) ;
SourceSet = SetA ;
TargetSet = SetB ;
RemainSet = setdiff (1: N , union ( SourceSet , TargetSet ) ) ;
Diffusion Geometry
161
162 7. DIFFUSION GEOMETRY
whence A is a row Markov matrix of the following discrete time Markov chain
{Xt }t∈N satisfying
(151) P (Xt+1 = xj | Xt = xi ) = Aij .
7.1.1. Spectral Properties of A. We may reach a spectral decomposition
of A with the aid of the following symmetric matrix S which is similar to A. Let
1 1
(152) S := D− 2 W D− 2
which is symmetric and has an eigenvalue decomposition
(153) S = V ΛV T , where V V T = In , Λ = diag(λ1 , λ2 , · · ·, λn )
7.1. DIFFUSION MAP AND DIFFUSION DISTANCE 163
So
1 1 1 1
A = D−1 W = D−1 (D 2 SD 2 ) = D− 2 SD 2
which is similar to S, whence sharing the same eigenvalues as S. Moreover
1 1
(154) A = D− 2 V ΛV T D 2 = ΦΛΨT
1 1
where Φ = D− 2 V and Ψ = D 2 V give right and left eigenvectors of A respectively,
AΦ = ΦΛ and ΨT A = ΛΨT , and satisfy ΨT Φ = In .
The Markov matrix A satisfies the following properties by Perron-Frobenius
Theory.
Proposition 7.1.1. (1) A has eigenvalues λ(A) ⊂ [−1, 1].
(2) A is irreducible, if and only if ∀(i, j) ∃t s.t. (At )ij > 0 ⇔ Graph G = (V, E)
is connected
(3) A is irreducible ⇒ λmax = 1
(4) A is primitive, if and only if ∃t > 0 s.t. ∀(i, j) (At )ij > 0 ⇔ Graph
G = (V, E) is path-t connected, i.e. any pair of nodes are connected by a
path of length no more than t
(5) A is irreducible and ∀i, Aii > 0 ⇒ A is primitive
(6) A is primitive ⇒ −1 ̸∈ λ(A)
(7) Wij is induced from the heat kernel, or any positive definite function
⇒ λ(A) ≥ 0
Proof. (1) assume λ and v are the eigenvalue and eigenvector of A, soAv =
λv. Find j0 s.t. |vj0 | ≥ |vj |, ∀j ̸= j0 where vj is the j-th entry of v. Then:
n
X
λvj0 = (Av)j0 = Aj 0 j v j
j=1
So:
n
X n
X
|λ||vj0 | = | Aj 0 j v j | ≤ Aj0 j |vj | ≤ |vj0 |.
j=1 j=1
We refer to the i′ th row of the matrix At , denoted Ati,∗ , as the transition prob-
ability of a t-step random walk that starts at xi . We can express At using the
decomposition of A. Indeed, from
(157) A = ΦΛΨT
with ΨT Φ = I, we get
(158) At = ΦΛt ΨT .
Written in a component-wise way, this is equivalent to
Xn
(159) Atij = λtk ϕk (i)ψk (j).
k=1
Φ1D 1D
t (x1 ), · · · , Φt (xn1 ) = c1
Diffusion Map :
Φt (xn1 +1 ), · · · , Φ1D
1D
t (xn ) = c2
EX2: ring graph. ”single circle”
In this case, W is a circulant matrix
1 1 0 0 ··· 1
1 1 1
0 ··· 0
W = 0 1 1
1 ··· 0
.. .. .. .. ..
. . . . ··· .
1 0 0 0 ··· 1
The eigenvalue of W is λk = cos 2πk n
n k = 0, 1, · · · , 2 and the corresponding eigen-
i 2π 2πkj 2πkj t
vector is (uk )j = e n j = 1, · · · , n. So we can get Φ2D
kj
t (xi ) = (cos n , sin n )c
EX3: order the face. Let
166 7. DIFFUSION GEOMETRY
∥x − y∥2
kε (x, y) = exp − ,
ε
1 ε→0
Lε := (Aε − I) −→ backward Kolmogorov operator
ε
1 ′′ ′ ′
1 2 ϕ (s) ′− ϕ (s)V (s) = λϕ(s)
Lε f = △M f − ∇f · ∇V ⇒ Lε ϕ = λϕ ⇒ ′
2 ϕ (0) = ϕ (1) = 0
Where V (s) is the Gibbs free energy and p(s) = e−V (x) is the density of data points
along the curve. △M is Laplace-Beltrami Operator. If p(x) = const, we can get
′′
(162) V (s) = const ⇒ ϕ (s) = 2λϕ(s) ⇒ ϕk (s) = cos(kπs), 2λk = −k 2 π 2
On the other hand p(s) ̸= const, one can show 1 that ϕ1 (s) is monotonic for
arbitrary p(s). As a result, the faces can still be ordered by using ϕ1 (s).
1by changing to polar coordinate p(s)ϕ′ (s) = r(s) cos θ(s), ϕ(s) = r(s) sin θ(s) ( the so-called
‘Prufer Transform’ ) and then try to show that ϕ′ (s) is never zero on (0, 1).
7.1. DIFFUSION MAP AND DIFFUSION DISTANCE 167
Proof.
n
2
X 1
∥Ati,∗ − Atj,∗ ∥ℓ2 (Rn ,1/d) = (Atil − Atjl )2
dl
l=1
n X n
X 1
= [ λtk ϕk (i)ψk (l) − λtk ϕk (j)ψk (l)]2
dl
l=1 k=1
n X
n
X 1
= λtk (ϕk (i) − ϕk (j))ψk (l)λtk′ (ϕk′ (i) − ϕk′ (j))ψk′ (l)
dl
l=1 k,k′
n n
X X ψk (l)ψk′ (l)
= λtk λtk′ (ϕk (i) − ϕk (j))(ϕk′ (i) − ϕk′ (j))
′
dl
k,k l=1
Xn
= λtk λtk′ (ϕk (i) − ϕk (j))(ϕk′ (i) − ϕk′ (j))δkk′
k,k′
n
X
= λ2t
k (ϕk (i) − ϕk (j))
2
k=1
= d2t (xi , xj )
□
In practice we usually do not use the mapping Φt but rather the truncate
diffusion map Φδt that makes use of fewer than n coordinates. Specifically, Φδt uses
t
only the eigenvectors for which the eigenvalues satisfy |λk | > δ. When t is enough
large, we can use the truncated diffusion distance:
2 21
X
(164) dδt (xi , xj ) = ∥Φδt (xi ) − Φδt (xj )∥ = [ λ2t
k (ϕk (i) − ϕk (j)) ]
k:|λk |t >δ
2
as an approximation of the weighted ℓ distance of the probability clouds. We now
derive a simple error bound for this approximation.
Lemma 7.1.3 (Truncated Diffusion Distance). The truncated diffusion distance
satisfies the following upper and lower bounds.
2δ 2
d2t (xi , xj ) − (1 − δij ) ≤ [dδt (xi , xj )]2 ≤ d2t (xi , xj ),
dmin
P
where dmin = min1≤i≤n di with di = j Wij .
1
Proof. Since, Φ = D− 2 V , where V is an orthonormal matrix (V V T =
T
V V = I), it follows that
1 1
(165) ΦΦT = D− 2 V V T D− 2 = D−1
Therefore,
n
X δij
(166) ϕk (i)ϕk (j) = (ΦΦT )ij =
di
k=1
and
n
X 1 1 2δij
(167) (ϕk (i) − ϕk (j))2 = + −
di dj di
k=1
168 7. DIFFUSION GEOMETRY
clearly,
n
X 2
(168) (ϕk (i) − ϕk (j))2 ≤ (1 − δij ), f orall i, j = 1, 2, · · · , n
dmin
k=1
As a result,
X
[dδt (xi , xj )]2 = d2t (xi , xj ) − λ2t
k (ϕk (i) − ϕk (j))
2
k:|λk |t <δ
X
≥ d2t (xi , xj ) − δ2 (ϕk (i) − ϕk (j))2
k:|λk |t <δ
n
X
≥ d2t (xi , xj ) − δ 2 (ϕk (i) − ϕk (j))2
k=1
2δ 2
≥ d2t (xi , xj ) − (1 − δij )
dmin
on the other hand, it is clear that
(169) [dδt (xi , xj )]2 ≤ d2t (xi , xj )
We conclude that
2δ 2
(170) d2t (xi , xj ) − (1 − δij ) ≤ [dδt (xi , xj )]2 ≤ d2t (xi , xj )
dmin
□
Therefore, for small δ the truncated diffusion distance provides a very good
approximation to the diffusion distance. Due to the fast decay of the eigenvalues,
the number of coordinates used for the truncated diffusion map is usually much
smaller than n, especially when t is large.
7.1.5. Is the diffusion distance really a distance? A distance function
d : X × X → R must satisfy the following properties:
(1) Symmetry: d(x, y) = d(y, x)
(2) Non-negativity: d(x, y) ≥ 0
(3) Identity of indiscernibles: d(x, y) = 0 ⇔ x = y
(4) Triangle inequality: d(x, z) + d(z, y) ≥ d(x, y)
Since the diffusion map is an embedding into the Euclidean space Rn , the
diffusion distance inherits all the metric properties of Rn such as symmetry, non-
negativity and the triangle inequality. The only condition that is not immediately
implied is dt (x, y) = 0 ⇔ x = y. Clearly, xi = xj implies that dt (xi , xj ) = 0. But
is it true that dt (xi , xj ) = 0 implies xi = xj ? Suppose dt (xi , xj ) = 0, Then,
n
X
(171) 0 = d2t (xi , xj ) = λ2t
k (ϕk (i) − ϕk (j))
2
k=1
It follows that ϕk (i) = ϕk (j) for all k with λk ̸= 0. But there is still the possibility
that ϕk (i) ̸= ϕk (j) for k with λk = 0. We claim that this can happen only whenever
i and j have the exact same neighbors and proportional weights, that is:
Proposition 7.1.4. The situation dt (xi , xj ) = 0 with xi ̸= xj occurs if and
only if node i and j have the exact same neighbors and proportional weights
Wik = αWjk , α > 0, f or all k ∈ V.
7.2. COMMUTE TIME MAP AND DISTANCE 169
n
λ2t 2
P
Proof. (Necessity) If dt (xi , xj ) = 0, then k (ϕk (i) − ϕk (j)) = 0 and
k=1
ϕk (i) = ϕk (j) for k with λk ̸= 0 This implies that dt′ (xi , xj ) = 0 for all t′ , because
n
X ′
(172) dt′ (xi , xj ) = λ2t 2
k (ϕk (i) − ϕk (j) = 0.
k=1
′
In particular, for t = 1, we get d1 (xi , xj ) = 0. But
d1 (xi , xj ) = ∥Ai,∗ − Aj,∗ ∥ℓ2 (Rn ,1/d) ,
and since ∥ · ∥ℓ2 (Rn ,1/d) is a norm, we must have Ai,∗ = Aj,∗ , which implies for each
k ∈V,
Wik Wjk
= , ∀k ∈ V
di dj
whence Wik = αWjk where α = di /dj , as desired.
n
(Ai,k − Aj,k )2 /dk = d21 (xi , xj ) ==
P
(Sufficiency) If Ai,∗ = Aj,∗ , then 0 =
k=1
n
λ2k (ϕk (i) − ϕk (j))2 and therefore ϕk (i) = ϕk (j) for k with λk ̸= 0, from which
P
k=1
it follows that dt (xi , xj ) = 0 for all t. □
Example 14. In a graph with three nodes V = {1, 2, 3} and two edges, say
E = {(1, 2), (2, 3)}, the diffusion distance between nodes 1 and 3 is 0. Here the
transition matrix is
0 1 0
A = 1/2 0 1/2 .
0 1 0
By definition, we have
X
(173) Tij+ = Pij · 1 + +
Pik (Tkj + 1)
k̸=j
T + − S = 1uT , ∀u
which implies T + = S. T ’s uniqueness follows from T = T + − Td+ . □
Now we continue with the proof of the main theorem. Since T = T +
− Td+ ,
then (173) becomes
T = E + P T − Td+
(I − P )T = E − Td+
(I − D−1 W )T = F
(D − W )T = DF
LT = DF
where F = E − Td+ and L = D − W is the (unnormalized)
Pn graph Laplacian. Since
T
L is symmetric and irreducible, we have L = Pn k=1 µ k νk ν k , where 0 = µ1 < µ2 ≤
· · · ≤ µn , ν1 = 1/||1||, νkT νl = δkl . Let L+ = k=2 µ1k νk νkT , L+ is called the pseudo-
inverse (or Moore-Penrose inverse) of L. We can test and verify L+ satisfies the
following four conditions + +
L LL = L+
+
LL L = L
(LL+ )T = LL+
(L+ L)T = L+ L
as 1 · uT ∈ ker(L), whence
n
X 1
Tij = L+ +
ik dk − Lij dj · + uj
πj
k=1
Xn
ui = − L+ +
ik dk + Lii vol(G), j=i
k=1
X X
Tij = L+ + +
ik dk − Lij vol(G) + Ljj vol(G) − L+
jk dk
k k
P
Note that vol(G) = i di and πi = di /vol(G) for all i.
Then
(175) Tij + Tji = vol(G)(L+ + +
ii + Ljj − 2Lij ).
d (x ,x )
k(x, y)dp(y) is a smoothed density at x, dr (xi , xj ) = √c i j is the
R
where d(x) = X vol(G)
resistance distance. This result shows that in this setting commute time distance
has no information about cluster information about point cloud data, instead it
simply reflects density information around the two points.
Proof omitted. The reverse is also true, which is Bochner theorem. High dimen-
sional case is similar.
2
Take 1-dimensional as an example. Since the Gaussian distribution e−ξ /2 dξ
is a positive finite Borel measure, and the Fourier transform of Gaussian kernel is
2
itself, we know that k(x, y) = e−|x−y| /2 is a positive definite integral kernel. The
matrix W as an discretized version of k(x, y) keeps the positive-definiteness (make
this rigorous? Hint: take ϕ(x) as a linear combination of n delta functions).
7.3.1. Main Result. In this lecture, we will study the bias and variance
decomposition for sample graph Laplacians and their asymptotic convergence to
Laplacian-Beltrami operators on manifolds.
Let M be a smooth manifold without boundary in Rp (e.g. a d-dimensional
sphere). Randomly draw a set of n data points, x1, ..., xn ∈ M ⊂ Rp , according to
distribution p(x) in an independent and identically distributed (i.i.d.) way. We can
extract an n × n weight matrix Wij as follows:
Wij = k(xi , xj )
where k(x, y) is a symmetric k(x, y) = k(y, x) and positivity-preserving kernel
k(x, y) ≥ 0. As an example, it can be the heat kernel (or Gaussian kernel),
||xi − xj ||2
kϵ (xi , xj ) = exp − ,
2ϵ
where || ||2 is the Euclidean distance in space Rp and ϵ is the bandwidth of the
kernel. Wij stands for similarity function between xi and xj . A diagonal matrix D
is defined with diagonal elements are the row sums of W :
7.3. DIFFUSION MAP: CONVERGENCE THEORY 173
n
X
Dii = Wij .
j=1
(α) (α) −1
Pn (α)
Denote A = (D ) W , and we can verify that j=1 Aij = 1, i.e.a row
Markov matrix. Now define L(α) = A(α) − I = (D(α) )−1 W (α) − I; and
1 (α)
(A − I)
Lϵ,α =
ϵ ϵ
when kϵ (x, y) is used in constructing W . In general, L(α) and Lϵ,α are both called
graph Laplacians. In particular L(0) is the unnormalized graph Laplacian in litera-
ture.
The target is to show that graph Laplacian Lϵ,α converges to continuous differ-
ential operators acting on smooth functions on M the manifold. The convergence
can be roughly understood as: we say a sequence of n-by-n matrix L(n) as n → ∞
converges to a limiting operator L, if for L’s eigenfunction f (x) (a smooth function
on M) with eigenvalue λ, that is
Lf = λf,
the length-n vector f (n) = (f (xi )), (i = 1, · · · , n) is approximately an eigenvector
of L(n) with eigenvalue λ, that is
L(n) f (n) = λf (n) + o(1),
where o(1) goes to zero as n → ∞.
Specifically, (the convergence is in the sense of multiplying a positive constant)
(I) Lϵ,0 = 1ϵ (Aϵ − I) → 12 (∆M + 2 ∇p p · ∇) as ϵ → 0 and n → ∞. ∆M is
the Laplace-Beltrami operator of manifold M . At a point on M which
is d-dimensional, in local (orthogonal) geodesic coordinate s1 , · · · , sd , the
Laplace-Beltrami operator has the same form as the laplace in calculus
d
X ∂2
∆M f = f;
i=1
∂s2i
1X 1X
F (xi ) = kϵ (xi , xj )f (xj ), G(xi ) = kϵ (xi, xj ).
n n
j̸=i j̸=i
depends only on the other n − 1 data points than xi . In what follows we treat
xi as a fixed chosen point and write as x.
7.3. DIFFUSION MAP: CONVERGENCE THEORY 175
Recall that given a random variable x, and a sample estimator θ̂ (e.g. sample mean),
the bias-variance decomposition is given by
E∥x − θ̂∥2 = E∥x − Ex∥2 + E∥Ex − θ̂∥2 .
E[F ]
If we use the same strategy here (though not exactly the same, since E[ G
F
] ̸= E[G]
!), we can decompose Eqn. (176) as
1 E[F ] 1 F (xi ) E[F ]
1
(Lf )i = − f (xi ) + f (xi )O( d ) + −
ϵ E[G] nϵ 2 ϵ G(xi ) E[G]
= bias + variance.
In the below we shall show that for case (I) the estimates are
(177)
1 E[F ] ∇p
1 m2 d
bias = − f (x) + f (xi )O( d ) = (∆M f +2∇f · )+O(ϵ)+O n−1 ϵ− 2 .
ϵ E[G] nϵ 2 2 p
1 F (xi ) E[F ]
1 d
(178) variance = − = O(n− 2 ϵ− 4 −1 ),
ϵ G(xi ) E[G]
whence
1 d 1 d
bias + variance = O(ϵ, n− 2 ϵ− 4 −1 ) = C1 ϵ + C2 n− 2 ϵ− 4 −1 .
As the bias is a monotone increasing function of ϵ while the variance is decreasing
w.r.t. ϵ, the optimal choice of ϵ is to balance the two terms by taking derivative
1 d
of the right hand side equal to zero (or equivalently setting ϵ ∼ n− 2 ϵ− 4 −1 ) whose
solution gives the optimal rates
ϵ∗ ∼ n−1/(2+d/2) .
(CL06) gives the bias and (HAvL05) contains the variance parts, which are further
improved by (Sin06) in both bias and variance.
7.3.3. The Bias Term. Now focus on E[F ]
1 n−1
X Z
E[F ] = E kϵ (xi , xj )f (xj ) = kϵ (x, y)f (y)p(y)dy
n n M
j̸=i
n−1
n is close to 1 and is treated as 1.
(1) the case of one-dimensional and flat (which means the manifold M is just
a real line, i.e.M = R)
(x−y)2
Let f˜(y) = f (y)p(y), and kϵ (x, y) = √1ϵ e− 2ϵ , by change of variable
√
y = x + ϵz,
we have
√ 1
Z
ϵ2
□= f˜(x + ϵz)e− 2 dz = m0 f˜(x) + m2 f ′′ (x)ϵ + O(ϵ2 )
R 2
ϵ2 ϵ2
where m0 = R e− 2 dz, and m2 = R z 2 e− 2 dz.
R R
176 7. DIFFUSION GEOMETRY
First part = ◦
1 ϵ2
| ◦ | ≤ ||f˜||∞ a e− 2ϵ ,
ϵ2
2
√
due to ||x − y|| > c ϵ
1
c ∼ ln( ).
ϵ
so this item is tiny and can
√ be ignored.
Locally, that is u ∼ ϵ, we have the curve in a plane and has the
following parametrized equation
(x(u), y(u)) = (u, au2 + qu3 + · · · ),
then the chord length
1 1 1
||x − y||2 = [u2 + (au2 + qu3 + ...)2 ] = [u2 + a2 u4 + q5 (u) + · · · ],
ϵ ϵ ϵ
where we mark a2 u4 + 2aqu5 + ... = q5 (u). Next, change variable √uϵ = z,
ξ
then with h(ξ) = e− 2
||x − y|| 2 3
h( ) = h(z 2 ) + h′ (z 2 )(ϵ2 az 4 + ϵ 2 q5 + O(ϵ2 )),
ϵ
also
df˜ 1 d2 f˜
f˜(s) = f˜(x) + (x)s + (x)s2 + · · ·
ds 2 ds2
and Z up
s= 1 + (2au + 3quu2 + ...)2 du + · · ·
0
and
ds 2
= 1 + 2a2 u2 + q2 (u) + O(ϵ2 ), s = u + a2 u3 + O(ϵ2 ).
du 3
Now come back to the integral
1 x−y ˜
Z
√
√ h( )f (s)ds
|x−y|<c ϵ ϵ ϵ
df˜
Z +∞
3 √ 2 3
≈ [h(z 2 ) + h′ (z 2 )(ϵ2 az 4 + ϵ 2 q5 ] · [f˜(x) + (x)( ϵz + a2 z 2 ϵ 2 )
−∞ ds 3
1 d2 f˜
+ (x)ϵz 2 ] · [1 + 2a2 + ϵ3 y3 (z)]dz
2 ds2
m2 d2 f˜
=m0 f˜(x) + ϵ ( (x) + a2 f˜(x)) + O(ϵ2 ),
2 ds2
O(ϵ2 ) tails are omitted in middle steps, and m0 = h(z 2 )dz,m2 =
R
where
R 2 the
z h(z 2 )dz, are positive constants. In what follows we normalize both of
7.3. DIFFUSION MAP: CONVERGENCE THEORY 177
ϵ 2
7.3.4. Variance Term. Our purpose is to derive the large deviation bound
for2
E[F ]
F
(181) P rob − ≥α
G E[G]
where F = F (xi ) = n1 j̸=i kϵ (xi , xj )f (xj ) and G = G(xi ) = n1 j̸=i kϵ (x, xj ).
P P
With x1 , x2 , ..., xn as i.i.d random variables, F and G are sample means (up to a
scaling constant). Define a new random variable
Y = E[G]F − E[F ]G − αE[G](G − E[G])
which is of mean zero and Eqn. (181) can be rewritten as
P rob(Y ≥ αE[G]2 ).
the Markov inequality on both the numerator and denominator. Combining those
estimates together, we have the following,
1 d
F f p + ϵ m22 (∆(f p) + E[f p]) + O(ϵ2 , n− 2 ϵ− 4 )
= 1 d
G p + ϵ m22 (∆p + E[p]) + O(ϵ2 , n− 2 ϵ− 4 )
m2 1 d
= f +ϵ (∆p + E[p]) + O(ϵ2 , n− 2 ϵ− 4 ),
2
3It means that P rob(X > α) ≤ E(X 2 )/α2 . A Chernoff bound with exponential tail can be
found in Singer’06.
7.4. *VECTOR DIFFUSION MAP AND CONNECTION LAPLACIAN 179
here O(B1 , B2 ) denotes the dominating one of the two bounds B1 and B2 in the
asymptotic limit. As a result, the error (bias + variance) of Lϵ,α (dividing another
ϵ) is of the order
1 d
(182) O(ϵ, n− 2 ϵ− 4 −1 ).
7.4.2. graph Laplacian. The goal of graph Laplacian is to discover the in-
trinsic manifold structure given a set of data points in space. There are three steps
of constructing the graph Laplacian operator:
• construct the graph using either the ϵ−neighborhood way (for any data
point, connect it with all the points in its ϵ−neighborhood) or the k-
nearest neighbor way (connect it with its k-nearest neighbors);
• construct the the weight matrix. Here we can use the simple-minded
binary weight (0 or 1), or use the heat kernel weight. For undirected
graph, the weight matrix is symmetric; P
• denote D as the diagonal matrix with D(i, i) = deg(i), deg(i) := j wij .
The graph Laplacian operator is:
L=D−W
The graph Laplacian has the following properties:
• ∀f : V → R, f T Lf = (i,j)∈E wij (fi − fj )2 ≥ 0
P
∗
u := u diag(wij )
n(n−1) n(n−1)
where diag(wij ) ∈ R 2 × 2is the diagonal matrix that has wij on the diag-
onal position corresponding to ⟨i, j⟩.
u∗ v = ⟨u, v⟩
Then,
L = D − W = δ0T diag(wij )δ0 = δ0∗ δ0
7.4. *VECTOR DIFFUSION MAP AND CONNECTION LAPLACIAN 181
We first look at the graph Laplacian operator. We solve the generalized eigen-
value problem:
Lf = λDf
denote the generalized eigenvalues as:
0 = λ1 ≤ λ2 ≤ · · · ≤ λn
f1 , · · · , fn
We now explains that this is the optimal embedding that preserves locality in the
sense that connected points stays as close as possible. Specifically speaking, for the
one-dimensional embedding, the problem is:
X
min (yi − yj )2 wij = 2miny yT Ly
i,j
1 1 1 1
y T Ly = y T D− 2 (I − D− 2 W D− 2 )D− 2 y
1 1 1 1
Since I − D− 2 W D− 2 is symmetric, the object is minimized when D− 2 )D− 2 y is
the eigenvector for the second smallest eigenvalue(the first smallest eigenvalue is
1 1
0) of I − D− 2 W D− 2 , which is the same with λ2 , the second smallest generalized
eigenvalue of L.
Similarly, the m-dimensional optimal embedding is given by Y = (f1 , · · · , fm ).
In diffusion map, the weights are used to define a discrete random walk. The
transition probability in a single step from i to j is:
wij
aij =
deg(i)
Aϕi = µi ϕi
At ϕi = µti ϕi
7.4.3. the embedding given by diffusion map. Φt (i) denotes the ith row
of Φt .
n
X At (i, k) At (j, k)
⟨Φt (i), Φ(j)⟩ = p p
k=1
deg(k) deg(k)
we can thus define a distance called diffusion distance
n
X (At (i, k) − At (j, k))2
d2DM,t (i, j) := ⟨Φt (i), Φ(i)⟩+⟨Φt (j), Φ(j)⟩−2⟨Φt (i), Φ(j)⟩ =
deg(k)
k=1
i∼j
7.4. *VECTOR DIFFUSION MAP AND CONNECTION LAPLACIAN 183
we will later define the vector diffusion mapping, and using the similar argument
as in diffusion mapping, it is easy to see that vector diffusion mapping gives the
optimal embedding that preserves locality in this sense.
we now discuss how we get the approximation of parallel transport operator
given the data set.
The approximation of the tangent space at a certain point xi is given by local PCA.
Choose ϵi to be sufficiently small, and denote xi1 , · · · , xiNi as the data points in
the ϵi -neighborhood of xi . Define
Xi := [xi1 − xi , · · · , xiNi − xi ]
Denote Di as the diagonal matrix with
s
||xij − xi ||
Di (j, j) = K( ), j = 1, · · · , Ni
ϵi
Bi := Xi Di
Perform SVD on Bi :
Bi = Ui Σi ViT
We use the first d columns of Ui (which are the left eigenvectors of the d largest
eigenvalues of Bi ) to form an approximation of the tangent space at xi . That is,
Oi = [ui1 , · · · , uid ]
Then Oi is a numerical approximation to an orthonormal basis of the tangent space
at xi .
For connected points xi and xj , since they are sufficiently close to each other,
their tangent space should be close. Therefore, Oi Oij and Oj should also be close.
We there use the closest orthogonal matrix to OiT Oj as the approximation of the
parallel transport operator from xj to xi :
ρij := argminOorthogonol ||O − OiT Oj ||HS
where ||A||2HS = T r(AAT ) is the Hilbert-Schimidt norm.
u := u diag(wij ), u∗ v = ⟨u, v⟩
∗
k=1
2t
We use ||S̃ (i, j)||2HS to measure the affinity between i and j. Thus,
2t
||S̃ (i, j)||2HS = T r(S̃ 2t (i, j)S̃ 2t (i, j)T )
Pnd
= (λk λl )2t T r(vk (i)vk (j)T vl (j)vl (i)T )
Pk,l=1
nd
= (λk λl )2t T r(vk (j)T vl (j)vl (i)T vk (i))
Pk,l=1
nd 2t
= k,l=1 (λk λl ) ⟨vk (j), vl (j)⟩⟨vk (i), vl (i)⟩
The vector diffusion mapping is defined as:
Vt : i → ((λk λl )t ⟨vk (i), vl (i)⟩)nd
k,l=1
and finance (FK18; FKL18), among others (Hub81). The presence of an unknown
contamination distribution poses both statistical and computational challenges to
the problem. The search for both statistically optimal and computationally feasible
procedures has become a fundamental problem in areas including statistics and
computer science.
Robust estimation of normal mean and covariance gives two paradigms in this
challenge.
where B
b gives the Tukey’s median (Tuk75).
(b) (Regression depth) When m = 1,
DU (β, P) = inf P uT X y − β T X ≥ 0 ,
U ∈U
XY T | X ∼ N XX T B, σ 2 XX T .
Here, G′ (t) is a subgradient of G at the point t. Moreover, the statement also holds
for strictly proper scoring rules when convex is replaced by strictly convex. Typical
examples of scoring rules are listed as follows that lead to many popular GANs.
(1) Log Score. The log score is perhaps the most commonly used rule because
of its various intriguing properties (JCVW15). The scoring rule with
S(t, 1) = log t and S(t, 0) = log(1 − t) is regular and strictly proper. Its
Savage representation is given by the convex function G(t) = t log t +
(1 − t) log(1 − t), which is interpreted as the negative Shannon entropy
of Bernoulli(t). It leads to the original GAN proposed by (GPAM+ 14)
that aims to minimize a variational lower bound of the Jensen-Shannon
divergence
1 dP 1 dQ
Z Z
JS(P, Q) = log dP + log dQ + log 2.
2 dP + dQ 2 dP + dQ
(2) Zero-One Score. The zero-one score S(t, 1) = 2I{t ≥ 1/2} and S(t, 0) =
2I{t < 1/2} is also known as the misclassification loss. This is a regular
proper scoring rule but not strictly proper. It leads to the TV-GAN that
was extensively studied by (GLYZ19) in the context of robust estimation,
toward minimizing a variational lower bound of the total variation distance
dP dP 1
Z
TV(P, Q) = P ≥1 −Q ≥1 = |dP − dQ|.
dQ dQ 2
192 8. ROBUST PCA VIA GENERATIVE ADVERSARIAL NETWORKS
(3) Quadratic Score. Also known as the Brier score (Bri50), the definition is
given by S(t, 1) = −(1 − t)2 and S(t, 0) = −t2 . The corresponding convex
function in the Savage representation is given by G(t) = −t(1−t). It leads
to the family of least-squares GANs proposed by (MLX+ 17), minimizing
a variational lower bound of the following divergence function,
1 (dP − dQ)2
Z
∆(P, Q) = ,
8 dP + dQ
known as the triangular discrimination.
(4) Boosting Score. The boosting score was introduced by (BSS05) with
1/2 1/2
S(t, 1) = − 1−t t and S(t, 0) = − t
1−t and has an connection
to the AdaBoost algorithm. The corresponding p convex function in the
Savage representation is given by G(t) = −2 t(1 − t). It leads to a GAN
toward minimizing a variational lower bound of the squared Hellinger dis-
tance
1 √
Z p 2
2
H (P, Q) = dP − dQ .
2
(5) Beta Score. A general Beta family of proper scoring rules was introduced
R1 Rt
by (BSS05) with S(t, 1) = − t cα−1 (1 − c)β dc and S(t, 0) = − 0 cα (1 −
c)β−1 dc for any α, β > −1. The log score, the quadratic score and the
boosting score are special cases of the Beta score with α = β = 0, α = β =
1, α = β = −1/2. The zero-one score is a limiting case of the Beta score
by letting α = β → ∞. Moreover, it also leads to asymmetric scoring
rules with α ̸= β. They lead to (α, β)-GANs in the sequel.
Now we introduce a general discriminator class of deep neural nets. We first
define a sigmoid first layer
Gsigmoid = g(x) = sigmoid(uT x + b) : u ∈ Rp , b ∈ R .
Note that the neighboring two layers are connected via ReLU activation functions.
Finally, the network structure is defined by
( )
X X
L L
(192) T (κ, B) = T (x) = sigmoid wj gj (x) :
|wj | ≤ κ, gj ∈ G (B) .
j≥1 j≥1
results to smooth ones. The condition 2G(2) (1/2) ≥ G(3) (1/2) + c0 is automatically
satisfied by a symmetric scoring rule, because S(t, 1) = S(1 − t, 0) immediately
R1
implies that G(3) (1/2) = 0. For the Beta score with S(t, 1) = − t cα−1 (1 − c)β dc
Rt
and S(t, 0) = − 0 cα (1 − c)β−1 dc for any α, β > −1, it is easy to check that such a
c0 (only depending on α, β) exists as long as |α − β| < 1. The following proposition
shows the statistical optimality of our proposal.
Such an error rate is the same to that of multi-task regression depth (Gao17)
and statistically optimal. After obtaining θb and Σ, b if ΣX is known or easy to be
−1 b c2 from Σ−1 Σ;
estimated, we can obtain estimator B = ΣX θ and σ
b
X
b otherwise if Xi
is also contaminated, we can exploit the same technique in (GYZ20) for an optimal
estimate Σ bX.
8.2.1.4. Generalization to Elliptical Distributions. Robust estimators induced
by GANs can adapt to general elliptical distributions as depth-based estimator
(188) (CGR18).
To achieve this goal, we further require that H belongs to the following class
( Z 1/3 )
′ 1
H(M ) = H ∈ H : dH(t) ≥ ′ .
1/4 M
where the number M ′ > 0 is assumed to be some large constant. The regularity
condition H ∈ H(M ′ ) will be easily satisfied as long as there is a constant proba-
bility mass of H contained in the interval [1/4, 1/3]. This condition prevents some
of the probability mass from escaping to infinity.
Define the estimator
(194) " n #
1 X
(θ,
b Σ,
b H)
b = argmin max S(T (Xi ), 1) + EX∼E(η,Γ,H) S(T (X), 0) .
η∈Rp ,Γ∈Ep (M ),H∈H(M ′ ) T ∈T n i=1
To accommodate for the more general generator class in (194), we consider the
discriminator class T̄ L (κ, B), which has the same definition (192), except that
G 1 (B) = Gramp = g(x) = ramp(uT x + b) : u ∈ Rp , b ∈ R .
In other words, T̄ L (κ, B) and T L (κ, B) only differs in the choice of the nonlin-
ear activation function of the first layer. We remark that the discriminator class
T L (κ, B) also works for the elliptical distributions, but the theory would require a
condition that is less transparent. The theoretical guarantee of the estimator (194)
is given by the following proposition.
Proposition 8.2.2. Consider the estimator (194) that is induced by a regular
proper scoring rule that satisfies Condition 1. The discriminator class is specified
by T = T̄ L (κ, B) with the dimension of (wj ) to be at least 2. Assume np + ϵ2 ≤ c
for somepsufficiently small constant c > 0. Set 2 ≤ L = O(1), 1 ≤ B = O(1), and
p
κ=O n + ϵ . Then, under the data generating process (193), we have
p
∥θb − θ∥2ℓ2 ≤ C ∨ ϵ2 ,
np
2
∥Σ − Σ∥op ≤ C
b ∨ ϵ2 ,
n
′ 2
with probability at least 1 − e−C (p+nϵ ) uniformly over all θ ∈ Rp , all ∥Σ∥op ≤
M = O(1), and all H ∈ H(M ′ ) with M ′ = O(1). The constants C, C ′ > 0 are
universal.
Figure 5. Scale ϵ1 : β0 = 1, β1 = 3
Example 9.1.4 (Strong Witness Complex). Let V = {tα ∈ X}. Define Wϵs =
{UI ⊆ V : ∃x ∈ X, ∀α ∈ I, d(x, tα ) ≤ d(x, V ) + ϵ}.
202 9. SIMPLICIAL COMPLEX REPRESENTATION OF DATA
Example 9.1.5 (Week Witness Complex). Let V = {tα ∈ X}. Define Wϵw =
{UI ⊆ V : ∃x ∈ X, ∀α ∈ I, d(x, tα ) ≤ d(x, V−I ) + ϵ}.
• V can be a set of landmarks, much smaller than X
• Monotonicity: Wϵ∗ ⊆ Wϵ∗′ if ϵ ≤ ϵ′
• But not easy to control homotopy types between W ∗ and X
Example 9.1.6 (Author collaboration complex). Let V consist of n authors.
Σ collects subsets of authors σ ∈ Σ if they join the same paper. Then Σ becomes
a simplicial complex. See Table 1 for an illustration and Figure 7. (LK10) gives a
similar term-document co-occurrence complex.
The usual solution concepts in games (e.g., Nash, mixed Nash, correlated equilibria) are defined
in terms of pairwise comparisons only. Games with identical pairwise comparisons share the same
equilibrium sets. Thus, 9.4. we LAB
refer to games with identical pairwise comparisons
AND FURTHER STUDIES 203
as strategically
equivalent games.
By employing the notion of pairwise comparisons, we can concisely represent any strategic-form
game in terms of a flow in a graph. We recall this notion next. Let G = (N, L) be an undirected
graph, with set of nodes N and set of links L. An edge flow (or just flow ) on this graph is a function
Y : N × N → R such that Y (p, q) = −Y (q, p) and Y (p, q) = 0 for (p, q) ∈ / L [21, 2]. Note that
the flow conservation equations are not enforced under this general definition.
game G,
Given aFigure 7. Awe simplicial
define a graphcomplexwhere of theeachtennode corresponds
authors in to a strategy profile, and
seven papers example. In the complex, there are four 2-
each edge connects two comparable strategy profiles. This undirected graph is referred to as the
{Jordan, Blei, N g},
game graphsimplices (triangles) of collaboration:
and is denoted by G(G) � (E, A), where E and A are the strategy profiles and pairs
{Jordan, W eiss, N g}, {Laf f erty, M cCallum, P ereira}, and
of comparable
{Koller, Bach, F riedman}, three 1-simplices (edges) otherNotice
strategy profiles defined above, respectively. than that, by definition, the graph
G(G) has the structure
the eleven facesofofa triangles:
direct product
{Blei, Lafof fM cliques
erty}, (oneNper
{Koller, g}, player), with clique m having
hm vertices. {Bach,
and The Jordan},comparison
pairwise and ten 0-simplices X : ESo× the
function(nodes). E → totalR defines a flow on G(G), as it
number
satisfies X(p, q) = of −X(q,
faces is p)
f0 = and10,X(p,
f1 = q)14, f=2 = 4. Euler
0 for (p, q)curvature:
∈
/ A. This flow may thus serve as an
Jordan (−1/3 = 1 − 4/2 + 2/3), Blei (−1/3 = 1 − 3/2 + 1/3),
equivalent Ng
representation of any game (up to a “non-strategic” component). It follows directly
(−1/3 = 1 − 4/2 + 2/3), Weiss (1/3 = 1 − 2/2 + 1/3), Koller
from the statements above
(−1/3 = 1 − 3/2 + 1/3),that Bach
two games
(−1/3 =are 1 −strategically equivalent if and only if they have the
3/2 + 1/3), Friedman
same flow representation
(1/3 = 1 − 2/2 +and 1/3),game graph.
Lafferty (−1/3 = 1 − 3/2 + 1/3), Pereira
(1/3 = 1 − 2/2 + 1/3), McCallum
Two examples of game graph representations (1/3 = 1 − 2/2are +given
1/3) below.
2
(O, O) (O, F )
O F O F
O 3, 2 0, 0 3O 4, 2 0,2 0
F 0, 0 2, 3 F 1, 0 2, 3
3
(a) Battle of the sexes (F,(b)
O)Modified(F, F ) of
battle
the sexes
Figure 2: Flows on the game graph corresponding to “battle of the sexes” (Example 2.2).
Figure 8. Illustration of Game Strategic Complex: Battle of Sex
It is easy to see that these two games have the same pairwise comparisons, which will lead to
identical equilibria for the two games: (O, O) and (F, F ). It is only the actual equilibrium payoffs
that would differ. In
Example 2.3.particular,
Considerinathe 9.2. Betti Numbers
equilibrium
three-player (O, O),
game, the payoff
where of the can
each player row choose
player isbetween
increased
two strategies
by 1. {a, b}. We 9.3.represent the and
Consistency strategic
Sampleinteractions
Complexity among theComplexes
of Cěch players by the directed graph in Figure
(Niyogi-Smale-Weinberger Theorem)
3a, where the payoff of player i is −1 if its strategy is identical to the strategy of its successor
The usual solution concepts in games (e.g., Nash, mixed Nash, correlated equilibria) are defined
9.4. Lab and Further Studies
in terms of pairwise comparisons only. Games with identical pairwise comparisons share the same
HenryThus,
equilibrium sets. Adamswemaintains
refer toa games
collection of applied
with topology
identical softwares:
7 pairwise comparisons as strategically
https://www.math.colostate.edu/
equivalent games. ~adams/advising/appliedTopologySoftware/
By employing the notion of pairwise comparisons, we can concisely represent any strategic-form
game in terms of a flow in a graph. We recall this notion next. Let G = (N, L) be an undirected
graph, with set of nodes N and set of links L. An edge flow (or just flow ) on this graph is a function
Y : N × N → R such that Y (p, q) = −Y (q, p) and Y (p, q) = 0 for (p, q) ∈ / L [21, 2]. Note that
the flow conservation equations are not enforced under this general definition.
Given a game G, we define a graph where each node corresponds to a strategy profile, and
each edge connects two comparable strategy profiles. This undirected graph is referred to as the
game graph and is denoted by G(G) � (E, A), where E and A are the strategy profiles and pairs
of comparable strategy profiles defined above, respectively. Notice that, by definition, the graph
G(G) has the structure of a direct product of M cliques (one per player), with clique m having
hm vertices. The pairwise comparison function X : E × E → R defines a flow on G(G), as it
CHAPTER 10
Persistent Homology
../ISLR/graphics/ISLRFigures/10_12.pdf
Figure 1. Cluster trees: Average, complete, and single linkage. From Intro-
duction to Statistical Learning with Applications in R.
205
206 10. PERSISTENT HOMOLOGY
../mySlides/figures/geom_tree.png
../2021a_csic5011-math5473/figures/rips1.png
../2021a_csic5011-math5473/figures/rips2.png
../2021a_csic5011-math5473/figures/patch3x3.png
../2021a_csic5011-math5473/figures/mumford_k300t25.png
../2021a_csic5011-math5473/figures/mumford_k15t25.png
../2021a_csic5011-math5473/figures/mumford_klein.png
../2021a_csic5011-math5473/figures/mumford_klein_model.png
../2021a_csic5011-math5473/figures/geom_tree.png
>> load_javaplex
Installation is complete. Confirm that Javaplex is working properly with the fol-
lowing command.
>> api . Plex4 . c r e a t e E x p l i c i t S i m p l e x S t r e a m ()
ans = edu . stanford . math . plex4 . streams . impl .
ExplicitSimplexStream@16966ef
Your output should be the same except for the last several characters. Each time
upon starting a new Matlab session, you will need to run load javaplex.m.
Now conduct the following numerical experiment with the example shown in
class:
212 10. PERSISTENT HOMOLOGY
../2021a_csic5011-math5473/figures/virus-trees.pdf
../2021a_csic5011-math5473/figures/influenza-reassort.pdf
../2021a_csic5011-math5473/figures/influenza-tree.pdf
../2021a_csic5011-math5473/figures/H1N1-2009.pdf
Figure 12. Origins of H1N1 2009 pandemic virus. Using phylogenetic trees, the
history of the HA gene of the 2009 H1N1 pandemic virus was reconstructed. It was related
to viruses that circulated in pigs potentially since the 1918 H1N1 pandemic. These viruses
had diverged since that date into various independent strains, infecting humans and swine.
Major reassortments between strains led to new sets of segments from different sources.
In 1998, triple reassortant viruses were found infecting pigs in North America. These
triple reassortant viruses contained segments that were circulating in swine, humans and
birds. Further reassortment of these viruses with other swine viruses created the ancestors
of this pandemic. Until this day, it is unclear how, where or when these reassortments
happened. Source: [506]. From New England Journal of Medicine, Vladimir Trifonov,
Hossein Khiabanian, and Ral Rabadn, Geographic dependence, surveillance, and origins
of the 2009 influenza A (H1N1) virus, 361.2, 115D̄119.
where you can check the number of simplices in the filtration (stream)
is 11
>> num_simplices = stream . getSize ()
214 10. PERSISTENT HOMOLOGY
../2021a_csic5011-math5473/figures/H1N1-betti0.pdf
Figure 13. In case of vanishing higher dimensional homology, zero dimen- sional
homology generates trees. When applied to only one gene of influenza A, in this case
hemagglutinin, the only significant homology occurs in dimen- sion zero (panel A). The
barcode represents a summary of a clustering procedure (panel B), that recapitulates
the known phylogenetic relation between different hemagglutinin types (panel C). Source:
[100]. From Joseph Minhow Chan, Gunnar Carlsson, and Ral Rabadn, ÔTopology of viral
evolutionÕ, Proceedings of the National Academy of Sciences 110.46 (2013): 18566D̄18571.
num_simplices = 11
(2) Compute the persistent homology for the filtration and plot the barcode
as Figure 16 (b).
>> % Compute the Z /2 Z persistence homology of dimension less
than 3:
>> persistence = api . Plex4 . g e t M o d u l a r S i m p l i c i a l A l g o r i t h m (3 , 2)
;
>> intervals = persistence . c o m p u t e In t e r v a l s ( stream ) ;
>> options . filename = ’ Persistent - Betti - Numbers ’;
>> options . m a x _ f i l t r a t i o n _ v a l u e = 11;
>> % Plot the barcode of persistent Betti numbers :
>> plot_barcodes ( intervals , options ) ;
10.5. LAB AND FURTHER STUDIES 215
../2021a_csic5011-math5473/figures/H1N1-allbetti.pdf
../2021a_csic5011-math5473/figures/H1N1-twomode.pdf
(a) (b)
Note: it extends Reeb Graph from R to general topological space Z; may lead
to a particular implementation of Nerve theorem through filter map h.
Reference Mapping: Typical one dimensional filters/mappings:
• Density estimators
• Measures of data (ec-)centrality: e.g. x′ ∈X d(x, x′ )p
P
• Geometric embeddings: PCA/MDS, Manifold learning, Diffusion Maps
etc.
• Response variable in statistics: progression stage of disease etc.
217
218 11. MAPPER AND MORSE THEORY
../mySlides/figures/Reeb_graph.pdf
../mySlides/figures/Mapper_graph.pdf
../mySlides/figures/density-tree.pdf
../mySlides/figures/gcaa.pdf
../mySlides/figures/native_contactmap_sm.png
../mySlides/figures/mapper_UFCE10_8l.pdf
../mySlides/figures/mapper_RFCE10_8l.pdf
../2021a_csic5011-math5473/figures/single-cell-mouse.pdf
../2021a_csic5011-math5473/figures/single-cell-Raul2_32.pdf
../2021a_csic5011-math5473/figures/breastcancer.pdf
../2021a_csic5011-math5473/figures/braintumor.pdf
Figure 8. A patient with two focal glioblastomas, on the left and right hemispheres.
After surgery and standard treatment, the tumor reappeared on the left side. Genomic
analysis shows that the initial tumors were seeded by two independent, but related clones.
The recurrent tumor was genetically similar to the left one. Jin-Ku Lee et al. Nature
Genetics 49.4 (2017): 594-599.
11.4. LAB AND FURTHER STUDIES 225
../2021a_csic5011-math5473/figures/braintumor-mapper.pdf
CHAPTER 12
*Euler Calculus
k+1
X
(δk u)(i0 , . . . , ik+1 ) = (−1)j+1 u(i0 , . . . , ij−1 , ij+1 , . . . , ik+1 )
j=0
Vote-for-top-1: (1, 0, . . . , 0)
Vote-for-top-2: (1, 1, 0, . . . , 0)
• II. Pairwise rules: convert the voting profile, a (distribution) function on
n! set Sn , into paired comparison matrix X ∈ Rn×n where X(i, j) is the
number (distribution) of voters that i ≻ j; define the social order based
on paired comparison data X.
Kemeny Optimization: minimizes the number of pairwise mismatches
to X over Sn (NP-hard)
Pluarity: the number of wins in paired comparisons (tournaments)
– equivalent to Borda count in complete Round-Robin tournaments
Let’s apply these rules to the three candidate example:
• Position:
s < 1/2, C wins
s = 1/2, ties
s > 1/2, A, B wins
• Pairwise:
A, B: 13 wins
C: 14 wins
Condorcet winner: C
so completely in chaos!
../mySlides/figures/saari_triangle0.png
dimension: n! − 2n−1 (n − 2) − 2
• Borda profile: all ranking methods give the same result
dimension: n − 1
basis: {1(σ(1) = i, ∗) − 1(∗, σ(n) = i) : i = 1, . . . , n}
• Condorcet profile: all positional rules give the same result
dimension: (n−1)!2
basis: sum of Zn orbit of σ minus their reversals
• Departure profile: all pairwise rules give the same result
../mySlides/figures/saari_decomp3.png
• So, if you look for a best possibility from impossibility, Borda count is
perhaps the choice
• Borda Count is the projection onto the Borda Profile subspace
Borda Count is equivalent to
X
α
min ωij (βi − βj − Yijα )2 ,
β∈R|V |
α,{i,j}∈E
14.2. CROWDSOURCED RANKING ON GRAPHS 237
where
• E.g. Yijα = 1, if i ⪰ j by voter α, and Yijα = −1, on the opposite.
• Note: NP-hard (n > 3) Kemeny Optimization, or Minimimum-Feedback-
Arc-Set:
X
α
min ωij (sign(βi − βj ) − Ŷijα )2
s∈R|V |
α,{i,j}∈E
⇔
X
min ωij ((xi − xj ) − ŷij )2 ,
x∈R|V |
{i,j}∈E
X X
where ŷij = Êα yij
α
=( α α
ωij yij )/ωij = −ŷji , ωij = α
ωij
α α
So ŷ ∈ lω2 (E), inner product space with ⟨u, v⟩ω = uij vij ωij , u, v skew-symmetric
P
Statistical Majority Voting: l2 (E)
P α α P α P α
• ŷij = ( α ωij yij )/( α ωij ) = −ŷji , ωij = α ωij
• ŷ from generalized linear models:
[1] Uniform model: ŷij = 2π̂ij − 1.
π̂ij
[2] Bradley-Terry model: ŷij = log 1−π̂ ij
.
[3] Thurstone-Mosteller model: ŷij = Φ−1 (π̂ij ), Φ(x) is Gaussian
CDF
[4] Angular transform model: ŷij = arcsin(2π̂ij − 1).
incomplete data is the pairwise comparison experiment, in which all partial orders
can be reduced.
14.2.1. HodgeRank on Graphs. Let ∧ = {1, ..., m} be a set of participants
and V = {1, ..., n} be the set of videos to be ranked. Paired comparison data is
collected as a function on ∧ × V × V , which is skew-symmetric for each participant
α, i.e., Yijα = −Yjiα representing the degree that α prefers i to j. The simplest
setting is the binary choice, where
1 if α prefers i to j,
Yijα =
−1 otherwise.
In general, Yijα can be used to represent paired comparison grades, e.g., Yijα > 0
refers to the degree that α prefers i to j and the vice versa Yjiα = −Yijα < 0 measures
the dispreference degree (JLYY11).
In this paper we shall focus on the binary choice, which is the simplest setting
and the data collected in this paper belongs to this case. However the theory can
be applied to the more general case with multiple choices above.
Such paired comparison data can be represented by a directed graph, or hyper-
graph, with n nodes, where each directed edge between i and j refers the preference
indicated by Yijα .
A nonnegative weight function ω : ∧ × V × V −→ [0, ∞) is defined as,
α 1 if α makes a comparison for {i, j},
(196) ωij =
0 otherwise.
It may reflect the confidence level that a participant compares {i, j} by taking
different values, and this is however not pursued in this paper.
Our statistical rank aggregation problem is to look for some global ranking
score s : V → R such that
X
α
(197) min ωij (si − sj − Yijα )2 ,
s∈R|V |
i,j,α
The decomposition
P above is orthogonal under the following inner product on R|E| ,
⟨u, v⟩ω = {i,j}∈E ωij uij vij .
Note B ◦ A = 0 since
(B ◦ Ax)(i, j, k) = (xi − xj ) + (xj − xk ) + (xk − xi ) = 0.
Hence
AT ŷ = AT (Ax + B T z + w) = AT Ax ⇒ x = (AT A)† AT ŷ
B ŷ = B(Ax + B T z + w) = BB T z ⇒ z = (BB T )† B ŷ
AT w = Bw = 0 ⇒ w ∈ ker(∆1 ), ∆1 = AAT + B T B.
14.3. HODGE DECOMPOSITION OF PAIRWISE PREFERENCE 241
sum (inflow minus outflow) to be zero on each node of G. These two conditions
characterize a linear subspace which is called harmonic flows.
3. The residue Ŷ c actually satisfies (213) but not (212). In fact, it measures
the amount of intrinsic (local) inconsistancy in Ŷ characterized by the triangular
trace. We often call this component curl flow. In particular, the following relative
curl,
|Ŷij + Ŷjk + Ŷki | |Ŷijc + Ŷjk
c c
+ Ŷki |
(216) curlrijk = = ∈ [0, 1],
|Ŷij | + |Ŷjk | + |Ŷki | |Ŷij | + |Ŷjk | + |Ŷki |
can be used to characterize triangular intransitivity; curlrijk = 1 iff {i, j, k} contains
an intransitive triangle of Ŷ . Note that computing the percentage of curlrijk = 1
is equivalent to calculating the Transitivity Satisfaction Rate (TSR) in complete
graphs.
Figure 3 illustrates the Hodge decomposition for paired comparison flows and
Algorithm 14 shows how to compute global ranking and other components. The
readers may refer to (JLYY11) for the detail of theoretical development. Below we
just make a few comments on the application of HodgeRank in our setting.
decomposition, ∥Ŷ h ∥2ω /∥Ŷ ∥2ω and ∥Ŷ c ∥2ω /∥Ŷ ∥2ω provide percentages of global and
local inconsistencies, respectively.
3. A nontrivial harmonic component Ŷ h ̸= 0 implies the fixed tournament issue,
i.e., for any candidate i ∈ V , there is a paired comparison design by removing some
of the edges in G = (V, E) such that i is the overall winner.
4. One can control the harmonic component by controlling the topology of
clique complex χ(G). In a loop-free clique complex χ(G) where β1 = 0, harmonic
component vanishes. In this case, there are no cycles which traverse all the nodes,
e.g., 1 ≻ 2 ≻ 3 ≻ 4 ≻ . . . ≻ n ≻ 1. All the inconsistency will be summarized in
those triangular cycles, e.g., i ≻ j ≻ k ≻ i.
Theorem 2. The linear space of harmonic flows has the dimension equal to
β1 , i.e., the number of independent loops in clique complex χ(G), which is called
the first order Betti number.
Condorcet Profile splits into Local vs. Global Cycles: Residues ŷ (c) = B T z and
(h)
ŷ = w are cyclic rankings, accounting for conflicts of interests:
• ŷ (c) , the local/triangular inconsistency, triangular curls (Z3 -invariant)
(c) (c) (c)
ŷij + ŷjk + ŷki ̸= 0 , {i, j, k} ∈ T
../mySlides/figures/Tennis-cycle.pdf
../mySlides/figures/Harmonic-cycle.pdf
Fortunately, with the aid of some random sampling principles, it is not hard to
obtain graphs whose β1 are zero.
244 14. SOCIAL CHOICE AND HODGE DECOMPOSITION OF PREFERENCES
These theories imply that when p is large enough, Erdös-Rényi random graph
will meet the two conditions above with high probability. In particular, almost
linear O(n log n) edges suffice to derive a global ranking, and with O(n3/2 ) edges
harmonic-free condition is met.
Despite such an asymptotic theory for large random graphs, it remains a ques-
tion how to ensure that a given graph instance satisfies the two conditions? Fortu-
nately, the recent development in computational topology provides us such a tool,
persistent homology, which will be illustrated in Section ??.
../mySlides/figures/betti.png
../mySlides/figures/active_fast.png
../mySlides/figures/active1.png
../mySlides/figures/active_betti.png
../mySlides/figures/Movielens.png
../mySlides/figures/game.png
Qn
• V = {(x1 , . . . , xn ) =: (xi , x−i )} = i=1 Si , n person game;
• undirected edge: {(xi , x−i ), (x′i , x−i )} = E
• each player has utility function ui (xi , x−i );
• Edge flow (1-form): ui (xi , x−i ) − ui (x′i , x−i )
251
252 15. GAME THEORY AND HODGE DECOMPOSITION OF UTILITIES
../mySlides/figures/battleSex_mat.pdf
../mySlides/figures/battleSex.pdf
X
π(xi , x−i )(ui (xi , x−i ) − ui (x′i , x−i )) ≥ 0,
x−i
../mySlides/figures/hodgegame.png
255
Bibliography
[ABET00] Nina Amenta, Marshall Bern, David Eppstein, and S-H Teng, Regres-
sion depth and center points, Discrete & Computational Geometry 23
(2000), no. 3, 305–323. 188, 189, 190
[Ach03] Dimitris Achlioptas, Database-friendly random projections: Johnson-
lindenstrauss with binary coins, Journal of Computer and System Sci-
ences 66 (2003), 671Ã687. 59
[Ali95] F. Alizadeh, Interior point methods in semidefinite programming with
applications to combinatorial optimization, SIAM J. Optim. 5 (1995),
no. 1, 13–51. 80, 90
[Aro50] N. Aronszajn, Theory of reproducing kernels, Transactions of the
American Mathematical Society 68 (1950), no. 3, 337–404. 15, 17
[Arr63] Kenneth J. Arrow, Social choice and individual values, 2nd ed., Yale
University Press, New Haven, CT, 1963. 233
[Bav11] Francois Bavaud, On the schoenberg transformations in data analysis:
Theory and illustrations, Journal of Classification 28 (2011), no. 3,
297–314. 9, 16
[BDDW08] Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael
Wakin, A simple proof of the restricted isometry property for random
matrices, Constructive Approximation 28 (2008), no. 3, 253–263. 64,
70
[BE92] Andreas Buja and Nermin Eyuboglu, Remarks on parallel analysis,
Multivariate Behavioral Research 27 (1992), no. 4, 509–540. 47, 56
[BFOS07] M. Burger, K. Frick, S. Osher, and O. Scherzer, Inverse total variation
flow, SIAM Multiscale Model. Simul. 6 (2007), no. 2, 366–395. 71
[BG10] Y. Baryshnikov and Robert Ghrist, Euler integration over definable
functions, PNAS 107 (2010), no. 21, 9525–9530. 227
[BGOX06] Martin Burger, Guy Gilboa, Stanley Osher, and Jinjun Xu, Non-
linear inverse scale space methods, Communications in Mathematical
Sciences 4 (2006), no. 1, 179–212. 71, 74
[BLT+ 06] P. Biswas, T.-C. Liang, K.-C. Toh, T.-C. Wang, and Y. Ye, Semi-
definite programming approaches for sensor network localization with
noisy distance measurements, IEEE Transactions on Automation Sci-
ence and Engineering 3 (2006), 360–371. 88
[BN01] Mikhail Belkin and Partha Niyogi, Laplacian eigenmaps and spectral
techniques for embedding and clustering, Advances in Neural Informa-
tion Processing Systems (NIPS) 14, MIT Press, 2001, pp. 585–591.
116
[BN03] Mikhail Belkin and Partha Niyogi, Laplacian eigenmaps for dimen-
sionality reduction and data representation, Neural Computation 15
257
258 Bibliography
4680–4688. 68, 75
[CWX10] Tony Cai, Lie Wang, and Guangwu Xu, Stable recovery of sparse sig-
nals and an oracle inequality, IEEE Transactions on Information The-
ory 56 (2010), no. 7, 3516–3522. 75
[CXZ09] Tony Cai, Guangwu Xu, and Jun Zhang, On recovery of sparse signals
via l1 minimization, IEEE Transactions on Information Theory 55
(2009), no. 7, 3588–3397. 68
[Dav88] H. David, The methods of paired comparisons, 2nd ed., Griffin’s Sta-
tistical Monographs and Courses, 41, Oxford University Press, New
York, NY, 1988. 239
[Daw07] A Philip Dawid, The geometry of proper scoring rules, Annals of the
Institute of Statistical Mathematics 59 (2007), no. 1, 77–93. 190
[DBS17] Simon S Du, Sivaraman Balakrishnan, and Aarti Singh, Computation-
ally efficient robust estimation of sparse functionals, arXiv preprint
arXiv:1702.07709 (2017). 188
[DG03a] Sanjoy Dasgupta and Anupam Gupta, An elementary proof of a theo-
rem of johnson and lindenstrauss, Random Structures and Algorithms
22 (2003), no. 1, 60–65. 59
[DG03b] David L. Donoho and Carrie Grimes, Hessian eigenmaps: Locally lin-
ear embedding techniques for high-dimensional data, Proceedings of
the National Academy of Sciences of the United States of America
100 (2003), no. 10, 5591–5596. 111, 113
[dGJL07] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and
Gert R. G. Lanckriet, A direct formulation for sparse pca using
semidefinite programming, SIAM Review 49 (2007), no. 3, http:
//arxiv.org/abs/cs/0406021. 86
[DH01] David L. Donoho and Xiaoming Huo, Uncertainty principles and ideal
atomic decomposition, IEEE Transactions on Information Theory 47
(2001), no. 7, 2845–2862. 67
[DKK+ 16] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur
Moitra, and Alistair Stewart, Robust estimators in high dimensions
without the computational intractability, Foundations of Computer
Science (FOCS), 2016 IEEE 57th Annual Symposium on, IEEE, 2016,
pp. 655–664. 188
[DKK+ 17] , Being robust (in high dimensions) can be practical, arXiv
preprint arXiv:1703.00893 (2017). 188
[DKK+ 18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob
Steinhardt, and Alistair Stewart, Sever: A robust meta-algorithm for
stochastic optimization, arXiv preprint arXiv:1803.02815 (2018). 188
[DKS16] Ilias Diakonikolas, Daniel Kane, and Alistair Stewart, Robust learning
of fixed-structure bayesian networks, arXiv preprint arXiv:1606.07384
(2016). 188
[DKS18a] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart, List-
decodable robust mean estimation and learning mixtures of spherical
gaussians, Proceedings of the 50th Annual ACM SIGACT Symposium
on Theory of Computing, ACM, 2018, pp. 1047–1060. 188
[DKS18b] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart, Efficient algo-
rithms and lower bounds for robust linear regression, arXiv preprint
Bibliography 261
[GPAM+ 14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio,
Generative adversarial nets, Advances in neural information process-
ing systems, 2014, pp. 2672–2680. 190, 191
[GR07] Tilmann Gneiting and Adrian E Raftery, Strictly proper scoring rules,
prediction, and estimation, Journal of the American Statistical Asso-
ciation 102 (2007), no. 477, 359–378. 190
[Gro11] David Gross, Recovering low-rank matrices from few coefficients in
any basis, IEEE Transaction on Information Theory 57 (2011), 1548,
arXiv:0910.1879. 85
[GUHY20] Hanlin Gu, Ilona Christy Unarta, Xuhui Huang, and Yuan Yao, Ro-
bust autoencoder gan for cryo-em image denoising, arXiv preprint
arXiv:2008.07307 (2020). 187
[GYZ20] Chao Gao, Yuan Yao, and Weizhi Zhu, Generative adversarial nets for
robust scatter estimation: A proper scoring rule perspective, Journal
of Machine Learning Research 21 (2020), 1 – 48, arXiv:1903.01944.
187, 190, 193
[HAvL05] M. Hein, J. Audibert, and U. von Luxburg, From graphs to manifolds:
weak and strong pointwise consistency of graph laplacians, COLT,
2005. 175
[Hor65] John L. Horn, A rationale and test for the number of factors in factor
analysis, Psychometrika 30 (1965), no. 2, 179–185. 47
[Hot33] Harold Hotelling, Analysis of a complex of statistical variables into
principal components, Journal of Educational Psychology 24 (1933),
417–441 and 498–520. 5
[HS78] Richard Paul Halmos and Viakalathur Shankar Sunder, Bounded in-
tegral operators in l2 spaces, Vol. 96 of Ergebnisse der Mathematik
und ihrer Grenzgebiete (Results in Mathematics and Related Areas),
Springer-Verlag, Berlin, 1978. 17
[HS89] Trevor Hastie and Werner Stuetzle, Principal curves, Journal of the
American Statistical Association 84 (1989), no. 406, 502–516. 114
[HSXY16] Chendi Huang, Xinwei Sun, Jiechao Xiong, and Yuan Yao, Split lbi:
An iterative regularization path with structural sparsity, Advances
in Neural Information Processing Systems (NIPS) 29 (D. D. Lee,
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds.), 2016,
pp. 3369–3377. 72
[HSXY20] , Boosting with structural sparsity: A differential inclusion ap-
proach, Applied and Computational Harmonic Analysis 48 (2020),
no. 1, 1–45, arXiv preprint arXiv:1704.04833. 72
[HTF01] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements
of statistical learning, Springer, 2001. 51
[Hub64] Peter J Huber, Robust estimation of a location parameter, The annals
of mathematical statistics 35 (1964), no. 1, 73–101. 187
[Hub65] , A robust version of the probability ratio test, The Annals of
Mathematical Statistics 36 (1965), no. 6, 1753–1758. 187
[Hub81] P. J. Huber, Robust statistics, New York: Wiley, 1981. 83, 188
Bibliography 263
[HY18] Chendi Huang and Yuan Yao, A unified dynamic approach to sparse
model selection, The 21st International Conference on Artificial Intel-
ligence and Statistics (AISTATS) (Lanzarote, Spain), 2018. 72
[JCVW15] Jiantao Jiao, Thomas A Courtade, Kartik Venkat, and Tsachy Weiss-
man, Justification of logarithmic loss via the benefit of side informa-
tion, IEEE Transactions on Information Theory 61 (2015), no. 10,
5357–5365. 191
[JL84] W. B. Johnson and J. Lindenstrauss, Extensions of lipschitz maps into
a hilbert space, Contemp Math 26 (1984), 189–206. 59
[JLYY11] Xiaoye Jiang, Lek-Heng Lim, Yuan Yao, and Yinyu Ye, Statistical
ranking and combinatorial hodge theory, Mathematical Programming
127 (2011), no. 1, 203–244, arXiv:0811.1067 [stat.ML]. 202, 238,
240, 241, 242
[Joh06] I. Johnstone, High dimensional statistical inference and random ma-
trices, Proc. International Congress of Mathematicians, 2006. 27, 42
[JYLG12] Xiaoye Jiang, Yuan Yao, Han Liu, and Leo Guibas, Detecting network
cliques with radon basis pursuit, The Fifteenth International Confer-
ence on Artificial Intelligence and Statistics (AISTATS) (La Palma,
Canary Islands), April 21-23 2012. 66
[Kah09] Matthew Kahle, Topology of random clique complexes, Discrete Math-
ematics 309 (2009), 1658–1671. 244
[Kah13] , Sharp vanishing thresholds for cohomology of random flag
complexes, Annals of Mathematics (2013), arXiv:1207.0149. 244
[Kle99] Jon Kleinberg, Authoritative sources in a hyperlinked environment,
Journal of the ACM 46 (1999), no. 5, 604–632. 126
[KMM04] T. Kaczynski, K. Mischaikow, and M. Mrozek, Computational homol-
ogy, Springer, New York, 2004. 200
[KMP10] Ioannis Koutis, G. Miller, and Richard Peng, Approaching optimality
for solving sdd systems, FOCS ’10 51st Annual IEEE Symposium on
Foundations of Computer Science, 2010, pp. 235–244. 242
[KN08] S. Kritchman and B. Nadler, Determining the number of components
in a factor model from limited noisy data, Chemometrics and Intelli-
gent Laboratory Systems 94 (2008), 19–32. 41
[KSS18] Pravesh K Kothari, Jacob Steinhardt, and David Steurer, Robust mo-
ment estimation and improved clustering via sum of squares, Pro-
ceedings of the 50th Annual ACM SIGACT Symposium on Theory of
Computing, ACM, 2018, pp. 1035–1046. 188
[Li91] Ker-Chau Li, Sliced inverse regression for dimension reduction, Jour-
nal of the American Statistical Association 86 (1991), no. 414, 316–
327. 51, 52
[LK10] Dandan Li and Chung-Ping Kwong, Understanding latent semantic
indexing: A topological structure analysis using q-analysis, J. Am.
Soc. Inf. Sci. Technol. 61 (2010), no. 3, 592–608. 202
[LL11] Jian Li and Tiejun Li, Probabilistic framework for network partition,
Phys. A 390 (2011), 3579. 150
[LLE09] Tiejun Li, Jian Liu, and Weinan E, Probabilistic framework for net-
work partition, Phys. Rev. E 80 (2009), 026106. 150
264 Bibliography
[LM06] Amy N. Langville and Carl D. Meyer, Google’s pagerank and beyond:
The science of search engine rankings, Princeton University Press,
2006. 125
[LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala, Agnostic estimation
of mean and covariance, Foundations of Computer Science (FOCS),
2016 IEEE 57th Annual Symposium on, IEEE, 2016, pp. 665–674. 188
[LST13] Jason D Lee, Yuekai Sun, and Jonathan E Taylor, On model selection
consistency of penalized m-estimators: a geometric theory, Advances
in Neural Information Processing Systems (NIPS) 26, 2013, pp. 342–
350. 72
[LZ10] Yanhua Li and Zhili Zhang, Random walks on digraphs, the general-
ized digraph laplacian, and the degree of asymmetry, Algorithms and
Models for the Web-Graph, Lecture Notes in Computer Science, vol.
6516, 2010, pp. 74–85. 136
[LZ11] Gilad Lerman and Teng Zhang, Robust recovery of multiple subspaces
by geometric lp minimization, Annals of Statistics 39 (2011), no. 5,
2686–2715. 83
[Mey00] Carl D. Meyer, Matrix analysis and applied linear algebra, SIAM,
2000. 127
[Miz02] Ivan Mizera, On depth and deep points: a calculus, The Annals of
Statistics 30 (2002), no. 6, 1681–1736. 188, 189
[MLX+ 17] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang,
and Stephen Paul Smolley, Least squares generative adversarial net-
works, Computer Vision (ICCV), 2017 IEEE International Conference
on, IEEE, 2017, pp. 2813–2821. 192
[MM04] Ivan Mizera and Christine H Müller, Location–scale depth, Journal of
the American Statistical Association 99 (2004), no. 468, 949–966. 188
[MNY06] Ha Quang Minh, Partha Niyogi, and Yuan Yao, Mercer’s theorem,
feature maps, and smoothing, Proc. of Computational Learning The-
ory (COLT), Proc. of Computational Learning Theory (COLT), vol.
4005, 2006, Times Cited: 1 Lugosi, G Simon, HU 19th Annual Con-
ference on Learning Theory (COLT 2006) JUN 22-25, 2006 Carnegie
Mellon Univ, Pittsburgh, PA, pp. 154–168. 18
[MSVE09] Philipp Metzner, Christof Schütte, and Eric Vanden-Eijnden, Transi-
tion path theory for markov jump processes, Multiscale Model. Simul.
7 (2009), 1192. 152, 154
[MZ93] S. G. Mallat and Z. Zhang, Matching pursuits with time-frequency dic-
tionaries, IEEE Transactions on Signal Processing 41 (1993), no. 12,
3397–3415. 65
[NBG10] R. R. Nadakuditi and F. Benaych-Georges, The breakdown point of
signal subspace estimation, IEEE Sensor Array and Multichannel Sig-
nal Processing Workshop (2010), 177–180. 42
[NCT16] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka, f-gan: Train-
ing generative neural samplers using variational divergence mini-
mization, Advances in Neural Information Processing Systems, 2016,
pp. 271–279. 190
[Noe60] G. Noether, Remarks about a paired comparison model, Psychometrika
25 (1960), 357–367. 239
Bibliography 265
[NSVE+ 09] Frank Noè, Christof Schütte, Eric Vanden−Eijnden, Lothar Reich,
and Thomas R. Weikl, Constructing the equilibrium ensemble of fold-
ing pathways from short off-equilibrium simulations, Proceedings of
the National Academy of Sciences of the United States of America
106 (2009), no. 45, 19011–19016. 152
[OBG+ 05] Stanley Osher, Martin Burger, Donald Goldfarb, Jinjun Xu, and
Wotao Yin, An iterative regularization method for total variation-
based image restoration, SIAM Journal on Multiscale Modeling and
Simulation 4 (2005), no. 2, 460–489. 38, 76
[ORX+ 16] Stanley Osher, Feng Ruan, Jiechao Xiong, Yuan Yao, and Wotao Yin,
Sparse recovery via differential inclusions, Applied and Computational
Harmonic Analysis 41 (2016), no. 2, 436–469, arXiv:1406.7728. 38,
68, 71, 76
[OW16] Art Owen and Jingshu Wang, Bi-cross-validation for factor analysis,
Statist. Sci. 31 (2016), no. 1, 119–139. 47
[Pea01] Karl Pearson, On lines and planes of closest fit to systems of points
in space, Philosophical Magazine 2 (1901), no. 11, 559–572. 5
[PVB17] Davy Paindaveine and Germain Van Bever, Halfspace depths
for scatter, concentration and shape matrices, arXiv preprint
arXiv:1704.06160 (2017). 188
[RH99] Peter J Rousseeuw and Mia Hubert, Regression depth, Journal of the
American Statistical Association 94 (1999), no. 446, 388–402. 188,
189
[RL00] Sam T. Roweis and Saul K. Lawrence, Locally linear embedding, Sci-
ence 290 (2000), no. 5500, 2319–2323. 101
[RS98] Peter J Rousseeuw and Anja Struyf, Computing location depth and
regression depth in higher dimensions, Statistics and Computing 8
(1998), no. 3, 193–203. 188, 189
[RXY18] Feng Ruan, Jiechao Xiong, and Yuan Yao, Libra: Linearized bregman
algorithms for generalized linear models, 2018, R package version 1.6,
https://cran.r-project.org/web/packages/Libra. 72
[Saa95] Donald G. Saari, Basic geometry of voting, Springer, 1995. 234
[Sch37] I. J. Schoenberg, On certain metric spaces arising from euclidean
spaces by a change of metric and their imbedding in hilbert space,
The Annals of Mathematics 38 (1937), no. 4, 787–793. 9
[Sch38a] , Metric spaces and completely monotone functions, The An-
nals of Mathematics 39 (1938), 811–841. 9, 16, 17
[Sch38b] , Metric spaces and positive definite functions, Transactions of
the American Mathematical Society 44 (1938), 522–536. 9, 15, 16
[Sen70] Amartya Sen, The impossibility of a paretian liberal, Journal of Polit-
ical Economy 78 (1970), no. 1, 152–157. 233
[She18] Peter S Shen, The 2017 nobel prize in chemistry: cryo-em comes of
age, Analytical and bioanalytical chemistry 410 (2018), no. 8, 2053–
2057. 195
[SHYW17] Xinwei Sun, Lingjing Hu, Yuan Yao, and Yizhou Wang, Gsplit lbi:
Taming the procedural bias in neuroimaging for disease prediction, In-
ternational Conference on Medical Image Computing and Computer-
Assisted Intervention (MICCAI), Springer, 2017, pp. 107–115. 72
266 Bibliography
[Sin06] Amit Singer, From graph to manifold laplacian: The convergence rate,
Applied and Computational Harmonic Analysis 21 (2006), 128–134.
174, 175, 179
[SSM98] B. Schölkopf, A. Smola, and K.-R. Müller, Nonlinear component anal-
ysis as a kernel eigenvalue problem, Neural Computation 10 (1998),
1299–1319. 18
[ST04] D. Spielman and Shang-Hua Teng, Nearly-linear time algorithms for
graph partitioning, graph sparsification, and solving linear systems,
STOC ’04 Proceedings of the thirty-sixth annual ACM symposium on
Theory of computing, 2004. 242
[Ste56] Charles Stein, Inadmissibility of the usual estimator for the mean of a
multivariate distribution, Proceedings of the Third Berkeley Sympo-
sium on Mathematical Statistics and Probability 1 (1956), 197–206.
27, 31
[Ste01] Ingo Steinwart, On the influence of the kernel on the consistency of
support vector machines, Journal of Machine Learning Research 2
(2001), 67–93. 18
[SW12] Amit Singer and Hau-Tieng Wu, Vector diffusion maps and the con-
nection laplacian, Comm. Pure Appl. Math. 65 (2012), no. 8, 1067–
1144. 185
[SY07] Anthony Man-Cho So and Yinyu Ye, Theory of semidefinite program-
ming for sensor network localization, Mathematical Programming, Se-
ries B 109 (2007), no. 2-3, 367–384. 90, 92
[SYZ08] Anthony Man-Cho So, Yinyu Ye, and Jiawei Zhang, A unified theorem
on sdp rank reduction, Mathematics of Operations Research 33 (2008),
no. 4, 910–920. 91
[Tao11] Terrence Tao, Topics in random matrix theory, Lecture Notes in
UCLA, 2011. 46
[TdL00] J. B. Tenenbaum, Vin deSilva, and John C. Langford, A global geo-
metric framework for nonlinear dimensionality reduction, Science 290
(2000), 2319–2323. 161
[TdSL00] J. Tenenbaum, V. de Silva, and J. Langford, A global geometric
framework for nonlinear dimensionality reduction, Science 290 (2000),
no. 5500, 2323–2326. 101
[Tib96] R. Tibshirani, Regression shrinkage and selection via the lasso, J. of
the Royal Statistical Society, Series B 58 (1996), no. 1, 267–288. 38,
64, 66, 72
[Tro04] Joel A. Tropp, Greed is good: Algorithmic results for sparse approx-
imation, IEEE Trans. Inform. Theory 50 (2004), no. 10, 2231–2242.
67, 68, 75
[Tsy09] Alexandre Tsybakov, Introduction to nonparametric estimation,
Springer, 2009. 34, 38, 40
[Tuk75] John W Tukey, Mathematics and the picturing of data, Proceedings
of the International Congress of Mathematicians, Vancouver, 1975,
vol. 2, 1975, pp. 523–531. 188, 189
[Tyl87a] D. E. Tyler, A distribution-free m-estimator of multivariate scatter,
Annals of Statistics 15 (1987), no. 1, 234–251. 83, 84
Bibliography 267
[YL06] Ming Yuan and Yi Lin, Model selection and estimation in regression
with grouped variables, Journal of the Royal Statistical Society: Series
B (Statistical Methodology) 68 (2006), no. 1, 49–67. 75
[YL07] , On the nonnegative garrote estimator, Journal of the Royal
Statistical Society, Series B 69 (2007), no. 2, 143–161. 73
[YODG08] Wotao Yin, Stanley Osher, Jerome Darbon, and Donald Goldfarb,
Bregman iterative algorithms for compressed sensing and related prob-
lems, SIAM Journal on Imaging Sciences 1 (2008), no. 1, 143–168.
76
[ZCS14] Teng Zhang, Xiuyuan Cheng, and Amit Singer, Marcenko-pastur law
for tyler?s m-estimator. 83, 84
[Zha02] Jian Zhang, Some extensions of Tukey’s depth function, Journal of
Multivariate Analysis 82 (2002), no. 1, 134–165. 188, 189
[Zha16] Teng Zhang, Robust subspace recovery by tyler?s m-estimator, Infor-
mation and Inference: A Journal of the IMA (2016), 1–23. 83, 84
[ZHT06] H. Zou, T. Hastie, and R. Tibshirani, Sparse principal compo-
nent analysis, Journal of Computational and Graphical Statistics 15
(2006), no. 2, 262–286. 86
[Zou06] Hui Zou, The adaptive lasso and its oracle properties, Journal of the
American Statistical Association 101 (2006), no. 476, 1418–1429. 73,
75
[ZSF+ 18] Bo Zhao, Xinwei Sun, Yanwei Fu, Yuan Yao, and Yizhou Wang, Msplit
lbi: Realizing feature selection and dense estimation simultaneously in
few-shot and zero-shot learning, International Conference on Machine
Learning (ICML), 2018. 72
[ZW] Zhenyue Zhang and Jing Wang, Mlle: Modified locally linear em-
bedding using multiple weights, http://citeseerx.ist.psu.edu/
viewdoc/summary?doi=10.1.1.70.382. 109, 110
[ZY06] Peng Zhao and Bin Yu, On model selection consistency of lasso, J.
Machine Learning Research 7 (2006), 2541–2567. 73, 75
[ZZ02] Zhenyue Zhang and Hongyuan Zha, Principal manifold and nonlinear
dimension reduction via local tangent space alignment, SIAM Journal
of Scientific Computing 26 (2002), 313–338. 114
[ZZ09] Hongyuan Zha and Zhenyue Zhang, Spectral properties of the align-
ment matrices in manifold learning, SIAM Review 51 (2009), no. 3,
545–566. 115
Index
K, 17
LK , 17
HK , 17
Algorithm
Classical/Metric MDS, 11
Kernel PCA/MDS, 19
PCA, 7
Command
SMACOF, 21
cmdscale, 21
mdscale, 21
prcomp, 19
princomp, 19
sklearn.decomposition.PCA, 19
sklearn.manifold.MDS, 21
covariance operator, 17
Mercer kernel, 17
Mercer’s Theorem, 17
Multidimensional Scaling (MDS), 9
PCA
parallel analysis, 47
269