0% found this document useful (0 votes)
215 views275 pages

Geometric and Topological Data Reduction

This monograph aims to provide graduate students or senior un- dergraduates in applied mathematics, computer science, statistics and compu- tational sciences an introduction to data analysis from a mathematical per- spective. The lecture notes have been used in related courses (A Mathematical Introduction to Data Analysis; Topological and Geometric Data Reduction) at Peking University and Hong Kong University of Science and Technology, that can be found at https://yao—lab.github.io/course.

Uploaded by

Leticia Ramirez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
215 views275 pages

Geometric and Topological Data Reduction

This monograph aims to provide graduate students or senior un- dergraduates in applied mathematics, computer science, statistics and compu- tational sciences an introduction to data analysis from a mathematical per- spective. The lecture notes have been used in related courses (A Mathematical Introduction to Data Analysis; Topological and Geometric Data Reduction) at Peking University and Hong Kong University of Science and Technology, that can be found at https://yao—lab.github.io/course.

Uploaded by

Leticia Ramirez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 275

Geometric and Topological Data Reduction

A Mathematical Introduction to Data Science

Yuan Yao
Current address: Department of Mathematics, Hong Kong University of Sci-
ence and Technology, Clear Water Bay, Hong Kong SAR, P. R. China,

School of Mathematical Sciences, Peking University, Beijing, P.


R. China 100871
Email address: yuany@ust.hk
URL: https://yao-lab.github.io/book datasci/
This is a working draft last updated on February 19, 2023.
2010 Mathematics Subject Classification. Primary 62, 68; Secondary 57
Key words and phrases. Principal Component Analysis, Multidimensional Scaling,
Reproducing Kernel, High Dimensional Statistics, Random Matrix Theory,
Manifold Learning, Spectral Graph Theory, Topological Data Analysis,
Combinatorial Hodge Theory

Special thanks to Prof. Amit Singer, Weinan E, Xiuyuan Cheng, Feng Ruan,
Jiangshu Wang, Kaizheng Wang, and the following students in PKU who help
scribe lecture notes with various improvements: Hong Cheng, Chao Deng,
Yanzhen Deng, Chendi Huang, Lei Huang, Shujiao Huang, Longlong Jiang, Yuwei
Jiang, Wei Jin, Changcheng Li, Xiaoguang Li, Zhen Li, Tengyuan Liang, Feng
Lin, Yaning Liu, Peng Luo, Wulin Luo, Tangjie Lv, Yuan Lv, Hongyu Meng, Ping
Qin, Jie Ren, Hu Sheng, Zhiming Wang, Yuting Wei, Jiechao Xiong, Jie Xu,
Bowei Yan, Jun Yin, and Yue Zhao.

Abstract. This monograph aims to provide graduate students or senior un-


dergraduates in applied mathematics, computer science, statistics and compu-
tational sciences an introduction to data analysis from a mathematical per-
spective. The lecture notes have been used in related courses (A Mathematical
Introduction to Data Analysis; Topological and Geometric Data Reduction) at
Peking University and Hong Kong University of Science and Technology, that
can be found at https://yao-lab.github.io/course.html, part of which is
based on a related course led by Professor Amit Singer at Princeton University.

The lectures put a focus on a geometric and topological perspective to


data analysis (reduction and visualization), with a central and hidden theme on
spectral method, e.g. can you hear the shape of data. Such a theme can be seen
from the simple principal component analysis, manifold learning, to the Hodge
theoretical approach to topological data analysis. A wide range of topics are
covered, including Principal Component Analysis, Multidimensional Scaling,
Stein’s phenomenon and high dimensionality, shrinkage and regularization,
random matrix theory and random projections, robust PCA, sparse PCA,
graph realization, manifold learning, topological data analysis, and applied
Hodge Theory.
Contents

Preface 1

Part 1. Linear Dimensionality Reduction: PCA, MDS, and Beyond 3


Chapter 1. Geometry of PCA and MDS 5
1.1. Principal Component Analysis (PCA) 5
1.2. Multidimensional Scaling (MDS) 9
1.3. Duality of PCA and MDS in Singular Value Decomposition 11
1.4. Schoenberg Isometric Embedding and Reproducing Kernels 12
1.5. Lab and Further Studies 19
Chapter 2. Curse of Dimensionality: High Dimensional Statistics 27
2.1. Maximum Likelihood Estimate of Mean and Covariance 27
2.2. Stein’s Phenomenon and Shrinkage of Sample Mean 30
2.3. Random Matrix Theory and Phase Transitions in PCA 41
2.4. Spectral Shrinkage by Horn’s Parallel Analysis 47
2.5. Sufficient Dimensionality Reduction and Supervised PCA 48
2.6. Lab and Further Studies 53

Chapter 3. Blessing of Dimensionality: Concentration of Measure 59


3.1. Introduction to Almost Isometric Embedding 59
3.2. Johnson-Lindenstrauss Lemma for Random Projections 60
3.3. Application: Human Genome Diversity Project 63
3.4. Random Projections and Compressed Sensing 64
3.5. Inverse Scale Space Method for Sparse Learning 71
3.6. Lab and Further Studies 77
Chapter 4. Generalized PCA and MDS via Semidefinite Programming 79
4.1. Introduction of Semi-Definite Programming (SDP) 79
4.2. Robust PCA via SDP 81
4.3. Sparse PCA via SDP 86
4.4. Graph Realization and Universal Rigidity 87
4.5. Lab and Further Studies 92

Part 2. Nonlinear Dimensionality Reduction: Kernels on Graphs 99


Chapter 5. Manifold Learning 101
5.1. Introduction 101
5.2. ISOMAP 103
5.3. Locally Linear Embedding (LLE) 106
5.4. Hessian LLE 110
3
4 CONTENTS

5.5. Local Tangent Space Alignment (LTSA) 113


5.6. Laplacian LLE (Eigenmap) 115
5.7. Diffusion Map 119
5.8. Stochastic Neighbor Embedding 121
5.9. Lab: Comparative Studies 121

Chapter 6. Random Walk on Graphs 123


6.1. Introduction to Perron-Frobenius Theory and PageRank 123
6.2. Introduction to Fiedler Theory and Cheeger Inequality 129
6.3. *Laplacians and the Cheeger inequality for directed graphs 136
6.4. Lumpability of Markov Chain 142
6.5. Applications of Lumpability: MNcut and Network Reduction 145
6.6. Mean First Passage Time 150
6.7. Transition Path Theory 152
6.8. Semi-supervised Learning and Transition Path Theory 155
6.9. Lab and Further Studies 158

Chapter 7. Diffusion Geometry 161


7.1. Diffusion Map and Diffusion Distance 161
7.2. Commute Time Map and Distance 169
7.3. Diffusion Map: Convergence Theory 172
7.4. *Vector Diffusion Map and Connection Laplacian 179
7.5. *Synchronization on Graphs 185

Chapter 8. Robust PCA via Generative Adversarial Networks 187


8.1. Huber’s Contamination Model and Tukey’s Median 187
8.2. Generative Adversarial Networks (GAN) 190
8.3. Robust PCA via GANs 194
8.4. Lab and Further Studies 196

Part 3. Introduction to Topological Data Analysis 197

Chapter 9. Simplicial Complex Representation of Data 199


9.1. From Graphs to Simplicial Complexes 199
9.2. Betti Numbers 203
9.3. Consistency and Sample Complexity of Cěch Complexes (Niyogi-
Smale-Weinberger Theorem) 203
9.4. Lab and Further Studies 203

Chapter 10. Persistent Homology 205


10.1. *Hierarchical Clustering, Metric Trees, and Persistent β0 205
10.2. Persistent Homology and Betti Numbers 205
10.3. Application Examples of Persistent Homology 207
10.4. *Stability of Persistent Barcode/Diagram 209
10.5. Lab and Further Studies 209

Chapter 11. Mapper and Morse Theory 217


11.1. Morse Theory, Reeb Graph, and Mapper 217
11.2. Applications Examples 218
11.3. *Discrete Morse Theory and Persistent Homology 222
CONTENTS 5

11.4. Lab and Further Studies 222

Chapter 12. *Euler Calculus 227


12.1. Euler Characteristics 227
12.2. *Euler Calculus and Integral Geometry 227
12.3. Applications Examples 227

Part 4. Combinatorial Hodge Theory and Applications 229


Chapter 13. Combinatorial Hodge Theory 231
13.1. Exterior Calculus on Simplicial Complex and Cohomology 231
13.2. Combinatorial Hodge Theory 232
13.3. Lab and Further Studies 232

Chapter 14. Social Choice and Hodge Decomposition of Preferences 233


14.1. Social Choice Theory 233
14.2. Crowdsourced Ranking on Graphs 237
14.3. Hodge Decomposition of Pairwise Preference 240
14.4. Random Graph Theory and Sampling 244
14.5. Online HodgeRank 246
14.6. Robust HodgeRank 248
14.7. From Social Choice to Individual Preferences 248
14.8. Lab and Further Studies 248
Chapter 15. Game Theory and Hodge Decomposition of Utilities 251
15.1. Nash and Correlated Equilibrium 252
15.2. Hodge Decomposition of Utilities 253
15.3. Potential Game and Shapley-Monderer Condition 254
15.4. Zero-sum Games 254
Chapter 16. *Towards Quantum Hodge Decomposition and TDA 255
16.1. An Introduction to Quantum Linear Algebra 255
16.2. Quantum Hodge Decomposition 255
16.3. Quantum Persistent Homology 255
16.4. A Prototype Demo 255
Exercise 255
Bibliography 257

Index 269
Preface

... the objective of statistical methods is the reduction of data. A


quantity of data... is to be replaced by relatively few quantities
which shall adequately represent ... the relevant information con-
tained in the original data.
Since the number of independent facts supplied in the data is usu-
ally far greater than the number of facts sought, much of the infor-
mation supplied by an actual sample is irrelevant. It is the object of
the statistical process employed in the reduction of data to exclude
this irrelevant information, and to isolate the whole of the relevant
information contained in the data. –R.A.Fisher

As the mathematical founder of modern statistics, R. A. Fisher exploits a popula-


tion distribution governed by relatively few “quantities” to characterize the relevant
information, that leads to the maximum likelihood inference and related. Yet in
nature, relevant vs. irrelevant variations of data can be well described or sepa-
rated by geometry and even topology, besides statistical models. Especially this is
motivated by data visualizations.
Geometry and topology has played a long term role in the spirit of character-
izing relevant variations from irrelevant ones. For example, in Erlangen Program,
geometry and topology are classified using invariants under transformation groups,
such as Euclidean geometry is invariant under Euclidean transformation group (ro-
tation and translation), projective geometry is invariant under projective group, and
topology is invariant under homomorphism group (a group of continuous invertible
transformations). Here the “data” are transformation groups. In modern data
analysis, data are typically associated with certain natural geometric structures,
e.g. Euclidean and/or metric representations. Dimensionality reduction looks for a
small set of coordinates characterizing the relevant variations of data, where PCA
or MDS relies on global metric information and manifold learning relies on local
metric information. When even local metric information might be noisy with ir-
relevant variations, topological reduction looks for more robust and coarse grained
invariants, such as clusters and high order interactions among them.
Data geometry and topology therefore play central roles in the reduction. An
interesting bridge lies in that geometry and topology are connected by spectral
method. Like the old problem raised by Weyl, can one hear the shape of drum,
a similar question can be asked here, e.g. can one hear the shape of data? Here
“shape” refers to the relevant information in data, in a geometric and topologi-
cal favor. This question is actually answered in this monograph: as one can see,
geometric data reduction from PCA/MDS to manifold learning can be regarded
1
2 PREFACE

as some kernel method defined on various data graphs where the spectral decom-
position of kernel gives rise to the small amount of geometric coordinates of data
shapes, while topological data reduction can be inferred from various simplicial
complex data representation, extended from graphs, with spectral decomposition
of Hodge Laplacians.
Therefore, in this course we shall see a dancing among geometric, topological,
and statistical data reductions. We are going to present these stories in four parts.
In general, data representation can be vectors, matrices (esp. graphs, net-
works), tensors, and possibly unstructured such as images, videos, languages, se-
quences, etc.
************************
This book is used in a course instructed by Yuan Yao at Peking University and
the Hong Kong University of Science and Technology, part of which is based on a
similar course led by Amit Singer at Princeton University.
Part 1

Linear Dimensionality Reduction:


PCA, MDS, and Beyond
CHAPTER 1

Geometry of PCA and MDS

In this chapter, we start from a basic data representation as Euclidean vectors


or points in metric spaces. In Principal Component Analysis (PCA), one starts
from high dimensional Euclidean representation and looks for a best affine (linear)
approximation of data variations; while in Multidimensional Scaling (MDS), one is
equipped with some pairwise distance metric among data, up to some noise, and
pursues an Euclidean representation preserving such a metric. PCA and MDS are
dual one to the other when data points indeed lie in an Euclidean space. In fact,
PCA and MDS give different sides of singular vectors of the same data matrix,
respectively. Such a geometric picture can be extended to Hilbert spaces of infinite
dimension via reproducing kernels as positive definite functions.

1.1. Principal Component Analysis (PCA)


Principal component analysis (PCA), invented by Pearson (1901) (Pea01) with
the name coined by Hotelling (1933) (Hot33), is perhaps the most popular method
for dimensionality reduction with high dimensional Euclidean data, under various
names in science and engineering such as Karhunen-Loève Transform for stochastic
processes (? ), Empirical Orthogonal Functions, and Principal Orthogonal Decom-
position, etc. In the following we introduce PCA from a geometric perspective as
the best low dimensional affine subspace approximation of data.

Figure 1. Illustration of Principal Component Analysis as the


best affine subspace approximation of data.

5
6 1. GEOMETRY OF PCA AND MDS

1.1.1. Best Affine Approximation of Data. Let xi ∈ Rp , i = 1, . . . , n, be


n samples in Rp . Denote the data matrix X = [x1 |x2 | · · · |xn ] ∈ Rp×n . Now we are
going to look for a k-dimensional affine space in Rp to best approximate these n
examples (see Figure 1). Assume that such an affine space can be parameterized
by µ + U β such that U = [u1 , . . . , uk ] consists of k-columns of an orthonormal basis
of the affine space such that U T U = Ik . Then the best approximation in terms of
Euclidean distance is given by the following optimization problem.
n
X
(1) min I := ∥xi − (µ + U βi )∥2
β,µ,U
i=1
Pn
where U ∈ Rp×k , U T U = Ik , and i=1 βi = 0 (nonzero sum of βi can be repre-
sented by µ). Taking the first order optimality conditions,
n n
∂I X 1X
= −2 (xi − µ − U βi ) = 0 ⇒ µ
bn = xi
∂µ i=1
n i=1

∂I
= (xi − µ − U βi )T U = 0 ⇒ βbi = U T (xi − µ
bn )
∂βi
Plug in the expression of µ
bn and βbi
n
X
(2) I = bn − U U T (xi − µ
∥xi − µ bn )∥2
i=1
n
X
= ∥xi − µ bn )∥2
bn − Pk (xi − µ
i=1
n
X
= ∥yi − Pk (yi )∥2 , yi := xi − µ
bn
i=1

where Pk = U U T is a projection operator satisfying the idempotent property Pk2 =


Pk .
Denote Y = [y1 |y2 | · · · |yn ] ∈ Rp×n , then the original problem turns into
n
X
min ∥yi − Pk (yi )∥2 = min trace[(Y − Pk Y )T (Y − Pk Y )]
U
i=1
= min trace[Y T (I − Pk )(I − Pk )Y ]
= min trace[Y Y T (I − Pk )2 ]
= min trace[Y Y T (I − Pk )]
= min[trace(Y Y T ) − trace(Y Y T U U T )]
= min[trace(Y Y T ) − trace(U T Y Y T U )].

Above we use the cyclic property of trace trace(ABC) = trace(BCA) and idempo-
tent property of projection P 2 = P .
Since Y does not depend on U , the problem above is equivalent to
1
(3) max Var(U T Y ) = max trace(U T Y Y T U ) = max trace(U T Σ
b nU )
U U T =Ik U U T =Ik n U U T =Ik
1.1. PRINCIPAL COMPONENT ANALYSIS (PCA) 7

where Σ b n = 1 Y Y T = 1 (X − µbn 1T )(X − µ bn 1T )T is the sample variance matrix1.


n n
Assume that the sample covariance matrix, which is positive semi-definite, has the
eigenvalue decomposition Σ b Û T , where Û T Û = I, Λ
b n = Û Λ b = diag(λ
b1 , . . . , λ
bp ), and
λ1 ≥ . . . ≥ λp ≥ 0. Then
b b
k
X
max trace(U T Σ
b nU ) = λ
bi
U U T =Ik
i=1

In fact when k = 1, the maximal covariance is given by the largest eigenvalue along
the direction of its associated eigenvector,
max uT Σ
b nu = λ
b1 .
∥u∥=1

Restricted on the orthogonal subspace u ⊥ û1 will lead to


max uT Σ
b nu = λ
b2 ,
∥u∥=1,uT û1 =0

and so on. Therefore, PCA takes the eigenvector decomposition of Σbn = U bΛ b T and
bU
studies the projection of centred data points on top k eigenvectors as the principle
components. In this way, we conclude that the k-dimensional affine space can be
discovered by the eigenvector decomposition of Σ b n.

Algorithm 1: Principal Component Analysis


Input: Data X = [xi ] ∈ Rp×n with xi = (xi1 , ..., xip )T ∈ Rp , i = 1, . . . , n.
Output: Euclidean k-dimensional coordinates βb ∈ Rk×n of data.
1 Compute sample covariance matrix, e.g.
b n = 1 Pn (xi − µ̂)T (xi − µ̂) = XHH T X T , where µ̂ = 1 Pn xi , the
Σ n i=1 n i=1
centering matrix H = I − n1 11T (1 = (1, . . . , 1)T ∈ Rn );
2 Compute Eigen-decomposition Σ b Û T with Λ
b n = Û Λ b = diag(λ bp ) where
b1 , . . . , λ
b1 ≥ λ
λ b2 ≥ . . . ≥ λ
bp ≥ 0;
3 Choose top k nonzero eigenvalues and corresponding eigenvectors,
bk = [u1 , . . . , uk ],
U u k ∈ Rp ,
Λ
b k = diag(λ bk ).
b1 , . . . , λ
4 Compute the j-th principal component score of sample i:
βbji = uTj (xi − µ̂).

1.1.2. PCA Algorithm. The Principal Component Analysis above can be


summarized in Algorithm 1.
The eigenvectors in Ubk are called principal component loading vectors. The
b T Y ∈ Rk×n ,
sample principal components are defined as column vectors of βb = Uk

1Note that in statistics the sampled covariance matrix is often defined by for n ≥ 2,
n
bn ≜ 1 X 1 e eT
Σ (xi − µ bn )T =
bn )(xi − µ XX .
n − 1 i=1 n−1

and it becomes 0 when sample size is 1.


8 1. GEOMETRY OF PCA AND MDS

(a) (b)

≈ - 2.52 - 0.64 + 2.02


(c)

Figure 2. (a) random 9 images. (b) percentage of singular values


over total sum. (c) approximation of the first image by top 3
principle components (singular vectors).

where the i-th observation xi has its j-th principal component score as the j-
th projection on uj , βbji = uTj yi = uTj (xi − µ
b). The importance or variance of
j-th principal component is characterized by the j-th eigenvalue λ bj . Given the
eigenvalues, the following quantities are often used to measure the variances.
• Total variance:
Xp
trace(Σn ) =
b λ
bj ;
j=1
• Percentage of variance explained by top-k principal components:
k
X
λ
bj /trace(Σ
b n );
i=j

• Generalized variance as total volume:


Yp
det(Σn ) =
b λ
bj .
j=1

1.1.3. Example: PCA of Handwritten Digits. Example. Take the


dataset of hand written digit “3”, X̂ ∈ R658×256 contains 658 images, each of
which is of 16-by-16 grayscale image as hand written digit 3. Figure 2 shows
a random selection of 9 images, the sorted singular values divided by total sum
of singular values, and an approximation of x1 by top 3 principle components:
x1 = µbn − 2.5184ṽ1 − 0.6385ṽ2 + 2.0223ṽ3 .
How many principal components shall one need? There is no universal rule for
this question and the answers depend on the specific problems. The key point of
PCA is to describe the dataset by only a few principal components (PCs). The
question is: how many PCs should be kept? In the example of digit “3”, we have
seen that the images of the top eigenvectors are quite relevant, hence are to be kept.
1.2. MULTIDIMENSIONAL SCALING (MDS) 9

The images in Figure 3 are the 201-st to 205-th PCs of the same dataset. They are
more like noise, hence are probably not to be kept.

Figure 3. The 201-st to 205-th principal components.

In general, the following schemes are often adopted in applications. For exam-
ple, one can inspect the eigenvalue plot in a non-increasing order to see if there
is a change point or an elbow where one can truncate; one can also find enough
principal components to explain a prescribed q-percentage of total variation (e.g.
q = 95%); sometimes one can select principal components with variance larger than
average.
The following example shows that one can set a threshold of the empirical
eigenvalues based on the percentage of variance. For instance, choose k such that
k
X
λ
bi /trace(Σ
b n ) > 0.95.
i=1
In section 2.4, we will discuss Horn’s Parallel Analysis with random permutation
test whose interpretation is based on the Random Matrix Theory.

1.2. Multidimensional Scaling (MDS)


Multidimensional Scaling (MDS) roots in psychology (YH41) which aims to
recover Euclidean coordinates given pairwise distance metrics or dissimilarities.
It is equivalent to PCA when pairwise distances are Euclidean. In the core of
theoretical foundation of MDS lies the notion of positive definite functions (Sch37;
Sch38a; Sch38b) (or see the survey (Bav11)) which has been the foundation of the
kernel method in statistics (Wah90) and modern machine learning society (http:
//www.kernel-machines.org/).
1.2.1. Metric MDS: Isometric Euclidean Embedding. In this section
we introduce the classical MDS, or metric embedding problem. The problem of
classical MDS or isometric Euclidean embedding is: given pairwise distances be-
tween data points, can one find a system of Euclidean coordinates for those points
whose pairwise distances meet given constraints?
Consider a forward problem: given a set of points x1 , x2 , ..., xn ∈ Rp , let
X = [x1 , x2 , ..., xn ]p×n .
The distance between point xi and xj satisfies
2 T
d2ij = ∥xi − xj ∥ = (xi − xj ) (xi − xj ) = xi T xi + xj T xj − 2xi T xj .
Now we are considering the inverse problem: given only dij , can one find a
{xi ∈ Rp : i = 1 . . . , n} for some p satisfying the constraint dij = ∥xi − xj ∥? This
is the classical metric MDS problem. Clearly the solutions are not unique as any
Euclidean transform on {xi } gives another solution.
10 1. GEOMETRY OF PCA AND MDS

The classical metric MDS gives a specific solution based on the following idea:
(a) transform squared distance matrix D = [d2ij ] to an inner product form;
(b) compute the eigen-decomposition for this inner product form.
The key observation is that the two-side centering transform of squared distance
matrix D gives the Gram matrix (inner product matrix or kernel matrix) of centered
data matrix, i.e.
1
(4) − HDH T = (XH)T (XH) =: K. b
2
where H := I − n1 1 · 1T = H T with 1 = (1, 1, ..., 1)T ∈ Rn is the Househölder
centering matrix.
To see this, let K be the inner product or kernel or Gram matrix
K = X T X, X = [xi ] ∈ Rp×n
with k = diag(Kii ) ∈ Rn . Note that
D = (d2ij ) = k · 1T + 1 · k T − 2K.
The following lines established the fact that
1
(5) − H · D · H T = H T KH.
2
In fact, note that
1 1
− H · D · H T = − H · (k · 1T + 1 · k T − 2K) · H T
2 2
T
Since k · 1T · H T = k · 1(I − n1 · 1 · 1T ) = k · 1 − k( 1 n·1 ) · 1 = 0, we have
H · k 1 · H T = H · 1 · k T · H T = 0. This implies that

1
− H · D · H T = H · K · H T = HX T XH T = (XH)T (XH),
2
since H = H T , which establishes (4).
Therefore Y = XH = X − n1 X ·1·1T is the centered data matrix and K b = Y TY
is the inner product matrix of centered data, which is positive semi-definite and
admits an orthogonal eigenvector decomposition.
Above we have shown that given a squared distance matrix D = (d2ij ), we can
convert it to an inner product matrix by K b = − 1 HDH T = (XH)T (XH). Eigen-
2
decomposition applied to K b will give rise the Euclidean coordinates centred at the
origin.
In practice, one often chooses top k nonzero eigenvectors of K
b for a k-dimensional
Euclidean embedding or approximation of n data points, summarized in the classi-
cal MDS Algorithm 2.
1.2.2. Example: Metric MDS of Cities on the Earth. Consider pairwise
geodesic distances among cities on the earth. Figure 4 shows the results of 2-D
embedding of eight cities using the classical metric MDS algorithm 2. Since the
cities lie on the surface of a sphere, their geodesic distances2 can not be isometrically
embedded by 2-D Euclidean spaces. Hence one can see some negative eigenvalues
of K.
b Details will be discussed in Section 1.5.

2For example, compute the distances by https://www.distancecalculator.net.


1.3. DUALITY OF PCA AND MDS IN SINGULAR VALUE DECOMPOSITION 11

Algorithm 2: Classical MDS Algorithm


Input: A squared distance matrix Dn×n with Dij = d2ij .
Output: Euclidean k-dimensional coordinates Zk ∈ Rk×n of data.
1 Compute Kb = − 1 H · D · H T , where H is the Househölder centering matrix.
2
2 Compute Eigenvalue decomposition K b Vb T with Λ
b = Vb Λ b = diag(λ bn )
b1 , . . . , λ
where λ1 ≥ λ2 ≥ . . . ≥ λn ≥ 0;
b b b
3 Choose top k nonzero eigenvalues and corresponding eigenvectors, set the
1
embedding coordinates Zk = Λ b 2 VbkT where
k

Vbk = [b
v1 , . . . , vbk ], vbk ∈ Rn ,
Λ
b k = diag(λ bk ),
b1 , . . . , λ
with λ
b1 ≥ λ
b2 ≥ . . . ≥ λ
bk ≥ 0.

(a) (b)

Figure 4. MDS of eight cities in the world. (a) MDS embedding


b = − 1 H · D · HT .
with top-2 eigenvectors. (b) Eigenvalues of K
2

1.2.3. Nonmetric MDS. Given a set of points xi ∈ Rp (i = 1, 2, · · · , n);


form a data Matrix X p×n = [X1 , X2 · · · Xn ]T , when p is large, especially in some
cases larger than n, we want to find k-dimensional projection with which pairwise
distances of the data point are preserved as well as possible. That is to say, if we
know the original pairwise distance dij = ∥Xi − Xj ∥ or data distances with some
disturbance d˜ij = ∥Xi − Xj ∥ + ϵ, we want to find Yi ∈ Rk s.t.:

(∥Yi − Yj ∥2 − d˜2ij )2 .
X
(6) min
Yi ∈Rk
i,j
P
Without loss of generality, we set i Yi = 0, i.e. putting the origin as data center.
This is called nonmetric MDS for general d˜ij , which is not necessarily a distance.

1.3. Duality of PCA and MDS in Singular Value Decomposition


We have seen that PCA arises from the sample covariance matrix of data,
b n = 1 (XH)(XH)T = U
Σ bΛbUb T , which gives rise to the principal components by
n
projecting the centered data on to top k eigenvectors in U bk . This is equivalent to
the singular value decomposition (SVD) of X = [x1 , . . . , xn ] ∈ Rp×n in the following
12 1. GEOMETRY OF PCA AND MDS

sense,
1
(7) Y = XH = X − X · 11T = U
b SbVb T ,
n
1 T
H = I− 11 , 1 = (1, . . . , 1)T ∈ Rn
n
where top left singular vectors of centred data matrix Y ∈ Rp×n are the eigenvectors
of the sample covariance matrix Σ. b The singular vectors are not unique, but the
singular subspace spanned by singular vectors associated with distinct singular
values is unique.
How about the right singular vectors here? In section 1.2, we have seen that
the metric Multidimensional scaling (MDS) is characterized by eigenvectors of the
positive semi-definite kernel matrix K b = − 1 H · D · H T = (XH)T (XH) = Y T Y =
2
Vb Sb2 Vb T , where the last step is by (7). Hence MDS is equivalent to apply the top k
right singular vectors of centered data matrix for the Euclidean embedding.
Therefore both PCA and MDS can be both obtained from SVD of centred data
matrix (7) in the following way.
• PCA has principal components given by top k left singular vectors U bk ∈
Rp×k and the projection of centred data on to such a subspace gives the
principal component scores βbk = U b T Y = Sbk Vb T ∈ Rk×n ;
k k
• MDS embedding is given by top k right singular vectors as ZkM DS =
Sbk VbkT ∈ Rk×n .
Note that both PCA and MDS share the same k-dimensional representation βbk =
ZkM DS = Sbk VbkT ∈ Rk×n . Unified under the framework of SVD for the centered
data matrix, PCA and MDS thus play dual roles in the same linear dimensionality
reduction.
From the properties of singular value decomposition (? ), PCA and MDS
provide the best rank-k approximation of centred data matrix Y in any unitary
invariant norms. That is, if Y = U ΣV T is the SVD of Y , then Yk = U Σk V T
(where Σk = diag(σ1 , σ2 , . . . , σk , 0, . . . , 0) is a diagonal matrix containing the largest
k singular values) is a rank-k matrix which satisfies
∥Y − Yk ∥⋆ = min ∥Y − Z∥⋆ ,
rank(Z)=k

where ∥ · ∥⋆ is a unitarily invariant norm. A matrix norm ∥Y ∥⋆ that satisfies


∥P Y Q∥⋆ = ∥Y ∥⋆ ,
for all orthogonal matrices P and Q is called a unitarily invariant norm,
p such as op-
erator norm ∥Y ∥σ = max∥u∥=1 ∥Y u∥ = σ1 , Frobenius-norm ∥Y ∥F = trace(Y T Y ) =
1/p
( i σi2 )1/2 , and Schatten p-norms ∥Y ∥p = ( i σip ) .
P P

1.4. Schoenberg Isometric Embedding and Reproducing Kernels


In this section, we introduce some mathematical foundation of the Multidi-
mensional Scaling, i.e. Schoenberg’s theory of Isometric Embedding. First, the
necessary and sufficient condition that a finite sample set in a metric space can
be isometrically embedded into an Euclidean space, is introduced as the squared
distance matrix being conditionally negative definite, which leads to a positive semi-
definite matrix after a negative Househöld centering transforms. Furthermore, the
1.4. SCHOENBERG ISOMETRIC EMBEDDING AND REPRODUCING KERNELS 13

finite metric space is generalized to arbitrary separable metric spaces isometrically


embeddable in a Hilbert space of infinite dimension, if and only if the Gaussian
or heat kernel is a positive definite function, known as a Mercer kernel or repro-
ducing kernel. This enables us to construct Reproducing Kernel Hilbert Spaces,
widely used in machine learning and statistics. In the end, kernel PCA or MDS is
introduced as a finite sample approximation of the Mercer’s Theorem.

1.4.1. Isometric Euclidean Embedding. In this section, we shall see that


a metric space of n points with distance dij can be isometrically embedded into an
Euclidean space, if and only if the squared distance matrix D = [d2ij ] is conditionally
negative definite. It lays down a foundation for the classical MDS algorithm.
Definition 1.4.1 (Positive Semi-definite Matrix). Suppose An×n is a real sym-
metric matrix, then A is called positive semi-definite (p.s.d.), denoted by A ⪰ 0, if
∀v ∈ Rn , v T Av ≥ 0.
Positive semi-definiteness completely characterizes the inner product matrices:
A ⪰ 0 ⇐⇒ A = Y T Y for some Y .
Property. Suppose An×n , B n×n are real symmetric matrix, A ⪰ 0, B ⪰ 0.
Then we have:
(a) A + B ⪰ 0;
(b) A ◦ B ⪰ 0;
where A ◦ B is called Hadamard product and (A ◦ B)i,j := Ai,j · Bi,j .
Definition 1.4.2 (Conditionally Negative Definite Matrix). Let An×n be a
real symmetric matrix.
Pn A is conditionally negative definite (c.n.d.), if for ∀v ∈ Rn
T T
such that 1 v = i=1 vi = 0, there holds v Av ≤ 0.
The following lemma shows that conditionally negative definite matrices are
precisely those matrices whose negative two-side Househölder centering transforms
are positive semi-definite matrices.
Lemma 1.4.1 (Young/Househölder-Schoenberg’1938).
Pn For any signed probabil-
ity measure α (α ∈ Rn , i=1 αi = 1),
1
Bα = − Hα CHαT ⪰ 0 ⇐⇒ C is c.n.d.
2
where Hα is Househölder centering matrix: Hα = I − 1 · αT .
Proof. The proof is divided into two directions.
⇐ We are to show if C is c.n.d., then Bα ≥ 0. Taking an arbitrary x ∈ Rn ,
1 1
xT Bα x = − xT Hα CHαT x = − (HαT x)T C(HαT x).
2 2
Now we are going to show that y = HαT x satisfies 1T y = 0. In fact,
1T HαT x = 1T (I − α · 1T )x = (1 − 1T α)1T x = 0,
as 1T α = 1 for signed probability measure α. Therefore,
1
xT Bα x = − (HαT x)T C(HαT x) ≥ 0,
2
as C is c.n.d.
14 1. GEOMETRY OF PCA AND MDS

⇒ Now it remains to show if Bα ≥ 0 then C is c.n.d. For ∀x ∈ Rn satisfying


1T x = 0, we have
HαT x = (I − α · 1T )x = x − α · 1T x = x.
Thus,
xT Cx = (HαT x)T C(HαT x) = xT Hα CHαT x = −2xT Bα x ≤ 0.
This completes the proof. □
The following theorem states that conditionally negative definite matrices, after
centering, are actually all the squared distance matrices that allows an isometric
embedding by the classical MDS algorithm.
Theorem 1.4.2 (Classical MDS). Let Dn×n be a real symmetric matrix and
1 1
C = D − d · 1T − 1 · dT , with d = diag(D).
2 2
Then the following holds.
(1) Bα = − 12 Hα DHαT = − 21 Hα CHαT for ∀α as a signed probability measure;
(2) Ci,j = Bi,i (α) + Bj,j (α) − 2Bi,j (α);
(3) D c.n.d. ⇐⇒ C c.n.d.;
(4) C c.n.d. ⇒ C is a squared distance matrix (i.e. ∃Y n×k s.t. Ci,j =
Pk 2
m=1 (yi,m − yj,m ) ).

Proof. The proof is presented in four steps, respectively.


(1) Note that
1 1
Hα DHαT − Hα CHαT = Hα (D − C)HαT = Hα ( d · 1T + 1 · dT )HαT .
2 2
Since Hα 1 = 0, we have Hα DHαT − Hα CHαT = 0.
(2) Consider
1
Bα = − Hα CHαT
2
1
= − (I − 1 · αT )C(I − α · 1T )
2
1 1 1 1
= − C + 1 · αT C + Cα · 1T − 1 · αT Cα · 1T ,
2 2 2 2
which gives
1 1 1 1
Bi,j (α) = − Ci,j + ci + cj − c
2 2 2 2
where ci = (αT C)i , c = αT Cα. This implies that
1 1
Bi,i (α) + Bj,j (α) − 2Bi,j (α) = − Cii − Cjj + Cij = Cij ,
2 2
where the last step is due to Ci,i = 0.
(3) According to Lemma 1.4.1 and the first part of Theorem 1.4.2, respec-
tively: C c.n.d. ⇐⇒ B p.s.d ⇐⇒ D c.n.d.
(4) According to Lemma 1.4.1 and the second part of Theorem 1.4.2:
T
P c.n.d. ⇐⇒ B p.s.d
C P ⇐⇒ ∃Y 2s.t. Bα = Y Y ⇐⇒ Bi,j (α) =
Y Y
k i,k j,k ⇒ C i,j = (Y
k i,k − Yj,k )
This completes the proof. □
1.4. SCHOENBERG ISOMETRIC EMBEDDING AND REPRODUCING KERNELS 15

This theorem tells us that when dij is exactly given by Euclidean distance
between points, the kernel matrix Kb ij = − 1 HDH T is positive semi-definite, where
2
D = (d2ij ) and H = I − n1 11T . Therefore, for positive semi-definite K
b as an inner
product matrix, the optimization problem
(8) min ∥Y T Y − K∥ b 2.
F
Y ∈Rk×n

is equivalent to (6) and has a solution Y whose rows are the eigenvectors corre-
sponding to k largest eigenvalues of K.
b This is exactly the classical or metric MDS
algorithm 2.
1.4.2. Isometric Hilbertian Embedding. Schoenberg (Sch38b) shows that
Euclidean embedding of finite points can be characterized completely by positive
definite functions, which paves a way toward Hilbert space embedding. Later Aron-
zajn (Aro50) developed Reproducing Kernel Hilbert spaces based on positive defi-
nite functions which eventually leads to the kernel methods in statistics and machine
learning (BTA04? ; Vap98; CST03).
Theorem 1.4.3 (Schoenberg 38). A separable space M with a metric function
d(x, y) can be isometrically imbedded in a Hilbert space H, if and only if the family
2
of functions e−λd are positive definite for all λ > 0 (in fact we just need it for a
sequence of λi whose accumulate point is 0).
Here a symmetric function k(x, y) = k(y, x) is called positive definite if for all
finite xi , xj ,
X
ci cj k(xi , xj ) ≥ 0, ∀ci , cj
i,j
with equality = holds iff ci = cj = 0. In other words the function k restricted on
{(xi , xj ) : i, j = 1, . . . , n} is a positive definite matrix.
To see the theorem, recall that by the classic MDS theorem, a set of n points
with distance dij can be isometrically imbedded in an Euclidean space if and only
if the squared distance matrix is conditionally negative definite, i.e.
X X
(9) ci cj d2ij ≤ 0, ci = 0.
i,j i

Using the inverse Fourier transform,


Z ∞
2 1 2
(10) e−t = √ eitu e−u /4 du
2 π −∞
2
function e−t is positive definite over the real since it has a positive even spectrum
2
(Sch38b) and so is e−λd for λ > 0 that shows the necessity.
2
There are two ways to see the sufficiency. First if e−λd (λ > 0) is a positive
definite function, then
X X λ2 X
0≤ ci cj exp(−λd2ij ) = −λ ci cj d2ij + ci cj d4ij − . . .
i,j i,j
2 i,j

by Tayler expansion. For sufficiently small λ, this implies (9). Since the number of
2
sample points is arbitrary, positive definite function e−λd ensures a Hilbert space
embedding of possibly infinite dimension.
16 1. GEOMETRY OF PCA AND MDS

Second, a powerful observation in (Sch38b) leads to the Schoenberg transform


that gives an alternative proof with fruitful results. Notice the formula
Z ∞
2 2
(11) α
t = c(α) (1 − e−λ t )λ−1−α dλ,
0
where
Z ∞ −1
−λ2 −1−α
(12) c(α) = (1 − e )λ > 0, 0 < α < 2, t ≥ 0.
0
P
Substituting t by dij and noticing i ci = 0, leads to
Z ∞X
X 2 2
(13) ci cj dα
ij = −c(α) ci cj e−λ dij λ−1−α dλ ≤ 0
i,j 0 i,j

for all α ∈ (0, 2). Then letting α approach the limit 2 gives the condition (9).
A slightly more general formula than (11) gives the Schoenberg Transform.
Definition 1.4.3 (Schoenberg Transform). The Schoenberg Transform Φ :
R+ → R+ is defined by
Z ∞
1 − exp (−λt)
(14) Φ(t) := g(λ)dλ,
0 λ
where g(λ) is some nonnegative measure on [0, ∞) s.t
Z ∞
g(λ)
dλ < ∞.
0 λ
Sometimes, we may want to transform a square distance matrix to another
square distance matrix. The following theorem tells us that Schoenberg Transform
(Sch38a; Sch38b) characterizes all the transformations between squared distance
matrices.
Theorem 1.4.4 (Schoenberg Transform). Given D a squared distance matrix,
Ci,j = Φ(Di,j ). Then
C is a squared distance matrix ⇐⇒ Φ is a Schoenberg Transform.
Examples of Schoenberg Transforms include
• Φ0 (t) = t with g0 (λ) = δ(λ);
1 − exp(−at)
• Φ1 (t) = with g1 (λ) = δ(λ − a) (a > 0);
a
• Φ2 (t) = ln(1 + t/a) with g2 (λ) = exp(−aλ);
t
• Φ3 (t) = with g3 (λ) = λ exp(−aλ);
a(a + t)
p
• Φ4 (t) = tp (p ∈ (0, 1)) with g4 (λ) = λ−p .
Γ(1 − p)
For more examples, see (Bav11). The first one gives √ the identity transform and
the last one implies that for a distance function, d is also a distance function
but d2 is not. To see this, take three points on a line x = 0, y = 1, z = 2 where
d(x, y) = d(y, z) = 1, then for p > 1 dp (x, z) = 2p > dp (x, y) + dp (y, z) = 2
which violates the triangle inequality. In fact, dp (p ∈ (0, 1)) is Euclidean distance
function immediately implies the following triangle inequality
dp (0, x + y) ≤ dp (0, x) + dp (0, y).
1.4. SCHOENBERG ISOMETRIC EMBEDDING AND REPRODUCING KERNELS 17

Note that Schoenberg transform satisfies Φ(0) = 0,


Z ∞
Φ′ (t) = exp(−λt)g(λ)dλ ≥ 0,
0
Z ∞
Φ′′ (t) = − exp(−λt)λg(λ)dλ ≤ 0,
0
and so on. In other words, Φ is a completely monotonic function defined by
(−1)n Φ(n) (x) ≥ 0, with additional constraint Φ(0) = 0. Schoenberg (Sch38a)
showed the following result that connects positive definite and completely mono-
tone functions.
Theorem 1.4.5 (Schoenberg, 1938). A function ϕ is completely monotone on
[0, ∞) if and only if ϕ(d2 ) is positive definite and radial on Rk for all k.
Combined this with Schoenberg transform, one shows that if d(x, y) is an Eu-
2
clidean distance matrix, then e−λΦ(d ) is positive definite for all λ > 0. Note that for
k
homogeneous function e−λΦ(tx) = e−λt Φ(x) , it suffices to check positive definiteness
for λ = 1.

1.4.3. Reproducing Kernel Hilbert Spaces and Kernel PCA. Symmet-


ric positive definite functions k(x, y) are often called reproducing kernels (Aro50).
In fact the functions spanned by kx (·) = k(x, ·) for x ∈ X made up of a Hilbert
space, where we can associate an inner product induced from ⟨kx , kx′ ⟩ = k(x, x′ ).
To be precise, let X ⊆ Rd be a compact Euclidean domain. Consider a Mercer
kernel K : X × X → R, i.e. a continuous symmetric real-valued function which
is positive definite. A Mercer kernel K induces a function Kx : X → R (x ∈ X )
defined by Kx (t) = K(x, t) for t ∈ X . An inner product between two functions Kx
and Kx′ can be defined as the bilinear form ⟨Kx , Kx′ ⟩HK = K(x, x′ ) (x, x′ ∈ X )
due to the positive definite K. Now take the completion of the span{Kx : x ∈ X }
with respect to the inner product as the unique linear extension of the bilinear form
⟨Kx , Kx′ ⟩HK = K(x, x′ ) (x, x′ ∈ X ), which gives a Hilbert space HK , called the
reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel K. The
most important property of RKHS is the reproducing property: for all f ∈ HK and
x ∈ X, f (x) = ⟨f, Kx ⟩HK . The norm of HK is denoted by ∥ ∥HK .
Let C (X ) be the Banach space of continuous functions on X and Lρ2X be
the Hilbert space of square integrable functions on X with respect to the prob-
ability
R measure ρX . Define a linear map LK : Lρ2X → C (X ) by LK (f )(x) =
X
K(x, t)f (t)dρX . The restriction of LK on HK induces an operator LK |HK :
HK → HK , which is called as the covariance operator of ρX in HK . Together with
the inclusion J : C (X ) → Lρ2X , J ◦ LK : Lρ2X → Lρ2X is a compact operator on
Lρ2X [e.g. see (HS78)]. When the domain is clear from the context, J ◦ LK and
LK |HK are also denoted by LK , abusing the notation.
The compact operator LK : Lρ2X → Lρ2X implies the existence of discrete
spectrum, i.e. an orthonormal eigensystem (λk , ϕk )k∈N , such that LK ϕk = λk ϕk .
This leads to the following Mercer’s Theorem, given in (? ) for X = [0, 1] and (?
, p. 245) for X = [a, b], with current format in (? ).
Theorem 1.4.6 (Mercer’s Theorem). Let X be a compact domain or a man-
ifold, ρX a Borel measure on X , and K : X × X → R a Mercer kernel. Let λk
18 1. GEOMETRY OF PCA AND MDS

be the k-th eigenvalue of LK and {ϕk }k∈N the corresponding eigenvectors. For all
x, t ∈ X ,

X
(15) K(x, t) = λk ϕk (x)ϕ(t)
k=1

where the convergence is absolute (for each x, t ∈ X ×X ) and uniform (on X ×X ).


Define LrK : Lρ2X → Lρ2X by

LrK : XLρ2X →L X
2
ρX
(16) a k ϕk 7→ ak λrk ϕk
k k

1/2
In particular, LK : Lρ2X → HK is an isometrical isomorphism between the quo-
tient space Lρ2X / ker(LK ) and HK . For simplicity we assume that ker(LK ) = {0},
which happens when K is a universal kernel (Ste01) such that HK is dense in
1/2 −1/2 −1/2
Lρ2X . With LK , ⟨ϕk , ϕk′ ⟩HK = ⟨LK ϕk , LK ϕk′ ⟩Lρ2 = λ−1 k ⟨ϕk , ϕk ⟩Lρ2 ,

X X
whence (ϕk ) is a bi-orthogonal system in HK and Lρ2X . Some examples based on
spherical harmonics are given in (MNY06) and for further examples see (? Wah90).
2 ′ 2
For example, the radial basis function kλ (x, x′ ) = e−λd = e−λ∥x−x ∥ is a
universal kernel, often called Gaussian kernel or heat kernel in literature and has
been widely used in statistics and machine learning.
Reproducing Kernel Hilbert Spaces are universal in statistical learning of func-
tions, in the sense that every Hilbert space H of functions on X with bounded eval-
uation functional can be regarded as a reproducing kernel Hilbert space (Wah90).
This is a result of the Riesz representation theorem, that for every x ∈ X there
exists Ex ∈ H such that f (x) = ⟨f, Ex ⟩. By boundedness of evaluation functional,
|f (x)| ≤ ∥f ∥∥Ex ∥, one can define a reproducing kernel k(x, y) = ⟨Ex , Ey ⟩ which is
bounded, symmetric and positive definite. It is called ‘reproducing’ because we can
reproduce the function value using f (x) = ⟨f, kx ⟩ where kx (·) := k(x, ·) as a func-
tion in H . Such a universal property makes RKHS a unified tool to study Hilbert
function spaces in nonparametric statistics, including Sobolev spaces consisting of
splines (Wah90).
Mercer’s Theorem shows that the spectrum decomposition of a Mercer kernel,
a continuous positive definite function on compact domains, renders an orthogonal
basis adaptive to the probability measure ρX . In practice, one can compute an
empirical version of such a basis based on finite samples according to ρX . This is
often known as the kernel PCA (SSM98), or more precisely the kernel MDS in the
following procedure.
Definition 1.4.4 (Kernel PCA/MDS). Given a data sample of {xi : i =
1, . . . , n} drawn independently and identically distributed from ρX , the kernel ma-
trix K = (k(xi , xj ) : i, j = 1, . . . , n) is a positive definite matrix. Then the following
procedure gives a k-dimensional Euclidean embedding of data.
(a) Find the top-k eigen-decomposition of the following centred matrix
b = HKH T ,
K where K = (k(xi , xj ) : i, j = 1, . . . , n).
(b) Embed the data in the same way as classical MDS in Algorithm 2.
1.5. LAB AND FURTHER STUDIES 19

The multiplication of Househölder centering matrix H in (a) is to find orthogo-


nal coordinates which are perpendicular to the constant vector, like classical MDS.
Yet if the constant vector is not an eigenvector of K, one can also directly apply
the eigen-decomposition to the kernel matrix K without multiplications of H, like
Mercer’s Theorem. Details of such kernel PCA/MDS procedure are summarized in
Algorithm 3.

Algorithm 3: Kernel PCA/MDS Algorithm


Input: A positive definite kernel matrix K = (k(xi , xj ) : i, j = 1, . . . , n).
Output: Euclidean k-dimensional coordinates Zk ∈ Rk×n of data.
1 Compute Kb = HKH T , where H is a Househölder centering matrix.
2 Compute Eigenvalue decomposition K b Vb T with Λ
b = Vb Λ b = diag(λ bn )
b1 , . . . , λ
where λ
b1 ≥ λ
b2 ≥ . . . ≥ λ
bn ≥ 0;
3 Choose top k nonzero eigenvalues and corresponding eigenvectors, set the
1
embedding coordinates Zk = Λ b 2 VbkT where
k

Vbk = [b
v1 , . . . , vbk ], vbk ∈ Rn ,
Λ
b k = diag(λ bk ),
b1 , . . . , λ
with λ
b1 ≥ λ
b2 ≥ . . . ≥ λ
bk ≥ 0.

In a summary, given a finite set of pairwise distances dij , it is not hard to


find an Euclidean embedding xi ∈ Rp such that ∥xi − xj ∥ = dij for large enough
p. For example, any n + 1 points can be isometrically embedded into Rn∞ using
(di1 , di2 , . . . , din ) and l∞ -metric: d∞ (xj , xk ) = maxi=1,...,n |dij − djk | = dik due to
triangle inequality. However the dimensionality p of such an embedding is large,
2
p = n. Furthermore, we have seen that via the heat kernel e−λt they can be
even embedded into Hilbert spaces of infinite dimensions. Therefore dimensionality
reduction is desired when p is large, at the best preservation of pairwise distances. A
central idea in this chapter is via the truncated spectral decomposition of a positive
definite kernel matrix properly formulated based on data. We shall see that such
an idea appears repeatedly in history, e.g. Chapter 5 of manifold learning.

1.5. Lab and Further Studies


1.5.1. PCA. There are many packages to solve PCA, e.g. in R the built-in
functions prcomp and princomp, in Python the scikit-learn includes the package
sklearn.decomposition.PCA.
The following Python codes show an example of handwritten digit data analysis
using sklearn.decomposition.PCA.
# !/ usr / bin / python
# -* - coding : utf -8 -* -

"""
=========================================================
Principal Component Analysis ( PCA ) : an example on dataset zip digit 3
=========================================================

The PCA does an unsupervised dim ensional ity reduction , as the best
affine
20 1. GEOMETRY OF PCA AND MDS

k - space approximation of the Euclidean data .

"""
print ( __doc__ )

# Created by Yuan YAO , HKUST


# 6 Feb , 2017

import pandas as pd
import io
import requests

import numpy as np

# Load dataset as 16 x16 gray scale images of handwritten zip code 3 ,


of total number 657.

url = " https :// statweb . stanford . edu /~ tibs / ElemStatLearn / datasets / zip .
digits / train .3 "
s = requests . get ( url ) . content
c = pd . read_csv ( io . StringIO ( s . decode ( ’utf -8 ’) ) )
data = np . array (c , dtype = ’ float32 ’) ;
# data = np . array ( pd . read_csv ( ’ train .3 ’) , dtype = ’ float32 ’) ;
data . shape

# Reshape the data into image of 16 x16 and show the image .
import matplotlib . pyplot as plt
img1 = np . reshape ( data [1 ,:] ,(16 ,16) ) ;
imgshow = plt . imshow ( img1 , cmap = ’ gray ’)

img2 = np . reshape ( data [39 ,:] ,(16 ,16) ) ;


imgshow = plt . imshow ( img2 , cmap = ’ gray ’)

# Now show the mean image .


mu = np . mean ( data , axis =0) ;
img_mu = np . reshape ( mu ,(16 ,16) ) ;
imgshow = plt . imshow ( img_mu , cmap = ’ gray ’)

# #########################################
# PCA

from sklearn . decomposition import PCA


pca = PCA ( n_components =50 , svd_solver = ’ arpack ’)
pca . fit ( data )

print ( pca . e x p l a i n e d _ v a r i a n c e _ r a t i o _ )

# Plot the ’ e x p l a i n e d _ v a r i a n c e _ r a t i o _ ’

plt . plot ( pca . explained_variance_ratio_ , " o " , linewidth =2)


plt . axis ( ’ tight ’)
plt . xlabel ( ’ n_components ’)
1.5. LAB AND FURTHER STUDIES 21

plt . ylabel ( ’ e x p l a i n e d _ v a r i a n c e _ r a t i o _ ’)

# Principal components

Y = pca . components_ ;
Y . shape

# Show the image of the 1 st principal component

img_pca1 = np . reshape ( Y [1 ,:] ,(16 ,16) ) ;


imgshow = plt . imshow ( img_pca1 , cmap = ’ gray ’)

# Show the image of the 2 nd principal component

img_pca2 = np . reshape ( Y [2 ,:] ,(16 ,16) ) ;


imgshow = plt . imshow ( img_pca2 , cmap = ’ gray ’)

# Show the image of the 3 rd principal component

img_pca3 = np . reshape ( Y [3 ,:] ,(16 ,16) ) ;


imgshow = plt . imshow ( img_pca3 , cmap = ’ gray ’)

1.5.2. MDS of Cities. In Matlab, the command for computing classical


MDS is cmdscale, short for Classical Multidimensional Scaling. For non-metric
MDS, you may choose mdscale. In R, SMACOF is often used to solve the non-
metric MDS by finding a local optimizer. In Python package scikit-learn,
sklearn.manifold.MDS performs both metric and non-metric MDS.
Figure 4 shows an example of classical MDS. In this part, we show how to
generate the embeddings of eight cities by Python.

# coding : utf -8

# # M u l t i d i m e n s i o n a l Scaling ( MDS )
# In this example , we show the MDS embedding of some cities with their
geodesic distances on earth .

# In [1]:

get_ipython () . magic ( ’ matplotlib ␣ inline ’)

# In [2]:

import numpy as np
import matplotlib . pyplot as plt

# In [3]:
22 1. GEOMETRY OF PCA AND MDS

# Cities
cities = [ " Beijing " ," Shanghai " ," Hongkong " ," Tokyo " ," Hawaii " ," Seattle " ,"
San ␣ Francisco " ," Los ␣ Angeles " ]
# cities = [" Boston " , " New York " , " Miami " , " Chicago " , " Seattle " , " San
Francisco " , " Los Angeles "]

# ## Using coordinates as input for MDS

# In [4]:

# Fetch distances between two cities .


from geopy . geocoders import Nominatim
from geopy import distance
geolocator = Nominatim ( user_agent = " Safari " )

def g et _c oo r di na te s ( city ) :
loc = geolocator . geocode ( city )
return ( loc . latitude , loc . longitude )

def get_distance ( city1 , city2 ) :


coo1 = ge t_ c oo rd in a te s ( city1 )
coo2 = ge t_ c oo rd in a te s ( city2 )
print ( " Getting ␣ geodesic ␣ distance ␣ between " , city1 , " and " , city2 + "
.")
return distance . distance ( coo1 , coo2 ) . km

# In [5]:

# Get geographic locations of cities


X = np . array ([ g et _c o or di na t es ( city ) for city in cities ])
X

# In [6]:

# Use sklearn . manifold . MDS


from sklearn . manifold import MDS

# Classical ( metric ) MDS for embedding


mds = MDS ( n_components = X . shape [0] , dissimilarity = ’ euclidean ’)
Y0 = mds . fit ( X ) . embedding_
Y0 . shape

# In [7]:

_ , ax = plt . subplots ()
1.5. LAB AND FURTHER STUDIES 23

ax . scatter ( Y0 [: , 0] , - Y0 [: , 1])

for i , txt in enumerate ( cities ) :


ax . annotate ( txt , ( Y0 [i , 0] , - Y0 [i , 1]) )

# # Compute Euclidean distance matrix

# In [8]:

from sklearn . metrics import e u c l i d e a n _ d i s t a n c e s

D0 = e u c l i d e a n _ d i s t a n c e s ( X )

# ## Use Euclidean distance matrix as input for self - defined MDS

# In [9]:

n = D0 . shape [0]

# Squared distance matrix


D02 = np . square ( D0 ) # D2 = ( D **2) . to_numpy ()

# Compute centering matrix


H = np . identity ( n ) -1 / n * np . ones (( n , n ) )

# Compute K
K = -1/2 * np . matmul ( np . matmul (H , D02 ) , H . T )

# Compute eigenvalues and eigenvectors in descending order


evals , evecs = np . linalg . eigh ( K )

# Sort eigenvalues according to descend order


idx = np . argsort ( evals ) [:: -1]
evals = evals [ idx ]
evecs = evecs [: , idx ]

evals , evecs

# In [10]:

# Plot the eigenvalues


plt . plot ( evals / evals . sum () )
plt . savefig ( " cities - mymds - euc - evals . png " , dpi =500 , bbox_inches = ’ tight ’
)

# In [11]:
24 1. GEOMETRY OF PCA AND MDS

# Find k - dimensional embedding


k = 2

# Compute the coordinates using positive eigenvalued components only


ip , = np . where ( evals > 0)

# Diag matrix of eigenvalues


Lambda = np . diag ( np . sqrt ( evals [ ip ]) )
V = evecs [: , ip ]
Y1 = V . dot ( Lambda )
# Y = np . matmul ( evecs [: k ]. T , Lambda )
# Y = evecs [ ,: k ]. dot ( L )

Y1

# In [23]:

# Embedding by top 2 eigenvectors


_ , ax = plt . subplots ()
ax . scatter ( Y1 [: , 0] , Y1 [: , 1])

for i , txt in enumerate ( cities ) :


ax . annotate ( txt , ( Y1 [i , 0] , Y1 [i , 1]) )

# ## Compute geodesic distance matrix

# In [13]:

# Compute D , the matrix of squares of distances


import pandas as pd
import itertools
D = pd . DataFrame ( columns = cities , index = cities , dtype = float )
for pair in itertools . product ( cities , cities ) :
D . loc [ pair [0] , pair [1]] = get_distance ( pair [0] , pair [1]) ** 2

# In [14]:

# Show the distance matrix


D

# ## Using geodesic distance as input for self - defined MDS

# In [15]:
1.5. LAB AND FURTHER STUDIES 25

n = D . shape [0]

# Squared distance matrix


D2 = np . square ( D . to_numpy () ) # D2 = ( D **2) . to_numpy ()

# Compute centering matrix


H = np . identity ( n ) -1 / n * np . ones (( n , n ) )

# Compute K
K = -1/2 * np . matmul ( np . matmul (H , D2 ) , H . T )

# Compute eigenvalues and eigenvectors in descending order


evals , evecs = np . linalg . eigh ( K )

# Sort eigenvalues according to descend order


idx = np . argsort ( evals ) [:: -1]
evals = evals [ idx ]
evecs = evecs [: , idx ]

evals , evecs

# There are negative eigenvalues , hence this is not an Eulidean


embeddable distance matrix .

# In [16]:

# Plot the eigenvalues


plt . plot ( evals / evals . sum () )
plt . savefig ( " cities - mymds - geo - evals . png " , dpi =500 , bbox_inches = ’ tight ’
)

# In [17]:

# Find k - dimensional embedding


k = 2

# Compute the coordinates using positive eigenvalued components only


ip , = np . where ( evals > 0)

# Diag matrix of eigenvalues


Lambda = np . diag ( np . sqrt ( evals [ ip ]) )
V = evecs [: , ip ]
Y2 = V . dot ( Lambda )
# Y = np . matmul ( evecs [: k ]. T , Lambda )
# Y = evecs [ ,: k ]. dot ( L )

Y2
26 1. GEOMETRY OF PCA AND MDS

# In [22]:

# Embedding by top 2 eigenvectors


_ , ax = plt . subplots ()
ax . scatter ( Y2 [: , 0] , Y2 [: , 1])

for i , txt in enumerate ( D . index ) :


ax . annotate ( txt , ( Y2 [i , 0] , Y2 [i , 1]) )

# ## Using geodesic distance matrix for MDS

# In [19]:

# Use sklearn . manifold . MDS


from sklearn . manifold import MDS

# In [20]:

# Find MDS embedding


mds = MDS ( n_components = X . shape [0] , metric = True , dissimilarity = "
precomputed " )
Y3 = mds . fit ( D . to_numpy () ) . embedding_
Y3 . shape

# In [24]:

# Embedding by top 2 eigenvectors


_ , ax = plt . subplots ()
ax . scatter ( - Y3 [: , 0] , Y3 [: , 1])

for i , txt in enumerate ( D . index ) :


ax . annotate ( txt , ( - Y3 [i , 0] , Y3 [i , 1]) )

1.5.3. Remarks. Kernel PCA is implemented in


sklearn.decomposition.kernelPCA.
CHAPTER 2

Curse of Dimensionality: High Dimensional


Statistics

We have seen that sample mean and covariance in high dimensional Euclidean
space Rp are exploited in Principal Component Analysis (PCA) or its equivalent
Multidimensional Scaling (MDS), which are the projections of high dimensional
data on to its top singular vectors of centered data matrix. In statistics, the sample
mean and covariance are Fisher’s Maximum Likelihood estimators based on mul-
tivariate Gaussian models. In classical statistics with the Law of Large Numbers,
for fixed p when sample size n → ∞, such sample mean and covariance converge,
so as to PCA. Although sample mean µ bn and sample covariance Σb n are the widely
used statistics in multivariate data analysis, they may suffer some problems in high
dimensional settings, e.g. for large p and small n scenario. In 1956, Stein (Ste56)
shows that the sample mean is not the best estimator in terms of prediction mea-
sured by the mean square error, for p > 2; furthermore in 2006, Jonestone (Joh06)
shows by random matrix theory that PCA might be overwhelmed by random noise
for fixed ratio p/n = γ when both n, p → ∞. Among other works, these two pieces
of excellent works inspired a long pursuit in modern high dimensional statistics for
biased estimators with shrinkage or regularization which trades variance by bias
toward a reduced prediction error.

2.1. Maximum Likelihood Estimate of Mean and Covariance


Consider the statistical model f (X|θ) as a conditional probability function
on Rp with parameter space θ ∈ Θ. Let X1 , ..., Xn ∈ Rp are independently and
identically distributed (i.i.d.) sampled according to f (X|θ0 ) on Rp for some θ0 ∈ Θ.
The likelihood function is defined as the probability of observing the given data as
a function of θ,
Yn
L(θ) = f (Xi |θ),
i=1
and a maximum likelihood estimator is defined as
n
Y
θbnM LE ∈ arg max L(θ) = arg max f (Xi |θ)
θ∈Θ θ∈Θ
i=1

which is equivalent to
n
1X
arg max log f (Xi |θ).
θ∈Θ n i=1
The following example shows that the sample mean and covariance can be
derived from the maximum likelihood estimator under multivariate normal models
of data.
27
28 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

2.1.1. Example: Multivariate Normal Distribution. For example, con-


sider the normal distribution N (µ, Σ),
 
1 1
f (X|µ, Σ) = p exp − (X − µ)T Σ−1 (X − µ) ,
(2π)p |Σ| 2

where |Σ| is the determinant of covariance matrix Σ.


To get the MLE of normal distribution, we need to
n
Y 1
max P (X1 , ..., Xn |µ, Σ) = max p exp[−(Xi − µ)T Σ−1 (Xi − µ)]
µ,Σ µ,Σ
i=1
2π|Σ|
It is equivalent to maximize the log-likelihood
n
1X n
I = log P (X1 , ..., Xn |µ, Σ) = − (Xi − µ)T Σ−1 (Xi − µ) − log |Σ| + C
2 i=1 2

The MLE of µ satisfies


n
∂I X
0= = − Σ−1 (Xi − µ
bn ),
∂µ µbn i=1

n
1X
⇒µ
bn = Xi .
n i=1
To get the estimation of Σ, we need to maximize
n
1 X n
I(Σ) = trace(I) = − trace (Xi − µ)T Σ−1 (Xi − µ) − trace log |Σ| + C
2 i=1
2

n n
1 X 1X
− trace bn )T Σ−1 (Xi − µ
(Xi − µ bn ) = − trace[Σ−1 (Xi − µ)(Xi − µ
bn )T ]
2 i=1
2 i=1
n
= − (traceΣ−1 Σ b n)
2
n 1 1
= − trace(Σ−1 Σ b n2 Σ
b n2 )
2
n 1
b n2 Σ−1 Σ
1
= − trace(Σ b n2 )
2
n
= − trace(S)
2
where
n
bn = 1
X
Σ (Xi − µ bn )T ,
bn )(Xi − µ
n i=1
1 1
and S = Σ b n2 Σ−1 Σ
b n2 is symmetric and positive definite. Above we repeatedly use
cyclic property of trace:
• trace(AB) = trace(BA), or more generally
• (invariance under cyclic permutation group) trace(ABCD) = trace(BCDA) =
trace(CDAB) = trace(DABC).
2.1. MAXIMUM LIKELIHOOD ESTIMATE OF MEAN AND COVARIANCE 29

Then we have
1 1
b n2 S −1 Σ
Σ=Σ b n2

n n n
− log |Σ| = log |S| + log |Σ
b n | = f (Σ
b n)
2 2 2
where we use for determinant of squared matrices of equal size, det(AB) = |AB| =
det(A) det(B) = |A| · |B|. Therefore,
n n
max I(Σ) ⇔ min trace(S) − log |S| + C,
2 2
where C is a constant. Suppose S = U ΛU T is the eigenvalue decomposition of S,
Λ = diag(λi )

p p
nX nX
J= λi − log(λi ) + C
2 i=1 2 i=1

∂J n n 1
⇒0= = − ⇒ λi = 1
∂λi 2 2 λi

⇒ S = Ip
This gives the MLE solution
n
bM LE 1X
Σ n = (Xi − µ bn )T .
bn )(Xi − µ
n i=1

Under some regularity conditions, the maximum likelihood estimator θ̂nM LE has
the following nice asymptotic properties as n → ∞.
A. (Consistency) θ̂nM LE → θ0 , in probability and almost surely.

B. (Asymptotic Normality) n(θ̂nM LE − θ0 ) → N (0, I0−1 ) in distribution,
where I0 is the Fisher Information matrix
" 2 #  2 
∂ ∂
I(θ0 ) := E log f (X|θ0 ) = −E log f (X|θ 0 ) .
∂θ ∂θ2

C. (Asymptotic Efficiency) limn→∞ cov(θ̂nM LE ) = I −1 (θ0 ). Hence θ̂nM LE is


the uniformly Minimum-Variance Unbiased Estimator, i.e. the estimator
with the least variance among the class of unbiased estimators, for any
unbiased estimator θ̂n , limn→∞ var(θ̂nM LE ) ≤ limn→∞ var(θ̂n ).
We note that maximum likelihood estimators are generally biased estimators with
small bias in the order √ of E(θ) − θ0 ∼ O(1/n), which is smaller than the standard
b
deviation order O(1/ n). Such results can be seen in textbooks about classical
statistics, for example (EH16).
The asymptotic results all hold under the assumption by fixing p and taking
n → ∞, where MLE satisfies µ bn → µ and Σb n → Σ. However as we can see in the
following, µ bn is not the best estimators for prediction when the dimension of the
data p gets large, with a finite sample n.
30 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

2.2. Stein’s Phenomenon and Shrinkage of Sample Mean


2.2.1. Risk and Bias-Variance Decomposition. To measure the predic-
tion performance of an estimator µ bn , it is natural to consider the expected squared
loss in regression, i.e. given a response y = µ + ϵ with zero mean noise Eϵ = 0,
E∥y − µ
bn ∥2 = E∥µ − µ
b + ϵ∥2 = E∥µ − µ
b∥2 + V ar(ϵ), V ar(ϵ) = E(ϵT ϵ).
Since V ar(ϵ) is a constant for all estimators µ b, one may simply look at the first
part which is often called as risk in literature,
µn , µ) = EL(b
R(b µn , µ)
where the loss function takes the square loss here,
µn − µ∥2 .
µn , µ) = ∥b
L(b
It is the mean square error (MSE) between µ and its estimator µ b.
The risk or MSE enjoy the following important bias-variance decomposition, as
a result of the Pythagorean theorem.
R(b
µn , µ) = E∥b
µn − E[b
µn ] + E[b
µn ] − µ∥2
= E∥b
µn − E[b
µn ]∥2 + ∥E[b
µn ] − µ∥2
| {z } | {z }
V ar(b
µn ) Bias(b
µn )

Consider multivariate Gaussian model: let X1 , . . . , Xn ∼ N (µ, Σ), Xi ∈ Rp (i =


1 . . . n), then the maximum likelihood estimators (MLE) of the parameters (µ and
Σ) are as follows:
n n
1X 1X
bM
µn
LE
= Xi , bM
Σ n
LE
= (Xi − µ bn )T .
bn )(Xi − µ
n i=1 n i=1

For simplicity, take a coordinate transform (PCA) Yi = U T Xi where Σ = U ΛU T


is an eigen-decomposition of the population covariance matrix Σ. Assume that
Λ = σ 2 Ip and n = 1, then it suffices to consider Y ∼ N (µ, σ 2 Ip ) in the sequel. In
bM LE = Y .
this case µ
The following example shows the bias and variance of MLE.
Example 1. For the simple case Yi ∼ N (µ, σ 2 Ip ) (i = 1, . . . , n), the MLE
estimator satisfies
µM
Bias(b n
LE
)=0
and
p 2
µM
V ar(bn
LE
)= σ
n
µM LE ) = σ 2 p for µ
In particular for n = 1, V ar(b bM LE = Y .
Example 2. MSE of Linear Estimators. Consider Y ∼ N (µ, σ 2 Ip ) and linear
estimator µ
bC = CY . Then we have
µC ) = ∥(I − C)µ∥2
Bias(b
and
µC ) = E[(CY −Cµ)T (CY −Cµ)] = E[trace((Y −µ)T C T C(Y −µ))] = σ 2 trace(C T C).
V ar(b
2.2. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 31

Linear estimator includes an important case, the Ridge regression (also known as
Tikhonov regularization in applied mathematics) with C = X(X T X + λI)−1 X T ,
1 λ
(17) min ∥Y − Xβ∥2 + ∥β∥2 , λ > 0.
µ 2 2
For simplicity, one may restrict our discussions on the diagonal linear estimators
C = diag(ci ) (up to an change of orthonormal basis for Ridge regression), whose
risk is
p
X p
X
µC , µ) = σ 2
R(b c2i + (1 − ci )2 µ2i .
i=1 i=1
In this case, it is simple to find minimax risk over the hyper-rectangular model class
|µi | ≤ τi ,
p
X σ 2 τi2
inf sup R(b µC , µ) = .
ci |µ |≤τ
i i i=1
σ 2 + τi2
From here one can see that for those sparse model classes such that #{i : τi =
O(σ)} = k ≪ p, it is possible to get smaller risk using linear estimators than MLE.
In general, is it possible to introduce some biased estimators which significantly
reduces the variance such that the total risk is smaller than MLE uniformly for all
µ? This is the notion of inadmissibility introduced by Charles Stein in 1956 and he
find the answer is YES by presenting the James-Stein estimators, as the shrinkage
of sample means.
Definition 2.2.1 (Inadmissible). An estimator µ
bn of the parameter µ is called
inadmissible on Rp with respect to the squared risk if there exists another esti-
mator µ∗n such that
E∥µ∗n − µ∥2 ≤ E∥b
µn − µ∥2 for all µ ∈ Rp ,
and there exist µ0 ∈ Rp such that
E∥µ∗n − µ0 ∥2 < E∥b
µn − µ0 ∥2 .
In this case, we also call that µ∗n dominates µ
bn . Otherwise, the estimator µ
bn is
called admissible.
The notion of inadmissibility or dominance introduces a partial order on the
set of estimators where admissible estimators are local optima in this partial order.
Stein (1956) (Ste56) found that if p ≥ 3, then the MLE estimator µ bn is inad-
missible. This property is known as Stein’s phenomenon. This phenomenon can
be described like:
b such that ∀µ ∈ Rp ,
For p ≥ 3, there exists µ
R(b µMLE , µ)
µ, µ) < R(b
which makes MLE inadmissible.
A typical choice is the James-Stein estimator.
Example 3 (James-Stein Estimator). Charles Stein shows in 1956 that MLE is
inadmissible, while the following original form of James-Stein estimator is demon-
strated by his student Willard James in 1961. Bradley Efron (Efr10) summarizes
32 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

mu_i is generated from Normal(0,1), sample size N=100 mu_i is generated from Uniform [0,1], sample size N=100

1.2

1.2
1.0

1.0
Error: ||hat{mu}-mu||/N

Error: ||hat{mu}-mu||/N
0.8

0.8
0.6

0.6
0.4

0.4
err_MLE err_JSE err_MLE err_JSE

Figure 1. Comparison of risks between Maximum Likelihood Es-


timators and James-Stein Estimators with Xi ∼ N (0, Ip ) (i =
1, . . . , N ) where p = N = 100.

the history and gives a simple derivation of these estimators from an Empirical
Bayes point of view.
σ 2 (p − 2)
 
JS0
(18) µ = 1 − M LE 2 µ bM LE .
∥b
µ ∥
b

where µbM LE = X for n = 1. Such an estimator shrinks each component of µ bM LE


toward 0. However,
Pp one can shrink it toward
Pp other point such as the mean compo-
nent: X̄ = i=1 Xi /p and S = S(z) := i=1 (Xi − X̄)2 ,
σ 2 (p − 3)
 
(19) bJS
µi = X̄ + 1 − µM
(bi
LE
− X̄),
S
at the sacrifice of p ≥ 4 to reach the same phenomenon. To avoid the issue of
negative sign, we can define the positive part of James-Stein Estimator
σ 2 (p − 3)
 
JS+
(20) µbi = X̄ + 1 − µM
(bi
LE
− X̄),
S +

where (x)+ takes the positive part of x if x > 0 or zero otherwise. James-Stein
estimator can be written as a Multitask Ridge regression:
p
X
(21) (b
µi , µ
b) := arg min [(µi − Xi )2 + λ(µi − µ)2 ].
µi ,µ
i=1
2 2
Taking λ = σ (p − 3)/(S − σ (p − 3)), µ bJS ; λ = min(S, σ 2 (p − 3))/(S −
b gives µ
2 JS+
min(S, σ (p − 3))) with 1/0 = 0, it gives µ
b .
Theorem 2.2.1. Suppose Y ∼ Np (µ, I). Then µ µ, µ) = Eµ ∥b
bMLE = Y . R(b µ−
2
µ∥ , and define
p−2
 
JS
b = 1−
µ Y
∥Y ∥2
then
µJS , µ) < R(b
R(b µMLE , µ)
2.2. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 33

Figure 1 shows some simulations where James-Stein Estimator dominates Max-


imum Likelihood Estimator. Table 1 gives a real world example by Bradley Efron
((Efr10), Chap 1, Table 1.1.) showing that JSEs improve MLE, yet for some ex-
treme case like Clemente, JSE may suffer from over-shrinkage toward the average.

Table 1. Efron’s Battting example. There are p = 18 players and


bM LE is obtained
n = 45 samples in the early part of 1970 season. µ
from the mean hits in these early games, while µ is obtained by
averages over the remainder of the season. µbJS0 takes the shrink-
JS
age toward 0 (Eq.(18)), and µ b takes the shrinkage toward the
average X̄ (Eq. (19)). Both forms of JS-estimators improve MLE
by reducing the mean square error in prediction, while the latter
enjoys a much more noticeable improvement than the former.

(M LE) (JS ) (JS)


Name hits/AB µ
bi µi bi 0
µ µ
bi
Clemente 18/45 0.4 0.346 0.378 0.294
F.Robinson 17/45 0.378 0.298 0.357 0.289
F.Howard 16/45 0.356 0.276 0.336 0.285
Johnstone 15/45 0.333 0.222 0.315 0.28
Berry 14/45 0.311 0.273 0.294 0.275
Spencer 14/45 0.311 0.27 0.294 0.275
Kessinger 13/45 0.289 0.263 0.273 0.27
L.Alvarado 12/45 0.267 0.21 0.252 0.266
Santo 11/45 0.244 0.269 0.231 0.261
Swoboda 11/45 0.244 0.23 0.231 0.261
Unser 10/45 0.222 0.264 0.21 0.256
Williams 10/45 0.222 0.256 0.21 0.256
Scott 10/45 0.222 0.303 0.21 0.256
Petrocelli 10/45 0.222 0.264 0.21 0.256
E.Rodriguez 10/45 0.222 0.226 0.21 0.256
Campaneris 9/45 0.2 0.286 0.189 0.252
Munson 8/45 0.178 0.316 0.168 0.247
Alvis 7/45 0.156 0.2 0.147 0.242
Mean Square Error - 0.075545 - 0.072055 0.021387

Next we outline the proof of such results. First of all, we’ll prove a useful
lemma.

2.2.2. Stein’s Unbiased Risk Estimates (SURE). Discussions below are


all under the assumption that Y ∼ Np (µ, I).
Lemma 2.2.2. (Stein’s Unbiased Risk Estimates (SURE)) Suppose µ
b= Y +
g(Y ), g satisfies 1
(1) gPis weakly differentiable.
p R
(2) i=1 |∂i gi (x)|dx < ∞
then
(22) µ, µ) = Eµ (p + 2∇T g(Y ) + ∥g(Y )∥2 )
R(b
Pp ∂
where ∇T g(Y ) := i=1 ∂yi gi (Y ).

1cf. p38, Prop 2.4 [GE]


34 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

Example 4 (Examples of g(x)). For James-Stein estimator


p−2
g(x) = − Y
∥Y ∥2
and for soft-thresholding, each component

 −λ xi > λ
gi (x) = −xi |xi | ≤ λ
λ xi < −λ

Both of them are weakly differentiable. But Hard-Thresholding:



0 |xi | > λ
gi (x) =
−xi |xi | ≤ λ
which is not weakly differentiable!
This lemma is in fact called the Stein’s lemma in Tsybakov’s book (Tsy09,
p. 157-158) . Now we present its proof.
Proof. Let ϕ(y) be the density function of standard Normal distribution
Np (0, I).
R(b
µ, µ) = Eµ ∥Y + g(Y ) − µ∥2
= Eµ p + 2(Y − µ)T g(Y ) + ∥g(Y )∥2


p Z
X ∞
Eµ (Y − µ)T g(Y ) = (yi − µi )gi (Y )ϕ(Y − µ)dY
i=1 −∞
p Z ∞
X ∂
= −gi (Y ) ϕ(Y − µ)dY, derivative of Gaussian function
i=1 −∞
∂yi
p Z ∞
X ∂
= gi (Y )ϕ(Y − µ)dY, Integration by parts
i=1 −∞ ∂yi
= Eµ ∇ g(Y ) T

which gives (22). □


Now for convenience, define
(23) U (Y ) := p + 2∇T g(Y ) + ∥g(Y )∥2 ,
µ, µ) = Eµ U (Y ).
and R(b
2.2.3. Risk of Linear Estimator. Consider the linear estimator µ
bC (Y ) =
CY = (I + g(Y ))Y , where
g(Y ) = (C − I)Y.
X ∂
⇒ ∇T g(Y ) = − ((C − I)Y ) = trace(C) − p
i
∂yi
Therefore,
U (Y ) = p + 2∇T g(Y ) + ∥g(Y )∥2
= p + 2(trace(C) − p) + ∥(I − C)Y ∥2
= −p + 2trace(C) + ∥(I − C)Y ∥2
2.2. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 35

which gives
µC , µ) = ∥(I − C(λ))Y ∥2 − p + 2trace(C(λ)).
R(b
In applications, C = C(λ) often depends on some regularization parameter λ (e.g.
ridge regression). So one could find optimal λ∗ by minimizing the MSE over λ.

2.2.4. Risk of James-Stein Estimator. Recall that James-Stein estimator


(18),
p−2
 
bJS0 = 1 −
µ = (I + g(Y ))Y
∥Y ∥2
p−2
⇒ g(Y ) = − Y
∥Y ∥2
(p − 2)2
⇒ ∥g(Y )∥2 =
∥Y ∥2
and
T
X ∂ p − 2  (p − 2)2
∇ g(Y ) = − Y = −
i
∂yi ∥Y ∥2 ∥Y ∥2
Therefore, plugging them into
U (Y ) = p + 2∇T g(Y ) + ∥g(Y )∥2
we have Stein’s Unbiased Risk Estimate
(p − 2)2
µJS , µ) = EU (Y ) = p − Eµ
R(b µMLE , µ)
< p = R(b
∥Y ∥2
when p ≥ 3.
Remark. What’s wrong when p = 1? In fact, the maximum likelihood esti-
mator µbM LE = Y for Y ∼ N (µ, Ip ) is known to be admissible for p = 1, 2, see e.g.
Lehmann and Casella (? , Ch. 5). The following theorem shows admissibility of
linear estimators.
Theorem 2.2.3 (Theorem 2.6 in (? )). Y ∼ N (µ, I), ∀b
µ = CY , µ
b is admiss-
able iff
(1) C is symmetric.
(2) 0 ≤ ρi (C) ≤ 1 (eigenvalue).
(3) ρi (C) = 1 for at most two i.
To find an upper bound
Pp of the risk of James-Stein estimator, notice that for
Y ∼ N (µ, I), ∥Y ∥2 = i=1 Yi2 ∼ χ2 (∥µ∥2 , p) as noncentral χ2 -distribution with
non-centrality parameter ∥µ∥2 and p degree of freedom, which can be viewed as
Poisson-weighted mixture of central χ2 -distributions. In fact, suppose that a ran-
dom variable J has a Poisson distribution with mean ∥µ∥2 /2, and the conditional
distribution of Z given J = i is χ2 with p + 2i degrees of freedom. Then the
unconditional distribution of Z is non-central χ2 with p degrees of freedom, and
non-centrality parameter ∥µ∥2 (see e.g. (? , p. 132)), i.e.
∥µ∥2
 
d
χ (∥µ∥ , p) = χ (0, p + 2J), J ∼ Poisson
2 2 2
,
2
36 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

we have
   
1 1
Eµ = EEµ
∥Y ∥2 ∥Y ∥2
J
1
= E
p + 2J − 2
1
≥ , by Jensen’s Inequality
p + 2EJ − 2
1
= .
p + ∥µ∥2 − 2
This gives the following result.
Proposition 2.2.4 (Upper bound of MSE for the James-Stein Estimator).
Y ∼ N (µ, Ip ),
(p − 2)2 (p − 2)∥µ∥2
µJS , µ) ≤ p −
R(b = 2 +
p − 2 + ∥µ∥2 p − 2 + ∥µ∥2
Using the inequality,
ab
≤a∧b
a+b
it gives the upper bound
µJS , µ) ≤ 2 + min((p − 2), ∥µ∥2 ).
R(b

Therefore for ∥µ∥ p, the risk of James-Stein Estimator is dominated by 2 + ∥µ∥2 ,
that arbitrarily approaches 2 when ∥µ∥ → 0. In comparison with the risk of MLE,
µMLE , µ) = p, James-Stein Estimator clearly wins a large gap in high dimension.
R(b
This is illustrated in Figure 2.
2.2.5. Risk of Soft-thresholding. Using Stein’s unbiased risk estimate, we
have soft-thresholding in the form of

µb(x) = x + g(x). gi (x) = −I(|xi | ≤ λ)
∂i
We then have
p p
!
X X
Eµ ∥b
µλ − µ∥ = Eµ p − 2
2
I(|xi | ≤ λ) + 2
xi ∧ λ 2

i=1 i=1
p
X p
≤ 1 + (2 log p + 1) µ2i ∧ 1 if we take λ = 2 log p
i=1
By using the inequality
1 ab
a∧b≤ ≤a∧b
2 a+b
we can compare the risk of soft-thresholding and James-Stein estimator as
p p
! !
(µi ∧ 1) ⋚ 2 + c
X X
2 2
1 + (2 log p + 1) µi ∧ p c ∈ (1/2, 1)
i=1 i=1

In LHS, the risk for each µi is bounded by 1 so if µ is sparse (s = #{i : µi ̸= 0})


but large in magnitudes (s.t. ∥µ∥22 ≥ p), we may expect LHS = O(s log p) < O(p) =
RHS (see also (? , p. 43).)
2.2. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 37

12
10
8
R

JS
4

MLE
2
0

0 2 4 6 8 10

||u||

Figure 2. Illustration on risk comparison between James-Stein


and MLE for p = 10, courtesy of Iain M. Johnstone.

2.2.6. Discussion. Stein’s phenomenon firstly shows that in high dimensional


estimation, shrinkage may lead to better performance than MLE, the sample mean.
This opens a new era for modern high dimensional statistics. In fact discussions
above study independent random variables in p-dimensional space, concentration of
measure tells us some priori knowledge about the estimator distribution – samples
are concentrating around certain point. Shrinkage toward such point may naturally
lead to better performance.
However, after Stein’s phenomenon firstly proposed in 1956, for many years
researchers have not found the expected revolution in practice. Mostly because
Stein’s type estimators are too complicated in real applications and very small
gain can be achieved in many cases. Researchers struggle to show real application
examples where one can benefit greatly from Stein’s estimators. For example, Efron-
Morris (1974) showed three examples that JS-estimator significantly improves the
multivariate estimation. On other other hand, deeper understanding on Shrinkage-
type estimators has been pursued from various aspects in statistics.
James-Stein estimator can be written as a multitask ridge regression:
p
X
(b
µi , µ
b) := arg min [(µi − Xi )2 + λ(µi − µ)2 ].
µi ,µ
i=1
38 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

Taking λ = σ 2 (p − 3)/(S − σ 2 (p − 3)), µ


b gives µ bJS ; λ = min(S, σ 2 (p − 3))/(S −
2 JS+
min(S, σ (p − 3))) with 1/0 = 0, it gives µ b . So they are linear shrinkage esti-
mators.
Soft-thresholding as nonlinear shrinkage estimators are introduced when LASSO-
type estimators by Rob Tibshirani (Tib96), also called Basis Pursuit by Donoho
et al. (CDS98), are studied around the year of 1996. This brings sparsity and
ℓ1 -regularization into the central theme of high dimensional statistics and leads to
a new type of nonlinear shrinkage estimator, soft-thresholding. For example,
1
µ − µ∥2 + λ∥b
min I = min ∥b µ∥1
µ
b b 2
µ

Subgradients of I over µ
b leads to the soft-thresholding,
0 ∈ ∂µbj I = (b
µj − µj ) + λsign(b
µj ) ⇒ µ
bj = sign(µj )(|µj | − λ)+
where the set-valued map sign(x) = 1 if x > 0, sign(x) = −1 if x < 0, and
sign(x) = [−1, 1] if x = 0, is the subgradient of absolute function |x|. Under this
new framework shrinkage estimators achieves a new peak with an ubiquitous spread
in data analysis with high dimensionality.
In addition to ℓ1 penalty in LASSO, there are also other penalty functions like
• λ∥β∥0 This leads to hard -thresholding when X = I. Solving this problem
is normally NP-hard.
• λ∥β∥
P p , 0 < p < 1. Non-convex, also NP-hard.
• λ ρ(βi ). such that
(1) ρ′ (0) singular (for sparsity in variable selection)
(2) ρ′ (∞) = 0 (for unbiasedness in parameter estimation)
Such ρ must be non-convex essentially (FL01).
In Section 3.5, we also introduce a new type of dynamic regularization paths of
shrinkage estimators, called as the inverse scale space method developed in applied
mathematics (OBG+ 05; ORX+ 16).

2.2.7. Appendix: Deriving James-Stein Estimator with Details. In


the following, we provide some details on looking for a function g such that the risk
of the estimator µ bn (Y ) = (1 − g(Y ))Y is smaller than the MLE of Y ∼ N (µ, ε2 Ip ).
See (Tsy09) for more details.
First of all, compute the risk R(b µn ) by
p
X
E∥b
µn − µ∥2 = E[((1 − g(y))yi − µi )2 ]
i=1
p
X
= {E[(yi − µi )2 ] + 2E[(µi − yi )g(y)yi ] + E[yi2 g(y)2 ]}.
i=1

Suppose now that the function g is such that the assumptions of Stein’s Lemma 2.2.5
hold (Lemma 3.6 in (Tsy09)), i.e. weakly differentiable.
Lemma 2.2.5 (Stein’s lemma). Suppose that a function f : Rp → R satisfies:
(1) f (u1 , . . . , up ) is absolutely continuous in each coordinate ui for almost all
values (with respect to the Lebesgue measure on Rp−1 ) of other coordi-
nates (uj , j ̸= i)
2.2. STEIN’S PHENOMENON AND SHRINKAGE OF SAMPLE MEAN 39

(2)
∂f (y)
E
< ∞, i = 1, . . . , p.
∂yi
then  
∂f
E[(µi − yi )f (y)] = −ε E 2
(y) , i = 1, . . . , p.
∂yi
With Stein’s Lemma, therefore
 
∂g
E[(µi − yi )(1 − g(y))yi ] = −ε2 E g(y) + yi (y) ,
∂yi
with
E[(yi − µi )2 ] = ε2 ,
we have
 
∂g
E[(b
µn,i − µi )] = ε − 2ε E g(y) + yi
2 2 2
(y) + E[yi2 g(y)2 ].
∂yi
Summing over i gives
E∥b
µn − µ∥2 = pε2 +E[W (y)],
|{z}
µM
=:R(bn
LE )=E∥b
µM
n
LE −µ∥2

with
p
X ∂g
W (y) = −2pε2 g(y) + 2ε2 yi (y) + ∥y∥2 g(y)2 .
i=1
∂yi
The risk of µ bM
bn is smaller than that of MLE µn
LE
if we choose g such that
E[W (y)] < 0.
In order to satisfy this inequality, we can search for g among the functions of
the form
b
g(y) =
a + ∥y∥2
with an appropriately chosen constants a ≥ 0, b > 0. Therefore, W (y) can be
written as
p
b X 2byi2 b2 ∥y∥2
W (y) = −2pε2 + 2ε 2
+
a + ∥y∥2 i=1
(a + ∥y∥2 )2 (a + ∥y∥2 )2
4bε2 ∥y∥2 b2 ∥y∥2
 
1 2
= −2pbε + +
a + ∥y∥2 a + ∥y∥2 (a + ∥y∥2 )2
1
≤ (−2pbε2 + 4bε2 + b2 ) ∥y∥2 ≤ a + ∥y∥2 for a ≥ 0
a + ∥y∥2
Q(b)
= , Q(b) = b2 − 2pbε2 + 4bε2 .
a + ∥y∥2
The minimizer in b of quadratic function Q(b) is equal to
bopt = ε2 (p − 2),
where the minimum of W (y) satisfies
b2opt ε4 (p − 2)2
Wmin (y) ≤ − = − < 0.
a + ∥y∥2 a + ∥y∥2
40 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

Note that when b ∈ (b1 , b2 ), i.e. between the two roots of Q(b)
b1 = 0, b2 = 2ε2 (p − 2)
we have W (y) < 0, which may lead to other estimators having smaller mean square
errors than MLE estimator.
bn = (1 − g(y))y associated to
When a = 0, the function g and the estimator µ
this choice of g are given by
ε2 (p − 2)
g(y) = ,
∥y∥2
and
ε2 (p − 2)
 
µ
bn = 1− y =: µ
bJS ,
∥y∥2
respectively. µbJS is called James-Stein estimator. If dimension p ≥ 3 and the
norm ∥y∥2 is sufficiently large, multiplication of y by g(y) shrinks the value of y to
0. This is called the Stein shrinkage. If b = bopt , then
ε4 (p − 2)2
Wmin (y) = − .
∥y∥2
Lemma 2.2.6. Let p ≥ 3. Then, for all µ ∈ Rp ,
 
1
0<E < ∞.
∥y∥2
The proof of Lemma 2.2.6 can be referred to Lemma 3.9 in (Tsy09). For
the function W , Lemma 2.2.6 implies −∞ < E[W (y)] < 0, provided that p ≥ 3.
Therefore, if p ≥ 3, the risk of the estimator µbn satisfies
 4
ε (p − 2)2

E∥b
µn − µ∥ = pε − E
2 2
< E∥b µM
n
LE
− µ∥2
∥y∥2
for all µ ∈ Rp .
Besides James-Stein estimator, there are other estimators having smaller mean
square errors than MLE.
• Stein estimator : a = 0, b = ε2 p,
ε2 p
 
bS := 1 −
µ y
∥y∥2
• James-Stein estimator : c ∈ (0, 2(p − 2))
ε2 c
 
c
bJS := 1 −
µ y
∥y∥2
• Positive part James-Stein estimator :
ε2 (p − 2)
 
bJS+ := 1 −
µ y
∥y∥2 +

• Positive part Stein estimator :


ε2 p
 
bS+ := 1 −
µ y
∥y∥2 +
2.3. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 41

where (x)+ = min(0, x). Comparisons of their risks as mean square errors are as
follows:
R(b
µJS+ ) < R(b µM
µJS ) < R(bn
LE
), R(b
µS+ ) < R(b µM
µS ) < R(bn
LE
).
Another dimension of variation is Shrinkage toward any vector rather than the
origin.
ε2 c
 
bµ0 = µ0 + 1 −
µ (y − µ0 ), c ∈ (0, 2(p − 2)).
∥y∥2
Pp
In particular, one may choose µ0 = ȳ where ȳ = i=1 yi /p.

2.3. Random Matrix Theory and Phase Transitions in PCA


In PCA, one often looks at the eigenvalue plot in an decreasing order as per-
centage or variations. A large gap in the eigenvalue drops may indicate those top
eigenvectors reflect major variation directions, where those small eigenvalues indi-
cate directions due to noise which will vanish when n → ∞. Is this true in all
situations? The situation depend on the following parameter
p
(24) γ = lim .
p,n→∞ n

The answer is yes in the classical setting where γ = 0 governed by the Law of Large
Numbers. Unfortunately, in high dimensional statistics with γ > 0, top eigenvectors
of sample covariance matrices might not reflect the subspace of signals. In fact,
there is a phase transition for signal identifiability by PCA: below a threshold of
signal-noise ratio, PCA will fail with high probability and above that threshold of
signal-noise ratio, PCA will approximate the signal subspace with high probability.
This will be illustrated by the following simplest rank-1 (spike) signal model, in
which a leverage of random matrix theory will shed light on the phase transitions
where PCA fails to capture the signal subspace depending on the signal-noise ratio.

2.3.1. Phase Transitions of PCA in Rank-1 Model. Consider the fol-


lowing rank-1 signal-noise model
Y = X + ε,
where
2
• the signal lies in an one-dimensional subspace X = αu with α ∼ N (0, σX );
2
• the noise ε ∼ N (0, σε Ip ) is i.i.d. Gaussian.
Therefore Y ∼ N (0, Σ) where the limiting covariance matrix Σ is rank-one added
by a sparse matrix:
2
Σ = σX uuT + σε2 Ip .
For multi-rank generalizations, please see (KN08).
The whole question in the remaining part of this section is to ask, can we recover
signal direction u from principal component analysis on noisy measurements Y ?
Define the signal-noise ratio
2
σX
SN R = R = .
σε2
For simplicity we assume that σε2 = 1 without loss of generality. We aim to show how
SNR affect the result of PCA when p is large. A fundamental result by Johnstone
42 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

in 2006 (Joh06), or see (NBG10), shows that the primary (largest) eigenvalue of
sample covariance matrix satisfies
( √ √
(1 + γ)2 = b, 2
σX ≤ γ
(25) λmax (Σn ) →
b √
2
(1 + σX )(1 + σγ2 ), σX
2
> γ
X

which implies that if signal energy is small, top eigenvalue of sample covariance
matrix never pops up from random matrix ones; only if the signal energy is beyond

the phase transition threshold γ, top eigenvalue can be separated from random
matrix eigenvalues. However, even in the latter case it is a biased estimation.
Moreover, the primary eigenvector (principal component) associated with the
largest eigenvalue converges to
 2 √
0 σX ≤ γ
|⟨u, vmax ⟩|2 → 1− σX
γ
(26) 4
2 √
 1+ γ , σX > γ
σ2
X

which means the same phase transition phenomenon: if signal is of low energy,
PCA will tell us nothing about the true signal and the estimated top eigenvector is
orthogonal to the true direction u; if the signal is of high energy, PCA will return a
biased estimation which lies in a cone whose angle with the true signal is no more
than
1 − σγ4
!
arccos X
.
1 + σγ2
X

Below we are going to show such results.

2.3.2. Marčenko-Pastur Law of Sample Covariance Matrix. First of


all, we show that even for white noise the sample covariance matrix has its eigen-
value distributed governed by Marčenko-Pastur Law. This is the null distribution
describing noise. Let xi ∼ N (0, Ip ) (i = 1, . . . , n) and X = [x1 , x2 , . . . , xn ] ∈ Rp×n .
Then the sample covariance matrix is defined as

(27) b n = 1 XX T .
Σ
n
Such a random matrix Σ b n is called a Wishart matrix.
• In classical statistics: when p fixed and n → ∞, the classical Law of Large
Numbers tells us Σ b n → Ip .
• In high dimensional statistics when both n and p grow: np → γ ̸= 0, the
distribution of the eigenvalues of Σb n follows a so called Marčcenko-Pastur
(MP) distribution (BS10),
(
0 t∈
/ [a, b],
 
1
(28) µMP
(t) = 1 − δ(x)I(γ > 1) + √(b−t)(t−a)
γ dt t ∈ [a, b],
2πγt
√ √
where a = (1 − γ)2 , b = (1 + γ)2 . In other words if γ ≤ 1, the
distribution has a support on [a, b] and if γ > 1, it has an additional point
mass 1 − 1/γ at the origin.
Figure 3 illustrates the MP-distribution by MATLAB simulations whose codes
can be found below.
2.3. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 43

(a) (b)

Figure 3. (a) Marčenko-Pastur distribution with γ = 2. (b)


Marčenko-Pastur distribution with γ = 0.5.

2.3.3. Characterization of Phase Transitions with RMT. After learn-


ing the null distribution of noise is the Marčenko-Pastur Law, now we are ready to
come back to the rank-1 spike model.
Following the rank-1 model, consider random vectors {Yi }ni=1 ∼ N (0, Σ), where
Σ = σx2 uuT + σε2 Ip and u is an arbitrarily chosen unit vector (∥u∥2 = 1) showing
σ2
the signal direction. Define the Signal-Noise-Ratio (SNR) R = σx2 . Without of
ε
generality, we assume σε2 = 1. The covariance matrix Σ thus has a structure that
low-rank plus sparse matrix. The sample covariance matrix is Σ b n = 1 Pn Yi Y T =
n i=1 i

n
1
Y Y T
where Y = [Y1 , . . . , Y n ] ∈ R p×n
. Suppose one of its eigenvalue is λ
b and the
corresponding unit eigenvector is v̂, so Σ b n v̂ = λv̂.
First of all, we relate the λ to the MP distribution by the trick:
b
1
(29) Zi = Σ− 2 Yi → Zi ∼ N (0, Ip ), where Σ = σx2 uuT + σε2 Ip = RuuT + Ip .
Pn
Then Sn = n1 i=1 Zi ZiT = n1 ZZ T is a Wishart random matrix whose eigenvalues
follow the Marčenko-Pastur distribution.
b n = 1 Y Y T = Σ1/2 ( 1 ZZ T )Σ1/2 = Σ 21 Sn Σ 12 and (λ,
Notice that Σ b v̂) is eigenvalue-
n n
eigenvector pair of matrix Σn . Therefore
b
1 1 1 1
(30) b ⇒ Sn Σ(Σ− 2 v̂) = λ(Σ
Σ 2 Sn Σ 2 v̂ = λv̂ b − 2 v̂)

In other words, λb and Σ− 12 v̂ are the eigenvalue and eigenvector of matrix Sn Σ.


1
Suppose cΣ− 2 v̂ = v where the constant c makes v a unit eigenvector and thus
satisfies,
(31) c2 = cv̂ T v̂ = v T Σv = v T (σx2 uuT + σε2 )v = σx2 (uT v)2 + σε2 ) = R(uT v)2 + 1.
Now we have,
(32) Sn Σv = λv.
b
Plugging in the expression of Σ, it gives
2
(33) Sn (σX uuT + σε2 Ip )v = λv
b
Rearrange the term with u to one side, we got
(34) b p − σ 2 Sn )v = σ 2 Sn u(uT v)
(λI ε X
44 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

Assuming that λI b p − σ 2 Sn is invertable, then multiple its reversion at both sides


ε
of the equality, we get,
(35) b p − σ 2 Sn )−1 · Sn u(uT v).
v = σ 2 · (λIX ε
b and | ⟨u, v⟩ |.
Now we are going to present the estimates on eigenvalue λ

2.3.3.1. Primary Eigenvalue. Multiply (35) by u at both side,
(36) uT v = σX
2 b p − σ 2 Sn )−1 Sn u · (uT v)
· uT (λI ε

that is, if uT v ̸= 0,
(37) 2
1 = σX b p − σ 2 Sn )−1 Sn u
· uT (λI ε

Assume that Sn has the eigenvalue decomposition Sn = W λW b T , where Λ =


T T
diag(λi : i = 1, . . . , p) (λi ≥ λi+1 ), W W = W W = Ip , with W = [w1 , w2 , · · · , wp ] ∈
Rp×p gives an orthonormal basis of eigenvectors.
Pp Define α = [α1 , α2 , · · · , αn ] ∈
Rp×1 , such that αi = wiT u, hence u = i=1 αi wi = W T α. Now (37) leads to
(38) 2
1 = σX b p − σ 2 Λ)−1 W T ][W ΛW T ]u = σ 2 · αT (λI
· uT [W (λI b p − σ 2 Λ)−1 Λα
ε X ε

which is
p
2
X λi
(39) 1 = σX · αi2
b − σ 2 λi
λ
i=1 ε
Pp 2
where i=1 αi
= 1. Since W consists of a random orthonormal basis on a sphere, αi
will concentrate on its mean αi = √1p . For large p, λi ∼ µM P (λi ) can be thought
sampled from the µM P and the sum (39) can thus be regarded as the following
Monte-Carlo integration with respect to the MP distribution,
p Z b
2 1X λi 2 t
(40) 1 = σX · ∼ σX · dµM P (t)
p i=1 λ
b − σ 2 λi
ε a λ
b − σ 2t
ε

Since we had assumed without loss of generosity that σε2 = 1, we can compute
the integration above using the Stieltjes transform and obtain,
Z b p
t (b − t)(t − a) σ2 b
q
2
(41) 1 = σX · dt = X [2λ − (a + b) − 2 |(λb − a)(b − λ)|].
b
a λb−t 2πγt 4γ
b ≥ b and R = σ 2 ≥ √γ, we have
For λ X
2
σX
q
∵ 1= [2λ − (a + b) − 2 (λ
b b − a)(λb − b),

∴ b = σ 2 + γ + 1 + γ = (1 + σ 2 )(1 + γ ).
λ X 2 X 2
σX σX
More general for σ 2 ̸= 1, all the equations above is true, except that all the λ
ε
b will
λ 2 2
be replaced by and σX by signal-noise-ratio R = σX /σε2 . Then we get:
b
σε2

b = (1 + R) 1 + γ σ 2 .
 
λ
R ε
Here we observe the following phase transitions for primary eigenvalue:
• If λb ∈ [a, b], then Σ
b n has its primary eigenvalue λb within supp(µM P ), so
it is undistinguishable from the noise Sn .
• If λb ≥ b, PCA will pick up the top eigenvalue as a signal.
2.3. RANDOM MATRIX THEORY AND PHASE TRANSITIONS IN PCA 45

• So λ
b = b is the phase transition where PCA works to pop up signal rather
than noise. Then plugging in λ b = b in (41), we get,

σ2
r
2 1 2 p
(42) 1 = σX · [2b − (a + b)] = √X ⇔ σX =
4γ γ n
Hence,
p in order to make PCA works, we need to let the signal-noise-ratio
R ≥ np .

2.3.3.2. Primary Eigenvector. We now study the phase transition of the pri-
mary eigenvector. It is convenient to study |uT v|2 first and then translate back to
|uT v̂|2 . From Equation (35), we obtain
1 = v T v = σX
4
· v T uuT Sn (λIp − σε2 Sn )−2 Sn uuT v
4
= σX · (|v T u|)[uT Sn (λIp − σε2 Sn )−2 Sn u](|uT v|)
which implies that
(43) |uT v|−2 = σX
4
[uT Sn (λIp − σε2 Sn )−2 Sn u].
Using the same trick as the equation (37), we reach the following Monte-Carlo
integration
Z b
T −2 4 T 2 −2 4 t2
(44) |u v| = σX [u Sn (λIp − σε Sn ) Sn u] ∼ σX dµM P (t)
a (λ − σε t)
2 2

and assume that λ ≥ b, from Stieltjes transform introduced later one can compute
the integral as
Z b
t2
|uT v|−2 = σX 4
· dµM P (t)
a (λ − σε t)
2 2
4
σX p λ(2λ − (a + b))
= (−4λ + (a + b) + 2 (λ − a)(λ − b) + p
4γ (λ − a)(λ − b)
γ
from which it can be computed that (using λ
b = (1 + R)(1 +
R) obtained above,
2
σX
where R = σϵ2 )
γ
1− R2
|uT v|2 = 2γ .
1+γ+ R
Now we can compute the inner product of u and v̂ that we are really interested
in:
1 1 1 1
|uT v̂|2 = ( uT Σ 2 v)2 = 2 ((Σ 2 u)T v)2
c c
1 1
= 2
(((Ruu + Ip ) 2 u)T v)2
T
c
∗ 1 p
= (( (1 + R)u)T v)2
c2
∗∗ (1 + R)(uT v)2
=
R(uT v)2 + 1
γ
1+R− R − Rγ2 1 − Rγ2
= γ = γ
1+R+γ+ R 1+ R
46 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS


where the equality (∗) uses Σ1/2 u = 1 + Ru, and the equality (∗∗) is due to
the formula for c2 (Equation (31) above). Note that this identity holds under the

condition that R ≥ γ to make the numerator above non-negative.
Therefore if PCA works well and noise doesn’t dominate the effect, the inner
product |uT v̂| should be close to 1. Particularly when γ = 0 we have |uT v̂| = 1 as
disclosed by the classical Law of Large Numbers in statistics. On the other hand,
from RMT we know that if the top eigenvalue λ b ∈ [a, b] is overwhelmed in the
domain of M. P. distribution, then the primary eigenvector computed from PCA
is purely random and |uT v̂| = 0, which means that from v̂ we can know nothing
about the signal u.

2.3.4. Stieltjes Transform. Now we present the Stieltjes Transformation of


MP-density which has been crucial in computing the integrals above. Define the
Stieltjes Transformation of MP-density µM P to be
1
Z
(45) s(z) := dµM P (t), z ∈ C
R t − z
If z ∈ R, the transformation is called Hilbert Transformation. Further details can
be found in Terry Tao’s textbook, Topics on Random Matrix Theory (Tao11), Sec.
2.4.3 (the end of page 169) for the definition of Stieltjes transform of a density
p(t)dt on R.
In (BS10), Lemma 3.11 on page 52 gives the following characterization of s(z)
(note that the book contains a typo that 4yσ 4 in numerator should be replaced by
4yzσ 2 ):
p
(1 − γ) − z + (z − 1 − γ)2 − 4γz
(46) s(z) = ,
2γz
which is the largest root of the quadratic equation,
1 1
(47) γzs(z)2 + (z − (1 − γ))s(z) + 1 = 0 ⇐⇒ z + = .
s(z) 1 + γs(z)
From the equation (46), one can take derivative of z on both side to obtain s′ (z)
in terms of s and z. Using s(z) one can compute the following basic integrals.
Lemma 2.3.1. (1)
Z b
t
µM P (t)dt = −λs(λ) − 1;
a λ−t
(2)
b
t2
Z
µM P (t)dt = λ2 s′ (λ) + 2λs(λ) + 1
a (λ − t)2
Proof. For convenience, define
b
t
Z
(48) T (λ) := µM P (t)dt.
a λ−t
Note that
b b
t λ − t + t MP
Z Z
(49) 1 + T (λ) = 1 + µM P (t)dt = µ (t)dt = −λs(λ)
a λ−t a λ−t
which give the first result.
2.4. SPECTRAL SHRINKAGE BY HORN’S PARALLEL ANALYSIS 47

From the definition of T (λ), we have


Z b
t2 ′
(50) µM P (t)dt = −T (λ) − λT (λ).
a (λ − t) 2

Combined with the first result, we reach the second one. □

2.3.5. Biographic Remarks. Random Matrix Theory can only deal with
homogeneous Gaussian noise σε2 Ip here. Moreover, it is still an open problem how
to deal with heteroscedastic noise, where Art Owen and Jingshu Wang has some
preliminary studies (OW16).
When log(p)
n → 0, we need to add more restrictions on Σ b n in order to estimate
it faithfully. There are typically three kinds of restrictions.
• Σ sparse
• Σ−1 sparse, also called–Precision Matrix
• banded structures (e.g. Toeplitz) on Σ or Σ−1
Recent developments can be found by Bickel, Tony Cai, Tsybakov, Wainwright et
al.
For spectral study on random kernel matrices, see El Karoui, Tiefeng Jiang,
Xiuyuan Cheng, and Amit Singer et al.

2.4. Spectral Shrinkage by Horn’s Parallel Analysis


In practice, it is usually not easy to apply the Random Matrix Theory above
to real world data analysis as the noise is unknown and the spectral cut based
on noise assumptions might not be correct. In this section, we shall introduce a
new way based on random permutation test, that is called Horn’s parallel analysis
(Hor65; BE92). Parallel analysis use random permutation to simulate random
matrices with a given data matrix, then estimate the random spectrum to decide
the shrinkage level empirically.

2.4.1. Parallel Analysis for Principal Component Analysis. Take the


data matrix X = [x1 |x2 | · · · |xn ] ∈ Rp×n , generate its parallel data matrices by
randomly permuting entries within rows. More precisely, suppose
 
X1,1 X1,2 · · · X1,n
 X2,1 X2,2 · · · X2,n 
X= . .. ..  .
 
 .. ..
. . . 
Xp,1 Xp,2 · · · Xp,n
Randomly take p permutations of n numbers π1 , . . . , πp (usually π1 is set as iden-
tity). We get a parallel data matrix
 
X1,π1 (1) X1,π1 (2) · · · X1,π1 (n)
 X2,π2 (1) X2,π2 (2) · · · X2,π2 (n) 
X1 =  .. .. .. .
 
..
 . . . . 
Xp,πp (1) Xp,πp (2) · · · Xp,πp (n)
b1 }i=1,...,p . By choosing a different
Then we can calculate its principal eigenvalues {λ i
set of permutations, we can get another parallel data matrix X 2 and its singular
48 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

b2 }i=1,...,p . Repeat such procedure for R times, we can get R sets of singular
values {λi
values. They can be put together as a matrix
 
b1
λ b1 · · · λ
λ b1
1 2 p
 b2 b2 · · · λ b2 
 λ1 λ 2 p 
.
 . .. .. 
 . ..
 . . . . 
bR λ
λ bR · · · λ bR
1 2 p

br }r=1,...,R with the i-th singular value λ


For each i = 1, . . . , p, compare {λ bi of the
i
original data X. Define
1 br > λ
bi }.
pvali = #{λ i
R
Here # stands for the cardinality of the set. Notice that pvali ∈ [0, 1]. It can be
regarded as the probability that λ bi is indistinguishable from noise. The smaller it
is, the more confident we are to think of λ bi as true signal in the data X. Thus we
can set a threshold on {pvali }i=1,...,p . For example, we keep λ bi if pval < 0.05.
i
Let’s apply the above parallel analysis to the digit “3” dataset X. The first
step is to perform PCA on X, which has been done. The second step is to generate
parallel data matrices {X r }r=1,...,R (R = 100, for example) by randomly permuting
entries within rows of X. Some examples in one of the parallel data matrices are
shown in Figure 4. One can see that the randomly permuted images are still
informative for digit “3” rather than random images, which implies that each pixel
values are highly restricted to some specific domain. This motivates some thoughts
that the pixel-wise vectors are restricted on a low-dimensional sub-manifold that
will be discussed in later.

Figure 4. Examples of randomly permuted data.

br }i=1,...,p of each X r . The


The third step is to calculate the singular values {λ i
fourth step is to compare them with the corresponding singular values of X, and
compute {pvali }i=1,...,p . Here we choose the threshold pvali < 0.05 for each i. It
means that the selected true singular value λbi should beat 95% of {λ br }. According
i
to the results shown in Figure 5, we can keep the top 19 PCs.
Figure 6 shows the mean image and the top 24 PCs. As we can see, after top
19 PCs, the remaining PCs are still informative for digit “3” as random permuted
images are still close to that digit as well. So if the sample points lie on a sub-
manifold, permutation test may be conservative in selecting number of PCs.

2.5. Sufficient Dimensionality Reduction and Supervised PCA


Previously we have seen the geometric interpretation of PCA. It is a type of
“unsupervised learning” in the sense that there seems no relation with response
2.5. SUFFICIENT DIMENSIONALITY REDUCTION AND SUPERVISED PCA 49

Figure 5. Results of parallel analysis on PCA. Considering the


exponential decay of eigenvalues and to emphasize the top eigen-
values, log scale are adopted for both axes. The top 5% singular
values of the parallel data matrices are draw as reference.

Figure 6. Images of the sample mean and the top 24 principal


components (top 19 are suggested by parallel analysis). The image
No.0 is the sample mean.

variable y. Does PCA really matter with response variable in supervised learning,
e.g., in classification or regression?
50 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

In the 2005 Fisher Lecture, R. Dennis Cook (Coo07) described PCA as a suffi-
cient dimensionality reduction in regression, and also extended it to principal fitted
components (PFC). Here we introduce his idea, together with several variations
of supervised PCA: Fisher’s Linear Discriminant Analysis and Li’s Sliced Inverse
Regression.
A sufficient dimension reduction Γ (Γ ∈ Rp×d , ΓT Γ = Id ) refers to the setting
that the conditional distribution of Y |X is the same as the distribution of Y |ΓT X
for all X.
For example, in regression Y = f (X, ε), for some unknown function f , sufficient
dimensionality reduction implies that Y = f (ΓT X, ε). However f is unknown here.
How can we find Γ independent to the choice of f ?
The answer is a possible Yes when we consider the inverse problem, based on
the conditional distribution X|Y .
For example, consider the following inverse model, for each value in response
variable y,

(51) Xy = µ + Γνy + ε,

where Xy ∈ Rp , νy ∈ P Rd , d < p, the basis Γ ∈ Rp×d with ΓT Γ = Id , and noise ε ∼


2
Np (0, σ Ip ). Assume y νy = 0 for removing the degree of freedom in translation.
The following proposition states under the assumption of inverse model, Γ is
actually a sufficient reduction. See (Coo07) for more details.

Proposition 2.5.1. Under the inverse model, the distribution of Y |X is the


same as the distribution of Y |ΓT X.

Proof. First, X|Y = y ∼ Np (µ + Γνy , σ 2 Ip ). By Bayesian formula, we have

fY |X (y|x) ∝ fX|Y (x|y)fY (y)


1
∝ exp(− 2 ∥x − µ − Γνy ∥2 )fY (y)

1
∝ exp(− 2 (νyT νy − 2νyT ΓT (x − µ))fY (y)

The last line is given by the orthogonality of Γ. Similarly, since ΓT X|Y = y ∼
Nd (ΓT µ + νy , σ 2 Id ), we have

fY |ΓT X (y|ΓT x) ∝ fΓT X|Y (ΓT x|y)fY (y)


1
∝ exp(− 2 ∥ΓT x − ΓT µ − νy ∥2 )fY (y)

1
∝ exp(− 2 (νyT νy − 2νyT ΓT (x − µ))fY (y)

Therefore, the kernel of Y |X and Y |ΓT X are the same, which implies the result. □

Now consider the Maximum Likelihood Estimate (MLE) of µ, Γ and νy . Under


the inverse model, the conditional likelihood function
 
1 1
f (Xy |µ, Γ, νy ) = p exp − 2 (Xy − µ − Γνy )T (Xy − µ − Γνy ) ,
σ p (2π)p 2σ
2.5. SUFFICIENT DIMENSIONALITY REDUCTION AND SUPERVISED PCA 51

Q
and the MLE tries to find arg maxµ,Γ,νy y f (Xy |µ, Γ, νy ), which is equivalent to
the following optimization problem after a logarithmic transform:
1 X X
max − 2 ∥Xy − µ − Γνy ∥2 − p log σ + C.
µ,Γ,νy 2σ y y

which leads to the MLE for n examples,


1X
µ
b= Xy ,
n y

b T (Xy − µ
νy = Γ b),
and
X
(52) Γ
b = arg min ∥Xy − µ b)∥2 ,
b − PΓ (Xy − µ PΓ = ΓΓT .
ΓT Γ=I
y

Comparison with (52) and (2) shows that when y is of distinct values (e.g. the
unknown function f is injective), this is exactly the PCA in unsupervised learn-
ing. Therefore PCA can be also derived as a sufficient dimensionality reduction in
supervised learning, even the function f is unknown here.
For y with discrete or repeated values of equal number Ny of samples at different
values, it suffices to replace Xy by
1 X
µ
by = Xi .
Ny y =y
i

In such cases, it suffices to look at the eigenvalue decomposition of the between-class


covariance matrix defined by
bB = 1
X
Σ µy − µ
(b b)(b b)T .
µy − µ
|Y|
y∈Y

Two famous examples are Fisher’s Linear Discriminant Analysis (LDA) for classi-
fication (HTF01) and Ker-Chau Li’s Sliced Inverse Regression (Li91) (SIR), which
will be called as supervised PCAs here. See Cook (Coo07) for a general class of
principal fitted components adapted to supervised learning.

2.5.1. Linear discriminant analysis. Fisher’s Linear discriminant analysis


(LDA) in classification, like PCA, looks for linear combinations of features which
best explain the data. However, LDA attempts to capture the variation between
different classes of data.
i=1 , assume that Xi ∈ R , and yi is discrete in {1, 2, ..., K}
For given data (Xi , yi )N p

but not ordered. LDA captures the variance between class and meanwhile discards
the variance within class.
Define the between-class covariance matrix
K
p×p 1 X
ΣB =
b µk − µ
(b b)(b b)T ;
µk − µ
K
k=1

and the within-class covariance matrix


K X
b p×p = 1 X
Σ W (Xi − µ bk )T ,
bk )(Xi − µ
N −K
k=1 yi =k
52 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

Algorithm 4: Linear Discriminate Analysis (LDA)


Input: Data with label {Xi , yi }N
i=1 where yi is discrete in {1, 2, ..., K} but not
ordered
Output: Effective dimension reducing directions Ud
1 Step 1 : Compute sample mean and within class means
N
1 X 1 X
b=
µ Xi , bk =
µ Xi ;
N i=1 Nk
yi =k

Step 2 :Compute Between class covariance matrix


K
b p×p = 1
X
Σ B (b
µk − µ
b)(b b)T ;
µk − µ
K
k=1

Step 3 : Compute Within class covariance matrix


K X
b p×p = 1 X
Σ W (Xi − µ bk )T ;
bk )(Xi − µ
N −K
k=1 yi =k

Step 4 : Generalized Eigen-decomposition Σ bB = Σ b W U ΛU T with


Λ = diag(λ1 , λ2 , ...λn ) where λ1 ≥ λ2 ≥ ... ≥ λn ; Choose eigenvectors
corresponding to top d ≤ K nonzero eigenvalues, i.e., return Ud such that
Ud = [u1 , . . . , ud ], u d ∈ Rn .

where µ
b is sample mean and µ
bk is within class means, i.e.
1 X
µ
bk = Xi .
Nk
yi =k

Now define the Rayleigh quotient by


wT Σ
bBw
R(w) =
wT Σ
bW w
which measures, in some sense, the ‘signal-to-noise ratio’ in terms of direction w.
Intuitively, if Σb W is invertible, the eigenvector corresponding the largest eigen-
values of Σb −1 Σ
b B (or generalized eigenvectors of pair (Σ bB, Σ
b W )) will maximize R.
W
Accordingly, the best feature vectors would be eigenvectors corresponding top k
eigenvalues, i.e.
Uk = [u1 , u2 , ..., uk ], uk ∈ Rn ,
where Σb B uk = Σ b W λk uk and λ1 ≥ λ2 ≥ ... ≥ λk .

Note. For Generalized Eigen Decomposition(G.E.D) problem Σ b B u = λΣ


b W u,
1 1
− −
it is more efficient to solve Eigen Decomposition problem Σ
b 2Σ
W
bBΣ
b 2 φ = λφ first
W
1

and scale φ by ΣW , i.e. u = Σ φ.
b b 2
W

2.5.2. Sliced Inverse Regression. Ker-Chau Li (Li91) extended LDA from


classification to regression by proposing Sliced Inverse Regression (SIR). In regres-
sion, we are interested in the conditional mean f (X) = E[y|X], which is a real
valued mapping from high dimensional space Rp and often called the regression
function; on the other hand, it is also interesting to look at the inverse conditional
mean g(y) = E[X|y], which is a 1-dimensional curve (manifold) in Rp often called
2.6. LAB AND FURTHER STUDIES 53

the principal curve or inverse regression curve. Such a curve might be easier to deal
with than the high dimensional regression function.

Algorithm 5: Sliced Inverse Regression


i=1 , where Xi ∈ R , yi ∈ R is continuous (or
Input: Data with label {Xi , yi }N p

ordered discrete)
Output: Effective dimension reducing directions Γd
1 Step 1 : Divide the range of yi into S non-overlapping slices Hs (s = 1, ..., S). Ns
is the number of observations within each slice
2 Step 2 :Compute the sample mean and total covariance matrix
N N
1 X b p×p = 1
X
b=
µ Xi , Σ (Xi − µ b)T ;
b)(Xi − µ
N i=1 N i=1
Step 3 : Compute the mean of Xi over all slices and Between slices covariance
matrix
K
1 X b p×p = 1
X
bk =
µ Xi , Σ B (b
µk − µ
b)(b b)T ;
µk − µ
Ns y ∈H K
i s h

Step 4 : Generalized Eigen-decomposition Σ b ΛU T with


b B = ΣU
Λ = diag(λ1 , λ2 , ...λn ) where λ1 ≥ λ2 ≥ ... ≥ λn ; Choose generalized
eigenvectors corresponding to top d nonzero eigenvalues, Γd i.e.
Γd = [u1 , . . . , ud ], uk ∈ Rn .

Given a response variable Y and a random vector X ∈ Rp of explanatory


variables, SIR is based on the model
Y = f (Γ X, ϵ) ,
k×p
where Γ is a unknown projection and k < p, and f is an unknown link func-
tion. One does not need to know f to reconstruct the projection or dimensionality
reduction matrix Γ : Rp → Rd . In SIR, the range of response values is divided into
non-overlapping slices; then one replaces the between-class covariance in LDA by
the between-slice covariance, the within-class covariance in LDA by the total co-
variance, respectively; the same generalized eigen-decomposition gives the sufficient
dimensionality reduction.
In (WLM09), this algorithm is extended to Localized Sliced Inverse Regression
(LSIR) which allows for supervised dimension reduction by projection onto a linear
subspace that captures the nonlinear subspace relevant to predicting the response.

2.6. Lab and Further Studies


2.6.1. James-Stein Estimator. JSE.py:

# # A simulation to show that JSE has smaller Mean Square Error than
MLE
# ## $ \ mu$ is generated from Normal (0 ,1)

# A simulation to show that JSE has smaller Mean Square Error than MLE
import numpy as np
import pandas as pd
54 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

Algorithm 6: Localized Sliced Inverse Regression


i=1 , where Xi ∈ R , yi ∈ R is continuous (or
Input: Data with label {Xi , yi }N p

ordered discrete)
Output: Effective dimension reducing directions Γd
1 Step 1 : Compute total covariance matrix Σ b as in SIR;
2 Step 2 : Divide the range of yi into S non-overlapping slices Hs (s = 1, ..., S); for
each sample (Xi , yi ) compute the localized mean
1 X
bi,loc =
µ Xj ,
|si | j∈s
i

where si = {j : xj belongs to the k nearest neighboors of xi in Hs } and s


indexes the slice Hs to which i belongs;
3 Step 3 :Compute a localized version of between-slice covariance Σ bB

b loc = 1
X
Σ (b
µi,loc − µ
b)(b b)T ;
µi,loc − µ
N i

Step 4 : Generalized Eigen-decomposition Σ b ΛU T with


b B = ΣU
Λ = diag(λ1 , λ2 , ...λn ) where λ1 ≥ λ2 ≥ ... ≥ λn ; Choose generalized
eigenvectors corresponding to top d nonzero eigenvalues, Γd i.e.
Γd = [u1 , . . . , ud ], u k ∈ Rn .

import matplotlib . pyplot as plt

nrep =100

err_MLE = np . zeros ( nrep )


err_JSE = np . zeros ( nrep )

# p = N in the following
N = 100
for i in range ( nrep ) :
mu = np . random . normal (0 ,1 , N )
z = np . random . normal ( mu ,1 , N )
mu_MLE = z
mu_JSE =(1 -( N -2) / np . sum ( z **2) ) * z
err_MLE [ i ]= np . sum (( mu_MLE - mu ) **2) / N
err_JSE [ i ]= np . sum (( mu_JSE - mu ) **2) / N
err1 = pd . DataFrame ({ ’ err_MLE ’: err_MLE , ’ err_JSE ’: err_JSE })

fig1 , ax1 = plt . subplots ()


ax1 . set_title ( ’$ \ mu_i$ ␣ is ␣ generated ␣ from ␣ Normal (0 ,1) ,␣ sample ␣ size ␣ N
=100 ’)
ax1 . boxplot ( err1 , labels = err1 . columns )
ax1 . set_ylabel ( r ’ Error : ␣ $ \ frac {||\ hat {\ mu } -\ mu ||^2}{ N } $ ’)
# plt . show ()
plt . savefig ( " normal . jpg " , dpi =500 , bbox_inches = ’ tight ’)

# ## $ \ mu$ is generated from Uniform (0 ,1)


2.6. LAB AND FURTHER STUDIES 55

err_MLE = np . zeros ( nrep )


err_JSE = np . zeros ( nrep )

for i in range ( nrep ) :


mu = np . random . uniform (0 ,1 , N )
z = np . random . normal ( mu ,1 , N )
mu_MLE = z
mu_JSE =(1 -( N -2) / np . sum ( z **2) ) * z
err_MLE [ i ]= np . sum (( mu_MLE - mu ) **2) / N
err_JSE [ i ]= np . sum (( mu_JSE - mu ) **2) / N
err2 = pd . DataFrame ({ ’ err_MLE ’: err_MLE , ’ err_JSE ’: err_JSE })

fig2 , ax2 = plt . subplots ()


ax2 . set_title ( ’$ \ mu_i$ ␣ is ␣ generated ␣ from ␣ Uniform (0 ,1) ,␣ sample ␣ size ␣ N
=100 ’)
ax2 . boxplot ( err2 , labels = err2 . columns )
ax2 . set_ylabel ( r ’ Error : ␣ $ \ frac {||\ hat {\ mu } -\ mu ||^2}{ N } $ ’)
# plt . show ()
plt . savefig ( " uniform . jpg " , dpi =500 , bbox_inches = ’ tight ’)

# # Efron ’s Batting example

names =[ " Clemente " ," F . Robinson " ," F . Howard " ," Johnstone " ," Berry " ," Spencer
" ," Kessinger " ," L . Alvarado " ," Santo " ," Swoboda " ," Unser " ," Williams " ,"
Scott " ," Petrocelli " ," E . Rodriguez " ," Campaneris " ," Munson " ," Alvis " ]
hits =[18 ,17 ,16 ,15 ,14 ,14 ,13 ,12 ,11 ,11 ,10 ,10 ,10 ,10 ,10 ,9 ,8 ,7]
n = 45
mu =[.346 ,.298 ,.276 ,.222 ,.273 ,.270 ,.263 ,.210 ,.269 ,
.230 ,.264 ,.256 ,.303 ,.264 ,.226 ,.286 ,.316 ,.200]
p = len ( hits )

mu_mle = np . array ( hits ) / n


z = mu_mle
z_bar = np . mean ( z )
S = np . sum (( z - z_bar ) **2)
sigma02 = z_bar *(1 - z_bar ) / n

mu_js =(1 - S / p *( p -2) / np . dot ( z .T , z ) ) * z


mu_js1 = z_bar +(1 -( p -3) * sigma02 / S ) *( z - z_bar )

err_js = np . sum (( np . round ( mu_js ,3) - mu ) **2)


err_js1 = np . sum (( np . round ( mu_js1 ,3) - mu ) **2)
err_mle = np . sum (( np . round ( mu_mle ,3) - mu ) **2)

X = pd . DataFrame ([ names , hits , np . round ( mu_mle ,3) ,mu , np . round ( mu_js ,3) , np .
round ( mu_js1 ,3) ]) . T
# X . columns =[ ’ Names ’,’ hits ’,’$ \ hat {\ mu } _i ^{( MLE ) } $ ’,’$ \ mu_i$ ’,’$ \ hat {\
mu } _i ^{( JS0 ) } $ ’,’$ \ hat {\ mu } _i ^{( JS +) } $ ’]
X . columns =[ ’ Names ’ , ’ hits ’ , ’$ \ mu_ { MLE } $ ’ , ’$ \ mu_i$ ’ , ’$ \ mu_ { JS0 } $ ’ , ’$ \ mu_
{ JS +} $ ’]
56 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

X . loc [ ’ Mean ␣ Squared ␣ Error ’ ]=[ ’ - ’ , ’ - ’ , err_mle , ’ - ’ , err_js , err_js1 ]


X . to_csv ( ’ somefile . txt ’ , index = False )

2.6.2. Marcenko-Pastur Distribution and Wishart Random Matri-


ces. Figure 3 illustrates the MP-distribution by Python simulations whose codes
can be found below.

# # MP - law
#
# Eigenvalue distribution of S converges to Marcenko - Pastur
distribution with parameter gamma = p / n

import numpy as np
import matplotlib . pyplot as plt

def f_MP (a ,b ,t , gamma ) :


# matlab : f_MP = @ ( t ) sqrt ( max (b -t , 0) .* max (t -a , 0) ) ./(2* pi * gamma
*t)
return np . sqrt ( max (b -t , 0) * max (t -a ,0) ) /(2* np . pi * gamma * t )

gamma = 2; # if gamma >1 , there will be a spike in MP distribution at 0


a = (1 - np . sqrt ( gamma ) ) **2
b = (1+ np . sqrt ( gamma ) ) **2
n = 400
p = int ( n * gamma )
X = np . random . randn (p , n ) # X is p - by -n , X ij i . i . d N (0 ,1)
S = 1/ n *( X@X . transpose () )

# plotting part
bins =100
evals = np . linalg . eigvals ( S ) . real
hist , edges = np . histogram ( evals , bins = bins )
width = edges [1] - edges [0]
hist = hist / p
plt . bar ( edges [0: -1] , hist , align = " edge " , width = width , alpha =0.75)
ts = np . linspace ( edges [0] , b *1.05 , num =1000)
f_mps =[]
for t in ts :
f_mps . append ( f_MP (a ,b ,t , gamma ) )
f_mps = np . array ( f_mps )
plt . plot ( ts , f_mps * width , color = " r " )
plt . ylim (0 , max ( f_mps [1: -1]* width ) *2)
plt . show ()

2.6.3. Parallel Analysis in PCA. In practice, Horn’s parallel analysis (BE92)


is widely used to find the number of principal components or factors using simula-
tions on given data, which has some implementations in R2 and Matlab3.
2
https://cran.r-project.org/web/packages/paran/
3
https://www.mathworks.com/matlabcentral/fileexchange/44996-parallel-analysis--pa--to-for-determining-the-numb
2.6. LAB AND FURTHER STUDIES 57

Figure 6 shows the mean image and the top 24 PCs for digit “3”. Run Python
code papca image.py.

import numpy as np
import scipy . io as sio
import matplotlib . pyplot as plt

n_perm = 10 # number of parallel dataset


perc = 0.05 # percentage

# ###############################################################
X = np . loadtxt ( ’ train .3 ’ , delimiter = ’ , ’)

n , dim = X . shape
mean = np . mean (X , axis = 0)
X0 = X - mean
# X0 = X
Cov = np . dot ( X0 .T , X0 ) /( n -1)

evals , evecs = np . linalg . eigh ( Cov )


evals = evals [:: -1]
evecs = evecs [: , :: -1]

modes = np . c_ [ mean . reshape ( -1 , 1) , evecs ]

# ###############################################################
Xcp = X0 . copy ()
evals_perm = np . zeros ([ n_perm , dim ])

for i in range ( n_perm ) :


for j in range (1 , dim ) :
np . random . shuffle ( Xcp [: , j ])
Cov_perm = np . dot ( Xcp .T , Xcp ) /( n -1)
evals_perm [ i ] = np . linalg . eigvalsh ( Cov_perm ) [:: -1]

evals0 = np . mean ( evals_perm , axis = 0)


evals_perm = np . sort ( evals_perm , axis = 0) [:: -1]
evals_perc = evals_perm [ int ( np . floor ( perc * n_perm ) ) ]
pvals = np . mean (( evals_perm > evals ) . astype ( float ) , axis = 0)

# index of the first nonzero p - values


for j in range ( dim ) :
if pvals [ j ] > 0:
pv1 = j
break

# ###############################################################
plt . figure ( figsize = (20 , 10) )
ax = plt . subplot (111)
ax . loglog ( evals , ’r - o ’ , linewidth = 2 , label = r ’ original ’)
# ax . loglog ( evals0 , ’g -* ’ , linewidth = 2 , label = r ’ permuted mean ’)
58 2. CURSE OF DIMENSIONALITY: HIGH DIMENSIONAL STATISTICS

ax . loglog ( evals_perc , ’y -^ ’ , linewidth = 2 , label = r ’ permuted ␣ top ␣ % s %


s ’ %( perc *100 , ’% ’) )
ax . plot ( np . nan , ’b ’ , linewidth = 2 , label = r ’p - value ’) # agent
ax . set_xticks ([ pv1 +1 , dim ])
ax . s e t_ xt ic k la be ls ([ pv1 +1 , dim ])
ax . tick_params ( axis = ’x ’ , labelsize = 20)
ax . tick_params ( axis = ’y ’ , labelsize = 20)
ax . set_xlabel ( ’ dimensions ␣ ( log ␣ scale ) ’ , fontsize = 20)
ax . set_ylabel ( ’ eigenvalues ’ , fontsize = 20)
ax . legend ( loc = ’ lower ␣ left ’ , fontsize = 20)

ax1 = ax . twinx ()
ax1 . plot ( pvals , ’b ’ , linewidth = 2)
ax1 . vlines ( pv1 , 0 , 1 , ’b ’ , ’ dashed ’ , linewidth = 2 , label = r ’1 - st ␣
nonzero ␣p - values ’)
ax1 . hlines ( perc , pv1 , dim , ’k ’ , ’ dotted ’ , linewidth = 2)
ax1 . fill_between ( np . arange ( dim ) , np . ones ( dim ) , where = ( pvals > perc ) ,
alpha = 0.2 , label = r ’ color ␣ fill : ␣ for ␣p - value ␣ >␣ % s % s ’
%( perc *100 , ’% ’) )
ax1 . tick_params ( axis = ’y ’ , labelsize = 20)
ax1 . set_yticks ([ perc , 1])
ax1 . se t_ y ti ck la b el s ([ ’% s % s ’ %(100* perc , ’% ’) , ’% s % s ’ %(100 , ’% ’) ])
ax1 . set_ylabel ( ’p - values ’ , fontsize = 20)
ax1 . legend ( loc = ’ upper ␣ right ’ , fontsize = 20)

# plt . title ( ’ parallel analysis of PCA ’, fontsize = 30)


plt . xlim (0 , dim )
plt . show ()

# ###############################################################
# img = X [0]. reshape (16 , 16)
plt . figure ( figsize = (10 , 10) )
# plt . title ( ’ mean and principal components ’, fontsize = 20)
for j in range (25) :
# img = Xcp [ j ]. reshape (16 , 16)
img = modes [: , j ]. reshape (16 , 16)
ax = plt . subplot (5 , 5 , j +1)
ax . imshow ( img , cmap = ’ gray ’)
ax . set_title ( ’% d ’ %( j +1) , fontsize = 20)
ax . set_xticks ([])
ax . set_yticks ([])

plt . tight_layout ()
plt . show ()
CHAPTER 3

Blessing of Dimensionality: Concentration of


Measure

3.1. Introduction to Almost Isometric Embedding


For this class, we introduce Random Projection method which may reduce the
dimensionality of n points in Rp to k = O(c(ϵ) log n) at the cost of a uniform met-
ric distortion of at most ϵ > 0, with high probability. The theoretical basis of this
method was given as a lemma by Johnson and Lindenstrauss (JL84) in the study of a
Lipschitz extension problem. The result has a widespread application in mathemat-
ics and computer science. The main application of Johnson-Lindenstrauss Lemma
in computer science is high dimensional data compression via random projections
(Ach03). In 2001, Sanjoy Dasgupta and Anupam Gupta (DG03a), gave a simple
proof of this theorem using elementary probabilistic techniques in a four-page pa-
per. Below we are going to present a brief proof of Johnson-Lindenstrauss Lemma
based on the work of Sanjoy Dasgupta, Anupam Gupta (DG03a), and Dimitris
Achlioptas (Ach03).
Recall the problem of MDS: given a set of points xi ∈ Rp (i = 1, 2, · · · , n);
form a data Matrix X p×n = [X1 , X2 · · · Xn ]T , when p is large, especially in some
cases larger than n, we want to find k-dimensional projection with which pairwise
distances of the data point are preserved as well as possible. That is to say, if we
know the original pairwise distance dij = ∥Xi − Xj ∥ or data distances with some
disturbance d˜ij = ∥Xi − Xj ∥ + ϵij , we want to find Yi ∈ Rk s.t.:
X
(53) min (∥Yi − Yj ∥2 − d2ij )2
i,j
P
assuming i Yi = 0, i.e.putting the origin as data center.
When D is given exactly by the squared distances of points in Euclidean space,
classical MDS defines a kernel matrix B = − 21 HDH where D = (d2ij ), H = I −
n 11 , then, the minimization (53) is equivalent to find Yi ∈ R :
1 T k

(54) min ∥Y T Y − B∥2F


Y ∈Rk×n

then the row vectors of matrix Y are the eigenvectors (singular vectors) correspond-
ing to k largest eigenvalues (singular values) of B.
The main features of MDS are the following.
• MDS looks for Euclidean embedding of data whose total or average metric
distortion are minimized.
• MDS embedding basis is adaptive to the data, namely as a function of
data via eigen-decomposition.
59
60 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE

Note that distortion measure here amounts to a certain distance between the set
of projected points and the original set of points B. Under the Frobenius norm the
distortion equals the sum of the squared lengths of these vectors. It is clear that
such vectors captures a significant global property, but it does not offer any local
guarantees. Chances are that some points deviate greatly from the original if we
only consider the total metric distortion minimization.
What if we want a uniform control on metric distortion at every data pair, say
∥Yi − Yj ∥2
(1 − ϵ) ≤ ≤ (1 + ϵ)?
d2ij
Such an embedding is an almost isometry or a Lipschitz mapping from metric space
X to Euclidean space Y. If X is an Euclidean space (or more generally Hilbert
space), Johnson-Lindenstrauss Lemma tells us that one can take Y as a subspace
of X of dimension k = O(c(ϵ) log n) via random projections to obtain an almost
isometry with high probability. As a contrast to MDS, the main features of this
approach are the following.
• Almost isometry is achieved with a uniform metric distortion bound (Bi-
Lipschitz bound), with high probability, rather than average metric dis-
tortion control;
• The mapping is universal, rather than being adaptive to the data.

3.2. Johnson-Lindenstrauss Lemma for Random Projections


Theorem 3.2.1 (Johnson-Lindenstrauss Lemma). For any 0 < ϵ < 1 and any
integer n, let k be a positive integer such that
k ≥ (4 + 2α)(ϵ2 /2 − ϵ3 /3)−1 ln n, α > 0.
Then for any set V of n points in Rp , there is a map f : Rp → Rk such that for all
u, v ∈ V
(55) (1 − ϵ) ∥ u − v ∥2 ≤∥ f (u) − f (v) ∥2 ≤ (1 + ϵ) ∥ u − v ∥2
Such a f in fact can be found in randomized polynomial time. In fact, inequalities
(55) holds with probability at least 1 − 1/nα .
Remark. We have following facts.
(1) The embedding dimension k = O(c(ϵ) log n) which is independent to am-
bient dimension d and logarithmic to the number of samples n. The
independence to d in fact suggests that the Lemma can be generalized to
the Hilbert spaces of infinite dimension.
(2) How to construct the map f ? In fact we can use random projections:
Y n×k = X n×d Rd×k
where the following random matrices R can cater our needs.
• R = [r1 ,√· · · , rk ] ri ∈ S d−1 ri = (ai1 , · · · , aid )/ ∥ ai ∥ aik ∼ N (0, 1)
• R = A/ k Aij ∼ N ( (0, 1)
√ 1 p = 1/2
• R = A/ k Aij =
−1 p = 1/2
3.2. JOHNSON-LINDENSTRAUSS LEMMA FOR RANDOM PROJECTIONS 61


p 1
 p = 1/6
• R = A/ k/3 Aij = 0 p = 2/3

−1 p = 1/6

The proof below actually takes the first form of R as an illustration.
Now we are going to prove Johnson-Lindenstrauss Lemma using a random
projection to k-subspace in Rd . Notice that the distributions of the following two
events are identical:

unit vector was randomly projected to k-subspace


⇐⇒ random vector on S d−1 fixed top-k coordinates.
Based on this observation, we change our target from random k-dimensional pro-
jection to random vector on sphere S d−1 .
Let xi ∼ N (0, 1) (i = 1, · · · , p), and X = (x1 , · · · , xp ), then Y = X/∥x∥ ∈ S p−1
is uniformly distributed. Fixing top-k coordinates, we get z = (x1 , · · · , xk , 0, · · · , 0)T /∥x∥ ∈
Rp . Let L = ∥z∥2 and µ := k/p. Note that E∥(x1 , · · · , xk , 0, · · · , 0)∥2 = k =
µ · E∥x∥2 . The following lemma shows that L is concentrated around µ.
The following lemma is crucial to reach the main theorem.
Lemma 3.2.2. let any k < p then we have
(a) if β < 1 then
(p−k)/2
(1 − β)k
  
k
Prob[L ≤ βµ] ≤ β k/2 1 + ≤ exp (1 − β + ln β)
p−k 2
(b) if β > 1 then
(p−k)/2
(1 − β)k
  
k/2 k
Prob[L ≥ βµ] ≤ β 1+ ≤ exp (1 − β + ln β)
p−k 2
Here µ = k/p.
We first show how to use this lemma to prove the main theorem – Johnson-
Lindenstrauss lemma.
Proof of Johnson-Lindenstrauss Lemma. If p ≤ k,the theorem is trivial.
Otherwise take a random k-dimensional subspace S, and let vi′ be the projection
of point vi ∈ V into S, then setting L = ∥vi′ − vj′ ∥2 and µ = (k/p)∥vi − vj ∥2 and
applying Lemma 2(a), we get that
 
k
Prob[L ≤ (1 − ϵ)µ] ≤ exp (1 − (1 − ϵ) + ln(1 − ϵ))
2
ϵ2
 
k
≤ exp (ϵ − (ϵ + )) ,
2 2
by ln(1 − x) ≤ −x − x2 /2 for 0 ≤ x < 1
kϵ2
 
= exp −
4
≤ exp(−(2 + α) ln n), for k ≥ 4(1 + α/2)(ϵ2 /2)−1 ln n
1
= 2+α
n
62 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE

 
k
Prob[L ≥ (1 + ϵ)µ] ≤ exp (1 − (1 + ϵ) + ln(1 + ϵ))
2
ϵ2 ϵ3
 
k
≤ exp (−ϵ + (ϵ − + )) ,
2 2 3
by ln(1 + x) ≤ x − x2 /2 + x3 /3 for x ≥ 0
 
k
= exp − (ϵ2 /2 − ϵ3 /3) ,
2
≤ exp(−(2 + α) ln n), for k ≥ 4(1 + α/2)(ϵ2 /2 − ϵ3 /3)−1 ln n
1
= 2+α
n
r r
d ′ d
Now set the map f (x) = x = (x1 , . . . , xk , 0, . . . , 0). By the above
k k
calculations, for some fixed pair i, j, the probability that the distortion
∥f (vi ) − f (vj )∥2
∥vi − vj ∥2
2
does not lie in the range [(1 − ϵ), (1 + ϵ)] is at most n(2+α) . Using the trivial union
 
n
bound with pairs, the chance that some pair of points suffers a large distortion
2
is at most:    
n 2 1 1 1
(2+α)
= α
1 − ≤ α.
2 n n n n
1
Hence f has the desired properties with probability at least 1 − α . This gives us
n
a randomized polynomial time algorithm. □
Now, it remains to Lemma 3.2.2.
Proof of Lemma 3.2.2.
p
k
!
X X
2 2
Prob(L ≤ βµ) =Prob xi ≤ βµ( xi )
i=1 i=1
p k
!
X X
=Prob βµ x2i − x2i ≥0
i=1 i=1
p
" k
! #
X X
=Prob exp tβµ x2i − t x2i ≥1 (t > 0)
i=1 i=1
p
" k
!#
X X
≤E exp tβµ x2i −t x2i (by Markov’s inequality)
i=1 i=1
=Πki=1 E exp(t(βµ − 1)xi )Πpi=k+1 E exp(tβµx2i )
2

2 k 2 p−k
=(E exp(t(βµ − 1)x )) (E exp(tβµx ))
=(1 − 2t(βµ − 1))−k/2 (1 − 2tβµ)−(p−k)/2
2 1
We use the fact that if X ∼ N (0, 1),then E[esX ] = p , for −∞ < s < 1/2.
(1 − 2s)
3.3. APPLICATION: HUMAN GENOME DIVERSITY PROJECT 63

Now we will refer to last expression as g(t). The last line of derivation gives
us the additional constraints that tβµ ≤ 1/2 and t(βµ − 1) ≤ 1/2, and so we have
0 < t < 1/(2βµ). Now to minimize g(t), which is equivalent to maximize
h(t) = 1/g(t) = (1 − 2t(βµ − 1))k/2 (1 − 2tβµ)(p−k)/2
in the interval 0 < t < 1/(2βµ). Setting the derivative h′ (t) = 0, we get the
maximum is achieved at
1−β
t0 =
2β(p − βk)
Hence we have (p−k)/2  k/2
p−k

1
h(t0 ) = ,
p − kβ β
and this is exactly what we need.
The proof of Lemma 3.2.2 (b) is almost exactly the same as that of Lemma
3.2.2 (a). □
3.2.1. Conclusion. As we can see, this proof of Lemma is both simple (using
just some elementary probabilistic techniques) and elegant. And you may find
in the field of machine learning, stochastic method always turns out to be really
powerful. The random projection method we approaching today can be used in
many fields especially huge dimensions of data is concerned. For one example, in
the term document, you may find it really useful for compared with the number
of words in the dictionary, the words included in a document is typically sparse
(with a few thousands of words) while the dictionary is huge. Random projections
often provide us a useful tool to compress such data without losing much pairwise
distance information.

3.3. Application: Human Genome Diversity Project


Now consider a SNPs (Single Nucleid Polymorphisms) dataset in Human Genome
Diversity Project (HGDP, http://www.cephb.fr/en/hgdp_panel.php) which con-
sists of a data matrix of n-by-p for n = 1064 individuals around the world and
p = 644258 SNPs. Each entry in the matrix has 0, 1, 2, and 9, representing “AA”,
“AC”, “CC”, and “missing value”, respectively. After removing 21 rows with all
missing values, we are left with a matrix X of size 1043 × 644258.
Consider the projection of 1043 persons on the MDS (PCA) coordinates. Let
H = I − n1 11T be the centering matrix. Then define
K = HXX T H = U ΛU T
which is a positive semi-define matrix as centered Gram matrix whose√eigenvalue
decomposition is given by U ΛU T . Taking the first two eigenvectors λi ui (i =
1, . . . , 2) as the projections of n individuals, Figure 1 gives the projection plot.
It is interesting to note that the point cloud data exhibits a continuous trend of
human migration in history: origins from Africa, then migrates to the Middle East,
followed by one branch to Europe and another branch to Asia, finally spreading
into America and Oceania.
One computational concern is that the high dimensionality caused by p =
644, 258, which is much larger than the number of samples n = 1043. However
random projections introduced above will provide us an efficient way to compute
MDS (PCA) principal components with an almost isometry.
64 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE

We randomly select (without replacement) {ni , i = 1, . . . , k} from 1, . . . , p with


equal probability. Let R ∈ Rk×p is a Bernoulli random matrix satisfying:
(
1/k j = ni ,
Rij =
0 otherwise.
Now define
e = H(XRT )(RX T )H
K
whose eigenvectors leads to new principal components of MDS. In the middle and
right, Figure 1 plots the such approximate MDS principal components with k =
5, 000, and k = 100, 000, respectively. These plots are qualitatively equivalent to
the original one.

Figure 1. (Left) Projection of 1043 individuals on the top 2 MDS


principal components. (Middle) MDS computed from 5,000 ran-
dom projections. (Right) MDS computed from 100,000 random
projections. Pictures are due to Qing Wang.

3.4. Random Projections and Compressed Sensing


There are wide applications of random projections in high dimensional data
processing, e.g. (Vem04). Here we particularly choose a special one, the compressed
(or compressive) sensing (CS) where we will use the Johnson-Lindenstrauss Lemma
to prove the Restricted Isometry Property (RIP), a crucial result in CS. A reference
can be found at (BDDW08).
Compressive sensing can be traced back to 1950s in signal processing in geog-
raphy. Its modern version appeared in LASSO (Tib96) and BPDN (CDS98), and
achieved a highly noticeable status by (CT05; CRT06; CT06).
The basic problem of compressive sensing can be expressed by the following
under-determined linear algebra problem. Assume that a signal x∗ ∈ Rp is sparse
with respect to some basis (measurement matrix) A ∈ Rn×p or A ∈ Rn×p where
n < p, given measurement b = Ax∗ = Ax∗ ∈ Rn1, how can one recover x∗ by
solving the linear equation system
(56) Ax = b?
As n < p, it is an under-determined problem, whence without further constraint,
the problem does not have an unique solution. To overcome this issue, one popular
1Below we abuse both terms A and Φ while leaving it unique for the future. The noisy version
is b = Ax + ϵ which will be discussed later.
3.4. RANDOM PROJECTIONS AND COMPRESSED SENSING 65

assumption is that the signal x∗ is sparse, namely the number of nonzero compo-
nents ∥x∗ ∥0 := #{x∗i ̸= 0 : 1 ≤ i ≤ p} is small compared to the total dimensionality
p. Figure 2 gives an illustration of such sparse linear equation problem.

Figure 2. Illustration of Compressive Sensing (CS). A is a rect-


angular matrix with more columns than rows. The dark elements
represent nonzero elements while the light ones are zeroes. The
signal vector x∗ , although high dimensional, is sparse.

3.4.1. Some Sparse Recovery Algorithms. Now we formally give some


algorithms to solve the problem. Without loss of generality, we assume each column
of design matrix A = [A1 , . . . , Ap ] has being standardized, that is, ∥Aj ∥2 = 1 ,
j = 1, ..., p .
With such a sparse assumption above, a simple idea is to find the sparsest
solution satisfying the measurement equation:
(57) (P0 ) min ∥x∥0
s.t. Ax = b.
This is an NP-hard combinatorial optimization problem.
3.4.1.1. Basis Pursuit. A convex relaxation of (57) is called Basis Pursuit
(CDS98),
X
(58) (P1 ) min ∥x∥1 := |xi |
s.t. Ax = b.
This is a linear programming problem. Figure 3 shows different projections of
a sparse vector x∗ under l0 , l1 and l2 , from which one can see in some cases the
convex relaxation (58) does recover the sparse signal solution in (57). Now a natural
problem arises, under what conditions the linear programming problem (P1 ) has
the solution exactly solves (P0 ), i.e. exactly recovers the sparse signal x∗ ?
When measurement noise exists, e.g. b = Ax+e with bound ∥e∥2 , the following
Basis Pursuit De-Noising (BPDN) (CDS98) are used instead

(59) (BP DN ) min ∥x∥1


s.t. ∥Ax − b∥2 ≤ ϵ.
3.4.1.2. Orthogonal Matching Pursuit. Another popular algorithm is proposed
by Stephane Mallat and Zhifeng Zhang, 1993 (MZ93).
Remarks:
1. OMP choose the column of maximal correlation with residue, in fact, it’s
also the one having the steepest decline in residue, which implies OMP is greedy.
2. In noisy case, we stop algorithm until ∥rt ∥ ≤ ε for a given ε.
66 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE

Figure 3. Comparison between different projections. Left: pro-


jection of x∗ under ∥ · ∥0 ; middle: projection under ∥ · ∥1 which
favors sparse solution; right: projection under Euclidean distance.

Algorithm 7: Orthogonal Matching Pursuit (OMP), Mallat-Zhang,1993


Input: A, b
Output: x ∈ Rp
1 initialization: r0 = b , x0 = 0 , S0 = ∅
2 while ∥rt ∥2 ̸= 0 do
3 jt = arg max1≤j≤p | ⟨Aj , rt−1 ⟩ |
4 St = St−1 ∪ jt
5 xt = arg minx∈Rp ∥b − ASt x∥
6 rt = b − Axt
7 end

3. It’s natural to ask how well OMP can recover x∗ , the answer is yes under
some conditions we will talk below.
3.4.1.3. LASSO. Least Absolute Shrinkage and Selection Operator (LASSO)
(Tib96) solves the following problem for noisy measurement b = Ax + e:

(60) (LASSO) minp ∥Ax − b∥2 + λ∥x∥1


x∈R

which is a convex quadratic programming problem.


3.4.1.4. Dantzig Selector. The Dantzig Selector (CT07) is proposed to deal
with noisy measurement b = Ax + e:
(61) min ∥x∥1
s.t. ∥AT (Ax − b)∥∞ ≤ λ
which is a linear programming problem, more scalable than quadratic programming
for large scale problems. It was shown in (BRT09) that Dantzig Selector and LASSO
share similar statistical properties.
For bounded noise ∥e∥∞ , the following formulation is used in network analysis
(JYLG12)
(62) min ∥x∥1
s.t. ∥Ax − b∥∞ ≤ ϵ
3.4. RANDOM PROJECTIONS AND COMPRESSED SENSING 67

3.4.1.5. Differential Inclusions and Linearized Bregman Iterations. to-be-finished...


See also 3.5 for details.

3.4.2. Uniformly Sparse Recovery Conditions. We are interested in un-


der which conditions we can recover arbitrary k-sparse x∗ ∈ Rp by those algorithms,
for k = |S| = |supp(x∗ )| ≪ n < p?
Now we turn to consider the conditions under which the algorithms above can
recover x∗ . Below A∗ denotes the conjugate of matrix A, which is AT if A is real.
(a) Uniqueness condition:

A∗S AS ≥ rI, for some r > 0.


(b) Incoherence (Donoho-Huo, 1999): Donoho-Huo (DH01) shows the follow-
ing sufficient condition
1
µ = max | ⟨Ai , Aj ⟩ | < ,
i̸=j 2k − 1
for sparse recovery by BP. This condition is numerically verifiable, so the
simplest condition.
(c) Irrepresentable condition (Exact-Recovery-Condition by (Tro04)): Joel
Tropp shows that under the following condition (Tro04)

M =: ∥A∗S c AS (A∗S AS )−1 ∥∞ < 1,


both OMP and BP recover x∗ . This condition is impossible to verify
unless the true support set S is known.
(d) Restricted-Isometry-Property (R.I.P. by (CRT06)): For all k-sparse x ∈
Rp , ∃δk ∈ (0, 1), s.t.
(1 − δk )∥x∥22 ≤ ∥Ax∥22 ≤ (1 + δk )∥x∥22 .
Remarks:
1. Uniqueness condition is a basic one, without which we can’t even know which
x∗ we’re going to recover.
2. Irrepresentable condition describe the relevance between AS and AS c should
be controlled. However, we may regard rows of A∗S c AS (A∗S AS )−1 to be the regres-
sion coefficient of Aj = AS β + ε , for j ∈ S c .
3. In fact, Irrepresentable condition could not be verified before we already have
the knowledge about S = supp(x∗ ).
4. The only condition easy to check is the incoherence condition, which is
stronger than the Irrepresentable condition.
5. The weakest condition, R.I.P. is not easily to be verified. But Johnson-
Lindestrauss Lemma says some suitable random matrices will satisfy R.I.P. with
high probability.
3.4.2.1. Incoherence Condition. (DH01) shows that as long as sparsity of x∗
satisfies the incoherence condition:
1
µ(A) <
2k − 1
which is later improved by (EB01) to be

2 − 12
µ(A) < ,
k
68 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE

then P1 recovers x∗ .
Tropp (Tro04) also shows that incoherence condition is stronger than the Ir-
representable condition in the following sense:
Lemma 3.4.1. (Tropp, 2004 (Tro04))
1 kµ
(63) µ< ⇒M ≤ < 1.
2k − 1 1 − (k − 1)µ
On the other hand, Tony Cai et al. (CXZ09; CW11) shows that the irrepre-
sentable or the incoherence condition is tight in the sense that if it fails, there exists
data A, x∗ , and b such that sparse recovery is not possible.

Proof of Lemma 3.4.1. First, we have


(64) M = ∥A∗S c AS (A∗S AS )−1 ∥∞ ≤ ∥(A∗S AS )−1 ∥∞ ∥A∗S c AS ∥∞ .
It’s easy to verify that
(65) ∥A∗S c AS ∥∞ ≤ kµ.
Then we consider ∥(A∗S AS )−1 ∥∞ .
Decompose A∗S AS = Ik + ∆, then

(66)
max |∆i,j | ≤ µ, diag(∆) = 0;
k−1
⇒ ∥∆∥∞ ≤ < 1;
2k − 1

X
⇒ (A∗S AS )−1 = (Ik + ∆)−1 = (−∆)j ;
j=0
∞ ∞
X X 1 1
⇒ ∥(A∗S AS )−1 ∥∞ = ∥ (−∆)j ∥∞ ≤ ∥∆∥j∞ = ≤ .
j=0 j=0
1 − ∥∆∥∞ 1 − (k − 1)µ

Thus, we reach our conclusion



(67) M≤ .
1 − (k − 1)µ

3.4.2.2. Irrepresentable condition. For irrepresentable condition, Joel Tropp


(Tro04) shows the following sufficiency for both OMP and BP. As a corollary in
(ORX+ 16) with zero noise, differential inclusion (220) recovers x∗ if irrepresentable
condition M < 1 is satisfied.
Theorem 3.4.2. (Tropp, 2004) Under uniqueness and Irrepresentable condi-
tions, OMP and BP recover x∗ .
Proof of Theorem 3.4.2. (I) OMP recovers x∗ .
The key to the proof is to show that at each step t ≤ k, OMP selects atom
from S rather than S c . Then we only need to examine
∥A∗S c rt ∥∞
(68) ρ(rt ) = < 1.
∥ATS rt ∥∞
3.4. RANDOM PROJECTIONS AND COMPRESSED SENSING 69

In noise-free case,
b = Ax∗ ∈ im(AS )
)
(69) ⇒ rt ∈ im(AS ).
rt = b − Axt ∈ im(AS )
PS = AS (A∗S AS )−1 A∗S is the projection operator onto im(AS ), thus we have
rt = PS rt . Hence,
(70)
∥A∗S c (PS rt )∥∞ ∥A∗S c AS (A∗S AS )−1 A∗S rt ∥∞
ρ(rt ) = ∗ = ≤ ∥A∗S c AS (A∗S AS )−1 ∥∞ < 1.
∥AS rt ∥∞ ∥A∗S rt ∥∞
(II) BP recovers x∗ .
Assume x̂ ̸= x∗ solves
(71) P1 : min ∥x∥1 , s.t. Ax = b.
Denote Ŝ = supp(x̂) and Ŝ \ S ̸= ∅. We have
(72)
∥x∗ ∥1 = ∥(A∗S AS )−1 A∗S b∥1
= ∥(A∗S AS )−1 A∗S AŜ x̂Ŝ ∥1 (Ax̂ = b)
= ∥(A∗S AS )−1 A∗S AS x̂S + (A∗S AS )−1 A∗S AŜ\S x̂Ŝ\S ∥1 (x̂Ŝ = x̂S + x̂Ŝ\S )
< ∥x̂S ∥1 + ∥x̂Ŝ\S ∥1 = ∥x̂Ŝ ∥1 ,
which is a contradictory. □
3.4.2.3. RIP and Random Projections. (CDD09) shows that incoherence con-
ditions implies RIP, whence RIP is a weaker condition. Under RIP condition,
uniqueness of P0 and P1 can be guaranteed for all k-sparse signals, often called
uniform exact recovery(Can08).
Theorem 3.4.3. The following holds for all k-sparse x∗ satisfying Ax∗ = b.
1, then problem P0 has a unique solution x∗ ;
(1) If δ2k < √
(2) If δ2k < 2 − 1, then the solution of P1 (58) has a unique solution x∗ , i.e.
recovers the original sparse signal x∗ .
The first condition2 is nothing but every 2k-columns of A are linearly depen-
dent. To see the first condition, assume by contradiction that there is another
k-sparse solution of P0 , x′ . Then by Ay = 0 and y = x∗ − x′ is 2k-sparse. If y ̸= 0,
it violates δ2k < 1 such that 0 = ∥Ay∥ ≥ (1 − δ2k )∥y∥ > 0. Hence one must have
y = 0, i.e. x∗ = x′ which proves the uniqueness of P0 . The proof of the second
condition can be found in (Can08).
RIP conditions also lead to upper bounds between solutions above and the
true sparse signal x∗ . For example, in the case of BPDN the follwoing result holds
(Can08).

Theorem 3.4.4. Suppose that ∥e∥2 ≤ ϵ. If δ2k < 2 − 1, then
∥x̂ − x∗ ∥2 ≤ C1 k −1/2 σk1 (x∗ ) + C2 ϵ,
2The necessity of the first condition fails. As pointed to me by Mr. Kaizheng Wang, a counter
example can be constructed as follows: Let A = [1, 1, 1, 0; 1, 1, 0, 1], x∗ = [0, 0, 1, 0]?, b = [1, 0]T ,
x = [1, −1, 0, 0]T , k = 1. Then x∗ is the unique k-sparse solution to Ax∗ = b. On the other
hand, x is 2k-sparse, but Ax = 0. Hence dependence of columns in A implies that δ2k ≥ 1 which
disproves necessity of δ2k < 1.
70 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE

where x̂ is the solution of BPDN and


σk1 (x∗ ) = min ∥x∗ − y∥1
supp(y)≤k

is the best k-term approximation error in l1 of x∗ .


How to find matrices satisfying RIP? Equipped with Johnson-Lindenstrauss
Lemma, one can construct such matrices by random projections with high proba-
bility (BDDW08).
Recall that in the Johnson-Lindenstrauss Lemma, a random matrix A ∈ Rn×p
with each element is i.i.d. according to some distribution satisfying certain bounded
moment conditions, e.g. Aij ∼ N (0, 1). The key step to establish Johnson-
Lindenstrauss Lemma is the following fact
Pr ∥Ax∥22 − ∥x∥22 ≥ ϵ∥x∥22 ≤ 2e−nc0 (ϵ) .

(73)
With this one can establish a bound on the action of A on k-sparse x by an union
bound via covering numbers of k-sparse signals.
Lemma 3.4.5. Let A ∈ Rn×p be a random matrix satisfying the concentration
inequality (73). Then for any δ ∈ (0, 1) and any set all T with |T | = k < n, the
following holds
(74) (1 − δ)∥x∥2 ≤ ∥Ax∥2 ≤ (1 + δ)∥x∥2
for all x whose support is contained in T , with probability at least
 k
12
(75) 1−2 e−c0 (δ/2)n .
δ
Proof. It suffices to prove the results when ∥x∥2 = 1 as A is linear. Let
XT := {x : supp(x) = T, ∥x∥2 = 1}. We first choose QT , a δ/4-cover of XT , such
that for every x ∈ XT there exists q ∈ QT satisfying ∥q − x∥2 ≤ δ/4. Since XT
has dimension at most k, it is well-known from covering numbers that the capacity
#(QT ) ≤ (12/δ)k . Now we are going to apply the union bound of (73) to the set QT
with ϵ = δ/2. For each q ∈ QT , with probability at most 2e−c0 (δ/2)n , |Aq∥22 −∥q∥22 ≥
δ/2∥q∥22 . Hence for all q ∈ QT , the same bound holds with probability at most
 k
−c0 (δ/2)n 12
2#(QT )e ≤2 e−c0 (δ/2)n .
δ
Now we define α to be the smallest constant such that
∥Ax∥2 ≤ (1 + α)∥x∥2 , for all x ∈ XT .
We can show that α ≤ δ with the same probability. For this, pick up a q ∈ QT
such that ∥q − x∥2 ≤ δ/4, whence by the triangle inequality
∥Ax∥2 ≤ ∥Aq∥2 + ∥A(x − q)∥2 ≤ 1 + δ/2 + (1 + α)δ/4.
This implies that α ≤ δ/2 + (1 + α)δ/4, whence α ≤ 3δ/4/(1 − δ/4) ≤ δ. This gives
the upper bound. The lower bound also follows this since
∥Ax∥2 ≥ ∥Aq∥2 − ∥A(x − q)∥2 ≥ 1 − δ/2 − (1 + δ)δ/4 ≥ 1 − δ,
which completes the proof. □
p

With this lemma, note that there are at most k subspaces of k-sparse, an
union bound leads to the following result for RIP.
3.5. INVERSE SCALE SPACE METHOD FOR SPARSE LEARNING 71

Theorem 3.4.6. Let A ∈ Rn×p be a random matrix satisfying the concentra-


tion inequality (73) and δ ∈ (0, 1). There exists c1 , c2 > 0 such that if
n
k ≤ c1
log(p/k)
the following RIP holds for all k-sparse x,
(1 − δk )∥x∥22 ≤ ∥Ax∥22 ≤ (1 + δk )∥x∥22
with probability at least 1 − 2e−c2 n .
Proof. For each of k-sparse signal (XT ), RIP fails with probability at most
 k
12
2 e−c0 (δ/2)n .
δ
There are kp ≤ (ep/k)k such subspaces. Hence, RIP fails with probability at most


 ep k  12 2
2 e−c0 (δ/2)n = 2e−c0 (δ/2)n+k[log(ep/k)+log(12/δ)] .
k δ
Thus for a fixed c1 > 0, whenever k ≤ c1 n/ log(p/k), the exponent above will
be ≤ −c2 n provided that c2 ≤ c0 (δ/2) − c1 (1 + (1 + log(12/δ))/ log(p/k). c2 can be
always chosen to be > 0 if c1 > 0 is small enough. This leads to the results. □
Another use of random projections (random matrices) can be found in Robust
Principal Component Analysis (RPCA) in the next chapter.

3.5. Inverse Scale Space Method for Sparse Learning


Consider the following differential inclusion,
(76a) ρ̇t = −∇x ℓ(xt ),
(76b) ρt ∈ ∂Ω(xt ),
where
• ℓ(x) measures the (empirical) loss of model on a set of samples, such as
1
the mean square loss ℓ(x) = 2n ∥Ax − b∥2 ; and
• Ω(x) is a sparsity enforcement penalty function, with typical examples
including Lasso (∥x∥1 ), elastic-net (α∥x∥1 + (1 − α)∥x∥2 ), group Lasso,
and matrix nuclear norm.
The dynamics above are differential inclusions due to the inclusive constraint in
(82a). The unique feature of dynamics (82a), is that it renders a regularization
path as a family of models at various levels of sparsity or parsimony. (? BGOX06?
; BFOS07) and (Bur08) studied it as the Inverse Scale Space method in image
restoration, and (? ) conducted a careful analysis and implementation. The key
observation of these works was that large-scale (image) features were recovered
before small-scale ones following the dynamics, from which the method derived its
name.
Why are we interested in such dynamics? The dynamics render regularization
paths which is distinct but statistically equivalent to LASSO paths. Moreover, ISS
may remove the bias in LASSO.
In literature, statistical path consistency for (220) was established in (ORX+ 16)
with its discretized algorithm, the LBI, under sparse linear regression models. This
work showed that (220) is able to remove the well-known statistical bias in Lasso
72 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE

1.5

0.5

−0.5

−1

−1.5 true signal


BPDN recovery
−2
0 50 100 150 200 250

Figure 4. True vs LASSO (t hand-tuned, courtesy of Wotao Yin)

estimators (FL01), and produce unbiased estimators under nearly the same model
selection consistency conditions as Lasso. Furthermore, (HSXY20) showed that un-
der a strictly weaker condition than generalized Lasso (LST13), statistical path con-
sistency could be achieved by (220) equipped with the variable splitting technique.
These studies laid down a theoretical foundation for the statistical consistency of
regularization paths generated by the solutions of differential inclusion (220). A free
R package is released, Libra (Linearized Bregman Algorithms) (RXY18). Another
Matlab package3 was also released with the Split LBI algorithm (HSXY16). These
works fostered various successful applications, such as high dimensional statistics
(XRY18), computer vision (FHX+ 16; ZSF+ 18), medical image analysis (SHYW17),
multimedia (XXCY16b), machine learning (XXCY16a; HSXY16), and AI (HY18).
Below we are going to show the bias of LASSO and a way of deriving the inverse
scale space ?? together with its statistical model selection consistency.

3.5.1. The Bias of LASSO. Consider a sparse linear model as follows. As-
sume that β ∗ ∈ Rp is sparse and unknown. Our purpose is to recover β ∗ from n
linear measurements
y = Xβ ∗ + ϵ, y ∈ Rn
where noise ϵ ∼ N (0, σ 2 ) and S := supp(β ∗ ) with s = |S| ≤ n ≤ p and T be the
complement of S.
Recall that LASSO ((Tib96); or equivalently BPDN (CDS98)) solves the fol-
lowing ℓ1 -regularized Maximum Likelihood Estimate problem
t
(77) min ∥β∥1 + ∥y − Xβ∥22 .
β 2n
where regularization parameter λ = 1/t is often used in literature. But here we
adopt parameter t for the purpose of deriving inverse scale space dynamics ??.
LASSO is biased in the sense that E(β̂t ) ̸= β ∗ for all t > 0. Let’s take some
simple examples.
Example 5. (a) X = Id, n = p = 1, LASSO is soft-thresholding
if τ < 1/β̃ ∗ ;

0,
β̂τ =
β̃ ∗ − τ ,
1
otherwise,
(b) n = 100, p = 256, Xij ∼ N (0, 1), ϵi ∼ N (0, 0.1)

3https://github.com/yuany-pku/split-lbi
3.5. INVERSE SCALE SPACE METHOD FOR SPARSE LEARNING 73

Figure 5. l2 , l1 , and SCAD, etc.

Even when the following model selection consistency (conditions given by (ZY06;
Zou06; YL07; Wai09), etc.) is reached at certain τn :
∃τn ∈ (0, ∞) s.t. supp(β̂τn ) = S,
LASSO estimate is biased away from the oracle estimator
1
(78) (β̂τn )S = β̃S∗ − Σ−1 sign(βS∗ ), τn > 0.
τn n,S
where the oracle estimator is defined as the subset least square solution (MLE)
with β̃T∗ = 0, had God revealed S to us,
1 −1 T
(79) β̃S∗ = βS∗ + Σ X ϵ, where Σn = n1 XST XS .
n n S
Estimate (78) can be derived from the first order optimality condition of LASSO,
ρt 1 T
(80a) = X (y − Xβt ),
t n
(80b) ρt ∈ ∂∥βt ∥1 ,
by setting βT (t) = 0 and solving βS (t).
How to remove the bias and return the Oracle Estimator? To reduce bias, var-
ious non-convex regularization schemes were proposed (Fan-Li’s SCAD, Zhang’s
MPLUS, Zou’s Adaptive LASSO, lq (q < 1), etc.)
X t
min ϕ(|βi |) + ∥y − Xβ∥22 ,
β
i
2n
where ϕ(t) is a nonnegative function such that ϕ(0) is singular (non-differentiable)
for sparsity and its derivative limt→∞ ϕ′ (t) = 0 for debiasing. Such ϕ must be
nonconvex, which is in general computationally hard to locate the global optimizer
in nonconvex optimization (? ). Various studies show the conditions that any local
optimizer can meet statistical precision (). Are there any other simple scheme?
3.5.2. Deriving Differential Inclusion as Debiasing. The crucial idea is
as follows.
• LASSO:
t
min ∥β∥1 + ∥y − Xβ∥22 .
β 2n
• KKT optimality condition:
1
⇒ ρt = X T (y − Xβt )t
n
74 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE

• Taking derivative (assuming differentiability) w.r.t. t


1 T
(81) ⇒ ρ̇t =
X (y − X(β̇t t + βt )), ρt ∈ ∂∥βt ∥1
n
• Assuming sign-consistency in a neighborhood of τn ,
for i ∈ S, ρτn (i) = sign(β ∗ (i)) ∈ ±1 ⇒ ρ̇τn (i) = 0,

⇒ β̇τn τn + βτn = β̃ ∗
• Equivalently, the blue part removes bias of LASSO automatically
1 −1
βτlasso = β̃ ∗ − Σ sign(β ∗ ) ⇒ β̇τlasso τn + βτlasso = β̃ ∗ (oracle)!
n
τn n n n

Now replacing β̇τlasso


n
τn + βτlasso
n
by βt in (81) renders the following differential
inclusion
1 T
ρ̇t = X (y − Xβt ),
n
ρt ∈ ∂∥βt ∥1 .
starting at t = 0 and ρ(0) = β(0) = 0.
Remark. (a) Replace ρ/t in LASSO KKT by ρ/t
ρt 1
= X T (y − Xβt )
t n
(b) (BGOX06) shows that in image recovery it recovers the objects in an
inverse-scale order as t increases (larger objects appear in βt first)
Example 6. (a) X = Id, n = p = 1, hard-thresholding
if τ < 1/(β̃ ∗ );

0,
βτ = ∗
β̃ , otherwise,
(b) the same example shown before (figures by courtesy of Wotao Yin)
2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 true signal −1.5 true signal


BPDN recovery Bregman recovery
−2 −2
0 50 100 150 200 250 0 50 100 150 200 250

True vs LASSO True vs ISS


3.5.3. Solution Path of ISS: Sequential Restricted Maximum Likeli-
hood Estimate. The differential inclusion (82a) has piece-wise constant solution
paths.
• ρt is piece-wise linear in t,
t − tk T
ρ t = ρ tk + X (y − Xβtk ), t ∈ [tk , tk+1 )
n
t−tk T
where tk+1 = sup{t > tk : ρtk + n X (y − Xβtk ) ∈ ∂∥βtk ∥1 }
3.5. INVERSE SCALE SPACE METHOD FOR SPARSE LEARNING 75

• βt is piece-wise constant in t: βt = βtk for t ∈ [tk , tk+1 ) and βtk+1 is the


sequential restricted Maximum Likelihood Estimate by solving
nonnegative least square (Burger et al.’13; Osher et al.’16)
βtk+1 = arg minβ ∥y − Xβ∥22
(83) subject to (ρtk+1 )i βi ≥ 0 ∀ i ∈ Sk+1 ,
βj = 0 ∀ j ∈ Tk+1 .
• Note: Sign consistency ρt = sign(β ∗ ) ⇒ βt = β̃ ∗ the oracle estimator

Figure 6. Diabetes data (Efron et al.’04) and regularization


paths are different, yet bearing similarities on the order of pa-
rameters being nonzero

3.5.4. Statistical Path Consistency. Assumptions


(A1) Restricted Strongly Convex: ∃γ ∈ (0, 1],
1 T
X XS ≥ γI
n S
(A2) Incoherence/Irrepresentable Condition: ∃η ∈ (0, 1),
 −1
1 T † 1
T 1 T

XT X = XT XS X X ≤1−η

n S S S

n n

Remark. (a) “Irrepresentable” means that one can not represent (regress)
column vectors in XT by covariates in XS .
(b) The incoherence/irrepresentable condition is used independently in (Tro04;
YL06; ZY06; Zou06; Wai09; CWX10; CW11) etc.
ISS is a kind of restricted gradient descent (also known as Bregman gradient or
mirror descent):
1
ρ̇t = −gradL(βt ) = X T (y − Xβt ), ρt ∈ ∂∥βt ∥1
n
such that
• incoherence condition and strong signals ensure it firstly evolves on index
set S (Oracle Subspace) to reduce the loss
76 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE

• strongly convex in subspace restricted on index set S ⇒ fast decay in loss


• early stopping after all strong signals are detected, before overfitting noise

Theorem 3.5.1 ((ORX+ 16)). Assume (A1) and (A2). Define an early stopping
time
 −1
η n
r
τ := max ∥Xj ∥ ,
2σ log p j∈T

and the smallest magnitude βmin = min(|βi∗ | : i ∈ S). Then
(a) No-false-positive: for all t ≤ τ , the path has no-false-positive with high
probability, supp(β(t)) ⊆ S;
(b) Model selection consistency: moreover if the signal is strong enough
such that
r
8σ(2 + log s) (maxj∈T ∥Xj ∥)

∗ 4σ log p
βmin ≥ ∨ ,
γ 1/2 γη n

there is τ ≤ τ̄ such that solution path β(t)) = β̃ ∗ for every t ∈ [τ, τ ].

Remark. By this theorem, ISS is equivalent to LASSO with λ∗ = 1/τ̄ (Wai09)


up to log s.

3.5.5. Linearized Bregman Iterations. The differential inclusion (220) has


piece-wise constant solution paths which are expensive to compute approximately.
The following damped dynamics with κ > 0 has continuous solution xt that con-
verges to the (220) as κ → ∞,

ẋt
(84) ρ̇t + = −∇x ℓ(xt ),
κ
ρt ∈ ∂Ω(xt ).

Its Euler forward discretization gives the Linearized Bregman Iterations (LBI) as

(85) zk+1 = zk − α∇x ℓ(xk ),


xk+1 = κ · proxΩ (zk+1 ),

where zk+1 = ρk+1 + xk+1


κ , the initial choice z0 = γ0 = 0 ∈ R (or small Gaussian),
m

β0 = 0 ∈ Rp (or small Gaussian), parameters κ > 0, α > 0, ν > 0, and the


proximal map associated with a convex function Ω is defined by
1
proxΩ (z) = arg min ∥z − x∥2 + Ω(x).
x 2

The discretizations of (220) are known as the Linearized Bregman Iteration


(LBI) ((OBG+ 05), or equation (3.7) and (5.19-20) of (YODG08)), which is devel-
oped independently of the continuous dynamics and has been widely used in image
processing and compressed sensing. In particular, in terms of matrix completion,
(CCS12) studied a discretized version of (220) with a matrix nuclear norm. Re-
cently, (FLL+ 20? ) applied such Inverse Scale Space methods with early stopping
to find sparse subnets in deep convolutional neural networks.
3.6. LAB AND FURTHER STUDIES 77

3.6. Lab and Further Studies


3.6.1. Random Projections in PCA/MDS. The following python codes
show the 2-dimensional PCA/MDS embeddings of SNPs dataset using full samples
and 5,000 random projections. Since the original dimension is quite large, ran-
dom projection could accelerate the process significantly without deteriorating the
quality data shapes. It shows that random projection is helpful for high dimen-
sional data analysis. Python package sklearn.random projection4 can be used
for further explorations.
Now we load the raw data from ?.
import numpy as np
import matplotlib . pyplot as plt
import pandas as pd
import cvxpy as cp
import seaborn as sns

# Read the data file and the information file


df = pd . read_csv ( " c e p h _ h g d p _ m i n o r _ c o d e _ X N A . be tt er A nn ot at e d . csv " )
df_info = pd . read_csv ( " c e p h _ h g d p _ m i n o r _ c o d e _ X N A . s a m p l e I n f o r m a t i o n . csv " )
labels = df_info [ " Geographic . area " ]
X = df . to_numpy ()
X = X [: ,3:]. astype ( float ) . transpose ()
PCA/MDS with original data.
# The Centering matrix
H = np . eye ( X . shape [0]) -(1/ X . shape [0]) * np . ones (( X . shape [0] ,1) ) . dot ( np .
ones ((1 , X . shape [0]) ) )

# Covariance Matrix
K = H . dot ( X ) . dot ( X . transpose () ) . dot ( H . transpose () )

# Eigen - decomposition of Covariance matrix


eigen_values , eigen_vec = np . linalg . eig ( K )

# Show 2 - dimensional PCA embeddings


labels = df_info [ " Geographic . area " ]
labels = labels . to_numpy ()
plt . figure ( figsize =(9 ,9) )
for label in set ( list ( labels ) ) :
index = labels == label
plt . scatter ( eigen_vec [ index ,0] , eigen_vec [ index ,1] , label = label )
plt . legend ()
plt . show ()
It shows the left picture in Figure 7.
Now we are going to show PCA after 5,000 random projections.
# Random projections to k - dimensional subspace .
k =5000
n_k = np . random . choice ( X . shape [1] , k , replace = False )
R = np . eye ( X . shape [1]) [ n_k ]/ k

4https://scikit-learn.org/stable/modules/random projection.html
78 3. BLESSING OF DIMENSIONALITY: CONCENTRATION OF MEASURE

# Covariance matrix of random projections


K_rp = H . dot ( X . dot ( R . transpose () ) ) . dot ( R . dot ( X . transpose () ) ) . dot ( H .
transpose () )

# Eigen - decomposition of Covariance matrix


eigen_values_rp , eigen_vec_rp = np . linalg . eig ( K_rp )

# Show 2 - dimensional PCA embeddings


labels = df_info [ " Geographic . area " ]
labels = labels . to_numpy ()
plt . figure ( figsize =(9 ,9) )
for label in set ( list ( labels ) ) :
index = labels == label
plt . scatter ( eigen_vec_rp [ index ,0] , eigen_vec_rp [ index ,1] , label =
label )
plt . legend ()
plt . show ()
It shows the right picture in Figure 7. One can see that after 5,000 random projects,
the top-2 principal components preserve the shape of original PCA of dimensionality
being 488,919.

Figure 7. (Left) Projection of 1043 individuals on the top 2 MDS


principal components. (Right) MDS computed from 5,000 random
projections. Courtesy of Donghao LI.
CHAPTER 4

Generalized PCA and MDS via Semidefinite


Programming

4.1. Introduction of Semi-Definite Programming (SDP)


Here we will give a short note on Semidefinite Programming (SDP) formula-
tion of Robust PCA, Sparse PCA, MDS with uncertainty, and Maximal Variance
Unfolding, etc. First of all, we give a short introduction to SDP based on a parallel
comparison with Linear Programming (LP).
Semi-definite programming (SDP) involves linear objective functions and linear
(in)equalities constraint with respect to variables as positive semi-definite matri-
ces. SDP is a generalization of linear programming (LP) by replacing nonnegative
variables with positive semi-definite matrices. We will give a brief introduction of
SDP through a comparison with LP.
LP (Linear Programming): for x ∈ Rn and c ∈ Rn ,
(86) min cT x
s.t. Ax = b
x≥0
This is the primal linear programming problem.
In SDP, the inner product between vectors cT x in LP will change to Hadamard
inner product (denoted by •) between matrices.
SDP (Semi-definite Programming): for X, C ∈ Rn×n
X
(87) min C • X = cij Xij
i,j
s.t. Ai • X = bi , for i = 1, · · · , m
X⪰0
Linear programming has a dual problem via the Lagrangian. The Lagrangian
of the primal problem is
max min Lx;y,µ = cT x + y T (b − Ax) − µT x
µ≥0,y x

which implies that


∂L
= c − AT y − µ = 0
∂x
⇐⇒ c − AT y = µ ≥ 0
=⇒ max L = −y T b
µ≥0,y

which leads to the following dual problem.


79
80 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING

LD (Dual Linear Programming):


(88) max bT y
s.t. µ = c − AT y ≥ 0
In a similar manner, for SDP’s dual form, we have the following.
SDD (Dual Semi-definite Programming):
(89) max bT y
m
X
s.t. S=C− Ai yi ⪰ 0 =: C − AT ⊗ y
i=1

where  
A1
A =  ...
 

Am
and  
y1
y =  ... 
 

ym

4.1.1. Duality of SDP. Define the feasible set of primal andP dual problems
are Fp = {X ⪰ 0; Ai • X = bi } and Fd = {(y, S) : S = C − i yi Ai ⪰ 0},
respectively. Similar to linear programming, semi-definite programming also has
properties of week and strong duality. The week duality says that the primal value
is always an upper bound of dual value. The strong duality says that the existence
of an interior point ensures the vanishing duality gap between primal value and
dual value, as well as the complementary conditions. In this case, to check the
optimality of a primal variable, it suffices to find a dual variable which meets the
complementary condition with the primal. This is often called the witness method.
For more reference on duality of SDP, see e.g. (Ali95).
Theorem 4.1.1 (Weak Duality of SDP). If Fp ̸= ∅, Fd ̸= ∅, We have C • X ≥
b y, for ∀X ∈ Fp and ∀(y, S) ∈ Fd .
T

Theorem 4.1.2 (Strong Duality SDP). Assume the following hold,


(1) Fp ̸= ∅, Fd ̸= ∅;
(2) At least one feasible set has an interior.
Then X ∗ is optimal iff
(1) X ∗ ∈ Fp
(2) ∃(y ∗ , S ∗ ) ∈ Fd
s.t. C • X ∗ = bT y ∗ or X ∗ S ∗ = 0 (note: in matrix product)
In other words, the existence of an interior solution implies the complementary
condition of optimal solutions. Under the complementary condition, we have
rank(X ∗ ) + rank(S ∗ ) ≤ n
for every optimal primal X ∗ and dual S ∗ .
4.2. ROBUST PCA VIA SDP 81

4.2. Robust PCA via SDP


Let X ∈ Rp×n be a data matrix. Classical PCA tries to find
(90) min ∥X − L∥
s.t. rank(L) ≤ k
where thePnorm here is any unitary invariant matrix norms, e.g. Schatten’s p-norm
∥M ∥p = ( i σi (M )p )1/p (p ≥ 1) when M admits the Singular Value Decomposition
(SVD) M = U SV T with S = diag(σ1 , . . . , σk , . . . ) (p = 2 is the Frobenius norm,
P norm, and p = ∞ gives
p = 1 is the nuclear P spectral norm). SVD provides a
solution with L = i≤k σi ui viT where X = i σi ui viT (σ1 ≥ σ2 ≥ . . .). In other
words, classical PCA looks for decomposition
X =L+E
where the error matrix E has small a Frobenius norm which usually is the case for
Gaussian noise. However, if some outliers exists, i.e. there are a small amount of
sample points which are largely deviated from the main population of samples, the
classical PCA is well-known very sensitive to such outliers.

Figure 1. Classical PCA is sensitive to outliers

To address this issue, Robust PCA looks for the following decomposition instead
X =L+S
where
• L is a low rank matrix;
• S is a sparse matrix.
Example 7. In the spike signal model, X = αu + σϵ ϵ, where α ∼ N (0, σu2 )
and ϵ ∼ N (0, Ip ). X is thus subject to the following normal distribution N (0, Σ)
where Σ = σu2 uu2 + σϵ2 I. So Σ = L + S has such a rank-sparsity structure with
L = σu2 uuT and S = σϵ2 I.
Example 8. Let X = [x1 , . . . , xp ]T ∼ N (0, Σ) be multivariate Gaussian ran-
dom variables. The following characterization (CPW12) holds
xi and xj are conditionally independent given other variables
⇔ (Σ−1 )ij = 0
We denote it by xi ⊥ xj |xk (k ̸∈ {i, j}). Let G = (V, E) be a undirected graph
where V represent p random variables and (i, j) ∈ E ⇔ xi ⊥ xj |xk (k ̸∈ {i, j}). G
is called a (Gaussian) graphical model of X.
82 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING

Divide the random variables into observed and hidden (a few) variables X =
(Xo , Xh )T (in semi-supervised learning, unlabeled and labeled, respectively) and
   
Σoo Σoh −1 Qoo Qoh
Σ= and Q = Σ =
Σho Σhh Qho Qhh
The following Schur Complement equation holds for covariance matrix of observed
variables
−1
Σoo = Qoo + Qoh Q−1hh Qho .
Note that
• Observable variables are often conditional independent given hidden vari-
ables, so Qoo is expected to be sparse;
• Hidden variables are of small number, so Qoh Q−1 hh Qho is of low-rank.
In semi-supervised learning, the labeled points are of small number, and the unla-
beled points should be as much conditional independent as possible to each other
given labeled points. This implies that the labels should be placed on those most
“influential” points.

Figure 2. Surveillance video as low rank plus sparse matrices:


Left = low rank (middle) + sparse (right) (CLMW09)

Example 9 (Surveillance Video Decomposition). Figure 2 gives an example


of low rank vs. sparse decomposition in surveilliance video. On the left column,
surveilliance video of a movie theatre records a great amount of images with the
same background and the various walking customers. If we vectorize these images
(each image as a vector) to form a matrix, the background image leads to a rank-1
part and the occasional walking customers contribute to the sparse part.
4.2. ROBUST PCA VIA SDP 83

More examples can be found at (CLMW09; CSPW11; CPW12).


In Robust PCA the purpose is to solve
(91) min ∥X − L∥0
s.t. rank(L) ≤ k
where ∥A∥0 = #{Aij ̸= 0}. However both the objective function and the constraint
are non-convex, whence it is NP-hard to solve in general.
The simplest convexification leads to a Semi-definite relaxation:
∥S∥0 := #{Sij ̸= 0} ⇒ ∥S∥1
X
rank(L) := #{σi (L) ̸= 0} ⇒ ∥L∥∗ = σi (L),
i
where ∥L∥∗ is called the nuclear norm of L, which has a semi-definite representation
1
∥L∥∗ = min (trace(W1 ) + trace(W2 ))
2 
W1 L
s.t. ⪰ 0.
LT W 2
With these, the relaxed Robust PCA problem can be solved by the following
semi-definite programming (SDP).
1
(92) min (trace(W1 ) + trace(W2 )) + λ∥S∥1
2
s.t. Lij + Sij = Xij , (i, j) ∈ E
 
W1 L
⪰0
LT W2
A Matlab package CVX (http://cvxr.com/cvx) implementation of the SDP
algorithm above is shown in 4.5.1. Typically CVX only solves SDP problem of
small sizes (say matrices of size less than 100). Specific matlab tools have been
developed to solve large scale RPCA, which can be found at http://perception.
csl.uiuc.edu/matrix-rank/home.html, with an example shown in 4.5.3.
Some theory based on convex geometry can be found in (CSPW11; CRPW12).
Besides the SDP approach, some other developments can be found in lp distance
(LZ11) and Tyler’s M-estimator (Zha16; ZCS14).
4.2.1. Tyler’s M-estimator. In this part we introduce a simple robust co-
variance estimator, Tyler’s M-estimator. In general, Huber’s M-estimators (Hub81)
are generalizations of the MLE estimators to achieve additional properties such as
robustness against outliers. M-estimators of covariance are motivated from the
MLE estimators under the assumption thatpdata samples are i.i.d. drawn from the
T
elliptical distribution xi ∼ C(ρ)e−ρ(x Σx) / det(Σ), where C(ρ) is a normalization
constant. That is, M-estimator of covariance is defined as the minimizer of
n
1X 1
L(Σ) = ρ(xTi Σxi ) + log det Σ
n i=1 2
By choosing ρ(t) to be some heavy tail distributions, one can achieve robust
estimators against outliers.
Tyler’s M-estimator (Tyl87a) is a special case of the M-estimators of covariance
with ρ(t) = p2 log(t), which allows more possibility for large deviations than normal
84 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING

distribution with ρ(t) ∼ t. Due to the scale invariance L(Σ) = cL(Σ) in this case,
one often adds the constraint trace(Σ) = 1 to make the minimizer unqiue, i.e.
Pn
(93) Σ̂
T yler
= arg min L(Σ) := n1 i=1 log(xTi Σxi ) + 12 log det Σ
trace(Σ)=1,Σ⪰0

The following simple iterative algorithm is also given in (Tyl87a),


−1
X xi xTi
(94) Σk+1 = [trace(Sk )] · Sk , where Sk := T
.
i
xi [Σk ]−1 xi
In particular, Tyler showed that it is the “most robust” estimator of the scatter
matrix of an elliptical distribution in the sense of minimizing the maximum asymp-
totic variance. Therefore, we can expect that it leads to robust PCA. For more
details, see (Zha16; ZCS14).

4.2.2. Exact Recovery Conditions for RPCA. A fundamental question


about Robust PCA is: given X = L0 + S0 with low-rank L and sparse S, under
what conditions that one can recover X by solving SDP in (92)?
It is necessary to assume that
• the low-rank matrix L0 can not be sparse;
• the sparse matrix S0 can not be of low-rank.
Such an assumption can be characterized using the following algebraic language.
Define
T (L0 ) = {U AT + BV T : ∀A, B ∈ Rn×p , L0 = U SV T }
which is the tangent space at L0 varying in the same column and row spaces of L0 ,
and
Ω(S0 ) = {S : supp(S) ⊆ supp(S0 )},
which is the tangent space at S0 varying within the same support of S0 . The
assumptions above are equivalent to say that tangent spaces T (L0 ) and Ω(S0 ) are
transversal with only intersection at 0,
\
Transversality: T (L0 ) Ω(S0 ) = {0}.
The following two incoherence constants measure the “diffusive behaviours” of
sparse (low-rank) matrices onto low-rank (sparse) opponents.
µ(S0 ) = max ∥S∥2
S∈Ω(S0 ),∥S∥∞ ≤1

ξ(L0 ) = max ∥L∥∞


L∈T (L0 ),∥L∥2 l≤1

(CSPW11) shows the following uncertainty principle, for any matrix M , µ(M )·
ξ(M ) ≥ 1. Therefore a sufficient condition holds,
\
µ(S0 ) · ξ(L0 ) < 1, ⇒ T (L0 ) Ω(S0 ) = {0}.
Moreover, (CSPW11) shows the following deterministic recovery conditions by SDP
µ(S0 ) · ξ(L0 ) < 1/6, ⇒ SDP recovers L0 and S0 .
Probabilistic recovery conditions are given earlier in (CR09). First of all we
need some incoherence conditions for the identifiability. Assume that L0 ∈ Rn×n =
U ΣV T and r = rank(L0 ).
4.2. ROBUST PCA VIA SDP 85

Incoherence condition (CR09): there exists a µ ≥ 1 such that for all ei =


(0, . . . , 0, 1, 0, . . . , 0)T ,
µr µr
∥U T ei ∥2 ≤ , ∥V T ei ∥2 ≤ ,
n n
and
µr
|U V T |2ij ≤ 2 .
n
These conditions, roughly speaking, ensure that the singular vectors are not
sparse, i.e. well-spread over all coordinates and won’t concentrate on some coor-
dinates. The incoherence condition holds if |Uij |2 ∨ |Vij |2 ≤ µ/n. In fact, if U
represent random projections to r-dimensional subspaces with r ≥ log n, we have
maxi ∥U T ei ∥2 ≍ r/n.
To meet the second condition, we simply assume that the sparsity pattern of
S0 is uniformly random.
Theorem 4.2.1. Assume the following holds,
(1) L0 is n-by-n with rank(L0 ) ≤ ρr nµ−1 (log n)−2 ,
(2) S0 is uniformly sparse of cardinality m ≤ ρs n2 .

Then with probability 1 − O(n−10 ), (92) with λ = 1/ n is exact, i.e. its solution
L̂ = L0 and Ŝ = S0 .

pNote that if L0 is a rectangular matrix of n1 × n2 , the same holds with λ =


1/ (max n1 , n2 ). The result can be generalized to 1−O(n−β ) for β > 0. Extensions
and improvements of these results to incomplete measurements can be found in
(CT10; Gro11) etc., which solves the following SDP problem.

(95) min ∥L∥∗ + λ∥S∥1


s.t. Lij + Sij = Xij , (i, j) ∈ Ωobs .
Theorem 4.2.2. Assume the following holds,
(1) L0 is n-by-n with rank(L0 ) ≤ ρr nµ−1 (log n)−2 ,
(2) Ωobs is a uniform random set of size m = 0.1n2 ,
(3) each observed entry is corrupted with probability τ ≤ τs .

Then with probability 1−O(n−10 ), (92) with λ = 1/ 0.1n is exact, i.e. √its solution
L̂ = L0 . The same conclusion holds for rectangular matrices with λ = 1/ max dim.
All these results hold irrespective to the magnitudes of L0 and S0 .
When there are no sparse perturbation in optimization problem (95), the prob-
lem becomes the classical Matrix Completion problem with uniformly random sam-
pling:
(96) min ∥L∥∗
s.t. Lij = L0ij , (i, j) ∈ Ωobs .
Assumed the same condition as before,(CT10) gives the following result: solu-
tion to SDP (96) is exact with probability at least 1 − n−10 if m ≥ µnr loga n where
a ≤ 6, which can be improved by (Gro11) to be near-optimal
m ≥ µnr log2 n.
86 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING

Phase Transitions. Take L0 = U V T as a product of n × r i.i.d. N (0, 1)


random matrices. Figure 3 shows the phase transitions of successful recovery prob-
ability over sparsity ratio ρs = m/n2 and low rank ratio r/n. White color indicates
the probability equals to 1 and black color corresponds to the probability being 0.
A sharp phase transition curve can be seen in the pictures. (a) and (b) respectively
use random signs and coherent signs in sparse perturbation, where (c) is purely ma-
trix completion with no perturbation. Increasing successful recovery can be seen
from (a) to (c).

Figure 3. Phase Transitions in Probability of Successful Recovery

4.3. Sparse PCA via SDP


Sparse PCA is firstly proposed by (ZHT06) which tries to locate sparse principal
components, which also has a SDP relaxation.
Recall that classical PCA is to solve
max xT Σx
s.t. ∥x∥2 = 1
which gives the maximal variation direction of covariance matrix Σ.
Note that xT Σx = trace(Σ(xxT )). Classical PCA can thus be written as
max trace(ΣX)
s.t. trace(X) = 1
X⪰0
The optimal solution gives a rank-1 X along the first principal component. A
recursive application of the algorithm may lead to top k principal components.
That is, one first to find a rank-1 approximation of Σ and extract it from Σ0 = Σ
to get Σ1 = Σ − X, then pursue the rank-1 approximation of Σ1 , and so on.
Now we are looking for sparse principal components, i.e. #{Xij ̸= 0} are small.
Using 1-norm convexification, we have the following SDP formulation (dGJL07) for
4.4. GRAPH REALIZATION AND UNIVERSAL RIGIDITY 87

Sparse PCA
(97) max trace(ΣX) − λ∥X∥1
s.t. trace(X) = 1
X⪰0
Some consistency studies can be found at (? ) and references therein.
The SDP algorithm above has a simple Matlab implementation based on CVX
(http://cvxr.com/cvx), shown in Section 4.5.2.

4.4. Graph Realization and Universal Rigidity


In this lecture, we introduce Semi-Definite Programming (SDP) approach to
solve some generalized Multi-dimensional Scaling (MDS) problems with uncer-
tainty. Recall that in classical MDS, given pairwise distances dij = ∥xi − xj ∥2
among a set of points xi ∈ Rp ( i = 1, 2, · · · , n) whose coordinates are unknown,
our purpose is to find yi ∈ Rk (k ≤ p) such that

n
X 2
(98) min ∥yi − yj ∥2 − dij .
i,j=1

In classical MDS (Section 1.2 in Chapter 1) an eigen-decomposition approach


is pursued to find a solution when all pairwise distances dij ’s are known and noise-
free. In case that dij ’s are not from pairwise distances, we often use gradient
descend method to solve it. However there is no guarantee that gradient descent
will converge to the global optimal solution. In this section we will introduce
a method based on convex relaxation, in particular the semi-definite relaxation,
which will guarantee us to find optimal solutions in the following scenarios.
• Noisy perturbations: dij → df ij = dij + ϵij
• Incomplete measurments: only partial pairwise distance measurements
are available on an edge set of graph, i.e. G = (V, E) and dij is given
when (i, j) ∈ E (e.g. xi and xj in a neighborhood).
• Anchors: sometimes we may fixed the locations of some points called
anchors, e.g. in sensor network localization (SNL) problem.
In other words, we are looking for MDS on graphs with partial and noisy informa-
tion.

4.4.1. SD Relaxation of MDS. Like PCA, classical MDS has a semi-definite


relaxation. In the following we shall introduce how the constraint
(99) ∥yi − yj ∥2 = dij ,
can be relaxed into linear matrix inequality system with positive semidefinite vari-
ables.
Denote Y = [y1 , · · · , yn ]k×n where yi ∈ Rk , and
ei = (0, 0, · · · , 1, 0, · · · , 0) ∈ Rn .
Then we have
∥yi − yj ∥2 = (yi − yj )T (yi − yj ) = (ei − ej )T Y T Y (ei − ej )
88 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING

Set X = Y T Y , which is symmetric and positive semi-definite. Then


∥Yi − Yj ∥2 = (ei − ej )(ei − ej )T • X.
So
∥Yi − Yj ∥2 = d2ij ⇔ (ei − ej )(ei − ej )T • X = d2ij
which is linear with respect to X.
Now we relax the constrain X = Y T Y to
X ⪰ Y T Y ⇐⇒ X − Y T Y ⪰ 0.
Through Schur Complement Lemma we know
 
I Y
X − Y T Y ⪰ 0 ⇐⇒ ⪰0
YT X
We may define a new variable
 
k+n Ik Y
Z∈S ,Z =
YT X
which gives the following result.
Lemma 4.4.1. The quadratic constraint
∥yi − yj ∥2 = d2ij , (i, j) ∈ E
has a semi-definite relaxation:


 Z1:k,1:k = I
(0; ei− ej )(0; ei − ej )T • Z = d2ij , (i, j) ∈ E

Ik Y
 Z= ⪰ 0.


YT X
Pn
where • denotes the Hadamard inner product, i.e. A • B := i,j=1 Aij Bij .

Note that the constraint with equalities of d2ij can be replaced by inequalities
such as ≤ d2ij (1 + ϵ) (or ≥ d2ij (1 − ϵ)). This is a system of linear matrix (in)-
equalities with positive semidefinite variable Z. Therefore, the problem becomes a
typical semidefinite programming.
Given such a SD relaxation, we can easily generalize classical MDS to the sce-
narios in the introduction. For example, consider the generalized MDS with anchors
which is often called sensor network localization problem in literature (BLT+ 06).
Given anchors ak (k = 1, . . . , s) with known coordinates, find xi such that
• ∥xi − xj ∥2 = d2ij where (i, j) ∈ Ex and xi are unknown locations
2
• ∥ak − xj ∥2 = dckj where (k, j) ∈ Ea and ak are known locations
We can exploit the following SD relaxation:
• (0; ei − ej )(0; ei − ej )T • Z = dij for (i, j) ∈ Ex ,
• (ai ; ej )(ai ; ej )T • Z = dc
ij for (i, j) ∈ Ea ,
both of which are linear with respect to Z.
Recall that every SDP problem has a dual problem (SDD). The SDD associated
with the primal problem above is
X X
(100) min I • V + wij dij + wbij dc
ij
i,j∈Ex i,j∈Ea
4.4. GRAPH REALIZATION AND UNIVERSAL RIGIDITY 89

s.t.  
V 0 X X
S= + wij Aij + w
bij A ij ⪰ 0
d
0 0
i,j∈Ex i,j∈Ea

where
Aij = (0; ei − ej )(0; ei − ej )T
T
A
d ij = (ai ; ej )(ai ; ej ) .

The variables wij is the stress matrix on edge between unknown points i and j and
w
bij is the stress matrix on edge between anchor i and unknown point j. Note that
the dual is always feasible, as V = 0, yij = 0 for all (i, j) ∈ Ex and wij = 0 for all
(i, j) ∈ Ea is a feasible solution.
There are many matlab toolboxes for SDP, e.g. CVX, SEDUMI, and recent
toolboxes SNLSDP (http://www.math.nus.edu.sg/~mattohkc/SNLSDP.html) and
DISCO (http://www.math.nus.edu.sg/~mattohkc/disco.html) by Toh et. al.,
adapted to MDS with uncertainty.
A crucial theoretical question is to ask, when X = Y T Y holds such that SDP
embedding Y gives the same answer as the classical MDS? Before looking for an-
swers to this question, we first present an application example of SDP embedding.

4.4.2. Protein 3D Structure Reconstruction. Here we show an example


of using SDP to find 3-D coordinates of a protein molecule based on noisy pairwise
distances for atoms in ϵ-neighbors.

nf = 0.1, λ = 1.0e+00

10

−5

−10

10
10
5 5
0 0
−5 −5
−10 −10
Refinement: RMSD = 5.33e−01

(a) (b)

Figure 4. (a) 3D Protein structure of PDB-1GM2, edges are


chemical bonds between atoms. (b) Recovery of 3D coordinates
from SNLSDP with 5Å-neighbor graph and multiplicative noise at
0.1 level. Red point: estimated position of unknown atom. Green
circle: actual position of unknown atom. Blue line: deviation from
estimation to the actual position.

4.4.3. Exact Reconstruction and Universal Rigidity. Now we are going


to answer the fundamental question, when the SDP relaxation exactly reconstruct
the coordinates up to a rigid transformation. We will provide two theories, one from
the optimality rank properties of SDP, and the other from a geometric criterion,
universal rigidity.
90 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING

Recall that for a standard SDP with X, C ∈ Rn×n


X
(101) min C • X = cij Xij
i,j
s.t. Ai • X = bi , for i = 1, · · · , m
X⪰0
whose SDD is
(102) max bT y
m
X
s.t. S=C− Ai yi ⪰ 0.
i=1

Such SDP has the following rank properties (Ali95):


A. maximal rank solutions X ∗ or S ∗ exist;
B. minimal rank solutions X ∗ or S ∗ exist;
C. if complementary condition X ∗ S ∗ = 0 holds, then rank(X ∗ ) + rank(S ∗ ) ≤
n with equality holds iff strictly complementary condition holds, whence
rank(S ∗ ) ≥ n − k ⇒ rank(X ∗ ) ≤ k.
Strong duality of SDP tells us that an interior point feasible solution in primal
or dual problem will ensure the complementary condition and the zero duality gap.
Now we assume that dij = ∥xi − xj ∥ precisely for some unknown xi ∈ Rk . Then the
primal problem is feasible with Z = (Id ; Y )T (Id ; Y ). Therefore the complementary
condition holds and the duality gap is zero. In this case, assume that Z ∗ is a primal
feasible solution of SDP embedding and S ∗ is an optimal dual solution, then
(1) rank(Z ∗ ) + rank(S ∗ ) ≤ k + n and rank(Z ∗ ) ≥ k, whence rank(S ∗ ) ≤ n;
(2) rank(Z ∗ ) = k ⇐⇒ X = Y T Y .
It follows that if an optimal dual S ∗ has rank n, then every primal solution Z ∗ has
rank k, which ensures X = Y T Y . Therefore it suffices to find a maximal rank dual
solution S ∗ whose rank is n.
Above we have optimality rank condition from SDP. Now we introduce a geo-
metric criterion based on universal rigidity.
Definition 4.4.1 (Universal Rigidity (UR) or Unique Localization (UL)).
2
∃!yi ∈ Rk ,→ Rl where l ≥ k s.t. d2ij = ∥yi − yj ∥2 , dc 2
ij = ∥ak − yj ∥ .

It simply says that there is no nontrivial extension of yi ∈ Rk in Rl satisfying


2
d2ij = ∥yi − yj ∥2 and dc 2
ij = ∥(ak ; 0) − yj ∥ . The following is a short history about
universal rigidity.
[Schoenberg 1938] G is complete =⇒ UR
[So-Ye 2007] G is incomplete =⇒ UR ⇐⇒ SDP has maximal rank solution
rank(Z ∗ ) = k.
Theorem 4.4.2. (SY07) The following statements are equivalent.
(1) The graph is universally rigid or has a unique localization in Rk .
(2) The max-rank feasible solution of the SDP relaxation has rank k;
(3) The solution matrix has X = Y T Y or trace(X − Y T Y ) = 0.
Moreover, the localization of a UR instance can be computed approximately in a
time polynomial in n, k, and the accuracy log(1/ϵ).
4.4. GRAPH REALIZATION AND UNIVERSAL RIGIDITY 91

In fact, the max-rank solution of SDP embedding is unique. There are many
open problems in characterizing UR conditions, see Ye’s survey at ICCM’2010.
In practice, we often meet problems with noisy measurements αd2ij ≥ d˜2ij ≤
βdij . If we relax the constraint ∥yi − yj ∥2 = d2ij or equivalently Ai • X = bi to
2

inequalities, however we can achieve arbitrary small rank solution. To see this,
assume that
Ai X = bi 7→ αbi ≤ Ai X ≤ βbi i = 1, . . . , m, where β ≥ 1, α ∈ (0, 1)
then So, Ye, and Zhang (2008) (SYZ08) show the following result.
Theorem 4.4.3. For every d ≥ 1, there is a SDP solution X b ⪰ 0 with rank
rank(X) ≤ d, if the following holds,
b
18 ln 2m

 1+
 1 ≤ d ≤ 18 ln 2m
β= √ d
 1 + 18 ln 2m d ≥ 18 ln 2m

d
 1

 e(2m)2/d
 1 ≤ d ≤ 4 ln 2m
α=
( r )
1 4 ln 2m
 max


2/d
,1 − d ≥ 4 ln 2m
e(2m) d

Note that α, β are independent to n.

4.4.4. Maximal Variance Unfolding. Here we give a special case of SDP


embedding, Maximal Variance Unfolding (MVU) (WS06). In this case we choose
graph G = (V, E) as k-nearest neighbor graph. As a contrast to the SDP embedding
above, we did not pursue a semi-definite relaxation X ⪰ Y T Y , but instead define
it as a positive semi-definite kernel K = Y T Y and maximize the trace of K.
Consider a set of points xi (i = 1, . . . , n) whose pairwise distance dij is known
if xj lies in k-nearest neighbors of xi . In other words, consider a k-nearest neighbor
graph G = (V, E) with V = {xi : i = 1, . . . , n} and (i, j) ∈ E if j is a member of
k-nearest neighbors of i.
Our purpose is to find coordinates yi ∈ Rk for i = 1, 2, . . . , n s.t.
d2ij = ∥yi − yj ∥2
P
wherever (i, j) ∈ E and i yi = 0.
Set Kij = ⟨yi , yj ⟩. Then K is symmetric and positive semidefinite, which
satisfies
Kii + Kjj − 2Kij = d2ij .
There are possibly many solutions for such K, and we look for a particular one
with maximal trace which characterizes the maximal variance.
n
X
(103) max trace(K) = λi (K)
i=1
s.t. Kii + Kjj − 2Kij = d2ij ,
X
Kij = 0,
j
K⪰0
92 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING

Again it is a SDP. The final embedding is obtained by using eigenvector decompo-


sition of K = Y T Y .
However we note here that maximization of trace is not a provably good ap-
proach to “unfold” a manifold. Sometimes, there are better ways than MVU, e.g.
if original data lie on a plane then maximization of the diagonal distance between
two neighboring triangles will unfold and force it to be a plane. This is a special
case of the general k + 1-lateration graphs (SY07). From here we see that there are
other linear objective functions better than trace for the purpose of “unfolding” a
manifold.

4.5. Lab and Further Studies


4.5.1. RPCA by CVX. The following Matlab codes realized the SDP algo-
rithm (92) by CVX (http://cvxr.com/cvx).
% Construct a random 20-by-20 Gaussian matrix and construct a rank-1
% matrix using its top-1 singular vectors
R = randn(20,20);
[U,S,V] = svds(R,3);
A = U(:,1)*V(:,1)’;

% Construct a 90% uniformly sparse matrix


E0 = rand(20);
E = 1*abs(E0>0.9);

X = A + E;

% Choose the regularization parameter


lambda = 0.25;

% Solve the SDP by calling cvx toolbox


if exist(’cvx setup.m’,’file’),
cd /matlab tools/cvx/
cvx setup
end

cvx begin
variable L(20,20);
variable S(20,20);
variable W1(20,20);
variable W2(20,20);
variable Y(40,40) symmetric;
Y == semidefinite(40);
minimize(.5*trace(W1)+0.5*trace(W2)+lambda*sum(sum(abs(S))));
subject to
L + S >= X-1e-5;
L + S <= X + 1e-5;
Y == [W1, L’;L W2];
cvx end
4.5. LAB AND FURTHER STUDIES 93

% The difference between sparse solution S and E


disp(’$\—S-E\— \infty$:’)
norm(S-E,’inf’)

% The difference between the low rank solution L and A


disp(’\—A-L\—’)
norm(A-L)

4.5.2. SPCA by CVX. The SDP algorithm (97) has a simple Matlab im-
plementation based on CVX (http://cvxr.com/cvx).
% Construct a 10-by-20 Gaussian random matrix and form a 20-by-20 correlation
% (inner product) matrix R
X0 = randn(10,20);
R = X0’*X0;

d = 20;
e = ones(d,1);

% Call CVX to solve the SPCA given R


if exist(’cvx setup.m’,’file’),
cd /matlab tools/cvx/
cvx setup
end

lambda = 0.5;
k = 10;

cvx begin
variable X(d,d) symmetric;
X == semidefinite(d);
minimize(-trace(R*X)+lambda*(e’*abs(X)*e));
subject to
trace(X)==1;
cvx end

4.5.3. RPCA by ADMM. Some ADMM-based Matlab codes for RPCA are
given by Stephen Boyd1. The following codes use cvxpy to implement RPCA:

# coding : utf -8

# # Robust Principal Component Analysis via Semidefinite Programming (


RPCA via SDP )
# This code shows RPCA examples using cvxpy .

# In [1]:

1https://web.stanford.edu/ boyd/papers/prox algs/matrix decomp.html


94 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING

import numpy as np
import cvxpy as cp

# In [2]:

# Construct a random 20 - by -20 Gaussian matrix and construct a rank -1


matrix using its top -1 singular vectors
np . random . seed (0)

R = np . random . randn (20 ,20)

U ,S , V = np . linalg . svd (R , full_matrices = False )


U , sdiag , VH = np . linalg . svd ( R )
S = np . zeros ((20 , 20) )
np . fill_diagonal (S , sdiag )
U = U [: ,:3]
S = S [:3 ,:3]
V = VH . T . conj () [: ,:3]

L0 = U [: ,0]. reshape ( -1 ,1) @ ( V [: ,0]. T ) . reshape (1 , -1)


E0 = np . random . rand (20 ,20)
S0 =1* np . absolute ( E0 >0.9)
X = L0 + S0
lambdas =0.25

# In [3]:

# Now find a low rank + sparse solution using CVXPY

L = cp . Variable ((20 ,20) )


S = cp . Variable ((20 ,20) )
W1 = cp . Variable ((20 ,20) )
W2 = cp . Variable ((20 ,20) )
Y = cp . Variable ((40 ,40) , symmetric = True )

objective = cp . Minimize (0.5* cp . trace ( W1 ) +0.5* cp . trace ( W2 ) + lambdas * cp .


sum ( cp . abs ( S ) ) )

constraints = [ Y >> 0]
constraints += [ L + S >= X -1 e -5 , L + S <= X + 1e -5 , Y == cp . vstack ([ cp .
hstack ([ W1 , L . T ]) , cp . hstack ([ L , W2 ]) ]) ]
prob = cp . Problem ( objective , constraints )

result = prob . solve ()

# In [4]:
4.5. LAB AND FURTHER STUDIES 95

print ( ’ The ␣ difference ␣ between ␣ the ␣ sparse ␣ solution ␣ S ␣ and ␣ true ␣ S0 ␣ $ % s$ : ␣


% f ’ %( r ’ \| S - S0 \| _infty ’ , np . linalg . norm ( S . value - S0 , ord = np . inf ) ) )

# In [5]:

print ( ’ The ␣ difference ␣ between ␣ the ␣ low ␣ rank ␣ solution ␣ L ␣ and ␣ true ␣ L0 ␣ $ % s$
: ␣ % f ’ %( r ’ \| L - L0 \| _2 ’ , np . linalg . norm ( L . value - L0 , ord =2) ) )

# In [6]:

# Another simple CVX implem entation directly using matrix nuclear norm
X1 = cp . Variable ((20 ,20) )
X2 = cp . Variable ((20 ,20) )

objective = cp . Minimize ( cp . atoms . norm ( X1 , ’ nuc ’) + lambdas * cp . atoms . norm (


cp . reshape ( X2 ,400) ,1) )
constraints = [ X1 + X2 == X ]

prob = cp . Problem ( objective , constraints )

result = prob . solve ()

# In [7]:

print ( ’ The ␣ difference ␣ between ␣ the ␣ sparse ␣ solution ␣ X2 ␣ and ␣ true ␣ S0 ␣ $ % s$ :


␣ % f ’ %( r ’ \| X2 - S0 \| _infty ’ , np . linalg . norm ( X2 . value - S0 , ord = np . inf ) ) )

# Therefore the algorithm converges to a sparse solution X2 with $ \| X2


- S0 \| _ \ infty =0.000012 $ ,

# In [8]:

print ( ’ The ␣ difference ␣ between ␣ the ␣ low - rank ␣ solution ␣ X1 ␣ and ␣ true ␣ L0 ␣ $ %
s$ : ␣ % f ’ %( r ’ \| X1 - L0 \| _2 ’ , np . linalg . norm ( X1 . value - L0 , ord =2) ) )

# This indicates the algorithm finds a low - rank solution of error $ \|


X1 - L0 \| _2 = 0.000007 $ .

4.5.4. SPCA. A python package for sklearn.decomposition.SparsePCA2.


The following codes use cvxpy to implement SPCA:

2https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html
96 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING

# coding : utf -8

# # Sparse Principal Component Analysis via SDP ( SPCA - SDP )

# In [1]:

import numpy as np
import cvxpy as cp
np . random . seed (0)

# ( inner product ) matrix R


X0 = np . random . randn (10 ,20)
R = X0 . T@X0
d =20
e = np . ones (( d ,1) )

# Small lambda will give dense PCA .


lambdas = 5
k = 10

# In [2]:

# construct cvx problem


X = cp . Variable (( d , d ) , symmetric = True )
objective = cp . Minimize ( - cp . trace ( R@X ) + lambdas *( e . T@cp . abs ( X ) @e ) )
constraints = [ X >> 0]
constraints += [ cp . trace ( X ) ==1]
prob = cp . Problem ( objective , constraints )

# result = prob . solve ( solver = ’ SCS ’)


result = prob . solve ( solver = ’ SCS ’)

# In [3]:

import matplotlib . pyplot as plt

plt . scatter ( X . value [: ,0] , X . value [: ,1])


plt . xlabel ( ’ first ␣ principal ␣ component ’)
plt . ylabel ( ’ second ␣ principal ␣ component ’)
plt . show ()

# ### SPCA compared with matlab data

# In [4]:
4.5. LAB AND FURTHER STUDIES 97

# Using R and X produced by matlab as input to compare results between


matlab and python .
# python could produce the same optimal value for objective functions .
# But the value for X is slightly different due to different solver
used by python .
from scipy import io

matlab_R = io . loadmat ( ’ data . mat ’) [ ’R ’]


matlab_X = io . loadmat ( ’ data . mat ’) [ ’X ’]

# In [5]:

cp . i n s t a l l e d _ s o l v e r s ()

# In [6]:

d =20
lambdas = 5
k = 10
e = np . ones (( d ,1) )

X = cp . Variable (( d , d ) , symmetric = True )


objective = cp . Minimize ( - cp . trace ( matlab_R@X ) + lambdas *( e . T@cp . abs ( X ) @e
))
constraints = [ X >> 0]
constraints += [ cp . trace ( X ) ==1]
prob = cp . Problem ( objective , constraints )

# result = prob . solve ( solver = cp . i n s t a l l e d _ s o l v e r s () [0])


result = prob . solve ( solver = ’ SCS ’)

# In [7]:

cp . i n s t a l l e d _ s o l v e r s ()

# In [8]:

import matplotlib . pyplot as plt

plt . scatter ( X . value [: ,0] , X . value [: ,1])


plt . xlabel ( ’ first ␣ principal ␣ component ’)
plt . ylabel ( ’ second ␣ principal ␣ component ’)
plt . title ( ’ python ␣ cvx - - ␣ cvxopt ␣ solver ’)
plt . show ()
98 4. GENERALIZED PCA AND MDS VIA SEMIDEFINITE PROGRAMMING

# In [9]:

import matplotlib . pyplot as plt


plt . scatter ( matlab_X [: ,0] , matlab_X [: ,1])
plt . xlabel ( ’ first ␣ principal ␣ component ’)
plt . ylabel ( ’ second ␣ principal ␣ component ’)
plt . title ( ’ matlab ␣ cvx ’)
plt . show ()

4.5.5. Graph Realization and Sensor Network Localization. We use


matlab package SNLSDP by Kim-Chuan Toh, Pratik Biswas, and Yinyu Ye, down-
ladable at http://www.math.nus.edu.sg/~mattohkc/SNLSDP.html.
After installation, run the following codes:
>> startup
>> testSNLsolver
The following outputs show the program runs successfully, with output Figure 4 .
number of anchors = 0
number of sensors = 166
box scale = 20.00
radius = 5.00
mult iplicati ve noise , noise factor = 1.00 e -01
-------------------------------------------------------
estimate sensor positions by SDP
-------------------------------------------------------
num of constraints = 2552 ,
Please wait :
solving SDP by the SDPT3 software package
sdpobj = -3.341 e +03 , time = 34.2 s
RMSD = 7.19 e -01
-------------------------------------------------------
refine positions by steepest descent
-------------------------------------------------------
objstart = 4.2408 e +02 , objend = 2.7245 e +02
number of iterations = 689 , time = 0.9 s
RMSD = 5.33 e -01
-------------------------------------------------------
( noise factor ) ^2 = -20.0 dB ,
mean square error ( MSE ) in estimated positions = -5.0 dB
-------------------------------------------------------
Part 2

Nonlinear Dimensionality
Reduction: Kernels on Graphs
CHAPTER 5

Manifold Learning

5.1. Introduction
In the past month we talked about two topics: one is the sample mean and
sample covariance matrix (PCA) in high dimensional spaces. We have learned that
when dimension p is large and sample size n is relatively small, in contrast to the
traditional statistics where p is fixed and n → ∞, both sample mean and PCA may
have problems. In particular, Stein’s phenomenon shows that in high dimensional
space with independent Gaussian distributions, the sample mean is worse than a
shrinkage estimator; moreover, random matrix theory sheds light on that in high
dimensional space with sample size in a fixed ratio of dimension, the sample co-
variance matrix and PCA may not reflect the signal faithfully. These phenomena
start a new philosophy in high dimensional data analysis that to overcome the curse
of dimensionality, additional constraints has to be put that data never distribute
in every corner in high dimensional spaces. Sparsity is a common assumption in
modern high dimensional statistics. For example, data variation may only depend
on a small number of variables; independence of Gaussian random fields leads to
sparse covariance matrix; and the assumption of conditional independence can also
lead to sparse inverse covariance matrix. In particular, an assumption that data
concentrate around a low dimensional manifold in high dimensional spaces, leads
to manifold learning or nonlinear dimensionality reduction, e.g. ISOMAP, LLE,
and Diffusion Maps etc. This assumption often finds example in computer vision,
graphics, and image processing.
All the work introduced in this chapter can be regarded as generalized PCA/MDS
on nearest neighbor graphs, which has roots in manifold learning concept. Two
pieces of milestone works, ISOMAP (TdSL00) and Locally Linear Embedding
(LLE) (RL00), are firstly published in science 2000, which opens a new field called
nonlinear dimensionality reduction, or manifold learning in high dimensional data
analysis. Here is the development of manifold learning method:
(104) MDS −→ ISOMAP


 Local Tangent Space Alignment
Hessian LLE

PCA −→ LLE −→

 Laplacian Eigen Map
Diffusion Map

To understand the motivation of such a novel methodology, let’s take a brief
review on PCA/MDS. Given a set of data xi ∈ Rp (i = 1, . . . , n) or merely pairwise
distances d(xi , xj ), PCA/MDS essentially looks for an affine space which best cap-
ture the variation of data distribution, see Figure 1(a). However, this scheme will
not work in the scenario that data are actually distributed on a highly nonlinear
101
102 5. MANIFOLD LEARNING

curved surface, i.e. manifolds, see the example of Swiss Roll in Figure 1(b). Can we
extend PCA/MDS in certain sense to capture intrinsic coordinate systems which
charts the manifold?

(a) (b)

Figure 1. (a) Find an affine space to approximate data variation


in PCA/MDS. (b) Swiss Roll data distributed on a nonlinear 2-D
submanifold in Euclidean space R3 . Our purpose is to capture an
intrinsic coordinate system describing the submanifold.

ISOMAP and LLE, as extensions from MDS and local PCA, respectively, leads
to a series of attempts to address this problem.
All the current techniques in manifold learning, as extensions of PCA and
MDS, are often called as Spectral Kernel Embedding. The common theme of these
techniques can be described in Figure 2. The basic problem is: given a set of
data points {x1 , x2 , ..., xn ∈ Rp }, how to find out y1 , y2 , ..., yn ∈ Rd , where d ≪ p,
such that some geometric structures (local or global) among data points are best
preserved.

Figure 2. The generative model for manifold learning. Y is the


hidden parameter space (like rotation angle of faces below), f is
a measure process which maps Y into a sub-manifold in a high
dimensional ambient space, X = f (Y ) ⊂ Rp . All of our purpose is
to recover this hidden parameter space Y given samples {xi ∈ Rp :
i = 1, . . . , n}.

All the manifold learning techniques can be summarized in the following meta-
algorithm, which explains precisely the name of spectral kernel embedding. All the
methods can be called certain eigenmaps associated with some positive semi-definite
kernels.
(1) Construct a data graph G = (V, E), where V = {xi : i = 1, ..., n}. For
example,
5.2. ISOMAP 103

ε-neighborhood, i ∼ j ⇔ d(xi , xj ) ⩽ ε, which leads to an undirected


graph;
k-nearest neighbor, (i, j) ∈ E ⇔ j ∈ Nk (i), which leads to a di-
rected graph.
(2) Construct a positive semi-definite matrix K (kernel).
1
(3) Eigen-decomposition K = U ΛU T , then Yd = Ud Λd2 , where choose d eigen-
vectors (top or bottom) Ud .
Example 10 (PCA). G is complete, K = Σ̂n is a covariance matrix.
Example 11 (MDS). G is complete, K = − 12 HDH T , where Dij = d2 (xi , xj ).
Example 12 (ISOMAP). G is incomplete.
(
d(xi , xj ) if (i, j) ∈ E,
Dij = ˆ
dg (xi , xj ) if (i, j) ̸∈ E.
where dˆg is a graph shorted path. Then
1
K = − HDH T .
2
Note that K is positive semi-definite if and only if D is a squared distance matrix.
Example 13 (LLE). G is incomplete. K = (I − W )T (I − W ), where
(
n×n wij j ∈ N (i),
Wij =
0 other’s.
and wij solves the following optimization problem
X
P min ∥Xi − wij Xj ∥2 .
j wij =1
j∈N (i)

After obtaining W , compute the global embedding d-by-n embedding matrix Y =


[Y1 , . . . , Yn ],
n
X n
X
min ∥Yi − Wij Yj ∥2 = trace((I − W )Y T Y (I − W )T ).
Y
i=1 j=1

This is equivalent to find smallest eigenvectors of K = (I − W )T (I − W ).

5.2. ISOMAP
ISOMAP is an extension of MDS, where pairwise euclidean distances between
data points are replaced by geodesic distances, computed by graph shortest path
distances.
(1) Construct a neighborhood graph G = (V, E, dij ) such that
V = {xi : i = 1, . . . , n}
E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest
neighbors, ϵ-neighbors
dij = d(xi , xj ), e.g. Euclidean distance when xi ∈ Rp
(2) Compute graph shortest path distances
dij = minP =(xi ,...,xj ) (∥xi − xt1 ∥ + . . . + ∥xtk−1 − xj ∥), is the length
of a graph shortest path connecting i and j
Dijkstra’s algorithm (O(kn2 log n)) and Floyd’s Algorithm (O(n3 ))
104 5. MANIFOLD LEARNING

(3) classical MDS with D = (d2ij )


construct a symmetric (positive semi-definite if D is a squared dis-
tance) B = −0.5HDH T where H = I − 11T /n (or H = I − 1aT for any
aT 1 = 1).
Find eigenvector decomposition of B = U ΛU T and choose top d
eigenvectors as embedding coordinates in Rd , i.e. Yd = [y1 , . . . , yd ] =
1/2
[U1 , . . . , Ud ]Λd ∈ Rn×d

Algorithm 8: ISOMAP Algorithm


Input: Metric distance dij = d(xi , xj ) between data points and a weighted
graph G = (V, E) such that
1 V = {xi : i = 1, . . . , n}
2 E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest neighbors,
ϵ-neighbors
3 dij = d(xi , xj ) for (i, j) ∈ E, e.g. Euclidean distance when xi ∈ Rp
Output: Euclidean d-dimensional coordinates Y = [yi ] ∈ Rk×n of data.
4 Step 1 : Compute graph shortest path distances
dij = min (∥xi − xt1 ∥ + . . . + ∥xtk−1 − xj ∥),
P =(xi ,...,xj )

which is the length of a graph shortest path connecting i and j;


1
5 Step 2 : Compute K = − H · D · H T (D := [d2ij ]), where H is the Househölder
2
centering matrix;
6 Step 3 : Compute Eigenvalue decomposition K = U ΛU T with
Λ = diag(λ1 , . . . , λn ) where λ1 ≥ λ2 ≥ . . . ≥ λn ≥ 0;
7 Step 4 : Choose top d nonzero eigenvalues and corresponding eigenvectors,
1
Yd = Ud Λd 2 where
Ud = [u1 , . . . , ud ], uj ∈ Rn ,
Λd = diag(λ1 , . . . , λd )
with λ1 ≥ λ2 ≥ . . . ≥ λd > 0.

The basic feature of ISOMAP can be described as: we find a low dimensional
embedding of data such that points nearby are mapped nearby and points far away
are mapped far away. In other words, we have global control on the data distance
and the method is thus a global method. The major shortcoming of ISOMAP
lies in its computational complexity, characterized by a full matrix eigenvector
decomposition.
5.2.1. ISOMAP Example. Now we give an example of ISOMAP with mat-
lab codes.
% load 33-face data
load ../data/face.mat Y
X = reshape(Y,[size(Y,1)*size(Y,2) size(Y,3)]);
p = size(X,1);
n = size(X,2);
D = pdist(X’);
DD = squareform(D);

% ISOMAP embedding with 5-nearest neighbors


5.2. ISOMAP 105

(a) (b)

Figure 3. (a) Residual Variance plot for ISOMAP. (b) 2-D


ISOMAP embedding, where the first coordinate follows the order
of rotation angles of the face.

[Y iso,R iso,E iso]=isomapII(DD,’k’,5);

% Scatter plot of top 2-D embeddings


y=Y iso.coords{2};
scatter(y(1,:),y(2,:))

5.2.2. Convergence of ISOMAP. Under dense-sample and regularity con-


ditions on manifolds, ISOMAP is proved to show convergence to preserve geodesic
distances on manifolds. The key is to approximate geodesic distance on manifold
by a sequence of short Euclidean distance hops.
Consider arbitrary two points on manifold x, y ∈ M . Define
dM (x, y) = inf {length(γ)}
γ

dG (x, y) = min(∥x0 − x1 ∥ + . . . + ∥xt−1 − xt ∥)


P
dS (x, y) = min(dM (x0 , x1 ) + . . . + dM (xt−1 , xt ))
P
where γ varies over the set of smooth arcs connecting x to y in M and P varies
over all paths along the edges of G starting at x0 = x and ending at xt = y. We
are going to show dM ≈ dG with the bridge dS .
It is easy to see the following upper bounds by dS :
(105) dM (x, y) ≤ dS (x, y)

(106) dG (x, y) ≤ dS (x, y)


where the first upper bound is due to triangle inequality for the metric dM and the
second upper bound is due to that Euclidean distances ∥xi − xi+1 ∥ are smaller than
arc-length dM (xi , xi+1 ).
To see other directions, one has to impose additional conditions on sample
density and regularity of manifolds.
Lemma 5.2.1 (Sufficient Sampling). Let G = (V, E) where V = {xi : i =
1, . . . , n} ⊆ M is a ϵ-net of manifold M , i.e.for every x ∈ M there exists xi ∈ V
106 5. MANIFOLD LEARNING

such that dM (x, xi ) < ϵ, and {i, j} ∈ E if dM (xi , xj ) ≤ αϵ (α ≥ 4). Then for any
pair x, y ∈ V ,
α
dS (x, y) ≤ max(α − 1, )dM (x, y).
α−2
Proof. Let γ be a shortest path connecting x and y on M whose length is
l. If l ≤ (α − 2)ϵ, then there is an edge connecting x and y whence dS (x, y) =
dM (x, y). Otherwise split γ into pieces such that l = l0 + tl1 where l1 = (α − 2)ϵ
and ϵ ≤ l0 < (α − 2)ϵ. This divides arc γ into a sequence of points γ0 = x, γ1 ,. . .,
γt+1 = y such that dM (x, γ1 ) = l0 and dM (γi , γi+1 ) = l1 (i ≥ 1). There exists a
sequence of x0 = x, x1 , . . . , xt+1 = y such that dM (xi , γi ) ≤ ϵ and
dM (xi , xi+1 ) ≤ dM (xi , γi ) + dM (γi , γi+1 ) + dM (γi+1 , xi+1 )
≤ ϵ + l1 + ϵ
= αϵ
= l1 α/(α − 2)
whence (xi , xi+1 ) ∈ E. Similarly dM (x, x1 ) ≤ dM (x, γ1 ) + dM (γ1 , x1 ) ≤ (α − 1)ϵ ≤
l0 (α − 1).
t−1
X
dS (x, y) ≤ dM (xi , xi+1 )
i=0
 
α
≤ l max ,α − 1
α−2
Setting α = 4 gives rise to dS (x, y) ≤ 3dM (x, y). □

The other lower bound dS (x, y) ≤ cdG (x, y) requires that for every two points
xi and xj , Euclidean distance ∥xi − xj ∥ ≤ cdM (xi , xj ). This imposes a regularity
on manifold M , whose curvature has to be bounded. We omit this part here and
leave the interested readers to the reference by Bernstein, de Silva, Langford, and
Tenenbaum 2000, as a supporting information to the ISOMAP paper.

5.3. Locally Linear Embedding (LLE)


In applications points nearby should be mapped nearby, while points far away
should impose no constraint. This is because typically when points are close enough,
they are similar, while points are far, there is no faithful information to measure how
far they are. Therefore global information about geodesic distance might not be
accurate, in addition to its expensive computational cost. This motivates another
type of algorithm, locally linear embedding. The algorithm assume that any date
point in a high dimensional ambient space can be a linear combination of data
points in its neighborhood. In other words, a data point xi has its neighborhood
xj ∈ Ni deciding its sufficient statistics. Alignment of such local linear structures
can lead to a global unfolding of data manifolds, often described as fit locally and
think globally. This is a local method as it involves data points in local neighbors
and hence a sparse eigenvector decomposition.
Now we are going to describe the procedure of LLE.
The reason behind the crucial steps can be explained as follows.
(1) Local fitting:
5.3. LOCALLY LINEAR EMBEDDING (LLE) 107

Algorithm 9: LLE Algorithm


Input: A graph G = (V, E) such that
1 V = {xi : i = 1, . . . , n}
2 E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest neighbors,
ϵ-neighbors
Output: Euclidean d-dimensional coordinates Y = [yi ] ∈ Rd×n of data.
3 Step 1 (local fitting): for each xi and its neighbors Ni , solve
4
X
P min ∥xi − wij xj ∥2 ,
j∈Ni wij =1
j∈Ni

by ŵi (µ) = (Ci + µI)−1 1 for some regularization parameter µ > 0 and
wi = ŵi /ŵiT 1;
5 Step 2 (global alignment): define the weight embedding matrix
6

wij , j ∈ Ni
Wij =
0, otherwise
Compute K = (I − W )T (I − W ) which is a positive semi-definite kernel matrix;
7 Step 3 (Eigenmap): Compute Eigenvalue decomposition K = U ΛU T with
Λ = diag(λ1 , . . . , λn ) where λ1 ≥ λ2 ≥ . . . λn−1 > λn = 0; choose bottom d + 1
nonzero eigenvalues and corresponding eigenvectors and drop the smallest
eigenvalue-eigenvector (0-constant) pair, such that
Ud = [un−d , . . . , un−1 ], u j ∈ Rn ,
Λd = diag(λn−d , . . . , λn−1 ).
1
Define Yd = Ud Λd . 2

Pick up a point xi and its neighbors Ni . Compute the local fitting


weights X
P min ∥xi − wij xj ∥2 ,
j∈Ni wij =1
j∈Ni
which is equivalent to
X
P min ∥ wij (xj − xi )∥2 ,
j∈Ni wij =1
j∈Ni

that is, finding a linear combination (possibly not unique!) for the sub-
space spanned by {(xj − xi ) : j ∈ Ni }. This can be done by Lagrange
multiplier method, i.e. solving
1 X X
min ∥ wij (xj − xi )∥2 + λ(1 − wij ).
wij 2
j∈Ni j∈Ni

Let wi = [wij1 , . . . wijk ]T ∈ Rk , X̄i = [xj1 − xi , . . . , xjk − xi ], and the local


Gram (covariance) matrix Ci (j, k) = ⟨xj −xi , xk −xi ⟩, whence the weights
are
(107) wi = λCi† 1,
where the Lagrange multiplier equals to the following normalization pa-
rameter
1
(108) λ= ,
1T Ci† 1
108 5. MANIFOLD LEARNING

Table 1. Comparisons between ISOMAP and LLE.

ISOMAP LLE
MDS on geodesic distance matrix local PCA + eigen-decomposition
global approach local approach
no for nonconvex manifolds with holes ok with nonconvex manifolds with holes
landmark (Nystrom) Hessian
Extensions: conformal Extensions: Laplacian
isometric, etc. LTSA etc.

and Ci† is a Moore-Penrose (pseudo) inverse of Ci . Note that Ci is often


ill-conditioned and to find its Moore-Penrose inverse one can use regular-
ization method (Ci + µI)−1 for some µ > 0.
(2) Global alignment
Define a n-by-n weight matrix W :

wij , j ∈ Ni
Wij =
0, otherwise
Compute the global embedding d-by-n embedding matrix Y ,
X n
X
min ∥yi − Wij yj ∥2 = trace(Y (I − W )T (I − W )Y T )
Y
i j=1

In other words, construct a positive semi-definite matrix K = (I −


W )T (I −W ) and find d+1 smallest eigenvectors of K, v0 , v1 , . . . , vd associ-
ated smallest eigenvalues λ0 , . . . , λd . Drop the smallest eigenvector which
is the constant
p vector explaining √ the degree of freedom as translation and
set Y = [v1 / (λ1 ), . . . , vd / λd ]T .
The benefits of LLE are:
• Neighbor graph: k-nearest neighbors is of O(kn)
• W is sparse: kn/n2 = k/n non-zeroes
• K = (I − W )T (I − W ) is guaranteed to be positive semi-definite
However, unlike ISOMAP, it is not clear if LLE constructed above converges
under certain conditions. This has to be left to some variations of basic LLE above,
such as Laplacian LLE, Hessian LLE, and LTSA etc. with convergence guarantees.
5.3.1. Issues of LLE and a Modified Version. Using the regularization,
(107) leads to a family of weight vectors
X 1
(109) wi (µ) = λ(Ci + µI)−1 1 = (i)
vj vjT 1
j λ j + µ
(i)
where the local PCA Ci = V ΛV T (Λ = diag(λj ), V = [vj ]).
So basically wi (µ) is made up of a low-pass filter: the projections of 1 on
(i)
those directions Uj such that λj ≪ µ are preserved while those projections with
(i)
λj ≫ µ are attenuated. In the ideal case without noise, such a low-pass filter makes
wi (µ) spanned by the normal subspace orthogonal to the local PCA, such that the
reconstructed Y will follow the directions of local PCA. However, in applications
when noise are presented, especially not well separated with signals, such wi (µ) is
5.3. LOCALLY LINEAR EMBEDDING (LLE) 109

Algorithm 10: MLLE Algorithm


Input: A graph G = (V, E) such that
1 V = {xi : i = 1, . . . , n}
2 E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest neighbors,
ϵ-neighbors
Output: Euclidean d-dimensional coordinates Y = [yi ] ∈ Rd×n of data.
3 Step 1 (local fitting): for each xi and its neighbors Ni , solve
4
X
P min ∥xi − wij xj ∥2 ,
j∈Ni wij =1
j∈Ni

by ŵi (µ) = (Ci + µI)−1 1 for some regularization parameter µ > 0 and
wi = ŵi /ŵiT 1;
5 Step 2 (local residue PCA): for each xi and its neighbors Ni (ki = |Ni |), let
Ci = V ΛV T be its eigenvalue decomposition where Λ = (λ1 , . . . , λki ) with
λ1 ≥ · · · ≥ λki . Find the size of almost normal subspace si as the maximal size
that the ratio of residue eigenvalue sum over principle eigenvalue sum is below
a threshold, i.e. ( Pki )
j=k −l+1 λj
si = max l ≤ ki − d, Pki −l ≤η
i
j=1 λj
l

where η is a parameter, such as the median of ratios of residue eigenvalue sum


over principle eigenvalue sum. Construct the normal subspace basis matrix as
si -bottom eigenvector matrix of Ci , Vi = [vki −si +1 , . . . , vki ] ∈ Rki ×si , define the
weight matrix
Wi = (1 − αi )wi (µ)1Tsi + Vi HiT ∈ Rki ×si ,

where αi = ∥Vi 1ki ∥2 / si and Hi = Isi − 2uuT /∥u∥2 with u = ViT 1ki − αi 1si
T

(or u = 0 if it is small).
6 Step 3 (global alignment): define the weight embedding matrix
7

 −1Tsi ,

j = i,
ci (j, :) =
W Wi , j ∈ Ni ,
0, otherwise.

Compute K = W cT W c which is a positive semi-definite kernel matrix;


8 Step 4 (Eigenmap): Compute Eigenvalue decomposition K = U ΛU T with
Λ = diag(λ1 , . . . , λn ) where λ1 ≥ λ2 ≥ . . . λn−1 > λn = 0; choose bottom d + 1
nonzero eigenvalues and corresponding eigenvectors and drop the smallest
eigenvalue-eigenvector (0-constant) pair, such that
Ud = [un−d , . . . , un−1 ], u j ∈ Rn ,
Λd = diag(λn−d , . . . , λn−1 ).
1
Define Yd = Ud Λd . 2

sensitive to the noise direction and might be mixed with signal directions. LLE in
this case might not capture well the signal directions in local PCA. Hessian LLE
and LTSA are both improvements over this by exploiting all the local principal
components. On the other hand, Modified Locally Linear Embedding (MLLE)
(ZW) remedies the issue using multiple weight vectors projected from orthogonal
complement of local PCA.
MLLE replace the weight vector above by a weight matrix Wi ∈ Rki ×si , a family
of si weight vectors using bottom si eigenvectors of Ci , Vi = [vki −si +1 , . . . , vki ] ∈
110 5. MANIFOLD LEARNING

Rki ×si , such that


(110) Wi = (1 − αi )wi (µ)1Tsi + Vi HiT ,

where αi = ∥ViT 1ki ∥2 / si and Hi = Isi − 2uuT (∥u∥2 = 1 or 0) is a Householder
matrix (Hi := Isi if u = 0) such that HViT 1ki = αi 1si (hence WiT 1ki = 1si ,
every column of Wi being a legal weight vector). In fact, one can choose u in the
direction of ViT 1ki − αi 1si . An adaptive choice of si is given in (ZW) using the
trade-off between residual variation and explained variation. Equipped with this
weight matrix, one can set the objective function by simultaneously minimizing the
residue over all reconstruction weights:
XX si X X X
min ∥yi − Wi (j, l)yj ∥2 := ci ∥2F = trace[Y (
∥Y W W ciT )Y T ]
ci W
Y
i l=1 j∈Ni i i

ci is the embedding of Wi ∈ Rki ×si into Rn×si ,


where W

 −1Tsi , j = i,
Wi (j, :) =
c Wi , j ∈ Ni ,
0, otherwise.

Python scikit-learn package contains an implementation of MLLE. The error


analysis of MLLE is similar to that of LTSA (ZW), hence it is expected both lead
to similar results in applications. Yet due to the adaptive choice of si , MLLE can
be adaptive to the heterogeneity in manifold curvature variations.
A shortcoming of MLLE lies in its projection of vector 1 on to the local normal
subspace spanned by the bottom eigenvectors which might be totally contaminated
by noise. So such a computation is to capture noise instead of signal and might be
very expensive (due to the full spectrum of Ci ) when the intrinsic dimensionality is
low. On the other hand both Hessian LLE and LTSA only exploit the partial local
SVD which is more robust to noise and cheaper in computational cost.

5.4. Hessian LLE


Hessian LLE only exploits top eigenvectors of local SVD which is more robust
to noise than MLLE, and is provable to find a linear coordinate chart for local
isometric and nonconvex assumptions while ISOMAP requires global isometry and
convexity assumptions.

Figure 4. Local coordinate system at the origin O = xi .

In LLE, one chooses the weights wij to minimize the following energy
X
P min ∥ wij (xj − xi )∥2 .
j∈Ni wij =1
j∈Ni
5.4. HESSIAN LLE 111

In the ideal case, if the points x̃j = xj − xi arePlinearly dependent, then there
is some wij , possibly not unique, such that 0 = j∈Ni wij x̃j . In this local chart
(Figure 4), we have
X X
0= wij x̃j , and 1 = wij .
j∈Ni j∈Ni

For any smooth function y(x), consider its Taylor expansion up to the second order
1
y(x) = y(0) + xT ∇y(0) + xT (Hy)(0)x + o(∥x∥2 ).
2
Therefore
X
(I − W )y(0) := y(0) − wij y(x̃j )
j∈Ni
X X 1 X T
≈ y(0) − wij y(0) − wij x̃Tj ∇y(0) − x̃j (Hy)(0)x̃j
2
j∈Ni j∈Ni j∈Ni
1 X T
= − x̃j (Hy)(0)x̃j .
2
j∈Ni

If function y(x) is a linear transform of the d-coordinates of x in the tangent space


at xi , then the Hessian matrix
∂ 2 y(x)
 
(Hy)(0) := = 0.
∂x(i)∂x(j) x=0
In this case (I − W )y(0) = 0 and y reaches a minimizer.
In other words, the kernel of Hessian operator H has dimension d+1, consisting
the constant function and d linearly independent coordinates. Inspired by such an
observation, Donoho and Grimes (DG03b) proposed Hessian LLE (Eigenmap) in
search of Z
min ∥Hy∥2 , ∥y∥ = 1.
y⊥1
The basic algorithmic idea is as follows.
1. G is incomplete, often k-nearest neighbour graph.
2. Local SVD on neighbourhood of xi , for xij ∈ N (xi ),

X̃ (i) = [xi1 − µi , ..., xik − µi ]p×k = Ũ (i) Σ̃(Ṽ (i) )T ,


Pk
where µi = j=1 xij = k1 Xi 1. Here
(i) (i)
• Left top singular vectors {Ũ1 , ..., Ũd } give an orthonormal basis of the
approximate tangent space at xi ,
(i) (i)
• Right top singular vectors [Ṽ1 , ..., Ṽd ] are representation coordinates in
the tangent space of local sample points around xi .
3. Null Hessian estimation: define
d+1
M = [1, V˜1 , ..., V˜d , Ṽ12 , V˜1 ⊙ V˜2 , ..., Ṽd−1 ⊙ V˜d ] ∈ Rk×(1+d+( 2 ))
where Ṽi ⊙ V˜j = [Ṽik Ṽjk ]T ∈ Rk denotes the element-wise product (Hadamard
product) between vector Ṽi and Ṽj .
Now we perform a Gram-Schmidt Orthogonalization procedure on M , get
112 5. MANIFOLD LEARNING

d+1
M̃ = [1, v̂1 , ..., v̂d , ŵ1 , ŵ2 , ..., ŵ(d+1) ] ∈ Rk×(1+d+( 2 ))
2

Define null Hessian by


 
(i) T d+1
[H ] = [last columns of M̃ ]k×(d+1) ,
2 2

as the first d + 1 columns of M̃ consists an orthonormal basis for the kernel of


Hessian together with the constant vector.
Define a selection matrix S (i) ∈ Rn×k which selects those data in N (xi ), i.e.

[x1 , .., xn ]S (i) = [xi1 , ..., xik ]

Then the kernel matrix is defined to be


n
X
K= S (i) H (i)T H (i) S (i)T ∈ Rn×n
i=1

Find smallest d + 1 eigenvectors of K and drop the smallest eigenvector, the re-
maining d eigenvectors will give rise to a d dimensional embedding of data points.

Algorithm 11: Hessian LLE Algorithm


Input: A weighted undirected graph G = (V, E, d) such that
1 V = {xi ∈ Rp : i = 1, . . . , n}
2 E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest neighbors
Output: Euclidean d-dimensional coordinates Y = [yi ] ∈ Rd×n of data.
3 Step 1 : Compute local PCA on neighborhood of xi , for,
X̃ (i) = [xi1 − µi , ..., xik − µi ]p×k = Ũ (i) Σ̃(Ṽ (i) )T , xij ∈ N (xi ),
where µi = kj=1 xij = k1 Xi 1;
P

4 Step 2 : Null Hessian estimation: define


d+1
M = [1, V˜1 , ..., V˜d , Ṽ12 , Ṽ1 ⊙ Ṽ2 , ..., Ṽd−1 ⊙ Ṽd , Ṽd2 ] ∈ Rk×(1+d+( 2 ))
where Ṽi ⊙ V˜j = [Ṽik Ṽjk ]T ∈ Rk denotes the elementwise product (Hadamard
product) between vector Ṽi and Ṽj . Now we perform a Gram-Schmidt
Orthogonalization procedure on M , get
d+1
M̃ = [1, v̂1 , ..., v̂d , ŵ1 , ŵ2 , ..., ŵ(d+1) ] ∈ Rk×(1+d+( 2 ))
2

Define !
(i) T d+1
[H ] = [last columns of M̃ ]k×(d+1) .
2 2

Step 3 : Define
X n
K= S (i) H (i)T H (i) S (i)T ∈ Rn×n , [x1 , .., xn ]S (i) = [xi1 , ..., xik ],
i=1

find smallest d + 1 eigenvectors of K and drop the smallest eigenvector, and the
remaining d eigenvectors will give rise to a d-embedding.
5.5. LOCAL TANGENT SPACE ALIGNMENT (LTSA) 113

5.4.1. Convergence of Hessian LLE. There are two assumptions for the
convergence of ISOMAP:
• Isometry: the geodesic distance between two points on manifolds equals
to the Euclidean distances between intrinsic parameters.
• Convexity: the parameter space is a convex subset in Rd .
Therefore, if the manifold contains a hole, ISOMAP will not faithfully recover
the intrinsic coordinates. Hessian LLE above is provable to find local orthogonal
coordinates for manifold reconstruction, even in nonconvex case. Figure (? ) gives
an example.

Figure 5. Comparisons of Hessian LLE on Swiss roll against


ISOMAP and LLE. Hessian better recovers the intrinsic coordi-
nates as the rectangular hole is the least distorted.

Donoho and Grimes (DG03b) relaxes the conditions above into the following
ones.
• Local Isometry: in a small enough neighborhood of each point, geodesic
distances between two points on manifolds are identical to Euclidean dis-
tances between parameter points.
• Connecteness: the parameter space is an open connected subset in Rd .
Based on the relaxed conditions above, they prove the following result.
Theorem 5.4.1. Supper M = ψ(Θ) where Θ is an open connected subset of
Rd , and ψ is a locally isometric embedding of Θ into Rn . Then the Hessian H(f ) has
a d+1 dimensional nullspace, consisting of the constant function and d-dimensional
space of functions spanned by the original isometric coordinates.
Under this theorem, the original isometric coordinates can be recovered, up to
a rigid motion, by identifying a suitable basis for the null space of H(f ).

5.5. Local Tangent Space Alignment (LTSA)


A shortcoming of Hessian LLE lies in its bilinear form of local singular vectors
(local PCA/MDS) to estimate the null Hessian. This is expensive when the intrinsic
114 5. MANIFOLD LEARNING

dimensionality is high and also not stable when noise are presented. On the other
hand, Zhenyue Zhang and Hongyuan Zha (2002) (ZZ02) suggest Local Tangent
Space Alignment (LTSA) algorithm which just needs the linear form of local PCA
which is more stable and cheaper than Hessian LLE.

Figure 6. Local tangent space approximation.

The basic idea of LTSA is illustrated in Figure 6, where given a smooth curve
(black), one can use discrete samples to find a good approximation of the tangent
space of the original curve at each sample point. Finding such an approximation is
in the spirit of principal curve or principal manifold proposed by Werner Stuetzle
and Trevor Hastie (HS89). Zhenyue Zhang and Hongyuan Zha (2002) (ZZ02) pro-
pose to use sampled data to find a good approximation of tangent space via local
PCA, then the reconstruction data coordinates tries to preserve such approximate
tangent space at each point to reach a global alignment.

Algorithm 12: LTSA Algorithm


Input: A weighted undirected graph G = (V, E) such that
1 V = {xi ∈ Rp : i = 1, . . . , n}
2 E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest neighbors
Output: Euclidean d-dimensional coordinates Y = [yi ] ∈ Rk×n of data.
3 Step 1 (local PCA): Compute local SVD on neighborhood of xi , xij ∈ N (xi ),
X̃ (i) = [xi1 − µi , ..., xik − µi ]p×k = Ũ (i) Σ̃(Ṽ (i) )T ,
Pk
where µi = j=1 xij . Define
√ (i) (i)
Gi = [1/ k, V˜1 , ..., V˜d ]k×(d+1) ;
4 Step 2 (tangent space alignment): Alignment (kernel) matrix
n
X
K n×n = Si Wi WiT SiT , Wik×k = I − Gi GTi ,
i=1

where selection matrix Sin×k : [xi1 , ..., xik ] = [x1 , ..., xn ]Sin×k ;
5 Step 3 : Find smallest d + 1 eigenvectors of K and drop the smallest
eigenvector, the remaining d eigenvectors will give rise to a d-embedding.

For each xi in Rd with neighbor Ni of size |Ni | = ki −1, let X (i) = [xj1 , xj2 , . . . , xjki ] ∈
R p×ki
be the coordinate matrix. Consider the local SVD (PCA)
X̃ (i) = [xi1 − µi , ..., xiki − µi ]p×ki = X (i) H = Ũ (i) Σ̃(Ṽ (i) )T ,
5.6. LAPLACIAN LLE (EIGENMAP) 115

(i) (i)
where H = I − k1i 1ki 1Tki . Left singular vectors {Ũ1 , ..., Ũd } give an orthonormal
basis of the approximate d-dimensional tangent space at xi . Right singular vectors
(i) (i)
(Ṽ1 , . . . , Ṽd ) ∈ Rki ×d present the d-coordinates of ki samples with respect to the
tangent space basis.
Let Yi ∈ Rd×ki be the embedding coordinates of the samples in Rd and Li :
(i) (i)
R p×d
be an estimated basis of the tangent space at xi in Rp . Let Θi = Ũd Σ̃d (Ṽd )T ∈
R p×ki
be the truncated SVD using top d components. LTSA looks for the minimizer
of the following problem
2
X X 1 T
2 T

(111) min ∥Ei ∥ = Yi (I − n 11 ) − Li Θi .

Y,L
i i
1 †
One can estimate LTi = Yi (1 − T
n 11 )Θi .
Hence it reduces to
2
X
2
X 1 T †

(112) min ∥Ei ∥ = Yi (I − 11 )(I − Θi Θi )
Y
i i
n

where I − Θ†i Θi is the projection to the normal space at xi . This is equivalent to


define
(i) (i)
Gi = [1/ ki , V˜1 , ..., V˜d ]ki ×(d+1) ,
p

a weight matrix,
Wiki ×ki = I − Gi GTi ,
and a positive semi-definite kernel matrix for alignment,
X n
K n×n = Φ = Si Wi WiT SiT
i=1

where the selection matrix Sin×ki


: [xi1 , ..., xiki ] = [x1 , ..., xn ]Si . Notice that con-
stant vector is an eigenvector corresponding to the 0 eigenvalue. Hence similar to
the LLE, one can choose bottom d + 1 eigenvectors and drop the constant eigenvec-
tor, which gives embedding matrix Y (n×d) . An error analysis is given in (ZZ09),
which shows that LTSA may recover the global coordinates asymptotically.
Remark. We note that LTSA can be also applied to the situation that we are
given local pairwise distances between samples. Since MDS and PCA are dual to
each other, one can replace the local PCA in the algorithm by local MDS which
leads to the same results as only right singular vectors V ˜(i) are used there.
LTSA and Hessian LLE both may recover the linear coordinates, though the
former is less expensive.

5.6. Laplacian LLE (Eigenmap)


Recall that in LLE, one chooses the weights wij to minimize the following
energy X
P min ∥ wij (xj − xi )∥2 .
j∈Ni wij =1
j∈Ni
In the ideal case, if the points x̃j = xj − xi are
P linearly dependent, then there is
some wij , possibly not unique, such that 0 = j∈Ni wij x̃j . Thus we have
X X
0= wij x̃j , and 1 = wij .
j∈Ni j∈Ni
116 5. MANIFOLD LEARNING

For any smooth function f (x), consider its Taylor expansion up to the second order
1
f (x) = f (0) + xT ∇f (0) + xT H(0)x + o(∥x∥2 ).
2
X
(I − W )y(0) := y(0) − wij y(x̃j )
j∈Ni
X X 1 X T
≈ y(0) − wij y(0) − wij x̃Tj ∇y(0) − x̃j (Hy)(0)x̃j
2
j∈Ni j∈Ni j∈Ni
1 X T
= − x̃j (Hy)(0)x̃j .
2
j∈Ni

When the {x̃i } in the last step becomes an orthonormal basis1, the equation above
gives
1 X T
− x̃j H(0)x̃j ≈ trace(H(0)) = ∆f (0),
2
j∈Ni
Pd ∂ 2
where the Laplacian operator ∆ = trace(H) = i=1 ∂yi2 with local coordinate
system (yi ). Such an observation leads to Laplacian LLE which looks for embedding
functions Z Z
min ∥∇y∥ = y T ∆y,
2
y⊥1,∥y∥=1

instead of in Hessian LLE, Z


min ∥Hy∥2 .
y⊥1,∥y∥=1

The kernel of Laplacian consists of constant, linear functions, and bilinear functions
of coordinates, of dimensionality 1 + d + d2 . Therefore Laplacian LLE does not
recover linear coordinates. However, Laplacian LLE converges to the spectrum of
Laplacian-Beltrami operator, which enables us to choose wij as heat kernels. It has
various connections with spectral graph theory and random walks on graphs, which
further leads to Diffusion Map and relates to topology of data graph, namely the
connectivity or the 0-th homology.
How to define Laplacian with discrete data? Graph Laplacians with heat
kernels provide us an answer (BN01; BN03). To see the idea, first consider a
weighted oriented graph G = (V, E, W ) where V = {x1 , . . . , xn } is the vertex set,
E = {(i, j) : i, j ∈ V } is the set of oriented edges, and W = [wij = wji ≥ 0] is
the weight matrix. Consider a particular weight matrix induced by heat kernels
W = (wij ) ∈ Rn×n as
( ∥xi −xj ∥2

wij = e
t j ∈ N (i),
0 otherwise.
P
In particular, for t → ∞, it gives binary weights. Let D = diag( j∈Ni wij ) be the
diagonal matrix with weighted degree as diagonal elements. Define the unnormal-
ized graph Laplacian by
L = D − W,

1This is in general not true, which inspires Hessian LLE though.


5.6. LAPLACIAN LLE (EIGENMAP) 117

and the normalized graph Laplacian by


1 1
L = D− 2 (D − W )D− 2 .
To see the nature of such graph Laplacians, define the graph gradient map (also
known as co-boundary map in algebraic topology) to be
∇ : RV −→ RE
f 7−→ (∇f )(i, j) = −(∇f )(j, i) = f (i) − f (j).
b w = diag(wij ) ∈ R
Let D |E|×|E|
be the diagonal matrix of edge weights. Then one
can check that the unnormalized graph Laplacian satisfies L = ∇T D b w ∇. In other
words, it actually shows
X
f T Lf = (∇f )T D
b w (∇f ) = wij (fi − fj )2 .
i≥j

This is the discrete analogue of the continuous Stokes Theorem on manifold,


Z Z
∥∇M f ∥2 = (trace(f T Hf ))2 ,

where H = [∂ 2 /∂i ∂j ] ∈ Rd×d is the Hessian matrix.


For Laplacian Eigenmaps, there are two natural candidates, either the eigen-
vectors of unnormalized graph Laplacian L,
y T Ly
min =
y T 1=0 yT y
or generalized eigenvectors of L,
y T Ly
min = .
y T D1=0 y T Dy
A generalized eigenvector ϕ of L are also right eigenvectors of row Markov matrix
P = D−1 W . To see this,
(D − W )ϕ = λDϕ ⇔ (I − D−1 W )v = λϕ ⇔ D−1 W ϕ = (1 − λ)ϕ ⇔ P ϕ = (1 − λ)ϕ.
So eigenvectors are the same but only the eigenvalues are translated from λ to 1−λ.
In (BN03), it suggests generalized eigenvectors of L for Laplacian LLE, which
scales the importance of vertex through its weighted degree and thus connects
to random walk on graphs and diffusion map to be discussed later. Note that
eigenvectors of normalized Laplacian L are related to generalized eigenvectors of L
up to a scaling matrix. This can be seen in the following reasoning.
Lv = λv
− 21 1
⇔D (D − W )D− 2 v = λv
1
⇔ Lϕ = (D − W )ϕ = λDϕ, ϕ = D− 2 v.
In spectral graph theory, Fiedler theory actually tells us that the number of zero
eigenvalues/generalized eigenvalues of L is the number of connected components of
graph G (0-th Betti number); the corresponding eigenvectors can be used to parti-
tion the graph into components of small normalized cuts via Cheeger’s inequality.
On the other hand, lumpable Markov Chains on graphs will have piecewise con-
stant Laplacian eigenmaps, which can be used for graph multiple normalized cut
or partition.
118 5. MANIFOLD LEARNING

Algorithm 13: Laplacian Eigenmap


Input: An adjacency graph G = (V, E, d) such that
1 V = {xi : i = 1, . . . , n}
2 E = {(i, j) : if j is a neighbor of i, i.e. j ∈ Ni }, e.g. k-nearest neighbors,
ϵ-neighbors
3 dij = d(xi , xj ), e.g. Euclidean distance for xi ∼ xj are in neighbor
Output: Euclidean d-dimensional coordinates Y = [yi ] ∈ Rk×n of data.
4 Step 1 : Choose weights
5 (a) Heat kernel weights (parameter t):
∥xi −xj ∥2
(
Wij = e− t , i ∼ j,
0, otherwise.
(b) Simple-minded (t → ∞), Wij = 1 if i and j are connected by an edge and
Wij = 0 otherwise. P
6 Step 2 (Eigenmap): Let D = diag( j Wij ) and L = D − W . Compute smallest
d + 1 generalized eigenvectors
Lyl = λl Dyl , l = 0, 1, . . . , d,
such that 0 = λ0 ≤ λ1 ≤ . . . ≤ λd . Drop the zero eigenvalue λ0 and constant
eigenvector y0 , and construct Yd = [y1 , . . . , yd ] ∈ Rn×d .

To embed the data on to a d-dimensional Euclidean space, we can always


choose bottom d+1 eigenvectors, drop the smallest eigenvector (the constant vector
associated with eigenvalue 0), and use the remaining d vectors to construct a d
dimensional embedding of data.
5.6.1. Convergence of Laplacian Eigenmap. Some rigorous results about
convergence of Laplacian eigenmaps are given in (BN08). Assume that M is a
compact manifold with vol(M) = 1. Let the Laplacian-Beltrami operator
∆M : C(M) → L2 (M)
f 7→ −div(∇f )
Consider the following operator
L̂t,n : C(M) → C(M)
!
1 X

∥y−xi ∥ X ∥y−xi ∥2
f 7 → e 4t f (y) − e 4t f (xi )
t(4πt)k/2 i i

where (L̂t,n f )(y) is a function on M, and


Lt : L2 (M) → L2 (M)
Z 
1
Z
∥y−x∥ ∥y−x∥2
− 4t
f 7→ e f (y)dx − e 4t f (x)dx .
t(4πt)k/2 M M
Then (BN08) shows that when those operators have no repeated eigenvalues,
the spectrum of L̂t,n converges to Lt as n → ∞ (variance), where the latter con-
verges to that of ∆M with a suitable choice of t → ∞ (bias). The following gives
a summary.
Theorem 5.6.1 (Belkin-Niyogi). Assume that all the eigenvalues in consider-
ation are of multiplicity one. For small enough t, let λ̂tn,i be the i-th eigenvalue of
5.7. DIFFUSION MAP 119

t
L̂t,n and v̂n,i be the corresponding eigenfunction. Let λi and vi be the correspond-
ing eigenvalue and eigenfunction of ∆M . Then there exists a sequence tn → 0 such
that
lim λ̂tn,i
n
= λi
n→∞
tn
lim ∥v̂n,i − vi ∥ = 0
n→∞
where the limits are taken in probability.

5.7. Diffusion Map


A detailed discussion on Diffusion Map will be after introducing random walks
on graphs. In this section, we just make an introduction in comparison with Lapla-
cian LLE.
Recall that for xi ∈ Rd , i = 1, 2, · · · , n, one can define a undirected weighted
graph G = (V, E, W ) with V = {xi : i = 1, . . . , n}, oriented edge set
 E = {(i,j)},
d(x ,x )2
and symmetric weight W = [wij ] by heat kernel wij = wji = exp − i t j for
i ∼ j and wij = 0 P
otherwise. Assume that G is connected and thus has a finite
n
diameter. Let di = j=1 Wij and D = diag(di ).
A random walk on graph G can be defined through the following row Markov
matrix,
P = D−1 W,
which is primitive (any two points can be connected by path of length no more
than the diameter) and thus admits the following spectral decomposition
P = ΦΛΨT ,
where
1) Λ = diag(λi ) with 1 = λ0 ≥ λ1 ≥ λ2 . . . ≥ λn−1 > −1 for primitive
Markov chain;
2) Φ = [ϕ0 , ϕ2 , · · · , ϕn−1 ] are right eigenvectors of P , P Φ = ΦΛ;
3) Ψ = [ψ0 , ψ2 , · · · , ψn−1 ] are left eigenvectors of P , ΨT P = ΛΨT . Note
that ϕ0 = 1 ∈ R and ψ0 (i) = di / i di . P
n
P
Thus ψ0 is the same eigenvector
as the stationary distribution π(i) = di / i di (π T 1 = 1) up to a scaling
factor;
4) Φ and Ψ are bi-orthogonal basis, i.e. ϕTi ψj = δij or simply ΦT Ψ = I.
To see this, consider the normalized Laplacian
L = D−1/2 (D − W )D−1/2 ,
which is symmetric and positive semi-definite, hence
S = D−1/2 W D−1/2 = I − L
has n orthogonal eigenvectors V = [v1 , v2 , · · · , vn ]
S = V ΛV T , Λ = diag(λi ), V T V = I,
where 1 = λ0 ≥ λ1 ≥ λ2 . . . ≥ λn−1 . Define Φ = D−1/2 V and Ψ = D1/2 V . One can
obtain the spectral decomposition of P . Hence for any τ ≥ 0, P τ = ΦΛτ ΨT defines
a diffusion process on graph G, one can define a multiscale Euclidean embedding
of data points.
120 5. MANIFOLD LEARNING

Define diffusion map at scale t (CLL+ 05), by dropping the constant eigenvector
ϕ0 for connected graph G,
Φτ (xi ) = [λτ1 ϕ1 (i), · · · , λτn−1 ϕn−1 (i)], τ ≥ 0.
Clearly, Laplacian LLE corresponds to such a diffusion map at τ = 0; as τ grows
and small eigenvalues |λi |τ < 1 will drop to zero exponentially fast, which leads
to a multiscale analysis on dimensionality reduction. For example, one can set a
threshold δ > 0, and only keep dδ dimensions such that |λi |τ ≥ δ for 1 ≤ i ≤ dδ .
5.7.1. General Diffusion Maps and Convergence. In (CLL+ 05) a general
class of diffusion maps are defined which involves a normalized weight matrix,
d(xi , xk )2
 
α,t Wij X
(113) Wij = α α , pi := exp −
pi · pj t
k
wherePα = 0 recovers the definition above. With this family, one can define Dα =
diag( j Wijα,t ) and the row Markov matrix
(114) Pα,t,n = Dα−1 W α ,
whose right eigenvectors Φα lead to a family of diffusion maps parameterized by α.
Such a definition suggests the following integral operators as diffusion operators.
Assume that q(x) is a density on M.
• Let kt (x, y) = h(∥x − y∥2 /t) where h is a radial basis function, e.g. h(z) =
exp(−z).
• Define Z
qt (x) = kt (x, y)q(y)dy
M
and form the new kernel
(α) kt (x, y)
kt (x, y) = .
qt (x)qtα (y)
α

• Let Z
(α) (α)
dt (x) = kt (x, y)q(y)dy
M
and define the transition kernel of a Markov chain by
(α)
kt (x, y)
pt,α (x, y) = (α)
.
dt (x)
Then the Markov chain can be defined as the operator
Z
Pt,α f (x) = pt,α (x, y)f (y)q(y)dy.
M
• Define the infinitesimal generator of the Markov chain
I − Pt,α
Lt,α = .
t
For this, Lafon et al.(CL06) shows the following pointwise convergence results.
Theorem 5.7.1. Let M ∈ Rp be a compact smooth submanifold, q(x) be a
probability density on M, and ∆M be the Laplacian-Beltrami operator on M.
∆M (f q 1−α ) ∆M (q 1−α ))
(115) lim Lt,α = − .
t→0 q 1−α q 1−α
5.9. LAB: COMPARATIVE STUDIES 121

This suggests that


• for α = 1, it converges to the Laplacian-Beltrami operator limt→0 Lt,1 =
∆M ;
• for α = 1/2, it converges to a Schrödinger operator whose conjugation
leads to a forward Fokker-Planck equation;
• for α = 0, it is the normalized graph Laplacian.
A central question in Laplacian LLE and Diffusion maps is:
Why do we choose right eigenvectors ϕi of row Markov matrix for both Laplacian
LLE and Diffusion map?
To answer this question, we will introduce Markov Chains on finite graphs to see
various properties associated with their spectrum.

5.8. Stochastic Neighbor Embedding


Diffusion map preserves the diffusion distances
!1/2 n
!1/2
t
X
2t 2
X (P (i, k) − P (j, k))2
D (xi , xj ) = λk (ϕk (i) − ϕk (j)) = ,
dk
k k=1

like MDS. Such a diffusion distance is in fact a d−1


i -weighted l2 -distance between
conditional probability representation of data points P (i, ∗) and P (j, ∗).
Instead of preserving diffusion distances, Stochastic Neighbor Embedding looks
for embedding such that the estimated conditional probability QY (i, :) is a faithful
recover of P , by minimizing the Kullback-Leibler divergence between P and Q.
P is similarly estimated using heat kernel as Diffusion maps. However, how to
estimate QY in a low-dimensional embedding space Y = Rd ? The original proposal
of Stochastic Neighbor Embedding (SNE) uses the same heat kernel, which however
suffers the crowding issue which push different classes of data points together. To
overcome this issue, t-SNE exploits Student t-distribution or Cauchy distribution
kernel which allows heavier tail than Gaussian distribution kernel, hence allowing
more moderate distance points lying in the neighbor and attracting the same class
members together.
to-be-finished...

5.9. Lab: Comparative Studies


According to the comparative studies by Todd Wittman, LTSA has the best
overall performance in current manifold learning techniques. Try yourself his code,
mani.m, and enjoy your new discoveries!
122 5. MANIFOLD LEARNING

Figure 7. Comparisons of Manifold Learning Techniques on


Swiss Roll
CHAPTER 6

Random Walk on Graphs

We have talked about Diffusion Map as a model of Random walk or Markov


Chain on data graph. Among other methods of Manifold Learning, the distinct
feature of Diffusion Map lies in that it combines both geometry and stochastic
process. In the next few sections, we will talk about general theory of random
walks or finite Markov chains on graphs which are related to data analysis. From
this one can learn the origin of many ideas in diffusion maps.
Random Walk on Graphs.
• Perron-Frobenius Vector and Google’s PageRank: this is about Perron-
Frobenius theory for nonnegative matrices, which leads to the character-
ization of nonnegative primary eigenvectors, such as stationary distribu-
tions of Markov chains; application examples include Google’s PageRank.
• Fiedler Vector, Cheeger’s Inequality, and Spectral Bipartition: this is
about the second eigenvector in a Markov chain, mostly reduced from
graph Laplacians (Fiedler theory, Cheeger’s Inequality), which is the ba-
sis for spectral partition.
• Lumpability/Metastability, piecewise constant right eigenvector, and Mul-
tiple spectral clustering (“MNcut” by Maila-Shi, 2001): this is about
when to use multiple eigenvectors, whose relationship with lumpability
or metastability of Markov chains, widely used in diffusion map, image
segmentation, etc.
• Mean first passage time, commute time distance: the origins of diffusion
distances.
Today we shall discuss the first part.

6.1. Introduction to Perron-Frobenius Theory and PageRank


Given An×n , we define A > 0, positive matrix, iff Aij > 0 ∀i, j, and A ≥ 0,
nonnegative matrix, iff Aij ≥ 0 ∀i, j.
Note that this definition is different from positive definite:
A ≻ 0 ⇔ A is positive-definite ⇔ xT Ax > 0 ∀x ̸= 0
A ⪰ 0 ⇔ A is semi-positive-definite ⇔ xT Ax ≥ 0 ∀x ̸= 0

Theorem 6.1.1 (Perron Theorem for Positive Matrix). Assume that A > 0,
i.e.a positive matrix. Then
1) ∃λ∗ > 0, ν ∗ > 0, ∥ν ∗ ∥2 = 1, s.t. Aν ∗ = λ∗ ν ∗ , ν ∗ is a right eigenvector
(∃λ∗ > 0, ω > 0, ∥ω∥2 = 1, s.t. (ω T )A = λ∗ ω T , left eigenvector)
2) ∀ other eigenvalue λ of A, |λ| < λ∗
3) ν ∗ is unique up to rescaling or λ∗ is simple
123
124 6. RANDOM WALK ON GRAPHS

4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max .
x≥0,x̸=0 xi ̸=0 xi x>0 xi
Such eigenvectors will be called Perron vectors. This result can be extended to
nonnegative matrices.
Theorem 6.1.2 (Nonnegative Matrix, Perron). Assume that A ≥ 0, i.e.nonnegative.
Then
1’) ∃λ∗ > 0, ν ∗ ≥ 0, ∥ν ∗ ∥2 = 1, s.t. Aν ∗ = λ∗ ν ∗ (similar to left eigenvector)
2’) ∀ other eigenvalue λ of A, |λ| ≤ λ∗
3’) ν ∗ is NOT unique
4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max
x≥0,x̸=0 xi ̸=0 xi x>0 xi
Notice the changes in 1’), 2’), and 3’). Perron vectors are nonnegative rather
than positive. In the nonnegative situation what we lose is the uniqueness in λ∗
(2’)and ν ∗ (3’). The next question is: can we add more conditions such that the
loss can be remedied? The answer is yes, if we add the concepts of irreducible and
primitive matrices.

Irreducibility exactly describes the case that the induced graph from A is con-
nected, i.e.every pair of nodes are connected by a path of arbitrary length. However
primitivity strengths this condition to k-connected, i.e.every pair of nodes are con-
nected by a path of length k.
Definition 6.1.1 (Irreducible). The following definitions are equivalent:
1) For any 1 ≤ i, j ≤ n, there is an integer k ∈ Z, s.t. Akij > 0; ⇔
2) Graph G = (V, E) (V = {1, . . . , n} and {i, j} ∈ E iff Aij > 0) is (path-)
connected, i.e.∀{i, j} ∈ E, there is a path (x0 , x1 , . . . , xt ) ∈ V n+1 where i = x0 and
xt = j, connecting i and j.
Definition 6.1.2 (Primitive). The following characterizations hold:
1) There is an integer k ∈ Z, such that ∀i, j, Akij > 0; ⇔
2) Any node pair {i, j} ∈ E are connected with a path of length no more than k;

3) A has unique λ∗ = max |λ|; ⇐
4) A is irreducible and Aii > 0, for some i,
Note that condition 4) is sufficient for primitivity but not necessary; all the first
three conditions are necessary and sufficient for primitivity. Irreducible matrices
have a simple primary eigenvalue λ∗ and 1-dimensional primary (left and right)
eigenspaces, with unique left and right eigenvectors. However, there might be other
eigenvalues whose absolute values (module) equal to the primary eigenvalue, i.e.,
λ∗ eiω .
When A is a primitive matrix, Ak becomes a positive matrix for some k, then we
can recover 1), 2) and 3) for positivity and uniqueness. This leads to the following
Perron-Frobenius theorem.
Theorem 6.1.3 (Nonnegative Matrix, Perron-Frobenius). Assume that A ≥ 0
and A is primitive. Then
6.1. INTRODUCTION TO PERRON-FROBENIUS THEORY AND PAGERANK 125

1) ∃λ∗ > 0, ν ∗ > 0, ∥ν ∗ ∥2 = 1, s.t. Aν ∗ = λ∗ ν ∗ (right eigenvector)


and ∃ω > 0, ∥ω∥2 = 1, s.t. (ω T )A = λ∗ ω T (left eigenvector)
2) ∀ other eigenvalue λ of A, |λ| < λ∗
3) ν ∗ is unique
4) Collatz-Wielandt Formula
[Ax]i [Ax]i
λ∗ = max min = min max
x>0 xi x>0 xi
Such eigenvectors and eigenvalue will be called as Perron-Frobenius or primary
eigenvectors/eigenvalue.
Example 6.1.1 (Markov Chain). Given a graph G = (V, E), consider a ran-
dom walk on G with transition probability Pij = P rob(xt+1 = j|xt = i) ≥ 0. Thus

− →
− →

P is a row-stochastic or row-Markov matrix i.e. P · 1 = 1 where 1 ∈ Rn is the
vector with all elements being 1. From Perron theorem for nonnegative matrices,
we know


ν ∗ = 1 > 0 is a right Perron eigenvector of P
λ = 1 is a Perron eigenvalue and all other eigenvalues |λ| ≤ 1 = λ∗

∃ left PF-eigenvector π such that π T P = π T where π ≥ 0, 1T π = 1; such π is


called an invariant/equilibrium distribution
P is irreducible (G is connected) ⇒ π unique
P is primitive (G connected by paths of length ≤ k) ⇒ |λ| = 1 unique

⇔ lim π0T P k → π T ∀π0 ≥ 0, 1T π0 = 1


t→∞
This means when we take powers of P , i.e.P k , all rows of P k will converge to the
stationary distribution π T . Such a convergence only holds when P is primitive. If
P is not primitive, e.g. P = [0, 1; 1, 0] (whose eigenvalues are 1 and −1), P k always
oscillates and never converges.
What’s the rate of the convergence? Let

γ = max{|λ2 |, · · · , |λn |}, λ1 = 1


T t
and πt = (P ) π0 , roughly speaking we have
∥πt − π∥1 ∼ O(e−γt ).
This type of rates will be seen in various mixing time estimations.
A famous application of Markov chain in modern data analysis is Google’s
PageRank (BP98), although Google’s current search engine only exploits that as
one factor among many others. But you can still install Google Toolbar on your
browser and inspect the PageRank scores of webpages. For more details about
PageRank, readers may refer to Langville and Meyer’s book (LM06).
Example 6.1.2 (Pagerank). Consider a directed weighted graph G = (V, E, W )
whose weight matrix decodes the webpage link structure:
(
#{link : i 7→ j}, (i, j) ∈ E
wij =
0, otherwise
Pn
Define an out-degree vector doi = j=1 wij , which measures the number of out-links
from i. A diagonal matrix D = diag(di ) and a row Markov matrix P1 = D−1 W ,
126 6. RANDOM WALK ON GRAPHS

assumed for simplicity that all nodes have non-empty out-degree. This P1 accounts
for a random walk according to the link structure of webpages. One would expect
that stationary distributions of such random walks will disclose the importance of
webpages: the more visits, the more important. However Perron-Frobenius above
tells us that to obtain a unique stationary distribution, we need a primitive Markov
matrix. For this purpose, Google’s PageRank does the following trick.
Let Pα = αP1 + (1 − α)E, where E = n1 1 · 1T is a random surfer model, i.e.one
can jump to any other webpage uniformly. So in the model Pα , a browser will play
a dice: he will jump according to link structure with probability α or randomly
surf with probability 1 − α. With 1 > α > 0, the existence of random surfer model
makes P a positive matrix, whence ∃!πs.t.PαT π = π (means ’there exists a unique
π’). Google choose α = 0.85 and in this case π gives PageRank scores.
Now you probably can figure out how to cheat PageRank. If there are many
cross links between a small set of nodes (for example, Wikipedia), those nodes must
appear to be high in PageRank. This phenomenon actually has been exploited by
spam webpages, and even scholar citations. After learning the nature of PageRank,
we should be aware of such mis-behaviors.
Finally we discussed a bit on Kleinberg’s HITS algorithm (Kle99), which is
based on singular value decomposition (SVD) of link matrix WP . Above we have
defined the out-degree do . Similarly we can define in-degree dik = j wjk . High out-
degree webpages can be regarded as hubs, as they provide more links to others. On
the other hand, high in-degree webpages are regarded as authorities, as they were
cited by others intensively. Basically in/out-degrees can be used to rank webpages,
which gives relative ranking as authorities/hubs. It turns out Kleinberg’s HITS
algorithm gives pretty similar results to in/out-degree ranking.

Definition 6.1.3 (HITS-authority). This use primary right singular vector of


W as scores to give the ranking. To understand this, define La = W T W . Primary
right singular vector of W is P
just a primary eigenvector of nonnegative symmetric
matrix La . Since La (i, j) = Pk Wki Wkj , thus it counts the number of references
which cites both i and j, i.e. k #{i ← k → j}. The higher value of La (i, j) the
more references received on the pair of nodes. Therefore Perron vector tend to rank
the webpages according to authority.

Definition 6.1.4 (HITS-hub). This use primary left singular vector of W as


scores to give the ranking. Define Lh = W W T , whence primary left singular vector
of W is just
Pa primary eigenvector of nonnegative symmetric matrix Lh . Similarly
Lh (i, j) = k Wik WP jk , which counts the number of links from both i and j, hitting
the same target, i.e. k #{i → k ← j}. Therefore the Perron vector Lh gives hub-
ranking.

The last example is about Economic Growth model where the Debreu intro-
duced nonnegative matrix into its study. Similar applications include population
growth and exchange market, etc.

Example 6.1.3 (Economic Growth/Population/Exchange Market). Consider


a market consisting n sectors (or families, currencies) where Aij represents for each
unit investment on sector j, how much the outcome in sector i. The nonnegative
constraint Aij ≥ 0 requires that i and j are not mutually inhibitor, which means
that investment in sector j does not decrease products in sector i. We study the
6.1. INTRODUCTION TO PERRON-FROBENIUS THEORY AND PAGERANK 127

dynamics xt+1 = Axt and its long term behavior as t → ∞ which describes the
economic growth.
Moreover in exchange market, an additional requirement is put as Aij = 1/Aji ,
which is called reciprocal matrix. Such matrices are also used for preference aggre-
gation in decision theory by Saaty.
From Perron-Frobenius theory we get: ∃λ∗ > 0 ∃ν ∗ ≥ 0 Aν ∗ = λ∗ ν ∗ and
∃ω ≥ 0 AT ω ∗ = λ∗ ω ∗ .

When A is primitive, (Ak > 0, i.e.investment in one sector will increase the product
in another sector in no more than k industrial periods), we have for all other
eigenvalues λ, |λ| < λ∗ and ω ∗ , ν ∗ are unique. In this case one can check that the
long term economic growth is governed by
At → (λ∗ )t ν ∗ ω ∗T
where
1) for all i, (x(xt−1
t )i
)i → λ

2) distribution of resources → ν ∗ / i νi∗ , so the distribution is actually not bal-


P
anced
3) ωi∗ gives the relative value of investment on sector i in long term

6.1.1. Proof of Perron Theorem for Positive Matrices. A complete


proof can be found in Meyer’s book (Mey00), Chapter 8. Our proof below is based
on optimization view, which is related to the Collatz-Wielandt Formula.
Assume that A > 0. Consider the following optimization problem:
max δ
s.t. Ax ≥ δx
x≥0
x ̸= 0
Without loss of generality, assume that 1T x = 1. Let y = Ax and consider the
growth factor xyii , for xi ̸= 0. Our purpose above is to maximize the minimal
growth factor δ (yi /xi ≥ δ).
Let λ∗ be optimal value with ν ∗ ≥ 0, 1T ν ∗ = 1, and Aν ∗ ≥ λ∗ ν ∗ . Our
purpose is to show
1) Aν ∗ = λ∗ ν ∗
2) ν ∗ > 0
3) ν ∗ and λ∗ are unique.
4) For other eigenvalue λ (λz = Az when z ̸= 0), |λ| < λ∗ .
Sketchy Proof of Perron Theorem. 1) If Aν ∗ ̸= λ∗ ν ∗ , then for some i,
[Aν ∗ ]i > λ∗ νi∗ . Below we will find an increase of λ∗ , which is thus not optimal.
Define ν̃ = ν ∗ + ϵei with ϵ > 0 and ei denotes the vector which is one on the ith
component and zero otherwise.
For those j ̸= i,
(Aν̃)j = (Aν ∗ )j + ϵ(Aei )j = λ∗ νj∗ + ϵAji > λ∗ νj∗ = λ∗ ν˜j
where the last inequality is due to A > 0.
For those j = i,
(Aν̃)i = (Aν ∗ )i + ϵ(Aei )i > λ∗ νi∗ + ϵAii .
128 6. RANDOM WALK ON GRAPHS

Since λ∗ ν˜i = λ∗ νi∗ + ϵλ∗ , we have


(Aν̃)i − (λ∗ ν̃)i + ϵ(Aii − λ∗ ) = (Aν ∗ )i − (λ∗ νi∗ ) − ϵ(λ∗ − Aii ) > 0,
where the last inequality holds for small enough ϵ > 0. That means, for some small
ϵ > 0, (Aν̃) > λ∗ ν̃. Thus λ∗ is not optimal, which leads to a contradiction.
2) Assume on the contrary, for some k, νk∗ = 0, then (Aν ∗ )k = λ∗ νk∗ = 0. But
A > 0, ν ∗ ≥ 0 and ν ∗ ̸= 0, so there ∃i, νi∗ > 0, which implies that Aν ∗ > 0.
That contradicts to the previous conclusion. So ν ∗ > 0, which followed by λ∗ > 0
(otherwise Aν ∗ > 0 = λ∗ ν ∗ = Aν ∗ ).
3) We are going to show that for every ν ≥ 0, Aν = µν ⇒ µ = λ∗ . Following the
same reasoning above, A must have a left Perron vector ω ∗ > 0, s.t. AT ω ∗ = λ∗ ω ∗ .
Then λ∗ (ω ∗T ν) = ω ∗T Aν = µ(ω ∗T ν). Since ω ∗T ν > 0 (ω ∗ > 0, ν ≥ 0), there
must be λ∗ = µ, i.e. λ∗ is unique, and ν ∗ is unique.
4) For any other eigenvalue Az = λz, A|z| ≥ |Az| = |λ||z|, so |λ| ≤ λ∗ . Then
we prove that |λ| < λ∗ . Before proceeding, we need the following lemma.
Lemma 6.1.4. Az = λz, |λ| = λ∗ , z ̸= 0 ⇒ A|z| = λ∗ |z|. λ∗ = maxi |λi (A)|
Proof of Lemma. Since |λ| = λ∗ ,
A|z| = |A||z| ≥ |Az| = |λ||z| = λ∗ |z|
Assume that ∃k, λ1∗ A|zk | > |zk |. Denote Y = λ1∗ A|z| − |z| ≥ 0, then Yk > 0.
Using that A > 0, x ≥ 0, x ̸= 0, ⇒ Ax > 0, we can get
1 1
⇒ AY > 0, A|z| > 0
λ∗ λ∗
A A
⇒ ∃ϵ > 0, ∗
Y > ϵ ∗ |z|
λ λ
A
⇒ ĀY > ϵĀ|z|, Ā =
λ∗
⇒ Ā2 |z| − Ā|z| > ϵĀ|z|

Ā2
⇒ |z| > Ā|z|
1+ϵ

⇒B= , 0 = lim B m Ā|z| ≥ Ā|z|
1+ϵ m→∞

⇒ Ā|z| = 0 ⇒ |z| = 0 ⇒ Y =0 ⇒ Ā|z| = λ∗ |z|


Equipped with this lemma, assume that we have Az = λz (z ̸= 0) with |λ| = λ∗ ,


then
X X A
A|z| = λ∗ |z| = |λ||z| = |Az| ⇒ | āij zj | = āij |zj |, Ā = ∗
j j
λ

which implies that zj has the same sign, i.e.zj ≥ 0 or zj ≤ 0 (∀j). In both cases |z|
(z ̸= 0) is a nonnegative eigenvector A|z| = λ|z| which implies λ = λ∗ by 3). □
6.2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 129

6.1.2. Perron-Frobenius theory for Nonnegative Tensors. Some researchers,


e.g. Liqun Qi (Polytechnic University of Hong Kong), Lek-Heng Lim (U Chicago)
and Kung-Ching Chang (PKU) et al. recently generalize Perron-Frobenius theory
to nonnegative tensors, which may open a field toward PageRank for hypergraphs
and array or tensor data. For example, A(i, j, k) is a 3-tensor of dimension n,
representing for each object 1 ≤ i ≤ n, which object of j and k are closer to i.
A tensor of order-m and dimension-n means an array of nm real numbers:
A = (ai1 ,...,im ), 1 ≤ i1 , . . . , im ≤ n
An n-vector ν = (ν1 , . . . , νn )T is called an eigenvector, if
Aν [m−1] = λν m−1
for some λ ∈ R, where

n
X
Aν [m−1] := aki2 ...im νi2 · · · νim , ν m−1 := (ν1m−1 , . . . , νnm−1 )T .
i2 ,...,im =1

Chang-Pearson-Zhang [2008] extends Perron-Frobenius theorem to show the exis-


tence of λ∗ > 0 and ν ∗ > 0 when A > 0 is irreducible.
[Ax[m−1] ]i [Ax[m−1] ]i
λ∗ = max min m−1 = min max .
x>0 i xi x>0 i xm−1
i

6.2. Introduction to Fiedler Theory and Cheeger Inequality


In this class, we introduced the random walk on graphs. The last lecture
shows Perron-Frobenius theory to the analysis of primary eigenvectors which is the
stationary distribution. In this lecture we will study the second eigenvector. To
analyze the properties of the graph, we construct two matrices: one is (unnormal-
ized) graph Laplacian and the other is normalized graph Laplacian. In the first
part, we introduce Fiedler Theory for the unnormalized graph Laplacian, which
shows the second eigenvector can be used to bipartite the graph into two connected
components. In the second part, we study the eigenvalues and eigenvectors of nor-
malized Laplacian matrix to show its relations with random walks or Markov chains
on graphs. In the third part, we will introduce the Cheeger Inequality for second
eigenvector of normalized Laplacian, which leads to an approximate algorithm for
Normalized graph cut (NCut) problem, an NP-hard problem itself.
6.2.1. Unnormalized Graph Laplacian and Fiedler Theory. Let G =
(V, E) be an undirected, unweighted simple1 graph. Although the edges here are
unweighted, the theory below still holds when weight is added. We can get a similar
conclusion with the weighted adjacency matrix. However the extension to directed
graphs will lead to different pictures.
We use i ∼ j to denote that node i ∈ V is a neighbor of node j ∈ V .
Definition 6.2.1 (Adjacency Matrix).

1 i∼j
Aij = .
0 otherwise
1Simple graph means for every pair of nodes there are at most one edge associated with it;
and there is no self loop on each node.
130 6. RANDOM WALK ON GRAPHS

Remark. We can use the weight of edge i ∼ j to define Aij if the graph is
weighted. That indicates Aij ∈ R+ . We can also extend Aij to R which involves
both positive and negative weights, like correlation graphs. But the theory below
can not be applied to such weights being positive and negative.
The degree of node i is defined as follows.
n
X
di = Aij .
j=1

Define a diagonal matrix D = diag(di ). Now let’s come to the definition of Lapla-
cian Matrix L.
Definition 6.2.2 (Graph Laplacian).

 di i = j,
Lij = −1 i∼j
0 otherwise

This matrix is often called unnormalized graph Laplacian in literature, to dis-


tinguish it from the normalized graph Laplacian below. In fact, L = D − A.
Example 6.2.1. V = {1, 2, 3, 4}, E = {{1, 2}, {2, 3}, {3, 4}}. This is a linear
chain with four nodes.
 
1 −1 0 0
 −1 2 −1 0 
L= .
 0 −1 2 −1 
0 0 −1 1

Example 6.2.2. A complete graph of n nodes, Kn . V = {1, 2, 3...n}, every


two points are connected, as the figure above with n = 5.
 
n−1 −1 −1 ... −1
 −1 n − 1 −1 ... −1 
L=  −1
.
... −1 n − 1 −1 
−1 ... −1 −1 n−1
6.2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 131

From the definition, we can see that L is symmetric, so all its eigenvalues will
be real and there is an orthonormal eigenvector system. Moreover L is positive
semi-definite (p.s.d.). This is due to the fact that
 
XX X X
T 2
v Lv = vi (vi − vj ) = di vi − vi vj 
i j:j∼i i j:j∼i
X
= (vi − vj )2 ≥ 0, ∀v ∈ Rn .
i∼j

In fact, L admits the decomposition L = BB T where B ∈ R|V |×|E| is called inci-


dence matrix (or boundary map in algebraic topology) here, for any 1 ≤ j < k ≤ n,

 1, i = j,
B(i, {j, k}) = −1, i = k,
0, otherwise

These two statements imply the eigenvalues of L can’t be negative. That is to say
λ(L) ≥ 0.
Theorem 6.2.1 (Fiedler theory). Let L has n eigenvectors
Lvi = λi vi , vi ̸= 0, i = 0, . . . , n − 1
where 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 . For the second smallest eigenvector v1 , define
N− = {i : v1 (i) < 0},

N+ = {i : v1 (i) > 0},


N0 = V − N− − N+ .
We have the following results.
(1) #{i, λi = 0} = #{connected components of G};
(2) If G is connected, then both N− and N+ are connected. N− ∪ N0 and
N+ ∪ N0 might be disconnected if N0 ̸= ∅.
This theorem tells us that the second smallest eigenvalue can be used to tell us
if the graph is connected, i.e.G is connected iff λ1 ̸= 0, i.e.
λ1 = 0 ⇔ there are at least two connected components.
λ1 > 0 ⇔ the graph is connected.
Moreover, the second smallest eigenvector can be used to bipartite the graph into
two connected components by taking N− and N+ when N0 is empty. For this reason,
we often call the second smallest eigenvalue λ1 as the algebraic connectivity. More
materials can be found in Jim Demmel’s Lecture notes on Fiedler Theory at UC
Berkeley: why we use unnormalized Laplacian eigenvectors for spectral partition
(http://www.cs.berkeley.edu/~demmel/cs267/lecture20/lecture20.html).
We can calculate eigenvalues by using Rayleigh Quotient. This gives a sketch
proof of the first part of the theory.

Proof of Part I. Let (λ, v) be a pair of eigenvalue-eigenvector, i.e.Lv = λv.


Since L1 = 0, so the constant vector 1 ∈ Rn is always the eigenvector associated
with λ0 = 0. In general,
132 6. RANDOM WALK ON GRAPHS

(vi − vj )2
P
T
v Lv i∼j
λ= T = P 2 .
v v vi
i
Note that
0 = λ1 ⇔ vi = vj (j is path connected with i).
Therefore v is a piecewise constant function on connected components of G. If
G has k components, then there are k independent piecewise constant vectors in
the span of characteristic functions on those components, which can be used as
eigenvectors of L. In this way, we proved the first part of the theory. □
6.2.2. Normalized graph Laplacian and Cheeger’s Inequality.
Definition 6.2.3 (Normalized Graph Laplacian).

1 i = j,


Lij = − √ 1
i ∼ j,
 di dj

0 otherwise.

In fact L = D−1/2 (D − A)D−1/2 = D−1/2 LD−1/2 = I − D−1/2 (D − A)D−1/2 .


From this one can see the relations between eigenvectors of normalized L and un-
normalized L. For eigenvectors Lv = λv, we have
(I − D−1/2 LD−1/2 )v = λv ⇔ Lu = λDu, u = D−1/2 v,
whence eigenvectors of L, v after rescaling by D−1/2 v, become generalized eigen-
vectors of L.
We can also use the Rayleigh Quotient to calculate the eigenvalues of L.
1 1
v T Lv v T D− 2 (D − A)D− 2 v
=
vT v vv
T
u Lu
= T Du
uP
(ui − uj )2
i∼j
= P 2 .
uj dj
j

Similarly we get the relations between eigenvalue and the connected components of
the graph.
#{λi (L) = 0} = #{connected components of G}.
Next we show that eigenvectors of L are related to random walks on graphs.
This will show you why we choose this matrix to analysis the graph.
We can construct a random walk on G whose transition matrix is defined by
Aij 1
Pij ∼ P = .
Aij di
j

By easy calculation, we see the result below.


P = D−1 A = D−1/2 (I − L)D1/2 .
6.2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 133

Hence P is similar to I −L. So their eigenvalues satisfy λi (P ) = 1−λi (L). Consider


the right eigenvector ϕ and left eigenvector ψ of P .
uT P = λu,
P v = λv.
Due to the similarity between P and L,
uT P = λuT ⇔ uT D−1/2 (I − L)D1/2 = λuT .
Let ū = D−1/2 u, we will get:
ūT (I − L) = λūT
⇔ Lū = (1 − λ)ū.
You can see ū is the eigenvector of L, and we can get left eigenvectors of P
from ū by multiply it with D1/2 on the left side. Similarly for the right eigenvectors
v = D−1/2 ū.
If we choose u0 = πi ∼ Pdidi , then:
p
ū0 (i) ∼ di ,
ūTk ūl = δkl ,
uTk Dvl = δkl ,
πi Pij = πj Pji ∼ Aij = Aji ,
where the last identity says the Markov chain is time-reversible.
All the conclusions above show that the normalized graph Laplacian L keeps
some connectivity measure of unnormalized graph Laplacian L. Furthermore, L is
more related with random walks on graph, through which eigenvectors of P are easy
to check and calculate. That’s why we choose this matrix to analysis the graph.
Now we are ready to introduce the Cheeger’s inequality with normalized graph
Laplacian.
Let G be a graph, G = (V, E) and S is a subset of V whose complement is
S̄ = V − S. We define V ol(S), CU T (S) and N CU T (S) as below.
X
V ol(S) = di .
i∈S
X
CU T (S) = Aij .
i∈S,j∈S̄

CU T (S)
N CU T (S) = .
min(V ol(S), V ol(S̄))
N CU T (S) is called normalized-cut. We define the Cheeger constant
hG = min N CU T (S).
S

Finding minimal normalized graph cut is NP-hard. It is often defined that


CU T (S)
Cheeger ratio (expander): hS :=
V ol(S)
and
Cheeger constant: hG := min max {hS , hS̄ } .
S
134 6. RANDOM WALK ON GRAPHS

Cheeger Inequality says the second smallest eigenvalue provides both upper and
lower bounds on the minimal normalized graph cut. Its proof gives us a constructive
polynomial algorithm to achieve such bounds.
Theorem 6.2.2 (Cheeger Inequality). For every undirected graph G,
h2G
≤ λ1 (L) ≤ 2hG .
2
Proof. (1) Upper bound:
Assume the following function f realizes the optimal normalized graph cut,
(
1
V ol(S) i ∈ S,
f (i) = −1
V ol(S̄)
i ∈ S̄,

By using the Rayleigh Quotient, we get


g T Lg
λ1 = inf
g⊥D 1/2 e g T g
2
P
i∼j (fi − fj )
≤ P 2
fi di
1 1
( V ol(S) + V ol(S̄)
)2 CU T (S)
= 1 1
V ol(S) V ol(S) 2 + V ol(S̄) V ol(S̄)2

1 1
=( + )CU T (S)
V ol(S) V ol(S̄)
2CU T (S)
≤ =: 2hG .
min(V ol(S), V ol(S̄))
which gives the upper bound.
(2) Lower bound: the proof of lower bound actually gives a constructive algo-
rithm to compute an approximate optimal cut as follows.
Let v be the second eigenvector, i.e. Lv = λ1 v, and f = D−1/2 v. Then we
reorder node set V such that f1 ≤ f2 ≤ ... ≤ fn ). Denote V− = {i; vi < 0}, V+ =
{i; vi ≥ vr }. Without Loss of generality, we can assume
X X
dv ≥ dv
i∈V− i∈V+
+
Define new functions f to be the magnitudes of f on V+ .

+ fi i ∈ V+ ,
fi =
0 otherwise,
Now consider a series of particular subsets of V ,
Si = {v1 , v2 , ...vi },
and define
V
g ol(S) = min(V ol(S), V ol(S̄)).
αG = min N CU T (Si ).
i
Clearly finding the optimal value α just requires comparison over n − 1 NCUT
values.
6.2. INTRODUCTION TO FIEDLER THEORY AND CHEEGER INEQUALITY 135

Below we shall show that


h2G α2
≤ G ≤ λ1 .
2 2
First, we have Lf = λ1 Df , so we must have
X
(116) fi (fi − fj ) = λ1 di fi2 .
j:j∼i

From this we will get the following results,


P P
i∈V+ fi j:j∼i (fi − fj )
λ1 = P 2 ,
i∈V+ di fi
− fj )2 +
P P P
i∼j i,j∈V+ (fi i∈V+ fi j∼i j∈V− (fi − fj )
= , (fi − fj )2 = fi (fi − fj ) + fj (fj − fi )
di fi2
P
i∈V+
2
P P P
i∼j i,j∈V+ (fi − fj ) + i∈V+ fi j∼i j∈V− (fi )
> ,
di fi2
P
i∈V+
+
− fj+ )2
P
i∼j (fi
= 2 ,
di fi+
P
i∈V
( i∼j (fi+ − fj+ )2 )( i∼j (fi+ + fj+ )2 )
P P
= 2
( i∈V fi+ di )( i∼j (fi+ + fj+ )2 )
P P

2 2
( i∼j fi+ − fj+ )2
P
≥ 2 , Cauchy-Schwartz Inequality
( i∈V fi+ di )( i∼j (fi+ + fj+ )2 )
P P

2 2
( i∼j fi+ − fj+ )2
P
≥ 2 ,
2( i∈V fi+ di )2
P

where the second last step is due to the Cauchy-Schwartz inequality |⟨x, y⟩|2 ≤
P + + 2 P +2
⟨x, x⟩ · ⟨y, y⟩, and the last step is due to i∼j∈V (fi + fj ) = i∼j∈V (fi +
+2 + + P +2 +2 P +2
fj + 2fi fj ) ≤ 2 i∼j∈V (fi + fj ) ≤ 2 i∈V fi di . Continued from the last
inequality,
2 2
( i∼j fi+ − fj+ )2
P
λ1 ≥ 2 ,
2( i∈V fi+ di )2
P
2 + 2
( i∈V (fi+ − fi−1 )CU T (Si−1 ))2
P
≥ 2 , since f1 ≤ f2 ≤ . . . ≤ fn
2( i∈V fi+ di )2
P
2 + 2
( i∈V (fi+ − fi−1 ol(Si−1 ))2
P
)αG V g
≥ 2
2( i∈V fi+ di )2
P
2
( i∈V fi+ (V ol(Si )))2
2
P
αG g ol(Si−1 ) − Vg
= · 2 ,
2 ( i∈V fi+ di )2
P

2 (
P +2 2 2
αG i∈V fi di ) αG
= = .
2 ( P +2 2 2
i∈V fi di )
136 6. RANDOM WALK ON GRAPHS

where the last inequality is due to the assumption V ol(V− ) ≥ V ol(V+ ), whence
V
g ol(Si ) = V ol(S̄i ) for i ∈ V+ .
This completes the proof. □
Fan Chung gives a short proof of the lower bound in Simons Institute workshop,
2014.
Short Proof. The proof is based on the fact that
P
x∼y |f (x) − f (y)|
hG = inf sup P
f ̸=0 c∈R x |f (x) − c|dx
where the supreme over c is reached at c∗ = median(f (x) : x ∈ V ).
2
P
x∼y (f (x) − f (y))
λ1 = R(f ) = sup P ,
x (f (x) − c) dx
2
c
2
P
x∼y (g(x) − g(y))
≥ P , g(x) = f (x) − c
g(x)2 dx
P x
( x∼y (g(x) − g(y))2 )( x∼y (g(x) + g(y))2 )
P
= P P
( x∈V g 2 (x)dx )(( x∼y (g(x) + g(y))2 )
( x∼y |g 2 (x) − g 2 (y)|)2
P
≥ P P , Cauchy-Schwartz Inequality
( x∈V g 2 (x)dx )(( x∼y (g(x) + g(y))2 )
( x∼y |g 2 (x) − g 2 (y)|)2
P
≥ P , (g(x) + g(y))2 ≤ 2(g 2 (x) + g 2 (y))
2( x∈V g 2 (x)dx )2
h2G
≥ .
2

6.3. *Laplacians and the Cheeger inequality for directed graphs


The following section is mainly contained in (Chu05), which described the
following results:
(1) Define Laplacians on directed graphs.
(2) Define Cheeger constants on directed graphs.
(3) Give an example of the singularity of Cheeger constant on directed graph.
(4) Use the eigenvalue of Lapacian and the Cheeger constant to estimate the
convergence rate of random walk on a directed graph.
Another good reference is (LZ10).
6.3.1. Definition of Laplacians on directed graphs. On a finite and
strong connected directed graph G = (V, E) (A directed graph is strong connected
if there is a path between any pair of vertices), a weight is a function
w: E → R≥0
The in-degree and out-degree of a vertex are defined as
din : V → P R≥0
din
i = j∈V wji
dout : V → P R≥0
dout
i = j∈V wij
6.3. *LAPLACIANS AND THE CHEEGER INEQUALITY FOR DIRECTED GRAPHS 137

Note that din out


i may be different from di .
A random walk on the weighted G is a Markov chain with transition probability
wij
Pij = out .
di
Since G is strong connected, P is irreducible, and consequently there is a unique
stationary distribution ϕ. (And the distribution of the Markov chain will converge
to it if and only if P is aperiodic.)
Example 6.3.1 (undirected graph).
dx
ϕ(x) = P .
y dy

Example 6.3.2 (Eulerian graph). If din out


x = dx for every vertex x, then ϕ(x) =
out
Pdx out .
y dy
This is because dout
x is an unchanged measure with
X X
dout
x Pxy = wxy = din out
y = dy .
x x

Example 6.3.3 (exponentially small stationary dist.). G is a directed graph


with n + 1 vertices formed by the union of a directed circle v0 → v1 → · · · → vn
and edges vi → v0 for i = 1, 2, · · · , n. The weight on any edge is 1. Checking from
vn to v0 with the prerequisite of stationary distribution that the inward probability
flow equals to the outward probability flow, we can see that
ϕ(v0 ) = 2n ϕ(vn ), i.e.ϕ(vn ) = 2−n ϕ(v0 ).
This exponentially small stationary distribution cannot occur in undirected
graph cases for then
di 1
ϕ(i) = P ≥ .
j dj n(n − 1)
However, the stationary dist. can be no smaller than exponential, because we
have
Theorem 6.3.1. If G is a strong connected directed graph with w ≡ 1, and
dout
x ≤ k, ∀x, then max{ϕ(x) : x ∈ V } ≤ k D min{ϕ(y) : y ∈ V }, where D is the
diameter of G.
It can be easily proved using induction on the path connecting x and y.
Now we give a definition on those balanced weights.
Definition 6.3.1 (circulation).
F : E → R≥0
If F satisfies X X
F (u, v) = F (v, w), ∀v,
u,u→v w,v→w

then F is called a circulation.


Note. A circulation is a flow with no source or sink.
138 6. RANDOM WALK ON GRAPHS

Example 6.3.4. For a directed graph, Fϕ (u, v) = ϕ(u)P (u, v) is a circulation,


for X X
Fϕ (u, v) = ϕ(v) = Fϕ (v, w).
u,u→v w,v→w

Definition 6.3.2 (Rayleigh quotient). For a directed graph G with transition


probability matrix P and stationary distribution ϕ, the Rayleigh quotient for any
f : V → C is defined as
| f (u) − f (v) |2 ϕ(u)P (u, v)
P
R(f ) = u→v P .
v | f (v) | ϕ(v)
2

Note. Compare with the undirected graph condition where


| f (u) − f (v) |2 wuv
P
R(f ) = u∼vP .
v | f (v) | d(v)
2

If we look on every undirected edge (u, v) as two directed edges u → v, v → u,


then we get a Eulerian directed graph. So ϕ(u) ∼ doutu and dout
u P (u, v) = wuv , as
a result R(f )(directed) = 2R(f )(undirected). The factor 2 is the result of looking
on every edge as two edges.
The next step is to extend the definition of Laplacian to directed graphs. First
we give a review on Lapalcian on undirected graphs. On an undirected graph,
adjacent matrix is 
1, i ∼ j;
Aij =
0, i ̸∼ j.
D = diag(d(i)),
L = D−1/2 (D − A)D−1/2 .
On a directed graph, however, there are two degrees on a vertex which are
generally inequivalent. Notice that on an undirected graph, stationary distribution
ϕ(i) ∼ d(i), so D = cΦ, where c is a constant and Φ = diag(ϕ(i)).
L = I − D−1/2 AD−1/2
= I − D1/2 P D−1/2
= I − c1/2 Φ1/2 P c−1/2 Φ−1/2
= I − Φ1/2 P Φ−1/2
Extending and symmetrizing it, we define Laplacian on a directed graph
Definition 6.3.3 (Laplacian).
1
L = I − (Φ1/2 P Φ−1/2 + Φ−1/2 P ∗ Φ1/2 ).
2
Suppose the eigenvalues of L are 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 . Like the undirected
case, we can calculate λ1 with the Rayleigh quotient.
Theorem 6.3.2.
R(f )
λ1 = P inf .
f (x)ϕ(x)=0 2
Before proving that, we need
Lemma 6.3.3.
gLg ∗
R(f ) = 2 , where g = f Φ1/2 .
∥ g ∥2
6.3. *LAPLACIANS AND THE CHEEGER INEQUALITY FOR DIRECTED GRAPHS 139

Proof.
(u) − f (v) |2 ϕ(u)P (u, v)
P
u→v | fP
R(f ) =
v | f (v) | ϕ(v)
2

2 2
P P P
u→v | f (u) | ϕ(u)P (u, v) + v | f (v) | ϕ(v) − u→v (f (u)f (v) + f (u)f (v))ϕ(u)P (u, v)
=
f Φf ∗
2 2 ∗ ∗
P P
u | f (u) | ϕ(u) + v | f (v) | ϕ(v) − (f ΦP f + f ΦP f
= ∗
f Φf
f (P ∗ Φ + ΦP )f ∗
= 2−
f Φf ∗
(gΦ−1/2 )(P ∗ Φ + ΦP )(Φ−1/2 g ∗ )
= 2−
(gΦ−1/2 )Φ(Φ−1/2 g ∗ )
g(Φ−1/2 P ∗ Φ1/2 + Φ1/2 P Φ−1/2 )g ∗
= 2−
gg ∗

gLg
= 2·
∥ g ∥2

Proof of Theorem 6.3.2. With Lemma 6.3.3 and L(ϕ(x)1/2 )n×1 = 0, we


have
R(f )
λ1 = inf
P
2
g(x)ϕ(x)1/2 =0

R(f )
= P inf .
f (x)ϕ(x)=0 2

Note.
R(f )
λ1 = inf
2
P
f, f (x)ϕ(x)=0

(u) − f (v) |2 ϕ(u)P (u, v)


P
u→v | fP
= P inf
f, f (x)ϕ(x)=0 2 v | f (v) |2 ϕ(v)
| f (u) − f (v) |2 ϕ(u)P (u, v)
P
= P inf sup u→v P
f, f (x)ϕ(x)=0 c 2 v | f (v) − c |2 ϕ(v)
Theorem 6.3.4. Suppose the eigenvalues of P are ρ0 , · · · , ρn−1 with ρ0 = 1,
then
λ1 ≤ min(1 − Reρi ).
i̸=0

6.3.2. Definition of Cheeger constants on directed graphs. We have a


circulation F (u, v) = ϕ(u)P (u, v). Define
X X X X
F (∂S) = F (u, v), F (v) = F (u, v) = F (v, w), F (S) = F (v),
u∈S,v̸∈S u,u→v w,v→w v∈S

then F (∂S) = F (∂ S̄).


140 6. RANDOM WALK ON GRAPHS

Definition 6.3.4 (Cheeger constant). The Cheeger constant of a graph G is


defined as
F (∂S)
h(G) = inf 
S⊂V min F (S), F (S̄)

Note. Compare with the undirected graph condition where


| ∂S |
hG = inf .
S⊂V min | S |, | S̄ |

Similarly, we have
| ∂S |
hG (undirected) = inf 
S⊂V min | S |, | S̄ |
P
u∈S,v∈S̄ wuv
= inf P P 
S⊂V min u∈S d(u), u∈S̄ d(u)
P
u∈S,v∈S̄ϕ(u)P (u, v)
hG (directed) = inf P P 
S⊂V min u∈S ϕ(u), u∈S̄ ϕ(u)
F (∂S)
= inf .
S⊂V min F (S), F (S̄)

Theorem 6.3.5. For every directed graph G,

h2 (G)
≤ λ1 ≤ 2h(G).
2
The proof is similar to the undirected case using Rayleigh quotient and Theorem
6.3.2.

6.3.3. An example of the singularity of Cheeger constant on a di-


rected graph. We have already given an example of a directed graph with n + 1
vertices and stationary distribution ϕ satisfying ϕ(vn ) = 2−n ϕ(v0 ). Now we make
a copy of this graph and denote the new n + 1 vertices u0 , . . . , un . Joining the
two graphs together by two edges vn → un and un → vn , we get a bigger directed
graph. Let S = (v0 , · · · , vn ), we have h(G) ∼ 2−n . In comparison, h(G) ≥ n(n−1)
2

for undirected graph.

6.3.4. Estimate the convergence rate of random walks on directed


graphs. Define the distance of P after s steps and ϕ as
!1/2
X (P s (y, x) − ϕ(x))2
∆(s) = max .
y∈V ϕ(x)
x∈V

I+P
Modify the random walk into a lazy random walk P̃ = 2 , so that it is aperiodic.

Theorem 6.3.6.
λ1 t
∆(t)2 ≤ C(1 − ).
2
6.3. *LAPLACIANS AND THE CHEEGER INEQUALITY FOR DIRECTED GRAPHS 141

6.3.5. Random Walks on Digraphs, The Generalized Digraph Lapla-


cian, and The Degree of Asymmetry. In this paper the following have been
discussed:
(1) Define an asymmetric Laplacian L̃ on directed graph;
(2) Use L̃ to estimate the hitting time and commute time of the corresponding
Markov chain;
(3) Introduce a metric to measure the asymmetry of L̃ and use this measure
to give a tighter bound on the Markov chain mixing rate and a bound on
the Cheeger constant.
Let P be the transition matrix of Markov chain, and π = (π1 , . . . , πn )T (column
vector) denote its stationary distribution (which is unique if the Markov chain is ir-
reducible, or if the directed graph is strongly connected). Let Π = diag{π1 , . . . , πn },
then we define the normalized Laplacian L̃ on directed graph:
1
(117) L̃ = I − Π1/2 P Π− 2
6.3.5.1. Hitting time, commute time and fundamental matrix. We establish the
relations between L̃ and the hitting time and commute time of random walk on
directed graph through the fundamental matrix Z = [zij ], which is defined as:

X
(118) zij = (ptij − πj ), 1 ≤ i, j ≤ n
t=0

or alternatively as an infinite sum of matrix series:



X
(119) Z= (P t − 1π T )
t=0

With the fundamental matrix, the hitting time and commute time can be ex-
pressed as follows:
zjj − zij
(120) Hij =
πj

zjj − zij zii − zji


(121) Cij = Hij + Hji = +
πj πi
Using (119), we can write the fundamental matrix Z in a more explicit form.
Notice that
(122) (P − 1π T )(P − 1π T ) = P 2 − 1π T P − P 1π T + 1π T 1π T = P 2 − 1π T
We use the fact that 1 and π are the right and left eigenvector of the transition
matrix P with eigenvalue 1, and that π T 1 = 1 since π is a distribution. Then

X
(123) Z + 1π T = (P − 1π T )t = (I − P + 1π T )−1
t=0

6.3.5.2. Green’s function and Laplacian for directed graph. If we treat the di-
rected graph Laplacian L̃ as an asymmetric operator on a directed graph G, then
we can define the Green’s Function G̃ (without boundary condition) for directed
graph. The entries of G satisfy the conditions:

(124) (G̃ L̃)ij = δij − πi πj
142 6. RANDOM WALK ON GRAPHS

or in the matrix form


T
(125) G̃ L̃ = I − π 1/2 π 1/2
The central theorem in the second paper associate the Green’s Function G̃, the
fundamental matrix Z and the normalize directed graph Laplacian L̃:
1
Theorem 6.3.7. Let Z̃ = Π1/2 ZΠ− 2 and L̃† denote the Moore-Penrose pseudo-
inverse L̃, then
(126) G̃ = Z̃ = L̃†
6.3.6. measure of asymmetric and its relation to Cheeger constant
and mixing rate. To measure the asymmetry in directed graph, we write the L̃
into the sum of a symmetric part and a skew-symmetric part:
(127) L̃ = 1/2(L̃ + L̃T ) + 1/2(L̃ − L̃T )
1/2(L̃ + L̃T ) = L is the symmetrized Laplacian introduced in the first paper. Let
∆ = 1/2(L̃ − L̃T ), the ∆ captures the difference between L̃ and its transpose. Let
σi , λi and δi (1 ≤ i ≤ n) denotes the i-th singular value of L, L, ∆ in ascending
order (σ1 = λ1 = δ1 = 0). Then the relation L̃ = L + ∆ implies
(128) λi ≤ σi ≤ λi + δn
Therefore δn = ∥∆∥2 is used to measure the degree of asymmetry in the directed
graph.
The following two theorems are application of this measure.
Theorem 6.3.8. The second singular of L̃ has bounds :
h(G)2 δn
(129) ≤ σ2 ≤ (1 + ) · 2h(G)
2 λ2
where h(G) is the Cheeger constant of graph G
Theorem 6.3.9. For a aperiodic Markov chain P ,
∥P̃ f ∥2
(130) δn2 ≤ max{ : f ⊥ π 1/2 } ≤ (1 − λ2 )2 + 2δn λn + δn2
∥f ∥2
1
where P̃ = Π1/2 P Π− 2

6.4. Lumpability of Markov Chain


Let P be the transition matrix of a Markov chain on graph G = (V, E) with
V = {1, 2, · · · , n}, i.e. Pij = Pr{xt = j : xt−1 = i}. Assume that V admits a
partition Ω:
V = ∪ki=1 Ωi , Ωi ∩ Ωj = ∅, i ̸= j.
Ω = {Ωs : s = 1, · · · , k}.
Observe a sequence{x0 , x1 , · · · , xt } sampled from the Markov chain with initial
distribution π0 . Relabel xt 7→ yt ∈ {1, · · · , k} by
k
X
yt = sχΩs (xt ),
s=1

where χ is the characteristic function. Thus we obtain a sequence (yt ) which is a


coarse-grained representation of original sequence.
6.4. LUMPABILITY OF MARKOV CHAIN 143

Definition 6.4.1 (Lumpability, Kemeny-Snell 1976). P is lumpable with re-


spect to partition Ω if the sequence {yt } is Markovian. In other words, the transition
probabilities do not depend on the choice of initial distribution π0 and history, i.e.
(131)
Probπ0 {xt ∈ Ωkt : xt−1 ∈ Ωkt−1 , · · · , x0 ∈ Ωk0 } = Prob{xt ∈ Ωkt : xt−1 ∈ Ωkt−1 }.
The lumpability condition above can be rewritten as
(132) Probπ0 {yt = kt : yt−1 = kt−1 , · · · , y0 = k0 } = Prob{yt = kt : yt−1 = kt−1 }.
Theorem 6.4.1. I. (Kemeny-Snell 1976) P is lumpable with respect
to
P partition Ω ⇔ ∀Ω s , Ωt ∈ Ω, ∀i, j ∈ Ωs , P̂iΩt = P̂jΩt , where P̂iΩt =
P
j∈Ωt ij .

Figure 1. Lumpability condition P̂iΩt = P̂jΩt

II. (Meila-Shi
P 2001) P is lumpable with respect to partition Ω and P̂ (p̂st =
i∈Ωs ,j∈Ωt pij ) is nonsingular ⇔ P has k independent piecewise constant
right eigenvectors in span{χΩs : s = 1, · · · , k}.

Figure 2. A linear chain of 2n nodes with a random walk.

Example 6.4.1. Consider a linear chain with 2n nodes (Figure 2) whose adja-
cency matrix and degree matrix are given by
 
0 1
 1 0 1 
 
A=
 . .
.. .. .. ,

D = diag{1, 2, · · · , 2, 1}
 . 
 1 0 1 
1 0
So the transition matrix is P = D−1 A which is illustrated in Figure 2. The spectrum
of P includes two eigenvalues of magnitude 1, i.e.λ0 = 1 and λn−1 = −1. Although
144 6. RANDOM WALK ON GRAPHS

P is not a primitive matrix here, it is lumpable. Let Ω1 = {odd nodes}, Ω2 = {even


nodes}. We can check that I and II are satisfied.
To see I, note that for any two even nodes, say i = 2 and j = 4, P̂iΩ2 = P̂jΩ2 = 1
as their neighbors are all odd nodes, whence I is satisfied. To see II, note that ϕ0
(associated with λ0 = 1) is a constant vector while ϕ1 (associated with λn−1 = −1)
is constant on even nodes and odd nodes respectively. Figure 3 shows the lumpable
states when n = 4 in the left.
Note that lumpable states might not be optimal bi-partitions in N CU T =
Cut(S)/ min(vol(S), vol(S̄)). In this example, the optimal bi-partition by Ncut is
given by S = {1, . . . , n}, shown in the right of Figure 3. In fact the second largest
eigenvalue λ1 = 0.9010 with eigenvector
v1 = [0.4714, 0.4247, 0.2939, 0.1049, −0.1049, −0.2939, −0.4247, −0.4714],
give the optimal bi-partition.

Figure 3. Left: two lumpable states; Right: optimal-bipartition


of Ncut.

Example 6.4.2. Uncoupled Markov chains are lumpable, e.g.


 
Ω1
P0 =  Ω2  , P̂it = P̂jt = 0.
Ω3
A markov chain P̃ = P0 + O(ϵ) is called nearly uncoupled Markov chain. Such
Markov chains can be approximately represented as uncoupled Markov chains
with metastable states, {Ωs }, where within metastable state transitions are fast
while cross metastable states transitions are slow. Such a separation of scale in
dynamics often appears in many phenomena in real lives, such as protein fold-
ing, your life transitions primary schools 7→ middle schools 7→ high schools 7→
college/university 7→ work unit, etc.
Before the proof of the theorem, we note that condition I is in fact equivalent
to
(133) V U P V = P V,
where U is a k-by-n matrix where each row is a uniform probability that
k×n 1
Uis = χΩ (i), i ∈ V, s ∈ Ω,
|Ωs | s
and V is a n-by-k matrix where each column is a characteristic function on Ωs ,
Vsjn×k = χΩs (j).
With this we have P̂ = U P V and U V = I. Such a matrix representation will be
useful in the derivation of condition II. Now we give the proof of the main theorem.
6.5. APPLICATIONS OF LUMPABILITY: MNCUT AND NETWORK REDUCTION 145

Proof. I. “⇒” To see the necessity, P is lumpable w.r.t. partition Ω, then it


is necessary that
Probπ0 {x1 ∈ Ωt : x0 ∈ Ωs } = Probπ0 {y1 = t : y0 = s} = p̂st
which does not depend on π0 . Now assume there are two different initial distribution
(1) (2)
such that π0 (i) = 1 and π0 (j) = 1 for ∀i, j ∈ Ωs . Thus
p̂iΩt = Probπ(1) {x1 ∈ Ωt : x0 ∈ Ωs } = p̂st = Probπ(2) {x1 ∈ Ωt : x0 ∈ Ωs } = p̂jΩt .
0 0

“⇐” To show the sufficiency, we are going to show that if the condition is satisfied,
then the probability
Probπ0 {yt = t : yt−1 = s, · · · , y0 = k0 }
depends only on Ωs , Ωt ∈ Ω. Probability above can be written as Probπt−1 (yt = t)
where πt−1 is a distribution with support only on Ωs which depends on π0 and
history up to t − 1.P But since Probi (yt = t) = p̂iΩt ≡ p̂st for all i ∈ Ωs , then
Probπt−1 (yt = t) = i∈Ωs πt−1 p̂iΩt = p̂st which only depends on Ωs and Ωt .
II.
“⇒”
Since P̂ is nonsingular, let {ψi , i = 1, · · · , k} are independent right eigenvectors
of P̂ , i.e., P̂ ψi = λi ψi . Define ϕi = V ψi , then ϕi are independent piecewise constant
vectors in span{χΩi , i = 1, · · · , k}. We have
P ϕi = P V ψi = V U P V ψi = V P̂ ψi = λi V ψi = λi ϕi ,
i.e.ϕi are right eigenvectors of P .
“⇐”
Let {ϕi , i = 1, · · · , k} be k independent piecewise constant right eigenvectors
of P in span{XΩi , i = 1, · · · , k}. There must be k independent vectors ψi ∈ Rk
that satisfied ϕi = V ψi . Then
P ϕi = λi ϕi ⇒ P V ψi = λi V ψi ,
Multiplying V U to the left on both sides of the equation, we have
V U P V ψi = λi V U V ψi = λi V ψi = P V ψi , (U V = I),
which implies
(V U P V − P V )Ψ = 0, Ψ = [ψ1 , . . . , ψk ].
Since Ψ is nonsingular due to independence of ψi , whence we must have V U P V =
PV . □

6.5. Applications of Lumpability: MNcut and Network Reduction


If the random walk on a graph P has top k nearly piece-wise constant right
eigenvectors, then the Markov chain P is approximately lumpable. Some spectral
clustering algorithms are proposed in such settings.
6.5.1. MNcut. Meila-Shi (2001) calls the following algorithm as MNcut, stand-
ing for modified Ncut. Due to the theory above, perhaps we’d better to call it
multiple spectral clustering.
1) Find top k right eigenvectors P Φi = λi Φi , i = 1, · · · , k, λi = 1 − o(ϵ).
2) Embedding Y n×k = [ϕ1 , · · · , ϕk ] → diffusion map when λi ≈ 1.
3) k-means (or other suitable clustering methods) on Y to k-clusters.
146 6. RANDOM WALK ON GRAPHS

6.5.2. Optimal Reduction and Complex Network.


6.5.2.1. Random Walk on Graph. Let G = G(S, E) denotes an undirected
graph. Here S has the meaning of ”states”. |S| = n ≫ 1 . Let A = e(x, y) denotes
its adjacency matrix, that is,
(
1 x∼y
e(x, y) =
0 otherwise
Here x ∼ y means (x, y) ∈ E . Here, weights on different edges are the same
1. They may be different in some cases.
Now we define a random walk on G . Let
e(x, y) X
p(x, y) = where d(x) = e(x, y)
d(x)
y∈S

We can check that P = p(x, y) is a stochastic matrix and (S, P ) is a Markov


chain. If G is connected, this Markov chain is irreducible and if G is not a tree,
the chain is even primitive. We assume G is connected from now on. If it is not,
we can focus on each of its connected component.So the Markov chain has unique
invariant distributionµ by irreducibility:
d(x)
µ(x) = P ∀x ∈ S
d(z)
z∈S

A Markov chain defined as above is reversible. That is, detailed balance con-
dition is satisfied:
µ(x)p(x, y) = µ(y)p(y, x) ∀x, y ∈ S
Define an inner product on spaceL2µ :
XX
< f, g >µ = f (x)g(x)µ(x) f, g ∈ L2µ
x∈S y∈S

L2µ is a Hilbert space with this inner product. If we define an operator T on it:
X
T f (x) = p(x, y)f (y) = E[y|x] f (y)
y∈S

We can check that T is a self adjoint operator on L2µ :


X
< T f (x), g(x) >µ = T f (x)g(x)µ(x)
x∈S
XX
= p(x, y)f (y)g(x)µ(x) with detailed balance condition
x∈S y∈S
XX
= p(y, x)f (y)g(x)µ(y)
y∈S x∈S
X
= f (y)T g(y)µ(y)
y∈S
= < f (x), T g(x) >µ
n−1
That means T is self-adjoint. So there is a set of orthonormal basis {ϕj (x)}j=0 and
a set of eigenvalue {λj }j=0 ⊂ [−1, 1], 1 = λ0 > λ1 ⩾ λ2 ⩾ · · · ⩾ λn−1 , s.t.Probϕj =
n−1

λj ϕj , j = 0, 1, . . . n − 1, and < ϕi , ϕj >µ = δij , ∀i, j = 0, 1, . . . n − 1.So ϕj (x) is right


6.5. APPLICATIONS OF LUMPABILITY: MNCUT AND NETWORK REDUCTION 147

eigenvectors. The corresponding left eigenvectors are denoted by {ψj (x)}n−1


j=0 . One
can obtain that ψj (x) = ϕj (x)µ(x). In fact,because T ϕj = λj ϕj ,

P
µ(x) p(x, y)ϕj (y) = λj ϕj (x)µ(x) with detailed balance condition
y∈S
P
p(y, x)µ(y)ϕj (y) = λj ϕj (x)µ(x) that is
y∈S
P
ψj Prob(x) = p(y, x)ϕ(y) = λj (x)ψ(x)
y∈S

Generally, T has spectral decomposition


n−1
X n−1
X
p(x, y) = λi ψi (x)ϕ(y) = p(x, y)ϕi (x)ϕi (y)µ(x)
i=0 i=0

Since P is a stochastic matrix, we have λ0 = 1,the corresponding right eigen-


vector is ϕ0 (x) ≡ 1,and left eigenvector is the invariant distribution ψ0 (x) = µ(x)
6.5.2.2. Optimal Reduction. This section is by (ELVE08). Suppose the number
of states n is very large. The scale of Markov chain is so big that we want a smaller
chain to present its behavior. That is, we want to decompose the state space S:
SN T
Let S = i=1 Si , s.t.N ≪ n, Si Sj = ∅, ∀i ̸= j, and define a transition probability
P̂ on it. We want the Markov chain ({Si }, P̂ ) has similar property as chain (S, P ).
We call {Si } coarse space. The first difficult we’re facing is whether ({Si }, P̂ )
really Markovian. We want
Pr(Xit+1 ∈ Sit+1 |xit ∈ Sit , . . . X0 ∈ Si0 ) = Pr(Xit+1 ∈ Sit+1 |xit ∈ Sit )
and this probability is independent of initial distribution. This property is so-called
lumpability, which you can refer Lecture 9. Unfortunately, lumpability is a strick
constraint that it seldom holds.
So we must modify our strategy of reduction. One choice is to do a optimization
with some norm on L2µ . First, Let us introduce Hilbert-Schmidt norm on L2µ .
Suppose F is an operator on L2µ , and F f (x) =
P
K(x, y)f (y)µ(y). Here K is
y∈S
called a kernel function. If K is symmetric, F is self adjoint. In fact,
XX
< F f (x), g(x) >µ = K(x, y)f (y)µ(y)g(x)µ(x)
x∈S y∈S
XX
= K(y, x)f (y)µ(y)g(x)µ(x)
y∈S x∈S
= < f (x), F g(x) >µ

So F guarantee a spectral decomposition. Let {λj }n−1 j=0 denote its eigenvalue
n−1
and {ϕj (x)}j=0 denote its eigenvector, then k(x, y) can be represented as K(x, y) =
n−1
P
λj ϕj (x)ϕj (y). Hilbert-Schmidt norm of F is defined as follow:
j=0

n−1
X
∥F ∥2HS = tr(F ∗ F ) = tr(F 2 ) = λ2i
i=0
148 6. RANDOM WALK ON GRAPHS

One can check that ∥F ∥2HS = K 2 (x, y)µ(x)µ(y). In fact,


P
x,y∈S
 2
X n−1
X
RHS =  λj ϕj (x)ϕj (y) µ(x)µ(y)
x,y∈S j=0
n−1
X n−1
X X
= λj λk ϕj (x)ϕk (x)ϕj (y)ϕk (y)µ(x)µ(y)
j=0 k=0 x,y∈S
n−1
X
= λ2j
j=0

the last equal sign dues do the orthogonality of eigenvectors. It is clear that if
L2µ = L2 , Hilbert-Schmidt norm is just Frobenius norm.
Now we can write our T as
X X p(x, y)
T f (x) = p(x, y)f (y) = f (y)µ(y)
µ(y)
y∈S y∈S

p(x,y)
and take K(x, y) = µ(y) . By detailed balance condition, K is symmetric. So
X p2 (x, y) X µ(x)
∥T ∥2HS = µ(x)µ(y) = p2 (x, y)
µ2 (y) µ(y)
x,y∈S x,y∈S

We’ll rename ∥P ∥HS to ∥P ∥µ in the following paragraphs.


Now go back to our reduction problem. Suppose we have a coarse space {Si }N i=1 ,
and a transition probability P̂ (k, l), k, l = 1, 2, . . . N on it. If we want to compare
({Si }, P̂ ) with (S, P ), we must ”lift” the coarse process to fine space. One nature
consideration is as follow: if x ∈ Sk , y ∈ Sl , first, we transit from x to Sl follow the
rule P̂ (k, l), and in Sl , we transit to y ”randomly”. To make ”randomly” rigorously,
one may choose the lifted transition probably as follow:
N
X 1
P̃ (x, y) = 1Sk (x)P̂ (k, l)1Sl (y)
|Sl |
k,l=1

One can check that this P̃ is a stochastic matrix, but it is not reversible. One
more convenient choice is transit ”randomly” by invariant distribution:
N
X µ(y)
P̃ (x, y) = 1Sk (x)P̂ (k, l)1Sl (y)
µ̂(Sl )
k,l=1

where
X
µ̂(Sl ) = µ(z)
z∈Sl

Then you can check this matrix is not only a stochastic matrix, but detailed
balance condition also hold provides P̂ on {Si } is reversible.
Now let us do some summary. Given a decomposition of state space S =
SN
i=1 Si , and a transition probability P̂ on coarse space, we may obtain a lifted
6.5. APPLICATIONS OF LUMPABILITY: MNCUT AND NETWORK REDUCTION 149

transition probability P̃ on fine space. Now we can compare ({Si }, P̂ ) and (S, P )
in a clear way: ∥P − P̃ ∥µ . So our optimization problem can be defined clearly:
E = min min ∥P − P̂ ∥2µ
S1 ...SN P̂

That is, given a partition of S, find the optimal P̂ to minimize ∥P − P̂ ∥2µ , and
find the optimal partition to minimize E.
N
6.5.2.3. Community structure of complex network. Given a partition S = ∪ Sk ,
k=1
the solution of optimization problem
min ∥p − p̂∥2µ

is
1 X
p̂∗kl = µ(x)p(x, y)
µ̂(Sk )
x∈Sk ,y∈Sl
It is easy to show that {p̂∗kl } form a transition probability matrix with detailed
balance condition:
p̂∗kl ≥ 0
X 1 X XX
p̂∗kl = µ(x) p(x, y)
µ̂(Sk )
l x∈Sk l y∈Sl
1 X
= µ(x) = 1
µ̂(Sk )
x∈Sk
X
µ̂(Sk )p̂∗kl = µ(x)p(x, y)
x∈Sk ,y∈Sl
X
= µ(y)p(y, x)
x∈Sk ,y∈Sl
= µ̂(Sl )p̂∗lk
The last equality implies that µ̂ is the invariant distribution of the reduced Markov
chain. Thus we find the optimal transition probability in the coarse space. p̂∗ has
the following property
∥p − p∗ ∥2µ = ∥p∥2µ − ∥p̂∗ ∥2µ̂
However, the partition of the original graph is not given in advance, so we
need to minimize E ∗ with respect to all possible partitions. This is a combinatorial
optimization problem, which is extremely difficult to find the exact solution. An
effective approach to obtain an approximate solution, which inherits ideas of K-
means clustering, is proposed as following: First we rewrite E ∗ as
N
X µ(x) X p̂∗
E∗ = |p(x, y) − 1Sk (x) kl 1Sl (y)µ(y)|2
µ(y) µ̂(Sk )
x,y∈S k,l=1
N 2
p̂∗

X X p(x, y)
= µ(x)µ(y) − kl
µ(y) µ̂(Sk )
k,l=1 x∈Sk ,y∈Sl
N X

X
E ∗ (x, Sk )
k=1 x∈Sk
150 6. RANDOM WALK ON GRAPHS

where
N X 2
p̂∗kl


X p(x, y)
E (x, Sk ) = µ(x)µ(y)

µ(y) µ̂(Sk )
l=1 y∈Sl
Based on above expression, a variation of K-means is designed:
N
E step: Fix partition ∪ Sk , compute p̂∗ .
k=1
(n+1)
M step: Put x in Sk such that

E (x, Sk ) = min E ∗ (x, Sj )
j

6.5.2.4. Extensions: Fuzzy Partition. This part is in (LLE09; LL11). It is


unnecessary to require that each vertex belong to a definite class. We introduce
ρk (x) as the probability of a vertex x belonging to class k, and we lift the Markov
chain in coarse space to fine space using the following transition probability
N
X µ(y)
p̃(x, y) = ρk (x)p̂kl ρl (y)
µ̂l
k,l=1

Now we solve
min ∥p − p̃∥2µ

to obtain a optimal reduction.
6.5.2.5. Model selection. Note the number of partition N should also not be
given in advance. But in strategies similar to K-means, the value of minimal E ∗ is
monotone decreasing with N . This means larger N is always preferred.
A possible approach is to introduce another quantity which is monotone in-
creasing with N . We take K-means clustering for example. In K-means clustering,
only compactness is reflected. If another quantity indicates separation of centers of
each cluster, we can minimize the ratio of compactness and separation to find an
optimal N .

6.6. Mean First Passage Time


Consider a Markov chain P on graph G = (V, E). In this section we study the
mean first passage time between vertices, which exploits the unnormalized graph
Laplacian and will be useful for commute time map against diffusion map.
Definition.
(1) First passage time (or hitting time): τij := inf(t ≥ 0|xt = j, x0 = i);
(2) Mean First Passage Time: Tij = Ei τij ;
+
(3) τij := inf(t > 0|xt = j, x0 = i), where τii+ is also called first return time;
(4) Tij+ = Ei τij
+
, where Tii+ is also called mean first return time.
Here Ei denotes the conditional expectation with fixed initial condition x0 = i.
Theorem 6.6.1. Assume that P is irreducible. Let L = D − W be the unnor-

malized
P graph Laplacian with Moore-Penrose inverse L , where D = diag(di ) with
di = j:j i Wij being the degree of node i. Then
(1) Mean First Passage Time is given by
Tii = 0,
L†ik dk − L†ij vol(G) + L†jj vol(G) − L†jk dk ,
X X
Tij = i ̸= j.
k k
6.6. MEAN FIRST PASSAGE TIME 151

(2) Mean First Return Time is given by


1
Tii+ = , Tij+ = Tij .
πi
Proof. Since P is irreducible, then the stationary distribution is unique, de-
noted by π. By definition, we have
X
(134) Tij+ = Pij · 1 + +
Pik (Tkj + 1)
k̸=j

Let E = 1 · 1T where 1 ∈ Rn is a vector with all elements one, Td+ = diag(Tii+ ).


Then 173 becomes
(135) T + = E + P (T + − Td+ ).
For the unique stationary distribution π, π T P = P , whence we have
πT T + = π T 1 · 1T + π T P (T + − Td+ )
πT T + = 1T + π T T + − π T Td+
1 = Td+ π
1
Tii+ =
πi
Before proceeding to solve equation (173), we first show its solution is unique.
Lemma 6.6.2. P is irreducible ⇒ T + and T are both unique.
Proof. Assume S is also a solution of equation (174), then
(I − P )S = E − P diag(1/πi ) = (I − P )T +

⇔ ((I − P )(T + − S) = 0.
Therefore for irreducible P , S and T + must satisfy
diag(T + − S) = 0


T + − S = 1uT , ∀u

which implies T + = S. T ’s uniqueness follows from T = T + − Td+ . □

Now we continue with the proof of the main theorem. Since T = T + − Td+ ,
then (173) becomes
T = E + P T − Td+
(I − P )T = E − Td+
(I − D−1 W )T = F
(D − W )T = DF
LT = DF
where F = E − Td+ and L = D − W is the (unnormalized)
Pn graph Laplacian. Since
T
L is symmetric and irreducible, we have L = Pn k=1 k k k , where 0 = µ1 < µ2 ≤
µ ν ν
· · · ≤ µn , ν1 = 1/||1||, νkT νl = δkl . Let L† = k=2 µ1k νk νkT , L† is called the pseudo-
inverse (or Moore-Penrose inverse) of L. We can test and verify L† satisfies the
152 6. RANDOM WALK ON GRAPHS

following four conditions


 † †
 L LL = L†
LL† L

= L

 (LL† )T
 = LL†
(L† L)T = L† L

From LT = D(E − Td+ ), multiplying both sides by L† leads to


T = L† DE − L† DTd+ + 1 · uT ,
as 1 · uT ∈ ker(L), whence
n
1
L†ik dk − L†ij dj ·
X
Tij = + uj
πj
k=1
n
L†ik dk + L†ii vol(G),
X
ui = − j=i
k=1

L†ik dk − L†ij vol(G) + L†jj vol(G) − L†jk dk


X X
Tij =
k k


P
Note that vol(G) = i di and πi = di /vol(G) for all i.

As L† is a positive definite matrix, this leads to the following corollary.


Corollary 6.6.3.
(136) Tij + Tji = vol(G)(L†ii + L†jj − 2L†ij ).
Therefore the average commute time between i and j leads to an Euclidean distance
metric
p
dc (xi , xj ) := Tij + Tji
often called commute time distance.

6.7. Transition Path Theory


The transition path theory was originally introduced in the context of continuous-
time Markov process on continuous state space (EVE06) and discrete state space
(MSVE09), see (EVE10) for a review. Another description of discrete transition
path theory for molecular dynamics can be also found in (NSVE+ 09). The follow-
ing material is adapted to the setting of discrete time Markov chain with transition
probability matrix P (? ). We assume reversibility in the following presentation,
which can be extended to non-reversible Markov chains.
Assume that an irreducible Markov  Chain on graph G = (V, E) admits the
Pll Plu
following decomposition P = D−1 W = . Here Vl = V0 ∪V1 denotes the
Pul Puu
labeled vertices with source set V0 (e.g. reaction state in chemistry) and sink set V1
(e.g. product state in chemistry), and Vu is the unlabeled vertex set (intermediate
states). That is,
• V0 = {i ∈ Vl : fi = f (xi ) = 0}
• V1 = {i ∈ Vl : fi = f (xi ) = 1}
• V = V0 ∪ V1 ∪ Vu where Vl = V0 ∪ V1
6.7. TRANSITION PATH THEORY 153

Given two sets V0 and V1 in the state space V , the transition path theory tells
how these transitions between the two sets happen (mechanism, rates, etc.). If we
view V0 as a reactant state and V1 as a product state, then one transition from V0
to V1 is a reaction event. The reactve trajectories are those part of the equilibrium
trajectory that the system is going from V0 to V1 .
Let the hitting time of Vl be
τik = inf{t ≥ 0 : x(0) = i, x(t) ∈ Vk }, k = 0, 1.
The central object in transition path theory is the committor function. Its
value at i ∈ Vu gives the probability that a trajectory starting from i will hit the
set V1 first than V0 , i.e., the success rate of the transition at i.
Proposition 6.7.1. For ∀i ∈ Vu , define the committor function
qi := P rob(τi1 < τi0 ) = P rob(trajectory starting from xi hit V1 before V0 )
which satisfies the following Laplacian equation with Dirichlet boundary conditions
(Lq)(i) = [(I − P )q](i) = 0, i ∈ Vu
qi∈V0 = 0, qi∈V1 = 1.
The solution is
qu = (Du − Wuu )−1 Wul ql .
Proof. By definition,

1
 xi ∈ V1
1 0
qi = P rob(τi < τi ) = 0 xi ∈ V0

P
j∈V Pij qj i ∈ Vu
This is because ∀i ∈ Vu ,
qi = P r(τiV1 < τiV0 )
X
= Pij qj
j
X X X
= Pij qj + Pij qj + Pij qj
j∈V1 j∈V0 j∈Vu
X X
= Pij + Pij qj
j∈V1 j∈Vu

∴ qu = Pul ql + Puu qu = Du−1 Wul ql + Du−1 Wuu qu


multiply Du to both side and reorganize
(Du − Wuu )qu = Wul ql
If Du − Wuu is reversible, we get
qu = (Du − Wuu )−1 Wul ql .

The committor function provides natural decomposition of the graph. If q(x)
is less than 0.5, i is more likely to reach V0 first than V1 ; so that {i | q(x) < 0.5}
gives the set of points that are more attached to set V0 .
Once the committor function is given, the statistical properties of the reaction
trajectories between V0 and V1 can be quantified. We state several propositions
154 6. RANDOM WALK ON GRAPHS

characterizing transition mechanism from V0 to V1 . The proof of them is an easy


adaptation of (EVE06; MSVE09) and will be omitted.

Proposition 6.7.2 (Probability distribution of reactive trajectories). The prob-


ability distribution of reactive trajectories
(137) πR (x) = P(Xn = x, n ∈ R)
is given by
(138) πR (x) = π(x)q(x)(1 − q(x)).

The distribution πR gives the equilibrium probability that a reactive trajec-


tory visits x. It provides information about the proportion of time the reactive
trajectories spend in state x along the way from V0 to V1 .

Proposition 6.7.3 (Reactive current from V0 to V1 ). The reactive current


from A to B, defined by
(139) J(xy) = P(Xn = x, Xn+1 = y, {n, n + 1} ⊂ R),
is given by
(
π(x)(1 − q(x))Pxy q(y), x ̸= y;
(140) J(xy) =
0, otherwise.

The reactive current J(xy) gives the average rate the reactive trajectories jump
from state x to y. From the reactive current, we may define the effective reactive
current on an edge and transition current through a node which characterizes the
importance of an edge and a node in the transition from A to B, respectively.

Definition 6.7.1. The effective current of an edge xy is defined as


(141) J + (xy) = max(J(xy) − J(yx), 0).
The transition current through a node x ∈ V is defined as
+
 P
 Py∈V J (xy), x∈A
(142) T (x) = J + (yx), x∈B
 Py∈V + P +
y∈V J (xy) = y∈V J (yx), x ̸∈ A ∪ B

In applications one often examines partial transition current through a node


connecting two communities V − = {x : q(x) < 0.5} and V + = {x : q(x) ≥ 0.5},
+ −
P
e.g. y∈V + J (xy) for x ∈ V , which shows relative importance of the node in
bridging communities.
The reaction rate ν, defined as the number of transitions from V0 to V1 hap-
pened in a unit time interval, can be obtained from adding up the probability
current flowing out of the reactant state. This is stated by the next proposition.

Proposition 6.7.4 (Reaction rate). The reaction rate is given by


X X
(143) ν= J(xy) = J(xy).
x∈A,y∈V x∈V,y∈B
6.8. SEMI-SUPERVISED LEARNING AND TRANSITION PATH THEORY 155

Finally, the committor functions also give information about the time propor-
tion that an equilibrium trajectory comes from A (the trajectory hits A last rather
than B).

Proposition 6.7.5. The proportion of time that the trajectory comes from A
(resp. from B) is given by
X X
(144) ρA = π(x)q(x), ρB = π(x)(1 − q(x)).
x∈V x∈V

6.8. Semi-supervised Learning and Transition Path Theory


Problem: x1 , x2 , ..., xl ∈ Vl are labled data, that is data with the value f (xi ), f ∈
V → R observed. xl+1 , xl+2 , ..., xl+u ∈ Vu are unlabled. Our concern is how to fully
exploiting the information (like geometric structure in disbution) provided in the
labeled and unlabeled data to find the unobserved labels.
This kind of problem may occur in many situations, like ZIP Code recognition.
We may only have a part of digits labeled and our task is to label the unlabeled
ones.

6.8.1. Discrete Harmonic Extension of Functions on Graph. Suppose


the whole graph is G =(V, E, W ), where
 V = Vl ∪ Vu and weight matrix is parti-
Wll Wlu
tioned into blocks W = . As before, we define D = diag(d1 , d2 , ..., dn ) =
Wul Wuu
Pn
diag(Dl , Du ), di = j=1 Wij , L = D −W The goal is to find fu = (fl+1 , ..., fl+u )T
such that
min f T Lf
s.t. f (Vl ) = fl
 
fl
where f = . Note that
fu
 
T fl
f Lf = (flT , fuT )L = fuT Luu fu + flT Lll fl + 2fuT Lul fl
fu
So we have:
∂f T Lf
= 0 ⇒ 2Luu fu + 2Llu fu = 0 ⇒ fu = −L−1
uu Lul fl = (Du − Wuu )
−1
Wul fl
∂fu

6.8.2. Explanation from Transition Path Theory. We can also view the
problem as a random walk  on graph.  Constructing a graph model with transition
−1 Pll Plu
matrix P = D W = . Assume that the labeled data are binary
Pul Puu
(classification). That is, for xi ∈ Vl , f (xi ) = 0 or 1. Denote
• V0 = {i ∈ Vl : fi = f (xi ) = 0}
• V1 = {i ∈ Vl : fi = f (xi ) = 1}
• V = V0 ∪ V1 ∪ Vu where Vl = V0 ∪ V1
With this random walk on graph P , fu can be interpreted as hitting time or
first passage time of V1 .
156 6. RANDOM WALK ON GRAPHS

Proposition 6.8.1. Define hitting time


τik = inf{t ≥ 0 : x(0) = i, x(t) ∈ Vk }, k = 0, 1
Then for ∀i ∈ Vu ,
fi = P rob(τi1 < τi0 )
i.e.
fi = P rob(trajectory starting from xi hit V1 before V0 )
Note that the probability above also called committor function in Transition
Path Theory of Markov Chains.
Proof. Define the committor function,

1
 xi ∈ V 1
qi+ = P rob(τi1 < τi0 ) = 0 xi ∈ V 0
Pij qj+

P
j∈V i ∈ Vu
This is because ∀i ∈ Vu ,
qi+ = P r(τiV1 < τiV0 )
X
= Pij qj+
j
X X X
= Pij qj+ + Pij qj+ + Pij qj+
j∈V1 j∈V0 j∈Vu
X X
= Pij + Pij qj+
j∈V1 j∈Vu

∴ = Pul fl + Puu qu+ = Du−1 Wul fl + Du−1 Wuu qu+


qu+
multiply Du to both side and reorganize:
(Du − Wuu )qu+ = Wul fl
If Du − Wuu is reversible, we get:
qu+ = (Du − Wuu )−1 Wul fl = fu
i.e. fu is the committor function on Vu . □
The result coincides with we obtained through the view of gaussian markov
random field.
6.8.3. Explanation from Gaussian Markov Random Field. If we con-
sider f : V → R are Gaussian random variables on graph nodes whose inverse
covariance matrix (precision matrix) is given by unnormalized graph Laplacian L
(sparse but singular), i.e. f ∼ N (0, Σ) where Σ−1 = L (interpreted as a pseudo
inverse). Then the conditional expectation of fu given fl is:
fu = Σul Σ−1
ll fl
where  
Σll Σlu
Σ=
Σul Σuu
Block matrix inversion formula tells us that when A and D are invertible,
−1 −1
−A−1 BSA
       
A B X Y X Y SD
· =I⇒ = −1 −1
C D Z W Z W −D−1 CSD SA
6.8. SEMI-SUPERVISED LEARNING AND TRANSITION PATH THEORY 157

−1 −1
BD−1
       
X Y A B X Y SD −SD
· =I⇒ = −1 −1
Z W C D Z W −SA CA−1 SA
where SA = D − CA−1 B and SD = A − BD−1 C are called Schur complements of
A and D, respectively. The matrix expressions for inverse are equivalent when the
matrix is invertible.
The graph Laplacian
 
Dl − Wll −Wlu
L=
−Wul Du − Wuu
is not invertible.
P Dl − Wll and Du − Wuu are both strictly diagonally dominant, i.e.
Dl (i, i) > j |Wll (i, j)|, whence they are invertible by Gershgorin Circle Theorem.
However their Schur complements SDu −Wuu and SDl −Wll are still not invertible and
the block matrix inversion formula above can not be applied directly. To avoid this
issue, we define a regularized version of graph Laplacian
Lλ = L + λI, λ>0
and study its inverse Σλ = L−1
λ .
By the block matrix inversion formula, we can set Σ as its right inverse above,
−1 −1
(λ + Dl − Wll )−1 Wlu Sλ+D
 
Sλ+Du −Wuu l −Wll
Σλ = −1 −1
(λ + Du − Wuu )−1 Wul Sλ+D u −Wuu
Sλ+D l −Wll

Therefore,
fu,λ = Σul,λ Σ−1
ll,λ fl = (λ + Du − Wuu )
−1
Wul fl ,
whose limit however exits limλ→0 fu,λ = (Du − Wuu )−1 Wul fl = fu . This implies
that fu can be regarded as the conditional mean given fl .

6.8.4. Remarks. One natural problem is: if we only have a fixed amount of
labeled data, can we recover labels of an infinite amount of unobserved data? This
is called well-posedness. [Nadler-Srebro 2009] gives the following result:
• If xi ∈ R1 , the problem is well-posed.
• If xi ∈ Rd (d ≥ 3), the problem is ill-posed in which case Du − Wuu
becomes singular and f becomes a bump function (fu is almost always
zeros or ones except on some singular points).
Here we can give a brief explanation:
Z
f T Lf ∼ ∥∇f ∥2

∥x−x0 ∥22
(
ϵ2 ∥x − x0 ∥2 < ϵ
If we have Vl = {0, 1}, f (x0 ) = 0, f (x1 ) = 1 and let fϵ (x) = .
1 otherwise
From multivariable calculus,
Z
∥∇f ∥2 = cϵd−2 .
R
Since d ≥ 3, so ϵ → 0 ⇒ ∥∇f ∥2 → 0. So fϵ (x) (ϵ → 0) converges to a bump func-
tion which is one almost everywhere except x0 whose value is 0. No generalization
ability is learned for such bump functions.
This means in high dimensional case, to obtain a smooth generalization, we
have to add constraints more than the norm of the first order derivatives. We
158 6. RANDOM WALK ON GRAPHS

also have a theorem to illustrate what kind of constraint is enough for a good
generalization:
Theorem 6.8.2 (Sobolev embedding Theorem). f ∈ Ws,p (Rd ) ⇐⇒ f has
s’th order weak derivative f (s) ∈ Lp ,
d
s> ⇒ Ws,2 ,→ C(Rd ).
2
So in Rd , to obtain a continuous function, one needs smoothness regularization
∥∇s f ∥ with degree s > d/2. To implement this in discrete Laplacian setting, one
R

may consider iterative Laplacian Ls which might converge to high order smoothness
regularization.

6.9. Lab and Further Studies


6.9.1. Transition Path Analysis for Karate Club Network. In the Mat-
lab command window, run the following command.
% Transition Path Analysis for Karate Club network
%
% Reference :
% Weinan E , Jianfeng Lu , and Yuan Yao (2013)
% The Landscape of Complex Networks : Critical Nodes and A
Hierarchical Decomposition .
% Methods and Applications of Analysis , special issue in honor
of Professor Stanley Osher on his 70 th birthday , 20(4) :383 -404 ,
2013.

% load the Adjacency matrix of Karate Club network


% replace it by your own data
load karate_rand1 . mat A

D = sum (A , 2) ;
N = length ( D ) ;
Label = [0: N -1];
TransProb = diag (1./ D ) * A ;
LMat = TransProb - diag ( ones (N , 1) ) ;

% source set A contains the coach


% target set B contains the president
SetA = 1; % [44:54];%[ find ( ind ==19) ];%[4 4:54];%1 8 + 1;
SetB = 34; % [ find ( ind ==11) ];%10 + 1; % seems to be 11 instead of 10

[ EigV , EigD ] = eig ( LMat ’) ;


EquiMeasure = EigV (: , 1) ./ sign ( EigV (1 ,1) ) ;

for i = 1: N
localmin = true ;
for j = setdiff (1: N , i )
if (( LMat (i , j ) >0) &( EquiMeasure ( j ) > EquiMeasure ( i ) ) )
localmin = false ;
break
end
end
6.9. LAB AND FURTHER STUDIES 159

if ( localmin )
i
end
end

mfpt = zeros (N , 1) ;
SourceSet = 11;
RemainSet = setdiff (1: N , SourceSet ) ;
mfpt ( RemainSet ) = - LMat ( RemainSet , RemainSet ) \ ones (N -1 , 1) ;

TransLMat = diag ( EquiMeasure ) * LMat * diag (1./ EquiMeasure ) ;

SourceSet = SetA ;
TargetSet = SetB ;
RemainSet = setdiff (1: N , union ( SourceSet , TargetSet ) ) ;

% Initia lization of Committor function : transition probability of


reaching
% the target set before returning to the source set .
CommitAB = zeros (N , 1) ;
CommitAB ( SourceSet ) = zeros ( size ( SourceSet ) ) ;
CommitAB ( TargetSet ) = ones ( size ( TargetSet ) ) ;

LMatRestrict = LMat ( RemainSet , RemainSet ) ;


RightHandSide = - LMat ( RemainSet , TargetSet ) * CommitAB ( TargetSet ) ;

% Solve the Dirchelet Boundary problem


CommitAB ( RemainSet ) = LMatRestrict \ RightHandSide ;

% Clustering into two basins according to the transition probability


ClusterA = find ( CommitAB <= 0.5) ;
ClusterB = find ( CommitAB > 0.5) ;

% The inverse transition probability ( committor function )


CommitBA = zeros (N , 1) ;
CommitBA ( SourceSet ) = ones ( size ( SourceSet ) ) ;
CommitBA ( TargetSet ) = zeros ( size ( TargetSet ) ) ;

LMatRestrict = LMat ( RemainSet , RemainSet ) ;


RightHandSide = - LMat ( RemainSet , SourceSet ) * CommitBA ( SourceSet ) ;

% Dirichelet Boundary Problem with inverse transition probability


CommitBA ( RemainSet ) = LMatRestrict \ RightHandSide ;

RhoAB = EquiMeasure .* CommitAB .* CommitBA ;

% Current or Flux on edges


CurrentAB = diag ( EquiMeasure .* CommitBA ) * LMat * diag ( CommitAB ) ;
CurrentAB = CurrentAB - diag ( diag ( CurrentAB ) ) ;

% Effective Current Flux


EffCurrentAB = max ( CurrentAB - CurrentAB ’ , 0) ;
160 6. RANDOM WALK ON GRAPHS

% Transition Current or Flux on each node


TransCurrent = zeros (N , 1) ;
TransCurrent ( ClusterA ) = sum ( EffCurrentAB ( ClusterA , ClusterB ) , 2) ;
TransCurrent ( ClusterB ) = sum ( EffCurrentAB ( ClusterA , ClusterB ) , 1) ;
CHAPTER 7

Diffusion Geometry

Finding meaningful low-dimensional structures hidden in high-dimensional ob-


servations is an fundamental task in high-dimensional statistics. The classical tech-
niques for dimensionality reduction, principal component analysis (PCA) and multi-
dimensional scaling (MDS), guaranteed to discover the true structure of data lying
on or near a linear subspace of the high-dimensional input space. PCA finds a
low-dimensional embedding of the data points that best preserves their variance as
measured in the high-dimensional input space. Classical MDS finds an embedding
that preserves the interpoint distances, equivalent to PCA when those distances
are Euclidean (TdL00). However, these linear techniques cannot adequately han-
dle complex nonlinear data. Recently more emphasis is put on detecting non-linear
features in the data. For example, ISOMAP (TdL00) etc. extends MDS by in-
corporating the geodesic distances imposed by a weighted graph. It defines the
geodesic distance to be the sum of edge weights along the shortest path between
two nodes. The top n eigenvectors of the geodesic distance matrix are used to
represent the coordinates in the new n-dimensional Euclidean space. Nevertheless,
as mentioned in (EST09), in practice robust estimation of geodesic distance on a
manifold is an awkward problem that require rather restrictive assumptions on the
sampling. Moreover, since the MDS step in the ISOMAP algorithm intends to
preserve the geodesic distance between points, it provides a correct embedding if
submanifold is isometric to a convex open set of the subspace. If the submanifold is
not convex, then there exist a pair of points that can not be joined by a straight line
contained in the submanifold. Therefore,their geodesic distance can not be equal
to the Euclidean distance. Diffusion maps (CLL+ 05) leverages the relationship
between heat diffusion and a random walk (Markov Chain); an analogy is drawn
between the diffusion operator on a manifold and a Markov transition matrix op-
erating on functions defined on a weighted graph whose nodes were sampled from
the manifold. A diffusion map, which maps coordinates between data and diffusion
space, aims to re-organize data according to a new metric. In this class, we will
discuss this very metric-diffusion distance and it’s related properties.

7.1. Diffusion Map and Diffusion Distance


Viewing the data points x1 ,x2 ,. . . ,xn as the nodes of a weighted undirected
graph G = (V, EW )(W = (Wij )), where the weight Wij is a measure of the similarity
between xi and xj . There are many ways to define Wij , such as:
(1) Heat kernel. If xi and xj are connected, put:
−∥xi −xj ∥2
(145) Wijε = e ε

161
162 7. DIFFUSION GEOMETRY

with some positive parameter ε ∈ R+


0.

(2) Cosine Similarity


xi xj
(146) Wij = cos(∠(xi , xj )) = ·
∥xi ∥ ∥xj ∥
(3) Kullback-Leibler divergence. P Assume xi and xj are two nonvanishing
k k
probability distribution, i.e. k xi = 1 and xi > 0. Define Kullback-
Leibler divergence
(k)
X (k) xi
D(KL) (xi ||xj ) = xi log (k)
k xj
and its symmetrization D̄ = D(KL) (xi ||xj ) + DKL (xj ||xi ), which measure
a kind of ‘distance’ between distributions; Jensen-Shannon divergence as
the symmetrization of KL-divergence between one distribution and their
average,
D(JS) (xi , xj ) = D(KL) (xi ||(xi + xj )/2) + D(KL) (xj ||(xi + xj )/2)
A similarity kernel can be
(147) Wij = −D(KL) (xi ||xj )
or
(148) Wij = −D(JS) (xi , xj )
The similarity functions are widely used in various applications. Sometimes
the matrix W is positive semi-definite (psd), that for any vector x ∈ Rn ,
(149) xT W x ≥ 0.
PSD kernels includes heat kernels, cosine similarity kernels, and JS-divergence ker-
nels. But in many other cases (e.g. KL-divergence kernels), similarity kernels are
not necessarily PSD. For a PSD kernel, it can be understood as a generalized co-
variance function; otherwise, diffusions as random walks on similarity graphs will
be helpful to disclose their structures.
n
Define A := D−1 W , where D = diag( Wij ) ≜ diag(d1 , d2 , · · · , dn ) for sym-
P
j=1
metric Wij = Wji ≥ 0. So
n
X
(150) Aij = 1 ∀i ∈ {1, 2, · · ·, n} (Aij ≥ 0)
j=1

whence A is a row Markov matrix of the following discrete time Markov chain
{Xt }t∈N satisfying
(151) P (Xt+1 = xj | Xt = xi ) = Aij .
7.1.1. Spectral Properties of A. We may reach a spectral decomposition
of A with the aid of the following symmetric matrix S which is similar to A. Let
1 1
(152) S := D− 2 W D− 2
which is symmetric and has an eigenvalue decomposition
(153) S = V ΛV T , where V V T = In , Λ = diag(λ1 , λ2 , · · ·, λn )
7.1. DIFFUSION MAP AND DIFFUSION DISTANCE 163

So
1 1 1 1
A = D−1 W = D−1 (D 2 SD 2 ) = D− 2 SD 2
which is similar to S, whence sharing the same eigenvalues as S. Moreover
1 1
(154) A = D− 2 V ΛV T D 2 = ΦΛΨT
1 1
where Φ = D− 2 V and Ψ = D 2 V give right and left eigenvectors of A respectively,
AΦ = ΦΛ and ΨT A = ΛΨT , and satisfy ΨT Φ = In .
The Markov matrix A satisfies the following properties by Perron-Frobenius
Theory.
Proposition 7.1.1. (1) A has eigenvalues λ(A) ⊂ [−1, 1].
(2) A is irreducible, if and only if ∀(i, j) ∃t s.t. (At )ij > 0 ⇔ Graph G = (V, E)
is connected
(3) A is irreducible ⇒ λmax = 1
(4) A is primitive, if and only if ∃t > 0 s.t. ∀(i, j) (At )ij > 0 ⇔ Graph
G = (V, E) is path-t connected, i.e. any pair of nodes are connected by a
path of length no more than t
(5) A is irreducible and ∀i, Aii > 0 ⇒ A is primitive
(6) A is primitive ⇒ −1 ̸∈ λ(A)
(7) Wij is induced from the heat kernel, or any positive definite function
⇒ λ(A) ≥ 0
Proof. (1) assume λ and v are the eigenvalue and eigenvector of A, soAv =
λv. Find j0 s.t. |vj0 | ≥ |vj |, ∀j ̸= j0 where vj is the j-th entry of v. Then:
n
X
λvj0 = (Av)j0 = Aj 0 j v j
j=1

So:
n
X n
X
|λ||vj0 | = | Aj 0 j v j | ≤ Aj0 j |vj | ≤ |vj0 |.
j=1 j=1

(7) Let S = D−1/2 W D−1/2 . As W is positive semi-definite, so S has eigenvalues


λ(S) ≥ 0. Note that A = D−1/2 SD1/2 , i.e. similar to S, whence A shares the same
eigenvalues with S. □
Sort the eigenvalues 1 = λ1 ≥ λ2 ≥ . . . ≥ λn ≥ −1. Denote Φ = [ϕ1 , . . . , ϕn ]
and Ψ = [ψ1 , . . . , ψn ]. So the primary (first) right and left eigenvectors are
ϕ1 = 1,
ψ1 = π
as the stationary distribution of the Markov chain, respectively.

7.1.2. Diffusion Map and Distance. Diffusion map of a point x is defined


as the weighted Euclidean embedding via right eigenvectors of Markov matrix A.
From the interpretation of the matrix A as a Markov transition probability matrix
(155) Aij = P r{s(t + 1) = xj |s(t) = xi }
it follows that
(156) Atij = P r{s(t + 1) = xj |s(0) = xi }
164 7. DIFFUSION GEOMETRY

We refer to the i′ th row of the matrix At , denoted Ati,∗ , as the transition prob-
ability of a t-step random walk that starts at xi . We can express At using the
decomposition of A. Indeed, from
(157) A = ΦΛΨT
with ΨT Φ = I, we get
(158) At = ΦΛt ΨT .
Written in a component-wise way, this is equivalent to
Xn
(159) Atij = λtk ϕk (i)ψk (j).
k=1

Therefore Φ and Ψ are right and left eigenvectors of At , respectively.


Let the diffusion map Φt : V 7→ Rn at scale t be
 t 
λ1 ϕ1 (i)
 λt2 ϕ2 (i) 
(160) Φt (xi ) :=  ..
 

 . 
t
λn ϕn (i)
The mapping of points onto the diffusion map space spanned the right eigenvectors
of the row Markov matrix has a well defined probabilistic meaning in terms of the
random walks. Lumpable Markov chains with Piece-wise constant right eigenvec-
tors thus help us understand the behavior of diffusion maps and distances in such
cases.
The diffusion distance is defined to be the Euclidean distances between embed-
ded points,
n
!1/2
X
(161) dt (xi , xj ) := ∥Φt (xi ) − Φt (xj )∥Rn = λ2t
k (ϕk (i) − ϕk (j))
2
.
k=1
The main intuition to define diffusion distance is to describe “perceptual dis-
tances” of points in the same and different clusters. For example Figure 1 shows
that points within the same cluster have small diffusion distances while in different
clusters have large diffusion distances. This is because the metastability phenom-
enon of random walk on graphs where each cluster represents a metastable state.
The main properties of diffusion distances are as follows.
• Diffusion distances reflect average path length connecting points via ran-
dom walks.
• Small t represents local random walk, where diffusion distances reflect
local geometric structure.
• Large t represents global random walk, where diffusion distances reflect
large scale cluster or connected components.

7.1.3. Examples. Three examples about diffusion map:


EX1: two circles.
Suppose graph G : (V, E). Matrix W satisfies wij > 0, if and only if (i, j) ∈ E.
Choose k(x, y) = I∥x−y∥<δ . In this case,
 
A1 0
A= ,
0 A2
7.1. DIFFUSION MAP AND DIFFUSION DISTANCE 165

Figure 1. Diffusion Distances dt (A, B) >> dt (B, C) while graph


shortest path dgeod (A, B) ∼ dgeod (B, C).

Figure 2. Two circles

Figure 3. EX2 single circle

where A1 is a n1 × n1 matrix, A2 is a n2 × n2 matrix, n1 + n2 = n.


Notice that the eigenvalue λ0 = 1 of A is of multiplicity 2, the two eigenvectors

are ϕ0 = 1n and ϕ0 = [c1 1Tn1 , c2 1Tn2 ]T c1 ̸= c2 .

Φ1D 1D

t (x1 ), · · · , Φt (xn1 ) = c1
Diffusion Map :
Φt (xn1 +1 ), · · · , Φ1D
1D
t (xn ) = c2
EX2: ring graph. ”single circle”
In this case, W is a circulant matrix
 
1 1 0 0 ··· 1
 1 1 1
 0 ··· 0 

W = 0 1 1
 1 ··· 0 
 .. .. .. .. ..


 . . . . ··· . 
1 0 0 0 ··· 1
The eigenvalue of W is λk = cos 2πk n
n k = 0, 1, · · · , 2 and the corresponding eigen-
i 2π 2πkj 2πkj t
vector is (uk )j = e n j = 1, · · · , n. So we can get Φ2D
kj
t (xi ) = (cos n , sin n )c
EX3: order the face. Let
166 7. DIFFUSION GEOMETRY

Figure 4. Order the face

∥x − y∥2
 
kε (x, y) = exp − ,
ε

Wijε = kε (xi , xj ) and Aε = D−1 W ε where D = diag( j Wijε ). Define a graph


P

Laplacian (recall that L = D−1 A − I)

1 ε→0
Lε := (Aε − I) −→ backward Kolmogorov operator
ε

1 ′′ ′ ′

1 2 ϕ (s) ′− ϕ (s)V (s) = λϕ(s)
Lε f = △M f − ∇f · ∇V ⇒ Lε ϕ = λϕ ⇒ ′
2 ϕ (0) = ϕ (1) = 0

Where V (s) is the Gibbs free energy and p(s) = e−V (x) is the density of data points
along the curve. △M is Laplace-Beltrami Operator. If p(x) = const, we can get

′′
(162) V (s) = const ⇒ ϕ (s) = 2λϕ(s) ⇒ ϕk (s) = cos(kπs), 2λk = −k 2 π 2

On the other hand p(s) ̸= const, one can show 1 that ϕ1 (s) is monotonic for
arbitrary p(s). As a result, the faces can still be ordered by using ϕ1 (s).

7.1.4. Properties of Diffusion Distance.

Lemma 7.1.2. The diffusion distance is equal to a ℓ2 distance between the


probability clouds Ati,∗ and Atj,∗ with weights 1/dl ,i.e.,

(163) dt (xi , xj ) = ∥Ati,∗ − Atj,∗ ∥ℓ2 (Rn ,1/d)

1by changing to polar coordinate p(s)ϕ′ (s) = r(s) cos θ(s), ϕ(s) = r(s) sin θ(s) ( the so-called
‘Prufer Transform’ ) and then try to show that ϕ′ (s) is never zero on (0, 1).
7.1. DIFFUSION MAP AND DIFFUSION DISTANCE 167

Proof.
n
2
X 1
∥Ati,∗ − Atj,∗ ∥ℓ2 (Rn ,1/d) = (Atil − Atjl )2
dl
l=1
n X n
X 1
= [ λtk ϕk (i)ψk (l) − λtk ϕk (j)ψk (l)]2
dl
l=1 k=1
n X
n
X 1
= λtk (ϕk (i) − ϕk (j))ψk (l)λtk′ (ϕk′ (i) − ϕk′ (j))ψk′ (l)
dl
l=1 k,k′
n n
X X ψk (l)ψk′ (l)
= λtk λtk′ (ϕk (i) − ϕk (j))(ϕk′ (i) − ϕk′ (j))

dl
k,k l=1
Xn
= λtk λtk′ (ϕk (i) − ϕk (j))(ϕk′ (i) − ϕk′ (j))δkk′
k,k′
n
X
= λ2t
k (ϕk (i) − ϕk (j))
2

k=1
= d2t (xi , xj )

In practice we usually do not use the mapping Φt but rather the truncate
diffusion map Φδt that makes use of fewer than n coordinates. Specifically, Φδt uses
t
only the eigenvectors for which the eigenvalues satisfy |λk | > δ. When t is enough
large, we can use the truncated diffusion distance:
2 21
X
(164) dδt (xi , xj ) = ∥Φδt (xi ) − Φδt (xj )∥ = [ λ2t
k (ϕk (i) − ϕk (j)) ]
k:|λk |t >δ
2
as an approximation of the weighted ℓ distance of the probability clouds. We now
derive a simple error bound for this approximation.
Lemma 7.1.3 (Truncated Diffusion Distance). The truncated diffusion distance
satisfies the following upper and lower bounds.
2δ 2
d2t (xi , xj ) − (1 − δij ) ≤ [dδt (xi , xj )]2 ≤ d2t (xi , xj ),
dmin
P
where dmin = min1≤i≤n di with di = j Wij .
1
Proof. Since, Φ = D− 2 V , where V is an orthonormal matrix (V V T =
T
V V = I), it follows that
1 1
(165) ΦΦT = D− 2 V V T D− 2 = D−1
Therefore,
n
X δij
(166) ϕk (i)ϕk (j) = (ΦΦT )ij =
di
k=1
and
n
X 1 1 2δij
(167) (ϕk (i) − ϕk (j))2 = + −
di dj di
k=1
168 7. DIFFUSION GEOMETRY

clearly,
n
X 2
(168) (ϕk (i) − ϕk (j))2 ≤ (1 − δij ), f orall i, j = 1, 2, · · · , n
dmin
k=1
As a result,
X
[dδt (xi , xj )]2 = d2t (xi , xj ) − λ2t
k (ϕk (i) − ϕk (j))
2

k:|λk |t <δ
X
≥ d2t (xi , xj ) − δ2 (ϕk (i) − ϕk (j))2
k:|λk |t <δ
n
X
≥ d2t (xi , xj ) − δ 2 (ϕk (i) − ϕk (j))2
k=1
2δ 2
≥ d2t (xi , xj ) − (1 − δij )
dmin
on the other hand, it is clear that
(169) [dδt (xi , xj )]2 ≤ d2t (xi , xj )
We conclude that
2δ 2
(170) d2t (xi , xj ) − (1 − δij ) ≤ [dδt (xi , xj )]2 ≤ d2t (xi , xj )
dmin

Therefore, for small δ the truncated diffusion distance provides a very good
approximation to the diffusion distance. Due to the fast decay of the eigenvalues,
the number of coordinates used for the truncated diffusion map is usually much
smaller than n, especially when t is large.
7.1.5. Is the diffusion distance really a distance? A distance function
d : X × X → R must satisfy the following properties:
(1) Symmetry: d(x, y) = d(y, x)
(2) Non-negativity: d(x, y) ≥ 0
(3) Identity of indiscernibles: d(x, y) = 0 ⇔ x = y
(4) Triangle inequality: d(x, z) + d(z, y) ≥ d(x, y)
Since the diffusion map is an embedding into the Euclidean space Rn , the
diffusion distance inherits all the metric properties of Rn such as symmetry, non-
negativity and the triangle inequality. The only condition that is not immediately
implied is dt (x, y) = 0 ⇔ x = y. Clearly, xi = xj implies that dt (xi , xj ) = 0. But
is it true that dt (xi , xj ) = 0 implies xi = xj ? Suppose dt (xi , xj ) = 0, Then,
n
X
(171) 0 = d2t (xi , xj ) = λ2t
k (ϕk (i) − ϕk (j))
2

k=1
It follows that ϕk (i) = ϕk (j) for all k with λk ̸= 0. But there is still the possibility
that ϕk (i) ̸= ϕk (j) for k with λk = 0. We claim that this can happen only whenever
i and j have the exact same neighbors and proportional weights, that is:
Proposition 7.1.4. The situation dt (xi , xj ) = 0 with xi ̸= xj occurs if and
only if node i and j have the exact same neighbors and proportional weights
Wik = αWjk , α > 0, f or all k ∈ V.
7.2. COMMUTE TIME MAP AND DISTANCE 169

n
λ2t 2
P
Proof. (Necessity) If dt (xi , xj ) = 0, then k (ϕk (i) − ϕk (j)) = 0 and
k=1
ϕk (i) = ϕk (j) for k with λk ̸= 0 This implies that dt′ (xi , xj ) = 0 for all t′ , because
n
X ′
(172) dt′ (xi , xj ) = λ2t 2
k (ϕk (i) − ϕk (j) = 0.
k=1

In particular, for t = 1, we get d1 (xi , xj ) = 0. But
d1 (xi , xj ) = ∥Ai,∗ − Aj,∗ ∥ℓ2 (Rn ,1/d) ,
and since ∥ · ∥ℓ2 (Rn ,1/d) is a norm, we must have Ai,∗ = Aj,∗ , which implies for each
k ∈V,
Wik Wjk
= , ∀k ∈ V
di dj
whence Wik = αWjk where α = di /dj , as desired.
n
(Ai,k − Aj,k )2 /dk = d21 (xi , xj ) ==
P
(Sufficiency) If Ai,∗ = Aj,∗ , then 0 =
k=1
n
λ2k (ϕk (i) − ϕk (j))2 and therefore ϕk (i) = ϕk (j) for k with λk ̸= 0, from which
P
k=1
it follows that dt (xi , xj ) = 0 for all t. □

Example 14. In a graph with three nodes V = {1, 2, 3} and two edges, say
E = {(1, 2), (2, 3)}, the diffusion distance between nodes 1 and 3 is 0. Here the
transition matrix is  
0 1 0
A =  1/2 0 1/2  .
0 1 0

7.2. Commute Time Map and Distance


Diffusion distance depends on time scale parameter t which is hard to select in
applications. In this section we introduce another closely related distance, namely
commute time distance, derived from mean first passage time between points. For
such distances we do not need to choose the time scale t.
Definition.
(1) First passage time (or hitting time): τij := inf(t ≥ 0|xt = j, x0 = i);
(2) Mean First Passage Time: Tij = Ei τij ;
+
(3) τij := inf(t > 0|xt = j, x0 = i), where τii+ is also called first return time;
(4) Tij+ = Ei τij
+
, where Tii+ is also called mean first return time.
Here Ei denotes the conditional expectation with fixed initial condition x0 = i.
All the below will show that the (average) commute time between xi and xj ,
i.e.Tij + Tji , in fact leads to an Euclidean distance metric which can be used for
embedding.
p
Theorem 7.2.1. dc (xi , xj ) := Tij + Tji is an Euclidean distance metric,
called commute time distance.
Proof. For simplicity, we will assume that P is irreducible such that the
stationary distribution is unique. We will give a constructive proof that Tij + Tji
is a squared distance of some Euclidean coordinates for xi and xj .
170 7. DIFFUSION GEOMETRY

By definition, we have
X
(173) Tij+ = Pij · 1 + +
Pik (Tkj + 1)
k̸=j

Let E = 1 · 1 where 1 ∈ R is a vector with all elements one, Td+ = diag(Tii+ ).


T n

Then 173 becomes


(174) T + = E + P (T + − Td+ ).
For the unique stationary distribution π, π T P = P , whence we have
πT T + = π T 1 · 1T + π T P (T + − Td+ )
πT T + = 1T + π T T + − π T Td+
= Td+ π
1
1
Tii+ =
πi
Before proceeding to solve equation (173), we first show its solution is unique.
Lemma 7.2.2. P is irreducible ⇒ T + and T are both unique.
Proof. Assume S is also a solution of equation (174), then
(I − P )S = E − P diag(1/πi ) = (I − P )T +
⇔ ((I − P )(T + − S) = 0.
Therefore for irreducible P , S and T + must satisfy
diag(T + − S) = 0


T + − S = 1uT , ∀u
which implies T + = S. T ’s uniqueness follows from T = T + − Td+ . □
Now we continue with the proof of the main theorem. Since T = T +
− Td+ ,
then (173) becomes
T = E + P T − Td+
(I − P )T = E − Td+
(I − D−1 W )T = F
(D − W )T = DF
LT = DF
where F = E − Td+ and L = D − W is the (unnormalized)
Pn graph Laplacian. Since
T
L is symmetric and irreducible, we have L = Pn k=1 µ k νk ν k , where 0 = µ1 < µ2 ≤
· · · ≤ µn , ν1 = 1/||1||, νkT νl = δkl . Let L+ = k=2 µ1k νk νkT , L+ is called the pseudo-
inverse (or Moore-Penrose inverse) of L. We can test and verify L+ satisfies the
following four conditions  + +

 L LL = L+
+
LL L = L

 (LL+ )T = LL+

(L+ L)T = L+ L

From LT = D(E − Td+ ), multiplying both sides by L+ leads to


T = L+ DE − L+ DTd+ + 1 · uT ,
7.2. COMMUTE TIME MAP AND DISTANCE 171

Table 1. Comparisons between diffusion map and commute time


map. Here x ∼ y means that x and y are in the same cluster and
x ≁ y for different clusters.

Diffusion Map Commute Time Map


P ’s right eigenvectors L+ ’s eigenvectors
scale parameters: t and ε scale: ε
∃t, s.t. x ∼ y, dt (x, y) → 0 and x ≁ y, dt (x, y) → ∞ x ∼ y, dc (x, y) small and x ≁ y, dc (x, y) large?

as 1 · uT ∈ ker(L), whence
n
X 1
Tij = L+ +
ik dk − Lij dj · + uj
πj
k=1
Xn
ui = − L+ +
ik dk + Lii vol(G), j=i
k=1
X X
Tij = L+ + +
ik dk − Lij vol(G) + Ljj vol(G) − L+
jk dk
k k

P
Note that vol(G) = i di and πi = di /vol(G) for all i.
Then
(175) Tij + Tji = vol(G)(L+ + +
ii + Ljj − 2Lij ).

To see it is a squared Euclidean distance, we need the following lemma.


Lemma 7.2.3. If K is a symmetric and positive semidefinite matrix, then
K(x, x)+K(y, y)−2K(x, y) = d2 (Φ(x), Φ(y)) = ⟨Φ(x), Φ(x)⟩+⟨Φ(y), Φ(y)⟩−2⟨Φ(x), Φ(y)⟩

P. . . , n) are orthonormal eigenvectors with eigenvalues µi ≥ 0,


where Φ = (ϕi : i = 1,
such that K(x, y) = i µi ϕi (x)ϕi (y).
Clearly L+ is a positive semidefinite matrix and we define the commute time
map by its eigenvectors,
 T
1 1
Ψ(xi ) = √ ν2 (i), · · · , √ νn (i) ∈ Rn−1 .
µ2 µn
q
then L+ii +L+
jj −2L +
ij = ||Ψ(x i )−Ψ(x j ))||2
l 2 , and we call d (x
r i , xj ) = L+ + +
ii + Ljj − 2Lij
the resistance distance.

p p
So we have dc (xi , xj ) = Tij + Tji = vol(G)dr (xi , xj ).
7.2.1. Comparisons between diffusion map and commute time map.
However, recently Radl, von Luxburg, and Hein give a negative answer for the last
desired property of dc (x, y) in geometric random graphs. Their result is as follows.
Let X ⊆ Rp be a compact set and let k : X × X → (0, +∞) be a symmetric
and continuous function. Suppose that (xi )i∈N is a sequence of data points drawn
i.i.d. from X according to a density function p > 0 on X . Define Wij = k(xi , xj ),
P = D−1 W , and L = D − W . Then Radl et al. shows
1 1
lim ndr (xi , xj ) = +
n→∞ d(xi ) d(xj )
172 7. DIFFUSION GEOMETRY

d (x ,x )
k(x, y)dp(y) is a smoothed density at x, dr (xi , xj ) = √c i j is the
R
where d(x) = X vol(G)
resistance distance. This result shows that in this setting commute time distance
has no information about cluster information about point cloud data, instead it
simply reflects density information around the two points.

7.3. Diffusion Map: Convergence Theory


Diffusion distance depends on both the geometry and density of the dataset.
The key concepts in the analysis of these methods, that incorporates the density and
geometry of a dataset. This section we will prove the convergence of diffusion map
with heat kernels to its geometric limit, the eigenfunctions of Laplacian-Beltrami
operators.
This is left by previous lecture. W is positive definite if using Gaussian Kernel.
One can check that, when
Z
Q(x) = e−ixξ dµ(ξ),
R
for some positive finite Borel measure dµ on R, then the (symmetric/Hermitian)
integral kernel
k(x, y) = Q(x − y)
is positive definite, that is, for any function ϕ(x) on R,
Z Z
ϕ̄(x)ϕ(y)k(x, y) ≥ 0.

Proof omitted. The reverse is also true, which is Bochner theorem. High dimen-
sional case is similar.
2
Take 1-dimensional as an example. Since the Gaussian distribution e−ξ /2 dξ
is a positive finite Borel measure, and the Fourier transform of Gaussian kernel is
2
itself, we know that k(x, y) = e−|x−y| /2 is a positive definite integral kernel. The
matrix W as an discretized version of k(x, y) keeps the positive-definiteness (make
this rigorous? Hint: take ϕ(x) as a linear combination of n delta functions).
7.3.1. Main Result. In this lecture, we will study the bias and variance
decomposition for sample graph Laplacians and their asymptotic convergence to
Laplacian-Beltrami operators on manifolds.
Let M be a smooth manifold without boundary in Rp (e.g. a d-dimensional
sphere). Randomly draw a set of n data points, x1, ..., xn ∈ M ⊂ Rp , according to
distribution p(x) in an independent and identically distributed (i.i.d.) way. We can
extract an n × n weight matrix Wij as follows:

Wij = k(xi , xj )
where k(x, y) is a symmetric k(x, y) = k(y, x) and positivity-preserving kernel
k(x, y) ≥ 0. As an example, it can be the heat kernel (or Gaussian kernel),

||xi − xj ||2
 
kϵ (xi , xj ) = exp − ,

where ||  ||2 is the Euclidean distance in space Rp and ϵ is the bandwidth of the
kernel. Wij stands for similarity function between xi and xj . A diagonal matrix D
is defined with diagonal elements are the row sums of W :
7.3. DIFFUSION MAP: CONVERGENCE THEORY 173

n
X
Dii = Wij .
j=1

Let’s consider a family of re-weighted similarity matrix, with superscript (α),


W (α) = D−α W D−α
and
n
(α) (α)
X
Dii = Wij .
j=1

(α) (α) −1
Pn (α)
Denote A = (D ) W , and we can verify that j=1 Aij = 1, i.e.a row
Markov matrix. Now define L(α) = A(α) − I = (D(α) )−1 W (α) − I; and

1 (α)
(A − I)
Lϵ,α =
ϵ ϵ
when kϵ (x, y) is used in constructing W . In general, L(α) and Lϵ,α are both called
graph Laplacians. In particular L(0) is the unnormalized graph Laplacian in litera-
ture.
The target is to show that graph Laplacian Lϵ,α converges to continuous differ-
ential operators acting on smooth functions on M the manifold. The convergence
can be roughly understood as: we say a sequence of n-by-n matrix L(n) as n → ∞
converges to a limiting operator L, if for L’s eigenfunction f (x) (a smooth function
on M) with eigenvalue λ, that is
Lf = λf,
the length-n vector f (n) = (f (xi )), (i = 1, · · · , n) is approximately an eigenvector
of L(n) with eigenvalue λ, that is
L(n) f (n) = λf (n) + o(1),
where o(1) goes to zero as n → ∞.
Specifically, (the convergence is in the sense of multiplying a positive constant)
(I) Lϵ,0 = 1ϵ (Aϵ − I) → 12 (∆M + 2 ∇p p · ∇) as ϵ → 0 and n → ∞. ∆M is
the Laplace-Beltrami operator of manifold M . At a point on M which
is d-dimensional, in local (orthogonal) geodesic coordinate s1 , · · · , sd , the
Laplace-Beltrami operator has the same form as the laplace in calculus
d
X ∂2
∆M f = f;
i=1
∂s2i

∇ denotes the gradient of a function on M , and · denotes the inner product


on tangent spaces of M. Note that p = e−V , so ∇p p = −∇V .
(Ignore this part if you don’t know stochastic process) Suppose we
have the following diffusion process
(M )
dXt = −∇V (Xt )dt + σdWt ,
(M )
where Wt is the Brownian motion on M , and σ is the volatility, say a
positive constant, then the backward Kolmogorov operator/Fokker-Plank
174 7. DIFFUSION GEOMETRY

operator/infinitesimal generator of the process is


σ2
∆M − ∇V · ∇,
2
so we say in (I) the limiting operator is the Fokker-Plank operator. Notice
that in Lafon ’06 paper they differ the case of α = 0 and α = 1/2, and
argue that only in the later case the limiting operator is the Fokker-Plank.
However the difference between α = 0 and α = 1/2 is a 1/2 factor in front
of −∇V , and that can be unified by changing the volatility σ to another
number. (Actually, according to Thm 2. on Page 15 of Lafon’06, one can
1
check that σ 2 = 1−α .) So here we say for α = 0 the limiting operator is
also Fokker-Plank. (not talked in class, open to discussion...)
(1)
(II) Lϵ,1 = 1ϵ (Aϵ − I) → 21 ∆M as ϵ → 0 and n → ∞. Notice that this
case is of important application value: whatever the density p(x) is, the
Laplacian-Beltrami operator of M is approximated, so the geometry of
the manifold can be understood.
A special case is that samples xi are uniformly distributed on M, whence
∇p = 0. Then (I) and (II) are the same up to multiplying a positive constant, due
to that D’s diagonal entries are almost the same number and the re-weight does
not do anything.
Convergence results like these can be found in Coifman and Lafon (CL06),
Diffusion maps, Applied and Computational Harmonic Analysis.
We also refer (Sin06) From graph to manifold Laplacian: The convergence rate,
Applied and Computational Harmonic Analysis for a complete analysis of the vari-
ance error, while the analysis of bias is very brief in this paper.

7.3.2. Proof. For a smooth function f (x) on M, let f = (fi ) ∈ Rn as a vector


defined by fi = f (xi ). At a given fixed point xi , we have the formula:
Pn ! 1
Pn !
1 Wij fj 1 j=1 Wij fj
(Lf )i
= Pj=1
n − fi = n
1
Pn − fi
ϵ j=1 Wij ϵ n j=1 Wij
1
P !
1 n j̸=i kϵ (xi , xj ).f (xj ) 1
= 1 − f (xi ) + f (xi )O( d )
ϵ
P
n j̸=i kϵ (xi , xj ) nϵ 2
where in the last step the diagonal terms j = i are excluded from the sums resulting
d
in an O(n−1 ϵ− 2 ) error. Later we will see that compared to the variance error, this
term is negligible.
We rewrite the Laplacian above as
 
1 F (xi ) 1
(176) (Lf )i = − f (xi ) + f (xi )O( d )
ϵ G(xi ) nϵ 2
where

1X 1X
F (xi ) = kϵ (xi , xj )f (xj ), G(xi ) = kϵ (xi, xj ).
n n
j̸=i j̸=i
depends only on the other n − 1 data points than xi . In what follows we treat
xi as a fixed chosen point and write as x.
7.3. DIFFUSION MAP: CONVERGENCE THEORY 175

Bias-Variance Decomposition. The points xj , j ̸= i are independent iden-


tically distributed (i.i.d), therefore every term in the summation of F (x) (G(x))
are i.i.d., and by theR Law of Large Numbers (LLN) one should expect F (x) ≈
Ex1 [k(x, x1 )f (x1 )] = M k(x, y)f (y)p(y)dy (and G(x) ≈ Ek(x, x1 ) = M k(x, y)p(y)dy).
R

Recall that given a random variable x, and a sample estimator θ̂ (e.g. sample mean),
the bias-variance decomposition is given by
E∥x − θ̂∥2 = E∥x − Ex∥2 + E∥Ex − θ̂∥2 .
E[F ]
If we use the same strategy here (though not exactly the same, since E[ G
F
] ̸= E[G]
!), we can decompose Eqn. (176) as
1 E[F ] 1 F (xi ) E[F ]
   
1
(Lf )i = − f (xi ) + f (xi )O( d ) + −
ϵ E[G] nϵ 2 ϵ G(xi ) E[G]
= bias + variance.
In the below we shall show that for case (I) the estimates are
(177)
1 E[F ] ∇p
 
1 m2  d

bias = − f (x) + f (xi )O( d ) = (∆M f +2∇f · )+O(ϵ)+O n−1 ϵ− 2 .
ϵ E[G] nϵ 2 2 p

1 F (xi ) E[F ]
 
1 d
(178) variance = − = O(n− 2 ϵ− 4 −1 ),
ϵ G(xi ) E[G]
whence
1 d 1 d
bias + variance = O(ϵ, n− 2 ϵ− 4 −1 ) = C1 ϵ + C2 n− 2 ϵ− 4 −1 .
As the bias is a monotone increasing function of ϵ while the variance is decreasing
w.r.t. ϵ, the optimal choice of ϵ is to balance the two terms by taking derivative
1 d
of the right hand side equal to zero (or equivalently setting ϵ ∼ n− 2 ϵ− 4 −1 ) whose
solution gives the optimal rates
ϵ∗ ∼ n−1/(2+d/2) .
(CL06) gives the bias and (HAvL05) contains the variance parts, which are further
improved by (Sin06) in both bias and variance.
7.3.3. The Bias Term. Now focus on E[F ]
 
1 n−1
X Z
E[F ] = E  kϵ (xi , xj )f (xj ) = kϵ (x, y)f (y)p(y)dy
n n M
j̸=i
n−1
n is close to 1 and is treated as 1.
(1) the case of one-dimensional and flat (which means the manifold M is just
a real line, i.e.M = R)
(x−y)2
Let f˜(y) = f (y)p(y), and kϵ (x, y) = √1ϵ e− 2ϵ , by change of variable

y = x + ϵz,
we have
√ 1
Z
ϵ2
□= f˜(x + ϵz)e− 2 dz = m0 f˜(x) + m2 f ′′ (x)ϵ + O(ϵ2 )
R 2
ϵ2 ϵ2
where m0 = R e− 2 dz, and m2 = R z 2 e− 2 dz.
R R
176 7. DIFFUSION GEOMETRY

(2) 1 Dimensional & Not flat:


Divide the integral into 2 parts:
Z Z Z
kϵ (x, y)f˜(y)p(y)dy = √
·+ √
·
m ||x−y||>c ϵ ||x−y||<c ϵ

First part = ◦
1 ϵ2
| ◦ | ≤ ||f˜||∞ a e− 2ϵ ,
ϵ2
2

due to ||x − y|| > c ϵ

1
c ∼ ln( ).
ϵ
so this item is tiny and can
√ be ignored.
Locally, that is u ∼ ϵ, we have the curve in a plane and has the
following parametrized equation
(x(u), y(u)) = (u, au2 + qu3 + · · · ),
then the chord length
1 1 1
||x − y||2 = [u2 + (au2 + qu3 + ...)2 ] = [u2 + a2 u4 + q5 (u) + · · · ],
ϵ ϵ ϵ
where we mark a2 u4 + 2aqu5 + ... = q5 (u). Next, change variable √uϵ = z,
ξ
then with h(ξ) = e− 2
||x − y|| 2 3
h( ) = h(z 2 ) + h′ (z 2 )(ϵ2 az 4 + ϵ 2 q5 + O(ϵ2 )),
ϵ
also
df˜ 1 d2 f˜
f˜(s) = f˜(x) + (x)s + (x)s2 + · · ·
ds 2 ds2
and Z up
s= 1 + (2au + 3quu2 + ...)2 du + · · ·
0
and
ds 2
= 1 + 2a2 u2 + q2 (u) + O(ϵ2 ), s = u + a2 u3 + O(ϵ2 ).
du 3
Now come back to the integral
1 x−y ˜
Z

√ h( )f (s)ds
|x−y|<c ϵ ϵ ϵ
df˜
Z +∞
3 √ 2 3
≈ [h(z 2 ) + h′ (z 2 )(ϵ2 az 4 + ϵ 2 q5 ] · [f˜(x) + (x)( ϵz + a2 z 2 ϵ 2 )
−∞ ds 3
1 d2 f˜
+ (x)ϵz 2 ] · [1 + 2a2 + ϵ3 y3 (z)]dz
2 ds2
m2 d2 f˜
=m0 f˜(x) + ϵ ( (x) + a2 f˜(x)) + O(ϵ2 ),
2 ds2
O(ϵ2 ) tails are omitted in middle steps, and m0 = h(z 2 )dz,m2 =
R
where
R 2 the
z h(z 2 )dz, are positive constants. In what follows we normalize both of
7.3. DIFFUSION MAP: CONVERGENCE THEORY 177

them by m0 , so only m2 appears as coefficient in the O(ϵ) term. Also the


ξ
fact that h(ξ) = e− 2 , and so h′ (ξ) = − 12 h(ξ), is used.
(3) For high dimension, M is of dimension d,
1 |x−y|2
kϵ (x, y) = d e−, 2ϵ

ϵ 2

the corresponding result is (Lemma 8 in Appendix B of Lafon ’06 paper)


m2
Z
(179) kϵ (x, y)f˜(y)dy = f˜(x) + ϵ (∆M f˜ + E(x)f˜(x)) + O(ϵ2 ),
M 2
where
d
X X
E(x) = ai (x)2 − ai1 (x)ai2 (x),
i=1 i1 ̸=i2

and ai (x) are the curvatures along coordinates si (i = 1, · · · , d) at point


x.
Now we study the limiting operator and the bias error:
′ 2
EF f + ϵ m22 (f ′′ + 2f ′ pp + f pp + Ef ) + O(ϵ2 )
R
kϵ (x, y)f (y)p(y)dy
= ≈
EG
′′
R
kϵ (x, y)p(y)dy 1 + ϵ m22 ( pp + E) + O(ϵ2 )
m2 ′′ p′
(180) = f (x) + ϵ (f + 2f ′ ) + o(ϵ2 ),
2 p
and as a result, for generally d-dim case,
1 EF ∇p
 
m2
− f (x) = (∆M f + 2∇f · ) + O(ϵ).
ϵ EG 2 p
Using the same method and use Eqn. (179), one can show that for case (II)
where α = 1, the limiting operator is exactly the Laplace-Beltrami operator and
the bias error is again O(ϵ) (homework).
About M with boundary: firstly the limiting differential operator bears
√ Newmann/no-
flux boundary condition. Secondly, the convergence at a belt of width ϵ near ∂M
is slower than the inner part of M, see more in Lafon’06 paper.

7.3.4. Variance Term. Our purpose is to derive the large deviation bound
for2
E[F ]
 
F
(181) P rob − ≥α
G E[G]
where F = F (xi ) = n1 j̸=i kϵ (xi , xj )f (xj ) and G = G(xi ) = n1 j̸=i kϵ (x, xj ).
P P

With x1 , x2 , ..., xn as i.i.d random variables, F and G are sample means (up to a
scaling constant). Define a new random variable
Y = E[G]F − E[F ]G − αE[G](G − E[G])
which is of mean zero and Eqn. (181) can be rewritten as
P rob(Y ≥ αE[G]2 ).

2The opposite direction is omitted here.


178 7. DIFFUSION GEOMETRY

For simplicity by Markov (Chebyshev) inequality 3 ,


E[Y 2 ]
P rob(Y ≥ αE[G]2 ) ≤
α2 E[G]4
and setting the right hand side to be δ ∈ (0, 1), then with probability at least 1 − δ
the following holds !
E[Y 2 ] E[Y 2 ]
p p
α≤ √ ∼O .
E[G]2 δ E[G]2
It remains to bound
E[Y 2 ] = (EG)2 E(F 2 ) − 2(EG)(EF )E(F G) + (EF )2 E(G2 ) + ...
+2α(EG)[(EF )E(G2 ) − (EG)E(F G)] + α2 (EG)2 (E(G2 ) − (EG)2 ).
So it suffices to give E(F ), E(G), E(F G), E(F 2 ), and E(G2 ). The former two are
given in bias and for the variance parts in latter three, let’s take one simple example
with E(G2 ).
Recall that x1 , x2 , ..., xn are distributed i.i.d according to density p(x), and
1X
G(x) = kϵ (x, xj ),
n
j̸=i
so Z 
1 2 2
V ar(G) = 2 (n − 1) kϵ (x, y)) p(y)dy − (Ekϵ (x, y)) .
n M
Look at the simplest case of 1-dimension flat M for an illustrative example:
1 √
Z Z
2
(kϵ (x, y)) p(y)dy = √ h2 (z 2 )(p(x) + p′ (x)( ϵz + O(ϵ)))dz,
ϵ
M R
R 2 2
let M2 = h (z )dz
R
1 √
Z
(kϵ (x, y))2 p(y)dy = p(x) · √ M2 + O( ϵ).
ϵ
M

Recall that Ekϵ (x, y) = O(1), we finally have


 
1 p(x)M2 1
V ar(G) ∼ √ + O(1) ∼ √ .
n ϵ n ϵ
d
Generally, for d-dimensional case, V ar(G) ∼ n−1 ϵ− 2 . Similarly one can derive
estimates on V ar(F ).
Ignoring the joint effect ofpE(F G), one can somehow get a rough estimate
based on F/G = [E(F ) + O( E(F 2 ))]/[E(G) + O( E(G2 ))] where we applied
p

the Markov inequality on both the numerator and denominator. Combining those
estimates together, we have the following,
1 d
F f p + ϵ m22 (∆(f p) + E[f p]) + O(ϵ2 , n− 2 ϵ− 4 )
= 1 d
G p + ϵ m22 (∆p + E[p]) + O(ϵ2 , n− 2 ϵ− 4 )
m2 1 d
= f +ϵ (∆p + E[p]) + O(ϵ2 , n− 2 ϵ− 4 ),
2
3It means that P rob(X > α) ≤ E(X 2 )/α2 . A Chernoff bound with exponential tail can be
found in Singer’06.
7.4. *VECTOR DIFFUSION MAP AND CONNECTION LAPLACIAN 179

here O(B1 , B2 ) denotes the dominating one of the two bounds B1 and B2 in the
asymptotic limit. As a result, the error (bias + variance) of Lϵ,α (dividing another
ϵ) is of the order
1 d
(182) O(ϵ, n− 2 ϵ− 4 −1 ).

In (Sin06) paper, the last term in the last line is improved to


1 d 1
(183) O(ϵ, n− 2 ϵ− 4 − 2 ),
F
where the improvement is by carefully analyzing the large deviation bound of G
around EG shown above, making use of the fact that F and G are correlated.
EF

Technical details are not discussed here.


In conclusion, we need to choose ϵ to balance bias error and variance error to
be both small. For example, by setting the two bounds in Eqn. (183) to be of the
same order we have
ϵ ∼ n−1/2 ϵ−1/2−d/4 ,
that is
ϵ ∼ n−1/(3+d/2) ,
so the total error is O(n−1/(3+d/2) ).

7.4. *Vector Diffusion Map and Connection Laplacian


In this class, we introduce the topic of vector Laplacian on graphs and vector
diffusion map.
The ideas for vector Laplacian on graphs and vector diffusion mapping are a
natural extension from graph Laplacian operator and diffusion mapping on graphs.
The reason why diffusion mapping is important is that previous dimension reduction
techniques, such as the PCA and MDS, ignore the intrinsic structure of the man-
ifold. By contrast, diffusion mapping derived from graph Laplacian is the optimal
embedding that preserves locality in a certain way. Moreover, diffusion mapping
gives rise to a kind of metic called diffusion distance. Manifold learning problems
involving vector bundle on graphs provide the demand for vector diffusion mapping.
And since vector diffusion mapping is an extension from diffusion mapping, their
properties and convergence behavior are similar.
The application of vector diffusion mapping is not restricted to manifold learn-
ing however. Due to its usage of optimal registration transformation, it is also a
valuable tool for problems in computer vision and computer graphics, for example,
optimal matching of 3D shapes.
The organization of this lecture notes is as follows: We first review graph Lapla-
cian and diffusion mapping on graphs as the basis for vector diffusion mapping. We
then introduce three examples of vector bundles on graphs. After that, we come to
vector diffusion mapping. Finally, we introduce some conclusions about the con-
vergence of vector diffusion mapping.

7.4.1. graph Laplacian and diffusion mapping.


180 7. DIFFUSION GEOMETRY

7.4.2. graph Laplacian. The goal of graph Laplacian is to discover the in-
trinsic manifold structure given a set of data points in space. There are three steps
of constructing the graph Laplacian operator:
• construct the graph using either the ϵ−neighborhood way (for any data
point, connect it with all the points in its ϵ−neighborhood) or the k-
nearest neighbor way (connect it with its k-nearest neighbors);
• construct the the weight matrix. Here we can use the simple-minded
binary weight (0 or 1), or use the heat kernel weight. For undirected
graph, the weight matrix is symmetric; P
• denote D as the diagonal matrix with D(i, i) = deg(i), deg(i) := j wij .
The graph Laplacian operator is:
L=D−W
The graph Laplacian has the following properties:
• ∀f : V → R, f T Lf = (i,j)∈E wij (fi − fj )2 ≥ 0
P

• G is connected ⇔ f T Lf > 0, ∀f T ⃗1, where ⃗1 = (1, · · · , 1)T


• G has k-connected components ⇔ dim(ker(L))=k
(this property is compatible with the previous one, since L⃗1 = 0)
• Kirchhofff’s Matrix Tree theorem:
Consider a connected graph G and the binary weight matrix: wij =
(
1, (i, j) ∈ E
, denote the eigenvalues of L as 0 = λ1 < λ1 ≤ λ2 ≤
0, otherwise
· · · ≤ λn , then #{T: T is a spanning tree of G}= n1 λ2 · · · λn
• Fieldler Theory, which will be introduced in later chapters.
We can have a further understanding of Graph Laplacian using the language
of exterior calculus on graph.
We give the following denotations:
V = {1, 2, · · · , |V |}. E⃗ is the oriented edge set that for (i, j) ∈ E and i < j,
⟨i, j⟩ is the positive orientation, and ⟨j, i⟩ is the negative orientation.

δ0 : RV → RE is a coboundary map, such that
(

fi − fj , ⟨i, j⟩ ∈ E
δ0 ◦ f (i, j) =
0, otherwise
It is easy to see that δ0 ◦ f (i, j) = −δ0 ◦ f (j, i)

The inner product of operators on RE is defined as:
X
⟨u, v⟩ = wij uij vij
i,j


u := u diag(wij )
n(n−1) n(n−1)
where diag(wij ) ∈ R 2 × 2is the diagonal matrix that has wij on the diag-
onal position corresponding to ⟨i, j⟩.
u∗ v = ⟨u, v⟩
Then,
L = D − W = δ0T diag(wij )δ0 = δ0∗ δ0
7.4. *VECTOR DIFFUSION MAP AND CONNECTION LAPLACIAN 181

We first look at the graph Laplacian operator. We solve the generalized eigen-
value problem:
Lf = λDf
denote the generalized eigenvalues as:

0 = λ1 ≤ λ2 ≤ · · · ≤ λn

and the corresponding generalized eigenvectors:

f1 , · · · , fn

we have already obtained the m-dimensional Laplacian eigenmap:

xi → (f1 (i), · · · , fm (i))

We now explains that this is the optimal embedding that preserves locality in the
sense that connected points stays as close as possible. Specifically speaking, for the
one-dimensional embedding, the problem is:
X
min (yi − yj )2 wij = 2miny yT Ly
i,j

1 1 1 1
y T Ly = y T D− 2 (I − D− 2 W D− 2 )D− 2 y
1 1 1 1
Since I − D− 2 W D− 2 is symmetric, the object is minimized when D− 2 )D− 2 y is
the eigenvector for the second smallest eigenvalue(the first smallest eigenvalue is
1 1
0) of I − D− 2 W D− 2 , which is the same with λ2 , the second smallest generalized
eigenvalue of L.
Similarly, the m-dimensional optimal embedding is given by Y = (f1 , · · · , fm ).
In diffusion map, the weights are used to define a discrete random walk. The
transition probability in a single step from i to j is:
wij
aij =
deg(i)

Then the transition matrix A = D−1 W .


1 1 1 1
A = D− 2 (D− 2 W D− 2 )D 2

Therefore, A is similar to a symmetric matrix, and has n real eigenvalues µ1 , · · · , µn


and the corresponding eigenvectors ϕ1 , · · · , ϕn .

Aϕi = µi ϕi

At is the transition matrix after t steps. Thus, we have:

At ϕi = µti ϕi

Define Λ as the diagonal matrix with Λ(i, i) = µi , Φ = [ϕ1 , · · · , ϕn ]. The diffusion


map is given by:
Φt := ΦΛt = [µt1 ϕ1 , · · · , µtn ϕn ]
182 7. DIFFUSION GEOMETRY

7.4.3. the embedding given by diffusion map. Φt (i) denotes the ith row
of Φt .
n
X At (i, k) At (j, k)
⟨Φt (i), Φ(j)⟩ = p p
k=1
deg(k) deg(k)
we can thus define a distance called diffusion distance
n
X (At (i, k) − At (j, k))2
d2DM,t (i, j) := ⟨Φt (i), Φ(i)⟩+⟨Φt (j), Φ(j)⟩−2⟨Φt (i), Φ(j)⟩ =
deg(k)
k=1

7.4.4. Examples of vector bundles on graph.


(1) Wind velocity field on globe:
To simplify the problem, we consider the two dimensional mesh on
the globe(the latitude and the longitude). Each node on the mesh has a
vector f⃗ which is the wind velocity at that place.
(2) Local linear regression:
The goal of local linear regression is to give an approximation of the
regression function at an arbitrary point in the variable space.
Given the data (yi , x⃗i )ni=1 and an arbitrary point ⃗x,⃗x, x⃗1 , · · · , x⃗n ∈ Rp ,
we want to find β⃗ := (β0 , β1 , · · · , βp )T that minimize
Pn
i=1 (yi − β0 −
β1 xi1 − · · · βp xip )2 Kn (x⃗i , ⃗x). Here Kn (x⃗i , ⃗x) is a kernel function that
defines the weight for x⃗i at the point ⃗x. For example, we can use the
||x−xi ||2
Nadaraya-Watson kernel Kn (x⃗i , ⃗x) = e n .
For a graph G=(V,E), each point ⃗x ∈ V has a corresponding vector

β(⃗x). We therefore get a vector bundle on the graph G(V,E).
Here β⃗ is kind of a gradient. In fact, if y and ⃗x has the relationship
y = f (⃗x), then β = (f (⃗x), ∇f (⃗x))T .
(3) Social networks:
If we see users as vertices and the relationship bonds that connected
users as edges, then a social network naturally gives rise to a graph
G=(V,E). Each user has an attribute profile containing all kinds of per-
sonal information, and a certain kind of information can be described by
a vector f⃗ recording different aspects. Again, we get a vector bundle on
graph.

7.4.5. optimal registration transformation. Like in graph eigenmap, we


expect the embedding f⃗ to be preserve locality to a certain extent, which means
that we expect the embedding of connected points to be sufficiently close. In the
graph Laplacian case, we use i∼j wij ||f⃗i − f⃗j ||2 . However, for vector bundle on
P
graphs, subtraction of vectors at different points may not be done directly due to
the curvature of the manifold. What makes sense should be the difference of vectors
compared with the tangent spaces at the certain points. Therefore, we borrow the
idea of parallel transport from differential geometry. Denote Oij as the parallel
transport operator from the tangent space at xj to the tangent space at xi . We
want to find out the embedding that minimizes
wij ||f⃗i − Oij f⃗j ||2
X

i∼j
7.4. *VECTOR DIFFUSION MAP AND CONNECTION LAPLACIAN 183

we will later define the vector diffusion mapping, and using the similar argument
as in diffusion mapping, it is easy to see that vector diffusion mapping gives the
optimal embedding that preserves locality in this sense.
we now discuss how we get the approximation of parallel transport operator
given the data set.
The approximation of the tangent space at a certain point xi is given by local PCA.
Choose ϵi to be sufficiently small, and denote xi1 , · · · , xiNi as the data points in
the ϵi -neighborhood of xi . Define
Xi := [xi1 − xi , · · · , xiNi − xi ]
Denote Di as the diagonal matrix with
s
||xij − xi ||
Di (j, j) = K( ), j = 1, · · · , Ni
ϵi

Bi := Xi Di
Perform SVD on Bi :
Bi = Ui Σi ViT
We use the first d columns of Ui (which are the left eigenvectors of the d largest
eigenvalues of Bi ) to form an approximation of the tangent space at xi . That is,
Oi = [ui1 , · · · , uid ]
Then Oi is a numerical approximation to an orthonormal basis of the tangent space
at xi .

For connected points xi and xj , since they are sufficiently close to each other,
their tangent space should be close. Therefore, Oi Oij and Oj should also be close.
We there use the closest orthogonal matrix to OiT Oj as the approximation of the
parallel transport operator from xj to xi :
ρij := argminOorthogonol ||O − OiT Oj ||HS
where ||A||2HS = T r(AAT ) is the Hilbert-Schimidt norm.

7.4.6. Vector Laplacian. Given the weight matrix W = (wij ), we denote


 
deg(1)Ip
D := 
 .. ∈R
 np×np
.
deg(n)Ip
P
where deg(i) = j wij as in graph Laplacian.
Define S as the block matrix with
(
wij ρij , i ∼ j
Sij =
0, otherwise
The vector Laplacian is then defined as L = D − S
184 7. DIFFUSION GEOMETRY

Like Graph Laplacian, we introduce an orientation on E and a coboundary map



δ0 : (Rd )V → (Rd )E
(
f⃗i − ρij f⃗j , ⟨i, j⟩ ∈ E

δ0 ◦ f (i, j) = , where f = (f⃗1 , · · · , f⃗n )T
0, otherwise

Inner product on (Rd )E is defined as
X
⟨u, v⟩ = wij uTij vij
i,j

u := u diag(wij ), u∗ v = ⟨u, v⟩

⃗ then, L = D −W = δ T diag(wij )δ0 =


If we let ρij be orthogonal, ∀i, j, s.t.⟨i, j⟩ ∈ E, 0

δ0 δ0 .

Analogous properties with Graph Laplacian:


• G has k connected components ⇔ dim ker(L) = kp
• generalized Matrix tree theorem.
7.4.7. Vector diffusion mapping.
L = D − S = D(I − D−1 S)
1 1
D−1 S = D− 2 SD− 2
Denote
1 1
S̃ := D− 2 SD− 2
S̃ has nd real eigenvalues λ1 , · · · , λnd and the corresponding eigenvectors v1 , · · · , vnd .
Thinking of these vectors of length nd in blocks of d, we denote vk (i) as the ith
block of vk .
The spectral decompositions of S̃(i, j) and S̃ 2t (i, j) are given by:
nd
X
S̃(i, j) = λk vk (i)vk (j)T
k=1
nd
∴ S̃ 2t (i, j) =
X
λ2t
k vk (i)vk (j)
T

k=1
2t
We use ||S̃ (i, j)||2HS to measure the affinity between i and j. Thus,
2t
||S̃ (i, j)||2HS = T r(S̃ 2t (i, j)S̃ 2t (i, j)T )
Pnd
= (λk λl )2t T r(vk (i)vk (j)T vl (j)vl (i)T )
Pk,l=1
nd
= (λk λl )2t T r(vk (j)T vl (j)vl (i)T vk (i))
Pk,l=1
nd 2t
= k,l=1 (λk λl ) ⟨vk (j), vl (j)⟩⟨vk (i), vl (i)⟩
The vector diffusion mapping is defined as:
Vt : i → ((λk λl )t ⟨vk (i), vl (i)⟩)nd
k,l=1

Like graph Laplacian, ||S̃ 2t (i, j)||2HS is actually an inner product:


||S̃ 2t (i, j)||2HS = ⟨Vt (i), Vt (j)⟩
This gives rise to a distance called vector diffusion distance:
d2V DM,t = ⟨Vt (i), Vt (i)⟩ + ⟨Vt (j), Vt (j)⟩ − 2⟨Vt (i), Vt (j)⟩
7.5. *SYNCHRONIZATION ON GRAPHS 185

7.4.8. Normalized Vector Diffusion Mappings. An important kind of


normalized VDM is obtained as follows:
Take 0 ≤ α ≤ 1,
Wα := D−α W D−α
Sα := D−α SD−α
n
X
degα (i) := Wα (i, j)
j=1
We define Dα ∈ Rn×n as the diagonal matrix with
Dα (i, i) = degα (i)
and Dα ∈ R nd×nd
as the block diagonal matrix with
Dα (i, i) = degα (i)Id
We can then get the vector diffusion mapping Vα,t using Sα and Dα instead of S
and D.
7.4.9. Convergence of VDM. We first introduce some concepts.
Suppose M is a smooth manifold, and TM is a tensor bundle on M. When
the rank of TM is 0, it is the set of functions on M. When the rank of TM is 1, it
is the set of vector fields on M.
Theconnection Laplacian operator is:
∇2X,Y T = −(∇X ∇Y T − ∇∇X Y T )
where ∇X Y is the covariant derivative of Y over X.
Intuitively, we can see the first item of the connection Laplacian operator as the sum
of the change of T over X and over Y, and the second item as the overlapped part
of the change of T over X and over Y. The remainder can be seen as an operator
that differentiates the vector fields in the direction of two orthogonal vector fields.
Now we introduce some results about convergence.
The normalized graph Laplacian converges to the Laplace-Beltrami operator:
(D−1 W − I)f → c∆f
for sufficiently smooth f and some constant c.

For VDM, Dα−1 Sα − I converges to the connection Laplacian operator (SW12)


plus some potential terms. When α = 1, D1−1 S1 − I converges to exactly the
connection Laplacian operator:
(D1−1 S1 − I)X → c∇2 X

7.5. *Synchronization on Graphs


CHAPTER 8

Robust PCA via Generative Adversarial Networks

This part is about RPCA by GANs.


Robust statistics under Huber’s contamination model arose in 1960s, since then
the search for both statistically optimal and computationally feasible procedures
has become a fundamental problem in areas including statistics and computer sci-
ence. Many depth-based estimators, although statistically optimal, suffers from
the computational difficulties. On the other hand, recent developments in com-
puter science in search of computationally tractable algorithms for robust learning,
rely on prior knowledge of moments or contamination portion, which can not be
adapted to general distributions such as elliptical distributions whose moments do
not necessarily exist.
To overcome these challenges, (GLYZ19; GYZ20) makes a new bridge to robust
statistics based on Generative Adversarial Networks (GANs), a new technique de-
veloped in machine learning society. It is possible to construct statistically optimal
mean and covariance (scatter) estimates from contaminated samples, under general
elliptical distributions. Equipped with the rapid progress in the computational fa-
cility for deep learning, it provides us a new scalable tool for robust statistics. In
the sequel, we introduced the following results with applications.
(A) A unified framework of Generative Adversarial Networks with Proper Scor-
ing Rules is proposed for robust multi-task regression problem, that brings many
popular GANs into the toolbox of robust learning.
(B) Statistical optimality of such robust estimates is given under general TV-
perturbation sets, that includes Huber’s contamination model as special cases, and
general elliptical distributions that includes Gaussian distributions and Cauchy
distributions whose moments do not exist.
(C) This new methodology may bing us new techniques in a variety of appli-
cations, including but are not limited to, robust factor analysis in Economics and
Finance, robust Cryo-EM imaging in biomolecular engineering that will be studied
here.

8.1. Huber’s Contamination Model and Tukey’s Median


Consider the following robust learning problem. In the setting of Huber’s ϵ-
contamination model (Hub64; Hub65), one has i.i.d observations
(184) X1 , ..., Xn ∼ (1 − ϵ)Pθ + ϵQ,
and the goal is to estimate the model parameter θ. Under the data generating
process (184), each observation is drawn from Pθ with probability 1−ϵ and otherwise
from the contamination distribution Q with probability ϵ. Such scenarios appear in
many applications including crowdsourced ranking (XXHY13; XXC+ 19), computer
vision (FHX+ 16; XYJ+ 19), Cryo-EM imaging (WGL+ 16; GUHY20), economics
187
188 8. ROBUST PCA VIA GENERATIVE ADVERSARIAL NETWORKS

and finance (FK18; FKL18), among others (Hub81). The presence of an unknown
contamination distribution poses both statistical and computational challenges to
the problem. The search for both statistically optimal and computationally feasible
procedures has become a fundamental problem in areas including statistics and
computer science.
Robust estimation of normal mean and covariance gives two paradigms in this
challenge.

8.1.1. Robust Mean. Consider a normal mean estimation problem with Pθ =


N (θ, Ip ). Due to the contamination of data, the sample average, which is optimal
when ϵ = 0, can be arbitrarily far away from the true mean if Q charges a positive
probability at infinity. Moreover, even robust estimators such as coordinatewise
median and geometric median are proved to be suboptimal under the setting of
(184) (CGR18; DKK+ 16; LRV16). For the normal mean estimation problem, it
has been shown in (CGR18) that the minimax rate with respect to the squared ℓ2
loss is np ∨ ϵ2 , and is achieved by Tukey’s median (Tuk75). Despite the statistical
optimality of Tukey’s median, its computation is not tractable. In fact, even an
approximate algorithm takes O(eCp ) in time (ABET00; Cha04; RS98).
Recent developments in theoretical computer science are focused on the search
of computationally tractable algorithms for estimating θ under Huber’s ϵ-contamination
model (184). The success of the efforts started from two fundamental papers
(DKK+ 16; LRV16), where two different but related computational strategies “it-
erative filtering” and “dimension halving” were proposed to robustly estimate the
normal mean. These algorithms can provably achieve the minimax rate np ∨ ϵ2 up
to a poly-logarithmic factor in polynomial time. The main idea behind the two
methods is a critical fact that a good robust moment estimator can be certified effi-
ciently by higher moments. This idea was later further extended (DKK+ 17; DBS17;
DKS16; DKK+ 18; DKS18b; DKS18a; KSS18) to develop robust and computable
procedures for various other problems.
However, many of the computationally feasible procedures for robust mean es-
timation in the literature rely on the knowledge of covariance matrix and sometimes
the knowledge of contamination proportion. Even though these assumptions can be
relaxed, nontrivial modifications of the algorithms are required for such extensions
and statistical error rates may also be affected. Compared with these computa-
tionally feasible procedures proposed in the recent literature for robust estimation,
Tukey’s median (Tuk75) and other depth-based estimators (RH99; Miz02; Zha02;
MM04; PVB17) have some indispensable advantages in terms of their statistical
properties. First, the depth-based estimators have clear objective functions that
can be interpreted from the perspective of projection pursuit (Miz02). Second, the
depth-based procedures are adaptive to unknown nuisance parameters in the mod-
els such as covariance structures, contamination proportion, and error distributions
(CGR18; Gao17). Last but not least, Tukey’s depth and other depth functions are
mostly designed for robust quantile estimation, while the recent advancements in
the theoretical computer science literature are all focused on robust moments esti-
mation. Although this is not an issue when it comes to normal mean estimation,
the difference is fundamental for robust estimation under general settings such as
elliptical distributions where moments do not necessarily exist.
8.1. HUBER’S CONTAMINATION MODEL AND TUKEY’S MEDIAN 189

8.1.2. Robust Covariance. For the normal covariance estimation problem


with PΣ = N (0, Σ), even though many robust covariance matrix estimators have
been proposed and analyzed in the literature (? Tyl87b? ? ? ? ), the problem of
optimal covariance estimation under the Huber contamination model has not been
investigated until the recent work by (CGR18). It was shown in (CGR18) that
the minimax rate with respect to the squared operator norm ∥Σ b − Σ∥2 is p ∨ ϵ2 .
op n
An important feature of the minimax rate is its dimension-free dependence on the
contamination proportion ϵ through the second term ϵ2 . An estimator that can
achieve the minimax rate is given by the maximizer of the covariance matrix depth
function (Zha02; CGR18? ). Despite its statistical optimality, the robust covariance
matrix estimator that maximizes the depth function can not be efficiently computed
unless the dimension of the data is extremely low. This is the same weakness that
is also shared by Tukey’s halfspace depth (Tuk75) and Rousseeuw and Hubert’s
regression depth (RH99). In fact, even an approximate algorithm that computes
these depth functions takes O(eCp ) in time (RS98; ABET00; Cha04; CGR18).
We first introduce the multi-task robust regression problem, followed by a
framework based on generative adversarial networks with proper scoring rules. Con-
ditions under which statistical optimality can be achieved are discussed with general
elliptical distributions and TV-perturbations. Applications with preliminary results
are discussed in the end.

8.1.3. Multi-task Regression. Consider (X, Y ) ∈ Rp × Rm ∼ PB where


(185) Y = B T X + σε
n
with B ∈ Rp×m and ε ∼ N (0, Im ). Now given an i.i.d. sample {(Xi , Yi )}i=1 under
Huber’s contamination model (184) with Pθ = PB , our purpose is to learn B.
8.1.3.1. Multi-task regression depth. (Miz02) proposed the following multi-task
regression depth, in population version,
DU (B, P) = inf P U T X, Y − B T X ≥ 0 ,


(186)
U ∈U

whose empirical version is


n
n 1 X 
T
I U Xi , Yi − B T Xi ≥ 0 .

(187) DU (B, {(Xi , Yi )}i=1 ) = inf
U ∈U n i=1

(Gao17) shows that


(188) b = arg max DU (B, {(Xi , Yi )}n )
B
B i=1

gives an optimal robust estimate of B. In particular it includes the Tukey’s median


and regression depth as special cases.
(a) (Multivariate location depth) When p = 1 and X = 1 ∈ R,
DU (b, P) = inf P uT (Y − b) ≥ 0 ,

u∈U

where B
b gives the Tukey’s median (Tuk75).
(b) (Regression depth) When m = 1,
DU (β, P) = inf P uT X y − β T X ≥ 0 ,
 
U ∈U

which leads to the robust regression (RS98).


190 8. ROBUST PCA VIA GENERATIVE ADVERSARIAL NETWORKS

Since Tukey’s median is included as a special case, the computational complexity


in finding B
b becomes NP-hard (ABET00). However, inspired by our recent work
(GLYZ19), we propose to exploit the computational facility of generative adversarial
networks toward finding robust estimators of B that are statistically optimal.

8.2. Generative Adversarial Networks (GAN)


To overcome these difficulties, recently (GLYZ19) built up a connection between
depth functions and Generative Adversarial Nets (GANs). The GAN (GPAM+ 14)
is a very popular technique in deep learning to learn complex distributions such as
the generating process of images. In the formulation of GAN, there is a generator
and a discriminator. The generator, modeled by a neural network, is trying to learn
a distribution as close to the data as possible, while the discriminator, modeled
by another neural network, is trying to distinguish samples from the generator
and data. This two-player game will reach its equilibrium when the discriminator
cannot tell the difference between samples from the generator and the data, and
that means the generator has successfully learned the underlying distribution of the
data. Since GAN can be written as a minimax optimization problem, this suggests
a mathematical resemblance to the robust estimators that are maximizers of depth
functions, which are maximin optimization problems.
In (GLYZ19), we established a framework of f -Learning, showing that both
procedures are minimizers of variational lower bounds of f -divergence functions.
While GAN minimizes the Jensen-Shannon divergence, the robust estimators in-
duced by depth functions all minimize the total variation distance. The connection
between GAN, or more generally, f -GAN (NCT16), and robust estimation opens a
door of approximating these hard-to-compute depth functions by neural networks,
and then standard techniques used to train GANs on a daily routine can be ap-
plied to compute various robust estimators. In particular, this framework can be
extended to include robust scatter matrix estimation under general elliptical distri-
butions, whose moments such as means do not exist. Appropriate choices of neural
network structures have been discussed in (GLYZ19), but only for optimal robust
location estimation.
In (GYZ20), it showed that the network structures for optimal robust location
estimation proposed in our early studies (GLYZ19) may not have sufficient dis-
criminative power for optimal covariance matrix estimation. Instead, we propose
necessary modifications of the network structures so that optimal co-
variance matrix estimation under Huber’s contamination model can be
achieved using GANs, initiating a framework of robust learning using
the concept of proper scoring rules (BSS05; GR07; Daw07).

8.2.1. Generative Adversarial Networks for Robust  Regression.


8.2.1.1. Embedding. For the model Y | X ∼ N B T X, σ 2 , consider the follow-
ing embedding of the data into Rp×m ,

XY T | X ∼ N XX T B, σ 2 XX T .


Therefore, one can use the technique proposed by us (GYZ20) to learn θ = ΣX B


and Σ = σ 2 ΣX where ΣX = E[XX T ].
8.2. GENERATIVE ADVERSARIAL NETWORKS (GAN) 191

8.2.1.2. Generative Adversarial Networks with Proper Scoring Rules. In this


section, we consider a more general setting where the data generating process is
iid
(189) Z1 , ..., Zn ∼ P for some P satisfying TV(P, N (θ, Σ)) ≤ ϵ.
where the total variation distance between two probability distributions P1 and P2
is defined as TV(P1 , P2 ) = supB |P1 (B) − P2 (B)|. Here Zi = Xi YiT . Both the mean
vector θ and the covariance matrix Σ are unknown.
We consider the estimation procedure
" n #
1X
(190) (θ, Σ) = argmin max
b b S(T (Zi ), 1) + EZ∼N (η,Γ) S(T (Z), 0) .
η∈Rp ,Γ∈Ep (M ) T ∈T n i=1

Note that the generator class is {N (η, Γ) : η ∈ Rp , Γ ∈ Ep (M )} where Ep (M ) is the


set of p × p covariance matrices whose operator norms are no more than M .
Here S is a scoring rule, defined as a pair of functions S(·, 1) and S(·, 0). To
be specific, S(t, 1) is the forecaster’s reward if he or she quotes t when the event 1
occurs, and S(t, 0) is the reward when the event 0 occurs. If the event occurs with
probability p, then, the expected reward for the forecaster is given by S(t; p) =
pS(t, 1) + (1 − p)S(t, 0). Fisher’s consistency requires that the maximal expected
reward can be achieved at the correct quotes or prediction t = p, i.e. S(p; p) =
maxt S(t, p), where we say S is proper. The celebrated Savage representation (?
) asserts that the Fisher’s consistency is equivalent to the existence of a convex
function G(·), such that
(
S(t, 1) = G(t) + (1 − t)G′ (t),
(191)
S(t, 0) = G(t) − tG′ (t).

Here, G′ (t) is a subgradient of G at the point t. Moreover, the statement also holds
for strictly proper scoring rules when convex is replaced by strictly convex. Typical
examples of scoring rules are listed as follows that lead to many popular GANs.
(1) Log Score. The log score is perhaps the most commonly used rule because
of its various intriguing properties (JCVW15). The scoring rule with
S(t, 1) = log t and S(t, 0) = log(1 − t) is regular and strictly proper. Its
Savage representation is given by the convex function G(t) = t log t +
(1 − t) log(1 − t), which is interpreted as the negative Shannon entropy
of Bernoulli(t). It leads to the original GAN proposed by (GPAM+ 14)
that aims to minimize a variational lower bound of the Jensen-Shannon
divergence
   
1 dP 1 dQ
Z Z
JS(P, Q) = log dP + log dQ + log 2.
2 dP + dQ 2 dP + dQ
(2) Zero-One Score. The zero-one score S(t, 1) = 2I{t ≥ 1/2} and S(t, 0) =
2I{t < 1/2} is also known as the misclassification loss. This is a regular
proper scoring rule but not strictly proper. It leads to the TV-GAN that
was extensively studied by (GLYZ19) in the context of robust estimation,
toward minimizing a variational lower bound of the total variation distance
   
dP dP 1
Z
TV(P, Q) = P ≥1 −Q ≥1 = |dP − dQ|.
dQ dQ 2
192 8. ROBUST PCA VIA GENERATIVE ADVERSARIAL NETWORKS

(3) Quadratic Score. Also known as the Brier score (Bri50), the definition is
given by S(t, 1) = −(1 − t)2 and S(t, 0) = −t2 . The corresponding convex
function in the Savage representation is given by G(t) = −t(1−t). It leads
to the family of least-squares GANs proposed by (MLX+ 17), minimizing
a variational lower bound of the following divergence function,
1 (dP − dQ)2
Z
∆(P, Q) = ,
8 dP + dQ
known as the triangular discrimination.
(4) Boosting Score. The boosting score was introduced by (BSS05) with
1/2  1/2
S(t, 1) = − 1−t t and S(t, 0) = − t
1−t and has an connection
to the AdaBoost algorithm. The corresponding p convex function in the
Savage representation is given by G(t) = −2 t(1 − t). It leads to a GAN
toward minimizing a variational lower bound of the squared Hellinger dis-
tance
1 √
Z p 2
2
H (P, Q) = dP − dQ .
2
(5) Beta Score. A general Beta family of proper scoring rules was introduced
R1 Rt
by (BSS05) with S(t, 1) = − t cα−1 (1 − c)β dc and S(t, 0) = − 0 cα (1 −
c)β−1 dc for any α, β > −1. The log score, the quadratic score and the
boosting score are special cases of the Beta score with α = β = 0, α = β =
1, α = β = −1/2. The zero-one score is a limiting case of the Beta score
by letting α = β → ∞. Moreover, it also leads to asymmetric scoring
rules with α ̸= β. They lead to (α, β)-GANs in the sequel.
Now we introduce a general discriminator class of deep neural nets. We first
define a sigmoid first layer
Gsigmoid = g(x) = sigmoid(uT x + b) : u ∈ Rp , b ∈ R .


Then, with G 1 (B) = Gsigmoid , we inductively define


   
 X X 
G l+1 (B) = g(x) = ReLU  vh gh (x) : |vh | ≤ B, gh ∈ G l (B) .
 
h≥1 h≥1

Note that the neighboring two layers are connected via ReLU activation functions.
Finally, the network structure is defined by
(   )
X X
L L
(192) T (κ, B) = T (x) = sigmoid  wj gj (x) :
 |wj | ≤ κ, gj ∈ G (B) .
j≥1 j≥1

This is a neural network class that consists of L hidden layers.


8.2.1.3. Statistical Optimality of Estimates.
Condition 1 (Smooth Scoring Rules). We assume G(2) (1/2) > 0 and G(3) (t)
is continuous at t = 1/2. Moreover, there is a universal constant c0 > 0, such that
2G(2) (1/2) ≥ G(3) (1/2) + c0 .
Condition 1 implies the scoring rule {S(·, 1), S(·, 0)} is induced by two smooth
functions, which excludes the zero-one loss. This is fine, because the zero-one loss
was already studied as the matrix depth function in (CGR18). We only focus on
scoring rules that are feasible to optimize, and thus it is sufficient to restrict our
8.2. GENERATIVE ADVERSARIAL NETWORKS (GAN) 193

results to smooth ones. The condition 2G(2) (1/2) ≥ G(3) (1/2) + c0 is automatically
satisfied by a symmetric scoring rule, because S(t, 1) = S(1 − t, 0) immediately
R1
implies that G(3) (1/2) = 0. For the Beta score with S(t, 1) = − t cα−1 (1 − c)β dc
Rt
and S(t, 0) = − 0 cα (1 − c)β−1 dc for any α, β > −1, it is easy to check that such a
c0 (only depending on α, β) exists as long as |α − β| < 1. The following proposition
shows the statistical optimality of our proposal.

Proposition 8.2.1. Consider the estimator (190) that is induced by a regular


proper scoring rule that satisfies Condition 1. The discriminator class T = T L (κ, B)
is specified by (192). Assume np + ϵ2 ≤ c for some sufficiently small constant c > 0.
pp 
Set 1 ≤ L = O(1), 1 ≤ B = O(1), and κ = O n + ϵ . Then, under the data
generating process (189), we have
 pm 
∥θb − θ∥2ℓ2 ≤ C ∨ ϵ2 ,
n
 pm 
2
∥Σ − Σ∥op ≤ C
b ∨ ϵ2 ,
n
′ 2
with probability at least 1 − e−C (p+nϵ ) uniformly over all θ ∈ Rp and all ∥Σ∥op ≤
M = O(1). The constants C, C ′ > 0 are universal.

Such an error rate is the same to that of multi-task regression depth (Gao17)
and statistically optimal. After obtaining θb and Σ, b if ΣX is known or easy to be
−1 b c2 from Σ−1 Σ;
estimated, we can obtain estimator B = ΣX θ and σ
b
X
b otherwise if Xi
is also contaminated, we can exploit the same technique in (GYZ20) for an optimal
estimate Σ bX.
8.2.1.4. Generalization to Elliptical Distributions. Robust estimators induced
by GANs can adapt to general elliptical distributions as depth-based estimator
(188) (CGR18).

Definition 8.2.1 (Elliptical Distribution). A random vector X ∈ Rp follows


an elliptical distribution if and only if it has the representation X = θ +ξAU , where
θ ∈ Rp and A ∈ Rp×r are model parameters. The random variable U is distributed
uniformly on the unit sphere {u ∈ Rr : ∥u∥ = 1} and ξ ≥ 0 is a random variable in
R independent of U . The vector θ and the matrix Σ = AAT are called the location
and the scatter of the elliptical distribution.

Definition 8.2.2 (Canonical Representation). The elliptical distribution X =


θ + ξAU has a canonical parametrization (θ, Σ, H) with Σ = AAT and H ∈ H.
We use the notation E(θ, Σ, H) to denote the elliptical distribution in its canonical
form.

With the canonical representation, the parameters θ, Σ, H are all identifiable.


The scatter matrix Σ is proportion to the covariance matrix whenever the covari-
ance matrix exists. Moreover, for multivariate Gaussian N (θ, Σ), its canonical
parametrization is (θ, Σ, Φ), and the scatter matrix and the covariance matrix are
identical.
The goal of this section is to estimate both the location θ and the scatter Σ
with observations
iid
(193) X1 , ..., Xn ∼ P for some P satisfying TV(P, E(θ, Σ, H)) ≤ ϵ.
194 8. ROBUST PCA VIA GENERATIVE ADVERSARIAL NETWORKS

To achieve this goal, we further require that H belongs to the following class
( Z 1/3 )
′ 1
H(M ) = H ∈ H : dH(t) ≥ ′ .
1/4 M
where the number M ′ > 0 is assumed to be some large constant. The regularity
condition H ∈ H(M ′ ) will be easily satisfied as long as there is a constant proba-
bility mass of H contained in the interval [1/4, 1/3]. This condition prevents some
of the probability mass from escaping to infinity.
Define the estimator
(194) " n #
1 X
(θ,
b Σ,
b H)
b = argmin max S(T (Xi ), 1) + EX∼E(η,Γ,H) S(T (X), 0) .
η∈Rp ,Γ∈Ep (M ),H∈H(M ′ ) T ∈T n i=1
To accommodate for the more general generator class in (194), we consider the
discriminator class T̄ L (κ, B), which has the same definition (192), except that
G 1 (B) = Gramp = g(x) = ramp(uT x + b) : u ∈ Rp , b ∈ R .


In other words, T̄ L (κ, B) and T L (κ, B) only differs in the choice of the nonlin-
ear activation function of the first layer. We remark that the discriminator class
T L (κ, B) also works for the elliptical distributions, but the theory would require a
condition that is less transparent. The theoretical guarantee of the estimator (194)
is given by the following proposition.
Proposition 8.2.2. Consider the estimator (194) that is induced by a regular
proper scoring rule that satisfies Condition 1. The discriminator class is specified
by T = T̄ L (κ, B) with the dimension of (wj ) to be at least 2. Assume np + ϵ2 ≤ c
for somepsufficiently small constant c > 0. Set 2 ≤ L = O(1), 1 ≤ B = O(1), and
p

κ=O n + ϵ . Then, under the data generating process (193), we have
p 
∥θb − θ∥2ℓ2 ≤ C ∨ ϵ2 ,
np 
2
∥Σ − Σ∥op ≤ C
b ∨ ϵ2 ,
n
′ 2
with probability at least 1 − e−C (p+nϵ ) uniformly over all θ ∈ Rp , all ∥Σ∥op ≤
M = O(1), and all H ∈ H(M ′ ) with M ′ = O(1). The constants C, C ′ > 0 are
universal.

8.3. Robust PCA via GANs


8.3.0.1. Robust Factor Analysis. Risk control is a fundamental goal in Econom-
ics and Finance, where temporal observations are often contaminated with unex-
pected irregular variations. Robust covariance estimation and regression against
such unknown contaminations thus play a key role in portfolio optimization, hedg-
ing and derivative pricing. In a preliminary study, we demonstrate the effectiveness
of our method with simulation data (Table ??-??) and applied the method to the
factor analysis on S&P 500 time series of daily stock prices during the period from
2007-01-01 to 2018-12-31 (Figure 1). This period particularly includes the financial
tsunami triggered in the fall of 2008, which are regarded as unknown contami-
nations. The factors obtained by our method show robustness to the drastic price
variations on those banks heavily influenced by the financial crisis, while traditional
principal components are not.
8.3. ROBUST PCA VIA GANS 195

Figure 1. Two-dimensional visualization of 50 companies in S&P 500 time


series of daily stock prices during the period from 2007-01-01 to 2018-12-31.
This period particularly includes the financial tsunami triggered in the fall
of 2008, which are regarded as irregular contaminations. (a) Classical PCA;
(b) Robust PCA from robust scatter learning with Elliptical distributions.
Classical PCA using standard covariance matrix is largely influenced by a
few companies, mostly financial banks which are heavily influenced by the
financial crisis, while the robust PCA using elliptical scatter learning have
top two factors less influenced by the banks after attenuating outliers. This
shows the potential usefulness of robust scatter learning in data reduction and
visualization against unknown contaminations.

8.3.0.2. Robust Denoising for Cyro-EM Image. The cryo-electron microscopy


(Cryo-EM) has become one of the most popular techniques to resolve the atomic
structure, for which the Nobel Prize in Chemistry in 2017 was awarded to three
pioneers in this field (She18). However, it is a computational challenge in pro-
cessing raw Cryo-EM images, due to heterogeneity in molecular conformations and
high noise. Figure 2 shows a typical noisy Cryo-EM image with its reference image
which is totally non-identifiable to human eyes. In extreme cases, some experi-
mental images even do not contain any particles, rendering it difficult for particle
picking either manually or automatically (WGL+ 16). How to achieve robust de-
noising against such kind of contamination thus becomes a critical problem. Here
we propose to combine the Generative Adversarial Network approach with decon-
volutional Autoencoder as generator toward robust denoising of Cryo-EM images
against contaminations. Table ?? shows some preliminary results and such an
approach is promising.
196 8. ROBUST PCA VIA GENERATIVE ADVERSARIAL NETWORKS

Figure 2. (a) a noisy Cryo-EM image; (b) reference image.

8.4. Lab and Further Studies


Part 3

Introduction to Topological Data


Analysis
Geometric Data Reduction:
• General method of manifold learning takes the following Spectral Kernal
Embedding approach
construct a neighborhood graph of data, G
construct a positive semi-definite kernel on graphs, K
find global embedding coordinates of data by eigen-decomposition
of K = Y Y T
• Sometimes ‘distance metric’ is just a similarity measure (nonmetric MDS,
ordinal embedding)
• Sometimes coordinates are not a good way to organize/visualize the data
(e.g. d > 3)
• Sometimes all that is required is a qualitative view
• Distance measurements are noisy
• Physical device like human eyes may ignore differences in proximity (or
as an average effect)
• Topology is the crudest way to capture invariants under distortions of
distances
• At the presence of noise, one need topology varied with scales
What kind of topology?
• Topology studies (global) mappings between spaces
• Point-set topology: continuous mappings on open sets
• Differential topology: differentiable mappings on smooth manifolds
Morse theory tells us topology of continuous space can be learned
by discrete information on critical points
• Algebraic topology: homomorphisms on algebraic structures, the most
concise encoder for topology
• Combinatorial topology: mappings on simplicial (cell) complexes
Simplicial complex may be constructed from data
Algebraic, differential structures can be defined here
Topological Data Analysis:
• What kind of topological information often useful
0-homology: clustering or connected components
1-homology: coverage of sensor networks; paths in robotic planning
1-homology as obstructions: inconsistency in statistical ranking;
harmonic flow games
high-order homology: high-order connectivity?
• How to compute homology in a stable way?
simplicial complexes for data representation
filtration on simplicial complexes
persistent homology
CHAPTER 9

Simplicial Complex Representation of Data

9.1. From Graphs to Simplicial Complexes


Definition 9.1.1 (Simplicial Complex). An abstract simplicial complex is a
collection Σ of subsets of V which is closed under inclusion (or deletion), i.e. τ ∈ Σ
and σ ⊆ τ , then σ ∈ Σ.
We have the following examples:
• Chess-board Complex
• Point cloud data:
Nerve complex
Cech, Rips, Witness complex
Mayer-Vietoris Blowup
• Term-document cooccurance complex
• Clique complex in pairwise comparison graphs
• Strategic complex in flow games
Example 9.1.1 (Chess-board Complex). Let V be the positions on a Chess
board. Σ collects position subsets of V where one can place queens (rooks) without
capturing each other. It is easy to check the closedness under deletion: if σ ∈ Σ is
a set of “safe” positions, then any subset τ ⊆ σ is also a set of “safe” positions

Figure 1. A. Chess board; B. The positions that can be captured


by a queen; C. A set of safe positions with 8 queens that can not
capture each other.

Example 9.1.2 (Nerve Complex). Define a cover of X, X = ∪α Uα . V = {Uα }


and define Σ = {UI : ∩α∈I UI ̸= ∅}.
• Closedness under deletion
• Can be applied to any topological space X
• In a metric space (X, d), if Uα = Bϵ (tα ) := {x ∈ X : d(x − tα ) ≤ ϵ}, we
have Cech complex Cϵ .
199
200 9. SIMPLICIAL COMPLEX REPRESENTATION OF DATA

• Nerve Theorem: if every UI is contractible, then X has the same homotopy


type as Σ.

Figure 2. The nerve graph of 7 bridges in Konisburg.

Figure 3. Čech complex of a circle, Cϵ , covered by a set of balls.

Cech complex is hard to compute, even in Euclidean space (FGK03). Given


any set of j points, one need to determine whether balls of radius ϵ around each
of these points have non-empty intersection toward building up a simplex in Cech
complex. This problem is related to the smallest ball problem defined as follows:
Given a set of j points, find the ball with the smallest radius enclosing all these
points. One can check that ∩ji=1 Bϵ (pi ) ̸= ∅ if and only if this smallest radius < ϵ.
Fast algorithms for the smallest ball problem exist. See (FGK03) for theoretical
discussion and (KMM04) for downloadable algorithms from the web.
On the other hand, one can easily compute an upper bound for Cech complex:
1) Construct a Cech subcomplex of 1-dimension, i.e. a graph with edges connecting
point pairs whose distance is no more than ϵ; 2) Find the clique complex, i.e.
maximal complex whose 1-skeleton is the graph above, where every k-clique is
regarded as a k − 1 simplex. This is called the Vietoris-Rips complex.
Example 9.1.3 (Vietoris-Rips Complex). Let V = {xα ∈ X}. Define V Rϵ =
{UI ⊆ V : d(xα , xβ ) ≤ ϵ, α, β ∈ I}.

• Rips is easier to compute than Cech


even so, Rips is exponential to dimension generally
• However Vietoris-Rips CAN NOT preserve the homotopy type as Cech
9.1. FROM GRAPHS TO SIMPLICIAL COMPLEXES 201

Figure 4. Left: Čech complex gives a circle; Right: Rips complex


gives a sphere S 2 .

• But there is still a hope to find a lower bound on homology –


Theorem 9.1.1 (“Sandwich”).
V Rϵ ⊆ Cϵ ⊆ V R2ϵ
• If a homology group “persists” through Rϵ → R2ϵ , then it must exists in
Cϵ ; but not the vice versa.
• All above gives rise to a filtration of simplicial complex
∅ = Σ0 ⊆ Σ1 ⊆ Σ2 ⊆ . . .
• Functoriality of inclusion: there are homomorphisms between homology
groups
0 → H1 → H2 → . . .
• A persistent homology is the image of Hi in Hj with j > i.

Figure 5. Scale ϵ1 : β0 = 1, β1 = 3

Example 9.1.4 (Strong Witness Complex). Let V = {tα ∈ X}. Define Wϵs =
{UI ⊆ V : ∃x ∈ X, ∀α ∈ I, d(x, tα ) ≤ d(x, V ) + ϵ}.
202 9. SIMPLICIAL COMPLEX REPRESENTATION OF DATA

Figure 6. Scale ϵ2 > ϵ1 : β0 = 1, β1 = 2. A 0-homology


group (connected component) and 1-homology group (loop) per-
sists through scale ϵ1 and ϵ2 .

Author/Paper ICML’01 NIPS’02 JMLR’03 ICML’06 JMLR’02 MIT’09 IJRR’04


F. Bach 1 1
D. Blei 1 1
N. Friedman 1
M. Jordan 1 1 1
D. Koller 1 1
J. Lafferty 1 1
A. McCallum 1
A. Ng 1 1 1
F. Pereira 1
Y. Weiss 1
Citation (2019.4) 1,959 7,297 26,337 2,085 1,845 5,732 871

Table 1. Author collaboration data. Google Scholar citations are


recorded up to April 18, 2019.

Example 9.1.5 (Week Witness Complex). Let V = {tα ∈ X}. Define Wϵw =
{UI ⊆ V : ∃x ∈ X, ∀α ∈ I, d(x, tα ) ≤ d(x, V−I ) + ϵ}.
• V can be a set of landmarks, much smaller than X
• Monotonicity: Wϵ∗ ⊆ Wϵ∗′ if ϵ ≤ ϵ′
• But not easy to control homotopy types between W ∗ and X
Example 9.1.6 (Author collaboration complex). Let V consist of n authors.
Σ collects subsets of authors σ ∈ Σ if they join the same paper. Then Σ becomes
a simplicial complex. See Table 1 for an illustration and Figure 7. (LK10) gives a
similar term-document co-occurrence complex.

Example 9.1.7 (Flag Complex of Paired Comparison Graph, Jiang-Lim-Yao-Ye


2011(JLYY11)). Let V be a set of alternatives to be compared and undirected pair
(i, j) ∈ E if the pair is comparable. A flag complex χG consists all cliques as sim-
plices or faces (e.g. 3-cliques as 2-faces and k + 1-cliques as k-faces), also called
clique complex of G.
Example 9.1.8 (Strategic Simplicial Complex for Flow Games, Candogan–
Menache-Ozdaglar-Parrilo 2011 (CMOP11)). Strategic simplicial complex is the
clique complex of pairwise comparison graph G = (V, E) of strategic profiles, where
V consists of all strategy profiles of players and a pair of strategy (x, x′ ) ∈ E is
comparable if only one player changes strategy from x to x′ . Every finite game can
It is easy to see that these two games have the same pairwise comparisons, which will lead to
identical equilibria for the two games: (O, O) and (F, F ). It is only the actual equilibrium payoffs
that would differ. In particular, in the equilibrium (O, O), the payoff of the row player is increased
by 1.

The usual solution concepts in games (e.g., Nash, mixed Nash, correlated equilibria) are defined
in terms of pairwise comparisons only. Games with identical pairwise comparisons share the same
equilibrium sets. Thus, 9.4. we LAB
refer to games with identical pairwise comparisons
AND FURTHER STUDIES 203
as strategically
equivalent games.
By employing the notion of pairwise comparisons, we can concisely represent any strategic-form
game in terms of a flow in a graph. We recall this notion next. Let G = (N, L) be an undirected
graph, with set of nodes N and set of links L. An edge flow (or just flow ) on this graph is a function
Y : N × N → R such that Y (p, q) = −Y (q, p) and Y (p, q) = 0 for (p, q) ∈ / L [21, 2]. Note that
the flow conservation equations are not enforced under this general definition.
game G,
Given aFigure 7. Awe simplicial
define a graphcomplexwhere of theeachtennode corresponds
authors in to a strategy profile, and
seven papers example. In the complex, there are four 2-
each edge connects two comparable strategy profiles. This undirected graph is referred to as the
{Jordan, Blei, N g},
game graphsimplices (triangles) of collaboration:
and is denoted by G(G) � (E, A), where E and A are the strategy profiles and pairs
{Jordan, W eiss, N g}, {Laf f erty, M cCallum, P ereira}, and
of comparable
{Koller, Bach, F riedman}, three 1-simplices (edges) otherNotice
strategy profiles defined above, respectively. than that, by definition, the graph
G(G) has the structure
the eleven facesofofa triangles:
direct product
{Blei, Lafof fM cliques
erty}, (oneNper
{Koller, g}, player), with clique m having
hm vertices. {Bach,
and The Jordan},comparison
pairwise and ten 0-simplices X : ESo× the
function(nodes). E → totalR defines a flow on G(G), as it
number
satisfies X(p, q) = of −X(q,
faces is p)
f0 = and10,X(p,
f1 = q)14, f=2 = 4. Euler
0 for (p, q)curvature:

/ A. This flow may thus serve as an
Jordan (−1/3 = 1 − 4/2 + 2/3), Blei (−1/3 = 1 − 3/2 + 1/3),
equivalent Ng
representation of any game (up to a “non-strategic” component). It follows directly
(−1/3 = 1 − 4/2 + 2/3), Weiss (1/3 = 1 − 2/2 + 1/3), Koller
from the statements above
(−1/3 = 1 − 3/2 + 1/3),that Bach
two games
(−1/3 =are 1 −strategically equivalent if and only if they have the
3/2 + 1/3), Friedman
same flow representation
(1/3 = 1 − 2/2 +and 1/3),game graph.
Lafferty (−1/3 = 1 − 3/2 + 1/3), Pereira
(1/3 = 1 − 2/2 + 1/3), McCallum
Two examples of game graph representations (1/3 = 1 − 2/2are +given
1/3) below.

Example 2.2. Consider


be decomposed again
as the direct the
sum of “battlegames
potential of the sexes”
and game
zero-sum from
games Example 2.1. The game graph
(harmonic
has games).
four vertices, corresponding to the direct product of two 2-cliques, and is presented in Figure 2.

2
(O, O) (O, F )
O F O F
O 3, 2 0, 0 3O 4, 2 0,2 0
F 0, 0 2, 3 F 1, 0 2, 3
3
(a) Battle of the sexes (F,(b)
O)Modified(F, F ) of
battle
the sexes
Figure 2: Flows on the game graph corresponding to “battle of the sexes” (Example 2.2).
Figure 8. Illustration of Game Strategic Complex: Battle of Sex
It is easy to see that these two games have the same pairwise comparisons, which will lead to
identical equilibria for the two games: (O, O) and (F, F ). It is only the actual equilibrium payoffs
that would differ. In
Example 2.3.particular,
Considerinathe 9.2. Betti Numbers
equilibrium
three-player (O, O),
game, the payoff
where of the can
each player row choose
player isbetween
increased
two strategies
by 1. {a, b}. We 9.3.represent the and
Consistency strategic
Sampleinteractions
Complexity among theComplexes
of Cěch players by the directed graph in Figure
(Niyogi-Smale-Weinberger Theorem)
3a, where the payoff of player i is −1 if its strategy is identical to the strategy of its successor
The usual solution concepts in games (e.g., Nash, mixed Nash, correlated equilibria) are defined
9.4. Lab and Further Studies
in terms of pairwise comparisons only. Games with identical pairwise comparisons share the same
HenryThus,
equilibrium sets. Adamswemaintains
refer toa games
collection of applied
with topology
identical softwares:
7 pairwise comparisons as strategically
https://www.math.colostate.edu/
equivalent games. ~adams/advising/appliedTopologySoftware/
By employing the notion of pairwise comparisons, we can concisely represent any strategic-form
game in terms of a flow in a graph. We recall this notion next. Let G = (N, L) be an undirected
graph, with set of nodes N and set of links L. An edge flow (or just flow ) on this graph is a function
Y : N × N → R such that Y (p, q) = −Y (q, p) and Y (p, q) = 0 for (p, q) ∈ / L [21, 2]. Note that
the flow conservation equations are not enforced under this general definition.
Given a game G, we define a graph where each node corresponds to a strategy profile, and
each edge connects two comparable strategy profiles. This undirected graph is referred to as the
game graph and is denoted by G(G) � (E, A), where E and A are the strategy profiles and pairs
of comparable strategy profiles defined above, respectively. Notice that, by definition, the graph
G(G) has the structure of a direct product of M cliques (one per player), with clique m having
hm vertices. The pairwise comparison function X : E × E → R defines a flow on G(G), as it
CHAPTER 10

Persistent Homology

10.1. *Hierarchical Clustering, Metric Trees, and Persistent β0


Hierarchical Cluster Trees:
(1) Start with each data point as its own cluster;
(2) Repeatedly merge two “closest” clusters, where notions of “distance” be-
tween two clusters are given by:
Single linkage: closest pair of points
Complete linkage: furthest pair of points
Average linkage (several variants):
(i) distance between centroids
(ii) average pairwise distance
(iii) Ward’s method: increase in k-means cost due to merger

10.2. Persistent Homology and Betti Numbers


Outline:
• Betti numbers are computed as dimensions of Boolean vector spaces (E.
Noether, Z2 -homology group)

../ISLR/graphics/ISLRFigures/10_12.pdf

Figure 1. Cluster trees: Average, complete, and single linkage. From Intro-
duction to Statistical Learning with Applications in R.

205
206 10. PERSISTENT HOMOLOGY

Methods K-means K-center Average Complete Single


Complexity NP NP ≈ K-means ≈ K-center Minimal
Spanning Tree
Appoximability 50-opt 2-opt. O(kn) ? k < α(k) < k log 3 ?
Online ? Cover-tree ? ? Persistent
(8-opt) Homology
Hierarchical ? Cover-tree Yes Yes Yes
Consistency Pollard’81 No ? ? Hartigen’81;
(metric-net) Stuetzle’03

../mySlides/figures/geom_tree.png

Figure 2. Tree (Evolution and Phylogenetics)

• βi (X) = dimHi (X, Z2 ), Z2 -homology or more general Homology group


associated with any fields or integral domain (e.g. Z, Q, and R)
• Hi (X) is functorial, i.e. continuous mapping f : X → Y induces linear
transformation Hi (f ) : Hi (X) → Hi (Y ), structure preserving
• computation is simple linear algebra over fields or integers
• data representation by simplicial complexes
• All above gives rise to a filtration of simplicial complex
∅ = Σ0 ⊆ Σ1 ⊆ Σ2 ⊆ . . .
• Functoriality of inclusion: there are homomorphisms between homology
groups
0 → H1 → H2 → . . .
• A persistent homology is the image of Hi in Hj with j > i.
Recall that
Theorem 10.2.1 (“Sandwich”).
V Rϵ ⊆ Cϵ ⊆ V R2ϵ
• If a homology group “persists” through Rϵ → R2ϵ , then it must exists in
Cϵ ; but not the vice versa.
• All above gives rise to a filtration of simplicial complex
∅ = Σ0 ⊆ Σ1 ⊆ Σ2 ⊆ . . .
• Functoriality of inclusion: there are homomorphisms between homology
groups
0 → H1 → H2 → . . .
10.3. APPLICATION EXAMPLES OF PERSISTENT HOMOLOGY 207

• A persistent homology is the image of Hi in Hj with j > i.


Persistent Homology is firstly proposed by Edelsbrunner-Letscher-Zomorodian,
with an algebraic formulation by Zomorodian-Carlsson. The algorithm is equivalent
to Robin Forman’s discrete Morse theory.
to be continued...

10.3. Application Examples of Persistent Homology


10.3.1. Sensor Network Coverage by Persistent Homology.
• V. de Silva and R. Ghrist (2005) Coverage in sensor networks via persistent
homology.
• Ideally sensor communication can be modeled by Rips complex
two sensors has distance within a short range, then two sensors
receive strong signals;
two sensors has distance within a middle range, then two sensors
receive weak signals;
otherwise no signals
Theorem 10.3.1 (“Sandwich Theorem”, de Silva-Ghrist 2005). Let X be a set
of points in Rd and Cϵ (X) the Čech complex of the cover of X by balls of radius
ϵ/2. Then there is chain of inclusions
r
ϵ 2d
Rϵ′ (X) ⊂ Cϵ (X) ⊂ Rϵ (X) whenever ′
≥ .
ϵ d+1
Moreover, this ratio is the smallest for which the inclusions hold in general.
Note: this gives a sufficient condition to detect holes in sensor network coverage
• Čech complex is hard to compute while Rips is easy;
• If a hole persists from Rϵ′ to Rϵ , then it must exists in Cϵ .

10.3.2. Natural Image Patch Statistics.


• G. Carlsson, V. de Silva, T. Ishkanov, A. Zomorodian (2008) On the local
behavior of spaces of natural images, International Journal of Computer
Vision, 76(1):1-12.
• An image taken by black and white digital camera can be viewed as a
vector, with one coordinate for each pixel
• Each pixel has a Ògray scaleÓ value, can be thought of as a real number
(in reality, takes one of 255 values)
• Typical camera uses tens of thousands of pixels, so images lie in a very
high dimensional space, call it pixel space, P
• D. Mumford: What can be said about the set of images I ⊆ P one
obtains when one takes many images with a digital camera?
• Lee, Mumford, Pedersen: Useful to study local structure of images
statistically
Lee-Mumford-Pedersen [LMP] study only high contrast patches.
• Collect: 4.5M high contrast patches from a collection of images obtained
by van Hateren and van der Schaaf
• Normalize mean intensity by subtracting mean from each pixel value to
obtain patches with mean intensity = 0
• Puts data on an 8-D hyperplane, ≈ R8
208 10. PERSISTENT HOMOLOGY

../2021a_csic5011-math5473/figures/rips1.png

../2021a_csic5011-math5473/figures/rips2.png

Figure 3. Persistent 1-Homology in Rips Complexes. Left: Rϵ′ ;


Right: Rϵ . The middle hole persists from Rϵ′ to Rϵ .

• Furthermore, normalize contrast by dividing by the norm, so obtain patches


with norm = 1, whence data lies on a 7-D ellipsoid, ≈ S 7
High density subsets M(k = 300, t = 0.25):
• Codensity filter: dk (x) be the distance from x to its k-th nearest neighbor
the lower dk (x), the higher density of x
• Take k = 300, the extract 5, 000 top t = 25% densest points, which
concentrate on a primary circle
Medium density subsets: three circles
• Take k = 15, the extract 5, 000 top 25% densest points, which shows
persistent β1 = 5, 3-circle model

10.3.3. H1N1 Evolution.


10.5. LAB AND FURTHER STUDIES 209

../2021a_csic5011-math5473/figures/patch3x3.png

Figure 4. 3 × 3 patches in images

../2021a_csic5011-math5473/figures/mumford_k300t25.png

Figure 5. Natural Image Statistics: Primary Circle

10.4. *Stability of Persistent Barcode/Diagram


10.5. Lab and Further Studies
10.5.1. Simplicial Stream and Persistent Homology. Download the Javaplex
(latest version 4.3.4) from the following site
http://appliedtopology.github.io/javaplex/
Follow the tutorial at
https://github.com/appliedtopology/javaplex/wiki/Tutorial
to get it work under Matlab environment. For example, extract the zip file
and open Matlab, change Matlab’s “Current Folder” to the directory where the
load javaplex.m file is located (src/matlab/for distribution/ in version 4.3.4),
run the file:
210 10. PERSISTENT HOMOLOGY

../2021a_csic5011-math5473/figures/mumford_k15t25.png

Figure 6. Natural Image Statistics: Three Circles

../2021a_csic5011-math5473/figures/mumford_klein.png

Figure 7. Natural Image Statistics: Klein Bottle?


10.5. LAB AND FURTHER STUDIES 211

../2021a_csic5011-math5473/figures/mumford_klein_model.png

Figure 8. Natural Image Statistics: Klein Bottle Model

../2021a_csic5011-math5473/figures/geom_tree.png

Figure 9. Are phylogenetic trees good representations for evolu-


tion?

>> load_javaplex

Also in the Matlab command window, type the following command.


>> import edu . stanford . math . plex4 .*;

Installation is complete. Confirm that Javaplex is working properly with the fol-
lowing command.
>> api . Plex4 . c r e a t e E x p l i c i t S i m p l e x S t r e a m ()
ans = edu . stanford . math . plex4 . streams . impl .
ExplicitSimplexStream@16966ef

Your output should be the same except for the last several characters. Each time
upon starting a new Matlab session, you will need to run load javaplex.m.
Now conduct the following numerical experiment with the example shown in
class:
212 10. PERSISTENT HOMOLOGY

../2021a_csic5011-math5473/figures/virus-trees.pdf

Figure 10. Virus gene reassortment may introduce loops

(1) Construct a filtration (stream) of the simplicial complex in the Figure 16


(a). This could be done by

>> stream = api . Plex4 . c r e a t e E x p l i c i t S i m p l e x S t r e a m () ;


>> stream . addVertex (0 ,0) ;
>> stream . addVertex (1 ,1) ;
>> stream . addVertex (2 ,2) ;
>> stream . addVertex (3 ,3) ;
>> stream . addElement ([0 ,2] ,4) ;
>> stream . addElement ([0 ,1] ,5) ;
>> stream . addElement ([2 ,3] ,6) ;
>> stream . addElement ([1 ,3] ,7) ;
>> stream . addElement ([1 ,2] ,8) ;
>> stream . addElement ([1 ,2 ,3] ,9) ;
>> stream . addElement ([0 ,1 ,2] ,10) ;
>> stream . finaliz eStream () ;
10.5. LAB AND FURTHER STUDIES 213

../2021a_csic5011-math5473/figures/influenza-reassort.pdf

Figure 11. Influenza

../2021a_csic5011-math5473/figures/influenza-tree.pdf
../2021a_csic5011-math5473/figures/H1N1-2009.pdf

Figure 12. Origins of H1N1 2009 pandemic virus. Using phylogenetic trees, the
history of the HA gene of the 2009 H1N1 pandemic virus was reconstructed. It was related
to viruses that circulated in pigs potentially since the 1918 H1N1 pandemic. These viruses
had diverged since that date into various independent strains, infecting humans and swine.
Major reassortments between strains led to new sets of segments from different sources.
In 1998, triple reassortant viruses were found infecting pigs in North America. These
triple reassortant viruses contained segments that were circulating in swine, humans and
birds. Further reassortment of these viruses with other swine viruses created the ancestors
of this pandemic. Until this day, it is unclear how, where or when these reassortments
happened. Source: [506]. From New England Journal of Medicine, Vladimir Trifonov,
Hossein Khiabanian, and Ral Rabadn, Geographic dependence, surveillance, and origins
of the 2009 influenza A (H1N1) virus, 361.2, 115D̄119.

where you can check the number of simplices in the filtration (stream)
is 11
>> num_simplices = stream . getSize ()
214 10. PERSISTENT HOMOLOGY

../2021a_csic5011-math5473/figures/H1N1-betti0.pdf

Figure 13. In case of vanishing higher dimensional homology, zero dimen- sional
homology generates trees. When applied to only one gene of influenza A, in this case
hemagglutinin, the only significant homology occurs in dimen- sion zero (panel A). The
barcode represents a summary of a clustering procedure (panel B), that recapitulates
the known phylogenetic relation between different hemagglutinin types (panel C). Source:
[100]. From Joseph Minhow Chan, Gunnar Carlsson, and Ral Rabadn, ÔTopology of viral
evolutionÕ, Proceedings of the National Academy of Sciences 110.46 (2013): 18566D̄18571.

num_simplices = 11
(2) Compute the persistent homology for the filtration and plot the barcode
as Figure 16 (b).
>> % Compute the Z /2 Z persistence homology of dimension less
than 3:
>> persistence = api . Plex4 . g e t M o d u l a r S i m p l i c i a l A l g o r i t h m (3 , 2)
;
>> intervals = persistence . c o m p u t e In t e r v a l s ( stream ) ;
>> options . filename = ’ Persistent - Betti - Numbers ’;
>> options . m a x _ f i l t r a t i o n _ v a l u e = 11;
>> % Plot the barcode of persistent Betti numbers :
>> plot_barcodes ( intervals , options ) ;
10.5. LAB AND FURTHER STUDIES 215

../2021a_csic5011-math5473/figures/H1N1-allbetti.pdf

Figure 14. Whole Genomic Persistent Betti Numbers


216 10. PERSISTENT HOMOLOGY

../2021a_csic5011-math5473/figures/H1N1-twomode.pdf

Figure 15. Co-reassortment of viral segments as structure in persistent homol- ogy


diagrams. Left: The non-random cosegregation of influenza segments was measured by
testing a null model of equal reassortment. Significant cosegregation was identified within
PA, PB1, PB2, NP, consistent with the cooperative func- tion of the polymerase complex.
Source: [100]. Right: The persistence diagram for whole-genome avian flu sequences
revealed bimodal topological structure. Annotating each interval as intra- or inter-subtype
clarified a genetic barrier to reassortment at intermediate scales. From Joseph Minhow
Chan, Gunnar Carlsson, and Ral Rabadn, ÔTopology of viral evolutionÕ, Proceedings of
the National Academy of Sciences 110.46 (2013): 18566D̄18571.

(a) (b)

Figure 16. (a) A filtration of simplices; (b) barcodes of Persistent


Betti Numbers.
CHAPTER 11

Mapper and Morse Theory

• How to choose coverings?


• Create a reference map (or filter) h : X → Z, where Z is a topological
space often with interesting metrics (e.g. R, R2 , S 1 etc.), and a covering
U of Z, then construct the covering of X using inverse map {h−1 Uα }.
Morse Theory and Reeb graph
• a nice (Morse) function: h : X → R, on a smooth manifold X
• topology of X reconstructed from level sets h−1 (t)
• topological of h−1 (t) only changes at ‘critical values’
• Reeb graph: a simplified version, contracting into points the connected
components in h−1 (t)
Mapper: from Continuous to Discrete... Note:
• degree-one nodes contain local minima/maxima;
• degree-three nodes contain saddle points (critical points);
• degree-two nodes consist of regular points
Reeb graph has found various applications in computational geometry, statistics
under different names.
• computer science: contour trees, Reeb graphs
• statistics: density cluster trees (Hartigan)

11.1. Morse Theory, Reeb Graph, and Mapper


Mapper algorithm: [Singh-Memoli-Carlsson. Eurograph-PBG, 2007] Given a data set X ,
• choose a filter map h : X → Z, where Z is a topological space such as R,
S 1 , Rd , etc.
• choose a cover Z ⊆ ∪α Uα
• cluster/partite level sets h−1 (Uα ) into Vα,β
• graph representation: a node for each Vα,β , an edge between (Vα1 ,β1 , Vα2 ,β2 )
iff Uα1 ∩ Uα2 ̸= ∅ and Vα1 ,β1 ∩ Vα2 ,β2 ̸= ∅.
• extendable to simplicial complex representation.

Note: it extends Reeb Graph from R to general topological space Z; may lead
to a particular implementation of Nerve theorem through filter map h.
Reference Mapping: Typical one dimensional filters/mappings:
• Density estimators
• Measures of data (ec-)centrality: e.g. x′ ∈X d(x, x′ )p
P
• Geometric embeddings: PCA/MDS, Manifold learning, Diffusion Maps
etc.
• Response variable in statistics: progression stage of disease etc.
217
218 11. MAPPER AND MORSE THEORY

../mySlides/figures/Reeb_graph.pdf

Figure 1. Construction of Reeb graph; h maps each point on


torus to its height.

../mySlides/figures/Mapper_graph.pdf

Figure 2. An illustration of Mapper.

11.2. Applications Examples


11.2.1. RNA Hairpin Folding/Unfolding Pathways. Biological relevance:
• serve as nucleation site for RNA folding
• form sequence specific tertiary interactions
• protein recognition sites
• certain Tetraloops can pause RNA transcription
Note: simple, but, biological debates over intermediate states on folding pathways
11.2.2. Differentiation Process by Single Cell Data.
• Over time, undifferentiated embryonic cells become differentiated mo-
tor neurons when retinoic acid and sonic hedgehog (a differentiation-
promoting protein) are applied.
• Mapper graph of differentiation process from murine embryonic stem cells
to motor neurons:
The data generated corresponds to RNA expression profiles from
roughly 2000 single cells.
11.2. APPLICATIONS EXAMPLES 219

../mySlides/figures/density-tree.pdf

The distance metric was provided by correlation between expression


vectors.
The filter function used was multidimensional scaling (MDS) pro-
jection into R2 .
The cover was overlapping rectangles in R2 .

11.2.3. Progression of Cancer.


• We study samples of expression data in Rn (n = 262) from 295 breast
cancers as well as additional samples from normal breast tissue.
The distance metric was given by the correlation between (pro-
jected) expression vectors.
The filter function used was a measure taking values in R of the de-
viation of the expression of the tumor samples relative to normal controls
(l2 -eccentrality).
The cover was overlapping intervals in R.
• Two branches of breast cancer progression are discovered.
• The lower right branch itself has a subbranch (referred to as c-MYB+
tumors), which are some of the most distinct from normal and are char-
acterized by high expression of genes including c-MYB, ER, DNALI1 and
C9ORF116. Interestingly, all patients with c-MYB+ tumors had very
good survival and no metastasis.
• These tumors do not correspond to any previously known breast cancer
subtype; the grouping seems to be invisible to classical hierarchical clus-
tering methods.

11.2.4. Brain Tumor.


• Using Mapper, one can appreciate a more continuous structure that reca-
pitulates the clonal and genetic history.
220 11. MAPPER AND MORSE THEORY

../mySlides/figures/gcaa.pdf

../mySlides/figures/native_contactmap_sm.png

Figure 3. RNA GCAA-Tetraloop

The tumor on the right appears to be transcriptionally distinct from


the left tumor and the recurrence tumor.
Expression profiles from cells in the recurrence tumor resembled the
originating initial tumor.
11.2. APPLICATIONS EXAMPLES 221

../mySlides/figures/mapper_UFCE10_8l.pdf

Figure 4. Mapper output for Unfolding Pathways

../mySlides/figures/mapper_RFCE10_8l.pdf

Figure 5. RNA hairpin folding pathways: Jointly with Xuhui


Huang, Jian Sun, Greg Bowman, Gunnar Carlsson, Leo Guibas,
and Vijay Pande, JACS’08, JCP’09
222 11. MAPPER AND MORSE THEORY

../2021a_csic5011-math5473/figures/single-cell-mouse.pdf

This is an important finding, as it shows a continued progression at


the expression level, with a few cells at diagnosis having a similar pattern
as cells at relapse.
It also shows that EGFR mutation is a subclonal event, occurring
only in the tumor at diagnosis that is not responsible for the relapse. So
tumors with heterogeneous populations of cells are less sensitive specific
therapies which target a subpopulation..

11.3. *Discrete Morse Theory and Persistent Homology


11.4. Lab and Further Studies
11.4. LAB AND FURTHER STUDIES 223

../2021a_csic5011-math5473/figures/single-cell-Raul2_32.pdf

Figure 6. Mapper graph of single cell data, where the different


regions in the Mapper graph nicely line up with different points
along the differentiation timeline. Rizvi et al. Nature Biotechnol.
35.6 (2017), 551-560.
224 11. MAPPER AND MORSE THEORY

../2021a_csic5011-math5473/figures/breastcancer.pdf

Figure 7. Progression of Breast Cancer: l2 -eccentrality, by Mon-


ica Nicolau, A. Levine, and Gunnar Carlsson, PNAS’10

../2021a_csic5011-math5473/figures/braintumor.pdf

Figure 8. A patient with two focal glioblastomas, on the left and right hemispheres.
After surgery and standard treatment, the tumor reappeared on the left side. Genomic
analysis shows that the initial tumors were seeded by two independent, but related clones.
The recurrent tumor was genetically similar to the left one. Jin-Ku Lee et al. Nature
Genetics 49.4 (2017): 594-599.
11.4. LAB AND FURTHER STUDIES 225

../2021a_csic5011-math5473/figures/braintumor-mapper.pdf
CHAPTER 12

*Euler Calculus

12.1. Euler Characteristics


For a general simplicial complex K of dimension d, Euler characteristic is defined
to be
Xd
χK = (−1)d fi ,
i=0
where fi is the number of i-faces (0-faces as nodes, 1-faces as edges, and so on).
The relationship between Euler Characteristic and topology is via the following
equation:
X d
χK = (−1)d βi
i=0
where βi is the i-th Betti number.
The famous Gauss-Bonnet-Chern theorem established the relationship between
the total curvature and Euler Characteristic for Riemannian manifolds, that can be
extended to simplicial complex using the following natural notion of combinatorial
curvature at a node v,
d
X fi (v)
κ(v) := (−1)d
i=0
i+1
where fi (v) denotes the number of i-faces that contain v. With this definition,
clearly we have the following combinatorial version of Gauss-Bonnet-Chern theorem
X d
X
(195) κ(v) = (−1)d fi = χK ,
v∈V i=0

which connects local geometry (curvature) to global topology in terms of Euler


Characteristic.
Euler Characteristic defined a finitely-additive measure as it satisfies the fun-
damental property of a measure:
χ(A ∪ B) = χ(A) + χ(B) − χ(A ∩ B),
that can be used to define integration of constructible or definable functions, and
called Euler Calculus (BG10; CGR12).
to be finished...

12.2. *Euler Calculus and Integral Geometry


12.3. Applications Examples
12.3.1. Sensor Network.
227
228 12. *EULER CALCULUS

12.3.2. Extremal Analysis of Gaussian Random Fields.


Part 4

Combinatorial Hodge Theory and


Applications
CHAPTER 13

Combinatorial Hodge Theory

13.1. Exterior Calculus on Simplicial Complex and Cohomology


We are going to study functions on simplicial complex, l2 (V d ).
A basis of “forms”:
• l2 (V ): ei (i ∈ V ), so f ∈ l2 (V ) has a representation f = i∈V fi ei , e.g.
P
global ranking score on VP.
• l2 (V 2 ): eij = −eji , f = (i,j) fij eij for f ∈ l2 (V 2 ), e.g. paired compari-
son scores on V 2 .
• l2 (V 3 ): eijk = ejki = ekij = −ejik = −ekji = −eikj , f = ijk fijk eijk
P

• l2 (V d+1 ): ei0 ,...,id is an alternating d-form


ei0 ,...,id = sign(σ)eσ(i0 ),...,σ(id ) ,
where σ ∈ Sd is a permutation on {0, . . . , d}.
Vector spaces of functions l2 (V d+1 ) represented on such basis with an inner product
defined, are called d-forms (cochains).
Example 13.1.1. In the crowdsourcing ranking of world universities,
http://www.allourideas.org/worldcollege/,
V consists of world universities, E are university pairs in comparison, l2 (V ) consists
of ranking scores of universities, l2 (V 2 ) is made up of paired comparison data.
Discrete differential operators: k-dimensional coboundary maps δk : L2 (V k ) →
L (V k+1 ) are defined as the alternating difference operator
2

k+1
X
(δk u)(i0 , . . . , ik+1 ) = (−1)j+1 u(i0 , . . . , ij−1 , ij+1 , . . . , ik+1 )
j=0

• δk plays the role of differentiation


• δk+1 ◦ δk = 0
So we have chain map
δ δ δk−1 δ
L2 (V ) −→
0
L2 (V 2 ) −→
1
L2 (V 3 ) → . . . L2 (V k ) −−−→ L2 (V k+1 ) −→
k
...
with δk ◦ δk−1 = 0.
Example 13.1.2 (Gradient, Curl, and Divergence). We can define discrete
gradient and curl, as well as their adjoints
• (δ0 v)(i, j) = vj − vi =: (grad v)(i, j)
• (δ1 w)(i, j, k) = (±)(wij + wjk + wki ) =: (curl w)(i, j, k), which measures
the total flow-sum along the loop i → j → k → i and (δ1 w)(i, j, k) = 0
implies the paired comparison data is path-independent, which defines the
triangular transitivity subspace
231
232 13. COMBINATORIAL HODGE THEORY

• for each alternative i ∈ V , the combinatorial divergence


X
(div w)(i) := −(δ0T w)(i) := wi∗
which measures the inflow-outflow sum at i and (δ0T w)(i) = 0 implies
alternative i is preference-neutral in all pairwise comparisons as a cyclic
ranking passing through alternatives.

13.2. Combinatorial Hodge Theory


Definition 13.2.1 (Combinatorial Hodge Laplacian). Define the k-dimensional
combinatorial Laplacian, ∆k : L2 (V k+1 ) → L2 (C k+1 ) by
T
∆k = δk−1 δk−1 + δkT δk , k>0
• k = 0, ∆0 = δ0T δ0 is the well-known graph Laplacian
• k = 1,
∆1 = curl ◦ curl∗ − div ◦ grad
• Important Properties:
∆k positive semi-definite
T
ker(∆k ) = ker(δk−1 ) ∩ ker(δk ): k-Harmonics, dimension equals to
k-th Betti number
Hodge Decomposition Theorem
Theorem 13.2.1 (Hodge Decomposition). The space of k-forms (cochains)
C k (K(G), R), admits an orthogonal decomposition into three
C k (K(G), R) = im(δk−1 ) ⊕ Hk ⊕ im(δkT )
where
Hk = ker(δk−1 ) ∩ ker(δkT ) = ker(∆k ).
• dim(Hk ) = βk .
A simple understanding is possible via Dirac operator:
D = δ + δ ∗ : ⊕k L2 (V k ) → ⊕k L2 (V k )
Hence D = D∗ is self-adjoint. Combine the chain map
δ δ δk−1 δ
L2 (V ) −→
0
L2 (V 2 ) −→
1
L2 (V 3 ) → . . . L2 (V k ) −−−→ L2 (V k+1 ) −→
k
...
into a big operator: Dirac operator.
Abstract Hodge Laplacian:
∆ = D2 = δδ ∗ + δ ∗ δ,
since δ 2 = 0.
By the Fundamental Theorem of Linear Algebra (Closed Range Theorem in
Banach Space),
⊕k L2 (V k ) = im(D) ⊕ ker(D)
where
im(D) = im(δ) ⊕ im(δ ∗ )
and ker(D) = ker(∆) is the space of harmonic forms.

13.3. Lab and Further Studies


CHAPTER 14

Social Choice and Hodge Decomposition of


Preferences

14.1. Social Choice Theory


Social choice, also known as preference or rank aggregation, is an important
topic in science and technology. The celebrated impossibility theorems by Nobel
Laureates in Economics have shown its fundamental limitations in which a pursuit
of global consensus ranking is doomed to meet conflicts of interests. Hence, a quan-
titative description of the conflicts of interests within preference data is even more
important than merely pursuing a global ranking. However, all of the traditional
studies on this topic assume complete preference data with homogeneous voters
who are treated in an equal way.
• Borda 1770, B. Count against plurality vote
• Condorcet 1785, C. Winner who wins all paired elections
• Impossibility theorems: Kenneth Arrow 1963, Amartya Sen 1973
• Resolving conflicts: Kemeny, Saari ...
• In these settings, we study complete ranking orders from voters.
The classical social choice problem can be described as follows: given n-candidates
V = {1, . . . , n} and m voters whose preferences are total orders (permutations)
{⪰u : u = 1, . . . , m} on V , find a social choice mapping f : (⪰1 , . . . , ⪰m ) 7→⪰∗ ,
resulting in a total order on V that “best” represents a voter’s will. The classical
social choice problem:
Problem. Given m voters whose preferences are total orders (permutation)
{⪰i : i = 1, . . . , m} on a candidate set V , find a social choice mapping
f : (⪰1 , . . . , ⪰m ) 7→⪰∗ ,
as a total order on V , which “best” represents voter’s will.
This is also known as rank aggregation in computer science, and as a form
of unidimensional scaling, it is a fundamental problem in many fields including
economics, voting theory, psychology, statistics, and engineering. Unfortunately,
a set of impossibility theorems, including two brilliant ones by Nobel Laureates
Kenneth Arrow and Amartya Sen, says “no” to its existence under some natural
conditions that many assume to be basic democratic norms.
In particular, Kenneth Arrow (Arr63) firstly showed that no social choice map-
ping exists for all possible inputs of total orders that satisfies: (a) unanimity
(Pareto, i.e. if all voters agree that candidate A is better than B, this is the social
choice); (b) independence of irrelevant alternative (IIA, i.e. the social choice order
for A and B should not depends on other candidates); and (c) non-dictatorship (i.e.
no voter’s order is the same to the social order). Even worse, Amartya Sen (Sen70)
233
234 14. SOCIAL CHOICE AND HODGE DECOMPOSITION OF PREFERENCES

further showed an impossibility theorem under weaker conditions: (a) unanimity


and (b) minimal liberalism (i.e. at least two different voters decide social orders
of two distinct pairs of candidates, respectively). To capture what is the intrinsic
inconsistency in voting, in the 1980s Don Saari (Saa95) introduced an orthogo-
nal decomposition of voting profiles, probability distributions on the permutation
group of V , which splits the common consensus and conflicts of interests within the
voting data.
14.1.1. Arrow and Sen’s Impossibility Theorems.
Theorem 14.1.1 (Arrow’1963). Consider the Unrestricted Domain, i.e. vot-
ers may have all complete and transitive preferences. The only social choice rule
satisfying the following conditions is the dictator rule
• Pareto (Unanimity): if all voters agree that A ⪰ B then such a preference
should appear in the social order
• Independence of Irrelevant Alternative (IIA): the social order of any pair
only depends on voter’s relative rankings of that pair
Theorem 14.1.2 (Sen’1970). With Unrestricted Domain, there are cases with
voting data that no social choice mapping,
f : (⪰1 , . . . , ⪰m ) 7→ 2V ,
exists under the following conditions
• Pareto: if all voters agree that A > B then such a preference should
appear in the social order
• Minimal Liberalism: two distinct voters decide social orders of two distinct
pairs respectively
14.1.2. Three candidate example. Let’s take an example of three candi-
dates: A, B, and C.

Preference order Votes


A⪰B⪰C 2
B⪰A⪰C 3
B⪰C⪰A 1
C⪰B⪰A 3
C⪰A⪰B 2
A⪰C⪰B 2

Table 1. Three Candidates: ABC

There are two important classes of social mapping in realities:


• I. Position rules: assign a score s : V → R, such that for each voter’s
order(permutation) σi ∈ Sn (i = 1, . . . , m), sσi (k) ≥ sσi (k+1) . Define the
social order by the descending order of total score over raters, i.e. the
score for k-th candidate
m
X
f (k) = sσi (k).
i=1

Borda Count: s : V → R is given by (n − 1, n − 2, . . . , 1, 0)


14.1. SOCIAL CHOICE THEORY 235

Vote-for-top-1: (1, 0, . . . , 0)
Vote-for-top-2: (1, 1, 0, . . . , 0)
• II. Pairwise rules: convert the voting profile, a (distribution) function on
n! set Sn , into paired comparison matrix X ∈ Rn×n where X(i, j) is the
number (distribution) of voters that i ≻ j; define the social order based
on paired comparison data X.
Kemeny Optimization: minimizes the number of pairwise mismatches
to X over Sn (NP-hard)
Pluarity: the number of wins in paired comparisons (tournaments)
– equivalent to Borda count in complete Round-Robin tournaments
Let’s apply these rules to the three candidate example:
• Position:
s < 1/2, C wins
s = 1/2, ties
s > 1/2, A, B wins
• Pairwise:
A, B: 13 wins
C: 14 wins
Condorcet winner: C
so completely in chaos!

../mySlides/figures/saari_triangle0.png

Figure 1. Position vs. Pairwise rules on the ABC example

14.1.3. Saari’s Decomposition. Saari’s decomposition: Every voting pro-


file, as distributions on symmetric group Sn , can be decomposed into the following
components:
• Universal kernel: all ranking methods induce a complete tie on any subset
of V
236 14. SOCIAL CHOICE AND HODGE DECOMPOSITION OF PREFERENCES

dimension: n! − 2n−1 (n − 2) − 2
• Borda profile: all ranking methods give the same result
dimension: n − 1
basis: {1(σ(1) = i, ∗) − 1(∗, σ(n) = i) : i = 1, . . . , n}
• Condorcet profile: all positional rules give the same result
dimension: (n−1)!2
basis: sum of Zn orbit of σ minus their reversals
• Departure profile: all pairwise rules give the same result

../mySlides/figures/saari_decomp3.png

Figure 2. Saari’s Decomposition

• So, if you look for a best possibility from impossibility, Borda count is
perhaps the choice
• Borda Count is the projection onto the Borda Profile subspace
Borda Count is equivalent to
X
α
min ωij (βi − βj − Yijα )2 ,
β∈R|V |
α,{i,j}∈E
14.2. CROWDSOURCED RANKING ON GRAPHS 237

Table 2. Invariant subspaces of social rules (-)

Borda Profile Condorcet Departure


Borda Count consistent - -
Pairwise consistent inconsistent -
Position (non-Borda) consistent - inconsistent

where
• E.g. Yijα = 1, if i ⪰ j by voter α, and Yijα = −1, on the opposite.
• Note: NP-hard (n > 3) Kemeny Optimization, or Minimimum-Feedback-
Arc-Set:
X
α
min ωij (sign(βi − βj ) − Ŷijα )2
s∈R|V |
α,{i,j}∈E

Generalized Borda Count with Incomplete Data:


X
α α 2
min ωij (xi − xj − yij ) ,
x∈R|V |
α,{i,j}∈E


X
min ωij ((xi − xj ) − ŷij )2 ,
x∈R|V |
{i,j}∈E
X X
where ŷij = Êα yij
α
=( α α
ωij yij )/ωij = −ŷji , ωij = α
ωij
α α
So ŷ ∈ lω2 (E), inner product space with ⟨u, v⟩ω = uij vij ωij , u, v skew-symmetric
P
Statistical Majority Voting: l2 (E)
P α α P α P α
• ŷij = ( α ωij yij )/( α ωij ) = −ŷji , ωij = α ωij
• ŷ from generalized linear models:
[1] Uniform model: ŷij = 2π̂ij − 1.
π̂ij
[2] Bradley-Terry model: ŷij = log 1−π̂ ij
.
[3] Thurstone-Mosteller model: ŷij = Φ−1 (π̂ij ), Φ(x) is Gaussian
CDF
[4] Angular transform model: ŷij = arcsin(2π̂ij − 1).

14.2. Crowdsourced Ranking on Graphs


However, all of the classical studies above assume the complete information
with inputs as total orders on the candidate set. Nowadays, with the Internet and
its associated explosive growth of information, individuals throughout the world
are faced with the rapid expansion of multiple choices but also with incomplete
information (e.g., which book to buy, which hotel to book, and which movie to
rent, etc.). For example, the Netflix dataset comprises a database of 17,000 movies
– so many that most raters can never watch them all. Crowdsourcing technology
such as the following platforms MTurk, InnoCentive, CrowdFlower, CrowdRank,
and AllOurIdeas is becoming a new paradigm for the collection of preference data
from a large crowd or population on the Internet in a less controlled fashion than
traditional lab environments. The data collected is big, incomplete, and often
contaminated with heterogeneous noise. The simplest and a typical scenario with
238 14. SOCIAL CHOICE AND HODGE DECOMPOSITION OF PREFERENCES

incomplete data is the pairwise comparison experiment, in which all partial orders
can be reduced.
14.2.1. HodgeRank on Graphs. Let ∧ = {1, ..., m} be a set of participants
and V = {1, ..., n} be the set of videos to be ranked. Paired comparison data is
collected as a function on ∧ × V × V , which is skew-symmetric for each participant
α, i.e., Yijα = −Yjiα representing the degree that α prefers i to j. The simplest
setting is the binary choice, where

1 if α prefers i to j,
Yijα =
−1 otherwise.
In general, Yijα can be used to represent paired comparison grades, e.g., Yijα > 0
refers to the degree that α prefers i to j and the vice versa Yjiα = −Yijα < 0 measures
the dispreference degree (JLYY11).
In this paper we shall focus on the binary choice, which is the simplest setting
and the data collected in this paper belongs to this case. However the theory can
be applied to the more general case with multiple choices above.
Such paired comparison data can be represented by a directed graph, or hyper-
graph, with n nodes, where each directed edge between i and j refers the preference
indicated by Yijα .
A nonnegative weight function ω : ∧ × V × V −→ [0, ∞) is defined as,

α 1 if α makes a comparison for {i, j},
(196) ωij =
0 otherwise.
It may reflect the confidence level that a participant compares {i, j} by taking
different values, and this is however not pursued in this paper.
Our statistical rank aggregation problem is to look for some global ranking
score s : V → R such that
X
α
(197) min ωij (si − sj − Yijα )2 ,
s∈R|V |
i,j,α

which is equivalent to the following weighted least square problem


X
(198) min ωij (si − sj − Ŷij )2 ,
s∈R|V |
i,j
α α α
P P P α
where Ŷij = ( α ωij Yij )/( α ωij )
and ωij = α ωij . For the principles behind
such a choice, readers may refer (JLYY11).
A graph structure arises naturally from ranking data as follows. Let G = (V, E)
be a paired ranking graph whose vertex set is V , the set of videos to be ranked, and
whose edge set is E, the set of video pairs which receive some comparisons, i.e.,
(   X )
V α
(199) E = {i, j}ϵ | ωi,j >0 .
2
α
A pairwise ranking is called complete if each participant α in ∧ gives a total
judgment of all videos in V ; otherwise it is called incomplete. It is called
P balanced if
α
the paired comparison graph is k -regular with equal weights ωij = α ωij ≡ c for
all {i, j} ∈ E; otherwise it is called imbalanced. A complete and balanced ranking
induces a complete graph with equal weights on all edges. The existing paired
comparison methods in VQA often assume complete and balanced data. However,
this is an unrealistic assumption for real world data, e.g. randomized experiments.
14.2. CROWDSOURCED RANKING ON GRAPHS 239

Moreover in crowdsourcing, raters and videos come in an unspecified way and it is


hard to control the test process with precise experimental designs. Nevertheless,
as to be shown below, it is efficient to utilize some random sampling design based
on random graph theory where for each participant a fraction of video pairs are
chosen randomly. The HodgeRank approach adopted in this paper enables us a
unified scheme which can deal with incomplete and imbalanced data emerged from
random sampling in paired comparisons.
The minimization problem (198) can be generalized to a family of linear mod-
els in paired comparison methods (Dav88). To see this, we first rewrite (198) in
another simpler form. Assume that for each edge as video pair {i, j}, the number
of comparisons is nij , among which aij participants have a preference on i over j
(aji carries the opposite meaning). So aij + aji = nij if no tie occurs. Therefore,
for each edge {i, j} ∈ E, we have a preference probability estimated from data
π̂ij = aij /nij . With this definition, the problem (198) can be rewritten as
X
(200) min nij (si − sj − (2π̂ij − 1))2 ,
s∈R|V |
{i,j}∈E

since Ŷij = (aij − aji )/nij = 2π̂ij − 1 due to Equation (196).


General linear models, which are firstly formulated by G. Noether (Noe60),
assume that the true preference probability can be fully decided by a linear scaling
function on V , i.e.,
(201) πij = Prob{i is preferred over j} = F (s∗i − s∗j ),
for some s∗ ∈ R|V | . F can be chosen as any symmetric cumulated distributed
function. When only an empirical preference probability π̂ij is observed, we can
map it to a skew-symmetric function by the inverse of F ,
(202) Ŷij = F −1 (π̂ij ),
where Ŷij = −Ŷji . However, in this case, one can only expect that
(203) Ŷij = s∗i − s∗j + εij ,
where εij accounts for the noise. The case in (200) takes a linear F and is often
called a uniform model. Below we summarize some well known models which have
been studied extensively in (Dav88).
1. Uniform model:
(204) Ŷij = 2π̂ij − 1.
2. Bradley-Terry model:
π̂ij
(205) Ŷij = log .
1 − π̂ij
3. Thurstone-Mosteller model:
(206) Ŷij = F −1 (π̂ij ).
where F is essentially the Gauss error function
Z ∞
1 1 2
(207) F (x) = √ e− 2 t dt.
2π −x/[2σ2 (1−ρ)]1/2
Note that constants σ and ρ will only contribute to a rescaling of the solution of
(198).
240 14. SOCIAL CHOICE AND HODGE DECOMPOSITION OF PREFERENCES

4. Angular transform model:


(208) Ŷij = arcsin(2π̂ij − 1).
This model is created for the so called variance stabilization property: asymptot-
ically Ŷij has variance only depending on number of ratings on edge {i, j} or the
weight ωij , but not on the true probability pij .
Different models will give different Ŷij from the same observation π̂ij , followed
by the same weighted least square problem (198) for the solution. Therefore, a
deeper analysis of problem (198) will disclose more properties about the ranking
problem.

14.3. Hodge Decomposition of Pairwise Preference


HodgeRank on graph G = (V, E) provides us such a tool, which characterizes
the solution and residue of (198), adaptive to topological structures of G. The
following theorem adapted from (JLYY11) describes a decomposition of Ŷ , which
can be visualized as edge flows on graph G with direction i → j if Ŷij > 0 and vice
versa. Before the statement of the theorem, we first define the triangle set of G as
all the 3-cliques in G.
   
V
(209) T = {i, j, k}ϵ |{i, j}, {j, k}, {k, i}ϵE .
3
Equipped with T , graph G becomes an abstract simplicial complex, the clique
complex χ(G) = (V, E, T ).
Theorem 1 [Hodge Decomposition of Paired Ranking] Let Ŷij be a
paired comparison flow on graph G = (V, E), i.e., Ŷij = −Ŷji for {i, j} ∈ E, and
Ŷij = 0 otherwise. There is a unique decomposition of Ŷ satisfying
(210) Ŷ = Ŷ g + Ŷ h + Ŷ c ,
where
(211) Ŷijg = ŝi − ŝj , for some ŝ ∈ RV ,

(212) Ŷijh + Ŷjk


h h
+ Ŷki = 0, for each {i, j, k} ∈ T ,
X
(213) ωij Ŷijh = 0, for each i ∈ V .
j∼i

The decomposition
P above is orthogonal under the following inner product on R|E| ,
⟨u, v⟩ω = {i,j}∈E ωij uij vij .
Note B ◦ A = 0 since
(B ◦ Ax)(i, j, k) = (xi − xj ) + (xj − xk ) + (xk − xi ) = 0.
Hence
AT ŷ = AT (Ax + B T z + w) = AT Ax ⇒ x = (AT A)† AT ŷ
B ŷ = B(Ax + B T z + w) = BB T z ⇒ z = (BB T )† B ŷ
AT w = Bw = 0 ⇒ w ∈ ker(∆1 ), ∆1 = AAT + B T B.
14.3. HODGE DECOMPOSITION OF PAIRWISE PREFERENCE 241

Figure 3. Hodge decomposition (three orthogonal components)


of paired rankings (JLYY11).

Hodge Decomposition as Rank-Nullity Theorem: Take product space V =


X × Y × Z, define
 
0 0 0
D =  A 0 0  , BA = 0,
0 B 0
Rank-nullity Theorem: im(D) + ker(D∗ ) = V , in particular
Y = im(A) + ker(A∗ )
= im(A) + ker(A∗ )/im(B ∗ ) + im(B ∗ ), since im(A) ⊆ ker(B)
= im(A) + ker(A∗ ) ∩ ker(B) + im(B ∗ )
Laplacian
(down)
L = (D + D∗ )2 = diag(A∗ A, AA∗ + B ∗ B, BB ∗ ) = diag(L0 , L1 , L2 )
The following provides some remarks on the decomposition.
1. When G is connected, Ŷijg is a rank two skew-symmetric matrix and gives a
linear score function ŝ ∈ RV up to translations. We thus call Ŷ g a gradient flow
since it is given by the difference (discrete gradient) of the score function ŝ on graph
nodes,
(214) Ŷijg = (δ0 ŝ)(i, j) := ŝi − ŝj ,
where δ0 : RV → RE is a finite difference operator (matrix) on G. ŝ can be chosen
as any least square solution of (198), where we often choose the minimal norm
solution,
(215) ŝ = ∆†0 δ0∗ Ŷ ,
where δ0∗ = δ0T W (W P = diag(ωij )), ∆0 = δ0∗ ·δ0 is the unnormalized graph Laplacian
defined by (∆0 )ii = j∼i ωij and (∆0 )ij = −ωij , and (·)† is the Moore-Penrose
(pseudo) inverse. On a complete and balanced graph, (215) is reduced to ŝi =
1
P
n−1 j̸=i Ŷij , often called Borda Count as the earliest preference aggregation rule
in social choice (JLYY11). For expander graphs like regular graphs, graph Laplacian
∆0 has small condition numbers and thus the global ranking is stable against noise
on data.
2. Ŷ h satisfies two conditions (212) and (213), which are called curl-free and
divergence-free conditions respectively. The former requires the triangular trace
of Ŷ to be zero, on every 3-clique in graph G; while the later requires the total
242 14. SOCIAL CHOICE AND HODGE DECOMPOSITION OF PREFERENCES

sum (inflow minus outflow) to be zero on each node of G. These two conditions
characterize a linear subspace which is called harmonic flows.
3. The residue Ŷ c actually satisfies (213) but not (212). In fact, it measures
the amount of intrinsic (local) inconsistancy in Ŷ characterized by the triangular
trace. We often call this component curl flow. In particular, the following relative
curl,
|Ŷij + Ŷjk + Ŷki | |Ŷijc + Ŷjk
c c
+ Ŷki |
(216) curlrijk = = ∈ [0, 1],
|Ŷij | + |Ŷjk | + |Ŷki | |Ŷij | + |Ŷjk | + |Ŷki |
can be used to characterize triangular intransitivity; curlrijk = 1 iff {i, j, k} contains
an intransitive triangle of Ŷ . Note that computing the percentage of curlrijk = 1
is equivalent to calculating the Transitivity Satisfaction Rate (TSR) in complete
graphs.
Figure 3 illustrates the Hodge decomposition for paired comparison flows and
Algorithm 14 shows how to compute global ranking and other components. The
readers may refer to (JLYY11) for the detail of theoretical development. Below we
just make a few comments on the application of HodgeRank in our setting.

Algorithm 14: Procedure of Hodge decomposition in Matlab Pseu-


docodes
Input: A paired comparison hypergraph G provide by assessors.
Output: Global score ŝ, gradient flow Ŷ g , curl flow Ŷ c , and harmonic flow Ŷ h .
1 Initialization:
2 Ŷ (a numEdge-vector consisting Ŷij defined),
3 W (a numEdge-vector consisting ωij ).
4 Step 1 :
5 Compute δ0 , δ1 ; // δ0 = gradient, δ1 = curl
6 δ0∗ = δ0T ∗ diag(W ); // the conjugate of δ0
7 △0 = δ0∗ ∗ δ0 ; // Unnormalized Graph Laplacian
8 div = δ0∗ ∗ Ŷ ; // divergence operator
9 ŝ = lsqr(△0 , div); // global score
10 Step 2 :
11 Compute 1st projection on gradient flow: Ŷ g = δ0 ∗ ŝ;
12 Step 3 :
13 δ1∗ = δ1T ∗ diag(1./W );
14 △1 = δ1 ∗ δ1∗ ;
15 curl = δ1 ∗ Ŷ ;
16 z = lsqr(△1 , curl);
17 Compute 3rd projection on curl flow: Ŷ c = δ1∗ ∗ z;
18 Step 4 :
19 Compute 2nd projection on harmonic flow: Ŷ h = Ŷ − Ŷ g − Ŷ c .

1. To find a global ranking ŝ in (215), the recent developments of Spielman-Teng


(ST04) and Koutis-Miller-Peng (KMP10) suggest fast (almost linear in |E|Poly(log |V |))
algorithms for this purpose.
2. Inconsistency of Ŷ has two parts: global inconsistency measured by harmonic
flow Ŷ h and local inconsistency measured by curls in Ŷ c . Due to the orthogonal
14.3. HODGE DECOMPOSITION OF PAIRWISE PREFERENCE 243

decomposition, ∥Ŷ h ∥2ω /∥Ŷ ∥2ω and ∥Ŷ c ∥2ω /∥Ŷ ∥2ω provide percentages of global and
local inconsistencies, respectively.
3. A nontrivial harmonic component Ŷ h ̸= 0 implies the fixed tournament issue,
i.e., for any candidate i ∈ V , there is a paired comparison design by removing some
of the edges in G = (V, E) such that i is the overall winner.
4. One can control the harmonic component by controlling the topology of
clique complex χ(G). In a loop-free clique complex χ(G) where β1 = 0, harmonic
component vanishes. In this case, there are no cycles which traverse all the nodes,
e.g., 1 ≻ 2 ≻ 3 ≻ 4 ≻ . . . ≻ n ≻ 1. All the inconsistency will be summarized in
those triangular cycles, e.g., i ≻ j ≻ k ≻ i.
Theorem 2. The linear space of harmonic flows has the dimension equal to
β1 , i.e., the number of independent loops in clique complex χ(G), which is called
the first order Betti number.
Condorcet Profile splits into Local vs. Global Cycles: Residues ŷ (c) = B T z and
(h)
ŷ = w are cyclic rankings, accounting for conflicts of interests:
• ŷ (c) , the local/triangular inconsistency, triangular curls (Z3 -invariant)
(c) (c) (c)
ŷij + ŷjk + ŷki ̸= 0 , {i, j, k} ∈ T

../mySlides/figures/Tennis-cycle.pdf

Condorcet Profile in Harmonic Ranking:


• ŷ (h) = w, the global inconsistency, harmonic ranking (Zn -invariant)
voting chaos: circular coordinates on V ⇒ fixed tournament issue

../mySlides/figures/Harmonic-cycle.pdf

Fortunately, with the aid of some random sampling principles, it is not hard to
obtain graphs whose β1 are zero.
244 14. SOCIAL CHOICE AND HODGE DECOMPOSITION OF PREFERENCES

14.4. Random Graph Theory and Sampling


In this section, we first describe two classical random models: Erdös-Rényi
random graph and random regular graph; then we investigate the relation between
them.
Random Graph Models for Crowdsourcing:
• Recall that in crowdsourcing ranking on internet,
unspecified raters compare item pairs randomly
online, or sequentially sampling
• random graph models for experimental designs
P a distribution on random graphs, invariant under permutations
(relabeling)
Generalized de Finetti’s Theorem [Aldous 1983, Kallenberg 2005]:
P (i, j) (P ergodic) is an uniform mixture of
h(u, v) = h(v, u) : [0, 1]2 → [0, 1],
h unique up to sets of zero-measure R1R1
Erdös-Rényi: P (i, j) = P (edge) = 0 0 h(u, v)dudv =: p
edge-independent process (Chung-Lu’06)
14.4.0.1. Erdös-Rényi Random Graph. Erdös-Rényi random graph G(n, p) starts
from n vertices and draws its edges independently according to a fixed probability
p. Such random graph model is chosen to meet the scenario that in crowdsourc-
ing ranking raters and videos come in an unspecified way. Among various models,
Erdös-Rényi random graph is the simplest one equivalent to I.I.D. sampling. There-
fore, such a model is to be systematically studied in the paper.
However, to exploit Erdös-Rényi random graph in crowdsourcing experimental
designs, one has to meet some conditions depending on our purpose:
1. The resultant graph should be connected, if we hope to derive global scores
for all videos in comparison;
2. The resultant graph should be loop-free in its clique complex, if we hope to
get rid of the global inconsistency in harmonic component.
The two conditions can be easily satisfied for large Erdös-Rényi random graph.
Theorem 3. Let G(n, p) be the set of Erdös-Rényi random graphs with n
nodes and edge appearance probability p. Then the following holds as n → ∞,
1. [Erdös-Rényi 1959] (ER59) if p ≻ logn/n, then G(n, p) is almost always
connected; and if p ≺ logn/n then G(n, p) is almost always disconnected;
2. [Kahle 2009] (Kah09; Kah13) if p = O(nα ), with α < −1 or α > −1/2, then
the expected β1 of the clique complex χ(G(n, p)) is almost always equal to zero,
i.e., loop-free.
• (Erdös-Rényi 1959) One phase-transition for β0
p << 1/n1+ϵ (∀ϵ > 0), almost always disconnected
p >> log(n)/n, almost always connected
• (Kahle 2009) Two phase-transitions for βk (k ≥ 1)
p << n−1/k or p >> n−1/(k+1) , almost always βk vanishes;
n−1/k << p << n−1/(k+1) , almost always βk is nontrivial
For example: with n = 16, 75% distinct edges included in G, then χG with
high probability is connected and loop-free. In general, O(n log(n)) samples for
connectivity and O(n3/2 ) for loop-free.
14.4. RANDOM GRAPH THEORY AND SAMPLING 245

These theories imply that when p is large enough, Erdös-Rényi random graph
will meet the two conditions above with high probability. In particular, almost
linear O(n log n) edges suffice to derive a global ranking, and with O(n3/2 ) edges
harmonic-free condition is met.
Despite such an asymptotic theory for large random graphs, it remains a ques-
tion how to ensure that a given graph instance satisfies the two conditions? Fortu-
nately, the recent development in computational topology provides us such a tool,
persistent homology, which will be illustrated in Section ??.

../mySlides/figures/betti.png

14.4.1. Asymptotic Estimates for Fiedler Values. Key Estimates of


Fiedler Value near Connectivity Threshold [Braxton-Xu-Xiong-Y., ACHA16].
r r
λ2 2 2
(218) G0 (n, m) : ≈ a1 (p0 , n) := 1 − 1−
np p0 n
r
λ2 2p
(219) G(n, m) : ≈ a2 (p0 , n) := 1 − 1−p
np p0
where p0 := 2m/(n log n) ≥ 1, p = p0 log
n
n
and
p
a(p0 ) = 1 − 2/p0 + O(1/p0 ), for p0 ≫ 1.

Figure 4. Examples of k-regular graphs.


246 14. SOCIAL CHOICE AND HODGE DECOMPOSITION OF PREFERENCES

Figure 5. A comparison of the Fiedler value, minimal degree,


and estimates a(p0 ), a1 (p0 ), and a2 (p0 ) for graphs generated via
random sampling with/without replacement and greedy sampling
at n = 64. Random sampling without-replacement is as good as
Greedy!

../mySlides/figures/active_fast.png

Figure 6. Note: Crowd-BT is proposed by Chen et al. 2013

14.4.2. Active Sampling. Active Sampling [Xu-Xiong-Chen-Huang-Y. AAAI’18]:


• Fisher Information Maximization: Greedy sampling above, unsupervised
• Bayesian Information Maximization: supervised sampling
closed-form online formula based on Sherman-Morrison-Woodbury
faster and more accurate sampling scheme in literature
Supervised active sampling is more accurate:
Both supervised and unsupervised sampling reduce the chance of ranking chaos:

14.5. Online HodgeRank


Online HodgeRank [Xu-Huang-Yao’2012]: Robbins-Monro (1951) algorithm for
∆0 x = b̄ := δ0∗ ŷ,
xt+1 = xt − γt (At xt − bt ), x0 = 0, E(At ) = ∆0 , E(bt ) = b̄
Note:
• For each Yt (it+1 , jt+1 ), updates only occur locally
• Step size: γt = a(t + b)−1/2 (e.g. a=1/λ1 (∆0 ) and b large)
• Optimal convergence of xt to x∗ (population solution) in t
E∥xt − x∗ ∥2 ≤ O t−1 · λ−2

2 (∆0 )
14.5. ONLINE HODGERANK 247

../mySlides/figures/active1.png

../mySlides/figures/active_betti.png

where λ2 (∆0 ) is the Fiedler Value of graph Laplacian


• Tong Zhang’s SVRG: E∥st − s∗ ∥2 ≤ O t−1 + λ−2 −2

2 (∆0 )t
248 14. SOCIAL CHOICE AND HODGE DECOMPOSITION OF PREFERENCES

14.6. Robust HodgeRank


• Outliers are sparse approximation of cyclic rankings (curl+harmonic) [Xu-
Xiong-Huang-Y.’13]
min ∥Πker(A∗ ) (ŷ − γ)∥2 + λ∥γ∥1
γ

• Robust ranking can be formulated as a Huber’s LASSO


min ∥ŷ − Ax − γ∥2 + λ∥γ∥1
x,γ

outlier γ is incidental parameter (Neyman-Scott’1948)


global rating x is structural parameter
• Yet, LASSO is a biased estimator (Fan-Li’2001)
• A Dual Gradient Descent (sparse mirror descent) dynamics [Osher-Ruan-
Xiong-Y.-Yin’2014, Huang-Sun-Xiong-Y.’2020]
• called Inverse Scale Space dynamics in imaging
• sign consistency under nearly the same conditions as LASSO (Wainwright’99),
yet returns unbiased estimator
• fast and scalable discretization as linearized Bregman Iteration

14.7. From Social Choice to Individual Preferences


Conflicts are due to personalization (XXC+ 19):
cycles = personalized ranking + position bias + noise.
Linear mixed-effects model for annotator’s pairwise ranking:
u
(221) yij = (θi + δiu ) − (θj + δju ) + γ u + εuij ,
where
• θi is the common global ranking score, as a fixed effect;
• δiu is the annotator’s preference deviation from the common ranking θi
such that θiu := θi + δiu is u’s personalized ranking;
• γ u is an annotator’s position bias, which captures the careless behavior
by clicking one side during the comparisons;
• εuij is the random noise which is assumed to be independent and identically
distributed with zero mean and being bounded.

14.8. Lab and Further Studies


14.8. LAB AND FURTHER STUDIES 249

../mySlides/figures/Movielens.png

Figure 7. A two-level preference learning in MovieLens: (a) The common preference


with six representative occupation group preference. (b) The purple is the common pref-
erence, the remaining 21 paths represent the occupation group preferences, the red are
the three groups with most distinct preferences from the common, the blue are the three
groups with most similar preferences to the common, and the green ones are the others
[Xu-Xiong-Huang-Cao-Y.’2019].
CHAPTER 15

Game Theory and Hodge Decomposition of


Utilities

From Single Utility to Multiple Utilities

../mySlides/figures/game.png

Multiple Utility Flows for Games Extension to multiplayer games: G = (V, E)

Qn
• V = {(x1 , . . . , xn ) =: (xi , x−i )} = i=1 Si , n person game;
• undirected edge: {(xi , x−i ), (x′i , x−i )} = E
• each player has utility function ui (xi , x−i );
• Edge flow (1-form): ui (xi , x−i ) − ui (x′i , x−i )
251
252 15. GAME THEORY AND HODGE DECOMPOSITION OF UTILITIES

../mySlides/figures/battleSex_mat.pdf
../mySlides/figures/battleSex.pdf

15.1. Nash and Correlated Equilibrium


π(xi , x−i ), a joint distribution tensor on i Si , satisfies ∀xi , x′i ,
Q

X
π(xi , x−i )(ui (xi , x−i ) − ui (x′i , x−i )) ≥ 0,
x−i

i.e. expected flow (E[·|xi ]) is nonnegative. Then,


• tensor π is a correlated equilibrium (CE, Aumann 1974);
• if π is a rank-one tensor,
Y
π(x) = µ(xi ),
i

then it is a Nash equilibrium (NE, Nash 1951);


• pure Nash-equilibria are sinks;
• fully decided by the edge flow data.
• Players are never independent in reality, e.g. Bayesian decision process
(Aumman’87)
• Finding NE is NP-hard, e.g. solving polynomial equations (Sturmfels’02,
Datta’03)
• Finding CE is linear programming, easy for graphical games (Papadimitriou-
Roughgarden’08)
• Some natural learning processes (best-response) converges to CE (Foster-
Vohra’97)
In graphical games:
• n-players live on a network of n-nodes
• player i utility only depends on its neighbor players N (i) strategies
• correlated equilibria allows a concise representation with parameters linear
to the size of the network (Kearns et al. 2001; 2003)
n
1 Y
π(x) = ψi (xN (i) )
Z i=1

this is not rank-one, but low-order interaction


n d
reduce the complexity from O(e2 ) to O(ne2 ) (d = maxi |N (i)|)
polynomial algorithms for CE in tree and chodal graphs.
15.2. HODGE DECOMPOSITION OF UTILITIES 253

15.2. Hodge Decomposition of Utilities


Theorem 15.2.1 (Candogan-Menache-Ozdaglar-Parrilo,2011). Every finite game
admits a unique decomposition:
Potential Games ⊕ Harmonic Games ⊕ Neutral Games
Furthermore:
• Shapley-Monderer Condition: Potential games ≡ quadrangular-curl free
• Extending G = (V, E) to complex by adding quadrangular cells, harmonic
games can be further decomposed into (quadrangular) curl games
For bi-matrix game (A, B),
• potential game is decided by ((A + A′ )/2, (B + B ′ )/2)
• harmonic game is zero-sum ((A − A′ )/2, (B − B ′ )/2)
• Computation of Nash Equilibrium:
each of them is tractable
however direct sum is NP-hard
approximate potential game leads to approximate NE

../mySlides/figures/hodgegame.png

Note: Shapley-Monderer Condition ≡ Harmonic-free ≡ quadrangular-curl free


Does it suggest myopic greedy players might lead to
transient potential games + periodic equilibrium?
254 15. GAME THEORY AND HODGE DECOMPOSITION OF UTILITIES

15.3. Potential Game and Shapley-Monderer Condition


15.4. Zero-sum Games
CHAPTER 16

*Towards Quantum Hodge Decomposition and


TDA

This section is about Grover’s algorithm for quantum eigen-decomposition and


Seth Lloyd’s application to Quantum Hodge Decomposition and TDA (with Jianwei
Pan’s demonstration).

16.1. An Introduction to Quantum Linear Algebra


16.2. Quantum Hodge Decomposition
16.3. Quantum Persistent Homology
16.4. A Prototype Demo
Exercise

255
Bibliography

[ABET00] Nina Amenta, Marshall Bern, David Eppstein, and S-H Teng, Regres-
sion depth and center points, Discrete & Computational Geometry 23
(2000), no. 3, 305–323. 188, 189, 190
[Ach03] Dimitris Achlioptas, Database-friendly random projections: Johnson-
lindenstrauss with binary coins, Journal of Computer and System Sci-
ences 66 (2003), 671Ã687. 59
[Ali95] F. Alizadeh, Interior point methods in semidefinite programming with
applications to combinatorial optimization, SIAM J. Optim. 5 (1995),
no. 1, 13–51. 80, 90
[Aro50] N. Aronszajn, Theory of reproducing kernels, Transactions of the
American Mathematical Society 68 (1950), no. 3, 337–404. 15, 17
[Arr63] Kenneth J. Arrow, Social choice and individual values, 2nd ed., Yale
University Press, New Haven, CT, 1963. 233
[Bav11] Francois Bavaud, On the schoenberg transformations in data analysis:
Theory and illustrations, Journal of Classification 28 (2011), no. 3,
297–314. 9, 16
[BDDW08] Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael
Wakin, A simple proof of the restricted isometry property for random
matrices, Constructive Approximation 28 (2008), no. 3, 253–263. 64,
70
[BE92] Andreas Buja and Nermin Eyuboglu, Remarks on parallel analysis,
Multivariate Behavioral Research 27 (1992), no. 4, 509–540. 47, 56
[BFOS07] M. Burger, K. Frick, S. Osher, and O. Scherzer, Inverse total variation
flow, SIAM Multiscale Model. Simul. 6 (2007), no. 2, 366–395. 71
[BG10] Y. Baryshnikov and Robert Ghrist, Euler integration over definable
functions, PNAS 107 (2010), no. 21, 9525–9530. 227
[BGOX06] Martin Burger, Guy Gilboa, Stanley Osher, and Jinjun Xu, Non-
linear inverse scale space methods, Communications in Mathematical
Sciences 4 (2006), no. 1, 179–212. 71, 74
[BLT+ 06] P. Biswas, T.-C. Liang, K.-C. Toh, T.-C. Wang, and Y. Ye, Semi-
definite programming approaches for sensor network localization with
noisy distance measurements, IEEE Transactions on Automation Sci-
ence and Engineering 3 (2006), 360–371. 88
[BN01] Mikhail Belkin and Partha Niyogi, Laplacian eigenmaps and spectral
techniques for embedding and clustering, Advances in Neural Informa-
tion Processing Systems (NIPS) 14, MIT Press, 2001, pp. 585–591.
116
[BN03] Mikhail Belkin and Partha Niyogi, Laplacian eigenmaps for dimen-
sionality reduction and data representation, Neural Computation 15

257
258 Bibliography

(2003), 1373–1396. 116, 117


[BN08] Mikhail Belkin and Partha Niyogi, Convergence of laplacian eigen-
maps, Tech. report, 2008. 118
[BP98] Sergey Brin and Larry Page, The anatomy of a large-scale hypertextual
web search engine, Proceedings of the 7th international conference on
World Wide Web (WWW) (Australia), 1998, pp. 107–117. 125
[Bri50] Glenn W Brier, Verification of forecasts expressed in terms of proba-
bility, Monthly Weather Review 78 (1950), no. 1, 1–3. 192
[BRT09] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov, Simulta-
neous analysis of lasso and dantzig selector, Ann. Statist. 37 (2009),
no. 4, 1705–1732. 66
[BS10] Zhidong Bai and Jack W. Silverstein, Spectral analysis of large dimen-
sional random matrices, Springer, 2010. 42, 46
[BSS05] Andreas Buja, Werner Stuetzle, and Yi Shen, Loss functions for binary
class probability estimation and classification: Structure and applica-
tions, Working draft, November 3 (2005). 190, 192
[BTA04] Alain Berlinet and Christine Thomas-Agnan, Reproducing kernel
hilbert spaces in probability and statistics, Kluwer Academic Publish-
ers, 2004. 15
[Bur08] Martin Burger, A note on sparse reconstruction methods, Journal of
Physics Conference Series 124 (2008), no. 1, 012002. 71
[Can08] E. J. Candès, The restricted isometry property and its implications
for compressed sensing, Comptes Rendus de l’Académie des Sciences,
Paris, Série I 346 (2008), 589–592. 69
[CCS12] Jian-Feng Cai, Emmanuel J. Candès, and Zuowei Shen, A singular
value thresholding algorithm for matrix completion, SIAM J. Optim.
20 (2012), no. 4, 1956–1982. 76
[CDD09] Albert Cohen, W Dahmen, and Ron DeVore, Compressed sensing and
best k-term approximation, J. Amer. Math. Soc 22 (2009), no. 1, 211–
231. 69
[CDS98] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders,
Atomic decomposition by basis pursuit, SIAM Journal on Scientific
Computing 20 (1998), 33–61. 38, 64, 65, 72
[CGR12] J. Curry, R. Ghrist, and M. Robinson, Euler calculus and its applica-
tions to signals and sensing, Proc. Sympos. Appl. Math., AMS, 2012.
227
[CGR18] Mengjie Chen, Chao Gao, and Zhao Ren, Robust covariance and scat-
ter matrix estimation under Huber’s contamination model, The Annals
of Statistics 46 (2018), no. 5, 1932–1960. 188, 189, 192, 193
[Cha04] Timothy M Chan, An optimal randomized algorithm for maximum
Tukey depth, Proceedings of the fifteenth annual ACM-SIAM sym-
posium on Discrete algorithms, Society for Industrial and Applied
Mathematics, 2004, pp. 430–436. 188, 189
[Chu05] Fan R. K. Chung, Laplacians and the cheeger inequality for directed
graphs, Annals of Combinatorics 9 (2005), no. 1, 1–19. 136
[CL06] Ronald R. Coifman and Stéphane. Lafon, Diffusion maps, Applied and
Computational Harmonic Analysis 21 (2006), 5–30. 120, 174, 175
Bibliography 259

[CLL+ 05] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler,


F. Warner, and S. W. Zucker, Geometric diffusions as a tool for har-
monic analysis and structure definition of data: Diffusion maps i,
Proceedings of the National Academy of Sciences of the United States
of America 102 (2005), 7426–7431. 120, 161
[CLMW09] E. J. Candès, Xiaodong Li, Yi Ma, and John Wright, Robust principal
component analysis, Journal of ACM 58 (2009), no. 1, 1–37. 82, 83
[CMOP11] Ozan Candogan, Ishai Menache, Asuman Ozdaglar, and Pablo A.
Parrilo, Flows and decompositions of games: Harmonic and poten-
tial games, Mathematics of Operations Research 36 (2011), no. 3,
474–503. 202
[Coo07] R. Dennis Cook, Fisher lecture: Dimension reduction in regression,
Statistical Science 22 (2007), no. 1, 1–26. 50, 51
[CPW12] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky, Latent variable
graphical model selection via convex optimization (with discussion),
Annals of Statistics (2012), to appear, http://arxiv.org/abs/1008.
1290. 81, 83
[CR09] E. J. Candès and B. Recht, Exact matrix completion via convex opti-
mization, Foundation of Computational Mathematics 9 (2009), no. 6,
717Ã772. 84, 85
[CRPW12] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, The
convex geometry of linear inverse problems, Foundation of Computa-
tional Mathematics (2012), to appear, http://arxiv.org/abs/1012.
0621. 83
[CRT06] Emmanuel. J. Candès, Justin Romberg, and Terrence Tao, Robust
uncertainty principles: Exact signal reconstruction from highly incom-
plete frequency information, IEEE Trans. on Info. Theory 52 (2006),
no. 2, 489–509. 64, 67
[CSPW11] V. Chandrasekaran, S. Sanghavi, P.A. Parrilo, and A. Willsky, Rank-
sparsity incoherence for matrix decomposition, SIAM Journal on Op-
timization 21 (2011), no. 2, 572–596, http://arxiv.org/abs/0906.
2220. 83, 84
[CST03] N. Cristianini and J. Shawe-Taylor, An introduction to support vector
machines and other kernel-based learning methods, Cambridge Uni-
versity Press, 2003. 15
[CT05] E. J. Candès and Terrence Tao, Decoding by linear programming, IEEE
Trans. on Info. Theory 51 (2005), 4203–4215. 64
[CT06] Emmanuel. J. Candès and Terrence Tao, Near optimal signal recovery
from random projections: Universal encoding strategies, IEEE Trans.
on Info. Theory 52 (2006), no. 12, 5406–5425. 64
[CT07] Emmanuel Candes and Terence Tao, The dantzig selector: Statistical
estimation when p is much larger than n, Ann. Statist. 35 (2007),
no. 6, 2313–2351. 66
[CT10] E. J. Candès and T. Tao, The power of convex relaxation: Near-
optimal matrix completion, IEEE Transaction on Information Theory
56 (2010), no. 5, 2053–2080. 85
[CW11] Tony Cai and Lie Wang, Orthogonal matching pursuit for sparse signal
recovery, IEEE Transactions on Information Theory 57 (2011), no. 7,
260 Bibliography

4680–4688. 68, 75
[CWX10] Tony Cai, Lie Wang, and Guangwu Xu, Stable recovery of sparse sig-
nals and an oracle inequality, IEEE Transactions on Information The-
ory 56 (2010), no. 7, 3516–3522. 75
[CXZ09] Tony Cai, Guangwu Xu, and Jun Zhang, On recovery of sparse signals
via l1 minimization, IEEE Transactions on Information Theory 55
(2009), no. 7, 3588–3397. 68
[Dav88] H. David, The methods of paired comparisons, 2nd ed., Griffin’s Sta-
tistical Monographs and Courses, 41, Oxford University Press, New
York, NY, 1988. 239
[Daw07] A Philip Dawid, The geometry of proper scoring rules, Annals of the
Institute of Statistical Mathematics 59 (2007), no. 1, 77–93. 190
[DBS17] Simon S Du, Sivaraman Balakrishnan, and Aarti Singh, Computation-
ally efficient robust estimation of sparse functionals, arXiv preprint
arXiv:1702.07709 (2017). 188
[DG03a] Sanjoy Dasgupta and Anupam Gupta, An elementary proof of a theo-
rem of johnson and lindenstrauss, Random Structures and Algorithms
22 (2003), no. 1, 60–65. 59
[DG03b] David L. Donoho and Carrie Grimes, Hessian eigenmaps: Locally lin-
ear embedding techniques for high-dimensional data, Proceedings of
the National Academy of Sciences of the United States of America
100 (2003), no. 10, 5591–5596. 111, 113
[dGJL07] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and
Gert R. G. Lanckriet, A direct formulation for sparse pca using
semidefinite programming, SIAM Review 49 (2007), no. 3, http:
//arxiv.org/abs/cs/0406021. 86
[DH01] David L. Donoho and Xiaoming Huo, Uncertainty principles and ideal
atomic decomposition, IEEE Transactions on Information Theory 47
(2001), no. 7, 2845–2862. 67
[DKK+ 16] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur
Moitra, and Alistair Stewart, Robust estimators in high dimensions
without the computational intractability, Foundations of Computer
Science (FOCS), 2016 IEEE 57th Annual Symposium on, IEEE, 2016,
pp. 655–664. 188
[DKK+ 17] , Being robust (in high dimensions) can be practical, arXiv
preprint arXiv:1703.00893 (2017). 188
[DKK+ 18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob
Steinhardt, and Alistair Stewart, Sever: A robust meta-algorithm for
stochastic optimization, arXiv preprint arXiv:1803.02815 (2018). 188
[DKS16] Ilias Diakonikolas, Daniel Kane, and Alistair Stewart, Robust learning
of fixed-structure bayesian networks, arXiv preprint arXiv:1606.07384
(2016). 188
[DKS18a] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart, List-
decodable robust mean estimation and learning mixtures of spherical
gaussians, Proceedings of the 50th Annual ACM SIGACT Symposium
on Theory of Computing, ACM, 2018, pp. 1047–1060. 188
[DKS18b] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart, Efficient algo-
rithms and lower bounds for robust linear regression, arXiv preprint
Bibliography 261

arXiv:1806.00040 (2018). 188


[EB01] M. Elad and A.M. Bruckstein, On sparse representations, Interna-
tional Conference on Image Processing (ICIP) (Tsaloniky, Greece),
November 2001. 67
[Efr10] Bradley Efron, Large-scale inference: Empirical bayes methods for es-
timation, testing, and prediction, Cambridge University Press, 2010.
31, 33
[EH16] B Efron and T Hastie, Computer age statistical inference: Algorithms,
evidence, and data science, Institute of Mathematical Statistics Mono-
graphs, 2016. 29
[ELVE08] Weinan E, Tiejun Li, and Eric Vanden-Eijnden, Optimal partition and
effective dynamics of complex networks, Proc. Nat. Acad. Sci. 105
(2008), 7907–7912. 147
[ER59] P. Erdos and A. Renyi, On random graphs i, Publ. Math. Debrecen 6
(1959), 290–297. 244
[EST09] Ioannis Z. Emiris, Frank J. Sottile, and Thorsten Theobald, Nonlinear
computational geometry, Springer, New York, 2009. 161
[EVE06] Weinan E and Eric Vanden-Eijnden, Towards a theory of transition
paths, J. Stat. Phys. 123 (2006), 503–523. 152, 154
[EVE10] Weinan E and Eric Vanden-Eijnden, Transition-path theory and path-
finding algorithms for the study of rare events, Annual Review of Phys-
ical Chemistry 61 (2010), 391–420. 152
[FGK03] K. Fischer, B. Gartner, and M. Kutz, Fast smallest-enclosing-ball com-
putation in high dimensions, Proceedings of the 11th Annual Euro-
pean Symposium on Algorithms (ESA) 2832 (2003), 630–641. 200
[FHX+ 16] Yanwei Fu, Timothy M. Hospedales, Tao Xiang, Jiechao Xiong, Shao-
gang Gong, Yizhou Wang, and Yuan Yao, Robust subjective visual
property prediction from crowdsourced pairwise labels, IEEE Transac-
tions on Pattern Analysis and Machine Intelligence 38 (2016), no. 3,
563–577. 72, 187
[FK18] Jianqing Fan and Donggyu Kim, Robust high-dimensional volatil-
ity matrix estimation for high-frequency factor model, Journal of the
American Statistical Association 113 (2018), no. 523, 1268–1283. 188
[FKL18] Jianqing Fan, Yuan Ke, and Yuan Liao, Augmented factor models with
applications to validating market risk factors and forecasting bond risk
premia, Manuscript (2018). 188
[FL01] Jianqing Fan and Runze Li, Variable selection via nonconcave penal-
ized likelihood and its oracle properties, Journal of American Statistical
Association (2001), 1348–1360. 38, 72
[FLL+ 20] Yanwei Fu, Chen Liu, Donghao Li, Xinwei Sun, Jinshan Zeng, and
Yuan Yao, Dessilbi: Exploring structural sparsity of deep networks via
differential inclusion paths, Thirty-seventh International Conference
on Machine Learning (ICML), 2020. 76
[Gao17] Chao Gao, Robust regression via mutivariate regression depth,
Bernoulli (2017). 188, 189, 193
[GLYZ19] Chao Gao, Jiyi Liu, Yuan Yao, and Weizhi Zhu, Robust estimation and
generative adversarial networks, International Conference on Learning
Representations (ICLR), 2019, New Orleans, Louisiana. 187, 190, 191
262 Bibliography

[GPAM+ 14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio,
Generative adversarial nets, Advances in neural information process-
ing systems, 2014, pp. 2672–2680. 190, 191
[GR07] Tilmann Gneiting and Adrian E Raftery, Strictly proper scoring rules,
prediction, and estimation, Journal of the American Statistical Asso-
ciation 102 (2007), no. 477, 359–378. 190
[Gro11] David Gross, Recovering low-rank matrices from few coefficients in
any basis, IEEE Transaction on Information Theory 57 (2011), 1548,
arXiv:0910.1879. 85
[GUHY20] Hanlin Gu, Ilona Christy Unarta, Xuhui Huang, and Yuan Yao, Ro-
bust autoencoder gan for cryo-em image denoising, arXiv preprint
arXiv:2008.07307 (2020). 187
[GYZ20] Chao Gao, Yuan Yao, and Weizhi Zhu, Generative adversarial nets for
robust scatter estimation: A proper scoring rule perspective, Journal
of Machine Learning Research 21 (2020), 1 – 48, arXiv:1903.01944.
187, 190, 193
[HAvL05] M. Hein, J. Audibert, and U. von Luxburg, From graphs to manifolds:
weak and strong pointwise consistency of graph laplacians, COLT,
2005. 175
[Hor65] John L. Horn, A rationale and test for the number of factors in factor
analysis, Psychometrika 30 (1965), no. 2, 179–185. 47
[Hot33] Harold Hotelling, Analysis of a complex of statistical variables into
principal components, Journal of Educational Psychology 24 (1933),
417–441 and 498–520. 5
[HS78] Richard Paul Halmos and Viakalathur Shankar Sunder, Bounded in-
tegral operators in l2 spaces, Vol. 96 of Ergebnisse der Mathematik
und ihrer Grenzgebiete (Results in Mathematics and Related Areas),
Springer-Verlag, Berlin, 1978. 17
[HS89] Trevor Hastie and Werner Stuetzle, Principal curves, Journal of the
American Statistical Association 84 (1989), no. 406, 502–516. 114
[HSXY16] Chendi Huang, Xinwei Sun, Jiechao Xiong, and Yuan Yao, Split lbi:
An iterative regularization path with structural sparsity, Advances
in Neural Information Processing Systems (NIPS) 29 (D. D. Lee,
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds.), 2016,
pp. 3369–3377. 72
[HSXY20] , Boosting with structural sparsity: A differential inclusion ap-
proach, Applied and Computational Harmonic Analysis 48 (2020),
no. 1, 1–45, arXiv preprint arXiv:1704.04833. 72
[HTF01] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements
of statistical learning, Springer, 2001. 51
[Hub64] Peter J Huber, Robust estimation of a location parameter, The annals
of mathematical statistics 35 (1964), no. 1, 73–101. 187
[Hub65] , A robust version of the probability ratio test, The Annals of
Mathematical Statistics 36 (1965), no. 6, 1753–1758. 187
[Hub81] P. J. Huber, Robust statistics, New York: Wiley, 1981. 83, 188
Bibliography 263

[HY18] Chendi Huang and Yuan Yao, A unified dynamic approach to sparse
model selection, The 21st International Conference on Artificial Intel-
ligence and Statistics (AISTATS) (Lanzarote, Spain), 2018. 72
[JCVW15] Jiantao Jiao, Thomas A Courtade, Kartik Venkat, and Tsachy Weiss-
man, Justification of logarithmic loss via the benefit of side informa-
tion, IEEE Transactions on Information Theory 61 (2015), no. 10,
5357–5365. 191
[JL84] W. B. Johnson and J. Lindenstrauss, Extensions of lipschitz maps into
a hilbert space, Contemp Math 26 (1984), 189–206. 59
[JLYY11] Xiaoye Jiang, Lek-Heng Lim, Yuan Yao, and Yinyu Ye, Statistical
ranking and combinatorial hodge theory, Mathematical Programming
127 (2011), no. 1, 203–244, arXiv:0811.1067 [stat.ML]. 202, 238,
240, 241, 242
[Joh06] I. Johnstone, High dimensional statistical inference and random ma-
trices, Proc. International Congress of Mathematicians, 2006. 27, 42
[JYLG12] Xiaoye Jiang, Yuan Yao, Han Liu, and Leo Guibas, Detecting network
cliques with radon basis pursuit, The Fifteenth International Confer-
ence on Artificial Intelligence and Statistics (AISTATS) (La Palma,
Canary Islands), April 21-23 2012. 66
[Kah09] Matthew Kahle, Topology of random clique complexes, Discrete Math-
ematics 309 (2009), 1658–1671. 244
[Kah13] , Sharp vanishing thresholds for cohomology of random flag
complexes, Annals of Mathematics (2013), arXiv:1207.0149. 244
[Kle99] Jon Kleinberg, Authoritative sources in a hyperlinked environment,
Journal of the ACM 46 (1999), no. 5, 604–632. 126
[KMM04] T. Kaczynski, K. Mischaikow, and M. Mrozek, Computational homol-
ogy, Springer, New York, 2004. 200
[KMP10] Ioannis Koutis, G. Miller, and Richard Peng, Approaching optimality
for solving sdd systems, FOCS ’10 51st Annual IEEE Symposium on
Foundations of Computer Science, 2010, pp. 235–244. 242
[KN08] S. Kritchman and B. Nadler, Determining the number of components
in a factor model from limited noisy data, Chemometrics and Intelli-
gent Laboratory Systems 94 (2008), 19–32. 41
[KSS18] Pravesh K Kothari, Jacob Steinhardt, and David Steurer, Robust mo-
ment estimation and improved clustering via sum of squares, Pro-
ceedings of the 50th Annual ACM SIGACT Symposium on Theory of
Computing, ACM, 2018, pp. 1035–1046. 188
[Li91] Ker-Chau Li, Sliced inverse regression for dimension reduction, Jour-
nal of the American Statistical Association 86 (1991), no. 414, 316–
327. 51, 52
[LK10] Dandan Li and Chung-Ping Kwong, Understanding latent semantic
indexing: A topological structure analysis using q-analysis, J. Am.
Soc. Inf. Sci. Technol. 61 (2010), no. 3, 592–608. 202
[LL11] Jian Li and Tiejun Li, Probabilistic framework for network partition,
Phys. A 390 (2011), 3579. 150
[LLE09] Tiejun Li, Jian Liu, and Weinan E, Probabilistic framework for net-
work partition, Phys. Rev. E 80 (2009), 026106. 150
264 Bibliography

[LM06] Amy N. Langville and Carl D. Meyer, Google’s pagerank and beyond:
The science of search engine rankings, Princeton University Press,
2006. 125
[LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala, Agnostic estimation
of mean and covariance, Foundations of Computer Science (FOCS),
2016 IEEE 57th Annual Symposium on, IEEE, 2016, pp. 665–674. 188
[LST13] Jason D Lee, Yuekai Sun, and Jonathan E Taylor, On model selection
consistency of penalized m-estimators: a geometric theory, Advances
in Neural Information Processing Systems (NIPS) 26, 2013, pp. 342–
350. 72
[LZ10] Yanhua Li and Zhili Zhang, Random walks on digraphs, the general-
ized digraph laplacian, and the degree of asymmetry, Algorithms and
Models for the Web-Graph, Lecture Notes in Computer Science, vol.
6516, 2010, pp. 74–85. 136
[LZ11] Gilad Lerman and Teng Zhang, Robust recovery of multiple subspaces
by geometric lp minimization, Annals of Statistics 39 (2011), no. 5,
2686–2715. 83
[Mey00] Carl D. Meyer, Matrix analysis and applied linear algebra, SIAM,
2000. 127
[Miz02] Ivan Mizera, On depth and deep points: a calculus, The Annals of
Statistics 30 (2002), no. 6, 1681–1736. 188, 189
[MLX+ 17] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang,
and Stephen Paul Smolley, Least squares generative adversarial net-
works, Computer Vision (ICCV), 2017 IEEE International Conference
on, IEEE, 2017, pp. 2813–2821. 192
[MM04] Ivan Mizera and Christine H Müller, Location–scale depth, Journal of
the American Statistical Association 99 (2004), no. 468, 949–966. 188
[MNY06] Ha Quang Minh, Partha Niyogi, and Yuan Yao, Mercer’s theorem,
feature maps, and smoothing, Proc. of Computational Learning The-
ory (COLT), Proc. of Computational Learning Theory (COLT), vol.
4005, 2006, Times Cited: 1 Lugosi, G Simon, HU 19th Annual Con-
ference on Learning Theory (COLT 2006) JUN 22-25, 2006 Carnegie
Mellon Univ, Pittsburgh, PA, pp. 154–168. 18
[MSVE09] Philipp Metzner, Christof Schütte, and Eric Vanden-Eijnden, Transi-
tion path theory for markov jump processes, Multiscale Model. Simul.
7 (2009), 1192. 152, 154
[MZ93] S. G. Mallat and Z. Zhang, Matching pursuits with time-frequency dic-
tionaries, IEEE Transactions on Signal Processing 41 (1993), no. 12,
3397–3415. 65
[NBG10] R. R. Nadakuditi and F. Benaych-Georges, The breakdown point of
signal subspace estimation, IEEE Sensor Array and Multichannel Sig-
nal Processing Workshop (2010), 177–180. 42
[NCT16] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka, f-gan: Train-
ing generative neural samplers using variational divergence mini-
mization, Advances in Neural Information Processing Systems, 2016,
pp. 271–279. 190
[Noe60] G. Noether, Remarks about a paired comparison model, Psychometrika
25 (1960), 357–367. 239
Bibliography 265

[NSVE+ 09] Frank Noè, Christof Schütte, Eric Vanden−Eijnden, Lothar Reich,
and Thomas R. Weikl, Constructing the equilibrium ensemble of fold-
ing pathways from short off-equilibrium simulations, Proceedings of
the National Academy of Sciences of the United States of America
106 (2009), no. 45, 19011–19016. 152
[OBG+ 05] Stanley Osher, Martin Burger, Donald Goldfarb, Jinjun Xu, and
Wotao Yin, An iterative regularization method for total variation-
based image restoration, SIAM Journal on Multiscale Modeling and
Simulation 4 (2005), no. 2, 460–489. 38, 76
[ORX+ 16] Stanley Osher, Feng Ruan, Jiechao Xiong, Yuan Yao, and Wotao Yin,
Sparse recovery via differential inclusions, Applied and Computational
Harmonic Analysis 41 (2016), no. 2, 436–469, arXiv:1406.7728. 38,
68, 71, 76
[OW16] Art Owen and Jingshu Wang, Bi-cross-validation for factor analysis,
Statist. Sci. 31 (2016), no. 1, 119–139. 47
[Pea01] Karl Pearson, On lines and planes of closest fit to systems of points
in space, Philosophical Magazine 2 (1901), no. 11, 559–572. 5
[PVB17] Davy Paindaveine and Germain Van Bever, Halfspace depths
for scatter, concentration and shape matrices, arXiv preprint
arXiv:1704.06160 (2017). 188
[RH99] Peter J Rousseeuw and Mia Hubert, Regression depth, Journal of the
American Statistical Association 94 (1999), no. 446, 388–402. 188,
189
[RL00] Sam T. Roweis and Saul K. Lawrence, Locally linear embedding, Sci-
ence 290 (2000), no. 5500, 2319–2323. 101
[RS98] Peter J Rousseeuw and Anja Struyf, Computing location depth and
regression depth in higher dimensions, Statistics and Computing 8
(1998), no. 3, 193–203. 188, 189
[RXY18] Feng Ruan, Jiechao Xiong, and Yuan Yao, Libra: Linearized bregman
algorithms for generalized linear models, 2018, R package version 1.6,
https://cran.r-project.org/web/packages/Libra. 72
[Saa95] Donald G. Saari, Basic geometry of voting, Springer, 1995. 234
[Sch37] I. J. Schoenberg, On certain metric spaces arising from euclidean
spaces by a change of metric and their imbedding in hilbert space,
The Annals of Mathematics 38 (1937), no. 4, 787–793. 9
[Sch38a] , Metric spaces and completely monotone functions, The An-
nals of Mathematics 39 (1938), 811–841. 9, 16, 17
[Sch38b] , Metric spaces and positive definite functions, Transactions of
the American Mathematical Society 44 (1938), 522–536. 9, 15, 16
[Sen70] Amartya Sen, The impossibility of a paretian liberal, Journal of Polit-
ical Economy 78 (1970), no. 1, 152–157. 233
[She18] Peter S Shen, The 2017 nobel prize in chemistry: cryo-em comes of
age, Analytical and bioanalytical chemistry 410 (2018), no. 8, 2053–
2057. 195
[SHYW17] Xinwei Sun, Lingjing Hu, Yuan Yao, and Yizhou Wang, Gsplit lbi:
Taming the procedural bias in neuroimaging for disease prediction, In-
ternational Conference on Medical Image Computing and Computer-
Assisted Intervention (MICCAI), Springer, 2017, pp. 107–115. 72
266 Bibliography

[Sin06] Amit Singer, From graph to manifold laplacian: The convergence rate,
Applied and Computational Harmonic Analysis 21 (2006), 128–134.
174, 175, 179
[SSM98] B. Schölkopf, A. Smola, and K.-R. Müller, Nonlinear component anal-
ysis as a kernel eigenvalue problem, Neural Computation 10 (1998),
1299–1319. 18
[ST04] D. Spielman and Shang-Hua Teng, Nearly-linear time algorithms for
graph partitioning, graph sparsification, and solving linear systems,
STOC ’04 Proceedings of the thirty-sixth annual ACM symposium on
Theory of computing, 2004. 242
[Ste56] Charles Stein, Inadmissibility of the usual estimator for the mean of a
multivariate distribution, Proceedings of the Third Berkeley Sympo-
sium on Mathematical Statistics and Probability 1 (1956), 197–206.
27, 31
[Ste01] Ingo Steinwart, On the influence of the kernel on the consistency of
support vector machines, Journal of Machine Learning Research 2
(2001), 67–93. 18
[SW12] Amit Singer and Hau-Tieng Wu, Vector diffusion maps and the con-
nection laplacian, Comm. Pure Appl. Math. 65 (2012), no. 8, 1067–
1144. 185
[SY07] Anthony Man-Cho So and Yinyu Ye, Theory of semidefinite program-
ming for sensor network localization, Mathematical Programming, Se-
ries B 109 (2007), no. 2-3, 367–384. 90, 92
[SYZ08] Anthony Man-Cho So, Yinyu Ye, and Jiawei Zhang, A unified theorem
on sdp rank reduction, Mathematics of Operations Research 33 (2008),
no. 4, 910–920. 91
[Tao11] Terrence Tao, Topics in random matrix theory, Lecture Notes in
UCLA, 2011. 46
[TdL00] J. B. Tenenbaum, Vin deSilva, and John C. Langford, A global geo-
metric framework for nonlinear dimensionality reduction, Science 290
(2000), 2319–2323. 161
[TdSL00] J. Tenenbaum, V. de Silva, and J. Langford, A global geometric
framework for nonlinear dimensionality reduction, Science 290 (2000),
no. 5500, 2323–2326. 101
[Tib96] R. Tibshirani, Regression shrinkage and selection via the lasso, J. of
the Royal Statistical Society, Series B 58 (1996), no. 1, 267–288. 38,
64, 66, 72
[Tro04] Joel A. Tropp, Greed is good: Algorithmic results for sparse approx-
imation, IEEE Trans. Inform. Theory 50 (2004), no. 10, 2231–2242.
67, 68, 75
[Tsy09] Alexandre Tsybakov, Introduction to nonparametric estimation,
Springer, 2009. 34, 38, 40
[Tuk75] John W Tukey, Mathematics and the picturing of data, Proceedings
of the International Congress of Mathematicians, Vancouver, 1975,
vol. 2, 1975, pp. 523–531. 188, 189
[Tyl87a] D. E. Tyler, A distribution-free m-estimator of multivariate scatter,
Annals of Statistics 15 (1987), no. 1, 234–251. 83, 84
Bibliography 267

[Tyl87b] David E Tyler, A distribution-free m-estimator of multivariate scatter,


The Annals of Statistics 15 (1987), no. 1, 234–251. 189
[Vap98] V. Vapnik, Statistical learning theory, Wiley, New York, 1998. 15
[Vem04] Santosh Vempala, The random projection method, Am. Math. Soc.,
Providence, 2004. 64
[Wah90] Grace Wahba, Spline models for observational data, CBMS-NSF Re-
gional Conference Series in Applied Mathematics 59, SIAM, 1990. 9,
18
[Wai09] Martin J. Wainwright, Sharp thresholds for high-dimensional and
noisy sparsity recovery using l1 -constrained quadratic programming
(lasso), IEEE Transactions on Information Theory 55 (2009), no. 5,
2183–2202. 73, 75, 76
[WGL+ 16] Feng Wang, Huichao Gong, Gaochao Liu, Meijing Li, Chuangye Yan,
Tian Xia, Xueming Li, and Jianyang Zeng, Deeppicker: A deep learn-
ing approach for fully automated particle picking in cryo-em, Journal
of structural biology 195 (2016), no. 3, 325–336. 187, 195
[WLM09] Qiang Wu, Feng Liang, and Sayan Mukherjee, Localized sliced inverse
regression, Annual Conference on Neural Information Processing Sys-
tems (NIPS) (2009). 53
[WS06] Killian Q. Weinberger and Lawrence K. Saul, Unsupervised learning of
image manifolds by semidefinite programming, International Journal
of Computer Vision 70 (2006), no. 1, 77–90. 91
[XRY18] Jiechao Xiong, Feng Ruan, and Yuan Yao, A tutorial on libra: R pack-
age for the linearized bregman algorithms in high dimensional statis-
tics, Handbook of Big Data Analytics, Springer, 2018, pp. 425 – 453.
72
[XXC+ 19] Qianqian Xu, Jiechao Xiong, Xiaochun Cao, Qingming Huang, and
Yuan Yao, From social to individuals: a parsimonious path of multi-
level models for crowdsourced preference aggregation, IEEE Transac-
tions on Pattern Analysis and Machine Intelligence 41 (2019), no. 4,
844–856. 187, 248
[XXCY16a] Qianqian Xu, Jiechao Xiong, Xiaochun Cao, and Yuan Yao, False dis-
covery rate control and statistical quality assessment of annotators in
crowdsourced ranking, International Conference on Machine Learning
(ICML), 2016, New York, June 19-24. 72
[XXCY16b] , Parsimonious mixed-effects HodgeRank for crowdsourced
preference aggregation, ACM Multimedia Conference, 2016. 72
[XXHY13] Qianqian Xu, Jiechao Xiong, Qingming Huang, and Yuan Yao, Robust
evaluation for quality of experience in crowdsourcing, ACM Confer-
ence on Multimedia, 2013, pp. 43–52. 187
[XYJ+ 19] Qianqian Xu, Zhiyong Yang, Yangbangyan Jiang, Xiaochun Cao,
Qingming Huang, and Yuan Yao, Deep robust subjective visual prop-
erty prediction in crowdsourcing, IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2019, Long Beach, Califor-
nia, pp. 8985–8993. 187
[YH41] G. Young and A. S. Householder, A note on multidimensional psycho-
physical analysis, Psychometrika 6 (1941), 331–333. 9
268 Bibliography

[YL06] Ming Yuan and Yi Lin, Model selection and estimation in regression
with grouped variables, Journal of the Royal Statistical Society: Series
B (Statistical Methodology) 68 (2006), no. 1, 49–67. 75
[YL07] , On the nonnegative garrote estimator, Journal of the Royal
Statistical Society, Series B 69 (2007), no. 2, 143–161. 73
[YODG08] Wotao Yin, Stanley Osher, Jerome Darbon, and Donald Goldfarb,
Bregman iterative algorithms for compressed sensing and related prob-
lems, SIAM Journal on Imaging Sciences 1 (2008), no. 1, 143–168.
76
[ZCS14] Teng Zhang, Xiuyuan Cheng, and Amit Singer, Marcenko-pastur law
for tyler?s m-estimator. 83, 84
[Zha02] Jian Zhang, Some extensions of Tukey’s depth function, Journal of
Multivariate Analysis 82 (2002), no. 1, 134–165. 188, 189
[Zha16] Teng Zhang, Robust subspace recovery by tyler?s m-estimator, Infor-
mation and Inference: A Journal of the IMA (2016), 1–23. 83, 84
[ZHT06] H. Zou, T. Hastie, and R. Tibshirani, Sparse principal compo-
nent analysis, Journal of Computational and Graphical Statistics 15
(2006), no. 2, 262–286. 86
[Zou06] Hui Zou, The adaptive lasso and its oracle properties, Journal of the
American Statistical Association 101 (2006), no. 476, 1418–1429. 73,
75
[ZSF+ 18] Bo Zhao, Xinwei Sun, Yanwei Fu, Yuan Yao, and Yizhou Wang, Msplit
lbi: Realizing feature selection and dense estimation simultaneously in
few-shot and zero-shot learning, International Conference on Machine
Learning (ICML), 2018. 72
[ZW] Zhenyue Zhang and Jing Wang, Mlle: Modified locally linear em-
bedding using multiple weights, http://citeseerx.ist.psu.edu/
viewdoc/summary?doi=10.1.1.70.382. 109, 110
[ZY06] Peng Zhao and Bin Yu, On model selection consistency of lasso, J.
Machine Learning Research 7 (2006), 2541–2567. 73, 75
[ZZ02] Zhenyue Zhang and Hongyuan Zha, Principal manifold and nonlinear
dimension reduction via local tangent space alignment, SIAM Journal
of Scientific Computing 26 (2002), 313–338. 114
[ZZ09] Hongyuan Zha and Zhenyue Zhang, Spectral properties of the align-
ment matrices in manifold learning, SIAM Review 51 (2009), no. 3,
545–566. 115
Index

K, 17
LK , 17
HK , 17

Algorithm
Classical/Metric MDS, 11
Kernel PCA/MDS, 19
PCA, 7

Command
SMACOF, 21
cmdscale, 21
mdscale, 21
prcomp, 19
princomp, 19
sklearn.decomposition.PCA, 19
sklearn.manifold.MDS, 21
covariance operator, 17

Linear Discriminant Analysis (LDA), 51

Mercer kernel, 17
Mercer’s Theorem, 17
Multidimensional Scaling (MDS), 9

PCA
parallel analysis, 47

reproducing kernel Hilbert space, 17


reproducing property, 17

Sliced Inverse Regression (SIR), 52

269

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy