0% found this document useful (0 votes)
37 views36 pages

Bo

The document discusses singular value decomposition (SVD), which is a factorization of a matrix into three component matrices. SVD can be used to find the best low-rank approximation of a data matrix. It can also be applied to problems like principal component analysis, clustering data, and compressing data. The document provides examples of how to compute SVD and applications in fields like information retrieval and dimensionality reduction.

Uploaded by

Sarnai Borkhuu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views36 pages

Bo

The document discusses singular value decomposition (SVD), which is a factorization of a matrix into three component matrices. SVD can be used to find the best low-rank approximation of a data matrix. It can also be applied to problems like principal component analysis, clustering data, and compressing data. The document provides examples of how to compute SVD and applications in fields like information retrieval and dimensionality reduction.

Uploaded by

Sarnai Borkhuu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Machine Translated by Google

Contents
2
1 Singular Value Decomposition (SVD)

1.1 Singular Vectors . . . . . ... . ... . . . . . . ... . . . . . . ... . ..3

. ... . ... . . . . . . ... . ..7


1.2 Singular Value Decomposition (SVD) .

1.3 Best Rank k Approximations . ... . . . . . . ... . ... . . ... . ..8

1.4 Power Method for Computing the Singular Value Decomposition . . . . . . 11

1.5 Applications of Singular Value Decomposition . ... . . ... . ... . . . 16

1.5.1 Principal Component Analysis . . ... . ... . . . . . . ... . . . 16

1.5.2 Clustering a Mixture of Spherical Gaussians . . . ... . ... . . . 16

1.5.3 An Application of SVD to a Discrete Optimization Problem . . . . 22

1.5.4 SVD as a Compression Algorithm . . . . ... . . . . . . ... . . . 24

1.5.5 Spectral Decomposition . . . . . ... . ... . . . . . . ... . . . 24

1.5.6 Singular Vectors and ranking documents . . . . . ... . ... . . . 25

1.6 Bibliographic Notes . . . ... . ... . . ... . ... . . . . . . ... . . . 27

1.7 Exercises . . . . . . . . . ... . ... . . . . . . ... . ... . . . . . . . . 28

1
Machine Translated by Google

1 Singular Value Decomposition (SVD)


The singular value decomposition of a matrix A is the factorization of A into the where the
T columns of U and V are orthonormal and
product of three matrices A = UDV the matrix D is
diagonal with positive real entries. The SVD is useful in many tasks. Here we mention some examples. First, in many applications,
the data matrix A is close to a matrix of low rank and it is useful to find a low rank matrix which is a good approximation to the data
matrix . We will show that from the singular value decomposition of A, we can get the matrix B of rank k which best approximates A;
in fact we can do this for every k. Also, singular value decomposition is defined for all matrices (rectangular or square) unlike the
more commonly used spectral decomposition in Linear Algebra. The reader familiar with eigenvectors and eigenvalues (we do not
assume familiarity here) will also realize that we need conditions on the matrix to ensure orthogonality of eigenvectors. In contrast,
the columns of V in the singular value decomposition, called the right singular vectors of A, always form an orthogonal set with no
assumptions on A. The columns of U are called the left singular vectors and they also form an orthogonal set. A simple consequence
of the orthogonality is that for a square and invertible matrix A, the inverse of A is V Dÿ1U T

as the reader can verify.


,

To gain insight into the SVD, treat the rows of an n × d matrix A as n points in a d-dimensional
space and consider the problem of finding the best k-dimensional subspace with respect to the
set of points. Here best means minimize the sum of the squares of the perpendicular distances
of the points to the subspace. We begin with a special case of the problem where the subspace
is 1-dimensional, a line through the origin. We will see later that the best-fitting k-dimensional
subspace can be found by k applications of the best fitting line algorithm. Finding the best fitting
line through the origin with respect to a set of points {xi |1 ÿ i ÿ n} in the plane means minimizing
the sum of the squared distances of the points to the line. Here distance is measured
perpendicular to the line.
The problem is called the best least squares fit.

In the best least squares fit, one is minimizing the distance to a subspace. An alter-native
problem is to find the function that best fits some data. Here one variable y is a function of the
variables x1, x2, . . . , xd and one wishes to minimize the vertical distance, i.e., distance in the y
direction, to the subspace of the xi rather than minimize the per-pendicular distance to the
subspace being fit to the data.

Returning to the best least squares fit problem, consider projecting a point xi onto a
line through the origin. Then
2 2 2 + x + · · · + x i1 i2
id = (length of projection)2 + (distance of point to line)2 .

See Figure 1.1. Thus


2 22+x+···+
(distance of point to line)2 = x i1 i2 id ÿ (length of projection)2 .

To minimize the sum of the squares of the distances to the line, one could minimize

2
Machine Translated by Google

xj

aj

xi
Min a 2

ÿi Equivalent to
in
2
Max b

ÿj

ÿi

Figure 1.1: The projection of the point xi onto the line through the origin in the direction of v

n
2
(x i1 2 + x i2 + · · · +2 id) minus the sum of the squares of the lengths of the projections of
i=1
n
(x i1 2 + xi2 + · · · +2 id) is a constant (independent of the
2
the points to the line. However,
i=1
line), so minimizing the sum of the squares of the distances is equivalent to maximizing the sum of the squares
of the lengths of the projections onto the line. Similarly for best-fit subspaces, we could maximize the sum of the
squared lengths of the projections onto the subspace instead of minimizing the sum of squared distances to the
subspace.
The reader may wonder why we minimize the sum of squared perpendicular distances to the line. We could
alternatively have defined the best-fit line to be the one which minimizes the sum of perpendicular distances to
the line. There are examples where this definition gives a different answer than the line minimizing the sum of
perpendicular distances squared. [The reader could construct such examples.] The choice of the objective
function as the sum of squared distances seems arbitrary and in a way it is. But the square has many nice
mathematical properties - the first of these is the use of Pythagoras theorem above to say that this is equivalent
to maximizing the sum of squared projections. We will see that in fact we can use the Greedy Algorithm to find
best-fit k dimensional subspaces (which we will define soon) and for this too, the square is important. The reader
should also recall from Calculus that the best-fit function is also defined in terms of least-squares fit. There too,
the existence of nice mathematical properties is the motivation for the

square.

1.1 Singular Vectors


We now define the singular vectors of an n × d matrix A. Consider the rows of A as n points in a d-
dimensional space. Consider the best fit line through the origin. Let v be a

3
Machine Translated by Google

th
unit vector along this line. The length of the projection of ai |ai · v|. the i row of A, onto v is .
,

From this we see that the sum of length squared of the projections is |Av| best fit line is the
2
The
2
one maximizing |Av| distances of the points to and hence minimizing the sum of the squared
the line.

With this in mind, define the first singular vector, v1, of A, which is a column vector, as the best fit
line through the origin for the n points in d-space that are the rows of A.
Thus
v1 = arg max |v|=1 |Off|.

The value ÿ1 (A) = |Av1| is called the first singular value of A. Note that ÿ sum of the
2
1
is the
squares of the projections of the points to the line determined by v1.

The greedy approach to find the best fit 2-dimensional subspace for a matrix A, takes v1 as the first basis vector for the 2-
dimenional subspace and finds the best 2-dimensional subspace containing v1. The fact that we are using the sum of squared
distances will again help. For every 2-dimensional subspace containing v1, the sum of squared lengths of the projections onto
the subspace equals the sum of squared projections onto v1 plus the sum of squared projections along a vector perpendicular
to v1 in the subspace. Thus, instead of looking for the best 2-dimensional subspace containing v1, look for a unit vector, call it
v2, perpendicular to v1 that maximizes |Av| among all such unit vectors. Using the same greedy strategy to find the best three
and higher dimensional subspaces, defines v3, v4, . . . in a similar manner. This is captured in the following definitions. There is
2
no apriori guarantee that the greedy algorithm gives the best fit. But, in fact, the greedy algorithm does work and yields the best-
fit subspaces of every dimension as we will show.

The second singular vector, v2, is defined by the best fit line perpendicular to v1

v2 = arg max vÿv1,| |Off| .


v|=1

The value ÿ2 (A) = |Av2| is called the second singular value of A. The third singular vector v3 is defined
similarly by

v3 = arg max |Off|


vÿv1,v2,|v|=1

and so on. The process stops when we have found

v1, v2, . . . , vr

as singular vectors and


arg max |Off| = 0.
vÿv1,v2,...,vr |v|
=1

If instead of finding v1 that maximized |Av| and then the best fit 2-dimensional subspace containing
v1, we had found the best fit 2-dimensional subspace, we might have

4
Machine Translated by Google

done better. This is not the case. We now give a simple proof that the greedy algorithm indeed finds the best subspaces of every
dimension.

,
Theorem 1.1 Let A be an n × d matrix where v1, v2, . . . vr are the singular vectors defined above. For 1 ÿ k ÿ r, let Vk be the subspace spanned by v1, v2, . . . , vk. Then for each k, Vk

is the best-fit k-dimensional subspace for A.

Proof: The statement is obviously true for k = 1. For k = 2, let W be a best-fit 2- dimensional subspace for A. For any basis w1, w2
2 2
of W, |Aw1| + |Aw2| is the sum of squared lengths of the projections of the rows of A onto W. Now, choose a basis w1, w2 of W so
that w2 is perpendicular to v1. If v1 is perpendicular to W, any unit vector in W will do as w2. If not, choose w2 to be the unit
vector in W perpendicular to the projection of v1 onto W. Since v1 was chosen to maximize |Av1| it follows that |Aw1| Since v2
was chosen to maximize |Av2| over all v perpendicular to v1, |Aw2| Thus
2 2 2
, ÿ |Av1| ÿ | .

2 2 2
Av2| .

2 2 2 2
|Aw1| + |Aw2| ÿ |Av1| + |Av2| .

Hence, V2 is at least as good as W and so is a best-fit 2-dimensional subspace.

For general k, proceed by induction. By the induction hypothesis, Vkÿ1 is a best-fit k-1 dimensional subspace. Suppose W is a best-fit k-dimensional subspace. Choose a wk of

W so that wk is perpendicular to v1, v2, . . . , basis w1, w2, . . . , vkÿ1. Then

2 2 2 2 2 2 2
|Aw1| + |Aw2| + · · · + |Awk| ÿ |Av1| + |Av2| + · · · + |Avkÿ1| + |Awk|

since Vkÿ1 is an optimal k -1 dimensional subspace. Since wk is perpendicular to v1, v2, . . . , vkÿ1, by the definition
2 2
of vk, |Awk| . Thus ÿ |Avk|

2 2 2 2 2 2 2 2
|Aw1| + |Aw2| + · · · + |Awkÿ1| + |Awk| ÿ |Av1| + |Av2| + · · · + |Avkÿ1| + |Avk| ,

proving that Vk is at least as good as W and hence is optimal.

Note that the n-vector Avi is really a list of lengths (with signs) of the projections of the rows of A onto vi . Think of |Avi | =
ÿi(A) as the “component” of the matrix A along vi . For this interpretation to make sense, it should be true that adding up the
squares of the components of A along each of the vi gives the square of the “whole content of the matrix A”. This is indeed the
case and is the matrix analogy of decomposing a vector into its components along orthogonal directions.

Consider one row, say aj , of A. Since v1, v2, . . . , vr span the space of all rows of A, aj · v = 0 for all v perpendicular to v1, v2, . . .
r
2 =
, vr. Thus, for each row aj , (ouch we )
i=1
2
| and | . Summing over all rows j,

n n r r n r r
2 = 2 = 2 = 2 =
| and | (ouch we ) (ouch we ) |Avi | 2
pi
(A).
j=1 j=1 i=1 i=1 j=1 i=1 i=1

5
Machine Translated by Google

n n d
But | and |
2 =
2 the sum of squares of all the entries of A. Thus, the sum of
a jk,
j=1 j=1 k=1
squares of the singular values of A is indeed the square of the “whole content of A”, i.e., the sum of
squares of all the entries. There is an important norm associated with this quantity, the Frobenius norm
of A, denoted ||A||F defined as

||A||F = 2
a jk.
j,k

Lemma 1.2 For any matrix A, the sum of squares of the singular values equals the Frobenius norm. That
is, ÿ i
2
(A) = ||A||2 F .

Proof: By the preceding discussion.

A matrix A can be described fully by how it transforms the vectors vi . Every vector v can be written as a linear combination of v1, v2, . . . , vr and a vector

perpendicular to all the vi . Now, Av is the same linear combination of Av1, Av2, . . . , Avr as v is of vr. So the Av1, Av2, . . .

v1, v2, . . . , , Avr form a fundamental set of vectors associated


with A. We normalize them to length one by

ui = Avi .
ÿi(A)

The vectors u1, u2, . . ur are called the.left


, singular vectors of A. The vi are called the right singular vectors. The SVD theorem (Theorem 1.5) will fully explain

the reason for these terms.

Clearly, the right singular vectors are orthogonal by definition. We now show that the
r
T
left singular vectors are also orthogonal and that A = ÿiuiv i
.

i=1

Theorem 1.3 Let A be a rank r matrix. The left singular vectors of A, u1, u2, . . . , ur, are orthogonal.

Proof: The proof is by induction on r. For r = 1, there is only one ui so the theorem is trivially true. For the
inductive part consider the matrix

T
B = A ÿ ÿ1u1v 1.

The implied algorithm in the definition of singular value decomposition applied to B is identical to a run of
the algorithm on A for its second and later singular vectors and sin-
T
gular values. To see this, first observe that Bv1 = Av1 ÿ ÿ1u1v 1 v1 = 0. It then follows that the first right

singular vector, call it z, of B will be perpendicular to v1 since if it had a component z1 along v1, then, B zÿz1 > |Bz|, contradicting the arg max |zÿz1| definition
= |Bz| |
of z. But for any v perpendicular to v1, Bv = Av. Thus, the top singular
zÿz1|

6
Machine Translated by Google

vector of B is indeed a second singular vector of A. Repeating this argument shows that a run of the algorithm on B is
the same as a run on A for its second and later singular vectors. This is left as an exercise.

Thus, there is a run of the algorithm that finds that B has right singular vectors vr and corresponding left singular vectors u2, u3, . . . , ur. By the induction v2,

v3, . . . , hypothesis, u2, u3, . .


., ur are orthogonal.

It remains to prove that u1 is orthogonal to the other ui . Suppose not and for some 1ui > 0. The proof is symmetric
and ÿ 2, inT1ui = 0. Without loss of generality assume that u T

for the case where u 1ui < 0. T


Now, for infinitesimally small ÿ > 0, the vector

v1 + ÿvi | ÿ1u1 + ÿsiui ÿ 1


A =
v1 + ÿvi | +ÿ 2

has length at least as large as its component along u1 which is

T ÿ1u1 + ÿsiui 1 ( T T
u ) = ÿ1 + ÿsiu 1 ui 1 -
2e
+ O (e
4
) = ÿ1 + ÿsiu 1ui ÿ O ÿ
2
> p1
2
ÿ1+e 2

a contradiction. Thus, u1, u2, . . . , ur are orthogonal.

1.2 Singular Value Decomposition (SVD)

Let A be an n×d matrix with singular vectors v1, v2, . . . , vr and corresponding singular values ÿ1, ÿ2, . . . , ÿr.
1
Then ui = Avi for i = 1, 2, . . . , r, are the left singular vectors and by Theorem 1.5, A can be decomposed into a sum of
,
ÿi
rank one matrices as

r
T
A= ÿiuiv i .
i=1

We first prove a simple lemma stating that two matrices A and B are identical if Av = Bv for all v. The lemma states
that in the abstract, a matrix A can be viewed as a transformation that maps vector v onto Av.

Lemma 1.4 Matrices A and B are identical if and only if for all vectors v, Av = Bv.

Proof: Clearly, if A = B then Av = Bv for all v. For the converse, suppose that Av = Bv for all v. Let ei be the vector that
is all zeros except for the i thcomponent which has value 1. Now Aei is the i column of A and thus A = B if for each i,
th
Aei = Bei .

7
Machine Translated by Google

D IN T
r×r r×d

A IN
n×d =
n×r

Figure 1.2: The SVD decomposition of an n × d matrix.

Theorem 1.5 Let A be an n × d matrix with right singular vectors v1, v2, . . . , vr, left singular vectors u1, u2, . . . , ur, and corresponding
singular values ÿ1, ÿ2, . . . , ÿr. Then

r
T
A= ÿiuiv i .
i=1

r
T
Proof: For each singular vector vj , Avj = ÿiuiv i vj . Since any vector v can be ex-
i=1
pressed as a linear combination of the singular vectors plus a vector perpendicular to the
r r
T T
we Off =
, ÿiuiv i v and by Lemma 1.4, A = ÿiuiv i .
i=1 i=1

The decomposition is called the singular value decomposition, SVD, of A. In matrix where the columns of U and V
T
notation A = UDV consist of the left and right singular vectors, respectively, and D is a diagonal matrix whose
diagonal entries are the singular values of A.

For any matrix A, the sequence of singular values is unique and if the singular values are all distinct, then the
sequence of singular vectors is unique also. However, when some set of singular values are equal, the corresponding
singular vectors span some subspace.
Any set of orthonormal vectors spanning this subspace can be used as the singular vectors.

1.3 Best Rank k Approximations

There are two important matrix norms, the Frobenius norm denoted ||A||F and the 2-norm denoted ||A||2. The 2-
norm of the matrix A is given by

max |Off|
|v|=1

8
Machine Translated by Google

and thus equals the largest singular value of the matrix.

Let A be an n × d matrix and think of the rows of A as n points in d-dimensional space. The Frobenius norm of
A is the square root of the sum of the squared distance of the points to the origin. The 2-norm is the square root of
the sum of squared distances to the origin along the direction that maximizes this quantity.

Let
r
A= T
ÿiuiv i
i=1

be the SVD of A. For k ÿ {1, 2, . . . , r}, let

k
T
And = ÿiuiv i
i=1

be the sum truncated after k terms. It is clear that Ak has rank k. Furthermore, Ak is the best rank k approximation
to A when the error is measured in either the 2-norm or the Frobenius norm.

Lemma 1.6 The rows of Ak are the projections of the rows of A onto the subspace Vk spanned by the first k
singular vectors of A.

Proof: Let a be an arbitrary row vector. Since the vi are orthonormal, the projection
k
T
of the vector a onto Vk is given by (a · you) you . Thus, the matrix whose rows are the
i=1
k
T
projections of the rows of A onto Vk is given by Aviv i . This last expression simplifies
i=1
to
k k
T = T
Avivi ÿiuivi = Ac.
i=1 i=1

The matrix Ak is the best rank k approximation to A in both the Frobenius and the 2-norm. First we show that
the matrix Ak is the best rank k approximation to A in the Frobenius norm.

Theorem 1.7 For any matrix B of rank at most k

A ÿ AkF ÿ A ÿ BF

Proof: Let B minimize A ÿ B spanned by 2


F among all rank k or less matrices. Let V be the space
2
the rows of B. The dimension of V is at most k. Since B minimizes A ÿ B it must be that each row of B is the F,

projection of the corresponding row of A onto V ,

9
Machine Translated by Google

otherwise replacing the row of B with the projection of the corresponding row of A onto V does not change V
2
and hence the rank of B but would reduce A ÿ B F . Since each row of B is the projection of the corresponding
2
row of A, it follows that A ÿ B is the sum of squared distances of rows of A to V . Since Ak minimizes the sum
F
of squared distance of rows of A to any k-dimensional subspace, from Theorem (1.1), it follows that A ÿ AkF
ÿ A ÿ BF .

Next we tackle the 2-norm. We first show that the square of the 2-norm of A ÿ Ak is the square of the (k +
1)st singular value of A,

2
Lemma 1.8 A ÿ Ak 2
2=ÿ
k+1.

r
Proof: Let A = T
ÿiuivi be the singular value decomposition of A. Then Ak =
i=1
k r
T T
ÿiuivi and A ÿ And = ÿiuivi . Let v be the top singular vector of A ÿ Ak.
i=1 i=k+1
r
Express v as a linear combination of v1, v2, . . . , vr. That is, write v = ÿivi . Then
i=1

r r r
T I
|(A ÿ Ak)v| = ÿiuivi ÿjvj = helped you

i=k+1 j=1 i=k+1

r r
= 22a
ÿiÿiui = si ii .

i=k+1 i=k+1

r
2 =
The v maximizing this last quantity, subject to the constraint that |v| 2ai = 1,
i=1
2
occurs when ÿk+1 = 1 and the rest of the ÿi are 0. Thus, A ÿ Ak lemma. 2
2=ÿ
k+1
proving the

Finally, we prove that Ak is the best rank k 2-norm approximation to A.

Theorem 1.9 Let A be an n × d matrix. For any matrix B of rank at most k

A ÿ Ak2 ÿ A ÿ B2

Proof: If A is of rank k or less, the theorem is obviously true since A ÿ Ak2 = 0. Thus assume that A is of rank
2
greater than k. By Lemma 1.8, A ÿ Ak Now suppose there is some matrix B of rank 2
at= most
2 ÿ
k+1.
k such that B is a
better 2-norm approximation to A than Ak. That is, A ÿ B2 < ÿk+1. The null space of B, Null (B), (the set of
vectors v such that Bv = 0) has dimension at least d ÿ k. Let v1, v2, . . . , vk+1 be the first k + 1 singular
vectors of A. By a dimension argument, it follows that there exists a z = 0 in

Null (B) ÿ Span {v1, v2, . . . , vk+1} .

10
Machine Translated by Google

[Indeed, there are d ÿ k independent vectors in Null(B); say, u1, u2, . . . , udÿk are any d ÿ k independent vectors in Null(B). Now, u1, u2, . . . , udÿk, v1, v2, . . . vk+1 are

d + 1 vectors in d space and they are dependent, so there are real numbers ÿ1, ÿ2, . . ÿjvj . Take z =
. , ÿdÿkdÿk
dÿk k
and ÿ1, ÿ2, . . . , ÿk not all zero so that i=1 ÿiui = i=1 ÿiui and observe that z cannot be the zero vector.] Scale z so that |z| = 1. We now
j=1
show that for this vector z, which lies in the space of the first k + 1 singular vectors of A, that (A ÿ B) z ÿ ÿk+1. Hence the 2-norm of A ÿ B
is at least ÿk+1 contradicting the assumption that A ÿ B2 < ÿk+1. First

2 2
AÿB ÿ |(A ÿ B) z| .

Since Bz = 0,
2 2
AÿB ÿ |Az| .

Since z is in the Span {v1, v2, . . . , vk+1}

2
n n k+1 k+1
2 2 2
2 = T = 2p T = 2p T T
|That| ÿiuivi z i vi z i vi z 2 ÿ ÿ k+1 vi z 2=ÿ
k+1.
i=1 i=1 i=1 i=1

It follows that
2 2
AÿB ÿ ÿ k+1
2

contradicting the assumption that ||A ÿ B||2 < ÿk+1. This proves the theorem.

1.4 Power Method for Computing the Singular Value Decom-


position
Computing the singular value decomposition is an important branch of numerical analysis in which there have been many
sophisticated developments over a long period of time. Here we present an “in-principle” method to establish that the approximate SVD
of a matrix A can be computed in polynomial time. The reader is referred to numerical analysis texts for more details. The method we
present, called the Power Method, is simple and is in fact the conceptual starting point for many algorithms. It is easiest to describe first
in the case when A is square symmetric and has the same right and left singular vectors, namely,

A= T .
ÿivivi
i=1

In this case, we have

r r r r
2 = T T = T T =
A ÿiÿjvivi vjvj 2p
You live
ÿivivi ÿjvjvj
,

i=1 j=1 i,j=1 i=1

where, first we just multiplied the two sums and then observed that if i = j, the dot product vi is a matrix
T T
vj equals 0 by orthogonality. [Caution: The “outer product” vivj

11
Machine Translated by Google

and is not zero even for i = j.] Similarly, if we take the k th power of A, again all the cross terms are zero and we will get

r
k k T
A = p the living .

i=1

If we had ÿ1 > ÿ2, we would have

1
k T
A ÿ v1v1 .

k
ÿ1

Now we do not know ÿ1 beforehand and cannot find this limit, but if we just take Ak and divide by ||Ak ||F so that the Frobenius norm
is normalized to 1 now, that matrix from which v1 may be computed. [This is will converge to the rank 1 matrix v1v1 still an intuitive
T
make the assumption that A is square and has the same right description, which we will make precise shortly. First, we cannot
and left singular vectors. But, B = AAT satisfies both these conditions. If again, the SVD of A is then by

T
ÿiuiv i ,
i

direct multiplication

B = AAT = T T
ÿiuiv i ÿjvju j
i j

= T T = T T
sisjuiv i saw j ÿiÿjui (v i · vj)u j

i,j i,j

= 2p
I
live i,

T
since v
i vj is the dot product of the two vectors and is zero unless i = j. This is the
spectral decomposition of B. Using the same kind of calculation as above,

k 2k T
B = p
I asked . _
i

2k
As k increases, for i > 1, ÿ i /ÿ2k 1 goes to zero and Bk is approximately equal to

2k T
p
1 u1u 1

provided that for each i > 1, ÿi (A) < ÿ1 (A).

This suggests a way of finding ÿ1 and u1, by successively powering B. But there are two issues. First, if there is a significant
gap between the first and second singular values of a matrix, then the above argument applies and the power method will quickly
converge to the first left singular vector. Suppose there is no significant gap. In the extreme case, there may be ties for the top
singular value. Then the above argument does not work.

There are cumbersome ways of overcoming this by assuming a “gap” between ÿ1 and ÿ2;

12
Machine Translated by Google

such proofs do have the advantage that with a greater gap, better results can be proved, but at the cost
of some mess. Here, instead, we will adopt a clean solution in Theorem 1.11 below which states that
even with ties, the power method converges to some vector in the span of those singular vectors
corresponding to the “nearly highest” singular values.

A second issue is that computing Bk costs k matrix multiplications when done in a straight-forward
manner or O (log k) when done by successive squaring. Instead we compute

B kx

where x is a random unit length vector, the idea being that the component of x in the direction of u1 would
get multiplied by ÿ each time, while the component 2of x along other ui would be multiplied only by ÿ i . Of
1

course, if the component of x along u1 is zero2 to start with, this would not help at all - it would always
remain 0. But, this problem is fixed by picking x to be random as we show in Lemma (1.10).

Each increase in k requires multiplying B by the vector Bkÿ1x, which we can further break up into

B kx = A A T B kÿ1x .

This requires two matrix-vector products, involving the matrices AT and A. In many applications, data
matrices are sparse - many entries are zero. [A leading example is the matrix of hypertext links in the
web. There are more than 1010 web pages and the matrix would be 1010by 1010. But on the average
only about 10 entries per row are non-zero; so only about 1011 of the possible 1020 entries are non-
zero.] Sparse matrices are often represented by giving just a linked list of the non-zero entries and their
values. If A is represented in this sparse manner, then the reader can convince him/herself that we can a
matrix vector product in time proportional to the number of nonzero entries in A.

Since Bkx ÿ ÿ · x) 2k T
is a scalar multiple of u1, u1 can be recovered from Bkx by 1 u1(u normalization.
1

We start with a technical Lemma needed in the proof of the theorem.

Lemma 1.10 Let (x1, x2, . . . , xd) be a unit d-dimensional vector picked at random from the set {x : |x| ÿ 1}. The probability that |x1| ÿ is at least 9/10. 20ÿ d
1

Proof: We first show that for a vector v picked at random with |v| ÿ 1, the probability 1 is at least 9/10.
follows. Then we let x = v/|v|. This can only increase the value that v1 ÿ 20ÿ d of v1, so the result

Let ÿ = 1
. The probability that |v1| ÿ ÿ equals one minus the probability that |v1| ÿ ÿ. The
20ÿ d
probability that |v1| ÿ ÿ is equal to the fraction of the volume of the unit sphere with |v1| ÿ ÿ. To get an
upper bound on the volume of the sphere with |v1| ÿ ÿ, consider twice the volume of the unit radius
cylinder of height ÿ. The volume of the

13
Machine Translated by Google

1 20ÿ d

1 Figure 1.3: The volume of the cylinder of height is an upper bound on the volume 20ÿ d 1 of the hemisphere below x1 = 20ÿ d

portion of the sphere with |v1| ÿ ÿ is less than or equal to 2ÿV (d ÿ 1) (recall notation from Chapter 2) and

2ÿV (d ÿ 1)
Prob(|v1| ÿ ÿ) ÿ
V (d)

Now the volume of the unit radius sphere is at least twice the volume of the cylinder of height ÿ dÿ1
1
and radius 1 ÿ 1 dÿ1
or

2 1 dÿ1 2

V (d) ÿ V (d ÿ 1)(1 ÿ ÿ d ÿ )
1dÿ1

Using (1 ÿ x) a ÿ 1 ÿ ax

2 dÿ1 1
V (d ÿ 1) ) ÿ
V (d) ÿ V (d ÿ 1)(1 ÿ ÿ d ÿ 1 ÿd
2 dÿ1 ÿ1

and
2ÿV (d ÿ 1) ÿ ÿ ÿdÿ1 1

Prob(|v1| ÿ ÿ) ÿ .

V (d ÿ 1) 10ÿ d 10
1 ÿdÿ1

1
Thus the probability that v1 ÿ 20ÿ d is at least 9/10.

Theorem 1.11 Let A be an n × d matrix and x a random unit length vector. Let V be the space spanned by the left singular vectors
of A corresponding to singular values greater

14
Machine Translated by Google

than (1 ÿ ÿ) ÿ1. Let k be ÿ ln(d/ÿ) method,


e
. Let w be unit vector after k iterations of the power
namely,
AAT kx
w= .

(AAT ) kx

The probability that w has a component of at least perpendicular to V is at most 1/10.

Proof: Let
r

A= T
ÿiuiv i
i=1

be the SVD of A. If the rank of A is less than d, then complete {u1, u2, . . . ur} into a basis {u1, u2, . . . ud} of d-space. Write x in the basis of the ui s as

x= I was born

i=1

d d
k = 2k T 2k
Since (AAT ) ÿ i iuiu i, it follows that (AAT ) kx = p I went _ For a random unit
i=1 i=1
length vector x picked independent of A, the ui are fixed vectors and picking x at random is equivalent to
picking random ci . From Lemma 1.10, |c1| ÿ with probability at least 9/10.
1 20ÿ d

Suppose that ÿ1, ÿ2, . . . , ÿm are the singular values of A that are greater than or equal to (1 ÿ ÿ) ÿ1 and that ÿm+1, . . . , ÿn are the singular values that are less

than (1 ÿ ÿ) ÿ1.
Now
2
n n
1
= 2k
= 4k 2 p 4k 2 c 1 4k
|(AAT ) kx| 2 p I went c ii ÿp ÿ s1,
1
400d
i=1 i=1

with probability at least 9/10. Here we used the fact that a sum of positive quantities is at least as large as its
1 4k x 1
first element and the first element is greater than or equal to with probability at least 9/10. [Note: If 400d
we did not choose x at random in the begin-ning and use Lemma 1.10, c1 could well have been zero and this
argument would not work.]

The component of |(AAT ) kx| 2 perpendicular to the space V is


d d

4k 2 c 4k 4k at 1 2 4k 4k
ii p ÿ (1 ÿ ÿ) c
i ÿ (1 ÿ ÿ) in 1 ,

i=m+1 i=m+1

n
since 2 = |x| = 1. Thus, the component of w perpendicular to V is at most
c ii=1

(1 ÿ ÿ) 2kÿ 2k 1 2k
1 2k
= O( ÿ d(1 ÿ ÿ) ) = O( ÿ deÿ2ÿk) = O ÿ deÿÿ(ln(d/ÿ)) = O(ÿ)
ÿ1
20ÿ d

as desired.

15
Machine Translated by Google

1.5 Applications of Singular Value Decomposition


1.5.1 Principal Component Analysis

The traditional use of SVD is in Principal Component Analysis (PCA). PCA is il-lustrated
by an example - customer-product data where there are n customers buying d products. Let
matrix A with elements aij represent the probability of customer i purchas-ing product j (or the
amount or utility of product j to customer i). One hypothesizes that there are really only k
underlying basic factors like age, income, family size, etc. that determine a customer’s purchase
behavior. An individual customer’s behavior is deter-mined by some weighted combination of
these underlying factors. That is, a customer’s purchase behavior can be characterized by a k-
dimensional vector where k is much smaller that n and d. The components of the vector are
weights for each of the basic factors.
Associated with each basic factor is a vector of probabilities, each component of which is the
probability of purchasing a given product by someone whose behavior depends only on that
factor. More abstractly, A is an n×d matrix that can be expressed as the product of two matrices
U and V where U is an n × k matrix expressing the factor weights for each customer and V is a
k × d matrix expressing the purchase probabilities of products that correspond to that factor.
One twist is that A may not be exactly equal to UV but close to it since there may be noise or,

random perturbations.

Taking the best rank k approximation Ak from SVD (as described above) gives us such a U,
V . In this traditional setting, one assumed that A was available fully and we wished to find U, V
to identify the basic factors or in some applications to “denoise” A (if we think of A ÿ UV as
noise). Now imagine that n and d are very large, on the order of thousands or even millions,
there is probably little one could do to estimate or even store A. In this setting, we may assume
that we are given just given a few elements of A and wish to estimate A. If A was an arbitrary
matrix of size n × d, this would require ÿ(nd) pieces of information and cannot be done with a
few entries. But again hypothesize that A was a small rank matrix with added noise. If now we
also assume that the given entries are randomly drawn according to some known distribution,
then there is a possibility that SVD can be used to estimate the whole of A. This area is called
collaborative filtering and one of its uses is to target an ad to a customer based on one or two
purchases. We will not be able to describe it here.

1.5.2 Clustering a Mixture of Spherical Gaussians

In clustering, we are given a set of points in dÿspace and the task is to partition the points
into k subsets (clusters) where each cluster consists of “nearby” points. Different definitions of
the goodness of a clustering lead to different solutions. Clustering is an important area which
we will study in detail in Chapter ??. Here we will see how to solve a particular clustering
problem using Singular Value Decomposition.

In general, a solution to any clustering problem comes up with k cluster centers

16
Machine Translated by Google

factors

ÿ ÿ
products

customers IN IN ÿ

ÿ ÿ

ÿÿÿÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿÿÿ ÿ

Figure 1.4: Customer-product data

which define the k clusters - a cluster is the set of data points which have a particular
cluster center as the closest cluster center. Hence the Vornoi cells of the cluster centers
determine the clusters. Using this observation, it is relatively easy to cluster points in two
or three dimensions. However, clustering is not so easy in higher dimensions. Many
problems have high-dimensional data and clustering problems are no exception. Clus-
tering problems tend to be NP-hard, so we do not have polynomial time algorithms to solve
them. One way around this is to assume stochastic models of input data and devise
algorithms to cluster under such models.

Mixture models are a very important class of stochastic models. A mixture is a


probability density or distribution that is the weighted sum of simple component prob-
ability densities. It is of the form w1p1 + w2p2 + · · · + wkpk where p1, p2, . . . ,pk are the
basic densities and w1, w2, . . . , wk are positive real numbers called weights that add up
to one. Clearly, w1p1 + w2p2 + · · · + wkpk is a probability density, it integrates to one.

PUT IN PICTURE OF A 1-DIMENSIONAL GAUSSIAN MIXTURE


The model fitting problem is to fit a mixture of k basic densities to n samples, each
sample drawn according to the same mixture distribution. The class of basic densities
is known, but the component weights of the mixture are not. Here, we deal with the
case where the basic densities are all spherical Gaussians. The samples are generated
by pick-ing an integer i from the set {1, 2, . . . , k} with probabilities w1, w2, . . . , wk, respectively.
Then, picking a sample according to pi and repeating the process n times. This process
generates n samples according to the mixture where the set of samples is naturally par-
titioned into k sets, each set corresponding to one pi .

The model-fitting problem consists of two sub problems. The first sub problem is to
cluster the sample into k subsets where each subset was picked according to one compo-
nent density. The second sub problem is to fit a distribution to each subset. We discuss

17
Machine Translated by Google

only the clustering problem here. The problem of fitting a single Gaussian to a set of data points
is a lot easier and was discussed in Section ?? of Chapter 2.

If the component Gaussians in the mixture have their centers very close together, then the
clustering problem is unresolvable. In the limiting case where a pair of component densities are
the same, there is no way to distinguish between them. What condition on the inter-center
separation will guarantee unambiguous clustering? First, by looking at 1-dimensional examples,
it is clear that this separation should be measured in units of the standard deviation, since the
density is a function of the number of standard deviation from the mean. In one dimension, if two
Gaussians have inter-center separation at least ten times the maximum of their standard
deviations, then they hardly overlap. What is the analog of this statement in higher dimensions?

For a d-dimensional spherical Gaussian, with standard deviation ÿ in each direction1 ,

it is easy to see that the expected distance squared from the center is dÿ2 . Define the radius r
of the Gaussian to be the square root of the average distance squared from the center; so r is ÿ
dÿ. If the inter-center separation between two spherical Gaussians, both of radius r is at least 2r
= 2ÿ dÿ, then it is easy to see that the densities hardly overlap. But this separation requirement
grows with d. For problems with large d, this would impose a separation requirement not met in
practice. The main aim is to answer affirmatively the question:

Can we show an analog of the 1-dimensional statement for large d : In a mixture of k


spherical Gaussians in d space, if the centers of each pair of component Gaussians are ÿ(1)
standard deviations apart, then we can separate the sample points into the k components (for k
ÿ O(1)).
The central idea is the following: Suppose for a moment, we can (magically) find the
subspace spanned by the k centers. Imagine projecting the sample points to this subspace.
It is easy to see (see Lemma (1.12) below) that the projection of a spherical Gaussian with
standard deviation ÿ is still a spherical Gaussian with standard deviation ÿ. But in the projection,
now, the inter-center separation still remains the same. So in the projection, the Gaussians are
distinct provided the inter-center separation (in the whole space) is ÿ(ÿ kÿ) which is a lot smaller
than the ÿ(ÿ dÿ) for k << d. Interestingly we will see that the subspace spanned by the kÿ centers
is essentially the best-fit k dimensional subspace which we can find by Singular Value
Decomposition.

Lemma 1.12 Suppose p is a d-dimensional spherical Gaussian with center µ and stan-dard
deviation ÿ. The density of p projected onto an arbitrary k-dimensional subspace V is a spherical
Gaussian with the same standard deviation.

Proof: Since p is spherical, the projection is independent of the k-dimensional subspace.


Pick V to be the subspace spanned by the first k coordinate vectors. For a point x =

1Since a spherical Gaussian has the same standard deviation in every direction, we call it the standard
deviation of the Gaussian.

18
Machine Translated by Google

1. Best fit 1-dimension subspace to a spherical Gaussian


is the line through its center and the origin.

2. Any k-dimensional subspace contain-ing the line is a


best fit k-dimensional subspace for the Gaussian.

3. The best fit k-dimensional subspace for k Gaussians is


the subspace con-taining their centers.

1. Best fit 1-dimension subspace to a spherical Gaussian


is the line through its center and the origin.

2. Any k-dimensional subspace contain-ing the line is a


best fit k-dimensional subspace for the Gaussian.

3. The best fit k-dimensional subspace for k Gaussians is


the subspace con-taining their centers.

Figure 1.5: Best fit subspace to a spherical Gaussian.

(x1, x2, . . . , xd), we will use the notation x = (x1, x2, . . . xk) and x = (xk+1, xk+2, . . . , xn).

The density of the projected Gaussian at the point (x1, x2, . . . , xk) is

2 2 2
ÿ
|x ÿµ | 2x2 ÿ
|x ÿµ | 2ÿ2 ÿ
|x ÿµ | 2x2
This It is
dx = ce .

This clearly implies the lemma.

We now show that the top k singular vectors produced by the SVD span the space of the k centers. First, we extend the notion
of best fit to probability distributions. Then we show that for a single spherical Gaussian (whose center is not the origin), the best fit
1-dimensional subspace is the line though the center of the Gaussian and the origin. Next, we show that the best fit k-dimensional
subspace for a single Gaussian (whose center is

19
Machine Translated by Google

not the origin) is any k-dimensional subspace containing the line through the Gaussian’s
center and the origin. Finally, for k spherical Gaussians, the best fit k-dimensional sub-space
is the subspace containing their centers. Thus, the SVD finds the subspace that contains the
centers.

Recall that for a set of points, the best-fit line is the line passing through the origin which
minimizes the sum of squared distances to the points. We extend this definition to probability
densities instead of a set of points.

Definition 4.1: If p is a probability density in d space, the best fit line for p is the line l passing
through the origin that minimizes the expected squared (perpendicular) distance to the line,
namely,

2
dist (x, l) p (x) dx.

Recall that a k-dimensional subspace is the best-fit subspace if the sum of squared
distances to it is minimized or equivalently, the sum of squared lengths of projections onto it
is maximized. This was defined for a set of points, but again it can be extended to a density
as above.

Definition 4.2: If p is a probability density in d-space and V is a subspace, then the expected
squared perpendicular distance of V to p, denoted f(V, p), is given by

f(V, p) = dist2 (x, V ) p (x) dx,

where dist(x, V ) denotes the perpendicular distance from the point x to the subspace V .

For the uniform density on the unit circle centered at the origin, it is easy to see that
any line passing through the origin is a best fit line for the probability distribution.

Lemma 1.13 Let the probability density p be a spherical Gaussian with center µ = 0.
The best fit 1-dimensional subspace is the line passing through µ and the origin.

Proof: For a randomly chosen x (according to p) and a fixed unit length vector v,
2

E (v
T
x)
2
=Ev T
(x ÿ µ) + v Tµ
2 2
=Ev T
(x ÿ µ) + 2 v Tµ v
T
(x ÿ µ) + v Tµ
2 2
=Ev T
(x ÿ µ) + 2 v Tµ E v
T
(x ÿ µ) + v Tµ
2 2
=Ev T
(x ÿ µ) + vTµ
2
2=p + vTµ

20
Machine Translated by Google

2
T T
since E v (x ÿ µ) is the variance in the direction v and E v (x ÿ µ) = 0. The
2
lemma follows from the fact that the best fit line v is the one that maximizes v Tu which is maximized
when v is aligned with the center µ.

Lemma 1.14 For a spherical Gaussian with center µ, a k-dimensional subspace is a best fit subspace if and
only if it contains µ.

Proof: By symmetry, every k-dimensional subspace through µ has the same sum of distances squared to
the density. Now by the SVD procedure, we know that the best-fit k-dimensional subspace contains the
best fit line, i.e., contains µ. Thus, the lemma follows.

This immediately leads to the following theorem.

Theorem 1.15 If p is a mixture of k spherical Gaussians whose centers span a k-dimensional subspace,
then the best fit k-dimensional subspace is the one containing the centers.

Proof: Let p be the mixture w1p1+w2p2+· · ·+wkpk. Let V be any subspace of dimension k or less. Then,
the expected squared perpendicular distance of V to p is

f(V, p) = dist2 (x, V )p(x)dx

=
wi dist2 (x, V )pi(x)dx
i=1
k

ÿ wi( distance squared of pi to its best fit k-dimensional subspace).


i=1

Choose V to be the space spanned by the centers of the densities pi . By Lemma ?? the last inequality
becomes an equality proving the theorem.

For an infinite set of points drawn according to the mixture, the k-dimensional SVD subspace gives
exactly the space of the centers. In reality, we have only a large number of samples drawn according to the
mixture. However, it is intuitively clear that as the number of samples increases, the set of sample points
approximates the probability density and so the SVD subspace of the sample is close to the space spanned
by the centers. The details of how close it gets as a function of the number of samples are technical and we
do not carry this out here.

21
Machine Translated by Google

1.5.3 An Application of SVD to a Discrete Optimization Problem

In the last example, SVD was used as a dimension reduction technique. It found a k-dimensional
subspace (the space of centers) of a d-dimensional space and made the Gaussian clustering problem
easier by projecting the data to the subspace. Here, instead of fitting a model to data, we have an
optimization problem. Again applying dimension reduction to the data makes the problem easier. The use
of SVD to solve discrete op-timization problems is a relatively new subject with many applications. We
start with an important NP-hard problem, the Maximum Cut Problem for a directed graph G(V, E).

The Maximum Cut Problem is to partition the node set V of a directed graph into two subsets S and S¯
so that the number of edges from S to S¯ is maximized. Let A be the adjacency matrix of the graph. With
each vertex i, associate an indicator variable xi .
The variable xi will be set to 1 for i ÿ S and 0 for i ÿ S¯. The vector x = (x1, x2, . . . , xn) is unknown and we
are trying to find it (or equivalently the cut), so as to maximize the number of edges across the cut. The
number of edges across the cut is precisely

xi(1 ÿ xj )aij .
i,j

Thus, the Maximum Cut Problem can be posed as the optimization problem

Maximize xi(1 ÿ xj )aij subject to xi ÿ {0, 1}.


i,j

In matrix notation,
xi(1 ÿ xj )aij = x TA(1 ÿ x),
i,j

where 1 denotes the vector of all 1’s . So, the problem can be restated as

Maximize x TA(1 ÿ x) subject to xi ÿ {0, 1}. (1.1)

The SVD is used to solve this problem approximately by computing the SVD of A and
k
T
replacing A by Ak = ÿiuivi in (1.1) to get
i=1

Maximize x TAk(1 ÿ x) subject to xi ÿ {0, 1}. (1.2)

Note that the matrix Ak is no longer a 0-1 adjacency matrix.

We will show that:

2n
1. For each 0-1 vector x, x TAk(1 ÿ x) and x TA(1 ÿ x) differ by at most the maxima in ÿ k+1 . Thus,
(1.1) and (1.2) differ by at most this amount.

2. A near optimal x for (1.2) can be found by exploiting the low rank of Ak, which by Item 1 is near
optimal for (1.1) where near optimal means with additive error of at most
2n .

ÿ k+1

22
Machine Translated by Google

First, we prove Item 1. Since x and 1 ÿ x are 0-1 n-vectors, each has length at most ÿ n. By the definition of
the 2-norm, |(A ÿ Ak)(1 ÿ x)| ÿ ÿ n||A ÿ Ak||2. Now since (A ÿ Ak)(1 ÿ x) is the dot product of the vector x with the
T
x vector (A ÿ Ak)(1 ÿ x),
T
|x (A ÿ And)(1 ÿ x)| ÿ n||A ÿ Ac||2.

By Lemma 1.8, ||A ÿ Ak||2 = ÿk+1(A). The inequalities,

2 2 2 2 ÿ ÿ + ÿ + · · · ÿ 1 2 k+1 =
(k + 1)p k+1 ÿ ||A||2 F 2 2ÿn
a ij
i,j

2 2n n
imply that ÿ ÿ and hence ||A ÿ Ak||2 ÿ ÿ k+1 proving Item 1.
k+1 k+1

Next we focus on Item 2. It is instructive to look at the special case when k=1 and A is approximated by the
rank one matrix A1. An even more special case when the left and right singular vectors u and v are required to be
identical is already NP-hard to solve ex-actly because it subsumes the problem of whether for a set of n integers,
{a1, a2, . . . , an}, there is a partition into two subsets whose sums are equal. So, we look for algorithms that solve
the Maximum Cut Problem approximately.

For Item 2, we want to maximize to=1 ÿi(x Tui)(vi T (1 ÿ x)) over 0-1 vectors x. A piece of
notation will be useful. For any S ÿ {1, 2, . . . n}, write ui(S) for the sum of coordinates of the vector ui corresponding
to elements in the set S and also for vi . That ÿiui(S)vi(S¯) using dynamic programming. is, ui(S) =
uij . We will maximize to=1
jÿS

For a subset S of {1, 2, . . . , n}, define the 2k-dimensional vector w(S) = (u1(S), v1(S¯), u2(S), v2(S¯), . . . ,
the list of all such vectors, we could find and take the maximum. to=1 ÿiui(S)vi(S¯) for each of them If we had
There are 2n subsets S, but several S could have the same w(S) and in that case it suffices to list just one of them.
Round each coordinate of each ui to the nearest integer multiple of nk2 . Call the rounded vector ˜ui . Similarly ob-
1
tain ˜vi . Let ˜w(S) denote the vector (˜u1(S), ˜v1(S¯), ˜u2(S), ˜v2(S¯), . . . , ˜uk(S), ˜vk(S¯)). We will construct a list
of all possible values of the vector ˜w(S). [Again, if several differ-ent S ’s lead to the same vector ˜w(S), we will
keep only one copy on the list.] The list will be constructed by Dynamic Programming. For the recursive step of
Dynamic Programming, assume we already have a list of all such vectors for S ÿ {1, 2, . . . , i} and wish to construct
the list for S ÿ {1, 2, . . . , i + 1}. Each S ÿ {1, 2, . . . , i} leads to two possible S ÿ {1, 2, . . . , i + 1}, namely, S and S
ÿ {i + 1}. In the first case, the vector ˜w(S ) = (˜u1(S) + ˜u1,i+1, ˜v1(S), ˜u2(S) + ˜u2,i+1, ˜v2(S), . . . , ...). In the
second case, ˜w(S ) = (˜u1(S), ˜v1(S) + ˜v1,i+1, ˜u2(S), ˜v2(S) + ˜v2,i+1, . . . , ...). We put in these two vectors for
each vector in the previous list. Then, crucially, we prune - i.e., eliminate duplicates. as claimed. ÿ k+1

2n
Assume that k is constant. Now, we show that the error is at most Since ui
n = 1
, vi are unit length vectors, |ui(S)|, |vi(S)| ÿ ÿ n. Also |˜ui(S)ÿui(S)| ÿ and similarly for vi . To nk2 k2

bound the error, we use an elementary fact: if a, b are reals with |a|, |b| ÿ M and we estimate a by a and b by b so
that |a ÿ a |, |b ÿ b | ÿ ÿ ÿ M, then

|ab ÿ a b | = |a(b ÿ b ) + b (a ÿ a )| ÿ |a||b ÿ b | + (|b| + |b ÿ b |)|a ÿ a | ÿ 3Mÿ.

23
Machine Translated by Google

Using this, we get that


k k

ÿi˜ui(S)˜vi(S¯) ÿ ÿiui(S)vi(S) ÿ 3kÿ1 ÿ n/k2 ÿ 3n 3/2 /k,


i=1 i=1

and this meets the claimed error bound.


Next, we show that the running time is polynomially bounded. |˜ui(S)|, |˜vi(S)| ÿ 2 ÿ n.
Since ˜ui(S), ˜vi(S) are all integer multiples of 1/(nk2 ), there are at most 2/ ÿ nk2 possible
values of ˜ui(S), ˜vi(S) from which it follows that the list of ˜w(S) never gets larger than (1/ ÿ
2k
nk2 ) which for fixed k is polynomially bounded.
We summarize what we have accomplished.

Theorem 1.16 Given a directed graph G(V, E), a cut of size at least the maximum cut minus
2
Onÿk can be computed in polynomial time n for any fixed k.

It would be quite a surprise to have an algorithm that actually achieves the same accuracy
in time polynomial in n and k because this would give an exact max cut in polynomial time.

1.5.4 SVD as a Compression Algorithm

Suppose A is the pixel intensity matrix of a large image. The entry aij gives the )
2
intensity of the ijth pixel. If A is n×n, the transmission of A requires transmitting O(n real numbers. Instead, one could send Ak, that is, the top k

singular values ÿ1, ÿ2, . . . ÿk along with the left and right singular vectors u1, u2, . . . , uk, and v1, v2, . . . , vk. This ,

2
would require sending O(kn) real numbers instead of O(n ) real numbers. If k is much
smaller than n, this results in savings. For many images, a k much smaller than n can be
used to reconstruct the image provided that a very low resolution version of the image is
sufficient. Thus, one could use SVD as a compression method.

It turns out that in a more sophisticated approach, for certain classes of pictures one
could use a fixed basis so that the top (say) hundred singular vectors are sufficient to
represent any picture approximately. This means that the space spanned by the top hundred
singular vectors is not too different from the space spanned by the top two hundred singular
vectors of a given matrix in that class. Compressing these matrices by this standard basis
can save substantially since the standard basis is transmitted only once and a matrix is
transmitted by sending the top several hundred singular values for the standard basis.

1.5.5 Spectral Decomposition


Let B be a square matrix. If the vector x and scalar ÿ are such that Bx = ÿx, then x is an
eigenvector of the matrix B and ÿ is the corresponding eigenvalue. We present here a
spectral decomposition theorem for the special case where B is of the form

24
Machine Translated by Google

B = AAT for some (possibly rectangular) matrix A. If A is a real valued matrix, then B is symmetric and positive
definite. That is, x TBx > 0 for all nonzero vectors x. The spectral decomposition theorem holds more generally and
the interested reader should consult a linear algebra book.

T
Theorem 1.17 (Spectral Decomposition) If B = AAT then B = 2 pI asked i where
i
T
A= ÿiuiv i is the singular valued decomposition of A.
i

Proof:
T
T T
B = AAT = ÿiuivi ÿjujv j
i j

=
T ÿiÿjuivi T see
ij

= 2p
T asked .

When the ÿi are all distinct, the ui are the eigenvectors of B and the ÿ are the corresponding eigenvalues.
i If
the ÿi are not distinct, then any vector that is a linear combination of those ui with the same eigenvalue is an
eigenvector of B.

1.5.6 Singular Vectors and ranking documents

An important task for a document collection is to rank the documents. Recall the term-document vector
representation from Chapter 2. A n¨aive method would be to rank in order of the total length of the document
(which is the sum of the components of its term-document vector). Clearly, this is not a good measure in many
cases. This n¨aive method attaches equal weight to each term and takes the projection (dot product) term-document
vector in the direction of the all 1 vector. Is there a better weighting of terms, i.e., a better projection direction
which would measure the intrinsic relevance of the doc-ument to the collection? A good candidate is the best-fit
direction for the collection of term-document vectors, namely the top (left) singular vector of the term-document
ma-trix. An intuitive reason for this is that this direction has the maximum sum of squared projections of the
collection and so can be thought of as a synthetic term-document vector best representing the document collection.

Ranking in order of the projection of each document’s term vector along the best fit direction has a nice
interpretation in terms of the power method. For this, we con-sider a different example - that of web with hypertext
links. The World Wide Web can be represented by a directed graph whose nodes correspond to web pages and
directed edges to hypertext links between pages. Some web pages, called authorities, are the most

25
Machine Translated by Google

prominent sources for information on a given topic. Other pages called hubs, are ones that identify
the authorities on a topic. Authority pages are pointed to by many hub pages and hub pages point to
many authorities. One is led to what seems like a circular definition: a hub is a page that points to
many authorities and an authority is a page that is pointed to by many hubs.

One would like to assign hub weights and authority weights to each node of the web.
If there are n nodes, the hub weights form a n-dimensional vector u and the authority weights form a
n-dimensional vector v. Suppose A is the adjacency matrix representing the directed graph : aij is 1
if there is a hypertext link from page i to page j and 0 otherwise. Given hub vector u, the authority
vector v could be computed by the formula
d

vj = uiaij
i=1

since the right hand side is the sum of the hub weights of all the nodes that point to node j. In matrix
terms,
v = Yours.

Similarly, given an authority vector v, the hub vector u could be computed by u = Av.
Of course, at the start, we have neither vector. But the above suggests a power iteration.
Start with any v. Set u = Av; then set v = ATu and repeat the process. We know from the power
method that this converges to the left and right singular vectors. So after sufficiently many iterations,
we may use the left vector u as hub weights vector and project each column of A onto this direction
and rank columns (authorities) in order of this projection. But the projections just form the vector ATu
which equals v. So we can just rank by order of vj .

This is the basis of an algorithm called the HITS algorithm which was one of the early proposals
for ranking web pages.

A different ranking called page rank is widely used. It is based on a random walk on the grap
described above. (We will study Random Walks in detail in Chapter 5 and the reader may postpone
reading this application until then.)

A random walk on the web goes from web page i to a randomly chosen neighbor of it. so if pij is
the probability of going from i to j, then pij is just 1/ (number of hypertext links from i). Represent the
pij in a matrix P. This matrix is called the transition probability matrix of the random walk. Represent
the probabilities of being in each state at time t by the components of a row vector p (t). The
probability of being in state j at time t is given by the equation

pj (t) = pi (t ÿ 1) pij .
i

Then
p (t) = p (t ÿ 1) P

26
Machine Translated by Google

and thus
t
p (t) = p (0) P .

The probability vector p (t) is computed by computing P to the power t. It turns out that under some
conditions, the random walk has a steady state probability vector that we can think of as p (ÿ). It
has turned out to be very useful to rank pages in decreasing order of pj (ÿ) in essence saying that
the web pages with the highest steady state probabilities are the most important.

In the above explanation, the random walk goes from page i to one of the web pages pointed
to by i, picked uniformly at random. Modern technics for ranking pages are more complex. A more
sophisticated random walk is used for several reasons. First, a web page might not contain any
links and thus there is nowhere for the walk to go. Second, a page that has no in links will never
be reached. Even if every node had at least one in link and one out link, the graph might not be
strongly connected and the walk would eventually end up in some strongly connected component
of the graph. Another difficulty occurs when the graph is periodic, that is, the greatest common
divisor of all cycle lengths of the graph is greater than one. In this case, the random walk does not
converge to a stationary probability distribution but rather oscillates between some set of probability
distributions. We will consider this topic further in Chapter 5.

1.6 Bibliographic Notes


Singular value decomposition is fundamental to numerical analysis and linear algebra.
There are many texts on these subjects and the interested reader may want to study these.
A good reference is [?]. The material on clustering a mixture of Gaussians in Section 1.5.2 is from
[?]. Modeling data with a mixture of Gaussians is a standard tool in statistics.
Several well-known heuristics like the expectation-minimization algorithm are used to learn (fit) the
mixture model to data. Recently, in theoretical computer science, there has been modest progress
on provable polynomial-time algorithms for learning mixtures.
Some references are [?], [?], [?], [?]. The application to the discrete optimization problem is from
[?]. The section on ranking documents/webpages is from two influential papers, one on hubs and
authorities by Jon Kleinberg [?] and the other on pagerank by Page, Brin, Motwani and Winograd
[?].

27
Machine Translated by Google

1.7 Exercises
Exercise 1.1 (Best fit functions versus best least squares fit) In many experiments one collects the
value of a parameter at various instances of time. Let yi be the value of the parameter y at time xi.
Suppose we wish to construct the best linear approximation to the data in the sense that we wish
to minimize the mean square error. Here error is measured vertically rather than perpendicular to
the line. Develop formulas for m and b to minimize the mean square error of the points {(xi , yi)|1 ÿ i
ÿ n} to the line y = mx + b.

Exercise 1.2 Given five observed parameters, height, weight, age, income, and blood pressure of n
people, how would one find the best least squares fit subspace of the form

a1 (height) + a2 (weight) + a3 (age) + a4 (income) + a5 (blood pressure) = 0

Here a1, a2, . . . , a5 are the unknown parameters. If there is a good best fit 4-dimensional subspace,
then one can think of the points as lying close to a 4-dimensional sheet rather than points lying in 5-
dimensions. Why is it better to use the perpendicular distance to the subspace rather than vertical
distance where vertical distance to the subspace is measured along the coordinate axis
corresponding to one of the unknowns?

Exercise 1.3 What is the best fit line for each of the following set of points?

1. {(0, 1),(1, 0)}

2. {(0, 1),(2, 0)}

3. The rows of the matrix


17 4
ÿ ÿ2 26 11 ÿ

ÿ 7 ÿ

Solution: (1) and (2) are easy to do from scratch. (1) y = x and (2) y = 2x. For (3), there is no simple
method. We will describe a general method later and this can
1
1
be applied. But the best fit line is v1 = 2
. FIX Convince yourself that this is
ÿ 5

correct.

Exercise 1.4 Let A be a square n × n matrix whose rows are orthonormal. Prove that the columns
of A are orthonormal.

Solution: Since the rows of A are orthonormal AAT = I and hence ATAAT = AT . Since AT is
ÿ1 ÿ1 ÿ1
. Thus
nonsingular it has an inverse AT = AT AT implying that ATAAT
ATA = I,AT
i.e., the columns of A are
orthonormal.

28
Machine Translated by Google

Exercise 1.5 Suppose A is a n × n matrix with block diagonal structure with k equal size blocks where all entries of the ith block are ai with a1 > a2 > · · · > ak >

0. Show that A ) 1/2 in the vk where vi has the value ( has exactly k nonzero singular vectors v1, v2, . . . , coordinates corresponding to the ith block and 0
k
the blocks of the diagonal. What happens if a1 = a2 = · · · = ak? other words, the singular vectors exactly identify n elsewhere. In

In the case where the ai are equal, what is the structure of the set of all possible singular vectors?

Hint: By symmetry, the top singular vector’s components must be constant in each block.

Exercise 1.6 Prove that the left singular vectors of A are the right singular vectors of AT .

Solution: A = EXT T thus AT = V DUT .


,

Exercise 1.7 Interpret the right and left singular vectors for the document term matrix.

Solution: The first right singular vector is a synthetic document that best matches the collection of
documents. The first left singular vector is a synthetic word that best matches the collection of terms
appearing in the documents.

r
T can be written as
Exercise 1.8 Verify that the sum of rank one matrices ÿiuivi
i=1
UDV T, where the ui are the columns of U and vi are the columns of V . To do this, first
verify that for any two matrices P and Q, we have

PQ= T pI
i

th th
where pi is the i column of P and qi is the i column of Q.

Exercise 1.9

1. Show that the rank of A is r where r is the miminum i such that arg max |A v| = 0.
vÿv1,v2,...,vi |v|=1

2. Show that u 1A =T max


and TA
|u|=1= ÿ1.

Hint: Use SVD.

Exercise 1.10 If ÿ1, ÿ2, . . . , ÿr are the singular values of A and v1, v2, . . ., vr are the
corresponding right singular vectors, show that
r
1. ATA = 2p
the living
T

i=1

29
Machine Translated by Google

2. v1, v2, . . . vr are eigenvectors ofATA.

3. Assuming that the set of eigenvectors of a matrix is unique, conclude that the set of
singular values of the matrix is unique.

See the appendix for the definition of eigenvectors.

Exercise 1.11 Let A be a matrix. Given an algorithm for finding

v1 = arg max |v|=1 |Off|

describe an algorithm to find the SVD of A.

Exercise 1.12 Compute the singular valued decomposition of the matrix


12
A=
34

Exercise 1.13 Write a program to implement the power method for computing the first
singular vector of a matrix. Apply your program to the matrix
1 2 3 · · · 9 10 2 3 4 · ·
ÿ · 10 0 ÿ
. . . .

A= .
.
.
.
.
.
.
.

9 10 0 · · · 0 0 10 0 0
ÿÿÿÿÿÿ
···00 ÿÿÿÿÿÿ

Exercise 1.14 Modify the power method to find the first four singular vectors of a matrix
A as follows. Randomly select four vectors and find an orthonormal basis for the space
spanned by the four vectors. Then multiple each of the basis vectors times A and find a
new orthonormal basis for the space spanned by the resulting four vectors. Apply your
method to find the first four singular vectors of matrix A of Exercise 1.13

Exercise 1.15 Let A be a real valued matrix. Prove that B = AAT is positive definite.

Exercise 1.16 Prove that the eigenvalues of a symmetric real valued matrix are real.
T
Exercise 1.17 Suppose A is a square invertible matrix and the SVD of A is A = ÿiuiv i .
i
1 T lives
Prove that the inverse of A is and . yes
i

Exercise
r
1.18 Supposer A is square, but not necessarily invertible and has SVD A =
T 1
ÿiuiv i . Let B = ÿi i.
you live Show that Bx = x for all x in the span of the right
i=1 i=1

singular vectors of A. For this reason B is sometimes called the pseudo inverse of A and can play the role of Aÿ1 in
many applications.

30
Machine Translated by Google

Exercise 1.19

||A ||F
1. For any matrix A, show that ÿk ÿ .

ÿk

||A ||
F 2. Prove that there exists a matrix B of rank at most k such that ||A ÿ B||2 ÿ ÿ k .

3. Can the 2-norm on the left hand side in (b) be replaced by Frobenius norm?

Exercise 1.20 Suppose an n × d matrix A is given and you are allowed to preprocess A. Then you
are given a number of d-dimensional vectors x1, x2, . . . , xm and for each of these vectors you must
find the vector Axi approximately, in the sense that you must find a vector ui satisfying |ui ÿ Axi | ÿ ÿ||
A||F |xi |. Here ÿ >0 is a given error bound. Describe an algorithm that accomplishes this in time O
d+n 2 per xi not counting the preprocessing time. e

Exercise 1.21 (Constrained Least Squares Problem using SVD) Given A, b, and m, use the SVD
algorithm to find a vector x with |x| < m minimizing |Axÿb|. This problem is a learning exercise for the
advanced student. For hints/solution consult Golub and van Loan, Chapter 12.

Exercise 1.22 (Document-Term Matrices): Suppose we have a m×n document-term matrix where
each row corresponds to a document where the rows have been normalized to length one. Define
the “similarity” between two such documents by their dot product.

1. Consider a “synthetic” document whose sum of squared similarities with all docu-ments in the
matrix is as high as possible. What is this synthetic document and how would you find it?

2. How does the synthetic document in (1) differ from the center of gravity?

3. Building on (1), given a positive integer k, find a set of k synthetic documents such that the
sum of squares of the mk similarities between each document in the matrix and each synthetic
document is maximized. To avoid the trivial solution of selecting k copies of the document in
(1), require the k synthetic documents to be orthogonal to each other. Relate these synthetic
documents to singular vectors.

4. Suppose that the documents can be partitioned into k subsets (often called clusters), where
documents in the same cluster are similar and documents in different clusters are not very
similar. Consider the computational problem of isolating the clusters.
This is a hard problem in general. But assume that the terms can also be partitioned into k
clusters so that for i = j, no term in the i cluster occurs in athdocument in the j cluster. If we
knew thethclusters and arranged the rows and columns in them to be contiguous, then the
matrix would be a block-diagonal matrix. Of course

31
Machine Translated by Google

the clusters are not known. By a “block” of the document-term matrix, we mean a
submatrix with rows corresponding to the i thcluster of documents and columns
corresponding to the i thcluster of terms . We can also partition any n vector into blocks.
Show that any right singular vector of the matrix must have the property that each of
its blocks is a right singular vector of the corresponding block of the document-term
matrix.

5. Suppose now that the singular values of all the blocks are distinct (also across blocks).
Show how to solve the clustering problem.

Hint: (4) Use the fact that the right singular vectors must be eigenvectors of ATA. Show that
ATA is also block-diagonal and use properties of eigenvectors.

Solution: (1)
(2)
(3): It is obvious that ATA is block diagonal. We claim that for any block-diagonal symmetric
matrix B, each eigenvector must be composed of eigenvectors of blocks. To see this, just
note that since for an eigenvector v of B, Bv is ÿv for a real ÿ, for a block Bi of B, Biv is also ÿ
times the corresponding block of v .
(4): By the above, it is easy to see that each eigenvector of ATA has nonzero entries in just
one block. (e)

Exercise 1.23 Generate a number of samples according to a mixture of 1-dimensional


Gaussians. See what happens as the centers get closer. Alternatively, see what happens
when the centers are fixed and the standard deviation is increased.

Exercise 1.24 Show that maximizing x TuuT (1 ÿ x) subject to xi ÿ {0, 1} is equivalent to


partitioning the coordinates of u into two subsets where the sum of the elements in both
subsets are equal.

Solution: x TuuT (1ÿx) can be written as the product of two scalars x Tu u The first T (1 ÿ x) .
scalar is the sum of the coordinates of u corresponding to the subset S and the second scalar
is the sum of the complementary coordinates of u. To maximize the product, one partitions
the coordinates of u so that the two sums are as equally as possible. Given the subset
T
determined by the maximization, check if x Tu = u (1 ÿ x).

Exercise 1.25 Read in a photo and convert to a matrix. Perform a singular value decom-
position of the matrix. Reconstruct the photo using only 10%, 25%, 50% of the singular
values.

1. Print the reconstructed photo. How good is the quality of the reconstructed photo?

2. What percent of the Forbenius norm is captured in each case?

32
Machine Translated by Google

Hint: If you use Matlab, the command to read a photo is imread. The types of files that can be read are given by imformats. To
print the file use imwrite. Print using jpeg format.
To access the file afterwards you may need to add the file extension .jpg. The command imread will read the file in uint8 and you
will need to convert to double for the SVD code.
Afterwards you will need to convert back to uint8 to write the file. If the photo is a color photo you will get three matrices for the
three colors used.

Exercise 1.26 Find a collection of something such as photgraphs, drawings, or charts and try the SVD compression technique on
it. How well does the reconstruction work?

Exercise 1.27 Create a set of 100, 100×100 matrices of random numbers between 0 and 1 such that each entry is highly correlated
with the adjacency entries. Find the SVD of A. What fraction of the Frobenius norm of A is captured by the top 100 singular
vectors?
How many singular vectors are required to capture 95% of the Frobenius norm?

Exercise 1.28 Create a 100 × 100 matrix A of random numbers between 0 and 1 such that each entry is highly correlated with the
adjacency entries and find the first 100 vectors for a single basis that is reasonably good for all 100 matrices. How does one do
this? What fraction of the Frobenius norm of a new matrix is captured by the basis?

·· T T
Solution: If v1, v2, · , v100 is the basis, then A = Av1v1 + Av2v2 +··· .

Exercise 1.29 Show that the running time for the maximum cut algorithm in Section ?? can be carried out in time O(n
3 k
+ poly(n)k ), where poly is some polynomial.

Exercise 1.30 Let x1, x2, . . . , xn be n points in d-dimensional space and let X be the n×d matrix whose rows are the n points.
Suppose we know only the matrix D of pairwise distances between points and not the coordinates of the points themselves. The
xij are not unique since any translation, rotation, or reflection of the coordinate system leaves the distances invariant. Fix the origin
of the coordinate system so that the centroid of the set of points is at the origin.

1. Show that the elements of XT X are given by

n n n n
1 1
T
x 1 i xj = ÿ 2 d2 ÿ

d2 ÿ

1 2d + ij d2 .
ij n ij n n ij
j=1 i=1 i=1 j=1

2. Describe an algorithm for determining the matrix X whose rows are the xi.

Solution: (1) Since the centroid of the set of points is at the origin of the coordinate i=1 xij = 0. Write axes,
n

T T T T
2
d ij = (xi ÿ xj ) (xi ÿ xj ) = x j xj ÿ 2x i xi + x i xj (1.3)

33
Machine Translated by Google

Then
n n
1 1
= T T
2 x (1.4)
d ij i xi + x j xj
n n
i=1 i=1

1 n T T 1 n T
Since i=1
x
i=1
x
i xj = 0.
n j xj and j xj = x n

Similarly
n n
1 1
= T T
2 x (1.5)
d ij j xj + x i xi
n n
j=1 j=1

Summing (1.4) over j gives

n n n n
1
2
= T T T
d x x x (1.6)
nj
ij i xi + j xj = 2n ixi _
=1 i=1 i=1 j=1 i=1

T T
Rearranging (1.3) and substituting for x i xi and x j xj from (1.3) and (1.4) yields

n n n
1 1 1
T 2 T T 2 2 2 T
x 1 i xj = ÿ 2 d ÿx d ÿ

d ÿ

2 d + ij x
ij i xi ÿ x j xj = ÿ 2 ij n ij n n
ixi _

j=1 i=1 i=1

Finally substituting (1.6) yields

n n n n
1 1 1 1
T 2 T T 2 2 2 2
x 1 i xj = ÿ 2 d ÿx d ÿ

d ÿ

d + ij n2 d
ij i xi ÿ x j xj = ÿ 2 ij n ij n j=1 ij
j=1 i=1 i=1

1 n 1 n
Note that is D is the matrix of pairwise squared distances, then and d2 d2
n k=1 ij , n i=1 ij ,
n n th
2 d are the averages of the square of the elements of the i j=1 ij square of the row, the
1 n2 i=1 elements
th
of the j column and all squared distances respectively.
(2) Having constructed XT X we can use an eigenvalue decomposition to determine the coordinate matrix X. Clearly XT X is
symmetric and if the distances come from a set of n points in a d-dimensional space XT X will be positive definite and of rank d.
Thus we can decompose XT X asXT X = V remainder are zero. Since the XT X = V
T
ÿV where the first d eigenvalues are positive and the
1 1
T
p2p 2 V and thus the coordinates are given by
1
T
X=V 2p

Exercise 1.31

1. Consider the pairwise distance matrix for twenty US cities given below. Use the algorithm of Exercise 2 to place the cities
on a map of the US.

2. Suppose you had airline distances for 50 cities around the world. Could you use
these distances to construct a world map?

34
Machine Translated by Google

BB CDDHLMMM
OUHAEOAE I I

S F I LINEN I HAVE
Boston - 400 851 1551 1769 1605 2596 1137 1255 1123
Buffalo 400 - 454 1198 1370 1286 2198 803 1181 731
Chicago 851 454 - 803 920 940 1745 482 1188 355
Dallas 1551 1198 803 - 663 225 1240 420 1111 862
Denver 1769 1370 920 663 - 879 831 879 1726 700
Houston 1605 1286 940 225 879 - 1374 484 968 1056
Los Angeles 2596 2198 1745 1240 831 1374 - 1603 2339 1524
Memphis 1137 803 482 420 879 484 1603 - 872 699
Miami 1255 1181 1188 1111 1726 968 2339 872 - 1511
Minneapolis 1123 731 355 862 700 1056 1524 699 1511 -

New York 188 292 713 1374 1631 1420 2451 957 1092 1018
Omaha 1282 883 432 586 488 794 1315 529 1397 290
Philadelphia 271 279 666 1299 1579 1341 2394 881 1019 985
Phoenix 2300 1906 1453 887 586 1017 357 1263 1982 1280
Pittsburgh 483 178 410 1070 1320 1137 2136 660 1010 743
Saint Louis 1038 662 262 547 796 679 1589 240 1061 466
Salt Lake City 2099 1699 1260 999 371 1200 579 1250 2089 987
San Francisco 2699 2300 1858 1483 949 1645 347 1802 2594 1584
Seattle 2493 2117 1737 1681 1021 1891 959 1867 2734 1395
Washington D.C. 393 292 597 1185 1494 1220 2300 765 923 934

35
Machine Translated by Google

NOPE P S S S SD
YMHHI t L FEC
AIOTLC A
Boston 188 1282 271 2300 483 1038 2099 2699 2493 393
Buffalo 292 883 279 1906 178 662 1699 2300 2117 292
Chicago 713 432 666 1453 410 262 1260 1858 1737 597
Dallas 1374 586 1299 887 1070 547 999 1483 1681 1185
Denver 1631 488 1579 586 1320 796 371 949 1021 1494
Houston 1420 794 1341 1017 1137 679 1200 1645 1891 1220
Los Angeles 2451 1315 2394 357 2136 1589 579 347 959 2300
Memphis 957 529 881 1263 660 240 1250 1802 1867 765
Miami 1092 1397 1019 1982 1010 1061 2089 2594 2734 923
Minneapolis 1018 290 985 1280 743 466 987 1584 1395 934
New York - 1144 83 2145 317 875 1972 2571 2408 205
Omaha - 1094 1036 836 354 833 1429 1369 1014
Philadelphia 1144 83 - 2083 259 811 1925 2523 2380 123
Phoenix 1094 2145 1036 2083 - 1828 1272 504 653 1114 1963
Pittsburgh 317 836 259 1828 875 - 559 1668 2264 2138 192
Saint Louis 354 811 1272 559 - 1162 1744 1724 712
Salt Lake City 1972 833 1925 504 1668 1162 2571 - 600 701 1848
San Francisco 1429 2523 653 2264 1744 600 - 678 2442
Seattle 2408 1369 2380 1114 2138 1724 701 678 Washington D.C. 250 - 2329
1014 123 1983 192 712 1848 2442 2329 -

References
Hubs and Authorities
Golub and Van Loan
Clustering Gaussians

36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy