HW 2
HW 2
HW 2
(ii) P (A)?
P (A) = P (A|W )P (W ) + P (A|W )P (W )
= P (A|W )P (W ) + P (A|W )(1 P (W ))
= 0.4 0.3 + 0.7 0.7
= 0.61
Consider the event space. The event that the archer hits the target exactly once in two shots can
be broken down into two cases.
i. The archer hits the target on the first shot and misses on the second.
ii. The archer misses on the first shot and hits the target on the second.
In both these cases, the likelihood of the case is simply the product of the probabilities that the
archer hits the target, and misses the target for the archer to hit the target exactly once in two
shots. That is:
P (A)P (A) = P (A)(1 P (A)) = (0.61)(1 0.61) = 0.2379
Because there are two cases in which this event can occur, the total probability of the archer
hitting the target exactly once in two shots is twice the value we found above. Thus:
1
(iv) P (W |A)?
P (W, A)
P (W |A) =
P (A)
P (A|W )P (W )
=
1 P (A)
(1 P (A|W ))(1 P (W ))
=
1 P (A)
(1 0.7)(1 0.3)
=
1 (0.61)
0.3 0.7
=
0.39
= 0.5385
Solution: We will start with P (A|B, C) > P (A|B) and massage it to the form P (A|B, C c ) < P (A|B).
2
2. Positive Definiteness
Definition. Let A Rnn be a symmetric matrix.
We say that A is positive definite if x Rn {0}, x> Ax > 0. We denote this with A 0.
(a) For a symmetric matrix A Rnn , prove that all of the following are equivalent.
(i) A 0.
(ii) B > AB 0, for some invertible matrix B Rnn .
(iii) All the eigenvalues of A are nonnegative.
(iv) There exists a matrix U Rnn such that A = U U > .
(Suggested road map: (i) (ii), (i) (iii) (iv) (i). For the implication (iii) (iv) use the
Spectral Theorem for Symmetric Matrices.
Solution: We will follow the suggested road map to show that (i), (ii), (iii), and (iv) are all equivalent
with the following proofs:
A 0 x Rn : x> Ax 0
Consider B > AB 0. For this to hold, x> B > ABx 0 must be true x Rn . Lets define
u Rn to be u = Bx. If we substitute u into x> B > ABx 0, we have:
(Bx)> A(Bx) 0
u> Au 0
(ii) A 0 B > AB 0
We can use similar logic to the forward direction and start by considering x Rn :
x> B > ABx 0 if we know that B > AB 0. Lets define a vector u = Bx again. From
the previous proof, we know that u can represent any vector in Rn since B is invertible
there is a one-to-one mapping between u and x.
3
The inequality u> Au 0 must still hold if B > AB 0 since they are equivalent given our
definition of u. Because u> Au 0 and u Rn , we can write u Rn : u> Au 0 A 0.
Since we started with B > AB 0, we have shown that B > AB 0 A 0, thereby proving
the reverse direction of (i) (ii).
Taking that weve proven (i)(ii) and (i)(ii), we ultimately proved (i) (ii).
(2) A 0 All the eigenvalues of A are nonnegative.
Pn Pn
Given A 0, we know i j=1 u2ij 0. Because the summation j=1 u2ij sums over squared
values, the summation must be nonnegative. Since the product of i and the summation must also
be nonnegative for A 0 to hold, i 0. This extends for every eigenvector ui and eigenvalue
i of A so we know that
i {1 . . . n} : i 0
thereby proving that if A 0, all the eigenvalues of A must be nonnegative: (i)(iii).
(3) All the eigenvalues of A are nonnegative U Rnn | A = U U >
Given that all of As eigenvalues are nonnegative, and that A is symmetric, we will make use
of the Spectral Theorem for Symmetric Matrices to show that U Rnn | A = U U > . First, lets
decompose A using Spectral Decomposition:
A = QQ>
Here weve shown that if As eigenvalues are nonnegative, A can be expressed as U U > for a
1
matrix U Rnn . Walking back in our proof, we know that our definition of U = Q 2 must
1
nn nn
be in R since both Q and 2 are in R given that Q contains As eigenvectors in Rn and
1
all the elements of 2 are in R since all of As eigenvalues are nonnegative. Thus, weve shown
(iii)(iv) in this proof.
4
(4) U Rnn | A = U U > A 0
This is the final piece to showing that (i)-(iv) are all equivalent; we will start with the claim
from (iv) that U Rnn | A = U U > given (iii).
A = UU>
Note that U > x gives a vector Rn . We denote (U > x)i to be the ith element vector U > x.
n
X
x> Ax = (U > x)2i
i=1
Pn >
Because i=1 (U x)2i sums over squared values, we know that it is nonnegative.
n
X
x Rn : (U > x)2i 0
i=1
x Rn : x> Ax 0
A0
Here, weve come full circle and proved that if U Rnn | A = U U > then A 0 (iv)(i).
Compiling our proofs for (i)(ii), (i)(iii), (iii)(iv), and (iv)(i), weve shown that (i), (ii), (iii),
and (iv) are all equivalent statements!
(b) For a symmetric positive definite matrix A 0 Rnn , prove the following.
(i) For every > 0, we have that A + I 0.
(ii) There exists a > 0 such that A I 0.
(iii) All the diagonal entries of A are positive; i.e. Aii > 0 for i = 1, . . . , 0.
Pn Pn
(iv) i=1 j=1 Aij > 0, where Aij is the element at the i-th row and j-th column of A.
We are given that A 0 and want to show that A + I 0 | > 0. Lets rewrite this
as:
x Rn {0} : x> Ax > 0 x Rn {0} : x> (A + I)x > 0 | > 0
So for any x Rn {0} we know that x> Ax > 0. Lets see if this is the case for A + I
5
Lets analyze what we have. Given A 0, we know that Pnx Rn {0} : x> Ax > 0. For
2
A + I 0 | > 0 to hold, we only need to show that i=1 xi is nonnegative. We can easily
see that the summation is > 0 since it sums over squared
Pn values and x 6= 0; multiplying any > 0
to it will give a value > 0. Ultimately, x> Ax + i=1 x2i > 0 thereby showing:
From this manipulation and analysis, weve proven that there exists a > 0 such that A I 0.
(iii) Proof: All the diagonal entries of A are positive; i.e. Aii > 0 for i = 1, . . . , 0.
We will show that all the diagonal entries of A must be positive in order for A 0 to hold.
Consider A 0 in its form
x Rn {0} : x> Ax > 0
and we know that
n X
X n
>
x Ax = Ai,j xi xj
i=1 j=1
Pn Pn
We will utilize the fact that i=1 j=1 Ai,j xi xj > 0 must hold x Rn {0} so that A 0
holds. Consider the cases where x is a basis vector for some dimension i: xj6=i = 0, xi = 1, we
denote xk for some k to be the kth element of x. Define a basis vector x(i) to be a basis vector in
(i) (i)
the ith direction i.e. xj=i = 1, xj6=i = 0. If we consider x(i)> Ax(i) we observe that
n X n
(
(i)> (i)
X
(i) (i) Ai,i if a = i b = i
x Ax = Aa,b xa xb =
a=1 b=1
0 otherwise
6
In the context of basis vectors, we see that x(i)> Ax(i) > 0 is dependent on Ai,i (the diagonal
entry of A at (i, i)) to be greater than 0 for A 0 to hold; in other words: the diagonal entries
of A must follow Ai,i > 0 if A 0 is to hold. Given this, we have proven by using basis vectors
that Ai,i > 0 for i = 1, . . . , n if A 0 .
Pn Pn
(iv) Proof: i=1 j=1 Aij > 0, where Aij is the element at the i-th row and j-th column of A.
Pn Pn
To show that i=1 j=1 Aij > 0 given A 0, we will consider the rewritten form of A be-
ing positive definite and choose a vector x that makes the proof easy.
n X
X n
Aij > 0
i=1 j=1
Pn Pn
Because i=1 j=1 Ai,j xi xj > 0 must hold for all x Rn {0} if A 0, by showing that
Pn Pn Pn Pn
i=1 j=1 Aij > 0 is true for one case of x, we know that i=1 j=1 Aij > 0 must also hold
x Rn {0}.
7
3. Derivatives and Norms
In the following questions, show your work, not just the final answer.
(a) Let x, a Rn . Compute x (a> x). Solution: We begin by treating aT x as a vector function and
expand it to clearly see what the gradient of it is. Let vi for some vector v denote the ith element of v.
x (a> x) = x (a1 x1 + . . . + an xn )
..
.
= for all i {1, . . . , n}
ai xi
xi
..
.
..
.
=ai
..
.
= a
8
Solution: We will rigorously expand this computation to see how it works piece by piece.
A1,1 . . . A1,n x1 !
. .
> . .. ...
x (x Ax) = x x1 . . . xn .. ..
An,1 . . . An,n xn
n
XX n
= x ( Ai,j xi xj )
i=1 j=1
Pn Pn
x1 i=1 j=1 Ai,j xi xj
=
..
P.n
Pn
xn i=1 j=1 Ai,j xi xj
Pn
Pn
Pn
i6=1 x1 j=1 Ai,j xi xj + x 1 j=1 A1,j x1 xj
=
..
Pn .
P n
P n
i6=n xn j=1 Ai,j xi xj + xn j=1 An,j xn xj
Pn
Pn
Pn 2
i6=1 ( x1 j6=1 Ai,j xi xj + x1 (Ai,1 xi x1 )) + x1 j6=1 A1,j x1 xj + x1 (A1,1 x1 )
=
..
Pn .
P n
Pn 2
i6=n ( xn j6=n Ai,j xi xj + xn (Ai,n xi xn )) + xn j6=n An,j xn xj + xn (An,n xn )
P P
i6=1 Ai,1 xi + j6=1 A1,j xj + 2A1,1 x1
=
..
P .
P
i6=n A i,n x i + j6=n A n,j x j + 2A n,n x n
Pn Pn
i=1 i,1 iA x + j=1 A1,j xj
=
..
Pn .P
n
i=1 A i,n x i + j=1 A n,j x j
Pn Pn
A
i=1 i,1 i x j=1 A1,j xj
= .. ..
+
Pn . Pn .
i=1 A i,n x i j=1 A n,j x j
= (x> A)> + Ax
= A> x + Ax
= (A> + A)x
In the case that A is symmetric, x (x> Ax) = 2Ax since A> = A when A is symmetric.
(c) Let A, X Rnn . Compute X (trace(A> X)).
Solution: We first focus on finding the trace of A> X and then find the matrix gradient of found
trace.
>
A1,1 . . . A>
1,n X1,1 . . . X1,n
A> X = ... .. .. .. .. ..
. . . . .
> >
An,1 . . . An,n Xn,1 . . . Xn,n
9
Because were finding the trace, we only care about what the diagonal entries of A> X are. Also note
that because A is symmetric, A> i,j = Ai,j .
n
X
>
(A X)i,i = A>
i,j Xj,i
j=1
Xn
= Ai,j Xj,i
j=1
Xn X n
trace(A> X) = Ai,j Xj,i
i=1 j=1
= X (f (X))
X f (X) . . . X1,n f (X)
1,1. .. ..
= .. . .
Xn,1 f (X) . . . Xn,n f (X)
Thus:
A1,1 ... A1,n
> .. .. ..
X (trace(A X)) = . . .
An,1 ... An,n
=A
Solution: To show that f (x) is a norm for vectors x R2 , we must show that f (x) satisfies the
triangle inequality.
f (x + y) f (x) + f (y)
p p p p p p
( |x1 + y1 | + |x2 + y2 |)2 ( |x1 | + |x2 |)2 + ( |y1 | + |y2 |)2
Unfortunately, for x = (1, 1) and y = (0, 1), the triangle inequality does not hold for f and thus
f (x) can not be a norm for vectors x R2 .
p p p p p p
( |1 + 0| + |(1) + (1)|)2 ( |1| + | 1|)2 + ( |0| + | 1|)2
(1 + 2)2 (1 + 1)2 + (0 + 1)2
1+2 2+24+1
3+2 25
This inequality doesnt hold since 2 > 1 2 2 > 2 3 + 2 2 > 5. We conclude that the function
f is not a norm for vectors x R2 by counterexample of using x = (1, 1) and y = (0, 1) .
10
(e) Let x Rn . Prove that kxk kxk2 nkxk .
Solution: To prove kxk kxk2 nkxk , we break it into two smaller inequality proofs and
then combine them at the end:
kxk kxk2 kxk2 nkxk kxk kxk2 nkxk
(a) Proof : kxk kxk2
kxk kxk2
v
u n 2
uX
max |xi | t xj
1in
j=1
s 2 X
max |xi | max |xi | + x2j
1in 1in
j6=i
2 2 X
max |xi | max |xi | + x2j
1in 1in
j6=i
X
0 x2j
j6=i
This must hold since we know that j6=i x2j must be nonnegative since we are summing only over
P
positive values (xj values are squared), and the summation is equal to 0 only when all the xj
values being summed over are 0. Thus, kxk kxk2 holds.
(b) Proof : kxk2 nkxk
kxk2 nkxk
v
u n 2
uX
t xj n max |xi |
1in
j=1
s 2 X
max |xi | + x2j n max |xi |
1in 1in
j6=i
2 X
max |xi | + x2j n( max |xi |)2
1in 1in
j6=i
X 2
x2j n( max |xi |)2 max |xi |
1in 1in
j6=i
X 2
x2j (n 1) max |xi |
1in
j6=i
X n1
X 2
x2j max |xi |
1in
j6=i i=i
2
Define = max1in |xi | ; this inequality must hold since 2 x2j |j 6= i. Intuitively, on the
left side we are summing n 1 terms of x2j |j 6= i x2j whereas on the right side, we are
summing over n 1 terms of . A sum of n of xs maximum element will always be greater
than
Pn
or equal to the P
sum of all of xs n elements. Mathematically, we can consider the average:
n
i=1 xmax2 2 x2i
n = xmax , i=1 n < x2max since
the elements of x that are less than xmax drag the
average (sum) down. Given this, kxk2 nkxk also holds.
Since weve shown kxk kxk2 kxk2 nkxk , we can combine the inequalities to prove that
kxk kxk2 nkxk .
11
(f) Let x Rn . Prove that kxk2 kxk1 nkxk2 .
(Hint: The CauchySchwarz inequality may come in handy.)
Solution: We approach this proof similarly to the previous one: well break the inequality up into two
of its component inequalities, show they both hold, and combine them at the end to finish the proof.
kxk2 kxk1 kxk1 nkxk2 kxk2 kxk1 nkxk2
kxk2 kxk1
v
u n 2 X n
uX
t xj |xj |
j=1 j=1
n
X Xn
x2j ( |xj |)2
j=1 j=1
Xn
x2j (|x1 | + . . . + |xn |)(|x1 | + . . . + |xn |)
j=1
n
X n
X n X
X
x2j x2i + |xi ||xj |
j=1 i=1 i=1 j6=i
n X
X
0 |xi ||xj |
i=1 j6=i
This inequality holds nicely since the summation sums over the product of absolute values, mean-
ing that it attains a nonnegative value when computed this satisfies the inequality. Thus,
kxk2 kxk1 holds.
(b) Proof : kxk1 nkxk2
We will make use of the Cauchy-Schwarz inequality by pattern matching the inequality we are
trying to prove to fit Cauchy-Schwarz. We start with the Cauchy-Schwarz inequality for vectors
u, v Rn !
To pattern match the inequality we want to prove into Cauchy-Schwarz, we plug in x for u and
an n 1 vector containing only 1s for v. v is a 1 vector.
1 1
| < x, ... > | kxk2 k ... k2
1 1
v v
n
X
u n u n
uX uX
|xi | t x2 t 12 i
i=1 i=1 i=1
v
n u n
X uX
|xi | t x2 n i
i=1 i=1
kxk1 nkxk2
By subsituting x and
a 1-vector into Cauchy-Schwarz appropriately, we can easily see that the
inequality kxk1 nkxk2 holds.
12
Weve proved both components of the inequality we intend to prove
kxk2 kxk1 kxk1 nkxk2 kxk2 kxk1 nkxk2
Thus, we can combine the smaller inequalities and show that kxk2 kxk1 nkxk2 .
13
4. Eigenvalues
Let A Rnn be a symmetric matrix with A 0.
(a) Prove that the largest eigenvalue of A is
max (A) = max x> Ax.
kxk2 =1
(Hint: Use the Spectral Theorem for Symmetric Matrices to reduce the problem to the diagonal case.)
Solution: We begin by starting with what we are aiming to prove. Note that this part will make it
so that the answer to the next part (b) is clear and simple since they both follow the same structure
with the only difference being which eigenvalue (max (A) or min (A) is being isolated.
max (A) = max x> Ax
kxk2 =1
Use the Spectral Theorem for Symmetric Matrices to express a simpler, but equivalent problem by
decomposing A to (QQ> ) where is the diagonal matrix containing As eigenvalues.
max (A) = max x> (QQ> )x
kxk2 =1
Now define a vector v = Q> x. We will show that kxk2 = kvk2 = 1 to reduce the problem to the
diagonal case: max (A) = maxkvk2 =1 v > v.
kxk2 = kvk2
kxk2 = kQT xk2
q
kxk2 = (Q> x)> Q> x
p
kxk2 = x> QQ> x
Since Q and Q> are orthogonal:
kxk2 = x> x
kxk2 = kxk2
Furthermore, we know that every v can be mapped to every x since x = Q> v Qx = v. Using
properties of orthogonal matrices, we know that Q is invertible with its inverse being Q> . Here, we
see that v maps to all vectors u Rn s.t. kuk2 = 1. Combining all this, we have reduced our original
problem to the diagonal case. So continuing with our proof:
max (A) = max v > v
kvk2 =1
1 v1
.. ..
= max v1 ... vn . .
kvk2 =1
n vn
= max (1 v12 + ... + n vn2 )
kvk2 =1
n
X
= max i vi2
kvk2 =1
i=1
Pn
Since we are constrained by kvk2 = 1, to maximize i=1 i vi2 , it is intuitive that all of vs weight
must fall on the As maximum eigenvalue because i {1 . . . n} : 0 v 1 | kvk2 = 1. Taking
this into account, v is essentially going to be a basis vector that isolates the maximum eigenvalue to
maximize the objective function. Thus:
max (A) = i | max {1 . . . n }
1in
= max (A)
14
(b) Similarly, prove that the smallest eigenvalue of A is
Solution: We use the same proof that we constructed from part (a), the only difference is that
we are now trying to minimize our objective function x> Ax which is equivalent to minimizing v > v
as shown in part (a). Thus following, our previous logic, we can go through this proof quickly:
= min v > v
kvk2 =1
n
X
= min i vi2
kvk2 =1
i=1
Similar to how all the weight of v should fall on the maximum value in the previous part, we want all
the weight of v to fall on the minimum coefficient min in order to minimize the objective function.
Thus we have as desired:
= min (A)
(c) Is either of the optimization problems described in parts (a) and (b) a convex program? Justify your
answer.
Solution: Neither of the optimization problems in parts (a) and (b) are a convex program. Even
though the objective functions x> Ax are convex since A 0 the constraints must also be convex
for the optimization problem to be a convex program as a whole. In both cases, the constraint space
is essentially some form of circle/spherical shell if we consider n to be in higher dimensions. If we
consider the method taught in class to evaluate whether or not constraints are convex, we can consider
the case of a spherical shell constraint space. If we place a plane through the shell, we notice that we
have points between our plane and boundary (lower portion of shell that is a part of the constraint
space) that are not a part of the constraint space. Thus, this type of constraint space isnt convex and
because our constraints correspond to this shell-like constraint space, the optimization problems in
(a) and (b) are not convex.
(d) Show that if is an eigenvalue of A then 2 is an eigenvalue of A2 , and deduce that
Av = v
A(Av) = A(v)
A2 v = (Av)
A2 v = 2 v
15
Clearly, for an eigenvector v belonging to A, weve shown that A2 v = 2 v thereby showing that 2 is
indeed an eigenvalue of A2 . Now we deduce max (A2 ) = max (A)2 and min (A2 ) = min (A)2 .
A = QQ>
A2 = QQ> QQ>
= Q2 Q>
Note here that 2 is with the eigenvalues i on its diagonal squared. The order of these diagonals
are also preserved in order this will be helpful in making our deduction. Now lets deduce starting
with what we proved in parts (a) and (b):
In both cases, weve gotten to the diagonal case of the proof. Because we know that 2 is simply
with the value in its diagonal squared and preserved in order, it is intuitive that the maximum and
minimum eigenvalues of 2 are simply the maximum and minimum eigenvalues of squared.
Solution: We can start by using parts (a) and (b) to construct an inequality that we can begin
our proof with.
Take the square root of the inequality to show the desired inequality.
16
Solution: We will start with the inequality proved in part (e) and make the deduction for x Rn .
x
Define a vector y = kxk 2
. y is essentially the unit vector of any vector x Rn ; thus, we know that
n
y R | kyk2 = 1 Given this definition of y we can use the inequality from (e).
Weve deduced from (e) as desired by making use of our defined y (unit vector of any x Rn ). Because
y is defined by x, which is any vector Rn , we know that this inequality must hold for any x Rn .
17
5. Gradient Descent
Consider the optimization problem minxRn 21 x> Ax b> x, where A is a symmetric matrix with 0 < min (A)
and max (A) < 1.
(a) Using the first order optimality conditions, derive a closed-form solution for the minimum possible
value of x, which we denote x .
Solution: We consider the given optimization problem and As properties to find x . An overview of
how we can approach this is:
(i) Take the gradient of the objective function and set it to 0 to find x by solving for x.
(ii) We then need to show that the value of x we found in the previous step is indeed the minimum
possible value of x (global minimum) by showing that the Hessian of the objective function is
convex (positive definite) at x . Because the objective function is quadratic in form, we know
that if it is indeed convex at x , then x is global minimum since quadratic functions only have
one global extrema.
Define the objective function f (x) = 21 x> Ax b> x and solve for f (x ) = 0 to find x . Note that
from question 3, we know how to take the gradient of a function of this form. Furthermore, we will
use that A = A> given that A is symmetric. A also must be invertible (i.e. A1 since 0 is not an
eigenvalue of A).
f (x ) = 0
1
( x> Ax b> x ) = 0
2
1
(A + A> )x b = 0
2
1
(2A)x b = 0
2
Ax = b
x = A1 b
Weve found x but we must now show that x is actually the minimum possible value of f (x) by
showing that f (x) is convex at x by finding its Hessian Hf (x).
Hf (x) = 2 f (x)
= (f (x))
= (Ax b)
=A
Here we find that for all x, the Hessian of f (x) is simply A our symmetric matrix whose eigenvalues
are all positive between 0 and 1. We know from question 2 that A must be positive semidefinite since
we showed that A 0 is equivalent to all the eigenvalues of A are nonnegative. Intuitively, A is positive
definite for x thereby implying that f (x) is convex at x . Thus, x must be a global minimum for
f (x) since f (x) is quadratic.
x = A1 b
18
(b) Solving a linear system directly using Gaussian elimination takes O(n3 ) time, which may be wasteful
if the matrix A is sparse. For this reason, we will use gradient descent to compute an approximation
to the optimal point x . Write down the update rule for gradient descent with a step size of 1.
Solution: We know that the update rule of gradient descent can be written generally as:
and that for a step size of 1, = 1. We know from part (a) that x f (x) = Ax(k) b. Plugging these
values in, we will obtain the update rule for gradient descent with a step size of 1.
x(k+1) (I A)x(k) + b
x(k) x = (I A)(x(k1) x ).
Solution: To show this, we utilize with our result from (b). We also note that because x = A1 b,
b = Ax .
x(k) = (I A)x(k1) + b
x(k) x = (I A)x(k1) + b x
x(k) x = (I A)x(k1) + (Ax ) x
x(k) x = (I A)x(k1) + (A I)x
x(k) x = (I A)x(k1) + (I A)(x )
x(k) x = (I A)(x(k1) x )
kx(k) x k2 kx(k1) x k2 .
Solution: We will use what was shown in (c) along with the fact from 4(f) that for a square n n
symmetric matrix A 0 and vector x Rn that
to show that the given inequality for this part holds. Note that the matrix A for this problem is also
symmetric and positive semidefinite given that 0 < min (A) max (A) < 1 (from question 2 we know
that since all the eigenvalues of A are nonnegative, A is also positive semidefinite).
First we know that x(k) x = (I A)(x(k1) x ). We can substitute this into what were try-
ing to prove.
kx(k) x k2 kx(k1) x k2
k(I A)(x(k1) x )k2 kx(k1) x k2
19
Now to utilize what we found from 4(f), we can start pattern matching and define a matrix B and
vector v.
B =I A
v = x(k1) x
Plugging these definitions into what we were trying to prove, we are left with something similar in
form to 4(f):
kBvk2 kvk2
Now, for our utilization of 4(f) to work, we must also show that B preserves As qualities and that v
is still any vector in Rn . First, it is obvious that B must still be symmetric since I A (a difference
of symmetric matrices) is symmetric. Great, now we need to show that B 0. Well lets consider Bs
eigenvalues. We know that for an eigenvector u of A and its corresponding eigenvalue that
Au = u
Now consider Bu. Note that because B is essentially A with diagonal values 1 aii |i {1 . . . n}, A
and B share eigenvectors u is also an eigenvector of B.
Bu = (I A)u = Iu Au = u u = (1 )u
We find that Bs eigenvalues B is simply 1 A . Because we know that all of As eigenvalues were
between 0 and 1, b = 1 a must also be between 0 and 1, where a is an eigenvalue of A. This also
means that Bs eigenvalues are nonnegative, meaning that B 0.
Lastly, given that B and v have the same qualities of A and x in the context of 4(f), we use the
fact that 0 < max (B) < 1 to show that max preserves the constraints on . We are left with
kBvk2 max (B)kvk2
k(I A)(x (k1)
x )k2 max (I A)kx(k1) x k2
kx(k) x k2 kx(k1) x k2
(e) Let x(0) Rn be a starting value for our gradient descent iterations. If we want our solution x(k) to
be > 0 close to x , i.e. kx(k) x k2 , then how many iterations of gradient descent should we
perform? In other words, how large should k be? Give your answer in terms of , kx(0) x k2 , and .
Note that 0 < < 1, so log < 0.
Solution: Given that our answer should be in terms of , we take the cue that we will probably
use (d) to find how large k should be basically the lowerbound on what k needs to be in order to
have our solution x(k) be close to x . Lets begin by considering what we found in (d).
kx(k) x k2 kx(k1) x k2
kx(k) x k2 (kx(k2) x k2 )
kx(k) x k2 ((kx(k3) x k2 ))
.. ..
..
kx(k) x k2 ( . . . (kx(0) x k2 ))
kx(k) x k2 k kx(0) x k2
20
On equality of kx(k) x k2 k kx(0) x k2 , we can now solve for k to satisfy our desired error bound
kx(k) x k2
k kx(0) x k2
k
kx(0) x k2
k ln ln (0)
kx x k2
1
k ln (0)
ln kx x k2
1 kx(0) x k2
k ( ln )
1
ln
1 kx(0) x k2
k ln
1
ln
Note that we had to invert the expressions inside the logarithms since kx(0) x k2 could be 0 and lead
to an undefined expression.
(f) Observe that the running time of each iteration of gradient descent is dominated by a matrix-vector
product. What is the overall running time of gradient descent to achieve a solution x(k) which is -close
to x ? Give your answer in terms of , kx(0) x k2 , , and n.
Solution: Given that the runtime of each iteration of gradient descent is dominated by the matrix-
vector product, we can simply find the overall running time of finding x(k) that is close to x by
reasoning that the total runtime should be k O(runtime of matrix-vector multiplication) since well
have k iterations of gradient descent. The runtime of matrix-vector multiplication in Snn is O(n2 ).
Thus the total runtime is O(kn2 ). Expressing this in the desired terms using what k should be given
from (e), the overall running time of gradient descent achieving a solution x(k) -close to x is:
" # !
1 kx(0) x k2 2
O ln n
1
ln
21
6. Classification
Suppose we have a classification problem with classes labeled 1, . . . , c and an additional doubt category
labeled c + 1. Let f : Rd {1, . . . , c + 1} be a decision rule. Define the loss function
0
if i = j i, j {1, . . . , c}
L(f (x) = i|x) = r if i = c + 1
s otherwise
where r 0 is the loss incurred for choosing doubt and s 0 is the loss incurred for making a misclassi-
fication. Hence the risk of classifying a new data point x as class i {1, 2, . . . , c + 1} is
c
X
R(f (x) = i|x) = L(f (x) = i, y = j)P (Y = j|x).
j=1
(a) Show that the following policy obtains the minimum risk. (1) Choose class i if P (Y = i|x) P (Y =
j|x) for all j and P (Y = i|x) 1 r /s ; (2) choose doubt otherwise.
Solution: To show that this policy obtains the minimum risk, we must consider how much risk is
associated with each decision, and understand how the rules minimizes the risk incurred at every step.
This is best done by breaking the policy rules into its cases.
Approach
(1) Break policy down into smaller cases to analyze and consider the basic contributers to risk.
(2) Show how each clause to the conjunction condition for choosing i minimizes risk.
(3) Show how choosing doubt instead of i minimizes risk.
The conjunction rule basically summarizes that we should choose the class i that we are most confident
about if the expected risk penalty of choosing i is less than the penalty incurred from choosing doubt,
otherwise we would choose doubt. Intuitively, it makes sense that this policy minimizes risk since it
chooses the class/category that minimizes the expected loss every time it classifies a sample.
We can now analyze the policy with this intuition in mind. Whenever we attempt to classify a sample
point x in the context of our policy and loss function L, we can consider:
The probability that xs true class is i: P (Y = i|x) i {1, 2, . . . , c}
The probability that xs true class is not i: P (Y 6= i|x) = 1 P (Y = i|x)
The loss whenever we choose the wrong class: s
The loss whenever we choose doubt: r
For some class i, and true class i , we know that the risk of choosing i is the same as the expected loss
of choosing i. Intuitively, this would be taking the sum of the probability that xs class is j multiplied
by the loss incurred if we choose xs class to be i given that xs class is j. The risk function manifests
this inutition:
Xc
R(f (x) = i|x) = L(f (x) = i, j = i)P (Y = j|x)
j=1
To mathematically see why we choose the i to be the class with the highest probability of being correct
given x, we can expand the risk function and reason about this. First, we know that x has only one
correct class i {1, . . . , c} which means that if we expand the risk function, the loss function will
output s for c 1 of the terms and 0 for one (correct i ) of the terms in the summation. What helps
is if we consider choosing some bogus i that is not even in the set of categories to choose from, all c of
the terms will have a s factor. To visualize:
i
/ {1, . . . , c + 1} : R(f (x) = i|x) = s P (Y = 1|x) + . . . + s P (Y = c|x)
22
First, notice that the probabilities are now the dominating
Pc component of determining the size of
risk since we can rewrite risk to be R(f (x) = i|x) = s j=1 P (Y = j|x) in this scenario where
we choose a bogus i. Since we can choose only one i, this is effecively saying we can zero only one
term in the summation. Because the probability terms dominate in contributing to the risk size, we
want to zero the term in the summation with the highest probability in order to minimize R. This
intuitively translates to picking xs class to be the one (i) that we are most confident about since
the probability that P (Y = i|x) is the largest. This explains why we want to choose i such that
j : P (Y = i|x) P (Y = j|x) this rule minimizes the amount of risk we incur within the context of
choosing a class in {1, . . . , c}.
However, because we also have the option of choosing doubt (c + 1) as a category, we need to
consider when choosing doubt would benefit us in minimizing the risk. Well from our analysis so far,
we know that choosing the class i that we are most confident about minimizes risk if we had to choose
one of the c possible classes. With the option of doubt and its corresponding cost r in mind, we only
need to ask: when would choosing doubt minimize the expected loss for classifying the given x. We
know that the expected loss is the probability that we are incorrect in choosing i multiplied by the loss
for being wrong s . This is simply s P (Y 6= i|x). To choose doubt, we pay r loss. Thus, to minimize
how much we lose for classifying x, its intuitive that we should choose i only when the risk of choosing
i less than the cost of choosing doubt. This is when
s P (Y 6= i|x) r
Great, now we can manipulate this decision criteria to reflect the second clause of the policys conjunc-
tion rule to choose i.
s P (Y 6= i|x) r
r
(1 P (Y = i|x)
s
r
P (Y = i|x) 1
s
We see that both clauses in the policys rule to choosing i serve to minimize the risk when classifying
x. Obviously, we see that if the risk of choosing i is higher than the cost of simply selecting doubt, we
will always choose doubt instead. Ultimately, this policy obtains the minimum risk.
(b) What happens if r = 0? What happens if r > s ? Explain why this is consistent with what one
would expect intuitively.
Solution: With a firm understanding of how the policy minimizes risk from (a), we can quickly see
how r can affect the policys decision making.
(1) r = 0
When r = 0, the cost of choosing doubt is nothing. Intuitively, we know that the policy will likely
classify x in the doubt category c+1 every time if the risk R(f (x) = i|x) > 0. Mathematically, this
will be the case if we consider the second clause of the condition to choosing i: P (Y = i|x) 1 rs .
With r set to 0, we now have:
P (Y = i|x) 1
This translates to: we can only choose i when we are sure that xs class is i. Because P (Y = i|x)
is probably never 1 unless we somehow can guarantee this i.e. cheating by some mean the
policy will always select doubt as xs category. Ultimately, this can be interpreted as a case that
impedes classification.
(2) r > s
r
Again, we consider the second clause of the condition to choosing i: P (Y = i|x) 1 s . If
23
r > s then rs > 1 and 1 rs < 0 whenever we look to our policy to classify. If we define
= 1 rs to represent a negative value and substitute this into the second clause we have:
P (Y = i|x)
By definition, the probability of an event can never be negative so the second clause of the
condition to choosing i will always hold. The implication of this is that the policys decision
function will never choose doubt c + 1 and always classify x as i.
24
7. Gaussian Classification
Let P (x|i ) N (i , 2 ) for a two-category, one-dimensional classification problem with classes 1 and 2 ,
P (1 ) = P (2 ) = 1/2, and 2 > 1 .
(a) Find the Bayes optimal decision boundary and the corresponding Bayes decision rule.
Solution: The Bayes optimal decision boundary (for this problem) is defined to be where P (Y =
1 , X = x) = P (Y = 2 |X = x) where Y is defined to be xs true class and X is a continuous random
variable. We start with this equality and apply Bayes Rule to solve for the x where the decision
boundary lies.
P (Y = 1 , X = x) = P (Y = 2 |X = x)
P (x|1 )P (1 ) = P (x|2 )P (2 )
Since P (1 ) = P (2 ) = 1/2
P (x|1 ) = P (x|2 )
N (1 , 2 ) = N (2 , 2 )
(x 1 )2 (x 2 )2
1 1
e 2 2 = e 2 2
2 2 2 2
(x 1 )2 (x 2 )2
ln e = ln e
2 2 2 2
(x 1 )2 = (x 2 )2
x2 21 x + 21 = x2 22 x + 22
21 x + 22 x = 22 21
2(2 1 )x = (2 1 )(2 + 1 )
2 + 1
x=
2
Now that weve found our Bayes optimal decision boundary, we can formulate the corresponding
Bayes decision rule. We know that since x is where we have a 50:50 chance of classifying correctly and
that 2 > 1 , everything to the right of the boundary should be classified as 2 whereas everything to
the left of it should be classified as 1 . This is intuitive because if 2 > 1 , we know that the Gaussian
representing P (x|2 ) is to the right of the one representing P (x|1 ). Visualizing this, we know that
x > x : P (x|2 ) > P (x|omega1 ) and x < x : P (x|1 ) > P (x|omega2 ). Thus our Bayes decision
rule f : R {1 , 2 } is:
(
1 if x < 2 +
2
1
f (x) =
2 if x 2 +
2
1
Show that the Bayes error associated with this decision rule is
Z
1 2
Pe = ez /2 dz
2 a
2 1
where a = 2 .
25
Solution: To simplify the math of showing that the Bayes error can be expressed as Pe desired above,
we will make use of the fact that (1) the priors of the classes are equivalent P (1 ) = P (2 ) (2) that the
Gaussians of the classes share the same variance. Given these properties, the Gaussian distributions
of the two classes share the same shape. The only difference between these distributions is where each
is centered and because 2 > 1 , N (2 , 2 ) lies to the right of N (1 , 2 ). We will use these facts to
show that
P ((misclassified as 1 )|2 )P (2 ) = P ((misclassified as 2 )|1 )P (1 )
and collapse two integrals into one and attain the desired expression for the Bayes error Pe .
First we can reexpress the Bayes error equation:
Given the similar properties of our two Gaussians for 1 , 2 , we can use their geometric symmetry and
claim that
Z 2 +
2
1 Z
2
N (2 , )dx = N (1 , 2 )dx
2 +1
2
Continuing on:
Z
1h i
= 2 N (1 , 2 )dx
2 2 +1
2
Z
(x 1 )2
1
= e 2 2 dx
2 +1
2
2 2
x1
In order to express Pe as the form above, we perform a change of variables. Let z = , then
dz = 1 dx. Performing this change of variables, we transform our integral to:
Z z()
z2
1
= e 2 dz
z(
2 +1
2 ) 2
2 1
Evaluating our new bounds, we have z() = and z( 2 +
2
1
)= 2 = a. Our finished integral is
now in the Pe form we desired:
Z
z2
1
= e 2 dz
a 2
Z
z2
1
Pe = e 2 dz
a 2
26
8. Maximum Likelihood Estimation
Let X be a discrete random variable which takes values in {1, 2, 3} with probabilities P (X = 1) = p1 , P (X =
2) = p2 , and P (X = 3) = p3 , where p1 + p2 + p3 = 1. Show how to use the method of maximum likelihood
to estimate p1 , p2 , and p3 from n observations of X : x1 , . . . , xn . Express your answer in terms of the counts
n n n
1(xi = 1), k2 = 1(xi = 2), and k3 = 1(xi = 3),
X X X
k1 =
i=1 i=1 i=1
where (
1 if x = a
1(x = a) =
0 if x 6= a.
Solution: Heres a sketch of how I approach this problem using maximum likelihood estimation:
1. Derive the likelihood probability L(p1 , p2 , p3 ) of observing the n observations of X : x1 , . . . , xn .
2. A good estimate of p1 , p2 , p3 is to find p1 , p2 , p3 that maximizes L(p1 , p2 , p3 ). To find these p1 , p2 , p3 ,
we take the gradient of L w.r.t. p1 , p2 , p3 and solve for p L(p1 , p2 , p3 ) = 0.
3. Consider the Hessian of our objective function and show that it is negative semidefinite to prove that the
objective is concave and ultimately that the values of p1 , p2 , p3 found are indeed values that maximize
the likelihood.
First lets derive the likelihood probability of observing our events. For each event Xi , we know that
P (Xi = ) = p where {1, 2, 3}. Because weve observed k observations for each event and if we
define P (N = n) to be the probability of observing n instances of event , then we know P (N = k ) = pk .
We derive the probability of observing our n observations of X to be P (N1 = k1 , N2 = k2 , N3 = k3 ). These
events are independent so we have
Now to estimateQ p1 , p2 , p3 , we find p1 , p2 , p3 that maximizes L. Because our likelihood function is essen-
3
tially a product i=1 pki i , we should take the log of L so that taking its gradient is easier. Note that our
objective of maximizing L is preserved even if we take the log of it since both are still monotonically in-
creasing. This essentially becomes a problem of maximum lograithmic likelihood estimation which is an
equivalent problem.
p L(p1 , p2 , p3 ) = 0
3
Y
p ( pki i ) = 0
i=1
27
Here we take the logarithm of the likelihood to make the math easier:
3
Y
p (ln pki i ) = 0
i=1
3
X
p ( ki ln pi ) = 0
i=1
p (k1 ln p1 )
1 (k ln p2 )
p2 2 =0
p3 (k3 ln p3 )
(k1 /p1 )
(k2 /p2 ) = 0
(k3 /p3 )
Note that we also have that p1 + p2 + p3 = 1 and k1 + k2 + k3 = n. We can set up a system of equations
to solve for p1 , p2 , p3 . Use the fact that 0 = kp11 = kp22 = kp33 to get two equations. Then use p1 + p2 + p3 = 1
as a third equation to solve a system of our three unknowns p1 , p2 , p3 given three equations. Here are three
equations that I selected:
k1 k2
=
p1 p2
k2 k3
=
p2 p3
p1 + p2 + p3 = 1
Solving for this system, pi = k1 +kk2i +k3 = kni given i {1, 2, 3}. This is intuitive since the probability
of observing ki of an event i in n observations is intuitively kni . Thus by method of maximum likelihood
estimation:
k1
p1 =
k1 + k2 + k3
k2
p2 =
k1 + k2 + k3
k3
p3 =
k1 + k2 + k3
P3
Finally, we consider the Hessian matrix of our objective function f (p1 , p2 , p3 ) = i=1 ki ln pi . It wont be
hard to compute since our objective function isnt too complicated when we consider its second derivatives.
f
X3 2 p1 0 0
f
ki ln pi ) = 0 0
H( 2 p2
i=1 f
0 0 2 p3
k1
p21
0 0
k2
= 0 0
p22
k3
0 0 p 2
3
28
Given that p1 , p2 , p3 are probabilities, they must be between 0 and 1. Furthermore, k1 , k2 , k3 are all nonneg-
ative integers since they are just counts of the events. So for each nonzero element of our Hessian, k p2i
i
yields
a negative value R. If we negate the Hessian, we notice that it is positive semidefinite since its diagonal
values are positive (taken from question 2) this must mean that our Hessian is negative semidefinite thereby
showing that our objective is concave. Most importantly, this means that the values of p1 , p2 , p3 we found
are indeed the values that maximize the likelihood and serve as good estimates of the event probabilities.
29