exam2018
exam2018
ID
STUDENT NAME
T
AF
DR
Wait for the start of the exam before turning to the next page. This document is
printed double sided, 18 pages.
• Place on your desk: your student ID, writing utensils, one double-sided A4 page cheat sheet
(handwritten or 11pt min font size) if you have one; place all other personal items below your
desk or on the side.
• For technical reasons, do use black or blue pens for the MCQ part, no pencils! Use
white corrector if necessary.
Newton-Raphson method
An easy method for computing the square root of a real number y > 0 by hand is as follows:
This is an instance of the Newton-Raphson method, which defines the sequence {xt }t≥0 of real numbers
by the following equation:
f (xt )
xt+1 := xt − . (1)
f 0 (xt )
Question 1
√
z − 17
T
What is the function f (z) of which we aim to find a zero in the example above?
AF
z2
z 2 − 17
√
z
Question 2 Now, suppose we are not happy with the solution x1 = 4.125, because x21 = 17.015625
DR
is not accurate enough. What is the next iterate x2 in the sequence (for y = 17, x0 = 4 as above)?
Use the following values: 17.015625
4.125 = 4.125, 0.015625
4.125 ≈ 0.0038.
4.1288
4.1269
4.1212
4.1231
Question 3 How many iterations do you (roughly) have to perform to compute the correct 16
significant digits in the above example (y = 17, x0 = 4).
1016
√
16
10 = 104
16
16
2 =8
√
16 = 4
xt+1 := xt − γf (xt )
xt+1 := xt − γf 0 (xt )
xt+1 := xt − γxt
xt+1 := xt − γ(f 0 (xt ))−1 f (xt )
xt+1 := xt − γ(f 00 (xt ))−1 f 0 (xt )
For n = 1, how does this optimization method relate to the Newton-Raphson method from Equa-
T
tion (1) from the previous section?
f = g0
f 00 = g
AF
f0 = g
f = g 00
Question 6 Given a quadratic function g : Rn → R of the form g(x) = − 12 x> Ax + b> x + c where
A ∈ Rn×n is a symmetric matrix. What are necessary and sufficient conditions for g to be convex?
DR
Coordinate Descent
Question 7 Consider the least squares objective function
1 2
f (x) := kAx − bk , (2)
2
for A a m × n matrix, A = [a1 , . . . an ] with columns ai .
What is the gradient ∇f (x)?
A
A> Ax
A> (Ax − b)
A(A> x − b)
Question 8 We are now interested in the complexity of computing the gradient of f as in Equa-
tion (2). Each addition or multiplication of two real numbers counts as one operation. How expensive
is it to compute the full gradient, given x:
(Note here Θ(k) refers to a function growing at least and at most as fast as k in the variables of concern)
Θ(n + m)
Θ(n2 m2 )
Θ(n)
T
AF
Θ(mn)
Θ(m2 n)
Θ(mn2 )
Θ(m)
DR
Θ(n)
Θ(m2 n)
Θ(m)
Θ(n2 m2 )
Θ(n + m)
Θ(mn)
Θ(mn2 )
Li = λmax (A> A)
2
Li = kai k
Li = λmax (A> A)/n
L i = A> A
Li = kai k
Frank-Wolfe
Consider the linear minimization oracle (LMO) for matrix completion, that is for
X
minn×m (Zij − Yij )2
Y ∈X⊆R
(i,j)∈Ω
when Ω ⊆ [n] × [m] is the set of observed entries from a given matrix Z. Our optimization domain X
is the unit ball of the trace norm (or nuclear norm), which is known to be the convex hull of the rank-1
matrices
u∈Rn , kuk =1
n o
X := conv(A) with A := uv> v∈Rm , kvk2 =1 .
2
Question 11 Consider the LMO for this set X for a gradient at iterate Y ∈ Rn×m (derive it if
necessary). Compare the computational operation (or cost) needed to compute the LMO, as opposed
to computing the projection onto X?
Hint: Assume that the Singular Value Decomposition of a n × m matrix takes time Θ(n2 m), and
computing the top singular vector takes time Θ(nm).
T
AF
Smoothness and Strong Convexity
Consider an iterative optimization procedure.
Question 12 Which one of the following three inequalities is valid for a smooth convex function f :
2
f (xt+1 ) − f (xt ) ≤ ∇f (xt )> (xt+1 − xt ) + L
2 kxt+1 − xt k
DR
2
f (xt+1 ) − f (xt ) ≤ ∇f (xt )> (xt+1 − xt ) − L
2 kxt+1 − xt k
2
f (xt+1 ) − f (xt ) ≤ ∇f (xt )> (xt − xt+1 ) + L
2 kxt+1 − xt k
Question 13 Which one of the following three inequalities is valid for a strongly convex function f :
µ 2
f (xt ) − f (x? ) ≥ ∇f (xt )> (xt − x? ) + 2 kxt − x? k
µ 2
f (xt ) − f (x? ) ≤ ∇f (xt )> (xt − x? ) − 2 kxt − x? k
µ 2
f (xt ) − f (x? ) ≤ ∇f (xt )> (xt − x? ) + 2 kxt − x? k
Random search
Question 14 Consider derivative-free random search, with line-search, as discussed in the lecture.
T
AF
Figure 1: Performance of different optimization algorithms.
None
Gradient Descent (with correct stepsize)
Accelerated Gradient Method (with correct parameters)
Newton’s optimization method
None
Newton’s optimization method
Accelerated Gradient Method (with correct parameters)
Gradient Descent (with correct stepsize)
T
AF
DR
TRUE FALSE
Question 20 (Convexity) The triangle inequality and homogeneity of a norm together imply that
any norm is convex.
TRUE FALSE
TRUE FALSE
Question 22
T2
(Differentiability) The function max(0, x) is differentiable over R.
AF
TRUE FALSE
Question 23 (Coordinate Descent) Consider Coordinate Descent on a strongly convex and smooth
objective function. depending on the coordinate-wise smoothness constants Li (i.e. different Lipschitz
DR
TRUE FALSE
Question 24 (Coordinate Descent) In the same setting, if we sample coordinate i uniformly, and
use stepsize 1/Li , convergence is typically faster than CD with fixed stepsize
TRUE FALSE
However we do not care about solving this problem exactly, but have a small leeway of magnitude
ε ≥ 0. To make this more mathematically precise, let us define some notation.
The distance between a set C ⊆ Rd and any point y ∈ Rd is defined as
def
d(C, y) = min kw − yk2 .
w∈C
We only want to distinguish between the following two cases for any ε > 0:
Tn
(N) The intersection of the sets is non-empty, i.e. i=1 Ci 6= ∅.
We want to make as few calls to the projection oracle as possible. Our strategy will be to i) define
a loss function and ii) run gradient descent. Then using our knowledge of convergence of gradient
descent, we can argue about the number of oracle calls required.
First Approach.
Inspired by the condition in case (E), let us define the following loss function:
0 1 2
Question 25: 5 points. What is the sub-gradient of g? How many calls to the gradient oracle are
needed to compute ∂g(x) and g(x)?
Hint: Show that for two convex functions g1 (x) and g2 (x), ∂gi (x) is a subgradient in the set
∂ max(g1 (x), g2 (x)) where gi (x) := max(g1 (x), g2 (x)).
0 1 2 3 4 5
T
AF
Question 26: 5 points. Assume you are given a starting point x0 and a constant R such that
kx0 − x? k2 ≤ R. Give the update step of gradient descent with an appropriate step-size. Show
DR
using the convergence of gradient descent we proved in class that for any optimum x? of g,
R
min g(xt ) − g(x? ) ≤ √ .
t∈{0,...,T } T
0 1 2 3 4 5
Question 27: 4 points. Using the result from the previous question, show that O(n/ε2 ) calls to the
projection oracle is sufficient to distinguish between case (N) and case (E) for our problem.
0 1 2 3 4
T
AF
DR
0 1 2 3 4 5 6
The convergence of the Frank-Wolfe algorithm was analyzed in class for only smooth functions. In
this question we will examine if smoothness is necessary. Consider the following non-smooth function
f : R2 → R:
f (w, v) := max {w, v} ,
restricted to a ball of radius 2 around the origin. We are then interested in finding
Suppose we start at the origin (0, 0) and run the Frank-Wolfe algorithm (with any step size rule).
Since the function is not smooth, we will call the LMO oracle using an arbitrary subgradient instead
of the gradient. Does this algorithm converge to the optimum?
Hint: First show that the iterates of Frank-Wolfe always lie in the convex hull of the starting point
and the solutions of the LMO oracle.
T
AF
DR
Question 29: 2 points. What happens when Newton’s optimization method is run on a convex
quadratic function? Explain.
0 1 2
T
AF
Question 30: 2 points. Affine Invariance of the Newton’s method
DR
Consider h(x) := g(M x) where M ∈ Rn×n is invertible where g is some convex function.
Show that the Newton steps for h and g are also related by the same linear transformation,
i.e., ∆xt = M ∆yt where ∆xt and ∆yt are the Newton steps at the tth iteration for h and g
respectively. We assume x0 = M y0 are the starting iterates for h and g respectively.
0 1 2
Coordinate Descent
Question 31: 2 points. Given a matrix A, we define λmin (A> A) and λmax (A> A) to be the smallest
and largest eigenvalues of A> A.
Show that for any x, y ∈ Rn ,
λmin (A> A)kx − yk2 ≤ kA(x − y)k2 ≤ λmax (A> A)kx − yk2 .
0 1 2
T
AF
Question 32: 3 points. Show that for any x, y ∈ Rn , for any b ∈ Rn :
DR
0 1 2 3
Question 33: 2 points. For f (x) := kAx − bk2 , we now perform one step of coordinate descent. I.e.
for a given point xt ∈ Rn we do a step of the form
where ei ∈ Rn denotes a standard unit vector. For i fixed, compute the best γt .
0 1 2
T
AF
Smooth strongly convex SGD
Pn
We consider a function f (x) := n1 i=1 fi (x) on Rd , and we assume that the functions fi are convex,
differentiable.
We furthermore assume that f is L-smooth, that is that ∇f is L-Lipschitz.
We consider SGD defined as the following algorithm: Let x0 ∈ Rd , and for any t ≥ 1, for a sequence
DR
We first consider gt := ∇fit (xt ), with it uniformly and independently sampled from {1, . . . , n}.
Question 34: 2 points. Show that gt is an unbiased estimator of the gradient ∇f (xt ).
0 1 2
Question 35: 6 points. Combining the two valid equations of smoothness and strong convexity (as
also stated in Questions 12 and 13), prove in detailed steps that, if γt ≤ L1 , SGD in this setting
converges as
h i h i
h i (1 − γt µ) E kxt − x? k2 − E kxt+1 − x? k2
2
E [f (xt+1 ) − f (x? )] ≤ γt E kgt − ∇f (xt )k + .
2γt
(3)
For comparison, recall the following result from Lecture 6 (slide 6):
h i h i
2 2
γ B 2 (1 − γt µ) E kxt − x? k − E kxt+1 − x? k
t
E [f (xt+1 ) − f (x? )] ≤ + . (4)
2 2γt
2
under the bounded gradient assumption E[kgt k ] ≤ B 2 .
How do the two results compare?
0 1 2 3 4 5 6
T
AF
DR
Question 36: 4 points. Recall the possible choices of learning rate (γt ) in the situation of the previous
question. What is the resulting rate of convergence? Which estimator do we eventually consider?
Comment on the assumption γt ≤ L1 . Is it a restriction? Which choice of step size could be
used,
a) for getting O(log(t)/t) convergence, and
b) for getting O(1/t)?
0 1 2 3 4
T
AF
DR
T
AF
DR