0% found this document useful (0 votes)
3 views

exam2018

Uploaded by

OJUGBA OLUCHUKWU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

exam2018

Uploaded by

OJUGBA OLUCHUKWU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

y +0/1/60+ y

Exam Optimization for Machine Learning – CS-439


Prof. Martin Jaggi

Fr. 6. July 2018 - 16h15 to 19h15, in CE1515

ID
STUDENT NAME
T
AF
DR

SCIPER : SCIPER Signature :

Wait for the start of the exam before turning to the next page. This document is
printed double sided, 18 pages.

• This is a closed book exam. No electronic devices of any kind.

• Place on your desk: your student ID, writing utensils, one double-sided A4 page cheat sheet
(handwritten or 11pt min font size) if you have one; place all other personal items below your
desk or on the side.

• You each have a different exam.

• For technical reasons, do use black or blue pens for the MCQ part, no pencils! Use
white corrector if necessary.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/2/59+ y

First part, multiple choice


There is exactly one correct answer per question.

Newton-Raphson method
An easy method for computing the square root of a real number y > 0 by hand is as follows:

(i) find x0 , such that x20 ≈ y (e.g. y = 17, x0 = 4).

(ii) Calculate the difference d = y − x20 (e.g. d = 17 − 42 = 1).


d 1
(iii) Output x1 = x0 + 2x0 (e.g. x1 = 4 + 8 = 4.125).

(iv) repeat (ii)–(iii) for higher accuracy.

This is an instance of the Newton-Raphson method, which defines the sequence {xt }t≥0 of real numbers
by the following equation:

f (xt )
xt+1 := xt − . (1)
f 0 (xt )

Question 1

z − 17
T
What is the function f (z) of which we aim to find a zero in the example above?
AF
z2
z 2 − 17

z

Question 2 Now, suppose we are not happy with the solution x1 = 4.125, because x21 = 17.015625
DR

is not accurate enough. What is the next iterate x2 in the sequence (for y = 17, x0 = 4 as above)?
Use the following values: 17.015625
4.125 = 4.125, 0.015625
4.125 ≈ 0.0038.

4.1288
4.1269
4.1212
4.1231

Question 3 How many iterations do you (roughly) have to perform to compute the correct 16
significant digits in the above example (y = 17, x0 = 4).

1016

16
10 = 104
16
16
2 =8

16 = 4

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/3/58+ y

Question 4 The Newton-Raphson method to find zeros of f can be interpreted as a second-order


optimization method. Of course, one could also use the gradient method instead. How would the
iterates of this scheme look like? (For carefully chosen stepsize γ).

xt+1 := xt − γf (xt )
xt+1 := xt − γf 0 (xt )
xt+1 := xt − γxt
xt+1 := xt − γ(f 0 (xt ))−1 f (xt )
xt+1 := xt − γ(f 00 (xt ))−1 f 0 (xt )

Newton’s second-order optimization method


Question 5 As studied in the class, the update step for Newton’s optimization method for an
objective function g : Rn → R is given by

xt+1 := xt − ∇2 g(xt )−1 ∇g(xt )

For n = 1, how does this optimization method relate to the Newton-Raphson method from Equa-

T
tion (1) from the previous section?

f = g0
f 00 = g
AF
f0 = g
f = g 00

Question 6 Given a quadratic function g : Rn → R of the form g(x) = − 12 x> Ax + b> x + c where
A ∈ Rn×n is a symmetric matrix. What are necessary and sufficient conditions for g to be convex?
DR

−A positive semidefinite, and b is non-negative


The Hessian of g is negative definite for all x, and b is non-negative
A positive semidefinite
The Hessian of g is negative definite for all x
The Hessian of g is positive definite for all x
A positive semidefinite, and b is non-negative
−A positive semidefinite
The Hessian of g is positive definite for all x, and b is non-negative

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/4/57+ y

Coordinate Descent
Question 7 Consider the least squares objective function
1 2
f (x) := kAx − bk , (2)
2
for A a m × n matrix, A = [a1 , . . . an ] with columns ai .
What is the gradient ∇f (x)?

A
A> Ax
A> (Ax − b)
A(A> x − b)

Question 8 We are now interested in the complexity of computing the gradient of f as in Equa-
tion (2). Each addition or multiplication of two real numbers counts as one operation. How expensive
is it to compute the full gradient, given x:
(Note here Θ(k) refers to a function growing at least and at most as fast as k in the variables of concern)

Θ(n + m)
Θ(n2 m2 )
Θ(n)

T
AF
Θ(mn)
Θ(m2 n)
Θ(mn2 )
Θ(m)
DR

Question 9 How expensive is it to compute just a single coordinate of the gradient of f as in


Equation (2), given x:
(Note here Θ(k) refers to a function growing at least and at most as fast as k in the variables of concern)

Θ(n)
Θ(m2 n)
Θ(m)
Θ(n2 m2 )
Θ(n + m)
Θ(mn)
Θ(mn2 )

Question 10 The complexity of Coordinate Descent depends on the coordinate-wise smoothness


constants Li . What is Li for f as in Equation (2)?

Li = λmax (A> A)
2
Li = kai k
Li = λmax (A> A)/n
L i = A> A
Li = kai k

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/5/56+ y

Frank-Wolfe
Consider the linear minimization oracle (LMO) for matrix completion, that is for
X
minn×m (Zij − Yij )2
Y ∈X⊆R
(i,j)∈Ω

when Ω ⊆ [n] × [m] is the set of observed entries from a given matrix Z. Our optimization domain X
is the unit ball of the trace norm (or nuclear norm), which is known to be the convex hull of the rank-1
matrices
u∈Rn , kuk =1
n o
X := conv(A) with A := uv> v∈Rm , kvk2 =1 .
2

Question 11 Consider the LMO for this set X for a gradient at iterate Y ∈ Rn×m (derive it if
necessary). Compare the computational operation (or cost) needed to compute the LMO, as opposed
to computing the projection onto X?
Hint: Assume that the Singular Value Decomposition of a n × m matrix takes time Θ(n2 m), and
computing the top singular vector takes time Θ(nm).

LMO and projection both take Θ(n2 m)


LMO takes Θ(nm), and projection takes Θ(n2 m)
LMO takes Θ(n2 m), and projection takes Θ(nm)

T
AF
Smoothness and Strong Convexity
Consider an iterative optimization procedure.

Question 12 Which one of the following three inequalities is valid for a smooth convex function f :
2
f (xt+1 ) − f (xt ) ≤ ∇f (xt )> (xt+1 − xt ) + L
2 kxt+1 − xt k
DR

2
f (xt+1 ) − f (xt ) ≤ ∇f (xt )> (xt+1 − xt ) − L
2 kxt+1 − xt k
2
f (xt+1 ) − f (xt ) ≤ ∇f (xt )> (xt − xt+1 ) + L
2 kxt+1 − xt k

Question 13 Which one of the following three inequalities is valid for a strongly convex function f :
µ 2
f (xt ) − f (x? ) ≥ ∇f (xt )> (xt − x? ) + 2 kxt − x? k
µ 2
f (xt ) − f (x? ) ≤ ∇f (xt )> (xt − x? ) − 2 kxt − x? k
µ 2
f (xt ) − f (x? ) ≤ ∇f (xt )> (xt − x? ) + 2 kxt − x? k

Random search

Question 14 Consider derivative-free random search, with line-search, as discussed in the lecture.

For strongly convex functions, random search converges as O(L log(1/ε))


For convex functions, random search converges as O(dL/ε)
For convex functions, random search converges as O(dL log(1/ε))

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/6/55+ y

Empirical comparison of different methods


Donald Duck’s three nephews Huey, Dewey, and Louie have enrolled in CS 439. For their course
project, they analyzed three different algorithms, namely Gradient Descent, Accelerated Gradient
Method and Newton’s second-order optimization method on a strongly convex optimization problem
and plotted the performance of the algorithms on a graph. However, as it turns out they forgot to
put a legend in their graph and due to some bug in their code, they plotted a line which corresponds
to none of the algorithms. Can you help them in labelling their unlabelled graph?

T
AF
Figure 1: Performance of different optimization algorithms.

Question 15 Which optimization method corresponds to the error-curve for Algorithm 1?


DR

None
Gradient Descent (with correct stepsize)
Accelerated Gradient Method (with correct parameters)
Newton’s optimization method

Question 16 Which optimization method corresponds to the error-curve for Algorithm 2?

None
Newton’s optimization method
Accelerated Gradient Method (with correct parameters)
Gradient Descent (with correct stepsize)

Question 17 Which optimization method corresponds to the error-curve for Algorithm 3?

Accelerated Gradient Method (with correct parameters)


Newton’s optimization method
None
Gradient Descent (with correct stepsize)

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/7/54+ y

Question 18 Which optimization method corresponds to the error-curve for Algorithm 4?

Newton’s optimization method


None
Accelerated Gradient Method (with correct parameters)
Gradient Descent (with correct stepsize)

T
AF
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/8/53+ y

Second part, true/false questions


Question 19 (Convexity) The epigraph of a function f : Rd → R is defined as

epi(f ) := {(x, α) ∈ Rd+1 | x ∈ dom(f ), α ≤ f (x)},

TRUE FALSE

Question 20 (Convexity) The triangle inequality and homogeneity of a norm together imply that
any norm is convex.

TRUE FALSE

Question 21 (Convex Sets) We consider C1 and C2 two convex sets in Rd .


We define C1 + C2 := {x1 + x2 , x1 ∈ C1 , x2 ∈ C2 }.
Is C1 + C2 a convex set?

TRUE FALSE

Question 22
T2
(Differentiability) The function max(0, x) is differentiable over R.
AF
TRUE FALSE

Question 23 (Coordinate Descent) Consider Coordinate Descent on a strongly convex and smooth
objective function. depending on the coordinate-wise smoothness constants Li (i.e. different Lipschitz
DR

constants for each gradient coordinate).


If we sample coordinate i with probability proportional to Li , and use stepsize Li , convergence is
typically faster than uniform CD

TRUE FALSE

Question 24 (Coordinate Descent) In the same setting, if we sample coordinate i uniformly, and
use stepsize 1/Li , convergence is typically faster than CD with fixed stepsize

TRUE FALSE

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/9/52+ y

Third part, open questions


Answer in the space provided! Your answer must be justified with all steps. Do not cross any
checkboxes, they are reserved for correction.

Intersection of Convex Sets


We are given n convex sets C0 = {C1 , . . . , Cn } where each set Ci ⊆ Rd . We want to design an
algorithm which can check if the intersection of all of these sets is null i.e. we want to check if
\
Ci = ∅ .
Ci ∈C0

However we do not care about solving this problem exactly, but have a small leeway of magnitude
ε ≥ 0. To make this more mathematically precise, let us define some notation.
The distance between a set C ⊆ Rd and any point y ∈ Rd is defined as
def
d(C, y) = min kw − yk2 .
w∈C

We only want to distinguish between the following two cases for any ε > 0:
Tn
(N) The intersection of the sets is non-empty, i.e. i=1 Ci 6= ∅.

(E) For any point x ∈ Rd , maxi∈{1,...,n} d(Ci , x) ≥ ε.


T
AF
We want to solve this problem using calls to an oracle which can compute the projection onto Ci ∈ C.
Let us define the projection oracle Pi (x) for any i ∈ {1, . . . , n} and x ∈ Rd as

Pi (x) := argmin ky − xk2 .


y∈Ci
DR

We want to make as few calls to the projection oracle as possible. Our strategy will be to i) define
a loss function and ii) run gradient descent. Then using our knowledge of convergence of gradient
descent, we can argue about the number of oracle calls required.

First Approach.
Inspired by the condition in case (E), let us define the following loss function:

g(x) := max d(Ci , x) .


i∈{1,...,n}

Question 24: 2 points. Is the function g(x) convex? Is it Lipschitz?


Hint: maximum of convex functions is also convex.

0 1 2

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/10/51+ y

Question 25: 5 points. What is the sub-gradient of g? How many calls to the gradient oracle are
needed to compute ∂g(x) and g(x)?
Hint: Show that for two convex functions g1 (x) and g2 (x), ∂gi (x) is a subgradient in the set
∂ max(g1 (x), g2 (x)) where gi (x) := max(g1 (x), g2 (x)).

0 1 2 3 4 5

T
AF
Question 26: 5 points. Assume you are given a starting point x0 and a constant R such that
kx0 − x? k2 ≤ R. Give the update step of gradient descent with an appropriate step-size. Show
DR

using the convergence of gradient descent we proved in class that for any optimum x? of g,
R
min g(xt ) − g(x? ) ≤ √ .
t∈{0,...,T } T

0 1 2 3 4 5

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/11/50+ y

Question 27: 4 points. Using the result from the previous question, show that O(n/ε2 ) calls to the
projection oracle is sufficient to distinguish between case (N) and case (E) for our problem.

0 1 2 3 4

T
AF
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/12/49+ y

Question 28: 6 points.

0 1 2 3 4 5 6

The convergence of the Frank-Wolfe algorithm was analyzed in class for only smooth functions. In
this question we will examine if smoothness is necessary. Consider the following non-smooth function
f : R2 → R:
f (w, v) := max {w, v} ,

restricted to a ball of radius 2 around the origin. We are then interested in finding

(w? , v ? ) := argmin (max {w, v}) .


w2 +v 2 ≤2

Suppose we start at the origin (0, 0) and run the Frank-Wolfe algorithm (with any step size rule).
Since the function is not smooth, we will call the LMO oracle using an arbitrary subgradient instead
of the gradient. Does this algorithm converge to the optimum?
Hint: First show that the iterates of Frank-Wolfe always lie in the convex hull of the starting point
and the solutions of the LMO oracle.

T
AF
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/13/48+ y

Newton’s second-order optimization method


As studied in the class, the update step for Newton’s optimization method for an objective function
g : Rn → R is given by
xt+1 := xt − ∇2 g(xt )−1 ∇g(xt )

Question 29: 2 points. What happens when Newton’s optimization method is run on a convex
quadratic function? Explain.

0 1 2

T
AF
Question 30: 2 points. Affine Invariance of the Newton’s method
DR

Consider h(x) := g(M x) where M ∈ Rn×n is invertible where g is some convex function.
Show that the Newton steps for h and g are also related by the same linear transformation,
i.e., ∆xt = M ∆yt where ∆xt and ∆yt are the Newton steps at the tth iteration for h and g
respectively. We assume x0 = M y0 are the starting iterates for h and g respectively.

0 1 2

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/14/47+ y

Coordinate Descent
Question 31: 2 points. Given a matrix A, we define λmin (A> A) and λmax (A> A) to be the smallest
and largest eigenvalues of A> A.
Show that for any x, y ∈ Rn ,

λmin (A> A)kx − yk2 ≤ kA(x − y)k2 ≤ λmax (A> A)kx − yk2 .

0 1 2

T
AF
Question 32: 3 points. Show that for any x, y ∈ Rn , for any b ∈ Rn :
DR

> λmax (A> A)


kAx − bk2 ≤ kAy − bk2 + A> (Ay − b) (x − y) + kx − yk2 .

2

What does that imply for f (x) := kAx − bk2 ?

0 1 2 3

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/15/46+ y

Question 33: 2 points. For f (x) := kAx − bk2 , we now perform one step of coordinate descent. I.e.
for a given point xt ∈ Rn we do a step of the form

xt+1 := xt − γt (∇f (xt ))i · ei

where ei ∈ Rn denotes a standard unit vector. For i fixed, compute the best γt .

0 1 2

T
AF
Smooth strongly convex SGD
Pn
We consider a function f (x) := n1 i=1 fi (x) on Rd , and we assume that the functions fi are convex,
differentiable.
We furthermore assume that f is L-smooth, that is that ∇f is L-Lipschitz.
We consider SGD defined as the following algorithm: Let x0 ∈ Rd , and for any t ≥ 1, for a sequence
DR

of step sizes γt , define


xt+1 := xt − γt gt .

We first consider gt := ∇fit (xt ), with it uniformly and independently sampled from {1, . . . , n}.

Question 34: 2 points. Show that gt is an unbiased estimator of the gradient ∇f (xt ).

0 1 2

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/16/45+ y

Question 35: 6 points. Combining the two valid equations of smoothness and strong convexity (as
also stated in Questions 12 and 13), prove in detailed steps that, if γt ≤ L1 , SGD in this setting
converges as
h i h i
h i (1 − γt µ) E kxt − x? k2 − E kxt+1 − x? k2
2
E [f (xt+1 ) − f (x? )] ≤ γt E kgt − ∇f (xt )k + .
2γt
(3)

For comparison, recall the following result from Lecture 6 (slide 6):
h i h i
2 2
γ B 2 (1 − γt µ) E kxt − x? k − E kxt+1 − x? k
t
E [f (xt+1 ) − f (x? )] ≤ + . (4)
2 2γt
2
under the bounded gradient assumption E[kgt k ] ≤ B 2 .
How do the two results compare?

0 1 2 3 4 5 6

T
AF
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/17/44+ y

Question 36: 4 points. Recall the possible choices of learning rate (γt ) in the situation of the previous
question. What is the resulting rate of convergence? Which estimator do we eventually consider?
Comment on the assumption γt ≤ L1 . Is it a restriction? Which choice of step size could be
used,
a) for getting O(log(t)/t) convergence, and
b) for getting O(1/t)?

0 1 2 3 4

T
AF
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +0/18/43+ y

T
AF
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy