I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

I.

Introduction to Convex
Optimization

Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Introduction to Optimization
In its most general form, an optimization program
minimize f0(x)
x
subject to x ∈ X

searches for the vector x ∈ RN that minimizes a given functional


f0 : RN → R over a set X ⊂ RN .

We will rely on X to be specified by a series of constraint functionals:


x∈X ⇔ fm(x) ≤ bm for m = 1, . . . , M.

Solving optimization problems is in general very difficult. In this


class, we will develop a framework for analyzing and solving convex
programs, which simply means each of the functionals above obeys
fm(θx + (1 − θ)y) ≤ θfm(x) + (1 − θ)fm(y),
for all m = 0, 1, . . . , M , 0 ≤ θ ≤ 1, and x, y ∈ RN . (Much more on
this later.)

What does convexity tell us? Two important things:


• Local minimizers are also global minimizers. So we can check
if a certain point is optimal by looking in a small neighborhood
and seeing if there is a direction to move that decreases f0.
• First-order necessary conditions for optimality turn out to be
sufficient. When the problem is unconstrained and smooth,
this means we can find an optimal point by finding x̂ such that
∇f0(x̂) = 0.

1
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
The upshot of these two things is that if the fm(x) and their deriva-
tives1 are reasonable to compute, then relatively simple algorithms
(e.g. gradient descent) are provably effective at performing the opti-
mization.

The great watershed in optimization is not between linearity and


non-linearity, but convexity and non-convexity.
— R. Tyrrell Rockafellar

The material in this course has three major components. The first
is the mathematical foundations of convex optimization. We will
see that talking about the solution to convex problems requires a
beautiful combination of algebraic and geometric ideas.
The second component is algorithms for solving convex programs.
We will talk about general purpose algorithms (and their associated
computational guarantees), but we will also look at algorithms that
are specialized to certain classes of problems, and even certain appli-
cations. Rather than focus on the “latest and greatest”, we will try
to understand the key ideas that are combined in different ways into
many solvers.
Finally, we will talk a lot about modeling. That is, how convex
optimization appears in signal processing, machine learning, statis-
tical inference, etc. We will give many examples of mapping word
problem into an optimization program. These examples will be in-
terleaved with the discussion of the first two components, and there
are several examples which we may return to several times.

1
And as we will see, all of what we do is very naturally extended to non-
smooth functions which do not have any derivatives.

2
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
You might have two questions at this point:
• Can all convex programs be solved efficiently?
Unfortunately, no. There are many examples of even seemingly
innocuous convex programs which are NP-hard. One way this
can happen is if the functionals themselves are hard to compute.
For example, suppose we were trying to find the matrix with
minimal (∞, 1) norm that obeyed some convex constraints:

f0(X) = kXk∞,1 = max kXvk1.


kvk∞ ≤1

This is a valid matrix norm, and we will see later that all
valid norms are convex. But it is known that computing f0
is NP-hard (see [Roh00]), as is approximating it to a fixed
accuracy. So optimizations involving this quantity are bound
to be difficult.

• Are there non-convex programs which can be solved efficiently?


Of course there are. Here is one for which you already know
the answer:

max xTAx subject to kxk2 = 1,


x

where A is an arbitrary symmetric matrix. This is the maxi-


mization of an indefinite quadratic form (not necessarily con-
vex or concave) over a nonconvex set. But we know that the
optimal value of this program is the largest eigenvalue, and
the optimizer is the corresponding eigenvector, and there are
well-known practical algorithms for computing these.
When there is a solutions to a nonconvex problem, it often times relies
on nice coincidences in the structure of the problem — perturbing
the problem just a little bit can disturb these coincidences. Consider

3
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
another nonconvex problem that we know how to solve:
N
X
minimize (Xi,j − Ai,j )2 subject to rank(X) ≤ R.
X
i,j=1

That is, we are looking for the best rank-R approximation (in the
least-squares) sense to the given n × n matrix A. The functional we
are optimizing above is convex, but the rank constraint definitely is
not. Nevertheless, we can compute the answer efficiently using the
SVD of A:
N
X
T
A = U ΣV = σnunv Tn , σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
n=1

The program above is solved simply by truncating this sum to its


first R terms:
R
X
X̂ = σnunv Tn .
n=1

But now suppose that instead of the matrix A, we are given a subset
of its entries indexed by I. We now want to find the matrix that is
most consistent over this subset while also having rank at most R:
X
minimize (Xi,j − Ai,j )2 subject to rank(X) ≤ R.
X
(i,j)∈I

Despite its similarity to the first problem above, this “matrix com-
pletion” problem is NP-hard.

Convex programs tend to be more robust to variations of this type.


Things like adding subspace constraints, restricting variables to be
positive, and considering functionals of linear transforms of x all
preserve the essential convex structure.

4
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
For the rest of this introduction, we will introduce a few of the very
well-known classes of convex program, and give an example of an
application for each.

Least squares
Given a M × N matrix A and a vector y ∈ RM , we solve the
unconstrained problem
2
minimize
N
ky − Axk 2.
x∈R

When A has full column rank (and so M ≥ N ), then there is a


unique closed-form solution:

x̂ = (ATA)−1ATy.

We can also write this in terms of the SVD of A = U ΣV T:

x̂ = V Σ−1U Ty.

The mapping from the data vector y to the solution x̂ is linear, and
the corresponding N × M matrix V Σ−1U T is called the pseudo-
inverse.

When A does not have full column rank, then the solution is non-
unique. An interesting case is when A is underdetermined (M < N )
with rank(A) = M (full row rank). Then there are many x such
that y = Ax and so ky − Axk22 = 0. Of these, we might choose
the one which has the smallest norm:

minimize
N
kxk2 subject to Ax = y.
x∈R

5
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
The solution is again given by the pseudo-inverse. We can still write
A = U ΣV T, where Σ is M × M , diagonal, and invertible, U is
M × M and V is N × M . Then x̂ = V Σ−1U T find the shortest
vector (in the Euclidean sense) that obeys the M specified linear
constraints.

Example: Regression
A fundamental problem in statistics is to estimate a function given
point samples (that are possibly heavily corrupted). We observe pairs
of points2 (xm, ym) for m = 1, . . . , M , and want to find a function
f (x) such that

f (xm) ≈ ym, m = 1, . . . , M.

Of course, the problem is not well-posed yet, since there are any
number of functions for which f (xm) = ym exactly. We regularize
the problem in two ways. The first is by specifying a class that f (·)
belongs to. One way of doing this is by building f up out of a linear
combination of basis functions φn(·):
N
X
f (x) = αnφn(x).
n=1

We now fit a function by solving for the expansion coefficients α.


There is a classical complexity versus robustness trade-off in choosing
the number of basis functions N .

2
We are just considering functions of a single variable here, but it is easy to
see how the basic setup extends to functionals of a vector.

6
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
The quality of a proposed fit is measured by a loss function — this
loss is typically (but not necessarily) specified pointwise at least of
the samples, and then averaged over all the sample points:
M
1 X
Loss(α; x, y) = `(α; xm, ym).
M m=1

One choice for `(·) is the squared-loss:


N
!2
X
`(α; xm, ym) = ym − αnφn(xm) ,
n=1

which is just the square between the difference of the observed value
ym and its prediction using the candidate α.

We can express everything more simply by putting it in matrix form.


We create the M × N matrix Φ:
· · · φN (x1)
 
φ1(x1) φ2(x1)
 φ1 (x2 ) φ2 (x2 ) · · · φN (x2) 
Φ=
 ... ... 

φ1(xM ) φ2(xM ) · · · φN (xM )

Φ maps a set of expansion coefficients α ∈ RN to a set of M predic-


tions for the vector of observations y ∈ RM . Finding the α that min-
imizes the squared-loss is now reduced to the standard least-squares
problem:
minimize
N
ky − Φαk22
α∈R

It is also possible to smooth the results and stay in the least-squares


framework. If Φ is ill-conditioned, then the least-squares solution

7
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
might do dramatic things to α to make it match y as closely as
possible. To discourage this, we can penalize kαk2:
2 2
minimize
N
ky − Φαk2 + τ kαk2,
α∈R

where τ > 0 is a parameter we can adjust. This can be converted


back to standard least-squares problem by concatenating (τ times)
the identity to the bottom of Φ and zeros to the bottom of y. At
any rate, the formula for the solution to this program is

x̂ = (ΦTΦ + τ I)−1ΦTy.

This is called ridge regression in the statistics community (and


Tikhonov regularization in the linear inverse problems community).

8
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Linear programming
A linear program (LP) minimizes a linear functional subject to
multiple linear constraints:

minimize cTx subject to aTmx ≤ bm, m = 1, . . . , M.


x

The general form above can include linear equality constraints aTi x =
bi by enforcing both aTi x ≤ bi and (−ai)Tx ≤ bi — in our study
later on, we will find it convenient to specifically distinguish between
these two types of constraints. We can also write the M constraints
compactly as Ax ≤ b, where A is the M × N matrix with the aTm
as rows.

Linear programs do not necessarily have to have a solution; it is


possible that there is no x such that Ax ≤ b, or that the program
is unbounded in that there exists a series x1, x2, . . . , all obeying
Axk ≤ b, with lim cTxk → −∞.

Unlike least squares, there is no formula for the solution of a linear


program. Fortunately, there exists very reliable and efficient software
for solving them. The first LP solver was developed in the late 1940s
(Dantzig’s “simplex algorithm”), and now LP solvers are considered
a mature technology. If the constraint matrix A is structured, then
linear programs with millions of variables can be solved to high ac-
curacy on a standard computer.

9
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Example: Chebyshev approximations
Consider the following tweak to the least-squares problem we looked
at previously. Suppose that we want to find the vector x so that Ax
does not vary too much in its maximum deviation:

minimize max |ym − aTmx| = minimize ky − Axk∞.


x∈RN m=1,...,M N x∈R

This is called the Chebyshev approximation problem.


We cannot solve this problem using the pseudo-inverse, but we can
solve it with linear programming. To do this, we introduce the aux-
iliary variable u ∈ R — it should be easy to see that the program
above is equivalent to
T
minimize u subject to y m − am x ≤ u
N
x∈R , u∈R

ym − aTmx ≥ −u
m = 1, . . . , M.

To put this in the standard linear programming form, take


       
x 0 −A −1 −y
z= , c0 = , A0 = , b0 = ,
u 1 A −1 y

and then solve


0T 0 0
minimize
N +1
c z subject to A z ≤ b .
z∈R

10
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Filter design

The standard “filter synthesis” problem is to find an finite-impulse


response (FIR) filter whose discrete-time Fourier transform (DTFT)
is as close to some target H∗(ω) as possible. When the deviation
from the optimal response is measured using a uniform error, this
is call “equiripple design”, since the error in the solution will tend
to have ripples a uniform distance away from the ideal. That is, we
would like to solve

minimize sup |H∗(ω) − H(ω)| , subject to H(ω) being FIR


H ω∈[−π,π]

If we restrict ourselves to the case where H∗(ω) has linear phase (so
the impulse response is symmetric around some time index)3 we can
recast this as a Chebyshev approximation problem.

A symmetric filter with 2K + 1 taps has a real DTFT that can be


written as a superposition of a DC term plus K cosines:
K
(
X h0, k = 0
hn = 0 |n| > K ⇒ H(ω) = h̃k cos(kω), h̃k =
k=0
2hk , 1 ≤ k ≤ K.

So we are trying to solve



K
X
minimize sup H ∗ (ω) − x k cos(kω) .


x∈RK+1 ω∈[−π,π]
k=0

3
The case with general phase can also be handled using convex optimization,
but it is not naturally stated as a linear program.

11
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
We will approximate the supremum on the inside by measuring it at
M equally spaced points ω1, . . . , ωM between −π and π. Then
K

X
minimize max H∗(ωm) − xk cos(kωm) = minimize ky − F xk∞,

x ωm x
k=0

where y ∈ RM and the M × (K + 1) matrix F are defined as

1 cos(ω1) cos(2ω1) · · · cos(Kω1)


   
H∗(ω1)
 H∗ (ω2 )  1 cos(ω2 ) cos(2ω2 ) · · · cos(Kω2 ) 
y=
 ... 
 F =
 ... ... 

H∗(ωM ), 1 cos(ωM ) cos(2ωM ) · · · cos(KωM )

It should be noted that since the ωm are equally spaced, the matrix
F (and its adjoint) can be applied efficiently using a fast discrete
cosine transform. We will see later how this has a direct impact
on the number of computations we need to solve the Chebyshev
approximation problem above.

12
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Quadratic programming
A quadratic program (QP) minimizes a quadratic functional sub-
ject to linear constraints:

minimize xTP x + q Tx, subject to Ax ≤ b.


x

When P is symmetric positive-definite, then the program is convex.


If P has even a single negative eigenvalue, then solving the program
above is NP-hard.

QPs are almost as ubiquitous as LPs; they have been used in finance
since the 1950s (see the example below), and are found all over op-
erations research, control systems, and machine learning. As with
LPs, there are reliable solvers; it might also be considered a mature
technology.

A quadratically constrained quadratic program (QCQP)


allows (convex) quadratic inequality constraints:

minimize xTP x + q Tx, subject to xTP mx + q Tmx ≤ bm,


x
m = 1, . . . , M.

This program is convex if all of the P m are symmetric positive defi-


nite; we are minimizing a convex quadratic functional over a region
defined by an intersection of ellipsoids.

13
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Example: Portfolio optimization
One of the classic examples in convex optimization is finding invest-
ment strategies that “optimally”4 balance the risk versus the return.
The following quadratic program formulation is due to Markowitz,
who formulated it in the 1950s, then won a Nobel Prize for it in 1990.

We want to spread our money over N different assets; the fraction of


our money we invest in asset n is denoted xn. We have the immediate
constraints that
N
X
xn = 1, and 0 ≤ xn ≤ 1, for n = 1, . . . , N.
n=1

The expected return on these investments, which are usually calcu-


lated using some kind of historical average, is µ1, . . . , µN . The µn
are specified as multipliers, so µn = 1.16 means that asset n has a
historical return of 16%. We specify some target expected return ρ,
which means
N
X
µnxn ≥ ρ.
n=1

We want to solve for the x that achieves this level of return while
minimizing our risk. Here, the definition of risk is simply the variance
of our return — if the assets have covariance matrix R, then the risk
of a given portfolio allocation x is
M X
X M
T
Risk(x) = x Rx = Rm,nxmxn.
m=1 n=1
4
I put “optimally” in quotes because, like everything in finance and the
world, this technique finds the optimal answer for a specified model. The
big question is then how good your model is ...

14
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Our optimization program is then5

minimize xTRx
x
subject to µTx ≥ ρ
1T x = 1
0 ≤ x ≤ 1.

This is an example of a quadratic program with linear constraints.


It is convex since the matrix R is covariance matrix, and so by
construction is it symmetric positive semi-definite (i.e. symmetric
with all its eigenvalues ≥ 0).

5
Throughout these notes, we will use 1 for a vector of all ones, and 0 for a
vector of all zeros.

15
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Our last example (for now) does not fit into any of the categories
mentioned above. It is, however, convex, and shows how viewing a
well-known problem through the lens of convex programming can al-
low us to systematically exploit a priori structural information about
our problem.

Example: Structured covariance estimation


Let’s recall the basics of estimating the covariance matrix for a Gaus-
sian random vector  
X1
 X2 
X=  ...  .

XN
We assume that E[X] = 0. We observe independent realizations
X 1, X 2, . . . , X K . How do we estimate the covariance matrix Σ =
E[XX T]?

If you have taken any class in statistics, you know that the standard
estimate is given by the sample covariance:
K
1 X
Σ̂ = X k X Tk .
K k=1
This makes intuitive sense, but let’s justify it a little more carefully
by showing it is the maximum likelihood estimate (MLE). Given the
{X k }, we want to find the matrix R that maximizes the likelihood
function
K
Y
maximize
N
(2π)−N/2(det R)−1/2 exp(−y Tk R−1y k /2).
R∈S++
k=1

16
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
N
The set S++ is the set of valid covariance matrices (i.e. the set of
positive definite matrices).
Since the log function is monotonic, we can equivalently maximize
the log-likelihood
K
−KN K −1 1X T −1
maximize log 2π + log det R − y R yk . (1)
N
R∈S++ 2 2 2 k=1 k

The first term does not depend on R, so we can ignore it. Since
the inverse of every valid covariance matrix is also a valid covariance
matrix, we can optimize over S = R−1. This makes the optimization
program
K
K 1X T
maximize log det S − y Sy k .
N
S∈S++ 2 2 k=1 k

Using the (easily checked) fact that y TSy = trace(Syy T), we can
write the above as

maximize
N
log det S − trace(S Σ̂),
S∈S++

where Σ̂ is the sample covariance matrix (and we have also dropped


the constant K/2 factors).

The functional log det S is concave in its matrix argument S (check


this), and with Σ̂ fixed, trace(S Σ̂) is linear in S (both convex and
concave). Thus maximizing this functional is the same as minimizing
the convex functional − log det S + trace(S Σ̂). Moreover, the set
N
S++ is convex, and so it is fair to call the optimization program
above a convex program.

17
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Since the functional we are maximizing is smooth, we know that we
N
have a solution Ŝ if Ŝ is feasible (in S++ ) and the gradient is equal
to zero. It is easy to see that

∇ trace(S Σ̂) = Σ̂,


and it can be shown (do this at home) that
∇ log det S = S −1.
Thus the optimal solution obeys
−1 −1
Ŝ − Σ̂ = 0 ⇒ Ŝ = Σ̂ .
This means that the optimal solution to the original program (1) is

R̂ = Σ̂.
So the sample covariance really is the the maximum likelihood esti-
mate.

You don’t need to take a course in convex optimization to tell you


that forming the sample covariance matrix is a good idea. But this
formulation comes in handy if we want to specify additional con-
straints on the (inverse) convariance. For example, we might know
that the variance of all of the variables is less than some positive num-
−1
ber νmax. This means that Rn,n = Sn,n ≤ νmax for all n = 1, . . . , N .
It turns out (and you should again think about this at home) that
N
functionals which take a matrix in S++ and return an entry along
the diagonal,
−1
fn(S) = Sn,n ,
are convex in S. So these variance constraints can be incorporated
while still keeping the program tractable.

18
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Another example is if we know different pairs of variables (i, j) ∈ I
are conditionally independent of one another given the other vari-
ables. Two entries of a Gaussian random vector are conditionally
independent if the corresponding entry in the inverse covariance ma-
trix is zero; so this information can be captured by introducing the
constraints
Si,j = 0, for all (i, j) ∈ I.
Constraints like these come in useful when we are trying to estimate
the structure of Gaussian graphical models. (We will say more about
this as we revisit this example at different points in the course.)

References
[Roh00] J. Rohn. Computing the norm kAk∞,1 is NP-Hard. Linear
and Multilinear Algebra, 47:195–204, 2000.

19
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy