I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
Introduction to Convex
Optimization
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Introduction to Optimization
In its most general form, an optimization program
minimize f0(x)
x
subject to x ∈ X
1
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
The upshot of these two things is that if the fm(x) and their deriva-
tives1 are reasonable to compute, then relatively simple algorithms
(e.g. gradient descent) are provably effective at performing the opti-
mization.
The material in this course has three major components. The first
is the mathematical foundations of convex optimization. We will
see that talking about the solution to convex problems requires a
beautiful combination of algebraic and geometric ideas.
The second component is algorithms for solving convex programs.
We will talk about general purpose algorithms (and their associated
computational guarantees), but we will also look at algorithms that
are specialized to certain classes of problems, and even certain appli-
cations. Rather than focus on the “latest and greatest”, we will try
to understand the key ideas that are combined in different ways into
many solvers.
Finally, we will talk a lot about modeling. That is, how convex
optimization appears in signal processing, machine learning, statis-
tical inference, etc. We will give many examples of mapping word
problem into an optimization program. These examples will be in-
terleaved with the discussion of the first two components, and there
are several examples which we may return to several times.
1
And as we will see, all of what we do is very naturally extended to non-
smooth functions which do not have any derivatives.
2
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
You might have two questions at this point:
• Can all convex programs be solved efficiently?
Unfortunately, no. There are many examples of even seemingly
innocuous convex programs which are NP-hard. One way this
can happen is if the functionals themselves are hard to compute.
For example, suppose we were trying to find the matrix with
minimal (∞, 1) norm that obeyed some convex constraints:
This is a valid matrix norm, and we will see later that all
valid norms are convex. But it is known that computing f0
is NP-hard (see [Roh00]), as is approximating it to a fixed
accuracy. So optimizations involving this quantity are bound
to be difficult.
3
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
another nonconvex problem that we know how to solve:
N
X
minimize (Xi,j − Ai,j )2 subject to rank(X) ≤ R.
X
i,j=1
That is, we are looking for the best rank-R approximation (in the
least-squares) sense to the given n × n matrix A. The functional we
are optimizing above is convex, but the rank constraint definitely is
not. Nevertheless, we can compute the answer efficiently using the
SVD of A:
N
X
T
A = U ΣV = σnunv Tn , σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
n=1
But now suppose that instead of the matrix A, we are given a subset
of its entries indexed by I. We now want to find the matrix that is
most consistent over this subset while also having rank at most R:
X
minimize (Xi,j − Ai,j )2 subject to rank(X) ≤ R.
X
(i,j)∈I
Despite its similarity to the first problem above, this “matrix com-
pletion” problem is NP-hard.
4
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
For the rest of this introduction, we will introduce a few of the very
well-known classes of convex program, and give an example of an
application for each.
Least squares
Given a M × N matrix A and a vector y ∈ RM , we solve the
unconstrained problem
2
minimize
N
ky − Axk 2.
x∈R
x̂ = (ATA)−1ATy.
x̂ = V Σ−1U Ty.
The mapping from the data vector y to the solution x̂ is linear, and
the corresponding N × M matrix V Σ−1U T is called the pseudo-
inverse.
When A does not have full column rank, then the solution is non-
unique. An interesting case is when A is underdetermined (M < N )
with rank(A) = M (full row rank). Then there are many x such
that y = Ax and so ky − Axk22 = 0. Of these, we might choose
the one which has the smallest norm:
minimize
N
kxk2 subject to Ax = y.
x∈R
5
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
The solution is again given by the pseudo-inverse. We can still write
A = U ΣV T, where Σ is M × M , diagonal, and invertible, U is
M × M and V is N × M . Then x̂ = V Σ−1U T find the shortest
vector (in the Euclidean sense) that obeys the M specified linear
constraints.
Example: Regression
A fundamental problem in statistics is to estimate a function given
point samples (that are possibly heavily corrupted). We observe pairs
of points2 (xm, ym) for m = 1, . . . , M , and want to find a function
f (x) such that
f (xm) ≈ ym, m = 1, . . . , M.
Of course, the problem is not well-posed yet, since there are any
number of functions for which f (xm) = ym exactly. We regularize
the problem in two ways. The first is by specifying a class that f (·)
belongs to. One way of doing this is by building f up out of a linear
combination of basis functions φn(·):
N
X
f (x) = αnφn(x).
n=1
2
We are just considering functions of a single variable here, but it is easy to
see how the basic setup extends to functionals of a vector.
6
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
The quality of a proposed fit is measured by a loss function — this
loss is typically (but not necessarily) specified pointwise at least of
the samples, and then averaged over all the sample points:
M
1 X
Loss(α; x, y) = `(α; xm, ym).
M m=1
which is just the square between the difference of the observed value
ym and its prediction using the candidate α.
7
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
might do dramatic things to α to make it match y as closely as
possible. To discourage this, we can penalize kαk2:
2 2
minimize
N
ky − Φαk2 + τ kαk2,
α∈R
x̂ = (ΦTΦ + τ I)−1ΦTy.
8
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Linear programming
A linear program (LP) minimizes a linear functional subject to
multiple linear constraints:
The general form above can include linear equality constraints aTi x =
bi by enforcing both aTi x ≤ bi and (−ai)Tx ≤ bi — in our study
later on, we will find it convenient to specifically distinguish between
these two types of constraints. We can also write the M constraints
compactly as Ax ≤ b, where A is the M × N matrix with the aTm
as rows.
9
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Example: Chebyshev approximations
Consider the following tweak to the least-squares problem we looked
at previously. Suppose that we want to find the vector x so that Ax
does not vary too much in its maximum deviation:
ym − aTmx ≥ −u
m = 1, . . . , M.
10
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Filter design
If we restrict ourselves to the case where H∗(ω) has linear phase (so
the impulse response is symmetric around some time index)3 we can
recast this as a Chebyshev approximation problem.
3
The case with general phase can also be handled using convex optimization,
but it is not naturally stated as a linear program.
11
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
We will approximate the supremum on the inside by measuring it at
M equally spaced points ω1, . . . , ωM between −π and π. Then
K
X
minimize max H∗(ωm) − xk cos(kωm) = minimize ky − F xk∞,
x ωm x
k=0
It should be noted that since the ωm are equally spaced, the matrix
F (and its adjoint) can be applied efficiently using a fast discrete
cosine transform. We will see later how this has a direct impact
on the number of computations we need to solve the Chebyshev
approximation problem above.
12
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Quadratic programming
A quadratic program (QP) minimizes a quadratic functional sub-
ject to linear constraints:
QPs are almost as ubiquitous as LPs; they have been used in finance
since the 1950s (see the example below), and are found all over op-
erations research, control systems, and machine learning. As with
LPs, there are reliable solvers; it might also be considered a mature
technology.
13
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Example: Portfolio optimization
One of the classic examples in convex optimization is finding invest-
ment strategies that “optimally”4 balance the risk versus the return.
The following quadratic program formulation is due to Markowitz,
who formulated it in the 1950s, then won a Nobel Prize for it in 1990.
We want to solve for the x that achieves this level of return while
minimizing our risk. Here, the definition of risk is simply the variance
of our return — if the assets have covariance matrix R, then the risk
of a given portfolio allocation x is
M X
X M
T
Risk(x) = x Rx = Rm,nxmxn.
m=1 n=1
4
I put “optimally” in quotes because, like everything in finance and the
world, this technique finds the optimal answer for a specified model. The
big question is then how good your model is ...
14
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Our optimization program is then5
minimize xTRx
x
subject to µTx ≥ ρ
1T x = 1
0 ≤ x ≤ 1.
5
Throughout these notes, we will use 1 for a vector of all ones, and 0 for a
vector of all zeros.
15
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Our last example (for now) does not fit into any of the categories
mentioned above. It is, however, convex, and shows how viewing a
well-known problem through the lens of convex programming can al-
low us to systematically exploit a priori structural information about
our problem.
XN
We assume that E[X] = 0. We observe independent realizations
X 1, X 2, . . . , X K . How do we estimate the covariance matrix Σ =
E[XX T]?
If you have taken any class in statistics, you know that the standard
estimate is given by the sample covariance:
K
1 X
Σ̂ = X k X Tk .
K k=1
This makes intuitive sense, but let’s justify it a little more carefully
by showing it is the maximum likelihood estimate (MLE). Given the
{X k }, we want to find the matrix R that maximizes the likelihood
function
K
Y
maximize
N
(2π)−N/2(det R)−1/2 exp(−y Tk R−1y k /2).
R∈S++
k=1
16
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
N
The set S++ is the set of valid covariance matrices (i.e. the set of
positive definite matrices).
Since the log function is monotonic, we can equivalently maximize
the log-likelihood
K
−KN K −1 1X T −1
maximize log 2π + log det R − y R yk . (1)
N
R∈S++ 2 2 2 k=1 k
The first term does not depend on R, so we can ignore it. Since
the inverse of every valid covariance matrix is also a valid covariance
matrix, we can optimize over S = R−1. This makes the optimization
program
K
K 1X T
maximize log det S − y Sy k .
N
S∈S++ 2 2 k=1 k
Using the (easily checked) fact that y TSy = trace(Syy T), we can
write the above as
maximize
N
log det S − trace(S Σ̂),
S∈S++
17
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Since the functional we are maximizing is smooth, we know that we
N
have a solution Ŝ if Ŝ is feasible (in S++ ) and the gradient is equal
to zero. It is easy to see that
R̂ = Σ̂.
So the sample covariance really is the the maximum likelihood esti-
mate.
18
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017
Another example is if we know different pairs of variables (i, j) ∈ I
are conditionally independent of one another given the other vari-
ables. Two entries of a Gaussian random vector are conditionally
independent if the corresponding entry in the inverse covariance ma-
trix is zero; so this information can be captured by introducing the
constraints
Si,j = 0, for all (i, j) ∈ I.
Constraints like these come in useful when we are trying to estimate
the structure of Gaussian graphical models. (We will say more about
this as we revisit this example at different points in the course.)
References
[Roh00] J. Rohn. Computing the norm kAk∞,1 is NP-Hard. Linear
and Multilinear Algebra, 47:195–204, 2000.
19
Georgia Tech ECE 8823a notes by J. Romberg. Last updated 13:32, January 11, 2017