Handout 1 Introduction
Handout 1 Introduction
Handout 1: Introduction
Instructor: Anthony Man–Cho So September 3, 2018
Note that if x∗ is a global minimizer of (P ), then we naturally have v ∗ = f (x∗ ). Nevertheless, v ∗ can
be finite even if problem (P ) does not have any global minimizer. For instance, when f (x) = 1/x
and X = R++ , we have v ∗ = 0.
A related notion is that of a local minimizer, which is defined as a point x0 ∈ X such that for
some > 0, we have f (x0 ) ≤ f (x) for all x ∈ X ∩ B(x0 , ). Here,
B(x0 , ) = x ∈ Rn : kx − x0 k2 ≤
is the Euclidean ball P of radius > 0 centered at x0 (recall that for x ∈ Rn , the 2–norm of
x is defined as kxk22 = ni=1 x2i ≡ xT x). Note that a global minimizer is automatically a local
minimizer, but the converse is not necessarily true. In this course we shall devote a substantial
amount of time to characterize the minimizers of (P ) and study how the structures of f and X
affect our ability to solve (P ). Before we do that, however, let us observe that problem (P ) is quite
general. For example, when X = Rn , we have an unconstrained optimization problem; when
X is discrete (i.e., for every x ∈ X, there exists an > 0 such that X ∩ B(x, ) = {x}), we have a
discrete optimization problem. Other important classes of optimization problems include:
1
with a1 , . . . , am ∈ Rn and b1 , . . . , bm ∈ R. In more compact notation, we may write a linear
programming problem as follows:
minimize cT x
subject to Ax ≤ b,
where Q = [Qij ] is an n × n matrix. Note that we may assume without loss that Q is
symmetric. This follows from the fact that
Q + QT
T T
x Qx = x x.
2
y ∈ Rm .
2
can be cast into the form (2). Hence, problem (3) is an instance of SDP. To determine
the feasible region of (3), observe that since the matrix Z is symmetric, it is completely
determined by, say, the entries on and above the diagonal. Hence, the feasible region of (3)
can be expressed as
n o
X = (Z11 , Z12 , . . . , Znn ) ∈ Rn(n+1)/2 : Ai • Z = bi for i = 1, . . . , m; Z 0 .
where α = (α1 , . . . , αn ) ∈ N, |α| = ni=1 αi , and fα is the coefficient of the term xα1 1 xα2 2 · · · xαnn .
P
The set X is defined by polynomial inequalities; i.e., it takes the form
X = {x ∈ Rn : gi (x) ≥ 0 for i = 1, . . . , m} ,
The aforementioned classes of problems capture a wide range of applications. However, in order
to convert a particular application into a problem of the form (P ), we need to first identify the
data and decision variables and then formulate the objective function and constraints. Let us now
illustrate this process via some examples.
3
It is not immediately clear that (4) can be formulated as an LP, but it can be done as follows. Let
z be a new decision variable. Then, we may rewrite (4) as
maximize z
subject to ti+1 − ti ≥ z for i = 1, . . . , n − 1,
ai ≤ ti ≤ bi for i = 1, . . . , n,
ti ≤ ti+1 for i = 1, . . . , n − 1,
which is an LP. We should point out that the above reformulation works only because we are
maximizing instead of minimizing the quantity min1≤j≤n−1 (tj+1 − tj ). In particular, the following
problems:
minimize min1≤j≤n−1 (tj+1 − tj )
subject to ai ≤ ti ≤ bi for i = 1, . . . , n, (5)
ti ≤ ti+1 for i = 1, . . . , n − 1
and
minimize z
subject to ti+1 − ti ≥ z for i = 1, . . . , n − 1,
(6)
ai ≤ ti ≤ bi for i = 1, . . . , n,
ti ≤ ti+1 for i = 1, . . . , n − 1
are not equivalent, since the optimum value of (5) is finite (in fact, it is always non–negative), while
the optimum value of (6) is −∞.
For another air traffic control application that utilizes optimization techniques, see [1].
where A is the m × n matrix whose i–th row is aTi , and e ∈ Rm is the vector of all ones. In other
words, our optimization problem is simply
m
X
bi − aTi x − t .
min (7)
x∈Rn , t∈R
i=1
Here, the objective function is nonlinear. However, we can turn problem (7) into an LP as follows.
We first introduce m new decision variables z1 , . . . , zm ∈ R. Then, it is not hard to see that (7) is
4
equivalent to the following LP:
m
X
minimize zi
i=1
Now, what if we want to minimize the 2–norm of the residual errors? In other words, we would
like to solve the following problem:
m
X 2
min
n
∆2 = kb − Ax − tek22 = bi − aTi x − t . (8)
x∈R , t∈R
i=1
It turns out that this is a particularly simple QP. In fact, since (8) is an unconstrained optimization
problem with a differentiable objective function, we can solve it using calculus techniques. Indeed,
suppose for simplicity that Ā has full column rank, so that ĀT Ā is invertible. Then, the (unique)
optimal solution (x∗ , t∗ ) ∈ Rn × R to (8) is given by
T
" # a1 1
x∗ −1 T . ..
∗
= ĀT Ā Ā b with Ā = .. . ∈R
m×(n+1)
.
t
aTm 1
On the other hand, if Ā does not have full column rank, then it can be shown that for any z ∈ Rn+1 ,
the vector " #
x∗ T
† T
T † T
= Ā Ā Ā b + I − ( Ā Ā) Ā Ā z
t∗
is optimal for (8). It is worth noting that the matrix I − (ĀT Ā)† ĀT Ā is simply the orthogonal
projection onto the nullspace of ĀT Ā. In particular, when Ā does not have full column rank, the
nullspace of ĀT Ā is non–trivial.
In the above discussion, we assume that the number of observations m exceeds the number of
parameters n; specifically, m ≥ n + 1. However, in many modern applications (such as biomedical
imaging and gene expression analyses), the number of observations is much smaller than the number
of parameters. Thus, one can typically find infinitely many parameter pairs (x̄, t̄) ∈ Rn × R that fit
the data perfectly; i.e., bi = aTi x̄ + t̄ for i = 1, . . . , m. To make the data fitting problem meaningful,
it is then necessary to impose additional assumptions. An intuitive and popular one is that the
actual number of parameters responsible for the input–output relationship is small. In other words,
most of the entries in the parameter vector x ∈ Rn should be zero, though we do not know a priori
where those entries are. There are several ways to formulate the data fitting problem under such
an assumption. For instance, one can consider the following constrained optimization approach:
minimize kb − Ax − tek22
subject to kxk0 ≤ K, (9)
x ∈ Rn , t ∈ R.
5
Here, kxk0 is the number of non–zero entries in the parameter vector x ∈ Rn , and K ≥ 0 is a
user–defined threshold that controls the sparsity of x. Alternatively, one can consider the following
penalty approach:
kb − Ax − tek22 + µkxk0 ,
min
n
(10)
x∈R , t∈R
where µ > 0 is a penalty parameter. However, due to the combinatorial nature of the function
x 7→ kxk0 , both of the above formulations are computationally difficult to solve. In fact, it can be
shown, in a formal sense, that an efficient algorithm for solving problems (9) and (10) is unlikely to
exist. To obtain more tractable formulations, a widely used approach is to replace k · k0 by k · k1 .
We will see later in the course why this is a good idea from a computationally perspective. For
now, we should note that such an approach changes the original problems, and a natural question
is whether there is any correspondence between the solutions to the original problems and those to
the modified problems. This question has been extensively studied in the fields of high–dimensional
statistics and compressive sensing over the past decade or so. We refer the interested reader to the
book [2] for details and further pointers to the literature.
At this point let us reflect a bit on the above examples. Intuitively, a linear problem (say, an
LP) should be easier than a nonlinear problem, and a differentiable problem should be easier than a
non–differentiable one. However, the above examples show that these need not be the case. Indeed,
even though the 2–norm problem (8) is a QP, its optimal solution has a nice characterization, while
the corresponding 1–norm problem (7) does not have such a feature. On the other hand, even
though the objective function in (7) is non–differentiable, the problem can still be solved easily
via LP. Also, we have seen from problems (9) and (10) that the inclusion of a seemingly simple
constraint or objective may render an originally easy optimization problem (namely, Problem (8))
intractable.
From the above discussion, it is natural to ask what makes an optimization problem difficult.
While it is hard to give an answer to such question without over–generalizing, let us at least identify
a possible source of difficulty. What distinguishes the seemingly very similar problems (8) and (9)
is that the former is a so–called convex optimization problem, while the latter is not. We shall
define the notion of convexity and study it in more detail later.
Note that by definition, the matrix A(x) is symmetric for any x ∈ Rk . Now, a problem that is
frequently encountered in practice is that of choosing an x ∈ Rk so that the largest eigenvalue of
A(x) is minimized (see, e.g., [4] for details). It turns out that such a problem can be formulated as
an SDP. To prove this, we need the following result:
Proposition 1 Let A be an arbitrary n × n symmetric matrix, and let λmax (A) denote the largest
eigenvalue of A. Then, we have tI A if and only if t ≥ λmax (A).
6
Proof Suppose that tI A, or equivalently, tI − A 0. Then, for any u ∈ Rn \{0}, we have
uT (tI − A)u = tuT u − uT Au ≥ 0, or equivalently,
uT Au
t≥ .
uT u
Since this holds for an arbitrary u ∈ Rn \{0}, we have
uT Au
t≥ max . (11)
u∈Rn \{0} uT u
By the Courant–Fischer theorem, the right–hand side of (11) is precisely λmax (A).
The converse can be established by reversing the above arguments. This completes the proof.
u
t
Proposition 1 allows us to formulate the above eigenvalue optimization problem as
minimize t
(12)
subject to tI − A(x) 0.
As the function Rn × R 3 (x, t) 7→ tI − A(x) is linear in (x, t), the constraint is an LMI. Hence,
problem (12) is an SDP.
References
[1] D. Bertsimas, M. Frankovich, and A. Odoni. Optimal Selection of Airport Runway Configura-
tions. Operations Research, 59(6):1407–1419, 2011.
[2] P. Bühlmann and S. van de Geer. Statistics for High–Dimensional Data: Methods, Theory and
Applications. Springer Series in Statistics. Springer–Verlag, Berlin/Heidelberg, 2011.
[3] J. B. Lasserre. Moments, Positive Polynomials and Their Applications, volume 1 of Imperial
College Press Optimization Series. Imperial College Press, London, United Kingdom, 2009.
[4] A. S. Lewis and M. L. Overton. Eigenvalue Optimization. Acta Numerica, 5:149–190, 1996.