Matinf 2360 Part 3
Matinf 2360 Part 3
Matinf 2360 Part 3
March 7, 2013
Contents
3 Nonlinear equations 29
3.1 Equations and fixed points . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Unconstrained optimization 37
4.1 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
i
6 Constrained optimization - methods 73
6.1 Equality constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Inequality constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Mathematics index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Index for MATLAB commands . . . . . . . . . . . . . . . . . . . . . . . . . 103
1
2
Preface
These lecture notes have been written for the course MAT-INF2360. They deal
with the third part of that course, and is about nonlinear optimization. Just as
the first parts of MAT-INF2360, this third part also has its roots in linear algebra.
In addition, it has stronger connection than the previous parts to multivariate
calculus, as taught in MAT1110.
Notation
We will follow multivariate calculus and linear algebra notation as you know it
from MAT1110 and MAT1120. In particular, vectors will be in boldface (x, y,
etc.), while matrices will be in uppercase (A, B , etc.). The zero vector, or the
zero matrix, is denoted by 0. All vectors stated will be assumed to be column
vectors. A row vector will always be written as x T , where x is a (column) vector
Vector-valued functions will be in uppercase boldface (F , G, etc.). Functions are
written using both uppercase and lowercase. Uppercase is often used for the
component functions of a vector-valued function.
Acknowledgment
Most material has been written by Geir Dahl, with many useful suggestions and
help from Øyvind Ryan. The authors would all like to thank each other for estab-
lishing these notes. The authors would also like to thank Andreas Våvang Solbrå
for his valuable contributions to the notes.
3
4
Chapter 1
The basics and applications
5
This first chapter introduces some of the basic concepts in optimization and
discusses some applications. Many of the ideas and results that you will find in
these lecture notes may be extended to more general linear spaces, even infinite-
dimensional. However, to keep life a bit easier and still cover most applications,
we will only be working in Rn .
Due to its character this chapter is a “proof-free zone”, but in the remaining
text we usually give full proofs of the main results.
So, no point “sufficiently near” x ∗ has smaller f -value than x ∗ . A local max-
imum is defined similarly, but with the inequality reversed. A stronger notion is
that x ∗ is a global minimum of f which means that
6
The definition of local minimum has a “variational character”; it concerns
the behavior of f near x ∗ . Due to this it is perhaps natural that Taylor’s formula,
which gives an approximation of f in such a neighborhood, becomes a main
tool for characterizing and finding local minima. We present Taylor’s formula,
in different versions, in Section 1.3.
An extension of the notion of minimum and maximum is for constrained
problems where we want, for instance, to minimize f (x) over all x lying in a
given set C . Then x ∗ ∈ C is a local minimum of f over the set C , or subject to
x ∈ C as we shall say, provided no point in C in some neighborhood of x ∗ has
smaller f -value than x ∗ . A similar extension holds for global minimum over C ,
and for maxima.
Example 1.1. To make these things concrete, consider an example from plane
geometry. Consider the point set C = {(z 1 , z 2 ) : z 1 ≥ 0, z 2 ≥ 0, z 1 + z 2 ≤ 1} in the
plane. We want to find a point x = (x 1 , x 2 ) ∈ C which is closest possible to the
point a = (3, 2). This can be formulated as the minimization problem
The function we want to minimize is f (x) = (x 1 −3)2 +(x 2 −2)2 which is a quadratic
function. This is the square of the distance between x and a; and minimizing the
distance or the square of the distance is equivalent (why?). A minimum here is
x ∗ = (1, 0), as can be seen from a simple geometric argument where we draw the
normal from (3, 2) to the line x 1 + x 2 = 1. If we instead minimize f over R2 , the
unique global minimum is clearly x ∗ = a = (3, 2). It is also useful, and not too
hard, to find these minima analytically. ♣
In optimization one considers minimization and maximization problems.
As
max{ f (x) : x ∈ S} = − min{− f (x) : x ∈ S}
7
optimal solutions really exist is the extreme value theorem as stated next. You
may want to look these notions up in [8].
Theorem 1.2. Let C be a subset of Rn which is closed and bounded, and let
f : C → R be a continuous function. Then f attains both its (global) minimum
and maximum, so these are points x 1 , x 2 ∈ C with
f (x 1 ) ≤ f (x) ≤ f (x 2 ) (x ∈ C ).
Pn
α j =1 µ j x j
P
minimize i , j ≤n c i j x i x j −
subject to
Pn
j =1 x j =1
xj ≥ 0 ( j ≤ n).
n
f (x) = α µj x j .
X X
ci j xi x j −
i , j ≤n j =1
1 The precise term is “Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel”
8
It can be explained in terms of random variables. Let R j be the return on stock
j , this is a random variable, and let µ j = ER j be the expectation of R j . So if X
denotes the random variable X = nj=1 x j R j , which is the return on our portfolio
P
The minus sign in front explains that we really want to maximize the expected
return. The first term in f is there because just looking at expected return is too
simple. We want to spread our investments to reduce the risk. The first term
in f is the variance of X multiplied by a weight factor α; the constant c i j is the
covariance of R i and R j , defined by
c i j = E(R i − µi )(R j − µ j ).
y = F α (x)
9
α
so here n = m = 2, α = (α1 , α2 ) and F α (x) = α1 cos x 1 + x 2 2 where (say) α1 ∈ R and
α2 ∈ [1, 2].
The general model may also be thought of as
y = F α (x) + error
The optimization variable is the parameter α. Here the model error is quadratic
(corresponding to the Euclidean norm), but other norms are also used.
This optimization problem above is a constrained nonlinear optimization
problem. When the function F α depends linearly on α, which often is the case in
practice, the problem becomes the classical least squares approximation prob-
lem which is treated in basic linear algebra courses. The solution is then charac-
terized by a certain linear system of equations, the so-called normal equations.
10
where P denotes probability.
Assume Y is the outcome of an experiment, and that we have observed Y = y
(so y is a known real number or a vector, if several observations were made). On
the basis of y we want to estimate the value of the parameter x which “explains”
best possible our observation Y = y. We have now available the probability den-
sity p x (·). The function x → p x (y), for fixed y, is called the likelihood function.
It gives the “probability mass” in y as a function of the parameter x. The max-
imum likelihood problem is to find a parameter value x which maximizes the
likelihood, i.e., which maximizes the probability of getting precisely y. This is an
optimization problem
max p x (y)
x
where y is fixed and the optimization variable is x. We may here add a con-
straint on x, say x ∈ C for some set C , which may incorporate possible knowl-
edge of x and assure that p x (y) is positive for x ∈ C . Often it is easier to solve the
equivalent optimization problem of maximizing the logarithm of the likelihood
function
max ln p x (y)
x
This is a nonlinear optimization problem. Often, in statistics, there are several
parameters, so x ∈ Rn for some n, and we need to solve a nonlinear optimiza-
tion problem in several variables, possibly with constraints on these variables.
If the likelihood function, or its logarithm, is a concave function, we have (after
multiplying by −1) a convex optimization problem. Such problems are easier to
solve than general optimization problems. This will be discussed later.
As a specific example assume we have the linear statistical model
y = Ax + w
where A is a given m × n matrix, x ∈ Rn is an unknown parameter, w ∈ Rm is a
random variable (the “noise”), and y ∈ Rn is the observed quantity. We assume
that the components of w , i.e., w 1 , w 2 , . . . , w m are independent and identically
distributed with common density function p on R. This leads to the likelihood
function
m
Y
p x (y) = p(y i − a i x)
i =1
where a i is the i ’th row in A. Taking the logarithm we obtain the maximum
likelihood problem
m
X
max ln p(y i − a i x).
i =1
In many applications of statistics is is central to solve this optimization problem
numerically.
11
Example 1.3. Let us take a look at a model take from physics for desintegration
of muons. The angle θ in electron radiation for desintegration of muons has a
probability density
1 + αx
p(x; α) = (1.1)
2
for x ∈ [−1, 1], where x = cos θ, and where α is an unknown parameter in [−1, 1].
Our goal is to estimate α from n measurements x = (x 1 , . . . , x n ). In this case the
likelihood function, which we seek to maximize, takes the form g (α) = ni=1 p(x i ; α).
Q
We compute
n x i /2 n xi
f 0 (α) = −
X X
=−
i =1 (1 + αx i )/2 i =1 1 + αx i
n x i2
f 00 (α) =
X
i =1 (1 + αx i )
2
We see that f 00 (α) ≥ 0, so that f is convex. As explained, this will make the prob-
lem easier to solve using numerical methods. If we try to solve f 0 (α) = 0 we will
run into problems, however. We see, however, that f 0 (α) → 0 when α → ±∞,
xi
and since 1+αx i
= 1/x1i +α , we must have that f 0 (α) → ∞ when α → −1/x i from
below, and f (α) → −∞ when α → −1/x i from above. It is therefore clear that
0
f has exactly one minimum in every interval of the form [−1/x i , −1/x i +1 ] when
we list the x i in increasing order. It is not for sure that there is a minimum within
[−1, 1] at all. If all measurements have the same sign we are guaranteed to find
no such point. In this case the minimum must be one of the end points in the
interval. We will later look into numerical method for finding this minimum. ♣
x t +1 = h t (x t ) (t = 0, 1, . . .)
12
for each t . Then the solution is x t = A t x 0 . For the more general situation, where
the system functions h t may be different, it may be difficult to find an explicit
solution for x t . Numerically, however, we compute x t simply in a for-loop by
computing x 0 , then x 1 = f 1 (x 0 ) and then x 2 = f 2 (x 1 ) etc.
Now, consider a dynamical system where we may “control” the system in
each time step. We restrict the attention to a finite time span, t = 0, 1, . . . , T . A
proper model is then
x t +1 = h t (x t , u t ) (t = 0, 1, . . . , T − 1)
where x t is the state of the system at time t and the new variable u t is the control
at time t . We assume x t ∈ Rn and u t ∈ Rm for each t (but these things also work
if these vectors lie in spaces of different dimensions). Thus, when we choose the
controls u 0 , u 1 , . . . , u T −1 and x 0 is known, the sequence {x t } of states is uniquely
determined. Next, assume there are given functions f t : Rn × Rm → R that we call
cost functions. We think of f t (x t , u t ) as the “cost” at time t when the system is
in state x t and we choose control u t . The optimal control problem is
PT −1
minimize f T (x T ) + t =0 f t (x t , u t )
subject to (1.3)
x t +1 = h t (x t , u t ) (t = 0, 1, . . . , T − 1)
which is a function of u. Thus, we see that the optimal control problem may be
transformed to the unconstrained optimization problem
min f (u)
u∈RN
Sometimes there may be constraints on the control variables, for instance that
they each lie in some interval, and then the transformation above results in a
constrained optimization problem.
13
1.2.5 Linear optimization
This is not an application, but rather a special case of the general nonlinear opti-
mization problem where all functions are linear. A linear optimization problem,
also called linear programming, has the form
minimize cT x
subject to (1.4)
Ax = b, x ≥ 0.
14
The following statements are equivalent
Similarly, a real symmetric matrix is positive definite if x T Ax > 0 for all nonzero
x ∈ Rn . The following statements are equivalent
vector
4 See Section 5.9 in [8]
15
so F i : Rn → R is the i th component function of F . F 0 denotes the Jacobi matrix5 ,
or simply the derivative, of F
∂F 1 (x) ∂F 1 (x) ∂F 1 (x)
· · ·
∂F∂x2 (x)
1 ∂x 2
∂F 2 (x)
∂x n
∂F 1 (x)
∂x ∂x · · · ∂x n
F 0 (x) = 1 2
..
.
∂F n (x) ∂F n (x) ∂F n (x)
∂x 1 ∂x 2 · · · ∂x n
The i th row of this matrix is therefore the gradient of F i , now viewed as a row
vector.
Next we recall Taylor’s theorems from multivariate calculus 6 :
f (x + h) = f (x) + ∇ f (x + t h)T h.
The next one is known as Taylor’s formula, or the second order Taylor’s the-
orem7 :
1
f (x + h) = f (x) + ∇ f (x)T h + h T ∇2 f (x + t h)h.
2
in [8]
7 See Section 5.9 in [8]
8 See Section 5.9 in [8]
16
Theorem 1.6 (Second order Taylor theorem, version 2). Let f : Rn → R be a
function having second order partial derivatives that are continuous in some
ball B (x; r ). Then there is a function ² : Rn → R such that, for each h ∈ Rn with
khk < r ,
1
f (x + h) = f (x) + ∇ f (x)T h + h T ∇2 f (x)h + ²(h)khk2 .
2
Here ²(y) → 0 when y → 0.
Using the O-notation from Definition ??, the very useful approximations we
get from Taylor’ theorems can thus be summarized as follows:
Taylor approximations:
First order: f (x + h) = f (x) + ∇ f (x)T h + O(khk)
≈ f (x) + ∇ f (x)T h.
Second order: f (x + h) = f (x) + ∇ f (x)T h + 21 h T ∇2 f (x)h + O(khk2 )
≈ f (x) + ∇ f (x)T h + 12 h T ∇2 f (x)h.
As we shall see, one can get a lot of optimization out of these approximations!
We also need a Taylor theorem for vector-valued functions, which follows by
applying Taylor’ theorem above to each component function:
Theorem 1.7 (First order Taylor theorem for vector-valued functions). Let
F : Rn → Rm be a vector-valued function which is continuously differentiable
in a neighborhood N of x. Then
when x + h ∈ N .
17
bility assumption the following chain rule9 holds:
Here the right-hand side is a product of two matrices, the respective Jacobi ma-
trices evaluated in the right points.
Finally, we discuss some notions concerning the convergence of sequences.
kx k+1 − x ∗ k ≤ γkx k − x ∗ k2 (k = 0, 1, . . .)
18
7. The sublevel set of a function f : Rn → R is the set S α ( f ) = {x ∈ R2 : f (x) ≤ α},
where α ∈ R. Assume that inf{ f (x) : x ∈ Rn } = η exists.
b. Consider the special case where n = 2. Solve the problem (hint: elimi-
nate one variable) and discuss how minimum point depends on α.
9. Later in these notes we will need the expression for the gradient of functions
which are expressed in terms of matrices.
c. Show that, with f defined as in b., but with A not symmetric, we obtain
that ∇ f (x) = 12 (A + A T )x, and ∇2 f = 12 (A + A T ). Verify that these formulas
are compatibe with what you found in b. when A is symmetric.
19
20
Chapter 2
A crash course in convexity
21
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
2
(a) A square (b) The ellipse x4 + y 2 ≤ 1 (c) The area x 4 + y 4 ≤ 1
x 1 + x 2 = 3, x 1 ≥ 0, x 2 ≥ 0
is a linear system in the variables x 1 , x 2 . The solution set is the set of points
(x 1 , 3 − x 1 ) where 0 ≤ x 1 ≤ 3. The set of solutions of a linear system is called a
polyhedron. These sets often occur in optimization. Thus, a polyhedron has the
form
P = {x ∈ Rn : Ax ≤ b}
22
Proposition 2.2. If T : Rn → Rm is a linear transformation, and C ⊆ Rn is a
convex set, then the image T (C ) of this set is also convex.
(This inequality holds for all x, y and λ as specified). Due to the convexity of C ,
the point (1 − λ)x + λy lies in C , so the inequality is well-defined. The geometri-
cal interpretation in one dimension is that whenever you take two points on the
graph of f , say (x, f (x)) and (y, f (y)), the graph of f restricted to the line seg-
ment [x, y] lies below the line segment in Rn+1 between the two chosen points.
A function g is called concave if −g is convex.
For every linear function we have that f ((1−λ)x+λy) = (1−λ) f (x)+λ f (y), so
that every linear function is convex. Some other examples of convex functions
in n variables are
• f (x) = L(x) + α where L is a linear function from Rn into R (a linear func-
tional) and α is a real number. In fact, for such functions we have that
f ((1 − λ)x + λy) = (1 − λ) f (x) + λ f (y), just as for linear functions. Func-
tions on the form f (x) = L(x) + α are called affine functions, and may be
written on the form f (x) = c T x + α for a suitable vector c.
• f (x) = kxk (Euclidean norm). That this is convex can be proved by writing
k(1 − λ)x + λyk ≤ k(1 − λ)xk + kλyk = (1 − λ)kxk + λkyk. In fact, the same
argument can be used to show that every norm defines a convex func-
tion. Such an example is the l 1 -norm, also called the sum norm, defined
by kxk1 = nj=1 |x j |.
P
Pn
• f (x) = e j =1 x j (see Exercise 10).
23
arbitrary family of affine functions (or even convex functions) is convex.
This is a very useful fact in convexity and its applications.
The following result is an exercise to prove, and it gives a method for proving
convexity of a function.
The next result is often used, and is called Jensen’s inequality.. It can be
proved using induction.
Pr j
A point of the form j =1 λ j x , where the λ j ’s are nonnegative and sum to 1,
is called a convex combination of the points x 1 , x 2 . . . , x r . One can show that a
set is convex if and only if it contains all convex combinations of its points.
Finally, one connection between convex sets and convex functions is the fol-
lowing fact whose proof is an exercise.
{x ∈ C : f (x) ≤ α}
is a convex set.
24
2
6
1
4
0
2
0 −1
2
2
0 0
−2 −2 −2
−2 −1 0 1 2
x2
(a) The function f (x, y) = 4 + y2 (b) Some level curves of f
Example 2.8. Using Theorem 2.7 it is straightforward to prove that the remain-
ing sets from Figure 2.1 are convex. They can be written as sublevel sets of the
2
functions f (x, y) = x4 +y 2 , and f (x, y) = x 4 +y 4 . For the first of these the level sets
are ellipses, and are shown in Figure 2.2, together with f itself. One can quickly
verify that the Hessian matrices of these functions are positive semidefinite. It
follows from Proposition 2.5 that the corresponding sets are convex. ♣
An important class of convex functions consists of (certain) quadratic func-
tions. Let A ∈ Rn×n be a symmetric matrix which is positive semidefinite and
25
consider the quadratic function f : Rn → R given by
n
f (x) = (1/2) x T Ax − b T x = (1/2)
X X
ai j xi x j − bj xj.
i,j j =1
(If A = 0, then the function is linear, and it may be strange to call it quadratic.
But we still do this, for simplicity.) Then (Exercise 1.3.9) the Hessian matrix of
f is A, i.e., ∇2 f (x) = A for each x ∈ Rn . Therefore, by Theorem 2.7 is a convex
function.
We remark that sometimes it may be easy to check that a symmetric matrix
A is positive semidefinite. A (real) symmetric n × n matrix A is called diagonally
P
dominant if |a i i | ≥ j 6=i |a i j | for i = 1, . . . , n. These matrices arise in many appli-
cations, e.g. splines and differential equations. It can be shown that every sym-
metric diagonally dominant matrix is positive semidefinite. For a simple proof
of this fact using convexity, see [3]. Thus, we get a simple criterion for convex-
ity of a function: check if the Hessian matrix ∇2 f (x) is diagonally dominant for
each x. Be careful here: this matrix may be positive semidefinite without being
diagonally dominant!
26
This theorem is important. Property (ii) says that the first-order Taylor ap-
proximation of f at x 0 (which is the right-hand side of the inequality) always
underestimates f . This result has interesting consequences for optimization as
we shall see later.
27
Pn
10. Show that f (x) = e j =1 x j is a convex function.
11. Assume that f and g are convex functions defined on an interval I . Deter-
mine which of the functions following functions that are convex or concave:
a. λ f where λ ∈ R,
b. min{ f , g },
c. | f |.
i.e., a convex function defined on closed real interval attains its maximum in one
of the endpoints.
13. Let f : 〈0, ∞〉 → R and define the function g : 〈0, ∞〉 → R by g (x) = x f (1/x).
Why is the function x → xe 1/x convex?
14. Let C ⊆ Rn be a convex set and consider the distance function dC defined
by dC (x) = inf{kx − yk : y ∈ C }. Show that dC is a convex function.
28
Chapter 3
Nonlinear equations
29
Often the problem F (x) = 0 has the following form, or may be rewritten to it:
K (x) = x. (3.2)
Fixed-point algorithm:
1. Choose an initial point x 0 , let x = x 0 and err = 1.
2. while err > ² do
(i) Compute x 1 = K (x)
(ii) Compute err = kx 1 − xk
(iii) Update x := x 1
When does the fixed-point iteration work? Let k · k be a fixed norm, e.g. the
Eulidean norm, on Rn . We say that the function K : Rn → Rn is a contraction if
there is a constant 0 ≤ c < 1 such that
We also say that K is c-Lipschitz in this case. The following theorem is called
the Banach contraction principle. It also holds in Banach spaces, i.e., complete
normed vector spaces (possibly infinite-dimensional).
30
Theorem 3.1. Assume that K is c-Lipschitz with 0 < c < 1. Then K has a
unique fixed point x ∗ . For any starting point x 0 the fixed-point iteration (3.3)
generates a sequence {x k }∞k=0
that converges to x ∗ . Moreover
so that
kx k − x ∗ k ≤ c k kx 0 − x ∗ k.
Proof: First, note that if both x and y are fixed points of K , then
which means that x = y (as c < 1); therefore K has at most one fixed point. Next,
we compute
so
kx m − x 0 k = k m−1 (x k+1 − x k )k ≤ m−1
P P
k=0 k=0
kx k+1 − x k k
Pn−1 k
≤ ( k=0 c )kx 1 − x 0 k ≤ (1/(1 − c))kx 1 − x 0 k
From this we derive that {x k } is a Cauchy sequence; as we have
and 0 < c < 1. Any Cauchy sequence in Rn has a limit point, so x m → x ∗ for some
x ∗ ∈ Rn . We now prove that the limit point x ∗ is a (actually, the) fixed point:
kx ∗ − K (x ∗ )k ≤ kx ∗ − x m k + kx m − K (x ∗ )k
= kx ∗ − x m k + kK (x m−1 ) − K (x ∗ )k
≤ kx ∗ − x m k + ckx m−1 − x ∗ k
31
3.2 Newton’s method
We return to the main problem (3.1). Our goal is to present Newton’s method,
a highly efficient iterative method for solving this equation. The method con-
structs a sequence
x 0, x 1, x 2, . . .
in Rn which, hopefully, converges to a root x ∗ of F , so F (x ∗ ) = 0. The idea is to
linearize F at the current iterate x k and choose the next iterate x k+1 as a zero of
this linearized function. The first order Taylor approximation of F at x k is
TF1 (x k ; x) = F (x k ) + F 0 (x k )(x − x k ).
We solve TF1 (x k ; x) = 0 for x and define the next iterate as x k+1 = x. This gives
which leads to Newton’s method. One here assumes that the derivative F 0 is
known analytically. Note that we do not (and hardly ever do!) compute the in-
verse of the matrix F 0 .
32
while norm(F(x)) > epsilon && n<=N
x=x-J(x)\F(x);
fval = F(x);
%fprintf(’itnr=%2d x=[%13.10f,%13.10f] F(x)=[%13.10f,%13.10f]\n’,...
% n,x(1),x(2),fval(1),fval(2))
n = n + 1;
end
This code also terminates after a given number of iterations, and when a given
accuracy is obtained. Note that this function should work for any function F ,
since it is a parameter to the function.
The convergence of Newton’s method may be analyzed using fixed point the-
ory since one may view Newton’s method as a fixed point iteration. Observe that
the Newton iteration (3.5) may be written
x k+1 = G(x k )
Theorem 3.2. Assume that Newton’s method with initial point x 0 produces a
sequence {x k }∞
k=0
which converges to a solution x ∗ of (3.1). Then the conver-
gence rate is superlinear.
0 = F (x ∗ ) = F (x k + (x ∗ − x k )) = F (x k ) + F 0 (x k )(x ∗ − x k ) + O(kx k − x ∗ k)
x k − x ∗ − F 0 (x k )−1 F (x k ) = O(kx k − x ∗ k)
33
So
lim kx k+1 − x ∗ k/kx k − x ∗ k = 0
k→∞
where K and L are some constants. Here kF 0 (x 0 )k denotes the operator norm of
the square matrix F 0 (x 0 ) which is defined as
kF 0 (x 0 )k = sup kF 0 (x 0 )xk
kxk=1
and it measures how much the operator F 0 (x 0 ) may increase the size of vectors.
The following convergence result for Newton’s method is known as Kantorovich’
theorem.
Then F 0 (x) is invertible for all x ∈ B (x 0 ; 1/(K L)) and Newton’s method with
initial point x 0 will produce a sequence {x k }∞ k=0
contained in B (x 0 ; 1/(K L))
and limk→∞ x k = x ∗ for some limit point x ∗ ∈ B̄ (x 0 ; 1/(K L)) with
F (x ∗ ) = 0.
A proof of this theorem is quite long (but not very difficult to understand) [8].
One disadvantage with Newton’s method is that one needs to know the Ja-
cobi matrix F 0 explicitly. For complicated functions, or functions being the out-
put of a simulation, the derivative may be hard or impossible to find. The quasi-
Newton method, also called the secant-method, is then a good alternative. The
idea is to approximate F 0 (x k ) by some matrix B k and to compute the new search
direction from
B k p = −F (x k )
34
A practical method for finding these approximations B 1 , B 2 , . . . is Broyden’s method.
Provided that the previous iteration gave x k , with Broyden’s method we com-
pute x k+1 by following the search direction, define s k = x k+1 − x k and y k =
F (x k+1 ) − F (x k ), and compute B k+1 from B k by the formula
It can be shown that B k approximates the Jacobi matrix F 0 (x k ) well in each itera-
tion. Moreover, the update given in (3.7) can be done efficiently (it is a rank one
update of B k ).
Note that this algorithm also computes an α through what we call a line
search, to attempt to find the optimal distance to follow the search direction.
We do not here specify how this line search can be performed. Also, we do not
specify how the initial values can be chosen. For B 0 , any approximation of the
Jacobian of F at x 0 can be used, using a numerical differentiation method of
your own choosing. One can show that Broyden’s method, under certain as-
sumptions, also converges superlinearly, see [11].
35
converge towards x ∗ . Then try the fixed point
an initial point in I will guaranteedp
algorithm with starting point x 0 = 5/3.
p
3. Let α ∈ R+ be fixed, and consider f (x) = x 2 − α. Then the zeros are ± α.
Write down the Newton’s iteration for this problem. Let α = 2 and compute the
first three iterates in Newton’s method when x 0 = 1.
4. For any vector norm k·k on Rn , we can more generally define a corresponding
operator norm for n × n matrices by
kAk = sup kAxk.
kxk=1
c. Show that f (x) = kAxk is convex for any n, and show that the maxi-
mum of f on the set {x : kxk = 1} is attained in a point x on the form ±e k .
Hint: For the second statement, use Jensen’s inequality with x j = ±e j
(Theorem 2.4).
d. Show that, for any n × n-matrix A, kAk = supk ni=1 |a i k |, where a i j are
P
b. Implement a function
function x=broyden(x0,F)
36
Chapter 4
Unconstrained optimization
∇ f (x ∗ ) = 0. (4.1)
37
Proof: Assume that x ∗ is a local minimum of f and that ∇ f (x ∗ ) 6= 0. Let h =
−α∇ f (x ∗ ) where α > 0. Then ∇ f (x ∗ )T h = −αk∇ f (x ∗ )k2 < 0 and by continuity
of the partial derivatives of f , ∇ f (x)T h < 0 for all x in some neighborhood of x ∗ .
From Theorem 1.4 (first order Taylor) we obtain
f (x ∗ + h) − f (x ∗ ) = ∇ f (x ∗ + t h)T h (4.2)
for some t ∈ (0, 1) (depending on α). By choosing α small enough, the right-hand
side of (4.2) is negative (as just said), and so f (x ∗ +h) < f (x ∗ ), contradicting that
x ∗ is a local minimum. This proves that ∇ f (x ∗ ) = 0.
To prove the second statement, we get from Theorem 1.5 (second order Tay-
lor)
1
f (x ∗ + h) = f (x ∗ ) + ∇ f (x ∗ )T h + h T ∇2 f (x ∗ + t h)h
2
∗ 1 T 2
= f (x ) + h ∇ f (x + t h)h (4.3)
2
38
Example 4.2. Consider a convex quadratic function
f (x) = (1/2) x T Ax − b T x
where A is the (symmetric) Hessian matrix is (constant equal to) A and this ma-
trix is positive semidefinite. Then ∇ f (x) = Ax − b so the first-order necessary
optimality condition is
Ax = b
h T Ah ≥ λn khk2 (h ∈ Rn ).
A = V DV T
39
Next we consider sufficient optimality conditions in the general differen-
tiable case. These conditions are used to prove that a candidate point (say, found
by an algorithm) is really a local minimum.
Proof: From Theorem 1.6 (second order Taylor) and Proposition 4.3 we get
f (x ∗ + h) = f (x ∗ ) + ∇ f (x ∗ )T h + 12 h T ∇2 f (x ∗ )h + ²(h)khk2
≥ f (x ∗ ) + 12 λn khk2 + ²(h)khk2
∇ f (x ∗ ) = 0.
40
and this contradicts that f (x) ≥ f (x 1 ) for all x in a neighborhood of x ∗ . There-
fore x 1 must be a global minimum.
Assume f is convex and differentiable. Due to Theorem 4.1 we only need to
show that if ∇ f (x ∗ ) = 0, then x ∗ is a local and global minimum. So assume that
∇ f (x ∗ ) = 0. Then, from Theorem 2.10 we have
f (x) ≥ f (x ∗ ) + ∇ f (x ∗ )T (x − x ∗ )
4.2 Methods
Algorithms for unconstrained optimization are iterative methods that generate
a sequence of points with gradually smaller values on the function f which is to
be minimized. There are two main types of algorithms in unconstrained opti-
mization:
• Line search methods: Here one first chooses a search direction d k from
the current point x k , using information about the function f . Then one
chooses a step length αk so that the new point
x k+1 = x k + αk d k
41
A very natural choice for search direction at a point x k is the negative gradi-
ent, d k = −∇ f (x k ). Recall that the direction of maximum increase of a (differen-
tiable) function f at a point x is ∇ f (x), and the direction of maximum decrease
is −∇ f (x). To verify this, Taylor’s theorem gives
1
f (x + h) = f (x) + ∇ f (x)T h + h T ∇2 f (x + t h)h.
2
So, for small h, the first order term dominates and we would like to make this
term small. By the Cauchy-Schwarz inequality1
and equality holds for h = −α∇ f (x) for some α ≥ 0. In general, we call h a de-
scent direction at x if ∇ f (x)T h < 0. Thus, if we move in a descent direction from
x and make a sufficiently small step, the new point has a smaller f -value. With
this background we shall in the following focus on gradient methods given by
x k+1 = x k + αk d k (4.4)
∇ f (x k )T d k < 0 (4.5)
f (x k + h) ≈ f (x k ) + ∇ f (x k )T h + (1/2)h T ∇2 f (x k )h
1 The Cauchy-Schwarz’ inequality says: |u · v | ≤ kuk kv k for u, v ∈ Rn .
42
If we minimize this quadratic function w.r.t. h, assuming ∇2 f (x k ) is posi-
tive definite, we get (see Exercise 8)
h = −(∇2 f (x k ))−1 ∇ f (x k )
and choose step length αk = βmk s. Here σ is typically chosen very small, e.g. σ =
10−3 . The parameter s fixes the search for step size to lie within the interval [0, s].
This can be important: for instance, we can set s so small that the initial step
size we try is within the domain of definition for f . According to [1] β is usually
chosen in [1/10, 1/2]. In the literature one may find a lot more information about
step size rules and how they may be adjusted to the methods for finding search
direction, see [1], [11].
Now, we return to the choice of search direction in the gradient method (4.4).
A main question is whether it generates a sequence {x k }∞ k=1
which converges to
a stationary point x ∗ , i.e., where ∇ f (x ∗ ) = 0. It turns out that this may not be the
case; one needs to be careful about the choice of d k to assure this convergence.
The problem is that if d k tends to be nearly orthogonal to ∇ f (x k ) one may get
into trouble. For this reason one introduces the following notion:
43
What this condition assures is that kd k k is not too small or large compared
to k∇ f (x k )k and that the angle between the vectors d k and ∇ f (x k ) is not too
close to 90◦ . The proof of the following theorem may be found in [1].
We remark that in Theorem 4.7 the same conclusion holds if we use exact
minimization as step size rule, i.e., f (x k +αd k ) is minimized exactly with respect
to α.
A very important property of a numerical algorithm is its convergence speed.
Let us consider the steepest descent method first. It turns out that the conver-
gence speed for this algorithm is very well explained by its performance on min-
imizing a quadratic function, so therefore the following result is important. In
this theorem A is a symmetric positive definite matrix with eigenvalues λ1 ≥
λ2 ≥ · · · ≥ λn > 0.
f (x k+1 ) ≤ m A f (x k )
The proof may be found in [1]. Thus, if the largest eigenvalue is much larger
than the smallest one, m A will be nearly 1 and one typically have slow conver-
gence. In this case we have m A ≈ cond(A) where cond(A) = λ1 /λn is the condi-
tion number of the matrix A. So the rule is: if the condition number of A is small
we get fast convergence, but if cond(A) is large, there will be slow convergence.
A similar behavior holds for most functions f because locally near a minimum
point the function is very close to its second order Taylor approximation in x ∗
which is a quadratic function with A = ∇2 f (x ∗ ).
Thus, Theorem 4.8 says that the sequence obtained in the steepest descent
method converges linearly to a stationary point (at least for quadratic functions).
44
Newton’s method for unconstrained optimization:
1. Choose an initial point x 0 .
2. For k = 1, 2, . . . do
(i) (Newton step) d k := −∇2 f (x)−1 ∇ f (x); η = −∇ f (x)T d k
(ii) (Stopping criterion) If η/2 < ²: stop.
(iii) (Line search) Use backtracking line search to find step size αk
(iv) (Update) x k+1 := x k + αk d k
Recall that the pure Newton step minimizes the second order Taylor ap-
proximation of f at the current iterate x k . Thus, if the function we minimize
is quadratic, we are done in one step. Similarly, if the function can be well ap-
proximated by a quadratic function, then one would expect fast convergence.
We shall give a result on the convergence of Newton’s method (see [2] for fur-
ther details). When A is symmetric, we let λmi n (A) denote that smallest eigen-
value of A.
For the convergence result we need a lemma on strictly convex functions.
Assume that x 0 is a starting point for Newton’s method and let S = {x ∈ Rn :
f (x) ≤ f (x 0 )}. We shall assume that f is continuous and convex, and this im-
plies that S is a closed convex set. We also assume that f has a minimum point
x ∗ which then must be a global minimum. Moreover the minimum point will be
unique due to a strict convexity assumption on f . Let f ∗ = f (x ∗ ) be the optimal
value.
The following lemma says that for a convex function as just described, a
point is nearly a minimum point (in terms of the f -value) whenever the gradient
is small in that point.
Lemma 4.9. Assume that f is convex as above and that λmi n (∇2 f (x)) ≥ m for
all x ∈ S. Then
1
f (x) − f ∗ ≤ k∇ f (x)k2 . (4.8)
2m
Proof: From Theorem 1.5, the second order Taylor’ theorem, we have for
each x, y ∈ S
45
for suitable z on the line segment between x and y. Here a lower bound for the
quadratic term is (m/2)ky − xk2 , due to Proposition 4.3. Therefore
Now, fix x and view the expression on the right-hand side as a quadratic function
of y. This function is minimized for y ∗ = x −(1/m)∇ f (x). So, by inserting y = y ∗
above we get
Theorem 4.10. Let f be convex and twice continuously differentiable and as-
sume that
(i) If k∇ f (x k )k ≥ η, then
f (x k+1 ) ≤ f (x k ) − γ. (4.9)
46
(ii) If k∇ f (x k )k < η, then backtracking line search gives αk = 1 and
¶2
L L
µ
k∇ f (x k+1 )k ≤ k∇ f (x k )k . (4.10)
2m 2 2m 2
µl +1 ≤ µ2l (l ≥ k).
So (by induction)
l −k l −k
µl ≤ µk2 ≤ (1/2)2 (l = l , k + 1, . . .).
1 2m 3 l −k+1
f (x k ) − f ∗ ≤ k∇ f (x l )k2 ≤ 2 (1/2)2 (l ≥ k).
2m L
This inequality shows that f (x l ) → f ∗ , and since the minimum point is unique,
we must have x l → x ∗ . Moreover, it follows that the convergence is quadratic.
It only remains to explain why case (ii) above indeed occurs for some k. In
each iteration of type (i) f is decreased by at least γ, as seen from equation (4.10),
so the number of such iterations must be bounded by
( f (x 0 ) − f ∗ )/γ
which is a finite number. Finally, the proof of the statements in connection with
(i) and (ii) above is quite long and one derives several inequalities using the con-
vexity properties of f .
From the proof it is also possible to say something about haw many itera-
tions that are needed to reach a certain accuracy. In fact, if ² > 0 a bound on the
number of iterations until f (x k ) ≤ f ∗ + ² is
2m 3
( f (x 0 ) − f ∗ )/γ + log2 log2 .
²L 2
47
Here γ is the parameter introduced in the proof above. The second term in this
expression (the logarithmic term) grows very slowly as ² is decreased, and it may
roughly be replaced by the constant 6. So, whenever the second stage (case (ii)
in the proof) occurs, the convergence is extremely fast, it takes about 6 more
Newton iterations. Note that quadratic convergence means, roughly, that the
number of correct digits in the answer doubles for every iteration.
48
a. Show that ∇h T f2 = ∇ f (x) + ∇2 f (x)h.
9. Implement the steepest descent method. Test the algorithm on the functions
in exercises 4 and 6. Use different starting points.
10. What can go wrong when you apply Armijo’s rule (Equation (4.7)) to a func-
tion f where ∇2 f er negative definite (i.e. all eigenvalues of ∇2 f are negative)?
Hint: Substitute the Taylor approximation
f (x k + βm sd k ) ≈ f (x k ) + ∇ f (x k )T (βm sd k )
which returns α chosen according to the Armijo rule for a function f with the
given gradient, at point x, with search direction d . The function shuld compute
m k from Equation (4.7) with β = 0.2, s = 0.5, σ = 10−3 , and return α = βmk s.
12. Write a function
[xopt,numit]=newtonbacktrack(f,df,d2f,x0)
x = (0.4992, −0.8661, 0.7916, 0.9107, 0.5357, 0.6574, 0.6353, 0.0342, 0.4988, −0.4607)
Use the start value α0 = 0 for Newtons method. What estimate for the
minimum of f (and thereby α) did you obtain?
49
b. The ten measurements from a. were generated from a probability dis-
tribution where α = 0.5. The answer you obtained was quite far from this.
Let us therefore take a look at how many measurements we should use in
order to get quite precise estimates for α. You can use the function
function ret=randmuon(alpha,m,n)
50
Chapter 5
Constrained optimization - theory
minimize f (x)
subject to
(5.1)
h i (x) = 0 (i ≤ m)
g j (x) ≤ 0 ( j ≤ r )
51
minimize f (x)
subject to (5.2)
h i (x) = 0 (i ≤ m)
pendent.
Theorem 5.1. Let x ∗ be a local minimum in problem (5.1) and assume that
x ∗ is a regular point. Then there is a unique vector λ∗ = (λ∗1 , λ∗2 , . . . , λ∗m ) ∈ Rm
such that
m
∇ f (x ∗ ) + λ∗i ∇h i (x ∗ ) = 0.
X
(5.3)
i =1
If f and each h i are twice continuously differentiable, then the following also
holds
m
h T (∇2 f (x ∗ ) + λ∗i ∇2 h i (x ∗ ))h ≥ 0 for all h ∈ T (x ∗ )
X
(5.4)
i =1
The numbers λ∗i in this theorem are called the Lagrangian multipliers. Note
that the Lagrangian multiplier vector λ∗ is unique; this follows directly from the
linear independence assumption as x ∗ is assumed regular. The theorem may
also be stated in terms of the Lagrangian function L : Rn × Rm → R given by
m
L(x, λ) = f (x) + λi h i (x) = f (x) + λT H (x) (x ∈ Rn , λ ∈ Rm ).
X
i =1
Then
∇x L(x, λ) = ∇ f (x) + λi ∇h i
X
i
∇λ L(x, λ) = H (x).
52
h 2 (x) = b 2
h 1 (x) = b 1
Figure 5.1: The two surfaces h 1 (x) = b 1 og h 2 (x) = b 2 intersect each other in a
curve. Along this curve the constraints are fulfilled
Therefore, the first order conditions in Theorem 5.1 may be rewritten as follows
∇x L(x ∗ , λ∗ ) = 0, ∇λ L(x ∗ , λ∗ ) = 0.
Here the second equation simply means that H (x) = 0. These two equations say
that (x ∗ , λ∗ ) is a stationary point for the Lagrangian, and it is a system of n + m
(possibly nonlinear) equations in n + m variables.
We may interpret the theorem in the following way. At the point x ∗ the linear
subspace T (x ∗ ) consist of the “first order feasible directions”. Actually, if each h i
is linear, then T (x ∗ ) consists of those h such that x ∗ + h is feasible, i.e., h i (x ∗ +
h) = 0 for each i ≤ m. Thus, (5.3) says that in a local minimum x ∗ the gradient
∇ f (x ∗ ) is orthogonal to the subspace T (x ∗ ) of the first order feasible variations.
This is reasonable since otherwise there would be a feasible direction in which
f would decrease. In Figure 5.1 we have plotted a curve where two constraints
are fulfilled. In Figure 5.2 we have then shown an interpretation of Theorem 5.1.
53
∇h 1 (x ∗ ) ∇ f (x ∗ )
CO COC
C C
h 2 (x ∗ ) = b 2C C
C
: ∇h 2 (x ∗ )
C
C C
Cs :
h 1 (x ∗ ) = b 1
where the last inequality follows from the facts that x̄ ∈ B̄ (x ∗ ; ²) and H (x̄) = 0.
Clearly, this gives x̄ = x ∗ . We have therefore shown that the sequence {x k } con-
verges to the local minimum x ∗ . Since x ∗ is the center of the ball B̄ (x ∗ ; ²), the
points x k lie in the interior of S for suitably large k. The conclusion is then that
x k is the unconstrained minimum of F k when k is sufficiently large. We may
therefore apply Theorem 4.1 so ∇F k (x k ) = 0, so
0 = ∇F k (x k ) = ∇ f (x k ) + k H 0 (x k )T H (x k ) + α(x k − x ∗ ). (5.5)
54
Here H 0 denotes the Jacobi matrix of H . For suitably large k the matrix H 0 (x k )H 0 (x k )T
is invertible (as the rows of H 0 (x k ) are linearly independent due to rank(H 0 (x ∗ )) =
m and a continuity argument). Multiply equation (5.5) by (H 0 (x k )H 0 (x k )T )−1 H 0 (x k )
to obtain
λ∗ = −(H 0 (x ∗ )H 0 (x ∗ )T )−1 H 0 (x ∗ )∇ f (x ∗ ).
0 = ∇ f (x ∗ ) + H 0 (x ∗ )T λ∗
This proves the first part of the theorem; we omit proving the second part which
may be found in [1].
The first order necessary condition (5.3) along with the constraints H (x) = 0
is a system of n+m equations in the n+m variables x 1 , x 2 , . . . , x n and λ1 , λ2 , . . . , λm .
One may use e.g. Newton’s method for solving these equations and find a can-
didate for an optimal solution. But usually there are better numerical methods
for solving the optimization (5.1), as we shall see soon.
Necessary optimality conditions are used for finding a candidate solution
for being optimal. In order to verify optimality we need sufficient optimality
conditions.
where ∇2 L(x ∗ , λ∗ ) is the Hessian of the Lagrangian function with second or-
der partial derivatives with respect to x. Then x ∗ is a (strict) local minimum
of f subject to H (x) = 0.
This theorem may be proved (see [1] for details) by considering the aug-
mented Lagrangian function
55
where c is a positive scalar. This is in fact the Lagrangian function in the modi-
fied problem
and this problem must have the same local minima as the problem of minimiz-
ing f (x) subject to H (x) = 0. The objective function in (5.8) contains the penalty
term (c/2)kH (x)k2 which may be interpreted as a penalty (increased function
value) for violating the constraint H (x) = 0. In connection with the proof of
Theorem 5.2 based on the augmented Lagrangian one also obtains the follow-
ing interesting and useful fact: if x ∗ and λ∗ satisfy the sufficient conditions in
Theorem 5.2 then there exists a positive c̄ such that for all c ≥ c̄ the point x ∗ is
also a local minimum of the augmented Lagrangian L c (·, λ∗ ). Thus, the original
constrained problem has been converted to an unconstrained one involving the
augmented Lagrangian. And, as we know, unconstrained problems are easier to
solve (solve the equations saying that the gradient is equal to zero).
minimize f (x)
subject to
(5.9)
h i (x) = 0 (i ≤ m)
g j (x) ≤ 0 ( j ≤ r )
56
minimize f (x)
subject to
(5.10)
h i (x) = 0 (i ≤ m)
2
g j (x) + z j = 0 ( j ≤ r ).
We have introduced extra variables z j , one for each inequality. The square
of these variables represent slack in each of the original inequalities. Note that
there is no sign constraint on z j . Clearly, the problems (5.9) and (5.10) are equiv-
alent. This transformation can also be useful computationally. Moreover, it is
useful theoretically as one may apply the optimality conditions from the previ-
ous section to problem (5.10) to derive the theorem below (see [1]).
We now present a main result in nonlinear optimization. It gives optimality
conditions for this problem, and these conditions are called the Karush-Kuhn-
Tucker conditions, or simply the KKT conditions. In order to present the KKT
conditions we introduce the Lagrangian function L : Rn × Rm × Rr → R given by
m r
L(x, λ, µ) = f (x) + λi h i (x) + µ j g j (x) = f (x) + λT H (x) + µT G(x). (5.11)
X X
i =1 j =1
Theorem 5.3. Consider problem (5.9) with the usual differentiability as-
sumptions.
(i ) Let x ∗ be a local minimum of this problem and assume that x ∗ is
a regular point. Then there are unique Lagrange multiplier vectors λ∗ =
57
(λ∗1 , λ∗2 , . . . , λ∗m ) and µ∗ = (µ∗1 , µ∗2 , . . . , µ∗r ) such that
∇x L(x ∗ , λ∗ , µ∗ ) = 0
µ∗j ≥ 0 (j ≤ r ) (5.12)
µ∗j = 0 ( j 6∈ A(x ∗ )).
minimize f (x)
subject to
(5.14)
h i (x) = 0 (i ≤ m)
g j (x) = 0 ( j ∈ A(x ∗ ))
58
The KKT conditions have an interesting geometrical interpretation. They say
that −∇ f (x ∗ ) may be written as linear combination of the gradients of the h i ’s
plus a nonnegative linear combination of the gradients of the g j ’s that are active
at x ∗ .
Example 5.4. Let us consider the following optimization problem:
g 1 (x 1 , x 2 ) = −x 2 ≤ 0
g 2 (x 1 , x 2 ) = (x 1 − 1)2 + x 22 − 1 ≤ 0.
If we compute the gradients we see that the KKT conditions take the form
µ ¶ µ ¶ µ ¶
1 0 2(x 1 − 1)
+ µ1 + µ2 = 0,
0 −1 2x 2
where the two last terms on the left hand side only are included if the corre-
sponding inequalities are active. It is clear that we find no solutions if no in-
equalities are active. If only the first inequality is active we find no solution ei-
ther. If only the second inequality is active we get the equations
(x 1 − 1)2 + x 22 = 1
1 + 2µ2 (x 1 − 1) = 0
2µ2 x 2 = 0.
(x 1 − 1)2 + x 22 = 1
x2 = 0
1 + 2µ2 (x 1 − 1) = 0
−µ1 + 2µ2 x 2 = 0.
It is clear that this reduces to the system we just solved, so that (0, 0) is the only
candidate for a minimum. It is clear that we must have a minimum, since any
59
continuous function defined on a closed, bounded region must have a mini-
mum.
Finally we should comment on any points which are not regular. If the first
inequality is active it is impossible to have that ∇g 1 = 0. If the other inequality
is active we must have that (2(x 1 − 1), 2x 2 ) = 0 in a point which is not regular, so
that (x 1 , x 2 ) = (1, 0). However, it is clear that the other inequality is not active at
this point. If both inequalities are active it is clear that (x 1 , x 2 ) = (0, 0), or (2, 0).
We have already considered the first point. In the other point the gradients are
∇g 1 = (0, −1) and ∇g 2 = (2, 0), which are linearly independent, so that we get no
candidates for the minimum from points which are not regular. ♣
We remark that the assumption that x ∗ is a regular point may be too restric-
tive in some situations, for instance there may be more than n active inequalities
in x ∗ . There exist several other weaker assumptions that assure the existence of
Lagrangian multipliers (and similar necessary conditions). Let us briefly say a
bit more on this matter.
lim (x k − x)/αk = d .
k→∞
TC (x) always contains the zero vector and it is a cone, meaning that it con-
tains each positive multiple of its vectors. Consider now problem (5.9) and let C
be the set of feasible solutions (those x satisfying all the equality and inequality
constraints).
d · ∇h i (x) = 0 (i ≤ m)
d · ∇g j (x) = 0 ( j ∈ A(x ∗ )).
60
constraints at x (the first order Taylor approximations) of each h i and each g j for
active constraints at x, i.e., those inequality constraints that hold with equality.
With this notation we have the following lemma. The proof may be found in [11]
and it involves the implicit function theorem from multivariate calculus [8].
61
1 1
z
0.5 0.5
0 0
1 1
1 1
0.5 0.5 0.5 0.5
y 0 0 x y 0 0 x
1
z
0.5
0
1
1
0.5 0.5
y 0 0 x
Figure 5.3: The different possibilities for ∇ f in a minimum of f , under the con-
straints x ≥ 0.
62
minimize (1/2) x T D x − q T x
subject to
Ax = b
AT
· ¸· ¸ · ¸
D x q
= . (5.15)
A 0 λ b
Under the additional assumption that D is positive definite and A has full row
rank, one can show that the coefficient matrix in (5.15) is invertible so this sys-
tem has a unique solution x, λ. Thus, for this problem, we may write down an
explicit solution (in terms of the inverse of the block matrix). Numerically, one
finds x (and the Lagrangian multiplier λ) by solving the linear system (5.15) by
e.g. Gaussian elimination or some faster (direct or iterative) method. ♣
Example 5.11. Consider an extension of the previous example by allowing lin-
ear inequality constraints as well:
minimize (1/2) x T D x − q T x
subject to
Ax = b
x ≥0
63
Example 5.12. Linear optimization is a problem of the form
minimize c T x subject to Ax = b, x ≥ 0
This is a special case of the convex programming problem (5.16) where g j (x) =
−x j ( j ≤ n). Here ∇ f (x) = c and ∇g k (x) = −e k . Let x be a feasible solution.
The KKT conditions state that there are vectors λ ∈ Rm and µ ∈ Rn such that
c + A T λ − µ = 0, µ ≥ 0 and µk = 0 if x k > 0 (k ≤ n). Here we eliminate µ and
obtain the equivalent set of KKT conditions: there is a vector λ ∈ Rm such that
c +A T λ ≥ 0, (c +A T λ)k ·x k = 0 (k ≤ n). These conditions are the familiar optimal-
ity conditions in linear optimization theory. The vector λ is feasible in the so-
called dual problem and complementary slack holds. We do not go into details
on this here, but refer to the course INF-MAT3370 Linear optimization where
these matters are treated in detail. ♣
Proof: 1.) The proof of property 1 is exactly as the proof of the first part of
Theorem 4.5, except that we work with local and global minimum of f over C .
2.) Assume the set C ∗ of minimum points is nonempty and let α = minx∈C f (x).
Then C ∗ = {x ∈ C : f (x) ≤ α} is a convex set, see Proposition 2.5. Moreover, this
set is closed as f is continuous.
64
3.) This follows directly from Theorem 2.10.
Next, we consider a quite general convex optimization problem which is of
the form (5.9):
minimize f (x)
subject to
(5.16)
Ax = b
g j (x) ≤ 0 ( j ≤ r )
where all the functions f and g j are differentiable convex functions, and A ∈
Rm×n and b ∈ Rm . Let C denote the feasible set of problem (5.16). Then C is a
convex set, see Proposition 2.5. A special case of (5.16) is linear optimization.
An important concept in convex optimization is duality. To briefly explain
this introduce again the Lagrangian function L : Rn × Rm × Rr+ → R given by
Remark: we use the variable name ν here in stead of the µ used before be-
cause of another parameter µ to be used soon. Note that we require ν ≥ 0.
Define the new function g : Rm × Rr+ → R̄ by
Note that this infimum may sometimes be equal to −∞ (meaning that the func-
tion x → L(x, λ, ν) is unbounded below). The function g is the pointwise infi-
mum of a family of affine functions in (λ, µ), one function for each x, and this
implies that g is a concave function. We are interested in g due to the following
fact, which is easy to prove. It is usually referred to as weak duality.
g (λ, ν) ≤ L(x, λ, ν)
= f (x) + λT (Ax − b) + νT G(x)
≤ f (x)
65
as Ax = b, ν ≥ 0 and G(x) ≤ 0.
Thus, g (λ, ν) provides a lower bound on the optimal value in (5.16). It is
natural to look for a best possible such lower bound and this is precisely the so-
called dual problem which is
maximize g (λ, ν)
subject to (5.17)
ν ≥ 0.
Actually, in this dual problem, we may further restrict the attention to those
(λ, ν) for which g (λ, ν) is finite. g (λ, ν) is also called the dual objective function.
The original problem (5.16) will be called the primal problem. It follows from
Lemma 5.14 that
g∗ ≤ f ∗
where f ∗ denotes the optimal value in the primal problem and g ∗ the optimal
value in the dual problem. If g ∗ < f ∗ , we say that there is a duality gap. Note
that the derivation above, and weak duality, holds for arbitrary functions f and
g j ( j ≤ r ). The concavity of g also holds generally.
The dual problem is useful when the dual objective function g may be com-
puted efficiently, either analytically or numerically. Duality provides a powerful
method for proving that a solution is optimal or, possibly, near-optimal. If we
have a feasible x in (5.16) and we have found a dual solution (λ, ν) with ν ≥ 0
such that
f (x) = g (λ, ν) + ²
for some ² (which then has to be nonnegative), then we can conclude that x is
“nearly optimal”, it is not possible to improve f by more than ². Such a point x
is sometimes called ²-optimal, where the case ² = 0 means optimal.
So, how good is this duality approach? For convex problems it is often per-
fect as the next theorem says. We omit most of the proof, see [5, 1, 14]). For
nonconvex problems one should expect a duality gap. Recall that G 0 (x) denotes
the Jacobi matrix of G = (g 1 , g 2 , . . . , g r ) at x.
g j (x 0 ) < 0 ( j ≤ r ).
66
Then f ∗ = g ∗ , so there is no duality gap. Moreover, x is a (local and global )
minimum in (5.16) if and only if there are λ ∈ Rm and ν ∈ Rr with ν ≥ 0 and
∇ f (x) + A T λ +G 0 (x)T ν = 0
and
ν j g j (x) = 0 ( j ≤ r ).
Proof: We only prove the second part (see the references above). So assume
that f ∗ = g ∗ and the infimum and supremum are attained in the primal and dual
problems, respectively. Let x be a feasible point in the primal problem. Then x
is a minimum in the primal problem if and only if there are λ ∈ Rm and ν ∈ Rr
such that all the inequalities in the proof of Lemma 5.14 hold with equality. This
means that g (λ, ν) = L(x, λ, ν) and νT G(x) = 0. But L(x, λ, ν) is convex in x so it
is minimized by x if and only if its gradient is the zero vector, i.e., ∇ f (x) + λT A +
G 0 (x)T ν = 0. This leads to the desired characterization.
The assumption stated in the theorem, that g j (x 0 ) < 0 for each j , is called the
weak Slater condition.
Example 5.16. Consider the convex optimization problem where we want to
minimize the function f (x) = x 2 + 1 subject to the inequality constraint g (x) =
(x − 3)2 − 1 ≤ 0. From Figure 5.4(a) it is quite clear that the minimum is attained
for x = 2, and is f (2) = 5. Since both the constraint and the objective function
are convex, and since here the weak Slater condition holds, Theorem 5.15 guar-
antees that the dual problem has the same solution as the primal problem. Let
us verify this by considering the dual problem as well. The Lagrangian function
is given by
L(x, ν) = f (x) + νg (x) = x 2 + 1 + ν((x − 3)2 − 1).
3ν
It is easy to see that this function attains its minimum for x = 1+ν . This means
that the dual objective function is given by
3ν 2
µ ¶ µ ¶ µµ ¶2 ¶
3ν 3ν
g (ν) = L ,ν = +1+ν −3 −1 .
1+ν 1+ν 1+ν
This is shown in Figure 5.4(b). It is quite clear from this figure that the maxi-
mum is 5, which we already found by solving the primal problem. To prove this
requires some more work, by setting the derivative of the dual objective func-
tion to zero. Therefore, the primal and the dual problem are two very different
problems, where we in practice choose the one which is simplest to solve. ♣
Finally, we mention a theorem on convex optimization which is used in sev-
eral applications.
67
8
Objective function
15 Inequality constraint
6
10
4
5
2
0
0
0 1 2 3 4 0 1 2 3 4
(a) The objective function and the inequality (b) The dual objective function
constraint
Figure 5.4: The objective function and the dual objective function of Exam-
ple 5.16
so x ∗ is a (global) minimum.
68
Exercises for Chapter 5
1. In the plane consider a rectangle R with sides of length x and y and with
perimeter equal to α (so 2x + 2y = α). Determine x and y so that the area of R is
largest possible.
2. Consider the optimization problem
minimize f (x 1 , x 2 ) subject to (x , x 2 ) ∈ C
a. f (x 1 , x 2 ) = 1.
b. f (x 1 , x 2 ) = x 1 .
c. f (x 1 , x 2 ) = 3x 1 + x 2 .
d. f (x 1 , x 2 ) = (x 1 − 1)2 + (x 2 − 1)2 .
e. f (x 1 , x 2 ) = (x 1 − 10)2 + (x 2 − 8)2 .
3. Solve
n
X
max{x 1 x 2 · · · x n : x j = 1, x j ≥ 0}.
j =1
problem and find its Lagrangian function L. Find the stationary points of L, and
use this to solve the optimization problem.
6. Solve
minimize x 1 + x 2 subject to x 12 + x 22 = 1.
using the Lagrangian, see Theorem 5.1. Next, solve the problem by eliminating
x 2 (using the constraint).
7. Let g (x 1 , x 2 ) = 3x 12 + 10x 1 x 2 + 3x 22 − 2. Solve
min{k(x 1 , x 2 )k : g (x 1 , x 2 ) = 0}.
69
8. Same question as in previous exercise, but with g (x 1 , x 2 ) = 5x 12 −4x 1 x 2 +4x 22 −
6.
9. Let f be a two times differentiable function f : Rn → R. Consider the opti-
mization problem
max{x T Ax : kxk = 1}
Rewrite the constraint as kxk − 1 = 0 and show that an optimal solution of this
problem must be an eigenvector of A. What can you say about the Lagrangian
multiplier?
12. Solve
min{(1/2)(x 12 + x 22 + x 32 ) : x 1 + x 2 + x 3 ≤ −6}.
Hint: Use KKT and discuss depending on whether the constraint is active or not.
13. Solve
min{(x 1 − 3)2 + (x 2 − 5)2 + x 1 x 2 : 0 ≤ x 1 , x 2 ≤ 1}.
14. Solve
min{x 1 + x 2 : x 12 + x 22 ≤ 2}.
15. Write down the KKT conditions for the portfolio optimization problem of
Section 1.2.1.
16. Write down the KKT conditions for the optimization problem
n
X
min{ f (x 1 , x 2 , . . . , x n ) : x j ≥ 0 ( j ≤ n), x j ≤ 1}
j =1
3 2
µ ¶
min{ x 1 − + x 22 : x 1 + x 2 ≤ 1, x 1 − x 2 ≤ 1, −x 1 + x 2 ≤ 1, −x 1 − x 2 ≤ 1}.
2
a. Draw the region which we minimize over, and find the minimum of
¢2
f (x) = x 1 − 23 + x 22 by a direct geometric argument.
¡
70
b. Write down the KKT conditions for this problem. From a., decide which
two conditions g 1 and g 2 are active at the minimum, and verify that you
can find µ1 ≥ 0, µ2 ≥ 0 so that ∇ f + µ1 ∇g 1 + µ2 ∇g 2 = 0 (as the KKT condi-
tions guarantee in a minimum) (it is not the meaning here that you should
go through all possibilities for active inequalities, only those you see must
be fulfilled from a.).
min{−x 1 x 2 : x 12 + x 22 ≤ 1}
Write down the KKT conditions for this problem, and find the minimum.
71
72
Chapter 6
Constrained optimization -
methods
In this final chapter we present numerical methods for solving nonlinear opti-
mization problems. This is a huge area, so we can here only give a small taste of
it! The algorithms we present are known good methods.
minimize f (x)
subject to (6.1)
Ax = b
Newton’s method may be applied to this problem. The method is very simi-
lar to the unconstrained case, but with two modifications. First, the initial point
x 0 must be chosen so that it is feasible, i.e., Ax 0 = b. Next, the search direction
d must be such that the new iterate is feasible as well. This means that Ad = 0,
so the search direction lies in the nullspace of A.
The second order Taylor approximation of f at an iterate x k is
T f1 (x k ; x k + h) = f (x k ) + ∇ f (x k )T h + (1/2)h T ∇2 f (x k )h
73
and we want to minimize this w.r.t. h subject to the constraint
A(x k + h) = b
∇ f (x k ) A T
· 2 ¸· ¸ · ¸
h −∇ f (x k )
=
A 0 λ 0
where λ is the Lagrange multiplier. The Newton step is only defined when the
coefficient matrix in the KKT problem is invertible. In that case, the problem has
a unique solution (h, λ) and we define d N t = h and call this the Newton step.
Newton’s method for solving (6.1) may now be described as follows. Again
² > 0 is a small stopping criterion.
This leads to an algorithm for Newtons’s method for linear equality con-
strained optimization which is very similar to the function newtonbacktrack
from Exercise 4.2.12. We do not state a formal convergence theorem for this
method, but it behaves very much like Newton’s method for unconstrained op-
timization. Actually, it can be seen that the method just described corresponds
to eliminating variables based on the equations Ax = b and using the uncon-
strained Newton method for the resulting (smaller) problem. So as soon as the
solution is “sufficiently near” an optimal solution, the convergence rate is quadratic,
so extremely few iterations are needed in this final stage.
74
attention to convex optimization problems, but many of the ideas are used for
nonconvex problems as well.
The method we present is an interior-point method, more precisely, an interior-
point barrier method. This is an iterative method which produces a sequence of
points lying in the relative interior of the feasible set. The barrier idea is to ap-
proximate the problem by a simpler one in which constraints are replaced by a
penalty term. The purpose of this penalty term is to give large objective function
values to points near the (relative) boundary of the feasible set, which effectively
becomes a barrier against leaving the feasible set.
Consider again the convex optimization problem
minimize f (x)
subject to
(6.2)
Ax = b
g j (x) ≤ 0 ( j ≤ r )
Ax = b, g j (x) ≤ 0 ( j ≤ r )
ν ≥ 0, ∇ f (x) + A T λ +G 0 (x)T ν = 0 (6.3)
ν j g j (x) = 0 ( j ≤ r ).
So, x is a minimum in (6.2) if and only if there are λ ∈ Rm and ν ∈ Rr such that
(6.3) holds.
Let us state an algorithm for Newton’s method for linear equality constrained
optimization with inequality constraints. Before we do this there is one final
problem we need to address: The α we get from backtracking line search may be
so that x + αd N t do not satisfty the inequality constraints (in the exercises you
will be asked to verify that this is the case for a certain function). The problem
comes from that the iterates x k + βm sd k from Armijo’s rule do not necessarily
satisfy the inequality constraints. However, we can choose m large enough so
that all succeeding iterates satisfy these constraints. We can reimplement the
function armijorule to address this as follows:
function alpha=armijoruleg1g2(f,df,x,d,g1,g2)
beta=0.2; s=0.5; sigma=10^(-3);
m=0;
while (g1(x+beta^m*s*d)>0 || g2(x+beta^m*s*d)>0)
75
m=m+1;
end
while (f(x)-f(x+beta^m*s*d) < -sigma *beta^m*s *(df(x))’*d)
m=m+1;
end
alpha = beta^m*s;
Here g1 and g2 are function handles which represent the inequality constraints,
and we have added a first loop, which secures that m is so large that the inequal-
ity constraints are satisfied. The rest of the code is as in the function armijorule.
After this we can also modify the function newtonbacktrack from Exercise 4.2.12
to a function newtonbacktrackg1g2 in the obvious way, so that the inequality
constraints are passed to armijoruleg1g2:
function [x,numit]=newtonbacktrackg1g2LEC(f,df,d2f,A,b,x0,g1,g2)
epsilon=10^(-3);
x=x0;
maxit=100;
for numit=1:maxit
matr=[d2f(x) A’; A zeros(size(A,1))];
vect=[-df(x); zeros(size(A,1),1)];
solvedvals=matr\vect;
d=solvedvals(1:size(A,2));
eta=d’*d2f(x)*d;
if eta^2/2<epsilon
break;
end
alpha=armijoruleg1g2(f,df,x,d,g1,g2);
x=x+alpha*d;
end
Both these function work in all cases where there are exactly two inequality con-
straints.
The interior-point barrier method is based on an approximation of problem
(6.2) by the barrier problem
76
where
r
φ(x) = −
X
ln(−g j (x))
j =1
and µ > 0 is a parameter (in R). The function φ is called the (logarithmic) barrier
function and its domain is the relative interior of the feasible set
The same set F ◦ is the feasible set of the barrier problem. The key properties of
the barrier function are:
since (−g1j (x)) > 0 and h T ∇2 g j (x)h ≥ 0 (since all g j are convex, ∇2 g j (x) is
positive semidefinite).
The idea here is that for points x near the boundary of F the value of φ(x) is very
large. So, an iterative method which moves around in the interior F ◦ of F will
typically avoid points near the boundary as the logarithmic penalty term makes
the function value f (x) + µφ(x) very large.
The interior point method consists in solving the barrier problem, using
Newton’s method, for a sequence {µk } of (positive) barrier parameters; these
are called the outer iterations. The solution x k found for µ = µk is used as the
starting point in Newton’s method in the next outer iteration where µ = µk+1 .
77
The sequence {µk } is chosen such that µk → 0. When µ is very small, the barrier
function approximates the "ideal" penalty function η(x) which is zero in F and
−∞ when one of the inequalities g j (x) ≤ 0 is violated.
A natural question is why one bothers to solve the barrier problems for more
than one single µ, typically a very small value. The reason is that it would be
hard to find a good starting point for Newton’s method in that case; the Hessian
matrix of µφ is typically ill-conditioned for small µ.
Assume now that the barrier problem has a unique optimal solution x(µ);
this is true under reasonable assumptions that we shall return to. The point
x(µ) is called a central point. Assume also that Newton’s method may be applied
to solve the barrier problem. The set of points x(µ) for µ > 0 is called the central
path; it is a path (or curve) as we know it from multivariate calculus. In order to
investigate the central path we prefer to work with the equivalent problem1 to
(6.4) obtained by multiplying the objection function by 1/µ, so
Ax(µ) = b
g j (x(µ)) < 0 ( j ≤ r )
i.e.,
r 1
∇g j (x) + A T λ = 0.
X
(1/µ)∇ f (x(µ)) + (6.8)
j =1 (−g j (x))
A fundamental question is: how far from being optimal is the central point x(µ)?
We now show that duality provides a very elegant way of answering this ques-
tion.
Theorem 6.1. For each µ > 0 the central point x(µ) satisfies
f ∗ ≤ f (x(µ)) ≤ f ∗ + r µ.
78
Proof: Define ν(µ) = (ν1 (µ), . . . , νr (µ)) ∈ Rr and λ(µ) ∈ Rm by
where λ and x(µ) satisfy Equation (6.8). We want to show that the pair (λ(µ), ν(µ))
is feasible in the dual problem to (6.2), see Section 5.3. So there are two prop-
erties to verify, that ν(µ) is nonnegative and that x(µ) minimizes the Lagrangian
function for the given (λ(µ), ν(µ)). The first property is immediate: as g j (x(µ)) <
0 and µ > 0, we get ν j (µ) = −µ/g j (x(µ)) > 0 for each j . Concerning the second
property, note first that the Lagrangian function L(x, λ, ν) = f (x) + λT (Ax − b) +
νT G(x) is convex in x for given λ and µ ≥ 0. Thus, x minimizes this function if
and only if ∇x L = 0. Now,
by (6.8) and the definition of the dual variables (6.9). This shows that (λ(µ), ν(µ))
is feasible in the dual problem.
By weak duality and Lemma 5.14, we therefore obtain
f ∗ ≥ g (λ(µ), ν(µ))
= L(x(µ), λ(µ), ν(µ))
r
= f (x(µ)) + λ(µ)T (Ax(µ) − b) + ν j (µ)g j (x(µ))
X
j =1
= f (x(µ)) − r µ
lim f (x(µ)) = f ∗ .
µ→0
79
In particular, if f is continuous and limµ→0 x(µ) = x ∗ for some x ∗ , then x ∗ is a
global minimum in (6.2).
Proof: This follows from Theorem 6.1 by letting µ → 0. The second part fol-
lows from
f (x ∗ ) = f (lim x(µ)) = lim f (x(µ)) = f ∗
µ→0 µ→0
by the first part and the continuity of f ; moreover x ∗ must be a feasible point by
elementary topology.
After these considerations we may now present the interior-point barrier
method. It uses a tolerance ² > 0 in its stopping criterion.
This leads to the following algorithm for the internal point barrier method
for the case of equality constraints, and 2 inequality constraints:
function xopt=IPBopt(f,g1,g2,df,dg1,dg2,d2f,d2g1,d2g2,A,b,x0)
xopt=x0;
mu=1;
alpha=0.1;
r=2;
epsilon=10^(-3);
numitouter=0;
while (r*mu>epsilon)
[xopt,numit]=newtonbacktrackg1g2LEC(...
@(x)(f(x)-mu*log(-g1(x))-mu*log(-g2(x))),...
@(x)(df(x) - mu*dg1(x)/g1(x) - mu*dg2(x)/g2(x)),...
@(x)(d2f(x) + mu*dg1(x)*dg1(x)’/(g1(x)^2) ...
+ mu*dg2(x)*dg2(x)’/(g2(x)^2) - mu*d2g1(x)/g1(x)...
80
- mu*d2g2(x)/g2(x) ),A,b,xopt,g1,g2);
mu=alpha*mu;
numitouter=numitouter+1;
fprintf(’Iteration %i:’,numitouter);
fprintf(’(%f,%f)\n’,xopt,f(xopt));
end
Note that we here have inserted the expressions from Equation 6.5 and Equa-
tion 6.6 for the gradient and the Hesse matrix of the barrier function. The input
are f , g 1 , g 2 , their gradients and their Hesse matrices, the matrix A, the vector b,
and an initial feasible point x 0 . The function calls newtonbacktrackg1g2LEC,
and returns the optimal solution x ∗ . It also gives some information on the val-
ues of f during the iterations. The iterations used in Newton’s method is called
the inner iterations. There are different implementation details here that we do
not discuss very much. A typical value on α is 0.1. The choice of the initial µ0
can be difficult, if it is chosen too large, one may experience many outer itera-
tions. Another issue is how accurately one solves (6.4). It may be sufficient to
find a near-optimal solution here as this saves inner iterations. For this reason
the method is also called a path-following method; it follows in the neighbor-
hood of the central path.
Finally, it should be mentioned that there exists a variant of the interior-
point barrier method which permits an infeasible starting point. For more de-
tails on this and various implementation issues one may consult [2] or [11].
Example 6.3. Consider the function f (x) = x 2 + 1, 2 ≤ x ≤ 4. Minimizing f can
be considered as the problem of finding a minimum subject to the constraints
g 1 (x) = 2 − x ≤ 0, and g 2 (x) = x − 4 ≤ 0. The barrier problem is to minimize the
function
f (x) + µφ(x) = x 2 + 1 − µ ln(x − 2) − µ ln(4 − x).
Some of these are drawn in Figure 6.1, where we clearly can see the effect of de-
creasing µ in the barrier function: The function converges to f pointwise, except
at the boundaries. It is easy to see that x = 2 is the minimum of f under the given
constraints, and that f (2) = 5 is the minimum value. There are no equality con-
strains in this case, so that we can use the barrier method with Newton’s method
for unconstrained optimization, as this was implemented in Exercise 4.2.12. We
need, however, to make sure also here that the iterates from Armijo’s rule satisfy
the inequality constraints. In fact, in the exercises you will be asked to verify
that, for the function f considered here, some of the iterates from Armijo’s rule
do not satisfy the constraints.
It is straightforward to implement a function newtonbacktrackg1g2 which
implements Newtons method for two inequality constraints and no equality
81
20 20
15 15
10 10
5 5
2 2.5 3 3.5 4 2 2.5 3 3.5 4
(a) f (x) (b) Barrier problem with µ = 0.2
20 20
15 15
10 10
5 5
2 2.5 3 3.5 4 2 2.5 3 3.5 4
(c) Barrier problem with µ = 0.5 (d) Barrier problem with µ = 1
Figure 6.1: The function from Example 6.3 and some if its barrier functions.
function xopt=IPBopt2(f,g1,g2,df,dg1,dg2,d2f,d2g1,d2g2,x0)
xopt=x0;
mu=1; alpha=0.1; r=2; epsilon=10^(-3);
numitouter=0;
while (r*mu>epsilon)
[xopt,numit]=newtonbacktrackg1g2(...
@(x)(f(x)-mu*log(-g1(x))-mu*log(-g2(x))),...
82
@(x)(df(x) - mu*dg1(x)/g1(x) - mu*dg2(x)/g2(x)),...
@(x)(d2f(x) + mu*dg1(x)*dg1(x)’/(g1(x)^2) ...
+ mu*dg2(x)*dg2(x)’/(g2(x)^2) ...
- mu*d2g1(x)/g1(x) - mu*d2g2(x)/g2(x) ),xopt,g1,g2);
mu=alpha*mu;
numitouter=numitouter+1;
fprintf(’Iteration %i:’,numitouter);
fprintf(’(%f,%f)\n’,xopt,f(xopt));
end
Note that this function also prints a summary for each of the outer iterations,
so that we can see the progress in the barrier method. We can now find the
minimum of f with the following code, where we have substituted with Matlab
functions for f , g i , their gradients, and their Hesse matrices.
IPBopt2(@(x)(x.^2+1),@(x)(2-x),@(x)(x-4),...
@(x)(2*x),@(x)(-1),@(x)(1),...
@(x)(2),@(x)(0),@(x)(0),3)
for a ν1 ≥ 0, where the last term is included only if x 1 + x 2 = 2 (i.e. when the
constraint is active). If the constraint is not active we see that x 1 = x 2 = 0, which
does not satisfy the inequality constraint. If the constraint is active we see that
x 1 = x 2 = ν1 /2, so that x 1 = x 2 = 1 and ν1 = 2 ≥ 0 in order for x 1 + x 2 = 2. The
minimum value is thus f (1, 1) = 2. It is clear that this must be a minimum: Since
f is bounded below and approaches ∞ when either x 1 or x 2 grows large, it must
have a mimimum ( f has no global maximum). For this one can also argue that
the Hessian of the Lagrangian for the constrained problem becomes positive
definit. All points are regular for this problem since ∇g 1 6= 0.
Let us also see if we can come to this same solution by solving the barrier
problem. The barrier function is φ(x 1 , x 2 ) = − ln(x 1 + x 2 − 2), which has gradient
∇φ = (−1/(x 1 +x 2 −2), −1/(x 1 +x 2 −2)). We set the gradient of f (x 1 , x 2 )+µφ(x 1 , x 2 )
83
to 0 and get
a. Find A, b, and functions g 1 , g 2 so that the problem takes the same form
as in Equation (6.2).
e. State the KKT conditions for finding the minimum, and solve these.
f. Show that the central path converges to the same solution which you
found in d. and e..
3. Use the function IPBopt to verify the solution you found in Exercise 2. Ini-
tially you must compute a feasible starting point x 0 .
4. State the KKT conditions for finding the minimum for the contstrained prob-
lem of Example 6.3, and solve these. Verify that you get the same solution as in
Example 6.3.
84
5. In the function IPBopt2, replace the call to the function newtonbacktrackg1g2
with a call to the function newtonbacktrack, with the obvious modification to
the parameters. Verify that the code does not return the expected minimum in
this case.
6. Consider the function f (x) = (x − 3)2 , with the same constraints 2 ≤ x ≤ 4 as
in Example 6.3. Verify in this case that the function IPBopt2 returns the correct
minimum regardless of whether you call newtonbacktrackg1g2 or newtonbacktrack.
This shows that, at least in some cases where the minimum is an interior point,
the iterates from Newtons method satisfy the inequality constraints as well.
7. (Trial Exam UIO V2012) In this exercise we will find the minimum of the
function f (x, y) = 3x + 2y under the constraints x + y = 1 and x, y ≥ 0.
b. State the KKT-conditions for this problem, and find the minimum by
solving these.
c. Write down the barrier function φ(x, y) = − ln(−g 1 (x, y))−ln(−g 2 (x, y))
for this problem, where g 1 and g 2 represent the two constraints of the
problem. Also compute ∇φ.
d. Solve the barrier problem with parameter µ, and denote the solution
by x(µ). Is it the case that the limit limµ→0 x(µ) equals the solution you
found in b.?
85
Answers
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Solutions
Chapter 1
3 . You can argue in many ways here: For instance the derivative of f 2 (x) is
2 f (x) f 0 (x), so that extremal points of f are also extremal points of f 2 .
The gradient of this function is 2αC x − µ, where µ is the vector with µ in all
entries. Lagrange multipliers thus gives that 2αC x −µ = λ, where λ is the vector
µ+λ
with λ in all entries. This gives that x i = 2αci . If x i = 1 we must have that
P
µ+λ P 1 P2α
2α c i = 1, so that λ = −µ + 1/c i .
f (x 1 , x 2 ) = αc 11 x 12 + αc 22 x 22 + α(c 12 + c 21 )x 1 x 2 − µx 1 − µx 2
= αc 11 x 12 + αc 22 (1 − x 1 )2 + α(c 12 + c 21 )x 1 (1 − x 1 ) − µx 1 − µ(1 − x 1 )
= α(c 11 + c 22 − c 12 − c 21 )x 12 + α(−2c 22 + c 12 + c 21 )x 1 + αc 22 − µ
86
P ∂f
9 .a. We have that f (x) = i q i x i , so that ∂x i = q i , so that ∇ f (x) = q. Clearly
∂2 f
∂x i ∂x j = 0, so that ∇2 f (x) = 0.
1X 1X 1 X
f (x) = xi A i j x j = A i i x i2 + xi A i j x j ,
2 i,j 2 i 2 i , j ,i 6= j
so that
∂f 1 X 1X
= A i i xi + x j (A i j + A j i ) = x j 2A i j
∂x i 2 j , j 6=i 2 j
X
= A i j x j = (Ax)i
j
∂f ∂ X
= ( Ai j x j ) = Ai j ,
∂x i ∂x j ∂x j j
so that ∇2 f = A.
1X 1X 1 X
f (x) = xi A i j x j = A i i x i2 + xi A i j x j ,
2 i,j 2 i 2 i , j ,i 6= j
∂f 1 X 1X
= A i i xi + x j (A i j + A j i ) = x j (A i j + A j i )
∂x i 2 j , j 6=i 2 j
1X 1
= (A i j + (A T )i j )x j = ( (A + A T )x)i
2 j 2
∂f ∂ 1X 1X
= ( (A i j + (A T )i j )x j ) = (A i j + (A T )i j ),
∂x i ∂x j ∂x j 2 j 2 j
so that ∇2 f = 12 (A + A T ).
87
10 . First note that f (0, 0) = 3, and that f (2, 1) = 8 We have that ∇ f = (2x 1 +
3x 2 , 3x 1 − 10x 2 ), and that ∇ f (0, 0) = (0, 0), and ∇ f (2, 1) = (7, −4). The first order
Taylor approximation at (0, 0) is thus
Chapter 2
2 . We have that h 0 (x) = f 0 (x)g (x)+ f (x)g 0 (x), and h 00 (x) = f 00 (x)g (x)+ f (x)g 00 (x)+
2 f 0 (x)g 0 (x). Since f and g are convex we have that f 00 (x) ≥ 0 and g 00 (x) ≥ 0.
Since the functions are increasing we have that f 0 (x) ≥ 0 and g 0 (x) ≥ 0. Since the
functions also are positive we see that all three terms in the sum are ≥ 0 so that
h 00 (x) ≥ 0, and it follows that h also is convex.
8 . Write B in row echelon form, to see which are pivot variables. Express these
variables in terms of the free variables, and replace the pivot variables in all the
equations. Ax ≥ b then takes the form C x ≥ b (where x now is a shorter vector),
and this can be written as −C x ≤ −b, which is on the new form with H = −C ,
h = −b. Note that this strategy rewrites the vector c to a shorter vector.
88
9 . Let y = tj =1 λ j x j and z = tj =1 µ j x j , where all λ j , µ j ≥ 0, and tj =1 λ j = 1,
P P P
Pt
j =1 µ j = 1. For any 0 ≤ λ ≤ 1 we have that
t t t
(1 − λ)y + λz = (1 − λ) λj x j + λ µj x j = ((1 − λ)λ j + λµ j )x j .
X X X
j =1 j =1 j =1
1 1.b. min{ f , g } may be neither convex or concave, consider the functions f (x) =
x 2 , g (x) = (x − 1)2 .
1 1.c. | f | may be neither convex or concave, consider the function f (x) = x 2 −1.
Chapter 3
1 . F (x) = 0 is equivalent to kF (x)k2 = i F i (x) = 0, where F i are the compo-
P
2 . Here we construct the function f (x) = T (x) − x = x/2 − 3x 3 /2, which has
derivative f 0 (x) = 1/2 − 9x 2 /2. We can then run Newton’s method as follows:
newtonmult(sqrt(5/3),@(x)(0.5*x-1.5*x^3),@(x)(0.5-4.5*x^2))
4 .a. The function x → kAxk is continuous, and any continuous function achieves
a supremum in a closed set (here kxk = 1).
4 .b. For n = 2, it is clear that the sublevel set is the square with corners (1, 0),
(−1, 0), (0, 1), (0, −1).
89
4 .c. The function f (x) = kAxk is the composition of a convex function and an
affine function, so that it must be convex. If x ∈ Rn and kxk1 = 1, we can write
x = ni=1 λi v i , where 0 ≤ λi ≤ 1, ni=1 λi = 1, and v i = ±e i (i.e. it absorbs the sign
P P
5 . We are asked to find for which A we have that kAxk1 < kxk1 for any x. From
the previous exercise we know that this happens if and only if kAk < 1, i.e. when
Pn
i =1 |a i k | < 1 for all k.
newtonmult(x0,...
@(x)([x(1)^2-x(1)/x(2)^3+cos(x(1))-1; 5*x(1)^4+2*x(1)^3-tan(x(1)*x(2)^8)-3]),...
@(x)([2*x(1)-1/x(2)^3-sin(x(1)) 3*x(1)/x(2)^4; ...
20*x(1)^3+6*x(1)^2-x(2)^8/(cos(x(1)*x(2)^8))^2 -8*x(1)*x(2)^7/(cos(x(1)*x(2)
)
Chapter 4
4 . The gradient of f is ∇ f = (4 + 2x 1 , 6 + 4x 2 ), and the Hessian matrix is ∇2 f =
µ ¶
2 0
, which is positive definite. The only stationary point is (−2, −3/2), which
0 4
is a minimum.
The gradient of g is ∇g = (4 + 2x 1 , 6 − 4x 2 ), and the Hessian matrix is ∇2 g =
µ ¶
2 0
, which is indefinite. The only stationary point is (−2, 3/2), which must
0 −4
be a saddle point.
90
6 . The gradient is ∇ f = (−400x 1 (x 2 − x 12 ) − 2(1 − x 1 ), 200(x 2 − x 12 )). The Hessian
matrix is
1200x 12 − 400x 2 + 2 −400x 1
µ ¶
2
∇ f = .
−400x 1 200
Clearly the only stationary point is x = (1, 1), and we get that
µ ¶
2 802 −400
∇ f (1, 1) = .
−400 200
x k+1 = x k − αk ∇ f (x k ),
µ2 ¶
T 1¡ T T
+ b ∇ f (x k ) − x k A∇ f (x k ) + ∇ f (x k ) Ax k αk
¢
2
1 T T
+ x k Ax k − b x k .
2
Now, since we claim that ∇ f (x k ) is an eigenvector, and that A is symmetric, we
get that A∇ f (x k ) = λ∇ f (x k ) and ∇ f (x k )T A = λ∇ f (x k )T , where λ is the corre-
sponding eigenvalue. This means that the above can be written
1 1
f (x k+1 ) = λk∇ f (x k )k2 α2k + b T ∇ f (x k ) − x Tk A∇ f (x k ) αk + x Tk Ax k − b T x k
¡ ¢
2 2
If we take the derivative of this w.r.t. αk and set this to 0 we get
b T ∇ f (x k ) − x Tk A∇ f (x k ) x k A − b T ∇ f (x k )
¡ T ¢
αk = − =
λk∇ f (x k )k2 λk∇ f (x k )k2
(Ax k − b)T ∇ f (x k ) ∇ f (x)T ∇ f (x k ) 1
= = = .
λk∇ f (x k )k2 λk∇ f (x k )k2 λ
91
This means that αk = λ1 is the step size we should use when we perform exact
line search. We now compute that
µ ¶
1
∇ f (x k+1 ) = Ax k+1 − b = A x k − ∇ f (x k ) − b
λ
1
= Ax k − A∇ f (x k ) − b = Ax k − ∇ f (x k ) − b
λ
= ∇ f (x k ) − ∇ f (x k ) = 0,
8 .b. If ∇2 f (x) is positive definite, its eigenvalues are positive, so that the de-
terminant is positive, and that the matrix is invertible. h = −(∇2 f (x k ))−1 ∇ f (x k )
follows after multiplying with the inverse.
9 . Here we have said nothing about the step length, but we can implement this
as in the function newtonbacktrack as follows:
function [xopt,numit]=steepestdescent(f,df,x0)
epsilon=10^(-3);
xopt=x0;
maxit=100;
for numit=1:maxit
d=-df(xopt);
eta=-df(xopt)’*d;
if eta/2<epsilon
break;
end
alpha=armijorule(f,df,xopt,d);
xopt=xopt+alpha*d;
end
The algorithm can be tested on the first function from Exercise 4 as follows:
f=@(x)(4*x(1)+6*x(2)+x(1)^2+2*x(2)^2);
df=@(x)([4+2*x(1);6+4*x(2)])
steepestdescent(f,df,[-1;-1])
92
function alpha=armijorule(f,df,x,d)
beta=0.2; s=0.5; sigma=10^(-3);
m=0;
while (f(x)-f(x+beta^m*s*d) < -sigma *beta^m*s *(df(x))’*d)
m=m+1;
end
alpha = beta^m*s;
Chapter 5
3 . This is the same as finding the minimum of f (x 1 , . . . , x n ) = −x 1 x 2 · · · x n . This
Q
boils down to the equations − i 6= j x i = 1, since clearly the minimum is not at-
tained when there are any active constraints. This implies that x 1 = . . . = x n , so
that all x i = 1/n. It is better to give a direct argument here that this must be a
minimum, than to attempt to analyse the second order conditions for a mini-
mum.
93
6 . We rewrite the constraint as g 1 (x 1 , x 2 ) = x 12 +x 22 −1 = 0, and get that ∇g 1 (x 1 , x 2 ) =
(2x 1 , 2x 2 ). Clearly all points are regular, since ∇g 1 (x 1 , x 2 ) 6= 0 whenever g 1 (x 1 , x 2 ) =
0. Since ∇ f = (1, 1) we get that the gradient of the Lagrangian is
µ ¶ µ ¶
1 2x 1
+λ = 0,
1 2x 2
p p
p that xp
which gives 1 = x 2 . This gives us the two possible feasible
p points (1/ 2, 1/ 2)
and (−1/ 2, −1/ 2). For the first we see that λ = −1/ µ 2, for¶ the second we
p 2 0
see that λ = 1/ 2. The Hessian of the Lagrangian is λ . For the point
0 2
p p p p
(1/ 2, 1/ 2) this is negative definite since λ is negative, for the point (−1/ 2, −1/ 2)
this is positive definite since λ is positive. From
p the p second order conditions it
follows that the minimum is attained in (−1/ 2, −1/ q2).
If we instead eliminated x 2 we must write x 2 = − 1 − x 12 (since the positive
square
p root gives a bigger value for f ), so that we must minimize f (x) = x −
1 − x subject to the constraint −1 ≤ x ≤ 1. The derivative of this is 1 + p x 2 ,
2
1−x
which is zero when x = − p1 , which we found above. We also could have found
2
this by considering the two inequality constraints −x − 1 ≤ 0 and x − 1 ≤ 0.
If the first one of these is active (i.e. x = −1), the KKT conditions say that
f 0 (−1) > 0. However, this is not the case since f 0 (x) → −∞ when x → −1+ . If
the second constraint is active (i.e. x = 1), the KKT conditions say that f 0 (1) < 0.
This is not the case since f 0 (x) → ∞ when x → 1−. When we have no active
constraint, the problem boils down to setting the derivative to zero, in which
case we get the solution we already have found.
2x x
We have that ∇ f (x) = −Ax, and ∇g 1 (x) = 2kxk = kxk . Clearly all points are regu-
lar, and we get that
x
∇ f + λ∇g 1 = −Ax + λ = 0.
kxk
94
Since we require that kxk = 1 we get that Ax = λx. In other words, the optimal
point x is an eigenvector of A, and the Lagrange multiplier is the corresponding
eigenvalue.
g 1 (x 1 , x 2 ) = −x 1 ≤ 0
g 2 (x 1 , x 2 ) = −x 2 ≤ 0
g 3 (x 1 , x 2 ) = x 1 − 1 ≤ 0
g 4 (x 1 , x 2 ) = x 2 − 1 ≤ 0.
2x 1 + x 2 = 6x 1 + 2x 2 = 10
which gives that x 1 = 2/3 and x 2 = 14/3. This point does not satisfy the con-
straints, however.
Assume now that we have one active constraint. We have four possibilities
in this case (and any solution will be regular). If the first constraint is active the
KKT conditions say that
µ ¶ µ ¶
2x 1 + x 2 − 6 −1
+µ =0
x 1 + 2x 2 − 10 0
95
If the fourth constraint is active we get 2x 1 − 5 = 0, which does not satisfy the
constraint.
Assume that we have two active constraints. Also here there are four possi-
bilities (and any solution will be regular):
x 1 = x 2 = 0: The KKT consitions say that (−6, −10) + (−µ1 , −µ2 ) = 0, which is im-
possible since µ1 , µ2 are positive.
x 1 = 0, x 2 = 1: The KKT conditions say that (−5, −8) + (−µ1 , µ4 ) = 0, which also is
impossible
x 1 = 1, x 2 = 0: The KKT conditions say that (−4, −9) + (µ3 , −µ2 ) = 0, which also is
impossible
x 1 = x 2 = 1: The KKT conditions say that (−3, −7)+(µ3 , µ4 ), which has a solution.
Clearly it is not possible to have more than two active constraints. The minimum
point is therefore (1, 1).
that ∇g j = −e j for 1 ≤ j ≤ n, and ∇g n+1 = (1, 1, . . . , 1). If there are no active in-
equalities, we must have that ∇ f (x) = 0. If the last constraint is not active we
have that
µj e j ,
X
∇f =
j ∈A(x), j ≤n
i.e. ∇ f points into the cone spanned by e j , j ∈ A(x). If the last constraint is
active also , we see that
(µ j − µn+1 )e j .
X X
∇f = −µn+1 e j
j 6∈ A(x), j ≤n j ∈A(x), j ≤n
∇ f is on this form whenever components outside the active set are equal and
≤ 0, and all are components are greater than or equal to this.
96
Chapter 6
1 . The constraint Ax = b actually yields one constraint per row in A, and the
gradient of the i ’th constraint is the i ’th row in A. This gives the following sum
in the KKT conditions:
m m m
∇g i λi = a iT· λi == (A T )·i λi = A T λ.
X X X
i =1 i =1 i =1
∇2 f (x k )h + A T λ = −∇ f (x k )
Ah + 0λ = 0,
∇ f + A T λ = (1, 1) + λ(−1, 1) = 0,
97
• The first inequality is active (i.e. x = 0):
• The second inequality is active (i.e y = 0): The first constraint then gives
that x = −1, which does not give a feasible point.
In conclusion, (0, 1) is the only point which satisfies the KKT conditions. If we
attempt the second order test, we will see that it is inconclusive, since the Hes-
sian of the Lagrangian is zero. To prove that (1, 0) must be a minimum, you can
argue that f is very large outside any rectangle, so that it must have a minimum
on this rectangle (the rectangle is a closed and bounded set).
2 .f. With the barrier method we obtained the solution x(µ) = (µ/2, µ/2 + 1).
Since this converges to (0, 1) as µ → 0, the central path converges to the solution
we have found.
IPBopt(@(x)(x(1)+x(2)),@(x)(-x(1)),@(x)(-x(2)),...
@(x)([1;1]),@(x)([-1;0]),@(x)([0;-1]),...
@(x)(zeros(2)),@(x)(zeros(2)),@(x)(zeros(2)),...
[-1 1],1,[4;5])
98
IPBopt2(@(x)((x-3).^2),@(x)(2-x),@(x)(x-4),...
@(x)(2*(x-3)),@(x)(-1),@(x)(1),...
@(x)(2),@(x)(0),@(x)(0),3.5)
¡ ¢
7 .a. We can set A = 1 1 , og b = 1.
7 .b. We set g 1 (x, y) = −x ≤ 0 and g 2 (x, y) = −y ≤ 0, and have that ∇ f = (3, 2),
∇g 1 = (−1, 0), ∇g 2 = (0, −1). The KKT-conditions therefore take the form x + y = 1
and
where the two last terms are included only if the corresponding inequalities are
active, and where ν1 , ν2 ≥ 0.
If none of the inequalities are active we get that (3, 2)+λ(1, 1) = 0, which has now
solution.
If both inequalities are active we get that x = y = 0, which does not fulfill the
constraint x + y = 1. If we have only one active inequality we have two possibil-
ities: If the first inequality is active we get that (3, 2) + λ(1, 1) + ν1 (−1, 0) = 0. The
equation for the second component says that λ = −2, and the equation for the
first component says that 3 − 2 − ν1 = 0, so that ν1 = 1.
If the second inequality is active we get that (3, 2) + λ(1, 1) + ν2 (0, −1) = 0. The
equation for the first component says that λ = −3, and the equation for the sec-
ond component says that 2 − 3 − ν2 = 0, which gives that ν2 = −1. This possibili-
tywe must denounce since ν2 < 0. We are left with the first inquality as active as
the only possibility. Then x = 0, and the constraint x + y = 1 gives that y = 1, and
a minimum value of 2. Since f clearly is bounded below on the region we work
on, it is clear that this must be a global minimum.
Finally we should candidates for the minimum which are not regular points.
If none of the equations are active we have no candidates, since ∇h 1 = (1, 1) 6= 0.
If one inequality is active we get no candidates either, since (1, 1) and (−1, 0) are
linearly independent, and since (1, 1) and (0, −1) are linearly independent. If
both inequalities are active (x = y = 0), it is clear that the constraint x + y = 1 is
not fulfilled. All in all, we get no additional candidates from points which are not
regular.
99
(3, 2) + µ(−1/x, −1/y) + λ(1, 1) = 0, which gives the equations
µ
= 3+λ
x
µ
= 2 + λ.
y
µ µ
This gives that y + 1 = x , which again gives µ(y − x) = x y. If we substitute the
constraint x + y = 1 we get that µ(1−2x) = x(1−x), which can be written x 2 −(1+
2µ)x + µ = 0. If we solve this we find that
p p
1 + 2µ ± (1 + 2µ)2 − 4µ 1 + 2µ ± 1 + 4µ2
x= = .
2 2
This corresponds to two different points, depending on which sign we choose,
but if we choose + as sign we see that x > 1, so that y < 0 in order for x + y =
1, so that (x, y) then is outside
p the domain of definition for the problem. We
1+2µ− 1+4µ2
therefore have that x = 2 . It is clear that x → 0 when µ → 0 here, so
that the solution of the barrier problem converges to the solution of the original
problem.
100
Mathematics index
101
Index for MATLAB commands
102
Bibliography
[3] G. Dahl. A note on diagonally dominant matrices. Linear Algebra and its
Appl., 317(1-3):217–224, 2000.
[6] C. T. Kelley. Iterative Methods for Linear and Nonlinear Equations. SIAM,
1995.
[7] D. C. Lay. linear algebra and its applications (4th edition). Addison Wesley,
2011.
103
[13] A. Ruszczynski. Nonlinear optimization. Princeton University Press, 2006.
104
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: