Matinf 2360 Part 3

Nonlinear optimization
Lecture notes for the course MAT-INF2360
Øyvind Ryan, Geir Dahl, and Knut Mørken
March 7, 2013
Contents
1 The basics and applications 5

1.1 The basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Some applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Portfolio optimization . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Fitting a model . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 Optimal control problems . . . . . . . . . . . . . . . . . . . . . 12
1.2.5 Linear optimization . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Multivariate calculus and linear algebra . . . . . . . . . . . . . . . . 14
2 A crash course in convexity 21

2.1 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Properties of convex functions . . . . . . . . . . . . . . . . . . . . . . 24
3 Nonlinear equations 29
3.1 Equations and fixed points . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Unconstrained optimization 37
4.1 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Constrained optimization - theory 51

5.1 Equality constraints and the Lagrangian . . . . . . . . . . . . . . . . 51
5.2 Inequality constraints and KKT . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
i
6 Constrained optimization - methods 73
6.1 Equality constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Inequality constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Mathematics index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Index for MATLAB commands . . . . . . . . . . . . . . . . . . . . . . . . . 103
1
2
Preface
These lecture notes have been written for the course MAT-INF2360. They deal
with the third part of that course, and is about nonlinear optimization. Just as
the first parts of MAT-INF2360, this third part also has its roots in linear algebra.
In addition, it has stronger connection than the previous parts to multivariate
calculus, as taught in MAT1110.
Notation
We will follow multivariate calculus and linear algebra notation as you know it
from MAT1110 and MAT1120. In particular, vectors will be in boldface (x, y,
etc.), while matrices will be in uppercase (A, B , etc.). The zero vector, or the
zero matrix, is denoted by 0. All vectors stated will be assumed to be column
vectors. A row vector will always be written as x T , where x is a (column) vector
Vector-valued functions will be in uppercase boldface (F , G, etc.). Functions are
written using both uppercase and lowercase. Uppercase is often used for the
component functions of a vector-valued function.
Acknowledgment
Most material has been written by Geir Dahl, with many useful suggestions and
help from Øyvind Ryan. The authors would all like to thank each other for estab-
lishing these notes. The authors would also like to thank Andreas Våvang Solbrå
for his valuable contributions to the notes.
Geir Dahl, Knut Mørken, and Øyvind Ryan

Oslo, February 2013.
3
4
Chapter 1
The basics and applications
The problem of minimizing a function of several variables, possibly subject to

constraints on these variables, is what optimization is about. So the main prob-
lem is easy to state! And, more importantly, such problems arise in many ap-
plications in natural science, engineering, economics and business as well as in
mathematics itself.
Nonlinear optimization differs from Fourier analysis and wavelet theory in
that classical multivariate analysis also is an important ingredient. A recom-
mended book on this, used here at the University of Oslo, is [8] (in Norwegian).
It contains a significant amount of fixed point theory, nonlinear equations, and
optimization.
There are many excellent books on nonlinear optimization (or nonlinear
programming, as it is also called). Some of these books that have influenced
these notes are [1, 2, 9, 5, 13, 11]. These are all recommended books for those
who want to go deeper into the subject. These lecture notes are particularly in-
fluenced by the presentations in [1, 2].
Optimization has its mathematical foundation in linear algebra and multi-
variate calculus. In analysis the area of convexity is especially important. For
the brief presentation of convexity given here the author’s own lecture notes [4]
(originally from 2001), and the very nice book [14], have been useful sources.
But, of course, anyone who wants to learn convexity should study the work by
R.T. Rockafellar, see e.g. the classic text [12].
Linear optimization (LP, linear programming) is a special case of nonlinear
optimization, but we do not discuss this in any detail here. The reason for this is
that we, at the University of Oslo, have a separate course in linear optimization
which covers many parts of that subject in some detail.
5
This first chapter introduces some of the basic concepts in optimization and
discusses some applications. Many of the ideas and results that you will find in
these lecture notes may be extended to more general linear spaces, even infinite-
dimensional. However, to keep life a bit easier and still cover most applications,
we will only be working in Rn .
Due to its character this chapter is a “proof-free zone”, but in the remaining
text we usually give full proofs of the main results.
Notation: For z ∈ Rn and δ > 0 define the (closed) ball B̄ (z; ²) = {x ∈ Rn :

kx − zk ≤ ²}. It consists of all points with distance at most ² from z. Similarly,
define the open ball B (z; ²) = {x ∈ Rn : kx − zk < ²}. A neighborhood of z is a set N
containing B (z; ²) for some ² > 0. Vectors are treated as column vectors and they
are identified with the corresponding n-tuple, denoted by x = (x 1 , x 2 , . . . , x n ). A
statement like
P (x) (x ∈ H )
means that the statement P (x) is true for all x ∈ H .
1.1 The basic concepts

Optimization deals with finding optimal solutions! So we need to define what
this is.
Let f : Rn → R be a real-valued function in n variables. The function value
is written as f (x), for x ∈ Rn , or f (x 1 , x 2 , . . . , x n ). This is the function we want to
minimize (or maximize) and it is often called the objective function. Let x ∗ ∈ Rn .
Then x ∗ is a local minimum (or local minimizer) of f if there is an ² > 0 such
that
f (x ∗ ) ≤ f (x) for all x ∈ B (x ∗ ; ²).
So, no point “sufficiently near” x ∗ has smaller f -value than x ∗ . A local max-
imum is defined similarly, but with the inequality reversed. A stronger notion is
that x ∗ is a global minimum of f which means that
f (x ∗ ) ≤ f (x) for all x ∈ Rn .
A global maximum satisfies the opposite inequality.
6
The definition of local minimum has a “variational character”; it concerns
the behavior of f near x ∗ . Due to this it is perhaps natural that Taylor’s formula,
which gives an approximation of f in such a neighborhood, becomes a main
tool for characterizing and finding local minima. We present Taylor’s formula,
in different versions, in Section 1.3.
An extension of the notion of minimum and maximum is for constrained
problems where we want, for instance, to minimize f (x) over all x lying in a
given set C . Then x ∗ ∈ C is a local minimum of f over the set C , or subject to
x ∈ C as we shall say, provided no point in C in some neighborhood of x ∗ has
smaller f -value than x ∗ . A similar extension holds for global minimum over C ,
and for maxima.
Example 1.1. To make these things concrete, consider an example from plane
geometry. Consider the point set C = {(z 1 , z 2 ) : z 1 ≥ 0, z 2 ≥ 0, z 1 + z 2 ≤ 1} in the
plane. We want to find a point x = (x 1 , x 2 ) ∈ C which is closest possible to the
point a = (3, 2). This can be formulated as the minimization problem
minimize (x 1 − 3)2 + (x 2 − 2)2

subject to
x1 + x2 ≤ 1
x 1 ≥ 0, x 2 ≥ 0.
The function we want to minimize is f (x) = (x 1 −3)2 +(x 2 −2)2 which is a quadratic
function. This is the square of the distance between x and a; and minimizing the
distance or the square of the distance is equivalent (why?). A minimum here is
x ∗ = (1, 0), as can be seen from a simple geometric argument where we draw the
normal from (3, 2) to the line x 1 + x 2 = 1. If we instead minimize f over R2 , the
unique global minimum is clearly x ∗ = a = (3, 2). It is also useful, and not too
hard, to find these minima analytically. ♣
In optimization one considers minimization and maximization problems.
As
max{ f (x) : x ∈ S} = − min{− f (x) : x ∈ S}
it is clear how to convert a maximization problem into a minimization problem

(or vise versa). This transformation may, however, change the properties of the
function you work with. For instance, if f is convex (definitions come later!),
then − f is not convex (unless f is linear), so rewriting between minimization
and maximization may take you out of a class of “good problems”. Note that a
minimum or maximum may not exist. A main tool one uses to establish that
7
optimal solutions really exist is the extreme value theorem as stated next. You
may want to look these notions up in [8].
Theorem 1.2. Let C be a subset of Rn which is closed and bounded, and let
f : C → R be a continuous function. Then f attains both its (global) minimum
and maximum, so these are points x 1 , x 2 ∈ C with
f (x 1 ) ≤ f (x) ≤ f (x 2 ) (x ∈ C ).
1.2 Some applications

It is useful to see some application areas for optimization. They are many, and
here we mention a few in some detail. The methods we will learn later will be
applied to these examples.
1.2.1 Portfolio optimization

The following optimization problem was introduced by Markowitz in order to
find an optimal portfolio in a financial market; he later received the Nobel prize
in economics1 (in 1990) for his contributions in this area:
Pn
α j =1 µ j x j
P
minimize i , j ≤n c i j x i x j −
subject to
Pn
j =1 x j =1
xj ≥ 0 ( j ≤ n).
The model may be understood as follows. The decision variables are x 1 , x 2 ,

. . . , x n where x i is the fraction of a total investment that is made in (say) stock i .
Thus one has available a set of stocks in different companies (Statoil, IBM, Apple
etc.) or bonds. The fractions x i must be nonnegative (so we consider no short
sale) and add up to 1. The function f to be minimized is
n
f (x) = α µj x j .
X X
ci j xi x j −
i , j ≤n j =1
1 The precise term is “Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel”
8
It can be explained in terms of random variables. Let R j be the return on stock
j , this is a random variable, and let µ j = ER j be the expectation of R j . So if X
denotes the random variable X = nj=1 x j R j , which is the return on our portfolio
P
(= mix among investments), then E X = nj=1 µ j x j which is the second term in f .

P
The minus sign in front explains that we really want to maximize the expected
return. The first term in f is there because just looking at expected return is too
simple. We want to spread our investments to reduce the risk. The first term
in f is the variance of X multiplied by a weight factor α; the constant c i j is the
covariance of R i and R j , defined by
c i j = E(R i − µi )(R j − µ j ).
c i i is also called the variance of R i .

So f is a weighted difference of variance and expected return. This is what
we want to minimize. The optimization problem is to minimize a quadratic
function subject to linear constraints. We shall discuss theory and methods for
such problems later.
In order to use such a model one needs to find good values for all the param-
eters µ j and c i j ; this is done using historical data from the stock markets. The
weight parameter α is often varied and the optimization problem is solved for
each such “interesting” value. This makes it possible to find a so-called efficient
frontier of expectation versus variance for optimal solutions.
The Markowitz model is a useful tool for financial investments, and now ex-
tensions and variations of the model exist, e.g., by using different ways of mea-
suring risk. All such models involve a balance between risk and expected return.
1.2.2 Fitting a model

In many applications one has a mathematical model of some phenomenon where
the model has some parameters. These parameters represent a flexibility of the
model, and they may be adjusted so that the model explains the phenomenon
best possible.
To be more specific consider a model
y = F α (x)
for some function F α : Rm → R. Here α = (α1 , α2 , . . . , αn ) ∈ Rn is a parameter vec-

tor (so we may have several parameters). Perhaps there are natural constraints
on the parameter, say α ∈ A for a given set A in Rn .
For instance, consider
α
y = α1 cos x 1 + x 2 2
9
α
so here n = m = 2, α = (α1 , α2 ) and F α (x) = α1 cos x 1 + x 2 2 where (say) α1 ∈ R and
α2 ∈ [1, 2].
The general model may also be thought of as
y = F α (x) + error
since it is usually a simplification of the system one considers. In statistics one

specifies this error term as a random variable with some (partially) known dis-
tribution. Sometimes one calls y the dependent variable and x the explaining
variable. The goal is to understand how y depends on x.
To proceed, assume we are given a number of observations of the phenomenon
given by points
(x i , y i ) (i = 1, 2, . . . , m).
meaning that one has observed y i corresponding to x = x i . We have m such ob-
servations. Usually (but not always) we have m ≥ n. The model fit problem is to
adjust the parameter α so that the model fits the given data as good as possible.
This leads to the optimization problem
m
(y i − F α (x i ))2 subject to α ∈ A.
X
minimize
i =1
The optimization variable is the parameter α. Here the model error is quadratic
(corresponding to the Euclidean norm), but other norms are also used.
This optimization problem above is a constrained nonlinear optimization
problem. When the function F α depends linearly on α, which often is the case in
practice, the problem becomes the classical least squares approximation prob-
lem which is treated in basic linear algebra courses. The solution is then charac-
terized by a certain linear system of equations, the so-called normal equations.
1.2.3 Maximum likelihood

A very important problem in statistics, arising in many applications, is param-
eter estimation and, in particular, maximum likelihood estimation. It leads to
optimization.
Let Y be a “continuous” real-valued random variable with probability den-
sisty p x (y). Here x is a parameter (often one uses other symbols for the pa-
rameter, like ξ, θ etc.). For instance, if Y is a normal (Gaussian) variable with
2
expectation x and variance 1, then p x (y) = p1 e −(y−x) /2 and
2π
Z b 1 2
P(a ≤ Y ≤ b) = p e −(y−x) /2 d y
a 2π
10
where P denotes probability.
Assume Y is the outcome of an experiment, and that we have observed Y = y
(so y is a known real number or a vector, if several observations were made). On
the basis of y we want to estimate the value of the parameter x which “explains”
best possible our observation Y = y. We have now available the probability den-
sity p x (·). The function x → p x (y), for fixed y, is called the likelihood function.
It gives the “probability mass” in y as a function of the parameter x. The max-
imum likelihood problem is to find a parameter value x which maximizes the
likelihood, i.e., which maximizes the probability of getting precisely y. This is an
optimization problem
max p x (y)
x
where y is fixed and the optimization variable is x. We may here add a con-
straint on x, say x ∈ C for some set C , which may incorporate possible knowl-
edge of x and assure that p x (y) is positive for x ∈ C . Often it is easier to solve the
equivalent optimization problem of maximizing the logarithm of the likelihood
function
max ln p x (y)
x
This is a nonlinear optimization problem. Often, in statistics, there are several
parameters, so x ∈ Rn for some n, and we need to solve a nonlinear optimiza-
tion problem in several variables, possibly with constraints on these variables.
If the likelihood function, or its logarithm, is a concave function, we have (after
multiplying by −1) a convex optimization problem. Such problems are easier to
solve than general optimization problems. This will be discussed later.
As a specific example assume we have the linear statistical model
y = Ax + w
where A is a given m × n matrix, x ∈ Rn is an unknown parameter, w ∈ Rm is a
random variable (the “noise”), and y ∈ Rn is the observed quantity. We assume
that the components of w , i.e., w 1 , w 2 , . . . , w m are independent and identically
distributed with common density function p on R. This leads to the likelihood
function
m
Y
p x (y) = p(y i − a i x)
i =1
where a i is the i ’th row in A. Taking the logarithm we obtain the maximum
likelihood problem
m
X
max ln p(y i − a i x).
i =1
In many applications of statistics is is central to solve this optimization problem
numerically.
11
Example 1.3. Let us take a look at a model take from physics for desintegration
of muons. The angle θ in electron radiation for desintegration of muons has a
probability density
1 + αx
p(x; α) = (1.1)
2
for x ∈ [−1, 1], where x = cos θ, and where α is an unknown parameter in [−1, 1].
Our goal is to estimate α from n measurements x = (x 1 , . . . , x n ). In this case the
likelihood function, which we seek to maximize, takes the form g (α) = ni=1 p(x i ; α).
Q
Taking logarithms and multiplying by −1, our problem is to minimize

Ã !
n n
p(x i ; α) = − ln((1 + αx i )/2).
Y X
f (α) = − ln g (α) = − ln (1.2)
i =1 i =1
We compute
n x i /2 n xi
f 0 (α) = −
X X
=−
i =1 (1 + αx i )/2 i =1 1 + αx i
n x i2
f 00 (α) =
X
i =1 (1 + αx i )
2
We see that f 00 (α) ≥ 0, so that f is convex. As explained, this will make the prob-
lem easier to solve using numerical methods. If we try to solve f 0 (α) = 0 we will
run into problems, however. We see, however, that f 0 (α) → 0 when α → ±∞,
xi
and since 1+αx i
= 1/x1i +α , we must have that f 0 (α) → ∞ when α → −1/x i from
below, and f (α) → −∞ when α → −1/x i from above. It is therefore clear that
0
f has exactly one minimum in every interval of the form [−1/x i , −1/x i +1 ] when
we list the x i in increasing order. It is not for sure that there is a minimum within
[−1, 1] at all. If all measurements have the same sign we are guaranteed to find
no such point. In this case the minimum must be one of the end points in the
interval. We will later look into numerical method for finding this minimum. ♣
1.2.4 Optimal control problems

Recall that a discrete dynamical system is an equation
x t +1 = h t (x t ) (t = 0, 1, . . .)
where x t ∈ Rn , x 0 is the initial solution, and h t is a given function for each t . We

here think of t as time and x t is the state of the process at time t . For instance, let
n = 1 and consider h t (x) = ax (t = 0, 1, . . .) for some a ∈ R. Then the solution is
x t = a t x 0 . Another example is when A is an n ×n matrix, x t ∈ Rn and h t (x) = Ax
12
for each t . Then the solution is x t = A t x 0 . For the more general situation, where
the system functions h t may be different, it may be difficult to find an explicit
solution for x t . Numerically, however, we compute x t simply in a for-loop by
computing x 0 , then x 1 = f 1 (x 0 ) and then x 2 = f 2 (x 1 ) etc.
Now, consider a dynamical system where we may “control” the system in
each time step. We restrict the attention to a finite time span, t = 0, 1, . . . , T . A
proper model is then
x t +1 = h t (x t , u t ) (t = 0, 1, . . . , T − 1)
where x t is the state of the system at time t and the new variable u t is the control
at time t . We assume x t ∈ Rn and u t ∈ Rm for each t (but these things also work
if these vectors lie in spaces of different dimensions). Thus, when we choose the
controls u 0 , u 1 , . . . , u T −1 and x 0 is known, the sequence {x t } of states is uniquely
determined. Next, assume there are given functions f t : Rn × Rm → R that we call
cost functions. We think of f t (x t , u t ) as the “cost” at time t when the system is
in state x t and we choose control u t . The optimal control problem is
PT −1
minimize f T (x T ) + t =0 f t (x t , u t )
subject to (1.3)
x t +1 = h t (x t , u t ) (t = 0, 1, . . . , T − 1)
where the control is the sequence (u 0 , u 1 , . . . , u T −1 ) to be determined. This

problem arises an many applications, in engineering, finance, economics etc.
We now rewrite this problem. First, let u = (u 1 , u 2 , . . . , u T ) ∈ RN where N = T n.
Since, as we noted, x t is uniquely determined by u, there is a function v t such
that x t = v t (u) (t = 1, 2, . . . , T ); x 0 is given. Therefore the total cost may be writ-
ten
TX
−1 TX
−1
f T (x T ) + f t (x t , u t ) = f T (v T (u)) + f t (v t (u), u t ) := f (u)
t =0 t =0
which is a function of u. Thus, we see that the optimal control problem may be
transformed to the unconstrained optimization problem
min f (u)
u∈RN
Sometimes there may be constraints on the control variables, for instance that
they each lie in some interval, and then the transformation above results in a
constrained optimization problem.
13
1.2.5 Linear optimization
This is not an application, but rather a special case of the general nonlinear opti-
mization problem where all functions are linear. A linear optimization problem,
also called linear programming, has the form
minimize cT x
subject to (1.4)
Ax = b, x ≥ 0.
Here A is an m × n matrix, b ∈ Rm and x ≥ 0 means that x i ≥ 0 for each i ≤ n.

So in linear optimization one minimizes (or maximizes) a linear function subject
to linear equations and nonnegativity on the variables. Actually, one can show
any problem with constraints that are linear equations and/or linear inequali-
ties may be transformed into the form above. Such problems have a wide range
of application in science, engineering, economics, business etc. Applications in-
clude portfolio optimization and many planning problems for e.g. production,
transportation etc. Some of these problems are of a combinatorial nature, but
linear optimization is a main tool here as well.
We shall not treat linear optimization in detail here since this is the topic of a
separate course, INF-MAT3370 Linear optimization. In that course one presents
some powerful methods for such problems, the simplex algorithm and interior
point methods. In addition one considers applications in network flow models
and game theory.
1.3 Multivariate calculus and linear algebra

We first recall some useful facts from linear algebra.
The spectral theorem says that if A is a real symmetric matrix, then there is an
orthogonal matrix V (i.e., its columns are orthonormal) and a diagonal matrix D
such that
A = V DV T .
The diagonal of D contains the eigenvalues of A, and A has an orthonormal set

of eigenvectors (the columns of V ).
A real symmetric matrix is positive semidefinite2 if x T Ax ≥ 0 for all x ∈ Rn .
2 See Section 7.2 in [7]
14
The following statements are equivalent
(i) A is positive semidefinite,

(ii) all eigenvalues of A are nonnegative,
(iii) A = W T W for some matrix W .
Similarly, a real symmetric matrix is positive definite if x T Ax > 0 for all nonzero
x ∈ Rn . The following statements are equivalent
(i) A is positive definite,

(ii) all eigenvalues of A are positive,
(iii) A = W T W for some invertible matrix W .
Every positive definite matrix is therefore invertible.

We also recall some central facts from multivariate calculus. They will be
used repeatedly in these notes. Let f : Rn → R be a real-valued function defined
on Rn . The gradient of f at x is the n-tuple
∂ f (x) ∂ f (x) ∂ f (x)

µ ¶
∇ f (x) = , ,..., .
∂x 1 ∂x 2 ∂x n
We will always identify an n-tuple with the corresponding column vector3 . Of

course, the gradient only exists if all the partial derivatives exist. Second or-
der information is contained in a matrix: assuming f has second order par-
tial derivatives we define the Hessian matrix4 ∇2 f (x) as the n × n matrix whose
(i , j )’th entry is
∂2 f (x)
.
∂x i ∂x j
If these second order partial derivatives are continuous, then we may switch the
order in the derivations, and ∇2 f (x) is a symmetric matrix.
For vector-valued functions we also need the derivative. Consider the vector-
valued function F given by
 
F 1 (x)
 F 2 (x) 
 
F (x) =  .. 
.
 
 
F n (x)
3 This is somewhat different from [8], since the gradient there is always considered as a row
vector
15
so F i : Rn → R is the i th component function of F . F 0 denotes the Jacobi matrix5 ,
or simply the derivative, of F
∂F 1 (x) ∂F 1 (x) ∂F 1 (x)
 
· · ·
 ∂F∂x2 (x)
1 ∂x 2
∂F 2 (x)
∂x n
∂F 1 (x) 

∂x ∂x · · · ∂x n

F 0 (x) =  1 2
 
.. 

 . 

∂F n (x) ∂F n (x) ∂F n (x)
∂x 1 ∂x 2 · · · ∂x n
The i th row of this matrix is therefore the gradient of F i , now viewed as a row
vector.
Next we recall Taylor’s theorems from multivariate calculus 6 :
Theorem 1.4 (First order Taylor theorem). Let f : Rn → R be a function hav-

ing continuous partial derivatives in some ball B (x; r ). Then, for each h ∈ Rn
with khk < r there is some t ∈ (0, 1) such that
f (x + h) = f (x) + ∇ f (x + t h)T h.
The next one is known as Taylor’s formula, or the second order Taylor’s the-
orem7 :
Theorem 1.5 (Second order Taylor theorem). Let f : Rn → R be a function

having second order partial derivatives that are continuous in some ball
B (x; r ). Then, for each h ∈ Rn with khk < r there is some t ∈ (0, 1) such that
1
f (x + h) = f (x) + ∇ f (x)T h + h T ∇2 f (x + t h)h.
2
This may be shown by considering the one-variable function g (t ) = f (x +t h)

and applying the chain rule and Taylor’s formula in one variable.
There is another version of the second order Taylor theorem in which the
Hessian is evaluated in x and, as a result, we get an error term. This theorem
shows how f may be approximated by a quadratic polynomial in n variables8 :
6 This theorem is also the mean value theorem of functions in several variables, see Section 5.5
in [8]
16
Theorem 1.6 (Second order Taylor theorem, version 2). Let f : Rn → R be a
function having second order partial derivatives that are continuous in some
ball B (x; r ). Then there is a function ² : Rn → R such that, for each h ∈ Rn with
khk < r ,
1
f (x + h) = f (x) + ∇ f (x)T h + h T ∇2 f (x)h + ²(h)khk2 .
2
Here ²(y) → 0 when y → 0.
Using the O-notation from Definition ??, the very useful approximations we
get from Taylor’ theorems can thus be summarized as follows:
Taylor approximations:
First order: f (x + h) = f (x) + ∇ f (x)T h + O(khk)
≈ f (x) + ∇ f (x)T h.
Second order: f (x + h) = f (x) + ∇ f (x)T h + 21 h T ∇2 f (x)h + O(khk2 )
≈ f (x) + ∇ f (x)T h + 12 h T ∇2 f (x)h.
We introduce notation for these approximations
T f1 (x; x + h) = f (x) + ∇ f (x)T h

T f2 (x; x + h) = f (x) + ∇ f (x)T h + 12 h T ∇2 f (x)h
As we shall see, one can get a lot of optimization out of these approximations!
We also need a Taylor theorem for vector-valued functions, which follows by
applying Taylor’ theorem above to each component function:
Theorem 1.7 (First order Taylor theorem for vector-valued functions). Let
F : Rn → Rm be a vector-valued function which is continuously differentiable
in a neighborhood N of x. Then
F (x + h) = F (x) + F 0 (x)h + O(khk)
when x + h ∈ N .
Finally, if F : Rn → Rm and G : Rk → Rn we define the composition H = F ◦ G

as the function H : Rk → Rm by H (x) = F (G(x)). Then, under natural differentia-
17
bility assumption the following chain rule9 holds:
H 0 (x) = F 0 (G(x))G 0 (x).
Here the right-hand side is a product of two matrices, the respective Jacobi ma-
trices evaluated in the right points.
Finally, we discuss some notions concerning the convergence of sequences.
Definition 1.8 (Linear convergence). We say that a sequence {x k }∞ k=1

con-
verges to x ∗ linearly (or that the convergence speed in linear) if there is a γ < 1
such that
kx k+1 − x ∗ k ≤ γkx k − x ∗ k (k = 0, 1, . . .).
A faster convergence rate is superlinear convergence which means that
lim kx k+1 − x ∗ k/kx k − x ∗ k = 0

k→∞
A special type of superlinear convergence is quadratic convergence where
kx k+1 − x ∗ k ≤ γkx k − x ∗ k2 (k = 0, 1, . . .)
for some γ < 1.
Exercises for Chapter 1

1. Give an example of a function f : R → R with 10 global minima.
2. Consider the function f (x) = x sin(1/x) defined for x > 0. Find its local min-
ima. What about global minimum?
3. Let f : X → R+ be a function (with nonnegative function values). Explain
why it is equivalent to minimize f over x ∈ X or minimize f 2 (x) over X .
4. In Example 1.2.3 we mentioned that optimizing the function p x (y) is equiv-
alent to optimizing the function ln p x (y). Explain why maximizing/minimizing
g is the same as maximizing/minimizing ln g for any positive function g .
5. Consider f : R2 → R given by f (x) = (x 1 − 3)2 + (x 2 − 2)2 . How would you
explain to anyone that x ∗ = (3, 2) is a minimum point?
6. The level sets of a function f : R2 → R are sets of the form L α = x ∈ R2 : f (x) =
α}. Let f (x) = 41 (x 1 − 1)2 + (x 2 − 3)2 . Draw the level sets in the plane for α =
10, 5, 1, 0.1.
18
7. The sublevel set of a function f : Rn → R is the set S α ( f ) = {x ∈ R2 : f (x) ≤ α},
where α ∈ R. Assume that inf{ f (x) : x ∈ Rn } = η exists.
a. What happens to the sublevel sets S α as α decreases? Give an example.
b. Show that if f is continuous and there is an x 0 such that with α = f (x 0 )

the sublevel set S α ( f ) is bounded, then f attains its minimum.
8. Consider the portfolio optimization problem in Subsection 1.2.1.
a. Assume that c i j = 0 for each i 6= j . Find, analytically, an optimal solu-

tion. Describe the set of all optimal solutions.
b. Consider the special case where n = 2. Solve the problem (hint: elimi-
nate one variable) and discuss how minimum point depends on α.
9. Later in these notes we will need the expression for the gradient of functions
which are expressed in terms of matrices.
a. Let f : Rn → R be defined by f (x) = q T x = x T q, where q is a vector.

Show that ∇ f (x) = q, and that ∇2 f (x) = 0.
b. Let f : Rn → R be the quadratic function f (x) = (1/2)x T Ax, where A is

symmetric. Show that ∇ f (x) = Ax, and that ∇2 f (x) = A.
c. Show that, with f defined as in b., but with A not symmetric, we obtain
that ∇ f (x) = 12 (A + A T )x, and ∇2 f = 12 (A + A T ). Verify that these formulas
are compatibe with what you found in b. when A is symmetric.
10. Consider f (x) = f (x 1 , x 2 ) = x 12 + 3x 1 x 2 − 5x 22 + 3. Determine the first order

Taylor approximation to f at each of the points (0, 0) and (2, 1).
µ ¶
1 2
11. Let A = . Show that A is positive definite. (Try to give two different
2 8
proofs.)
12. Show that if A is positive definite, then its inverse is also positive definite.
19
20
Chapter 2
A crash course in convexity
Convexity is a branch of mathematical analysis dealing with convex sets and

convex functions. It also represents a foundation for optimization.
We just summarize concepts and some results. For proofs one may consult
[4] or [14], see also [1].
2.1 Convex sets

A set C ⊆ Rn is called convex if (1 − λ)x + λy ∈ C whenever x, y ∈ C and 0 ≤ λ ≤ 1.
Geometrically, this means that C contains the line segment between each pair
of points in C , so, loosely speaking, a convex set contains no “holes”.
For instance, the ball B (a; δ) = {x ∈ Rn : kx − ak ≤ δ} is a convex set. Let
us show this. Recall the triangle inequality which says that ku + v k ≤ kuk + kv k
whenever u, v ∈ Rn . Let x, y ∈ B (a; δ) and λ ∈ [0, 1]. Then
k((1 − λ)x + λy) − ak = k(1 − λ)(x − a) + λ(y − a)k

≤ k(1 − λ)(x − a)k + kλ(y − a)k
= (1 − λ)kx − ak + λky − ak
≤ (1 − λ)δ + λδ = δ.
Therefore B (a; δ) is convex.

Every linear subspace is also a convex set, as well as the translate of every
subspace (which is called an affine set). Some other examples of convex sets
in R2 are shown in Figure 2.1. We will come back to why each of these sets are
convex later. Another important property is that the intersection of a family of
convex sets is a convex set.
21
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
2
(a) A square (b) The ellipse x4 + y 2 ≤ 1 (c) The area x 4 + y 4 ≤ 1
Figure 2.1: Examples of some convex sets.
By a linear system we mean a finite system of linear equations and/or linear

inequalities involving n variables. For example
x 1 + x 2 = 3, x 1 ≥ 0, x 2 ≥ 0
is a linear system in the variables x 1 , x 2 . The solution set is the set of points
(x 1 , 3 − x 1 ) where 0 ≤ x 1 ≤ 3. The set of solutions of a linear system is called a
polyhedron. These sets often occur in optimization. Thus, a polyhedron has the
form
P = {x ∈ Rn : Ax ≤ b}
where A ∈ Rm,n and b ∈ Rm (m is arbitrary, but finite) and ≤ means componen-

twise inequality. There are simple techniques for rewriting any linear system in
the form Ax ≤ b.
Proposition 2.1. Every polyhedron is a convex set.
Proof: Assume that P is the polyhedron given by all points where Ax ≤ b.

Assume that x and y lie in P , so that Ax ≤ b, and Ay ≤ b. We then have that
A(λx + (1 − λ)y) = λAx + (1 − λ)Ay ≤ λb + (1 − λ)b = b.
This shows that λAx + (1 − λ)Ay also lies in P , so that P is convex.

The square from Figure 2.1(a) is defined by the inequalities −1 ≤ x, y ≤ 1. It is
therefore a polyhedron, and therefore convex. The next result shows that convex
sets are preserved under linear maps.
22
Proposition 2.2. If T : Rn → Rm is a linear transformation, and C ⊆ Rn is a
convex set, then the image T (C ) of this set is also convex.
2.2 Convex functions

The notion of a convex function also makes sense for real-valued functions of
several variables. Consider a real-valued function f : C → R where C ⊆ Rn is a
convex set. We say that f is convex provided that
f ((1 − λ)x + λy) ≤ (1 − λ) f (x) + λ f (y) (x, y ∈ C , 0 ≤ λ ≤ 1) (2.1)
(This inequality holds for all x, y and λ as specified). Due to the convexity of C ,
the point (1 − λ)x + λy lies in C , so the inequality is well-defined. The geometri-
cal interpretation in one dimension is that whenever you take two points on the
graph of f , say (x, f (x)) and (y, f (y)), the graph of f restricted to the line seg-
ment [x, y] lies below the line segment in Rn+1 between the two chosen points.
A function g is called concave if −g is convex.
For every linear function we have that f ((1−λ)x+λy) = (1−λ) f (x)+λ f (y), so
that every linear function is convex. Some other examples of convex functions
in n variables are
• f (x) = L(x) + α where L is a linear function from Rn into R (a linear func-
tional) and α is a real number. In fact, for such functions we have that
f ((1 − λ)x + λy) = (1 − λ) f (x) + λ f (y), just as for linear functions. Func-
tions on the form f (x) = L(x) + α are called affine functions, and may be
written on the form f (x) = c T x + α for a suitable vector c.
• f (x) = kxk (Euclidean norm). That this is convex can be proved by writing
k(1 − λ)x + λyk ≤ k(1 − λ)xk + kλyk = (1 − λ)kxk + λkyk. In fact, the same
argument can be used to show that every norm defines a convex func-
tion. Such an example is the l 1 -norm, also called the sum norm, defined
by kxk1 = nj=1 |x j |.
P
Pn
• f (x) = e j =1 x j (see Exercise 10).
• f (x) = e h(x) where h : Rn → R is a convex function.
• f (x) = maxi g i (x) where g i : Rn → R is an affine function (i ≤ m). This

means that the pointwise maximum of affine functions is a convex func-
tion. Note that such convex functions are typically not differentiable ev-
erywhere. A more general result is that the pointwise supremum of an
23
arbitrary family of affine functions (or even convex functions) is convex.
This is a very useful fact in convexity and its applications.
The following result is an exercise to prove, and it gives a method for proving
convexity of a function.
Proposition 2.3. Assume that f : Rn → R is convex and H : Rm → Rn is affine.

Then the composition f ◦ H is convex, where ( f ◦ H )(x) := f (H (x)).
The next result is often used, and is called Jensen’s inequality.. It can be
proved using induction.
Theorem 2.4 (Jensen’s inequality). Let f : C → R be a convex function de-

fined on a convex set C ⊆ Rn . If x 1 , x 2 . . . , x r ∈ C and λ1 , . . . , λr ≥ 0 satisfy
Pr
j =1 λ j = 1, then
r r
f ( λj x j ) ≤ λ j f (x j ).
X X
(2.2)
j =1 j =1
Pr j
A point of the form j =1 λ j x , where the λ j ’s are nonnegative and sum to 1,
is called a convex combination of the points x 1 , x 2 . . . , x r . One can show that a
set is convex if and only if it contains all convex combinations of its points.
Finally, one connection between convex sets and convex functions is the fol-
lowing fact whose proof is an exercise.
Proposition 2.5. Let C ⊆ Rn be a convex set and consider a convex function

f : C → R. Let α ∈ R. Then the “sublevel” set
{x ∈ C : f (x) ≤ α}
is a convex set.
2.3 Properties of convex functions

A convex function may not be differentiable in every point. However, one can
show that a convex function always has one-sided directional derivatives at any
point. But what about continuity?
24
2
6
1
4
0
2
0 −1
2
2
0 0
−2 −2 −2
−2 −1 0 1 2
x2
(a) The function f (x, y) = 4 + y2 (b) Some level curves of f
Figure 2.2: A function and its level curves.
Theorem 2.6. Let f : C → R be a convex function defined on an open convex

set C ⊆ Rn . Then f is continuous on C .
However, a convex function may be discontinuous in points on the boundary

of its domain. For instance, the function f : [0, 1] → R given by f (0) = 1 and
f (x) = 0 for x ∈ (0, 1] is convex, but discontinuous at x = 0. Next we give a useful
technique for checking that a function is convex.
Theorem 2.7. Let f be a real-valued function defined on an open convex set

C ⊆ Rn and assume that f has continuous second-order partial derivatives on
C.
Then f is convex if and only if the Hessian matrix ∇2 f (x) is positive
semidefinite for each x ∈ C .
Example 2.8. Using Theorem 2.7 it is straightforward to prove that the remain-
ing sets from Figure 2.1 are convex. They can be written as sublevel sets of the
2
functions f (x, y) = x4 +y 2 , and f (x, y) = x 4 +y 4 . For the first of these the level sets
are ellipses, and are shown in Figure 2.2, together with f itself. One can quickly
verify that the Hessian matrices of these functions are positive semidefinite. It
follows from Proposition 2.5 that the corresponding sets are convex. ♣
An important class of convex functions consists of (certain) quadratic func-
tions. Let A ∈ Rn×n be a symmetric matrix which is positive semidefinite and
25
consider the quadratic function f : Rn → R given by
n
f (x) = (1/2) x T Ax − b T x = (1/2)
X X
ai j xi x j − bj xj.
i,j j =1
(If A = 0, then the function is linear, and it may be strange to call it quadratic.
But we still do this, for simplicity.) Then (Exercise 1.3.9) the Hessian matrix of
f is A, i.e., ∇2 f (x) = A for each x ∈ Rn . Therefore, by Theorem 2.7 is a convex
function.
We remark that sometimes it may be easy to check that a symmetric matrix
A is positive semidefinite. A (real) symmetric n × n matrix A is called diagonally
P
dominant if |a i i | ≥ j 6=i |a i j | for i = 1, . . . , n. These matrices arise in many appli-
cations, e.g. splines and differential equations. It can be shown that every sym-
metric diagonally dominant matrix is positive semidefinite. For a simple proof
of this fact using convexity, see [3]. Thus, we get a simple criterion for convex-
ity of a function: check if the Hessian matrix ∇2 f (x) is diagonally dominant for
each x. Be careful here: this matrix may be positive semidefinite without being
diagonally dominant!
We now look at differentiability properties of convex functions.
Theorem 2.9. Let f be a real-valued convex function defined on an open con-

vex set C ⊆ Rn . Assume that all the partial derivatives ∂ f (x)/∂x 1 , . . . , ∂ f (x)/∂x n
exist at a point x ∈ C . Then f is differentiable at x.
A convex function may not be differentiable everywhere, but it is differen-

tiable “almost everywhere”. More precisely, for a convex function defined on
an open convex set in Rn , the set of points for which f is not differentiable has
Lebesgue measure zero. We do not go into further details on this here, but refer
to e.g. [5] for a proof and a discussion.
Another characterization of convex functions that involves the gradient may
now be presented.
Theorem 2.10. Let f : C → R be a differentiable function defined on an open

convex set C ⊆ Rn . Then the following conditions are equivalent:
(i ) f is convex.
(i i ) f (x) ≥ f (x 0 ) + ∇ f (x 0 )T (x − x 0 ) for all x, x 0 ∈ C .
(i i i ) (∇ f (x) − ∇ f (x 0 ))T (x − x 0 ) ≥ 0 for all x, x 0 ∈ C .
26
This theorem is important. Property (ii) says that the first-order Taylor ap-
proximation of f at x 0 (which is the right-hand side of the inequality) always
underestimates f . This result has interesting consequences for optimization as
we shall see later.

1. We recall that A ∩ B consists of all points which lie both in A and B . Show
that A ∩ B is convex when A and B are.
2. (Trial Exam UIO V2012) Assume that f , g are convex, positive, and increasing
functions, both two times differentiable and defined on R. Show that h(x) =
f (x)g (x) is convex.
Hint: Look at the second derivative of h(x).
3. (Exam UIO V2012)
a. Let f and g both be two times (continuously) differentiable functions
defined on R. Suppose also that f and g are convex, and that f is increas-
ing. Show that h(x) = f (g (x)) is convex.
Hint: Compute the second derivative of h(x), and consider its sign.
b. Construct two convex functions f , g so that h(x) = f (g (x)) is not con-

vex.
4. Let S = {(x, y, z) : z ≥ x 2 + y 2 } ⊂ R3 . Sketch the set and verify that it is a convex

set.
5. Let f : S → R be a differentiable function, where S is an open set in R. Check
that f is convex if and only if f 00 (x) ≥ 0 for all x ∈ S.
6. Prove Proposition 2.3.
7. Prove Proposition 2.5.
8. Explain how you can write the LP problem max {c T x : Ax ≥ b, B x = d , x ≥ 0}
as an LP problem of the form
max{c T x : H x ≤ h, x ≥ 0}
for suitable matrix H and vector h.
9. Let x 1 , . . . , x t ∈ Rn and let C be the set of vectors of the form
t
λj x j
X
j =1
Pt
where λ j ≥ 0 for each j = 1, . . . , t , and j =1 λ j = 1. Show that C is convex. Make
3
a sketch of such a set in R .
27
Pn
10. Show that f (x) = e j =1 x j is a convex function.
11. Assume that f and g are convex functions defined on an interval I . Deter-
mine which of the functions following functions that are convex or concave:
a. λ f where λ ∈ R,
b. min{ f , g },
c. | f |.
12. Let f : [a, b] → R be a convex function. Show that
max{ f (x) : x ∈ [a, b]} = max{ f (a), f (b)}
i.e., a convex function defined on closed real interval attains its maximum in one
of the endpoints.
13. Let f : 〈0, ∞〉 → R and define the function g : 〈0, ∞〉 → R by g (x) = x f (1/x).
Why is the function x → xe 1/x convex?
14. Let C ⊆ Rn be a convex set and consider the distance function dC defined
by dC (x) = inf{kx − yk : y ∈ C }. Show that dC is a convex function.
28
Chapter 3
Nonlinear equations
A basic mathematical problem is to solve a system of equations in several un-

knowns (variables). There are numerical methods that can solve such equations,
at least within a small error tolerance. We shall briefly discuss such methods
here; for further details, see [6, 11].
3.1 Equations and fixed points

In linear algebra one works a lot with linear equations in several variables, and
Gaussian elimination is a central method for solving such equations. There are
also other faster methods, so-called iterative methods, for linear equations. But
what about nonlinear equations? For instance, consider the system in two vari-
ables x 1 and x 2 :
x 12 − x 1 x 2−3 + cos x 1 = 1
5x 14 + 2x 13 − tan(x 1 x 28 ) = 3
Clearly, such equations can be very hard to solve. The general problem is to solve
the equation
F (x) = 0 (3.1)
for a given function F : Rn → Rn . If F (x) = 0 we call x a root of F (or of the equa-
tion). The example above is equivalent to finding roots in F (x) = (F 1 (x), F 2 (x))
where
F 1 (x) = x 12 − x 1 x 2−3 + cos x 1 − 1
F 2 (x) = 5x 14 + 2x 13 − tan(x 1 x 28 ) − 3
In particular, if F (x) = Ax − b where A is an n × n matrix and b ∈ Rn , then we are
back to linear equations (a square system). More generally one may consider
equations G(x) = 0 where G : Rn → Rm , but we here only discuss the case m = n.
29
Often the problem F (x) = 0 has the following form, or may be rewritten to it:
K (x) = x. (3.2)
for some function K : Rn → Rn . This corresponds to the special choice F (x) =

K (x) − x. A point x ∈ Rn such that x = K (x) is called a fixed point of the func-
tion K . In finding such a fixed point it is tempting to use the following iterative
method: choose a starting point x 0 and repeat the following iteration
x k+1 = K (x k ) for k = 1, 2, . . . (3.3)
This is called a fixed-point iteration. We note that if K is continuous and this

procedure converges to some point x ∗ , then x ∗ must be a fixed point. The fixed-
point iteration is an extremely simple algorithm, and very easy to implement.
Perhaps surprisingly, it also works very well for many such problems. Let ² > 0
denote a small error tolerance used for stopping the process, e.g. 10−6 .
Fixed-point algorithm:
1. Choose an initial point x 0 , let x = x 0 and err = 1.
2. while err > ² do
(i) Compute x 1 = K (x)
(ii) Compute err = kx 1 − xk
(iii) Update x := x 1
When does the fixed-point iteration work? Let k · k be a fixed norm, e.g. the
Eulidean norm, on Rn . We say that the function K : Rn → Rn is a contraction if
there is a constant 0 ≤ c < 1 such that
kK (x) − K (y)k ≤ ckx − yk (x, y ∈ Rn ).
We also say that K is c-Lipschitz in this case. The following theorem is called
the Banach contraction principle. It also holds in Banach spaces, i.e., complete
normed vector spaces (possibly infinite-dimensional).
30
Theorem 3.1. Assume that K is c-Lipschitz with 0 < c < 1. Then K has a
unique fixed point x ∗ . For any starting point x 0 the fixed-point iteration (3.3)
generates a sequence {x k }∞k=0
that converges to x ∗ . Moreover
kx k+1 − x ∗ k ≤ ckx k − x ∗ k for k = 0, 1, . . . (3.4)
so that
kx k − x ∗ k ≤ c k kx 0 − x ∗ k.
Proof: First, note that if both x and y are fixed points of K , then
kx − yk = kK (x) − K (y)k ≤ ckx − yk
which means that x = y (as c < 1); therefore K has at most one fixed point. Next,
we compute
kx k+1 − x k k = kK (x k ) − K (x k−1 )k ≤ ckx k − x k−1 k = · · · ≤ c k kx 1 − x 0 k
so
kx m − x 0 k = k m−1 (x k+1 − x k )k ≤ m−1
P P
k=0 k=0
kx k+1 − x k k
Pn−1 k
≤ ( k=0 c )kx 1 − x 0 k ≤ (1/(1 − c))kx 1 − x 0 k
From this we derive that {x k } is a Cauchy sequence; as we have
kx s+m − x s k = kK (x s+m−1 ) − K (x s−1 )k ≤ ckx s+m−1 − x s−1 k = · · ·

≤ c s kx m − x 0 k ≤ (c s /(1 − c))kx 1 − x 0 k.
and 0 < c < 1. Any Cauchy sequence in Rn has a limit point, so x m → x ∗ for some
x ∗ ∈ Rn . We now prove that the limit point x ∗ is a (actually, the) fixed point:
kx ∗ − K (x ∗ )k ≤ kx ∗ − x m k + kx m − K (x ∗ )k
= kx ∗ − x m k + kK (x m−1 ) − K (x ∗ )k
≤ kx ∗ − x m k + ckx m−1 − x ∗ k
and letting m → ∞ here gives kx ∗ − K (x ∗ )k ≤ 0 so x ∗ = K (x ∗ ) as desired.

Finally,
kx k+1 − x ∗ k = kK (x k ) − K (x ∗ )k ≤ ckx k − x ∗ k ≤ c k+1 kx 0 − x ∗ k
which completes the proof.

We see that x k → x ∗ linearly, and that Equation (3.4) gives an estimate on
the convergence speed.
31
3.2 Newton’s method
We return to the main problem (3.1). Our goal is to present Newton’s method,
a highly efficient iterative method for solving this equation. The method con-
structs a sequence
x 0, x 1, x 2, . . .
in Rn which, hopefully, converges to a root x ∗ of F , so F (x ∗ ) = 0. The idea is to
linearize F at the current iterate x k and choose the next iterate x k+1 as a zero of
this linearized function. The first order Taylor approximation of F at x k is
TF1 (x k ; x) = F (x k ) + F 0 (x k )(x − x k ).
We solve TF1 (x k ; x) = 0 for x and define the next iterate as x k+1 = x. This gives
x k+1 = x k − F 0 (x k )−1 F (x k ) (3.5)
which leads to Newton’s method. One here assumes that the derivative F 0 is
known analytically. Note that we do not (and hardly ever do!) compute the in-
verse of the matrix F 0 .
Newton’s method for nonlinear equations:

1. Choose an initial point x 0 .
2. For k = 0, 1, . . . do
(i) Find the direction p by solving F 0 (x k )p = −F (x k )
(ii) Update: x k+1 = x k + p
In the main step, which is to compute p, one needs to solve an n × n linear

system of equations where the coefficient matrix is the Jacobi matrix of F , eval-
uated at x k . In MAT1110 [8] we implemented the following code for Newton’s
method for nonlinear equations:
function x=newtonmult(x0,F,J)
% Performs Newtons method in many variables
% x: column vector which contains the start point
% F: computes the values of F
% J: computes the Jacobi matrix
epsilon=0.0000001; N=30; n=0;
x=x0;
32
while norm(F(x)) > epsilon && n<=N
x=x-J(x)\F(x);
fval = F(x);
%fprintf(’itnr=%2d x=[%13.10f,%13.10f] F(x)=[%13.10f,%13.10f]\n’,...
% n,x(1),x(2),fval(1),fval(2))
n = n + 1;
end
This code also terminates after a given number of iterations, and when a given
accuracy is obtained. Note that this function should work for any function F ,
since it is a parameter to the function.
The convergence of Newton’s method may be analyzed using fixed point the-
ory since one may view Newton’s method as a fixed point iteration. Observe that
the Newton iteration (3.5) may be written
x k+1 = G(x k )
where G is the function

G(x) = x − F 0 (x)−1 F (x)
From this it is possible to show that if the starting point is sufficiently close to
the root, then Newton’s method will converge to this root at a linear conver-
gence rate. With more clever arguments one may show that the convergence
rate of Newton’s method is even faster: it has superlinear convergence. Actually,
for many functions one even has quadratic convergence rate. The proof of the
following convergence theorem relies purely on Taylor’s theorem.
Theorem 3.2. Assume that Newton’s method with initial point x 0 produces a
sequence {x k }∞
k=0
which converges to a solution x ∗ of (3.1). Then the conver-
gence rate is superlinear.
Proof: From Taylor’s theorem for vector-valued functions, Theorem 1.7, in

the point x k we have
0 = F (x ∗ ) = F (x k + (x ∗ − x k )) = F (x k ) + F 0 (x k )(x ∗ − x k ) + O(kx k − x ∗ k)
Multiplying this equation by F 0 (x k )−1 (which is assumed to exist!) gives
x k − x ∗ − F 0 (x k )−1 F (x k ) = O(kx k − x ∗ k)
Combining this with the Newton iteration x k+1 = x k − F 0 (x k )−1 F (x k ) we get
x k+1 − x ∗ = O(kx k − x ∗ k).
33
So
lim kx k+1 − x ∗ k/kx k − x ∗ k = 0
k→∞
This shows the superlinear convergence.

The previous result is interesting, but it does not say how near to the root the
starting point need to be in order to get convergence. This is the next topic. Let
F : U → Rn where U is an open, convex set in Rn . Consider the conditions on the
derivative F 0
(i ) kF 0 (x) − F 0 (y)k ≤ Lkx − yk for all x, y ∈ U

(3.6)
(i i ) kF 0 (x 0 )k ≤ K for some x 0 ∈ U
where K and L are some constants. Here kF 0 (x 0 )k denotes the operator norm of
the square matrix F 0 (x 0 ) which is defined as
kF 0 (x 0 )k = sup kF 0 (x 0 )xk
kxk=1
and it measures how much the operator F 0 (x 0 ) may increase the size of vectors.
The following convergence result for Newton’s method is known as Kantorovich’
theorem.
Theorem 3.3 (Kantorovich’ theorem). Let F : U → Rn be a differentiable

function satisfying (3.6). Assume that B̄ (x 0 ; 1/(K L)) ⊆ U and that
kF 0 (x 0 )−1 F (x 0 )k ≤ 1/(2K L).
Then F 0 (x) is invertible for all x ∈ B (x 0 ; 1/(K L)) and Newton’s method with
initial point x 0 will produce a sequence {x k }∞ k=0
contained in B (x 0 ; 1/(K L))
and limk→∞ x k = x ∗ for some limit point x ∗ ∈ B̄ (x 0 ; 1/(K L)) with
F (x ∗ ) = 0.
A proof of this theorem is quite long (but not very difficult to understand) [8].
One disadvantage with Newton’s method is that one needs to know the Ja-
cobi matrix F 0 explicitly. For complicated functions, or functions being the out-
put of a simulation, the derivative may be hard or impossible to find. The quasi-
Newton method, also called the secant-method, is then a good alternative. The
idea is to approximate F 0 (x k ) by some matrix B k and to compute the new search
direction from
B k p = −F (x k )
34
A practical method for finding these approximations B 1 , B 2 , . . . is Broyden’s method.
Provided that the previous iteration gave x k , with Broyden’s method we com-
pute x k+1 by following the search direction, define s k = x k+1 − x k and y k =
F (x k+1 ) − F (x k ), and compute B k+1 from B k by the formula
B k+1 = B k + (1/s Tk s k )(y k − B k s k )s Tk . (3.7)
It can be shown that B k approximates the Jacobi matrix F 0 (x k ) well in each itera-
tion. Moreover, the update given in (3.7) can be done efficiently (it is a rank one
update of B k ).
Algorithm: Broyden’s method:

1. Choose an initial point x 0 , and an initial B 0 .
2. For k = 0, 1, . . . do
(i) Find direction p k by solving B k p = −F (x k )
(ii) Use line search (see Section 4.2) along direction p k to find αk
(iii) Update: x k+1 := x k + αk p k
s k := x k+1 − x k
y k := F (x k+1 ) − F (x k )
compute B k+1 from (3.7).
Note that this algorithm also computes an α through what we call a line
search, to attempt to find the optimal distance to follow the search direction.
We do not here specify how this line search can be performed. Also, we do not
specify how the initial values can be chosen. For B 0 , any approximation of the
Jacobian of F at x 0 can be used, using a numerical differentiation method of
your own choosing. One can show that Broyden’s method, under certain as-
sumptions, also converges superlinearly, see [11].

1. Show that the problem of solving nonlinear equations (3.1) may be trans-
formed into a nonlinear optimization problem. (Hint: Square each component
function and sum these up!)
2. Let T : R → R be given by T (x) = (3/2)(x−x 3 ). Draw the graph of this function,
and determine its fixed points. Let x ∗ denote the largest fixed point. Find, using
your graph, an interval I containing x ∗ such that the fixed point algorithm with
35
converge towards x ∗ . Then try the fixed point
an initial point in I will guaranteedp
algorithm with starting point x 0 = 5/3.
p
3. Let α ∈ R+ be fixed, and consider f (x) = x 2 − α. Then the zeros are ± α.
Write down the Newton’s iteration for this problem. Let α = 2 and compute the
first three iterates in Newton’s method when x 0 = 1.
4. For any vector norm k·k on Rn , we can more generally define a corresponding
operator norm for n × n matrices by
kAk = sup kAxk.
kxk=1
a. Explain why this supremum is attained.

Pn
In the rest of this exercise we will use the vector norm kxk = kxk1 = j =1 |x j | on
Rn .
b. For n = 2, draw the sublevel set {x ∈ R2 : kxk1 ≤ 1}.
c. Show that f (x) = kAxk is convex for any n, and show that the maxi-
mum of f on the set {x : kxk = 1} is attained in a point x on the form ±e k .
Hint: For the second statement, use Jensen’s inequality with x j = ±e j
(Theorem 2.4).
d. Show that, for any n × n-matrix A, kAk = supk ni=1 |a i k |, where a i j are
P
the entries of A (i.e. the biggest sum of absolute values in a column).
5. Consider a linear map T : Rn → Rm given by T (x) = Ax where A is an n × n

matrix. When is T a contraction w.r.t. the vector norm k · k1 ?
6. Test the function newtonmult on the equations given initially in Section 3.1.
7. In this exercise we will implement Broyden’s method with Matlab.
a. Given a value x 0 , implement a function which computes an estimate
of F 0 (x 0 ) by estimating the partial derivatives of F , using a numerical dif-
ferentiation method and step size of you own choosing.
b. Implement a function
function x=broyden(x0,F)
which returns an estimate of a zero of F using Broyden’s method. Your

method should set B 0 to be the matrix obtained from the function in a.
Just indicate where line search along the search direction should be per-
formed in your function, without implementing it. The function should
work as newtonmult in that it terminates after a given number of itera-
tions, or after precision of a given accuracy has been obtained.
36
Chapter 4
Unconstrained optimization
How can we know whether a given point x ∗ is a minimum, local or global, of

some given function f : Rn → R? And how can we find such a point x ∗ ?
These are, of course, some main questions in optimization. In order to give
good answers to these questions we need optimality conditions. They provide
tests for optimality, and serve as the basis for algorithms. We here focus on dif-
ferentiable functions; the corresponding results for the nondifferentiable case
are more difficult (but they exist, and are based on convexity, see [5, 13]).
For unconstrained problems it is not difficult to find powerful optimality
conditions from Taylor’s theorem for functions in several variables.
4.1 Optimality conditions

In order to establish optimality conditions in unconstrained optimization, Tay-
lor’s theorem is the starting point, see Section 1.3. We only consider mini-mization
problems, as maximization problems are turned into minimization problems by
multiplying the function f by −1.
First we look at some necessary optimality conditions.
Theorem 4.1. Assume that f : Rn → R has continuous partial derivatives, and

assume that x ∗ is a local minimum of f . Then
∇ f (x ∗ ) = 0. (4.1)
If, moreover, f has continuous second order partial derivatives, then ∇2 f (x ∗ )

is positive semidefinite.
37
Proof: Assume that x ∗ is a local minimum of f and that ∇ f (x ∗ ) 6= 0. Let h =
−α∇ f (x ∗ ) where α > 0. Then ∇ f (x ∗ )T h = −αk∇ f (x ∗ )k2 < 0 and by continuity
of the partial derivatives of f , ∇ f (x)T h < 0 for all x in some neighborhood of x ∗ .
From Theorem 1.4 (first order Taylor) we obtain
f (x ∗ + h) − f (x ∗ ) = ∇ f (x ∗ + t h)T h (4.2)
for some t ∈ (0, 1) (depending on α). By choosing α small enough, the right-hand
side of (4.2) is negative (as just said), and so f (x ∗ +h) < f (x ∗ ), contradicting that
x ∗ is a local minimum. This proves that ∇ f (x ∗ ) = 0.
To prove the second statement, we get from Theorem 1.5 (second order Tay-
lor)
1
f (x ∗ + h) = f (x ∗ ) + ∇ f (x ∗ )T h + h T ∇2 f (x ∗ + t h)h
2
∗ 1 T 2
= f (x ) + h ∇ f (x + t h)h (4.3)
2
If ∇2 f (x ∗ ) is not positive semidefinite, there is an h such that h T ∇2 f (x ∗ )h < 0

and, by continuity of the second order partial derivatives, h T ∇2 f (x)h < 0 for
all x in some neighborhood of x ∗ . But then (4.3) gives f (x ∗ + h) − f (x ∗ ) < 0; a
contradiction. This proves that ∇2 f (x) is positive semidefinite.
The two necessary optimality conditions in Theorem 4.1 are called the first-
order and the second-order conditions, respectively. The first-order condition
says that the gradient must be zero at x ∗ , and such a point if often called a sta-
tionary point. The second-order condition may be interpreted by f being "con-
vex locally" at x ∗ , although this is not a precise term. A stationary point which
is neither a local minimum or a local maximum is called a saddle point. So, ev-
ery neighborhood of a saddle point contains points with larger and points with
smaller f -value.
Theorem 4.1 gives a connection to nonlinear equations. In order to find a
stationary point we may solve ∇ f (x) = 0, which is a n × n (usually nonlinear)
system of equations. (The system is linear whenever f is a quadratic function.)
One may solve this equation, for instance, by Newton’s method and thereby get
a candidate for a local minimum. Sometimes this approach works well, in par-
ticular if f has a unique local minimum and we have an initial point "sufficiently
close". However, there are other better methods which we discuss later.
It is important to point out that any algorithm for finding a minimum of f
has to be able to find a stationary point. Therefore algorithms in this area are
typically iterative and move to gradually better points where the norm of the
gradient becomes smaller, and eventually almost equal to zero.
38
Example 4.2. Consider a convex quadratic function
f (x) = (1/2) x T Ax − b T x
where A is the (symmetric) Hessian matrix is (constant equal to) A and this ma-
trix is positive semidefinite. Then ∇ f (x) = Ax − b so the first-order necessary
optimality condition is
Ax = b
which is a linear system of equations. If f is strictly convex, which happens when

A is positive definite, then A is invertible and the unique solution is x ∗ = A −1 b.
Thus, there is only one candidate for a local (and global) minimum, namely
x ∗ = A −1 b. Actually, this is indeed a unique global minimum, but to verify this
we need a suitable argument. One way is to use convexity (with results pre-
sented later) or an alternative is to use sufficient optimality conditions which
we discuss next. The linear system Ax = b, when A is positive definite, may be
solved by several methods. A popular, and very fast, method is the conjugate
gradient method. This method, and related methods, are discussed in detail in
the course INF-MAT4360 Numerical linear algebra [10]. ♣
In order to present a sufficient optimality condition we need a result from

linear algebra. Recall from linear algebra that a symmetric positive definite ma-
trix has only real eigenvalues and all these are positive.
Proposition 4.3. Let A be an n ×n symmetric positive definite matrix, and let

λn > 0 denote its smallest eigenvalue. Then
h T Ah ≥ λn khk2 (h ∈ Rn ).
Proof: By the spectral theorem there is an orthogonal matrix V (containing

the orthonormal eigenvectors as its columns) such that
A = V DV T
where D is the diagonal matrix with the eigenvalues λ1 , . . . , λn on the diagonal.

Let h ∈ Rn and define y = V T h. Then kyk = khk and
n n
h T Ah = h T V DV T h = y T D y = λi y i2 ≥ λn y i2 = λn kyk2 = λn khk2 .
X X
j =1 i =1
39
Next we consider sufficient optimality conditions in the general differen-
tiable case. These conditions are used to prove that a candidate point (say, found
by an algorithm) is really a local minimum.
Theorem 4.4. Assume that f : Rn → R has continuous second order partial

derivatives in some neighborhood of a point x ∗ . Assume that ∇ f (x ∗ ) = 0 and
∇2 f (x ∗ ) is positive definite. Then x ∗ is a local minimum of f .
Proof: From Theorem 1.6 (second order Taylor) and Proposition 4.3 we get
f (x ∗ + h) = f (x ∗ ) + ∇ f (x ∗ )T h + 12 h T ∇2 f (x ∗ )h + ²(h)khk2
≥ f (x ∗ ) + 12 λn khk2 + ²(h)khk2
where λn > 0 is the smallest eigenvalue of ∇2 f (x ∗ ). Dividing here by khk2 gives

1
( f (x ∗ + h) − f (x ∗ ))/|hk2 = λn + ²(h)
2
Since limh→0 ²(h) = 0, there is an r such that for khk < r , |²(h)| < λn /4. This
implies that
( f (x ∗ + h) − f (x ∗ ))/|hk2 ≥ λn /4
for all h with khk < r . This proves that x ∗ is a local minimum of f .
We remark that the proof of the previous theorem actually shows that x ∗ is
a strict local minimum of f meaning that f (x ∗ ) is strictly smaller than f (x) for
all other points x in some neighborhood of x ∗ . Note the difference between
the necessary and the sufficient optimality conditions: a necessary condition is
that ∇2 f (x) is positive semidefinite, while a part of the sufficient condition is the
stronger property that ∇2 f (x) is positive definite.
Let us see what happens when we work with a convex function.
Theorem 4.5. Let f : Rn → R be a convex function. Then a local minimum is

also a global minimum. If, in addition, f is differentiable, then a point x ∗ is a
local (and then global) minimum of f if and only if
∇ f (x ∗ ) = 0.
Proof: Let x 1 be a local minimum. If x 1 is not a global minimum, there is an

x 2 6= x 1 with f (x 2 ) < f (x 1 ). Then for 0 < λ < 1
f ((1 − λ)x 1 + λx 2 ) ≤ (1 − λ) f (x 1 ) + λ f (x 2 ) < f (x 1 )
40
and this contradicts that f (x) ≥ f (x 1 ) for all x in a neighborhood of x ∗ . There-
fore x 1 must be a global minimum.
Assume f is convex and differentiable. Due to Theorem 4.1 we only need to
show that if ∇ f (x ∗ ) = 0, then x ∗ is a local and global minimum. So assume that
∇ f (x ∗ ) = 0. Then, from Theorem 2.10 we have
f (x) ≥ f (x ∗ ) + ∇ f (x ∗ )T (x − x ∗ )
for all x ∈ Rn . If ∇ f (x ∗ ) = 0, this directly shows that x ∗ is a global minimum.
4.2 Methods
Algorithms for unconstrained optimization are iterative methods that generate
a sequence of points with gradually smaller values on the function f which is to
be minimized. There are two main types of algorithms in unconstrained opti-
mization:
• Line search methods: Here one first chooses a search direction d k from
the current point x k , using information about the function f . Then one
chooses a step length αk so that the new point
x k+1 = x k + αk d k
has a small, perhaps smallest possible, value on the halfline {x k + αd k :

α ≥ 0}. αk describes how far one should go along the search direction.
The problem of choosing αk is a one-dimensional optimization problem.
Sometimes we can find αk exactly, and in such cases we refer to the method
as exact line search. In cases where αk can not be found analytically, algo-
rithms can be used to approximate how we can get close to the minimum
on the halfline. The method is then refered to as backtracking line search.
• Trust region methods: In these methods one chooses an approximation

fˆk to the function in some neighborhood of the current point x k . The
function fˆk is simpler than f and one minimizes fˆk (in the mentioned
neighborhood) and let the next iterate x k+1 be this minimizer.
These types are typically both based on quadratic approximation of f , but

they differ in the order in which one chooses search direction and step size. In
the following we only discuss the first type, the line search methods.
41
A very natural choice for search direction at a point x k is the negative gradi-
ent, d k = −∇ f (x k ). Recall that the direction of maximum increase of a (differen-
tiable) function f at a point x is ∇ f (x), and the direction of maximum decrease
is −∇ f (x). To verify this, Taylor’s theorem gives
1
f (x + h) = f (x) + ∇ f (x)T h + h T ∇2 f (x + t h)h.
2
So, for small h, the first order term dominates and we would like to make this
term small. By the Cauchy-Schwarz inequality1
∇ f (x)T h ≥ −k∇ f (x)k khk
and equality holds for h = −α∇ f (x) for some α ≥ 0. In general, we call h a de-
scent direction at x if ∇ f (x)T h < 0. Thus, if we move in a descent direction from
x and make a sufficiently small step, the new point has a smaller f -value. With
this background we shall in the following focus on gradient methods given by
x k+1 = x k + αk d k (4.4)
where the direction d k satisfies
∇ f (x k )T d k < 0 (4.5)
There are two gradient methods we shall discuss:
• If we choose the search direction d k = −∇ f (x k ), we get the steepest descent

method
x k+1 = x k − αk ∇ f (x k ).
In each step it moves in the direction of the negative gradient. Sometimes
this gives slow convergence, so other methods have been developed where
other choices of direction d k are made.
• An important method is Newton’s method
x k+1 = x k − αk (∇2 f (x k ))−1 ∇ f (x k ). (4.6)
This is the gradient method with d k = −(∇2 f (x k ))−1 ∇ f (x k ); this vector

d k is called the Newton step. The so-called pure Newton method is when
one simply chooses step size αk = 1 for each k. To interpret this method
consider the second order Taylor approximation of f in x k
f (x k + h) ≈ f (x k ) + ∇ f (x k )T h + (1/2)h T ∇2 f (x k )h
1 The Cauchy-Schwarz’ inequality says: |u · v | ≤ kuk kv k for u, v ∈ Rn .
42
If we minimize this quadratic function w.r.t. h, assuming ∇2 f (x k ) is posi-
tive definite, we get (see Exercise 8)
h = −(∇2 f (x k ))−1 ∇ f (x k )
which explains the Newton step.
In the following we follow the presentation in [1]. In a gradient method we

need to choose the step length. This is the one-dimensional optimization prob-
lem
min{ f (x + αd ) : α ≥ 0}.
Sometimes (maybe not too often) we may solve this problem exactly. Most prac-
tical methods try some candidate α’s and pick the one with smallest f -value.
Note that it is not necessary to compute the exact minimum (this may take too
much time). The main thing is to assure that we get a sufficiently large decrease
in f without making a too small step.
A popular step size rule is the Armijo Rule. Here one chooses (in advance)
parameters s, a reduction factor β satisfying 0 < β < 1, and 0 < σ < 1. Define the
integer
m k = min{m : m ≥ 0, f (x k ) − f (x k + βm sd k ) ≥ −σβm s∇ f (x k )T d k } (4.7)
and choose step length αk = βmk s. Here σ is typically chosen very small, e.g. σ =
10−3 . The parameter s fixes the search for step size to lie within the interval [0, s].
This can be important: for instance, we can set s so small that the initial step
size we try is within the domain of definition for f . According to [1] β is usually
chosen in [1/10, 1/2]. In the literature one may find a lot more information about
step size rules and how they may be adjusted to the methods for finding search
direction, see [1], [11].
Now, we return to the choice of search direction in the gradient method (4.4).
A main question is whether it generates a sequence {x k }∞ k=1
which converges to
a stationary point x ∗ , i.e., where ∇ f (x ∗ ) = 0. It turns out that this may not be the
case; one needs to be careful about the choice of d k to assure this convergence.
The problem is that if d k tends to be nearly orthogonal to ∇ f (x k ) one may get
into trouble. For this reason one introduces the following notion:
Definition 4.6 (Gradient related). {d k } is called gradient related to {x k } if

for any subsequence {x k p }∞
p=1 of {x k } converging to a nonstationary point,
then the corresponding subsequence {d k p }∞ p=1 of {d k } is bounded and
lim supp→∞ ∇ f (x k )T d k < 0.
43
What this condition assures is that kd k k is not too small or large compared
to k∇ f (x k )k and that the angle between the vectors d k and ∇ f (x k ) is not too
close to 90◦ . The proof of the following theorem may be found in [1].
Theorem 4.7. Let {x k }∞ k=0

be generated by the gradient method (4.4), where
{d k }k=0 is gradient related to {x k }∞
∞
k=0
and the step size αk is chosen using the
Armijo rule. Then every limit point of {x k }∞ k=0
is a stationary point.
We remark that in Theorem 4.7 the same conclusion holds if we use exact
minimization as step size rule, i.e., f (x k +αd k ) is minimized exactly with respect
to α.
A very important property of a numerical algorithm is its convergence speed.
Let us consider the steepest descent method first. It turns out that the conver-
gence speed for this algorithm is very well explained by its performance on min-
imizing a quadratic function, so therefore the following result is important. In
this theorem A is a symmetric positive definite matrix with eigenvalues λ1 ≥
λ2 ≥ · · · ≥ λn > 0.
Theorem 4.8. If the steepest descent method x k+1 = x k − αk ∇ f (x k ) using ex-

act line search is applied to the quadratic function f (x) = x T Ax where A is
positive definite, then (the minimum value is 0 and)
f (x k+1 ) ≤ m A f (x k )
where m A = ((λ1 − λn )/(λ1 + λn ))2 .
The proof may be found in [1]. Thus, if the largest eigenvalue is much larger
than the smallest one, m A will be nearly 1 and one typically have slow conver-
gence. In this case we have m A ≈ cond(A) where cond(A) = λ1 /λn is the condi-
tion number of the matrix A. So the rule is: if the condition number of A is small
we get fast convergence, but if cond(A) is large, there will be slow convergence.
A similar behavior holds for most functions f because locally near a minimum
point the function is very close to its second order Taylor approximation in x ∗
which is a quadratic function with A = ∇2 f (x ∗ ).
Thus, Theorem 4.8 says that the sequence obtained in the steepest descent
method converges linearly to a stationary point (at least for quadratic functions).
We now turn to Newton’s method.
44
Newton’s method for unconstrained optimization:
1. Choose an initial point x 0 .
2. For k = 1, 2, . . . do
(i) (Newton step) d k := −∇2 f (x)−1 ∇ f (x); η = −∇ f (x)T d k
(ii) (Stopping criterion) If η/2 < ²: stop.
(iii) (Line search) Use backtracking line search to find step size αk
(iv) (Update) x k+1 := x k + αk d k
Recall that the pure Newton step minimizes the second order Taylor ap-
proximation of f at the current iterate x k . Thus, if the function we minimize
is quadratic, we are done in one step. Similarly, if the function can be well ap-
proximated by a quadratic function, then one would expect fast convergence.
We shall give a result on the convergence of Newton’s method (see [2] for fur-
ther details). When A is symmetric, we let λmi n (A) denote that smallest eigen-
value of A.
For the convergence result we need a lemma on strictly convex functions.
Assume that x 0 is a starting point for Newton’s method and let S = {x ∈ Rn :
f (x) ≤ f (x 0 )}. We shall assume that f is continuous and convex, and this im-
plies that S is a closed convex set. We also assume that f has a minimum point
x ∗ which then must be a global minimum. Moreover the minimum point will be
unique due to a strict convexity assumption on f . Let f ∗ = f (x ∗ ) be the optimal
value.
The following lemma says that for a convex function as just described, a
point is nearly a minimum point (in terms of the f -value) whenever the gradient
is small in that point.
Lemma 4.9. Assume that f is convex as above and that λmi n (∇2 f (x)) ≥ m for
all x ∈ S. Then
1
f (x) − f ∗ ≤ k∇ f (x)k2 . (4.8)
2m
Proof: From Theorem 1.5, the second order Taylor’ theorem, we have for
each x, y ∈ S
f (y) = f (x) + ∇ f (x)T (y − x) + (1/2)(y − x)T ∇2 f (z)(y − x)
45
for suitable z on the line segment between x and y. Here a lower bound for the
quadratic term is (m/2)ky − xk2 , due to Proposition 4.3. Therefore
f (y) ≥ f (x) + ∇ f (x)T (y − x) + (m/2)ky − xk2 .
Now, fix x and view the expression on the right-hand side as a quadratic function
of y. This function is minimized for y ∗ = x −(1/m)∇ f (x). So, by inserting y = y ∗
above we get
f (y) ≥ f (x) + ∇ f (x)T (y ∗ − x) + (m/2)ky ∗ − xk2

1
= f (x) − 2m k∇ f (x)k2
This holds for every y ∈ S so letting y = x ∗ gives

1
f ∗ = f (x ∗ ) ≥ f (x) − k∇ f (x)k2
2m
which proves the desired inequality.
In the following convergence result we consider a function f as in Lemma
4.9. Moreover, we assume that the Hessian matrix is Lipschitz continuous over
S; this is essentially a bound on the third derivatives of f . We do not give the
complete proof (it is quite long), but consider some of the main ideas. Recall
the definition of the set S from above. Recall that the spectral norm of a square
matrix A is defined by
kAk2 = max kAxk.
kxk=1
It is a fact that kAk2 is equal to the largest singular value of A.
Theorem 4.10. Let f be convex and twice continuously differentiable and as-
sume that
(i) λmi n (∇2 f (x)) ≥ m for all x ∈ S.
(ii) k∇2 f (x) − ∇2 f (y)k2 ≤ Lkx − yk for all x ∈ S.
Moreover, assume that f has a minimum point x ∗ . Then Newton’s method

generates a sequence {x k }∞
k=0
that converges to x ∗ . From a certain k 0 the con-
vergence speed is quadratic.
Proof: Define f ∗ = f (x ∗ ). It is possible to show that there are numbers η and

γ > 0 with 0 < η ≤ m 2 /L such that the following holds for each k:
(i) If k∇ f (x k )k ≥ η, then
f (x k+1 ) ≤ f (x k ) − γ. (4.9)
46
(ii) If k∇ f (x k )k < η, then backtracking line search gives αk = 1 and
¶2
L L
µ
k∇ f (x k+1 )k ≤ k∇ f (x k )k . (4.10)
2m 2 2m 2
We omit the proof of this fact; it may be found in [2].

We may now prove that if k∇ f (x k )k < η, then also k∇ f (x k+1 )k < η. This fol-
lows from (ii) above and the fact (assumption) η ≤ m 2 /L. Therefore, as soon as
case (ii) occurs in the iterative process, in all the remaining iterations case (ii)
will occur. Actually, as soon as case (ii) “kicks in” quadratic convergence starts
as we shall see now. So assume that case (ii) occurs from a certain k. (Below we
show that such k must exist.)
L 2
Define µl = 2m 2 k∇ f (x l )k for each l ≥ k. Then 0 ≤ µk < 1/2 as η ≤ m /L.
From what we just saw and (4.10)
µl +1 ≤ µ2l (l ≥ k).
So (by induction)
l −k l −k
µl ≤ µk2 ≤ (1/2)2 (l = l , k + 1, . . .).
Next, from Lemma 4.9
1 2m 3 l −k+1
f (x k ) − f ∗ ≤ k∇ f (x l )k2 ≤ 2 (1/2)2 (l ≥ k).
2m L
This inequality shows that f (x l ) → f ∗ , and since the minimum point is unique,
we must have x l → x ∗ . Moreover, it follows that the convergence is quadratic.
It only remains to explain why case (ii) above indeed occurs for some k. In
each iteration of type (i) f is decreased by at least γ, as seen from equation (4.10),
so the number of such iterations must be bounded by
( f (x 0 ) − f ∗ )/γ
which is a finite number. Finally, the proof of the statements in connection with
(i) and (ii) above is quite long and one derives several inequalities using the con-
vexity properties of f .
From the proof it is also possible to say something about haw many itera-
tions that are needed to reach a certain accuracy. In fact, if ² > 0 a bound on the
number of iterations until f (x k ) ≤ f ∗ + ² is
2m 3
( f (x 0 ) − f ∗ )/γ + log2 log2 .
²L 2
47
Here γ is the parameter introduced in the proof above. The second term in this
expression (the logarithmic term) grows very slowly as ² is decreased, and it may
roughly be replaced by the constant 6. So, whenever the second stage (case (ii)
in the proof) occurs, the convergence is extremely fast, it takes about 6 more
Newton iterations. Note that quadratic convergence means, roughly, that the
number of correct digits in the answer doubles for every iteration.

1. Consider the function f (x 1 , x 2 ) = x 12 + ax 22 where a > 0 is a parameter. Draw
some of the level sets of f (for different levels) for each a in the set {1, 4, 100}.
Also draw the gradient in a few points on these level sets.
2. State and prove a theorem similar to Theorem 4.1 for maximization prob-
lems.
3. Let f (x) = x T Ax where A is a symmetric n × n matrix. Assume that A is in-
definite, so it has both positive and negative eigenvalues. Show that x = 0 is a
saddlepoint of f .
4. Let f (x 1 , x 2 ) = 4x 1 + 6x 2 + x 12 + 2x 22 . Find all stationary points and determine
if they are minimum, maximum or saddlepoints. Do the same for the function
g (x 1 , x 2 ) = 4x 1 + 6x 2 + x 12 − 2x 22 .
5. Let the function f be given by f (x 1 , x 2 ) = (x 1 − 1)2 + (x 2 − 2)2 + 1.
a. Compute the search direction d k which is chosen by the steepest de-
scent method in the point x k = (2, 3).
b. Compute in the same way the search direction d k which is chosen

when we instead use Newton’s method in the point x k = (2, 3).
6. The function f (x 1 , x 2 ) = 100(x 2 −x 12 )2 +(1−x 1 )2 is called the Rosenbrock func-

tion. Compute the gradient and the Hessian matrix at every point x. Find every
local minimum. Also draw some of the level sets (contour lines) of f using Mat-
lab.
7. Let f (x) = (1/2)x T Ax − b T x where A is a positive definite n × n matrix. Con-
sider the steepest descent method applied to the minimization of f , where we
assume exact line search is used. Assume that the search direction happens to
be equal to an eigenvector of A. Show that then the minimum is reached in just
one step.
8. Consider the second order Taylor approximation
T f2 (x; x + h) = f (x) + ∇ f (x)T h + (1/2)h T ∇2 f (x)h.
48
a. Show that ∇h T f2 = ∇ f (x) + ∇2 f (x)h.
b. Minimizing T f2 with respect to h implies solving ∇h T f2 = 0, i.e. ∇ f (x)+

∇2 f (x)h = 0 from a.. If ∇2 f (x) is positive definite, explain that it also is in-
vertible, so that this equation has the unique solution h = −(∇2 f (x k ))−1 ∇ f (x k ),
as previously noted for the Newton step.
9. Implement the steepest descent method. Test the algorithm on the functions
in exercises 4 and 6. Use different starting points.
10. What can go wrong when you apply Armijo’s rule (Equation (4.7)) to a func-
tion f where ∇2 f er negative definite (i.e. all eigenvalues of ∇2 f are negative)?
Hint: Substitute the Taylor approximation
f (x k + βm sd k ) ≈ f (x k ) + ∇ f (x k )T (βm sd k )
in Equation (4.7), and remember that σ there is chosen so that σ < 1.

11. Implement a function
function alpha=armijorule(f,df,x,d)
which returns α chosen according to the Armijo rule for a function f with the
given gradient, at point x, with search direction d . The function shuld compute
m k from Equation (4.7) with β = 0.2, s = 0.5, σ = 10−3 , and return α = βmk s.
12. Write a function
[xopt,numit]=newtonbacktrack(f,df,d2f,x0)
which performs Newton’s method for unconstrained optimization. The input

parameters are the function, its gradient, its Hesse matrix, and the initial point.
The function should also return the number of iterations, and at each iteration
write the corresponding function value. The function should use backtracking
line search with the function armijorule from the previous exercise. Test the
algorithm on the functions in exercises 4 and 6. Use different starting points.
13. Let us return to the maximum likelihood example 1.3.
a. Run the function newtonbacktrack with parameters being the func-

tion f and its and derivaties defined as in Example 1.3 with n = 10 and
x = (0.4992, −0.8661, 0.7916, 0.9107, 0.5357, 0.6574, 0.6353, 0.0342, 0.4988, −0.4607)
Use the start value α0 = 0 for Newtons method. What estimate for the
minimum of f (and thereby α) did you obtain?
49
b. The ten measurements from a. were generated from a probability dis-
tribution where α = 0.5. The answer you obtained was quite far from this.
Let us therefore take a look at how many measurements we should use in
order to get quite precise estimates for α. You can use the function
function ret=randmuon(alpha,m,n)
to generate an m × n-matrix with measurements generated with a proba-

bility distribution with a given parameter α. This function can be found
at the homepage of the book.
With α = 0.5, generate n = 10 measurements with the help of the function
randmuon, and find the maximum likelihood estimate as above. Repeat
this 10 times, and plot the ten estimates you obtain. Repeat for n = 1000,
and for n = 100000 (in all cases you are supposed to plot 10 maximum
likelihood estimates). How many measurements do we need in order to
obtain maximum likelihood estimates which are reliable?
Note that it is possible for the maximum likelihood estimates you obtain
here to be outside the domain of definition [−1, 1]. You need not take this
into account.
50
Chapter 5
Constrained optimization - theory
In this chapter we consider constrained optimization problems. A general opti-

mization problem is
minimize f (x) subject to x ∈ S
where S ⊆ Rn is a given set and f : S → R. We here focus on a very general op-

timization problem which often occurs in applications. Consider the nonlinear
optimization problem with equality/inequality constraints
minimize f (x)
subject to
(5.1)
h i (x) = 0 (i ≤ m)
g j (x) ≤ 0 ( j ≤ r )
where f , h 1 , h 2 , . . . , h m and g 1 , g 2 , . . . , g r are continuously differentiable functions

from Rn into R. A point x satisfying all the m + r constraints will be called feasi-
ble. Thus, we look for a feasible point with smallest f -value.
Our goal is to establish optimality conditions for this problem, starting with
the special case with only equality constraints. Then we discuss algorithms for
solving this problem. Our presentation is strongly influenced by [2] and [1].
5.1 Equality constraints and the Lagrangian

Consider the nonlinear optimization problem with equality constraints
51
minimize f (x)
subject to (5.2)
h i (x) = 0 (i ≤ m)
where f and h 1 , h 2 , . . . , h m are continuously differentiable functions from Rn into

R. We introduce the vector field H = (h 1 , h 2 , . . . , h m ), so H : Rn → Rm and H (x) =
(h 1 (x), h 2 (x), . . . , h m (x)).
We first establish necessary optimality conditions for this problem. A point
x ∈ Rn is called regular if the gradient vectors ∇h i (x ∗ ) (i ≤ m) are linearly inde-
∗
pendent.
Theorem 5.1. Let x ∗ be a local minimum in problem (5.1) and assume that
x ∗ is a regular point. Then there is a unique vector λ∗ = (λ∗1 , λ∗2 , . . . , λ∗m ) ∈ Rm
such that
m
∇ f (x ∗ ) + λ∗i ∇h i (x ∗ ) = 0.
X
(5.3)
i =1
If f and each h i are twice continuously differentiable, then the following also
holds
m
h T (∇2 f (x ∗ ) + λ∗i ∇2 h i (x ∗ ))h ≥ 0 for all h ∈ T (x ∗ )
X
(5.4)
i =1
where T (x ) is the subspace T (x ∗ ) = {h ∈ Rn : ∇h i (x ∗ ) · h = 0 (i ≤ m)}.

∗
The numbers λ∗i in this theorem are called the Lagrangian multipliers. Note
that the Lagrangian multiplier vector λ∗ is unique; this follows directly from the
linear independence assumption as x ∗ is assumed regular. The theorem may
also be stated in terms of the Lagrangian function L : Rn × Rm → R given by
m
L(x, λ) = f (x) + λi h i (x) = f (x) + λT H (x) (x ∈ Rn , λ ∈ Rm ).
X
i =1
Then
∇x L(x, λ) = ∇ f (x) + λi ∇h i
X
i
∇λ L(x, λ) = H (x).
52
h 2 (x) = b 2
h 1 (x) = b 1
Figure 5.1: The two surfaces h 1 (x) = b 1 og h 2 (x) = b 2 intersect each other in a
curve. Along this curve the constraints are fulfilled
Therefore, the first order conditions in Theorem 5.1 may be rewritten as follows
∇x L(x ∗ , λ∗ ) = 0, ∇λ L(x ∗ , λ∗ ) = 0.
Here the second equation simply means that H (x) = 0. These two equations say
that (x ∗ , λ∗ ) is a stationary point for the Lagrangian, and it is a system of n + m
(possibly nonlinear) equations in n + m variables.
We may interpret the theorem in the following way. At the point x ∗ the linear
subspace T (x ∗ ) consist of the “first order feasible directions”. Actually, if each h i
is linear, then T (x ∗ ) consists of those h such that x ∗ + h is feasible, i.e., h i (x ∗ +
h) = 0 for each i ≤ m. Thus, (5.3) says that in a local minimum x ∗ the gradient
∇ f (x ∗ ) is orthogonal to the subspace T (x ∗ ) of the first order feasible variations.
This is reasonable since otherwise there would be a feasible direction in which
f would decrease. In Figure 5.1 we have plotted a curve where two constraints
are fulfilled. In Figure 5.2 we have then shown an interpretation of Theorem 5.1.
Note that this necessary optimality condition corresponds to the condition

∇ f (x ∗ ) = 0 in the unconstrained case. The second condition (5.4) is a sim-
ilar generalization of the second order condition in Theorem 4.1 (saying that
∇2 f (x ∗ ) is positive semidefinite).
It is possible to prove the theorem by eliminating variables based on the
equations and thereby reducing the problem to an unconstrained one. Another
proof, which we shall present below is based on the penalty approach. This ap-
proach is also interesting as it leads to algorithms for actually solving the prob-
lem.
53
∇h 1 (x ∗ ) ∇ f (x ∗ )
CO COC
C C
h 2 (x ∗ ) = b 2C C
C
: ∇h 2 (x ∗ )
C
C C
Cs :

h 1 (x ∗ ) = b 1
Figure 5.2: ∇ f (x ∗ ) as a linear combination of ∇h 1 (x ∗ ) and ∇h 2 (x ∗ )
Proof: (Theorem 5.1) For k = 1, 2, . . . consider the modified objective function
F k (x) = f (x) + (k/2)kH (x)k2 + (α/2)kx − x ∗ k2
where x ∗ is the local minimum under consideration, and α is a positive con-

stant. The second term is a penalty term for violating the constraints and the
last term is there for proof technical reasons. As x ∗ is a local minimum there is
an ² > 0 such that f (x ∗ ) ≤ f (x) for all x ∈ B̄ (x ∗ ; ²). Choose now an optimal solu-
tion x k of the problem min{F k (x) : x ∈ B̄ (x ∗ ; ²)}; the existence here follows from
the extreme value theorem (F k is continuous and the ball is compact). For every
k
F k (x k ) = f (x k ) + (k/2)kH (x k )k2 + (α/2)kx k − x ∗ k2 ≤ F k (x ∗ ) = f (x ∗ ).
By letting k → ∞ in this inequality we conclude that limk→∞ kH (x k )k = 0. So
every limit point x̄ of the sequence {x k } satisfies H (x̄) = 0. The inequality above
also implies (by dropping a term on the left-hand side) that f (x k ) + (α/2)kx k −
x ∗ k2 ≤ f (x ∗ ) for all k, so by passing to the limit we get
f (x̄) + (α/2)kx̄ − x ∗ k2 ≤ f (x ∗ ) ≤ f (x̄)
where the last inequality follows from the facts that x̄ ∈ B̄ (x ∗ ; ²) and H (x̄) = 0.
Clearly, this gives x̄ = x ∗ . We have therefore shown that the sequence {x k } con-
verges to the local minimum x ∗ . Since x ∗ is the center of the ball B̄ (x ∗ ; ²), the
points x k lie in the interior of S for suitably large k. The conclusion is then that
x k is the unconstrained minimum of F k when k is sufficiently large. We may
therefore apply Theorem 4.1 so ∇F k (x k ) = 0, so
0 = ∇F k (x k ) = ∇ f (x k ) + k H 0 (x k )T H (x k ) + α(x k − x ∗ ). (5.5)
54
Here H 0 denotes the Jacobi matrix of H . For suitably large k the matrix H 0 (x k )H 0 (x k )T
is invertible (as the rows of H 0 (x k ) are linearly independent due to rank(H 0 (x ∗ )) =
m and a continuity argument). Multiply equation (5.5) by (H 0 (x k )H 0 (x k )T )−1 H 0 (x k )
to obtain
k H (x k ) = −(H 0 (x k )H 0 (x k )T )−1 H 0 (x k )(∇ f (x k ) + α(x k − x ∗ )).
Letting k → ∞ we see that the sequence {k H (x k )} is convergent and its limit

point λ∗ is given by
λ∗ = −(H 0 (x ∗ )H 0 (x ∗ )T )−1 H 0 (x ∗ )∇ f (x ∗ ).
Finally, by passing to the limit in (5.5) we get
0 = ∇ f (x ∗ ) + H 0 (x ∗ )T λ∗
This proves the first part of the theorem; we omit proving the second part which
may be found in [1].
The first order necessary condition (5.3) along with the constraints H (x) = 0
is a system of n+m equations in the n+m variables x 1 , x 2 , . . . , x n and λ1 , λ2 , . . . , λm .
One may use e.g. Newton’s method for solving these equations and find a can-
didate for an optimal solution. But usually there are better numerical methods
for solving the optimization (5.1), as we shall see soon.
Necessary optimality conditions are used for finding a candidate solution
for being optimal. In order to verify optimality we need sufficient optimality
conditions.
Theorem 5.2. Assume that f and H are twice continuously differentiable

functions. Moreover, let x ∗ be a point satisfying the first order necessary op-
timality condition (5.3) and the following condition
y T ∇2 L(x ∗ , λ∗ )y > 0 for all y 6= 0 with H 0 (x ∗ )T y = 0 (5.6)
where ∇2 L(x ∗ , λ∗ ) is the Hessian of the Lagrangian function with second or-
der partial derivatives with respect to x. Then x ∗ is a (strict) local minimum
of f subject to H (x) = 0.
This theorem may be proved (see [1] for details) by considering the aug-
mented Lagrangian function
L c (x, λ) = f (x) + λT H (x) + (c/2)kH (x)k2 (5.7)
55
where c is a positive scalar. This is in fact the Lagrangian function in the modi-
fied problem
minimize f (x) + (c/2)kH (x)k2 subject to H (x) = 0 (5.8)
and this problem must have the same local minima as the problem of minimiz-
ing f (x) subject to H (x) = 0. The objective function in (5.8) contains the penalty
term (c/2)kH (x)k2 which may be interpreted as a penalty (increased function
value) for violating the constraint H (x) = 0. In connection with the proof of
Theorem 5.2 based on the augmented Lagrangian one also obtains the follow-
ing interesting and useful fact: if x ∗ and λ∗ satisfy the sufficient conditions in
Theorem 5.2 then there exists a positive c̄ such that for all c ≥ c̄ the point x ∗ is
also a local minimum of the augmented Lagrangian L c (·, λ∗ ). Thus, the original
constrained problem has been converted to an unconstrained one involving the
augmented Lagrangian. And, as we know, unconstrained problems are easier to
solve (solve the equations saying that the gradient is equal to zero).
5.2 Inequality constraints and KKT

We now consider the general nonlinear optimization problem where there are
both equality and inequality constraints. The problem is then
minimize f (x)
subject to
(5.9)
h i (x) = 0 (i ≤ m)
g j (x) ≤ 0 ( j ≤ r )
We assume, as usual, that all these functions are continuously differentiable

real-valued functions defined on Rn . In short form we write the constraints as
H (x) = 0 and G(x) ≤ 0 where we let H = (h 1 , h 2 , . . . , h m ) and G = (g 1 , g 2 , . . . , g r ).
A main difficulty in problems with inequality constraints is to determine
which of the inequalities that are active in an optimal solution. If we knew the
active inequalities, we would essentially have a problem with only equality con-
straints, H (x) = 0 plus the active equalities, i.e., a problem of the form discussed
in the previous section. For very small problems (solvable by hand-calculation)
a direct method is to consider all possible choices of active inequalities and solve
the corresponding equality-constrained problem by looking at the Lagrangian
function.
Interestingly, one may also transform the problem (5.9) into the following
equality-constrained problem
56
minimize f (x)
subject to
(5.10)
h i (x) = 0 (i ≤ m)
2
g j (x) + z j = 0 ( j ≤ r ).
We have introduced extra variables z j , one for each inequality. The square
of these variables represent slack in each of the original inequalities. Note that
there is no sign constraint on z j . Clearly, the problems (5.9) and (5.10) are equiv-
alent. This transformation can also be useful computationally. Moreover, it is
useful theoretically as one may apply the optimality conditions from the previ-
ous section to problem (5.10) to derive the theorem below (see [1]).
We now present a main result in nonlinear optimization. It gives optimality
conditions for this problem, and these conditions are called the Karush-Kuhn-
Tucker conditions, or simply the KKT conditions. In order to present the KKT
conditions we introduce the Lagrangian function L : Rn × Rm × Rr → R given by
m r
L(x, λ, µ) = f (x) + λi h i (x) + µ j g j (x) = f (x) + λT H (x) + µT G(x). (5.11)
X X
i =1 j =1
The gradient of L with respect to x is given by

m r
∇x L(x, λ, µ) = ∇ f (x) + λi ∇h i (x) + µ j ∇g j (x).
X X
i =1 j =1
The Hessian matrix of L at (x, λ, µ) containing second order partial derivatives

of L with respect to x will be denoted by ∇x x L(x, λ, µ). Finally, the indices of the
active inequalities at x is denoted by A(x), so A(x) = { j ≤ r : g j (x) = 0}. A point
x is called regular if {∇h 1 (x), . . . ∇h m (x)} ∪ {∇g i (x) : i ∈ A(x)} is linearly indepen-
dent.
In the following theorem the first part contains necessary conditions while
the second part contains sufficient conditions for optimality.
Theorem 5.3. Consider problem (5.9) with the usual differentiability as-
sumptions.
(i ) Let x ∗ be a local minimum of this problem and assume that x ∗ is
a regular point. Then there are unique Lagrange multiplier vectors λ∗ =
57
(λ∗1 , λ∗2 , . . . , λ∗m ) and µ∗ = (µ∗1 , µ∗2 , . . . , µ∗r ) such that
∇x L(x ∗ , λ∗ , µ∗ ) = 0
µ∗j ≥ 0 (j ≤ r ) (5.12)
µ∗j = 0 ( j 6∈ A(x ∗ )).
If f , g and h are twice continuously differentiable, then the following also

holds
y T ∇2x x L(x ∗ , λ∗ , µ∗ )y ≥ 0 (5.13)
for all y with ∇h i (x ∗ )T y = 0 (i ≤ m) and ∇g j (x ∗ )T y = 0 ( j ∈ A(x ∗ )).
(i i ) Assume that x ∗ , λ∗ and µ∗ are such that x ∗ is a feasible point and
(5.12) holds. Assume, moreover, that (5.13) holds with strict inequality for
each y. Then x ∗ is a (strict) local minimum in problem (5.9).
Proof: We shall derive this result from Theorem 5.1.

(i) By assumption x ∗ is a local minimum of problem (5.9), and x ∗ is a regular
point. Consider the constrained problem
minimize f (x)
subject to
(5.14)
h i (x) = 0 (i ≤ m)
g j (x) = 0 ( j ∈ A(x ∗ ))
which is obtained by removing all inactive constraints in x ∗ . Then x ∗ must

be a local minimum in (5.14); otherwise there would be a point x 0 in the neigh-
borhood of x ∗ which is feasible in (5.14) and satisfying f (x 0 ) < f (x ∗ ). By choos-
ing x 0 sufficiently near x ∗ we would get g j (x 0 ) < 0 for all j ∈ A(x ∗ ), contradicting
that x ∗ is a local minimum in (5.9). Therefore we may apply Theorem 5.1 to
problem (5.14) and by regularity of x ∗ there must be unique Lagrange multiplier
vectors λ∗ = (λ∗1 , λ∗2 , . . . , λ∗m ) and µ∗j ( j ∈ A(x ∗ )) such that
m
∇ f (x ∗ ) + λ∗i ∇h i (x ∗ ) + µ∗j ∇g j (x ∗ ) = 0
X X
i =1 j ∈A(x ∗ )
By defining µ j = 0 for j 6∈ A(x ∗ ) we get (5.12), except for the nonnegativity of µ.

The remaining part of the theorem may be proved, after some work, by study-
ing the equality-constrained reformulation (5.10) of (5.9) and applying Theorem
5.1 to (5.10). The details may be found in [1].
58
The KKT conditions have an interesting geometrical interpretation. They say
that −∇ f (x ∗ ) may be written as linear combination of the gradients of the h i ’s
plus a nonnegative linear combination of the gradients of the g j ’s that are active
at x ∗ .
Example 5.4. Let us consider the following optimization problem:
min{x 1 : x 2 ≥ 0, 1 − (x 1 − 1)2 − x 22 ≥ 0}.
Here there are two inequality constraints:
g 1 (x 1 , x 2 ) = −x 2 ≤ 0
g 2 (x 1 , x 2 ) = (x 1 − 1)2 + x 22 − 1 ≤ 0.
If we compute the gradients we see that the KKT conditions take the form
µ ¶ µ ¶ µ ¶
1 0 2(x 1 − 1)
+ µ1 + µ2 = 0,
0 −1 2x 2
where the two last terms on the left hand side only are included if the corre-
sponding inequalities are active. It is clear that we find no solutions if no in-
equalities are active. If only the first inequality is active we find no solution ei-
ther. If only the second inequality is active we get the equations
(x 1 − 1)2 + x 22 = 1
1 + 2µ2 (x 1 − 1) = 0
2µ2 x 2 = 0.
From the last equation we see that either x 2 = 0 or µ2 = 0. µ2 = 0 is in conflict

withthe second equation, however. If x 2 = 0, the first equation gives us that x 1 =
0 or x 1 = 2. x 1 = 2 put into the second equation gives that µ2 = −1/2, which
contradicts µ2 ≥ 0. With x 1 = 0 we get that µ2 = 1/2, so that (0, 0) is a candidate
for the minimum.
If both inequalities are active we get the equations
(x 1 − 1)2 + x 22 = 1
x2 = 0
1 + 2µ2 (x 1 − 1) = 0
−µ1 + 2µ2 x 2 = 0.
It is clear that this reduces to the system we just solved, so that (0, 0) is the only
candidate for a minimum. It is clear that we must have a minimum, since any
59
continuous function defined on a closed, bounded region must have a mini-
mum.
Finally we should comment on any points which are not regular. If the first
inequality is active it is impossible to have that ∇g 1 = 0. If the other inequality
is active we must have that (2(x 1 − 1), 2x 2 ) = 0 in a point which is not regular, so
that (x 1 , x 2 ) = (1, 0). However, it is clear that the other inequality is not active at
this point. If both inequalities are active it is clear that (x 1 , x 2 ) = (0, 0), or (2, 0).
We have already considered the first point. In the other point the gradients are
∇g 1 = (0, −1) and ∇g 2 = (2, 0), which are linearly independent, so that we get no
candidates for the minimum from points which are not regular. ♣
We remark that the assumption that x ∗ is a regular point may be too restric-
tive in some situations, for instance there may be more than n active inequalities
in x ∗ . There exist several other weaker assumptions that assure the existence of
Lagrangian multipliers (and similar necessary conditions). Let us briefly say a
bit more on this matter.
Definition 5.5 (Tangent vector). Let C ⊆ Rn and let x ∈ C . A vector d ∈ Rn

is called a tangent (vector) to C at x if there is a sequence {x k } in C and a
sequence {αk } in R+ such that
lim (x k − x)/αk = d .
k→∞
The set of tangent vectors at x is denoted by TC (x).
TC (x) always contains the zero vector and it is a cone, meaning that it con-
tains each positive multiple of its vectors. Consider now problem (5.9) and let C
be the set of feasible solutions (those x satisfying all the equality and inequality
constraints).
Definition 5.6 (Linearized feasible directions). A linearized feasible direc-

tion at x ∈ C is a vector d such that
d · ∇h i (x) = 0 (i ≤ m)
d · ∇g j (x) = 0 ( j ∈ A(x ∗ )).
Let LFC (x) be the set of all linearized feasible directions at x.
So, if we move from x along a linearized feasible direction with a suitably

small step, then the new point is feasible if we only care about the linearized
60
constraints at x (the first order Taylor approximations) of each h i and each g j for
active constraints at x, i.e., those inequality constraints that hold with equality.
With this notation we have the following lemma. The proof may be found in [11]
and it involves the implicit function theorem from multivariate calculus [8].
Lemma 5.7. Let x ∗ ∈ C . Then TC (x ∗ ) ⊆ LFC (x). If x ∗ is a regular point, then

TC (x ∗ ) = LFC (x).
The purpose of constraint qualifications is to assure that TC (x ∗ ) = LFC (x).

This property is central for obtaining the necessary optimality conditions dis-
cussed above. An important example is when C is defined only by linear con-
straints, i.e., each h i and c j is a linear function. Then TC (x ∗ ) = LFC (x) holds for
each x ∈ C .
For a more thorough discussion of these matters, see e.g. [11, 1].
In the remaining part of this section we discuss some examples; the main
tool is to establish the KKT conditions.
Example 5.8. Consider the one-variable problem: minimize f (x) subject to x ≥
0, where f : R → R is a differentiable convex function. We here let g 1 (x) = −x
and m = 0. The KKT conditions then become: there is a number µ such that
f 0 (x) − µ = 0, µ ≥ 0 and µ = 0 if x > 0. This is one of the (rare) occasions where
we can eliminate the Lagrangian variable µ via the equation µ = f 0 (x). So the
optimality conditions are: x ≥ 0 (feasibility), f 0 (x) ≥ 0, and f 0 (x) = 0 if x > 0 (x is
an interior point of the domain so the derivative must be zero), and if x = 0 we
must have f 0 (0) ≥ 0. ♣
Example 5.9. More generally, consider the problem to minimize f (x) subject
to x ≥ 0, where f : Rn → R. So here C = {x ∈ Rn : x ≥ 0} is the nonnegative or-
thant. We have that g i (x) = −x i , so that ∇g i = −e i . The KKT conditions say that
−∇ f (x ∗ ) is a nonnegative combination of −e i for i so that x i = 0. In other words,
∇ f (x ∗ ) is a nonnegative combination of e i for i so that x i = 0. This means that
∂ f (x ∗ )/∂x i = 0 for all i ≤ n with x ∗i > 0, and
∂ f (x ∗ )/∂x i ≥ 0 for all i ≤ n with x ∗i = 0.
It we interpret this for n = 3 we get the following cases:
• No active constraints: This means that x, y, z > 0. The KKT-conditions say
that all partial derivatives are 0, so that ∇ f (x ∗ ) = 0. This is reasonable,
since these points are internal points.
• One active constraint, such as x = 0, y, z > 0 The KKT-conditions say that

∂ f (x ∗ )/∂y = ∂ f (x ∗ )/∂z = 0, so that ∇ f (x ∗ ) points in the positive direction
of e 1 , as shown in Figure 5.3(a).
61
1 1
z
0.5 0.5
0 0
1 1
1 1
0.5 0.5 0.5 0.5
y 0 0 x y 0 0 x
(a) One active constraint (b) Two active constraints
1
z
0.5
0
1
1
0.5 0.5
y 0 0 x
(c) Three active constraints
Figure 5.3: The different possibilities for ∇ f in a minimum of f , under the con-
straints x ≥ 0.
• Two active constraints, such x = y = 0, z > 0. The KKT-conditions say

that ∂ f (x ∗ )/∂z = 0, so that ∇ f (x ∗ ) lies in the cone spanned by e 1 , e 2 , i.e.
∇ f (x ∗ ) lies in the first quadrant of the x y-plane, as shown in Figure 5.3(b).
• Three active constraints: This means that x = y = z = 0. The KKT con-

ditions say that ∇ f (x ∗ ) is in the cone spanned by e 1 , e 2 , e 3 , as shown in
Figure 5.3(c).
In all cases ∇ f (x ∗ ) points into a cone spanned by gradients corresponding to

the active inequalities (in general, by a cone we mean the set of all linear com-
binations of a set of vectors, with positive coefficients). Note that for the third
case above, we are used to finding minimum values from before: if we restrict
f to values where x = y = 0, we have a one-dimensional problem where we
want to minimize g (z) = f (x, y, z), which is equivalent to finding z so that g 0 (z) =
∂ f (x ∗ )/∂z = 0, as stated by the KKT-conditions. ♣
Example 5.10. Consider a quadratic optimization problem with linear equality
constraints
62
minimize (1/2) x T D x − q T x
subject to
Ax = b
where D is positive semidefinite and A ∈ Rm×n , b ∈ Rm . This is a special case

of (5.16) where f (x) = (1/2) x T D x −q T x. Then ∇ f (x) = D x −q (see Exercise 1.9).
Thus, the KKT conditions are: there is some λ ∈ Rm such that D x − q + A T λ =
0. In addition, the vector x is feasible so we have Ax = b. Thus, solving the
quadratic optimization problem amounts to solving the linear system of equa-
tions
D x + A T λ = q, Ax = b
which may be written as
AT
· ¸· ¸ · ¸
D x q
= . (5.15)
A 0 λ b
Under the additional assumption that D is positive definite and A has full row
rank, one can show that the coefficient matrix in (5.15) is invertible so this sys-
tem has a unique solution x, λ. Thus, for this problem, we may write down an
explicit solution (in terms of the inverse of the block matrix). Numerically, one
finds x (and the Lagrangian multiplier λ) by solving the linear system (5.15) by
e.g. Gaussian elimination or some faster (direct or iterative) method. ♣
Example 5.11. Consider an extension of the previous example by allowing lin-
ear inequality constraints as well:
minimize (1/2) x T D x − q T x
subject to
Ax = b
x ≥0
Here D, A and b are as above. Then ∇ f (x) = D x − q and ∇g k (x) = −e k . Thus,

the KKT conditions for this problem are: there are λ ∈ Rm and µ ∈ Rn such that
D x − q + A T λ − µ = 0, µ ≥ 0 and µk = 0 if x k > 0 (k ≤ n). We eliminate µ from
the first equation and obtain the equivalent condition: there is a λ ∈ Rm such
that D x + A T λ ≥ q and (D x + A T λ − q)k · x k = 0 (k ≤ n). In addition, we have
Ax = b, x ≥ 0. This problem may be solved numerically, for instance, by a so-
called active set method, see [9]. ♣
63
Example 5.12. Linear optimization is a problem of the form
minimize c T x subject to Ax = b, x ≥ 0
This is a special case of the convex programming problem (5.16) where g j (x) =
−x j ( j ≤ n). Here ∇ f (x) = c and ∇g k (x) = −e k . Let x be a feasible solution.
The KKT conditions state that there are vectors λ ∈ Rm and µ ∈ Rn such that
c + A T λ − µ = 0, µ ≥ 0 and µk = 0 if x k > 0 (k ≤ n). Here we eliminate µ and
obtain the equivalent set of KKT conditions: there is a vector λ ∈ Rm such that
c +A T λ ≥ 0, (c +A T λ)k ·x k = 0 (k ≤ n). These conditions are the familiar optimal-
ity conditions in linear optimization theory. The vector λ is feasible in the so-
called dual problem and complementary slack holds. We do not go into details
on this here, but refer to the course INF-MAT3370 Linear optimization where
these matters are treated in detail. ♣
5.3 Convex optimization

A convex optimization problem is to minimize a convex function f over a convex
set C in Rn . These problems are especially attractive, both from a theoretic and
algorithmic perspective.
First, let us consider some general results.
Theorem 5.13. Let f : C → R be a convex function defined on a convex set

C ⊆ Rn .
1. Then every local minimum of f over C is also a global minimum.
2. If f is continuous and C is closed, then the set of local (and therefore

global) minimum points of f over C is a closed convex set.
3. Assume, furthermore, that f : C → R is differentiable and C is open. Let

x ∗ ∈ C . Then x ∗ ∈ C is a local (global) minimum if and only if ∇ f (x ∗ ) =
0.
Proof: 1.) The proof of property 1 is exactly as the proof of the first part of
Theorem 4.5, except that we work with local and global minimum of f over C .
2.) Assume the set C ∗ of minimum points is nonempty and let α = minx∈C f (x).
Then C ∗ = {x ∈ C : f (x) ≤ α} is a convex set, see Proposition 2.5. Moreover, this
set is closed as f is continuous.
64
3.) This follows directly from Theorem 2.10.
Next, we consider a quite general convex optimization problem which is of
the form (5.9):
minimize f (x)
subject to
(5.16)
Ax = b
g j (x) ≤ 0 ( j ≤ r )
where all the functions f and g j are differentiable convex functions, and A ∈
Rm×n and b ∈ Rm . Let C denote the feasible set of problem (5.16). Then C is a
convex set, see Proposition 2.5. A special case of (5.16) is linear optimization.
An important concept in convex optimization is duality. To briefly explain
this introduce again the Lagrangian function L : Rn × Rm × Rr+ → R given by
L(x, λ, ν) = f (x) + λT (Ax − b) + νT G(x) (x ∈ Rn , λ ∈ Rm , ν ∈ Rr+ )
Remark: we use the variable name ν here in stead of the µ used before be-
cause of another parameter µ to be used soon. Note that we require ν ≥ 0.
Define the new function g : Rm × Rr+ → R̄ by
g (λ, ν) = inf L(x, λ, ν)

x
Note that this infimum may sometimes be equal to −∞ (meaning that the func-
tion x → L(x, λ, ν) is unbounded below). The function g is the pointwise infi-
mum of a family of affine functions in (λ, µ), one function for each x, and this
implies that g is a concave function. We are interested in g due to the following
fact, which is easy to prove. It is usually referred to as weak duality.
Lemma 5.14. Let x be feasible in problem (5.16) and let λ ∈ Rm , ν ∈ Rr where

ν ≥ 0. Then
g (λ, ν) ≤ f (x).
Proof: For λ ∈ Rm , ν ∈ Rr with ν ≥ 0 and x feasible in problem (5.16) we have
g (λ, ν) ≤ L(x, λ, ν)
= f (x) + λT (Ax − b) + νT G(x)
≤ f (x)
65
as Ax = b, ν ≥ 0 and G(x) ≤ 0.
Thus, g (λ, ν) provides a lower bound on the optimal value in (5.16). It is
natural to look for a best possible such lower bound and this is precisely the so-
called dual problem which is
maximize g (λ, ν)
subject to (5.17)
ν ≥ 0.
Actually, in this dual problem, we may further restrict the attention to those
(λ, ν) for which g (λ, ν) is finite. g (λ, ν) is also called the dual objective function.
The original problem (5.16) will be called the primal problem. It follows from
Lemma 5.14 that
g∗ ≤ f ∗
where f ∗ denotes the optimal value in the primal problem and g ∗ the optimal
value in the dual problem. If g ∗ < f ∗ , we say that there is a duality gap. Note
that the derivation above, and weak duality, holds for arbitrary functions f and
g j ( j ≤ r ). The concavity of g also holds generally.
The dual problem is useful when the dual objective function g may be com-
puted efficiently, either analytically or numerically. Duality provides a powerful
method for proving that a solution is optimal or, possibly, near-optimal. If we
have a feasible x in (5.16) and we have found a dual solution (λ, ν) with ν ≥ 0
such that
f (x) = g (λ, ν) + ²
for some ² (which then has to be nonnegative), then we can conclude that x is
“nearly optimal”, it is not possible to improve f by more than ². Such a point x
is sometimes called ²-optimal, where the case ² = 0 means optimal.
So, how good is this duality approach? For convex problems it is often per-
fect as the next theorem says. We omit most of the proof, see [5, 1, 14]). For
nonconvex problems one should expect a duality gap. Recall that G 0 (x) denotes
the Jacobi matrix of G = (g 1 , g 2 , . . . , g r ) at x.
Theorem 5.15. Consider convex optimization problem (5.16) and assume

this problem has a feasible point satisfying
g j (x 0 ) < 0 ( j ≤ r ).
66
Then f ∗ = g ∗ , so there is no duality gap. Moreover, x is a (local and global )
minimum in (5.16) if and only if there are λ ∈ Rm and ν ∈ Rr with ν ≥ 0 and
∇ f (x) + A T λ +G 0 (x)T ν = 0
and
ν j g j (x) = 0 ( j ≤ r ).
Proof: We only prove the second part (see the references above). So assume
that f ∗ = g ∗ and the infimum and supremum are attained in the primal and dual
problems, respectively. Let x be a feasible point in the primal problem. Then x
is a minimum in the primal problem if and only if there are λ ∈ Rm and ν ∈ Rr
such that all the inequalities in the proof of Lemma 5.14 hold with equality. This
means that g (λ, ν) = L(x, λ, ν) and νT G(x) = 0. But L(x, λ, ν) is convex in x so it
is minimized by x if and only if its gradient is the zero vector, i.e., ∇ f (x) + λT A +
G 0 (x)T ν = 0. This leads to the desired characterization.
The assumption stated in the theorem, that g j (x 0 ) < 0 for each j , is called the
weak Slater condition.
Example 5.16. Consider the convex optimization problem where we want to
minimize the function f (x) = x 2 + 1 subject to the inequality constraint g (x) =
(x − 3)2 − 1 ≤ 0. From Figure 5.4(a) it is quite clear that the minimum is attained
for x = 2, and is f (2) = 5. Since both the constraint and the objective function
are convex, and since here the weak Slater condition holds, Theorem 5.15 guar-
antees that the dual problem has the same solution as the primal problem. Let
us verify this by considering the dual problem as well. The Lagrangian function
is given by
L(x, ν) = f (x) + νg (x) = x 2 + 1 + ν((x − 3)2 − 1).
3ν
It is easy to see that this function attains its minimum for x = 1+ν . This means
that the dual objective function is given by
3ν 2
µ ¶ µ ¶ µµ ¶2 ¶
3ν 3ν
g (ν) = L ,ν = +1+ν −3 −1 .
1+ν 1+ν 1+ν
This is shown in Figure 5.4(b). It is quite clear from this figure that the maxi-
mum is 5, which we already found by solving the primal problem. To prove this
requires some more work, by setting the derivative of the dual objective func-
tion to zero. Therefore, the primal and the dual problem are two very different
problems, where we in practice choose the one which is simplest to solve. ♣
Finally, we mention a theorem on convex optimization which is used in sev-
eral applications.
67
8
Objective function
15 Inequality constraint
6
10
4
5
2
0
0
0 1 2 3 4 0 1 2 3 4
(a) The objective function and the inequality (b) The dual objective function
constraint
Figure 5.4: The objective function and the dual objective function of Exam-
ple 5.16
Theorem 5.17. Let f : C → R be a convex function defined on a convex set

C ⊆ Rn , and x ∗ ∈ C . Then x ∗ is a (local and therefore global) minimum of f
over C if and only if
∇ f (x ∗ )T (x − x ∗ ) ≥ 0 for all x ∈ C . (5.18)
Proof: Assume first that ∇ f (x ∗ )T (x − x ∗ ) < 0 for some x ∈ C . Consider the

function g (²) = f (x ∗ + ²(x − x ∗ )) and apply the first order Taylor theorem to this
function. Thus, for every ² > 0 there exists an t ∈ [0, 1] with
f (x ∗ + ²(x − x ∗ )) = f (x ∗ ) + ²∇ f (x ∗ + t ²(x − x ∗ ))T (x − x ∗ ).
Since ∇ f (x ∗ )T (x −x ∗ ) < 0 and the gradient function is continuous (our standard

assumption!) we have for sufficiently small ² > 0 that ∇ f (x ∗ + t ²(x − x ∗ ))T (x −
x ∗ ) < 0. This implies that f (x ∗ +²(x − x ∗ )) < f (x ∗ ). But, as C is convex, the point
x ∗ + ²(x − x ∗ ) also lies in C and so we conclude that x ∗ is not a local minimum.
This proves that (5.18) is necessary for x ∗ to be a local minimum of f over C .
Next, assume that (5.18) holds. Using Theorem 2.10 we then get
f (x) ≥ f (x ∗ ) + ∇ f (x ∗ )T (x − x ∗ ) ≥ f (x ∗ ) for every x ∈ C
so x ∗ is a (global) minimum.
68
1. In the plane consider a rectangle R with sides of length x and y and with
perimeter equal to α (so 2x + 2y = α). Determine x and y so that the area of R is
largest possible.
2. Consider the optimization problem
minimize f (x 1 , x 2 ) subject to (x , x 2 ) ∈ C
where C = {(x 1 , x 2 ) ∈ R2 : x 1 , x 2 ≥ 0, 4x 1 + x 2 ≥ 8, 2x 1 +3x 3 ≤ 12}. Draw the feasible

set C in the plane. Find the set of optimal solutions in each of the cases given
below.
a. f (x 1 , x 2 ) = 1.
b. f (x 1 , x 2 ) = x 1 .
c. f (x 1 , x 2 ) = 3x 1 + x 2 .
d. f (x 1 , x 2 ) = (x 1 − 1)2 + (x 2 − 1)2 .
e. f (x 1 , x 2 ) = (x 1 − 10)2 + (x 2 − 8)2 .
3. Solve
n
X
max{x 1 x 2 · · · x n : x j = 1, x j ≥ 0}.
j =1
4. Let S = {x ∈ R2 : kxk = 1} be the unit circle in the plane. Let a ∈ R2 be a

given point. Formulate the problem of finding a nearest point in S to a as a
nonlinear optimization problem. How can you solve this problem directly using
a geometrical argument?
5. Let S be the unit circle from the previous exercise. Let a 1 , a 2 be two given
points in the plane. Let f (x) = 2i =1 kx − a i k2 . Formulate this as an optimization
P
problem and find its Lagrangian function L. Find the stationary points of L, and
use this to solve the optimization problem.
6. Solve
minimize x 1 + x 2 subject to x 12 + x 22 = 1.
using the Lagrangian, see Theorem 5.1. Next, solve the problem by eliminating
x 2 (using the constraint).
7. Let g (x 1 , x 2 ) = 3x 12 + 10x 1 x 2 + 3x 22 − 2. Solve
min{k(x 1 , x 2 )k : g (x 1 , x 2 ) = 0}.
69
8. Same question as in previous exercise, but with g (x 1 , x 2 ) = 5x 12 −4x 1 x 2 +4x 22 −
6.
9. Let f be a two times differentiable function f : Rn → R. Consider the opti-
mization problem
minimize f (x) subject to x 1 + x 2 + · · · + x n = 1.
Characterize the stationary points (find the equation they satisfy).

10. Consider the previous exercise. Explain how to convert this into an uncon-
strained problem by eliminating x n .
11. Let A be a real symmetric n × n matrix. Consider the optimization problem
max{x T Ax : kxk = 1}
Rewrite the constraint as kxk − 1 = 0 and show that an optimal solution of this
problem must be an eigenvector of A. What can you say about the Lagrangian
multiplier?
12. Solve
min{(1/2)(x 12 + x 22 + x 32 ) : x 1 + x 2 + x 3 ≤ −6}.
Hint: Use KKT and discuss depending on whether the constraint is active or not.
13. Solve
min{(x 1 − 3)2 + (x 2 − 5)2 + x 1 x 2 : 0 ≤ x 1 , x 2 ≤ 1}.
14. Solve
min{x 1 + x 2 : x 12 + x 22 ≤ 2}.
15. Write down the KKT conditions for the portfolio optimization problem of
Section 1.2.1.
16. Write down the KKT conditions for the optimization problem
n
X
min{ f (x 1 , x 2 , . . . , x n ) : x j ≥ 0 ( j ≤ n), x j ≤ 1}
j =1
where f : Rn → R is a differentiable function.

17. Consider the following optimization problem
3 2
µ ¶
min{ x 1 − + x 22 : x 1 + x 2 ≤ 1, x 1 − x 2 ≤ 1, −x 1 + x 2 ≤ 1, −x 1 − x 2 ≤ 1}.
2
a. Draw the region which we minimize over, and find the minimum of
¢2
f (x) = x 1 − 23 + x 22 by a direct geometric argument.
¡
70
b. Write down the KKT conditions for this problem. From a., decide which
two conditions g 1 and g 2 are active at the minimum, and verify that you
can find µ1 ≥ 0, µ2 ≥ 0 so that ∇ f + µ1 ∇g 1 + µ2 ∇g 2 = 0 (as the KKT condi-
tions guarantee in a minimum) (it is not the meaning here that you should
go through all possibilities for active inequalities, only those you see must
be fulfilled from a.).
18. Consider the following optimization problem
min{−x 1 x 2 : x 12 + x 22 ≤ 1}
Write down the KKT conditions for this problem, and find the minimum.
71
72
Chapter 6
Constrained optimization -
methods
In this final chapter we present numerical methods for solving nonlinear opti-
mization problems. This is a huge area, so we can here only give a small taste of
it! The algorithms we present are known good methods.
6.1 Equality constraints

We here consider the nonlinear optimization problem with linear equality con-
straints
minimize f (x)
subject to (6.1)
Ax = b
Newton’s method may be applied to this problem. The method is very simi-
lar to the unconstrained case, but with two modifications. First, the initial point
x 0 must be chosen so that it is feasible, i.e., Ax 0 = b. Next, the search direction
d must be such that the new iterate is feasible as well. This means that Ad = 0,
so the search direction lies in the nullspace of A.
The second order Taylor approximation of f at an iterate x k is
T f1 (x k ; x k + h) = f (x k ) + ∇ f (x k )T h + (1/2)h T ∇2 f (x k )h
73
and we want to minimize this w.r.t. h subject to the constraint
A(x k + h) = b
This is a quadratic optimization problem in h with a linear equality constraint

(Ah = 0) as in Example 5.10. The KKT conditions for this problem are thus
∇ f (x k ) A T
· 2 ¸· ¸ · ¸
h −∇ f (x k )
=
A 0 λ 0
where λ is the Lagrange multiplier. The Newton step is only defined when the
coefficient matrix in the KKT problem is invertible. In that case, the problem has
a unique solution (h, λ) and we define d N t = h and call this the Newton step.
Newton’s method for solving (6.1) may now be described as follows. Again
² > 0 is a small stopping criterion.
Newton’s method for linear equality constrained optimization:

1. Choose an initial point x 0 satisfying Ax 0 = b and let x = x 0 .
2. repeat
(i) Compute the Newton step d N t and η := d TN t ∇2 f (x)d N t .
(ii) If η2 /2 < ²: stop.
(iii) Use backtracking line search to find step size α
(iv) Update x := x + αd N t
This leads to an algorithm for Newtons’s method for linear equality con-
strained optimization which is very similar to the function newtonbacktrack
from Exercise 4.2.12. We do not state a formal convergence theorem for this
method, but it behaves very much like Newton’s method for unconstrained op-
timization. Actually, it can be seen that the method just described corresponds
to eliminating variables based on the equations Ax = b and using the uncon-
strained Newton method for the resulting (smaller) problem. So as soon as the
solution is “sufficiently near” an optimal solution, the convergence rate is quadratic,
so extremely few iterations are needed in this final stage.
6.2 Inequality constraints

We here briefly discuss an algorithm for inequality constrained nonlinear opti-
mization problems. The presentation is mainly based on [2, 11]. We restrict the
74
attention to convex optimization problems, but many of the ideas are used for
nonconvex problems as well.
The method we present is an interior-point method, more precisely, an interior-
point barrier method. This is an iterative method which produces a sequence of
points lying in the relative interior of the feasible set. The barrier idea is to ap-
proximate the problem by a simpler one in which constraints are replaced by a
penalty term. The purpose of this penalty term is to give large objective function
values to points near the (relative) boundary of the feasible set, which effectively
becomes a barrier against leaving the feasible set.
Consider again the convex optimization problem
minimize f (x)
subject to
(6.2)
Ax = b
g j (x) ≤ 0 ( j ≤ r )
where A is an m × n matrix and b ∈ Rm . The feasible set here is F = {x ∈ Rn :

Ax = b, g j (x) ≤ 0 ( j ≤ r )}. We assume that the weak Slater condition holds, and
therefore by Theorem 5.15 the KKT conditions for problem (6.2) are
Ax = b, g j (x) ≤ 0 ( j ≤ r )
ν ≥ 0, ∇ f (x) + A T λ +G 0 (x)T ν = 0 (6.3)
ν j g j (x) = 0 ( j ≤ r ).
So, x is a minimum in (6.2) if and only if there are λ ∈ Rm and ν ∈ Rr such that
(6.3) holds.
Let us state an algorithm for Newton’s method for linear equality constrained
optimization with inequality constraints. Before we do this there is one final
problem we need to address: The α we get from backtracking line search may be
so that x + αd N t do not satisfty the inequality constraints (in the exercises you
will be asked to verify that this is the case for a certain function). The problem
comes from that the iterates x k + βm sd k from Armijo’s rule do not necessarily
satisfy the inequality constraints. However, we can choose m large enough so
that all succeeding iterates satisfy these constraints. We can reimplement the
function armijorule to address this as follows:
function alpha=armijoruleg1g2(f,df,x,d,g1,g2)
beta=0.2; s=0.5; sigma=10^(-3);
m=0;
while (g1(x+beta^m*s*d)>0 || g2(x+beta^m*s*d)>0)
75
m=m+1;
end
while (f(x)-f(x+beta^m*s*d) < -sigma *beta^m*s *(df(x))’*d)
m=m+1;
end
alpha = beta^m*s;
Here g1 and g2 are function handles which represent the inequality constraints,
and we have added a first loop, which secures that m is so large that the inequal-
ity constraints are satisfied. The rest of the code is as in the function armijorule.
After this we can also modify the function newtonbacktrack from Exercise 4.2.12
to a function newtonbacktrackg1g2 in the obvious way, so that the inequality
constraints are passed to armijoruleg1g2:
function [x,numit]=newtonbacktrackg1g2LEC(f,df,d2f,A,b,x0,g1,g2)
epsilon=10^(-3);
x=x0;
maxit=100;
for numit=1:maxit
matr=[d2f(x) A’; A zeros(size(A,1))];
vect=[-df(x); zeros(size(A,1),1)];
solvedvals=matr\vect;
d=solvedvals(1:size(A,2));
eta=d’*d2f(x)*d;
if eta^2/2<epsilon
break;
end
alpha=armijoruleg1g2(f,df,x,d,g1,g2);
x=x+alpha*d;
end
Both these function work in all cases where there are exactly two inequality con-
straints.
The interior-point barrier method is based on an approximation of problem
(6.2) by the barrier problem
minimize f (x) + µφ(x)

subject to (6.4)
Ax = b
76
where
r
φ(x) = −
X
ln(−g j (x))
j =1
and µ > 0 is a parameter (in R). The function φ is called the (logarithmic) barrier
function and its domain is the relative interior of the feasible set
F ◦ = {x ∈ Rn : Ax = b, g j (x) < 0 ( j ≤ r )}.
The same set F ◦ is the feasible set of the barrier problem. The key properties of
the barrier function are:
• φ is twice differentiable and

r
X 1
∇φ(x) = ∇g j (x) (6.5)
j =1 (−g j (x))
r 1 r 1
∇2 φ(x) = ∇g j (x)∇g j (x)T + ∇2 g j (x)
X X
2
(6.6)
j =1 g j (x) j =1 (−g j (x))
• φ is convex. For this it is enough to show that ∇2 φ is positive semidefinite

at all points, which can be shown from Equation 6.6 as follows:
Ã !
r 1 1
T 2 T T T 2
h ∇ φ(x)h =
X
2
h ∇g j (x)∇g j (x) h + h ∇ g j (x)h
j =1 g j (x) (−g j (x))
Ã !
r 1 1
T 2 T 2
X
= 2
k∇g j (x) hk + h ∇ g j (x)h ≥ 0
j =1 g j (x) (−g j (x))
since (−g1j (x)) > 0 and h T ∇2 g j (x)h ≥ 0 (since all g j are convex, ∇2 g j (x) is
positive semidefinite).
• If {x k } is a sequence in F ◦ such that g j (x k ) → 0 for some j ≤ r , then φ(x k ) →

∞. This is the barrier property.
The idea here is that for points x near the boundary of F the value of φ(x) is very
large. So, an iterative method which moves around in the interior F ◦ of F will
typically avoid points near the boundary as the logarithmic penalty term makes
the function value f (x) + µφ(x) very large.
The interior point method consists in solving the barrier problem, using
Newton’s method, for a sequence {µk } of (positive) barrier parameters; these
are called the outer iterations. The solution x k found for µ = µk is used as the
starting point in Newton’s method in the next outer iteration where µ = µk+1 .
77
The sequence {µk } is chosen such that µk → 0. When µ is very small, the barrier
function approximates the "ideal" penalty function η(x) which is zero in F and
−∞ when one of the inequalities g j (x) ≤ 0 is violated.
A natural question is why one bothers to solve the barrier problems for more
than one single µ, typically a very small value. The reason is that it would be
hard to find a good starting point for Newton’s method in that case; the Hessian
matrix of µφ is typically ill-conditioned for small µ.
Assume now that the barrier problem has a unique optimal solution x(µ);
this is true under reasonable assumptions that we shall return to. The point
x(µ) is called a central point. Assume also that Newton’s method may be applied
to solve the barrier problem. The set of points x(µ) for µ > 0 is called the central
path; it is a path (or curve) as we know it from multivariate calculus. In order to
investigate the central path we prefer to work with the equivalent problem1 to
(6.4) obtained by multiplying the objection function by 1/µ, so
minimize (1/µ) f (x) + φ(x)

subject to (6.7)
Ax = b.
A central point x(µ) is characterized by
Ax(µ) = b
g j (x(µ)) < 0 ( j ≤ r )
and the existence of λ ∈ Rm (the Lagrange multiplier vector) such that
(1/µ)∇ f (x(µ)) + ∇φ(x(µ)) + A T λ = 0
i.e.,
r 1
∇g j (x) + A T λ = 0.
X
(1/µ)∇ f (x(µ)) + (6.8)
j =1 (−g j (x))
A fundamental question is: how far from being optimal is the central point x(µ)?
We now show that duality provides a very elegant way of answering this ques-
tion.
Theorem 6.1. For each µ > 0 the central point x(µ) satisfies
f ∗ ≤ f (x(µ)) ≤ f ∗ + r µ.
1 Equivalent here means the same minimum points.
78
Proof: Define ν(µ) = (ν1 (µ), . . . , νr (µ)) ∈ Rr and λ(µ) ∈ Rm by
ν j (µ) = −µ/g j (x(µ)), ( j ≤ r );

(6.9)
λ(µ) = µλ.
where λ and x(µ) satisfy Equation (6.8). We want to show that the pair (λ(µ), ν(µ))
is feasible in the dual problem to (6.2), see Section 5.3. So there are two prop-
erties to verify, that ν(µ) is nonnegative and that x(µ) minimizes the Lagrangian
function for the given (λ(µ), ν(µ)). The first property is immediate: as g j (x(µ)) <
0 and µ > 0, we get ν j (µ) = −µ/g j (x(µ)) > 0 for each j . Concerning the second
property, note first that the Lagrangian function L(x, λ, ν) = f (x) + λT (Ax − b) +
νT G(x) is convex in x for given λ and µ ≥ 0. Thus, x minimizes this function if
and only if ∇x L = 0. Now,
∇x L(x(µ), λ(µ), ν(µ))

r
= ∇ f (x(µ)) + A T λ(µ) + ν j (µ)∇g j (x(µ))
X
j =1
r 1
= ∇ f (x(µ)) + µA T λ + µ
X
∇g j (x(µ))
j =1 (−g j (x(µ)))
Ã !
1 r 1
T
= µ ∇ f (x(µ)) + A λ +
X
∇g j (x(µ)) = 0,
µ j =1 (−g j (x(µ)))
by (6.8) and the definition of the dual variables (6.9). This shows that (λ(µ), ν(µ))
is feasible in the dual problem.
By weak duality and Lemma 5.14, we therefore obtain
f ∗ ≥ g (λ(µ), ν(µ))
= L(x(µ), λ(µ), ν(µ))
r
= f (x(µ)) + λ(µ)T (Ax(µ) − b) + ν j (µ)g j (x(µ))
X
j =1
= f (x(µ)) − r µ
which proves the result.

This theorem is very useful and shows why letting µ → 0 (more accurately
µ → 0+ ) is a good idea.
Corollary 6.2. The central path has the following property
lim f (x(µ)) = f ∗ .
µ→0
79
In particular, if f is continuous and limµ→0 x(µ) = x ∗ for some x ∗ , then x ∗ is a
global minimum in (6.2).
Proof: This follows from Theorem 6.1 by letting µ → 0. The second part fol-
lows from
f (x ∗ ) = f (lim x(µ)) = lim f (x(µ)) = f ∗
µ→0 µ→0
by the first part and the continuity of f ; moreover x ∗ must be a feasible point by
elementary topology.
After these considerations we may now present the interior-point barrier
method. It uses a tolerance ² > 0 in its stopping criterion.
Interior-point barrier method:

1. Choose an initial point x = x 0 in F ◦ , µ = µ0 and α < 1.
2. while r µ > ² do
(i) (Centering step) Using initial point x find the solution x(µ) of (6.4)
(ii) (Update) x := x(µ)
(iii) (Decrease µ) µ := αµ.
This leads to the following algorithm for the internal point barrier method
for the case of equality constraints, and 2 inequality constraints:
function xopt=IPBopt(f,g1,g2,df,dg1,dg2,d2f,d2g1,d2g2,A,b,x0)
xopt=x0;
mu=1;
alpha=0.1;
r=2;
epsilon=10^(-3);
numitouter=0;
while (r*mu>epsilon)
[xopt,numit]=newtonbacktrackg1g2LEC(...
@(x)(f(x)-mu*log(-g1(x))-mu*log(-g2(x))),...
@(x)(df(x) - mu*dg1(x)/g1(x) - mu*dg2(x)/g2(x)),...
@(x)(d2f(x) + mu*dg1(x)*dg1(x)’/(g1(x)^2) ...
+ mu*dg2(x)*dg2(x)’/(g2(x)^2) - mu*d2g1(x)/g1(x)...
80
- mu*d2g2(x)/g2(x) ),A,b,xopt,g1,g2);
mu=alpha*mu;
numitouter=numitouter+1;
fprintf(’Iteration %i:’,numitouter);
fprintf(’(%f,%f)\n’,xopt,f(xopt));
end
Note that we here have inserted the expressions from Equation 6.5 and Equa-
tion 6.6 for the gradient and the Hesse matrix of the barrier function. The input
are f , g 1 , g 2 , their gradients and their Hesse matrices, the matrix A, the vector b,
and an initial feasible point x 0 . The function calls newtonbacktrackg1g2LEC,
and returns the optimal solution x ∗ . It also gives some information on the val-
ues of f during the iterations. The iterations used in Newton’s method is called
the inner iterations. There are different implementation details here that we do
not discuss very much. A typical value on α is 0.1. The choice of the initial µ0
can be difficult, if it is chosen too large, one may experience many outer itera-
tions. Another issue is how accurately one solves (6.4). It may be sufficient to
find a near-optimal solution here as this saves inner iterations. For this reason
the method is also called a path-following method; it follows in the neighbor-
hood of the central path.
Finally, it should be mentioned that there exists a variant of the interior-
point barrier method which permits an infeasible starting point. For more de-
tails on this and various implementation issues one may consult [2] or [11].
Example 6.3. Consider the function f (x) = x 2 + 1, 2 ≤ x ≤ 4. Minimizing f can
be considered as the problem of finding a minimum subject to the constraints
g 1 (x) = 2 − x ≤ 0, and g 2 (x) = x − 4 ≤ 0. The barrier problem is to minimize the
function
f (x) + µφ(x) = x 2 + 1 − µ ln(x − 2) − µ ln(4 − x).
Some of these are drawn in Figure 6.1, where we clearly can see the effect of de-
creasing µ in the barrier function: The function converges to f pointwise, except
at the boundaries. It is easy to see that x = 2 is the minimum of f under the given
constraints, and that f (2) = 5 is the minimum value. There are no equality con-
strains in this case, so that we can use the barrier method with Newton’s method
for unconstrained optimization, as this was implemented in Exercise 4.2.12. We
need, however, to make sure also here that the iterates from Armijo’s rule satisfy
the inequality constraints. In fact, in the exercises you will be asked to verify
that, for the function f considered here, some of the iterates from Armijo’s rule
do not satisfy the constraints.
It is straightforward to implement a function newtonbacktrackg1g2 which
implements Newtons method for two inequality constraints and no equality
81
20 20
15 15
10 10
5 5
2 2.5 3 3.5 4 2 2.5 3 3.5 4
(a) f (x) (b) Barrier problem with µ = 0.2
20 20
15 15
10 10
5 5
2 2.5 3 3.5 4 2 2.5 3 3.5 4
(c) Barrier problem with µ = 0.5 (d) Barrier problem with µ = 1
Figure 6.1: The function from Example 6.3 and some if its barrier functions.
constraints (this can follow the implementation of the function newtonbacktrack

from Exercise 4.2.12, and use the function armijoruleg1g2, just as the function
newtonbacktrackg1g2LEC). This leads to the following algorithm for the inter-
nal point barrier method for the case of no equality constraints, but 2 inequality
constraints:
function xopt=IPBopt2(f,g1,g2,df,dg1,dg2,d2f,d2g1,d2g2,x0)
xopt=x0;
mu=1; alpha=0.1; r=2; epsilon=10^(-3);
numitouter=0;
while (r*mu>epsilon)
[xopt,numit]=newtonbacktrackg1g2(...
@(x)(f(x)-mu*log(-g1(x))-mu*log(-g2(x))),...
82
@(x)(df(x) - mu*dg1(x)/g1(x) - mu*dg2(x)/g2(x)),...
@(x)(d2f(x) + mu*dg1(x)*dg1(x)’/(g1(x)^2) ...
+ mu*dg2(x)*dg2(x)’/(g2(x)^2) ...
- mu*d2g1(x)/g1(x) - mu*d2g2(x)/g2(x) ),xopt,g1,g2);
mu=alpha*mu;
numitouter=numitouter+1;
fprintf(’Iteration %i:’,numitouter);
fprintf(’(%f,%f)\n’,xopt,f(xopt));
end
Note that this function also prints a summary for each of the outer iterations,
so that we can see the progress in the barrier method. We can now find the
minimum of f with the following code, where we have substituted with Matlab
functions for f , g i , their gradients, and their Hesse matrices.
IPBopt2(@(x)(x.^2+1),@(x)(2-x),@(x)(x-4),...
@(x)(2*x),@(x)(-1),@(x)(1),...
@(x)(2),@(x)(0),@(x)(0),3)
Running this code gives a good approximation to the minimum x = 2 after 4

outer iterations. ♣
Example 6.4. Let us consider the problem of finding the minimum of x 12 + x 22
subject to the constraint x 1 +x 2 ≥ 2. We set f (x 1 , x 2 ) = x 12 +x 22 , and write the con-
straint as g 1 (x 1 , x 2 ) = 2 − x 1 − x 2 ≤ 0. Here it is not difficult to state the KKT con-
ditions and solve these, so let us do this first. The gradients are ∇ f = (2x 1 , 2x 2 ),
∇g 1 = (−1, −1), so that the KKT conditions take the form
(2x 1 , 2x 2 ) + ν1 (−1, −1) = 0
for a ν1 ≥ 0, where the last term is included only if x 1 + x 2 = 2 (i.e. when the
constraint is active). If the constraint is not active we see that x 1 = x 2 = 0, which
does not satisfy the inequality constraint. If the constraint is active we see that
x 1 = x 2 = ν1 /2, so that x 1 = x 2 = 1 and ν1 = 2 ≥ 0 in order for x 1 + x 2 = 2. The
minimum value is thus f (1, 1) = 2. It is clear that this must be a minimum: Since
f is bounded below and approaches ∞ when either x 1 or x 2 grows large, it must
have a mimimum ( f has no global maximum). For this one can also argue that
the Hessian of the Lagrangian for the constrained problem becomes positive
definit. All points are regular for this problem since ∇g 1 6= 0.
Let us also see if we can come to this same solution by solving the barrier
problem. The barrier function is φ(x 1 , x 2 ) = − ln(x 1 + x 2 − 2), which has gradient
∇φ = (−1/(x 1 +x 2 −2), −1/(x 1 +x 2 −2)). We set the gradient of f (x 1 , x 2 )+µφ(x 1 , x 2 )
83
to 0 and get
(2x 1 , 2x 2 ) + µ(−1/(x 1 + x 2 − 2), −1/(x 1 + x 2 − 2)) = 0.

µ
From this we see that x 1 = x 2 must fulfill 2x 1 = so that 4x 1 (x 1 − 1) = µ, so
2x 1 −2 , p
4± 16+16µ
that 4x 12 − 4x 1 − µ = 0. If we solve this problem we find that x 1 = 8 =
p
1± 1+µ
2 . If we choose the negative sign here we find that x 1 < 0, which does not
lie inside the domain of definition for the function we optimize (i.e.
p points where
1+ 1+µ
x 1 + x 2 > 2). If we choose the positive sign we find x 1 = x 2 = 2 . It is clear
that, when µ → 0, this will converge to x 1 = x 2 = 1, which equals the solution we
found when we solved the KKT conditions. ♣

1. Consider problem (6.1) in Section 6.1. Verify that the KKT conditions for this
problem are as stated there.
2. Define the function f (x, y) = x + y. We will attempt to minimize f under the
constraints y − x = 1, and x, y ≥ 0
a. Find A, b, and functions g 1 , g 2 so that the problem takes the same form
as in Equation (6.2).
b. Draw the contours of the barrier function f (x, y) + µφ(x, y) for µ =

0.1, 0.2, 0.5, 1, where φ(x, y) = − ln(−g 1 (x, y)) − ln(−g 2 (x, y)).
c. Solve the barrier problem analytically using the Lagrange method.
d. It is straightforward to find the minimum of f under the mentioned

constraints. State a simple argument for finding this minimum.
e. State the KKT conditions for finding the minimum, and solve these.
f. Show that the central path converges to the same solution which you
found in d. and e..
3. Use the function IPBopt to verify the solution you found in Exercise 2. Ini-
tially you must compute a feasible starting point x 0 .
4. State the KKT conditions for finding the minimum for the contstrained prob-
lem of Example 6.3, and solve these. Verify that you get the same solution as in
Example 6.3.
84
5. In the function IPBopt2, replace the call to the function newtonbacktrackg1g2
with a call to the function newtonbacktrack, with the obvious modification to
the parameters. Verify that the code does not return the expected minimum in
this case.
6. Consider the function f (x) = (x − 3)2 , with the same constraints 2 ≤ x ≤ 4 as
in Example 6.3. Verify in this case that the function IPBopt2 returns the correct
minimum regardless of whether you call newtonbacktrackg1g2 or newtonbacktrack.
This shows that, at least in some cases where the minimum is an interior point,
the iterates from Newtons method satisfy the inequality constraints as well.
7. (Trial Exam UIO V2012) In this exercise we will find the minimum of the
function f (x, y) = 3x + 2y under the constraints x + y = 1 and x, y ≥ 0.
a. Find a matrix A and a vector b so that the constraint x + y = 1 can be

written on the form Ax = b.
b. State the KKT-conditions for this problem, and find the minimum by
solving these.
c. Write down the barrier function φ(x, y) = − ln(−g 1 (x, y))−ln(−g 2 (x, y))
for this problem, where g 1 and g 2 represent the two constraints of the
problem. Also compute ∇φ.
d. Solve the barrier problem with parameter µ, and denote the solution
by x(µ). Is it the case that the limit limµ→0 x(µ) equals the solution you
found in b.?
85
Answers
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Solutions
Chapter 1
3 . You can argue in many ways here: For instance the derivative of f 2 (x) is
2 f (x) f 0 (x), so that extremal points of f are also extremal points of f 2 .
8 .a. If c i , j = 0 the function to be minimized is

n
α c i i x i2 − µj x j .
X X
i ≤n j =1
The gradient of this function is 2αC x − µ, where µ is the vector with µ in all
entries. Lagrange multipliers thus gives that 2αC x −µ = λ, where λ is the vector
µ+λ
with λ in all entries. This gives that x i = 2αci . If x i = 1 we must have that
P
µ+λ P 1 P2α
2α c i = 1, so that λ = −µ + 1/c i .
8 .b. When n = 2, we have that x 2 = 1 − x 1 , so that
f (x 1 , x 2 ) = αc 11 x 12 + αc 22 x 22 + α(c 12 + c 21 )x 1 x 2 − µx 1 − µx 2
= αc 11 x 12 + αc 22 (1 − x 1 )2 + α(c 12 + c 21 )x 1 (1 − x 1 ) − µx 1 − µ(1 − x 1 )
= α(c 11 + c 22 − c 12 − c 21 )x 12 + α(−2c 22 + c 12 + c 21 )x 1 + αc 22 − µ
The derivative of this is 2α(c 11 + c 22 − c 12 − c 21 )x 1 + α(−2c 22 + c 12 + c 21 ), which is

0 when x 1 = − 2(c−2c 22 +c 12 +c 21
11 +c 22 −c 12 −c 21 )
. This is not dependent on α.
86
P ∂f
9 .a. We have that f (x) = i q i x i , so that ∂x i = q i , so that ∇ f (x) = q. Clearly
∂2 f
∂x i ∂x j = 0, so that ∇2 f (x) = 0.
9 .b. We have that
1X 1X 1 X
f (x) = xi A i j x j = A i i x i2 + xi A i j x j ,
2 i,j 2 i 2 i , j ,i 6= j
so that
∂f 1 X 1X
= A i i xi + x j (A i j + A j i ) = x j 2A i j
∂x i 2 j , j 6=i 2 j
X
= A i j x j = (Ax)i
j
This gives ∇ f = Ax. Finally we get
∂f ∂ X
= ( Ai j x j ) = Ai j ,
∂x i ∂x j ∂x j j
so that ∇2 f = A.
9 .c. As in b. we have that
1X 1X 1 X
f (x) = xi A i j x j = A i i x i2 + xi A i j x j ,
2 i,j 2 i 2 i , j ,i 6= j
but the further simplifications now take the form
∂f 1 X 1X
= A i i xi + x j (A i j + A j i ) = x j (A i j + A j i )
∂x i 2 j , j 6=i 2 j
1X 1
= (A i j + (A T )i j )x j = ( (A + A T )x)i
2 j 2
This gives ∇ f = 12 (A + A T )x. Finally we get
∂f ∂ 1X 1X
= ( (A i j + (A T )i j )x j ) = (A i j + (A T )i j ),
∂x i ∂x j ∂x j 2 j 2 j
so that ∇2 f = 12 (A + A T ).
87
10 . First note that f (0, 0) = 3, and that f (2, 1) = 8 We have that ∇ f = (2x 1 +
3x 2 , 3x 1 − 10x 2 ), and that ∇ f (0, 0) = (0, 0), and ∇ f (2, 1) = (7, −4). The first order
Taylor approximation at (0, 0) is thus
f (0, 0) + ∇ f (0, 0)T (x − (0, 0)) = 3.
The first order Taylor approximation at (2, 1) is
f (2, 1) + ∇ f (2, 1)T (x − (2, 1)) = 8 + (7, −4)T (x 1 − 2, x 2 − 1)

= 8 + 7(x 1 − 2) − 4(x 2 − 1) = 7x 1 − 4x 2 − 2.
12 . If A is positive definite then its eigenvalues λi are positive. The eigenvalues

of A −1 are 1/λi , which also are positive, so that A −1 also is positive definite.
Chapter 2
2 . We have that h 0 (x) = f 0 (x)g (x)+ f (x)g 0 (x), and h 00 (x) = f 00 (x)g (x)+ f (x)g 00 (x)+
2 f 0 (x)g 0 (x). Since f and g are convex we have that f 00 (x) ≥ 0 and g 00 (x) ≥ 0.
Since the functions are increasing we have that f 0 (x) ≥ 0 and g 0 (x) ≥ 0. Since the
functions also are positive we see that all three terms in the sum are ≥ 0 so that
h 00 (x) ≥ 0, and it follows that h also is convex.
3 .a. We have learnt that (continuously) differentiable functions are convex if

and only if the Hessian is positive semidefinite (here this is translated to that the
second derivative is ≥ 0). We have that h 0 (x) = f 0 (g (x))g 0 (x), and that h 00 (x) =
f 00 (g (x))[g 0 (x)]2 + f 0 (g (x))g 00 (x). Since f is convex we have that f 00 (g (x)) ≥ 0.
Since g also is convex we have that g 00 (x) ≥ 0. Since f is increasing we also have
that f 0 (g (x)) ≥ 0. Therefore both terms in the sum must be ≥ 0, so that h 00 (x) ≥ 0,
so that h is convex.
4 . The function f (x, y, z) = x 2 + y 2 −z is convex (the Hessian is positive semidef-

inite). The set in question can be written as the points where f (x, y, z) ≤ 0, which
is a sublevel set, and therefore convex.
8 . Write B in row echelon form, to see which are pivot variables. Express these
variables in terms of the free variables, and replace the pivot variables in all the
equations. Ax ≥ b then takes the form C x ≥ b (where x now is a shorter vector),
and this can be written as −C x ≤ −b, which is on the new form with H = −C ,
h = −b. Note that this strategy rewrites the vector c to a shorter vector.
88
9 . Let y = tj =1 λ j x j and z = tj =1 µ j x j , where all λ j , µ j ≥ 0, and tj =1 λ j = 1,
P P P
Pt
j =1 µ j = 1. For any 0 ≤ λ ≤ 1 we have that
t t t
(1 − λ)y + λz = (1 − λ) λj x j + λ µj x j = ((1 − λ)λ j + λµ j )x j .
X X X
j =1 j =1 j =1
The sum of the coefficients here is

t t t
((1 − λ)λ j + λµ j ) = (1 − λ) λj + λ µ j = 1 − λ + λ = 1,
X X X
j =1 j =1 j =1
so that C is a convex set.

Pn
10 . Follows from Proposition 2.3, since f (x) = e x is convex, and H (x) = j =1 x j
is affine.
1 1.a. λ f is convex if λ ≥ 0, concave if λ ≤ 0.
1 1.b. min{ f , g } may be neither convex or concave, consider the functions f (x) =
x 2 , g (x) = (x − 1)2 .
1 1.c. | f | may be neither convex or concave, consider the function f (x) = x 2 −1.
Chapter 3
1 . F (x) = 0 is equivalent to kF (x)k2 = i F i (x) = 0, where F i are the compo-
P
nent functions of F . Solving F (x) = 0 thus is equivalent to showing that 0 is the

P
minimum value of i F i (x).
2 . Here we construct the function f (x) = T (x) − x = x/2 − 3x 3 /2, which has
derivative f 0 (x) = 1/2 − 9x 2 /2. We can then run Newton’s method as follows:
newtonmult(sqrt(5/3),@(x)(0.5*x-1.5*x^3),@(x)(0.5-4.5*x^2))
p converges to the zero we are looking for, which we easily compute as x =

This
1/3.
4 .a. The function x → kAxk is continuous, and any continuous function achieves
a supremum in a closed set (here kxk = 1).
4 .b. For n = 2, it is clear that the sublevel set is the square with corners (1, 0),
(−1, 0), (0, 1), (0, −1).
89
4 .c. The function f (x) = kAxk is the composition of a convex function and an
affine function, so that it must be convex. If x ∈ Rn and kxk1 = 1, we can write
x = ni=1 λi v i , where 0 ≤ λi ≤ 1, ni=1 λi = 1, and v i = ±e i (i.e. it absorbs the sign
P P
of the i th component). If w is the vector among {±e j } j so that f (±e j ) ≤ f (w ) for

all j and all signs, Jensen’s inequality (Theorem 2.4) gives
Ã !
n n n
λi v i ≤ λi f (v i ) ≤ λi f (w ) = f (w ),
X X X
f (x) = f
i =1 i =1 i =1
so that f assumes its maximum in w .
4 .d. Since the supremum is attained for some w = ±e k , the maximum is

n
X n
X
kAw k1 = k ± colk Ak = | ± ai k | = |a i k |.
i =1 i =1
Pn
It is now clear that kAk = supk i =1 |a i k |.
5 . We are asked to find for which A we have that kAxk1 < kxk1 for any x. From
the previous exercise we know that this happens if and only if kAk < 1, i.e. when
Pn
i =1 |a i k | < 1 for all k.
6 . You can write
newtonmult(x0,...
@(x)([x(1)^2-x(1)/x(2)^3+cos(x(1))-1; 5*x(1)^4+2*x(1)^3-tan(x(1)*x(2)^8)-3]),...
@(x)([2*x(1)-1/x(2)^3-sin(x(1)) 3*x(1)/x(2)^4; ...
20*x(1)^3+6*x(1)^2-x(2)^8/(cos(x(1)*x(2)^8))^2 -8*x(1)*x(2)^7/(cos(x(1)*x(2)
)
Chapter 4
4 . The gradient of f is ∇ f = (4 + 2x 1 , 6 + 4x 2 ), and the Hessian matrix is ∇2 f =
µ ¶
2 0
, which is positive definite. The only stationary point is (−2, −3/2), which
0 4
is a minimum.
The gradient of g is ∇g = (4 + 2x 1 , 6 − 4x 2 ), and the Hessian matrix is ∇2 g =
µ ¶
2 0
, which is indefinite. The only stationary point is (−2, 3/2), which must
0 −4
be a saddle point.
90
6 . The gradient is ∇ f = (−400x 1 (x 2 − x 12 ) − 2(1 − x 1 ), 200(x 2 − x 12 )). The Hessian
matrix is
1200x 12 − 400x 2 + 2 −400x 1
µ ¶
2
∇ f = .
−400x 1 200
Clearly the only stationary point is x = (1, 1), and we get that
µ ¶
2 802 −400
∇ f (1, 1) = .
−400 200
It is straightforward to check that this matrix is positive definite, so that (1, 1) is

a local minimum.
7 . The steepest descent method takes the form
x k+1 = x k − αk ∇ f (x k ),
where ∇ f (x k ) = Ax k − b. We have that
f (x k+1 ) = (1/2)x Tk+1 Ax k+1 − b T x k+1

1¡ 1¡
= ∇ f (x k )T A∇ f (x k ) α2k − x Tk A∇ f (x k ) + ∇ f (x k )T Ax k αk
¢ ¢
2 2
1 T
+ x k Ax k − b T (x k − αk ∇ f (x k ))
2
1¡
= ∇ f (x k )T A∇ f (x k ) α2k
¢
µ2 ¶
T 1¡ T T
+ b ∇ f (x k ) − x k A∇ f (x k ) + ∇ f (x k ) Ax k αk
¢
2
1 T T
+ x k Ax k − b x k .
2
Now, since we claim that ∇ f (x k ) is an eigenvector, and that A is symmetric, we
get that A∇ f (x k ) = λ∇ f (x k ) and ∇ f (x k )T A = λ∇ f (x k )T , where λ is the corre-
sponding eigenvalue. This means that the above can be written
1 1
f (x k+1 ) = λk∇ f (x k )k2 α2k + b T ∇ f (x k ) − x Tk A∇ f (x k ) αk + x Tk Ax k − b T x k
¡ ¢
2 2
If we take the derivative of this w.r.t. αk and set this to 0 we get
b T ∇ f (x k ) − x Tk A∇ f (x k ) x k A − b T ∇ f (x k )
¡ T ¢
αk = − =
λk∇ f (x k )k2 λk∇ f (x k )k2
(Ax k − b)T ∇ f (x k ) ∇ f (x)T ∇ f (x k ) 1
= = = .
λk∇ f (x k )k2 λk∇ f (x k )k2 λ
91
This means that αk = λ1 is the step size we should use when we perform exact
line search. We now compute that
µ ¶
1
∇ f (x k+1 ) = Ax k+1 − b = A x k − ∇ f (x k ) − b
λ
1
= Ax k − A∇ f (x k ) − b = Ax k − ∇ f (x k ) − b
λ
= ∇ f (x k ) − ∇ f (x k ) = 0,
which shows that the minimum is reached in one step.
8 .a. This is simply Exercise 1.3.9.
8 .b. If ∇2 f (x) is positive definite, its eigenvalues are positive, so that the de-
terminant is positive, and that the matrix is invertible. h = −(∇2 f (x k ))−1 ∇ f (x k )
follows after multiplying with the inverse.
9 . Here we have said nothing about the step length, but we can implement this
as in the function newtonbacktrack as follows:
function [xopt,numit]=steepestdescent(f,df,x0)
epsilon=10^(-3);
xopt=x0;
maxit=100;
for numit=1:maxit
d=-df(xopt);
eta=-df(xopt)’*d;
if eta/2<epsilon
break;
end
alpha=armijorule(f,df,xopt,d);
xopt=xopt+alpha*d;
end
The algorithm can be tested on the first function from Exercise 4 as follows:
f=@(x)(4*x(1)+6*x(2)+x(1)^2+2*x(2)^2);
df=@(x)([4+2*x(1);6+4*x(2)])
steepestdescent(f,df,[-1;-1])
11 . The function can be implemented as follows:
92
function alpha=armijorule(f,df,x,d)
beta=0.2; s=0.5; sigma=10^(-3);
m=0;
while (f(x)-f(x+beta^m*s*d) < -sigma *beta^m*s *(df(x))’*d)
m=m+1;
end
alpha = beta^m*s;
12 . The function can be implemented as follows:

function [xopt,numit]=newtonbacktrack(f,df,d2f,x0)
epsilon=10^(-3);
xopt=x0;
maxit=100;
for numit=1:maxit
d=-d2f(xopt)\df(xopt);
eta=-df(xopt)’*d;
if eta/2<epsilon
break;
end
alpha=armijorule(f,df,xopt,d);
xopt=xopt+alpha*d;
end
Chapter 5
3 . This is the same as finding the minimum of f (x 1 , . . . , x n ) = −x 1 x 2 · · · x n . This
Q
boils down to the equations − i 6= j x i = 1, since clearly the minimum is not at-
tained when there are any active constraints. This implies that x 1 = . . . = x n , so
that all x i = 1/n. It is better to give a direct argument here that this must be a
minimum, than to attempt to analyse the second order conditions for a mini-
mum.
4 . We can formulate the problem as finding the minimum of f (x 1 , x 2 ) = (x 1 −

a 1 )2 + (x 2 − a 2 )2 subject to the constraint h 1 (x 1 , x 2 ) = x 12 + x 22 = 1. The minimum
can be found geometrically by drawing a line which passes through a and the
origin, and reading the intersection with the unit circle. This follows also from
that ∇ f is parallel to x −a, ∇h 1 is parallel to x, and from that the KKT-conditions
say that these should be parallel.
93
6 . We rewrite the constraint as g 1 (x 1 , x 2 ) = x 12 +x 22 −1 = 0, and get that ∇g 1 (x 1 , x 2 ) =
(2x 1 , 2x 2 ). Clearly all points are regular, since ∇g 1 (x 1 , x 2 ) 6= 0 whenever g 1 (x 1 , x 2 ) =
0. Since ∇ f = (1, 1) we get that the gradient of the Lagrangian is
µ ¶ µ ¶
1 2x 1
+λ = 0,
1 2x 2
p p
p that xp
which gives 1 = x 2 . This gives us the two possible feasible
p points (1/ 2, 1/ 2)
and (−1/ 2, −1/ 2). For the first we see that λ = −1/ µ 2, for¶ the second we
p 2 0
see that λ = 1/ 2. The Hessian of the Lagrangian is λ . For the point
0 2
p p p p
(1/ 2, 1/ 2) this is negative definite since λ is negative, for the point (−1/ 2, −1/ 2)
this is positive definite since λ is positive. From
p the p second order conditions it
follows that the minimum is attained in (−1/ 2, −1/ q2).
If we instead eliminated x 2 we must write x 2 = − 1 − x 12 (since the positive
square
p root gives a bigger value for f ), so that we must minimize f (x) = x −
1 − x subject to the constraint −1 ≤ x ≤ 1. The derivative of this is 1 + p x 2 ,
2
1−x
which is zero when x = − p1 , which we found above. We also could have found
2
this by considering the two inequality constraints −x − 1 ≤ 0 and x − 1 ≤ 0.
If the first one of these is active (i.e. x = −1), the KKT conditions say that
f 0 (−1) > 0. However, this is not the case since f 0 (x) → −∞ when x → −1+ . If
the second constraint is active (i.e. x = 1), the KKT conditions say that f 0 (1) < 0.
This is not the case since f 0 (x) → ∞ when x → 1−. When we have no active
constraint, the problem boils down to setting the derivative to zero, in which
case we get the solution we already have found.
9 . We define h 1 (x 1 , . . . , x n ) = x 1 + . . . + x n − 1, and find that ∇h 1 = (1, 1, . . . , 1).

The stationary points are characterized by ∇ f + λT (1, 1, . . . , 1) = 0, which has a
∂f ∂f ∂f
solution exactly when ∂x1 = ∂x2 = . . . = ∂xn .
10 . We substitute x n = 1 − x 1 − . . . − x n−1 in the expression for f , to turn the

problem into one of minimizing a function in n − 1 variables.
11 . The problem can be rewritten to the following minimation problem:
min{−x T Ax : g 1 (x) = kxk − 1 = 0}.
2x x
We have that ∇ f (x) = −Ax, and ∇g 1 (x) = 2kxk = kxk . Clearly all points are regu-
lar, and we get that
x
∇ f + λ∇g 1 = −Ax + λ = 0.
kxk
94
Since we require that kxk = 1 we get that Ax = λx. In other words, the optimal
point x is an eigenvector of A, and the Lagrange multiplier is the corresponding
eigenvalue.
12 . Define f (x 1 , x 2 , x 3 ) = (1/2)(x 12 + x 22 + x 32 ) and g 1 (x 1 , x 2 , x 3 ) = x 1 + x 2 + x 3 . We

have that ∇ f = (x 1 , x 2 , x 3 ), ∇g 1 = (1, 1, 1). Clearly all points are regular points. If
there are no active constraints, we must have that ∇ f = 0, so that x 1 = x 2 = x 3 =
0, which does not fulfill the constraint. If teh constraint is active we must have
that (x 1 , x 2 , x 3 ) + µ(1, 1, 1) = 0 for some µ ≤ 0, which is satisfied when x 1 = x 2 =
x 3 < 0. Clearly we must have that x 1 = x 2 = x 3 = −2. The Hessian of L(x, λ, µ) is
easily computed to be positive definite, so that we have found a minimum.
13 . We need to minimze f (x 1 , x 2 ) = (x 1 − 3)2 + (x 2 − 5)2 + x 1 x 2 subject to the

constraints
g 1 (x 1 , x 2 ) = −x 1 ≤ 0
g 2 (x 1 , x 2 ) = −x 2 ≤ 0
g 3 (x 1 , x 2 ) = x 1 − 1 ≤ 0
g 4 (x 1 , x 2 ) = x 2 − 1 ≤ 0.
We have that ∇ f = (2(x 1 − 3) + x 2 , 2(x 2 − 5) + x 1 ), and ∇g 1 = (−1, 0), ∇g 2 = (0, −1),

∇g 3 = (1, 0), ∇g 4 = (0, 1). If there are no active constraints the KKT conditions say
that ∇ f = 0, so that
2x 1 + x 2 = 6x 1 + 2x 2 = 10
which gives that x 1 = 2/3 and x 2 = 14/3. This point does not satisfy the con-
straints, however.
Assume now that we have one active constraint. We have four possibilities
in this case (and any solution will be regular). If the first constraint is active the
KKT conditions say that
µ ¶ µ ¶
2x 1 + x 2 − 6 −1
+µ =0
x 1 + 2x 2 − 10 0
Setting x 1 = 0 we get that x 2 − 6 = µ and 2x 2 − 10 = 0, so that x 2 = 5, which does

not satisfy the constraints.
If the second constraint is active we get similarly that 2x 1 −6 = 0, which also does
not satisfy the constraints.
If the third constraint is active we get 2x 2 − 9 = 0, which does not satisfy the
constraints.
95
If the fourth constraint is active we get 2x 1 − 5 = 0, which does not satisfy the
constraint.
Assume that we have two active constraints. Also here there are four possi-
bilities (and any solution will be regular):
x 1 = x 2 = 0: The KKT consitions say that (−6, −10) + (−µ1 , −µ2 ) = 0, which is im-
possible since µ1 , µ2 are positive.
x 1 = 0, x 2 = 1: The KKT conditions say that (−5, −8) + (−µ1 , µ4 ) = 0, which also is
impossible
x 1 = 1, x 2 = 0: The KKT conditions say that (−4, −9) + (µ3 , −µ2 ) = 0, which also is
impossible
x 1 = x 2 = 1: The KKT conditions say that (−3, −7)+(µ3 , µ4 ), which has a solution.
Clearly it is not possible to have more than two active constraints. The minimum
point is therefore (1, 1).
14 . We can define g 1 (x 1 , x 2 ) = x 12 +x 22 −2, so that the only constraint is g 1 (x 1 , x 2 ) ≤

0. We have that ∇g 1 = (2x 1 , 2x 2 ), and this can be zero if and only if x 1 = x 2 = 0.
However g 1 (0, 0) = −2 < 0, so that the equality is not active. This means that all
points are regular for this problem.
We compute that ∇ f = (1, 1). If g 1 is not an active inequality, the KKT condi-
tions say that ∇ f = 0, which is impossible. If g 1 is active, we get that
µ ¶ µ ¶ µ ¶
1 2x 1 0
∇ f (x 1 , x 2 ) + µ∇g 1 (x 1 , x 2 ) = +µ = ,
1 2x 2 0
so that 1 = −2µx 1 and 1 = −2µx 2 for some µ ≥ 0. This is satisfied if x 1 = x 2 is

negative. For g 1 to be active we must have that x 12 + x 22 = 2, which implies that
x 1 = x 2 = −1. We have that f (−1, −1) = −2.
16 . We define g j (x) = −x j for j = 1, . . . , n, and g n+1 (x) = nj=1 x j − 1. We have

P
that ∇g j = −e j for 1 ≤ j ≤ n, and ∇g n+1 = (1, 1, . . . , 1). If there are no active in-
equalities, we must have that ∇ f (x) = 0. If the last constraint is not active we
have that
µj e j ,
X
∇f =
j ∈A(x), j ≤n
i.e. ∇ f points into the cone spanned by e j , j ∈ A(x). If the last constraint is
active also , we see that
(µ j − µn+1 )e j .
X X
∇f = −µn+1 e j
j 6∈ A(x), j ≤n j ∈A(x), j ≤n
∇ f is on this form whenever components outside the active set are equal and
≤ 0, and all are components are greater than or equal to this.
96
Chapter 6
1 . The constraint Ax = b actually yields one constraint per row in A, and the
gradient of the i ’th constraint is the i ’th row in A. This gives the following sum
in the KKT conditions:
m m m
∇g i λi = a iT· λi == (A T )·i λi = A T λ.
X X X
i =1 i =1 i =1
The gradient of f (x k ) + ∇ f (x k )T h + 12 h T ∇2 f (x k )h is ∇ f (x k ) + ∇2 f (x k )h. The

KKT conditions are thus ∇ f (x k ) + ∇2 f (x k )h + A T λ = 0 and Ah = 0. This can be
written as the set of equations
∇2 f (x k )h + A T λ = −∇ f (x k )
Ah + 0λ = 0,
from which the stated equation system follows.

¡ ¢ ¡ ¢
2 .a. We can set g 1 (x, y) = −x, g 2 (x, y) = −y, A = −1 1 , and b = 1 .
2 .c. The barrier problem here is to minimize x + y − µ ln x − µ ln y subject to the

constraint y − x = 1. The gradient of the Lagrangian is (1 − µ/x, 1 − µ/y) + A T λ. If
this is 0 we must have that 1−µ/x = µ/y −1, so that 2x y = µ(x + y) The constraint
gives that y = x + 1, so that 2x(x + 1) = µ(2x + 1). This can be written as 2x 2 + (2 −
µ)x − µ = 0, which has the solution
−(2 − µ) ± (2 − µ)2 + 8µ −(2 − µ) ± (2 + µ)

p
x= = ,
4 4
which gives the two solutions (x, y) = (µ/2, µ/2 + 1) and (x, y) = (−1, 0). Only the
first solution here is within the domain of definition for f , so the barrier method
obtains this minimum.
2 .d. By inserting y = x + 1 for the constraint we see that we need to minimize

g (x) = 2x + 1 subject to x ≥ 0, which clearly has a minimum for x = 0, and then
y = 1. This gives the same minimum as in c.
2 .e. The KKT conditions takes one of the following forms:
• If there are no active inequalities:
∇ f + A T λ = (1, 1) + λ(−1, 1) = 0,
which has no solutions.
97
• The first inequality is active (i.e. x = 0):
∇ f + A T λ + µ1 ∇g 1 = (1, 1) + λ(−1, 1) + µ1 (−1, 0) = (1 − λ − µ1 , 1 + λ) = 0,
which gives that λ = −1 and µ1 = µ1 = 2. When x = 0 the equaility con-

straint gives that y = 1, so that (0, 1) satisfies the KKT conditions.
• The second inequality is active (i.e y = 0): The first constraint then gives
that x = −1, which does not give a feasible point.
In conclusion, (0, 1) is the only point which satisfies the KKT conditions. If we
attempt the second order test, we will see that it is inconclusive, since the Hes-
sian of the Lagrangian is zero. To prove that (1, 0) must be a minimum, you can
argue that f is very large outside any rectangle, so that it must have a minimum
on this rectangle (the rectangle is a closed and bounded set).
2 .f. With the barrier method we obtained the solution x(µ) = (µ/2, µ/2 + 1).
Since this converges to (0, 1) as µ → 0, the central path converges to the solution
we have found.
3 . You can use the following code:
IPBopt(@(x)(x(1)+x(2)),@(x)(-x(1)),@(x)(-x(2)),...
@(x)([1;1]),@(x)([-1;0]),@(x)([0;-1]),...
@(x)(zeros(2)),@(x)(zeros(2)),@(x)(zeros(2)),...
[-1 1],1,[4;5])
4 . Here we have that ∇ f = 2x, ∇g 1 = −1, ∇g 2 = 1. If there are no active con-

straints the KKT conditions say that 2x = 0, so that x = 0, which is outside the
domain of definition for f .
If the first constraint is active we get that 2x − µ1 = 4 − µ1 = 0, so that µ1 = 4. This
is a candidate for the minimum (clearly the second order conditions for a mini-
mum is fulfilled here as well, since the Hessian of the Lagrangian is 2).
If the second constraint is active we get that 2x + µ2 = 4 + µ2 = 0, so that µ2 = −4,
so that this gives no candidate for a solution.
It is impossible for both constraints to be active at the same time, so x = 2 is the
unique minimum.
6 . You can use the following code:
98
IPBopt2(@(x)((x-3).^2),@(x)(2-x),@(x)(x-4),...
@(x)(2*(x-3)),@(x)(-1),@(x)(1),...
@(x)(2),@(x)(0),@(x)(0),3.5)
¡ ¢
7 .a. We can set A = 1 1 , og b = 1.
7 .b. We set g 1 (x, y) = −x ≤ 0 and g 2 (x, y) = −y ≤ 0, and have that ∇ f = (3, 2),
∇g 1 = (−1, 0), ∇g 2 = (0, −1). The KKT-conditions therefore take the form x + y = 1
and
∇ f + A T λ + ν1 ∇g 1 + ν2 ∇g 2 = (3, 2) + λ(1, 1) + ν1 (−1, 0) + ν2 (0, −1) = 0,
where the two last terms are included only if the corresponding inequalities are
active, and where ν1 , ν2 ≥ 0.
If none of the inequalities are active we get that (3, 2)+λ(1, 1) = 0, which has now
solution.
If both inequalities are active we get that x = y = 0, which does not fulfill the
constraint x + y = 1. If we have only one active inequality we have two possibil-
ities: If the first inequality is active we get that (3, 2) + λ(1, 1) + ν1 (−1, 0) = 0. The
equation for the second component says that λ = −2, and the equation for the
first component says that 3 − 2 − ν1 = 0, so that ν1 = 1.
If the second inequality is active we get that (3, 2) + λ(1, 1) + ν2 (0, −1) = 0. The
equation for the first component says that λ = −3, and the equation for the sec-
ond component says that 2 − 3 − ν2 = 0, which gives that ν2 = −1. This possibili-
tywe must denounce since ν2 < 0. We are left with the first inquality as active as
the only possibility. Then x = 0, and the constraint x + y = 1 gives that y = 1, and
a minimum value of 2. Since f clearly is bounded below on the region we work
on, it is clear that this must be a global minimum.
Finally we should candidates for the minimum which are not regular points.
If none of the equations are active we have no candidates, since ∇h 1 = (1, 1) 6= 0.
If one inequality is active we get no candidates either, since (1, 1) and (−1, 0) are
linearly independent, and since (1, 1) and (0, −1) are linearly independent. If
both inequalities are active (x = y = 0), it is clear that the constraint x + y = 1 is
not fulfilled. All in all, we get no additional candidates from points which are not
regular.
7 .c. We get that φ(x, y) = − ln x − ln y, and ∇φ = (−1/x, −1/y).
7 .d. In the barrier problem we minimize the function f (x, y) + µφ(x, y) = 3x +

2y − µ ln x − µ ln y under the constraint x + y = 1. The KKT-conditions become
99
(3, 2) + µ(−1/x, −1/y) + λ(1, 1) = 0, which gives the equations
µ
= 3+λ
x
µ
= 2 + λ.
y
µ µ
This gives that y + 1 = x , which again gives µ(y − x) = x y. If we substitute the
constraint x + y = 1 we get that µ(1−2x) = x(1−x), which can be written x 2 −(1+
2µ)x + µ = 0. If we solve this we find that
p p
1 + 2µ ± (1 + 2µ)2 − 4µ 1 + 2µ ± 1 + 4µ2
x= = .
2 2
This corresponds to two different points, depending on which sign we choose,
but if we choose + as sign we see that x > 1, so that y < 0 in order for x + y =
1, so that (x, y) then is outside
p the domain of definition for the problem. We
1+2µ− 1+4µ2
therefore have that x = 2 . It is clear that x → 0 when µ → 0 here, so
that the solution of the barrier problem converges to the solution of the original
problem.
100
Mathematics index
101
Index for MATLAB commands
102
Bibliography
[1] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University

Press, 2004.
[3] G. Dahl. A note on diagonally dominant matrices. Linear Algebra and its
Appl., 317(1-3):217–224, 2000.
[4] G. Dahl. An introduction to convexity. Report, University of Oslo, 2010.
[5] J. B. Hiriart-Urruty and C. Lemaréchal. Convex analysis and minimization

algorithms I. Springer, 1993.
[6] C. T. Kelley. Iterative Methods for Linear and Nonlinear Equations. SIAM,
1995.
[7] D. C. Lay. linear algebra and its applications (4th edition). Addison Wesley,
2011.
[8] T. Lindstrøm and K. Hveberg. Flervariabel analyse med lineær algebra.

Pearson, 2011.
[9] D.G. Luenberger. Linear and nonlinear programming. Addison-Wesley,

1984.
[10] T. Lyche. Numerical linear algebra. Report, University of Oslo, 2010.
[11] J. Nocedal and S.J. Wright. Numerical Optimization. Springer, 2006.
[12] R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970.
103
[13] A. Ruszczynski. Nonlinear optimization. Princeton University Press, 2006.
[14] R. Webster. Convexity. Oxford University Press, Oxford, 1994.
104

Matinf 2360 Part 3

Uploaded by

Copyright:

Available Formats

Matinf 2360 Part 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matinf 2360 Part 3

Uploaded by

Copyright:

Available Formats

Nonlinear optimization

Lecture notes for the course MAT-INF2360

Øyvind Ryan, Geir Dahl, and Knut Mørken

1 The basics and applications 5

2 A crash course in convexity 21

5 Constrained optimization - theory 51

Geir Dahl, Knut Mørken, and Øyvind Ryan

The problem of minimizing a function of several variables, possibly subject to

Notation: For z ∈ Rn and δ > 0 define the (closed) ball B̄ (z; ²) = {x ∈ Rn :

means that the statement P (x) is true for all x ∈ H .

1.1 The basic concepts

f (x ∗ ) ≤ f (x) for all x ∈ B (x ∗ ; ²).

f (x ∗ ) ≤ f (x) for all x ∈ Rn .

A global maximum satisfies the opposite inequality.

minimize (x 1 − 3)2 + (x 2 − 2)2

it is clear how to convert a maximization problem into a minimization problem

1.2 Some applications

1.2.1 Portfolio optimization

The model may be understood as follows. The decision variables are x 1 , x 2 ,

(= mix among investments), then E X = nj=1 µ j x j which is the second term in f .

c i i is also called the variance of R i .

1.2.2 Fitting a model

for some function F α : Rm → R. Here α = (α1 , α2 , . . . , αn ) ∈ Rn is a parameter vec-

since it is usually a simplification of the system one considers. In statistics one

1.2.3 Maximum likelihood

Taking logarithms and multiplying by −1, our problem is to minimize

1.2.4 Optimal control problems

where x t ∈ Rn , x 0 is the initial solution, and h t is a given function for each t . We

where the control is the sequence (u 0 , u 1 , . . . , u T −1 ) to be determined. This

Here A is an m × n matrix, b ∈ Rm and x ≥ 0 means that x i ≥ 0 for each i ≤ n.

1.3 Multivariate calculus and linear algebra

The diagonal of D contains the eigenvalues of A, and A has an orthonormal set

2 See Section 7.2 in [7]

(i) A is positive semidefinite,

(i) A is positive definite,

Every positive definite matrix is therefore invertible.

∂ f (x) ∂ f (x) ∂ f (x)

We will always identify an n-tuple with the corresponding column vector3 . Of

Theorem 1.4 (First order Taylor theorem). Let f : Rn → R be a function hav-

Theorem 1.5 (Second order Taylor theorem). Let f : Rn → R be a function

This may be shown by considering the one-variable function g (t ) = f (x +t h)

We introduce notation for these approximations

T f1 (x; x + h) = f (x) + ∇ f (x)T h

F (x + h) = F (x) + F 0 (x)h + O(khk)

Finally, if F : Rn → Rm and G : Rk → Rn we define the composition H = F ◦ G

H 0 (x) = F 0 (G(x))G 0 (x).

Definition 1.8 (Linear convergence). We say that a sequence {x k }∞ k=1

A faster convergence rate is superlinear convergence which means that

lim kx k+1 − x ∗ k/kx k − x ∗ k = 0

A special type of superlinear convergence is quadratic convergence where

for some γ < 1.

Exercises for Chapter 1

a. What happens to the sublevel sets S α as α decreases? Give an example.

b. Show that if f is continuous and there is an x 0 such that with α = f (x 0 )

8. Consider the portfolio optimization problem in Subsection 1.2.1.

a. Assume that c i j = 0 for each i 6= j . Find, analytically, an optimal solu-

a. Let f : Rn → R be defined by f (x) = q T x = x T q, where q is a vector.

b. Let f : Rn → R be the quadratic function f (x) = (1/2)x T Ax, where A is

10. Consider f (x) = f (x 1 , x 2 ) = x 12 + 3x 1 x 2 − 5x 22 + 3. Determine the first order