School of Computer Science and Applied Mathematics
School of Computer Science and Applied Mathematics
School of Computer Science and Applied Mathematics
LEARNING OUTCOMES
By the completion of this lecture you should be able to:
1. Describe the optimization problem and determine if a solution exist.
2. Describe the intrinsic characterization of a solution to an optimization problem.
3. Describe basic algorithms to find a solution to an optimization problem.
Reference: This lecture is largely based on
• Chapter 1 and 2, Jorge Nocedal and Stephen J. Wright, ‘Numerical Optimization’.
• Chapter 1, Dimitri P. Bertsekas, ‘Nonlinear Programming’.
1.1 Introduction
Optimization is central to any problem involving decision making, whether in science, engineering or
economics. The task of decision making entails choosing between various alternatives. This choice is
governed by a desire to make the best decision. For example, investors seek to create portfolios that avoid
excessive risk while achieving a high rate of return. Manufacturers aim for maximum efficiency in the design
and operation of their production processes. Engineers adjust parameters to optimize the performance of
their designs.
Mathematical models for optimization can be generally represented by a constraint set S and an objective
function f that maps elements of S into real numbers. The set S consists of the available decisions x and
the objective f (x) is a scalar measure of goodness of choosing the decision x. We want to find an optimal
decision, x∗ ∈ S such that
f (x∗ ) ≤ f (x), ∀x ∈ S.
Optimization theory and methods deal with selecting the best decision in the sense of the given objective
function.
At this level of generality, very little can be said about the optimization problem. The problem contains,
as special cases, several important classes of problems that have widely different structures. In this course,
our focus will be on nonlinear programming problems. We provide some orientation about the character
of these problems and their relations with other types of optimization problems.
Perhaps the essential characteristic of an optimization problem is whether it is discrete or continuous. In
discrete optimization problems, variables make sense only if they take discrete values. The constraint set S
is finite or countably infinite. A typical example of discrete problems arises in scheduling, route planning,
and matching. In continuous optimization problems, the variables can take on any real value. In this case,
the constraint set S is infinite and have a continuous character. Typical examples of continuous problems
are those where there are no constraints, i.e, where S = Rn , or where S is specified by some equations and
inequalities.
In nonlinear programming, either the objective function f is nonlinear, or the constraint set S is specified
by nonlinear equations and inequalities. Nonlinear programming lies squarely within the continuous
1-1
problem. This course aims to focus on nonlinear problems, their continuous variables, and the associated
mathematical analysis. Another essential characteristic of an optimization problem worth mention is convex
programming. In convex programming, both the objective function and constraints functions are convex.
Convex programming is well studied in the literature. We do not focus too much on it.
In this lecture, we formulate a nonlinear programming problem and establish the conditions for existence of
an optimal solution. We also characterize the optimal solution by stating and proving optimality conditions
and give basic algorithmic concepts.
We consider the problem of determining the value of a vector of decision variables x ∈ Rn that minimizes
an objective function f : Rn → R, when x is required to belong to a feasible set S ⊆ Rn ; that is we consider
the problem:
Note that optimization problems can be stated as either minimization or maximization. A minimization
problem seeks the least possible value of the objective function. In comparison, a maximization problem
seeks the highest possible value of the objective function. Conventionally, optimization problems are
formulated as minimization problems. The maximization problem can be solved by minimizing −f .
There are two cases of main interest regarding the feasible set S ⊆ Rn of (1.1):
• When there are no restriction on the decision variable, the feasible set S = Rn , and problem (1.1)
becomes:
where E and I are respectively the index set for equality and inequality constraints. In this case we
say that problem (1.1) is constrained.
Problem (1.1) is a Nonlinear Programming (NLP) problem when at least one among the problem functions
f and ci ∀i ∈ E ∪ I, is nonlinear in its argument x. We assume that the problem functions are at least
continuously differentiable(hence smoothness is assumed) in Rn .
The process of identifying objective functions, variables and constraints for a given problem is known as
modelling. Construction of an appropriate model is the first step - sometimes the most critical step - in the
optimization process. If the model is too simple, it will not give helpful insight into the practical problem,
and if the model is too complex, it may become too difficult to solve.
Once the model has been formulated, an optimization algorithm can be used to find its solution. Usually,
the algorithm and model are complicated enough that a computer is needed to implement this process. The
1-2
is no universal optimization algorithm. Instead, there are numerous algorithms, each of which is tailored
to a particular type of optimization problem. It is often the user’s responsibility to choose an algorithm
that is appropriate for their application. This choice is an important one; it may determine whether the
problem is solved rapidly or slowly and, indeed, whether the solution is found at all.
When f is a convex function and S is a convex set, problem (1.1) is a convex NLP problem. In particular, S
is convex if the equality constraint functions ci for i ∈ E, are affine and the inequality constraint functions
ci for i ∈ I are convex. Convexity adds much structure to the NPL problem and can be exploited widely
both from the theoretical and the computational point of view. If f is convex quadratic and constraint
functions ci are affine, we have a Quadratic Programming problem. This is a case of special interest. Here
we will confine ourselves to general NLP problems without convexity assumptions.
Notion of a solution
Definition 1.1 (Global solution). A point x∗ ∈ S is a global solution of problem (1.1) if f (x∗ ) ≤ f (x),
for all x ∈ S; it is a strict global solution if f (x∗ ) < f (x), for all x ∈ S, x 6= x∗ .
A main existence results for a constrained problem is that a global solution exists if S is compact
(Weierstrass Theorem). An easy consequence for unconstrained problems is that a global solution exists if
the level set Lα = {x ∈ Rn : f (x) ≤ α} is compact for some finite α ∈ R.
Definition 1.2 (Local solution). A point x∗ ∈ S is a local solution of problem (1.1) if there exits an
open neighbourhood Nx∗ of x∗ such that f (x∗ ) ≤ f (x), for all x ∈ S ∩ Nx∗ ; it is strict local solution if
f (x∗ ) < f (x), for all x ∈ S ∩ Nx∗ , x 6= x∗
To determine a global solution of an NLP is generally challenging since we usually only have a local
perspective of the objective function f . Usually, NLP algorithms can determine only local solutions.
Nevertheless, in practical applications, a local solution can be of great worth. We will confine ourselves to
local optimization algorithms.
In the following, we assume for simplicity that the objective function is twice continuously differentiable.
In many cases, only once continuously differentiability will suffice. We also assume that the standard
assumption ensuring the existence of a solution to the problem (1.4) holds, namely:
Assumption 1.1. The level set Lα = {x ∈ Rn : f (x) ≤ f (x0 )} is compact for some point x0 ∈ Rn .
The following definition plays a significant role in the design and analysis of unconstrained optimization
algorithms.
1-3
if the function f is differential it is possible to give a simple condition guaranteeing that a certain direction
is a descent direction.
Proposition 1.1. Let f : Rn → R and assume that ∇f (x) exists and is continuous. Let x and d be given.
Then, if ∇f (x)T d < 0 the direction d is a descent direction for f at x.
The proposition establishes that if ∇f (x)T d < 0 then for sufficiently small positive displacement along d
and starting from x the function f is decreasing. It is also obvious that if ∇f (x)T d > 0, d is a direction
of ascent, that is, the function f is increasing for sufficiently small positive displacement from x along d.
If ∇f (x)T d = 0, d is orthogonal to ∇f (x) it is not possible to establish, without further knowledge on the
function f , what is the nature of the direction d.
We are now ready to characterize an optimal solution to an optimization problem. We state and prove
some necessary and sufficient conditions for a local minimum.
Theorem 1.1 (First order necessary condition). Let f : Rn → R and assume ∇f exits and it continuous.
The point x∗ is a local minimum of f only if
∇f (x∗ ) = 0.
Theorem 1.2 (Second order necessary condition). Let f : Rn → R and assume ∇2 f exits and it
continuous. The point x∗ is a local minimum of f only if
∇f (x∗ ) = 0,
and
Proof. The first condition ∇f (x∗ ) = 0 is a consequence of Theorem 1.1. To prove the second condition,
we use the Taylor approximation. As f is twice differentiable, for any x 6= x∗ , one has
1
f (x∗ + sd) = f (x∗ ) + s∇f (x∗ ) + s2 dT ∇2 f (x∗ )d + β(x∗ , sd).
2
where
β(x∗ , sd)
lim = 0.
s→0 s2
1-4
Moreover, the condition ∇f (x∗ ) = 0 yields
f (x∗ + sd) − f (x∗ ) 1 T 2 ∗ β(x∗ , sd)
= d ∇ f (x )d + . (1.5)
s2 2 s2
However, as x∗ is a local minimum, the left hand side of equation (1.5) must be non-negative for all s
sufficiently small, hence
1 T 2 β(x∗ , sd)
d ∇ f (x∗ )d + ≥0
2 s2
and
β(x∗ , sd)
1 T 2 1
lim d ∇ f (x∗ )d + = dT ∇2 f (x∗ )d,
s→0 2 s2 2
which prove the second condition. Q.E.D
Not only are the conditions in Theorem 1.2 necessary, with a slight modification, they are also sufficient.
Theorem 1.3 (Second order sufficient condition). Let f : Rn → R and assume ∇2 f exits and it continuous.
The point x∗ is a local minimum of f only if
∇f (x∗ ) = 0,
and
dT ∇2 f (x∗ )d > 0, for all non-zero d ∈ Rn .
Proof. To begin with, note that as ∇2 f (x∗ ) 0 and ∇2 f is continuous, then there is a neighbourhood
Nx∗ of x∗ such that for all y ∈ Nx∗ we have ∇2 f (y) 0.
Now consider the Taylor series expansion of f around the point x∗ , that is
1
f (y) = f (x∗ ) + ∇f (y)T (y − x∗ ) + (y − x∗ )T ∇2 f (ξ)(y − x∗ ),
2
where ξ = x∗ + θ(y − x∗ ), for some θ ∈ [0, 1]. By the first condition one has
1
f (y) = f (x∗ ) + (y − x∗ )T ∇2 f (ξ)(y − x∗ ),
2
and, for any y ∈ Nx∗ such that y 6= x∗ ,
f (y) > f (x∗ ),
which proves the claim. Q.E.D
The above results can be easily modified to derive necessary and sufficient conditions for the local maximum.
Moreover, if x∗ and the Hessian matrix ∇f (x∗ ) is indefinite, the point x∗ is neither a local minimum neither
a local maximum. Such a point is called a saddle point.
In some cases, original problem of finding the global minimum of (1.4), can be solved. Indeed, for the
convex function, there exists an unique global minima. We say a function is convex if
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y), ∀x, y ∈ Rn and α ∈ [0, 1]. (1.6)
Theorem 1.4. If f is a convex function, then every local minimizer of f is also a global minimizer.
1-5
1.3.3 Algorithms
In doing so, we successively improve our current solution estimate, and we hope to decrease f to its
minimum. There are two fundamental strategies for moving from the current point xk to the new iterate
xk+1 - the linear search methods and trust-region methods.
In the line search strategy, the algorithm chooses s direction dk and searches along this direction form the
current iterate for the new iterate xk+1 with the lower function value:
xk+1 = xk + sk dk , (1.8)
where sk is the stepsize. The stepsize is chosen such that it result in the maximum descent along dk . That
is, sk is the approximate solution of the following one-dimensional minimization problem:
By solving (1.9) exactly, we would derive the maximum benefit from the direction dk , but the exact
minimization is expensive and unnecessary. Instead, the line search algorithm generates a limited number
of trial step lengths until it finds one that loosely approximates the minimum of (1.9). More discussion of
line search method is reserved Lecture 3.
In the trust region strategy, the information gathered about f is used to construct a model function mk
whose behaviour near the current point xk is similar to that of the actual objective function. Because
the model mk may not be a good approximation of f when x is far from xk , we restrict the search for a
minimizer of mk to some region around xk . In other words we find the candidate step d by approximately
solving the following subproblem:
where xk + d lies inside the trust region. If the candidate solution does not produce a sufficient decrease
in f , we conclude that the trust region is too large, and we shrink it and resolve (1.10). More discussion
of trust region methods is reserved for Lecture 6.
The gradient is the direction that maximizes the local change in the objective function, while a negative
gradient is a direction that minimizes the local change in the objective function. Thus, it is intuitive to
use the negative gradient as the search direction. The negative gradient direction is called the steepest
descent direction, that is
In the steepest descent method at the k-th iteration, the transition from the current point xk to the new
point xk+1 is given by the following expression:
1-6
where the stepsize sk can be determined using the line search algorithms (covered in lecture 3). For
time being, we assume the constant step length sk = s. The steepest descent algorithm is summarized in
Algorithm (1).
Example 1.1. Compare the progress of the steepest descent method on the following functions; f (x) =
x21 + x22 and g(x) = 4x21 + x21 − 2x1 x2 .
The steepest descent method ensures a reduction in the function value at each iteration. If the starting
point is far away from the minimum, the gradient will be higher, and the function reduction will be
maximized at each iteration. Because the gradient value of the function changes and decreases to a small
value near the optimal solution, the function reduction is uneven, and the method converges slowly near
the minimum. Steepest descent is bad affected by a poorly scaled function.
Newton’s method
From one viewpoint the search direction of the steepest descent can be interpreted as being orthogonal to
the linear approximation(tangent to) of the objective function a point xk . The idea in Newton’s method
is to minimize at each iteration the quadratic approximation of f (x) at xk given by
1
f (xk + d) ≈ f (xk ) + dT ∇f (xk ) + dk ∇2 f (xk )d := mk (d) (1.13)
2
Assume the Hessian matrix ∇2 f (xk ) is positive definite, the Newton direction is obtained by minimizing
mk (d) (1.13). The minimum of mk (d) is obtained by differentiating (1.13) with respect o each of the
components of d and equating the resulting expression to zero
Simplifying with respect to d, we obtain the following explicit expression for the Newton’s direction:
h i−1
dk = − ∇2 f (xk ) ∇f (xk ) (1.14)
There is a natural step length of one (sk = 1) associated with the Newtons direction. An adjustment of the
step length is required when it does not produce a satisfactory reduction in the value of f . If we adjust the
stepsize, the resulting line search method is called Modified Newton’s method. The algorithm for Newton’s
method is described in Algorithm 2 and modified Newton’s method in Algorithm 3
1-7
Algorithm 2 Newton’s method
1: Set the initial point x0 and k ← 0
2: while not convergence do
−1
Compute the search direction dk ← − ∇2 f (xk ) ∇f (xk ).
3:
4: Set x k+1 k
←x +d k
5: Set xk+1 ← xk + sk dk .
6: Increase iteration counter k ← k + 1.
7: end while
8: return Approximate solution x∗ = xk
Example 1.2. Minimize the function g(x) = 4x21 + x21 − 2x1 x2 , starting with the initial point x0 = [1, 1]T .
If the initial starting point is far away form the optimal solution, the search direction may not always
be descent. Often a restart is required with a different starting point to avoid the difficulty. Since the
Newton method minimize the quadratic approximation of the function f via Taylor series approximation,
the method known to converge in one iteration for a quadratic function. Near the optimal point, the
function can be well approximated with quadratic function. Thus Newton’s method does not struggle as
the steepest descent method. The main drawback of Newton’s method is the evaluation of the inverse
Hessian matrix, which can be computationally expensive at times. In lecture 5 we discuss quasi-Newton
methods, which iteratively approximate the computation of the inverse Hessian matrix.
The general optimization problem (1.1), forms the basis of constrained optimization formulation. There are
interesting mathematical concepts and arguments about the derivation of constrained optimality conditions.
Thus, we state the conditions here with little details, and Lecture 6 is dedicated to a detailed derivation
of constrained optimality conditions.
For the constrained optimization problem (1.1), most of the necessary optimality conditions commonly used
in the development of algorithms assume that at a local solution, the constraints satisfy some qualification
condition to prevent the occurrence of degenerate cases. These conditions are usually called constraints
qualifications, and among them, the linear independence constraints qualification (LICQ) is the simplest
and by far the most invoked.
Let x̂ ∈ S, we say that the inequality constraint ci is active at x̂ if ci (x̂) = 0. We define the index set of
inequality constraints active at x̂ as:
Ia = {i ∈ I : ci (x̂) = 0}.
Of course, any equality constraint is active at x̂. LICQ is satisfied at x̂ if the gradient of the active
constrained ∇ci∈E∪Ia (x̂), are linear independent. Optimality conditions under LICQ for constrained
1-8
optimization problem (1.1) are stated making use of the Lagrangian function:
m
X
L(x, λ) = f (x) + λi ci (x), (1.16)
i
where λi are Lagrange multipliers or dual variables, f represent the usual objective function.
The constrained necessary optimality conditions, also known as Karush-Kuhn-Tucker conditions are stated
as follow:
Theorem 1.5. Suppose that x∗ is a local solution of (1.1), that the function f and ci in (1.1) are
continuously differentiable and that the LICQ holds at x∗ . Then there is a Lagrange multiplier vector
λ∗ , with components λi , i ∈ E ∪ I, such that the following conditions are satisfied at (x∗ , λ∗ ):
∇x L(x∗ , λ∗ ) = 0,
ci (x∗ ) = 0, for all i ∈ E,
∗
ci (x ) ≥ 0, for all i ∈ I, (1.17)
λ∗i ≥ 0, for all i ∈ I,
λ∗i ci (x
∗
) ≥ 0, for all i ∈ E ∪ I.
The last conditions λ∗i ci (x∗ ) ≥ 0, for all i ∈ E ∪ I, are complementary conditions, they imply that either
constraint i is active or λ∗i = 0, or both. In particular, the Lagrange multiplier corresponding to inactive
inequality constraints are zero.
Moreover, the second-order necessary optimality condition of a constrained optimization problem is stated
as:
Theorem 1.6. Suppose that x∗ is a local solution of (1.1) and the LICQ condition is satisfied. Let λ∗ be
the Lagrange multiplier vector for which the KKT conditions (1.17) are satisfied. Then
dT ∇2xx L(x∗ , λ∗ )d ≥ 0, for all d ∈ C(x∗ , λ∗ ), (1.18)
where C(x∗ , λ∗ ) = d ∈ Rn : ∇cTi∈Ia d = 0; ∇cTi∈E d = 0 .
A point x∗ satisfying the necessary optimality conditions (1.17) together with some multipliers λ∗ is called
the KKT point.
Sufficient optimality condition ensure that x∗ is a local minimizer of a constrained optimization problem
(1.1). Unlike necessary optimality conditions, the sufficient optimality condition does not require constrained
qualifications and the inequality in (1.18) is replaced by a strict inequality:
Theorem 1.7. Suppose that for some feasible point x∗ ∈ Rn there is a Lagrange multiplier vector λ∗ such
that the KKT conditions (1.17) are satisfied. Suppose also that
dT ∇2xx L(x∗ , λ∗ )d > 0, for all d ∈ C(x∗ , λ∗ ), and d 6= 0. (1.19)
Then x∗ is a strict local minimizer of (1.1).
We postpone a detailed discussion of these conditions and related constrained algorithms to lecturer 6
onwards.
1.5 Summary
In this lecture, we introduced the mathematical formulation of nonlinear programming and determined
the conditions for the existence of the solution. We also characterized the optimal solution via optimality
1-9
conditions. Optimality conditions are fundamental in the solution of nonlinear optimization problems. If
it is known that a global solution exits, the most straightforward method to employ then are as follow:
find all points satisfying the first-order necessary conditions and declare as global solutions the points with
the smallest value of the objective function. If the problem function is twice differentiable, we can also
check the second-order necessary condition. Filtering out those points that do not satisfy it, we can check
the second-order sufficient conditions to find local minima for the remaining candidates. It is essential
to realize that using optimality conditions as described above does not work except for straightforward
cases.
The principle context in which optimality conditions become helpful is the development and analysis of
algorithms. An algorithm for a solution of a general optimization problem produces a sequence {xk }, k =
0, 1, . . . , of tentative solutions, and terminates when a stopping criterion is satisfied. Usually, the stopping
criterion is based on satisfaction of necessary optimality conditions within a prefixed tolerance. Moreover,
necessary conditions often suggest how to improve the current tentative solution xk in order to get the
next one xk+1 , closer to the optimal solution. Thus, necessary optimal conditions provide the basis for the
convergence analysis of algorithms. On the other hand, sufficient optimality conditions play a crucial role in
analyzing the rate of convergence. The subsequent lecture covers the convergence analysis of optimization
algorithms.
1-10
PROBLEM SET I
1. A piece of wire six meters long has to be divided into two pieces. One piece is bent into a square,
and the other is bent into an equilateral triangle. What must be the individual lengths so that the
total area enclosed by the two pieces is minimum? Formulate the optimization problem.
2. A rectangle has its lower-left corner at the point Q(−1, 2) and the upper corner at the point P (x, y) on
the straight line 3x + 4y = 5. Find the dimensions, x and y, that maximize the area of the rectangle.
Formulate the optimization problem.
3. Evaluate the gradient of the function f (x) = (x1 + x2 )3 x3 + x23 x22 x21 at the point x = [1, 1, 1]T .
4. Explain the difference between the global minimizer and local minimizer.
5. Consider the function f : R2 → R given by f (x) = (x1 − 3)2 + (x2 − 2)2 . How will you explain to
anyone that x∗ = [3, 2]T is a minimum point?
6. Let f (x) = 21 xT Ax + xT b + c, where A ∈ Rn×n is a symmetric positive definite matrix, b ∈ Rn and
c ∈ R.
a. Calculate the gradient and Hessian of f (x).
b. How many local/global minimum can f (x) have?
c. Find a formula for the minima using only the data A and b.
7. Consider the Rosenbrock function:
2
f (x) = x2 − x21 + (1 − x1 )2 .
f (x, y) = x2 + y 2 + βxy + x + 2y
1-11
13. Given the function f (x) = 3x21 + 3x22 + 3x23 to minimize, would you expect that steepest descent or
Newton’s method be faster in solving the problem from the same staring point x0 = [10, 10, 10]T ?
Explain the reason for your answer.
14. Consider the following function
Starting with the initial point x = [4, 2, −1]T , find the minimizer of the function using
a. Steepest descent method.
b. Newton’s method.
15. Write a simple computer programme for implementing (a)the steepest descent algorithm and (b)
Newton’s method. Use a fixed stepsize in both algorithms. For the stopping criterion, use the
condition k∇f (x)k k2 ≤ ε, where ε = 10−6 . Test your programme with the function in problem 14.
Compare the two algorithms using an initial condition [−4, 5, 1]T , determine the number of iterations
and time required to satisfy the above stopping criterion. Evaluate the objective function at the final
point to see how close it is to 0. Experiment with different values of the stepsize(small and large).
16. Furthermore, compare the two algorithms by applying the computer programmes to solve the Rosenbrock
function. Use an initial condition x0 = [−2, 2]T . Terminate the algorithm when the norm of the
gradient of the objective function is less than 10−4 .
1-12