L10_Subgrad_PGD (partially annotated)

Lecture 10: Subgradient method and
Projected gradient descent

In topic 6, we will study numerical methods for solving constrained
optimization problem.
Subgradient method
The quadratic penalty method
Augmented Lagrangian method
Barrier function method
Problem setting
(P) min f (x)

s.t. g(x) = 0, (D) max p
θ(λ, µ)
λ∈Rm , µ∈R+
h(x) ≤ 0, := inf x∈X f (x) + λT g(x) + µT h(x)

x ∈ X ⊆ Rn
Assumptions:
X is a compact set in Rn
f : Rn → R, g : Rn → Rm , and h : Rn → Rp are continuous
Problem setting
(P) min f (x)

s.t. g(x) = 0, (D) max p
θ(λ, µ)
λ∈Rm , µ∈R+
h(x) ≤ 0, := inf x∈X f (x) + λT g(x) + µT h(x)

x ∈ X ⊆ Rn
Assumptions:
X is a compact set in Rn
f : Rn → R, g : Rn → Rm , and h : Rn → Rp are continuous
For solving dual, consider gradient ascent on θ:
[λ; µ](k+1) = [λ; µ](k) + tk ∇θ λ(k) , µ(k)

tk : step length
 ·) is often not differentiable!
Difficulty: θ(·,
 5λ − 4 if λ ≤ −1
Eg. θ(λ) = λ − 8 if − 1 ≤ λ ≤ 2 in example 10.11.
−3λ if λ ≥ 2

Another motivating example
Consider the LASSO model where we apply l1 regularization to linear

regression, i.e.
m
1 X
f (x) = kbi − aT 2
i xk + λkxk1
m
i=1
Here, kxk1 := ni=1 |xi | is defined to be the vector one-norm of x, λ is a

P
regularization parameter.
Compared to the usual regression, this model promotes sparsity for the
optimal solution x.
Another motivating example
Consider the LASSO model where we apply l1 regularization to linear

regression, i.e.
m
1 X
f (x) = kbi − aT 2
i xk + λkxk1
m
i=1
Here, kxk1 := ni=1 |xi | is defined to be the vector one-norm of x, λ is a

P
regularization parameter.
Compared to the usual regression, this model promotes sparsity for the
optimal solution x.
Again, the difficulty for solving this problem is that the objective function
is not differentiable.
Subgradients
Definition 10.1
Suppose
S ⊆ Rn be a nonempty convex set
f : S → R be a convex function
A vector ξ ∈ Rn is a subgradient of f at x̄ ∈ S if
f (x) ≥ f (x̄) + ξ T (x − x̄), ∀ x ∈ S.

Subgradients
Definition 10.1
Suppose
S ⊆ Rn be a nonempty convex set
f : S → R be a convex function
A vector ξ ∈ Rn is a subgradient of f at x̄ ∈ S if
f (x) ≥ f (x̄) + ξ T (x − x̄), ∀ x ∈ S.
The set of all subgradients of f at x̄ is called the subdifferential of f at x̄,

denoted by ∂f (x̄), i.e.
∂f (x̄) = {ξ : ξ is a subgradient of f at x̄}.
If f is concave, then ξ ∈ Rn is a subgradient of f at x̄ ∈ S if
f (x) ≤ f (x̄) + ξ T (x − x̄), ∀ x ∈ S.

Example 10.2
Find the subdifferential of f defined by f (x) = |x| for all x ∈ R.
(
x, if x ≥ 0,
Solution. Note that f (x) = and it is differentiable for all
−x, if x < 0.
x 6= 0. Hence 
{1},
 if x > 0,
∂f (x) = {−1}, if x < 0,

[−1, 1], if x = 0.

Subgradient method (part 2)
Some useful propositions
Proposition 1
∂f is a convex set. If f is differentiable at x, then
∂f (x) = {∇f (x)}.

Proposition 1
∂f (x) = {∇f (x)}.
Proposition 2
If f is continuous and convex, then minx∈Rn f (x) is attained at x∗ if and
only if 0 ∈ ∂f (x∗ ).
Proposition 1
∂f (x) = {∇f (x)}.
Proposition 2
If f is continuous and convex, then minx∈Rn f (x) is attained at x∗ if and
only if 0 ∈ ∂f (x∗ ).
Proposition 3
Let f , g be two convex functions. Under some mild conditions (which
assumed to be true in this course), the subdifferential of f + g is given by
∂(f + g )(x) = ∂f (x) + ∂g (x)

Convex hull
Definition 10.3
Given a set S, C = conv (S) is the convex hull of S, if C is the smallest
convex set that contains S.
Convex hull
Definition 10.3
Given a set S, C = conv (S) is the convex hull of S, if C is the smallest
convex set that contains S.
Proposition 4
If S = {v1 , · · · , vn }, then
n n
X n
X o
conv (S) = v = λi vi , λi ≥ 0, λi = 1
i=1 i=1
Proposition 5
Suppose f (x) = max{f1 (x), · · · , fm (x)}, where fi are all convex and
continuously differentiable functions. Suppose
f (x∗ ) = f1 (x∗ ) = · · · = fj (x∗ ). Then
∂f (x∗ ) = conv ({∇f1 (x∗ ), · · · , ∇fj (x∗ )})

Example 10.4
Consider f (x) = x 2 + |x − 1|. Find the subgradient and use it to find the
optimal solution.
Solution. x ∗ = 0.5.
Example 10.5 (C.f. example 10.11)

 5λ − 4 if λ ≤ −1
Try to maximize θ(λ) = λ − 8 if − 1 ≤ λ ≤ 2 .
−3λ if λ ≥ 2

Note that θ(λ) = min{5λ − 4, λ − 8, −3λ}
Example 10.5 (C.f. example 10.11)

 5λ − 4 if λ ≤ −1
Try to maximize θ(λ) = λ − 8 if − 1 ≤ λ ≤ 2 .
−3λ if λ ≥ 2

Note that θ(λ) = min{5λ − 4, λ − 8, −3λ}
Solution. When λ < −1, θ is continuously differentiable and θ(λ) = 5λ − 4, so
∂θ(λ) = {5}.
When λ = −1, θ(λ) = − max{4 − 5λ, 8 − λ}, and 4 − 5λ = 8 − λ. By

proposition 5, ∂θ(−1) = −conv ({−5, −1}) = [1, 5].
When −1 < λ < 2, θ is continuously differentiable and θ(λ) = λ − 8, so

∂θ(λ) = {1}.
When λ = 2, θ(λ) = − max{8 − λ, 3λ}, and 8 − λ = 3λ. By proposition 5,

∂θ(−1) = −conv ({1, −3}) = [−3, 1].
When λ > 2, θ is continuously differentiable and θ(λ) = −3λ, so ∂θ(λ) = {−3}.
Running through all cases, the only possibility where 0 ∈ ∂θ(λ∗ ) is when λ∗ = 2.
Example 10.6
Find the subgradient of
f (x) = |x1 + x2 | + |x1 |
At (1,0), (0,1), (1,-1), (0,0). Which of them is a global minimizer?

Solution. (0,0)
Subgradient descent/ascent method
In gradient descent,
x(k+1) = x(k) − tk ∇f (x(k) )
For nonsmooth function, since there is no gradient defined, we can replace
∇f with elements of the subgradient.
However, the set of the subgradient can be difficult to find. Normally, we
just need to choose one of the subgradient to replace ∇f .
Specify some initial guess x(0) .

For k = 0, 1, · · · ,
If 0 ∈ ∂f (x(k) ) ,
then stop;
otherwise,
Pick v(k) ∈ −∂f (x(k) )
x(k+1) = x(k) + tk v(k)
End
End
The last x(k+1) will be the approximate minimizer.
Projected gradient method
In topic 6, we will study numerical methods for solving constrained
optimization problem.
Subgradient method
The quadratic penalty method
Augmented Lagrangian method
Barrier function method
Problem setting
min f (x), x∈S
f is a convex, differentiable function

S is a convex set, eg. S = {x | Ax = b, hj (x) ≤ 0, j = 1, ..., p},
where hj are convex functions.
Difficulty in applying usual gradient descent algorithm: the iterates
may go out of the feasible region.
The projected gradient method is a method that pulls the iterate back to
the feasible region, using the basic template of the usual gradient descent
algorithm.
Projection on a convex set
Theorem 10.7 (Projection theorem)
Let C be a closed convex set in Rn .
(a) For every z ∈ Rn , there exists a unique minimizer (denoted as ΠC (z)
and called as the projection of z onto C ) of
n1 o
min kx − zk2 | x ∈ C
2
where k · k is the Euclidean norm.
n1 o
2
(b) x∗ := ΠC (z) is the projection of z onto C if and only if
hz − x∗ , x − x∗ i ≤ 0 ∀ x ∈ C .
n1 o
2
(b) x∗ := ΠC (z) is the projection of z onto C if and only if
hz − x∗ , x − x∗ i ≤ 0 ∀ x ∈ C .
(c) For any z, w ∈ Rn ,
kΠC (z) − ΠC (w)k ≤ kz − wk.

Example 10.8
For y ∈ R2 , find ΠS (y) for S = {kxk ≤ 1}.
Example 10.8
For y ∈ R2 , find ΠS (y) for S = {kxk ≤ 1}.
Solution. ΠS (y) is the KKT solution of the problem

1
min kx − yk2 s.t. kxk2 ≤ 1.
x 2
The KKT systems give
x − y + µ(2x) = 0, kxk2 ≤ 1, µ ≥ 0, µ(kxk2 − 1) = 0.
Case 1. y ∈ S. Then ΠS (y) = y.

1
Case 2. y ∈/ S. We have x = 1+2µ y. Also, µ 6= 0. This implies the
inequality constraint must be active, i.e.
y
kxk = 1 ⇔ kyk = 1 + 2µ ⇐ x = .
kyk
(
y, if kyk ≤ 1,
Hence, ΠS (y) = y
kyk , otherwise.
Example 10.9
For y ≥ 0 ∈ R2 , findΠS (y) for S = {0 ≤ x1 ≤ 1, 0 ≤ x2 ≤ 1}.


 y, if y ∈ S,

(y , 1),
1 if 0 ≤ y1 ≤ 1, 1 < y2 ,
Solution. ΠS (y) =
(1, y2 ),

 if 1 < y1 , 0 ≤ y2 ≤ 1,

(1, 1), otherwise.
Example 10.10
Given a 6= 0. For y ∈ d T
(R , find ΠS (y) for S = {a x + b ≤ 0}.
y, if y ∈ S,
Solution. ΠS (y) = aT y+b
y − kak2 a, otherwise
Steepest descent method vs projected gradient descent
Steepest descent method

for solving minx∈Rn f (x)
[Step 0] Select x(0) , and > 0.
[Step k] For k = 0, 1, 2, 3 · · · ,
(a) d(k) := −∇f (x(k) ).
(b) if kd(k) k < , stop.
(c) else,
(i) Choose fixed tk , or solve
tk = arg min f (x(k) + td(k) )

t≥0
(ii) x(k+1) = x(k) + tk d(k) .

Steepest descent method vs projected gradient descent
Steepest descent method Projected gradient descent

for solving minx∈Rn f (x) for solving minx∈S f (x)
[Step 0] Select x(0) , and > 0. [Step 0] Select x(0) , and > 0.
[Step k] For k = 0, 1, 2, 3 · · · , [Step k] For k = 0, 1, 2, 3 · · · ,

(k)
(a) d := −∇f (x(k) ). (a) d(k) := −∇f (x(k) ).
(b) if kd(k) k < , stop. (b) if kx(k+1) − x(k) k < , stop
(c) else, (c) else,
(i) Choose fixed tk , or solve (i) Choose fixed tk , or solve
tk = arg min f (x(k) + td(k) ) tk = arg min f (x(k) + td(k) )

t≥0 t≥0
(ii) x(k+1) = x(k) + tk d(k) . (ii) x(k+1) = ΠS (x(k) + tk d(k) ).

Example 10.11
min f (x) := x1 + x2 , x12 + x22 ≤ 1
Suppose x(0) = 0, find x(1) using projected gradient descent method with
step size t0 = 1.
Example 10.11
min f (x) := x1 + x2 , x12 + x22 ≤ 1
step size t0 = 1.

Solution. x(1) = ΠS x(0) − t0 ∇f (x(0) ) , where S = {x | kxk2 ≤ 1}.

(0) (0) 1 1
x − t0 ∇f (x ) = 0 − =− .
1 1

1
Since − ∈/ S, the projection formula given by example 10.8 shows:
1

(1) 1 1 1
x = ΠS − = − √2 .
1 1
Example 10.12
1
min f (x) := kAx − bk2 , 0 ≤ x1 , x2 ≤ 1.
2

2 0
optimal step size if A = , b = [1; 2].
0 1
Solution. x(1) = [0.8; 0.8].

L10_Subgrad_PGD (partially annotated)

Uploaded by

Copyright:

Available Formats

L10_Subgrad_PGD (partially annotated)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L10_Subgrad_PGD (partially annotated)

Uploaded by

Copyright:

Available Formats

Lecture 10: Subgradient method and

Projected gradient descent

(P) min f (x)

(P) min f (x)

Consider the LASSO model where we apply l1 regularization to linear

Here, kxk1 := ni=1 |xi | is defined to be the vector one-norm of x, λ is a

Consider the LASSO model where we apply l1 regularization to linear

Here, kxk1 := ni=1 |xi | is defined to be the vector one-norm of x, λ is a

f (x) ≥ f (x̄) + ξ T (x − x̄), ∀ x ∈ S.

f (x) ≥ f (x̄) + ξ T (x − x̄), ∀ x ∈ S.

The set of all subgradients of f at x̄ is called the subdifferential of f at x̄,

∂f (x̄) = {ξ : ξ is a subgradient of f at x̄}.

If f is concave, then ξ ∈ Rn is a subgradient of f at x̄ ∈ S if

f (x) ≤ f (x̄) + ξ T (x − x̄), ∀ x ∈ S.

∂f (x) = {∇f (x)}.

∂f (x) = {∇f (x)}.

∂f (x) = {∇f (x)}.

∂(f + g )(x) = ∂f (x) + ∂g (x)

∂f (x∗ ) = conv ({∇f1 (x∗ ), · · · , ∇fj (x∗ )})

When λ = −1, θ(λ) = − max{4 − 5λ, 8 − λ}, and 4 − 5λ = 8 − λ. By

When −1 < λ < 2, θ is continuously differentiable and θ(λ) = λ − 8, so

When λ = 2, θ(λ) = − max{8 − λ, 3λ}, and 8 − λ = 3λ. By proposition 5,

When λ > 2, θ is continuously differentiable and θ(λ) = −3λ, so ∂θ(λ) = {−3}.

f (x) = |x1 + x2 | + |x1 |

At (1,0), (0,1), (1,-1), (0,0). Which of them is a global minimizer?

Specify some initial guess x(0) .

min f (x), x∈S

f is a convex, differentiable function

(c) For any z, w ∈ Rn ,

kΠC (z) − ΠC (w)k ≤ kz − wk.

Solution. ΠS (y) is the KKT solution of the problem

x − y + µ(2x) = 0, kxk2 ≤ 1, µ ≥ 0, µ(kxk2 − 1) = 0.

Case 1. y ∈ S. Then ΠS (y) = y.

Steepest descent method

[Step 0] Select x(0) , and  > 0.

tk = arg min f (x(k) + td(k) )

(ii) x(k+1) = x(k) + tk d(k) .

Steepest descent method Projected gradient descent

[Step k] For k = 0, 1, 2, 3 · · · , [Step k] For k = 0, 1, 2, 3 · · · ,

tk = arg min f (x(k) + td(k) ) tk = arg min f (x(k) + td(k) )

(ii) x(k+1) = x(k) + tk d(k) . (ii) x(k+1) = ΠS (x(k) + tk d(k) ).

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

[Step 0] Select x(0) , and > 0.