Notes ch0

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

CHAPTER 0.

PRELIMINARY

Yutong He

1 Norm
Definition 1.1 (Norm). A real-valued function k · k defined on linear space E is
called norm, if it satisfy:

1. (being positive definite) ∀x ∈ E, we have kxk ≥ 0, and kxk = 0 if and only if


x = 0.

2. (being absolutely homogeneous) ∀x ∈ E, α ∈ R, we have kαxk = |α| · kxk.

3. (the triangle inequality) ∀x, y ∈ E, we have kx + yk ≤ kxk + kyk.

1.1 Vector Norm


We consider vector norm defined on vector space E = Rn .

Definition 1.2 (`p -norm). The `p -norm (p ≥ 1) is defined as:

kxkp = (|x1 |p + |x2 |p + · · · + |xn |p )1/p ,

for all x = (x1 , x2 , · · · , xn )> ∈ Rn .

Specifically, we have the following commonly used definition of `1 -norm and `2 -norm for
vector x = (x1 , x2 , · · · , xn )> ∈ Rn . When p = 1, the `1 -norm is given by:

kxk1 = |x1 | + |x2 | + · · · + |xn |.

1
When p = 2, the `2 -norm is the same as the Euclidean norm:
q
kxk2 = x21 + x22 + · · · + x2n .

We have the following useful Cauchy’s inequality for `2 norm:

Proposition 1.3 (Cauchy’s inequality). ∀x, y ∈ Rn , we have

|hx, yi| ≤ kxk2 · kyk2 ,

where hx, yi = x> y denotes the inner product of x and y, and the equality holds if
and only if x and y are linearly correlated (i.e., ∃ α, β ∈ R, s.t. α2 + β 2 > 0 and
αx + βy = 0).

Definition 1.4 (`∞ -norm). The `∞ -norm is defined as:

kxk∞ = max{|x1 |, |x2 |, · · · , |xn |},

for all x = (x1 , x2 , · · · , xn )> ∈ Rn .

Definition 1.5 (Norm induced by a positive definite matrix A). Given a positive
definite matrix A ∈ Rn×n , we define:

kxkA = x> Ax,

for all x = (x1 , x2 , · · · , xn )> ∈ Rn .

1.2 Matrix norm


We consider matrix norms defined on matrix space E = Rm×n .

Definition 1.6 (Frobenius norm / F-norm). The Frobenius norm (or F-norm) of
matrix A ∈ Rm×n is defined as:
v
um X n
q uX
>
kAkF = tr(A A) = t A2i,j ,
i=1 j=1

where Ai,j represents the (i, j)-th element of matrix A.

The Frobenius norm has similar properties as the `2 -norm. For example, we have the
following Cauchy’s inequality for the Frobenius norm.

2
Proposition 1.7 (Cauchy’s inequality). ∀A, B ∈ Rm×n , we have

|hA, Bi| ≤ kAkF · kBkF ,

where hA, Bi = tr(A> B) = tr(AB > ) denotes the Frobenius inner product of A and
B, and the equality holds if and only if A and B are linearly correlated.

Definition 1.8 (Spectral norm). The spectral norm of matrix A ∈ Rm×n is defined
as:
kAk2 = max
n
kAxk2 .
x∈R ,kxk2 =1

For a given matrix A ∈ Rm×n , it can be shown that kAk2 equals the largest singular
value of A, and kAk22 equals the largest eigenvalue of A> A (and AA> ).
By definition, we also have the following inequality:

kAxk2 ≤ kAk2 kxk2 ,

for all A ∈ Rm×n and x ∈ Rn .

Definition 1.9 (Nuclear norm). The nuclear norm of matrix A ∈ Rm×n is defined
as:
kAk∗ = σ1 + σ2 + · · · + σr ,

where r = rank(A) and σ1 ≥ σ2 ≥ · · · ≥ σr are non-zero singular values of A.

We have the following inequalities for the matrix norms:

kABkF ≤kAk2 kBkF ,


|hA, Bi| ≤kAk2 kBk∗ .

2 Gradient
2.1 Gradient and Hessian matrix

Definition 2.1 (Gradient). Suppose f : Rn → R is continuously differentiable, its


gradient ∇f : Rn → Rn is defined at point x = (x1 , x2 , · · · , xn )> as:
 >
∂f (x) ∂f (x) ∂f (x)
∇f (x) = , ,··· .
∂x1 ∂x2 ∂xn

3
Definition 2.2 (Hessian matrix). Suppose f : Rn → R is twice continuously differen-
tiable, its Hessian matrix ∇2 f : Rn → Rn×n is defined at point x = (x1 , x2 , · · · , xn )>
as:  ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)

∂x21 ∂x1 ∂x2 · · · ∂x 1 ∂xn
 ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x) 

∂x ∂x ∂x 2 · · · ∂x 2 ∂xn 

∇2 f (x) =  . 2 1 2
.

 . .
.. .. .
.. 
 . . 
∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
∂xn ∂x1 ∂xn ∂x2 · · · ∂x2 n

As an example, we consider gradients and Hessian matrices of two typical functions.

1. (The least squares problem) Let f (x) := 12 kAx − bk22 , we have ∇f (x) = A> (Ax − b)
and ∇2 f (x) = A> A.
1
PM >
2. (The logistic regression problem) Let f (x) := M i=1 ln(1 + exp(−bi ai x)), we have

M
1 X −bi exp(−bi a> i x)
∇f (x) = ai ,
M i=1 1 + exp(−bi a>
i x)

and
M
1 X b2i exp(−bi a>
i x)ai,j ai,k
[∇2 f (x)]j,k = .
M i=1 [1 + exp(−bi a>i x)]
2

Like univariate functions, we have Taylor expansions in the multivariate case:

Proposition 2.3 (Taylor expansion). Suppose f : Rn → R is continuously differen-


tiable, ∀x, y ∈ Rn , we have

f (x + y) = f (x) + h∇f (x + θ1 y), yi,

for some θ1 ∈ (0, 1). If f is further twice continuously differentiable, it holds that
Z 1
∇f (x + y) = ∇f (x) + ∇2 f (x + ty)ydt,
0

and
1
f (x + y) = f (x) + h∇f (x), yi + y > ∇2 f (x + θ2 y)y,
2
for some θ2 ∈ (0, 1).

2.2 L-smoothness

Definition 2.4 (L-smoothness). Function f : Rn → R is said to be L-smooth if


∀x, y ∈ Rn ,
k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 ,

where L > 0 is the Lipschitz constant of ∇f .

4
Theorem 2.5 (L-smooth property). Suppose f : Rn → R is L-smooth, it holds for
∀x, y ∈ Rn that
L
f (y) ≤ f (x) + h∇f (x), y − xi + ky − xk22 .
2

Proof. Let g(t) := f (x + t(y − x)), we have

g 0 (t) = h∇f (x + t(y − x)), y − xi,

thus

f (y) − f (x) − h∇f (x), y − xi


=g(1) − g(0) − g 0 (0)
Z 1
= (g 0 (t) − g 0 (0))dt
0
Z 1
= h∇f (x + t(y − x)) − ∇f (x), y − xidt
0
Z 1
≤ k∇f (x + t(y − x)) − ∇f (x)k2 · ky − xk2 dt
0
Z 1
≤ Ltky − xk22 dt
0
L
= ky − xk22 .
2

L
3 f (x) + h∇f (x), y − xi + 2 ky − xk22

1 f (y)

0 (x, f (x))

−1

−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5

Figure 1: L-smooth property

5
3 Convexity
3.1 Convex set

Definition 3.1 (Convex set). A set X ⊆ Rn is called convex, if for ∀x, y ∈ X , it


holds that
θx + (1 − θ)y ∈ X , ∀θ ∈ [0, 1].

Theorem 3.2 (Projection onto closed convex sets). Suppose X ⊆ Rn is a closed


convex set, then for ∀y ∈ Rn , there exists a unique x? ∈ X such that kx? − yk2 ≤
kx − yk2 for any x ∈ X . The point x? is called the projection of y onto X , denoted
by PX (y).

Proof. We first prove the existence of point x? . For a given y ∈ Rn , let d := inf x∈X kx − yk2 ,
there exists sequence {xk }∞ 2 2
k=1 ⊆ X such that kxk − yk2 ≤ d + 1/k. For any integers
0 < m < n, by the convexity of X we have (xm + xn )/2 ∈ X . By the definition of d, we
have k(xm + xn )/2 − yk2 ≥ d. Consequently,

kxm − yk22 + kxn − yk22 1


hxm − y, xn − yi = 2k(xm + xn )/2 − yk22 − ≥ d2 − ,
2 m
2 2 2 4
⇒kxm − xn k2 = kxm − yk2 + kxn − yk2 − 2hxm − y, xn − yi ≤ .
m
Thus {xk }∞ ? n
k=1 is a Cauchy sequence, which implies the existence of point x ∈ R such that
? ?
x = limk→∞ xk . By closedness of X we know x ∈ X , and by the continuity of `2 -norm we
know kx? − yk2 = d. Next we show the uniqueness of x? . Otherwise ∃ x?1 , x?2 ∈ X (x?1 6= x?2 )
such that kx?1 − yk2 = kx?2 − yk2 = d, we thus have

1 ? 1 1
k(x?1 + x?2 )/2 − yk22 = kx − yk22 + kx?2 − yk22 − kx?1 − x?2 k22 < d,
2 1 2 4
a contradiction with the definition of d.

3.2 Convex function

Definition 3.3 (Convex function). Function f : X → R is said to be convex if


X ⊆ Rn is a convex set and ∀x, y ∈ X , it holds that

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y), ∀ θ ∈ [0, 1].

Here we list some common convex functions:


1. vector norm: f (x) = kxk,

2. quadratic function: f (x) = 12 x> Ax where A is positive semi-definite,

3. linear function: f (x) = ha, xi for some a ∈ Rn ,

4. combination of convex functions: f (x) = α1 f1 (x) + α2 f2 (x) + · · · + αm fm (x) where


f1 , f2 , · · · , fm are convex and α1 , α2 , · · · , αm are non-negative.

6
Theorem 3.4 (Convex property). Suppose f : X → R is differentiable, then f is
convex if and only if

f (y) ≥ f (x) + h∇f (x), y − xi, ∀ x, y ∈ X . (1)

Similarly, f is convex if and only if

h∇f (y) − ∇f (x), y − xi ≥ 0, ∀ x, y ∈ X . (2)

Proof. Step 1: we prove that (1) holds if f is convex. By definition we have for any θ ∈ [0, 1]
that f (θy + (1 − θ)x) ≤ θf (y) + (1 − θ)f (x). Thus,

θf (y) + (1 − θ)f (x) − f (x)


f (y) − f (x) = lim
θ→0+0 θ
f (θy + (1 − θ)x) − f (x)
≥ lim
θ→0+0 θ
f (x + θ(y − x)) − f (x)
= lim
θ→0+0 θ
=h∇f (x), y − xi.

Step 2: we prove that f is convex if (1) holds. By (1) we have for ∀x, y ∈ X and θ ∈ [0, 1]
that

θf (x) + (1 − θ)f (y) − f (θx + (1 − θ)f (y))


=θ(f (x) − f (θx + (1 − θ)y)) + (1 − θ)(f (y) − f (θx + (1 − θ)y))
≥θh∇f (θx + (1 − θ)y), (1 − θ)(x − y)i + (1 − θ)h∇f (θx + (1 − θ)y), θ(y − x)i
=0.

Step 3: we prove that (1) implies (2). By (1) we have

f (y) + f (x) ≥(f (x) + h∇f (x), y − xi) + (f (y) + h∇f (y), x − yi)
=f (x) + f (y) − h∇f (y) − ∇f (x), y − xi,

which is equivalent to (2).


Step 4: we prove that (2) implies (1). This follows from
Z 1
f (y) − f (x) = h∇f (x + t(y − x)), y − xidt
0
1 
1
Z
= h∇f (x), y − xi + h∇f (x + t(y − x)) − ∇f (x), t(y − x)i dt
0 t
Z 1
≥ h∇f (x), y − xidt
0

=h∇f (x), y − xi.

7
2.0

1.5

1.0
(x, f (x))

0.5 f (y)

0.0

f (x) + h∇f (x), y − xi

−0.5

0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50

Figure 2: convex property

The following Jensen’s inequality is an important property of convex functions.

Proposition 3.5 (Jensen’s inequality). Let f : X → R be convex, then for any


x1 , x2 , · · · , xm ∈ X and non-negative θ1 , θ2 , · · · , θm satisfying θ1 + θ2 + · · · + θm = 1,
it holds that

f (θ1 x1 + θ2 x2 + · · · + θm xm ) ≤ θ1 f (x1 ) + θ2 f (x2 ) + · · · + θm f (xm ).

Definition 3.6 (µ-strongly convex function). Function f : X → R is said to be


µ-strongly convex if
µ
g(x) := f (x) − kxk22
2
is a convex function.

It can be proved that f : X → R is µ-strongly convex if and only if for ∀x, y ∈ X and
θ ∈ [0, 1], it holds that
µ
f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) − θ(1 − θ)kx − yk22 .
2

Theorem 3.7 (µ-strongly convex property). Function f : X → R is µ-strongly


convex if and only if
µ
f (y) ≥ f (x) + h∇f (x), y − xi + ky − xk22 , ∀ x, y ∈ X , (3)
2
if and only if

h∇f (y) − ∇f (x), y − xi ≥ µky − xk22 , ∀ x, y ∈ X . (4)

8
Proof. Let g(x) := f (x) − µ2 kxk22 , then f is µ-strongly convex if and only if g is convex.
Note that (3) is equivalent to
µ µ µ
g(y) + kyk22 ≥ g(x) + kxk22 + h∇g(x) + µx, y − xi + ky − xk22 ,
2 2 2
⇔g(y) ≥ g(x) + h∇g(x), y − xi,

and that (4) is equivalent to

h∇g(y) + µy − ∇g(x) − µx, y − xi ≥ µky − xk22


⇔h∇g(y) − ∇g(x), y − xi ≥ 0,

it suffices to apply Theorem 3.4.

Theorem 3.8 (µ-strongly convex property). Let f : Rn → R be µ-strongly convex,


then the following inequalities hold:

1
f (y) ≤ f (x) + h∇f (x), y − xi + k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn ,

1
h∇f (y) − ∇f (x), y − xi ≤ k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn .
µ

Proof. Fix x ∈ Rn , and let φ(y) := f (y) − f (x) − h∇f (x), y − xi, we know φ is µ-strongly
convex with minimum φ(x) = 0. Thus,
 µ 
0 = min φ(v) ≥ min φ(y) + h∇φ(y), v − yi + kv − yk22
v v 2
2
1 µ 1 1
=φ(y) − k∇φ(y)k22 + ∇φ(y) = φ(y) − k∇φ(y)k22
µ 2 µ 2 2µ
1
=f (y) − f (x) − h∇f (x), y − xi − k∇f (y) − ∇f (x)k22 ,

which is equivalent to

1
f (y) ≤ f (x) + h∇f (x), y − xi + k∇f (y) − ∇f (x)k22 .

Summing the two copies of the above inequality with x and y interchanged, we obtain

1
h∇f (y) − ∇f (x), y − xi ≤ k∇f (y) − ∇f (x)k22 .
µ

L-smooth convex functions are widely considered in optimization, which have the fun-
damental properties below.

9
2.0

1.5

1.0 f (y)

0.5
(x, f (x))
0.0

f (x) + h∇f (x), y − xi + µ2 ky − xk22


−0.5

−1.0

−1.5 f (x) + h∇f (x), y − xi

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Figure 3: µ-strongly convex property

Theorem 3.9 (L-smooth convex property). Let f : Rn → R be differentiable, then


f is convex and L-smooth if and only if any one of the following conditions holds:

L
0 ≤ f (y) − f (x) − h∇f (x), y − xi ≤ ky − xk22 , ∀ x, y ∈ Rn , (5)
2
1
f (y) − f (x) − h∇f (x), y − xi ≥ k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn , (6)
2L
1
h∇f (y) − ∇f (x), y − xi ≥ k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn , (7)
L
0 ≤ h∇f (y) − ∇f (x), y − xi ≤ Lky − xk22 , ∀ x, y ∈ Rn . (8)

Proof. Step 1: we show (5) holds when f is L-smooth and convex. In fact, (5) is a direct
result of Theorem 3.4 and Theorem 2.5.
Step 2: we show (5) implies (6). Fix x ∈ Rn and let φ(y) := f (y) − f (x) − h∇f (x), y − xi.
By (5) we know φ(x) = 0 is the minimum of φ. Substituting y by y − L1 ∇φ(y) and applying
(5), we obtain
     
1 1 1
0 ≤φ y − ∇φ(y) = f y − ∇φ(y) − f (x) − ∇f (x), y − ∇φ(y) − x
L L L
  2  
1 L 1 1
≤f (y) − ∇f (y), ∇φ(y) + ∇φ(y) − f (x) − ∇f (x), y − ∇φ(y) − x
L 2 L 2 L
1
=f (y) − f (x) − h∇f (x), y − xi − k∇f (y) − ∇f (x)k22 .
2L

10
Step 3: we show (6) implies (7).

1 1 1
k∇f (y) − ∇f (x)k22 = k∇f (y) − ∇f (x)k22 + k∇f (x) − ∇f (y)k22
L 2L 2L
≤ (f (y) − f (x) − h∇f (x), y − xi) + (f (x) − f (y) − h∇f (y), x − yi)
=h∇f (y) − ∇f (x), y − xi.

Step 4: we show (7) implies f is L-smooth and convex. The convexity can be justified by
directly applying Theorem 3.4. By Cauchy’s inequality, we have

1
k∇f (y) − ∇f (x)k22 ≤ h∇f (y) − ∇f (x), y − xi ≤ k∇f (y) − ∇f (x)k2 · ky − xk2 ,
L
which implies
k∇f (y) − ∇f (x)k2 ≤ Lky − xk2 .

Step 5: we show (5) implies (8). This can be achieved by directly adding two copies of (5)
with x and y interchanged.
Step 6: we show (8) implies (5). By Theorem 3.4, it remains to show the second inequality
in (5).
Z 1
f (y) − f (x) − h∇f (x), y − xi = h∇f (x + t(y − x)) − ∇f (x), y − xidt
0
1
L
Z
≤ kt(y − x)k22 dt
0 t
L
= ky − xk22 .
2

We also has the following property for L-smooth and µ-strongly convex functions. By
Theorem 2.5 and Theorem 3.7 we can easily know the fact that µ ≤ L.

Theorem 3.10 (L-smooth µ-strongly convex property). Let f : Rn → R be L-


smooth and µ-strongly convex, then for ∀ x, y ∈ Rn , the following inequality holds:

µL 1
h∇f (y) − ∇f (x), y − xi ≥ ky − xk22 + k∇f (y) − ∇f (x)k22 .
µ+L µ+L

Proof. If µ = L, we have

µL 1
ky − xk22 + k∇f (y) − ∇f (x)k22
µ+L µ+L
µ 1
= ky − xk22 + k∇f (y) − ∇f (x)k22
2 2L
1 1
≤ h∇f (y) − ∇f (x), y − xi + h∇f (y) − ∇f (x), y − xi
2 2
=h∇f (y) − ∇f (x), y − xi.

11
If µ < L, g(x) := f (x) − µ2 kxk22 is (L − µ)-smooth convex function, implying

1
h∇g(y) − ∇g(x), y − xi ≥ k∇g(y) − ∇g(x)k22 ,
L−µ

which is equivalent to

µL 1
h∇f (y) − ∇f (x), y − xi ≥ ky − xk22 + k∇f (y) − ∇f (x)k22 .
µ+L µ+L

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy