A Projected Semismooth Newton Method For A Class of Nonconvex Composite Programs With Strong Prox-Regularity
A Projected Semismooth Newton Method For A Class of Nonconvex Composite Programs With Strong Prox-Regularity
A Projected Semismooth Newton Method For A Class of Nonconvex Composite Programs With Strong Prox-Regularity
Jiang Hu hujiangopt@gmail.com
Massachusetts General Hospital and Harvard Medical School
Harvard University
Boston, MA 02114, USA
Kangkang Deng∗ freedeng1208@gmail.com
Beijing International Center for Mathematical Research
Peking University
Beijing, 100871, China
Jiayuan Wu 1901110043@pku.edu.cn
College of Engineering
Peking University
Beijing, 100871, China
Quanzheng Li li.quanzheng@mgh.harvard.edu
Massachusetts General Hospital and Harvard Medical School
Harvard University
Boston, MA 02114, USA
Abstract
This paper aims to develop a Newton-type method to solve a class of nonconvex composite
programs. In particular, the nonsmooth part is possibly nonconvex. To tackle the non-
convexity, we develop a notion of strong prox-regularity which is related to the singleton
property and Lipschitz continuity of the associated proximal operator, and we verify it
in various classes of functions, including weakly convex functions, indicator functions of
proximally smooth sets, and two specific sphere-related nonconvex nonsmooth functions.
In this case, the problem class we are concerned with covers smooth optimization problems
on manifold and certain composite optimization problems on manifold. For the latter,
the proposed algorithm is the first second-order type method. Combining with the semis-
moothness of the proximal operator, we design a projected semismooth Newton method
to find a root of the natural residual induced by the proximal gradient method. Due to
the possible nonconvexity of the feasible domain, an extra projection is added to the usual
semismooth Newton step and new criteria are proposed for the switching between the pro-
jected semismooth Newton step and the proximal step. The global convergence is then
established under the strong prox-regularity. Based on the BD regularity condition, we es-
tablish local superlinear convergence. Numerical experiments demonstrate the effectiveness
of our proposed method compared with state-of-the-art ones.
Keywords: nonconvex composite optimization, strong prox-regularity, projected semis-
mooth Newton method, superlinear convergence
∗. Corresponding author.
1. Introduction
The nonconvex composite minimization problem has attracted lots of attention in signal
processing, statistics, and machine learning. The formulation we are concerned with is:
We call the above definition strong prox-regularity due to the uniform γ for all x ∈ C,
which can be seen as an enhanced version of the prox-regularity (Rockafellar and Wets,
2009, Definition 13.27, Proposition 13.37). Note that the strong prox-regularity holds for
any closed C ⊂ Rn and γ > 0 if h is convex (Moreau, 1965). Here, we present some classes
of nonconvex functions satisfying Definition 1.
(i) h is weakly convex. A function is called weakly convex with modulus ρ > 0 if h(x) +
ρ 2
2 kxk is convex. By using the same idea for the convex functions, one can verify that
proxth is single-valued and Lipschitz continuous when t < ρ1 . Thus, h is strongly prox-
regular with C = Rn , γ = t, and the `2 -norm k · k2 for any t < ρ1 . Optimization with
weakly convex objective functions has been considered in (Davis and Drusvyatskiy,
2019).
(ii) h is the indicator function of a proximally smooth set (Clarke et al., 1995). For a set
X ⊂ Rn , define its closed r-neighborhoods
2
A Projected SSN for Nonconvex and Nonsmooth Programs
algorithmic design and theoretic analysis. Since the proximal operator is single-valued and
Lipschitz continuous on a closed set, one can further explore the differentiability and design
second-order type algorithms to obtain the algorithmic speedup and fast convergence rate
guarantee.
It has been shown in (Böhm and Wright, 2021) that two popular nonsmooth nonconvex
regularizers, the minimax concave penalty (Zhang, 2010) and the smoothly clipped absolute
deviation (Fan, 1997), are weakly convex. Since any smooth manifold is proximally smooth,
the manifold optimization problems (Absil et al., 2009; Hu et al., 2020; Boumal, 2023) take
the form (1). Besides, we are also motivated by the following applications, where h is from
the oblique manifold and a simple `1 norm or the constraint of nonnegativity. Let us note
that such h is not weakly convex or the indicator function of a smooth manifold.
where Ob(n, p) = {X ∈ Rn×p : diag(X > X) = 1p } with diag(B) being a vector consisting
of the diagonal entries of B and 1p ∈ Rn of all elements 1, D is a diagonal matrix whose
diagonal entries P Pfirst p largest singular values of A, k·kF denotes the matrix Frobenius
are the
norm, kXk1 := ni=1 pj=1 |Xij |, and λ > 0 is a parameter to control the sparsity. Problem
(3) takes the form (1) by letting
where δC (·) denotes the indicator function of the set C, which takes the value zero on C and
+∞ otherwise. Utilizing the separable structure and the results by (Xiao and Bai, 2021),
the i-th column of proxth (X), denoted by (proxth (X))i , is
(0,
. . . , 0, sign(Xij ), 0, . . . , 0)> , if w ≥ 0,
| {z } | {z }
(proxth (X))i = j−1 n−j
−w− /kw− k · sign(X ),
otherwise,
i i 2 i
where wi = λt − |Xi |, Xi is the i-th column of X, wi− = min(wi , 0), sign(a) returns 1 if
a ≥ 0 and −1 otherwise, and j = arg min1≤k≤n wi (k). Note that proxth is not unique for
all X ∈ Rn×p and t > 0. We will give the specific C, γ, and k · k such that proxth is strongly
prox-regular later in Section 3.
3
Hu and Deng and Wu and Li
where Ob+ (n, p) := Ob(n, p) ∩ {X ∈ Rn×p : Xij ≥ 0} and D is defined as in (3). Note
that a more general formulation with smooth objective function over Ob+ (n, p) has been
considered in (Jiang et al., 2022). Problem (5) falls into (1) by letting
is the indicator function of Ob+ (n, p). Due to the separable structure, the i-th column of
proxth (X), denoted by (proxth (X))i , is
(0,
. . . , 0, 1, 0, . . . , 0), if max(Xi ) ≤ 0,
| {z } | {z }
(proxth (X))i = j−1 n−j
X + /kX + k ,
otherwise,
i i 2
where j = arg min1≤k≤n Xik in the first case, Xi+ = max(Xi , 0), and Xi is the i-th column
of X. Note that this projection is not unique for all X ∈ Rn×p , e.g., X = 0. We will show
its strong prox-regularity later in Section 3.
where ∆n = {y ∈ Rn : y ≥ 0, 1>
n y = 1}, A ∈ R
m×n , and b ∈ Rm . By decomposing y = x x
with the Hadamard product (i.e., yi = x2i , i = 1, . . . , n), it holds that
y ∈ ∆n ⇐⇒ x ∈ Ob(n, 1).
4
A Projected SSN for Nonconvex and Nonsmooth Programs
et al., 2010; Xiao et al., 2018; Li et al., 2018a,b) are also developed for the nonsmooth
problem (1). If h is nonconvex, the proximal gradient methods are developed for `1/2 norm
in (Xu et al., 2012) and more nonconvex regularizers (Gong et al., 2013; Yang, 2017). The
global convergence is established by utilizing the smoothness of f and the explicit solution
of the proximal subproblem.
In the case of h being weakly convex, subgradient-type methods (Davis and Drusvy-
atskiy, 2019; Davis et al., 2018) and proximal point-type method (Drusvyatskiy, 2018) yield
lower complexity bound. Optimization with prox-regular functions has recently attracted
much attention. The authors (Themelis et al., 2018) propose a gradient-type method to
solve the forward-backward envelope of ϕ. This can be seen as a variable-metric first-order
method. Since the Moreau envelope of a prox-regular function is continuously differen-
tiable, a nonsmooth Newton method is designed to solve the gradient system of the Moreau
envelope in (Khanh et al., 2020, 2021). Note that the indicator function of a proximally
smooth set is prox-regular (Clarke et al., 1995), the authors of (Balashov and Tremba, 2022)
developed a generalized Newton method to fixed point equation induced by the projected
gradient method.
In the case of h being the indicator function of a Riemannian manifold, the efficient
Riemannian algorithms have been extensively studied in the last decades (Absil et al.,
2009; Wen and Yin, 2013; Hu et al., 2020; Boumal, 2023). When h takes the form (4), the
manifold proximal gradient methods (Chen et al., 2020; Huang and Wei, 2021) are designed.
These approaches only use first-order information and do not have superlinear convergence.
In addition, manifold augmented Lagrangian methods are also proposed in works (Deng
and Peng, 2022; Zhou et al., 2021), in which the subproblem is solved by the first-order
method or second-order method. When it comes to the case of (6), a second-order type
method is proposed in the recent work (Jiang et al., 2022). While in their subproblems,
only the second-order information of the smooth part is explored.
• We introduce the concept of strong prox-regularity. Different from the classic prox-
regularity, the strong prox-regularity enjoys some kind of uniform proximal regularity
around a closed region containing all feasible points. A crucial property is that the
proximal operator of a strongly prox-regular function locally behaves like that of
convex functions. With the strong prox-regularity, the stationary condition can be
reformulated as a single-valued residual mapping which is Lipschitz continuous on
the closed region. We present several classes of functions satisfying both the strong
prox-regularity condition, including weakly convex functions and indicator functions
of proximally smooth sets (including manifold constraints). In particular, two specific
sphere-related nonsmooth and nonconvex functions, which are not weakly convex
or indicator functions of a smooth manifold, are verified to satisfy the strong prox-
regularity.
5
Hu and Deng and Wu and Li
• As shown in Section 1.1, two sphere-related nonsmooth and nonconvex functions result
in composite optimization problems on manifolds. In this paper, we propose the first
second-order type method to solve this kind of problem, which outperforms state-of-
the-art first-order methods (Chen et al., 2020; Huang and Wei, 2021). It is worth
mentioning that first-order methods (Chen et al., 2020; Huang and Wei, 2021) fail in
solving the nonnegative PCA on the oblique manifold due to their dependence on the
Lipschitz continuity of the nonsmooth part.
• The global convergence of the proposed projected semismooth Newton method is pre-
sented. Other than the strong prox-regularity condition and the semismoothness, the
assumptions are standard and can be achieved by various applications including our
motivating examples. We prove the switching conditions are locally satisfied, which
allows the local transition to the projected semismooth Newton step. By assuming
the BD-regularity condition, we show the local superlinear convergence. Numerical
experiments on various applications demonstrate the efficiency over state-of-the-art
ones.
1.4 Notation
P
Given a matrix A, we use kAkF to denote its Frobenius norm, kAk1 := ij |Aij | to
denote its `1 norm, and kAk2 to denote its spectral norm. For a vector x, we use kxk2 and
kxk1 to denote its Euclidean norm and `1 norm, respectively. The symbol B will denote
the closed unit ball in Rn , while B(x, ) will stand for the closed ball of the radius of > 0
centered at x.
1.5 Organization
6
A Projected SSN for Nonconvex and Nonsmooth Programs
2. Preliminaries
In this section, we first review some basic notations of subdifferential and give the
definition of the prox-regular function. We also introduce several concepts of stationarity
and present the definition of semismoothness.
By convention, if x ∈
/ dom(ϕ), then ∂ϕ(x) = ∅. The domain of ∂ϕ is defined as dom(∂ϕ) =
{x ∈ Rn : ∂ϕ(x) 6= ∅} . For the indicator function δS : Rn → {0, +∞} associated with the
non-empty closed set S ⊆ Rn , we have
( )
b S (x) = v ∈ Rn : lim sup hv, y − xi
∂δ ≤0 and ∂δS (x) = NS (x)
y→x,y∈S ky − xk2
7
Hu and Deng and Wu and Li
where t > 0.
It follows from the definition of proxth that any point x satisfying (12) yields 0 ∈ ∇f (x) +
∂h(x), which implies x is also a critical point. Inversely, a critical point may not satisfy
(12) due to the nonconvexity of h. Therefore, equation (12) defines a stronger stationary
point than (11).
2.3 Semismoothness
By the Rademacher’s theorem, a locally Lipschitz operator is almost everywhere differ-
entiable. For a locally Lipschitz F , denote by DF the set of the differential points of F .
The B-subdifferential at x is defined as
k k k
∂B F (x) := lim J x | x ∈ DF , x → x ,
k→∞
where J(x) represents the Jacobian of F at the differentiable point x. Obviously, ∂B F (x)
may not be a singleton. The Clarke subdifferential ∂C F (x) is defined as
where conv(A) represents the closed convex hull of A. A locally Lipschitz continuous oper-
ator F is called semismooth at x with respect to ∂B F (∂C F ) if
F (x+td)−F (x)
• F is directionally differentiable at x, i.e., for any direction d, the limit limt↓0 t
exists.
8
A Projected SSN for Nonconvex and Nonsmooth Programs
where λ and θ are two positive parameters. It is weakly convex with modulus ρ = θ−1 . If
t < θ, the closed-form expression of the proximal operator is
0,
|x| < tλ,
x−λt sign(x)
proxth (x) =
1−(t/θ) , tλ ≤ |x| ≤ θλ,
x, |x| > θλ.
The semismoothness property of the MCP regularizer is presented in (Shi et al., 2019).
Analogously, one can also verify the weak convexity of the SCAD regularizer and the semis-
moothness of its proximal operator. We refer to (Böhm and Wright, 2021) and (Shi et al.,
2019) for the details. Numerical results in (Shi et al., 2019) exhibit the efficiency of semis-
mooth Newton methods.
9
Hu and Deng and Wu and Li
Lemma 3 The functions h defined in both (4) and (6) are strongly prox-regular and their
proximal operators are semismooth with respect to their B-subdifferentials. Specifically,
1
(i) Let C1 = Ob(n, p), kV k2,∞ := maxi=1,2,...,p kVi k2 , and γ1 = (λ+1)n . The function
h(X) = λkXk1 +δOb(n,p) (X) is strongly prox-regular with respect to C1 , γ1 , and k·k2,∞ .
Moreover, the proximal mapping proxth is semismooth over the set D1 = {X + tV :
X ∈ C1 , kV k2,∞ = 1, 0 ≤ t ≤ γ1 } with repsect to ∂B proxth .
(ii) Let C2 = Ob+ (n, p) and 0 < γ2 < 1. The function h(X) = δOb+ (n,p) (X) is strongly
prox-regular with respect to C2 , γ2 , and k · k2,∞ . Moreover, the proximal mapping
proxth is semismooth over the set D2 = {X + tV : X ∈ C2 , kV k2,∞ = 1, 0 ≤ t ≤ γ2 }
with repsect to ∂B proxth .
One can draw a similar conclusion for the case di < 0. Combining them together, we
conclude that proxth is semismooth.
10
A Projected SSN for Nonconvex and Nonsmooth Programs
(ii) It follows from the definition of the proximal mapping (6) that proxth (X) is single-
valued and Lipschitz continuous over D2 . Analogous to the case above, one can prove
the semismooth property of proxth .
The strong prox-regularity and semismoothness established in the above lemma allow us
to design efficient second-order methods for solving the applications in Subsection 1.1.
Corresponding numerical experiments will be conducted in Section 6.
where t is set as min{γ, 1}/L. It follows the SL property of proxth and twice continuous
differentiability of f that F is single-valued, Lipschitz continuous, and semismooth.
In what follows, we assume that proxth is semismooth with respect to its B-subdifferential.
Then, F is semismooth with respect to M (x). This allows us to design a semismooth New-
ton method for solving (1). One typical benefit of second-order methods is the superlinear
or faster local convergence rate. Specifically, we first solve the linear system
where Mk ∈ M (xk ) defined by (13) is a generalized Jacobian and µk = κkF (xk )k2 with
a positive constant κ. Note that the shift term µk I can be used to promote the positive
definiteness of the coefficient matrix of (16), particularly in the convex setting (Xiao et al.,
2018; Li et al., 2018b). The semismooth Newton step is then defined as
where the projection onto dom(h) is necessary for the globalization due to the nonconvexity
of h. We remark that the strong prox-regularity in Definition 1 is crucial for the design
11
Hu and Deng and Wu and Li
where ρk is the norm of the residual of the last accepted Newton iterate until k with an
initialization ρ0 > 0, η > 0, and ν, q ∈ (0, 1). Otherwise, the semismooth Newton step z k
fails, and we do a proximal gradient step, i.e.,
Due to the choice of t = min{γ, 1}/L, we will show in the next section that there is a
sufficient decrease in the objective function value ϕ(xk+1 ). Under the BD-regularity con-
dition (Any element of ∂B F (x∗ ) at the stationary point x∗ is nonsingular (Qi, 1993; Pang
and Qi, 1993)), we show in the next section that the semismooth Newton steps will always
be accepted when the iterates are close to the optimal solution. The proposed switching
between the Newton step and the proximal gradient step ensures that its theoretical con-
vergence is independent of the specific value chosen for κ > 0 in (16). However, selecting an
appropriate κ is beneficial for achieving satisfactory numerical performance. The detailed
algorithm is presented in Algorithm 1.
3: Set z k = Pdom(h) (xk + dk ). If the conditions (18) and (19) are satisfied, set xk+1 = z k .
Otherwise, set xk+1 = xk − F (xk ).
4: Set k = k + 1.
5: end while
12
A Projected SSN for Nonconvex and Nonsmooth Programs
5. Convergence analysis
In this section, we will present the convergence properties of the proposed projected
semismooth Newton method, i.e., Algorithm 1. It consists of two parts, the global conver-
gence to a stationary point from any starting point and the local superlinear convergence.
With the above assumption, the proximal gradient step (20) leads to a sufficient decrease
on ϕ.
Lemma 5 Suppose that Assumption 4 holds. Then for any tk ∈ (0, L1 ] we have
k k+1 1 L
ϕ(x ) − ϕ(x ) ≥ − kxk+1 − xk k22 . (21)
2tk 2
From the above lemma, the convergence of the proximal gradient method for solving (1)
can be obtained by the coercive property of ϕ. When the projected semismooth Newton
update z k is accepted, the function value ϕ(z k ) may increase while the residual decreases
as guaranteed by (18) and (19). This allows us to show global convergence.
Theorem 6 Let {xk } be the iterates generated by Algorithm 1. Suppose that Assumption
4 holds. Let tk ≡ t ∈ (0, min(γ, 1)/L], Then we have
13
Hu and Deng and Wu and Li
Proof If xk+1 is obtained by the proximal gradient update, it holds from Lemma 5 that
k k+1 1 L
ϕ(x ) − ϕ(x )≥ − kF (xk )k22 . (22)
2t 2
It follows the Lipschitz properties of proxth and ∇f (x) that F is Lipschitz continuous. Let
LF be the Lipschitz constant of F . From the triangle inequality, we have
kF (xk+1 )k2 ≤ kF (xk )k2 + kF (xk+1 ) − F (xk )k2 ≤ (LF + 1)kF (xk )k2 .
1 L 1
where c1 := 2t − 2 (LF +1)2
> 0.
If the Newton k
update z is accepted, the conditions (18) and (19) imply that
and ρk+1 = kF (xk+1 )k2 ≤ νρk . Since ρk ∈ (0, ρ0 ) for all k, c1 kF (xk+1 )k2−q
2 + ηρk1−q is
bounded by a constant, denoted by c2 . Hence, for the projected semismooth Newton step,
it holds
ϕ(xk ) − ϕ(xk+1 ) ≥ c1 kF (xk+1 )k22 − c2 ρqk+1 . (24)
K
X K
X X q
ϕ(x0 ) − ϕ(xK+1 ) = (ϕ(xk ) − ϕ(xk+1 )) ≥ c1 kF (xk+1 )k22 − c2 ρk+1 ,
i=1 k=0 k∈KN
where KN ⊂ {1, 2, . . . , K+1} consists of the indices where the projected semismooth Newton
q q(K+1) )
updates are accepted. It is easy to see that k∈KN ρqk+1 ≤ ρq0 K+1 qk = ρ0 (1−ν
P P
k=1 ν 1−ν ≤
ρq0
1−ν q . Therefore,
K
X c2 ρp0
c1 kF (xk+1 )k22 ≤ ϕ(x0 ) − ϕ(xK+1 ) + .
1 − νp
k=0
14
A Projected SSN for Nonconvex and Nonsmooth Programs
Since the convergence of {kF (xk )k2 } is proved in Theorem 6, any accumulation point
of {xk } has zero residual. The Assumption (A1) reads that the full sequence {xk } is
convergent. The Assumption (A2) holds for any twice continuously differentiable f . The
Assumption (A3) is the standard BD-regularity condition used in (Qi, 1993; Pang and Qi,
1993; Milzarek and Ulbrich, 2014; Xiao et al., 2018).
For the projection operator Pdom(h) in Algorithm 1, we prove the following bounded
property, which has also been used in the convergence rate analysis for the generalized
power method for the group synchronization problems (Liu et al., 2017b, Lemma 1) (Liu
et al., 2017a, Proposition 3.3) (Liu et al., 2020, Lemma 2).
Proposition 8 For all x ∈ Rn and y ∈ dom(h), it holds kPdom(h) (x) − yk2 ≤ 2kx − yk2 .
Proof Following the definition of Pdom(h) , we have
kPdom(h) (x) − yk2 ≤ kPdom(h) (x) − xk2 + kx − yk2 ≤ 2kx − yk2 .
The following lemma shows that the switching conditions (18) and (19) are satisfied by
the projected semismooth Newton update when k is large enough.
Lemma 9 Let {xk } be the iterates generated by Algorithm 1. Suppose that Assumptions 4
and 7 hold. Then for sufficiently large k, the Newton update z k is always accepted.
1
1 ν η 1−q
Proof Let us first define a constant γF ∈ 0, min 8C , 32C 2 LF , 1 , where
32C 2 (Lϕ 3q C q ) 1−q
C, ν, η, q, LF , Lϕ are defined previously. It follows from (Qi, 1993, Lemma 2.6) and (A3)
that there exists ε > 0 such that for any x ∈ B(x∗ , ε) and M ∈ M (x),
kF (x) − F (x∗ ) − (M + κkF (x)k2 I)(x − x∗ )k2 ≤ γF kx − x∗ k2 , k(M + κkF (x)kI)−1 k2 ≤ 2C.
(25)
15
Hu and Deng and Wu and Li
For the projected semismooth Newton update z k = Pdom(h) (xk − (Mk + µk I)−1 F (xk )), it
hold that
kz k − x∗ k2 = kPdom(h) (xk − (Mk + µk I)−1 F (xk )) − x∗ k2
≤ 2k(Mk + µk I)−1 (F (xk ) − F (x∗ ) − (Mk + µk I)(xk − x∗ ))k2 (26)
k ∗
≤ 4γF Ckx − x k2 ,
where we assume xk ∈ B(x∗ , ε). Due to the choice of γF , we have z k ∈ B(x∗ , ε). Note that
Then
4C
kxk − x∗ k2 ≤ kF (xk )k2 . (28)
1 − 4γF C
Combining (26) and (28) implies
16γF C 2
kz k − x∗ k2 ≤ kF (xk )k2 . (29)
1 − 4γF C
Hence,
16γF C 2 LF
kF (z k )k2 = kF (z k ) − F (x∗ )k2 ≤ LF kz k − x∗ k2 ≤ kF (xk )k2 ≤ νkF (xk )k2 . (30)
1 − 4γF C
In addition, note that
kz k − x∗ k2 = k(Mk + µk I)−1 F (z k ) − F (x∗ ) − (Mk + µk I)(z k − x∗ ) − F (z k ) k2
≤ 2γF Ckz k − x∗ k2 + 2CkF (z k )k2 .
This gives
2C
kz k − x∗ k2 ≤ kF (z k )k2 . (31)
1 − 2γF C
The changes between ϕ(z k ) and ϕ(xk ) can be estimated by
The above lemma establishes the local transition to the projected semismooth Newton
step. Utilizing the semismoothness, we have the locally superlinear convergence on the
iterates generated by Algorithm 1.
16
A Projected SSN for Nonconvex and Nonsmooth Programs
Theorem 10 Let {xk } be the iterates generated by Algorithm 1. Suppose that Assumptions
4 and 7 hold. Then there exists a finite K > 0, such that for all k ≥ K, {xk } converges to
x∗ Q-superlinearly.
Proof From Lemma 9, there exists a K such that the projected semismooth Newton
update is accepted for k ≥ K. It follows from the semismoothness of F that
where we use µk = κkF (xk )k2 and F (xk ) → 0 (i.e., (A1)) for the last equality. This means
{xk } converges to x∗ Q-superlinearly.
6. Numerical experiments
In this section, some numerical experiments are presented to evaluate the performance
of our proposed Algorithm 1, denoted by ProxSSN. We compare ProxSSN with the existing
methods including AManPG and ARPG (Huang and Wei, 2021). We also test the proximal
gradient descent method (ProxGD for short) as in (14). Here, a nonmonotone line search
with Barzilai–Borwein (BB) step size (Barzilai and Borwein, 1988) is used for acceleration.
Let sk = xk − xk−1 and y k = ∇f (xk ) − ∇f (xk−1 ). The BB step sizes are defined as
sk , sk | sk , y k |
βk1 = , and β 2
k = . (33)
| hsk , y k i | hy k , y k i
Given %, δ ∈ (0, 1), the nonmonotone Armijo line search is to find the smallest nonnegative
integer ` satisfying
%
ϕ(proxtk (`)h (xk − tk (`)∇f (xk ))) ≤ Ck + kproxtk (`)h (xk − tk (`)∇f (xk )) − xk k22 . (34)
2tk (`)
Here, tk (`) := βk δ ` , βk is set to βk1 and βk2 alternatively, and the reference value Ck is
calculated via Ck = ($Qk−1 Ck−1 +ϕ(xk ))/Qk where $ ∈ [0, 1], C0 = ϕ(x0 ), Qk = $Qk−1 +
1 and Q0 = 1. Once ` is obtained, we set tk = βk δ ` and the next iterate is then given by
xk+1 = proxtk h (xk − tk ∇f (xk )).
The reasons of not using ManPG (Chen et al., 2020), RPG (Huang and Wei, 2021)
or the algorithms proposed in (Lai and Osher, 2014; Kovnatsky et al., 2016) is that their
performance can not measure up with AManPG or ARPG in tests of (Huang and Wei,
2021). For ARPG and AManPG, we use the code provided by (Huang and Wei, 2021).
The codes were written in MATLAB and run on a standard PC with 3.00 GHz AMD R5
microprocessor and 16GB of memory. The reported time is wall-clock time in seconds.
17
Hu and Deng and Wu and Li
where L̃ > L with L being the Lipschitz constant of f , gradf (X k ) denotes the Riemannian
gradient of f at X k , and TX Ob(n, p) is the tangent space to Ob(n, p) at X. We refer to
(Chen et al., 2020) for more details. In the k-th iteration of ARPG, one needs to solve the
subproblem:
D E L̃
ηX k = arg min gradf (X k ), η + kηk2F + λkRX k (η)k1 ,
η∈TX k Ob(n,p)
2
where R denotes a retraction operator on Ob(n, p). The termination condition of both
AManPG and ARPG is as follows:
where tol > 0 is a given tolorance. The ProxGD and ProxSSN methods are applied to
solve problem (3) by setting f (X) := kX T AT AX − D2 k2F , h(X) = λkXk1 + δOb(n,p) (X).
ProxGD has the following update rule
The following relative KKT condition is set as a stopping criterion for our algorithm and
ProxGD:
X k − proxtk h (X k − tk ∇f (X k )) F
err := ≤ tol. (36)
tk (1 + kX k kF )
Note that tk is fixed in ProxSSN. Based on Lemma 3, we can calculate the proximal mapping
and its generalized Jacobian in our ProxSSN at a low cost.
Implementation details The parameters of AManPG and ARPG are set the same
as in (Huang and Wei, 2021). For ProxSSN, we set q = 20, ν = 0.9999, η = 10−6 , t =
1/λmax (AT A), and the initial value κ = 1. The maximum number of iterations is 10000.
The starting point of all algorithms is the leading p right singular vectors of the matrix A.
Due to the evaluation criterion being different for different algorithms, we first run ARPG
when (35) is satisfied with tol = 10−10 × n × p or the number of iterations exceeds 10000,
and denote FARPG as the obtained objective value. The other algorithms are terminated
when the objective value satisfies F (X k ) ≤ FARPG + 10−6 or (35) (or (36)) is satisfied with
tol = 10−10 × n × p, or the number of iterations exceeds 10000.
In our experiments, the data matrix A ∈ Rm×n is produced by MATLAB function
randn(m, n), in which all entries of A follow the standard Gaussian distribution. Next, we
shift the columns of A such that they have zero-mean, and normalize the resulting matrix
by its spectral norm.
18
A Projected SSN for Nonconvex and Nonsmooth Programs
100
10-5
10-5
10-10
10-10
ProxSSN ProxSSN
ProxGD ProxGD
AManPG AManPG
ARPG ARPG
10-15 10-15
0 1 2 3 4 0 2 4 6 8
time elapsed (sec) time elapsed (sec)
Figure 1: The trajectories of the objective function values with respect to the wall-clock
time on the sparse PCA problem (3) with p = 10, λ = 0.01. Left: n = 300; right:
n = 400
We also compare the accuracy and efficiency of ProxSSN with other algorithms using
the performance profiling method proposed in (Dolan and Moré, 2002). Let ti,s be some
performance quantity (e.g. the wall-clock time or the gap between the obtained objective
function value and ϕmin , lower is better) associated with the s-th solver on problem i. Then,
one computes the ratio ri,s as ti,s over the smallest value obtained by ns solvers on problem
ti,s
i, i.e., ri,s := min{ti,s :1≤s≤ns}
. For τ > 0, the value
indicates that solver s is within a factor 2τ ≥ 1 of the performance obtained by the best
solver. Then the performance plot is a curve πs (τ ) for each solver s as a function of τ .
In Figure 4, we show the performance profiles of the criterion, the wall-clock time and the
gap in the objective function values. In particular, the intercept point of the axis “ratio of
problems” and the curve in each subfigure is the percentage of the faster one among the
19
Hu and Deng and Wu and Li
30
10-2
ProxSSN
25
ProxGD
AManPG 10-4
20 ARPG
time
15 10-6
10 ProxSSN
-8 ProxGD
10 AManPG
5
ARPG
0 10-10
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
n n
Figure 2: Comparisons of wall-clock time and the objective function values on the sparse
PCA problem (3) with p = 20, λ = 0.01 for different n.
four solvers. These figures show that both the wall-clock time and the gap in the objective
function values of ProxSSN are much better than other algorithms on most problems.
40
ProxSSN 10-2
ProxGD ProxSSN
30 ProxGD
AManPG
10-4 AManPG
ARPG
ARPG
time
20
10-6
10
10-8
0 10-10
10 15 20 25 30 10 15 20 25 30
p p
Figure 3: Comparisons of wall-clock time and the objective function values on the sparse
PCA problem (3) with n = 512, λ = 0.01 for different p.
20
A Projected SSN for Nonconvex and Nonsmooth Programs
21
Hu and Deng and Wu and Li
ProxSSN ProxGD
(n, p)
time obj err iter time obj err iter
500 / 10 0.35 1.166866 1.23e-7 66 (9.2) 2.46 1.166866 2.93e-5 2840
500 / 15 0.22 1.619850 6.55e-7 33 (9.5) 1.96 1.619850 2.60e-5 1838
500 / 20 0.53 1.942255 8.52e-7 64 (9.8) 4.60 1.942255 2.29e-5 3413
500 / 25 0.72 2.300220 7.04e-7 73 (9.8) 10.42 2.300220 1.70e-5 7317
500 / 30 0.89 2.523960 6.56e-7 83 (9.9) 15.41 2.523999 2.69e-4 10000
500 / 5 0.04 0.592243 8.56e-8 13 (8.8) 0.13 0.592244 4.83e-5 189
600 / 10 0.12 1.223420 5.71e-7 20 (9.4) 1.32 1.223420 2.57e-5 1362
600 / 15 0.34 1.894680 3.86e-7 47 (9.7) 5.43 1.894680 2.44e-5 4455
600 / 20 1.23 2.261036 1.19e-6 130 (9.6) 13.43 2.261036 1.63e-5 9617
22
A Projected SSN for Nonconvex and Nonsmooth Programs
ProxSSN ProxGD
(n, p)
time obj err iter time obj err iter
600 / 25 0.90 2.238704 9.12e-7 91 (9.8) 14.71 2.241462 1.54e-3 10000
600 / 30 0.92 2.510240 1.19e-6 94 (9.8) 16.34 2.510453 1.81e-5 10000
600 / 5 0.04 0.782584 8.94e-8 12 (9.2) 0.52 0.782584 2.43e-5 757
700 / 10 0.48 1.332547 3.06e-7 66 (9.5) 4.74 1.332547 2.90e-5 5031
700 / 15 0.23 1.891921 4.99e-7 32 (9.7) 3.09 1.891922 1.61e-5 2448
700 / 20 0.35 2.232710 7.31e-7 38 (9.7) 4.07 2.232710 1.86e-5 2787
700 / 25 1.00 2.578730 1.39e-6 90 (9.9) 18.99 2.578745 1.61e-4 10000
700 / 30 2.01 2.997021 1.90e-6 124 (9.8) 14.99 3.025005 1.36e-5 6748
700 / 5 0.09 0.751121 1.23e-7 19 (9.1) 1.06 0.751121 4.40e-5 1475
800 / 10 0.62 1.361048 6.58e-7 57 (9.5) 5.33 1.361048 2.53e-5 3805
800 / 15 1.07 1.837726 5.13e-7 99 (9.7) 17.68 1.839436 7.48e-4 10000
800 / 20 1.50 2.262145 1.20e-6 115 (9.5) 18.80 2.262147 5.92e-5 10000
800 / 25 2.30 2.621645 1.51e-6 158 (9.8) 19.63 2.623857 1.42e-4 10000
800 / 30 1.58 2.943294 2.20e-6 122 (9.7) 19.78 2.944495 1.71e-3 10000
800 / 5 0.07 0.754357 1.28e-7 10 (9.0) 0.31 0.754357 4.35e-5 257
900 / 10 0.10 1.374185 3.69e-7 14 (9.3) 1.51 1.374185 2.40e-5 1200
900 / 15 0.86 1.933525 1.06e-6 93 (9.7) 7.89 1.933525 1.07e-5 5513
900 / 20 1.18 2.360027 1.36e-6 107 (9.7) 17.38 2.360027 1.74e-5 10000
900 / 25 1.88 2.773641 1.69e-6 153 (9.8) 19.07 2.777065 3.41e-4 10000
900 / 30 1.60 3.157731 2.02e-6 121 (9.8) 21.05 3.159992 3.55e-3 10000
900 / 5 0.27 0.770418 2.27e-7 62 (7.9) 1.54 0.770418 2.59e-5 1672
1000 / 10 0.64 1.376750 2.84e-7 81 (9.0) 3.50 1.376750 3.47e-5 2719
1000 / 15 0.19 2.049750 1.08e-6 21 (9.5) 2.65 2.049750 2.16e-5 1673
1000 / 20 1.05 2.581317 1.41e-6 85 (9.7) 18.11 2.581318 2.08e-5 10000
1000 / 25 1.30 3.043254 1.18e-6 101 (9.8) 20.56 3.045420 4.82e-4 10000
1000 / 30 1.79 3.516861 2.84e-6 129 (9.8) 23.79 3.517976 1.25e-3 10000
1000 / 5 0.05 0.804861 2.75e-7 10 (9.0) 0.33 0.804861 3.53e-5 390
23
Hu and Deng and Wu and Li
1 1
0.8 0.8
ratio of problems
ratio of problems
0.6 0.6
0.4 0.4
ProxSSN ProxSSN
0.2 ProxGD 0.2 ProxGD
AManPG AManPG
ARPG ARPG
0 0
0 1 2 3 4 0 5 10 15 20 25 30
not more than 2x times worse than the best not more than 2x times worse than the best
100 100
10-5 10-5
10-10 10-10
ProxSSN ProxSSN
ProxGD ProxGD
AManPG AManPG
ARPG ARPG
10-15 10-15
0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3
time elapsed (sec) time elapsed (sec)
Figure 5: The trajectories of the objective function values with respect to the wall-clock
time on the sparse least square regression (8) with m = 20, λ = 0.01. Left:
n = 2000; right: n = 3000.
By using a suitable discretization, such as finite differences or the sine pseudo-spectral and
Fourier pseudo-spectral (FP) method, we can reformulate the BEC problem as follows:
M
1 ∗ βX
min x Ax + |xi |4 , s.t. x ∈ S M , (39)
x∈CM 2 2
i=1
24
A Projected SSN for Nonconvex and Nonsmooth Programs
8
ProxSSN 100
ProxSSN
ProxGD
6 ProxGD
AManPG
AManPG
ARPG
ARPG
time
4
10-5
10-10
0
1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
n n
Figure 6: Comparisons of wall-clock time and the objective function values on the sparse
least square regression (8) with m = 30, λ = 0.01 for different n.
1 1
0.8 0.8
ratio of problems
ratio of problems
0.6 0.6
0.4 0.4
ProxSSN ProxSSN
0.2 ProxGD 0.2 ProxGD
AManPG AManPG
ARPG ARPG
0 0
0 1 2 3 4 5 6 0 5 10 15 20
not more than 2x times worse than the best not more than 2x times worse than the best
Figure 7: The performance profiles on the sparse least square regression (8).
Since problem (39) can be seen as a smooth problem on the complex sphere, we do com-
parisons with the adaptive regularized Newton method (ARNT) in (Hu et al., 2018). All
parameters of ProxGD and ProxSSN follow the setup discussed in subsection 6.1 except
tol = 10−6 . The parameters of ARNT are the same as in (Hu et al., 2018), we stop ARNT
when the Riemannian gradient norm is less than 10−6 or the maximum number of iterations
500 is reached. We take d = 2 and V (x, y) = 21 x2 + 12 y 2 . The BEC problem is discretized by
FP on the bounded domain (−16, 16)2 with β ranging from 500 to 1000 and Ω = 0, 0.1, 0.25.
Following the settings in (Wu et al., 2017), we use the mesh refinement procedure with the
coarse meshes (2k + 1) × (2k + 1)(k = 2, · · · , 5) to gradually obtain an initial solution point
on the finest mesh (26 + 1) × (26 + 1). all algorithms are tested with mesh refinement and
start from the same initial point on the coarsest mesh with
25
Hu and Deng and Wu and Li
10-3
1
20 ProxSSN
ProxGD
ProxSSN 0.8
ProxGD
15
0.6
time
10
0.4
5
0.2
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
n n
Figure 8: Comparisons of wall-clock time and the objective function values on the nonneg-
ative PCA problem (5) with p = 20 for different n.
1 1
0.8 0.8
ratio of problems
ratio of problems
0.6 0.6
0.4 0.4
0.2 0.2
ProxSSN ProxSSN
ProxGD ProxGD
0 0
0 0.5 1 1.5 2 2.5 3 0 5 10 15 20
not more than 2x times worse than the best not more than 2x times worse than the best
2 2 2 2
where φ1 (x, y) = √1π e−(x +y )/2 and φ2 (x, y) = x+iy
√ e−(x +y )/2 .
π
Table 2 gives detailed computational results. For the first column,“Initial” denotes the
type of the initial point, “a” and “b” are φa (x, y) and φb (x, y), respectively. For the iteration
numbers in our table, “iter” and “siter” denote the outer iterations and the average sub-
iterations, respectively. Note that ProxGD reaches the maximum iteration of 1000, which
shows that ProxGD does not converge to the required accuracy in all cases. ProxSSN
and ARNT find a point with almost the same objective function value, while our algorithm
ProxSSN is faster than ARNT in most cases. Figures 10 and 11 demonstrate the superiority
of ProxSSN over ARNT and ProxGD.
Table 4: Computational results of BEC
26
A Projected SSN for Nonconvex and Nonsmooth Programs
7. Conclusion
This paper introduces a new concept of strong prox-regularity and validates it over
many existing interesting applications, including composite optimization problems with
weakly convex regularizer, smooth optimization problems on manifolds, and several com-
posite optimization problems on manifolds. Then a projected semismooth Newton method
is proposed for solving a class of nonconvex optimization problems equipped with strong
prox-regularity. The idea is to utilize the locally single-valued, Lipschitz continuous prop-
erties of the residual mapping. The global convergence and local superlinear convergence
results of the proposed algorithm are presented under standard conditions. Numerical re-
sults have convincingly demonstrated the effectiveness of our proposed method in various
nonconvex composite problems, including the sparse PCA problem, the nonnegative PCA
problem, the sparse least square regression, and the BEC problem.
Acknowledgments
27
Hu and Deng and Wu and Li
15
100 ProxSSN
ProxGD
ARNT
10
ProxSSN
time
ProxGD
ARNT
10-5
10-10
0
200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
n n
Figure 10: Comparisons of wall-clock time and the objective function values on the BEC
problem (39) with Ω = 0.2 and “b”.
1 1
0.8 0.8
ratio of problems
ratio of problems
0.6 0.6
0.4 0.4
The authors are grateful to Prof. Anthony Man-Cho So for his valuable comments and
suggestions.
References
P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix
Manifolds. Princeton University Press, 2009.
Amandine Aftalion and Qiang Du. Vortices in a rotating Bose-Einstein condensate: Critical
angular velocities and energy diagrams in the thomas-fermi regime. Physical Review A,
64(6):063603, 2001.
28
A Projected SSN for Nonconvex and Nonsmooth Programs
Weizhu Bao and Yongyong Cai. Mathematical theory and numerical methods for Bose-
Einstein condensation. Kinetic & Related Models, 6(1):1–135, 2013.
Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods. IMA
Journal of Numerical Analysis, 8(1):141–148, 1988.
Axel Böhm and Stephen J Wright. Variable smoothing for weakly convex composite func-
tions. Journal of Optimization Theory and Applications, 188(3):628–649, 2021.
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed
optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Figen Oztoprak. A family of second-
order methods for convex `1 -regularized optimization. Mathematical Programming, 159
(1):435–467, 2016.
Zi Xian Chan and Defeng Sun. Constraint nondegeneracy, strong regularity, and nonsin-
gularity in semidefinite programming. SIAM Journal on Optimization, 19(1):370–396,
2008.
Shixiang Chen, Shiqian Ma, Anthony M.-C. So, and Tong Zhang. Proximal gradient method
for nonsmooth optimization over the Stiefel manifold. SIAM Journal on Optimization,
30(1):210–239, 2020.
Francis H Clarke, RJ Stern, and PR Wolenski. Proximal smoothness and the lower-C2
property. Journal of Convex Analysis, 2(1-2):117–144, 1995.
Damek Davis, Dmitriy Drusvyatskiy, Kellie J MacPhee, and Courtney Paquette. Subgra-
dient methods for sharp weakly convex functions. Journal of Optimization Theory and
Applications, 179(3):962–982, 2018.
Kangkang Deng and Zheng Peng. A manifold inexact augmented Lagrangian method for
nonsmooth optimization on Riemannian submanifolds in Euclidean space. IMA Journal
of Numerical Analysis, 2022.
Elizabeth D Dolan and Jorge J Moré. Benchmarking optimization software with perfor-
mance profiles. Mathematical Programming, 91(2):201–213, 2002.
Dmitriy Drusvyatskiy. The proximal point method revisited. SIAG/OPT Views and News,
26:1–7, 2018.
29
Hu and Deng and Wu and Li
30
A Projected SSN for Nonconvex and Nonsmooth Programs
Xudong Li, Defeng Sun, and Kim-Chuan Toh. A highly efficient semismooth Newton aug-
mented Lagrangian method for solving Lasso problems. SIAM Journal on Optimization,
28(1):433–458, 2018a.
Yongfeng Li, Zaiwen Wen, Chao Yang, and Yaxiang Yuan. A semismooth Newton method
for semidefinite programs and its applications in electronic structure calculations. SIAM
Journal on Scientific Computing, 40(6):A4131–A4157, 2018b.
Huikang Liu, Man-Chung Yue, and Anthony Man-Cho So. On the estimation performance
and convergence rate of the generalized power method for phase synchronization. SIAM
Journal on Optimization, 27(4):2426–2446, 2017a.
Huikang Liu, Man-Chung Yue, Anthony Man-Cho So, and Wing-Kin Ma. A discrete first-
order method for large-scale MIMO detection with provable guarantees. In 2017 IEEE
18th International Workshop on Signal Processing Advances in Wireless Communications
(SPAWC), pages 1–5. IEEE, 2017b.
Huikang Liu, Man-Chung Yue, and Anthony Man-Cho So. A unified approach to synchro-
nization problems over subgroups of the orthogonal group. arXiv:2009.07514, 2020.
Andre Milzarek and Michael Ulbrich. A semismooth Newton method with multidimensional
filter globalization for l 1-optimization. SIAM Journal on Optimization, 24(1):298–333,
2014.
Jong-Shi Pang and Liqun Qi. Nonsmooth equations: Motivation and algorithms. SIAM
Journal on Optimization, 3(3):443–465, 1993.
Liqun Qi. Convergence analysis of some algorithms for solving nonsmooth equations. Math-
ematics of Operations Research, 18(1):227–244, 1993.
Liqun Qi and Defeng Sun. A survey of some nonsmooth equations and smoothing Newton
methods. In Progress in optimization, pages 121–146. Springer, 1999.
Liqun Qi and Jie Sun. A nonsmooth version of Newton’s method. Mathematical program-
ming, 58(1):353–367, 1993.
R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer
Science & Business Media, 2009.
Yueyong Shi, Jian Huang, Yuling Jiao, and Qinglong Yang. A semismooth Newton algorithm
for high-dimensional nonconvex sparse learning. IEEE Transactions on Neural Networks
and Learning Systems, 31(8):2993–3006, 2019.
31
Hu and Deng and Wu and Li
Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality con-
straints. Mathematical Programming, 142(1):397–434, 2013.
Xinming Wu, Zaiwen Wen, and Weizhu Bao. A regularized Newton method for computing
ground states of Bose–Einstein condensates. Journal of Scientific Computing, 73(1):303–
329, 2017.
Guiyun Xiao and Zheng-Jian Bai. A geometric proximal gradient method for sparse least
squares regression with probabilistic simplex constraint. arXiv:2107.00809, 2021.
Xiantao Xiao, Yongfeng Li, Zaiwen Wen, and Liwei Zhang. A regularized semi-smooth New-
ton method with projection steps for composite convex programs. Journal of Scientific
Computing, 76(1):364–389, 2018.
Zongben Xu, Xiangyu Chang, Fengmin Xu, and Hai Zhang. `1/2 regularization: A thresh-
olding representation theory and a fast solver. IEEE Transactions on Neural Networks
and Learning Systems, 23(7):1013–1027, 2012.
Lei Yang. Proximal gradient method with extrapolation and line search for a class of
nonconvex and nonsmooth problems. arXiv:1711.06831, 2017.
Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The
Annals of Statistics, 38(2):894–942, 2010.
Xin-Yuan Zhao, Defeng Sun, and Kim-Chuan Toh. A Newton-CG augmented Lagrangian
method for semidefinite programming. SIAM Journal on Optimization, 20(4):1737–1765,
2010.
Yuhao Zhou, Chenglong Bao, Chao Ding, and Jun Zhu. A semi-smooth Newton
based augmented Lagrangian method for nonsmooth optimization on matrix manifolds.
arXiv:2103.02855, 2021.
32