0% found this document useful (0 votes)
11 views15 pages

Super Gradient Descent: Global Optimization Requires Global Gradient

Uploaded by

jackiechannie24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views15 pages

Super Gradient Descent: Global Optimization Requires Global Gradient

Uploaded by

jackiechannie24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Super Gradient Descent: Global Optimization requires Global

Gradient
Seifeddine Achour
arXiv:2410.19706v1 [cs.LG] 25 Oct 2024

Abstract

Global minimization is a fundamental challenge in optimization, especially in machine learn-


ing, where finding the global minimum of a function directly impacts model performance and
convergence. This report introduces a novel optimization method that we called Super Gra-
dient Descent, designed specifically for one-dimensional functions, guaranteeing convergence to
the global minimum for any k-Lipschitz function defined on a closed interval [a, b]. Our ap-
proach addresses the limitations of traditional optimization algorithms, which often get trapped
in local minima. In particular, we introduce the concept of global gradient which offers a robust
solution for precise and well-guided global optimization. By focusing on the global minimization
problem, this work bridges a critical gap in optimization theory, offering new insights and prac-
tical advancements in different optimization problems in particular Machine Learning problems
like line search.

1 Introduction
Global optimization plays a critical role in addressing complex real-life challenges across various
fields. In engineering, it is applied to structural design optimization, where minimizing weight or
material use while ensuring durability is essential for cost-effective and safe construction. In financial
services, portfolio optimization requires balancing risk and return by finding the global minimum
or maximum in investment strategies. In logistics and transportation, global optimization is crucial
for solving routing problems such as determining the shortest path or optimizing delivery routes
which leads to significant cost savings and improved efficiency. Similarly, in energy systems, global
optimization is key to managing and distributing power more efficiently, reducing operational costs,
and optimizing renewable energy usage.

In machine learning, the need for global optimization is especially pronounced. The performance
of models often depends on the ability to minimize complex, non-convex loss functions. While tra-
ditional methods like gradient descent are effective in many cases, they frequently encounter the
problem of getting trapped in local minima, which can hinder the model’s overall performance. This
is particularly relevant in tasks that require complex models where the optimization landscape is
highly non-linear and fraught with local minima.

1
The primary contribution of this work is the introduction of a novel algorithm named Super Gra-
dient Descent. Unlike classical gradient descent, which collects only local information making it
prone to local minima, the proposed method adapts the state’s change decision based on a global
detection of the function change to ensure consistent progress towards the global minimum. We
evaluate its performance on various one-dimensional functions, demonstrating that it provides supe-
rior convergence behavior, particularly in avoiding local minima and achieving the global optimum.
This novel approach contributes to overcoming the challenges of non-convex optimization, offering
a more reliable method for finding global solutions in machine learning.

2 State of the Art


Optimization algorithms are crucial in training machine learning models, as they guide the param-
eter updates in response to the loss function. Below, we discuss several prominent optimization
algorithms and their mathematical formulations.

2.1 Gradient Descent (GD)


Gradient Descent (GD) is a fundamental optimization algorithm used to minimize a given objective
function by iteratively updating parameters in the direction of the steepest descent, which is indi-
cated by the negative gradient of the function[1].
The update rule for Gradient Descent is:

xt+1 = xt − η∇f (xt )

where:
• xt represents the vector of parameters at iteration t,

• η is the learning rate, which controls the step size,

• ∇f (xt ) is the gradient of the objective function at xt .


Pros: Intuitive algorithm and stable convergence for convex problems.
Cons: Can get stuck in local minima, particularly in non-convex problems.

2.2 AdaGrad
AdaGrad is an adaptive learning rate optimization algorithm that adjusts the learning rate for
each parameter based on the historical gradients and handles the sparse data well. Parameters
with large gradients receive smaller updates, while those with smaller gradients are updated more
significantly.[2]
The AdaGrad update rule is:

η
xt+1 = xt − √ ∇f (xt )
Gt + ϵ

2
where:

• Gt is the sum of the squares of the gradients up to time step t,

• ϵ is a small constant to avoid division by zero.

Pros: Automatically adjusts learning rates; works well with sparse data.
Cons: The learning rate decreases too aggressively over time, leading to premature convergence.

2.3 RMSprop
RMSprop was developed to address the diminishing learning rate issue in AdaGrad by introducing
an exponentially decaying average of squared gradients [3]. This allows RMSprop to adapt the
learning rate dynamically without the aggressive reduction that AdaGrad experiences.
The RMSprop update rule is:

η
xt+1 = xt − p ∇f (xt )
E[∇f (xt )2 ] + ϵ

where:

• E[∇f (xt )2 ] is the exponentially weighted average of the squared gradients,

• ϵ is a small constant for numerical stability.

Pros: More robust to the diminishing learning rate issue.


Cons: Sensitive to hyperparameters.

2.4 Adam
Adam combines the advantages of both AdaGrad and RMSprop by maintaining an exponentially
decaying average of past gradients (first moment) and squared gradients (second moment) [4]. This
approach allows it to use adaptive learning rates, while also addressing the diminishing learning rate
issue.
The update rule for Adam is:

mt = β1 mt−1 + (1 − β1 )∇f (xt )

vt = β2 vt−1 + (1 − β2 )∇f (xt )2


mt vt
m̂t = , v̂t =
1 − β1t 1 − β2t
η m̂t
xt+1 = xt − √
v̂t + ϵ
where:

• mt and vt are the first and second moment estimates,

3
• β1 and β2 are hyperparameters that control the decay rates of these moments.

Pros: Efficient and well-suited for a wide variety of problems; adaptive learning rates.
Cons: Lacks formal convergence guarantees in non-convex settings.

2.5 AdamW
AdamW is a variant of the Adam optimizer that decouples weight decay from the gradient-based
parameter updates [5]. This leads to better generalization performance, especially in deep learning
models, by applying weight decay directly to the parameters as follows:

η m̂t
xt+1 = xt − √ − ηλxt
v̂t + ϵ

where:

• λ is the weight decay coefficient.

Pros: Better generalization than Adam; reduces overfitting by using weight decay.
Cons: Requires careful tuning of the weight decay parameter.

2.6 Nesterov Accelerated Gradient (NAG)


Nesterov Accelerated Gradient (NAG) introduces a look-ahead step that computes the gradient not
at the current position, but at the point where the momentum term would take it [6]. This improves
the convergence rate by anticipating the trajectory of the updates as follows:

vt+1 = βvt + η∇f (xt − βvt )

xt+1 = xt − vt+1

where:

• vt is the velocity (momentum term),

• β is the momentum coefficient.

Pros: Faster convergence than standard momentum-based methods in general.


Cons: Slightly more complex to implement and tune.

2.7 Heuristic Methods


Heuristic optimization methods, such as Genetic Algorithms (GA) [7]and Particle Swarm Optimiza-
tion (PSO)[8], are inspired by natural processes and can be useful for finding global optima. They
are particularly advantageous in complex landscapes but come with their own limitations:
Convergence Speed: Heuristic methods can be slow to converge and often require a large number
of evaluations to achieve satisfactory results.

4
Noisy Landscapes: They may struggle in noisy environments, where the fitness landscape can
mislead the optimization process.
Global Convergence: While they are designed to search the global space, they do not guarantee
convergence to the global minimum.

In summary, while state-of-the-art optimization algorithms provide powerful tools for training ma-
chine learning models, they suffer from limitations regarding local minima and global convergence
guarantees. In the following section we will introduce our new algorithm Super Gradient Descent
(SuGD) while proving its convergence to the global minimum of any k-lipschitz one-dimensional
function defined on a domain [a, b].

3 Methodology
In this section, we present our methodology for implementing the Super Gradient Descent Algo-
rithm along with the mathematical proof of its guaranteed global convergence for any k-Lipschitz
continuous function defined on the domain [a, b].

3.1 Super Gradient Descent Implementation


In the general case, it is hard to explicitly calculate the gradient of a any function and in many cases
we don’t even know the function explicitly. The Finite Difference Method (FDM) is widely used to
approximate derivatives using discrete points on a function’s domain, with the forward difference
formula f ′ (x) ≈ f (x+h)−f
h
(x)
providing a first-order accurate estimate for f ′ (x) by evaluating the
difference between f (x + h) and f (x) over a small step size h. Similarly, the backward difference
f (x)−f (x−h)
approximation f ′ (x) ≈ h offers first-order accuracy using previous points, while the cen-
f (x+h)−f (x−h)
tral difference scheme f (x) ≈

2h yields second-order accuracy by averaging forward
and backward differences [9].
Generally speaking, approximating the derivation means evaluating the function at two close points
and calculating the ratio of the difference between the two outputs and the difference between the
two points of evaluation. This allow us to calculate what we call the local gradient which is used
in optimization to point to the local minimum through following the negative of its direction with
an optimization step chosen small enough to ensure convergence. What we aim here to find a step
αϵ > 0 where ∀α ∈ [0, αϵ ] the solution converges to the global minimum. Inspiring from the local
gradient concept we introduce the concept of global gradient of a function f defined on domain D
on dimension i as:
f (y) − f (x)
Fi (x, y) = (1)
y−x

It evaluates the function at any two points instead of necessarily two neighbor points to capture
its global variation. In that case, the local gradient can be a particular case where ∇f (x) =
limh→0 F (x + h, x).

5
Based on the definition of the global gradient, we introduce our algorithm that guarantees con-
vergence to a global minimum for any k-Lipschitz one-dimensional function defined on an interval
[a, b], that we called the Super Gradient Descent (SuGD) as follows:

Algorithm 1 Super Gradient Descent


Input: f , domain [a, b], tolerance η > 0 , optimization step α > 0
(1) (2)
Initialize: Choose initial points x0 = a, x0 = b
(2) (1) (2) (1)
while |xn − xn |(1 + |F (xn , xn )|) > η do
(2) (1)
if f (xn ) − f (xn ) < 0 then
(1) (1) (1) (2) (2) (1) (2) (2)
Set xn+1 = xn − α(xn − xn )(1 − F (xn , xn )), xn+1 = xn
else
(2) (2) (2) (1) (2) (1) (1) (1)
Set xn+1 = xn − α(xn − xn )(1 + F (xn , xn )), xn+1 = xn
end if
end while
(1)
Return: Approximate minimum at xn = xn

The algorithm leverages the global derivative information to ensure fast and well-guided con-
vergence, with a global vision within the interval [a, b] while mastering the neighborhood between
points to ensure a local behavior that is similar to the behavior of local algorithms. Next, we will
provide the theoretical basis behind this algorithm.

3.2 Theoretical Convergence Guarantee


Before introducing the proof of the algorithm’s global convergence, we start by recalling that a
function f is k-Lipschitz if:

|f (x) − f (y)| ≤ k|x − y| ∀x, y ∈ [a, b].

|f (x)−f (y)|
This means that ∀x, y ∈ [a, b], |F (x, y)| = |x−y| ≤ k. In other words, Lipschitz continuity
ensures that our Global Gradient is always finite which justifies its usefulness in the SuGD algorithm.

Theorem 1. Let f be a 1-D, k-Lipschitz function defined on a bounded domain [a, b] that admits
one global minimum at x∗ ∈ [a, b]. ∃αϵ > 0, ∀α ∈ [0, αϵ ] an optimization step, where the Super
Gradient Descent Algorithm 1 converges to this global minimum. More precisely, ∀ϵ > 0, ∃αϵ >
0 where ∀α ∈ [0, αϵ ] ,|f (xn ) − f (x∗ )| < ϵ
(1) (2)
Proof. To ensure that xn ,xn converges to x∗ the proof of this theorem will consist of two main
parts:

6
Part 1:
( (1)
xn < x∗
Prooving with recurrence that ∃αϵ > 0, ∀n ∈ N , ∀α ∈ [0, αϵ ] (2) (2)
xn > x∗

Where ϵ is the approximation threshold.

( (1)
x0 = a < x ∗
For n = 0 (2) (3)
x0 = b > x ∗
Assuming that the inequalities are true until an order n.
If the sequence is stationary the recurrence is trivial for it. If not then
( (1) (1) (1) (2) (2) (1) (2) (1)
xn+1 = xn − α(xn − xn )(1 − F (xn , xn )), when f (xn ) − f (xn ) < 0
(2) (2) (2) (1) (2) (1) (2) (1) (4)
xn+1 = xn − α(xn − xn )(1 + F (xn , xn )), when f (xn ) − f (xn ) ≥ 0

Let’s choose αϵ = ϵ
(b−a)(1+k)k , 0 ≤ α ≤ αϵ

(1) (2) (2) (1) (1) (2) (2) (1)


We have |(xn − xn )(1 − F (xn , xn ))| = |xn − xn + f (xn ) − f (xn )|

(1) (2) (2) (1) (1) (2) (1) (2)


And |xn − xn + f (xn ) − f (xn )| ≤ |xn − xn | + |f (xn ) − f (xn )|

(1) (2) (1) (2)


And from k-Lip condition |xn − xn | + |f (xn ) − f (xn )| ≤ (b − a)(1 + k)

(1) (2) (2) (1)


Therefore |(xn − xn )(1 − F (xn , xn ))| ≤ (b − a)(1 + k)

(1) (2) (2) (1)


Means α|(xn − xn )(1 − F (xn , xn ))| ≤ ϵ
k

(2) (1) (1) (2) (1) (1) (2) (2) (1) (1)
since f (xn )−f (xn ) < 0, and xn −xn < 0 (2), then xn −α(xn −xn )(1−F (xn , xn )) ≤ xn + kϵ

Considering our target is f (xn ) − f (x∗ ) ≤ ϵ then we suppose f (xn ) − f (x∗ ) > ϵ

(1)
based on Lipschitz condition and (2) we deduce that f (xn ) − f (x∗ ) ≤ k(x∗ − xn ) ≤ k(x∗ − xn )

(1) (1)
So xn+1 ≤ xn + ϵ
k < x∗

Therefore we got the result of the recurrence and we prove in the same manner the statement
validity for the second sequence.

Part 2:

7
(1) (2)
We proved in the first part that xn has an upper bound and xn has a lower bound, we still
(1) (2)
need to prove that xn , xn converges toward the same limit to ensure they are adjacent sequences
and from part 1 we deduce that this limit is x∗ .

We get back to the definition of the sequences and since by construction one of them should change
its value and if the other is stationary it will trivially be convergent, so we will focus on the non-
stationary case:

( (1) (1) (1) (2) (2) (1) (2) (1)


xn+1 = xn − α(xn − xn )(1 − F (xn , xn )), when f (xn ) − f (xn ) < 0
(2) (2) (2) (1) (2) (1) (2) (1) (5)
xn+1 = xn − α(xn − xn )(1 + F (xn , xn )), when f (xn ) − f (xn ) ≥ 0

(1) (2) (2) (1) (1)


We know that xn − xn < 0, and on one hand for f (xn ) − f (xn ) < 0, we have α(xn −
(2) (2) (1) (2) (1) (1) (2)
xn )(1 − F (xn , xn )) < 0, on the other hand for f (xn ) − f (xn ) ≥ 0, we have α(xn − xn )(1 −
(2) (1) (1) (2)
F (xn , xn )) > 0. This means that xn is an increasing upper bounded sequence. ,xn is a decreas-
ing lower bounded sequence therefore they converge respectively to the limits l1 , l2 .
(1) (1) (1) (1) (2) (2) (1)
Based on the expression of xn+1 we get limn→∞ xn+1 = limn→∞ xn −α(xn −xn )(1−F (xn , xn )) =
l1 − α(l1 − l2 )(1 − F (l2 , l1 )) = l1 .

(2) (1) (2) (1)


Considering that (1 − F (xn , xn )) > 0 when f (xn ) − f (xn ) < 0 and α > 0, then l2 = l1
Similarly, we prove this statement in the second case.

From part 1, we have x∗ ≤ l1 ,l2 ≤ x∗ , and from part 2 l2 = l1 we deduce that l1 = l2 = x∗


means that xn converges to x∗ , which proves the validity of the theorem.

4 Results and Discussion


In this section, we present our results from testing the Super Gradient Descent algorithm on various
synthetic one-dimensional functions defined on a bounded interval and compare its performance to
the classical optimization methods. The key metric considered is robustness in avoiding local minima
and converging to global minimum.

4.1 Gradient Descent vs Super Gradient Descent


We first try to find the global minimum of the function f (x) = x sin(x) which has several minima
but it is enough regular.

8
Figure 1: Non-convex multi-minima test

The results of our experiments are illustrated in Figures 1. As observed, the traditional gradient
descent algorithm shows signs of convergence towards a local minimum, since the function admits
several minima. The convergence path is often characterized by oscillations especially at final iter-
ations, indicating a struggle to escape local minima.

In contrast, our proposed super gradient descent algorithm demonstrates a more stable and di-
rected approach towards the global minimum. The algorithm effectively navigates the oscillations
and maintains a consistent trajectory towards the global minimum. This enhanced performance is
attributed to its global information-based update rules, which allow for more nuanced adjustments
in the descent path.
 
x3
Next, we apply both algorithms to the more complex function f (x) = 2x sin(x3 ) − x cos 12 which
presents not only the challenge of huge number of local minima but also the hard differentiability.

9
Figure 2: Hardly differentiable function test

Here, the classical gradient descent again not only struggles with local minima, failing to reach
the global minimum efficiently but also, the trajectory appears erratic and unstable due to the local
sudden change in the objective function value, further confirming its sensitivity.

In contrast, the super gradient descent algorithm exhibits a more robust behavior, with a clear
and smooth convergence towards the global minimum, regardless of the function’s complexity. This
can be observed in the final convergence plots where the loss reduction is significantly more consistent
compared to the classical method.

4.2 Benchmarking with the existing optimization algorithms


In this section, we compare the performance of our proposed Super Gradient Descent (SuGD) algo-
rithm with several state-of-the-art optimization algorithms defined in the second section, additionally
to the Gradient Descent (GD), we tested AdaGrad, RMSprop, AdamW, NAG and Adam. The com-

10
parison focuses on the ability to escape local minima, the rate of convergence, and the consistency
of reaching the global minimum.

Figure 3: Non-convex multi-minima test

In the first test function, f (x) = x sin(x), AdaGrad, which adapts its learning rate based on
previous gradients [2], performs well at first but suffers from a decreasing learning rate over time,
which causes premature convergence. RMSprop shows a more stable convergence but still fail to
reach the global minimum in this case. The moment/gradient adaptive learning rates in AdamW
and Adam accelerated the convergence but still not enough to escape the local minimum. The an-
ticipating behaviour of NAG allowed it to climb in the ascending direction a bit but fastly got back
to the local minimum.
In contrast, our proposed Super Gradient Descent effectively bypasses local minima and demon-
strates a consistent path towards the global minimum and this within a reasonable time compared
to the previous algorithms.

11
Figure 4: Hardly differentiable non-convex function test
 
x3
For the more complex function f (x) = 2x sin(x3 ) − x cos 12 , the limitations of classical op-
timization algorithms become more apparent. NAG, while managing to escape local minima and
reduce the loss, experienced a gradient exploding phenomenon and diverged because the function is
much more complex and hardly differentiable. This highlights the stochastic behaviour in escaping
the local minima of these algorithms and, therefore the unguarantee of the global convergence. In
comparison, Super Gradient Descent displays much smoother convergence and consistently reaches
the global minimum, avoiding the pitfalls of local minima encountered by other algorithms. The
global information-based updates in our method allow it to adapt more effectively to the varying
gradients and oscillations present in the function, guaranteeing the global convergence.

12
Figure 5: Non-convex multi-regularity function test

In Figure 5 the experiment tests the Super


 Gradient Descent on a multi-regularity
 non-convex
−0.004(x−35)2 −0.2(x−25)2
function defined as f (x) = e sin(0.3x) + e sin(5x) . This function has two
regular regions separated by a hardly differentiable region and the challenge here is not only escap-
ing the local minimum but also being robust against the sudden change in regularity to achieve the
global minimum region.

Overall, the results indicate that the super gradient descent algorithm not only converges to the
global minimum effectively but also does so with greater stability and robustness than the tra-
ditional gradient descent method. This has significant implications for optimization problems in
various fields, where the risk of being trapped in local minima can severely impact the performance

13
of learning algorithms.

The comparative analysis presented here establishes the effectiveness of our proposed approach.
Future work will involve testing on higher-dimensional functions and integrating additional opti-
mization strategies to further enhance the robustness of our algorithm.

5 Conclusion
In this article, We presented the Super Gradient Descent novel algorithm capable of converging to the
global minimum of any k-lipschitz one-dimensional function defined on a domain [a, b]. We started
by an overview about the state of the art optimization algorithms including Gradient Descent, Ada-
Grad, RMSprop, Adam, AdamW, Nesterov Accelerated Gradient (NAG). Then, we introduced our
methodology and the mathematical implementation of the algorithm followed by the proof of the
guaranteed global convergences showing how SuGD leverages global information about the function,
more precisely the introduction of the concept of global gradient which plays a pivotal role in point-
ing to the global minimum. Finally we compared the performance of our algorithm with the existing
optimization algorithms demonstrating its stable convergence and robustness in achieving the global
minimum where the others struggle to escape local minima and maintain a consistent convergence
path.
Overall, this novel algorithm succeeded in resolving one of the most challenging problems in opti-
mization. Particularly, the guaranteed one-dimensional global convergence, on one hand it directly
impacts the higher dimensional optimization through the line search method, and on the other hand,
it opens up the perspectives for future expansion of this algorithm to a higher dimensional space.

14
References
[1] Marcin Andrychowicz, Misha Denil, Sergio Gómez Colmenarejo, Matthew W. Hoffman, David
Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to Learn by Gradi-
ent Descent. arXiv preprint arXiv:1606.04474, 2016. Google DeepMind, University of Oxford,
Canadian Institute for Advanced Research.

[2] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning
and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.

[3] T. Tieleman and G. Hinton. Lecture 6.5—RMSProp: Divide the Gradient by a Running Average
of Its Recent Magnitude. Lecture Notes, 2012.

[4] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv
preprint arXiv:1412.6980, 2014.

[5] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regulariza-
tion. In International Conference on Learning Representations, 2017. URL:
https://openreview.net/forum?id=Bkg6RiRAb7.

[6] T. Dozat. Incorporating Nesterov Momentum into Adam. OpenReview.net, 2016. URL:
https://openreview.net/forum?id=H1Z2w9lgl.

[7] R. Chelouah and P. Siarry. A Continuous Genetic Algorithm Designed for the Global Optimiza-
tion of Multimodal Functions. Journal of Heuristics, 2000. Publisher: Springer.

[8] F. Marini and B. Walczak. Particle Swarm Optimization (PSO): A Tutorial. Chemometrics and
Intelligent Laboratory Systems, 2015. Publisher: Elsevier.

[9] ML Shahab, H Susanto, and H Hatzikirou. A finite difference method with symmetry proper-
ties for the high-dimensional Bratu equation. arXiv preprint arXiv:2410.12553, 2024. Publisher:
arxiv.org.

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy