Super Gradient Descent: Global Optimization Requires Global Gradient
Super Gradient Descent: Global Optimization Requires Global Gradient
Gradient
Seifeddine Achour
arXiv:2410.19706v1 [cs.LG] 25 Oct 2024
Abstract
1 Introduction
Global optimization plays a critical role in addressing complex real-life challenges across various
fields. In engineering, it is applied to structural design optimization, where minimizing weight or
material use while ensuring durability is essential for cost-effective and safe construction. In financial
services, portfolio optimization requires balancing risk and return by finding the global minimum
or maximum in investment strategies. In logistics and transportation, global optimization is crucial
for solving routing problems such as determining the shortest path or optimizing delivery routes
which leads to significant cost savings and improved efficiency. Similarly, in energy systems, global
optimization is key to managing and distributing power more efficiently, reducing operational costs,
and optimizing renewable energy usage.
In machine learning, the need for global optimization is especially pronounced. The performance
of models often depends on the ability to minimize complex, non-convex loss functions. While tra-
ditional methods like gradient descent are effective in many cases, they frequently encounter the
problem of getting trapped in local minima, which can hinder the model’s overall performance. This
is particularly relevant in tasks that require complex models where the optimization landscape is
highly non-linear and fraught with local minima.
1
The primary contribution of this work is the introduction of a novel algorithm named Super Gra-
dient Descent. Unlike classical gradient descent, which collects only local information making it
prone to local minima, the proposed method adapts the state’s change decision based on a global
detection of the function change to ensure consistent progress towards the global minimum. We
evaluate its performance on various one-dimensional functions, demonstrating that it provides supe-
rior convergence behavior, particularly in avoiding local minima and achieving the global optimum.
This novel approach contributes to overcoming the challenges of non-convex optimization, offering
a more reliable method for finding global solutions in machine learning.
where:
• xt represents the vector of parameters at iteration t,
2.2 AdaGrad
AdaGrad is an adaptive learning rate optimization algorithm that adjusts the learning rate for
each parameter based on the historical gradients and handles the sparse data well. Parameters
with large gradients receive smaller updates, while those with smaller gradients are updated more
significantly.[2]
The AdaGrad update rule is:
η
xt+1 = xt − √ ∇f (xt )
Gt + ϵ
2
where:
Pros: Automatically adjusts learning rates; works well with sparse data.
Cons: The learning rate decreases too aggressively over time, leading to premature convergence.
2.3 RMSprop
RMSprop was developed to address the diminishing learning rate issue in AdaGrad by introducing
an exponentially decaying average of squared gradients [3]. This allows RMSprop to adapt the
learning rate dynamically without the aggressive reduction that AdaGrad experiences.
The RMSprop update rule is:
η
xt+1 = xt − p ∇f (xt )
E[∇f (xt )2 ] + ϵ
where:
2.4 Adam
Adam combines the advantages of both AdaGrad and RMSprop by maintaining an exponentially
decaying average of past gradients (first moment) and squared gradients (second moment) [4]. This
approach allows it to use adaptive learning rates, while also addressing the diminishing learning rate
issue.
The update rule for Adam is:
3
• β1 and β2 are hyperparameters that control the decay rates of these moments.
Pros: Efficient and well-suited for a wide variety of problems; adaptive learning rates.
Cons: Lacks formal convergence guarantees in non-convex settings.
2.5 AdamW
AdamW is a variant of the Adam optimizer that decouples weight decay from the gradient-based
parameter updates [5]. This leads to better generalization performance, especially in deep learning
models, by applying weight decay directly to the parameters as follows:
η m̂t
xt+1 = xt − √ − ηλxt
v̂t + ϵ
where:
Pros: Better generalization than Adam; reduces overfitting by using weight decay.
Cons: Requires careful tuning of the weight decay parameter.
xt+1 = xt − vt+1
where:
4
Noisy Landscapes: They may struggle in noisy environments, where the fitness landscape can
mislead the optimization process.
Global Convergence: While they are designed to search the global space, they do not guarantee
convergence to the global minimum.
In summary, while state-of-the-art optimization algorithms provide powerful tools for training ma-
chine learning models, they suffer from limitations regarding local minima and global convergence
guarantees. In the following section we will introduce our new algorithm Super Gradient Descent
(SuGD) while proving its convergence to the global minimum of any k-lipschitz one-dimensional
function defined on a domain [a, b].
3 Methodology
In this section, we present our methodology for implementing the Super Gradient Descent Algo-
rithm along with the mathematical proof of its guaranteed global convergence for any k-Lipschitz
continuous function defined on the domain [a, b].
It evaluates the function at any two points instead of necessarily two neighbor points to capture
its global variation. In that case, the local gradient can be a particular case where ∇f (x) =
limh→0 F (x + h, x).
5
Based on the definition of the global gradient, we introduce our algorithm that guarantees con-
vergence to a global minimum for any k-Lipschitz one-dimensional function defined on an interval
[a, b], that we called the Super Gradient Descent (SuGD) as follows:
The algorithm leverages the global derivative information to ensure fast and well-guided con-
vergence, with a global vision within the interval [a, b] while mastering the neighborhood between
points to ensure a local behavior that is similar to the behavior of local algorithms. Next, we will
provide the theoretical basis behind this algorithm.
|f (x)−f (y)|
This means that ∀x, y ∈ [a, b], |F (x, y)| = |x−y| ≤ k. In other words, Lipschitz continuity
ensures that our Global Gradient is always finite which justifies its usefulness in the SuGD algorithm.
Theorem 1. Let f be a 1-D, k-Lipschitz function defined on a bounded domain [a, b] that admits
one global minimum at x∗ ∈ [a, b]. ∃αϵ > 0, ∀α ∈ [0, αϵ ] an optimization step, where the Super
Gradient Descent Algorithm 1 converges to this global minimum. More precisely, ∀ϵ > 0, ∃αϵ >
0 where ∀α ∈ [0, αϵ ] ,|f (xn ) − f (x∗ )| < ϵ
(1) (2)
Proof. To ensure that xn ,xn converges to x∗ the proof of this theorem will consist of two main
parts:
6
Part 1:
( (1)
xn < x∗
Prooving with recurrence that ∃αϵ > 0, ∀n ∈ N , ∀α ∈ [0, αϵ ] (2) (2)
xn > x∗
( (1)
x0 = a < x ∗
For n = 0 (2) (3)
x0 = b > x ∗
Assuming that the inequalities are true until an order n.
If the sequence is stationary the recurrence is trivial for it. If not then
( (1) (1) (1) (2) (2) (1) (2) (1)
xn+1 = xn − α(xn − xn )(1 − F (xn , xn )), when f (xn ) − f (xn ) < 0
(2) (2) (2) (1) (2) (1) (2) (1) (4)
xn+1 = xn − α(xn − xn )(1 + F (xn , xn )), when f (xn ) − f (xn ) ≥ 0
Let’s choose αϵ = ϵ
(b−a)(1+k)k , 0 ≤ α ≤ αϵ
(2) (1) (1) (2) (1) (1) (2) (2) (1) (1)
since f (xn )−f (xn ) < 0, and xn −xn < 0 (2), then xn −α(xn −xn )(1−F (xn , xn )) ≤ xn + kϵ
Considering our target is f (xn ) − f (x∗ ) ≤ ϵ then we suppose f (xn ) − f (x∗ ) > ϵ
(1)
based on Lipschitz condition and (2) we deduce that f (xn ) − f (x∗ ) ≤ k(x∗ − xn ) ≤ k(x∗ − xn )
(1) (1)
So xn+1 ≤ xn + ϵ
k < x∗
Therefore we got the result of the recurrence and we prove in the same manner the statement
validity for the second sequence.
Part 2:
7
(1) (2)
We proved in the first part that xn has an upper bound and xn has a lower bound, we still
(1) (2)
need to prove that xn , xn converges toward the same limit to ensure they are adjacent sequences
and from part 1 we deduce that this limit is x∗ .
We get back to the definition of the sequences and since by construction one of them should change
its value and if the other is stationary it will trivially be convergent, so we will focus on the non-
stationary case:
8
Figure 1: Non-convex multi-minima test
The results of our experiments are illustrated in Figures 1. As observed, the traditional gradient
descent algorithm shows signs of convergence towards a local minimum, since the function admits
several minima. The convergence path is often characterized by oscillations especially at final iter-
ations, indicating a struggle to escape local minima.
In contrast, our proposed super gradient descent algorithm demonstrates a more stable and di-
rected approach towards the global minimum. The algorithm effectively navigates the oscillations
and maintains a consistent trajectory towards the global minimum. This enhanced performance is
attributed to its global information-based update rules, which allow for more nuanced adjustments
in the descent path.
x3
Next, we apply both algorithms to the more complex function f (x) = 2x sin(x3 ) − x cos 12 which
presents not only the challenge of huge number of local minima but also the hard differentiability.
9
Figure 2: Hardly differentiable function test
Here, the classical gradient descent again not only struggles with local minima, failing to reach
the global minimum efficiently but also, the trajectory appears erratic and unstable due to the local
sudden change in the objective function value, further confirming its sensitivity.
In contrast, the super gradient descent algorithm exhibits a more robust behavior, with a clear
and smooth convergence towards the global minimum, regardless of the function’s complexity. This
can be observed in the final convergence plots where the loss reduction is significantly more consistent
compared to the classical method.
10
parison focuses on the ability to escape local minima, the rate of convergence, and the consistency
of reaching the global minimum.
In the first test function, f (x) = x sin(x), AdaGrad, which adapts its learning rate based on
previous gradients [2], performs well at first but suffers from a decreasing learning rate over time,
which causes premature convergence. RMSprop shows a more stable convergence but still fail to
reach the global minimum in this case. The moment/gradient adaptive learning rates in AdamW
and Adam accelerated the convergence but still not enough to escape the local minimum. The an-
ticipating behaviour of NAG allowed it to climb in the ascending direction a bit but fastly got back
to the local minimum.
In contrast, our proposed Super Gradient Descent effectively bypasses local minima and demon-
strates a consistent path towards the global minimum and this within a reasonable time compared
to the previous algorithms.
11
Figure 4: Hardly differentiable non-convex function test
x3
For the more complex function f (x) = 2x sin(x3 ) − x cos 12 , the limitations of classical op-
timization algorithms become more apparent. NAG, while managing to escape local minima and
reduce the loss, experienced a gradient exploding phenomenon and diverged because the function is
much more complex and hardly differentiable. This highlights the stochastic behaviour in escaping
the local minima of these algorithms and, therefore the unguarantee of the global convergence. In
comparison, Super Gradient Descent displays much smoother convergence and consistently reaches
the global minimum, avoiding the pitfalls of local minima encountered by other algorithms. The
global information-based updates in our method allow it to adapt more effectively to the varying
gradients and oscillations present in the function, guaranteeing the global convergence.
12
Figure 5: Non-convex multi-regularity function test
Overall, the results indicate that the super gradient descent algorithm not only converges to the
global minimum effectively but also does so with greater stability and robustness than the tra-
ditional gradient descent method. This has significant implications for optimization problems in
various fields, where the risk of being trapped in local minima can severely impact the performance
13
of learning algorithms.
The comparative analysis presented here establishes the effectiveness of our proposed approach.
Future work will involve testing on higher-dimensional functions and integrating additional opti-
mization strategies to further enhance the robustness of our algorithm.
5 Conclusion
In this article, We presented the Super Gradient Descent novel algorithm capable of converging to the
global minimum of any k-lipschitz one-dimensional function defined on a domain [a, b]. We started
by an overview about the state of the art optimization algorithms including Gradient Descent, Ada-
Grad, RMSprop, Adam, AdamW, Nesterov Accelerated Gradient (NAG). Then, we introduced our
methodology and the mathematical implementation of the algorithm followed by the proof of the
guaranteed global convergences showing how SuGD leverages global information about the function,
more precisely the introduction of the concept of global gradient which plays a pivotal role in point-
ing to the global minimum. Finally we compared the performance of our algorithm with the existing
optimization algorithms demonstrating its stable convergence and robustness in achieving the global
minimum where the others struggle to escape local minima and maintain a consistent convergence
path.
Overall, this novel algorithm succeeded in resolving one of the most challenging problems in opti-
mization. Particularly, the guaranteed one-dimensional global convergence, on one hand it directly
impacts the higher dimensional optimization through the line search method, and on the other hand,
it opens up the perspectives for future expansion of this algorithm to a higher dimensional space.
14
References
[1] Marcin Andrychowicz, Misha Denil, Sergio Gómez Colmenarejo, Matthew W. Hoffman, David
Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to Learn by Gradi-
ent Descent. arXiv preprint arXiv:1606.04474, 2016. Google DeepMind, University of Oxford,
Canadian Institute for Advanced Research.
[2] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning
and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
[3] T. Tieleman and G. Hinton. Lecture 6.5—RMSProp: Divide the Gradient by a Running Average
of Its Recent Magnitude. Lecture Notes, 2012.
[4] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv
preprint arXiv:1412.6980, 2014.
[5] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regulariza-
tion. In International Conference on Learning Representations, 2017. URL:
https://openreview.net/forum?id=Bkg6RiRAb7.
[6] T. Dozat. Incorporating Nesterov Momentum into Adam. OpenReview.net, 2016. URL:
https://openreview.net/forum?id=H1Z2w9lgl.
[7] R. Chelouah and P. Siarry. A Continuous Genetic Algorithm Designed for the Global Optimiza-
tion of Multimodal Functions. Journal of Heuristics, 2000. Publisher: Springer.
[8] F. Marini and B. Walczak. Particle Swarm Optimization (PSO): A Tutorial. Chemometrics and
Intelligent Laboratory Systems, 2015. Publisher: Elsevier.
[9] ML Shahab, H Susanto, and H Hatzikirou. A finite difference method with symmetry proper-
ties for the high-dimensional Bratu equation. arXiv preprint arXiv:2410.12553, 2024. Publisher:
arxiv.org.
15