0% found this document useful (0 votes)

11 views15 pages

Super Gradient Descent: Global Optimization Requires Global Gradient

Uploaded by

jackiechannie24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views15 pages

Super Gradient Descent: Global Optimization Requires Global Gradient

Uploaded by

jackiechannie24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Super Gradient Descent: Global Optimization requires Global

Gradient
Seifeddine Achour
arXiv:2410.19706v1 [cs.LG] 25 Oct 2024

Abstract

Global minimization is a fundamental challenge in optimization, especially in machine learn-

ing, where finding the global minimum of a function directly impacts model performance and
convergence. This report introduces a novel optimization method that we called Super Gra-
dient Descent, designed specifically for one-dimensional functions, guaranteeing convergence to
the global minimum for any k-Lipschitz function defined on a closed interval [a, b]. Our ap-
proach addresses the limitations of traditional optimization algorithms, which often get trapped
in local minima. In particular, we introduce the concept of global gradient which offers a robust
solution for precise and well-guided global optimization. By focusing on the global minimization
problem, this work bridges a critical gap in optimization theory, offering new insights and prac-
tical advancements in different optimization problems in particular Machine Learning problems
like line search.

1 Introduction
Global optimization plays a critical role in addressing complex real-life challenges across various
fields. In engineering, it is applied to structural design optimization, where minimizing weight or
material use while ensuring durability is essential for cost-effective and safe construction. In financial
services, portfolio optimization requires balancing risk and return by finding the global minimum
or maximum in investment strategies. In logistics and transportation, global optimization is crucial
for solving routing problems such as determining the shortest path or optimizing delivery routes
which leads to significant cost savings and improved efficiency. Similarly, in energy systems, global
optimization is key to managing and distributing power more efficiently, reducing operational costs,
and optimizing renewable energy usage.

In machine learning, the need for global optimization is especially pronounced. The performance
of models often depends on the ability to minimize complex, non-convex loss functions. While tra-
ditional methods like gradient descent are effective in many cases, they frequently encounter the
problem of getting trapped in local minima, which can hinder the model’s overall performance. This
is particularly relevant in tasks that require complex models where the optimization landscape is
highly non-linear and fraught with local minima.

1
The primary contribution of this work is the introduction of a novel algorithm named Super Gra-
dient Descent. Unlike classical gradient descent, which collects only local information making it
prone to local minima, the proposed method adapts the state’s change decision based on a global
detection of the function change to ensure consistent progress towards the global minimum. We
evaluate its performance on various one-dimensional functions, demonstrating that it provides supe-
rior convergence behavior, particularly in avoiding local minima and achieving the global optimum.
This novel approach contributes to overcoming the challenges of non-convex optimization, offering
a more reliable method for finding global solutions in machine learning.

2 State of the Art

Optimization algorithms are crucial in training machine learning models, as they guide the param-
eter updates in response to the loss function. Below, we discuss several prominent optimization
algorithms and their mathematical formulations.

2.1 Gradient Descent (GD)

Gradient Descent (GD) is a fundamental optimization algorithm used to minimize a given objective
function by iteratively updating parameters in the direction of the steepest descent, which is indi-
cated by the negative gradient of the function[1].
The update rule for Gradient Descent is:

xt+1 = xt − η∇f (xt )

where:
• xt represents the vector of parameters at iteration t,

• η is the learning rate, which controls the step size,

• ∇f (xt ) is the gradient of the objective function at xt .

Pros: Intuitive algorithm and stable convergence for convex problems.
Cons: Can get stuck in local minima, particularly in non-convex problems.

2.2 AdaGrad
AdaGrad is an adaptive learning rate optimization algorithm that adjusts the learning rate for
each parameter based on the historical gradients and handles the sparse data well. Parameters
with large gradients receive smaller updates, while those with smaller gradients are updated more
significantly.[2]
The AdaGrad update rule is:

η
xt+1 = xt − √ ∇f (xt )
Gt + ϵ

2
where:

• Gt is the sum of the squares of the gradients up to time step t,

• ϵ is a small constant to avoid division by zero.

Pros: Automatically adjusts learning rates; works well with sparse data.
Cons: The learning rate decreases too aggressively over time, leading to premature convergence.

2.3 RMSprop
RMSprop was developed to address the diminishing learning rate issue in AdaGrad by introducing
an exponentially decaying average of squared gradients [3]. This allows RMSprop to adapt the
learning rate dynamically without the aggressive reduction that AdaGrad experiences.
The RMSprop update rule is:

η
xt+1 = xt − p ∇f (xt )
E[∇f (xt )2 ] + ϵ

where:

• E[∇f (xt )2 ] is the exponentially weighted average of the squared gradients,

• ϵ is a small constant for numerical stability.

Pros: More robust to the diminishing learning rate issue.

Cons: Sensitive to hyperparameters.

2.4 Adam
Adam combines the advantages of both AdaGrad and RMSprop by maintaining an exponentially
decaying average of past gradients (first moment) and squared gradients (second moment) [4]. This
approach allows it to use adaptive learning rates, while also addressing the diminishing learning rate
issue.
The update rule for Adam is:

mt = β1 mt−1 + (1 − β1 )∇f (xt )

vt = β2 vt−1 + (1 − β2 )∇f (xt )2

mt vt
m̂t = , v̂t =
1 − β1t 1 − β2t
η m̂t
xt+1 = xt − √
v̂t + ϵ
where:

• mt and vt are the first and second moment estimates,

3
• β1 and β2 are hyperparameters that control the decay rates of these moments.

Pros: Efficient and well-suited for a wide variety of problems; adaptive learning rates.
Cons: Lacks formal convergence guarantees in non-convex settings.

2.5 AdamW
AdamW is a variant of the Adam optimizer that decouples weight decay from the gradient-based
parameter updates [5]. This leads to better generalization performance, especially in deep learning
models, by applying weight decay directly to the parameters as follows:

η m̂t
xt+1 = xt − √ − ηλxt
v̂t + ϵ

where:

• λ is the weight decay coefficient.

Pros: Better generalization than Adam; reduces overfitting by using weight decay.
Cons: Requires careful tuning of the weight decay parameter.

2.6 Nesterov Accelerated Gradient (NAG)

Nesterov Accelerated Gradient (NAG) introduces a look-ahead step that computes the gradient not
at the current position, but at the point where the momentum term would take it [6]. This improves
the convergence rate by anticipating the trajectory of the updates as follows:

vt+1 = βvt + η∇f (xt − βvt )

xt+1 = xt − vt+1

where:

• vt is the velocity (momentum term),

• β is the momentum coefficient.

Pros: Faster convergence than standard momentum-based methods in general.

Cons: Slightly more complex to implement and tune.

2.7 Heuristic Methods

Heuristic optimization methods, such as Genetic Algorithms (GA) [7]and Particle Swarm Optimiza-
tion (PSO)[8], are inspired by natural processes and can be useful for finding global optima. They
are particularly advantageous in complex landscapes but come with their own limitations:
Convergence Speed: Heuristic methods can be slow to converge and often require a large number
of evaluations to achieve satisfactory results.

4
Noisy Landscapes: They may struggle in noisy environments, where the fitness landscape can
mislead the optimization process.
Global Convergence: While they are designed to search the global space, they do not guarantee
convergence to the global minimum.

In summary, while state-of-the-art optimization algorithms provide powerful tools for training ma-
chine learning models, they suffer from limitations regarding local minima and global convergence
guarantees. In the following section we will introduce our new algorithm Super Gradient Descent
(SuGD) while proving its convergence to the global minimum of any k-lipschitz one-dimensional
function defined on a domain [a, b].

3 Methodology
In this section, we present our methodology for implementing the Super Gradient Descent Algo-
rithm along with the mathematical proof of its guaranteed global convergence for any k-Lipschitz
continuous function defined on the domain [a, b].

3.1 Super Gradient Descent Implementation

In the general case, it is hard to explicitly calculate the gradient of a any function and in many cases
we don’t even know the function explicitly. The Finite Difference Method (FDM) is widely used to
approximate derivatives using discrete points on a function’s domain, with the forward difference
formula f ′ (x) ≈ f (x+h)−f
h
(x)
providing a first-order accurate estimate for f ′ (x) by evaluating the
difference between f (x + h) and f (x) over a small step size h. Similarly, the backward difference
f (x)−f (x−h)
approximation f ′ (x) ≈ h offers first-order accuracy using previous points, while the cen-
f (x+h)−f (x−h)
tral difference scheme f (x) ≈
′
2h yields second-order accuracy by averaging forward
and backward differences [9].
Generally speaking, approximating the derivation means evaluating the function at two close points
and calculating the ratio of the difference between the two outputs and the difference between the
two points of evaluation. This allow us to calculate what we call the local gradient which is used
in optimization to point to the local minimum through following the negative of its direction with
an optimization step chosen small enough to ensure convergence. What we aim here to find a step
αϵ > 0 where ∀α ∈ [0, αϵ ] the solution converges to the global minimum. Inspiring from the local
gradient concept we introduce the concept of global gradient of a function f defined on domain D
on dimension i as:
f (y) − f (x)
Fi (x, y) = (1)
y−x

It evaluates the function at any two points instead of necessarily two neighbor points to capture
its global variation. In that case, the local gradient can be a particular case where ∇f (x) =
limh→0 F (x + h, x).

5
Based on the definition of the global gradient, we introduce our algorithm that guarantees con-
vergence to a global minimum for any k-Lipschitz one-dimensional function defined on an interval
[a, b], that we called the Super Gradient Descent (SuGD) as follows:

Algorithm 1 Super Gradient Descent

Input: f , domain [a, b], tolerance η > 0 , optimization step α > 0
(1) (2)
Initialize: Choose initial points x0 = a, x0 = b
(2) (1) (2) (1)
while |xn − xn |(1 + |F (xn , xn )|) > η do
(2) (1)
if f (xn ) − f (xn ) < 0 then
(1) (1) (1) (2) (2) (1) (2) (2)
Set xn+1 = xn − α(xn − xn )(1 − F (xn , xn )), xn+1 = xn
else
(2) (2) (2) (1) (2) (1) (1) (1)
Set xn+1 = xn − α(xn − xn )(1 + F (xn , xn )), xn+1 = xn
end if
end while
(1)
Return: Approximate minimum at xn = xn

The algorithm leverages the global derivative information to ensure fast and well-guided con-
vergence, with a global vision within the interval [a, b] while mastering the neighborhood between
points to ensure a local behavior that is similar to the behavior of local algorithms. Next, we will
provide the theoretical basis behind this algorithm.

3.2 Theoretical Convergence Guarantee

Before introducing the proof of the algorithm’s global convergence, we start by recalling that a
function f is k-Lipschitz if:

|f (x) − f (y)| ≤ k|x − y| ∀x, y ∈ [a, b].

|f (x)−f (y)|
This means that ∀x, y ∈ [a, b], |F (x, y)| = |x−y| ≤ k. In other words, Lipschitz continuity
ensures that our Global Gradient is always finite which justifies its usefulness in the SuGD algorithm.

Theorem 1. Let f be a 1-D, k-Lipschitz function defined on a bounded domain [a, b] that admits
one global minimum at x∗ ∈ [a, b]. ∃αϵ > 0, ∀α ∈ [0, αϵ ] an optimization step, where the Super
Gradient Descent Algorithm 1 converges to this global minimum. More precisely, ∀ϵ > 0, ∃αϵ >
0 where ∀α ∈ [0, αϵ ] ,|f (xn ) − f (x∗ )| < ϵ
(1) (2)
Proof. To ensure that xn ,xn converges to x∗ the proof of this theorem will consist of two main
parts:

6
Part 1:
( (1)
xn < x∗
Prooving with recurrence that ∃αϵ > 0, ∀n ∈ N , ∀α ∈ [0, αϵ ] (2) (2)
xn > x∗

Where ϵ is the approximation threshold.

( (1)
x0 = a < x ∗
For n = 0 (2) (3)
x0 = b > x ∗
Assuming that the inequalities are true until an order n.
If the sequence is stationary the recurrence is trivial for it. If not then
( (1) (1) (1) (2) (2) (1) (2) (1)
xn+1 = xn − α(xn − xn )(1 − F (xn , xn )), when f (xn ) − f (xn ) < 0
(2) (2) (2) (1) (2) (1) (2) (1) (4)
xn+1 = xn − α(xn − xn )(1 + F (xn , xn )), when f (xn ) − f (xn ) ≥ 0

Let’s choose αϵ = ϵ
(b−a)(1+k)k , 0 ≤ α ≤ αϵ

(1) (2) (2) (1) (1) (2) (2) (1)

We have |(xn − xn )(1 − F (xn , xn ))| = |xn − xn + f (xn ) − f (xn )|

(1) (2) (2) (1) (1) (2) (1) (2)

And |xn − xn + f (xn ) − f (xn )| ≤ |xn − xn | + |f (xn ) − f (xn )|

(1) (2) (1) (2)

And from k-Lip condition |xn − xn | + |f (xn ) − f (xn )| ≤ (b − a)(1 + k)

(1) (2) (2) (1)

Therefore |(xn − xn )(1 − F (xn , xn ))| ≤ (b − a)(1 + k)

(1) (2) (2) (1)

Means α|(xn − xn )(1 − F (xn , xn ))| ≤ ϵ
k

(2) (1) (1) (2) (1) (1) (2) (2) (1) (1)
since f (xn )−f (xn ) < 0, and xn −xn < 0 (2), then xn −α(xn −xn )(1−F (xn , xn )) ≤ xn + kϵ

Considering our target is f (xn ) − f (x∗ ) ≤ ϵ then we suppose f (xn ) − f (x∗ ) > ϵ

(1)
based on Lipschitz condition and (2) we deduce that f (xn ) − f (x∗ ) ≤ k(x∗ − xn ) ≤ k(x∗ − xn )

(1) (1)
So xn+1 ≤ xn + ϵ
k < x∗

Therefore we got the result of the recurrence and we prove in the same manner the statement
validity for the second sequence.

Part 2:

7
(1) (2)
We proved in the first part that xn has an upper bound and xn has a lower bound, we still
(1) (2)
need to prove that xn , xn converges toward the same limit to ensure they are adjacent sequences
and from part 1 we deduce that this limit is x∗ .

We get back to the definition of the sequences and since by construction one of them should change
its value and if the other is stationary it will trivially be convergent, so we will focus on the non-
stationary case:

( (1) (1) (1) (2) (2) (1) (2) (1)

xn+1 = xn − α(xn − xn )(1 − F (xn , xn )), when f (xn ) − f (xn ) < 0
(2) (2) (2) (1) (2) (1) (2) (1) (5)
xn+1 = xn − α(xn − xn )(1 + F (xn , xn )), when f (xn ) − f (xn ) ≥ 0

(1) (2) (2) (1) (1)

We know that xn − xn < 0, and on one hand for f (xn ) − f (xn ) < 0, we have α(xn −
(2) (2) (1) (2) (1) (1) (2)
xn )(1 − F (xn , xn )) < 0, on the other hand for f (xn ) − f (xn ) ≥ 0, we have α(xn − xn )(1 −
(2) (1) (1) (2)
F (xn , xn )) > 0. This means that xn is an increasing upper bounded sequence. ,xn is a decreas-
ing lower bounded sequence therefore they converge respectively to the limits l1 , l2 .
(1) (1) (1) (1) (2) (2) (1)
Based on the expression of xn+1 we get limn→∞ xn+1 = limn→∞ xn −α(xn −xn )(1−F (xn , xn )) =
l1 − α(l1 − l2 )(1 − F (l2 , l1 )) = l1 .

(2) (1) (2) (1)

Considering that (1 − F (xn , xn )) > 0 when f (xn ) − f (xn ) < 0 and α > 0, then l2 = l1
Similarly, we prove this statement in the second case.

From part 1, we have x∗ ≤ l1 ,l2 ≤ x∗ , and from part 2 l2 = l1 we deduce that l1 = l2 = x∗

means that xn converges to x∗ , which proves the validity of the theorem.

4 Results and Discussion

In this section, we present our results from testing the Super Gradient Descent algorithm on various
synthetic one-dimensional functions defined on a bounded interval and compare its performance to
the classical optimization methods. The key metric considered is robustness in avoiding local minima
and converging to global minimum.

4.1 Gradient Descent vs Super Gradient Descent

We first try to find the global minimum of the function f (x) = x sin(x) which has several minima
but it is enough regular.

8
Figure 1: Non-convex multi-minima test

The results of our experiments are illustrated in Figures 1. As observed, the traditional gradient
descent algorithm shows signs of convergence towards a local minimum, since the function admits
several minima. The convergence path is often characterized by oscillations especially at final iter-
ations, indicating a struggle to escape local minima.

In contrast, our proposed super gradient descent algorithm demonstrates a more stable and di-
rected approach towards the global minimum. The algorithm effectively navigates the oscillations
and maintains a consistent trajectory towards the global minimum. This enhanced performance is
attributed to its global information-based update rules, which allow for more nuanced adjustments
in the descent path.

x3
Next, we apply both algorithms to the more complex function f (x) = 2x sin(x3 ) − x cos 12 which
presents not only the challenge of huge number of local minima but also the hard differentiability.

9
Figure 2: Hardly differentiable function test

Here, the classical gradient descent again not only struggles with local minima, failing to reach
the global minimum efficiently but also, the trajectory appears erratic and unstable due to the local
sudden change in the objective function value, further confirming its sensitivity.

In contrast, the super gradient descent algorithm exhibits a more robust behavior, with a clear
and smooth convergence towards the global minimum, regardless of the function’s complexity. This
can be observed in the final convergence plots where the loss reduction is significantly more consistent
compared to the classical method.

4.2 Benchmarking with the existing optimization algorithms

In this section, we compare the performance of our proposed Super Gradient Descent (SuGD) algo-
rithm with several state-of-the-art optimization algorithms defined in the second section, additionally
to the Gradient Descent (GD), we tested AdaGrad, RMSprop, AdamW, NAG and Adam. The com-

10
parison focuses on the ability to escape local minima, the rate of convergence, and the consistency
of reaching the global minimum.

Figure 3: Non-convex multi-minima test

In the first test function, f (x) = x sin(x), AdaGrad, which adapts its learning rate based on
previous gradients [2], performs well at first but suffers from a decreasing learning rate over time,
which causes premature convergence. RMSprop shows a more stable convergence but still fail to
reach the global minimum in this case. The moment/gradient adaptive learning rates in AdamW
and Adam accelerated the convergence but still not enough to escape the local minimum. The an-
ticipating behaviour of NAG allowed it to climb in the ascending direction a bit but fastly got back
to the local minimum.
In contrast, our proposed Super Gradient Descent effectively bypasses local minima and demon-
strates a consistent path towards the global minimum and this within a reasonable time compared
to the previous algorithms.

11
Figure 4: Hardly differentiable non-convex function test

x3
For the more complex function f (x) = 2x sin(x3 ) − x cos 12 , the limitations of classical op-
timization algorithms become more apparent. NAG, while managing to escape local minima and
reduce the loss, experienced a gradient exploding phenomenon and diverged because the function is
much more complex and hardly differentiable. This highlights the stochastic behaviour in escaping
the local minima of these algorithms and, therefore the unguarantee of the global convergence. In
comparison, Super Gradient Descent displays much smoother convergence and consistently reaches
the global minimum, avoiding the pitfalls of local minima encountered by other algorithms. The
global information-based updates in our method allow it to adapt more effectively to the varying
gradients and oscillations present in the function, guaranteeing the global convergence.

12
Figure 5: Non-convex multi-regularity function test

In Figure 5 the experiment tests the Super

Gradient Descent on a multi-regularity
non-convex
−0.004(x−35)2 −0.2(x−25)2
function defined as f (x) = e sin(0.3x) + e sin(5x) . This function has two
regular regions separated by a hardly differentiable region and the challenge here is not only escap-
ing the local minimum but also being robust against the sudden change in regularity to achieve the
global minimum region.

Overall, the results indicate that the super gradient descent algorithm not only converges to the
global minimum effectively but also does so with greater stability and robustness than the tra-
ditional gradient descent method. This has significant implications for optimization problems in
various fields, where the risk of being trapped in local minima can severely impact the performance

13
of learning algorithms.

The comparative analysis presented here establishes the effectiveness of our proposed approach.
Future work will involve testing on higher-dimensional functions and integrating additional opti-
mization strategies to further enhance the robustness of our algorithm.

5 Conclusion
In this article, We presented the Super Gradient Descent novel algorithm capable of converging to the
global minimum of any k-lipschitz one-dimensional function defined on a domain [a, b]. We started
by an overview about the state of the art optimization algorithms including Gradient Descent, Ada-
Grad, RMSprop, Adam, AdamW, Nesterov Accelerated Gradient (NAG). Then, we introduced our
methodology and the mathematical implementation of the algorithm followed by the proof of the
guaranteed global convergences showing how SuGD leverages global information about the function,
more precisely the introduction of the concept of global gradient which plays a pivotal role in point-
ing to the global minimum. Finally we compared the performance of our algorithm with the existing
optimization algorithms demonstrating its stable convergence and robustness in achieving the global
minimum where the others struggle to escape local minima and maintain a consistent convergence
path.
Overall, this novel algorithm succeeded in resolving one of the most challenging problems in opti-
mization. Particularly, the guaranteed one-dimensional global convergence, on one hand it directly
impacts the higher dimensional optimization through the line search method, and on the other hand,
it opens up the perspectives for future expansion of this algorithm to a higher dimensional space.

14
References
[1] Marcin Andrychowicz, Misha Denil, Sergio Gómez Colmenarejo, Matthew W. Hoffman, David
Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to Learn by Gradi-
ent Descent. arXiv preprint arXiv:1606.04474, 2016. Google DeepMind, University of Oxford,
Canadian Institute for Advanced Research.

[2] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning
and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.

[3] T. Tieleman and G. Hinton. Lecture 6.5—RMSProp: Divide the Gradient by a Running Average
of Its Recent Magnitude. Lecture Notes, 2012.

[4] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv
preprint arXiv:1412.6980, 2014.

[5] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regulariza-
tion. In International Conference on Learning Representations, 2017. URL:
https://openreview.net/forum?id=Bkg6RiRAb7.

[6] T. Dozat. Incorporating Nesterov Momentum into Adam. OpenReview.net, 2016. URL:
https://openreview.net/forum?id=H1Z2w9lgl.

[7] R. Chelouah and P. Siarry. A Continuous Genetic Algorithm Designed for the Global Optimiza-
tion of Multimodal Functions. Journal of Heuristics, 2000. Publisher: Springer.

[8] F. Marini and B. Walczak. Particle Swarm Optimization (PSO): A Tutorial. Chemometrics and
Intelligent Laboratory Systems, 2015. Publisher: Elsevier.

[9] ML Shahab, H Susanto, and H Hatzikirou. A finite difference method with symmetry proper-
ties for the high-dimensional Bratu equation. arXiv preprint arXiv:2410.12553, 2024. Publisher:
arxiv.org.

Cheat Sheet Imperva
100% (2)
Cheat Sheet Imperva
12 pages
Super GD
No ratings yet
Super GD
15 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Optim
No ratings yet
Optim
33 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Module 2
No ratings yet
Module 2
67 pages
GD Compare
No ratings yet
GD Compare
5 pages
Optimization Techniques
No ratings yet
Optimization Techniques
9 pages
ADAM StochasticOptimiz 1412.6980
100% (1)
ADAM StochasticOptimiz 1412.6980
15 pages
Op Tim Ization
No ratings yet
Op Tim Ization
1 page
Optimizers Types
No ratings yet
Optimizers Types
6 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Adam Optimizer
No ratings yet
Adam Optimizer
14 pages
Cours 5
No ratings yet
Cours 5
23 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
A: A M S O: DAM Ethod For Tochastic Ptimization
No ratings yet
A: A M S O: DAM Ethod For Tochastic Ptimization
13 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Mathematics 11 02466 v2
No ratings yet
Mathematics 11 02466 v2
37 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
AdamZ Research Paper
No ratings yet
AdamZ Research Paper
13 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
No ratings yet
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
19 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
6705-Article Text-13114-1-10-20210220
No ratings yet
6705-Article Text-13114-1-10-20210220
29 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Adaptive Stochastic Conjugate Gradient For Machine Learning
No ratings yet
Adaptive Stochastic Conjugate Gradient For Machine Learning
14 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Data Science Module 4 Q & A
No ratings yet
Data Science Module 4 Q & A
9 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
EDA Lecture Module 4
No ratings yet
EDA Lecture Module 4
20 pages
UNIT3
No ratings yet
UNIT3
37 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Optimization in Machine Learning
No ratings yet
Optimization in Machine Learning
26 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
Module 3
No ratings yet
Module 3
7 pages
Optimizers
No ratings yet
Optimizers
4 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Flutter
No ratings yet
Flutter
13 pages
Wa0124.
No ratings yet
Wa0124.
6 pages
Torin Manual Freno Adicional JLQ20-20190820
No ratings yet
Torin Manual Freno Adicional JLQ20-20190820
23 pages
GBAS - How It Works
No ratings yet
GBAS - How It Works
4 pages
Essentials of Educational Psychology Big Ideas To Guide Effective Teaching 5th Edition Test Bank Available Instantly
No ratings yet
Essentials of Educational Psychology Big Ideas To Guide Effective Teaching 5th Edition Test Bank Available Instantly
411 pages
Trans - Command-Line Translator Using Google Translate, Bing Translator, Yandex - Translate, Etc. - Translate-Shell Commands - Man Pages - ManKier
No ratings yet
Trans - Command-Line Translator Using Google Translate, Bing Translator, Yandex - Translate, Etc. - Translate-Shell Commands - Man Pages - ManKier
9 pages
Microsoft 365 Administration Learning Path For Beginners
No ratings yet
Microsoft 365 Administration Learning Path For Beginners
37 pages
Arijit Mandal Minor Project 002111201248
No ratings yet
Arijit Mandal Minor Project 002111201248
14 pages
BH Us 04 Mandia PDF
No ratings yet
BH Us 04 Mandia PDF
30 pages
Developing A Business Plan
No ratings yet
Developing A Business Plan
4 pages
DBMS Syllabus
No ratings yet
DBMS Syllabus
3 pages
ChatGPT - MyLearning On Dataset Aksara Jawa
No ratings yet
ChatGPT - MyLearning On Dataset Aksara Jawa
13 pages
Haris Mahmood F18BB005 Assignment 3
No ratings yet
Haris Mahmood F18BB005 Assignment 3
4 pages
Moot Corp Catalog
No ratings yet
Moot Corp Catalog
134 pages
Testing and Measuring Equipment/Allowed Subcontracting IEC 61010-1, Edition 3.0
100% (1)
Testing and Measuring Equipment/Allowed Subcontracting IEC 61010-1, Edition 3.0
7 pages
SMP Gateway Automation Functions
No ratings yet
SMP Gateway Automation Functions
60 pages
Computer Science S5 TG
100% (1)
Computer Science S5 TG
322 pages
S6-Mobile Application Development - Android Full Note
100% (1)
S6-Mobile Application Development - Android Full Note
43 pages
Qartuli Me-9 Maswavleblis Wigni
No ratings yet
Qartuli Me-9 Maswavleblis Wigni
34 pages
Expected Direction Sense Questions For Ibps RRB Po Prelims Exam
No ratings yet
Expected Direction Sense Questions For Ibps RRB Po Prelims Exam
29 pages
Structral Design For Solar Cana S
No ratings yet
Structral Design For Solar Cana S
15 pages
H13 321 Demo
No ratings yet
H13 321 Demo
5 pages
Cast-Crete Lintel Solid Lintels
No ratings yet
Cast-Crete Lintel Solid Lintels
1 page
Ravenol Atf Dsih 6: Application Note
No ratings yet
Ravenol Atf Dsih 6: Application Note
2 pages
PRM DWG DC Me Ta El 1001 1 1101
No ratings yet
PRM DWG DC Me Ta El 1001 1 1101
1 page
KPS Industrial Brochure
No ratings yet
KPS Industrial Brochure
24 pages
Auto Series Hydraulic Presses Sheet
No ratings yet
Auto Series Hydraulic Presses Sheet
2 pages
Department of Computer Science: Image To Text Using Text Recognition & Text To Speech
No ratings yet
Department of Computer Science: Image To Text Using Text Recognition & Text To Speech
66 pages
S04 Boom
No ratings yet
S04 Boom
82 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Super Gradient Descent: Global Optimization Requires Global Gradient

Uploaded by

Super Gradient Descent: Global Optimization Requires Global Gradient

Uploaded by

Super Gradient Descent: Global Optimization requires Global

Global minimization is a fundamental challenge in optimization, especially in machine learn-

2 State of the Art

2.1 Gradient Descent (GD)

xt+1 = xt − η∇f (xt )

• η is the learning rate, which controls the step size,

• ∇f (xt ) is the gradient of the objective function at xt .

• Gt is the sum of the squares of the gradients up to time step t,

• ϵ is a small constant to avoid division by zero.

• E[∇f (xt )2 ] is the exponentially weighted average of the squared gradients,

• ϵ is a small constant for numerical stability.

Pros: More robust to the diminishing learning rate issue.

mt = β1 mt−1 + (1 − β1 )∇f (xt )

vt = β2 vt−1 + (1 − β2 )∇f (xt )2

• mt and vt are the first and second moment estimates,

• λ is the weight decay coefficient.

2.6 Nesterov Accelerated Gradient (NAG)

vt+1 = βvt + η∇f (xt − βvt )

• vt is the velocity (momentum term),

• β is the momentum coefficient.

Pros: Faster convergence than standard momentum-based methods in general.

2.7 Heuristic Methods

3.1 Super Gradient Descent Implementation

Algorithm 1 Super Gradient Descent

3.2 Theoretical Convergence Guarantee

|f (x) − f (y)| ≤ k|x − y| ∀x, y ∈ [a, b].

Where ϵ is the approximation threshold.

(1) (2) (2) (1) (1) (2) (2) (1)

(1) (2) (2) (1) (1) (2) (1) (2)

(1) (2) (1) (2)

(1) (2) (2) (1)

(1) (2) (2) (1)

( (1) (1) (1) (2) (2) (1) (2) (1)

(1) (2) (2) (1) (1)

(2) (1) (2) (1)

From part 1, we have x∗ ≤ l1 ,l2 ≤ x∗ , and from part 2 l2 = l1 we deduce that l1 = l2 = x∗

4 Results and Discussion

4.1 Gradient Descent vs Super Gradient Descent

4.2 Benchmarking with the existing optimization algorithms

Figure 3: Non-convex multi-minima test

In Figure 5 the experiment tests the Super

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.