0% found this document useful (0 votes)
0 views24 pages

Implement 03-1

The document outlines key concepts in deep learning optimization, including the importance of learning rates and gradient descent methods. It discusses techniques such as stochastic gradient descent, mini-batch processing, and momentum to improve convergence speed and efficiency. Additionally, it introduces adaptive learning rates through methods like AdaGrad and RMSProp to address challenges in optimizing deep neural networks.

Uploaded by

yuqi.rose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views24 pages

Implement 03-1

The document outlines key concepts in deep learning optimization, including the importance of learning rates and gradient descent methods. It discusses techniques such as stochastic gradient descent, mini-batch processing, and momentum to improve convergence speed and efficiency. Additionally, it introduces adaptive learning rates through methods like AdaGrad and RMSProp to address challenges in optimizing deep neural networks.

Uploaded by

yuqi.rose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CSE 151B/251B

Deep Learning

Rose Yu
ANNOUNCEMENT
Kaggle Group Signup

• Sign up as a group before April 18th

• Use Piazza search for Teammates function

• Sign up on Google sheet first

• After that, register the info on Canvas to receive AWS


O P T I M I Z AT I O N
Learning as Optimization
Loss

learning as optimization
Weight
Parameter

to learn the weights, we need the derivative of the loss w.r.t. the weight
i.e. “how should the weight be updated to decrease the loss?”

@L
w=w ↵
@w
with multiple weights, we need the gradient of the loss w.r.t. the weights
w=w ↵rw L
learning rate
Gradient Descent
Convexity
Traditional loss functions often assume convexity
L(x2) − L(x1) ≥ ▿ L(x1)⊤(x2 − x1)

Easy to find
global optima!
Strict convex if
diff always >0
Convexity
• All local optima are global optima:

• Strictly convex: unique global optimum:

• Stochastic gradient descent will find the global optimum


with good learning rate
Poll
Optimizing Deep Neural Nets
VGG-56

convex non-convex

Li, Hao, et al. "Visualizing the loss landscape of neural nets."


Advances in Neural Information Processing Systems. 2018.

saddle point
stochastic gradient descent
large scale (N) datasets leads to memory bottleneck

stochastic gradient descent (SGD): w = w ↵r̃w L


use stochastic gradient estimate to descend the surface of the loss function

batch gradient stochastic gradient


n
▿˜ w L = ▿w L( f(xi; w), yi)

▿L = ▿w L( f(xi; w), yi)
i=1

mini-batch size

11
Batch and Minibatch
n 1

mini-batch size

• Accurate gradient estimate (nonlinear) • Approximate estimate


• Fast training • Slow training
• Memory bottleneck • Memory efficient
• Hard to parallelize • Easy to parallelize
• Worse generalization • Better generalization

Stochastic gradient needs to be an unbiased estimator

▿w L(w) = [▿˜ wL(w, ξ)]


𝔼
Mini-Batch

• mini-batch implementation:https://d2l.ai/chapter_optimization/
minibatch-sgd.html
• Mini-batch SGD is faster than GD and SGD
• Trade-off between convergence speed and computational efficiency
SGD is Not Enough
global minima

Increased frequency and severity of


bad local minima

pathological curvature, like


the type seen in the well-
known Rosenbrock function
An ill-Conditioned Example
increase learning rate: 0.4
• Consider optimizing f(x) = 0.1x12 + 2x22

• The function is very flat along the


direction of x1

• The gradient is much larger and changes


rapidly along the direction of x2
increase learning rate: 0.6
• Increase learning rate improves the
convergence along the x1 direction, but
the overall solution is much worse

• Key idea: can we average the past


gradients?
Momentum
• SGD oscillates a lot and converges slowly in the
direction with a flat landscape

• Use ``leaky gradient’: use a fraction of the past


gradients updates

vt = βvt−1 − gt,t−1 gradients wt+1 = wt + vt


past gradients
t−1
2 τ

vt = β vt−2 − βgt−1,t−2 − gt,t−1 = β gt−τ,t−τ−1
τ=0
Example
increase learning rate: 0.6 • momentum = 0 recovers the
momentum: 0.25
original SGD equation

• momentum improves learning


even with the same learning
rate
increase learning rate: 0.6
momentum: 0.5 • increasing the momentum can
lead to oscillating trajectories

• but still much better than


diverged solutions
Momentum
• Momentum
physical interpretation: update velocity

wt+1 = wt − α ▿˜ w L(wt)

vt = βvt−1 − α ▿˜ w L(wt) SGD without momentum SGD with momentum


wt+1 = wt + vt
• Nesterov Momentum
stronger theoretical guarantee
vt = βvt−1 − α ▿˜ w L(wt + αvt−1)
wt+1 = wt + vt
learning rate
w=w ↵r̃w L α is the learning rate

Learning rate is essential to convergence speed and accuracy

L(w) L(w) L(w)

w w w
too low too high just right
Simple Strategy
• Divide Loss Function by Number of Examples:

α
wt+1 = wt − ▿˜ w L(wt)
n
• Start with large step size
1
• If loss plateaus, divide step size by 2: αt = α0
2t
• (Can also use advanced optimization methods)
• (Step size must decrease over time to guarantee
convergence to global optimum)
1

Scale the learning rate by iterations αt = α0
t+c
Adaptive Learning rate
• Potential Issues of fixed learning rate

• parameters for common features converge rather quickly to their optimal values

• infrequent features we are still short of observing them sufficiently frequently


before their optimal values can be determined

• Remedy: let the learning rate scales according to the features:


1
αt = α0 where s(i, t) counts the number of nonzeros
s(i, t) + c
for parameter dimension i that we have observed up to time t

• Only for the input features, but not for the gradients
Adaptive Learning Rate
• AdaGrad: adaptive scales the learning for each
parameter dimension.
ϵ

wt+1 = wt − ⊙g gradients
δ+r adaptive learning rate

• rt = rt−1 + g ⊙ g sum of gradient squares


∂2 L ∂L∂L ∂L∂L
∂w1∂w2
⋯ ∂w ∂w
∂2w1 1 n

∂L∂L
Hessian matrix H = ⋮ ⋱ ∂w2∂wn
• ∂L∂L ∂2 L
∂wn∂w1

∂2wn

• Diag(H) = g ⊙ g approximates the Hessian


Adaptive Learning Rate
Cost is sensitive to learning rate only in some directions in the parameter space
maintain“memory” of previous gradients and scale gradients per parameter

• AdaGrad
gradient g = ▿˜ w L(wt)
approximate Hessian rt = rt−1 + g ⊙ g
ϵ
update wt+1 = wt − ⊙g
δ+r
• RMSProp
gradient g = ▿˜ w L(wt)
approximate Hessian rt = ρrt−1 + (1 − ρ)g ⊙ g
ϵ
update wt+1 = wt − ⊙g
δ+r
Comparison
local minima and saddle points are largely not an issue
in many dimensions, can move in exponentially more directions

24 http://sebastianruder.com/optimizing-gradient-descent/index.html

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy