0% found this document useful (0 votes)

0 views24 pages

Implement 03-1

The document outlines key concepts in deep learning optimization, including the importance of learning rates and gradient descent methods. It discusses techniques such as stochastic gradient descent, mini-batch processing, and momentum to improve convergence speed and efficiency. Additionally, it introduces adaptive learning rates through methods like AdaGrad and RMSProp to address challenges in optimizing deep neural networks.

Uploaded by

yuqi.rose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views24 pages

Implement 03-1

Uploaded by

yuqi.rose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

CSE 151B/251B

Deep Learning

Rose Yu
ANNOUNCEMENT
Kaggle Group Signup

• Sign up as a group before April 18th

• Use Piazza search for Teammates function

• Sign up on Google sheet first

• After that, register the info on Canvas to receive AWS

O P T I M I Z AT I O N
Learning as Optimization
Loss

learning as optimization
Weight
Parameter

to learn the weights, we need the derivative of the loss w.r.t. the weight
i.e. “how should the weight be updated to decrease the loss?”

@L
w=w ↵
@w
with multiple weights, we need the gradient of the loss w.r.t. the weights
w=w ↵rw L
learning rate
Gradient Descent
Convexity
Traditional loss functions often assume convexity
L(x2) − L(x1) ≥ ▿ L(x1)⊤(x2 − x1)

Easy to find
global optima!
Strict convex if
diff always >0
Convexity
• All local optima are global optima:

• Strictly convex: unique global optimum:

• Stochastic gradient descent will find the global optimum

with good learning rate
Poll
Optimizing Deep Neural Nets
VGG-56

convex non-convex

Li, Hao, et al. "Visualizing the loss landscape of neural nets."

Advances in Neural Information Processing Systems. 2018.

saddle point
stochastic gradient descent
large scale (N) datasets leads to memory bottleneck

stochastic gradient descent (SGD): w = w ↵r̃w L

use stochastic gradient estimate to descend the surface of the loss function

batch gradient stochastic gradient

n
▿˜ w L = ▿w L( f(xi; w), yi)
∑
▿L = ▿w L( f(xi; w), yi)
i=1

mini-batch size

11
Batch and Minibatch
n 1

mini-batch size

• Accurate gradient estimate (nonlinear) • Approximate estimate

• Fast training • Slow training
• Memory bottleneck • Memory efficient
• Hard to parallelize • Easy to parallelize
• Worse generalization • Better generalization

Stochastic gradient needs to be an unbiased estimator

▿w L(w) = [▿˜ wL(w, ξ)]

𝔼
Mini-Batch

• mini-batch implementation:https://d2l.ai/chapter_optimization/
minibatch-sgd.html
• Mini-batch SGD is faster than GD and SGD
• Trade-off between convergence speed and computational efficiency
SGD is Not Enough
global minima

Increased frequency and severity of

bad local minima

pathological curvature, like

the type seen in the well-
known Rosenbrock function
An ill-Conditioned Example
increase learning rate: 0.4
• Consider optimizing f(x) = 0.1x12 + 2x22

• The function is very flat along the

direction of x1

• The gradient is much larger and changes

rapidly along the direction of x2
increase learning rate: 0.6
• Increase learning rate improves the
convergence along the x1 direction, but
the overall solution is much worse

• Key idea: can we average the past

gradients?
Momentum
• SGD oscillates a lot and converges slowly in the
direction with a flat landscape

• Use ``leaky gradient’: use a fraction of the past

gradients updates

vt = βvt−1 − gt,t−1 gradients wt+1 = wt + vt

past gradients
t−1
2 τ
∑
vt = β vt−2 − βgt−1,t−2 − gt,t−1 = β gt−τ,t−τ−1
τ=0
Example
increase learning rate: 0.6 • momentum = 0 recovers the
momentum: 0.25
original SGD equation

• momentum improves learning

even with the same learning
rate
increase learning rate: 0.6
momentum: 0.5 • increasing the momentum can
lead to oscillating trajectories

• but still much better than

diverged solutions
Momentum
• Momentum
physical interpretation: update velocity

wt+1 = wt − α ▿˜ w L(wt)

vt = βvt−1 − α ▿˜ w L(wt) SGD without momentum SGD with momentum

wt+1 = wt + vt
• Nesterov Momentum
stronger theoretical guarantee
vt = βvt−1 − α ▿˜ w L(wt + αvt−1)
wt+1 = wt + vt
learning rate
w=w ↵r̃w L α is the learning rate

Learning rate is essential to convergence speed and accuracy

L(w) L(w) L(w)

w w w
too low too high just right
Simple Strategy
• Divide Loss Function by Number of Examples:

α
wt+1 = wt − ▿˜ w L(wt)
n
• Start with large step size
1
• If loss plateaus, divide step size by 2: αt = α0
2t
• (Can also use advanced optimization methods)
• (Step size must decrease over time to guarantee
convergence to global optimum)
1
•
Scale the learning rate by iterations αt = α0
t+c
Adaptive Learning rate
• Potential Issues of fixed learning rate

• parameters for common features converge rather quickly to their optimal values

• infrequent features we are still short of observing them sufficiently frequently

before their optimal values can be determined

• Remedy: let the learning rate scales according to the features:

1
αt = α0 where s(i, t) counts the number of nonzeros
s(i, t) + c
for parameter dimension i that we have observed up to time t

• Only for the input features, but not for the gradients
Adaptive Learning Rate
• AdaGrad: adaptive scales the learning for each
parameter dimension.
ϵ
•
wt+1 = wt − ⊙g gradients
δ+r adaptive learning rate

• rt = rt−1 + g ⊙ g sum of gradient squares

∂2 L ∂L∂L ∂L∂L
∂w1∂w2
⋯ ∂w ∂w
∂2w1 1 n

∂L∂L
Hessian matrix H = ⋮ ⋱ ∂w2∂wn
• ∂L∂L ∂2 L
∂wn∂w1
⋯
∂2wn

• Diag(H) = g ⊙ g approximates the Hessian

Adaptive Learning Rate
Cost is sensitive to learning rate only in some directions in the parameter space
maintain“memory” of previous gradients and scale gradients per parameter

• AdaGrad
gradient g = ▿˜ w L(wt)
approximate Hessian rt = rt−1 + g ⊙ g
ϵ
update wt+1 = wt − ⊙g
δ+r
• RMSProp
gradient g = ▿˜ w L(wt)
approximate Hessian rt = ρrt−1 + (1 − ρ)g ⊙ g
ϵ
update wt+1 = wt − ⊙g
δ+r
Comparison
local minima and saddle points are largely not an issue
in many dimensions, can move in exponentially more directions

24 http://sebastianruder.com/optimizing-gradient-descent/index.html

R22 Machine Learning Digital Notes Final
No ratings yet
R22 Machine Learning Digital Notes Final
143 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
15 Deep
No ratings yet
15 Deep
39 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
CS231n Deep Learning For Computer Vision p-1
No ratings yet
CS231n Deep Learning For Computer Vision p-1
10 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
التعلم العميق
No ratings yet
التعلم العميق
192 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Forecasting With Artificial Intelligence: Theory and Applications
No ratings yet
Forecasting With Artificial Intelligence: Theory and Applications
441 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Otimization 2024 - Ver3
No ratings yet
Otimization 2024 - Ver3
42 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
H13-311 - V3.5 Dumps - HCIA-AI V3.5
No ratings yet
H13-311 - V3.5 Dumps - HCIA-AI V3.5
18 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Chapter 4 - Optimization
No ratings yet
Chapter 4 - Optimization
44 pages
Training NNs
No ratings yet
Training NNs
34 pages
Bianchi
No ratings yet
Bianchi
62 pages
Full Text 01
No ratings yet
Full Text 01
33 pages
Project Report On Recommendation System
No ratings yet
Project Report On Recommendation System
26 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Chapter 02.background-Theory
No ratings yet
Chapter 02.background-Theory
20 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
29619-Article Text-33673-1-2-20240324
No ratings yet
29619-Article Text-33673-1-2-20240324
8 pages
2.CDS Tech - Talk 2022 09 23
No ratings yet
2.CDS Tech - Talk 2022 09 23
55 pages
Accepted Version Full
No ratings yet
Accepted Version Full
48 pages
ML Aat 2
No ratings yet
ML Aat 2
25 pages
AdaDefense - Gradients Stand-In For Defending Deep Leakage in Federated Learning
No ratings yet
AdaDefense - Gradients Stand-In For Defending Deep Leakage in Federated Learning
12 pages
Cours 5
No ratings yet
Cours 5
23 pages
Ijst 2021 1266
No ratings yet
Ijst 2021 1266
15 pages
B15-Content - Analysis - in - Social - Media (1) - Bbhavani
No ratings yet
B15-Content - Analysis - in - Social - Media (1) - Bbhavani
59 pages
02 Machine Learning Overview
No ratings yet
02 Machine Learning Overview
103 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Unit2 Optimizer
No ratings yet
Unit2 Optimizer
18 pages
Lec 8
No ratings yet
Lec 8
43 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Physics-Informed Neural Network For Modeling Dynamic Linear Elasticity
No ratings yet
Physics-Informed Neural Network For Modeling Dynamic Linear Elasticity
18 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
CII4Q3 - Computer Vision-EAR - Week-11-Intro To Deep Learning v1.0
No ratings yet
CII4Q3 - Computer Vision-EAR - Week-11-Intro To Deep Learning v1.0
50 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
ANN-Regression-Python Examples
No ratings yet
ANN-Regression-Python Examples
35 pages
AbhishekYadav Assignment 02
No ratings yet
AbhishekYadav Assignment 02
24 pages
Learning Temporal Regularity in Video Sequences
No ratings yet
Learning Temporal Regularity in Video Sequences
40 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
When Machine Learning Meets Blockchain: A Decentralized, Privacy-Preserving and Secure Design
No ratings yet
When Machine Learning Meets Blockchain: A Decentralized, Privacy-Preserving and Secure Design
11 pages
Optim
No ratings yet
Optim
33 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Real Estate Price Prediction Based On Linear Regre
No ratings yet
Real Estate Price Prediction Based On Linear Regre
10 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Optimal Hyperparameters For Deep LSTM-Networks For Sequence Labeling Tasks
No ratings yet
Optimal Hyperparameters For Deep LSTM-Networks For Sequence Labeling Tasks
34 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Optimizer
No ratings yet
Optimizer
13 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Optimizers
No ratings yet
Optimizers
4 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
23-Practical Aspects of Optimization
No ratings yet
23-Practical Aspects of Optimization
7 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Deep Learning Q Bank Mte
No ratings yet
Deep Learning Q Bank Mte
2 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
13 Useful Deep Learning Interview Questions and Answer
No ratings yet
13 Useful Deep Learning Interview Questions and Answer
6 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
HW 4
No ratings yet
HW 4
7 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
From Everand
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Implement 03-1

Uploaded by

Implement 03-1

Uploaded by

CSE 151B/251B

• Sign up as a group before April 18th

• Use Piazza search for Teammates function

• Sign up on Google sheet first

• After that, register the info on Canvas to receive AWS

• Strictly convex: unique global optimum:

• Stochastic gradient descent will find the global optimum

Li, Hao, et al. "Visualizing the loss landscape of neural nets."

stochastic gradient descent (SGD): w = w ↵r̃w L

batch gradient stochastic gradient

• Accurate gradient estimate (nonlinear) • Approximate estimate

Stochastic gradient needs to be an unbiased estimator

▿w L(w) = [▿˜ wL(w, ξ)]

Increased frequency and severity of

pathological curvature, like

• The function is very flat along the

• The gradient is much larger and changes

• Key idea: can we average the past

• Use ``leaky gradient’: use a fraction of the past

vt = βvt−1 − gt,t−1 gradients wt+1 = wt + vt

• momentum improves learning

• but still much better than

vt = βvt−1 − α ▿˜ w L(wt) SGD without momentum SGD with momentum

Learning rate is essential to convergence speed and accuracy

L(w) L(w) L(w)

• infrequent features we are still short of observing them sufficiently frequently

• Remedy: let the learning rate scales according to the features:

• rt = rt−1 + g ⊙ g sum of gradient squares

• Diag(H) = g ⊙ g approximates the Hessian

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.