Implement 03-1
Implement 03-1
Deep Learning
Rose Yu
ANNOUNCEMENT
Kaggle Group Signup
learning as optimization
Weight
Parameter
to learn the weights, we need the derivative of the loss w.r.t. the weight
i.e. “how should the weight be updated to decrease the loss?”
@L
w=w ↵
@w
with multiple weights, we need the gradient of the loss w.r.t. the weights
w=w ↵rw L
learning rate
Gradient Descent
Convexity
Traditional loss functions often assume convexity
L(x2) − L(x1) ≥ ▿ L(x1)⊤(x2 − x1)
Easy to find
global optima!
Strict convex if
diff always >0
Convexity
• All local optima are global optima:
convex non-convex
saddle point
stochastic gradient descent
large scale (N) datasets leads to memory bottleneck
mini-batch size
11
Batch and Minibatch
n 1
mini-batch size
• mini-batch implementation:https://d2l.ai/chapter_optimization/
minibatch-sgd.html
• Mini-batch SGD is faster than GD and SGD
• Trade-off between convergence speed and computational efficiency
SGD is Not Enough
global minima
wt+1 = wt − α ▿˜ w L(wt)
w w w
too low too high just right
Simple Strategy
• Divide Loss Function by Number of Examples:
α
wt+1 = wt − ▿˜ w L(wt)
n
• Start with large step size
1
• If loss plateaus, divide step size by 2: αt = α0
2t
• (Can also use advanced optimization methods)
• (Step size must decrease over time to guarantee
convergence to global optimum)
1
•
Scale the learning rate by iterations αt = α0
t+c
Adaptive Learning rate
• Potential Issues of fixed learning rate
• parameters for common features converge rather quickly to their optimal values
• Only for the input features, but not for the gradients
Adaptive Learning Rate
• AdaGrad: adaptive scales the learning for each
parameter dimension.
ϵ
•
wt+1 = wt − ⊙g gradients
δ+r adaptive learning rate
∂L∂L
Hessian matrix H = ⋮ ⋱ ∂w2∂wn
• ∂L∂L ∂2 L
∂wn∂w1
⋯
∂2wn
• AdaGrad
gradient g = ▿˜ w L(wt)
approximate Hessian rt = rt−1 + g ⊙ g
ϵ
update wt+1 = wt − ⊙g
δ+r
• RMSProp
gradient g = ▿˜ w L(wt)
approximate Hessian rt = ρrt−1 + (1 − ρ)g ⊙ g
ϵ
update wt+1 = wt − ⊙g
δ+r
Comparison
local minima and saddle points are largely not an issue
in many dimensions, can move in exponentially more directions
24 http://sebastianruder.com/optimizing-gradient-descent/index.html