Gradient Descent_PR
Gradient Descent_PR
• The size of these steps is called the learning rate. With a high learning
rate we can cover more ground each step, but we risk overshooting
the lowest point since the slope of the hill is constantly changing.
With a very low learning rate, we can confidently move in the
direction of the negative gradient since we are recalculating it so
frequently. A low learning rate is more precise, but calculating the
gradient is time-consuming, so it will take us a very long time to get to
the bottom. (η)
Cost function
• bt+1 = bt − η∇bt
• 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 (𝐽 𝑊 ) = σ𝑁
𝑘=1(𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)
1
2
• Typically, the value of the learning rate is chosen manually. We usually
start with a small value such as 0.1, 0.01 or 0.001 and adapt it based
on whether the cost function is reducing very slowly (increase
learning rate) or is exploding / being erratic (decrease learning rate).
Variants of Gradient Descent
N/B
Momentum-Based Gradient Descent
If I am repeatedly being asked to move in the same direction then I should probably gain some confidence and start taking
bigger steps in that direction. Just as a ball gains momentum while rolling down a slope.
You can see that the current update is proportional to not just the present gradient but also gradients of previous
steps, although their contribution reduces every time step by γ(gamma) times. And that is how we boost the
magnitude of the update at gentle regions.