0% found this document useful (0 votes)
5 views31 pages

Gradient Descent_PR

Gradient Descent is an optimization technique used to minimize the cost function in deep learning and neural networks by iteratively moving in the direction of the steepest descent. The learning rate determines the size of the steps taken towards the minimum, with high rates risking overshooting and low rates leading to slow convergence. Variants of gradient descent, such as Stochastic Gradient Descent and Mini-Batch Gradient Descent, improve computational efficiency by using subsets of data for gradient calculations.

Uploaded by

archanashrma6266
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views31 pages

Gradient Descent_PR

Gradient Descent is an optimization technique used to minimize the cost function in deep learning and neural networks by iteratively moving in the direction of the steepest descent. The learning rate determines the size of the steps taken towards the minimum, with high rates risking overshooting and low rates leading to slow convergence. Variants of gradient descent, such as Stochastic Gradient Descent and Mini-Batch Gradient Descent, improve computational efficiency by using subsets of data for gradient calculations.

Uploaded by

archanashrma6266
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Gradient Descent

Dr. Preeti Rai


Professor
Department of CSE
Gradient Descent
• Gradient Descent is an optimization technique that is used to improve
deep learning and neural network-based models by minimizing the
cost function.

• Gradient descent is an optimization algorithm used to minimize some


function by iteratively moving in the direction of steepest descent as
defined by the negative of the gradient.
Example
Learning rate

• The size of these steps is called the learning rate. With a high learning
rate we can cover more ground each step, but we risk overshooting
the lowest point since the slope of the hill is constantly changing.
With a very low learning rate, we can confidently move in the
direction of the negative gradient since we are recalculating it so
frequently. A low learning rate is more precise, but calculating the
gradient is time-consuming, so it will take us a very long time to get to
the bottom. (η)
Cost function

• A Loss Functions tells us “how good” our model is at making


predictions for a given set of parameters. The cost function has its
own curve and its own gradients. The slope of this curve tells us how
to update our parameters to make the model more accurate.

𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 (𝐽 𝑊 ) = (𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2


Pseudocode for Gradient Descent

• Gradient descent is used to minimize a cost function J(W) parameterized by a


model parameters W.
• The gradient (or derivative) tells us the incline or slope of the cost function.
Hence, to minimize the cost function, we move in the direction opposite to the
gradient.
• Initialize the weights W randomly.
• Calculate the gradients G of cost function w.r.t parameters. This is done using
partial differentiation: G = ∂J(W)/∂W. The value of the gradient G depends on the
inputs, the current values of the model parameters, and the cost function.
• Update the weights by an amount proportional to G, i.e. Wnew = Wold - ηG
• Repeat until the cost J(w) stops reducing, or some other pre-defined termination
criteria is met.
1. Initialize weight (W=3)
2. Calculate the predicted output =X.W
3. Cost function J(W)=(𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2 ------1
4. Calculate the gradient of Cost Function
5. G = ∂J(W)/∂W.
6. Wnew = Wold – η*G=3-
7. In step above, η is the learning rate which determines the size of
the steps we take to reach a minimum. We need to be very careful
about this parameter. High values of η may overshoot the
minimum, and very low values will reach the minimum very slowly.
• In step 6, η is the learning rate which determines the size of the steps
we take to reach a minimum. We need to be very careful about this
parameter. High values of η may overshoot the minimum, and very
low values will reach the minimum very slowly.
• A popular choice for the termination criteria is that the cost J(w) stops
reducing on a validation dataset.
Vanishing gradient problem
In neural networks during back propagation,
each weight receives an update proportional to
the partial derivative of the error function. In
some cases, this derivative term is so small that
it makes updates very small. Especially in deep
layers of the neural network, the update is
obtained by multiplication of various partial
derivatives.
If these partial derivatives are very small then
the overall update becomes very small and
approaches zero. In such a case, weights will not
be able to update and hence there will be slow
or no convergence. This problem is known as
the Vanishing gradient problem.
Exploding gradient problem.
Similarly, if the derivative term is very large then
updates will also be very large. In such a case, the
algorithm will overshoot the minimum and won’t be
able to converge. This problem is known as the
Exploding gradient problem.
There are various methods to avoid these problems.
Choosing the appropriate activation function is one of
the them.
Gradient Descent Rule
• wt+1 = wt − η∇wt

• bt+1 = bt − η∇bt

• where, ∇wt = ∂J(w,b)/∂W. at w = wt, b = bt ,


• ∇bt = ∂L (w, b) ∂b at w = wt, b = bt
1
Gradient Descent Rule
• wt+1 = wt − η∇wt

• where, ∇wt = ∂J(w,b)/∂W. at w = wt, b = bt ,

• 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 (𝐽 𝑊 ) = (𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2

• 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 (𝐽 𝑊 ) = σ𝑁
𝑘=1(𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)
1
2
• Typically, the value of the learning rate is chosen manually. We usually
start with a small value such as 0.1, 0.01 or 0.001 and adapt it based
on whether the cost function is reducing very slowly (increase
learning rate) or is exploding / being erratic (decrease learning rate).
Variants of Gradient Descent

• There are multiple variants of gradient descent, depending on how


much of the data is being used to calculate the gradient.
• The main reason for these variations is computational efficiency. A
dataset may have millions of data points, and calculating the gradient
over the entire dataset can be computationally expensive.
• Batch gradient descent computes the gradient of the cost function
w.r.t to parameter W for entire training data. Since we need to
calculate the gradients for the whole dataset to perform one
parameter update, batch gradient descent can be very slow.
Stochastic gradient descent (SGD)
• computes the gradient for each update using a single training data
point x_i (chosen at random). The idea is that the gradient calculated
this way is a stochastic approximation to the gradient calculated using
the entire training data. Each update is now much faster to calculate
than in batch gradient descent, and over many updates, we will head
in the same general direction.
• SGD can be used for larger datasets. It converges faster when the
dataset is large as it causes updates to the parameters more
frequently.
Stochastic gradient descent (SGD)
• Take an example
• Feed it to Neural Network
• Calculate it’s gradient
• Use the gradient we calculated in step 3 to update the weights
• Repeat steps 1–4 for all the examples in training dataset
• Since we are considering just one example at a time the cost will
fluctuate over the training examples and it will not necessarily
decrease. But in the long run, you will see the cost decreasing with
fluctuations.
Stochastic gradient descent (SGD)
mini-batch gradient descent
• In mini-batch gradient descent, we calculate the gradient for each
small mini-batch of training data. That is, we first divide the training
data into small batches (say M samples per batch). We perform one
update per mini-batch. M is usually in the range 30–500, depending
on the problem. Usually mini-batch GD is used because computing
infrastructure — compilers, CPUs, GPUs — are often optimized for
performing vector additions and vector multiplications.
• Of these, SGD and mini-batch GD are most popular.
mini-batch gradient descent
• So, after creating the mini-batches of fixed size, we do the following
steps in one epoch:
• Pick a mini-batch
• Feed it to Neural Network
• Calculate the mean gradient of the mini-batch
• Use the mean gradient we calculated in step 3 to update the weights
• Repeat steps 1–4 for the mini-batches we created
mini-batch gradient descent
1

N/B
Momentum-Based Gradient Descent
If I am repeatedly being asked to move in the same direction then I should probably gain some confidence and start taking
bigger steps in that direction. Just as a ball gains momentum while rolling down a slope.

We accommodate the momentum concept in the gradient update rule as follows:


In addition to the current update, we also look at the history of updates. I encourage you to take your time to
process the new update rule and try and put it on paper how the update term changes in every step. Or keep
reading. Breaking it down we get

You can see that the current update is proportional to not just the present gradient but also gradients of previous
steps, although their contribution reduces every time step by γ(gamma) times. And that is how we boost the
magnitude of the update at gentle regions.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy