0% found this document useful (0 votes)

5 views31 pages

Gradient Descent_PR

Gradient Descent is an optimization technique used to minimize the cost function in deep learning and neural networks by iteratively moving in the direction of the steepest descent. The learning rate determines the size of the steps taken towards the minimum, with high rates risking overshooting and low rates leading to slow convergence. Variants of gradient descent, such as Stochastic Gradient Descent and Mini-Batch Gradient Descent, improve computational efficiency by using subsets of data for gradient calculations.

Uploaded by

archanashrma6266

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views31 pages

Gradient Descent_PR

Uploaded by

archanashrma6266

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Gradient Descent

Dr. Preeti Rai

Professor
Department of CSE
Gradient Descent
• Gradient Descent is an optimization technique that is used to improve
deep learning and neural network-based models by minimizing the
cost function.

• Gradient descent is an optimization algorithm used to minimize some

function by iteratively moving in the direction of steepest descent as
defined by the negative of the gradient.
Example
Learning rate

• The size of these steps is called the learning rate. With a high learning
rate we can cover more ground each step, but we risk overshooting
the lowest point since the slope of the hill is constantly changing.
With a very low learning rate, we can confidently move in the
direction of the negative gradient since we are recalculating it so
frequently. A low learning rate is more precise, but calculating the
gradient is time-consuming, so it will take us a very long time to get to
the bottom. (η)
Cost function

• A Loss Functions tells us “how good” our model is at making

predictions for a given set of parameters. The cost function has its
own curve and its own gradients. The slope of this curve tells us how
to update our parameters to make the model more accurate.

𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 (𝐽 𝑊 ) = (𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2

Pseudocode for Gradient Descent

• Gradient descent is used to minimize a cost function J(W) parameterized by a

model parameters W.
• The gradient (or derivative) tells us the incline or slope of the cost function.
Hence, to minimize the cost function, we move in the direction opposite to the
gradient.
• Initialize the weights W randomly.
• Calculate the gradients G of cost function w.r.t parameters. This is done using
partial differentiation: G = ∂J(W)/∂W. The value of the gradient G depends on the
inputs, the current values of the model parameters, and the cost function.
• Update the weights by an amount proportional to G, i.e. Wnew = Wold - ηG
• Repeat until the cost J(w) stops reducing, or some other pre-defined termination
criteria is met.
1. Initialize weight (W=3)
2. Calculate the predicted output =X.W
3. Cost function J(W)=(𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2 ------1
4. Calculate the gradient of Cost Function
5. G = ∂J(W)/∂W.
6. Wnew = Wold – η*G=3-
7. In step above, η is the learning rate which determines the size of
the steps we take to reach a minimum. We need to be very careful
about this parameter. High values of η may overshoot the
minimum, and very low values will reach the minimum very slowly.
• In step 6, η is the learning rate which determines the size of the steps
we take to reach a minimum. We need to be very careful about this
parameter. High values of η may overshoot the minimum, and very
low values will reach the minimum very slowly.
• A popular choice for the termination criteria is that the cost J(w) stops
reducing on a validation dataset.
Vanishing gradient problem
In neural networks during back propagation,
each weight receives an update proportional to
the partial derivative of the error function. In
some cases, this derivative term is so small that
it makes updates very small. Especially in deep
layers of the neural network, the update is
obtained by multiplication of various partial
derivatives.
If these partial derivatives are very small then
the overall update becomes very small and
approaches zero. In such a case, weights will not
be able to update and hence there will be slow
or no convergence. This problem is known as
the Vanishing gradient problem.
Exploding gradient problem.
Similarly, if the derivative term is very large then
updates will also be very large. In such a case, the
algorithm will overshoot the minimum and won’t be
able to converge. This problem is known as the
Exploding gradient problem.
There are various methods to avoid these problems.
Choosing the appropriate activation function is one of
the them.
Gradient Descent Rule
• wt+1 = wt − η∇wt

• bt+1 = bt − η∇bt

• where, ∇wt = ∂J(w,b)/∂W. at w = wt, b = bt ,

• ∇bt = ∂L (w, b) ∂b at w = wt, b = bt
1
Gradient Descent Rule
• wt+1 = wt − η∇wt

• where, ∇wt = ∂J(w,b)/∂W. at w = wt, b = bt ,

• 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 (𝐽 𝑊 ) = (𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2

• 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 (𝐽 𝑊 ) = σ𝑁
𝑘=1(𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)
1
2
• Typically, the value of the learning rate is chosen manually. We usually
start with a small value such as 0.1, 0.01 or 0.001 and adapt it based
on whether the cost function is reducing very slowly (increase
learning rate) or is exploding / being erratic (decrease learning rate).
Variants of Gradient Descent

• There are multiple variants of gradient descent, depending on how

much of the data is being used to calculate the gradient.
• The main reason for these variations is computational efficiency. A
dataset may have millions of data points, and calculating the gradient
over the entire dataset can be computationally expensive.
• Batch gradient descent computes the gradient of the cost function
w.r.t to parameter W for entire training data. Since we need to
calculate the gradients for the whole dataset to perform one
parameter update, batch gradient descent can be very slow.
Stochastic gradient descent (SGD)
• computes the gradient for each update using a single training data
point x_i (chosen at random). The idea is that the gradient calculated
this way is a stochastic approximation to the gradient calculated using
the entire training data. Each update is now much faster to calculate
than in batch gradient descent, and over many updates, we will head
in the same general direction.
• SGD can be used for larger datasets. It converges faster when the
dataset is large as it causes updates to the parameters more
frequently.
Stochastic gradient descent (SGD)
• Take an example
• Feed it to Neural Network
• Calculate it’s gradient
• Use the gradient we calculated in step 3 to update the weights
• Repeat steps 1–4 for all the examples in training dataset
• Since we are considering just one example at a time the cost will
fluctuate over the training examples and it will not necessarily
decrease. But in the long run, you will see the cost decreasing with
fluctuations.
Stochastic gradient descent (SGD)
mini-batch gradient descent
• In mini-batch gradient descent, we calculate the gradient for each
small mini-batch of training data. That is, we first divide the training
data into small batches (say M samples per batch). We perform one
update per mini-batch. M is usually in the range 30–500, depending
on the problem. Usually mini-batch GD is used because computing
infrastructure — compilers, CPUs, GPUs — are often optimized for
performing vector additions and vector multiplications.
• Of these, SGD and mini-batch GD are most popular.
mini-batch gradient descent
• So, after creating the mini-batches of fixed size, we do the following
steps in one epoch:
• Pick a mini-batch
• Feed it to Neural Network
• Calculate the mean gradient of the mini-batch
• Use the mean gradient we calculated in step 3 to update the weights
• Repeat steps 1–4 for the mini-batches we created
mini-batch gradient descent
1

N/B
Momentum-Based Gradient Descent
If I am repeatedly being asked to move in the same direction then I should probably gain some confidence and start taking
bigger steps in that direction. Just as a ball gains momentum while rolling down a slope.

We accommodate the momentum concept in the gradient update rule as follows:

In addition to the current update, we also look at the history of updates. I encourage you to take your time to
process the new update rule and try and put it on paper how the update term changes in every step. Or keep
reading. Breaking it down we get

You can see that the current update is proportional to not just the present gradient but also gradients of previous
steps, although their contribution reduces every time step by γ(gamma) times. And that is how we boost the
magnitude of the update at gentle regions.

UNIT2
No ratings yet
UNIT2
25 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
3. Linear Models-Gradient Descent, Regularization(Introduction)
No ratings yet
3. Linear Models-Gradient Descent, Regularization(Introduction)
26 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
Adam Optimizer
No ratings yet
Adam Optimizer
22 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Multi Percept Ron
No ratings yet
Multi Percept Ron
14 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
UNIT3
No ratings yet
UNIT3
37 pages
cours5
No ratings yet
cours5
23 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Backpropagation, Sgmiod Neuron & Gradient Discend
No ratings yet
Backpropagation, Sgmiod Neuron & Gradient Discend
29 pages
DL Unit -2
No ratings yet
DL Unit -2
20 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
Paper 2
No ratings yet
Paper 2
27 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
04 Batch SGD Mini Batch Gradient Descent Algorithms
No ratings yet
04 Batch SGD Mini Batch Gradient Descent Algorithms
3 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
optim
No ratings yet
optim
33 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
GD Types
No ratings yet
GD Types
98 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Optimizer
No ratings yet
Optimizer
13 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
CNC Milling
No ratings yet
CNC Milling
4 pages
Gradient Descent a Fundamental Optimization Algorithm
No ratings yet
Gradient Descent a Fundamental Optimization Algorithm
30 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Class 6th Python
0% (1)
Class 6th Python
16 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Gradient Decent
No ratings yet
Gradient Decent
40 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
chp2 Gradient Descent algorithm
No ratings yet
chp2 Gradient Descent algorithm
5 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
LInear
No ratings yet
LInear
14 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
M 181-05 Chain-Link Fence PDF
No ratings yet
M 181-05 Chain-Link Fence PDF
18 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
List of Health Facilities in KANO STATE
No ratings yet
List of Health Facilities in KANO STATE
32 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
Smile Sessions A Chronological List
No ratings yet
Smile Sessions A Chronological List
12 pages
agriengineering-06-00187
No ratings yet
agriengineering-06-00187
18 pages
Rod Report
No ratings yet
Rod Report
13 pages
Project Proposal
No ratings yet
Project Proposal
7 pages
Internship Opportunity at ODOB Consulting
0% (1)
Internship Opportunity at ODOB Consulting
2 pages
Ccda PDF
No ratings yet
Ccda PDF
96 pages
Manual Cutting Process
No ratings yet
Manual Cutting Process
13 pages
JD CL - CJ4 - 2023 - 1.33.4
No ratings yet
JD CL - CJ4 - 2023 - 1.33.4
3 pages
Rendy SMP
No ratings yet
Rendy SMP
25 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Simulated Annealing: Fundamentals and Applications
From Everand
Simulated Annealing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Roberts Fume Hood
No ratings yet
Roberts Fume Hood
9 pages
LCC Naga - Plumbing Design Analysis
100% (4)
LCC Naga - Plumbing Design Analysis
4 pages
All India Senior School Certificate Examination Project On Chemistry
No ratings yet
All India Senior School Certificate Examination Project On Chemistry
14 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
30 pages
100 Albums PDF
No ratings yet
100 Albums PDF
20 pages
Management of Post-Keratoplasty Astigmatism
No ratings yet
Management of Post-Keratoplasty Astigmatism
11 pages
C088 Settlement Prod Cost Collector Collective
0% (1)
C088 Settlement Prod Cost Collector Collective
8 pages
Introduction:-Decathlon Is One of The World's Largest Sporting Goods Retailers. Decathlon Started With A
No ratings yet
Introduction:-Decathlon Is One of The World's Largest Sporting Goods Retailers. Decathlon Started With A
3 pages
Edtpa Direct Lesson Plan
No ratings yet
Edtpa Direct Lesson Plan
3 pages
Nippon Piston Ring Co.,Ltd.: Fit for SUZUKI ／スズキ Fit for SUZUKI ／スズキ
No ratings yet
Nippon Piston Ring Co.,Ltd.: Fit for SUZUKI ／スズキ Fit for SUZUKI ／スズキ
3 pages
Earthworms - Architects of Fertile Soils: Their Significance and Recommendations For Their Promotion in Agriculture
No ratings yet
Earthworms - Architects of Fertile Soils: Their Significance and Recommendations For Their Promotion in Agriculture
1 page
About MOA and AOA
No ratings yet
About MOA and AOA
4 pages
O9A Grade Ritual of External Adept: An American Experience
No ratings yet
O9A Grade Ritual of External Adept: An American Experience
2 pages
Homeopathy and Minerals Contents 20190725182637
No ratings yet
Homeopathy and Minerals Contents 20190725182637
20 pages
Sample Soil Report
No ratings yet
Sample Soil Report
64 pages
Oracle Stream - Step by Step Procedure
No ratings yet
Oracle Stream - Step by Step Procedure
2 pages
FABM Final Output Practice Set C
No ratings yet
FABM Final Output Practice Set C
11 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Gradient Descent_PR

Uploaded by

Gradient Descent_PR

Uploaded by

Gradient Descent

Dr. Preeti Rai

• Gradient descent is an optimization algorithm used to minimize some

• A Loss Functions tells us “how good” our model is at making

𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 (𝐽 𝑊 ) = (𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2

• Gradient descent is used to minimize a cost function J(W) parameterized by a

• where, ∇wt = ∂J(w,b)/∂W. at w = wt, b = bt ,

• where, ∇wt = ∂J(w,b)/∂W. at w = wt, b = bt ,

• 𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 (𝐽 𝑊 ) = (𝑎𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑝𝑢𝑡 𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2

• There are multiple variants of gradient descent, depending on how

We accommodate the momentum concept in the gradient update rule as follows:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.