0% found this document useful (0 votes)
11 views7 pages

OPTIMIZATION-MODULE IV

Module IV of BCS405C covers various unconstrained optimization techniques including steepest ascent/descent, Newton-Raphson method, gradient descent, mini-batch gradient descent, and stochastic gradient descent. It provides a detailed example of minimizing a function using the method of steepest descent and discusses the advantages of mini-batch gradient descent for handling large datasets. The module emphasizes the importance of these optimization techniques in machine learning and data science.

Uploaded by

sdfsdfs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

OPTIMIZATION-MODULE IV

Module IV of BCS405C covers various unconstrained optimization techniques including steepest ascent/descent, Newton-Raphson method, gradient descent, mini-batch gradient descent, and stochastic gradient descent. It provides a detailed example of minimizing a function using the method of steepest descent and discusses the advantages of mini-batch gradient descent for handling large datasets. The module emphasizes the importance of these optimization techniques in machine learning and data science.

Uploaded by

sdfsdfs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

MODULE IV

BCS405C – OPTIMIZATION TECHNIQUE

MODULE IV - Convex Optimization-II


Contents
• Unconstrained optimization -Method of steepest ascent/descent,
• NR method
• Gradient descent
• Mini batch gradient descent
• Stochastic gradient descent.
Unconstrained optimization -Method of steepest descent
https://www.youtube.com/watch?v=oKO_yDg8qw8

Example 1 – Minimize 𝒇(𝒙𝟏 , 𝒙𝟐 ) = 𝒙𝟏 − 𝒙𝟐 + 𝟐𝒙𝟏 𝟐 + 𝟐 𝒙𝟏 𝒙𝟐 + 𝒙𝟐 𝟐


0
starting with the initial approximation 𝑋1 = [ ] using method of steepest descent
0
Solution

Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.

1
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE

𝝏𝒇
= 𝟏 + 𝟒𝒙𝟏 + 𝟐𝒙𝟐
𝝏𝒙𝟏
𝝏𝒇
= −𝟏 + 𝟐𝒙𝟏 + 𝟐𝒙𝟐
𝝏𝒙𝟐
𝝏𝒇
𝝏𝒙 𝟏 + 𝟒𝒙𝟏 + 𝟐𝒙𝟐
Gradient 𝛁 = [ 𝝏𝒇𝟏 ] =[ ]
= −𝟏 + 𝟐𝒙𝟏 + 𝟐𝒙𝟐
𝝏𝒙𝟐

𝟏
𝛁(𝒂𝒕 𝑿𝟏) = [ ]
−𝟏
Iteration1
−𝟏
First we find the direction 𝑺𝟏 = −𝛁 = [ ]
𝟏
𝟎 −𝟏 −𝝀
𝑿𝟏 + 𝑺𝟏 𝝀𝟏 = [ ] + 𝝀𝟏 [ ] = [ 𝟏 ]
𝟎 𝟏 𝝀𝟏
Next calculate the given function at this point.
𝒇(−𝝀𝟏 , −𝝀𝟏 ) = 𝒙𝟏 − 𝒙𝟐 + 𝟐𝒙𝟏 𝟐 + 𝟐 𝒙𝟏 𝒙𝟐 + 𝒙𝟐 𝟐

= −𝝀𝟏 − −𝝀𝟏 + 𝟐𝝀𝟏 𝟐 − 𝟐𝝀𝟏 𝟐 + 𝝀𝟏 𝟐

= −𝟐𝝀𝟏 + 𝝀𝟏 𝟐
𝝏𝒇 𝝏𝒇
Next 𝝏𝝀 = −𝟐 + 𝟐𝝀𝟏 , 𝝏𝝀 = 𝟎 → −𝟐 + 𝟐 𝝀𝟏 = 𝟎 → 𝝀𝟏 = 𝟏
𝟏 𝟏

0 −𝟏 −𝟏
𝑋2 = 𝑿𝟏 + 𝑺𝟏 𝝀𝟏 = [ ] + 1 [ ] = [ ]
0 𝟏 𝟏
𝝏𝒇
𝝏𝒙 𝟏 + 𝟒𝒙𝟏 + 𝟐𝒙𝟐
𝛁= [ 𝝏𝒇𝟏 ] =[ ] at (-1,1)
= −𝟏 + 𝟐𝒙𝟏 + 𝟐𝒙𝟐
𝝏𝒙𝟐

−𝟏
𝛁(𝒂𝒕 𝑿𝟐) = [ ]
−𝟏
−𝟏
𝛁(𝒂𝒕 𝑿𝟐) = [ ] ≠ 𝑋1, So we repeat the iteration
−𝟏
Iteration2
𝟏
𝑺𝟐 = −𝛁 = [ ]
𝟏

Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.

2
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE

−𝟏 𝟏 −𝟏 + 𝝀𝟐
𝑿𝟐 + 𝑺𝟐 𝝀𝟐 = [ ] + 𝝀𝟐 [ ] = [ ]
𝟏 𝟏 𝟏 + 𝝀𝟐
Next calculate the given function at this point.
𝒇(−𝟏 + 𝝀𝟐 , 𝟏 + 𝝀𝟐 ) = 𝒙𝟏 − 𝒙𝟐 + 𝟐𝒙𝟏 𝟐 + 𝟐 𝒙𝟏 𝒙𝟐 + 𝒙𝟐 𝟐

= −𝟏 + 𝝀𝟐 − (𝟏 + 𝝀𝟐 ) + 𝟐(−𝟏 + 𝝀𝟐) 𝟐 + 𝟐 (−𝟏 + 𝝀𝟐 ) (𝟏 + 𝝀𝟐 ) + (𝟏 + 𝝀𝟐 )𝟐

𝒇(−𝟏 + 𝝀𝟐 , 𝟏 + 𝝀𝟐 ) = −𝟐𝝀𝟐 + 𝟓𝝀𝟐 𝟐 − 𝟏

𝝏𝒇 𝝏𝒇
Next 𝝏𝝀 = −𝟐 + 𝟏𝟎𝝀𝟐 , 𝝏𝝀 = 𝟎 → 𝝀𝟐 = 𝟏/𝟓
𝟐 𝟐

−1 𝟏 −𝟎. 𝟖
𝑋3 = 𝑿𝟐 + 𝑺𝟐 𝝀𝟐 = [ ] + 1/5 [ ] = [ ]
1 𝟏 𝟏. 𝟐
𝝏𝒇
𝝏𝒙 𝟏 + 𝟒𝒙𝟏 + 𝟐𝒙𝟐 𝟎. 𝟐
𝛁= [ 𝝏𝒇𝟏 ] =[ ] at (−𝟎. 𝟖, 𝟏. 𝟐)= [ ] ≠ 𝑋1, So we repeat the iteration
= −𝟏 + 𝟐𝒙𝟏 + 𝟐𝒙𝟐 −𝟎. 𝟐
𝝏𝒙𝟐

Iteration3
−𝟎. 𝟐
𝑺𝟑 = −𝛁 = [ ]
𝟎. 𝟐

−𝟎. 𝟖 −𝟎. 𝟐 −𝟎. 𝟖−. 𝟐𝝀𝟑


𝑿𝟑 + 𝑺𝟑 𝝀𝟑 = [ ] + 𝝀𝟑 [ ]=[ ]
𝟏. 𝟐 𝟎. 𝟐 𝟏. 𝟐 +. 𝟐𝝀𝟑

𝒇(−𝟎. 𝟖−. 𝟐𝝀𝟑 , 𝟏. 𝟐 +. 𝟐𝝀𝟑 ) = 𝒙𝟏 − 𝒙𝟐 + 𝟐𝒙𝟏 𝟐 + 𝟐 𝒙𝟏 𝒙𝟐 + 𝒙𝟐 𝟐

= − . 𝟎𝟒𝝀𝟑 𝟐 − . 𝟎𝟖𝝀𝟑 − 𝟏. 𝟐

𝝏𝒇 𝝏𝒇
= . 𝟎𝟖𝝀𝟑 −. 𝟎𝟖 , 𝝏𝝀 = 𝟎 → 𝝀𝟑 = 𝟏
𝝏𝝀𝟑 𝟑

−𝟏
𝑿𝟒 = 𝑿𝟑 + 𝑺𝟑 𝝀𝟑 = [ ]
𝟏. 𝟒

Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.

3
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE
𝝏𝒇
𝝏𝒙 𝟏 + 𝟒𝒙𝟏 + 𝟐𝒙𝟐 𝟎. 𝟐
𝛁 = [ 𝝏𝒇𝟏 ] = [ ] at (−𝟏, 𝟏. 𝟒)= [ ] ≠ 𝑋1, So we repeat the iteration
= −𝟏 + 𝟐𝒙𝟏 + 𝟐𝒙𝟐 .𝟐
𝝏𝒙𝟐

Since .2 is closer to 0, let us stop the iteration . If you want you can do the iterations few more
tim until you are closer to 𝑿𝟏
We will conclude theat at (−1, 1.4) is the minimum point
To find f minimum, calculate 𝒇((−1, 1.4)

NR method
https://www.youtube.com/watch?v=SIqfDj1DTyM
https://www.youtube.com/watch?v=1z1sD202jbE

Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.

4
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE

1/2 −1/2
𝐽1 −1 = [ ]
−1/2 1
𝜕𝑓
𝜕𝑥 1 + 4𝑥1 + 2𝑥2 1
𝑔1 = [ 𝜕𝑓1 ] =[ ] = [ ](substitute x values as 0, 0)
−1 + 2𝑥1 + 2𝑥2 −1
𝜕𝑥2

1 1
0 −2 1 −1
−1
𝑋2 = 𝑋1 − 𝐽1 𝑔1 = [ ] − [ 2 1 ] [ ]= [ ]
0 −2 1 −1 3/2

𝜕𝑓
𝜕𝑥 1 + 4𝑥1 + 2𝑥2 0
𝑔2 = [ 𝜕𝑓1] = [ ] = [ ](substitute x values as -1, 3/2)
−1 + 2𝑥1 + 2𝑥2 0
𝜕𝑥2

−1
As 𝑔2 = 0, 𝑋2 = [ ] is the optimum point.
3/2
3
Hence 𝑓 (−1, 2) = −7/2 (by substituting x and y as -1 and 3/2 in the given question)

Mini batch gradient descent & Stochastic gradient


descent.
https://www.youtube.com/watch?v=kmb5FuFCZKE
Gradient descent minimizes the loss function of a model. Mini-batch gradient
descent helps reduce the loss function by randomly choosing the data points and dividing them
into batches making the update after the complete computation of one batch, making
computation easy and the optimization fast.
In Gradient Descent we update weights in such a way that we obtain the minimum cost function
value. That is every updation we reaches closer to the local minima.
The problem is we will be passing the entire data set once. If the dataet is too
large(Example say 1million records). In this case what will happen is, if we update the weights.
Our model has to process huge amount of data, it will take a lot of time and each step needs
that much time. Apart from 1miilion records in the data set there may be a large number of
features also, which may also decide the processing time.
For example if we are having 100,000 images having pixel size 1000X1000. Then we will
be having 1011 bytes of data(that is 100GB). Our RAM/GPU is very limited size(may be 64GB),
and it will not be able to process or handle this much data at once. We will be getting out of

Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.

5
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE

Memory Error. To overcome this we use Mini Batch Gradient Descent. The way in which it
works is we will divide the data set into batches (For example 100,000 record- makes 100
batches each of size 1000). Instead of pasing the entire data set, we will be passing the Mini
Batch to the model, update the weights and then we pass another Mini Batch to the model. By
the time our model see the entire data set, we would have already made a lot of progress by
updating the weights 100 times. As we pass Mini Batches, it reduces the time taken to process
the data, and also will not be facing any Out of Memory Error. The only problem what we
are having with the Mini Batch Gradient Descent is we have only small batches of data and
then training the model, so that model will not be able to recognize the patterns what the other
Mini Batches will be having. It will be able to train that specificMini Batch.

For Example, if we take the entire data set then updation of weights takes place and
reaches to the local minima. This is the actual graph if we train the model using the entire
dataset, and the graph slowly approaches to the local minima.

But if we take the mini batches, the graph will be a plane as shown below.

If we take the mini batches, it will mot directly reaches to local minima, it will take a zig zag
curve as shown below and reaches to the local minima as shown below.

Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.

6
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE

The updation of weights does not takes place as in the parabolic curve. It will take a Zig – zag
curve till it reaches the local minima. Actually It will not reaches to the exact point of local
minima, but aroud or closer to that it will reach. This Zig – zag path is actually called a noise.
Imagine that our dataset contains only 1000 records(not 100, 000 records), at that time it
is pointless to use the Mini Batch Gradient Descent, because most of the time will be wasted
by following that zig – zag curve, instead of travelling through a straight path.

https://www.youtube.com/watch?v=qOeU9GCnU3w --- Gradient

Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy