OPTIMIZATION-MODULE IV
OPTIMIZATION-MODULE IV
Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.
1
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE
𝝏𝒇
= 𝟏 + 𝟒𝒙𝟏 + 𝟐𝒙𝟐
𝝏𝒙𝟏
𝝏𝒇
= −𝟏 + 𝟐𝒙𝟏 + 𝟐𝒙𝟐
𝝏𝒙𝟐
𝝏𝒇
𝝏𝒙 𝟏 + 𝟒𝒙𝟏 + 𝟐𝒙𝟐
Gradient 𝛁 = [ 𝝏𝒇𝟏 ] =[ ]
= −𝟏 + 𝟐𝒙𝟏 + 𝟐𝒙𝟐
𝝏𝒙𝟐
𝟏
𝛁(𝒂𝒕 𝑿𝟏) = [ ]
−𝟏
Iteration1
−𝟏
First we find the direction 𝑺𝟏 = −𝛁 = [ ]
𝟏
𝟎 −𝟏 −𝝀
𝑿𝟏 + 𝑺𝟏 𝝀𝟏 = [ ] + 𝝀𝟏 [ ] = [ 𝟏 ]
𝟎 𝟏 𝝀𝟏
Next calculate the given function at this point.
𝒇(−𝝀𝟏 , −𝝀𝟏 ) = 𝒙𝟏 − 𝒙𝟐 + 𝟐𝒙𝟏 𝟐 + 𝟐 𝒙𝟏 𝒙𝟐 + 𝒙𝟐 𝟐
= −𝟐𝝀𝟏 + 𝝀𝟏 𝟐
𝝏𝒇 𝝏𝒇
Next 𝝏𝝀 = −𝟐 + 𝟐𝝀𝟏 , 𝝏𝝀 = 𝟎 → −𝟐 + 𝟐 𝝀𝟏 = 𝟎 → 𝝀𝟏 = 𝟏
𝟏 𝟏
0 −𝟏 −𝟏
𝑋2 = 𝑿𝟏 + 𝑺𝟏 𝝀𝟏 = [ ] + 1 [ ] = [ ]
0 𝟏 𝟏
𝝏𝒇
𝝏𝒙 𝟏 + 𝟒𝒙𝟏 + 𝟐𝒙𝟐
𝛁= [ 𝝏𝒇𝟏 ] =[ ] at (-1,1)
= −𝟏 + 𝟐𝒙𝟏 + 𝟐𝒙𝟐
𝝏𝒙𝟐
−𝟏
𝛁(𝒂𝒕 𝑿𝟐) = [ ]
−𝟏
−𝟏
𝛁(𝒂𝒕 𝑿𝟐) = [ ] ≠ 𝑋1, So we repeat the iteration
−𝟏
Iteration2
𝟏
𝑺𝟐 = −𝛁 = [ ]
𝟏
Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.
2
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE
−𝟏 𝟏 −𝟏 + 𝝀𝟐
𝑿𝟐 + 𝑺𝟐 𝝀𝟐 = [ ] + 𝝀𝟐 [ ] = [ ]
𝟏 𝟏 𝟏 + 𝝀𝟐
Next calculate the given function at this point.
𝒇(−𝟏 + 𝝀𝟐 , 𝟏 + 𝝀𝟐 ) = 𝒙𝟏 − 𝒙𝟐 + 𝟐𝒙𝟏 𝟐 + 𝟐 𝒙𝟏 𝒙𝟐 + 𝒙𝟐 𝟐
𝝏𝒇 𝝏𝒇
Next 𝝏𝝀 = −𝟐 + 𝟏𝟎𝝀𝟐 , 𝝏𝝀 = 𝟎 → 𝝀𝟐 = 𝟏/𝟓
𝟐 𝟐
−1 𝟏 −𝟎. 𝟖
𝑋3 = 𝑿𝟐 + 𝑺𝟐 𝝀𝟐 = [ ] + 1/5 [ ] = [ ]
1 𝟏 𝟏. 𝟐
𝝏𝒇
𝝏𝒙 𝟏 + 𝟒𝒙𝟏 + 𝟐𝒙𝟐 𝟎. 𝟐
𝛁= [ 𝝏𝒇𝟏 ] =[ ] at (−𝟎. 𝟖, 𝟏. 𝟐)= [ ] ≠ 𝑋1, So we repeat the iteration
= −𝟏 + 𝟐𝒙𝟏 + 𝟐𝒙𝟐 −𝟎. 𝟐
𝝏𝒙𝟐
Iteration3
−𝟎. 𝟐
𝑺𝟑 = −𝛁 = [ ]
𝟎. 𝟐
= − . 𝟎𝟒𝝀𝟑 𝟐 − . 𝟎𝟖𝝀𝟑 − 𝟏. 𝟐
𝝏𝒇 𝝏𝒇
= . 𝟎𝟖𝝀𝟑 −. 𝟎𝟖 , 𝝏𝝀 = 𝟎 → 𝝀𝟑 = 𝟏
𝝏𝝀𝟑 𝟑
−𝟏
𝑿𝟒 = 𝑿𝟑 + 𝑺𝟑 𝝀𝟑 = [ ]
𝟏. 𝟒
Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.
3
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE
𝝏𝒇
𝝏𝒙 𝟏 + 𝟒𝒙𝟏 + 𝟐𝒙𝟐 𝟎. 𝟐
𝛁 = [ 𝝏𝒇𝟏 ] = [ ] at (−𝟏, 𝟏. 𝟒)= [ ] ≠ 𝑋1, So we repeat the iteration
= −𝟏 + 𝟐𝒙𝟏 + 𝟐𝒙𝟐 .𝟐
𝝏𝒙𝟐
Since .2 is closer to 0, let us stop the iteration . If you want you can do the iterations few more
tim until you are closer to 𝑿𝟏
We will conclude theat at (−1, 1.4) is the minimum point
To find f minimum, calculate 𝒇((−1, 1.4)
NR method
https://www.youtube.com/watch?v=SIqfDj1DTyM
https://www.youtube.com/watch?v=1z1sD202jbE
Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.
4
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE
1/2 −1/2
𝐽1 −1 = [ ]
−1/2 1
𝜕𝑓
𝜕𝑥 1 + 4𝑥1 + 2𝑥2 1
𝑔1 = [ 𝜕𝑓1 ] =[ ] = [ ](substitute x values as 0, 0)
−1 + 2𝑥1 + 2𝑥2 −1
𝜕𝑥2
1 1
0 −2 1 −1
−1
𝑋2 = 𝑋1 − 𝐽1 𝑔1 = [ ] − [ 2 1 ] [ ]= [ ]
0 −2 1 −1 3/2
𝜕𝑓
𝜕𝑥 1 + 4𝑥1 + 2𝑥2 0
𝑔2 = [ 𝜕𝑓1] = [ ] = [ ](substitute x values as -1, 3/2)
−1 + 2𝑥1 + 2𝑥2 0
𝜕𝑥2
−1
As 𝑔2 = 0, 𝑋2 = [ ] is the optimum point.
3/2
3
Hence 𝑓 (−1, 2) = −7/2 (by substituting x and y as -1 and 3/2 in the given question)
Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.
5
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE
Memory Error. To overcome this we use Mini Batch Gradient Descent. The way in which it
works is we will divide the data set into batches (For example 100,000 record- makes 100
batches each of size 1000). Instead of pasing the entire data set, we will be passing the Mini
Batch to the model, update the weights and then we pass another Mini Batch to the model. By
the time our model see the entire data set, we would have already made a lot of progress by
updating the weights 100 times. As we pass Mini Batches, it reduces the time taken to process
the data, and also will not be facing any Out of Memory Error. The only problem what we
are having with the Mini Batch Gradient Descent is we have only small batches of data and
then training the model, so that model will not be able to recognize the patterns what the other
Mini Batches will be having. It will be able to train that specificMini Batch.
For Example, if we take the entire data set then updation of weights takes place and
reaches to the local minima. This is the actual graph if we train the model using the entire
dataset, and the graph slowly approaches to the local minima.
But if we take the mini batches, the graph will be a plane as shown below.
If we take the mini batches, it will mot directly reaches to local minima, it will take a zig zag
curve as shown below and reaches to the local minima as shown below.
Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.
6
MODULE IV
BCS405C – OPTIMIZATION TECHNIQUE
The updation of weights does not takes place as in the parabolic curve. It will take a Zig – zag
curve till it reaches the local minima. Actually It will not reaches to the exact point of local
minima, but aroud or closer to that it will reach. This Zig – zag path is actually called a noise.
Imagine that our dataset contains only 1000 records(not 100, 000 records), at that time it
is pointless to use the Mini Batch Gradient Descent, because most of the time will be wasted
by following that zig – zag curve, instead of travelling through a straight path.
Dr. Ranjini. P. S, M.Sc., M. Phil, Ph. D, M. Tech in Data Science & Machine Learning,
Professor, Department of Artificial Intelligence & Data Science,
Don Bosco Institute of Bangalore.