Gradient Descent Learning: Minimize Objective Function: Error Landscape
Gradient Descent Learning: Minimize Objective Function: Error Landscape
Gradient Descent Learning: Minimize Objective Function: Error Landscape
Objective Function
Error Landscape
SSE:
Sum
Squared
Error
S (ti – zi)2
0
Weight Values
1
Minimizing the Error
initial error
Error surface
negative derivative
final error
local minimum
winitial wtrained
positive change
Gradient Descent Training rule
• ∇E(w) = gradient of
error in weight space.
• Wi := Wi + δWi, where
• δWi = -η∇E(W)
• This process always
converges (towards
minimum error).
• This is summed over
all training cases
(batch).
Moving in the weight space
to the point that minimizes
Squared error on the training set
Deriving a Gradient Descent Learning
Algorithm
• Goal is to decrease overall error (or other objective function) each time a
weight is changed
• Total Sum Squared error is one possible objective function E: S (ti – zi)2
• Seek a weight changing algorithm such that error gradient is negative
• If a formula can be found then we have a gradient descent learning
algorithm
E E E
Gradient E[ w] , ,...,
w0 w1 wn
Training rule : wi E[ w]
E
i.e., wi
wi
4
Linear unit gradient descent training rule
Guaranteed to converge with minimum
squared error :
Given sufficiently small learning rate
Even when training data contains noise
Even when training data not separable
Δw rE w
E
Δw i r
w i
E 1 2 2
1
t x o x t x o x
w i w i 2
xD 2 xD w i
1
2 t x o x
t x ox t x ox t x w x
2 xD w i xD w i
E
t x ox x i
w i xD
5
Measuring error for linear output
(not perceptron)
• Linear Output Function
( x) w x
• Error Measure:
1
E ( w) (td od ) 2
2 dD
data target linear unit
value output
What about Perceptron ?
• Recall in a perceptron
7
Non-Linear activation functions
8
Imp relationship between sigmoid and tanh
(try the proofs )
1
x
1
1 tanh 2 x
1
1 e 2
2
tanh( x) 2 x
1
1 e
MSE Gradient for a non-linear Units
E 1
d d
wi wi 2 dD
(t o ) 2
But we know :
od (net d )
1
(t d od ) 2 od (1 od )
2 d wi net d net d
1 net d ( w xd )
2(t d od )
wi
(t d od ) xi ,d
2 d wi wi
od
(t d od ) So :
d wi E
od net d (t d od )od (1 od ) xi ,d
- (t d od ) wi d D
d net d wi
10
Sigmoid function
Continuous, differentiable
easy to compute.
This is how the derivative looks->
11
12
Practice problem
• For the given problem, apply gradient descent
learning to update the weights for one epoch (
applying the weight updates to all the data in
the training set).
• In the next slide I have provided the solution.
• Just check out if you can get the calculations
on your own.
• Use a sigmoid function as the processing unit.
CS 478 - Perceptrons 13
14