Gradient Descent Learning: Minimize Objective Function: Error Landscape

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Gradient Descent Learning: Minimize

Objective Function

Error Landscape
SSE:
Sum
Squared
Error
S (ti – zi)2

0
Weight Values

1
Minimizing the Error
initial error
Error surface

negative derivative

final error

local minimum

winitial wtrained

positive change
Gradient Descent Training rule
• ∇E(w) = gradient of
error in weight space.
• Wi := Wi + δWi, where
• δWi = -η∇E(W)
• This process always
converges (towards
minimum error).
• This is summed over
all training cases
(batch).
Moving in the weight space
to the point that minimizes
Squared error on the training set
Deriving a Gradient Descent Learning
Algorithm
• Goal is to decrease overall error (or other objective function) each time a
weight is changed
• Total Sum Squared error is one possible objective function E: S (ti – zi)2
• Seek a weight changing algorithm such that error gradient is negative
• If a formula can be found then we have a gradient descent learning
algorithm

  E E E 
Gradient E[ w]   , ,..., 
 w0 w1 wn 

Training rule : wi   E[ w]
E
i.e., wi  
wi
4
Linear unit gradient descent training rule
Guaranteed to converge with minimum
squared error :
Given sufficiently small learning rate 
Even when training data contains noise
Even when training data not separable
 
Δw  rE w 
E
Δw i  r
w i

E  1 2   2

     
1
  t x  o x    t  x   o  x  
w i w i 2
 xD  2 xD  w i 

      

1
 2 t  x   o x  
t x   ox   t x   ox  t x   w  x 
2 xD  w i  xD  w i 
E
 t x   ox  x i 
w i xD
5
Measuring error for linear output
(not perceptron)
• Linear Output Function
  
 ( x)  w  x
• Error Measure:

 1
E ( w)   (td  od ) 2

2 dD
data target linear unit
value output
What about Perceptron ?
• Recall in a perceptron

Sign(W.X) > or < zero

Can we apply Gradient Descent to perceptron classifier ?

7
Non-Linear activation functions

8
Imp relationship between sigmoid and tanh
(try the proofs )
1
x

1
1  tanh  2 x 
1
1 e 2

2
tanh( x)  2 x
1
1 e
MSE Gradient for a non-linear Units
E  1
  d d
wi wi 2 dD
(t  o ) 2
But we know :
 od  (net d )

1
 (t d  od ) 2   od (1  od )
2 d wi net d net d
 
1  net d  ( w  xd )
  2(t d  od )
wi
(t d  od )   xi ,d
2 d wi wi
 od 
  (t d  od )   So :
d  wi  E
od net d    (t d  od )od (1  od ) xi ,d
 -  (t d  od ) wi d D
d net d wi

10
Sigmoid function
Continuous, differentiable
easy to compute.
This is how the derivative looks->

11
12
Practice problem
• For the given problem, apply gradient descent
learning to update the weights for one epoch (
applying the weight updates to all the data in
the training set).
• In the next slide I have provided the solution.
• Just check out if you can get the calculations
on your own.
• Use a sigmoid function as the processing unit.

CS 478 - Perceptrons 13
14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy