Error

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 24

Lecture Outline

• Introduction
• Adaline (Adaptive Linear Neuron) Networks
• Derivation of the LMS algorithm
• Example
• Limitation of Adaline
Limit of Perceptron Learning rule
• quadratic optimization for finding the optimal hyperplane when the
two classes are linearly separable and when they are not;
• If there is no separating hyperplane, the perceptron will never
classify the samples 100% correctly.
• But there is nothing from trying. So we need to add something to
stop the training, like:
• Put a limit on the number of iterations, so that the algorithm will terminate
even if the sample set is not linearly separable.
• Include an error bound. The algorithm can stop as soon as the portion of
misclassified samples is less than this bound. This ideal is developed in the
Adaline training algorithm.
Error Correcting Learning
• The objective of this learning is to start from an
arbitrary point error and then move toward a global
minimum error, in a step-by-step fashion.
• The arbitrary point error determined by the initial values
assigned to the synaptic weights.
• It is closed-loop feedback learning.
• Examples of error-correction learning:
• the least-mean-square (LMS) algorithm (Windrow and Hoff),
also called delta rule
• and its generalization known as the back-propagation (BP)
algorithm.
Error Correcting Learning consider as a
Search problem
• The task can be seen as a search problem in the weight
space:
• Start from a random position (defined by the initial weights)
and find a set of weights that minimizes the error on the given
training set .
 Let
 Initial state: a random set of weights
 Goal state: a set of weights that
minimizes the error on the training set.
 Evaluation function (performance
index, or cost function): an error Local
Minimum
function.
 Operators: how to move from one
state to the other; defined by the
Global Minimum
learning algorithm.
Lecture 4-4
Adaline (Adaptive Linear Neuron) Networks
• 1960 - Bernard Widrow and his student Marcian Hoff
introduced the ADALINE Networks and its learning rule
which they called the Least mean square (LMS) algorithm
(or Widrow-Hoff algorithm or delta rule)
• The Widrow-Hoff algorithm
• can only train by single-Layer network.
• Both the Perceptron and Adaline can only solve linearly
separable problems
• (i.e., the input patterns can be separated by a linear plane into
groups, like AND and OR problems).
Adaline Architecture

 Given:
 xk(n): an input value for a neuron k at iteration n,
 dk(n): the desired response or the target response for
neuron k.
 Let:
 yk(n) : the actual response of neuron k.
The error function is the Mean Square Error
• ADALINEs use the Widrow-Hoff algorithm or Least
Mean Square (LMS) algorithm to adjusts the weights
of the linear network in order to minimize the mean
square error
• Error : difference between the target and actual
network output (delta rule).
error signal for neuron k at iteration n: ek(n) = dk(n)- yk(n)
• mean square error in batch mode:
• batch mode means take the mean squared error over all m
training patterns: m
1 1
E ( n)   ( d p  y p ) 2
m p 1 2
Error Landscape in Weight Space
•Total error signal is a function of the weights
Ideally, we would like to find the global minimum (i.e. the
optimal solution)
E(w)
Decreasing E(w)

w1

E(w)
Decreasing E(w)

w2
Error Landscape in Weight Space, cont.
• The error space of
• the linear networks
(ADALINE’s) is a parabola (in
1d: one weight vs. error) or
• a paraboloid (in high
dimension)
• and it has only one minimum
called the global minmum.
Error Landscape in Weight Space, cont.
 Takes steps downhill (w1,w2)

(w1+w1,w2 +w2)

 Moves down as fast as possible


i.e. moves in the direction that makes the largest reduction in
error
 how is this direction called?
Steepest Descent
• The direction of the steepest descent is called gradient and can be
computed.
• Any function
• increases most rapidly when the direction of the movement is in
the direction of the gradient
• decreases most rapidly when the direction of movement is in the
direction of the negative of the gradient
• Change the weights so that we move a short distance in
the direction of the greatest rate of decrease of the error,
i.e., in the direction of –ve gradient.

Dw = - η * E/w
LMS Algorithm - Derivation
• Steepest gradient descent rule for change of the
weights:
Given
• xk(n): an input value for a neuron k at iteration n,
• dk(n): the desired response or the target response for neuron k.
Let:
• yk(n) : the actual response of neuron k.
• ek(n) : error signal = dk(n)- yk(n)

Train the wi’s such that they minimize the squared error after each
iteration 1 2
E (n)  ek (n)
2
LMS Algorithm – Derivation, cont.
• The derivative of the error with respect to each
E
weight wij
wij j i
can be written as: 1 1
 ei2  d i  yi 2
E wij  
E
 2  2
w w w
ij ij ij

 Next we use the chain rule to split this into two derivatives:
1
E
 d i  yi 2 y
 2 i

wij yi wij


 R 
f   wij x j 

E  1 
  * 2d i  yi * 1 *  
j 1

wij  2  wij
E  R 
 d i  yi * x j * f   wij x j 
'

wij  j 1 
LMS Algorithm – Derivation, cont.
E
E wij    d i  yi * x j * f ' neti 
wij

wij  E wij    (d i  yi ) f (neti ) x j

• This is called the Delta Learning rule.


• Then
• The Delta Learning rule can therefore be used Neurons with
differentiable activation functions like the sigmoid function.
LMS Algorithm – Derivation, cont.
• The widrow-Hoff learning rule is a special case of Delta learning
rule. Since the Adaline’s transfer function is linear function:

• then f neti   neti and f ' neti   1

E
E wij    d i  yi * x j
wij
• The widrow-Hoff learning rule is:

wij  E wij    ( d i  yi ) x j


Adaline Training Algorithm
1- initialize the weights to small random values and select a learning rate, (h)
2- Repeat
3- for m training patterns
select input vector X , with target output, t,
compute the output: v = b + wTx, y = f(v)
Compute the output error e=t-y
update the bias and weights
wi (new) = wi (old) + h (t – y ) xi
4- end for
5- until the stopping criteria is reached by find the Mean square error across all the
training samples 1 m 1
mse (n) 
m
 2
2
e(n)
k 1

stopping criteria: if the Mean Squared Error across all the training samples is less than a
specified value, stop the training.
Otherwise , cycle through the training set again (go to step 2)
Convergence Phenomenon
• The performance of an ADALINE neuron depends heavily
on the choice of the learning rate h.
• How to choose it?
• Too big
• the system will oscillate and the system will not converge
• Too small
• the system will take a long time to converge
• Typically, h is selected by trial and error
• typical range: 0.01 < h < 1.0
• often start at 0.1
• sometimes it is suggested that:
0.1/m < h < 1.0 /m
where m is the number of inputs
Example
• The input/target pairs for our test problem are

• Learning rate: h = 0.4


• Stopping criteria: mse < 0.01
• Initial weight: [0 0 0]

Show how the learning proceeds using the LMS algorithm?


Example Iteration One
• First iteration – p1
 1
y  w(0) * P1  0 0 0  1   0
 1

e = t – y = -1 – 0 = -1
w(1)  w(0)   * e * P1
0   1  0.4 
w(1)  0  0.4  1    0.4
0  1  0.4 
Example Iteration Two
• Second iteration – p2
1
y  w(1) * P2  0.4  0.4 0 .4  1   0.4
 1
e = t – y = 1 – (-0.4) = 1.4

w( 2)  w(1)   * e * P2
 0.4   1   0.96 
w( 2)   0.4  0.4(1.4)  1    0.16 
 0.4   1  0.16

End of epoch 1, check the stopping criteria


Example – Check Stopping Criteria
For input P1
e1  t1  y1  t1  w(2)T * P1
 1
 1  0.96 0.16  0.16 1   1  (0.64)  0.36
 1
For input P 2
e2  t 2  y2  t 2  w(2)T * P2
1
 1  0.96 0.16  0.16 1   1  (1.28)  0.28
 1
1 (0.36) 2  (0.28) 2
mse  ( )  0.05185  0.01
2 2
Stopping criteria is not satisfied, continue with epoch 2 Lecture 4-23
Example – Next Epoch (epoch 2)
• Third iteration – p1
 1
y  w(2) * P1  0.96 0.16  0.16  1   0.64
 1
e = t – y = -1 – 0.64 = -0.36
w(3)  w(2)   * e * P1
 0.96   1  1.104 
w(3)   0.16   0.4(0.36)  1    0.016
 0.16  1  0.016

if we continue this procedure, the algorithm converges to:


W(…) = [1 0 0]
Compare ADALINE with Perceptron
• Both ADALINE and perceptron suffer from the same inherent
limitation - can only solve linearly separable problems

• LMS, however, is more powerful than the perceptron’s learning rule:


• if the patterns are not linearly separable, i.e. the perfect
solution does not exist, an ADALINE will find the best
solution possible by minimizing the error (given the
learning rate is small enough)
• Adaline always converges; see what happens with XOR

• Perceptron’s rule is guaranteed to converge to a solution that


correctly categorizes the training patterns but the resulting network
can be sensitive to noise patterns as patterns often lie close to the
decision boundary.
Compare ADALINE with Perceptron, cont.

• Both use updating rule to change the weight after


each input.

• Perceptron fixes binary error; ADALINE minimizes


continuous error.

• The Adaline, is similar to the perceptron, but their


transfer function is linear rather than hard limiting.
This allows their output to take on any value.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy