Lecture 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Adaline

 ADALINE
 ADAptive LINEar neuron
 Proposed by Widrow & Hoff 1960
 Typically uses bipolar (+1,-1) activations for its
inputs and targets
 Not restricted to such values

1
Architecture
 Bipolar Input
 Bipolar Target
 Net Input: Yin=WTX+b
 If the net is being used
for pattern classification
with bipolar class labels,
a threshold function (with
threshold = 0) is applied
to the net input to obtain
the activations

2
Difference between Perceptron and Adaline

 The difference between (single layer) perceptron and


ADALINE networks is the learning method.
 The adaline learning rule (also known as the least-mean-
squares rule, the delta rule, and the Widrow-Hoff rule) is a
training rule that minimizes the output error using
(approximate) gradient descent.
 Adaptive Linear Element or adaline is a single layer linear
neural network based on the McCulloch-Pitts neuron.

3
Difference between Perceptron and Adaline

 The learning rule in adaline neural networks is quite


straightforward. Some input values or vectors are given to
the adaline network, and the values output are compared
with the desired values.
 If the difference in the output and desired values is greater
than the tolerance, the weights of the neurons are changed
till the error or difference becomes acceptable. Adaline
networks have found use in modern times for implementing
adaptive filters.

4
Difference between Perceptron and Adaline

 Thus, the weights are adjusted by


 Wij (t+1) = Wij (t) + α (T-Wij I) X
This corresponds to gradient descent on the quadratic error
surface, Ej=Sum [T - Wj .X] 2
 In perceptron learning, the weights are adjusted only when
a pattern is misclassified. The correction to the weights
after applying the training pattern p is
Wij (t+1) = Wij (t) + α (T- A) X
This corresponds to gradient descent on the error surface
E (Wij )= Summisclassified [Wij (A)X].

5
Training
 The learning rule minimizes the mean squared error
between the activations and the target value
 Delta Rule, LMS or Widrow-Hoff Rule

6
Training: Proof for Delta Rule
 Error for all P samples: mean square error
1 P
=E ∑
P p =1
(t ( p ) − y _ in( p )) 2

 E is a function of W = {w1, ... w n}


 Learning takes gradient descent approach to reduce E
by modifying W
∂E ∂E ∂E
 the gradient of E: ∇E =(
∂w1
, ......
∂wn
) ∆wi ∝ −
∂wi

∂E 2 P ∂

∂wi
=
[ ∑
P p =1
(t ( p ) − y _ in ( p ))]
∂wi
(t ( p ) − y _ in( p )

2 P
−[ ∑ (t ( p ) − y _ in( p ))]xi
=
P p =1
∂E 2 P
∆wi =
∝− [ ∑ (t ( p ) − y _ in( p ))]xi 7
∂wi P 1
Error Surface Error Contour
2 24

22
1.5

25 20

20 1
18
15

10
0.5 16
Sum Squared Error

0 14

Bias B
-5 0

12
-10

-15 -0.5
10
-20

-25 8
-1
2

6
1 2

1 -1.5
0
4
0
-1
-1
-2 2
Bias B -2 -2 -1 0 1 2
-2 Weight W
Weight W 8
Training…
 Application of Delta Rule
 Method 1 (sequential mode): change w i after each training
pattern by α (t ( p) − y _ in( p)) xi
 Method 2 (batch mode): change w i at the end of each epoch.
Within an epoch, cumulate α (t ( p) − y _ in( p)) xi
for every pattern (x(p), t(p))
 Method 2 is slower but may provide slightly better results
(because Method 1 may be sensitive to the sample ordering)
 Notes:
 E monotonically decreases until the system reaches a state
with (local) minimum E (a small change of any w i will cause E
to increase).
 At a local minimum E state, ∂E / ∂wi = 0 ∀i , but E is not
guaranteed to be zero

9
Training Algorithm

10
Parameter Initialization
 Learning Rate
 Hecht & Nielson

 Practical Value for a single neuron

11
Application

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy