Supervised learningNN

SUPERVISED LEARNING
NETWORK
7/30/2020 ITE 1015 1

DEFINITION OF SUPERVISED LEARNING NETWORKS
 Training and test data sets
 Training set; input & target are specified
7/30/2020 ITE 1015 2

• By Rosenblatt (1962)
– Three layers of units: Sensory, Association, and Response
– Learning occurs only on weights from A units to R units
(weights from S units to A units are fixed).
– A single R unit receives inputs from n A units (same
architecture as our simple network)
– For a given training sample s:t, change weights only if the
computed output y is different from the target output t
(thus error driven)
7/30/2020 ITE 1015 3

PERCEPTRON NETWORKS
7/30/2020 ITE 1015 4

 Linear threshold unit (LTU)
x1 w1
w0
w2
x2  o
. 
n
. wi xi
. wn i=0
n
xn 1 if  wi xi >0
f(xi)= { i=0
-1 otherwise
7/30/2020 ITE 1015 5

Key Points
• Three layers of units: Sensory, Association, and Response
• The sensory units are connected to associator units with fixed
weights having value 1,0 or -1, which assigned at random.
• The binary activation function is used in sensory unit and
associator unit.
• The response unit has an activation of 1, 0 or -1. The binary step
with fixed threshold is used as activation for associator. The
output signals that are sent from the associator unit to the
response unit are only binary.
7/30/2020 ITE 1015 6

Key Points…
7/30/2020 ITE 1015 7

7/30/2020 ITE 1015 8
PERCEPTRON LEARNING
wi = wi + wi
wi =  (t - o) xi
where
t = c(x) is the target value,
o is the perceptron output,
 is a small constant (e.g., 0.1) called learning rate.
 If the output is correct (t = o) the weights wi are not changed
 If the output is incorrect (t  o) the weights wi are changed such

that the output of the perceptron for the new weights is closer to t.
 The algorithm converges to the correct classification

• if the training data is linearly separable
•  is sufficiently small
7/30/2020 ITE 1015 9

LEARNING ALGORITHM
 Epoch : Presentation of the entire training set to the neural
network.
 In the case of the AND function, an epoch consists of four sets of

inputs being presented to the network (i.e. [0,0], [0,1], [1,0],
[1,1]).
 Error: The error value is the amount by which the value output by
the network differs from the target value. For example, if we
required the network to output 0 and it outputs 1, then Error = -1.
7/30/2020 ITE 1015 10

 Target Value, T : When we are training a network we not only
present it with the input but also with a value that we require the
network to produce. For example, if we present the network with
[1,1] for the AND function, the training value will be 1.
 Output , O : The output value from the neuron.
 Ij : Inputs being presented to the neuron.
 Wj : Weight from input neuron (Ij) to the output neuron.
 LR : The learning rate. This dictates how quickly the network

converges. It is set by a matter of experimentation. It is typically
0.1.
7/30/2020 ITE 1015 11

TRAINING ALGORITHM
 Adjust neural network weights to map inputs to outputs.
 Use a set of sample patterns where the desired output (given the
inputs presented) is known.
 The purpose is to learn to

• Recognize features which are common to good and bad
exemplars
7/30/2020 ITE 1015 12

Perceptron Training Algorithm for Single output classes
7/30/2020 ITE 1015 13

Single classification perceptron n/w
7/30/2020 ITE 1015 14

MULTILAYER PERCEPTRON
Output Values
Output Layer
Adjustable
Weights
Input Layer
Input Signals (External Stimuli)
7/30/2020 ITE 1015 15

LAYERS IN NEURAL NETWORK
 The input layer:
• Introduces input values into the network.
• No activation function or other processing.
 The hidden layer(s):

• Performs classification of features.
• Two hidden layers are sufficient to solve any problem.
• Features imply more layers may be better.
 The output layer:

• Functionally is just like the hidden layers.
• Outputs are passed on to the world outside the neural
network.
7/30/2020 ITE 1015 16

8/3/2020 ITE 1015 17
Implement OR function using
perceptron model
7/30/2020 ITE 1015 18

7/30/2020 ITE 1015 19
ADAPTIVE LINEAR NEURON (ADALINE)
In 1959, Bernard Widrow and Marcian Hoff of Stanford developed

models they called ADALINE (Adaptive Linear Neuron) and
MADALINE (Multilayer ADALINE). These models were named for their
use of Multiple ADAptive LINear Elements. MADALINE was the first
neural network to be applied to a real world problem. It is an adaptive
filter which eliminates echoes on phone lines.
ADALINE is a net which has only one output unit.
7/30/2020 ITE 1015 20

7/30/2020 ITE 1015 21
Delta Rule for Single Output
• The Widrow-Hoff rule is very similar to perceptron learning rule.
However, their origins are different.
• The perceptron learning rule originates from the Hebbian assumptions
while the delta rule is derived from the gradient-descent method.
• The perceptron learning rule stops after a finite number of learning
steps, but the gradient-descent approach continues forever,
converging only asymptotically to the solution.
• The delta rule updates the weights between the connections so as to
minimize the difference between the net inputs to the output unit
and the target value.
• The major aim is to minimize the error over all training patterns. This
is done by reducing the error for each pattern, one at a time.
7/30/2020 ITE 1015 22

Delta Rule for Single Output
• The delta rule for adjusting the weight of ith
pattern (i=1 to n) is
7/30/2020 ITE 1015 23

ADALINE MODEL
7/30/2020 ITE 1015 24

ADALINE Training Mechanism
7/30/2020 ITE 1015 25

7/30/2020 ITE 1015 26
ADALINE LEARNING RULE
Adaline network uses Delta Learning Rule. This rule is also called as
Widrow Learning Rule or Least Mean Square Rule. The delta rule for
adjusting the weights is given as (i = 1 to n):
7/30/2020 ITE 1015 27

USING ADALINE NETWORKS
 Initialize
Initialize • Assign random weights to all links
 Training
• Feed-in known inputs in random sequence
• Simulate the network
Training • Compute error between the input and the
output (Error Function)
• Adjust weights (Learning Function)
• Repeat until total error < ε
Thinking  Thinking
• Simulate the network
• Network will respond to any input
• Does not guarantee a correct solution even
for trained inputs
7/30/2020 ITE 1015 28

7/30/2020 ITE 1015 29
Example (ADALINE)
7/30/2020 ITE 1015 30

• Initially all the weights and links are assumed to be small random values,
say 0.1, and the learning rate is also set to 0.1.
• Also here LMS error may be set.
• The weights are calculated until the LMS error is obtained.
• The initial weights are taken to be w1=w2=b=0.1 and learning rate also as
0.1.
• For the first input sample, x1=1, x2=1, t=1, we calculate the net input as:
7/30/2020 ITE 1015 31

• Now compute (t-yin)=(1-0.3)=0.7. Updating the weights we obtain,
7/30/2020 ITE 1015 32

7/30/2020 ITE 1015 33
• These calculations are performed for all the input samples and the error is
calculated.
• One epoch is completed when all the input patterns are presented.
• Summing up all the errors obtained for each input sample during one
epoch will give the total mean square error of the epoch.
• The network training is continued until the error is minimized to a very

small value.
7/30/2020 ITE 1015 34

7/30/2020 ITE 1015 35
Difference between perceptron
and ADALINE
• The perceptron is the simplest form of feedforward neural network.
Perceptron networks uses a rule known as the perceptron rule as
the learning algorithm. In this, the weights and biases of the neurons are
trained to produce correct targets when corresponding inputs are
entered.
• whereas, Adaptive Linear Element or ADALINE is a single layer linear
neural network based on the McCulloch-Pitts neuron. The learning phase
of the ADALINE neural network entails the adjustment of the weights of
the neurons as per the weighted summation of the net inputs. Each node
in the ADALINE neural network accepts more than one input but
generates one single output.
7/30/2020 ITE 1015 36

MADALINE NETWORK
MADALINE is a Multilayer Adaptive Linear Element. MADALINE was the
first neural network to be applied to a real world problem. It is used in
several adaptive filtering process.
7/30/2020 ITE 1015 37

Madaline
• Model consists of many Adaline in parallel
with a single output unit.
• The weights that are connected from the
Adaline to the Madaline layer are fixed,
positive and possess equal values.
• The weights between input layer and Adaline
layer are adjusted during the training process.
• Adaline and Madaline layer neuron have a
bias of excitation “1” connected to them.
• Training process is similar to Adaline.
7/30/2020 ITE 1015 38
• “n”units of input layer
• “m”units of Adaline
layer
• “1”unit of the
Madaline layer
• Each neuron in the
Adaline and Madaline
has a bias of excitation
1.
• Adaline layer is
between input and
Madaline layer (hidden
layer)
• The use of the hidden
layer gives the net
computational
capability which is not
found in single-layer
nets.
7/30/2020 ITE 1015 39
Training Algorithm
8/3/2020 ITE 1015 40

7/30/2020 ITE 1015 41
8/3/2020 ITE 1015 42
7/30/2020 ITE 1015 43
7/30/2020 ITE 1015 44
7/30/2020 ITE 1015 45
7/30/2020 ITE 1015 46
7/30/2020 ITE 1015 47
7/30/2020 ITE 1015 48
BACK PROPAGATION NETWORK
7/30/2020 ITE 1015 49

 A training procedure which allows multilayer feed forward Neural
Networks to be trained.
 Can theoretically perform “any” input-output mapping.
 Can learn to solve linearly inseparable problems.
7/30/2020 ITE 1015 50

MULTILAYER FEEDFORWARD NETWORK
Inputs
Hiddens
I0
Outputs
h0
I1 o0
h1
I2 o1
h2 Outputs
I3 Hiddens
Inputs
7/30/2020 ITE 1015 51
MULTILAYER FEEDFORWARD NETWORK:
ACTIVATION AND TRAINING
 For feed forward networks:
• A continuous function can be differentiated allowing gradient-
descent.
• Back propagation is an example of a gradient-descent technique.
• Uses sigmoid (binary or bipolar) activation function.
7/30/2020 ITE 1015 52

In multilayer networks, the activation function is
usually more complex than just a threshold function,
like 1/[1+exp(-x)] or even 2/[1+exp(-x)] – 1 to allow for
inhibition, etc.
7/30/2020 ITE 1015 53

GRADIENT DESCENT
 Gradient-Descent(training_examples, )
 Each training example is a pair of the form <(x1,…xn),t> where

(x1,…,xn) is the vector of input values, and t is the target output
value,  is the learning rate (e.g. 0.1)
 Initialize each wi to some small random value
 Until the termination condition is met, Do

• Initialize each wi to zero
• For each <(x1,…xn),t> in training_examples Do
7/30/2020 ITE 1015 54

 Input the instance (x1,…,xn) to the linear unit and compute
the output o
 For each linear unit weight wi Do
• wi= wi +  (t-o) xi

• For each linear unit weight wi Do
• wi=wi+wi
7/30/2020 ITE 1015 55

Algorithm for Training Network
7/30/2020 ITE 1015 56

MODES OF GRADIENT DESCENT
 Batch mode : gradient descent
w=w -  ED[w] over the entire data D
ED[w]=1/2d(td-od)2
 Incremental mode: gradient descent

w=w -  Ed[w] over individual training examples d
Ed[w]=1/2 (td-od)2
 Incremental Gradient Descent can approximate Batch Gradient

Descent arbitrarily closely if  is small enough.
7/30/2020 ITE 1015 57

SIGMOID ACTIVATION FUNCTION
x0=1
x1 w1
w0 net=i=0n wi xi o=(net)=1/(1+e-net)
w2
x2  o
.
. (x) is the sigmoid function: 1/(1+e-x)
. wn
d(x)/dx= (x) (1- (x))
xn
Derive gradient decent rules to train:
• one sigmoid function
E/wi = -d(td-od) od (1-od) xi
• Multilayer networks of sigmoid units
backpropagation
7/30/2020 ITE 1015 58

BACKPROPAGATION TRAINING ALGORITHM
7/30/2020 ITE 1015 59

7/30/2020 ITE 1015 60
BACKPROPAGATION
 Gradient descent over entire network weight vector
 Easily generalized to arbitrary directed graphs
 Will find a local, not necessarily global error minimum -in practice
often works well (can be invoked multiple times with different initial
weights)
 Often include weight momentum term

wi,j(t)=  j xi,j +  wi,j (t-1)
 Minimizes error training examples
 Will it generalize well to unseen instances (over-fitting)?
 Training can be slow typical 1000-10000 iterations (use Levenberg-

Marquardt instead of gradient descent)
7/30/2020 ITE 1015 61

APPLICATIONS OF BACKPROPAGATION
NETWORK
 Load forecasting problems in power systems.
 Image processing.
 Fault diagnosis and fault detection.
 Gesture recognition, speech recognition.
 Signature verification.
 Bioinformatics.
 Structural engineering design (civil).

7/30/2020 ITE 1015 62
7/30/2020 ITE 1015 63
7/30/2020 ITE 1015 64
7/30/2020 ITE 1015 65
7/30/2020 ITE 1015 66
7/30/2020 ITE 1015 67
7/30/2020 ITE 1015 68
RADIAL BASIS FUCNTION NETWORK
 The radial basis function (RBF) is a classification and functional
approximation neural network developed by M.J.D. Powell.
 The network uses the most common nonlinearities such as

sigmoidal and Gaussian kernel functions.
 The Gaussian functions are also used in regularization networks.
 The Gaussian function is generally defined as
7/30/2020 ITE 1015 69

Architecture
7/30/2020 ITE 1015 70

7/30/2020 ITE 1015 71
RADIAL BASIS FUCNTION NETWORK
7/30/2020 ITE 1015 72

SUMMARY
This chapter discussed on the several supervised learning networks like
 Perceptron,
 Adaline,
 Madaline,
 Backpropagation Network,
 Radial Basis Function Network.
Apart from these mentioned above, there are several other supervised
neural networks like tree neural networks, wavelet neural network,
functional link neural network and so on.
7/30/2020 ITE 1015 73

Supervised learningNN

Uploaded by

Copyright:

Available Formats

Supervised learningNN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Supervised learningNN

Uploaded by

Copyright:

Available Formats

SUPERVISED LEARNING

7/30/2020 ITE 1015 1

 Training and test data sets

 Training set; input & target are specified

7/30/2020 ITE 1015 2

7/30/2020 ITE 1015 3

7/30/2020 ITE 1015 4

7/30/2020 ITE 1015 5

7/30/2020 ITE 1015 6

7/30/2020 ITE 1015 7

 If the output is correct (t = o) the weights wi are not changed

 If the output is incorrect (t  o) the weights wi are changed such

 The algorithm converges to the correct classification

7/30/2020 ITE 1015 9

 In the case of the AND function, an epoch consists of four sets of

7/30/2020 ITE 1015 10

 Output , O : The output value from the neuron.

 Ij : Inputs being presented to the neuron.

 Wj : Weight from input neuron (Ij) to the output neuron.

 LR : The learning rate. This dictates how quickly the network

7/30/2020 ITE 1015 11

 The purpose is to learn to

7/30/2020 ITE 1015 12

7/30/2020 ITE 1015 13

7/30/2020 ITE 1015 14

7/30/2020 ITE 1015 15

 The hidden layer(s):

 The output layer:

7/30/2020 ITE 1015 16

7/30/2020 ITE 1015 18

In 1959, Bernard Widrow and Marcian Hoff of Stanford developed

7/30/2020 ITE 1015 20

7/30/2020 ITE 1015 22

7/30/2020 ITE 1015 23

7/30/2020 ITE 1015 24

7/30/2020 ITE 1015 25

7/30/2020 ITE 1015 27

7/30/2020 ITE 1015 28

7/30/2020 ITE 1015 30

7/30/2020 ITE 1015 31

7/30/2020 ITE 1015 32

• The network training is continued until the error is minimized to a very

7/30/2020 ITE 1015 34

7/30/2020 ITE 1015 36

7/30/2020 ITE 1015 37

8/3/2020 ITE 1015 40

7/30/2020 ITE 1015 49

 Can theoretically perform “any” input-output mapping.

 Can learn to solve linearly inseparable problems.

7/30/2020 ITE 1015 50

7/30/2020 ITE 1015 52

7/30/2020 ITE 1015 53

 Each training example is a pair of the form <(x1,…xn),t> where

 Initialize each wi to some small random value

 Until the termination condition is met, Do

7/30/2020 ITE 1015 54

• wi= wi +  (t-o) xi

7/30/2020 ITE 1015 55

7/30/2020 ITE 1015 56