Lecture Three Multi-Layer Perceptron: Backpropagation: Part I: Fundamentals of Neural Networks
Lecture Three Multi-Layer Perceptron: Backpropagation: Part I: Fundamentals of Neural Networks
Lecture Three Multi-Layer Perceptron: Backpropagation: Part I: Fundamentals of Neural Networks
Lecture Three
Multi-layer Perceptron:
Xiang Cheng
Associate Professor
Department of Electrical & Computer Engineering
The National University of Singapore
Pattern Recognition:
Goal: To correctly classify the set of externally applied
stimuli x1, x2,…, xn into one of two classes, C1 and C2.
What is the equation describing the decision boundary produced by the perceptron?
Linearly Separable
If two classes can be separated by one line (plane or hyper-plane in higher
dimensional space).
Figure: (a) A pair of linearly separable patterns; (b) A pair of non-linearly separable patterns.
Two classes are linearly separable if and only if there exists a weight vector w
based on which the perceptron can correctly perform the classification.
How to choose the proper weights (the proper decision boundary?
By off-line calculation of weights (without learning) if the
problem is relatively simple in lower dimensional space.
If the problem is more complex, we can use
Perceptron learning algorithm
Start with a randomly chosen weight vector w(1);
Update the weight vector by the error-correction-learning rule
w(n + 1) = w(n) + ηe(n) x(n)
e( n ) = d ( n ) − y ( n )
If the patterns are linearly separable, then the weights will converge properly in
finite steps.
4 Regression Problem
Consider a multiple input-single output system whose mathematical
characterization is unknown:
Optimization problem: Minimize the cost function!
How to evaluate the fitting results? Which one is better? Red or Black?
What is the most common cost function?
n n
E ( w) = ∑ e(i ) = ∑ (d (i ) − y (i )) 2
i =1 i =1
Big question: How to choose the iterative algorithm such that the cost is always decreasing ?
What is the simplest way if the gradient is known?
Method of Steepest Descent (Gradient Descent)
Sufficiently Small! 8
i =1 i =1
Of course, the answer can be easily found by solving ∂w
w = ( X T X ) −1 X T d
Regression matrix:
Can we directly use Rosenblatt’s percetron to solve this linear regression problem?
No. The output of the perceptron is either 1 or 0 due to the hard limiter.
What is the simplest way to make the range of the output continuous instead of binary?
Linear Neuron
(single-layer percepton without squash function)
What is the simple logic gate problem that killed perceptron?
Let’s consider the XOR truth table:
Line 1
Line 1
Line 2
What would happen if we combine the two perceptrons together?
+1 Neuron 1
x1 y1 x1 0 0 1 1
+1 +1 x2 0 1 0 1
Neuron 2 y1 0 0 0 1
x2 +1 y2 y2 0 1 1 1
Can you find a line to separate the two classes in the output space (y1,y2) ?
Let’s construct the perceptron to separate the two classes.
What is the equation for this line? x2 = x1 + 0.5 − x1 + x2 − 0.5 = 0
+1 Neuron 1 -0.5
+1 +1 -1
Neuron 2
Now, how to combine this perceptron with the previous two neurons?
What are the inputs to this perceptron?
The outputs of the previous two perceptrons serve as the inputs to the output neuron!
The complete solution to XOR problem
Two layers!
The inputs are transformed into another space (y1, y2) such that they become linearly separable!
Could Frank Rosenblatt find out this solution and answer Minsky’s attack if
he had survived the boating accident?
Yes. He could! Unfortunately, we have to wait another 15 years after his tragic
death in 1971.
Multilayer Perceptron (MLP) and Back Propagation Algorithm
David Rumelhart and the PDP (Parallel Distributed Processing) group, 1986
He obtained his B.A. in psychology and mathematics in 1963 at the
University of South Dakota. He received his Ph. D. in mathematical
psychology at Stanford University in 1967. From 1967 to 1987 he
served on the faculty of the Department of Psychology at the
University of California, San Diego.
The PDP group was led by David Rumelhart and Jay McClelland at
UCSD. They became dissatisfied with symbol-processing machines,
and embarked on a more ambitious “connectionist” program.
The 1986 PDP book was a big success. The book was read eagerly not
David Rumelhart only by brain theorists and psychologists but by mathematicians,
(1942-2011) physicists, engineers and even by people working in Artificial
In 1987, Rumelhart moved to Stanford University, serving as Professor there until 1998.
The Robert J. Glushko and Pamela Samuelson Foundation created the David E. Rumelhart
Prize for Contributions to the Theoretical Foundations of Human Cognition in 2000.
Francis Crick was also a member of the PDP group. He joked later, “Almost my only
contribution to their efforts was to insist that they stop using the word neurons for the units of
their networks.”
Multilayer Perceptrons
Multilayer perceptrons (MLPs)
Generalization of the single-layer perceptron
Consists of
An input layer
One or more hidden layers of computation nodes
An output layer of computation nodes
Architectural graph of a multilayer perceptron with two hidden layers:
MLP generally adopts a smooth nonlinear activation function, such as
the following logistic function:
where vj is the induced local field (weighted sum of all synaptic inputs plus
the bias) of neuron j, yj is the output of the neuron.
What would happen if all the neurons are linear neurons? Would it behave
differently from single layer perceptrons?
We already showed that MLP can solve the XOR problem by geometrical construction.
Training Algorithm
Back-Propagation Algorithm
Consider a multilayer perceptron neural network having three layers of
neurons (one output layer and two hidden layers).
Let’s try to figure out the Back-Propagation (BP) algorithm step by step.
The MLP is fed with an input vector x(n), and produces an output vector y(n).
Let d(n) denote the desired network output, and the error is then
( 3)
e(n) = d (n) − y (n) = d (n) − xout ( n)
2 j =1 2 j =1
1 n3 1 n3
E (n) = ∑ e j (n) 2 = ∑ (d j (n) − xout
( 3)
, j ( n))
2 j =1 2 j =1
Similar to LMS algorithm, the learning rule for a network weight is:
All we need to do is trying to figure out how to compute the derivatives for all the weights!
For Output Layer (neuron j for output layer):
∂E (n)
∆w (ji3) (n) = −η
∂w (ji3) (n)
1 n3 1 n3
E (n) = ∑ e j (n) = ∑ (d j (n) − xout
2 ( 3)
, j ( n))
2 j =1 2 j =1
Do you know how to calculate ∂E (n)
( 3) ?
∂xout , j ( n)
∂E (n) ( 3)
( 3)
= ( d j ( n ) − x out , j ( n)) • ( −1) = −e j ( n)
∂xout , j (n)
How to calculate the outputs of the network, ( 3)
xout , j ( n) ?
, j ( n) = ϕ
( 3) ( 3)
y j (n) = xout (v (j3) (n))
( 3)
∂xout , j ( n)
How to calculate ?
∂v (j3) (n)
( 3)
∂xout , j ( n)
= ϕ (3) ' (v (j3) (n))
∂v (j3) (n)
How to compute the induced local fields v (j3) (n) ?
v (n) = ∑ w(ji3) (n)xout
( 3)
( 2)
,i ( n )
i =1
∂v (j3) (n) ∂v (j3) (n)
How to calculate ( 3) ? ( 3)
( 2)
= xout ,i ( n )
∂w (n)
ji ∂w (n)
∂E (n)
Now the big question is: how to calculate ? Chain Rule!
∂w (ji3) (n)
For Output Layer (neuron j for output layer):
By chain rule, g(f(x))' = g'(f(x))f'(x).
( 3) ( 3)
∂E (n) ∂E (n) ∂xout , j (n) ∂v j (n)
= ( 3)
∂w ji (n) ∂xout , j (n) ∂v (j3) (n) ∂w (ji3) (n)
( 3)
∂E (n)
Now, ( 3)
= −e j ( n)
∂xout , j (n)
( 3)
∂xout , j ( n)
= ϕ (3) ' (v (j3) (n))
∂v (j3) (n)
∂v (j3) (n) ( 2)
= xout ,i ( n )
∂w(ji3) (n)
So we have 𝜕𝜕𝜕𝜕(𝑛𝑛) (3) (3) (2)
= −𝑒𝑒𝑗𝑗 (𝑛𝑛)𝜙𝜙 ′(𝑣𝑣𝑗𝑗 (𝑛𝑛))𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛)
𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛)
w(ji3) (n + 1) = w(ji3) (n) + ηδ (j 3) (n) xout
( 2)
,i ( n )
2 j =1 2 j =1
Let’s first figure out how the error signal is related to the synaptic weights in the second
hidden layer.
What do the error signals depend upon?
The outputs of the network.
What do the network outputs depend upon?
The induced local fields of the output neurons.
What do induced local fields of the output
neurons depend upon?
The outputs of the hidden neurons and the
synaptic weights in the output layer.
What do the outputs of the hidden neurons
depend upon?
The induced local fields of the hidden neurons.
What do the induced local fields of the hidden neurons depend upon ?
The synaptic weights in the second hidden layer.
There are five levels of dependence between the error signal and the synaptic weights in the
second hidden layer. Which rule shall we use to compute the derivatives? 26
The chain rule, of course!
Derivatives of cost function with respect to the weights in the second hidden layer:
By chain rule, we need to compute the derivatives for every level and then put them together.
The first level is from the outputs of the network to cost function E(n)
1 n3
E (n) = ∑ (d j (n) − xout
( 3)
, j ( n))
2 j =1
So we have ,
∂E (n) ( 3)
( 3)
= − ( d k ( n ) − xout , k ( n)) = −ek ( n)
∂xout ,k (n)
Derivatives of cost function with respect to the weights in the
second hidden layer:
The second level is from the induced local fields of the output neuron to the
output of the network.
( 3)
xout ,k ( n ) = ϕ ( 3)
( v ( 3)
k ( n))
So we have , ,k
= φ (3)' (vk(3) (n))
∂vk(3) (n)
The third level is from the output of the hidden
neurons to the induced local fields of the output
v k(3) (n) = ∑ wkj(3) (n) xout
( 2)
, j ( n)
j =1
Easily we obtain
∂v k(3) (n) ( 3)
( 2)
= w kj ( n)
∂xout , j
Derivatives of cost function with respect to the weights in the second hidden layer:
The fourth level is from the induced local fields of the hidden neurons to
the outputs of the hidden neurons.
2 (2)
𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 𝑛𝑛 = 𝜙𝜙 (2) (𝑣𝑣𝑗𝑗 (𝑛𝑛))
Easily we obtain
𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) (2)
= 𝜙𝜙 (2) ′(𝑣𝑣𝑗𝑗 (𝑛𝑛))
𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛)
So we have , (2)
𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛) (1)
= 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛)
𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛) 29
Derivatives of cost function with respect to the weights in the second hidden layer:
In summary, we have
∂E (n) ( 3)
First level: ( 3)
= − ( d k ( n ) − xout , k ( n)) = −ek ( n)
∂xout ,k ( n )
∂xout ,k
Second level: = φ (3)' (vk(3) (n))
∂vk(3) (n)
𝜕𝜕𝑣𝑣𝑘𝑘 (𝑛𝑛) (3)
Third level: = 𝑤𝑤𝑘𝑘𝑗𝑗 (𝑛𝑛)
𝜕𝜕𝜕𝜕𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛)
𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) (2)
Fourth level: = 𝜙𝜙 (2) ′(𝑣𝑣𝑗𝑗 (𝑛𝑛))
𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛)
Fifth level: 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛) (1)
= 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛)
𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛)
Derivatives of cost function with respect to the weights in the second hidden Layer:
1 n3
E (n) = ∑ (d j (n) − xout
( 3)
,j ( n )) 2
2 j =1
𝟑𝟑 (2) (2)
𝜕𝜕𝐸𝐸(𝑛𝑛) 𝜕𝜕𝜕𝜕(𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑘𝑘(3) (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛)
(2) = (3) (3) 𝟐𝟐 (2) (2)
𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛)
Is this the correct way to apply the chain rule? Did we miss anything?
No, we only consider the k-th output in above calculation. The weights in the hidden
layer affect all the outputs of the network.
So we should consider all the outputs of the network instead of only one!
Derivatives of cost function with respect to the weights in the second hidden layer:
1 (3)
𝐸𝐸(𝑛𝑛) = �(𝑑𝑑𝑗𝑗 (𝑛𝑛) − 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛))2
𝟑𝟑 (2) (2)
𝜕𝜕𝐸𝐸(𝑛𝑛) 𝑛𝑛3 𝜕𝜕𝜕𝜕(𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑘𝑘(3) (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛)
(2) ∑
= 𝑘𝑘=1( (3) (3) 𝟐𝟐 (2) (2) )
𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛)
Why don’t we use the summation for the derivatives with respect to the weights
in the output layer?
The weight in the output layer only affects its connected output neuron! 32
Derivatives of cost function with respect to the weights in the second hidden layer:
1 (3)
The cost function: 𝐸𝐸(𝑛𝑛) = �(𝑑𝑑𝑗𝑗 (𝑛𝑛) − 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛))2
The derivatives for all the five levels:
𝜕𝜕𝜕𝜕(𝑛𝑛) (3) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) (3)
= −(𝑑𝑑𝑘𝑘 (𝑛𝑛) − 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛)) = −𝑒𝑒𝑘𝑘 (𝑛𝑛) = 𝜙𝜙 (3)′ (𝑣𝑣𝑘𝑘 (𝑛𝑛))
𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑘𝑘 (𝑛𝑛)
So we have
Also notice that the last two terms do not depend upon the index k, so we can
take them out from the summation and obtain
𝜕𝜕𝐸𝐸(𝑛𝑛) 2 2 1 𝑛𝑛 3 3
(2) = −𝜙𝜙 ′(𝑣𝑣𝑗𝑗 (𝑛𝑛)) 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛) ∑𝑘𝑘=1
( 𝛿𝛿𝑘𝑘 (𝑛𝑛)𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛))
𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛)
Derivatives of cost function with respect to the weights in the second hidden layer:
𝜕𝜕𝐸𝐸(𝑛𝑛) 2 2 1 𝑛𝑛 3 3
(2) = −𝜙𝜙 ′(𝑣𝑣𝑗𝑗 (𝑛𝑛)) 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛) ∑𝑘𝑘=1
( 𝛿𝛿𝑘𝑘 (𝑛𝑛)𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛))
𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛)
Now let’s define the output error for the hidden neuron:
δ (j 2) (n) = (∑ wkj(3) (n)δ k(3) (n))ϕ ( 2) ' (v (j2) (n))
k =1
Thus, by the rule of gradient descent, we have
For the output layer, the output error is proportional to the network error!
How should we calculate the output error at the hidden layer, δ j ?
( 2)
(2) (3) (3) (2)
𝛿𝛿𝑗𝑗 (𝑛𝑛) = (� 𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛) 𝛿𝛿𝑘𝑘 (𝑛𝑛))𝜙𝜙 (2) ′(𝑣𝑣𝑗𝑗 (𝑛𝑛))
For the hidden neurons, the output error is the linear combination of the errors
in the higher layer!
(3) (3)
� 𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛) 𝛿𝛿𝑘𝑘 (𝑛𝑛)
δ (j s ) (n) = (d (n) − xout
,j ( n ))ϕ (s)
' ( v (s)
j ( n)) for output layer
or ns +1
δ (j s ) (n) = (∑ δ k( s +1) (n) wkj( s +1) (n))ϕ ( s ) ' (v (js ) (n)) for hidden layer
k =1
Signal-flow graphic representation of BP
Is BP a special case of
steepest descent method?
In general, at step k,
k k
∆w (k ) = ∑ α
ji η δ
k −t (s) (s)
j (t ) x ( s −1)
out ,i (t ) = η (s)
∑α k −t
δ (j s ) (t ) xout
( s −1)
,i (t )
t =0 t =0
From previous analysis, , so
The adjustments depend upon the weighted sum of present and the past derivatives!
1. For stability,
When has the same algebraic sign on consecutive iterations,
then grows in magnitude and is adjusted by a large
amount. The inclusion of momentum in the back-propagation algorithm
tends to accelerate descent in steady downhill directions.
1. Initialization:
weights, biases
2. Presentations of training examples
Present an epoch of training examples
For each example:
perform forward and backward
3. Forward computation
Compute error signal
4. Backward computation
Compute and adjust weights based
on generalized delta rule
5. Iteration
If stopping criterion is not met, go
through step 2, 3, 4.
Approximations of Functions
Question: Can Multi-layer Perceptrons approximate any functions?
This question was answered shortly after 1986 by a number of people including
Cybenko, Hecht-Nielsen, Funahashi, Hornik and White.
Universal Approximation Theorem:
Let be a non-constant, bounded, and monotone-increasing continuous
function. Let denote the mo -dimensional unit hypercube . Then,
given any continuous function f on and > 0, there exist an integer m1
and sets of real constant and where i = 1 , …, m1 and j = 1 , …, m0
such that we may define,
The theorem merely states that a single hidden layer is sufficient for a
multilayer perceptron to approximate any bounded continuous function.
Does the theorem tell you how to find out the optimal weights?
No. This is just an existence result.
This theorem assures you that the good solution is out there. How to find it?
Steve Simpson
David Raubenheimer
Frequency distribution (60 bins)
Network architecture
Feed forward network
60 input (one for each frequency bin)
6 hidden
2 output (0-1 for “Steve”, 1-0 for “David”)
Presenting the data
Presenting the data (untrained network)
Calculate error
0-0.43 = -0.43
1- 0.26 = 0.74
1-0.73 = 0.27
0-0.55 = -0.55
Backpropagate error and adjust weights
0- 0.43 = -0.43
1-0.26 = 0.74
1-0.73 = 0.27
0-0.55 = -0.55
Repeat process (sweep) for all training pairs
Present data
Calculate error
Backpropagate error
Adjust weights
Repeat process multiple times
Presenting the data (trained network)
Results – Voice Recognition
Performance of trained network
Discrimination accuracy between known “Hello”s
Discrimination accuracy between new “Hello”’s
It is particularly fascinating when you hear the audio examples of the neural network as
it progresses through training seems to progress from a baby babbling to what sounds
like a young child reading a kindergarten text, making the occasional mistake, but
clearly demonstrating learned the major rules of reading.
How did they do it?
What are the inputs to the NETtalk?
Since the pronunciation of English depends
on what letters lie before and after it, the
input layer looks at string of seven letters at
a time.
Can you feed letters directly to MLP?
How do you code the 26 English letters?
This is because the hidden units and the learned weights do not have a
semantics. What can be learned are operational parameters, not general,
abstract knowledge of a domain.
All gradient based methods like the gradient descent and its variations share
the same weakness: cannot escape from local minimum
Possible remedies for local minima problem:
Try nets with different # of hidden layers and hidden units (they may
lead to different error surfaces, some might be better than others).
Try different initial weights (different starting points on the surface).
Forced escape from local minima by random perturbation (e.g.,
simulated annealing).
Q & A…