Lecture Three Multi-Layer Perceptron: Backpropagation: Part I: Fundamentals of Neural Networks

CEG5301: Machine Learning with Applications
Part I: Fundamentals of Neural Networks
Lecture Three
Multi-layer Perceptron:
Backpropagation
Xiang Cheng
Associate Professor
Department of Electrical & Computer Engineering
The National University of Singapore
Phone: 65166210 Office: Block E4-08-07

Email: elexc@nus.edu.sg 1
Perceptron is built by Frank Rosenblatt in 1958.
What is the difference between McCulloch-Pitts model and Perceptron? Learning! 2

What type of problem is suitable for classical Perceptron?
Pattern Recognition or Regression? And why?
Pattern Recognition:
Goal: To correctly classify the set of externally applied
stimuli x1, x2,…, xn into one of two classes, C1 and C2.
What is the equation describing the decision boundary produced by the perceptron?
What is the geometrical meaning of this equation? Hyper-plane.

Is it possible for perceptron to produce a nonlinear decision boundary? No. 3
Can perceptron separate any two classes of patterns? 10
Linearly Separable
If two classes can be separated by one line (plane or hyper-plane in higher
dimensional space).
Figure: (a) A pair of linearly separable patterns; (b) A pair of non-linearly separable patterns.
Two classes are linearly separable if and only if there exists a weight vector w
based on which the perceptron can correctly perform the classification.
4
How to choose the proper weights (the proper decision boundary?
By off-line calculation of weights (without learning) if the
problem is relatively simple in lower dimensional space.
If the problem is more complex, we can use
Perceptron learning algorithm
Start with a randomly chosen weight vector w(1);
Update the weight vector by the error-correction-learning rule
w(n + 1) = w(n) + ηe(n) x(n)
e( n ) = d ( n ) − y ( n )
Does it always converge to the correct solution?
If the patterns are linearly separable, then the weights will converge properly in
finite steps.
5
2
4 Regression Problem
Consider a multiple input-single output system whose mathematical
characterization is unknown:
Given a set of observations of input-output data:
m = dimensionality of the input space; i = time index.

How to design a model for the unknown system?
6
Optimization problem: Minimize the cost function!
How to evaluate the fitting results? Which one is better? Red or Black?
What is the most common cost function?
n n
E ( w) = ∑ e(i ) = ∑ (d (i ) − y (i )) 2
2
i =1 i =1
What is the Optimality Condition?

∂E
=0
∂w
There are two ways to solve the problem:
∂E
If the model is simple, then directly solve =0
∂w
Iterative descent algorithm: Starting with an initial guess denoted by w(0),

generate a sequence of weight vectors w(1), w(2),…, such that the cost function
E(w) is reduced at each iteration of the algorithm,
E(w(n+1)) < E(w(n))
Big question: How to choose the iterative algorithm such that the cost is always decreasing ?
What is the simplest way if the gradient is known?
7
Method of Steepest Descent (Gradient Descent)
Successive adjustment applied to the weight vector w are in the direction of

steepest descent (a direction opposite to the gradient vector ∇E(w)).
Let g(n) =∇E(w(n)), steepest descent algorithm is formally described by
where η is a positive constant called the stepsize or learning-rate.

Is there any condition on the learning-rate to make the algorithm work?
Sufficiently Small! 8
31
Linear Regression Problem

Consider that we are trying to fit a linear model to a set of input-output
pairs (x(1), d(1)), (x(2), d(2)) …, (x(n), d(n)) observed in an interval of
duration n.
y(x)=w1x1+w2x2+…+wmxm+b
n n
Cost Function: E ( w) = ∑ e(i ) = ∑ (d (i ) − y (i )) 2

2
i =1 i =1
∂E
Of course, the answer can be easily found by solving ∂w
=0
Standard Linear Least Squares
w = ( X T X ) −1 X T d
where
Regression matrix:
9
31
Linear Regression Problem

Consider that we are trying to fit a linear model to a set of input-output
pairs (x(1), d(1)), (x(2), d(2)) …, (x(n), d(n)) observed in an interval of
duration n.
y(x)=w1x1+w2x2+…+wmxm+b
Can we directly use Rosenblatt’s percetron to solve this linear regression problem?
No. The output of the perceptron is either 1 or 0 due to the hard limiter.
What is the simplest way to make the range of the output continuous instead of binary?
Linear Neuron
(single-layer percepton without squash function)
Can the linear neuron learn the function by

itself just like the percetron? Yes.
10
37
The Least-Mean-Square algorithm:

Given n training samples: {x(i), d(i)}, i = 1, 2,…,n
e(n) = d (n) − wT (n) x(n)
w(n + 1) = w(n) + ηe(n) x(n)
Does it take the same form as that for the Perceptron? Yes.
What is the advantage of LMS compared to standard linear least squares?
It requires much less memory space, and can be updated with new data easily.
Can linear neuron solve the regression problem where the underlying process
is nonlinear?
Example: Sinusoid function:

y=sin(x) (unknown to you), only
sampling points are provided.
Single layer perceptron can only

solve linear regression problem!
11
10
The fundamental limits of Single Layer Perceptrons

For Pattern Recognition Problem: Linearly Separable
For Regression Problem: The process has to be close to a linear model!
12
What is the simple logic gate problem that killed perceptron?
Let’s consider the XOR truth table:
Line 1
How many lines do we need to separate these two classes?

Line 2
Let’s write down the equations for these two lines:
Line 1 x2 = − x1 + 1.5 x1 + x2 − 1.5 = 0
Line 2 x2 = − x1 + 0.5 x1 + x2 − 0.5 = 0
Can we construct the corresponding perceptrons for these two decision lines separately?
Single Layer Perceptron with two output neurons
Line 1
Line 2
What would happen if we combine the two perceptrons together?
+1 Neuron 1
x1 y1 x1 0 0 1 1
-1.5
+1 +1 x2 0 1 0 1
Neuron 2 y1 0 0 0 1
x2 +1 y2 y2 0 1 1 1
-0.5
Can you find a line to separate the two classes in the output space (y1,y2) ?
Let’s construct the perceptron to separate the two classes.
What is the equation for this line? x2 = x1 + 0.5 − x1 + x2 − 0.5 = 0
+1 Neuron 1 -0.5
-1.5
+1 +1 -1
Neuron 2
+1
-0.5
Now, how to combine this perceptron with the previous two neurons?
What are the inputs to this perceptron?
The outputs of the previous two perceptrons serve as the inputs to the output neuron!
15
The complete solution to XOR problem
How many layers?
Two layers!
What is the magic?
In the original input space (x1, x2), is it linearly separable? No!

In the output space (y1, y2) of the hidden layer, is it linearly separable? Yes!
The inputs are transformed into another space (y1, y2) such that they become linearly separable!
Could Frank Rosenblatt find out this solution and answer Minsky’s attack if
he had survived the boating accident?
Yes. He could! Unfortunately, we have to wait another 15 years after his tragic
death in 1971.
Multilayer Perceptron (MLP) and Back Propagation Algorithm
David Rumelhart and the PDP (Parallel Distributed Processing) group, 1986
He obtained his B.A. in psychology and mathematics in 1963 at the
University of South Dakota. He received his Ph. D. in mathematical
psychology at Stanford University in 1967. From 1967 to 1987 he
served on the faculty of the Department of Psychology at the
University of California, San Diego.
The PDP group was led by David Rumelhart and Jay McClelland at
UCSD. They became dissatisfied with symbol-processing machines,
and embarked on a more ambitious “connectionist” program.
The 1986 PDP book was a big success. The book was read eagerly not
David Rumelhart only by brain theorists and psychologists but by mathematicians,
(1942-2011) physicists, engineers and even by people working in Artificial
Intelligence.
In 1987, Rumelhart moved to Stanford University, serving as Professor there until 1998.
The Robert J. Glushko and Pamela Samuelson Foundation created the David E. Rumelhart
Prize for Contributions to the Theoretical Foundations of Human Cognition in 2000.
Francis Crick was also a member of the PDP group. He joked later, “Almost my only
contribution to their efforts was to insist that they stop using the word neurons for the units of
their networks.”
1
Multilayer Perceptrons
Multilayer perceptrons (MLPs)
 Generalization of the single-layer perceptron
Consists of
 An input layer
 One or more hidden layers of computation nodes
 An output layer of computation nodes
Architectural graph of a multilayer perceptron with two hidden layers:
18
MLP generally adopts a smooth nonlinear activation function, such as
the following logistic function:
where vj is the induced local field (weighted sum of all synaptic inputs plus
the bias) of neuron j, yj is the output of the neuron.
What would happen if all the neurons are linear neurons? Would it behave
differently from single layer perceptrons?
No.
We already showed that MLP can solve the XOR problem by geometrical construction.
Can MLP solve the XOR problem by learning?
Training Algorithm
 Back-Propagation (BP) algorithm
19
5
Back-Propagation Algorithm
Consider a multilayer perceptron neural network having three layers of
neurons (one output layer and two hidden layers).
20
Let’s try to figure out the Back-Propagation (BP) algorithm step by step.
The MLP is fed with an input vector x(n), and produces an output vector y(n).
Let d(n) denote the desired network output, and the error is then
( 3)
e(n) = d (n) − y (n) = d (n) − xout ( n)
We will only look at the error at this step!

How to use this error signal to adjust the synaptic weights w(n)?
w(n + 1) = w(n) + ∆w(n)
Did we solve a similar problem in lecture 2?
Yes. The LMS is derived by minimizing the instantaneous errors!

What is the cost function used in LMS? 1
E ( n) = e( n ) 2
2
Since we may have multiple outputs for MLP, so we define the cost function as
1 n3 1 n3
E (n) = ∑ e j (n) = ∑ (d j (n) − xout
2 ( 3)
,j ( n )) 2
2 j =1 2 j =1
1 n3 1 n3
E (n) = ∑ e j (n) 2 = ∑ (d j (n) − xout
( 3)
, j ( n))
2
2 j =1 2 j =1
What is the simplest iterative algorithm to solve the minimization problem?
Steepest Descent (Gradient Descent):
Similar to LMS algorithm, the learning rule for a network weight is:
∆w(n) = −ηg (n)

∂E (n)
g T ( n) =
∂w(n)
∂E (n)
∆w(jis ) (n) = −η
∂w(jis ) (n)
where s = 1, 2, 3 designates the appropriate network layer, η > 0 is the

corresponding learning rate parameter.
All we need to do is trying to figure out how to compute the derivatives for all the weights!
For Output Layer (neuron j for output layer):
∂E (n)
∆w (ji3) (n) = −η
∂w (ji3) (n)
1 n3 1 n3
E (n) = ∑ e j (n) = ∑ (d j (n) − xout
2 ( 3)
, j ( n))
2
2 j =1 2 j =1
Do you know how to calculate ∂E (n)
( 3) ?
∂xout , j ( n)
∂E (n) ( 3)
( 3)
= ( d j ( n ) − x out , j ( n)) • ( −1) = −e j ( n)
∂xout , j (n)
How to calculate the outputs of the network, ( 3)
xout , j ( n) ?
, j ( n) = ϕ
( 3) ( 3)
y j (n) = xout (v (j3) (n))
( 3)
∂xout , j ( n)
How to calculate ?
∂v (j3) (n)
( 3)
∂xout , j ( n)
= ϕ (3) ' (v (j3) (n))
∂v (j3) (n)
How to compute the induced local fields v (j3) (n) ?
n2
v (n) = ∑ w(ji3) (n)xout
( 3)
j
( 2)
,i ( n )
i =1
∂v (j3) (n) ∂v (j3) (n)
How to calculate ( 3) ? ( 3)
( 2)
= xout ,i ( n )
∂w (n)
ji ∂w (n)
ji
∂E (n)
Now the big question is: how to calculate ? Chain Rule!
∂w (ji3) (n)
By chain rule, g(f(x))' = g'(f(x))f'(x).
( 3) ( 3)
∂E (n) ∂E (n) ∂xout , j (n) ∂v j (n)
= ( 3)
∂w ji (n) ∂xout , j (n) ∂v (j3) (n) ∂w (ji3) (n)
( 3)
∂E (n)
Now, ( 3)
= −e j ( n)
∂xout , j (n)
( 3)
∂xout , j ( n)
= ϕ (3) ' (v (j3) (n))
∂v (j3) (n)
∂v (j3) (n) ( 2)
= xout ,i ( n )
∂w(ji3) (n)
So we have 𝜕𝜕𝜕𝜕(𝑛𝑛) (3) (3) (2)
(3)
= −𝑒𝑒𝑗𝑗 (𝑛𝑛)𝜙𝜙 ′(𝑣𝑣𝑗𝑗 (𝑛𝑛))𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛)
𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛)
To simplify the notation, let’s define

(3) (3) (3)
𝛿𝛿𝑗𝑗 (𝑛𝑛) = 𝑒𝑒𝑗𝑗 (𝑛𝑛)𝜙𝜙 ′(𝑣𝑣𝑗𝑗 (𝑛𝑛))
∂E (n)
Then ( 3)
= −δ ( 3)
j ( n ) x ( 2)
out ,i ( n )
∂w ji (n) 24
∂E (n)
( 3)
= −δ (j 3) (n) xout
( 2)
,i ( n )
∂w ji (n)
The learning rule for the weights in the output layer is

∂E (n)
∆w(ji3) (n) = −η
∂w(ji3) (n)
∆w(ji3) (n) = ηδ (j 3) (n) xout

( 2)
,i ( n )
or
w(ji3) (n + 1) = w(ji3) (n) + ηδ (j 3) (n) xout
( 2)
,i ( n )
Output Error Input Signal
Is it similar to the LMS algorithm w(n + 1) = w(n) + ηe(n) x(n) ?

Yes.
25
Derivatives of cost function with respect to the weights in the second hidden layer:
1 n3 1 n3
E (n) = ∑ e j (n) = ∑ (d j (n) − xout
2 ( 3)
, j ( n))
2
2 j =1 2 j =1
Let’s first figure out how the error signal is related to the synaptic weights in the second
hidden layer.
What do the error signals depend upon?
The outputs of the network.
What do the network outputs depend upon?
The induced local fields of the output neurons.
What do induced local fields of the output
neurons depend upon?
The outputs of the hidden neurons and the
synaptic weights in the output layer.
What do the outputs of the hidden neurons
depend upon?
The induced local fields of the hidden neurons.
What do the induced local fields of the hidden neurons depend upon ?
The synaptic weights in the second hidden layer.
There are five levels of dependence between the error signal and the synaptic weights in the
second hidden layer. Which rule shall we use to compute the derivatives? 26
The chain rule, of course!
By chain rule, we need to compute the derivatives for every level and then put them together.
The first level is from the outputs of the network to cost function E(n)
1 n3
E (n) = ∑ (d j (n) − xout
( 3)
, j ( n))
2
2 j =1
So we have ,
∂E (n) ( 3)
( 3)
= − ( d k ( n ) − xout , k ( n)) = −ek ( n)
∂xout ,k (n)
27
Derivatives of cost function with respect to the weights in the
second hidden layer:
The second level is from the induced local fields of the output neuron to the
output of the network.
( 3)
xout ,k ( n ) = ϕ ( 3)
( v ( 3)
k ( n))
(3)
∂xout
So we have , ,k
= φ (3)' (vk(3) (n))
∂vk(3) (n)
The third level is from the output of the hidden
neurons to the induced local fields of the output
neuron.
n2
v k(3) (n) = ∑ wkj(3) (n) xout
( 2)
, j ( n)
j =1
Easily we obtain
∂v k(3) (n) ( 3)
( 2)
= w kj ( n)
∂xout , j
28
The fourth level is from the induced local fields of the hidden neurons to
the outputs of the hidden neurons.
2 (2)
𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 𝑛𝑛 = 𝜙𝜙 (2) (𝑣𝑣𝑗𝑗 (𝑛𝑛))
Easily we obtain
(2)
𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) (2)
(2)
= 𝜙𝜙 (2) ′(𝑣𝑣𝑗𝑗 (𝑛𝑛))
𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛)
The fifth level is from the synaptic weights of

the hidden layer to the induced local fields of
the hidden neurons.
𝑛𝑛1
𝟐𝟐 (2) (1)
𝑣𝑣𝑗𝑗 𝑛𝑛 = � 𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛) 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛)
𝑘𝑘=1
So we have , (2)
𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛) (1)
(2)
= 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛)
𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛) 29
In summary, we have
∂E (n) ( 3)
First level: ( 3)
= − ( d k ( n ) − xout , k ( n)) = −ek ( n)
∂xout ,k ( n )
(3)
∂xout ,k
Second level: = φ (3)' (vk(3) (n))
∂vk(3) (n)
(3)
𝜕𝜕𝑣𝑣𝑘𝑘 (𝑛𝑛) (3)
Third level: = 𝑤𝑤𝑘𝑘𝑗𝑗 (𝑛𝑛)
(2)
𝜕𝜕𝜕𝜕𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛)
(2)
𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) (2)
Fourth level: = 𝜙𝜙 (2) ′(𝑣𝑣𝑗𝑗 (𝑛𝑛))
(2)
𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛)
(2)
Fifth level: 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛) (1)
(2)
30
Derivatives of cost function with respect to the weights in the second hidden Layer:
1 n3
E (n) = ∑ (d j (n) − xout
( 3)
,j ( n )) 2
2 j =1
Now let’s apply the chain rule:
𝟑𝟑 (2) (2)
𝜕𝜕𝐸𝐸(𝑛𝑛) 𝜕𝜕𝜕𝜕(𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑘𝑘(3) (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛)
(2) = (3) (3) 𝟐𝟐 (2) (2)
𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛)
Is this the correct way to apply the chain rule? Did we miss anything?
No, we only consider the k-th output in above calculation. The weights in the hidden
layer affect all the outputs of the network.
31
So we should consider all the outputs of the network instead of only one!
𝑛𝑛3
1 (3)
𝐸𝐸(𝑛𝑛) = �(𝑑𝑑𝑗𝑗 (𝑛𝑛) − 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛))2
2
𝑗𝑗=1
Now let’s apply the chain rule:
𝟑𝟑 (2) (2)
𝜕𝜕𝐸𝐸(𝑛𝑛) 𝑛𝑛3 𝜕𝜕𝜕𝜕(𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑘𝑘(3) (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛)
(2) ∑
= 𝑘𝑘=1( (3) (3) 𝟐𝟐 (2) (2) )
Why don’t we use the summation for the derivatives with respect to the weights
in the output layer?
The weight in the output layer only affects its connected output neuron! 32
𝑛𝑛3
1 (3)
The cost function: 𝐸𝐸(𝑛𝑛) = �(𝑑𝑑𝑗𝑗 (𝑛𝑛) − 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛))2
2
𝑗𝑗=1
The derivatives for all the five levels:
3
𝜕𝜕𝜕𝜕(𝑛𝑛) (3) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) (3)
(3)
= −(𝑑𝑑𝑘𝑘 (𝑛𝑛) − 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛)) = −𝑒𝑒𝑘𝑘 (𝑛𝑛) = 𝜙𝜙 (3)′ (𝑣𝑣𝑘𝑘 (𝑛𝑛))
(3)
𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑘𝑘 (𝑛𝑛)
(3) (2) (2)

𝜕𝜕𝑣𝑣𝑘𝑘 (𝑛𝑛) (3) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) (2) 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛)
2
= 𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛) = 𝜙𝜙 (2) ′(𝑣𝑣𝑗𝑗 (𝑛𝑛)) (1)
𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) (2) (2)
𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛)
The chain rule:

𝟑𝟑 (2) (2)
𝜕𝜕𝐸𝐸(𝑛𝑛) 𝑛𝑛 𝜕𝜕𝜕𝜕(𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑘𝑘 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑘𝑘(3) (𝑛𝑛) 𝜕𝜕𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑗𝑗 (𝑛𝑛) 𝜕𝜕𝑣𝑣𝑗𝑗 (𝑛𝑛)
(2) = ∑𝑘𝑘=1
3
( (3) (3) 𝟐𝟐 (2) (2) )
Plug in all the derivatives:

𝜕𝜕𝐸𝐸(𝑛𝑛) 𝑛𝑛 (3) (3) (2) (1)
(2) =∑𝑘𝑘=1
3
( − 𝑒𝑒𝑘𝑘 (𝑛𝑛) 𝜙𝜙 (3)′ (𝑣𝑣𝑘𝑘 (𝑛𝑛))𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛)𝜙𝜙 (2) ′(𝑣𝑣𝑗𝑗 (𝑛𝑛)) 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛))
33
𝜕𝜕𝐸𝐸(𝑛𝑛) 𝑛𝑛 (3) (3) (2) (1)

(2) =∑𝑘𝑘=1
3
( − 𝑒𝑒𝑘𝑘 (𝑛𝑛) 𝜙𝜙 (3)′ (𝑣𝑣𝑘𝑘 (𝑛𝑛))𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛)𝜙𝜙 (2) ′(𝑣𝑣𝑗𝑗 (𝑛𝑛)) 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛))
It looks complicated, let’s try to simplify it.
Notice that we already define
(3) (3) (3)

So we have
𝜕𝜕𝐸𝐸(𝑛𝑛) 𝑛𝑛3 (3) (3) (2) (1)

(2) = − ∑𝑘𝑘=1( 𝛿𝛿𝑘𝑘 (𝑛𝑛)𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛)𝜙𝜙 (2) ′(𝑣𝑣𝑗𝑗 (𝑛𝑛)) 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛))
Also notice that the last two terms do not depend upon the index k, so we can
take them out from the summation and obtain
𝜕𝜕𝐸𝐸(𝑛𝑛) 2 2 1 𝑛𝑛 3 3
(2) = −𝜙𝜙 ′(𝑣𝑣𝑗𝑗 (𝑛𝑛)) 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛) ∑𝑘𝑘=1
3
( 𝛿𝛿𝑘𝑘 (𝑛𝑛)𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛))
34
𝜕𝜕𝐸𝐸(𝑛𝑛) 2 2 1 𝑛𝑛 3 3
(2) = −𝜙𝜙 ′(𝑣𝑣𝑗𝑗 (𝑛𝑛)) 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛) ∑𝑘𝑘=1
3
( 𝛿𝛿𝑘𝑘 (𝑛𝑛)𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛))
Now let’s define the output error for the hidden neuron:
n3
δ (j 2) (n) = (∑ wkj(3) (n)δ k(3) (n))ϕ ( 2) ' (v (j2) (n))
k =1
So finally we have a very simple formula to compute the derivatives of the

cost function with respect to the weights in the second hidden layer:
𝜕𝜕𝜕𝜕(𝑛𝑛) (2) (1)

(2)
= −𝛿𝛿𝑗𝑗 (𝑛𝑛)𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛)
35
Thus, by the rule of gradient descent, we have
(2) 𝜕𝜕𝜕𝜕(𝑛𝑛) (2) (1)

Δ𝑤𝑤𝑗𝑗𝑗𝑗 (𝑛𝑛) = −𝜂𝜂 (2)
= 𝜂𝜂𝛿𝛿𝑗𝑗 (𝑛𝑛)𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜,𝑖𝑖 (𝑛𝑛)
or w(ji2 ) (n + 1) = w(ji2 ) (n) + ηδ (j 2 ) (n) xout
(1)
,i ( n )
Let’s compare it to the updating rule for the output layer:

w(ji3) (n + 1) = w(ji3) (n) + ηδ (j 3) (n) xout
( 2)
,i ( n )
Do they have the same form?

They have the same form as that for the LMS algorithm:
w(n + 1) = w(n) + ηe(n) x(n)
Output Error Input Signal
Difference lies in how we compute the output error.

36
How should we calculate the output error at the output layer, δ ( 3) ?
j
(3) (3) (3)
Network Error Gradient of the activation function
For the output layer, the output error is proportional to the network error!
37
How should we calculate the output error at the hidden layer, δ j ?
( 2)
𝑛𝑛3
(2) (3) (3) (2)
𝛿𝛿𝑗𝑗 (𝑛𝑛) = (� 𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛) 𝛿𝛿𝑘𝑘 (𝑛𝑛))𝜙𝜙 (2) ′(𝑣𝑣𝑗𝑗 (𝑛𝑛))
𝑘𝑘=1
Output Error Gradient of the activation function
For the hidden neurons, the output error is the linear combination of the errors
in the higher layer!
𝑛𝑛3
(3) (3)
� 𝑤𝑤𝑘𝑘𝑘𝑘 (𝑛𝑛) 𝛿𝛿𝑘𝑘 (𝑛𝑛)
𝑘𝑘=1
It can be viewed as the output of the

neuron of another network where all
the connections are “backwards” now!
This is the most beautiful property for BP.

That is why it is called backpropagation!
38
Summary,
w(jis ) (n + 1) = w(jis ) (n) + ηδ (j s ) (n) xout
( s −1)
,i ( n )
where
δ (j s ) (n) = (d (n) − xout
(s)
,j ( n ))ϕ (s)
' ( v (s)
j ( n)) for output layer
or ns +1
δ (j s ) (n) = (∑ δ k( s +1) (n) wkj( s +1) (n))ϕ ( s ) ' (v (js ) (n)) for hidden layer
k =1
Back-propagation algorithm - 2 passes of computation:

1. Forward pass: Computation of function signals for each neuron.
2. Backward pass: Starts at the output layer, backwardly compute δ for each
neuron from output layer towards the first hidden layer. At each layer, the
synaptic weights are changed accordingly to the above delta rule.
39
Break
•Visions of The Future 3
40
Signal-flow graphic representation of BP
Is BP a special case of
steepest descent method?
Sure!
Why is it called back-

propagation?
The errors are propagated

back from the output layer
to the input layer!
41
13
Back-Propagation Algorithm - Rate of learning

Back-propagation algorithm is based upon the method of steepest
descent.
What is the requirement for the learning rate for steepest descent method?
Small learning-rate parameter is desirable to avoid “jumping”.
Small learning rate parameter corresponds to “SLOW Learning”!
To increase the learning speed,

Modification of delta rule to give the generalized delta rule
where  is usually a positive number called the momentum constant or

forgetting factor;  is called the momentum term.
The adjustments depend upon not only the derivatives at present, but also the history!
42
Let’s write down the solutions step by step,
1
At the initial step, k=0
4
∆w (jis ) (0) = η ( s )δ (j s ) (0) xout
( s −1)
,i (0)
At the first step, k=1

∆w (jis ) (1) = α∆w (jis ) (0) + η ( s )δ (j s ) (1) xout
( s −1)
,i (1)
∆w (jis ) (1) = αη ( s )δ (j s ) (0) xout ,i (0) + η

( s −1)
δ j (1) xout
(s) (s) ( s −1)
,i (1)
At the second step, k=2

∆w (jis ) (2) = α∆w (jis ) (1) + η ( s )δ (j s ) (2) xout
( s −1)
,i ( 2)
∆w (jis ) (2) = α 2η ( s )δ (j s ) (0) xout ,i (0) + αη

( s −1)
δ j (1) xout
(s) (s)
,i (1) + η
( s −1)
δ j (2) xout
(s) (s) ( s −1)
,i ( 2)
In general, at step k,
∆w (jis ) (k ) = α∆w (jis ) (k − 1) + η ( s )δ (j s ) (k ) xout

( s −1)
,i ( k )
k k
∆w (k ) = ∑ α
(s)
ji η δ
k −t (s) (s)
j (t ) x ( s −1)
out ,i (t ) = η (s)
∑α k −t
δ (j s ) (t ) xout
( s −1)
,i (t )
43
t =0 t =0
From previous analysis, , so
The adjustments depend upon the weighted sum of present and the past derivatives!
Note:
1. For stability,
2. Normally, the momentum constant is positive to assure good performance. Why?
44
When has the same algebraic sign on consecutive iterations,
then grows in magnitude and is adjusted by a large
amount. The inclusion of momentum in the back-propagation algorithm
tends to accelerate descent in steady downhill directions.
When has the opposite signs on consecutive iterations, the

shrinks in magnitude, the is adjusted by a small
amount. The inclusion of momentum in BP algorithm has a stabilizing
effect in directions along which oscillates in sign. 45
Back-Propagation Algorithm - Stopping Criteria
Epoch: A complete presentation of the entire training set during the

learning process.
For single layer perceptron, does the LMS converge?
Similarly, BP would not converge.
Learning process maintains on an epoch-by-epoch basis until certain
stopping criterion is met.
When to stop training? Any suggestions?
Possible criteria:
 Mean squared error over the entire training set is less than some threshold
value.
 The total number of epochs reaches a threshold.
Absolute rate of change in mean squared error per epoch is sufficiently small.
 Synaptic weights and bias level stabilized.
46
16
Back-Propagation Algorithm - Summary
1. Initialization:
 weights, biases
2. Presentations of training examples
 Present an epoch of training examples
 For each example:
 perform forward and backward
computations
3. Forward computation
 Compute error signal
4. Backward computation
 Compute and adjust weights based
on generalized delta rule
5. Iteration
 If stopping criterion is not met, go
through step 2, 3, 4.
47
26
Approximations of Functions
Question: Can Multi-layer Perceptrons approximate any functions?
This question was answered shortly after 1986 by a number of people including
Cybenko, Hecht-Nielsen, Funahashi, Hornik and White.
Universal Approximation Theorem:
 Let be a non-constant, bounded, and monotone-increasing continuous
function. Let denote the mo -dimensional unit hypercube . Then,
given any continuous function f on and > 0, there exist an integer m1
and sets of real constant  and where i = 1 , …, m1 and j = 1 , …, m0
such that we may define,
as an approximate realization of the function f(·); that is,
for all that lie in the input space.

48
The theorem is directly applicable to multi-layer perceptrons. The logistic
function 1/[1 + exp(-v)] is indeed a non-constant, bounded, and monotone-
increasing function, i.e., it satisfies the conditions imposed on the function .
How to realize this function using a multilayer perceptron?
 How many input neurons? m0
How many hidden layers? One
How many hidden neurons? m1
How many output neurons? One
What are the activation functions?
49
The theorem merely states that a single hidden layer is sufficient for a
multilayer perceptron to approximate any bounded continuous function.
Does the theorem tell you how to find out the optimal weights?
No. This is just an existence result.
This theorem assures you that the good solution is out there. How to find it?
That’s why BP and other learning algorithms are proposed! 50

Function Approximation Examples:
How about the threshold function? Is it continuous?
Can a discontinuous function be approximated by continuous function? Sure! 51

MLP application example: Voice Recognition:
Torsten Reil
torsten.reil@zoo.ox.ac.uk
Task: Learn to discriminate between two different voices

saying “Hello”
Data
Sources
Steve Simpson
David Raubenheimer
Format
Frequency distribution (60 bins)
52
Network architecture
Feed forward network
60 input (one for each frequency bin)
6 hidden
2 output (0-1 for “Steve”, 1-0 for “David”)
53
Presenting the data
Steve
David
54
Presenting the data (untrained network)
Steve
0.43
0.26
David
0.73
0.55
55
Calculate error
Steve
0-0.43 = -0.43
1- 0.26 = 0.74
David
1-0.73 = 0.27
0-0.55 = -0.55
56
Backpropagate error and adjust weights
Steve
0- 0.43 = -0.43
1-0.26 = 0.74
David
1-0.73 = 0.27
0-0.55 = -0.55
57
Repeat process (sweep) for all training pairs
Present data
Calculate error
Backpropagate error
Adjust weights
Repeat process multiple times
58
Presenting the data (trained network)
Steve
0.01
0.99
David
0.99
0.01
59
Results – Voice Recognition
Performance of trained network
Discrimination accuracy between known “Hello”s
100%
Discrimination accuracy between new “Hello”’s
100%
Network has learnt to generalise from original data.
Networks with different weight settings can have same

functionality. (The final solution is not unique!)
Trained networks ‘concentrate’ on lower frequencies.
Network is robust against non-functioning nodes.

60
Let’s look at another more famous example: NETtalk
Terrence Sejnowski and Charles Rosenberg in 1987.
NETtalk was created to learn how to correctly pronounce English from written English
text.
This is not straightforward since English is an especially difficult language to pronounce
because of its irregular spelling.
The network was, of course, not explicitly given any of the rules of English pronunciation.
It had to learn them just from the corrections it received after each of its attempts during the
training sessions.
The input was fed into the network, letter by letter, in a special way. The overall output of
NETtalk was a string of symbols related to spoken sounds. To make the demonstration
more vivid, this output was coupled to a speech synthesizer that produced speech sounds
from the output of NETtalk.
It is particularly fascinating when you hear the audio examples of the neural network as
it progresses through training seems to progress from a baby babbling to what sounds
like a young child reading a kindergarten text, making the occasional mistake, but
clearly demonstrating learned the major rules of reading.
How did they do it?
61
What are the inputs to the NETtalk?
Since the pronunciation of English depends
on what letters lie before and after it, the
input layer looks at string of seven letters at
a time.
Can you feed letters directly to MLP?
How do you code the 26 English letters?
One input unit for each of the 26 letters,

plus three for punctuation and word
boundaries. Thus there are 29x7=203 inputs
What are the outputs?
The output layer has a neuron for each of the 21 “articulatory features” of the required
phonemes plus 5 units to handle syllable boundaries and stresses. As an example, the
consonants p and b are both called “labial stops” because they start with pursed lips.
How many hidden layers? How many hidden neurons?
Originally they used 80, then later increased to 120.
Structure of MLP: 203-80-26, total number of weights: 204*80+81*26=18426 62
"Autonomous Land Vehicle In a Neural Network” (ALVINN)
by Dean Pomerleau and Todd Jochem, 1993, CMU
What are the inputs?
The image captured by the camera.
How do you code an image?
30x32 pixels in the gray-level image
30x32 matrix
How many input neurons?
30x32=960
What are the 30 output units?
Each output neuron corresponds to a
particular steering direction, and the
output values of these units determine
which steering direction is
recommended most strongly
How many hidden layers?
Only one!
How many hidden neurons?
Only four.
Architecture of MLP: 960-4-30
# of weights: 961*4+5*30=3994
Single Layer Perceptron v.s. Multi-layer Perceptrons
Pattern Recognition Problem:
Why can MLP solve the nonlinearly separable problem?

The nonlinearly separable problem can be transformed into linearly separable problem
in the feature space produced by the hidden layer! 64
Single Layer Perceptron v.s. Multi-layer Perceptrons
Regression Problem:
Multi-layer Perceptrons can approximate any bounded continuous functions!

The learning algorithms are based upon the steepest descent method:
w(n + 1) = w(n) + ηe(n) x(n)
Output Error Input Signal Output Error Input Signal

65
21
Deficiencies of BP Trained NNs

Nobody is perfect. BP trained NNs have many limitations too.
What is the most serious weakness associated with Neural Networks ?
The net is essentially a black box!
The most powerful weapon of NN can turn against itself!
The Neural networks can learn almost everything due to its plasticity!
The synaptic weights can adapt to any learning task!
On the other hand, do you have any idea about how to interpret the weights?
Almost impossible!
 It may provide a desired mapping between input and output vectors (x, y)
but does not have the information of why a particular x is mapped to a
particular y.
 It thus cannot provide an intuitive (e.g., causal) explanation for the
computed result.
 This is because the hidden units and the learned weights do not have a
semantics. What can be learned are operational parameters, not general,
abstract knowledge of a domain.
66
21
Deficiencies of BP Trained NNs

There are other minor limitations too:
1. Learning often takes a long time to converge:
Complex functions often need hundreds or thousands of epochs.
2. Gradient descent approach only guarantees to reduce the total error to a

local minimum.
67
All gradient based methods like the gradient descent and its variations share
the same weakness: cannot escape from local minimum
68
 Possible remedies for local minima problem:
 Try nets with different # of hidden layers and hidden units (they may
lead to different error surfaces, some might be better than others).
Try different initial weights (different starting points on the surface).
 Forced escape from local minima by random perturbation (e.g.,
simulated annealing).
The learning (accuracy, speed, and generalization) is highly dependent on a set

of learning parameters:
Initial weights, learning rate, # of hidden layers and # of units...
 Many of them can only be determined empirically (via experiments).
We will discuss all these design issues next week.
69
Q & A…
70

Lecture Three Multi-Layer Perceptron: Backpropagation: Part I: Fundamentals of Neural Networks

Uploaded by

Copyright:

Available Formats

Lecture Three Multi-Layer Perceptron: Backpropagation: Part I: Fundamentals of Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Three Multi-Layer Perceptron: Backpropagation: Part I: Fundamentals of Neural Networks

Uploaded by

Copyright:

Available Formats

CEG5301: Machine Learning with Applications

Part I: Fundamentals of Neural Networks

Phone: 65166210 Office: Block E4-08-07

What is the difference between McCulloch-Pitts model and Perceptron? Learning! 2

What is the geometrical meaning of this equation? Hyper-plane.

Does it always converge to the correct solution?

Given a set of observations of input-output data:

m = dimensionality of the input space; i = time index.

What is the Optimality Condition?

Iterative descent algorithm: Starting with an initial guess denoted by w(0),

Successive adjustment applied to the weight vector w are in the direction of

where η is a positive constant called the stepsize or learning-rate.

Linear Regression Problem

Cost Function: E ( w) = ∑ e(i ) = ∑ (d (i ) − y (i )) 2

Standard Linear Least Squares

Linear Regression Problem

Can the linear neuron learn the function by

The Least-Mean-Square algorithm:

Example: Sinusoid function:

Single layer perceptron can only

The fundamental limits of Single Layer Perceptrons

For Regression Problem: The process has to be close to a linear model!

How many lines do we need to separate these two classes?

How many layers?

What is the magic?

In the original input space (x1, x2), is it linearly separable? No!

Can MLP solve the XOR problem by learning?

 Back-Propagation (BP) algorithm

We will only look at the error at this step!

Did we solve a similar problem in lecture 2?

Yes. The LMS is derived by minimizing the instantaneous errors!

What is the simplest iterative algorithm to solve the minimization problem?

Steepest Descent (Gradient Descent):

∆w(n) = −ηg (n)

where s = 1, 2, 3 designates the appropriate network layer, η > 0 is the

To simplify the notation, let’s define

The learning rule for the weights in the output layer is

∆w(ji3) (n) = ηδ (j 3) (n) xout

Output Error Input Signal

Is it similar to the LMS algorithm w(n + 1) = w(n) + ηe(n) x(n) ?

The fifth level is from the synaptic weights of

Now let’s apply the chain rule:

Now let’s apply the chain rule:

(3) (2) (2)

The chain rule:

Plug in all the derivatives:

𝜕𝜕𝐸𝐸(𝑛𝑛) 𝑛𝑛 (3) (3) (2) (1)

It looks complicated, let’s try to simplify it.

Notice that we already define

(3) (3) (3)

𝜕𝜕𝐸𝐸(𝑛𝑛) 𝑛𝑛3 (3) (3) (2) (1)

So finally we have a very simple formula to compute the derivatives of the

𝜕𝜕𝜕𝜕(𝑛𝑛) (2) (1)

(2) 𝜕𝜕𝜕𝜕(𝑛𝑛) (2) (1)

Let’s compare it to the updating rule for the output layer:

Do they have the same form?

Output Error Input Signal

Difference lies in how we compute the output error.

Network Error Gradient of the activation function