Lecture NN 2005
Lecture NN 2005
(Laboratorio)
Reti Neurali Per
L’identificazione, Predizione
Ed Il Controllo
Lecture 1:
Introduction to Neural Networks
(Machine Learning)
Silvio Simani
ssimani@ing.unife.it
1
References
Textbook (suggested):
3. Multilayer Perceptron
i. Basics
5
Machine Learning
Definition
experience.
6
Examples of Learning
Problems
8
Issues in Machine
Learning
What algorithms can approximate
functions well and when?
How does the number of training
examples influence accuracy?
How does the complexity of hypothesis
representation impact it?
How does noisy data influence
accuracy?
How do you reduce a learning problem
to a set of function approximation ? 9
Summary
Lecture 2:
Introduction
11
Lecture Outline
1. Introduction (2)
i. Course introduction
ii. Introduction to neural network
iii. Issues in Neural network
2. Simple Neural Network (3)
i. Perceptron
ii. Adaline
3. Multilayer Perceptron (4)
i. Basics
ii. Dynamics
4. Radial Basis Networks (5)
12
Introduction to
Neural Networks
13
Brain
1011 neurons (processors)
On average 1000-10000
connections
14
Artificial Neuron
bias
neti = ∑j wijyj + b
i
j
15
Artificial Neuron
18
Artificial Neural Networks
(ANN)
Activation
Input vector
function
Output (vector)
weight
weight
Signal Activation
routing function
19
Historical Development of
ANN…
set of rules.
variables.
Categorical Variable
A categorical variable is a variable that can
belong to one of a number of discrete categories.
For example, red, green, blue.
Categorical variables are usually encoded using 1
out-of n coding. e.g. for three colours, red = (1 0
0), green =(0 1 0) Blue =(0 0 1).
If we used red = 1, green = 2, blue = 3, then this
type of encoding imposes an ordering on the
values of the variables which does not exist.
26
Data Pre-processing
CONTINUOUS VARIABLES
A continuous variable can be directly
applied to a neural network. However, if
the dynamic range of input variables are
not approximately the same, it is better to
normalize all input variables of the neural
network.
27
Example of Normalized Input
Vector
28
Simple Neural
Networks
Lecture 3:
Simple Perceptron
29
Outlines
The Perceptron
• Linearly separable
problem
• Network structure
• Perceptron learning rule
• Convergence of
Perceptron 30
THE
PERCEPTRON
The perceptron was a simple model of ANN
introduced by Rosenblatt of MIT in the 1960’
with the idea of learning.
Perceptron is designed to accomplish a simple
pattern recognition task: after learning with real value
training data
{ x(i), d(i), i =1,2, …, p} where d(i) = 1 or -1
For a new signal (pattern) x(i+1), the perceptron
is capable of telling you to which class the new
signal belongs
perceptron = 1 or -1
x(i+1) 31
Perceptron
Linear threshold unit (LTU)
x0=1 1 if i=0n w xi
x1 w1 w0=b >0
o(x)=
{ i
-1 otherwise
w2
x2 x= o
. i=0n wi xi
.
. wn
xn
32
Decision Surface of a
Perceptron
x2 x2
+ AND
+ + -
+ - -
x1 w0 x1
+ - w1 - +
- w2
w2=1
• But functions that are not linearly separable 33
Mathematically the Perceptron
is
m m
y f ( wi xi b) f ( wi xi )
i 1 i 0
m
1if wi xi b 0
y i 1
m
1if wi xi b 0
i 1
34
Why is the network capable of solving linearly
separable problem ?
m
w x
i 1
i i b 0
+
m
w x b 0
m
i 1
i i
w x b 0
i i
-
i 1
35
Learning rule
An algorithm to update the weights w so that
finally
the input patterns lie on both sides of the line
decided by the perceptron
+
Let t be the time, at t = 0, we have
w(0) x 0
-
36
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line
decided by the
perceptron
+
w(1) x 0
-
37
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron
w( 2) x 0
+
- 38
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by th
perceptron
w(3) x 0
+
- 39
In Math
1if x (t )inclass
d (t )
1if x (t )inclass
Perceptron learning rule
w(t 1) w(t ) (t )[ d (t )
sign( w(t ) x (t ))]x (t )
Where h(t) is the learning rate >0,
+1 if x>0
sign(x) = hard limiter
function
–1 if x<=0,
NB : d(t) is the same as d(i) and x(t) as x(i)
40
In words:
41
Perceptron convergence
theorem (Rosenblatt, 1962)
42
Summary of Perceptron learning …
b = bias
y(t) = actual response
h(t) = learning rate parameter, a +ve constant <
1
d(t) = desired response
43
Summary of Perceptron learning
…
Data { (x(i), d(i)), i=1,…,p}
could be cyclic :
(x(1), d(1)), (x(2), d(2)),…, (x(p), d(p)),
(x(p+1), d(p+1)),…
or randomly
45
Questions remain
Where or when to stop?
46
++
+
After +
learning -
t steps
- -
E(t) = 0
47
How to define generalization error?
.
Eg = [d(t+1)-sign (x(t+1) w (t)) ]2
+++ +
After ++
learning + +
t steps +
- +
- -
48
We next turn to ADALINE learning,
from which we can understand
the learning rule, and more general the
Back-Propagation (BP) learning
49
Simple Neural
Network
Lecture 4:
ADALINE Learning
50
Outlines
ADALINE
Modes of training
51
Unhappy over Perceptron
Training
+/
S
f (x)
(Adaline)
54
After each training pattern x(i) is presented, the correction
apply to the weights is proportional to the error.
E (t) = ∑i E(i,t)
55
General Approach gradient descent method
To find g
w(t+1) = w(t)+g( E(w(t)) )
so that w automatically tends to
the global minima of E(w).
56
Gradient direction is the direction of uphill
for example, in the Figure, at position 0.4, the
gradient is uphill ( F is E, consider one dim case )
F Gradient direction
F’(0.4)
57
• In gradient descent algorithm, we have
w(t+1) = w(t) – F’(w(t)) h(t)
therefore the ball goes downhill since – F’(w(t))
is downhill direction
Gradient direction
w(t)
58
• In gradient descent algorithm, we have
w(t+1) = w(t) – F’(w(t)) h(t)
therefore the ball goes downhill since – F’(w(t))
is downhill direction
Gradient direction
w(t+1)
59
• Gradually the ball will stop at a local minima wher
the gradient is zero
Gradient direction
w(t+k)
60
• In words
Gradient method could be thought of as a ball rolling down
from a hill: the ball will roll down and finally stop at the valley
65
Multi-Layer Perceptron (MLP)
Input Output
xn
Signal routing
Input layer Hidden layer Output layer
67
Properties of architecture
• No connections within a layer
• No direct connections between input and output layers
• Fully connected between layers
• Often more than 2 layers
• Number of output units need not equal number of input
• Number of hidden units per layer can be more or less th
input or output units
m
yi f ( wij x j bi )
j 1
68
BP (Back Propagation)
69
Lecture 5
MultiLayer Perceptron
I
Back Propagating
Learning
70
BP learning algorithm
Solution to “credit assignment problem” in MLP
71
BP Learning for Simplest O
MLP
W(t)
Task : Data {I, d} to minimize
E = (d - o)2 /2 y
= [d - f(W(t)y(t)) ]2 /2 w(t)
= [d - f(W(t)f(w(t)I)) ]2 /2
I
Error function at the output unit
dE
W (t 1) W (t ) W(t)
dW (t )
dE df y
W (t )
df dW (t ) w(t)
W (t ) ( d o) f ' (W (t ) y ) y
E = (d - o)2 /2 o = f ( W(t)
y)
74
Backward pass phase
O
dE
W (t 1) W (t )
dW (t ) W(t)
dE df
W (t ) y
df dW (t )
w(t)
W (t ) ( d o) f ' (W (t ) y ) y
W (t ) y
I
where D = ( d-o ) f ’
75
Backward pass phase
O
dE
w(t 1) w(t )
dw(t ) W(t)
dE dy
w(t ) y
dy dw(t )
dy w(t)
w(t ) (d o) f ' (W (t ) y )W (t )
dw(t )
w(t ) W (t ) f ' ( w(t ) I ) I I
o = f ( W(t) y )
= f ( W(t) f( w(t) I ) )
76
General Two Layer Network
I inputs, O outputs, w connections for
input units, W connections for output
units, y is the activity of input unit
I O
y
Input units
77
Forward pass
Weights are fixed during forward & backward pass at time t
Ok
1. Compute values for hidden units
net j ( t ) w ji ( t ) I i ( t ) Wkj(t)
i
y j f ( net j ( t )) yj
wji(t)
2. compute values for output units
Net k ( t ) Wkj ( t ) y j
j
Ii
Ok f ( Net k ( t ))
78
Backward Pass
Recall delta rule , error measure for pattern n is
1
E (t ) (d k (t ) Ok (t )) 2
2 k 1
We want to know how to modify weights in order to decrease
where
E (t )
wij (t 1) wij (t )
wij (t )
79
E (t ) E (t ) net j (t )
wij (t ) net j (t ) wij (t )
both for hidden units and output units
80
Summary
weight updates are local
w ji (t 1) w ji (t ) j (t ) I i (t ) (hidden unit)
Wkj (t 1) Wkj (t ) k (t ) y j (t ) (output unit)
output unit
Wkj (t 1) Wkj (t ) k (t ) y j (t )
( d k (t ) Ok (t )) f ' ( Net k (t )) y j (t )
hidden unit
w ji (t 1) w ji (t ) j (t ) I i (t )
f ' ( net j (t )) k (t )Wkj I i (t )
k
83
Shape of sigmoidal function
derivative
i (t ) ( d i (t ) Oi (t )) f ' ( Net i (t ))
( d i (t ) Oi (t )) kOi (t )(1 Oi (t ))
or hidden units we have
87
Advantages and disadvantages of
different modes
Sequential mode:
• Less storage for each weighted connection
• Random order of presentation and updating per
pattern means search of weight space is stochastic--
reducing risk of local minima able to take advantage
of any redundancy in training set (i.e.. same pattern
occurs more than once in training set, esp. for large
training sets)
• Simpler to implement
Batch mode:
• Faster learning than sequential mode
88
Lecture 5
MultiLayer Perceptron
II
Dynamics of MultiLayer
Perceptron
89
Summary of Network Training
Backward phase:
Output unit
Wkj (t 1) Wkj (t ) k (t ) y j (t )
( d k (t ) Ok (t )) f ' ( Net k (t )) y j (t )
Input unit
w ji (t 1) wij (t ) j (t ) I i (t )
f ' ( net j (t )) k (t )Wkj (t ) I i (t )
k
90
Network training:
92
Advantages and disadvantages of
different modes
Sequential mode:
• Less storage for each weighted connection
• Random order of presentation and updating per
pattern means search of weight space is stochastic--
reducing risk of local minima able to take advantage
of any redundancy in training set (i.e.. same pattern
occurs more than once in training set, esp. for large
training sets)
• Simpler to implement
Batch mode:
• Faster learning than sequential mode
93
Goals of Neural Network Training
To give the correct output for
input training vector
(Learning)
94
Training and Testing
Problems
• Stuck neurons: Degree of weight change is
proportional to derivative of activation function,
weight changes will be greatest when units receives
mid-range functional signal than at extremes neuron.
To avoid stuck neurons weights initialization should
give outputs of all neurons approximate 0.5
• Insufficient number of training patterns:
In this case, the training patterns will be learnt
instead of the underlying relationship between inputs
and output, i.e. network just memorizing the patterns.
• Too few hidden neurons: network will not
produce a good model of the problem.
• Over-fitting: the training patterns will be learnt
instead of the underlying function between inputs and
output because of too many of hidden neurons. This
means that the network will have a poor
generalization capability. 95
Dynamics of BP learning
Aim is to minimise an error function over all
training patterns by adapting weights in MLP
96
Dynamics of BP learning
In single layer perceptron with linear
activation
functions, the error function is simple,
described
by a smooth parabolic surface with a
single minimum
97
Dynamics of BP learning
MLP with nonlinear activation functions have complex
error surfaces (e.g. plateaus, long valleys etc. ) with
no single minimum
99
Effect of momentum term
If weight changes tend to have same
sign momentum term increases and
gradient decrease speed up
convergence on shallow gradient
If weight changes tend have
opposing signs momentum term
decreases and gradient descent slows
to reduce oscillations (stabilizes)
Can help escape being trapped in
local minima
100
Selecting Initial Weight Values
Choice of initial weight values is important
as this decides starting position in weight
space. That is, how far away from global
minimum
Aim is to select weight values which
produce midrange function signals
Select weight values randomly from
uniform probability distribution
Normalise weight values so number of
weighted connections per unit produces
midrange function signal 101
Convergence of Backprop
Avoid local minumum with fast
convergence :
Add momentum
Stochastic gradient descent
Train multiple nets with different initial
weights
Nature of convergence
Initialize weights ’near zero’ or initial
networks near-linear
Increasingly non-linear functions possible as
training progresses 102
Use of Available Data Set for
Training
The available data set is normally split into
three sets as follows:
Training set – use to update the weights.
Validation
erro set
r Training set
No. of epochs
104
validation
Too few hidden units prevent the network
from learning adequately fitting the data and
learning the concept.
Too many hidden units leads to overfitting.
Similar cross-validation methods can be used
to determine an appropriate number of
hidden units by using the optimal test error
to select the model with optimal number of
hidden layers and nodes.
Validation
erro set
r Training set
No. of epochs
105
Alternative training algorithm
Lecture 8 :
Genetic Algorithms
106
History
Background
10111
10011 10001
phenotype space
01001 00111 11001
01001
01011 f
coding scheme
recombination selection
x
10011
10
10001
011
001 10001
fitness
01001
01
01011
001
011 10001 11001
01011
111
Pseudo Code of an Evolutionary Algorithm
Mutate offspring
Initial population
enotype integ. phenotype fitness prop. fitnes
1010 26 2.6349 1.2787 30%
1011 11 1.1148 1.0008 24%
0001 17 1.7228 1.7029 40%
0101 5 0.5067 0.2459 6%
113
Some Other Issues
Regarding Evolutionary
Computing
Evolution according to Lamarck.
Individual adapts during lifetime.
Adaptations inherited by children.
In nature, genes don’t change; but for computations we
could allow this...
Baldwin effect.
Individual’s ability to learn has positive effect on evolution.
It supports a more diverse gene pool.
Thus, more “experimentation” with genes possible.
Bacteria and virus.
New evolutionary computing strategies.
114
Lecture 7
Radial Basis Functions
Radial Basis
Functions
115
Radial-basis function (RBF)
networks
quadratically separable
116
Radial-basis function (RBF)
networks
So RBFs are functions
(|| x taking
x the
||)form
i
* * * *
1 0
* *
0
O1
MLP vs RBF
distributed local 120
Starting point: exact
interpolation
121
That is, given a set of N vectors xi and a
corresponding set of N real numbers, di (the
targets), find a function F that satisfies the
interpolation condition:
F ( xi ) = di for i =1,...,N
N find:
or more exactly
F ( x ) w j (|| x x j ||)
j 1
satisfying: N
F ( x i ) w j (|| x i x j ||) di
j 1
122
Single-layer networks
y1 f1 (y)=f1 (||y-x1||)
y2 wj
S
Input Output
d
yp
Input layer :fN (y)=fN (||y-xN||)
• output = S wi fi (y - xi)
• adjustable parameters are weights wj
• number of hidden units = number of data
points
123
To summarize:
For a given data set containing N points (xi,di), i=1,…,N
Choose a RBF function f
Calculate f(xj - xi )
Solve the linear equation F W = D
Get the unique solution
Done
125
Small s = 0.2
126
Problems with exact interpolation
can produce poor generalisation performance as only data
points constrain mapping
Overfitting problem
Bishop(1995) example
fits all data points but creates oscillations due added noise
and unconstrained between data points
127
All Data Points 5 Basis functions
128
To fit an RBF to every data point is very
inefficient due to the computational cost
of matrix inversion and is very bad for
generalization so:
Lecture 9:
Nonlinear Identification,
Prediction and Control
130
Nonlinear System Identification
Neural network
input generation
Pm
133
Nonlinear System
Identification
134
Model Reference
Control
135
Model Reference
Control
137