0% found this document useful (0 votes)
28 views137 pages

Lecture NN 2005

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views137 pages

Lecture NN 2005

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 137

Automazione

(Laboratorio)
Reti Neurali Per
L’identificazione, Predizione
Ed Il Controllo
Lecture 1:
Introduction to Neural Networks
(Machine Learning)

Silvio Simani
ssimani@ing.unife.it
1
References

Textbook (suggested):

• Neural Networks for Identification,


Prediction, and Control, by Duc Truong Pham
and Xing Liu. Springer Verlag; (December
1995). ISBN: 3540199594

• Nonlinear Identification and Control: A


Neural Network Approach, by G. P. Liu.
Springer Verlag; (October 2001). ISBN: 2
Course Overview
1. Introduction
i. Course introduction
ii. Introduction to neural network
iii. Issues in Neural network

2. Simple Neural Network


i. Perceptron
ii. Adaline

3. Multilayer Perceptron
i. Basics

4. Radial Basis Networks


5. Application Examples
3
Machine Learning
 Improve automatically with experience
 Imitating human learning
 Human learning
Fast recognition and classification of complex
classes of objects and concepts and fast
adaptation
 Example: neural networks
 Some techniques assume statistical source
Select a statistical model to model the
source
 Other techniques are based on reasoning
or inductive inference (e.g. Decision tree)4
Disciplines relevant to
ML
 Artificial intelligence
 Bayesian methods
 Control theory
 Information theory
 Computational complexity theory
 Philosophy
 Psychology and neurobiology
 Statistics

5
Machine Learning
Definition

A computer program is said to learn

from experience E with respect to some

class of tasks T and performance

measure P, if its performance at tasks

in T, as measured by P, improves with

experience.
6
Examples of Learning
Problems

Example 1: Handwriting Recognition:


 T: Recognizing and classifying handwritten
words within images.
 P: percentage of words correctly classified.
 E: a database of handwritten words with
given classification.

Example 2: Learn to play checkers:


 T: play checkers.
 P: percentage of games won in a
tournament.
 E: opportunity to play against itself (war
games…). 7
Type of Training
Experience
 Direct or indirect?
 Direct: board state -> correct move
 Indirect: Credit assignment problem (degree of credit or
blame for each move to the final outcome of win or loss)
 Teacher or not ?

Teacher selects board states and provide correct moves or
 Learner can select board states
 Is training experience representative of
performance goal?
 Training playing against itself
 Performance evaluated playing against world champion

8
Issues in Machine
Learning
 What algorithms can approximate
functions well and when?
 How does the number of training
examples influence accuracy?
 How does the complexity of hypothesis
representation impact it?
 How does noisy data influence
accuracy?
 How do you reduce a learning problem
to a set of function approximation ? 9
Summary

 Machine Learning is useful for data mining,


poorly understood domain (face recognition)
and programs that must dynamically adapt.
 Draws from many diverse disciplines.
 Learning problem needs well-specified task,
performance metric and training experience.
 Involve searching space of possible
hypotheses. Different learning methods
search different hypothesis space, such as
numerical functions, neural networks,
decision trees, symbolic rules.
10
Topics in Neural
Networks

Lecture 2:
Introduction

11
Lecture Outline
1. Introduction (2)
i. Course introduction
ii. Introduction to neural network
iii. Issues in Neural network
2. Simple Neural Network (3)
i. Perceptron
ii. Adaline
3. Multilayer Perceptron (4)
i. Basics
ii. Dynamics
4. Radial Basis Networks (5)

12
Introduction to
Neural Networks

13
Brain
 1011 neurons (processors)
 On average 1000-10000
connections

14
Artificial Neuron
bias
neti = ∑j wijyj + b

i
j

15
Artificial Neuron

 Input/Output Signal may be.


 Real value.
 Unipolar {0, 1}.
 Bipolar {-1, +1}.
 Weight : wij – strength of
connection.

Note that wij refers to the weight from


unit j to unit i (not the other way
round). 16
Artificial Neuron
 The bias b is a constant that can be
written as wi0y0 with y0 = b and wi0 = 1
n

such that neti  wij y j


j 0

 The function f is the unit’s activation


function. In the simplest case, f is the
identity function, and the unit’s output is
just its net input. This is called a linear
unit.
 Other activation functions are : step 17
Activation Functions

Identity function Binary Step function


Bipolar Step function
( x  )2
1 
2 2
y ( x)  e
2

Sigmoid function Bipolar Sigmoid function Gaussian function

18
Artificial Neural Networks
(ANN)

Activation
Input vector

function

Output (vector)
weight

weight
Signal Activation
routing function
19
Historical Development of
ANN…

 William James (1890) : Describes in words and


figures simple distributed networks and Hebbian
learning
 McCulloch & Pitts (1943) : Binary threshold units
that perform logical operations (they proof
universal computation)
 Hebb (1949) : formulation of a physiological (local)
learning rule
 Roseblatt (1958) : The perceptron– a first real
learning machine
 Widrow & Hoff (1960) : ADALINE and the Widrow-
Hoff supervised learning rule
20
Historical Development of
ANN
 Kohonen (1982) : Self-organizing
maps
 Hopfield (1982): Hopfield Networks
 Rumelhart, Hinton & Williams
(1986) : Back-propagation & Multilayer
Perceptron
 Broomhead & Lowe (1988) : Radial
basis functions (RBF)
 Vapnik (1990) -- support vector 21
When Should ANN Solution Be
Considered ?

The solution to the problem cannot be explicitly

described by an algorithm, a set of equations, or a

set of rules.

There is some evidence that an input-output

mapping exists between a set of input and output

variables.

There should be a large amount of data available 22


to
Problems That Can Lead to Poor
Performance ?

 The network has to distinguish between very similar


cases with a very high degree of accuracy.
 The train data does not represent the ranges of cases
that the network will encounter in practice.
 The network has a several hundred inputs.
 The main discriminating factors are not present in the
available data. E.g. trying to assess the loan
application without having knowledge of the
applicant's salaries.
 The network is required to implement a very complex
23
Applications of Artificial Neural
Networks

 Manufacturing : fault diagnosis, fraud


detection.
 Retailing : fraud detection, forecasting, data
mining.
 Finance : fraud detection, forecasting, data
mining.
 Engineering : fault diagnosis, signal/image
processing.
 Production : fault diagnosis, forecasting.
 Sales & Marketing : forecasting, data mining.
24
Data Pre-processing
Neural networks very rarely operate on the
raw data. An initial pre-processing stage is
essential. Some examples are as
follows:
 Feature extraction of images: For example, the analysis of X-
rays requires pre-processing to extract features which may be
of interest within a specified region.
 Representing input variables with numbers. For example "+1"
is the person is married, "0" if divorced, and "-1" if single.
Another example is representing the pixels of an image: 255
= bright white, 0 = black. To ensure the generalization
capability of a neural network, the data should be encoded in
25
Data Pre-processing

 Categorical Variable
 A categorical variable is a variable that can
belong to one of a number of discrete categories.
For example, red, green, blue.
 Categorical variables are usually encoded using 1
out-of n coding. e.g. for three colours, red = (1 0
0), green =(0 1 0) Blue =(0 0 1).
 If we used red = 1, green = 2, blue = 3, then this
type of encoding imposes an ordering on the
values of the variables which does not exist.

26
Data Pre-processing

 CONTINUOUS VARIABLES

A continuous variable can be directly
applied to a neural network. However, if
the dynamic range of input variables are
not approximately the same, it is better to
normalize all input variables of the neural
network.
27
Example of Normalized Input
Vector

 Input vector : (2 4 5 6 10 4)t


1 6
 Mean of vector :    xi 5.167
6 i 1
1 6
 Standard deviation :    ( xi   ) 2 2.714
6  1 i 1
x
 Normalized vector x:N  i  1.17  0.43  0.06 0.31 1.78  0.43t

 Mean of normalized vector is zero
 Standard deviation of normalized vector is
unity

28
Simple Neural
Networks

Lecture 3:
Simple Perceptron

29
Outlines
 The Perceptron
• Linearly separable
problem
• Network structure
• Perceptron learning rule
• Convergence of
Perceptron 30
THE
PERCEPTRON
The perceptron was a simple model of ANN
introduced by Rosenblatt of MIT in the 1960’
with the idea of learning.
Perceptron is designed to accomplish a simple
pattern recognition task: after learning with real value
training data
{ x(i), d(i), i =1,2, …, p} where d(i) = 1 or -1
For a new signal (pattern) x(i+1), the perceptron
is capable of telling you to which class the new
signal belongs
perceptron = 1 or -1
x(i+1) 31
Perceptron
 Linear threshold unit (LTU)

x0=1 1 if i=0n w xi
x1 w1 w0=b >0
o(x)=
{ i

-1 otherwise
w2
x2  x= o
. i=0n wi xi
.
. wn
xn

32
Decision Surface of a
Perceptron
x2 x2
+ AND
+ + -
+ - -
x1 w0 x1
+ - w1 - +
- w2

• Perceptron is able to represent some useful


functions
• AND (x1,x2) choose weights w0=-1.5, w1=1,

w2=1
• But functions that are not linearly separable 33
Mathematically the Perceptron
is
m m
y  f ( wi xi  b)  f ( wi xi )
i 1 i 0

We can always treat the bias b as another weight with


inputs equal 1
where f is the hard limiter function i.e.

 m

1if  wi xi  b  0

y  i 1
m
 1if  wi xi  b 0

 i 1

34
Why is the network capable of solving linearly
separable problem ?
m

w x
i 1
i i  b 0

+
m

 w x b  0
m

i 1
i i
 w x b  0
i i

-
i 1

35
Learning rule
An algorithm to update the weights w so that
finally
the input patterns lie on both sides of the line
decided by the perceptron

+
Let t be the time, at t = 0, we have
w(0)  x 0

-
36
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line
decided by the
perceptron

Let t be the time, at t = 1

+
w(1)  x 0

-
37
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron

Let t be the time, at t = 2

w( 2)  x 0
+
- 38
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by th
perceptron

Let t be the time, at t = 3

w(3)  x 0
+
- 39
In Math
 1if x (t )inclass
d (t ) 
 1if x (t )inclass
Perceptron learning rule
w(t  1) w(t )   (t )[ d (t ) 
sign( w(t )  x (t ))]x (t )
Where h(t) is the learning rate >0,
+1 if x>0
sign(x) = hard limiter
function
–1 if x<=0,
NB : d(t) is the same as d(i) and x(t) as x(i)
40
In words:

• If the classification is right, do not update


the weights

• If the classification is not correct, update


the weight towards the opposite direction
so that the output move close to the right
directions.

41
Perceptron convergence
theorem (Rosenblatt, 1962)

Let the subsets of training vectors be linearly


separable. Then after finite steps of learning we have

lim w(t) = w which correctly separate the


samples.

The idea of proof is that to consider ||w(t+1)-w||-||


w(t)-w||
which is a decrease function of t

42
Summary of Perceptron learning …

Variables and parameters


x(t) = (m+1) dim. input vectors at time t
= ( b, x1 (t), x2 (t), .... , xm (t) )

w(t) = (m+1) dim. weight vectors


= ( 1 , w1 (t), .... , wm (t) )

b = bias
y(t) = actual response
h(t) = learning rate parameter, a +ve constant <
1
d(t) = desired response
43
Summary of Perceptron learning

Data { (x(i), d(i)), i=1,…,p}

 Present the data to the network once a


point

 could be cyclic :
(x(1), d(1)), (x(2), d(2)),…, (x(p), d(p)),
(x(p+1), d(p+1)),…
or randomly

(Hence we mix time t with i here) 44


Summary of Perceptron learning (algorithm)

1. Initialization Set w(0)=0. Then perform the following


computation for time step t=1,2,...
2. Activation At time step t, activate the perceptron by
applying input vector x(t) and desired response d(t)
3. Computation of actual response Compute the actual
response of the perceptron
y(t) = sign ( w(t) · x(t) )
where sign is the sign function
4. Adaptation of weight vector Update the weight
vector of the perceptron
w(t+1) = w(t)+ h(t) [ d(t) - y(t) ] x(t)
5. Continuation

45
Questions remain
Where or when to stop?

By minimizing the generalization error

For training data {(x(i), d(i)), i=1,…p}


How to define training error after t steps of
learning?

E(t)= pi=1 [d(i)-sign(w(t) . x(i)]2

46
++
+
After +
learning -
t steps
- -

E(t) = 0

47
How to define generalization error?

For a new signal {x(t+1),d(t+1)}, we have

.
Eg = [d(t+1)-sign (x(t+1) w (t)) ]2

+++ +
After ++
learning + +
t steps +
- +
- -
48
We next turn to ADALINE learning,
from which we can understand
the learning rule, and more general the
Back-Propagation (BP) learning

49
Simple Neural
Network

Lecture 4:
ADALINE Learning

50
Outlines

 ADALINE

 Gradient descending learning

 Modes of training
51
Unhappy over Perceptron
Training

 When a perceptron gives the right


answer, no learning takes place
 Anything below the threshold is
interpreted as ‘no’, even it is just below
the threshold.
 It might be better to train the neuron
based on how far below the threshold it
52
ADALINE
• ADALINE is an acronym for ADAptive LINear
Element
(or ADAptive LInear NEuron) developed by
Bernard Widrow and Marcian Hoff (1960).
• There are several variations of Adaline. One has
threshold same as perceptron and another just a
bare linear function.
• The Adaline learning rule is also known as the
least-mean-squares (LMS) rule, the delta rule, or
the Widrow-Hoff rule.
• It is a training rule that minimizes the output
53
• Replace the step function in the perceptron with a
continuous (differentiable) function f, e.g the
simplest is linear function
• With or without the threshold, the Adaline is trained
based on the output of the function f rather than
the final output.

+/
S
f (x)

(Adaline)
54
After each training pattern x(i) is presented, the correction
apply to the weights is proportional to the error.

E (i,t) = ½ [ d(i) – f(w(t) · x(i)) ] 2


i=1,...,p

N.B. If f is a linear function f(w(t) · x(i)) = w(t) · x(i)

Summing together, our purpose is to find w which minimize

E (t) = ∑i E(i,t)
55
General Approach gradient descent method

To find g
w(t+1) = w(t)+g( E(w(t)) )
so that w automatically tends to
the global minima of E(w).

w(t+1) = w(t)- E’(w(t))h(t)

(see figure below)

56
Gradient direction is the direction of uphill
for example, in the Figure, at position 0.4, the
gradient is uphill ( F is E, consider one dim case )

F Gradient direction
F’(0.4)

57
• In gradient descent algorithm, we have
w(t+1) = w(t) – F’(w(t)) h(t)
therefore the ball goes downhill since – F’(w(t))
is downhill direction

Gradient direction

w(t)

58
• In gradient descent algorithm, we have
w(t+1) = w(t) – F’(w(t)) h(t)
therefore the ball goes downhill since – F’(w(t))
is downhill direction

Gradient direction

w(t+1)

59
• Gradually the ball will stop at a local minima wher
the gradient is zero

Gradient direction

w(t+k)

60
• In words
Gradient method could be thought of as a ball rolling down
from a hill: the ball will roll down and finally stop at the valley

Thus, the weights are adjusted by

wj(t+1) = wj(t) +h(t) S [d(i) - f(w(t) · x(i)) ] xj(i)


f’

This corresponds to gradient descent on the quadratic


error surface E

When f’ =1, we have the perceptron learning rule (we


have in general f’>0 in neural networks). The ball
moves in the right direction. 61
Two types of network training:

Sequential mode (on-line, stochastic,


or per-pattern) :
Weights updated after each pattern is
presented (Perceptron is in this class)

Batch mode (off-line or per-epoch) :


Weights updated after all patterns are
presented
62
Comparison Perceptron and
Gradient Descent Rules
 Perceptron learning rule guaranteed to
succeed if
 Training examples are linearly separable
 Sufficiently small learning rate 
 Linear unit training rule uses gradient
descent guaranteed to converge to
hypothesis with minimum squared error
given sufficiently small learning rate 
 Even when training data contains noise
 Even when training data not separable by 63
Renaissance of
Perceptron
Multi-
Layer
Perceptro
Back-Propagation, n
80’
Perceptro
n
Learning Theory,
90’
Support
Vector
64
Summary of Previous
Lectures
Perceptron

W(t+1)= W(t)+h(t) [ d(t) - sign (w(t) . x)] x

Adaline (Gradient descent method)

W(t+1)= W(t)+h(t) [ d(t) - f(w(t) . x)] x f’

65
Multi-Layer Perceptron (MLP)

Idea: Credit assignment


problem
• Problem of assigning ‘credit’ or ‘blame’
to individual elements involving in
forming overall response of a learning
system (hidden units)

• In neural networks, problem relates to


dividing which weights should be altered,
by how much and in which direction. 66
Example: Three-layer networks
x1
x2

Input Output

xn
Signal routing
Input layer Hidden layer Output layer
67
Properties of architecture
• No connections within a layer
• No direct connections between input and output layers
• Fully connected between layers
• Often more than 2 layers
• Number of output units need not equal number of input
• Number of hidden units per layer can be more or less th
input or output units

Each unit is a perceptron

m
yi  f ( wij x j  bi )
j 1

68
BP (Back Propagation)

69
Lecture 5
MultiLayer Perceptron
I
Back Propagating
Learning

70
BP learning algorithm
Solution to “credit assignment problem” in MLP

Rumelhart, Hinton and Williams (1986)

BP has two phases:

Forward pass phase: computes ‘functional


signal’, feedforward propagation of input pattern
signals through network

Backward pass phase: computes ‘error


signal’, propagation of error (difference between
actual and desired output values) backwards
through network starting at output units

71
BP Learning for Simplest O
MLP
W(t)
Task : Data {I, d} to minimize
E = (d - o)2 /2 y
= [d - f(W(t)y(t)) ]2 /2 w(t)
= [d - f(W(t)f(w(t)I)) ]2 /2
I
Error function at the output unit

Weight at time t is w(t) and W(t),


intend to find the weight w and W at time t+1

Where y = f(w(t)I), output of the hidden unit


72
Forward pass
phase O
Suppose that we have w(t), W(t)
of time t W(t)
For given input I, we can
calculate y
y = f(w(t)I) w(t)
and
o = f ( W(t) y ) I
= f ( W(t) f( w(t) I ) )

Error function of output unit will


be 73
Backward Pass Phase
O

dE
W (t  1) W (t )   W(t)
dW (t )
dE df y
W (t )  
df dW (t ) w(t)
W (t )   ( d  o) f ' (W (t ) y ) y

E = (d - o)2 /2 o = f ( W(t)
y)
74
Backward pass phase
O
dE
W (t  1) W (t )  
dW (t ) W(t)
dE df
W (t )   y
df dW (t )
w(t)
W (t )   ( d  o) f ' (W (t ) y ) y
W (t )  y
I

where D = ( d-o ) f ’
75
Backward pass phase
O
dE
w(t  1) w(t )  
dw(t ) W(t)
dE dy
w(t )   y
dy dw(t )
dy w(t)
w(t )   (d  o) f ' (W (t ) y )W (t )
dw(t )
w(t )  W (t ) f ' ( w(t ) I ) I I

o = f ( W(t) y )
= f ( W(t) f( w(t) I ) )
76
General Two Layer Network
I inputs, O outputs, w connections for
input units, W connections for output
units, y is the activity of input unit

net (t) = network input to the unit at time


t
Output units
w W

I O
y
Input units
77
Forward pass
Weights are fixed during forward & backward pass at time t
Ok
1. Compute values for hidden units
net j ( t )  w ji ( t ) I i ( t ) Wkj(t)
i

y j  f ( net j ( t )) yj
wji(t)
2. compute values for output units
Net k ( t )  Wkj ( t ) y j
j
Ii
Ok  f ( Net k ( t ))

78
Backward Pass
Recall delta rule , error measure for pattern n is
1
E (t )   (d k (t )  Ok (t )) 2
2 k 1
We want to know how to modify weights in order to decrease
where

E (t )
wij (t  1)  wij (t ) 
wij (t )

both for hidden units and output units

his can be rewritten as product of two terms using chain rule

79
E (t ) E (t ) net j (t )
 
wij (t ) net j (t ) wij (t )
both for hidden units and output units

Term A How error for pattern changes as function of change


in network input to unit j

How net input to unit j changes as a function of


Term B
change in weight w

80
Summary
weight updates are local
w ji (t  1)  w ji (t )  j (t ) I i (t ) (hidden unit)
Wkj (t  1)  Wkj (t )  k (t ) y j (t ) (output unit)
output unit
Wkj (t  1)  Wkj (t )  k (t ) y j (t )
 ( d k (t )  Ok (t )) f ' ( Net k (t )) y j (t )
hidden unit
w ji (t  1)  w ji (t )  j (t ) I i (t )
f ' ( net j (t ))  k (t )Wkj I i (t )
k

Once weight changes are computed for all units,


weights are updated at same time (bias included as
weights here)
81
Activation Functions
to compute we need to find the derivative of
activation function f
to find derivative the activation function must be
smooth

Sigmoidal (logistic) function-common in MLP


1
f (neti (t )) 
1  exp( k neti (t ))

where k is a positive constant. The sigmoidal function


gives value in range of 0 to 1
82
Shape of sigmoidal function

Note: when net = 0, f = 0.5

83
Shape of sigmoidal function
derivative

Derivative of sigmoidal function has max at x= 0., is


symmetric
about this point falling to zero as sigmoidal approaches
84
eturning to local error gradients in BP algorithm we have
utput units

 i (t ) ( d i (t )  Oi (t )) f ' ( Net i (t ))
( d i (t )  Oi (t )) kOi (t )(1  Oi (t ))
or hidden units we have

 i (t )  f ' ( neti (t ))  k (t )Wki


k
kyi (t )(1  yi (t ))  k (t )Wki
k

Since degree of weight change is proportional to


derivative of
activation function, weight changes will be greatest
when units
receives mid-range functional signal than at extremes85
Summary of BP learning algorithm
Set learning rate 
Set initial weight values (incl.. biases): w, W
Loop until stopping criteria satisfied:

present input pattern to input units
compute functional signal for hidden units
compute functional signal for output units

present Target response to output units


compute error signal for output units
compute error signal for hidden units
update all weights at same time
increment n to n+1 and select next I and d
end loop
86
Network training:
 Training set shown repeatedly until stopping criteria
are met
 Each full presentation of all patterns = ‘epoch’
 Randomise order of training patterns presented for
each epoch in order to avoid correlation between
consecutive training pairs being learnt (order
effects)

Two types of network training:

 Sequential mode (on-line, stochastic, or per-


pattern)
Weights updated after each pattern is presented

87
Advantages and disadvantages of
different modes

Sequential mode:
• Less storage for each weighted connection
• Random order of presentation and updating per
pattern means search of weight space is stochastic--
reducing risk of local minima able to take advantage
of any redundancy in training set (i.e.. same pattern
occurs more than once in training set, esp. for large
training sets)
• Simpler to implement

Batch mode:
• Faster learning than sequential mode
88
Lecture 5
MultiLayer Perceptron
II

Dynamics of MultiLayer
Perceptron

89
Summary of Network Training

Forward phase: I(t), w(t), net(t), y(t), W(t), Net(t), O(t)

Backward phase:
Output unit
Wkj (t  1)  Wkj (t )  k (t ) y j (t )
 ( d k (t )  Ok (t )) f ' ( Net k (t )) y j (t )

Input unit
w ji (t  1)  wij (t )  j (t ) I i (t )
f ' ( net j (t ))  k (t )Wkj (t ) I i (t )
k
90
Network training:

Training set shown repeatedly until stopping criteria are


met. Possible convergence criteria are
 Euclidean norm of the gradient vector reaches a
sufficiently small denoted as .
When the absolute rate of change in the average
squared error per epoch is sufficiently small
denoted as .
Validation for generalization performance : stop
when generalization reaching the peak (illustrate in
91
Network training:

Two types of network training:

 Sequential mode (on-line, stochastic, or per-


pattern)
Weights updated after each pattern is presented

 Batch mode (off-line or per -epoch)


Weights updated after all the patterns are
presented

92
Advantages and disadvantages of
different modes

Sequential mode:
• Less storage for each weighted connection
• Random order of presentation and updating per
pattern means search of weight space is stochastic--
reducing risk of local minima able to take advantage
of any redundancy in training set (i.e.. same pattern
occurs more than once in training set, esp. for large
training sets)
• Simpler to implement

Batch mode:
• Faster learning than sequential mode
93
Goals of Neural Network Training
To give the correct output for
input training vector
(Learning)

To give good responses to new


unseen input patterns
(Generalization)

94
Training and Testing
Problems
• Stuck neurons: Degree of weight change is
proportional to derivative of activation function,
weight changes will be greatest when units receives
mid-range functional signal than at extremes neuron.
To avoid stuck neurons weights initialization should
give outputs of all neurons approximate 0.5
• Insufficient number of training patterns:
In this case, the training patterns will be learnt
instead of the underlying relationship between inputs
and output, i.e. network just memorizing the patterns.
• Too few hidden neurons: network will not
produce a good model of the problem.
• Over-fitting: the training patterns will be learnt
instead of the underlying function between inputs and
output because of too many of hidden neurons. This
means that the network will have a poor
generalization capability. 95
Dynamics of BP learning
Aim is to minimise an error function over all
training patterns by adapting weights in MLP

Recalling the typical error function is the


mean squared error as follows
1 p
E(t)= 
2 k 1
( d k (t )  Ok (t )) 2

The idea is to reduce E(t) to global minimum


point.

96
Dynamics of BP learning
In single layer perceptron with linear
activation
functions, the error function is simple,
described
by a smooth parabolic surface with a
single minimum

97
Dynamics of BP learning
MLP with nonlinear activation functions have complex
error surfaces (e.g. plateaus, long valleys etc. ) with
no single minimum

For complex error surfaces the problem is learning


rate must keep small to prevent divergence. Adding
momentum term is a simple approach dealing with
this problem.
98
Momentum
• Reducing problems of instability while
increasing the rate of convergence
• Adding term to weight update equation can
effectively holds as exponentially weight
history of previous weights changed

Modified weight update equation is

wij (n  1)  wij (n )  j (n )yi (n )


  [wij (n )  wij (n  1)]

99
Effect of momentum term
 If weight changes tend to have same
sign momentum term increases and
gradient decrease speed up
convergence on shallow gradient
 If weight changes tend have
opposing signs momentum term
decreases and gradient descent slows
to reduce oscillations (stabilizes)
 Can help escape being trapped in
local minima

100
Selecting Initial Weight Values
 Choice of initial weight values is important
as this decides starting position in weight
space. That is, how far away from global
minimum
 Aim is to select weight values which
produce midrange function signals
 Select weight values randomly from
uniform probability distribution
 Normalise weight values so number of
weighted connections per unit produces
midrange function signal 101
Convergence of Backprop
Avoid local minumum with fast
convergence :
 Add momentum
 Stochastic gradient descent
 Train multiple nets with different initial
weights

Nature of convergence
 Initialize weights ’near zero’ or initial
networks near-linear
 Increasingly non-linear functions possible as
training progresses 102
Use of Available Data Set for
Training
The available data set is normally split into
three sets as follows:
 Training set – use to update the weights.

Patterns in this set are repeatedly in


random order. The weight update
equation are applied after a certain
number of patterns.
 Validation set – use to decide when to

stop training only by monitoring the


error.
 Test set – Use to test the performance of

the neural network. It should not be


103
Earlier Stopping - Good Generalization
 Running too many epochs may overtrain the
network and result in overfitting and perform
poorly in generalization.
 Keep a hold-out validation set and test
accuracy after every epoch. Maintain weights
for best performing network on the validation
set and stop training when error increases
increases beyond this.

Validation
erro set
r Training set
No. of epochs

104
validation
 Too few hidden units prevent the network
from learning adequately fitting the data and
learning the concept.
 Too many hidden units leads to overfitting.
 Similar cross-validation methods can be used
to determine an appropriate number of
hidden units by using the optimal test error
to select the model with optimal number of
hidden layers and nodes.
Validation
erro set
r Training set
No. of epochs
105
Alternative training algorithm

Lecture 8 :
Genetic Algorithms

106
History
Background

 Idea of evolutionary computing was introduced in the 1960s by I.


Rechenberg in his work "Evolution strategies"
(Evolutionsstrategie in original). His idea was then developed by
other researchers. Genetic Algorithms (GAs) were invented by
John Holland and developed by him and his students and
colleagues. This lead to Holland's book "Adaption in Natural and
Artificial Systems" published in 1975.
 In 1992 John Koza has used genetic algorithm to evolve
programs to perform certain tasks. He called his method
“Genetic Programming" (GP). LISP programs were used,
because programs in this language can expressed in the form of
107
Biological Background
Chromosome.
 All living organisms consist of cells. In each cell there is the same set
of chromosomes. Chromosomes are strings of DNA and serves as a
model for the whole organism. A chromosome consist of genes,
blocks of DNA. Each gene encodes a particular protein. Basically can
be said, that each gene encodes a trait, for example color of eyes.
Possible settings for a trait (e.g. blue, brown) are called alleles. Each
gene has its own position in the chromosome. This position is called
locus.
 Complete set of genetic material (all chromosomes) is called
genome. Particular set of genes in genome is called genotype. The
genotype is with later development after birth base for the organism's
108
Biological Background
Reproduction.
 During reproduction, first occurs recombination
(or crossover). Genes from parents form in some
way the whole new chromosome. The new created
offspring can then be mutated. Mutation means,
that the elements of DNA are a bit changed. This
changes are mainly caused by errors in copying
genes from parents.
 The fitness of an organism is measured by
success of the organism in its life.
109
Evolutionary
Computation
 Based on evolution as it occurs in nature
 Lamarck, Darwin, Wallace: evolution of species,
survival of the fittest
 Mendel: genetics provides inheritance
mechanism

Hence “genetic algorithms”

 Essentially a massively parallel search


procedure
 Start with random population of individuals
110
Evolutionary Algorithms
mutation population of genotypes

10111
10011 10001
phenotype space
01001 00111 11001
01001
01011 f

coding scheme
recombination selection
x
10011
10
10001
011
001 10001
fitness
01001
01
01011
001
011 10001 11001
01011
111
Pseudo Code of an Evolutionary Algorithm

Create initial random population

Evaluate fitness of each individual


yes
Termination criteria satisfied ? stop
no
Select parents according to
fitness
Recombine parents to generate offspring

Mutate offspring

Replace population by new


offspring 112
A Simple Genetic
Algorithm
 Optimization task : find the maximum of f(x)
for example f(x)=x•sin(x) x [0,p]
• genotype: binary string
 s [0,1]5 e.g. 11010, 01011, 10001
• mapping : genotype  phenotype
n 5
 = • si • 2n-i-1 / (2n-1)
binary integer encoding: x
i 1

Initial population
enotype integ. phenotype fitness prop. fitnes
1010 26 2.6349 1.2787 30%
1011 11 1.1148 1.0008 24%
0001 17 1.7228 1.7029 40%
0101 5 0.5067 0.2459 6%

113
Some Other Issues
Regarding Evolutionary
Computing
 Evolution according to Lamarck.
 Individual adapts during lifetime.
 Adaptations inherited by children.
 In nature, genes don’t change; but for computations we
could allow this...
 Baldwin effect.
 Individual’s ability to learn has positive effect on evolution.

It supports a more diverse gene pool.

Thus, more “experimentation” with genes possible.
 Bacteria and virus.
 New evolutionary computing strategies.

114
Lecture 7
Radial Basis Functions

Radial Basis
Functions

115
Radial-basis function (RBF)
networks

RBF = radial-basis function: a function


which depends only on the radial distance
from a point
XOR problem

quadratically separable

116
Radial-basis function (RBF)
networks
So RBFs are functions
 (|| x taking
 x the
||)form
i

where f is a nonlinear activation function, x


is the input and xi is the i’th position,
prototype, basis or centre vector.
The idea is that points near the centres will
have similar outputs (i.e. if x ~ xi then f (x) ~
f (xi)) since they should have similar
properties.
117
Typical RBFs include
(a) Multiquadrics
 ( r ) ( r  c )
2 2 1/ 2

for some c>0


(b) Inverse multiquadrics
 ( r ) ( r  c )
2 2  1/ 2

for some c>0


(c) Gaussian 2
r
 (r ) exp( )
2 2

for some s >0 118


‘nonlocalized’ functions ‘localized’ functions 119
 Idea is to use a weighted sum of the outputs
from the basis functions to represent the data.
 Thus centers can be thought of as prototypes of
input data.

* * * *

1 0
* *
0
O1

MLP vs RBF
distributed local 120
Starting point: exact
interpolation

Each input pattern x must be mapped onto


a target value d

121
That is, given a set of N vectors xi and a
corresponding set of N real numbers, di (the
targets), find a function F that satisfies the
interpolation condition:

F ( xi ) = di for i =1,...,N

N find:
or more exactly
F ( x )  w j (|| x  x j ||)
j 1

satisfying: N
F ( x i )  w j (|| x i  x j ||) di
j 1
122
Single-layer networks
y1 f1 (y)=f1 (||y-x1||)
y2 wj
S
Input Output
d

yp
Input layer :fN (y)=fN (||y-xN||)

• output = S wi fi (y - xi)
• adjustable parameters are weights wj
• number of hidden units = number of data
points
123
To summarize:
 For a given data set containing N points (xi,di), i=1,…,N
 Choose a RBF function f
 Calculate f(xj - xi )
 Solve the linear equation F W = D
 Get the unique solution
 Done

 Like MLP’s, RBFNs can be shown to be able to


approximate any function to arbitrary accuracy (using
an arbitrarily large numbers of basis functions).
 Unlike MLP’s, however, they have the property of
‘best approximation’ i.e. there exists an RBFN with
minimum approximation error.
124
Large s = 1

125
Small s = 0.2

126
Problems with exact interpolation
can produce poor generalisation performance as only data
points constrain mapping

Overfitting problem
Bishop(1995) example

Underlying function f(x)=0.5+0.4sine(2p x)


sampled randomly for 30 points

added Gaussian noise to each data point

30 data points 30 hidden RBF units

fits all data points but creates oscillations due added noise
and unconstrained between data points
127
All Data Points 5 Basis functions

128
To fit an RBF to every data point is very
inefficient due to the computational cost
of matrix inversion and is very bad for
generalization so:

 Use less RBF’s than data points I.e. M<N


 Therefore don’t necessarily have RBFs centred at
data points
 Can include bias terms
 Can have Gaussian with general covariance matrices
but there is a trade-off between complexity and the
129
Application Examples

Lecture 9:
Nonlinear Identification,
Prediction and Control

130
Nonlinear System Identification

Target function: yp(k+1) = f(.)


Identified function: yNET(k+1) = F(.)
Estimation error: e(k+1)
131
Nonlinear System Neural Control

The goal of training is to find an


d: reference/desired response appropriate plant control u from
y: system output/desired output the desired response d. The weights
u: system input/controller output are adjusted based on the difference
ū: desired controller input between the outputs of the networks
I & II to minimise e. If network I is
u*: NN output
trained so that y = d, then u = u*.
e: controller/network error Networks act as inverse dynamics
identifiers.
132
Nonlinear System
Identification

Neural network
input generation
Pm

133
Nonlinear System
Identification

Neural network target


Tm

Neural network response


(angle & velocity)

134
Model Reference
Control

Antenna arm nonlinear model

Linear reference model

135
Model Reference
Control

Neural controller + nonlinear system diagram

Neural controller, reference model, neural model


136
Matlab NNtool GUI (Graphical User Interface)

137

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy