0% found this document useful (0 votes)

28 views137 pages

Lecture NN 2005

Uploaded by

علي سالم الكوت

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views137 pages

Lecture NN 2005

Uploaded by

علي سالم الكوت

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 137

Automazione

(Laboratorio)
Reti Neurali Per
L’identificazione, Predizione
Ed Il Controllo
Lecture 1:
Introduction to Neural Networks
(Machine Learning)

Silvio Simani
ssimani@ing.unife.it
1
References

Textbook (suggested):

• Neural Networks for Identification,

Prediction, and Control, by Duc Truong Pham
and Xing Liu. Springer Verlag; (December
1995). ISBN: 3540199594

• Nonlinear Identification and Control: A

Neural Network Approach, by G. P. Liu.
Springer Verlag; (October 2001). ISBN: 2
Course Overview
1. Introduction
i. Course introduction
ii. Introduction to neural network
iii. Issues in Neural network

2. Simple Neural Network

i. Perceptron
ii. Adaline

3. Multilayer Perceptron
i. Basics

4. Radial Basis Networks

5. Application Examples
3
Machine Learning
 Improve automatically with experience
 Imitating human learning
 Human learning
Fast recognition and classification of complex
classes of objects and concepts and fast
adaptation
 Example: neural networks
 Some techniques assume statistical source
Select a statistical model to model the
source
 Other techniques are based on reasoning
or inductive inference (e.g. Decision tree)4
Disciplines relevant to
ML
 Artificial intelligence
 Bayesian methods
 Control theory
 Information theory
 Computational complexity theory
 Philosophy
 Psychology and neurobiology
 Statistics

5
Machine Learning
Definition

A computer program is said to learn

from experience E with respect to some

class of tasks T and performance

measure P, if its performance at tasks

in T, as measured by P, improves with

experience.
6
Examples of Learning
Problems

Example 1: Handwriting Recognition:

 T: Recognizing and classifying handwritten
words within images.
 P: percentage of words correctly classified.
 E: a database of handwritten words with
given classification.

Example 2: Learn to play checkers:

 T: play checkers.
 P: percentage of games won in a
tournament.
 E: opportunity to play against itself (war
games…). 7
Type of Training
Experience
 Direct or indirect?
 Direct: board state -> correct move
 Indirect: Credit assignment problem (degree of credit or
blame for each move to the final outcome of win or loss)
 Teacher or not ?

Teacher selects board states and provide correct moves or
 Learner can select board states
 Is training experience representative of
performance goal?
 Training playing against itself
 Performance evaluated playing against world champion

8
Issues in Machine
Learning
 What algorithms can approximate
functions well and when?
 How does the number of training
examples influence accuracy?
 How does the complexity of hypothesis
representation impact it?
 How does noisy data influence
accuracy?
 How do you reduce a learning problem
to a set of function approximation ? 9
Summary

 Machine Learning is useful for data mining,

poorly understood domain (face recognition)
and programs that must dynamically adapt.
 Draws from many diverse disciplines.
 Learning problem needs well-specified task,
performance metric and training experience.
 Involve searching space of possible
hypotheses. Different learning methods
search different hypothesis space, such as
numerical functions, neural networks,
decision trees, symbolic rules.
10
Topics in Neural
Networks

Lecture 2:
Introduction

11
Lecture Outline
1. Introduction (2)
i. Course introduction
ii. Introduction to neural network
iii. Issues in Neural network
2. Simple Neural Network (3)
i. Perceptron
ii. Adaline
3. Multilayer Perceptron (4)
i. Basics
ii. Dynamics
4. Radial Basis Networks (5)

12
Introduction to
Neural Networks

13
Brain
 1011 neurons (processors)
 On average 1000-10000
connections

14
Artificial Neuron
bias
neti = ∑j wijyj + b

i
j

15
Artificial Neuron

 Input/Output Signal may be.

 Real value.
 Unipolar {0, 1}.
 Bipolar {-1, +1}.
 Weight : wij – strength of
connection.

Note that wij refers to the weight from

unit j to unit i (not the other way
round). 16
Artificial Neuron
 The bias b is a constant that can be
written as wi0y0 with y0 = b and wi0 = 1
n

such that neti  wij y j

j 0

 The function f is the unit’s activation

function. In the simplest case, f is the
identity function, and the unit’s output is
just its net input. This is called a linear
unit.
 Other activation functions are : step 17
Activation Functions

Identity function Binary Step function

Bipolar Step function
( x  )2
1 
2 2
y ( x)  e
2

Sigmoid function Bipolar Sigmoid function Gaussian function

18
Artificial Neural Networks
(ANN)

Activation
Input vector

function

Output (vector)
weight

weight
Signal Activation
routing function
19
Historical Development of
ANN…

 William James (1890) : Describes in words and

figures simple distributed networks and Hebbian
learning
 McCulloch & Pitts (1943) : Binary threshold units
that perform logical operations (they proof
universal computation)
 Hebb (1949) : formulation of a physiological (local)
learning rule
 Roseblatt (1958) : The perceptron– a first real
learning machine
 Widrow & Hoff (1960) : ADALINE and the Widrow-
Hoff supervised learning rule
20
Historical Development of
ANN
 Kohonen (1982) : Self-organizing
maps
 Hopfield (1982): Hopfield Networks
 Rumelhart, Hinton & Williams
(1986) : Back-propagation & Multilayer
Perceptron
 Broomhead & Lowe (1988) : Radial
basis functions (RBF)
 Vapnik (1990) -- support vector 21
When Should ANN Solution Be
Considered ?

The solution to the problem cannot be explicitly

described by an algorithm, a set of equations, or a

set of rules.

There is some evidence that an input-output

mapping exists between a set of input and output

variables.

There should be a large amount of data available 22

to
Problems That Can Lead to Poor
Performance ?

 The network has to distinguish between very similar

cases with a very high degree of accuracy.
 The train data does not represent the ranges of cases
that the network will encounter in practice.
 The network has a several hundred inputs.
 The main discriminating factors are not present in the
available data. E.g. trying to assess the loan
application without having knowledge of the
applicant's salaries.
 The network is required to implement a very complex
23
Applications of Artificial Neural
Networks

 Manufacturing : fault diagnosis, fraud

detection.
 Retailing : fraud detection, forecasting, data
mining.
 Finance : fraud detection, forecasting, data
mining.
 Engineering : fault diagnosis, signal/image
processing.
 Production : fault diagnosis, forecasting.
 Sales & Marketing : forecasting, data mining.
24
Data Pre-processing
Neural networks very rarely operate on the
raw data. An initial pre-processing stage is
essential. Some examples are as
follows:
 Feature extraction of images: For example, the analysis of X-
rays requires pre-processing to extract features which may be
of interest within a specified region.
 Representing input variables with numbers. For example "+1"
is the person is married, "0" if divorced, and "-1" if single.
Another example is representing the pixels of an image: 255
= bright white, 0 = black. To ensure the generalization
capability of a neural network, the data should be encoded in
25
Data Pre-processing

 Categorical Variable
 A categorical variable is a variable that can
belong to one of a number of discrete categories.
For example, red, green, blue.
 Categorical variables are usually encoded using 1
out-of n coding. e.g. for three colours, red = (1 0
0), green =(0 1 0) Blue =(0 0 1).
 If we used red = 1, green = 2, blue = 3, then this
type of encoding imposes an ordering on the
values of the variables which does not exist.

26
Data Pre-processing

 CONTINUOUS VARIABLES

A continuous variable can be directly
applied to a neural network. However, if
the dynamic range of input variables are
not approximately the same, it is better to
normalize all input variables of the neural
network.
27
Example of Normalized Input
Vector

 Input vector : (2 4 5 6 10 4)t

1 6
 Mean of vector :    xi 5.167
6 i 1
1 6
 Standard deviation :    ( xi   ) 2 2.714
6  1 i 1
x
 Normalized vector x:N  i  1.17  0.43  0.06 0.31 1.78  0.43t

 Mean of normalized vector is zero
 Standard deviation of normalized vector is
unity

28
Simple Neural
Networks

Lecture 3:
Simple Perceptron

29
Outlines
 The Perceptron
• Linearly separable
problem
• Network structure
• Perceptron learning rule
• Convergence of
Perceptron 30
THE
PERCEPTRON
The perceptron was a simple model of ANN
introduced by Rosenblatt of MIT in the 1960’
with the idea of learning.
Perceptron is designed to accomplish a simple
pattern recognition task: after learning with real value
training data
{ x(i), d(i), i =1,2, …, p} where d(i) = 1 or -1
For a new signal (pattern) x(i+1), the perceptron
is capable of telling you to which class the new
signal belongs
perceptron = 1 or -1
x(i+1) 31
Perceptron
 Linear threshold unit (LTU)

x0=1 1 if i=0n w xi
x1 w1 w0=b >0
o(x)=
{ i

-1 otherwise
w2
x2  x= o
. i=0n wi xi
.
. wn
xn

32
Decision Surface of a
Perceptron
x2 x2
+ AND
+ + -
+ - -
x1 w0 x1
+ - w1 - +
- w2

• Perceptron is able to represent some useful

functions
• AND (x1,x2) choose weights w0=-1.5, w1=1,

w2=1
• But functions that are not linearly separable 33
Mathematically the Perceptron
is
m m
y  f ( wi xi  b)  f ( wi xi )
i 1 i 0

We can always treat the bias b as another weight with

inputs equal 1
where f is the hard limiter function i.e.

 m

1if  wi xi  b  0

y  i 1
m
 1if  wi xi  b 0

 i 1

34
Why is the network capable of solving linearly
separable problem ?
m

w x
i 1
i i  b 0

+
m

 w x b  0
m

i 1
i i
 w x b  0
i i

-
i 1

35
Learning rule
An algorithm to update the weights w so that
finally
the input patterns lie on both sides of the line
decided by the perceptron

+
Let t be the time, at t = 0, we have
w(0)  x 0

-
36
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line
decided by the
perceptron

Let t be the time, at t = 1

+
w(1)  x 0

-
37
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron

Let t be the time, at t = 2

w( 2)  x 0
+
- 38
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by th
perceptron

Let t be the time, at t = 3

w(3)  x 0
+
- 39
In Math
 1if x (t )inclass
d (t ) 
 1if x (t )inclass
Perceptron learning rule
w(t  1) w(t )   (t )[ d (t ) 
sign( w(t )  x (t ))]x (t )
Where h(t) is the learning rate >0,
+1 if x>0
sign(x) = hard limiter
function
–1 if x<=0,
NB : d(t) is the same as d(i) and x(t) as x(i)
40
In words:

• If the classification is right, do not update

the weights

• If the classification is not correct, update

the weight towards the opposite direction
so that the output move close to the right
directions.

41
Perceptron convergence
theorem (Rosenblatt, 1962)

Let the subsets of training vectors be linearly

separable. Then after finite steps of learning we have

lim w(t) = w which correctly separate the

samples.

The idea of proof is that to consider ||w(t+1)-w||-||

w(t)-w||
which is a decrease function of t

42
Summary of Perceptron learning …

Variables and parameters

x(t) = (m+1) dim. input vectors at time t
= ( b, x1 (t), x2 (t), .... , xm (t) )

w(t) = (m+1) dim. weight vectors

= ( 1 , w1 (t), .... , wm (t) )

b = bias
y(t) = actual response
h(t) = learning rate parameter, a +ve constant <
1
d(t) = desired response
43
Summary of Perceptron learning
…
Data { (x(i), d(i)), i=1,…,p}

 Present the data to the network once a

point

 could be cyclic :
(x(1), d(1)), (x(2), d(2)),…, (x(p), d(p)),
(x(p+1), d(p+1)),…
or randomly

(Hence we mix time t with i here) 44

Summary of Perceptron learning (algorithm)

1. Initialization Set w(0)=0. Then perform the following

computation for time step t=1,2,...
2. Activation At time step t, activate the perceptron by
applying input vector x(t) and desired response d(t)
3. Computation of actual response Compute the actual
response of the perceptron
y(t) = sign ( w(t) · x(t) )
where sign is the sign function
4. Adaptation of weight vector Update the weight
vector of the perceptron
w(t+1) = w(t)+ h(t) [ d(t) - y(t) ] x(t)
5. Continuation

45
Questions remain
Where or when to stop?

By minimizing the generalization error

For training data {(x(i), d(i)), i=1,…p}

How to define training error after t steps of
learning?

E(t)= pi=1 [d(i)-sign(w(t) . x(i)]2

46
++
+
After +
learning -
t steps
- -

E(t) = 0

47
How to define generalization error?

For a new signal {x(t+1),d(t+1)}, we have

.
Eg = [d(t+1)-sign (x(t+1) w (t)) ]2

+++ +
After ++
learning + +
t steps +
- +
- -
48
We next turn to ADALINE learning,
from which we can understand
the learning rule, and more general the
Back-Propagation (BP) learning

49
Simple Neural
Network

Lecture 4:
ADALINE Learning

50
Outlines

 ADALINE

 Gradient descending learning

 Modes of training
51
Unhappy over Perceptron
Training

 When a perceptron gives the right

answer, no learning takes place
 Anything below the threshold is
interpreted as ‘no’, even it is just below
the threshold.
 It might be better to train the neuron
based on how far below the threshold it
52
ADALINE
• ADALINE is an acronym for ADAptive LINear
Element
(or ADAptive LInear NEuron) developed by
Bernard Widrow and Marcian Hoff (1960).
• There are several variations of Adaline. One has
threshold same as perceptron and another just a
bare linear function.
• The Adaline learning rule is also known as the
least-mean-squares (LMS) rule, the delta rule, or
the Widrow-Hoff rule.
• It is a training rule that minimizes the output
53
• Replace the step function in the perceptron with a
continuous (differentiable) function f, e.g the
simplest is linear function
• With or without the threshold, the Adaline is trained
based on the output of the function f rather than
the final output.

+/
S
f (x)

(Adaline)
54
After each training pattern x(i) is presented, the correction
apply to the weights is proportional to the error.

E (i,t) = ½ [ d(i) – f(w(t) · x(i)) ] 2

i=1,...,p

N.B. If f is a linear function f(w(t) · x(i)) = w(t) · x(i)

Summing together, our purpose is to find w which minimize

E (t) = ∑i E(i,t)
55
General Approach gradient descent method

To find g
w(t+1) = w(t)+g( E(w(t)) )
so that w automatically tends to
the global minima of E(w).

w(t+1) = w(t)- E’(w(t))h(t)

(see figure below)

56
Gradient direction is the direction of uphill
for example, in the Figure, at position 0.4, the
gradient is uphill ( F is E, consider one dim case )

F Gradient direction
F’(0.4)

57
• In gradient descent algorithm, we have
w(t+1) = w(t) – F’(w(t)) h(t)
therefore the ball goes downhill since – F’(w(t))
is downhill direction

Gradient direction

w(t)

58
• In gradient descent algorithm, we have
w(t+1) = w(t) – F’(w(t)) h(t)
therefore the ball goes downhill since – F’(w(t))
is downhill direction

Gradient direction

w(t+1)

59
• Gradually the ball will stop at a local minima wher
the gradient is zero

Gradient direction

w(t+k)

60
• In words
Gradient method could be thought of as a ball rolling down
from a hill: the ball will roll down and finally stop at the valley

Thus, the weights are adjusted by

wj(t+1) = wj(t) +h(t) S [d(i) - f(w(t) · x(i)) ] xj(i)

f’

This corresponds to gradient descent on the quadratic

error surface E

When f’ =1, we have the perceptron learning rule (we

have in general f’>0 in neural networks). The ball
moves in the right direction. 61
Two types of network training:

Sequential mode (on-line, stochastic,

or per-pattern) :
Weights updated after each pattern is
presented (Perceptron is in this class)

Batch mode (off-line or per-epoch) :

Weights updated after all patterns are
presented
62
Comparison Perceptron and
Gradient Descent Rules
 Perceptron learning rule guaranteed to
succeed if
 Training examples are linearly separable
 Sufficiently small learning rate 
 Linear unit training rule uses gradient
descent guaranteed to converge to
hypothesis with minimum squared error
given sufficiently small learning rate 
 Even when training data contains noise
 Even when training data not separable by 63
Renaissance of
Perceptron
Multi-
Layer
Perceptro
Back-Propagation, n
80’
Perceptro
n
Learning Theory,
90’
Support
Vector
64
Summary of Previous
Lectures
Perceptron

W(t+1)= W(t)+h(t) [ d(t) - sign (w(t) . x)] x

Adaline (Gradient descent method)

W(t+1)= W(t)+h(t) [ d(t) - f(w(t) . x)] x f’

65
Multi-Layer Perceptron (MLP)

Idea: Credit assignment

problem
• Problem of assigning ‘credit’ or ‘blame’
to individual elements involving in
forming overall response of a learning
system (hidden units)

• In neural networks, problem relates to

dividing which weights should be altered,
by how much and in which direction. 66
Example: Three-layer networks
x1
x2

Input Output

xn
Signal routing
Input layer Hidden layer Output layer
67
Properties of architecture
• No connections within a layer
• No direct connections between input and output layers
• Fully connected between layers
• Often more than 2 layers
• Number of output units need not equal number of input
• Number of hidden units per layer can be more or less th
input or output units

Each unit is a perceptron

m
yi  f ( wij x j  bi )
j 1

68
BP (Back Propagation)

69
Lecture 5
MultiLayer Perceptron
I
Back Propagating
Learning

70
BP learning algorithm
Solution to “credit assignment problem” in MLP

Rumelhart, Hinton and Williams (1986)

BP has two phases:

Forward pass phase: computes ‘functional

signal’, feedforward propagation of input pattern
signals through network

Backward pass phase: computes ‘error

signal’, propagation of error (difference between
actual and desired output values) backwards
through network starting at output units

71
BP Learning for Simplest O
MLP
W(t)
Task : Data {I, d} to minimize
E = (d - o)2 /2 y
= [d - f(W(t)y(t)) ]2 /2 w(t)
= [d - f(W(t)f(w(t)I)) ]2 /2
I
Error function at the output unit

Weight at time t is w(t) and W(t),

intend to find the weight w and W at time t+1

Where y = f(w(t)I), output of the hidden unit

72
Forward pass
phase O
Suppose that we have w(t), W(t)
of time t W(t)
For given input I, we can
calculate y
y = f(w(t)I) w(t)
and
o = f ( W(t) y ) I
= f ( W(t) f( w(t) I ) )

Error function of output unit will

be 73
Backward Pass Phase
O

dE
W (t  1) W (t )   W(t)
dW (t )
dE df y
W (t )  
df dW (t ) w(t)
W (t )   ( d  o) f ' (W (t ) y ) y

E = (d - o)2 /2 o = f ( W(t)
y)
74
Backward pass phase
O
dE
W (t  1) W (t )  
dW (t ) W(t)
dE df
W (t )   y
df dW (t )
w(t)
W (t )   ( d  o) f ' (W (t ) y ) y
W (t )  y
I

where D = ( d-o ) f ’
75
Backward pass phase
O
dE
w(t  1) w(t )  
dw(t ) W(t)
dE dy
w(t )   y
dy dw(t )
dy w(t)
w(t )   (d  o) f ' (W (t ) y )W (t )
dw(t )
w(t )  W (t ) f ' ( w(t ) I ) I I

o = f ( W(t) y )
= f ( W(t) f( w(t) I ) )
76
General Two Layer Network
I inputs, O outputs, w connections for
input units, W connections for output
units, y is the activity of input unit

net (t) = network input to the unit at time

t
Output units
w W

I O
y
Input units
77
Forward pass
Weights are fixed during forward & backward pass at time t
Ok
1. Compute values for hidden units
net j ( t )  w ji ( t ) I i ( t ) Wkj(t)
i

y j  f ( net j ( t )) yj
wji(t)
2. compute values for output units
Net k ( t )  Wkj ( t ) y j
j
Ii
Ok  f ( Net k ( t ))

78
Backward Pass
Recall delta rule , error measure for pattern n is
1
E (t )   (d k (t )  Ok (t )) 2
2 k 1
We want to know how to modify weights in order to decrease
where

E (t )
wij (t  1)  wij (t ) 
wij (t )

both for hidden units and output units

his can be rewritten as product of two terms using chain rule

79
E (t ) E (t ) net j (t )
 
wij (t ) net j (t ) wij (t )
both for hidden units and output units

Term A How error for pattern changes as function of change

in network input to unit j

How net input to unit j changes as a function of

Term B
change in weight w

80
Summary
weight updates are local
w ji (t  1)  w ji (t )  j (t ) I i (t ) (hidden unit)
Wkj (t  1)  Wkj (t )  k (t ) y j (t ) (output unit)
output unit
Wkj (t  1)  Wkj (t )  k (t ) y j (t )
 ( d k (t )  Ok (t )) f ' ( Net k (t )) y j (t )
hidden unit
w ji (t  1)  w ji (t )  j (t ) I i (t )
f ' ( net j (t ))  k (t )Wkj I i (t )
k

Once weight changes are computed for all units,

weights are updated at same time (bias included as
weights here)
81
Activation Functions
to compute we need to find the derivative of
activation function f
to find derivative the activation function must be
smooth

Sigmoidal (logistic) function-common in MLP

1
f (neti (t )) 
1  exp( k neti (t ))

where k is a positive constant. The sigmoidal function

gives value in range of 0 to 1
82
Shape of sigmoidal function

Note: when net = 0, f = 0.5

83
Shape of sigmoidal function
derivative

Derivative of sigmoidal function has max at x= 0., is

symmetric
about this point falling to zero as sigmoidal approaches
84
eturning to local error gradients in BP algorithm we have
utput units

 i (t ) ( d i (t )  Oi (t )) f ' ( Net i (t ))
( d i (t )  Oi (t )) kOi (t )(1  Oi (t ))
or hidden units we have

 i (t )  f ' ( neti (t ))  k (t )Wki

k
kyi (t )(1  yi (t ))  k (t )Wki
k

Since degree of weight change is proportional to

derivative of
activation function, weight changes will be greatest
when units
receives mid-range functional signal than at extremes85
Summary of BP learning algorithm
Set learning rate 
Set initial weight values (incl.. biases): w, W
Loop until stopping criteria satisfied:

present input pattern to input units
compute functional signal for hidden units
compute functional signal for output units

present Target response to output units

compute error signal for output units
compute error signal for hidden units
update all weights at same time
increment n to n+1 and select next I and d
end loop
86
Network training:
 Training set shown repeatedly until stopping criteria
are met
 Each full presentation of all patterns = ‘epoch’
 Randomise order of training patterns presented for
each epoch in order to avoid correlation between
consecutive training pairs being learnt (order
effects)

Two types of network training:

 Sequential mode (on-line, stochastic, or per-

pattern)
Weights updated after each pattern is presented

87
Advantages and disadvantages of
different modes

Sequential mode:
• Less storage for each weighted connection
• Random order of presentation and updating per
pattern means search of weight space is stochastic--
reducing risk of local minima able to take advantage
of any redundancy in training set (i.e.. same pattern
occurs more than once in training set, esp. for large
training sets)
• Simpler to implement

Batch mode:
• Faster learning than sequential mode
88
Lecture 5
MultiLayer Perceptron
II

Dynamics of MultiLayer
Perceptron

89
Summary of Network Training

Forward phase: I(t), w(t), net(t), y(t), W(t), Net(t), O(t)

Backward phase:
Output unit
Wkj (t  1)  Wkj (t )  k (t ) y j (t )
 ( d k (t )  Ok (t )) f ' ( Net k (t )) y j (t )

Input unit
w ji (t  1)  wij (t )  j (t ) I i (t )
f ' ( net j (t ))  k (t )Wkj (t ) I i (t )
k
90
Network training:

Training set shown repeatedly until stopping criteria are

met. Possible convergence criteria are
 Euclidean norm of the gradient vector reaches a
sufficiently small denoted as .
When the absolute rate of change in the average
squared error per epoch is sufficiently small
denoted as .
Validation for generalization performance : stop
when generalization reaching the peak (illustrate in
91
Network training:

Two types of network training:

 Sequential mode (on-line, stochastic, or per-

pattern)
Weights updated after each pattern is presented

 Batch mode (off-line or per -epoch)

Weights updated after all the patterns are
presented

92
Advantages and disadvantages of
different modes

Batch mode:
• Faster learning than sequential mode
93
Goals of Neural Network Training
To give the correct output for
input training vector
(Learning)

To give good responses to new

unseen input patterns
(Generalization)

94
Training and Testing
Problems
• Stuck neurons: Degree of weight change is
proportional to derivative of activation function,
weight changes will be greatest when units receives
mid-range functional signal than at extremes neuron.
To avoid stuck neurons weights initialization should
give outputs of all neurons approximate 0.5
• Insufficient number of training patterns:
In this case, the training patterns will be learnt
instead of the underlying relationship between inputs
and output, i.e. network just memorizing the patterns.
• Too few hidden neurons: network will not
produce a good model of the problem.
• Over-fitting: the training patterns will be learnt
instead of the underlying function between inputs and
output because of too many of hidden neurons. This
means that the network will have a poor
generalization capability. 95
Dynamics of BP learning
Aim is to minimise an error function over all
training patterns by adapting weights in MLP

Recalling the typical error function is the

mean squared error as follows
1 p
E(t)= 
2 k 1
( d k (t )  Ok (t )) 2

The idea is to reduce E(t) to global minimum

point.

96
Dynamics of BP learning
In single layer perceptron with linear
activation
functions, the error function is simple,
described
by a smooth parabolic surface with a
single minimum

97
Dynamics of BP learning
MLP with nonlinear activation functions have complex
error surfaces (e.g. plateaus, long valleys etc. ) with
no single minimum

For complex error surfaces the problem is learning

rate must keep small to prevent divergence. Adding
momentum term is a simple approach dealing with
this problem.
98
Momentum
• Reducing problems of instability while
increasing the rate of convergence
• Adding term to weight update equation can
effectively holds as exponentially weight
history of previous weights changed

Modified weight update equation is

wij (n  1)  wij (n )  j (n )yi (n )

  [wij (n )  wij (n  1)]

99
Effect of momentum term
 If weight changes tend to have same
sign momentum term increases and
gradient decrease speed up
convergence on shallow gradient
 If weight changes tend have
opposing signs momentum term
decreases and gradient descent slows
to reduce oscillations (stabilizes)
 Can help escape being trapped in
local minima

100
Selecting Initial Weight Values
 Choice of initial weight values is important
as this decides starting position in weight
space. That is, how far away from global
minimum
 Aim is to select weight values which
produce midrange function signals
 Select weight values randomly from
uniform probability distribution
 Normalise weight values so number of
weighted connections per unit produces
midrange function signal 101
Convergence of Backprop
Avoid local minumum with fast
convergence :
 Add momentum
 Stochastic gradient descent
 Train multiple nets with different initial
weights

Nature of convergence
 Initialize weights ’near zero’ or initial
networks near-linear
 Increasingly non-linear functions possible as
training progresses 102
Use of Available Data Set for
Training
The available data set is normally split into
three sets as follows:
 Training set – use to update the weights.

Patterns in this set are repeatedly in

random order. The weight update
equation are applied after a certain
number of patterns.
 Validation set – use to decide when to

stop training only by monitoring the

error.
 Test set – Use to test the performance of

the neural network. It should not be

103
Earlier Stopping - Good Generalization
 Running too many epochs may overtrain the
network and result in overfitting and perform
poorly in generalization.
 Keep a hold-out validation set and test
accuracy after every epoch. Maintain weights
for best performing network on the validation
set and stop training when error increases
increases beyond this.

Validation
erro set
r Training set
No. of epochs

104
validation
 Too few hidden units prevent the network
from learning adequately fitting the data and
learning the concept.
 Too many hidden units leads to overfitting.
 Similar cross-validation methods can be used
to determine an appropriate number of
hidden units by using the optimal test error
to select the model with optimal number of
hidden layers and nodes.
Validation
erro set
r Training set
No. of epochs
105
Alternative training algorithm

Lecture 8 :
Genetic Algorithms

106
History
Background

 Idea of evolutionary computing was introduced in the 1960s by I.

Rechenberg in his work "Evolution strategies"
(Evolutionsstrategie in original). His idea was then developed by
other researchers. Genetic Algorithms (GAs) were invented by
John Holland and developed by him and his students and
colleagues. This lead to Holland's book "Adaption in Natural and
Artificial Systems" published in 1975.
 In 1992 John Koza has used genetic algorithm to evolve
programs to perform certain tasks. He called his method
“Genetic Programming" (GP). LISP programs were used,
because programs in this language can expressed in the form of
107
Biological Background
Chromosome.
 All living organisms consist of cells. In each cell there is the same set
of chromosomes. Chromosomes are strings of DNA and serves as a
model for the whole organism. A chromosome consist of genes,
blocks of DNA. Each gene encodes a particular protein. Basically can
be said, that each gene encodes a trait, for example color of eyes.
Possible settings for a trait (e.g. blue, brown) are called alleles. Each
gene has its own position in the chromosome. This position is called
locus.
 Complete set of genetic material (all chromosomes) is called
genome. Particular set of genes in genome is called genotype. The
genotype is with later development after birth base for the organism's
108
Biological Background
Reproduction.
 During reproduction, first occurs recombination
(or crossover). Genes from parents form in some
way the whole new chromosome. The new created
offspring can then be mutated. Mutation means,
that the elements of DNA are a bit changed. This
changes are mainly caused by errors in copying
genes from parents.
 The fitness of an organism is measured by
success of the organism in its life.
109
Evolutionary
Computation
 Based on evolution as it occurs in nature
 Lamarck, Darwin, Wallace: evolution of species,
survival of the fittest
 Mendel: genetics provides inheritance
mechanism

Hence “genetic algorithms”

 Essentially a massively parallel search

procedure
 Start with random population of individuals
110
Evolutionary Algorithms
mutation population of genotypes

10111
10011 10001
phenotype space
01001 00111 11001
01001
01011 f

coding scheme
recombination selection
x
10011
10
10001
011
001 10001
fitness
01001
01
01011
001
011 10001 11001
01011
111
Pseudo Code of an Evolutionary Algorithm

Create initial random population

Evaluate fitness of each individual

yes
Termination criteria satisfied ? stop
no
Select parents according to
fitness
Recombine parents to generate offspring

Mutate offspring

Replace population by new

offspring 112
A Simple Genetic
Algorithm
 Optimization task : find the maximum of f(x)
for example f(x)=x•sin(x) x [0,p]
• genotype: binary string
 s [0,1]5 e.g. 11010, 01011, 10001
• mapping : genotype  phenotype
n 5
 = • si • 2n-i-1 / (2n-1)
binary integer encoding: x
i 1

Initial population
enotype integ. phenotype fitness prop. fitnes
1010 26 2.6349 1.2787 30%
1011 11 1.1148 1.0008 24%
0001 17 1.7228 1.7029 40%
0101 5 0.5067 0.2459 6%

113
Some Other Issues
Regarding Evolutionary
Computing
 Evolution according to Lamarck.
 Individual adapts during lifetime.
 Adaptations inherited by children.
 In nature, genes don’t change; but for computations we
could allow this...
 Baldwin effect.
 Individual’s ability to learn has positive effect on evolution.

It supports a more diverse gene pool.

Thus, more “experimentation” with genes possible.
 Bacteria and virus.
 New evolutionary computing strategies.

114
Lecture 7
Radial Basis Functions

Radial Basis
Functions

115
Radial-basis function (RBF)
networks

RBF = radial-basis function: a function

which depends only on the radial distance
from a point
XOR problem

quadratically separable

116
Radial-basis function (RBF)
networks
So RBFs are functions
 (|| x taking
 x the
||)form
i

where f is a nonlinear activation function, x

is the input and xi is the i’th position,
prototype, basis or centre vector.
The idea is that points near the centres will
have similar outputs (i.e. if x ~ xi then f (x) ~
f (xi)) since they should have similar
properties.
117
Typical RBFs include
(a) Multiquadrics
 ( r ) ( r  c )
2 2 1/ 2

for some c>0

(b) Inverse multiquadrics
 ( r ) ( r  c )
2 2  1/ 2

for some c>0

for some s >0 118

‘nonlocalized’ functions ‘localized’ functions 119
 Idea is to use a weighted sum of the outputs
from the basis functions to represent the data.
 Thus centers can be thought of as prototypes of
input data.

* * * *

1 0
* *
0
O1

MLP vs RBF
distributed local 120
Starting point: exact
interpolation

Each input pattern x must be mapped onto

a target value d

121
That is, given a set of N vectors xi and a
corresponding set of N real numbers, di (the
targets), find a function F that satisfies the
interpolation condition:

F ( xi ) = di for i =1,...,N

N find:
or more exactly
F ( x )  w j (|| x  x j ||)
j 1

satisfying: N
F ( x i )  w j (|| x i  x j ||) di
j 1
122
Single-layer networks
y1 f1 (y)=f1 (||y-x1||)
y2 wj
S
Input Output
d

yp
Input layer :fN (y)=fN (||y-xN||)

• output = S wi fi (y - xi)
• adjustable parameters are weights wj
• number of hidden units = number of data
points
123
To summarize:
 For a given data set containing N points (xi,di), i=1,…,N
 Choose a RBF function f
 Calculate f(xj - xi )
 Solve the linear equation F W = D
 Get the unique solution
 Done

 Like MLP’s, RBFNs can be shown to be able to

approximate any function to arbitrary accuracy (using
an arbitrarily large numbers of basis functions).
 Unlike MLP’s, however, they have the property of
‘best approximation’ i.e. there exists an RBFN with
minimum approximation error.
124
Large s = 1

125
Small s = 0.2

126
Problems with exact interpolation
can produce poor generalisation performance as only data
points constrain mapping

Overfitting problem
Bishop(1995) example

Underlying function f(x)=0.5+0.4sine(2p x)

sampled randomly for 30 points

added Gaussian noise to each data point

30 data points 30 hidden RBF units

fits all data points but creates oscillations due added noise
and unconstrained between data points
127
All Data Points 5 Basis functions

128
To fit an RBF to every data point is very
inefficient due to the computational cost
of matrix inversion and is very bad for
generalization so:

 Use less RBF’s than data points I.e. M<N

 Therefore don’t necessarily have RBFs centred at
data points
 Can include bias terms
 Can have Gaussian with general covariance matrices
but there is a trade-off between complexity and the
129
Application Examples

Lecture 9:
Nonlinear Identification,
Prediction and Control

130
Nonlinear System Identification

Target function: yp(k+1) = f(.)

Identified function: yNET(k+1) = F(.)
Estimation error: e(k+1)
131
Nonlinear System Neural Control

The goal of training is to find an

d: reference/desired response appropriate plant control u from
y: system output/desired output the desired response d. The weights
u: system input/controller output are adjusted based on the difference
ū: desired controller input between the outputs of the networks
I & II to minimise e. If network I is
u*: NN output
trained so that y = d, then u = u*.
e: controller/network error Networks act as inverse dynamics
identifiers.
132
Nonlinear System
Identification

Neural network
input generation
Pm

133
Nonlinear System
Identification

Neural network target

Neural network response

(angle & velocity)

134
Model Reference
Control

Antenna arm nonlinear model

Linear reference model

135
Model Reference
Control

Neural controller + nonlinear system diagram

Neural controller, reference model, neural model

136
Matlab NNtool GUI (Graphical User Interface)

137

Unit 2 Machine Learning Notes
100% (1)
Unit 2 Machine Learning Notes
25 pages
Lemony Snicket's All The Wrong Questions: "When Did You See Her Last?" (Book 2)
78% (18)
Lemony Snicket's All The Wrong Questions: "When Did You See Her Last?" (Book 2)
44 pages
Enrico Ferri On Criminology
0% (1)
Enrico Ferri On Criminology
2 pages
LUX Blacj Bbook
100% (1)
LUX Blacj Bbook
61 pages
Report - Project8 - FRA - Surabhi - Report
0% (1)
Report - Project8 - FRA - Surabhi - Report
15 pages
Module 3 Ppt
No ratings yet
Module 3 Ppt
83 pages
Neural Networks
No ratings yet
Neural Networks
28 pages
Neural Networks
No ratings yet
Neural Networks
40 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
216 pages
ML Unit-5 Final
No ratings yet
ML Unit-5 Final
23 pages
Unit-5
No ratings yet
Unit-5
59 pages
Basics
No ratings yet
Basics
48 pages
Wk. 12. Artificial Neural Networks [12!05!2021] (1)
No ratings yet
Wk. 12. Artificial Neural Networks [12!05!2021] (1)
48 pages
Deep-learning (1)
No ratings yet
Deep-learning (1)
180 pages
Part7.2 Artificial Neural Networks
No ratings yet
Part7.2 Artificial Neural Networks
51 pages
12 Neural Network
No ratings yet
12 Neural Network
52 pages
UNIT4_Part1 aiml
No ratings yet
UNIT4_Part1 aiml
79 pages
ML_UNIT-1 &2 Notes
No ratings yet
ML_UNIT-1 &2 Notes
84 pages
Kiet School of Engineering & Technology: Department of Computer Appication
No ratings yet
Kiet School of Engineering & Technology: Department of Computer Appication
30 pages
Module 3 Chap 4 ANNs
No ratings yet
Module 3 Chap 4 ANNs
69 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
221 pages
Neural Deep Learning
No ratings yet
Neural Deep Learning
221 pages
Lecture Slides-Week13,14
No ratings yet
Lecture Slides-Week13,14
62 pages
UNIT V
No ratings yet
UNIT V
49 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
Chapter 5 Artificial Neural Networks
No ratings yet
Chapter 5 Artificial Neural Networks
50 pages
Military AI-Week 03-ANN
No ratings yet
Military AI-Week 03-ANN
71 pages
Chapter 3-1 Neural Network
No ratings yet
Chapter 3-1 Neural Network
43 pages
Wk9-Neural Networks
No ratings yet
Wk9-Neural Networks
46 pages
Lecture15 NeuronNetworks
No ratings yet
Lecture15 NeuronNetworks
61 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
83 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
51 pages
Unit-V
No ratings yet
Unit-V
42 pages
ECSE484 Intro v2
No ratings yet
ECSE484 Intro v2
67 pages
ML Unit 5
No ratings yet
ML Unit 5
33 pages
CO2- ANN Structure and Funadamentals_P1
No ratings yet
CO2- ANN Structure and Funadamentals_P1
65 pages
Neural Networks
100% (1)
Neural Networks
119 pages
UNIT III 3.1 ML Artificial Neural Networks
No ratings yet
UNIT III 3.1 ML Artificial Neural Networks
65 pages
Ipcw Ann
No ratings yet
Ipcw Ann
100 pages
Artificial Neural Network
100% (2)
Artificial Neural Network
20 pages
Neural Networks and Their Statistical Application
No ratings yet
Neural Networks and Their Statistical Application
41 pages
ML-Lec10-Artificial Neural Networks (1)
No ratings yet
ML-Lec10-Artificial Neural Networks (1)
76 pages
Neural Network
No ratings yet
Neural Network
52 pages
Ann I
No ratings yet
Ann I
41 pages
Unit 6 Application of AI
No ratings yet
Unit 6 Application of AI
91 pages
Unit 3 - Ann
No ratings yet
Unit 3 - Ann
49 pages
Lesson 7.0 Supervised Learning With Neural Networks (1)
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks (1)
22 pages
Unit-5 AI
No ratings yet
Unit-5 AI
19 pages
Unit 2-Ann
No ratings yet
Unit 2-Ann
62 pages
Lecture+8
No ratings yet
Lecture+8
65 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
37 pages
AML_mod4
No ratings yet
AML_mod4
22 pages
Refined Chapter 5 UceQEJ (2)
No ratings yet
Refined Chapter 5 UceQEJ (2)
79 pages
10 Neural Network
No ratings yet
10 Neural Network
65 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
75 pages
ANN
No ratings yet
ANN
73 pages
Unit Ii ML
No ratings yet
Unit Ii ML
22 pages
Neural-Network-oxygen
No ratings yet
Neural-Network-oxygen
25 pages
Unit-4 MLT
No ratings yet
Unit-4 MLT
105 pages
Ds Unit V Ann Perceptron
No ratings yet
Ds Unit V Ann Perceptron
69 pages
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
معمل الكترونيات التجربة 222
No ratings yet
معمل الكترونيات التجربة 222
8 pages
معمل الكترونيات التجربة4
No ratings yet
معمل الكترونيات التجربة4
5 pages
معمل الكترونيات التجربة1
No ratings yet
معمل الكترونيات التجربة1
9 pages
معمل الكترونيات التجربة 222
No ratings yet
معمل الكترونيات التجربة 222
8 pages
1st Experiment
No ratings yet
1st Experiment
5 pages
Amazon Web Services, Inc. Invoice: This Invoice Is For The Billing Period September 1 - September 30, 2022
No ratings yet
Amazon Web Services, Inc. Invoice: This Invoice Is For The Billing Period September 1 - September 30, 2022
91 pages
What Is Non
No ratings yet
What Is Non
15 pages
Customers
No ratings yet
Customers
437 pages
DS SCAdvance Gaming en 1-1
No ratings yet
DS SCAdvance Gaming en 1-1
4 pages
Eligibility Criteria For GSPs
No ratings yet
Eligibility Criteria For GSPs
2 pages
Groups (II) 2021
No ratings yet
Groups (II) 2021
2 pages
Motocalv Eg: Calibration Utilities
No ratings yet
Motocalv Eg: Calibration Utilities
2 pages
Normalization Exercise-1 The Solution
No ratings yet
Normalization Exercise-1 The Solution
6 pages
Mean and Mps Pir Data
No ratings yet
Mean and Mps Pir Data
1 page
Tugas Bahasa Inggris 15220651 Michael Noriyuki Sindu Subroto
No ratings yet
Tugas Bahasa Inggris 15220651 Michael Noriyuki Sindu Subroto
2 pages
Love, Sex & Marriage in the Middle Ages; A Sourcebook Conor Mccarthy download
100% (1)
Love, Sex & Marriage in the Middle Ages; A Sourcebook Conor Mccarthy download
48 pages
PDF HF 110 Rev C Web PDF - Compress
No ratings yet
PDF HF 110 Rev C Web PDF - Compress
80 pages
User Manual QK75 Terminal v3
No ratings yet
User Manual QK75 Terminal v3
5 pages
Passage A
No ratings yet
Passage A
7 pages
Colour of Emotions Book
No ratings yet
Colour of Emotions Book
9 pages
VR Lubricator Tool
No ratings yet
VR Lubricator Tool
2 pages
Week 3-Motivation and Behavioural Change
No ratings yet
Week 3-Motivation and Behavioural Change
31 pages
Sleep dprivation and mental health
No ratings yet
Sleep dprivation and mental health
22 pages
16 - Principles of Communication
No ratings yet
16 - Principles of Communication
14 pages
PB - Iii (2024)
No ratings yet
PB - Iii (2024)
6 pages
Grade 9 If Comprehension Check
No ratings yet
Grade 9 If Comprehension Check
3 pages
The 8051 Architecture
No ratings yet
The 8051 Architecture
80 pages
Face Recognition Access Controller Web 5.0 User S Manual V1.2.3
No ratings yet
Face Recognition Access Controller Web 5.0 User S Manual V1.2.3
177 pages
Covid Vaccine Certificate
No ratings yet
Covid Vaccine Certificate
1 page
Lor Devan Sir
No ratings yet
Lor Devan Sir
1 page
Analysis of Soil Erosion Using Remote Sensing and Gis in Hosur Taluk, Tamilnadu
No ratings yet
Analysis of Soil Erosion Using Remote Sensing and Gis in Hosur Taluk, Tamilnadu
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.