Ann R16 Unit 2
Ann R16 Unit 2
Ann R16 Unit 2
UNIT 2
MATHEMATICAL FOUNDATIONS AND LEARNING MECHANISMS
Re-visiting Vector and Matrix algebra
Vector: An n-tuple (pair, triple, quadruple ...) of scalars can be written as a horizontal row or vertical
column. A column is called a vector. A vector is denoted by an uppercase letter. Its entries are
identified by the corresponding lowercase letter, with subscripts. The row with the same entries is
indicated by a superscript t.
Consider
Clearly,
-O = O X+O=X
-(-X ) = X X + (-X ) = O.
You can multiply a vector by a scalar:
This product is also written sX.2 You should verify these manipulation rules:
You can add and subtract rows X t and Y t with the same number of entries, and define the zero row
and the negative of a row. The product of a scalar and a row is
sX t = s[x1 , ..., xn] = [sx1 , ..., sxn]
These rules are useful: X ± Y = (X ± Y )t
t t
-(X t) = (-X )t s(X t) = (sX )t .
Finally, you can multiply a row by a vector with the same number of entries to get their scalar
product:
With a little algebra you can verify the following manipulation rules:
1
Matrix algebra An m×n matrix is a rectangular array of mn scalars in m rows and n columns. A
matrix is denoted by an uppercase letter. Its entries are identified by the corresponding lowercase
letter, with double subscripts:
A is called square when m = n. The ai j with i = j are called diagonal entries. m×1 and 1×n matrices
are columns and rows with m and n entries, and 1×1 matrices are handled like scalars. You can add
or subtract m×n matrices by adding or subtracting corresponding entries, just as you add or subtract
columns and rows. A matrix whose entries are all zeros is called a zero matrix, and denoted by O.
You can also define the negative of a matrix, and the product sA of a scalar s and a matrix A.
You can multiply an m×n matrix A by a vector X with n entries; their product AX is the vector with m
entries, the products of the rows of A by X:
Similar manipulation rules hold. Further, you can check the associative law
X t (AY ) = (X tA)Y.
You can multiply an l×m matrix A by an m×n matrix B. Their product AB is an l×n matrix that you
can describe two ways. Its columns are the products of A by the columns of B, and its rows are the
products of the rows of A by B:
The i,k th entry of AB is thus ai1b1 k + … + aimbmk. You can check these manipulation rules:
The definition of the product of two matrices was motivated by the formulas for linear substitution;
from
2
Every m×n matrix A has a transpose At, the n×m matrix whose j,ith entry is the i, jth entry of A:
Hence the leftmost matrix in this display isn’t invertible. When there exists B such that AB = I = BA,
it.s unique; if also AC = I = CA, then B = BI = B(AC) = (BA)C = IC = C. Thus an invertible matrix A
has a unique inverse A&I such that AA-1 = I = A-1A.
Clearly, I is invertible and I-1 = I.
The inverse and transpose of an invertible matrix are invertible, and any
product of invertible matrices is invertible:
(A-1 )-1 = A (At )-1 = (A-1 )t (AB)-1 = B-1A-1
Determinants
The determinant of an n×n matrix A is
where the sum ranges over all n! permutations ψ
of {1, ..., n }, and sign ψ = ±1 depending on whether n is even or odd. In each term of the sum there’s
one factor from each row and one from each column.
For the 2×2 case the determinant is
3
State-space Concepts
State space search is a process used in the field of computer science, including artificial
intelligence (AI), in which successive configurations or states of an instance are considered, with the
intention of finding a goal state with a desired property.
The Problems are often modeled as a state space, a set of states that a problem can be in. The
set of states forms a graph where two states are connected if there is an operation that can be
performed to transform the first state into the second.
State space search often differs from traditional computer science search methods because the
state space is implicit: the typical state space graph is much too large to generate and store
in memory. Instead, nodes are generated as they are explored, and typically discarded thereafter. A
solution to a problem instance may consist of the goal state itself, or of a path from some initial
state to the goal state.
A state space is a set of descriptions or states.
Each search problem consists of:
One or more initial states.
A set of legal actions- Actions are represented by operators or moves applied to each state. For
example, the operators in a state space representation of the 8-puzzle problem are left, right, up and
down.
One or more goal states.
The number of operators are problem dependant and specific to a particular state space
representation. The more operators the larger the branching factor of the state space. Thus,
the number of operators should kept to a minimum, e.g. 8-puzzle: operations are efined in
terms of moving the space instead of the tiles.
In state space search a state space is formally represented as a tuple S: (S, A, Action(s), Result(s,a),
Cost(s,a)), in which:
S is the set of all possible states
A is the set of possible action, not related to a particular state but regarding all the state space
Action(s) is the function that establish which action is possible to perform in a certain state
Result(s,a) is the function that return the state reached performing action a in state s
Cost(s,a) is the cost of performing an action a in state s. In many state spaces is a constant,
but this is not true in general.
4
7. (x,y) If (x+y)<7 (4, y-[4-x]) Pour some water from the 3 gallon jug to fill the four gallon jug
8. (x,y) If (x+y)<7 (x-[3-y],y) Pour some water from the 4 gallon jug to fill the 3 gallon jug.
9. (x,y) If (x+y)<4 (x+y,0) Pour all water from 3 gallon jug to the 4 gallon jug
10. (x,y) if (x+y)<3 (0, x+y) Pour all water from the 4 gallon jug to the 3 gallon jug
The listed production rules contain all the actions that could be performed by the agent in
transferring the contents of jugs. But, to solve the water jug problem in a minimum number of
moves, following set of rules in the given sequence should be performed:
Solution of water jug problem according to the production rules:
S.No. 4 gallon jug contents 3 gallon jug contents Rule followed
1. 0 gallon 0 gallon Initial state
2. 0 gallon 3 gallons Rule no.2
3. 3 gallons 0 gallon Rule no. 9
4. 3 gallons 3 gallons Rule no. 2
5. 4 gallons 2 gallons Rule no. 7
6. 0 gallon 2 gallons Rule no. 5
7. 2 gallons 0 gallon Rule no. 9
On reaching the 7th attempt, we reach a state which is our goal state. Therefore, at this state, our
problem is solved.
8-Puzzle
The 8-puzzle problem is a puzzle invented and popularized by Noyes Palmer Chapman in the 1870s.
It is played on a 3-by-3 grid with 8 square blocks labeled 1 through 8 and a blank square. Your goal
is to rearrange the blocks so that they are in order. You are permitted to slide blocks horizontally or
vertically into the blank square.
For example, given the initial state above we may want the tiles to be moved so that the following
goal state may be attained.
Initial State Final State
1 3 1 2 3
4 2 5 4 5 6
7 8 6 7 8
The following shows a sequence of legal moves from an initial board position (left) to the goal
position (right).
1 3
4 2 5
7 8 6
1 2 3
4 5
7 8 6
1 2 3
4 5
7 8 6
1 2 3
4 5 6
7 8
5
Concepts of Optimization
Optimization is an action of making something such as design, situation, resource, and
system as effective as possible. Using a resemblance between the cost function and energy function,
we can use highly interconnected neurons to solve optimization problems. Such a kind of neural
network is Hopfield network, that consists of a single layer containing one or more fully connected
recurrent neurons. This can be used for optimization.
Points to remember while using Hopfield network for optimization −
The energy function must be minimum of the network.
It will find satisfactory solution rather than select one out of the stored patterns.
The quality of the solution found by Hopfield network depends significantly on the initial
state of the network.
Matrix Representation
Actually each tour of n-city TSP can be expressed as n × n matrix whose ith row describes
the ith city’s location. This matrix, M, for 4 cities A, B, C, D can be expressed as follows
Now the energy function to be minimized, based on the above constraint, will contain a term
proportional to
6
Constraint-II
As we know, in TSP one city can occur in any position in the tour hence in each column of
matrix M, one element must equal to 1 and other elements must be equal to 0. This constraint can
mathematically be written as follows −
Now the energy function to be minimized, based on the above constraint, will contain a term
proportional to –
As we know, in Matrix the output value of each node can be either 0 or 1, hence for every pair of
cities A, B we can add the following terms to the energy function –
On the basis of the above cost function and constraint value, the final energy function E can be given
as follows –
7
Mathematical Concept
Suppose we have a function fx and we are trying to find the minimum of this function. Following
are the steps to find the minimum of fx.
First, give some initial value x0 for x
Now take the gradient ∇f of function, with the intuition that the gradient will give the slope
of the curve at that x and its direction will point to the increase in the function, to find out
the best direction to minimize it.
Now change x as follows −
Here, θ > 0 is the training rate stepsize that forces the algorithm to take small jumps.
Estimating Step Size
Actually a wrong step size θ may not reach convergence, hence a careful selection of the same is
very important. Following points must have to be remembered while choosing the step size
Do not choose too large step size, otherwise it will have a negative impact, i.e. it will diverge
rather than converge.
Do not choose too small step size, otherwise it take a lot of time to converge.
Some options with regards to choosing the step size −
One option is to choose a fixed step size.
Another option is to choose a different step size for every iteration.
Simulated Annealing
The basic concept of Simulated Annealing SASA is motivated by the annealing in solids. In the
process of annealing, if we heat a metal above its melting point and cool it down then the structural
properties will depend upon the rate of cooling. We can also say that SA simulates the metallurgy
process of annealing.
Use in ANN
SA is a stochastic computational method, inspired by Annealing analogy, for approximating the
global optimization of a given function. We can use SA to train feed-forward neural networks.
Algorithm
Step 1 − Generate a random solution.
Step 2 − Calculate its cost using some cost function.
Step 3 − Generate a random neighboring solution.
Step 4 − Calculate the new solution cost by the same cost function.
Step 5 − Compare the cost of a new solution with that of an old solution as follows −
If CostNew Solution < CostOld Solution then move to the new solution.
Step 6 − Test for the stopping condition, which may be the maximum number of iterations reached
or get an acceptable solution.
8
Learning mechanisms:-
A neural network is the able to learn from its environment. and to improve its performance through
learning. A neural network learns about its environment through an interactive process of
adjustments applied to its synaptic weights and bias levels. Ideally the network becomes more
knowledgeable about its environment after each iteration of the learning process.
Learning is a process by which the free parameters of a neural network are adapted through a
process of stimulation by the environment in which the network is embedded. The type of learning is
determined by the manner in which the parameter changes take place. This definition of the learning
process implies the following sequence of events:
1, The neural network is stimulated by an environment.
2. The neural network undergoes changes in its free parameters as a result of this stimulation.
3. The neural network responds in a new way to the environment because of the changes that have
occurred in its internal structure.
A prescribed set of well-defined rules for the solution of a learning problem is called a
learning algorithm. There is no unique learning algorithm for the design of neural networks.
Basically, learning algorithms differ from each other in the way in which the adjustment to a
synaptic weight of a neuron is formulated. Another factor to be considered is the manner in which a
neural network (learning machine) made up of a set of interconnected neurons, relates to its
environment.
ERROR-CORRECTION LEARNING
Consider the simple case of a neuron k constituting the only
computational node in the output layer of a feedforward neural network, as depicted in following
figure.
Neuron k is driven by a signal vector x(n) produced by one or more layers of hidden neurons, which
are themselves driven by an input vector (stimulus) applied to the source nodes (i.e., input layer) of
the neural network. The argument n denotes discrete time, or more precisely, the time step of an
iterative process involved in adjusting the synaptic weights of neuron k. The output signal of neuron
k is denoted by . This output signal, representing the only output of the neural network, is
compared to a desired response or target output, denoted by Consequently, an error signal,
denoted by , is produced. By definition, we thus have
The error signal actuates a control mechanism, the purpose of which is to apply a sequence of
corrective adjustments to the synaptic weights of neuron k. The corrective adjustments are designed
to make the output signal come closer to the desired response in a step-by-step manner.
This objective is achieved by minimizing a cost function or index of performance. defined in
terms of the error signal as:
9
That is. is the instantaneous value of the error energy. The step-by-step adjustments to the
synaptic weights of neuron k are continued until the system reaches a steady state (i.e .• the synaptic
weights are essentially stabilized). At that point the learning process is terminated.
The learning process described herein is obviously referred to as error-correction learning. In
particular. minimization of the cost function leads to a learning rule commonly referred to as
the delta rule or Widrow-Hoff rule.
Let denote the value of synaptic weight of neuron k excited by element of the
signal vector x(n) at time step n. According tothe delta rule. the adjustment ∆ applied to the
synaptic weight at time step n is defined by
where is a positive constant that determines the rate of learning as we proceed from one step in
the learning process to another. It is therefore natural that we refer to as the learning-rate
parameter. In other words. the delta rule may be stated as:
The adjustment made to a synaptic weight of a neuron is proportional to the product of
error signal and input signal of the synapse in question.
Having computed the synaptic adjustment ∆ the updated value of synaptic weight is
determined by
In effect, and may be viewed as the old and new values of synaptic Weight
respectively. In computational terms we may also write
10
MEMORY-BASED LEARNING
In memory-based learning, all ( or most) of the past experiences are explicitly stored in a large
memory of correctly classified input-output examples: where denotes an input
vector and denotes the corresponding desired response. Without loss of generality, we have
restricted the desired response to be a scalar. For example,in a binary pattern classification problem
there are two classes/hypotheses, denoted by and , to be considered. In this example, the
desired response takes the value 0 (or - 1) for class and the value 1 for class When
classification of a test vector (not seen before) is required, the algorithm responds by retrieving
and analyzing the training data in a "local neighborhood" of
where is the Euclidean distance between the vectors and The class associated
with the minimum distance, that is, vector is reported as the classification of . This rule is
independent of the underlying distribution responsible for generating the training examples.
Cover and Hart (1967) have formally studied the nearest neighbor rule as a tool for pattern
classification. The analysis presented therein is based on two assumptions:
• The classified examples are independently and identically distributed (iid), according to the
joint probability distribution of the example (x, d).
• The sample size N is infinitely large.
Under these two assumptions, it is shown that the probability of classification error incurred by the
nearest neighbor rule is bounded above by twice the Bayes probability of error, that is, the minimum
probability of error over all decision rules.
A variant of the nearest neighbor classifier is the k-nearest neighbor classifier, which proceeds as
follows:
• Identify the k classified patterns that lie nearest to the test vector for some integer k.
• Assign to the class (hypothesis) that is most frequently represented in the k nearest neighbors
to .
Thus the k-nearest neighbor classifier acts like an averaging device.
In particular, it discriminates against a single outlier, as illustrated in following Figure for k = 3.
An outlier is an observation that is improbably large for a nominal model of interest.
11
HEBBIAN LEARNING
Hebb 's learning is the oldest and most famous of all learning rules; it is named in honor of the
neuropsychologist Hebb (1949).
When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in
firing it, some growth process or metabolic changes take place in one or both cells such that A's
efficiency as one of the cells firing B, is increased.
Hebb proposed this change as a basis of associative learning, which would result in an enduring
modification in the activity pattern of a spatially distributed "assembly of nerve cells."This statement
is made in a neurobiological context.
We may expand and rephrase it as a two-part rule:
1. If two neurons on either side of a synapse (connection) are activated simultaneously (i.e.,
synchronously). then the strength of that synapse is selectively increased.
2. If two neurons on either side of a synapse are activated asynchronously, then that synapse is
selectively weakened or eliminated.
Such a synapse is called a Hebbian synapse.
A Hebbian synapse as a synapse that uses a time dependent,highly local, and strongly interactive
mechanism to increase synaptic efficiency as a function of the correlation between the presynaptic
and postsynaptic activities.
The key mechanisms (properties) that characterize a Hebbian synapse are :
1. Time-dependent mechanism. This mechanism refers to the fact that the modifications in a
Hebbian synapse depend on the exact time of occurrence of the presynaptic and postsynaptic signals.
2. Local mechanism. A synapse is the transmission site where information-bearing signals are in
spatia temporal contiguity. This locally available information is used by a Hebbian synapse to
produce a local synaptic modification that is input specific.
3. Interactive mechanism. The occurrence of a change in a Hebbian synapse depends on signals on
both sides of the synapse. That is, a Hebbian form of learning depends on a "true interaction"
between presynaptic and postsynaptic signals in the sense that we cannot make a prediction from
either one of these two activities by itself.
Note also that this dependence or interaction may be deterministic or statistical in nature.
4. Conjunctional or correlational mechanism. One interpretation of Hebb's learning is that the
condition for a change in synaptic efficiency is the conjunction of presynaptic and postsynaptic
signals. Thus, according to this interpretation,the co-occurrence of presynaptic and postsynaptic
signals is sufficient to produce the synaptic modification.
A Hebbian synapse is sometimes referred to as a conjunctional synapse.
In particular, the correlation over time between presynaptic and postsynaptic signals is viewed as
being responsible for a synaptic change. Accordingly, a Hebbian synapse is also referred to as a
correlational synapse.
12
Mathematical Models of Hebbian Modifications
To formulate Hebbian learning in mathematical terms, consider a synaptic weight of neuron k
with presynaptic and postsynaptic signals denoted by and ' respectively.
The adjustment applied to the synaptic weight at time step n is expressed in the general form
where is the learning-rate parameter. The average values and constitute presynaptic
and postsynaptic thresholds, which determine the sign of synaptic modification.
I
Hebb's hypothesis. The simplest form of Hebbian learning is described by
where is a positive constant that determines the rate of learning. The above equation clearly
emphasizes the correlational nature of a Hebbian synapse. It is sometimes referred to as the activity
product rule. The top curve of following Figure shows a graphical representation of above equation
with the change plotted versus the output signal From this representation we see that the
repeated application of the input signal leads to an increase in and therefore exponential
growth that finally drives the synaptic connection into saturation. At that point no information will be
stored in the synapse and selectivity is lost.
Covariance hypothesis
One way of overcoming the limitation of Hebb's hypothesis is to use the covariance hypothesis. In
this hypothesis, the presynaptic and postsynaptic signals are replaced by the departure of presynaptic
and postsynaptic signals from their respective average values over a certain time interval.
Let x and y denote the time-averaged values of the presynaptic signal and postsynaptic signal
respectively.
According to the covariance hypothesis, the adjustment applied to the synaptic weight wki is defined
by
where is the learning-rate parameter. The average values x and Y constitute pre synaptic and
postsynaptic thresholds, which determine the sign of synaptic modification.
13
In particular, the covariance hypothesis allows for the following:
• Convergence to a nontrivial state, which is reached when or
• Prediction of both synaptic potentiation (i.e., increase in synaptic strength) and synaptic depression
(i.e., decrease in synaptic strength).
The above figure illustrates the difference between Hebb's hypothesis and the covariance hypothesis.
In both cases the dependence of on is linear; the intercept with the -axis in Hebb's
hypothesis is at the origin, whereas in the covariance hypothesis it is at = y.
We make the following important observations from above equation:
1. Synaptic weight is enhanced if there are sufficient levels of presynaptic and postsynaptic
activities, that is, the conditions and are both satisfied.
2. Synaptic weight WkJ is depressed if there is either
• a presynaptic activation (i.e., ) in the absence of sufficient postsynaptic activation (i.e.,
), or
• a postsynaptic activation (i.e., ) in the absence of sufficient presynaptic activation (i.e.,
).
This behavior may be regarded as a form of temporal competition between the incoming patterns.
COMPETITIVE LEARNING
In competitive learning, the output neurons of a neural network compete among themselves to
become active (fired). Whereas in a neural network based on Hebbian learning several output
neurons may be active simultaneously, in competitive learning only a single output neuron is active
at any one time.
There are three basic elements to a competitive learning rule (Rumelhart and Zipser, 1985):
• A set of neurons that are all the same except for some randomly distributed synaptic weights, and
which therefore respond differently to a given set of input patterns.
• A limit imposed on the "strength" of each neuron.
• A mechanism that permits the neurons to compete for the right to respond to a given subset of
inputs, such that only one output neuron, or only one neuron per group, is active at a time. The
neuron that wins the competition is called a winner-takes-all neuron.
In the simplest form of competitive learning, the neural network has a single layer of output neurons,
each of which is fully connected to the input nodes. The network may include feedback connections
among the neurons, as indicated in following Figure. In the network architecture described herein,
the feedback connections perform lateral inhibition: with each neuron tending to inhibit the neuron to
which it is laterally connected. In contrast, the feedforward synaptic connections in the network of
are all excitatory.
14
For a neuron k to be the winning neuron, its induced local field , for a specified input pattern x
must be the largest among all the neurons in the network. The output signal of winning neuron k
is set equal to one; the output signals of all the neurons that lose the competition are set equal to zero.
We thus write
where the induced local field represents the combined action o f all the forward and feedback
inputs to neuron k.
Let denote the synaptic weight connecting input node j to neuron k. Suppose that each neuron
is allotted a fixed amount of synaptic weight, which is distributed among its input nodes; that is,
A neuron then learns by shifting synaptic weights from its inactive to active input nodes. If a neuron
does not respond to a particular input pattern, no learning takes place in that neuron. If a particular
neuron wins the competition, each input node of that neuron relinquishes some proportion of its
synaptic weight, and the weight relinquished is then distributed equally among the active input
nodes. According to the standard competitive learning rule, the change applied to synaptic
weight is defined by
where is he learning-rate parameter. This rule has the overall effect of moving the synaptic weight
vector W k of winning neuron k toward the input pattern x.
We use the geometric analogy depicted in above Figure to illustrate the essence of competitive
learning. It is assumed that each input pattern (vector) x has some constant Euclidean length so that
we may view it as a point on an N-dimensional unit sphere where N is the number of input nodes. N
also represents the dimension of each synaptic weight vector ' It is further assumed that all
neurons in the network are constrained to have the same Euclidean length, as shown by
15
When the synaptic weights are properly scaled they form a set of vectors that fall on the same N-
dimensional unit sphere.
Figure (a) we show three natural groupings (clusters) of the stimulus patterns represented by dots.
This figure also includes a possibleinitial state of the network that may exist before learning.
Figure (b) shows a typical final state of the network that results from the use of competitive learning.
In particular, each output neuron has discovered a cluster of input patterns by moving its synaptic
weight vector to the center of gravity of the discovered cluster.
This figure illustrates the ability of a neural network to perform clustering through competitive
learning. However, for this function to be performed in a "stable" fashion the input patterns must fall
into sufficiently distinct groupings to begin with. Otherwise the network may be unstable because it
will no longer respond to a given input pattern with the same output neuron
16