Chapter 5 PDF
Chapter 5 PDF
Chapter 5 PDF
BACKGROUND
The pattern associator described in the previous chapter has been known
since the late 1950s, when variants of what we have called the delta rule
were first proposed. In one version, in which output units were linear
threshold units, it was known as the perceptron (cf. Rosenblatt, 1959,
1962). In another version, in which the output units were purely linear, it
was known as the LMS or least mean square associator (cf. Widrow & Hoff,
1960). Important theorems were proved about both of these versions. In
the case of the perceptron, there was the so-called perceptron convergence
theorem. In this theorem, the major paradigm is pattern classification.
There is a set of binary input vectors, each of which can be said to belong
to one of two classes. The system is to learn a set of connection strengths
122 BACKGROUND
and a threshold value so that it can correctly classify each of the input vec-
tors. The basic structure of the perceptron is illustrated in Figure 1. The
perceptron learning procedure is the following: An input vector is
presented to the system (i.e., the input units are given an activation of 1 if
the corresponding value of the input vector is 1 and are given 0 otherwise).
The net input to the output unit is computed: net = L
Wi ii' If net is
i
greater than the threshold (), the unit is turned on, otherwise it is turned
off. Then the response is compared with the actual category of the input
vector. If the vector was correctly categorized, then no change is made to
the weights. If, however, the output uniUurns on when the input vector is
in category 0, then the weights and thresholds are modified as follows: The
threshold is incremented by 1 (to make it less likely that the output unit
will come on if the same vector were presented again). If input ii is 0, no
change is made in the weight Wi (that weight could not have contributed to
its having turned on). However, if ii = 1, then Wi is decremented by 1. In
this way, the system will not be as likely to turn on the next time this input
vector is presented. On the other hand, if the output unit does not come
on when it is supposed to, the opposite changes are made. That is, the
threshold is decremented, and those weights connecting the output units to
input units that are on are incremented.
Mathematically, this amounts to the following: The output, 0, is given
by
if net = ~
~ wi>
II ()
otherwise.
FIGURE I. The one-layer perceptron analyzed by Minsky and Papert. (From Perceplrons by
M. L Minsky and S. Paper!, 1969, Cambridge, MA: MIT Press. Copyright 1969 by MIT
Press. Reprinted by permission.)
5. THE GENERALIZED DELTA RULE 123
The remarkable thing about this procedure is that, in spite of its simpli-
city, such a system is guaranteed to find a set of weights that correctly clas-
sifies the input vectors if such a set of weights exists. Moreover, since the
learning procedure can be applied independently to each of a set of output
units, the perceptron learning procedure will find the appropriate mapping
from a set of input vectors onto a set of output vectors-if such a mapping
exists. Unfortunately, as indicated in Chapter 4, such a mapping does not
always exist, and this is the major problem for the perceptron learning pro-
cedure.
In their famous book Perceptrons, Minsky and Papert (1969) document
the limitations of the perceptron. The simplest example of a function that
cannot be computed by the perceptron is the exclusive-or (XOR), illus-
trated in Table 1. It should be clear enough why this problem is impossi-
ble. In order for a perceptron to solve this problem, the following four in-
equalities must be satisfied:
OXWj+Oxw2<O=> 0<0
OXWj+1xw2>O=> Wj>O
Obviously, we can't have both WI and W2 greater than 0 while their sum,
WI + W2, is less than O. There is a simple geometric interpretation of the
class of problems that can be solved by a perceptron: It is the class of
TABLE j
00 o
01 j
10 I
II o
Output
Input 1 Input 2
AND OR
(0, 1) (1, 1)
o
L--
(0,0) (1,0)
FIGURE 2. A: A simple network that can solve the two-dimensional AND and OR functions
but cannot solve the XOR function. B: Geometric representations of the three problems. See
text for details.
5. THE GENERALIZED DELTA RULE 125
TABLE 2
000 o
010 I
100 I
III o
hyperplane, f
i~l
Wi ii = (). All functions for which there exists such a plane
are called linearly separable. Now consider the function in Table 2 and illus-
trated in Figure 3. This is a three-dimensional problem in which the first
two dimensions are identical to the XOR and the third dimension is the
AND of the first two dimensions. (That is, the third dimension is 1 when-
ever both of the first two dimensions are 1, otherwise it is 0). Figure 3
shows how this problem can be represented in three dimensions. The fig-
ure also shows how the addition of the third dimension allows a plane to
separate the patterns classified in category 0 from those in category 1.
Thus, we see that the XOR is not solvable in two dimensions, but if we add
the appropriate third dimension, that is, the appropriate new feature, the
problem is solvable. Moreover, as indicated in Figure 4, if you allow a
multilayered perceptron, it is possible to take the original two-dimensional
problem and convert it into the appropriate three-dimensional problem so it
can be solved. Indeed, as Minsky and Papert knew, it is always possible to
convert any unsolvable problem into a solvable one in a multilayer percep-
tron. In the more general case of multilayer networks, we categorize units
'0
, -'
1
(0, 0, 0) (1,0,0)
FIGURE 3. The three-dimensional solution of the XOR problem.
126 BACKGROUND
Output
Input 3
Input 1 Input 2
FIGURE 4. A multilayer network that converts the two-dimensional XOR problem into a
three-dimensional linearly separable problem.
into three classes: input units, which receive the input patterns directly; out-
put units, which have associated teaching or target inputs; and hidden units,
which neither receive inputs directly nor are given direct feedback. This is
the stock of units from which new features and new internal representa-
tions can be created. The problem is to know which new features are
required to solve the problem at hand. In short, we must be able to learn
intermediate layers. The question is, how? The original perceptron learn-
ing procedure does not apply to more than one layer. Minsky and Papert
believed that no such general procedure could be found. To examine how
such a procedure can be developed it is useful to consider the other major
one-layer learning system of the 1950s and early 1960s, namely, the least-
mean-square (LMS) learning procedure of Widrow and Hoff (1960).
The LMS procedure makes use of the delta rule for adjusting connection
strengths; the perceptron convergence procedure is very similar, differing
only in that linear threshold units are used instead of units with
continuous-valued outputs. We use the term LMS procedure here to stress
the fact that this family of learning rules may be viewed as minimizing a
measure of the error in their performance.
The LMS procedure cannot be directly applied when the output units are
linear threshold units (like the perceptron). It has been applied most often
with purely linear output units. In this case the activation of an output
5. THE GENERALIZED DELTA RULE 127
unit, OJ, is simply given by OJ = L wi} ij' The error function, as indicated
j
by the name least-mean-square, is the summed squared error. That is, the
total error, E, is defined to be
where the index p ranges over the set of input patterns, i ranges over the
set of output units, and Ep represents the error on pattern p. The variable
tpj is the desired output, or target, for the ith output unit when the pth pat-
tern has been presented, and Opj is the actual output of the ith output unit
when pattern p has been presented. The object is to find a set of weights
that minimizes this function. It is useful to consider how the error varies
as a function of any given weight in the system. Figure 5 illustrates the
nature of this dependence. In the case of the simple single-layered linear
system, we always get a smooth error function such as the one shown in
the figure. The LMS procedure finds the values of all of the weights that
minimize this function using a method called gradient descent. That is, after
each pattern has been presented, the error on that pattern is computed and
each weight is moved "down" the error gradient toward its minimum value
for that pattern. Since we cannot map out the entire error function on each
pattern presentation, we must find a simple procedure for determining, for
each weight, how much to increase or decrease each weight. The idea of
gradient descent is to make a change in the weight proportional to the
Error
Wij
FIGURE 5. Typical curve showing the relationship between overall error and changes in a
single weight in the network.
128 BACKGROUND
Ilw =
I}
-k--
aEp
aw. I}
where E = 2k and Opi = tpi- Opi is the difference between the target for unit
i on pattern p and the actual output produced by the network. This is
exactly the delta learning rule described in Equation 15 from Chapter 4. It
should also be noted that this rule is essentially the same as that for the
perceptron. In the perceptron the learning rate was 1 (i.e., we made unit
changes in the weights) and the units were binary, but the rule itself is the
same: the weights are changed proportionally to the difference between tar-
get and output times the input.
If we change each weight according to this rule, each weight is moved
toward its own minimum and we think of the system as moving downhill in
weight-space until it reaches its minimum error value. When all of the
weights have reached their minimum points, the system has reached equili-
brium. If the system is able to solve the problem entirely, the system will
reach zero error and the weights will no longer be modified. On the other
hand, if the network is unable to get the problem exactly right, it will find a
set of weights that produces as small an error as possible.
In order to get a fuller understanding of this process it is useful to care-
fully consider the entire error space rather than a one-dimensional slice. In
general this is very difficult to do because of the difficulty of depicting and
visualizing high-dimensional spaces. However, we can usefully go from
one to two dimensions by considering a network with exactly two weights.
Consider, as an example, a linear network with two input units and one
output unit with the task of finding a set of weights that comes as close as
possible to performing the function OR. Assume the network has just two
weights and no bias terms like the network in Figure 2A. We can then give
some idea of the shape of the space by making a contour map of the error
surface.
Figure 6 shows the contour map. In this case the space is shaped like a
kind of oblong bowl. It is relatively flat on the bottom and rises sharply on
the sides. Each equal error contour is elliptically shaped. The arrows
I It should be clear from Figure 5 why we want the negation of the derivative. If the weight is
above the minimum value, the slope at that point is positive and we want to decrease the
weight; thus when the slope is positive we add a negative amount to the weight. On the other
hand, if the weight is too small, the error curve has a negative slope at that point, so we want
to add a positive amount to the weight.
5. THE GENERALIZED DELTA RULE 129
FIGURE 6. A contour map illustrating the error surface with respect to the two weights IVI
and 1V2' for the OR problem in a linear network with two weights and no bias term. Note that
the OR problem cannot be solved perfectly in a linear system. The minimum sum squared
error over the four input-output pairs occurs when IVI = 1V2= 0.75. (The input-output pairs are
00- 0,01- 1,10- I, and 11- I.)
around the ellipses represent the derivatives of the two weights at those
points and thus represent the directions and magnitudes of weight changes
at each point on the error surface. The changes are relatively large where
the sides of the bowl are relatively steep and become smaller and smaller as
we move into the central minimum. The long, curved arrow represents a
typical trajectory in weight-space from a starting point far from the
minimum down to the actual minimum in the space. The weights trace a
curved trajectory following the arrows and crossing the contour lines at
right angles.
The figure illustrates an important aspect of gradient descent learning.
This is the fact that gradient descent involves making larger changes to
parameters that will have the biggest effect on the measure being
130 BACKGROUND
minimized. In this case, the LMS procedure makes changes to the weights
proportional to the effect they will have on the summed squared error. The
resulting total change to the weights is a vector that points in the direction
in which t11eerror drops most steeply.
2 Note that the symbol 'I) was used for the learning rate parameter in PDP:8. We use E here
for consistency with other chapters in this volume.
) In the networks we will be considering in this chapter, the output of a unit is equal to its
activation. We use the symbol a to designate this variable. This symbol can be used for any
unit, be it an input unit, an output unit, or a hidden unit.
5. THE GENERALIZED DELTA RULE 131
where netp; =L wijapj + bias; and f'; (netp) is the derivative of the activa-
j
tion function with respect to a change in the net input to the unit. Note
that bias; is a bias that has a similar function to the threshold, (), in the per-
ceptron.4
The 0 term for hidden units for which there is no specified target is
determined recursively in terms of the 0 terms of the units to which it
directly connects and the weights of those connections. That is,
ap;= 1
1 + e -nelpi
4 Note that the values of the bias can be learned just like any other weights. We simply ima-
gine that the bias is the weight from a unit that is always on.
132 BACKGROUND
In order to apply our learning rule, we need to know the derivative of this
function with respect to its total input, netpj• It is easy to show that this
derivative is given by
dapj )
-d--
netpi = apj(1 - apj .
Thus, for the logistic activation function, the error signal, Bpi> for an output
unit is given by
It should be noted that the derivative, apj(1 - ap), reaches its maximum
at apj = 0.5 and, since O~ apj~ 1, approaches its minimum as apj approaches
o or 1. Since the amount of change in a given weight is proportional to this
derivative, weights will be changed most for those units that are near their
midrange and, in some sense, not yet committed to being either on or off.
This feature, we believe, con~ributes to the stability of the learning of the
system.
Local minima. Like the simpler LMS learning paradigm, back propaga-
tion is a gradient descent procedure. Essentially, the system will follow the
contour of the error surface-always moving downhill in the direction of
steepest descent. This is no particular problem for the single-layer linear
model. These systems always have bowl-shaped error surfaces. However,
in multilayer networks there is the possibility of rather more complex sur-
faces with many minima. Some of the minima constitute solutions to the
problems in which the system reaches an errorless state. AU such minima
are global minima. However, it is possible for some of the minima to be
deeper than others. In this case, a gradient descent method may not find
the best possible solution to the problem at hand. Part of the study of back
propagation networks and learning involves a study of how frequently and
under what conditions local minima occur. In problems with many hidden
units, local minima seem quite rare. However with few hidden units, local
minima can be more common. Figure 7 shows a very simple network in
which we can demonstrate these phenomena. The network involves a sin-
gle input unit, a single hidden unit, and a single output unit (a 1:1:1 net-
work, for short). The problem is to copy the value of the input unit to the
output unit. There are two basic ways in which the network can solve the
problem. It can have positive biases on the hidden unit and on the output
unit and large negative connections from the input unit to the hidden unit
and from the hidden unit to the output unit, or it can have large negative
biases on the two units and large positive weights from the input unit to the
5. THE GENERALIZED DELTA RULE 133
hidden unit and from the hidden unit to the output unit. These solutions
are illustrated in Table 3. In the first case; the solution works as follows:
Imagine first that the input unit takes on a value of O. In this case, there
will be no activation from the input unit to the hidden unit, but the bias on
the hidden unit will turn it on. Then the hidden unit has a strong negative
connection to the output unit so it will be turned off, as required in this
case. Now suppose that the input unit is set to 1. In this case, the strong
inhibitory connection from the input to the hidden unit will turn the hidden
unit off. Thus, no activation will flow from the hidden unit to the output
unit. In this case, the positive bias on the output unit will turn it on and
the problem will be solved. Now consider the second class of solutions.
For this case, the connections among units are positive and the biases are
negative. When the input unit is off, it cannot turn on the hidden unit.
Since the hidden unit has a negative bias, it too will be off. The output
unit, then, will not receive any input from the hidden unit and since its bias
is negative, it too will turn off as required for zero input. Finally, if the
input unit is turned on, the strong positive connection from the input unit
to the hidden unit will turn on the hidden unit. This in turn will turn on
the output unit as required. Thus we have, it appears, two symmetric solu-
tions to the problem. Depending on the random starting state, the system
will end up in one or the other of these global minima.
Interestingly, it is a simple matter to convert this problem to one with
one local and one global minimum simply by setting the biases to 0 and not
allowing them to change. In this case, the minima correspond to roughly
the same two solutions as before. In one case, which is the global
minimum as it turns out, both connections are large and negative. These
minima are also illustrated in Table 3. Consider first what happens with
TABLE 3
-4
+4
+8
-8
WEIGHTS AND +4
-
0
+0.73
biasj
IV]4BIASES
0-8
+8
bias]
IVI 0 OF THE SOLUTIONS FOR A 1:1:1 NETWORK
Minima
134 BACKGROUND
both weights negative. When the input unit is turned off, the hidden unit
receives no input. Since the bias is 0, the hidden unit has a net input of O.
A net input of 0 causes the hidden unit to take on a value of 0.5. The 0.5
input from the hidden unit, coupled with a large negative connection from
the hidden unit to the output unit, is sufficient to turn off the output unit
as required. On the other hand, when the input unit is turned on, it turns
off the hidden unit. When the hidden unit is off, the output unit receives
a net input of 0 and takes on a value of 0.5 rather than the desired value of
1.0. Thus there is an error of 0.5 and a squared error of 0.25. This, it
turns out, is the best the system can do with zero biases. Now consider
what happens if both connections are positive. When the input unit is off,
the hidden unit takes on a value of 0.5. Since the output is intended to be
o in this case, there is pressure for the weight from the hidden unit to the
output unit to be small. On the other hand, when the input unit is on, it
turns on the hidden unit. Since the output unit is to be on in this case,
there is pressure for the weight to be large so it can turn on the output
unit. In fact, these two pressures balance off and the system finds a
compromise value of about 0.73. This compromise yields a summed
squared error of about 0.45 -a local minima.
Usually, it is difficult to see why a network has been caught in a local
minimum. However, in this very simple case, we have only two weights
and can produce a contour map for the error space. The map is shown in
Figure 8. It is perhaps difficult to visualize, but the map roughly shows a
saddle shape. It is high on the upper left and lower right and slopes down
toward the center. It then slopes off on each side toward the two minima.
If the initial values of the weights begin below the antidiagonal (that is,
below the line WI + W2 = 0), the system will follow the contours down and
to the left into the minimum in which both weights are negative. If, how-
ever, the system begins above the antidiagonal, the system will follow the
slope into the upper right quadrant in which both weights are positive.
Eventually, the system moves into a gently sloping valley in which the
weight from the hidden unit to the output unit. is almost constant at about
0.73 and the weight from the input unit to the hidden unit is slowly
increasing. It is slowly being sucked into a local minimum. The directed
arrows superimposed on the map illustrate the lines of force and illustrate
these dynamics. The long arrows represent two trajectories through weight-
space for two different starting points.
It is rare that we can create such a simple illustration of the dynamics of
weight-spaces and how local minima come about. However, it is likely that
many of our spaces contain these kinds of saddle-shaped error surfaces.
Sometimes, as when the biases are free to move, there is a global minimum
on either side of the saddle point. In this case, it doesn't matter which way
you move off. At other times, such as in Figure 8, the two sides are of dif-
ferent depths. There is no way the system can sense the depth of a
minimum from the edge, and once it has slipped in there is no way out.
Importantly, however, we find that high-dimensional spaces (with many
5. THE GENERALIZED DELTA RULE 135
(-::: M'
-2.5
-2.5 2.5
FIGURE 8. A contour map for the 1:1:1 identity problem with biases fixed at O. The map
show a local minimum in the positive quadrant and a global minimum in the lower left-hand
negative quadrant. Overall the error surface is saddle-shaped. See the text for further expla-
nation.
weights) have relatively few local minima. It seems that the system can
always, as it were, slip along another dimension to find a path out of most
local minima.
Symmetry breaking. Our learning procedure has one more problem that
can be readily overcome and this is the problem of symmetry breaking. If
all weights start out with equal values and if the solution requires that
unequal weights be developed, the system can never learn. This is because
error is propagated back through the weights in proportion to the values of
the weights. This means that all hidden units connected directly to the out-
put units will get identical error signals, and, since the weight changes
depend on the error signals, the weights from those units to the output
units must always be the same. The system is starting out at a kind of
unstable equilibrium point that keeps the weights equal, but it is higher
than some neighboring points on the error surface, and once it moves away
to one of these points, it will never return. We counteract this problem by
starting the system with small random weights. Under these conditions
symmetry problems of this kind do not arise. This can be seen in Figure 8.
If the system starts at exactly (0,0), there is no pressure for it to move at
all and the system will not learn, but if it starts anywhere off of the antidi-
agonal, it will eventually end up in one minimum or the other.
IMPLEMENTATION
The bp program also makes use of a list of pattern pairs, each pair con-
sisting of a name, an input pattern, and a target pattern.
Processing of a single pattern occurs as follows: A pattern pair is chosen,
and the paftern of activation specified by the input pattern is clamped on
the input units; that is, their activations are set to 1 or 0 based on the
values found in the input pattern.
Next, activations are computed. For each noninput unit, the net input to
the unit is computed and then the activation of the unit is set. This occurs
in the order that the units are specified in the network specification, so that
by the time each unit is encountered, the activations of all of the units that
feed into it have already been set. The routine that performs this computa-
tion is
compute_output ()
In this code, note that ninputs, which is the number of input units, is also
the index of the first hidden unit. The arrays first_ weight_to and
last weight to indicate which unit is the first and which is the last to project
to each unit. 5
Next, error and delta terms are computed. The error for a unit is
equivalent to the partial derivative of the error with respect to a change in
the activation of the unit. The delta for the unit is the partial derivative of
the error with respect to a change in the net input to the unit.
First, the delta and error terms for all units are set to O. Then, error
terms are calculated for each output unit. For these units, error is the
difference between the target and the obtained activation of the unit.
After the error has been computed for each output unit, we get to the
"heart" of back propagation: the recursive computation of error and delta
terms for hidden units. The program iterates backward over the units,
starting with the last output unit. The first thing it does in each pass
through the loop is set delta for the current unit, which is equal to the error
for the unit times the derivative of the activation function (i.e., the activa-
tion of the unit times one minus its activation). Then, once it has delta for
the current unit, the program passes this back to all units that have
5 Note that for efficiency and other reasons, the weight indexes are not actually implemented
in the form shown here. A description of the actual treatment of weight arrays is given in
Appendix F.
5. THE GENERALIZED DELTA RULE 139
connections coming into the current unit; this is the actual back propaga-
tion process. By the time a particular unit becomes the current unit, all of
the units that it projects to will have already been processed, and all of its
error will have been accumulated, so it is ready to have its delta computed.
The code for this is as follows:
compute_error() {
After computing error, if the !flag is nonzero, the weight error deriva-
tives are then computed from the deltas and activations. The error deriva-
tives for the bias terms are also computed. (Recall that the bias ter;ns are
equivalent to weights to a unit from a unit whose activation is always 1.0.)
These computations occur in the following routine:
compute_wed() {
Note that this routine adds the weight error derivatives occasioned by the
present pattern into an array where they can potentially be accumulated
over patterns.
Weight error derivatives actually lead to changes in the weights either
after processing each pattern or after each entire epoch of processing. In
either case, the computation that is actually performed needs to be clearly
understood. For each weight, a delta weight is first calculated. The delta
weight is equal to the accumulated weight error derivative plus a fraction of
the previous delta weight, where the size of the fraction is determined by
140 IMPLEMENTATION
the parameter momentum. Then, this delta weight is added into the weight,
so that the weight's new value is equal to its old value plus the delta
weight. Again, the same computation is performed for all of the bias
terms. The following routine performs these computations:
change_weights() {
Note that before the weights are actually changed, the sum_linked_ weds rou-
tine is called. This routine adds together all the weight error derivative
terms associated with all the weights that are linked together in the same
link group. The idea of linking weights is to assure that all weights that are
linked together always have the same value since conceptually they are
thought of as being a single weight. Also, after the weights are changed,
the constrain_negyos routine is called to make sure that the values assigned
to the weights conform to the positive or negative constraints that have
been imposed on them in the network specification file. Weights that are
constrained to be positive are reset to 0 by constrain_negyos if
change_ weights trys to put them below 0, and weights that are constrained to
be negative are reset to 0 if change_weights tries to put them above O.
We have just described the processing activity that takes place for each
input-target pair in each learning trial. Generally, learning is accomplished
through a sequence of epochs, in which all pattern pairs are presented for
one trial each during each epoch. The change_weights routine may be called
once per pattern or only once per epoch. The presentation is either in
sequential or permuted order. It is also possible to test the processing of
patterns, either individually or by sequentially cycling through the whole
list, with learning turned off. In this case, compute_output and compute_error
are called, but compute_wed and change_weights are not called.
Whether or not learning is occurring, the program also computes sum-
mary statistics after processing each pattern. First it computes the pattern
sum of squares (pss), equal to the squared error terms summed over all of
5. THE GENERALIZED DELTA RULE 141
the output units. Then it adds the pss to the total sum of squares (tss),
which is just the cumulative sum of the pss for all patterns thus far pro-
cessed within the current epoch.
Learning is carried out by the strain and ptrain commands. The first car-
ries out training in sequential order, the second in permuted order. Train-
ing goes on for nepochs or until the value of tss becomes less than the value
of a control parameter called ecrit for "error criterion." (Note that strain and
ptrain do not check the tss until the end of each epoch, so they will always
run through at least one epoch before returning.)
The bp program is used much like earlier programs in this series, particu-
larly pa. Like the other programs, it has a flexible architecture that must
be specified using a .net file. The conventions used in these files are
142 RUNNING THE PROGRAM
Commands
Variables
There are several new variables and a few minor changes to the meaning
of some familiar variables. These new and changed variables are described
in the following list. As in previous programs, all of the variables in bp are
accessed via the set I and exam I commands.
ncycles
The number of processing cycles run when the cycle routine is
called. This program control variable is used in cascade mode,
described later in this chapter.
stepsize
Size of the processing step taken before updating the screen and
prompting for return in single step mode. Possible values are cycle,
ncycles, pattern, epoch, and nepochs. When cascade mode is not on
and if the value of stepsize is cycle or ncycles, the screen is updated
after the forward processing sweep, then again after the backward
processing sweep and any weight adjustments. When cascade mode
is on and if the stepsize is cycle, updating occurs after each process-
ing cycle; if stepsize is ncycles, updating occurs after ncycles of pro-
cessing.
5. THE GENERALIZED DELTA RULE 143
config / bepsilon
For each bias term, there is an associated bepsilon, or modifiability
parameter, just as for each weight. These are just like epsilons (see
below), except that the user must only indicate one index since
there is just one per unit.
config/ epsilon
For each weight, there is an associated epsilon, or modifiability
i
parameter. Generally, epsilon!ilOJ (to unit from unit}) is equal to
either Irate or 0.0, according to whetherweight!ilOJ is modifiable, as
specified in the .net file. When Irate is changed, all the nonzero
epsilons are set to Irate. However, the user may then independently
adjust epsilon on any particular weight to any value desired. The
user must specify both a receiver and a sender to indicate which
epsilon to examine or change.
env / ipattern
Input patterns are specified as a sequence of floating-point
numbers, as in earlier programs. The entries "+", ".", and "_" are
allowed shorthand for + 1.0, 0.0, and -1.0, respectively. Legal
values are between 0 and 1, inclusive. If an element of an input
pattern has a negative value, the activation of the corresponding
input unit is set to the activation of the unit whose index is the
absolute value of the negative input pattern element. If the parame-
ter mu is nonzero, the activation value of the input unit is incre-
mented by mu times its previous value. For example, if element 3
of a particular input pattern is -12, the activation of input unit 3 is
set equal to the activation calculated for input unit 12 in processing
the previous pattern, plus mu times the previous activation of input
unit 3.
env / tpattern
Target patterns are specified as a sequence of floating-point
numbers as in pa. The entries "+", "_", and ". II are interpreted as
in the ipattern. Legal target values are between 0 and 1, inclusi ve.
If an element of a target pattern has a negative value, then the pro-
gram acts as though no target was specified for the corresponding
output unit; the error for that unit is therefore set to 0.0.
mode / cascade
By default, the forward pass of processing occurs in a single sweep
in bp. If this mode variable is set to 1, however, net inputs accu-
mulate gradually over several processing cycles. In this mode the
ncycles variable determines the number of processing cycles run per
pattern tested, and the cycle command, following the test command,
allows the user to continue cycling.
mode / follow
When this mode switch is set to 1, the bp program computes the
geor measure, which is the correlation of the gradient in weight
space between successive calls to change_weights.
144 RUNNING THE PROGRAM
mode I Igrain
Refers to the grain of learning or weight adjustment. By default in
the program it is set to pattern, which means that weights are
adjusted after processing each pattern pair. The Igrain variable may
also be set to epoch, in which case weight changes are accumulated
over all patterns presented within an epoch, and then the weights
are actually changed only at the end of the epoch.
param/ crate
The cascade rate parameter. Default value is 0.05. Determines the
rate of build-up of the net input to each unit on each processing
cycle.
paraml Irate
The learning rate parameter. Its default value is 0.05. When Irate
is reset, all epsilons and bepsilons that are nonzero are reset to the
new value of Irate.
paraml momentum
The value of the momentum parameter (called alpha in PDP:8); it
has a default value of 0.9. Note that when Igrain is set to pattern,
momentum is building up over patterns; when Igrain is epoch,
momentum is building up over epochs.
param/ mu
This parameter applies to the extension of bp to sequential net-
works. It specifies the extent to which the previous activation of a
unit is averaged together with input it receives from other units.
paraml tmax
The target activation actually used when the value given in the tar-
get pattern is 1. Defaults to 1.0. The target activation used when
the target string contains a 0 is (I - tmax), which is 0.0 when tmax
is 1.0.
paraml wrange
Range of variability for random weights. If constrained to be posi-
tive, the random weights range from 0 to +wrange. If constrained
to be negative, they range from -wrange to O. If unconstrained,
they range from -wrangel2 to +wrangeI2.
state I activation
Vector of activation values for units. The ith element corresponds
to the activation of the ith unit, as computed during the most
recent processing cycle.
statel bed
Vector of bias error derivative terms.
state I dbias
Vector of delta bias terms, comparable to delta weights, see below.
state I delta
Vector of delta terms, or partial derivatives of the error with respect
to the net inputs.
5. THE GENERALIZED DELTA RULE 145
state / dweight
Matrix of delta weights (Li wij from the equations). This matrix
contains the increment last added to the weights, and is used when
mom'entum is nonzero in setting the value of the next increment to --:".
add.
state / error
Vector of error terms, or the partial derivative of the error with
respect to a change in the activation of each unit.
state/ gear
Measures the correlation of the direction of the gradient in weight-
space between successive calls to change_weights. This is computed
only if follow mode is on.
state/ netinput
Net input vector.
state / target
Vector of targets. Note that the ith element corresponds to target
value for ith output unit.
state / wed
Matrix of weight error derivative terms.
OVERVIEW OF EXERCISES
FIGURE 9. Architecture of the XOR network used in Exs. 5.1 and 5.2. (From PDP:8, p.
332.)
hidden units project to the output unit; there are no direct connections
from the input units to the output units.
All of the relevant files for doing this exercise are contained in the bp
directory; they are called xor.tem, xor.str, xor.net, xor.pat, and xor. wts.
These contain the template, start-up, network, and pattern specifications
needed for running the XOR problem.
Once you have your own directory set up with copies of the relevant
files, you can start the program. To do so, type
bp xor.tem xor.str
If you wish, you may use a different template file, called xor2.tem, instead
of xor.tem. We describe the case where xor.tem is used because the screen
layout used in this file is easier to modify for other problems. The layout
used in xor2.tem should be self-explanatory.
The xor.str file is as follows:
set slevel 1
set lflag 1
set mode 19rain epoch
set _mode follow 1
set param Irate 0.5
get weights xor.wts
tall
This file instructs the program to set up the network as specified in the
xor.net file and to read the patterns as specified in the xor.pat file; it also
initializes various variables. Then it reads in an initial set of weights to use
for this exercise. Finally, tall is called, so that the program processes each
pattern. The display that finally appears shows the state of the network at
the end of this initial test of all of the patterns. It is shown in Figure 10.
In this figure, below the prompt and the top-level menu and to the left,
the current epoch number and the total sum of squares (tss) resulting from
testing all four patterns are displayed. Also displayed is the gcor measure,
which is 0.0 at this point since no weight error derivatives have been com-
puted. The next line contains the current pattern number and the sum of
squares associated with this pattern. To the right of these entries is the set
of input and target patterns for XOR. Below all these entries is a horizontal
vector indicating the activations of all of the "sender units." These are units
that send their activations forward to other units in the network. The first
two are the two input units, and the next two are the two hidden units.
Below this row vector of sender activations is the matrix of weights. The
weight in a particular column and row represents the strength of the con-
nection from a particular sender unit indexed by the column to the particu-
lar receiver indexed by the row. Note that only the weights that actually
bp: •
displ exam/ get/ savel set/ clear cycle do log newstart ptrain quit
'reset run strain tall test
weights: 43 44
_ 3 IIfii 1m
U 60 40
64
27 8 '27 48 61
exist in the network are displayed-these are the weights from the two
input units to the two hidden units (these are the four numbers below the
activations of the two input units) and the weights from the two hidden
units to the single output unit (these are the numbers below the activations
of the two sending units).
To the right of the weights is a column vector indicating the values of the
bias terms for the receiver units-that is, all the units that receive input
from other units. In this case, the receivers are the two hidden units and
the output unit.
To the right of the biases is a column for the net input to each receiving
unit. There is also a column for the activations of each of these receiver
units. (Note that the hidden units' activations appear twice, once in the
row of senders and once in this column of receivers.) The next column
contains the target vector, which in this case has only one element since
there is only one output unit. Finally, the last column contains the delta
values for the hidden and output units.
Note that all activation, weight, and bias values are given in hundredths,
so that, for example, 43 means 0.43 and reverse-video 3 means -0.03. For
deltas, values are given in thousandths so that reverse-video 9 means
-0.009.
The display shows what happened when the last pattern pair in the file
xor.pat was processed. This pattern pair consists of the input pattern (I 1)
and the target pattern (0). This input pattern was clamped on the two input
units. This is why they both have activation values of 1.0, shown as 100 in
the first two entries of the sender activation vector. With these activations
of the input units, coupled with the weights from these units to the hidden
units, and with the values of the bias terms, the net inputs to the hidden
units were set to 0.60 and -0.40, as indicated in the net column. Plugging
these values into the logistic function, the activation values of 0.64 and
0.40 were obtained for these units. These values are recorded both in the
sender activation vector and in the receiver activation vector (labeled act,
next to the net input vector). Given these activations for the hidden units,
coupled with the weights from the hidden units to the output unit and the
bias on the output unit, the net input to the output unit is 0.48, as indi-
cated at the bottom of the net column. This leads to an activation of 0.61,
as shown in the last entry of the acl column. Since the target is 0.0, as
indicated in the target column, the error, or (target - activation) is -0.61;
this error, times the derivative of the activation function (that is,
activation (I - activation)) results in a delta value of -0.146, as indicated in
the last entry of the final column. The delta values of the hidden units are
determined by back propagating this delta term to the hidden units, as
specified by the compute_error subroutine.
Q.5.1.1. Show the calculations of the values of delta for each of the two
hidden units, using the activations and weights as given in this
initial screen display. Explain why these values are so small.
5. THE GENERALIZED DELTA RULE 149
At this point, you will notice that the total sum of squares before any
learning has occurred is 1.0507. Run another tall to understand more about
what is happening.
Q.5.1.2. Report the output the network produces for each input pattern
and explain why the values are all so similar, referring to the
strengths of the weights, the logistic function, and the effects of
passing activation forward through the hidden units before it
reaches the output units.
Now you are ready to begin learning. Use the strain command. This will
run 30 epochs of training because the nepochs variable is set to 30. If you
set single to 1, you can follow the tss and gcor measures as they change
from epoch to epoch. You may find in the course of running this exercise
that you need to go back and start again. To do this, you should use the
reset command, followed by
The xor. wts file contains the initial weights used for this exercise. This
method of reinitializing guarantees that all users wi!! get the same starting
weights, independent of possible differences in random number generators
from system to system.
Q.5 .1.3. The total sum of squares is smaller at the end of 30 epochs, but is
only a little smaller. Describe what has happened to the weights
and biases and the resulting effects on the activation of the output
units. Note the small sizes of the deltas for the hidden units and
explain. Do you expect learning to proceed quickly or slowly
from this point? Why?
Run another 90 epochs of training (for a total of 120) and see if your
predictions are confirmed. As you go along, keep a record of the tss at each
30-epoch milestone. (The initial value is given in Figure 10, in case you
did not record this previously.) You might find it interesting to observe
the results of processing each pattern rather than just the last pattern in the
four-pattern set. To do this, you can set the stepsize variable to pattern
rather than the default epoch.
At the end of another 60 epochs (total: 180), some of the weights in the
network have begun to build up. At this point, one of the hidden units is
providing a fairly sensitive index of the number of input units that are on.
The other is very unresponsive.
Q.5.1.4. Explain why the more responsive hidden unit will continue to
change its incoming weights more rapidly than the other unit over
the next few epochs.
150 EXERCISES
Run another 30 epochs. At this point, after a total of 210 epochs, one of
the hidden units is now acting rather like an OR unit: its output is about
the same for all input patterns in which one or more input units is on.
Q.5.1.5. Explain this OR unit in terms of its incoming weights and bias
term. What is the other unit doing at this point?
Now run another 30 epochs. During these epochs, you will see that the
second hidden unit becomes more differentiated in its response.
Q.5.1.6. Describe what the second hidden unit is doing at this point, and
explain why it is leading the network to activate the output unit
most strongly when only one of the two input units is on.
Run another 30 epochs. Here you will see the tss drop very quickly.
Q.5.1.7. Explain the rapid drop in the tss, referring to the forces operating
on the second hidden unit and the change in its behavior. Note
that the size of the delta for this hidden unit at the end of 270
epochs is about as large in absolute magnitude as the size of the
delta for the output unit. Explain.
Run the strain command one more time. Before the end of the 30
epochs, the value of tss drops below ecrit, and so strain returns. The XOR
problem is solved at this point.
Q.5.1.8. Summarize the course of learning, and compare the final state of
the weights with their initial state. Can you give an approximate
intuitive account of what has happened? What suggestions might
you make for improving performance based on this analysis?
There are several further studies one can do with XOR. Yuu can study
the effects of varying:
The possibilities are almost endless, and all of them have effects. For
this exercise, pick one of these possible dimensions of variation, and run at
least three more runs, comparing the results to those you obtained in
Ex. 5.1.
Q.5.2.1. Describe what you have chosen to vary, how you chose to vary it,
and what results you obtained in terms of the rate of learning, the
evolution of the weights, and the eventual solution achieved.
unit, you have to add an additional line in the layout. These extra lines in
the template file go just above the last line that has $'s on it. Both these
things are done by adding blank lines and spaces to the layout. You must
also alter the template specifications themselves, specifying the correct
starting element and number of elements for the vector displays and speci-
fying the correct starting row, number of rows, starting column, and
number of columns for the weights.
Q.5.3.1. Describe the problem you have chosen, and why you find it
interesting. Explain the network architecture that you have
selected for the problem and the set of training patterns that you
have used. Describe the results of your learning experiments.
Evaluate the back propagation method for learning and explain
your feelings as to its adequacy, drawing on the results you have
obtained in this experiment and any other observations you have
made from the readings or from this exercise.
Hints. Try not to be too ambitious. You will learn a lot even if you limit
yourself to a network with 10 units. For an easy way out, you
could choose to study the 4-2-4 encoder problem described in
PDP:8. The files 424.tem, 424.str, 424. net, and 424.pat are already
set up for this problem.
One of the precursors of our work on PDP models was a model called
the cascade model (McClelland, 1979). This model was a purely linear,
feedforward, multilayer network. Units at each level took on activations
based on inputs from the preceding level according to the following equa-
tion:
Here rand s index some receiving level and the level sending to it, and j i
index units within levels, and kr is a rate constant governing the rate at
which the activations of units at level r reach the value that the summed
input is driving them toward.
Such a system has interesting temporal properties and provides a useful
framework for accounting for many aspects of reaction time data (see
McClelland, 1979, for details), but its computational capabilities are seri-
ously limited. In particular, with linear units, a multilayer system can
always be replaced by a single layer, and as we have noted repeatedly, there
are limits on the computations that can be performed in a single layer. To
overcome these limitations, multiple layers of units, with some form of
nonlinear activation function, are required.
The idea we describe in this section preserves the desirable computa-
tional characteristics of the nonlinear networks we have been considering in
this chapter but combines with these the gradual build-up of activation
characteristic of the cascade model. To achieve this, we introduce one
change into the cascade equation: Instead of directly setting the activation
of units on the basis of this equation, we use it to determine the net input;
the activation is then calculated from the net input using the logistic func-
tion. Thus the equation for the net input to a unit becomes
(4)
air (I) =_1
1 + e -net;r' , •
Thus, we can view the one-pass feedforward computation as one that com-
putes the asymptotic activations that would result from a process that may
in real systems be gradual and continuous. We can then train the network
to achieve a particular asymptotic activation value for each of several pat-
terns and then observe its dynamical properties.
Before we actually turn to an exercise in which we do just what we have
described, we mention one more characteristic of these cascaded feedfor-
ward networks: Their dynamical properties depend on the initial state.
Here we assume that the initial state is the pattern of activation that results
from processing an input pattern consisting of all-zero activations for all of
the input units. For this to work, the network must in general be trained to
produce some appropriate output state for this initial input state. Here we
simply assume that the desired initial output state is also all-zero activations
on all of the output units.
After training the XOR network of Ex. 5.1 in the standard way, turn on
cascade mode and, using the test command, examine the time course of
activation of the hidden and output units for the patterns in the XOR pat-
tern set.
To run this exercise, run the program with the cas. tern and cas.str files.
These differ only slightly from the standard XOR files. The .tem file
displays the cycle number (which is 0 until cascade mode is turned on).
The .str file sets nepochs to 300 and sets ncycles to 100. After start-up,
issue the strain command, which will cause the network to learn the solu-
tion to the XOR problem from the first exercise, finishing at epoch 289,
where the tss falls below 0.04. Then enter set/ model cascade 1 to turn on
cascade mode. In this mode, the activations of the units are initialized to
the asymptotic values they would have for the input (0 0), which, con-
veniently, was in the training set with target output (0). Now use the test
command to test each of the three nonzero patterns (patterns 1, 2, and 3).
The test command will automatically set stepsize to cycle and turn on single
mode, so you will be able to study the time course of activation step by
step. To obtain a simple graph of the results after the run is over, you can
use the log command to open a log file before you begin testing. Then you
can use the plot program described in Appendix D. Once a log file has been
opened with the log command, the .str and .tem files set up the dlevels of
the templates and the global slevel of the program so that the pattern name,
the cycle number, and the activations of the hidden and output units are
logged on each cycle.
Q.5.4.1. Explain why the output unit is initially more strongly activated by
the (1 1) input pattern than by either the (1 0) or the (0 1)
5. THE GENERALIZED DELTA RULE 155
RECURRENT NETWORKS
see that the output of the third stage of processing is close to 0.5 for all
eight of the different input patterns.
The weights and bias terms have been set up as already described, and
the bias terms are additionally constrained to be negative. The other param-
eters were chosen arbitrarily. The problem is an easy one to learn, as you
can prove to yourself by running the strain command. The network will
generally find a solution in about 40 to 200 epochs. Once the solution has
been found, you can run a tall and see how the network has solved the
problem. If you wish to see the weights, biaseS, and target patterns in addi-
tion to the activation vectors, you can set dlevel to 3. In this state, the
display will show the weight matrix twice, .once with the initial set of activa-
tions feeding into it from above with the intermediate set to the left and
once with the second set feeding into it with the third set to its left.
Q.5.5.1. Does the network always find the same solution to the problem?
Describe all the different solutions you get after repeated runs,
using newstart to reinitialize the network between runs. What
happens if you try removing the constraints that force the biases
to be negative and link the weights together? Do these changes
increase the number of different solutions?
SEQUENTIAL NETWORKS
FIGURE 11. Basic structure of Jordan's sequential networks. (From" Attractor Dynamics
and Parallelism in a Connectionist Sequential Machine" by M. I. Jordan, 1986, Proceedings of
the Eighth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Lawrence Erlbaum
Associates. Copyright 1986 by M. I. Jordan. Reprinted by permission.)
output unit plus f.L (mu) times the unit's previous activation. This gives the
current-state units a way of capturing the prior history of activation in the
network.
We have provided a simple facility for implementing Jordan's sequential
nets as well as many more complex variants. This facility employs negative
integers in input patterns to indicate that the unit's activation should be
derived from activations produced on the previous cycle. The particular
negative integer indicates which unit's activation should be used in setting
the activation of the input unit. Thus, if element i
of a particular input
i
pattern is equal to - j, the activation of unit will be set to aj + f.Lai, where
ai and aj are taken to be the activations of units i
and j from the previous
cycle. The parameter mu can be set using the setl paraml mu command
(its value should be between 0 and 1).
To implement Jordan's sequential networks using this scheme, we first
set up a network, say, with four plan units, four current-state units, four
hidden units, and four next-state units. Then we set up each sequence we
want to train the network to learn as a set of pattern pairs. In each pattern
pair belonging to the same sequence, the plan field stays the same. The
target patterns are the sequence of desired outputs. For the first pattern in
158 SEQUENTIAL NETWORKS
the sequence, the input values for the current state units are O. For subse-
quent patterns in the sequence, the input values for the current state units
are set to -n, where n is the index of the corresponding output unit.
Thus, the patterns to specify that the network should interpret the plan pat-
tern (I 0 1 0) as an instruction to turn on first the first output unit, then
the second, then the third, and then the fourth would be as follows:
Plan 0 Current
1 000-12 State
0 -13 -14
o000
0
-15
1 1
Target0 0
1
(In the pattern file each pattern would also have a name at the beginning of
the line.) Note that the assumptions made by Jordan can be seen as a spe-
cial case of a very general class of sequential models made possible by the
facility to set the activation of input units based on prior activations. Inputs
can be set to have activations based on the activations of hidden units as
well as output units, and external input can be intermixed at will with feed-
back. It is even possible to set up any desired combination of previous
inputs by hard-wiring hidden units to receive inputs from particular sets of
input units.6
6 The only restnctlOn on the use of the previous-state facility is that an input unit cannot
receive its input from the previous activation of a lower-numbered unit. In practice this
should not be much of a restriction since lower-numbered units would always be other input
units.
5. THE GENERALIZED DELTA RULE 159
generally takes several hundred epochs). Once the problem is solved (tss
less that 0.1), do a tall again to verify that the network does as instructed.
Q.5.6.1. See'if your can figure out some of the reasons why this task is
harder than the XOR task. If you want to explore this kind of
network further, you might want to study the effects of various
parameters on learning time. For example, you might want to
experiment with different values of mu, Irate, and momentum.
Hints. In thinking about the first part of the Question, you might note
what happens to the patterns of ,activation on the output units,
and therefore the current-state units, during the early parts of
training. In your explorations for the second part of the Question,
the most interesting effects occur with variations in mu. See if
you can discover and understand these effects.