LSTM in RNN
LSTM in RNN
LSTM in RNN
PAR
FELIX GERS
Diplom in Physik, Universität Hannover, Deutschland
de nationalité allemand
Lausanne, EPFL
2001
Contents
1 Introdu
tion 5
1.1 Re
urrent Neural Networks (RNNs) . . . . . . . . . . . . . . .. . .. . .. . . . 5
1.2 General
onsiderations. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 6
1.2.1 Problem: Exponential de
ay of gradient information. . .. . .. . .. . . . 7
1.2.2 Solution: Constant error
arousels. . . . . . . . . . . . .. . .. . .. . . . 7
1.3 Previous and Related Work . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 7
1.3.1 RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 7
1.3.2 RNNs versus Other Sequen
e Pro
essing Approa
hes . .. . .. . .. . . . 8
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 9
2 Traditional LSTM 11
2.1 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Tasks Solved with Traditional LSTM . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Learning to Forget: Continual Predi
tion with LSTM 15
3.1 Introdu
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 15
3.1.1 Limits of traditional LSTM . . . . . . . . . . . . . . . .. . .. . .. . . . 15
3.2 Solution: Forget Gates . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 16
3.2.1 Forward Pass of Extended LSTM with Forget Gates . .. . .. . .. . . . 16
3.2.2 Ba
kward Pass of Extended LSTM with Forget Gates .. . .. . .. . . . 17
3.2.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 20
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 21
3.3.1 Continual Embedded Reber Grammar Problem . . . . .. . .. . .. . . . 21
3.3.2 Network Topology and Parameters . . . . . . . . . . . .. . .. . .. . . . 23
3.3.3 CERG Results . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 24
3.3.4 Analysis of the CERG Results . . . . . . . . . . . . . .. . .. . .. . . . 25
3.3.5 Continual Noisy Temporal Order Problem . . . . . . . .. . .. . .. . . . 25
3.4 Con
lusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 28
4 Arithmeti
Operations on Continual Input Streams 29
4.1 Introdu
tion . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. . .. . .. . . . 29
4.2 Experiments . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. . .. . .. . . . 29
4.2.1 Network Topology and Parameters .. . . .. . .. . . .. . .. . .. . . . 30
4.2.2 Results . . . . . . . . . . . . . . . .. . . .. . .. . . .. . .. . .. . . . 30
4.3 Con
lusion . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. . .. . .. . . . 31
iii
iv CONTENTS
Abstra
t
For a long time, re
urrent neural networks (RNNs) were thought to be theoreti
ally fas
inating.
Unlike standard feed-forward networks RNNs
an deal with arbitrary input sequen
es instead
of stati
input data only. This
ombined with the ability to memorize relevant events over time
makes re
urrent networks in prin
ipal more powerful than standard feed-forward networks. The
set of potential appli
ations is enormous: any task that requires to learn how to use memory is a
potential task for re
urrent networks. Potential appli
ation areas in
lude time series predi
tion,
motor
ontrol in non-Markovian environments and rhythm dete
tion (in musi
and spee
h).
Previous su
esses in real world appli
ations, with re
urrent networks were limited, however,
due to pra
ti
al problems when long time lags between relevant events make learning diÆ
ult.
For these appli
ations
onventional gradient-based re
urrent network algorithms for learning to
store information over extended time intervals take too long. The main reason for this failure is
the rapid de
ay of ba
k-propagated error. The \Long Short Term Memory" (LSTM) algorithm
over
omes this and related problems by enfor
ing
onstant error
ow. Using gradient des
ent,
LSTM expli
itly learns when to store information and when to a
ess it.
In this thesis we extend, analyze, and apply the LSTM algorithm. In parti
ular, we identify
two weaknesses of LSTM, oer solutions and modify the algorithm a
ordingly: (1) We re
ognize
a weakness of LSTM networks pro
essing
ontinual input streams that are not a priori segmented
into subsequen
es with expli
itly marked ends at whi
h the network's internal state
ould be
reset. Without resets, the state may grow indenitely and eventually
ause the network to break
down. Our remedy is a novel, adaptive \forget gate" that enables an LSTM
ell to learn to reset
itself at appropriate times, thus releasing internal resour
es. (2) We identify a weakness in
LSTM's
onne
tion s
heme, and extend it by introdu
ing \peephole
onne
tions" from LSTM's
\Constant Error Carousel" to the multipli
ative gates prote
ting them. These
onne
tions
provide the gates with expli
it information about the state to whi
h they
ontrol a
ess. We
show that peephole
onne
tions are ne
essary for numerous tasks and do not signi
antly ae
t
LSTM's performan
e on previously solved tasks.
We apply the extended LSTM with forget gates and peephole
onne
tions to tasks that no
other RNN algorithm
an solve (in
luding traditional LSTM): Grammar tasks and temporal
order tasks involving
ontinual input streams, arithmeti
operation on
ontinual input streams,
tasks that require pre
ise,
ontinual timing, periodi
fun
tion generation and
ontext free and
ontext sensitive language tasks. Finally we establish limits of LSTM on time series predi
tion
problems solvable by time window approa
hes.
CONTENTS 3
Sommario
Per molto tempo le reti neuronali ri
orrenti sono state
onsiderate teori
amente aas
inanti.
Le reti ri
orrenti possono trattare naturalmente sequenze di dati inve
e di poter ri
evere solo
dati stati
i in input. Possono imparare a memorizzare gli eventi importanti. Queste
apa
ita
le rendono in linea di prin
ipio piu potenti delle reti feed-forward. La
lasse di potenziali
appli
azioni e ampia: essa
ontiene ogni problema
he ri
hiede l'uso di memoria interna. Al
uni
esempi sono la previsione delle serie stori
he (time series predi
tion), il
ontrollo del moto (motor
ontrol) in ambienti non Markoviani e il ri
onos
imento del ritmo (per esempio nella musi
a o
nella lingua parlata).
D'altra parte, sinora le reti ri
orrenti anno avuto po
o su
esso nell'appli
azione a problemi
reali
aratterizzati da intervalli temporali lunghi tra eventi importanti dell'input. Gli algo-
ritmi
onvenzionali di apprendimento, basati sul gradiente, hanno bisogno di troppo tempo per
imparare a memorizzare delle informazioni
on intervalli temporali lunghi. La ragione prin-
ipale e la rapida de
res
ita dell'errore retropropagato (ba
k-propagation error). Le reti di
tipo long-short term memory (LSTM) orono una soluzione a questo problema, proponendo
un'ar
hitettura dove il
usso dell'errore rimane
ostante. Usando l'in
linazione del gradiente
(gradient des
ent), le reti LSTM possono imparare quando un'informazione deve essere memo-
rizzata e quando va su
essivamente usata.
Questa tesi analizza, estende ed appli
a l'algoritmo LSTM. Si identi
ano due difetti dell'algoritmo
preesistente e si propongono due estensioni prin
ipali dell'algoritmo
he risolvono i problemi
ris
ontrati. In parti
olare: (1) viene identi
ato un difetto dell'algoritmo LSTM
he a
ade
quando l'input e
ontiguo,
ioe non a priori suddiviso in sottosequenze
on inizi e ni distinti.
In questo
aso, l'algoritmo non e in grado di determinare quando la rete va riportata allo stato
iniziale e i valori interni possono
res
ere illimitatamente
ausando una paralisi del sistema.
Il rimedio proposto si basa su una nuova unita moltipli
ativa (gate unit) adattabile
hiamato
\forget gate". Essa permette ad una
ella della rete LSTM di imparare a ritornare ad uno stato
pre
edente in momenti opportuni, liberando
osi risorse interne.
(2) Si identi
a un difetto nello s
hema delle
onnessioni delle reti LSTM e lo si risolve intro-
du
endo
onnessioni
hiamate \peephole
onne
tions". Esse
ollegano l'unita
entrale (\
on-
stant error
arousel") delle
elle alle unita moltipli
ative
he le stanno attorno. In questo modo
vengono fornite alle unita informazioni espli
ite sulla
ondizione dell'oggetto di
ui
ontrollano
l'a
esso. Si mostra inoltre
he le peephole
onne
tions sono ne
essarie per numerosi problemi
e
he non ridu
ono signi
ativamente la performan
e delle reti LSTM su problemi pre
edente-
mente arontati.
La tesi appli
a l'algoritmo LSTM esteso
on forget gates e peephole
onne
tions a problemi
he nessun altro algoritmo per reti ri
orrenti puo risolvere (
ompreso le reti LSTM tradizionali):
problemi di grammati
a; problemi di ordinamenti temporali
he
oinvolgono input
ontinui; op-
erazioni aritmeti
he su input
ontinuo; problemi
he ri
hiedono una
ontinua e pre
isa misura
del tempo; la generazione di funzioni periodi
he e ri
onos
imento di grammati
he
ontext-free
e
ontext-sensitive. Inne si identi
ano dei limiti dell'algoritmo LSTM esteso relativi a prob-
lemi di previsione di serie stori
he
he sono risolvibili dalla
lasse di metodi basati su nestre
temporali.
Chapter 1
Introdu tion
The goal of this Ph.D. thesis is to extend, analyze, and apply a re
ent, novel, promising gradient
learning algorithm for re
urrent neural networks (RNNs). The algorithm is
alled \Long Short
Term Memory" (LSTM). It was introdu
ed by Ho
hreiter and S
hmidhuber (1997).
where ek denotes the externally inje
ted error; E (t) represents the error at time t for one sequen
e
omponent
alled pattern. For a typi
al data set
onsisting of sequen
es of patterns, E is the
sum of E (t) over all patterns of all sequen
es in the set.
5
6 CHAPTER 1. INTRODUCTION
In In In In
Figure 1.1: Left: Feed-forward neural network. Middle: Layered network with an input layer, a
fully re
urrent hidden layer and an output layer. Right: Fully
onne
ted re
urrent network.
A gradient des
ent learning algorithm for RNNs, su
h as LSTM,
omputes the gradient of
E with respe
t to ea
h weight wlm to determine the weight
hanges wlm :
1990). (This topology is equivalent to a network with a hidden layer, where ea
h unit feeds into
every other one via time delayed
onne
tions with delay one.) Elman nets are trained by ba
k-
propagation (Rumelhart, Hinton, & Williams, 1986); thus they do not even propagate errors
ba
k though time. In alternative approa
hes with
ontext units the hidden units feed (e.g., fully
onne
ted) into the
ontext units (their number may be dierent from the number of the hidden
units). Usually BPTT or RTRL (and their trun
ated versions) are used for training.
Time delay neural networks (TDNNs). Time-Delay Neural Networks (TDNNs) (Haner
& Waibel, 1992) allow a
ess to past events via
as
aded internal delay lines. The interval they
an a
ess depends on the network topology. Thus they suer from the same problems as
feed-forward networks using a time window.
Nonlinear autoregressive models with exogenous inputs (NARX) networks. NARX
networks (Lin, Horne, Ti~no, & Giles, 1996), allow for several distin
t input time-windows (possi-
bly of size one) with dierent temporal osets. They
an potentially solve tasks with stationary
long time lags; it remains a problem to determine the right windows. However, when the long
term dependen
ies are non-stationary the approa
h fails.
Fo
used ba
k-propagation. To deal with long time lags, Mozer (1989) uses time
onstants
whi
h in
uen
e a
tivation
hanges. However, for long time gaps the time
onstants need external
ne tuning (Mozer, 1992). Sun et al.'s alternative approa
h (1993) updates the a
tivation of
a re
urrent unit by adding the old a
tivation and the (s
aled)
urrent net input. The net
input, however, tends to perturb the stored information, whi
h again makes long term storage
impra
ti
able.
Continual, Hierar
hi
al, In
remental Learning and Development (CHILD). Ring
(1994) proposed the CHILD method for bridging long time lags. Whenever a unit in his network
re
eives
on
i
ting error signals, he adds a higher order unit in
uen
ing appropriate
onne
tions.
Although his approa
h
an sometimes be extremely fast, to bridge a time lag involving 100 steps
may require the addition of 100 units. The network
annot generalize to sequen
es with unseen
lag durations.
Chunker systems. Chunker systems (S
hmidhuber, 1992b; Mozer, 1992) do have the
ability to bridge arbitrary time lags, but only if the input sequen
e exhibits lo
ally predi
table
regularities.
LSTM. LSTM does not suer from the problems above. It seems to be the state of the
art method for re
urrent networks fa
ed with realisti
, long time lags between o
urren
es of
relevant events.
1.3.2 RNNs versus Other Sequen
e Pro
essing Approa
hes
Dis
rete symboli
grammar learning algorithms (SGLAs). SGLAs (Lee, 1996; Sakak-
ibara, 1997) may faster learn grammati
al stru
ture of dis
rete, noise-free event sequen
es, but
annot deal well with noise or with sequen
es of real-valued inputs (Osborne & Bris
oe, 1997).
Hidden Markov models (HMMs). HMMs are widely used approa
hes to sequen
e
pro
essing. They are well-suited for noisy inputs and are invariant to non-linear temporal
stret
hing. This makes HMMs espe
ially su
essful in spee
h re
ognition (they do not
are for
the dieren
e between slow and fast versions of a given spoken word). But for many other
tasks HMMs are less suited, be
ause, unlike RNNs, they are limited to dis
rete state spa
es.
This makes their appli
ation to many time series task
umbersome and ineÆ
ient. For example
for simples
ounting tasks, HMMs need as many states as the the number of symbols on the
longest sequen
e that should be
ounted. Whereas with RNNs the ne
essary algorithm
an be
1.4. OUTLINE 9
instantiated with networks of 2-5 units (Kalinke & Lehmann, 1998; Rodriguez & Wiles, 1998;
Gers & S
hmidhuber, 2000e). Thus, in prin
iple RNNs are appli
able to tasks beyond the rea
h
of HMMs.
Input output hidden Markov models (IOHMMs). The input-output HMM ar
hi-
te
ture (Bengio & Fras
oni, 1995)
ombines elements of mixture-of-experts, RNNs, and hidden
Markov models, and is adapted via the EM algorithm. To our knowledge, this ar
hite
ture has
not yet been applied to tasks
omparable to the ones dis
ussed here. But it was shown to solve
simple tasks involving long time lags.
Geneti
Programming and Program Sear
h. Geneti
Programming (see e.g., Di
k-
manns et al., 1987; Cramer, 1985; Koza, 1992) and Probabilisti
In
remental Program Evolution
(PIPE) (Salustowi
z & S
hmidhuber, 1997) in prin
iple
ould sear
h in general algorithm spa
es
but are slow due to the absen
e of gradient information providing a sear
h dire
tion.
Random guessing. For some simple ben
hmarks weight guessing nds solutions faster
than elaborate gradient algorithms (Ho
hreiter & S
hmidhuber, 1996, 1995; S
hmidhuber &
Ho
hreiter, 1996).
1.4 Outline
Traditional LSTM. Chapter 2 des
ribes the traditional LSTM algorithm as introdu
ed by
Ho
hreiter and S
hmidhuber (1997).
Forget Gates. In Chapter 3 we identify a weakness of LSTM in dealing with
ontinual
input streams that are not a priori segmented into separate training sequen
es, su
h that it is
not
lear when to reset the network's internal state. We introdu
e \forget gates" as a remedy
(Gers, S
hmidhuber, & Cummins, 2000, 1999b).
Arithmeti
operations. In Chapter 4 we present tasks involving arithmeti
operations on
ontinual input streams that traditional LSTM
annot solve. But LSTM extended with forget
gates has superior arithmeti
apabilities and does solve the tasks (Gers & S
hmidhuber, 2000
).
Timing, extending LSTM with \peephole
onne
tions". In Chapter 5 we investigate
tasks where the temporal distan
e between events
onveys essential information (this is the
ase
for numerous sequential tasks su
h as motor
ontrol and rhythm dete
tion). First we identify
a weakness in LSTM's
onne
tion s
heme, regarding the wiring of the nonlinear, multipli
ative
gates surrounding and prote
ting LSTM's
onstant error
arrousels (CEC). We extend LSTM by
introdu
ing \peephole
onne
tions" from the CECs to the gates and nd that LSTM augmented
by peephole
onne
tions
an learn pre
ise timing. It learned, for example, the ne distin
tion be-
tween sequen
es of spikes separated by either 50 or 49 dis
rete time steps, without the help of any
short training exemplars (Gers & S
hmidhuber, 2000e; Gers, S
hmidhuber, & S
hraudolph, ).
Context free and
ontext sensitive languages. Previous work by Ho
hreiter and
S
hmidhuber (1997) and the our work (see Chapter 3) showed that LSTM outperforms tra-
ditional RNNs on learning regular languages from exemplary training sequen
es. In Chapter
6 we demonstrate LSTM's superior performan
e on
ontext free language (CFL) ben
hmarks
for re
urrent neural networks (RNNs). To the best of our knowledge, LSTM variants are also
the rst RNNs to learn a simple
ontext sensitive language (CSL), namely anbn
n (Gers &
S
hmidhuber, 2001, ).
Time series predi
tion. In Chapter 7 LSTM is applied to time series predi
tion tasks
solvable by time window approa
hes: the Ma
key-Glass series and the Santa Fe FIR laser
emission series (Set A) (Gers, E
k, & S
hmidhuber, 2000, 2001).
Chapter 2
Traditional LSTM
The basi
unit in the hidden layer of an LSTM network is the memory blo
k ; it repla
es the
hidden units in a \traditional" RNN (Figure 2.1). A memory blo
k
ontains one or more memory
ells and a pair of adaptive, multipli
ative gating units whi
h gate input and output to all
ells
in the blo
k. Memory blo
ks allow
ells to share the same gates (provided the task permits this),
thus redu
ing the number of adaptive parameters. Ea
h memory
ell has at its
ore a re
urrently
self-
onne
ted linear unit
alled the \Constant Error Carousel" (CEC), whose a
tivation we
all
the
ell state. The CEC's solve the vanishing error problem: in the absen
e of new input or
Out
Output Out
Hidden
Memory Output Gate
Block
with
one
Cell InputGate
Input
In In
Figure 2.1: Left: RNN with one fully re
urrent hidden layer. Right: LSTM network with
memory blo
ks in the hidden layer (only one is shown).
11
12 CHAPTER 2. TRADITIONAL LSTM
y c
y out
hy
out
output gating wout net out
ouput gate
output squashing h( s )
c
s=s+gy
c c
in
1.0
CEC: memorizing
y in
input gating gy in
win net in
input gate
input squashing g(net ) c
wc
net c
Figure 2.2: The traditional LSTM
ell has a linear unit with a re
urrent self-
onne
tion with
weight 1.0 (CEC). Input and output gate regulate read and write a
ess to the
ell whose state
is denoted s
. The fun
tion g squashes the
ell's input; h squashes the
ell's output (see text for
details).
error signals to the
ell, the CEC's lo
al error ba
k
ow remains
onstant, neither growing nor
de
aying. The CEC is prote
ted from both forward
owing a
tivation and ba
kward
owing
error by the input and output gates respe
tively. When gates are
losed (a
tivation around
zero), irrelevant inputs and noise do not enter the
ell, and the
ell state does not perturb the
remainder of the network. Figure 2.2 shows a memory blo
k with a single
ell.
2.1 Forward Pass
The
ell state, s
, is updated based on its
urrent state and three sour
es of input: net
is input
to the
ell itself while netin and netout are inputs to the input and output gates.
We
onsider dis
rete time steps t = 1; 2; : : : . A single step involves the update of all units
(forward pass) and the
omputation of error signals for all weights (ba
kward pass). Input gate
a
tivation yin and output gate a
tivation yout are
omputed as follows:
X
netoutj (t) = woutj m ym (t 1) ; youtj (t) = foutj (netoutj (t)) ; (2.1)
m
X
netinj (t) = winj m ym (t 1) ; yinj (t) = finj (netinj (t)) : (2.2)
m
2.2. LEARNING 13
Throughout this thesis j indexes memory blo
ks; v indexes memory
ells in blo
k j (with Sj
ells), su
h that
vj denotes the v-th
ell of the j -th memory blo
k; wlm is the weight on the
onne
tion from unit m to unit l. Index m ranges over all sour
e units, as spe
ied by the
network topology (if a sour
e unit a
tivation ym (t 1) refers to an input unit,
urrent external
input ym(t) is used instead). For the gates, f is a logisti
sigmoid (with range [0; 1℄):
f (x) =
1 : (2.3)
1+e x
The input to the
ell itself is
X
net
vj (t) = w
vj m ym (t 1) ; (2.4)
m
whi
h is is squashed by g, a
entered logisti
sigmoid fun
tion with range [ 2; 2℄ (if not spe
ied
dierently):
g(x) =
4
1+e x 2 : (2.5)
The internal state of memory
ell s
(t) is
al
ulated by adding the squashed, gated input to the
state at the last time step s
(t 1):
s
vj (0) = 0 ; s
vj (t) = s
vj (t 1) + yinj (t) g(net
vj (t)) for t > 0 : (2.6)
The
ell output y
is
al
ulated by squashing the internal state s
via the output squashing
fun
tion h, and then multiplying (gating) it by the output gate a
tivation yout :
y
j (t) = youtj (t) h(s
vj (t)) : (2.7)
v
h(x) =
2
1+e x 1 : (2.8)
Finally, assuming a layered network topology with a standard input layer, a hidden layer
on-
sisting of memory blo
ks, and a standard output layer, the equations for the output units k
are:
X
netk (t) = wkm ym (t 1) ; yk (t) = fk (netk (t)) ; (2.9)
m
where m ranges over all units feeding the output units (typi
ally all
ells in the hidden layer,
the input units, but not the memory blo
k gates). As squashing fun
tion fk we again use the
logisti
sigmoid (2.3). This
on
ludes traditional LSTM's forward pass.
2.2 Learning
See Ho
hreiter & S
hmidhuber (1997) for details of traditional LSTM's ba
kward pass. It will
be re-derived and dis
ussed in detail in Se
tion 3.2.2 after the introdu
tion of forget gates.
Essentially, as in trun
ated BPTT, errors arriving at net inputs of memory blo
ks and their
gates do not get propagated ba
k further in time, although they do serve to
hange the in
oming
14 CHAPTER 2. TRADITIONAL LSTM
weights. In essen
e, on
e an error signal arrives at a memory
ell output, it gets s
aled by the
output gate and the output nonlinearity h; then it enters the memory
ell's linear CEC, where
it
an
ow ba
k indenitely without ever being
hanged (this is why LSTM
an bridge arbitrary
time lags between input events and target signals). Only when the error es
apes from the
memory
ell through an opening input gate and the additional input nonlinearity g, does it
get s
aled on
e more and then serves to
hange in
oming weights before being trun
ated. The
onsequen
e of this trun
ation is that ea
h LSTM blo
k relies on errors from the output for its
adaptation. Sin
e blo
ks do not ex
hange error signals, it is hard for LSTM to learn tasks where
one blo
k ex
lusively serves other blo
ks (e.g., as a pointer into a FIFO queue) without dire
tly
redu
ing the output error.
2.3 Tasks Solved with Traditional LSTM
Ho
hreiter and S
hmidhuber (1997) already solved a wide range of tasks with traditional LSTM:
(1) The embedded Reber grammar (a popular regular grammar ben
hmark); (2) Noise free and
noisy sequen
es with time lags of up to 1000 steps (e.g.; the \2-sequen
e problem" proposed by
Bengio et al., 1994); (3) Continuous-valued tasks the require the storage of values for long time
periods and their summation and multipli
ation (up to a
ertain pre
ision); (4) Temporal order
problems with wildly separated inputs.
In the following
hapters, however, we will present tasks (partly derived form the tasks listed
above) on whi
h traditional LSTM fails and point out its problems.
Chapter 3
situations: the
ell states s
often tend to grow linearly during the presentation of a time
series (the nonlinear aspe
ts of sequen
e pro
essing are left to the squashing fun
tions and the
highly nonlinear gates). If we present a
ontinuous input stream, the
ell states may grow in
unbounded fashion,
ausing saturation of the output squashing fun
tion, h. This happens even
if the nature of the problem suggests that the
ell states should be reset o
asionally, e.g., at
the beginnings of new input sequen
es (whose starts, however, are not expli
itly indi
ated by
a tea
her). Saturation will (a) make h's derivative vanish, thus blo
king in
oming errors, and
(b) make the
ell output equal the output gate a
tivation, that is, the entire memory
ell will
degenerate into an ordinary BPTT unit, so that the
ell will
ease fun
tioning as a memory.
The problem did not arise in the experiments reported by Ho
hreiter & S
hmidhuber (1997)
be
ause
ell states were expli
itly reset to zero before the start of ea
h new sequen
e.
How
an we solve this problem without losing LSTM's advantages over time delay neural
networks (TDNN) (Waibel, 1989) or NARX networks (Lin et al., 1996), whi
h depend on a
priori knowledge of typi
al time lag sizes?
The standard te
hnique of weight de
ay, whi
h helps to
ontain the level of overall a
tivity
within the network, was found to generate solutions whi
h were parti
ularly prone to unbounded
state growth.
Variants of fo
used ba
k-propagation (Mozer, 1989) also do not work well. These let the
internal state de
ay via a self-
onne
tion whose weight is smaller than 1. But there is no
prin
ipled way of designing appropriate de
ay
onstants: A potential gain for some tasks is paid
for by a loss of ability to deal with arbitrary, unknown
ausal delays between inputs and targets.
In fa
t, state de
ay does not signi
antly improve experimental performan
e (see \State De
ay"
in Table 3.2).
Of
ourse we might try to \tea
her for
e" (Jordan, 1986; Doya & Yoshizawa, 1989) the
internal states s
by resetting them on
e a new training sequen
e starts. But this requires an
external tea
her who knows how to segment the input stream into training subsequen
es. We
are pre
isely interested, however, in those situations where there is no a priori knowledge of this
kind.
3.2 Solution: Forget Gates
Our solution to the problem above is to use adaptive \forget gates" whi
h learn to reset memory
blo
ks on
e their
ontents are out of date and hen
e useless. By resets we do not only mean
immediate resets to zero but also gradual resets
orresponding to slowly fading
ell states.
More spe
i
ally, we repla
e traditional LSTM's
onstant CEC weight 1.0 by the multipli
a-
tive forget gate a
tivation y'. See Figure 3.1.
3.2.1 Forward Pass of Extended LSTM with Forget Gates
All equations of traditional LSTM's forward pass ex
ept for equation (2.6) will remain valid also
for extended LSTM with forget gates.
The forget gate a
tivation y' is
al
ulated like the a
tivations of the other gates|see equa-
tions (2.1) and (2.2):
X
net'j (t) = w'j m ym (t 1) ; y'j (t) = f'j (net'j (t)) : (3.1)
m
Here net'j is the input from the network to the forget gate. We use the logisti
sigmoid with
range [0; 1℄ as squashing fun
tion f'j . Its output be
omes the weight of the self re
urrent
3.2. SOLUTION: FORGET GATES 17
y c
y out
hy
out
output gating wout net out
ouput gate
output squashing h( s )c
s=sy+gy ϕ in y
c c wϕ net ϕ
memorizing and forgetting
forget gate
y in
input gating gy in
win net in
input gate
input squashing g(net ) c
wc
net c
Figure 3.1: Memory blo
k with only one
ell for the extended LSTM. A multipli
ative forget
gate
an reset the
ell's inner state s
.
onne
tion of the internal state s
in equation (2.6). The revised update equation for s
in the
extended LSTM algorithm is (for t > 0):
s
vj (t) = y'j (t) s
vj (t 1) + yinj (t) g(net
vj (t)) ; (3.2)
with s
vj (0) = 0. Extended LSTM's full forward pass is obtained by adding equations (3.1) to
those in Chapter 2 and repla
ing equation (2.6) by (3.2).
Bias weights for LSTM gates are initialized with negative values for input and output gates
(see Se
tion 3.3.2), positive values for forget gates. This implies|
ompare equations (3.1) and
(3.2)|that in the beginning of the training phase the forget gate a
tivation will be almost 1.0,
and the entire
ell will behave like a traditional LSTM
ell. It will not expli
itly forget anything
until it has learned to forget.
3.2.2 Ba
kward Pass of Extended LSTM with Forget Gates
LSTM's ba
kward pass is an eÆ
ient fusion of slightly modied, trun
ated ba
k propagation
through time (BPTT) (e.g Williams & Peng 1990 ) and a
ustomized version of real time
re
urrent learning (RTRL) (e.g. Robinson & Fallside 1987). Output units use BP; output
gates use slightly modied, trun
ated BPTT. Weights to
ells, input gates and the novel forget
gates, however, use a trun
ated version of RTRL. Trun
ation means that all errors are
ut o
on
e they leak out of a memory
ell or gate, although they do serve to
hange the in
oming
18 CHAPTER 3. LEARNING TO FORGET: CONTINUAL PREDICTION WITH LSTM
weights. The ee
t is that the CECs are the only part of the system through whi
h errors
an
ow ba
k forever. This makes LSTM's updates eÆ
ient without signi
antly ae
ting learning
power: error
ow outside of
ells tends to de
ay exponentially anyway (Ho
hreiter, 1991). In
the equations below, =tr will indi
ate where we use error trun
ation and, for simpli
ity, unless
otherwise indi
ated, we assume only a single
ell per blo
k.
We start with the usual squared error obje
tive fun
tion based on targets tk :
E (t) =
1 X e (t) ; e (t) := tk (t) yk (t) ; (3.3)
2 k k
2
k
where ek denotes the externally inje
ted error. We minimize E via gradient des
ent by adding
weight
hanges wlm to the weights wlm (from unit m to unit l) using learning rate (Æij is
the Krone
ker delta):
wlm(t) = E (t) = X E (t) yk (t) = X e (t) yk (t)
wlm k
yk (t) wlm k
k
wlm
XX yk (t) yl (t) netl (t) 0
= ek (t) l
y (t) netl (t) wlm 0
0
0
k l 0
XX yk (t) yl (t) netl (t)
0
= ek (t) l Æ y (t 1) + m
y (t) netl (t) l l 0
m
y (t 1) 0
0
0
k l 0
Errors are trun
ated when they leave a memory blo
k by setting the following derivatives in the
above equation to zero: ym t =0 for l0 2 f'; in;
vj g.
netl t tr
(
0( )
1)
For an arbitrary output unit (l = k0 ) the sum in (3.4) redu
es to ek (with k = k0). By dier-
entiating equation (2.9) we obtain the usual ba
k-propagation weight
hanges for the output
units:
yk (t)
netk (t)
= fk0 (netk (t)) =) Æk (t) = fk0 (netk (t)) ek (t) : (3.5)
To
ompute the weight
hanges for the output gates woutj m we set (l = out) in (3.4). The
resulting terms
an be determined by dierentiating equations (2.1), (2.7) and (2.9):
youtj (t) 0 (netout (t)) ; y (t) ek (t) = h(s
v (t)) wk
v Æk (t) :
k
netoutj (t)
= fout j j youtj (t) j j
Equations (3.4), (3.5)and (3.6) dene the weight
hanges for output units and output gates of
memory blo
ks. Their derivation was almost standard BPTT, with error signals trun
ated on
e
they leave memory blo
ks (in
luding its gates). This trun
ation does not ae
t LSTM's long
time lag
apabilities but is
ru
ial for all equations of the ba
kward pass and should be kept in
mind.
For weights to
ell, input gate and forget gate we adopt an RTRL-oriented perspe
tive, by
rst stating the in
uen
e of a
ell's internal state s
vj on the error and then analyzing how ea
h
weight to the
ell or the blo
k's gates
ontributes to s
vj . So we split the gradient in a way
dierent from the one used in equation (3.4), negle
ting, however, the same derivatives:
wlm(t) = E (t) =tr E (t) s
vj (t) = e v (t) s
vj (t) : (3.7)
w lms
v (t) w s
j w lm lm
| {zj }
=: es
v (t)
j
v
These terms are the internal state error es
vj and a partial w
lmj of s
vj with respe
t to weights
s
wlm feeding the
ell
vj (l =
vj ) or the blo
k's input gate (l = in) or the blo
k's forget gate
(l = '), as all these weights
ontribute to the
al
ulation of s
vj (t). We treat the partial for the
internal states error es
vj analogously to (3.4) and obtain:
E (t) tr E (t) yk (t) y
j (t) y
j X yk (t)
v v
es
vj (t) := = = e (t)
s
vj (t) yk (t) y
vj (t) s
vj (t) s
vj (t) k y
vj (t) k
| {z }
=w
vj l Æk (t)
Dierentiating the forward pass equations (2.6), (2.2) and (3.1) for g, yin, and y' we
an
substitute the unresolved partials and split the expression on the right hand side of (3.9) into
three separate equations for the
ell (l =
vj ), the input gate (l = in) and the forget gate (l = '):
s
vj (t) s
vj (t 1) 'j
w
vj m
= w
vj m
y (t) + g0 (net
vj (t)) yinj (t) ym (t 1) ; (3.10)
s
vj (t) s
vj (t 1) 'j
winj m
= winj m
y (t) + g(net
vj (t)) fin0 j (netinj (t)) ym (t 1) ; (3.11)
s
vj (t) s
vj (t 1) 'j
w' m
= w' m
y (t) + s
vj (t 1) f'0 j (net'j (t)) ym (t 1) : (3.12)
j j
Furthermore the initial state of network does not depend on the weights, so we have
s
vj (t = 0)
w
= 0 for l 2 f'; in;
vj g : (3.13)
lm
Note that the re
ursions in equations (3.10)-(3.12) depend on the a
tual a
tivation of the blo
k's
forget gate. When the a
tivation goes to zero not only the
ell's state, but also the partials are
reset (forgetting in
ludes forgiving previous mistakes). Every
ell needs to keep a
opy of ea
h
of these three partials and update them at every time step.
We
an insert the partials in equation (3.7) and
al
ulate the
orresponding weight updates,
with the internal state error es
vj (t) given by equation (3.8). The dieren
e between updates of
weights to a
ell itself (l =
vj ) and updates of weights to the gates is that
hanges to weights to
the
ell w
vj m only depend on the partials of this
ell's own state:
s v (t)
w
vj m (t) = es
vj (t) w
j v m : (3.14)
j
To update the weights of the input gate and of the forget gate, however, we have to sum over
the
ontributions of all
ells in the blo
k:
XSj
s v (t)
wlm(t) = es
vj (t) w
j for l 2 f'; ing : (3.15)
v =1 lm
The equations ne
essary to implement the ba
kward pass are (3.4), (3.5), (3.6), (3.8), (3.10),
(3.11), (3.12), (3.13), (3.14) and (3.15).
3.2.3 Complexity
To
al
ulate the
omputational
omplexity of extended LSTM we take into a
ount that weights
to input gates and forget gates
ause more expensive updates than others, be
ause ea
h su
h
weight dire
tly ae
ts all the
ells in its memory blo
k. We evaluate a rather typi
al topology
used in the experiments (see Figure 3.3). All memory blo
ks have the same size; gates have
no outgoing
onne
tions; output units and gates have a bias
onne
tion (from a unit whose
a
tivation is always 1.0); other
onne
tions to output units stem from memory blo
ks only; the
hidden layer is fully
onne
ted. Let B; S; I; K denote the numbers of memory blo
ks, memory
3.3. EXPERIMENTS 21
ells in ea
h blo
k, input units, and output units, respe
tively. We nd the update
omplexity
per time step to be:
z }| {
to input and forget gates
z }| { z }| {
to
ells to output gate
W
= B [S ( B S + 1 +
|
2 (B{z S + 1) )+ BS +1 ℄
}
re
urrent
onne
tions and bias
+ K (B S + 1) + I (B (S + 2 S + 1))
| {z } | {z }
to output from input
of weights is:
z }|{ z }| {
to
ells to gates
Nw = B [S (B S + 1) + 3 (B S + 1)) ℄ + K (B S + 1) + I (B S + 3 B )) ;
| {z } | {z } | {z }
re
urrent
onne
tions and bias to output from input
Hen
e LSTM's
omputational
omplexity per time step and weight is O(1). Considering
onne
-
tions to gates separately we nd that their
omputational
omplexity per time step and weight
is O(S ). But this is
ompensated by the \less
omplex"
onne
tions to the
ells of O(1). It is
essentially the same as for a fully
onne
ted BPTT re
urrent network. Storage
omplexity per
weight is also O(1), as the last time step's partials from equations (3.10), (3.11) and (3.12) are
all that need to be stored for the ba
kward pass. So the storage
omplexity does not depend on
the length of the input sequen
e. Hen
e extended LSTM is lo
al in spa
e and time, a
ording
to S
hmidhuber's denition (1989), just like traditional LSTM.
3.3 Experiments
3.3.1 Continual Embedded Reber Grammar Problem
To generate an innite input stream we extend the well-known \embedded Reber grammar"
(ERG) ben
hmark problem, e.g., Smith and Zipser (1989), Cleeremans et al. (1989), Fahlman
(1991), Ho
hreiter & S
hmidhuber (1997). Consider Figure 3.2.
ERG. The traditional method starts at the leftmost node of the ERG graph, and sequentially
generates nite symbol strings (beginning with the empty string) by steping from node to node
following the edges of the graph, and appending the symbols asso
iated with the edges to the
urrent string until the rightmost node is rea
hed. Edges are
hosen randomly if there is a
hoi
e
(probability = 0:5).
Input and target symbols are represented by 7 dimensional binary ve
tors, ea
h
omponent
standing for one of the 7 possible symbols. Hen
e the network has 7 input units and 7 output
units. The task is to read strings, one symbol at a time, and to
ontinually predi
t the next
possible symbol(s). Input ve
tors have exa
tly one nonzero
omponent. Target ve
tors may have
two, be
ause sometimes there is a
hoi
e of two possible symbols at the next step. A predi
tion
is
onsidered
orre
t if the error at ea
h of the 7 output units is below 0:49 (error signals o
ur
at every time step).
22 CHAPTER 3. LEARNING TO FORGET: CONTINUAL PREDICTION WITH LSTM
S
X Reber
Grammar T
T S T
B X E B E
P
P P Reber P
V Grammar
V
recurrent connection for continual prediction
T
Figure 3.2: Transition diagrams for standard (left) and embedded (right) Reber grammars. The
dashed line indi
ates the
ontinual variant.
Algo- # hidden #weights learning % of su
ess su
ess
rithm units rate after
RTRL 3 170 0.05 \some fra
tion" 173,000
RTRL 12 494 0.1 \some fra
tion" 25,000
ELM 15 435 0 >200,000
RCC 7-9 119-198 50 182,000
Tra.
LSTM 3bl.,size 2 276 0.5 100 8,440
Table 3.1: Standard embedded Reber grammar (ERG): per
entage of su
essful trials and num-
ber of sequen
e presentations until su
ess for RTRL (results taken from Smith and Zipser 1989
), \Elman net trained by Elman's pro
edure" (results taken from Cleeremans et al. 1989 ),
\Re
urrent Cas
ade-Correlation" (results taken from Fahlman 1991 ) and traditional LSTM
(results taken from Ho
hreiter and S
hmidhuber 1997 ). Weight numbers in the rst 4 rows are
estimates.
To
orre
tly predi
t the symbol before the last (T or P) in an ERG string, the network
has to remember the se
ond symbol (also T or P) without
onfusing it with identi
al symbols
en
ountered later. The minimal time lag is 7 (at the limit of what standard re
urrent networks
an manage); time lags have no upper bound though. The expe
ted length of a string generated
by an ERG is 11.5 symbols. The length of the longest string in a set of N non-identi
al strings
is proportional to log N (statisti
s of the embedded Reber Grammar are dis
ussed in Appendix
A). For the training and test sets used in our experiments, the expe
ted value of the longest
string is greater than 50.
Table 3.1 summarizes performan
e of previous RNNs on the standard ERG problem (testing
involved a test set of 256 ERG test strings). Only traditional LSTM always learns to solve the
task. Even when we ignore the unsu
essful trials of the other approa
hes, LSTM learns mu
h
faster.
CERG. Our more diÆ
ult
ontinual variant of the ERG problem (CERG) does not provide
information about the beginnings and ends of symbol strings. Without intermediate resets, the
network is required to learn, in an on-line fashion, from input streams
onsisting of
on
atenated
ERG strings. Input streams are stopped as soon as the network makes an in
orre
t predi
tion or
the 10 -th su
essive symbol has o
urred. Learning and testing alternate: after ea
h training
5
3.3. EXPERIMENTS 23
Out Out Out Out Out Out Out
1 2 3 4 5 6 7
In In In In In In In
1 2 3 4 5 6 7
Figure 3.3: Three layer LSTM topology with re
urren
e limited to the hidden layer
onsisting
of four extended LSTM memory blo
ks (only two shown) with two
ells ea
h. Only a limited
subset of
onne
tions are shown.
stream we freeze the weights and feed 10 test streams. Our performan
e measure is the average
test stream size; 100,000
orresponds to a so-
alled \perfe
t" solution (10 su
essive
orre
t
6
predi
tions).
3.3.2 Network Topology and Parameters
The 7 input units are fully
onne
ted to a hidden layer
onsisting of 4 memory blo
ks with 2
ells ea
h (8
ells and 12 gates in total). The
ell outputs are fully
onne
ted to the
ell inputs,
to all gates, and to the 7 output units. The output units have additional \short
ut"
onne
tions
from the input units (see Figure 3.3). All gates and output units are biased. Bias weights to in-
and output gates are initialized blo
kwise: 0:5 for the rst blo
k, 1:0 for the se
ond, 1:5
for the third, and so forth. In this manner,
ell states are initially
lose to zero, and, as training
progresses, the biases be
ome progressively less negative, allowing the serial a
tivation of
ells
as a
tive parti
ipants in the network
omputation. Forget gates are initialized with symmetri
positive values: +0.5 for the rst blo
k, +1 for the se
ond blo
k, et
. Pre
ise bias initialization
is not
riti
al though|other values work just as well. All other weights in
luding the output
bias are initialized randomly in the range [ 0:2; 0:2℄. There are 424 adjustable weights, whi
h
is
omparable to the number used by LSTM in solving the ERG (see Table 3.1).
Weight
hanges are made after ea
h input symbol presentation. At the beginning of ea
h
training stream, the learning rate is initialized with 0.5. It either remains xed or de
ays
by a fa
tor of 0.99 per time step (LSTM with -de
ay). Learning rate de
ay is well studied in
statisti
al approximation theory and is also
ommon in neural networks, e.g. (Darken, 1995).
24 CHAPTER 3. LEARNING TO FORGET: CONTINUAL PREDICTION WITH LSTM
Training was stopped after at most 30000 training streams, ea
h of whi
h was ended when the
rst predi
tion error or the 100000th su
essive input symbol o
urred. Table 3.2
ompares
extended LSTM (with and without learning rate de
ay) to traditional LSTM and an LSTM
variant with de
ay of the internal
ell state s
(with a self re
urrent weight < 1). Our results
for traditional LSTM with network a
tivation resets (by an external tea
her) at sequen
e ends
are slightly better than those based on a dierent topology (Ho
hreiter & S
hmidhuber, 1997).
External resets (non-
ontinual
ase) allow LSTM to nd ex
ellent solutions in 74% of the trials,
a
ording to our stringent testing
riterion. Traditional LSTM fails, however, in the
ontinual
ase. Internal state de
ay does not help mu
h either (we tried various self-re
urrent weight
values and report only the best result). Extended LSTM with forget gates, however,
an solve
the
ontinual problem.
A
ontinually de
reasing learning rate led to even better results but had no ee
t on the
other algorithms. Dierent topologies may provide better results, too|we did not attempt to
optimize topology.
Can the network learn to re
ognize appropriate times for opening/
losing its gates without
using the information
onveyed by the marker symbols B and E? To test this we repla
ed all
CERG subnets of the type Tn!P E! B! T! nP
by T! nP TnP
! .
This makes the task more diÆ
ult as the net now needs to keep tra
k of sequen
es of numerous
potentially
onfusing T and P symbols. But LSTM with forget gates (same topology) was still
able to nd perfe
t solutions, although less frequently (sequential de
ay was not applied).
3.3. EXPERIMENTS 25
-9- -10- -14- -10- -10- -9- -10- -10- -12- -10- -9- -9-
Internal Cell State 100
50
-50
0 T T T T T P P T T T T P 130
Symbol
Figure 3.4: Evolution of traditional LSTM's internal states s
during presentation of a test
stream stopped at rst predi
tion failure. Starts of new ERG strings are indi
ated by verti
al
lines labeled by the symbols (P or T) to be stored until the next string start.
3.3.4 Analysis of the CERG Results
How does extended LSTM solve the task on whi
h traditional LSTM fails? Se
tion 3.1.1 already
mentioned LSTM's problem of un
ontrolled growth of the internal states. Figure 3.4 shows the
evolution of the internal states s
during the presentation of a test stream. The internal states
tend to grow linearly. At the starts of su
essive ERG strings, the network is in an in
reasingly
a
tive state. At some point (here after 13 su
essive strings), the high level of state a
tivation
leads to saturation of the
ell outputs, and performan
e breaks down. Extended LSTM, however,
learns to use the forget gates for resetting its state when ne
essary. Figure 3.5 (top half) shows a
typi
al internal state evolution after learning. We see that the third memory blo
k resets its
ells
in syn
hrony with the starts of ERG strings (the verti
al lines in Figure 3.5 indi
ate the third
symbol of a string). The internal states os
illate around zero; they never drift out of bounds
as with traditional LSTM (Figure 3.4). It also be
omes
lear how the relevant information gets
stored: the se
ond
ell of the third blo
k stays negative while the symbol P has to be stored,
whereas a T is represented by a positive value. The third blo
k's forget gate a
tivations are
plotted in Figure 3.5 (bottom). Most of the time they are equal to 1.0, thus letting the memory
ells retain their internal values. At the end of an ERG string the forget gate's a
tivation goes
to zero, thus resetting
ell states to zero.
Analyzing the behavior of the other memory blo
ks, we nd that only the third is dire
tly
responsible for bridging ERG's longest time lag (whi
h is suÆ
ient as one just bit has to be
stored). Figure 3.6 plots values analogous to those in Figure 3.5 for the rst memory blo
k and
its rst
ell. The rst blo
k's
ell and forget gate show short-term behavior only (ne
essary for
predi
ting the numerous short time lag events of the Reber grammar). The same is true for
all other blo
ks ex
ept the third. Common to all memory blo
ks is that they learned to reset
themselves in an appropriate fashion.
3.3.5 Continual Noisy Temporal Order Problem
Extended LSTM solves the CERG problem while traditional LSTM does not. But
an traditional
LSTM solve problems whi
h extended LSTM
annot? We tested extended LSTM on one of the
26 CHAPTER 3. LEARNING TO FORGET: CONTINUAL PREDICTION WITH LSTM
-12- -20- -11- -15- -11--10- -15- -14- -9- -19- -10- -9- -9-
Internal Cell State 20
10
0
3.Block, 1.Cell
-10 3.Block, 2.Cell
680 T P P T T P P T T T T T 850
Forget Gate Activ.
0.5
0
680 T P P T T P P T T T T T 850
Symbol
Figure 3.5: Top: Internal states s
of the two
ells of the self-resetting third memory blo
k in
an extended LSTM network during a test stream presentation. The gure shows 170 su
essive
symbols taken from the longer sequen
e presented to a network that learned the CERG. Starts
of new ERG strings are indi
ated by verti
al lines labeled by the symbols (P or T) to be stored
until the next string start. Bottom: simultaneous forget gate a
tivations of the same memory
blo
k.
most diÆ
ult nonlinear long time lag tasks ever solved by an RNN: \Noisy Temporal Order"
(NTO) (task 6b taken from Ho
hreiter & S
hmidhuber 1997 ).
NTO. The goal is to
lassify sequen
es of lo
ally represented symbols. Ea
h sequen
e starts
with an E , ends with a B (the \trigger symbol"), and otherwise
onsists of randomly
hosen
symbols from the set fa; b;
; dg ex
ept for three elements at positions t ; t and t that are
either X or Y (Figure 3.7). The sequen
e length is randomly
hosen between 100 and 110, t
1 2 3
is randomly
hosen between 10 and 20, t is randomly
hosen between 33 and 43, and t is
1
randomly
hosen between 66 and 76. There are 8 sequen
e
lasses Q; R; S; U; V; A; B; C whi
h
2 3
depend on the temporal order of the X s and Y s. The rules are (temporal order !
lass):
X; X; X ! Q; X; X; Y ! R; X; Y; X ! S ; X; Y; Y ! U ; Y; X; X ! V ; Y; X; Y !
A; Y; Y; X ! B ; Y; Y; Y ! C . Target signals o
ur only at the end of a sequen
e. The
problem's minimal time lag size is 80 (!). Forgetting is only harmful as all relevant information
has to be kept until the end of a sequen
e, after whi
h the network is reset anyway.
We use the network topology des
ribed in se
tion 3.3.2 with 8 input and 8 output units.
Using a large bias (5.0) for the forget gates, extended LSTM solved the task as qui
kly as
traditional LSTM (re
all that a high forget gate bias makes extended LSTM degenerate into
traditional LSTM). Using a moderate bias like the one used for CERG (1.0), extended LSTM
3.3. EXPERIMENTS 27
-12- -20- -11- -15- -11--10- -15- -14- -9- -19- -10--9- -9-
Internal State 10
1.Block, 1.Cell
-10
680 T P P T T P P T T T T T 850
Forget Gate Activ.
0.5
0
680 T P P T T P P T T T T T 850
Symbol
Figure 3.6: Top: Extended LSTM's self-resetting states for the rst
ell in the rst blo
k.
Bottom: forget gate a
tivations of the rst memory blo
k.
0 10 - 20 33 - 43 66 - 76 100-110
X
B
noise
a.b,c,d {} Y
noise
a.b,c,d {XY} noise
a.b,c,d {XY} noise
a.b,c,d
E
Figure 3.7: NTO and CNTO tasks. See text for details.
took about three times longer on average, but did solve the problem. The slower learning speed
results from the net having to learn to remember everything and not to forget.
Generally speaking, we have not yet en
ountered a problem that LSTM solves while extended
LSTM does not.
CNTO. Now we take the next obvious step and transform the NTO into a
ontinual prob-
lem that does require forgetting, just as in se
tion 3.3.1, by generating
ontinual input streams
onsisting of
on
atenated NTO sequen
es (Figure 3.7). Pro
essing su
h streams without inter-
mediate resets, the network is required to learn to
lassify NTO sequen
es in an online fashion.
Ea
h input stream is stopped on
e the network makes an in
orre
t
lassi
ation or 100 su
essive
NTO sequen
es have been
lassied
orre
tly. Learning and testing alternate; the performan
e
measure is the average size of 10 test streams, measured by the number of their NTO sequen
es
(ea
h
ontaining between 100 and 110 input symbols). Training is stopped after at most 10 5
28 CHAPTER 3. LEARNING TO FORGET: CONTINUAL PREDICTION WITH LSTM
4.2 Experiments
We fo
us on tasks involving arithmeti
operations on input streams that so far have been ad-
dressed only in non-
ontinual settings (Tsung & Cottrell, 1989; Ho
hreiter & S
hmidhuber,
1997).
General set-up. We feed the net
ontinual streams of 4-dimensional input ve
tors generated
in an online fashion. We dene t = 0 (stream start) and tn = tn +T +( 1)n V for n = 1; 2; : : : ,
where V 2 f0; 1; : : : ; T g is
hosen randomly, and integer T is the minimal time lag. The rst
0 1
omponent of ea
h input ve
tor is a random number from the interval [ 1; +1℄. The se
ond
5
and third serve as \markers": they are always 0.0 ex
ept at times t m for m = 1; 2; : : : , when
either the se
ond
omponent is 1.0 with probability p, or the third is 1.0 with probability 1 p.
2 1
The fourth
omponent is always 0 ex
ept at times t m when targets are given and its a
tivation
is 1.0. The target at t is 0. If the 2nd
omponent was a
tive at t m then the target at t m
2
is the sum of the previous target at t m and the \marked" rst input
omponent at t m .
0 2 1 2
29
30 CHAPTER 4. ARITHMETIC OPERATIONS ON CONTINUAL INPUT STREAMS
-1
t1 t2 t3 t4
time
Figure 4.1: Illustration of the
ontinual addition (and multipli
ation) tasks.
Hen
e non-initial targets depend on events that happened at least 2 T steps ago. Note that
o
urren
es of \value markers" and targets os
illate. See Figure 4.1 for an illustration of the
task.
All streams are stopped on
e the absolute output error ex
eeds 0.04. Test streams are almost
unlimited (max. length = 1000 target o
urren
es), but training streams end after at most 10
target o
urren
es. Learning and testing alternate: after ea
h training stream we freeze the
weights and feed 100 test streams. Our performan
e measure is the average test stream size.
Task 1: Continual addition. p = 1:0 (no multipli
ation). T = 20. Task 1 essentially requires
to keep adding (possibly negative) values to the already existing internal state.
Task 2: Continual addition and multipli
ation. p = 0:5, T = 20. If the 3rd input
omponent is a
tive at t m and the 1st is negative then the latter will get repla
ed by its
absolute value.
2 1
Task 3: Gliding addition. Like Task 1, but targets at times t m equal the sum of the two
most re
ent marked values at times t m and t m (the rst target at t equals the rst value
2 +2
at t ). T = 10. Task 3 is harder than task 1 be
ause it requires sele
tive partial resets of the
2 +1 2 1 2
networks.
Task 1. Both traditional LSTM and LSTM with forget gates learn the task. Worse perfor-
man
e of LSTM with forget gates is
aused by slower
onvergen
e, be
ause the net has to learn
to remember everything and not to forget.
Task 2. LSTM with forget gates solves the problem even when addition and multipli
ation are
ombined, whereas traditional LSTM's solutions are not suÆ
iently a
urate. This shows that
forget gates add algorithmi
fun
tionality to memory blo
ks besides releasing resour
es during
runtime (their original purpose whi
h is not essential here).
Task 3. Traditional LSTM
annot solve the problem at all, whereas LSTM with forget gates
does nd good and even \perfe
t" solutions. Why? The forget gates learn to prevent LSTM's
un
ontrolled internal state growth (see Se
tion 3.3.4), by resetting states on
e stored information
be
omes obsolete.
The results
onrm that forget gates are mandatory for LSTM fed with
ontinual input
streams (Chapter 3)., where obsolete memories need to be dis
arded at some point (see \Task
3: Gliding addition"). Experiment 2 shows that forget gates also greatly fa
ilitate operations
involving multipli
ation.
4.3 Con
lusion
In this
hapter we demonstrated that forget gates do not only serve for the pro
essing of
ontinual
input streams but also augment LSTM's arithmeti
apabilities.
We presented tasks on
ontinual input streams with a level of arithmeti
omplexity where
traditional LSTM fails but LSTM with forget gates solves the tasks in an elegant way. On the
other hand we have not found a task yet that traditional LSTM
an solve but LSTM with forget
gates
annot.
Chapter 5
y
cell output
c
y out
output gating
sy ∗ c
out
wout netout
output gate
peephole connection ∆t
ϕ peephole connections
y
net ϕ wϕ
memorizing
∗
and forgetting
CEC s cell
c state
∆t
forget gate
gy in y in
input gating
∗ win netin
input squashing g input gate
cell input wc
netc
Figure 5.1: LSTM memory blo
k with one
ell; peephole
onne
tions
onne
t s
to the gates.
there was a helpful \marker" input informing the network that its next a
tion would be
ru
ial.
Thus the network did not really have to learn to measure a time interval of 50 steps; it just had
to learn to store relevant information for 50 steps, and use it on
e the marker was observed |
something that is impossible for traditional RNNs but
omparatively easy for LSTM.
But what if there are no su
h markers at all? What if the network itself has to learn to
measure and internally represent the duration of task-spe
i
intervals, or to generate sequen
es
of patterns separated by exa
t intervals? Here we will study to what extent this is possible. The
highly nonlinear tasks in the present
hapter do not involve any time marker inputs; instead
they require the network to time pre
isely and robustly a
ross long time lags in
ontinual input
streams.
Before we des
ribe our new timing experiments we will rst identify a weakness in LSTM's
onne
tion s
heme, and introdu
es peephole
onne
tions as a remedy (Se
tion 5.2). Se
tions 5.3
and 5.4 des
ribe the modied forward and ba
kward pass for \peephole LSTM."
The peephole
onne
tions for the input gate and the forget gate are in
orporated in equation 5.1
and 5.2 by in
luding the CECs (
ontaining the
ell states) of memory blo
k j as sour
e units.
Step 1
. At t = 0, the state s
(t) of memory
ell
is initialized to zero; subsequently
(t > 0) it is
al
ulated by adding the squashed, gated input to the state at the previous time
step, s
(t 1), whi
h is multiplied (gated) by the forget gate a
tivation y'j (t):
X
net
vj (t) = w
vj m ym (t 1) ;
m
s
vj (t) = y'j (t) s
vj (t 1) + yinj (t) g(net
vj (t)) : (5.3)
Step 2. The output gate a
tivation yout is
omputed as:
Sj
X X
netoutj (t) = woutj m y (t m
1) + woutj
vj s
vj (t) ;
m v =1
youtj (t) = foutj (netoutj (t)) : (5.4)
36 CHAPTER 5. LEARNING PRECISE TIMING WITH PEEPHOLE LSTM
Equation 5.4 in
ludes the peephole
onne
tions for the output gate from the CECs of memory
blo
k j with the
ell states s
(t), as updated in step 1
. The
ell output y
is
omputed as:
y
j (t) = youtj (t) s
vj (t) : (5.5)
v
The equations for the output units k remain as spe
ied in equations 2.9.
5.4 Gradient-Based Ba
kward Pass
The revised update s
heme for memory blo
ks allows for treating peephole
onne
tions like reg-
ular
onne
tions (see Se
tions 5.2 and 5.3), and so requires only minor
hanges to the ba
kward
pass (Chapter 3). We will present it below but not fully re-derive it. We will, however, point
out the dieren
es to the previous equations in Se
tion 3.2.2. Appendix B gives pseudo-
ode for
the entire algorithm.
In what follows we will present equations fortrLSTM with forget gates and peephole
onne
-
tions, but without output squashing. The sign = will indi
ate where we use error trun
ation.
During ea
h step in the forward pass, no matter whether a target is given or not, we need
to update the partial derivatives s
vj =wlm and s
vj =wl
vj for weights to the
ell (l =
vj ), to
0
s
vj (t) s
vj (t 1) 'j
winj m
=tr winj m
y (t) + g(net
vj (t)) fin0 j (netinj (t)) ym (t 1) ; (5.7a)
s
vj (t) s
vj (t 1) 'j
winj
vj0
=tr winj
vj
0
y (t) + g(net
vj (t)) fin0 j (netinj (t)) s
vj (t
0 1) ; (5.7b)
s
vj (t) s
vj (t 1) 'j
w'j m
=tr w'j m
y (t) + s
vj (t 1) f'0 j (net'j (t)) ym (t 1) ; (5.8a)
s
vj (t) s
vj (t 1) 'j
w'j
vj 0
=tr w'j
vj
y (t) + s
vj (t
0
1) f'0 j (net'j (t)) s
vj (t 1) ;
0 (5.8b)
with s
vj (0)=wlm = s
vj (0)=wl
vj = 0 for l 2 fin; ';
vj g. Equation 5.7b and 5.8b are for
0
Here we use the
ustomary squared error obje
tive fun
tion based on targets tk , yielding:
Æk (t) = fk0 (netk (t)) ek (t) ; (5.10)
5.5. EXPERIMENTS 37
where ek (t) := tk (t) yk (t) is the externally inje
ted error. The weight
hanges for
onne
tions
to the output gate (of the j -th memory blo
k) from the sour
e units (as spe
ied by the network
topology) woutj m and for the peephole
onne
tions woutj
vj are:
woutj m (t) = Æoutj (t) ym (t) ; woutj
vj (t) = Æoutj (t) s
vj (t) ; (5.11a)
0 1
Sj
X X
Æoutj (t) =tr 0 (netout (t))
fout j j
s
v (t)
j wk
vj Æk (t)A : (5.11b)
v =1 k
Output squashing (removed here) would require the in
orporation of the derivative of the output
squashing fun
tion in (5.11b). To
al
ulate weight
hanges wlm and wl
vj (peephole
onne
- 0
tion weights) for
onne
tions to the
ell (l =
vj ), the input gate (l = in), and the forget gate
(l = ') we use the partials from Equations 5.6, 5.7b, and 5.8b:
s v (t)
w
vj m(t) = es
vj (t) w
j v (5.12)
m j
Sj
X s
vj (t) Sj
X s
vj (t)
winj m(t) = es
vj (t)
winj m
; winj
vj (t) =
0 es
vj (t)
winj
vj 0
(5.13)
v =1 v =1
Sj
X s
vj (t) Sj
X s
vj (t)
w'j m(t) = es
vj (t)
w'j m
; w'j
vj (t) =
0 es
vj (t)
w'j
vj0
(5.14)
v =1 v =1
where the internal state error es
vj is separately
al
ulated for ea
h memory
ell:
!
X
es
vj (t) =
tr
y outj
(t) wk
vj Æk (t) : (5.15)
k
Like traditional LSTM LSTM with forget gates and peephole
onne
tions is still lo
al in
spa
e and time. The in
rease in
omplexity due to peephole
onne
tions is small: 3 weights per
ell.
5.5 Experiments
We study LSTM's performan
e on three tasks that require the pre
ise measurement or generation
of delays. We
ompare
onventional to peephole LSTM, analyze the solutions, or explain why
none was found.
Measuring spike delays (MSD). See Se
tion 5.5.2. The goal is to
lassify input sequen
es
onsisting of sharp spikes. The
lass depends on the interval between spikes. We
onsider two
versions of the task:
ontinual (MSD) and non-
ontinual (NMSD). NMSD sequen
es stop after
the se
ond spike, whereas MSD sequen
es are
ontinual spike trains. Both NMSD and MSD
require the network to measure intervals between spikes; MSD also requires the produ
tion of
stable results in presen
e of
ontinually streaming inputs, without any external reset of the
network's state. Can LSTM learn the dieren
e between almost identi
al pattern sequen
es
that dier only by a small lengthening of the interval (e.g., from n to n+1 steps) between input
spikes? How does the diÆ
ulty of this problem depend on n?
38 CHAPTER 5. LEARNING PRECISE TIMING WITH PEEPHOLE LSTM
Generating timed spikes (GTS). See Se
tion 5.5.3. The GTS task
an be obtained from
the MSD task by ex
hanging inputs and targets. It requires the produ
tion of
ontinual spike
trains, where the interval between spikes must re
e
t the magnitude of an input signal that may
hange after every spike.
GTS is a spe
ial
ase of periodi
fun
tion generation (PFG, see below). In
ontrast to
previously studied PFG tasks (Williams & Zipser, 1989; Doya & Yoshizawa, 1989; Tsung &
Cottrell, 1995), GTS is highly nonlinear and involves long time lags between signi
ant output
hanges, whi
h
annot be learned by
onventional RNNs. Previous work also did not fo
us on
stability issues. Here, by
ontrast, we demand that the generation be stable for 1000 su
essive
spikes. We systemati
ally investigate the ee
t of minimal time lag on task diÆ
ulty.
Additional periodi
fun
tion generation tasks (PFG). See Se
tion 5.5.4. We study
the problem of generating periodi
fun
tions other than the spike trains above. The
lassi
examples are smoothly os
illating outputs su
h as sine waves, whi
h are learnable by fully
onne
ted tea
her-for
ed RNNs whose units are all output units with tea
her-dened a
tivations
(Williams & Zipser, 1989). An alternative approa
h trains an RNN to predi
t the next input;
after training outputs are fed ba
k dire
tly to the input so as to generate the waveform (Doya
& Yoshizawa, 1989; Tsung & Cottrell, 1995; Weiss, 1999; Townley et al., 1999).
Here we fo
us on more diÆ
ult, highly nonlinear, triangular and re
tangular waveforms, the
latter featuring long time lags between signi
ant output
hanges. Again, traditional RNNs
annot learn tasks involving long time lags (Ho
hreiter, 1991; Bengio et al., 1994), and previous
work did not fo
us on stability issues. By
ontrast, we demand that the generation be stable
for 1000 su
essive periods of the waveform.
5.5.1 Network Topology and Experimental Parameters
We found that
omparatively small LSTM nets
an already solve the tasks above. A single input
unit (used only for tasks where there is input) is fully
onne
ted to the hidden layer
onsisting
of a single memory blo
k with one
ell. The
ell output is
onne
ted to the
ell input, to all
three gates, and to a single output unit (Figure 5.2). All gates, the
ell itself, and the output
unit are
onne
ted to a bias unit (a unit with
onstant a
tivation one) as well. The bias weights
to input gate, forget gate, and output gate are initialized to 0:0, 2:0 and +2:0, respe
tively.
(Although not
riti
al, these values have been found empiri
ally to work well; we use them for
all our experiments.) All other weights are initialized to uniform random values in the range
[ 0:1; 0:1℄. In addition to the three peephole
onne
tions there are 14 adjustable weights: 9
\unit-to-unit"
onne
tions and 5 bias
onne
tions. The
ell's input squashing fun
tion g is the
identity fun
tion. The squashing fun
tion of the output unit is a logisti
sigmoid with range
[0; 1℄ for MSD and GTS (ex
ept where expli
itly stated otherwise), and the identity fun
tion for
PFG. (A sigmoid fun
tion would work as well, but we fo
us on the simplest system that
an
solve the task.)
Our networks pro
ess
ontinual streams of inputs and targets; only at the beginning of a
stream are they reset. They must learn to always predi
t the target tk (t), produ
ing a stream of
output values (predi
tions) yk (t). A predi
tion is
onsidered
orre
t if the absolute output error
jek (t)j = jtk (t) yk (t)j is below 0:49 for binary targets (MSD, NMSD and GTS tasks), below
0:3 otherwise (PFG tasks). Streams are stopped as soon as the network makes an in
orre
t
predi
tion, or after a given maximal number of su
essive periods (spikes): 100 during training,
1000 during testing.
Learning and testing alternate: after ea
h training stream, we freeze the weights and generate
5.5. EXPERIMENTS 39
Out
In
Figure 5.2: Three-layer LSTM topology with one input and one output unit. Re
urren
e is
limited to the hidden layer, whi
h
onsists of a single LSTM memory blo
k with a single
ell.
All 9 \unit-to-unit"
onne
tions are shown, but bias and peephole
onne
tions are not.
a test stream. Our performan
e measure is the a
hieved test stream size: 1000 su
essive periods
are deemed a \perfe
t" solution. Training is stopped on
e a task is learned or after a maximal
number of 10 training streams (10 for the MSD and NMSD tasks). Weight
hanges are made
7 8
after ea
h target presentation. The learning rate is set to 10 ; we use the momentum
5
algorithm (Plaut, Nowlan, & Hinton, 1986) with momentum parameter 0.999 for the GTS task,
0.99 for the PFG and NMSD task, and 0.9999 for the MSD task. We roughly optimized the
momentum parameter by trying out dierent orders of magnitude.
For tasks GTS and MSD, the sto
hasti
input streams are generated online. A perfe
t solu-
tion
orre
tly pro
esses 10 test streams, to make sure the network provides stable performan
e
independent of the stream beginning, whi
h we found to be
riti
al. All results are averages
over 10 independently trained networks.
5.5.2 Measuring Spike Delays (MSD)
The network input is a spike train, represented by a series of ones and zeros, where ea
h \one"
indi
ates a spike. Spikes o
ur at times T (n) set F + I (n) steps apart, where F is the minimum
interval between spikes, and I (n) is an integer oset, randomly reset for ea
h spike:
T (0) = F + I (0) ; T (n) = T (n 1) + F + I (n) (n 2 N ) :
The target given at times t = T (n) is the delay I (n). (Learning to measure the total interval
F + I (n) | that is, adding the
onstant F to the output | is no harder.) A perfe
t solution
orre
tly pro
esses all possible input test streams. For the non-
ontinual version of the task
(NMSD) a stream
onsists of a single period (spike).
MSD Results. Table 5.1 reports results for NMSD with I (n) 2 f0; 1g for various minimum
spike intervals F . The results suggest that the diÆ
ulty of the task (measured as the average
number of training streams ne
essary to solve it) in
reases drasti
ally with F (see Figure 5.3).
A qualitative explanation is that longer intervals ne
essitate ner tuning of the weights, whi
h
40 CHAPTER 5. LEARNING PRECISE TIMING WITH PEEPHOLE LSTM
T F
LSTM
I (n) 2
Peephole LSTM
% Sol. Train. [10 ℄ % Sol. Train. [10 ℄
3 3
10 100 f0; 1g
160 14 100 125 14
20 100 f0; 1g
732 97 100 763 103
NMSD 30 f0; 1g
100 17521 2200 80 12885 2091
40 f0; 1g
20 37533 4558 70 25686 2754
50 0 f0; 1g
| 10 32485
MSD 10 10
10 f0; 1g
8850
f0; 1; 2g
20 29257 13758
20 27453 11750
60 9791 2660
Table 5.1: Results
omparing
onventional and peephole LSTM on the NMSD and MSD tasks.
Columns show the task T, the minimum spike interval F , the set of delays I (n), the per
entage
of perfe
t solutions found, and the mean and standard derivation of the number of training
streams required.
Training Streams [106]
40 LSTM
Peephole LSTM
30
20
10
0
0 10 20 30 40 50
Time Delay F
Figure 5.3: Average number of training streams required for the NMSD task with I (n) 2 f0; 1g,
plotted against the minimum spike interval F .
requires more training. Peephole LSTM outperforms LSTM. The
ontinual MSD task for F =10
with I (n) 2 f0; 1g or I (n) 2 f0; 1; 2g, is solved with or without peephole
onne
tions (Table
5.1).
In the next experiment we evaluate the in
uen
e of the range of I (n), using the identity
fun
tion instead of the logisti
sigmoid as output squashing fun
tion. We let I (n) range over
f0; ig or f0;::; ig for all i 2 f1;::; 10g. Results are reported in Table 5.2 for NMSD with F =10.
The training duration depends on the size of the set from whi
h I (n) is drawn, and on the
maximum distan
e (MD) between elements in the set. A larger MD leads to a better separation
of patterns, thus fa
ilitating re
ognition. To
onrm this, we ran the NMSD task with F =10
and I (n) 2 f0; ig with i 2 f2;::; 10g (size 2, MD i), as shown in the bottom half of Table 5.2.
As expe
ted, training time de
reases with in
reasing MD. A larger set of possible delays should
make the task harder. Surprisingly, for I (n) 2 f0;::; ig (size i +1, MD i) with i ranging from
1 to 5 the task appears to be
ome easier (due to the simultaneous in
rease of MD) before the
5.5. EXPERIMENTS 41
LSTM Peephole LSTM
I (n) 2 % Training % Training
Sol. Str. [10 ℄
3
Sol. Str. [10 ℄
3
diÆ
ulty in
reases rapidly for larger i. Thus the task's diÆ
ulty does not grow linearly with
the number of possible delays,
orresponding to values (states) inside a
ell the network must
learn to distinguish. Instead we observe that LSTM fares best at distinguishing 6 or 7 dierent
delays. One is tempted to draw a
onne
tion to the \magi
number" of 7 2 items that an
average human
an store in Short Term Memory (STM) (Miller, 1956), but su
h a link seems
rather far-fet
hed to us.
We also observe that the results for I (n) 2 f0; 1g are better than those obtained with a
sigmoid fun
tion (
ompare Table 5.1). Flu
tuations in the sto
hasti
input
an
ause temporary
saturation of sigmoid units; the resulting tiny derivatives for the ba
kward pass will slow down
learning (LeCun, Bottou, Orr, & Muller, 1998).
MSD Analysis. LSTM learned to measure time in two prin
ipled ways. The rst is to
slightly in
rease the
ell
ontents s
at ea
h time step, so that the elapsed time
an be read o
the value of s
. This kind of solution is shown on the left-hand side of Figure 5.4. (The state
reset performed by the forget gate is essential only for
ontinual online predi
tion over many
periods.) The se
ond way is to establish internal os
illators and derive the elapsed time from
their phases (right-hand side of Figure 5.4). Both kinds of solutions
an be learned with or
without peephole
onne
tions, as it is never ne
essary here to
lose the output gate for more
than one time step (see bottom row of Figure 5.4).
42 CHAPTER 5. LEARNING PRECISE TIMING WITH PEEPHOLE LSTM
1 1
0.5 0.5
in
tk
yk
0 0
20
1
10
sc
0
yc
0
1 1
yin
out
y
yϕ
0 0
0 10 20 30 0 10 20 30
Time Time
Figure 5.4: Two ways to time. Test run with trained LSTM networks for the MSD task with
F =10 and I (n) 2 f0; 1g. Top: target values tk and network output yk ; middle:
ell state s
and
ell output y
; bottom: a
tivation of the input gate yin, forget gate y', and output gate yout.
Why may the output gate be left open? Targets o
ur rarely, hen
e the network output
an be ignored most of the time. Sin
e there is only one memory blo
k, mutual perturbation of
blo
ks is not possible. This type of reasoning is invalid though for more
omplex measuring tasks
involving larger nets or more frequent targets. Figure 5.5 shows the behavior of LSTM in su
h
a regime. With peephole LSTM the output gate opens only when a target is provided, whereas
onventional LSTM does not learn this behavior. Note that in some
ases these \
leaner"
solutions with peephole
onne
tions took longer to be learned (
ompare Tables 5.1 and 5.2,
be
ause they require more
omplex behavior.
5.5. EXPERIMENTS 43
2 in
tk
yk in
1 tk
2 yk
1
0 0
4 0
sc
yc
-2
0
sc
c
y
1 1
in
y
yout yin
out
yϕ y
yϕ
0 0
0 10 20 30 40 50 60 70
Time Time
Figure 5.5: Behavior of peephole LSTM (left) versus LSTM (right) for the MSD task with F =10
and I (n) 2 f0; 1; 2g. Top: target values tk and network output yk ; middle:
ell state s
and
ell
output y
; bottom: a
tivation of the input gate yin, forget gate y', and output gate yout.
5.5.3 Generating Timed Spikes (GTS).
The GTS task reverses the roles of inputs and targets of the MSD task: the spike train T (n),
dened as for the MSD task, now is the network's target, while the delay I (n) is provided as
input.
GTS Results. The GTS task
ould not be learned by networks without peephole
onne
-
tions; thus we report results with peephole LSTM only. Results with various minimum spike
intervals F (Figure 5.6) suggest that the required training time in
reases dramati
ally with F ,
as with the NMSD task (Se
tion 5.5.2). The network output during a su
essful test run for the
GTS task with F = 10 is shown on the top left of Figure 5.7. Peephole LSTM also solves the
task for F =10 and I (n) 2 f0; 1g or f0; 1; 2g, as shown in Figure 5.6 (left).
44 CHAPTER 5. LEARNING PRECISE TIMING WITH PEEPHOLE LSTM
3
Peephole LSTM
10 f0g 100 41 4 2
20 f0g 100 67 8
30 f0g 80 845 82
40 f0g 100 1152 101 1
50 f0g 100 2538 343
10 f0; 1g 50 1647 46 0
10 f0; 1; 2g 30 954 393 0 10 20 30 40 50
Time Delay F
Figure 5.6: Results for the GTS task. Table (left) shows the minimum spike interval F , the set
of delays I (n), the per
entage of perfe
t solutions found, and the mean and standard derivation
of the number of training streams required. Graph (right) plots the number of training streams
against the minimum spike interval F , for I (n) 2 f0g.
GTS Analysis. Figure 5.7 shows test runs with trained networks for the GTS task. The
output gates open only at the onset of a spike and
lose again immediately afterwards. Hen
e,
during a spike, the output of the
ell equals its state (middle row of Figure 5.7). The opening
of the output gate is triggered by the
ell state s
: it starts to open on
e the input from the
peephole
onne
tion outweighs a negative bias. The opening self-reinfor
es via a
onne
tion
from the
ell output, whi
h produ
es the high nonlinearity ne
essary for generating the spike.
This pro
ess is terminated by the
losing of the forget gate, triggered by the
ell output spike.
Simultaneously the input gate
loses, so that s
is reset.
In the parti
ular solution shown on the right-hand side of Figure 5.7 for F = 50, the role
of the forget gate in this pro
ess is taken over by a negative self-re
urrent
onne
tion of the
ell in
onjun
tion with a simultaneous opening of the other two gates. We tentatively removed
the forget gate (by pinning its a
tivation to 1.0) without
hanging the weights learned with the
forget gate's help. The network then qui
kly learned a perfe
t solution. Learning from s
rat
h
without forget gate, however, never yields a solution! The forget gate is essential during the
learning phase, where it prevents the a
umulation of irrelevant errors.
The exa
t timing of a spike is determined by the growth of s
, whi
h is tuned through
onne
tions to input gate, forget gate, and the
ell itself. To solve GTS for I (n) 2 f0; 1g or
I (n) 2 f0; 1; 2g, the network essentially translates the input into a s
aling fa
tor for the growth
of s
(Figure 5.8).
5.5.4 Periodi
Fun
tion Generation (PFG)
We now train LSTM to generate real-valued periodi
fun
tions, as opposed to the spike trains
of the GTS task. At ea
h dis
rete time step we provide a real-valued target, sampled with
frequen
y F from a target fun
tion f (t). No input is given to the network.
The task's degree of diÆ
ulty is in
uen
ed by the shape of f and the sampling frequen
y F .
The former
an be partially
hara
terized by the absolute maximal values of its rst and se
ond
derivatives, max jf 0j and max jf 00j. Sin
e we work in dis
rete time, and with non-dierentiable
5.5. EXPERIMENTS 45
1 1
tk tk
yk yk
0 0
0
2
-4 sc
sc yc
c
y
0
1 1
in
y
yout
yin yϕ
out
y
yϕ
0 0
0 10 20 30 0 50 100 150
Time Time
Figure 5.7: Test run of a trained peephole LSTM network for the GTS task with I (n) 2 f0g,
and a minimum spike interval of F = 10 (left) vs. F = 50 (right). Top: target values tk and
network output yk ; middle:
ell state s
and
ell output y
; bottom: a
tivation of the input gate
yin, forget gate y' , and output gate yout .
Generally speaking, the larger these values, the harder the task. F determines the number of
distinguishable internal states required to represent the periodi
fun
tion in internal state spa
e.
The larger F , the harder the task. We generate sine waves f
os, triangular fun
tions ftri, and
re
tangular fun
tions fre
t , all ranging between 0:0 and 1:0, ea
h sampled with two frequen
ies,
46 CHAPTER 5. LEARNING PRECISE TIMING WITH PEEPHOLE LSTM
1
in
k
2
y and t
k
Input: in
t
k
k
y 1
0
0
sc
2
0 yc
-2
in
1 y
yout
ϕ
y
0
50 60 70 80 90 100
Time
Figure 5.8: Test run of a trained peephole LSTM network for the GTS task with F = 10 and
I (n) 2 f0; 1; 2g. Top: target values tk and network output yk ; middle:
ell state s
and
ell
output y
; bottom: a
tivation of the input gate yin, forget gate y', and output gate yout.
F =10 and F =25:
1
f
os (t) 1
os
2t
) max jf
os
0 j = max jf 00 j = =F ;
2 F
os
(
ftri(t)
t F
if (t mod F ) > F
2 ( mod )
) max jftri
0 j = 2=F ; max jf 00 j = 4=F ;
F
2 t F F otherwise
2 ( mod )
2
tri
fre
t (t) 01 otherwise
if (t mod T ) > F 2
) max jfre
t
0 j = max jf 00 j = 1 :
re
t
5.5. EXPERIMENTS 47
one half, from 0:17 0:019 down to 0:086 0:002. Perfe
t solutions were learned in all
ases,
but only after (2704 49) 10 training streams, as opposed to (149 7) 10 training streams
3 3
1 tk tk
1
yk yk
0 0
1
1 tk
k
y
tk
0 0 yk
k k
1
t 1 tk
yk y
0
0
0 10 20 30 40 0 25 50 75 100
Time Time
Figure 5.9: Target values tk and network output yk during test runs of trained peephole LSTM
networks on the PFG task for the periodi
fun
tions f
os (top), ftri (middle), and fre
t (bot-
tom), with periods F =10 (left) and F =25 (right).
onne
tions from
ell output to gates, whi
h shows that they are indeed used even though they
are not mandatory for this task.
The target fun
tions ftri and fre
t required peephole
onne
tions for both values of F .
Figure 5.11 shows typi
al network solutions for the fre
t target fun
tion. The
ell output y
equals the
ell state s
in the se
ond half of ea
h period (when fre
t = 1) and is zero in the rst
half, be
ause the output gate
loses the
ell (triggered by s
, whi
h is a
essed via the peephole
onne
tions). The timing information is read o s
, as explained in Se
tion 5.5.2. Furthermore,
the two states of the fre
t fun
tion are distinguished: s
is
ounted up when fre
t = 0 and
ounted down again when fre
t = 1. This is a
hieved through a negative
onne
tion from the
ell output to the the
ell input, feeding negative input into the
ell only when the output gate
5.5. EXPERIMENTS 49
1 1
tk tk
0 0
yk yk
0 0.0
-2 -0.2
-4 -0.4
sc sc
-6 c -0.6
y yc
1 1
0 0
in
0 y 10 20 30 0 yin 10 20 30
yout Time yout Time
yϕ yϕ
Figure 5.10: Test runs of a trained LSTM network with (right) vs. without (left) peephole
onne
tions on the f
os PFG task with F =10. Top: target values tk and network output yk ;
middle:
ell state s
and
ell output y
; bottom: a
tivation of the input gate yin, forget gate y',
and output gate yout.
is open; otherwise the input is dominated by the positive bias
onne
tion. Networks without
peephole
onne
tions
annot use this me
hanism, and did not nd any alternative solution.
Throughout all experiments peephole
onne
tions were ne
essary to trigger the opening of gates
while the output gate was
losed, by granting unrestri
ted a
ess to the timer implemented by
the CEC. The gates learned to
ombine this information with their bias so as to open on rea
hing
a
ertain trigger threshold.
50 CHAPTER 5. LEARNING PRECISE TIMING WITH PEEPHOLE LSTM
1 tk 1 tk
yk yk
0
0
4 sc
sc
1
y c 3 yc
0
0
1 1
yin
yin yout
yout yϕ
yϕ
0 0
0 10 20 0 25 50
Time Time
Figure 5.11: Test runs of trained peephole LSTM networks on the fre
t PFG task with F =10
(left) and F =25 (right). Top: target values tk and network output yk ; middle:
ell state s
and
ell output y
; bottom: a
tivation of the input gate yin, forget gate y', and output gate yout.
5.6. CONCLUSION 51
0.2
sc 1
-3
yin
yin, yϕ
out
out 0.1
y
y
0.9 ϕ
y
-4
0
0 5 10 15 20 0 5 10 15 20
Cycle Cycle
Figure 5.12: Cell states and gate a
tivations at the onset (zero phase) of the rst 20
y
les
during a test run with a trained LSTM network on the f
os PFG task with F =10. Note that
the initial state (at
y
le 0) is quite far from the equilibrium state.
5.5.5 General Observation: Network initialization
At the beginning of ea
h stream
ell states and gate a
tivations are initialized to zero. This initial
state is almost always quite far from the
orresponding state in the same phase of later periods
in the stream. Figure 5.12 illustrates this for the f
os task. After few
onse
utive periods,
ell
states and gate a
tivations of su
essful networks tend to settle to very stable, phase-spe
i
values, whi
h are typi
ally quite dierent from the
orresponding values in the rst period. This
suggests that the initial state of the network should be learned as well, as proposed by For
ada
and Carras
o (1995), instead of arbitrarily initializing it to zero.
5.6 Con
lusion
Previous work on LSTM did not require the network to extra
t relevant information
onveyed
by the duration of intervals between events. Here we show that LSTM
an solve su
h highly
nonlinear tasks as well, by learning to pre
isely measure time intervals, provided we furnish
LSTM
ells with peephole
onne
tions that allow them to inspe
t their
urrent internal states.
It is remarkable that peephole LSTM
an learn exa
t and extremely robust timing algorithms
without tea
her for
ing, even in
ase of very uninformative, rarely
hanging target signals. This
makes it a promising approa
h for numerous real-world tasks whose solution partly depend on
the pre
ise duration of intervals between relevant events.
Chapter 6
state automaton with a
ess to two
ounters that
an be in
remented or de
remented. To our
knowledge no RNN has been able to learn a CSL.
We are using LSTM with forget gates and peephole
onne
tions introdu
ed in the previous
hapters.
6.2 Experiments
The network sequentially observes exemplary symbol strings of a given language, presented one
input symbol at a time. Following the traditional approa
h in the RNN literature we formulate
the task as a predi
tion task. At any given time step the target is to predi
t the possible next
symbols, in
luding the "end of string" symbol T . When more than one symbol
an o
ur in the
next step all possible symbols have to be predi
ted, and none of the others.
The network sequentially observes exemplary symbol strings of a given language, presented one
input symbol at a time, also referred to as input sequen
es. Every input sequen
e begins with
the start symbol S . The empty string,
onsisting of ST only, is
onsidered part of ea
h language.
A string is a
epted when all predi
tions have been
orre
t. Otherwise it is reje
ted.
This predi
tion task is equivalent to a
lassi
ation task with two
lasses \a
ept" and \reje
t,"
be
ause the system will make predi
tion errors for all strings outside the language. A system
has learned a given language up to string size n on
e it is able to
orre
tly predi
t all strings
with size n.
Symbols are en
oded lo
ally by d-dimensional binary ve
tors with only one non-zero
ompo-
nent, where d equals the number of language symbols plus one for either the start symbol in the
input or the "end of string" symbol in the output (d input units, d output units). +1 signies
that a symbol is set and 1 that it is not; the de
ision boundary for the network output is 0:0.
CFL an bn (Sun et al., 1993; Wiles & Elman, 1995; Tonkes & Wiles, 1997; Rodriguez et al.,
1999). Here the strings in the input sequen
es are of the form anbn; input and output ve
tors are
3-dimensional. Prior to the rst o
urren
e of b either a or b, or a or T at sequen
e beginnings,
are possible in the next step. Thus, e.g., for n =5:
Input: S a a a a a b b b b b
Target: a/T a/b a/b a/b a/b a/b b b b b T
An example for a set of
ontext-free produ
tion rules for the anbn grammar is: S ! j ; !
ab j , where S is the starting symbol, is a non-terminal symbol and is the empty string.
CFL an bm B m An (Rodriguez & Wiles, 1998). The se
ond half of a string from this palin-
drome or mirror language is
ompletely predi
table from the rst half. The task involves an
intermediate time lag of length 2m. Input and output ve
tors are 5-dimensional. Prior to the
rst o
urren
e of B two symbols are possible in the next step. Thus, e.g., for n =4; m =3:
Input: S a a a a b b b B B B A A A A
Target: a/T a/b a/b a/b a/b b/B b/B b/B B B A A A A T
The anbm B mAn grammar
an be produ
ed by similar
ontext-free rules as the anbn grammar
using two non-terminal symbols ( and ): S ! j ; ! aA j j ; ! b B j .
CSL an bn
n . Input and output ve
tors are 4-dimensional. Prior to the rst o
urren
e of b
two symbols are possible in the next step. Thus, e.g., for n =5:
Input: S a a a a a b b b b b
Target: a/T a/b a/b a/b a/b a/b b b b b
T
6.2. EXPERIMENTS 55
The pumping Lemma for
ontext-free languages
an be applied to show that anbn
n is not
ontext-free. An intuitive explanation is that it is ne
essary to
onsider the number of a symbols
then produ
ing b and
symbols, this requires
ontext information.
6.2.1 Training and Testing
Learning and testing alternate: after ea
h epo
h (= 1000 training sequen
es) we freeze the
weights and run a test. Even when all strings are pro
essed
orre
tly during training, it is
ne
essary to test again with frozen weights on
e all weight
hanges have been exe
uted. Apart
from ensuring the learning of the training set the test also determines generalization performan
e,
whi
h we did not optimize by using, say, a validation set.
Training and test sets in
orporate all legal strings up to a given length: 2n for anbn, 3n for
a b
and 2(n + m) for an bm B m An . Training strings are presented in random order. Only
n n n
exemplars from the
lass \a
ept" are presented. Training is stopped on
e all training sequen
es
have been a
epted, or after at most 10 training sequen
es. The generalization set is the largest
7
Out
In
Figure 6.1: Three-layer LSTM topology with a single input and output. Re
urren
e is limited
to the hidden layer,
onsisting here of a single LSTM memory blo
k with a single
ell. All 10
\unit-to-unit"
onne
tions are shown (but bias and peephole
onne
tions are not).
Referen
e Hidden Train. Train. Sol./ Best Test
Units Set [n℄ Str. [10 ℄ Tri. 3
[n℄
(Sun et al., 1993) 1
5 1;::; 160 13.5 1/1 1;::; 160
(Wiles & Elman, 1995) 2 1;::; 11 2000 4/20 1;::; 18
(Tonkes & Wiles, 1997) 2 1;::; 10 10 13/100 1;::; 12
(Rodriguez et al., 1999) 2
2 1;::; 11 267 8/50 1;::; 16
Table 6.1: Previous results for the CFL anbn, showing (from left to right) the number of hidden
units or state units, the values of n used during training, the number of training sequen
es, the
number of found solutions/trials and the largest a
epted test set.
We use one memory blo
k (with one
ell). With peephole
onne
tions there are
CFL an bn .
38 adjustable weights (3 peephole, 28 unit-to-unit and 7 bias
onne
tions).
CFL an bm B m An . We use two blo
ks with one
ell ea
h, resulting in 110 adjustable weights
(6 peephole, 91 unit-to-unit and 13 bias
onne
tions).
CSL an bn
n . We use the same topology as for the an bm B m An language, but with 4 input
and output units instead of 5, resulting in 90 adjustable weights (6 peephole, 72 unit-to-unit
and 12 bias
onne
tions).
6.2.3 Previous results
CFL an bn . Published results on the anbn language are summarized in Table 6.1. RNNs trained
1
Sun's training set was augmented stepwise by sequen
es mis
lassied during testing, and in the nal a
epted
set n was in f1;::; 20g ex
ept for 20 random sequen
es up to length n = 160 (the exa
t generalization performan
e
was un
lear).
2
Applying brute for
e sear
h to the weights of the best network of Rodriguez et al. (1999) further improves
performan
e to a
eptan
e up to n = 28.
6.2. EXPERIMENTS 57
Train. Train. % Generalization
Set [n℄ Str. [10 ℄
3
Sol. Set [n℄
1;::; 10 22 (19) 100 1;::; 1000 (1;::; 118)
1;::; 20 18 (19) 100 1;::; 587 (1;::; 148)
1;::; 30 16 (19) 100 1;::; 1000 (1;::; 408)
1;::; 40 25 (28) 100 1;::; 1000 (1;::; 628)
1;::; 50 42 (40) 100 1;::; 767 (1;::; 430)
Table 6.2: Results for the anbn language, showing (from left to right) the values for n used during
training, the average number of training sequen
es until best generalization was a
hieved, the
per
entage of
orre
t solutions and the best generalization (average over all networks given in
parenthesis).
with plain BPTT tend to learn to just reprodu
e the input (Wiles & Elman, 1995; Tonkes &
Wiles, 1997; Rodriguez et al., 1999). Sun et al. (1993) used a highly spe
ialized ar
hite
ture,
the \neural pushdown automaton", whi
h also did not generalize well (Sun et al., 1993; Das,
Giles, & Sun, 1992).
CFL an bm B m An . Rodriguez and Wiles (1998) used BPTT-RNNs with 5 hidden nodes.
After training with 51 10 strings with n + m 12 (sequen
es of length 24), most networks
3
generalized on longer o-training set strings. The best network generalized to sequen
es up to
length 36 (n = 9,m = 9). But none of them learned the
omplete training set.
CSL an bn
n . To our knowledge no previous RNN ever learned a CSL.
CFL an bn . 100% solved for all training sets (Table 6.2). Small training sets (n 2 f1;::; 10g)
were already suÆ
ient for perfe
t generalization up to the tested maximum: n 2 f1;::; 1000g.
Note that long sequen
es of this kind require very stable, nely tuned
ontrol of the network's
internal
ounters (Casey, 1996).
This performan
e is mu
h better than that of previous approa
hes, where the largest set was
learned by the spe
ially designed neural push-down automaton (Sun et al., 1993; Das et al.,
1992): n 2f1;::; 160g. The latter, however, required training sequen
es of the same length as the
test sequen
es. From the training set with n 2 f1;::; 10g LSTM generalized to n 2 f1;::; 1000g,
whereas the best previous result (see Table 6.1) generalized only to n 2 f1;::; 18g (even with
a slightly larger training set: n 2 f1;::; 11g). In
ontrast to Tonkes and Wiles (1997), we did
not observe our networks forgetting solutions as training progresses. So unlike all previous
approa
hes, LSTM reliably nds solutions that generalize well.
The
u
tuations in generalization performan
e for dierent training sets in Table 6.2 may be
due to the fa
t that we did not optimize generalization performan
e by using a validation set.
Instead we simply stopped ea
h epo
h (= 1000 sequen
es) on
e the training set was learned.
CFL an bm B m An . Training set a): 100% solved; after 29 10 training sequen
es the best
3
average.
Training set b): 100% solved; after 26 10 training sequen
es the best network generalized
3
58 CHAPTER 6. SIMPLE CONTEXT FREE AND CONTEXT SENSITIVE LANGUAGES
previous approa
h of Rodriguez and Wiles (1998), LSTM easily learns the
omplete training set
and reliably nds solutions that generalize well.
CSL an bn
n . LSTM learns 4 of the 5 training sets in 10 out of 10 trials (only 9 out of
10 for the training set with n 2 f1;::; 40g) and generalizes well (Table 6.3). Small training sets
(n 2 f1;::; 40g) were already suÆ
ient for perfe
t generalization up to the tested maximum:
n 2 f1;::; 500g, that is, sequen
es of length up to 1500. Even in absen
e of any short training
sequen
es (n 2fN 1; N g) LSTM learned well (see bottom half of Table 6.3).
We also modied the training pro
edure, by presenting ea
h exemplary string without pro-
viding all possible next symbols as targets, but only the symbol that a
tually o
urs in the
urrent exemplar. This led to slightly longer training durations, but did not signi
antly
hange
the results.
6.2.5 Analysis
How do the solutions dis
overed by LSTM work?
CFL an bn . Figure 6.2 shows a test run with a network solution for n = 5. The
ell state
s
in
reases while a symbols are fed into the network, then de
reases (with the same step size)
while b symbols are fed in. At sequen
e beginnings (when the rst a symbols are observed),
however, the step size is smaller due to the
losed input gate, whi
h is triggered by s
itself.
This results in \overshooting" the initial value of s
at the end of a sequen
e, whi
h in turn
triggers the opening of the output gate, whi
h in turn leads to the predi
tion of the sequen
e
termination.
CFL an bm B m An . The behavior of a typi
al network solution is shown in Figure 6.3. The
network learned to establish and
ontrol two
ounters. The two symbol pairs (a, A) and (b, B )
are treated separately by two dierent
ells,
and
, respe
tively. Cell
tra
ks the dieren
e
2 1 2
6.2. EXPERIMENTS 59
S a a a a a b b b b b : input
a a a a a a
T b b b b b b b b b T : target
1 a
b
0
T
-1
10 sc1
yc
1
1.0
y in 1
y out 1
y ϕ1
0.0
0 1 2 3 4 5 6 7 8 9 10 Time
Figure 6.2: CFL anbn (n = 5): Test run with network solutions. Top: Network output yk .
Middle: Cell state s
and
ell output y
. Bottom: A
tivations of the gates (input gate yin,
forget gate y' and output gate yout).
between the number of observed a and A symbols. It opens only at the end of a string, where it
predi
ts the nal T . Cell
treats the embedded bm B m substring in a similar way. While values
are stored and manipulated within a
ell, the output gate remains
losed. This prevents the
ell
1
from disturbing the rest of the network and also prote
ts its CEC against in
oming errors.
CSL an bn
n . The network solutions use a
ombination of two
ounters, instantiated sepa-
rately in the two memory blo
ks (Figure 6.4). Here the se
ond
ell
ounts up, given an a input
symbol. It
ounts down, given a b. A
in the input
auses the input gate to
lose and the
forget gate to reset the
ell state s
. The se
ond memory blo
k does the same for b,
, and a,
60 CHAPTER 6. SIMPLE CONTEXT FREE AND CONTEXT SENSITIVE LANGUAGES
S a a a a a b b b b B B B B A A A A A : input
a aa a aa b b b b
T b b b b b B B B B B B B A A A A A T : target
1 a
b
0 A
B
-1 T
10 sc1
yc
5 1
sc
2
0 yc
2
in 1
1.0 y
y out 1
y ϕ1
y in 2
y out 2
0.0 y ϕ2
0 5 10 15 Time
Figure 6.3: CFL a b B A (n = 5; m = 4): Test run with network solution. Top: Network
n m m n
output y . Middle: Cell state s and
ell output y . Bottom: A
tivations of the gates (input
k
gate y , forget gate y and output gate y ).
in ' out
respe
tively. The opening of output gate of the rst blo
k indi
ates the end of a string (and the
predi
tion of the last T ), triggered via its peephole
onne
tion.
Why does the network not generalize for short strings when using only two training strings as
for the anbn
n language (see Table 6.3)? The gate a
tivations in Figure 6.4 show that a
tivations
6.2. EXPERIMENTS 61
S a a a a a a b b bb b b c c c c c c : input
a a a a a a a : target
T bb b bb bb b bb b c c c c c c T
1 a
b
0 c
-1 T
20 sc1
yc
1
10 sc
2
0 yc
2
1.0 y in 1
y out 1
y ϕ1
y in 2
y out 2
0.0 y ϕ2
0 5 10 15 Time
Figure 6.4: CSL a b
(n = 5): Test run with network solution (the system s
ales up to
n n n
sequen
es of length 1000 and more). Top: Network output y . Middle: Cell state s and
ell
k
output y . Bottom: A
tivations of the gates (input gate y , forget gate y and output gate
in '
y ).
out
slightly drift even when the input stays
onstant. Solutions take this state drift into a
ount,
and will not work without it or with too mu
h of it, as in the
ase when the sequen
es are mu
h
shorter or longer than the few observed training examples. This imposes a limit on generalization
in both dire
tions (towards longer and shorter strings). We found solutions with less drift to
generalize better.
Further improvements. Even better results
an be obtained through in
reased training
time and stepwise redu
tion of the learning rate, as done in (Rodriguez et al., 1999). The distri-
bution of lengths of sequen
es in the training set also ae
ts learning speed and generalization.
62 CHAPTER 6. SIMPLE CONTEXT FREE AND CONTEXT SENSITIVE LANGUAGES
A set
ontaining more long sequen
es improves generalization for longer sequen
es. Omitting
the sequen
e with n =1 (and m =1), typi
ally the last one to be learned, has the same ee
t.
Training sets with many short and many long sequen
es are learned more qui
kly than uniformly
distributed ones.
Related tasks. The (bak )n regular language is related to an bn in the sense that it requires
to learn a
ounter, but the
ounter never needs
ounting down. This task is equivalent to the
\Generating timed spikes" task (Se
tion 5.5.3) learned by LSTM for k = 50 with n 1000. A
hand-made, hardwired solution (no learning) of a se
ond order RNN worked for values of k until
120 (Steijvers & Grunwald, 1996).
For all three tasks peephole
onne
tions are mandatory. The output gates remain
losed for
substantial time periods during ea
h input sequen
e presentation (
ompare Figures 6.2, 6.3 and
6.4); the end of su
h a period is always triggered via peephole
onne
tions.
6.3 Con
lusion
We found that LSTM
learly outperforms previous RNNs not only on regular language ben
h-
marks (a
ording to previous resear
h) but also on
ontext free language (CFL) ben
hmarks;
it learns faster and generalizes better. LSTM also is the rst RNN to learn a
ontext sensitive
language.
Although CFLs like those studied here may also be learnable by
ertain dis
rete symboli
grammar learning algorithms (SGLAs) (Sakakibara, 1997; Lee, 1996; Osborne & Bris
oe, 1997),
the latter exhibit more task-spe
i
bias, and are not designed to solve numerous other sequen
e
pro
essing tasks involving noise, real-valued inputs / internal states, and
ontinuous output tra-
je
tories, whi
h LSTM solves easily (see previous
hapters and Ho
hreiter and S
hmidhuber
(1997)). SGLAs in
lude a large range of methods, su
h as de
ision-tree algorithms (see e.g.,
Quinlan (1992)),
ase-based and explanation-bases reasoning (see e.g., Mit
hell, Keller, and
Kedar-Cabelli (1986), Porter, Bru
e, Bareiss, and Holte (1990)), and indu
tive logi
program-
ming (see e.g., Zelle and Mooney (1993)).
Our ndings reinfor
e the per
eption that LSTM is a very general and promising adaptive
sequen
e pro
essing devi
e, with a wider eld of potential appli
ations than alternative RNNs.
Chapter 7
∆t
LSTM
x(t) ∆x(t) *fs 1/fs + x(t+p)
RNN
MLP. The MLPs we use for
omparison have one hidden layer and are trained with BP. As
with LSTM, the one output unit is linear and x is the target. The input diers for ea
h task
but in general uses a time window with a time-spa
e embedding. All units are biased and the
learning rate is =10 .3
Note that we do not use IO short
uts, be
ause they be
ome short
ir
uits during self iteration,
ausing exponential growth of the output unit's a
tivity.
7.3 Ma
key-Glass Chaoti
Time Series
The Ma
key-Glass
haoti
time series (Ma
key & Glass, 1977)
an be generated from the Ma
key-
Glass delay-dierential equation:
x(t )
x_ (t) =
1 + x
(t ) x(t) (7.1)
We generate ben
hmark sets using the parameters a =0:2, b =0:1,
=10 and =17. For > 16:8
the series be
omes
haoti
. =17 results in a quasi-periodi
series with a
hara
teristi
period
T
50, lying on an attra
tor with fra
tal dimension D = 2:1. To generate these ben
hmark
sets, 7.1 is integrated using a four-point Runge-Kutta method with step size 0:1 and the initial
ondition x(t) = 0:8 for t < 0. The equation is integrated up to t = 5500, with the points from
t =200 to t =3200 used for training and the points from t =5000 to t =5500 used for testing.
Figure 7.2 shows the rst 100 points from the test set. Sin
e the Ma
key-Glass time series
is
haoti
, it is diÆ
ult to predi
t for values of T greater than its
hara
teristi
period T
of
approximately 50. In the literature a number of dierent predi
tion points have been tried:
T 2 f1; 6; 84; 85; 90g. For the
omparison of results we
onsider the predi
tions with osets
T 2 f84; 85; 90g as equal tasks. For approa
hes that use as input a time window of past values
it is
ommon to use the four delays t, t 6, t 12 and t 18. These points represent an adequate
delay-state embedding for the predi
tion of Ma
key-Glass series assuming T = 6. For further
66CHAPTER 7. TIME SERIES PREDICTABLE THROUGH TIME-WINDOW APPROACHES
x(t) ∆ x(t)
0.05
1
0
0.5 -0.05
0 100 200 Time 0 100 200 Time
x(t+1) ∆ x(t)
0.05
1
0
0.5 -0.05
0.5 1 x(t) 0.5 1 x(t)
Figure 7.2: Ma
key-Glass time series (test set). Top-Left: Cut-out of the series. Top-Right:
The rst dieren
e for p = 1: x(t) = (x(t + 1) x(t)). Bottom-Left: x(t + 1) against x(t).
Bottom-Right: x(t) against x(t).
explanation see, for example, Fal
o, Iazzetta, Natale, and Tarantino (1998"). As explained
above, LSTM re
eived only the value of x(t) as input.
7.3.1 Previous Work
In the following se
tions we attempt to summarize existing attempts to predi
t these time series.
To allow
omparison among approa
hes, we did not
onsider works where noise was added to the
task or where training
onditions were very dierent from ours. When not spe
i
ally mentioned,
an input time window with time delays t, t 6, t 12 and t 18 or larger was used. The dierent
approa
hes are outlined in Table 7.1. Vesanto (1997) oers the best result to date, a
ording
to our knowledge, with a Self-Organizing Map (SOM) approa
h. The SOM parameters given
in Table 7.2 refers to the prototype ve
tors of the map. The results from these approa
hes are
found in Table 7.2. We re-
al
ulated the results for R. Bone et al., be
ause only the NMSE was
given.
7.3.2 Results
The LSTM results are listed at the bottom of Table 7.2. After six single-steps of iterated training
(p =1, T =6, n =6) the LSTM NRMSE for single step predi
tion (p = T =1, n =1) is: 0:0452.
After 84 single-steps of iterated training (p = 1, T = 84, n = 84) the LSTM NRMSE for single
step predi
tion (p = T = 1, n = 1) is: 0:0809. Figure 7.3 shows iterated predi
tion results for
LSTM. In
reasing the number of memory blo
ks did not signi
antly improve the results.
Why did LSTM perform worse than the MLP? The AR-LSTM network does not have a
ess
to the past as part of its input and therefore has to learn to extra
t and represent a Markov
7.3. MACKEY-GLASS CHAOTIC TIME SERIES 67
Model Author Des
ription
BPNN Day and Davenport (1993) A BP
ontinuous-time feed forward NNs
with two hidden layers and with xed time
delays.
ATNN Day and Davenport (1993) A BP
ontinuous-time feed forward NNs
with two hidden layers and with adaptable
time delays.
DCS-LMM Chudy and Farkas (1998) Dynami
Cell Stru
tures
ombined with
Lo
al Linear Models.
EBPTTRNN R. Bone, Cru
ianu, Ver- RNNs with 10 adaptive delayed
onne
-
ley, and Asselin de Beauville tions trained with BPTT
ombined with
(2000) a
onstru
tive algorithm.
BGALR Fal
o, Iazzetta, Natale, and A geneti
algorithm with adaptable input
Tarantino (1998") time window size (Breeder Geneti
Algo-
rithm with Line Re
ombination).
EPNet Yao and Liu (1997) Evolved neural nets (Evolvable Program-
ming Net).
SOM Vesanto (1997) A Self-organizing map.
Neural Gas Martinez, Berkovi
h, and The Neural Gas algorithm for a Ve
tor
S
hulten (1993) Quantization approa
h.
AMB Bersini, Birattari, and Bon- An improved memory-based regression
tempi (1998) (MB) method (Platt, 1991) that uses an
adaptive approa
h to automati
ally sele
t
the number of regressors (AMB).
Table 7.1: Summary of previous approa
hes for the predi
tion of the Ma
key-Glass time series.
state (Bakker & Kleij, 2000). In tasks we
onsidered so far this required remembering one or two
events from the past, then using this information before over-writing the same memory
ells.
The Ma
key-Glass equation,
ontains the input from t 17, hen
e its implementation requires
the storage of all inputs from t 17 to t (time window approa
hes
onsider sele
ted inputs ba
k
to at least t 18). Assuming that any dynami
model needs the event from time t with
17, we note that the AR-RNN has to store all inputs from t to t and to overwrite them
at the adequate time. This requires the implementation of a
ir
ular buer, a stru
ture quite
diÆ
ult for an RNN to simulate. In a TDNN, on the other hand, a
ir
ular buer is inherent
to the network stru
ture.
7.3.3 Analysis
It is interesting that for MLPs (T = 6) it was more ee
tive to transform the task into a one-
step-ahead predi
tion task and iterate than it was to predi
t dire
tly (
ompare the results for
p = 1 and p = T ). It is in general easier to predi
t fewer steps ahead, the disadvantage being
that during iteration input values have to be repla
ed by predi
tions. For T =6 with p =1 this
ae
ts only the latest value. This advantage is lost for T = 84 and the results with p = 1 are
worse than with p = 6, where fewer iterations are ne
essary. For MLPs, iterated training did
not in general produ
e better results: it improved performan
e when the step-size p was 1, and
68CHAPTER 7. TIME SERIES PREDICTABLE THROUGH TIME-WINDOW APPROACHES
MLP, p =1, IT 4 25 1 10
4
0.0089 0.0191 0.4143
MLP, p =1, IT 16 97 1 10
4
0.0094 0.0205 0.3929
MLP, p =6 4 25 1 10
4
- p = T =6 0.1659
MLP, p =6 16 97 1 10
4
- p = T =6 0.1466
MLP, p =6, IT 4 25 1 10
4
- 0.0946 0.3012
MLP, p =6, IT 16 97 1 10
4
- 0.0945 0.2820
LSTM, p = T 4 113 5 10
4
0.0214 0.1184 0.4700
LSTM, p =1 4 113 5 10 p = T =1 0.1981 0.5927
4
1 1
target target
(p=1;n=1..6) (p=1;n=6)
0.5 0.5
150 200 time 250 150 200 time 250
1 1
target
target
0.5 (p=6;n=14) 0.5 (p=1;n=84)
150 200 time 250 100 200 time 300
Figure 7.3: Ma
key-Glass time series: Test run with LSTM network solutions. Shown are the
network output as solid lines, and the target t. Top-Left: Single step predi
tion and six iterations
(p =1, T =1, n =1 : : : 6) after iterated training. Top-Right: The predi
tion for T =6 with n =6,
extra
ted from the top-left graph. Bottom-Left: The best solution for T = 84 with p = 6 and
n =14. Bottom-Left: The best single-step solution for T =84 with p =1 and n =84.
t
0.5 y
0
-1
-2 s1
s2
s3
-3 s4
1
0.5
0
0 100 200 time
Figure 7.4: Test run with network solutions for the Ma
key-Glass time series (p = 1, T = 84,
n =84). Shown is a \free" iteration of 250 steps starting with all states set to zero. Top: Network
output y and the test set target t. Middle: Cell states s
. Bottom: A
tivations of the gates.
7.4 Laser Data
This data is set A from the Santa Fe time series predi
tion
ompetition (Weigend & Gershenfeld,
1993") . It
onsists of one-dimensional data re
orded from a Far-Infrared (FIR) laser in a
haoti
1
state (Huebner, Abraham, & Weiss, 1989). The training set
onsists of 1,000 points from the
laser, with the task being to predi
t the next 100 points (Figure 7.5). The main diÆ
ulty is to
predi
t the
ollapse of a
tivation in the test set, given only two similar events in the training set.
We run tests for stepwise predi
tion and fully iterated predi
tion, where the output is
lamped
to the input for 100 steps.
1
The data is available from http://www.stern.nyu.edu/aweigend/Time-Series/SantaFe.html.
7.4. LASER DATA 71
trainig test test
200 200
100 100
0 0
0 200 400 600 800 1000 time 0 20 40 60 80 100
time
Figure 7.5: FIR-laser Data (Set A) from the Santa Fe time series predi
tion
ompetition (see
text for details).
For the experiments with MLPs the setup was as des
ribed for the Ma
key-Glass data but
with an input embedding of the last 9 time steps as in Koskela, Varsta, Heikkonen, and Kaski
(1998).
7.4.1 Previous Work
Results are listed in Table 7.3. Linear predi
tion is no better than predi
ting the data-mean.
Wan (1994) a
hieved the best results submitted to the original Santa Fe
ontest. He used a
Finite Input Response Network (FIRN) (25 inputs and 12 hidden units), a method similar to
a TDNN. Wan improved performan
e by repla
ing the last 25 predi
ted points by smoothed
values (sFIRN).
Koskela, Varsta, Heikkonen, and Kaski (1998)
ompared re
urrent SOMs (RSOMs) and
MLPs (trained with the Levenberg-Marquardt algorithm) with an input embedding of dimension
9 (an input window with the last 9 values). Bakker, S
houten, Giles, Takens, and Bleek (2000)
used a mixture of predi
tions and true values as input (Error Propagation, EP). Then Prin
ipal
Component Analysis (PCA) was applied to redu
e the dimensionality of the time embedding
for the input from the 40 most re
ent inputs to 16 prin
ipal
omponents. These were fed into
a MLP (with two hidden layers of 32 and 24 units) and trained with BPTT using
onjugate
gradients. The value for the iterated predi
tion was a
hieved with a mixture of 90%
lamped
output and 10% true value (true iteration
orresponds to 100%
lamped output). The value for
the iterated predi
tion was a
hieved without applying EP during training.
Kohlmorgen and Muller (1998) pointed out that the predi
tion problem
ould be solved by
pattern mat
hing, if it
an be guaranteed that the best mat
h from the past is always the right
one. To resolve ambiguities they propose to up-sample the data using linear extrapolation (as
done by Sauer, 1994).
The best result to date, a
ording to our knowledge, was a
hieved by Weigend and Nix (1994).
They used a nonlinear regression approa
h in a maximum likelihood framework, realized with
feed-forward NN (25 inputs and 12 hidden units) using an additional output to estimate the
predi
tion error. For the iterated predi
tion the mean of the values at times 620{700 was used
as predi
tion after the predi
ted
ollapse of a
tivity at time-step 1072 (this was based on visual
inspe
tion). A similar approa
h was used by Eri
J. Kosteli
h (1994), who sear
hed for the best
mat
h to an embedding of 75 steps using a lo
al linear model.
M
Names (2000) proposed a statisti
al method that used
ross-validation error to estimate
the model parameters for lo
al models, but the testing
onditions were too dierent to in
lude
72CHAPTER 7. TIME SERIES PREDICTABLE THROUGH TIME-WINDOW APPROACHES
Table 7.3: Results for the FIR-laser task, showing (from left to right): The number of units, the
number of parameters (weights for NNs), the number of training sequen
e presentations, and
the NRMSE.
the results in the
omparison. Bontempi G. (1999) used a similar approa
h
alled \ Predi
ted
Sum of Squares (PRESS)" (here, the dimension of the time embedding was 16).
7.4.2 Results
The results for MLP and LSTM are listed in Table 7.3. The results for these methods are not
as good as the other results listed in Table 7.3. This is true in part be
ause we did not repla
e
predi
ted values by hand with a mean value where we suspe
ted the system to be lead astray.
7.4.3 Analysis
The LSTM network
ould not predi
t the
ollapse of emission in the test set (Figure 7.6).
Instead, the network tra
ks the os
illation in the original series for only about 40 steps before
desyn
hronizing. This indi
ates performan
e similar to that in the Ma
key-Glass task: the
LSTM network was able to tra
k the strongest eigen-frequen
y in the task but was unable to
a
ount for high-frequen
y varian
e. Though the MLP performed better, it generated ina
urate
amplitudes and also desyn
hronized after about 40 steps. The MLP did however manage to
predi
t the
ollapse of emission (Figure 7.6).
LSTM's ability to tra
k slow os
illations in the
haoti
signal is notable. In simple
ases,
syn
hronization with a periodi
signal is easily a
hieved using me
hanisms su
h as phase-lo
ked
loops (PLLs). But when noisy or
omplex signals are used, syn
hronization
an be
hallenging
7.5. CONCLUSION 73
200 200
100 100
0
0 20 40 60 80 time 0 20 40 60 80 time
200 200
100 100
0 0
0 20 40 60 80 time 0 20 40 60 80 time
Figure 7.6: Test run with network solutions after iterated training for the FIR-laser task. Top:
LSTM. Bottom: MLP with 32 hidden units. Left: Single-Step predi
tion. Right: Iteration of
100 steps.
(M
Auley, 1994; Large & Kolen, 1994). Systems like LSTM that
an nd periodi
ity in
om-
pli
ated signals should be appli
able to
ognitive domains su
h as spee
h and musi
(Large &
Jones, 1999; E
k, 2000a). See also (E
k, 2000b) for more on this topi
.
Iterated training yielded improved results for iterated predi
tion, even when stepwise pre-
di
tion made things worse, as in the
ase of MLP single-step predi
tion (predi
tion step size
one) for both the Ma
key-Glass task and the FIR task. When multi-step predi
tion was used
(for Ma
key-Glass only), iterated training did not improve system performan
e.
proa
hes must fail. One reasonable hybrid approa
h to predi
tion of unknown time series may
be this: start by training a time window-based MLP, then freeze its weights and use LSTM only
to redu
e the residual error if there is any, employing LSTM's ability to
ope with long time
lags between signi
ant events.
LSTM's ability to tra
k slow os
illations in the
haoti
signal may be appli
able to
ognitive
domains su
h as rhythm dete
tion in spee
h and musi
.
Chapter 8
Con lusion
This work has
on
entrated on improving and applying the original LSTM algorithm as intro-
du
ed by Ho
hreiter and S
hmidhuber (1997). We proposed to extended LSTM with forget
gates and peephole
onne
tions. Extended LSTM is
learly superior to traditional LSTM (and
other RNNs), and
an serve as basis for future appli
ations. Our ndings reinfor
e the per
ep-
tion that LSTM is a very general and promising adaptive sequen
e pro
essing devi
e, with a
wider eld of potential appli
ations than alternative RNNs. In the following we summarize the
ontributions of this thesis and present some thoughts about future work and possible LSTM
appli
ations.
75
76 CHAPTER 8. CONCLUSION
RNNs on
ontext free language (CFL) ben
hmarks. Moreover, LSTM is the rst RNN to learn
a
ontext sensitive language.
Time series predi
tion. Time window based MLPs outperformed a LSTM pure auto-
regressive approa
h on
ertain time series predi
tion ben
hmarks solvable by looking at a few
re
ent inputs only. Thus LSTM's spe
ial strength, namely, to learn to remember single events
for very long, unknown time periods, was not ne
essary for those tasks.
8.2 Future work and possible appli
ations of LSTM.
Gain adaptation. In our experiments we either used a
onstant learning rate (sometimes with
exponential or linear de
ay within sequen
es) or applied the rather simple momentum algorithm
(Plaut et al., 1986). More advan
ed lo
al learning rate adaptation approa
hes like a de
oupled
Kalman ltering (Puskorius & Feldkamp, 1994) or sto
hasti
meta des
ent (S
hraudolph, 1999,
2000) may improve learning speed and redu
e the per
entage of networks that diverge.
Hierar
hi
al de
omposition, rhythm and timing. LSTM with forget gates holds
promise for any sequential pro
essing task in whi
h we suspe
t that a hierar
hi
al de
omposition
may exist, but do not know in advan
e what this de
omposition is (one example is prosodi
in-
formation in spee
h). We showed that memory blo
ks equipped with forget gates and peephole
onne
tions are
apable of developing into internal os
illators and timers and that LSTM is able
to tra
k slow os
illations in the
haoti
signal. This may allow the re
ognition and generation
of hierar
hi
al rhythmi
patterns in musi
. In parti
ular the ability to perform pre
ise timing
and measuring makes LSTM a promising approa
h for real-world tasks whose solution partly
depend on the pre
ise duration of intervals between relevant events.
Growing LSTM networks. It may be useful to grow LSTM networks (e.g., add one
memory blo
k at a time), similar to the
as
ade-
orrelation algorithm (Fahlman, 1991), to
de
ouple blo
ks when tra
king multiple frequen
ies in a signal. So far only the fundamental
frequen
y was tra
ked.
Time series predi
tion. For the predi
tion of unknown time series our results suggest to
use LSTM in a hybrid approa
h as follows: start by training a time window-based MLP, then
freeze its weights and use LSTM only to redu
e the residual error if there is any, employing
LSTM's ability to
ope with long time lags between signi
ant events. An example for a task
where a hybrid approa
h with LSTM might be promising is the predi
tion of se
ondary protein
stru
ture from a sequen
e of amino a
ids (Brunak, Baldi, Fras
oni, Pollastri, & Soda, 1999).
The standard solution involves using a xed window over the protein sequen
e,
entered over a
spe
i
amino a
id. As a protein is folded, a
ids that are far apart in the series of a
ids may
be spatially
lose and have signi
ant intera
tion. This generates
omplex, varying long-term
dependen
ies in the series.
Appendix A
The minimal length of an embedded Reber grammar (ERG) string is 9; string length have no
upper bound. To provide an idea of the string size distribution, Figure A.1 (left) shows a his-
togram of ERG strings
omputed from sampled data. We assume that ERG string probabilities
de
rease exponentially with ERG string size (
ompare exponential t on the left hand side of
Figure A.1), so that the probability p(l) of sampling a string of size l
an be written as:
p(l) = b e a l for l 9 else p(l) = 0 ;
( 9)
with a; b > 0; the oset 9 expresses the minimum string length. To
ompute the probability
P (L) of sampling a string of size l L we integrate p(l):
Z L
b
P (L) = p(l) dl = 1 e a(L 9)
:
o a
25 0.4
Number ERG Strings in %
logarihmic scale 0.35
exponential fit
20
0.3
0.25
15
Probability
0.2
10
0.15
0.1
5
0.05
0 0
10 20 30 40 50 60 0 10 20 30 40 50 60 70 80 90 100
ERG String Length Max. ERG String Length
Figure A.1: Left: histogram of 10 random samples of ERG string sizes. Right: Joint probability
6
that an ERG string of a given size o
urs and is the longest among 80000.
77
78 APPENDIX A. EMBEDDED REBER GRAMMAR STATISTICS
60 60
40 40
30 30
20 20
10 10
0 0
0 200000 400000 600000 800000 1e+06 1 10 100 1000 10000 100000 1e+06
Number Samples Number Samples
Figure A.2: Left: number of embedded Reber strings N plotted against lower bounds of expe
ted
maximal string size (N ). Right: logarithmi
x-axis.
From normalization P (1) = 1 follows a = b. Solving P (l) = 1 P (l) with the value a
! ! 3
extra
ted form the data (left hand side of Figure A.1), we nd the expe
ted ERG string size:
11
The pseudo-
ode in this
hapter des
ribes the implementation of LSTM with forget gates and
peephole
onne
tions as introdu
ed in the
hapters 3 and 5. This is the LSTM version that we
urrently use and re
ommend; the C
ode
an be down-loaded from: \http://www.idsia.
h/~felix".
The partial derivatives w are represented by the variables dS :
s
s
vj
dSljvj m := ;
wlj m
as dened in
hapter 2, j indexes memory blo
ks and v indexes memory
ells in blo
k j ; l =
vj
for weights to the
ell, l = in for weights to the input gate, and l = ' for weights to the forget
gate. The variables dS are
al
ulated no matter if a target (and hen
e an error) is given or
not. Thus their
al
ulation is done in the forward pass. Whereas the ba
kward pass is only
al
ulated at time steps when a target is present.
It is task-spe
i
(see des
riptions in
hapters) when the weight-updates are exe
uted: After
ea
h step time, regularly after a xed number of time steps, after intervals with varying duration
or at the and of a sequen
e or epo
h.
The momentum algorithm (Plaut et al., 1986), that we used for some of our experiments, is
not in
orporated into this pseudo-
ode.
79
80 APPENDIX B. PEEPHOLE LSTM WITH FORGET GATES IN PSEUDO-CODE
init network:
reset: CECs: s
vj =^s
vj =0; partials: dS =0; a
tivations: y = y^ =0;
forward pass:
input units: y =
urrent external input;
^=
roll over: a
tivations: y y ;
ell states: s
vj ^ = s
vj ;
loop over memory blo
ks, indexed j f
Step 1a: input gates (5.1):
P P j
netinj = m winj m y^m + Sv=1 winj
vj s^
vj ; yinj = fin (netin );
j j
ells (5.12):
loop over the Sj
ells in blo
k j , indexed v f
w
vj m = es
vj dS
m
jv ; g;
Bakker, B., & Kleij, G. van der Voort van der. (2000). Trading o per
eption with internal state:
Reinfor
ement learning and analysis of q-elman networks in a markovian task. In Pro
eedings of
IJCNN 2000. Como, Italy.
Bakker, R., S
houten, J. C., Giles, C. L., Takens, F., & Bleek, C. M. van den. (2000). Learning
haoti
attra
tors by neural networks. Neural Computation, 12 (10).
Bengio, Y., & Fras
oni, P. (1995). An input output HMM ar
hite
ture. In Advan
es in Neural Information
Pro
essing Systems 7. San Mateo CA: Morgan Kaufmann.
Bengio, Y., Fras
oni, P., Gori, M., & G.Soda. (1993). Re
urrent neural networks for adaptive temporal
pro
essing. In Pro
eedings of the 6th italian workshop on parallel ar
hite
tures and neural networks
wirn93 (pp. 85{117). Vietri (Italy): World S
ienti
Pub.
Bengio, Y., Simard, P., & Fras
oni, P. (1994). Learning long-term dependen
ies with gradient des
ent is
diÆ
ult. IEEE Transa
tions on Neural Networks, 5 (2), 157-166.
Bersini, H., Birattari, M., & Bontempi, G. (1998). Adaptive memory-based regression methods. In In
Pro
eedings of the 1998 IEEE International Joint Conferen
e on Neural Networks (pp. 2102{2106).
Blair, A. D., & Polla
k, J. B. (1997). Analysis of dynami
al re
ognizers. Neural Computation, 9 (5),
1127{1142.
Bontempi G., B. H., Birattari M. (1999). Lo
al learning for iterated time-series predi
tion. In B. I. &
D. S. (Eds.), Ma
hine Learning: Pro
eedings of the Sixteenth International Conferen
e (p. 32-38).
San Fran
is
o, USA: Morgan Kaufmann.
Box, G., & Jenkins, G. (1970"). Time series analysis { fore
asting and
ontrol; san fran
is
o: Holden-day.
Brunak, S., Baldi, P., Fras
oni, P., Pollastri, G., & Soda, G. (1999). Exploiting the past and the future
in protein se
ondary stru
ture predi
tion. Bioinformati
s, 15 (11).
Casey, M. P. (1996). The dynami
s of dis
rete-time
omputation, with appli
ation to re
urrent neural
networks and nite state ma
hine extra
tion. Neural Computation, 8 (6), 1135{1178.
Chudy, L., & Farkas, I. (1998). Predi
tion of
haoti
time-series using dynami
ell stru
turesand lo
al
linear models. Neural Network World, 8 (5), 481-489.
Cleeremans, A., Servan-S
hreiber, D., & M
Clelland, J. L. (1989). Finite-state automata and simple
re
urrent networks. Neural Computation, 1, 372-381.
Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. In
J. Grefenstette (Ed.), Pro
eedings of an international
onferen
e on geneti
algorithms and their
appli
ations. Hillsdale NJ: Lawren
e Erlbaum Asso
iates.
Crowder, R. S. (1990). Predi
ting the ma
key-glass timeseries with
as
ade
orrelation learning. In
D. S. T. (ed) (Ed.), Conne
tionist Models: Pro
eedings of the 1990 Summer S
hool.
Cummins, F., Gers, F., & S
hmidhuber, J. (1999). Language identi
ation from prosody without expli
it
features. In Pro
eedings of EUROSPEECH'99 (Vol. 1, pp. 371{374).
83
84 REFERENCES
Darken, C. (1995). Sto
hasti
approximation and neural network learning. In M. A. Arbib (Ed.), The
Handbook of Brain Theory and Neural Networks (pp. 941{944). Cambridge, Massa
husetts: MIT
Press.
Das, S., Giles, C., & Sun, G. (1992). Learning
ontext-free grammars: Capabilities and limitations of a
re
urrent neural network with an external sta
k memory. In Pro
eedings of The Fourteenth Annual
Conferen
e of the Cognitive S
ien
e So
iety (pp. 791{795). San Mateo, CA: Morgan Kaufmann
Publishers.
Day, S. P., & Davenport, M. R. (1993). Continuous-time temporal ba
k-progagation with adaptive time
delays. IEEE Transa
tions on Neural Networks, 4, 348{354.
De
o, G., & S
hurmann, B. (1994). Neural learning of
haoti
system behavior. IEICE Trans. Funda-
mentals, E77-A, 1840-1845.
Di
kmanns, D., S
hmidhuber, J., & Winklhofer, A. (1987). Der genetis
he Algorithmus: Eine Imple-
mentierung in Prolog. Fortges
hrittenenpraktikum, Institut fur Informatik, Lehrstuhl Prof. Radig,
Te
hnis
he Universitat Mun
hen.
Doya, K., & Yoshizawa, S. (1989). Adaptive neural os
illator using
ontinuous-time ba
kpropagation
learning. Neural Networks, 2 (5), 375{385.
E
k, D. (2000a). Meter Through Syn
hrony: Pro
essing Rhythmi
al Patterns with Relax-
ation Os
illators. Unpublished do
toral dissertation, Indiana University, Bloomington, IN.,
(www.idsia.
h/doug/publi
ations.html).
E
k, D. (2000b). Tra
king rhythms with a relaxation os
illator (Te
h. Rep. No. IDSIA-10-00).
www.idsia.
h/te
hrep.html, Galleria 2, 6928 Manno-Lugano, Switzerland: IDSIA.
Elman, J. L. (1990). Finding stru
ture in time. Cognitive S
ien
e, 14 (2), 179{211.
Eri
J. Kosteli
h, D. P. L. (1994). The predi
tion of
haoti
time series: a variation on the method of
analogues. In W. A. S. & G. N. A. (Eds.), Time Series Predi
tion: Fore
asting the Future and
Understanding the Past (pp. 283{295). Addison-Wesley.
Fahlman, S. E. (1991). The re
urrent
as
ade-
orrelation learning algorithm. In R. P. Lippmann, J. E.
Moody, & D. S. Touretzky (Eds.), NIPS 3 (p. 190-196). San Mateo, CA: Morgan Kaufmann.
Fal
o, I. de, Iazzetta, A., Natale, P., & Tarantino, E. (1998"). Evolutionary neural networks for nonlinear
dynami
s modeling. In Parallel Problem Solving from Nature 98 (Vol. 1498, p. 593-602). Springer.
For
ada, M. L., & Carras
o, R. C. (1995). Learning the initial state of a se
ond-order re
urrent neural
network during regular-language inferen
e [Letter℄. Neural Computation, 7 (5), 923{930.
Gers, F. A., E
k, D., & S
hmidhuber, J. (2000). Applying LSTM to time series predi
table through
time-window approa
hes (Te
h. Rep. No. IDSIA-22-00). Manno, CH: IDSIA.
Gers, F. A., E
k, D., & S
hmidhuber, J. (2001a). Applying LSTM to time series predi
table through
time-window approa
hes. In Pro
. ICANN 2001, Int. Conf. on Arti
ial Neural Networks. Vienna,
Austria: IEE, London. (submitted)
Gers, F. A., E
k, D., & S
hmidhuber, J. (2001b). Applying LSTM to time series predi
table through
time-window approa
hes. In Neural nets, WIRN vietri-99, pro
eedings 11th workshop on neural
nets. Vietri sul Mare, Italy. (submitted)
Gers, F. A., & S
hmidhuber, J. (2000a). Neural pro
essing of
omplex
ontinual input streams. In Pro
.
IJCNN'2000, Int. Joint Conf. on Neural Networks. Como, Italy.
Gers, F. A., & S
hmidhuber, J. (2000b). Neural pro
essing of
omplex
ontinual input streams (Te
h.
Rep. No. IDSIA-02-00). Manno, CH: IDSIA.
Gers, F. A., & S
hmidhuber, J. (2000
). Re
urrent nets that time and
ount. In Pro
. IJCNN'2000, Int.
Joint Conf. on Neural Networks. Como, Italy.
REFERENCES 85
Gers, F. A., & S
hmidhuber, J. (2000d). Re
urrent nets that time and
ount (Te
h. Rep. No. IDSIA-01-
00). Manno, CH: IDSIA.
Gers, F. A., & S
hmidhuber, J. (2000e). LSTM learns
ontext free languages. In Snowbird 2000
Conferen
e.
Gers, F. A., & S
hmidhuber, J. (2000f). Long Short-Term Memory learns
ontext free languages and
ontext sensitive languages (Te
h. Rep. No. IDSIA-03-00). Manno, CH: IDSIA.
Gers, F. A., & S
hmidhuber, J. (2001a). LSTM re
urrent networks learn simple
ontext free and
ontext
sensitive languages. IEEE Transa
tions on Neural Networks. (a
epted)
Gers, F. A., & S
hmidhuber, J. (2001b). Long Short-Term Memory learns
ontext free and
ontext
sensitive languages. In Pro
eedings of the ICANNGA 2001
onferen
e. Springer. (a
epted)
Gers, F. A., S
hmidhuber, J., & Cummins, F. (1999a). Continual predi
tion using LSTM with forget
gates. In M. Marinaro & R. Tagliaferri (Eds.), Neural Nets, WIRN Vietri-99, Pro
eedings 11th
Workshop on Neural Nets (p. 133-138). Vietri sul Mare, Italy: Springer Verlag, Berlin.
Gers, F. A., S
hmidhuber, J., & Cummins, F. (1999b). Learning to forget: Continual predi
tion with
LSTM. In Pro
. ICANN'99, Int. Conf. on Arti
ial Neural Networks (Vol. 2, p. 850-855). Edin-
burgh, S
otland: IEE, London.
Gers, F. A., S
hmidhuber, J., & Cummins, F. (1999
). Learning to forget: Continual predi
tion with
LSTM (Te
h. Rep. No. IDSIA-01-99). Lugano, CH: IDSIA.
Gers, F. A., S
hmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual predi
tion with
LSTM. Neural Computation, 12 (10), 2451{2471.
Gers, F. A., S
hmidhuber, J., & S
hraudolph, N. Learning pre
ise timing with LSTM re
urrent networks.
(submitted to Neural Computation)
Haner, P., & Waibel, A. (1992). Multi-state time delay networks for
ontinuous spee
h re
ognition. In
J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advan
es in Neural Information Pro
essing
Systems (Vol. 4, pp. 135{142). Morgan Kaufmann Publishers, In
.
Hinton, G. E., Sejnowski, T. J., & A
kley, D. H. (1984). Boltzmann Ma
hines: Constraint satisfa
tion
networks that learn (Te
h. Rep. No. CMU-CS-84-119). Carnegie Mellon University.
Ho
hreiter, S. (1991). Untersu
hungen zu dynamis
hen neuronalen Netzen. Diploma thesis, Institut fur
Informatik, Lehrstuhl Prof. Brauer, Te
hnis
he Universitat Mun
hen. (See www7.informatik.tu-
muen
hen.de/~ho
hreit)
Ho
hreiter, S., & S
hmidhuber, J. (1995). Long short-term memory
an solve hard long time lag problems.
In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advan
es in neural information pro
essing
systems 7 (NIPS "94). Cambridge, MA: MIT Press.
Ho
hreiter, S., & S
hmidhuber, J. (1996). Bridging long time lags by weight guessing and \Long Short-
Term Memory". In F. L. Silva, J. C. Prin
ipe, & L. B. Almeida (Eds.), Spatiotemporal models in
biologi
al and arti
ial systems (p. 65-72). IOS Press, Amsterdam, Netherlands. (Serie: Frontiers
in Arti
ial Intelligen
e and Appli
ations, Volume 37)
Ho
hreiter, S., & S
hmidhuber, J. (1997). Long short-term memory. Neural Computation, 9 (8), 1735-
1780.
Hopeld, J. J. (1982). Neural networks and physi
al systems with emergent
olle
tive
omputational
fa
ilities. Pro
eedings of the National A
ademy of S
ien
es of the USA, 79, 2554{2558.
Huebner, U., Abraham, N. B., & Weiss, C. O. (1989). Dimensions and entropies of
haoti
intensity
pulsations in a single-mode far-infrared nh3 laser. Phys. Rev. A, 40, 6354.
Jordan, M. I. (1986). Attra
tor dynami
s and parallelism in a
onne
tionist sequential ma
hine. In
Pro
eedings of the Eighth Annual Cognitive S
ien
e So
iety Conferen
e. Hillsdale, NJ: Erlbaum.
86 REFERENCES
Kalinke, Y., & Lehmann, H. (1998). Computation in re
urrent neural networks: From
ounters to
iterated fun
tion systems. In G. Antoniou & J. Slaney (Eds.), Advan
ed Topi
s in Arti
ial Intel-
ligen
e, Pro
eedings of the 11th Australian Joint Conferen
e on Arti
ial Intelligen
e (Vol. 1502).
Berlin,Heidelberg: Springer.
Kohlmorgen, J., & Muller, K.-R. (1998). Data set a is a pattern mat
hing problem. Neural Pro
essing
Letters, 7 (1), 43-47.
Koskela, T., Varsta, M., Heikkonen, J., & Kaski, K. (1998). Re
urrent SOM with lo
al linear models
in time series predi
tion. In 6th European Symposium on Arti
ial Neural Networks. ESANN'98.
Pro
eedings. D-Fa
to, Brussels, Belgium (pp. 167{72).
Koza, J. R. (1992). Geneti
programming. Cambridge, MA: MIT Press.
Lapedes, A., & Farber, R. (1987). Nonlinear signal pro
essing using neural networks: Predi
tion and sig-
nal modeling (Te
h. Rep. Nos. LA{UR{87{2662). Los Alamos, New Mexi
o: Los Alamos National
Laboratory.
Large, E. W., & Jones, M. R. (1999). The dynami
s of attending: How people tra
k time-varying events.
Psy
hologi
al Review, 106 (1), 119{159.
Large, E. W., & Kolen, J. F. (1994). Resonan
e and the per
eption of musi
al meter. Conne
tion S
ien
e,
6, 177{208.
LeCun, Y., Bottou, L., Orr, G., & Muller, K.-R. (1998). EÆ
ient ba
kprop. In G. B. Orr & K.-R. Muller
(Eds.), Neural Networks|Tri
ks of the Trade (Vol. 1524, p. 5-50). Berlin: Springer Verlag.
Lee, L. (1996). Learning of
ontext-free languages: A survey of the literature (Te
h. Rep. No. TR-12-96).
Center for Resear
h in Computing Te
hnology, Harvard University, Cambridge, Massa
husetts.
Lin, T., Horne, B. G., Ti~no, P., & Giles, C. L. (1996). Learning long-term dependen
ies in NARX
re
urrent neural networks [Paper℄. IEEE Transa
tions on Neural Networks, 7 (6), 1329{1338.
Ma
key, M., & Glass, L. (1977). Os
illation and
haos in a physiologi
al
ontrol system. S
ien
e,
197 (287).
Martinez, T. M., Berkovi
h, S. G., & S
hulten, K. J. (1993). Neural-gas network for ve
tor quantization
and its appli
ation to time-series predi
tion [Paper℄. IEEE Transa
tions on Neural Networks, 4 (4),
558{569.
M
Auley, J. (1994). Finding metri
al stru
ture in time. In M. Mozer, P. Smolensky, D. Touretsky,
J. Elman, & A. S. Weigend (Eds.), Pro
eedings of the 1993 Conne
tionist Models Summer S
hool
(pp. 219{227). Hillsdale, NJ: Erlbaum.
M
Names, J. (2000). Lo
al modeling optimization for time series predi
tion. In In Pro
eedings of the
8th European Symposium on Arti
ial Neural Networks (p. 305-310). Bruges, Belgium.
Miller, G. A. (1956). The magi
al number seven, plus or minus two: Some limits on our
apa
ity for
pro
essing information. Psy
hologi
al Review (63), 81-97.
Mit
hell, T. M., Keller, R. M., & Kedar-Cabelli, S. T. (1986). Explanation-based generalization: A
unifying view. Ma
hine Learning, 1, 47{80.
Mozer, M. C. (1989). A fo
used ba
kpropagation algorithm for temporal pattern pro
essing. Complex
Systems, 3, 349{381.
Mozer, M. C. (1992). Indu
tion of multis
ale temporal stru
ture. In D. S. Lippman, J. E. Moody, &
D. S. Touretzky (Eds.), Advan
es in Neural Information Pro
essing Systems 4 (p. 275-282). San
Mateo, CA: Morgan Kaufmann.
Mozer, M. C. (1993). Neural net ar
hite
tures for temporal sequen
es pro
essing. In A. S. Weigend
& N. A. Gershenfeld (Eds.), Time series predi
tion: Fore
asting the future and understanding the
past (Vol. 15, pp. 243{264). Reading, MA: Addison Wesley.
REFERENCES 87
Osborne, M., & Bris
oe, E. (1997). Learning sto
hasti
ategorial grammars. In Pro
eedings of the
Asso
. for Comp. Linguisti
s, Comp. Nat. Lg.Learning (CoNLL97) Workshop (pp. 80{87). Madrid.
(http://
iteseer.nj.ne
.
om/osborne97learning.html)
Pearlmutter, B. A. (1995). Gradient
al
ulations for dynami
re
urrent neural networks: A survey. IEEE
Transa
tions on Neural Networks, 6 (5), 1212-1228.
Platt, J. (1991). A resour
e-allo
ating network for fun
tion interpolation. Neural Computation, 3,
213{225.
Plaut, D. C., Nowlan, S. J., & Hinton, G. E. (1986). Experiments on learning ba
k propagation (Te
h.
Rep. Nos. CMU{CS{86{126). Pittsburgh, PA: Carnegie{Mellon University.
Porter, Bru
e, W., Bareiss, R., & Holte, R. C. (1990). Con
ept learning and heuristi
lassi
ation in
weak-theory domains. Arti
ial Intelligen
e, 45 (1-2), 229{263.
Prin
ipe, J. C., & Kuo, J.-M. (1995). Dynami
modelling of
haoti
time series with neural networks. In
G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advan
es in Neural Information Pro
essing Systems
(Vol. 7, pp. 311{318). The MIT Press.
Prin
ipe, J. C., Rathie, A., & Kuo, J. M. (1992). Predi
tion of
haoti
time series with neural networks
and the issue of dynami
modeling. Int. J. of Bifur
ation and Chaos, 2 (4), 989{996.
Puskorius, G. V., & Feldkamp, L. A. (1994). Neuro
ontrol of nonlinear dynami
al systems with Kalman
lter trained re
urrent networks. IEEE Transa
tions on Neural Networks, 5 (2), 279-297.
Quinlan, J. (1992). Programs for ma
hine learning. Morgan Kaufmann.
R. Bone, Cru
ianu, M., Verley, G., & Asselin de Beauville, J.-P. (2000). A bounded exploration approa
h
to
onstru
tive algorithms for re
urrent neural networks. In Pro
eedings of IJCNN 2000. Como,
Italy.
Ring, M. B. (1994). Continual learning in reinfor
ement environments. Unpublished do
toral dissertation,
University of Texas at Austin, Austin, Texas 78712.
Robinson, A. J., & Fallside, F. (1987). The utility driven dynami
error propagation network (Te
h. Rep.
No. CUED/F-INFENG/TR.1). Cambridge University Engineering Department.
Rodriguez, P., & Wiles, J. (1998). Re
urrent neural networks
an learn to implement symbol-sensitive
ounting. In Advan
es in Neural Information Pro
essing Systems (Vol. 10, p. 87-93). The MIT
Press.
Rodriguez, P., Wiles, J., & Elman, J. (1999). A re
urrent neural network that learns to
ount. Conne
tion
S
ien
e, 11 (1), 5-40.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representation by er-
ror propagation. In D. E. Rumelhart & J. L. M
Clelland (Eds.), Parallel distributed pro
essing:
Explorations in the mi
rostru
ture of
ognition (Vol. 1, pp. 318{362). Cambridge, MA: MIT Press.
Sakakibara, Y. (1997). Re
ent advan
es of grammati
al inferen
e. Theoreti
al Computer S
ien
e, 185 (1),
15{45.
Salustowi
z, R. P., & S
hmidhuber, J. (1997). Probabilisti
in
remental program evolution: Sto
hasti
sear
h through program spa
e. In M. van Someren & G. Widmer (Eds.), Ma
hine Learning: ECML-
97, Le
ture Notes in Arti
ial Intelligen
e 1224 (p. 213-220). Springer-Verlag Berlin Heidelberg.
Sauer, T. (1994). Time series predi
tion using delay
oordinate embedding. In A. S. Weigend & N. A.
Gershenfeld (Eds.), Time Series Predi
tion: Fore
asting the Future and Understanding the Past.
Addison-Wesley.
S
hmidhuber, J. (1989). The Neural Bu
ket Brigade, a lo
al learning algorithm for dynami
feedforward
and re
urrent networks. Conne
tion S
ien
e, 1 (4), 403-412.
88 REFERENCES
S
hmidhuber, J. (1992a). A xed size storage O(n3 ) time
omplexity learning algorithm for fully re
urrent
ontinually running networks. Neural Computation, 4 (2), 243-248.
S
hmidhuber, J. (1992b). Learning
omplex, extended sequen
es using the prin
iple of history
ompres-
sion. Neural Computation, 4 (2), 234-242.
S
hmidhuber, J., & Ho
hreiter, S. (1996). Guessing
an outperform many long time lag algorithms (Te
h.
Rep. No. IDSIA-19-96). IDSIA.
S
hraudolph, N. (1999). Lo
al gain adaptation in sto
hasti
gradient des
ent. In Pro
eedings of the 9th
International Conferen
e on Arti
ial Neural Networks. London: IEE.
S
hraudolph, N. N. (2000). Fast se
ond-order gradient des
ent via O(n)
urvature matrix-ve
tor produ
ts
(Te
h. Rep. No. IDSIA-12-00). Galleria 2, CH-6928 Manno, Switzerland: Istituto Dalle Molle di
Studi sull'Intelligenza Arti
iale. (Submitted to Neural Computation)
Siegelmann, H. (1992). Theoreti
al foundations of re
urrent neural networks. Unpublished do
toral
dissertation, Rutgers, New Brunswi
k Rutgers, The State of New Jersey.
Siegelmann, H. T., & Sontag, E. D. (1991). Turing
omputability with neural nets. Applied Mathemati
s
Letters, 4 (6), 77{80.
Smith, A. W., & Zipser, D. (1989). Learning sequential stru
tures with the real-time re
urrent learning
algorithm. International Journal of Neural Systems, 1 (2), 125-131.
Steijvers, M., & Grunwald, P. (1996). A re
urrent network that performs a
ontextsensitive predi
tion
task. In Pro
eedings of the 18th Annual Conferen
e of the Cognitive S
ien
e So
iety. Erlbaum.
Sun, G., Chen, H., & Lee, Y. (1993). Time warping invariant neural networks. In J. D. C. S. J. Hanson &
C. L. Giles (Eds.), Advan
es in Neural Information Pro
essing Systems 5 (p. 180-187). San Mateo,
CA: Morgan Kaufmann.
Sun, G. Z., Giles, C. L., Chen, H. H., & Lee, Y. C. (1993). The neural network pushdown automaton:
Model, sta
k and learning simulations (Te
hni
al Report No. CS-TR-3118). University of Maryland,
College Park.
Tonkes, B., & Wiles, J. (1997). Learning a
ontext-free task with a re
urrent neural network: An analysis
of stability. In Pro
eedings of the Fourth Biennial Conferen
e of the Australasian Cognitive S
ien
e
So
iety.
Townley, S., Il
hmann, A., Weiss, M. G., M
Clements, W., Ruiz, A. C., Owens, D., & Praetzel-Wolters,
D. (1999). Existen
e and learning of os
illations in re
urrent neural networks (Te
h. Rep. No.
AGTM 202). Kaiserslautern, Germany: Universitaet Kaiserslautern, Fa
hberei
h Mathematik.
Tsoi, A. C., & Ba
k, A. D. (1994). Lo
ally re
urrent globally feedforward networks: A
riti
al review of
ar
hite
tures. IEEE Transa
tions on Neural Networks, 5 (2), 229{239.
Tsung, F. S., & Cottrell, G. W. (1989). A sequential adder using re
urrent networks. In Pro
eedings of
the First International Joint Conferen
e on Neural Networks, Washington, DC. San Diego: IEEE
TAB Neural Network Committee.
Tsung, F.-S., & Cottrell, G. W. (1995). Phase-spa
e learning. In Advan
es in Neural Information
Pro
essing Systems (Vol. 7, pp. 481{488). The MIT Press.
Vesanto, J. (1997). Using the SOM and lo
al models in time-series predi
tion. In Pro
eedings of
WSOM'97, Workshop on Self-Organizing Maps, Espoo, Finland, June 4{6 (pp. 209{214). Espoo,
Finland: Helsinki University of Te
hnology, Neural Networks Resear
h Centre.
Vijay-Shanker, K. (1992). Using des
riptions of trees in a tree adjoining grammar. Computational
Linguisti
s, 18 (4), 481{517.
Waibel, A. (1989). Modular
onstru
tion of time-delay neural networksfor spee
h re
ognition [Letter℄.
Neural Computation, 1 (1), 39{46.
REFERENCES 89
Wan, E. A. (1994). Time series predi
tion by using a
onne
tionist network with internal time delays.
In W. A. S. & G. N. A. (Eds.), Time Series Predi
tion: Fore
asting the Future and Understanding
the Past (pp. 195{217). Addison-Wesley.
Weigend, A., & Gershenfeld, N. (1993"). Time series predi
tion: Fore
asting the future and understanding
the past. Addison-Wesley.
Weigend, A. S., & Nix, D. A. (1994). Predi
tions with
onden
e intervals (lo
al error bars). In
Pro
eedings of the International Conferen
e on Neural Information Pro
essing (ICONIP'94) (pp.
847{852). Seoul, Korea.
Weiss, M. G. (1999). Learning os
illations using adaptive
ontrol (Te
h. Rep. No. AGTM 178). Kaiser-
slautern, Germany: Universitaet Kaiserslautern, Fa
hberei
h Mathematik.
Werbos, P. J. (1988). Generalisation of ba
kpropagation with appli
ation to a re
urrent gas market
model. Neural Networks, 1, 339{356.
Wiles, J., & Elman, J. (1995). Learning to
ount without a
ounter: A
ase study of dynami
s and a
ti-
vation lands
apes in re
urrent networks. In In Pro
eedings of the Seventeenth Annual Conferen
e
of the Cognitive S
ien
e So
iety (pp. pages 482 { 487). Cambridge, MA: MIT Press.
Williams, R. J., & Peng, J. (1990). An eÆ
ient gradient-based algorithm for on-line training of re
urrent
network traje
tories [Letter℄. Neural Computation, 2 (4), 490{501.
Williams, R. J., & Zipser, D. (1989). A learning algorithm for
ontinually running fully re
urrent net
works. Neural Computation, 1 (2), 270-280.
Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for re
urrent networks and their
omputational
omplexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Ba
k-propagation: Theory,
Ar
hite
tures and Appli
ations (pp. 433{486). Hillsdale, NJ: Erlbaum.
Yao, X., & Liu, Y. (1997). A new evolutionary system for evolving arti
ial neural networks. IEEE
Transa
tions on Neural Networks, 8 (3), 694{713.
Zelle, J., & Mooney, R. (1993). Learning semanti
grammars with
onstru
tive indu
tive logi
program-
ming. In Pro
eedings of the 11th national
onferen
e on arti
ial intelligen
e, aaai (pp. 817{822).
MIT Press.
Zeng, Z., Goodman, R., & Smyth, P. (1994"). Dis
rete re
urrent neural networks for grammati
al
inferen
e. IEEE Transa
tions on Neural Networks, 5 (2).
90 Personal Re
ord
Gers, F. A., E
k, D., & S
hmidhuber, J. Applying LSTM to time series predi
table
through time-window approa
hes. In Neural Nets, WIRN Vietri-99, Pro
eedings 11th
Workshop on Neural Nets.
Cummins, F., Gers, F., & S
hmidhuber, J. (1999). Language identi
ation from prosody
without expli
it features. In Pro
eedings of EUROSPEECH'99 (Vol. 1, pp. 371{374).
Cummins, F., Gers, F. A., & S
hmidhuber, J. (1999). Automati
dis
rimination among
languages based on prosody alone (Te
h. Rep. No. IDSIA-03-99). Lugano, CH: IDSIA.
De Garis, H., Gers, F. A., Korkin, M., Agah, A., & Nawa, N. E. (1998). Building an
arti
ial brain using an FPGA based 'CAM-brain ma
hine'. Arti
ial Life and Roboti
s
Journal, 2, 56-61.
Gers, F. A., & Czarske, J. W. (1995). Untersu
hungen zur verteilten temperatur-sensorik
mit stimulierter brillouin-streuung. In Laser'95 Conferen
e Pro
eedings C P22.
Gers, F. A., & De Garis, H. (1996a). Porting a
ellular automata based arti
ial brain to
MIT's
ellular automata ma
hine "CAM-8". In Int. Conf. on Simulated Evolution and
Learnin (SEAL) S7-3, Taejon, Korea.
Gers, F. A., & De Garis, H. (1996b). CAM-brain : A new model for ATR's
ellular
automata based arti
ial brain proje
t. In Int. Conf. on Evolvable Systems Conferen
e
Pro
eedings (ICES) S7-5, Tsukuba, Japan.
Gers, F. A., & De Garis, H. (1997). Codi-1bit : A simplied
ellular automata based
neuron model. In Arti
ial Evolution Conferen
e (AE), Nimes, Fran
e.
Gers, F. A., De Garis, H., & Korkin, M. (1997a). Evolution of neural sru
tures based on
ellular automata. In C. J. Lakhmi (Ed.), Soft
omputing te
hniques in knowlage-based
intelligent engineering systems (p. 259-278). Heidelberg New York: Physi
a-Verlag.
Gers, F. A., De Garis, H., & Korkin, M. (1997b). A simplied
ellular automata based
neuron model. In J. Hao, E. Lutton, E. Ronald, M. S
hoennauer, & D. Snyers (Eds.),
Arti
ial Evolution (p. 315-334). Springer Verlag.
Gers, F. A., De Garis, H., & Korkin, M. (1998). Codi-1bit : A
ellular automata
based neural net model simple enough to be implemented in evolvable hardware. In
Int.Symposium on Arti
ial Life and Roboti
s (AROB), Beppu, Oita, Japan.
Gers, F. A., E
k, D., & S
hmidhuber, J. (2000). Applying LSTM to time series predi
table
through time-window approa
hes (Te
h. Rep. No. IDSIA-22-00). Manno, CH: IDSIA.
Gers, F. A., E
k, D., & S
hmidhuber, J. (2001). Applying LSTM to time series predi
table
through time-window approa
hes. In Pro
. ICANN 2001, Int. Conf. on Arti
ial Neural
Networks. Vienna, Austria: IEE, London. (submitted)
Gers, F. A., & S
hmidhuber, J. Long short-term memory learns
ontext free and
ontext
sensitive languages. In ICANNGA 2001 Conferen
e. (a
epted)
92 Personal Re
ord
Gers, F. A., & S
hmidhuber, J. (2000a). LSTM learns
ontext free languages. In Snowbird
2000 Conferen
e.
Gers, F. A., & S
hmidhuber, J. (2000b). Long short-term memory learns
ontext free
languages and
ontext sensitive languages (Te
h. Rep. No. IDSIA-03-00). Manno, CH:
IDSIA.
Gers, F. A., & S
hmidhuber, J. (2000
). Neural pro
essing of
omplex
ontinual input
streams. In Pro
. IJCNN'2000, Int. Joint Conf. on Neural Networks. Como, Italy.
Gers, F. A., & S
hmidhuber, J. (2000d). Neural pro
essing of
omplex
ontinual input
streams (Te
h. Rep. No. IDSIA-02-00). Manno, CH: IDSIA.
Gers, F. A., & S
hmidhuber, J. (2000e). Re
urrent nets that time and
ount. In Pro
.
IJCNN'2000, Int. Joint Conf. on Neural Networks. Como, Italy.
Gers, F. A., & S
hmidhuber, J. (2000f). Re
urrent nets that time and
ount (Te
h. Rep.
No. IDSIA-01-00). Manno, CH: IDSIA.
Gers, F. A., & S
hmidhuber, J. (2001). Long short-term memory learns simple
ontext
free and
ontext sensitive languages. IEEE Transa
tions on Neural Networks. (a
epted)
Gers, F. A., S
hmidhuber, J., & Cummins, F. (1999a). Continual predi
tion using LSTM
with forget gates. In M. Marinaro & R. Tagliaferri (Eds.), Neural Nets, WIRN Vietri-99,
Pro
eedings 11th Workshop on Neural Nets (p. 133-138). Vietri sul Mare, Italy: Springer
Verlag, Berlin.
Gers, F. A., S
hmidhuber, J., & Cummins, F. (1999b). Learning to forget: Continual
predi
tion with LSTM. In Pro
. ICANN'99, Int. Conf. on Arti
ial Neural Networks
(Vol. 2, p. 850-855). Edinburgh, S
otland: IEE, London.
Gers, F. A., S
hmidhuber, J., & Cummins, F. (1999
). Learning to forget: Continual
predi
tion with LSTM (Te
h. Rep. No. IDSIA-01-99). Lugano, CH: IDSIA.
Gers, F. A., S
hmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual
predi
tion with LSTM. Neural Computation, 12 (10), 2451{2471.
Gers, F. A., S
hmidhuber, J., & S
hraudolph, N. Learning pre
ise timing with LSTM
re
urrent networks. (submitted to Neural Computation)
Hough, M., De Garis, H., Korkin, M., Gers, F. A., & Nawa, N. E. (1999). Spiker : Analog
waveform to digital spiketrain
onversion in atr's arti
ial brain "
am-brain" proje
t. In
Int. Conf. on Roboti
s and Arti
ial Life, Beppu, Japan.
Korkin, M., De Garis, H., Gers, F., & Hemmi, H. (1997). 'CBM (CAM-brain ma
hine) :
A hardware tool whi
h evolves a neural net module in a fra
tion of a se
ond and runs a
million neuron arti
ial brain in real time. In Geneti
Programming Conferen
e, Stanford,
USA.
Nawa, N. E., De Garis, H., Gers, F. A., & Korkin, M. (1998). 'ATR's CAM-brain
ma
hine (CBM) simulation results and representation issues. In Geneti
Programming
Conferen
e.
94 A
knowledgments
A knowledgments