LSTM in RNN

Long Short-Term Memory
in Recurrent Neural Networks
THÈSE NÆ 2366 (2001)

PRÉSENTÉE AU DÉPARTEMENT D’INFORMATIQUE
ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE
POUR L’OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES
PAR
FELIX GERS
Diplom in Physik, Universität Hannover, Deutschland
de nationalité allemand
soumise à l’approbation du jury:
Prof. R. Hersch, président

Prof. Wulfram Gerstner, directeur de thèse
Dr. habil. Jürgen Schmidhuber, corapporteur
Prof. Paolo Frasconi, corapporteur
Dr. MER Martin Rajman, corapporteur
Lausanne, EPFL
2001
Contents
1 Introdu tion 5
1.1 Re urrent Neural Networks (RNNs) . . . . . . . . . . . . . . .. . .. . .. . . . 5
1.2 General onsiderations. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 6
1.2.1 Problem: Exponential de ay of gradient information. . .. . .. . .. . . . 7
1.2.2 Solution: Constant error arousels. . . . . . . . . . . . .. . .. . .. . . . 7
1.3 Previous and Related Work . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 7
1.3.1 RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 7
1.3.2 RNNs versus Other Sequen e Pro essing Approa hes . .. . .. . .. . . . 8
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 9
2 Traditional LSTM 11
2.1 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Tasks Solved with Traditional LSTM . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Learning to Forget: Continual Predi tion with LSTM 15
3.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 15
3.1.1 Limits of traditional LSTM . . . . . . . . . . . . . . . .. . .. . .. . . . 15
3.2 Solution: Forget Gates . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 16
3.2.1 Forward Pass of Extended LSTM with Forget Gates . .. . .. . .. . . . 16
3.2.2 Ba kward Pass of Extended LSTM with Forget Gates .. . .. . .. . . . 17
3.2.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 20
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 21
3.3.1 Continual Embedded Reber Grammar Problem . . . . .. . .. . .. . . . 21
3.3.2 Network Topology and Parameters . . . . . . . . . . . .. . .. . .. . . . 23
3.3.3 CERG Results . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 24
3.3.4 Analysis of the CERG Results . . . . . . . . . . . . . .. . .. . .. . . . 25
3.3.5 Continual Noisy Temporal Order Problem . . . . . . . .. . .. . .. . . . 25
3.4 Con lusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 28
4 Arithmeti Operations on Continual Input Streams 29
4.1 Introdu tion . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. . .. . .. . . . 29
4.2 Experiments . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. . .. . .. . . . 29
4.2.1 Network Topology and Parameters .. . . .. . .. . . .. . .. . .. . . . 30
4.2.2 Results . . . . . . . . . . . . . . . .. . . .. . .. . . .. . .. . .. . . . 30
4.3 Con lusion . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. . .. . .. . . . 31
iii
iv CONTENTS
5 Learning Pre ise Timing with Peephole LSTM 33

5.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 33
5.2 Extending LSTM with \Peephole Conne tions" . . . . . . . . .. . .. . .. . . . 34
5.3 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 35
5.4 Gradient-Based Ba kward Pass . . . . . . . . . . . . . . . . . .. . .. . .. . . . 36
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 37
5.5.1 Network Topology and Experimental Parameters . . . .. . .. . .. . . . 38
5.5.2 Measuring Spike Delays (MSD) . . . . . . . . . . . . . .. . .. . .. . . . 39
5.5.3 Generating Timed Spikes (GTS). . . . . . . . . . . . . .. . .. . .. . . . 43
5.5.4 Periodi Fun tion Generation (PFG) . . . . . . . . . . .. . .. . .. . . . 44
5.5.5 General Observation: Network initialization . . . . . . .. . .. . .. . . . 51
5.6 Con lusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 51
6 Simple Context Free and Context Sensitive Languages 53
6.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 53
6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 54
6.2.1 Training and Testing . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 55
6.2.2 Network Topology and Experimental Parameters . . . .. . .. . .. . . . 55
6.2.3 Previous results . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 56
6.2.4 LSTM Results . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 57
6.2.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 58
6.3 Con lusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . 62
7 Time Series Predi table Through Time-Window Approa hes 63
7.1 Introdu tion . . . . . . . . . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 63
7.2 Experimental Setup . . . . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 64
7.2.1 Network Topology . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 65
7.3 Ma key-Glass Chaoti Time Series . .. . .. . . .. . .. . . .. . .. . .. . . . 65
7.3.1 Previous Work . . . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 66
7.3.2 Results . . . . . . . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 66
7.3.3 Analysis . . . . . . . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 67
7.4 Laser Data . . . . . . . . . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 70
7.4.1 Previous Work . . . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 71
7.4.2 Results . . . . . . . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 72
7.4.3 Analysis . . . . . . . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 72
7.5 Con lusion . . . . . . . . . . . . . . .. . .. . . .. . .. . . .. . .. . .. . . . 73
8 Con lusion 75
8.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.2 Future work and possible appli ations of LSTM. . . . . . . . . . . . . . . . . . . 76
A Embedded Reber Grammar Statisti s 77
B Peephole LSTM with Forget Gates in Pseudo- ode 79
Referen es 83
Personal Re ord 90
2 CONTENTS
Abstra t
For a long time, re urrent neural networks (RNNs) were thought to be theoreti ally fas inating.
Unlike standard feed-forward networks RNNs an deal with arbitrary input sequen es instead
of stati input data only. This ombined with the ability to memorize relevant events over time
makes re urrent networks in prin ipal more powerful than standard feed-forward networks. The
set of potential appli ations is enormous: any task that requires to learn how to use memory is a
potential task for re urrent networks. Potential appli ation areas in lude time series predi tion,
motor ontrol in non-Markovian environments and rhythm dete tion (in musi and spee h).
Previous su esses in real world appli ations, with re urrent networks were limited, however,
due to pra ti al problems when long time lags between relevant events make learning diÆ ult.
For these appli ations onventional gradient-based re urrent network algorithms for learning to
store information over extended time intervals take too long. The main reason for this failure is
the rapid de ay of ba k-propagated error. The \Long Short Term Memory" (LSTM) algorithm
over omes this and related problems by enfor ing onstant error ow. Using gradient des ent,
LSTM expli itly learns when to store information and when to a ess it.
In this thesis we extend, analyze, and apply the LSTM algorithm. In parti ular, we identify
two weaknesses of LSTM, oer solutions and modify the algorithm a ordingly: (1) We re ognize
a weakness of LSTM networks pro essing ontinual input streams that are not a priori segmented
into subsequen es with expli itly marked ends at whi h the network's internal state ould be
reset. Without resets, the state may grow indenitely and eventually ause the network to break
down. Our remedy is a novel, adaptive \forget gate" that enables an LSTM ell to learn to reset
itself at appropriate times, thus releasing internal resour es. (2) We identify a weakness in
LSTM's onne tion s heme, and extend it by introdu ing \peephole onne tions" from LSTM's
\Constant Error Carousel" to the multipli ative gates prote ting them. These onne tions
provide the gates with expli it information about the state to whi h they ontrol a ess. We
show that peephole onne tions are ne essary for numerous tasks and do not signi antly ae t
LSTM's performan e on previously solved tasks.
We apply the extended LSTM with forget gates and peephole onne tions to tasks that no
other RNN algorithm an solve (in luding traditional LSTM): Grammar tasks and temporal
order tasks involving ontinual input streams, arithmeti operation on ontinual input streams,
tasks that require pre ise, ontinual timing, periodi fun tion generation and ontext free and
ontext sensitive language tasks. Finally we establish limits of LSTM on time series predi tion
problems solvable by time window approa hes.
CONTENTS 3
Sommario
Per molto tempo le reti neuronali ri orrenti sono state onsiderate teori amente aas inanti.
Le reti ri orrenti possono trattare naturalmente sequenze di dati inve e di poter ri evere solo
dati stati i in input. Possono imparare a memorizzare gli eventi importanti. Queste apa ita
le rendono in linea di prin ipio piu potenti delle reti feed-forward. La lasse di potenziali
appli azioni e ampia: essa ontiene ogni problema he ri hiede l'uso di memoria interna. Al uni
esempi sono la previsione delle serie stori he (time series predi tion), il ontrollo del moto (motor
ontrol) in ambienti non Markoviani e il ri onos imento del ritmo (per esempio nella musi a o
nella lingua parlata).
D'altra parte, sinora le reti ri orrenti anno avuto po o su esso nell'appli azione a problemi
reali aratterizzati da intervalli temporali lunghi tra eventi importanti dell'input. Gli algo-
ritmi onvenzionali di apprendimento, basati sul gradiente, hanno bisogno di troppo tempo per
imparare a memorizzare delle informazioni on intervalli temporali lunghi. La ragione prin-
ipale e la rapida de res ita dell'errore retropropagato (ba k-propagation error). Le reti di
tipo long-short term memory (LSTM) orono una soluzione a questo problema, proponendo
un'ar hitettura dove il usso dell'errore rimane ostante. Usando l'in linazione del gradiente
(gradient des ent), le reti LSTM possono imparare quando un'informazione deve essere memo-
rizzata e quando va su essivamente usata.
Questa tesi analizza, estende ed appli a l'algoritmo LSTM. Si identi ano due difetti dell'algoritmo
preesistente e si propongono due estensioni prin ipali dell'algoritmo he risolvono i problemi
ris ontrati. In parti olare: (1) viene identi ato un difetto dell'algoritmo LSTM he a ade
quando l'input e ontiguo, ioe non a priori suddiviso in sottosequenze on inizi e ni distinti.
In questo aso, l'algoritmo non e in grado di determinare quando la rete va riportata allo stato
iniziale e i valori interni possono res ere illimitatamente ausando una paralisi del sistema.
Il rimedio proposto si basa su una nuova unita moltipli ativa (gate unit) adattabile hiamato
\forget gate". Essa permette ad una ella della rete LSTM di imparare a ritornare ad uno stato
pre edente in momenti opportuni, liberando osi risorse interne.
(2) Si identi a un difetto nello s hema delle onnessioni delle reti LSTM e lo si risolve intro-
du endo onnessioni hiamate \peephole onne tions". Esse ollegano l'unita entrale (\ on-
stant error arousel") delle elle alle unita moltipli ative he le stanno attorno. In questo modo
vengono fornite alle unita informazioni espli ite sulla ondizione dell'oggetto di ui ontrollano
l'a esso. Si mostra inoltre he le peephole onne tions sono ne essarie per numerosi problemi
e he non ridu ono signi ativamente la performan e delle reti LSTM su problemi pre edente-
mente arontati.
La tesi appli a l'algoritmo LSTM esteso on forget gates e peephole onne tions a problemi
he nessun altro algoritmo per reti ri orrenti puo risolvere ( ompreso le reti LSTM tradizionali):
problemi di grammati a; problemi di ordinamenti temporali he oinvolgono input ontinui; op-
erazioni aritmeti he su input ontinuo; problemi he ri hiedono una ontinua e pre isa misura
del tempo; la generazione di funzioni periodi he e ri onos imento di grammati he ontext-free
e ontext-sensitive. Inne si identi ano dei limiti dell'algoritmo LSTM esteso relativi a prob-
lemi di previsione di serie stori he he sono risolvibili dalla lasse di metodi basati su nestre
temporali.
Chapter 1
Introdu tion
The goal of this Ph.D. thesis is to extend, analyze, and apply a re ent, novel, promising gradient
learning algorithm for re urrent neural networks (RNNs). The algorithm is alled \Long Short
Term Memory" (LSTM). It was introdu ed by Ho hreiter and S hmidhuber (1997).
1.1 Re urrent Neural Networks (RNNs)

The RNNs we onsider here onsist of units intera ting in dis rete time via dire ted, weighted
onne tions with weights wlm (from unit m to unit l). Every unit has an a tivation y(t) updated
at every time t = 1; 2; : : : . The a tivations of the units feeding into other units form the state
of the network. An a tivation yl of unit l is updated by omputing its network input sum netl ,
X
netl (t) = wlm ym (t 1) ;
m
and \squashing" it with a dierentiable fun tion f , a ording:

yl (t) = f (netl (t)) :
Input and output to the network are time-varying series of ve tor-patterns alled sequen es.
We dene learning in RNNs as optimizing a dierentiable obje tive fun tion E , summed over
all time steps of all sequen es. by adapting the onne tion weights. E is based on supervised
targets tk , where k indexes the output units of the network with a tivations yk . An example is
the squared error obje tive fun tion:
E (t) =
1 X ek (t) ; ek (t) := tk (t) yk (t) ;
2 k
2
where ek denotes the externally inje ted error; E (t) represents the error at time t for one sequen e
omponent alled pattern. For a typi al data set onsisting of sequen es of patterns, E is the
sum of E (t) over all patterns of all sequen es in the set.
5
6 CHAPTER 1. INTRODUCTION
Out Out Out Out
In In In In
Figure 1.1: Left: Feed-forward neural network. Middle: Layered network with an input layer, a
fully re urrent hidden layer and an output layer. Right: Fully onne ted re urrent network.
A gradient des ent learning algorithm for RNNs, su h as LSTM, omputes the gradient of
E with respe t to ea h weight wlm to determine the weight hanges wlm :
wlm (t) = E (t) ;wlm

where is alled the learning rate. For an ex ellent introdu tion to gradient learning in RNNs
see Williams and Zipser (1992).
The onne tion s heme of a network is alled the network ar hite ture or topology. Ar hite -
tures without loops are alled feed-forward neural networks (Figure 1.1, left). RNN topologies
range from partly re urrent, to fully re urrent networks. An example of a partly re urrent net-
work is a layered network with distin t input and output layers, where the re urren e is limited
to the hidden layer(s), as shown in the middle of Figure 1.1. In fully re urrent networks ea h
node gets input from all other nodes (Figure 1.1, right).
1.2 General onsiderations.
RNNs onstitute a very powerful lass of omputational models, apable of instantiating almost
arbitrary dynami s (Siegelmann & Sontag, 1991).
Usually two basi types of RNN are distinguished: autonomous RNNs with onverging dy-
nami s where the input is xed, for example Hopeld networks (Hopeld, 1982) or Boltzmann
ma hines (Hinton, Sejnowski, & A kley, 1984), versus non-autonomous RNNs with time-varying
inputs. The RNNs onsidered in this thesis are of the latter lass. They an perform gradi-
ent des ent in a very general spa e of potentially noise-resistant algorithms using distributed,
ontinuous-valued internal states. The ability to map real-valued input sequen es to real-valued
output sequen es, making use of their internal state to in orporate past ontext, makes them
a remarkably general sequen e pro essing devi es (Bengio, Fras oni, Gori, & G.Soda, 1993).
RNNs are espe ially promising for tasks that require to learn how to use memory. Potential
appli ations are: time series predi tion (e.g., of nan ial series), time series produ tion (e.g.,
motor ontrol in non-Markovian environments) and time series lassi ation or labeling (e.g.,
rhythm dete tion in musi and spee h).
1.3. PREVIOUS AND RELATED WORK 7
1.2.1 Problem: Exponential de ay of gradient information.
The extent to whi h this potential an be exploited, is however limited by the ee tiveness of
the training pro edure applied. Gradient based methods (see survey: (Pearlmutter, 1995))|
\Ba k-Propagation Through Time" (Williams & Zipser, 1992; Werbos, 1988) or \Real-Time
Re urrent Learning" (Robinson & Fallside, 1987; Williams & Zipser, 1992) and their ombination
(S hmidhuber, 1992a)|share an important limitation. The temporal evolution of the path
integral over all error signals \ owing ba k in time" exponentially depends on the magnitude
of the weights (Ho hreiter, 1991). This implies that the ba k-propagated error qui kly either
vanishes or blows up (Ho hreiter & S hmidhuber, 1997; Bengio, Simard, & Fras oni, 1994;
S hmidhuber, 1992b). Hen e standard RNNs fail to learn in the presen e of long time lags
between relevant input and target events. Tasks with time lags greater than 5{10 already
be ome diÆ ult for then to learn within reasonable time (Ho hreiter & S hmidhuber, 1997). The
vanishing error problem asts doubt on whether standard RNNs an indeed exhibit signi ant
pra ti al advantages over time-window-based feed-forward networks (Ho hreiter, 1991; Bengio
et al., 1994).
1.2.2 Solution: Constant error arousels.
The LSTM algorithm over omes this problem by enfor ing non-de aying error ow \ba k into
time." It an learn to bridge minimal time lags in ex ess of 1000 dis rete time steps (Ho hreiter &
S hmidhuber, 1997) by enfor ing onstant error ow through \ onstant error arrousels" (CECs)
within spe ial units, alled ells. Multipli ative gate units learn to open and lose a ess to the
ells. Thus LSTM rather qui kly solves many tasks traditional RNNs annot solve. LSTM's
learning algorithm is lo al in spa e and time; its omputational omplexity per time step and
weight for standard topologies is O(1).
1.3 Previous and Related Work
In this se tion we will give a brief overview on alternative approa hes for time series pro ess-
ing and provide pointers to the original works. Throughout the thesis we will dis uss their
appli ability for the various tasks we investigate.
Time window approa hes. Any stati pattern mat hing devi e (e.g., feed-forward net-
work) with a xed time window of re ent inputs an serve as temporal sequen e pro essing
system. This approa h has several signi ant drawba ks: (1) It is diÆ ult to determine the op-
timal time window size, it there is any. (2) For tasks with long-term dependen ies a large input
window is ne essary. A solution might be to use a ombination of several time windows. But
this is only appli able when the exa t long-term dependen ies of the task are known, whi h is
usually not the ase. (3) Fixed time windows are inadequate when a task has hanging long-term
dependen ies.
An RNN approa h, on the other hand, an avoid these problems, be ause RNNs do in
prin iple not need a ess to the past. They an potentially learn to extra t and represent a
Markov state.
1.3.1 RNNs
Elman networks and RNNs with ontext units. In Elman network the ontent of the
hidden units is opied into so alled ontext units, whi h feed ba k into the hidden layer (Elman,
8 CHAPTER 1. INTRODUCTION
1990). (This topology is equivalent to a network with a hidden layer, where ea h unit feeds into
every other one via time delayed onne tions with delay one.) Elman nets are trained by ba k-
propagation (Rumelhart, Hinton, & Williams, 1986); thus they do not even propagate errors
ba k though time. In alternative approa hes with ontext units the hidden units feed (e.g., fully
onne ted) into the ontext units (their number may be dierent from the number of the hidden
units). Usually BPTT or RTRL (and their trun ated versions) are used for training.
Time delay neural networks (TDNNs). Time-Delay Neural Networks (TDNNs) (Haner
& Waibel, 1992) allow a ess to past events via as aded internal delay lines. The interval they
an a ess depends on the network topology. Thus they suer from the same problems as
feed-forward networks using a time window.
Nonlinear autoregressive models with exogenous inputs (NARX) networks. NARX
networks (Lin, Horne, Ti~no, & Giles, 1996), allow for several distin t input time-windows (possi-
bly of size one) with dierent temporal osets. They an potentially solve tasks with stationary
long time lags; it remains a problem to determine the right windows. However, when the long
term dependen ies are non-stationary the approa h fails.
Fo used ba k-propagation. To deal with long time lags, Mozer (1989) uses time onstants
whi h in uen e a tivation hanges. However, for long time gaps the time onstants need external
ne tuning (Mozer, 1992). Sun et al.'s alternative approa h (1993) updates the a tivation of
a re urrent unit by adding the old a tivation and the (s aled) urrent net input. The net
input, however, tends to perturb the stored information, whi h again makes long term storage
impra ti able.
Continual, Hierar hi al, In remental Learning and Development (CHILD). Ring
(1994) proposed the CHILD method for bridging long time lags. Whenever a unit in his network
re eives on i ting error signals, he adds a higher order unit in uen ing appropriate onne tions.
Although his approa h an sometimes be extremely fast, to bridge a time lag involving 100 steps
may require the addition of 100 units. The network annot generalize to sequen es with unseen
lag durations.
Chunker systems. Chunker systems (S hmidhuber, 1992b; Mozer, 1992) do have the
ability to bridge arbitrary time lags, but only if the input sequen e exhibits lo ally predi table
regularities.
LSTM. LSTM does not suer from the problems above. It seems to be the state of the
art method for re urrent networks fa ed with realisti , long time lags between o urren es of
relevant events.
1.3.2 RNNs versus Other Sequen e Pro essing Approa hes
Dis rete symboli grammar learning algorithms (SGLAs). SGLAs (Lee, 1996; Sakak-
ibara, 1997) may faster learn grammati al stru ture of dis rete, noise-free event sequen es, but
annot deal well with noise or with sequen es of real-valued inputs (Osborne & Bris oe, 1997).
Hidden Markov models (HMMs). HMMs are widely used approa hes to sequen e
pro essing. They are well-suited for noisy inputs and are invariant to non-linear temporal
stret hing. This makes HMMs espe ially su essful in spee h re ognition (they do not are for
the dieren e between slow and fast versions of a given spoken word). But for many other
tasks HMMs are less suited, be ause, unlike RNNs, they are limited to dis rete state spa es.
This makes their appli ation to many time series task umbersome and ineÆ ient. For example
for simples ounting tasks, HMMs need as many states as the the number of symbols on the
longest sequen e that should be ounted. Whereas with RNNs the ne essary algorithm an be
1.4. OUTLINE 9
instantiated with networks of 2-5 units (Kalinke & Lehmann, 1998; Rodriguez & Wiles, 1998;
Gers & S hmidhuber, 2000e). Thus, in prin iple RNNs are appli able to tasks beyond the rea h
of HMMs.
Input output hidden Markov models (IOHMMs). The input-output HMM ar hi-
te ture (Bengio & Fras oni, 1995) ombines elements of mixture-of-experts, RNNs, and hidden
Markov models, and is adapted via the EM algorithm. To our knowledge, this ar hite ture has
not yet been applied to tasks omparable to the ones dis ussed here. But it was shown to solve
simple tasks involving long time lags.
Geneti Programming and Program Sear h. Geneti Programming (see e.g., Di k-
manns et al., 1987; Cramer, 1985; Koza, 1992) and Probabilisti In remental Program Evolution
(PIPE) (Salustowi z & S hmidhuber, 1997) in prin iple ould sear h in general algorithm spa es
but are slow due to the absen e of gradient information providing a sear h dire tion.
Random guessing. For some simple ben hmarks weight guessing nds solutions faster
than elaborate gradient algorithms (Ho hreiter & S hmidhuber, 1996, 1995; S hmidhuber &
Ho hreiter, 1996).
1.4 Outline
Traditional LSTM. Chapter 2 des ribes the traditional LSTM algorithm as introdu ed by
Ho hreiter and S hmidhuber (1997).
Forget Gates. In Chapter 3 we identify a weakness of LSTM in dealing with ontinual
input streams that are not a priori segmented into separate training sequen es, su h that it is
not lear when to reset the network's internal state. We introdu e \forget gates" as a remedy
(Gers, S hmidhuber, & Cummins, 2000, 1999b).
Arithmeti operations. In Chapter 4 we present tasks involving arithmeti operations on
ontinual input streams that traditional LSTM annot solve. But LSTM extended with forget
gates has superior arithmeti apabilities and does solve the tasks (Gers & S hmidhuber, 2000 ).
Timing, extending LSTM with \peephole onne tions". In Chapter 5 we investigate
tasks where the temporal distan e between events onveys essential information (this is the ase
for numerous sequential tasks su h as motor ontrol and rhythm dete tion). First we identify
a weakness in LSTM's onne tion s heme, regarding the wiring of the nonlinear, multipli ative
gates surrounding and prote ting LSTM's onstant error arrousels (CEC). We extend LSTM by
introdu ing \peephole onne tions" from the CECs to the gates and nd that LSTM augmented
by peephole onne tions an learn pre ise timing. It learned, for example, the ne distin tion be-
tween sequen es of spikes separated by either 50 or 49 dis rete time steps, without the help of any
short training exemplars (Gers & S hmidhuber, 2000e; Gers, S hmidhuber, & S hraudolph, ).
Context free and ontext sensitive languages. Previous work by Ho hreiter and
S hmidhuber (1997) and the our work (see Chapter 3) showed that LSTM outperforms tra-
ditional RNNs on learning regular languages from exemplary training sequen es. In Chapter
6 we demonstrate LSTM's superior performan e on ontext free language (CFL) ben hmarks
for re urrent neural networks (RNNs). To the best of our knowledge, LSTM variants are also
the rst RNNs to learn a simple ontext sensitive language (CSL), namely anbn n (Gers &
S hmidhuber, 2001, ).
Time series predi tion. In Chapter 7 LSTM is applied to time series predi tion tasks
solvable by time window approa hes: the Ma key-Glass series and the Santa Fe FIR laser
emission series (Set A) (Gers, E k, & S hmidhuber, 2000, 2001).
Chapter 2
Traditional LSTM
The basi unit in the hidden layer of an LSTM network is the memory blo k ; it repla es the
hidden units in a \traditional" RNN (Figure 2.1). A memory blo k ontains one or more memory
ells and a pair of adaptive, multipli ative gating units whi h gate input and output to all ells
in the blo k. Memory blo ks allow ells to share the same gates (provided the task permits this),
thus redu ing the number of adaptive parameters. Ea h memory ell has at its ore a re urrently
self- onne ted linear unit alled the \Constant Error Carousel" (CEC), whose a tivation we all
the ell state. The CEC's solve the vanishing error problem: in the absen e of new input or
Out
Output Out
Hidden
Memory Output Gate
Block
with
one
Cell InputGate
Input
In In
Figure 2.1: Left: RNN with one fully re urrent hidden layer. Right: LSTM network with
memory blo ks in the hidden layer (only one is shown).
11
12 CHAPTER 2. TRADITIONAL LSTM
y c
y out
hy
out
output gating wout net out
ouput gate
output squashing h( s )
c
s=s+gy
c c
in
1.0
CEC: memorizing
y in
input gating gy in
win net in
input gate
input squashing g(net ) c
wc
net c
Figure 2.2: The traditional LSTM ell has a linear unit with a re urrent self- onne tion with
weight 1.0 (CEC). Input and output gate regulate read and write a ess to the ell whose state
is denoted s . The fun tion g squashes the ell's input; h squashes the ell's output (see text for
details).
error signals to the ell, the CEC's lo al error ba k ow remains onstant, neither growing nor
de aying. The CEC is prote ted from both forward owing a tivation and ba kward owing
error by the input and output gates respe tively. When gates are losed (a tivation around
zero), irrelevant inputs and noise do not enter the ell, and the ell state does not perturb the
remainder of the network. Figure 2.2 shows a memory blo k with a single ell.
2.1 Forward Pass
The ell state, s , is updated based on its urrent state and three sour es of input: net is input
to the ell itself while netin and netout are inputs to the input and output gates.
We onsider dis rete time steps t = 1; 2; : : : . A single step involves the update of all units
(forward pass) and the omputation of error signals for all weights (ba kward pass). Input gate
a tivation yin and output gate a tivation yout are omputed as follows:
X
netoutj (t) = woutj m ym (t 1) ; youtj (t) = foutj (netoutj (t)) ; (2.1)
m
X
netinj (t) = winj m ym (t 1) ; yinj (t) = finj (netinj (t)) : (2.2)
m
2.2. LEARNING 13
Throughout this thesis j indexes memory blo ks; v indexes memory ells in blo k j (with Sj
ells), su h that vj denotes the v-th ell of the j -th memory blo k; wlm is the weight on the
onne tion from unit m to unit l. Index m ranges over all sour e units, as spe ied by the
network topology (if a sour e unit a tivation ym (t 1) refers to an input unit, urrent external
input ym(t) is used instead). For the gates, f is a logisti sigmoid (with range [0; 1℄):
f (x) =
1 : (2.3)
1+e x
The input to the ell itself is
X
net vj (t) = w vj m ym (t 1) ; (2.4)
m
whi h is is squashed by g, a entered logisti sigmoid fun tion with range [ 2; 2℄ (if not spe ied
dierently):
g(x) =
4
1+e x 2 : (2.5)
The internal state of memory ell s (t) is al ulated by adding the squashed, gated input to the
state at the last time step s (t 1):
s vj (0) = 0 ; s vj (t) = s vj (t 1) + yinj (t) g(net vj (t)) for t > 0 : (2.6)
The ell output y is al ulated by squashing the internal state s via the output squashing
fun tion h, and then multiplying (gating) it by the output gate a tivation yout :
y j (t) = youtj (t) h(s vj (t)) : (2.7)
v
h is a entered sigmoid with range [ 1; 1℄:
h(x) =
2
1+e x 1 : (2.8)
Finally, assuming a layered network topology with a standard input layer, a hidden layer on-
sisting of memory blo ks, and a standard output layer, the equations for the output units k
are:
X
netk (t) = wkm ym (t 1) ; yk (t) = fk (netk (t)) ; (2.9)
m
where m ranges over all units feeding the output units (typi ally all ells in the hidden layer,
the input units, but not the memory blo k gates). As squashing fun tion fk we again use the
logisti sigmoid (2.3). This on ludes traditional LSTM's forward pass.
2.2 Learning
See Ho hreiter & S hmidhuber (1997) for details of traditional LSTM's ba kward pass. It will
be re-derived and dis ussed in detail in Se tion 3.2.2 after the introdu tion of forget gates.
Essentially, as in trun ated BPTT, errors arriving at net inputs of memory blo ks and their
gates do not get propagated ba k further in time, although they do serve to hange the in oming
14 CHAPTER 2. TRADITIONAL LSTM
weights. In essen e, on e an error signal arrives at a memory ell output, it gets s aled by the
output gate and the output nonlinearity h; then it enters the memory ell's linear CEC, where
it an ow ba k indenitely without ever being hanged (this is why LSTM an bridge arbitrary
time lags between input events and target signals). Only when the error es apes from the
memory ell through an opening input gate and the additional input nonlinearity g, does it
get s aled on e more and then serves to hange in oming weights before being trun ated. The
onsequen e of this trun ation is that ea h LSTM blo k relies on errors from the output for its
adaptation. Sin e blo ks do not ex hange error signals, it is hard for LSTM to learn tasks where
one blo k ex lusively serves other blo ks (e.g., as a pointer into a FIFO queue) without dire tly
redu ing the output error.
2.3 Tasks Solved with Traditional LSTM
Ho hreiter and S hmidhuber (1997) already solved a wide range of tasks with traditional LSTM:
(1) The embedded Reber grammar (a popular regular grammar ben hmark); (2) Noise free and
noisy sequen es with time lags of up to 1000 steps (e.g.; the \2-sequen e problem" proposed by
Bengio et al., 1994); (3) Continuous-valued tasks the require the storage of values for long time
periods and their summation and multipli ation (up to a ertain pre ision); (4) Temporal order
problems with wildly separated inputs.
In the following hapters, however, we will present tasks (partly derived form the tasks listed
above) on whi h traditional LSTM fails and point out its problems.
Chapter 3
Learning to Forget: Continual

Predi tion with LSTM
3.1 Introdu tion

Ho hreiter and S hmidhuber (1997) demonstrated that LSTM an solve numerous tasks not
solvable by previous learning algorithms for RNNs. In this hapter, however, we will show that
even LSTM fails to learn to orre tly pro ess ertain very long or ontinual time series that are
not a priori segmented into appropriate training subsequen es with learly dened beginnings
and ends at whi h the network's internal state ould be reset. The problem is that a ontinual
input stream eventually may ause the internal values of the ells to grow without bound, even
if the repetitive nature of the problem suggests they should be reset o asionally. In this hapter
will present a remedy.
While we present a spe i solution to the problem of forgetting in LSTM networks, we
re ognize that any training pro edure for RNNs whi h is powerful enough to span long time
lags must also address the issue of forgetting in short term memory (unit a tivations). We know
of no other urrent training method for RNNs whi h is suÆ iently powerful to have en ountered
this problem.
Outline. Se tion 3.1.1 explains LSTM's weakness in pro essing ontinual input streams.
Se tion 3.2 introdu es a remedy alled \forget gates." Forget gates learn to reset memory ell
ontents on e they are not needed any more. Forgetting may o ur rhythmi ally or in an input-
dependent fashion. In the same se tion we derive a gradient-based learning algorithm for the
LSTM extension with forget gates. Se tion 3.3 des ribes experiments: we transform well-known
ben hmark problems into more omplex, ontinual tasks, report the performan e of various RNN
algorithms, and analyze and ompare the networks found by traditional LSTM and extended
LSTM.
3.1.1 Limits of traditional LSTM
LSTM allows information to be stored a ross arbitrary time lags, and error signals to be arried
far ba k in time. This potential strength, however, an ontribute to a weakness in some
15
16 CHAPTER 3. LEARNING TO FORGET: CONTINUAL PREDICTION WITH LSTM
situations: the ell states s often tend to grow linearly during the presentation of a time
series (the nonlinear aspe ts of sequen e pro essing are left to the squashing fun tions and the
highly nonlinear gates). If we present a ontinuous input stream, the ell states may grow in
unbounded fashion, ausing saturation of the output squashing fun tion, h. This happens even
if the nature of the problem suggests that the ell states should be reset o asionally, e.g., at
the beginnings of new input sequen es (whose starts, however, are not expli itly indi ated by
a tea her). Saturation will (a) make h's derivative vanish, thus blo king in oming errors, and
(b) make the ell output equal the output gate a tivation, that is, the entire memory ell will
degenerate into an ordinary BPTT unit, so that the ell will ease fun tioning as a memory.
The problem did not arise in the experiments reported by Ho hreiter & S hmidhuber (1997)
be ause ell states were expli itly reset to zero before the start of ea h new sequen e.
How an we solve this problem without losing LSTM's advantages over time delay neural
networks (TDNN) (Waibel, 1989) or NARX networks (Lin et al., 1996), whi h depend on a
priori knowledge of typi al time lag sizes?
The standard te hnique of weight de ay, whi h helps to ontain the level of overall a tivity
within the network, was found to generate solutions whi h were parti ularly prone to unbounded
state growth.
Variants of fo used ba k-propagation (Mozer, 1989) also do not work well. These let the
internal state de ay via a self- onne tion whose weight is smaller than 1. But there is no
prin ipled way of designing appropriate de ay onstants: A potential gain for some tasks is paid
for by a loss of ability to deal with arbitrary, unknown ausal delays between inputs and targets.
In fa t, state de ay does not signi antly improve experimental performan e (see \State De ay"
in Table 3.2).
Of ourse we might try to \tea her for e" (Jordan, 1986; Doya & Yoshizawa, 1989) the
internal states s by resetting them on e a new training sequen e starts. But this requires an
external tea her who knows how to segment the input stream into training subsequen es. We
are pre isely interested, however, in those situations where there is no a priori knowledge of this
kind.
3.2 Solution: Forget Gates
Our solution to the problem above is to use adaptive \forget gates" whi h learn to reset memory
blo ks on e their ontents are out of date and hen e useless. By resets we do not only mean
immediate resets to zero but also gradual resets orresponding to slowly fading ell states.
More spe i ally, we repla e traditional LSTM's onstant CEC weight 1.0 by the multipli a-
tive forget gate a tivation y'. See Figure 3.1.
3.2.1 Forward Pass of Extended LSTM with Forget Gates
All equations of traditional LSTM's forward pass ex ept for equation (2.6) will remain valid also
for extended LSTM with forget gates.
The forget gate a tivation y' is al ulated like the a tivations of the other gates|see equa-
tions (2.1) and (2.2):
X
net'j (t) = w'j m ym (t 1) ; y'j (t) = f'j (net'j (t)) : (3.1)
m
Here net'j is the input from the network to the forget gate. We use the logisti sigmoid with
range [0; 1℄ as squashing fun tion f'j . Its output be omes the weight of the self re urrent
3.2. SOLUTION: FORGET GATES 17
y c
y out
hy
out
output gating wout net out
ouput gate
output squashing h( s )c
s=sy+gy ϕ in y
c c wϕ net ϕ
memorizing and forgetting
forget gate
y in
input gating gy in
win net in
input gate
input squashing g(net ) c
wc
net c
Figure 3.1: Memory blo k with only one ell for the extended LSTM. A multipli ative forget
gate an reset the ell's inner state s .
onne tion of the internal state s in equation (2.6). The revised update equation for s in the
extended LSTM algorithm is (for t > 0):
s vj (t) = y'j (t) s vj (t 1) + yinj (t) g(net vj (t)) ; (3.2)
with s vj (0) = 0. Extended LSTM's full forward pass is obtained by adding equations (3.1) to
those in Chapter 2 and repla ing equation (2.6) by (3.2).
Bias weights for LSTM gates are initialized with negative values for input and output gates
(see Se tion 3.3.2), positive values for forget gates. This implies| ompare equations (3.1) and
(3.2)|that in the beginning of the training phase the forget gate a tivation will be almost 1.0,
and the entire ell will behave like a traditional LSTM ell. It will not expli itly forget anything
until it has learned to forget.
3.2.2 Ba kward Pass of Extended LSTM with Forget Gates
LSTM's ba kward pass is an eÆ ient fusion of slightly modied, trun ated ba k propagation
through time (BPTT) (e.g Williams & Peng 1990 ) and a ustomized version of real time
re urrent learning (RTRL) (e.g. Robinson & Fallside 1987). Output units use BP; output
gates use slightly modied, trun ated BPTT. Weights to ells, input gates and the novel forget
gates, however, use a trun ated version of RTRL. Trun ation means that all errors are ut o
on e they leak out of a memory ell or gate, although they do serve to hange the in oming
weights. The ee t is that the CECs are the only part of the system through whi h errors an
ow ba k forever. This makes LSTM's updates eÆ ient without signi antly ae ting learning
power: error ow outside of ells tends to de ay exponentially anyway (Ho hreiter, 1991). In
the equations below, =tr will indi ate where we use error trun ation and, for simpli ity, unless
otherwise indi ated, we assume only a single ell per blo k.
We start with the usual squared error obje tive fun tion based on targets tk :
E (t) =
1 X e (t) ; e (t) := tk (t) yk (t) ; (3.3)
2 k k
2
k
where ek denotes the externally inje ted error. We minimize E via gradient des ent by adding
weight hanges wlm to the weights wlm (from unit m to unit l) using learning rate (Æij is
the Krone ker delta):
wlm(t) = E (t) = X E (t) yk (t) = X e (t) yk (t)
wlm k
yk (t) wlm k
k
wlm
XX yk (t) yl (t) netl (t) 0
= ek (t) l
y (t) netl (t) wlm 0
0
0
k l 0

XX yk (t) yl (t) netl (t)
0
= ek (t) l Æ y (t 1) + m
y (t) netl (t) l l 0
m
y (t 1) 0
0
0
k l 0
Errors are trun ated when they leave a memory blo k by setting the following derivatives in the
above equation to zero: ym t =0 for l0 2 f'; in; vj g.
netl t tr
(
0( )
1)
X yk (t) yl (t) m

wlm(t) =tr ek (t)
yl (t) netl (t)
y (t 1)
k
!
yl (t) X yk (t)
= e (t)
netl (t) k yl (t) k
ym (t 1) : (3.4)
| {z }
=: Æl (t)
For an arbitrary output unit (l = k0 ) the sum in (3.4) redu es to ek (with k = k0). By dier-
entiating equation (2.9) we obtain the usual ba k-propagation weight hanges for the output
units:
yk (t)
netk (t)
= fk0 (netk (t)) =) Æk (t) = fk0 (netk (t)) ek (t) : (3.5)
To ompute the weight hanges for the output gates woutj m we set (l = out) in (3.4). The
resulting terms an be determined by dierentiating equations (2.1), (2.7) and (2.9):
youtj (t) 0 (netout (t)) ; y (t) ek (t) = h(s v (t)) wk v Æk (t) :
k
netoutj (t)
= fout j j youtj (t) j j
Inserting both terms in equation (3.4) gives Æout v

j , the ontribution of the blo k's v -th ell to
Æoutj . As every ell in a memory blo k ontributes to the weight hange of the output gate, we
3.2. SOLUTION: FORGET GATES 19
have to sum over all ells v in blo k j to obtain the total Æoutj of the j -th memory blo k (with
Sj ells):
0 1
Sj
X X
Æoutj (t) = fout
0 (netout (t))
j j
h(s v (t))
j wk vj Æk (t)A : (3.6)
v =1 k
Equations (3.4), (3.5)and (3.6) dene the weight hanges for output units and output gates of
memory blo ks. Their derivation was almost standard BPTT, with error signals trun ated on e
they leave memory blo ks (in luding its gates). This trun ation does not ae t LSTM's long
time lag apabilities but is ru ial for all equations of the ba kward pass and should be kept in
mind.
For weights to ell, input gate and forget gate we adopt an RTRL-oriented perspe tive, by
rst stating the in uen e of a ell's internal state s vj on the error and then analyzing how ea h
weight to the ell or the blo k's gates ontributes to s vj . So we split the gradient in a way
dierent from the one used in equation (3.4), negle ting, however, the same derivatives:
wlm(t) = E (t) =tr E (t) s vj (t) = e v (t) s vj (t) : (3.7)
w lms v (t) w s
j w lm lm
| {zj }
=: es v (t)
j
v
These terms are the internal state error es vj and a partial w lmj of s vj with respe t to weights
s
wlm feeding the ell vj (l = vj ) or the blo k's input gate (l = in) or the blo k's forget gate
(l = '), as all these weights ontribute to the al ulation of s vj (t). We treat the partial for the
internal states error es vj analogously to (3.4) and obtain:
E (t) tr E (t) yk (t) y j (t) y j X yk (t)
v v
es vj (t) := = = e (t)
s vj (t) yk (t) y vj (t) s vj (t) s vj (t) k y vj (t) k
| {z }
=w vj l Æk (t)
Dierentiating the forward pass equation (2.7), we obtain:

v
y j
s vj (t)
= youtj (t) h0 (s vj (t)) :
Substituting this term in the equation for es vj :
!
X
es vj (t) = youtj (t) h0 (s vj (t)) wk vj Æk (t) : (3.8)
k
This
s v
internal state error needs to be al ulated for ea h memory ell. To al ulate the partial
j in equation (3.7) we dierentiate equation (3.2) and obtain a sum of four terms.
wlm
s vj (t) s vj (t 1) 'j g(net vj (t))

w
= w
y (t) + yinj (t)
w
lm
| lm
{z } | {z lm
}
6
=0 for all l2f';in; v
j g 6
=0 for l= v
j ( ell)
yinj (t) y'j (t)

+ g(net vj (t)) + s vj (t 1) : (3.9)
| {z wlm } | {z w lm
}
6
=0 for l=in (input gate) 6
=0 for l=' (forget gate)
Dierentiating the forward pass equations (2.6), (2.2) and (3.1) for g, yin, and y' we an
substitute the unresolved partials and split the expression on the right hand side of (3.9) into
three separate equations for the ell (l = vj ), the input gate (l = in) and the forget gate (l = '):
s vj (t) s vj (t 1) 'j
w vj m
= w vj m
y (t) + g0 (net vj (t)) yinj (t) ym (t 1) ; (3.10)
winj m
= winj m
y (t) + g(net vj (t)) fin0 j (netinj (t)) ym (t 1) ; (3.11)
w' m
= w' m
y (t) + s vj (t 1) f'0 j (net'j (t)) ym (t 1) : (3.12)
j j
Furthermore the initial state of network does not depend on the weights, so we have
s vj (t = 0)
w
= 0 for l 2 f'; in; vj g : (3.13)
lm
Note that the re ursions in equations (3.10)-(3.12) depend on the a tual a tivation of the blo k's
forget gate. When the a tivation goes to zero not only the ell's state, but also the partials are
reset (forgetting in ludes forgiving previous mistakes). Every ell needs to keep a opy of ea h
of these three partials and update them at every time step.
We an insert the partials in equation (3.7) and al ulate the orresponding weight updates,
with the internal state error es vj (t) given by equation (3.8). The dieren e between updates of
weights to a ell itself (l = vj ) and updates of weights to the gates is that hanges to weights to
the ell w vj m only depend on the partials of this ell's own state:
s v (t)
w vj m (t) = es vj (t) w j v m : (3.14)
j
To update the weights of the input gate and of the forget gate, however, we have to sum over
the ontributions of all ells in the blo k:
XSj
s v (t)
wlm(t) = es vj (t) w j for l 2 f'; ing : (3.15)
v =1 lm
The equations ne essary to implement the ba kward pass are (3.4), (3.5), (3.6), (3.8), (3.10),
(3.11), (3.12), (3.13), (3.14) and (3.15).
3.2.3 Complexity
To al ulate the omputational omplexity of extended LSTM we take into a ount that weights
to input gates and forget gates ause more expensive updates than others, be ause ea h su h
weight dire tly ae ts all the ells in its memory blo k. We evaluate a rather typi al topology
used in the experiments (see Figure 3.3). All memory blo ks have the same size; gates have
no outgoing onne tions; output units and gates have a bias onne tion (from a unit whose
a tivation is always 1.0); other onne tions to output units stem from memory blo ks only; the
hidden layer is fully onne ted. Let B; S; I; K denote the numbers of memory blo ks, memory
3.3. EXPERIMENTS 21
ells in ea h blo k, input units, and output units, respe tively. We nd the update omplexity
per time step to be:
z }| {
to input and forget gates
z }| { z }| {
to ells to output gate
W = B [S ( B S + 1 +
|
2 (B{z S + 1) )+ BS +1 ℄
}
re urrent onne tions and bias
+ K (B S + 1) + I (B (S + 2 S + 1))
| {z } | {z }
to output from input
= O(B S ) + O(K B S ) + O(I B S ) ;

2 2
(3.16)
Keeping K and I xed we obtain a total omputational omplexity of O(B S ). The number 2 2
of weights is:
z }|{ z }| {
to ells to gates
Nw = B [S (B S + 1) + 3 (B S + 1)) ℄ + K (B S + 1) + I (B S + 3 B )) ;
| {z } | {z } | {z }
re urrent onne tions and bias to output from input
with and I xed:

K
Nw = O(B S ) : 2 2
Hen e LSTM's omputational omplexity per time step and weight is O(1). Considering onne -
tions to gates separately we nd that their omputational omplexity per time step and weight
is O(S ). But this is ompensated by the \less omplex" onne tions to the ells of O(1). It is
essentially the same as for a fully onne ted BPTT re urrent network. Storage omplexity per
weight is also O(1), as the last time step's partials from equations (3.10), (3.11) and (3.12) are
all that need to be stored for the ba kward pass. So the storage omplexity does not depend on
the length of the input sequen e. Hen e extended LSTM is lo al in spa e and time, a ording
to S hmidhuber's denition (1989), just like traditional LSTM.
3.3 Experiments
3.3.1 Continual Embedded Reber Grammar Problem
To generate an innite input stream we extend the well-known \embedded Reber grammar"
(ERG) ben hmark problem, e.g., Smith and Zipser (1989), Cleeremans et al. (1989), Fahlman
(1991), Ho hreiter & S hmidhuber (1997). Consider Figure 3.2.
ERG. The traditional method starts at the leftmost node of the ERG graph, and sequentially
generates nite symbol strings (beginning with the empty string) by steping from node to node
following the edges of the graph, and appending the symbols asso iated with the edges to the
urrent string until the rightmost node is rea hed. Edges are hosen randomly if there is a hoi e
(probability = 0:5).
Input and target symbols are represented by 7 dimensional binary ve tors, ea h omponent
standing for one of the 7 possible symbols. Hen e the network has 7 input units and 7 output
units. The task is to read strings, one symbol at a time, and to ontinually predi t the next
possible symbol(s). Input ve tors have exa tly one nonzero omponent. Target ve tors may have
two, be ause sometimes there is a hoi e of two possible symbols at the next step. A predi tion
is onsidered orre t if the error at ea h of the 7 output units is below 0:49 (error signals o ur
at every time step).
S
X Reber
Grammar T
T S T
B X E B E
P
P P Reber P
V Grammar
V
recurrent connection for continual prediction
T
Figure 3.2: Transition diagrams for standard (left) and embedded (right) Reber grammars. The
dashed line indi ates the ontinual variant.
Algo- # hidden #weights learning % of su ess su ess
rithm units rate after
RTRL 3 170 0.05 \some fra tion" 173,000
RTRL 12 494 0.1 \some fra tion" 25,000
ELM 15 435 0 >200,000
RCC 7-9 119-198 50 182,000
Tra.
LSTM 3bl.,size 2 276 0.5 100 8,440
Table 3.1: Standard embedded Reber grammar (ERG): per entage of su essful trials and num-
ber of sequen e presentations until su ess for RTRL (results taken from Smith and Zipser 1989
), \Elman net trained by Elman's pro edure" (results taken from Cleeremans et al. 1989 ),
\Re urrent Cas ade-Correlation" (results taken from Fahlman 1991 ) and traditional LSTM
(results taken from Ho hreiter and S hmidhuber 1997 ). Weight numbers in the rst 4 rows are
estimates.
To orre tly predi t the symbol before the last (T or P) in an ERG string, the network
has to remember the se ond symbol (also T or P) without onfusing it with identi al symbols
en ountered later. The minimal time lag is 7 (at the limit of what standard re urrent networks
an manage); time lags have no upper bound though. The expe ted length of a string generated
by an ERG is 11.5 symbols. The length of the longest string in a set of N non-identi al strings
is proportional to log N (statisti s of the embedded Reber Grammar are dis ussed in Appendix
A). For the training and test sets used in our experiments, the expe ted value of the longest
string is greater than 50.
Table 3.1 summarizes performan e of previous RNNs on the standard ERG problem (testing
involved a test set of 256 ERG test strings). Only traditional LSTM always learns to solve the
task. Even when we ignore the unsu essful trials of the other approa hes, LSTM learns mu h
faster.
CERG. Our more diÆ ult ontinual variant of the ERG problem (CERG) does not provide
information about the beginnings and ends of symbol strings. Without intermediate resets, the
network is required to learn, in an on-line fashion, from input streams onsisting of on atenated
ERG strings. Input streams are stopped as soon as the network makes an in orre t predi tion or
the 10 -th su essive symbol has o urred. Learning and testing alternate: after ea h training
5
3.3. EXPERIMENTS 23
Out Out Out Out Out Out Out
1 2 3 4 5 6 7
MemoryMemory Out Gate 1 MemoryMemory Out Gate 2

Block Block Block Block
1 1 Forget Gate 1 2 2 Forget Gate 2
Cell Cell Cell Cell
1 2 1 2
In Gate 1 In Gate 2
In In In In In In In
1 2 3 4 5 6 7
Figure 3.3: Three layer LSTM topology with re urren e limited to the hidden layer onsisting
of four extended LSTM memory blo ks (only two shown) with two ells ea h. Only a limited
subset of onne tions are shown.
stream we freeze the weights and feed 10 test streams. Our performan e measure is the average
test stream size; 100,000 orresponds to a so- alled \perfe t" solution (10 su essive orre t
6
predi tions).
3.3.2 Network Topology and Parameters
The 7 input units are fully onne ted to a hidden layer onsisting of 4 memory blo ks with 2
ells ea h (8 ells and 12 gates in total). The ell outputs are fully onne ted to the ell inputs,
to all gates, and to the 7 output units. The output units have additional \short ut" onne tions
from the input units (see Figure 3.3). All gates and output units are biased. Bias weights to in-
and output gates are initialized blo kwise: 0:5 for the rst blo k, 1:0 for the se ond, 1:5
for the third, and so forth. In this manner, ell states are initially lose to zero, and, as training
progresses, the biases be ome progressively less negative, allowing the serial a tivation of ells
as a tive parti ipants in the network omputation. Forget gates are initialized with symmetri
positive values: +0.5 for the rst blo k, +1 for the se ond blo k, et . Pre ise bias initialization
is not riti al though|other values work just as well. All other weights in luding the output
bias are initialized randomly in the range [ 0:2; 0:2℄. There are 424 adjustable weights, whi h
is omparable to the number used by LSTM in solving the ERG (see Table 3.1).
Weight hanges are made after ea h input symbol presentation. At the beginning of ea h
training stream, the learning rate is initialized with 0.5. It either remains xed or de ays
by a fa tor of 0.99 per time step (LSTM with -de ay). Learning rate de ay is well studied in
statisti al approximation theory and is also ommon in neural networks, e.g. (Darken, 1995).
Algorithm %Solutions %Good Sol. %Rest

Tra. LSTM with external reset 74 (7441) 0h i 26 h31i
Traditional LSTM 0 (-) 1 h1166i 99 h37i
LSTM with State De ay (0.9) 0 (-) 0h i 100 h56i
LSTM with Forget Gates 18 (18889) 29 h39171i 53 h145i
LSTM with Forget Gates
and sequential de ay 62 (14087) 6 h68464i 32 h30i
Table 3.2: Continuous Embedded Reber Grammar (CERG): Column \%Solutions": Per entage
of \perfe t" solutions ( orre t predi tion of 10 streams of 100,000 symbols ea h), in parenthesis
the number of training streams presented until solution was rea hed. Column \Good Sol.": Per-
entage of solutions with an average stream length > 1000 (mean length of error free predi tion
is given in angle bra kets). Column \Rest": per entage of \bad" solutions with average stream
length 1000 (mean length of error free predi tion is given in angle bra kets). The results are
averages over 100 independently trained networks. Other algorithms like BPTT are not in luded
in the omparison, be ause they tend to fail even on the easier, non- ontinual ERG.
We report results of exponential

p -de ay (as spe ied above), but also tested several other
variants (linear, 1=T , 1= T ), and found them all to work as well without extensive optimization
of parameters.
3.3.3 CERG Results
Training was stopped after at most 30000 training streams, ea h of whi h was ended when the
rst predi tion error or the 100000th su essive input symbol o urred. Table 3.2 ompares
extended LSTM (with and without learning rate de ay) to traditional LSTM and an LSTM
variant with de ay of the internal ell state s (with a self re urrent weight < 1). Our results
for traditional LSTM with network a tivation resets (by an external tea her) at sequen e ends
are slightly better than those based on a dierent topology (Ho hreiter & S hmidhuber, 1997).
External resets (non- ontinual ase) allow LSTM to nd ex ellent solutions in 74% of the trials,
a ording to our stringent testing riterion. Traditional LSTM fails, however, in the ontinual
ase. Internal state de ay does not help mu h either (we tried various self-re urrent weight
values and report only the best result). Extended LSTM with forget gates, however, an solve
the ontinual problem.
A ontinually de reasing learning rate led to even better results but had no ee t on the
other algorithms. Dierent topologies may provide better results, too|we did not attempt to
optimize topology.
Can the network learn to re ognize appropriate times for opening/ losing its gates without
using the information onveyed by the marker symbols B and E? To test this we repla ed all
CERG subnets of the type Tn!P E! B! T! nP
by T! nP TnP
! .
This makes the task more diÆ ult as the net now needs to keep tra k of sequen es of numerous
potentially onfusing T and P symbols. But LSTM with forget gates (same topology) was still
able to nd perfe t solutions, although less frequently (sequential de ay was not applied).
3.3. EXPERIMENTS 25
-9- -10- -14- -10- -10- -9- -10- -10- -12- -10- -9- -9-
Internal Cell State 100
50
-50
0 T T T T T P P T T T T P 130
Symbol
Figure 3.4: Evolution of traditional LSTM's internal states s during presentation of a test
stream stopped at rst predi tion failure. Starts of new ERG strings are indi ated by verti al
lines labeled by the symbols (P or T) to be stored until the next string start.
3.3.4 Analysis of the CERG Results
How does extended LSTM solve the task on whi h traditional LSTM fails? Se tion 3.1.1 already
mentioned LSTM's problem of un ontrolled growth of the internal states. Figure 3.4 shows the
evolution of the internal states s during the presentation of a test stream. The internal states
tend to grow linearly. At the starts of su essive ERG strings, the network is in an in reasingly
a tive state. At some point (here after 13 su essive strings), the high level of state a tivation
leads to saturation of the ell outputs, and performan e breaks down. Extended LSTM, however,
learns to use the forget gates for resetting its state when ne essary. Figure 3.5 (top half) shows a
typi al internal state evolution after learning. We see that the third memory blo k resets its ells
in syn hrony with the starts of ERG strings (the verti al lines in Figure 3.5 indi ate the third
symbol of a string). The internal states os illate around zero; they never drift out of bounds
as with traditional LSTM (Figure 3.4). It also be omes lear how the relevant information gets
stored: the se ond ell of the third blo k stays negative while the symbol P has to be stored,
whereas a T is represented by a positive value. The third blo k's forget gate a tivations are
plotted in Figure 3.5 (bottom). Most of the time they are equal to 1.0, thus letting the memory
ells retain their internal values. At the end of an ERG string the forget gate's a tivation goes
to zero, thus resetting ell states to zero.
Analyzing the behavior of the other memory blo ks, we nd that only the third is dire tly
responsible for bridging ERG's longest time lag (whi h is suÆ ient as one just bit has to be
stored). Figure 3.6 plots values analogous to those in Figure 3.5 for the rst memory blo k and
its rst ell. The rst blo k's ell and forget gate show short-term behavior only (ne essary for
predi ting the numerous short time lag events of the Reber grammar). The same is true for
all other blo ks ex ept the third. Common to all memory blo ks is that they learned to reset
themselves in an appropriate fashion.
3.3.5 Continual Noisy Temporal Order Problem
Extended LSTM solves the CERG problem while traditional LSTM does not. But an traditional
LSTM solve problems whi h extended LSTM annot? We tested extended LSTM on one of the
-12- -20- -11- -15- -11--10- -15- -14- -9- -19- -10- -9- -9-
Internal Cell State 20
10
0
3.Block, 1.Cell
-10 3.Block, 2.Cell
680 T P P T T P P T T T T T 850
Forget Gate Activ.
0.5
0
680 T P P T T P P T T T T T 850
Symbol
Figure 3.5: Top: Internal states s of the two ells of the self-resetting third memory blo k in
an extended LSTM network during a test stream presentation. The gure shows 170 su essive
symbols taken from the longer sequen e presented to a network that learned the CERG. Starts
of new ERG strings are indi ated by verti al lines labeled by the symbols (P or T) to be stored
until the next string start. Bottom: simultaneous forget gate a tivations of the same memory
blo k.
most diÆ ult nonlinear long time lag tasks ever solved by an RNN: \Noisy Temporal Order"
(NTO) (task 6b taken from Ho hreiter & S hmidhuber 1997 ).
NTO. The goal is to lassify sequen es of lo ally represented symbols. Ea h sequen e starts
with an E , ends with a B (the \trigger symbol"), and otherwise onsists of randomly hosen
symbols from the set fa; b; ; dg ex ept for three elements at positions t ; t and t that are
either X or Y (Figure 3.7). The sequen e length is randomly hosen between 100 and 110, t
1 2 3
is randomly hosen between 10 and 20, t is randomly hosen between 33 and 43, and t is
1
randomly hosen between 66 and 76. There are 8 sequen e lasses Q; R; S; U; V; A; B; C whi h
2 3
depend on the temporal order of the X s and Y s. The rules are (temporal order ! lass):
X; X; X ! Q; X; X; Y ! R; X; Y; X ! S ; X; Y; Y ! U ; Y; X; X ! V ; Y; X; Y !
A; Y; Y; X ! B ; Y; Y; Y ! C . Target signals o ur only at the end of a sequen e. The
problem's minimal time lag size is 80 (!). Forgetting is only harmful as all relevant information
has to be kept until the end of a sequen e, after whi h the network is reset anyway.
We use the network topology des ribed in se tion 3.3.2 with 8 input and 8 output units.
Using a large bias (5.0) for the forget gates, extended LSTM solved the task as qui kly as
traditional LSTM (re all that a high forget gate bias makes extended LSTM degenerate into
traditional LSTM). Using a moderate bias like the one used for CERG (1.0), extended LSTM
3.3. EXPERIMENTS 27
-12- -20- -11- -15- -11--10- -15- -14- -9- -19- -10--9- -9-
Internal State 10
1.Block, 1.Cell
-10
680 T P P T T P P T T T T T 850
Forget Gate Activ.
0.5
0
680 T P P T T P P T T T T T 850
Symbol
Figure 3.6: Top: Extended LSTM's self-resetting states for the rst ell in the rst blo k.
Bottom: forget gate a tivations of the rst memory blo k.
0 10 - 20 33 - 43 66 - 76 100-110
X
B
noise
a.b,c,d {} Y
noise
a.b,c,d {XY} noise
a.b,c,d {XY} noise
a.b,c,d
E
recurrent connection for continual version
Figure 3.7: NTO and CNTO tasks. See text for details.
took about three times longer on average, but did solve the problem. The slower learning speed
results from the net having to learn to remember everything and not to forget.
Generally speaking, we have not yet en ountered a problem that LSTM solves while extended
LSTM does not.
CNTO. Now we take the next obvious step and transform the NTO into a ontinual prob-
lem that does require forgetting, just as in se tion 3.3.1, by generating ontinual input streams
onsisting of on atenated NTO sequen es (Figure 3.7). Pro essing su h streams without inter-
mediate resets, the network is required to learn to lassify NTO sequen es in an online fashion.
Ea h input stream is stopped on e the network makes an in orre t lassi ation or 100 su essive
NTO sequen es have been lassied orre tly. Learning and testing alternate; the performan e
measure is the average size of 10 test streams, measured by the number of their NTO sequen es
(ea h ontaining between 100 and 110 input symbols). Training is stopped after at most 10 5
Algorithm %Perfe t Sol. %Partial Sol.

Traditional LSTM 0 (-) 100 h4:6i
LSTM with Forget Gates 24 (18077) 76 h12:2i
LSTM with Forget Gates
and sequential de ay 37 (22654) 63 h11:8i
Table 3.3: Continuous Noisy Temporal Order (CNTO): Column \%Perfe t Sol.": Per entage of
\perfe t" solutions ( orre t lassi ation of 1000 su essive NTO sequen es in 10 test streams);
in parentheses: number of training streams presented. Column \%Partial Sol.": per entage of
solutions and average stream size (value in angular bra kets) 100. All results are averages over
100 independently trained networks. Other algorithms (BPTT, RTRL et .) are not in luded in
the omparison, be ause they fail even on the easier, non- ontinual NTO.
training streams.
Results. Table 3.3 summarizes the results. We observe that traditional LSTM again fails
to solve the ontinual problem. Extended LSTM with forget gates, however, an solve it. A
ontinually de reasing learning rate ( de aying by a fra tion of 0.9 after ea h NTO sequen e
in a stream) leads to slightly better results but is not ne essary.
3.4 Con lusion
Continual input streams generally require o asional resets of the stream-pro essing network.
Partial resets are also desirable for tasks with hierar hi al de omposition. For instan e, re-
o urring subtasks should be solved by the same network module, whi h should be reset on e
the subtask is solved. Sin e typi al real-world input streams are not a priori de omposed
into training subsequen es, and sin e typi al sequential tasks are not a priori de omposed into
appropriate subproblems, RNNs should be able to learn to a hieve appropriate de ompositions.
The novel forget gates naturally permit LSTM to learn lo al self-resets of memory ontents that
have be ome irrelevant.
LSTM extended with forget gates holds promise for any sequential pro essing task in whi h
we suspe t that a hierar hi al de omposition may exist, but do not know in advan e what
this de omposition is. The model has been su essfully applied to the task of dis riminating
languages from very limited prosodi information (Cummins, Gers, & S hmidhuber, 1999) where
there is no lear linguisti theory of hierar hi al stru ture.
Chapter 4
Arithmeti Operations on Continual

Input Streams
4.1 Introdu tion

Many typi al real world sequen e pro essing tasks involve ontinual input streams, distributed
input representations, ontinuous-valued targets and inputs and internal states, and long time
lags between relevant events. So we designed several arti ial nonlinear tasks that ombine these
fa tors.
Due to its ar hite ture traditional LSTM is well suited for tasks involving addition, subtra -
tion and integration (Ho hreiter & S hmidhuber, 1997). Su h operations are essential for many
real-world tasks. But another essential arithmeti operation, namely multipli ation, does pose
problems. Forget gates, however, originally introdu ed to release irrelevant memory ontents,
greatly improve LSTM's performan e on tasks involving multipli ation, as will be seen below.
4.2 Experiments
We fo us on tasks involving arithmeti operations on input streams that so far have been ad-
dressed only in non- ontinual settings (Tsung & Cottrell, 1989; Ho hreiter & S hmidhuber,
1997).
General set-up. We feed the net ontinual streams of 4-dimensional input ve tors generated
in an online fashion. We dene t = 0 (stream start) and tn = tn +T +( 1)n V for n = 1; 2; : : : ,
where V 2 f0; 1; : : : ; T g is hosen randomly, and integer T is the minimal time lag. The rst
0 1
omponent of ea h input ve tor is a random number from the interval [ 1; +1℄. The se ond
5
and third serve as \markers": they are always 0.0 ex ept at times t m for m = 1; 2; : : : , when
either the se ond omponent is 1.0 with probability p, or the third is 1.0 with probability 1 p.
2 1
The fourth omponent is always 0 ex ept at times t m when targets are given and its a tivation
is 1.0. The target at t is 0. If the 2nd omponent was a tive at t m then the target at t m
2
is the sum of the previous target at t m and the \marked" rst input omponent at t m .
0 2 1 2
Otherwise it is the produ t of these two values.

2 2 2 1
29
30 CHAPTER 4. ARITHMETIC OPERATIONS ON CONTINUAL INPUT STREAMS
[T-T/5, T] [T, T+T/5] [T-T/5, T] [T, T+T/5] ...

random value marker ? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 ...
+ 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
+1 +/ * +/*
? ?
0
-1
t1 t2 t3 t4
time
Figure 4.1: Illustration of the ontinual addition (and multipli ation) tasks.
Hen e non-initial targets depend on events that happened at least 2 T steps ago. Note that
o urren es of \value markers" and targets os illate. See Figure 4.1 for an illustration of the
task.
All streams are stopped on e the absolute output error ex eeds 0.04. Test streams are almost
unlimited (max. length = 1000 target o urren es), but training streams end after at most 10
target o urren es. Learning and testing alternate: after ea h training stream we freeze the
weights and feed 100 test streams. Our performan e measure is the average test stream size.
Task 1: Continual addition. p = 1:0 (no multipli ation). T = 20. Task 1 essentially requires
to keep adding (possibly negative) values to the already existing internal state.
Task 2: Continual addition and multipli ation. p = 0:5, T = 20. If the 3rd input
omponent is a tive at t m and the 1st is negative then the latter will get repla ed by its
absolute value.
2 1
Task 3: Gliding addition. Like Task 1, but targets at times t m equal the sum of the two
most re ent marked values at times t m and t m (the rst target at t equals the rst value
2 +2
at t ). T = 10. Task 3 is harder than task 1 be ause it requires sele tive partial resets of the
2 +1 2 1 2
urrent internal state.

1
4.2.1 Network Topology and Parameters

The 4 input units are fully onne ted to a hidden layer onsisting of 3 memory blo ks with 1 ell
ea h (roughly: less blo ks de reased performan e for LSTM and more blo ks did not improve
performan e signi antly). The ell outputs are fully onne ted to the ell inputs, to all gates,
and to the output unit. All gates and output units are biased. Bias weights to in- and output
gates are initialized blo k-wise: 1:0 for the rst blo k, 2:0 for the se ond, and so forth. (This
is a standard initialization pro edure: blo ks with higher bias tend to get released later during
the learning phase.) Forget gates are initialized with symmetri positive values: +1.0 for the
rst blo k, +2.0 for the se ond, and so forth. The squashing fun tions g,h and fk are the identity
fun tion.
4.2.2 Results
See Table 4.1. Test stream sizes are measured by number of target presentations before rst
failure. A stream size below 3 ounts as an unsu essful trial. We report the best test perfor-
man e during a training phase involving 3 10 training streams, averaged over 10 independent
6
4.3. CONCLUSION 31
Algorithm Task 1 Task 2 Task 3
Traditional LSTM 73 (100%) - (0%) - (0%)
LSTM + Forget Gates 42 (100%) 40 (60%) 241 (50%)
Table 4.1: Average test stream size (per entage of su essful trials given in parenthesis). In
Task 3 one network with forget gates ex eeded the limit of 1000 target o urren es.
networks.
Task 1. Both traditional LSTM and LSTM with forget gates learn the task. Worse perfor-
man e of LSTM with forget gates is aused by slower onvergen e, be ause the net has to learn
to remember everything and not to forget.
Task 2. LSTM with forget gates solves the problem even when addition and multipli ation are
ombined, whereas traditional LSTM's solutions are not suÆ iently a urate. This shows that
forget gates add algorithmi fun tionality to memory blo ks besides releasing resour es during
runtime (their original purpose whi h is not essential here).
Task 3. Traditional LSTM annot solve the problem at all, whereas LSTM with forget gates
does nd good and even \perfe t" solutions. Why? The forget gates learn to prevent LSTM's
un ontrolled internal state growth (see Se tion 3.3.4), by resetting states on e stored information
be omes obsolete.
The results onrm that forget gates are mandatory for LSTM fed with ontinual input
streams (Chapter 3)., where obsolete memories need to be dis arded at some point (see \Task
3: Gliding addition"). Experiment 2 shows that forget gates also greatly fa ilitate operations
involving multipli ation.
4.3 Con lusion
In this hapter we demonstrated that forget gates do not only serve for the pro essing of ontinual
input streams but also augment LSTM's arithmeti apabilities.
We presented tasks on ontinual input streams with a level of arithmeti omplexity where
traditional LSTM fails but LSTM with forget gates solves the tasks in an elegant way. On the
other hand we have not found a task yet that traditional LSTM an solve but LSTM with forget
gates annot.
Chapter 5
Learning Pre ise Timing with

Peephole LSTM
5.1 Introdu tion

Humans qui kly learn to re ognize rhythmi pattern sequen es, whose dening aspe ts are the
temporal intervals between sub-patterns. Conversely, drummers and others are also able to
generate pre isely timed rhythmi sequen es of motor ommands. This motivates the study of
arti ial systems that learn to separate or generate patterns that onvey information through
the length of intervals between events.
Widely used approa hes to sequen e pro essing, su h as Hidden Markov Models (HMMs),
typi ally dis ard su h information. They are su essful in spee h re ognition pre isely be ause
they do not are for the dieren e between slow and fast versions of a given spoken word. Other
tasks su h as rhythm dete tion, musi pro essing, and the tasks in this hapter, however, do
require exa t time measurements. Although an HMM ould deal with a nite set of intervals
between given events by devoting a separate internal state for ea h interval, this would be
umbersome and ineÆ ient, and would not use the very strength of HMMs to be invariant to
non-linear temporal stret hing.
RNNs hold more promise for re ognizing patterns that are dened by temporal distan e. In
fa t, while HMMs and traditional dis rete symboli grammar learning devi es are limited to dis-
rete state spa es, RNNs are in prin iple suited for all sequen e learning tasks be ause they have
Turing apabilities (Siegelmann & Sontag, 1991). Typi al RNN learning algorithms (Pearlmut-
ter, 1995) perform gradient des ent in a very general spa e of potentially noise-resistant algo-
rithms using distributed, ontinuous-valued internal states to map real-valued input sequen es to
real-valued output sequen es. Hybrid HMM-RNN approa hes (Bengio & Fras oni, 1995) might
be able to ombine the virtues of both methodologies, but to our knowledge have never been
applied to the problem of pre ise event timing as dis ussed here.
Previous tasks already required the LSTM network to a t upon events that o urred 50
dis rete time steps ago, independently of what happened over the intervening 49 steps (see
Chapter 3 and Ho hreiter & S hmidhuber, 1997). Right before the riti al moment, however,
33
34 CHAPTER 5. LEARNING PRECISE TIMING WITH PEEPHOLE LSTM
y
cell output
c
y out
output gating
sy ∗ c
out
wout netout
output gate
peephole connection ∆t
ϕ peephole connections
y
net ϕ wϕ
memorizing
∗
and forgetting
CEC s cell
c state
∆t
forget gate
gy in y in
input gating
∗ win netin
input squashing g input gate
cell input wc
netc
Figure 5.1: LSTM memory blo k with one ell; peephole onne tions onne t s to the gates.
there was a helpful \marker" input informing the network that its next a tion would be ru ial.
Thus the network did not really have to learn to measure a time interval of 50 steps; it just had
to learn to store relevant information for 50 steps, and use it on e the marker was observed |
something that is impossible for traditional RNNs but omparatively easy for LSTM.
But what if there are no su h markers at all? What if the network itself has to learn to
measure and internally represent the duration of task-spe i intervals, or to generate sequen es
of patterns separated by exa t intervals? Here we will study to what extent this is possible. The
highly nonlinear tasks in the present hapter do not involve any time marker inputs; instead
they require the network to time pre isely and robustly a ross long time lags in ontinual input
streams.
Before we des ribe our new timing experiments we will rst identify a weakness in LSTM's
onne tion s heme, and introdu es peephole onne tions as a remedy (Se tion 5.2). Se tions 5.3
and 5.4 des ribe the modied forward and ba kward pass for \peephole LSTM."
5.2 Extending LSTM with \Peephole Conne tions"

We are building on LSTM with forget gates (Chapter 3), simply alled \LSTM" in what follows.
A limitation of LSTM. Ea h gate re eives onne tions from the input units and the
outputs of all ells. But there is no dire t onne tion from the CEC it is supposed to ontrol.
All it an observe dire tly is the ell output, whi h is lose to zero as long as the output gate
is losed. The resulting la k of essential information may harm network performan e, espe ially
in ase of the tasks we are going to study here.
Peephole onne tions. Our simple but very ee tive remedy is to add weighted \peephole"
onne tions from the CEC to the gates of the same memory blo k (Figure 5.1). The gates learn
to shield the CEC from unwanted inputs (forward pass) or unwanted error signals (ba kward
pass). To keep the shield inta t, during learning no error signals are propagated ba k from gates
via peephole onne tions to the CEC (see ba kward pass, Se tion 5.4). Peephole onne tions
are treated like regular onne tions to gates (e.g., from the input) ex ept for update timing. For
5.3. FORWARD PASS 35
onventional LSTM the only sour e of re urrent onne tions is the ell output y , so the order
of updates within a layer is arbitrary. Peephole onne tions from within the ell, or re urrent
onne tions from gates, however, require a renement of LSTM's update s heme.
Updates for peephole LSTM. Ea h memory ell omponent should be updated based
on the most re ent a tivations or states of onne ted sour es. In the simplest ase this requires
a two-phase update s heme; when re urrent onne tions from gates are present, the rst phase
must be further subdivided into three steps (a,b, ):
1. (a) Input gate a tivation yin,
(b) forget gate a tivation y',
( ) ell input and ell state s ,
2. output gate a tivation yout and ell output y .
Thus the output gate is updated after ell state s , seeing via its peephole onne tion the urrent
value of s (t) (already ae ted by forget gate and re ent input), and possibly the urrent input
and forget gate a tivations.
5.3 Forward Pass
Before spe ifying the equations for the LSTM model with peephole onne tions, we introdu e
a minor simpli ation unrelated to the entral idea of this hapter. So far LSTM memory ells
in orporated an input squashing fun tion g and an output squashing fun tion ( alled h in earlier
LSTM publi ations). We remove the latter for la k of empiri al eviden e that it is really needed
(in fa t, the very rst LSTM publi ation (Ho hreiter & S hmidhuber, 1997) already omitted the
output squashing fun tion for some experiments).
Step 1a,1b. The input gate a tivation yin and the forget gate a tivation y' are omputed
as:
Sj
X X
netinj (t) = winj m ym (t 1) + winj vj s vj (t 1) ; yinj (t) = finj (netinj (t)) ; (5.1)
m v =1
Sj
X X
net'j (t) = w'j m y (t
m
1) + w'j vj s vj (t 1) ; y'j (t) = f'j (net'j (t)) : (5.2)
m v =1
The peephole onne tions for the input gate and the forget gate are in orporated in equation 5.1
and 5.2 by in luding the CECs ( ontaining the ell states) of memory blo k j as sour e units.
Step 1 . At t = 0, the state s (t) of memory ell is initialized to zero; subsequently
(t > 0) it is al ulated by adding the squashed, gated input to the state at the previous time
step, s (t 1), whi h is multiplied (gated) by the forget gate a tivation y'j (t):
X
net vj (t) = w vj m ym (t 1) ;
m
s vj (t) = y'j (t) s vj (t 1) + yinj (t) g(net vj (t)) : (5.3)
Step 2. The output gate a tivation yout is omputed as:
Sj
X X
netoutj (t) = woutj m y (t m
1) + woutj vj s vj (t) ;
m v =1
youtj (t) = foutj (netoutj (t)) : (5.4)
Equation 5.4 in ludes the peephole onne tions for the output gate from the CECs of memory
blo k j with the ell states s (t), as updated in step 1 . The ell output y is omputed as:
y j (t) = youtj (t) s vj (t) : (5.5)
v
The equations for the output units k remain as spe ied in equations 2.9.
5.4 Gradient-Based Ba kward Pass
The revised update s heme for memory blo ks allows for treating peephole onne tions like reg-
ular onne tions (see Se tions 5.2 and 5.3), and so requires only minor hanges to the ba kward
pass (Chapter 3). We will present it below but not fully re-derive it. We will, however, point
out the dieren es to the previous equations in Se tion 3.2.2. Appendix B gives pseudo- ode for
the entire algorithm.
In what follows we will present equations fortrLSTM with forget gates and peephole onne -
tions, but without output squashing. The sign = will indi ate where we use error trun ation.
During ea h step in the forward pass, no matter whether a target is given or not, we need
to update the partial derivatives s vj =wlm and s vj =wl vj for weights to the ell (l = vj ), to
0
the input gate (l = in), and to the forget gate (l = '):

s vj (t) tr s vj (t 1) 'j
w v
= w v y (t) + g0 (net vj (t)) yinj (t) ym (t 1) ; (5.6)
j m j m
winj m
=tr winj m
y (t) + g(net vj (t)) fin0 j (netinj (t)) ym (t 1) ; (5.7a)
winj vj0
=tr winj vj
0
y (t) + g(net vj (t)) fin0 j (netinj (t)) s vj (t
0 1) ; (5.7b)
w'j m
=tr w'j m
y (t) + s vj (t 1) f'0 j (net'j (t)) ym (t 1) ; (5.8a)
w'j vj 0
=tr w'j vj
y (t) + s vj (t
0
1) f'0 j (net'j (t)) s vj (t 1) ;
0 (5.8b)
with s vj (0)=wlm = s vj (0)=wl vj = 0 for l 2 fin; '; vj g. Equation 5.7b and 5.8b are for
0
the peephole onne tion weights.

Following previous notation, we minimize the obje tive fun tion E by gradient des ent (sub-
je t to error trun ation), hanging the weights wlm (from unit m to unit l) by an amount wlm
given by the learning rate times the negative gradient of E . For the output units we obtain
the standard ba k-propagation weight hanges:
E (t)
wkm(t) = Æk (t) ym (t 1) ; Æk (t) = net : (5.9)
k (t)
Here we use the ustomary squared error obje tive fun tion based on targets tk , yielding:
Æk (t) = fk0 (netk (t)) ek (t) ; (5.10)
5.5. EXPERIMENTS 37
where ek (t) := tk (t) yk (t) is the externally inje ted error. The weight hanges for onne tions
to the output gate (of the j -th memory blo k) from the sour e units (as spe ied by the network
topology) woutj m and for the peephole onne tions woutj vj are:
woutj m (t) = Æoutj (t) ym (t) ; woutj vj (t) = Æoutj (t) s vj (t) ; (5.11a)
0 1
Sj
X X
Æoutj (t) =tr 0 (netout (t))
fout j j
s v (t)
j wk vj Æk (t)A : (5.11b)
v =1 k
Output squashing (removed here) would require the in orporation of the derivative of the output
squashing fun tion in (5.11b). To al ulate weight hanges wlm and wl vj (peephole onne - 0
tion weights) for onne tions to the ell (l = vj ), the input gate (l = in), and the forget gate
(l = ') we use the partials from Equations 5.6, 5.7b, and 5.8b:
s v (t)
w vj m(t) = es vj (t) w j v (5.12)
m j
Sj
X s vj (t) Sj
X s vj (t)
winj m(t) = es vj (t)
winj m
; winj vj (t) =
0 es vj (t)
winj vj 0
(5.13)
v =1 v =1
Sj
X s vj (t) Sj
X s vj (t)
w'j m(t) = es vj (t)
w'j m
; w'j vj (t) =
0 es vj (t)
w'j vj0
(5.14)
v =1 v =1
where the internal state error es vj is separately al ulated for ea h memory ell:
!
X
es vj (t) =
tr
y outj
(t) wk vj Æk (t) : (5.15)
k
Like traditional LSTM LSTM with forget gates and peephole onne tions is still lo al in
spa e and time. The in rease in omplexity due to peephole onne tions is small: 3 weights per
ell.
5.5 Experiments
We study LSTM's performan e on three tasks that require the pre ise measurement or generation
of delays. We ompare onventional to peephole LSTM, analyze the solutions, or explain why
none was found.
Measuring spike delays (MSD). See Se tion 5.5.2. The goal is to lassify input sequen es
onsisting of sharp spikes. The lass depends on the interval between spikes. We onsider two
versions of the task: ontinual (MSD) and non- ontinual (NMSD). NMSD sequen es stop after
the se ond spike, whereas MSD sequen es are ontinual spike trains. Both NMSD and MSD
require the network to measure intervals between spikes; MSD also requires the produ tion of
stable results in presen e of ontinually streaming inputs, without any external reset of the
network's state. Can LSTM learn the dieren e between almost identi al pattern sequen es
that dier only by a small lengthening of the interval (e.g., from n to n+1 steps) between input
spikes? How does the diÆ ulty of this problem depend on n?
Generating timed spikes (GTS). See Se tion 5.5.3. The GTS task an be obtained from
the MSD task by ex hanging inputs and targets. It requires the produ tion of ontinual spike
trains, where the interval between spikes must re e t the magnitude of an input signal that may
hange after every spike.
GTS is a spe ial ase of periodi fun tion generation (PFG, see below). In ontrast to
previously studied PFG tasks (Williams & Zipser, 1989; Doya & Yoshizawa, 1989; Tsung &
Cottrell, 1995), GTS is highly nonlinear and involves long time lags between signi ant output
hanges, whi h annot be learned by onventional RNNs. Previous work also did not fo us on
stability issues. Here, by ontrast, we demand that the generation be stable for 1000 su essive
spikes. We systemati ally investigate the ee t of minimal time lag on task diÆ ulty.
Additional periodi fun tion generation tasks (PFG). See Se tion 5.5.4. We study
the problem of generating periodi fun tions other than the spike trains above. The lassi
examples are smoothly os illating outputs su h as sine waves, whi h are learnable by fully
onne ted tea her-for ed RNNs whose units are all output units with tea her-dened a tivations
(Williams & Zipser, 1989). An alternative approa h trains an RNN to predi t the next input;
after training outputs are fed ba k dire tly to the input so as to generate the waveform (Doya
& Yoshizawa, 1989; Tsung & Cottrell, 1995; Weiss, 1999; Townley et al., 1999).
Here we fo us on more diÆ ult, highly nonlinear, triangular and re tangular waveforms, the
latter featuring long time lags between signi ant output hanges. Again, traditional RNNs
annot learn tasks involving long time lags (Ho hreiter, 1991; Bengio et al., 1994), and previous
work did not fo us on stability issues. By ontrast, we demand that the generation be stable
for 1000 su essive periods of the waveform.
5.5.1 Network Topology and Experimental Parameters
We found that omparatively small LSTM nets an already solve the tasks above. A single input
unit (used only for tasks where there is input) is fully onne ted to the hidden layer onsisting
of a single memory blo k with one ell. The ell output is onne ted to the ell input, to all
three gates, and to a single output unit (Figure 5.2). All gates, the ell itself, and the output
unit are onne ted to a bias unit (a unit with onstant a tivation one) as well. The bias weights
to input gate, forget gate, and output gate are initialized to 0:0, 2:0 and +2:0, respe tively.
(Although not riti al, these values have been found empiri ally to work well; we use them for
all our experiments.) All other weights are initialized to uniform random values in the range
[ 0:1; 0:1℄. In addition to the three peephole onne tions there are 14 adjustable weights: 9
\unit-to-unit" onne tions and 5 bias onne tions. The ell's input squashing fun tion g is the
identity fun tion. The squashing fun tion of the output unit is a logisti sigmoid with range
[0; 1℄ for MSD and GTS (ex ept where expli itly stated otherwise), and the identity fun tion for
PFG. (A sigmoid fun tion would work as well, but we fo us on the simplest system that an
solve the task.)
Our networks pro ess ontinual streams of inputs and targets; only at the beginning of a
stream are they reset. They must learn to always predi t the target tk (t), produ ing a stream of
output values (predi tions) yk (t). A predi tion is onsidered orre t if the absolute output error
jek (t)j = jtk (t) yk (t)j is below 0:49 for binary targets (MSD, NMSD and GTS tasks), below
0:3 otherwise (PFG tasks). Streams are stopped as soon as the network makes an in orre t
predi tion, or after a given maximal number of su essive periods (spikes): 100 during training,
1000 during testing.
Learning and testing alternate: after ea h training stream, we freeze the weights and generate
5.5. EXPERIMENTS 39
Out
Memory Output Gate

Block
with Forget Gate
one
Cell Input Gate
In
Figure 5.2: Three-layer LSTM topology with one input and one output unit. Re urren e is
limited to the hidden layer, whi h onsists of a single LSTM memory blo k with a single ell.
All 9 \unit-to-unit" onne tions are shown, but bias and peephole onne tions are not.
a test stream. Our performan e measure is the a hieved test stream size: 1000 su essive periods
are deemed a \perfe t" solution. Training is stopped on e a task is learned or after a maximal
number of 10 training streams (10 for the MSD and NMSD tasks). Weight hanges are made
7 8
after ea h target presentation. The learning rate is set to 10 ; we use the momentum
5
algorithm (Plaut, Nowlan, & Hinton, 1986) with momentum parameter 0.999 for the GTS task,
0.99 for the PFG and NMSD task, and 0.9999 for the MSD task. We roughly optimized the
momentum parameter by trying out dierent orders of magnitude.
For tasks GTS and MSD, the sto hasti input streams are generated online. A perfe t solu-
tion orre tly pro esses 10 test streams, to make sure the network provides stable performan e
independent of the stream beginning, whi h we found to be riti al. All results are averages
over 10 independently trained networks.
5.5.2 Measuring Spike Delays (MSD)
The network input is a spike train, represented by a series of ones and zeros, where ea h \one"
indi ates a spike. Spikes o ur at times T (n) set F + I (n) steps apart, where F is the minimum
interval between spikes, and I (n) is an integer oset, randomly reset for ea h spike:
T (0) = F + I (0) ; T (n) = T (n 1) + F + I (n) (n 2 N ) :
The target given at times t = T (n) is the delay I (n). (Learning to measure the total interval
F + I (n) | that is, adding the onstant F to the output | is no harder.) A perfe t solution
orre tly pro esses all possible input test streams. For the non- ontinual version of the task
(NMSD) a stream onsists of a single period (spike).
MSD Results. Table 5.1 reports results for NMSD with I (n) 2 f0; 1g for various minimum
spike intervals F . The results suggest that the diÆ ulty of the task (measured as the average
number of training streams ne essary to solve it) in reases drasti ally with F (see Figure 5.3).
A qualitative explanation is that longer intervals ne essitate ner tuning of the weights, whi h
T F
LSTM
I (n) 2
Peephole LSTM
% Sol. Train. [10 ℄ % Sol. Train. [10 ℄
3 3
10 100 f0; 1g
160 14 100 125 14
20 100 f0; 1g
732 97 100 763 103
NMSD 30 f0; 1g
100 17521 2200 80 12885 2091
40 f0; 1g
20 37533 4558 70 25686 2754
50 0 f0; 1g
| 10 32485
MSD 10 10
10 f0; 1g
8850
f0; 1; 2g
20 29257 13758
20 27453 11750
60 9791 2660
Table 5.1: Results omparing onventional and peephole LSTM on the NMSD and MSD tasks.
Columns show the task T, the minimum spike interval F , the set of delays I (n), the per entage
of perfe t solutions found, and the mean and standard derivation of the number of training
streams required.
Training Streams [106]
40 LSTM
Peephole LSTM
30
20
10
0
0 10 20 30 40 50
Time Delay F
Figure 5.3: Average number of training streams required for the NMSD task with I (n) 2 f0; 1g,
plotted against the minimum spike interval F .
requires more training. Peephole LSTM outperforms LSTM. The ontinual MSD task for F =10
with I (n) 2 f0; 1g or I (n) 2 f0; 1; 2g, is solved with or without peephole onne tions (Table
5.1).
In the next experiment we evaluate the in uen e of the range of I (n), using the identity
fun tion instead of the logisti sigmoid as output squashing fun tion. We let I (n) range over
f0; ig or f0;::; ig for all i 2 f1;::; 10g. Results are reported in Table 5.2 for NMSD with F =10.
The training duration depends on the size of the set from whi h I (n) is drawn, and on the
maximum distan e (MD) between elements in the set. A larger MD leads to a better separation
of patterns, thus fa ilitating re ognition. To onrm this, we ran the NMSD task with F =10
and I (n) 2 f0; ig with i 2 f2;::; 10g (size 2, MD i), as shown in the bottom half of Table 5.2.
As expe ted, training time de reases with in reasing MD. A larger set of possible delays should
make the task harder. Surprisingly, for I (n) 2 f0;::; ig (size i +1, MD i) with i ranging from
1 to 5 the task appears to be ome easier (due to the simultaneous in rease of MD) before the
5.5. EXPERIMENTS 41
LSTM Peephole LSTM
I (n) 2 % Training % Training
Sol. Str. [10 ℄
3
Sol. Str. [10 ℄
3
f0; 1g 100 48 12 100 46 14

f0; 1; 2g 100 25 4 100 10:3 3:3
f0;::; 3g 100 12:3 2:4 100 7:4 2:2
f0;::; 4g 100 8:5 1:3 100 3:6 0:4
f0;::; 5g 100 4:5 0:4 100 6:0 1:4
f0;::; 6g 100 6:1 1:0 100 7:1 2:8
f0;::; 7g 100 8:5 2:9 70 15 6:5
f0;::; 8g 100 14:1 4:2 50 22 9
f0;::; 9g 90 39 28 50 33 17
f0;::; 10g 60 23 5 20 395 167
f0; 2g 100 33 8 100 18 5
f0; 3g 100 12:5 4:2 100 23 6
f0; 4g 100 12:1 2:8 100 13:7 2:7
f0; 5g 100 8:5 2:3 100 10:4 2:0
f0; 6g 100 7:7 1:5 100 12:7 3:1
f0; 7g 100 7:7 1:5 100 14:5 6:0
f0; 8g 100 7:5 2:0 100 6:3 1:3
f0; 9g 100 5:8 1:6 100 7:5 1:6
f0; 10g 100 5:6 0:9 100 6:7 1:7
Table 5.2: The per entage of perfe t solutions found, and the mean and standard derivation of
the number of training streams required, for onventional versus peephole LSTM on the NMSD
task with F =10 and various hoi es for the set of delays I (n).
diÆ ulty in reases rapidly for larger i. Thus the task's diÆ ulty does not grow linearly with
the number of possible delays, orresponding to values (states) inside a ell the network must
learn to distinguish. Instead we observe that LSTM fares best at distinguishing 6 or 7 dierent
delays. One is tempted to draw a onne tion to the \magi number" of 7 2 items that an
average human an store in Short Term Memory (STM) (Miller, 1956), but su h a link seems
rather far-fet hed to us.
We also observe that the results for I (n) 2 f0; 1g are better than those obtained with a
sigmoid fun tion ( ompare Table 5.1). Flu tuations in the sto hasti input an ause temporary
saturation of sigmoid units; the resulting tiny derivatives for the ba kward pass will slow down
learning (LeCun, Bottou, Orr, & Muller, 1998).
MSD Analysis. LSTM learned to measure time in two prin ipled ways. The rst is to
slightly in rease the ell ontents s at ea h time step, so that the elapsed time an be read o
the value of s . This kind of solution is shown on the left-hand side of Figure 5.4. (The state
reset performed by the forget gate is essential only for ontinual online predi tion over many
periods.) The se ond way is to establish internal os illators and derive the elapsed time from
their phases (right-hand side of Figure 5.4). Both kinds of solutions an be learned with or
without peephole onne tions, as it is never ne essary here to lose the output gate for more
than one time step (see bottom row of Figure 5.4).
1 1
0.5 0.5
in
tk
yk
0 0
20
1
10
sc
0
yc
0
1 1
yin
out
y
yϕ
0 0
0 10 20 30 0 10 20 30
Time Time
Figure 5.4: Two ways to time. Test run with trained LSTM networks for the MSD task with
F =10 and I (n) 2 f0; 1g. Top: target values tk and network output yk ; middle: ell state s and
ell output y ; bottom: a tivation of the input gate yin, forget gate y', and output gate yout.
Why may the output gate be left open? Targets o ur rarely, hen e the network output
an be ignored most of the time. Sin e there is only one memory blo k, mutual perturbation of
blo ks is not possible. This type of reasoning is invalid though for more omplex measuring tasks
involving larger nets or more frequent targets. Figure 5.5 shows the behavior of LSTM in su h
a regime. With peephole LSTM the output gate opens only when a target is provided, whereas
onventional LSTM does not learn this behavior. Note that in some ases these \ leaner"
solutions with peephole onne tions took longer to be learned ( ompare Tables 5.1 and 5.2,
be ause they require more omplex behavior.
5.5. EXPERIMENTS 43
2 in
tk
yk in
1 tk
2 yk
1
0 0
4 0
sc
yc
-2
0
sc
c
y
1 1
in
y
yout yin
out
yϕ y
yϕ
0 0
0 10 20 30 40 50 60 70
Time Time
Figure 5.5: Behavior of peephole LSTM (left) versus LSTM (right) for the MSD task with F =10
and I (n) 2 f0; 1; 2g. Top: target values tk and network output yk ; middle: ell state s and ell
output y ; bottom: a tivation of the input gate yin, forget gate y', and output gate yout.
5.5.3 Generating Timed Spikes (GTS).
The GTS task reverses the roles of inputs and targets of the MSD task: the spike train T (n),
dened as for the MSD task, now is the network's target, while the delay I (n) is provided as
input.
GTS Results. The GTS task ould not be learned by networks without peephole onne -
tions; thus we report results with peephole LSTM only. Results with various minimum spike
intervals F (Figure 5.6) suggest that the required training time in reases dramati ally with F ,
as with the NMSD task (Se tion 5.5.2). The network output during a su essful test run for the
GTS task with F = 10 is shown on the top left of Figure 5.7. Peephole LSTM also solves the
task for F =10 and I (n) 2 f0; 1g or f0; 1; 2g, as shown in Figure 5.6 (left).
3
Peephole LSTM
Training Streams [10 ]

6
F I (n) 2 % Sol. Train. [10 ℄
3
10 f0g 100 41 4 2
20 f0g 100 67 8
30 f0g 80 845 82
40 f0g 100 1152 101 1
50 f0g 100 2538 343
10 f0; 1g 50 1647 46 0
10 f0; 1; 2g 30 954 393 0 10 20 30 40 50
Time Delay F
Figure 5.6: Results for the GTS task. Table (left) shows the minimum spike interval F , the set
of delays I (n), the per entage of perfe t solutions found, and the mean and standard derivation
of the number of training streams required. Graph (right) plots the number of training streams
against the minimum spike interval F , for I (n) 2 f0g.
GTS Analysis. Figure 5.7 shows test runs with trained networks for the GTS task. The
output gates open only at the onset of a spike and lose again immediately afterwards. Hen e,
during a spike, the output of the ell equals its state (middle row of Figure 5.7). The opening
of the output gate is triggered by the ell state s : it starts to open on e the input from the
peephole onne tion outweighs a negative bias. The opening self-reinfor es via a onne tion
from the ell output, whi h produ es the high nonlinearity ne essary for generating the spike.
This pro ess is terminated by the losing of the forget gate, triggered by the ell output spike.
Simultaneously the input gate loses, so that s is reset.
In the parti ular solution shown on the right-hand side of Figure 5.7 for F = 50, the role
of the forget gate in this pro ess is taken over by a negative self-re urrent onne tion of the
ell in onjun tion with a simultaneous opening of the other two gates. We tentatively removed
the forget gate (by pinning its a tivation to 1.0) without hanging the weights learned with the
forget gate's help. The network then qui kly learned a perfe t solution. Learning from s rat h
without forget gate, however, never yields a solution! The forget gate is essential during the
learning phase, where it prevents the a umulation of irrelevant errors.
The exa t timing of a spike is determined by the growth of s , whi h is tuned through
onne tions to input gate, forget gate, and the ell itself. To solve GTS for I (n) 2 f0; 1g or
I (n) 2 f0; 1; 2g, the network essentially translates the input into a s aling fa tor for the growth
of s (Figure 5.8).
5.5.4 Periodi Fun tion Generation (PFG)
We now train LSTM to generate real-valued periodi fun tions, as opposed to the spike trains
of the GTS task. At ea h dis rete time step we provide a real-valued target, sampled with
frequen y F from a target fun tion f (t). No input is given to the network.
The task's degree of diÆ ulty is in uen ed by the shape of f and the sampling frequen y F .
The former an be partially hara terized by the absolute maximal values of its rst and se ond
derivatives, max jf 0j and max jf 00j. Sin e we work in dis rete time, and with non-dierentiable
5.5. EXPERIMENTS 45
1 1
tk tk
yk yk
0 0
0
2
-4 sc
sc yc
c
y
0
1 1
in
y
yout
yin yϕ
out
y
yϕ
0 0
0 10 20 30 0 50 100 150
Time Time
Figure 5.7: Test run of a trained peephole LSTM network for the GTS task with I (n) 2 f0g,
and a minimum spike interval of F = 10 (left) vs. F = 50 (right). Top: target values tk and
network output yk ; middle: ell state s and ell output y ; bottom: a tivation of the input gate
yin, forget gate y' , and output gate yout .
step fun tions, we dene:

f 0(t) := f (t +1) f (t) ; max jf 0j max
t
jf 0(t)j ; max jf 00j max
t
jf 0(t +1) f (t)0 j :
Generally speaking, the larger these values, the harder the task. F determines the number of
distinguishable internal states required to represent the periodi fun tion in internal state spa e.
The larger F , the harder the task. We generate sine waves f os, triangular fun tions ftri, and
re tangular fun tions fre t , all ranging between 0:0 and 1:0, ea h sampled with two frequen ies,
1
in
k
2
y and t
k
Input: in
t
k
k
y 1
0
0
sc
2
0 yc
-2
in
1 y
yout
ϕ
y
0
50 60 70 80 90 100
Time
Figure 5.8: Test run of a trained peephole LSTM network for the GTS task with F = 10 and
I (n) 2 f0; 1; 2g. Top: target values tk and network output yk ; middle: ell state s and ell
output y ; bottom: a tivation of the input gate yin, forget gate y', and output gate yout.
F =10 and F =25:

1
f os (t) 1 os
2t
) max jf os
0 j = max jf 00 j = =F ;
2 F os
(
ftri(t)
t F
if (t mod F ) > F
2 ( mod )
) max jftri
0 j = 2=F ; max jf 00 j = 4=F ;
F
2 t F F otherwise
2 ( mod )
2
tri

fre t (t) 01 otherwise
if (t mod T ) > F 2
) max jfre t
0 j = max jf 00 j = 1 :
re t
5.5. EXPERIMENTS 47
tgt. F % Training LSTM Peephole LSTM

fn.
p
MSE
% Training p
MSE
Sol. Str. [10 ℄ 3
Sol. Str. [10 ℄ 3
10 90 2477 341 0:13 0:033 100 145 32 0:18 0:016

f os 25 0 > 10000 | 60 149 7 0:17 0:019
10 0 > 10000
ftri 25
| 100 869 204 0:13 0:014
0 > 10000 | 50 4063 303 0:13 0:024
10 0 > 10000
fre t 25
| 80 1107 97 0:12 0:014
0 > 10000 | 20 748 278 0:12 0:012
Table 5.3: Results for the PFG task, showing target fun tion f , sampling frequen y F , the
per entage of perfe t solutions found, and the mean and standard derivation
p of the number of
training streams required, as well as of the root mean squared error MSE for the nal test
run.
PFG Results. Our experimental results for the PFG task are summarized in Table 5.3.
Peephole LSTM found perfe t, stable solutions for all target fun tions (Figure 5.9). LSTM
without peephole onne tions ould solve only f os with F =10, requiring many more training
streams. Without forget gates, LSTM never learned to predi t the waveform for more than
two su essive periods.
The duration of training roughly re e ted our riteria for task diÆ ulty. We did not try
to a hieve maximal a ura y for ea h task: training was stopped on e the \perfe t solution"
riteria were fullled. A ura y an be improved by de reasing the tolerated maximum output
error emax during training, albeit at a signi ant in rease in trainingp duration. De reasing emax
by one half (to 0:15) for f os with F =25 also redu es the average MSE of solutions by about
k k
one half, from 0:17 0:019 down to 0:086 0:002. Perfe t solutions were learned in all ases,
but only after (2704 49) 10 training streams, as opposed to (149 7) 10 training streams
3 3
(yielding 60% solutions) before.

PFG Analysis. For the PFG task, the networks do not have any external input, so updates
depend on the internal ell states only. Hen e, in a stable solution for a periodi target fun tion
tk (t) the ell states s also have to follow some periodi traje tory s(t) phase-lo ked to tk (t). Sin e
the ell output is the only time-varying input to gates and output units, it must simultaneously
minimize the error at the output units and provide adequate input to the gates. An example
of how these two requirement an be ombined in one solution is shown in Figure 5.10 for f os
with F =10. This task an be solved with or without peephole onne tions be ause the output
gate never needs to be losed ompletely, so that all gates an base their ontrol on the ell
output.
Why did LSTM networks without peephole onne tions never learn the target fun tion f os
for F = 25, although they did learn it for F = 10? The output gate is part of an un ontrolled
feedba k loop: its a tivation dire tly determines its own input (here: its only input, ex ept
for the bias) via the onne tion to the ell output | but no errors are propagated ba k on
this onne tion. The same is true for the other gates, ex ept that output gating an blo k
their (thus in omplete) feedba k loop. This makes an adaptive LSTM memory blo k without
peephole onne tions more diÆ ult to tune. Additional support for this reasoning stems from
the fa t that networks with peephole onne tions learn f os with F =10 mu h faster (see Table
5.3). The peephole weights of solutions are typi ally of the same magnitude as the weights of
1 tk tk
1
yk yk
0 0
1
1 tk
k
y
tk
0 0 yk
k k
1
t 1 tk
yk y
0
0
0 10 20 30 40 0 25 50 75 100
Time Time
Figure 5.9: Target values tk and network output yk during test runs of trained peephole LSTM
networks on the PFG task for the periodi fun tions f os (top), ftri (middle), and fre t (bot-
tom), with periods F =10 (left) and F =25 (right).
onne tions from ell output to gates, whi h shows that they are indeed used even though they
are not mandatory for this task.
The target fun tions ftri and fre t required peephole onne tions for both values of F .
Figure 5.11 shows typi al network solutions for the fre t target fun tion. The ell output y
equals the ell state s in the se ond half of ea h period (when fre t = 1) and is zero in the rst
half, be ause the output gate loses the ell (triggered by s , whi h is a essed via the peephole
onne tions). The timing information is read o s , as explained in Se tion 5.5.2. Furthermore,
the two states of the fre t fun tion are distinguished: s is ounted up when fre t = 0 and
ounted down again when fre t = 1. This is a hieved through a negative onne tion from the
ell output to the the ell input, feeding negative input into the ell only when the output gate
5.5. EXPERIMENTS 49
1 1
tk tk
0 0
yk yk
0 0.0
-2 -0.2
-4 -0.4
sc sc
-6 c -0.6
y yc
1 1
0 0
in
0 y 10 20 30 0 yin 10 20 30
yout Time yout Time
yϕ yϕ
Figure 5.10: Test runs of a trained LSTM network with (right) vs. without (left) peephole
onne tions on the f os PFG task with F =10. Top: target values tk and network output yk ;
middle: ell state s and ell output y ; bottom: a tivation of the input gate yin, forget gate y',
and output gate yout.
is open; otherwise the input is dominated by the positive bias onne tion. Networks without
peephole onne tions annot use this me hanism, and did not nd any alternative solution.
Throughout all experiments peephole onne tions were ne essary to trigger the opening of gates
while the output gate was losed, by granting unrestri ted a ess to the timer implemented by
the CEC. The gates learned to ombine this information with their bias so as to open on rea hing
a ertain trigger threshold.
1 tk 1 tk
yk yk
0
0
4 sc
sc
1
y c 3 yc
0
0
1 1
yin
yin yout
yout yϕ
yϕ
0 0
0 10 20 0 25 50
Time Time
Figure 5.11: Test runs of trained peephole LSTM networks on the fre t PFG task with F =10
(left) and F =25 (right). Top: target values tk and network output yk ; middle: ell state s and
ell output y ; bottom: a tivation of the input gate yin, forget gate y', and output gate yout.
5.6. CONCLUSION 51
0.2
sc 1
-3
yin
yin, yϕ
out
out 0.1
y
y
0.9 ϕ
y
-4
0
0 5 10 15 20 0 5 10 15 20
Cycle Cycle
Figure 5.12: Cell states and gate a tivations at the onset (zero phase) of the rst 20 y les
during a test run with a trained LSTM network on the f os PFG task with F =10. Note that
the initial state (at y le 0) is quite far from the equilibrium state.
5.5.5 General Observation: Network initialization
At the beginning of ea h stream ell states and gate a tivations are initialized to zero. This initial
state is almost always quite far from the orresponding state in the same phase of later periods
in the stream. Figure 5.12 illustrates this for the f os task. After few onse utive periods, ell
states and gate a tivations of su essful networks tend to settle to very stable, phase-spe i
values, whi h are typi ally quite dierent from the orresponding values in the rst period. This
suggests that the initial state of the network should be learned as well, as proposed by For ada
and Carras o (1995), instead of arbitrarily initializing it to zero.
5.6 Con lusion
Previous work on LSTM did not require the network to extra t relevant information onveyed
by the duration of intervals between events. Here we show that LSTM an solve su h highly
nonlinear tasks as well, by learning to pre isely measure time intervals, provided we furnish
LSTM ells with peephole onne tions that allow them to inspe t their urrent internal states.
It is remarkable that peephole LSTM an learn exa t and extremely robust timing algorithms
without tea her for ing, even in ase of very uninformative, rarely hanging target signals. This
makes it a promising approa h for numerous real-world tasks whose solution partly depend on
the pre ise duration of intervals between relevant events.
Chapter 6
Simple Context Free and Context

Sensitive Languages
6.1 Introdu tion

Previous work showed that LSTM outperforms traditional RNN algorithms on tasks that require
to learn the rules of regular languages (RLs), see Chapter 3 and Ho hreiter and S hmidhuber
(1997). RLs are des ribable by deterministi nite state automata (DFA) (Casey, 1996; Siegel-
mann, 1992; Blair & Polla k, 1997; Kalinke & Lehmann, 1998; Zeng, Goodman, & Smyth,
1994"). Until now, however, it has remained un lear whether LSTM's superiority arries over
to tasks involving ontext free languages (CFLs), su h as those dis ussed in the RNN literature
(Sun, Giles, Chen, & Lee, 1993; Wiles & Elman, 1995; Steijvers & Grunwald, 1996; Tonkes &
Wiles, 1997; Rodriguez, Wiles, & Elman, 1999; Rodriguez & Wiles, 1998). Their re ognition
requires the fun tional equivalent of a sta k. It is on eivable that LSTM has just the right bias
for RLs but might fail on CFLs.
Here we will fo us on the most ommon CFLs ben hmarks found in the RNN literature:
an bn and an bm B m An . We study questions su h as:
Can LSTM learn the fun tional equivalent of a pushdown automaton?
Given training sequen es up to size n, an it generalize to n + 1; n + 2; : : : ?
How stable are the solutions?
Does LSTM outperform previous approa hes?
Finally we will apply LSTM to a ontext sensitive language (CSL). The CSLs in lude the
CFLs, whi h in lude the RLs. We will fo us on the lassi example anbn n, whi h is a CSL
but not a CFL (Se tion 6.2). In general, CSL re ognition requires a linear-bounded automaton,
a spe ial Turing ma hine whose tape length is at most linear in the input size. The fanbn n g
language is one of the simplest CSLs; it an be generated by a tree-adjoined grammar and
re ognized using a so- alled embedded push-down automaton (Vijay-Shanker, 1992) or a nite
53
54 CHAPTER 6. SIMPLE CONTEXT FREE AND CONTEXT SENSITIVE LANGUAGES
state automaton with a ess to two ounters that an be in remented or de remented. To our
knowledge no RNN has been able to learn a CSL.
We are using LSTM with forget gates and peephole onne tions introdu ed in the previous
hapters.
6.2 Experiments
The network sequentially observes exemplary symbol strings of a given language, presented one
input symbol at a time. Following the traditional approa h in the RNN literature we formulate
the task as a predi tion task. At any given time step the target is to predi t the possible next
symbols, in luding the "end of string" symbol T . When more than one symbol an o ur in the
next step all possible symbols have to be predi ted, and none of the others.
The network sequentially observes exemplary symbol strings of a given language, presented one
input symbol at a time, also referred to as input sequen es. Every input sequen e begins with
the start symbol S . The empty string, onsisting of ST only, is onsidered part of ea h language.
A string is a epted when all predi tions have been orre t. Otherwise it is reje ted.
This predi tion task is equivalent to a lassi ation task with two lasses \a ept" and \reje t,"
be ause the system will make predi tion errors for all strings outside the language. A system
has learned a given language up to string size n on e it is able to orre tly predi t all strings
with size n.
Symbols are en oded lo ally by d-dimensional binary ve tors with only one non-zero ompo-
nent, where d equals the number of language symbols plus one for either the start symbol in the
input or the "end of string" symbol in the output (d input units, d output units). +1 signies
that a symbol is set and 1 that it is not; the de ision boundary for the network output is 0:0.
CFL an bn (Sun et al., 1993; Wiles & Elman, 1995; Tonkes & Wiles, 1997; Rodriguez et al.,
1999). Here the strings in the input sequen es are of the form anbn; input and output ve tors are
3-dimensional. Prior to the rst o urren e of b either a or b, or a or T at sequen e beginnings,
are possible in the next step. Thus, e.g., for n =5:
Input: S a a a a a b b b b b
Target: a/T a/b a/b a/b a/b a/b b b b b T
An example for a set of ontext-free produ tion rules for the anbn grammar is: S ! j ; !
ab j , where S is the starting symbol, is a non-terminal symbol and is the empty string.
CFL an bm B m An (Rodriguez & Wiles, 1998). The se ond half of a string from this palin-
drome or mirror language is ompletely predi table from the rst half. The task involves an
intermediate time lag of length 2m. Input and output ve tors are 5-dimensional. Prior to the
rst o urren e of B two symbols are possible in the next step. Thus, e.g., for n =4; m =3:
Input: S a a a a b b b B B B A A A A
Target: a/T a/b a/b a/b a/b b/B b/B b/B B B A A A A T
The anbm B mAn grammar an be produ ed by similar ontext-free rules as the anbn grammar
using two non-terminal symbols ( and ): S ! j ; ! aA j j ; ! b B j .
CSL an bn n . Input and output ve tors are 4-dimensional. Prior to the rst o urren e of b
two symbols are possible in the next step. Thus, e.g., for n =5:
Input: S a a a a a b b b b b
Target: a/T a/b a/b a/b a/b a/b b b b b T
6.2. EXPERIMENTS 55
The pumping Lemma for ontext-free languages an be applied to show that anbn n is not
ontext-free. An intuitive explanation is that it is ne essary to onsider the number of a symbols
then produ ing b and symbols, this requires ontext information.
6.2.1 Training and Testing
Learning and testing alternate: after ea h epo h (= 1000 training sequen es) we freeze the
weights and run a test. Even when all strings are pro essed orre tly during training, it is
ne essary to test again with frozen weights on e all weight hanges have been exe uted. Apart
from ensuring the learning of the training set the test also determines generalization performan e,
whi h we did not optimize by using, say, a validation set.
Training and test sets in orporate all legal strings up to a given length: 2n for anbn, 3n for
a b and 2(n + m) for an bm B m An . Training strings are presented in random order. Only
n n n
exemplars from the lass \a ept" are presented. Training is stopped on e all training sequen es
have been a epted, or after at most 10 training sequen es. The generalization set is the largest
7
a epted test set (assuming that the network generalizes at all).

Weight hanges are made after ea h sequen e. We apply the momentum algorithm (Plaut et al.,
1986) with learning rate is 10 and momentum parameter 0.99. All results are averages over
5
10 independently trained networks with dierent weight initializations (these 10 initializations

are identi al for ea h experiment).
CFL an bn . We study training sets with n 2f1;::; N g. We test all sets with n 2f1;::; M g and
M 2fN;::; 1000g (sequen es of length 2000).
CFL an bm B m An . We use two training sets: a) The same set as used by Rodriguez and
Wiles (1999) : n 2 f1;::; 11g, m 2 f1;::; 11g with n + m 12 (sequen es of length 24). b)
The set given by n 2 f1;::; 11g, m 2 f1;::; 11g (sequen es of length 44). We test all sets with
n 2f1;::; M g, m 2f1;::; M g and M 2f11;::; 50g (sequen es of length 200).
CSL an bn n . We study two kinds of training sets: a) with n 2 f1;::; N g and b) with
n 2 fN 1; N g. Case b) asks for a major generalization step that seems almost impossible
at rst glan e: Given very similar training sequen es whose sizes dier by at most 2, learn to
pro ess sequen es of arbitrary size! We test all sets with n 2 fL;::; M g, L 2 f1;::; N 1g and
M 2fN;::; 500g (sequen es of length 1500).
6.2.2 Network Topology and Experimental Parameters

The input units are fully onne ted to a hidden layer onsisting of memory blo ks with 1 ell
ea h. The ell outputs are fully onne ted to the ell inputs, to all gates, and to the output units,
whi h also have dire t \short ut" onne tions from the input units (Figure 6.1). For ea h task
we sele ted the topology with minimal number of memory blo ks that solved the task without
extensive parameter optimization. Larger topologies never led to disadvantages ex ept for an
in rease in omputational omplexity.
All gates, the ell itself and the output unit are biased. The bias weights to input gate,
forget gate and output gate are initialized with 1:0, +2:0 and 2:0, respe tively. Although
not riti al, these values have been found empiri ally to work well; we use them for all our
experiments. The forget gates start o losed, so that the ells initially remember everything. We
also tried dierent bias ongurations; the results were qualitatively the same, whi h supports
our laim that pre ise initialization is not riti al. All other weights are initialized randomly
in the range [ 0:1; 0:1℄. The ell's input squashing fun tion g is the identity fun tion. The
squashing fun tion of the output units is a sigmoid fun tion with the range [ 2; 2℄.
Out
Memory Output Gate

Block
with Forget Gate
one
Cell Input Gate
In
Figure 6.1: Three-layer LSTM topology with a single input and output. Re urren e is limited
to the hidden layer, onsisting here of a single LSTM memory blo k with a single ell. All 10
\unit-to-unit" onne tions are shown (but bias and peephole onne tions are not).
Referen e Hidden Train. Train. Sol./ Best Test
Units Set [n℄ Str. [10 ℄ Tri. 3
[n℄
(Sun et al., 1993) 1
5 1;::; 160 13.5 1/1 1;::; 160
(Wiles & Elman, 1995) 2 1;::; 11 2000 4/20 1;::; 18
(Tonkes & Wiles, 1997) 2 1;::; 10 10 13/100 1;::; 12
(Rodriguez et al., 1999) 2
2 1;::; 11 267 8/50 1;::; 16
Table 6.1: Previous results for the CFL anbn, showing (from left to right) the number of hidden
units or state units, the values of n used during training, the number of training sequen es, the
number of found solutions/trials and the largest a epted test set.
We use one memory blo k (with one ell). With peephole onne tions there are
CFL an bn .
38 adjustable weights (3 peephole, 28 unit-to-unit and 7 bias onne tions).
CFL an bm B m An . We use two blo ks with one ell ea h, resulting in 110 adjustable weights
(6 peephole, 91 unit-to-unit and 13 bias onne tions).
CSL an bn n . We use the same topology as for the an bm B m An language, but with 4 input
and output units instead of 5, resulting in 90 adjustable weights (6 peephole, 72 unit-to-unit
and 12 bias onne tions).
6.2.3 Previous results
CFL an bn . Published results on the anbn language are summarized in Table 6.1. RNNs trained
1
Sun's training set was augmented stepwise by sequen es mis lassied during testing, and in the nal a epted
set n was in f1;::; 20g ex ept for 20 random sequen es up to length n = 160 (the exa t generalization performan e
was un lear).
2
Applying brute for e sear h to the weights of the best network of Rodriguez et al. (1999) further improves
performan e to a eptan e up to n = 28.
6.2. EXPERIMENTS 57
Train. Train. % Generalization
Set [n℄ Str. [10 ℄
3
Sol. Set [n℄
1;::; 10 22 (19) 100 1;::; 1000 (1;::; 118)
1;::; 20 18 (19) 100 1;::; 587 (1;::; 148)
1;::; 30 16 (19) 100 1;::; 1000 (1;::; 408)
1;::; 40 25 (28) 100 1;::; 1000 (1;::; 628)
1;::; 50 42 (40) 100 1;::; 767 (1;::; 430)
Table 6.2: Results for the anbn language, showing (from left to right) the values for n used during
training, the average number of training sequen es until best generalization was a hieved, the
per entage of orre t solutions and the best generalization (average over all networks given in
parenthesis).
with plain BPTT tend to learn to just reprodu e the input (Wiles & Elman, 1995; Tonkes &
Wiles, 1997; Rodriguez et al., 1999). Sun et al. (1993) used a highly spe ialized ar hite ture,
the \neural pushdown automaton", whi h also did not generalize well (Sun et al., 1993; Das,
Giles, & Sun, 1992).
CFL an bm B m An . Rodriguez and Wiles (1998) used BPTT-RNNs with 5 hidden nodes.
After training with 51 10 strings with n + m 12 (sequen es of length 24), most networks
3
generalized on longer o-training set strings. The best network generalized to sequen es up to
length 36 (n = 9,m = 9). But none of them learned the omplete training set.
CSL an bn n . To our knowledge no previous RNN ever learned a CSL.
6.2.4 LSTM Results
CFL an bn . 100% solved for all training sets (Table 6.2). Small training sets (n 2 f1;::; 10g)
were already suÆ ient for perfe t generalization up to the tested maximum: n 2 f1;::; 1000g.
Note that long sequen es of this kind require very stable, nely tuned ontrol of the network's
internal ounters (Casey, 1996).
This performan e is mu h better than that of previous approa hes, where the largest set was
learned by the spe ially designed neural push-down automaton (Sun et al., 1993; Das et al.,
1992): n 2f1;::; 160g. The latter, however, required training sequen es of the same length as the
test sequen es. From the training set with n 2 f1;::; 10g LSTM generalized to n 2 f1;::; 1000g,
whereas the best previous result (see Table 6.1) generalized only to n 2 f1;::; 18g (even with
a slightly larger training set: n 2 f1;::; 11g). In ontrast to Tonkes and Wiles (1997), we did
not observe our networks forgetting solutions as training progresses. So unlike all previous
approa hes, LSTM reliably nds solutions that generalize well.
The u tuations in generalization performan e for dierent training sets in Table 6.2 may be
due to the fa t that we did not optimize generalization performan e by using a validation set.
Instead we simply stopped ea h epo h (= 1000 sequen es) on e the training set was learned.
CFL an bm B m An . Training set a): 100% solved; after 29 10 training sequen es the best
3
network of 10 generalized to at least n; m 2 f1;::; 22g (all strings up to a length of 88 symbols

pro essed orre tly); the average generalization set was the one with n; m 2f1;::; 16g (all strings
up to a length of 64 symbols pro essed orre tly), learned after 25 10 training sequen es on
3
average.
Training set b): 100% solved; after 26 10 training sequen es the best network generalized
3
Train. Train. % Generalization

Set [n℄ Str. [10 ℄
3
Sol. Set [n℄
1;::; 10 54 (62) 100 1;::; 52 (1;::; 28)
1;::; 20 28 (43) 100 1;::; 160 (1;::; 66)
1;::; 30 37 (43) 100 1;::; 228 (1;::; 91)
1;::; 40 51 (48) 90 1;::; 500 (1;::; 120)
1;::; 50 60 (94) 100 1;::; 500 (1;::; 409)
10; 11 24 (78) 100 9;::; 12 (10;::; 11)
20; 21 829 (626) 40 10;::; 27 (17;::; 23)
30; 31 42 (855) 30 29;::; 34 (29;::; 32)
40; 41 854 (1597) 40 20;::; 57 (35;::; 45)
50; 51 32 (621) 60 43;::; 57 (47;::; 55)
Table 6.3: Results for the anbn n language, showing (from left to right) the values for n used
during training, the average number of training sequen es until best generalization was a hieved,
the per entage of orre t solutions and the best generalization (average over all networks in
parenthesis).
to at least n; m 2 f1;::; 23g (all strings until a length of 92 symbols pro essed orre tly). The
average generalization set was the one with n; m 2 f1;::; 17g (all strings until a length of 68
symbols pro essed orre tly), learned after 82 10 training sequen es on average. Unlike the
3
previous approa h of Rodriguez and Wiles (1998), LSTM easily learns the omplete training set
and reliably nds solutions that generalize well.
CSL an bn n . LSTM learns 4 of the 5 training sets in 10 out of 10 trials (only 9 out of
10 for the training set with n 2 f1;::; 40g) and generalizes well (Table 6.3). Small training sets
(n 2 f1;::; 40g) were already suÆ ient for perfe t generalization up to the tested maximum:
n 2 f1;::; 500g, that is, sequen es of length up to 1500. Even in absen e of any short training
sequen es (n 2fN 1; N g) LSTM learned well (see bottom half of Table 6.3).
We also modied the training pro edure, by presenting ea h exemplary string without pro-
viding all possible next symbols as targets, but only the symbol that a tually o urs in the
urrent exemplar. This led to slightly longer training durations, but did not signi antly hange
the results.
6.2.5 Analysis
How do the solutions dis overed by LSTM work?
CFL an bn . Figure 6.2 shows a test run with a network solution for n = 5. The ell state
s in reases while a symbols are fed into the network, then de reases (with the same step size)
while b symbols are fed in. At sequen e beginnings (when the rst a symbols are observed),
however, the step size is smaller due to the losed input gate, whi h is triggered by s itself.
This results in \overshooting" the initial value of s at the end of a sequen e, whi h in turn
triggers the opening of the output gate, whi h in turn leads to the predi tion of the sequen e
termination.
CFL an bm B m An . The behavior of a typi al network solution is shown in Figure 6.3. The
network learned to establish and ontrol two ounters. The two symbol pairs (a, A) and (b, B )
are treated separately by two dierent ells, and , respe tively. Cell tra ks the dieren e
2 1 2
6.2. EXPERIMENTS 59
S a a a a a b b b b b : input
a a a a a a
T b b b b b b b b b T : target
1 a
b
0
T
-1
10 sc1
yc
1
1.0
y in 1
y out 1
y ϕ1
0.0
0 1 2 3 4 5 6 7 8 9 10 Time
Figure 6.2: CFL anbn (n = 5): Test run with network solutions. Top: Network output yk .
Middle: Cell state s and ell output y . Bottom: A tivations of the gates (input gate yin,
forget gate y' and output gate yout).
between the number of observed a and A symbols. It opens only at the end of a string, where it
predi ts the nal T . Cell treats the embedded bm B m substring in a similar way. While values
are stored and manipulated within a ell, the output gate remains losed. This prevents the ell
1
from disturbing the rest of the network and also prote ts its CEC against in oming errors.
CSL an bn n . The network solutions use a ombination of two ounters, instantiated sepa-
rately in the two memory blo ks (Figure 6.4). Here the se ond ell ounts up, given an a input
symbol. It ounts down, given a b. A in the input auses the input gate to lose and the
forget gate to reset the ell state s . The se ond memory blo k does the same for b, , and a,
S a a a a a b b b b B B B B A A A A A : input
a aa a aa b b b b
T b b b b b B B B B B B B A A A A A T : target
1 a
b
0 A
B
-1 T
10 sc1
yc
5 1
sc
2
0 yc
2
in 1
1.0 y
y out 1
y ϕ1
y in 2
y out 2
0.0 y ϕ2
0 5 10 15 Time
Figure 6.3: CFL a b B A (n = 5; m = 4): Test run with network solution. Top: Network
n m m n
output y . Middle: Cell state s and ell output y . Bottom: A tivations of the gates (input
k
gate y , forget gate y and output gate y ).
in ' out
respe tively. The opening of output gate of the rst blo k indi ates the end of a string (and the
predi tion of the last T ), triggered via its peephole onne tion.
Why does the network not generalize for short strings when using only two training strings as
for the anbn n language (see Table 6.3)? The gate a tivations in Figure 6.4 show that a tivations
6.2. EXPERIMENTS 61
S a a a a a a b b bb b b c c c c c c : input
a a a a a a a : target
T bb b bb bb b bb b c c c c c c T
1 a
b
0 c
-1 T
20 sc1
yc
1
10 sc
2
0 yc
2
1.0 y in 1
y out 1
y ϕ1
y in 2
y out 2
0.0 y ϕ2
0 5 10 15 Time
Figure 6.4: CSL a b (n = 5): Test run with network solution (the system s ales up to
n n n
sequen es of length 1000 and more). Top: Network output y . Middle: Cell state s and ell
k
output y . Bottom: A tivations of the gates (input gate y , forget gate y and output gate
in '
y ).
out
slightly drift even when the input stays onstant. Solutions take this state drift into a ount,
and will not work without it or with too mu h of it, as in the ase when the sequen es are mu h
shorter or longer than the few observed training examples. This imposes a limit on generalization
in both dire tions (towards longer and shorter strings). We found solutions with less drift to
generalize better.
Further improvements. Even better results an be obtained through in reased training
time and stepwise redu tion of the learning rate, as done in (Rodriguez et al., 1999). The distri-
bution of lengths of sequen es in the training set also ae ts learning speed and generalization.
A set ontaining more long sequen es improves generalization for longer sequen es. Omitting
the sequen e with n =1 (and m =1), typi ally the last one to be learned, has the same ee t.
Training sets with many short and many long sequen es are learned more qui kly than uniformly
distributed ones.
Related tasks. The (bak )n regular language is related to an bn in the sense that it requires
to learn a ounter, but the ounter never needs ounting down. This task is equivalent to the
\Generating timed spikes" task (Se tion 5.5.3) learned by LSTM for k = 50 with n 1000. A
hand-made, hardwired solution (no learning) of a se ond order RNN worked for values of k until
120 (Steijvers & Grunwald, 1996).
For all three tasks peephole onne tions are mandatory. The output gates remain losed for
substantial time periods during ea h input sequen e presentation ( ompare Figures 6.2, 6.3 and
6.4); the end of su h a period is always triggered via peephole onne tions.
6.3 Con lusion
We found that LSTM learly outperforms previous RNNs not only on regular language ben h-
marks (a ording to previous resear h) but also on ontext free language (CFL) ben hmarks;
it learns faster and generalizes better. LSTM also is the rst RNN to learn a ontext sensitive
language.
Although CFLs like those studied here may also be learnable by ertain dis rete symboli
grammar learning algorithms (SGLAs) (Sakakibara, 1997; Lee, 1996; Osborne & Bris oe, 1997),
the latter exhibit more task-spe i bias, and are not designed to solve numerous other sequen e
pro essing tasks involving noise, real-valued inputs / internal states, and ontinuous output tra-
je tories, whi h LSTM solves easily (see previous hapters and Ho hreiter and S hmidhuber
(1997)). SGLAs in lude a large range of methods, su h as de ision-tree algorithms (see e.g.,
Quinlan (1992)), ase-based and explanation-bases reasoning (see e.g., Mit hell, Keller, and
Kedar-Cabelli (1986), Porter, Bru e, Bareiss, and Holte (1990)), and indu tive logi program-
ming (see e.g., Zelle and Mooney (1993)).
Our ndings reinfor e the per eption that LSTM is a very general and promising adaptive
sequen e pro essing devi e, with a wider eld of potential appli ations than alternative RNNs.
Chapter 7
Time Series Predi table Through

Time-Window Approa hes
7.1 Introdu tion

In the previous hapters we have applied LSTM to numerous temporal pro essing tasks, su h
as: ontinual grammar problems, re ognition of temporally extended, noisy patterns (Chapter
3); arithmeti operations on ontinual input streams and robust storage of real numbers a ross
extended time intervals (Chapter 4); extra tion of information onveyed by the temporal distan e
between events and generation of pre isely timed events (Chapter 5); stable generation of smooth
and highly nonlinear periodi traje tories (Chapter 5); Re ognition of regular and ontext free
and ontext sensitive languages (Chapter 6).
Time series ben hmark problems found in the literature, however, often are on eptually
simpler than the above. They often do not require RNNs at all, be ause all relevant information
about the next event is onveyed by a few re ent events ontained within a small time window.
Here we apply LSTM to su h relatively simple tasks, to establish a limit to the apabilities of
the LSTM-algorithm in its urrent form. We fo us on two intensively studied tasks, namely,
predi tion of the Ma key-Glass series (Ma key & Glass, 1977) and haoti laser data (Set A)
from a ontest at the Santa Fe Institute (1992).
LSTM is run as a \pure" autoregressive (AR) model that an only a ess input from the
urrent time-step, reading one input at a time, while its ompetitors | e.g., multi-layer per ep-
trons (MLPs) trained by ba k-propagation (BP) | simultaneously see several su essive inputs
in a suitably hosen time window. Note that Time-Delay Neural Networks (TDNNs) (Haner &
Waibel, 1992) are not purely AR, be ause they allow for dire t a ess to past events. Neither are
NARX networks (Lin et al., 1996) whi h allow for several distin t input time windows (possibly
of size one) with dierent temporal osets.
We also evaluate stepwise versus iterated training as proposed by Prin ipe and Kuo (1995) to
make RNNs learn a dynami attra tor rather than simply approximate output. It was found by
Prin ipe, Rathie, and Kuo (1992) that neural networks trained with iterative training outperform
traditional predi tion algorithms in approximating \real" haoti attra tors. Bakker, S houten,
63
64CHAPTER 7. TIME SERIES PREDICTABLE THROUGH TIME-WINDOW APPROACHES
∆t
LSTM
x(t) ∆x(t) *fs 1/fs + x(t+p)
RNN
Figure 7.1: AR-RNN setup for time series predi tion.

Giles, Takens, and Bleek (2000) rened the iterated training s heme and found it superior to
stepwise training. Here we annot generally onrm this result.
7.2 Experimental Setup
The task is to use urrently available points in the time series to predi t the future point, t + T .
The target for the network tk is the dieren e between the values x(t+p) of the time series p steps
ahead and the urrent value x(t) multiplied by a s aling fa tor fs: tk (t)= fs (x(t + p) x(t))=
fs x(t). fs s ales x(t) between 1 and 1 for the training set, the same value for fs is used
during testing. The predi ted value is the network output divided by fs plus x(t) (Figure 7.1).
During iterated predi tion with T = n p the output is lamped to the input (self-iteration) and
the predi ted values are fed ba k n times. For dire t predi tion p = T and n =1; for single-step
predi tion p =1 and n = T .
Note that during iterated predi tion, the network state after the rst predi tion has to be
stored and re-established after the last self-iterations. For the iterated predi tion with p > 1
and n> 1 the setup be omes more omplex: p opies of the network have to predi t in parallel.
The network predi ting x(t + n p) starts with x(t), and feeds ba k the predi ted values n 1
times to the input before the same pro edure is exe uted with a se ond network starting with
x(t + 1) at the task of predi ting x(t + 1 + n p). The internal network state is indire tly also
trained to move from s(t) to s(t + p) in one iteration. Hen e, the iterated predi tion of one series
with step-size p > 1 results in the parallel predi tion of p series with p dierent starting points:
tstart = 0; 1; 2; : : : ; p 1. For example, given T = 84 and p = 6 () n =14), we start at tstart = 0
and iterate 14 times to predi t the value at t =84. We then use another opy of the network to
predi t t =84 + 1 starting at tstart =1 and so forth.
Bakker, S houten, Giles, Takens, and Bleek (2000) proposed to mix network predi tions with
the target values during iterated training. One hallenge with this pro edure lies in nding the
right mixing oeÆ ient. Bakker et al. used the same onstant value throughout training. This
pro edure has the disadvantage that bad predi tions at the beginning of the training indu e a
lot of \input noise". We modied Bakker's idea by introdu ing a maximum output error emax
for iterated training in pla e of a mixture. When the error at the output ek was larger than
emax =0:5, the output was un lamped and training ontinued with the next true input value at
t + 1 + p.
This s heme has the advantage that the number of iterated steps is oupled to training
performan e and is in this way self-regularizing. In preliminary experiments we tested our
method using onstant predi tion-target mixtures having dierent oeÆ ients. Our method
always learned faster and with fewer network divergen es.
The error measure is the normalized root mean squared error: NRMSE= h(yk tk ) i =h(tk2
1
2
7.3. MACKEY-GLASS CHAOTIC TIME SERIES 65
htk i) i 21 , where yk is the network output and tk the target. The reported performan e is the
2
best result of 10 independent trials.

7.2.1 Network Topology
LSTM. The input units are fully onne ted to a hidden layer onsisting of memory blo ks with
1 ell ea h. The ell outputs are fully onne ted to the ell inputs, to all gates, and to the
output units. All gates, the ell itself and the output unit are biased. Bias weights to input and
output gates are initialized blo k-wise: 0:5 for the rst blo k, 1:0 for the se ond, 1:5 for
the third, and so forth. Forget gates are initialized with symmetri positive values: +0.5 for the
rst blo k, +1 for the se ond blo k, et . We also tried a bias onguration with reversed signs
for the initial values. In this ase the gates are open (so no gating) and the ells forget almost
immediately. This onguration is similar to a RNN with one fully re urrent hidden layer. The
results were qualitatively the same, whi h supports our laim that pre ise initialization is not
riti al. All other weights are initialized randomly in the range [ 0:1; 0:1℄. The ell's input
squashing fun tion g is a sigmoid fun tion with the range [ 1; 1℄. The squashing fun tion of the
output units is the identity fun tion.
To have statisti ally independent weight updates, we exe ute weight hanges every 50 +
rand(50) steps (where rand(max) stands for a random positive integer smaller than max whi h
hanges after every update). We use a onstant learning rate =10 . 4
MLP. The MLPs we use for omparison have one hidden layer and are trained with BP. As
with LSTM, the one output unit is linear and x is the target. The input diers for ea h task
but in general uses a time window with a time-spa e embedding. All units are biased and the
learning rate is =10 .3
Note that we do not use IO short uts, be ause they be ome short ir uits during self iteration,
ausing exponential growth of the output unit's a tivity.
7.3 Ma key-Glass Chaoti Time Series
The Ma key-Glass haoti time series (Ma key & Glass, 1977) an be generated from the Ma key-
Glass delay-dierential equation:
x(t )
x_ (t) =
1 + x (t ) x(t) (7.1)
We generate ben hmark sets using the parameters a =0:2, b =0:1, =10 and =17. For > 16:8
the series be omes haoti . =17 results in a quasi-periodi series with a hara teristi period
T 50, lying on an attra tor with fra tal dimension D = 2:1. To generate these ben hmark
sets, 7.1 is integrated using a four-point Runge-Kutta method with step size 0:1 and the initial
ondition x(t) = 0:8 for t < 0. The equation is integrated up to t = 5500, with the points from
t =200 to t =3200 used for training and the points from t =5000 to t =5500 used for testing.
Figure 7.2 shows the rst 100 points from the test set. Sin e the Ma key-Glass time series
is haoti , it is diÆ ult to predi t for values of T greater than its hara teristi period T of
approximately 50. In the literature a number of dierent predi tion points have been tried:
T 2 f1; 6; 84; 85; 90g. For the omparison of results we onsider the predi tions with osets
T 2 f84; 85; 90g as equal tasks. For approa hes that use as input a time window of past values
it is ommon to use the four delays t, t 6, t 12 and t 18. These points represent an adequate
delay-state embedding for the predi tion of Ma key-Glass series assuming T = 6. For further
x(t) ∆ x(t)
0.05
1
0
0.5 -0.05
0 100 200 Time 0 100 200 Time
x(t+1) ∆ x(t)
0.05
1
0
0.5 -0.05
0.5 1 x(t) 0.5 1 x(t)
Figure 7.2: Ma key-Glass time series (test set). Top-Left: Cut-out of the series. Top-Right:
The rst dieren e for p = 1: x(t) = (x(t + 1) x(t)). Bottom-Left: x(t + 1) against x(t).
Bottom-Right: x(t) against x(t).
explanation see, for example, Fal o, Iazzetta, Natale, and Tarantino (1998"). As explained
above, LSTM re eived only the value of x(t) as input.
7.3.1 Previous Work
In the following se tions we attempt to summarize existing attempts to predi t these time series.
To allow omparison among approa hes, we did not onsider works where noise was added to the
task or where training onditions were very dierent from ours. When not spe i ally mentioned,
an input time window with time delays t, t 6, t 12 and t 18 or larger was used. The dierent
approa hes are outlined in Table 7.1. Vesanto (1997) oers the best result to date, a ording
to our knowledge, with a Self-Organizing Map (SOM) approa h. The SOM parameters given
in Table 7.2 refers to the prototype ve tors of the map. The results from these approa hes are
found in Table 7.2. We real ulated the results for R. Bone et al., be ause only the NMSE was
given.
7.3.2 Results
The LSTM results are listed at the bottom of Table 7.2. After six single-steps of iterated training
(p =1, T =6, n =6) the LSTM NRMSE for single step predi tion (p = T =1, n =1) is: 0:0452.
After 84 single-steps of iterated training (p = 1, T = 84, n = 84) the LSTM NRMSE for single
step predi tion (p = T = 1, n = 1) is: 0:0809. Figure 7.3 shows iterated predi tion results for
LSTM. In reasing the number of memory blo ks did not signi antly improve the results.
Why did LSTM perform worse than the MLP? The AR-LSTM network does not have a ess
to the past as part of its input and therefore has to learn to extra t and represent a Markov
Model Author Des ription
BPNN Day and Davenport (1993) A BP ontinuous-time feed forward NNs
with two hidden layers and with xed time
delays.
ATNN Day and Davenport (1993) A BP ontinuous-time feed forward NNs
with two hidden layers and with adaptable
time delays.
DCS-LMM Chudy and Farkas (1998) Dynami Cell Stru tures ombined with
Lo al Linear Models.
EBPTTRNN R. Bone, Cru ianu, Ver- RNNs with 10 adaptive delayed onne -
ley, and Asselin de Beauville tions trained with BPTT ombined with
(2000) a onstru tive algorithm.
BGALR Fal o, Iazzetta, Natale, and A geneti algorithm with adaptable input
Tarantino (1998") time window size (Breeder Geneti Algo-
rithm with Line Re ombination).
EPNet Yao and Liu (1997) Evolved neural nets (Evolvable Program-
ming Net).
SOM Vesanto (1997) A Self-organizing map.
Neural Gas Martinez, Berkovi h, and The Neural Gas algorithm for a Ve tor
S hulten (1993) Quantization approa h.
AMB Bersini, Birattari, and Bon- An improved memory-based regression
tempi (1998) (MB) method (Platt, 1991) that uses an
adaptive approa h to automati ally sele t
the number of regressors (AMB).
Table 7.1: Summary of previous approa hes for the predi tion of the Ma key-Glass time series.
state (Bakker & Kleij, 2000). In tasks we onsidered so far this required remembering one or two
events from the past, then using this information before over-writing the same memory ells.
The Ma key-Glass equation, ontains the input from t 17, hen e its implementation requires
the storage of all inputs from t 17 to t (time window approa hes onsider sele ted inputs ba k
to at least t 18). Assuming that any dynami model needs the event from time t with
17, we note that the AR-RNN has to store all inputs from t to t and to overwrite them
at the adequate time. This requires the implementation of a ir ular buer, a stru ture quite
diÆ ult for an RNN to simulate. In a TDNN, on the other hand, a ir ular buer is inherent
to the network stru ture.
7.3.3 Analysis
It is interesting that for MLPs (T = 6) it was more ee tive to transform the task into a one-
step-ahead predi tion task and iterate than it was to predi t dire tly ( ompare the results for
p = 1 and p = T ). It is in general easier to predi t fewer steps ahead, the disadvantage being
that during iteration input values have to be repla ed by predi tions. For T =6 with p =1 this
ae ts only the latest value. This advantage is lost for T = 84 and the results with p = 1 are
worse than with p = 6, where fewer iterations are ne essary. For MLPs, iterated training did
not in general produ e better results: it improved performan e when the step-size p was 1, and
Referen e Units Para. Seq. NMSE

T =1 T =6 T =84
Predi t Input:
x(t + T )= x(t) - - 0.1466 0.8219 1.4485
Linear Predi tor - - - 0.0327 0.7173 1.5035
6th-order Polynom.
(Crowder, 1990) - - - - 0.04 0.85
BPNN
(Lapedes & Farber, 1987) - - - - 0.02 0.06
FTNN
(Day & Davenport, 1993) 20 120 7 10
7
- 0.012 -
ATNN
(Day & Davenport, 1993) 20 120 7 10
7
- 0.005 -
Cas ade-Correlation
(Crowder, 1990) 20 250 - 0.04 0.17
DCS-LLM
(Chudy & Farkas, 1998) 200 200 1 102 5
- 0.0055 0.03
EBPTTRNN
(R. Bone et al., 2000) 6 65 - - 0.0115 -
BGALR
(Fal o et al., 1998") 16 150 - - 0.2373 0.267
EPNet (Yao & Liu, 1997) 10 100 1 10 4
- 0.02 0.06
SOM - 10x10 1:5 10 4
- 0.013 0.06
(Vesanto, 1997) - 35x35 1:5 10 4
- 0.0048 0.022
Neural Gas
(Martinez et al., 1993) 400 3600 2 10 4
- - 0.05
AMB (Bersini et al., 1998) - - - - 0.054
MLP, p = T 4 25 1 10
4
0.0102 0.0511 0.4604
MLP, p = T 16 97 1 10
4
0.0113 0.0502 0.4612
MLP, p =1 4 25 1 10 p = T =1 0.0241 0.4208
4
MLP, p =1 16 97 1 10 p = T =1 0.0252 0.4734

4
MLP, p =1, IT 4 25 1 10
4
0.0089 0.0191 0.4143
MLP, p =1, IT 16 97 1 10
4
0.0094 0.0205 0.3929
MLP, p =6 4 25 1 10
4
- p = T =6 0.1659
MLP, p =6 16 97 1 10
4
- p = T =6 0.1466
MLP, p =6, IT 4 25 1 10
4
- 0.0946 0.3012
MLP, p =6, IT 16 97 1 10
4
- 0.0945 0.2820
LSTM, p = T 4 113 5 10
4
0.0214 0.1184 0.4700
LSTM, p =1 4 113 5 10 p = T =1 0.1981 0.5927
4
LSTM, p =1, IT 4 113 1 10

4
s. text 0.1970 0.8157
LSTM, p =6 4 113 5 10
4
- p = T =6 0.2910
LSTM, p =6, IT 4 113 1 10
4
- 0.1903 0.3595
Table 7.2: Results for the Ma key-Glass task, showing (from left to right) the number of units,
the number of parameters (weights for NNs), the number of training sequen e presentations,
and the NRMSE for predi tion osets T 2 f1; 6; 84g. \IT" stands of iterated training.
1 1
target target
(p=1;n=1..6) (p=1;n=6)
0.5 0.5
150 200 time 250 150 200 time 250
1 1
target
target
0.5 (p=6;n=14) 0.5 (p=1;n=84)
150 200 time 250 100 200 time 300
Figure 7.3: Ma key-Glass time series: Test run with LSTM network solutions. Shown are the
network output as solid lines, and the target t. Top-Left: Single step predi tion and six iterations
(p =1, T =1, n =1 : : : 6) after iterated training. Top-Right: The predi tion for T =6 with n =6,
extra ted from the top-left graph. Bottom-Left: The best solution for T = 84 with p = 6 and
n =14. Bottom-Left: The best single-step solution for T =84 with p =1 and n =84.
worsened performan e for p =6.

The results for AR-LSTM approa h are learly worse than the results for time window
approa hes, for example with MLPs. Iterated training de reased the performan e. But surpris-
ingly, the relative performan e de rease for one-step predi tion was mu h larger than for iterated
predi tion. This indi ates that the iteration apabilities were improved (taking in onsideration
the over-proportionally worsened one-step predi tion performan e).
The single-step predi tions for LSTM are not a urate enough to follow the series for as
mu h as 84 steps (Figure 7.3). Instead the LSTM network starts os illating, having adapted to
the strongest eigen-frequen y in the task. During self-iterations, the memory ells tune into this
eigen-os illation (Figure 7.4), with time onstants determined by the intera tion of ell state
and forget gate. Most solutions are stable during iterated testing as in the solution shown in
Figure 7.4. Applying a sigmoid squashing fun tion g prevents exponentially growth of the ell
states by limiting their self-reinfor ement to be linear. Linear self-reinfor ement in turn an be
ompensated for by the forget gate. Still it is possible that networks diverge, when the damping
indu ed by the forget gate is always smaller than the onstant reinfor ement des ribed above.
This situation might be established via feed ba k form the ell to the forget gate.
t
0.5 y
0
-1
-2 s1
s2
s3
-3 s4
1
0.5
0
0 100 200 time
Figure 7.4: Test run with network solutions for the Ma key-Glass time series (p = 1, T = 84,
n =84). Shown is a \free" iteration of 250 steps starting with all states set to zero. Top: Network
output y and the test set target t. Middle: Cell states s . Bottom: A tivations of the gates.
7.4 Laser Data
This data is set A from the Santa Fe time series predi tion ompetition (Weigend & Gershenfeld,
1993") . It onsists of one-dimensional data re orded from a Far-Infrared (FIR) laser in a haoti
1
state (Huebner, Abraham, & Weiss, 1989). The training set onsists of 1,000 points from the
laser, with the task being to predi t the next 100 points (Figure 7.5). The main diÆ ulty is to
predi t the ollapse of a tivation in the test set, given only two similar events in the training set.
We run tests for stepwise predi tion and fully iterated predi tion, where the output is lamped
to the input for 100 steps.
1
The data is available from http://www.stern.nyu.edu/aweigend/Time-Series/SantaFe.html.
7.4. LASER DATA 71
trainig test test
200 200
100 100
0 0
0 200 400 600 800 1000 time 0 20 40 60 80 100
time
Figure 7.5: FIR-laser Data (Set A) from the Santa Fe time series predi tion ompetition (see
text for details).
For the experiments with MLPs the setup was as des ribed for the Ma key-Glass data but
with an input embedding of the last 9 time steps as in Koskela, Varsta, Heikkonen, and Kaski
(1998).
7.4.1 Previous Work
Results are listed in Table 7.3. Linear predi tion is no better than predi ting the data-mean.
Wan (1994) a hieved the best results submitted to the original Santa Fe ontest. He used a
Finite Input Response Network (FIRN) (25 inputs and 12 hidden units), a method similar to
a TDNN. Wan improved performan e by repla ing the last 25 predi ted points by smoothed
values (sFIRN).
Koskela, Varsta, Heikkonen, and Kaski (1998) ompared re urrent SOMs (RSOMs) and
MLPs (trained with the Levenberg-Marquardt algorithm) with an input embedding of dimension
9 (an input window with the last 9 values). Bakker, S houten, Giles, Takens, and Bleek (2000)
used a mixture of predi tions and true values as input (Error Propagation, EP). Then Prin ipal
Component Analysis (PCA) was applied to redu e the dimensionality of the time embedding
for the input from the 40 most re ent inputs to 16 prin ipal omponents. These were fed into
a MLP (with two hidden layers of 32 and 24 units) and trained with BPTT using onjugate
gradients. The value for the iterated predi tion was a hieved with a mixture of 90% lamped
output and 10% true value (true iteration orresponds to 100% lamped output). The value for
the iterated predi tion was a hieved without applying EP during training.
Kohlmorgen and Muller (1998) pointed out that the predi tion problem ould be solved by
pattern mat hing, if it an be guaranteed that the best mat h from the past is always the right
one. To resolve ambiguities they propose to up-sample the data using linear extrapolation (as
done by Sauer, 1994).
The best result to date, a ording to our knowledge, was a hieved by Weigend and Nix (1994).
They used a nonlinear regression approa h in a maximum likelihood framework, realized with
feed-forward NN (25 inputs and 12 hidden units) using an additional output to estimate the
predi tion error. For the iterated predi tion the mean of the values at times 620{700 was used
as predi tion after the predi ted ollapse of a tivity at time-step 1072 (this was based on visual
inspe tion). A similar approa h was used by Eri J. Kosteli h (1994), who sear hed for the best
mat h to an embedding of 75 steps using a lo al linear model.
M Names (2000) proposed a statisti al method that used ross-validation error to estimate
the model parameters for lo al models, but the testing onditions were too dierent to in lude
Referen e Units Para. Seq. NMSE

stepwise iterated
Predi t Input:
x(t + T )= x(t) - - - 0.96836 -
Linear Predi tor - - - 1.25056 -
FIRN (Wan, 1994) 26 170 - 0.0230 0.0551
sFIRN (Wan, 1994) 26 170 - - 0.0273
MLP (Koskela et al., 1998) 70 30 - 0.01777 -
RSOM (Koskela et al., 1998) 13 - - 0.0833 -
EP-MLP
(Bakker et al., 2000) 73 > 1300 - - 0.2159
(Sauer, 1994) - 32 - - 0.077
(Weigend & Nix, 1994) 27 180 - 0.0198 0.016
(Bontempi G., 1999) - - - - 0.029
MLP 16 177 1 10 0.36322
4
>1
MLP 32 353 1 10 0.0996017 0.856932
4
MLP 64 769 1 10 0.101023 > 1

4
MLP IT 32 353 1 10 0.158298 0.621936

4
LSTM 4 113 1 10 0.395959 1.02102

5
LSTM IT 4 113 1 10 0.36422 0.96834

5
Table 7.3: Results for the FIR-laser task, showing (from left to right): The number of units, the
number of parameters (weights for NNs), the number of training sequen e presentations, and
the NRMSE.
the results in the omparison. Bontempi G. (1999) used a similar approa h alled \ Predi ted
Sum of Squares (PRESS)" (here, the dimension of the time embedding was 16).
7.4.2 Results
The results for MLP and LSTM are listed in Table 7.3. The results for these methods are not
as good as the other results listed in Table 7.3. This is true in part be ause we did not repla e
predi ted values by hand with a mean value where we suspe ted the system to be lead astray.
7.4.3 Analysis
The LSTM network ould not predi t the ollapse of emission in the test set (Figure 7.6).
Instead, the network tra ks the os illation in the original series for only about 40 steps before
desyn hronizing. This indi ates performan e similar to that in the Ma key-Glass task: the
LSTM network was able to tra k the strongest eigen-frequen y in the task but was unable to
a ount for high-frequen y varian e. Though the MLP performed better, it generated ina urate
amplitudes and also desyn hronized after about 40 steps. The MLP did however manage to
predi t the ollapse of emission (Figure 7.6).
LSTM's ability to tra k slow os illations in the haoti signal is notable. In simple ases,
syn hronization with a periodi signal is easily a hieved using me hanisms su h as phase-lo ked
loops (PLLs). But when noisy or omplex signals are used, syn hronization an be hallenging
7.5. CONCLUSION 73
200 200
100 100
0
0 20 40 60 80 time 0 20 40 60 80 time
200 200
100 100
0 0
0 20 40 60 80 time 0 20 40 60 80 time
Figure 7.6: Test run with network solutions after iterated training for the FIR-laser task. Top:
LSTM. Bottom: MLP with 32 hidden units. Left: Single-Step predi tion. Right: Iteration of
100 steps.
(M Auley, 1994; Large & Kolen, 1994). Systems like LSTM that an nd periodi ity in om-
pli ated signals should be appli able to ognitive domains su h as spee h and musi (Large &
Jones, 1999; E k, 2000a). See also (E k, 2000b) for more on this topi .
Iterated training yielded improved results for iterated predi tion, even when stepwise pre-
di tion made things worse, as in the ase of MLP single-step predi tion (predi tion step size
one) for both the Ma key-Glass task and the FIR task. When multi-step predi tion was used
(for Ma key-Glass only), iterated training did not improve system performan e.
7.5 Con lusion

A time window based MLP outperformed the LSTM pure-AR approa h on ertain time series
predi tion ben hmarks solvable by looking at a few re ent inputs only. Thus LSTM's spe ial
strength, namely, to learn to remember single events for very long, unknown time periods, was
not ne essary here.
LSTM learned to tune into the fundamental os illation of ea h series but was unable to
a urately follow the signal. The MLP, on the other hand, was able to apture some aspe ts
of the haoti behavior. For example the system ould predi t the ollapse of emission in the
FIR-laser task.
Iterated training has advantages over single-step training for iterated testing only for MLPs
and when the predi tion step-size is one. The advantage is evident when the number of ne essary
iterations is large.
Our results suggest to use LSTM only on tasks where traditional time window based ap-
proa hes must fail. One reasonable hybrid approa h to predi tion of unknown time series may
be this: start by training a time window-based MLP, then freeze its weights and use LSTM only
to redu e the residual error if there is any, employing LSTM's ability to ope with long time
lags between signi ant events.
LSTM's ability to tra k slow os illations in the haoti signal may be appli able to ognitive
domains su h as rhythm dete tion in spee h and musi .
Chapter 8
Con lusion
This work has on entrated on improving and applying the original LSTM algorithm as intro-
du ed by Ho hreiter and S hmidhuber (1997). We proposed to extended LSTM with forget
gates and peephole onne tions. Extended LSTM is learly superior to traditional LSTM (and
other RNNs), and an serve as basis for future appli ations. Our ndings reinfor e the per ep-
tion that LSTM is a very general and promising adaptive sequen e pro essing devi e, with a
wider eld of potential appli ations than alternative RNNs. In the following we summarize the
ontributions of this thesis and present some thoughts about future work and possible LSTM
appli ations.
8.1 Main Contributions

Forget Gates. While previous work fo used on training sequen es with well-dened beginnings
and ends, typi al real-world input streams are not a priori segmented into training subsequen es
indi ating network resets. Therefore RNNs should be able to learn appropriate self-resets. This
is also desirable for tasks with hierar hi al but a priori unknown de ompositions. For instan e,
re-o urring subtasks should be solved by the same network module, whi h should be reset on e
the subtask is solved. Forget gates naturally permit LSTM to learn lo al self-resets of memory
ontents that have be ome irrelevant.
Forget gates also substantially improve LSTM's performan e on tasks involving arithmeti
operations, be ause they make the LSTM ar hite ture more powerful.
Extending LSTM with peephole onne tions. We identied a weakness in the wiring
s heme of the multipli ative gates surrounding LSTM's onstant error arrousels (CECs). As
a remedy, we extend LSTM by introdu ing peephole onne tions from the CECs to the gates,
that allow them to inspe t the urrent internal ell-states.
Timing. We tested LSTM on a spe ial lass of tasks that requires the network to extra t
relevant information onveyed by the duration of intervals between events. We showed that
LSTM an solve su h highly nonlinear tasks as well, by learning to pre isely measure time
intervals, provided we furnish LSTM ells with peephole onne tions.
Context free and ontext sensitive languages. We show that LSTM outperforms other
75
76 CHAPTER 8. CONCLUSION
RNNs on ontext free language (CFL) ben hmarks. Moreover, LSTM is the rst RNN to learn
a ontext sensitive language.
Time series predi tion. Time window based MLPs outperformed a LSTM pure auto-
regressive approa h on ertain time series predi tion ben hmarks solvable by looking at a few
re ent inputs only. Thus LSTM's spe ial strength, namely, to learn to remember single events
for very long, unknown time periods, was not ne essary for those tasks.
8.2 Future work and possible appli ations of LSTM.
Gain adaptation. In our experiments we either used a onstant learning rate (sometimes with
exponential or linear de ay within sequen es) or applied the rather simple momentum algorithm
(Plaut et al., 1986). More advan ed lo al learning rate adaptation approa hes like a de oupled
Kalman ltering (Puskorius & Feldkamp, 1994) or sto hasti meta des ent (S hraudolph, 1999,
2000) may improve learning speed and redu e the per entage of networks that diverge.
Hierar hi al de omposition, rhythm and timing. LSTM with forget gates holds
promise for any sequential pro essing task in whi h we suspe t that a hierar hi al de omposition
may exist, but do not know in advan e what this de omposition is (one example is prosodi in-
formation in spee h). We showed that memory blo ks equipped with forget gates and peephole
onne tions are apable of developing into internal os illators and timers and that LSTM is able
to tra k slow os illations in the haoti signal. This may allow the re ognition and generation
of hierar hi al rhythmi patterns in musi . In parti ular the ability to perform pre ise timing
and measuring makes LSTM a promising approa h for real-world tasks whose solution partly
depend on the pre ise duration of intervals between relevant events.
Growing LSTM networks. It may be useful to grow LSTM networks (e.g., add one
memory blo k at a time), similar to the as ade- orrelation algorithm (Fahlman, 1991), to
de ouple blo ks when tra king multiple frequen ies in a signal. So far only the fundamental
frequen y was tra ked.
Time series predi tion. For the predi tion of unknown time series our results suggest to
use LSTM in a hybrid approa h as follows: start by training a time window-based MLP, then
freeze its weights and use LSTM only to redu e the residual error if there is any, employing
LSTM's ability to ope with long time lags between signi ant events. An example for a task
where a hybrid approa h with LSTM might be promising is the predi tion of se ondary protein
stru ture from a sequen e of amino a ids (Brunak, Baldi, Fras oni, Pollastri, & Soda, 1999).
The standard solution involves using a xed window over the protein sequen e, entered over a
spe i amino a id. As a protein is folded, a ids that are far apart in the series of a ids may
be spatially lose and have signi ant intera tion. This generates omplex, varying long-term
dependen ies in the series.
Appendix A
Embedded Reber Grammar

Statisti s
The minimal length of an embedded Reber grammar (ERG) string is 9; string length have no
upper bound. To provide an idea of the string size distribution, Figure A.1 (left) shows a his-
togram of ERG strings omputed from sampled data. We assume that ERG string probabilities
de rease exponentially with ERG string size ( ompare exponential t on the left hand side of
Figure A.1), so that the probability p(l) of sampling a string of size l an be written as:
p(l) = b e a l for l 9 else p(l) = 0 ;
( 9)
with a; b > 0; the oset 9 expresses the minimum string length. To ompute the probability
P (L) of sampling a string of size l L we integrate p(l):
Z L
b
P (L) = p(l) dl = 1 e a(L 9)
:
o a
25 0.4
Number ERG Strings in %
logarihmic scale 0.35
exponential fit
20
0.3
0.25
15
Probability
0.2
10
0.15
0.1
5
0.05
0 0
10 20 30 40 50 60 0 10 20 30 40 50 60 70 80 90 100
ERG String Length Max. ERG String Length
Figure A.1: Left: histogram of 10 random samples of ERG string sizes. Right: Joint probability
6
that an ERG string of a given size o urs and is the longest among 80000.
77
78 APPENDIX A. EMBEDDED REBER GRAMMAR STATISTICS
60 60
Expected Max. ERG String Length
Expected Max. ERG String Length

50 50
40 40
30 30
20 20
10 10
0 0
0 200000 400000 600000 800000 1e+06 1 10 100 1000 10000 100000 1e+06
Number Samples Number Samples
Figure A.2: Left: number of embedded Reber strings N plotted against lower bounds of expe ted
maximal string size (N ). Right: logarithmi x-axis.
From normalization P (1) = 1 follows a = b. Solving P (l) = 1 P (l) with the value a
! ! 3
extra ted form the data (left hand side of Figure A.1), we nd the expe ted ERG string size:
11
l = o 1 ln( 1 ) 11:54 : (A.1)

a 2
Given a set of N ERG strings, what is the expe ted maximal string length (N ) ? We derive
a lower bound PN ( ) for the probability that a set of N ERG strings ontains a string of size
, assuming a sample of N 1 strings of size and one of size (we set N 1 N ):
PN ( ) = N P ( )N (1 P ( )) :
Figure A.1 (right) plots PN for N = 80000. The x-value of the distribution maximum is a lower
bound for (N = 80000). Figure A.2 plots N against the lower bound of (N ).
(N ) grows logarithmi ally with N . For the test set we use in our experiments (N = 80000)
the expe ted maximal string length is about 50.
Appendix B
Peephole LSTM with Forget Gates

in Pseudo- ode
The pseudo- ode in this hapter des ribes the implementation of LSTM with forget gates and
peephole onne tions as introdu ed in the hapters 3 and 5. This is the LSTM version that we
urrently use and re ommend; the C ode an be down-loaded from: \http://www.idsia. h/~felix".
The partial derivatives w are represented by the variables dS :
s
s vj
dSljvj m := ;
wlj m
as dened in hapter 2, j indexes memory blo ks and v indexes memory ells in blo k j ; l = vj
for weights to the ell, l = in for weights to the input gate, and l = ' for weights to the forget
gate. The variables dS are al ulated no matter if a target (and hen e an error) is given or
not. Thus their al ulation is done in the forward pass. Whereas the ba kward pass is only
al ulated at time steps when a target is present.
It is task-spe i (see des riptions in hapters) when the weight-updates are exe uted: After
ea h step time, regularly after a xed number of time steps, after intervals with varying duration
or at the and of a sequen e or epo h.
The momentum algorithm (Plaut et al., 1986), that we used for some of our experiments, is
not in orporated into this pseudo- ode.
79
80 APPENDIX B. PEEPHOLE LSTM WITH FORGET GATES IN PSEUDO-CODE
init network:
reset: CECs: s vj =^s vj =0; partials: dS =0; a tivations: y = y^ =0;
forward pass:
input units: y = urrent external input;
^=
roll over: a tivations: y y ; ell states: s vj ^ = s vj ;
loop over memory blo ks, indexed j f
Step 1a: input gates (5.1):
P P j
netinj = m winj m y^m + Sv=1 winj vj s^ vj ; yinj = fin (netin );
j j
Step 1b: forget gates (5.2):

P Pj
net'j = m w'j m y^m + Sv=1 w'j vj s^ vj ; y 'j = f' (net' );
j j
Step 1 : CECs, i.e the ell states (5.3):

loop over the Sj ells in blo k j , indexed v f
P
net vj = m w vj m y^m ; s vj = y'j s^ vj + yinj g(net vj ); g
Step 2:
output gate a tivation: (5.4):
P P j
netoutj = m woutj m y^m + Sv=1 woutj vj s vj ; youtj = foutj (netoutj );
ell outputs (5.5):
loop over the Sj ells in blo k j , indexed v f y j = y outj s vj ; g
v
g end loop over memory blo ks

P
output units (2.9): netk =
m wkm y ;
m
yk = fk (netk );
partial derivatives:
s v
ells (5.6), (dS m
jv
:= w v m
j
j ):
jv = dS jv y 'j + g (net v ) y inj y^m ;
dS m 0
m j
s v v
:= := win j v ):
s
input gates (5.7), (5.7b), (dSin;m
jv j , dSin;
jv
winj m v 0
j j j
0
jv = dS jv y 'j + g (net v ) f (net ) y^m ;

dSin;m in;m j inj inj 0
loop over peephole onne tions from all ells, indexed v 0

f
jv = dS jv
dSin; 'j + g (net v ) f 0 (netin ) s^v0 ; g
v0 in; v0 y j inj j
j j
s v v
:= := w' j v ):
s
forget gates (5.8), (5.8b), (dS'm
jv j , dS'
jv
w'j m v 0
j j j
0
jv = dS jv y 'j + s^ v f (net' ) y^m ;

dS'm 0
'm j 'j j

f
jv = dS jv y 'j + s^ v f 0 (net ) s^v0 ; g
dS' v0 ' vj 0 j 'j 'j
j
g g end loops over ells and memory blo ks

81
ba kward pass (if error inje ted):

errors and Æ s:
inje tion error: ek = tk yk ;
Æs of output units (5.10): Æk = fk0 (netk ) ek ;
Æs of output gates (5.11b):
PS P
Æoutj = fout (netout )
0
j j
j
v=1 s vj k wk vj Æk ;
internal state error (5.15):
loop over the Sj ells in blo k j ,
P
indexed v f
es vj = youtj k wk vj Æk ; g
weight updates:
output units (5.9): wkm = Æk ym;
output gates (5.11a):
wout;m = Æout y^m; wout; vj = Æout s vj ;
input gates (5.13):
win;m = PSv=1
j jv ;
es vj dSin;m
f
win; vj 0 = PSv=1
j jv ; g
es vj dSin; v0
j
forget gates (5.14):

w'm = PSv=1
j jv ;
es vj dS'm
f
w' vj 0 = PSv=1
j jv ; g
es vj dS' v0
j
ells (5.12):
w vj m = es vj dS m
jv ; g;

Referen es
Bakker, B., & Kleij, G. van der Voort van der. (2000). Trading o per eption with internal state:
Reinfor ement learning and analysis of q-elman networks in a markovian task. In Pro eedings of
IJCNN 2000. Como, Italy.
Bakker, R., S houten, J. C., Giles, C. L., Takens, F., & Bleek, C. M. van den. (2000). Learning haoti
attra tors by neural networks. Neural Computation, 12 (10).
Bengio, Y., & Fras oni, P. (1995). An input output HMM ar hite ture. In Advan es in Neural Information
Pro essing Systems 7. San Mateo CA: Morgan Kaufmann.
Bengio, Y., Fras oni, P., Gori, M., & G.Soda. (1993). Re urrent neural networks for adaptive temporal
pro essing. In Pro eedings of the 6th italian workshop on parallel ar hite tures and neural networks
wirn93 (pp. 85{117). Vietri (Italy): World S ienti Pub.
Bengio, Y., Simard, P., & Fras oni, P. (1994). Learning long-term dependen ies with gradient des ent is
diÆ ult. IEEE Transa tions on Neural Networks, 5 (2), 157-166.
Bersini, H., Birattari, M., & Bontempi, G. (1998). Adaptive memory-based regression methods. In In
Pro eedings of the 1998 IEEE International Joint Conferen e on Neural Networks (pp. 2102{2106).
Blair, A. D., & Polla k, J. B. (1997). Analysis of dynami al re ognizers. Neural Computation, 9 (5),
1127{1142.
Bontempi G., B. H., Birattari M. (1999). Lo al learning for iterated time-series predi tion. In B. I. &
D. S. (Eds.), Ma hine Learning: Pro eedings of the Sixteenth International Conferen e (p. 32-38).
San Fran is o, USA: Morgan Kaufmann.
Box, G., & Jenkins, G. (1970"). Time series analysis { fore asting and ontrol; san fran is o: Holden-day.
Brunak, S., Baldi, P., Fras oni, P., Pollastri, G., & Soda, G. (1999). Exploiting the past and the future
in protein se ondary stru ture predi tion. Bioinformati s, 15 (11).
Casey, M. P. (1996). The dynami s of dis rete-time omputation, with appli ation to re urrent neural
networks and nite state ma hine extra tion. Neural Computation, 8 (6), 1135{1178.
Chudy, L., & Farkas, I. (1998). Predi tion of haoti time-series using dynami ell stru turesand lo al
linear models. Neural Network World, 8 (5), 481-489.
Cleeremans, A., Servan-S hreiber, D., & M Clelland, J. L. (1989). Finite-state automata and simple
re urrent networks. Neural Computation, 1, 372-381.
Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. In
J. Grefenstette (Ed.), Pro eedings of an international onferen e on geneti algorithms and their
appli ations. Hillsdale NJ: Lawren e Erlbaum Asso iates.
Crowder, R. S. (1990). Predi ting the ma key-glass timeseries with as ade orrelation learning. In
D. S. T. (ed) (Ed.), Conne tionist Models: Pro eedings of the 1990 Summer S hool.
Cummins, F., Gers, F., & S hmidhuber, J. (1999). Language identi ation from prosody without expli it
features. In Pro eedings of EUROSPEECH'99 (Vol. 1, pp. 371{374).
83
84 REFERENCES
Darken, C. (1995). Sto hasti approximation and neural network learning. In M. A. Arbib (Ed.), The
Handbook of Brain Theory and Neural Networks (pp. 941{944). Cambridge, Massa husetts: MIT
Press.
Das, S., Giles, C., & Sun, G. (1992). Learning ontext-free grammars: Capabilities and limitations of a
re urrent neural network with an external sta k memory. In Pro eedings of The Fourteenth Annual
Conferen e of the Cognitive S ien e So iety (pp. 791{795). San Mateo, CA: Morgan Kaufmann
Publishers.
Day, S. P., & Davenport, M. R. (1993). Continuous-time temporal ba k-progagation with adaptive time
delays. IEEE Transa tions on Neural Networks, 4, 348{354.
De o, G., & S hurmann, B. (1994). Neural learning of haoti system behavior. IEICE Trans. Funda-
mentals, E77-A, 1840-1845.
Di kmanns, D., S hmidhuber, J., & Winklhofer, A. (1987). Der genetis he Algorithmus: Eine Imple-
mentierung in Prolog. Fortges hrittenenpraktikum, Institut fur Informatik, Lehrstuhl Prof. Radig,
Te hnis he Universitat Mun hen.
Doya, K., & Yoshizawa, S. (1989). Adaptive neural os illator using ontinuous-time ba kpropagation
learning. Neural Networks, 2 (5), 375{385.
E k, D. (2000a). Meter Through Syn hrony: Pro essing Rhythmi al Patterns with Relax-
ation Os illators. Unpublished do toral dissertation, Indiana University, Bloomington, IN.,
(www.idsia. h/doug/publi ations.html).
E k, D. (2000b). Tra king rhythms with a relaxation os illator (Te h. Rep. No. IDSIA-10-00).
www.idsia. h/te hrep.html, Galleria 2, 6928 Manno-Lugano, Switzerland: IDSIA.
Elman, J. L. (1990). Finding stru ture in time. Cognitive S ien e, 14 (2), 179{211.
Eri J. Kosteli h, D. P. L. (1994). The predi tion of haoti time series: a variation on the method of
analogues. In W. A. S. & G. N. A. (Eds.), Time Series Predi tion: Fore asting the Future and
Understanding the Past (pp. 283{295). Addison-Wesley.
Fahlman, S. E. (1991). The re urrent as ade- orrelation learning algorithm. In R. P. Lippmann, J. E.
Moody, & D. S. Touretzky (Eds.), NIPS 3 (p. 190-196). San Mateo, CA: Morgan Kaufmann.
Fal o, I. de, Iazzetta, A., Natale, P., & Tarantino, E. (1998"). Evolutionary neural networks for nonlinear
dynami s modeling. In Parallel Problem Solving from Nature 98 (Vol. 1498, p. 593-602). Springer.
For ada, M. L., & Carras o, R. C. (1995). Learning the initial state of a se ond-order re urrent neural
network during regular-language inferen e [Letter℄. Neural Computation, 7 (5), 923{930.
Gers, F. A., E k, D., & S hmidhuber, J. (2000). Applying LSTM to time series predi table through
time-window approa hes (Te h. Rep. No. IDSIA-22-00). Manno, CH: IDSIA.
Gers, F. A., E k, D., & S hmidhuber, J. (2001a). Applying LSTM to time series predi table through
time-window approa hes. In Pro . ICANN 2001, Int. Conf. on Arti ial Neural Networks. Vienna,
Austria: IEE, London. (submitted)
Gers, F. A., E k, D., & S hmidhuber, J. (2001b). Applying LSTM to time series predi table through
time-window approa hes. In Neural nets, WIRN vietri-99, pro eedings 11th workshop on neural
nets. Vietri sul Mare, Italy. (submitted)
Gers, F. A., & S hmidhuber, J. (2000a). Neural pro essing of omplex ontinual input streams. In Pro .
IJCNN'2000, Int. Joint Conf. on Neural Networks. Como, Italy.
Gers, F. A., & S hmidhuber, J. (2000b). Neural pro essing of omplex ontinual input streams (Te h.
Rep. No. IDSIA-02-00). Manno, CH: IDSIA.
Gers, F. A., & S hmidhuber, J. (2000 ). Re urrent nets that time and ount. In Pro . IJCNN'2000, Int.
Joint Conf. on Neural Networks. Como, Italy.
REFERENCES 85
Gers, F. A., & S hmidhuber, J. (2000d). Re urrent nets that time and ount (Te h. Rep. No. IDSIA-01-
00). Manno, CH: IDSIA.
Gers, F. A., & S hmidhuber, J. (2000e). LSTM learns ontext free languages. In Snowbird 2000
Conferen e.
Gers, F. A., & S hmidhuber, J. (2000f). Long Short-Term Memory learns ontext free languages and
ontext sensitive languages (Te h. Rep. No. IDSIA-03-00). Manno, CH: IDSIA.
Gers, F. A., & S hmidhuber, J. (2001a). LSTM re urrent networks learn simple ontext free and ontext
sensitive languages. IEEE Transa tions on Neural Networks. (a epted)
Gers, F. A., & S hmidhuber, J. (2001b). Long Short-Term Memory learns ontext free and ontext
sensitive languages. In Pro eedings of the ICANNGA 2001 onferen e. Springer. (a epted)
Gers, F. A., S hmidhuber, J., & Cummins, F. (1999a). Continual predi tion using LSTM with forget
gates. In M. Marinaro & R. Tagliaferri (Eds.), Neural Nets, WIRN Vietri-99, Pro eedings 11th
Workshop on Neural Nets (p. 133-138). Vietri sul Mare, Italy: Springer Verlag, Berlin.
Gers, F. A., S hmidhuber, J., & Cummins, F. (1999b). Learning to forget: Continual predi tion with
LSTM. In Pro . ICANN'99, Int. Conf. on Arti ial Neural Networks (Vol. 2, p. 850-855). Edin-
burgh, S otland: IEE, London.
Gers, F. A., S hmidhuber, J., & Cummins, F. (1999 ). Learning to forget: Continual predi tion with
LSTM (Te h. Rep. No. IDSIA-01-99). Lugano, CH: IDSIA.
Gers, F. A., S hmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual predi tion with
LSTM. Neural Computation, 12 (10), 2451{2471.
Gers, F. A., S hmidhuber, J., & S hraudolph, N. Learning pre ise timing with LSTM re urrent networks.
(submitted to Neural Computation)
Haner, P., & Waibel, A. (1992). Multi-state time delay networks for ontinuous spee h re ognition. In
J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advan es in Neural Information Pro essing
Systems (Vol. 4, pp. 135{142). Morgan Kaufmann Publishers, In .
Hinton, G. E., Sejnowski, T. J., & A kley, D. H. (1984). Boltzmann Ma hines: Constraint satisfa tion
networks that learn (Te h. Rep. No. CMU-CS-84-119). Carnegie Mellon University.
Ho hreiter, S. (1991). Untersu hungen zu dynamis hen neuronalen Netzen. Diploma thesis, Institut fur
Informatik, Lehrstuhl Prof. Brauer, Te hnis he Universitat Mun hen. (See www7.informatik.tu-
muen hen.de/~ho hreit)
Ho hreiter, S., & S hmidhuber, J. (1995). Long short-term memory an solve hard long time lag problems.
In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advan es in neural information pro essing
systems 7 (NIPS "94). Cambridge, MA: MIT Press.
Ho hreiter, S., & S hmidhuber, J. (1996). Bridging long time lags by weight guessing and \Long Short-
Term Memory". In F. L. Silva, J. C. Prin ipe, & L. B. Almeida (Eds.), Spatiotemporal models in
biologi al and arti ial systems (p. 65-72). IOS Press, Amsterdam, Netherlands. (Serie: Frontiers
in Arti ial Intelligen e and Appli ations, Volume 37)
Ho hreiter, S., & S hmidhuber, J. (1997). Long short-term memory. Neural Computation, 9 (8), 1735-
1780.
Hopeld, J. J. (1982). Neural networks and physi al systems with emergent olle tive omputational
fa ilities. Pro eedings of the National A ademy of S ien es of the USA, 79, 2554{2558.
Huebner, U., Abraham, N. B., & Weiss, C. O. (1989). Dimensions and entropies of haoti intensity
pulsations in a single-mode far-infrared nh3 laser. Phys. Rev. A, 40, 6354.
Jordan, M. I. (1986). Attra tor dynami s and parallelism in a onne tionist sequential ma hine. In
Pro eedings of the Eighth Annual Cognitive S ien e So iety Conferen e. Hillsdale, NJ: Erlbaum.
86 REFERENCES
Kalinke, Y., & Lehmann, H. (1998). Computation in re urrent neural networks: From ounters to
iterated fun tion systems. In G. Antoniou & J. Slaney (Eds.), Advan ed Topi s in Arti ial Intel-
ligen e, Pro eedings of the 11th Australian Joint Conferen e on Arti ial Intelligen e (Vol. 1502).
Berlin,Heidelberg: Springer.
Kohlmorgen, J., & Muller, K.-R. (1998). Data set a is a pattern mat hing problem. Neural Pro essing
Letters, 7 (1), 43-47.
Koskela, T., Varsta, M., Heikkonen, J., & Kaski, K. (1998). Re urrent SOM with lo al linear models
in time series predi tion. In 6th European Symposium on Arti ial Neural Networks. ESANN'98.
Pro eedings. D-Fa to, Brussels, Belgium (pp. 167{72).
Koza, J. R. (1992). Geneti programming. Cambridge, MA: MIT Press.
Lapedes, A., & Farber, R. (1987). Nonlinear signal pro essing using neural networks: Predi tion and sig-
nal modeling (Te h. Rep. Nos. LA{UR{87{2662). Los Alamos, New Mexi o: Los Alamos National
Laboratory.
Large, E. W., & Jones, M. R. (1999). The dynami s of attending: How people tra k time-varying events.
Psy hologi al Review, 106 (1), 119{159.
Large, E. W., & Kolen, J. F. (1994). Resonan e and the per eption of musi al meter. Conne tion S ien e,
6, 177{208.
LeCun, Y., Bottou, L., Orr, G., & Muller, K.-R. (1998). EÆ ient ba kprop. In G. B. Orr & K.-R. Muller
(Eds.), Neural Networks|Tri ks of the Trade (Vol. 1524, p. 5-50). Berlin: Springer Verlag.
Lee, L. (1996). Learning of ontext-free languages: A survey of the literature (Te h. Rep. No. TR-12-96).
Center for Resear h in Computing Te hnology, Harvard University, Cambridge, Massa husetts.
Lin, T., Horne, B. G., Ti~no, P., & Giles, C. L. (1996). Learning long-term dependen ies in NARX
re urrent neural networks [Paper℄. IEEE Transa tions on Neural Networks, 7 (6), 1329{1338.
Ma key, M., & Glass, L. (1977). Os illation and haos in a physiologi al ontrol system. S ien e,
197 (287).
Martinez, T. M., Berkovi h, S. G., & S hulten, K. J. (1993). Neural-gas network for ve tor quantization
and its appli ation to time-series predi tion [Paper℄. IEEE Transa tions on Neural Networks, 4 (4),
558{569.
M Auley, J. (1994). Finding metri al stru ture in time. In M. Mozer, P. Smolensky, D. Touretsky,
J. Elman, & A. S. Weigend (Eds.), Pro eedings of the 1993 Conne tionist Models Summer S hool
(pp. 219{227). Hillsdale, NJ: Erlbaum.
M Names, J. (2000). Lo al modeling optimization for time series predi tion. In In Pro eedings of the
8th European Symposium on Arti ial Neural Networks (p. 305-310). Bruges, Belgium.
Miller, G. A. (1956). The magi al number seven, plus or minus two: Some limits on our apa ity for
pro essing information. Psy hologi al Review (63), 81-97.
Mit hell, T. M., Keller, R. M., & Kedar-Cabelli, S. T. (1986). Explanation-based generalization: A
unifying view. Ma hine Learning, 1, 47{80.
Mozer, M. C. (1989). A fo used ba kpropagation algorithm for temporal pattern pro essing. Complex
Systems, 3, 349{381.
Mozer, M. C. (1992). Indu tion of multis ale temporal stru ture. In D. S. Lippman, J. E. Moody, &
D. S. Touretzky (Eds.), Advan es in Neural Information Pro essing Systems 4 (p. 275-282). San
Mateo, CA: Morgan Kaufmann.
Mozer, M. C. (1993). Neural net ar hite tures for temporal sequen es pro essing. In A. S. Weigend
& N. A. Gershenfeld (Eds.), Time series predi tion: Fore asting the future and understanding the
past (Vol. 15, pp. 243{264). Reading, MA: Addison Wesley.
REFERENCES 87
Osborne, M., & Bris oe, E. (1997). Learning sto hasti ategorial grammars. In Pro eedings of the
Asso . for Comp. Linguisti s, Comp. Nat. Lg.Learning (CoNLL97) Workshop (pp. 80{87). Madrid.
(http:// iteseer.nj.ne . om/osborne97learning.html)
Pearlmutter, B. A. (1995). Gradient al ulations for dynami re urrent neural networks: A survey. IEEE
Transa tions on Neural Networks, 6 (5), 1212-1228.
Platt, J. (1991). A resour e-allo ating network for fun tion interpolation. Neural Computation, 3,
213{225.
Plaut, D. C., Nowlan, S. J., & Hinton, G. E. (1986). Experiments on learning ba k propagation (Te h.
Rep. Nos. CMU{CS{86{126). Pittsburgh, PA: Carnegie{Mellon University.
Porter, Bru e, W., Bareiss, R., & Holte, R. C. (1990). Con ept learning and heuristi lassi ation in
weak-theory domains. Arti ial Intelligen e, 45 (1-2), 229{263.
Prin ipe, J. C., & Kuo, J.-M. (1995). Dynami modelling of haoti time series with neural networks. In
G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advan es in Neural Information Pro essing Systems
(Vol. 7, pp. 311{318). The MIT Press.
Prin ipe, J. C., Rathie, A., & Kuo, J. M. (1992). Predi tion of haoti time series with neural networks
and the issue of dynami modeling. Int. J. of Bifur ation and Chaos, 2 (4), 989{996.
Puskorius, G. V., & Feldkamp, L. A. (1994). Neuro ontrol of nonlinear dynami al systems with Kalman
lter trained re urrent networks. IEEE Transa tions on Neural Networks, 5 (2), 279-297.
Quinlan, J. (1992). Programs for ma hine learning. Morgan Kaufmann.
R. Bone, Cru ianu, M., Verley, G., & Asselin de Beauville, J.-P. (2000). A bounded exploration approa h
to onstru tive algorithms for re urrent neural networks. In Pro eedings of IJCNN 2000. Como,
Italy.
Ring, M. B. (1994). Continual learning in reinfor ement environments. Unpublished do toral dissertation,
University of Texas at Austin, Austin, Texas 78712.
Robinson, A. J., & Fallside, F. (1987). The utility driven dynami error propagation network (Te h. Rep.
No. CUED/F-INFENG/TR.1). Cambridge University Engineering Department.
Rodriguez, P., & Wiles, J. (1998). Re urrent neural networks an learn to implement symbol-sensitive
ounting. In Advan es in Neural Information Pro essing Systems (Vol. 10, p. 87-93). The MIT
Press.
Rodriguez, P., Wiles, J., & Elman, J. (1999). A re urrent neural network that learns to ount. Conne tion
S ien e, 11 (1), 5-40.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representation by er-
ror propagation. In D. E. Rumelhart & J. L. M Clelland (Eds.), Parallel distributed pro essing:
Explorations in the mi rostru ture of ognition (Vol. 1, pp. 318{362). Cambridge, MA: MIT Press.
Sakakibara, Y. (1997). Re ent advan es of grammati al inferen e. Theoreti al Computer S ien e, 185 (1),
15{45.
Salustowi z, R. P., & S hmidhuber, J. (1997). Probabilisti in remental program evolution: Sto hasti
sear h through program spa e. In M. van Someren & G. Widmer (Eds.), Ma hine Learning: ECML-
97, Le ture Notes in Arti ial Intelligen e 1224 (p. 213-220). Springer-Verlag Berlin Heidelberg.
Sauer, T. (1994). Time series predi tion using delay oordinate embedding. In A. S. Weigend & N. A.
Gershenfeld (Eds.), Time Series Predi tion: Fore asting the Future and Understanding the Past.
Addison-Wesley.
S hmidhuber, J. (1989). The Neural Bu ket Brigade, a lo al learning algorithm for dynami feedforward
and re urrent networks. Conne tion S ien e, 1 (4), 403-412.
88 REFERENCES
S hmidhuber, J. (1992a). A xed size storage O(n3 ) time omplexity learning algorithm for fully re urrent
ontinually running networks. Neural Computation, 4 (2), 243-248.
S hmidhuber, J. (1992b). Learning omplex, extended sequen es using the prin iple of history ompres-
sion. Neural Computation, 4 (2), 234-242.
S hmidhuber, J., & Ho hreiter, S. (1996). Guessing an outperform many long time lag algorithms (Te h.
Rep. No. IDSIA-19-96). IDSIA.
S hraudolph, N. (1999). Lo al gain adaptation in sto hasti gradient des ent. In Pro eedings of the 9th
International Conferen e on Arti ial Neural Networks. London: IEE.
S hraudolph, N. N. (2000). Fast se ond-order gradient des ent via O(n) urvature matrix-ve tor produ ts
(Te h. Rep. No. IDSIA-12-00). Galleria 2, CH-6928 Manno, Switzerland: Istituto Dalle Molle di
Studi sull'Intelligenza Arti iale. (Submitted to Neural Computation)
Siegelmann, H. (1992). Theoreti al foundations of re urrent neural networks. Unpublished do toral
dissertation, Rutgers, New Brunswi k Rutgers, The State of New Jersey.
Siegelmann, H. T., & Sontag, E. D. (1991). Turing omputability with neural nets. Applied Mathemati s
Letters, 4 (6), 77{80.
Smith, A. W., & Zipser, D. (1989). Learning sequential stru tures with the real-time re urrent learning
algorithm. International Journal of Neural Systems, 1 (2), 125-131.
Steijvers, M., & Grunwald, P. (1996). A re urrent network that performs a ontextsensitive predi tion
task. In Pro eedings of the 18th Annual Conferen e of the Cognitive S ien e So iety. Erlbaum.
Sun, G., Chen, H., & Lee, Y. (1993). Time warping invariant neural networks. In J. D. C. S. J. Hanson &
C. L. Giles (Eds.), Advan es in Neural Information Pro essing Systems 5 (p. 180-187). San Mateo,
CA: Morgan Kaufmann.
Sun, G. Z., Giles, C. L., Chen, H. H., & Lee, Y. C. (1993). The neural network pushdown automaton:
Model, sta k and learning simulations (Te hni al Report No. CS-TR-3118). University of Maryland,
College Park.
Tonkes, B., & Wiles, J. (1997). Learning a ontext-free task with a re urrent neural network: An analysis
of stability. In Pro eedings of the Fourth Biennial Conferen e of the Australasian Cognitive S ien e
So iety.
Townley, S., Il hmann, A., Weiss, M. G., M Clements, W., Ruiz, A. C., Owens, D., & Praetzel-Wolters,
D. (1999). Existen e and learning of os illations in re urrent neural networks (Te h. Rep. No.
AGTM 202). Kaiserslautern, Germany: Universitaet Kaiserslautern, Fa hberei h Mathematik.
Tsoi, A. C., & Ba k, A. D. (1994). Lo ally re urrent globally feedforward networks: A riti al review of
ar hite tures. IEEE Transa tions on Neural Networks, 5 (2), 229{239.
Tsung, F. S., & Cottrell, G. W. (1989). A sequential adder using re urrent networks. In Pro eedings of
the First International Joint Conferen e on Neural Networks, Washington, DC. San Diego: IEEE
TAB Neural Network Committee.
Tsung, F.-S., & Cottrell, G. W. (1995). Phase-spa e learning. In Advan es in Neural Information
Pro essing Systems (Vol. 7, pp. 481{488). The MIT Press.
Vesanto, J. (1997). Using the SOM and lo al models in time-series predi tion. In Pro eedings of
WSOM'97, Workshop on Self-Organizing Maps, Espoo, Finland, June 4{6 (pp. 209{214). Espoo,
Finland: Helsinki University of Te hnology, Neural Networks Resear h Centre.
Vijay-Shanker, K. (1992). Using des riptions of trees in a tree adjoining grammar. Computational
Linguisti s, 18 (4), 481{517.
Waibel, A. (1989). Modular onstru tion of time-delay neural networksfor spee h re ognition [Letter℄.
Neural Computation, 1 (1), 39{46.
REFERENCES 89
Wan, E. A. (1994). Time series predi tion by using a onne tionist network with internal time delays.
In W. A. S. & G. N. A. (Eds.), Time Series Predi tion: Fore asting the Future and Understanding
the Past (pp. 195{217). Addison-Wesley.
Weigend, A., & Gershenfeld, N. (1993"). Time series predi tion: Fore asting the future and understanding
the past. Addison-Wesley.
Weigend, A. S., & Nix, D. A. (1994). Predi tions with onden e intervals (lo al error bars). In
Pro eedings of the International Conferen e on Neural Information Pro essing (ICONIP'94) (pp.
847{852). Seoul, Korea.
Weiss, M. G. (1999). Learning os illations using adaptive ontrol (Te h. Rep. No. AGTM 178). Kaiser-
slautern, Germany: Universitaet Kaiserslautern, Fa hberei h Mathematik.
Werbos, P. J. (1988). Generalisation of ba kpropagation with appli ation to a re urrent gas market
model. Neural Networks, 1, 339{356.
Wiles, J., & Elman, J. (1995). Learning to ount without a ounter: A ase study of dynami s and a ti-
vation lands apes in re urrent networks. In In Pro eedings of the Seventeenth Annual Conferen e
of the Cognitive S ien e So iety (pp. pages 482 { 487). Cambridge, MA: MIT Press.
Williams, R. J., & Peng, J. (1990). An eÆ ient gradient-based algorithm for on-line training of re urrent
network traje tories [Letter℄. Neural Computation, 2 (4), 490{501.
Williams, R. J., & Zipser, D. (1989). A learning algorithm for ontinually running fully re urrent net
works. Neural Computation, 1 (2), 270-280.
Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for re urrent networks and their
omputational omplexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Ba k-propagation: Theory,
Ar hite tures and Appli ations (pp. 433{486). Hillsdale, NJ: Erlbaum.
Yao, X., & Liu, Y. (1997). A new evolutionary system for evolving arti ial neural networks. IEEE
Transa tions on Neural Networks, 8 (3), 694{713.
Zelle, J., & Mooney, R. (1993). Learning semanti grammars with onstru tive indu tive logi program-
ming. In Pro eedings of the 11th national onferen e on arti ial intelligen e, aaai (pp. 817{822).
MIT Press.
Zeng, Z., Goodman, R., & Smyth, P. (1994"). Dis rete re urrent neural networks for grammati al
inferen e. IEEE Transa tions on Neural Networks, 5 (2).
90 Personal Re ord
Personal Re ord Lugano, January 30, 2001
Name Felix Alexander Gers
Date of birth 15.11.1970

Pla e of birth Freiburg/Br. (Germany)
Nationality German
Marital status single
Parents Dietmar Gers
Erika Gers born Mars hner
Edu ation 1976 - 1980 Primary s hool
1980 - 1982 Orientation s hool
1982 - 1989 High s hool (grammar-s hool)
Bismar ks hule Hannover
quali ation for admission
to a university (Abitur)
1989 - 1995 Study of physi s
at the University of Hannover
1991 Intermediate examination
1995 Master degree
(Diplom) in physi s
at the University of Hannover
Work 1996 - 1997 Advan ed Tele ommuni ation
Resear h Center (ATR, Kyoto, Japan),
Human Information Pro essing Laboratories,
Evolutionary Systems Department
1997 Laser Zentrum Hanover (LZH), Germany,
Opti al Measurement Te hniques Group
1997 - Istituto Dalle Molle di Studi sull'Intelligenza
Arti iale (IDSIA, Lugano, Switzerland),
Neural Network Group
Personal Re ord 91
Publi ations
Gers, F. A., E k, D., & S hmidhuber, J. Applying LSTM to time series predi table
through time-window approa hes. In Neural Nets, WIRN Vietri-99, Pro eedings 11th
Workshop on Neural Nets.
Cummins, F., Gers, F., & S hmidhuber, J. (1999). Language identi ation from prosody
without expli it features. In Pro eedings of EUROSPEECH'99 (Vol. 1, pp. 371{374).
Cummins, F., Gers, F. A., & S hmidhuber, J. (1999). Automati dis rimination among
languages based on prosody alone (Te h. Rep. No. IDSIA-03-99). Lugano, CH: IDSIA.
De Garis, H., Gers, F. A., Korkin, M., Agah, A., & Nawa, N. E. (1998). Building an
arti ial brain using an FPGA based 'CAM-brain ma hine'. Arti ial Life and Roboti s
Journal, 2, 56-61.
Gers, F. A., & Czarske, J. W. (1995). Untersu hungen zur verteilten temperatur-sensorik
mit stimulierter brillouin-streuung. In Laser'95 Conferen e Pro eedings C P22.
Gers, F. A., & De Garis, H. (1996a). Porting a ellular automata based arti ial brain to
MIT's ellular automata ma hine "CAM-8". In Int. Conf. on Simulated Evolution and
Learnin (SEAL) S7-3, Taejon, Korea.
Gers, F. A., & De Garis, H. (1996b). CAM-brain : A new model for ATR's ellular
automata based arti ial brain proje t. In Int. Conf. on Evolvable Systems Conferen e
Pro eedings (ICES) S7-5, Tsukuba, Japan.
Gers, F. A., & De Garis, H. (1997). Codi-1bit : A simplied ellular automata based
neuron model. In Arti ial Evolution Conferen e (AE), Nimes, Fran e.
Gers, F. A., De Garis, H., & Korkin, M. (1997a). Evolution of neural sru tures based on
ellular automata. In C. J. Lakhmi (Ed.), Soft omputing te hniques in knowlage-based
intelligent engineering systems (p. 259-278). Heidelberg New York: Physi a-Verlag.
Gers, F. A., De Garis, H., & Korkin, M. (1997b). A simplied ellular automata based
neuron model. In J. Hao, E. Lutton, E. Ronald, M. S hoennauer, & D. Snyers (Eds.),
Arti ial Evolution (p. 315-334). Springer Verlag.
Gers, F. A., De Garis, H., & Korkin, M. (1998). Codi-1bit : A ellular automata
based neural net model simple enough to be implemented in evolvable hardware. In
Int.Symposium on Arti ial Life and Roboti s (AROB), Beppu, Oita, Japan.
Gers, F. A., E k, D., & S hmidhuber, J. (2000). Applying LSTM to time series predi table
through time-window approa hes (Te h. Rep. No. IDSIA-22-00). Manno, CH: IDSIA.
Gers, F. A., E k, D., & S hmidhuber, J. (2001). Applying LSTM to time series predi table
through time-window approa hes. In Pro . ICANN 2001, Int. Conf. on Arti ial Neural
Networks. Vienna, Austria: IEE, London. (submitted)
Gers, F. A., & S hmidhuber, J. Long short-term memory learns ontext free and ontext
sensitive languages. In ICANNGA 2001 Conferen e. (a epted)
92 Personal Re ord
Gers, F. A., & S hmidhuber, J. (2000a). LSTM learns ontext free languages. In Snowbird
2000 Conferen e.
Gers, F. A., & S hmidhuber, J. (2000b). Long short-term memory learns ontext free
languages and ontext sensitive languages (Te h. Rep. No. IDSIA-03-00). Manno, CH:
IDSIA.
Gers, F. A., & S hmidhuber, J. (2000 ). Neural pro essing of omplex ontinual input
streams. In Pro . IJCNN'2000, Int. Joint Conf. on Neural Networks. Como, Italy.
Gers, F. A., & S hmidhuber, J. (2000d). Neural pro essing of omplex ontinual input
streams (Te h. Rep. No. IDSIA-02-00). Manno, CH: IDSIA.
Gers, F. A., & S hmidhuber, J. (2000e). Re urrent nets that time and ount. In Pro .
IJCNN'2000, Int. Joint Conf. on Neural Networks. Como, Italy.
Gers, F. A., & S hmidhuber, J. (2000f). Re urrent nets that time and ount (Te h. Rep.
No. IDSIA-01-00). Manno, CH: IDSIA.
Gers, F. A., & S hmidhuber, J. (2001). Long short-term memory learns simple ontext
free and ontext sensitive languages. IEEE Transa tions on Neural Networks. (a epted)
Gers, F. A., S hmidhuber, J., & Cummins, F. (1999a). Continual predi tion using LSTM
with forget gates. In M. Marinaro & R. Tagliaferri (Eds.), Neural Nets, WIRN Vietri-99,
Pro eedings 11th Workshop on Neural Nets (p. 133-138). Vietri sul Mare, Italy: Springer
Verlag, Berlin.
Gers, F. A., S hmidhuber, J., & Cummins, F. (1999b). Learning to forget: Continual
predi tion with LSTM. In Pro . ICANN'99, Int. Conf. on Arti ial Neural Networks
(Vol. 2, p. 850-855). Edinburgh, S otland: IEE, London.
Gers, F. A., S hmidhuber, J., & Cummins, F. (1999 ). Learning to forget: Continual
predi tion with LSTM (Te h. Rep. No. IDSIA-01-99). Lugano, CH: IDSIA.
Gers, F. A., S hmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual
predi tion with LSTM. Neural Computation, 12 (10), 2451{2471.
Gers, F. A., S hmidhuber, J., & S hraudolph, N. Learning pre ise timing with LSTM
re urrent networks. (submitted to Neural Computation)
Hough, M., De Garis, H., Korkin, M., Gers, F. A., & Nawa, N. E. (1999). Spiker : Analog
waveform to digital spiketrain onversion in atr's arti ial brain " am-brain" proje t. In
Int. Conf. on Roboti s and Arti ial Life, Beppu, Japan.
Korkin, M., De Garis, H., Gers, F., & Hemmi, H. (1997). 'CBM (CAM-brain ma hine) :
A hardware tool whi h evolves a neural net module in a fra tion of a se ond and runs a
million neuron arti ial brain in real time. In Geneti Programming Conferen e, Stanford,
USA.
Nawa, N. E., De Garis, H., Gers, F. A., & Korkin, M. (1998). 'ATR's CAM-brain
ma hine (CBM) simulation results and representation issues. In Geneti Programming
Conferen e.
94 A knowledgments
A knowledgments
I am grateful to everybody who helped me to start, do and nish this thesis.

Spe ial thanks go to my parents who always supported me in everything I wanted to do.
This thesis was only possible, be ause Juergen S hmidhuber set up the LSTM proje t at IDSIA
in luding the position that I took in the last years. During all my time at IDSIA Juergen always
left me the freedom to follow my own ideas. He was ex ellent as s ienti referen e point and
as riti to test my ideas against. I greatly appre iated working with him.
I want to thank Wulfram Gerstner for a epting the supervision over the theses. His riti al
feedba k was always very helpful for my work.
I always enjoyed ex hanging ideas with Doug and Fred, who worked with me on the LSTM
proje t.
My deep thanks go to Mara for all the things she did for me during my time in Lugano. Rafal
a ompanied me, working on his thesis, from the day of my interview until now, and I hope we
an also elebrate the \the days after" (for both of us) together. Ni was always there for any
s ienti dis ussion down into painful details. Everything would have been mu h more diÆ ult
without Ivo driving me through the last part of the thesis while I ould not walk. Mar o "Zaa"
helped to transform my Italian into Italian.
Thanks to everybody at IDSIA and Old-IDSIA for reating a great atmosphere for working with
lots of fun.
The persons I mentioned, but also many others outside and around IDSIA did mu h more for
me when I want to write here. They know.
View publication stats

LSTM in RNN

Uploaded by

Copyright:

Available Formats

LSTM in RNN

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LSTM in RNN

Uploaded by

Copyright:

Available Formats

Long Short-Term Memory

in Recurrent Neural Networks

THÈSE NÆ 2366 (2001)

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

POUR L’OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES

soumise à l’approbation du jury:

Prof. R. Hersch, président

5 Learning Pre ise Timing with Peephole LSTM 33

1.1 Re urrent Neural Networks (RNNs)

and \squashing" it with a di erentiable fun tion f , a ording:

Out Out Out Out

wlm (t) = E (t) ;wlm

h is a entered sigmoid with range [ 1; 1℄:

Learning to Forget: Continual

3.1 Introdu tion

X yk (t) yl (t) m

Inserting both terms in equation (3.4) gives Æout v

Di erentiating the forward pass equation (2.7), we obtain:

s vj (t) s vj (t 1) 'j g(net vj (t))

yinj (t) y'j (t)

= O(B S ) + O(K B S ) + O(I B S ) ;

with and I xed:

MemoryMemory Out Gate 1 MemoryMemory Out Gate 2

Algorithm %Solutions %Good Sol. %Rest

We report results of exponential

3.3.3 CERG Results

recurrent connection for continual version

Algorithm %Perfe t Sol. %Partial Sol.

Arithmeti Operations on Continual

4.1 Introdu tion

Otherwise it is the produ t of these two values.

[T-T/5, T] [T, T+T/5] [T-T/5, T] [T, T+T/5] ...

urrent internal state.

4.2.1 Network Topology and Parameters

Learning Pre ise Timing with

5.1 Introdu tion

5.2 Extending LSTM with \Peephole Conne tions"

the input gate (l = in), and to the forget gate (l = '):

the peephole onne tion weights.

Memory Output Gate

f0; 1g 100 48  12 100 46  14

Training Streams [10 ]

step fun tions, we de ne:

tgt. F % Training LSTM Peephole LSTM

10 90 2477  341 0:13  0:033 100 145  32 0:18  0:016

(yielding 60% solutions) before.

Simple Context Free and Context

6.1 Introdu tion

a epted test set (assuming that the network generalizes at all).

10 independently trained networks with di erent weight initializations (these 10 initializations

6.2.2 Network Topology and Experimental Parameters

Memory Output Gate

6.2.4 LSTM Results

network of 10 generalized to at least n; m 2 f1;::; 22g (all strings up to a length of 88 symbols

Train. Train. % Generalization

Time Series Predi table Through

7.1 Introdu tion

Figure 7.1: AR-RNN setup for time series predi tion.

best result of 10 independent trials.

Referen e Units Para. Seq. NMSE

and \squashing" it with a dierentiable fun tion f , a ording:

wlm (t) = E (t) ;wlm

X yk (t) yl (t) m

Dierentiating the forward pass equation (2.7), we obtain:

s vj (t) s vj (t 1) 'j g(net vj (t))

yinj (t) y'j (t)

f0; 1g 100 48 12 100 46 14

step fun tions, we dene:

10 90 2477 341 0:13 0:033 100 145 32 0:18 0:016

10 independently trained networks with dierent weight initializations (these 10 initializations

l = o 1 ln( 1 ) 11:54 : (A.1)