Finding Temporal Structure in Music: Blues Improvisation With LSTM Recurrent Networks
Finding Temporal Structure in Music: Blues Improvisation With LSTM Recurrent Networks
Finding Temporal Structure in Music: Blues Improvisation With LSTM Recurrent Networks
INTRODUCTION
Music is among the most widely consumed types of signal streams. For
this reason alone, signal processing techniques for finding and extracting and
reproducing musical structure are of considerable interest. In particular,
machine learning techniques for composing (good) music might have not only
academic but also commercial potential.
In H. Boulard, editor, Neural Networks for Signal Processing XII, Proceedings of the
2002 IEEE Workshop. 747756, New York, IEEE, 2002.
BPTT; while weights to cells, input gates and forget gates use truncated
RTRL. LSTM performance is improved in online learning situations by using
a Kalman filter to control weight updates [14].
Data Representation: We avoid psychologically realistic distributed
encodings and instead represent the data in a simple local form (similar to
[19] ). We use one input/target unit per note, with 1.0 representing on and
0.0 representing off. (In later experiments we used the common technique
of adjusting input units to have a mean of 0 and a standard deviation of
1.0.) Unlike CONCERT this representation leaves it to the network to learn
an inductive bias towards chromatically and harmonically related notes. Despite this, we prefer a localist representation for several reasons. First it is
implicitly multi-voice and makes no artificial distinction between chords and
melodies. (In fact, we implement chords by simply turning on the appropriate
notes in the input vector.) Second it is an easy task to generate probability
distributions over the set of possible notes, with the flexibility to treat single
note probabilities as independent or dependent from one another.
The representation of time is straightforward, with one input vector representing a slice of real time. The stepsize of quantization can of course vary;
if the quantization is set at the eighth note level (as it is for all experiments
in this study) then eight network time steps are required to process a whole
note. This method is preferable for LSTM because it forces the network to
learn the relative durations of notes, making it easier for the counting and
timing mechanisms to work [6].
Two representational issues are ignored in this representation. First, there
is no explicit way to determine when a note ends. This means that eight
eighth notes of the same pitch are represented exactly the same way as,
say, four quarter notes of the same pitch. One way to implement this without changing input and target data structures is to decrease the stepsize of
quantization and always mark note endings with a zero. With this method,
a quantization level of sixteen steps per whole note would generate unique
codes for eight eighth notes and four quarter notes of the same pitch. A
second method is to have special unit(s) in the network to indicate the beginning of a note. This method was employed by Todd [19] and is certainly
viable. However, it is not clear how such a method would scale to data sets
with multi-voice melodies.
In simulations for this study, a range of 12 notes were possible for chords
and 13 notes were possible for melodies Figure 1. Though we divided chord
notes from melody notes for these simulations, this division is artificial:
Chord notes are represented no differently than melody notes and in future
experiments we intend to blend the two in a more realistic manner.
Training Data: For the experiments in this study, a form of 12-bar
blues popular among bebop jazz musicians is used. With a quantization
stepsize of 8 notes per bar, this yields a single song length of 96 network
time steps. The chords used did not vary from song to song and are shown
in Figure 2. Chords inversions were chosen so that the chords would fit into
the allocated range of notes. For Experiment 1, only these chords were pre-
melody
chords
C
GmC
F7
Em A7 Dm G
Fdim C
G7
C A7 Dm
sented. For Experiment 2, a single melody line was presented along with
the chords. The melody line was built using the pentatonic scale (Figure 3)
commonly used in this style of music. Training melodies were constructed
blocks containing 2 cells each are fully connected to each other and to the
input layer. The output layer is fully connected to all cells and to the input
layer. Forget gate, input gate and output gate biases for the four blocks are
set at -0.5, -1.0, -1.5 and -2.0. This allows the blocks to come online one by
one. Output biases were set at 0.5. Learning rate was set at .00001. Momentum rate [15] was set at .9. Weights being burned after every timestep.
Experiments showed that learning was faster if the network was reset after
making one (or a small number) of gross errors. Resetting went as follows: on
error, burn existing weights, reset the input pattern and clear partial derivatives, activations and cell states. Gers et al. [6] use a similar strategy. The
squashing function at the output layer was the logistic sigmoid with range
[0,1].
Training and Testing: The goal was to predict at the output the probability for a given note to be on or off. For predicting probabilities root mean
squared error (RMSE) is not appropriate. Instead the network was trained
using cross-entropy as the objective function. The error function Ei for output activation yi and target value ti is Ei = ti ln(yi ) (1 ti )ln(1 yi ).
This yields a term at the output layer of (ti yi ). See, e.g., [11] for details.
By using a series of binomial formulations rather than a single multinomial
formulation (softmax) we treat outputs as statistically independent of one
another. Though this assumption is untrue, it allows the network to predict
chords and melodies in parallel and also allows for multi-voice melodies. The
network was tested by starting it with the inputs from the first timestep and
then using network predictions for ensuing time steps. Chord notes were
predicted using a decision threshold of 0.5. Training was stopped after the
network successfully predicted the entire chord sequence.
Results: LSTM easily handled this task under a wide range of learning
rates and momentum rates. Once a network could successfully generate one
full cycle through the chord sequence, it could generate any number of continuing cycles. This indicates that there was no reason to continue learning
for a longer time. As it is already well documented that LSTM excels at
timing and counting tasks [6] , success at this task is not surprising. Fast
convergence was not a goal of this study, and learning times were not carefully collected. Informal timing tests show learning times on the order of 15
minutes to 45 minutes of processor time on a 1Ghz Pentium depending on
parameter settings and initial random weights.
quite pleasant. One jazz musician3 is struck by how much the compositions
sound like real bebop jazz improvisation over this same chord structure. In
particular, the networks tendency to intermix snippets of known melodies
with less-constrained passages is in keeping with this style.
DISCUSSION
These experiments were successful: LSTM induced both global structure and
local structure from a corpus of musical training data, and used that information to compose in the same form. This answers Mozers [13] key criticism
of RNN music composition, namely that an RNN is unable to compose music
having global coherence. To our knowledge the model presented in this paper
is the first to accomplish this. That said, several parts of the experimental
setup made the task easier for the model. More research is required to know
whether the LSTM model can deal with more challenging composition tasks.
Training Data: There was no variety in the underlying chord structure.
For this reason it is perhaps better to talk about network performance as improvisation over a predefined (albeit learned) form rather than composition.
This lack of variation made it easier for LSTM to generate appropriatelytimed chord changes. Furthermore, quantization stepsize for these experiments was rather low, at 8 time steps per whole note. As LSTM is known to
excel at datasets with long time lags, this does not pose a serious problem.
However it remains to be seen how much more difficult the task will be at,
say, 32 time steps per whole note, a stepsize which would allow two sixteenth
notes to be disambiguated from a single eighth note.
Network Architecture: There network connections were divided between chords and melody, with chords influencing melody but not vice-versa.
We believe this choice makes sense: in real music improvisation the person
playing melody (the soloist) is for the most part following the chord structure
supplied by the rhythm section. However this architectural choice presumes
that we know ahead of time how to segment chords from melodies. When
working with jazz sheet music, chord changes are almost always provided separately from melodies and so this does not pose a great problem. Classical
music compositions on the other hand make no such explicit division. Furthermore in an audio signal (as opposed to sheet music) chords and melodies
are mixed together.
These are preliminary experiments, and much more research is warranted.
A comparison with BPTT and RTRL (and other candidate RNNs) would
help verify the claim that LSTM performance is better. A more interesting
training set would allow for more interesting compositions. Finally, recent
evidence suggests [5] that LSTM works better in similar situations using a
Kalman filter to control weight updates. This should be explored. Finally, the
current architecture is limited to working with symbolic representations (i.e.
3 Admittedly, this musician isnt particularly good, and also happens to be the first
author.
CONCLUSION
A music composition model based on LSTM successfully learned the global
structure of a musical form, and used that information to compose new pieces
in the form. Two experiments were performed. The first verified that LSTM
was not relying on regularities in the melody to learn the chord structure. The
second experiment explored the ability for LSTM to generate new instances
of a musical form, in this case a bebop-jazz variation of standard 12-bar blues.
These experiments are preliminary and much more work is warranted. For
example, we have yet to compare LSTM performance to non-RNN approaches
such as HMMs and graphical models. Also, we report on LSTM behavior for
a single set of parameters; a more methodical exploration of parameter space
is warranted. However by demonstrating that an RNN can capture both the
local structure of melody and the long-term structure of a musical style, these
experiments represent an advance in neural network music composition.
ACKNOWLEDGEMENTS
We would like to thank Mike Mozer for answering many questions about his
model and simulation methods. The division of labor between authors is as
follows: The first author devised and constructed the datasets, implemented
the program code, ran the simulations and wrote the paper. The second
author wrote the grant proposal and eventually obtained a grant (SNF 2149144.96) for doing this type of work. He provided guidance on how to
structure the task and how to best use LSTM; he also edited the paper.
REFERENCES
[1] J. J. Bharucha and P. M. Todd, Modeling the perception of tonal structure
with neural nets, Computer Music Journal, vol. 13, no. 4, pp. 4453,
1989.
[2] G. Cooper and L. B. Meyer, The Rhythmic Structure of Music, The
University of Chicago Press, 1960.
[3] D. Eck, A Network of Relaxation Oscillators that Finds Downbeats in
Rhythms, in G. Dorffner (ed.), Artificial Neural Networks ICANN
2001 (Proceedings), Berlin: Springer, 2001, pp. 12391247.
10
[20] R. J. Williams and D. Zipser, Gradient-based learning algorithms for recurrent networks and their computational complexity, in Y. Chauvin and D. E.
Rumelhart (eds.), Back-propagation: Theory, Architectures and Applications, Hillsdale, NJ: Erlbaum, chap. 13, pp. 433486, 1995.
11
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: