Audio-To-Score Alignment of Piano Music Using Rnn-Based Automatic Music Transcription
Audio-To-Score Alignment of Piano Music Using Rnn-Based Automatic Music Transcription
Audio-To-Score Alignment of Piano Music Using Rnn-Based Automatic Music Transcription
1. INTRODUCTION
Audio-to-score alignment (also known as score following)
is the process of temporally fitting music performance au- In this paper, we follow the AMT-based approach for
dio to its score. The task has been explored for quite a audio-to-score alignment. To this end, we build two AMT
while and utilized mainly for interactive music applica- systems by adapting a state-of-art method using recurrent
tions, for example, automatic page turning, computer-aided neural networks [8] with a few modifications. One system
accompaniment or interactive interface for active music takes spectrograms as input and is trained in a supervised
listening [1, 2]. Another use case of audio-to-score align- manner to predict a binary representation of MIDI in either
ment is performance analysis which examines performer’s 88 notes or chroma. The prediction does not consider in-
interpretation of music pieces in terms of tempo, dynam- tensities of notes, in other words, MIDI velocity. Using this
ics, rhythm and other musical expressions [3]. To this end, system only however does not provide precise alignment
the alignment result must be precise having high tempo- because onset frames and sustain frames are equally im-
ral resolution. It was reported that the just-noticeable dif- portant, in other words, similarity between matching onset
ference (JND) time displacement of a tone presented in a frames becomes identical to that between following sus-
metrical sequence is about 10 ms for short notes [4], which tain frames. In order to make up for the limitation, we use
is beyond the current accuracy of the automatic alignment another AMT system that is trained to predict the onsets of
algorithm. This challenge has provided the motivation for MIDI notes in chroma domain. This was inspired from De-
our research. caying Locally-adaptive Normalized Chroma Onset (DL-
There are two main components in audio-to-score align- NCO) feature by Ewert et al. [6]. Following the idea, we
ment: features used in comparing audio to score, and align- employ decaying chroma note onset features which turned
ment algorithm between two feature sequences. In this out to offer not only temporally precise points but also
paper, we limit our scope to the feature part. A typical make onset frames salient. Finally, we combine the two
MIDI domain features and conduct dynamic time warping
Copyright:
c 2017 Taegyun Kwon et al. This is an open-access article distributed algorithm on the feature similarity matrix. The evaluation
under the terms of the Creative Commons Attribution 3.0 Unported License, which on the MAPS dataset shows that our proposed framework
permits unrestricted use, distribution, and reproduction in any medium, provided significantly improves the alignment accuracy compared to
the original author and source are credited. previous approaches.
filtered spectrogram. We observed a significant increase of
the transcription performance with this addition.
2.3 Alignment
2.2.2 Network Training
The AMT systems return two types of MIDI-level fea-
In order to we train the networks, we used audio files and tures. We combine them and compute a similarity matrix
aligned MIDI files. The MIDI data was converted into an between the AMT outputs and Score MIDI. MIDI files that
array form with the same frame rate of the input filter-bank correspond to the score are also converted into 88 note
spectrogram with 100 fps. For 88 notes and chroma labels, (or chroma) and chroma onset representation. We used
the array elements between note onset and offset were an- euclidean distance to measure similarity between the two
notated as 1 and otherwise filled with 0. For chroma onset combined representations and compute the similarity ma-
labels, elements that correspond to note onsets were anno- trix. We then applied FastDTW algorithm [13] which is
tated as 1. The corresponding audio data was normalized an approximate method to dynamic time warping (DTW).
with zero-mean and deviation of 1 over each filter in train- FastDTW uses iterative multilevel approaches with con-
ing set. straint windows to reduce the complexity. Because of the
We use dropout with ratio of 0.5 and weight regulariza- high frame rate of the features, it is necessary to employ
tion with value of 10−4 in each LSTM layer. This ef- low-cost algorithm. While the original DTW algorithm has
fectively improve the performance by generalization. We O(N 2 ) time and space complexity, FastDTW operates in
optimized the network with stochastic gradient decent to O(N ) complexity, almost with the same accuracy. Müller
minimize binary cross entropy loss function. Learning rate et al. [14] also examined a similar multi-level DTW for the
was initially set as 0.1 and iteratively decreased by divided audio-to-score alignment task and reported similar results
by 3 when no improvement was observed for validation compared to the original DTW. The radius parameter in the
loss for 10 epochs (i.e. early stopping). The training was fastDTW algorithm, which defines the size of window to
find an optimal path for each resolution refinement, was set
to 10 in our experiment.
3. EXPERIMENTS
3.1 Dataset
We used the MAPS dataset [15], specifically the ‘MUS’
subset that contains large pieces of piano music, for train-
ing and evaluation. Each piece consists of audio files of
piano and ground-truth MIDI annotation. The audio files
were generated from a MIDI file, either through virtual in-
struments and automatic playing via the Disk-lavier piano.
Nine combinations of audio files, each with different in-
struments and recording conditions was applied to each Figure 4. Ratio of correctly aligned onsets within in func-
piece of piano music. This helped our model avoid over- tion of the threshold. Some data points with lower than
fitting to a specific piano tone. The MIDI files served as 80% of precision are not shown in this figure.
the ground-truth annotation of the corresponding audio but
some of them (ENSTDkCl, ENSTDkAm) are sometimes
temporally inaccurate, which is more than 65 ms as de- Ewert’s algorithm for the both audio and MIDI cases. The
scribed in [16]. temporal frame rate of features were adjusted to 100 fps
for both algorithms.
3.2 Evaluation method For the aligning task employing Ewert’s algorithm, we
used the same FastDTW algorithm. But since the Fast-
To evaluate the proposed method, we carried out audio- DTW algorithm cannot be directly applied to Carabias-
to-score alignment experiments using the MAPS dataset. Orti’s algorithm due to its own distance calculation method,
Because our method requires data for training, we conduct we applied a classic DTW algorithm, which employs an
the experience with 4-fold cross validation with train/test entire frame-wise distance matrix. Because of the limita-
splits from [12] which is accessible to public 1 . For each tion of memory, when reproducing the Carabias-Orti’s al-
fold, 43 pieces were detached from training set and used as gorithm, we excluded 35 pieces that are longer than 400
validation set. As a result, each fold was composed with seconds among the test sets.
173 / 43 / 54 pieces for train / valid / test set respectively, After we obtained the alignment path through DTW, ab-
as processed in [11]. solute temporal errors between estimated note onsets and
To make MIDI files to be used as if they are score MIDI, ground truth were measured. For each piece of music in
we distorted the aligned MIDI files by changing the dura- the test set, mean value of the temporal errors and ratio of
tion of every events. This type of evaluation method for the correctly aligned notes with varied threshold were used to
alignment task using the temporal distortion of MIDI was summarize the results.
also employed in previous research [6, 17, 18]. A number
randomly selected in [0.7 1.3] was multiplied to modify the
4. RESULTS AND DISCUSSION
tempo of an interval. Employing this scheme of temporal
distortion prevents the alignment path from being trivial. The result of the audio-to-score alignment is shown in Fig-
ure 4, which represents the precision of different algorithms
3.3 Compared Algorithms as a function of error threshold. Typically, a tolerance win-
dow of 50ms is used for evaluation. However, because
To make a comparison of the performance, we reproduced
most of notes were aligned within 50 ms of temporal thresh-
two other alignment algorithms proposed by Ewert et al.
old, we varied the tolerance window from 0ms to 200ms
[6], and offline algorithms by Carabias-Orti et al. [19],
with 10 ms steps.
which suggested novel features for the alignment task. We
Overall, our 88 note framework combined with the chroma
performed experiments with the same test set using the
onsets achieved the best result. Even with zero threshold,
fastDTW algorithm only without any post-processing. Ew-
which means the best match with resolution of our sys-
ert’s algorithm is a representative example that employs a
tem (10ms), our proposed model with 88 note output ex-
hand-crafted chromagram and onset feature based on audio
actly aligned 52.55% of notes. The ratio was increased
filter bank responses. Carabias-Orti’s algorithm employs
to 91.60% with 10 ms threshold. The proposed frame-
a non-negative matrix factorization for learning spectral
work using chroma showed similar precision to the 88 note
basis of each note combination from spectrogram. The
framework, but the accuracy was lower. Compared to Ew-
latter is designed only for an audio-to-audio alignment,
ert’s algorithm with hand-crafted features, our method shows
while Ewert’s algorithm can be applied to both audio-to-
significantly better performance especially in high resolu-
audio and audio-to-MIDI alignment. Therefore, we made
tion section. Over 100 ms of threshold, our framework
a synthesized version of the distorted MIDI using Synth-
with chroma and Ewert’s method shows similar precision,
ogy Ivory II and employed it as an input. We tested the
but in the intervals under 50 ms the difference becomes
1 http://www.eecs.qmul.ac.uk/sss31/TASLP/info.html significant. Note that we penalized our framework com-
Mean Median Std ≤ 10 ms ≤ 30 ms ≤ 50 ms ≤ 100 ms
chroma 12.83 6.40 56.22 92.01 97.44 98.31 98.98
Proposed with onset
88 note 8.62 5.57 31.14 91.60 98.00 98.97 99.61
chroma 48.01 27.96 152.06 60.66 84.65 89.36 93.72
Proposed w/o onset
88 note 25.31 18.69 63.26 56.39 86.42 93.05 97.48
Ewert et. al. (audio-to-MIDI) 16.44 13.64 32.52 71.78 91.38 95.50 98.03
Ewert et. al. (audio-to-audio) 14.66 11.71 25.38 71.53 92.43 96.91 99.13
Carabias-Orti et. al. 131.31 49.96 305.52 23.58 49.40 69.30 91.60
Table 1. Results of the onset errors in piecewise. Mean, median, and standard deviation of the errors are in millisecond.
The right columns are the ratio of notes (%) that are aligned within the onset error of 10 ms, 30 ms, 50 ms and 100 ms,
respectively.