mosart2001-gouyon

Exploration of techniques for automatic labeling of audio drum
tracks’ instruments
Fabien Gouyon, Perfecto Herrera
Music Technology Group, Pompeu Fabra University
{fabien.gouyon, perfecto.herrera}@iua.upf.es
http://www.iua.upf.es/mtg
Abstract
We report on the progress of current work regarding the automatic recognition of percussive instruments
embedded in audio excerpts of performances on drum sets. Content-based transformation of audio drum
tracks and loops requires the identification of the instruments that are played in the sound file. Some
supervised and unsupervised techniques are examined in this paper, and classification results for a small
number of classes are discussed. In order to cope with the issue of classifying percussive events embedded in
continuous audio streams, we rely on a method based on an automatic adaptation of the analysis frame size to
the smallest metrical pulse, called the “tick”. The success rate with some of the explored techniques has been
quite good (around 80%), but some enhancements are still needed in order to be able to accurately classify
sounds in real applications’ conditions.
around their indexes of occurrence in a scope

1. Introduction determined by the smallest metrical pulse (typically
180ms), which is assumed to be sufficient for the
In this paper we define drum tracks as short audio purpose of characterizing these instruments.
excerpts of constant tempi containing few sets of The grid’s gap is herein called the tick and
percussive timbres: typically 5 to 10 seconds-long corresponds to the smallest pulse implied by a
mixes of acoustic bass drums, snares drums, hi-hats, performance on a set of drums (this concept is
toms and cymbals. Our long-term objective is to referred to as the “tatum” in [3], and the Yeston’s
design a set of analysis, transformation and browsing “attack-point” concept (see [17]), as reported in [15]
tools anchored in the musical contents of these p.33). Occurrences of percussive events are supposed
signals, i.e. their rhythmic structures and their timbral to match the tick grid with little approximation. We
features. Here we address the intermediary task of also assume that “new events” occur in the signal
automatic description of drum tracks (see e.g. [15] when energy transients are present. Unlike in the
and [3]). That is, given an excerpt, we aim at general polyphonic multitimbral case, it seems
determining which percussive timbres are present, acceptable to state that occurrences of events in
and at providing their timing occurrences; this, percussive music are linked to abrupt changes in the
without making any assumption regarding the single energy parameter.
musical style. Therefore, we first focus on the segmentation of
When seeking a meaningful representation of the signal, based on a prior detection of onsets;
audio signals, one must address the issues of characterization comes in a second step. Handling
characterization and segmentation (i.e. the “what is these issues in the reverse way would be for instance
when?” issue). These concepts are tightly linked, and to use metadata of the signal (e.g. ‘this signal
it is never clear which should come first. Indeed, corresponds to the melody X played by the instrument
segmenting temporally a signal into meaningful Y’), and intend to adjust the segments boundaries to
elements is obviously easier if we know what it’s what we already know regarding what the signal is
made of, and characterizing an event entails that it has made of. An example of characterization before
boundaries. How could we meaningfully segment a segmentation can be seen in e.g. [6] where a blind
signal before it has been categorized, or categorize it frame grid -which gap doesn’t depend on the signal’s
before it has been segmented? In this “chicken and content- is applied. Each frame is characterized in
egg” issue, focusing first on one task or the other terms of chord type membership, and a subsequent
implies explicit or implicit use of heuristics regarding segmentation in chords sequences is then applied. In
the signal. In our case, we focus on musical signals of that case, the heuristics lies in the knowledge of the
constant tempi, thus we assume that there always signals to be processed: polyphonic audio signals with
exists a relevant segmentation grid in regularly- a strong harmonicity feature, i.e. audio streams in
spaced indexes that is inherent to the performance and which it is relevant to assume that at each moment a
corresponds to the smallest metrical pulse. These chord is being played.
signals are solely made up of percussive instruments, This paper uses a method to segment the signal
therefore if the grid’s gap and starting index are respecting to a regular frame grid that is rhythmically
correctly determined, the resulting segments isolate relevant (see [7]). This segmentation in frames doing
coherent parts of percussive instruments signals a good job in isolating significant part of percussive
events, we perform subsequent pattern matching of several sounds, but without the usual cluttering that
algorithms over these frames to reach the objective of is present in songs.
description. These algorithms are compared using an Even though there is an increasing literature on
evaluation framework detailed hereafter. instrument classification (see [9] for a comprehensive
review), very few works are concerned with the
1.1. Tick segmentation specific case of percussive instruments. In fact the
In [7], we propose an algorithm to extract the smallest most specific papers are aimed more to the
pulse implied in a performance on a set of drums. classification based on perceptual similarity ([10],
Let’s remind the reader that the notion addressed here [14]). Although there are several techniques which
in not that of the beat –perceptively most prominent provide a high percent of success when classifying
pulse–, rather a notion of pulse that most highly isolated sounds (see e.g. [13] that addressed
coincides with all onsets, at the metrical level of classification of entire isolated tones –several seconds
which corresponds the communication of important long– of clean recordings), it is not clear that they can
musical features (see [3]). In [7], we argue that peaks be applied directly and successfully to the more
in the inter-onset intervals (IOIs) histograms of complex task of labeling monophonic phrases
musical signals respect harmonic series, the gap of segments. With this aim, in [4], Brown addresses
which we define as the tick. The proposed algorithm classification of short phrases (1-2 s) of recordings of
performs onsets detection, generates IOIs histograms varied sources. There is no segmentation of the signal
and makes use of a powerful method of fundamental in coherent events prior to the characterization, a
frequency extraction: the two-way mismatch frame grid with a fixed size (23 ms) is applied on the
algorithm proposed in [11]. signal, each frame is treated independently of the
An example of tick segmentation can be seen in others. The analyzed data is supposed to pertain to
Figure 1. one and solely one of two classes (‘oboe’ or
‘saxophone’), no switching from one instrument to
another is envisaged. In [12], Marques and Moreno
classify 0.2s-long segments. As for Brown, the
assumption is implicit that all the frames constituting
one excerpt pertain to solely one of the classes used
for training. Moreover, none of these approaches
Figure1: Example of drum track intended to handle the issue of data that wouldn’t
tick segmentation grid pertain to any of the training classes (silences are
removed from the tests sets).
1.2. Percussive events characterization
Handling of sequential data
Instruments sounds classification It is not clear yet how instruments classification
There are two different approaches to automatic methods like those commented above can be applied
classification of musical instruments sounds. One of to audio mixtures without assuming a preliminary
them is oriented towards perceptual classification (i.e. source separation, still not feasible. Nevertheless, it is
simulation of perceptual similarity judgments that can of great interest to tackle the issue of automatically
be obtained from human subjects in psychoacoustical describing audio files that are more complex than
experiments) whereas the other is oriented towards monotimbral excerpts, which is the case of drum
taxonomic classification (i.e. simulation of learned tracks (as several instrument sounds can overlap
and culture-influenced judgments regarding the sometimes along the track, but in other times a single
family, subfamily and type of instrument that can instrument is there). Indeed, a search (matching all the
generate a test sound presented to a human or words) for “audio drum tracks download” objects on
computer system). As we are here concerned with the Internet with Google reports 27200 URLs of interest
second approach, the interested reader is addressed to (October 19th 2001), which certainly corresponds to
[8] and [18] for more details on the perceptual hundreds of thousands of drum tracks audio files
approach and its associated techniques. Automatic “waiting to be analyzed”.
taxonomic classification of sounds is an important In the task of classifying percussive sounds
functionality for the retrieval of audio files from large embedded in drum tracks, unavoidable steps are that
databases. In the case of song databases, location of of: (1) segmenting the drum tracks in “coherent”
specific instruments parts is an expected feature events, and (2) choose a classification technique to
(though very difficult to be implemented given the part these events in groups.
current state of the art); in the case of sound samples It is obvious that we can’t assume that a
databases, automatic labeling of samples after being segmentation technique will provide us with isolated
incorporated in the database set is also a must (and clean drum occurrences similar to those of a training
this time it seems an achievable feat). The automatic set. This has a great consequence if we choose a
labeling of sound events in audio drum tracks seems supervised classification technique. Indeed, even if
to be in a middle way, as it can include combinations
we assume that a database of isolated sounds could be
representative of any drum track sounds, an issue
would still lie in the comparison of (1) sounds that
would have coherent boundaries regarding a timbral
criterion (those of the database) with (2) sounds that
would have coherent boundaries regarding a rhythmic
attribute (the tick size) and can therefore differ from Figure 2: Drum track symbolic sequence
one drum track to another. Here ‘a’ corresponds to an occurrence of kick, ‘b’ of
This has as consequence that we can’t use a snare, ‘d’ of a hi-hats and ‘c’ to ‘no new event’.
descriptors that are relative to entire drum The second ‘b’ is the sole artifact of the example.
occurrences (e.g. “log attack time”, “temporal On the other hand, in the supervised framework,
centroid of the tone”, or “spectral centroid of the it is intended from the very outset to recognize
tone”, as in the percussive timbre space reported in specific instruments, stating an universal membership
[14]), then we have no choice but using descriptors of of any family of percussive sounds to given regions
tick-length frames. Therefore, for classifying our in the representation space defined by the
specific data it seems that the following options are classification scheme’s features.
thinkable: In order to systematically evaluate the goodness of
1) Set the frame size to the tick size and choose an a given pattern-matching algorithm, we propose the
unsupervised technique to detect grouping following methodology.
characteristics in the frames of a given excerpt.
Here, no relation is supposed whatsoever amongst
excerpts frames, nor between frames and the
2. Evaluation methodology
elements of a sound database. We developed an algorithm to generate random audio
2) Consider that the tick segmentation isolates the drum tracks, together with an exact score of these
most characteristic parts of percussive instruments drum tracks. Given a tick value, at each position of
and apply supervised classification techniques: the tick grid either a percussive instrument (with
train a classifier with a database of labeled tick random amplitude) or silence is randomly assigned.
segments derived from audio drum tracks, classify Deviations from the exact tick grid positions are
incoming tick segments according to the resulting allowed and white noise is added to the signal for
labeled regions in the projection space. further (though in a quite crude way) approximating
3) Divide all tick segments in smaller frames which realistic performance data (see [7]). The resulting
size would be fixed (e.g. 15 ms) and independent audio files constitute a useful audio material for
of the excerpt’s specific tick size. Train a classifier automatic evaluation.
with labeled percussive instruments samples The evaluation process is the following:
described at the frame level, the frame size being - Audio drum tracks and scores are generated
the same as that mentioned above. Here, the goal - The segmentation in ticks is achieved over the
would be to characterize the evolution of features drum tracks
within a tick. - A tick computation is considered good if the
The eventual goal is to automatically associate a computed tick is the same as the one given in the
symbolic temporal sequence representing the score of the track, with a possible deviation of
occurrences of diverse percussive events to a given ±1%.
drum track. Choosing one or the other of the - Features are extracted from tick frames (see
aforementioned options, this sequence will either below).
represent the sole drum track structure, or its accurate - The memberships of the frames to different
description. That is, in the unsupervised framework, classes is determined by either a supervised or
classes do not necessarily correspond to the same unsupervised technique, and a symbolic sequence
instruments in different drum tracks; what is achieved is generated
is the separation in clusters, not the assignation of - These sequences are evaluated by comparison
meaningful labels. In this case, we propose to with the scores sequences.
represent drum tracks by symbolic sequences (or When using unsupervised techniques, the assignation
“strings”) constructed as the conjunction of several of a letter to a frame is arbitrary, for instance, an ‘a’
time-series of occurrences of different percussive doesn’t have to correspond to the same instrument in
timbres. As a trivial example, the string ‘acdba’ could different drum tracks. Thus, in this case, the
stand for the sequence e.g. ‘kick occurrence’- evaluation entails a brute-force string-matching
‘nothing new’-‘snare occurrence’-‘hihat occurrence’- algorithm: For each possible permutation of a
‘kick occurrence’. The pattern-matching techniques computed sequence (if N is the number letters of the
we detail hereafter aim at providing symbolic alphabet, there are N! possible permutations) the
sequences similar to that of Figure 2. percentage of elements that aren’t similar to those of
the actual sequence is computed. Then, among the N!
percentages computed for a given sequence, the best - Spectral kurtosis (SK): This is a measure of how
one is chosen to be the percentage of matching of the outlier-prone the spectrum is. Spectra which
computed sequence to the actual sequence. distribution are more outlier-prone than the normal
To check the relevance of this process, we first distribution have kurtosis greater than 3; those that
evaluated the sole extraction of the tick: 1000 5- are less outlier-prone have kurtosis less than 3.
seconds drum tracks have been generated following  1 
sk = 
 nσ
4 ∑ (X − X )4  − 3

the previous algorithm, four tick sizes being
considered (250, 166, 124 and 83 ms). Comparing where X is the magnitude spectrum of a
the ticks extracted from the audio signals to those frame, σ is the spectrum distribution
given in the scores, the results are that 77.3% are standard deviation
correctly computed. An analysis achieved over real - Temporal centroid (TC): This is the balance
audio drum tracks has also been performed. The point of the absolute value of the temporal
systematic evaluation is here more difficult as we signal.
don’t have scores of the drum tracks that would - “Strong decay” (SD): This feature is built from
provide the unambiguous knowledge of the tick. the non-linear combination of the frames energy
Here, the subjectivity of the listener enters obviously and temporal centroid. A frame containing a
in the evaluation process, even more if the number of temporal centroid near its left boundary and a
excerpts to evaluate is important. Nonetheless, it is strong energy is said to have a strong decay.
interesting to mention the following results: Over 57
drum tracks, ranging from 2 to 10 seconds, made up
sd = 1 ⋅ e
tc
( )
of different bass drums, snares, hi-hats, cymbals, (where e is the energy of a frame and tc
toms, corresponding mainly to reggae, funk, hip-hop the temporal centroid)
and rock styles, comparing subjectively extracted - Zero-crossing rate (ZCR).
minimal pulses with tick gaps and starting indexes - Spectral centroid (SC): This is the balance point
yielded by the algorithm, the determination of the of the spectrum.
tick was considered good in 86% of the cases. - Energy (E).
The automatic segmentation working sufficiently - “Strong peak” (SP): Intended to reveal whether
well, we now report on several pattern matching the spectrum presents a very pronounced peak. The
algorithms that have been run on the 1000 drum thinner and the higher the maximum of the spectrum
tracks mentioned above (summing 26680 occurrences is, the higher value takes this parameter.
of percussive sounds). The evaluation can be max( X )
sp =
performed either on all the drum tracks or solely those bW
segmented correctly. (where X is the magnitude spectrum of a
frame, bW is the bandwidth of the
3. Experiments maximum peak in the spectrum above a
threshold –half its amplitude)
In a first step, we restrain the investigations solely to
drum tracks made up of kicks, snares and hi-hats. The
3.2. Hierarchical clustering
database of sounds used for the generation of drum
tracks consists in 8 hi-hats, 9 kicks and 15 snares. It Either being on-line or off-line,1 a clustering process
should be noted that we actually account for more is based on proximity measures between elements and
than 32 sounds; indeed, the tick segmentation between groups of elements (see [16], p.358-378 for
approach yields multiple variations of the same sound details on different proximity measures). Hierarchical
(estimation is around 5 variations per sound). In the clustering algorithms produce a hierarchy of
generation algorithm, no simultaneous occurrences of clusterings that provides different degrees of
instruments was generated, however, as the tick is agglomeration of the data, the number of clusters
generally shorter than the tones sizes, it should be decreasing at each agglomeration step. The clustering
noted that the issue of partially overlapping timbres produced at each step results from the previous one
still exists (e.g. a hit often occurs in the tail of by merging two clusters into one, depending on the
another). distance chosen. The process begins by assigning a
cluster to each data vector, it ends when the whole
3.1. Features data is agglomerated into a single cluster. The
hierarchy of clustering can be explored in a
Following the segmentation comes a process of
dendrogram structure (see Figure 4).
recentering and windowing: the frame grid is shifted
Seeking an agglomeration in four clusters, and
from half a tick, and each frame is multiplied by a
using deviations of the descriptors from their mean
Hanning window. There is no overlap of frames. The
following descriptors are computed over each frame
(they don’t necessarily correspond to entire tones’ 1
On-line procedures progressively adapt the clusters
descriptors): number, shapes and centroids according to incoming data,
whereas off-line procedures process the data as a whole.
value, normalized by the standard deviation, we categories of sounds: kick, snare, hihat, and no-
experimented the use of several descriptors, several instrument). From the table 3 it is clear that each
elements-distances and several groups-distances (see cluster predominantly identifies one type of
[16]). A selection of the results are given in the table instrument (cluster 1 agglomerates the 84% of hihats,
1, where we illustrate the changing of features, then cluster 4 agglomerates 73% of kicks, and cluster 3
of elements-distance, then of groups-distance and agglomerates 73% of no-instrument (“nothing”)), but
then of features again. there is a problem with the snares, as cluster 2 only
agglomerates a 54% of them whereas there is a 43%
in cluster 4, apparently “confused” with kicks. The
association between cluster and category is supported
by the values of a Chi-square statistic (χ2=
36303.134, d.f. 9, p < 0.001). The total matching
between the algorithm output and the scores is
71.35%.
Figure 4: Hierarchical clustering
3.5. Fuzzy c-means clustering
dendrogram structure
The fuzzy c-means algorithm (like its “crisp” version,
the c-means –or k-means) produces successive
3.3. Decision tree clusterings while trying to minimize a specific cost
We empirically noted that the parameter SD allows to function. In the c-means algorithm, the data elements
separate frames of kicks and snares occurrences from are assigned exclusive memberships with respect to a
the rest in the same manner: high values of SD given number of clusters. Introduced in [19], the
correspond to occurrences of kicks and snares, and concept of fuzzy clusters stands in the assignation to
small values to the remaining frames (that is, each data element of a partial or distributed
occurrences of hi-hats and frames in which no new membership to each cluster. In the fuzzy c-means
occurrence takes place). Therefore, we designed an algorithm, these memberships are used as weights in
algorithm that defines clusters in a step-by-step the computation of the distances between data
agglomerative manner. Decision steps are binary and elements and centroids (see [2]).
taken respecting the last level of agglomeration of a Distributed memberships are assigned to the data
hierarchical clustering scheme (elements-distances elements until the process reaches convergence (i.e.
being Mahalanobis and groups-distances being improvement in the cost function smaller than a
Ward). threshold), eventually the final memberships are
Frames are separated in class 1 and class 2 using exclusive (for the purpose of classification) and
the parameter SD. The class that shows the highest correspond to the highest memberships. For instance,
values of SDs is labeled Cl1, the other Cl2. Then we if the memberships to 4 clusters of 7 frames are as in
experimented different parameters to divide Cl1 data table 4, the resulting sequence is: ‘b b a d a d b’. We
in class A and class B and Cl2 in class C and class D. tried several parameters and cluster numbers. A
A selection of the results are given in table 2. selection of the results are given in the table 5. One
can see that the best clustering is performed with all
3.4. K-means clustering the features and the correct number of clusters.
K-means clustering consists of first deciding the
number of clusters that we need to get, and then 3.6. Linear Discriminant Analysis
running an algorithm that picks one “seed” case for A simple supervised technique is that of Linear
each cluster. Then depending on its distances with Discriminant Analysis (LDA), an equivalent of
the seeds, each case is reassigned a cluster multivariate analysis of variance for categorical
membership and the seed case is updated as the variables. LDA attempts to minimize the ratio of
centroid of its corresponding cluster. A function is within-class scatter to the between-class scatter and
computed that depends on the distances (e.g. builds a definite decision region between the classes.
Euclidean) between elements and centroids, and It provides linear, quadratic or logistic functions of
represents how well the centroids match the data. The the variables that "best" separate cases into two or
process enters an iteration loop that terminates when more predefined groups, but it is also useful for
a local minimum of the function is reached (see [5], determining which are the most discriminative
p. 526). Though K-means does not perform an features and the most alike/different groups. Focusing
exhaustive search through every possible partitioning on correctly segmented drum tracks, we checked that
of the data, it provides results that overall perform our data distributions were Gaussian and that their
quite well. Restraining tests to correctly segmented variances were similar, and then randomly selected a
excerpts, we tested values of K ranging from 4 to 7 65% of the sounds in order to derive a set of linear
and the best results were obtained for K=4 discriminant functions. They were then applied to the
(coincidentally with the number of a-priory rest 35% in order to cross-validate the functions.
Results for this test set are shown in table 6. Though could be the reason for the snare-kick confusion
the overall success is 80%, there is a clear confusion detected in some of the analysis). Improving the set
phenomenon between snares and kicks that will of descriptors is therefore one of our next steps.
require further study. Regarding the usefulness of A couple of easy-to-do additional improvements
variables, it seems that the most relevant are the will be the usage of more complex drum tracks,
temporal centroid, the spectral centroid, and the containing more diverse sounds (for example, toms
strong decay. The least relevant seem to be energy or crash cymbals) and the creation of drum tracks that
(which, on the other hand, is highly correlated with contain realistically overlapped sounds; indeed, it is
strong decay) and ZCR (which is highly correlated frequent that e.g. hi-hats and kicks be played
with spectral centroid), in fact the elimination of these simultaneously. In that respect, we are currently
two variables improved the snare discrimination, but designing an algorithm to generate drum tracks based
at the cost of degrading a bit the classification of hihat on MIDI files data, that contain realistic mixtures of
and “no-instrument” categories. sounds, and where the labeling is given. This way we
may get access to a suitable database as large and
3.7. Discussion complex as specific experiments call for.
Clustering methods are sometimes considered as More realistic drum tracks will provide us with
“second-league” methods that should be used data actually pertaining with diverse degrees to
cautiously (see [1]). However, reasons for exploring several classes. It will be investigated whether fuzzy
this approach can be found in the immediateness of clustering permits to cope with this issue.
results (against the time consuming parameter-setting The supervised classification techniques that we
“learning” phases that are required in most of the have explored are not the most powerful, but have
powerful supervised techniques). The other practical given an indication of the usability and limitations of
advantage is that a large proportion of realistic our framework. Although we have obtained fairly
sounds extracted from drum tracks will be mixtures good results, it is now clear that we need to jump
that cannot be easily labeled (and at the present towards techniques such as Gaussian Mixture
moment we do not know whether a general models, Hidden Markov Models or Support Vector
“kick+hihat” category will be enough for that or we Machines if we want to improve the performance of
shall use different combinatory categories regarding our system.
some difficult-to-define criterion). In that respect,
clustering techniques apparently seem more suitable 5. Acknowledgment
than supervised learning categories. Anyway, it is The work reported in this paper has been partially
still something to be studied as a next step of our funded by the IST European project CUIDADO.
research. On the other hand, a supervised technique
will provide us with a direct labeling provided the 6. References
adequate learning or estimation phase, and a faster
[1] Aldenderfer, M., Blashfield and R., Cluster Analysis,
assignment of labels than using clustering. If we take
SAGE Publications, Newbury Park, CA 1984.
the performance of the LDA as the lowest estimation [2] Bezdek J and Pal S.K., Fuzzy models for pattern
for a supervised technique, it seems that classification recognition, IEEE Press, New York 1992.
success is similar to that of clustering. Therefore, it is [3] Bilmes J. Timing is of the Essence: Perceptual and
reasonable to expect an improvement when using Computational Techniques for Representing, Learning, and
more powerful techniques. Handling and labeling of Reproducing Expressive Timing in Percussive Rhythm. MS
multifarious mixtures can be efficiently done with Thesis/Dissertation MIT, 1993.
Gaussian Mixture Models or Hidden Markov [4] Brown J., Computer identification of musical
Models. Anyway, it should be noted that the instruments using pattern recognition with cepstral
coefficients as features. Journal of the Acoustical Society
percentages of supervised and unsupervised
of America 105, 1999.
experiments should be compared cautiously. Indeed, [5] Duda R., Hart P. and Stork D., Pattern classification,
it should be investigated what is the influence on John Wiley & Sons, New York 2001.
both rationales of the representativeness of the [6] Fujishima T. Real-time chord recognition of musical
instruments database used to generate tracks. sound: a system using Common Lisp Music. In Proc.
International Computer Music Conference, 1999.
4. Future work [7] Gouyon F., Herrera P., and Cano P. Pulse-dependent
analyses of percussive music. To appear.
The present work illustrates the appropriateness of a [8] Grey J., Multidimensional perceptual scaling of musical
set of audio descriptors for the task of percussive timbres. Journal of the Acoustical Society of America 61,
sound classification. We used a flexible framework 1977.
suitable for the generation of drum tracks that allows [9] Herrera P., Amatriain X., Batlle E., and Serra X.
systematic testing of classification techniques and Towards Instrument Segmentation for Music Content
features. Given the specific windowing method that Description: a Critical Review of Instrument Classification
we have used it is possible that some important Techniques. In Proc. of International Symposium on Music
Information Retrieval, 2000.
features have been lost or corrupted (specifically, this
[10] Lakatos S., Beauchamp J., Extended perceptual spaces [14] Peeters G., Mc Adams S., and Herrera P. Instrument
for pitched and percussive timbres, J. Acoust. Soc. Am., sound description in the context of MPEG-7. In Proc. of the
107 (5), 2000. International Computer Music Conference, 2000.
[11] Maher J. and Beauchamp J., Fundamental frequency [15] Schloss A. On the automatic transcription of
estimation of musical signals using a two-way mismatch percussive music - From acoustic signal to high-level
procedure. J of the Acoust Soc of America 95, 1993. analysis. PhD Thesis/Dissertation CCRMA, Stanford
[12] Marques J. and Moreno P. A study of musical University, 1985.
instrument classification using gaussian mixture models [16] Theodoridis S. and Koutroumbas K., Pattern
and Support Vector Machine. CRL 99/4 1999 Cambridge Recognition, Academic Press, San Diego 1998.
Research Lab. [17] Yeston M., The stratification of musical rhythm, Yale
[13] Martin K.D. and Kim Y. Musical Instrument University Press, New Haven 1976.
Identification: A pattern-recognition approach. In Proc. of [18] Young F., Multidimensional scaling: history, theory
the 136th meeting of the Acoustical Society of America, and applications, Lawrence Erlbaum, 1987.
1998. [19] Zadeh L., Fuzzy sets. Inform Control 8, 1965.
Features ZCR, E E, SD E, SD E, SD E, SD, SK

Elements distance Mahalanobis Mahalanobis Euclidean Mahalanobis Mahalanobis
Groups distance Shortest Shortest Shortest Ward Ward
Excerpts with
59.2% 73.8% 71.6% 81.1% 79.4%
Results good ticks
All excerpts 53.7% 65% 63.5% 71.1% 70%
Table 1: Computed success rates of hierarchical clustering experiments
Features for dividing C11 SK SK, ZCR SK, SP SK, SP

Features for dividing C12 E E, TC E, TC E, SC
Excerpts with
83.6% 78.6% 84.6% 84.8%
Results good ticks
All excerpts 73.2% 69.2% 73.8% 74.2%
Table 2: Computed success rates of decision tree experiments
1 2 3 4 Total
hihat 84.516% 0.778% 14.679% 0.027% 100%
kick 3.528% 23.039% 0.208% 73.225% 100%
nothing 25.235% 0.514% 73.737% 0.514% 100%
snare 2.838% 53.957% 0.150% 43.054% 100%
Total 29.764% 21.368% 17.110% 31.758% 100%
N 7941 5701 4565 8473 26680
Table 3. Distribution percents of sounds into clusters by K-means clustering.
Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Frame 7

Cluster 1 0.0013 0.0016 0.9842 0.0046 0.9963 0.0048 0.0005
Cluster 2 0.9959 0.9950 0.0037 0.0007 0.0008 0.0008 0.9984
Cluster 3 0.0020 0.0024 0.0015 0.0004 0.0003 0.0004 0.0007
Cluster 4 0.0008 0.0010 0.0106 0.9943 0.0026 0.9940 0.0003
Table 4: Example of memberships of some frames to 4 clusters in a fuzzy c-means
clustering, after convergence.
Number of clusters 3 4 4
Features SK, SP, SD, SC, E SK, SP, SD, SC, E SD, E
Excerpts with
76.4% 84.1% 75.9%
Results good ticks
All excerpts 67.6% 73.6% 67.1%
Table 5: Computed success rates of fuzzy c-means experiments
hihat kick Nothing snare %correct

hihat 2281 11 193 6 92%
kick 26 2031 1 407 82%
nothing 142 25 1165 3 87%
snare 23 823 0 1276 60%
All instruments 80%
Table 6: Confusion matrix and percent of correct classifications for a linear

discriminant classificator

mosart2001-gouyon

Uploaded by

Copyright:

Available Formats

mosart2001-gouyon

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

mosart2001-gouyon

Uploaded by

Copyright:

Available Formats

Exploration of techniques for automatic labeling of audio drum

around their indexes of occurrence in a scope

Features ZCR, E E, SD E, SD E, SD E, SD, SK

Table 1: Computed success rates of hierarchical clustering experiments

Features for dividing C11 SK SK, ZCR SK, SP SK, SP

Table 2: Computed success rates of decision tree experiments

Table 3. Distribution percents of sounds into clusters by K-means clustering.

Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Frame 7

Table 5: Computed success rates of fuzzy c-means experiments

hihat kick Nothing snare %correct

Table 6: Confusion matrix and percent of correct classifications for a linear

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.