UCL Icmc
UCL Icmc
UCL Icmc
Jonathan Bell
PRISM-CNRS
belljonathan50@gmail.com
ABSTRACT
1. INTRODUCTION
In 2009, Miller Puckette, the inventor of Max and Pure Figure 1. A VR interface in which each button in the world cor-
Data, stated: “Having seen a lot of networked performances responds to an item of the sound corpus. Machine learning helps
of one sort or another, I find myself most excited by the po- bringing closer sounds that share common spectral characteristics.
tential of networked “telepresence” as an aid to rehearsal, https://www.youtube.com/playlist?list=PLc WX6wY4JtkNjWelOprx-
mOwioStXbUa
not performance.” [?]. Meanwhile,
The recent emergence of multiplayer capabilities of a VR
software such as patchXR [?] urges to find meaningful in- by Diemo Schwarz in CartaRT/ftm (today fully integrated
strument design in order to remotely and collaboratively in mubu), the idea is today gaining increasing popularity
interact musically with a digital instrument, in multiplayer (FluComA, AudioStellar, Mosaique, DICY2, Somax2) be-
VR (or metaverse). [?] In search of experiences that would cause of the potential it opens when combined machine
relate to those found in traditional chamber music, the so- learning. Focusing on research questions characteristic of
lution proposed here focuses on the exploration of a sound the well-established NIME (New Interface for Musical Ex-
corpus projected on a 3d space, which the users can then pression) community, COSMIX consists in porting sound
navigate with his hand controllers (see Fig. ??): in an etude maps to Extended Reality (XR) technology within the rapidly
for piano and saxophone for instance 1 , one musician plays evolving PatchXR creative coding environment, in order
blue buttons (the saxophone samples), and the other the to build interactive instruments. By combining FluCoMA
yellow buttons (the piano samples). and PatchXR’s capacities in machine learning/listening and
COSMIX (Corpus-based Sound Manipulation for Inter- Human Computer Interaction (HCI) respectively, the re-
active Instruments in eXtended reality) seeks to build upon sulting instruments will facilitate multi-user sonic explo-
FluCoMa’s results and apply them to the realm of Vir- ration of large corpora through various immersive inter-
tual Reality Musical Instruments (VRMIs) by simulating faces. Inherently corpus independent, VRMIs’ affordances
acoustic instruments in telematic performances. By its fo- will be examined through three criteria: 1/ modes of ex-
cus on Creative Music Information Retrieval (CMIR), Flu- citation/interaction 2/ primarily sonic, but also visual and
CoMa set as its agenda to enable mid-level creative coding, haptic feedback 3/ co-creativity in Networked music per-
thus finding a sought-after balance between three conflict- formance.
ing affordances 1/ access low-level audio features extrac- An important reason here for using VR to explore a 3D
tion 2/ accessible learning curve and 3/ customisable use. dataset is that it allows users to interact with the data in a
The typical use case of the toolbox automatically segments more natural and immersive way compared to a 2d plane
and analyses a sound corpus according to a custom array of (Chapter 3 will show how the present study derives from
audio descriptors in order to reveal similarities by project- the CataRT 2d interface project), using the experience both
ing its data on a two or three-dimensional plot (or sound as a tool for performance as well as data visualisation and
map). Initiated at IRCAM at the turn of the millennium analysis. Users can move around and explore the data from
1 https://youtu.be/kIi7YdzP2Nw?t=89
different angles, which can help them to better understand
the relationships between different data points and identify
patterns, which becomes more evident as the number of
Copyright: c 2023 Jonathan Bell et al. This is an open-access article points increases. The use of machine learning (dimension-
distributed under the terms of the Creative Commons Attribution License ality reduction in our case) renders a world in which the
3.0 Unported, which permits unrestricted use, distribution, and reproduc- absolute coordinates of each point has no more link to the
tion in any medium, provided the original author and source are credited. descriptor space (the high sounds cannot be mapped to the
single project or create something entirely new. In addi-
tion, Patch has a robust library of resources for users to
draw from, including tutorials, documentation, and sample
patches. The community around Patch is also very active,
with regular competitions, events, and meet ups happening
around the world.
One of the most exciting aspects of Patch is its poten-
tial for use in music performance and composition. The
modular design of the platform allows users to create com-
plex audio and visual environments that can be controlled
in real-time, opening up new possibilities for live music
and audiovisual performances. Patch has been used in a va-
riety of contexts, from creating interactive installations and
exhibits to developing VR training simulations and games.
Its flexibility and modular design make it a powerful tool
for anyone interested in exploring the creative possibilities
Figure 2. The implementation of the FM algorithm 1/ in Pure Data (left) of VR.
2/ in PatchXR (right).
appear as highly customizable: IRCAM’s MuBu [?] and of feature extraction method that is commonly used in speech and speaker
recognition systems. MFCCs are used to represent the spectral character-
the more recent EU funded FluCoMa [?] project. CataRT
istics of a sound in a compact form that is easier to analyze and process
is now fully integrated in MuBu, whose purpose encom- than the raw waveform. They are calculated by applying a series of trans-
passes multimodal audio analysis as well as machine for formations to the power spectrum of a sound signal, including a Mel-scale
movement and gesture recognition [?]. This makes MuBu warping of the frequency axis, taking the logarithm of the power spec-
trum, and applying a discrete cosine transform (DCT) to the resulting
extremely general purpose, but also difficult to grasp. The
coefficients. The resulting coefficients, which are called MFCCs, cap-
data processing tools in MuBu are mostly exposed in the ture the spectral characteristics of the sound and are commonly used as
pipo plugin framework [?], which can compute for instance features for training machine learning models for tasks such as speech
recognition and speaker identification.
3 A demo is available at: https://youtu.be/jxo4StjV0Cg 5 https://youtu.be/1LHcbYh2KCI?t=19
analysis and dimensionality reduction, as will be shown in 5.1 Slicing
section ??.
Slicing a sound file musically allow various possible ex-
Section ?? will present different javascript programs and
ploitations in the realm of CBCS. In MuBu onset detection
Max patches that were developed in order to diversify the
is done with pipo.onseg or pipo.gate. FluCoMa expose five
ways in which FluCoMa analysis is represented in PatchXR,
different onset detection algorithms:
and how the user can interact with it.
1. fluid.ampslice: Amplitude-based detrending slicer
5. WORKFLOW - ANALYSIS IN FLUCOMA 2. fluid.ampgate: Gate detection on a signal
3. fluid.onsetslice: Spectral difference-based audio buffer
My experiments have focussed on musical instrument cor-
slicer
pora almost exclusively 6 . The tools presented here can
efficiently generate plausible virtuosic instrumental music 4. fluid.noveltyslice: Based on self-similarity matrix
but recent uses found more satisfying results in slower, (SSM)
quieter, “Feldman-like” types of textures. Various limi- 5. fluid.transcientslice: Implements a de-clicking algo-
tations on the playback side (either in standalone VR, or rithm
on a Pure Data sampler for RaspberryPi described in Sec-
tion ??) have imposed restrictions in the first stages on the Onsetslice only was extensively tested. The only tweaked
amount of data it could handle (less than 5 minutes in AIFF parameters were a straight-forward “threshold” as well as a
in PatchXR) or the number of slice the sample could be “minslicelength” argument, determining the shortest slice
chunked into (256 because of limitation of lists in Max, a allowed (or minimum duration of a slice) in hopSize. This
limitation that has also been surpassed since). Both lim- introduce a common limitation in CBCS: the system strongly
itations were later overcome (use of the compressed ogg biases the user to choose short samples for better anal-
format in PatchXR, and taking advantage of longer sound ysis results, and more interactivity, when controlling the
files since version 672, increase of internal buffer size in database with a gesture follower. Aaron Einbond remarks
fluid.buf2list in FluCoMa), thus allowing for far more con- in the use of CataRT how short samples most suited his in-
vincing models. tention: “Short samples containing rapid, dry attacks, such
Using concatenative synthesis to model an improvising as close-miked key-clicks, were especially suitable for a
instrumental musician typically involves several steps: convincing impression of motion of the single WFS source.
The effect is that of a virtual instrument moving through the
1. Segmentation of a large soundfile: This involves di- concert hall in tandem with changes in its timbral content,
viding a large audio recording of the musician’s per- realizing Wessel’s initial proposal.”[?]
formance into smaller units or segments. A related limitation of concatenative synthesis lies in the
fact that short samples will demonstrate the efficiency of
2. Analysis: These segments are then organised in a the algorithm 7 , but at the same time moves away from
database according to various descriptor data (mfcc the “plausible simulation” sought in the present study. A
in our case). balance therefore must be found between the freedom im-
3. Scaling/pre-processing: scaling is applied for better posed by large samples, and the refined control one can
visualisation. obtain with short samples.
A direct concatenation of slices clicks in most cases on
4. Dimension reduction: Based on mfcc descriptors, the edit point, which can be avoided through the use of
the dimensionality of the data is reduced in order to ramps. The second most noticeable glitch on concatena-
make it more manageable and easier to work with. tion concerns the interruption of low register resonances,
This can be done using techniques such as principal which even a large reverb fails making sound plausible.
component analysis (PCA) singular value decompo- Having a low threshold and large “minslicelenght” results
sition (SVD), or Uniform Manifold Approximation in equidistant slices, all of identical durations, as would do
and Projection (UMAP, preferred in our case). the pipo.onseg object in MuBu.
5. Near neighbours sequencing: Once the segments have Because we listen to sound in time, this parameter re-
been organised and analysed, the algorythm selects sponsible for the duration of samples is of prior impor-
and combines them in real-time based on certain in- tance.
put parameters or rules to create a simulated musical
performance that sounds like it is being improvised 5.2 MFCC on each slice - across one whole
by the musician. We use here a near neighbours al- slice/segment
gorithm, which selects segments that are similar in Multidimensional MFCC analysis: MFCC (Mel-Frequency
some way (e.g., in terms of pitch, loudness, or timbre Cepstral Coefficient) analysis is a technique used to extract
- thanks to similarities revealed by umap on mfccs in features from audio signals that are relevant for speech and
our case) to the current segment being played. music recognition. It involves calculating a set of coeffi-
cients that represent the spectral envelope of the audio sig-
We will now describe these steps in further detail: nal, or decomposing a sound signal into a set of frequency
bands and representing the power spectrum of each band
6 For cello: https://youtu.be/L-MiKmsIzjM For various instruments:
with a set of coefficients. The resulting MFCC coefficients
https://www.youtube.com/playlist?list=PLc WX6wY4JtnNqu4Lwe2YzEUq9S1IMvUk
7 e.g. https://youtu.be/LD0ivjyuqMA?t=3032
For flute: https://www.youtube.com/playlist?list=PLc WX6wY4JtlbjLuLHDZhlx78sTDm aqs
capture important spectral characteristics of the sound sig- By applying UMAP to the MFCC coefficients of a sound
nal (albeit hardly interpretable by the novice user), such signal, it is possible to create a visual representation of the
as the frequency and magnitude of the spectral peaks. We sound that preserves the relationships between the different
will see that combines with umap, it is able to capture the MFCC coefficients (see Fig. ??).
spectral characteristics of the musician’s playing style.
[3] L. Turchet, “Musical Metaverse: vision, opportunities, [16] L. Garber, T. Ciccola, and J. C. Amusategui, “Au-
and challenges,” Personal and Ubiquitous Computing, dioStellar, an open source corpus-based musical instru-
01 2023. ment for latent sound structure discovery and sonic ex-
perimentation,” 12 2020.
[4] F. Berthaut, “3D interaction techniques for musical ex-
pression,” Journal of New Music Research, vol. 49, [17] B. Hackbarth, N. Schnell, P. Esling, and D. Schwarz,
no. 1, pp. 60–72, 2020. “Composing Morphology: Concatenative Synthesis
as an Intuitive Medium for Prescribing Sound
[5] B. Loveridge, “Networked music performance in vir-
in Time,” Contemporary Music Review, vol. 32,
tual reality: current perspectives,” Journal of Network
no. 1, pp. 49–59, 2013. [Online]. Available: https:
Music and Arts, vol. 2, no. 1, p. 2, 2020.
//hal.archives-ouvertes.fr/hal-01577895
[6] A. Çamcı and R. Hamilton, “Audio-first VR: new per-
spectives on musical experiences in virtual environ- [18] N. Schnell, A. Roebel, D. Schwarz, G. Peeters, and
ments,” Journal of New Music Research, vol. 49, no. 1, R. Borghesi, “MUBU and FRIENDS -ASSEMBLING
pp. 1–7, 2020. TOOLS FOR CONTENT BASED REAL-TIME IN-
TERACTIVE AUDIO PROCESSING IN MAX/MSP,”
[7] D. Schwarz, G. Beller, B. Verbrugghe, and S. Britton, Proceedings of the International Computer Music Con-
“Real-Time Corpus-Based Concatenative Synthesis ference (ICMC 2009), 01 2009.
with CataRT,” in 9th International Conference on
Digital Audio Effects (DAFx), Montreal, Canada, Sep. [19] P. A. Tremblay, G. Roma, and O. Green, “Enabling
2006, pp. 279–282, cote interne IRCAM: Schwarz06c. Programmatic Data Mining as Musicking: The
[Online]. Available: https://hal.archives-ouvertes.fr/ Fluid Corpus Manipulation Toolkit,” Computer Music
hal-01161358 Journal, vol. 45, no. 2, pp. 9–23, 06 2021. [Online].
Available: https://doi.org/10.1162/comj a 00600
[8] J.-P. Briot, G. Hadjeres, and F.-D. Pachet, Deep
Learning Techniques for Music Generation – A [20] F. Bevilacqua and R. Müller, “A Gesture follower for
Survey, Aug. 2019. [Online]. Available: https: performing arts,” 05 2005.
//hal.sorbonne-universite.fr/hal-01660772
[21] N. Schnell, D. Schwarz, J. Larralde, and R. Borghesi,
[9] P. Esling, A. Chemla-Romeu-Santos, and A. Bitton, “PiPo, a Plugin Interface for Afferent Data Stream Pro-
“Generative timbre spaces with variational audio syn- cessing Operators,” in International Society for Music
thesis,” CoRR, vol. abs/1805.08501, 2018. [Online]. Information Retrieval Conference, 2017.
Available: http://arxiv.org/abs/1805.08501
[22] R. Fiebrink and P. Cook, “The Wekinator: A System
[10] D. L. Wessel, “Timbre Space as a Musical Control for Real-time, Interactive Machine Learning in Mu-
Structure,” Computer Music Journal, vol. 3, no. 2, pp. sic,” Proceedings of The Eleventh International Soci-
45–52, 1979. [Online]. Available: http://www.jstor. ety for Music Information Retrieval Conference (IS-
org/stable/3680283 MIR 2010), 01 2010.
[23] A. Einbond and D. Schwarz, “Spatializing Tim-
bre With Corpus-Based Concatenative Synthesis,” 06
2010.
[24] G. Roma, O. Green, and P. A. Tremblay, “Adaptive
Mapping of Sound Collections for Data-driven Musical
Interfaces,” in New Interfaces for Musical Expression,
2019.
[25] B. D. Smith and G. E. Garnett, “Unsupervised Play:
Machine Learning Toolkit for Max,” in New Interfaces
for Musical Expression, 2012.
[26] PrÉ : connected polyphonic immersion. Zenodo,
Jul. 2022. [Online]. Available: https://doi.org/10.5281/
zenodo.6806324
[27] R. Mills, “Tele-Improvisation: Intercultural Interaction
in the Online Global Music Jam Session,” in Springer
Series on Cultural Computing, 2019. [Online].
Available: https://api.semanticscholar.org/CorpusID:
57428481
[28] M. Puckette, “Not Being There,” Contemporary
Music Review, vol. 28, no. 4-5, pp. 409–412,
2009. [Online]. Available: https://doi.org/10.1080/
07494460903422354
[29] G. Hajdu, “Embodiment and disembodiment in
networked music performance,” 2017. [Online].
Available: https://api.semanticscholar.org/CorpusID:
149523160
[30] L. Manovich, “Database as Symbolic Form,” Conver-
gence: The International Journal of Research into New
Media Technologies, vol. 5, pp. 80 – 99, 1999.