Icm Text RBD 2021
Icm Text RBD 2021
Icm Text RBD 2021
to
Computer
Music
Roger B. Dannenberg
2
1 Introduction 11
1.1 Theory and Practice . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Fundamentals of Computer Sound . . . . . . . . . . . . . . . 13
1.3 Nyquist, SAL, Lisp . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Using SAL In the IDE . . . . . . . . . . . . . . . . . . . . . 21
1.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6 Constants, Variables and Functions . . . . . . . . . . . . . . . 23
1.7 Defining Functions . . . . . . . . . . . . . . . . . . . . . . . 24
1.8 Simple Commands . . . . . . . . . . . . . . . . . . . . . . . 26
1.9 Control Constructs . . . . . . . . . . . . . . . . . . . . . . . 28
2 Basics of Synthesis 35
2.1 Unit Generators . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 Storing Sounds or Not Storing Sounds . . . . . . . . . . . . . 38
2.3 Waveforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Piece-wise Linear Functions: pwl . . . . . . . . . . . . . . . 43
2.5 Basic Wavetable Synthesis . . . . . . . . . . . . . . . . . . . 46
2.6 Introduction to Scores . . . . . . . . . . . . . . . . . . . . . . 46
2.7 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4 Frequency Modulation 83
4.1 Introduction to Frequency Modulation . . . . . . . . . . . . . 83
4.2 Theory of FM . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Frequency Modulation with Nyquist . . . . . . . . . . . . . . 88
4.4 Behavioral Abstraction . . . . . . . . . . . . . . . . . . . . . 90
4.5 Sequential Behavior (seq) . . . . . . . . . . . . . . . . . . . . 93
4.6 Simultaneous Behavior (sim) . . . . . . . . . . . . . . . . . . 93
4.7 Logical Stop Time . . . . . . . . . . . . . . . . . . . . . . . 94
4.8 Scores in Nyquist . . . . . . . . . . . . . . . . . . . . . . . . 95
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
12 Spectral Modeling,
Algorithmic Control,
3-D Sound 221
12.1 Additive Synthesis and Table-Lookup Synthesis . . . . . . . . 222
12.2 Spectral Interpolation Synthesis . . . . . . . . . . . . . . . . 223
12.3 Algorithmic Control of Signal Processing . . . . . . . . . . . 235
12.4 3-D Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Electrical Engineering students for whom this approach made great sense.
Eventually, I created a mostly online version of the course. After committing
lectures to video, it made sense to create some lecture notes to complement
the video format. These lecture notes are now compiled into this book, which
forms a companion to the online course.
Chapters are each intended to cover roughly a week of work and mostly
contain a mix of theory and practice, including many code examples in Nyquist.
Remember there is a Nyquist Reference Manual full of details and additional
tutorials.
Roger B. Dannenberg December, 2020
Acknowledgments
Thanks to Sai Samarth and Shuqi Dai for editing assistance throughout. Sub-
stantial portions of Chapters 4 and 10 were adapted from course notes, on FM
synthesis and panning, by Anders Öland.
Portions of this book are taken almost verbatim from Music and Comput-
ers, A Theoretical and Historical Approach (sites.music.columbia.edu/
cmc/MusicAndComputers/) by Phil Burk, Larry Polansky, Douglas Repetto,
Mary Robert, and Dan Rockmore.
Other portions are taken almost verbatim from Introduction to Computer
Music: Volume One (www.indiana.edu/~emusic/etext/toc.shtml) by
Jeffrey Hass. I would like to thank these authors for generously sharing their
work and knowledge.
9
10 CONTENTS
Chapter 1
Introduction
Topics Discussed: Sound, Nyquist, SAL, Lisp and Control Constructs
1. There are no limits to the range of sounds that a computer can help ex-
plore. In principle, any sound can be represented digitally, and therefore
any sound can be created.
from the highest level of composition, form and structure, to the min-
utest detail of an individual sound. Unlike with conventional music, we
can use automation to hear the results of these decisions quickly and we
can refine computer programs accordingly.
4. Computers can blur the lines between the traditional roles of the com-
poser and the performer and even the audience. We can build interactive
systems where, thanks to automation, composition is taking place in real
time.
The creative potential for musical composition and sound generation empow-
ered a revolution in the world of music. That revolution in the world of elec-
troacoustic music engendered a wonderful synthesis of music, mathematics
and computing.
1. something moves,
3. something (or someone) hears the results of that movement (though this
is philosophically debatable).
All things that make sound move, and in some very metaphysical sense, all
things that move (if they don’t move too slowly or too quickly) make sound.
As things move, they push and pull at the surrounding air (or water or whatever
medium they occupy), causing pressure variations (compressions and rarefac-
tions). Those pressure variations, or sound waves, are what we hear as sound.
Sound is produced by a rapid variation in the average density or pressure of
air molecules above and below the current atmospheric pressure. We perceive
sound as these pressure fluctuations cause our eardrums to vibrate. When dis-
cussing sound, these usually minute changes in atmospheric pressure are re-
ferred to as sound pressure and the fluctuations in pressure as sound waves.
14 CHAPTER 1. INTRODUCTION
Figure 1.2: Illustration of how waveform changes with the change in fre-
quency.
able Hertz (Hz) (60 cps = 60 Hz), named after the 19th C. physicist. 1000 Hz
is often referred to as 1 kHz (kilohertz) or simply “1k” in studio parlance.
The range of human hearing in the young is approximately 20 Hz to 20
kHz—the higher number tends to decrease with age (as do many other things).
It may be quite normal for a 60-year-old to hear a maximum of 16,000 Hz.
Frequencies above and below the range of human hearing are also commonly
used in computer music studios.
Amplitude is the objective measurement of the degree of change (posi-
tive or negative) in atmospheric pressure (the compression and rarefaction of
air molecules) caused by sound waves. Sounds with greater amplitude will
produce greater changes in atmospheric pressure from high pressure to low
pressure to the ambient pressure present before sound was produced (equilib-
rium). Humans can hear atmospheric pressure fluctuations of as little as a few
billionths of an atmosphere (the ambient pressure), and this amplitude is called
the threshold of hearing. On the other end of the human perception spectrum,
a super-loud sound near the threshold of pain may be 100,000 times the pres-
sure amplitude of the threshold of hearing, yet only a 0.03% change at your ear
drum in the actual atmospheric pressure. We hear amplitude variations over
about 5 orders of magnitude from threshold to pain.
Figure 1.3: Before audio recording became digital, sounds were “carved” into
vinyl records or written to tape as magnetic waveforms. Left image shows
wiggles in vinyl record grooves and the right image shows a typical tape used
to store audio data.
Figure 1.4: An analog waveform and its digital cousin: the analog wave-
form has smooth and continuous changes, and the digital version of the same
waveform consists only of a set of points shown as small black squares. The
grey lines suggest that the rest of the signal is not represented-—all that the
computer knows about are the discrete points marked by the black squares.
There is nothing in between those points. (It is important to note, however,
that it might be possible to recover the original continuous signal from just the
samples.)
18 CHAPTER 1. INTRODUCTION
To convert the digital audio into the analog format, we use Digital to Ana-
log Converters. A Digital to Analog Converter, or DAC, is an electronic device
that converts a digital code to an analog signal such as a voltage, current, or
electric charge. Signals can easily be stored and transmitted in digital form;
a DAC is used for the signal to be recognized by human senses or non-digital
systems.
ments, and we would use a smoother waveform. (We will talk about why
digital audio waveforms have to be smooth later.) It is common to use this
technique to generate sinusoids. Of course, you could just call sin(phase) for
every sample, but in most cases, pre-computing values of the sin function and
saving them in a table, then reading the samples from memory, is much faster
than computing the sin function once per sample.
Instead of synthesizing sinusoids, we can also synthesize complex wave-
forms such as triangle, sawtooth, and square waves of analog synthesizers, or
waveforms obtained from acoustic instruments or human voices.
We will learn about many other synthesis algorithms and techniques, but
the table-lookup oscillator is a computationally efficient method to produce
sinusoids and more complex periodic signals. Besides being efficient, this
method offers direct control of amplitude and frequency, which are very im-
portant control parameters for making music. The main drawback of table-
lookup oscillators is that the waveform or wave shape is fixed, whereas most
musical tones vary over time and with amplitude and frequency. Later, we
will see alternative approaches to sound synthesis and also learn about filters,
which can be used to alter wave shapes.
2 TheNyquist Reference Manual is included as PDF and HTML in the Nyquist download; also
available online: www.cs.cmu.edu/~rbd/doc/nyquist
3 All language designers tell you this. Don’t believe any of them.
1.4. USING SAL IN THE IDE 21
1.3.1 SAL
Nyquist is based on the Lisp language. Many users found Lisp’s syntax un-
familiar, and eventually Nyquist was extended with support for SAL, which
is similar in semantics to Lisp, but similar in syntax to languages such as
Python and Javascript. The NyquistIDE supports two modes, Lisp and SAL.
SAL mode means that Nyquist reads and evaluates SAL commands rather than
Lisp. The SAL mode prompt is “SAL> ” while the Lisp mode prompt is “> ”.
When Nyquist starts, it normally enters SAL mode automatically, but certain
errors may exit SAL mode. You can reenter SAL mode by typing the Lisp
expression (sal) or finding the button labeled SAL in the IDE.
In SAL mode, you type commands in the SAL programming language.
Nyquist reads the commands, compiles them into Lisp, and evaluates the com-
mands. Some examples of SAL commands are the following:
• print expression – evaluate and expression and print the result.
• exec expression – evaluate expression but do not print the result.
• play expression – evaluate an expression and play the result, which
must be a sound.
• set var = expression – set a variable.
cussed in the Nyquist Reference Manual. You should take time to learn:
• How to switch to SAL mode. In particular, you can “pop” out to the top
level of Nyquist by clicking the “Top” button; then, you can enter SAL
mode by clicking the “SAL” button.
• How to run a SAL command, e.g. type print "hello world" in the
input window at the upper left.
• How to create a new file. In particular, you should normally save a new
empty file to a file named something.sal in order to tell the editor this
is a SAL file and thereby invoke the SAL syntax coloring, indentation
support, etc.
• How to execute a file by using the Load menu item or keyboard shortcut.
1.5 Examples
This would be a good time to install and run the NyquistIDE. You can find
Nyquist downloads on sourceforge.net/projects/nyquist, and “readme” files
contain installation guidelines.
The program named NyquistIDE is an “integrated development environ-
ment” for Nyquist. When you run NyquistIDE, it starts the Nyquist program
and displays all Nyquist output in a window. NyquistIDE helps you by provid-
ing a Lisp and SAL editor, hints for command completion and function param-
eters, some graphical interfaces for editing envelopes and graphical equalizers,
and a panel of buttons for common operations. A more complete description
of NyquistIDE is in Chapter “The NyquistIDE Program” in the Nyquist Refer-
ence Manual.
For now, all you really need to know is that you can enter Nyquist com-
mands by typing into the upper left window. When you type return, the ex-
pression you typed is sent to Nyquist, and the results appear in the window
below. You can edit files by clicking on the New File or Open File buttons.
After editing some text, you can load the text into Nyquist by clicking the Load
button. NyquistIDE always saves the file first; then it tells Nyquist to load the
file. You will be prompted for a file name the first time you load a new file.
Try some of these examples. These are SAL commands, so be sure to enter
SAL mode. Then, just type these one-by-one into the upper left window.
play pluck(c4)
play pluck(c4) ~ 3
play piano-note(5, fs1, 100)
1.6. CONSTANTS, VARIABLES AND FUNCTIONS 23
play osc(c4)
play osc(c4) * osc(d4)
play pluck(c4) ~ 3
play noise() * env(0.05, 0.1, 0.5, 1, 0.5, 0.4)
SAL and Lisp convert all variable letters to upper case, so foo and FOO
and Foo all denote the same variable. The preferred way to write variables
and functions is in all lower case. (There are ways to create symbols and
variables with lower case letters, but this should be avoided.)
A symbol with a leading colon (:) evaluates to itself. E.g. :foo has the
value :FOO. Otherwise, a symbol denotes either a local variable, a formal pa-
rameter, or a global variable. As in Lisp, variables do not have data types or
type declarations. The type of a variable is determined at runtime by its value.
Functions in SAL include both operators, e.g. 1 + 2 and standard func-
tion notation, e.g. sqrt(2). The most important thing to know about opera-
tors is that you must separate operators from operands with white space. For
example, a + b is an expression that denotes “a plus b”, but a+b (no spaces)
denotes the value of a variable with the unusual name of “A+B”.
Functions are invoked using what should be familiar notation, e.g. sin(pi)
or max(x, 100). Some functions (including max) take a variable number of
arguments. Some functions take keyword arguments, for example
string-downcase("ABCD", start: 2)
Note that space and newlines are ignored, so that could be equivalently
written:
function three()
return 3
The formal parameters may be positional parameters that are matched with
actual parameters by position from left to right. Syntactically, these are sym-
bols and these symbols are essentially local variables that exist only until state-
ment completes or a return statement causes the function evaluation to end. As
in Lisp, parameters are passed by value, so assigning a new value to a formal
parameter has no effect on the caller. However, lists and arrays are not copied,
so internal changes to a list or array produce observable side effects.
Alternatively, formal parameters may be keyword parameters. Here the
parameter is actually a pair: a keyword parameter, which is a symbol followed
by a colon, and a default value, given by any expression. Within the body
of the function, the keyword parameter is named by a symbol whose name
matches the keyword parameter except there is no final colon.
exec foo(x: 6, y: 7)
1.8.2 load
load expression
The load command loads a file named by expression, which must evauate to
a string path name for the file. To load a file, SAL interprets each statement
in the file, stopping when the end of the file or an error is encountered. If the
file ends in .lsp, the file is assumed to contain Lisp expressions, which are
evaluated by the XLISP interpreter. In general, SAL files should end with the
extension .sal.
1.8.3 play
play expr
The play statement plays the sound computed by expr, an expression.
1.8.4 plot
plot expr, dur, n
The plot statement plots the sound denoted by expr, an expression. If you
plot a long sound, the plot statement will by default truncate the sound to 2.0
seconds and resample the signal to 1000 points. The optional dur is an ex-
pression that specifies the (maximum) duration to be plotted, and the optional
n specifies the number of points to be plotted. Executing a plot statement is
equivalent to calling the s-plot function.
1.8.5 print
print expr , expr ...
The print statement prints the values separated by spaces and followed by a
newline. There may be 0, 1, or more expressions separated by commas (,).
1.8. SIMPLE COMMANDS 27
1.8.6 display
display string, expression, expression
The display statement is handy for debugging. When executed, display
prints the string followed by a colon (:) and then, for each expression, the
expression and its value are printed; after the last expression, a newline is
printed. For example,
display "In function foo", bar, baz
prints
In function foo : bar = 23, baz = 5.3
SAL may print the expressions using Lisp syntax, e.g. if the expression is
“bar + baz,” do not be surprised if the output is:
(sum bar baz) = 28.3
1.8.7 set
set var op expression
The set statement changes the value of a variable var according to the opera-
tor op and the value of expression. The operators are:
= The value of expression is assigned to var.
+= The value of expression is added to var.
*= The value of var is multiplied by the value of the expression.
&= The value of expression is inserted as the last element of the list referenced
by var. If var is the empty list (denoted by nil or \#f), then var is as-
signed a newly constructed list of one element, the value of expression.
ˆ= The value of expression, a list, is appended to the list referenced by var. If
var is the empty list (denoted by nil or \#f), then var is assigned the
(list) value of expression.
@= Pushes the value of expression onto the front of the list referenced by var.
If var is empty (denoted by nil or \#f), then var is assigned a newly
constructed list of one element, the value of expression.
<= Sets the new value of var to the minimum of the old value of var and the
value of expression.
>= Sets the new value of var to the maximum of the old value of var and the
value of expression.
28 CHAPTER 1. INTRODUCTION
begin
with db = 12.0,
linear = db-to-linear(db)
print db, "dB represents a factor of", linear
set scale-factor = linear
end
1.9.3 loop
The loop statement is by far the most complex statement in SAL, but it offers
great flexibility for just about any kind of iteration. However, when computing
sounds, loops are generally the wrong approach, and there are special func-
tions such as seqrep and simrep to use iteration to create sequential and
simultaneous combinations of sounds as well as special functions to iterate
over scores, apply synthesis functions, and combine the results.
Therefore, loops are mainly for imperative programming where you want
to iterate over lists, arrays, or other discrete structures. You will probably need
loops at some point, so at least scan this section to see what is available, but
there is no need to dwell on this section for now.
The basic function of a loop is to repeatedly evaluate a sequence of actions
which are statements. The syntax for a loop statement is:
loop [ with-stmt ] { stepping }* { stopping }* { action }+
[ final ] end
Before the loop begins, local variables may be declared in a with state-
ment.
The stepping clauses do several things. They introduce and initialize ad-
ditional local variables similar to the with statement. However, these local
variables are updated to new values after the actions. In addition, some step-
ping clauses have associated stopping conditions, which are tested on each
iteration before evaluating the actions.
30 CHAPTER 1. INTRODUCTION
There are also stopping clauses that provide additional tests to stop the it-
eration. These are also evaluated and tested on each iteration before evaluating
the actions.
When some stepping or stopping condition causes the iteration to stop,
the final clause is evaluated (if present). Local variables and their values can
still be accessed in the final clause. After the final clause, the loop statement
completes.
The stepping clauses are the following:
repeat expression
Sets the number of iterations to the value of expression, which should be
an integer (FIXNUM).
is no to, below, downto, or above clause, no iteration stop test is created for
this stepping clause.)
The stopping clauses are the following:
while expression
until expression
finally statement
; iterate 10 times
loop
repeat 10
print random(100)
end
play pluck-chord(c3, 5, 2)
play pluck-chord(d3, 7, 4) ~ 3
play pluck-chord(c2, 10, 7) ~ 8
Note that this version of the function is substantially smaller (loop is pow-
erful, but sometimes a bit verbose). In addition, one could argue this simrep
version is more correct – in the case where n is 0, this version returns silence,
whereas the loop version always initializes s to a pluck sound, even if n is
zero, so it never returns silence.
34 CHAPTER 1. INTRODUCTION
Chapter 2
Basics of Synthesis
Topics Discussed: Unit Generators, Implementation, Functional Pro-
gramming, Wavetable Synthesis, Scores in Nyquist, Score Manipulation
that are “consumed” by other unit generators and quickly deleted to conserve
memory.
Figure 2.2 shows how unit generators can be combined. Outputs from
an oscillator and an envelope generator serve as inputs to the multiply unit
generator in this figure.
Figure 2.3 shows how the “circuit diagram” or “signal flow diagram” nota-
tion used in Figure 2.2 relates to the functional notation of Nyquist. As you can
see, whereever there is output from one unit generator to the input of another
as shown on the left, we can express that as nested function calls as shown in
the expression on the right.
2.1.2 Evaluation
Normally, Nyquist expressions (whether written in SAL or Lisp syntax) eval-
uate their parameters, then apply the function. If we write f(a, b), Nyquist
will evaluate a and b, then pass the resulting values to function f.
Sounds are different. If Nyquist evaluated sounds immediately, they could
be huge. Even something as simple as multiply could require memory for two
huge input sounds and one equally huge output sound. Multiplying two 10-
minute sounds would require 30 minutes’ worth of memory, or about 300MB.
This might not be a problem, but what happens if you are working with multi-
channel audio, longer sounds, or more parameters?
To avoid storing huge values in memory, Nyquist uses lazy evaluation.
Sounds are more like promises to deliver samples when asked, or you can
think of a sound as an object with the potential to compute samples. Samples
are computed only when they are needed. Nyquist Sounds can contain either
samples or the potential to deliver samples, or some combination.
38 CHAPTER 2. BASICS OF SYNTHESIS
play sound-expression
f(g(x), h(x))
will have to write some extra code to derive a new sound at the desired starting
time.
Instead of using global variables, you should generally use (global) func-
tions. Here is an example of something to avoid:
2.3 Waveforms
Our next example will be presented in several steps. The goal is to create
a sound using a wavetable consisting of several harmonics as opposed to a
simple sinusoid. We begin with an explanation of harmonics. Then, in order to
build a table, we will use a function that adds harmonics to form a wavetable.
Partials are any additional frequencies but are not necessarily harmonic.
Harmonics or harmonic partials are integer (whole number) multiples of the
fundamental frequency (f) (1f, 2f, 3f, 4f . . . ). Overtones refers to any partials
above the fundamental. For convention’s sake, we usually refer to the fun-
damental as partial #1. The first few harmonic partials are the fundamental
frequency, octave above, octave plus perfect fifth above, 2 octaves above, two
octaves and a major 3rd, two octaves and a major fifth, as pictured in Figure
2.4 for the pitch “A.” After the eighth partial, the pitches begin to grow ever
closer and do not necessarily correspond closely to equal-tempered pitches, as
shown in the chart. In fact, even the fifths and thirds are slightly off their equal-
tempered frequencies. You may note that the first few pitches correspond to
the harmonic nodes of a violin (or any vibrating) string.
Now that we have defined a function, the last step of this example is to
build the wave. The following code calls mkwave, which sets *table* as a
side effect:
exec mkwave()
This simple approach (setting *table*) is fine if you want to use the same
waveform all the time, but in most cases, you will want to compute or select a
2.4. PIECE-WISE LINEAR FUNCTIONS: PWL 43
waveform, use it for one sound, and then compute or select another waveform
for the next sound. Using the global default waveform *table* is awkward.
A better way is to pass the waveform directly to osc. Here is an example
to illustrate:
;; redefine mkwave to set *mytable* instead of *table*
define function mkwave()
begin
set *mytable* = 0.5 * build-harmonic(1, 2048) +
0.25 * build-harmonic(2, 2048) +
0.125 * build-harmonic(3, 2048) +
0.0625 * build-harmonic(4, 2048)
set *mytable* = list(*mytable*, hz-to-step(1.0), #t)
end
Now, you should be thinking “wait a minute, you said to avoid setting
global variables to sounds, and now you are doing just that with these wave-
form examples. What a hypocrite!” Waveforms are a bit special because they
are
• typically short so they do not claim much memory,
• typically used many times, so there can be significant savings by com-
puting them once and saving them,
• not used directly as sounds but only as parameters to oscillators.
You do not have to save waveforms in variables, but it is common practice, in
spite of the general advice to keep sounds out of global variables.
pwlv(v0 ,t1 , v1 ,t2 , v2 , . . . ,tn , vn ) – “v” for “value first” is used for
signals with non-zero starting and ending points
pwev(v1 ,t2 , l2 , . . . ,tn , vn ) – exponential interpolation, vi > 0
pwlr(i1 , v1 , i2 , v2 , . . . , in ) – relative intervals rather than absolute
times
See the Nyquist Reference Manual for more variants and combinations.
(duration given by dur is optional). One advantage of env over pwl is that
env allows you to give fixed attack and decay times that do not stretch with
duration. In contrast, the default behavior for pwl is to stretch each segment in
proportion when the duration changes. (We have not really discussed duration
in Nyquist, but we will get there later.)
46 CHAPTER 2. BASICS OF SYNTHESIS
special name to denote half-steps? How about the Bach since J. S. Bach’s Well-Tempered Clavier
is a landmark in the development of the fixed-size half step, or the Schoenberg, honoring Arnold’s
development of 12-tone music. Wouldn’t it be cool to say 440 Hertz is 69 Bachs? Or to argue
whether it’s “Bach” or “Bachs?” But I digress ....
48 CHAPTER 2. BASICS OF SYNTHESIS
2.6.2 Lists
Scores are built on lists, so let’s learn about lists.
Lists in Nyquist
Lists in Nyquist are represented as standard singly-linked lists. Every element
cell in the list contains a link to the value and a link to the next element. The
last element links to nil, which can be viewed as pointing to an empty list.
Nyquist uses dynamic typing so that lists can contain any types of elements or
any mixture of types; types are not explicitly declared before they are used.
2.6. INTRODUCTION TO SCORES 49
Also, a list can have nested lists within it, which means you can make any
binary tree structure through arbitrary nesting of lists.
Notation
Although we can manipulate pointers directly to construct lists, this is frowned
upon. Instead, we simply write expressions for lists. In SAL, we use curly
brace notations for literal lists, e.g. {a b c}. Note that the three elements here
are literal symbols, not variables (no evaluation takes place, so these symbols
denote themselves, not the values of variables named by the symbols). To con-
struct a list from variables, we call the list function with an arbitrary number
of parameters, which are the list elements, e.g. list(a, b, c). These pa-
rameters are evaluated as expressions in the normal way and the values of these
expressions become list elements.
set a = 1, b = 2, c = 3
print {a b c}
This prints: {a b c}. Why? Remember that the brace notation {} does
not evaluate anything, so in this case, a list of the symbols a, b and c is formed.
To make a list of the values of a, b and c, use list, which evaluates its
arguments:
print list(a, b, c)
This prints: {a b c}. The quote() form can enclose any expression,
but typically just a symbol. The quote() form returns the symbol without
evaluation.
If you want to add an element to a list, there is a special function, cons:
print cons(a, {b})
This prints: {1 b}. Study this carefully; the first argument becomes the
first element of a new list. The elements of the second argument (a list) form
the remaining elements of the new list.
In contrast, here is what happens with list:
print list(a, {b c d})
This prints: {1 {b c d}}. Study this carefully too; the first argument
becomes the first element of a new list. The second argument becomes the
second element of the new list, so the new list has two elements.
2.7 Scores
In Nyquist, scores are represented by lists of data. The form of a Nyquist score
is the following:
{ sound-event-1
sound-event-2
...
sound-event-n }
where a sound event is also a list consisting of the event time, duration,
and an expression that can be evaluated to compute the event. The expression,
when evaluated, must return a sound:
{ {time-1 dur-1 expression-1}
{time-2 dur-2 expression-2}
...
{time-n dur-n expression-n} }
2.7. SCORES 51
• Scores are data. You can compute scores using by writing code and
using the list construction functions from the previous section (and see
the Nyquist Reference Manual for many more).
• Expressions in scores are lists, not SAL expressions. The first element
of the list is the function to call. The remaining elements form the pa-
rameter list.
In this case, the score occupies the time period from 1.2 to 6 seconds.
For example, if we want the previous score, which nominally ends at time
3 to contain an extra second of silence at the end, we can specify the time span
of the score is from 0 to 4 as follows:
{ {0 0 {score-begin-end 0 4}}
{0 1 {note pitch: 60 vel: 100}}
{1 1 {note pitch: 62 vel: 110}}
{2 1 {note pitch: 64 vel: 120}} }
set myscore = {
{0 0 {score-begin-end 0 4}}
{0 1 {note pitch: 60 vel: 100}}
{1 1 {note pitch: 62 vel: 110}}
{2 1 {note pitch: 64 vel: 120}} }
play timed-seq(myscore)
2.8 Summary
Now you should know how to build a simple wavetable instrument with the
waveform consisting of any set of harmonics and with an arbitrary envelope
2.8. SUMMARY 53
function mkwave()
begin
set *mytable* = 0.5 * build-harmonic(1, 2048) +
0.25 * build-harmonic(2, 2048) +
0.125 * build-harmonic(3, 2048) +
0.0625 * build-harmonic(4, 2048)
set *mytable* = list(*mytable*, hz-to-step(1.0), #t)
end
exec mkwave()
function env-note(p)
return osc(p, 1.0, *mytable*) *
env(0.05, 0.1, 0.5, 1.0, 0.5, 0.4)
play timed-seq(myscore)
controlled by pwl or env. You can also write or compute scores containing
many instances of your instrument function organized in time, and you can
synthesize the score using timed-seq. You might experiment by creating
different waveforms, different envelopes, using non-integer pitches for micro-
tuning, or notes overlapping in time to create chords or clusters.
Chapter 3
Sampling Theory
Introduction
Topics Discussed: Sampling Theory, Analog to/from Digital Conversion,
Fourier Synthesis, Aliasing, Quantization Noise, Amplitude Modulation
The principles which underlie almost all digital audio applications and de-
vices, be they digital synthesis, sampling, digital recording, or CD or iPod
playback, are based on the basic concepts presented in this chapter. New forms
of playback, file formats, compression, synthesis, processing and storage of
data are all changing on a seemingly daily basis, but the underlying mecha-
nisms for converting real-world sound into digital values and converting them
back into real-world sound has not varied much since Max Mathews developed
MUSIC I in 1957 at Bell Labs.
The theoretical framework that enabled musical pioneers such as Max
Mathews to develop digital audio programs stretches back over a century. The
groundbreaking theoretical work done at Bell Labs in the mid-20th century is
worth noting. Bell Labs was concerned with transmitting larger amounts of
voice data over existing telephone lines than could normally be sent with ana-
log transmissions, due to bandwidth restrictions. Many of the developments
pertaining to computer music were directly related to work on this topic.
Harry Nyquist, a Swedish-born physicist laid out the principles for sam-
pling analog signals at even intervals of time and at twice the rate of the highest
frequency so they could be transmitted over telephone lines as digital signals
[Nyquist, 1928], even though the technology to do so did not exist at the time.
Part of this work is now known as the Nyquist Theorem. Nyquist worked
for AT&T, then Bell Labs. Twenty years later, Claude Shannon, mathemati-
cian and early computer scientist, also working at Bell Labs and then M.I.T.,
55
56 CHAPTER 3. SAMPLING THEORY INTRODUCTION
developed a proof for the Nyquist theory.1 The importance of their work to
information theory, computing, networks, digital audio, digital photography
and computer graphics (images are just signals in 2D!) cannot be understated.
Pulse Code Modulation (PCM), a widely used method for encoding and de-
coding binary data, such as that used in digital audio, was also developed early
on at Bell Labs, attributed to John R. Pierce, a brilliant engineer in many fields,
including computer music.
In this chapter, we will learn the basic theory of sampling, or representing
continuous functions with discrete data.
1 The Shannon Theorem [Shannon, 1948], a pioneering work in information theory, should
not be confused with the Shannon Juggling Theory of the same author, in which he worked out
the mathematics of juggling numerous objects ((F+D)H=(V+D)N, where F is the time an object
spends in the air, D is the time a object spends in a hand, V is the time a hand is vacant, N is the
number of objects juggled, and H is the number of hands)–he was an avid juggler as well.
57
plifiers, speakers, and subsequently the air and our ears) into a continuous vi-
bration. Interpolation just means smoothly transitioning between the discrete
numerical values.
OK, keep watching, but now blink repeatedly. What does the trail look
like now? Because we are blinking our eyes, we’re only able to see the ball at
discrete moments in time. It’s on the way up, we blink, now it’s moved up a
bit more, we blink again, now it may be on the way down, and so on. We’ll
call these snapshots samples because we’ve been taking visual samples of the
complete trajectory that the ball is following. (See Figure 3.4.) The rate at
which we obtain these samples (blink our eyes) is called the sampling rate.
It’s pretty clear, though, that the faster we sample, the better chance we
have of getting an accurate picture of the entire continuous path along which
the ball has been traveling.
59
What’s the difference between the two views of the bouncing ball: the
blinking and the nonblinking? Each view pretty much gives the same picture
of the ball’s path. We can tell how fast the ball is moving and how high it’s
bouncing. The only real difference seems to be that in the first case the trail
is continuous, and in the second it is broken, or discrete. That’s the main
distinction between analog and digital representations of information: analog
information is continuous, while digital information is not.
Figure 3.5: An analog waveform and its digital cousin: the analog waveform
has smooth and continuous changes, and the digital version of the same wave-
form has a stairstep look. The black squares are the actual samples taken by
the computer. Note that the grey lines are only for show – all that the com-
puter knows about are the discrete points marked by the black squares. There
is nothing in between those points stored in the computer.
The gray “staircase” waveform in Figure 3.5 emphasizes the point that
digitally recorded waveforms are apparently missing some original informa-
tion from analog sources. However, by increasing the number of samples taken
each second (the sample rate), as well as increasing the accuracy of those sam-
60 CHAPTER 3. SAMPLING THEORY INTRODUCTION
Consider Figure 3.6. The graph at the top is our usual time domain graph, or
audiogram, of the waveform created by a five-note whistled melody. Time is
on the x-axis, and amplitude is on the y-axis.
The bottom graph is the same melody, but this time we are looking at a
time-frequency representation. The idea here is that if we think of the whistle
as made up of contiguous small chunks of sound, then over each small time
period the sound is composed of differing amounts of various pieces of fre-
quency. The amount of frequency y at time t is encoded by the brightness of
the pixel at the coordinate (t, y). The darker the pixel, the more of that fre-
quency at that time. For example, if you look at time 0.4 you see a band of
white, except near 2,500, showing that around that time we mainly hear a pure
tone of about 2,500 Hz, while at 0.8 seconds, there are contributions all around
from about 0 Hz to 5,500 Hz, but stronger ones at about 2,500 Hz and 200 Hz.
It looks like the signal only contains frequencies up to 8,000 Hz. If this
were the case, we would need to sample the sound at a rate of 16,000 Hz (16
kHz) in order to accurately reproduce the sound. That is, we would need to
take sound bites (bytes?!) 16,000 times a second.
In the next section, when we talk about representing sounds in the fre-
quency domain (as a combination of various amplitude levels of frequency
components, which change over time) rather than in the time domain (as a nu-
merical list of sample values of amplitudes), we’ll learn a lot more about the
ramifications of the Nyquist theorem for digital sound. For example, since the
human ear only responds to sounds up to about 20,000 Hz, we need to sample
sounds at least 40,000 times a second, or at a rate of 40,000 Hz, to represent
these sounds for human consumption. You may be wondering why we even
need to represent sonic frequencies that high (when the piano, for instance,
only goes up to the high 4,000 Hz range in terms of pitches or repetition rate.).
The answer is timbral, particularly spectral. Those higher frequencies do exist
and fill out the descriptive sonic information.
Figure 3.7: Sine waves and phasors. As the sine wave moves forward in time,
the arrow goes around the circle at the same rate. The height of the arrow
(that is, how far it is above or below the x-axis) as it spins around in a circle is
described by the sine wave.
64 CHAPTER 3. SAMPLING THEORY INTRODUCTION
In other words, if we trace the arrow’s location on the circle and measure
the height of the arrow on the y-axis as our phasor goes around the circle, the
resulting curve is a sine wave!
As time goes on, the phasor goes round and round. At each instant, we
measure the height of the dot over the x-axis. Let’s consider a small exam-
ple first. Suppose the wheel is spinning at a rate of one revolution per sec-
ond. This is its frequency (and remember, this means that the period is 1
second/revolution). This is the same as saying that the phasor spins at a rate
of 360 degrees per second, or better yet, 2π radians per second (if we’re going
to be mathematicians, then we have to measure angles in terms of radians). So
2π radians per second is the angular velocity of the phasor.
This means that after 0.25 second the phasor has gone π/2 radians (90
degrees), and after 0.5 second it’s gone π radians or 180 degrees, and so on.
So, we can describe the amount of angle that the phasor has gone around at
time t as a function, which we call θ (t).
Now, let’s look at the function given by the height of the arrow as time
goes on. The first thing that we need to remember is a little trigonometry.
The sine and cosine of an angle are measured using a right triangle. For our
right triangle, the sine of θ , written sin(θ ) is given by the equation: sin(θ ) =
a/c (See Figure 3.8.)
This means that: a = c × sin(θ )
We’ll make use of this in a minute, because in this example a is the height
of our triangle.
Similarly, the cosine, written cos(θ ), is: cos(θ ) = b/c
This means that: b = c × cos(θ )
This will come in handy later, too.
Now back to our phasor. We’re interested in measuring the height at time
t, which we’ll denote as h(t). At time t, the phasor’s arrow is making an
angle of θ (t) with the x-axis. Our basic phasor has a radius of 1, so we get
the following relationship: h(t) = sin(θ (t)) = sin(2πt). We also get this nice
graph of a function, which is our favorite old sine curve.
3.2. THE FREQUENCY DOMAIN 65
Now, how could we change this curve? Well, we could change the am-
plitude—this is the same as changing the length of our arrow on the phasor.
We’ll keep the frequency the same and make the radius of our phasor equal to
3. Then we get: h(t) = 3 × sin(2πt) Then we get the nice curve in Figure 3.10,
which is another kind of sinusoid (bigger!).
Now let’s start messing with the frequency, which is the rate of revolution
of the phasor. Let’s ramp it up a notch and instead start spinning at a rate of
66 CHAPTER 3. SAMPLING THEORY INTRODUCTION
five revolutions per second. Now: θ (t) = 5 × (2πt) = 10πt. This is easy to
see since after 1 second we will have gone five revolutions, which is a total
of 10 radians. Let’s suppose that the radius of the phasor is 3. Again, at each
moment we measure the height of our arrow (which we call h(t)), and we get:
h(t) = 3 × sin(θ (t)) = 3 × sin(10πt). Now we get the sinusoid in Figure 3.11.
Then we get our basic sinusoid, but shifted ahead π/2. Does this look familiar?
This is the graph of the cosine function!
You can do some checking on your own and see that this is also the graph
that you would get if you plotted the displacement of the arrow from the y-axis.
So now we know that a cosine is a phase-shifted sine!
68 CHAPTER 3. SAMPLING THEORY INTRODUCTION
are equivalent, so we started with cosines and sines. Once we have the cosine
R(ω) and sine X(ω) parts, we can derive a representation in terms of phase.
For any given ω of interest, we have:
In other words, we can write the partial with frequency ω as A(ω) × sin(ωt +
θ (ω)) where
q
A(ω) = R(ω)2 + X(ω)2
and
θ (ω) = arctan(X(ω)/R(ω)))
Complex Representation
Rather than separating the sine and cosine parts, we can use complex numbers
because, among other properties, each complex number has a real part and an
imaginary part. We can think of a phasor as rotating in the complex plane, so
a single complex number can represent both amplitude and phase:
e jx = cos(x) + j · sin(x)
Example Spectrum
The result of a Fourier Transform is a continuous function of frequency some-
times called the frequency spectrum or just spectrum. Figure 3.14 shows the
spectrum measured from a gamelan tone. You can see there is energy in a
wide range of frequencies, but also some peaks that indicate strong sinusoidal
partials here and there.
Figure 3.15: The spectrum of the sine wave has energy only at one frequency.
The triangle wave has energy at odd-numbered harmonics (meaning odd mul-
tiples of the fundamental), with the energy of each harmonic decreasing as 1
over the square of the harmonic number (1/N 2 ). In other words, at the fre-
quency that is N times the fundamental, we have 1/N 2 as much energy as in
the fundamental.
Figure 3.15 shows some periodic waveforms and their associated spectra.
The partials in the sawtooth wave decrease in energy in proportion to the in-
3 By the way, periodic waveforms generally have clear pitches. In evolutionary terms, peri-
odicity implies an energy source to sustain the vibration, and pitch is a good first-order feature
that tells us whether the energy source is big enough to be a threat or small enough to be food.
Think lion roar vs. goat bleat. It should not be too suprising that animals—remember we are the
life forms capable of running away from danger and chasing after food—developed a sense of
hearing, including pitch perception, which at the most basic level serves to detect energy sources
and estimate size.
72 CHAPTER 3. SAMPLING THEORY INTRODUCTION
verse of the harmonic number (1/N). Pulse (or rectangle or square) waveforms
have energy over a broad area of the spectrum.
Figure 3.17: Sampling in the time domain gives rise to shifted spectral copies
in the frequency domain.
3.3.2 Aliasing
Notice in Figure 3.17 that after sampling, the copies of the spectrum come
close to overlapping. What would happen if the spectrum contained higher
frequencies. Figure 3.18 illustrates a broader spectrum.
Figure 3.19 illustrates the signal after sampling. What a mess! The copies
of the spectrum now overlap. The Nyquist frequency (one-half the sampling
rate, and one half the amount by which copies are shifted) is shown by a ver-
74 CHAPTER 3. SAMPLING THEORY INTRODUCTION
tical dotted line. Notice that there is a sort of mirror symmetry around that
line: the original spectrum extending above the Nyquist frequency is “folded”
to become new frequencies below the Nyquist frequency.
This is really bad because new frequencies are introduced and added to the
original spectrum. The frequencies are in fact aliases of the frequencies above
the Nyquist frequency. To avoid aliasing, it is necessary to sample at higher
than twice the frequency of the original signal.
In the time domain, aliasing is a little more intuitive, at least if we consider
the special case of a sinusoid. Figure 3.20 shows a fast sinusoid at a frequency
more than half the sample rate. This frequency is “folded” to become a lower
frequency below half the sample rate. Note that both sinusoids pass perfectly
through the available samples, which is why they are called aliases. They have
the same “signature” in terms of samples.
To summarize, the frequency range (bandwidth) of a sampled signal is
determined by the sample rate. We can only capture and represent frequencies
below half the sample rate, which is also called the Nyquist rate or Nyquist
frequency.
the absolute error but the ratio between the signal and the error. We call errors
in audio “noise,” and we measure the signal-to-noise ratio.
Ratio is so important in audio, and the range of signals and noise is so
large, that we use a special unit, the dB or decibel to represent ratios, or more
precisely logarithms of ratios. A bel represents a factor of 10 in power, which
is 10 decibels. Since this is a logarithmic scale, 20 decibels is two factors of 10
or a factor of 100 in power. But power increases with the square of amplitude,
so 20 dB results in only a factor of 10 in amplitude. A handy constant to
remember is a factor of 2 in amplitude is about 6 dB, a factor of 4 is about 12
dB. What can you hear? A change of 1 dB is barely noticeable. A change of
10 dB is easily noticeable, but your range of hearing is about 10 of these steps
(around 100 dB).
the original signal would pass through integer sample values and we could
capture the signal perfectly! Thus, the effect of quantization error is to add
random values in the range -0.5 to +0.5 to the original signal before sampling.
It turns out that uniform random samples are white noise, so calling error
“noise” is quite appropriate. How loud is the noise? The resulting signal-
to-noise ratio, given M-bit samples, is 6.02M + 1.76 dB. This is very close
to saying “the signal-to-noise ratio is 6dB per bit.” You only get this signal-
to-noise ratio if the signal uses the full range of the integer samples. In real
life, you always record with some “headroom,” so you probably lose 6 to 12
dB of the best possible signal. There is also a limit to how many bits you
can use—remember that the bits are coming from analog hardware, so how
quiet can you make circuits, and how accurately can you measure voltages?
In practice, 16-bit converters offering a theoretical 98 dB signal-to-noise ratio
are readily available and widely used. 20- and 24-bit converters are used in
high-end equipment, but these are not accurate to the least-significant bit, so
do not assume the signal-to-noise ratio can be estimated merely by counting
bits.
Figure 3.23: When a signal is sampled, the real-valued signal must be rounded
at each sample to the nearest integer sample value. The resulting error is called
quantization error.
real world, but the signal-to-noise ratio is about 6 dB per bit, so with 16 bits
or more, we can make noise virtually imperceptible.
3.7.2 Oversampling
Another great idea in sampling is oversampling. A problem with high-quality
digital audio conversion is that we need anti-aliasing filters and reconstruc-
tion filters that are (1) analog and (2) have very difficult to achieve require-
ments, namely that the filter passes everything without alteration right up to
the Nyquist frequency, then suddenly cuts out everything above that frequency.
Filters, typically have a smooth transition from passing signal to cutting sig-
nals, so anti-aliasing filters and reconstruction filters used to be marvels of
low-noise audio engineering and fabrication.
With oversampling, the idea is to sample at a very high sample rate to avoid
stringent requirements on the analog anti-aliasing filter. For example, suppose
3.8. AMPLITUDE MODULATION 79
Figure 3.24: A time-domain waveform. It’s easy to see the attack, steady-
state, and decay portions of the “note” or sound event, because these are all
pretty much variations of amplitude, which time-domain representations show
us quite well. The amplitude envelope is a kind of average of this picture.
steady-signal-expression * envelope-expression
In Nyquist, you can use the pwl or env functions, for example, to make
envelopes, as we have seen before.
Another type of amplitude modulation is amplitude vibrato, where we
want a wavering amplitude. One way to create this effect is to vary the ampli-
tude using a low-frequency oscillator (LFO). However, be careful not to simply
multiply by the LfO. Figure 3.25) shows what happens to a signal multiplied
by lfo(6).
Figure 3.25: The output from osc(c4) * lfo(6). The signal has extreme
pulses at twice the LFO rate because osc(6) crosses through zero twice for
each cycle. Only 1/4 second is plotted.
3.8. AMPLITUDE MODULATION 81
To get a more typical vibrato, offset thelfo signal so that it does not go
through zero. Figure 3.26 shows an example. You can play the expressions
from Figures 3.25 and 3.26 and hear the difference.
Figure 3.26: The output from osc(c4) * (0.7 + 0.3 * lfo(6)). The
signal vibrates at 6 Hz and does not diminish to zero because the vibrato ex-
pression contains an offset of 0.7.
3.9 Summary
Sampling obtains a discrete digital representation of a continuous signal. Al-
though intuition suggests that the information in a continuous signal is infinite,
this is not the case in practice. Limited bandwith (or the fact that we only care
about a certain bandwidth) means that we can capture all frequencies of inter-
est with a finite sampling rate. The presence of noise in all physical systems
means that a small amount of quantization noise is not significant, so we can
represent each sample with a finite number of bits.
Dither and noise shaping can be applied to signals before quantization oc-
curs to reduce any artifacts that might occur due to non-random rounding —
typically this occurs in low amplitude sinusoids where rounding errors can
become periodic and audible. Another interesting technique is oversampling,
where digital low-pass filtering performs some of the work typically done by
a hardware low-pass filter to prevent aliasing.
Amplitude modulation at low frequencies is a fundamental operation used
to control loudness in mixing and to shape notes and other sound events in
synthesis. We typically scale the signal by a factor from 0 to 1 that varies over
time. On the other hand, periodic amplitude modulation at audible frequencies
acts to shift frequencies up and down by the modulating frequency. This can
be used as a form of distortion, for synthesis, and other effects.
Chapter 4
Frequency Modulation
Topics Discussed: Frequency Modulation, Behaviors and Transforma-
tions in Nyquist
4.1.1 Examples
Common forms of frequency variation in music include:
• Loose vibrating strings go sharp (higher pitch) as they get louder. Loose
plucked strings get flatter (lower pitch) especially during the initial de-
cay.
• The slide trombone, Theremin, voice, violin, etc. create melodies by
changing pitch, so melodies on these instruments can be thought of as
examples of frequency modulation (as opposed to, say, pianos, where
melodies are created by selecting from fixed pitches).
4.1.2 FM Synthesis
Normally, vibrato is a slow variation created by musicians’ muscles. With
electronics, we can increase the vibrato rate into the audio range, where inter-
esting things happen to the spectrum. The effect is so dramatic, we no longer
refer to it as vibrato but instead call it FM Synthesis.
Frequency modulation (FM) is a synthesis technique based on the simple
idea of periodic modulation of the signal frequency. That is, the frequency of
a carrier sinusoid is modulated by a modulator sinusoid. The peak frequency
deviation, also known as depth of modulation, expresses the strength of the
modulator’s effect on the carrier oscillator’s frequency.
FM synthesis was invented by John Chowning [Chowning, 1973] and be-
came very popular due to its ease of implementation and computationally low
cost, as well as its (somewhat surprisingly) powerful ability to create realistic
and interesting sounds.
4.2 Theory of FM
Let’s look at the equation for a simple frequency controlled sine oscillator.
Often, this is written
y(t) = A sin(2πφt) (4.1)
where φ (phi) is the frequency in Hz. Note that if φ is in in Hz (cycles per
second), the frequency in radians per second is 2πφ . At time t, the phase will
have advanced from 0 to 2πφt, which is the integral of 2πφ over a time span
of t. We can express what we are doing in detail as:
Z t
y(t) = A sin(2π φ dx) (4.2)
0
where C is the carrier, a frequency offset that is in many cases is the funda-
mental or “pitch”. D is the depth of modulation that controls the amount of
frequency deviation (called modulation), and M is the frequency of modula-
tion in Hz. Plugging this into Equation 4.3 and simplifying gives the equation
for FM: Z t
y(t) = A sin(2π C + D sin(2πMx)dx) (4.5)
0
Note that the integral of sin is cos, but the only difference between the two is
phase. By convention, we ignore this detail and simplify Equation 4.5 to get
this equation for an FM oscillator:
Figure 4.1: Bessel functions of the first kind, orders 0 to 3. The x axis repre-
sents the index of modulation in FM.
• The spectral bandwidth increases with I. The upper and lower sidebands
represent the higher and lower frequencies, respectively. The larger the
value of I, the more significant sidebands you get.
we assume that all the analysis here still applies and allows us to make pre-
dictions about the short-time spectrum. FM is particularly interesting because
we can make very interesting changes to the spectrum with just a few control
parameters, primarily D(t), and M(t).
4.3.1 Basic FM
The basic FM oscillator in Nyquist is the function fmosc. The signature for
fmosc is
4.3. FREQUENCY MODULATION WITH NYQUIST 89
Since the modulation comes from osc(c4), the modulation frequency matches
the carrier frequency (given by fmosc(c4, ...), so the C:M ratio is 1:1. The
amplitude of modulation ramps from 0 to 4000, giving a final index of mod-
ulation of I = 4000/steptohz(C4) = 15.289. Thus, the spectrum will evolve
from a sinusoid to a rich spectrum with the carrier and around I + 1 sidebands,
or about 17 harmonics. Try it!
For a musical tone, you should multiply the fmosc signal by an envelope.
Otherwise, the output amplitude will be constant. Also, you should replace
the pwl function with an envelope that increases and then decreases, at least if
you want the sound to get brighter in the middle.
such as maintaining the desired pitch and stretching only the “sustain” portion
of an envelope to obtain a longer note.
4.4.1 Behaviors
Nyquist sound expressions denote a whole class of behaviors. The specific
sound computed by the expression depends upon the environment. There are
a number of transformations, such as stretch and transpose that alter the
environment and hence the behavior.
The most common transformations are shift and stretch, but remem-
ber that these do not necessarily denote simple time shifts or linear stretching:
When you play a longer note, you don’t simply stretch the signal! The behav-
ior concept is critical for music.
4.4.4 Transformations
Many transformations are relative and can be nested. Stretch does not just set
the stretch factor; instead, it multiplies the stretch factor by a factor, so the
final stretch factor in the new environment is relative to the current one.
Nyquist also has “absolute” transformations that override any existing value
in the environment. For example,
92 CHAPTER 4. FREQUENCY MODULATION
An Operational View
You can think about this operationally: When Nyquist evaluates expr ~ 3, the
~ operator is not a function in the sense that expressions on the left and right
are evaluated and passed to a function where “stretching” takes place. Instead,
it is better to think of ~ as a control construct that:
• changes the environment by tripling the stretch factor
• evaluates expr in this new environment
• restores the environment
• returns the sound computed by expr
Thus, if expr is a behavior that computes a sound, the computation will be
affected by the environment. If expr is a sound (or a variable that stores a
sound), the sound is already computed and evaluation merely returns the sound
with no transformation.
Transformations are described in detail in the Nyquist Reference Manual
(find “Transformations” in the index). In practice, the most critical transfor-
mations are at (@) and stretch (~), which control when sounds are computed
and how long they are.
2 This is a bit of a simplification. If a function merely returns a pre-computed value of type
The idea is to create the first note at time 0, and to start the next note when the
first one finishes. This is all accomplished by manipulating the environment.
In particular, *warp* is modified so that what is locally time 0 for the second
note is transformed, or warped, to the logical stop time of the first note.
One way to understand this in detail is to imagine how it might be exe-
cuted: first, *warp* is set to an initial value that has no effect on time, and
my-note(c4, q) is evaluated. A sound is returned and saved. The sound has
an ending time, which in this case will be 1.0 because the duration q is 1.0.
This ending time, 1.0, is used to construct a new *warp* that has the effect of
shifting time by 1.0. The second note is evaluated, and will start at time 1.0.
The sound that is returned is now added to the first sound to form a composite
sound, whose duration will be 2.0. Finally, *warp* is restored to its initial
value.
Notice that the semantics of seq can be expressed in terms of transforma-
tions. To generalize, the operational rule for seq is: evaluate the first behav-
ior according to the current *warp*. Evaluate each successive behavior with
*warp* modified to shift the new note’s starting time to the ending time of
the previous behavior. Restore *warp* to its original value and return a sound
which is the sum of the results.
In the Nyquist implementation, audio samples are only computed when
they are needed, and the second part of the seq is not evaluated until the ending
time (called the logical stop time) of the first part. It is still the case that when
the second part is evaluated, it will see *warp* bound to the ending time of
the first part.
A language detail: Even though Nyquist defers evaluation of the second
part of the seq, the expression can reference variables according to ordinary
Lisp/SAL scope rules. This is because the seq captures the expression in a
closure, which retains all of the variable bindings.
Figure 4.4: Standard quarter notes nominally fill an entire beat. Staccato quar-
ter notes, indicated by the dots, are played shorter, but they still occupy a time
interval of one beat. The “on” and “off” time of these short passages is indi-
cated graphically below each staff of notation.
A sound in Nyquist has one start time, but there are effectively two “stop”
times. The signal stop is when the sound samples end. After that point, the
sound is considered to continue forever with value of zero. The logical stop
marks the “musical” or “rhythmic” end of the sound. If there is a sound after
it (e.g. in a sequence computed by seq), the next sound begins at this logical
stop time of the sound.
The logical stop is usually the signal stop by default, but you can change it
with the set-logical-stop function. For example, the following will play
a sequence of 3 plucked-string sounds with durations of 2 seconds each, but
the IOI, that is, the time between notes, will be only 1 second.
play seq(set-logical-stop(pluck(c4) ~ 2, 1)
set-logical-stop(pluck(c4) ~ 2, 1),
set-logical-stop(pluck(c4) ~ 2, 1))
4.9 Summary
In this unit, we covered frequency modulation and FM synthesis. Frequency
modulation refers to anything that changes pitch within a sound. Frequency
modulation at audio rates can give rise to many partials, making FM Synthe-
sis a practical, efficient, and versatile method of generating complex evolving
sounds with a few simple control parameters.
All Nyquist sounds are computed by behaviors in an environment that can
be modified with transformations. Functions describe behaviors. Each be-
havior can have explicit parameters as well as implicit environment values, so
behaviors represent classes of sounds (e.g. piano tones), and each instance of
a behavior can be different. Being “different” includes having different start
times and durations, and some special constructs in Nyquist, such as sim, seq,
at (@) and shift (~) use the environment to organize sounds in time. These
concepts are also used to implement scores in Nyquist.
96 CHAPTER 4. FREQUENCY MODULATION
Chapter 5
In this chapter, we dive into the Fast Fourier Transform for spectral anal-
ysis and use it to compute the Spectral Centroid. Then, we learn about some
Nyquist functions that make interesting sequences of numbers and how to in-
corporate those sequences as note parameters in scores. Finally, we consider a
variety of algorithmic composition techniques.
Figure 5.1: Equations for the Short Time Discrete Fourier Transform. Each
Rk , Xk pair represents a different frequency in the spectrum.
waves that make up the sound we have analyzed. It is these data that are
displayed in the sonograms we looked at earlier.
Figure 5.2 illustrates some FFT data. It shows the first 16 bins of a typical
FFT analysis after the conversion is made from real and imaginary numbers to
amplitude/phase pairs. We left out the phases, because, well, it was too much
trouble to just make up a bunch of arbitrary phases between 0 and 2. In a lot
of cases, you might not need them (and in a lot of cases, you would!). In this
case, the sample rate is 44.1 kHz and the FFT size is 1,024, so the bin width (in
frequency) is the Nyquist frequency (44,100/2 = 22,050) divided by the FFT
size, or about 22 Hz.
We confess that we just sort of made up the numbers; but notice that we
made them up to represent a sound that has a simple, more or less harmonic
structure with a fundamental somewhere in the 66 Hz to 88 Hz range (you
can see its harmonics at around 2, 3, 4, 5, and 6 times its frequency, and note
that the harmonics decrease in amplitude more or less like they would in a
sawtooth wave).
One important consequence of the FFT (or equivalently, DFT) is that we
only “see” the signal during a short period. We pretend that the spectrum is
stable, at least for the duration of the analysis window, but this is rarely true.
Thus, the FFT tells us something about the signal over the analysis window,
but we should always keep in mind that the FFT changes with window size
and the concept of a spectrum at a point in time is not well defined. For
example, harmonics of periodic signals usually show up in the FFT as peaks
in the spectrum with some spread rather than single-frequency spikes, even
though a harmonic intuitively has a single specific frequency.
DFT bin has an amplitude and phase, we end up with about N/2 bins. The
complete story is that there are actually N/2 + 1 bins, but one represents zero
frequency where only the real term is non-zero (because sin(0) is 0), and one
bin represents the Nyquist frequency where again, only the real term is non-
zero (because sin(2πki/N) is zero).1
These N/2 + 1 bins are spaced across the frequency range from 0 to the
Nyquist frequency (half the sample rate), so the spacing between bins is the
sample rate divided by the number of bins:
We can also relate the frequency spacing of bins to the duration of the analysis
window:
Figure 5.3: Selecting an FFT size involves making trade-offs in terms of time
and frequency accuracy. Basically it boils down to this: The more accurate
the analysis is in one domain, the less accurate it will be in the other. This
figure illustrates what happens when we choose different frame sizes. In the
first illustration, we used an FFT size of 512 samples, giving us pretty good
time resolution. In the second, we used 2,048 samples, giving us pretty good
frequency resolution. As a result, frequencies are smeared vertically in the
first analysis, while time is smeared horizontally in the second. What’s the
solution to the time/frequency uncertainty dilemma? Compromise.
simultaneously know the exact position and the exact speed of an object.
For an N-point FFT (input contains N samples), the output will have N/2+
1 bins and therefore N/2 + 1 amplitudes. The first amplitude corresponds
to 0 Hz, the second to (sample rate / N), etc., up to the last amplitude that
corresponds to the frequency (sample rate / 2).
∑ fi ai
c= , where fi is the frequency of the ith bin and ai is the amplitude.
∑ ai
(5.2)
5.3 Patterns
Nyquist offers pattern objects that generate data streams. For example, the
cycle-class of objects generate cyclical patterns such as “1 2 3 1 2 3 1 2 3
5.3. PATTERNS 103
Figure 5.4: A magnitude spectrum and its spectral centroid (dashed line). If
you were to cut out the spectrum shape from cardboard, it would balance on
the centroid. The spectral centroid is a good measure of the overall placement
of energy in a signal from low frequencies to high.
Figure 5.5: The centroid curve of a sound over time. This curve is of a rapidly
changing voice (Australian sound poet Chris Mann). Each point in the hori-
zontal dimension represents a new spectral frame. Note that centroids tend to
be rather high and never the fundamental (this would only occur if our sound
is a pure sine wave).
104 CHAPTER 5. SPECTRAL ANALYSIS AND PATTERNS
directly accessible from SAL. Instead, Nyquist defines a functional interface, e.g. make-cycle
creates an instance of cycle-class, and the next function, introduced below, retrieves the next
value from any instance of pattern-class. Using LISP syntax, you can have full access to the
methods of all objects.
5.3. PATTERNS 105
Heap
The heap-class selects items in random order from a list without replace-
ment, which means that all items are generated once before any item is re-
peated. For example, two periods of make-heap({a b c}) might be (C A
B) (B A C). Normally, repetitions can occur even if all list elements are dis-
tinct. This happens when the last element of a period is chosen first in the next
period. To avoid repetitions, the max: keyword argument can be set to 1. If the
main argument is a pattern instead of a list, a period from that pattern becomes
the list from which random selections are made, and a new list is generated
every period. (See Section 5.3.2 on nested patterns.)
Palindrome
The palindrome-class repeatedly traverses a list forwards and then back-
wards. For example, two periods of make-palindrome({a b c}) would be
(A B C C B A) (A B C C B A). The elide: keyword parameter controls
whether the first and/or last elements are repeated:
make-palindrome(a b c, elide: nil)
;; generates A B C C B A A B C C B A ...
make-palindrome(a b c, elide: t)
;; generates A B C B A B C B ...
make-palindrome(a b c, elide: :first)
;; generates A B C C B A B C C B ...
make-palindrome(a b c, elide: :last)
;; generates A B C B A A B C B A ...
Random
The random-class generates items at random from a list. The default selec-
tion is uniform random with replacement, but items may be further specified
with a weight, a minimum repetition count, and a maximum repetition count.
106 CHAPTER 5. SPECTRAL ANALYSIS AND PATTERNS
Weights give the relative probability of the selection of the item (with a default
weight of one). The minimum count specifies how many times an item, once
selected at random, will be repeated. The maximum count specifies the max-
imum number of times an item can be selected in a row. If an item has been
generated n times in succession, and the maximum is equal to n, then the item
is disqualified in the next random selection. Weights (but not currently minima
and maxima) can be patterns. The patterns (thus the weights) are recomputed
every period.
Line
The line-class is similar to the cycle class, but when it reaches the end of
the list of items, it simply repeats the last item in the list. For example, two
periods of make-line({a b c}) would be (A B C) (C C C).
Accumulation
The accumulation-class takes a list of values and returns the first, followed
by the first two, followed by the first three, etc. In other words, for each list
item, return all items form the first through the item. For example, if the list is
(A B C), each generated period is (A A B A B C).
Copier
The copier-class makes copies of periods from a sub-pattern. For exam-
ple, three periods of make-copier( make-cycle({a b c}, for: 1),
repeat: 2, merge: t) would be (A A) (B B) (C C). Note that en-
tire periods (not individual items) are repeated, so in this example the for:
keyword was used to force periods to be of length one so that each item is
repeated by the repeat: count.
Length
The length-class generates periods of a specified length from another pat-
tern. This is similar to using the for: keyword, but for many patterns, the
for: parameter alters the points at which other patterns are generated. For
example, if the palindrome pattern has an elide: pattern parameter, the value
will be computed every period. If there is also a for: parameter with a value
of 2, then elide: will be recomputed every 2 items. In contrast, if the palin-
drome (without a for: parameter) is embedded in a length pattern with a
length of 2, then the periods will all be of length 2, but the items will come
5.3. PATTERNS 107
from default periods of the palindrome, and therefore the elide: values will
be recomputed at the beginnings of default palindrome periods.
Window
The window-class groups items from another pattern by using a sliding win-
dow. If the skip value is 1, each output period is formed by dropping the first
item of the previous period and appending the next item from the pattern. The
skip value and the output period length can change every period. For a simple
example, if the period length is 3 and the skip value is 1, and the input pattern
generates the sequence A, B, C, ..., then the output periods will be (A B C), (B
C D), (C D E), (D E F), ....
Periods
The data returned by a pattern object is structured into logical groups called
periods. You can get an entire period (as a list) by calling next(pattern,
t). For example:
set pitch-source = make-cycle(list(c4, d4, e4, f4))
print next(pitch-source, t)
This prints the list (60 62 64 65), which is one period of the cycle.
You can also get explicit markers that delineate periods by calling
send(pattern, :next). In this case, the value returned is either the next
item of the pattern, or the symbol +eop+ if the end of a period has been
reached. What determines a period? This is up to the specific pattern class, so
see the documentation for specifics. You can override the “natural” period by
using the keyword for:, e.g.
set pitch-source =
make-cycle(list(c4, d4, e4, f4), for: 3)
108 CHAPTER 5. SPECTRAL ANALYSIS AND PATTERNS
print next(pitch-source, t)
print next(pitch-source, t)
This prints the lists (60 62 64) (65 60 62). Notice that these periods just
restructure the stream of items into groups of 3.
Nested patterns are probably easier to understand by example than by spec-
ification. Here is a simple nested pattern of cycles:
The use of keywords like :PITCH helps to make scores readable and easy
to process without specific knowledge about the functions called in the score.
For example, one could write a transpose operation to transform all the :pitch
parameters in a score without having to know that pitch is the first parameter
of pluck and the second parameter of piano-note. Keyword parameters are
also used to give flexibility to note specification with score-gen. Since this
approach requires the use of keywords, the next section is a brief explanation
of how to define functions that use keyword parameters.
Notice that within the body of the function, the actual parameter value for
keywords pitch: and dur: are referenced by writing the keywords without
the colons (pitch and dur) as can be seen in the call to pluck. Also, keyword
parameters have default values. Here, they are 60 and 1, respectively.
Now, we can call k-pluck with keyword parameters. A function call would
look like:
Usually, it is best to give keyword parameters useful default values. That way,
if a parameter such as dur: is missing, a reasonable default value (1) can
be used automatically. It is never an error to omit a keyword parameter, but
the called function can check to see if a keyword parameter was supplied or
not. Because of default values, we can call k-pluck(pitch: c3) with no
duration, k-pluck(dur: 3) with only a duration, or even k-pluck() with
no parameters.
where the k’s are keywords and the e’s are expressions. A score is generated
by evaluating the expressions once for each note and constructing a list of
keyword-value pairs. A number of keywords have special interpretations. The
rules for interpreting these parameters will be explained through a set of “How
do I ...” questions below.
How many notes will be generated? The keyword parameter score-len:
specifies an upper bound on the number of notes. The keyword score-dur:
specifies an upper bound on the starting time of the last note in the score. (To
be more precise, the score-dur: bound is reached when the default starting
time of the next note is greater than or equal to the score-dur: value. This
5.4. SCORE GENERATION AND MANIPULATION 111
definition is necessary because note times are not strictly increasing.) When
either bound is reached, score generation ends. At least one of these two
parameters must be specified or an error is raised. These keyword parameters
are evaluated just once and are not copied into the parameter lists of generated
notes.
What is the duration of generated notes? The keyword dur: defaults to 1
and specifies the nominal duration in seconds. Since the generated note list is
compatible with timed-seq, the starting time and duration (to be precise, the
stretch factor) are not passed as parameters to the notes. Instead, they control
the Nyquist environment in which the note will be evaluated.
What is the start time of a note? The default start time of the first note
is zero. Given a note, the default start time of the next note is the start time
plus the inter-onset time, which is given by the ioi: parameter. If no ioi:
parameter is specified, the inter-onset time defaults to the duration, given by
dur:. In all cases, the default start time of a note can be overridden by the
keyword parameter time:. So in other words, to get the time of each note,
compute the expression given as (time:). If there is no time: parameter,
compute the time of the previous note plus the value of ioi:, and if there is
no ioi:, then use dur:, and if there is no dur:, use 1.
When does the score begin and end? The behavior SCORE-BEGIN-END
contains the beginning and ending of the score (these are used for score manip-
ulations, e.g. when scores are merged, their begin times can be aligned.) When
timed-seq is used to synthesize a score, the SCORE-BEGIN-END marker is not
evaluated. The score-gen macro inserts an event of the form
with begin-time and end-time determined by the begin: and end: keyword
parameters, respectively. (Recall that these score-begin-end events do not
make sound, but they are used by score manipulation functions for splicing,
stretching, and other operations on scores.) If the begin: keyword is not
provided, the score begins at zero. If the end: keyword is not provided, the
score ends at the default start time of what would be the next note after the last
note in the score (as described in the previous paragraph). Note: if time: is
used to compute note starting times, and these times are not increasing, it is
strongly advised to use end: to specify an end time for the score, because the
default end time may not make sense.
What function is called to synthesize the note? The name: parameter
names the function. Like other parameters, the value can be any expression,
including something like next(fn-name-pattern), allowing function names
to be recomputed for each note. The default value is note.
Can I make parameters depend upon the starting time or the duration of
112 CHAPTER 5. SPECTRAL ANALYSIS AND PATTERNS
the note? score-gen sets some handy variables that can be used in expres-
sions that compute parameter values for notes:
• sg:count counts how many notes have been computed so far, starting
at 0.
The order of computation is: sg:count, then sg:start, then sg:ioi and fi-
nally sg:dur, so for example, an expression for dur: can depend on sg:ioi.
Can parameters depend on each other? The keyword pre: introduces
an expression that is evaluated before each note, and post: provides an ex-
pression to be evaluated after each note. The pre: expression can assign one
or more global variables which are then used in one or more expressions for
parameters.
How do I debug score-gen expressions? You can set the trace: parameter
to true (t) to enable a print statement for each generated note.
How can I save scores generated by score-gen that I like? If the keyword
parameter save: is set to a symbol, the global variable named by the symbol
is set to the value of the generated sequence. Of course, the value returned by
score-gen is just an ordinary list that can be saved like any other value.
In summary, the following keywords have special interpretations in
score-gen: begin:, end:, time:, dur:, name:, ioi:, trace:, save:,
score-len:, score-dur:, pre:, post:. All other keyword parameters are
expressions that are evaluated once for each note and become the parameters
of the notes.
In the real world, we have random events in time such as atomic decay or
something as mundane as the time points when a yellow car drives by. The
inter-arrival time of these random events has a negative exponential distribu-
tion, as shown in Figure 5.6. The figure shows that the pobability of longer
and longer intervals is less and less likely.
ates many large intervals. Melodies often have fractal properties, with a mix
of mainly small intervals, but occasional larger ones. An interesting idea is
to randomly choose a direction (up or down) and interval size in a way that
emphasizes smaller intervals over larger ones. Sometimes, this is called a
“random walk,” as illustrated in Figure 5.7.
5.5.5 Serialism
At the risk of over-simplifying, serialism arose as an attempt to move beyond
the tonal concepts that dominated Western musical thought through the begin-
ning of the 20th Century. Arnold Schoenberg created his twelve-tone tech-
nique that organized music around “tone rows” which are permutations of the
12 pitch classes of the chromatic scale (namely, C, C#, D, D#, E, F, F#, G,
G#, A, A#, B). Starting with a single permutation, the composer could gener-
ate new rows through transposition, inversion, and playing the row backwards
(called retrograde). Music created from these sequences tends to be atonal
with no favored pitch center or scale. Because of the formal constraints on
pitch selection and operations on tone rows, serialism has been an inspiration
for many algorithmic music compositions.
5.5.7 Grammars
Formal grammars, most commonly used in formal descriptions of program-
ming languages, have been applied to music. Consider the formal grammar:
melody ::= intro middle ending
middle ::= phrase | middle phrase
phrase ::= A B C B | A C D A
5.6. TENDENCY MASKS 117
Figure 5.9: Tendency masks offer a way for composers to retain global con-
trol over the evolution of a piece even when the moment-by-moment details of
the piece are generated algorithmically. Here, the vertical axis represents pa-
rameter values and the horizontal axis represents time. The two colored areas
represent possible values (to be randomly selected) for each of two parameters
used for music generation.
5.7 Summary
We have learned about a variety of approaches to algorithmic music gener-
ation. Nyquist supports algorithmic music generation especially through the
score-gen macro, which iteratively evaluates expressions and creates note
lists in the form of Nyquist scores. Nyquist also offers a rich pattern library for
generating parameter values. These patterns can be used to implement many
“standard” algorithmic music techniques such as serialism, random walks,
Markov Chains, probability distributions, pitch and rhythmic grids, etc.
We began by learning details of the FFT and Spectral Centroid. The main
reason to introduce these topics is to prepare you to extract spectral centroid
data from one sound source and use it to control parameters of music syn-
thesis. Hopefully, you can combine this sonic control with some higher-level
algorithmic generation of sound events to create an interesting piece. Good
luck!
Chapter 6
much disk space, so to avoid the obvious problem of storing an infinite sound,
we can multiply drum-roll() by an envelope. In limited-drum-roll(),
we make a finite drum roll by multiplying drum-roll() by const(1, 2).
const(1, 2) is a unit generator that returns a constant value of 1 until the
duration of 2, then it drops to 0. Here, multiplying a limited sound by an infi-
nite sound gives us a finite computation and result. Note that the multiplication
operator in Nyquist is quite smart. It knows that when multiplying by 0, the
result is always 0; and when a sound reaches its stop time, it remains 0 forever,
thus Nyquist can terminate the recursion at the sound stop time.
Note that in this example, there is a risk that the “envelope” const(1, 2)
might cut off a drum-stroke suddenly.1 Here, drum-strokes happen to be 0.1
seconds long, which means a 2-second drum-roll has exactly 20 full drum-
strokes, so a drum-stroke and and the envelope will end exactly together. In
general, either seqrep or an envelope that gives a smooth fadeout would be a
better design.
Remember that Nyquist sounds are immutable. Nyquist will not and cannot
go back and recompute behaviors to get the “right” durations–how would it
know? There are two basic approaches to make durations match. The first is
to make everything have a nominal length of 1 and use the stretch operator to
change durations:
Here we have changed pwl to have a duration of 1 rather than 13. This is the
default duration of osc, so they match. Note also the use of parentheses to
ensure that the stretch factor applies to both pwl and osc.
The second method is to provide duration parameters everywhere:
pwl(0.5, 1, 10, 1, 13) * osc(c4, 13)
Here, we kept the original 13-second long pwl function, but we explicitly set
the duration of osc to 13. If you provide duration parameters everywhere, you
will often end up passing duration as a parameter, but that’s not always a bad
thing as it makes duration management more explicit.
Smooth Transitions
Apply envelopes to almost everything. Even control functions can have control
functions! A good example is vibrato. A “standard” vibrato function might
look like lfo(6) * 5, but this would generate constant vibrato throughout
a tone. Instead, consider lfo(6) * 5 * pwl(0.3, 0, 0.5, 1, 0.9, 1,
1) where the vibrato depth is controlled by an envelope. Initially, there is no
vibrato, the vibrato starts to emerge at 0.3 and reaches the full depth at 0.5, and
finally tapers rapidly from 0.9 to 1. Of course this might be stretched, so these
numbers are relative to the whole duration. Thus, we not only use envelopes to
get smooth transitions in amplitude at the beginnings and endings of notes, we
can use envelopes to get smooth transitions in vibrato depth and other controls.
122 CHAPTER 6. TECHNIQUES AND GRANULAR SYNTHESIS
In the simple case of envelopes, you can just apply the global envelope
through multiplication after notes are synthesized. In some other cases, you
might need to actually pass the global function as a parameter to be used for
synthesis. Suppose that in Figure 6.2, the uppermost (global) envelope is sup-
posed to control the index of modulation in FM synthesis. This effect cannot
be applied after synthesis, so you must pass the global envelope as a parameter
to each note. Within each note, you might expect the whole global envelope to
somehow be shifted and stretched according to the note’s start time and dura-
tion, but that does not happen. Instead, since the control function has already
been computed as a SOUND, it is immutable and fixed in time. Thus, the note
only “sees” the portion of the control function over the duration of the note, as
indicated by the dotted lines in Figure 6.2.
In some cases, e.g. FM synthesis, the control function determines the note
start time and duration, so fmosc might try to create a tone over the entire
duration of the global envelope, depending on how you use it. One way to
“slice” out a piece of a global envelope or control function according to the
current environment is to use const(1) which forms a rectangular pulse that
is 1 from the start time to the nominal duration according to the environment.
If you multiply a global envelope parameter by const(1), you are sure to get
something that is confined within the nominal starting and ending times in the
environment.
the discussion above about matching durations). But sometimes you need to
customize what it means to “stretch” a sound. For example, if you stretch
a melody, you make notes longer, but if you stretch a drum roll, you do not
make drum strokes slower. Instead, you add more drum strokes to fill the time.
With Nyquist, you can create your own abstract behaviors to model things like
drum rolls, constant-rate trills, and envelopes with constant attack times, none
of which follow simple default rules of stretching.
Here is an example where you want the number of things to increase with
duration:
The basic idea here is to first “capture” the nominal duration using
get-duration(1)
6.1.5 Summary
6.2.1 Grains
To make a grain, we simply take any sound (e.g. a sinusoid or sound from a
sound file) and apply a short smoothing envelope to avoid clicks. (See Figure
6.3.) The duration is typically from around 20ms to 200ms: long enough
to convey a little content and some spectral information, but short enough to
avoid containing an entire note or word (from speech sounds) or anything too
recognizable.
126 CHAPTER 6. TECHNIQUES AND GRANULAR SYNTHESIS
Figure 6.3: Applying a short envelope to a sound to make a grain for granular
synthesis. How would a different amplitude envelope, say a square one, affect
the shape of the grain? What would it do to the sound of the grain? What
would happen if the sound was a recording of a natural sound instead of a
sinusoid? What would be the effect of a longer or shorter envelope?
whole words are obliterated. Granular synthesis can also be used for time
stretching: By moving through a file very slowly, fetching overlapping grains
and outputting them with less overlap, the file is apparently stretched, a shown
in Figure 6.5. There will be artifacts because grains will not add up perfectly
smoothly to form a continuous sound, but this can be a feature as well as a
limitation, depending on your musical goals.
Generating a Grain
Figure 6.6 illustrates Nyquist code to create an smooth envelope and read a
grain’s worth of audio from a file to construct a smooth grain. Note the use
of duration d both to stretch the envelope and control how many samples are
read from the file. Below, we consider two approaches to granular synthesis
implementation. The first uses scores and the second uses seqrep.
The score calls upon grain, which we define below. Notice that grain dura-
tions are specified in the score and implemented through the environment,
so the cos-pulse signal will be stretched by the duration, but s-read is
unaffected by stretching. Therefore, we must obtain the stretch factor using
6.2. GRANULAR SYNTHESIS 129
get-duration(1) and pass that value to s-read as the optional dur: key-
word parameter:
function grain(offset: 0)
begin with dur = get-duration(1)
return s-read("filename.wav",
time-offset: offset, dur: dur) *
cos-pulse()
end
Now, we can make make a score with score-gen. In the following expres-
sion, we construct 2000 grains with randomized inter-onset intervals and using
pattern objects to compute the grain durations and file offsets:
score-gen(score-len: 2000,
ioi: 0.05 + rrandom() * 0.01,
dur: next(dur-pat),
offset: next(offset-pat))
You could also use more randomness to compute parameters, e.g. the du-
ration could come from a Gaussian distribution (see the Nyquist function
gaussian-dist), and offset: could be computed by slowly moving through
the file and adding a random jitter, e.g.
max(0, sg-count * 0.01 + rrandom() * 0.2)
create and sum 2000 sounds. Each sound is produced by a call to grain,
which is stretched by values from dur-pat. To obtain grain overlap, we
use set-logical-stop with an IOI (logical stop time) parameter of 0.05
+ rrandom() * 0.01, so the grain IOI will be 50 ms ± 10 ms:
seqrep(i, 2000,
set-logical-stop(
grain(offset: next(offset-pat)) ~
next(dur-pat),
0.05 + rrandom() * 0.01))
For another example, you can install the gran extension using Nyquist’s
Extension Manager. The package includes a function sf-granulate that im-
plements granular synthesis using a sound file as input.
6.2.8 Summary
Granular synthesis creates sound by summing thousands of sound particles or
grains with short durations to form clouds of sounds. Granular synthesis can
construct a wide range of textures, and rich timbres can be created by tak-
ing grains from recordings of natural sounds. Granular synthesis offers many
choices of details including grain duration, density, random or deterministic
timing, pitch shifts and source.
Chapter 7
reordering them, some noisy and unusual effects can be created. As in collage
visual art, the ironic and interesting juxtaposition of very familiar materials
can be used to create new works that are perhaps greater than the sum of their
constituent parts.
Unique, experimental, and rather strange programs for deconstructing and
reconstructing sounds in the time domain are Herbert Brün’s SAWDUST see
Figure 7.1 and Argeïphontes Lyre (see Figure 7.2, written by the enigmatic
Akira Rabelais. Argeïphontes Lyre provides a number of techniques for rad-
ical decomposition/recomposition of sounds—techniques that often preclude
the user from making specific decisions in favor of larger, more probabilistic
decisions.
Figure 7.1: Herbert Brün said of his program SAWDUST: “The computer
program which I called SAWDUST allows me to work with the smallest parts
of waveforms, to link them and to mingle or merge them with one another.
Once composed, the links and mixtures are treated, by repetition, as periods,
or by various degrees of continuous change, as passing moments of orientation
in a process of transformations.”
7.1.3 Sampling
Sampling refers to taking small bits of sound, often recognizable ones, and
recontextualizing them via digital techniques. By digitally sampling, we can
easily manipulate the pitch and time characteristics of ordinary sounds and use
them in any number of ways.
We’ve talked a lot about samples and sampling in the preceding chapters.
In popular music (especially electronic dance and beat-oriented music), the
term sampling has acquired a specialized meaning. In this context, a sample
refers to a (usually) short excerpt from some previously recorded source, such
7.1. SAMPLING SYNTHESIS 133
Figure 7.2: Sample GUI from Argeïphontes Lyre, for sound deconstruction.
This is time-domain mutation.
as a drum loop from a song or some dialog from a film soundtrack, that is used
as an element in a new work. A sampler is the hardware used to record, store,
manipulate, and play back samples. Originally, most samplers were stand-
alone pieces of gear. Today sampling tends to be integrated into a studio’s
computer-based digital audio system.
Sampling was pioneered by rap artists in the mid-1980s, and by the early
1990s it had become a standard studio technique used in virtually all types of
music. Issues of copyright violation have plagued many artists working with
sample-based music, notably John Oswald of “Plunderphonics” fame and the
band Negativland, although the motives of the “offended” parties (generally
large record companies) have tended to be more financial than artistic. One
result of this is that the phrase “Contains a sample from xxx, used by permis-
sion” has become ubiquitous on CD cases and in liner notes.
Although the idea of using excerpts from various sources in a new work
is not new (many composers, from Béla Bartók, who used Balkan folk songs,
to Charles Ives, who used American popular music folk songs, have done so),
digital technology has radically changed the possibilities.
134 CHAPTER 7. SAMPLING AND FILTERS
1. Samples of musical tones have definite pitches. You could collect sam-
ples for every pitch, but this requires a lot of storage, so most samplers
need the capability of changing the pitch of samples through calculation.
Pitch Control
The standard approach to pitch control over samples (here, sample means a
short recording of a musical tone) is to change the sample rate using interpo-
lation. If we upsample, or increase the sample rate, but play at the normal
sample rate, the effect will be to slow down and drop the pitch of the original
recording. Similarly, if we downsample, or lower the sample rate, but play at
the normal rate, the effect will be to speed up and raise the pitch of the original
recording.
To change sample rates, we interpolate existing samples at new time points.
Often, “interpolation” means linear interpolation, but with signals, linear in-
terpolation does not result in the desired smooth band-limited signal, which is
another way of saying that linear interpolation will distort the signal; it may
be a rough approximation to proper reconstruction, but it is not good enough
for high-quality audio. Instead, good audio sample interpolation requires a
weighted sum taken over a number of samples. The weights follow an inter-
esting curve as shown in Figure 7.3. The more the samples considered, the
better the quality, but the more the computation. Commercial systems do not
136 CHAPTER 7. SAMPLING AND FILTERS
advertise the quality of their interpolation, but a good guess is that interpola-
tion typically involves less than 10 samples. In theoretical terms, maintaining
16-bit quality requires interpolation over about 50 samples in the worst case.
Another problem with resampling is that when pitch is shifted by a lot, the
quality of sound is significantly changed. For example, voices shifted to higher
pitches begin so sound like cartoon voices or people breathing helium. This is
unacceptable, so usually, there are many samples for a given instrument. For
example, one might use four different pitches for each octave. Then, pitch
shifts are limited to the range of 1/4 octave, eliminating the need for extreme
pitch shifts.
Duration Control
Loudness Control
A simple way to produce soft and loud versions of samples is to just multiply
the sample by a scale factor. There is a difference, however, between amplitude
and loudness in that many instruments change their sound when they play
louder. Imagine a whisper in your ear vs. a person shouting from 100 yards
away. Which is louder? Which has higher amplitude?
To produce variations in perceptual loudness, we can use filters to alter
the spectrum of the sound. E.g. for a softer sound, we might filter out higher
frequencies as well as reduce the amplitude by scaling. Another option is
to record samples for different amplitudes and select samples according go
the desired loudness level. Amplitude changes can be used to introduce finer
adjustments.
Sampling in Nyquist
In Nyquist, the sampler function can be used to implement sampling synthe-
sis. The signature of the function is:
The pitch parameter gives the desired output pitch, and modulation is a fre-
quency deviation function you can use to implement vibrato or other modu-
lation. The output duration is controlled by modulation, so if you want zero
modulation and a duration of dur, use const(0, dur) for modulation. The
sample parameter is a list of the form: (sound pitch loop-start), where sound is
the sample (audio), pitch is the natural pitch of the sound, and loop-start is the
time at which the loop starts. Resampling is performed to convert the natural
pitch of the sample to the desired pitch computed from the pitch parameter
and modulation. The sampler function returns a sound constructed by read-
ing the audio sample from beginning to end and then splicing on copies of the
same sound from the loop point (given by loop-start) to the end. Currently,
only 2-point (linear) interpolation is implemented, so the quality is not high.
However, you can use resample to upsample your sound to a higher sample
rate, e.g. 4x and this will improve the resulting sound quality. Note also that
the loop point may be fractional, which may help to make clean loops.
Summary
Sampling synthesis combines pre-recorded sounds with signal processing to
adjust pitch, duration, and loudness. For creating music that sounds like
acoustic instruments, sampling is the currently dominant approach. One of
the limitations of sampling is the synthesis of lyrical musical phrases. While
138 CHAPTER 7. SAMPLING AND FILTERS
samples are great for reproducing single notes, acoustic instruments perform
connected phrases, and the transitions between notes in phrases are very dif-
ficult to implement with sampling. Lyrical playing by wind instruments and
bowed strings is very difficult to synthesize simply by combining 1-note sam-
ples. Some sample libraries even contain transitions to address this problem,
but the idea of sampling all combinations of articulations, pitches, durations,
and loudness levels and their transitions does not seem feasible.
7.2 Filters
The most common way to think about filters is as functions that take in a signal
and give back some sort of transformed signal. Usually, what comes out is
less than what goes in. That’s why the use of filters is sometimes referred to
as subtractive synthesis.
It probably won’t surprise you to learn that subtractive synthesis is in many
ways the opposite of additive synthesis. In additive synthesis, we start with
simple sounds and add them together to form more complex ones. In subtrac-
tive synthesis, we start with a complex sound (such as noise or a rich harmonic
spectrum) and subtract, or filter out, parts of it. Subtractive synthesis can be
thought of as sound sculpting—you start out with a thick chunk of sound con-
taining many possibilities (frequencies), and then you carve out (filter) parts
of it. Filters are one of the sound sculptor’s most versatile and valued tools.
The action of filters is best explained in the frequency domain. Figure
7.4 explains the action of a filter on the spectrum of a signal. In terms of
the magnitude spectrum, filtering is a multiplication operation. Essentially,
the filter scales the amplitude of each sinusoidal component by a frequency-
dependent factor. The overall frequency-dependent scaling function is called
the frequency response of the filter.
A common misconception among students is that since filtering is multi-
plication in the frequency domain, and since the FFT converts signals to the
frequency domain (and the inverse FFT converts back), then filtering must be
performed by an FFT-multiply-IFFT sequence. This is false! At least sim-
ple filters operate in the time domain (details below). Filtering with FFTs is
possible but problematic because it is not practical to perform FFTs on long
signals.
Figure 7.4: Filtering in the frequency domain. The input signal with spectrum
X is multiplied by the frequency response of the filter H to obtain the output
spectrum Y. The result of filtering is to reduce the amplitude of some frequen-
cies and boost the amplitudes of others. In addition to amplitude changes,
filters usually cause phase shifts that are also frequency dependent. Since we
are so insensitive to phase, we usually ignore the phase response of filters.
Figure 7.5: Four common filter types (clockwise from upper left): low-pass,
high-pass, band-reject, band-pass.
140 CHAPTER 7. SAMPLING AND FILTERS
Things really get interesting when you start combining low-pass and high-
pass filters to form band-pass and band-reject filters. Band-pass and band-
reject filters also have transition bands and slopes, but they have two of them:
one on each side. The area in the middle, where frequencies are either passed
or stopped, is called the passband or the stopband. The frequency in the mid-
dle of the band is called the center frequency, and the width of the band is
called the filter’s bandwidth.
You can plainly see that filters can get pretty complicated, even these sim-
ple ones. By varying all these parameters (cutoff frequencies, slopes, band-
widths, etc.), we can create an enormous variety of subtractive synthetic tim-
bres.
Figure 7.6: FIR and IIR filters. Filters are usually designed in the time domain,
by delaying a signal and then averaging (in a wide variety of ways) the delayed
signal and the nondelayed one. These are called finite impulse response (FIR)
filters, because what comes out uses a finite number of samples, and a sample
only has a finite effect.
If we delay, average, and then feed the output of that process back into the
signal, we create what are called infinite impulse response (IIR) filters. The
feedback process actually allows the output to be much greater than the input.
These filters can, as we like to say, “blow up.”
These diagrams are technical lingo for typical filter diagrams for FIR and
IIR filters. Note how in the IIR diagram the output of the filter’s delay is
summed back into the input, causing the infinite response characteristic. That’s
the main difference between the two filters.
7.2. FILTERS 143
Designing filters is a difficult but key activity in the field of digital signal
processing, a rich area of study that is well beyond the range of this book. It
is interesting to point out that, surprisingly, even though filters change the fre-
quency content of a signal, a lot of the mathematical work done in filter design
is done in the time domain, not in the frequency domain. By using things like
sample averaging, delays, and feedback, one can create an extraordinarily rich
variety of digital filters.
For example, the following is a simple equation for a low-pass filter. This
equation just averages the last two samples of a signal (where x(n) is the
current sample) to produce a new sample. This equation is said to have a
one-sample delay. You can see easily that quickly changing (that is, high-
frequency) time domain values will be “smoothed” (removed) by this equa-
tion.
x(n) = (x(n) + x(n − 1))/2 (7.1)
In fact, although it may look simple, this kind of filter design can be quite
difficult (although extremely important). How do you know which frequen-
cies you’re removing? It’s not intuitive, unless you’re well schooled in digital
signal processing and filter theory and have some background in mathematics.
The theory introduces another transform, the Laplace transform and its dis-
crete cousin, the Z-transform, which, along with some remarkable geometry,
yields a systematic approach to filter design.
lp(signal, cutoff)
Note that cutoff can also be a signal, allowing you to adjust the cutoff fre-
quency over the course of the signal. Normally, cutoff would be a control-
rate signal computed as an envelope or pwl function. Changing the filter cutoff
frequency involves some trig functions, and Nyquist filters perform these up-
dates at the sample rate of the control signal, so if the control signal sample
rate is low (typically 1/20 of the audio sample rate), you save a lot of expensive
computation.
Some other filters in Nyquist include:
• hp - a high-pass filter.
144 CHAPTER 7. SAMPLING AND FILTERS
• areson - the opposite of reson; adding the two together recreates the
original signal, so areson is a sort of high-pass filter with a notch at the
cutoff frequency.
• comb - a comb filter has a set of harmonically spaced resonances and
can be very interesting when applied to noisy signals.
Spectral Processing
Topics Discussed: FFT, Inverse FFT, Overlap Add, Reconstruction from
Spectral Frames
Imaginary part: Z ∞
X(ω) = − f (t) sin(ωt)dt (8.2)
−∞
Imaginary part:
N−1
Xk = − ∑ xi sin(2πki/N) (8.4)
x=0
Recall from the discussion of the spectral centroid that when we take FFTs
in Nyquist, the spectra appear as floating point arrays. As shown in Figure
8.1, the first element of the array (index 0) is the DC (0 Hz) component,1
and then we have alternating cosine and sine terms all the way up to the top
element of the array, which is the cosine term of the Nyquist frequency. To
visualize this in a different way, in Figure 8.2, we represent the basis functions
(cosine and sine functions that are multiplied by the signal), and the numbers
here are the indices in the array. The second element of the array (the blue
curve labeled 1), is a single period of cosine across the duration of the analysis
frame. The next element in the array is a single sine function over the length of
the array. Proceeding from there, we have two cycles of cosine, two cycles of
sine. Finally, the Nyquist frequency term has n/2 cycles of a cosine, forming
alternating samples of +1, -1, +1, -1, +1, .... (The sine terms at 0 and the
Nyquist frequency N/2 are omitted because sin(2πki/N) = 0 if k = 0 or k =
N/2.)
Following the definition of the Fourier transform, these basis functions
are multiplied by the input signal, the products are summed, and the sums
are the output of the transform, the so-called Fourier coefficients. Each one
of the basis functions can be viewed as a frequency analyzer—it picks out a
1 This component comes from the particular case where ω = 0, so cos ωt = 1, and the integral
is effectively computing the average value of the signal. In the electrical world where we can
describe electrical power as AC (alternating current, voltage oscillates up an down, e.g. at 60 Hz)
or DC (direct current, voltage is constant, e.g. from a 12-volt car battery), the average value of the
signal is the DC component or DC offset, and the rest is “AC.”
8.1. FFT ANALYSIS AND RECONSTRUCTION 147
Figure 8.1: The spectrum as a floating point array in Nyquist. Note that the
real and imaginary parts are interleaved, with a real/imaginary pair for each
frequency bin. The first and last bin have only one number (the real part)
because the imaginary part for these bins is zero.
Figure 8.2: The so-called basis functions for the Fourier transform labeled
with bin numbers. To compute each Fourier coefficient, form the dot product
of the basis function and the signal. You can think of this as the weighted
average of the signal where the basis function provides the weights. Note that
all basis functions, and thus the frequencies they select from the input signal,
are harmonics: multiples of a fundamental frequency that has a period equal to
the size of the FFT. Thus, if the N input points represent 1/10 second of sound,
the bins will represent 10 Hz, 20 Hz, 30 HZ, ..., etc. Note that we show sin
functions, but the final imaginary (sin) coefficient of the FFT is negated (see
Equation 8.4.)
148 CHAPTER 8. SPECTRAL PROCESSING
particular frequency from the input signal. The frequencies selected by the
basis functions are K/duration, where index K ∈ {0, 1, ..., n/2}.
Knowing the frequencies of basis functions is very important for interpret-
ing or operating on spectra. For example, if the analysis window size is 512
samples, the sample rate is 44100 Hz, and value at index 5 of the spectrum is
large, what strong frequency does that indicate? The duration is 512/44100 =
0.01161 s, and from Figure 8.2, we can see there are 3 periods within the anal-
ysis window, so K = 3, and the frequency is 3/0.01161 = 258.398 Hz. A large
value at array index 5 indicates a strong component near 258.398 Hz.
Now, you may ask, what about some near-by frequency, say, 300 Hz? The
next analysis frequency would be 344.531 Hz at K = 4, so what happens to
frequencies between 258.398 and 344.531 Hz? It turns out that each basis
function is not a perfect “frequency selector” because of the finite number of
samples considered. Thus, intermediate frequencies (such as 300 Hz) will
have some correlation with more than one basis function, and the discrete
spectrum will have more than one non-zero term. The highest magnitude co-
efficients will be the ones whose basis function frequencies are nearest that of
the sinusoid(s) being analyzed.
of samples for analysis. From the figure, you would (correctly) expect that
ideal window functions would be smooth and sum to 1. One period of the
cosine function raised by 1 (so the range is from 0 to 2) is a good example
of such a function. Raised cosine windows (also called Hann or Hanning
windows, see Figure 8.5) sum to one if they overlap by 50%.
The technique of adding smoothed overlapped intervals of sound is related
to granular synthesis. When the goal is to produce a continuous sound (as
opposed to a turbulent granular texture), this approach is called overlap-add.
But windows are applied twice: Once before FFT analysis because smooth-
ing the signal eliminates some undesirable artifacts from the computed spec-
trum, and once after the IFFT to eliminate discontinuities. If we window
twice, do the envelopes still sum to one? Well, no, but if we change the overlap
150 CHAPTER 8. SPECTRAL PROCESSING
Figure 8.5: The raised cosine, Hann, or Hanning window, is named after the
Mathematician Von Hann. This figure, from katja (www.katjaas.nl), shows
multiple Hann windows with 50% overlap, which sum to 1. But that’s not
enough! In practice, we multiply by a smoothing window twice: once before
the FFT and once after the IFFT. See the text on how to resolve this problem.
to 75% (i.e. each window steps by 1/4 window length), then the sum of the
windows is one!2
With windowing, we can now alter spectra more-or-less arbitrarily, then
reconstruct the time domain signal, and the result will be smooth and pretty
well behaved.
For example, a simple noise reduction technique begins by converting a
signal to the frequency domain. Since noise has a broad spectrum, we expect
the contribution of noise to the FFT to be a small magnitude at every frequency.
Any magnitude that is below some threshold is likely to be “pure noise,” so
we set it to zero. Any magnitude above threshold is likely to be, at least in
part, a signal we want to keep, and we can hope the signal is strong enough to
mask any noise near the same frequency. We then simply convert these altered
spectra back into the time domain. Most of the noise will be removed and
most of the desired signal will be retained.
If you struggled to learn trig identities for high-school math class but never had any use for them,
this proof will give you a great feeling of fulfillment.
8.2. SPECTRAL PROCESSING 151
Since SAL does not have objects, but one might want object-like behaviors,
the spectral processing system is carefully implemented with “stateful” object-
oriented processing in mind. The idea is that we pass state variables into the
processing function and the function returns the final values of the state vari-
ables so that they can be passed back to the function on its next invocation.
The definition of processing-fn looks like this:
function processing-fn(sa, frame, p1, p2)
begin
... Process frame here ...
set frame[0] = 0.0 ; simple example: remove DC
return list(frame, p1 + 1, p2 + length(frame))
end
In this case, note that processing-fn works with state represented by p1 and
p2. This state is passed into the function each time it is called. The function
returns a list consisting of the altered spectral frame and the state values, which
are saved and passed back on the next call. Here, we update p1 to maintain
a count of frames, and p2 to maintain a count of samples processed so far.
These values are not used here. A more realistic example using state variables
is you might want to compute the spectral difference between each frame and
the next. In that case, you could initialize p1 to an array of zeros and in
processing-fn, copy the elements of frame to p1. Then, processing-fn
will be called with the current frame in frame as well as the previous frame in
p1.
8.2. SPECTRAL PROCESSING 153
Getting back to our example, to run the spectral processing, you write the
following:
play sp-to-sound(sp)
The sp-to-sound function takes a spectral processing object created by
sp-init, calls it to produce a sequence of frames, converts the frames back to
the time domain, applies a windowing function, and performs an overlap add,
resulting in a sound.
154 CHAPTER 8. SPECTRAL PROCESSING
Chapter 9
of Figure 9.1 shows roughly what the waveform looks like. Pulses from the
vocal folds are rounded, and we have some control over both the shape and
frequency of the pulses by using our muscles.
These pulses go into a resonant chamber that is formed by the oral cavity
(the mouth) and the nasal cavity. The shape of the oral cavity can be manip-
ulated extensively by the tongue and lips. Although you are usually unaware
of your tongue and lips while speaking, you can easily attend to them as you
produce different vocal sounds to get some idea of their function in speech
production and singing.
Figure 9.1: Diagram of a human head to illustrate how the voice works. The
waveform at right is roughly what the “input” sound looks like coming from
the vocal folds.
second formant is around 1500 Hz, we will get vowel sound IR as in “sir.” On
the other hand, if we increase those frequencies to around 800 Hz for the first
formant and close to 2000 Hz for the second one, we will get an A as in “sat,”
which is perceived quite differently. After plotting all of the vowels, we see
that vowels form regions in the space of formant frequencies. We control those
frequencies by changing the shape of our vocal tract to make vowel sounds.
noise signal (whispered / unvoiced sound, where the vocal folds are not vibrat-
ing, used for sounds like “sss”), or a pulse train of periodic impulses creating
voiced speech as is normal for vowels. The selected source input goes into the
vocal tract filter, which is characterized by the resonances or formants that we
saw in the previous paragraphs.
Notice that the frequency of the source (i.e. the rate of pulses) determines
the pitch of the voiced sound, and this is independent of the resonant fre-
quencies (formant frequencies). The source spectrum for voiced sounds is a
harmonic series with many strong harmonics arising from the periodic pulses.
The effect of filtering is to multiply the spectrum by the filter response, which
boosts harmonics at or near the formant frequencies.
In the previous section, we have learned how the source-filter model is used
to model voice and produce voice sounds. Now we are going to talk about
a specific speech analysis/synthesis method that can be used to construct the
filter part of the source-filter model. This method is called Linear Prediction
Coding (LPC).
9.2. LINEAR PREDICTION CODING (LPC) 159
Figure 9.4: The physical analogy of LPC: a tube with varying cross-section.
Another important point here is that we are not talking about a fixed shape
but a shape that is rapidly changing, which makes speech possible. We are
making sequences of vowel sounds and consonants by rapidly moving the
tongue and mouth as well as changing the source sound. To apply LPC to
speech, we analyze speech sounds in short segments (and as usual, we call
short analysis segments frames). This is analogous to short-time windows in
the SFFT. At each LPC analysis frame, we re-estimate the geometry of the
tube, which means we re-estimate the coefficient values that determine the fil-
ter. Frames give rise to changing coefficients, which model changes in tube
geometry (or vocal tract shape).
LPC actually creates an inverse filter. Applying the inverse filter to a vocal
sound yields a residual that sounds like the source signal. The residual may ei-
ther be an estimate of glottal pulses, making the residual useful for estimating
the pitch of the source, or noise-like.
Taking all this into account, an overall LPC analysis system is shown in
Figure 9.6. We have an input signal and do formant analysis, which is the
estimation of coefficients that can reconstruct the filter part of the source-filter
model. After applying the result filter of the formant analysis to the input
signal, we can get the residual and then do further analysis to detect the pitch.
We can also do analysis to see if the input is periodic or not so that it gives us a
decision of whether the input is voiced or unvoiced. Finally, we can compute
the RMS amplitude of the signal. So, for each frame of LPS analysis, we get
not only the filter part of the source-filter model, but also estimated pitch, the
amplitude, and whether the source is voiced or unvoiced.
There are many musical applications of voice sounds and source-filter models.
One common application pursued by many composers is to replace the source
with some other sound. For example, rather than having a pitched source or
noise source based on the human voice, you can apply human sounding for-
mants to a recording of an orchestra to make it sound like speech. This idea of
taking elements from two different sounds and putting them together is some-
times called “cross-synthesis.” You can also “warp” the filter frequencies to
get human-like but not really human filters. Also, you can modify the source
and LPC coefficients (glottal pulses or noise) to perform time stretching, slow-
ing speech down or speeding up.
162 CHAPTER 9. VOCAL AND SPECTRAL MODELS
9.3 Vocoder
9.4 VOSIM
VOSIM is a simple and fun synthesis method inspired by the voice. VOSIM
is also fast and efficient. It was developed in the 70s by Kaegi and Tempelaars
[Kaegi and Tempelaars, 1978]. The basic idea of VOSIM is to think about
what happens in the time domain when a glottal pulse (pulse coming through
the vocal folds) hits a resonance (a formant resonance in the vocal tract). The
answer is that you get an exponentially damped sinusoid. In speech produc-
tion, we have multiple formants. When a pulse enters a complex filter with
multiple resonances, we get the superposition of many damped sinusiods, one
for each resonance. The exact details are influenced by the shape of the pulse.
VOSIM is the first of several time-domain models for speech synthesis that we
will consider.
lence that follows the pulses. As you can see in Figure 9.8, the result is at least
somewhat like a decaying sinusoid, and this gives an approximation of a pulse
filtered by a single resonance. The fundamental frequency (establishing the
pitch) is determined by the entire period, which is NT + M and the resonant
frequency is 1/T , so we expect the harmonics near 1/T to be stronger.
pulses do not overlap.) This makes the implementation of FOF more difficult,
and we will describe the implemenation below.
One final detail about FOF is that the decaying sinusoid has an initial
smooth rise time. The rise time gives some additional control over the re-
sulting spectrum, as shown in Figure 9.10.
Since analysis can be performed using any spectrum, FOF is not limited to
voice synthesis, and FOF has been used for a wide range of sounds.
that input sounds consist of fairly steady sinusoidal partials, there are relatively
few partials and they are well-separated in frequency. If that is the case, then
we can use the FFT to separate the partials into different bins (different coeffi-
cients) of the discrete spectrum. Each frequency bin is assigned an amplitude
and phase. A signal is reconstructed using the inverse FFT (IFFT).
The main attraction of the phase vocoder is that by changing the spacing of
analysis frames, the reconstructed signal can be made to play faster or slower,
but without pitch changes because the partial frequencies stay the same. Time
stretching without pitch change also permits pitch change without time stretch-
ing! This may seem contradictory, but the technique is simple: play the sound
faster or slower (by resampling to a new sample rate) to make the pitch go up
or down. This changes the speed as well, so use a phase vocoder to change
the altered speed back to the original speed. Now you have pitch change with-
out speed change, and of course you can have any combination or even time-
variable pitch and speed.
When successive FFT frames are used to reconstruct signals, there is the
potential that partials in one frame will be out of phase with partials in the
next frame and some cancellation will occur where the frames overlap. This
should not be a problem if the signal is simply analyzed and reconstructed
with no manipulation, but if we want to achieve a speed-up or slow-down of
the reconstructed signal, then the partials can become out of phase.
In the Phase Vocoder, we use phase measurements to shift the phase of par-
tials in each frame to match the phase in the previous frame so that cancellation
does not occur. This tends to be more pleasing because it avoids the somewhat
buzzy sound of partials being amplitude-modulated at the FFT frame rate. In
return for eliminating the frame-rate discontinuities, the Phase Vocoder often
smears out transients (by shifting phases more-or-less randomly), resulting in
a sort of fuzzy or smeared quality, especially if frames are long. As we often
see (and hear), there is a time-frequency uncertainty principle at play: shorter
frames give better transient (time) behavior, and longer frames give better re-
production of sustained timbres due to better frequency resolution.
vocoder indicates small deviations from the center frequency of each bin, so the phase vocoder,
168 CHAPTER 9. VOCAL AND SPECTRAL MODELS
when properly implemented, is modeling frequencies accurately, but the model is still one where
partial frequencies are modeled as fixed frequencies.
9.9. SUMMARY 169
time stretch and manipulate all the control parameters. The noise modeling in
SMS allows for more accurate reconstruction as well as control over how much
of the noise in the signal should be retained (or for that matter emphasized) in
synthesis.
9.9 Summary
We have considered many approaches to voice modeling and synthesis. The
number of methods is some indication of the broad attraction of the human
voice in computer music, or perhaps in all music, considering the prevalence of
vocals in popular music. The source-filter model is a good way to think about
sound production in the voice: a source signal establishes the pitch, amplitude
and harmonicity or noisiness of the voice, and more-or-less independently,
formant filters modify the source spectrum to impose resonances that we can
perceive as vowel sounds.
LPC and the Vocoder are examples of models that actually use resonant
filters to modify a source sound. These methods are open to interesting cross-
synthesis applications where the source sound of some original vocal sound is
replaced by another, perhaps non-human, sound.
Resonances can also be modeled in the time domain as decaying sinusoids.
VOSIM and FOF synthesis are the main examples. Both can be seen as related
170 CHAPTER 9. VOCAL AND SPECTRAL MODELS
to granular synthesis, where each source pulse creates new grains in the form
of decaying sinusoids, one to model each resonance.
In the spectral domain, the Phase Vocoder is a powerful method that is
especially useful for high-quality time-stretching without pitch change. Time-
stretching also allows us to “undo” time stretch due to resampling, so we can
use the Phase Vocoder to achieve pitch shifting without stretching. MQ analy-
sis and synthesis models sounds as sinusoidal tracks that can vary in amplitude
and frequency, and it uses addition of sinusoidal partials for synthesis. Spec-
tral Modeling Synthesis (SMS) extends MQ analysis/synthesis by modeling
noise separately from partials, leading to a more compact representation and
offering more possibilities for sound manipulation.
Although we have focused on basic sound production, the voice is com-
plex, and singing involves vibrato, phrasing, actual speech production beyond
vowel sounds, and many other details. Undoubtedly, there is room for more
research and for applying recent developments in speech modeling to musical
applications. At the same time, one might hope that the speech community
could become aware of sophisticated control and modeling techniques as well
as the need for very high quality signals arising in the computer music com-
munity.
Chapter 10
Acoustics, Perception,
Effects
Topics Discussed: Pitch vs. Frequency, Loudness vs. Amplitude, Lo-
calization, Linearity, Reverberation, Echo, Equalization, Chorus, Panning,
Dynamics Compression, Sample-Rate Conversion, Convolution Reverber-
ation
10.1 Introduction
Acoustics and perception are very different subjects: acoustics being about
the physics of sound and perception being about our sense and cognitive pro-
cessing of sound. Though these are very different, both are very important
for computer sound and music generation. In sound generation, we generally
strive to create sounds that are similar to those in the real world, sometimes us-
ing models of how sound is produced by acoustic instruments. Alternatively,
we may try to create sound that will cause a certain aural impression. Some-
times, an understanding of physics and perception can inspire sounds that have
no basis in reality.
Sound is vibration or air pressure fluctuations (see Figure 10.1). We can
hear small pressure variations, e.g. 0.001 psi (lbs/in2 ) for loud sound. (We are
purposefully using English units of pounds and inches because if you inflate a
bicycle tire or check tires on an automobile – at least in the U.S. – you might
have some idea of what that means.) One psi ∼ = 6895 Pascal (Pa), so 0.001
psi is about 7 Pascal. At sea level, air pressure is 14.7 pounds per square inch,
while the cabin pressure in an airplane is about 11.5 psi, so 0.001 (and remem-
ber that is a loud sound) is a tiny tiny fraction of the nominal constant pressure
around us. Changes in air pressure deflect our ear drum. The amplitude of de-
171
172 CHAPTER 10. ACOUSTICS, PERCEPTION, EFFECTS
flection of ear drum is about diameter of hydrogen atom for the softest sounds.
So we indeed have a extremely sensitive ears!
What can we hear? The frequencies that we hear range over three orders of
magnitude from about 20 to 20 kHz. As we grow older and we are exposed to
loud sounds, our high frequency hearing is impaired, and so the actual range
is typically less.
Our range of hearing, in terms of loudness, is about 120 dB, which is
measured from threshold of hearing to threshold of pain (discomfort from loud
sounds). In practical terms, our dynamic range is actually limited and often
determined by background noise. Listen to your surroundings right now and
listen to what you can hear. Anything you hear is likely to mask even softer
sounds, limiting your ability to hear them and reducing the effective dynamic
range of sounds you can hear.
We are very sensitive to the amplitude spectrum. Figure 10.2 shows a spec-
tral view that covers our range of frequency, range of amplitude, and suggests
that the shape of the spectrum is something to which we are sensitive.
Real-world sounds are complex. We have seen many synthesis algorithms
that produce very clean, simple, specific spectra and waveforms. These can
be musically interesting, but they are not characteristic of sounds in the “real
world.” This is important; if you want to synthesize musical sounds that are
pleasing to the ear, it is important to know that, for example, real-world sounds
are not simple sinusoids, they do not have constant amplitude, and so on.
Let’s consider some “real-world” sounds. First, noisy sounds, such as the
sound of “shhhh,” tend to be broadband, meaning they contain energy at al-
most all frequencies, and the amount of energy in any particular frequency, or
the amplitude at any particular time, is random. The overall spectrum of noisy
sounds looks like the top spectrum in Figure 10.3.
10.1. INTRODUCTION 173
Figure 10.2: The actual spectrum shown here is not important, but the graph
covers the range of audible frequencies (about 20KHz) and our range of am-
plitudes (about 120 dB). We are very sensitive to the shape of the spectrum.
Percussion sounds on the other hand, such as a thump, bell clang, ping, or
knock, tend to have resonances, so the energy is more concentrated around cer-
tain frequencies. The middle picture of Figure 10.3 gives the general spectral
appearance of such sounds. Here, you see resonances at different frequen-
cies. Each resonance produces an exponentially decaying sinusoid. The decay
causes the spectral peak to be wider than that of a steady sinusoid. In general,
the faster the decay, the wider the peak. It should also be noted that, depending
on the material, there can be non-linear coupling between modes. In that case,
the simple model of independent sinusoids with exponential decay is not exact
or complete (but even then, this can be a good approximation.)
Figure 10.4 shows some modes of vibration in a guitar body. This is the top
of an acoustic guitar, and each picture illustrates how the guitar top plate flexes
in each mode. At the higher frequencies (e.g. “j” in Figure 10.4), you can see
patches of the guitar plate move up while neighboring patches are move down.
If you tap a guitar body, you will “excite” these different modes and get a
collection of decaying sinusoids. The spectrum might look something like the
middle of Figure 10.3. (When you play a guitar, of course, you are strumming
strings that have a different set of modes of vibration that give sharper peaks
and a clearer sense of pitch.)
Pitched sounds are often called tones and tend to have harmonically related
174 CHAPTER 10. ACOUSTICS, PERCEPTION, EFFECTS
Figure 10.3: Some characteristic spectra. Can you identify them? Horizontal
axis is frequency. Vertical axis is magnitude (amplitude). The top represents
noise, with energy at all frequencies. The middle spectrum represents a per-
cussion sound with resonant or “characteristic” frequencies. This might be a
bell sound. The bottom spectrum represents a musical tone, which has many
harmonics that are multiples of the first harmonic, or fundamental.
that our perception does not directly correspond to the underlying physical
properties of sound. For example, two sounds with the same perceived loud-
ness might have very different amplitudes, and two pitch intervals perceived
to be the same musically could be not at all the same in terms of frequency
differences.
Figure 10.6:
√ A piano keyboard. The ratio between frequencies of adjacent
keys is 12 2. (credit: http://alijamieson.co.uk/2017/12/03/describing-relationship-two-notes/)
√
We can divide a semitone ( 12 2) into 100 log-spaced frequency intervals
called cents. Often cents are used in studies of tuning and intonation. We are
sensitive to about 5 cents, which is a ratio of about 0.3%. No wonder it is so
hard for musicians to play in tune!
10.2.2 Amplitude
The term pitch refers to perception while frequency refers to objective or phys-
ical repetition rates. In a similar way, loudness is the perception of intensity
10.2. PERCEPTION: PITCH, LOUDNESS, LOCALIZATION 177
Figure 10.7: Three different ways of measuring amplitude. Note: the 4 on the
Figure indicates the length of one period. credit: Wikipedia
Volume
There is no one formal definition for what the volume of sound means. Gener-
ally, volume is used as a synonym for loudness. In music production, adjusting
the volume of a sound usually means to move a volume fader on e.g. a mixer.
Faders can be either linear or logarithmic, so again, it is not exactly clear what
volume means (i.e. is it the fader position, or the perceived loudness?).
Power
Power is the amount of work done per unit of time; e.g. how much energy
is transferred by an audio signal. Hence, the average power P̄ is the ratio of
energy E to time t: P̄ = E/t. Power is usually measured in watts (W); which
178 CHAPTER 10. ACOUSTICS, PERCEPTION, EFFECTS
are joules (J) per second: J/s = W . The power of a sound signal is proportional
to the squared amplitude of that signal. Note that in terms of relative power
or amplitude, it does not really matter whether we think of amplitude as the
instantaneous amplitude (i.e. the amplitude at a specific point in time), or the
peak, the peak-to-peak, or the root mean square amplitude over a single period.
It makes no significant difference to ratios.
Pressure
In general, the pressure p is the amount of force F applied perpendicularly
(normal) to a surface area divided by the size a of that area: p = F/a. Pressure
is normally measured in pascal (Pa), which is newtons (N) per square meter.
Thus, Pa = N/m2 .
Intensity
Intensity I is the energy E flowing across a unit surface area per unit of time t:
I = Ea 1t . For instance, the energy from a sound wave flowing to an ear. As the
average power P̄ = E/t, we can express the intensity as I = P̄/a, meaning the
power flowing across a surface of area a. The standard unit area is one square
meter, and therefore we measure intensity in W /m2 ; i.e. watts per square
meter.
lh 1
10 log10 = 10 log10 W /m2 = 120dB
th 10−12
Hence, the intensity range of human hearing is 120 dB.
I
10 log10
Ire f
So for example, instead of using th as the reference intensity, we could use
the limit of hearing lh , and thus measure down from pain instead of measuring
up from silence.
A2 A A
dB = 10 log10 2
= 10 log10 ( )2 = 20 log10
B B B
Main idea: for power or intensity we use 10 log10 ratio, but for amplitudes,
we must use 20 log10 ratio.
Other dB Variants
You see the abbreviaton dB in many contexts.
• dBa uses the so-called A-weighting to account for the relative loudness
perceived by the human ear; based on equal-loudness contours.
180 CHAPTER 10. ACOUSTICS, PERCEPTION, EFFECTS
• dBb and dBc are similar to dBa, but apply different weighting
schemes.
Sound Pressure
Measuring the intensity of a sound signal is usually not practical (or possible);
i.e. measuring the energy flow over an area is tricky. Fortunately, we can
measure the average variation in pressure. Pressure is the force applied normal
to a surface area, so if we sample a sufficiently large area we’ll get a decent
approximation; this is exactly what a microphone does!
It is worth noting that we can relate sound pressure to intensity by the
following ratio,
∆p2
Vδ
where ∆p is the variation in pressure, V is the velocity of sound in air, and δ
is the density of air. What this tells us is that intensity is proportional to the
square of the variation in pressure.
For more information on loudness, dB, intensity, pressure etc., see Musi-
mathics Vol. 1 [Loy, 2011].
10.2.3 Loudness
Loudness is a perceptual concept. Equal loudness does necessarily result from
equal amplitude because the human ear’s sensitivity to sound varies with fre-
quency. This sensitivity, the loudness, is depicted in equal-loudness contours
for the human ear; often referred to as the Fletcher-Munson curves. Fletcher
and Munson’s data were revised to create an ISO standard. Both are illus-
trated in Figure 10.9 below. Each curve traces changes in amplitude required
to maintain equal loudness as the frequency of a sine tone varies. In other
words, each curve depicts amplitude as a function of frequency at a constant
loudness level.
Loudness is commonly measured in phons.
182 CHAPTER 10. ACOUSTICS, PERCEPTION, EFFECTS
Phon
Phon expresses the loudness of a sound in terms of a reference loudness. That
is, the phon level of a sound, A, is the dB SPL (defined earlier) of a reference
sound—of frequency 1 kHz—that has the same (perceived) loudness as A.
Zero phon is the limit of audibility of the human ear; inaudible sounds have
negative phon levels.
Rules of Thumb
Loudness is mainly dependent on amplitude, and this relationship is approx-
imately logarithmic, which means equal ratios of amplitude are more-or-less
perceived as equal increments of loudness. We are very sensitive to small ra-
tios of frequency, but we are not very sensitive to small ratios of amplitude. To
double loudness you need about a 10-fold increase in intensity or about 20 dB.
We are sensitive to about 1 dB of amplitude ratio. 1 dB is a change of about
12%.
The fact that we are not so sensitive to amplitude changes is important to
keep in mind when adjusting amplitude. If you synthesize a sound and it is
too soft, try scaling it by a factor of 2. People are often tempted to make small
changes, e.g. multiply by 1.1 to make something louder, but in most cases, a
10.2. PERCEPTION: PITCH, LOUDNESS, LOCALIZATION 183
It is instructive to listen to this using a quiet volume setting with good head-
phones. You should hear the two tones at the same pitch. Then, listen to the
same sounds on your built-in laptop speaker. You will probably not hear the
second tone, indicating that your computer cannot produce frequencies that
low. And yet, the pitch of the first tone, even with no fundamental frequency
present retains the same pitch!
10.2.4 Localization
Localization is the ability to perceive direction of the source of a sound and
perhaps the distance of that sound. We have multiple cues for localization,
including relative amplitude, phase or timing, and spectral effects of the pinnae
(outer ears). Starting with amplitude, if we hear something louder in our right
ear than our left ear, then we will perceive that the sound must be coming from
the right, all other things be equal.
184 CHAPTER 10. ACOUSTICS, PERCEPTION, EFFECTS
The third cue is produced by our pinnae or outer ears. Sound reflects off
the pinnae, which are very irregularly shaped. These reflections cause some
cancellation or reinforcement at particular wavelengths, depending on the di-
rection of the sound source and the orientation of our head. Even though we
do not know the exact spectrum of the source sound, our amazing brain is able
to disentangle all this information and compute something about localization.
This is especially important for the perception of elevation, i.e. is the sound
coming from ahead or above? In either case, the distance to our ears is the
same, so whether the source is ahead or above, there should be no difference
in amplitude or timing. The only difference is in spectral changes due to our
outer ears and reflections from our shoulders.
All of these effects or cues can be described in terms of filters. Taken to-
gether, these effects are sometimes called the HRTF, or Head-Related Transfer
Function. (A “transfer function” describes the change in the spectrum from
source to destination.) You might have seen some artificial localization sys-
tems, including video games and virtual reality application systems based on
HRTF. The idea is, for each source to be placed in a virtual 3D space, compute
an appropriate HRTF and apply that to the source sound. There is a different
HRTF for the left ear and right ear, and typically the resulting stereo signal is
presented through headphones. Ideally, the headphones are tracked so that the
HRTFs can be recomputed as the head turns and the angles to virtual sources
change.
Environmental cues are also important for localization. If sounds reflect off
walls, then you get a sense of being in a closed space, how far away the walls
are, and what the walls are made of. Reverberation and ratio of reverberation
to direct sound are important for distance estimation, especially for computer
music if you want to create the effect of a sound source fading off into the
distance. Instead of just turning down the amplitude of the sound, the direct
or dry sound should diminish faster than the reverberation sound to give the
impression of greater distance.
and because filters are linear, they weight or delay frequencies differently but
independently. If we add two sounds and put them through the filter, the result
is equivalent to putting the two sounds through the filter independently and
summing the results.
Figure 10.10: A comparison of concepts and terms from perception (left col-
umn) to acoustics (right column).
Note that the duration of the output of this unit generator is equal to the
duration of the input, so if the input is supposed to come to an end and then
be followed by multiple echos, we need to append silence to the input source
to avoid a sudden ending. The example below uses s-rest() to construct 10
seconds of silence, which follows the sound.
feedback-delay(seq(sound, s-rest(10)), delay, feedback)
In principle, the exponential decay of the feedback-delay effect never ends,
so it might be prudent to use an envelope to smoothly bring the end of the
signal to zero:
188 CHAPTER 10. ACOUSTICS, PERCEPTION, EFFECTS
Figure 10.12: Filter response of a comb filter. The horizontal axis is frequency
and the vertical axis is amplitude. The comb filter has resonances at multiples
of some fundamental frequency.
The code below shows how to apply a comb filter to sound in Nyquist. A
comb filter emphasizes (resonates at) frequencies that are multiples of a hz.
The decay time of the resonance is given by decay. The decay may be a sound
or a number. In either case, it must also be positive. The resulting sound will
have the start time, sample rate, etc. of sound. One limitation of comb is that
the actual delay will be the closest integer number of sample periods to 1/hz, so
the resonance frequency spacing will one that divides the sample rate evenly.
comb(sound, decay, hz)
10.3.4 Equalization
Equalization is generally used to adjust spectral balance. For example, we
might want to boost the bass, or boost the high frequencies, or cut some objec-
10.3. EFFECTS AND REVERBERATION IN NYQUIST 189
10.3.5 Chorus
The chorus effect is a very useful way to enrich a simple and dry sound. Es-
sentially, the “chorus” effect is a very short, time-varying delay that is added
to the original sound. Its implementation is shown in Figure 10.13. The orig-
inal sound passes through the top line, while a copy of the sound with some
attenuation is added to the sound after a varying delay, which is indicated by
the diagonal arrow.
Figure 10.13: Chorus effect algorithm. A signal is mixed with a delayed copy
of itself. The delay varies, typically by a small amount and rather slowly.
10.3.6 Panning
Panning refers to the simulation of location by splitting a signal between left
and right (and sometimes more) speakers. When panning a mono source from
left to right in a stereo output, you are basically adjusting the volume of that
source in the left and right channels. Simple enough. However, there are
multiple reasonable ways of making those adjustments. In the following, we
shall cover the three most common ones.
A typical two-speaker stereo setup is depicted in Figure 10.14; the speakers
are placed symmetrically at ±45-degree angles, and equidistant to the listener,
who is located at the so-called “sweet-spot,” while facing the speakers.
Note that the range of panning (for stereo) is thus 90 degrees. However, it
is practical to use radians instead of degrees. By convention, the left speaker is
at 0 radians and the right speaker is at π/2 radians, giving us a panning range
of θ ∈ [0; π/2], with the center position at θ = π/4.
Linear Panning
The simplest panning strategy is to adjust the channel gains (volumes) linearly
with inverse correlation.
10.3. EFFECTS AND REVERBERATION IN NYQUIST 191
Main idea: for a stereo signal with gain 1, the gains of the left
and right channels should sum to 1; i.e. L(θ ) + R(θ ) = 1.
With the panning angle θ ∈ [0; π/2] we thus get the gain functions
π 1 2
L(θ ) = ( − θ ) π = (1 − θ )
2 2 π
, and
1 2
R(θ ) = θ π =θ
2 π
as plotted in Figure 10.15.
stereo to mono by adding the left and right channels (so anything panned to the
center is now added perfectly in phase), or if you sit in exactly the right spot
for left and right channels to add in phase.4 The idea behind the -4.5 dB law
is to split the difference between constant power and linear panning – a kind
of compromise between the two. This is achieved by simply q taking the square
root of the product of the two laws, thus we have L(θ ) = ( π2 − θ ) π2 cos(θ ),
q
and R(θ ) = θ π2 sin(θ ); as plotted in Figure 10.17.
As we can see on the plot, the center gain is now at 0.59, and hence the
2
per-channel attenuation is 10 log10 ( .59
12
)dB = −4.5dB, which is exactly in be-
tween that for the previous two laws. The power of the signal at the center is
now L2 (π/4) + R2 (π/4) = .592 + .592 = 0.71, corresponding to
2 2
10 log10 ( .59 1+.59
2 )dB = −1.5dB. If amplitudes are additive when stereo is
converted to mono, the center pan signal is boosted by 1.5 dB. When signals
are panned to the center and not heard in phase, the center pan signal is atten-
uated by 1.5 dB.
cannot stay in phase everywhere, so the total power is conserved even if there are some “hot
spots’.”
194 CHAPTER 10. ACOUSTICS, PERCEPTION, EFFECTS
jects, and found that constant power panning was preferred by the listeners.
Then, in the 1950’s, the BBC conducted a similar experiment and concluded
that the -4.5 dB compromise was the better pan law. Now, we know next to
nothing about the details of those experiments, but their results might make
sense if we assume that Disney’s focus was on movie theaters and the BBC’s
was on TV audiences. Let’s elaborate on that. It all depends on the listening
situation. That is, how are the speakers placed, what is the size of the room,
where is the listener placed, and can we expect the listener to stay in the same
place? If the speakers are placed close to each other in a small room (with
very little reverb) or if the listener is in the sweet spot, then one can reason-
ably expect that the phases of the signals from the two speakers will add up
constructively (at least at lower frequencies, acting pretty much as one single
speaker (at least at lower frequencies); as depicted in Figure 10.18.
In such a case where signals are in phase so that the left and right ampli-
tudes sum, the sounds that are panned to the center will experience up to 3 dB
of boost with constant power panning but a maximum of only 1.5 dB boost
using the -4.5 dB compromise. Thus, the -4.5 dB rule might give more equal
loudness panning. It could also be that the BBC considered mono TV sets
where the left and right channels are added perfectly in phase. In this case,
the -4.5 dB compromise gives a 1.5 dB boost to center-panned signals vs. a
3 dB boost with constant power panning. (Recall that linear panning is ideal
for mono reproduction because the center-panned signals are not boosted at
all. However, linear panning results in the “hole-in-the-middle” problem for
stereo.)
On the other hand, if the speakers are placed far from each other in a big
room, then the phases will not add up constructively; as seen in Figure 10.19.
Furthermore, in this case, the listeners are probably not always placed at
the sweet spot – there are probably multiple listeners placed at different dis-
tances and angles to the speakers (as e.g. in a movie theater). One would
expect constant power panning to produce more uniform loudness at all pan-
ning positions in this situation.
Given the variables of listener position, speaker placement, and possible
10.3. EFFECTS AND REVERBERATION IN NYQUIST 195
Figure 10.19: Two speakers out of phase; phases do not add constructively.
Panning in Nyquist
pan(sound, where)
The pan function pans sound (a behavior) according to where (another behav-
ior or a number). Sound must be monophonic. The where parameter should
range from 0 to 1, where 0 means pan completely left, and 1 means pan com-
pletely right. For intermediate values, the sound is scaled linearly between left
and right.
196 CHAPTER 10. ACOUSTICS, PERCEPTION, EFFECTS
10.3.7 Compression/Limiting
A Compression/Limiting effect refers to automatic gain control, which reduces
the dynamic range of a signal. Do not confuse dynamics compression with
data compression such as producing an MP3 file. When you have a signal
that ranges from very soft to very loud, you might like to boost the soft part.
Alternatively, when you have a narrow dynamic range, you can expand it to
make soft sounds much quieter and louder sounds even louder.
The basic algorithm for compression is shown in Figure 10.20. The com-
pressor detects the signal level with a Root Mean Square (RMS) detector5 and
uses table-lookup to determine how much gain to place on the original signal
at that point. The implementation in Nyquist is provided by the Nyquist Ex-
tension named compress, and there are two useful functions in the extension.
The first one is compress, which compresses input using map, a compression
curve probably generated by compress-map.6 Adjustments in gain have the
given rise-time and fall-time. lookahead tells how far ahead to look at the sig-
nal, and is rise-time by default. Another function is agc, an automatic gain
5 The RMS analysis consists of squaring the signal, which converts each sample from positive
or negative amplitude to a positive measure of power, then taking the mean of a set of consecutive
power samples perhaps 10 to 50 ms in duration, and finally taking the square root of this“mean
power” to get an amplitude.
6 compress-map(compress-ratio, compress-threshold, expand-ratio, expand-threshold,
limit, transition, verbose) constructs a map for the compress function. The map consists
of two parts: a compression part and an expansion part. The intended use is to compress
everything above compress-threshold by compress-ratio, and to downward expand everything
below expand-threshold by expand-ratio. Thresholds are in dB and ratios are dB-per-dB. 0 dB
corresponds to a peak amplitude of 1.0 or RMS amplitude of 0.7. If the input goes above 0
dB, the output can optionally be limited by setting limit: (a keyword parameter) to T. This
effectively changes the compression ratio to infinity at 0 dB. If limit: is nil (the default), then
the compression-ratio continues to apply above 0 dB.
10.3. EFFECTS AND REVERBERATION IN NYQUIST 197
control applied to input. The maximum gain in dB is range. Peaks are attenu-
ated to 1.0, and gain is controlled with the given rise-time and fall-time. The
look-ahead time default is rise-time.
compress(input, map, rise-time, fall-time, lookahead)
agc(input, range, rise-time, fall-time, lookahead)
10.3.8 Reverse
Reverse is simply playing a sound backwards. In Nyquist, the reverse func-
tions can either reverse a sound or a file, and both are part of the Nyquist ex-
tension named reverse. If you reverse a file, Nyquist reads blocks of samples
from the file and reverses them, one-at-a-time. Using this method, Nyquist can
reverse very long sounds without using much memory. See s-read-reverse
in the Nyquist Reference Manual for details.
To reverse a sound, Nyquist must evaluate the whole sound in memory,
which requires 4 bytes per sample plus some overhead. The function
s-reverse(sound) reverses sound, which must be shorter than
*max-reverse-samples* (currently initialized to 25 million samples). This
function does sample-by-sample processing without an efficiently compiled
unit generator, so do not be surprised if it calls the garbage collector a lot
and runs slowly. The result starts at the starting time given by the current
environment (not necessarily the starting time of sound). If sound has multiple
channels, a multiple channel, reversed sound is returned.
force-srate(srate, sound)
resample(snd, rate)
quantize(sound, steps)
10.3.11 Reverberation
A reverberation effect simulates playing a sound in a room or concert hall.
Typical enclosed spaces produce many reflections from walls, floor, ceiling,
chairs, balconies, etc. The number of reflections increases exponentially with
time due to secondary, tertiary, and additional reflections, and also because
sound is following paths in all directions.
Typically, reverberation is modeled in two parts:
• Early reflections, e.g. sounds bouncing off one wall before reaching the
listener, are modeled by discrete delays.
• Late reflections become very dense and diffuse and are modeled using a
network of all-pass and feedback-delay filters.
Reverberation often uses a low-pass filter in the late reflection model because
high frequencies are absorbed by air and room surfaces.
The rate of decay of reverberation is described by RT60, the time to decay
to -60 dB relative to the peak amplitude. (-60 dB is about 1/1000 in amplitude.)
Typical values of RT60 are around 1.5 to 3 s, but much longer times are easy
to create digitally and can be very interesting.
In Nyquist, the reverb function provides a simple reverberator. You will
probably want to mix the reverberated signal with some “dry” original signal,
so you might like this function:
function reverb-mix(s, rt, wet)
return s * (1 – wet) + reverb(s, rt) * wet
Nyquist also has some reverberators from the Synthesis Tool Kit: nrev (simi-
lar to Nyquist’s reverb), jcrev (ported from an implementation by John Chown-
ing), and prcrev (created by Perry Cook). See the Nyquist Reference Manual
for details.
Convolution-based Reverberators
Reverberators can be seen as very big filters with long irregular impulse re-
sponses. Many modern reverberators measure the impulse response of a real
room or concert hall and apply the impulse response to an input signal using
convolution (recall that filtering is equivalent to multiplication in the frequency
domain, and that is what convolution does).
With stereo signals, a traditional approach is to mix stereo to mono, com-
pute reverberation, then add the mono reverberation signal to both left and
right channels. In more modern reverberation implementations, convolution-
based reverberators use 4 impulse responses because the input and output are
200 CHAPTER 10. ACOUSTICS, PERCEPTION, EFFECTS
stereo. There is an impulse response representing how the stage-left (left in-
put) signal reaches the left channel or left ear (left output), the stage-left signal
to the right channel or right ear, stage-right to the left channel, and stage-right
to the right channel.
Nyquist has a convolve function to convolve two sounds. There is no
Nyquist library of impulse responses for reverberation, but see
www.openairlib.net/, www.voxengo.com/impulses/ and other sources.
Convolution with different sounds (even if they are not room responses) is an
interesting effect for creating new sounds.
10.3.12 Summary
Many audio effects are available in Nyquist. Audio effects are crucial in mod-
ern music production and offer a range of creative possibilities. Synthesized
sounds can be greatly enhanced through audio effects including filters, chorus,
and delay. Effects can be modulated to add additional interest.
Panning algorithms are surprisingly non-obvious, largely due to the “hole-
in-the-middle” effect and the desire to minimize the problem. The -4.5 dB
panning law seems to be a reasonable choice unless you know more about the
listening conditions.
Reverberation is the effect of millions of “echoes” caused by reflections
off of walls and other surfaces when sound is created in a room or reflective
space. Reverberation echos generally become denser with greater delay, and
the sound of reverberation generally has an approximately exponential decay.
Chapter 11
Physical Modeling
Topics Discussed: Mass-Spring Models, Karplus-Strong, Waveguide
Models, Guitar Models
11.1 Introduction
1
Yt = (Yt−p +Yt−p−1 ) (11.1)
2
1 When I was an undergraduate at Rice University, a graduate student was working with a PDP-
11 mini-computer with a vector graphics display. There were digital-to-analog converters to drive
the display, and the student had connected them to a stereo system to make a primitive digital
audio system. I remember his description of the synthesis system he used, and it was exactly the
Karplus-Strong algorithm, including initializing the buffer with random numbers. This was in the
late 1970s, so it seems Karplus and Strong reinvented the algorithm, but certainly deserve credit
for publishing their work.
204 CHAPTER 11. PHYSICAL MODELING
Here, p is the period or length of the buffer, t is the current sample count,
and Y is the output of the system.
To generate a waveform, we start reading through the buffer and using the
values in it as sample values. If we were to just keep reading through the buffer
over and over again, we would get a complex, periodic, pitched waveform. It
would be complex because we started out with noise, but pitched because we
would be repeating the same set of random numbers. (Remember that any time
we repeat a set of values, we end up with a pitched sound.) The pitch we get
is directly related to the size of the buffer (the number of numbers it contains)
we’re using, since each time through the buffer represents one complete cycle
(or period) of the signal.
Now, here’s the trick to the Karplus-Strong algorithm: each time we read
a value from the buffer, we average it with the last value we read. It is this
averaged value that we use as our output sample. (See Figure 11.2.) We then
take that averaged sample and feed it back into the buffer. That way, over
time, the buffer gets more and more averaged (this is a simple filter, like the
averaging filter, Equation 7.1). Let’s look at the effect of these two actions
separately.
Physical models generally offer clear, “real world” controls that can be
used to play an instrument in different ways, and the Karplus-Strong algo-
rithm is no exception: we can relate the buffer size to pitch, the initial random
numbers in the buffer to the energy given to the string by plucking it, and the
low-pass buffer feedback technique to the effect of air friction on the vibrating
string.
Figure 11.4: A waveguide model. The right-going and left-going wave, which
are superimposed on a single string, are modeled separately as simple de-
lays. The signal connections at the ends of the delays represent reflections.
A waveguide can also model a column of air as in a flute.
start there. Figure 11.5 illustrates the oscillation in a bowed string. The bow
alternately sticks to and slips across the string. Rather than reaching a steady
equilibrium where the bow pulls the string to some steady stretched configura-
tion, the “slip” phase reduces friction on the the string and allows the string to
move almost as if were plucked. Interestingly, string players put rosin on the
bow, which is normally sticky, but when the string begins to slide, the rosin
heats up and a molecular level of rosin liquifies and lubricates the bow/string
contact area until the string stops sliding. It’s amazing to think that rosin on
the string and bow can melt and re-solidify at audio rates!
Figure 11.5: A bowed string is pulled by the bow when the bow
sticks to the string. At some point the bow does not have enough
friction to pull the string further, and the string begins to slip. The
sliding reduces the friction on the string, which allows it to stretch
in opposition to the bowing direction. Finally, the stretching slows
the string and the bow sticks again, repeating the cycle. (From
http://physerver.hamilton.edu/courses/Fall12/Phy175/ClassNotes/Violin.html)
Nyquist has a number of built-in physical models. Many of them come from
the Synthesis Tool Kit (STK).
yn = c0 xn + c1 xn−1 (11.4)
can be used, but this also produces attenuation (low-pass filter), so we can
adjust the loop filter (FIR) to provide only the additional attenuation required.
After all this, the model is still not perfect and might require a compensat-
ing boost at higher frequencies, but Sullivan decided to ignore this problem:
Sometimes higher frequencies will suffer, but the model is workable.
11.9.3 Distortion
In electric guitars, distortion of a single note just adds harmonics, but distortion
of a sum of notes is not the sum of distorted notes: distortion is not linear, so
all those nice properties of linearity do not apply here.
Sullivan creates distortion using a soft clipping function so that as ampli-
tude increases, there is a gradual introduction of non-linear distortion. The
signal is x and the distorted signal is F(x) in the following equation, which is
plotted in Figure 11.12:
2
3
x≥1
3
F(x) = x − x3 −1 < x < 1 (11.5)
2
−3 x ≤ −1
Figure 11.12: Distortion functions. At left is “hard clipping” where the signal
is unaffected until it reaches limits of 1 and -1, at which points the signal is
limited to those values. At right is a “soft clipping” distortion that is more like
analog amplifiers with limited output range. The amplification is very linear
for small amplitudes, but diminishes as the signal approaches the limits of 1
and -1.
11.9.4 Feedback
A wonderful technique available to electric guitarists is feedback, where the
amplified signal is coupled back to the strings which then resonate in a sus-
tained manner. Figure 11.13 illustrates Sullivan’s configuration for feedback.
There are many parameters to control gain and delay. Sullivan notes that one
of the interesting things about synthesizing guitar sounds with feedback is that
even though it is hard to predict exactly what will happen, once you find pa-
rameter settings that work, the sounds are very reproducible. One guiding
11.9. ELECTRIC GUITAR MODEL 215
principle is that the instrument will tend to feedback at periods that are mul-
tiples of the feedback delay. This is quite different from real feedback with
real guitars, where the player must interact with the amplifier and guitar, and
particular sounds are hard to reproduce.
Figure 11.14: Initializing the string. The physical string shape (a) contains
both left- and right-going waves (b), so we need to combine them to get a full
round-trip initial waveform (c).
11.10. ANALYSIS EXAMPLE 217
– Distortion,
– Wah-wah pedals,
– Chorus, etc.
11.12 Summary
Physical models simulate physical systems to create sound digitally. A com-
mon approach is to model strings and bores (in wind instruments) with re-
circulating delays, and to “lump” the losses in a filter at some point in the
loop. Non-linear elements are added to model how a driving force (a bow or
breath) interacts with the wave in the recirculating delay to sustain an oscil-
lation. Digital waveguides offer a simple model that separates the left- and
right-going waves of the medium.
iors that tend to arise naturally from models. In spite of potentially complex
behavior, physical models tend to have a relatively small set of controls that
are meaningful and intuitively connected to real-world phenomena and expe-
rience. Models also tend to be modular. It is easy to add coupling between
strings, refine a loop filter, etc. to obtain better sound quality or test theories
about how instruments work.
Spectral Modeling,
Algorithmic Control,
3-D Sound
Topics Discussed: Additive Synthesis, Table-Lookup Synthesis, Spec-
tral Interpolation Synthesis, Algorithmic Control of Signal Processing, 3-D
Sound, Head-Related Transfer Functions, Multi-Speaker Playback
Perhaps more interesting (or is it just because I invented this?) is the idea that
the spectrum can evolve arbitrarily over time. For example, consider a trum-
pet tone, which typically follows an envelope of getting louder, then softer.
As the tone gets softer, it gets less bright because higher harmonics are re-
duced disproportionately when the amplitude decreases. If we could record a
sequence of spectra, say one spectrum every 100 ms or so, then we could cap-
ture the spectral variation of the trumpet sound. The storage is low (perhaps
20 harmonic amplitudes per table, and 10 tables per second, so that is only 200
samples per second). The computation is also low: Consider that full additive
synthesis would require 20 sine oscillators at the sample rate for 20 harmon-
ics. With spectral interpolation, the cost is only 2 table lookups per sample.
We also need to compute tables from stored harmonic amplitudes, but tables
are small, so the cost is not high. Overall, we can expect spectral interpolation
to run 5 to 10 times faster than additive synthesis, or only a few times slower
than a basic table-lookup oscillator.
Given this framework, we can develop and refine the synthesis model by ex-
tracting control signals from real audio produced by human performers and
acoustic instruments. As shown in Figure 12.5, the human-produced control
signals are used to drive the synthesis, or instrument, model, and we can listen
to the results and compare them directly to the actual recording of the human.
If the synthesis sounds different, we can try to refine the synthesis model. For
example, dissatisfaction with the sound led us to introduce sampled attacks for
brass instruments.
Note that the ability to drive the instrument model with amplitude and
frequency, in other words parameters directly related to the sound itself, is a
big advantage over physical models, where we cannot easily measure physical
228 CHAPTER 12. SPECTRAL MODELING
Figure 12.4: The divide-and-conquer approach. Even though control and syn-
thesis must both be considered as part of the problem of musical sound syn-
thesis, we can treat them as two sub-problems.
Figure 12.5: Synthesis refinement. To optimize and test the synthesis part of
the SIS model, control signals can be derived from actual performances of
musical phrases. This ensures that the control is “correct,” so any problems
with the sound must be due to the synthesis stage. The synthesis stage is
refined until the output is satisfactory.
12.2. SPECTRAL INTERPOLATION SYNTHESIS 229
parameters, like bow-string friction or reed stiffness, that are critical to the
behavior of the model.
• When the previous pitch is lower and the next pitch is higher (we call
this an “up-up” condition), notes showed a later center of mass than
other combinations;
• large pitch intervals before and after the tone resulted in an earlier center
of mass than small intervals;
• legato articulation gave a later center of mass than others (this is clearly
seen in Figure 12.7.
Envelope Model
The center of mass does not offer enough detail to describe a complete trumpet
envelope. A more refined model is illustrated in Figure 12.8. In this model,
the envelope is basically smooth as shown by the dashed line, but this overall
shape is modified at the beginning and ending, as shown inside the circles. It
is believed that the smooth shape is mainly due to large muscles controlling
pressure in the lungs, and the beginning and ending are modified by the tongue,
which can rapidly block or release the flow of air through the lips and into the
trumpet.
The next figure (Figure 12.9) shows a typical envelope from a slurred note
where the tongue is not used. Here, there is less of a drop at the beginning
12.2. SPECTRAL INTERPOLATION SYNTHESIS 231
Figure 12.8: The “tongue and breath” envelope model is based on the idea
that there is a smooth shape (dashed line) controlled by the diaphram, upper
body and lungs, and this shape is modified at note transitions by the tongue
and other factors. (Deviations from the smooth shape are circled.)
232 CHAPTER 12. SPECTRAL MODELING
and ending. The drop in amplitude (which in the data does not actually reach
zero or silence), is probably due to the disruption of vibration when valves are
pressed and the pitch is changed. Whatever is going on physically, the idea of
a smooth breath envelope with some alterations at the beginning and ending
seems reasonable.
Figure 12.9: The envelope of a slurred note. This figure shows that the “tongue
and breath” model also applies to the envelopes of slurred notes (shown here).
The deviations at the beginning and ending of the note may be due to trumpet
valves blocking the air or to the fact that oscillations are disrupted when the
pitch changes.
Computing Parameters
All 9 envelope parameters must be automatically computed from the score.
Initially, we used a set of rules designed by hand. Parameters depend on:
Figure 12.11: Computing envelope parameters. This example shows how the
parameter tf is derived from score features from-slur, direction-up and dur.
12.2.7 Conclusions
What have we learned? First, envelopes and control are critical to music syn-
thesis. It is amazing that so much research was done on synthesis algorithms
without much regard for control! Research on the spectral centroid showed
statistically valid relationships between score parameters and envelope shape,
which should immediately raise concerns about sampling synthesis: If samples
have envelopes “baked in,” how can sampling ever create natural-sounding
musical phrases? Certainly not without a variety of alternative samples with
different envelopes and possibly some careful additional control through am-
plitude envelopes, but for brass, that also implies spectral control. Large sam-
ple libraries have moved in this direction.
12.3. ALGORITHMIC CONTROL OF SIGNAL PROCESSING 235
The idea that envelopes have an overall “breath” shape and some fine de-
tails in the beginning and ending of every note seems to fit real data better
than ADSR and other commonly used models. Even though spectral interpo-
lation or even phrase-based analysis/synthesis have not become commonplace
or standard in the world of synthesis, it seems that the study of musical phrases
and notes in context is critical to future synthesis research and systems.
Pat-ctrl
The implementation of pat-ctrl is shown below. This is a recursive se-
quence that produces one segment of output followed by a recursive call to
produce the rest. The duration is infinite:
define function pat-ctrl(durpat, valpat)
return seq(const(next(valpat), next(durpat)),
pat-ctrl(durpat, valpat))
236 CHAPTER 12. SPECTRAL MODELING
Using Scores
One long sine tone may not be so interesting, even if it is modulated rapidly
by patterns. The following example shows how we can write a score that
“launches” a number of pat-fm sounds (here, the durations are 30, 20, 18,
and 13 seconds) with different parameters. The point here is that scores are
not limited to conventional note-based music. Here we have a score organiz-
ing long, overlapping, and very abstract sounds.
12.3. ALGORITHMIC CONTROL OF SIGNAL PROCESSING 237
exec score-play(
{{ 0 30 {pat-fm-note :grain-dur 8 :spread 1
:pitch c3 :fixed-dur t :vel 50}}
{10 20 {pat-fm-note :grain-dur 3 :spread 10
:pitch c4 :vel 75}}
{15 18 {pat-fm-note :grain-dur 1 :spread 20
:pitch c5}}
{20 13 {pat-fm-note :grain-dur 1 :spread 10
:pitch c1}}})
• pwl-pat-fm is a function that will create and use a pattern. Within the
pattern, we see make-eval({get-pitch}). The make-eval pattern
constructor takes an expression, which is expressed in Lisp syntax. Each
time the pattern is called to return a value, the expression is evaluated.
Recall that in Lisp, the function call syntax is (function-name arg1
arg2 ...), so with no arguments, we get (function-name), and in SAL,
we can write a quoted list as {function-name}. Thus, make-eval(
{get-pitch}) is the pattern that calls get-pitch to produce each
value.
variable pitch-contour =
pwl(10, 25, 15, 10, 20, 10, 22, 25, 22)
function get-pitch()
return sref(pitch-contour, 0)
function pwl-pat-fm()
begin
...
make-eval(get-pitch),
...
end
play pwl-pat-fm()
begin
with pitch-contour =
pwl(10, 25, 15, 10, 20, 10, 22, 25, 22),
ioi-pattern = make-heap(0.2 0.3 0.4)
exec score-gen(
save: quote(pwl-score),
score-dur: 22,
pitch: truncate(
c4 + sref(pitch-contour, sg:start) +
#if(oddp(sg:count), 0, -5)),
ioi: next(ioi-pattern),
dur: sg:ioi - 0.1,
vel: 100)
end
You can even use the envelope editor in the NyquistIDE to graphically
edit pitch-contour. To evaluate pitch-contour at a specific time, sref
12.4. 3-D SOUND 239
12.4.1 Introduction
To review from our discussion of sound localization, we use a number of cues
to sense direction and distance. Inter-aural time delay and amplitude differ-
ences give us information about direction in the horizontal plane, but suffers
from symmetry between sounds in front and sounds behind us. Spectral cues
help to disambiguate front-to-back directions and also give us a sense of height
or elevation. Reverberation, and especially the direct-sound-to-reverberant-
sound ratio gives us the impression of distance, as do spectral cues.
Figure 12.13: The “cone of confusion,” where different sound directions give
the same ITD and ILD. (Credit: Bill Kapralos, Michael Jenkin and Evangelos Milios, “Vir-
tual Audio Systems,” Presence Teleoperators & Virtual Environments, 17, 2008, pp. 527-549).
In fact, the duplex theory ignores the pinnae, our outer ears, which are not
symmetric from front-to-back or top-to-bottom. Reflections in the pinnae cre-
ate interference that is frequency-dependent and can be used by our auditory
system to disambiguate sound directions. The cues from our pinnae are not as
strong as ITD and ILD, so it is not unusual to be confused about sound source
locations, especially along the cone of confusion.
the signal processing details. To fully characterize the HRTF, the loudspeaker
and/or listener must be moved to many different angles and elevations. The
number of different angles and elevations measured can range from one hun-
dred to thousands.
To simulate a sound source at some angle and elevation, we retrieve the
nearest HRTF measurement to that direction and apply the left and right HRTFs
as filters to the sound before playing the sound to the left and right ears through
headphones. (See Figure 12.14.)
One way to implement the HRTF filters is to compute the HRIR, or head-
related impulse response, which is the response of the HRTF to an impulse
input. To filter a sound by the HRTF, we can convolve the sound with the
HRIR. If the sound is moving, it is common to interpolate between the nearest
HRTFs or HRIRs so that there is no sudden switch from one filter to another,
which might cause pops or clicks in the filtered sound.
the headphones. To solve this problem, headphones are often tracked. Then,
if the head turns 20◦ to the left, the virtual sound source is rotated to the right
to compensate. Headphones with head tracking are sometimes combined with
virtual reality (VR) goggles that do a similar thing with computer graphics:
when the listener turns 20◦ to the left, the virtual world is rotated to the right,
giving the impression that the world is real and we are moving the head within
that world to see different views.
12.4.7 Reverberation
Doppler shift is enhanced through the use of reverberation. If a sound is re-
ceding rapidly, we should hear several effects: the pitch drops, the sound de-
creases in amplitude and the ratio of reverberation to direct sound increases.
Especially with synthesized sounds, we do not have any pre-conceptions about
the absolute loudness of the (virtual) sound source. Simply making the source
12.4. 3-D SOUND 243
quieter does not necessarily give the impression of distance, but through the
careful use of reverberation, we can give the listener clues about distance and
loudness.
12.4.8 Panning
Stereo loudspeaker systems are common. A standard technique to simulate
the location of a sound source between two speakers is to “pan” the source,
sending some of the source to each speaker. Panning techniques and the “hole-
in-the-middle” problem were presented in detail in Section 10.3.6. Panning
between two speakers can create ILD, but not ITD or spectral effects consistent
with the desired sound placement, so our ears are rarely fooled by stereo into
believing there is a real 3-D or even 1-D space of sound source locations.2
Finally, room reflections defeat some of the stereo effect or at least lead to
unpredictable listening conditions.
manufacturers who wanted to sell more gear? There were multi-speaker systems in use before
consumer stereo, and experiments have shown that three-channel systems (adding a center chan-
nel) are significantly better than stereo. My guess is that stereo is ubiquitous today because it was
both interesting enough to sell to consumers yet simple enough to implement in consumer media
including the phonograph record and FM radio. Even though we have only two ears, stereo with
two channels falls far short of delivering a full two-dimensional sound experience, much less three
dimensions.
244 CHAPTER 12. SPECTRAL MODELING
12.4.10 Summary
In this section, we have reviewed multiple perceptual cues for sound location
and distance. HRTFs offer a powerful model for simulating and reproduc-
ing these cues, but HRTFs require headphones to be most effective, and since
headphones move with the head, head tracking is needed for the best effect.
For audiences, it is more practical to use loudspeakers. Multiple loudspeakers
provide multiple point sources, but usually are restricted to a plane. Panning
is often used as a crude approximation of ideal perceptual cues. Panning can
be scaled up to multiple loudspeaker systems. Rather than treat speakers as a
means of reproducing a virtual soundscape, loudspeaker “orchestras” can be
embraced as a part of the electroacoustic music presentation and exploited for
musical purposes. Finally, wavefield synthesis offers the possibility of repro-
ducing actual 3-D sound waves as if coming from virtual sound sources, but
many speakers are required, so this is an expensive and still mostly experimen-
tal approach.
Chapter 13
Audio Compression
Topics Discussed: Coding Redundancy, Intersample Redundancy, Psycho-
Perceptual Redundancy, Huffman Coding
binary codes through pulses—see Patent US2801281A—which misses the point that this is about
representation, not transmission, but since this work came out of a phone company, maybe it is
not so surprising that PCM was viewed as a form of transmission.
13.2. CODING REDUNDANCY 247
are smaller at low amplitudes. A less common but similar encoding is A-law,
which has a 12-bit dynamic range.
Figure 13.1 shows a schematic comparison of (linear) PCM to µ-law en-
coding. The tick-marks on the left (PCM) and right (µ-law) scales show actual
values to which the signal will be quantized. You can see that at small signal
amplitudes, µ-law has smaller quantization error because the tick marks are
closer together. At higher amplitudes, µ-law has larger quantization error. At
least in perceptual terms, what matters most is the signal-to-noise ratio. In µ-
law, quantization error is approximately proportional to the signal amplitude,
so the signal-to-noise ratio is roughly the same for soft and loud signals. At
low amplitudes, where absolute quantization errors are smaller than PCM, the
signal-to-noise ratio of µ-law is better than PCM. This comes at the cost of
lower signal-to-noise ratios at higher amplitudes. All around, µ-law is gener-
ally better than PCM, at least when you do not have lots of bits to spare, but
we will discuss this further in Section 13.2.2.
Figure 13.1: µ-law vs. PCM. Quantization levels are shown at the left and
right vertical scales. µ-law has a better signal-to-quantization-error ratio at
lower amplitudes but a higher ratio than PCM at higher amplitudes.
signal to be encoded (the solid line) changes rapidly, the 1-bit DPCM encoding
is not able to keep up with it.
Figure 13.2: DPCM. The solid line is encoded into 1’s and 0’s shown below.
The reconstructed signal is shown as small boxes. Each box is at the value of
the previous one plus or minus the step size.
the encoded signal is overshooting the target, so the step size is decreased.
ADPCM achieves an improvement of 10-11 dB, or about 2 bits per sample,
over PCM for speech.
Figure 13.4 illustrates the action of ADPCM. Notice how the steps in the
encoded signal (shown in small squares) changes. In this example, the step
size changes by a factor of two, and there is a minimum step size so that
reasonably large step sizes can be reached without too many doublings.
Figure 13.4: ADPCM. The step size is increased when bits repeat, and the step
size is decreased (down to a minimum allowed step size) when bits alternate.
Figure 13.5: ADPCM Coder. This is similar do DPCM, but the step size is
variable and controlled by additional logic that looks for same bits or alternat-
ing bits in the encoded output.
13.4. PSYCHO-PERCEPTUAL REDUNDANCY AND MP3 251
13.3.5 Review
The term PCM refers to uncompressed audio where amplituded is converted
to binary numbers using a linear scale. By coding differently with µ-law and
A-law, we can achieve better quality, at least for low-fidelity speech audio.
Using intersample redundancy schemes such as DPCM and ADPCM, we can
achieve further gains, but when we consider very high quality audio (16-bit
PCM), these coding schemes lose their advantage.
As we will see in the next section, psycho-perceptual redundancy offers a
better path to high-quality audio compression.
bits for E than Z. For this example, we encode only symbols A, B, C, D, E and
F, which we assume occur with the probabilities shown in Figure 13.6.
The algorithm produces a binary tree (13.6) using a simple process: Start-
ing with the symbols as leaves of the tree, create a node whose branches have
the 2 smallest probabilities. (This would combine D and E in Figure 13.6.)
Then, give this node the probability of the sum of the two branches and re-
peat the process. For example, we now have C with probability 0.1 and the
DF node with probability 0.1. Combining them, we get a new internal node
of probability 0.2. Continue until there is just one top-level node. Now, each
symbol is represented by the path from the root to the leaf, encoding a left
branch as 0 and a right branch as 1. The final codes are shown at the right of
Figure 13.6.
E = [-1, -1, 0, 0]
F = [-1, -1, 0, 1]
etc.
13.4.7 Summary
Masking reduces our ability to hear “everything” in a signal, and in particular
quantization noise. MP3 uses a highly quantized frequency domain represen-
tation. Because of different quantization levels and coefficient sizes, Huffman
coding is used to encode the coefficients.
control parameters and object parameters can be quite small compared to the
audio signal that is generated.
pretation and natural sounding voices and instruments is a big challenge that
has not been automated.
Another representation that is similar to music notation is the performance
information found in MIDI, or Musical Instrument Digital Interface. The
idea of MIDI is to encode performance “gestures” including keyboard perfor-
mances (when do keys go down?, how fast?, when are they released?) as well
as continuous controls such as volume pedals and other knobs and sensors.
The bandwidth of MIDI is small, with a maximum over MIDI hardware of
3 KB/s. Typically, the bandwidth is closer to 3 KB/minute. For example, the
complete works of ragtime composer Scott Joplin, in MIDI form, takes about
1 MB. The complete output of 50 composers (400 days of continuous music)
fits into 500 MB of MIDI—less data than a single compact disc.
MIDI has many uses, but one big limitation is the inability in most cases
to synthesize MIDI to create, say, the sound of an orchestra or even a rock
band. An exception is that we can send MIDI to a player piano to at least
produce a real acoustic piano performance that sounds just like the original.
Another limitation, in terms of data compression, is that there is no “encoder”
that converts music audio into a MIDI representation.
13.8 Summary
We have discussed three kinds of redundancy: coding redundancy, intersam-
ple redundancy, and psycho-perceptual redundancy. µ-law, ADPCM, etc. of-
fer simple, fast, but not high-quality or high-compression representations of
audio. MP3 and related schemes are more general, of higher quality, and of-
fer higher compression ratios, but the computation is fairly high. We also
considered model-based analysis and synthesis, which offer even greater com-
pression when the source can be accurately modeled and control parameters
can be estimated. Music notation and MIDI are examples of very abstract
and compact digital encodings of music, but currently, we do not have good
automatic methods for encoding and decoding.
258 CHAPTER 13. AUDIO COMPRESSION
Chapter 14
14.1 Introduction
Let’s take a step back from details and technical issues to think about where
computer music is heading, and maybe where music is heading. This chapter
is a personal view, and I should state at the beginning that the ideas here are
highly biased by my research, which is in turn guided by my own knowledge,
interests, abilities, and talents.
I believe that if we divide computer music into a “past” and “future” that
the “past” is largely characterized by the pursuit of sound and the concept of
the instrument. Much of the early work on synthesis tried to create interesting
sounds, model acoustic instruments or reproduce their sounds, and build inter-
faces and real-time systems that could be used in live performance in the same
way that traditional musicians master and perform with acoustic instruments.
The “future,” I believe, will be characterized by the model of the musician
rather than the instrument. The focus will be on models of music performance,
musical interaction among players, music described in terms of style, genre,
“feel,” and emotion rather than notes, pitch, amplitude, and spectrum. A key
technology will be the use of AI and machine learning to “understand” music
at these high levels of abstraction and to incorporate music understanding into
the composition, performance, and production of music.
Much of my research has been in the area of music understanding, and
in some sense, the “future” is already here. Everything I envision as part of
the future has some active beginnings well underway. It’s hard to foresee the
future as anything but an extension of what we already know in the present!
259
260 CHAPTER 14. COMPUTER MUSIC FUTURES
However, I have witnessed the visions of others as they developed into reality:
high-level languages enabling sophisticated music creation, real-time control
of sound with gestural control, and all-digital music production. These visions
took a long time to develop, and even if they were merely the logical pro-
gression of what was understood about computer music in the 70s or 80s, the
results are truly marvelous and revolutionary. I think in another 20 or 30 years,
we will have models of musicians and levels of automatic music understanding
that make current systems pale by comparison.
The following sections describe a selection of topics in music understand-
ing. Each section will describe a music understanding problem and outline a
solution or at least some research systems that address the problem and show
some possibilities for the future.
14.2.2 Matching
The Matcher receives input from the Input Processor and attempts to find a cor-
respondence between the real-time performance and the score. The Matcher
has access to the entire score before the performance begins. As each note is
reported by the Input Processor, the matcher looks for a corresponding note in
the score. Whenever a match is found, it is output. The information needed
for Computer Accompaniment is just the real-time occurrence of the note per-
formed by the human and the designated time of the note according to the
score.
Since the Matcher must be tolerant of timing variations, matching is per-
formed on sequences of pitches only. This decision throws away potentially
useful information, but it makes the matcher completely time-independent.
One problem raised by this pitch-only approach is that each pitch is likely to
occur many times in a composition. In a typical melody, a few pitches occur
in many places, so there may be many candidates to match a given performed
note.
The matcher described here overcomes this problem and works well in
practice. Many different matching algorithms have been explored, often in-
troducing more sophisticated probabilistic approaches, but my goal here is to
illustrate one simple approach. The matcher is derived from the dynamic pro-
gramming algorithm for finding the longest common subsequence (LCS) of
two strings. Imagine starting with two strings and eliminating arbitrary charac-
ters from each string until the the remaining characters (subsequences) match
exactly. If these strings represent the performance and score, respectively,
then a common subsequence represents a potential correspondence between
performed notes and the score (see Figure 14.2). If we assume that most of
the score will be performed correctly, then the longest possible common sub-
sequence should be close to the “true” correspondence between performance
and score.
In practice, it is necessary to match the performance against the score as
262 CHAPTER 14. COMPUTER MUSIC FUTURES
terms is not important. What really matters is the ability of the performer to
consistently produce intentional and different styles of playing at will.
The ultimate test is the following: Suppose, as an improviser, you want to
communicate with a machine through improvisation. You can communicate
four different tokens of information: lyrical, frantic, syncopated, and pointil-
listic. The question is, if you play a style that you identify as frantic, what is
the probability that the machine will perceive the same token? By describing
music as a kind of communication channel and music understanding as a kind
of decoder, we can evaluate style recognition systems in an objective scientific
way.
It is crucial that this classification be responsive in real time. We arbitrarily
constrained the classifier to operate within five seconds.
Figure 14.3: Training data for a style recognition system is obtained by asking
a human performer to play in different styles. Each style is played for about 15
seconds, and 6 overlapping 5-second segments or windows of the performance
are extracted for training. A randomly selected style is requested every 15
seconds for many minutes to produce hundreds of labeled style examples for
training.
Figure 14.3 illustrates the process of collecting data for training and evalu-
ating a classifier. Since notions of style are personal, all of data is provided by
one performer, but it takes less than one hour to produce enough training data
to create a customized classifier for a new performer or a new set of style la-
bels. The data collection provided a number of 5-second “windows” of music
performance data, each with a label reflecting the intended style.
The data was analyzed by sending music audio from an instrument (a trum-
pet) through an IVL Pitchrider, a hardware device that detects pitch, note-
onsets, amplitude, and other features and encodes them into MIDI. This ap-
proach was used because, at the time (1997), audio signal processing in soft-
ware was very limited on personal computers, and using the MIDI data stream
was a simple way to pre-process the audio into a useful form. We extracted a
14.4. AUDIO-TO-SCORE ALIGNMENT 265
number of features from each 5-second segment of the MIDI data. Features
included average pitch, standard deviation of pitch, number of note onsets,
mean and standard deviation of note durations, and the fraction of time filled
with notes (vs. silence).
Various style classifiers were constructed using rather simple and standard
techniques such as linear classifiers and neural networks.
While attempts to build classifiers by hand were not very successful, this
data-driven machine-learning approach worked very well, and it seems that
our particular formulation of the style-classification problem was not even
very difficult. All of the machine-learning algorithms performed well, achiev-
ing nearly perfect classification in the 4-class case, and around 90% accuracy
in the 8-class case. This was quite surprising since there seems to be no simple
connection between the low-level features such as pitch and duration and the
high-level concepts of style. For example the “syncopation” style is recogniz-
able from lots of notes on the upbeat, but the low-level features do not include
any beat or tempo information, so evidently, “syncopation” as performed hap-
pens to be strongly correlated with some other features that the machine learn-
ing systems were able to discover and take advantage of. Similarly, “quote,”
which means “play something familiar” is recognizable to humans because
we might recognize the familiar tune being quoted by the performer, but the
machine has no database of familiar tunes. Again, it must be that playing
melodies from memory differs from free improvisation in terms of low-level
pitch and timing features, and the machine learning systems were able to dis-
cover this.
You can see a demonstration of this system by finding “Automatic style
recognition demonstration” on www.cs.cmu.edu/~rbd/videos.html. Since this
work was published, machine learning and classifiers have been applied to
many music understanding problems including genre recognition from music
audio, detection of commercials, speech, and music in broadcasts, and the
detection of emotion in music.
true, but that is partly because accompaniment systems tend to be built for
monophonic instruments (those that produce one note at a time). Dealing with
polyphony (multiple notes) or something as complex as an orchestra or even
a popular music recording poses a very challenging problem to detect notes
accurately. Our ability to do real-time score following is greatly diminished
when the input consists of polyphonic music audio.
These problems are compensated for by the additional power of using
look-ahead in a non-real-time audio-to-score alignment system. This section
outlines the basic principles that are used and then presents some applications
and opportunities for this technology.
Figure 14.4: A similarity matrix illustrating data used for audio-to-score align-
ment. Darker means more similar.
meta-data or labels for audio. This data can then be used to train machine
learning systems to identify chords, pitches, tempo, downbeats and other mu-
sic information.
mans, much like in computer accompaniment systems, only now with popular
beat-based music. One could imagine a computer using beat-tracking algo-
rithms to find the beat and then synchronize to that. Some experimental sys-
tems have been built with this model, but even state-of-the-art beat tracking
(another interesting research problem) is not very reliable. An alternative is
simple foot tapping using a sensor that reports the beat to the computer.
In addition to beats, the computer needs to know when to start, so we
can add sensors for giving cues to start a performance sequence. Just as hu-
man conductors give directions to humans—play something now, stop play-
ing, play louder or softer, go back and repeat the chorus, etc.—we can use
multiple sensors or different gestures to give similar commands to computer
performers.
We call this scenario Human-Computer Music Performance (HCMP), with
terminology that is deliberately derived from Human-Computer Interaction
(HCI) since many of the problems of HCMP are HCI problems: How can com-
puters and humans communicate and work cooperatively in a music perfor-
mance? The vision of HCMP is a variety of compatible systems for creating,
conducting, cueing, performing, and displaying music. Just as rock musicians
can now set up instruments, effects pedals, mixers, amplifiers and speakers,
musicians in the future should be able to interconnect modular HCMP sys-
tems, combining sensors, virtual performers, conducting systems, and digital
music displays as needed to create configurations for live performance.
Some of the interesting challenges for HCMP are:
• Preparing music for virtual players: Perhaps you need a bass player to
join your band, but you only have chord progressions for your tunes.
Can software write bass parts in a style that complements your musical
tastes?
• Sharing music: Can composers, arrangers, and performers share HCMP
software and content? How can a band assemble the software compo-
nents necessary to perform music based on HCMP?
• Music display systems: Can digital music displays replace printed mu-
sic? If so, can we make the displays interactive and useful for cueing and
conducting? Can a band leader lead both human and virtual performers
through a natural music-notation-based interface?
Clearly, music understanding is important for HCMP systems to interact
with musicians in high-level musical terms. What if a band leader tells a virtual
musician to make the music sound more sad? Or make the bass part more
lively? HCMP is an example of how future music systems can move from a
focus on instruments and sounds to a focus on music performance and musical
interaction.
14.6. SUMMARY 271
14.6 Summary
Music understanding encompasses a wide range of research and practice with
the goal of working with music at higher levels of abstraction that include emo-
tion, expressive performance, musical form and structure, and music recogni-
tion. We have examined just a limited selection of music understanding capa-
bilities based on work by the author. These include:
All of these tasks rely on new techniques for processing musical information
that we call music understanding.
272 CHAPTER 14. COMPUTER MUSIC FUTURES
Chapter 15
Where Next?
Almost coincident with the beginnings of NIME, a new field emerged com-
bining library science, data retrieval, computer music, and machine learning
called Music Information Retrieval. Like NIME, the International Society
for Music Information Retrieval (ISMIR) conference draws hundreds of re-
searchers every year who present work which has steadily drifted from re-
trieval problems to more general research on music understanding and mu-
sic composition by computer. This field is strongly weighted toward popular
music and commercial music. Proceedings tend to be very technical and are
avaiable online at ismir.net. Also, ISMIR has become a favorite venue for
applications of machine learning to music.
Some textbooks, mainly for graduate students, have appeared, including
Fundamentals of Music Processing [Müller, 2015]. George Tzanetakis has a
book online as well as course materials for his “Music Retrieval Systems”
class at the University of Victoria (marsyas.cs.uvic.ca/mirBook/course).
[Karplus and Strong, 1983] Karplus, K. and Strong, A. (1983). Digital syn-
thesis of plucked-string and drum timbres. Computer Music Journal,
7(2):43–55.