Sue Harding
Sue Harding
More information in Moore (1997); Brown & Wang (in press), Mackensen (2004)
• Nature of sound sources - how many, what are they
• Position of sound sources - movement towards us
• Information about the environment (obstacles)
• Improved communication (e.g. identifying a stream of speech)
Source 2
Source 1 Source 3
50
50
signals entering each ear from
Frequency
40
Frequency
40
10
to the right of the listener 10
Listener
Two major cues: use difference between input to each ear:
• interaural time differences (ITDs) ITD, ILD:
• interaural level differences (ILDs) Left ear first, louder source is on left
Right ear first, louder source is on right
- particularly important in left/right distinction
Sound source
Sound reaching ear further from source must travel
around head;
it is delayed in time and is less intense than sound
reaching ear nearer to source
800
2000 Left ear
600
Right ear
1000
0 400
Left and right ear signals
Amplitude
800
-1000
200 Left ear
-2000 600 Right ear
0
-3000 400
0 500 1000 1500 2000 0 500 1000 1500
Amplitude
-200
Time (ms) 200
-400
0
-600
250 255 260 265 270 275 280
-200
Time (ms)
-400
Signals are offset in time (ITD)
Signals differ in level (ILD) -600
265 266 267 268 269 270
Time (ms)
ITD dominates at low frequencies (Wightman & Kistler 1992)
• Listeners presented with broadband noise at 36
spatial positions
Elevation 0°
-13 -2 11 20
Listener Elevation -13°
Actual elevation of loudspeakers (in degrees)
#
Pinnae, head and torso modify sound spectra depending on angle of
incidence
-13 -2 11 20
Actual elevation of loudspeakers (in degrees)
Distance perception is affected by:
• interaural level differences
- large ILDs indicate nearby source
- distance judgements are generally better when one ear is oriented towards
source (Holt and Thurlow 1969)
• changes in spectrum and familiarity with sounds (e.g. Coleman 1962)
- high frequencies are attentuated due to absorbing properties of the air -
Error (in feet) of distance judgements
Trial number
Distance perception is affected by:
• sound level and expectations (e.g. Gardner 1969)
• environment & reverberation (covered next)
• 4 loudspeakers at azimuths 0°
• distance 3, 10, 20, 30 feet (approx. 1 m
to 10 m)
• anechoic conditions
http://gbs.glenbrook.k12.il.us/Academics/gbssci/phys/mmedia/waves/er.html
Surfaces can be characterised by the reverberation time T60 (the time taken
for the sound level to decay by 60 dB after the sound source is turned off)
Examples (simulated reverberation using roomsim software):
• anechoic (no reverberation) • acoustic plaster (T60 = 0.34 s)
Front/Back Front/Back
Source 1
50 50
Frequency
Frequency
40 40
30 30
20 20
10 10
Listener
Presence of background noise reduces localisation accuracy of click-train
stimuli, especially front/back distinction (Good & Gilkey 1996)
Later results (Hawley et al 2004) suggest better intelligibility for speech masked by
speech than speech masked by noise
' " '
Listening with two ears can reduce threshold of audibility
• If a tone is just masked by a
broadband noise when presented to
both ears, then if the phase of the
tone is changed by 180° it becomes
audible
but evidence also exists that ear of presentation doesn’t always segregate,
i.e. cues for segregation can be overridden, e.g.:
• speech sound split between two ears is fused into a whole (Broadbent
1955; Broadbent & Ladefoged 1957)
• duplex perception: partial speech sound in one ear plus non-speech
chirp in another fuses into complete speech sound plus segregated chirp
+
Later developments
b) ‘Equalisation-cancellation’ model
Kock (1950), developed by Durlach (1963)
• designed to model binaural masking level differences (BMLD)
• signal in one ear is transformed so that one component (the
‘masker’) matches that in the other ear; then one signal is subtracted
from the other
Jeffress (1948) coincidence-based model, adapted by Colburn (1973),
plus later developments
Figure from Stern &
Trahiotis (1995)
Frequency channel
50 50
40 40
30 30
20 20
10 10
20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180
Time frame Time frame
"
Cross-correlogram example for a single source at azimuth 40 in anechoic conditions
(time frame 90)
Highest peak in each frequency channel indicates ITD and therefore position of
source: convert ITD to azimuth (e.g. using empirical data)
Can sum over all channels and/or over time
ITD cross-correlogram frame 90 Azimuth cross-correlogram frame 90
60 60
50 50
Frequency channel
Frequency channel
40 40
30 30
20 20
10 10
0 0
-1 -0.5 0 0.5 1 -90 -45 0 45 90
Azimuth (degrees)
ITD (ms)
50
Frequency channel
40
30
20
10
0
-90 -45 0 45 90
Azimuth (degrees)
Skeleton summary cross-correlogram frame 90
" , &
Highest peak in each frequency channel indicates azimuth of dominant source in that
channel – but not always accurate, even for a single anechoic source
50
40
Dominant azimuth per time frame
80
40 20
Dominant azimuth
60
0 40
30
20
-20
0
20 -40
-20
-60
10 -40
-60
-80
-80
50 100 150
Time frame 0 50 100 150 200
Time frame
Colour bar shows azimuth value (orange corresponds
to azimuth 40 degrees, i.e. actual azimuth of source)
"
Cross-correlogram example for two sources, one at azimuth 0, one at azimuth
40, in anechoic conditions (time frames 90 and 105)
Dominant source differs in different time frames
Skeleton cross-correlogram frame 90 Skeleton cross-correlogram frame 105
60 60
50 50
Frequency channel
Frequency channel
40 40
30 30
20 20
10 10
0 0
-90 -45 0 45 90 -90 -45 0 45 90
Azimuth (degrees) Azimuth (degrees)
Dominant azimuth
60
50
Frequency channel
40
40
40 20
20
0
0
30 -20
-20 -40
20 -60
-40
-80
-60
10 0 50 100 150 200
50 50
Frequency channel
Frequency channel
40 40
30 30
20 20
10 10
0 0
-90 -45 0 45 90 -90 -45 0 45 90
Azimuth (degrees) Azimuth (degrees)
Skeleton summary cross-correlogram frame 90 Skeleton summary cross-correlogram frame 90
4
" , & -
Localisation accuracy deteriorates in reverberant conditions
Note example is for a single source at azimuth 40
Azimuth
80
60 Dominant azimuth per time frame
60 80
50
Dominant azimuth
60
40
Frequency channel
40
40 20
20
0 0
30
-20
-20
-40
20 -40
-60
-60 -80
10
0 50 100 150 200
-80
Time frame
50 100 150
Time frame
60 60
50 50
Frequency channel
Frequency channel
40 40
30 30
20 20
10 10
0 0
-90 -45 0 45 90 -90 -45 0 45 90
Azimuth (degrees) Azimuth (degrees)
Dominant azimuth
Frequency channel
40 60
40
40 20
20
0
0
30
-20 -20
20 -40
-40
-60
-60
10 -80
50 100 150
Time frame
Time frame
Colour indicates azimuth value (orange corresponds
to azimuth 40 degrees; green to 0 degrees)
ILD cue is less reliable in reverberant conditions
ILD is stronger at higher frequencies
Single source, azimuth 40, anechoic Single source, azimuth 40, reverberant
ILD ILD
20 20
60 60
15 15
Frequency channel
Frequency channel
50 10
50 10
40 5 40 5
0 0
30 30
-5 -5
20 20
-10 -10
10 10 -15
-15
-20 -20
50 100 150 50 100 150
Time frame Time frame
ILD cue is less reliable in reverberant conditions
ILD is stronger for source on one side of head
Two sources, azimuths 0 & 40, anechoic Two sources, azimuths 0 & 40, reverberant
ILD ILD
20 20
60 60
15 15
Frequency channel
Frequency channel
50 10 50 10
40 5
40 5
0 0
30 30
-5 -5
20 20
-10 -10
10 -15
10 -15
-20 -20
50 100 150 50 100 150
Time frame Time frame
Problems with cross-correlograms:
a) multiple peaks at high frequencies
b) interactions between sources – incorrect, broad or reduced peaks
c) reverberation effects
d) moving sources
Suggested solutions:
a) Sum cross-correlogram across frequency channels (Lyon 1983)
b) Convert from ITD to azimuth, using supervised training or empirical data
(Bodden 1993)
c) Weight frequency bands according to their importance (Bodden 1993)
d) Track peaks over time; measuring amplitude changes (Bodden 1993)
e) Sharpen cross-correlation peaks – skeleton cross-correlogram (Palomaki et al
2004)
f) Subtract (stationary) background cross-correlogram (Braasch 2002)
g) Ignore low-amplitude peaks in cross-correlogram – use ‘interaural
coherence’ (Faller & Merimaa 2004)
h) Use template matching (‘stencil’) to identify muliple peaks (Liu et al 2000)
i) Track moving sources using hidden Markov models (Roman & Wang 2003)
# &
Binaural sound localisation uses cues:
• interaural time difference (ITD)
• interaural level difference (ILD)
• pinna cues
ITD dominates, but cues interact in complex ways (not fully understood)