kinnunen_reddots_2017
kinnunen_reddots_2017
kinnunen_reddots_2017
Tomi Kinnunen1 , Md Sahidullah1 , Mauro Falcone2 , Luca Costantini2 , Rosa González Hautamäki1 ,
Dennis Thomsen3 , Achintya Sarkar3 , Zheng-Hua Tan3 , Héctor Delgado4 , Massimiliano Todisco4 ,
Nicholas Evans4 , Ville Hautamäki1 , Kong Aik Lee5
1
University of Eastern Finland, Finland, 2 Fondazione Ugo Bordoni, Italy, 3 Aalborg University, Denmark
4
EURECOM, France, 5 Institute for Infocomm Research, Singapore
ABSTRACT what might otherwise be achieved without text constraints, they may
be more vulnerable to spoofing through replay attacks. Conversely,
This paper describes a new database for the assessment of automatic
spoofing relates to authentication applications which generally de-
speaker verification (ASV) vulnerabilities to spoofing attacks. In
mand high usability, short-duration and hence text-dependent ASV.
contrast to other recent data collection efforts, the new database has
Previous assessments of ASV and replay attacks have generally
been designed to support the development of replay spoofing coun-
involved only a small number of recording and playback conditions,
termeasures tailored towards the protection of text-dependent ASV
e.g. [9, 10, 11]. As a consequence, countermeasures developed with
systems from replay attacks in the face of variable recording and
these databases generally perform well. The reality may be differ-
playback conditions. Derived from the re-recording of the original
ent, however. With the nature of recording and playback conditions
RedDots database, the effort is aligned with that in text-dependent
being totally unconstrained, the existing databases are probably not
ASV and thus well positioned for future assessments of replay spoof-
representative; ASV systems could be far more vulnerable than past
ing countermeasures, not just in isolation, but in integration with
results might suggest. Furthermore, even if the first evaluation of
ASV. The paper describes the database design and re-recording, a
ASV spoofing and countermeasures (ASVspoof) [7] was performed
protocol and some early spoofing detection results. The new “Red-
using a range of different spoofing attacks, it lacked a focus on replay
Dots Replayed” database is publicly available through a creative
attacks and on text-dependent ASV.
commons license.
A new database is thus needed to support a more meaningful
Index Terms— speaker verification, spoofing, replay study of replay spoofing and its impact on text-dependent ASV. This
is the impetus of the work reported in this paper. It describes the cre-
1. INTRODUCTION ation of a new database derived from a smale-scale, crowd-sourced
re-recording of the text-dependent RedDots database [5]. The new
Automatic speaker verification (ASV) [1, 2, 3] technology is today replay corpus represents a significant number of recording and play-
exploited in a growing range of real-world user authentication appli- back conditions while linking research in spoofing to that of the
cations. Examples are systems developed by most of today’s global text-dependent ASV community, recent results from which are based
technology corporations and a number of large-scale, collaborative upon the RedDots database. The use of the same base corpora for the
projects such as the European Union Horizon 2020 project, OC- new replay corpus will thus provide an ideal starting point for further
TAVE1 . work to assess the impact of replay spoofing and countermeasures on
Many of these applications demand not only reliable recognition ASV itself, rather than being assessed in isolation.
performance and robustness to environmental and channel variation,
but also resilience to circumvention. On this front, recent years have 2. PRIOR REPLAY ATTACK CORPORA
witnessed the emergence of two relatively new, or renewed research
directions within the ASV community. The first focuses on text- In general, ASV vulnerabilities to replay attacks have received sur-
dependent ASV. The second relates to spoofing countermeasures [4]. prisingly little attention in the literature, likely due to a lack of com-
Research in both directions has benefited greatly from community- mon data. The AVspoof corpus2 is the only publicly available cor-
driven efforts to introduce free and publicly available corpora. These pus that includes replay attacks [8]. It contains 44 speakers recorded
have been essential for the benchmarking of text-dependent ASV first using two smart-phones and a laptop. Two smart-phones were
systems [5, 6] and spoofing countermeasures [7, 8]. then used to replay the utterances. The laptop was used both with
Even if these two research directions have been pursued in its built-in speaker and with a high-quality loudspeaker to generate
relative independence, they are closely intertwined. While text- the replayed signals. Playback and recording experiments were con-
dependent ASV systems can improve verification reliability beyond ducted in a controlled environment involving varied room acoustics.
Other studies [9, 10, 11] assessed the impact of replay attacks
The paper reflects some results from the OCTAVE Project (#647850),
funded by the Research European Agency (REA) of the European Commis-
on ASV accuracy using in-house data. The work in [9] compared
sion, in its framework programme Horizon 2020. The views expressed in this to the impact of voice conversion and speech synthesis attacks to
paper are those of the authors and do not engage any official position of the replay attacks emulated recording and playback impulse responses.
European Commission. The authors would like to further acknowledge the In [10], a subset of RSR2015 [6] was used as a source corpus by
effort of Sebastiano Trigila (FUB) for his coordination effort of OCTAVE.
1 https://www.octave-project.eu/ 2 Not to be confused with the similarly named ASVspoof 2015 corpus [7]
(a) Recording site 1 (b) Recording site 2
Fig. 1. An illustration of replay spoofing.
This section describes the design and collection of the “RedDots Re-
played” database. This was accomplished through a crowd-sourcing 3.2. RedDots as a source corpus
approach which captures diverse recording and playback conditions.
We are interested in text-dependent ASV, for which two recent cor-
pora, RSR2015 [6] and RedDots [5], are widely adopted by the com-
3.1. Definition and resource-constrained collection plan munity. The RedDots corpus, which contains short phrases recorded
using different brands of smartphones was used for the work reported
A replay attack is illustrated in the upper part of Fig. 1. An attacker
here. It was collected over a time-span of several months to a year
first acquires a covert recording of the target speaker’s utterance
and contains subjects from different geographical locations around
without his/her consent, then replays it using a replay device over
the globe. The database thus encapsulates diverse channel, session
some physical space. The replayed sample is captured by the ASV
and accent variations. The RedDots database consists of speech files
system terminal (recording device). In contrast, an authentic target
in English, with its Quarter 4 Release having 62 speakers (49 M, 13
speaker recording illustrated in the lower part of Fig. 1 would also
F) from 21 countries. The total number of sessions for that release is
be obtained through some (generally, another) physical space, but
572 (473 M and 99 F).
captured directly by the ASV system mic.
Creating a corpus to emulate the full scenario of Fig. 1 would To collect the replay data, we used the Part 01 of the corpus
ideally require a corpus of multiple simultaneously recorded chan- which consists of 10 common short phrases. Since replay attacks are
nels, some of which would be used for representing the covert concerned with the playback of target speakers’ own voices, we have
recording mics, some as target speaker enrolment and some as ASV chosen all the speech samples from speaker-matched trials referred
system test mics. As such data collection is generally tedious, in this to as target-correct (TC) and target-wrong (TW) in the RedDots eval-
study we assume that the attacker has access to the original digital uation plan [5]. In total, 3498 utterances from 49 male speakers were
copy of the target speaker recordings. This simplification allows us used. We consider male speakers only due to larger amount of data.
to re-use any off-the-shelf source corpus to study replay.
Attackers in the real-world are not necessarily IT experts but
laypersons who share a common-sense understanding of the poten- 3.3. Replay material preparation and segmentation
tial to bypass an ASV system by capturing and replaying someone
else’s voice using common consumer devices. Thus, we recruited a The 3498 selected utterances from RedDots corpus were divided into
group of volunteers who were given simple tasks while encouraging 13 disjoint sets: 3 sets of 250 utterances and one set with the remain-
them to be creative in emulating replay attacks. The goal was to ob- ing 498 utterances. The utterances were concatenated including in-
tain both unpredictable environments and diverse replay-recording terleaved marker tones as embedded segment or utterance identifiers.
device combinations. They signify the beginning of each utterance; corresponding time
The volunteers, recruited from the ongoing EU H2020 OCTAVE stamps were used for later segmentation. The marker is a dual-tone
project, were instructed to use their favorite recording/replay audio multi-frequency (DTMF) tone. The 13 concatenated files of approx-
software in their smart-phones or other devices. To keep the re- imately 13 minutes duration were distributed to the volunteers. An
recording time feasible, volunteers were provided with long audio additional long file containing all the 3498 samples was provided to
files merged from the original utterances to make replay recordings two sites. The replayed files were segmented by synchronizing it
a one-shot task. The long audio files would then be segmented auto- manually with the corresponding original file and using the recorded
matically using embedded segment identifiers to obtain the individ- time stamps to identify the individual utterances. In total, 130 re-
ual utterances. played long files were received from the volunteers.
Table 1. Summary of replay (left) and re-recording (right) devices collected. The recording devices emulate possible ASV system terminal
devices on which sensor-level attacks are executed using playback devices.
ID Playback device ID Recording device
P1 ACER “Ferrari ONE” netbook R1 AKG C562CM + Marantz PMD670
P2 All-in-one PC speakers R2 BQ Aquaris M5 smartphone. Software: Smart voice recorder
P3 BQ Aquaris M5 smarphone R3 Desktop Computer with headset and arecord
P4 Beyerdynamic DT 770 PRO headphones with PC R4 H6 Handy Recorder
P5 Creative A60 connected to laptop R5 Logitech C920 connected to Dell (SSD) notebook
P6 Dell (SSD) notebook + EdirolUA25 + XXX R6 Nokia Lumia
P7 Dell laptop with internal speakers R7 Røde NT2 microphone with a laptop
P8 Dynaudio BM5A Speaker connected to laptop R8 Røde smartlav+ mic with a laptop
P9 HP Laptop speakers R9 Samsung GT-I9100
P10 High-end GENELEC 8020C studio monitorss R10 Samsung GT-P6200
P11 MacBook pro internal speakers R11 Samsung Galaxy 7s
P12 PC with Altec lansing Orbit USB iML227 speaker R12 Samsung Trend 2
P13 Samsung GT-I9100 R13 Samsung Trend 3
P14 Samsung GT-P6200 R14 ZoomHD1
P15 VIFA M10MD-39-08 Speakers with laptop R15 iPhone 5c
R16 iphone4
3.4. Data collection sites ZoomHD1. In order to have a stable and reproducible condition for
the uncontrolled recording, a special housing-base was developed
The replay recordings were executed in four distinct locations repre- for the devices, as illustrated in Fig. 2.
senting four OCTAVE project partners, UEF (Finland), FUB (Italy), UEF: Controlled recordings were made in a silent office and
AAU (Denmark) and EUR (France). The volunteers were instructed in a silent apartment room. The replay devices were a desktop PC
to do at least one recording in a controlled condition and at least one with high quality Genelec Studio speakers and All-in-One PC speak-
in a variable condition. The former refers to a silent environment, ers. The audio was recorded using a Zoom H6 Handy recorder with
and the latter to any creative choice by the volunteers. Some set-ups an omni-directional headset mic (Glottal Enterprises M80). Un-
are illustrated in Fig. 2 and a summary of the devices is provided controlled recordings were collected in a coffee room, office room,
in Table 1. In the following we provide a brief description of each and from an open balcony. The office recordings contain additive
site’s approach. noises generated from a Nexus 4 smartphone playing bar or small
AAU: Controlled recordings where made in a small office room pub noises in the background. The playback devices include a laptop
of approximately 6.5m x 3.5m x 3.5m (H) with a large meeting ta- with external Creative A60 speakers, a HP Elite book laptop speak-
ble in the middle. The replay device is P8 from Table 1, while the ers and a desktop PC with portable Altec lansing Orbit USB iML227
recording devices are R7, R8 and R11. Uncontrolled recording were speaker. Recordings used two smartphones: a Nokia Lumia 635 and
made in a student canteen with a large entrance hall. The uncon- a Samsung Galaxy Trend 2.
trolled setup was the same as the controlled setup, except that the
replay device was changed to P15.
3.5. Analysis of collected data
EURECOM: Controlled recording were made in a silent office.
The internal audio card of a desktop PC was used for both playback A total of 130 long files were received within a period of a few days.
and recording. The playback device was a pair of Beyerdynamic After segmentation, we extracted 49,432 individual replayed utter-
DT 770 PRO headphones connected to the audio device output. The ances. One of the aims was to collect replay attacks of varied techni-
microphone of a conventional headset device connected to the au- cal quality. To give a sense of how well this was achieved, we mea-
dio device input was used for audio capture. It was placed imme- sure the signal-to-noise ratios (SNRs) of the collected data, obtained
diately between the two headphone speakers. Different mobile and using NIST STNR tool3 . The SNR histograms of the individual ut-
portable devices (BQ Aquaris M5, iPhone 5c, Dell laptop) were used terances for both controlled and variable conditions, along with the
for playback/recording in different environments (silent living room, original RedDots files are shown in Fig. 3. The SNR distributions of
office room with windows open, bedroom with windows open facing RedDots and controlled data are similar, though the replay data also
a street producing traffic noise) for uncontrolled recordings. contains examples of lower SNRs, as might be expected. Comparing
FUB: Controlled recordings were collected in a silent room of the controlled and variable conditions, the proportion of utterances
dimensions 6.85m x 3.65m x 2.40m (H). From 500Hz, the isola- with high SNR is lower and the mode occurs at lower SNR for the
tion level exceeds 40dB. Reverberation time is in the range 0.3-0.5s latter, as expected.
and the background noise is approximately 2dBSPL. One replay
recording came from a notebook controlling an EDIROL AU25 dig-
4. EXPERIMENTS
ital board connected to a professional amplified loudspeaker (LEM
SoundPressure). It was recorded using a binaural microphone based To demonstrate corpus usage, we defined pilot protocols both for
on two AKG C 562 CM, with professional Marantz PMD670 solid standalone replay attack detection and ASV. For the former, we use
state recorder. Another recording came from the loudspeaker of 10 training speakers that are disjoint from the test utterance speak-
an ACER ONE (Ferrari) netbook, recorded by a DELL XSP note- ers. Further, any long audio file, whose one or more segments were
book equipped with an external Logitech C920 HD webcam. Un-
controlled recordings were made using a Samsung tablet GT-P6200, 3 http://www.itl.nist.gov/iad/mig//tools/spqa_
Samsung smartphone GT-I9200, iPhone4, and a solid state recorder 23sphere25tarZ.htm
(a)
400
Table 4. Accuracy of two countermeasures (EER, %) on the spoof-
200 ing protocol, for controlled condition, variable condition and pooled
trials.
0
(b)
Feature Controlled Variable All
2000 LFCC 20-DA 5.88 4.43 5.11
Frequency