2023 Iwslt-1
2023 Iwslt-1
2023 Iwslt-1
Diamond
Gold
Silver
ii
c 2023 Association for Computational Linguistics
ISBN 978-1-959429-84-5
iii
Introduction
The International Conference on Spoken Language Translation (IWSLT) is the premiere annual scien-
tific conference for the study, development and evaluation of spoken language translation technology.
Launched in 2004 and spun out from the C-STAR speech translation consortium before it (1992-2003),
IWSLT is the main venue for scientific exchange on all topics related to speech-to-text translation, speech-
to-speech translation, simultaneous and consecutive translation, speech dubbing, cross-lingual commu-
nication including all multimodal, emotional, paralinguistic, and stylistic aspects and their applications
in the field. The conference organizes evaluations around challenge areas, and presents scientific papers
and system descriptions. IWSLT is organized by the Special Interest Group on Spoken Language Tran-
slation (SIGSLT), which is supported by ACL, ISCA and ELRA.
This year, IWSLT featured nine shared tasks in spoken language translation: (i) simultaneous and (ii)
offline translation, (iii) automatic subtitling and (iv) dubbing, (v) speech-to-speech translation, (vi) mul-
tilingual, (vii) dialect and (viii) low-resource speech translation, and (ix) formality control. Each shared
task was coordinated by one or more chairs. The resulting evaluation campaigns attracted a total of 31
teams, from academia, research centers, and industry. System submissions resulted in system papers
that will be presented at the conference. Following our call for papers, this year 51 submissions were
received. In a blind review process, 8 research papers were selected out of 15 for oral presentation (57%)
in addition to 37 system papers.
The program committee is excited about the quality of the accepted papers and expects lively discussion
and exchange at the conference. The conference chairs and organizers would like to express their grati-
tude to everyone who contributed and supported IWSLT. In particular, we wish to thank our Diamond
sponsors Apple and Translated, our Gold sponsor aiXplain, and our Silver sponsor AppTek. We thank
the shared tasks chairs, organizers, and participants, the program committee members, as well as all the
authors that went the extra mile to submit system and research papers to IWSLT, and make this year’s
conference a big success. We also wish to express our sincere gratitude to ACL for hosting our confe-
rence and for arranging the logistics and infrastructure that allow us to hold IWSLT 2023 as a hybrid
conference.
iv
Organizing Committee
Conference Chairs
Marcello Federico, AWS AI Labs, USA
Alex Waibel, CMU, USA
Program Chair
Marine Carpuat, UMD, USA
Sponsorship Chair
Sebastian Stüker, Zoom, Germany
Evaluation Chairs
Jan Niehues, KIT, Germany
Publicity Chair
Atul Kr. Ohja, University of Galway, Ireland
v
Program Committee
Program Committee
Sweta Agrawal, University of Maryland, USA
Duygu Ataman, University of Zurich, Switzerland
Laurent Besacier, Naver Labs, France
Roldano Cattoni, FBK, Italy
Alexandra Chronopoulou, LMU Munich, Germany
Josep Maria Crego, Systran, France
Mattia Di Gangi, AppTek, Germany
Qianqian Dong, ByteDance AI Lab, China
Akiko Eriguchi, Microsoft, USA
Carlos Escolano, Universitat Politècnica de Catalunya, Spain
Markus Freitag, Google, USA
Hirofumi Inaguma, Meta AI, USA
Tom Ko, ByteDance AI Lab, China
Surafel Melaku Lakew, Amazon AI, USA
Yves Lepage, Waseda University, Japan
Xutai Ma, Meta AI, USA
Wolfgang Macherey, Google, USA
Prashant Mathur, AWS AI Labs, USA
Evgeny Matusov, AppTek, Germany
Kenton Murray, Johns Hopkins University, USA
Maria Nadejde, AWS AI Labs, USA
Matteo Negri, FBK, Italy
Xing Niu, AWS AI Labs, USA
Raghavendra Reddy Pappagari, Johns Hopkins University, USA
Juan Pino, Meta AI, USA
Elijah Rippeth, UMD, USA
Elizabeth Salesky, Johns Hopkins University, USA
Rico Sennrich, University of Zurich, Switzerland
Matthias Sperber, Apple, USA
Sebastian Stüker, Zoom, Germany
Katsuhito Sudoh, NAIST, Japan
Brian Thompson, AWS AI Labs, USA
Marco Turchi, Zoom, Germany
David Vilar, Google, Germany
Changhan Wang, Meta AI, USA
Krzystof Wolk, Polish-Japanese Academy of Information Technology, Poland
vi
Table of Contents
Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Ter-
minology
Elizabeth Salesky, Kareem Darwish, Mohamed Al-Badrashiny, Mona Diab and Jan Niehues . . 62
The MineTrans Systems for IWSLT 2023 Offline Speech Translation and Speech-to-Speech Translation
Tasks
Yichao Du, Guo Zhengsheng, Jinchuan Tian, Zhirui Zhang, Xing Wang, Jianwei Yu, Zhaopeng
Tu, Tong Xu and Enhong Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
The BIGAI Offline Speech Translation Systems for IWSLT 2023 Evaluation
Zhihang Xie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
NAVER LABS Europe’s Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource
Track
Edward Gow-Smith, Alexandre Berard, Marcely Zanon Boito and Ioan Calapodescu . . . . . . . . 144
vii
Improving Neural Machine Translation Formality Control with Domain Adaptation and Reranking-
based Transductive Learning
Zhanglin Wu, Zongyao Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, Xiaoyu Chen, Zhiqiang
Rao, Zhengzhe YU, Jinlong Yang, Shaojun Li, Yuhao Xie, Bin Wei, Jiawei Zheng, Ming Zhu, Lizhi
Lei, Hao Yang and Yanfei Jiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
HW-TSC at IWSLT2023: Break the Quality Ceiling of Offline Track via Pre-Training and Domain
Adaptation
Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Xie YuHao, Guo JiaXin, Daimeng Wei, Hengchao
Shang, Wang Minghan, Xiaoyu Chen, Zhengzhe YU, Li ShaoJun, Lei LiZhi and Hao Yang . . . . . . . 187
Submission of USTC’s System for the IWSLT 2023 - Offline Speech Translation Track
Xinyuan Zhou, Jianwei Cui, Zhongyi Ye, Yichi Wang, Luzhen Xu, Hanyi Zhang, Weitai Zhang
and Lirong Dai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
I2R’s End-to-End Speech Translation System for IWSLT 2023 Offline Shared Task
Muhammad Huzaifah, Kye Min Tan and Richeng Duan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
The NiuTrans End-to-End Speech Translation System for IWSLT23 English-to-Chinese Offline Task
Yuchen Han, Xiaoqian Liu, Hao Chen, Yuhao Zhang, Chen Xu, Tong Xiao and Jingbo Zhu . . 211
ON-TRAC Consortium Systems for the IWSLT 2023 Dialectal and Low-resource Speech Translation
Tasks
Antoine Laurent, Souhir Gahbiche, Ha Nguyen, Haroun Elleuch, Fethi Bougares, Antoine Thiol,
Hugo Riguidel, Salima Mdhaffar, Gaëlle Laperrière, Lucas Maison, Sameer Khurana and Yannick
Estève . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
BUT Systems for IWSLT 2023 Marathi - Hindi Low Resource Speech Translation Task
Santosh Kesiraju, Karel Beneš, Maksim Tikhonov and Jan Černocký . . . . . . . . . . . . . . . . . . . . . . 227
Improving Low Resource Speech Translation with Data Augmentation and Ensemble Strategies
Akshaya Vishnu Kudlu Shanbhogue, Ran Xue, Soumya Saha, Daniel Zhang and Ashwinkumar
Ganesan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Speech Translation with Style: AppTek’s Submissions to the IWSLT Subtitling and Formality Tracks in
2023
Parnia Bahar, Patrick Wilken, Javier Iranzo-Sánchez, Mattia Di Gangi, Evgeny Matusov and
Zoltán Tüske . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
QUESPA Submission for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks
John E. Ortega, Rodolfo Zevallos and William Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
GMU Systems for the IWSLT 2023 Dialect and Low-resource Speech Translation Tasks
Jonathan Mbuya and Antonios Anastasopoulos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
viii
Learning Nearest Neighbour Informed Latent Word Embeddings to Improve Zero-Shot Machine Tran-
slation
Nishant Kambhatla, Logan Born and Anoop Sarkar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation
Task
Kun Song, Yi Lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei Xie, Ning Jiang and
Guoqing Zhao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Language Model Based Target Token Importance Rescaling for Simultaneous Neural Machine Transla-
tion
Aditi Jain, Nishant Kambhatla and Anoop Sarkar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
Tagged End-to-End Simultaneous Speech Translation Training Using Simultaneous Interpretation Data
Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Katsuhito Sudoh and Satoshi Nakamura
363
The HW-TSC’s Simultaneous Speech-to-Text Translation System for IWSLT 2023 Evaluation
Jiaxin GUO, Daimeng Wei, Zhanglin Wu, Zongyao Li, Zhiqiang Rao, Minghan Wang, Hengchao
Shang, Xiaoyu Chen, Zhengzhe Yu, Shaojun Li, Yuhao Xie, Lizhi Lei and Hao Yang . . . . . . . . . . . . 376
The HW-TSC’s Simultaneous Speech-to-Speech Translation System for IWSLT 2023 Evaluation
Hengchao Shang, Zhiqiang Rao, Zongyao Li, Zhanglin Wu, Jiaxin GUO, Minghan Wang, Dai-
meng Wei, Shaojun Li, Zhengzhe YU, Xiaoyu Chen, Lizhi Lei and Hao Yang . . . . . . . . . . . . . . . . . . . 383
Towards Efficient Simultaneous Speech Translation: CUNI-KIT System for Simultaneous Track at IW-
SLT 2023
Peter Polak, Danni Liu, Ngoc-Quan Pham, Jan Niehues, Alexander Waibel and Ondřej Bojar 389
Speech Translation with Foundation Models and Optimal Transport: UPC at IWSLT23
Ioannis Tsiamas, Gerard I. Gállego, Jose Fonollosa and Marta R. Costa-jussá . . . . . . . . . . . . . . 397
The Xiaomi AI Lab’s Speech Translation Systems for IWSLT 2023 Offline Task, Simultaneous Task and
Speech-to-Speech Task
Wuwei Huang, Mengge Liu, Xiang Li, Yanzhi Tian, Fengyu Yang, Wen Zhang, Jian Luan, Bin
Wang, Yuhang Guo and Jinsong Su . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Improving Formality-Sensitive Machine Translation Using Data-Centric Approaches and Prompt En-
gineering
Seugnjun Lee, Hyeonseok Moon, Chanjun Park and Heuiseok Lim . . . . . . . . . . . . . . . . . . . . . . . . 420
ix
UM-DFKI Maltese Speech Translation
Aiden Williams, Kurt Abela, Rishu Kumar, Martin Bär, Hannah Billinghurst, Kurt Micallef,
Ahnaf Mozib Samin, Andrea DeMarco, Lonneke van der Plas and Claudia Borg . . . . . . . . . . . . . . . . . 433
SRI-B’s Systems for IWSLT 2023 Dialectal and Low-resource Track: Marathi-Hindi Speech Translation
Balaji Radhakrishnan, Saurabh Agrawal, Raj Prakash Gohil, Kiran Praveen, Advait Vinay Dho-
peshwarkar and Abhishek Pandey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator
Loss
Yihong Liu, Alexandra Chronopoulou, Hinrich Schütze and Alexander Fraser . . . . . . . . . . . . . . 491
x
Program
xi
Friday, July 14, 2023
xii
FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN
Milind Agarwal1 Sweta Agrawal2 Antonios Anastasopoulos1 Luisa Bentivogli3
Ondřej Bojar4 Claudia Borg5 Marine Carpuat2 Roldano Cattoni3
Mauro Cettolo3 Mingda Chen6 William Chen7 Khalid Choukri8
Alexandra Chronopoulou9 Anna Currey10 Thierry Declerck11 Qianqian Dong12
Kevin Duh13 Yannick Estève14 Marcello Federico10 Souhir Gahbiche15
Barry Haddow16 Benjamin Hsu10 Phu Mon Htut10 Hirofumi Inaguma6
Dávid Javorský4 John Judge17 Yasumasa Kano18 Tom Ko12
Rishu Kumar4 Pengwei Li6 Xutai Ma6 Prashant Mathur10
Evgeny Matusov19 Paul McNamee13 John P. McCrae20 Kenton Murray13
Maria Nadejde10 Satoshi Nakamura18 Matteo Negri3 Ha Nguyen14
Jan Niehues21 Xing Niu10 Atul Kr. Ojha20 John E. Ortega22
Proyag Pal16 Juan Pino6 Lonneke van der Plas23 Peter Polák4
Elijah Rippeth2 Elizabeth Salesky13 Jiatong Shi7 Matthias Sperber24
Sebastian Stüker25 Katsuhito Sudoh18 Yun Tang6 Brian Thompson10
Kevin Tran6 Marco Turchi25 Alex Waibel7 Mingxuan Wang12
Shinji Watanabe7 Rodolfo Zevallos26
1 GMU 2 UMD 3 FBK 4 Charles U. 5 U. Malta 6 Meta 7 CMU 8 ELDA
9 LMU 10 AWS 11 DFKI 12 ByteDance 13 JHU 14 Avignon U. 15 Airbus
16 U. Edinburgh 18 NAIST 19 AppTek 20 U. Galway 21 KIT 22 Northeastern U.
23 IDIAP 24 Apple 25 Zoom 26 U. Pompeu Fabra
1
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 1–61
July 13-14, 2023 c 2023 Association for Computational Linguistics
Team Organization
A LEXA AI Amazon Alexa AI, USA (Vishnu et al., 2023)
A PP T EK AppTek, Germany (Bahar et al., 2023)
BIGAI Beijing Institute of General Artificial Intelligence, China (Xie, 2023)
BIT Beijing Institute of Technology, China (Wang et al., 2023b)
BUT Brno University of Technology, Czechia (Kesiraju et al., 2023)
CMU Carnegie Mellon University, USA (Yan et al., 2023)
CUNI-KIT Charles University, Czechia, and KIT, Germany (Polák et al., 2023)
FBK Fondazione Bruno Kessler, Italy (Papi et al., 2023)
GMU George Mason University, USA (Mbuya and Anastasopoulos, 2023)
HW-TSC Huawei Translation Services Center, China (Li et al., 2023; Wang et al., 2023a)
(Guo et al., 2023; Shang et al., 2023; Rao et al., 2023)
I2R Institute for Infocomm Research, A*STAR, Singapore (Huzaifah et al., 2023)
JHU Johns Hopkins University, USA (Hussein et al., 2023; Xinyuan et al., 2023)
KIT Karlsruhe Institute of Technology, Germany (Liu et al., 2023)
KU Kyoto University, Japan (Yang et al., 2023)
KU X UPSTAGE Korea University X Upstage, South Korea (Wu et al., 2023; Lee et al., 2023)
M ATESUB Translated Srl, Italy (Perone, 2023)
M INE T RANS U. of Sci. and Techn. of China, Tancient AI Lab, State Key Lab. of Cognitive Intelligence (Du et al., 2023)
NAIST Nara Institute of Science and Technology, Japan (Fukuda et al., 2023)
NAVER NAVER Labs Europe, France (Gow-Smith et al., 2023)
N IU T RANS NiuTrans, China (Han et al., 2023)
NPU-MSXF Northwestern Polytechnical U., Nanjing U., MaShang Co., China (Song et al., 2023)
N EURODUB NeuroDub, Armenia
NEMO NVIDIA NeMo, USA(Hrinchuk et al., 2023)
ON-TRAC ON-TRAC Consortium, France (Laurent et al., 2023)
QUESPA Northeastern U, USA, U. de Pompeu Fabra, Spain, CMU, USA(Ortega et al., 2023)
UPC Universitat Politècnica de Catalunya, Spain (Tsiamas et al., 2023)
SRI-B Samsung R&D Institute Bangalore, India (Radhakrishnan et al., 2023)
UCSC U. of California, Santa Cruz, USA (Vakharia et al., 2023)
UM-DFKI U. of Malta, Malta, and DFKI, Germany (Williams et al., 2023)
USTC U. of Science and Technology of China (Deng et al., 2023; Zhou et al., 2023)
X IAOMI Xiaomi AI Lab, China (Huang et al., 2023)
English into Arabic, Chinese, Dutch, French, • Formality Control for SLT, focusing on for-
German, Japanese, Farsi, Portuguese, Russian, mality/register control for spoken language
and Turkish. translation from English to Korean, Viet-
namese, EU Portuguese, and Russian.
• Speech-to-speech translation, focusing on The shared tasks attracted 38 submissions by 31
natural-speech to synthetic-speech translation teams (see Table 1) representing both academic
of recorded utterances from English to Chinese. and industrial organizations. The following sec-
tions report on each shared task in detail, in par-
• Automatic Dubbing, focusing on dubbing of ticular: the goal and automatic metrics adopted for
short video clips from German to English. the task, the data used for training and testing data,
the received submissions and the summary of re-
sults. Detailed results for some of the shared tasks
• Dialect SLT, focusing on speech translation of are reported in a corresponding appendix.
recorded utterances from Tunisian Arabic to
English. 2 Offline SLT
Offline speech translation is the task of translating
• Low-resource SLT, focusing on speech trans- audio speech in one language into text in a differ-
lation of recorded utterances from Irish to En- ent target language, without any specific time or
glish, Marathi to Hindi, Maltese to English, structural constraints (as, for instance, in the si-
Pashto to French, Tamasheq to French, and multaneous, subtitling, and dubbing tasks). Un-
Quechua to Spanish. der this general problem definition, the goal of
2
the offline ST track (one of the speech tasks with submissions to the English-Japanese and English-
the longest tradition at the IWSLT campaign) is to Chinese sub-tasks.
constantly challenge a technology in rapid evolu-
tion by gradually introducing novelty aspects that 2.2 Data and Metrics
raise the difficulty bar. Training and development data. Participants
were offered the possibility to submit systems built
2.1 Challenge under three training data conditions:
In continuity with last year, participants
were given three sub-tasks correspond- 1. Constrained: the allowed training data is
ing to three language directions, namely limited to a medium-sized framework in
English→German/Japanese/Chinese. Partici- order to keep the training time and re-
pation was allowed both with cascade architec- source requirements manageable. The com-
tures combining automatic speech recognition plete list1 of allowed training resources
(ASR) and machine translation (MT) systems (speech, speech-to-text-parallel, text-parallel,
as core components, or by means of end-to-end text-monolingual) does not include any pre-
approaches that directly translate the input speech trained language model.
without intermediate symbolic representations.
Also this year, one of the main objectives was 2. Constrained with large language models
indeed to measure the performance difference (constrained+LLM ): in addition to all the con-
between the two paradigms, a gap that recent strained resources, a restricted selection1 of
research (Bentivogli et al., 2021) and IWSLT find- large language models is allowed to give par-
ings (Ansari et al., 2020; Anastasopoulos et al., ticipants the possibility to leverage large lan-
2021, 2022b) indicate as gradually decreasing. guage models and medium-sized resources.
The other main objective of this round was to 3. Unconstrained: any resource, pre-trained
assess the ability of SLT technology to deal with language models included, can be used with
complex scenarios involving different types of in- the exception of evaluation sets. This setup is
put characterized by phenomena like spontaneous proposed to allow the participation of teams
speech, noisy audio conditions and overlapping equipped with high computational power and
speakers. In light of this, the main novelty of the effective in-house solutions built on addi-
2022 offline SLT task lies in a richer variety of tional resources.
speech data to be processed. To this aim, in addi-
tion to the classic TED talks test set, two novel test The development data allowed under the con-
sets were released: strained condition consist of the dev set from
IWSLT 2010, as well as the test sets used for
• ACL presentations, in which a single
the 2010, 2013-2015 and 2018-2020 IWSLT cam-
speaker is presenting on a stage. Although
paigns. Besides this TED-derived material, ad-
similar to the TED talks scenario, additional
ditional development data were released to cover
challenges posed by this test set include the
the two new scenarios included in this round of
presence of non-native speakers, different ac-
evaluation. For the ACL domain, 5 presentations
cents, variable recording quality, terminol-
from the ACL 2022 conference with translations
ogy, and controlled interactions with a second
and transcriptions were provided. Due to addi-
speaker.
tional constraints, these references were gener-
• Press conferences and interviews, in which ated by human post-editing of automatic transcrip-
two persons interact on different topics. tions and translation. For the press conferences
Inherent challenges, therefore, include the and interviews domain, 12 videos (total duration:
presence of spontaneous speech, non-native 1h:3m) were selected from publicly available in-
speakers, different accents, and controlled in- terviews from the Multimedia Centre of the Euro-
teraction with a second speaker. pean Parliament (EPTV)2 .
1
All the test sets were used for evaluation in See the IWSLT 2023 offline track web page: https:
//iwslt.org/2023/offline
the English-German sub-task, while only TED 2
https://multimedia.europarl.europa.
Talks and ACL presentations were used to test the eu
3
Test data. Three new test sets were created for Talks / Videos Duration
the three language directions. The new test sets English-German
include heterogeneous material drawn from each TED 42 3h:47m:53s
scenario. For the traditional TED scenario, a new ACL 5 59m:22s
set of 42 talks not included in the current public EPTV 10 1h:1m
release of MuST-C was selected to build the en-de English-Chinese
test set.3 Starting from this material, the talks for TED 37 3h:2m:22s
which Japanese and Chinese translations are avail- ACL 5 59m:22s
able were selected to build the en-zh and en-ja test English-Japanese
sets (respectively, 38 and 37 talks). Similar to the TED 38 3h:19m:34s
2021 and 2022 editions, we consider two different ACL 5 59m:22s
types of target-language references, namely:
Table 2: Statistics of the official test sets for the IWSLT
• The original TED translations. Since these 2023 offline speech translation task.
references come in the form of subtitles, they
are subject to compression and omissions
ranked based on the BLEU calculated on the con-
to adhere to the TED subtitling guidelines.4
catenation of the three test sets by using automatic
This makes them less literal compared to
resegmentation6 of the hypotheses based on the
standard, unconstrained translations;
reference translations. For the BLEU computed
• Unconstrained translations. These references on the concatenation of the three test sets, the new
were created from scratch5 by adhering to the unconstrained ones have been used for the TED
usual translation guidelines. They are hence data. As observed on IWSLT 2022 manual eval-
exact translations (i.e. literal and with proper uation of simultaneous speech-to-text translation
punctuation). (Macháček et al., 2023), COMET is correlating
with human judgments best and BLEU correlation
For the ACL presentation scenario, paper pre- is also satisfactory. Moreover, to meet the requests
sentations from ACL 2022 were transcribed and of last year’s participants, a human evaluation was
translated into the target languages. A detailed de- performed on the best-performing submission of
scription of the data set can be found in Salesky each participant.
et al. (2023). There are 5 presentations in each of
the dev and test sets with a total duration 1h per 2.3 Submissions
split. Talks were selected to include diverse paper This year, 10 teams participated in the offline task,
topics and speaker backgrounds. This test set is submitting a total of 37 runs. Table 3 provides a
shared with the Multilingual task (§5). breakdown of the participation in each sub-task
For the press conferences and interviews sce- showing, for each training data condition, the
nario, the test set comprises 10 EPTV videos of number of participants, the number of submitted
variable duration (6m on average), amounting to a runs and, for each training data condition (con-
total of 1h:1m. The details of the new test sets are strained, constrained+LLM , unconstrained), the
reported in Table 2. number of submitted runs obtained with cascade
Metrics. Systems were evaluated with respect and direct systems.
to their capability to produce translations similar
to the target-language references. The similarity
was measured in terms of BLEU and COMET (Rei • BIGAI (Xie, 2023) participated both with
et al., 2020a) metrics. The submitted runs were cascade and direct models for en-de, en-ja,
and en-zh translations, which were trained
under the constrained+LLM condition.
3
This set of 42 TED talks is also referred to as the
“Common” test set (not to be confused with MuST-C “tst-
COMMON”) because it serves in both Offline and Simul-
The cascade is the concatenation of an
taneous https://iwslt.org/2023/simultaneous ASR model and an MT system. The ASR
tasks. consists of the first 12 Transformer layers
4
http://www.ted.com/participate/
translate/subtitling-tips 6
Performed with mwerSegmenter - https:
5
We would like to thank Meta for providing us with this //www-i6.informatik.rwth-aachen.de/web/
new set of references. Software/mwerSegmenter.tar.gz
4
English-German
Participants Runs Constrained Constrained+LLM Unconstrained
Cascade 1 Cascade 1 Cascade 2
6 16 2 12 2
Direct 1 Direct 11 Direct -
English-Chinese
Participants Runs Constrained Constrained+LLM Unconstrained
Cascade 3 Cascade 1 Cascade 7
7 16 5 3 8
Direct 2 Direct 2 Direct 1
English-Japanese
Participants Runs Constrained Constrained+LLM Unconstrained
Cascade 1 Cascade 1 Cascade 1
3 5 2 2 1
Direct 1 Direct 1 Direct -
5
to increase robustness to ASR noise (through en-de system trained under the unconstrained
synthetic noise generation and data augmen- condition. It consists of a 4-staged process
tation). including the ASR, the punctuation module
performing both sentence extraction and
• M INE T RANS (Du et al., 2023) participated punctuation placement, the speaker- and
with en-zh cascade systems trained under gender distinction component, and the
constrained and unconstrained conditions. translation model. Every stage is trained on
The submitted runs are obtained with a the crawled data from the web.
pipeline of ASR, punctuation recognition,
and MT components. The ASR is an RNN-
Transducer. For the unconstrained condi- • N E M O (Hrinchuk et al., 2023) participated
tion, GigaSpeech is added to the training with direct systems for all language di-
data allowed in the constrained setting. In rections in the constrained training data
both conditions, pre-processing and filter- condition. Pre-trained models and synthetic
ing techniques are applied to improve data training data are exploited in different ways
quality, while SpecAugment is used for data to cope with the scarcity of direct ST data. A
augmentation. Before being passed to the Conformer-based ASR model trained on all
MT component, the unpunctuated ASR out- allowed speech-to-text data is used to initial-
put is processed by means of a BERT-based ize the SLT encoder. A Transformer-based
punctuation recognition model. For the MT NMT model trained on all allowed parallel
component, two strategies are implemented. data and fine-tuned on TED talks is used to
The first one relies on different Transformer- generate synthetic translation alternatives for
based models for supervised training. A all available speech-to-text and text-to-text
base Transformer and an M2M 100 model data. A TTS model based on Fast Pitch
are used for the constrained condition. A (Łańcucki, 2021) and trained on the English
translation model trained on additional in- transcripts of all TED-derived data is used
house corpora is used for the unconstrained to generate the synthetic speech version of
condition. The second strategy adopted for English texts in the available text corpora.
the MT component relies on a large language The submitted SLT systems are based on
model (Chat-GPT) for prompt-guided trans- a Conformer-based encoder followed by a
lation. Transformer decoder trained on this mix
of (gold and synthetic) speech-to-text and
• N IU T RANS (Han et al., 2023) participated text-to-text data.
with a direct en-zh system trained under
the constrained condition. It consists of
two separate encoders for speech and text • X IAOMI (Huang et al., 2023) participated
with an adapter in between, followed by a with a direct en-zh system trained under the
decoder. The speech encoder is pre-trained constrained+LLM condition. It consists of
with an ASR encoder, while the textual a speech encoder, a text encoder, and a text
encoder and the decoder with pre-trained decoder, with all parameters initialized using
MT components. Different architectures the pre-trained HuBERT and mBART mod-
with variable size were tested both for ASR els. The speech encoder is composed of a
(enhanced with CTC loss and inter-CTC loss feature extractor based on convolutional neu-
to speed up convergence) and MT (used to ral networks and a Transformer encoder. In
generate pseudo-references so as to increase addition to the cross-entropy loss, ASR, MT,
the size of the SLT data). The final system and a contrastive loss, which tries to learn an
is an ensemble aiming at maximizing the encoder that produces similar representations
diversity between models. for similar instances independently from the
modalities, are added. Self-training is also
used to leverage unlabelled data. In addition
• N EURODUB7 participated with a cascade
to the allowed datasets, a large set of pseudo
7
Unofficial participant, as no system paper is available. references are generated translating the
6
transcripts of the ASR corpora. During train- 2.4 Results
ing, a second fine-tuning is performed on Also this year, the submissions to the IWSLT Of-
MuST-C as in-domain data. The final system fline translation task were evaluated both with au-
is an ensemble of the two best-performing tomatic metrics and through human evaluation.
models. The results for each sub-task are shown in detail
in the Appendix.
7
ify the overall evaluation of the systems. In the 3 Simultaneous SLT
English-to-Chinese task, there are two situations
Simultaneous speech translation means the system
where the metrics differ significantly. The rank-
starts translating before the speaker finishes the
ing for USTC end-to-end compared to the HW-
sentence. The task is essential to enable people
TSC systems is different with respect to COMET,
to communicate seamlessly across different back-
which rewards the HW-TSC submissions. A sim-
grounds, in low-latency scenarios such as transla-
ilar situation is visible for NiuTrans and Xiaomi,
tion in international conferences or travel.
where BLEU favors the NiuTrans translations,
This year, the task included two tracks: speech-
while COMET assigns higher scores, and ranking,
to-text and speech-to-speech, covering three lan-
to the Xiaomi submissions.
guage directions: English to German, Chinese and
Japanese.
Data conditions For the different data condi-
tions, the gains by using additional large language 3.1 Challenge
models or additional data are not clear. HW- There are two major updates compared with pre-
TSC submitted three primary systems for each vious years:
data condition and they all perform very similarly.
However, for en-zh the unconstrained system by • Removal of the text-to-text track. The task
USTC was clearly the best and for en-de the best focuses on the real-world live-translation set-
system except HW-TSC was also an unconstrained ting, where the speech is the input medium.
one. The additional benefit of the pre-trained mod-
• Addition of a speech-to-speech track. Trans-
els is even less clear. There is no clear picture that
lation into synthetic speech has gained in-
the systems with or without this technology per-
creasing attention within the research com-
form better.
munity, given its potential application to real-
time conversations.
Domains One new aspect this year is the evalu-
ation of the systems on three different test sets and To simplify the shared task, a single latency
domains. First of all, the absolute performance on constraint is introduced for each track: 2 sec-
the different domains is quite different. The sys- onds of Average Lagging for speech-to-text, and
tems perform clearly worse on the EPTV test sets. 2.5 seconds of starting offset for speech-to-speech.
For the relationship between ACL and TED, the The participants can submit no more than one
picture is not as clear. While the BLEU scores system per track / language direction, as long as
on ACL are higher, the COMET scores are lower. the latency of the system is under the constraint.
Only for English-to-Japanese, both metrics are The latency of the system is qualified on the open
higher on the ACL test set. One explanation could MuST-C tst-COMMON test set (Di Gangi et al.,
be that the references for the ACL talks are gen- 2019a).
erated by post-editing an MT output. This could The participants made submissions in a format
indicate that the post-edited references inflate the of docker images, which were later run by orga-
BLEU score, while the COMET score seems to be nizers on the blind-test set in a controllable en-
more robust to this phenomenon. When compar- vironment. An example of implementation was
ing the different systems, the tendency is for all provided with the SimulEval toolkit (Ma et al.,
cases the same. However, some perform slightly 2020a).
better in one condition. For example, the end-
to-end system from USTC performs very well on 3.2 Data
TED compared to other systems but less well on The training data condition of the simultaneous
ACL. task follows “constrained with large language
models” setting in the Offline translation task, as
2.4.2 Human Evaluation described in Section 2.2
The test data has two parts:
At the time of writing, human evaluation is still in
progress. Its results will be reported at the confer- Common TED talks. It’s the the same as in the
ence and they will appear in the updated version Offline task, as described in Section 2.2 .For En-
of this paper in Appendix A. glish to German, Chinese and Japanese
8
Non-Native see Appendix A.1.1. For English to teams entered the English-to-German track; four
German. teams entered the English-to-Chinese track; three
teams entered the English-to-Japanese track. Even
3.3 Evaluation though this year is our first time introducing the si-
Two attributes are evaluated in the simultaneous multaneous speech-to-speech track, four teams out
task: quality and latency. of six, submitted speech-to-speech systems.
For quality, we conducted both automatic and
human evaluation. BLEU score (Papineni et al., • CMU(Yan et al., 2023) participated in both
2002a) is used for automatic quality evaluation. the speech-to-text and speech-to-speech
For speech output, the BLEU score is computed tracks for English-German translation.
on the transcripts from Whisper (Radford et al., Their speech-to-text model combined
2022) ASR model. The ranking of the submis- self-supervised speech representations, a
sion is based on the BLEU score on the Com- Conformer encoder, and an mBART decoder.
mon blind test set. Furthermore, we conducted In addition to the cross-entropy attentional
BLASER (Chen et al., 2022) evaluation on the loss, the translation model was also trained
speech output. We also conducted human evalu- with CTC objectives. They used machine
ation on speech-to-text translation quality, includ- translation pseudo labeling for data aug-
ing general human evaluation for all three lan- mentation. Simultaneous decoding was
guage pairs, and task specific human evaluation on achieved by chunking the speech signals
German and Japanese outputs. and employing incremental beam search.
For latency, we only conducted automatic eval- For their speech-to-speech system, they
uation. We report the following metrics for each incorporated a VITS-based text-to-speech
speech-to-text systems. model, which was trained separately.
• Average Lagging (AL; Ma et al., 2019, • HW-TSC (Guo et al., 2023; Shang et al.,
2020b) 2023) participated in both the speech-to-
text and speech-to-speech tracks for all
• Length Adaptive Average Lagging (LAAL; three language directions. Their model was
Polák et al., 2022; Papi et al., 2022) a cascaded system that combined an U2
ASR, a Transformer-based machine trans-
• Average Token Delay (ATD; Kano et al., lation model, and a VITS-based text-to-
2023) speech model for speech-to-speech transla-
• Average Proportion (AP; Cho and Esipova, tion. The MT model was multilingual and
2016) offered translation in all three directions by
conditioning on language embeddings. For
• Differentiable Average Lagging (DAL; data augmentation, they adopted data di-
Cherry and Foster, 2019) versification and forward translation tech-
niques. Their simultaneous decoding policy
We also measured the computation aware version employed chunk-based incremental decod-
of the latency metrics, as described by Ma et al. ing with stable hypotheses detection. They
(2020b). However, due to the new synchronized also utilized additional TTS models for the
SimulEval agent pipeline design, the actual com- speech-to-speech track.
putation aware latency can be smaller with care-
fully designed parallelism. • NAIST(Fukuda et al., 2023) participated in
For speech-to-speech systems, we report start- the speech-to-text translation direction for
offset and end-offset. The latency metrics will not all three language directions and English-to-
be used for ranking. Japanese speech-to-speech translation. Their
system consisted of a HuBERT encoder and
3.4 Submissions an mBART decoder. They employed three
The simultaneous shared task received submis- techniques to improve translation quality:
sions from six teams, whereas all the teams par- inter-connection to combine pre-trained rep-
ticipated in at least one language direction in resentations, prefix alignment fine-tuning for
speech-to-text translation. Among the teams, five simultaneous decoding, and local agreement
9
to find stable prefix hypotheses. They also attribute this to better robustness of NAIST and
utilized an additional Tacotron2-based TTS CMU towards the noise in Non-Native test set.
model for speech-to-speech translation with
the wait-k decoding policy. English-Chinese The ranking is HW-TSC,
CUNI-KIT, XIAOMI, NAIST, as shown in
• FBK(Papi et al., 2023) participated in the Table 18.
English-to-German speech-to-text translation
track, using an end-to-end Conformer-based English-Japanese The ranking is HW-TSC,
speech-to-text model. Considering computa- CUNI-KIT, NAIST, as shown in Table 19.
tional latency, their focus was on efficient us-
age of offline models. They employed three
simultaneous policies, including local agree- 3.5.2 Speech-to-Speech
ment, encoder-decoder attention, and EDATT Despite the great novelty and difficulty of speech-
v2, to achieve this. to-speech track, there are 5 submissions in total:
2 in German, 2 in Chinese and 1 in Japanese.
• CUNI-KIT(Polák et al., 2023) partici- The full results can be seen in table Table 20.
pated in the English-to-German speech-to- For English-to-German, the ranking is CMU, HW-
text translation track. Their system utilized TSC. For English-to-Chinese, HW-TSC is the
WavLM and mBART as the base framework. only participant. For English-to-Japanese, the
The key highlights of their system were in the ranking is HW-TSC, NAIST.
decoding strategy and simultaneous policies. We also provide the BLASER scores, which
They applied empirical hypotheses filtering directly predict the quality of translations based
during decoding and adopted CTC to detect on speech embeddings. We note that since refer-
the completion of block inference. ence audios are not available in our datasets, we
use text LASER (Heffernan et al., 2022) to embed
• X IAOMI(Huang et al., 2023) participated
reference text to compute the scores. While the
in both the speech-to-text and speech-to-
BLASER scores indicate the same quality rank-
speech tracks for English-Chinese transla-
ing for English to German as BLEU scores, on
tion. Their end-to-end system utilized Hu-
the Japanese output they are similar. It’s pos-
BERT and mBART with a wait-k decoding
sible that BLASER is adequately developed on
strategy and an Information-Transport-based
Japanese outputs
architecture. They further enhanced their sys-
tem by applying data filtering on long sen- 3.6 Human Evaluation
tences and misaligned audio/text, data aug-
mentation with pseudo labeling, and punctu- In the Simultaneous task, speech-to-text track,
ation normalization. They also incorporated English-German and English-Japanese were man-
contrastive learning objectives. ually evaluated, each with a different scoring
method.
3.5 Automatic Evaluation
3.6.1 English-German
We rank the system performance based on BLEU
scores. The detailed results can be found in Ap- For English-to-German, we used the same human
pendix B.2. evaluation method as last year, originally inspired
by Javorský et al. (2022). We evaluated (1) the
3.5.1 Speech-to-Text best system selected by BLEU score, and (2) tran-
English-German On the Common test set, the scription of human interpretation, the same as used
ranking is HW-TSC, CUNI-KIT, FBK, NAIST, in last year evaluation (more details can be found
CMU, as shown in Table 17. Meanwhile, on the in Anastasopoulos et al. (2022a), Section 2.6.1).
Non-Native test set, the ranking differs consider- Figure 1 plots automatic and manual evalua-
ably. While HW-TSC performs best on Common tion in relation with each other. We confirm the
test set, they end up second to last on Non-Native. generally good correlation with BLEU (Pearson
The situation is reversed for NAIST and CMU .952 across the two test set parts), as observed by
who end up at the tail of Common scoring but Macháček et al. (2023), although individual sys-
reach the best scores on the Non-Native set. We tem results are rather interesting this year.
10
Figure 1: Manual and automatic evaluation of Simulatenous speech-to-text English-to-German translation on the
Common (TED talks) and Non-Native test sets. The error bars were obtained by bootstrap resampling, see the
caption of Table 22.
On the Common test set, HWTSC performed The human evaluation results are shown in Ta-
best in terms of BLEU but the manual scor- ble 23. The error score almost correlates with
ing seems to prefer CUNI-KIT and FBK. CMU BLEU against the additional reference, but the dif-
and NAIST are worst in BLEU but on par with ference in the error scores was very small between
HWTSC in terms of manual scores. HW-TSC and CUNI-KIT in spite of the 0.8 BLEU
The situation is very different on the Non- difference.
Native test set: CMU and NAIST score best both
in manual scores and in BLEU while CUNI-KIT 3.7 Final remarks
and esp. FBK get much worse scores, again, both This year, we simplified the conditions by focus-
manual and automatic. ing solely on low-latency systems to reduce the
The Non-Native test set is substantially harder burden of submission and evaluation. We also
with respect to sound conditions, and the striking introduced the novel and challenging speech-to-
difference drop observed for both CUNI-KIT and speech track, and were happy to receive 5 submis-
FBK can be an indication of some form of over- sions.
fitting towards the clean input of Common (TED We note potential modifications for future edi-
talks). tions:
Appendix A.1.1 presents details of the human
• Providing further simplified submission for-
evaluation and results are shown in Table 22.
mat.
3.6.2 English-Japanese • Ranking with better designed metrics to ad-
For English-to-Japanese, we also followed the dress the overfitting towards BLEU scores.
methodology in the last year. We hired a profes-
sional interpreter for human evaluation using JTF • Aligning more with offline tasks on more test
Translation Quality Evaluation Guidelines (JTF, domains and evaluation metrics.
2018) based on Multidimensional Quality Metrics
4 Automatic Subtitling
(MQM; Lommel et al., 2014). We applied the
error weighting by Freitag et al. (2021a). Ap- In recent years, the task of automatically creating
pendix A.1.2 presents details of the human eval- subtitles for audiovisual content in another lan-
uation. guage has gained a lot of attention, as we have
11
seen a surge in the amount of movies, series and domain set AV hh:m ref subtitles
user-generated videos which are being streamed docs h:mm de es
and distributed all over the world. dev 17 04:11 4906 4964
TED
For the first time, this year IWSLT proposed a test 14 01:22 1375 1422
specific track on automatic subtitling, where par- dev 12 01:03 960 909
EPTV
ticipants were asked to generate subtitles of audio- test 10 01:01 891 874
visual documents, belonging to different domains dev 9 03:59 4508 4037
Peloton
with increasing levels of complexity. test 8 02:43 2700 2661
dev 7 06:01 4489 4763
ITV
4.1 Challenge test 7 05:08 4807 4897
The task of automatic subtitling is multi-faceted: Table 4: Statistics of the dev and test sets for the subti-
starting from speech, not only the translation has tling task.
to be generated, but it must be segmented into
subtitles compliant with constraints that ensure
high-quality user experience, like a proper read- 4.2 Data and Metrics
ing speed, synchrony with the voices, the maxi- Data. This track proposed two training condi-
mum number of subtitle lines and characters per tions to participants: constrained, in which only
line, etc. Most audio-visual companies define a pre-defined list of resources is allowed, and un-
their own subtitling guidelines, which can differ constrained, without any data restrictions. The
slightly from each other. Participants were asked constrained setup allowed to use the same train-
to generate subtitles according to some of the tips ing data as in the Offline Speech Translation task
listed by TED, in particular: (see Section 2.2 for the detailed list), with the ob-
vious exclusion of the parallel resources not in-
• the maximum subtitle reading speed is 21
volving the English-{German, Spanish} pairs. In
characters / second;
addition, two monolingual German and Spanish
• lines cannot exceed 42 characters, white text corpora built on OpenSubtitles, enriched with
spaces included; subtitle breaks, document meta-info on genre and
• never use more than two lines per subtitle. automatically predicted line breaks, have been re-
leased.
It was expected that participants used only the au- For each language and domain, a development
dio track from the provided videos (dev and test set and a test set were released. Table 4 provides
sets), the video track being of low quality and pro- some information about these sets.
vided primarily as a means to verify time syn- The evaluation was carried out from three per-
chronicity and other aspects of displaying subtitles spectives, subtitle quality, translation quality and
on screen. subtitle compliance, through the following auto-
The subtitling track requires to automatically matic measures:
subtitle in German and/or Spanish audio-visual
documents where the spoken language is always • Subtitle quality vs. reference subtitles:
English, and which were collected from the fol- – SubER, primary metric, used also for
lowing sources: ranking (Wilken et al., 2022)12 ;
– Sigma (Karakanta et al., 2022b)13 .
• TED talks from the MuST-Cinema8 corpus;
• press interviews from the Multimedia Centre • Translation quality vs. reference translations:
of the European Parliament (EPTV)9 ; – BLEU14 and CHRF15 via sacreBLEU
• physical training videos offered by Peloton10 – BLUERT (Sellam et al., 2020)
11
• TV series from ITV Studios. 12
https://github.com/apptek/SubER
13
8
https://github.com/fyvo/EvalSubtitle
https://ict.fbk.eu/must-cinema 14
sacreBLEU signature: nrefs:1|case:mixed|
9
https://multimedia.europarl.europa.eu |eff:no|tok:13a|smooth:exp|version:2.0.0
10 15
https://www.onepeloton.com sacreBLEU signature: nrefs:1|case:mixed|
11
https://www.itvstudios.com |eff:yes|nc:6|nw:0|space:no|version:2.0.0
12
Automatic subtitles are realigned to the ref- four domains TED, EPTV, ITV, Peloton), fol-
erence subtitles using mwerSegmenter (Ma- lowed by a subtitle line segmentation model
tusov et al., 2005a)16 before running sacre- (intelligent line segmentation by A PP T EK).
BLEU and BLEURT.
• FBK (Papi et al., 2023) submitted primary
• Subtitle compliance:17 runs for the two language pairs, generated
– rate of subtitles with reading speed by a direct neural speech translation model,
higher than 21 char / sec (CPS); trained in the constrained setup, that works
– rate of lines longer than 42 char (CPL); as follows: i) the audio is fed to a Subtitle
– rate of subtitles with more than two lines Generator that produces the (un-timed) sub-
(white spaces included) (LPB). title blocks; ii) the computed encoder repre-
sentations are passed to a Source Timestamp
4.3 Submissions Generator to obtain the caption blocks and
their corresponding timestamps; iii) the sub-
Three teams submitted automatically generated
title timestamps are estimated by the Source-
subtitles for the test sets of this task.
to-Target Timestamp Projector from the gen-
• A PP T EK (Bahar et al., 2023) submitted runs erated subtitles, captions, and source times-
in the constrained setup for both language tamps.
pairs. The primary submissions came from a
cascade architecture composed of the follow- • M ATESUB (Perone, 2023) submitted primary
ing modules: neural encoder-decoder ASR, runs for the two language pairs, automatically
followed by a neural Machine Translation generated by the back-end subtitling pipeline
model trained on the data allowed in the con- of M ATESUB, its web-based tool that sup-
strained track, with the source (English) side ports professionals in the creation of high-
lowercased and normalized to resemble raw quality subtitles (https://matesub.com/). The
ASR output, as well as adapted to the IWSLT M ATESUB subtitling pipeline is based on a
subtitling domains, followed by a subtitle line cascade architecture, composed of ASR, text
segmentation model (intelligent line segmen- segmenter and MT neural models, which al-
tation by A PP T EK). A contrastive run was lows covering any pair from about 60 lan-
generated for the en→de pair only by a direct guages and their variants, including the two
speech translation system with CTC-based language pairs of the task. Since M ATESUB
timestamp prediction, followed by the intel- is a production software, its neural models
ligent line segmentation model of A PP T EK. are trained on more resources than those al-
The system was trained on the constrained al- lowed for the constrained condition, there-
lowed data plus forward translated synthetic fore the submissions fall into the uncon-
data (translations of allowed ASR transcripts) strained setup.
and synthetic speech data for selected sen- 4.4 Results
tences from the allowed parallel data. For the
en→de pair, A PP T EK also submitted a run in Scores of all runs as computed by automatic met-
the unconstrained setup, where a cascade ar- rics are shown in Tables 24 and 25 in the Ap-
chitecture was employed consisting of: neu- pendix. Averaged over the 4 domains, A PP T EK
ral encoder-decoder CTC ASR, followed by achieved the lowest SubER scores with their pri-
a neural punctuation prediction model and mary submission for en→de in the constrained and
inverse text normalization model, followed unconstrained condition, with the overall best re-
by an MT model adapted to the IWSLT do- sults for the latter. For en→es, M ATESUB obtained
mains (sentences similar in embedding sim- the overall lowest SubER with their unconstrained
ilarity space to the development sets of the system.
We observe that in terms of domain difficulty,
16
https://www-i6.informatik. the TV series (from ITV) pose the most challenges
rwth-aachen.de/web/Software/
mwerSegmenter.tar.gz for automatic subtitling. This has to do with di-
17
https://github.com/hlt-mt/ verse acoustic conditions in which speech is found
FBK-fairseq/blob/master/examples/speech_
to_text/scripts/subtitle_compliance.py in movies and series - background music, noises,
13
shouts, and cross-talk. All of this makes the task Regarding the automatic metrics used in the
of recognizing speech quite challenging, which evaluation, we observed that the metric Sigma pro-
results in error accumulation in the downstream vides scores which are not consistent with the
components. Unconstrained systems by A PP T EK other measures: for example, German subtitles
and M ATESUB perform significantly better on this from M ATESUB seem to be the worst as measured
domain, which shows the importance of training by Sigma, but this is unlikely based on the val-
on additional data that is more representative of ues of the other metrics. Yet the pure MT quality
real-life content. metrics also exhibit some discrepancies in how the
The second-hardest domain are the fitness performance of the same system on the four do-
videos from Peloton. Here, despite a gener- mains is ranked. This ranking sometimes differs
ally clear single-speaker audio with reduced back- depending on whether you choose BLEU, ChrF, or
ground noise, the challenge is the MT: some of the BLEURT as the “primary” metric. The two most
fitness- and sports-specific terminology and slang striking cases are:
pose significant challenges in translation to their • the en→de A PP T EK unconstrained primary
German and Spanish equivalents. submission, for which the BLEU score for
Surprisingly, even the EPTV interviews pose the ITV test data was 14.43 and for Pelo-
significant challenges for subtitling, despite the ton 10.47, but the BLEURT scores were very
fact that the topics discussed in the interviews similar: 0.4069 and 0.4028;
are found in abundance in the allowed speech- • the en→de FBK constrained primary system,
to-text and text-to-text parallel data for the con- for which the BLEU score was 7.73 on the
strained condition (Europarl, Europarl-ST). Here, Peloton part of the test data vs. 8.05 on the
the issues such as spontaneous speech with many ITV part, but the BLEURT scores showed a
pauses, as well as speaker separation may have better quality for Peloton translations: 0.3137
been cause of some of the errors. vs. 0.2255.
The TED talks which have been the main All of these discrepancies highlight the impor-
domain for the IWSLT evaluations in the past tance of human evaluation, which we have not
years are the easiest to be automatically subti- conducted this time. One of the reasons for this
tled. Whereas the current level of subtitle quality is that in most prior research (Matusov et al.,
for TED talks may require minimal human cor- 2019; Karakanta et al., 2022a) the automatic sub-
rections or can even be shown unedited on the titling quality is evaluated in post-editing scenar-
screen, for the other three domains the automatic ios, which are too expensive to be run on signifi-
subtitles will require significant post-editing. This cant amounts of data as they require professional
shows the importance of running evaluations not subtitle translators. On the other hand, as men-
only under very controlled conditions as in the tioned above, for 3 out of 4 domains the quality of
case of TED talks, but on a variety of real-life con- the automatically generated subtitle translations is
tent where multiple research challenges in speech low, so that an evaluation of user experience when
translation are yet to be overcome. watching subtitles would be also challenging, es-
This year’s direct speech translation systems pecially if the users would have to assign evalu-
seem to be too weak to compete with the cascaded ation scores to individual subtitles or sentences.
approaches. In particular, a full end-to-end ap- With all of this in mind, we decided to postpone
proach like the one from FBK that directly gen- any human evaluation to the next edition of the
erates subtitle boundaries is currently inferior in subtitling track at IWSLT.
comparison with the systems that adopt a specific Overall, this first edition of the subtitling track
solution for segmenting the text (intelligent line emphasised the crucial role of the following com-
segmentation by A PP T EK and a neural text seg- ponents related to speech processing: noise re-
menter by M ATESUB). Such specific solutions duction and/or speech separation, speaker diariza-
lead to almost perfect subtitle compliance. But tion, and sentence segmentation. So far they
even in terms of pure speech translation quality as have been underestimated in speech translation re-
measured e.g. with BLEU and BLEURT the cas- search. Current automatic solutions do not reach
caded systems currently provide better translations the level of quality that is necessary in subti-
even under constrained training data conditions. tling. Therefore, we encourage further research
14
into these areas, for which subtitle translation is and translated with the support of ACL and the 60-
a good test case. 60 initiative as described in Salesky et al. (2023).
5 Multilingual SLT
5.2 Data and Metrics
The NLP and speech communities are rapidly ex-
Data. We use the ACL 60-60 evaluation sets cre-
panding with increasing focus on broader lan-
ated by Salesky et al. (2023) to evaluate this chal-
guage coverage and multilinguality. However, de-
lenge task. The data comes from ACL 2022 tech-
spite the community’s efforts on ASR and SLT, re-
nical presentations and is originally spoken in En-
search is rarely focused on applying these efforts
glish, and then transcribed and translated to ten
to the data within the scientific domain. It is clear
target languages from the 60/60 initiative: Ara-
from recent initiatives to caption technical presen-
bic, Mandarin Chinese, Dutch, French, German,
tations at NLP and speech conferences that tran-
Japanese, Farsi, Portuguese, Russian, and Turk-
scription and translation in the technical domain
ish. The resulting dataset contains parallel speech,
is needed, desired, and remains a disproportionate
transcripts, and translation for ten language pairs,
challenge for current ASR and SLT models com-
totaling approximately one hour for the develop-
pared to standard datasets in these spaces. Mo-
ment set and one hour for the evaluation set.
tivated by the ACL 60-60 initiative18 to translate
the ACL Anthology to up to 60 languages for the During the evaluation campaign, the only in-
60th anniversary of ACL, which will be reported domain data provided is the development set. To
on at this year’s ACL conference co-located with simulate the realistic use case where recorded
IWSLT, this year’s Multilingual Task evaluates the technical presentations would be accompanied by
ability of current models to translate technical pre- a research paper, in addition to the talk audio
sentations to a set of ten diverse target languages. we provide the corresponding paper title and ab-
stract, which are likely to contain a subset of
5.1 Challenge relevant keywords and terminology and could be
used by participants to bias or adapt their systems.
Translating technical presentations combines sev- Constrained training data follows the Offline task
eral challenging conditions: domain-specific ter- (see Sec. 2.2) with pretrained models and out-of-
minology, recording conditions varying from domain parallel speech and text provided for all
close-range microphones to laptop microphones 10 language pairs. The unconstrained setting al-
with light background noise or feedback, diverse lowed participants to potentially crawl additional
speaker demographics, and importantly unseg- in-domain data to assist with adaptation, as was
mented speech typically 10-60 minutes in dura- done by one team (JHU). For the official rankings,
tion. This task focuses on one-to-many translation we use the official evaluation set, which was held
from English to ten target languages. Providing blind until after the evaluation campaign.
English ASR was optional though encouraged. In-
To mimic realistic test conditions where the
domain data is scarce, particularly parallel data,
audio for technical presentations would be pro-
though all language pairs are covered by current
vided as a single file, rather than gold-sentence-
publicly available corpora; further challenging for
segmented, for both the development and evalu-
current domain adaptation techniques, monolin-
ation sets we provided the full unsegmented wav
gual data is typically available for the source lan-
files, as well as an automatically generated base-
guage (English) only. We present two conditions:
line segmentation using SHAS (Tsiamas et al.,
constrained (using only the out-of-domain data
2022) to get participants started. Two teams used
allowed and provided for other tasks this year)
the baseline segmentation, while one (JHU) used
and unconstrained (allowing any additional data,
longer segments which improved the ASR qual-
included crawled, which may facilitate e.g., do-
ity of their particular pretrained model. To evalu-
main adaptation). To evaluate submissions, we
ate translation quality of system output using any
use evaluation sets curated from presentations at
input segmentation, we provided gold sentence-
ACL 2022 which were professionally transcribed
segmented transcripts and translations, which sys-
18
https://www.2022.aclweb.org/ tem output could be scored with as described be-
dispecialinitiative low in ‘Metrics.’
15
Metrics. Translation output was evaluated us- ing talk abstracts to prompt Whisper to train-
ing multiple metrics for analysis: translation out- ing in-domain language models on either the
put using chrF (Popović, 2015a), BLEU (Pap- small amount of highly-relevant data in the
ineni et al., 2002b) as computed by S ACRE BLEU talk abstract or larger LMs trained on signifi-
(Post, 2018), and COMET (Rei et al., 2020b) and cantly more data they scraped from the ACL
ASR output using WER. For BLEU we use the Anthology and release with their paper. They
recommended language-specific tokenization in see slight improvements over the provided
S ACRE BLEU for Chinese, Japanese, Korean, and SHAS (Tsiamas et al., 2022) segments us-
the metric-default otherwise. Translation metrics ing longer segments closer what Whisper ob-
were calculated with case and punctuation. WER served in training. They show that prompting
was computed on lowercased text with punctua- Whisper is not competitive with in-domain
tion removed. NFKC normalization was applied language models, and provide an analysis of
on submitted systems and references. All offi- technical term recall and other fine-grained
cial scores were calculated using automatic reseg- details.
mentation of the hypothesis based on the refer-
ence transcripts (ASR) or translations (SLT) by • KIT (Liu et al., 2023) submitted multiple
mwerSegmenter (Matusov et al., 2005b), using constrained multilingual models, both end-
character-level segmentation for resegmentation to-end and cascaded, which combine several
for those languages which do not mark whites- techniques to adapt to the technical domain
pace. The official task ranking is based on average given the absence of in-domain training data,
chrF across all 10 translation language pairs. using pretrained speech and translation mod-
els as initializations (WavLM: Chen et al.
5.3 Submissions 2021, DeltaLM: Ma et al. 2021, mBART-
50: Tang et al. 2020). These include kNN-
We received 11 submissions from 3 teams, as de-
MT to bias generated output to the techni-
scribed below:
cal domain; data diversification to enrich pro-
• BIT (Wang et al., 2023b) submitted a single vided parallel data; adapters for lightweight
constrained one-to-many multilingual model finetuning to the language pairs for trans-
to cover all 10 language pairs, trained using a lation (though they note that this does not
collection of multiple versions of the MuST- necessarily stack with data diversification);
C dataset (Di Gangi et al., 2019b). They use and for their cascaded model, adaptation of
English ASR pre-training with data augmen- the ASR model to the target technical do-
tation from SpecAugment (Park et al., 2019), main using n-gram re-weighting, noting that
and multilingual translation finetuning for all it is typically easier to adapt or add lexical
language pairs together. The final model is an constraints to models with separate LMs, as
ensemble of multiple checkpoints. No adap- opposed to encoder-decoder models. Addi-
tation to the technical domain is performed. tional techniques (ensembling, updated ASR
encoder/decoder settings, knowledge distilla-
• JHU (Xinyuan et al., 2023) submitted two tion, synthesized speech) are also used for
cascaded systems, one constrained and one further small improvements.
unconstrained, combining multiple differ-
ent pretrained speech and translation mod- 5.4 Results
els, and comparing different domain adap- All task results are shown in Appendix B.4. The
tation techniques. Their unconstrained sys- official task ranking was determined by the aver-
tem uses an adapted Whisper (Radford et al., age chrF across all 10 target languages after reseg-
2022) ASR model combined with NLLB mentation to the reference translations.Table 26.
(NLLB Team et al., 2022), M2M-100 (Fan Scores for all submissions by individual language
et al., 2020), or mBART-50 (Tang et al., pairs are shown in Table 28 (chrF), Table 29
2020) MT models depending on the lan- (COMET), and Table 30 (BLEU).
guage pair, while the constrained system Overall, the majority of approaches combined
uses wav2vec2.0 (Baevski et al., 2020a) and strong pretrained speech and translation mod-
mBART-50 or M2M-100. They compare us- els to do very well on the ACL 60-60 evalua-
16
tion data. For this task, cascaded models per- System Metric
JHU-unconstrained KIT-primary chrF
formed consistently better than direct/end-to-end JHU-constrained BIT terminology
approaches; all of the top 6 submissions were cas-
cades, and 4/5 of the lowest-performing systems 70
were direct. Optional English ASR transcripts 60
were submitted for 3 systems (JHUunconstrained ,
50
KITprimary , JHUconstrained ), all of which were
cascades; we see that WER aligns with speech 40
translation performance in these cases. The only 30
unconstrained model, from JHU, utilized larger
20
pretrained models and crawled in-domain lan-
guage modeling data for ASR to great success, and 10
was the top system on all metrics (Table 26). The 0
remaining submissions were all constrained (here ar de fa fr ja nl pt ru tr zh
meaning, used the white-listed training data and Language
smaller pretrained models). The KITprimary sys-
Figure 2: Official task metric performance (chrF) vs
tem was the best performing constrained model. terminology recall for teams’ primary submissions.
While BIT trained models from scratch on TED
to reasonable performance on MuST-C, large pre-
trained models and domain adaptation were key
ing, backtranslation, ...). The data diversifica-
for high performance on the technical in-domain
tion applied by KIT via TTS ‘backtranslation’
test set. chrF and BLEU result in the same sys-
(contrastive5, contrastive7) did not affect chrF or
tem rankings, while COMET favors the end-to-
BLEU, but did provide small (0.5-0.6) improve-
end models slightly more, though not affecting
ments on COMET.
the top 3 systems (JHUunconstrained , KITprimary ,
KITconstrastive1 ). In addition to the overall evaluation set, we look
at the recall of specific terminology annotated for
Domain adaptation techniques had consistent
the ACL evaluation sets. For the three submissions
positive impact on system performance. The KIT
(JHUunconstrained , KITprimary , JHUconstrained )
team submitted constrained systems only and thus
which provided supplementary ASR, we first in-
were limited to the dev bitext and talk abstracts
vestigate terminology recall and propagation be-
for domain adaptation. Despite its small size
tween ASR and downstream ST. Recall that the
(<500 sentences) they were able to generate con-
overall WER of these systems was 16.9, 23.7, and
sistent improvements of up to ∼1chrF and ∼ 1
34.1, respectively. Of the 1107 labeled terminol-
BLEU using kNN-MT (primary/contrastive1 vs
ogy words and phrases from the ACL 60-60 eval-
contrastive2); with this method, extending the dev
uation set annotations, 87.8% / 77.3% / 71.7% in-
data to include the abstracts for the evaluation set
dividual instances were correctly transcribed by
talks (primary vs contrastive1) had neglible ef-
these systems, respectively. Of these, 12.0% /
fect on all 3 metrics. The JHU submissions saw
7.4% / 7.9% were then maintained and correctly
that decoding with interpolated in-domain lan-
translated to each target language respectively on
guage models outperformed knowledge distilla-
average. We plot the official task metric (chrF)
tion or prompting pretrained models with informa-
against terminology recall in Figure 2 for all pri-
tion for each talk in this case; small talk-specific
mary submissions. We see that there were consis-
LMs did provide slight improvements in WER, but
tent differences across languages in how terminol-
significant improvements of 2-3 WER were gained
ogy was maintained, which generally but not fully
by extending the limited highly relevant data from
corresponds to overall performance (ex: Dutch,
talk abstracts and the dev set to the larger domain-
Turkish). While the domain adaptation techniques
general data crawled from the 2021 ACL confer-
used ensured strong transcription performance for
ence and workshop proceedings.
the JHU and KIT submissions, this was not gen-
Without in-domain target-language monolin- erally maintained for translation with a significant
gual data, conventional techniques for adaptation drop, converging with BIT which did not perform
of end-to-end ST models did not apply (finetun- domain adaptation. Additional work is needed to
17
ensure targeted lexical terms are correctly tran- 6.1 Challenge
scribed and translated, both in general as well as The participants were tasked with creating speech-
comparably across different languages. to-speech translation systems that could translate
While the JHU submissions finetuned to each from English to Chinese using various methods,
target language individually, the KIT systems fine- such as a cascade system (ASR + MT + TTS or
tuned multilingually; no contrastive systems were end-to-end speech-to-text translation + TTS), or
submitted with which to ablate this point, but both an end-to-end / direct system. They were also al-
teams’ papers describe consistently worse perfor- lowed to use any techniques to enhance the per-
mance finetuning multilingually rather than bilin- formance of the system, apart from using uncon-
gually, which KIT was able to largely mitigate strained data.
with language adapters in development in isola-
tion but in their final submission on eval language 6.2 Data and Metrics
adapters were consistently slightly worse (con- Data. This task allowed the same training data
trastive4 ‘with’ vs contrastive3 ‘without.’). It re- from the Offline task on English-Chinese speech-
mains to be seen the degree to which one-to-many to-text translation. More details are available in
models can benefit from multilingual training. Sec. 2.2. In addition to the Offline task data,
The Offline task additionally used the ACL 60- the following training data was allowed to help
60 evaluation sets as part of their broader evalu- build English-Chinese speech-to-speech models
ation for 3 language pairs (en→ de, ja, zh), en- and Chinese text-to-speech systems:
abling a wider comparison across 25 total sys-
tems. We show the Multilingual task submissions • GigaS2S, target synthetic speech for the Chi-
compared to the Offline on these languages in Ta- nese target text of GigaST (Ye et al., 2023)
ble 27. On these three language pairs, perfor- that was generated with an in-house single-
mance is generally higher than the remaining lan- speaker TTS system;
guage pairs in the Multilingual task. We again • aishell 3 (Shi et al., 2020), a multi-speaker
consistently see stronger performance on this task Chinese TTS dataset.
from cascaded models, and unconstrained sub-
missions or those with larger pretrained LLMs, It’s noted that several datasets allowed for the
though there are notable outliers such as the HW- Offline task such as Common Voice (Ardila
TSC constrained model. The Offline submissions et al., 2019) actually contain multi-speaker Chi-
did not perform domain adaptation specifically to nese speech and text data that could help for this
the technical ACL domain, but appear to be benefit task.
from better domain-general performance in some
Metrics. All systems were evaluated with both
cases, particularly for submissions targeting only
automatic and human evaluation metrics.
Chinese. We note slight differences in system
rankings between metrics (COMET and BLEU) Automatic metrics. To automatically evaluate
and target languages, particularly for Japanese and translation quality, the speech output was auto-
Chinese targets, possibly highlighting the differ- matically transcribed with a Chinese ASR sys-
ence in metric tokenization for these pairs. tem19 (Yao et al., 2021), and then BLEU20 (Pa-
pineni et al., 2002a), chrF21 (Popović, 2015b),
6 Speech-to-Speech Translation COMET22 (Rei et al., 2022) and SEScore223 (Xu
et al., 2022) were computed between the generated
Speech-to-speech translation (S2ST) involves transcript and the human-produced text reference.
translating audio in one language to audio in an- BLEU and chrF were computed using SacreBLEU
other language. In the offline setting, the transla-
19
tion system can assume that the entire input audio https://github.com/wenet-e2e/wenet/
blob/main/docs/pretrained_models.en.md
is available before beginning the translation pro- 20
sacreBLEU signature: nrefs:1|case:mixed|
cess. This differs from streaming or simultaneous eff:no|tok:zh|smooth:exp|version:2.3.1
21
settings where the system only has access to par- sacreBLEU signature: nrefs:1|case:mixed|
eff:yes|nc:6|nw:0|space:no|version:2.3.1
tial input. The primary objective of this task is to 22
https://huggingface.co/Unbabel/
encourage the advancement of automated methods wmt22-comet-da
for offline speech-to-speech translation. 23
https://github.com/xu1998hz/SEScore2
18
(Post, 2018). Furthermore, the output speech models and brings positive gain to the entire
could be evaluated directly using BLASER (Chen S2ST system.
et al., 2022). More information could be found at
stopes24 (Andrews et al., 2022). • KU (Yang et al., 2023) submitted a cascade
system composed of a speech-to-text transla-
Human evaluation. Output speech translations tion (ST) model and a TTS model. Their ST
were evaluated with respect to translation quality model comprises a ST decoder and an ASR
and speech quality. decoder. The two decoders can exchange in-
formation with each other with the interactive
• Translation quality: Bilingual annotators attention mechanism. For the TTS part, they
were presented with the source audio, source use FastSpeech2 as the acoustic model and
transcript and the generated target audio, then HiFi-GAN as the vocoder.
gave scores on the translation quality be-
tween 1 and 5 (worst-to-best)). There were • NPU-MSXF (Song et al., 2023) submitted a
4 annotators per sample and we retained the cascaded system of separate ASR, MT, and
median score. TTS models. For ASR, they adopt ROVER-
based model fusion and data augmentation
• Output speech quality: In addition to trans- strategies to improve the recognition accu-
lation quality (capturing meaning), the qual- racy and generalization ability. Then they use
ity of the speech output was also human- a three-stage fine-tuning process to adapt a
evaluated. The annotators were requested to pre-trained mBART50 model to translate the
give an overall score by considering three di- output of ASR model. The three-stage fine-
mensions: naturalness (voice and pronunci- tuning is based on Curriculum Learning and
ation), clarity of speech (understandability), it involves three sets of data: (1) the original
and sound quality (noise and other artifacts). MT data, (2) the MT data in ASR transcrip-
Each sample was assessed by 4 annotators tion format and (3) the ASR outputs. For
and scored on a scale of 1-5 (worst-to-best)), TTS, they leverage a two-stage framework,
with a minimum score interval of 0.5. using network bottleneck features as a ro-
bust intermediate representation for speaker
The detailed guidelines for output speech qual-
timbre and linguistic content disentangle-
ity evaluation were similar to last year (Anasta-
ment. Based on the two-stage framework,
sopoulos et al., 2022a).
pre-trained speaker embedding is leveraged
6.3 Submissions as a condition to transfer the speaker timbre
in the source speech to the translated speech.
We received eight submissions from five teams.
The M INE T RANS team submitted four systems • X IAOMI (Huang et al., 2023) submitted a cas-
and each of the other teams submitted one system. cade system composed of a speech-to-text
translation (ST) model and a TTS model. The
• HW-TSC (Wang et al., 2023a) submitted a
ST model is the same as the one they sub-
cascaded system composed of an ensemble
mitted to the Offline SLT track. It is based
of Conformer and Transformer-based ASR
on an encoder-decoder architecture from the
models, a multilingual Transformer-based
pre-trained HuBERT and mBART models.
MT model and a diffusion-based TTS model.
For the TTS model, they use the Tacotron2
Their primary focus in their submission is to
framework. It is first trained with AISHELL-
investigate the modeling ability of the diffu-
3 dataset and then finetuned with GigaS2S
sion model for TTS tasks in high-resource
dataset. Furthermore, they implement sev-
scenarios. The diffusion TTS model takes
eral popular techniques, such as data filtering,
raw text as input and generates waveform
data augmentation, speech segmentation, and
by iteratively denoising on pure Gaussian
model ensemble, to improve the overall per-
noise. Based on the result, they conclude that
formance of the system.
the diffusion model outperforms normal TTS
24
https://github.com/facebookresearch/ • M INE T RANS (Du et al., 2023) submitted
stopes/tree/main/demo/iwslt_blaser_eval three end-to-end S2ST systems (M INE -
19
T RANS E2E, including primary, con- ation along the speech quality perspective, NPU-
trastive1, and contrastive2), and a cascade MSXF obtained the highest score, followed by
S2ST system (M INE T RANS Cascade). Their HW-TSC, X IAOMI, M INE T RANS E2E, M INE -
end-to-end systems adopt the speech-to-unit T RANS Cascade and KU. With a equal weighting
translation (S2UT) framework. The end- of translation quality and speech quality, NPU-
to-end S2UT model comprises a speech MSXF obtained the highest overall score in hu-
encoder, a length adapter and an unit de- man evaluation, followed by X IAOMI and the oth-
coder. The S2UT model is trained to convert ers.
the source speech into units of target speech.
A unit-based HiFi-GAN vocoder is finally S2ST approaches. This year, all systems but
applied to convert the units into waveform. M INE T RANS E2E were cascaded systems, with
Based on their results, they conclude that the three systems adopting an ASR + MT + TTS ap-
widely used multi-task learning technique proach and two systems adopting an end-to-end
is not important for model convergence S2T + TTS approach. This showed that cascade
once large-scale labeled training data is approach was still dominant in the community. Al-
available, which means that the mapping though M INE T RANS E2E performed better than
from source speech to target speech units M INE T RANS Cascade in all evaluation metrics,
can be learned directly and easily. Further- we could not draw conclusions on the comparison
more, they apply other techniques, such as between cascade and end-to-end given the limited
consistency training, data augmentation, data points. Future challenges can encourage more
speech segmentation, and model ensemble direct or end-to-end submissions.
to improve the overall performance of the
system. Their cascade system consists of
ASR, MT and TTS models. Their ASR and 6.5 Conclusion
MT replicates those used for the Offline This is the second time that speech-to-speech
SLT submission. Their TTS model is a translation (S2ST) is presented in one of the
combination of FastSpeech2 and HiFi-GAN. IWSLT tasks. S2ST is an important benchmark for
general AI as other NLP tasks, e.g. dialogue sys-
6.4 Results
tem, question answering and summarization can
Results as scored by automatic metrics are shown also be implemented in speech-to-speech manner.
in Table 31 and human evaluation results are Compared to the setting last year, the size of the
shown in Table 32 in the Appendix. training data set available to the participants is
much larger. The BLEU scores obtained in this
Overall results. According to the automatic
challenge is high in general, compared to MT and
metrics used in the evaluation, X IAOMI obtained
ST of the same language direction. Although not
the highest score in ASR-BLEU, ASR-chrF, ASR-
required by the task, NPU-MSXF is the only
COMET and ASR-SEScore2. NPU-MSXF ob-
team that implemented speaker timbre transfer in
tained the second highest score, followed sub-
their system. We plan to include evaluation met-
sequently by HW-TSC, M INE T RANS E2E, KU
rics addressing this aspect in the next edition.
and M INE T RANS Cascade. The BLEU, chrF,
COMET and SEScore2 rankings were exactly the
same. The scores for the test-expanded data were 7 Dialect SLT
lower than those for the test-primary data, likely
due to a domain mismatch with the training data. The Dialect Speech Translation shared task is a
For human evaluation along the translation quality continuation of last year’s task. We use the same
perspective, X IAOMI obtained the highest score, training data as 2022 and evaluated systems on
followed by NPU-MSXF, then HW-TSC and the 2022 evaluation set to measure progress; in
M INE T RANS E2E, then M INE T RANS Cascade, addition, we added a new 2023 evaluation set as
and finally KU. This ranking was mostly con- blind test. From the organizational perspective, we
sistent with the automatic ranking, showing that merged the call for shared task with the the Low-
automatic metrics were useful in evaluating the Resource tasks (Section 8) in order to encourage
translation quality of systems. For human evalu- cross-submission of systems.
20
7.1 Challenge • test1: Participants are encouraged to use this
Diglossic communities are common around the for internal evaluation since references are
world. For example, Modern Standard Arabic provided. This is part of LDC2022E01 re-
(MSA) is used for formal spoken and written com- leased to participants for training and devel-
munication in most parts of the Arabic-speaking opment, obtained by applying the standard
world, but local dialects such as Egyptian, Moroc- data split and preprocessing26 .
can, and Tunisian are used in informal situations. • test2: official evaluation for 2022, from
Diglossia poses unique challenges to speech trans- LDC2022E02
lation because local “low” dialects tend to be low-
resource with little ASR and MT training data, and • test3: official evaluation for 2023, from
may not even have standardized writing, while re- LDC2023E09
sources from “high” dialects like MSA provides
opportunities for transfer learning and multilin- 7.3 Submissions
gual modeling.
We received submission from four teams:
7.2 Data and Metrics
• GMU (Mbuya and Anastasopoulos, 2023)
Participants were provided with the following
participated in five language-pairs in the
datasets:
Low-Resource tasks as well as this task.
• (a) 160 hours of Tunisian conversational They focused on investigating how different
speech (8kHz), with manual transcripts self-supervised speech models (Wav2vec 2.0,
XLSR-53, and HuBERT) compare when ini-
• (b) 200k lines of manual translations of the tialized to an end-to-end (E2E) speech trans-
above Tunisian transcripts into English, mak- lation architecture.
ing a three-way parallel data (i.e. aligned au-
dio, transcript, translation) that supports end- • JHU (Hussein et al., 2023) submitted both
to-end speech translation models cascaded and E2E systems, using transformer
and branchformer architectures. They inves-
• (c) 1200 hours of Modern Standard Arabic tigated the incorporation of pretrained text
(MSA) broadcast news with transcripts for MT models, specifically mBART50 and dis-
ASR, available from MGB-2 tilled NLLB-200. Further, they explored dif-
• Approximately 42,000k lines of bitext in ferent ways for system combination and han-
MSA-English for MT from OPUS (specifi- dling of orthographic variation and channel
cally: Opensubtitles, UN, QED, TED, Glob- mismatch.
alVoices, News-Commentary).
• ON-TRAC (Laurent et al., 2023) partici-
In 2022, we constructed three conditions: The pated in two language-pairs in the Low-
basic condition trains on (a) and (b), provided by Resource task as well as this task. For this
the Linguistic Data Consortium (LDC); the di- task, they focused on using SAMU-XLS-R
alect adaptation condition trains on (a), (b), (c), as the multilingual, multimodal pretrained
(d); the unconstrained condition can use any addi- speech encoder and mBART as the text de-
tional data and pre-trained models. In 2023, due coder.
to the coordinated organization with other Low-
• USTC (Deng et al., 2023) proposed a
Resource Tasks this year, we renamed basic con-
method for synthesis of pseudo Tunisian-
dition as “constrained condition”, and the other
MSA-English paired data. For the cascaded
two conditions are merged as the “unconstrained
system, they explored ASR with different
condition”.
feature extraction (VGG, GateCNN) and neu-
All train and test sets are time-segmented at
ral architectures (Conformer, Transformer).
the utterance level. Statistics are shown in Table
For E2E, they proposed using SATE and a
5. There are three test sets for evaluation with
hybrid SATE architecture to take advantage
BLEU25 .
26
25
SacreBLEU signature for dialect speech translation task:
https://github.com/kevinduh/
nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0 iwslt22-dialect
21
Dataset Speech Text (#lines) Use
(#hours) Tunisian MSA English
LDC2022E01 train 160 200k - 200k Constrained condition
LDC2022E01 dev 3 3833 - 3833 Constrained condition
LDC2022E01 test1 3 4204 - 4204 Participant’s internal evaluation
LDC2022E02 test2 3 4288 - 4288 Evaluate progress from 2022
LDC2023E09 test3 3 4248 - 4248 Official evaluation for 2023
MGB2 1100 - 1.1M - Unconstrained condition
OPUS - - 42M 42M Unconstrained condition
Any other data - - - - Unconstrained condition
22
Language Pairs Train Set Dev Set Test Set Additional Data
Irish–English ga–eng 9.46 1.03 0.44 n/a
Marathi–Hindi mr–hi 15.3 3.7 4.4 monolingual audio with transcriptions
(ASR), monolingual text
Maltese–English mlt–eng 2.5 - 1.35 monolingual audio with transcriptions
(ASR), monolingual text
Pashto–French pus–fra 61 2.5 2 n/a
Tamasheq–French tmh–fra 17 - - untranscribed audio, data in other re-
gional languages
Quechua–Spanish que–spa 1.60 1.03 1.03 60 hours of monolingual audio with
transcriptions (ASR) and MT data (not
transcribed)
Table 6: Training, development and test data details (in hours) for the language pairs of the low-resource shared
task.
nantly spoken in the state of Maharashtra in India. along with about 7.5 hours of audio with only Mal-
It is one of the 22 scheduled languages of India tese transcriptions. Last, the participants were di-
and the official language of Maharashtra and Goa. rected to several monolingual Maltese textual re-
As per the 2011 Census of India, it has around 83 sources. The provided datasets were taken from
million speakers which covers 6.86% of the coun- the MASRI corpus (Hernandez Mena et al., 2020).
try’s total population.29 Marathi is the third most
Pashto–French Pashto is spoken by approxi-
spoken language in India.
mately forty to sixty million people in the world.
The provided Marathi–Hindi corpus consists of
It is particularly spoken by the Pashtun people in
22.33 hours of Marathi speech data (see Table 6)
the south, east and southwest of Afghanistan (it
from the news domain, extracted from News On
is one of the two official languages), as well as
Air30 and translated into Hindi texts.31 The dataset
in the north and northwest Pakistan but also in
was manually segmented and translated by Panlin-
Iran, Tajikistan and India (Uttar Pradesh and Cash-
gua.32 Additionally, the participants were directed
mere) and one of the two official languages of
that they may use monolingual Marathi audio data
Afghanistan.
(with transcription) from Common Voice (Ardila
The corpus was totally provided by ELDA,
et al., 2020a),33 as well as the corpus provided
and is available on the ELRA catalog: TRAD
by He et al. (2020)34 and the Indian Language Cor-
Pashto Broadcast News Speech Corpus (ELRA
pora (Abraham et al., 2020).35
catalogue, 2016b) that consists of audio files and
Maltese–English Maltese is a Semitic lan- TRAD Pashto-French Parallel corpus of tran-
guage, with about half a million native speakers, scribed Broadcast News Speech - Training data
spoken in the official language of Malta and the (ELRA catalogue, 2016a) which are their tran-
EU. It is written in Latin script. scriptions.
The provided data was divided into three parts. This dataset is a collection of about 108 hours of
First, around 2.5 hours of audio with Maltese tran- Broadcast News with transcriptions in Pashto and
scription and an English translation were released, translations into French text. The dataset is built
29
from collected recordings from 5 sources: Ashna
https://censusindia.gov.in/nada/
TV, Azadi Radio, Deewa Radio, Mashaal Radio
index.php/catalog/42561
30
https://newsonair.gov.in and Shamshad TV. Original training data contains
31
https://github.com/panlingua/ 99 hours of speech in Pashto, which corresponds
iwslt2023_mr-hi to 29,447 utterances translated into French. Train-
32
http://panlingua.co.in/
33
https://commonvoice.mozilla.org/en/
ing data corresponds to 61 hours of speech (Ta-
datasets ble 6).
34
https://www.openslr.org/64/
35
https://www.cse.iitb.ac.in/˜pjyothi/ Tamasheq–French Tamasheq is a variety of Tu-
indiccorpora/ areg, a Berber macro-language spoken by nomadic
23
tribes across North Africa in Algeria, Mali, Niger Quechua spoken in Ayacucho, Peru (Quechua
and Burkina Faso. It accounts for approximately Chanka ISO: quy) and Cusco, Peru (Quechua
500,000 native speakers, being mostly spoken in Collao ISO: quz) which are both part of Quechua
Mali and Niger. This task is about translating spo- II and, thus, considered a “southern” languages.
ken Tamasheq into written French. Almost 20 We label the data set with que - the ISO norm for
hours of spoken Tamasheq with French transla- Quechua II mixtures.
tion are freely provided by the organizers. A ma- The constrained setting allowed a Quechua-
jor challenge is that no Tamasheq transcription is Spanish speech translation dataset along with the
provided, as Tamasheq is a traditionally oral lan- additional parallel (text-only) data for machine
guage. translation compiled from previous work (Ortega
The provided corpus is a collection of radio et al., 2020). The audio files for training, valida-
recordings from Studio Kalangou36 translated to tion, and test purposes consisted of excerpts of the
French. It comprises 17 hours of clean speech Siminchik corpus (Cardenas et al., 2018) that were
in Tamasheq, translated into the French language. translated by native Quechua speakers. For the un-
The organizers also provided a 19-hour version of constrained setting, participants were directed to
this corpus, including 2 additional hours of data another larger data set from the Siminchik corpus
that was labeled by annotators as potentially noisy. which consisted of 60 hours of fully transcribed
Both versions of this dataset share the same vali- Quechua audio (monolingual).
dation and test sets. Boito et al. (2022a) provides
a thorough description of this dataset. 8.2.1 Metrics
In addition to the 17 hours of Tamasheq audio We use standard lowercase BLEU as well as
data aligned to French translations, and in light of charF++ to automatically score all submissions.
recent work in self-supervised models for speech Additional analyses for some language pairs are
processing, we also provide participants with un- provided below.
labeled raw audio data in the Tamasheq language, Due to the exceptionally hard setting, which
as well as in other 4 languages spoken from Niger: currently leads to generally less competent transla-
French (116 hours), Fulfulde (114 hours), Hausa tion systems, we did not perform the human eval-
(105 hours), Tamasheq (234 hours) and Zarma uation of the outputs.
(100 hours). All this data comes from the ra-
dio broadcastings of Studio Kalangou and Studio 8.3 Submissions
Tamani.37 Below we discuss all submissions for all language
Note that this language pair is a continuation of pairs, given that there were several overlaps. A
last year’s shared task. An additional separate test brief summary per language is below:
set was provided this year.
• Irish–English received four submissions from
Quechua–Spanish Quechua is an indigenous one team (GMU);
language spoken by more than 8 million peo-
ple in South America. It is mainly spoken in • Marathi–Hindi received submissions from
Peru, Ecuador, and Bolivia where the official high- four teams (A LEXA AI, BUT, GMU, and
resource language is Spanish. It is a highly inflec- SRI-B);
tive language based on its suffixes which aggluti-
nate and are found to be similar to other languages • Maltese–English received five submissions
like Finnish. The average number of morphemes from one team (UM-DFKI);
per word (synthesis) is about two times larger than
in English. English typically has around 1.5 mor- • Pashto–French received submissions from
phemes per word and Quechua has about 3 mor- two teams (GMU, ON-TRAC);
phemes per word.
• Tamasheq–French received submissions
There are two main regional divisions of
from four teams (A LEXA AI, GMU,
Quechua known as Quechua I and Quechua II.
NAVER, and ON-TRAC);
This data set consists of two main types of
36
https://www.studiokalangou.org/ • Quechua-Spanish received three submissions
37
https://www.studiotamani.org/ (GMU, NAVER, and QUESPA).
24
Below we discuss each team’s submission in de- ESPnet (Inaguma et al., 2021) toolkit. The
tail: primary system was built with the end-to-
end and bilingual ASR model while the con-
• A LEXA AI (Vishnu et al., 2023) submitted trastive was built with a cascade which uses
one primary and three contrastive systems, various backbone models including ASR, the
all of these are in the unconstrained condition bilingual ASR, transformer-based seq2seq
(Table 44) for Tamasheq-French, and one pri- MT, LM for re-scoring and XLM.
mary and five contrastive systems on the un-
constrained condition for Marathi–Hindi. For • GMU (Mbuya and Anastasopoulos, 2023)
Marathi–Hindi, their systems relied on an focused on end-to-end speech translation
end-to-end speech translation approach, us- systems. End-to-end (E2E) transformer-
ing the wav2vec 2.0 base model finetuned based encoder-decoder architecture (Vaswani
on 960 hours of English speech (Baevski et al., 2017) was used for primary con-
et al., 2020b) as encoder baseline and it was strained submission. For unconstrained sub-
also finetuned on 94 hours of Marathi au- missions, they explored self-supervised pre-
dio data. The team focused on evaluating trained speech models and used wav2vec 2.0
three strategies including data augmentation, (Baevski et al., 2020a) and HuBERT (Hsu
an ensemble model and post-processing tech- et al., 2021) for the low resource task. They
niques. For Tamasheq–French, they reuse used wav2vec 2.0 - with removing the last
the same end-to-end AST model proposed three layers - for their primary submission.
by the ON-TRAC Consortium in the last HuBERT was used for the contrastive1 sub-
year’s IWSLT edition (Boito et al., 2022b). mission - without removing any layer. For
This model consists of a speech encoder that contrastive2, End-to-end with ASR (E2E-
is initialized by the wav2vec 2.0 (Baevski ASR) architecture uses the same architec-
et al., 2020a) base model pre-trained on 243 ture as the E2E. The difference is that a pre-
hours of Tamasheq audio data released by trained ASR model was used to initialize its
the ON-TRAC Consortium 38 . The decoder encoder.
of this model is a shallow stack of 2 trans-
• ON-TRAC (Laurent et al., 2023) partic-
former layers with 4 attention heads. A
ipated in the Pashto–French (one primary
feed-forward layer is put in between the en-
and three contrastive systems, both for con-
coder and the decoder for matching the di-
strained and unconstrained settings) and
mension of the encoder output and that of
Tamasheq–French (one primary and five con-
the decoder input. In this work, they fo-
trastive systems, all of which are uncon-
cus on leveraging different data augmenta-
strained (c.f. Table 44). For Pashto–French,
tion techniques including audio stretching,
the primary cascaded system is based on a
back translation, paraphrasing, and weighted
convolutional model (Gehring et al., 2017)
loss. Another important endeavor of their
upgraded, while contrastive3 is based on
work is experimenting with different post-
small basic transformers. For Primary and
processing approaches with LLMs, such as
contrastive1 systems, SAMU-XLS-R (Khu-
re-ranking, sentence correction, and token
rana et al., 2022) was used with pre-trained
masking. Besides, they also ensemble AST
encoder with 100 and 53 languages. The two
models trained with different seeds and data
constrained contrastive E2E systems share
augmentation methods, which is proven to
the same encoder-decoder architecture using
improve the performance of their systems.
transformers (Vaswani et al., 2017). The dif-
Their primary system scores 9.30 BLEU on
ference lies in the use or not of a transformer
the 2023 test set.
language model trained from scratch on the
• BUT (Kesiraju et al., 2023) submitted one provided dataset.
primary and one contrastive system using the All of their systems for Tamasheq–French
38
are based on the same end-to-end encoder-
https://huggingface.
co/LIA-AvignonUniversity/ decoder architecture. In this architec-
IWSLT2022-tamasheq-only ture, the encoder is initialized by a pre-
25
trained semantic speech representation learn- performance. Their primary system, which is
ing model named SAMU-XLS-R (Khurana ensembled from 3 different runs on the com-
et al., 2022), while the decoder is initialized bination of both ST and ASR data, scores
with the decoder of the pre-trained mBART 23.59 BLEU on the 2023 test set.
model. Their work heavily relies on different For the Quechua–Spanish track, the overall
versions of the SAMU-XLS-R model, which architecture for their systems consists of first
are pre-trained on different combinations of initializing a PLM which was then fine-tuned
multilingual corpora of 53, 60, and 100 lan- on the speech translation task by inputting
guages. In addition, they leverage training features from a frozen pre-trained speech rep-
data from higher resource corpora, such as resentation. Similar adaptations were done
CoVoST-2 (Wang et al., 2020a) and Europarl- with an MT model to control domain and
ST (Iranzo-Sánchez et al., 2020), for train- length mismatch issues. One of the interest-
ing their end-to-end models. Their primary ing takeaways from their approaches is that
system, which scores 15.88 BLEU on the their contrastive 2 system (1.3 billion pa-
Tamasheq–French 2023 test set, was trained rameters (NLLB Team et al., 2022)) outper-
on the combination of (CoVoST-2, Europarl- formed their contrastive 1 system (3.3 billion
ST and the IWSLT 2022’s test set), with the parameters (NLLB Team et al., 2022)) de-
encoder is initialized by the SAMU-XLS-R spite it having less parameters. NAVER’s
model trained on the data gathered from 100 primary submission was an ensemble ap-
languages. proach that included the use of PLMs for
both the ASR (Baevski et al., 2020a) and
• NAVER (Gow-Smith et al., 2023) submit-
MT systems ((NLLB Team et al., 2022))
ted one primary and two contrastive sys-
and included training on both Tamasheq and
tems to the Tamasheq–French track, as well
Quechua data. Their submissions to QUE–
as one primary and two contrastive sys-
SPA did not include the use of mBART or
tems for the unconstrained condition in the
HuBERT (Hsu et al., 2021) as was done for
Quechua–Spanish track. In their work for
other language pairs that NLE submitted.
the Tamasheq–French track, they concentrate
on parameter-efficient training methods that • QUESPA (Ortega et al., 2023) submitted
can perform both ST and MT in a multilin- to both conditions (constrained and uncon-
gual setting. In order to do so, they initial- strained) a total of six systems including a
ize their models with a pre-trained multilin- primary, contrastive 1, and contrastive 2 for
gual MT model (mBART (Liu et al., 2020) or each condition. They also claim to have tried
NLLB (NLLB Team et al., 2022)), which is several other combinations but did not sub-
then fine-tuned on the ST task by inputting mit those systems. For the constrained condi-
features extracted with a frozen pre-trained tion, their primary system scored second best,
speech representation model (wav2vec 2.0 or slightly less than team GMU with a BLEU
HuBERT (Hsu et al., 2021)). The encoder score of 1.25 and chrF2 of 25.35. They also
of their translation model is slightly modified scored third best for the constrained condi-
where they stack several modality-specific tion with 0.13 BLEU and 10.53 chrF2 us-
layers at the bottom. In addition, adapter ing their contrastive 1 system. It is worth-
layers are also inserted in between layers of while to note that chrF2 was used by the
the pre-trained MT model at both the en- organizers when BLEU scores were below
coder and decoder sides. While these new five. For their constrained systems, a di-
components get fine-tuned during the train- rect speech translation system was submit-
ing process, the pre-trained components of ted similar to the GMU team’s primary ap-
the MT model are frozen. One of the appeal- proach that used Fairseq (Wang et al., 2020b).
ing characteristics of their approach is that it QUESPA extracted mel-filter bank (MFB)
allows the same model to do both speech-to- features similar to the S2T approach in previ-
text and text-to-text translation (or transcrip- ous work Wang et al. (2020b). The main dif-
tion). Furthermore, their method maximizes ference between QUESPA’s submission and
knowledge transfer to improve low-resource GMU’s submissions was that the GMU team
26
increased the number of decoder layers to 8.4 Results
6 which resulted in a slightly better system
Irish–English As discussed earlier, only the
for GMU. The other systems submitted for
GMU team participated in the GA–ENG trans-
the constrained setting were cascade systems
lation track and submitted one primary system to
where ASR and MT were combined in a
constrained, one primary system to unconstrained
pipeline setting. Their contrastive 1 and 2
and the rest of the two systems to contrastive
system submissions for the constrained task
on unconstrained conditions. The end-to-end and
respectively used wav2letter++ (Pratap et al.,
end-to-end with ASR models submitted primary
2019) and a conformer architecture similar
constrained and contrastive2 unconstrained sys-
to previous work (Gulati et al., 2020) along
tems. Both the systems achieved 15.1 BLEU
with an OpenNMT (Klein et al., 2017) trans-
scores. They did not perform well in comparison
lation system trained on the constrained ST
to the wav2vec 2.0 and HuBERT models. The de-
and MT data. Both of those systems per-
tail of the results of this track can be found in Ta-
formed poorly scoring less than 1 BLEU. For
ble 36 and 37.
the unconstrained condition, the three sys-
tems that were presented by QUESPA con- Marathi–Hindi The results of this translation
sisted of pipeline approaches of PLMs that track can be found in Table 38 and 39. Over-
were fine-tuned on the additional 60 hours all we see varying performances among the sys-
of Siminchik audio data along with the con- tems submitted to this track, with some perform-
strained data. Their primary and contrastive ing much better on the test set. Out of the 16
1 unconstrained ASR systems were trained submissions, the SRI-B team’s primary system
using the 102-language FLEURS (Conneau achieved the best result of 31.2 and 54.8 in BLEU
et al., 2023) model and used the MT sys- and in charF++ respectively on the constrained
tem that was based on NLLB (NLLB Team condition while the BUT team’s primary system
et al., 2022) which just so happens to in- achieved the best results of 39.6 in BLEU and
clude Quechua as one of its languages. Their 63.3 in charF++ on the unconstrained condition.
contrastive 2 ASR system was based on In both constrained and unconstrained conditions,
wav2letter++ (Pratap et al., 2019) while their the GMU systems achieved the lowest results of
contrastive 2 MT system was identical to the 3.3 and 5.9 in BLEU and 16.8 and 20.3 in charF++
MT systems used for their Primary and Con- respectively.
trastive 1 submissions.
Maltese–English The results of this translation
track can be found in Table 42. UM-DFKI used
• SRI-B (Radhakrishnan et al., 2023) submit- contrastive approaches in training their ASR sys-
ted four systems. For Marathi–English, they tem. For their contrastive1 system, their fine-
submitted one primary and one contrastive tuning consisted of using Maltese, Arabic, French
system in the constrained setting and one and Italian corpora. Their contrastive2, con-
primary and one contrastive system in the trastive3, and contrastive4 approaches respectively
unconstrained setting. They used end-to- use a subset from Arabic, French and Italian ASR
end speech translation networks comprising a corpus along with Maltese data. The best result
conformer encoder and a transformer decoder of 0.7 BLEU was achieved with their contrastive1
for both constrained and unconstrained. system.
27
Tamasheq-French The results of this transla- and text models to maximize performance in low-
tion track can be found in Table 43 and 44. Com- resource languages. Being able to be trained on
pared to the last year’s edition, this year has wit- both ST and ASR data due to the multilingual na-
nessed a growing interest in this low-resource ture, all of their submissions heavily outperform
translation track in terms of both quantity and the second team ON-TRAC by considerable mar-
quality of submissions. Almost all submissions gins. Their primary system, which is ensembled
achieve relatively better results than the last year’s from 3 different runs, uses NLLB1.3B as the pre-
best system (5.7 BLEU on test2022 (Boito et al., trained MT system, and wav2vec2.0 Niger-Mali 39
2022b)). Furthermore, it is notable that cascaded as the speech presentation extractor. After be-
systems are not favorable in this track while none ing trained on a combination of both ST corpora
of the submitted systems is of this kind. (Tamasheq-French, mTEDx fr-en, mTEDx es-fr,
This year, this language pair remains a chal- mTEDx es-en, mTEDx fr-es (Salesky et al., 2021))
lenging low-resource translation track. There is and AST corpora (TED-LIUM v2 (Rousseau et al.,
only one submission to the constrained condi- 2014), mTEDx fr, mTEDx es), this system estab-
tion from GMU with an end-to-end model scor- lishes an impressive state-of-the-art performance
ing 0.48 BLEU on this year’s test set. For of the Tamasheq-French language pair, scoring
this reason, all the participants are in favor of 23.59 BLEU on the 2023 test set.
exploiting pre-trained models, hence being sub-
Quechua–Spanish The QUE–SPA results for
ject to the unconstrained condition. Among
all systems submitted to this low-resource trans-
these pre-trained models, self-supervised learn-
lation track can be found in Table 45 and 46 of
ing (SSL) from speech models remains a popu-
the appendix. To our knowledge, this first edi-
lar choice for speech encoder initializing. Us-
tion of the QUE–SPA language pair in the low-
ing a wav2vec2.0 model pre-trained on unlabelled
resource track of IWSLT has witnessed the best
Tamasheq data for initializing their speech en-
BLEU scores achieved by any known system in
coder, GMU gains +7.55 BLEU score in compari-
research for Quechua. The two best performing
son with their Transformer-based encoder-decoder
systems: 1.46 BLEU (constrained) and 15.70 (un-
model training from scratch (their primary con-
constrained) show that there is plenty of room to
strained system). At the decoder side, pre-trained
augment approaches presented here. Nonetheless,
models such as mBART or NLLB are commonly
submissions from the three teams: GMU, NAVER,
leveraged for initializing the decoder of the end-to-
and QUESPA have shown that it is possible to use
end ST model. Besides, data augmentation and en-
PLMs to create speech-translation systems with as
sembling are also beneficial as shown by ALEXA
little as 1.6 hours of parallel speech data. This is
AI when they consistently achieve ∼ 9 BLEU in
a notable characteristic of this task and surpasses
all of their settings.
previous work in the field.
Outstanding BLEU scores can be found in the We have found that the NLLB (NLLB Team
work of the ON-TRAC team. An interesting pre- et al., 2022) system’s inclusion of Quechua in re-
trained model named SAMU-XLS-R is shown to cent years has had a greater impact than expected
bring significant improvements. This is a multilin- for ease-of-use. Similarly, the use of Fairseq
gual multimodal semantic speech representation (Wang et al., 2020b) seems to be the preferred
learning framework (Khurana et al., 2022) which toolkit for creating direct S2T systems, cascaded
fine-tunes the pre-trained speech transformer en- or not. The QUE–SPA submissions for the un-
coder XLS-R (Babu et al., 2021) using semantic constrained conditions preferred the use of a cas-
supervision from the pre-trained multilingual se- cading system in a pipeline approach where pre-
mantic text encoder LaBSE (Feng et al., 2022). trained models were fine-tuned first for ASR and
Exploiting this pre-trained model and training then for MT.
end-to-end ST models on the combinations of dif- The constrained setting leaves much room for
ferent ST corpora, they achieve more than 15 improvement. Nonetheless, GMU and QUESPA’s
BLEU in all of their settings. near identical submissions have shown that the in-
NAVER tops this translation track by a multilin- 39
https://huggingface.
gual parameter-efficient training solution that al- co/LIA-AvignonUniversity/
lows them to leverage strong pre-trained speech IWSLT2022-Niger-Mali
28
crease of 3 layers during decoding can be powerful Limitations As noted by some participants,
and should be explored further. It would be worth- the Irish–English and Maltese–English transla-
while for the organizers of the QUE–SPA track to tion track data has limitations. For Irish–English,
obtain more parallel data including translations for the speech translation systems can achieve very
future iterations of this task. high BLEU scores on the test set if the built
The unconstrained setting clearly can benefit systems have used wav2vec 2.0 and/or the Irish
from an ensembling technique and training with ASR model which is trained on the Common
multiple languages – in these submissions, the Voice (Ardila et al., 2020b) dataset. Similarly,
training of a model with an additional language the GMU team has achieved high BLEU scores
like Tamasheq alongside Quechua does not seem especially when they used wav2vec 2.0 and Hu-
to have a negative impact on performance. Al- BERT models. We plan to continue this translation
though, it is hard to ascertain whether the slight track next year by updating the test and training
performance gain of less than 1 BLEU point of the data to thoroughly investigate the data quality as
NLE team’s submission compared to QUESPA’s well as the reason to obtain the high BLEU scores.
submission was due to the ensembling, freezing of For Maltese–English, some participants reported
the models, or the language addition. issues with the data quality, which we hope to re-
As a final takeaway, the NLE team’s submis- solve in future iterations of the shared task.
sions scored quite well under the unconstrained
condition. It should be noted that for other lan- 9 Formality Control for SLT
guage pairs NLE’s high system performance was
Different languages encode formality distinctions
also due to the ensembling of systems that were
in different ways, including the use of honorifics,
executed using different initialization parameters
grammatical registers, verb agreement, pronouns,
on at least three unique runs. As an aside, small
and lexical choices. While machine translation
gains were achieved under the constrained condi-
(MT) systems typically produce a single generic
tion when comparing the GMU submission to the
translation for each input segment, SLT requires
QUESPA system due to the increase in decoding
adapting the translation output to be appropriate to
layers. QUESPA’s inclusion of a language model
the context of communication and target audience.
on top of a state-of-the-art dataset (Fleurs) allowed
This shared task thus challenges machine transla-
them to achieve scores similar to NAVER’s with-
tion systems to generate translations of different
out additional tuning or ensembling. State-of-the-
formality levels.
art performance was achieved by all three teams
that submitted systems. 9.1 Challenge
General Observations As in previous years, the Task Given a source text, X in English, and a
low-resource shared task proved particularly chal- target formality level, l ∈ {F, IF }, the goal in
lenging for the participants, but there are several formality-sensitive machine translation (Niu et al.,
encouraging signs that further reinforce the need 2017) is to generate a translation, Y , in the target
for more research in the area. language that accurately preserves the meaning of
First, more teams than ever participated in the the source text and conforms to the desired formal-
shared task, showing a continued interest in the ity level, l. The two formality levels typically con-
field. Second, we note that for the language sidered are “F” for formal and “IF” for informal,
pair that was repeated from last year (Tamasheq– resulting in two translations: YF and YIF respec-
French), almost all submissions outperformed last tively. For example, the formal and informal trans-
year’s best submission, with an accuracy increase lations for the source text “Yeah Did your mom
of more than 17 BLEU points in the unconstrained know you were throwing the party?” (originally
setting. Last, we highlight the breadth of different informal) in Korean are shown in the table below:
approaches employed by the participants, ranging This shared task builds on last year’s offering,
from the use of finetuned pre-trained models to which evaluated systems’ ability to control for-
pre-training from scratch, to parameter efficient mality on the following translation tasks: trans-
dine-tuning as well as cascaded pipeline systems, lation from English (EN) into Korean (KO) and
all of which seem to have benefits to offer, to a Vietnamese (VI) in the supervised setting, and
certain extent, to different language pairs. from English (EN) into Portugal Portuguese (PT)
29
Source: Yeah Did your mom know you were Constrained (C) Participants were allowed to
throwing the party? use the following resources: Textual MuST-C v1.2
Korean Informal: ᄀ
ᅳ, ᄋ
ᅥ머ᄂᆷᄋ
ᅵ ᆫ [F]ᄂ
ᅳ ᅦ가[/F] (Di Gangi et al., 2019b), CCMatrix (Schwenk
ᅳᄑ
ᄀ ᅡᄐ ᅵᄋᆫᄀ
ᅧ ᅥ [F]ᄋ ᅧ[/F]?
ᅡᄉ et al., 2021), OpenSubtitles (Lison and Tiede-
mann, 2016) and dataset in the constrained set-
Korean Formal: ᄀ
ᅳ, ᄋ
ᅥᄆ
ᅥᄂ ᆫ [F]ᄂ
ᆷᄋ
ᅵᅳ ᅵ[/F] ᄀ
ᆷᄋ
ᅵ ᅳ ting from the Formality Control track at IWSLT22
ᄑᄐ
ᅡ ᅵᄋ ᆫᄀ
ᅧ ᅥ [F]ᄋ
ᅡᄉ ᅭ[/F]?
ᅦᄋ (Anastasopoulos et al., 2022a).
Table 7: Contrastive formal and informal translations Unconstrained (U) Participants could use any
into Korean. Grammatical formality markers are anno- publicly available datasets and resources: the use
tated with [F]text[/F]. of pre-trained language models was also allowed.
Additionally, using additionally automatically an-
notated bitext with formality labels was also al-
and Russian (RU) in the zero-shot setting. Re-
lowed.
sults showed that formality-control is challeng-
ing in zero-shot settings and for languages with 9.3 Formality Classifier
many grammatical and lexical formality distinc-
We release a multilingual classifier (M C) trained
tions. This year’s edition invited participants to
to predict the formality of a text for all the lan-
advance research in effective methods for bridg-
guage pairs: EN-KO, EN-VI, EN-RU, and EN-
ing the gap in formality control for zero-shot cases
PT. We finetune an xlm-roberta-base (Con-
and for languages with rich grammatical and lexi-
neau et al., 2020) model on human-written formal
cal formality distinctions.
and informal translations following the setup from
Briakou et al. (2021). Our classifier achieves an
9.2 Data and Metrics
accuracy of > 98% in detecting the formality of
Participants were provided with test data, as well human-written translations for the four target lan-
as MT quality and formality control metrics. In guages (Table 10). Participants were allowed to
addition, we provided training data, consisting of use the classifier both for model development and
formal and informal translation of texts for the su- for evaluation purposes as discussed below.
pervised language pairs (EN-KO, EN-VI).
9.4 Automatic Metrics
9.2.1 Formality Annotated Dataset We evaluate the submitted system outputs along
We provide targeted datasets comprising source the following two dimensions:
segments paired with two contrastive reference 1. Overall translation quality, evaluated using
translations, one for each formality level (informal SacreBLEU v2.0.0 (Papineni et al., 2002b;
and formal) for two EN-VI, EN-KO in the super- Post, 2018), and COMET (Rei et al., 2020b)
vised setting and EN-RU, EN-PT in the zero-shot on both the shared task-provided test sets
setting (see Example 7)40 . The sizes and proper- based on topical chat (Gopalakrishnan et al.,
ties of the released datasets for all the language 2019) and on the FLORES devtest (NLLB
pairs are listed in Table 8. Formal translations tend Team et al., 2022; Goyal et al., 2022).
to be longer than informal texts for Vietnamese
compared to other language pairs. The number 2. Formality control, evaluated using:
of phrasal formality annotations ranges from 2 to • Matched-Accuracy (mACC), a reference-
3.5 per segment, with Korean exhibiting a higher based corpus-level automatic metric that
diversity between the formal and informal transla- leverages phrase-level formality markers
tions as indicated by the TER score. from the references to classify a system-
generated hypothesis as formal, informal,
9.2.2 Training Conditions
or neutral (Nadejde et al., 2022).
We allowed submissions under the constrained • Classifier-Accuracy (cACC), a reference-
and unconstrained data settings described below: free metric that uses the multilingual for-
40
mality classifier discussed above to label a
https://github.com/amazon-science/
contrastive-controlled-mt/tree/main/ system-generated hypothesis as formal or
IWSLT2023 informal.
30
L ANGUAGE T YPE S IZE L ENGTH # P HRASAL A NNOTATIONS TER(F, IF)
S OURCE F ORMAL I NFORMAL F ORMAL I NFORMAL
Train 400 20.35 28.52 25.48 2.71 1.49 23.70
EN-VI
Test 600 21.82 29.59 26.77 2.79 1.55 23.00
Table 9: Formality Track Submissions Summary. Most participants train bilingual systems but leverage a diverse
set of formality encoding mechanisms for control.
31
RU. These are Transformer-Big models Overall Results For the supervised language
trained on a large public dataset from the pairs in both constrained and unconstrained set-
OPUS collection (Tiedemann, 2012), auto- tings, most submitted systems were successfully
matically marked with formality using a se- able to control formality. The average mAcc
quence of regular expressions. The formality scores ranged from 78-100. Controlling formality
level is encoded with a pseudo-token at the in Korean was found to be more challenging than
beginning of each training source sentence translating with formality control in Vietnamese
with one of 3 values: formal, informal, or no as reflected by the relatively lower mAcc scores
style. which we believe to be due to the variation in for-
mality expression of Korean honorific speech re-
• HW-TSC (Wang et al., 2023a) describes a flected in pretraining data.
system that uses a multi-stage pre-training HW-TSC consistently achieves the best scores
strategy on task-provided data to train strong across the board for all language pairs and both
bilingual models. Using these bilingual mod- settings due to the use of transductive learning.
els, they employ beam re-ranking on the out- Interestingly, the constrained submission by HW-
puts generated using the test source. The gen- TSC achieves better or competitive results com-
erated hypothesis are ranked using the for- pared to their unconstrained system suggesting
mality classifier and phrasal annotations, it- that the use of a pre-trained language model or
eratively fine-tuning the model on this data additional resources is not necessary to gener-
until test performance convergences. Initial ate high-quality formality-controlled translations.
formality control is enabled by a special to- Generally, the systems generate higher quality out-
ken and re-affirmed through classifier output puts in the formal setting relative to the informal
and annotations from training. setting for both supervised language pairs accord-
ing to BLEU and COMET, which might be due
• KU X U P S TAGE (Lee et al., 2023) uses large- to the bias of the dataset used during pre-training
scale bilingual transformer-based MT sys- which is typically news and hence more formal.
tems trained on high-quality datasets and In the zero-shot unconstrained setting, this for-
M BART for the supervised and zero-shot set- mality bias is even more prominent. We observe
tings respectively. They generate a formality- a much wider distribution in the formality scores
controlled translation dataset for supervision for English-Portuguese (mAcc: F 90-100, IF: 58-
in the zero-shot setting using GPT-4 and fil- 100), possibly due to the high ambiguity in the
ter the generated source-translation pairs us- informal language and the confounding dialectal
ing the formality classifier. All bilingual influence of Brazilian Portuguese dominant in the
models are then finetuned independently for pre-training corpora, which is known to use for-
the two target formality directions to gen- mal register even in typically informal contexts
erate formality-controlled outputs, resulting (Costa-jussà et al., 2018). HW-TSC and A PP T EK
in #(Language-pairs) × 2 (Formal/Informal) achieve the best translation quality for English-
models. Portuguese and English-Russian respectively. The
lowest scoring submission in both quality and for-
• UCSC (Vakharia et al., 2023) focused on us- mality control (UCSC) did not include any fine-
ing a single multilingual translation model tuning or adaptation of the base M BART model to
for all the language pairs under the uncon- the two zero-shot language pairs: English-Russian
strained setting. They finetune the pre-trained and English-Portuguese. This suggests that for-
model, mBART-large-50 (Tang et al., mality information is not transferred from the un-
2020), using the provided contrastive transla- related language pairs, EN-KO and EN-VI, and
tions (§ 9.2.1) with an added style embedding that some language-specific supervision is needed
intervention layer. to mark grammatical formality appropriately in
Russian and Portuguese.
9.6 Results
How well do systems match the desired tar-
Tables 47 and 48 in the Appendix show the main get formality? We show the distribution of the
automatic evaluation results for the shared task. scores generated using the formality classifier for
32
Figure 3: Formality Classifier Scores’ Distribution on the submitted system outputs in the Unconstrained setting:
HW-TSC can precisely match the target formality as depicted by the peaky distribution.
33
formality-controlled MT, despite the edit-focused 3. Target (English) phonemes and durations cor-
nature of the contrastive translations. We recom- responding to a translation which adheres to
mend that future work on formality-controlled ma- the desired timing
chine translation targets these challenges. The test data was produced by volunteers and
consists of videos of native German speakers
10 Automatic Dubbing reading individual sentences from the German
10.1 Challenge CoVoST-2 test set.43 This test set was divided in to
two subsets; Subset 1 where there are no pauses in
This task focuses on automatic dubbing: translat-
the speech and Subset 2 where there is one or more
ing the speech in a video into a new language such
pause in the speech. More details on this data are
that the new speech is natural when overlayed on
presented in (Chronopoulou et al., 2023).
the original video (see Figure 5).
Participants were given German videos, along 10.3 Submissions
with their text transcripts, and were asked to pro-
duced dubbed videos where the German speech Despite high initial interest, we received only
has been translated in to English speech. one submission, which was from the Huawei
Translation Services Center (HW-TSC) (Rao
Automatic dubbing is a very difficult/complex
et al., 2023). However, we had two systems
task (Brannon et al., 2023), and for this shared
(Chronopoulou et al., 2023; Pal et al., 2023) built
task we focus on the characteristic which is per-
for the task for which we had not yet performed
haps most characteristic of dubbing: isochrony.
human evaluation, so we still had enough systems
Isochrony refers to the property that the speech
for a interesting comparison.
translation is time aligned with the original
speaker’s video. When the speaker’s mouth is
• Interleaved (Baseline): Our first baseline
moving, a listener should hear speech; likewise,
and the basis for this shared task is from
when their mouth isn’t moving, a listener should
Chronopoulou et al. (2023). They propose to
not hear speech.
jointly model translations and speech timing,
To make this task accessible for small academic
giving the model the freedom to change the
teams with limited training resources, we make
translation to fit the timing, or and make scar-
some simplifications: First, we assume the input
ifies in translation quality to meet timing con-
speech has already been converted to text using an
straints or relax timing constraints to improve
ASR system and the desired speech/pause times
translation quality. This is achieved by sim-
have been extracted from the input speech. Sec-
ply binning target phoneme durations and in-
ond, to alleviate the challenges of training a TTS
terleaving them with target phonemes during
model, the output is defined to be phonemes and
training and inference. To avoid teaching the
their durations. These phonemes and durations are
model that speech durations should be prior-
played through an open-source FastSpeech2 (Ren
itized over translation quality44 , noise with
et al., 2022) text-to-speech model to produce the
standard deviation 0.1 is added to the target
final speech.41
phrase durations to simulate the source dura-
10.2 Data and Metrics tions used at inference.
Official training and test data sets were provided42 • Factored (Baseline): Pal et al. (2023) build
by the organizers. The training data was derived on the first baseline by using target factors
from CoVoST2 (Wang et al., 2021) and consists (Garcı́a-Martı́nez et al., 2016), where along-
of: side predicting phoneme sequences as the
1. Source (German) text target, we also predict durations for each
2. Desired target speech durations (e.g. 2.1s of phoneme as a target factor. Additionally, they
speech, followed by a pause, followed by 1.3s propose auxiliary counters, which are simi-
of speech) lar to target factors except the model is not
41 43
https://github.com/mtresearcher/ Each volunteer provided their consent to use this data
FastSpeech2 for automatic dubbing task.
42 44
https://github.com/amazon-science/ Median speech overlap is just 0.731 in a large corpus of
iwslt-autodub-task/tree/main/data human dubs (Brannon et al., 2023)
34
Figure 5: To illustrate, here’s an example in which “hallo! wei gehts?” is translated to “hi! how are you?” such
that the output will fit in the desired target speech durations of 0.4s and 1.3s, with a pause in between
• HW-TSC: In contrast to our three baselines, were researchers in automatic dubbing. For each
(Rao et al., 2023) took a more traditional video in the the test set, one judge was shown the
approach to dubbing and followed the prior four system outputs in random order and asked to
works on verbosity control (Lakew et al., rate them from 1-6. The judges were not given
2021, 2019) to first generate a set of transla- a defined rubric or guidelines to follow but were
tion candidates and later re-rank them. Their asked to be consistent.
system consists of four parts: 1) voice ac- As a metric we opted for mean opinion score
tivity detection followed by pause alignment, (MOS) methodology where the scores for a system
2) generating a list of translation candidates, as judged by humans are averaged in one score.45
3) phoneme duration prediction, followed by Feedback from the judges indicate that the base-
4) re-ranking/scaling the candidates based on line and submitted systems often produce poor
the durations (see Figure 6). With the last translations (perhaps due to the small amount of
step in the pipeline, the top scored candidate training data used by each system), and the voice
is ensured to have the best speech overlap quality from the FastSpeech 2 model was far from
with the source speech amongst all candidate perfect. However, they felt that having all systems
translations. share the same voice made it much easier to com-
pare across dubbing systems.
10.4 Evaluation & Metric When we looked at the distribution of scores per
The dubbed English videos were judged by a mix- 45
https://en.wikipedia.org/wiki/Mean_
ture of native and non-native speakers, all of which opinion_score
35
annotator (judge) level, the numbers showed that MOS↑
each annotator had a bias towards dubbing, some System Constrained? Mean CI
liked dubbing more than others which is intuitive Text2Phone Yes 3.16 ±0.19
but has not been studied before in the context of Interleaved Yes 3.33 ±0.18
automatic dubbing. As shown in Table 11, it is Factored Yes 3.43 ±0.19
clear that annotator A2 had a significantly higher HW-TSC No 3.77 ±0.19
preference for dubbing as compared to annotator
A4 in terms of MOS. Table 12: Mean opinion score for baselines 1)
Text2Phone 2) Interleaved (Chronopoulou et al., 2023)
Annotator MOS↑ CI 3) Factored (Pal et al., 2023) and 4) submitted system
±0.16
of HW-TSC (Rao et al., 2023).
A1 3.34
A2 3.74 ±0.19
LSE-D↓
A3 3.53 ±0.13
System Subset1 Subset2
A4 3.07 ±0.15
Original 7.39 7.67
Table 11: MOS (on a scale of 1-6) with confidence in- Text2Phone 11.64 13.31
terval (CI) at 95% per annotator showing the biases to- Interleaved 11.71 12.35
wards general purpose dubbed content. Factored 11.73 12.48
HW-TSC 12.11 12.77
We also looked at MOS for the two different Table 13: Results of Lip-Sync Error Distance (LSE-D)
subsets to understand whether it was difficult for via Syncnet pre-trained model (Chung and Zisserman,
the submitted systems to dub the videos. As it 2016). Lower the better.
turns out, Subset 1 has an significantly higher
MOS of 3.54 (± 0.11) compared to Subset 2 with
a MOS of 3.31 (± 0.11). This shows it is signifi- the amount of Lip-Sync errors in the video. From
cantly more difficult for all systems to dub Subset Table 13, Subset 1 consistently has a lower lip-
2 than Subset 1. sync error than Subset 2 in all cases pointing that
its difficult to generate lip-synced dubs for Sub-
10.5 Results set 2. This result is also in line with the MOS
scores we obtained for two subsets where the an-
Results are shown in Table 12. All three
notators preferred dubs for Subset 1. Secondly,
dubbing systems outperform the non-isochronic
original videos show significantly lower lip-sync
Text2Phone baseline (Chronopoulou et al., 2023),
error distance (12.x v/s 7.x) than dubbed videos
as expected. The factored baseline improves over
showing that automatic dubbing research still has
the interleaved baseline, consistent with the auto-
a long way to go to reach lip-sync quality in origi-
matic metric results reported by Pal et al. (2023).
nal videos.
The HW-TSC system (Rao et al., 2023) outper-
forms all the baselines in terms of mean opinion Acknowledgements
score, making it the clear winner of the IWSLT
2023 dubbing shared task. Unfortunately, since Claudia Borg, Thierry Declerck, Rishu Kumar
HW-TSC system was unconstrained (it trains on and John Judge acknowledge H2020 LT-Bridge
additional bitext compared to the baselines) and Project (GA 952194). Rishu Kumar would also
uses fundamentally different approaches than the like to thank the EMLCT46 programme. Atul
baselines, it is not possible to attribute it’s perfor- Kr. Ojha and John P. McCrae would like to
mance to any single factor. thank Science Foundation Ireland (SFI) under
Grant Number SFI/12/RC/2289 P2 Insight 2, and
Lip-sync is an important feature of dubbing,
Panlingua Language Processing LLP for provid-
it is important that the final generated audio is
ing the Marathi-Hindi speech translation data
in sync with the lip movements of the on-screen
and for their support. John Judge would also
speaker in the original video. As an analy-
like to acknowledge the support of SFI under
sis, we looked at Lip-Sync Error Distance (LSE-
grant SFI/13/RC/2106 P2 ADPAT. Ondřej Bojar
D) (Chung and Zisserman, 2016) following the
would like to acknowledge the grant 19-26934X
evaluation methodology in Hu et al. (2021). LSE-
D is not a perfect metric but it is an indication to 46
https://mundus-web.coli.uni-saarland.de/
36
(NEUREM3) of the Czech Science Foundation. Language Translation (IWSLT 2022), pages 98–157,
Antonios Anastasopoulos and Milind Agarwal are Dublin, Ireland (in-person and online). Association
for Computational Linguistics.
supported by the US National Science Foundation
CCRI-Planning 2234895 award, as well as a Na- Antonios Anastasopoulos, Ondřej Bojar, Jacob Bre-
tional Endowment for the Humanities PR-276810- merman, Roldano Cattoni, Maha Elbayad, Marcello
21 award. Federico, Xutai Ma, Satoshi Nakamura, Matteo Ne-
gri, Jan Niehues, Juan Pino, Elizabeth Salesky,
Sebastian Stüker, Katsuhito Sudoh, Marco Turchi,
Alexander Waibel, Changhan Wang, and Matthew
References Wiesner. 2021. FINDINGS OF THE IWSLT 2021
EVALUATION CAMPAIGN. In Proceedings of the
Basil Abraham, Danish Goel, Divya Siddarth, Ka-
18th International Conference on Spoken Language
lika Bali, Manu Chopra, Monojit Choudhury, Pratik
Translation (IWSLT 2021), pages 1–29, Bangkok,
Joshi, Preethi Jyoti, Sunayana Sitaram, and Vivek
Thailand (online). Association for Computational
Seshadri. 2020. Crowdsourcing speech data for low-
Linguistics.
resource languages from low-income workers. In
Proceedings of the 12th Language Resources and Pierre Andrews, Guillaume Wenzek, Kevin Heffernan,
Evaluation Conference, pages 2819–2826. Onur Çelebi, Anna Sun, Ammar Kamran, Yingzhe
Guo, Alexandre Mourachko, Holger Schwenk, and
Yasuhiro Akiba, Marcello Federico, Noriko Kando, Hi- Angela Fan. 2022. stopes-modular machine trans-
romi Nakaiwa, Michael Paul, and Jun’ichi Tsujii. lation pipelines. In Proceedings of the The 2022
2004. Overview of the IWSLT04 Evaluation Cam- Conference on Empirical Methods in Natural Lan-
paign. In Proceedings of the International Work- guage Processing: System Demonstrations, pages
shop on Spoken Language Translation, pages 1–12, 258–265.
Kyoto, Japan.
Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, On-
Antonios Anastasopoulos, Loı̈c Barrault, Luisa Ben- drej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir
tivogli, Marcely Zanon Boito, Ondřej Bojar, Durrani, Marcello Federico, Christian Federmann,
Roldano Cattoni, Anna Currey, Georgiana Dinu, Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay
Kevin Duh, Maha Elbayad, Clara Emmanuel, Yan- Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz-
nick Estève, Marcello Federico, Christian Fed- abeth Salesky, Xing Shi, Sebastian Stüker, Marco
ermann, Souhir Gahbiche, Hongyu Gong, Ro- Turchi, and Changhan Wang. 2020. Findings of the
man Grundkiewicz, Barry Haddow, Benjamin Hsu, IWSLT 2020 Evaluation Campaign. In Proceedings
Dávid Javorský, Vĕra Kloudová, Surafel Lakew, of the 17th International Conference on Spoken Lan-
Xutai Ma, Prashant Mathur, Paul McNamee, Kenton guage Translation (IWSLT 2020), Seattle, USA.
Murray, Maria Nǎdejde, Satoshi Nakamura, Mat-
teo Negri, Jan Niehues, Xing Niu, John Ortega, Rosana Ardila, Megan Branson, Kelly Davis, Michael
Juan Pino, Elizabeth Salesky, Jiatong Shi, Matthias Henretty, Michael Kohler, Josh Meyer, Reuben
Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco Morais, Lindsay Saunders, Francis M Tyers, and
Turchi, Yogesh Virkar, Alexander Waibel, Chang- Gregor Weber. 2019. Common voice: A massively-
han Wang, and Shinji Watanabe. 2022a. Findings of multilingual speech corpus. arXiv preprint
the IWSLT 2022 evaluation campaign. In Proceed- arXiv:1912.06670.
ings of the 19th International Conference on Spoken
Language Translation (IWSLT 2022), pages 98–157, Rosana Ardila, Megan Branson, Kelly Davis, Michael
Dublin, Ireland (in-person and online). Association Henretty, Michael Kohler, Josh Meyer, Reuben
for Computational Linguistics. Morais, Lindsay Saunders, Francis M Tyers, and
Gregor Weber. 2020a. Common voice: A
Antonios Anastasopoulos, Loı̈c Barrault, Luisa Ben- massively-multilingual speech corpus. In LREC.
tivogli, Marcely Zanon Boito, Ondřej Bojar,
Rosana Ardila, Megan Branson, Kelly Davis, Michael
Roldano Cattoni, Anna Currey, Georgiana Dinu,
Kohler, Josh Meyer, Michael Henretty, Reuben
Kevin Duh, Maha Elbayad, Clara Emmanuel, Yan-
Morais, Lindsay Saunders, Francis Tyers, and Gre-
nick Estève, Marcello Federico, Christian Fed-
gor Weber. 2020b. Common voice: A massively-
ermann, Souhir Gahbiche, Hongyu Gong, Ro-
multilingual speech corpus. In Proceedings of The
man Grundkiewicz, Barry Haddow, Benjamin Hsu,
12th Language Resources and Evaluation Confer-
Dávid Javorský, Vĕra Kloudová, Surafel Lakew,
ence, pages 4218–4222.
Xutai Ma, Prashant Mathur, Paul McNamee, Kenton
Murray, Maria Nǎdejde, Satoshi Nakamura, Mat- Mikel Artetxe and Holger Schwenk. 2019. Mas-
teo Negri, Jan Niehues, Xing Niu, John Ortega, sively multilingual sentence embeddings for zero-
Juan Pino, Elizabeth Salesky, Jiatong Shi, Matthias shot cross-lingual transfer and beyond. Transac-
Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco tions of the Association for Computational Linguis-
Turchi, Yogesh Virkar, Alexander Waibel, Chang- tics, 7:597–610.
han Wang, and Shinji Watanabe. 2022b. Findings of
the IWSLT 2022 Evaluation Campaign. In Proceed- Arun Babu, Changhan Wang, Andros Tjandra, Kushal
ings of the 19th International Conference on Spoken Lakhotia, Qiantong Xu, Naman Goyal, Kritika
37
Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Ronald Cardenas, Rodolfo Zevallos, Reynaldo Baquer-
et al. 2021. XLS-R: Self-supervised cross-lingual izo, and Luis Camacho. 2018. Siminchik: A speech
speech representation learning at scale. arXiv corpus for preservation of southern quechua. ISI-
preprint arXiv:2111.09296. NLP 2, page 21.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Mauro Cettolo, Marcello Federico, Luisa Ben-
and Michael Auli. 2020a. wav2vec 2.0: A frame- tivogli, Jan Niehues, Sebastian Stüker, K. Su-
work for self-supervised learning of speech repre- doh, K. Yoshino, and Christian Federmann. 2017.
sentations. In Advances in Neural Information Pro- Overview of the IWSLT 2017 Evaluation Campaign.
cessing Systems, volume 33, pages 12449–12460. In Proceedings of the 14th International Workshop
on Spoken Language Translation (IWSLT 2017),
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, pages 2–14, Tokyo, Japan.
and Michael Auli. 2020b. wav2vec 2.0: A frame-
work for self-supervised learning of speech repre- Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa
sentations. Advances in Neural Information Pro- Bentivogli, Roldano Cattoni, and Marcello Federico.
cessing Systems, 33:12449–12460. 2015. The IWSLT 2015 Evaluation Campaign. In
Proceedings of the 12th International Workshop on
Parnia Bahar, Patrick Wilken, Javier Iranzo-Sánchez, Spoken Language Translation (IWSLT 2015), Da
Mattia Di Gangi, Evgeny Matusov, and Zoltán Nang, Vietnam.
Tüske. 2023. Speech Translation with Style:
AppTek’s Submissions to the IWSLT Subtitling and Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa
Formality Tracks in 2023. In Proceedings of the Bentivogli, and Marcello Federico. 2013. Report on
20th International Conference on Spoken Language the 10th IWSLT Evaluation Campaign. In Proceed-
Translation (IWSLT). ings of the Tenth International Workshop on Spoken
Language Translation (IWSLT 2013), Heidelberg,
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina Germany.
Karakanta, Alberto Martinelli, and Marco Turchi
Matteo Negri. 2021. Cascade versus Direct Speech Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa
Translation: Do the Differences Still Make a Dif- Bentivogli, and Marcello Federico. 2014. Report
ference? In Proceedings of the 59th Annual Meet- on the 11th IWSLT Evaluation Campaign, IWSLT
ing of the Association for Computational Linguis- 2014. In Proceedings of the Eleventh International
tics, Bangkok, Thailand. Association for Computa- Workshop on Spoken Language Translation (IWSLT
tional Linguistics. 2014), Lake Tahoe, USA.
Marcely Zanon Boito, Fethi Bougares, Florentin Bar-
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa
bier, Souhir Gahbiche, Loı̈c Barrault, Mickael Rou-
Bentivogli, and Marcello Federico. 2016. The
vier, and Yannick Estéve. 2022a. Speech resources
IWSLT 2016 Evaluation Campaign. In Proceedings
in the tamasheq language. Language Resources and
of the 13th International Workshop on Spoken Lan-
Evaluation Conference (LREC).
guage Translation (IWSLT 2016), Seattle, USA.
Marcely Zanon Boito, John Ortega, Hugo Riguidel,
Antoine Laurent, Loı̈c Barrault, Fethi Bougares, Fi- Mingda Chen, Paul-Ambroise Duquenne, Pierre An-
ras Chaabani, Ha Nguyen, Florentin Barbier, Souhir drews, Justine Kao, Alexandre Mourachko, Holger
Gahbiche, and Yannick Estève. 2022b. ON-TRAC Schwenk, and Marta R. Costa-jussà. 2022. Blaser:
Consortium Systems for the IWSLT 2022 Dialect A text-free speech-to-speech translation evaluation
and Low-resource Speech Translation Tasks. In metric.
Proceedings of the 19th International Conference on
Spoken Language Translation (IWSLT). Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
William Brannon, Yogesh Virkar, and Brian Thomp- Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu,
son. 2023. Dubbing in Practice: A Large Scale Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian,
Study of Human Localization With Insights for Au- Micheal Zeng, and Furu Wei. 2021. Wavlm: Large-
tomatic Dubbing. Transactions of the Association scale self-supervised pre-training for full stack
for Computational Linguistics, 11:419–435. speech processing. IEEE Journal of Selected Top-
ics in Signal Processing, 16:1505–1518.
Eleftheria Briakou, Sweta Agrawal, Joel Tetreault, and
Marine Carpuat. 2021. Evaluating the evaluation Colin Cherry and George Foster. 2019. Thinking slow
metrics for style transfer: A case study in multi- about latency evaluation for simultaneous machine
lingual formality transfer. In Proceedings of the translation. arXiv preprint arXiv:1906.00048.
2021 Conference on Empirical Methods in Natural
Language Processing, pages 1321–1336, Online and Kyunghyun Cho and Masha Esipova. 2016. Can neu-
Punta Cana, Dominican Republic. Association for ral machine translation do simultaneous translation?
Computational Linguistics. arXiv preprint arXiv:1606.02012.
38
Alexandra Chronopoulou, Brian Thompson, Prashant and Speech-to-Speech Translation Tasks. In Pro-
Mathur, Yogesh Virkar, Surafel M. Lakew, and Mar- ceedings of the 20th International Conference on
cello Federico. 2023. Jointly Optimizing Transla- Spoken Language Translation (IWSLT).
tions and Speech Timing to Improve Isochrony in
Automatic Dubbing. ArXiv:2302.12979. Matthias Eck and Chiori Hori. 2005. Overview of the
IWSLT 2005 evaluation campaign. In Proceedings
J. S. Chung and A. Zisserman. 2016. Out of time: au- of the International Workshop on Spoken Language
tomated lip sync in the wild. In Workshop on Multi- Translation, pages 1–22, Pittsburgh, PA.
view Lip-reading, ACCV.
ELRA catalogue. 2016a. Trad pashto broadcast
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, news speech corpus. https://catalogue.
Vishrav Chaudhary, Guillaume Wenzek, Francisco elra.info/en-us/repository/browse/
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- ELRA-S0381/. ISLRN: 918-508-885-913-7,
moyer, and Veselin Stoyanov. 2020. Unsupervised ELRA ID: ELRA-S0381.
cross-lingual representation learning at scale. In ELRA catalogue. 2016b. Trad pashto-french parallel
Proceedings of the 58th Annual Meeting of the Asso- corpus of transcribed broadcast news speech - train-
ciation for Computational Linguistics, pages 8440– ing data. http://catalog.elda.org/en-us/
8451, Online. Association for Computational Lin- repository/browse/ELRA-W0093/. ISLRN:
guistics. 802-643-297-429-4, ELRA ID: ELRA-W0093.
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Ma, Ahmed El-Kishky, Siddharth Goyal, Man-
Rivera, and Ankur Bapna. 2023. Fleurs: Few-shot deep Baines, Onur Celebi, Guillaume Wenzek,
learning evaluation of universal representations of Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi-
speech. In 2022 IEEE Spoken Language Technol- taliy Liptchinsky, Sergey Edunov, Edouard Grave,
ogy Workshop (SLT), pages 798–805. IEEE. Michael Auli, and Armand Joulin. 2020. Beyond
english-centric multilingual machine translation.
Marta R. Costa-jussà, Marcos Zampieri, and Santanu
Pal. 2018. A neural approach to language variety Marcello Federico, Luisa Bentivogli, Michael Paul,
translation. In Proceedings of the Fifth Workshop and Sebastian Stüker. 2011. Overview of the IWSLT
on NLP for Similar Languages, Varieties and Di- 2011 Evaluation Campaign. In Proceedings of the
alects (VarDial 2018), pages 275–282, Santa Fe, International Workshop on Spoken Language Trans-
New Mexico, USA. Association for Computational lation, pages 11–27, San Francisco, USA.
Linguistics.
Marcello Federico, Mauro Cettolo, Luisa Ben-
Pan Deng, Shihao Chen, Weitai Zhang, Jie Zhang, tivogli, Michael Paul, and Sebastian Stüker. 2012.
and Lirong Dai. 2023. The USTC’s Dialect Speech Overview of the IWSLT 2012 Evaluation Campaign.
Translation System for IWSLT 2023. In Proceed- In Proceedings of the International Workshop on
ings of the 20th International Conference on Spoken Spoken Language Translation, pages 11–27, Hong
Language Translation (IWSLT). Kong, HK.
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and
Matteo Negri, and Marco Turchi. 2019a. MuST-C: W. Wang. 2022. Language-agnostic BERT Sentence
a Multilingual Speech Translation Corpus. In Pro- Embedding. In Proceedings of the 60th ACL.
ceedings of the 2019 Conference of the North Amer- Cameron Shaw Fordyce. 2007. Overview of the
ican Chapter of the Association for Computational IWSLT 2007 evaluation campaign. In Proceedings
Linguistics: Human Language Technologies, Vol- of the International Workshop on Spoken Language
ume 1 (Long and Short Papers), pages 2012–2017, Translation, pages 1–12, Trento, Italy.
Minneapolis, Minnesota.
Markus Freitag, George Foster, David Grangier, Viresh
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Ratnakar, Qijun Tan, and Wolfgang Macherey.
Matteo Negri, and Marco Turchi. 2019b. MuST-C: 2021a. Experts, errors, and context: A large-scale
a Multilingual Speech Translation Corpus. In Pro- study of human evaluation for machine translation.
ceedings of the 2019 Conference of the North Amer- Transactions of the Association for Computational
ican Chapter of the Association for Computational Linguistics, 9:1460–1474.
Linguistics: Human Language Technologies, Vol-
ume 1 (Long and Short Papers), pages 2012–2017, Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu
Minneapolis, Minnesota. Association for Computa- Lo, Craig Stewart, George Foster, Alon Lavie, and
tional Linguistics. Ondřej Bojar. 2021b. Results of the WMT21 met-
rics shared task: Evaluating metrics with expert-
Yichao Du, Guo Zhengsheng, Jinchuan Tian, Zhirui based human evaluations on TED and news domain.
Zhang, Xing Wang, Jianwei Yu, Zhaopeng Tu, Tong In Proceedings of the Sixth Conference on Machine
Xu, and Enhong Chen. 2023. The MineTrans Sys- Translation, pages 733–774, Online. Association for
tems for IWSLT 2023 Offline Speech Translation Computational Linguistics.
39
Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Yuka Fei He, Shan-Hui Cathy Chu, Oddur Kjartansson,
Ko, Tomoya Yanagita, Kosuke Doi, Mana Maki- Clara Rivera, Anna Katanova, Alexander Gutkin,
nae, Sakriani Sakti, Katsuhito Sudoh, and Satoshi Isin Demirsahin, Cibu Johny, Martin Jansche,
Nakamura. 2023. NAIST Simultaneous Speech-to- Supheakmungkol Sarin, and Knot Pipatsrisawat.
speech Translation System for IWSLT 2023. In Pro- 2020. Open-source multi-speaker speech cor-
ceedings of the 20th International Conference on pora for building Gujarati, Kannada, Malayalam,
Spoken Language Translation (IWSLT). Marathi, Tamil and Telugu speech synthesis sys-
tems. In Proceedings of the Twelfth Language Re-
Mercedes Garcı́a-Martı́nez, Loı̈c Barrault, and Fethi sources and Evaluation Conference, pages 6494–
Bougares. 2016. Factored neural machine transla- 6503, Marseille, France. European Language Re-
tion architectures. In Proceedings of the 13th Inter- sources Association.
national Conference on Spoken Language Transla-
tion, Seattle, Washington D.C. International Work- Kevin Heffernan, Onur Çelebi, and Holger Schwenk.
shop on Spoken Language Translation. 2022. Bitext mining using distilled sentence rep-
resentations for low-resource languages. In Find-
Jonas Gehring, Michael Auli, David Grangier, Denis ings of the Association for Computational Linguis-
Yarats, and Yann N. Dauphin. 2017. Convolutional tics: EMNLP 2022, pages 2101–2112, Abu Dhabi,
sequence to sequence learning. United Arab Emirates. Association for Computa-
tional Linguistics.
Karthik Gopalakrishnan, Behnam Hedayatnia, Qin-
lang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Carlos Daniel Hernandez Mena, Albert Gatt, Andrea
Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. DeMarco, Claudia Borg, Lonneke van der Plas,
2019. Topical-Chat: Towards knowledge-grounded Amanda Muscat, and Ian Padovani. 2020. MASRI-
open-domain conversations. In Proc. Interspeech HEADSET: A Maltese corpus for speech recogni-
2019, pages 1891–1895. tion. In Proceedings of the Twelfth Language Re-
sources and Evaluation Conference, pages 6381–
Edward Gow-Smith, Alexandre Berard, 6388, Marseille, France. European Language Re-
Marcely Zanon Boito, and Ioan Calapodescu. sources Association.
2023. NAVER LABS Europe’s Multilingual
Speech Translation Systems for the IWSLT 2023 Oleksii Hrinchuk, Vladimir Bataev, Evelina Bakhtu-
Low-Resource Track. In Proceedings of the 20th rina, and Boris Ginsburg. 2023. NVIDIA NeMo Of-
International Conference on Spoken Language fline Speech Translation Systems for IWSLT 2023.
Translation (IWSLT). In Proceedings of the 20th International Conference
on Spoken Language Translation (IWSLT).
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán, Tsai, Kushal Lakhotia, Ruslan Salakhutdinov,
and Angela Fan. 2022. The Flores-101 evaluation and Abdelrahman Mohamed. 2021. Hubert:
benchmark for low-resource and multilingual ma- Self-supervised speech representation learn-
chine translation. Transactions of the Association ing by masked prediction of hidden units.
for Computational Linguistics, 10:522–538. IEEE/ACM Trans. Audio, Speech and Lang.
Proc., 29:3451–3460.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Chenxu Hu, Qiao Tian, Tingle Li, Wang Yuping, Yux-
Zhengdong Zhang, Yonghui Wu, and Ruoming uan Wang, and Hang Zhao. 2021. Neural dubber:
Pang. 2020. Conformer: Convolution-augmented Dubbing for videos according to scripts. In Thirty-
transformer for speech recognition. Interspeech, Fifth Conference on Neural Information Processing
pages 5036–5040. Systems.
Jiaxin Guo, Daimeng Wei, Zhanglin Wu, Zongyao Li, Wuwei Huang, Mengge Liu, Xiang Li, Yanzhi Tian,
Zhiqiang Rao, Minghan Wang, Hengchao Shang, Fengyu Yang, Wen Zhang, Jian Luan, Bin Wang,
Xiaoyu Chen, Zhengzhe Yu, Shaojun Li, Yuhao Xie, Yuhang Guo, and Jinsong Su. 2023. The Xiaomi
Lizhi Lei, and Hao Yang. 2023. The HW-TSC’s Si- AI Lab’s Speech Translation Systems for IWSLT
multaneous Speech-to-Text Translation system for 2023 Offline Task, Simultaneous Task and Speech-
IWSLT 2023 evaluation. In Proceedings of the to-Speech Task. In Proceedings of the 20th Interna-
20th International Conference on Spoken Language tional Conference on Spoken Language Translation
Translation (IWSLT). (IWSLT).
Yuchen Han, Xiaoqian Liu, Hao Chen, Yuhao Zhang, Amir Hussein, Cihan Xiao, Neha Verma, Matthew
Chen Xu, Tong Xiao, and Jingbo Zhu. 2023. The Wiesner, Thomas Thebaud, and Sanjeev Khudanpur.
NiuTrans End-to-End Speech Translation System 2023. JHU IWSLT 2023 Dialect Speech Translation
for IWSLT23 English-to-Chinese Offline Task. In System Description. In Proceedings of the 20th In-
Proceedings of the 20th International Conference on ternational Conference on Spoken Language Trans-
Spoken Language Translation (IWSLT). lation (IWSLT).
40
Muhammad Huzaifah, Kye Min Tan, and Richeng Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
Duan. 2023. I2R’s End-to-End Speech Translation Senellart, and Alexander Rush. 2017. OpenNMT:
System for IWSLT 2023 Offline Shared Task. In Open-source toolkit for neural machine translation.
Proceedings of the 20th International Conference on In Proceedings of ACL 2017, System Demonstra-
Spoken Language Translation (IWSLT). tions, pages 67–72, Vancouver, Canada. Association
for Computational Linguistics.
Hirofumi Inaguma, Brian Yan, Siddharth Dalmia,
Pengcheng Guo, Jiatong Shi, Kevin Duh, and Shinji Surafel M Lakew, Yogesh Virkar, Prashant Mathur,
Watanabe. 2021. ESPnet-ST IWSLT 2021 Offline and Marcello Federico. 2021. Isometric mt: Neural
Speech Translation System. In Proceedings of the machine translation for automatic dubbing. arXiv
18th International Conference on Spoken Language preprint arXiv:2112.08682.
Translation (IWSLT).
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Surafel Melaku Lakew, Mattia Di Gangi, and Marcello
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- Federico. 2019. Controlling the output length of
bert Sanchis, Jorge Civera, and Alfons Juan. 2020. neural machine translation. In Proc. IWSLT.
Europarl-st: A multilingual corpus for speech trans-
lation of parliamentary debates. In Proc. of 45th Intl. Antoine Laurent, Souhir Gahbiche, Ha Nguyen,
Conf. on Acoustics, Speech, and Signal Process- Haroun Elleuch, Fethi Bougares, Antoine Thiol,
ing (ICASSP 2020), pages 8229–8233, Barcelona Hugo Riguidel, Salima Mdhaffar, Gaëlle Laperrière,
(Spain). Lucas Maison, Sameer Khurana, and Yannick
Estève. 2023. ON-TRAC consortium systems for
Dávid Javorský, Dominik Macháček, and Ondřej Bo- the IWSLT 2023 dialectal and low-resource speech
jar. 2022. Continuous rating as reliable human translation tasks. In Proceedings of the 20th Inter-
evaluation of simultaneous speech translation. In national Conference on Spoken Language Transla-
Proceedings of the Seventh Conference on Machine tion (IWSLT).
Translation (WMT), pages 154–164, Abu Dhabi,
United Arab Emirates (Hybrid). Association for Seugnjun Lee, Hyeonseok Moon, Chanjun Park,
Computational Linguistics. and Heuiseok Lim. 2023. Improving Formality-
Sensitive Machine Translation using Data-Centric
Japan Translation Federation JTF. 2018. JTF Transla- Approaches and Prompt Engineering. In Proceed-
tion Quality Evaluation Guidelines, 1st Edition (in ings of the 20th International Conference on Spoken
Japanese). Language Translation (IWSLT).
Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka-
mura. 2023. Average Token Delay: A Latency Met- Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Xie YuHao,
ric for Simultaneous Translation. In Proceedings of Guo JiaXin, Daimeng Wei, Hengchao Shang, Wang
Interspeech 2023. To appear. Minghan, Xiaoyu Chen, Zhengzhe YU, Li Shao-
Jun, Lei LiZhi, and Hao Yang. 2023. HW-TSC at
Alina Karakanta, Luisa Bentivogli, Mauro Cettolo, IWSLT2023: Break the Quality Ceiling of Offline
Matteo Negri, and Marco Turchi. 2022a. Post- Track via Pre-Training and Domain Adaptation. In
editing in automatic subtitling: A subtitlers’ per- Proceedings of the 20th International Conference on
spective. In Proceedings of the 23rd Annual Con- Spoken Language Translation (IWSLT).
ference of the European Association for Machine
Translation, pages 261–270, Ghent, Belgium. Euro- Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu
pean Association for Machine Translation. Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na-
man Goyal, Shruti Bhosale, Jingfei Du, et al. 2021.
Alina Karakanta, François Buet, Mauro Cettolo, and Few-shot learning with multilingual language mod-
François Yvon. 2022b. Evaluating subtitle seg- els. arXiv preprint arXiv:2112.10668.
mentation for end-to-end generation systems. In
Proceedings of the Thirteenth Language Resources
Pierre Lison and Jörg Tiedemann. 2016. OpenSub-
and Evaluation Conference, pages 3069–3078, Mar-
titles2016: Extracting large parallel corpora from
seille, France. European Language Resources Asso-
movie and TV subtitles. In Proceedings of the Tenth
ciation.
International Conference on Language Resources
Santosh Kesiraju, Karel Beneš, Maksim Tikhonov, and and Evaluation (LREC’16), pages 923–929, Por-
Jan Černocký. 2023. BUT Systems for IWSLT 2023 torož, Slovenia. European Language Resources As-
Marathi - Hindi Low Resource Speech Translation sociation (ELRA).
Task. In Proceedings of the 20th International Con-
ference on Spoken Language Translation (IWSLT). Danni Liu, Thai Binh Nguyen, Sai Koneru, Enes Yavuz
Ugan, Ngoc-Quan Pham, Tuan Nam Nguyen,
Sameer Khurana, Antoine Laurent, and James Glass. Tu Anh Dinh, Carlos Mullov, Alexander Waibel,
2022. Samu-xlsr: Semantically-aligned multimodal and Jan Niehues. 2023. KIT’s Multilingual Speech
utterance-level cross-lingual speech representation. Translation System for IWSLT 2023. In Proceed-
IEEE Journal of Selected Topics in Signal Process- ings of the 20th International Conference on Spoken
ing, pages 1–13. Language Translation (IWSLT).
41
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey In Proceedings of the Second International Work-
Edunov, Marjan Ghazvininejad, Mike Lewis, and shop on Spoken Language Translation, Pittsburgh,
Luke Zettlemoyer. 2020. Multilingual denoising Pennsylvania, USA.
pre-training for neural machine translation. Trans-
actions of the Association for Computational Lin- Evgeny Matusov, Patrick Wilken, and Yota Geor-
guistics, 8:726–742. gakopoulou. 2019. Customizing neural machine
translation for subtitling. In Proceedings of the
Arle Lommel, Hans Uszkoreit, and Aljoscha Bur- Fourth Conference on Machine Translation (Volume
chardt. 2014. Multidimensional Quality Met- 1: Research Papers), pages 82–93, Florence, Italy.
rics (MQM): A Framework for Declaring and Association for Computational Linguistics.
DescribingTranslation Quality Metrics. Revista
Tradumàtica: tecnologies de la traducció, 12:455– Jonathan Mbuya and Antonios Anastasopoulos. 2023.
463. GMU Systems for the IWSLT 2023 Dialect and
Low-resource Speech Translation Tasks. In Pro-
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, ceedings of the 20th International Conference on
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Spoken Language Translation (IWSLT).
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
Haifeng Wang. 2019. STACL: Simultaneous trans- Maria Nadejde, Anna Currey, Benjamin Hsu, Xing
lation with implicit anticipation and controllable la- Niu, Marcello Federico, and Georgiana Dinu. 2022.
tency using prefix-to-prefix framework. In Proceed- CoCoA-MT: A dataset and benchmark for con-
ings of the 57th Annual Meeting of the Association trastive controlled MT with application to formality.
for Computational Linguistics, pages 3025–3036, In Findings of the Association for Computational
Florence, Italy. Association for Computational Lin- Linguistics: NAACL 2022, pages 616–632, Seattle,
guistics. United States. Association for Computational Lin-
guistics.
Shuming Ma, Li Dong, Shaohan Huang, Dong-
dong Zhang, Alexandre Muzio, Saksham Singhal, J. Niehues, R. Cattoni, S. Stüker, M. Negri, M. Turchi,
Hany Hassan Awadalla, Xia Song, and Furu Wei. T. Ha, E. Salesky, R. Sanabria, L. Barrault, L. Spe-
2021. DeltaLM: Encoder-decoder pre-training for cia, and M. Federico. 2019. The IWSLT 2019 Eval-
language generation and translation by augmenting uation Campaign. In Proceedings of the 16th Inter-
pretrained multilingual encoders. arXiv. national Workshop on Spoken Language Translation
(IWSLT 2019), Hong Kong, China.
Xutai Ma, Mohammad Javad Dousti, Changhan Wang,
Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An
Jan Niehues, Roldano Cattoni, Sebastian Stüker,
evaluation toolkit for simultaneous translation. In
Mauro Cettolo, Marco Turchi, and Marcello Fed-
Proceedings of the 2020 Conference on Empirical
erico. 2018. The IWSLT 2018 Evaluation Cam-
Methods in Natural Language Processing: System
paign. In Proceedings of the 15th International
Demonstrations, pages 144–150, Online. Associa-
Workshop on Spoken Language Translation (IWSLT
tion for Computational Linguistics.
2018), pages 2–6, Bruges, Belgium.
Xutai Ma, Juan Pino, and Philipp Koehn. 2020b.
SimulMT to SimulST: Adapting simultaneous text Xing Niu, Marianna Martindale, and Marine Carpuat.
translation to end-to-end simultaneous speech trans- 2017. A study of style in machine translation: Con-
lation. In Proceedings of the 1st Conference of the trolling the formality of machine translation output.
Asia-Pacific Chapter of the Association for Compu- In Proceedings of the 2017 Conference on Empiri-
tational Linguistics and the 10th International Joint cal Methods in Natural Language Processing, pages
Conference on Natural Language Processing, pages 2814–2819, Copenhagen, Denmark. Association for
582–587, Suzhou, China. Association for Computa- Computational Linguistics.
tional Linguistics.
NLLB Team, Marta R. Costa-jussà, James Cross,
Dominik Macháček, Ondřej Bojar, and Raj Dabre. Onur Çelebi, Maha Elbayad, Kenneth Heafield,
2023. MT Metrics Correlate with Human Ratings Kevin Heffernan, Elahe Kalbassi, Janice Lam,
of Simultaneous Speech Translation. In Proceed- Daniel Licht, Jean Maillard, Anna Sun, Skyler
ings of the 20th International Conference on Spoken Wang, Guillaume Wenzek, Al Youngblood, Bapi
Language Translation (IWSLT). Akula, Loic Barrault, Gabriel Mejia-Gonzalez,
Prangthip Hansanti, John Hoffman, Semarley Jar-
Evgeny Matusov, Gregor Leusch, Oliver Bender, and rett, Kaushik Ram Sadagopan, Dirk Rowe, Shan-
Hermann Ney. 2005a. Evaluating machine transla- non Spruit, Chau Tran, Pierre Andrews, Necip Fazil
tion output with automatic sentence segmentation. Ayan, Shruti Bhosale, Sergey Edunov, Angela
In Proc. of the International Workshop on Spoken Fan, Cynthia Gao, Vedanuj Goswami, Francisco
Language Translation (IWSLT), pages 138–144. Guzmán, Philipp Koehn, Alexandre Mourachko,
Christophe Ropers, Safiyyah Saleem, Holger
Evgeny Matusov, Gregor Leusch, Oliver Bender, and Schwenk, and Jeff Wang. 2022. No language left be-
Hermann Ney. 2005b. Evaluating machine transla- hind: Scaling human-centered machine translation.
tion output with automatic sentence segmentation. arXiv preprint.
42
John E Ortega, Richard Castro Mamani, and Michael Paul, Marcello Federico, and Sebastian Stüker.
Kyunghyun Cho. 2020. Neural machine translation 2010. Overview of the IWSLT 2010 Evaluation
with a polysynthetic low resource language. Ma- Campaign. In Proceedings of the International
chine Translation, 34(4):325–346. Workshop on Spoken Language Translation, pages
3–27, Paris, France.
John E. Ortega, Rodolfo Zevallos, and William Chen.
2023. QUESPA Submission for the IWSLT 2023 Simone Perone. 2023. Matesub: the Translated Sub-
Dialect and Low-resource Speech Translation Tasks. titling Tool at the IWSLT2023 Subtitling task. In
In Proceedings of the 20th International Conference Proceedings of the 20th International Conference on
on Spoken Language Translation (IWSLT). Spoken Language Translation (IWSLT).
Proyag Pal, Brian Thompson, Yogesh Virkar, Prashant Peter Polák, Danni Liu, Ngoc-Quan Pham, Jan
Mathur, Alexandra Chronopoulou, and Marcello Niehues, Alexander Waibel, and Ondřej Bojar. 2023.
Federico. 2023. Improving isochronous machine Towards Efficient Simultaneous Speech Transla-
translation with target factors and auxiliary counters. tion: CUNI-KIT System for Simultaneous Track at
IWSLT 2023. In Proceedings of the 20th Interna-
Sara Papi, Marco Gaido, and Matteo Negri. 2023. Di- tional Conference on Spoken Language Translation
rect Models for Simultaneous Translation and Auto- (IWSLT).
matic Subtitling: FBK@IWSLT2023. In Proceed-
ings of the 20th International Conference on Spoken Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen,
Language Translation (IWSLT). Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo-
jar, and Alexander Waibel. 2022. CUNI-KIT system
Sara Papi, Marco Gaido, Matteo Negri, and Marco for simultaneous speech translation task at IWSLT
Turchi. 2022. Over-generation cannot be rewarded: 2022. In Proceedings of the 19th International Con-
Length-adaptive average lagging for simultaneous ference on Spoken Language Translation (IWSLT
speech translation. In Proceedings of the Third 2022), pages 277–285, Dublin, Ireland (in-person
Workshop on Automatic Simultaneous Translation, and online). Association for Computational Linguis-
pages 12–17, Online. Association for Computational tics.
Linguistics.
Maja Popović. 2015a. chrF: character n-gram F-score
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- for automatic MT evaluation. In Proceedings of the
Jing Zhu. 2002a. Bleu: a method for automatic eval- Tenth Workshop on Statistical Machine Translation,
uation of machine translation. In Proceedings of pages 392–395, Lisbon, Portugal. Association for
the 40th annual meeting on association for compu- Computational Linguistics.
tational linguistics. Association for Computational Maja Popović. 2015b. chrf: character n-gram f-score
Linguistics. for automatic mt evaluation. In Proceedings of the
tenth workshop on statistical machine translation,
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
pages 392–395.
Jing Zhu. 2002b. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of the Matt Post. 2018. A call for clarity in reporting BLEU
40th Annual Meeting of the Association for Com- scores. In Proceedings of the Third Conference on
putational Linguistics, pages 311–318, Philadelphia, Machine Translation: Research Papers, pages 186–
Pennsylvania, USA. Association for Computational 191, Brussels, Belgium. Association for Computa-
Linguistics. tional Linguistics.
Daniel S. Park, William Chan, Yu Zhang, Chung- Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai,
Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchin-
Quoc V. Le. 2019. SpecAugment: A Simple sky, and Ronan Collobert. 2019. Wav2letter++:
Data Augmentation Method for Automatic Speech A fast open-source speech recognition system. In
Recognition. Interspeech 2019. ICASSP 2019-2019 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
Michael Paul. 2006. Overview of the IWSLT 2006 (ICASSP), pages 6460–6464. IEEE.
Evaluation Campaign. In Proceedings of the In-
ternational Workshop on Spoken Language Trans- Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
lation, pages 1–15, Kyoto, Japan. man, Christine McLeavey, and Ilya Sutskever. 2022.
Robust speech recognition via large-scale weak su-
Michael Paul. 2008. Overview of the IWSLT 2008 pervision.
Evaluation Campaign. In Proceedings of the In-
ternational Workshop on Spoken Language Trans- Balaji Radhakrishnan, Saurabh Agrawal, Raj Prakash
lation, pages 1–17, Waikiki, Hawaii. Gohil, Kiran Praveen, Advait Vinay Dhopesh-
warkar, and Abhishek Pandey. 2023. SRI-B’s sys-
Michael Paul. 2009. Overview of the IWSLT 2009 tems for IWSLT 2023 Dialectal and Low-resource
Evaluation Campaign. In Proceedings of the In- track: Marathi-Hindi Speech Translation. In Pro-
ternational Workshop on Spoken Language Trans- ceedings of the 20th International Conference on
lation, pages 1–18, Tokyo, Japan. Spoken Language Translation (IWSLT).
43
Zhiqiang Rao, Hengchao Shang, Jinlong Yang, Thibault Sellam, Dipanjan Das, and Ankur Parikh.
Daimeng Wei, Zongyao Li, Lizhi Lei, and Hao 2020. BLEURT: Learning robust metrics for text
Yang. 2023. Length-Aware NMT and Adaptive Du- generation. In Proceedings of the 58th Annual
ration for Automatic Dubbing. In Proceedings of the Meeting of the Association for Computational Lin-
20th International Conference on Spoken Language guistics, pages 7881–7892, Online. Association for
Translation (IWSLT). Computational Linguistics.
Ricardo Rei, José GC de Souza, Duarte Alves, Hengchao Shang, Zhiqiang Rao, Zongyao Li, Zhanglin
Chrysoula Zerva, Ana C Farinha, Taisiya Wu, Jiaxin Guo, Minghan Wang, Daimeng Wei,
Glushkova, Alon Lavie, Luisa Coheur, and Shaojun Li, Zhengzhe Yu, Xiaoyu Chen, Lizhi Lei,
André FT Martins. 2022. Comet-22: Unbabel-ist and Hao Yang. 2023. The HW-TSC’s Simultaneous
2022 submission for the metrics shared task. In Speech-to-Speech Translation system for IWSLT
Proceedings of the Seventh Conference on Machine 2023 evaluation. In Proceedings of the 20th Interna-
Translation (WMT), pages 578–585. tional Conference on Spoken Language Translation
(IWSLT).
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020a. Comet: A neural framework for mt Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming
evaluation. arXiv preprint arXiv:2009.09025. Li. 2020. Aishell-3: A multi-speaker mandarin
tts corpus and the baselines. arXiv preprint
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
arXiv:2010.11567.
Lavie. 2020b. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference Kun Song, Yi Lei, Peikun Chen, Yiqing Cao, Kun Wei,
on Empirical Methods in Natural Language Pro- Yongmao Zhang, Lei Xie, Ning Jiang, and Guoqing
cessing (EMNLP), pages 2685–2702, Online. Asso- Zhao. 2023. The NPU-MSXF Speech-to-Speech
ciation for Computational Linguistics. Translation System for IWSLT 2023 Speech-to-
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Speech Translation Task. In Proceedings of the
Zhou Zhao, and Tie-Yan Liu. 2022. Fastspeech 2: 20th International Conference on Spoken Language
Fast and high-quality end-to-end text to speech. Translation (IWSLT).
Anthony Rousseau, Paul Deléglise, and Yannick Es- Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
teve. 2014. Enhancing the ted-lium corpus with man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
selected data for language modeling and more ted gela Fan. 2020. Multilingual translation with exten-
talks. In LREC. sible multilingual pretraining and finetuning. arXiv
preprint arXiv:2008.00401.
Elizabeth Salesky, Kareem Darwish, Mohamed Al-
Badrashiny, Mona Diab, and Jan Niehues. 2023. Jörg Tiedemann. 2012. Parallel data, tools and inter-
Evaluating Multilingual Speech Translation Under faces in OPUS. In Proceedings of the Eighth In-
Realistic Conditions with Resegmentation and Ter- ternational Conference on Language Resources and
minology. In Proceedings of the 20th Interna- Evaluation (LREC’12), pages 2214–2218, Istanbul,
tional Conference on Spoken Language Translation Turkey. European Language Resources Association
(IWSLT 2023). Association for Computational Lin- (ELRA).
guistics.
Ioannis Tsiamas, Gerard I. Gállego, Jose Fonollosa,
Elizabeth Salesky, Matthew Wiesner, Jacob Bremer- and Marta R. Costa-jussà. 2023. Speech Transla-
man, Roldano Cattoni, Matteo Negri, Marco Turchi, tion with Foundation Models and Optimal Trans-
Douglas W. Oard, and Matt Post. 2021. The Mul- port: UPC at IWSLT23. In Proceedings of the
tilingual TEDx Corpus for Speech Recognition and 20th International Conference on Spoken Language
Translation. In Proc. Interspeech 2021, pages 3655– Translation (IWSLT).
3659.
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
Gabriele Sarti, Phu Mon Htut, Xing Niu, Ben- losa, and Marta R. Costa-jussà. 2022. SHAS:
jamin Hsu, Anna Currey, Georgiana Dinu, and Approaching optimal Segmentation for End-to-End
Maria Nadejde. 2023. RAMP: Retrieval and Speech Translation. In Proc. Interspeech 2022,
attribute-marking enhanced prompting for attribute- pages 106–110.
controlled translation.
Priyesh Vakharia, Shree Vignesh S, Pranjali Bas-
Holger Schwenk, Guillaume Wenzek, Sergey Edunov, matkar, and Ian Lane. 2023. Low-Resource For-
Edouard Grave, Armand Joulin, and Angela Fan. mality Controlled NMT Using Pre-trained LM. In
2021. CCMatrix: Mining billions of high-quality Proceedings of the 20th International Conference on
parallel sentences on the web. In Proceedings of the Spoken Language Translation (IWSLT).
59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Conference on Natural Language Processing (Vol- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
ume 1: Long Papers), pages 6490–6500, Online. As- Kaiser, and Illia Polosukhin. 2017. Attention is All
sociation for Computational Linguistics. You Need. In Proceedings of NIPS 2017.
44
Akshaya Vishnu, Kudlu Shanbhogue, Ran Xue, Zhihang Xie. 2023. The BIGAI Offline Speech Trans-
Soumya Saha, Daniel Zhang, and Ashwinkumar lation Systems for IWSLT 2023 Evaluation. In Pro-
Ganesan. 2023. Amazon Alexa AI’s Low-Resource ceedings of the 20th International Conference on
Speech Translation System for IWSLT2023. In Pro- Spoken Language Translation (IWSLT).
ceedings of the 20th International Conference on
Spoken Language Translation (IWSLT). Henry Li Xinyuan, Neha Verma, Bismarck Bamfo
Odoom, Ujvala Pradeep, Matthew Wiesner, and
Changhan Wang, Juan Pino, Anne Wu, and Jiatao Gu. Sanjeev Khudanpur. 2023. JHU IWSLT 2023 Mul-
2020a. Covost: A diverse multilingual speech-to- tilingual Speech Translation System Description. In
text translation corpus. In Proceedings of The 12th Proceedings of the 20th International Conference on
Language Resources and Evaluation Conference, Spoken Language Translation (IWSLT).
pages 4197–4203.
Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, shen
Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, huang, Qi Ju, Tong Xiao, and Jingbo Zhu. 2021.
Dmytro Okhonko, and Juan Pino. 2020b. fairseq Stacked acoustic-and-textual encoding: Integrating
s2t: Fast speech-to-text modeling with fairseq. the pre-trained models into speech translation en-
arXiv preprint arXiv:2010.05171. coders.
Changhan Wang, Anne Wu, Jiatao Gu, and Juan Wenda Xu, Xian Qian, Mingxuan Wang, Lei Li, and
Pino. 2021. CoVoST 2 and Massively Multilin- William Yang Wang. 2022. Sescore2: Retrieval aug-
gual Speech Translation. In Proc. Interspeech 2021, mented pretraining for text generation evaluation.
pages 2247–2251. arXiv preprint arXiv:2212.09305.
Minghan Wang, Yinglu Li, Jiaxin Guo, Zongyao Brian Yan, Jiatong Shi, Soumi Maiti, William Chen,
Li, Hengchao Shang, Daimeng Wei, Min Zhang, Xinjian Li, Yifan Peng, Siddhant Arora, and Shinji
Shimin Tao, and Hao Yang. 2023a. The HW-TSC’s Watanabe. 2023. CMU’s IWSLT 2023 Simultane-
Speech-to-Speech Translation System for IWSLT ous Speech Translation System. In Proceedings of
2023. In Proceedings of the 20th International Con- the 20th International Conference on Spoken Lan-
ference on Spoken Language Translation (IWSLT). guage Translation (IWSLT).
Zhipeng Wang, Yuhang Guo, and Shuoying Chen. Zhengdong Yang, Shuichiro Shimizu, Sheng Li
2023b. BIT’s System for Multilingual Track. In Wangjin Zhou, and Chenhui Chu. 2023. The Kyoto
Proceedings of the 20th International Conference on Speech-to-Speech Translation System for IWSLT
Spoken Language Translation (IWSLT). 2023. In Proceedings of the 20th International Con-
ference on Spoken Language Translation (IWSLT).
Patrick Wilken, Panayota Georgakopoulou, and
Evgeny Matusov. 2022. SubER - a metric for au- Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang,
tomatic evaluation of subtitle quality. In Proceed- Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen,
ings of the 19th International Conference on Spoken Lei Xie, and Xin Lei. 2021. Wenet: Produc-
Language Translation (IWSLT 2022), pages 1–10, tion oriented streaming and non-streaming end-to-
Dublin, Ireland (in-person and online). Association end speech recognition toolkit. arXiv preprint
for Computational Linguistics. arXiv:2102.01547.
Aiden Williams. 2022. The applicability of Wav2Vec Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao
2.0 for low-resource Maltese ASR. B.S. thesis, Uni- Wang, Mingxuan Wang, and Jun Cao. 2023. Gigast:
versity of Malta. A 10,000-hour pseudo speech translation corpus. In
Interspeech 2023.
Aiden Williams, Kurt Abela, Rishu Kumar, Martin Bär,
Xinyuan Zhou, Jianwei Cui, Zhongyi Ye, Yichi Wang,
Hannah Billinghurst, Kurt Micallef, Ahnaf Mozib
Luzhen Xu, Hanyi Zhang, Weitai Zhang, and Lirong
Samin, Andrea DeMarco, Lonneke van der Plas, and
Dai. 2023. Submission of USTC’s system for the
Claudia Borg. 2023. UM-DFKI Maltese Speech
IWSLT 2023 - Offline Speech Translation Track. In
Translation. In Proceedings of the 20th Interna-
Proceedings of the 20th International Conference on
tional Conference on Spoken Language Translation
Spoken Language Translation (IWSLT).
(IWSLT).
Adrian Łańcucki. 2021. Fastpitch: Parallel text-to-
Zhanglin Wu, Zongyao Li, Daimeng Wei, Hengchao
speech with pitch prediction. In ICASSP 2021
Shang, Jiaxin Guo, Xiaoyu Chen, Zhiqiang Rao,
- 2021 IEEE International Conference on Acous-
Zhengzhe YU, Jinlong Yang, Shaojun Li, Yuhao
tics, Speech and Signal Processing (ICASSP), pages
Xie, Bin Wei, Jiawei Zheng, Ming Zhu, Lizhi Lei,
6588–6592, Toronto, Canada. IEEE.
Hao Yang, and Yanfei Jiang. 2023. Improving
Neural Machine Translation Formality Control with
Domain Adaptation and Reranking-based Transduc-
tive Learning. In Proceedings of the 20th Interna-
tional Conference on Spoken Language Translation
(IWSLT).
45
Appendix A. Human Evaluation
46
A Human Evaluation
Human evaluation was carried out for the Simultaneous and Offline SLT shared tasks. At the time of
writing, only the former evaluation has been completed which is reported here. The human evaluation of
the Offline Task will be recounted during the conference and possibly in an update version of this report.
47
2018), distributed by Japan Translation Federation (JTF). The guidelines are based on MQM but include
some modifications in consideration of the property of the Japanese language.
We hired a Japanese-native professional interpreter as the evaluator, while the evaluator was a trans-
lator in the last year (Anastasopoulos et al., 2022a). The evaluator checked translation hypotheses along
with their source speech transcripts and chose the corresponding error category and severity for each
translation hypothesis using a spreadsheet. Here, we asked the evaluator to focus only on Accuracy and
Fluency errors, because other types of errors in Terminology, Style, and Locale convention would not
be so serious in the evaluation of simultaneous translation. Finally, we calculated the cumulative error
score for each system based on the error weighting presented by Freitag et al. (2021a), where Critical
and Major errors are not distinguished.
48
Appendix B. Automatic Evaluation Results and Details
49
B.1 Offline SLT
⋅ Systems are ordered according to the BLEU score computed on the concatenation of the three test sets
(Joint BLEU, third column).
⋅ The “D” column indicates the data condition in which each submitted run was trained, namely: Con-
strained (C), constrained+LLM (C+ ), Unconstrained (U).
⋅ For the BLEU scores computed on the TED test set, “Orig” and “New” respectively indicate the results
computed on the original (subtitle-like) TED translations and the unconstrained (exact, more literal)
translations as references.
⋅ Direct systems are indicated by gray background.
⋅ “*” indicates a late submission.
⋅ “+ ” indicates an unofficial submission.
System D Joint TED ACL EPTV
BLEU COMET BLEU COMET BLEU COMET BLEU COMET
Ref New Orig Both New Orig
HW-TSC C 32.4 0.8213 34.8 30.2 42.1 0.8327 0.8208 38.1 0.8090 16.7 0.3829
HW-TSC U 32.3 0.8209 34.9 30.9 42.4 0.8331 0.8223 36.9 0.8073 16.9 0.3819
HW-TSC C+ 31.9 0.8210 34.4 30.6 41.9 0.8332 0.8230 37.2 0.8063 16.8 0.3823
NeuroDub+ U 30.4 0.8089 31.8 25.8 38.5 0.8205 0.8082 41.1 0.7956 15.4 0.3784
NEMO C 28.5 0.7759 30.5 26.4 37.7 0.7977 0.7871 31.9 0.7171 15.6 0.3680
UPC C+ 27.9 0.7892 29.8 25.5 36.6 0.8098 0.7985 32.1 0.7473 15.6 0.3746
I2R C+ 22.4 0.7070 24.0 20.3 29.5 0.7248 0.7172 23.9 0.6841 13.3 0.3506
BIGAI∗ C+ 20.3 0.6945 22.3 19.3 27.4 0.7128 0.7055 19.6 0.6295 11.5 0.3555
Table 14: Official results of the automatic evaluation for the Offline Speech Translation Task, English to German.
Table 15: Official results of the automatic evaluation for the Offline Speech Translation Task, English to Japanese.
Table 16: Official results of the automatic evaluation for the Offline Speech Translation Task, English to Chinese.
50
B.2 Simultaneous SLT
Table 17: Simultaneous Speech-to-Text Translation, English to German. Except for AP, the latency is measured in
seconds. Numbers in brackets are computation aware latency.
Table 18: Simultaneous Speech-to-Text Translation, English to Chinese. Except for AP, the latency is measured in
seconds. Numbers in brackets are computation aware latency.
Table 19: Simultaneous Speech-to-Text Translation, English to Japanese. Except for AP, the latency is measured
in seconds. Numbers in brackets are computation aware latency.
51
Target Language Team ASR BLEU BLASER Start Offset End Offset ATD
CMU 22.62 0.122 2.37 5.21 4.22
German
HW-TSC 19.74 -0.442 2.04 5.09 3.75
HW-TSC 15.53 -1.70 2.37 3.48 3.56
Japanese
NAIST 10.19 -1.68 2.58 4.32 3.49
Chinese HW-TSC 31.68 -0.696 1.92 3.12 3.23
Table 20: Simultaneous Speech-to-Speech from English Speech. The latency is measured in seconds. The BLEU
scores are computed based on transcript from the default Whisper (Radford et al., 2022) ASR model for each
language direction.
Common Non-native
Number of audios 42 43
Mean audio length (seconds) 400.3 208.8
Mean ratings per audio 65.6 36.5
Table 21: Human evaluation for the English-to-German task on two test sets: the Common one (used also in
automatic scoring) and the Non-native one. We show the size of the test sets, and the number of ratings collected.
On average, our annotators provide a quality judgement ever 6 seconds.
Common Non-native
CUNI-KIT 3.10 3.04→3.16 1.63 1.54→1.72
FBK 3.08 3.02→3.14 1.26 1.20→1.30
HWTSC 2.91 2.85→2.98 2.04 1.92→2.15
NAIST 2.84 2.78→2.91 2.27 2.18→2.34
CMU 2.79 2.72→2.87 2.38 2.30→2.46
Interpreter – 2.79 2.71→2.87
Table 22: Human evaluation results for English-to-German Simultaneous task on the 1–5 (worst-to-best) scale,
with 95% confidence intervals. We calculate a mean score for each annotated audio file, then a mean across
annotators (for each audio), then a mean across all audio files for each system. To compute confidence intervals,
we take the scores for annotated audios, perform 10,000x bootstrap resampling, compute the mean score for each
resample, then compute [2.5, 97.5] percentiles across the resampled means.
Table 23: Human evaluation results on two talks (107 lines) in the English-to-Japanese Simultaneous speech-to-
text translation task. Error weights are 5 for Critical and Major errors and 1 for Minor errors.
52
B.3 Automatic Subtitling
team con- system domain Subtitle quality Translation quality Subtitle compliance
dition SubER Sigma Bleu ChrF Bleurt CPS CPL LPB
A PP T EK U prmry ALL 70.64 73.35 15.38 38.36 .4376 87.74 100.00 100.00
ted 59.72 74.33 23.74 49.14 .5683 92.58 100.00 100.00
eptv 73.98 67.09 15.81 45.21 .5229 86.65 100.00 100.00
pltn 77.63 72.79 10.47 33.18 .4069 88.98 100.00 100.00
itv 69.83 74.48 14.43 35.27 .4028 86.01 100.00 100.00
M ATESUB U prmry ALL 75.41 65.22 14.81 39.50 .4591 84.97 99.25 100.00
ted 67.70 62.01 20.37 50.05 .5500 90.55 98.61 100.00
eptv 87.04 57.73 12.08 43.59 .4705 88.59 99.20 100.00
pltn 79.72 68.27 10.06 34.46 .4264 89.17 99.29 100.00
itv 73.11 67.04 14.92 37.13 .4501 80.21 99.47 100.00
A PP T EK C prmry ALL 77.05 72.50 12.74 34.31 .3420 93.35 100.00 100.00
ted 59.61 74.29 26.78 50.93 .5539 97.33 100.00 100.00
eptv 76.25 68.49 14.43 42.37 .4604 95.76 100.00 100.00
pltn 80.72 69.56 9.40 31.20 .3419 93.45 100.00 100.00
itv 80.87 72.62 9.08 27.74 .2612 91.14 100.00 100.00
FBK C prmry ALL 79.70 75.73 11.22 33.32 .3172 69.98 83.50 99.98
ted 63.85 76.79 21.48 50.31 .5511 71.39 79.83 100.00
eptv 79.76 69.04 13.20 42.69 .4722 74.95 82.08 99.91
pltn 83.71 74.02 7.73 30.17 .3137 70.02 84.20 99.96
itv 82.67 77.17 8.05 26.10 .2255 67.75 85.12 100.00
A PP T EK C cntrstv ALL 83.53 70.39 9.73 30.51 .2914 89.60 100.00 100.00
ted 68.47 72.97 19.07 46.17 .4921 90.53 100.00 100.00
eptv 81.69 66.36 11.46 39.25 .4150 94.57 100.00 100.00
pltn 86.37 69.79 7.08 27.89 .2780 91.50 100.00 100.00
itv 87.25 68.29 6.70 23.85 .2204 86.85 100.00 100.00
Table 24: Automatic evaluation results for the Subtitling Task: en→de. C and U stand for constrained and uncon-
strained training condition, respectively; prmry and cntrstv for primary and contrastive systems.
team con- system domain Subtitle quality Translation quality Subtitle compliance
dition SubER Sigma Bleu ChrF Bleurt CPS CPL LPB
M ATESUB U prmry ALL 68.11 68.37 22.34 47.38 .5059 86.07 99.52 100.00
ted 45.94 66.85 40.36 65.72 .7047 92.62 99.48 100.00
eptv 74.47 59.59 21.06 54.11 .5728 90.15 99.44 100.00
pltn 74.87 70.99 15.96 41.86 .4666 88.27 99.60 100.00
itv 71.25 71.06 18.50 41.07 .4592 81.93 99.51 100.00
A PP T EK C prmry ALL 71.68 74.99 18.67 40.21 .3637 95.42 100.00 100.00
ted 45.81 74.50 39.37 62.11 .6562 97.20 100.00 100.00
eptv 66.60 73.31 23.57 51.94 .5379 96.27 100.00 100.00
pltn 76.00 74.63 14.03 36.95 .3664 95.18 100.00 100.00
itv 80.20 75.90 11.37 29.75 .2487 94.67 100.00 100.00
FBK C prmry ALL 73.31 74.44 17.79 39.54 .3419 77.00 91.34 99.99
ted 45.68 74.31 40.21 65.09 .6737 78.95 88.14 100.00
eptv 68.47 69.63 23.92 52.19 .5490 79.81 88.05 100.00
pltn 78.45 75.78 12.84 35.89 .3513 77.79 92.67 99.96
itv 82.00 76.16 9.33 27.14 .2063 74.67 92.94 100.00
Table 25: Automatic evaluation results for the Subtitling Task: en→es. Legenda in Table 24.
53
B.4 Multilingual Speech Translation
Below we show the Multilingual task (§5) results and overall rankings, ordered according to the
average chrF across all 10 target languages after resegmentation to the reference translations.
We also compare to the Offline submissions on the ACL 60-60 evaluation set
on the 3 language pairs used for the Offline task.
Finally, we show the scores for each metric (chrF, COMET, BLEU) per language pair for all systems.
Table 26: Overall task ranking with metrics averaged across all ten language pairs on the evaluation set.
We show the official task metric (chrF) as well as the unofficial metrics (COMET, BLEU, and English WER).
All metrics are calculated after resegmentation to reference transcripts and translations. Direct / end-to-end systems
are highlighted in gray.
de ja zh
System Task Constrained? COMET BLEU COMET BLEU COMET BLEU
USTC Off. 85.4 (1) 58.0 (1)
HW-TSC Off. ✓ 80.9 (2) 38.1 (3) 84.4 (3) 30.1 (7) 84.0 (2) 53.0 (2)
JHU Mult. 81.3 (1) 41.2 (1) 84.7 (1) 33.9 (4) 82.0 (3) 46.5 (11)
HW-TSC Off. 80.7 (3) 36.9 (6) 84.7 (1) 30.7 (6) 84.0 (2) 52.8 (3)
HW-TSC Off. ✓ + LLM 80.6 (4) 37.2 (5) 84.6 (2) 30.7 (6) 84.0 (2) 53.0 (2)
NeuroDub Off. 79.6 (5) 41.1 (2)
USTC Off. 80.0 (4) 52.5 (4)
KITpr Mult. ✓ + LLM 74.9 (6) 37.5 (4) 82.0 (4) 35.7 (1) 79.3 (5) 49.4 (6)
KITc1 Mult. ✓ + LLM 74.6 (8) 36.5 (7) 82.0 (4) 35.2 (2) 79.3 (5) 49.7 (5)
KITc2 Mult. ✓ + LLM 74.3 (9) 36.5 (7) 81.6 (6) 34.0 (3) 78.6 (10) 49.4 (6)
KITc3 Mult. ✓ + LLM 74.7 (7) 36.1 (9) 81.4 (7) 33.3 (5) 78.4 (11) 48.6 (7)
KITc4 Mult. ✓ + LLM 74.2 (10) 36.4 (8) 81.7 (5) 33.9 (4) 78.4 (11) 48.2 (8)
KITc5 Mult. ✓ + LLM 74.9 (6) 33.8 (10) 80.3 (8) 27.3 (8) 79.1 (6) 46.7 (10)
UPC Off. ✓ + LLM 74.7 (7) 32.1 (12)
KITc6 Mult. ✓ + LLM 73.9 (11) 32.9 (11) 80.0 (9) 26.6 (9) 78.9 (7) 45.7 (13)
KITc7 Mult. ✓ + LLM 73.9 (11) 32.9 (11) 80.3 (8) 25.6 (10) 78.8 (8) 46.0 (12)
Xiaomi Off. ✓ + LLM 78.7 (9) 46.5 (11)
NiuTrans Off. ✓ 77.3 (12) 47.1 (9)
NeMo Off. ✓ 71.7 (12) 31.9 (13) 77.7 (10) 24.9 (11) 74.0 (13) 41.8 (14)
I2R Off. ✓ + LLM 68.4 (13) 23.9 (14)
JHU Mult. ✓ + LLM 59.0 (15) 23.7 (15) 69.3 (11) 18.9 (12) 67.9 (15) 37.4 (16)
MINE-Trans Off. 70.0 (14) 39.9 (15)
BIGAI* Off. ✓ + LLM 63.0 (14) 19.6 (16) 67.7 (12) 10.4 (13) 65.3 (16) 27.4 (18)
MINE-Trans Off. ✓ 63.5 (17) 31.8 (17)
BIT Mult. ✓ 47.2 (16) 11.1 (17) 56.2 (13) 8.0 (14) 55.7 (18) 19.8 (19)
Table 27: Submissions from all tracks on the ACL 60-60 evaluation sets on the three language pairs shared across
tracks (En → De, Ja, Zh), ordered by average metric ranking. Direct / end-to-end systems are highlighted in gray.
54
Submission ar de fa fr ja nl pt ru tr zh Avg.
JHUunconstrained 62.4 67.6 57.8 73.4 42.0 71.6 75.0 56.8 62.5 42.2 61.1
KITprimary 56.9 64.8 55.4 67.8 42.3 67.6 69.6 51.2 57.3 42.5 57.5
KITcontrastive1 56.9 64.6 55.6 67.8 42.0 67.6 69.6 51.2 56.7 42.7 57.5
KITcontrastive2 56.1 63.6 52.9 67.3 40.8 66.5 69.2 50.6 55.6 41.3 56.4
KITcontrastive4 56.2 63.3 53.0 67.2 40.7 66.5 68.8 50.4 55.1 40.3 56.2
KITcontrastive3 55.5 63.7 52.1 66.9 40.3 66.0 68.9 50.0 55.2 40.6 55.9
KITcontrastive5 55.3 61.3 53.8 65.2 35.9 63.7 67.3 48.6 54.9 39.2 54.5
KITcontrastive7 54.7 60.3 54.0 64.4 34.5 63.4 67.2 47.8 54.2 38.2 53.9
KITcontrastive6 54.6 60.3 52.7 64.3 35.5 62.7 66.4 48.2 53.8 38.4 53.7
JHUconstrained 45.2 53.4 44.5 62.4 26.8 62.1 62.2 46.8 46.3 30.8 48.1
BIT 28.9 36.8 28.8 45.2 14.5 41.7 43.0 28.4 25.9 17.2 31.0
Table 28: chrF with resegmentation for each target language on the evaluation set, sorted by the system average.
Direct / end-to-end systems are highlighted in gray.
Submission ar de fa fr ja nl pt ru tr zh Avg.
JHUunconstrained 82.7 81.3 80.6 81.4 84.7 84.1 84.9 78.9 82.5 82.0 82.3
KITprimary 78.0 74.9 75.8 74.4 82.0 77.7 78.4 72.5 76.6 79.3 77.0
KITconstrastive1 77.7 74.6 75.7 74.5 82.0 77.6 78.4 72.2 76.4 79.3 76.8
KITconstrastive5 78.5 74.9 75.9 74.6 80.3 76.8 78.5 71.6 76.9 79.1 76.7
KITconstrastive7 78.2 73.9 76.3 74.2 80.3 76.7 80.3 71.3 76.2 78.8 76.6
KITconstrastive2 77.3 74.3 74.9 74.3 81.6 77.3 78.4 72.1 75.8 78.6 76.5
KITconstrastive4 77.2 74.2 75.0 74.3 81.7 77.3 78.2 72.0 75.5 78.4 76.4
KITconstrastive3 76.9 74.7 74.6 74.2 81.4 76.9 78.2 71.8 75.7 78.4 76.3
KITconstrastive6 77.8 73.9 75.2 73.3 80.0 75.4 77.7 70.8 75.7 78.9 75.9
JHUconstrained 67.9 59.0 66.1 63.2 69.3 66.2 67.8 62.0 64.0 67.9 65.3
BIT 52.8 47.2 48.7 52.2 56.2 53.8 54.8 47.7 48.0 55.7 51.7
Table 29: COMET with resegmentation for each target language on the evaluation set, sorted by the system average.
Direct / end-to-end systems are highlighted in gray.
ar de fa fr ja nl pt ru tr zh Avg.
JHUunconstrained 33.4 41.2 35.0 50.0 33.9 44.8 51.7 27.9 28.1 46.5 39.3
KITprimary 25.9 37.5 29.8 41.3 35.7 40.4 44.3 22.4 21.8 49.4 34.9
KITconstrastive1 25.6 37.5 30.1 41.1 35.2 40.6 44.5 22.6 21.3 49.7 34.8
KITconstrastive2 24.7 36.5 28.0 42.4 34.0 38.8 43.8 21.9 20.6 49.4 34.0
KITconstrastive4 24.4 36.4 28.4 42.1 33.9 38.9 43.0 21.6 20.3 48.2 33.7
KITconstrastive3 24.0 36.1 27.6 41.9 33.3 38.2 43.6 21.5 20.1 48.6 33.5
KITconstrastive5 23.7 33.8 28.7 39.6 27.3 35.9 40.7 19.6 20.6 46.7 31.7
KITconstrastive7 23.4 32.9 28.6 38.8 25.6 36.0 40.9 19.1 20.1 46.0 31.1
KITconstrastive6 23.0 32.9 28.3 38.9 26.6 35.0 39.7 19.7 19.1 45.7 30.9
JHUconstrained 15.0 23.7 21.9 33.1 18.9 31.3 33.2 17.2 12.8 37.4 24.5
BIT 5.7 11.1 7.4 19.7 8.0 16.3 18.6 6.3 4.1 19.8 11.7
Table 30: BLEU with resegmentation for each target language on the evaluation set, sorted by the system average.
BLEU scores in grey are calculated using language-specific tokenization (ja) or at the character-level (zh); see §5.2
for specific tokenization details. Direct / end-to-end systems are highlighted in gray.
55
B.5 Speech-to-Speech Translation
Table 31: Official results of the automatic evaluation for the English to Chinese Speech-to-Speech Translation
Task.
Table 32: Official results of the human evaluation for the English to Chinese Speech-to-Speech Translation Task.
56
B.6 Dialectal SLT
Table 33: Automatic evaluation results for the Dialect Speech Translation task, Unconstrained Condition. Systems
are ordered in terms of the official metric BLEU on test3. We also report brevity penalty (bp) and unigram precision
(pr1) of BLEU, chrF, and TER.
Table 34: Automatic evaluation results for the Dialect Speech Translation task, Constrained Condition.
ASR System test2 WER↓ test2 CER↓ test3 WER↓ test3 CER↓
Orig Norm Orig Norm Orig Norm Orig Norm
JHU / constrained / primary 70.3 43.7 30.7 22.7 74.0 44.9 33.1 24.8
JHU / unconstrained / primary 69.3 40.6 29.0 20.7 72.9 41.6 31.5 22.9
USTC / constrained / primary 49.5 40.8 24.2 20.9 52.3 43.2 27.1 23.8
USTC / unconstrained / primary 47.4 39.3 23.1 20.0 49.2 40.5 25.2 22.1
2022best:ON-TRAC/unconstrained 65.7 41.5 28.1 21.1 - - - -
Table 35: Word Error Rate (WER) and Character Error Rate (CER) of the ASR component of submitted cascaded
systems on test2 and test3. The original version (Orig) matches the minimal text pre-processing provided by the
organizer’s data preparation scripts, and results in relatively high WER. As diagnosis, we ran additional Arabic-
specific normalization (Norm) for e.g. Alif, Ya, Ta-Marbuta on the hypotheses and transcripts before computing
WER/CER. We are grateful to Ahmed Ali for assistance on this.
57
B.7 Low-Resource SLT
Table 36: Automatic evaluation results for the Irish to English task, Constrained Condition.
Table 37: Automatic evaluation results for the Irish to English task, Unconstrained Condition.
Table 38: Automatic evaluation results for the Marathi to Hindi task, Constrained Condition.
Table 39: Automatic evaluation results for the Marathi to Hindi task, Unconstrained Condition.
58
Pashto→French (Unconstrained Condition)
BLEU
Team System valid test
ON-TRAC primary 24.82 24.87
ON-TRAC contrastive1 23.38 23.87
GMU primary 11.99 16.87
GMU contrastive1 11.27 15.24
ON-TRAC contrastive2 12.26 15.18
ON-TRAC contrastive3 12.16 15.07
GMU contrastive2 9.72 13.32
Table 40: Automatic evaluation results for the Pashto to French task, Unconstrained Condition.
Table 41: Automatic evaluation results for the Pashto to French task, Constrained Condition.
Table 42: Automatic evaluation results for the Maltese to English task, Unconstrained Condition.
Table 43: Automatic evaluation results for the Tamasheq to French task, Constrained Condition.
59
Tamasheq→French (Unconstrained Condition)
Team System BLEU chrF2 TER
NAVER primary 23.59 49.84 64.00
NAVER contrastive1 21.31 48.15 66.41
NAVER contrastive2 18.73 46.11 70.32
ON-TRAC primary 15.88 43.88 73.85
ON-TRAC contrastive1 16.35 44.22 74.26
ON-TRAC contrastive2 15.46 43.59 75.30
ON-TRAC contrastive3 15.49 43.74 75.07
ON-TRAC contrastive4 16.25 44.11 74.26
ON-TRAC contrastive5 15.54 43.91 75.08
Alexa AI primary 9.30 32.29 81.25
Alexa AI contrastive1 8.87 32.04 81.03
Alexa AI contrastive2 9.50 33.67 80.85
Alexa AI contrastive3 9.28 32.86 82.33
GMU primary 8.03 33.03 87.81
GMU contrastive1 1.30 23.63 96.72
GMU contrastive2 2.10 24.33 94.58
Table 44: Automatic evaluation results for the Tamasheq to French task, Unconstrained Condition.
Table 45: Automatic evaluation results for the Quechua to Spanish task, Constrained Condition. ChrF2 scores
were only taken into account for those systems that scored less than 5 points BLEU.
Table 46: Automatic evaluation results for the Quechua to Spanish task, Unconstrained Condition. ChrF2 scores
were only taken into account for those systems that scored less than 5 points BLEU.
60
B.8 Formality Control for SLT
EN-KO EN-VI
Model
BLEU COMET mACC cACC BLEU COMET mACC cACC
F 11.1 0.5044 28.5 55 43.2 0.6189 99 99
C ONSTRAINED
C O C OA (baseline)
IF 11.1 0.5125 80.4 58 41.5 0.6021 98 99
Table 47: Results for the Formality Track (Supervised Setting). Most systems perform well in this setting, though
MT quality on formal (F) tends to be higher than informal (IF)
EN-PT EN-RU
Model
BLEU COMET mACC cACC BLEU COMET mACC cACC
C ONSTRAINED
Table 48: Results for the Formality Track (Zero-shot Setting). Appreciable differences in formality control exist
between formal (F) and informal (IF), suggesting that formality bias exists in participant systems.
61
Evaluating Multilingual Speech Translation Under Realistic Conditions
with Resegmentation and Terminology
Abstract
1 Introduction
The NLP and speech communities are rapidly ex-
panding, which has motivated increased interest in Figure 1: Multilingual translation of ACL presentations.
multilingual scientific communication and accessi-
bility. From the automatic captioning at NAACL We present the ACL 60/60 evaluation sets to en-
2019 provided by Microsoft to the current ACL able greater development of tools by the field for
60-60 initiative1 for the 60th anniversary of ACL the field. Specifically, we hope that this data en-
at 2022, it is clear that transcription and translation ables further research into speech translation and
in the technical domain is needed, desired, and still other NLP applications in the technical domain
a disproportionate challenge for current models with resegmentation and terminology, given a di-
compared to standard datasets in these spaces. verse speaker set and realistic recording conditions,
Translating technical presentations presents chal- with the goal of increased accessibility and multi-
lenging conditions, from domain-specific terminol- linguality. Our dataset is publicly available through
ogy and adaptation, to recordings often captured the ACL Anthology.2
with a laptop microphone and light background
noise, diverse speaker demographics as well as 2 Evaluation under realistic conditions
unsegmented speech typically 10-60 minutes in
duration. We have curated evaluation sets from To evaluate transcription and translation under real-
presentations at ACL 2022 which have been pro- istic conditions may require different metrics than
fessionally transcribed and translated with the sup- with e.g. provided segmentation. Here we present
port of ACL and the 60-60 initiative. In this pa- the necessary metrics in order to discuss the dataset
per we describe the methodology to create this creation process.
dataset, considerations and methods to evaluate
speech translation models with it, and open chal- 2.1 Resegmentation
lenges we believe this dataset may support research While most offline speech translation models are
towards. We release all data and intermediate steps trained with provided segmentation, in an applica-
to support further research in this space. tion setting segmentation is unlikely to be provided.
1 2
https://www.2022.aclweb.org/dispecialinitiative https://aclanthology.org/2023.iwslt-1.2
62
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 62–78
July 13-14, 2023 c 2023 Association for Computational Linguistics
Most models are typically unable to maintain out- We caution against using any one translation
put quality given audio of typical talk lengths (10+ metric in isolation, and suggest chrF and COMET
minutes), necessitating the use of automatic seg- as the standard evaluation metrics for this dataset.
mentation methods. In order to evaluate output
with variable segmentation, resegmentation to a 3 Creating the ACL 60/60 evaluation sets
fixed reference is necessary.
3.1 Languages
The standard tool within the field for many years
has been mwerSegmenter (Matusov et al., 2005), All data is originally spoken in English and then
which resegments model output to match a refer- transcribed and translated to ten diverse languages
ence segmentation for downstream evaluation with from the 60/60 initiative for which publicly avail-
various metrics. This is done by dynamically re- able speech translation corpora are available (see
segmenting the output using a given tokenization Table 5: §A.3): Arabic, Mandarin Chinese, Dutch,
to minimize word error rate to the reference.3 We French, German, Japanese, Farsi, Portuguese, Rus-
use mwerSegmenter for all scores in this paper and sian, and Turkish. The resulting dataset contains
suggest that resegmentation be the scoring standard three-way parallel (speech, transcripts, transla-
for the ACL 60/60 dataset. tions) one-to-many data for ten language pairs, and
multi-way parallel text data for 100 language pairs.
2.2 Evaluation metrics
3.2 Data selection
We compare a variety of evaluation metrics to ana-
lyze both transcription and translation quality using Data was selected from the ACL 2022 paper pre-
the evaluation sets, as well as the results of interme- sentations for which precorded audio or video pre-
diate steps in corpus creation such as post-editing. sentations were provided to the ACL Anthology.
For translation, we compare chrF (Popović, Talks were selected such that each of the two evalu-
2015) which is tokenization-agnostic and more ap- ation sets, development and evaluation, would have
propriate for a wider array of target languages than approximately one hour total duration. Oral pre-
BLEU; BLEU (Papineni et al., 2002) as computed sentations were advised to be up to 12 minutes per
by S ACRE BLEU (Post, 2018); and the model- recording, resulting in 5 talks for each set with rel-
based metric COMET (Rei et al., 2020), which atively balanced durations of ∼11.5 minutes each.
often has higher correlation with human judge- From the 324 available recordings, the final 10
ments (Mathur et al., 2020) though is limited by were selected in order to balance speaker demo-
language coverage in pretrained models. For BLEU graphics, accents, and talk content, while lightly
we use the suggested language-specific tokenizers controlling for recording conditions. The major-
in S ACRE BLEU for our non-space delimited tar- ity of recordings were created using laptop micro-
get languages, Japanese (MeCab4 ) and Chinese phones in quiet conditions, but background noise,
(character-level). microphone feedback, speech rate and/or volume
To analyze both automatic and post-editing tran- in some cases affected understanding of the content.
scription quality, we use word error rate (WER). We selected talks with representative but minimal
We note that we use case-sensitive and punctuation- noise where conditions did not affect understand-
sensitive WER here as these are both maintained in ing of the content. We aimed for a gender balance
system output during dataset creation in order to be representative of conference participation,6 result-
post-edited and translated. For downstream evalua- ing in a 3:7 female:male speaker ratio. This is also
tion of ASR model quality using the final dataset, a global field with a wide variety of native and non-
it may be desired to compute WER without case native English accents, which remains a necessary
and without punctuation; if so, the scores would challenge for speech models to address to mitigate
not be directly comparable to those presented here. performance biases (Sanabria et al., 2023; Feng
We also use translation error rate (TER) (Snover et al., 2021; Koenecke et al., 2020; Tatman and
et al., 2006) to assess the expected level of editing Kasten, 2017). Talks were chosen and assigned to
necessary to match the final reference quality.5 each set to maximize accent diversity, aiming for
3
L1s from all continents with language families fre-
We use word-level tokenization for all languages except
Japanese and Chinese here, where we use character-level. --ter-asian-support in S ACRE BLEU.
4 6
https://taku910.github.io/mecab/ Aggregate conference participation statistics provided by
5
We calculate TER with --ter-normalized and ACL 2022; see §A.2.
63
400
VAD 160 VAD
350 subtitles subtitles
sentences 140 sentences
300
120
Num. Segments
Num. Segments
250 100
200 80
150 60
100 40
50 20
0 0
0 5 10 15 20 25 30 0 10 20 30 40 50 60 70 80 90
Seconds Word count
(a) Speech segment length distribution (b) Text segment length distribution
Figure 2: Distribution of English segment lengths via speech duration (seconds) and text length (word count) for
each of three segmentations: VAD, subtitles, and sentences.
quently represented in the ACL community while based on pauses, speech, and non-speech phenom-
balancing topic diversity and gender. We note na- ena. Figure 2 shows the resulting distribution of
tive language and country where available. Talks segment lengths. Evaluating these initial automatic
were chosen to cover a diverse set of tracks and transcripts against the final released version with
topics and therefore diverse technical vocabulary resegmentation (§2.1), the automatic transcription
representative of the needs of the field. Where pre- yielded a WER of 15.4 and 22.4 for the develop-
sentations were chosen within the same track, they ment and evaluation sets, respectively.
covered different focuses and methodology, e.g.
math word problems versus release note generation 3.4 Human post-editing: Transcription
or few-shot adaptation for structured data. Meta- We contracted with aiXplain Inc. to professionally
data for all talks with exact durations and track and post-edit the ASR output. There was a three tier
speaker annotations are shown in Table 3 in §A.1. review process: an initial annotator post-edited per
Holding out speakers and topics per set opti- segment, followed by a quality assurance (QA) an-
mizes for overall system generalization but reduces notator who went through each full talk to ensure
the match between dev and eval sets; this e.g. re- quality and consistency, and then finally 10-20%
duces the benefit of finetuning on the dev set to of the segments were randomly chosen for a final
maximize test set performance and overfitting the check. In addition to semantic content, annotators
model or chosen hyperparameters to the dev set may theoretically also fix segmentation boundaries
will adversely affect test set performance. How- but in practice this rarely occurs. The annotators
ever, high performance on both sets is more likely provided additional information about the speak-
to indicate generalizable systems and representa- ers, namely gender (male, female) and age (child,
tive performance beyond these data points than if young adult, adult, elderly). The annotators were
the dev and eval data were more closely matched. also shown the video of the presentation to aid them
3.3 Automatic transcription ing recognizing technical terms, which may appear
in the slides. Disfluencies were standardized such
The first pass through the data used automatic seg- that false starts and repetitions were kept where
mentation and transcription to provide initial tran- there were perceivable pauses between them, and
scripts. We used the Azure API speech-to-text two hesitation spelling variations (ah, um) were
service,7 which has the best cost and quality bal- used. The annotator guidelines and LabelStudio
ance of currently available models. In addition to interface are shown in §A.4. After the professional
transcription, the service performs speaker diariza- post-editing pass, a domain expert verified and cor-
tion, with implicit voice activity detection (VAD), rected the technical terms.
segmenting the initially ∼11.5 minute audio files
into segments of approximately 30 seconds or less Post-editing analysis. ASR output is strongly
7
https://azure.microsoft.com/en-us/products/ monotonic with respect to the original speech, and
cognitive-services/speech-to-text accordingly most post-edits are for incorrectly tran-
64
REF: we find a BILSTM ** CRF model using flare
HYP: we find a BIAS TM CRF model using flare
S D
REF: also FASTTEXT CHARACTER EMBEDDINGS
HYP: also FASTTEX KITCHEN BEDDINGS
S S S
Figure 4: Example of tagged terminology from dev.
REF: multilingual BERT PERFORMS better than BETO Terminology lists were not exhaustive; [text-to-speech]
HYP: multilingual BIRD PERFORM better than BETTER did not appear, leading [text] and [speech] to be tagged
S S S separately.
65
Metric ar de fa fr ja nl pt ru tr zh
chrF 75.3 72.8 54.9 80.0 56.9 82.7 82.3 59.3 69.0 60.5
dev
BLEU 54.1 48.3 25.3 63.0 50.7 63.6 65.9 30.5 39.1 65.9
COMET 86.2 83.6 76.8 84.5 89.1 88.1 87.9 82.5 85.9 87.4
chrF 77.2 71.7 56.3 83.7 53.6 86.6 84.8 65.3 77.0 62.7
eval
BLEU 55.4 48.5 27.1 68.3 47.3 71.5 68.7 39.4 51.6 67.9
COMET 86.2 83.6 79.5 84.5 89.1 88.1 87.9 82.5 85.9 87.4
Table 1: Evaluating the initial commercial MT from ground-truth transcripts against the final released references.
BLEU scores in grey are calculated using language-specific tokenization (ja) or at the character-level (zh); see §2.2.
We compare the distribution of segment lengths not necessarily indicated by these metrics.
for each of the three approaches (VAD, subtitles,
3.7 Human post-editing: Translation
and sentences) in terms of both duration (seconds)
and number of words (English) in Figure 2. VAD Post-editing has become the industry standard due
results in the most uneven distribution, with seg- its increased productivity, typically reducing pro-
ments ranging from <1 second to >30 seconds. Sub- cessing time and cognitive load compared to direct
titles result in more uniform but distinctly shorter translation, particularly for domain-specific texts
segments, with 58% containing less than 10 words (O’Brien, 2007; Groves and Schmidtke, 2009; Tat-
and 19% shorter than two seconds, likely too short sumi, 2009; Plitt and Masselot, 2010).
for some downstream tasks or metrics. Sentences We contracted with Translated to professionally
result in less extreme segment lengths. Examples post-edit the MT output. There was a two tier re-
of each segmentation are shown in §A.8. The final view process: an initial annotator who was a native
data contains 468 sentences in the development set speaker of the target language post-edited per seg-
and 416 sentences in the evaluation set. ment, followed by a second to review the output
and consistency of the first. Annotator guidelines
3.6 Machine translation and the post-editing interface are shown in §A.5.
The first translation pass used publicly available
Technical terms. Terminology was not handled
bilingual MT models to translate the final sentence
separately during the MT step nor automatically
segments. We used the ModernMT API9 for the
tagged, given that the MT systems may omit or
9 of 10 language pairs supported, and the Azure
incorrectly translate technical terms. We did not
API10 for English-Farsi. We evaluate the commer-
use constrained decoding given the terminology
cial machine translation output against the final
lists translations as their validity could be context-
released translation references (§3.7) using the met-
dependent and some terms had multiple possible
rics discussed in §2.2, shown in Table 1.
translations. Instead, translation post-editors were
Each metric suggests a different story about
instructed to correct the translations of tagged ter-
translation quality and the degree to which it is
minology on the source if they were not maintained
language-specific. While COMET suggests rel-
and then tag the appropriate target translations
atively consistent performance across languages,
for each source tagged source span. Capitalized
chrF and BLEU do not. chrF and BLEU sug-
acronyms and terminology not on the lists and un-
gest significantly worse performance for a subset
known to the translators was left in English.
of target languages, including all but one of the
non-Latin script and non-Indo European languages. Post-editing analysis. While the metrics in the
BLEU yields 1.7× greater variance than chrF. By previous section give a sense for the automatic
all metrics, though, MT quality was consistent be- translation quality, they do not necessarily reflect
tween the development and evaluation sets. We see the effort required to post-edit the translations to
in the next section that the amount of post-editing final reference quality. Using TER to assess the
required to create the final references, however, is degree of post-editing necessary, we see in Fig-
9 ure 5 that this varies by language. Most noticeably,
https://www.modernmt.com/api/
10
https://azure.microsoft.com/en-us/products/ we see that Farsi, Russian, Japanese as target lan-
cognitive-services/translator guages required the highest amount of post-editing.
66
TER [dev] TER [eval] ja
ar fa
zh
zh 50 de
tr
40 de
30 nl
tr 20 fa pt
10
fr
ru
ar
0 2 4 6 8 10 12 14
ru fr
Figure 6: Degree of reordering done in MT post-editing.
Figure 7: Range in TER by talk per language. Figure 8: Correlation in TER across languages.
70
International Conference on Spoken Language Trans- annotation of the ACL Anthology corpus for the auto-
lation (IWSLT 2021), pages 1–29, Bangkok, Thailand matic analysis of scientific literature. In Proceedings
(online). Association for Computational Linguistics. of the Tenth International Conference on Language
Resources and Evaluation (LREC’16), pages 3694–
Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, 3701, Portorož, Slovenia. European Language Re-
Ondřej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir sources Association (ELRA).
Durrani, Marcello Federico, Christian Federmann,
Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay Toni Giorgino. 2009. Computing and visualizing dy-
Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz- namic time warping alignments in r: The dtw pack-
abeth Salesky, Xing Shi, Sebastian Stüker, Marco age. Journal of Statistical Software, 31(7).
Turchi, Alexander Waibel, and Changhan Wang.
2020. FINDINGS OF THE IWSLT 2020 EVAL- Declan Groves and Dag Schmidtke. 2009. Identifica-
UATION CAMPAIGN. In Proceedings of the 17th tion and analysis of post-editing patterns for MT.
International Conference on Spoken Language Trans- In Proceedings of Machine Translation Summit XII:
lation, pages 1–34, Online. Association for Compu- Commercial MT User Program, Ottawa, Canada.
tational Linguistics.
Francisco Guzman, Hassan Sajjad, Stephan Vogel, and
Ebrahim Ansari, Ondřej Bojar, Barry Haddow, and Mo- Ahmed Abdelali. 2013. The AMARA corpus: build-
hammad Mahmoudi. 2021. SLTEV: Comprehensive ing resources for translating the web’s educational
evaluation of spoken language translation. In Pro- content. In Proceedings of the 10th International
ceedings of the 16th Conference of the European Workshop on Spoken Language Translation: Papers,
Chapter of the Association for Computational Lin- Heidelberg, Germany.
guistics: System Demonstrations, pages 71–79, On-
Chris Hokamp and Qun Liu. 2017. Lexically con-
line. Association for Computational Linguistics.
strained decoding for sequence generation using grid
beam search. In Proceedings of the 55th Annual
Luisa Bentivogli, Mauro Cettolo, Marcello Federico,
Meeting of the Association for Computational Lin-
and Christian Federmann. 2018. Machine transla-
guistics (Volume 1: Long Papers), pages 1535–1546,
tion human evaluation: an investigation of evaluation
Vancouver, Canada. Association for Computational
based on post-editing and its relation with direct as-
Linguistics.
sessment. In Proceedings of the 15th International
Conference on Spoken Language Translation, pages J. Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick
62–69, Brussels. International Conference on Spoken Xia, Tongfei Chen, Matt Post, and Benjamin
Language Translation. Van Durme. 2019. Improved lexically constrained
decoding for translation and monolingual rewriting.
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, In Proceedings of the 2019 Conference of the North
Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara American Chapter of the Association for Computa-
Rivera, and Ankur Bapna. 2023. Fleurs: Few-shot tional Linguistics: Human Language Technologies,
learning evaluation of universal representations of Volume 1 (Long and Short Papers), pages 839–850,
speech. In 2022 IEEE Spoken Language Technology Minneapolis, Minnesota. Association for Computa-
Workshop (SLT), pages 798–805. tional Linguistics.
Oliver Čulo and Jean Nitzke. 2016. Patterns of termino- J. Iranzo-Sánchez, J. A. Silvestre-Cerdà, J. Jorge,
logical variation in post-editing and of cognate use N. Roselló, A. Giménez, A. Sanchis, J. Civera, and
in machine translation in contrast to human transla- A. Juan. 2020. Europarl-st: A multilingual corpus
tion. In Proceedings of the 19th Annual Conference for speech translation of parliamentary debates. In
of the European Association for Machine Translation, ICASSP 2020 - 2020 IEEE International Confer-
pages 106–114. ence on Acoustics, Speech and Signal Processing
(ICASSP), pages 8229–8233.
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli,
Matteo Negri, and Marco Turchi. 2019. MuST-C: a Yiping Jin, Min-Yen Kan, Jun-Ping Ng, and Xiangnan
Multilingual Speech Translation Corpus. In Proceed- He. 2013. Mining scientific terms and their defini-
ings of the 2019 Conference of the North American tions: A study of the ACL Anthology. In Proceed-
Chapter of the Association for Computational Lin- ings of the 2013 Conference on Empirical Methods
guistics: Human Language Technologies, Volume 1 in Natural Language Processing, pages 780–790,
(Long and Short Papers), pages 2012–2017, Min- Seattle, Washington, USA. Association for Computa-
neapolis, Minnesota. Association for Computational tional Linguistics.
Linguistics.
Allison Koenecke, Andrew Joo Hun Nam, Emily Lake,
Siyuan Feng, Olya Kudina, Bence Mark Halpern, and Joe Nudell, Minnie Quartey, Zion Mengesha, Connor
Odette Scharenborg. 2021. Quantifying bias in auto- Toups, John R. Rickford, Dan Jurafsky, and Sharad
matic speech recognition. ArXiv, abs/2103.15122. Goel. 2020. Racial disparities in automated speech
recognition. Proceedings of the National Academy of
Kata Gábor, Haïfa Zargayouna, Davide Buscaldi, Is- Sciences of the United States of America, 117:7684 –
abelle Tellier, and Thierry Charnois. 2016. Semantic 7689.
71
Jérôme Louradour. 2023. whisper-timestamped. https: Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
//github.com/linto-ai/whisper-timestamped. man, Christine McLeavey, and Ilya Sutskever. 2022.
Robust speech recognition via large-scale weak su-
Nitika Mathur, Johnny Wei, Markus Freitag, Qingsong pervision. arXiv preprint arXiv:2212.04356.
Ma, and Ondřej Bojar. 2020. Results of the WMT20
metrics shared task. In Proceedings of the Fifth Con- Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
ference on Machine Translation, pages 688–725, On- Lavie. 2020. COMET: A neural framework for MT
line. Association for Computational Linguistics. evaluation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process-
Evgeny Matusov, Gregor Leusch, Oliver Bender, and ing (EMNLP), pages 2685–2702, Online. Association
Hermann Ney. 2005. Evaluating machine translation for Computational Linguistics.
output with automatic sentence segmentation. In Pro-
ceedings of the Second International Workshop on Ramon Sanabria, Nikolay Bogoychev, Nina Markl, An-
Spoken Language Translation, Pittsburgh, Pennsylva- drea Carmantini, Ondrej Klejch, and Peter Bell. 2023.
nia, USA. The edinburgh international accents of english cor-
pus: Towards the democratization of english asr.
Sharon O’Brien. 2007. An empirical investigation of
temporal and technical post-editing effort. The Infor- Scarton Scarton, Mikel L. Forcada, Miquel Esplà-
mation Society, 2:83–136. Gomis, and Lucia Specia. 2019. Estimating post-
editing effort: a study on human judgements, task-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- based and reference-based metrics of MT quality. In
Jing Zhu. 2002. Bleu: a method for automatic evalu- Proceedings of the 16th International Conference on
ation of machine translation. In Proceedings of the Spoken Language Translation, Hong Kong. Associa-
40th Annual Meeting of the Association for Compu- tion for Computational Linguistics.
tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational Anne-Kathrin Schumann and Héctor Martínez Alonso.
Linguistics. 2018. Automatic annotation of semantic term types
in the complete ACL Anthology reference corpus.
Silvio Picinini and Nicola Ueffing. 2017. A detailed In Proceedings of the Eleventh International Confer-
investigation of bias errors in post-editing of MT out- ence on Language Resources and Evaluation (LREC
put. In Proceedings of Machine Translation Summit 2018), Miyazaki, Japan. European Language Re-
XVI: Commercial MT Users and Translators Track, sources Association (ELRA).
pages 79–90, Nagoya Japan.
Sukanta Sen, Ondřej Bojar, and Barry Haddow. 2022.
Marcis Pinnis, Rihards Kalnins, Raivis Skadins, and Simultaneous translation for unsegmented input: A
Inguna Skadina. 2016. What can we really learn sliding window approach.
from post-editing? In Conferences of the Associa- Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea
tion for Machine Translation in the Americas: MT Micciulla, and John Makhoul. 2006. A study of trans-
Users’ Track, pages 86–91, Austin, TX, USA. The lation edit rate with targeted human annotation. In
Association for Machine Translation in the Americas. Proceedings of the 7th Conference of the Association
Mirko Plitt and François Masselot. 2010. A productivity for Machine Translation in the Americas: Technical
test of statistical machine translation post-editing in Papers, pages 223–231, Cambridge, Massachusetts,
a typical localisation context. In Prague Bulletin of USA. Association for Machine Translation in the
Mathematical Linguistics. Americas.
Rachael Tatman and Conner Kasten. 2017. Effects of
Maja Popović. 2015. chrF: character n-gram F-score
talker dialect, gender & race on accuracy of bing
for automatic MT evaluation. In Proceedings of the
speech and youtube automatic captions. In Inter-
Tenth Workshop on Statistical Machine Translation,
speech.
pages 392–395, Lisbon, Portugal. Association for
Computational Linguistics. Midori Tatsumi. 2009. Correlation between automatic
evaluation metric scores, post-editing speed, and
Matt Post. 2018. A call for clarity in reporting BLEU some other factors. In Proceedings of Machine Trans-
scores. In Proceedings of the Third Conference on lation Summit XII: Posters, Ottawa, Canada.
Machine Translation: Research Papers, pages 186–
191, Brussels, Belgium. Association for Computa- Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
tional Linguistics. losa, and Marta Ruiz Costa-jussà. 2022. Shas:
Approaching optimal segmentation for end-to-end
Matt Post and David Vilar. 2018. Fast lexically con- speech translation. In Interspeech.
strained decoding with dynamic beam allocation for
neural machine translation. In Proceedings of the Changhan Wang, Juan Pino, Anne Wu, and Jiatao Gu.
2018 Conference of the North American Chapter of 2020. CoVoST: A diverse multilingual speech-to-text
the Association for Computational Linguistics: Hu- translation corpus. In Proceedings of the Twelfth Lan-
man Language Technologies, Volume 1 (Long Pa- guage Resources and Evaluation Conference, pages
pers), pages 1314–1324, New Orleans, Louisiana. 4197–4203, Marseille, France. European Language
Association for Computational Linguistics. Resources Association.
72
Vilém Zouhar, Martin Popel, Ondřej Bojar, and Aleš
Tamchyna. 2021. Neural machine translation quality
and post-editing performance. In Proceedings of the
2021 Conference on Empirical Methods in Natural
Language Processing, pages 10204–10214, Online
and Punta Cana, Dominican Republic. Association
for Computational Linguistics.
73
A Appendix
A.1 Additional Metadata for ACL 60/60 Evaluation Sets
Below we list the duration for talks in the evaluation sets, along with additional demographic metadata
about the presenting author (speaker) and content (conference track). Conference tracks are taken from the
ACL 2022 handbook. Gender annotations were checked with speakers’ listed pronouns13 and validated
by speakers where available. For speaker demographics and accent we list L1 and native country where
available, as well as country of affiliation as a rough proxy.
Gender # %
Woman 909 28.7
Man 2164 68.3
Non-binary / Genderqueer / Third gender 14 <1
Genderfluid / Gender non-confirming <10 <1
Prefer not to say 77 2.4
Specify your own <10 <1
TOTAL 3170 100
13
Though we note pronouns do not always indicate gender.
74
A.3 Publicly Available Corpora
Below are the current publicly available multi-way parallel speech translation corpora with English as the
speech source. We note that for MuST-C not all target languages are available in all versions of the corpus
as successive versions added additional language coverage. For full coverage v1.2 or above is required.
Table 5: Current publicly available aligned speech translation corpora covering the ACL 60/60 language pairs.
Target languages are abbreviated using ISO 639-1 codes as follows – Arabic: ar, German: de, Farsi: fa, French: fr,
Japanese: ja, Dutch: nl, Portuguese: pt, Russian: ru, Turkish: tr, Mandarin Chinese: zh.
• Accuracy. Only type the words that are spoken in the audio file. Phrases or words you don’t
understand should NOT be omitted. Instead, they should be annotated using the label “#Unclear”.
• Keep everything verbatim. Include every utterance and sound exactly as you hear. All filler words
should be included (ex. #ah, #hmm). If the user corrects his/her self, all the utterances should be
transcribed and corrected words need to preceded with a # mark (ex. She says #said that).
• Do not paraphrase. Do not correct the speaker’s grammar nor rearrange words. Also, do not cut
words that you think are off-topic or irrelevant. Any words not spoken should not be included. Type
the actual words spoken. If the speaker makes a grammatical mistake, the transcript must reflect the
mistake (ex. If the speaker says: “he were”, it should be transcribed as is without correction).
• Repeat repeated words in the transcript. For example, if the user says: I I said, you must include both
instances of I.
• Do not add additional information such as page numbers, job numbers, titles or your comments in
your submission.
• Foreign words should be transliterated using Latin letters.
• All abbreviations need to be spelled out. For example, doctor should NOT be spelled as Dr. Similarly,
percent should NOT be spelled as %.
• All numbers and special symbols (ex.: %, $, +, @, =, etc.), or combinations of both must be spelled
out as words, and must match what the speaker says exactly.
• All proper names (ex. Google, NATO, Paris) should be transliterated in English.
• Proper punctuation needs to be placed in the text (ex. He, the boy, .). Please pay special attention
and do not miss/omit these punctuation marks: , . ? ! : )(
• Personally identifiable information (like phone number, address, IDs) should be marked in the text as
<PII></PII>. For example: My address is <PII>address</PII>
• Use double dashes “--” to indicate truncated words, attached whether at the beginning or the end of
the word (ex. transfor–).
75
Figure 10: LabelStudio interface for transcription post-editing.
• Any term found in the 60-60 terminologies list, should be translated using the translation in the
terminologies list.
• Any abbreviation if not found in the terminologies list, should be kept it in the English form
• The terms in the terminologies list may contain one or more translation for each term separated by
‘:::’. The translator should pick the proper one based on the context
• If the translator thinks that none of the given translations for a specific term makes sense in the given
context, the translators can use a better translation if they are very confident. If not very confident,
keep the word in the English form
14
https://site.matecat.com/
76
Figure 11: Matecat interface for translation post-editing.
Commercial VAD 66.6 68.5 52.7 74.1 46.2 73.6 73.7 53.9 60.6 49.8 62.0
SHAS 66.5 68.6 52.8 73.7 46.9 73.8 73.5 54.3 59.9 49.7 62.0
Sentences 64.0 66.1 51.3 69.0 43.9 71.0 71.9 55.8 63.8 46.0 60.3
eval
Commercial VAD 63.5 66.3 51.1 69.0 43.7 70.4 72.0 55.1 62.9 47.1 60.1
SHAS 64.4 66.4 51.5 69.6 42.0 71.4 72.4 55.7 63.1 45.4 60.2
Table 6: Cascaded ST by language for different source speech segmentations, resegmented and scored with chrF.
If one of the segments created by the VAD does not adhere to the above guidelines, an English model is
used to force alignment the long audio segment and its transcript to get the timestamp of each token, and
then the segment is split into shorter subsegments. Note that these guidelines are automatically applied;
the above means that if a VAD segment conforms to these guidelines it will not be resegmented, and
subtitle segments may differ from manually created subtitles were semantic coherence may be prioritized
over longer segments within these guidelines, or text may be lightly changed from what is spoken to
optimize subtitle quality (here not allowed).
15
https://partnerhelp.netflixstudios.com/hc/en-us/articles/217350977-English-Timed-Text-Style-Guide
16
https://www.ted.com/participate/translate/subtitling-tips
17
Varies by program audience, commonly between 17 and 21.
77
A.8 Segmentation Examples
Examples of each transcript segmentation approach discussed (VAD, subtitles, and sentences) for sample
data from the development set. Examples were chosen to show segments from the longest and shortest
VAD quartiles, and the resulting subtitles following subtitle guidelines from §A.7.
Figure 12: Examples of each discussed transcript segmentation approach for sample data from the development set.
78
The M INE T RANS Systems for IWSLT 2023 Offline Speech Translation and
Speech-to-Speech Translation Tasks
Yichao Du♭‡ , Zhengsheng Guo♮ , Jinchuan Tian♮ , Zhirui Zhang♮ , Xing Wang♮ , Jianwei Yu♮ ,
Zhaopeng Tu♮ , Tong Xu♭‡ and Enhong Chen♭‡
♭
University of Science and Technology of China ♮ Tencent AI Lab
‡
State Key Laboratory of Cognitive Intelligence
♭
duyichao@mail.ustc.edu.cn ♭ {tongxu, cheneh}@ustc.edu.cn ♮ zrustc11@gmail.com
♮
{zhengshguo, tyriontian, tomasyu, brightxwang, zptu}@tencent.com
79
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 79–88
July 13-14, 2023 c 2023 Association for Computational Linguistics
leverages the standard sequence-to-sequence model Librispeech (Panayotov et al., 2015), and Europarl-
to learn the mapping between source speech and ST (Iranzo-Sánchez et al., 2019), resulting in ap-
discrete units directly. We found that with a large- proximately 4500 hours of labeled ASR corpus, as
scale dataset, such as 10,000 hours of training data, shown in Table 1. For MuST-C and Europarl-ST,
the previous multi-task learning technique (Jia; we collect source speech for all translation direc-
Lee et al., 2021a,b; Popuri et al., 2022; Dong tions and de-duplicated them based on audio identi-
et al., 2022) is not necessary for model conver- fiers. In addition, GigaSpeech (Chen et al., 2021) is
gence, and this approach can successfully han- used to construct data-unconstrained ASR model,
dle the mapping between source speech and dis- which includes 10k hours data covering various
crete units. We also explore various initializa- sources (audiobooks, podcasts, and stream media),
tion strategies and several techniques to improve speaking styles (reading and spontaneous), and top-
model performance, including (1) different self- ics (arts, science, sports, etc.). Of these corpus, we
supervised pre-trained speech encoders and pre- use MuST-C as the in-domain for the Offline track
trained text-to-unit models, (2) data filtering and and the rest as the out-of-domain.
augmentation, consistency training, and model en-
MT Corpus. To train data-constrained English-
sembles. To the best of our knowledge, we are
to-Chinese MT models, MuST-C v1&v2 are
the first and only one to successfully train and sub-
considered in-domain corpora, while OpenSubti-
mit the end-to-end S2ST model on this challeng-
tles2018 (Lison et al., 2018) and NewsCommen-
ing track. Our code is open-sourced at: https:
tary3 corpora are considered out-of-domain. Addi-
//github.com/duyichao/MINETrans-IWSLT23.
tionally, we utilize in-house corpora to train data-
The remainder of this paper is organized as fol-
unconstrained MT models, although we cannot pro-
lows: Section 2 describes data preparation, includ-
vide further details about it.
ing data statistics, data preprocessing, and data
filtering. Section 3 describes our solution for the TTS Corpus. To ensure target speech timbre
offline speech translation track. Section 4 describes matching with the S2ST track, we consider the
our solution to the speech-to-speech track. In Sec- single-speaker GigaSS-S, a small subset of GigaSS,
tion 5, we conclude this paper. as in-domain and the multi-speaker AISHELL-
3 (Shi et al., 2020) as out-of-domain. These corpora
2 Data Preparation are used to train the TTS model and its correspond-
ing vocoder.
2.1 Data Statistics
Table 1 lists statistics of the speech corpus we used S2ST Corpus. The full version of GigaSS is
for M INE T RANS training, which can be divided used to train our end-to-end S2UT model, which
into four categories: unlabeled speech, ASR, TTS is an large-scale S2ST corpora derived from Gi-
and S2ST Corpus. gaSpeech (Chen et al., 2021) via MT and TTS.
We also construct S2ST pseudo-data, the details of
Unlabeled Speech. As shown in Table 1, we in- which will be presented in Section 4.1.2.
tegrate source side speech from VoxPopuli (Wang
et al., 2021a) and GigaSS2 to build a large-scale un- 2.2 Data Pre-processing and Filtering
labeled English speech corpus for self-supervised In general, a simple way to improve model perfor-
training of speech encoders Wav2vec2.0 (Baevski mance is to provide them with better data. How-
et al., 2020) and HuBert (Hsu et al., 2021), which ever, through a careful review of the data, we iden-
are used for initializing the S2UT model in the tified issues with the quality of the original data.
S2ST track. Similarly, we also integrate target To address this, we performed the following pre-
speech from GigaSS and AISHELL-3 (Shi et al., processing and filtering:
2020) to train the Chinese HuBert, which is used
for discretizing Chinese speech. • We convert all audio data to mono-channel
16kHz wav format. Since the sentences of spo-
ASR Corpus. To train data-constrained English ken translation are generally short, we discarded
ASR models, we merge MuST-C (Gangi et al., sentences with text longer than 100 and speech
2019), Common Voice v11 (Ardila et al., 2019), frames longer than 3000. Then 80-dimensional
2 3
https://github.com/SpeechTranslation/GigaS2S https://opus.nlpl.eu/News-Commentary.php
80
Corpus Utterances (k) Duration (h) S2T CST. S2ST CST.
Unlabeled VoxPopuli 22,905 28,708 ✓ ✓
MuST-C ASR v1&v2 342 617 ✓ –
Common Voice v11.0 1680 3,098 ✓ –
ASR Librispeech 281 960 ✓ –
Europarl-ST 34 81 ✓ –
GigaSpeech 8,030 10,000 × –
NewsCommentary 32 – ✓ –
MT OpenSubtitles 9,969 – ✓ –
MuST-C v1&v2 543 – ✓ –
In-house – – × –
AISHELL 3 88 85 – ✓
TTS
GigaSS-S 210 244 – ✓
GigaSS 7,635 9,000 – ✓
S2ST CoVoST synthetic 288 288 – ✓
MuST-C synthetic 358 587 – ✓
Table 1: Statistics of the training data. The "CST." indicates that a corpus is in the task constrained corpus list of
corresponding S2T or S2ST. The "-" indicates this corpus is not available in that column.
log-mel filter banks acoustic features are ex- is a standard 1-layer LSTM with a hidden size of
tracted with a stepsize of 10ms and a window 1024. The joint network is linear with a size of
size of 25ms. The acoustic features are normal- 512. The input acoustic features are 80-dim Fbank
ized by global channel mean and variance. plus 3-dim pitch, which are down-sampled by a
• We use a pre-trained ASR model on Librispeech 2-layer CNN with a factor of 6 in the time-axis
to filter the audio with very poor quality, i.e., before being fed into the acoustic encoder. The
word error rate (WER) more than 75. overall parameter budget is 126M. During training,
SpecAugment (Park et al., 2019) is consistently
• Since the annotation format is not uniform across adopted for data augmentation. The training on
multiple datasets, we remove non-printing char- both GigaSpeech and MuST-C datasets lasts for
acters, speaker names, laughter, applause and 50 epochs each, which consumes 32 Nvidia V100
other events. In addition, we also regularize punc- GPUs. The Adam optimizer is adopted, with peak
tuation marks. learning rate of 5e-3, warmup steps of 25k and in-
• For the English-to-Chinese direction of MuST-C, verse square root decay schedule(Vaswani et al.,
we first merge the v1 and v2 versions and then 2017a). Model weights from the last 10 epochs are
remove duplicates based on audio identifiers. averaged before decoding. The default decoding
method described in Graves (2012) is adopted with
3 Offline Speech Translation a beam size of 10. External language models in
any form are not adopted.
3.1 Cascaded M INE T RANS S2T System
3.1.1 Speech Recognition ASR Output Adaptation. In the realm of au-
A standard RNN-Transducer (Graves, 2012) model tomatic speech recognition (ASR) and machine
is used for speech recognition. It consists of an translation (MT), it is common for ASR output to
acoustic encoder, a prediction network and a joint lack punctuation, whereas MT models are sensitive
network. The acoustic encoder contains 18 Con- to punctuation. To address this issue, we propose
former (Gulati et al., 2020) layers with the follow- an ASR output adaptation method by incorporating
ing dimensions: attention size is 512, feed-forward a punctuation model between ASR and MT. Specif-
size is 2048, number of attention heads is 4, and ically, we adopt a BERT-based punctuation model
convolutional kernels is 31. The prediction network that can automatically recover the original punctu-
81
ation. The objective of this approach is to bridge Firstly, we have observed samples of incorrect lit-
the disparity between ASR and MT, leading to im- eral translations. For example, for the parallel sen-
proved overall performance in speech translation tence pair, “I remember my first fire. ||| 记得我
tasks. 第一场火”, we usually translate the English word
“fire” into Chinese word “火灾 (huo zhai)” not “火
Speech Segmentation. Speech translation is a (huo)”. Secondly, we have noticed inconsisten-
multi-faceted task that requires overcoming the cies in the punctuation annotation, as most Chinese
challenges of bridging the gap between automatic translations lack proper full stop marks. To address
speech recognition (ASR) and machine translation these challenges, we have employed the services of
(MT) systems. To address these challenges, we a professional translator to accurately translate the
employ several text augmentation techniques to English sentences. We will release the data, aiming
improve the quality and accuracy of our training to facilitate future research in the field.
data. Specifically, we have utilized speech-based
audio segmentation (SHAS (Tsiamas et al., 2022)) Domain Augmentation. The MuST-C v2.0 train-
to identify and segment meaningful units of speech ing data contains considerable bilingual sentence
that can be accurately translated by the MT system. pairs that are partially aligned. In the specific
pair “Thank you so much Chris. ||| 非常谢谢,
3.1.2 Machine Translation 克里斯。的确非常荣幸”, we are unable to lo-
In our systems, we adopt four different types of cate the corresponding translation for the Chinese
translation strategies: phrase “的确非常荣幸" in the English sentence.
As Koehn and Knowles (2017); Wang et al. (2018)
• T RANSFORMER is a system trained on the pointed out, data noise (partially aligned data) has
constrained data. We train the Transformer- been demonstrated to impact the performance of
base (Vaswani et al., 2017b) model on the con- Neural Machine Translation (NMT). To address
strained general data and finetune the model on this issue, we employ a data rejuvenation strat-
the in-domain MuST-C data. egy (Jiao et al., 2020). Specifically, we first fine-
tune the model using the raw parallel data and then
• M2M-1004 (Fan et al., 2021) is a multilingual
rejuvenate the low-quality bilingual samples to en-
model trained for many-to-many multilingual
hance the training data.
translation. We employ the supervised in-domain
fine-tuning strategy to finetune the M2M-100 3.2 Experiment
1.2B-parameter model on the downstream MuST-
C data. The Cascaded MINETRANS S2T System we pro-
pose comprises an Automatic Speech Recogni-
• C HAT GPT is a large language model product de- tion (ASR) model and a machine translation (MT)
veloped by OpenAI. Previous studies (Jiao et al., model. In our evaluation, we assess the perfor-
2023; Wang et al., 2023) have demonstrated that mance of each component separately. For the ASR
ChatGPT is a good translator on high-resource system evaluation, we employ the Word Error Rate
languages. Therefore we utilize the proper trans- (WER) metric, while the BLEU score is utilized to
lation prompts with ChatGPT to carry out the evaluate the performance of our machine transla-
translation task. tion model.
• I N - HOUSE M ODEL We fine-tune our in-house The evaluation results obtained on the MuST-C
translation model (Huang et al., 2021) using dataset, with and without fine-tuning, are presented
the MuST-C data. Our in-house model is a in Table 2. When the GigaSpeech ASR system
Transformer-big (Vaswani et al., 2017b) model is used without fine-tuning, we observe a WER
with a deep encoder (Dou et al., 2018). of 10.0 on the MuST-C test set. However, when
the system is fine-tuned using the MuST-C dataset,
Data Re-Annotation. We have identified two is- a significant improvement in performance is ob-
sues with the annotation of the English-to-Chinese served, resulting in a noticeable decrease in the
translation direction in the MuST-C v2.0 test set5 . error rate from WER of 10.0 to 5.8. This highlights
4 the effectiveness of fine-tuning on the MuST-C
https://github.com/facebookresearch/fairseq/
tree/main/exa\mples/m2m_100 dataset in enhancing the overall performance of our
5
https://ict.fbk.eu/MuST-C/ system.
82
System Dev Test Target waveform
Gigaspeech 9.3 10.0
+ MuST-C Finetune 4.8 5.8 Unit Hifigan
Vocoder
Table 2: ASR performance measured in terms of word
Target unit
error rates.
Unit
Decoder
We evaluate various translation strategies us-
ing the MuST-C test set. The experimental re-
sults are presented in Table 2. In the constrained Length
scenario, T RANSFORMER achieved a test BLEU Adapter
score of 25.04, whereas M2M-100 attained a
marginally higher score of 25.40. In the uncon-
strained setting, C HAT GPT demonstrated superior Speech
performance with a BLEU score of 28.25, while I N - Encoder
HOUSE M ODEL obtained the highest BLEU score
of 30.91. These results emphasize the significance
Source waveform
of utilizing in-domain data for achieving optimal
performance in spoken language translation. Figure 1: The overall architecture of the end-to-end
S2ST system.
System Dev tst-COMMON
T RANSFORMER 13.93 25.04 4.1.1 Pretrained Models
M2M-100 16.53 25.40 Previous experiences (Dong et al., 2022; Popuri
C HAT GPT — 28.25 et al., 2022) shown that better initialization can
I N - HOUSE M ODEL 21.52 30.91 reduce learning difficulty, we explore pre-training
of both the speech encoder and unit decoder.
Table 3: Offline speech translation performance mea-
sured in terms of the BLEU score. Speech Encoder Pre-training. We use Wav2vec
2.0 (Baevski et al., 2020) and HuBert (Hsu et al.,
2021), which are trained in a self-supervised man-
4 Speech-to-Speech Translation ner, as speech encoders. Due to the data limitation
4.1 End-to-End M INE T RANS S2ST System of the S2ST track, we use the unlabeled speech
described in Table 1 for training speech encoder:
As shown in Figure 1, we construct an end-to-
end S2UT (Lee et al., 2021a) model comprising a • Wav2vec 2.0 uses a multi layer convolution neu-
speech encoder, length adapter, and unit decoder. ral network to encode audio and then uses a
Following (Lee et al., 2021a), we encode target transformer-based context encoder to construct a
speech as discrete units via our trained Chinese contextual representation. The model is trained
HuBert and remove consecutive repetitive units by having a masked span of contrast loss on the
to generate a reduced unit sequence. Unlike (Lee input of the context encoder. In this paper, we
et al., 2021a), our S2UT model directly learns the modify Transformer as Conformer to obtain bet-
mapping between source speech and discrete units ter performance.
without any auxiliary recognition tasks (i.e., ASR • HuBert has the same model architecture as
and MT tasks), which hyper-parameters are diffi- Wav2vec 2.0. However, its training process dif-
cult to tune. Then we leverage a unit-based HiFi- fers primarily in the use of cross-entropy and ad-
GAN Vocoder to achieve unit-to-waveform con- ditionally in the construction of targets through a
version (Polyak et al., 2021). Next, we detail the separate clustering process.
efforts making in pre-training for model initializa-
tion, data augmentation, consistency training and Unit Decoder Pre-training. We use the standard
model ensemble, which are used to improve the sequence-to-sequence model to model the Text-to-
translation quality of our system. unit (T2U) task on GigaSS, and the decoder of
83
this model will be used for the initialization of the Wav2vec 2.0 L ARGE model. The unit decoder is
unit decoder of S2UT. The T2U model contains initialized from the T2U model.
12 transformer layers for the encoder and coder, • W2V2-T RANS -L ARGE +T2U: The speech en-
respectively. More specifically, we set the size of coder is initialized using Transformer-based
the self-attention layer, the feed-forward network, Wav2vec 2.0 L ARGE model. The unit decoder is
and the head to 1024, 4096, and 8, respectively. initialized from the T2U model.
4.1.2 Model Finetuning • H U B ERT-T RANS -L ARGE +T2U: The speech
We combine the pre-trained speech encoder and encoder is initialized using Transformer-based
unit decoder, and adding a randomly initialized HuBert L ARGE model. The unit decoder is ini-
length adapter between the pre-trained modules. tialized from the T2U model.
The length adapter consists of a one-dimensional
convolutional layer with a stride of 2, which miti- 4.1.5 Data Augmentation
gates the length difference between the source au- We utilize well trained Fastspeech2 (Ren et al.,
dio and the reduced target unit, as well as the mis- 2020) TTS models (see Section 4.2 for details) to
match between representations. generate speech for MuST-C and CoVoST Chinese
texts to construct pseudo-corpora. These pseudo-
Consistency Training. To further improve the corpora are used as training data together with the
consistency of our model, we employ the R-Drop original labeled S2ST corpus.
algorithm (Liang et al., 2021) with a weight α set to
5. The R-Drop algorithm reduces inconsistencies 4.2 Experiments
predicted by the model between training and infer- 4.2.1 Implementation Details
ence through dropout, thereby improving general-
All end-to-end S2UT models are implemented
ization. Specifically, it randomly drops out parts
based on the FAIRSEQ6 (Ott et al., 2019) toolkit.
of the model during training, forcing it to learn
We use pre-trained Chinese HuBERT model and
more robust representations that are less sensitive
k-means model to encode Chinese target speech
to small changes in the input. For a more detailed
into a vocabulary of 250 units. The Chinese Hu-
description of the R-Drop algorithm and its imple-
BERT and k-means models are learned from the
mentation, please refer to the paper by (Liang et al.,
TTS data in Table 1. The architectural details of the
2021).
S2UT models are detailed in section 4.1.4. During
4.1.3 Unit-based Vocoder training, we use the adam optimizer with a learning
We utilize the unit-based HiFi-GAN (Polyak et al., rate set to 5e-5 to update model parameters with 8K
2021) vocoder to convert discrete units into wave- warm-up updates. The label smoothing and dropout
form for the speech-to-unit model. Following ratios are set to 0.15 and 0.2, respectively. In prac-
the (Lee et al., 2021a) setup, we augment the tice, we train S2UT with 8 Nvidia Tesla A100
vocoder with a duration prediction module for the GPUs with 150K update steps. The batch size in
reduced unit output, which consists of two 1D con- each GPU is set to 1200K, and we accumulate the
volutional layers, each with ReLU activation, fol- gradient for every 9 batches. For the first 5K steps
lowed by layer normalization and a linear layer. of S2UT model training, we freeze the update of the
speech encoder. The Unit HiFi-GAN Vocoder is
4.1.4 Ensemble trained using S PEECH -R ESYNTHESISRES7 toolkit
Model ensemble can reduce the inconsistency of for 500k steps. For FastSpeech2 and HiFi-GAN,
the system to some extent, and we consider the we followed the paddlespeech AISHELL recipe8
ensemble of four variants of S2UT models: for training. During inference, we average the
model parameters on the 30 best checkpoints based
• W2V2-C ONF -L ARGE: The speech encoder is on the performance of the GigaSS dev set, and
initialized using Conformer-based Wav2vec 2.0 adopt beam search strategy with beam size of 10.
L ARGE model. The unit decoder is initialized 6
https://github.com/facebookresearch/fairseq
randomly. 7
https://github.com/facebookresearch/
speech-resynthesis
• W2V2-C ONF -L ARGE +T2U: The speech en- 8
https://github.com/PaddlePaddle/PaddleSpeech/
coder is initialized using Conformer-based tree/develop/examples/aishell3/tts3
84
ID Model BLEU chrF on this track. This model uses our trained Hu-
1 W2V2-C ONF -L ARGE 27.7 23.4
BERT to encode the target speech as discrete units
2 W2V2-C ONF -L ARGE +T2U 27.8 23.7 and leverages the standard sequence-to-sequence
3 W2V2-T RANS -L ARGE +T2U 25.2 22.3 model to directly learn the mapping between source
4 H U B ERT-T RANS -L ARGE +T2U 26.2 23.2 speech and discrete units without the need for auxil-
5 H U B ERT-T RANS -L ARGE +T2U* 25.7 22.6 iary recognition tasks such as ASR and MT. We use
6 Ensemble(1, 2, 4) 28.0 23.9 several techniques to improve M INE T RANS’s per-
7 Ensemble(2, 4, 5) 27.2 23.0 formance, including speech encoder pre-training
on large-scale data, data filtering, data augmen-
Table 4: ASR-BLEU and ASR-chrF on GigaSS valida- tation, speech segmentation, consistency training,
tion set. ‘*’ indicates adding the GigaST test set to the and model ensemble.
training data and fine-tuning it for one round.
Acknowledgements
4.2.2 Results This work is supported by the grants from
National Natural Science Foundation of China
To evaluate the speech-to-speech translation sys-
(No.62222213, U20A20229, 62072423), and the
tem, we use a Chinese ASR system9 trained on
USTC Research Funds of the Double First-Class
WenetSpeech (Zhang et al., 2021) to transcribe
Initiative (No.YD2150002009). The authors would
the speech output with the ctc_greedy_serach
like to thank anonymous reviewers for their valu-
mode. Based on this, we report case-sensitive
able comments. Zhirui Zhang and Tong Xu are the
BLEU and chrF scores between the produced tran-
corresponding authors.
script and a textual human reference using sacre-
BLEU. The results on the GigaSS validation set
is shown in Table 4. Comparing W2V2-C ONF - References
L ARGE +T2U and W2V2-T RANS -L ARGE +T2U,
using Conformer-based architecture pre-trained
speech encoder for initialization has better perfor- Milind Agarwal, Sweta Agrawal, Antonios Anasta-
mance. In addition, we find that adding the GigaST sopoulos, Ondřej Bojar, Claudia Borg, Marine
test set to training leads to a weak performance Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
Chen, William Chen, Khalid Choukri, Alexandra
degradation on the validation set, possibly because Chronopoulou, Anna Currey, Thierry Declerck, Qian-
the annotations of the test set are calibrated by hu- qian Dong, Yannick Estève, Kevin Duh, Marcello
mans and their style differs from that of the training Federico, Souhir Gahbiche, Barry Haddow, Benjamin
data. Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
5 Conclusion Evgeny Matusov, Paul McNamee, John P. McCrae,
Kenton Murray, Maria Nadejde, Satoshi Nakamura,
This paper presents the M INE T RANS system for Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
two challenge tracks of the IWSLT 2023: Offline Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
Speech Translation (S2T) and Speech-to-Speech Lonneke van der Plas, Peter Polák, Elijah Rippeth,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
Translation (S2ST). For the S2T track, M INE - bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
T RANS employs a cascaded system to investigate Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
the limits of translation performance in both con- Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
strained and unconstrained settings. We explore vallos. 2023. Findings of the IWSLT 2023 Evaluation
Campaign. In Proceedings of the 20th International
two machine translation strategies: supervised in-
Conference on Spoken Language Translation (IWSLT
domain fine-tuning and prompt-guided translation 2023). Association for Computational Linguistics.
using a large language model. For the S2ST track,
M INE T RANS builds an end-to-end model based on Antonios Anastasopoulos, Loïc Barrault, Luisa Ben-
tivogli, Marcely Zanon Boito, Ondrej Bojar, Roldano
the speech-to-unit (S2U) framework. To the best Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
of our knowledge, we are the first and only team to Maha Elbayad, Clara Emmanuel, Y. Estève, Mar-
successfully train and submit the end-to-end S2ST cello Federico, Christian Federmann, Souhir Gah-
biche, Hongyu Gong, Roman Grundkiewicz, Barry
9
https://github.com/wenet-e2e/wenet/blob/main/ Haddow, B. Hsu, Dávid Javorský, Věra Kloudová,
docs/pretrained_models.en.md Surafel Melaku Lakew, Xutai Ma, Prashant Mathur,
85
Paul McNamee, Kenton Murray, Maria Nadejde, Alex Graves. 2012. Sequence transduction with
Satoshi Nakamura, Matteo Negri, Jan Niehues, Xing recurrent neural networks. arXiv preprint
Niu, John E. Ortega, Juan Miguel Pino, Elizabeth arXiv:1211.3711.
Salesky, Jiatong Shi, Matthias Sperber, Sebastian
Stüker, Katsuhito Sudoh, Marco Turchi, Yogesh Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Virkar, Alexander H. Waibel, Changhan Wang, and Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Shinji Watanabe. 2022. Findings of the iwslt 2022 Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
evaluation campaign. In IWSLT. 2020. Conformer: Convolution-augmented Trans-
former for Speech Recognition. In Proc. Interspeech
Rosana Ardila, Megan Branson, Kelly Davis, Michael 2020, pages 5036–5040.
Henretty, Michael Kohler, Josh Meyer, Reuben Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng
Morais, Lindsay Saunders, Francis M. Tyers, and Zhang, Yujiu Yang, Rui Wang, Zhaopeng Tu, Shum-
Gregor Weber. 2019. Common voice: A massively- ing Shi, and Xing Wang. 2023. Exploring human-
multilingual speech corpus. In International Confer- like translation strategy with large language models.
ence on Language Resources and Evaluation. arXiv preprint arXiv:2305.04118.
Alexei Baevski, Henry Zhou, Abdel rahman Mohamed, Oleksii Hrinchuk, Vahid Noroozi, Ashwinkumar
and Michael Auli. 2020. wav2vec 2.0: A framework Ganesan, Sarah Campbell, Sandeep Subramanian,
for self-supervised learning of speech representations. Somshubra Majumdar, and Oleksii Kuchaiev. 2022.
Advances in Neural Information Processing Systems. Nvidia nemo offline speech translation systems for
iwslt 2022. In IWSLT.
Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu
Du, Weiqiang Zhang, Chao Weng, Dan Su, Daniel Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, San- Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
jeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, rahman Mohamed. 2021. Hubert: Self-supervised
Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, speech representation learning by masked prediction
Yujun Wang, Zhao You, and Zhiyong Yan. 2021. Gi- of hidden units. IEEE/ACM Transactions on Audio,
gaspeech: An evolving, multi-domain asr corpus Speech, and Language Processing, 29:3451–3460.
with 10, 000 hours of transcribed audio. ArXiv,
Guoping Huang, Lemao Liu, Xing Wang, Longyue
abs/2106.06909.
Wang, Huayang Li, Zhaopeng Tu, Chengyan Huang,
and Shuming Shi. 2021. Transmart: A practical in-
Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan
teractive machine translation system. arXiv preprint
Wang, Qibing Bai, and Yu Zhang. 2022. Leverag-
arXiv:2105.13072.
ing pseudo-labeled data to improve direct speech-to-
speech translation. In Interspeech. Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà,
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al-
Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Shuming Shi, and berto Sanchís, Jorge Civera Saiz, and Alfons Juan-
Tong Zhang. 2018. Exploiting deep representations Císcar. 2019. Europarl-st: A multilingual corpus for
for neural machine translation. In Proceedings of the speech translation of parliamentary debates. ICASSP
2018 Conference on Empirical Methods in Natural 2020 - 2020 IEEE International Conference on
Language Processing, pages 4253–4262. Acoustics, Speech and Signal Processing (ICASSP),
pages 8229–8233.
Yichao Du, Weizhi Wang, Zhirui Zhang, Boxing Chen,
Tong Xu, Jun Xie, and Enhong Chen. 2022. Non- Ye Jia, Ron J Weiss, Fadi Biadsy, Wolfgang Macherey,
parametric domain adaptation for end-to-end speech Melvin Johnson, Zhifeng Chen, and Yonghui Wu.
translation. In Conference on Empirical Methods in 2019. Direct speech-to-speech translation with
Natural Language Processing. a sequence-to-sequence model. arXiv preprint
arXiv:1904.06037.
Yichao Du, Zhirui Zhang, Weizhi Wang, Boxing Chen,
Jun Xie, and Tong Xu. 2021. Regularizing end-to- Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing
end speech translation with triangular decomposition Wang, and Zhaopeng Tu. 2023. Is chatgpt a good
agreement. In AAAI Conference on Artificial Intelli- translator? a preliminary study. arXiv preprint
gence. arXiv:2301.08745.
Wenxiang Jiao, Xing Wang, Shilin He, Irwin King,
Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Michael Lyu, and Zhaopeng Tu. 2020. Data reju-
Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep venation: Exploiting inactive training examples for
Baines, Onur Celebi, Guillaume Wenzek, Vishrav neural machine translation. In Proceedings of the
Chaudhary, et al. 2021. Beyond english-centric multi- 2020 Conference on Empirical Methods in Natural
lingual machine translation. The Journal of Machine Language Processing (EMNLP), pages 2255–2266.
Learning Research, 22(1):4839–4886.
Philipp Koehn and Rebecca Knowles. 2017. Six chal-
Mattia Antonino Di Gangi, R. Cattoni, L. Bentivogli, lenges for neural machine translation. In First Work-
Matteo Negri, and M. Turchi. 2019. Must-c: a multi- shop on Neural Machine Translation, pages 28–39.
lingual speech translation corpus. In NAACL. Association for Computational Linguistics.
86
Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Peng-Jen Chen, Changhan Wang,
Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Juan Miguel Pino, Yossi Adi, Jiatao Gu, Wei-Ning
Tang, Juan Miguel Pino, and Wei-Ning Hsu. 2021a. Hsu, and Ann Lee. 2022. Enhanced direct speech-to-
Direct speech-to-speech translation with discrete speech translation using self-supervised pre-training
units. In Annual Meeting of the Association for Com- and data augmentation. In Interspeech.
putational Linguistics.
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech
Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, 2: Fast and high-quality end-to-end text to speech.
Qing He, Yun Tang, Juan Pino, and Wei-Ning Hsu. ArXiv, abs/2006.04558.
2022. Direct speech-to-speech translation with dis-
crete units. In Proceedings of the 60th Annual Meet- Yao Shi, Hui Bu, Xin Xu, Shaojing Zhang, and Ming
ing of the Association for Computational Linguistics Li. 2020. Aishell-3: A multi-speaker mandarin tts
(Volume 1: Long Papers), pages 3327–3339, Dublin, corpus and the baselines. In Interspeech.
Ireland. Association for Computational Linguistics.
Matthias Sperber, Graham Neubig, J. Niehues, and
Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, A. Waibel. 2017. Neural lattice-to-sequence mod-
Holger Schwenk, Peng-Jen Chen, Changhan Wang, els for uncertain inputs. In EMNLP.
Sravya Popuri, Juan Miguel Pino, Jiatao Gu, and
Wei-Ning Hsu. 2021b. Textless speech-to-speech Ioannis Tsiamas, Gerard I Gállego, José AR Fonollosa,
translation on real data. ArXiv, abs/2112.08352. and Marta R Costa-jussà. 2022. Shas: Approaching
optimal segmentation for end-to-end speech transla-
Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, tion. arXiv preprint arXiv:2202.04774.
Qi Meng, Tao Qin, Wei Chen, M. Zhang, and Tie-Yan
Liu. 2021. R-drop: Regularized dropout for neural Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
networks. ArXiv, abs/2106.14448. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017a. Attention is all
Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. you need. Advances in neural information processing
2018. Opensubtitles2018: Statistical rescoring of systems, 30.
sentence alignments in large, noisy parallel corpora.
In International Conference on Language Resources Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
and Evaluation. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017b. Attention is all
Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, you need. Advances in neural information processing
Hua Wu, Haifeng Wang, and Chengqing Zong. 2019. systems, 30.
End-to-end speech translation with knowledge distil-
lation. In INTERSPEECH. Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu,
Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
H. Ney. 1999. Speech translation: coupling of recogni- Juan Miguel Pino, and Emmanuel Dupoux. 2021a.
tion and translation. 1999 IEEE International Con- Voxpopuli: A large-scale multilingual speech corpus
ference on Acoustics, Speech, and Signal Process- for representation learning, semi-supervised learning
ing. Proceedings. ICASSP99 (Cat. No.99CH36258), and interpretation. In Annual Meeting of the Associa-
1:517–520 vol.1. tion for Computational Linguistics.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang,
S. Gross, Nathan Ng, David Grangier, and Michael Dian Yu, Shuming Shi, and Zhaopeng Tu. 2023.
Auli. 2019. fairseq: A fast, extensible toolkit for Document-level machine translation with large lan-
sequence modeling. In NAACL. guage models. arXiv preprint arXiv:2304.02210.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Minghan Wang, Yuxia Wang, Chang Su, Jiaxin Guo,
S. Khudanpur. 2015. Librispeech: An asr corpus Yingtao Zhang, Yujiao Liu, M. Zhang, Shimin Tao,
based on public domain audio books. 2015 IEEE Xingshan Zeng, Liangyou Li, Hao Yang, and Ying
International Conference on Acoustics, Speech and Qin. 2021b. The hw-tsc’s offline speech translation
Signal Processing (ICASSP), pages 5206–5210. system for iwslt 2022 evaluation. In IWSLT.
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji
Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Nakagawa, and Ciprian Chelba. 2018. Denoising
Le. 2019. Specaugment: A simple data augmen- neural machine translation training with trusted data
tation method for automatic speech recognition. In- and online data selection. In Proceedings of the Third
terspeech 2019. Conference on Machine Translation: Research Pa-
pers, pages 133–143.
Adam Polyak, Yossi Adi, Jade Copet, Eugene
Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Ab- Wenxuan Wang, Wenxiang Jiao, Yongchang Hao, Xing
delrahman Mohamed, and Emmanuel Dupoux. 2021. Wang, Shuming Shi, Zhaopeng Tu, and Michael Lyu.
Speech resynthesis from discrete disentangled self- 2022. Understanding and improving sequence-to-
supervised representations. ArXiv, abs/2104.00355. sequence pretraining for neural machine translation.
87
In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers), pages 2591–2600.
Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao,
Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen,
Chenchen Zeng, Di Wu, and Zhendong Peng. 2021.
Wenetspeech: A 10000+ hours multi-domain man-
darin corpus for speech recognition. ICASSP 2022
- 2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 6182–
6186.
Peidong Zhang, Boxing Chen, Niyu Ge, and Kai Fan.
2019. Lattice transformer for speech translation. In
ACL.
Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
Xinyuan Zhou, Jing Yang, Jianwei Cui, Dan Liu,
Junhua Liu, and Lirong Dai. 2022a. The ustc-nelslip
offline speech translation systems for iwslt 2022. In
IWSLT.
Ziqiang Zhang, Junyi Ao, Shujie Liu, Furu Wei, and
Jinyu Li. 2022b. The yitrans end-to-end speech
translation system for iwslt 2022 offline shared task.
ArXiv, abs/2206.05777.
88
Improving End-to-End Speech Translation by Imitation-Based Knowledge
Distillation with Synthetic Transcripts
93
Architecture Hypotheses # Decoding Setup Source Transcripts dev-BLEU↑
1 AST - 11.9
full
2 ASR transcribes, NMT expert translates - 21.8
RNN
3 AST starts, NMT expert completes gold 21.9
partial
4 AST starts, NMT expert completes synthetic 15.6
5 AST - 16.7
full
6 ASR transcribes, NMT expert translates - 25.4
Transformer
7 AST starts, NMT expert completes gold 25.4
partial
8 AST starts, NMT expert completes synthetic 19.9
Table 3: Feasibility experiment: BLEU score on CoVoST2 development set of NMT expert’s completion of AST
model full or partial hypotheses with greedy decoding; gold denotes the usage of the dataset’s source language
transcripts as NMT inputs and synthetic denotes synthetic transcripts created by the respective ASR model.
Achitecture Models
CoVoST2 MuST-C these two new corpora.
dev test dev test As Table 6 shows, Transformer KD+ trained on
Standard 13.6 10.0 14.6 14.1 translated gold transcripts outperforms its coun-
ours baseline ours baseline
94
CoVoST2 MuST-C
IL Algorithm Model Data BLEU↑ TER↓ BLEU↑ TER↓
dev test dev test dev test dev test
Standard gold 18.4 14.2 69.1 77.1 19.5 19.4 70.8 69.4
Dagger IKD+ gold 21.8 18.4 63.7 70.0 23.2 23.3 67.4 65.6
SynthIKD+ synth 21.8 18.5 63.6 69.8 23.5 23.5 67.2 65.6
BLEU↑ TER↓ BLEU↑ TER↓
Warm-start Model Data
dev test dev test dev test dev test
sentence-BLEU reward-to-go
Standard gold 18.7 14.6 68.2 76.0 19.9 19.9 70.2 68.1
Standard synth 18.7 14.6 68.2 75.9 20.0 19.7 70.1 68.7
IKD+ gold 22.1 18.5 63.1 69.6 23.5 23.4 67.4 65.7
AggreVaTe SynthIKD+ synth 22.1 18.5 63.1 69.7 23.5 23.6 67.0 65.6
TER reward-to-go
Standard gold 18.7 14.7 67.8 75.4 20.0 19.9 70.0 68.5
Standard synth 18.7 14.6 67.9 75.6 19.9 19.6 69.8 68.4
IKD+ gold 22.0 18.5 63.1 69.4 23.3 23.4 67.3 65.5
SynthIKD+ synth 22.1 18.5 63.1 69.6 23.5 23.6 67.0 65.3
Table 5: Comparison of Dagger with warm-started AggreVaTe with a maximum of 50 epochs on CoVoST2 and
MuST-C.
Figure 3: NMT expert top-8 output probabilities when translating the incorrect synthetic transcript “The king had
taken possession of Glamis Castle and plywood it.”
Figure 4: NMT expert top-8 output probabilities when translating the incorrect synthetic transcript “Slow down!”
correction. At the next timestep, however, the last tions do not share similar meaning with the tran-
symbol in the prefix is the subword unit “ge” and, script. After, in Figure 4b, the expert has received
as Figure 3b shows, the expert, being driven by its the prefix “Sagte,”, it still attempts to complete
decoder language modeling capability, puts highest y<t by generating output symbols that would turn
probabilities on subword units that are most likely y into a valid translation of this wrong transcript
to produce a fluent output (the correct one “pl@@”, (“langsam” (slow), “ruhig” (quiet), “langs@@”))
and less probable “pflan@@” and “kl@@” rather with the rest of options being mostly driven by
then paying attention to the (wrong) information in language modeling rather then reproducing source
the synthetic transcripts. semantics (“ent@@”, “verlan@@”).
Overall, with the SynthIKD+ training, the expert
Similar situations can be observed in samples induces smoothed output distributions and fluency
with entirely wrong synthetic transcripts. In Fig- on the student more than it enforces the student to
ure 4, the expert has received the synthetic tran- predict one-hot labels produced by the expert as is
script “Slow down!” as input, which shares no done by sequence-level KD.
meaning with the gold transcript “Said he’d con-
sider it.” As shown in Figure 4a, the expert as- 5 Conclusion
signs the highest probability to “@@low” if it is
given the prefix “S” (as the expert has a shared We showed that a pretrained NMT model can suc-
vocabulary, it can complete the output this way), cessfully be used as an oracle for an AST student,
which turns the partial translation into an exact without requiring gold source language transcripts
copy of the transcript. Again, the top-8 predic- as in previous approaches to imitation learning for
96
AST. This widens the applicability of imitation Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam
learning approaches to datasets that do not con- Shazeer. 2015. Scheduled sampling for sequence
prediction with recurrent neural networks. In Ad-
tain manual transcripts or to pre-trained ASR mod-
vances in Neural Information Processing Systems,
els for which training transcripts are not available. volume 28. Curran Associates, Inc.
Our qualitative analysis suggests an explanation of
the fact that the NMT oracle is robust against mis- Alexandre Berard, Laurent Besacier, Ali Can Ko-
cabiyikoglu, and Olivier Pietquin. 2018. End-to-end
matches between manual and synthetic transcripts automatic speech translation of audiobooks. In 2018
by its large language model capabilities that allow IEEE International Conference on Acoustics, Speech
it to continue the prefix solely based on its learned and Signal Processing, ICASSP 2018, Calgary, AB,
contextual knowledge. Canada, April 15-20, 2018, pages 6224–6228. IEEE.
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli,
6 Limitations Matteo Negri, and Marco Turchi. 2019. MuST-C: a
Multilingual Speech Translation Corpus. In Proceed-
There are several limitations of this study. First, it is ings of the 2019 Conference of the North American
done on one language pair although we believe this Chapter of the Association for Computational Lin-
should not qualitatively change the results. Second, guistics: Human Language Technologies, Volume 1
only one set of standard model sizes was evaluated (Long and Short Papers), pages 2012–2017, Min-
neapolis, Minnesota. Association for Computational
for AST student and NMT expert; we expect it Linguistics.
be in line with reported findings for NMT (Ghor-
bani et al., 2021). Finally, while alluding to the Marco Gaido, Mattia A. Di Gangi, Matteo Negri, and
Marco Turchi. 2020. End-to-end speech-translation
potential of using large pre-trained ASR models in- with knowledge distillation: FBK@IWSLT2020. In
stead of manual transcripts for IL-based AST, our Proceedings of the 17th International Conference on
current work must be seen as a proof-of-concept Spoken Language Translation, pages 80–88, Online.
experiment where we train ASR models on a few Association for Computational Linguistics.
hundred hours of audio, and discard the manual Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur
transcripts in IL training, showing the feasibility of Bapna, Maxim Krikun, Xavier Garcia, Ciprian
our idea. Chelba, and Colin Cherry. 2021. Scaling laws for
neural machine translation. CoRR, abs/2109.07740.
Acknowledgements Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015.
Distilling the knowledge in a neural network. In
The authors acknowledge support by the state of NIPS Deep Learning and Representation Learning
Baden-Württemberg through bwHPC and the Ger- Workshop.
man Research Foundation (DFG) through grant
Luca Hormann and Artem Sokolov. 2021. Fixing ex-
INST 35/1597-1 FUGG. posure bias with imitation learning needs powerful
oracles. CoRR, abs/2109.04114.
97
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad
method for stochastic optimization. In 3rd Inter- Dousti, and Yun Tang. 2020. Self-Training for End-
national Conference on Learning Representations, to-End Speech Translation. In Proc. Interspeech
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, 2020, pages 1476–1480.
Conference Track Proceedings.
Matt Post. 2018. A call for clarity in reporting BLEU
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris scores. In Proceedings of the Third Conference on
Callison-Burch, Marcello Federico, Nicola Bertoldi, Machine Translation: Research Papers, pages 186–
Brooke Cowan, Wade Shen, Christine Moran, 191, Brussels, Belgium. Association for Computa-
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra tional Linguistics.
Constantin, and Evan Herbst. 2007. Moses: Open Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
source toolkit for statistical machine translation. In man, Christine McLeavey, and Ilya Sutskever. 2022.
Proceedings of the 45th Annual Meeting of the As- Robust speech recognition via large-scale weak su-
sociation for Computational Linguistics Companion pervision. CoRR, abs/2212.04356.
Volume Proceedings of the Demo and Poster Sessions,
pages 177–180, Prague, Czech Republic. Association Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
for Computational Linguistics. and Wojciech Zaremba. 2016. Sequence level train-
ing with recurrent neural networks. In 4th Inter-
Alexander Lin, Jeremy Wohlwend, Howard Chen, and national Conference on Learning Representations,
Tao Lei. 2020. Autoregressive knowledge distillation ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016,
through imitation learning. In Proceedings of the Conference Track Proceedings.
2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 6121–6133, Stefan Riezler and John T. Maxwell. 2005. On some
Online. Association for Computational Linguistics. pitfalls in automatic evaluation and significance test-
ing for MT. In Proceedings of the ACL Workshop
Yuchen Liu, Hao Xiong, Jiajun Zhang, Zhongjun He, on Intrinsic and Extrinsic Evaluation Measures for
Hua Wu, Haifeng Wang, and Chengqing Zong. 2019. Machine Translation and/or Summarization, pages
End-to-End Speech Translation with Knowledge Dis- 57–64, Ann Arbor, Michigan. Association for Com-
tillation. In Proc. Interspeech 2019, pages 1128– putational Linguistics.
1132. Stéphane Ross and Andrew Bagnell. 2014. Reinforce-
ment and imitation learning via interactive no-regret
Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, learning. CoRR, abs/1406.5979.
Michael Auli, and Sergey Edunov. 2019. Facebook
FAIR’s WMT19 news translation task submission. Stephane Ross, Geoffrey Gordon, and Drew Bagnell.
In Proceedings of the Fourth Conference on Machine 2011. A reduction of imitation learning and struc-
Translation (Volume 2: Shared Task Papers, Day tured prediction to no-regret online learning. In Pro-
1), pages 314–319, Florence, Italy. Association for ceedings of the Fourteenth International Conference
Computational Linguistics. on Artificial Intelligence and Statistics, volume 15 of
Proceedings of Machine Learning Research, pages
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, 627–635, Fort Lauderdale, FL, USA. PMLR.
Sam Gross, Nathan Ng, David Grangier, and Michael
Auli. 2019. fairseq: A fast, extensible toolkit for Rico Sennrich, Barry Haddow, and Alexandra Birch.
sequence modeling. In Proceedings of the 2019 Con- 2016. Neural machine translation of rare words with
ference of the North American Chapter of the Associa- subword units. In Proceedings of the 54th Annual
tion for Computational Linguistics (Demonstrations), Meeting of the Association for Computational Lin-
pages 48–53, Minneapolis, Minnesota. Association guistics (Volume 1: Long Papers), pages 1715–1725,
for Computational Linguistics. Berlin, Germany. Association for Computational Lin-
guistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jing Zhu. 2002. Bleu: a method for automatic evalu- Jon Shlens, and Zbigniew Wojna. 2016. Rethink-
ation of machine translation. In Proceedings of the ing the inception architecture for computer vision.
40th Annual Meeting of the Association for Compu- In 2016 IEEE Conference on Computer Vision and
tational Linguistics, pages 311–318, Philadelphia, Pattern Recognition (CVPR), pages 2818–2826.
Pennsylvania, USA. Association for Computational
Linguistics. Yun Tang, Juan Pino, Xian Li, Changhan Wang, and
Dmitriy Genzel. 2021a. Improving speech transla-
Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. tion by understanding and learning from the auxiliary
McCarthy, and Deepak Gopinath. 2019. Harness- text translation task. In Proceedings of the 59th An-
ing indirect training data for end-to-end automatic nual Meeting of the Association for Computational
speech translation: Tricks of the trade. In Proceed- Linguistics and the 11th International Joint Confer-
ings of the 16th International Conference on Spoken ence on Natural Language Processing (Volume 1:
Language Translation, Hong Kong. Association for Long Papers), pages 4252–4261, Online. Association
Computational Linguistics. for Computational Linguistics.
98
Yun Tang, Juan Pino, Changhan Wang, Xutai Ma, and speech recognition. IEEE Journal of Selected Topics
Dmitriy Genzel. 2021b. A general multi-task learn- in Signal Processing, 16(6):1519–1532.
ing framework to leverage text data for speech to
text tasks. In ICASSP 2021 - 2021 IEEE Interna- Renjie Zheng, Junkun Chen, Mingbo Ma, and Liang
tional Conference on Acoustics, Speech and Signal Huang. 2021. Fused acoustic and text encoding for
Processing (ICASSP), pages 6209–6213. multimodal bilingual pretraining and speech transla-
tion. In International Conference on Machine Learn-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ing, pages 12736–12746. PMLR.
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all Chunting Zhou, Jiatao Gu, and Graham Neubig.
you need. Advances in neural information processing 2020. Understanding knowledge distillation in non-
systems, 30. autoregressive machine translation. In International
Conference on Learning Representations (ICLR).
Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
Dmytro Okhonko, and Juan Pino. 2020. Fairseq
S2T: Fast speech-to-text modeling with fairseq. In
Proceedings of the 1st Conference of the Asia-Pacific
Chapter of the Association for Computational Lin-
guistics and the 10th International Joint Conference
on Natural Language Processing: System Demon-
strations, pages 33–39, Suzhou, China. Association
for Computational Linguistics.
Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino.
2021. CoVoST 2 and Massively Multilingual Speech
Translation. In Interspeech, pages 2247–2251.
Chaojun Wang and Rico Sennrich. 2020. On exposure
bias, hallucination and domain shift in neural ma-
chine translation. In ACL.
Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Wu, and Zhifeng Chen. 2017. Sequence-to-sequence
models can directly translate foreign speech. In In-
terspeech 2017, 18th Annual Conference of the Inter-
national Speech Communication Association, Stock-
holm, Sweden, August 20-24, 2017, pages 2625–2629.
ISCA.
Ronald J. Williams and David Zipser. 1989. A learning
algorithm for continually running fully recurrent neu-
ral networks. Neural Computation, 1(2):270–280.
Jeremy H.M. Wong and Mark J.F. Gales. 2016. Se-
quence Student-Teacher Training of Deep Neural
Networks. In Proc. Interspeech 2016, pages 2761–
2765.
Rong Ye, Mingxuan Wang, and Lei Li. 2021. End-to-
end speech translation via cross-modal progressive
training. In Proc. of INTERSPEECH.
Biao Zhang, Barry Haddow, and Rico Sennrich. 2022a.
Revisiting end-to-end speech-to-text translation from
scratch. In International Conference on Machine
Learning, pages 26193–26205. PMLR.
Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol
Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan-
ping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min
Ma, William Chan, Jiahui Yu, Yongqiang Wang, Lian-
gliang Cao, Khe Chai Sim, Bhuvana Ramabhadran,
Tara N. Sainath, Francoise Beaufays, Zhifeng Chen,
Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, and
Yonghui Wu. 2022b. BigSSL: Exploring the frontier
of large-scale semi-supervised learning for automatic
99
BLEU↑ for Transformers, we create each model by aver-
Model aging over the last 10 checkpoints. For inference,
dev test
a beam size of 5 was used and we report case-
original dataset
sensitive detokenized BLEU (Papineni et al., 2002)
Standard 13.8 14.4
computed with sacreBLEU (Post, 2018). We tested
KD+ 17.4 17.8
for statistical significance with the paired approx-
SynthKD + 17.5 18.0
imate randomization test (Riezler and Maxwell,
IKD+ 17.0 17.1
2005).
SynthIKD + 17.0 17.0
For all experiments, we preprocess the datasets
translated gold training set
as follows: We extract log mel-scale filterbanks
Standard 15.3 15.3
with a povey window, 80 bins, a pre-emphasis filter
KD + 18.2 18.4
of 0.97, a frame length of 25 ms and a frame shift
IKD 16.8 17.0
of 10 ms. We discard samples with less than five or
IKD+ 17.1 17.5
more than 3000 frames and subtract the mean of the
synthetic translated training set
waveform from each frame and zero-pad the FFT
Standard 14.7 15.3
input. For the text data, we normalize punctuation,
KD + 17.0 16.8
remove non-printable characters, use the Moses
IKD 16.1 16.0
tokenizer (Koehn et al., 2007) for tokenization and
IKD+ 16.3 16.6
segment the text data into subword units with byte-
Table A.1: Results on Europarl-ST pair encoding (Sennrich et al., 2016). We used a
random seed of 1 for all experiments.
We list the final used and best performing hy-
A Models, Meta-parameters, and perparameters in Table A.2. Parameters that do
Training Settings not differ between the training methods are not re-
peated in the table. We determine the batch size by
We use the speech-to-text module of the fairseq
defining a maximum number of input frames in the
framework (Ott et al., 2019; Wang et al., 2020)
batch.
for all experiments and train both RNNs with con-
volutional layers for time dimension reduction as
in Berard et al. (2018) and small Transformers as
B Europarl-ST
in Wang et al. (2020), which consist of a convo- We performed additional experiments on the
lutional subsampler of two convolutional blocks, Europarl-ST dataset (Iranzo-Sánchez et al., 2020)
followed by 12 encoder layers and 6 decoder layers. that provides 83 hours of speech training data. We
The dimension of the self-attention layer is 256 and train RNNs with a learning rate of 0.002 and a max-
the number of attention heads is set to 4. For the tokens size of 40,000 for a total of 80,000 updates.
NMT oracle, we use the trained Transformer model All other hyper-parameters are the same as listed
from the Facebook’s submission to WMT19 (Ng for MuST-C in Table A.2. We only trained RNNs
et al., 2019) 5 , which is based on the big Trans- on the Europarl-ST dataset due to the small amount
former (Vaswani et al., 2017) which has 6 encoder of available training data. We present the results in
and decoder layers, 16 attention heads and the di- Table A.1.
mension of 1024, with a larger feed-forward layer
Both improvements over standard training and
size of 8192. This NMT oracle had been trained
by training on both the gold-translated and
on all available WMT19 shared task en-de training
synthetic-translated translated training data corre-
data and on back-translated english and german
spond with the results presented in the main body
portions of the News crawl dataset.
of this work. Hence, the results presented here hold
For all models we use Adam (Kingma and Ba, for relatively small datasets, too.
2015) with gradient clipping at norm 10 and stop
training if the development set loss has not im-
C Additional Example of NMT Expert
proved for 10 epochs. For RNN architectures, we
Correction
return the best model on the development set and
5
As the WMT19 submission consists of an ensemble of Here we give another example of the NMT expert
models, we use the model1.pt for our experiments. predicting the correct output token despite receiv-
100
Model Hyperparameter CoVoST2 MuST-C
RNN
standard learning rate 1e-3 1e-3
max-tokens 60000 40000
scheduler fixed fixed
warmup-updates 20000 20000
encoder freezing updates 10000 10000
dropout 0.2 0.2
KD+ learning rate 1e-3 2e-3
max-tokens 50000 30000
warmup-updates 25000 20000
max-update 250000 250000
encoder-freezing updates 20000 10000
scheduler inverse square root inverse square root
Transformer
ASR learning rate 2e-3 1e-3
max-tokens 50000 40000
max-update 60000 100000
scheduler inverse square root inverse square root
warmup-updates 10000 10000
dropout 0.15 0.1
AST
standard learning rate 2e-3 2e-3
max-update 30000 100000
encoder-freezing updates 1000 -
KD+ max-tokens 50000 20000
Table A.2: list of hyperparameters that are dependent on model and dataset; we list only parameters which differ
from the previous model’s
Pan Deng1 Shihao Chen1 Weitai Zhang1,2 Jie Zhang1 Lirong Dai1
1
University of Science and Technology of China, Hefei, China
2
iFlytek Research, Hefei, China
{pdeng, shchen16, zwt2021}@mail.ustc.edu.cn; {jzhang6, lrdai}@ustc.edu.cn
Tunisian A 0.2M -
OPUS B - 42M
time dimension with mask parameters (mT , T ) =
OPUS+Private data C - 61M (2, 70). Afterwards, we filtered out audio data that
is longer than 3k frames. Further, we introduced
Tunisian A 0.2M -
Filtered
103
Ta-En Data for Training
MSA-En
Figure 1: The data augmentation method for Tunisian-English Text, where * indicates the pseudo text.
Conformer model (Simonyan and Zisserman, 2014; et al., 2016a) approach. Ultimately, the obtained
Gulati et al., 2020), VGG-Transformer model synthetic data and the original data were merged to
(Vaswani et al., 2017) and GateCNN-Conformer form the BTFT dataset.
model (Dauphin et al., 2017). These ASR mod-
els differ in their feature extractor modules (VGG, Dialect Transfer: In the IWSLT 2022 dialect
GateCNN) and acoustic modules (Conformer, ST track, (Yang et al., 2022) presented an ef-
Transformer). We chose diverse models with the fective Ta2En-bt-tune model that generates syn-
expectation that increasing the variability of ASR thetic Tunisian-English data by converting MSA
models would improve the final ASR performance to pseudo-Tunisian with an MSA2Ta MT model.
when using model ensemble methods. For dialect In Figure 1, we modified this approach by intro-
transfer in condition B/C, we pre-trained an ASR ducing a multi-step pre-training technique that im-
model using MSA data, which was then fine-tuned proves the quality of pseudo-Tunisian and enhances
using the Tunisian data. Note that for condition downstream translation tasks. Our dialect transfer
A, we initially attempted to pre-train a phoneme method is outlined as follows:
recognition model for Tunisian but found it to be (1) Firstly, the En2MSA (English to MSA)
useless after fine-tuning the pre-trained model. model was pre-trained using condition B/C MT
data and then fine-tuned using the MT data from
3.2 Data Augmentation for MT condition A to create the En2Ta model.
We considered various data augmentation tech- (2) The En2MSA and En2Ta models were uti-
niques for MT. To augment the Tunisian-English lized separately with the English texts from con-
(Ta-En) dialect MT data, we used the back transla- dition A and condition B/C as inputs to generate
tion and forward translation (BTFT) method to cre- paired Ta-MSA-En triple text data for condition
ate a synthetic parallel corpus that can be merged A/B/C. The pseudo-text in condition A is the MSA*
with the true bilingual data. To accomplish dialect text, whereas the pseudo-text in condition B/C is
transfer from MSA to Tunisian, we constructed the Tunisian* text (* representing pseudo-text). No-
a pivot MT model that converts MSA to Tunisian tably, during this step, the pseudo-Tunisian* text
and produces abundant synthetic Ta-En data. derived from condition B/C is marked as the first
iteration.
BTFT: Two MT models were first trained from (3) Next, we trained an MSA2Ta (MSA to
Tunisian to English (Ta2En) and from English to Tunisian) model, which serves as a pivot MT model.
Tunisian (En2Ta) using MT data of condition A. We pre-trained the model with the MSA-Ta* data
The Tunisian text and English text were then re- of condition B/C and fine-tuned it using the MSA*-
spectively fed to the corresponding MT models for Ta data of condition A from step 2.
inference, resulting in paired Tunisian to synthetic- (4) Lastly, we input the MSA text of condition
English text and paired synthetic-Tunisian to En- B/C to the MSA2Ta model for inference, generat-
glish text. It is worth noting that the Ta2En model ing the second iteration of the pseudo-Tunisian text
implements the forward translation approach simi- (marked as pseudo-Tunisian**). We re-created the
larly to the sequence-level knowledge distillation paired triple text data of Ta-MSA-En text by merg-
method (Kim and Rush, 2016), while the En2Ta ing the pseudo-Tunisian** text with the primary
model employs the backward translation (Sennrich MSA-English text from condition B/C.
104
Tunisian Speech Speech Encoder CTC Layer Adaptor Ta2En MT English Translation
Figure 2: The top figure shows the SATE model (Xu et al., 2021), which implements a forward dialect transfer
system from MSA to Tunisian through pre-training and fine-tuning techniques. The bottom part shows the Hybrid
SATE model with a hierarchical text encoder, which can be used to reversely transfer from Tunisian to MSA.
3.3 End-to-end ST Model by retaining both the repeated tokens and blank
The end-to-end ST approaches can mitigate issues symbols of the CTC output. The resulting output
of error propagation that often appears in low- was then combined with its corresponding English
resource scenarios. We developed an E2E ST sys- text to fine-tune the Ta2En MT model. The modi-
tem utilizing the SATE model (Xu et al., 2021) due fied Ta2En MT model was well-suited to initialize
to its effectiveness and simplicity for implementa- the MT module of the SATE model.
tion, which is shown in Figure 2. In particular, we 3.3.2 Reverse dialect transfer system
suggest two dialect transfer approaches for condi- It is a common issue that the Tunisian Arabic di-
tion B/C, specifically the forward dialect transfer alect is considered as being non-standardized at
system from MSA to Tunisian and the reverse di- the linguistic level (Ben Abdallah et al., 2020). To
alect transfer method from Tunisian to MSA. address this, we proposed a reverse dialect transfer
3.3.1 Forward dialect transfer system system that converts the Tunisian dialect to MSA,
The forward dialect transfer system aims to transfer serving as a regularization of the dialect, which
information from MSA to Tunisian by pre-training is illustrated in Figure 2. We modified the SATE
the ASR and MT models on the MSA dataset, re- model with a hierarchical text encoder (resulting in
spectively. These models are then fine-tuned us- Hybrid SATE) to enable the reverse dialect trans-
ing the Tunisian dataset to transfer from MSA to fer system. The proposed Hybrid SATE model
Tunisian. Note that the forward dialect transfer primarily comprises a speech encoder, a Ta2MSA
system is treated as a transfer of model parameters. text encoder and an MSA2En MT module.
In order to create an E2E ST system, we utilize In order to initialize the model parameter for the
the SATE model with pre-trained Tunisian ASR Ta2MSA text encoder module in the Hybrid SATE
and MT models, followed by fine-tuning the SATE model, we trained a Ta2MSA MT model. Based
model with Tunisian ST dataset. on the generated Ta-MSA* data in condition A
During training, the SATE model utilizes multi- and Ta**-MSA paired data in condition B/C from
task optimization, including the CTC loss of the Section 3.2, we first pre-trained a Ta2MSA MT
source language LTa model with the Ta**-MSA data from condition
CTC , the cross-entropy loss for
the target language LEn B/C. Notably, the Ta2MSA MT model is equipped
CE and the knowledge distil-
lation (KD) losses for both the source and target with a CTC layer on top of its encoder and is trained
languages, i.e., LTa with an additional CTC loss for MSA. Then, we
KD and LKD . The overall loss
En
function reads fine-tuned the model using the Ta-MSA* data from
condition A. Finally, the encoder attached with a
L = λ1 LTa En Ta En
CTC + λ2 LCE + λ3 LKD + λ4 LKD , (1) CTC layer of the Ta2MSA MT model was used to
initialize the Ta2MSA text encoder.
with four respective hyper weight parameters. The The hybrid SATE model is optimized with an
SATE model utilizes an adaptor to map speech fea- additional CTC loss for MSA, denoted as LMSA CTC ,
tures into the text feature space but suffers from resulting in the overall loss function
inconsistent in-between sequence lengths. For this,
we proposed a robust training method. Specifi- L =λ1 LTa En Ta En
CTC + λ2 LCE + λ3 LKD + λ4 LKD
cally, the Tunisian ASR model was first decoded + λ5 LMSA
CTC . (2)
105
3.4 Model Ensemble Method B C
Model
dev test dev test
As training a single model can lead to implicit
model bias, it is expected that a model ensemble VGG-Conformer 14.3 13.2 12.5 12
decoding method can improve system robustness, VGG-Transformer 16.6 15.5 14.2 13.3
especially in low-resource ST scenarios. We imple- GateCNN-Conformer 15.1 14.2 14.3 13.4
mented synchronous decoding with multiple mod-
Table 4: The WER of the MSA MGB2 corpus.
els and averaged the posterior probabilities pre-
dicted by each model at each time step. Consistent
A B C
with single model decoding, the beam search de- Model
dev test1 dev test1 dev test1
coding strategy was used with a beam size of 10. VGG-Conformer 48.5 55.4 45.4 53.2 42 49.7
Subsequently, multiple models decoded the next to- VGG-Transformer 49.2 57 49 56.8 44.7 52.1
kens based on the same historical tokens. It should GateCNN-Conformer 46.6 53.4 47.2 53.7 46.1 53.3
be noted that either E2E ST or MT models can Ensemble 44.5 51.7 43.4 50.9 40.8 48.7
be used for the model ensemble. Consequently,
we can form ensembles of E2E ST and cascaded Table 5: The original WER on Tunisian. Due to the
ST systems by using transcriptions from the ASR non-standard orthography and grammar in Tunisian,
the value of original WER is relatively higher than the
models as inputs for the MT models.
normalized WER in Table 11.
106
MT Cascaded ST MT Cascaded ST
Data & Method Model
dev test1 dev test1 dev test1 dev test1
Baseline 26.3 23.0 19.4 16.7 MSA2En-large - - - -
BTFT data 28.2 24.0 20.3 17.1 + BTFT data FT 29.3 26.0 22.2 19.0
+ Constrained FT 28.5 24.3 20.6 17.3 + Constrained FT 30.1 26.2 22.5 19.2
Ta*2En-large 16.3 15.6 13.3 11.4
Table 6: The BLEU score of MT and cascaded MT + BTFT data FT 29.9 26.5 22.5 19.3
experiments in condition A. + Constrained FT 30.4 26.6 22.8 19.5
Ta**2En-large 16.7 15.5 13.3 12.0
MT BLEU + BTFT data FT 30.4 26.6 23.1 19.2
Model Pretrain Model
dev test1 + Constrained FT 30.8 27.0 23.2 19.5
En2Ta - 12.4 10.0
En2Ta En2MSA 16.6 12.5 Table 8: The BLEU score of the MT and the cascaded
MSA2Ta* - 8.3 6.8 ST systems in condition C.
MSA*2Ta MSA2Ta* 12.1 9.6
MT Cascaded ST
Model
Table 7: The BLEU score of different pivot MT models dev test1 dev test1
using Ta-MSA*-En triple text data of condition A. Condition A Best 28.5 24.3 20.6 17.3
+ Error Adapation FT 28.3 23.9 20.5 17.1
that combining the training data with BTFT data Condition C Best 30.8 27.0 23.2 19.5
brings a considerable performance gain for both + Error Adapation FT 30.7 26.6 23.3 19.7
MT and cascaded ST. The MT model trained by
Table 9: The BLEU score of the MT and the cascaded
the BTFT data are further fine-tuned by the original
ST systems in condition A/C when using error adaption
true paired Ta-En data. In order to prevent exces- fine-tune method.
sive over-fitting while fine-tuning, we proposed a
constrained fine-tune method, as depicted in Figure YGround Truth YGround Truth
3. Specifically, the student model is constrained
by the teacher model using KL divergence loss to lossCE
lossKD
lossCE
Table 10: The BLEU scores of our E2E ST in condition A/B/C, where the speech encoder and MT module represent
the sub-modules, and MT and MT-Macaron represent MT large and MT macaron models, respectively.
109
Yoon Kim and Alexander M. Rush. 2016. Sequence- the pre-trained models into speech translation en-
level knowledge distillation. In Proceedings of the coders. In Proceedings of the 59th Annual Meet-
2016 Conference on Empirical Methods in Natu- ing of the Association for Computational Linguistics
ral Language Processing, pages 1317–1327, Austin, and the 11th International Joint Conference on Natu-
Texas. Association for Computational Linguistics. ral Language Processing (Volume 1: Long Papers),
pages 2619–2630, Online. Association for Computa-
Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, tional Linguistics.
Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019. Un-
derstanding and improving transformer from a multi- Brian Yan, Patrick Fernandes, Siddharth Dalmia, Jia-
particle dynamic system point of view. arXiv preprint tong Shi, Yifan Peng, Dan Berrebbi, Xinyi Wang,
arXiv:1906.02762. Graham Neubig, and Shinji Watanabe. 2022. CMU’s
IWSLT 2022 dialect speech translation system. In
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Proceedings of the 19th International Conference on
Sam Gross, Nathan Ng, David Grangier, and Michael Spoken Language Translation (IWSLT 2022), pages
Auli. 2019. fairseq: A fast, extensible toolkit for se- 298–307, Dublin, Ireland (in-person and online). As-
quence modeling. arXiv preprint arXiv:1904.01038. sociation for Computational Linguistics.
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng
Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. Jinyi Yang, Amir Hussein, Matthew Wiesner, and San-
2019. SpecAugment: A Simple Data Augmentation jeev Khudanpur. 2022. JHU IWSLT 2022 dialect
Method for Automatic Speech Recognition. In Proc. speech translation system description. In Proceed-
Interspeech 2019, pages 2613–2617. ings of the 19th International Conference on Spoken
Language Translation (IWSLT 2022), pages 319–326,
Matt Post. 2018. A call for clarity in reporting BLEU Dublin, Ireland (in-person and online). Association
scores. In Proceedings of the Third Conference on for Computational Linguistics.
Machine Translation: Research Papers, pages 186–
191, Brussels, Belgium. Association for Computa- Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
tional Linguistics. Xinyuan Zhou, Jing Yang, Jianwei Cui, Pan Deng,
Mohan Shi, Yifan Song, Dan Liu, Junhua Liu, and
Rico Sennrich, Barry Haddow, and Alexandra Birch. Lirong Dai. 2022. The USTC-NELSLIP offline
2016a. Improving neural machine translation models speech translation systems for IWSLT 2022. In Pro-
with monolingual data. In Proceedings of the 54th ceedings of the 19th International Conference on
Annual Meeting of the Association for Computational Spoken Language Translation (IWSLT 2022), pages
Linguistics (Volume 1: Long Papers), pages 86–96, 198–207, Dublin, Ireland (in-person and online). As-
Berlin, Germany. Association for Computational Lin- sociation for Computational Linguistics.
guistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch. A Appendix. Model configurations
2016b. Neural machine translation of rare words
with subword units. In Proceedings of the 54th An- The detailed model configurations for ASR systems
nual Meeting of the Association for Computational are as following:
Linguistics (Volume 1: Long Papers), pages 1715–
1725, Berlin, Germany. Association for Computa-
tional Linguistics. • Condition A: The model configurations are
almost identical to the ESPnet (Inaguma et al.,
Karen Simonyan and Andrew Zisserman. 2014. Very 2020) baseline. There are 12-layer encoder
deep convolutional networks for large-scale image and 6-layer decoder. The attention module of
recognition. arXiv preprint arXiv:1409.1556.
both the encoder and decoder comprises 256
Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS- hidden units and 4 attention heads. The size
MT – building open translation services for the world. of the FFN module is 1024 for the encoder
In Proceedings of the 22nd Annual Conference of
but 2048 for the decoder. We use two VGG
the European Association for Machine Translation,
pages 479–480, Lisboa, Portugal. European Associa- blocks as the feature extractor for both the
tion for Machine Translation. VGG-Conformer and the VGG-Transformer
models. For the GateCNN-Conformer model,
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
the feature extractor has a 6-layer GateCNN.
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing • Condition B/C: The model difference be-
systems, 30. tween the condition A and the condition B/C
Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, Shen
lies in the model size. For condition B/C, the
Huang, Qi Ju, Tong Xiao, and Jingbo Zhu. 2021. attention module has 512 hidden units and 8
Stacked acoustic-and-textual encoding: Integrating attention heads, and the size of FFN is 4096.
110
Condition Training Stage lr Max-tokens Warmup Dropout rate Training steps
Stage1: BTFT Pretrain 5e-4 12000 4000 0.3 120000
A
Stage2: Constrained Fine-tune - 4096 - 0.3 40000
Stage1: MSA-En Pretrain 1e-3 40000×8 4000 0.1 200000
Stage2: Ta**-En Pretrain 5e-4 40000×8 None 0.1 20000
B/C Stage3: BTFT Fine-tune 4e-5 6144 4000 0.3 120000
Stage4: Constrained Fine-tune - 2048 - 0.3 80000
Stage5: Error Adaptation Fine-tune 1e-5 4096 None 0.3 10000
Table 12: Hyper parameters in different stages ("-" means reuse from the former stage and "×" the GPU numbers).
Condition A B/C # A B C
112
KIT’s Multilingual Speech Translation System for IWSLT 2023
Danni Liu, Thai Binh Nguyen, Sai Koneru, Enes Yavuz Ugan, Ngoc-Quan Pham,
Tuan-Nam Nguyen, Tu Anh Dinh, Carlos Mullov, Alexander Waibel, Jan Niehues
Karlsruhe Institute of Technology
firstname.lastname@kit.edu
114
Original After Diversification training source data. We then train a model to re-
Lang. # sent. (M) # sent. (M) # tokens (M) store the casing and punctuation marks.
ar 26.0 65.2 865.0
zh 11.2 21.5 254.3
2.5 Speech Translation Data
nl 33.1 82.1 1162.7 The speech translation data are shown in Ta-
fr 38.9 91.6 1427.8
de 23.0 54.4 860.0 ble 4. We additionally use our trained MT model
ja* 2.6 27.2 832.7 to create forward translations based on the fol-
fa 5.8 11.3 162.1 lowing transcript-only datasets: Common Voice,
pt 29.0 72.3 1024.3
ru 22.1 51.5 685.3 TEDLIUM, and VoxPopuli. The TTS data de-
tr 36.7 89.7 1021.2 scribed in §2.2 is also used.
Total 228.4 566.8 8295.4
Lang. Corpus / Data Source Hours # Utterances
Table 3: MT data overview. *: For ja, the original data ar CoVoST 429 289k
of 2.6M sentences did not include JParaCrawl, which MuST-C 463 212k
was announced later as allowed data. TTS 283 203k
zh CoVoST 429 289k
MuST-C 596 358k
TTS 204 183k
As preprocessing, we perform truecasing, dedu- nl MuST-C 434 248k
plication, length ratio filtering, and histogram filter- europarl-ST 75 32k
TTS 1138 713k
ing using the statistics by Fan et al. (2021). Then fr MuST-C 485 275k
we perform subword segmentation using Sentence- europarl-ST 76 32k
piece (Kudo and Richardson, 2018) based on the TTS 1768 998k
de CoVoST 429 289k
vocabulary of mBART50 (Tang et al., 2020). MuST-C 440 269k
europarl-ST 77 33k
Data Diversification Different from last years’ TTS 1891 779k
shared tasks (Anastasopoulos et al., 2021, 2022), ja CoVoST 429 289k
MuST-C 541 329k
no monolingual (non-English) data is provided. TTS 73 56k
This means conventional data augmentation tech- fa CoVoST 429 289k
niques like backward translation are not directly MuST-C 347 182k
TTS 89 88k
applicable. On the other hand, forward translation pt MuST-C 377 206k
from existing English monolingual data may intro- europarl-ST 75 32k
duce undesirable errors in the translation targets, TTS 1678 639k
ru MuST-C 482 265k
especially on lower-resource languages. In this TTS 331 331k
light, we use data diversification (Nguyen et al., tr CoVoST 429 289k
2020), a data augmentation method that enriches MuST-C 446 236k
TTS 428 511k
existing parallel data by forward and backward
all Common Voice 1488 948k
translating the training bitext. As the model has TEDLIUM 453 268k
seen the parallel data in training, the synthetic trans- VoxPopuli 502 177k
lations are expected to have relatively high quality.
Moreover, either the source or target side of the Table 4: ST data overview. The last section “all” indi-
synthetic data is from the original bitext. The di- cates forward translated synthetic targets from transcript-
versified data amount after deduplication is shown only corpora, which are available for all 10 languages.
in Table 3. Here we perform one round of forward
and backward translation, as Nguyen et al. (2020)
3 Cascaded System
have empirically shown further rounds do not lead
to substantial gains. For the cascaded system, we introduce our ASR
(§3.1) and MT (§3.2) models.
2.4 Casing/Punctuation Restoration Data
The ASR outputs are lower-cased and unpunctu- 3.1 Automatic Speech Recognition Module
ated, while the MT model expects cased and punc- Baseline Models The first baseline is our ASR
tuated inputs. We randomly sample 1.5 million En- model for last year’s offline track (Pham et al.,
glish sentences from the MT training data (Table 3), 2022). It is a Wav2vec 2.0 (Baevski et al., 2020)
and remove the casing and punctuation marks as with L ARGE configuration pretrained on 960 hours
115
of Librispeech data. This year, after seeing ini- As our final system is an encoder-decoder model
tial favourable results compared to Wav2vec, we (WavLM + mBART50), adapting the LM alone
opt for WavLM (Chen et al., 2022) as audio en- is less straightforward. We create pseudo ASR
coder. We use the L ARGE configuration with 24 training data with ACL data on the transcript side.
layers. We use the mBART50 (Tang et al., 2020) Specifically, we use our TTS model to synthesize
decoder along with the WavLM encoder. As the speech from the ACL dev and test abstracts. As the
ASR model only needs to transcribe English2 , we amount of ACL abstract data is very limited (less
trim the mBART50 vocabulary from 256k down to than 100 sentences in total), we heavily upsampled
62k tokens by removing all non-alphabetic tokens. them, so that they consist of 60% of the training
data. As shown in the lower section of Table 6, this
In-Domain TTS Data We also use the synthe- leads to a minor improvement of WER for ACL
sized TTS data. Compared to the same model dev. However, the gain does not carry over to ST
without TTS data, the word error rate (WER) im- performance when later cascading with our MT
proves from 11.6% to 10.7% on ACL dev, but de- model. Therefore, our final ASR system did not
grades from 8.4% to 9.0% on the TEDLIUM test use the abstracts. The lack of improvement could
set. There are two potential explanations: First, the be related to the low amount of ACL abstract data,
noisy TTS speech may be helpful for handling the which requires heavy upsampling of the TTS data,
non-native utterances prominent in the ACL dev and as a result hinders the ability of transcribing
set. Second, the target side of the TTS data is more real speech.
relevant to the ACL domain, as we selected them The contrast between the two sets of experiments
based on n-gram overlap with ACL data. This in may be related to diminishing gains as WER im-
turn improves ASR performance on the ACL dev proves, i.e., for the Wav2vec + CTC + LM model,
set. gaining over a WER of 13.8% is easier than starting
As shown in Table 5, compared to last year’s sub- from a 10.7% WER. Another interpretation of the
mission, this year’s ASR model achieves consistent difference could be that adding specific constraints
gains across domains on ACL dev, tst-COMMON, to “end-to-end” ASR models is more challenging
and tst2020. than the counterparts with separate LMs.
Model ACL dev tstCom. tst2020 Model ACL dev tst-COMMON
ASR 2022 (Pham et al., 2022) 12.5 5.4 5.6 Wav2vec + CTC + 5-gram 13.8 7.6
WavLM + mBART50 10.7 3.9 4.8 + ACL abstract 5-gram 13.0 7.6
WavLM + mBART50 10.7 3.9
Table 5: ASR results in WER(↓) in comparison to our + ACL abstract TTS (upsampled) 10.5 4.3
submission last year (Pham et al., 2022) which used
Wav2vec trained with CTC and a 5-gram LM. By using Table 6: ASR adaptation results in WER(↓). On prelim-
WavLM audio encoder and the mBART decoder, we inary experiments with Wav2vec + CTC + LM models,
achieve consistent gains across domains (ACL and TED, we improve ASR performance on ACL dev by integrat-
i.e., tst*). ing n-gram statistics from the ACL abstracts. For the
WavLM + mBART 50 model, adding synthesized audio-
transcript data based ACL dev abstracts does not give
Language Model (LM) Adaptation Aside from consistent gain.
using TTS data, we also investigate other meth-
ods to adapt towards the ACL domain using the
Casing/Punctuation Restoration We take a
provided paper abstracts. On preliminary experi-
sequence-to-sequence approach to the casing and
ments with Connectionist Temporal Classification
punctuation restoration problem. Specifically,
(CTC) + n-gram LM models, we integrate ACL
we train a punctuation model initializing from
abstract 5-grams statistics into the language mod-
DeltaLM-base (Ma et al., 2021) to restore the cas-
els. As shown in the upper section of Table 6, this
ing and punctuation information, using the training
improves on ACL dev (WER 13.8% → 13.0%)
data described in §2.4.
while preserving the performance on TED talks
(tst-COMMON WER stays at 7.6%). 3.2 Machine Translation Module
2
BART, the English-only predecessor of mBART, is not Baseline Model We start with the pretrained
among the allowed pretrained models. DeltaLM (Ma et al., 2021) with L ARGE configura-
116
ACL dev (en→X) TED (en→de)
ID de ja zh ar nl fr fa pt ru tr Avg. tst2019 tst2020
From ground-truth transcripts (MT alone)
(1) base 39.8 44.2 47.4 30.4 45.7 48.9 23.6 51.1 19.5 22.9 37.4 29.5 32.9
(2) data divers. all 41.6 44.5 49.8 33.6 50.7 51.1 25.4 52.5 21.5 24.6 39.5 30.0 33.7
(3) (1) + data divers.; adapter 41.4 45.8 48.8 33.3 49.8 51.5 25.2 54.1 21.9 24.1 39.6 29.5 33.2
(4) ensemble (2) + (3) 41.7 46.1 49.6 33.7 50.8 52.1 25.9 54.3 23.1 24.8 40.2 30.4 33.7
(5) (4) + kNN-MT 43.7 47.3 49.8 35.4 52.3 52.8 27.2 55.3 23.9 27.1 41.5 30.4 33.4
From ASR outputs (cascaded ST)
(1) base 34.3 38.2 41.6 25.3 36.6 39.9 19.1 40.7 16.7 18.9 31.1 26.5 28.0
(2) data divers. all 35.4 38.6 44.3 26.8 39.2 41.5 20.5 42.6 18.7 19.5 32.7 27.0 29.3
(3) (1) + data divers.; adapter 35.5 39.0 43.6 26.4 38.9 41.9 20.2 43.0 19.3 19.6 32.7 26.7 28.3
(4) ensemble (2) + (3) 36.1 39.8 44.4 26.9 39.8 42.3 20.7 43.5 19.2 19.7 33.2 26.9 28.7
(5) (4) + kNN-MT 36.8 40.2 44.6 28.2 40.8 42.0 21.8 44.5 19.7 21.1 34.0 26.9 28.5
End-to-end ST
(6) WavLM + mBART50 decoder 31.7 29.2 40.7 25.0 36.7 40.5 19.5 43.0 16.9 18.5 30.2 27.0 29.3
(7) (6) + TTS 33.2 29.2 40.5 25.5 37.9 41.0 20.1 43.9 16.5 18.9 30.7 27.0 29.1
(8) ensemble (6) + (7) 34.0 29.9 41.7 25.5 38.2 42.0 20.2 44.4 18.3 20.2 31.4 27.3 29.6
tion. The pretrained model has 24 and 12 encoder Adapters for Incremental Data Retraining on
and decoder Transformer layers respectively. It the new training data after diversification (Row
uses postnorm layer normalization. It is a fully (2) of Table 7) is time-consuming and costly.
multilingual model where all parameters are shared To adapt the initial model (Row (1) of Table 7)
across languages. The target language tokens are rapidly towards to the augmented data, we use
prepended to the source target sentences. We use adapters (Bapna and Firat, 2019; Philip et al., 2020).
temperature-based sampling (Arivazhagan et al., In this case, the adapters are target-language-
2019) with τ = 5.0 to counteract the data imbal- specific. The adapters are inserted after each en-
ance between languages. When training, we use coder and decoder layer. We initialize from the
a relatively large effective batch size of 128k as trained baseline (Row (1) in Table 7), freeze trained
preliminary experiments with smaller batch sizes parameters and update the adapters only. We use
showed more instabilities in training. This might the efficient implementation from Baziotis et al.
be a side effect of the postnorm layer normaliza- (2022). As shown in Row (3) of Table 7, only train-
tion (Nguyen and Salazar, 2019). The results of the ing the adapters on the new diversified training data
baseline are shown in Row (1) of Table 7, with an performs on par with the re-training setup in Row
average score of 37.4 BLEU3 on ACL dev. (2) (39.6 on MT and 32.7 on ST on average for
ACL dev). These results demonstrate that adapters
are suitable for fast and effective incremental learn-
Data Diversification As motivated in §2.3, we
ing when additional training data emerges later.
use data diversification as an alternative data aug-
mentation method in absence of monolingual target To our surprise, adding adapters to the model
data for backtranslation. As data diversification trained with full data diversification (Row (2) from
needs forward and backward translations on the Table 7) does not bring further gain. A similar
training data, we additionally train a 10-to-English observation was reported by Pires et al. (2023),
model to create the backward translations. Row (2) who opted for training the full network from scratch
of Table 7 shows the results after data diversifica- along with adapters instead. In our case, it therefore
tion on all languages pairs. On average, this data would be interesting to see the impact of training
augmentation approach improves MT quality by on data diversification with adapters from scratch.
2.1 BLEU and (37.4 → 39.5), and ST quality by
1.6 BLEU (31.1 → 32.7). Multilingual vs Bilingual To investigate the im-
pact of interference from multiple target languages,
3
in preliminary experiments, we also compare the
By default using tok.13a from sacreBLEU (Post,
2018), except for zh and ja where we use tok.zh and multilingual and bilingual translation performance
tok.ja-mecab-0.996-IPA. for selected language pairs. As shown in Table 8,
117
compared to bilingual models, the multilingual Source (ASR output): ... in a zero shot evaluation setup,
model lags behind especially on higher-resource meaning that pre trained word embedding models are ap-
plied out of the box without any additional fine tuning
languages. Adding the adapters partly closes this w/o kNN-MT (Table 7 row (4)): ... in einer
gap. Note the score difference to main result table Null-Shot-Bewertungs-Setup (zero-shot evaluation setup),
(Table 7) is because the preliminary experiments was bedeutet, dass vorgebildete (pre-educated) Wort-
Einbettungsmodelle ohne zusätzliche Feinabstimmung di-
did not fully use diversified data for all languages. rekt angewendet werden.
w/ kNN-MT (Table 7 row (5)): ... in einer Null-Shot-
Model ACL dev tst-COMMON Bewertung (zero-shot evaluation), was bedeutet, dass
vortrainierte (pretrained) Wort-Einbettungsmodelle ohne
en-de en-ru en-fa en-de en-ru en-fa zusätzliche Feinabstimmung direkt angewendet werden.
bilingual 41.0 20.0 24.2 34.3 22.7 16.0 Source (ASR output): Hello. My name is Ramachandra,
multilingual 39.8 19.5 23.6 34.1 21.9 15.9 and I will present our paper.
+ adapters 40.9 20.2 23.7 34.7 22.2 16.3 w/o kNN-MT (Table 7 row (4)): 你好 (Hello; addressing
a single person),我叫拉玛钱德拉 我要发表 (publish)我
Table 8: Comparison of bilingual vs multilingual trans- 们的论文
lation performance in BLEU (↑) on German (de), Rus- w/ kNN-MT (Table 7 row (5)): 大家好 (Hi all; addressing
a group of audience),我叫拉玛钱德拉, 我要介绍 (intro-
sian (ru), Farsi (fa), which are high-, mid-, low-resource duce)我们的论文。
in the training data (Table 3). Multilingual system falls
behind bilingual system, while adapters partly closes Table 9: Examples of kNN-MT improving transla-
the gap. Note the score difference to main result table tion quality for en→de (upper) and en→zh (lower).
(Table 7) is because the experiments here did not fully kNN-MT creates more accurate terminology transla-
use diversification. tions (“pre trained” for en→de) and create more context-
appropriate translation (“Hello” for en→zh).
Ensemble Although the models in Row (2) and
(3) in Table 7 are trained on the same data and the number of retrieved neighbors k, the tempera-
share the same base architecture, we expect their ture for smoothing the kNN distribution T , and the
representations to be sufficiently different, as (3) interpolation weight w.
additionally uses adapters. We therefore ensemble In our experiments, we use systems (2) and (3)
these two models. The results are in Row (4) of Ta- from Table 7 for creating the datastores. As differ-
ble 7. On MT and ST, for ACL, ensembling shows ent models’ hidden states (which serve as keys in
an improvement of 0.6 and 0.5 BLEU respectively the datastore) also differ substantially, the datastore
over the single models in Row (2) and (3). On is MT-model-dependent. To use kNN-MT when
TED, however, ensembling does not seem to im- ensembling systems (2) and (3), we therefore need
pact the scores compared to the single models. One two datastores for systems (2) and (3) respectively.
explanation is that the adapter model from Row The kNN-MT candidate tokens are interpolated
(3) performs worse than its non-adapter counter- with the output vocabulary distribtuion before the
part (Row (2)) on TED, which limits the overall ensembling operation.
effectiveness of ensembling.
We use hyperparameters k = 8, T = 50,
kNN-MT We also adapt the MT model to the tar- w = 0.3, after an initial search with T ∈
get domain of scientific talks. A challenge is that [10, 50, 100], w ∈ [0.1, 0.3, 0.5]. Our implemen-
we do not have sufficient training data to fully fine- tation mostly follows Zheng et al. (2021), which
tune the MT model towards the desired domain or uses the FAISS toolkit (Johnson et al., 2019) for
style. In this case, we use kNN-MT (Khandelwal efficient kNN operations. Comparing the infer-
et al., 2021) to adapt the model at inference time. ence speed of system (4) and (5), with the same
In kNN-MT, bitexts are passed through a trained batch size of 64 sentences4 , using kNN-MT takes
MT model. For each target token, its decoder hid- roughly 50% more time on a Nvidia Titan RTX
den state is stored in a datastore. At inference time, GPU with 24GB memory.
based on the current decoder hidden state, k candi- Naively using all ACL dev bitext as datastore
date target tokens are retrieved from the datastore would lead the model to copying the oracle targets.
using a nearest neighbor lookup. The retrieved to- To simulate the scenario on the blind test set, when
ken distribution is then interpolated with the MT 4
System (5) requires more GPU memory than system (4).
target distribution, which in turn generates the out- The latter would be able to use a larger batch size of 128
put tokens. Hyperparameters for kNN-MT include sentences.
118
translating the i-th talk, we use the other jj̸=i ∈ systems have several novelties. Lacking suitable
[n] talks’ bitext as datastore, where n is the total training data for the target domain, we used kNN-
number of talks. MT for inference-time adaptation and showed an
As shown in Row (5) of Table 7, kNN-MT improvement of +0.8 BLEU for cascaded speech
brings an additional gain of 1.3 BLEU on MT and translation system. We also used adapters to in-
0.8 BLEU on ST. These results shows a datastore tegrate incremental data from augmentation, and
as small as hundreds of sentence pairs can be effec- achieved performance on-par with re-training on
tively used for inference-time domain adaptation. all data. In our experiments, we observed that cas-
Table 9 shows two examples of kNN-MT im- caded systems are more easily adaptable towards
proving translation quality, apart from generic im- desired target domains due to their separate mod-
provements in fluency and accuracy, in these ex- ules. Our cascaded speech system outperforms
amples kNN-MT also helps generate correct termi- its end-to-end counterpart on scientific talk transla-
nologies and context-appropriate greetings. tion, although their performance remains similar on
TED talks. For future work, we are interested in the
4 End-to-End System feasibility of applying the adaptation approaches
shown effective on MT to end-to-end ST.
For the end-to-end system, similar to our ASR
model, after seeing initial favourable results of Acknowledgement We thank the anonymous re-
WavLM over Wav2vec, we choose WavLM as viewers for detailed and insightful feedback. Part
the audio encoder. Following last year’s submis- of this work was performed on the HoreKa su-
sion (Pham et al., 2022), we use the mBART50 percomputer funded by the Ministry of Science,
decoder. The results are shown in Row (6) of Ta- Research and the Arts Baden-Württemberg and by
ble 7. Contrasting Row (6) and (7) reveals that the Federal Ministry of Education and Research
adding the TTS data does not substantially change of Germany. Part of this work was supported by
ST performance. However, ensembling the two the Federal Ministry of Education and Research
models trained with and without TTS data (Row of Germany under grant agreement 01EF1803B
(8)) improves over the single models (on average (RELATER).
+0.7 for ACL, +0.4 for TED), despite them having
the identical architecture.
Compared to the strongest cascaded system References
(Row (5)), the end-to-end system falls behind 2.6
Milind Agarwal, Sweta Agrawal, Antonios Anasta-
BLEU on ACL dev. On TED, however, it appears sopoulos, Ondřej Bojar, Claudia Borg, Marine
to slightly outperform the cascaded system. One Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
explanation is that the MT model of the cascaded Chen, William Chen, Khalid Choukri, Alexandra
system has not been separately adapted to TED Chronopoulou, Anna Currey, Thierry Declerck, Qian-
qian Dong, Yannick Estéve, Kevin Duh, Marcello
texts (although parts of the full training data do Federico, Souhir Gahbiche, Barry Haddow, Benjamin
cover TED data), which was shown essential in im- Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
proving performance on TED test sets (Zhang et al., vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
2022; Pham et al., 2022). The end-to-end system, Kumar, Pengwei Li, Xutai Ma, Prashant Mathur,
Evgeny Matusov, Paul McNamee, John P. McCrae,
on the other hand, has seen a larger proportion of Kenton Murray, Maria Nadejde, Satoshi Nakamura,
TED data in training (Table 4). Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
Similar to the previous year (Polák et al., 2022), Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
we also adapt our end-to-end offline model for si- Lonneke van der Plas, Peter Polák, Elijah Rippeth,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
multaneous track (Polák et al., 2023). bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
5 Conclusion Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
vallos. 2023. Findings of the IWSLT 2023 Evaluation
In this paper, we described our systems for the mul- Campaign. In Proceedings of the 20th International
tilingual speech translation track of IWSLT 2023, Conference on Spoken Language Translation (IWSLT
which translates English speech into 10 target lan- 2023). Association for Computational Linguistics.
guages. To tackle the task of translating scien- Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019.
tific conference talks, which feature non-native in- Massively multilingual neural machine translation.
put speech and terminology-dense contents, our In Proceedings of the 2019 Conference of the North
119
American Chapter of the Association for Computa- Ankur Bapna and Orhan Firat. 2019. Simple, scal-
tional Linguistics: Human Language Technologies, able adaptation for neural machine translation. In
Volume 1 (Long and Short Papers), pages 3874–3884, Proceedings of the 2019 Conference on Empirical
Minneapolis, Minnesota. Association for Computa- Methods in Natural Language Processing and the
tional Linguistics. 9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 1538–
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- 1548, Hong Kong, China. Association for Computa-
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano tional Linguistics.
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
Maha Elbayad, Clara Emmanuel, Yannick Estève, Christos Baziotis, Mikel Artetxe, James Cross, and
Marcello Federico, Christian Federmann, Souhir Shruti Bhosale. 2022. Multilingual machine trans-
Gahbiche, Hongyu Gong, Roman Grundkiewicz, lation with hyper-adapters. In Proceedings of the
Barry Haddow, Benjamin Hsu, Dávid Javorský, 2022 Conference on Empirical Methods in Natu-
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant ral Language Processing, pages 1170–1185, Abu
Mathur, Paul McNamee, Kenton Murray, Maria Dhabi, United Arab Emirates. Association for Com-
Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan putational Linguistics.
Niehues, Xing Niu, John Ortega, Juan Pino, Eliz-
abeth Salesky, Jiatong Shi, Matthias Sperber, Se- Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo- Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
gesh Virkar, Alexander Waibel, Changhan Wang, Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long
and Shinji Watanabe. 2022. Findings of the IWSLT Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu,
2022 evaluation campaign. In Proceedings of the Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022.
19th International Conference on Spoken Language Wavlm: Large-scale self-supervised pre-training for
Translation (IWSLT 2022), pages 98–157, Dublin, full stack speech processing. IEEE J. Sel. Top. Signal
Ireland (in-person and online). Association for Com- Process., 16(6):1505–1518.
putational Linguistics.
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli,
Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer- Matteo Negri, and Marco Turchi. 2019. MuST-C: a
man, Roldano Cattoni, Maha Elbayad, Marcello Fed- Multilingual Speech Translation Corpus. In Proceed-
erico, Xutai Ma, Satoshi Nakamura, Matteo Negri, ings of the 2019 Conference of the North American
Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas- Chapter of the Association for Computational Lin-
tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan- guistics: Human Language Technologies, Volume 1
der Waibel, Changhan Wang, and Matthew Wiesner. (Long and Short Papers), pages 2012–2017, Min-
2021. FINDINGS OF THE IWSLT 2021 EVAL- neapolis, Minnesota. Association for Computational
UATION CAMPAIGN. In Proceedings of the 18th Linguistics.
International Conference on Spoken Language Trans-
lation (IWSLT 2021), pages 1–29, Bangkok, Thailand Matthias Eck, Stephan Vogel, and Alex Waibel. 2005.
(online). Association for Computational Linguistics. Low cost portability for statistical machine transla-
tion based on n-gram frequency and TF-IDF. In
Rosana Ardila, Megan Branson, Kelly Davis, Michael Proceedings of the Second International Workshop
Kohler, Josh Meyer, Michael Henretty, Reuben on Spoken Language Translation, Pittsburgh, Penn-
Morais, Lindsay Saunders, Francis M. Tyers, and sylvania, USA.
Gregor Weber. 2020. Common voice: A massively-
multilingual speech corpus. In Proceedings of The Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
12th Language Resources and Evaluation Confer- Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
ence, LREC 2020, Marseille, France, May 11-16, Baines, Onur Celebi, Guillaume Wenzek, Vishrav
2020, pages 4218–4222. European Language Re- Chaudhary, Naman Goyal, Tom Birch, Vitaliy
sources Association. Liptchinsky, Sergey Edunov, Michael Auli, and Ar-
mand Joulin. 2021. Beyond english-centric multilin-
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, gual machine translation. The Journal of Machine
Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Learning Research, 22:107:1–107:48.
Mia Xu Chen, Yuan Cao, George F. Foster, Colin
Cherry, Wolfgang Macherey, Zhifeng Chen, and François Hernandez, Vincent Nguyen, Sahar Ghannay,
Yonghui Wu. 2019. Massively multilingual neural Natalia A. Tomashenko, and Yannick Estève. 2018.
machine translation in the wild: Findings and chal- TED-LIUM 3: Twice as much data and corpus repar-
lenges. CoRR, abs/1907.05019. tition for experiments on speaker adaptation. In
Speech and Computer - 20th International Confer-
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, ence, SPECOM 2018, Leipzig, Germany, September
and Michael Auli. 2020. wav2vec 2.0: A framework 18-22, 2018, Proceedings, volume 11096 of Lecture
for self-supervised learning of speech representations. Notes in Computer Science, pages 198–208. Springer.
In Advances in Neural Information Processing Sys-
tems 33: Annual Conference on Neural Information Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà,
Processing Systems 2020, NeurIPS 2020, December Javier Jorge, Nahuel Roselló, Adrià Giménez, Al-
6-12, 2020, virtual. bert Sanchís, Jorge Civera, and Alfons Juan. 2020.
120
Europarl-st: A multilingual corpus for speech transla- Toan Q. Nguyen and Julian Salazar. 2019. Transformers
tion of parliamentary debates. In 2020 IEEE Interna- without tears: Improving the normalization of self-
tional Conference on Acoustics, Speech and Signal attention. In Proceedings of the 16th International
Processing, ICASSP 2020, Barcelona, Spain, May Conference on Spoken Language Translation, IWSLT
4-8, 2020, pages 8229–8233. IEEE. 2019, Hong Kong, November 2-3, 2019. Association
for Computational Linguistics.
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019.
Billion-scale similarity search with GPUs. IEEE Xuan-Phi Nguyen, Shafiq R. Joty, Kui Wu, and Ai Ti
Transactions on Big Data, 7(3):535–547. Aw. 2020. Data diversification: A simple strategy for
neural machine translation. In Advances in Neural
Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Information Processing Systems 33: Annual Confer-
Zettlemoyer, and Mike Lewis. 2021. Nearest neigh- ence on Neural Information Processing Systems 2020,
bor machine translation. In 9th International Confer- NeurIPS 2020, December 6-12, 2020, virtual.
ence on Learning Representations, ICLR 2021, Vir-
Vassil Panayotov, Guoguo Chen, Daniel Povey, and
tual Event, Austria, May 3-7, 2021. OpenReview.net.
Sanjeev Khudanpur. 2015. Librispeech: An ASR
corpus based on public domain audio books. In
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. 2015 IEEE International Conference on Acoustics,
Conditional variational autoencoder with adversar- Speech and Signal Processing, ICASSP 2015, South
ial learning for end-to-end text-to-speech. In Pro- Brisbane, Queensland, Australia, April 19-24, 2015,
ceedings of the 38th International Conference on pages 5206–5210. IEEE.
Machine Learning, ICML 2021, 18-24 July 2021, Vir-
tual Event, volume 139 of Proceedings of Machine Ngoc-Quan Pham, Tuan Nam Nguyen, Thai-Binh
Learning Research, pages 5530–5540. PMLR. Nguyen, Danni Liu, Carlos Mullov, Jan Niehues, and
Alexander Waibel. 2022. Effective combination of
Philipp Koehn. 2005. Europarl: A parallel corpus for pretrained models - KIT@IWSLT2022. In Proceed-
statistical machine translation. In Proceedings of ings of the 19th International Conference on Spoken
Machine Translation Summit X: Papers, pages 79–86, Language Translation (IWSLT 2022), pages 190–197,
Phuket, Thailand. Dublin, Ireland (in-person and online). Association
for Computational Linguistics.
Sai Koneru, Danni Liu, and Jan Niehues. 2022. Cost-
effective training in low-resource neural machine Jerin Philip, Alexandre Berard, Matthias Gallé, and
translation. CoRR, abs/2201.05700. Laurent Besacier. 2020. Monolingual adapters for
zero-shot neural machine translation. In Proceed-
Taku Kudo and John Richardson. 2018. SentencePiece: ings of the 2020 Conference on Empirical Methods
A simple and language independent subword tok- in Natural Language Processing (EMNLP), pages
enizer and detokenizer for neural text processing. In 4465–4470, Online. Association for Computational
Proceedings of the 2018 Conference on Empirical Linguistics.
Methods in Natural Language Processing: System
Demonstrations, pages 66–71, Brussels, Belgium. Telmo Pessoa Pires, Robin M. Schmidt, Yi-Hsiu Liao,
Association for Computational Linguistics. and Stephan Peitz. 2023. Learning language-specific
layers for multilingual machine translation. CoRR,
Pierre Lison and Jörg Tiedemann. 2016. OpenSub- abs/2305.02665.
titles2016: Extracting large parallel corpora from
Peter Polák, Danni Liu, Ngoc-Quan Pham, Jan Niehues,
movie and TV subtitles. In Proceedings of the Tenth
Alexander Waibel, and Ondřej Bojar. 2023. Towards
International Conference on Language Resources
efficient simultaneous speech translation: CUNI-
and Evaluation (LREC’16), pages 923–929, Portorož,
KIT system for simultaneous track at IWSLT 2023.
Slovenia. European Language Resources Association
In Proceedings of the 20th International Confer-
(ELRA).
ence on Spoken Language Translation (IWSLT 2023),
Toronto, Canada (in-person and online). Association
Shuming Ma, Li Dong, Shaohan Huang, Dong- for Computational Linguistics.
dong Zhang, Alexandre Muzio, Saksham Sing-
hal, Hany Hassan Awadalla, Xia Song, and Furu Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen,
Wei. 2021. Deltalm: Encoder-decoder pre-training Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo-
for language generation and translation by aug- jar, and Alexander Waibel. 2022. CUNI-KIT system
menting pretrained multilingual encoders. CoRR, for simultaneous speech translation task at IWSLT
abs/2106.13736. 2022. In Proceedings of the 19th International Con-
ference on Spoken Language Translation (IWSLT
Makoto Morishita, Katsuki Chousa, Jun Suzuki, and 2022), pages 277–285, Dublin, Ireland (in-person
Masaaki Nagata. 2022. JParaCrawl v3.0: A large- and online). Association for Computational Linguis-
scale English-Japanese parallel corpus. In Pro- tics.
ceedings of the Thirteenth Language Resources and
Evaluation Conference, pages 6704–6710, Marseille, Matt Post. 2018. A call for clarity in reporting BLEU
France. European Language Resources Association. scores. In Proceedings of the Third Conference on
121
Machine Translation: Research Papers, pages 186–
191, Brussels, Belgium. Association for Computa-
tional Linguistics.
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea
Vedaldi. 2017. Learning multiple visual domains
with residual adapters. In Advances in Neural Infor-
mation Processing Systems 30: Annual Conference
on Neural Information Processing Systems 2017, De-
cember 4-9, 2017, Long Beach, CA, USA, pages 506–
516.
Nils Reimers and Iryna Gurevych. 2020. Making
monolingual sentence embeddings multilingual us-
ing knowledge distillation. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 4512–4525,
Online. Association for Computational Linguistics.
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
gela Fan. 2020. Multilingual translation with exten-
sible multilingual pretraining and finetuning. CoRR,
abs/2008.00401.
Jörg Tiedemann. 2012. Parallel data, tools and inter-
faces in OPUS. In Proceedings of the Eighth In-
ternational Conference on Language Resources and
Evaluation (LREC’12), pages 2214–2218, Istanbul,
Turkey. European Language Resources Association
(ELRA).
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
Juan Pino, and Emmanuel Dupoux. 2021. VoxPop-
uli: A large-scale multilingual speech corpus for rep-
resentation learning, semi-supervised learning and
interpretation. In Proceedings of the 59th Annual
Meeting of the Association for Computational Lin-
guistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long
Papers), pages 993–1003, Online. Association for
Computational Linguistics.
Changhan Wang, Anne Wu, and Juan Miguel Pino. 2020.
Covost 2: A massively multilingual speech-to-text
translation corpus. CoRR, abs/2007.10310.
Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
Xinyuan Zhou, Jing Yang, Jianwei Cui, Pan Deng,
Mohan Shi, Yifan Song, Dan Liu, Junhua Liu, and
Lirong Dai. 2022. The USTC-NELSLIP offline
speech translation systems for IWSLT 2022. In Pro-
ceedings of the 19th International Conference on
Spoken Language Translation (IWSLT 2022), pages
198–207, Dublin, Ireland (in-person and online). As-
sociation for Computational Linguistics.
Xin Zheng, Zhirui Zhang, Junliang Guo, Shujian Huang,
Boxing Chen, Weihua Luo, and Jiajun Chen. 2021.
Adaptive nearest neighbor machine translation. In
Proceedings of the 59th Annual Meeting of the Asso-
ciation for Computational Linguistics and the 11th
International Joint Conference on Natural Language
Processing (Volume 2: Short Papers), pages 368–374,
Online. Association for Computational Linguistics.
122
The BIGAI Offline Speech Translation Systems for IWSLT 2023 Evaluation
Zhihang Xie
Beijing Institute of General Artificial Intelligence
zhihangxie@gmail.com
124
For end-to-end speech translation, the mod- Table 3: WER scores on test speech datasets
els have similar architecture as the PT36 mod-
LibriSpeech TEDLIUM MuSTC
els in Zhang et al.’s (2022b) work instead of the
27.23 32.17 34.73
PT48 models to reduce computational complex-
ity. Within a PT36 model, the speech module
Table 4: BLEU scores on tst-COMMON datasets
and the translation module are initialized with the
ASR12 model and the MT24 model respectively. Model en→de en→ja en→zh
The adapter module that connects the two modules MT24 31.04 14.74 22.80
is not trained from random initialization, because + finetune 33.00 17.11 23.44
it has been trained with the ASR12 model on the PT36 26.45 14.28 19.65
first stage. The training loss combines the cross
entropy loss for machine translation and the CTC
loss for speech recognition with a hyperparameter the update frequency of 8. The parameters in the
to balance the weights between the two losses. Wav2Vec2 module and the linear layer are sepa-
rately optimized by the Adam optimizer (Kingma
3.3 Speech Resegmentation and Ba, 2014). The learning rates are initialized
Past years’ systems (Anastasopoulos et al., 2021; with 1e−4 and 4e−4 with the annealing factors set
Antonios et al., 2022) have proved that speech re- to 0.9 and 0.8. The learning rates are updated based
segmentation has a great impact on the translation on the improvement of the training losses between
performance at corpus level. During evaluation, the previous epoch and the current epoch. During
audio clips are splitted into segments with a simple training, speech waveform is perturbed with a ran-
two-stage strategy using the WebRTCVAD4 toolkit. dom speed rate between 0.9 and 1.1 and speech fea-
On the split stage, long audios are processed with tures are augmented with the SpecAugment tech-
three-level settings of aggressiveness modes in- nique (Park et al., 2019).
creasing from 1 to 3 and frame sizes decreasing On the second stage, three MT24 models are
from 30ms to 10ms. In this way, most segments are finetuned on the translation corpora with the batch
no longer than a maximum duration durmax and size of 12 and the update frequency of 4. The
the outliers are further segmented into ⌊ duration
0.75×θ ⌋
en→de MT24 model is trained using 8 A100 GPUs
chunks brutally. On the merge stage, consecutive for 2 epochs and the other two models are trained
segments are merged into final segments no shorter using 4 A100 GPUs for 6 epochs and 3 epochs. The
than a minimum duration durmin . model parameters are optimized with the Adam
optimizer and the initial learning rates are set to
4 Experiments 5e−5 with the annealing factor set to 0.9.
4.1 Settings On the third stage, three PT36 models are fine-
tuned on the corresponding MuSTC datasets, each
All the models are implemented with the Speech- of which is trained using 4 A100 GPUs for 10
Brain toolkit (Ravanelli et al., 2021). The total num- epochs with the batch size of 12 and the update
ber of parameters in a PT36 model is about 794.0M, frequency of 4. The learning rates are initialized
183.2M in the speech module and 610.9M in the to 3e−5 for the W2V module and 5e−5 for the
translation module. The feature extractor processes mBART module with the annealing factors set to
speech waveform with seven 512-channel convo- 0.9. The loss weights are set to 0.1 for the ASR
lution layers, in which kernel sizes and strides are module and 0.9 for the MT module since the per-
[10,3,3,3,3,2,2] and [5,2,2,2,2,2,2]. There are 12 formance of the ASR module is not good enough.
Transformer layers with 16 attention heads, model
dimension of 1024 and inner dimension of 4096 4.2 Speech Recognition
in speech encoder, text encoder and decoder. The Table 3 lists WER scores on test speech datasets,
adapter module has three Conv1D layers with ker- where 34.73% is the average WER score of the
nel sizes and strides being [3,3,3] and [2,2,2]. three MuSTC datasets. Obviously, the performance
On the first stage, the ASR12 model is finetuned of the ASR12 model is much worse than that of
on the speech corpora using 16 NVIDIA A100 other systems (Zhang et al., 2022b; Wang et al.,
GPUs for 21 epochs with the batch size of 3 and 2021b) with WERs around 10%. Due to extremely
4
https://github.com/wiseman/py-webrtcvad large vocabulary size, the model requires a long
125
Table 5: Statistics on short segments in the tst2020 dataset with different durmin and durmax settings.
Table 6: BLEU scores on calculated on past years’ IWSLT en→de test sets with hypotheses automatically reseg-
mented by the mwerSegmenter toolkit (Ansari et al., 2021) based on source transcriptions and target translations.
time to train. As a result, the model is still far from in Section 3.3. Statistics on short segments in the
converge at the time of this submission. tst2020 dataset are shown as Table 5. It is noticed
that the number of brutal segments is decreased to
4.3 Sentence-level Translation zero when durmin is set to more than 15s.
The tst-COMMON datasets are used to evaluate the Table 6 lists BLEU scores on past years’ test
translation performance at sentence level and the datasets with different durmin and durmax set-
BLEU scores are calculated by the SacreBLEU 5 tings. It is found that the performance is boosted
toolkit, where Japanese texts are tokenized by the as the segment duration gets longer, which means
Mecab6 morphological analyzer and Chinese texts that more contextual information is provided to
are tokenized into characters. The BLEU scores on the model. When durmin and durmax are set to
the three datasets are listed in Table 4. 20s and 90s, the best BLEU scores are achieved
For machine translation, compared with the on most test datasets with an increment of 3.93
base MT24 models, the performance of the fine- (~18.7%) mean BLEU score. Further investigation
tuned MT24 models is improved by 1.96 (~6.3%), on long audio segments finds that avoiding brutal
2.37 (~16.1%) and 0.64 (~2.8%) BLEU scores on segmentation is another factor of such improve-
en→de, en→ja and en→zh translations. It indi- ment. Comparing experiment 2 and experiment 3,
cates that adding out-of-domain corpora like Open- the mean BLEU score is increased by 0.95 (~3.9%)
Subtitles and NewsCommentaries is able to boost points, when the number of brutal segments is de-
the machine translation quality. creased from 69 to 0. Comparing experiment 3
For speech translation, compared with the fine- and experiment 4, the mean BLEU score is merely
tuned MT24 models, the performance of PT36 increased by 0.22 (~0.8%) points.
models is degraded by a large margin with 6.55
(~19.8%), 2.83 (~16.5%) and 3.79 (~16.2%) BLEU 4.5 Submissions
scores on en→de, en→ja and en→zh translations. The three PT36 models are finally evaluated on
Compared with the base MT24 models, the gaps tst2023 datasets (Agarwal et al., 2023) with more
are still relatively large with 4.59 (~14.8%), 0.46 challenging scenarios like presentations and inter-
(~3.1%) and 3.15 (~13.8%) BLEU scores. views. Test audios are resegmented with durmin
and durmax set to 20s and 90s. Official metrics are
4.4 Corpus-level Translation
presented as Table 7 for en→de datasets, Table 8
The translation performance of en→de PT36 model for en→ja datasets and Table 9 for en→zh datasets.
is further evaluated on past years’ test datasets with Comparing the performance between in-domain
challenging scenarios. To keep consistency, all test TED datasets and out-of-domain ACL datasets, the
audios are resegmented using the method described BLEU scores are decreased by 2.7 (~12.1%), 0.3
5
https://github.com/mjpost/sacrebleu (~2.8%) and 5.6 (~16.9%) points on en→de, en→ja
6
https://github.com/taku910/mecab and en→zh translations. Noticeably, the perfor-
126
Table 7: Official metrics on the tst2023 en→de subsets with hypotheses automatically resegmented by the mwerSeg-
menter toolkit (Ansari et al., 2021) based on source transcriptions and target translations.
TED ACL
Comet BLEU Comet BLEU
ref2 ref1 ref2 ref1 both
0.7201 0.7228 10.7 13.2 16.8 0.6769 10.4
mance is almost halved (~48.4%) with only 11.5 Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
BLEU scores on the en→de Sub dataset. The re- Lonneke van der Plas, Peter Polák, Elijah Rippeth,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
sults indicate that the proposed PT36 models have
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
inadequate abilities of handling non-native speak- Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
ers, different accents, spontaneous speech and con- Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
trolled interaction with a second speaker. vallos. 2023. Findings of the IWSLT 2023 Evaluation
Campaign. In Proceedings of the 20th International
5 Conclusion Conference on Spoken Language Translation (IWSLT
2023). Association for Computational Linguistics.
In conclusion, this paper describes the end-to-end
speech translation systems for IWSLT 2023 of- Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer-
man, Roldano Cattoni, Maha Elbayad, Marcello Fed-
fline tasks. Built upon pretrained models, the sys- erico, Xutai Ma, Satoshi Nakamura, Matteo Negri,
tems are further trained on large amount of parallel Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas-
data using the three-stage finetuning strategy. The tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan-
PT36 model consists of an ASR12 module with der Waibel, Changhan Wang, and Matthew Wies-
ner. 2021. Findings of the iwslt 2021 evaluation
an adapter module for ASR and an MT24 module campaign. In Proceedings of the 18th International
for MT. The training loss sums up the CTC loss Conference on Spoken Language Translation (IWSLT
for ASR and the cross entropy loss for MT. Experi- 2021), pages 1–29, Bangkok, Thailand (online). As-
ments demonstrate that the proposed methods have sociation for Computational Linguistics.
the potential to achieve a reasonable performance.
Ebrahim Ansari, Ondřej Bojar, Barry Haddow, and Mo-
However, due to limited resources, some modules hammad Mahmoudi. 2021. Sltev: Comprehensive
has not well trained, which has a negative impact evaluation of spoken language translation. In Pro-
on subsequent tasks. Therefore, the end-to-end ceedings of the 16th Conference of the European
models still underperform SOTA systems. Chapter of the Association for Computational Lin-
guistics: System Demonstrations, pages 71–79.
127
Table 9: Official metrics on the tst2023 en→zh subsets.
TED ACL
Comet BLEU Comet BLEU
ref2 ref1 ref2 ref1 both
0.7428 0.7014 33.0 23.3 38.6 0.6534 27.4
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing
and Michael Auli. 2020. wav2vec 2.0: A framework Tang, Juan Pino, Alexei Baevski, Alexis Conneau,
for self-supervised learning of speech representations. and Michael Auli. 2020. Multilingual speech trans-
Advances in neural information processing systems, lation with efficient finetuning of pretrained models.
33:12449–12460. arXiv preprint arXiv:2010.12829.
Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Ben- Pierre Lison, Jörg Tiedemann, and Milen Kouylekov.
tivogli, Matteo Negri, and Marco Turchi. 2021. Must- 2018. Opensubtitles2018: Statistical rescoring of
c: A multilingual corpus for end-to-end speech trans- sentence alignments in large, noisy parallel corpora.
lation. Computer Speech & Language, 66:101155. In Proceedings of the 11th International Confer-
ence on Language Resources and Evaluation (LREC
Akhbardeh Farhad, Arkhangorodsky Arkady, Biesialska 2018). European Language Resources Association
Magdalena, Bojar Ondřej, Chatterjee Rajen, Chaud- (ELRA).
hary Vishrav, Marta R Costa-jussa, España-Bonet
Cristina, Fan Angela, Federmann Christian, et al. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
2021. Findings of the 2021 conference on machine Edunov, Marjan Ghazvininejad, Mike Lewis, and
translation (wmt21). In Proceedings of the Sixth Luke Zettlemoyer. 2020. Multilingual denoising pre-
Conference on Machine Translation, pages 1–88. As- training for neural machine translation. Transac-
sociation for Computational Linguistics. tions of the Association for Computational Linguis-
tics, 8:726–742.
Alex Graves, Santiago Fernández, Faustino Gomez, and
Jürgen Schmidhuber. 2006. Connectionist temporal Vassil Panayotov, Guoguo Chen, Daniel Povey, and
classification: labelling unsegmented sequence data Sanjeev Khudanpur. 2015. Librispeech: an asr cor-
with recurrent neural networks. In Proceedings of the pus based on public domain audio books. In 2015
23rd international conference on Machine learning, IEEE international conference on acoustics, speech
pages 369–376. and signal processing (ICASSP), pages 5206–5210.
IEEE.
François Hernandez, Vincent Nguyen, Sahar Ghannay,
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng
Natalia Tomashenko, and Yannick Esteve. 2018. Ted-
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le.
lium 3: Twice as much data and corpus repartition for
2019. Specaugment: A simple data augmentation
experiments on speaker adaptation. In Speech and
method for automatic speech recognition. arXiv
Computer: 20th International Conference, SPECOM
preprint arXiv:1904.08779.
2018, Leipzig, Germany, September 18–22, 2018,
Proceedings 20, pages 198–208. Springer. Mirco Ravanelli, Titouan Parcollet, Peter Plantinga,
Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem
Oleksii Hrinchuk, Vahid Noroozi, Ashwinkumar Subakan, Nauman Dawalatabad, Abdelwahab Heba,
Ganesan, Sarah Campbell, Sandeep Subramanian, Jianyuan Zhong, et al. 2021. Speechbrain: A
Somshubra Majumdar, and Oleksii Kuchaiev. 2022. general-purpose speech toolkit. arXiv preprint
Nvidia nemo offline speech translation systems for arXiv:2106.04624.
iwslt 2022. In Proceedings of the 19th International
Conference on Spoken Language Translation (IWSLT Akshaya Shanbhogue, Ran Xue, Ching Yun Chang, and
2022), pages 225–231. Sarah Campbell. 2022. Amazon alexa ai’s system for
iwslt 2022 offline speech translation shared task. In
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categori- Proceedings of the 19th International Conference on
cal reparameterization with gumbel-softmax. arXiv Spoken Language Translation (IWSLT 2022), pages
preprint arXiv:1611.01144. 169–176.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
method for stochastic optimization. arXiv preprint man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
arXiv:1412.6980. gela Fan. 2020. Multilingual translation with exten-
sible multilingual pretraining and finetuning. arXiv
Taku Kudo and John Richardson. 2018. Sentencepiece: preprint arXiv:2008.00401.
A simple and language independent subword tok-
enizer and detokenizer for neural text processing. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
arXiv preprint arXiv:1808.06226. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
128
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
systems, 30.
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
Juan Pino, and Emmanuel Dupoux. 2021a. Voxpop-
uli: A large-scale multilingual speech corpus for rep-
resentation learning, semi-supervised learning and
interpretation. arXiv preprint arXiv:2101.00390.
Minghan Wang, Yuxia Wang, Chang Su, Jiaxin Guo,
Yingtao Zhang, Yujia Liu, Min Zhang, Shimin Tao,
Xingshan Zeng, Liangyou Li, et al. 2021b. The hw-
tsc’s offline speech translation systems for iwslt 2021
evaluation. arXiv preprint arXiv:2108.03845.
Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
Xinyuan Zhou, Jing Yang, Jianwei Cui, Pan Deng,
Mohan Shi, Yifan Song, et al. 2022a. The ustc-
nelslip offline speech translation systems for iwslt
2022. In Proceedings of the 19th International Con-
ference on Spoken Language Translation (IWSLT
2022), pages 198–207.
129
Enhancing Video Translation Context with Object Labels
Jeremy Gwinnup1,2 , Tim Anderson2 , Brian Ore2 , Eric Hansen2 , Kevin Duh1
1
Johns Hopkins University, 2 Air Force Research Laboratory
{jeremy.gwinnup.1, timothy,anderson.20, brian.ore.1, eric.hansen.5}@us.af.mil,
kevinduh@cs.jhu.edu
1 Introduction
Video streams are rich sources of content and the
application of machine translation to videos present
open research challenges. Specifically, we are in- src: And then you’re going to stir it so have your
terested in translating the speech content present stirrer available. PERSON CUP BOTTLE
in videos, using the visual modality as auxiliary tgt: E então você vai mexer, então tenha seu
input to improve translation quality. Intuitively, vi- agitador disponível.
sual signals may help disambiguate under-specified Figure 1: Demonstration of augmenting source data
words or correct speech recognition errors. with detected object labels to provide additional context.
There has been much research in speech trans-
lation, which focuses on speech input, and multi-
modal machine translation, which focuses on vi- 2 Object Class Label Augmentation
sual and textual inputs; this work combines aspects
When considering the translation of instructional
of both areas. We assume a cascaded pipeline,
videos, the speaker’s narration may use ambiguous
where the speech in a video input is first passed to
language when describing the steps to the task as
a speech recognition component, then the text tran-
the viewer may be able to infer the intent through
scripts together with the video frames are passed to
objects or actions in the scene. If MT systems
a multimodal machine translation (MMT) system.
are trained on the speaker’s words and translations,
Our contribution is a MMT system that augments
these cues from the scene are not present. We
text-based training data with labels obtained from
proposed to address this omission by analyzing
a computer vision object detector (Fig. 1).
clips of the video and augmenting the text data
In contrast to more complex multimodal fusion
with objects found in that clip.
techniques that combine vision and translation neu-
ral networks into end-to-end models, our modu- Augmentation Process: To augment training
lar approach is simple to implement, requiring no data with object labels, an object recognition model
130
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 130–137
July 13-14, 2023 c 2023 Association for Computational Linguistics
YoloV5 Objects Classes
PERSON
CUP PERSON
BOTTLE CUP
BOTTLE BOTTLE
BOTTLE
BOTTLE
BOTTLE
45000
set in order to generate lists of objects present. To 40000
25000
video clips corresponding to the utterances from the 20000
video clip are collated and collapsed in order to Figure 3: Training segments with N object classes de-
keep final sentence length to a manageable size - tected.
we are interested in the presence of an object class
versus how many times that class has occurred in
the scene or the time slices in the video clip. higher class counts forming a long tail. Full class
Once processed, the per-clip labels are appended object counts are shown in Table 1.
to the source side of the training, dev and test sets Observing the most-detected class labels in train-
as “context-markers”. We do not apply these labels ing segments (shown in Figure 4), we see that PER-
to the target side as we wish to generate coherent SON is by far the most common object class with
sentences in the target language. This processing over 164k occurrences, while CUP and BOTTLE
pipeline is illustrated in Figure 2. are the next most common with around 23.8k occur-
In particular, we note in the example in Figure 1 rences each. As How2 is comprised of instructional
that the transcription discusses a stirrer but does not videos in which the authors are demonstrating how
give context to what kind of stirrer: A laboratory to perform a task, PERSON’s high occurrence rate
sample stirrer, a paint stirrer, or in this case a stirrer seems reasonable. The figure shows the top 15
to mix a drink. Using the object labels from the object classes detected, the full list of detection
example, we see that the stirrer in this case refers counts is shown in Table 2.
to a drink - adding valuable context. While the above analyses focus on the train-
The augmented How2 corpus will be available ing portion of the dataset, similar distributions are
for download at a future date. present in both the validation and test sets.
131
Classes Segments Classes Segments Classes Segments
0 15,544 6 7,508 12 143
1 44,496 7 4,300 13 79
2 41,950 8 2,259 14 42
3 32,077 9 1,166 15 14
4 21,428 10 626 16 7
5 13,011 11 293 17 3
hour subset contains the full set of annotations. This portion consists of 13,493 videos consist-
This work focuses on that subset. ing of a total run-time of 305.1 hours from which
189,276 utterances are extracted. These videos and
Videos Hours Sentences segments are then segregated into training, vali-
train 13,168 298.2 184,949 dation and test sets as shown in Table 3. These
validation 150 3.2 2,022 segments are then used to train systems in down-
test 175 3.7 2,305 stream tasks such as MT.
132
180000
three methods to prune over-prevalent or under-
160000
140000 represented object class labels: naïve dropping of
120000
100000
the N most-represented labels, inverse document
80000 frequency (IDF) thresholding and normalized term
60000
40000
frequency-inverse document frequency (TF-IDF)
20000 thresholding. For the first method, object labels are
0
simply removed in the most common order - e.g.
_P IR
_P L
ON
LA E
S
L
NT
RE E
TS V
LE
OK
OP
E
SE
TT BOW
AS
AL
CU
TI
OT
N
T
CE CHA
TT
VA
drop-3 removes the three most common classes:
HO
LA
PT
RS
W BO
_B
GL
M
BO
PE
E_
ED
OR
LL
IN
SP
PERSON, CUP, and BOTTLE.
PO
133
(Cieri et al., 2004–2005), TEDLIUM-v3 (Hernan- the longer sequences.
dez et al., 2018), and ATC (Godfrey, 1994); the
4.5.2 Nmtpytorch Baseline Experiments
language models (LM) were estimated on 1 bil-
lion words from Fisher, News-Crawl 2007-2017 For nmtpytorch baseline comparison systems, we
(Kocmi et al., 2022), News-Discuss 2014-2017 note that maximum training sequence has an ef-
(Kocmi et al., 2022), and TED. This system used fect on system performance, most likely due to the
Mel frequency cepstral coefficient (MFCC) fea- shallow RNN architecture. Table 6 shows that us-
tures as input to a factorized time delay neural ing the default 120 max token limit from Sanabria
network (TDNN) with residual network style skip et al. (2018) yields better performance (+0.9-1.1
connections. Initial decoding was performed using BLEU) with both the visual perturbation and our
a finite state transducer (FST) built from a bigram label augmentation approach. These results show
LM, and the resulting lattices were rescored with a our approach yields a similar performance gain.
RNN LM. The vocabulary included 100k words. 4.5.3 ASR Noise Experiments
4.5 Results For the ASR-based experiments shown in Table 7,
we see improvements of +0.7 BLEU with both the
Armed with an array of label pruning strategies,
clean and noisy Kaldi systems. We expect that
we run a series of experiments to determine the
the speech-recognition based systems would not
effectiveness of each method.
perform as well as the gold-standard systems, but
4.5.1 Marian Label Augmented Systems the use of object labels can help mitigate this loss
Marian label augmentation and pruning results are in performance.
shown in Table 4 reporting scores for BLEU (Pap-
4.6 Analyzing Attention Outputs
ineni et al., 2002), chrF2 (Popović, 2015) and TER
(Snover et al., 2006) as calculated by SacreBLEU We use Marian’s ability to output soft attention
(Post, 2018) and COMET (Rei et al., 2020) with weights to compare an augmented system against
the default wmt20-comet-da model. its baseline counterpart, as shown in Figure 5. For
We note that drop-3, tfidf at 0.20, and idf at 4.0 this example, line 221 of the test set, the baseline
each yield a +0.9-1.0 gain in BLEU over baseline. system scores a sentence-BLEU of 30.66 versus the
We also report the number of labels pruned at each augmented system’s 61.32. We note the attention
experimental threshold noting that drop and tfidf contributions of the object labels on the output
remove approximately 42-43% of object class la- tokens. Utilizing this feature as part of an unaltered
bels at maximum performance, while idf removes MT toolkit allows for quick and easy analysis of
a much larger 74.73%. the benefits of object label augmentation.
As we see from the results, each of the three label
5 Related Work
pruning methods yields improvements over both
the text-only and non-pruned augmented systems. Perhaps most closely related to our approach is
Using the compare-mt (Neubig et al., 2019) tool, ViTA (Gupta et al., 2021), which adds object labels
we take a closer look at various characteristics of extracted from images in an image captioning trans-
the translation hypotheses of each of these five lation task. While the motivation of adding object
systems to see if any trends emerge. Table 5 shows labels are similar, there are important differences
averaged sentence BLEU scores for hypotheses with our setup: 1) We work on video narration of
with outputs of varying lengths. The intuition is an author’s task demonstration where objects ap-
that these average scores will help determine if a pear at different points in the clip, which differs
given system or pruning strategy is better at certain significantly from static image captions. 2) Our
output lengths. work focuses on training MT systems from scratch
From these averaged scores, we note that plain as opposed to fine-tuning existing models.
label augmentation tends to improve over base- For a broad survey of multimodal translation,
line with hypothesis lengths between 30 and 60 refer to Sulubacak et al. (2020). Specifically
tokens but performs worse when outside of those for video translation on How2, Sanabria et al.
ranges. Of the three pruning strategies, drop 3 (2018) investigates a MT system that adds a 2048-
tends to bring the most improvement, especially dimensional feature vector averaging features for
with shorter hypotheses and idf 4.0 tends to help every 16 frames to create a global feature vector for
134
System BLEU chrF2 TER COMET Dropped Labels
Marian baseline 57.9 75.0 29.6 0.6819 –
nmtpy baseline 56.2 74.2 30.7 0.6234 –
nmtpy visual 55.9 74.0 31.1 0.6090 –
drop 0 57.6 74.9 29.9 0.6732 0 (0%)
drop 1 58.6 75.4 28.9 0.6785 164,605 (33.55%)
drop 2 58.7 75.5 28.9 0.6840 188,475 (38.41%)
drop 3 58.9 75.7 28.7 0.6907 212,284 (43.26%)
drop 4 58.5 75.3 29.1 0.6766 230,090 (46.89%)
drop 5 58.5 75.2 29.3 0.6687 247,106 (50.36%)
tfidf 0.10 58.3 75.1 29.5 0.6778 162,762 (33.17%)
tfidf 0.20 58.8 75.4 28.8 0.6817 205,938 (41.97%)
tfidf 0.30 58.8 75.5 29.0 0.6812 398,643 (81.24%)
idf 3.0 58.4 75.2 29.2 0.6832 212,284 (43.26%)
idf 4.0 58.9 75.5 29.0 0.6887 366,695 (74.73%)
idf 5.0 58.5 75.4 29.0 0.6857 428,655 (87.36%)
Table 4: Marian system scores for How2 en–pt test set, measured in BLEU, chrF2, TER and COMET. There are
490,697 object class labels present in the entire augmented training corpus.
135
System Max Tok BLEU to the underlying MT toolkits used to build mod-
els. We additionally show improvements of up to
nmtpy base 120 55.0 +0.7 BLEU with object label augmentation when
nmtpy vis 120 56.1 substituting ASR speech for gold standard inputs.
nmtpy aug 120 55.9
nmtpy base 250 56.2
nmtpy vis 250 55.9 References
nmtpy aug 250 55.7 Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe
Morency. 2019. Multimodal machine learning: A
Table 6: Max token length effect on BLEU for nmtpy- survey and taxonomy. IEEE Trans. Pattern Anal.
torch baseline, visual perturbation and our label aug- Mach. Intell., 41(2):423–443.
mented systems. Iacer Calixto and Qun Liu. 2017. Incorporating global
visual features into attention-based neural machine
System BLEU COMET translation. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Process-
Kaldi clean base 52.0 0.556 ing, pages 992–1003, Copenhagen, Denmark. Asso-
Kaldi clean aug 52.7 0.583 ciation for Computational Linguistics.
Kaldi 5 dB noise base 50.8 0.459 Christopher Cieri, David Graff, Owen Kimball, David
Miller, and Kevin Walker. 2004–2005. Fisher En-
Kaldi 5 dB noise aug 51.5 0.459 glish Training Part 1 and 2 Speech and Transcripts.
Linguistic Data Consortium, Philadelphia.
Table 7: Results for clean and noisy Kaldi systems for
both baseline and augmented conditions. John Godfrey. 1994. Air Traffic Control Complete.
Linguistic Data Consortium, Philadelphia.
word embedding space to produce image-based Kshitij Gupta, Devansh Gautam, and Radhika Mamidi.
2021. ViTA: Visual-linguistic translation by aligning
first and last words to influence word choice in object tags. In Proceedings of the 8th Workshop
their bidirectional RNN systems. on Asian Translation (WAT2021), pages 166–173,
While there are a few examples of object detec- Online. Association for Computational Linguistics.
tion as a separate task (including our work), Bal- François Hernandez, Vincent Nguyen, Sahar Ghan-
trusaitis et al. (2019) notes the rapid jump to joint nay, Natalia Tomashenko, and Yannick Estève. 2018.
representations as neural networks became popular TED-LIUM 3: Twice as much data and corpus
tools for a variety of multimodal tasks, explaining repartition for experiments on speaker adaptation.
In Speech and Computer, pages 198–208, Cham.
the prevalence of work following that approach. Springer International Publishing.
6 Future Work Glenn Jocher, Alex Stoken, Ayush Chaurasia, Jirka
Borovec, NanoCode012, TaoXie, Yonghye Kwon,
Having proven our object label augmentation tech- Kalen Michael, Liu Changyu, Jiacong Fang, Abhiram
nique on How2, future work includes applying V, Laughing, tkianai, yxNONG, Piotr Skalski, Adam
label augmentation to other datasets such as the Hogan, Jebastin Nadar, imyhxy, Lorenzo Mammana,
AlexWang1900, Cristi Fati, Diego Montes, Jan Ha-
VATEX (Wang et al., 2020) video description
jek, Laurentiu Diaconu, Mai Thanh Minh, Marc, al-
and VISA (Li et al., 2022) ambiguous subtitles binxavi, fatih, oleg, and wanghaoyang0106. 2021.
datasets. Further research into the effects of ultralytics/yolov5: v6.0 - YOLOv5n ’Nano’ models,
ASR degraded speech and examining task-agnostic Roboflow integration, TensorFlow export, OpenCV
image-language models such as CLIP (Radford DNN support.
et al., 2021) for label augmentation may also be Marcin Junczys-Dowmunt, Roman Grundkiewicz,
useful. Tomasz Dwojak, Hieu Hoang, Kenneth Heafield,
Tom Neckermann, Frank Seide, Ulrich Germann,
7 Conclusion Alham Fikri Aji, Nikolay Bogoychev, André F. T.
Martins, and Alexandra Birch. 2018. Marian: Fast
We present a straight-forward method to improve The views expressed are those of the authors and do not
MT context quality by augmenting training data necessarily reflect the official policy or position of the Depart-
with objects detected in corresponding video clips. ment of the Air Force, the Department of Defense, or the U.S.
government. Distribution Statement A. Approved for public
Using these augmented corpora, we realize gains of release: distribution is unlimited. Originator reference number
up to +1.0 BLEU over baselines without changes RH-22-123269. Case number AFRL-2022-3098.
136
neural machine translation in C++. In Proceedings of Machine Translation: Research Papers, pages 186–
ACL 2018, System Demonstrations, pages 116–121, 191, Brussels, Belgium. Association for Computa-
Melbourne, Australia. Association for Computational tional Linguistics.
Linguistics.
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas
Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Burget, Ondrej Glembek, Nagendra Goel, Mirko
Dvorkovich, Christian Federmann, Mark Fishel, Hannemann, Petr Motlicek, Yanmin Qian, Petr
Thamme Gowda, Yvette Graham, Roman Grund- Schwarz, Jan Silovsky, Georg Stemmer, and Karel
kiewicz, Barry Haddow, Rebecca Knowles, Philipp Vesely. 2011. The kaldi speech recognition toolkit.
Koehn, Christof Monz, Makoto Morishita, Masaaki In IEEE 2011 Workshop on Automatic Speech Recog-
Nagata, Toshiaki Nakazawa, Michal Novák, Martin nition and Understanding. IEEE Signal Processing
Popel, and Maja Popović. 2022. Findings of the 2022 Society. IEEE Catalog No.: CFP11SRW-USB.
conference on machine translation (WMT22). In
Proceedings of the Seventh Conference on Machine Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Translation (WMT), pages 1–45, Abu Dhabi, United Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
Arab Emirates (Hybrid). Association for Computa- try, Amanda Askell, Pamela Mishkin, Jack Clark,
tional Linguistics. Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
ing transferable visual models from natural language
Taku Kudo and John Richardson. 2018. SentencePiece: supervision. In Proceedings of the 38th International
A simple and language independent subword tok- Conference on Machine Learning, volume 139 of
enizer and detokenizer for neural text processing. In Proceedings of Machine Learning Research, pages
Proceedings of the 2018 Conference on Empirical 8748–8763. PMLR.
Methods in Natural Language Processing: System
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Demonstrations, pages 66–71, Brussels, Belgium.
Lavie. 2020. COMET: A neural framework for MT
Association for Computational Linguistics.
evaluation. In Proceedings of the 2020 Conference
Yihang Li, Shuichiro Shimizu, Weiqi Gu, Chenhui on Empirical Methods in Natural Language Process-
Chu, and Sadao Kurohashi. 2022. VISA: an ambigu- ing (EMNLP), pages 2685–2702, Online. Association
ous subtitles dataset for visual scene-aware machine for Computational Linguistics.
translation. CoRR, abs/2201.08054. Ramon Sanabria, Ozan Caglayan, Shruti Palaskar,
Desmond Elliott, Loïc Barrault, Lucia Specia, and
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir
Florian Metze. 2018. How2: a large-scale dataset for
Bourdev, Ross Girshick, James Hays, Pietro Perona,
multimodal language understanding. In Proceedings
Deva Ramanan, C. Lawrence Zitnick, and Piotr Dol-
of the Workshop on Visually Grounded Interaction
lár. 2015. Microsoft coco: Common objects in con-
and Language (ViGIL). NeurIPS.
text.
Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea
Pranava Swaroop Madhyastha, Josiah Wang, and Lucia Micciulla, and John Makhoul. 2006. A study of trans-
Specia. 2017. Sheffield MultiMT: Using object pos- lation edit rate with targeted human annotation. In
terior predictions for multimodal machine translation. Proceedings of the 7th Conference of the Association
In Proceedings of the Second Conference on Machine for Machine Translation in the Americas: Technical
Translation, pages 470–476, Copenhagen, Denmark. Papers, pages 223–231, Cambridge, Massachusetts,
Association for Computational Linguistics. USA. Association for Machine Translation in the
Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Americas.
Danish Pruthi, Xinyi Wang, and John Wieting. 2019. David Snyder, Guoguo Chen, and Daniel Povey. 2015.
compare-mt: A tool for holistic comparison of lan- MUSAN: A Music, Speech, and Noise Corpus.
guage generation systems. CoRR, abs/1903.07926. ArXiv:1510.08484v1.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Umut Sulubacak, Ozan Çağlayan, Stig-Arne Grönroos,
Jing Zhu. 2002. Bleu: a method for automatic evalu- Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg
ation of machine translation. In Proceedings of the Tiedemann. 2020. Multimodal machine translation
40th Annual Meeting of the Association for Compu- through visuals and speech. Machine Translation,
tational Linguistics, pages 311–318, Philadelphia, 34.
Pennsylvania, USA. Association for Computational
Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Maja Popović. 2015. chrF: character n-gram F-score Kaiser, and Illia Polosukhin. 2017. Attention is all
for automatic MT evaluation. In Proceedings of the you need. In Advances in Neural Information Pro-
Tenth Workshop on Statistical Machine Translation, cessing Systems, pages 6000–6010.
pages 392–395, Lisbon, Portugal. Association for
Computational Linguistics. Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-
Fang Wang, and William Yang Wang. 2020. Vatex:
Matt Post. 2018. A call for clarity in reporting BLEU A large-scale, high-quality multilingual dataset for
scores. In Proceedings of the Third Conference on video-and-language research.
137
Length-Aware NMT and Adaptive Duration for Automatic Dubbing
Zhiqiang Rao, Hengchao Shang, Jinlong Yang, Daimeng Wei, Zongyao Li,
Jiaxin Guo, Shaojun Li, Zhengzhe Yu, Zhanglin Wu, Yuhao Xie, Bin Wei,
Jiawei Zheng, Lizhi Lei and Hao Yang
Huawei Translation Service Center, Beijing, China
{raozhiqiang,shanghengchao,yangjinlong7,weidaimeng,lizongyao,
guojiaxin1,lishaojun18,yuzhengzhe,wuzhanglin2,xieyuhao2,weibin29,
zhengjiawei15,leilizhi,yanghao30}@huawei.com
141
Chen, William Chen, Khalid Choukri, Alexandra Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Chronopoulou, Anna Currey, Thierry Declerck, Qian- Callison-Burch, Marcello Federico, Nicola Bertoldi,
qian Dong, Yannick Estève, Kevin Duh, Marcello Brooke Cowan, Wade Shen, Christine Moran,
Federico, Souhir Gahbiche, Barry Haddow, Benjamin Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Hsu, Phu Mon Htut, Hirofumi Inaguma, John Ja- Constantin, and Evan Herbst. 2007. Moses: Open
vorský, Dávid and Judge, Yasumasa Kano, Tom source toolkit for statistical machine translation. In
Ko, Rishu Kumar, Pengwei Li, Xutail Ma, Prashant Proceedings of the 45th Annual Meeting of the As-
Mathur, Evgeny Matusov, Paul McNamee, John P. sociation for Computational Linguistics Companion
McCrae, Kenton Murray, Maria Nadejde, Satoshi Volume Proceedings of the Demo and Poster Sessions,
Nakamura, Matteo Negri, Ha Nguyen, Jan Niehues, pages 177–180, Prague, Czech Republic. Association
Xing Niu, Atul Ojha Kr., John E. Ortega, Proyag Pal, for Computational Linguistics.
Juan Pino, Lonneke van der Plas, Peter Polák, Elijah
Rippeth, Elizabeth Salesky, Jiatong Shi, Matthias Taku Kudo and John Richardson. 2018. SentencePiece:
Sperber, Sebastian Stüker, Katsuhito Sudoh, Yun A simple and language independent subword tok-
Tang, Brian Thompson, Kevin Tran, Marco Turchi, enizer and detokenizer for neural text processing. In
Alex Waibel, Mingxuan Wang, Shinji Watanabe, and Proceedings of the 2018 Conference on Empirical
Rodolfo Zevallos. 2023. Findings of the IWSLT 2023 Methods in Natural Language Processing: System
Evaluation Campaign. In Proceedings of the 20th Demonstrations, pages 66–71, Brussels, Belgium.
International Conference on Spoken Language Trans- Association for Computational Linguistics.
lation (IWSLT 2023). Association for Computational
Linguistics. Surafel M. Lakew, Yogesh Virkar, Prashant Mathur,
and Marcello Federico. 2022. Isometric mt: Neu-
William Brannon, Yogesh Virkar, and Brian Thompson. ral machine translation for automatic dubbing. In
2022. Dubbing in practice: A large scale study of ICASSP 2022 - 2022 IEEE International Confer-
human localization with insights for automatic dub- ence on Acoustics, Speech and Signal Processing
bing. (ICASSP), pages 6242–6246.
Alexandra Chronopoulou, Brian Thompson, Prashant Zongyao Li, Jiaxin Guo, Daimeng Wei, Hengchao
Mathur, Yogesh Virkar, Surafel M. Lakew, and Mar- Shang, Minghan Wang, Ting Zhu, Zhanglin Wu,
cello Federico. 2023. Jointly optimizing translations Zhengzhe Yu, Xiaoyu Chen, Lizhi Lei, Hao Yang,
and speech timing to improve isochrony in automatic and Ying Qin. 2022. HW-TSC’s participation in
dubbing. the IWSLT 2022 isometric spoken language transla-
tion. In Proceedings of the 19th International Confer-
Chris Dyer, Victor Chahuneau, and Noah A. Smith. ence on Spoken Language Translation (IWSLT 2022),
2013. A simple, fast, and effective reparameteriza- pages 361–368, Dublin, Ireland (in-person and on-
tion of IBM model 2. In Proceedings of the 2013 line). Association for Computational Linguistics.
Conference of the North American Chapter of the
Adam Lopez. 2008. Statistical machine translation.
Association for Computational Linguistics: Human
ACM Comput. Surv., 40(3).
Language Technologies, pages 644–648, Atlanta,
Georgia. Association for Computational Linguistics. Marco Lui and Timothy Baldwin. 2011. Cross-domain
feature selection for language identification. In Pro-
Johanes Effendi, Yogesh Virkar, Roberto Barra-Chicote, ceedings of 5th International Joint Conference on
and Marcello Federico. 2022. Duration modeling of Natural Language Processing, pages 553–561, Chi-
neural tts for automatic dubbing. In ICASSP 2022 ang Mai, Thailand. Asian Federation of Natural Lan-
- 2022 IEEE International Conference on Acoustics, guage Processing.
Speech and Signal Processing (ICASSP), pages 8037–
8041. Marco Lui and Timothy Baldwin. 2012. langid.py: An
off-the-shelf language identification tool. In Proceed-
Marcello Federico, Robert Enyedi, Roberto Barra- ings of the ACL 2012 System Demonstrations, pages
Chicote, Ritwik Giri, Umut Isik, Arvindh Krish- 25–30, Jeju Island, Korea. Association for Computa-
naswamy, and Hassan Sawaf. 2020. From speech-to- tional Linguistics.
speech translation to automatic dubbing. In Proceed-
ings of the 17th International Conference on Spoken Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
Language Translation, pages 257–264, Online. Asso- Sam Gross, Nathan Ng, David Grangier, and Michael
ciation for Computational Linguistics. Auli. 2019. fairseq: A fast, extensible toolkit for
sequence modeling. In Proceedings of the 2019 Con-
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki ference of the North American Chapter of the Associa-
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, tion for Computational Linguistics (Demonstrations),
Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. pages 48–53, Minneapolis, Minnesota. Association
2020. Conformer: Convolution-augmented trans- for Computational Linguistics.
former for speech recognition.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Jing Zhu. 2002. Bleu: a method for automatic evalu-
method for stochastic optimization. ation of machine translation. In Proceedings of the
142
40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.
Jongseok Park, Kyubyong Kim. 2019. g2pe. https:
//github.com/Kyubyong/g2p.
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
Zhou Zhao, and Tie-Yan Liu. 2022. Fastspeech 2:
Fast and high-quality end-to-end text to speech.
143
NAVER LABS Europe’s Multilingual Speech Translation Systems
for the IWSLT 2023 Low-Resource Track
egow-smith1@sheffield.ac.uk first.last@naverlabs.com
144
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 144–158
July 13-14, 2023 c 2023 Association for Computational Linguistics
Figure 1: An illustration of our multilingual ST architecture as described in Section 2. The bold arrow path
corresponds to the speech-to-text training path. At decoding time, we can choose between producing speech-to-text
or text-to-text translations. Figure best seen in color.
less training data and compute. to map the speech features into the representation
This paper is organized as follows. We first de- space of the pre-trained MT model and the adapters
scribe the architecture and training settings of our can help with domain adaptation (and possibly help
multilingual ST systems in Section 2. We next alleviate the length mismatch). At inference, this
list the resources we use in Section 3. Section 4 model can be used for MT with very little memory
presents our results in both low and high-resource overhead: the convolutional layers and adapters
settings. Lastly, we highlight the zero-shot poten- are disabled, and the bottom encoder layers are
tial of our approach in Section 5 and present our swapped with those of the initial pre-trained model.
concluding remarks in Section 6.
Training settings. We train on 4 V100
2 System Description GPUs (80GB) for up to 200 000 updates, with a
maximum batch size of 4 000 source features (or
In this work we focus on a parameter-efficient train- 80 seconds of audio) and accumulated gradients
ing solution that allows us to input the features over two batches.3 We sample language pairs
from a pre-trained speech representation model with a temperature of 3.4 We validate every 5 000
into a pre-trained multilingual MT model, produc- updates and perform early stopping on valid
ing translations from both speech and text in mul- BLEU for the language pair(s) of interest, with
tilingual settings. This setting also allows us to a patience of 5, averaging model weights across
leverage automatic speech recognition (ASR; i.e. the last 3 checkpoints.5 We find best results using
speech-to-transcript) data. The general architecture a single convolutional layer with stride 2, which
is presented in Figure 1. The architecture is consid- downsamples the sequence of speech features by a
ered parameter-efficient because a small portion of factor of 2. The other hyperparameters are listed in
its parameters are trained (bottom encoder layers Appendix Section A.1.
and small adapters layers). 3
This corresponds to a total of 32 000 features per update,
Architecture. We initialize our models with a or 640 seconds of audio. In practice, with padding, each
update corresponds to approximately 80 utterances or 530 sec-
pre-trained multilingual MT model, which we onds of audio.
adapt to the ST task by inputting features extracted 4 1/3 P 1/3
pk = uk / ui where uk is the utterance count for
with a frozen pre-trained speech representation language pair k.
5
model. The MT model is also frozen, except for While all the configurations presented in this paper use
checkpoint averaging, we later re-trained our contrastive sub-
the bottom 2 or 3 encoder layers and small adapter mission for Taq-Fr and found virtually the same results with-
modules (those introduced by Bapna and Firat out it.
(2019), with bottleneck dimension 64) added af-
ter each encoder and decoder layer. As we show in
our results, the fine-tuned encoder layers are able
145
Transformer Feature Task Source Target hours:minutes # utterances
Model # params
layers dimension
ASR Quechua Quechua 51:39 8,301
Tamasheq (Boito et al., 2022b) 95M 12 768
Niger-Mali (Boito et al., 2022b) 95M 12 768 ST Quechua Spanish 2:42 698
mHuBERT-Tamasheq 95M 12 768 ST Tamasheq French 15:43 5,025
XLSR-53 (Conneau et al., 2021) 317M 24 1024
XLS-R (Babu et al., 2022) 317M 24 1024
Table 2: Speech Translation (ST) and Speech Recogni-
tion (ASR) data provided by the organizers (train+valid).
Table 1: Speech representation models. The top portion
The ASR data is outside of the constrained setting.
presents Tamasheq-dedicated models, while the bottom
lists large general purpose multilingual models.
speech during training.7
3 Resources 3.2 Pre-trained Multilingual MT Models
3.1 Pre-trained Speech Representation To initialize our ST models, we first experi-
Models mented with mBART for many-to-many transla-
We experiment with different versions of two tion (mBART50NN; Tang et al., 2020), but found
speech representation models: HuBERT (Hsu et al., the NLLB-200 models (Costa-jussà et al., 2022)
2021) and wav2vec 2.0 (Baevski et al., 2020). We to give better results. We experiment with the
do not fine-tune these models in any of our con- dense NLLB models of various sizes: the distilled
figurations, but instead use them as feature extrac- 600M-parameter and 1.3B-parameter versions, and
tors (see Figure 1). Because of this, our models the 3.3B-parameter version. We end up using the
are sensitive to the layer we extract features from. larger versions in our submissions (1.3B and 3.3B).
Pasad et al. (2021) argue that, for wav2vec 2.0 mod- Note that NLLB covers 202 languages, including
els that are not fine-tuned on ASR, speech features Tamsheq and Quechua, which is not the case for
from middle layers tend to have a higher abstrac- mBART. At the same model size, despite covering
tion from the speech signal, which is beneficial to more languages, NLLB is also a stronger machine
downstream tasks. The results from Boito et al. translation model overall than mBART. Also, un-
(2022b) seem to confirm this observation holds for like mBART, it is not English-centric.
low-resource ST. To the best of our knowledge, Contrary to Tang et al. (2021), we keep the orig-
there is no similar investigation for HuBERT mod- inal mBART or NLLB vocabularies of size 250k
els.6 and do not train any embeddings. Instead, like
Table 1 presents the speech representation mod- Berard et al. (2021), we find that it is possible to
els we experiment with. The Tamasheq model is filter the vocabulary at test time to only cover the
a monolingual wav2vec 2.0 Base model trained languages of interest, significantly reducing the
on 243 h of Tamasheq speech. The Niger-Mali memory footprint of the model with a minor re-
is a wav2vec 2.0 Base model trained on the duction in performance.8 We can also filter the
same Tamasheq speech data plus 111 h of French, vocabulary and embeddings before ST fine-tuning
109 h of Fulfulde, 100 h of Hausa, and 95 h of and achieve the same performance as with the full
Zarma. This gives 658 h in total. The data for vocabulary without needing to train any embed-
both models is sourced from the Niger-Mali audio dings. See Table 14 in Appendix for a comparison
collection (Boito et al., 2022a). The unreleased of these approaches. In order to study the zero-shot
mHuBERT-Tamasheq model uses this same audio translation capabilities of our models (i.e., trans-
collection for training, while also including Com- lating to languages and language pairs unseen at
mon Voice (Ardila et al., 2020) data in four other training), we do not apply vocabulary filtering to
languages (English, French, Arabic and Kabyle), the configurations presented in the main paper.
resulting in 5 069 h of speech. XLSR-53 (56k hours) 7
Appendix Table 16 lists all models with links for down-
and XLS-R (500k hours) are massively multilingual loading checkpoints, when available.
8
wav2vec 2.0 Large models covering 53 and 128 lan- With NLLB, 44k tokens are enough for a 100% cov-
erage of the training data (mTEDx, TED-LIUM, Quechua,
guages, respectively. Neither of these two multi- Tamasheq), or 35k when restricting to our Taq-Fr setting. This
lingual models have seen Tamasheq or Quechua represents a reduction of more than 200M parameters.
6
We hypothesize that layer selection is less important for
HuBERT architectures due to the multi-iteration approach that
increases signal abstraction at each iteration.
146
Task Source Target hours:minutes # utterances Taq-Fr Que-Es
ASR English English 208:00 91,003
IWSLT IWSLT IWSLT
ASR French French 218:59 117,081 2022 2023 2023
ASR Spanish Spanish 214:15 103,076 primary 20.75 23.59 ✗
Taq-
ST French English 57:39 31,207 contrastive 1 19.06 21.31 ✗
Fr
ST French Spanish 42:14 21,862 contrastive 2 18.58 18.73 17.74
ST Spanish English 79:37 37,168
ST Spanish French 9:34 4,568 primary 18.58 18.73 17.74
Que-
contrastive 1 16.84 ✗ 15.67
Es
contrastive 2 16.21 ✗ 15.25
Table 3: ASR and ST data in English, French and Span-
ish sourced from TED talks (unconstrained setting).
Table 4: Results on the official test sets for the IWSLT
2023 Low-Resource Task. We also show results on the
3.3 Datasets IWSLT 2022 Taq-Fr test set. Note that all Quechua
models are trained on Tamasheq data, but the reverse
We tackle the low-resource setting by building mul- is not true (see Appendix Table 15). Lines 3 and 4
tilingual systems that utilize both ASR and ST correspond to the same model.
data in the languages of interest (Tamasheq and
Quechua), and in high-resource directions whose
experiments in the setting of the IWSLT 2021 Mul-
target language is of interest (French and Span-
tilingual Task to measure how good our approach
ish). Note that we also include X→English data,
is on high-resource languages. The datasets used
as we initially planned to participate in the Irish-
for this setting are presented in Appendix Table 10.
English task. Including more data in high-resource
languages has several advantages. Firstly, it has a 4 Experiments and Results
regularization effect that prevents us from immedi-
ately overfitting the low-resource training data. Sec- All our submissions to the low-resource ST task
ondly, this enables knowledge transfer from com- are in the unconstrained setting, due to the use of
mon target languages and from similarly-sounding pre-trained models, and from training on data in
source languages.9 Thirdly, as we build multilin- other languages. The datasets used in each submis-
gual ST systems by mapping the speech representa- sion are listed in Appendix Table 15. This section
tion vectors into the same space as the multilingual is organized as follows. We present our Taq-Fr re-
MT model, our goal is to produce a model that is sults (4.1) with a detailed ablation study justifying
as multilingual as possible, not specializing in one our architectural choices. We then present our Que-
specific language. Our results show that training Es results (4.2). Lastly, we evaluate and analyze
on multiple languages at once achieves this effect, our approach in a high-resource setting (4.3).
while also producing good zero-shot ST results.
4.1 Tamasheq-French Results
Table 2 presents statistics for the datasets pro-
vided by the IWSLT 2023 organizers. The Que-Es We submit two systems that have Taq-Fr as the
dataset10 is an unreleased dataset prepared for this only low-resource language pair (primary and con-
year’s challenge. It corresponds to a translated trastive 1). Additionally, we take our primary sub-
subset of the Quechua ASR data (“Siminchik”) mission for Que-Es, which has also been trained
from Cardenas et al. (2018). The Taq-Fr dataset on Taq-Fr, and submit this as contrastive 2. The
was introduced by Boito et al. (2022a). Table 3 top portion of Table 4 gives the test BLEU scores,
presents statistics for the datasets in high-resource and the top portion of Appendix Table 11 presents
languages. English ASR data comes from TED- the valid BLEU scores. Table 12 shows statistics
LIUMv2 (Rousseau et al., 2014), and the other (average and standard deviation) over multiple runs
data comes from mTEDx (Salesky et al., 2021). when applicable.
Appendix Table 15 lists the datasets used in each System description. The contrastive 1 model
of our submissions. In Section 4.3, we also run uses as a speech feature extractor the Niger-Mali
9
Manual inspection revealed that audio from both datasets wav2vec 2.0 model (8th layer). It was initialized
presents some degree of target language borrowing (e.g., with NLLB 1.3B, whose bottom 3 encoder layers
Spanish words present in the Quechua speech, French words
present in the Tamasheq speech). were finetuned. We took three runs of this setting
10
We are aware the dataset reference is Que-Spa. We chose with different random seeds and picked the best
to use the ISO 639-1 two letters abbreviation for Spanish for performing one on the validation set (in terms of
consistency with the other datasets used in this work.
147
Taq-Fr BLEU) as our contrastive submission. We mance (even more so for Fr-En). However, we
then ensembled the three runs as our primary sub- find that the gain from using NLLB 3.3B over
mission. Finally, constrastive 2 is the ensemble NLLB 1.3B is too small to justify the increase in
model used as primary submission to the Que-Es model size and decoding latency (3 times slower).
task, which covers both low-resource languages, At the same model size, NLLB 600M performs
and combines XSL-R Large with NLLB 3.3B. considerably better than mBART (+1.7 BLEU on
Taq-Fr, +3.6 BLEU on Fr-En).
Results. Our primary submission significantly
outperforms the previous state of the art of Trained parameters. Fine-tuning too many en-
13.2 BLEU (+7.5 BLEU) on the IWSLT 2022 test coder layers results in overfitting, which hurts
set by Khurana et al. (2022).11 It also ranks first in Taq-Fr and Fr-En performance. On the other
this year’s edition, with +7.7 BLEU over the second hand, fine-tuning just 1 or 2 layers instead of
best primary submission. Our contrastive submis- 3 does not result in a large BLEU drop. Simi-
sions rank second and third (beating the second larly, adapter modules are not always needed. Dis-
best primary submission by +5.4 and +2.8 BLEU). abling decoder adapters does not degrade Taq-
Fr performance (+0.2 BLEU), but results in a
4.1.1 Ablation Study
slight drop in Fr-En performance (-0.9 BLEU),
In Appendix Table 18 we compare our con- which could be attributed to a domain adaptation
trastive 1 model (the non-ensembled version of effect (to the mTEDx domain). Disabling en-
our primary submission) with other architectures coder adapters has more impact on performance for
trained on the same data to validate our choice of Taq-Fr (-0.8 BLEU), with similar effect on perfor-
hyperparameters. mance for Fr-En (-1.0 BLEU). Section 4.3 shows
Speech features. The wav2vec 2.0 models that these adapters are important for domain adap-
trained with Tamasheq (Niger-Mali and Tamasheq) tation.
largely outperform the well-known massively mul- Convolutions. The number of convolutional lay-
tilingual models (XLSR-53 and XLS-R) on Taq-Fr ers does not impact performance much (range of
(e.g. +2.5 BLEU Tamasheq compared to XLS-R L). 1.1 BLEU on Taq-Fr and 3.2 BLEU on Fr-En for
These models are larger and trained on consider- 0 to 3 layers), but it can have a large impact on
ably more data, but do not include any Tamasheq decoding speed: each layer divides the input length
speech. Similar to previous works (Pasad et al., by a factor of 2 resulting in a roughly 3.5× speed-
2021; Boito et al., 2022b), when extracting fea- up from 0 to 3 layers. Interestingly, even though
tures from wav2vec 2.0 we find that the 8th layer it was trained on much shorter sequences, the MT
gives better results than the 11th (penultimate) layer model seems to adapt quite well to any input length,
(+2.5 BLEU for Niger-Mali). even without any convolutions – we achieve a bet-
For HuBERT, on the contrary, features from the ter Taq-Fr result without any convolutions, but a
th
11 layer give the best results (+0.2 BLEU com- worse Fr-En result.12 However, models with fewer
pared to 8th layer). When using the right layer, we convolutional layers seem to converge faster (as
find that wav2vec 2.0 outperforms HuBERT (+2.7 shown in Appendix Figure 2).
BLEU Niger-Mali compared to mHuBERT-Taq).
Finally, Niger-Mali is as good on Taq-Fr as the Stacked layers. While our approach described
Tamasheq wav2vec 2.0, but performs considerably in Section 2 fine-tunes some parameters of the pre-
better on Fr-En (+4.1 BLEU), probably because trained MT model, we can instead plug new Trans-
it was trained with French audio. The best Fr-En former layers at the bottom of the encoder, without
performance is achieved with XLS-R L. We find changing any existing parameter. These “stacked
worse performance on Fr-En with XLS-R XL (-2.0 layers” result in slightly larger models but are con-
BLEU), but this may be due to layer selection. ceptually simpler, as they try to map the speech
features into the same representation space as the
Pre-trained MT model. The larger the model input text embeddings of the MT model. Appendix
used for initialization, the better the perfor- Table 17 compares this architecture with the one
11
Here we are referencing the model pre-trained used in our submission to the Taq-Fr task. We see
using the Niger-Mali dataset that was presented at
12
JSALT 2022: https://www.clsp.jhu.edu/ Without any convolution, the speech feature to target
jsalt-2022-closing-presentations/ token ratio is 12:1.
148
that it performs similarly well (sometimes better) Que-Es ST models are evaluated in an unrealistic
and that it does not add any noticeable decoding setting, where they are tasked to translate Quechua
latency. We can even reach the same Taq-Fr perfor- utterances of which they already know the tran-
mance as our contrastive submission by just adding scription into Quechua. For this reason, we filtered
a single Transformer layer plus one convolution the ASR data to remove all audio files also present
layer and small adapters (28M trained parameters in the validation and test sets for Que-Es, and we
in total). Finally, disabling all adapters only results re-trained models on this filtered data.13 While our
in a small BLEU drop, suggesting that it is indeed official submission results presented in Table 4 use
possible to map the speech features into the text the “contaminated” dataset for comparison with the
input space, with only one Transformer layer. This other submissions, we think any future comparison
is surprising, considering that the input to this layer to our work should be done with the updated results
is 6 times as long as the target sequence on average. in Appendix Table 11. Note that similar care should
be taken with the results of other participants.
4.2 Quechua-Spanish Results
The test and validation scores of our submissions to 4.3 Results and Analysis in a High-Resource
the Que-Es task are reported in the second half of Setting
Table 4 and 11, respectively. Because these models The results of our ablation studies (Section 4.1.1)
are also trained on Taq-Fr data, we additionally seem to indicate that our models are reasonably
report their performance on that task. good on Fr-En translation, even though we do
early stopping and tune our hyper-parameters based
System description. As we do not have a
on Taq-Fr performance. Here, we further inves-
speech feature extractor specialized to Quechua
tigate the performance of our approach on high-
speech, our contrastive 1 submission uses a mas-
resource ST by training models in the setting of the
sively multilingual wav2vec 2.0 model: XLS-R
IWSLT 2021 Multilingual Task (Anastasopoulos
Large (18th layer). Compared to our Tamasheq
et al., 2021). This task evaluates the performance
submission, it is also initialized with a larger MT
of multilingual ST models in 4 training directions,
model (NLLB 3.3B), which we found to perform
for which in-domain training data is provided, and
better in this setting. The training settings are the
3 zero-shot directions, for which no training data is
same as for the Tamasheq models, except that we
provided.
only fine-tune the bottom 2 encoder layers (instead
We use XLS-R Large as the speech feature
of 3) and validate every 2 500 updates, since this
extractor, experiment with both NLLB 1.3B and
larger model tends to converge faster. Another
NLLB 3.3B as the MT model, and perform early
difference is that we train on both Tamasheq and
stopping based on the average validation BLEU
Quechua data (in addition to the mTEDx and TED-
across the 4 official training directions. We train
LIUM data). Like in our Tamasheq submission,
our models on all the mTEDx language pairs that
we train 3 models with different random seeds and
are not zero-shot, along with TED-LIUM (English
ensemble them as our primary submission. Our
ASR) and the Tamasheq and Quechua data (see
constrastive 2 submission uses a single model with
Table 15). Note that the use of pre-trained models
the same training settings, but starts from a smaller
and English ASR means our models fall into the
pre-trained MT model (NLLB 1.3B).
unconstrained setting.
Results. Our primary submission in the Que-Es Table 5 presents our results on this task,
task also ranked first, with 17.7 BLEU on the of- compared with the best unconstrained submis-
ficial test set. The full ranking results were not sion (FAIR; Tang et al., 2021).14 We find that both
communicated in time to this camera-ready. They our models outperform FAIR’s ensemble submis-
will be made available later through the conference sion in the training directions, even though they
findings paper (Agarwal et al., 2023). require substantially less compute and data to train,
and they are not ensembled. In the zero-shot direc-
Data contamination. We found shortly after our
13
submission that all the audio files used in the of- In the updated version, we use NLLB 1.3B by default
instead of NLLB 3.3B, like for Taq-Fr. Appendix Table 11
ficial test and validation sets are also present in presents uncontaminated results.
the ASR training data shared by the organizers 14
SacreBLEU signature (Post, 2018): nrefs:1|
for the unconstrained setting. This means that our case:mixed|eff:no|tok:13a|smooth:exp|version:2.1.0
149
Total Trained Training directions Zero-shot directions
Model
params params Es-En Fr-En Fr-Es Pt-En Pt-Es It-En It-Es
FAIR at IWSLT 2021 700M 40.4 36.4 34.4 29.0 34.4 28.4 34.6
(Tang et al., 2021) 3×700M (ensemble) 42.2 38.7 36.5 31.0 38.2 29.4 37.3
XLS-R + NLLB 1.3B 317M + 1.38B 70M 43.7 39.4 38.0 31.5 35.9 28.9 35.0
XLS-R + NLLB 3.3B 317M + 3.36B 115M 44.0 39.9 38.3 33.1 38.1 29.3 36.9
XLS-R + NLLB 1.3B, ASR + MT cascade 41.8 35.6 34.4 29.7 35.8 29.3 35.2
Table 5: Results on the IWSLT 2021 Multilingual task. We report BLEU scores on the IWSLT 2021 test sets. Our
NLLB 1.3B and 3.3B models took respectively 34 and 46 h to train on 4 V100 GPUs, while FAIR’s models each
took 7 days to train on 8 V100 GPUs. Also note that FAIR’s models were trained on much larger amounts of data,
including data for the “zero-shot” directions (which, in their case is only zero-shot w.r.t the in-domain TED data).
Model New params Taq-Fr of dimension 256 in the bottom layers and training
Joint training 0 21.06 only those; 4) adding adapters of dimension 256 in
Adapters 64 (all) 6.4M 17.60 the bottom layers and training both those and the
Adapters 256 (all) 15.9M 18.18 convolutional layer.
Adapters 256 (bottom) 1.6M 19.24 We keep the same training settings as before, ex-
Conv + Adapters 256 (bottom) 2.5M 19.13
cept that: we train on Taq-Fr data only; we train
Table 6: BLEU scores on the Taq-Fr validation set, only the parameters mentioned above; we validate
when training jointly with IWSLT 2021 and Tamasheq more often (every 1 000 updates); and we disable
data; versus incremental (2-stage) training. The “New checkpoint averaging. Table 6 shows the perfor-
params” columns give the number of Tamasheq-specific mance of these four incremental training methods,
parameters added. compared to training on the entire language set
from scratch. Even though incremental training
tions, our NLLB 1.3B version performs worse than does not perform quite as well, it appears to be a vi-
FAIR’s ensemble, which is not surprising since able option that can achieve decent results. Lastly,
they used training data for the zero-shot language we highlight that our experiments were limited to
directions (from other datasets), whilst we do not.15 these four incremental learning settings (without
We find that using the larger NLLB 3.3B model for hyper-parameter search), and that better results may
initialization considerably improves our zero-shot be obtained with other parameter-efficient adapta-
results. tion methods, or with more regularization.
Table 7: BLEU and chrF results for Taq-{Fr, En, Ko} using contrastive 1 and its variants (models trained without
adapters or with larger adapters), on the IWSLT 2022 Taq-Fr test set or silver-standard Korean and English references
obtained with MT. The last row is a cascade of speech translation followed by text translation (Taq→Fr→X).
tation space and that the adapters further improve Note that this is only a silver-standard made of syn-
performance by allowing domain adaptation of the thetic data, and thus the evaluation will inevitably
MT model (which is hard to do at the very bottom be biased.17 Our goal is solely to assess whether
layers). Note that the encoder adapters seem to be our systems have some zero-shot ST abilities. We
the most important ones, which is consistent with evaluate our Taq-Fr contrastive 1 system, and vari-
the findings of Cooper Stickland et al. (2021) that ants of this system with fewer or larger adapters.
adapting the encoder is the most effective strategy We compare with a cascade baseline, in which we
for domain adaptation. Lastly, we highlight that first perform Taq-Fr ST, followed by Fr-En or Fr-
adapting the MT model directly with MT data (mT- Ko MT using the text-to-text path from Figure 1. In
EDx’s transcriptions and translations) gives even this setting, the adapters are disabled during MT.
better results (+4.6 BLEU on average), but this
Results. In Table 7, we measure the zero-shot
cross-modality domain transfer is an interesting
translation capabilities of our approach on this
by-product of our parameter-efficient approach.
silver-standard test set. We evaluate four mod-
5 Zero-Shot Capabilities els: our contrastive 1 submission presented in Sec-
tion 4.1, and variants of this model with increased
Throughout this paper we have argued that one ad- adapter size, adapters only in the encoder, or no
vantage of the multilingual models we propose is adapters. We compare against a cascade baseline
their potential for zero-shot translation, a setting in that is not zero-shot, which consists in translating
which a system produces translation in an unseen the Tamasheq speech into French text and then
language pair by leveraging its existing knowledge translating this text into English or Korean.
of both languages. In Section 4.3 we showed that We observe that, in the case of English, which
our models are competitive with the best submis- was seen during ST adaptation, adapters can be
sion to IWSLT 2021 on the three zero-shot high- helpful (+2 BLEU over the cascade baseline). On
resource language pairs, despite the fact that these the other hand, for Korean, unseen during ST adap-
pairs were not truly zero-shot for that system. In tation, systems with adapters in the decoder (first
this section, we further illustrate the zero-shot ca- two rows) perform worse, as they likely bring some
pabilities of our models by translating Tamasheq degree of language confusion. Results are even
speech in two settings: 1) target language seen dur- worse with larger adapters, with over 40% of out-
ing both MT pre-training and ST adaptation (En- put sentences being in the wrong language. In
glish); 2) target language only seen during MT this setting, the best results are achieved with only
pre-training (Korean). encoder adapters or no adapters at all (-1 BLEU
Evaluation settings. To score BLEU and chrF16 compared to the baseline).
in the chosen target languages, we use a commer- Appendix Table 13 measures the percentage of
cial translation service to translate the French side output sentences in the correct language and the
of the IWSLT 2022 test set to English and Korean. percentage of Hangul versus Latin character in
each system’s outputs. We find that models with
16
SacreBLEU signature: nrefs:1|case:mixed|
17
eff:no|tok:X|smooth:exp|version:2.3.1, (En: For instance, we observe that these generated translations
X=13a, Ko: X=ko-mecab-0.996/ko-0.9.2-KO). contain both the Korean transliteration in Hangul of named
chrF signature: nrefs:1|case:mixed| entities and the original version in the Latin script. This will
eff:yes|nc:6|nw:0|space:no|version:2.3.1 likely penalize our produced translation during scoring.
151
Utterance id Target Content
Ref Chers auditeurs, rappelez-vous que vous écoutez Studio Kalangou en ce moment.
Fr Chers auditeurs, n’oubliez pas que vous êtes avec le Studio Kalangou.
2016-11-23_id_7
En Well, listeners, don’t forget that you are with Studio Kalangou right now.
Ko ᆼᄎ
ᅥ
ᄎ ᅱᄌ ᅡᄋ ᅧ러ᄇ ᆫ, ᄌ
ᅮ ᅵᄀᆷ Studio Kalangouᄋ
ᅳ ᅪᄒ ᆷᄁ
ᅡ ᅦᄋᆻᄂ
ᅵ ᆫᄀ
ᅳ ᆺᄋ
ᅥ ᆯᄋ
ᅳ ᆽᄌ
ᅵ ᅵᄆ ᅡᄉ ᅦ요.
2016-06-27_id_5 Ref Les examens du BEPC sont terminés et les corrections ont commencé hier après-midi dans la ville de Niamey.
Fr Les examens du BEPC sont terminés et sur toute l’étendue du territoire, les travaux de leur suivi ont débuté hier après-midi à Niamey.
En The BEPC exams are over and throughout the country, the monitoring activities started yesterday afternoon in Niamey.
Ko BEPC ᄉ ᅵᄒ
ᆷᄋ
ᅥ ᆫᄁ
ᅳ ᇀᄂ
ᅳ ᆻᄉ
ᅡ ᆸᄂ
ᅳᅵ다. ᄌ ᆫᄀ
ᅥ ᆨᄋ
ᅮ ᅦ서ᄀ ᆷᄉ
ᅥ ᅡᄌ ᆨᄋ
ᅡ ᆸᄋ
ᅥ ᆫᄋ
ᅳ ᅥ제ᄋ ᅩ후 Niameyᄋ ᅦᄉ ᅥᄉ ᅵᄌᆨᄃ
ᅡ ᅬᄋᆻᄉ
ᅥ ᆸᄂ
ᅳ ᅵᄃ ᅡ.
D’autres informations que nous apportons aujourd’hui concernent un projet appelé aniamey.com qui informe que l’État du Nigéria a refoulé
Ref
des Nigériens, au nombre de 53, qui arrivent (), qui habitent dans la ville de Mina sur le territoire du Niger ou Neja.
2016-10-27_id_39 D’autres informations que nous apportons aujourd’hui concernent les informations apportées par un programme dénommé Niamey Point Com qui a
Fr
apporté des informations selon lesquelles le Nigeria a accueilli 53 Nigériens qui habitent la ville de Mena qui se trouve sur le territoire du Niger ou le Niger.
Today, we’re going to talk about the information about a program called Niamey Point Com, which reports that Nigeria has brought back 53 Nigerians
En
who live in the town of Mena in Niger.
Ko ᅮᄅ
ᄋ ᅵᄀ ᅦᄋᆷᄋ
ᅵ ᅴᄋ ᅩᄂᆯᄀ
ᅳ ᅵ사ᄋ ᅦ서ᄂ ᆫ Niamey Point Comᄅ
ᅳ ᅡᄂᆫᄑ
ᅳ ᅳᄅ ᅩ그ᄅ ᆷᄋ
ᅢ ᅳᄅ ᅩᄂ ᅡᄋ ᅵᄌ
ᅵ리ᄋ ᅡᄀ ᅡᄆ ᅵᄂ ᅦᄋ ᅦᄀ ᅥᄌ ᅮ하ᄂᆫ 53ᄆ
ᅳ ᆼᄋ
ᅧ ᅴᄂ ᅵᄀ
ᅳ르ᄋ ᆫᄋ
ᅵ ᆯᄀ
ᅳ ᅱᄒ ᆫᄉ
ᅪ ᅵᄏᆻᄃ
ᅧ ᅡᄂ ᆫᄉ
ᅳ ᅩᄉᆨᄋ
ᅵ ᅵᄋ ᆻᄉ
ᅵ ᆸᄂ
ᅳ ᅡ.
ᅵᄃ
Table 8: Some decoding examples for Taq-Fr, Taq-En and Taq-Ko language pairs, accompanied by the French
reference (Ref). Utterance id corresponds to the suffix of the audio files in the IWSLT 2022 test set.
152
erico, Xutai Ma, Satoshi Nakamura, Matteo Negri, the Tamasheq language. In Proceedings of the Thir-
Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas- teenth Language Resources and Evaluation Confer-
tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan- ence, pages 2066–2071, Marseille, France. European
der Waibel, Changhan Wang, and Matthew Wiesner. Language Resources Association.
2021. FINDINGS OF THE IWSLT 2021 EVAL-
UATION CAMPAIGN. In Proceedings of the 18th Marcely Zanon Boito, John Ortega, Hugo Riguidel, An-
International Conference on Spoken Language Trans- toine Laurent, Loïc Barrault, Fethi Bougares, Firas
lation (IWSLT 2021), pages 1–29, Bangkok, Thailand Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gah-
(online). Association for Computational Linguistics. biche, and Yannick Estève. 2022b. ON-TRAC con-
sortium systems for the IWSLT 2022 dialect and
Rosana Ardila, Megan Branson, Kelly Davis, Michael low-resource speech translation tasks. In Proceed-
Kohler, Josh Meyer, Michael Henretty, Reuben ings of the 19th International Conference on Spoken
Morais, Lindsay Saunders, Francis Tyers, and Gre- Language Translation (IWSLT 2022), pages 308–318,
gor Weber. 2020. Common voice: A massively- Dublin, Ireland (in-person and online). Association
multilingual speech corpus. In Proceedings of the for Computational Linguistics.
Twelfth Language Resources and Evaluation Confer-
ence, pages 4218–4222, Marseille, France. European Ronald Cardenas, Rodolfo Zevallos, Reynaldo Baquer-
Language Resources Association. izo, and Luis Camacho. 2018. Siminchik: A speech
corpus for preservation of southern quechua. ISI-
Arun Babu, Changhan Wang, Andros Tjandra, Kushal NLP 2, page 21.
Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab-
Baevski, Alexis Conneau, and Michael Auli. 2022. delrahman Mohamed, and Michael Auli. 2021. Un-
XLS-R: Self-supervised Cross-lingual Speech Rep- supervised Cross-Lingual Representation Learning
resentation Learning at Scale. In Proc. Interspeech for Speech Recognition. In Proc. Interspeech 2021,
2022, pages 2278–2282. pages 2426–2430.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Asa Cooper Stickland, Alexandre Berard, and Vassilina
and Michael Auli. 2020. wav2vec 2.0: A framework Nikoulina. 2021. Multilingual domain adaptation
for self-supervised learning of speech representations. for NMT: Decoupling language and domain infor-
Advances in neural information processing systems, mation with adapters. In Proceedings of the Sixth
33:12449–12460. Conference on Machine Translation, pages 578–598,
Online. Association for Computational Linguistics.
Ankur Bapna and Orhan Firat. 2019. Simple, scal-
able adaptation for neural machine translation. In Marta R Costa-jussà, James Cross, Onur Çelebi, Maha
Proceedings of the 2019 Conference on Empirical Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe
Methods in Natural Language Processing and the Kalbassi, Janice Lam, Daniel Licht, Jean Maillard,
9th International Joint Conference on Natural Lan- et al. 2022. No language left behind: Scaling
guage Processing (EMNLP-IJCNLP), pages 1538– human-centered machine translation. arXiv preprint
1548, Hong Kong, China. Association for Computa- arXiv:2207.04672.
tional Linguistics.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Alexandre Berard. 2021. Continual learning in multilin- Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
gual NMT via language-specific embeddings. In rahman Mohamed. 2021. Hubert: Self-supervised
Proceedings of the Sixth Conference on Machine speech representation learning by masked prediction
Translation, pages 542–565, Online. Association for of hidden units. IEEE/ACM Transactions on Audio,
Computational Linguistics. Speech, and Language Processing, 29:3451–3460.
Alexandre Berard, Dain Lee, Stephane Clinchant, Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika
Kweonwoo Jung, and Vassilina Nikoulina. 2021. Bali, and Monojit Choudhury. 2020. The state and
Efficient inference for multilingual neural machine fate of linguistic diversity and inclusion in the NLP
translation. In Proceedings of the 2021 Conference world. In Proceedings of the 58th Annual Meeting of
on Empirical Methods in Natural Language Process- the Association for Computational Linguistics.
ing, pages 8563–8583, Online and Punta Cana, Do-
minican Republic. Association for Computational Sameer Khurana, Antoine Laurent, and James Glass.
Linguistics. 2022. Samu-xlsr: Semantically-aligned multimodal
utterance-level cross-lingual speech representation.
Steven Bird. 2011. Bootstrapping the language archive: IEEE Journal of Selected Topics in Signal Processing,
New prospects for natural language processing in 16(6):1493–1504.
preserving linguistic heritage. Linguistic Issues in
Language Technology, 6(4). Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. 2021.
Layer-wise analysis of a self-supervised speech rep-
Marcely Zanon Boito, Fethi Bougares, Florentin Bar- resentation model. In 2021 IEEE Automatic Speech
bier, Souhir Gahbiche, Loïc Barrault, Mickael Rou- Recognition and Understanding Workshop (ASRU),
vier, and Yannick Estève. 2022a. Speech resources in pages 914–921. IEEE.
153
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers, pages 186–
191, Brussels, Belgium. Association for Computa-
tional Linguistics.
Anthony Rousseau, Paul Deléglise, and Yannick Estève.
2014. Enhancing the TED-LIUM corpus with se-
lected data for language modeling and more TED
talks. In Proceedings of the Ninth International
Conference on Language Resources and Evaluation
(LREC’14), pages 3935–3939, Reykjavik, Iceland.
European Language Resources Association (ELRA).
Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman,
Roldano Cattoni, Matteo Negri, Marco Turchi, Dou-
glas W. Oard, and Matt Post. 2021. Multilingual
tedx corpus for speech recognition and translation.
In Proceedings of Interspeech.
154
A Appendix
A.1 Hyperparameters
7 No conv layer
1 conv layer
Hyper-parameter Value 6 2 conv layers
3 conv layers
5
Training loss
Batch size 4 000
Data-parallel GPUs 4
Update freq 2 4
Max learning rate 0.0005
3
Initial LR 10−7
Schedule inverse square root 2
Warmup steps 10 000
Adam betas 0.9, 0.999 1
Mixed precision True 0 50000 100000 150000
Label smoothing 0.2 Training steps
Weight decay 0.0 25
Dropout 0.3†
Checkpoint averaging 3
Patience 5
20
Early stopping metric BLEU
Beam size 5 15
155
Task Source Target hours:minutes # utterances
ASR French French 218:59 117,081
ASR Italian Italian 118:39 50,895
ASR Portuguese Portuguese 179:33 91,257
ASR Spanish Spanish 214:15 103,076
ST French English 57:39 31,207
ST French Spanish 42:14 21,862
ST French Portuguese 26:53 14,322 Inference Taq-Fr Fr-En
Train vocab Inference vocab Speed
params BLEU BLEU
ST Portuguese English 63:13 31,868
ST Spanish French 9:34 4,568 Full (256k) 1.38B 19.1 36.6 12.5×
Full (256k)
Filtered (35k) 1.19B 18.9 35.8 13.0×
ST Spanish English 79:37 37,168 Filtered (35k) Filtered (35k) 1.19B 20.0 35.5 13.0×
ST Spanish Italian 11:50 5,616
ST Spanish Portuguese 47:01 22,012
Table 14: Speech Translation performance on the
IWSLT 2022 Taq-Fr and mTEDx Fr-En test sets of
Table 10: Statistics for all the mTEDx lan-
our contrastive Taq-Fr submission (non-ensemble ver-
guages (train+valid) seen by our systems for the IWSLT
sion of our primary submission) with several vocabulary
2021 evaluation setup described in Section 4.3.
filtering strategies: no filtering (first row, corresponds to
Taq-Fr valid Que-Es valid Que-Es test our submission); inference-time filtering (second row);
primary 26.13 ✗ ✗ or training-time filtering (third row). See Table 18 for
Taq-Fr contrastive 1 24.53 ✗ ✗ an explanation of the “speed” column.
contrastive 2 22.88 20.29 17.74
primary 22.88 20.29 17.74
Que-Es contrastive 1 20.81 19.03 15.67
contrastive 2 21.31 16.78 15.25
primary 22.36 16.52 15.70
Que-Es
contrastive 1 20.97 15.15 15.55
(updated)
contrastive 2 20.31 16.30 13.17
156
IWSLT 2023 TED-LIUM v2 mTEDx ASR mTEDx ST
Submission Taq-Fr Que-Es Que-Que En-En Fr-Fr Es-Es It-It Pt-Pt Fr-En Fr-Es Es-Fr Es-En Fr-Pt Pt-En Es-It Es-Pt
Taq-Fr primary ✓ ✗ ✗ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
Taq-Fr contrastive 1 ✓ ✗ ✗ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
Taq-Fr contrastive 2 ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
Que-Es primary ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
Que-Es contrastive 1 ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
Que-Es contrastive 2 ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗
IWSLT 2021 setup ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Table 15: Extensive list of datasets used for training (✓) each system presented in this paper.
Model URL
mHuBERT-Tamasheq Unavailable
Tamasheq https://huggingface.co/LIA-AvignonUniversity/IWSLT2022-tamasheq-only
Niger-Mali https://huggingface.co/LIA-AvignonUniversity/IWSLT2022-Niger-Mali
XLSR-53 https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec
XLS-R large and xlarge https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec/xlsr
Table 16: Downloading sources for the speech representation models checkpoints used in our experiments.
Table 17: Training stacked layers (i.e. adding and training new bottom encoder layers) versus fine-tuning the
existing bottom layers; with or without adapters. The other hyper-parameters are identical to our constrastive
submission (underlined scores).
157
Conv. FT Total Trained Taq-Fr Fr-En
Speech features MT model Adapters Speed
layers layers params params BLEU BLEU
Tamasheq (layer 11) 1.38B 70M 16.8 32.5 11.6×
Tamasheq (layer 8) 1.38B 70M 19.3 31.6 12.0×
mHuBERT-Taq (layer 11) 1.38B 70M 16.4 37.1 12.1×
mHuBERT-Taq (layer 8) 1.38B 70M 16.2 36.7 12.1×
Niger-Mali (layer 11) NLLB 1.3B 1 3 enc+dec (64) 1.38B 70M 16.6 34.6 11.8×
Niger-Mali (layer 8) 1.38B 70M 19.1 36.6 12.5×
XLSR-53 (layer 18) 1.38B 70M 15.9 38.0 12.4×
XLS-R L (layer 18) 1.38B 70M 16.8 39.4 12.7×
XLS-R XL (layer 46) 1.38B 70M 15.4 37.4 11.7×
mBART (600M) 0.61B 41M 16.3 28.9 22.9×
NLLB (600M) 0.62B 41M 18.0 32.5 24.2×
Niger-Mali (layer 8) 1 3 enc+dec (64)
NLLB (1.3B) 1.38B 70M 19.1 36.6 12.5×
NLLB (3.3B) 3.36B 165M 19.3 37.3 4.5×
3 1.38B 70M 18.5 33.4 25.5×
2 1.38B 70M 19.4 35.4 19.5×
Niger-Mali (layer 8) NLLB 1.3B 3 enc+dec (64)
1 1.38B 70M 19.1 36.6 12.5×
0 1.38B 70M 19.6 34.4 7.1×
24 1.37B 508M 16.7 30.7 11.9×
4 1.38B 91M 19.6 36.8 12.3×
Niger-Mali (layer 8) NLLB 1.3B 1 3 enc+dec (64) 1.38B 70M 19.1 36.6 12.5×
2 1.38B 49M 19.0 36.2 12.0×
1 1.38B 28M 18.2 35.1 12.0×
enc (64) 1.37B 25M 19.1 34.2 12.4×
1
none 1.37B 22M 17.5 33.3 12.6×
enc+dec (256) 1.40B 88M 18.8 35.8 12.2×
Niger-Mali (layer 8) NLLB 1.3B 1 enc+dec (128) 1.38B 76M 19.2 36.3 12.1×
3 enc+dec (64) 1.38B 70M 19.1 36.6 12.5×
enc (64) 1.37B 67M 19.3 35.7 12.7×
none 1.37B 64M 18.3 35.6 13.1×
Table 18: Ablation study on Taq-Fr ST, with various speech feature extractors, pre-trained MT models used for
initialization, and trained parameters. The total parameter counts do not include the parameters of the speech feature
extractors. The BLEU scores reported are on the IWSLT 2022 Taq-Fr and mTEDx Fr-En test sets. The speed metric
is relative to real time (i.e., seconds in the test set divided by seconds spent decoding) and does not include feature
extraction time. It is obtained by decoding the Taq-Fr test set on a single T4 with a batch size of 10 utterances
(averaged over 3 decoding runs). The underlined numbers all correspond to the same model, which is our first
contrastive submission to the task (the non-ensemble version of our primary submission). All of these models are
trained with the same data (see Table 15) and early stopping is done based on Taq-Fr valid BLEU scores. The
numbers inside parentheses in the Adapters column correspond to the bottleneck dimension of the trained adapter
modules. Adapters are not added in the encoder layers that are being fine-tuned. These models took between 15 and
47 h each to train on 4 V100 GPUs, with an average training time of 26 h.
MT NLLB 3.3B none 47.4 39.5 39.2 39.8 48.6 34.0 42.4
none 47.9 38.9 39.6 39.8 48.5 33.8 41.9
enc+dec 50.2 40.7 42.2 42.1 51.0 37.6 45.2
MT NLLB 1.3B
enc 49.9 41.3 42.6 41.9 50.6 36.5 44.9
dec 48.8 39.2 41.0 41.1 49.7 35.6 43.9
MT NLLB 1.3B (DA) enc+dec 51.3 43.2 45.2 44.7 53.2 37.8 47.1
Table 19: Top half: Speech translation BLEU scores on the IWSLT 2021 test sets, when deactivating encoder
adapters, decoder adapters, or both in an ST model at inference time. The ST model is the same one as in Table 5,
trained with encoder and decoder adapters. Bottom half: Text-to-text MT BLEU scores when using the ST adapters
in the initial model and disabling the ST bottom layers and convolutions.
158
Direct Models for Simultaneous Translation and Automatic Subtitling:
FBK@IWSLT2023
161
Automatic Subtitling Both the classic encoder- et al., 2022) as additional latency metrics. All the
decoder architecture and the triangle architecture evaluations were run on a single NVIDIA K80 with
are composed of 12 layers of Conformer encoder 12GB of RAM, by applying global CMVN to audio
and 6 layers of Transformer decoder (which is input, whose features were estimated on the MuST-
replicated twice in the triangle model). The di- C v2 training set. Computational aware metrics
mension of the feed-forward layers is 2,048 and (“_CA”) refer to the single NVIDIA K80 setting
d = 512 in the attention. The kernel size of the and consider also the model computational time in
point- and depth-wise convolutions in the convolu- the delay calculation.
tional modules is 31. The dropout was set to 0.1.
CTC loss with compression is added with weight Automatic Subtitling We adopt the follow-
0.5 to the cross entropy loss with label smoothing ing metrics: SubER-cased (henceforth, SubER)
(0.1 of smoothing factor) and optimized with Adam (Wilken et al., 2022) for overall subtitle quality,
(β1 = 0.9, β2 = 0.98). The source vocabulary is of Sigma (Karakanta et al., 2022) for the subtitle seg-
size 8,000 and the target vocabulary of size 16,000 mentation quality, and BLEU5 for translation qual-
(<eob> and <eol> included); both are obtained ity. We also compute the conformity percentage
by SentencePiece models. The ST pre-training was of 42 characters per line (CPL) and 21 characters
done by setting the learning rate to 0.002 with in- per second (CPS) or reading speed, as suggested
verse square-root scheduler and 25,000 warm-up on the track website.6 We neglected the conformity
updates. The SubST fine-tuning was done by set- computation of the subtitles with more than two
ting a constant learning rate of 0.001. A second lines since our model only produces subtitles with
fine-tuning was done with the same setting of (Papi two lines or less, thus being always 100% conform.
et al., 2022a), but we restored the punctuation of Conformity scores are computed by using the script
the ASR datasets which do not contain any (i.e., released for the paper (Papi et al., 2022a).7 Dev/test
the TEDLIUM corpus (Hernandez et al., 2018)) audios are segmented with SHAS (Tsiamas et al.,
by using bert-restore-punctuation,3 be- 2022). No audio cleaning is applied.
fore machine-translating and segmenting the target
texts into subtitles. We trained the standard archi- 4 Results
tecture with 40,000 maximum tokens on 4 NVIDIA 4.1 Simultaneous Translation
A100 GPUs with 40GB of RAM and we set the
update frequency to 2. For the triangle architecture, Since we directly employ an offline model for the
we set maximum tokens to 20,000 to fit the archi- simultaneous inference, we show in Table 1 the
tecture in memory and the update frequency to 4 results of the offline ASR pre-training and ST train-
to hold the same total batch size of 320,000 tokens. ing. Although the model with 12 encoder layers
Maximum updates were set to 100,000 for both the (row 0) obtains lower – hence better – WER com-
pre-training and training phases. pared to the 16 encoder-layers model (row 1), the
highest – hence better – BLEU in ST is achieved
3.3 Evaluation Settings by the bigger architecture. The performance is also
Simultaneous We exploit the SimulEval tool slightly enhanced by adding the CTC compression
(Ma et al., 2020a). To be comparable with the (row 3) during training, which is particularly useful
previous years, all the results except this year’s also for the SimulST scenario since it speeds up
submission are shown for the SimulEval v1.0.2, inference (of about 12/15%). Therefore, we select
which adopts BLEU (Post, 2018)4 to measure trans- this model for the final submission. Compared to
lation quality and Average Lagging or AL (Ma our last year’s submission (row 5), our 16 encoder-
et al., 2019) to measure latency. Instead, for layers model scores +0.4 BLEU even if, at this
this year’s submission, we adopt the latest ver- time, we have not fine-tuned it on the in-domain
sion of SimulEval (1.1.0) with BLEU measured (TED talks) datasets. Our model also performs
with sacrebleu 2.3.0 and we also report Length- 5
case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
Adaptive Average Lagging or LAAL (Papi et al., 6
https://iwslt.org/2023/subtitling#
2022c) and Average Token Delay or ATD (Kano automatic-evaluation
7
Script available at: https://github.com/
3
https://huggingface.co/felflare/ hlt-mt/FBK-fairseq/blob/master/examples/
bert-restore-punctuation speech_to_text/scripts/subtitle_
4
case:mixed|eff:no|tok:13a|smooth:exp|version:1.5.1 compliance.py
162
better than the NAIST last year’s system (+11.1 32
BLEU
models such as wav2vec 2.0 and mBART50. Com-
26
pared to last year’s cascade model by UPV, we
score -1.7 BLEU. This system, however, also out- 24
performed the CUNI-KIT system by 0.7 BLEU
22
points, indicating that a gap between direct and
cascade architectures still exists. 1 1.5 2 2.5 3 3.5
AL / AL_CA (s)
id Model WER% (↓) BLEU (↑) offline LA EDAtt AlignAtt
1 12 encoder layers 9.7 31.6
2 16 encoder layers 9.9 31.9
3 + CTC compress. - 32.1 Figure 1: Comparison between the LA, EDATT, and
4 CUNI-KIT 2022† - 33.1 A LIGNATT policies described in Section 2.1 on MuST-
5 FBK 2022 - 31.7 C v2 en→de tst-COMMON. Solid curves represent AL,
6 NAIST 2022‡ - 21.0 dashed curves represent AL_CA.
7 UPV 2022 (Cascade)* 9.5 33.8
163
33 en-de
Model SubER BLEU Sigma CPL CPS
(Papi et al., 2022a) 59.9 23.4 77.9 86.9 68.9
31 Triangle 60.8 22.6 74.6 84.5 67.7
en-es
Model SubER BLEU Sigma CPL CPS
BLEU
164
5 Conclusions Mathur, Paul McNamee, Kenton Murray, Maria
Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan
We presented the FBK’s systems built to partici- Niehues, Xing Niu, John Ortega, Juan Pino, Eliz-
pate in the IWSLT 2023 Evaluation Campaigns for abeth Salesky, Jiatong Shi, Matthias Sperber, Se-
simultaneous speech translation (en-de) and auto- bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
gesh Virkar, Alexander Waibel, Changhan Wang, and
matic subtitling (en-{de, es}). Our submissions Shinji Watanabe. 2022. Findings of the IWSLT 2022
are characterized by the use of direct speech trans- evaluation campaign. In Proceedings of the 19th In-
lation models to address both tasks, without any ternational Conference on Spoken Language Trans-
further modification nor adaptation for the simulta- lation (IWSLT 2022), pages 98–157, Dublin, Ireland
(in-person and online).
neous task, and with a fine-tuning on subtitle-like
translations for the automatic subtitling task. Our Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer-
SimulST system achieves a lower computational- man, Roldano Cattoni, Maha Elbayad, Marcello Fed-
aware latency with up to 3.5 BLEU gain compared erico, Xutai Ma, Satoshi Nakamura, Matteo Negri,
Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas-
to the last two years’ winners. Our automatic subti- tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan-
tling system achieves 3.7 and 1.7 SubER improve- der Waibel, Changhan Wang, and Matthew Wiesner.
ment on en-de and en-es respectively, compared to 2021. FINDINGS OF THE IWSLT 2021 EVAL-
the only solution published in the literature based UATION CAMPAIGN. In Proceedings of the 18th
International Conference on Spoken Language Trans-
on a direct system.
lation (IWSLT 2021), pages 1–29, Bangkok, Thailand
(online).
Acknowledgements
Antonios Anastasopoulos and David Chiang. 2018.
This work has been supported by the project Tied multitask learning for neural speech translation.
“AI@TN” funded by the Autonomous Province In Proceedings of the 2018 Conference of the North
of Trento, Italy. American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 82–91, New Orleans,
References Louisiana.
Milind Agarwal, Sweta Agrawal, Antonios Anasta- Ebrahim Ansari, Amittai Axelrod, Nguyen Bach,
sopoulos, Ondřej Bojar, Claudia Borg, Marine Ondřej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Durrani, Marcello Federico, Christian Federmann,
Chen, William Chen, Khalid Choukri, Alexandra Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay
Chronopoulou, Anna Currey, Thierry Declerck, Qian- Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz-
qian Dong, Yannick Estève, Kevin Duh, Marcello abeth Salesky, Xing Shi, Sebastian Stüker, Marco
Federico, Souhir Gahbiche, Barry Haddow, Benjamin Turchi, Alexander Waibel, and Changhan Wang.
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- 2020. FINDINGS OF THE IWSLT 2020 EVAL-
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu UATION CAMPAIGN. In Proceedings of the 17th
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur, International Conference on Spoken Language Trans-
Evgeny Matusov, Paul McNamee, John P. McCrae, lation, pages 1–34, Online.
Kenton Murray, Maria Nadejde, Satoshi Nakamura,
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino, Karakanta, Alberto Martinelli, Matteo Negri, and
Lonneke van der Plas, Peter Polák, Elijah Rippeth, Marco Turchi. 2021. Cascade versus direct speech
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- translation: Do the differences still make a differ-
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian ence? In Proceedings of the 59th Annual Meet-
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, ing of the Association for Computational Linguistics
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- and the 11th International Joint Conference on Natu-
vallos. 2023. Findings of the IWSLT 2023 Evaluation ral Language Processing (Volume 1: Long Papers),
Campaign. In Proceedings of the 20th International pages 2873–2887, Online.
Conference on Spoken Language Translation (IWSLT
2023). Alexandre Bérard, Olivier Pietquin, Christophe Ser-
van, and Laurent Besacier. 2016. Listen and Trans-
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- late: A Proof of Concept for End-to-End Speech-to-
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Text Translation. In NIPS Workshop on end-to-end
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, learning for speech and audio processing, Barcelona,
Maha Elbayad, Clara Emmanuel, Yannick Estève, Spain.
Marcello Federico, Christian Federmann, Souhir
Gahbiche, Hongyu Gong, Roman Grundkiewicz, Ondřej Bojar, Dominik Macháček, Sangeet Sagar,
Barry Haddow, Benjamin Hsu, Dávid Javorský, Otakar Smrž, Jonáš Kratochvíl, Peter Polák, Ebrahim
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Ansari, Mohammad Mahmoudi, Rishu Kumar, Dario
165
Franceschini, Chiara Canton, Ivan Simonini, Thai- Alex Graves, Santiago Fernández, Faustino J. Gomez,
Son Nguyen, Felix Schneider, Sebastian Stüker, Alex and Jürgen Schmidhuber. 2006. Connectionist Tem-
Waibel, Barry Haddow, Rico Sennrich, and Philip poral Classification: Labelling Unsegmented Se-
Williams. 2021. ELITR multilingual live subtitling: quence Data with Recurrent Neural Networks. In
Demo and strategy. In Proceedings of the 16th Con- Proceedings of the 23rd international conference
ference of the European Chapter of the Association on Machine learning (ICML), pages 369–376, Pitts-
for Computational Linguistics: System Demonstra- burgh, Pennsylvania.
tions, pages 271–277, Online.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Chih-Chiang Chang and Hung-Yi Lee. 2022. Exploring Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Continuous Integrate-and-Fire for Adaptive Simulta- Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
neous Speech Translation. In Proc. Interspeech 2022, 2020. Conformer: Convolution-augmented Trans-
pages 5175–5179. former for Speech Recognition. In Proc. Interspeech
2020, pages 5036–5040.
Junkun Chen, Mingbo Ma, Renjie Zheng, and Liang
Huang. 2021. Direct simultaneous speech-to-text François Hernandez, Vincent Nguyen, Sahar Ghannay,
translation assisted by synchronized streaming ASR. Natalia Tomashenko, and Yannick Estève. 2018. Ted-
In Findings of the Association for Computational lium 3: Twice as much data and corpus repartition for
Linguistics: ACL-IJCNLP 2021, pages 4618–4624, experiments on speaker adaptation. In Speech and
Online. Computer, pages 198–208, Cham. Springer Interna-
tional Publishing.
Marcello Federico, Yogesh Virkar, Robert Enyedi, and
Roberto Barra-Chicote. 2020. Evaluating and Opti- Javier Iranzo-Sánchez, Javier Jorge Cano, Alejandro
mizing Prosodic Alignment for Automatic Dubbing. Pérez-González-de Martos, Adrián Giménez Pas-
In Proc. Interspeech 2020, pages 1481–1485. tor, Gonçal Garcés Díaz-Munío, Pau Baquero-Arnal,
Joan Albert Silvestre-Cerdà, Jorge Civera Saiz, Al-
Ryo Fukuda, Yuka Ko, Yasumasa Kano, Kosuke Doi, bert Sanchis, and Alfons Juan. 2022. MLLP-VRAIN
Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh, UPV systems for the IWSLT 2022 simultaneous
and Satoshi Nakamura. 2022. NAIST simultaneous speech translation and speech-to-speech translation
speech-to-text translation system for IWSLT 2022. In tasks. In Proceedings of the 19th International Con-
Proceedings of the 19th International Conference on ference on Spoken Language Translation (IWSLT
Spoken Language Translation (IWSLT 2022), pages 2022), pages 255–264, Dublin, Ireland (in-person
286–292, Dublin, Ireland (in-person and online). and online).
Marco Gaido, Mauro Cettolo, Matteo Negri, and Marco Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka-
Turchi. 2021a. CTC-based compression for direct mura. 2022. Average token delay: A latency met-
speech translation. In Proceedings of the 16th Con- ric for simultaneous translation. arXiv preprint
ference of the European Chapter of the Association arXiv:2211.13173.
for Computational Linguistics: Main Volume, pages
690–696, Online. Alina Karakanta, Franćois Buet, Mauro Cettolo, and
Franćois Yvon. 2022. Evaluating Subtitle Segmenta-
Marco Gaido, Mattia A. Di Gangi, Matteo Negri, and tion for End-to-end Generation Systems. In Proceed-
Marco Turchi. 2020. End-to-end speech-translation ings of the 13th Language Resources and Evaluation
with knowledge distillation: FBK@IWSLT2020. In Conference (LREC), pages 3069–3078, Marseilles,
Proceedings of the 17th International Conference on France.
Spoken Language Translation, pages 80–88, Online.
Alina Karakanta, Marco Gaido, Matteo Negri, and
Marco Gaido, Mattia A. Di Gangi, Matteo Negri, and Marco Turchi. 2021a. Between flexibility and consis-
Marco Turchi. 2021b. On Knowledge Distillation tency: Joint generation of captions and subtitles. In
for Direct Speech Translation . In Proceedings of Proceedings of the 18th International Conference on
CLiC-IT 2020, Online. Spoken Language Translation (IWSLT 2021), pages
215–225, Bangkok, Thailand (online).
Marco Gaido, Matteo Negri, and Marco Turchi. 2022a.
Direct speech-to-text translation models as students Alina Karakanta, Matteo Negri, and Marco Turchi.
of text-to-text models. Italian Journal of Computa- 2020a. Is 42 the answer to everything in subtitling-
tional Linguistics. oriented speech translation? In Proceedings of the
17th International Conference on Spoken Language
Marco Gaido, Sara Papi, Dennis Fucci, Giuseppe Translation, pages 209–219, Online.
Fiameni, Matteo Negri, and Marco Turchi. 2022b.
Efficient yet competitive speech translation: Alina Karakanta, Matteo Negri, and Marco Turchi.
FBK@IWSLT2022. In Proceedings of the 19th 2020b. MuST-cinema: a speech-to-subtitles cor-
International Conference on Spoken Language pus. In Proc. of the 12th Language Resources and
Translation (IWSLT 2022), pages 177–189, Dublin, Evaluation Conference, pages 3727–3734, Marseille,
Ireland (in-person and online). France.
166
Alina Karakanta, Sara Papi, Matteo Negri, and Marco translation to end-to-end simultaneous speech trans-
Turchi. 2021b. Simultaneous speech translation for lation. In Proceedings of the 1st Conference of the
live subtitling: from delay to display. In Proceedings Asia-Pacific Chapter of the Association for Compu-
of the 1st Workshop on Automatic Spoken Language tational Linguistics and the 10th International Joint
Translation in Real-World Settings (ASLTRW), pages Conference on Natural Language Processing, pages
35–48, Virtual. 582–587, Suzhou, China.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Evgeny Matusov, Patrick Wilken, and Yota Geor-
method for stochastic optimization. In 3rd Inter- gakopoulou. 2019. Customizing neural machine
national Conference on Learning Representations, translation for subtitling. In Proceedings of the
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Fourth Conference on Machine Translation (Volume
Conference Track Proceedings. 1: Research Papers), pages 82–93, Florence, Italy.
Maarit Koponen, Umut Sulubacak, Kaisa Vitikainen, Maite Melero, Antoni Oliver, and Toni Badia. 2006. Au-
and Jörg Tiedemann. 2020. MT for subtitling: User tomatic Multilingual Subtitling in the eTITLE Project.
evaluation of post-editing productivity. In Proceed- In Proceedings of ASLIB Translating and the Com-
ings of the 22nd Annual Conference of the European puter 28.
Association for Machine Translation, pages 115–124,
Lisboa, Portugal. Ha Nguyen, Yannick Estève, and Laurent Besacier.
2021. An empirical study of end-to-end simultaneous
Taku Kudo and John Richardson. 2018. SentencePiece: speech translation decoding strategies. In ICASSP
A simple and language independent subword tok- 2021-2021 IEEE International Conference on Acous-
enizer and detokenizer for neural text processing. In tics, Speech and Signal Processing (ICASSP), pages
Proceedings of the 2018 Conference on Empirical 7528–7532. IEEE.
Methods in Natural Language Processing: System
Alp Öktem, Mireia Farrús, and Antonio Bonafonte.
Demonstrations, pages 66–71, Brussels, Belgium.
2019. Prosodic phrase alignment for machine dub-
Dan Liu, Mengge Du, Xiaoxi Li, Yuchen Hu, and Lirong bing. ArXiv, abs/1908.07226.
Dai. 2021a. The USTC-NELSLIP systems for simul- Sara Papi, Marco Gaido, Alina Karakanta, Mauro Cet-
taneous speech translation task at IWSLT 2021. In tolo, Matteo Negri, and Marco Turchi. 2022a. Direct
Proceedings of the 18th International Conference on speech translation for automatic subtitling. arXiv
Spoken Language Translation (IWSLT 2021), pages preprint arXiv:2209.13192.
30–38, Bangkok, Thailand (online).
Sara Papi, Marco Gaido, Matteo Negri, and Andrea
Dan Liu, Mengge Du, Xiaoxi Li, Ya Li, and Enhong Pilzer. 2023a. Reproducibility is Nothing without
Chen. 2021b. Cross attention augmented transducer Correctness: The Importance of Testing Code in NLP.
networks for simultaneous translation. In Proceed- arXiv preprint arXiv:2303.16166.
ings of the 2021 Conference on Empirical Methods in
Natural Language Processing, pages 39–55, Online Sara Papi, Marco Gaido, Matteo Negri, and Marco
and Punta Cana, Dominican Republic. Turchi. 2021. Dealing with training and test segmen-
tation mismatch: FBK@IWSLT2021. In Proceed-
Danni Liu, Gerasimos Spanakis, and Jan Niehues. 2020. ings of the 18th International Conference on Spoken
Low-Latency Sequence-to-Sequence Speech Recog- Language Translation (IWSLT 2021), pages 84–91,
nition and Translation by Partial Hypothesis Selec- Bangkok, Thailand (online).
tion. In Proc. Interspeech 2020, pages 3620–3624.
Sara Papi, Marco Gaido, Matteo Negri, and Marco
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Turchi. 2022b. Does simultaneous speech transla-
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, tion need simultaneous models? In Findings of the
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Association for Computational Linguistics: EMNLP
Haifeng Wang. 2019. STACL: Simultaneous trans- 2022, pages 141–153, Abu Dhabi, United Arab Emi-
lation with implicit anticipation and controllable la- rates.
tency using prefix-to-prefix framework. In Proceed-
ings of the 57th Annual Meeting of the Association for Sara Papi, Marco Gaido, Matteo Negri, and Marco
Computational Linguistics, pages 3025–3036, Flo- Turchi. 2022c. Over-generation cannot be rewarded:
rence, Italy. Length-adaptive average lagging for simultaneous
speech translation. In Proceedings of the Third Work-
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, shop on Automatic Simultaneous Translation, pages
Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An 12–17, Online.
evaluation toolkit for simultaneous translation. In
Proceedings of the 2020 Conference on Empirical Sara Papi, Alina Karakanta, Matteo Negri, and Marco
Methods in Natural Language Processing: System Turchi. 2022d. Dodging the data bottleneck: Au-
Demonstrations, pages 144–150, Online. tomatic subtitling with automatically segmented ST
corpora. In Proceedings of the 2nd Conference of the
Xutai Ma, Juan Pino, and Philipp Koehn. 2020b. Asia-Pacific Chapter of the Association for Compu-
SimulMT to SimulST: Adapting simultaneous text tational Linguistics and the 12th International Joint
167
Conference on Natural Language Processing (Vol- Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
ume 2: Short Papers), pages 480–487, Online only. Dmytro Okhonko, and Juan Pino. 2020. fairseq s2t:
Fast speech-to-text modeling with fairseq. In Pro-
Sara Papi, Matteo Negri, and Marco Turchi. 2022e. At- ceedings of the 2020 Conference of the Asian Chap-
tention as a guide for simultaneous speech translation. ter of the Association for Computational Linguistics
arXiv preprint arXiv:2212.07850. (AACL): System Demonstrations.
Sara Papi, Matteo Negri, and Marco Turchi. 2023b. Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Alignatt: Using attention-based audio-translation Wu, and Zhifeng Chen. 2017. Sequence-to-Sequence
alignments as a guide for simultaneous speech trans- Models Can Directly Translate Foreign Speech. In
lation. In Proc. of Interspeech 2023, Dublin, Ireland. Proceedings of Interspeech 2017, pages 2625–2629,
Stockholm, Sweden.
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng
Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. Patrick Wilken, Panayota Georgakopoulou, and Evgeny
2019. SpecAugment: A Simple Data Augmentation Matusov. 2022. SubER - a metric for automatic eval-
Method for Automatic Speech Recognition. In Proc. uation of subtitle quality. In Proceedings of the 19th
Interspeech 2019, pages 2613–2617. International Conference on Spoken Language Trans-
lation (IWSLT 2022), pages 1–10, Dublin, Ireland
Stelios Piperidis, Iason Demiros, Prokopis Prokopidis, (in-person and online).
Peter Vanroose, Anja Hoethker, Walter Daelemans,
Elsa Sklavounou, Manos Konstantinou, and Yannis Xingshan Zeng, Liangyou Li, and Qun Liu. 2021. Real-
Karavidas. 2004. Multimodal, multilingual resources TranS: End-to-end simultaneous speech translation
in the subtitling process. In Proceedings of the Fourth with convolutional weighted-shrinking transformer.
International Conference on Language Resources In Findings of the Association for Computational
and Evaluation (LREC’04), Lisbon, Portugal. Linguistics: ACL-IJCNLP 2021, pages 2461–2474,
Online.
Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen,
Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo- Shaolei Zhang and Yang Feng. 2022. Information-
jar, and Alexander Waibel. 2022. CUNI-KIT system transport-based policy for simultaneous translation.
for simultaneous speech translation task at IWSLT In Proceedings of the 2022 Conference on Empiri-
2022. In Proceedings of the 19th International Con- cal Methods in Natural Language Processing, pages
ference on Spoken Language Translation (IWSLT 992–1013, Abu Dhabi, United Arab Emirates.
2022), pages 277–285, Dublin, Ireland (in-person
and online). Baigong Zheng, Kaibo Liu, Renjie Zheng, Mingbo Ma,
Hairong Liu, and Liang Huang. 2020. Simultane-
Matt Post. 2018. A call for clarity in reporting BLEU ous translation policies: From fixed to adaptive. In
scores. In Proceedings of the Third Conference on Proceedings of the 58th Annual Meeting of the Asso-
Machine Translation: Research Papers, pages 186– ciation for Computational Linguistics, pages 2847–
191, Brussels, Belgium. 2853, Online.
168
MT Metrics Correlate with Human Ratings of
Simultaneous Speech Translation
Dominik Macháček1 and Ondřej Bojar1 and Raj Dabre2
els of speech translation quality, BLEU, chrF2, sentences in the document to one single sequence,
B ERT S CORE and COMET can be used for reliable and then apply the metric on it, as if it was one
assessment of human judgement of SST quality at sentence. mWERSegmenter is a tool for aligning
least on the level of test sets. chrF2, B ERT S CORE translation candidates to reference, if their sentence
and COMET are reliable also at the document level. segmentation differs. It finds the alignment with the
minimum WER when comparing tokens in aligned
Translation vs Interpreting Reference There is segments. For translation, we also apply the default
an open question whether SST should rather mimic sentence alignment (S ENT).
offline translation, or simultaneous interpreting. As In Table 2, we report the correlations of metric,
172
reference and alignment variants and their signifi- Furthermore, we used only one example of hu-
cance, with more details in Appendix D. man interpreting. A precise in-depth study of hu-
man interpretations is needed to re-assess the rec-
4.1 Recommendations ommendation of translation or interpreting as refer-
Taking CR as the golden truth of human quality, ence in SST.
we make the following recommendations of the
most correlating metric, reference and sentence Acknowledgements
alignment method for SST evaluation.
We are thankful to Dávid Javorský and Peter Polák
Which metric? COMET, because it correlates for their reviews.
significantly better with CR than B ERT S CORE does. This research was partially supported by the
From the fall back options, chrF2 should be slightly grants 19-26934X (NEUREM3) of the Czech Sci-
preferred over BLEU. ence Foundation, SVV project number 260 698,
and 398120 of the Grant Agency of Charles Uni-
Which reference? The metrics give significantly
versity.
higher correlations with CR with translations than
with interpreting as a reference. Difference be-
tween translation reference and two references References
(TRANSL+INTP) is insignificant. Therefore, we
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben-
recommend translation as a reference for SST. tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
Which alignment method? With an unaligned Maha Elbayad, Clara Emmanuel, Yannick Estève,
reference, COMET and B ERT S CORE correlate Marcello Federico, Christian Federmann, Souhir
significantly more with S INGLE S EQ than with Gahbiche, Hongyu Gong, Roman Grundkiewicz,
M WER, probably because the neural metrics are Barry Haddow, Benjamin Hsu, Dávid Javorský,
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant
trained on full, complete sentences, which are of-
Mathur, Paul McNamee, Kenton Murray, Maria
ten split to multiple segments by mWERSegmenter. Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan
chrF2 correlates insignificantly better with M WER Niehues, Xing Niu, John Ortega, Juan Pino, Eliz-
than with S INGLE S EQ. abeth Salesky, Jiatong Shi, Matthias Sperber, Se-
bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
5 Conclusion gesh Virkar, Alexander Waibel, Changhan Wang,
and Shinji Watanabe. 2022. Findings of the IWSLT
We found correlation of offline MT metrics to hu- 2022 evaluation campaign. In Proceedings of the
man judgements of simultaneous speech transla- 19th International Conference on Spoken Language
Translation (IWSLT 2022), pages 98–157, Dublin,
tion. The most correlating and thus preferred met- Ireland (in-person and online). Association for Com-
ric is COMET, followed by B ERT S CORE and chrF2. putational Linguistics.
We recommend text translation reference over inter-
Kyunghyun Cho and Masha Esipova. 2016. Can neu-
preting, and single sequence alignment for neural,
ral machine translation do simultaneous translation?
and mWERSegmenter for n-gram metrics. CoRR, abs/1606.02012.
173
Spoken Language Translation (IWSLT 2022), pages Dominik Macháček, Matúš Žilinec, and Ondřej Bojar.
286–292, Dublin, Ireland (in-person and online). As- 2021. Lost in Interpreting: Speech Translation from
sociation for Computational Linguistics. Source or Interpreter? In Proc. Interspeech 2021,
pages 2376–2380.
Marco Gaido, Sara Papi, Dennis Fucci, Giuseppe
Fiameni, Matteo Negri, and Marco Turchi. 2022. Evgeny Matusov, Gregor Leusch, Oliver Bender, and
Efficient yet competitive speech translation: Hermann Ney. 2005. Evaluating machine translation
FBK@IWSLT2022. In Proceedings of the 19th output with automatic sentence segmentation. In Pro-
International Conference on Spoken Language ceedings of the Second International Workshop on
Translation (IWSLT 2022), pages 177–189, Dublin, Spoken Language Translation, Pittsburgh, Pennsylva-
Ireland (in-person and online). Association for nia, USA.
Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Yvette Graham, Timothy Baldwin, and Nitika Mathur. Jing Zhu. 2002. Bleu: a method for automatic evalu-
2015. Accurate evaluation of segment-level machine ation of machine translation. In Proceedings of the
translation metrics. In Proceedings of the 2015 Con- 40th Annual Meeting of the Association for Compu-
ference of the North American Chapter of the Asso- tational Linguistics, pages 311–318, Philadelphia,
ciation for Computational Linguistics: Human Lan- Pennsylvania, USA. Association for Computational
guage Technologies, pages 1183–1191, Denver, Col- Linguistics.
orado. Association for Computational Linguistics.
Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen,
Javier Iranzo-Sánchez, Javier Jorge Cano, Alejandro Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo-
Pérez-González-de Martos, Adrián Giménez Pas- jar, and Alexander Waibel. 2022. CUNI-KIT system
tor, Gonçal Garcés Díaz-Munío, Pau Baquero-Arnal, for simultaneous speech translation task at IWSLT
Joan Albert Silvestre-Cerdà, Jorge Civera Saiz, Al- 2022. In Proceedings of the 19th International Con-
bert Sanchis, and Alfons Juan. 2022. MLLP-VRAIN ference on Spoken Language Translation (IWSLT
UPV systems for the IWSLT 2022 simultaneous 2022), pages 277–285, Dublin, Ireland (in-person
speech translation and speech-to-speech translation and online). Association for Computational Linguis-
tasks. In Proceedings of the 19th International Con- tics.
ference on Spoken Language Translation (IWSLT
2022), pages 255–264, Dublin, Ireland (in-person Maja Popović. 2017. chrF++: words helping charac-
and online). Association for Computational Linguis- ter n-grams. In Proceedings of the Second Confer-
tics. ence on Machine Translation, pages 612–618, Copen-
hagen, Denmark. Association for Computational Lin-
Dávid Javorský, Dominik Macháček, and Ondřej Bojar. guistics.
2022. Continuous rating as reliable human evaluation
of simultaneous speech translation. In Proceedings Matt Post. 2018. A call for clarity in reporting BLEU
of the Seventh Conference on Machine Translation, scores. In Proceedings of the Third Conference on
pages 154–164, Abu Dhabi. Association for Compu- Machine Translation: Research Papers, pages 186–
tational Linguistics. 191, Brussels, Belgium. Association for Computa-
tional Linguistics.
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Lavie. 2020. Unbabel’s participation in the WMT20
Haifeng Wang. 2019. STACL: Simultaneous trans- metrics shared task. In Proceedings of the Fifth Con-
lation with implicit anticipation and controllable la- ference on Machine Translation, pages 911–920, On-
tency using prefix-to-prefix framework. In Proceed- line. Association for Computational Linguistics.
ings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 3025–3036, Flo- Elizabeth Salesky, Marcello Federico, and Marta Costa-
rence, Italy. Association for Computational Linguis- jussà, editors. 2022. Proceedings of the 19th Interna-
tics. tional Conference on Spoken Language Translation
(IWSLT 2022). Association for Computational Lin-
Dominik Macháček and Ondřej Bojar. 2020. Presenting guistics, Dublin, Ireland (in-person and online).
simultaneous translation in limited space. In Proceed-
ings of the 20th Conference Information Technolo- Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
gies - Applications and Theory (ITAT 2020), Hotel losa, and Marta R. Costa-jussà. 2022. SHAS: Ap-
Tyrapol, Oravská Lesná, Slovakia, September 18-22, proaching optimal Segmentation for End-to-End
2020, volume 2718 of CEUR Workshop Proceedings, Speech Translation. In Proc. Interspeech 2022, pages
pages 34–39. CEUR-WS.org. 106–110.
Dominik Macháček, Jonáš Kratochvíl, Tereza Vojtě- Minghan Wang, Jiaxin Guo, Yinglu Li, Xiaosong Qiao,
chová, and Ondřej Bojar. 2019. A speech test set Yuxia Wang, Zongyao Li, Chang Su, Yimeng Chen,
of practice business presentations with additional Min Zhang, Shimin Tao, Hao Yang, and Ying Qin.
relevant texts. In Statistical Language and Speech 2022. The HW-TSC’s simultaneous speech transla-
Processing, pages 151–161, Cham. Springer Interna- tion system for IWSLT 2022 evaluation. In Proceed-
tional Publishing. ings of the 19th International Conference on Spoken
174
Language Translation (IWSLT 2022), pages 247–254, We found two definitions that can yield differ-
Dublin, Ireland (in-person and online). Association ent results in certain situations: (1) The rating (as
for Computational Linguistics.
clicked by the evaluator) is valid at the instant time
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. point when the evaluator clicked the rating button.
Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- The final score is the average of all clicks, each
uating text generation with bert. In International click has the equal weight. We denote this interpre-
Conference on Learning Representations. tation as CR.
(2) The rating is assigned to the time interval
A Highlights of IWSLT22 Findings
from the click time to the next click, or between
The Findings of IWSLT22 (Anastasopoulos et al., the last click and the end of the document. The
2022) are available in PDF. The most up-to-date length of the interval is considered in averaging.
version (version 2) is 61 pages long.2 We highlight The final score is the average of ratings weighted
the relevant parts of Findings with page numbers by interval lengths when the rating is valid. We
in Table 3 so that we can refer to them easily. denote this interpretation as CRi. 3
Note that findings are a part of the conference To express them rigorously, let us have a docu-
proceedings (Salesky et al., 2022) as a chapter in a ment of duration T , and n ratings (ri , ti ), where
book. The order of findings pages in PDF does not i ∈ {1, . . . , n} is an index, ri ∈ {1, . . . , 4} is the
match the page numbers at the footers. rated value and 0 ≤ t1 < · · · < tn ≤ T are times
Also note that in Section 2.4 on page 4 (in when the ratings were recorded.
PDF, 101 in Proceedings), there is a description Then, the definitions are as follows:
of MLLP-VRAIN which corresponds to the sys-
tem denoted as UPV in all other tables and figures. 1X
n
CR = ri
n
B Metric Signatures i=1
175
marker PDF page numbered page description
Section 2 3-5 100-102 Simultaneous Speech Translation Task
Figure 1 6 103 Quality-latency trade-off curves
Section 2.6.1 5 102 Description of human evaluation
Figure 5 8 105 Manual scores vs BLEU (plot)
Two Test Sets (paragraph) 39 136 Non-Native subset
Test data (paragraph) 9 106 Common (native) subset of test data
Automatic Evaluation Results 44 141 Latency and BLEU results (table)
A1.1 (appendix) 38-39 135-136 Details on human evaluation
Table 17 48 145 Test subsets duration
Table 18 48 145 Manual scores and BLEU (table)
2.5
CRi
2.0
1.5
1.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0
CR
Figure 2: Relation between weighted interval averaging
of continuous rating (CRi, y-axis) and average of all rat-
ings (CR, x-axis) for each annotation of each document
(blue data points).
176
Both subsets
COMET transl sent 0.80 0.64 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET transl singleseq 0.64 0.79 0.18 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET transl+intp singleseq 0.37 0.18 0.79 0.04 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl sent 0.00 0.01 0.04 0.77 0.75 0.93 0.17 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl+intp sent+mwer 0.00 0.01 0.04 0.75 0.77 0.97 0.20 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET intp singleseq 0.00 0.00 0.00 0.93 0.97 0.77 0.32 0.21 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl+intp singleseq 0.00 0.00 0.00 0.17 0.20 0.32 0.76 0.12 0.03 0.02 0.02 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl singleseq 0.00 0.00 0.00 0.08 0.10 0.21 0.12 0.75 0.06 0.05 0.03 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl+intp sent+mwer 0.00 0.00 0.00 0.00 0.00 0.01 0.03 0.06 0.73 0.93 0.27 0.27 0.22 0.02 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BLEU transl+intp singleseq 0.00 0.00 0.00 0.00 0.00 0.01 0.02 0.05 0.93 0.73 0.87 0.42 0.39 0.00 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl sent 0.00 0.00 0.00 0.00 0.00 0.01 0.02 0.03 0.27 0.87 0.73 0.41 0.33 0.03 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl+intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.27 0.42 0.41 0.72 0.73 0.30 0.34 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.22 0.39 0.33 0.73 0.72 0.32 0.37 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BLEU transl singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.03 0.30 0.32 0.71 0.86 0.20 0.00 0.01 0.00 0.00 0.00 0.00 0.00
COMET intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.08 0.09 0.34 0.37 0.86 0.71 0.24 0.06 0.03 0.00 0.00 0.00 0.00 0.00
BertScore intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.02 0.20 0.24 0.69 0.51 0.11 0.07 0.01 0.01 0.00 0.00
BLEU transl+intp sent+mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 0.51 0.68 0.45 0.00 0.12 0.12 0.00 0.00
chrF2 intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.03 0.11 0.45 0.66 0.72 0.45 0.40 0.00 0.00
BLEU transl sent 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.00 0.72 0.65 0.85 0.80 0.01 0.00
chrF2 intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.12 0.45 0.85 0.65 0.93 0.00 0.00
BLEU intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.12 0.40 0.80 0.93 0.65 0.01 0.00
BertScore intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.60 0.43
BLEU intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.43 0.58
BertScore intp mwer
COMET transl sent
COMET transl singleseq
COMET transl+intp singleseq
BertScore transl sent
BertScore transl+intp sent+mwer
COMET intp singleseq
BertScore transl+intp singleseq
BertScore transl singleseq
Figure 3: Results of significance test (p-values rounded to two decimal digits) for difference of correlations of the
metrics variants to CR. The metrics variants are ordered by Pearson correlation to CR on both subsets from most
correlating (top left) to least (bottom right). The bold numbers on the diagonal are the correlation coefficients to CR.
177
Common subset
COMET transl sent 0.76 0.46 0.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET transl singleseq 0.46 0.75 0.42 0.03 0.03 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET transl+intp singleseq 0.30 0.42 0.74 0.05 0.05 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET intp mwer 0.00 0.03 0.05 0.69 0.77 0.76 0.68 0.05 0.06 0.05 0.06 0.05 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl+intp sent+mwer 0.00 0.03 0.05 0.77 0.69 0.83 0.90 0.26 0.05 0.04 0.09 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl sent 0.00 0.03 0.05 0.76 0.83 0.68 0.91 0.28 0.06 0.05 0.09 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET intp singleseq 0.00 0.00 0.00 0.68 0.90 0.91 0.68 0.34 0.21 0.20 0.17 0.16 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore intp mwer 0.00 0.00 0.00 0.05 0.26 0.28 0.34 0.65 0.55 0.53 0.48 0.45 0.14 0.08 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl+intp sent+mwer 0.00 0.00 0.00 0.06 0.05 0.06 0.21 0.55 0.63 0.56 0.78 0.72 0.26 0.14 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl sent 0.00 0.00 0.00 0.05 0.04 0.05 0.20 0.53 0.56 0.63 0.87 0.81 0.27 0.15 0.01 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00
chrF2 transl singleseq 0.00 0.00 0.00 0.06 0.09 0.09 0.17 0.48 0.78 0.87 0.63 0.08 0.35 0.21 0.00 0.02 0.01 0.00 0.01 0.00 0.00 0.00 0.00
chrF2 transl+intp singleseq 0.00 0.00 0.00 0.05 0.08 0.09 0.16 0.45 0.72 0.81 0.08 0.63 0.37 0.22 0.00 0.02 0.01 0.00 0.01 0.00 0.00 0.00 0.00
BertScore transl+intp singleseq 0.00 0.00 0.00 0.01 0.00 0.00 0.02 0.14 0.26 0.27 0.35 0.37 0.59 0.02 0.43 0.12 0.13 0.13 0.02 0.00 0.02 0.00 0.00
BertScore transl singleseq 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.08 0.14 0.15 0.21 0.22 0.02 0.58 0.62 0.24 0.23 0.23 0.05 0.01 0.04 0.00 0.00
chrF2 intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.43 0.62 0.55 0.74 0.41 0.07 0.24 0.18 0.03 0.06 0.02
BLEU transl+intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.02 0.02 0.12 0.24 0.74 0.53 0.65 0.67 0.42 0.00 0.10 0.00 0.00
BLEU intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.13 0.23 0.41 0.65 0.51 0.92 0.62 0.44 0.05 0.21 0.09
chrF2 intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.23 0.07 0.67 0.92 0.51 0.71 0.57 0.33 0.34 0.16
BertScore intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.02 0.05 0.24 0.42 0.62 0.71 0.49 0.78 0.66 0.52 0.25
BLEU transl singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.18 0.00 0.44 0.57 0.78 0.47 0.88 0.18 0.00
BLEU intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.04 0.03 0.10 0.05 0.33 0.66 0.88 0.47 0.74 0.35
BLEU transl+intp sent+mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 0.00 0.21 0.34 0.52 0.18 0.74 0.45 0.00
BLEU transl sent 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.09 0.16 0.25 0.00 0.35 0.00 0.42
BertScore intp mwer
COMET transl sent
Figure 4: Results of significance test (p-values rounded to two decimal digits) for difference of correlations of the
metrics variants to CR. The metrics variants are ordered by Pearson correlation to CR on the Common subset from
most correlating (top left) to least (bottom right). The bold numbers on the diagonal are the correlation coefficients
to CR.
178
Non-Native subset
COMET transl sent 0.75 0.06 0.06 0.07 0.06 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl+intp sent+mwer 0.06 0.73 0.96 0.74 0.71 0.55 0.30 0.04 0.01 0.01 0.03 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl sent 0.06 0.96 0.73 0.74 0.72 0.55 0.30 0.04 0.01 0.01 0.03 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl singleseq 0.07 0.74 0.74 0.73 0.87 0.72 0.39 0.07 0.02 0.02 0.03 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BertScore transl+intp singleseq 0.06 0.71 0.72 0.87 0.73 0.74 0.40 0.08 0.03 0.02 0.03 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET transl singleseq 0.02 0.55 0.55 0.72 0.74 0.73 0.06 0.24 0.15 0.12 0.00 0.08 0.08 0.05 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET transl+intp singleseq 0.01 0.30 0.30 0.39 0.40 0.06 0.72 0.45 0.30 0.24 0.00 0.18 0.17 0.11 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BLEU transl singleseq 0.00 0.04 0.04 0.07 0.08 0.24 0.45 0.71 0.71 0.31 0.68 0.40 0.02 0.04 0.22 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl+intp sent+mwer 0.00 0.01 0.01 0.02 0.03 0.15 0.30 0.71 0.70 0.87 0.90 0.09 0.56 0.38 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BLEU transl+intp singleseq 0.00 0.01 0.01 0.02 0.02 0.12 0.24 0.31 0.87 0.70 1.00 0.80 0.54 0.17 0.39 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET intp singleseq 0.00 0.03 0.03 0.03 0.03 0.00 0.00 0.68 0.90 1.00 0.70 0.84 0.77 0.60 0.36 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl sent 0.00 0.00 0.00 0.01 0.01 0.08 0.18 0.40 0.09 0.80 0.84 0.70 0.89 0.67 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BLEU transl sent 0.00 0.01 0.01 0.01 0.01 0.08 0.17 0.02 0.56 0.54 0.77 0.89 0.70 0.49 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
BLEU transl+intp sent+mwer 0.00 0.00 0.00 0.01 0.01 0.05 0.11 0.04 0.38 0.17 0.60 0.67 0.49 0.69 0.73 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
COMET intp mwer 0.00 0.00 0.00 0.00 0.00 0.01 0.03 0.22 0.31 0.39 0.36 0.50 0.57 0.73 0.69 0.05 0.05 0.00 0.00 0.00 0.00 0.00 0.00
chrF2 transl singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.05 0.64 0.74 0.54 0.06 0.00 0.00 0.00 0.00
chrF2 transl+intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.74 0.64 0.58 0.06 0.00 0.00 0.00 0.00
BertScore intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.54 0.58 0.63 0.36 0.00 0.00 0.00 0.00
chrF2 intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 0.06 0.36 0.61 0.00 0.01 0.00 0.00
chrF2 intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.55 0.90 0.44 0.12
BertScore intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.90 0.55 0.58 0.14
BLEU intp singleseq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.44 0.58 0.53 0.03
BLEU intp mwer 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.12 0.14 0.03 0.50
COMET transl sent
BertScore transl+intp sent+mwer
BertScore transl sent
BertScore transl singleseq
BertScore transl+intp singleseq
COMET transl singleseq
COMET transl+intp singleseq
Figure 5: Results of significance test (p-values rounded to two decimal digits) for difference of correlations of the
metrics variants to CR. The metrics variants are ordered by Pearson correlation to CR on the Non-Native subset from
most correlating (top left) to least (bottom right). The bold numbers on the diagonal are the correlation coefficients
to CR.
179
Improving Neural Machine Translation Formality Control with Domain
Adaptation and Reranking-based Transductive Learning
Zhanglin Wu, Zongyao Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, Xiaoyu Chen,
Zhiqiang Rao, Zhengzhe Yu, Jinlong Yang, Shaojun Li, Yuhao Xie, Bin Wei,
Jiawei Zheng, Ming Zhu, Lizhi Lei, Hao Yang, Yanfei Jiang
Huawei Translation Service Center, Beijing, China
{wuzhanglin2,lizongyao,weidaimeng,shanghengchao,guojiaxin1,chenxiaoyu35,
raozhiqiang,yuzhengzhe,yangjinlong7,lishaojun18,xieyuhao2,weibin29,
zhengjiawei15,zhuming47,leilizhi,yanghao30,jiangyanfei}@huawei.com
Abstract and Carpuat, 2020). Fortunately, the IWSLT for-
mality control task now provides a new benchmark1
This paper presents Huawei Translation Ser-
(Nădejde et al., 2022; Agarwal et al., 2023) by
vice Center (HW-TSC)’s submission on the
IWSLT 2023 formality control task, which pro-
contributing high-quality training datasets and test
vides two training scenarios: supervised and datasets for multiple language pairs.
zero-shot, each containing two language pairs, This paper presents HW-TSC’s submission on
and sets constrained and unconstrained condi- the IWSLT 2023 formality control task. How for-
tions. We train the formality control models mality distinctions are expressed grammatically
for these four language pairs under these two and lexically can vary widely by language. Thus,
conditions respectively, and submit the corre- we participate in the formality control task of all
sponding translation results. Our efforts are di-
these four language pairs to investigate a general
vided into two fronts: enhancing general trans-
lation quality and improving formality control formality control method that can be applied to
capability. According to the different require- different language pair. In addition, we also inves-
ments of the formality control task, we use a tigate the difference in formality control between
multi-stage pre-training method to train a bilin- constrained and unconstrained conditions by intro-
gual or multilingual neural machine translation ducing the mBART model (Liu et al., 2020) under
(NMT) model as the basic model, which can im- unconstrained condition.
prove the general translation quality of the base
model to a relatively high level. Then, under 2 Data
the premise of affecting the general translation
quality of the basic model as little as possi- 2.1 Pre-training Data
ble, we adopt domain adaptation and reranking-
We use the CCMatrix2 and OpenSubtitles3 bilin-
based transductive learning methods to improve
the formality control capability of the model. gual data given by the organizers to train a NMT
model from scratch or fine-tune the mBART model
1 Introduction as the general basic model. The bilingual data size
of each language pair is shown in Table 1:
Machine translation (MT) (Lopez, 2008; Vaswani
et al., 2017) models typically return one single Language pair CCMatrix OpenSubtitles
translation for each input sentence. This means EN-KO 19.4M 1.4M
that when the input sentence is ambiguous, the MT EN-VI 50.1M 3.5M
model must choose a translation from among var- EN-PT 173.7M 33.2M
ious valid options, without regard to the intended EN-RU 139.9M 25.9M
use case or target audience. Therefore, there is a
need to control certain attributes (Schioppa et al., Table 1: The bilingual data size of each language pair.
2021) of the text generated in a target language
such as politeness (Sennrich et al., 2016a; Feely In order to achieve a better training effect, we
et al., 2019) or formality (Niu et al., 2017, 2018; also use some data pre-processing methods to clean
Viswanathan et al., 2020). bilingual data, such as: remove duplicate data, use
The lack of gold translation with alternate for- 1
https://github.com/amazon-science/
mality for supervised training and evaluation has contrastive-controlled-mt
2
https://opus.nlpl.eu/CCMatrix.php
lead researchers to rely on synthetic supervision 3
https://opus.nlpl.eu/
training and manual evaluation in past work (Niu OpenSubtitles-v2018.php
180
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 180–186
July 13-14, 2023 c 2023 Association for Computational Linguistics
Moses4 to normalize punctuation, filter extremely decoder, 16-head self-attention, 1024-dimensional
long sentences, use langid5 (Lui and Baldwin, 2011, embedding and 4096-dimensional FFN embedding.
2012) to filter sentences that do not meet the lan-
guage requirements, use fast-align6 (Dyer et al., 3.2 Unconstrained Model
2013) to filter unaligned sentence pairs. Recently, multilingual denoising pre-training
method (Liu et al., 2020; Tang et al., 2021) pro-
2.2 Formality-annotated Data duces significant performance gains across a wide
The formality-annotated data is provided by the variety of machine translation tasks. As the ear-
organizers, and the data size of each language pair liest sequence-to-sequence model using multilin-
is shown in Table 2: gual denoising pre-training method, mBART (Liu
et al., 2020) has also achieved good results in var-
Setting Language pair Train Test ious machine translation-related tasks. Under un-
Supervised EN-KO 400 597 constrained conditions, we use the mBART50 1n
Supervised EN-VI 400 598 model7 as the initial model of the unconstrained
Zero-shot EN-PT 0 599 formality control task. The mBART50 1n model
Zero-shot EN-RU 0 600 adopts Transformer structure, which features 12-
Table 2: The formality-annotated data size of each lan- layer encoder, 12-layer decoder, 16-head self-
guage pair. attention, 1024-dimensional embedding and 4096-
dimensional FFN embedding, and an additional
For supervised language pairs, we split the layer-normalization layer (Xu et al., 2019) on top
formality-annotated train data into a train set and of both the encoder and decoder.
a dev set with a ratio of 3:1, and use the formality- 4 Method
annotated train set and a small amount of bilingual
data for formality control training, while for zero- In our implementation, we first use a multi-stage
shot language pairs, we use formality-annotated pre-training method to train a general NMT model
train set from the other two supervised language with relatively high translation quality. Then,
pairs for formality control training. we use domain adaptation method to fine-tune
the NMT model so that the model can have ba-
3 Model sic formality control capability. Finally, we use
3.1 Constrained Model the reranking-based transductive learning (RTL)
method to further improve the formality control
Transformer (Vaswani et al., 2017) is the state-of- capability of the model.
the-art model in recent machine translation evalua-
tions. There are two parts of research to improve 4.1 Multi-stage Pre-training
this kind: the first part uses wide networks (eg: There are four different types of formality control
Transformer-Big (Vaswani et al., 2017)), and the tasks, which are constrained supervised task, con-
other part uses deeper language representations (eg: strained zero-shot task, unconstrained supervised
Deep Transformer (Wang et al., 2019; Wu et al., task, and unconstrained zero-shot task. For these
2022; Wei et al., 2022)). Under the constrained four different tasks, we formulate different pre-
conditions, we combine these two improvements, training strategies and collectively refer to these
adopt the Deep Transformer-Big model structure, strategies as multi-stage pre-training method.
and train a one-to-many multilingual NMT model Under the constrained condition, we adopt the
(Johnson et al., 2017; Zhang et al., 2020) from Deep Transformer-Big model structure and use
scratch using bilingual data of four language pairs bilingual data of all four language pairs to train
provided by the organizers. The main structure a one-to-many multilingual NMT model from
of Deep Transformer-Big is that it features pre- scratch, which is used as the basic model for con-
layer-normalization and 25-layer encoder, 6-layer strained zero-shot task. For constrained supervised
4
https://github.com/moses-smt/ task, we use the bilingual data of this task to further
mosesdecoder 7
5
https://github.com/saffsd/langid.py https://dl.fbaipublicfiles.com/
6
https://github.com/clab/fast_align fairseq/models/mbart50/mbart50.ft.1n.
tar.gz
181
pre-train the multilingual NMT model to obtain a and the formality phrases from formality-annotated
bilingual NMT model as the basic model. training data for reranking. The implementation de-
While under the unconstrained condition, we fur- tails are shown in Algorithm 1. For zero-shot task,
ther pre-train the mBART50 1n model using bilin- due to the lack of formality-annotated training data,
gual data from all these four language pairs as the we just use a reference-free formality classifier for
basic model for unconstrained zero-shot task. For reranking. Among them, the formality classifier
unconstrained supervised task, we use the bilingual under the constrained condition comes from self-
data of this task to further pre-train the pre-trained training (Axelrod et al., 2011), while the formality
model, and use the final pre-trained bilingual model classifier under the unconstrained condition comes
as the basic model. from the organizer8 (Briakou et al., 2021).
182
To Formal To Informal Flores
EN-VI
M-Acc C-F BLEU COMET M-Acc C-F BLEU COMET BLEU COMET
AWS-baseline 99.40% 99.16% 43.2 0.6189 98.10% 98.49% 41.5 0.6021 - -
Multilingual pre-training 10.86% 1.67% 25.6 0.2023 89.14% 98.33% 30.0 0.2873 42.3 0.6653
+ Bilingual pre-training 8.80% 3.01% 24.8 0.1782 91.20% 96.99% 28.9 0.2630 42.4 0.6706
+ Domain adaptation 98.17% 97.83% 49.1 0.7248 99.37% 99.83% 48.0 0.6952 41.3 0.6576
+ RTL 99.59% 100.00% 49.5 0.7296 99.38% 100.00% 48.1 0.7034 41.7 0.6614
+ Iterative RTL 100.00% 99.83% 51.3 0.7522 100.00% 100.00% 49.8 0.7209 41.8 0.6730
UMD-baseline 96.00% 99.67% 26.7 0.3629 96.00% 98.16% 25.3 0.3452 - -
mBART50 1n 3.82% 1.51% 26.7 0.3516 96.18% 98.49% 31.0 0.4426 34.7 0.6040
+ Multilingual pre-training 9.44% 1.84% 25.4 0.2089 90.56% 98.16% 29.9 0.2975 42.2 0.6673
+ Bilingual pre-training 12.20% 2.51% 25.2 0.1579 87.80% 97.49% 29.4 0.2445 42.4 0.6698
+ Domain adaptation 99.02% 99.50% 47.8 0.7181 99.36% 100.00% 47.4 0.6930 43.2 0.6916
+ RTL 99.22% 100.00% 47.7 0.7190 99.16% 100.00% 47.8 0.7053 43.4 0.7033
+ Iterative RTL 100.00% 100.00% 48.2 0.7214 100.00% 100.00% 48.3 0.7102 43.4 0.6983
Table 3: The overall translation quality and formality control accuracy of EN-VI models.
Table 4: The overall translation quality and formality control accuracy of EN-KO models.
process as iterative RTL method. 2002; Post, 2018) and COMET (eamt22-
cometinho-da)11 (Rei et al., 2022) to evaluate
5 Experiments the overall translation quality of formality con-
5.1 Training Details trol model on the official formality test sets
and FLORES-200 devtest sets12 (Goyal et al.,
We use the Pytorch-based Fairseq framework9 (Ott
2022).
et al., 2019) to pre-train or fine-tune NMT model,
and use Adam optimizer (Kingma and Ba, 2014) • We also use the reference-based corpus-level
with parameters β1=0.9 and β2=0.98. During the automatic metric Matched-Accuracy (M-Acc)
multi-stage pre-training phase, each model uses 8 and the reference-free automatic metric (C-
GPUs for training, warmup steps is 4000, batch size F) that uses a multilingual formality classifier
is 4096, learning rate is 5 × 10−4 , label smoothing provided by the organizer to evaluate the for-
rate (Szegedy et al., 2016) is 0.1, and dropout is mality control accuracy of the model on the
0.1. In the domain adaptation and RTL phases, each official formality test sets, respectively.
model only uses 1 GPU for training without warm-
up, batch size is 1024, learning rate is 3 × 10−5 , 5.3 Evaluation Results
label smoothing rate is 0.1, and dropout is 0.3. Based on the above evaluation metrics, we eval-
uate the formality control models trained at dif-
5.2 Evaluation Metrics
ferent phases for each language pair under con-
We evaluate the translation results of formality con- strained and unconstrained conditions, and com-
trol model from the following two dimensions: pare with constrained baseline (AWS-baseline)
• We use SacreBLEU v2.0.0 10 (Papineni et al., (Nădejde et al., 2022) and unconstrained baseline
9 11
https://github.com/facebookresearch/ https://github.com/Unbabel/COMET
12
fairseq https://github.com/facebookresearch/
10 flores/tree/main/flores200
https://github.com/mjpost/sacrebleu
183
To Formal To Informal Flores
EN-RU
M-Acc C-F BLEU COMET M-Acc C-F BLEU COMET BLEU COMET
Multilingual pre-training 99.27% 67.83% 29.7 0.4265 0.73% 32.17% 23.7 0.3869 32.2 0.7790
+ Domain adaptation 99.71% 90.67% 33.8 0.5977 85.49% 70.67% 31.2 0.5333 27.8 0.7040
+ RTL 99.74% 100.00% 34.5 0.6155 97.14% 100.00% 33.4 0.6019 29.4 0.7261
+ Iterative RTL 100.00% 100.00% 36.5 0.6472 100.00% 100.00% 35.6 0.6442 29.0 0.7153
UMD-baseline 96.20% 92.00% 22.0 0.3492 84.10% 85.17% 21.6 0.3475 - -
mBART50 1n 100.00% 91.67% 25.6 0.2916 0.00% 8.33% 19.3 0.2351 25.0 0.5950
+ Multilingual pre-training 98.15% 67.00% 28.9 0.4263 1.85% 33.00% 23.1 0.3904 32.1 0.7638
+ Domain adaptation 99.49% 98.17% 31.8 0.5336 99.73% 99.83% 30.8 0.5214 30.7 0.7386
+ RTL 98.76% 100.00% 32.3 0.5575 99.73% 99.83% 31.6 0.5363 30.9 0.7417
+ Iterative RTL 100.00% 100.00% 33.7 0.5804 100.00% 99.83% 32.4 0.5558 31.0 0.7521
Table 5: The overall translation quality and formality control accuracy of EN-RU models.
Table 6: The overall translation quality and formality control accuracy of EN-PT models.
(UMD-baseline) (Lin et al., 2022) provided by the ity of multilingual model. Finally, we still submit
organizers. the Iterative RTL model as primary system.
5.3.1 EN-VI & EN-KO 6 Conclusions
The formality control task for EN-VI and EN-KO
This paper presents HW-TSC’s submission on the
language pairs is supervised, and we adopt the
IWSLT 2023 formality control task, in which we
same training methods on these two language pairs.
participate in both constrained and unconstrained
Table 3 and Table 4 are the evaluation results of
tasks for all four language pairs. For the formal-
the models trained at different phases for these two
ity control task, we use a multi-stage pre-training
language pairs. From the experimental results, the
method to improve the general translation quality
multi-stage pre-training method can improve the
of the basic model. We also adopt domain adap-
translation quality of the model on the FLORES-
tation and RTL methods to improve the model’s
200 devtest sets, while domain adaptation and RTL
formality control capability. Experimental results
methods are effective in improving formality con-
show that these methods we have adopted are ex-
trol capability of the model. Besides, domain adap-
tremely effective, but how to improve general trans-
tation and RTL methods have relatively little im-
lation quality more effectively and achieve formal-
pact on the general translation quality of the model
ity control with less training resources is still wor-
on the FLORES-200 devtest sets. Finally, we sub-
thy of further research.
mit the Iterative RTL model as primary system.
5.3.2 EN-RU & EN-PT
References
The formality control tasks for the EN-RU and EN-
PT language pairs are zero-shot, and we only use Milind Agarwal, Sweta Agrawal, Antonios Anasta-
one-stage pre-training on these two tasks. Table 5 sopoulos, Ondřej Bojar, Claudia Borg, Marine
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
and Table 6 are the evaluation results of the models Chen, William Chen, Khalid Choukri, Alexandra
trained in different phases for these two language Chronopoulou, Anna Currey, Thierry Declerck, Qian-
pairs. The experimental results show that domain qian Dong, Yannick Estève, Kevin Duh, Marcello
adaptation and RTL methods are still effective in Federico, Souhir Gahbiche, Barry Haddow, Benjamin
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
improving the zero-shot formality control capabil- vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
184
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur, translation system: Enabling zero-shot translation.
Evgeny Matusov, Paul McNamee, John P. McCrae, Transactions of the Association for Computational
Kenton Murray, Maria Nadejde, Satoshi Nakamura, Linguistics, 5:339–351.
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino, Diederik P. Kingma and Jimmy Ba. 2014. Adam:
Lonneke van der Plas, Peter Polák, Elijah Rippeth, A method for stochastic optimization. CoRR,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- abs/1412.6980.
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Ann Lee, Michael Auli, and Marc’Aurelio Ranzato.
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- 2021. Discriminative reranking for neural machine
vallos. 2023. Findings of the IWSLT 2023 Evaluation translation. In Proceedings of the 59th Annual Meet-
Campaign. In Proceedings of the 20th International ing of the Association for Computational Linguistics
Conference on Spoken Language Translation (IWSLT and the 11th International Joint Conference on Natu-
2023). Association for Computational Linguistics. ral Language Processing (Volume 1: Long Papers),
pages 7250–7264.
Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011.
Domain adaptation via pseudo in-domain data se- Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu
lection. In Proceedings of the 2011 conference on Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na-
empirical methods in natural language processing, man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth
pages 355–362. Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav
Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettle-
Eleftheria Briakou, Sweta Agrawal, Joel Tetreault, and moyer, Zornitsa Kozareva, Mona Diab, Veselin Stoy-
Marine Carpuat. 2021. Evaluating the evaluation met- anov, and Xian Li. 2022. Few-shot learning with
rics for style transfer: A case study in multilingual multilingual generative language models. In Proceed-
formality transfer. In Proceedings of the 2021 Con- ings of the 2022 Conference on Empirical Methods
ference on Empirical Methods in Natural Language in Natural Language Processing, pages 9019–9052,
Processing, pages 1321–1336. Abu Dhabi, United Arab Emirates. Association for
Computational Linguistics.
Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017.
An empirical comparison of domain adaptation meth- Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
ods for neural machine translation. In Proceedings Edunov, Marjan Ghazvininejad, Mike Lewis, and
of the 55th Annual Meeting of the Association for Luke Zettlemoyer. 2020. Multilingual denoising pre-
Computational Linguistics (Volume 2: Short Papers), training for neural machine translation. Transac-
pages 385–391. tions of the Association for Computational Linguis-
tics, 8:726–742.
Zi-Yi Dou, Xinyi Wang, Junjie Hu, and Graham Neubig.
2019. Domain differential adaptation for neural ma- Adam Lopez. 2008. Statistical machine translation.
chine translation. EMNLP-IJCNLP 2019, page 59. ACM Computing Surveys (CSUR), 40(3):1–49.
Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. Marco Lui and Timothy Baldwin. 2011. Cross-domain
A simple, fast, and effective reparameterization of feature selection for language identification. In Pro-
ibm model 2. In Proceedings of the 2013 Conference ceedings of 5th International Joint Conference on
of the North American Chapter of the Association Natural Language Processing, pages 553–561, Chi-
for Computational Linguistics: Human Language ang Mai, Thailand. Asian Federation of Natural Lan-
Technologies, pages 644–648. guage Processing.
Weston Feely, Eva Hasler, and Adrià de Gispert. Marco Lui and Timothy Baldwin. 2012. langid.py: An
2019. Controlling Japanese honorifics in English- off-the-shelf language identification tool. In Proceed-
to-Japanese neural machine translation. In Proceed- ings of the ACL 2012 System Demonstrations, pages
ings of the 6th Workshop on Asian Translation, pages 25–30, Jeju Island, Korea. Association for Computa-
45–53, Hong Kong, China. Association for Computa- tional Linguistics.
tional Linguistics.
Xing Niu and Marine Carpuat. 2020. Controlling neural
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- machine translation formality with synthetic super-
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- vision. In Proceedings of the AAAI Conference on
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán, Artificial Intelligence, pages 8568–8575.
and Angela Fan. 2022. The flores-101 evaluation
benchmark for low-resource and multilingual ma- Xing Niu, Marianna Martindale, and Marine Carpuat.
chine translation. Transactions of the Association for 2017. A study of style in machine translation: Con-
Computational Linguistics, 10:522–538. trolling the formality of machine translation output.
In Proceedings of the 2017 Conference on Empiri-
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim cal Methods in Natural Language Processing, pages
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, 2814–2819, Copenhagen, Denmark. Association for
Fernanda Viégas, Martin Wattenberg, Greg Corrado, Computational Linguistics.
et al. 2017. Google’s multilingual neural machine
185
Xing Niu, Sudha Rao, and Marine Carpuat. 2018. Multi- semi-supervised deep learning using min-max fea-
task neural models for translating between styles tures. In Proceedings of the European Conference on
within and across languages. In Proceedings of the Computer Vision (ECCV), pages 299–315.
27th International Conference on Computational Lin-
guistics, pages 1008–1021, Santa Fe, New Mexico, Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
USA. Association for Computational Linguistics. Jon Shlens, and Zbigniew Wojna. 2016. Rethinking
the inception architecture for computer vision. In
Maria Nădejde, Anna Currey, Benjamin Hsu, Xing Proceedings of the IEEE conference on computer
Niu, Marcello Federico, and Georgiana Dinu. 2022. vision and pattern recognition, pages 2818–2826.
CoCoA-MT: A dataset and benchmark for Con-
trastive Controlled MT with application to formality. Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
In Findings of the Association for Computational Lin- man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
guistics: NAACL 2022, Seattle, USA. Association for gela Fan. 2021. Multilingual translation from de-
Computational Linguistics. noising pre-training. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021,
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, pages 3450–3466.
Sam Gross, Nathan Ng, David Grangier, and Michael
Auli. 2019. fairseq: A fast, extensible toolkit for Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
sequence modeling. In Proceedings of NAACL-HLT Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
2019: Demonstrations. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- systems, 30.
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the Aditi Viswanathan, Varden Wang, and Antonina
40th annual meeting of the Association for Computa- Kononova. 2020. Controlling formality and style
tional Linguistics, pages 311–318. of machine translation output using automl. In Infor-
mation Management and Big Data: 6th International
Matt Post. 2018. A call for clarity in reporting BLEU Conference, SIMBig 2019, Lima, Peru, August 21–23,
scores. In Proceedings of the Third Conference on 2019, Proceedings 6, pages 306–313. Springer.
Machine Translation: Research Papers, pages 186–
191, Belgium, Brussels. Association for Computa- Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu,
tional Linguistics. Changliang Li, Derek F Wong, and Lidia S Chao.
2019. Learning deep transformer models for ma-
Ricardo Rei, Ana C Farinha, José G.C. de Souza, Pe- chine translation. In Proceedings of the 57th Annual
dro G. Ramos, André F.T. Martins, Luisa Coheur, and Meeting of the Association for Computational Lin-
Alon Lavie. 2022. Searching for COMETINHO: The guistics, pages 1810–1822.
little metric that could. In Proceedings of the 23rd
Annual Conference of the European Association for Daimeng Wei, Zhiqiang Rao, Zhanglin Wu, Shaojun Li,
Machine Translation, pages 61–70, Ghent, Belgium. Yuanchang Luo, Yuhao Xie, Xiaoyu Chen, Hengchao
European Association for Machine Translation. Shang, Zongyao Li, Zhengzhe Yu, et al. 2022. Hw-
tsc’s submissions to the wmt 2022 general machine
Andrea Schioppa, David Vilar, Artem Sokolov, and translation shared task. In Proceedings of the Seventh
Katja Filippova. 2021. Controlling machine transla- Conference on Machine Translation, Online. Associ-
tion for multiple attributes with additive interventions. ation for Computational Linguistics.
In Proceedings of the 2021 Conference on Empiri-
cal Methods in Natural Language Processing, pages Zhanglin Wu, Jinlong Yang, Zhiqiang Rao, Zhengzhe
6676–6696, Online and Punta Cana, Dominican Re- Yu, Daimeng Wei, Xiaoyu Chen, Zongyao Li,
public. Association for Computational Linguistics. Hengchao Shang, Shaojun Li, Ming Zhu, et al. 2022.
Hwtsc translation systems for the wmt22 biomedical
Rico Sennrich, Barry Haddow, and Alexandra Birch. translation task. In Proceedings of the Seventh Con-
2016a. Controlling politeness in neural machine ference on Machine Translation, Online. Association
translation via side constraints. In Proceedings of for Computational Linguistics.
the 2016 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao,
Human Language Technologies, pages 35–40. and Junyang Lin. 2019. Understanding and improv-
ing layer normalization. Advances in Neural Infor-
Rico Sennrich, Barry Haddow, and Alexandra Birch. mation Processing Systems, 32.
2016b. Improving neural machine translation models
with monolingual data. In Proceedings of the 54th Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen-
Annual Meeting of the Association for Computational nrich. 2020. Improving massively multilingual neu-
Linguistics (Volume 1: Long Papers), pages 86–96. ral machine translation and zero-shot translation. In
2020 Annual Conference of the Association for Com-
Weiwei Shi, Yihong Gong, Chris Ding, Zhiheng MaXi- putational Linguistics, pages 1628–1639. Associa-
aoyu Tao, and Nanning Zheng. 2018. Transductive tion for Computational Linguistics (ACL).
186
HW-TSC at IWSLT2023: Break the Quality Ceiling of Offline Track via
Pre-Training and Domain Adaptation
Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Xie YuHao, Guo JiaXin,
Daimeng Wei, Hengchao Shang, Wang Minghan, Xiaoyu Chen
Zhengzhe YU, Li ShaoJun, Lei LiZhi, Hao Yang
Huawei Translation Service Center, Beijing, China
{lizongyao,wuzhanglin2,raozhiqiang,xieyuhao2,guojiaxin1,
weidaimeng,shanghengchao,wangminghan,chenxiaoyu35,
yuzhengzhe,lishaojun18,leilizhi,yanghao30}@huawei.com
Abstract trained cascade system, the accuracy of ASR and
MT will reach a higher level. So from the results,
This paper describes HW-TSC’s submissions the BLEU of the cascaded system will be higher
to the IWSLT 2023 Offline Speech Transla-
than that of the end-to-end system. Currently in
tion task, including speech translation of talks
from English to German, English to Chinese the industry, the mainstream speech translation sys-
and English to Japanese. We participated in all tem is still based on the cascade system. We use
three tracks (Constrained training, Constrained the cascade system for this task, mainly to further
with Large Language Models training, Uncon- improve the performance of speech translation.
strained training), with using cascaded architec-
tures models. We use data enhancement, pre- In this work, we carefully filter and preprocess
training models and other means to improve
the data, and adopt various enhancement tech-
the quality of ASR, and use a variety of tech-
niques including R-Drop, deep model, domain niques, such as pre-training model, data enhance-
data selection, etc. to improve the quality of ment, domain adaptation, etc., to optimize the
NMT. Compared with last year’s best results, performance of ASR. We build machine transla-
we have improved by 2.1 BLEU in the MuST-C tion systems with techniques like back translation
English-German test set. (Edunov et al., 2018), domain adaptation and R-
drop (Wu et al., 2021), which have been proved to
1 Introduction be effective practices.
The goal of the Offline Speech Translation Task
is to examine automatic methods for translating The main contribution of this paper can be sum-
audio speech in one language into text in the tar- marized as follows:
get language. In recent years, end-to-end system
and cascade system are fundamental pipelines for 1) According to the characteristics of three dif-
speech translation tasks. Traditional cascade sys- ferent tracks (constrained, constrained with large
tem is comprised of continuing parts, automatic language models (LLM), and unconstrained), we
speech recognition (ASR) is responsible for gener- use different strategies to optimize the results of
ating transcripts from audios and machine transla- ASR. After careful fine-tuning, the WER of the
tion (MT) model aims at translating ASR outputs ASR system of the three tracks have achieved good
from source language into target language. ASR performance.
model like Conformer (Gulati et al., 2020) and S2T-
Transformer (Synnaeve et al., 2019) are commonly 2) Explored the multilingual machine translation
used. MT models like Transformer (Vaswani et al., model, and tried a variety of model enhancement
2017) can be considered as a standard configura- strategies, and finally achieved good results on the
tion. The End-to-end systems use a model to di- MUST-C test set.
rectly recognize speech into target text in another
language. Section 2 focuses on our data processing strate-
The cascade system will cause some "missing gies while section 3 describes the training tech-
information" due to the two encoding and decoding niques of ASR, including model architecture and
processes of ASR and MT. At the same time, the training strategy, etc. Section 4 describes the train-
disadvantage of the end-to-end system is the lack ing techniques of MT, and section 5 presents our
of sufficient training data. However, with a fully experiment results.
187
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 187–193
July 13-14, 2023 c 2023 Association for Computational Linguistics
Dataset Duration(h) 1) Conformer: The encoder is composed of 2
LibriSpeech 960 layers of VGG and 16 layers of Conformer, and the
MuST-C 590 decoder is composed of 6 layers of Transformer.
CoVoST 1802 The embedding size is 1024, and the hidden size of
TEDLIUM3 453 FFN is 4096, and the attention head is 16.
Europarl 161 2) U2: Two convolution subsampling layers with
VoxPopuli 1270 kernel size 3*3 and stride 2 are used in the front of
the encoder. We use 12 Conformer layers for the
Table 1: Data statistics of our ASR corpora.
encoder and 6 Transformer layers for the decoder.
The embedding size is 1024, and the hidden size of
2 Datasets and Preprocessing FFN is 4096, and the attention head is 16.
During the training of ASR models, we set the
2.1 ASR Data batch size to the maximum of 20,000 frames per-
There are six different datasets used in the training card. Inverse sqrt is used for lr scheduling with
of our ASR models, such as MuST-C V2 (Cat- warm-up steps set to 10,000 and peak lr set as 5e-4.
toni et al., 2021), LibriSpeech (Panayotov et al., Adam is used as the optimizer. All ASR models
2015), TED-LIUM 3 (Hernandez et al., 2018), are trained on 8 A100 GPUs for 100 epochs. Pa-
CoVoST 2(Wang et al., 2020), VoxPopuli (Wang rameters for last 5 epochs are averaged. Audio fea-
et al., 2021), Europarl-ST (Iranzo-Sánchez et al., tures are normalized with utterance-level CMVN
2020), as described in Table 1. We use the ex- for Conformer, and with global CMVN for U2.
actly same data processing strategy to train our All audio inputs are augmented with spectral aug-
ASR models following the configuration of (Wang mentation (Park et al., 2019), and Connectionist
et al., 2022). We extend one data augmentation Temporal Classification (CTC) is added to make
method (Zhang et al., 2022): adjacent voices are models converge better.
concatenated to generate longer training speeches.
3.2 Constrained with Large Language Models
Tsiamas et al. (2022) propose Supervised Hybrid
training
Audio Segmentation (SHAS), a method that can
effectively learn the optimal segmentation from Large Language Models (LLM) is currently the
any manually segmented speech corpus. For test mainstream method in the field of artificial intel-
set, we use SHAS to split long audios into shorter ligence. In ASR, the pre-training model has been
segments. proved to be an effective means to improve the
quality, especially the models such as wav2vec
2.2 MT Data (Schneider et al., 2019) and Hubert (Hsu et al.,
We used all provided data, including text-parallel 2021) have been proposed in recent years. Li et al.
and speech-to-text-parallel, text-monolingual data, (2020) combine the encoder of wav2vec2 (Baevski
and use the exactly same data processing strategy et al., 2020) and the decoder of mBART50 (Tang
to process our MT data following (Wei et al., 2021). et al., 2020) to fine-tune an end2end model. We
Data sizes before and after cleaning are listed in also adopt a similar strategy, but combine the en-
Table 2. coder of wav2vec2 and the decoder of mBART50
to fine-tune an ASR model (w2v2-mBART). Due
3 ASR Model to the modality mismatch between pre-training and
fine-tuning, in order to better train cross-attention,
3.1 Constrained training we freeze the self-attention of the encoder and de-
In this track, we trained the constrained ASR model coder. We first use all the constrained data for
using the Conformer (Gulati et al., 2020) and U2 fine-tuning, and only use the MUST-C data after
(Zhang et al., 2020b) model architectures. The 30 epochs of training.
first model is standard auto-regressive ASR mod-
els built upon the Transformer architecture. The 3.3 Unconstrained training
last one is a unified model that can perform both Whisper (Radford et al., 2022) is an automatic
streaming and non-streaming ASR, supported by speech recognition (ASR) system trained on
the dynamic chunking training strategy. The model 680,000 hours of multilingual and multitask su-
configurations are as follows: pervised data collected from the web. It show that
188
language pairs Raw Data Filter Data LaBSE Filter Data Domain Selection
En2De 19.8M 14.5M 5.8M 0.4M
En2Zh 8.1M 5.5M 2.2M 0.4M
En2Ja 16.4M 14.1M 5.6M 0.4M
Table 2: Bilingual data sizes before and after filtering used in tasks.
the use of such a large and diverse dataset leads to Second, use LaBSE (Feng et al., 2020) to filter the
improved robustness to accents, background noise bilingual data, and use the filtered data for incre-
and technical language. The Whisper architecture mental training. In Table 2, there are the number
is a simple end-to-end approach, implemented as an of filtered data for each languages. Then, for the
encoder-decoder Transformer. Even though it en- three languages, the backward models are trained
ables transcription in multiple languages, we only separately, and the monolingual datas are used for
use its speech recognition feature, transcribing au- backward translation (BT). Finally, we combine
dio files to English text. In this task, we use it as a backward translation and forward translation (FT)
pre-trained model, and use the MUST-C dataset for for iterative joint training (Zhang et al., 2018). Af-
fine-tuning to improve its performance in specific ter the above several stages, a base model with
domains. We trained for 2 epochs with a small better performance is obtained, which can be used
learning rate of 10e-6. for further optimization.
Table 4: The experimental results of ASR. We present WER performance of tst-COM, tst2018, tst2019 and tst2020.
System En2De En2Ja En2Zh data are provided in this task, including audio,
One2Many 36.22 15.43 29.05 source and target. We use the trained ASR to tran-
+ LaBSE bitext 37.58 15.48 29.48 scribe the audio file to get source′ , and finally get
+ Domain adaptation 41.55 17.08 29.27 the MT training data like (source′ , target). The
+ Iter FTBT 43.03 17.86 29.82 source′ transcribed by ASR may have some errors,
+ Dev fine-tuning 43.66 20.88 30.48 but when used in MT, it will increase the robustness
of the MT encoder.
Table 5: The BLEU of MT using tst-COM golden tran-
When using the data generated above, we refer
scription.
to the tagged BT method (Caswell et al., 2019), and
System En2De En2Ja En2Zh add a special token at the beginning of the source
sentence.
One2Many 31.54 14.08 26.69
+ LaBSE bitext 32.65 13.88 27.14
+ Domain adaptation 35.96 15.4 27.15 5 Experiments and Results
+ Iter FTBT 36.38 15.81 27.98
We use the open-source fairseq (Ott et al., 2019)
+ Dev fine-tuning 37.83 18.6 28.86
for training, word error rate (WER) to evaluate the
+ Robustness 38.71 20.34 28.93
ASR models and report case-sensitive SacreBLEU
Table 6: The BLEU of MT using tst-COM transcription (Post, 2018) scores for machine translation. We
by the Whisper fine-tuning model. evaluated our system on the test sets of MuST-C
tst-COMMON (tst-COM).
Table 3 is our results on three languages for
In this task, we use TED and MUST-C data as three tracks (Constrained, Constrained with LLM,
in-domain data. We score all the training bilingual Unconstrained). After a series of optimizations,
data through Equation 1, and filter out 80% - 90% although the ASR results of the three systems
of the data according to the score distribution. We are somewhat different, the BLEU of all sys-
use the remaining 0.4M in-domain data to continue tems are very close. Since there is no testset for
training on the previous model. iwslt2022, we only compared with last year’s teams
on tst-COM. Compared with last year’s best re-
4.5 Robustness to ASR Noise sults (Zhang et al., 2022), we have improved by 2.1
We use two methods to improve the robustness of BLEU in the MuST-C En2De test set; in En2Zh
the system to ASR output noise. and En2Ja, we have achieved close to last year’s
Synthetic Noise Generation. We refer to the best results.
method proposed in Guo et al. (2022) to synthesize We analyze the main reasons for the similar re-
part of the noise data to enhance the robustness of sults of the three systems: 1. The three systems use
the model. the same MT, and our MT system has the ability
ASR Transcript Data. Because some triplet to correct wrong input after the robustness is en-
190
hanced. 2. Using the same data to finetuning the which proves the effectiveness of our strategy. It-
three ASR systems, the WER are relatively close. erative joint training with FT and BT (Iter FTBT)
is also an effective mean to improve quality. After
5.1 Automatic Speech Recognition dev fine-tuning, the results are already very compet-
We compare the results of different model archi- itive. With improving the robustness of the system
tectures, the overall experimental results about to ASR output, our BLEU in En2De, En2Zh, and
ASR is described in Table 4. We evaluated En2Ja are 38.71, 20.34, and 28.93, respectively.
our system on the test sets of tst-COM, IWSLT
tst2018/tst2019/tst2020 respectively. For long au- 6 Conclusion
dio in the test set, we use SHAS for segmenta-
This paper presents our offline speech translation
tion. We calculate the WER after the reference and
systems in the IWSLT 2023 evaluation. We ex-
hypothesis are lowercased and the punctuation is
plored different strategies in the pipeline of build-
removed.
ing the cascade system. In the data preprocess-
In Table 4, all ASR systems achieve good per-
ing, we adopt efficient cleansing approaches to
formance, and the results are relatively close. Con-
build the training set collected from different data
former and U2 are trained using constrained data.
sources. We tried various ASR training strategies
w2v2-mBART is obtained through fine-tuning us-
and achieved good performance. For the MT sys-
ing pre-trained models, which are constrained.
tem, we have used various methods such as multi-
Whisper is the result of transcribing long audio
lingual machine translation, R-drop, domain adap-
without segmentation using the native whisper
tation, and enhanced robustness. Finally, compared
medium model. Whisper fine-tuning is obtained
with last year’s best results, we have improved by
after fine-tuning on MuST-C dataset, with using the
2.1 BLEU in the MuST-C English-German test set.
Whisper medium model. The WER of Conformer
and U2 is relatively close. In submitting the results
of constrained track, we use Conformer as the fi-
References
nal ASR system. The experimental results show
that pre-trained models exhibit their advantages, Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,
and Michael Auli. 2020. wav2vec 2.0: A framework
w2v2-mBART can achieve better results than just
for self-supervised learning of speech representations.
training with constrained data. Whisper itself has Advances in neural information processing systems,
a very good performance in the general domain, 33:12449–12460.
and after fine-tuning, it has even better results in
the specific domain. However, it is very difficult Isaac Caswell, Ciprian Chelba, and David Grangier.
2019. Tagged back-translation. arXiv preprint
to perform finetuning on whisper and improve the arXiv:1906.06442.
performance of all domains. WER performance on
tst2019 and tst2020 has deteriorated. Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Ben-
tivogli, Matteo Negri, and Marco Turchi. 2021. Must-
5.2 Neural Machine Translation c: A multilingual corpus for end-to-end speech trans-
lation. Computer Speech & Language, 66:101155.
We evaluate the performance of the MT model in
detail on the MUST-C test set. Table 5 shows the Sergey Edunov, Myle Ott, Michael Auli, and David
performance results of each optimization strategy Grangier. 2018. Understanding back-translation at
using golden as the source; Table 6 uses the tran- scale. arXiv preprint arXiv:1808.09381.
scription generated by Whisper fine-tuning model Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen
as the source. The results show that there is a gap Arivazhagan, and Wei Wang. 2020. Language-
in BLEU between golden and transcription of ASR, agnostic bert sentence embedding. arXiv preprint
which is mainly due to errors (punctuation, capital- arXiv:2007.01852.
ization, vocabulary, etc.) in transcription of ASR. Pengzhi Gao, Zhongjun He, Hua Wu, and Haifeng
On the En2De test set, this gap is particularly wide. Wang. 2022. Bi-simcut: A simple strategy for
One2Many is a multilingual model trained us- boosting neural machine translation. arXiv preprint
ing the R-drop strategy, and has achieved relatively arXiv:2206.02368.
good performance on the test set. LaBSE can bring Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
a little improvement to the model, and domain adap- Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo
tation can bring a huge improvement to the model, Wang, Zhengdong Zhang, Yonghui Wu, et al.
191
2020. Conformer: Convolution-augmented trans- method for automatic speech recognition. arXiv
former for speech recognition. arXiv preprint preprint arXiv:1904.08779.
arXiv:2005.08100.
Matt Post. 2018. A call for clarity in reporting bleu
Bao Guo, Mengge Liu, Wen Zhang, Hexuan Chen, scores. arXiv preprint arXiv:1804.08771.
Chang Mu, Xiang Li, Jianwei Cui, Bin Wang, and
Yuhang Guo. 2022. The xiaomi text-to-text simulta- Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
neous speech translation system for iwslt 2022. In man, Christine McLeavey, and Ilya Sutskever. 2022.
Proceedings of the 19th International Conference on Robust speech recognition via large-scale weak su-
Spoken Language Translation (IWSLT 2022), pages pervision. arXiv preprint arXiv:2212.04356.
216–224.
Steffen Schneider, Alexei Baevski, Ronan Collobert,
François Hernandez, Vincent Nguyen, Sahar Ghannay, and Michael Auli. 2019. wav2vec: Unsupervised
Natalia Tomashenko, and Yannick Esteve. 2018. Ted- pre-training for speech recognition. arXiv preprint
lium 3: Twice as much data and corpus repartition for arXiv:1904.05862.
experiments on speaker adaptation. In Speech and
Computer: 20th International Conference, SPECOM Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
2018, Leipzig, Germany, September 18–22, 2018, Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Proceedings 20, pages 198–208. Springer. Dropout: a simple way to prevent neural networks
from overfitting. The journal of machine learning
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, research, 15(1):1929–1958.
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
rahman Mohamed. 2021. Hubert: Self-supervised Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Ta-
speech representation learning by masked prediction tiana Likhomanenko, Edouard Grave, Vineel Pratap,
of hidden units. IEEE/ACM Transactions on Audio, Anuroop Sriram, Vitaliy Liptchinsky, and Ronan Col-
Speech, and Language Processing, 29:3451–3460. lobert. 2019. End-to-end asr: from supervised to
semi-supervised learning with modern architectures.
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerda, arXiv preprint arXiv:1911.08460.
Javier Jorge, Nahuel Roselló, Adria Giménez, Al-
bert Sanchis, Jorge Civera, and Alfons Juan. 2020. Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
Europarl-st: A multilingual corpus for speech transla- man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
tion of parliamentary debates. In ICASSP 2020-2020 gela Fan. 2020. Multilingual translation with exten-
IEEE International Conference on Acoustics, Speech sible multilingual pretraining and finetuning. arXiv
and Signal Processing (ICASSP), pages 8229–8233. preprint arXiv:2008.00401.
IEEE.
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Ioannis Tsiamas, Gerard I Gállego, José AR Fonollosa,
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, and Marta R Costa-jussà. 2022. Shas: Approaching
Fernanda Viégas, Martin Wattenberg, Greg Corrado, optimal segmentation for end-to-end speech transla-
et al. 2017. Google’s multilingual neural machine tion. arXiv preprint arXiv:2202.04774.
translation system: Enabling zero-shot translation.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Transactions of the Association for Computational
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Linguistics, 5:339–351.
Kaiser, and Illia Polosukhin. 2017. Attention is all
Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing you need. Advances in neural information processing
Tang, Juan Pino, Alexei Baevski, Alexis Conneau, systems, 30.
and Michael Auli. 2020. Multilingual speech trans-
lation with efficient finetuning of pretrained models. Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
arXiv preprint arXiv:2010.12829. Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
Juan Pino, and Emmanuel Dupoux. 2021. Voxpop-
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, uli: A large-scale multilingual speech corpus for rep-
Sam Gross, Nathan Ng, David Grangier, and Michael resentation learning, semi-supervised learning and
Auli. 2019. fairseq: A fast, extensible toolkit for se- interpretation. arXiv preprint arXiv:2101.00390.
quence modeling. arXiv preprint arXiv:1904.01038.
Changhan Wang, Anne Wu, and Juan Pino. 2020. Cov-
Vassil Panayotov, Guoguo Chen, Daniel Povey, and ost 2 and massively multilingual speech-to-text trans-
Sanjeev Khudanpur. 2015. Librispeech: an asr cor- lation. arXiv preprint arXiv:2007.10310.
pus based on public domain audio books. In 2015
IEEE international conference on acoustics, speech Minghan Wang, Jiaxin Guo, Yinglu Li, Xiaosong Qiao,
and signal processing (ICASSP), pages 5206–5210. Yuxia Wang, Zongyao Li, Chang Su, Yimeng Chen,
IEEE. Min Zhang, Shimin Tao, et al. 2022. The hw-tsc’s si-
multaneous speech translation system for iwslt 2022
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng evaluation. In Proceedings of the 19th International
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Conference on Spoken Language Translation (IWSLT
2019. Specaugment: A simple data augmentation 2022), pages 247–254.
192
Mingxuan Wang, Zhengdong Lu, Jie Zhou, and Qun Liu.
2017. Deep neural machine translation with linear
associative unit. arXiv preprint arXiv:1705.00861.
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu,
Changliang Li, Derek F Wong, and Lidia S Chao.
2019a. Learning deep transformer models for ma-
chine translation. arXiv preprint arXiv:1906.01787.
Wei Wang, Isaac Caswell, and Ciprian Chelba. 2019b.
Dynamically composing domain-data selection with
clean-data selection by" co-curricular learning"
for neural machine translation. arXiv preprint
arXiv:1906.01130.
Daimeng Wei, Zongyao Li, Zhanglin Wu, Zhengzhe Yu,
Xiaoyu Chen, Hengchao Shang, Jiaxin Guo, Minghan
Wang, Lizhi Lei, Min Zhang, Hao Yang, and Ying
Qin. 2021. HW-TSC’s participation in the WMT
2021 news translation shared task. In Proceedings of
the Sixth Conference on Machine Translation, pages
225–231, Online. Association for Computational Lin-
guistics.
Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei
Chen, Min Zhang, Tie-Yan Liu, et al. 2021. R-drop:
Regularized dropout for neural networks. Advances
in Neural Information Processing Systems, 34:10890–
10905.
193
The USTC’s Offline Speech Translation Systems for IWSLT 2023
Xinyuan Zhou2 , Jianwei Cui1 , Zhongyi Ye2 , Yichi Wang1 ,
Luzhen Xu1 , Hanyi Zhang2 , Weitai Zhang1,2 , Lirong Dai1
1
University of Science and Technology of China, Hefei, China
2
iFlytek Research, Hefei, China
{jwcui,wangyichi,lzxu,zwt2021}@mail.ustc.edu.cn
lrdai@ustc.edu.cn
{xyzhou15,zyye7,hyzhang56}@iflytek.com
195
Corpus Duration (h) Sample Scale based on the Transformer (Vaswani et al., 2017)
implemented in the Fairseq (Ott et al., 2019) toolkit.
MuST-C 593 2
CovoST2 1092 2 Each single model was executed on 16 NVIDIA
EN-ZH V100 GPUs. Our experiments utilized several
KD 16000 2
TTS 27000 1 crucial technologies including Back Translation,
Sentence-level Knowledge Distillation, Domain
Table 4: Speech Translation Corpora. Adaptation, Robust MT Training, and Ensembling.
Back Translation. The utilization of Back-
Translation (Sennrich et al., 2016a) is a proficient
text are utilized to enhance our speech translation
technique for enhancing translation accuracy. This
dataset, referred to as TTS Corpus in Table 4.
method generates synthetic sentence pairs by trans-
3 Cascaded Speech Translation lating target-side monolingual data. It has gained
significant popularity in both academic research
3.1 Automatic Speech Recognition and commercial applications. We train NMT mod-
We implement ASR model in cascaded condi- els with bilingual data, and translate Chinese sen-
tion via Supervised Hybrid Audio Segmentation tences to English.
(SHAS) and Whisper. Knowledge Distillation. Sentence-level Knowl-
Supervised Hybrid Audio Segmentation. Super- edge Distillation (Kim and Rush, 2016), also
vised Hybrid Audio Segmentation (SHAS) (Tsia- known as Self-training, is an effective method for
mas et al., 2022) is used to split long audio into enhancing performance. We expand our training
short segments with quality comparable to manual dataset by leveraging a trained NMT model to trans-
segmentation. Hence, we use SHAS as a Voice Ac- late English sentences into Chinese. This approach
tivity Detection (VAD) in the ASR system, as well has proven to be highly beneficial in improving
as a speech segmentation tool in the Speech Trans- model accuracy.
lation system. This way, the output of the ASR Domain Adapatation. Due to the critical impor-
system can be directly fed into the text translation tance of high-quality, domain-specific translation
component. (Saunders, 2022), we fine-tune the NMT model by
Whisper. We incorporated the pre-trained Whisper using a mix of in-domain data (such as MuST-C,
(Radford et al., 2022) as the ASR model of the cas- TED-LIUM3, etc.) and out-of-domain data. Ad-
caded system to reduce errors in the intermediate ditionally, the labelled English sentences from the
source language text. speech recognition training data is also utilized as
Whisper scales weakly supervised speech-to-text augmented in-domain self-training data by translat-
tasks to 680,000 hours of labeled audio data and ing them.
expands the pre-training scope from English-only We adopt a Denoise-based approach (Wang et al.,
speech recognition to multilingual and multitask. 2018) to assess and select data for domain-specific
In comparison with the previous unsupervised pre- MT and use it to denoise NMT training. The tech-
training approach (Baevski et al., 2020), Whisper nique of denoising addresses data quality issues
not only improves the quality of the audio encoder, and reduces the adverse effects of noise on MT
but also trains a pre-trained decoder with high training, particularly NMT training.
equivalency, enhancing usefulness and robustness. Robust MT Training. To enhance the robustness
Results demonstrate that the pre-trained Whisper of the MT model to ASR errors in cascaded ST,
model can be well transferred to different or even the ASR output adaptive training approach (Zhang
zero-shot datasets without any dataset-specific fine- et al., 2022a) is introduced. The English transcripts
tuning. of all speech translation datasets are inputted into a
We used the large version of the pre-trained whis- trained ASR model to generate text in source side,
per model, which contains 32 layers and a total of which is then paired with the transcription text in
1550M parameters. target side. We improve the robustness of the MT
model through three methods: 1) fine-tuning the
3.2 Neural Machine Translation MT model with synthetic data; 2) incorporating KL
We adopted the same strategy as last year’s (Zhang loss during fine-tuning to prevent over-fitting; and
et al., 2022a) and built machine translation models 3) distilling the model using clean source text and
196
ASR output. 𝐿𝐴𝑆𝑅 𝐿𝐾𝐷−𝑇𝑟𝑎𝑛𝑠 𝐿 𝑇𝑟𝑎𝑛𝑠
models. Textual
Decoder
• E15D6-v1: 15 layers for the encoder and 6 Linear
Cross-Attention
layers for the docoder. The embedding size is Softmax
1024. FFN size is 8192 and attention head is Cross-Attention
• Macaron: A version with macaron architec- which consists of 2 layers of VGG and 16
ture (Lu et al., 2019) based on data of E18D6. layers of Transformer. The decoder of VGG-
36 layers for the encoder and FFN size is T is 6 layers of Transformer with embedding
2048. size of 1024, attention head of 16 and FFN
size of 8192.
3.3 End-to-End Speech Translation
In the end-to-end condition, we ensemble the • VGG-T-init: The VGG-Transformer encoder
encoder-decoder and the Stacked Acoustic-and- is initialized by the ASR VGG-Transformer
Textual Encoding extension (SATE-ex) models de- architecture. The decoder is 6 layers of Trans-
scribed in Section 3.4. former, initialized by NMT E15D6-v2 variant.
Encoder-Decoder. The encoder-decoder-based
end-to-end ST model processes the speech in the 3.4 Stacked Acoustic-and-Textual Encoding
source language by its encoder and generates text Extension
in the target language by its decoder. The encoder To further improve the performance of end-to-end
and decoder are initialized using the corresponding ST, we propose Stacked Acoustic-and-Textual En-
parts of the cascade ASR and MT models. As re- coding extension (SATE-ex) based on SATE (Xu
gards model architecture, we investigate 4 variants et al., 2021).
in end-to-end ST. SATE. The MT encoder captures the long-distance
dependency structure, while ASR encoder focuses
• VGG-C: The encoder of VGG-C is initial- on local dependencies in the input sequence. Thus,
ized by the ASR VGG-Conformer architec- the encoder-decoder model initialized with the
ture, which consists of 2 layers of VGG and ASR encoder and the MT decoder may have in-
12 layers of Conformer. And the ASR VGG- consistent on intermediate representations.
Conformer is trained using the data in Section SATE stacks two encoders, an acoustic encoder
2.1. The decoder of VGG-C is 6 layers of and a textual encoder. The acoustic encoder pro-
Transformer with embedding size of 1024, at- cesses the acoustic input, while the textual encoder
tention head of 16 and FFN size of 8192. generates global attention representations for trans-
lation. Moreover, an adapter is designed after the
• VGG-C-init: The encoder is VGG-Conformer,
acoustic encoder, which maps the acoustic repre-
initialized by ASR VGG-Conformer architec-
sentation to the latent space of the textual encoder
ture. The decoder is 6 layers of Transformer,
while retaining acoustic information. By doing so,
initialized by NMT E15D6-v2 variant.
SATE can maintain consistency in representation
• VGG-T: The encoder of VGG-T is initialized across different pre-trained components. Besides,
by the ASR VGG-Transformer architecture, the multi-teacher knowledge distillation has been
197
developed to preserve pre-training knowledge dur- System tst2018 tst2019 tst2020 tst2022 tst-COM
ing fine-tuning (Hinton et al., 2015). ASR* 95.59 97.55 95.71 96.67 98.04
SATE-ex. Figure 1 shows the SATE-ex architec- Whisper 95.75 98.34 97.17 97.86 97.01
ture, comprising the acoustic encoder, acoustic de-
coder, textual encoder, and textual decoder compo- Table 5: The recognition accuracy of the ASR fusion
model and pre-trained Whisper. ASR* indicates the
nents. Theses components are initialized with their
ASR fusion model.
corresponding components in cascade ASR and
MT models. Notably, the textual decoder in SATE-
ex has a Cross-Attention module (highlighted in (5, 54, 0.1). We also provide the results of MT as
yellow) that processes the acoustic decoder’s out- reference (System #1-5).
put. By doing so, this approach fuses the last layer
decoding hidden states of the ASR decoder into the 4.1 Automatic Speech Recognition
textual decoder, alongside Connectionist Tempo- We evaluate the recognition performance of ASR
ral Classification (CTC) decoding hidden states of fusion model and pre-trained Whisper. The ASR fu-
ASR that are injected through adaptor and textual sion model comprises three model structures, each
encoder. Similar to (Zhang et al., 2020), this idea trained with and without Text-to-Speech (TTS)
facilitates to fuse and complement different decod- data, resulting in a total of six ASR models. These
ing strategies, which can improve inner recognition models are fused to obtain the final ASR* model.
accuracy, reduce the propagation of intermediate The three ASR structures are presented below.
representation errors, and thereby enhance transla-
tion performance. • VGG-Conformer: 2 layers of VGG and 12
The loss function of SATE-ex, similar to SATE layers of Conformer in encoder, 6 layers of
(Xu et al., 2021), computes CTC loss LCT C , ASR Transformer in decoder.
loss LASR , and translation loss LT rans . Addi- • VGG-Transformer: 2 layers of VGG and 16
tionally, the losses LKD−CT C and LKD−T rans of layers of Transformer in encoder, 6 layers of
multi-teacher knowledge distillation are used to Transformer in decoder.
preserve pre-trained knowledge during fine-tuning.
Adaptation Training. To further eliminate the in- • GateCNN-Transformer: 6 layers of GateCNN
termediate representation mismatch in pre-trained and 12 layers of Conformer in encoder, 6 lay-
ASR and MT, before end-to-end training, we adopt ers of Transformer in decoder.
adaptation training to fine-tune the MT part of
SATE-ex (including the textual encoder and tex- The recognition results of the ASR fusion model
tual decoder). Specifically, we first generate greedy and pre-trained Whisper are presented in Table 5.
CTC decoding without removing duplicates and The results indicate that Whisper has a superior
blanks through the acoustic encoder. Then, we pair recognition performance compared to the ASR fu-
these CTC decoding with text in target language to sion model, with an average improvement of 0.51%.
fine-tune the textual encoder and textual decoder. However, the ASR fusion model outperforms Whis-
Please note that the textual decoder here does not per slightly on the tst-COM dataset, which could be
contain the Cross-Attention module (highlighted in due to the ASR fusion model upsampling, making
yellow) in Figure 1. its data distribution closer to tst-COM.
Table 6: The BLEU scores of machine translation (MT), cascaded, end-to-end, and ensemble systems. * indicates
fusion models. The parameter of SHAS is (min, max, threshold).
System #7 uses the large version of Whisper3 and the E18D6 MT model in Section 3.2 to initial-
as ASR, while the MT* is consistent with System ize the textual encoder and decoder. SATE-ex-M
#6. As shown, on Dev set, using Whisper to reduce uses the Macaron MT model in Section 3.2 to ini-
errors in the source language text has improved tialize the textual encoder and decoder.
the performance of ST. However, on tst-COM, the It can be seen that the results of ensemble SATE-
cascade model with ASR* performs better, pre- ex (System #16) outperform those of ensemble
sumably due to the closer match between the data encoder-decoder (System #17). However, the per-
distribution of ASR* and that of tst-COM. formance of a single SATE-ex model is slightly
worse than that of a single encoder-decoder model,
4.3 End-to-End Systems which we attribute to the lack of fine-tuning for the
In the end-to-end setting, we adopt the encoder- single SATE-ex model. In future work, we will
decoder and SATE-ex architectures. Systems #12- discuss SATE-ex in detail.
15 are built based on the encoder-decoder, with spe-
cific parameters referred to Section 3.3. Systems 4.4 Ensemble Systems
#8-11 adopt the SATE-ex architecture. SATE-ex-T We ensemble the two cascade models (Systems #6
uses the VGG-Conformer ASR model in Section and #7) and the end-to-end model (System #18)
4.2 to initialize the acoustic encoder and decoder, separately. The results are shown in Systems #19
3
https://github.com/openai/whisper
and #20 in Table 6. It can be seen that the ensemble
199
systems achieves excellent performance. References
4.5 System Description Milind Agarwal, Sweta Agrawal, Antonios Anasta-
sopoulos, Ondřej Bojar, Claudia Borg, Marine
Our system is primarily based on the full dataset Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
allowed by IWSLT 2022, supplemented with Whis- Chen, William Chen, Khalid Choukri, Alexandra
Chronopoulou, Anna Currey, Thierry Declerck, Qian-
per large and SHAS for audio segmentation, which
qian Dong, Yannick Estève, Kevin Duh, Marcello
is trained on MUSTC. We have trained six ASR Federico, Souhir Gahbiche, Barry Haddow, Benjamin
models and six MT models based on the IWSLT Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
2022 training data for model fusion. Additionally, vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
we have trained four end-to-end ST models and Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
Evgeny Matusov, Paul McNamee, John P. McCrae,
four SATE-ex end-to-end ST models for end-to- Kenton Murray, Maria Nadejde, Satoshi Nakamura,
end model fusion. Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
For the end-to-end system, we use a fusion of Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
the above-mentioned eight end-to-end models. For Lonneke van der Plas, Peter Polák, Elijah Rippeth,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
the cascaded systems, we build two cascades: one bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
with ASR based on Whisper and the other with Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
ASR based on six-model fusion. The MT side used Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
six-model fusion for both cascades. The submit- vallos. 2023. Findings of the IWSLT 2023 Evaluation
Campaign. In Proceedings of the 20th International
ted systems are based on these two cascades, each Conference on Spoken Language Translation (IWSLT
combined with the eight-model fusion end-to-end 2023). Association for Computational Linguistics.
system.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,
The system structure and SHAS parameter
and Michael Auli. 2020. wav2vec 2.0: A framework
(min, max, threshold) settings of the five submit- for self-supervised learning of speech representations.
ted systems are shown below. Advances in neural information processing systems,
33:12449–12460.
• Primary Cascade: System #7 with SHAS pa-
Edresson Casanova, Christopher Shulby, Eren
rameters set to (5, 54, 0.1).
Gölge, Nicolas Michael Müller, Frederico San-
tos de Oliveira, Arnaldo Candido Jr., Anderson
• Contrastive1: System #20 with SHAS param- da Silva Soares, Sandra Maria Aluisio, and
eters set to (1, 18, 0.5). Moacir Antonelli Ponti. 2021. SC-GlowTTS: An
Efficient Zero-Shot Multi-Speaker Text-To-Speech
• Contrastive2: System #19 with SHAS param- Model. In Proc. Interspeech 2021, pages 3645–3649.
eters set to (1, 18, 0.5).
Edresson Casanova, Christopher Shulby, Alexander Ko-
rolev, Arnaldo Candido Junior, Anderson da Silva
• Contrastive3: System #6 with SHAS parame- Soares, Sandra Aluísio, and Moacir Antonelli Ponti.
ters set to (5, 54, 0.1). 2022. Asr data augmentation using cross-lingual
multi-speaker tts and cross-lingual voice conversion.
• Primary e2e: System #18 with SHAS parame- arXiv preprint arXiv:2204.00618.
ters set to (1, 18, 0.5).
Long Duong, Antonios Anastasopoulos, David Chiang,
Steven Bird, and Trevor Cohn. 2016. An attentional
5 Conclusion model for speech translation without transcription.
This paper summarizes the results on the IWSLT In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computa-
2023 Offline Speech Translation task. We employ tional Linguistics: Human Language Technologies,
various model architectures and data augmentation pages 949–959, San Diego, California. Association
techniques to build speech translation systems in for Computational Linguistics.
cascaded and end-to-end settings. The experimen- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
tal results demonstrate the effectiveness of strate- Distilling the knowledge in a neural network. arXiv
gies such as pre-trained Whisper models, adapta- preprint arXiv:1503.02531.
tion training, and the Stacked Acoustic-and-Textual
Yoon Kim and Alexander M Rush. 2016. Sequence-
Encoding extension (SATE-ex). In future work, we level knowledge distillation. In Proceedings of the
will further investigate SATE-ex and explore multi- 2016 Conference on Empirical Methods in Natural
modal representation learning in speech translation. Language Processing, pages 1317–1327.
200
Hang Le, Florentin Barbier, Ha Nguyen, Natalia with subword units. In Proceedings of the 54th An-
Tomashenko, Salima Mdhaffar, Souhir Gabiche Gah- nual Meeting of the Association for Computational
biche, Benjamin Lecouteux, Didier Schwab, and Linguistics (Volume 1: Long Papers), pages 1715–
Yannick Estève. 2021. ON-TRAC’ systems for the 1725, Berlin, Germany. Association for Computa-
IWSLT 2021 low-resource speech translation and tional Linguistics.
multilingual speech translation shared tasks. In Pro-
ceedings of the 18th International Conference on Ioannis Tsiamas, Gerard I Gállego, José AR Fonollosa,
Spoken Language Translation (IWSLT 2021), pages and Marta R Costa-jussà. 2022. Shas: Approaching
169–174, Bangkok, Thailand (online). Association optimal segmentation for end-to-end speech transla-
for Computational Linguistics. tion. arXiv preprint arXiv:2202.04774.
Dan Liu, Mengge Du, Xiaoxi Li, Yuchen Hu, and Lirong Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Dai. 2021. The USTC-NELSLIP systems for simul- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
taneous speech translation task at IWSLT 2021. In Kaiser, and Illia Polosukhin. 2017. Attention is all
Proceedings of the 18th International Conference on you need. Advances in neural information processing
Spoken Language Translation (IWSLT 2021), pages systems, 30.
30–38, Bangkok, Thailand (online). Association for
Computational Linguistics. Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji
Nakagawa, and Ciprian Chelba. 2018. Denoising
Yuchen Liu, Hao Xiong, Jiajun Zhang, Zhongjun He, neural machine translation training with trusted data
Hua Wu, Haifeng Wang, and Chengqing Zong. 2019. and online data selection. In Proceedings of the Third
End-to-end speech translation with knowledge distil- Conference on Machine Translation: Research Pa-
lation. Proc. Interspeech 2019, pages 1128–1132. pers, pages 133–143, Brussels, Belgium. Association
for Computational Linguistics.
Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong,
Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019. Un- Che Wanxiang, Feng Yunlong, Qin Libo, and Liu Ting.
derstanding and improving transformer from a multi- 2020. N-ltp: A open-source neural chinese language
particle dynamic system point of view. arXiv preprint technology platform with pretrained models. arXiv
arXiv:1906.02762. preprint.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, Shen
Sam Gross, Nathan Ng, David Grangier, and Michael Huang, Qi Ju, Tong Xiao, and Jingbo Zhu. 2021.
Auli. 2019. fairseq: A fast, extensible toolkit for Stacked acoustic-and-textual encoding: Integrating
sequence modeling. In Proceedings of the 2019 Con- the pre-trained models into speech translation en-
ference of the North American Chapter of the Associa- coders. In Proceedings of the 59th Annual Meet-
tion for Computational Linguistics (Demonstrations), ing of the Association for Computational Linguistics
pages 48–53, Minneapolis, Minnesota. Association and the 11th International Joint Conference on Natu-
for Computational Linguistics. ral Language Processing (Volume 1: Long Papers),
pages 2619–2630.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang,
Machine Translation: Research Papers, pages 186– Fan Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei
191, Brussels, Belgium. Association for Computa- Xie, and Xin Lei. 2020. Unified streaming and
tional Linguistics. non-streaming two-pass end-to-end model for speech
recognition. arXiv preprint arXiv:2012.05481.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
man, Christine McLeavey, and Ilya Sutskever. 2022. Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
Robust speech recognition via large-scale weak su- Xinyuan Zhou, Jing Yang, Jianwei Cui, Pan Deng,
pervision. arXiv preprint arXiv:2212.04356. Mohan Shi, Yifan Song, et al. 2022a. The ustc-
nelslip offline speech translation systems for iwslt
Danielle Saunders. 2022. Domain adaptation and multi- 2022. In Proceedings of the 19th International Con-
domain adaptation for neural machine translation: A ference on Spoken Language Translation (IWSLT
survey. Journal of Artificial Intelligence Research, 2022), pages 198–207.
75:351–424.
Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu,
Rico Sennrich, Barry Haddow, and Alexandra Birch. Lirong Dai, Jinyu Li, and Furu Wei. 2022b.
2016a. Improving neural machine translation models Speechut: Bridging speech and text with hidden-unit
with monolingual data. In Proceedings of the 54th for encoder-decoder based speech-text pre-training.
Annual Meeting of the Association for Computational arXiv preprint arXiv:2210.03730.
Linguistics (Volume 1: Long Papers), pages 86–96,
Berlin, Germany. Association for Computational Lin-
guistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016b. Neural machine translation of rare words
201
I2R’s End-to-End Speech Translation System
for IWSLT 2023 Offline Shared Task
BLEU
Model
tst2020 tst2019 MuST-C v3 MuST-C v2 CoVoST v2
in-domain
1 base (best) 25.70 22.68 30.29 30.56 27.92
2 base (avg 5) 24.81 22.25 29.98 30.29 28.11
extended-domain
3 base (best) 22.80 21.17 29.33 29.50 28.63
4 base (avg 3) 23.21 21.20 29.61 29.95 29.30
Ensemble (1 + 2 + 4) 24.99 22.64 29.99 30.35 29.13
208
for Speech Recognition. In Interspeech, pages 2426– pages 7871–7880, Online. Association for Computa-
2430. tional Linguistics.
Qingkai Fang, Rong Ye, Lei Li, Yang Feng, and Bei Li, Ziyang Wang, Hui Liu, Yufan Jiang, Quan Du,
Mingxuan Wang. 2022. STEMM: Self-learning with Tong Xiao, Huizhen Wang, and Jingbo Zhu. 2020.
speech-text manifold mixup for speech translation. Shallow-to-deep training for neural machine trans-
In Proceedings of the 60th Annual Meeting of the lation. In Proceedings of the 2020 Conference on
Association for Computational Linguistics (Volume Empirical Methods in Natural Language Processing
1: Long Papers), pages 7050–7062, Dublin, Ireland. (EMNLP), pages 995–1005, Online. Association for
Association for Computational Linguistics. Computational Linguistics.
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Ari- Yinglu Li, Minghan Wang, Jiaxin Guo, Xiaosong Qiao,
vazhagan, and Wei Wang. 2022. Language-agnostic Yuxia Wang, Daimeng Wei, Chang Su, Yimeng Chen,
BERT sentence embedding. In Proceedings of the Min Zhang, Shimin Tao, Hao Yang, and Ying Qin.
60th Annual Meeting of the Association for Compu- 2022. The HW-TSC’s offline speech translation sys-
tational Linguistics (Volume 1: Long Papers), pages tem for IWSLT 2022 evaluation. In Proceedings of
878–891, Dublin, Ireland. Association for Computa- the 19th International Conference on Spoken Lan-
tional Linguistics. guage Translation (IWSLT 2022), pages 239–246,
Dublin, Ireland (in-person and online). Association
Séverine Guillaume, Guillaume Wisniewski, Cécile for Computational Linguistics.
Macaire, Guillaume Jacques, Alexis Michaud, Ben-
jamin Galliot, Maximin Coavoux, Solange Rossato, Pierre Lison and Jörg Tiedemann. 2016. OpenSub-
Minh-Châu Nguyên, and Maxime Fily. 2022. Fine- titles2016: Extracting large parallel corpora from
tuning pre-trained models for automatic speech recog- movie and TV subtitles. In Proceedings of the Tenth
nition, experiments on a fieldwork corpus of japhug International Conference on Language Resources
(trans-himalayan family). In Proceedings of the Fifth and Evaluation (LREC’16), pages 923–929, Portorož,
Workshop on the Use of Computational Methods in Slovenia. European Language Resources Association
the Study of Endangered Languages, pages 170–178, (ELRA).
Dublin, Ireland. Association for Computational Lin-
guistics. Yuchen Liu, Junnan Zhu, Jiajun Zhang, and Chengqing
Zong. 2020. Bridging the modality gap for speech-
Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, to-text translation. arXiv preprint arXiv:2010.14920.
Ruslan Salakhutdinov, and Abdelrahman Mohamed.
2021. HuBERT: How much can a bad teacher benefit Shuming Ma, Li Dong, Shaohan Huang, Dong-
ASR pre-training? In 2021 IEEE International Con- dong Zhang, Alexandre Muzio, Saksham Singhal,
ference on Acoustics, Speech and Signal Processing Hany Hassan Awadalla, Xia Song, and Furu Wei.
(ICASSP), pages 6533–6537. 2021. DeltaLM: Encoder-decoder pre-training for
language generation and translation by augmenting
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, pretrained multilingual encoders.
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al-
bert Sanchis, Jorge Civera, and Alfons Juan. 2020. David Fraile Navarro, Mark Dras, and Shlomo
Europarl-ST: A multilingual corpus for speech trans- Berkovsky. 2022. Few-shot fine-tuning SOTA sum-
lation of parliamentary debates. In 2020 IEEE Inter- marization models for medical dialogues. In Proceed-
national Conference on Acoustics, Speech and Signal ings of the 2022 Conference of the North American
Processing (ICASSP), pages 8229–8233. Chapter of the Association for Computational Lin-
guistics: Human Language Technologies: Student
Philipp Koehn. 2005. Europarl: A parallel corpus for Research Workshop, pages 254–266, Hybrid: Seattle,
statistical machine translation. In Proceedings of Washington + Online. Association for Computational
Machine Translation Summit X: Papers, pages 79–86, Linguistics.
Phuket, Thailand.
Jan Niehues, Rolando Cattoni, Sebastian Stüker, Mauro
Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Cettolo, Marco Turchi, and Marcello Federico. 2018.
Schwab, and Laurent Besacier. 2020. Dual-decoder The IWSLT 2018 evaluation campaign. In Proceed-
transformer for joint automatic speech recognition ings of the 15th International Conference on Spoken
and multilingual speech translation. In 28th Inter- Language Translation, pages 2–6, Brussels. Interna-
national Conference on Computational Linguistics, tional Conference on Spoken Language Translation.
pages 3520–3533, Barcelona, Spain.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan jeev Khudanpur. 2015. Librispeech: An asr corpus
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, based on public domain audio books. In 2015 IEEE
Veselin Stoyanov, and Luke Zettlemoyer. 2020. International Conference on Acoustics, Speech and
BART: Denoising sequence-to-sequence pre-training Signal Processing (ICASSP), pages 5206–5210.
for natural language generation, translation, and com-
prehension. In Proceedings of the 58th Annual Meet- Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
ing of the Association for Computational Linguistics, ine Lee, Sharan Narang, Michael Matena, Yanqi
209
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Speech Translation. In Interspeech, pages 2247–
limits of transfer learning with a unified text-to-text 2251.
transformer. Journal of Machine Learning Research,
21(1). Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Wu, and Zhifeng Chen. 2017. Sequence-to-Sequence
Nils Reimers and Iryna Gurevych. 2020. Making Models Can Directly Translate Foreign Speech. In
monolingual sentence embeddings multilingual us- Interspeech, pages 2625–2629.
ing knowledge distillation. In Proceedings of the
2020 Conference on Empirical Methods in Natural Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu,
Language Processing. Association for Computational Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun
Linguistics. Gong, Lirong Dai, Jinyu Li, and Furu Wei.
2022. SpeechLM: Enhanced speech pre-training
Anthony Rousseau, Paul Deléglise, and Yannick Estève. with unpaired textual data. arXiv preprint
2012. TED-LIUM: an automatic speech recogni- arXiv:2209.15329.
tion dedicated corpus. In Proceedings of the Eighth
International Conference on Language Resources
and Evaluation (LREC’12), pages 125–129, Istanbul,
Turkey. European Language Resources Association
(ELRA).
Tara N Sainath, Rohit Prabhavalkar, Ankur Bapna,
Yu Zhang, Zhouyuan Huo, Zhehuai Chen, Bo Li,
Weiran Wang, and Trevor Strohman. 2023. JOIST:
A joint speech and text streaming model for asr. In
2022 IEEE Spoken Language Technology Workshop
(SLT), pages 52–59. IEEE.
Yun Tang, Juan Pino, Xian Li, Changhan Wang, and
Dmitriy Genzel. 2021. Improving speech translation
by understanding and learning from the auxiliary text
translation task. In Proceedings of the 59th Annual
Meeting of the Association for Computational Lin-
guistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long
Papers), pages 4252–4261, Online. Association for
Computational Linguistics.
Jörg Tiedemann. 2012. Parallel data, tools and inter-
faces in OPUS. In Proceedings of the Eight Inter-
national Conference on Language Resources and
Evaluation (LREC’12), Istanbul, Turkey. European
Language Resources Association (ELRA).
Ioannis Tsiamas, Gerard I. Gállego, Carlos Escolano,
José Fonollosa, and Marta R. Costa-jussà. 2022. Pre-
trained speech encoders and efficient fine-tuning
methods for speech translation: UPC at IWSLT 2022.
In Proceedings of the 19th International Confer-
ence on Spoken Language Translation (IWSLT 2022),
pages 265–276, Dublin, Ireland (in-person and on-
line). Association for Computational Linguistics.
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
Juan Pino, and Emmanuel Dupoux. 2021a. VoxPop-
uli: A large-scale multilingual speech corpus for rep-
resentation learning, semi-supervised learning and
interpretation. In Proceedings of the 59th Annual
Meeting of the Association for Computational Lin-
guistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long
Papers), pages 993–1003, Online. Association for
Computational Linguistics.
Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino.
2021b. CoVoST 2 and Massively Multilingual
210
The NiuTrans End-to-End Speech Translation System
for IWSLT23 English-to-Chinese Offline Task
Yuchen Han1∗, Xiaoqian Liu1∗, Hao Chen1 , Yuhao Zhang1 ,
Chen Xu1 , Tong Xiao1,2 , Jingbo Zhu1,2
1
School of Computer Science and Engineering, Northeastern University, Shenyang, China
2
NiuTrans Research, Shenyang, China
{hanyuchen114,yoohao.zhang}@gmail.com,methanechen@126.com
{liuxiaoqian0319,xuchennlp}@outlook.com
{xiaotong,zhujingbo}@mail.neu.edu.cn
Abstract train
decode
This paper describes the NiuTrans end-to-end initialize
speech translation system submitted for the
IWSLT 2023 English-to-Chinese offline task.
Our speech translation models are composed
of pre-trained ASR and MT models under the FBank/ labeled ST
ST
stacked acoustic and textual encoding frame- wav
work. Several pre-trained models with diverse pseudo ST
architectures and input representations (e.g.,
log Mel-filterbank and waveform) were utilized. MT
We proposed an iterative data augmentation ASR
(ensemble)
method to iteratively improve the performance
of the MT models and generate the pseudo ST FBank/wav
data through MT systems. We then trained ST iteratively
models with different structures and data set-
tings to enhance ensemble performance. Exper-
imental results demonstrate that our NiuTrans labeled ASR labeled MT
system achieved a BLEU score of 29.22 on
the MuST-C En-Zh tst-COMMON set, outper- pseudo MT
forming the previous year’s submission by 0.12
BLEU despite using less MT training data. Figure 1: Overview of our system.
1 Introduction
components. Using this framework, we explore
End-to-end speech translation (E2E ST) directly multiple architectures of pre-trained ASR and MT
translate speech in the source language into text in models with varying numbers of parameters and
the target language without generating an interme- input representations such as FBank features or
diate representation, which has gained significant waveform data.
attention in recent years due to several advantages Pseudo data is a crucial component of E2E ST,
over cascade methods, including low latency and often generated by ensemble MT systems (Gaido
the ability to avoid error propagation (Berard et al., et al., 2020). This year, we focused more on the per-
2016; Weiss et al., 2017). In this paper, we describe formance of MT models and developed an Iterative
our NiuTrans E2E ST system that participated in Data Augmentation method to leverage text data
the IWSLT23 English-to-Chinese offline track, the from all corpora, improving the MT models and
overview of our system is shown in Fig 1. enabling the generation of multiple pseudo data.
To improve the performance of our system, we We then used these multiple pseudo data to train
aim to maximize the diversity of our ensemble of diverse E2E ST models for optimal performance.
E2E ST models. Our E2E ST models are based on Our best ST ensemble system includes models with
the stacked acoustic and textual encoding (SATE) different input representations, architectures, and
method (Xu et al., 2021a), which is a framework training corpora, achieving a BLEU score of 29.22
to make the best of pre-trained automatic speech on the MuST-C En-Zh tst-COMMON set.
recognition (ASR) and machine translation (MT) The remainder of the paper is organized as fol-
* Authors contributed equally. lows: Section 2 describes the data processing, data
211
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 211–218
July 13-14, 2023 c 2023 Association for Computational Linguistics
augmentation and speech segmentation. Section 3 Task Corpus Sentence Hour
outlines the construction of the vocabulary and LibriSpeech 0.28 960
structures of our ASR, MT and ST models. The Europarl-ST 0.03 77
experimental settings and final results are presented TED LIUM 0.26 448
in Section 4. Finally, Section 5 concludes the sub- ST TED 0.16 235
mission. ASR VoxPopuil 0.17 478
MuST-C V1 En-De 0.07 138
2 Data MuST-C V2 En-Zh 0.36 572
CoVoST v2 En-Zh 0.28 416
2.1 Data Processing
Total 1.61 3324
Our system was built under the “constrained” train- News Commentary 0.31 -
ing condition. The training data can be divided OpenSubtitle 8.62 -
into three categories: ASR, MT, and ST corpora. MuST-C V2 En-Zh 0.36 -
We used the NiuTrans toolkit (Xiao et al., 2012) to MT
CoVoST V2 En-Zh 0.28 -
segment English and Chinese text in all corpora. Tatoeba 0.05 -
ASR corpora. We followed the previous work Total 9.62 -
(Xu et al., 2021b) and standardized all audio sam- MuST-C En-Zh 0.36 572
ples to a single channel and a sample rate of 16,000 ST CoVoST V2 En-Zh 0.28 416
Hz. For the Common Voice corpus, we selected Total 0.64 988
only the cleaner parts according to the CoVoST
Table 1: Details about the size of all labeled corpora.
v2 En-Zh corpus. In the MuST-C v1 En-De cor- The unit of sentence is million (M).
pus, we removed repetitive items by comparing
the MuST-C v2 En-Zh transcriptions. We used the
Task Corpus Sentence Hour
Librispeech corpus to train the ASR model and
MT ASR corpora+MT 1.38 -
scored the Common Voice, TED LIUM, and ST
ASR corpora+MT 1.61 3323
TED corpus. Data with a WER greater than 0.75 ST
Audio+ASR+MT 1.4e-2 3
were removed, and frames with lengths less than
5 or greater than 3000 were filtered. In addition, Table 2: Details about the size of all pseudo corpora.
utterances with more than 400 characters were re-
moved.
2.2 Data Augmentation
MT corpora. Following the methodology of
(Zhang et al., 2020), we cleaned the parallel texts We only used SpecAugment (Bahar et al., 2019)
of the OpenSubtitle corpus and used fast-align to and not used speed perturb for ASR data augmenta-
score all sentences. We averaged the scores by tion, because speed perturb requires more training
the sentence length and filtered out sentences with resources but has the limited improvement. It is
scores below -6.0. In the News Commentary v16 also worth noting that we did not use back transla-
corpus, we used langid (Lui and Baldwin, 2012) to tion technology in either MT or E2E ST, as there
filter out sentences with incorrect language identifi- was no target-side monolingual data available.
cation results. In the Tatoeba corpus, we converted The MT model or ensemble MT systems repre-
90% of the sentences from traditional Chinese to sent the upper limit for E2E ST. Translating the
simplified Chinese using OpenCC1 . transcript in the ASR corpus into the target lan-
guage using MT models is a simpler and more
ST corpora. For the MuST-C v2 En-Zh and CoV- effective way to augment the ST corpus than gener-
oST v2 En-zh corpus, we only filtered frames by ating source speech features from the source texts
length, similar to the ASR corpora. For the pseudo in the MT corpus using TTS models. Based on
ST data, we removed sentences containing repeated this, we propose an Iterative Data Augmentation
n-gram words (n is 2 to 4) more than four times. (IDA) method, which aims to use text data from all
Additionally, sentences with length ratios outside corpora to improve the performance of MT models
the range of 0.25 to 4 and those with incorrect lan- and generate high-quality ST corpus iteratively, as
guage identification results were filtered out. illustrated in Algorithm 1.
1
https://github.com/BYVoid/OpenCC We also discovered incomplete transcriptions in
212
a few sentences from the TED LIUM, ST-TED, and training corpora for the SPM. The vocabulary size
voxpupil corpus. Therefore, we generated pseudo for English and Chinese is 10k and 44k, respec-
transcriptions using the ASR model and then trans- tively.
lated them using the best MT ensemble systems.
3.2 ASR Models
Alexandre Berard, Olivier Pietquin, Christophe Servan, Bei Li, Quan Du, Tao Zhou, Yi Jing, Shuhan Zhou, Xin
and Laurent Besacier. 2016. Listen and translate: A Zeng, Tong Xiao, Jingbo Zhu, Xuebo Liu, and Min
216
Zhang. 2022. ODE transformer: An ordinary differ- Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu,
ential equation-inspired model for sequence genera- Changliang Li, Derek F. Wong, and Lidia S. Chao.
tion. In Proceedings of the 60th Annual Meeting of 2019. Learning deep transformer models for machine
the Association for Computational Linguistics (Vol- translation. In Proceedings of the 57th Conference of
ume 1: Long Papers), ACL 2022, Dublin, Ireland, the Association for Computational Linguistics, ACL
May 22-27, 2022, pages 8335–8351. Association for 2019, Florence, Italy, July 28- August 2, 2019, Vol-
Computational Linguistics. ume 1: Long Papers, pages 1810–1822. Association
for Computational Linguistics.
Marco Lui and Timothy Baldwin. 2012. langid.py: An
off-the-shelf language identification tool. In The 50th Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Annual Meeting of the Association for Computational Wu, and Zhifeng Chen. 2017. Sequence-to-sequence
Linguistics, Proceedings of the System Demonstra- models can directly translate foreign speech. In In-
tions, July 10, 2012, Jeju Island, Korea, pages 25–30. terspeech 2017, 18th Annual Conference of the Inter-
The Association for Computer Linguistics. national Speech Communication Association, Stock-
holm, Sweden, August 20-24, 2017, pages 2625–2629.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, ISCA.
Sam Gross, Nathan Ng, David Grangier, and Michael
Auli. 2019. fairseq: A fast, extensible toolkit for Tong Xiao, Jingbo Zhu, Hao Zhang, and Qiang Li. 2012.
sequence modeling. In Proceedings of the 2019 Con- Niutrans: An open source toolkit for phrase-based
ference of the North American Chapter of the Asso- and syntax-based machine translation. In The 50th
ciation for Computational Linguistics: Human Lan- Annual Meeting of the Association for Computational
guage Technologies, NAACL-HLT 2019, Minneapo- Linguistics, Proceedings of the System Demonstra-
lis, MN, USA, June 2-7, 2019, Demonstrations, pages tions, July 10, 2012, Jeju Island, Korea, pages 19–24.
48–53. Association for Computational Linguistics. The Association for Computer Linguistics.
Matt Post. 2018. A call for clarity in reporting BLEU Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, Shen
scores. In Proceedings of the Third Conference on Huang, Qi Ju, Tong Xiao, and Jingbo Zhu. 2021a.
Machine Translation: Research Papers, WMT 2018, Stacked acoustic-and-textual encoding: Integrating
Belgium, Brussels, October 31 - November 1, 2018, the pre-trained models into speech translation en-
pages 186–191. Association for Computational Lin- coders. In Proceedings of the 59th Annual Meeting
guistics. of the Association for Computational Linguistics and
Weiqiao Shan, Zhiquan Cao, Yuchen Han, Siming Wu, the 11th International Joint Conference on Natural
Yimin Hu, Jie Wang, Yi Zhang, Hou Baoyu, Hang Language Processing, ACL/IJCNLP 2021, (Volume 1:
Cao, Chenghao Gao, Xiaowen Liu, Tong Xiao, Anxi- Long Papers), Virtual Event, August 1-6, 2021, pages
ang Ma, and Jingbo Zhu. 2022. The niutrans machine 2619–2630. Association for Computational Linguis-
translation systems for WMT22. In Proceedings tics.
of the Seventh Conference on Machine Translation,
WMT 2022, Abu Dhabi, United Arab Emirates (Hy- Chen Xu, Xiaoqian Liu, Xiaowen Liu, Laohu Wang,
brid), December 7-8, 2022, pages 366–374. Associa- Canan Huang, Tong Xiao, and Jingbo Zhu. 2021b.
tion for Computational Linguistics. The niutrans end-to-end speech translation system
for IWSLT 2021 offline task. In Proceedings of the
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. 18th International Conference on Spoken Language
Self-attention with relative position representations. Translation, IWSLT 2021, Bangkok, Thailand (on-
In Proceedings of the 2018 Conference of the North line), August 5-6, 2021, pages 92–99. Association for
American Chapter of the Association for Computa- Computational Linguistics.
tional Linguistics: Human Language Technologies,
NAACL-HLT, New Orleans, Louisiana, USA, June Chen Xu, Yuhao Zhang, Chengbo Jiao, Xiaoqian Liu,
1-6, 2018, Volume 2 (Short Papers), pages 464–468. Chi Hu, Xin Zeng, Tong Xiao, Anxiang Ma, Huizhen
Association for Computational Linguistics. Wang, and Jingbo Zhu. 2023. Bridging the gran-
ularity gap for acoustic modeling. In Findings of
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol- the Association for Computational Linguistics: ACL
losa, and Marta R. Costa-jussà. 2022. SHAS: 2023. Association for Computational Linguistics.
approaching optimal segmentation for end-to-end
speech translation. In Interspeech 2022, 23rd Annual Yuhao Zhang, Canan Huang, Chen Xu, Xiaoqian Liu,
Conference of the International Speech Communica- Bei Li, Anxiang Ma, Tong Xiao, and Jingbo Zhu.
tion Association, Incheon, Korea, 18-22 September 2022a. The niutrans’s submission to the IWSLT22
2022, pages 106–110. ISCA. english-to-chinese offline speech translation task. In
Proceedings of the 19th International Conference on
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Spoken Language Translation, IWSLT@ACL 2022,
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Dublin, Ireland (in-person and online), May 26-27,
Kaiser, and Illia Polosukhin. 2017. Attention is all 2022, pages 232–238. Association for Computational
you need. In Advances in Neural Information Pro- Linguistics.
cessing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-9, Yuhao Zhang, Ziyang Wang, Runzhe Cao, Binghao Wei,
2017, Long Beach, CA, USA, pages 5998–6008. Weiqiao Shan, Shuhan Zhou, Abudurexiti Reheman,
217
Tao Zhou, Xin Zeng, Laohu Wang, Yongyu Mu, Jing-
nan Zhang, Xiaoqian Liu, Xuanjun Zhou, Yinqiao
Li, Bei Li, Tong Xiao, and Jingbo Zhu. 2020. The
niutrans machine translation systems for WMT20.
In Proceedings of the Fifth Conference on Machine
Translation, WMT@EMNLP 2020, Online, Novem-
ber 19-20, 2020, pages 338–345. Association for
Computational Linguistics.
Yuhao Zhang, Chen Xu, Bojie Hu, Chunliang Zhang,
Tong Xiao, and Jingbo Zhu. 2022b. Improving end-
to-end speech translation by leveraging auxiliary
speech and text data. CoRR, abs/2212.01778.
218
ON-TRAC consortium systems for the IWSLT 2023 dialectal and
low-resource speech translation tasks
Antoine Laurent1 , Souhir Gahbiche5 , Ha Nguyen2 , Haroun Elleuch4 ,
Fethi Bougares4 , Antoine Thiol5 , Hugo Riguidel1,3 , Salima Mdhaffar2 ,
Gaëlle Laperrière2 , Lucas Maison2 , Sameer Khurana6 , Yannick Estève2
1
LIUM - Le Mans University, France, 2 LIA - Avignon University, France, 3 Systran - France,
4
ELYADATA - Tunis, Tunisia, 5 Airbus - France,
6
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
220
translation). This dataset is made available by LDC 5 Tamasheq-French Experiments
under reference LDC2022E01. The goal of this
In this section we present our experiments for the
track is to train speech translation systems under
Tamasheq-French dataset in the context of the low-
two training conditions: constrained, in which only
resource ST track.
the provided dataset resources are allowed, and un-
constrained where participants may use any public 5.1 Data
or private resources.
This dataset, recently introduced in Boito et al.
4.2 End-to-end ST (2022), contains 14 h of speech in the Tamasheq
language for the training split which corresponds
We used the end-to-end translation model presented to 4,444 utterances translated to French. The de-
in section 3.2. The model was trained directly on velopment set contains 581 utterances (a little bit
the Tunisian to English task (no pre-training of less than 2 h of speech), the 2022 test set contains
the encoder-decoder model), using SAMU−XLS−R 804 utterances (approximatively 2 h of speech).
trained on 100 languages. We used adapters The 2023 test set contains 374 utterances (approxi-
(Houlsby et al., 2019) inside the encoder to keep matively 1 h of speech). Additional audio data was
the semantic information while fine-tuning. also made available through the Niger-Mali audio
collection: 224 h in Tamasheq and 417 h in geo-
4.3 Results graphically close languages (French from Niger,
Table 1 presents our ST results for the Tunisian Fulfulde, Hausa, and Zarma).3 For all this data, the
to English Dialectal and Low-resource track. Our speech style is radio broadcasting, and the dataset
primary system obtained a BLEU of 20.7 on our presents no transcription.
validation set. As shown in the tables, the official
5.2 Models
evaluation scores appear to be low compared to
the good result obtained on the validation set. We For the Tamasheq to French task, we performed
suspect that our test submission was not conform several experiments. First of all, we did the same
to the evaluation specifications. We speculate that experiment that was done for Pashto-French and
this difference between validation and test scores is Tunisian-English tasks. We used the end-to-end
due to the fact we did not remove the punctuation translation model presented in section 3.2, directly
nor the disfluencies tokens from the case-sensitive trained on the Tamasheq→French task. Directly
translation we submitted, while the evaluation is means that we used SAMU−XLS−R-xx (xx corre-
made lowercase and no punctuation. We mistak- sponds to the number of languages in the training
enly expected this normalization step to be applied set, equals to 53, 60 and 100) to initialise the en-
by the organizers instead of the participant. We coder and performed the training of the encoder-
were able to ask the organizers to evaluate our nor- decoder model using the Tamasheq→French train-
malized output after the evaluation period. The ing set.
results are reported in Table 1. Test2 refers to the We used the CoVoST-2 (Wang et al., 2020)
IWSLT 2022 evaluation campaign test, and test3 X →EN speech-translation dataset in which we
refers to the one of IWSLT 2023. This normaliza- translated the EN text into French (using Mbart
tion before the training of our translation model is Many-to-Many). Additionally, we exploited the Eu-
expected to further improve our results because we roparl benchmark, which comprises 72 translation
believe that the post-deadline fix more accurately tasks (denoted as X →Y), with the source language
reflects our system’s true performance. set (X ) consisting of nine languages: FR, DE, ES,
IT, PL, PT, RO, NL, and EN. The target language
System Description valid test2 test3 set (Y) is equivalent to the source language set. For
primary SAMU−XLS−R 100 20.7 9.6 8.8
the specific training data distribution of each of the
post-deadline fix SAMU−XLS−R 100 20.7 18.2 16.3 72 translation tasks, refer to (Iranzo-Sánchez et al.,
2019).
Table 1: Results for Tunisian Arabic to English We trained a translation model using CoVost-2
translation systems in terms of %BLEU for low- X→FR,EN and Europarl X→FR, namely models
resource (LR)track. 3
https://demo-lia.univ-avignon.fr/
studios-tamani-kalangou/
221
System Description valid test 2023
primary samu100l[cv2_xx→(en,fr)+europarl_xx→fr] + test22 21.39 16.00
contrastive1 samu100l[cv2_xx→(en,fr)+europarl_xx→fr] 21.41 16.52
contrastive2 samu60l[cv2_xx→(en,fr)+europarl_xx→fr] + test22 20.80 15.84
contrastive3 samu60l[cv2_xx→(en,fr)+europarl_xx→fr] 20.66 15.35
contrastive4 samu100l continue training + test22 21.39 16.30
contrastive5 samu100l continue training 20.78 15.60
baseline best system from IWSLT2022 8.34 5.70
224
Alexei Baevski, Michael Auli, and Abdelrahman Mo- Solène Evain, Ha Nguyen, Hang Le, Marcely Zanon
hamed. 2019. Effectiveness of self-supervised pre- Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong,
training for speech recognition. arXiv preprint Natalia Tomashenko, Marco Dinarelli, Titouan Par-
arXiv:1911.03912. collet, et al. 2021a. Task agnostic and task specific
self-supervised learning from speech with LeBench-
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, mark. In Thirty-fifth Conference on Neural Informa-
and Michael Auli. 2020. wav2vec 2.0: A framework tion Processing Systems Datasets and Benchmarks
for self-supervised learning of speech representations. Track (Round 2).
Advances in Neural Information Processing Systems,
33:12449–12460. Solène Evain, Ha Nguyen, Hang Le, Marcely Zanon
Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong,
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina
Natalia Tomashenko, Marco Dinarelli, Titouan Par-
Karakanta, Alberto Martinelli, Matteo Negri, and
collet, Alexandre Allauzen, Yannick Estève, Ben-
Marco Turchi. 2021. Cascade versus direct speech
jamin Lecouteux, François Portet, Solange Rossato,
translation: Do the differences still make a differ-
Fabien Ringeval, Didier Schwab, and Laurent Be-
ence? CoRR, abs/2106.01045.
sacier. 2021b. LeBenchmark: A Reproducible
Alexandre Berard, Olivier Pietquin, Christophe Servan, Framework for Assessing Self-Supervised Represen-
and Laurent Besacier. 2016. Listen and translate: A tation Learning from Speech. In Interspeech, pages
proof of concept for end-to-end speech-to-text trans- 1439–1443.
lation. CoRR, abs/1612.01744.
F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang.
Kaushal Bhogale, Abhigyan Raman, Tahir Javed, 2022. Language-agnostic BERT Sentence Embed-
Sumanth Doddapaneni, Anoop Kunchukuttan, ding. In Proceedings of the 60th ACL.
Pratyush Kumar, and Mitesh M Khapra. 2023. Effec-
tiveness of mining audio and text pairs from public Jonas Gehring, Michael Auli, David Grangier, Denis
data for improving asr systems for low-resource lan- Yarats, and Yann N. Dauphin. 2017. Convolutional
guages. In ICASSP 2023-2023 IEEE International sequence to sequence learning.
Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 1–5. IEEE. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
Bruna Morrone, Quentin de Laroussilhe, Andrea Ges-
Marcely Zanon Boito, Fethi Bougares, Florentin Bar- mundo, Mona Attariyan, and Sylvain Gelly. 2019.
bier, Souhir Gahbiche, Loïc Barrault, Mickael Rou- Parameter-efficient transfer learning for nlp. In Proc.
vier, and Yannick Estéve. 2022. Speech resources ICML.
in the tamasheq language. Language Resources and
Evaluation Conference (LREC). Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
Alexis Conneau, Alexei Baevski, Ronan Collobert, rahman Mohamed. 2021. Hubert: Self-supervised
Abdelrahman Mohamed, and Michael Auli. 2020. speech representation learning by masked prediction
Unsupervised cross-lingual representation learn- of hidden units. IEEE/ACM Transactions on Audio,
ing for speech recognition. arXiv preprint Speech, and Language Processing, 29:3451–3460.
arXiv:2006.13979.
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà,
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Javier Jorge, Nahuel Roselló, Adrià Giménez, Al-
Patrick von Platen, Anton Lozhkov, Colin Cherry, bert Sanchis, Jorge Civera, and Alfons Juan. 2019.
Ye Jia, Clara Rivera, Mihir Kale, et al. 2022. Xtreme- Europarl-st: A multilingual corpus for speech trans-
s: Evaluating cross-lingual speech representations. lation of parliamentary debates. arXiv:1911.03167.
arXiv preprint arXiv:2203.10752.
Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Tahir Javed, Kaushal Santosh Bhogale, Abhigyan Ra-
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe man, Anoop Kunchukuttan, Pratyush Kumar, and
Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Mitesh M Khapra. 2022. Indicsuperb: A speech pro-
et al. 2022. No language left behind: Scaling cessing universal performance benchmark for indian
human-centered machine translation. arXiv preprint languages. arXiv preprint arXiv:2208.11761.
arXiv:2207.04672.
Kazuya Kawakami, Luyu Wang, Chris Dyer, Phil Blun-
ELRA catalogue. 2016a. Trad pashto broadcast som, and Aaron van den Oord. 2020. Learning robust
news speech corpus. https://catalogue.elra. and multilingual speech representations. In Find-
info/en-us/repository/browse/ELRA-S0381/. ings of the Association for Computational Linguistics:
ISLRN: 918-508-885-913-7, ELRA ID: ELRA- EMNLP 2020, pages 1182–1192, Online. Association
S0381. for Computational Linguistics.
ELRA catalogue. 2016b. Trad pashto-french paral- Sameer Khurana, Antoine Laurent, and James Glass.
lel corpus of transcribed broadcast news speech - 2022. Samu-xlsr: Semantically-aligned multimodal
training data. http://catalog.elda.org/en-us/ utterance-level cross-lingual speech representation.
repository/browse/ELRA-W0093/. ISLRN: 802- IEEE Journal of Selected Topics in Signal Processing,
643-297-429-4, ELRA ID: ELRA-W0093. pages 1–13.
225
Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang,
Schwab, and Laurent Besacier. 2020. Dual-decoder Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin,
transformer for joint automatic speech recognition Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting
and multilingual speech translation. arXiv preprint Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko tik
arXiv:2011.00747. Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-
Wen Li, Shinji Watanabe, Abdelrahman Mohamed,
Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing and Hung yi Lee. 2021. SUPERB: Speech Process-
Tang, Juan Pino, Alexei Baevski, Alexis Conneau, ing Universal PERformance Benchmark. In Inter-
and Michael Auli. 2020. Multilingual speech trans- speech, pages 1194–1198.
lation with efficient finetuning of pretrained models.
arXiv:2010.12829. Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao,
Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen,
Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Chenchen Zeng, et al. 2022. Wenetspeech: A 10000+
Synnaeve, and Ronan Collobert. 2020. Mls: A hours multi-domain mandarin corpus for speech
large-scale multilingual dataset for speech research. recognition. In ICASSP 2022-2022 IEEE Interna-
arXiv:2012.03411. tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 6182–6186. IEEE.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
man, Christine McLeavey, and Ilya Sutskever. 2022.
Robust speech recognition via large-scale weak su-
pervision.
226
BUT Systems for IWSLT 2023 Marathi - Hindi Low Resource
Speech Translation Task
228
Duration in hours (number of utterances)
Dataset Language Training Dev Test
GV hi 97.9 (37,152) 4.9 (1885) 2.8 (1032)
ILC mr 109.2 (92,471) - - - -
hi 5.3 (4481) 2.8 (2179) 4.1 (2962)
MCV
mr 12.0 (7321) 3.0 (1678) 3.2 (1827)
hi 95.1 (99,925) 5.6 (3843) - -
MUCS
mr 93.8 (79,432) 5.0 (4675) - -
MSSC mr 3.0 (1569) - - - -
hi 1478.6 (764,237) - - - -
SL
mr 894.8 (466,203) - - - -
hi 1676.8 (898,369) 13.3 (7895) 6.9 (3994)
Total
mr 1112.8 (638,159) 8.0 (6353) 3.2 (1827)
Table 1: Statistics of the data used for training ASR systems. The dev and test splits are only used for internal
evaluation of the ASR systems.
229
Duration in hours (# utterances)
Training Dev Test Latt (x, z)
15.9 (7990) 3.7 (2103) 4.4 (2164)
Lctc (x, z)
Table 3: Statistics of Marathi-Hindi IWSLT2023 speech
translation data.
Decoder
3.3 MT
The MT model is a transformer based seq2seq CTC
model initialized from XLM. Both the encoder and
decoder parameters are initialized from XLM en-
coder, except for the cross-attention parameters Encoder
in the decoder that are randomly initialized. The
model is then fine-tuned on the same 1.6 M parallel
sentences with a batch size of 64 and a maximum x
of 1000 epochs. The model achieved 23.0 and 22.6
BLEU scores on the internal valid and test sets Figure 1: End-to-end framework for speech translation.
(Table 2) respectively. x is the input speech (features), z is the target text trans-
lation.
3.4 LM for re-scoring
For Hindi, we used an LSTM of three layers of The effect of various initializations and their influ-
4096 units each, with no dropout. The model was ence on downstream speech translation is discussed
trained on 217 M sub-word tokens obtained by to- later in Section 5.
kenizing the monolingual Hindi corpus into a 10k The E 2 E speech translation was also trained us-
Unigram vocabulary (Kudo, 2018). The model ing ESPnet toolkit. Our changes to the original
achieved validation perplexity of 46. Thereafter, toolkit, along with the training recipes, are avail-
we have fine-tuned it on text data from Shrutilipi able online9 .
(SL) data for 500 steps. A beam search based joint decoding (Karita
For Marathi, we used an LSTM of 2 layers per et al., 2019) that relies on the weighted average
2048 units, again with no dropout. This model also of log-likelihoods from both the CTC and trans-
utilized a 10k Unigram vocabulary and was trained former decoder modules is used, that produces the
on 8.2 M tokens. This model achieved validation most likely hypotheses according to
perplexity of 120.
ẑ = arg max β log pctc (z | x) +
4 Speech translation systems z
(1 − β) log patt (z | x) (3)
Here, we briefly describe both the end-to-end and
cascade systems. We found λ = {0.1, 0.3}, β = {0.1, 0.3} suitable
for joint training and decoding respectively.
4.1 End-to-end
The E 2 E models are initialized from pre-trained 4.2 Cascade systems
ASR models. We use both the encoder and decoder For the cascade speech translation systems, we first
from the ASR, as it provides a better initializa- decode n-best hypotheses from ASR model and
tion since the representations from the encoder are obtain 1-best from Marathi LM rescorer. These are
readily compatible with the decoder (Bansal et al., then passed directly to the MT system, which gives
2019). The model is then trained for direct speech us n-best translation hypotheses in target language
translation, with the auxiliary CTC objective also Hindi. These are then re-scored by Hindi LM to
for translation (Zhang et al., 2022; Yan et al., 2023; give us 1-best translation hypotheses.
Kesiraju et al., 2023). 9
https://github.com/BUTSpeechFIT/
espnet/tree/main/egs2/iwslt23_low_
Lst = λ Lctc (x, z) + (1 − λ)Latt (x, z) (2) resource/st1
230
Model name Training data Model type Sub-word vocab Dev WER Test WER
(hrs) per language mr hi mr hi
H1 198† Mono (hi) 1000 - 30.7 - 35.9
H2 1676 Mono (hi) 8000 - 24.7 - 28.4
M1 218† Mono (mr) 1000 14.3 - 42.4 -
M2 1112 Mono (mr) 8000 19.0 - 36.0 -
B1 416† Bilingual (mr, hi) 1000 11.1 31.5 31.9 35.1
B2 2789 Bilingual (mr, hi) 8000 16.0 24.2 23.7 26.9
Table 4: Word-error-rates (WER) of various mono and bilingual ASR systems, trained on various amounts of data.
†
implies that the training data contains everything from Table 1 except Shrutilipi (SL).
A further fine-tuning of the MT system using H2, M2 and B2 are bigger ones with dmodel = 512.
1-best hypotheses from Marathi to Hindi IWSLT All the ASR models were trained with joint CTC
training set did not improve the results. Due to time and attention loss, where the CTC weight of 0.3
constraints, we did not try various strategies (Ben- was found to be optimal. The same weight was
tivogli et al., 2021) or hyperparameter tuning for used during joint decoding. Since we retained the
the cascade systems. original punctuation in the text, the WER is slightly
affected.
4.3 Re-scoring n-best hypotheses
We have utilized the language models to re-score 5.2 Performance of ST
up to 100-best hypotheses in both languages. Us-
Here we present the results of speech translation
ing BrnoLM10 , we have introduced the language
systems based on end-to-end architecture. As
model scores. Here, we have tuned the two hyper-
shown in Table 5, all the ST models were initial-
parameters: The weight of the LM score (additive
ized either from mono or bilingual ASR systems
to 1.0 weight of the acoustic system) and an inser-
and fine-tuned using the speech translation data
tion bonus, added for each token of the hypothesis,
(with or without data augmentation). While most
in the LM tokenization. For the E 2 E system, we
of these systems can be considered direct end-to-
have achieved optimal results with LM weight 1.2
end; using an external LM for re-scoring the n-best
and insertion bonus 5.5. For the Marathi ASR in
makes an exception. Using a Marathi monolingual
the cascade system, optimal setting was 0.3 and
ASR model would be sub optimal because the in-
3.5. For the translation system in the cascade, we
ternal language model represented in the decoder
did not achieve any improvement by re-scoring the
of the ASR would not be suitable for generating
output with the Hindi LM.
linguistically acceptable text sequences in Hindi.
5 Results and analysis Fig. 2 shows the effect of CTC weight during
joint training and decoding. We can see that 0.3 is
Here, we present the performance of various back- the optimal weight both for training and decoding.
bone models, along with analysis showing the ef- Since, we have a separate vocabulary for both the
fectiveness of various factors such as initializations, languages, the posterior probabilities from CTC
data augmentation, auxiliary objectives and joint during joint decoding will only correspond to the
decoding. tokens from the target language Hindi. This is
important, since both the languages come from
5.1 Performance of ASR systems
same family with high phonetic similarity, and use
From the Table 4 we can see that the bilingual mod- same Devanagari script, the non auto regressive
els perform (B1, B2) better than the monolingual CTC decoder does not accidentally provide higher
parts (H1, M1, H2, M2). Here, H1, M1 and B1 scores for tokens from source language Marathi.
are smaller models with dmodel = 256, whereas The latter scenario can happen when using a joint-
10
https://github.com/BUTSpeechFIT/
sub word vocabulary for both the languages.
BrnoLM Sacrebleu library (Post, 2018) was used to com-
231
ST Model Speed Dev set
initialization perturb BLEU CHR F2
28.5
H1 ✗ 16.3 45.0
H2 ✓ 24.9 51.0
BLEU on dev set
27.0
λ = 0.1
λ = 0.3
Cascade - 21.7 48.2
0.0 0.1 0.3 0.5 0.7 0.9
(β) CTC weight during joint decoding Table 5: Speech translation results on Marathi - Hindi
dev set. All the ST models are fine-tuned on training
Figure 2: Effect of hyperparameters in joint training and data from Table 3.
decoding for direct speech translation. The model is
initialized from B2 and trained on augmented training
data.
6 Conclusions
pute BLEU11 and CHR F212 scores in the dev sets. In this paper, we presented the systems submitted
to the IWSLT 2023 Marathi Hindi low resource
track. Our main efforts were along the end-to-end
From the Table 5, we can see that independent direct speech translation system, initialized from a
improvements come from using bilingual ASR bilingual ASR. The model was jointly trained with
trained on more data, data augmentation (speed CTC and attention objective directly for translation.
perturbation) and LM re-scoring. In case of cas- The joint decoding provided additional benefits.
cade system, the LM re-scoring did not improve the These strategies combined with speed perturbation
results. We believe this is because the Marathi LM for data augmentation and re-scoring the n-best
was trained on much fewer amounts of data (400K hypotheses using external LM provided further sig-
sentences). We plan to rerun these experiments in nificant improvements. We also submitted a cas-
the near future. cade system which uses the same bilingual ASR
Finally, our primary submission was based on as the backbone, followed by an MT system. Both
B2 + ST fine-tuning with data augmentation + systems performed competitively, while the one
LM re-scoring which obtained 39.6 BLEU and based on end-to-end provided superior results in
63.3 CHR F2 scores on official test set. Our con- terms of BLEU. It is yet to be investigated, if the
trastive system was based on B2 + MT + LM large pre-trained MT systems would close the gap
re-scoring which obtained 28.6 BLEU and 54.4 between cascade and end-to-end systems.
CHR F2 scores.
233
Guillaume Lample and Alexis Conneau. 2019. Cross-
lingual language model pretraining. CoRR,
abs/1901.07291.
Tanvina Patel and Odette Scharenborg. 2022. Using
cross-model learnings for the Gram Vaani ASR Chal-
lenge 2022. In Proc. Interspeech 2022, pages 4880–
4884.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers, pages 186–
191, Brussels, Belgium. Association for Computa-
tional Linguistics.
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš
Burget, Ondřej Glembek, K. Nagendra Goel, Mirko
Hannemann, Petr Motlíček, Yanmin Qian, Petr
Schwarz, Jan Silovský, Georg Stemmer, and Karel
Veselý. 2011. The kaldi speech recognition toolkit.
In Proceedings of ASRU 2011, pages 1–4. IEEE Sig-
nal Processing Society.
Gowtham Ramesh, Sumanth Doddapaneni, Aravinth
Bheemaraj, Mayank Jobanputra, Raghavan AK,
Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Ma-
halakshmi J, Divyanshu Kakwani, Navneet Kumar,
Aswin Pradeep, Srihari Nagaraj, Kumar Deepak,
Vivek Raghavan, Anoop Kunchukuttan, Pratyush Ku-
mar, and Mitesh Shantadevi Khapra. 2022. Samanan-
tar: The largest publicly available parallel corpora
collection for 11 indic languages. Transactions of the
Association for Computational Linguistics, 10:145–
162.
Shashank Siripragada, Jerin Philip, Vinay P. Nambood-
iri, and C V Jawahar. 2020. A multilingual parallel
corpora collection effort for Indian languages. In
Proceedings of the 12th Language Resources and
Evaluation Conference, pages 3743–3751, Marseille,
France. European Language Resources Association.
Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki
Hayashi, Jiro Nishitoba, Yuya Unno, Nelson En-
rique Yalta Soplin, Jahn Heymann, Matthew Wiesner,
Nanxin Chen, Adithya Renduchintala, and Tsubasa
Ochiai. 2018. ESPnet: End-to-end speech process-
ing toolkit. In Proceedings of Interspeech, pages
2207–2211.
Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham
Neubig, Florian Metze, Alan W Black, and Shinji
Watanabe. 2023. CTC alignments improve autore-
gressive translation. In Proceedings of the 17th Con-
ference of the European Chapter of the Association
for Computational Linguistics, pages 1623–1639,
Dubrovnik, Croatia. Association for Computational
Linguistics.
Biao Zhang, Barry Haddow, and Rico Sennrich. 2022.
Revisiting End-to-End Speech-to-Text Translation
From Scratch. In International Conference on Ma-
chine Learning, volume 162 of Proc. of Machine
Learning Research, pages 26193–26205. PMLR.
234
CMU’s IWSLT 2023 Simultaneous Speech Translation System
Brian Yan*1 Jiatong Shi*1 Soumi Maiti1 William Chen1
Xinjian Li1 Yifan Peng2 Siddhant Arora1 Shinji Watanabe1,3
1
Language Technologies Institute, Carnegie Mellon University, USA
2
Electrical and Computer Engineering, Carnegie Mellon University, USA
3
Human Language Technology Center of Excellence, Johns Hopkins University, USA
{byan, jiatongs}@cs.cmu.edu
235
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 235–240
July 13-14, 2023 c 2023 Association for Computational Linguistics
Figure 1: Offline ST model architecture based on the Figure 2: Incremental encoding strategy which pro-
joint CTC/attention framework with a WavLM front- cesses chunks of input speech by re-computing repre-
end and mBART decoder. sentations corresponding to earlier chunks.
front-end features to train ASR models. In these Algorithm 1 Beam search step with rewinding of
models, a pre-encoder module (Chang et al., 2021) unreliable hypotheses on non-final chunks and in-
applies feature dimension down-sampling and a cremental pruning upon end-detection.
learned weighted combination of WavLM layers be- 1: procedure B EAM S TEP(hyps, prevHyps, isFinal)
2: newHyps = {}; endDetected = False
fore feeding to a Conformer encoder (Gulati et al., 3: for y1:l−1 ∈ prtHs do
2020). The pre-encoder and encoder modules from 4: attnCnds = top-k(PAttn (yl |X, y1:l−1 ), k = p)
5: for c ∈ attnCnds do
ASR are then used to initialize our ST models. 6: y1:l = y1:l−1 ⊕ c
To leverage unpaired text data, we use the 7: αCTC = CTCScore(y1:l , X1:T )
mBART decoder (Tang et al., 2020) as an initial- 8: αAttn = AttnScore(y1:l , X1:T )
9: β = LengthPen(y1:l )
ization for our ST models. Following (Li et al., 10: PBeam (y1:l |X) = αCTC + αAttn + β
2020), we freeze all feed-forward layers during 11: newHyps[y1:l ] = PBeam (·)
fine-tuning and use a post-encoder down-sampling 12: if (!isFinal) and (c is <eos> or repeat) then
13: endDetected = True
layer to reduce the computational load. 14: newHyps = prevHyps ▷ rewind
We fine-tune our ST models using the follow- 15: else if l is maxL then
ing interpolated loss function: L = λ1 LASR_CE + 16: endDetected = True
17: end if
λ2 LASR_CTC + λ3 LST_CE + λ4 LST_CTC . Here, the 18: end for
cross-entropy (CE) losses are used to train atten- 19: end for
20: if endDetected then ▷ incremental pruning
tional decoders. Note that in Figure 1, we omit 21: newHyps = top-k(PBeam (·), k = 1)
the ASR attentional decoder and CTC components 22: else ▷ standard pruning
as these function as training regularizations and 23: newHyps = top-k(PBeam (·), k = b)
24: end if
do not factor into the inference proceedure. We 25: return newHyps, endDetected
perform fine-tuning on in-domain data consisting 26: end procedure
primarily of MuST-C (Di Gangi et al., 2019).
To leverage additional in-domain data, we apply
MT pseudolabeling to TEDLIUM ASR data (Zhou speech. As shown in Figure 2, our scheme uses a
et al., 2020). We also use the same MT model fixed duration (e.g. 2 seconds) to compute front-
to apple sequence-level knowledge distillation to end and encoder representations on chunks of in-
the MuST-C data. The MT model is a pre-trained put speech. With each new chunk, we re-compute
DeltaLM-large (Ma et al., 2021) fine-tuned on the front-end and encoder representations using the
corpora listed in Section 2. The pseudo-labels and incrementally longer input speech.
distilled sequences were then translated from En- To produce incremental translation outputs, we
glish to German using a beam size of 10. apply several modifications to the offline joint
CTC/attention beam search. As shown in Algo-
3.2 Simultaneous Speech Translation (SST) rithm 1, we run beam search for each chunk of
We adapt our offline ST model for streaming infer- input. Unless we know that the current chunk is the
ence by using a chunk-based processing of input final chunk, we perform end-detection using the
236
M ODEL Q UALITY L ATENCY
O FFLINE S PEECH T RANSLATION (ST) BLEU ↑ -
Multi-Decoder CTC/Attn (Yan et al., 2023b) 30.1 - -
WavLM-mBART CTC/Attn (Ours) 32.5 - -
S IMUL S PEECH T RANSLATION (SST) BLEU ↑ AL ↓ LAAL ↓
Time-Sync Blockwise CTC/Attn (Yan et al., 2023b) 26.6 1.93 1.98
WavLM-mBART CTC/Attn (Ours) 30.4 1.92 1.99
S IMUL S PEECH - TO -S PEECH T RANSLATION (SS2T) ASR-BLEU ↑ SO ↓ EO ↓
WavLM-mBART CTC/Attn + VITS (Ours) 26.7 2.33 5.67
heuristics introduced by (Tsunoo et al., 2021). If use speech enhancement metric DNSMOS (Reddy
any of the hypotheses in our beam propose a next et al., 2021) which provides an estimation of the
candidate which is the special end-of-sequence to- speech quality. We evaluate the speech quality for
ken or a token which already appeared in the hy- the top five speakers with the largest number of
pothesis, then this strategy determines that the out- utterances. To establish the high-quality subset,
puts have likely covered all of the available input. we set a threshold of 4.0 for selecting sentences
At this point, the current hypotheses should be con- that meet the desired quality level. Based on this
sidered unreliable and thus the algorithm rewinds criterion, we choose the second speaker, who has
hypotheses to the previous step. approximately 12 hours of high-quality data.
After the end has been detected within the cur- Finally, we combine our trained German TTS
rent chunk, we prune the beam to the 1-best hypoth- model with SST module during inference. We feed
esis and select this as our incremental output – this incremental translation text outputs to TTS and
pruning step is necessary to avoid re-translation. synthesize translated speech.
When the next input chunk is received, beam search
continues from this 1-best hypothesis. 4 Experimental Setup
Our models were developed using the ESPnet-ST-
3.3 Simultaneous Speech-to-Speech
v2 toolkit (Yan et al., 2023b). Our ST/SST model
Translation (S2ST)
uses WavLM-large as a front-end (Chen et al.,
Simultaneous S2ST model is created by feeding in- 2022). A linear pre-encoder down-samples from
cremental text outputs to a German text-to-speech 1024 to 80 feature dim. Our encoder is a 12 layer
model. We use end-to-end TTS model VITS (Kim Conformer with 1024 attention dim, 8 attention
et al., 2021) and train a single speaker German TTS heads, and 2048 linear dim (Gulati et al., 2020).
model using CommonVoice dataset(Ardila et al., A convolutional post-encoder then down-samples
2020). VITS consists of text-encoder, flow based along the length dimension by a factor of 2. Our de-
stochastic duration predictor from text, variational coder follows the mBART architecture and we ini-
auto-encoder for learning latent feature from au- tialize using the mBART-large-50-many-to-many
dio and generator-discriminator based decoder for model (Tang et al., 2020). Our ST CTC branch uses
generating speech from latent feature. We use char- the same 250k vocabulary as the mBART decoder
acter as input to the TTS model. to enable joint decoding. Our TTS model consists
We select a suitable speaker from CommonVoice of 6 transformer encoder layers for text-encoder, 4
German dataset and train single speaker TTS. As normalizing flow layers for duration predictor, 16
CommonVoice may contain many noisy utterances residual dilated convolutional blocks as posterior
which can hurt performance of TTS, we use data- encoder and multi-period HiFiGan (Kong et al.,
selection for high-quality subset. The data selec- 2020) style decoder. We train VITS model for
tion process involves identifying the speaker who 400 epochs with AdamW (Loshchilov and Hutter,
has the highest number of utterances with high 2019) optimizer.
speech quality. To determine the speech quality, we During inference, we use a chunk size of 2 sec-
237
onds for SST and 2.5 seconds for SS2ST. For both Acknowledgements
SST and SS2ST we use beam size 5, CTC weight
Brian Yan and Shinji Watanabe are supported by
0.2, and no length penalty/bonus. To account for
the Human Language Technology Center of Ex-
incremental outputs which end in a prefix of a word
cellence. This work used the Extreme Science
rather than a whole word, we delay outputs for scor-
and Engineering Discovery Environment (XSEDE)
ing by 1 token. There are two exceptions to this
(Towns et al., 2014), which is supported by Na-
token delay: if the last token is a valid German
tional Science Foundation grant number ACI-
word or a punctuation, then we do not delay.
1548562; specifically, the Bridges system (Nys-
We evaluate translation quality using BLEU
trom et al., 2015), as part of project cis210027p,
score (Papineni et al., 2002) for ST/SST and ASR-
which is supported by NSF award number ACI-
BLEU score for SS2ST. ST/SST references are
1445606, at the Pittsburgh Supercomputing Center.
case-sensitive and punctuated while SS2ST refer-
This work also used GPUs donated by the NVIDIA
ences are case-insensitive and un-punctuated. The
Corporation.
ASR model used for ASR-BLEU is Whisper-small
(Radford et al., 2022). We evaluate translation la-
tency for SST using average lagging (AL) (Ma References
et al., 2020) and length-adaptive average lagging
Milind Agarwal, Sweta Agrawal, Antonios Anasta-
(LAAL) (Papi et al., 2022). We evaluate translation sopoulos, Ondřej Bojar, Claudia Borg, Marine
latency for SS2ST using start (SO) and end-offset Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
(EO) (Ma et al., 2020). Chen, William Chen, Khalid Choukri, Alexandra
Chronopoulou, Anna Currey, Thierry Declerck, Qian-
qian Dong, Yannick Estève, Kevin Duh, Marcello
5 Results Federico, Souhir Gahbiche, Barry Haddow, Benjamin
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
Table 1 shows the quality and latency of our SST vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
and SS2ST models as measured on En-De tst- Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
COMMON. We also show the ST performance of Evgeny Matusov, Paul McNamee, John P. McCrae,
Kenton Murray, Maria Nadejde, Satoshi Nakamura,
our model for reference. As a baseline, we compare Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
to the IWSLT-scale ST and SST systems developed Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
in Yan et al. (2023b) – our systems show improved Lonneke van der Plas, Peter Polák, Elijah Rippeth,
quality, primarily due to the use of WavLM and Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
mBART self-supervised representations. Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
From ST to SST, we observe a 6% quality degra- Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
dation. Note that the average duration of tst- vallos. 2023. Findings of the IWSLT 2023 Evaluation
COMMON utterances is around 5 seconds, mean- Campaign. In Proceedings of the 20th International
Conference on Spoken Language Translation (IWSLT
ing the corresponding latency gain is 60%. From 2023). Association for Computational Linguistics.
SST to SS2ST, we observe a 12% quality degrada-
tion. Note that both the TTS model and the Whis- Rosana Ardila, Megan Branson, Kelly Davis, Michael
Kohler, Josh Meyer, Michael Henretty, Reuben
per ASR model powering the ASR-BLEU metric Morais, Lindsay Saunders, Francis Tyers, and Gre-
contribute to this gap. gor Weber. 2020. Common voice: A massively-
multilingual speech corpus. In Proceedings of the
6 Conclusion Twelfth Language Resources and Evaluation Confer-
ence, pages 4218–4222, Marseille, France. European
Language Resources Association.
We describe our English to German simultane-
ous speech-to-text and speech-to-speech transla- Xuankai Chang, Takashi Maekaku, Pengcheng Guo,
tion systems for the IWSLT 2023 shared task. We Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subra-
start by building large-scale offline speech-to-text manian, Tianzi Wang, Shu-wen Yang, Yu Tsao,
Hung-yi Lee, et al. 2021. An exploration of self-
systems which leverage self-supervised speech and supervised pretrained representations for end-to-end
text representations. We then adapt these offline speech recognition. In 2021 IEEE Automatic Speech
models for online inference, enabling simultaneous Recognition and Understanding Workshop (ASRU),
speech-to-text translation. Finally, we feed stream- pages 228–235. IEEE.
ing text outputs to a down-stream TTS model, en- Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
abling simultaneous speech-to-speech translation. Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
238
Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Wavlm: Large-scale self-supervised pre-training for Jing Zhu. 2002. Bleu: a method for automatic evalu-
full stack speech processing. IEEE Journal of Se- ation of machine translation. In Proceedings of the
lected Topics in Signal Processing, 16(6):1505–1518. 40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318, Philadelphia,
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Pennsylvania, USA. Association for Computational
Matteo Negri, and Marco Turchi. 2019. MuST-C: a Linguistics.
Multilingual Speech Translation Corpus. In Proceed-
ings of the 2019 Conference of the North American Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
Chapter of the Association for Computational Lin- man, Christine McLeavey, and Ilya Sutskever. 2022.
guistics: Human Language Technologies, Volume 1 Robust speech recognition via large-scale weak su-
(Long and Short Papers), pages 2012–2017, Min- pervision. arXiv preprint arXiv:2212.04356.
neapolis, Minnesota. Association for Computational
Linguistics. Chandan KA Reddy, Vishak Gopal, and Ross Cutler.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki 2021. Dnsmos: A non-intrusive perceptual objec-
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, tive speech quality metric to evaluate noise suppres-
Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. sors. In IEEE International Conference on Acoustics,
2020. Conformer: Convolution-augmented Trans- Speech and Signal Processing (ICASSP), pages 6493–
former for speech recognition. In Proceedings of 6497. IEEE.
Interspeech, pages 5036–5040.
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
Conditional variational autoencoder with adversarial gela Fan. 2020. Multilingual translation with extensi-
learning for end-to-end text-to-speech. In Interna- ble multilingual pretraining and finetuning.
tional Conference on Machine Learning. PMLR.
Jörg Tiedemann, Santhosh Thottingal, et al. 2020. Opus-
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. mt–building open translation services for the world.
Hifi-gan: Generative adversarial networks for effi- In Proceedings of the 22nd Annual Conference of
cient and high fidelity speech synthesis. volume 33, the European Association for Machine Translation.
pages 17022–17033. European Association for Machine Translation.
Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing
J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither,
Tang, Juan Pino, Alexei Baevski, Alexis Conneau,
A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka,
and Michael Auli. 2020. Multilingual speech trans-
G. D. Peterson, R. Roskies, J. R. Scott, and
lation with efficient finetuning of pretrained models.
N. Wilkins-Diehr. 2014. Xsede: Accelerating scien-
arXiv preprint arXiv:2010.12829.
tific discovery. Computing in Science & Engineering,
Ilya Loshchilov and Frank Hutter. 2019. Decoupled 16(5):62–74.
weight decay regularization. In International Confer-
ence on Learning Representations. Emiru Tsunoo, Yosuke Kashiwagi, and Shinji Watanabe.
2021. Streaming transformer asr with blockwise
Shuming Ma, Li Dong, Shaohan Huang, Dong- synchronous beam search. In 2021 IEEE Spoken
dong Zhang, Alexandre Muzio, Saksham Singhal, Language Technology Workshop (SLT), pages 22–29.
Hany Hassan Awadalla, Xia Song, and Furu Wei. IEEE.
2021. DeltaLM: Encoder-decoder pre-training for
language generation and translation by augmenting Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R
pretrained multilingual encoders. Hershey, and Tomoki Hayashi. 2017. Hybrid
ctc/attention architecture for end-to-end speech recog-
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, nition. IEEE Journal of Selected Topics in Signal
Jiatao Gu, and Juan Pino. 2020. Simuleval: An eval- Processing, 11(8):1240–1253.
uation toolkit for simultaneous translation. In Pro-
ceedings of the EMNLP. Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham
Nicholas A Nystrom, Michael J Levine, Ralph Z Neubig, Florian Metze, Alan W Black, and Shinji
Roskies, and J Ray Scott. 2015. Bridges: a uniquely Watanabe. 2023a. CTC alignments improve autore-
flexible hpc resource for new communities and data gressive translation. In Proceedings of the 17th Con-
analytics. In Proceedings of the 2015 XSEDE Confer- ference of the European Chapter of the Association
ence: Scientific Advancements Enabled by Enhanced for Computational Linguistics, pages 1615–1631,
Cyberinfrastructure, pages 1–8. Dubrovnik, Croatia. Association for Computational
Linguistics.
Sara Papi, Marco Gaido, Matteo Negri, and Marco
Turchi. 2022. Over-generation cannot be rewarded: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma,
Length-adaptive average lagging for simultaneous Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick
speech translation. In Proceedings of the Third Work- Fernandes, Dan Berrebbi, Tomoki Hayashi, et al.
shop on Automatic Simultaneous Translation, pages 2023b. Espnet-st-v2: Multipurpose spoken language
12–17. translation toolkit. arXiv preprint arXiv:2304.04596.
239
Wei Zhou, Wilfried Michel, Kazuki Irie, Markus Kitza,
Ralf Schlüter, and Hermann Ney. 2020. The rwth asr
system for ted-lium release 2: Improving hybrid hmm
with specaugment. In ICASSP 2020-2020 IEEE Inter-
national Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 7839–7843. IEEE.
240
Improving Low Resource Speech Translation with Data Augmentation and
Ensemble Strategies
241
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 241–250
July 13-14, 2023 c 2023 Association for Computational Linguistics
for Tmh→Fra with audio stretching (Yang During initialization, the last 6 layers of the pre-
et al., 2021). trained wav2vec 2.0 model are discarded. We use a
• The baseline model for Tmh→Fra is trained shallow decoder which consists 2 transformer lay-
with a back-translation corpus generated us- ers with 4 attention heads. Between encoder and
ing the NLLB-200 machine translation model decoder, we use one feed-forward layer to match
(Team et al., 2022). the dimension of encoder output and decoder input.
• For Tmh→Fra, we build a separate training During training, the model directly performs
corpus of paraphrases and show that model speech to text translation task without generating
performance improves when trained on this intermediate source language text. The training
dataset (Bhavsar et al., 2022). loss is the cross entropy loss between ground truth
• We show how a weighted cross entropy and hypothesis with label smoothing of 0.1. Each
loss further improves the performance of the experiment is trained for 200 epochs and check-
Tmh→Fra translation model. The model points are selected based on best validation BLEU.
trained with this loss, additional data gener- For Marathi-Hindi speech-to-text (ST) model,
ated using paraphrases and audio stretching is we chose a Wav2Vec 2.0 base model finetuned
shown to perform 5.2% better than the base- on 960 h of English speech (Baevski et al., 2020)
line. as the encoder baseline. We also used the same
• An ensemble of models trained on the above encoder model finetuned on 94 hours of Marathi
strategies shows the best performance, with audio data (Chadha et al., 2022) in our experiments.
BLEU score that is 17.2% higher than the For these models, the last 6 layers of the pretrained
average BLEU score of the individual models models were discarded, while the decoder archi-
within the ensemble. tecture and other hyperparameters were kept same
• In case of Mr→ Hi, our best independent as the Tmh→Fra models 4 . For audio encoder,
ensemble model shows a 23% improvement we also experimented with Wav2vec 2.0 XLS-R
over the average BLEU score of the individual 0.3B model (Babu et al., 2021) and another XLS-R
models within the ensemble. 0.3B model specifically finetuned on Marathi audio
Apart from these contributions, we also explore (Bhattacharjee, 2022). Because the XLS-R base
post-processing techniques with large language model was trained on audio from a range of Indian
models (LLMs), focusing on re-ranking generated languages including Marathi and Hindi, we chose
translations (Kannan et al., 2018), correcting the to incorporate XLS-R in our experimentation. For
grammar of translations and masking tokens so the XLS-R based models, we utilized the first 12
that the LLM can complete the translate sentence. out of 24 encoder layers to initialize the encoder
These methods though, did not yield any noticeable followed by a linear projection layer to transform
improvement. the output features of 1024 dimensions to the de-
The paper is organized as follows: Section 2 sired decoder dimensionality of 256. We trained
describes our speech translation system, 3.1 has all Marathi-Hindi ST models for 300 epochs and
details about the datasets for various language pairs, we chose the best checkpoint based on validation
3.2 contains analysis of our experimental results BLEU score.
and we finally conclude in 4.
2.2 Data Augmentation
2 Speech Translation System
2.1 Baseline Model 2.2.1 Audio Stretching
Our base model for Tmh→Fra ST task is an end- We apply audio stretching directly on wav form
to-end speech translation system which employs data using torchaudio library (Yang et al., 2021).5
an encoder-decoder architecture (Vaswani et al., For each audio sample, we alter the speed of the
2017). We initialize the audio feature extractor and audio with a rate uniformly sampled from [0.8, 1.2]
the 6-layer transformer encoder from a pretrained with a probability of 0.8 while maintaining the
wav2vec 2.0 base model (Baevski et al., 2020). audio sample rate.
We reuse the wav2vec 2.0 model pretrained on
243 hours of Tamasheq audio data released by ON- 4
Detailed hyperparameters used can be found in A.1.
TRAC Consortium Systems (Boito et al., 2022b). 5
https://github.com/pytorch/audio
242
2.2.2 Back-Translation Where, yt denotes the decoded token at time t, x
We use the NLLB-200 machine translation model denotes the input and θi denotes the ith model in
to generate variations of target text in French (Team the ensemble.
et al., 2022). The original French data is first trans- We apply the following ensemble decoding
lated into English, and then translated back into strategies:
French. For French to English translation, only • Independent ensemble: we ensemble check-
1 best prediction is used. For English to French points having the highest BLEU scores on the
translation, we take the top 5 results with a beam validation set, on N training runs. The N dif-
size of 5. ferent models have the same architecture, but
We also try to generate synthetic transcription of initialized with different seed values.
the Tamasheq audio by translating French text into • Data-augmented ensemble: we ensemble
Tamasheq. However, we notice that the translation checkpoints having the highest BLEU scores
quality is unstable and decide to not use it for the on the validation set, on N training runs. The
experiment. N different models have the same architec-
ture, but trained on different data augmenta-
2.2.3 Paraphrasing
tion strategies.
We use a French paraphrase model (Bhavsar, 2022),
which is a fine tuned version of mBART model (Liu We additionally attempt a checkpoint ensemble,
et al., 2020), to generate variations of target text in where N different checkpoints having the highest
French. We take the top 5 paraphrases using beam validation BLEU within the same training run are
search with a beam size of 5. ensembled. Since we notice marginal improve-
ments with checkpoint ensemble, we decide to not
2.2.4 Weighted Loss explore checkpoint ensemble in depth for our ex-
As the quality of synthetically generated sentences periments.
varies, we apply a sentence level weight to the 2.4 Post Processing with LLMs
corresponding sample’s cross entropy loss during
training. We further explore a set of post processing strate-
gies by leveraging large language models (LLM)
N
X to 1) rerank the top-k generated samples; 2) correct
l= wi ∗ CE(yi , ŷi ) (1) grammar of the output; and 3) guess the missing
i tokens of the sentence. The strategy is based on
where N is the size of the corpus, yi , ŷi , wi are the observation that translation outputs from the
ground truth, prediction, and loss weight for sam- validation set often carry incomplete sentences and
ple i respectively . For back-translation data, the broken grammar. We found that LLMs are good
weights are directly taken from the prediction score fit to address this problem as they have brought
of NLLB-200. For paraphrasing data, we calculate promising improvements in sentence re-ranking,
the perplexity of each generated paraphrase and and rewriting tasks (Liu et al., 2023). We summa-
then take the exponential of the perplexity as the rize our proposed strategies as follows:
weight. For original training data (clean and full), 2.4.1 Re-ranking
weight are set to 1.
The reranking approach takes the top 5 results from
2.3 Ensemble Model the best-performing candidate, and rerank these
outputs with language models. We first explore
Ensemble decoding (Liu et al., 2018; Zhang and
performing shallow fusion (Kannan et al., 2018)
Ao, 2022) is a method of combining probability
with language model (GPT2-Fr).6 Additionally, we
values generated by multiple models while decod-
leverage a LLM (French finetuned-Alpaca 7B 7 )
ing the next token. We provide equal-weight to N
to guess the most probable sentence that is from a
different ensemble models as shown in 2.
radio broadcast news with the prompt:
quelle phrase est plus susceptible
N
1 X d’apparaître dans un journal télévisé
logP (yt |x, y1...t−1 ) = logPθi (yt |x, y1...t−1 )
N i 6
https://github.com/aquadzn/gpt2-french
(2) 7
https://github.com/bofenghuang/vigogne
243
2.4.2 Sentence Correction 3.1.2 Marathi-Hindi Corpus
The sentence correction approach rewrites the For Marathi-Hindi we use the data from Panlin-
whole output prediction by correcting the gram- gua (2023) containing approximately 25 hours of
matical and spelling errors. We use two LLMs speech. The audio recordings are sourced from the
for this tasks - aforementioned Alpaca model and news domain. The statistics of the dataset is shown
Bloom 7B with the following prompt: 8 in Table 3.
Corrigez la faute de frappe et la Data Split Hours # Utterances
grammaire de la phrase sans changer train 16 7,990
valid 3.7 2,103
la structure test 4.5 2,164
2.4.3 Token Masking Table 3: Data statistics for mr→hi corpus. Hours shows the
number of hours of audio samples available while # Utterances
The token masking approach first masks the trans- is the associated number of utterances.
lation output with <blank> tokens for out-of-
vocabulary (OOV) tokens. For example the pre-
dicted output "...Les questions sont [pi];." is re- 3.2 Experimental Results
placed with " <blank> Les questions sont <blank>." In this section, we compare the effects of data aug-
where [pi] is a common token we observed in the mentation, ensembling and post-processing strate-
prediction output that does not carry meaning. We gies on the tmh→fra task on test 2022 dataset. We
then apply the following prompt to let the LLMs to additionally compare results on the mr→hi task on
complete the sentence: the validation dataset.
Table 1: Impact of Data Augmentation on tmh→fra models. The table shows the BLEU scores for different strategies
in comparison to the baseline trained on clean and full dataset. Back-Translation + audio stretching and Paraphrase dataset
augmentation improve the BLEU score. Back-Translation alone can improve model performance when combined with a weighted
loss.
Table 4: Impact of Post Processing on tmh→fra corpus. The post-processing steps outlined are applied to an Ensembled
Wav2Vec2 model. The post-processing with a LLM does not provide any additional benefit.
Instruct: quelle phrase est plus susceptible d’apparaître dans un journal télévisé
Reranking Input: top k hypothesis
Output: best hypothesis picked by LLM
Instruct: complétez la phrase en remplaçant les jetons <blank>?
Token Masking Input: Donc, on dirait que l’organisation de l’UENA, elle est <blank>
Output: Donc, on dirait que l’organisation de l’UENA, elle est un organisme de bienfaits
Instruct: Corrigez la faute de frappe et la grammaire de la phrase sans changer la structure
Sentence Correction Input: Les a été libérés et ceux qui sont rentrés.
Output: Ils ont été libéré et ceux rentrant.
Table 5: Prompt Designs. Example LLM Prompts for Post Processing tmh→fra corpus.
Ensemble Models
(Refer Table 1) Ensemble Type Test2022 BLEU observation to the fact that the pretrained LLMs
cb-ensemble Independent 10.32 lacks context-specific data of the Tamasheq corpus.
fb-ensemble Independent 10.79 For example, when asked to correct the output sen-
Data Augmented
ft+ftw+fta+ftaw Back-translation 10.95 tence, LLMs tend to re-frame the phrases related
Data Augmented to more generic topics like sports or events.
fp+fpw+fpa+fpaw Paraphrase 11.26
Second, we find reranking and token masking
strategies both lead to slight degradation compared
Number of models Avg Test BLEU to the baseline. This is due to the fact that both
4 10.83
3 10.60 approaches make less aggressive changes to the
2 10.23 original output. In general, we find LLMs do not
1 (No Ensemble) 9.24 perform well when the predicted text deviates too
Table 6: Impact of Ensembling tmh→fra ST models. En- much from the ground truth.
sembling models trained with different seeds increases the
BLEU score. Increasing the number of models in ensemble Finally, we perform the same set of the strate-
also increases performance. gies but using translated English output from the
original French translation. We present the best
performing candidates (Translation+Reranking in
leads to significant performance degradation com- Table 4). We find that this strategy caused the worst
pared to the ensemble baseline. We attribute this performance degradation due to error propagation
245
# Model Vocab size Validation BLEU
mwb wav2vec2-base-960h 1k 11.41
mwbm1k wav2vec2-base-marathi 1k 13.19
mwbm3k wav2vec2-base-marathi 3k 11.85
mwx wav2vec2-xls-r-300m 1k 15.94
mwxm wav2vec2-xls-r-300m-marathi 1k 10.76
Table 7: Model performance on mr→hi task. Average BLEU scores are shown for the models which we trained with multiple
seeds. Move to XLS-R model as encoder improved BLEU by 40% over baseline. Complete results in Table 13
246
vallos. 2023. Findings of the IWSLT 2023 Evaluation Marcely Zanon Boito, John Ortega, Hugo Riguidel, An-
Campaign. In Proceedings of the 20th International toine Laurent, Loïc Barrault, Fethi Bougares, Firas
Conference on Spoken Language Translation (IWSLT Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gah-
2023). Association for Computational Linguistics. biche, et al. 2022b. On-trac consortium systems for
the iwslt 2022 dialect and low-resource speech trans-
Antonios Anastasopoulos, Loic Barrault, Luisa Ben- lation tasks. IWSLT.
tivogli, Marcely Zanon Boito, Ondrej Bojar, Roldano
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, Harveen Singh Chadha, Anirudh Gupta, Priyanshi Shah,
Maha Elbayad, Clara Emmanuel, Yannick Esteve, Neeraj Chhimwal, Ankur Dhuriya, Rishabh Gaur,
Marcello Federico, Christian Federmann, Souhir and Vivek Raghavan. 2022. Vakyansh: Asr toolkit
Gahbiche, Hongyu Gong, Roman Grundkiewicz, for low resource indic languages.
Barry Haddow, Benjamin Hsu, David Javorsky,
Vera Kloudova, Surafel Melaku Lakew, Xutai Ma, Yao-Fei Cheng, Hung-Shin Lee, and Hsin-Min Wang.
Prashant Mathur, Paul McNamee, Kenton Murray, 2021. Allost: Low-resource speech transla-
Maria Nădejde, Satoshi Nakamura, Matteo Negri, tion without source transcription. arXiv preprint
Jan Niehues, Xing Niu, John Ortega, Juan Pino, Eliz- arXiv:2105.00171.
abeth Salesky, Jiatong Shi, Matthias Sperber, Sebas-
tian Stuker, Katsuhito Sudoh, Marco Turchi, Yogesh Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N
Virkar, Alex Waibel, Changhan Wang, and Shinji Sainath, Zhijeng Chen, and Rohit Prabhavalkar. 2018.
Watanabe. 2022. Findings of the iwslt 2022 evalua- An analysis of incorporating an external language
tion campaign. In IWSLT 2022. model into a sequence-to-sequence model. In 2018
IEEE International Conference on Acoustics, Speech
Arun Babu, Changhan Wang, Andros Tjandra, Kushal and Signal Processing (ICASSP), pages 1–5828.
Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, IEEE.
Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei
Baevski, Alexis Conneau, and Michael Auli. 2021. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
Xls-r: Self-supervised cross-lingual speech represen- Hiroaki Hayashi, and Graham Neubig. 2023. Pre-
tation learning at scale. train, prompt, and predict: A systematic survey of
prompting methods in natural language processing.
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, ACM Computing Surveys, 55(9):1–35.
and Michael Auli. 2020. wav2vec 2.0: A frame-
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
work for self-supervised learning of speech represen-
Edunov, Marjan Ghazvininejad, Mike Lewis, and
tations.
Luke Zettlemoyer. 2020. Multilingual denoising pre-
training for neural machine translation. Transac-
Sameer Bansal, Herman Kamper, Karen Livescu,
tions of the Association for Computational Linguis-
Adam Lopez, and Sharon Goldwater. 2018. Low-
tics, 8:726–742.
resource speech-to-text translation. arXiv preprint
arXiv:1803.09164. Yuchen Liu, Long Zhou, Yining Wang, Yang Zhao,
Jiajun Zhang, and Chengqing Zong. 2018. A com-
Joydeep Bhattacharjee. 2022. Xls-r marathi pre- parable study on model averaging, ensembling and
trained model. https://huggingface.co/ reranking in nmt. In Natural Language Processing
infinitejoy/wav2vec2-large-xls-r- and Chinese Computing.
300m-marathi-cv8. Accessed: 2023-04-15.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
Nidhir Bhavsar. 2022. French paraphrase weight decay regularization.
model. https://huggingface.co/
enimai/mbart-large-50-paraphrase- Chenggang Mi, Lei Xie, and Yanning Zhang. 2022. Im-
finetuned-for-fr. Accessed: 2023-04-12. proving data augmentation for low resource speech-
to-text translation with diverse paraphrasing. Neural
Nidhir Bhavsar, Rishikesh Devanathan, Aakash Bhatna- Networks, 148:194–205.
gar, Muskaan Singh, Petr Motlicek, and Tirthankar
Ghosal. 2022. Team innovators at SemEval-2022 Language Processing LLP Panlingua. 2023. Dataset for
for task 8: Multi-task training with hyperpartisan marathi-hindi speech translation shared task@iwslt-
and semantic relation for multi-lingual news article 2023. Contributor/©holder: Panlingua Languague
similarity. In Proceedings of the 16th International Processing LLP, India and Insight Centre for Data
Workshop on Semantic Evaluation (SemEval-2022), Analytics, Data Science Institue, University of Gal-
pages 1163–1170, Seattle, United States. Association way, Ireland.
for Computational Linguistics.
Mihaela C Stoian, Sameer Bansal, and Sharon Goldwa-
Marcely Zanon Boito, Fethi Bougares, Florentin Bar- ter. 2020. Analyzing asr pretraining for low-resource
bier, Souhir Gahbiche, Loïc Barrault, Mickael Rou- speech-to-text translation. In ICASSP 2020-2020
vier, and Yannick Estéve. 2022a. Speech resources IEEE International Conference on Acoustics, Speech
in the tamasheq language. Language Resources and and Signal Processing (ICASSP), pages 7909–7913.
Evaluation Conference (LREC). IEEE.
247
NLLB Team, Marta R. Costa-jussà, James Cross, Onur A.2 Full Results
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
Jean Maillard, Anna Sun, Skyler Wang, Guillaume
Wenzek, Al Youngblood, Bapi Akula, Loic Bar-
rault, Gabriel Mejia Gonzalez, Prangthip Hansanti,
John Hoffman, Semarley Jarrett, Kaushik Ram
Sadagopan, Dirk Rowe, Shannon Spruit, Chau
Tran, Pierre Andrews, Necip Fazil Ayan, Shruti
Bhosale, Sergey Edunov, Angela Fan, Cynthia
Gao, Vedanuj Goswami, Francisco Guzmán, Philipp
Koehn, Alexandre Mourachko, Christophe Ropers,
Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
2022. No language left behind: Scaling human-
centered machine translation.
A Appendix
A.1 Hyperparameters and Computing
Resource
• encoder
– n layers: 6
– hidden dim: 1024 for mr-hi xls-r model, 768 for
tmh-fra model and other mr-hi model
– n head: 12
– activation: gelu
• decoder
– n layers: 2
– hidden dim: 256
– n head: 4
– activation: gelu
• training
– optimizer: AdamW (Loshchilov and Hutter,
2019)
– lr: 1e − 3
– encoder lr: 1e − 5
– label smoothing: 0.1
– batch size: 4
• computing resource: AWS g5.12xlarge instance (4x
NVIDIA A10G Tensor Core GPUs)
248
# Data Data Augmentation Vocab size Loss Seed Test2022 BLEU
cb1 clean baseline 1k baseline v1 8.98
cb2 clean baseline 1k baseline v2 8.91
cb3 clean baseline 1k baseline v3 8.82
cb4 clean baseline 1k baseline v4 8.69
fb1 full baseline 1k baseline v1 9.53
fb2 full baseline 1k baseline v2 9.10
fb3 full baseline 1k baseline v3 9.21
fb4 full baseline 1k baseline v4 9.17
249
# Model Vocab size Seed Validation BLEU
mwbm1k1 wav2vec2-base-marathi 1k v1 13.19
mwbm1k2 wav2vec2-base-marathi 1k v2 13.15
mwbm1k3 wav2vec2-base-marathi 1k v3 13.39
mwbm1k4 wav2vec2-base-marathi 1k v4 13.01
mwbm3k1 wav2vec2-base-marathi 3k v1 11.63
mwbm3k2 wav2vec2-base-marathi 3k v2 11.71
mwbm3k3 wav2vec2-base-marathi 3k v3 11.80
mwbm3k4 wav2vec2-base-marathi 3k v4 12.26
mwx1 wav2vec2-xls-r-300m 1k v1 16.31
mwx2 wav2vec2-xls-r-300m 1k v2 15.35
mwx3 wav2vec2-xls-r-300m 1k v4 16.09
mwx4 wav2vec2-xls-r-300m 1k v4 16.00
250
Speech Translation with Style: AppTek’s Submissions to the IWSLT
Subtitling and Formality Tracks in 2023
Abstract
AppTek participated in the subtitling and for- formality-controlled machine translation. Finally,
mality tracks of the IWSLT 2023 evaluation. Section 4.1 shows the results of our formality track
This paper describes the details of our subti- submission.
tling pipeline - speech segmentation, speech
recognition, punctuation prediction and inverse 2 Data Preparation
text normalization, text machine translation and
direct speech-to-text translation, intelligent line 2.1 Text Data
segmentation - and how we make use of the We use all of the allowed “speech-to-text paral-
provided subtitling-specific data in training and
lel” and “text-parallel” data, including Europarl,
fine-tuning. The evaluation results show that
our final submissions are competitive, in par- Europarl-ST, News Commentary, CORDIS News,
ticular outperforming the submissions by other Tatoeba, TED2020, IWSLT TED, MuST-C v3,
participants by 5% absolute as measured by CoVoST v2, and OpenSubtitles1 . We apply com-
the S UB ER subtitle quality metric. For the for- mon parallel data filtering steps based on lan-
mality track, we participated with our En-Ru guage identification, sentence length ratios between
and En-Pt production models, which support source and target sentences and additional heuris-
formality control via prefix tokens. Except for
tics. After filtering, we obtain 13.5M sentence pairs
informal Portuguese, we achieved near perfect
formality level accuracy while at the same time
with 152M running words (counted on the English
offering high general translation quality. side) for En-De and 16.5M sentence pairs with
183M words for En-Es.
1 Introduction Next, we clone this data and process the En
This paper presents AppTek’s submissions to the side of the clone with our text normalization tool
NEWTN . It implements elaborate regular expres-
subtitling and formality tracks of the IWSLT 2023
evaluation campaign. In the subtitling track, we sions to convert numbers, dates, monetary amounts,
participate in constrained and unconstrained condi- and other entities with digits into their spoken form.
tions and in both language pairs English-to-German It is also used to remove punctuation and word case
(En-De) and English-to-Spanish (En-Es). In the information. After training on such source data, our
formality track, we participate in the zero-shot un- MT systems are able to directly translate from raw
constrained condition for English-to-Portuguese ASR output that lacks punctuation and casing into
(En-Pt) and English-to-Russian (En-Ru). properly formatted written target language text.
This paper is organized as follows: Section 2 For the parallel corpora which have document
briefly describes our data preparation. Section 3 labels, we also create a version in which we con-
presents AppTek’s pipeline for subtitle translation. catenate two subsequent sentences from the same
Its different components, namely audio segmen- document using a separator symbol. Our past ex-
tation, speech translation (ST), automatic speech perience shows that adding such data is beneficial
recognition (ASR), machine translation (MT) mod- even if we do not add the context of the previous
els, and our subtitle segmentation algorithm are sentence at inference time.
described in Sections 3.1-3.5. Section 3.6 contains Finally, for each language pair, we extract about
experiments and an analysis of our subtitling sys- 4M words of bilingual phrases (based on unsuper-
tems. Section 4 presents AppTek’s approach to vised word alignment) as additional training “sen-
∗ 1
equal contribution The filtered version provided by the track organizers.
251
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 251–260
July 13-14, 2023 c 2023 Association for Computational Linguistics
tence” pairs to make sure that the MT system can followed by a probabilistic divide-and-conquer
cope well with incomplete sentences or too fine- (pDAC) algorithm that iteratively splits audio at the
grained automatic sentence segmentation. positions with the lowest probability of the speech
class. For the unconstrained condition, we use the
2.2 Speech Data English segmentation model published by the au-
We use all the allowed datasets marked as “speech” thors of SHAS, which is an XLS-R 300M model
and “speech-to-text parallel”, including Europarl- (Babu et al., 2022) fine-tuned for the frame clas-
ST, How2, MuST-C, TED-LIUM, LibriSpeech, sification task on the MuST-C train set. For the
Mozilla Common Voice, VoxPopuli, CoVoST, and constrained condition, we train our own frame clas-
IWSLT TED. After removing very short (< 0.1s) sifier with Wav2Vec2 (Baevski et al., 2020), pre-
and long (> 120s) segments, we obtain about trained on LibriSpeech, followed by fine-tuning for
3590 hours of speech with transcripts. From each the frame classification task using MuST-C.
dataset, we only take the train sets, where appli- A hyper-parameter search was conducted to find
cable. The English text is processed to be lower- the number of layers (constrained model), as well
cased, punctuation-free using NEWTN, and split as the inference parameters (max. segment length
into 10k byte-pair-encoding (BPE) tokens (Sen- and pDAC threshold) that optimize the performance
nrich et al., 2016). of the downstream speech translation pipeline. We
found that the pDAC threshold, which is the min-
2.3 Direct Speech Translation Data imum probability required to keep a frame, has
All data marked as “speech-to-text parallel”, i.e. significant effects on the translation quality, and
Europarl-ST, MuST-C, CoVoST, and IWSLT TED – that the optimal value can vary depending on the
except MuST-Cinema – is utilized for direct speech task and acoustic conditions.
translation. It results in a total of approximately
1220 hours of speech with transcripts and corre- 3.2 Direct Speech Translation
sponding translations after only keeping segments 3.2.1 Attention Encoder-Decoder
between 0.1 and 120 seconds. As for our data pro- We train an attention-based model (Bahdanau et al.,
cessing, on the English text, we carried out the 2015) composed of a Conformer encoder (Gulati
same scheme as for speech data, while following et al., 2020) and a Transformer decoder (Vaswani
almost the same German data processing scheme et al., 2017). The encoder consists of 12 layers
as described in Section 2.1. plus tokenization using with a size of 512, a feed-forward size of 2048, and
the Moses toolkit (Koehn et al., 2007). Then 10k 8 heads, whereas the decoder has 6 layers with the
and 20k BPEs are used on the English and Ger- same hidden size and number of heads. For fast yet
man texts, respectively. The dev set for the direct stable convergence, we apply a layer-wise network
model is chosen to be the concatenation of IWSLT construction scheme (Zeyer et al., 2018, 2019).
dev2010, MuST-C, Europal-ST, and CoVoST dev Specifically, we start with 2 layers of halved hid-
sets, resulting in a large dev set of 33 hours. den dimensions in both encoder and decoder (18M
2.3.1 Synthetic Data parameters) and linearly scale the model depth and
To leverage more training data for our direct model, width to full size (125M parameters) in the first 5
we translate the English transcripts of the allowed sub-epochs where each sub-epoch is one-twentieth
“speech” data (Jia et al., 2019) using our constrained of the whole training data. Also, L2-norm regular-
machine translation model described in Section ization and dropout are scaled up from 0 to 0.0001
3.4 with output length control “short” (Wilken and and 0.1 respectively. Label smoothing is enabled
Matusov, 2022). Combining the real ST data with only afterwards. We apply Adam (Kingma and Ba,
the synthetic data, we obtain about 4100 hours of 2015) with an initial learning rate of 0.0005 and
translated-speech parallel utterances. dynamic learning scheduling based on dev set loss.
Audio log mel 80-dimensional features are ex-
3 Subtitle Translation tracted every 10ms. The first layer of Conformer
is composed of 2 convolution layers with strides
3.1 Audio Segmentation of 3 and 2 over time giving a reduction factor of
We use the SHAS method (Tsiamas et al., 2022) 6. We use SpecAugment (Park et al., 2019; Bahar
for audio segmentation. SHAS scores every audio et al., 2019b) and speed perturbation in a random
frame with a binary classifier (speech/no-speech), interval of [0.9, 1.1] as data augmentation. In order
252
to train a single direct speech translation model that Unconstrained We train an attention-based
also supports time alignment between source label encoder-decoder model to run ASR decoding and
sequence and time frames, we add the source CTC also a CTC model which is used to generate word
loss (Graves et al., 2006; Kim et al., 2017; Bahar timings by force-aligning the audio with the de-
et al., 2019a) on top of the encoder in training. coded hypotheses. Here, the CTC model uses an
We also add a second shallow 1-layer Trans- explicit word boundary <space> symbol between
former decoder (with 14M parameters) in order to words. It serves as silence modeling. Both
generate better source transcripts for time align- models are trained on the same training set of 15K
ment. Given this network with a shared speech hours of speech mixing publicly available data with
encoder and two independent decoders, multi-task a commercial license and in-house data.
learning is employed to train all model parameters The 185M-parameter attention-based model uses
jointly. The final objective function is computed as a 31-layer Conformer encoder of hidden size 384;
a sum of the 3 losses (source CTC, source enc-dec, 8 heads with 64 dimensions per head; Macaron-
and target enc-dec). style (Lu et al., 2019) feed-forward layers with
size 2048; convolutional layers with 1024 chan-
3.2.2 Forced Alignment nels and kernel size 31. The decoder is a single-
CTC relies on Viterbi alignment to obtain the best headed attention-based model (Tüske et al., 2020),
path going through the source token at position n and consists of 4 stacked projected long short-
at time frame t. It is therefore possible to obtain term memory (pLSTM) recurrent layers with layer
word timings from CTC which can be used for size 2048 (Hochreiter and Schmidhuber, 1997; Sak
subtitle generation. To do so, we first generate the et al., 2014). The first two LSTMs operate on
source transcripts using the source decoder of the the embedding of the label sequence only. The
network and then use them to run forced-alignment other two decoder LSTM layers also process the
on the CTC output. The model’s alignments are on acoustic information extracted by the encoder us-
BPE-level, we therefore combine the timings of all ing a single-head, additive, location-aware cross-
subwords belonging to a word to obtain the final attention. The decoder predicts 1K BPE units. De-
word-level timestamps. coding is done using an external neural LM con-
We experimented with this approach and were sisting of 4 stacked LSTM layers of size 3072 with
able to generate accurate timestamps appropriate the same output vocabulary as the ASR models.
for creating subtitles in the source language. How- The 273M-parameter language model is trained on
ever, as we decide against using the source template 2.4B running words segmented to BPE units. The
approach for the constrained systems (see Section language model data are selected from a wide range
3.5), only the timings of the first and last word in of various domains, e.g. books, movies, news, re-
a segment are used for the target subtitles of the views, Wikipedia, talks, etc. ASR transcription is
constrained submission. We plan to explore how obtained after decoding with beam search limited to
to make better use of the CTC timings from this 16 hypotheses without any vocabulary constraints.
model in future experiments. In particular, we plan The CTC model uses the same encoder structure as
to add silence modeling to obtain information about the attention-based model.
pauses within speech segments, which can then be
reflected in the subtitle timings. 3.4 Machine Translation
3.4.1 Unconstrained Condition
3.3 Automatic Speech Recognition For the unconstrained subtitling pipeline we use
Constrained We train a Conformer-Transformer AppTek’s production MT systems which have been
model for the constrained task mainly following trained on large amounts of parallel data, mostly
Section 3.2.1 using 3590 hours of speech. Layer- from the OPUS collection (Tiedemann, 2012).
wise network construction, SpecAugment, and Both En-De and En-Es systems are Transformer
CTC loss are applied. Since the model is not Big systems that support additional API parame-
trained for multiple tasks (no additional decoder ters which can in particular control the genre (e.g.
is added), it has better performance in terms of patents, news articles, dialogs) and length (auto-
W ER compared to the source decoder part of the matic, short, long, etc.). The control is imple-
ST model. The final checkpoint achieves a W ER mented via pseudo-tokens in the beginning of the
of 9.6% on the concatenated dev set of 33h. source or target sentence (Matusov et al., 2020).
253
For the IWSLT experiments, we set the genre to System MuST-C TED EPTV ITV Peloton
English-to-German
“dialogs” because it reflects best the spoken sponta-
unconstrained 33.7 27.1 19.0 30.6 23.9
neous style in the dev 2023 data. When not men- + fine-tuning 35.0 27.7 20.3 31.0 24.4
tioned otherwise, we set the length to “short”. This constrained 32.3 34.2 18.4 27.2 20.3
yields more condensed translations, similar to how + fine-tuning 32.9 – 19.0 28.1 21.5
human subtitlers would translate to comply with a English-to-Spanish
given reading speed limit. baseline 37.2 46.1 34.1 24.5 23.6
+ fine-tuning 38.2 46.4 34.8 25.5 24.7
3.4.2 Constrained Condition
Table 1: B LEU scores in % for text-only MT fine-tuning
For the constrained condition we use the parallel experiments on the MuST-C tst-COMMON set and on
training data prepared as described in Section 2.1. the AppTek’s aligned subsets of the 2023 subtitling track
As the dev data for learning rate control, we use dev data.
the Europarl-ST and MuST-C dev sets.
Our MT model is a variant of the Transformer that provides a translation with a target-to-source
Big model (Vaswani et al., 2017) with additional character ratio of less than 1.1. This is motivated
encoder layers and using relative positional encod- by the fact that translations need to be fitted into
ing (Shaw et al., 2018). We use a batch size of the source subtitle template (Section 3.5.1). We
800 words, but the effective batch size is increased note that the reading speed compliance of our sub-
by accumulating gradients over 8 batches. We add mission could have been increased even further
the same length control feature as for the uncon- by exploiting timing information to select the MT
strained system by classifying the training data into length variants.
5 bins of target-to-source length ratios and adding
3.4.4 Fine-tuning Experiments
the class label as a target-side prefix token.
We apply SentencePiece (Kudo and Richardson, For our fine-tuning experiments, we first select “in-
2018) segmentation with a vocabulary size of 10K domain” training data in terms of similarity to the
for En and 20K for De/Es and use a translation fac- seed data – the dev 2023 set – from the real parallel
tor to predict the casing of the target words (Wilken data, as well as the synthetic data described in Sec-
and Matusov, 2019). Our MT models have been tion 2.3.1. The selection is done by clustering dis-
trained for 100 sub-epochs with 1M lines in each; tributed sentence representations in the embedding
thus, all of the prepared data has been observed space, and then keeping sentence pairs from the
in training 1-3 times. For each sub-epoch, we se- clusters which correspond to the seed data clusters.
lect sentence pairs proportionally to the following This is done considering both source and target
distribution and then randomly mix them: seed data sentences, but independently, so that no
sentence-level alignment of seed data is necessary.
20% Europarl and Europarl-ST data For details on this data selection method, please
20% TED data (MuST-C, IWSLT, TED2020) refer to our 2020 submission to the offline speech
20% OpenSubtitles (other) translation track (Bahar et al., 2020). With this
10% News (Commentary+CORDIS), Tatoeba, CoVoST method, we create two versions of the in-domain
data: one using all 4 parts of the dev 2023 set as
15% Concatenated neighboring sentence pairs2
seed data (in-domain A: En-De: 1.9M lines, 27M
5% OpenSubtitles (documentaries)
En words; En-Es: 1.7M lines, 25M words), and
5% OpenSubtitles (sports) one, for En-De only, using just ITV and Peloton
5% Bilingual phrases dev 2023 parts as seed data (in-domain B: 1.5M
3.4.3 Length ROVER lines, 20M words).
We then use the dev 2023 set as a dev set in
For all final submissions, we optimize the length
fine-tuning of the MT model for learning rate con-
control of MT by using a length ROVER (Wilken
trol. Since the dev 2023 data is not aligned at
and Matusov, 2022). For each segment we create 3
sentence-level, but is available as (in part) indepen-
translations: without forcing the target-side length
dently created subtitle files, we had to sentence-
token, forcing length bin 2 ("short"), and forcing
align it. To do so, we first extracted full sen-
length bin 1 ("extra short"). From those transla-
tences from the English subtitles based on sentence-
tions we select the first – given the order above –
final punctuation marks, translated these sentences
2
See Section 2.1. with the (constrained) baseline MT, and then re-
254
segmented the target side into sentences that match translation. This creates a nice viewing experience,
the source sentences using Levenshtein alignment since subtitles appear on the screen only during
as implemented by the S UB ER tool (Wilken et al., the actual speech. However, the source template
2022). The source-target segments obtained this constraints might be sub-optimal in terms of target
way are kept in the final dev set only if the B ERT F- language reading speed.
score (Zhang et al., 2019) for a given pair is > 0.5 We use the source template approach for the un-
for TED, EPTV, and Peloton sets and > 0.55 for constrained submission. To create subtitles in the
the ITV set. With this method, the obtained dev original language of the videos (English), we start
set contains 7645 sentence-like units with 27.7K with a timed word list provided by the ASR sys-
words for TED, 2.3K for EPTV, 20.7K for Peloton, tem. We train a 3-layer bidirectional LSTM model
and 13.9K for ITV. (hidden size 256, embedding dim 128) to jointly
We perform fine-tuning for up to 20 sub-epochs add basic punctuation marks ( .,!? ) and casing
ranging in size from 100K to 400K sentence pairs information to the word list. As training data, we
using a small learning rate between 10−06 and use 14M English sentences from the Gigaword and
10−05 , and select the best configuration for each of OpenSubtitles corpora. The model operates on full
the four dev 2023 domains. words and has two softmax output layers, one with
The fine-tuning results are shown in Table 1. the four punctuation tokens and "no punctuation"
Despite the fact that no real in-domain data, not as target classes (to be added after the word), the
even the dev 2023 set, is used as training data in other one with lower-cased, capitalized, all-upper,
fine-tuning we are able to improve MT quality in and mixed-cased classes as targets.
terms of B LEU scores (Papineni et al., 2002; Post, In addition, we train an inverse text normaliza-
2018), as well as B ERT and other scores skipped tion model to convert spoken forms of numbers,
due to space constraints. The improvements are dates, currencies, etc. into the proper written form.
more pronounced for the constrained system, but This model is a Transformer Big trained on data
the absolute scores are generally better with the where the source data is processed using our text
unconstrained setup3 . However, since the TED talk normalization tool NEWTN, see Section 2.1. Ap-
and Europarl domains are covered well in the data plying it to the transcriptions helps MT to produce
allowed for the constrained condition, the differ- proper digits also on the target side. This has a
ence between our unconstrained and constrained slight positive effect on automatic scores (0.8%
system for the TED and EPTV domains is small. It S UB ER for Peloton, only up to 0.4% for the other
is worth noting that for ITV and Peloton domains domains), but mainly helps subjectively perceived
we could only improve MT quality by fine-tuning quality and also reduces the number of characters.
on the in-domain B set that did not include any The resulting timed, punctuated, and cased word
TED-related data, and also not using any TED or list is split into sentences using punctuation ( .!? )
EPTV dev data for learning rate control. and pauses between words longer than 3 seconds.
Those are fed into a subtitle segmentation algo-
3.5 Subtitle Creation rithm similar to the one described in (Matusov et al.,
3.5.1 Source Template Approach 2019). Its core component is an LSTM segmenta-
To create subtitle files from translation hypothe- tion model that is trained on English OpenSubtitles
ses, the text has to be segmented into blocks with XML data, which includes subtitle block boundary
start/end time information. One challenge is to information4 , to estimate the probability of a subti-
transfer timings extracted from the source speech tle break after each word of a given input sentence.
to the target subtitles. An approach to generate tim- Within a beam search framework, this model is
ings that is also used in human subtitling workflows combined with hard subtitling constraints such as
(Georgakopoulou, 2019), is to first create subtitles the character limit per line to create valid subtitles.
in the source language – a so-called subtitle tem- Here, we adjust it for the creation of subtitles from
plate – and to keep the same subtitle blocks during timed words by including minimum and maximum
3
subtitle duration as constraints, and not forcing any
The B LEU score of the constrained system on the En-De
TED part is higher because, as we found out shortly before
predefined number of subtitles.
submission, some of the dev 2023 TED talks were part of the After segmentation, we use the start time of the
allowed TED2020 training corpus. Hence, further fine-tuning
4
did not help for this system on this set. The unconstrained https://opus.nlpl.eu/download.php?f=
system had not been trained on this corpus. OpenSubtitles/v2018/xml/en.zip
255
first word and the end time of the last word in system TED EPTV Peloton ITV
each subtitle block as the subtitle start and end SHAS 0.31 21.1 14.9 12.1 15.6
time. The subtitle template defined this way is SHAS 0.50 22.4 14.9 11.6 13.9
SHAS 0.71 20.8 14.6 10.8 10.7
then translated using the fine-tuned MT system
ASR Segm. 19.8 14.8 11.3 13.5
described in Section 3.4.4, employing the length
ROVER (Section 3.4.3) to avoid long translations Table 2: Impact of different segmentation schemes on
that do not fit the template. Sentences as defined the translation quality (B LEU in %).
above are used as translation units, note that they
may span several subtitle blocks. To insert the 3.6 Results
translations back into the template, we again apply
We first decide which audio segmentation to use
the subtitle segmentation algorithm, this time with
based on dev set results using our final ASR and
the exact settings as in (Matusov et al., 2019).
MT unconstrained systems. We set different pDAC
3.5.2 Template-Free Approach thresholds for the unconstrained SHAS (0.31, 0.50,
By definition, the source template approach is not and 0.71) and compare them with an in-house seg-
desirable for direct speech translation without inter- menter optimized for ASR. The results in Table 2
mediate source text representation. Also, the con- show that a low threshold of 0.31 leads to better
strained condition does not include English Open- translations overall. There is however variation de-
Subtitles data with subtitle breaks. We hence fall pending on the domain: it is 1.3 B LEU points worse
back to a simpler subtitle creation approach for than SHAS 0.50 on TED, but as good or up to 1.7
our constrained direct and cascade systems. We B LEU points better in all other domains. Results
use the segments provided by the audio segmenter for ITV are highly sensitive to the threshold. We
as translation units. For the cascade system, we attribute this to the fact that in TV series speech
translate the transcription of each segment with the is often mixed with music and other sounds and a
fine-tuned constrained MT, also using the length lower threshold is required not to miss speech seg-
ROVER (Section 3.4.3). End-of-line and end-of- ments. Given these results, we use SHAS 0.31 as
block tokens are inserted into the translated text our segmenter for unconstrained experiments. For
of each segment using the subtitle segmentation the constrained experiments, we use SHAS 0.31
algorithm configured similarly to the case of tem- everywhere except on TED with SHAS 0.50.
plate creation in the previous section but without Table 3 compares the performance of the final
duration-based constraints. Timestamps for the ad- constrained cascade (separate ASR + MT) and di-
ditional subtitle block boundaries are then created rect En-De subtitling systems as well as the un-
by linearly interpolating the audio segment tim- constrained cascade system. All metrics are com-
ings according to character count ratios. Assuming puted using the S UB ER tool5 (Wilken et al., 2022)
the translation of an audio segment with start time directly on subtitle files. To calculate the B LEU
Tstart and end time Tend is split into N blocks with and C HR F (Popović, 2015) metrics, it performs
c1 , ..., cN characters, respectively, the start time of an alignment of hypothesis to reference sentences
P n−1
c n′ similar to (Matusov et al., 2005). On all metrics,
block n is set to Tstart + (Tend − Tstart ) · Pn
′=1
.
N
n′ =1 cn′ the constrained cascade system outperforms our
This method leads to reasonable timings in most direct model. We observe imperfections in the di-
cases but can create temporary time shifts between rect model’s output such as repetitions. This can
speech and subtitles inside long audio segments. be partially attributed to the fact that it has been
trained jointly for 3 tasks leading to sub-optimal
3.5.3 Subtitle Post-Processing
optimization for the final translation process. The
To all subtitles, we apply a final post-processing lack of length control of our direct ST model is
that splits rare cases of subtitles with more than 2 another reason for the gap between the two con-
lines (same segmentation method as for template- strained systems. For the cascade systems, we find
free approach) and shifts subtitle end times to later length control via the length ROVER to be crucial,
in time if needed to comply with the maximum giving consistent improvements of 4 to 5% points
reading speed of 21 characters per second. The in S UB ER compared to no length control at all.
latter is only possible if there is a large enough As seen in Table 3, the unconstrained system out-
gap after a given subtitle and will therefore not
guarantee low enough reading speed in all cases. 5
https://github.com/apptek/SubER
256
system constr. S UB ER (↓) B LEU C HR F pairs (Matusov et al., 2020). This year, we decided
TED to test these systems in the unconstrained condition
cascade yes 63.0 26.0 53.9 of the IWSLT formality track for En-Pt and En-
direct yes 75.9 17.1 47.6 to-Ru. Each of these two systems is trained in a
cascade no 64.3 22.1 51.0 Transformer Big setup (Vaswani et al., 2017). The
EPTV formality level is encoded with a pseudo-token in
cascade yes 78.7 13.5 45.2 the beginning of each training source sentence with
direct yes 85.1 10.9 42.6 one of 3 values: formal, informal, no style. The
cascade no 75.8 14.8 44.1
system is trained on large public data from the
Peloton
OPUS collection (Tiedemann, 2012) that has been
cascade yes 87.6 9.9 32.0
partitioned into the 3 style classes as follows.
direct yes 86.1 6.8 26.9
First, we write a sequence of regular expressions
cascade no 71.9 11.6 34.3
for the target language (in this case, European Pt
ITV
cascade yes 83.6 8.5 26.1
and Ru) which try to match sentences containing
direct yes 90.9 5.7 21.0 formal or informal features. Thus, for Russian, we
cascade no 71.4 14.8 35.2 try to match either the formal or informal second-
person pronoun that corresponds to English “you”,
Table 3: En-De subtitle translation results in % (con- including their possessive forms. For Portuguese,
strained and unconstrained setting) on the dev2023 sets. we additionally match the forms of most common
verbs which agree with the corresponding pronoun.
Domain S UB ER (↓) B LEU (↑) C HR F (↑) The regex list for Russian is given in Table 56 .
TED 48.8 37.8 61.8 Each list of regular expressions uses standard
EPTV 70.2 20.4 50.6 regex syntax and makes either case-sensitive or
Peloton 79.0 12.2 36.2 insensitive matches. For each sentence pair from
ITV 82.1 9.2 26.8 the parallel data, the regex list is processed from
top to bottom. As soon as a match in the target
Table 4: Subtitle translation results in % on the dev2023
sets for En-Es via the constrained cascade system. sentence is found, the FORMAL or INFORMAL label
is assigned to the sentence pair. The sentence pair
performs both constrained systems except on the is labeled with NO _ STYLE if there is no match.
TED set. This is due to a data overlap, some TED If document information is available and at least
talks present in the dev set have also been part of 5% of the document sentence pairs are labeled as
the constrained training data. To analyze the im- formal/informal according to the regex rules (with
pact of the source template approach we re-create no sentences labeled with the opposite class), then
the subtitles of the unconstrained system using the all of the sentence pairs in the document are as-
template-free approach. We find that this deterio- signed the corresponding label. Such data is useful
rates the S UB ER scores for TED, Peloton and ITV to model stylistic traits which are not limited to
by 0.7, 3.6 and 3.8% points, respectively, while the choice of second-person pronouns. Note that
actually giving better results for EPTV by 0.7%. In document annotations are available for some of
general, the results in Table 3 show a higher auto- the IWSLT data, including TED talks, OpenSubti-
matic subtitling quality for the TED domain, which tles (each subtitle file corresponds to a document),
represents the case of well recorded and prepared individual sessions of European Parliament, etc.
speech, but also show the need to focus research We further smooth the three style classes to en-
on harder conditions such as interviews and TV sure that e.g., sentences containing second-person
series. Table 4 contains the scores we are able pronouns can be translated well even when no style
to achieve for En-Es under constrained conditions. is specified at inference time. To this end, 5 to 8%
Also here, acceptable subtitle quality can only be of sentence pairs which had been assigned to one of
reached for TED and EPTV content, but not for the the 3 style classes as described above are randomly
more challenging Peloton and ITV content. re-assigned to one of the other two classes.
For En-Ru, the training data that had been parti-
4 Formality Control tioned into style classes in this way included about
AppTek’s production systems support formality or, 6
We released the En-Pt and En-Ru lists of regular expres-
as we call it, style control for selected language sions as part of our evaluation submission.
257
INFORMAL IGNORECASE \b(ты|теб[яе]|тобой|тво[йеёяю]|твоей|твоего|твоему|твоим|тво[ёе]м)\b
FORMAL IGNORECASE \b(вы|вами?|ваш[ае]?|вашей|вашего|вашему?|вашу|вас|вашим)\b
Table 5: The regular expressions used to partition En-Ru training data into formal, informal, and (in case of no
match) “no style” classes.
language pair / B LEU C OMET M-Acc to the imperfect regular expressions we defined for
requested style [%] [%] informal Portuguese pronouns and corresponding
En-Pt formal 34.6 0.6089 99 verb forms, since some of them are ambiguous.
informal 42.4 0.6776 64
However, we find it difficult to explain that e.g. the
En-Ru formal 35.4 0.6165 99
informal 33.3 0.6026 98
B LEU score of AppTek’s “informal” MT output
with respect to the informal reference is almost 8%
Table 6: Automatic evaluation results for AppTek’s absolute higher than for our “formal” output with
submission to the formality track of IWSLT 2023. respect to the formal reference. This may indicate
that the human reference translation also has not
40M sentence pairs. At the time this model was always followed the requested style, the informal
trained in early 2022, the larger CCMatrix cor- one in particular.
pus (Schwenk et al., 2021) was not included. For
En-Pt, we did use a filtered version of CCMatrix 5 Conclusion
in training, so that the total number of parallel sen-
tence pairs was 140M. The filtering of CCMatrix We described AppTek’s submissions to the subti-
and other large crawled data included removing sen- tling and formality tracks of the IWSLT 2023.
tence pairs with low cross-lingual sentence embed- For the subtitling track, we obtained good re-
ding similarity as given by the LABSE scores (Feng sults, outperforming the other two evaluation partic-
et al., 2022). All of our parallel training data is also ipants either with our constrained or unconstrained
filtered based on sentence-level language identifi- cascaded approach on all 4 domains. Part of this
cation scores and other heuristics. success is due to our subtitle creation process, in
When training the Transformer Big model, we which we employ AppTek’s intelligent line seg-
balanced the contribution of formal, informal, and mentation models. However, the results varied by
“no style” data by adding them in equal proportions domain, with the domain of movie subtitles posing
(number of lines) to each sub-epoch. the most challenges for ASR, and the domain of
fitness-related videos (Peloton) being hardest for
4.1 Results MT. Yet our biggest overall challenge, especially
We did not perform any experiments, but just for the direct (end-to-end) submission was speech
set the API parameter style=formal or segmentation and creating sentence-like units, on
style=informal and translated the evaluation real ITV movies in particular, in which there is mu-
data with the AppTek’s production systems, trained sic, background noise, and multiple speakers. In
as described above. The results in terms of auto- the future, we plan to improve this component of
matic error metrics, as reported by the track orga- our speech translation technology. We also plan to
nizers, are summarized in Table 6. include length control in our direct models which
Among the 5 participants of the unconstrained showed to be an important factor for those applica-
condition, we obtain the best results for En-Ru in tions with time constraints.
terms of B LEU and C OMET (Rei et al., 2020), while Our formality track participation was a one-shot
producing the correct formality level for more than attempt at a zero-shot task that showed the compet-
98% of the sentences. The second-best competitor itiveness of the formality control that we have im-
system obtains formality accuracy of 100%, but plemented in AppTek’s production systems. How-
scores 1.7% absolute lower in B LEU for the formal ever, our approach currently requires the creation
and 0.9% B LEU absolute for the informal class. of manual regular expression rules for partition-
For En-Pt, our system scores second in terms of ing the parallel training data into formality classes,
automatic MT quality metrics and correctly pro- and the participation in the IWSLT evaluation re-
duced the formal style for 99% of the sentences in vealed some weaknesses of this approach for one
the evaluation data. However, when the informal of the involved target languages. In the future, we
style was requested, our system could generate it in plan to further improve our approach, reducing or
only 64% of the cases. We attribute this low score eliminating the need for writing rules.
258
References speech recognition. 21th Annual Conference of the
International Speech Communication Association
Arun Babu, Changhan Wang, Andros Tjandra, Kushal (INTERSPEECH), pages 5036–5040.
Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
Baevski, Alexis Conneau, and Michael Auli. 2022. short-term memory. Neural computation, 9(8):1735–
XLS-R: Self-supervised Cross-lingual Speech Rep- 1780.
resentation Learning at Scale. In Proc. Interspeech
2022, pages 2278–2282. Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J.
Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari,
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Stella Laurenzo, and Yonghui Wu. 2019. Leverag-
and Michael Auli. 2020. wav2vec 2.0: A framework ing weakly supervised data to improve end-to-end
for self-supervised learning of speech representations. speech-to-text translation. In IEEE International
In Advances in Neural Information Processing Sys- Conference on Acoustics, Speech and Signal Pro-
tems 33: Annual Conference on Neural Information cessing, ICASSP 2019, Brighton, United Kingdom,
Processing Systems 2020, NeurIPS 2020, December May 12-17, 2019, pages 7180–7184. IEEE.
6-12, 2020, virtual.
Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017.
Parnia Bahar, Tobias Bieschke, and Hermann Ney. Joint ctc-attention based end-to-end speech recogni-
2019a. A comparative study on end-to-end speech tion using multi-task learning. In Proc. Int. Conf.
to text translation. In IEEE Automatic Speech Recog- on Acoustics, Speech, and Signal Processing, pages
nition and Understanding Workshop (ASRU), pages 4835–4839, New Orleans, LA, USA.
792–799, Sentosa, Singapore.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Parnia Bahar, Patrick Wilken, Tamer Alkhouli, Andreas
method for stochastic optimization. In 3rd Inter-
Guta, Pavel Golik, Evgeny Matusov, and Christian
national Conference on Learning Representations,
Herold. 2020. Start-before-end and end-to-end: Neu-
ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
ral speech translation by apptek and rwth aachen
Conference Track Proceedings.
university. In Proceedings of the 17th International
Conference on Spoken Language Translation, pages Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
44–54. Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Parnia Bahar, Albert Zeyer, Ralf Schlüter, and Hermann
Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
Ney. 2019b. On using specaugment for end-to-end
Constantin, and Evan Herbst. 2007. Moses: Open
speech translation. In International Workshop on
source toolkit for statistical machine translation. In
Spoken Language Translation (IWSLT).
ACL 2007, Proceedings of the 45th Annual Meet-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- ing of the Association for Computational Linguistics,
gio. 2015. Neural machine translation by jointly June 23-30, 2007, Prague, Czech Republic.
learning to align and translate. In Proceedings of the
International Conference on Learning Representa- Taku Kudo and John Richardson. 2018. Sentencepiece:
tions (ICLR). A simple and language independent subword tok-
enizer and detokenizer for neural text processing. In
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Ari- Proceedings of the 2018 Conference on Empirical
vazhagan, and Wei Wang. 2022. Language-agnostic Methods in Natural Language Processing: System
BERT sentence embedding. In Proceedings of the Demonstrations, pages 66–71.
60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong,
878–891, Dublin, Ireland. Association for Computa- Tao Qin, Liwei Wang, and Tie-yan Liu. 2019. Un-
tional Linguistics. derstanding and improving transformer from a multi-
particle dynamic system point of view. In ICLR 2020
Panayota Georgakopoulou. 2019. Template files:: The Workshop on Integration of Deep Neural Models and
holy grail of subtitling. Journal of Audiovisual Trans- Differential Equations.
lation, 2(2):137–160.
Evgeny Matusov, Gregor Leusch, Oliver Bender, and
Alex Graves, Santiago Fernández, Faustino J. Gomez, Hermann Ney. 2005. Evaluating machine transla-
and Jürgen Schmidhuber. 2006. Connectionist tem- tion output with automatic sentence segmentation. In
poral classification: Labelling unsegmented sequence International Workshop on Spoken Language Trans-
data with recurrent neural networks. In International lation, pages 148–154, Pittsburgh, PA, USA.
Conference on Machine Learning (ICML), volume
148, pages 369–376, Pittsburgh, PA, USA. Evgeny Matusov, Patrick Wilken, and Yota Geor-
gakopoulou. 2019. Customizing neural machine
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki translation for subtitling. In Proceedings of the
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Fourth Conference on Machine Translation (Volume
Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. 1: Research Papers), pages 82–93, Florence, Italy.
Conformer: Convolution-augmented transformer for Association for Computational Linguistics.
259
Evgeny Matusov, Patrick Wilken, and Christian Herold. In Proceedings of the 2018 Conference of the North
2020. Flexible customization of a single neural American Chapter of the Association for Computa-
machine translation system with multi-dimensional tional Linguistics: Human Language Technologies,
metadata inputs. In Proceedings of the 14th Confer- Volume 2 (Short Papers), pages 464–468.
ence of the Association for Machine Translation in
the Americas (Volume 2: User Track), pages 204– Jörg Tiedemann. 2012. Parallel data, tools and inter-
216, Virtual. Association for Machine Translation in faces in OPUS. In Proceedings of the Eighth In-
the Americas. ternational Conference on Language Resources and
Evaluation (LREC’12), pages 2214–2218, Istanbul,
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Turkey. European Language Resources Association
Jing Zhu. 2002. Bleu: a method for automatic evalu- (ELRA).
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu- Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
tational Linguistics, pages 311–318, Philadelphia, losa, and Marta R. Costa-jussà. 2022. SHAS: Ap-
Pennsylvania, USA. Association for Computational proaching optimal Segmentation for End-to-End
Linguistics. Speech Translation. In Proc. Interspeech 2022, pages
106–110.
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Zoltán Tüske, George Saon, Kartik Audhkhasi, and
2019. SpecAugment: A simple data augmentation Brian Kingsbury. 2020. Single headed attention
method for automatic speech recognition. based sequence-to-sequence model for state-of-the-
art results on Switchboard. In Interspeech, pages
Maja Popović. 2015. chrF: character n-gram F-score 551–555, Shanghai, China.
for automatic MT evaluation. In Proceedings of the
Tenth Workshop on Statistical Machine Translation, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
pages 392–395, Lisbon, Portugal. Association for Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Computational Linguistics. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
Matt Post. 2018. A call for clarity in reporting BLEU cessing Systems, pages 5998–6008.
scores. In Proceedings of the Third Conference on
Patrick Wilken, Panayota Georgakopoulou, and Evgeny
Machine Translation: Research Papers, pages 186–
Matusov. 2022. SubER - a metric for automatic eval-
191, Brussels, Belgium. Association for Computa-
uation of subtitle quality. In Proceedings of the 19th
tional Linguistics.
International Conference on Spoken Language Trans-
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon lation (IWSLT 2022), pages 1–10, Dublin, Ireland (in-
Lavie. 2020. COMET: A neural framework for MT person and online). Association for Computational
evaluation. In Proceedings of the 2020 Conference Linguistics.
on Empirical Methods in Natural Language Process- Patrick Wilken and Evgeny Matusov. 2019. Novel appli-
ing (EMNLP), pages 2685–2702, Online. Association cations of factored neural machine translation. arXiv
for Computational Linguistics. preprint arXiv:1910.03912.
Haşim Sak, Andrew W. Senior, and Françoise Beaufays. Patrick Wilken and Evgeny Matusov. 2022. AppTek’s
2014. Long short-term memory based recurrent neu- submission to the IWSLT 2022 isometric spoken lan-
ral network architectures for large vocabulary speech guage translation task. In Proceedings of the 19th
recognition. arXiv preprint arXiv:1402.1128. International Conference on Spoken Language Trans-
lation (IWSLT 2022), pages 369–378, Dublin, Ire-
Holger Schwenk, Guillaume Wenzek, Sergey Edunov,
land (in-person and online). Association for Compu-
Edouard Grave, Armand Joulin, and Angela Fan.
tational Linguistics.
2021. CCMatrix: Mining billions of high-quality
parallel sentences on the web. In Proceedings of the Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schlüter,
59th Annual Meeting of the Association for Compu- and Hermann Ney. 2019. A comparison of trans-
tational Linguistics and the 11th International Joint former and lstm encoder decoder models for asr. In
Conference on Natural Language Processing (Vol- IEEE Automatic Speech Recognition and Understand-
ume 1: Long Papers), pages 6490–6500, Online. As- ing Workshop, pages 8–15, Sentosa, Singapore.
sociation for Computational Linguistics.
Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann
Rico Sennrich, Barry Haddow, and Alexandra Birch. Ney. 2018. Improved training of end-to-end attention
2016. Neural machine translation of rare words with models for speech recognition. In 19th Annual Conf.
subword units. In Proceedings of the 54th Annual Interspeech, Hyderabad, India, 2-6 Sep., pages 7–11.
Meeting of the Association for Computational Lin-
guistics, ACL 2016, August 7-12, 2016, Berlin, Ger- Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
many, Volume 1: Long Papers. Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
uating text generation with bert. arXiv preprint
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. arXiv:1904.09675.
Self-attention with relative position representations.
260
QUESPA Submission for the IWSLT 2023
Dialect and Low-resource Speech Translation Tasks
262
enables zero-shot cross-lingual transfer for many 4. a primary unconstrained system consisting of
low-resource languages, including Quechua. a multi-lingual PLM ASR model, a Quechua
We provide reference to previous work that in- recurrent neural-network language model, and
cludes either a direct or end-to-end ST models (Be- a fine-tuned neural MT system based on a
rard et al., 2016; Weiss et al., 2017). More tradi- PLM;
tional approaches typically use a cascade approach
5. a contrastive 1 unconstrained system consist-
which first transcribes using an ASR model and
ing of a multi-lingual PLM ASR model and a
then translates using a MT model. While recent
fine-tuned neural MT system based on a PLM;
work (Bentivogli et al., 2021; Anastasopoulos et al.,
2021; Antonios et al., 2022) has shown that the 6. a contrastive 2 unconstrained system consist-
direct ST approaches are worthy, traditional ap- ing of a wav2letter ASR system and a fine-
proaches work well for low-resource situations too. tuned neural MT system based on a PLM.
In our system submissions, all of our systems with We present the experimental settings and results
exception of the primary constrained used the cas- for all systems starting off with constrained sys-
cade approach. tems in Section 3.1 and continuing with the uncon-
strained systems in Section 3.2. We then describe
3 Quechua-Spanish the other less successful approaches in Section 3.3.
Finally, we offer results and discussion in Section
In this section we present our experiments for the
4.
QUE–SPA dataset provided in the low-resource ST
track at IWSLT 2023. This is the first time that 3.1 Constrained Setting
this dataset has been officially introduced in its cur- The IWSLT 2023 constrained setting for QUE–SPA
rent state which contains 1 hour and 40 minutes consists of two main datasets. First, the speech
of constrained speech audio along with its corre- translation dataset consists of 1 hour and 40 min-
sponding translations and nearly 60 hours of ASR utes divided into 573 training files, 125 validation
data (with transcriptions) from the Siminichik (Car- files, and 125 test files where each file is a .wav
denas et al., 2018) corpus. AmericasNLP 2022’s file with a corresponding transcription and human-
task used a smaller part of the dataset but the data validated translation from Simanchik (Cardenas
was not presented or compiled with the same of- et al., 2018). Secondly, there is a MT data set com-
fering and, as of this writing, have not published bined by previous work (Ortega et al., 2020) which
their results. This dataset aggregates the QUE–SPA consists of 100 daily magazine article sentences
MT corpus from previous neural MT work (Ortega and 51140 sentences which are of religious context
et al., 2020). The audio and corresponding tran- in nature.
scriptions along with their translations are mostly
made of of radio broadcasting, similar to the work 3.1.1 Primary System
from Boito et al. (2022) which contains 17 hours The Primary System consists of a direct ST ap-
of speech in the Tamasheq language. proach. Since the constrained setting does not al-
We present the six submissions for both the con- low for external data, we used only the data pro-
strained and unconstrained as follows: vided. We use the Fairseq (Ott et al., 2019) toolkit
to perform direct ST using the 573 training files, a
1. a primary constrained system that uses a direct total of 1.6 hours of audio. The system extracts log
ST approach; mel-filter bank (MFB) features and is based on the
S2T approach by (Wang et al., 2020). We gener-
2. a contrastive 1 constrained system consisting ate a 1k unigram vocabulary for the Spanish text
of a wav2letter (Pratap et al., 2019) ASR sys- using SentencePiece (Kudo and Richardson, 2018),
tem and a neural MT system created from with no pre-tokenization. Our model consists of
scratch; a convolutional feature extractor and transformer
encoder-decoder (Vaswani et al., 2017) with 6 en-
3. a contrastive 2 constrained system consist- coder layers and 3 decoder layers. Error is mea-
ing of a conformer-based (Gulati et al., 2020) sured using cross entropy and optimization is done
ASR system and a neural MT system created using Adam. Our model was run for 500 epochs
from scratch; with a learning rate of .0002.
263
3.1.2 Contrastive 1 System 3.2 Unconstrained Setting
The Contrastive 1 System is a cascade system For the unconstrained setting in IWSLT 2023, an
where first ASR is performed to produce transcrip- additional 60 hours of speech data with their corre-
tions that are translated using a separate MT system. sponding transcriptions was made available by the
For the ASR system, we used the wav2letter++ organizers. This allowed for greater mono-lingual
(Pratap et al., 2019) model. The wav2letter++ fine-tuning of the ASR data. Additionally, for both
model consists of a RNN with 30M parameters the ASR and MT components of all three of our
(2 spatial convolution layers, 5 bidirectional LSTM submitted unconstrained systems, PLMs were used
layers, and 2 linear layers) and a CNN with 100M along with fine-tuning. The three submissions were
parameters (18 temporal convolution layers and 1 cascade systems.
linear layer). We use the convolutional gated lin- 3.2.1 Primary System
ear unit (GLU) (Dauphin et al., 2017) architecture
The Primary System for the unconstrained setting
proposed in the recipe wav2letter (WSJ) (Collobert
consists of two systems, the ASR and the MT
et al., 2016). Our experiments using wav2letter++
system. Both systems are fine-tuned. First, the
took 134 epochs to train, using Stochastic Gra-
ASR system is multi-lingual model pre-trained on
dient Descent (SGD) with Nesterov momentum
the 102-language FLEURS (Conneau et al., 2023)
and a minibatch of 8 utterances. The initial learn-
dataset. The model consists of a conformer (Gulati
ing rate was set to 0.006 for faster convergence,
et al., 2020) encoder and transformer decoder and is
and it was annealed with a constant factor of 3.6
trained using hybrid CTC/attention loss (Watanabe
after each epoch, with momentum set to 0. The
et al., 2017) and hierarchical language identifica-
model was optimized using the Auto Segmentation
tion conditioning (Chen et al., 2023). The model
Criterion (ASG) (Collobert et al., 2016). During
inputs are encoded representations extracted from
development, the ASR system WER was 72.15
a pre-trained XLS-R 128 model (Babu et al., 2021)
on the validation set. The MT system was cre-
with its weights frozen, augmented with SpecAug
ated from scratch using the OpenNMT framework
(Park et al., 2019) and speech perturbation (Ko
(Klein et al., 2020) with the MT data provided for
et al., 2015). In order to jointly decode, we also
the constrained task along with the ASR training
trained an RNN language model. The RNN con-
data. More specifically, the MT system’s encoder
sists of 2 layers with a hidden size of 650, trained
and decoder are based on a transformer (Vaswani
using SGD with a flat learning rate of 0.1. The
et al., 2017) (encode/decode) architecture of 6 lay-
word-error rate on the validation set was 15. For
ers. Hidden layer and vectors sizes were 512.
the MT system, we use the Fairseq (Ott et al., 2019)
Dropout was set to 0.1. Optimization was done
tool kit for translation. The Flores 101 model was
using the Adam optimizer. Tokenization was done
used (Guzmán et al., 2019) as the PLM and is based
using SentencePiece (Kudo and Richardson, 2018).
on a transformer (Vaswani et al., 2017) architecture
Both source and target vocabularies were 50k. Ini-
used at WMT 20212 by Facebook. Fine-tuning was
tial BLEU score on the validation set was 21.13.
performed using the training ASR+MT data from
the constrained task as was used for training in the
3.1.3 Contrastive 2 System
Constrained Contrastive 1 task in Section 3.1.2.
Similar to the Contrastive 1 System, the Contrastive
2 system is a cascade approach. The ASR sys- 3.2.2 Contrastive 1 System
tem, however, is distinct. It is derived using MFB The Constrastive 1 system is nearly identical to
features similar to previous work Berrebbi et al. the Primary System for the unconstrained setting.
(2022). It uses a conformer instead of the trans- The MT system is identical to that of the Primary
former encoder like Gulati et al. (2020). Training System submission for the unconstrained setting.
was performed using a hybrid CTC/attention loss For the ASR system, a FLEURS approach is used
(Watanabe et al., 2017). The model was optimized identical to the unconstrained Primary System in
using Adam (Kingma and Ba, 2015) and a Noam Section 3.2.1. The only difference is that this Un-
learning rate scheduler (Vaswani et al., 2017) with constrained Contrastive 1 system does not use a
4000 warmup steps. The MT system is identical language model.
to the OpenNMT MT system mentioned for the 2
https://www.statmt.org/wmt21/large-scale-
Contrastive 1 submisison covered in Section 3.1.2. multilingual-translation-task.html
264
3.2.3 Contrastive 2 System score was of 6.27 BLEU. The Flores 200 model
The Contrastive 2 System is also a cascade is made available as the NLLB task on Fairseq,
(ASR+MT) system. The MT system is identical however, we experienced several conflicts with the
to that of the Primary System submission for the machine infrastructure causing complexity with the
unconstrained setting. The ASR system architec- Stopes tokenization that prevented us from moving
ture is identical to the Constrained Contrastive 1 forward.
System in Section 3.1.2, but with other hyperparam- For direct ST approaches, we also were unsuc-
eters. In this experiment took 243 epochs to train, cessful using w2v feature encoding without ma-
using Stochastic Gradient Descent (SGD) with Nes- jor modification. Overall, the cascade approaches
terov momentum and a minibatch of 16 utterances. seemed to work better for this task and, thus, we
The initial learning rate was set to 0.002 for faster made a decision to use those instead. The results
convergence, and it was annealed with a constant for the constrained task, nonetheless, show that
factor of 1.2 after each epoch, with momentum the direct s2t approach worked well using MFB
set to 0. In this system, we add the additional 60 features.
hours of monolingual transcribed speech data from
4 Results and Discussion
the unconstrained setting mentioned in the IWSLT
2023 low-resource task in addition to the 1.6 hours
provided for the constrained setting. Team QUESPA BLEU and CHRF Scores
Constrained
3.3 Other Approaches
System Description BLEU CHRF
As noted in Section 2, there have been other suc- primary mfb+s2t 1.25 25.35
contrastive 1 w2vl+onmt 0.13 10.53
cessful approaches worth visiting. While we could contrastive 2 conformer+onmt 0.11 10.63
not exhaustively attempt to use all of those ap-
Unconstrained
proaches, we did focus on several that are worth
noting. System Description BLEU CHRF
primary fleurs+lm+floresmt 15.36 47.89
For ASR approaches, we focused on experiment- contrastive 1 fleurs+floresmt 15.27 47.74
ing with different model architectures. This in- contrastive 2 w2vl+floresmt 10.75 42.89
cluded using different encoders (transformer, con-
former) and decoders (auto-regressive Transformer, Table 1: Team QUESPA results for the Quechua to
CTC-only). Regardless, all of the ASR systems Spanish low-resource task at IWSLT 2023.
achieved at best 100 WER in the constrained set-
ting, limiting the effectiveness of any cascaded Results are presented in Table 1. For the con-
approach. In the unconstrained setting, we also strained task, we were unable to create a system
looked at different ways to incorporate pre-training. that would be viable for deployment. Notwithstand-
For example, we tried directly fine-tuning a pre- ing, we believe that the primary submission which
trained XLS-R model (Babu et al., 2021; Baevski used MFB features along with the default Fairseq
et al., 2020) instead of using extracted layer-wise S2T recipe could be used to further research in the
features from a frozen model. These approaches field. Other systems, based on w2vletter (Pratap
were somewhat more successful by achieving up to et al., 2019) and a conformer (Gulati et al., 2020)
20.4 WER on the validation set; however, the top resulted in a near zero BLEU score and are proba-
three systems reported performed better with ASR. bly only valid as proof of the non-functional status
For MT approaches, several attempts were made of the two systems when performing ASR on the
to experiment with other systems. For example, the QUE–SPA language pair. It is clear that with 1.6
OpenNMT (Klein et al., 2020) toolkit now offers hours of data for training, few constrained systems
PLMs that include the Flores 101 (Guzmán et al., will perform better than 5 BLEU, as seen in previ-
2019) dataset. However, since Quechua was not ous IWSLT tasks.
included in the language list, the performance was For the unconstrained setting, our findings have
extremely low on the validation set (0.06 BLEU). shown that for both the ASR and MT models, the
The Hugging Face version of the Flores 200 dataset use of a PLM with fine-tuning is necessary. We
was also tested and resulted in 23.5 on its own data. were unable to create a system from scratch that
However, when testing on the validation set, the would perform as well as those presented in previ-
265
Figure 1: The best-performing unconstrained speech translation pipeline.
266
Anastasopoulos Antonios, Barrault Loc, Luisa Ben- Ronan Collobert, Christian Puhrsch, and Gabriel Syn-
tivogli, Marcely Zanon Boito, Bojar Ondřej, Roldano naeve. 2016. Wav2letter: an end-to-end convnet-
Cattoni, Currey Anna, Dinu Georgiana, Duh Kevin, based speech recognition system. arXiv preprint
Elbayad Maha, et al. 2022. Findings of the iwslt arXiv:1609.03193.
2022 evaluation campaign. In Proceedings of the
19th International Conference on Spoken Language Alexis Conneau, Alexei Baevski, Ronan Collobert,
Translation (IWSLT 2022), pages 98–157. Associa- Abdelrahman Mohamed, and Michael Auli. 2020.
tion for Computational Linguistics. Unsupervised cross-lingual representation learn-
ing for speech recognition. arXiv preprint
Mikel Artetxe and Holger Schwenk. 2019. Mas- arXiv:2006.13979.
sively multilingual sentence embeddings for zero-
shot cross-lingual transfer and beyond. Transactions Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang,
of the Association for Computational Linguistics, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara
7:597–610. Rivera, and Ankur Bapna. 2023. Fleurs: Few-shot
Arun Babu, Changhan Wang, Andros Tjandra, Kushal learning evaluation of universal representations of
Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, speech. In 2022 IEEE Spoken Language Technology
Patrick von Platen, Yatharth Saraf, Juan Pino, et al. Workshop (SLT), pages 798–805.
2021. Xls-r: Self-supervised cross-lingual speech
representation learning at scale. arXiv preprint Marta R Costa-jussà, James Cross, Onur Çelebi, Maha
arXiv:2111.09296. Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe
Kalbassi, Janice Lam, Daniel Licht, Jean Maillard,
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, et al. 2022. No language left behind: Scaling
and Michael Auli. 2020. wav2vec 2.0: A framework human-centered machine translation. arXiv preprint
for self-supervised learning of speech representations. arXiv:2207.04672.
Advances in Neural Information Processing Systems,
33:12449–12460. Yann N Dauphin, Angela Fan, Michael Auli, and David
Grangier. 2017. Language modeling with gated con-
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina volutional networks. In International conference on
Karakanta, Alberto Martinelli, Matteo Negri, and machine learning, pages 933–941. PMLR.
Marco Turchi. 2021. Cascade versus direct speech
translation: Do the differences still make a differ- Abteen Ebrahimi, Manuel Mager, Arturo Oncevay,
ence? CoRR, abs/2106.01045. Vishrav Chaudhary, Luis Chiruzzo, Angela Fan, John
Alexandre Berard, Olivier Pietquin, Christophe Servan, Ortega, Ricardo Ramos, Annette Rios Gonzales,
and Laurent Besacier. 2016. Listen and translate: A Ivan Meza-Ruiz, et al. 2022. Americasnli: Evalu-
proof of concept for end-to-end speech-to-text trans- ating zero-shot natural language understanding of
lation. CoRR, abs/1612.01744. pretrained multilingual models in truly low-resource
languages. In Proceedings of the 60th Annual Meet-
Dan Berrebbi, Jiatong Shi, Brian Yan, Osbel López- ing of the Association for Computational Linguistics
Francisco, Jonathan Amith, and Shinji Watanabe. (Volume 1: Long Papers), pages 6279–6299.
2022. Combining Spectral and Self-Supervised Fea-
tures for Low Resource Speech Recognition and Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
Translation. In Proc. Interspeech 2022, pages 3533– Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
3537. Baines, Onur Celebi, Guillaume Wenzek, Vishrav
Chaudhary, et al. 2021. Beyond english-centric multi-
Marcely Zanon Boito, Fethi Bougares, Florentin Bar- lingual machine translation. The Journal of Machine
bier, Souhir Gahbiche, Loïc Barrault, Mickael Rou- Learning Research, 22(1):4839–4886.
vier, and Yannick Estéve. 2022. Speech resources
in the tamasheq language. Language Resources and Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Evaluation Conference (LREC). Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
Ronald Cardenas, Rodolfo Zevallos, Reynaldo Baquer-
2020. Conformer: Convolution-augmented Trans-
izo, and Luis Camacho. 2018. Siminchik: A speech
former for Speech Recognition. In Proc. Interspeech
corpus for preservation of southern quechua. ISI-
2020, pages 5036–5040.
NLP 2, page 21.
William Chen and Brett Fazio. 2021. Morphologically- Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan
guided segmentation for translation of agglutinative Pino, Guillaume Lample, Philipp Koehn, Vishrav
low-resource languages. Proceedings of Machine Chaudhary, and Marc’Aurelio Ranzato. 2019. The
Translation Summit XVIII. flores evaluation datasets for low-resource machine
translation: Nepali–english and sinhala–english. In
William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Proceedings of the 2019 Conference on Empirical
Soumi Maiti, and Shinji Watanabe. 2023. Improv- Methods in Natural Language Processing and the 9th
ing massively multilingual asr with auxiliary CTC International Joint Conference on Natural Language
objectives. arXiv preprint arXiv:2302.12829. Processing (EMNLP-IJCNLP), pages 6098–6111.
267
Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
Barone, Jindřich Helcl, and Alexandra Birch. 2022. Sam Gross, Nathan Ng, David Grangier, and Michael
Survey of low-resource machine translation. Compu- Auli. 2019. fairseq: A fast, extensible toolkit for
tational Linguistics, 48(3):673–732. sequence modeling. In NAACL (Demonstrations),
pages 48–53, Minneapolis, Minnesota. Association
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A for Computational Linguistics.
method for stochastic optimization. In ICLR 2015,
Conference Track Proceedings. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
Guillaume Klein, François Hernandez, Vincent Nguyen, ation of machine translation. In Proceedings of the
and Jean Senellart. 2020. The opennmt neural ma- 40th annual meeting of the Association for Computa-
chine translation toolkit: 2020 edition. In Proceed- tional Linguistics, pages 311–318.
ings of the 14th Conference of the Association for
Machine Translation in the Americas (Volume 1: Re- Daniel S Park, William Chan, Yu Zhang, Chung-Cheng
search Track), pages 102–109. Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le.
2019. Specaugment: A simple data augmentation
Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San- method for automatic speech recognition. arXiv
jeev Khudanpur. 2015. Audio augmentation for preprint arXiv:1904.08779.
speech recognition. In Proc. Interspeech 2015, pages
3586–3589. Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad
Dousti, and Yun Tang. 2020. Self-training for end-to-
Taku Kudo and John Richardson. 2018. Sentencepiece: end speech translation.
A simple and language independent subword tok-
enizer and detokenizer for neural text processing. Vineel Pratap, Awni Y. Hannun, Qiantong Xu, Jeff Cai,
arXiv preprint arXiv:1808.06226. Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky,
and Ronan Collobert. 2019. Wav2letter++: A fast
Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, open-source speech recognition system. IEEE Inter-
John Ortega , Annette Riosψ, Angela Fan, Xi- national Conference on Acoustics, Speech and Signal
mena Gutierrez-Vasquesψ, Luis Chiruzzo, Gustavo A Processing (ICASSP), pages 6460–6464.
Giménez-Lugo, Ricardo Ramosη, et al. 2021. Find-
ings of the americasnlp 2021 shared task on open Annette Rios. 2015. A basic language technology
machine translation for indigenous languages of the toolkit for Quechua. Ph.D. thesis, University of
americas. NAACL-HLT 2021, page 202. Zurich.
NLLB Team, Marta R. Costa-jussà, James Cross, Onur Rico Sennrich, Barry Haddow, and Alexandra Birch.
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- 2015. Neural machine translation of rare words with
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, subword units. arXiv preprint arXiv:1508.07909.
Jean Maillard, Anna Sun, Skyler Wang, Guillaume
Wenzek, Al Youngblood, Bapi Akula, Loic Bar- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
rault, Gabriel Mejia-Gonzalez, Prangthip Hansanti, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
John Hoffman, Semarley Jarrett, Kaushik Ram Kaiser, and Illia Polosukhin. 2017. Attention is all
Sadagopan, Dirk Rowe, Shannon Spruit, Chau you need. Advances in neural information processing
Tran, Pierre Andrews, Necip Fazil Ayan, Shruti systems, 30.
Bhosale, Sergey Edunov, Angela Fan, Cynthia
Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
Koehn, Alexandre Mourachko, Christophe Ropers, Dmytro Okhonko, and Juan Pino. 2020. fairseq s2t:
Safiyyah Saleem, Holger Schwenk, and Jeff Wang. Fast speech-to-text modeling with fairseq. arXiv
2022. No language left behind: Scaling human- preprint arXiv:2010.05171.
centered machine translation.
Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R.
Atul Kr Ojha, Valentin Malykh, Alina Karakanta, and Hershey, and Tomoki Hayashi. 2017. Hybrid
Chao-Hong Liu. 2020. Findings of the loresmt 2020 CTC/attention architecture for end-to-end speech
shared task on zero-shot for low-resource languages. recognition. IEEE Journal of Selected Topics in Sig-
In Proceedings of the 3rd Workshop on Technologies nal Processing, 11(8):1240–1253.
for MT of Low Resource Languages, pages 33–37.
Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
John E Ortega, Richard Castro Mamani, and Kyunghyun Wu, and Zhifeng Chen. 2017. Sequence-to-Sequence
Cho. 2020. Neural machine translation with a Models Can Directly Translate Foreign Speech. In
polysynthetic low resource language. Machine Trans- Proc. Interspeech 2017, pages 2625–2629.
lation, 34(4):325–346.
Marion Weller-Di Marco and Alexander Fraser. Find-
John E Ortega and Krishnan Pillaipakkamnatt. 2018. ings of the wmt 2022 shared tasks in unsupervised
Using morphemes from agglutinative languages like mt and very low resource supervised mt.
quechua and finnish to aid in low-resource translation.
Technologies for MT of Low Resource Languages
(LoResMT 2018), page 1.
268
GMU Systems for the IWSLT 2023 Dialect and Low-resource Speech
Translation Tasks
resource setting has limited training data, while audio-only data. Though Tunisian Arabic is not
the dialectal one lacks standard orthography and part of the XLSR-53, the XLSR-53 contains Ara-
formal grammar. Both shared tasks allowed the bic data that may be related to Tunisian Arabic.
submission of models trained under constrained
and unconstrained conditions. In the constrained 3 Proposed Methods
condition, models are only trained on data provided Our methods consist of three different architectures.
by the organizers. In contrast, models in the uncon- The first is an end-to-end based transformer-based
strained condition can be trained on any available architecture (E2E) trained on only provided data.
resources, including pre-trained models. The second architecture, which we name E2E-ASR,
2.1 Data is the same as the first, except that we initialize the
encoder with an ASR encoder. The third archi-
Six low-resource languages were made available, tecture uses self-supervised speech models as an
and one dialectal. However, due to data quality is- encoder and a transformer-based decoder. We used
sues (see Section 5) we do not report results on the three different self-supervised models, Wav2vec
Maltese to English task. Table 1 shows the data de- 2.0, XLSR-53, and Hubert, and refer to these ar-
tails for each language pair. The organizers shared chitectures as W2V-E2E, XLSR-E2E, and Hubert-
additional data for specific languages, including E2E respectively.
data for automatic speech recognition (ASR) and We used the Fairseq ST (Wang et al., 2020)
machine translation (MT). However, our approach framework for all our experiments and modified
used the data described in table 1. The exception is this framework to accommodate our new custom
for Tamasheq-French, where we used the provided model architectures.
234 hours of unlabeled Tamasheq audio to pre-train
a self-supervised speech model. 3.1 End-to-end and End-to-end with ASR
For the unconstrained condition, we used data For End-to-end (E2E) architecture, we used a
from MUST-C1 (Di Gangi et al., 2019) to train transformer-based encoder-decoder architecture
an ASR model for which we used its encoder to (Vaswani et al., 2017) (st_tranformer_s)
initialize the speech translation training. We used as implemented in the Fairseq S2T framework
publicly available pre-trained self-supersized mod- (Wang et al., 2020). The E2E architecture con-
els (Wav2vec 2.0 (Baevski et al., 2020), XLSR- sists of a 6-block transformer encoder and a 6-
53 (Conneau et al., 2020), and Hubert (Hsu et al., block transformer decoder and is optimized using
2021)). The Wav2vec 2.0 and Hubert check- the cross-entropy loss with label smoothing. We
points we used were trained on the Librispeech used this model architecture to train the model
960hr English-only data (Panayotov et al., 2015), for the primary constrained category (primary-
while XLSR-53 was trained on 53 different lan- constrained).
guages (Conneau et al., 2020). No source lan- The End-to-end with ASR (E2E-ASR) architec-
guage of all language pairs appears in any self- ture, similar to (Stoian et al., 2019) and (Bansal
supervised models except Tamasheq-French, where et al., 2019), uses the same architecture as the
we pre-trained the Wav2vec 2.0 model we used for E2E. The difference is that we use a pre-trained
Tamasheq-French was pre-trained on Tamasheq ASR model to initialize its encoder. We used a
1
English to French only transformer-based architecture identical to the one
270
for E2E to train the ASR on the English data of fine-tuning these models on a downstream task as
the English-French Must-C dataset (Di Gangi et al., done in Pasad et al. (2022), we explored the idea
2019). We chose this architecture for the ASR of removing these layers and then fine-tuning the
model to facilitate the transfer of the ASR encoder modified model on a downstream task. Through a
weights to initialize the E2E-ASR encoder. The series of experiments, we found that removing the
decoder of the E2E-ASR was randomly initialized last three layers for the Wav2vec 2.0 and XLSR-53
and did not use the ASR decoder because it was models yields the highest BLEU score.
trained on a different language with a different vo- We found the Wav2vec 2.0 helpful for the low-
cabulary. We used this model architecture to train resource languages, while the XLSR-53 was more
the model for the second contrastive unconstrained beneficial for the dialectal language. Therefore,
category (contrastive2-unconstrained). we used the Wav2vec 2.0 for the primary uncon-
strained category (primary unconstrained) for the
3.2 Self-Supervised Approaches low-resource task. The XLSR-53 was used as the
The self-supervised approach uses self-supervised primary unconstrained category (primary uncon-
speech models as acoustic encoders with a strained) for the dialectal transfer task.
transformer-based decoder. The use of these self- The Wav2vec 2.0 we used for all the low-
supervised models is motivated by the scarcity of resource languages (except Tamasheq-French) was
data in the low-resource setting. However, we trained on the English raw audio of the Librispeech
found these models useful even for the dialectal 960hr data (Panayotov et al., 2015). However, due
task. The self-supervised architecture is illustrated to the availability of Tamasheq raw audio, we also
in figure 1. trained a Wav2vec 2.0 model on Tamasheq raw au-
We used three different self-supervised models, dio that used this model on the Tamasheq to French
Wav2vec 2.0, XLSR-53, and Hubert, which cor- language pair. The XLSR-53 model we used was
respond to the respective architectures W2V-E2E, trained on 53 raw audio data from 53 different lan-
XLSR-E2E, and Hubert-E2E. These models con- guages.
sist of a feature encoder and a context network.
The feature encoder has seven temporal convolu- 3.2.2 Using Hubert
tion blocks, and the context network consists of Unlike Wav2vec 2.0 and XSLR-53, we did not re-
several transformer blocks. The Wav2vec 2.0 and move any layers for the Hubert model. We rather
Hubert models used in our experiments have 12 fine-tuned the out-of-the-box pre-trained Hubert
transformer blocks, whereas the XLSR-53 has 24.2 model on the English raw audio data of Librispeech
We use these self-supervised models as encoders 960hr. As discussed by (Pasad et al., 2022), Hubert
following the traditional encoder-decoder model does not follow the autoencoder pattern, given that
architecture. The decoder consists of a transformer the higher layers appear to encode more phonetic
network with six layers preceded by a linear layer. and word information. The choice of not removing
top layers for the Hubert model was also corrobo-
3.2.1 Using Wav2vec 2.0 and XLSR-53 rated through our empirical experiments, where we
Instead of using all the layers of the context net- achieved the highest BLEU score for the Hubert
work for the Wav2vec 2.0 and XLSR-53 models, model when we did not remove any top layers.
we explored the impact of removing the top n most We used the Hubert model for the first con-
layers. The exploration of removing the top layers trastive constrained category (contrastive1 uncon-
was inspired by Pasad et al. (2022), who analyzed strained) for the low-resource and dialectal tasks.
self-supervised speech models and measures the
acoustic, phonetic, and word-level properties en- 3.3 Data
coded in individual layers of the context network. The input to architectures E2E and E2E-ASR con-
For Wav2vec 2.0 and XLSR, the analyses show sist of 80-channel log-mel filterbank features com-
that the initial and the final layers are more simi- puted on a 25 ms window with a 10 ms shift. We
lar to the inputs than the intermediate layers. In- used raw audio as input for all the architectures
stead of re-initializing the top n layers and then using self-supervised models. For the translation
2 text, we use the byte pair encoding (BPE) (Sen-
We refer the reader to the following papers (Baevski et al.,
2020), (Conneau et al., 2020) and (Hsu et al., 2021) for more nrich et al., 2016) algorithm with the sentencepiece
details on these models. toolkit from the Fairseq ST framework (Wang et al.,
271
Figure 1: Self-supervised model architecture. This is an end-to-end architecture that uses self-supervised speech
models as the encoder. The encoder is one of the Wav2vec 2.0, XLSR, or Hubert models. We removed the top 3
layers of the Wav2vec 2.0 and XLSR models.
272
Language System Task Architecture dev/valid test1 test2 test3
primary constr. E2E - - 15.1 -
primary unconstr. W2V-E2E - - 66.5 -
ga-eng LR
contrastive1 unconstr. Hubert-E2E - - 77.4 -
contrastive2 unconstr. E2E-ASR - - 15.1 -
primary constr. E2E 0.77 - 3.3 -
primary unconstr. W2V-E2E 4.76 - 7.7 -
mr-hi LR
contrastive1 unconstr. Hubert-E2E 5.78 - 8.6 -
contrastive2 unconstr. E2E-ASR 4.07 - 5.9 -
primary constr. E2E 2.66 - 5.92 -
primary unconstr. W2V-E2E 11.99 - 16.87 -
pus-fra LR
contrastive1 unconstr. Hubert-E2E 11.27 - 15.24 -
contrastive2 unconstr. E2E-ASR 9.72 - 13.32 -
primary constr. E2E 1.24 1.0 0.48 -
primary unconstr. W2V-E2E 12.07 7.63 8.03 -
tmh-fra LR
contrastive1 unconstr. Hubert-E2E 4.79 2.77 1.3 -
contrastive2 unconstr. E2E-ASR 5.24 3.77 2.1 -
primary constr. E2E 1.46 - 1.46 -
primary unconstr. W2V-E2E 1.2 - 1.78 -
que-spa LR
contrastive1 unconstr. Hubert-E2E 1.84 - 1.86 -
contrastive2 unconstr. E2E-ASR 1.63 - 1.63 -
primary constr. E2E 11.49 8.94 5.0 4.5
primary unconstr. XLSR-E2E 19.35 16.31 16.6 14.6
aeb-eng DT
contrastive1 unconstr. Hubert-E2E 17.69 14.52 15.0 13.4
contrastive2 unconstr. W2V-E2E 16.7 14.4 14.1 12.9
Table 3: BLEU score for all the submitted systems. LR and DT indicate low-resource and dialectal transfer,
respectively. dev/valid refers to the validation or development sets we used during training. test1 refers to the test set
we used during training (some language pairs did not have this set). test2 refers to the blind test set. Some language
pairs (i.e., aeb-eng) had an additional blind test set called test3. The "-" character indicates that we do not have
BLEU results for that category. We did not report the dev/valid results for the Irish to English (ga-eng) task due to
the data quality issue discussed in section 5.
273
pre-training, we still see the same pattern for self- the metadata of about 1001 out of 1698 samples
supervised pre-training. mentioned zero or less than zero duration for audio
Particularly for Tamasheq-French, which had a samples (start_time >= end_time) while
baseline BLEU score of 5.7 for the best IWSLT the aligned utterances had several words in most
2022 system (Anastasopoulos et al., 2022), we nev- cases. Therefore, we were not able to align most
ertheless improved upon the baseline by more than audio data with their utterances.
2 BLEU on the blind test set. The Irish to English data had an issue with the
development set. Initially, the samples in the devel-
4.2 Dialectal Task opment were also present in the training set. How-
Unlike the low-resource task, the highest BLEU ever, the organizer later fixed this issue by updating
for the dialectal task was achieved by using the the development set data. However, no matter how
XLSR-53 model (XLSR-E2E). Therefore, we used we trained our models, we never achieved more
this architecture for our primary unconstrained set- than 1 BLEU score on the updated development
ting. Table 3 shows the results for Tunisian Arabic- set. After troubleshooting our model on the train-
English. ing data, we were confident that we should have
For this task, Wav2vec 2.0 and Hubert had com- gotten a BLEU score that was well above 1. We
parable BLEU scores. However, surprisingly, they proceeded with submitting our system for this task.
did not perform as well as XLSR-53. This find- However, we are very suspicious of the high BLEU
ing was counterintuitive given that the XLSR-53 score reported on the blind test, as shown in Ta-
model did not perform as well as the Wav2vec 2.0 ble 3, as it suggests that there’s an overlap between
or Hubert on all the low-resource languages. The training and test sets.
XLSR-53 model was also reported to have poor
performance by Zanon Boito et al. (2022) on a low- 6 Conclusion
resource language. Based on our experiments, we In this paper, we presented the GMU Systems for
think that the poor performance of the XLSR-53 the IWSLT 2023 Dialect and Low-resource Speech
model for the low-resource task was related to its Translation Tasks. Our approach mainly focused on
size. We speculate that the XLSR-53 model size using self-supervised pre-trained speech models to
may fail to adapt while fine-tuning it on little data. improve the performance of speech translation on
However, fine-tuning it on a lot of data, like the downstream tasks. The self-supervised pre-trained
case of Tunisian-Arabic-English, may yield overall speech models used in this paper are the Wav2vec
improvement. 2.0, XLSR-53, and Hubert. We showed that the
It is also possible that the best performance of the Wav2vec 2.0 and the Hubert model have compa-
XLSR-53 model on the Tunisian Arabic-English rable results in low resource and dialectal transfer
data is because it was trained on more languages. It tasks. However, the Wav2vec 2.0 performs well
will be interesting to investigate the impact of the when we remove the top three layers, while the
model size and multilinguality for self-supervised Hubert model has no such requirements.
pre-trained speech models to improve the perfor- Our experiments showed that the XLSR-53
mance of speech translation downstream tasks. In model performs poorly in the low-resource setting
addition, we think there may be room to study compared to the Wav2vec 2.0 and Hubert models.
further the speech representation of the XLSR- However, in the dialectal task, the XLSR-53 model
53 model across layers so that they can be better outperforms the Wav2vec 2.0 and Hubert models.
adapted in low-resource settings. In the future, we plan to conduct an in-depth anal-
5 Data Quality Issues ysis to understand the advantages and limitations
of these self-supervised pre-trained speech mod-
The low-resource shared tasks of the IWSLT 2023 els while fine-tuning them on downstream speech
consists of six tasks, each task corresponding to translation tasks.
one language pair. As we worked on these shared
tasks, we noticed issues with the data of two tasks: Acknolwedgements
Maltese to English and Irish to English. We are thankful to the organizers of the IWSLT
The Maltese to English data had a number of 2023 low resource and dialectal shared tasks. This
issues that made it hard to work with. For instance, work was generously supported by NSF grant
274
IIS-2125466 and by a Meta Sponsored Research on high-resource speech recognition improves low-
Award. We are also thankful to the Office of resource speech-to-text translation. In Proceedings
of the 2019 Conference of the North American Chap-
Research Computing at George Mason Univer-
ter of the Association for Computational Linguistics:
sity (https://orc.gmu.edu), funded in part Human Language Technologies, Volume 1 (Long and
by grants from the National Science Foundation Short Papers), pages 58–68, Minneapolis, Minnesota.
(Awards Number 1625039 and 2018631), for the Association for Computational Linguistics.
computing resources we used to train our models. Marcely Zanon Boito, Fethi Bougares, Florentin Bar-
bier, Souhir Gahbiche, Loïc Barrault, Mickael Rou-
vier, and Y. Estève. 2022. Speech resources in the
References tamasheq language. In International Conference on
Language Resources and Evaluation.
Milind Agarwal, Sweta Agrawal, Antonios Anasta-
sopoulos, Ondřej Bojar, Claudia Borg, Marine Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab-
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda del rahman Mohamed, and Michael Auli. 2020. Un-
Chen, William Chen, Khalid Choukri, Alexandra supervised cross-lingual representation learning for
Chronopoulou, Anna Currey, Thierry Declerck, Qian- speech recognition. In Interspeech.
qian Dong, Yannick Estève, Kevin Duh, Marcello
Federico, Souhir Gahbiche, Barry Haddow, Benjamin Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli,
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- Matteo Negri, and Marco Turchi. 2019. MuST-C: a
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Multilingual Speech Translation Corpus. In Proceed-
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur, ings of the 2019 Conference of the North American
Evgeny Matusov, Paul McNamee, John P. McCrae, Chapter of the Association for Computational Lin-
Kenton Murray, Maria Nadejde, Satoshi Nakamura, guistics: Human Language Technologies, Volume 1
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, (Long and Short Papers), pages 2012–2017, Min-
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino, neapolis, Minnesota. Association for Computational
Lonneke van der Plas, Peter Polák, Elijah Rippeth, Linguistics.
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian ELRA. Elra catalogue (http://catalog.elra.info), trad
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, pashto-french parallel corpus of transcribed broad-
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- cast news speech - training data, islrn: 802-643-297-
vallos. 2023. Findings of the IWSLT 2023 Evaluation 429-4, elra id: Elra-w0093, trad pashto broadcast
Campaign. In Proceedings of the 20th International news speech corpus, islrn: 918-508-885-913-7, elra
Conference on Spoken Language Translation (IWSLT id: Elra-s0381.
2023). Association for Computational Linguistics.
Qingkai Fang, Rong Ye, Lei Li, Yang Feng, and
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- Mingxuan Wang. 2022. STEMM: Self-learning with
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano speech-text manifold mixup for speech translation.
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, In Proceedings of the 60th Annual Meeting of the
Maha Elbayad, Clara Emmanuel, Yannick Estève, Association for Computational Linguistics (Volume
Marcello Federico, Christian Federmann, Souhir 1: Long Papers), pages 7050–7062, Dublin, Ireland.
Gahbiche, Hongyu Gong, Roman Grundkiewicz, Association for Computational Linguistics.
Barry Haddow, Benjamin Hsu, Dávid Javorský, Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
Mathur, Paul McNamee, Kenton Murray, Maria rahman Mohamed. 2021. Hubert: Self-supervised
Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan speech representation learning by masked prediction
Niehues, Xing Niu, John Ortega, Juan Pino, Eliz- of hidden units. IEEE/ACM Transactions on Audio,
abeth Salesky, Jiatong Shi, Matthias Sperber, Se- Speech, and Language Processing, 29:3451–3460.
bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
gesh Virkar, Alexander Waibel, Changhan Wang, Ha Nguyen, Fethi Bougares, Natalia Tomashenko, Yan-
and Shinji Watanabe. 2022. Findings of the IWSLT nick Estève, and laurent besacier. 2020. Investi-
2022 evaluation campaign. In Proceedings of the gating self-supervised pre-training for end-to-end
19th International Conference on Spoken Language speech translation. In ICML 2020 Workshop on Self-
Translation (IWSLT 2022), pages 98–157, Dublin, supervision in Audio and Speech.
Ireland (in-person and online). Association for Com-
putational Linguistics. Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
jeev Khudanpur. 2015. Librispeech: An asr corpus
Alexei Baevski, Henry Zhou, Abdel rahman Mohamed, based on public domain audio books. 2015 IEEE
and Michael Auli. 2020. wav2vec 2.0: A framework International Conference on Acoustics, Speech and
for self-supervised learning of speech representations. Signal Processing (ICASSP), pages 5206–5210.
ArXiv, abs/2006.11477.
Ankita Pasad, Bowen Shi, and Karen Livescu. 2022.
Sameer Bansal, Herman Kamper, Karen Livescu, Adam Comparative layer-wise analysis of self-supervised
Lopez, and Sharon Goldwater. 2019. Pre-training speech models. ArXiv, abs/2211.03929.
275
Sravya Popuri, Peng-Jen Chen, Changhan Wang,
Juan Miguel Pino, Yossi Adi, Jiatao Gu, Wei-Ning
Hsu, and Ann Lee. 2022. Enhanced direct speech-to-
speech translation using self-supervised pre-training
and data augmentation. In Interspeech.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 1715–1725,
Berlin, Germany. Association for Computational Lin-
guistics.
Matthias Sperber and Matthias Paulik. 2020. Speech
translation and the end-to-end promise: Taking stock
of where we are. In Annual Meeting of the Associa-
tion for Computational Linguistics.
Mihaela Stoian, Sameer Bansal, and Sharon Goldwater.
2019. Analyzing asr pretraining for low-resource
speech-to-text translation. ICASSP 2020 - 2020 IEEE
International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 7909–7913.
Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang,
Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li,
Abdelrahman Mohamed, Michael Auli, and Juan
Pino. 2022. Unified speech-text pre-training for
speech translation and recognition. In Proceedings
of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 1488–1499, Dublin, Ireland. Association for
Computational Linguistics.
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. ArXiv, abs/1706.03762.
Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
Dmytro Okhonko, and Juan Miguel Pino. 2020.
Fairseq s2t: Fast speech-to-text modeling with
fairseq. In AACL.
Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Wu, and Z. Chen. 2017. Sequence-to-sequence mod-
els can directly translate foreign speech. In Inter-
speech.
Anne Wu, Changhan Wang, Juan Miguel Pino, and
Jiatao Gu. 2020. Self-supervised representations
improve end-to-end speech translation. ArXiv,
abs/2006.12124.
Marcely Zanon Boito, John Ortega, Hugo Riguidel, An-
toine Laurent, Loïc Barrault, Fethi Bougares, Firas
Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gah-
biche, and Yannick Estève. 2022. ON-TRAC con-
sortium systems for the IWSLT 2022 dialect and
low-resource speech translation tasks. In Proceed-
ings of the 19th International Conference on Spoken
Language Translation (IWSLT 2022), pages 308–318,
Dublin, Ireland (in-person and online). Association
for Computational Linguistics.
276
The HW-TSC’s Speech-to-Speech Translation System for IWSLT 2023
Minghan Wang, Yinglu Li, Jiaxin Guo, Zongyao Li, Hengchao Shang, Daimeng Wei,
Chang Su, Min Zhang, Shimin Tao, Hao Yang
1
Huawei Translation Services Center, Beijing, China
{wangminghan,liyinglu,guojiaxin1,lizongyao,shanghengchao,
weidaimeng,suchang8,zhangmin186,taoshimin,yanghao30}@huawei.com
[0, T ]. Timestep-Embedding
During inference, the model iteratively samples
Positional Embedding
xt−1 from xt :
2 x 1DConvolution
1 1 − αt
xt−1 = √ xt − √ ϵθ (xt , t, c) + σt z
αt 1 − ᾱt
(8)
1 √
ϵθ = √ xt − ᾱt x̂θ (xt , t, c) (9)
1 − ᾱt Figure 1: The architecture of DTS model, which takes
√ C = [c1 , ..., cM ] as the encoder input to predict the
where σt = 1 − αt and z ∼ N (0, I). In our ex- frame length N . For the decoder, it takes xt and t as
periments, to allow for flexible determination of input, conditions on C to predict x0 for the sampling of
the maximum step T , we choose to use a contin- xt−1 according to Eq 8 and 9.
uous t ranging from 0 to 1. During training, t is
uniformly sampled, and we use the cosine noise
scheduler (Nichol and Dhariwal, 2021). • In the input part of the Decoder, we use two
In addition to modeling the denoising process, 1D convolutions with a proper setting of ker-
DTS also needs to predict the length of the tar- nel size, stride, and padding, so the sequence
get audio in advance, as DTS is essentially a non- length before and after convolution remains
autoregressive (NAR) model. However, unlike pre- unchanged.
vious TTS models that predict the duration of each
phoneme, we directly model the total number of
frames in the target audio, which is more conve- • As the Diffusion model depends on the time
nient. Specifically, we use the text representation step t, we additionally introduce a Timestep
after average pooling, denoted as hc , as the input Embedding, and use the same implementation
to the classifier ϕ to predict the length distribution. as (Ho et al., 2020).
Then, we calculate the cross-entropy loss with the
frame number Nx0 of x0 .
• To make the time step encoding more compre-
Llength = CE(ϕ(hc ; θ), Nx0 ) (10) hensive, we add Layerwise time encoding at
each layer and added to the encoded hidden
2.4.2 Model Architecture states from the last layer.
The DTS model is essentially a parameterized de-
noising function x̂(xt , t, c) which takes xt , t as in-
put, conditions on c, and predicts the x0 for the • In the output part of the decoder, we add 2
sampling of xt−1 . The model makes some modifi- 1D deconvolutions to restore the hidden state
cations on top of the Transformer model to make back to the waveform. We use deconvolu-
it more suitable for speech synthesis. As shown in tion because we found that using only linear
Figure 1, the main modifications are as follows: projection leads to a lack of dependency be-
tween the generated waveform and the pre-
• On top of the Encoder, we add a two-layer vious waveform, resulting in noticeable jitter,
FFN network to predict the length of the target which can be significantly eliminated by using
audio. deconvolution.
279
Model WER-all-punct WER-all WER-code-switch WER-zh
FastSpeech 2 13.18 10.75 15.70 8.37
DTS-Mel 13.32 10.28 15.66 7.69
DTS-Wave 12.68 9.82 15.33 7.17
Table 2: This table shows the performance of our TTS models on the GigaS2S dev set, using ground truth transcripts
as input. We compare our models against FastSpeech 2 (Ren et al., 2021), which serves as the baseline. Additionally,
we present a DTS model trained to predict mel-spectrograms (DTS-Mel) for comparison with DTS for waveform
(DTS-Wave). The table reports the word error rate (WER) for the entire set with punctuation (WER-all-punct), WER
for all samples without punctuation (WER-all), WER for code-switch samples without punctuation (WER-code-
switch), and WER for Chinese-only samples without punctuation (WER-zh). The results indicate that DTS-Wave
outperforms the other models, achieving the lowest WER values in all categories.
Model Input BLEU ChrF the raw waveform, DTS can also learn to generate
ASR output 29.0 25.4 mel-spectrogram, simply by changing wave frames
Ground Truth 30.7 27.3 to spectrogram frames. This is also evaluated in
our experiment.
Table 4: The performance of our MT models with
ground truth input and asr outputs as the input. 3.2 Experimental Results
In the experiments, we tested the performance of
3 Experiment each module in our S2S system separately. In addi-
tion to testing with the cascaded results as input, we
3.1 Experimental Setup also conducted independent tests with ground truth
input. For the three modules, we mainly used the
For the ASR and MT parts of our S2S system, we
dev set of GigaS2S for evaluation. In terms of eval-
directly used the same setting as in the Offline track.
uation metrics, for ASR and MT, we used WER,
For the TTS part, we trained the model on the Gi-
BLEU and ChrF, respectively. For TTS, we used a
gaS2S dataset for 360k steps, with a maximum
Whisper-medium (Radford et al., 2022) model to
learning rate of 1e-4, warmup of 20000 steps, and
transcribe the TTS-generated audio back into the
a batch size of 32 samples per GPU. The maximum
text for automatic evaluation and calculated WER.
and minimum audio lengths were restricted to 25
seconds and 0.5 seconds, respectively. The model ASR Results We evaluated the results of two
has 12 layers in the encoder and 16 layers in the ASR models trained on the same corpus separately,
decoder, with a hidden dimension of 512 and an as well as the ensemble version. As shown in Table
FFN dimension of 2048. DTS can directly generate 3, the ensemble results were slightly better.
waveforms, but since audio waveforms are usually
long, we pre-segment them into equally sized non- MT Results In the evaluation of MT, we consid-
overlapping frames. In this way, the model learns ered two scenarios: using ground truth transcripts
to generate the waveform frame by frame, and we as input and using the output of the previous ASR
only need to flatten these frames to get the final module as input. The experimental results showed
output. In our experiments, we used a frame length that the robustness of MT was relatively good, even
of 1200 and a sampling rate of 24000. When infer- if there were errors in the ASR output, the differ-
ence, we set the sampling step to 100. In addition to ence in BLEU score was not significant as shown
280
in Table 4. Niehues, Xing Niu, Atul Ojha Kr., John E. Ortega,
Proyag Pal, Juan Pino, Lonneke van der Plas, Elijah
TTS Results In the TTS experiments, because Rippeth, Elizabeth Salesky, Matthias Sperber, Se-
the development set of GigaS2S contains code- bastian Stüker, Katsuhito Sudoh, Brian Thompson,
switching samples, we evaluated not only the Marco Turchi, Alex Waibel, Mingxuan Wang, and
Rodolfo Zevallos. 2023. Findings of the IWSLT 2023
WER of the entire set but also separately evalu- Evaluation Campaign. In Proceedings of the 20th
ated the cases without the code-switching. As for International Conference on Spoken Language Trans-
the models, we chose FastSpeech 2 as the base- lation (IWSLT 2023). Association for Computational
line. In addition, we trained an additional DTS Linguistics.
based on mel-spectrogram for comparison with Antonios Anastasopoulos, Loïc Barrault, Luisa Ben-
the waveform-based DTS. Both FS2 and DTS-mel tivogli, Marcely Zanon Boito, Ondrej Bojar, Roldano
used the Griffin-lim vocoder. As shown in Table Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
Maha Elbayad, Clara Emmanuel, Yannick Estève,
2, DTS-Wave outperformed the other two models,
Marcello Federico, Christian Federmann, Souhir
especially on Chinese monolingual data. Gahbiche, Hongyu Gong, Roman Grundkiewicz,
Barry Haddow, Benjamin Hsu, Dávid Javorský,
Full Pipeline Results In addition to testing each Vera Kloudová, Surafel Melaku Lakew, Xutai Ma,
module separately, we also tested the final metrics Prashant Mathur, Paul McNamee, Kenton Murray,
of the entire pipeline. We compared the difference Maria Nadejde, Satoshi Nakamura, Matteo Negri, Jan
between the speech generated by the three TTS Niehues, Xing Niu, John Ortega, Juan Miguel Pino,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
models with the MT results as input by computing bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
the BLEU and ChrF with the ground truth transla- gesh Virkar, Alexander Waibel, Changhan Wang, and
tion. Table 5 shows that there is a difference that Shinji Watanabe. 2022. Findings of the IWSLT 2022
existed, but it is not significant. Therefore, we can evaluation campaign. In Proceedings of the 19th In-
ternational Conference on Spoken Language Transla-
conclude that the quality of the speech generated tion, IWSLT@ACL 2022, Dublin, Ireland (in-person
by TTS does affect the final performance of S2S and online), May 26-27, 2022, pages 98–157. Asso-
system in terms of automatic evaluation, but the ciation for Computational Linguistics.
impact is still limited.
Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu
Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel
4 Conclusion Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, San-
jeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao,
In this paper, we present the system we developed Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang,
for the IWSLT2023 speech-to-speech competition. Zhao You, and Zhiyong Yan. 2021. Gigaspeech: An
The system includes relatively simple and effective evolving, multi-domain ASR corpus with 10, 000
ASR and MT modules, as well as a TTS module hours of transcribed audio. In Interspeech 2021,
22nd Annual Conference of the International Speech
proposed by us based on the Diffusion Model. In
Communication Association, Brno, Czechia, 30 Au-
the experiments, we demonstrate that the denoising gust - 3 September 2021, pages 3670–3674. ISCA.
diffusion process can effectively learn end-to-end
TTS task, simplifying both training and inference. Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Ari-
vazhagan, and Wei Wang. 2022. Language-agnostic
However, its generation speed is relatively slow. BERT sentence embedding. In Proceedings of the
In our future work, we will continue to optimize 60th Annual Meeting of the Association for Compu-
its quality and generation efficiency, and further tational Linguistics (Volume 1: Long Papers), ACL
explore the application of diffusion in end-to-end 2022, Dublin, Ireland, May 22-27, 2022, pages 878–
891. Association for Computational Linguistics.
S2S tasks.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
References Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
2020. Conformer: Convolution-augmented trans-
Milind Agarwal, Sweta Agrawal, Antonios Anasta- former for speech recognition. In Interspeech 2020,
sopoulos, Claudia Borg, Marine Carpuat, Roldano 21st Annual Conference of the International Speech
Cattoni, Mauro Cettolo, William Chen, Khalid Communication Association, Virtual Event, Shang-
Choukri, Alexandra Chronopoulou, Thierry Declerck, hai, China, 25-29 October 2020, pages 5036–5040.
Qianqian Dong, Yannick Estève, Kevin Duh, Mar- ISCA.
cello Federico, Souhir Gahbiche, Benjamin Hsu,
John Judge, Tom Ko, Rishu Kumar, Xutail Ma, Jiaxin Guo, Yinglu Li, Minghan Wang, Xiaosong Qiao,
Prashant Mathur, Evgeny Matusov, Paul McNamee, Yuxia Wang, Hengchao Shang, Chang Su, Yimeng
John P. McCrae, Kenton Murray, Matteo Negri, Jan Chen, Min Zhang, Shimin Tao, Hao Yang, and Ying
281
Qin. 2022. The hw-tsc’s speech to speech translation Minghan Wang, Jiaxin Guo, Yinglu Li, Xiaosong Qiao,
system for IWSLT 2022 evaluation. In Proceedings Yuxia Wang, Zongyao Li, Chang Su, Yimeng Chen,
of the 19th International Conference on Spoken Lan- Min Zhang, Shimin Tao, Hao Yang, and Ying Qin.
guage Translation, IWSLT@ACL 2022, Dublin, Ire- 2022a. The hw-tsc’s simultaneous speech translation
land (in-person and online), May 26-27, 2022, pages system for IWSLT 2022 evaluation. In Proceedings
293–297. Association for Computational Linguistics. of the 19th International Conference on Spoken Lan-
guage Translation, IWSLT@ACL 2022, Dublin, Ire-
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De- land (in-person and online), May 26-27, 2022, pages
noising diffusion probabilistic models. In Advances 247–254. Association for Computational Linguistics.
in Neural Information Processing Systems 33: An-
nual Conference on Neural Information Processing Minghan Wang, Jiaxin Guo, Xiaosong Qiao, Yuxia
Systems 2020, NeurIPS 2020, December 6-12, 2020, Wang, Daimeng Wei, Chang Su, Yimeng Chen, Min
virtual. Zhang, Shimin Tao, Hao Yang, and Ying Qin. 2022b.
The hw-tsc’s offline speech translation system for
Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, IWSLT 2022 evaluation. In Proceedings of the
Qi Meng, Tao Qin, Wei Chen, Min Zhang, and Tie- 19th International Conference on Spoken Language
Yan Liu. 2021. R-drop: Regularized dropout for Translation, IWSLT@ACL 2022, Dublin, Ireland (in-
neural networks. In Advances in Neural Information person and online), May 26-27, 2022, pages 239–246.
Processing Systems 34: Annual Conference on Neu- Association for Computational Linguistics.
ral Information Processing Systems 2021, NeurIPS
2021, December 6-14, 2021, virtual, pages 10890– Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao
10905. Wang, Mingxuan Wang, and Jun Cao. 2022. Gigast:
A 10, 000-hour pseudo speech translation corpus.
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. CoRR, abs/2204.03939.
Improved denoising diffusion probabilistic models.
In Proceedings of the 38th International Conference
on Machine Learning, ICML 2021, 18-24 July 2021,
Virtual Event, volume 139 of Proceedings of Machine
Learning Research, pages 8162–8171. PMLR.
282
JHU IWSLT 2023 Dialect Speech Translation System Description
Amir Hussein† Cihan Xiao† Neha Verma† Thomas Thebaud†
Matthew Wiesner‡ Sanjeev Khudanpur†‡
†
Center for Language and Speech Processing, and
‡
Human Language Technology Center of Excellence,
Johns Hopkins University
{ahussei6, cxiao7, nverma7, tthebau1, wiesner, khudanpur}@jhu.edu
283
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 283–290
July 13-14, 2023 c 2023 Association for Computational Linguistics
Condition ASR MT
166 hours of manually transcribed Tunisian 212K lines of manual English translation
(A) Basic
telephone speech of the Tunisian transcripts
1200 hours of Modern Standard Arabic
Any other English, Arabic dialects,
broadcast speech (MGB-2) (Ali et al., 2016).
(B) Unconstrained 250 hours of Levantine Arabic telephone
or multilingual models
beyond English and Arabic
conversations (LDC2006S291 , LDC2006T072 )
brevity, we will refer to these conditions as (A) and brid CTC/attention (Watanabe et al., 2017) ap-
(B) respectively. proach. Each Branchformer encoder block consists
of two branches that work in parallel. One branch
2.1 Data description uses self-attention to capture long-range dependen-
The data we used for the conditions (A) and (B) cies while the other branch uses a multi-layer per-
are listed in Table 1, and sizes of the training, ceptron with convolutional gating (Sakuma et al.,
development-testing and test partitions are listed in 2021) to capture local dependencies. To mitigate
Table 2. The development and test sets for Tunisian orthographic variations (or inconsistencies) in the
data are provided by the organizers of IWLST 2023. ASR transcripts, we augment the training data
The data is 3-way parallel: Tunisian Arabic tran- during the fine-tuning stage by reusing the audio
scripts and English translations are available for training samples paired with their ASR transcripts,
each Tunisian Arabic audio utterance. We use the which tend to be orthographically more consistent.
development set for model comparison and hyper- We refer to this approach as pseudo-labeling.
parameter tuning, and the test1 set for evaluating
Condition (A). We train the ASR model de-
our ST systems. Finally, the task organizers pro-
scribed previously using the constrained Tunisian
vided blind evaluation (test2, test3) sets for final
Arabic audio and transcripts.
comparison of submissions.
Condition (B). The ASR Branchformer in this
ASR (hours) MT (lines) condition is pretrained on our MGB-2 standard Ara-
train (condition A) 160 ∼202k
bic data (Ali et al., 2016) and then fine-tuned on the
train (condition B) 1200+160+250 -
dev 3.0 3833
provided Tunisian Arabic data. The MGB-2 MSA
test1 3.3 4204 data differ from the Tunisian data in channel, and
test2 3.6 4288 dialect. Since the Tunisian data are telephone con-
test3 3.5 4284 versations sampled at 8kHz, we downsample the
MGB-2 speech from 16kHz to 8kHz, which we pre-
Table 2: Details for train, dev and test1 sets for con-
viously found was more effective than upsampling
strained condition (A) and unconstrained condition (B). the telephone conversations to 16kHz (Yang et al.,
2022). We also added additional telephone speech
3 Methods from the Levantine Arabic dialect (Maamouri et al.,
2006). Note that Levantine Arabic is very different
In this section we describe our cascaded (§3.1), from Tunisian, and the hope here is to benefit from
and end-to-end (E2E) (§3.2) speech translation sys- matched genre and channel conditions, not dialect.
tems as well as our strategy for combining both We did not explicitly attempt to reduce the di-
approaches (§3.3). alect mismatch. However, we mitigated some of
the spurious orthographic variations in transcripts
3.1 Cascaded ASR-MT
of dialectal speech by using pseudo-labels for train-
3.1.1 Automatic Speech Recognition ing instead of of the manual transcripts, as noted
To train ASR models for E2E and cascaded sys- above, in the final fine-tuning step.
tems, we use the ESPnet (Watanabe et al., 2018)
3.1.2 Machine Translation
toolkit. Our ASR architecture uses a Branchformer
encoder (Peng et al., 2022), a Transformer de- Condition (A). We train an MT model on
coder (Vaswani et al., 2017) and follows the hy- Tunisian Arabic transcripts paired with their En-
glish translations. The MT architecture is similar to
1
https://catalog.ldc.upenn.edu/LDC2006S29 §3.1.1 model architecture, and uses a Branchformer
2
https://catalog.ldc.upenn.edu/LDC2006T07
encoder and Transformer decoder.
284
Condition (B). We experiment with two main tion on the E2E-ST system and pre-trained MT
pre-trained models: mBART and NLLB-200. In initialization.
the first setting, we use the mBART25 model,
Condition (B). For the unconstrained condition,
which was shown to be slightly better for MSA ver-
we propose a novel E2E-ST system that incorpo-
sus the newer mBART50 model (Liu et al., 2020a;
rates the combination of a pretrained ASR mod-
Tang et al., 2020). mBART25 also contains French,
ule and a pretrained MT module. Specifically, we
Turkish, Italian, and Spanish, all of which con-
combine the Branchformer ASR module described
tribute loanwords to Tunisian (Zribi et al., 2014).
in Section 3.1, with mBART (Liu et al., 2020b),
Although these loanwords are transcribed in the
which was fine-tuned on Tunisian data. We modify
Arabic script in our data, there is prior evidence
the ESPnet ST recipe to incorporate the mBART
that multilingual language models can benefit from
model trained by the fairseq (Ott et al., 2019) frame-
cross-lingual transfer even between different scripts
work. The architecture of the model is shown in
of the same language (Pires et al., 2019).
Figure 1. In contrast to the modified Hierarchical
For NLLB-200, we use the distilled 1.3 billion
Multi-Decoder architecture for Condition (A), to
parameter version of the model, due to space con-
fully exploit the effect of MT pretraining, we re-
straints. This model is a dense Transformer dis-
moved the speech attention from the MT decoder
tilled from the original NLLB-200 model, which is
that attends to the hierarchical encoder’s hidden
a 54 billion parameter Mixture-of-Experts model
representations.
that can translate into and out-of 200 different
Specifically, the ASR encoder module in the pro-
languages. We note that this model supports
posed architecture takes in a sequence of audio
Tunisian Arabic, the aforementioned contact lan-
features x1 , x2 , · · · , xT and generates a sequence
guages, MSA, as well as other closely related
of hidden representations with length N , optimized
Maghrebi dialects (Moroccan, Egyptian, Maltese).
with respect to the ASR CTC objective. The ASR
The breadth of language coverage seen during the
decoder takes in the ASR encoder’s hidden rep-
training of NLLB-200 makes this model an attrac-
resentations and autoregressively produces a se-
tive choice for a dialect speech translation task.
quence of logits with length L trained by the label-
We fine-tune these models on the provided ∼
smoothing loss. The hierarchical speech encoder
200K lines of Tunisian Arabic-English data. The
module is trained directly by the ST CTC loss for
source side is normalized as described in Section
generating auxiliary frame-level labels in the tar-
4. We preprocess all data with the provided pre-
get language to aid the ST decoding process. The
trained sentencepiece vocabularies released with
primary innovation of the proposed system lies
the models with no pre-tokenization. Results on
in the fully-connected layer that maps the ASR
MT systems are included in Table 8.
decoder’s output hidden representations to some
3.2 End-to-End Speech Translation representations that resemble mBART’s encoder’s
For the constrained condition we adopt the hierar- embedding layer’s outputs, making the full sys-
chical multi-decoder architecture proposed by (Yan tem differentiable. The ST encoder subsequently
et al., 2022). encodes the input representations and feeds them
into its decoder. The ST decoder, slightly differ-
Condition (A). The system consists of a multi- ent from the vanilla mBART decoder, optionally
task learning approach, which combines ASR and runs hybrid/joint CTC decoding at inference time,
MT sub-nets into one differentiable E2E system with the ST-CTC auxiliary labels and the autore-
where the hidden representation of the speech de- gressively generated ST outputs with target length
coder is fed as input to the MT encoder. Addition- M , i.e. y1ST , y2ST , · · · , yM
ST .
285
Figure 1: E2E model architecture with mBART MT module. The fully-connected (FC) layer applies a linear
transformation to the ASR decoder’s final hidden representation, which is then used to replace mBART’s encoder’s
embedding layer’s output.
and the same NLLB-200 MT component as in our for every system, as a simplified version of the
best cascaded system. In Table 6, the 5 combined Generalized MBR (Duh et al., 2011).
systems are referred to as A3, B1, B3, B4, and B5,
COMET-MBR For our third combination, we
in order.
utilized the comet-mbr framework, which employs
3.3.1 Minimum Bayes Risk the COMET score between the source and hypothe-
We applied Minimum Bayes Risk decoding (Ku- sis as the similarity metric (L), using same equation
mar and Byrne, 2004) to combine the hypotheses (1), without the use of posterior probabilities (Fer-
produced by five systems. For a given speech ut- nandes et al., 2022). We used wmt20-comet-da
terance xi , and for a given system sjθj (j ∈ S and for MBR scoring (Rei et al., 2020). Despite
θj the set of parameters used by the j th trained Tunisian Arabic not being a COMET-supported
system), we can define the translation hypothesis language, we observed an improvement compared
as yij = fθjj (xi ) and pji be the probability that the to our single best system, suggesting that this ap-
proach may extend to dialects of languages covered
hypothesis yij would be outputted. We use this by COMET.
probability as a self-confidence score. Let L be
similarity metric used to compare two hypothesis, 4 Experiments
outputting a scalar that rises if the two hypothesis
are more similar. Then, for a given speech utter- In this section, we describe our experiments on the
ance xi , and for a given set of systems S, we define ASR, MT, and ST tasks. In order to reduce the
the best output as the one minimizing the distance orthographic variation in the Tunisian speech tran-
with others while having the highest confidence: scription we performed additional text normaliza-
tion similar to (Yang et al., 2022) which showed sig-
X jX
yimbr = max pi L(yij , yik ) (1) nificant improvements on ASR, MT and ST tasks.
yij j∈F k∈F The normalization is performed on both Tunisian
and MSA transcripts and includes removing dia-
3.3.2 Variations of MBR critics and single character words, and Alif/Ya/Ta-
Baseline MBR For our first combination, we Marbuta normalization (see (Yang et al., 2022) for
compute the outputs according to the MBR using more details).
the BLEU score of sacrebleu (Post, 2018a) as the
4.1 ASR
L similarity metric and the posterior probabilities
pji used are the log-likelihood ratios given by the First we augment the raw audio segments by ap-
end-to-end systems and the MT systems. plying speed perturbation with three speed factors
of 0.9, 1.0 and 1.1 (Ko et al., 2015). Then we
Unscored MBR For our second combination, we transform the augmented audio to a sequence of
use the same technique but with a constant pji = 1 83-dimensional feature frames for the E2E model;
286
80-dimensional log-mel filterbank coefficients with dev test1 test2 test3
3 pitch features (Ghahremani et al., 2014). We nor- ASR-ID Model WER (↓)
A1 Conformer (Yang et al., 2022) 40.8 44.8 43.8 -
malize the features by the mean and the standard A2 Branchformer 40.1 44.5 - -
deviation calculated on the entire training set. In ad- B1 MGB2-tune (Yang et al., 2022) 38.8 43.8 42.8 -
dition, we augment the features with specaugment B2
B3
MGB2-tune Branchformer
+ Pseudo
38.3
37.5
43.1
42.6
-
-
-
-
approach (Park et al., 2019), with mask parameters B4 + Tel 36.5 41.7 40.6 41.6
B5 E2E-MD-ASR 40.6 45.1 43.7 44.9
(mT, mF, T, F ) = (5, 2, 27, 0.05) and bi-cubic B6 E2E-mBART-ASR 37.7 43.2 41.5 42.6
time-warping. The E2E Branchformer-based ASR
model was trained using Adam optimizer for 50 Table 4: WER (%) of ASR models on dev, test1, test2
epochs with dropout-rate 0.001, warmup-steps of and test3 sets. A* and B* IDs are the ASR models devel-
25000 for condition (A) and 40000 for condition oped under condition (A) and condition (B) respectively.
(B). The BPE vocabulary size is 500 for condition B5 refers to the ASR submodule of the MD-ASR sys-
(A) and 2000 for condition (B). Table 3 summa- tem under the constrained condition and B6 refers to
the ASR sub-module of the E2E-mBART system both
rizes the best set of parameters that were found for
described in Section 3.2.
the Branchformer architecture. We note here that
the Branchformer has 28.28 M parameters, which BW (REF / HYP) Arabic English Translation
is approximately one-fourth the number of parame- 69: Ayh / Ay éK @ / ø @ yes
ters in the Conformer (Yang et al., 2022), which is 61: Ay / Ayh ø @ / éK @ yes
116.15 M. 18: Akhw / khw ñê» / ñê» @ it’s
17: khw / Akhw ñê» @ / ñê» it’s
Att heads CNN Enc layers Dec layers dk FF
4 31 12 6 256 2048 8: gdwA / gdwh @ð Y« / èðY« tomorrow
7: gdwh / gdwA èðY« / @ð Y« tomorrow
Table 3: Values of condition (A) and (B) hyperparame-
ters CNN: refers to CNN module kernel, Att: attention, Table 5: Top 6 substitutions with inconsistencies for
Enc: encoder, Dec: decoder, and FF: fully connected ASR system transliterated using Buckwalter (BW). The
layer number of times each error occurs is followed by the
word in the reference and the corresponding hypothesis.
MGB2-tune: the pretrained model on MGB-2
is fine-tuned on Tunisian data from condition (A) confirm this hypothesis we take a closer look at
by updating all model parameters with 1/10 of the most frequent top four substitutions shown in
the learning rate that was used during the training Table 5. The words are transliterated using Buck-
similar to (Hussein et al., 2021). In addition, we walter transliteration (BW)3 to make it readable for
examine the effect of adding ASR outputs to the non-Arabic speakers. It can be seen that the ASR
ground truth source during finetuning (pseudo la- substitutions are present in both hypothesis and as
beling ) and adding additional telephone data (Tel). correct reference which indicates that the assump-
The ASR results are summarized in Table 4 and tion of reference inconsistency holds true. Finally,
compared to the state-of-the-art conformer results channel matching using more telephone data pro-
from (Yang et al., 2022). The MD refers to the vides an additional 2.5% relative improvement.
hierarchical multi-decoder ST architecture adopted
from (Yan et al., 2022), and MD-ASR refers to the
ASR sub-module of the ST. It can be observed that 4.2 MT
the Branchformer provides slightly better results We train the MT models as described in Section
compared to the previous best conformer with simi- 3.1.2. For condition (A) the MT system parameters
lar size on both conditions (A) and (B). In addition, are shown in Table 7. In this condition, our MT
it can be also seen that pseudo labeling provides system is finetuned on the training Tunisian data
2% relative improvement. We found that there is a where the source data is mixed with ASR outputs,
high inconsistency between different transcribers in order to be more robust to noisy source data. We
since there is no standard orthography in Tunisian use 5000 Byte-pair encoding (BPE) units shared
dialect. By incorporating the ASR predictions in between Tunisian Arabic and English. We train
this way, we aim to provide the model with more 3
https://en.wikipedia.org/wiki/Buckwalter_
examples of the Tunisian dialect and help it better transliteration
generalize to variations in the spoken language. To
287
Pretrained dev test1 test2 test3
ST-ID Type ASR MT BLEU (↑) BLEU (↑) BLEU (↑) BLEU (↑)
A1 Cascade A2 A3 18.9 15.6 - -
A2 E2E-MD (Yan et al., 2022) A2 - 20.6 17.1 - -
A3 E2E-MD+norm A2 - 20.7 17.5 19.1 17.6
B1 E2E-mBART B4 B2 20.7 17.5 17.5 17.1
B2 Cascade-mBART B4 B2 20.9 17.9 - -
B3 Cascade-Base-NLLB200 B4 B3 22.2 19.2 21.2 18.7
B4 Cascade-B5-ASR-NLLB200 B5 B3 21.1 18.3 19.9 18.2
B5 Cascade-B6-ASR-NLLB200 B6 B3 22.2 18.8 20.7 18.3
B6 MBR with scores - - 21.7 18.8 18.7 17.1
B7 MBR no scores - - 22.7 19.6 20.6 18.8
B8 comet-mbr - - 22.7 19.6 21.6 19.1
Table 6: Results of cascaded, E2E, and combined systems measured by BLEU score on the dev, test1, test2 and
test3. E2E-MD is the hierarchical multi-decoder described in (§3.2). Norm indicates the use of text normalization
(§4) which is used with all systems except A2. The pretrained indicates the use of pretrained ASR and MT systems
from Tables(8,4). A* and B* IDs are the models developed under condition (A) and condition (B) respectively
289
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Matt Post. 2018a. A call for clarity in reporting BLEU
Edunov, Marjan Ghazvininejad, Mike Lewis, and scores. In Proceedings of the Third Conference on
Luke Zettlemoyer. 2020a. Multilingual denoising Machine Translation: Research Papers, pages 186–
pre-training for neural machine translation. Transac- 191, Belgium, Brussels. Association for Computa-
tions of the Association for Computational Linguis- tional Linguistics.
tics, 8:726–742.
Matt Post. 2018b. A call for clarity in reporting BLEU
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey scores. In Proceedings of the Third Conference on
Edunov, Marjan Ghazvininejad, Mike Lewis, and Machine Translation: Research Papers, pages 186–
Luke Zettlemoyer. 2020b. Multilingual denoising 191, Brussels, Belgium. Association for Computa-
pre-training for neural machine translation. tional Linguistics.
Mohamed Maamouri et al. 2006. Levantine arabic qt Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
training data set 5, speech ldc2006s29. Web Down- Lavie. 2020. Comet: A neural framework for mt
load. evaluation. arXiv preprint arXiv:2009.09025.
Jin Sakuma, Tatsuya Komatsu, and Robin Scheibler.
NLLB Team, Marta R. Costa-jussà, James Cross, Onur
2021. Mlp-based architecture with variable length
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
input for automatic speech recognition.
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
Jean Maillard, Anna Sun, Skyler Wang, Guillaume Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
Wenzek, Al Youngblood, Bapi Akula, Loic Bar- man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
rault, Gabriel Mejia-Gonzalez, Prangthip Hansanti, gela Fan. 2020. Multilingual translation with exten-
John Hoffman, Semarley Jarrett, Kaushik Ram sible multilingual pretraining and finetuning. arXiv
Sadagopan, Dirk Rowe, Shannon Spruit, Chau preprint arXiv:2008.00401.
Tran, Pierre Andrews, Necip Fazil Ayan, Shruti
Bhosale, Sergey Edunov, Angela Fan, Cynthia A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
Gao, Vedanuj Goswami, Francisco Guzmán, Philipp L. Jones, A. Gomez, u. Kaiser, and I. Polosukhin.
Koehn, Alexandre Mourachko, Christophe Ropers, 2017. Attention is all you need. In Advances in
Safiyyah Saleem, Holger Schwenk, and Jeff Wang. neural information processing systems, pages 5998–
2022. No language left behind: Scaling human- 6008.
centered machine translation. Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Yalta,
Sam Gross, Nathan Ng, David Grangier, and Michael Jahn Heymann, Matthew Wiesner, Nanxin Chen,
Auli. 2019. fairseq: A fast, extensible toolkit for Adithya Renduchintala, and Tsubasa Ochiai. 2018.
sequence modeling. In Proceedings of NAACL-HLT Espnet: End-to-end speech processing toolkit. ArXiv,
2019: Demonstrations. abs/1804.00015.
Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Hershey, and Tomoki Hayashi. 2017. Hybrid
Jing Zhu. 2002. Bleu: a method for automatic evalu- ctc/attention architecture for end-to-end speech recog-
ation of machine translation. In Proceedings of the nition. IEEE Journal of Selected Topics in Signal
40th Annual Meeting of the Association for Compu- Processing, 11:1240–1253.
tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational Brian Yan, Patrick Fernandes, Siddharth Dalmia, Jia-
Linguistics. tong Shi, Yifan Peng, Dan Berrebbi, Xinyi Wang,
Graham Neubig, and Shinji Watanabe. 2022. CMU’s
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng IWSLT 2022 dialect speech translation system. In
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Proceedings of the 19th International Conference on
2019. Specaugment: A simple data augmentation Spoken Language Translation (IWSLT 2022), pages
method for automatic speech recognition. Proc. In- 298–307, Dublin, Ireland (in-person and online). As-
terspeech, pages 2613–2617. sociation for Computational Linguistics.
Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Jinyi Yang, Amir Hussein, Matthew Wiesner, and San-
Watanabe. 2022. Branchformer: Parallel mlp- jeev Khudanpur. 2022. Jhu iwslt 2022 dialect speech
attention architectures to capture local and global translation system description. In Proceedings of the
context for speech recognition and understanding. 19th International Conference on Spoken Language
In International Conference on Machine Learning, Translation (IWSLT 2022), pages 319–326.
pages 17627–17643. PMLR.
Inès Zribi, Rahma Boujelbane, Abir Masmoudi, Mariem
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. Ellouze, Lamia Belguith, and Nizar Habash. 2014.
How multilingual is multilingual BERT? In Proceed- A conventional orthography for Tunisian Arabic. In
ings of the 57th Annual Meeting of the Association for Proceedings of the Ninth International Conference
Computational Linguistics, pages 4996–5001, Flo- on Language Resources and Evaluation (LREC’14),
rence, Italy. Association for Computational Linguis- pages 2355–2361, Reykjavik, Iceland. European Lan-
tics. guage Resources Association (ELRA).
290
Learning Nearest Neighbour Informed Latent Word Embeddings
to Improve Zero-Shot Machine Translation
Figure 2: A NN-informed embedding for an arbitrary subword shire is produced by averaging across nearby
subwords from various languages, and combining with a semantic representation extracted from this average.
292
De - It De - Nl De - Ro It - Nl It - Ro Nl - Ro zero sup.
Ð Ñ Ð Ñ Ð Ñ Ð Ñ Ð Ñ Ð Ñ
Base M2M 15.64 15.28 18.46 18.14 14.42 14.98 18.16 18.79 17.91 20.14 15.81 16.41 17.01 30.62
SRA (2019) 16.44 16.45 18.44 19.15 15.07 15.83 19.30 19.10 18.52 21.52 16.83 17.66 17.85 30.41
SF (2019) 16.34 15.77 18.37 18.16 14.74 15.25 18.6 19.18 18.54 21.64 16.09 16.94 17.46 30.50
LV (2021) 16.82 15.81 18.74 18.64 15.12 16.32 18.92 19.29 18.70 22.13 16.21 18.22 17.91 30.51
CL (2021b) 17.31 16.21 19.70 19.57 15.32 16.25 18.90 20.09 19.07 22.44 17.14 17.99 18.33 30.29
DP (2021) 16.62 15.64 19.64 18.78 15.07 15.96 19.01 20.15 18.67 21.56 16.46 18.18 17.97 30.49
Ours 17.41 16.89 19.71 19.21 15.60 16.22 19.30 20.10 19.60 21.88 17.25 18.40 18.47 30.62
Table 1: BLEU on IWSLT17 test set (mean of 3 runs). Zero and sup. are average zero-shot and supervised results.
the averaged embedding EMBµ pwq as query: IWSLT17 (Cettolo et al., 2012) is an English-
centric dataset3 totalling 1.8M parallel sentences.
EMB latent pwq
T
“ SoftmaxpEMBµ pwq.Wsem qWsem (3) It has 8 supervised directions to and from Ger-
man, Italian, Dutch and Romanian, each with about
A residual connection from EMBµ pwq gives the 220,000 parallel sentences, and 12 zero-shot direc-
final NN-informed word embedding: tions. We use the official validation and test sets.
EMB knn pwq “ EMBlatent pwq ` EMBµ pwq (4) Ted59 (Qi et al., 2018) is a massively multilin-
gual English-centric dataset4 with 116 translation
EMBknn pwq is a drop-in replacement for a conven-
directions totalling 10.8M parallel sentences. The
tional word embedding EMBpwq. imbalanced data—from 0.25M to just 2000 parallel
samples for some language pairs—makes it ideal
Modelling Prediction Consistency Given a to study the effects of our method. Following (Aha-
source sentence represented using conventional roni et al., 2019; Raganato et al., 2021) we evaluate
word embeddings and using NN-informed em- on 16 supervised pairs and 4 zero-shot (Arabic Ø
beddings, following Kambhatla et al. (2022b) we French, Ukranian Ø Russian).
model the loss with respect to target sentence yi as:
3.2 Baselines and Related Work
i i
L “ α 1 LN LL p pΘ p yi |xi q q
looooooooooooomooooooooooooon We compare against methods for encoder manifold
source x-entropy
alignment. These include strong baselines such
` α2 LiN LL p pΘ p yi | kNNpxi qq q (5)
looooooooooooooooomooooooooooooooooon as sentence representation alignment (SRA; Ari-
k-NN embeds. source x-entropy vazhagan et al. 2019), softmax forcing (SF; Pham
` β Lidist p pΘ p yi |xi q, pΘ p yi |kNNpxi qq q
loooooooooooooooooooooooomoooooooooooooooooooooooon et al. 2019), the contrastive multilingual model (CL;
agreement loss Pan et al. 2021b), multilingual Transformer with
disentagled positional embedding (DP; Liu et al.
where kNNpxi q denotes the set of k-nearest neigh- 2021), and latent variable based denoising (LV;
bors to token xi . This loss combines three Wang et al. 2021), along with the vanilla many-
terms: the first two are conventional negative to-many zero-shot model (M2M). On TED59, we
log-likelihoods, while the third is an agreement compare against CL and 3 explicit multilingual
loss measuring pairwise symmetric KL diver- alignment techniques proposed by Raganato et al.
gence between the output distributions for xi and (2021): word-alignment, language tag alignment,
kNNpxi q. This agreement-loss term performs co- and the union of the two. We also implement and
regularization by allowing explicit interactions compare against Raganato et al.’s (2021) sparse
between source sentences with and without NN- 1.5entmax cross-attention variant.
informed embeddings.
3.3 Model and Implementation Details
3 Experiments
All models use the configuration in Vaswani et al.
3.1 Datasets 2017 using the fairseq toolkit (Ott et al., 2019).
See reproducibility details in Appendix A.
We conduct experiments on 2 multilingual datasets,
each with BPE (Sennrich et al., 2016) vocabulary 3
https://wit3.fbk.eu/2017-01
size of 32k subwords: 4
github.com/neulab/word-embeddings-for-nmt
293
Θ EnÑX XÑEn Zero-Shot Acc0
Aharoni et al. – 106 langs 473M 20.11 29.97 9.17 -
Aharoni et al. – 59 langs 93M 19.54 28.03 - -
Transformer M2M reimp. 93M 18.98 27.22 7.12 74.10
Constrastive (2021b) 93M 19.09 27.29 8.16 73.90
Ours 77M 19.01 27.11 10.03 95.81
Raganato et al. (2021)
ZS + 1.5entmax (ibid.) 93M 18.90 27.21 10.02 87.81
ë Word Align (ibid.) 93M 18.99 27.58 8.38 73.12
ë LangID Align (ibid.) 93M 18.98 27.48 6.35 65.01
ë Word + LangID Align 93M 19.06 27.37 11.94 97.25
Ours + 1.5entmax 77M 18.94 27.42 12.11 98.90
Table 2: Average BLEU scores on the TED59 dataset. Our model produces zero-shot translations in the correct
output language with high accuracy (Acc0 ).
We use ScANN (Guo et al., 2020) for efficient against the stronger contrastive model. Further, our
ANN search 5 with k “ 3. To increase train- model consistently outperforms strong, explicitly
ing speeds, we cache each subword’s ANNs for alignment-based methods.
400 iterations before recomputing them. We only
(peridocally) cache subword IDs: the embedding Target-language Accuracy. To supplement the
EMB µ p¨q is always computed directly from Wemb .
evaluation, we provide the accuracy score for tar-
We set λ “ 0.5, α1 , α2 “ 1, and β “ 5. The atten- get language identification6 in zero-shot scenarios,
tional latent semantic representation layer has 512 called Acc0 . While the classical many-to-many
dim (same as the embedding layer) and a size N of NMT models (Johnson et al., 2017; Aharoni et al.,
1000 for IWSLT17 (smaller dataset) and 5000 for 2019) enable zero-shot translations, several studies
TED59 (larger dataset). We did not tune this hyper- have shown that these models fail to reliably gener-
parameter and chose the values based on the size of alize to unseen language pairs, ending up with an
the datasets. For evaluation, we report sacreBLEU off-target translation issue (Zhang et al., 2020). The
(Post, 2018). model ignores the language label and the wrong
target language is produced as a result. We ob-
3.4 Results serve significant improvements in target language
accuracy, up to nearly 99% (absolute).
Main Results. Tables 1 and 2 show our main
results. On IWSLT17, our latent k-NN embed- 4 Analysis
ding model outperforms several strong baselines,
including sentence-representation alignment and Ablation Study. Table 3 reports ablations on the
contrastive learning, by an average of 0.62 and IWSLT17 test set. We find that kNN embeddings
0.11 BLEU respectively across the 12 zero-shot alone yield improvements over the baseline many-
pairs. Compared to the baseline many-to-many to-many model. By contrast, absent the other parts
model, our method yields a 1.5 BLEU gain on av- of our model, the attentional semantic layer dete-
erage. Our method is able to improve zero-shot riorates model performance. Only in combination
performance without deteriorating supervised per- with the agreement loss do we observe a benefit
formance. from this component.
On the TED59 dataset, we follow Raganato Embedding Analysis. Figure 3 visualizes sub-
et al. (2021) in comparing against two multilin- word representations from models trained on
gual model variants: the standard Transformer, and IWSLT17. Each subword is colored according to
the Transformer with sparse entmax instead of stan- the language in which it is most frequent. The over-
dard softmax cross-attention. Our approach gains all layout of the two spaces is similar, although the
„3 BLEU points against the baseline, and 2 BLEU
6
We utilize FastText (Joulin et al., 2017) as a language
5
We use asymmetric hashing with 2-dimensional blocks identification tool to compare the translation language with
and a quantization threshold of 0.2, and re-order the top 100 the reference target language and keep count of the number of
ANN candidates. matches.
294
ID Component dev.2010 test.2010 We quantify this trend by labeling each subword
1 many-to-many (zero-shot) 15.95 18.46 according to the language in which it is most fre-
2 1 + attn. semantic repr. 15.43 17.83 quently attested. In the baseline model, we find
3 1 + kNN embeds 17.11 19.69 that on average only 2.7 of a subword’s 6 nearest
4 2 + kNN embeds 16.60 19.08 neighbors come from the same language as that sub-
5 3 + agreement loss 17.99 20.91 word. This average rises to 3.6 in the ANN model,
6 4 + agreement loss 18.31 21.01 demonstrating that ANN training significantly in-
creases the number of same-language neighbors on
Table 3: Effect of different components of our model on
average.
the IWSLT17 datasets. We report sacreBLEU scores on ?
the two official validation sets with beam size 1. In the ANN model, a few rare subwords ( , ž, ć)
are disproportionately common among the nearest
neighbors of many other subwords. We speculate
baseline model (left) exhibits a clear ring-shaped that these tokens may act as pivots for informa-
gap dividing the embeddings into two groups. With tion to flow between their many neighbours. Their
ANN embeddings (right), this gap is eliminated and high centrality means that these tokens provide
the layout of the embeddings appears more homo- avenues for information to flow between a large
geneous. Quantitatively, the average distance from number of subwords, even those which never occur
a subword to its neighbors exhibits a smaller vari- in sentences together. Because these tokens are
ance in the ANN model than in the baseline, which rare, there is also very little penalty for the model
further supports the reading that ANN training cre- to “corrupt” their representations with information
ates a more homogeneous representation space in from neighboring subwords.
which subwords are more uniformly distributed.
5 Other Related Work
A vast body of work addresses zero-shot transla-
tion. Most methods focus on producing language-
agnostic encoder outputs (Pham et al., 2019). Wei
et al. (2021) introduce multilingual contrastive
learning, while Yang et al. (2021) adopt auxiliary
target language prediction. To enable the input to-
kens to be positioned without constraints, Liu et al.
Figure 3: t-SNE visualization of subword embeddings (2021) eliminate the residual connections within
from IWSLT17 models trained without (left) and with
a middle layer of the encoder. Yang et al. (2022);
(right) ANN embeddings. Points are colored according
to the language where the corresponding subword is
Gu and Feng (2022) employ optimal transport to
most frequent. ANN embeddings decrease the separa- improve contextual cross-alignments, in contrast
tion between some monolingual subspaces, and remove to our method which performs soft, non-contextual
others entirely. alignment between subwords in the continuously-
updating embedding space. Other methods ex-
Table 4 shows nearest neighbors for a random tend the training data using monolingual data (Al-
sample of subwords (additional examples in Table 5 Shedivat and Parikh, 2019) to pretrain the decoder
in Appendix B). With ANN training, a subword’s (Gu et al., 2019), and random-online backtransla-
nearest neighbors are generally its synonyms (e.g. tion (Zhang et al., 2020). Lin et al. (2021); Reid and
_wonderful, _large _tremendous, and _big Artetxe (2022) use dictionary based alignments to
as neighbors to _great) or derived forms (e.g. produce pseudo-cross-lingual sentences. Other ap-
_încep, _începem, _început, _începe be- proaches that enhance token level representations
side _înceap). In the baseline, it is more likely include multiple subword segmentations (Wu et al.,
to find neighbors with no apparent relation, such as 2020; Kambhatla et al., 2022a), enciphered source
_erzählen ‘tell’ and _stemmen ‘hoist’ or ‘accom- text (Kambhatla et al., 2022b) and stroke sequence
plish’ beside _America. This suggests that ANN modelling (Wang et al., 2022). While all these tech-
embeddings help a model to better organize its sub- niques rely on multilingual training paradigm for
word embedding space into coherent, semantically- machine translation, they either rely on external
related subspaces. data and use explicit augmentations. We do not
295
Subword Nearest Neighbors (Baseline) Nearest Neighbors (Ours)
?
_great _gesproken _schaffen ppy ită _prosper _senior _wonderful _large _tremendous _big _great
_înceapă _popolare _condotto _mişcă _bekijken _crească _creeze _gepubliceerd _încep _începem _început _începe muovono
_America tate _erzählen _stemmen dine _facultate _chestiune _USA _Asia _Africa _American _America ć
?
_play _lavori eranno _tenuto _bekijken - möglichkeiten play _playing _Play _played _play
_football _pesci bon _surf _betrachten _Hintergrund möglichkeiten _weather _baseball ball _montagna _biodiversità _football
ing ificazione izăm amento tung erende ende ling ting ung ž ingen ing
?
_fish _petrec schen _Sachen _feed _chestii möglichkeiten fisch _pesce _pesca _Fisch _fish
Table 4: Approximate nearest neighbors for a sample of subwords, computed with (right) and without (left) ANN
training.
use any external data or explicit alignments and NSERC RGPIN-2018-06437 and RGPAS-2018-
our model can be trained end-to-end like a regular 522574 and a Department of National Defence
multilingual model. (DND) and NSERC grant DGDND-2018-00025 to
the third author, and by an NSERC award CGSD3-
6 Conclusion 547773-2020 to the second author.
We described a novel approach to harness near-
est neighbors at the token level and learn nearest- References
neighbour informed word embeddings for every
Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019.
word in a source language for many-to-many multi- Massively multilingual neural machine translation.
lingual translation. Our experiments show that this In Proceedings of the 2019 Conference of the North
simple yet effective approach results in consistently American Chapter of the Association for Computa-
better zero-shot translations across multiple multi- tional Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pages 3874–3884,
lingual datasets. Additionally, our model produces Minneapolis, Minnesota. Association for Computa-
translations in the right target language with high tional Linguistics.
accuracy. Our analysis shows that our model learns
Maruan Al-Shedivat and Ankur Parikh. 2019. Con-
to organize subwords into semantically-related sistency by agreement in zero-shot neural machine
neighborhoods, and reduces the separation between translation. In Proceedings of the 2019 Conference
monolingual subspaces in the embedding space. of the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
Limitations nologies, Volume 1 (Long and Short Papers), pages
1184–1197, Minneapolis, Minnesota. Association for
While our method is effective in zero-shot set- Computational Linguistics.
tings, we find that it has limited implications in Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee
supervised settings. This is because improving Aharoni, Melvin Johnson, and Wolfgang Macherey.
zero-shot translation presents a tug-of-war between 2019. The missing ingredient in zero-shot neural ma-
chine translation. arXiv preprint arXiv:1903.07091.
language-agnostic and language-specific represen-
tations, each of which has a distinct effect on the Mauro Cettolo, Christian Girardi, and Marcello Fed-
model. Another major downside is reduced training erico. 2012. Wit3: Web inventory of transcribed and
translated talks. In Proceedings of the 16th Annual
speed relative to the baseline many-to-many model. conference of the European Association for Machine
We note that this is an artifact of the agreement Translation, pages 261–268.
loss (KLDiv.) which entails two forward-passes for
Guanhua Chen, Shuming Ma, Yun Chen, Dongdong
each update. Finally, in the present work, we com- Zhang, Jia Pan, Wenping Wang, and Furu Wei. 2022.
pute k-NNs for every source word in a sentence. Towards making the most of cross-lingual transfer
Although this has yielded strong results, we would for zero-shot neural machine translation. In Proceed-
like to explore a more explainable setting where ings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
k-NNs can be applied to specific source words. We pers), pages 142–157, Dublin, Ireland. Association
leave such explorations to future work. for Computational Linguistics.
296
Xiangyu Duan, Baijun Ji, Hao Jia, Min Tan, Min Zhang, pre-training based transfer for zero-shot neural ma-
Boxing Chen, Weihua Luo, and Yue Zhang. 2020. chine translation. In Proceedings of the AAAI con-
Bilingual dictionary based neural machine translation ference on artificial intelligence, volume 34, pages
without using parallel sentences. In Proceedings 115–122.
of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 1570–1579, Online. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Association for Computational Linguistics. Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
Fernanda Viégas, Martin Wattenberg, Greg Corrado,
Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Macduff Hughes, and Jeffrey Dean. 2017. Google’s
Ma, Ahmed El-Kishky, Siddharth Goyal, Man- multilingual neural machine translation system: En-
deep Baines, Onur Celebi, Guillaume Wenzek, abling zero-shot translation. Transactions of the As-
Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi- sociation for Computational Linguistics, 5:339–351.
taliy Liptchinsky, Sergey Edunov, Edouard Grave, Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Michael Auli, and Armand Joulin. 2022. Beyond Tomas Mikolov. 2017. Bag of tricks for efficient
english-centric multilingual machine translation. J. text classification. In Proceedings of the 15th Con-
Mach. Learn. Res., 22(1). ference of the European Chapter of the Association
for Computational Linguistics: Volume 2, Short Pa-
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. pers, pages 427–431, Valencia, Spain. Association
Multi-way, multilingual neural machine translation for Computational Linguistics.
with a shared attention mechanism. In Proceedings
of the 2016 Conference of the North American Chap- Nishant Kambhatla, Logan Born, and Anoop Sarkar.
ter of the Association for Computational Linguistics: 2022a. Auxiliary subword segmentations as related
Human Language Technologies, pages 866–875, San languages for low resource multilingual translation.
Diego, California. Association for Computational In Proceedings of the 23rd Annual Conference of
Linguistics. the European Association for Machine Translation,
pages 131–140, Ghent, Belgium. European Associa-
Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O.K. tion for Machine Translation.
Li. 2018. Universal neural machine translation for
extremely low resource languages. In Proceedings Nishant Kambhatla, Logan Born, and Anoop Sarkar.
of the 2018 Conference of the North American Chap- 2022b. CipherDAug: Ciphertext based data augmen-
ter of the Association for Computational Linguistics: tation for neural machine translation. In Proceed-
Human Language Technologies, Volume 1 (Long Pa- ings of the 60th Annual Meeting of the Association
pers), pages 344–354, New Orleans, Louisiana. As- for Computational Linguistics (Volume 1: Long Pa-
sociation for Computational Linguistics. pers), pages 201–218, Dublin, Ireland. Association
for Computational Linguistics.
Jiatao Gu, Yong Wang, Kyunghyun Cho, and Vic- Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke
tor O.K. Li. 2019. Improved zero-shot neural ma- Zettlemoyer, and Mike Lewis. 2020. Nearest neigh-
chine translation via ignoring spurious correlations. bor machine translation. In International Conference
In Proceedings of the 57th Annual Meeting of the As- on Learning Representations.
sociation for Computational Linguistics, pages 1258–
1268, Florence, Italy. Association for Computational Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke
Linguistics. Zettlemoyer, and Mike Lewis. 2019. Generalization
through memorization: Nearest neighbor language
Shuhao Gu and Yang Feng. 2022. Improving zero- models. In International Conference on Learning
shot multilingual translation with universal represen- Representations.
tations and cross-mappings. In Proceedings of the
EMNLP 2022 Long Findings. Yusen Lin, Jiayong Lin, Shuaicheng Zhang, and Haoy-
ing Dai. 2021. Bilingual dictionary-based language
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, model pretraining for neural machine translation.
David Simcha, Felix Chern, and Sanjiv Kumar. 2020.
Danni Liu, Jan Niehues, James Cross, Francisco
Accelerating large-scale inference with anisotropic
Guzmán, and Xian Li. 2021. Improving zero-shot
vector quantization. In International Conference on
translation by disentangling positional information.
Machine Learning, pages 3887–3896. PMLR.
In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the
Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2017. 11th International Joint Conference on Natural Lan-
Effective strategies in zero-shot neural machine trans- guage Processing (Volume 1: Long Papers), pages
lation. In Proceedings of the 14th International Con- 1259–1273, Online. Association for Computational
ference on Spoken Language Translation, pages 105– Linguistics.
112, Tokyo, Japan. International Workshop on Spo-
ken Language Translation. Thang Luong, Hieu Pham, and Christopher D. Manning.
2015. Effective approaches to attention-based neural
Baijun Ji, Zhirui Zhang, Xiangyu Duan, Min Zhang, machine translation. In Proceedings of the 2015 Con-
Boxing Chen, and Weihua Luo. 2020. Cross-lingual ference on Empirical Methods in Natural Language
297
Processing, pages 1412–1421, Lisbon, Portugal. As- Machel Reid and Mikel Artetxe. 2022. PARADISE:
sociation for Computational Linguistics. Exploiting parallel data for multilingual sequence-
to-sequence pretraining. In Proceedings of the 2022
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Conference of the North American Chapter of the
Sam Gross, Nathan Ng, David Grangier, and Michael Association for Computational Linguistics: Human
Auli. 2019. fairseq: A fast, extensible toolkit for Language Technologies, pages 800–810, Seattle,
sequence modeling. In Proceedings of the 2019 Con- United States. Association for Computational Lin-
ference of the North American Chapter of the Associa- guistics.
tion for Computational Linguistics (Demonstrations),
pages 48–53, Minneapolis, Minnesota. Association Rico Sennrich, Barry Haddow, and Alexandra Birch.
for Computational Linguistics. 2016. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual
Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. Meeting of the Association for Computational Lin-
2021a. Contrastive learning for many-to-many multi- guistics (Volume 1: Long Papers), pages 1715–1725,
lingual neural machine translation. In Proceedings Berlin, Germany. Association for Computational Lin-
of ACL 2021. guistics.
Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
2021b. Contrastive learning for many-to-many mul- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
tilingual neural machine translation. In Proceedings Kaiser, and Illia Polosukhin. 2017. Attention is all
of the 59th Annual Meeting of the Association for you need. Advances in neural information processing
Computational Linguistics and the 11th International systems, 30.
Joint Conference on Natural Language Processing Weizhi Wang, Zhirui Zhang, Yichao Du, Boxing Chen,
(Volume 1: Long Papers), pages 244–258, Online. Jun Xie, and Weihua Luo. 2021. Rethinking zero-
Association for Computational Linguistics. shot neural machine translation: From a perspective
of latent variables. In Findings of the Association
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- for Computational Linguistics: EMNLP 2021, pages
Jing Zhu. 2002. Bleu: a method for automatic evalu- 4321–4327, Punta Cana, Dominican Republic. Asso-
ation of machine translation. In Proceedings of the ciation for Computational Linguistics.
40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318, Philadelphia, Xinyi Wang, Hieu Pham, Philip Arthur, and Graham
Pennsylvania, USA. Association for Computational Neubig. 2018. Multilingual neural machine transla-
Linguistics. tion with soft decoupled encoding. In International
Conference on Learning Representations.
Ngoc-Quan Pham, Jan Niehues, Thanh-Le Ha, and Alex
Waibel. 2019. Improving zero-shot translation with Zhijun Wang, Xuebo Liu, and Min Zhang. 2022. Break-
language-independent constraints. In Proceedings ing the representation bottleneck of Chinese charac-
of the Fourth Conference on Machine Translation ters: Neural machine translation with stroke sequence
(Volume 1: Research Papers), pages 13–23. modeling. In Proceedings of the 2022 Conference on
Empirical Methods in Natural Language Processing,
Matt Post. 2018. A call for clarity in reporting BLEU pages 6473–6484, Abu Dhabi, United Arab Emirates.
scores. In Proceedings of the Third Conference on Association for Computational Linguistics.
Machine Translation: Research Papers, pages 186–
191, Belgium, Brussels. Association for Computa- Xiangpeng Wei, Rongxiang Weng, Yue Hu, Luxi Xing,
tional Linguistics. Heng Yu, and Weihua Luo. 2021. On learning univer-
sal representations across languages. In International
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad- Conference on Learning Representations.
manabhan, and Graham Neubig. 2018. When and
Lijun Wu, Shufang Xie, Yingce Xia, Yang Fan, Jian-
why are pre-trained word embeddings useful for neu-
Huang Lai, Tao Qin, and Tieyan Liu. 2020. Sequence
ral machine translation? In Proceedings of the 2018
generation with mixed representations. In Proceed-
Conference of the North American Chapter of the
ings of the 37th International Conference on Machine
Association for Computational Linguistics: Human
Learning, volume 119 of Proceedings of Machine
Language Technologies, Volume 2 (Short Papers),
Learning Research, pages 10388–10398. PMLR.
pages 529–535, New Orleans, Louisiana. Associa-
tion for Computational Linguistics. Shijie Wu, Benjamin Van Durme, and Mark Dredze.
2022. Zero-shot cross-lingual transfer is under-
Alessandro Raganato, Raúl Vázquez, Mathias Creutz, specified optimization. In Proceedings of the 7th
and Jörg Tiedemann. 2021. An empirical investi- Workshop on Representation Learning for NLP,
gation of word alignment supervision for zero-shot pages 236–248, Dublin, Ireland. Association for
multilingual neural machine translation. In Proceed- Computational Linguistics.
ings of the 2021 Conference on Empirical Methods
in Natural Language Processing, pages 8449–8456, Yilin Yang, Akiko Eriguchi, Alexandre Muzio, Prasad
Online and Punta Cana, Dominican Republic. Asso- Tadepalli, Stefan Lee, and Hany Hassan. 2021. Im-
ciation for Computational Linguistics. proving multilingual translation by representation
298
and gradient regularization. In Proceedings of the
2021 Conference on Empirical Methods in Natural
Language Processing, pages 7266–7279, Online and
Punta Cana, Dominican Republic. Association for
Computational Linguistics.
Zhe Yang, Qingkai Fang, and Yang Feng. 2022. Low-
resource neural machine translation with cross-modal
alignment. pages arXiv–2210.
Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen-
nrich. 2020. Improving massively multilingual neu-
ral machine translation and zero-shot translation. In
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1628–
1639, Online. Association for Computational Linguis-
tics.
299
A Reproducibility Details the IWSLT17 model and 2.5M parameters to the
TED59 model. However, note that the total train-
A.1 Data able parameters are still much lower than that of
IWSLT17 (Cettolo et al., 2012) is an English- the baselines – this because our models have shared
centric dataset7 totalling 1.8M parallel sentences. embedding layers.
It has 8 supervised directions to and from Ger- We use the Adam optimizer with inverse square
man, Italian, Dutch and Romanian, each with about root learning scheduling and 6k warm steps, lr “
220,000 parallel sentences, and 12 zero-shot direc- 0.0007 and dropout of 0.3 (IWSLT17), or 10k
tions. We use the official validation and test sets. warmup steps, lr “ 0.005 and dropout of 0.2
(TED59). The batch size is 4096 tokens for each
Ted59 (Qi et al., 2018) is a massively multilin- of four A100 GPUs.
gual English-centric dataset8 with 116 translation We use ScANN (Guo et al., 2020) for efficient
directions totalling 10.8M parallel sentences. The ANN search10 with k “ 3. To increase train-
imbalanced data—from 0.25M to just 2000 parallel ing speeds, we cache each subword’s ANNs for
samples for some language pairs—makes it ideal 400 iterations before recomputing them. We only
to study the effects of our method. Following (Aha- (peridocally) cache subword IDs: the embedding
roni et al., 2019; Raganato et al., 2021) we evaluate EMB µ p¨q is always computed directly from Wemb .
on 16 supervised pairs (Azerbaijani, Belarusian, The value of λ is set to 0.5 (Equation 1). We follow
Galician, Slovak, Arabic, German, Hebrew, and Kambhatla et al. (2022b) to set the values of α1 , α2
Italian to and from English) and 4 zero-shot (Ara- to 1, and β to 5 (Equation 5).
bic Ø French, Ukranian Ø Russian). Note that of
these languages, Azerbaijani, Belarusian, Galician, Evaluation. For evaluation, all translations are
and Slovak are low resource with only 5.9k, 4.5k, generated with beam size 5. We report case-
10k and 61.5k paralle samples to/from English. sensitive BLEU scores (Papineni et al., 2002) us-
All settings and baselines use sentencepiece9 ing sacreBLEU11 (Post, 2018). We report detok-
for subword tokenization using byte-pair encodings enized BLEU for IWSLT17 and tokenized BLEU
(BPEs; Sennrich et al. 2016) with 32000 merge for TED59 for fair comparison with prior work
operations. (Aharoni et al., 2019; Raganato et al., 2021).
300
Subword Nearest Neighbors (Baseline) Nearest Neighbors (Ours)
_Fisch _findet œ _chestii _Netz fisch möglichkeiten erei fisch _pesca _fish _Fisch ž
schaft hood erung ungen gaat _gehabt schaft würdig lichkeit ship nisse äglich schaft
the tje ped own asta by _solamente tech ther th by the
?
_the isce izăm _erzählen ”& _gehabt oara _your _their _our _the ć ž
_Music mat _cartoon hood _connessione zia _şcoala _musica _music ž _Music dine ć
_picior _sfârşit _plaatje _mesaj _teren _avion _gehabt _corpul _brat, _pagină _picior ž
?
ern eien iere eren erung _tenuto _gehabt uren ungen ert eren stern ern
_înceapă _popolare _condotto _mişcă _bekijken _crească _creeze _gepubliceerd _încep _începem _început _începe muovono
_democrat, ia analisi _înt, elege _popolare izăm _şcoala _deshalb _terorism muovono _democratic dine _biodiversità ć
_pure rische _giovane _appena _tare \u0e22 _avesse _semplicemente _unique _tragic _complete _sole _pure
_genomic _finanzia  _popolare _răspândi _genomen möglichkeiten _electronic _genome _robotic ž _genetic _genomic
301
_Abbiamo _perciò _gehabt _spunem _condotto izăm _avesse abbiamo mmo iamo _Abbiamo _abbiamo ć
izări amento isieren ierung izzazione _răspândi izare izare ităţi aţie izări muovono nelli
_negative _altele azioni iere _bune _enormous oase _illegal _alternative _evil _positive _negativ _negative
_take _solamente _gemacht _spinge _accompagna _preso _tenuto _takes _taken _taking _took ć _take
_muziek _percorso _besef _onderwijs _erzählen _vreugde oara _music muovono _Musik _musica _muziek ć
_Karte _Bibliothek _lavori strategie _chestii _cifre kaart _Weise _Sprache _carta _montagna kjes _Karte
_funct, iona _mişcă _munci matig _realiza _funct, ie _funct, iona _funcţionează _funct, ionează _funziona _funcţiona _funct, iona ć
_naţional _popolare iere _bază _condotto _esenţial _politic juist _rural äglich _National _naţional _national
_America tate _erzählen _stemmen dine _facultate _chestiune _USA _Asia _Africa _American _America ć
Table 5: Approximate nearest-neighbors for a sample of subwords, computed with (right) and without (left) ANN training.
JHU IWSLT 2023 Multilingual Speech Translation System Description
Henry Li Xinyuan1∗ Neha Verma1∗ Bismarck Bamfo Odoom1 Ujvala Pradeep1
Matthew Wiesner2 Sanjeev Khudanpur1,2
1
Center for Language and Speech Processing, and
2
Human Language Technology Center of Excellence,
Johns Hopkins University
{xli257, nverma7, bodoom1, upradee1, wiesner, khudanpur}@jhu.edu
302
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 302–310
July 13-14, 2023 c 2023 Association for Computational Linguistics
robustness of translation systems. In light of this, These constraints were applied in order to mimic
we scraped talks and papers from the proceedings the text-normalization of the dev data so that these
and workshops of ACL 2021. scraped ACL data could be incorporated into our
model’s source language side.
3.1 Data Collection
About 65% of the papers accepted in ACL 2021 4 Systems
have video presentations recorded and uploaded In this section, we separately describe our uncon-
on the ACL website. We scraped 1847 papers and strained and constrained submissions. Since we
1193 talks from the proceedings and workshops. built cascaded models, we describe the automatic
The format of the papers and talks are pdf and speech recognition (ASR) and machine translation
mp4 respectively. We extract the text from the (MT) components of each system.
papers using pypdf.1 The talks are split into 30-
second chunks, converted into FLAC format, and 4.1 Unconstrained Subtrack
resampled to 16KHz. This amounts to about 155 4.1.1 Automatic Speech Recognition
hours of speech and about 200K lines of text. We
An important characteristic of ACL presentations
plan to release the data under a CC BY 4.0 license2
is the wide array of accents represented, which re-
(same as the license for the ACL talks).
flects the diverse background of NLP researchers.
3.2 Data Filtering Accent-robust speech recognition continues to
present a challenge to the community (Tadimeti
To make the corpora (including ACL papers before
et al., 2022; Riviere et al., 2021; Radford et al.,
2022) useful, we first denoised the data and made
2022).
it similar to ASR text outputs. A comprehensive
One model that demonstrated a degree of robust-
list of the filters we applied to the data includes:
ness to accented speech, is Whisper (Radford et al.,
• Removing any information past the Refer- 2022), an ASR model trained on 680,000 hours of
ences section. web-crawled data. Its performance on the accented
splits of the VoxPopuli (Wang et al., 2021), while
• Removing links ("https..").
significantly worse than non-accented English, was
• Reforming broken words since the text was in comparable (without an external language model)
a two column format. to methods designed for accent robustness (with a
• Removing any information before the Ab- strong language model) (Riviere et al., 2021). This
stract section. robustness to accented speech, as well as its overall
strong performance on English ASR makes it well-
• Removing any non alpha-numeric or punctua-
suited for the accent-diverse ACL presentations.
tion characters.
The domain specificity and technical terms of
• Removing any lines that start with or that have ACL presentations may still prove difficult for a
too many numbers (to account for tables with strong ASR model like Whisper. We therefore
data). condition the decoder towards key technical vocab-
• Removing any lines with less that 10 charac- ulary and named entities by prompting Whisper
ters (number obtained from averaging mini- with the corresponding abstracts when decoding
mum character length of each sentence in dev each presentation.
data). Additionally, we test the effect of using the
pre-segmented audio files (with oracle segmenta-
• Removing any lines larger than 297 characters
tion provided by the IWSLT 60-60 challenge or-
(number obtained through a similar process as
ganizers) versus using longer speech segments for
above).
Whisper decoding. We find that decoding the full
• Reformatting the data such that it has one sen- talk at once results in a lower WER than decod-
tence per line. ing segment-by-segment. For Whisper-large, the
1
https://github.com/py-pdf/pypdf best performing model, this difference is 0.6 WER.
2
https://github.com/IWSLT-23/60_60_data/tree/ Longer form inputs more closely match the train-
main/acl_data ing segments of Whisper, which were in 30 second
segments (Radford et al., 2022).
303
4.1.2 Audio Segmentation et al., 2019). We use the 1.2B parameter version of
Since we found that decoding using unsegmented M2M100 in our experiments.
audio outperformed decoding using the predefined 4.1.4 Domain-Specific Data
segments, we segment our ASR text output in order
Using the 2021 ACL data described in Section 3,
to perform sentence-level machine translation. We
we attempted to perform sequence knowledge dis-
choose to perform sentence-level machine trans-
tillation (SeqKD) (Kim and Rush, 2016). Because
lation rather than incorporating more document
we only had additional source-side monolingual
context because our final systems make use of
data, SeqKD could give us pseudo-target labels in
many large pre-trained multilingual models that
order to retrain our best model on these outputs.
are trained at a sentence level rather than a docu-
Although NLLB-200-3.3B is our best model for
ment level.
many of our language pairs, we fine-tune NLLB-
Because we require sentence-level segments
200-1.3B instead due to computational constraints.
from our ASR outputs, we use the state-of-the-
While benchmarking these models, however, there
art ersatz neural sentence segmenter. ersatz has
is only a marginal improvement in using the larger
been shown to be more robust to technical terms in-
model over the smaller (average +0.6 chrF). For en-
cluding acronyms and irregular punctuation, which
ja, however, we continue to use mBART50-1toN.
is particularly helpful in the ACL domain (Wicks
Despite the large amount of in-domain source
and Post, 2021).
language data we made available, we did not see
4.1.3 Machine Translation much benefit from it ourselves, specifically for data
We test several pre-trained MT systems on our data. augmentation via SeqKD. We speculate that the
Specifically, we test NLLB-200 (NLLB Team et al., data may be too noisy in spite of filtering, and that
2022), mBART50 (Tang et al., 2020), and M2M100 its best use may be as source context during infer-
(Fan et al., 2021). All 10 of our target languages ence, rather than for training data augmentation.
are supported by these models.
4.2 Constrained Subtrack
The original NLLB-200 model is a 54 billion pa-
rameter Mixture-of-Experts model that translates 4.2.1 Automatic Speech Recognition
to and from 200 languages. It is trained on a We leveraged the pre-trained wav2vec 2.0 model
large amount of mined parallel, back-translated, (Baevski et al., 2020) for the constrained ST task.
and monolingual data. We use the 3.3B parame- Wav2vec 2.0 was trained in a self-supervised fash-
ter version of NLLB-200, which is a dense Trans- ion and requires fine-tuning on an annotated cor-
former model that is trained via online distillation pus in order to be used for the ASR task, with the
of the original model, but still supports all of the domain-similarity between the choice of the fine-
original 200 languages. tuning corpus and the evaluation data being crucial
mBART50 is the second iteration of the multi- for ASR performance. The most commonly used
lingual BART model, which is a dense transformer wav2vec 2.0 model is fine-tuned with a CTC objec-
architecture trained on multilingual text using a tive on Librispeech, a corpus made of audiobooks
denoising task. The authors of mBART50 also re- that is considered to have a considerable domain
lease a checkpoint of mBART50 that is fine-tuned mismatch compared to the ACL 60-60 data. Since
on the one-to-many translation task, which we will the development split of the ACL 60-60 data alone
refer to as mBART50-1toN. In this case, English is insufficient for wav2vec 2.0 fine-tuning, we in-
is the source, and all 50 covered languages are the stead performed a two-stage fine tuning with TED-
targets. LIUM 3 (Hernandez et al., 2018) being used in the
Finally, M2M100 is another transformer-based first stage and the ACL 60-60 development data
model that is trained directly on the MT task. It used in the second.
translates to and from 100 languages, and is a previ- Our approach to tackling the content domain mis-
ous iteration of the initiative that produced NLLB- match between the training data and ACL presen-
200. However, we still test both models because tations is to perform ASR decoding with the help
sometimes adding additional language pairs to a of an content-domain matching language model.
model can lead to the reduced performance of some What it means in practice is that we rescore the per-
language pairs (Aharoni et al., 2019; Arivazhagan frame output trellis with a content-domain match-
ing language model, which in turn was created by
304
interpolating a general language model (trained 5.1 ASR Experiments
from all the available English corpora in the con- 5.1.1 Prompting Whisper
strained challenge) and a domain-specific language
In the unconstrained setting, we evaluate Whisper
model (trained with transcripts from the ACL 60-
on both the segmented and unsegmented audio files.
60 development data). In order to bias our model
We simulate LM biasing by using the “prompt”
towards named entities mentioned in each specific
interface provided by Whisper.
presentation, we train a separate language model
for each presentation by re-interpolating the above- 5.1.2 Decoding with an Interpolated
mentioned language model with one trained with Language Model
the corresponding paper abstract. In the constrained setting, we build a domain-
4.2.2 Machine Translation adapted language model as follows: first we com-
bine transcripts from a number of ASR corpora that
In the constrained setting, we use mBART50-1toN
are available in the constrained challenge, namely
and M2M100 as our base models. We addition-
Librispeech, VoxPopuli, Common Voice (Ardila
ally test fine-tuning these models on MuST-C data,
et al., 2020), and TED-LIUM 3, to train a flexi-
which we hypothesized to be closely related to the
ble 6-gram general bpe-level language model for
ACL talk data, domain-wise (Di Gangi et al., 2019).
English. We proceed to interpolate the general
This data is comprised of professionally translated
English language model with one trained on the
English TED talks, which matches the presentation
development split transcripts from the ACL 60-60
domain as well as some of the technical nature of
challenge, allowing the model to gain exposure
the ACL talks, although to a lesser degree.
to technical terms within the NLP field. Finally,
We fine-tune both mBART and M2M100 using
during decoding, we further interpolate the previ-
the MuST-C transcripts and translations available
ously obtained language model with a low-order
in all 10 language pairs. We use data from both v1.2
language model trained from the paper abstract cor-
(v1.0 is contained in v1.2) and v2.0 depending on
responding to the current presentation, biasing our
language pair availability. A summary of this data
model towards technical terms and named entities
is provided in Table 1. For mBART, we additionally
that are likely to appear in the presentation.
test multilingual fine-tuning where we fine-tune on
We used KenLM (Heafield, 2011) to train and
all the language pairs simultaneously, rather than
integrate our language models. The interpolation
fine-tuning on a single language pair bitext (Tang
weights for each step were estimated using a leave-
et al., 2020).
one-out strategy on the development split, minimis-
ing the perplexity on the held-out transcript and
lang. pair MuST-C release # lines
averaging the interpolation weights.
en-ar v1.2 212085
5.1.3 Decoding with a Language Model
en-de v1.0 229703
Trained on Additional ACL Anthology
en-fa v1.2 181772
data
en-fr v1.0 275085
en-ja v2.0 328639 We use the text scraped from the proceedings and
en-nl v1.0 248328 workshops of ACL 2021 to train a 6-gram domain-
en-pt v1.0 206155 matching language model for decoding. Without
en-ru v1.0 265477 interpolation or additional data, this gives a WER
en-tr v1.2 236338 of 18.9 and a technical term recall of 0.47 using
en-zh v1.2 184801 Wav2Vec2-TED-LIUM 3 as the acoustic model.
We observe that using data from a similar domain
improves performance even though the data are
Table 1: Dataset statistics and source of MuST-C bitext
across the 10 task language pairs. relatively noisy.
5.1.4 Evaluation
5 Experimental Setup We compare ASR performance, as measured by
Word Error Rate (WER), across the different sys-
In this section, we provide technical details of our tems that we built. Specifically, we compute WER
experiments and our evaluation practices. on depunctuated lowercase transcripts. Since we
305
Acoustic Model Language Model WER Tech. Term Recall
Whisper-medium.en - 8.1 0.861
Whisper-medium.en abstract prompting 8.7 0.865
Whisper-large - 6.8 0.854
Whisper-large abstract prompting 6.9 0.852
Whisper-large abstract and conclusion prompting 6.7 0.863
Whisper-large abstract, conclusion and intro prompting 6.6 0.851
Whisper-large abstract, conclusion, intro & author name prompting 6.4 0.854
Wav2Vec2-960h librispeech librispeech-4gram 25.1 0.306
Wav2Vec2-960h librispeech interpolated LM 24.3 0.370
Wav2Vec2-960h librispeech inter. LM + dev transcripts 24.1 0.382
Wav2Vec2-960h librispeech inter. LM + dev + abstract 23.7 0.392
Wav2Vec2-960h librispeech inter. LM + dev + abstract + ACL anthology 20.7 0.462
HUBERT-960h librispeech librispeech-4gram 22.0 0.390
HUBERT-960h librispeech interpolated LM 21.7 0.386
HUBERT-960h librispeech inter. LM + dev transcripts 20.4 0.421
HUBERT-960h librispeech inter. LM + dev + abstract 20.4 0.498
HUBERT-960h librispeech inter. LM + dev + abstract + ACL anthology 18.5 0.473
Wav2Vec2-TED-LIUM 3 librispeech-4gram 20.9 0.383
Wav2Vec2-TED-LIUM 3 interpolated LM 19.5 0.422
Wav2Vec2-TED-LIUM 3 inter. LM + dev transcripts 18.9 0.436
Wav2Vec2-TED-LIUM 3 inter. LM + dev + abstract 14.2 0.626
Wav2Vec2-TED-LIUM 3 inter. LM + dev + abstract + ACL anthology 16.7 0.505
Wav2Vec2-TED-LIUM 3 ACL anthology only 18.9 0.470
Table 2: ASR results. WER is measured against depunctuated, all lower-case reference text.
either perform ASR on unsegmented talks (uncon- enizers provided by sacrebleu (ja-mecab and zh,
strainted), or on the SHAS-segmented audio (con- respectively).
strained), we use mwerSegmenter to align our out- For evaluating translations of ASR outputs, ei-
puts to the gold transcripts (Matusov et al., 2005). ther segmented using ersatz or pre-segmented us-
Because we are interested in the effect of using ing the provided SHAS-segmented wav files, we
domain-specific text to improve ASR on techni- use the mwerSegmenter to resegment the transla-
cal terms, we compute the recall of NLP-specific tions based on the references. For all languages ex-
technical words in our output. We obtain these cept Japanese and Chinese, we use detokenized text
technical terms by asking domain experts to flag as input to resegmentation. However, for Japanese
all technical terms in the development set reference and Chinese, we first use whitespace tokenization
transcript. as input to mwerSegmenter, and then detokenize
for scoring, which is retokenized according to the
5.2 MT Experiments sacrebleu package.
5.2.1 MuST-C fine-tuning
For bilingual fine-tuning on mBART50 and 6 Results
M2M100, we train for 40K updates, and use loss 6.1 ASR Results
to select the best checkpoint. For multilingual fine-
For the Whisper-based systems, we focus on the ef-
tuning on mBART50-1toN, we train for 100K up-
fects of prompting; for the constrained systems, we
dates, and use temperature sampling of the mixed
contrast different families of pre-trained ASR mod-
datset using T = 1.5. We use loss to select the
els fine-tuned on different ASR corpora; finally, we
best checkpoint. For all experiments, we use an
assess the efficacy of incorporating an in-domain
effective batch size of 2048 tokens.
language model during decoding. The full list of
5.2.2 Evaluation results is shown in Table 2.
For all experiments, we report BLEU and chrF Contrary to what we expected, prompting Whis-
scores as reported by sacrebleu (Post, 2018). For per with the corresponding paper abstracts not only
Japanese and Chinese, we use the appropriate tok- had little impact on the ASR WER, but also failed
306
mBART50-1toN M2M100 NLLB-200
language pair BLEU chrF BLEU chrF BLEU chrF
en-ar 22.6 52.9 16.2 46.3 37.6 65.4
en-de 37.4 66.0 39.7 66.8 42.9 69.6
en-fa 17.2 49.6 20.4 49.5 27.4 57.3
en-fr 46.4 70.4 54.5 74.6 55.9 76.2
en-ja 37.5 45.9 35.2 43.8 25.7 36.3
en-nl 41.0 69.0 50.9 75.3 51.5 76.1
en-pt 44.3 69.7 57.6 77.4 61.6 79.0
en-ru 22.2 52.0 24.3 54.3 27.4 57.2
en-tr 15.5 50.7 22.3 56.5 28.6 62.8
en-zh 43.8 38.8 45.7 40.7 42.2 38.5
Table 3: Unconstrained MT results on the development set using oracle transcripts as input. Both chrF and BLEU
scores are computed using the mWER Segmenter and sacrebleu. BLEU scores for ja and zh are computed using
the ja-mecab and zh tokenizers in sacrebleu, respectively. We bold our best chrF scores as it is the main metric of
the task.
Table 4: Constrained MT results on the development set using oracle transcripts as input. Both chrF and BLEU
scores are computed using the mWER Segmenter and sacrebleu. BLEU scores for ja and zh are computed using
the ja-mecab and zh tokenizers in sacrebleu, respectively. We bold our best chrF scores as it is the main metric of
the task.
to improve the recall of technical terms of the ASR domain language model (from Librispeech-4gram
system. Further increasing the length and relevance to Interpolated LM) resulted in WER improve-
of the prompts provided to whisper, such as adding ments while not necessarily helping technical term
the conclusion and part of the introduction section recall; by contrast, while LMs that better fit the
of each paper corresponding to the ACL presenta- domain may not necessarily help WER, they bring
tion in question, had marginal impact on both of the substantial gains in technical term recall.
above-mentioned metrics. A more detailed look at The language model that best fits our domain,
the mechanism and behaviour of Whisper prompt- namely the model that interpolates the LMs trained
ing could help to understand this observation. from every ASR corpus in addition to the develop-
On the constrained side, the incorporation of the ment transcripts, from the current paper abstract,
interpolated LM during ASR decoding had a sig- and from the crawled ACL anthology, provided
nificant impact on the performance of our ASR substantial improvement on both WER and tech-
systems, regardless of the upstream acoustic model. nical term recall for the weaker acoustic models
As expected, increasing the quality of the out-of- (Wav2Vec2 fine-tuned on Librispeech) but not on
307
Constrained Unconstrained
language MT system BLEU chrF MT system BLEU chrF
en-ar mBART50-1toN+MuST-C 15.3 45.6 NLLB-200-3.3B 33.7 62.5
en-de M2M100 24.3 55.2 NLLB-200-3.3B 39.6 67.8
en-fa mBART50-1toN+MuST-C 14.8 42.0 NLLB-200-3.3B 24.5 54.3
en-fr M2M100 33.3 61.9 NLLB-200-3.3B 49.3 72.5
en-ja mBART50-1toN 21.9 29.9 mBART50-1toN 34.8 43.1
en-nl M2M100 30.6 62.5 NLLB-200-3.3B 45.7 72.4
en-pt M2M100 34.9 63.4 NLLB-200-3.3B 54.7 75.6
en-ru M2M100 15.0 45.1 NLLB-200-3.3B 24.8 54.4
en-tr M2M100 11.9 43.5 NLLB-200-3.3B 24.7 58.8
en-zh M2M100 32.2 26.6 M2M100 37.7 33.5
Table 5: Final speech translation results for both our constrained and unconstrained systems on the development set.
Both chrF and BLEU scores are computed using the mWER Segmenter and sacrebleu. BLEU scores for ja and zh
are computed using the ja-mecab and zh tokenizers in sacrebleu, respectively. We used output from our strongest
ASR system, Whisper-large with abstract prompting, as the input to our translation system.
the stronger acoustic models. scripts is -5.7 chrF. In the constrained case, this
value is -12.8 chrF. The small reduction in the un-
6.2 MT results constrained system indicates that our cascaded ap-
We detail the results of testing pre-trained MT mod- proach of two strong components is a viable option
els as described in Section 4 on the oracle tran- for ST in this setting. However, our constrained
scripts in Table 3. This table reflects experiments system could likely benefit from techniques that
we performed for the unconstrained setting. We help reduce the error propagation from ASR, like
find that for almost all language pairs, NLLB-200- mixing ASR outputs with gold source sentences
3.3B has the best performance, except for en-ja during MT training, or joint training of ASR and
and en-zh, which perform best with mBART and MT components.
M2M100, respectively.
We summarize our fine-tuning results in Table 7 Conclusion
4. This table reflects experiments we performed We present a constrained and unconstrained system
for the constrained setting. We find that in gen- for the IWSLT 2023 Multilingual speech transla-
eral, the additional data can provide a boost over tion task. We address some of the major challenges
mBART50-1toN, but not for M2M100. Addition- of this dataset with our design choices: ASR ro-
ally, we find that despite positive results in Tang bust to speaker accents, adaptation to match the
et al. (2020), multilingual fine-tuning does not out- domain specificity, and ASR prompting to incorpo-
perform bilingual fine-tuning in this setting. For a rate context in this academic talk-level translation
majority of pairs, M2M100 without fine-tuning is task. We additionally release a supplemental ACL
the best system, but for en-ar and en-fa, mBART50- audio and text corpus to encourage further work in
1toN with fine-tuning is the best system, and simi- high quality speech translation of ACL content.
lar to the unconstrained system, mBART50-1toN
without fine-tuning is the best system for en-ja.
References
6.3 ST Results
Milind Agarwal, Sweta Agrawal, Antonios Anasta-
Final results for both our constrained and uncon- sopoulos, Ondřej Bojar, Claudia Borg, Marine
strained systems are summarized in Table 5. We Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
translate the transcripts from our best ASR systems Chen, William Chen, Khalid Choukri, Alexandra
using the best language-pair specific MT systems. Chronopoulou, Anna Currey, Thierry Declerck, Qian-
qian Dong, Yannick Estève, Kevin Duh, Marcello
In the unconstrained case, the average reduction in Federico, Souhir Gahbiche, Barry Haddow, Benjamin
chrF from using ASR outputs versus oracle tran- Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
308
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Kenneth Heafield. 2011. KenLM: Faster and smaller
Kumar, Pengwei Li, Xutai Ma, Prashant Mathur, language model queries. In Proceedings of the Sixth
Evgeny Matusov, Paul McNamee, John P. McCrae, Workshop on Statistical Machine Translation, pages
Kenton Murray, Maria Nadejde, Satoshi Nakamura, 187–197, Edinburgh, Scotland. Association for Com-
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, putational Linguistics.
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
Lonneke van der Plas, Peter Polák, Elijah Rippeth, François Hernandez, Vincent Nguyen, Sahar Ghannay,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- Natalia A. Tomashenko, and Yannick Estève. 2018.
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian TED-LIUM 3: twice as much data and corpus repar-
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, tition for experiments on speaker adaptation. CoRR,
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- abs/1805.04699.
vallos. 2023. Findings of the IWSLT 2023 Evaluation
Campaign. In Proceedings of the 20th International Yoon Kim and Alexander M. Rush. 2016. Sequence-
Conference on Spoken Language Translation (IWSLT level knowledge distillation. In Proceedings of the
2023). Association for Computational Linguistics. 2016 Conference on Empirical Methods in Natu-
ral Language Processing, pages 1317–1327, Austin,
Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Texas. Association for Computational Linguistics.
Massively multilingual neural machine translation.
In Proceedings of the 2019 Conference of the North Evgeny Matusov, Gregor Leusch, Oliver Bender, and
American Chapter of the Association for Computa- Hermann Ney. 2005. Evaluating machine translation
tional Linguistics: Human Language Technologies, output with automatic sentence segmentation. In Pro-
Volume 1 (Long and Short Papers), pages 3874–3884, ceedings of the Second International Workshop on
Minneapolis, Minnesota. Association for Computa- Spoken Language Translation, Pittsburgh, Pennsylva-
tional Linguistics. nia, USA.
Rosana Ardila, Megan Branson, Kelly Davis, Michael NLLB Team, Marta R. Costa-jussà, James Cross, Onur
Kohler, Josh Meyer, Michael Henretty, Reuben Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
Morais, Lindsay Saunders, Francis Tyers, and Gre- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
gor Weber. 2020. Common voice: A massively- Jean Maillard, Anna Sun, Skyler Wang, Guillaume
multilingual speech corpus. In Proceedings of the Wenzek, Al Youngblood, Bapi Akula, Loic Bar-
Twelfth Language Resources and Evaluation Confer- rault, Gabriel Mejia-Gonzalez, Prangthip Hansanti,
ence, pages 4218–4222, Marseille, France. European John Hoffman, Semarley Jarrett, Kaushik Ram
Language Resources Association. Sadagopan, Dirk Rowe, Shannon Spruit, Chau
Tran, Pierre Andrews, Necip Fazil Ayan, Shruti
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Bhosale, Sergey Edunov, Angela Fan, Cynthia
Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Gao, Vedanuj Goswami, Francisco Guzmán, Philipp
Mia Xu Chen, Yuan Cao, George Foster, Colin Koehn, Alexandre Mourachko, Christophe Ropers,
Cherry, et al. 2019. Massively multilingual neural Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
machine translation in the wild: Findings and chal- 2022. No language left behind: Scaling human-
lenges. arXiv preprint arXiv:1907.05019. centered machine translation.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Matt Post. 2018. A call for clarity in reporting BLEU
and Michael Auli. 2020. wav2vec 2.0: A framework scores. In Proceedings of the Third Conference on
for self-supervised learning of speech representations. Machine Translation: Research Papers, pages 186–
In Advances in Neural Information Processing Sys- 191, Brussels, Belgium. Association for Computa-
tems, volume 33, pages 12449–12460. Curran Asso- tional Linguistics.
ciates, Inc.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, man, Christine McLeavey, and Ilya Sutskever. 2022.
Matteo Negri, and Marco Turchi. 2019. MuST-C: a Robust speech recognition via large-scale weak su-
Multilingual Speech Translation Corpus. In Proceed- pervision. arXiv preprint arXiv:2212.04356.
ings of the 2019 Conference of the North American
Chapter of the Association for Computational Lin- Morgane Riviere, Jade Copet, and Gabriel Synnaeve.
guistics: Human Language Technologies, Volume 1 2021. Asr4real: An extended benchmark for speech
(Long and Short Papers), pages 2012–2017, Min- models. arXiv preprint arXiv:2110.08583.
neapolis, Minnesota. Association for Computational
Elizabeth Salesky, Kareem Darwish, Mohamed Al-
Linguistics.
Badrashiny, Mona Diab, and Jan Niehues. 2023.
Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Evaluating Multilingual Speech Translation Under
Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Realistic Conditions with Resegmentation and Ter-
Baines, Onur Celebi, Guillaume Wenzek, Vishrav minology. In Proceedings of the 20th International
Chaudhary, et al. 2021. Beyond english-centric multi- Conference on Spoken Language Translation (IWSLT
lingual machine translation. The Journal of Machine 2023). Association for Computational Linguistics.
Learning Research, 22(1):4839–4886.
309
Divya Tadimeti, Kallirroi Georgila, and David Traum.
2022. Evaluation of off-the-shelf speech recognizers
on different accents in a dialogue domain. In Pro-
ceedings of the Thirteenth Language Resources and
Evaluation Conference, pages 6001–6008, Marseille,
France. European Language Resources Association.
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
gela Fan. 2020. Multilingual translation with exten-
sible multilingual pretraining and finetuning. arXiv
preprint arXiv:2008.00401.
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
losa, and Marta R. Costa-jussà. 2022. SHAS: Ap-
proaching optimal Segmentation for End-to-End
Speech Translation. In Proc. Interspeech 2022, pages
106–110.
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
Juan Pino, and Emmanuel Dupoux. 2021. VoxPop-
uli: A large-scale multilingual speech corpus for rep-
resentation learning, semi-supervised learning and
interpretation. In Proceedings of the 59th Annual
Meeting of the Association for Computational Lin-
guistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long
Papers), pages 993–1003, Online. Association for
Computational Linguistics.
Rachel Wicks and Matt Post. 2021. A unified approach
to sentence segmentation of punctuated text in many
languages. In Proceedings of the 59th Annual Meet-
ing of the Association for Computational Linguistics
and the 11th International Joint Conference on Natu-
ral Language Processing (Volume 1: Long Papers),
pages 3995–4007, Online. Association for Computa-
tional Linguistics.
310
The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023
Speech-to-Speech Translation Task
Kun Song1 , Yi Lei1 , Peikun Chen1 , Yiqing Cao2 , Kun Wei1 , Yongmao Zhang1 ,
Lei Xie1∗ , Ning Jiang3 , Guoqing Zhao3
1
Audio, Speech and Language Processing Group (ASLP@NPU),
School of Computer Science, Northwestern Polytechnical University, China
2
Department of Computer Science and Technology, Nanjing University, China
3
MaShang Consumer Finance Co., Ltd, China
312
speech of various data distributions. To let the on Curriculum Learning (Bengio et al., 2009), we
ASR model generalize better to the multi-source adopt a three-stage fine-tuning strategy to mitigate
input, we adopt a model fusion strategy. Specifi- such a mismatch.
cally, we train the Conformer and E-branchformer
• Fine-tuning using the MT data: First, we
models introduced in Section 2.1 using the com-
use all the MT data to fine-tune the pre-trained
bination of the original and the augmented data.
model to improve the accuracy of the model
Each testing utterance is then transcribed by these
in the En2Zh translation task.
different models, resulting in multiple outputs. Fi-
nally, ROVER (Fiscus, 1997) is adopted to align • Fine-tuning using the MT data in ASR tran-
and vote with equal weights on the multiple outputs, scription format: Second, we convert the
resulting in the final ASR output. English text in the MT data into the ASR
2.4 ASR Output Post-processing transcription format. Then, we fine-tune the
Given that the spontaneous speech in the test set MT model using the converted data, which is
contains frequent filler words such as "Uh" and closer to the actual text than the ASR recog-
"you know", it is necessary to address their impact nition output. This approach can enhance the
on subsequent MT accuracy and TTS systems that stability of the fine-tuning process, minimize
rely on the ASR output. To mitigate this issue, the impact of ASR recognition issues on the
we use a simple rule-based post-processing step translation model, and improve the model’s
to detect and eliminate these expressions from the ability to learn punctuation, thereby enhanc-
ASR output. By doing so, we improve the accuracy ing its robustness.
of the downstream modules. • Fine-tuning using the ASR outputs: Third,
3 Machine Translation we leverage GigaSpeech (Chen et al., 2021)
For the MT module, we first use a pre-trained lan- to address the mismatch problem between the
guage model as a basis for initialization and then ASR outputs and the MT data. Specifically,
employ various methods to further enhance transla- we use the ASR module to transcribe the Gi-
tion accuracy. gaSpeech training set and replace the corre-
3.1 Pre-trained Language Model sponding transcriptions in GigaST (Ye et al.,
As pre-trained language models are considered 2022) with the ASR transcriptions for transla-
part of the training data in the offline track and tion model fine-tuning. This enables the MT
can be used in the S2ST track, we use the pre- model to adapt to ASR errors.
trained mBART50 model for initializing our MT
module. mBART50 (Liu et al., 2020) is a multi- 3.3 Back Translation
lingual BART (Lewis et al., 2020) model with 12 Following (Akhbardeh et al., 2021), we adopt the
layers of encoder and decoder, which we believe back translation method to enhance the data and
will provide a solid basis for improving translation improve the robustness and generalization of the
accuracy. model. First, we train a Zh2En MT model to trans-
3.2 Three-stage Fine-tuning based on late Chinese to English, using the same method
Curriculum Learning employed for the En2Zh MT module. Next, we
We perform fine-tuning on the pre-trained model to generate the corresponding English translations for
match the English-to-Chinese (En2Zh) translation the Chinese text of the translation data. Finally, we
task. There are substantial differences between combine the back translation parallel corpus pairs
the ASR outputs and the texts of MT data. First, with the real parallel pairs and train the MT model.
ASR prediction results inevitably contain errors. 3.4 Cross-validation
Second, ASR outputs are normalized text without We use 5-fold cross-validation (Ojala and Garriga,
punctuation. Therefore, directly fine-tuning the 2010) to improve the robustness of translation and
pre-trained model with the MT data will cause a reduce over-fitting. Firstly, we randomly divide the
mismatch problem with the ASR output during data into five equal parts and train five models on
inference. On the other hand, fine-tuning the model different datasets by using one of them as the vali-
with the ASR outputs will cause difficulty in model dation set each time and combining the remaining
coverage because of the difference between the four as the training set. After that, we integrate the
ASR outputs and the MT data. Therefore, based predicted probability distributions from these five
313
Speech
24kHz Speech
BN VISinger 2
Decoder
Audio Super-resolution
Conformer Decoder
16kHz Speech Speaker Embedding Posterior Encoder
MT Output Text BN
models to obtain the final predicted probability dis- BN features contain the duration and prosody
tribution for the next word during token generation information, which eliminates the need for text
for predicting the translation results. transcripts and prosody modeling. Instead, the
BN-to-speech stage focuses on time-invariant
4 Text-to-speech
information modeling, such as speaker timbre.
4.1 Overview As the goal of this work is to conduct zero-shot
Figure 1 (a) shows the pipeline of the text-to-speech English-to-Chinese speech translation, we concen-
module in the proposed S2ST system. The TTS trate on the method to transfer the unseen speaker
module is built on a BN-based two-stage architec- timbre of the source English speech to the synthe-
ture, which consists of a text-to-BN and a BN-to- sized Chinese speech through voice cloning (Chen
speech procedure. The text-to-BN stage tends to et al., 2019). To capture new speaker timbre dur-
generate BN features from the Chinese text trans- ing inference, the TTS module requires to model
lated by the MT module. The BN-to-speech stage abundant various speakers during training, which
produces 16KHz Chinese speech from the BN fea- relies on large-scale high-quality TTS data. Un-
ture, conditioning on the speaker embedding of fortunately, we are limited in the high-quality TTS
source speech. Given the translated Chinese speech data we can use in this task and must rely on ad-
which preserves the speaker timbre in the source ditional data such as ASR to model the speaker
English speech, an audio super-resolution model is timbre. However, this data is not suitable for TTS
further leveraged to convert the synthesized speech model training because the labels are inconsistent
from 16KHz to 24KHz for higher speech fidelity. with TTS, and the prosody of the speakers is not as
Building on the two-stage framework good as high-quality TTS data.
AdaVITS (Song et al., 2022a), we employ Furthermore, we incorporate ASR data into the
bottleneck (BN) features as the intermediate BN-to-speech training procedure by re-sampling
representations in the two-stage TTS module. BN all the training speech to 16kHz, which can not
features, extracted from a multi-condition trained reach high-quality audio. Therefore, we utilize
noise-robust ASR system, mainly represent the audio super-resolution techniques to upsample the
speaker-independent linguistic content. So BN can synthesized 16KHz audio and convert it into higher
effectively disentangle the speaker timbre and the sampling rate audio.
linguistic content information. In the text-to-BN 4.2 Text-to-BN
stage, high-quality TTS data is adopted in the
Our text-to-BN stage network in TTS is based on
training phase to model the speaker-independent
DelightfulTTS (Liu et al., 2021), which employs a
BN features with prosody information. In the
Conformer-based encoder, decoder, and a variance
BN-to-speech stage, both high-quality TTS data
adapter for modeling duration and prosody. The
and low-quality ASR data should be involved
model extends phoneme-level linguistic features to
during training to sufficiently model the speech of
frame-level to guarantee the clarity and naturalness
various speaker identities. Extracted from speech,
of speech in our system.
314
4.3 BN-to-speech 5.1.1 ASR Data
We build the BN-to-speech model based on For the English ASR module in our proposed sys-
VITS (Kim et al., 2021), which is a mainstream tem, we use GigaSpeech, LibriSpeech, TED-LIUM
end-to-end TTS model. VITS generates speech v2&v3 as training data. For the ASR system used to
waveforms directly from the input textual informa- extract BN features in TTS, we use text-to-speech
tion, rather than a conventional pipeline of using data in AISHELL-3 and Chinese speech in GigaS2S,
the combination of an acoustic model and a neural along with the corresponding Chinese text in Gi-
vocoder. gaST, as the training set. Since the test set’s MT
The network of the BN-to-speech stage consists output text is a mix of Chinese and English, includ-
of a BN encoder, posterior encoder, decoder, flow, ing names of people and places, the TTS module
and speaker encoder. The monotonic alignment needs to support both languages. Therefore, we
search (MAS) from the original VITS is removed also add the aforementioned English data to the
since BN features contain the duration information. training set.
For achieving zero-shot voice cloning, an ECAPA- 5.1.2 MT Data
TDNN (Desplanques et al., 2020) speaker encoder We use the text-parallel data including News Com-
is pre-trained to provide the speaker embedding mentary and OpenSubtitles2018 as MT training set.
as the condition of the synthesized speech. To Moreover, we also add the Chinese texts in GigaST
avoid periodic signal prediction errors in the orig- and the English texts in GigaSpeech corresponding
inal HiFiGAN-based (Kong et al., 2020) decoder to the Chinese texts in GigaST to the training set.
in VITS, which induces sound quality degradation, 5.1.3 TTS Data
we follow VISinger2 (Zhang et al., 2022) to adopt a We use AISHELL-3 as training data in Text-to-BN
decoder with the sine excitation signals. Since The and audio super-resolution. For the pre-trained
VISinger2 decoder requires pitch information as speaker encoder, we adopt LibriSpeech, which con-
input, we utilize a pitch predictor with a multi-layer tains 1166 speakers, as the training data.For the BN-
Conv1D that predicts the speaker-dependent pitch to-speech model, in addition to using AISHELL-3
from BN and speaker embedding. With the desired which has 218 speakers, we also use LibriSpeech
speaker embedding and corresponding BN features, to meet the data amount and speaker number re-
the BN-to-speech module produces Chinese speech quirements of zero-shot TTS.
in the target timbre. 5.2 Data Pre-processing
4.4 Audio Super-resolution 5.2.1 ASR Data
Following (Liu et al., 2021), we use an upsam- To prepare the ASR data, we pre-process all tran-
pling network based vocoder to achieve audio scripts to remove audio-related tags. Next, we map
super-resolution (16kHz→24kHz). During train- the text to the corresponding byte-pair encoding
ing, the 16KHz mel-spectrogram is used as the (BPE) unit and count the number of BPE units in
condition to predict the 24KHz audio in the au- the ASR dictionary, which totals 5,000 units. For
dio super-resolution model. Specifically, we adopt audio processing, we use a frame shift of 10ms and
the AISHELL-3 (Shi et al., 2021) dataset, com- a frame length of 25ms and normalize all audio to
posing the paired 16KHz and 24KHz speech data 16KHz.
for model training. During inference, the high- 5.2.2 MT Data
quality 24kHz speech is produced for the mel- For the MT data, we use the same tokenizer as
spectrogram of the 16KHz speech generated by the mBART50 to perform sub-word segmentation for
BN-to-speech model. Here DSPGAN (Song et al., English and Chinese texts and to organize them
2022b) is adopted as our audio super-resolution into a format for neural network training. By doing
model, which is a universal vocoder that ensures so, we can maximize the benefits of initializing
robustness and good sound quality without periodic our translation model with mBART50 pre-trained
signal errors. model parameters. The mBART tokenizer men-
5 Data Preparation tioned above is a Unigram tokenizer. A Unigram
model is a type of language model that consid-
5.1 Datasets
Following the constraint of data usage, the training ers each token to be independent of the tokens be-
dataset for the S2ST system is illustrated in Table 1. fore it. What’s more, the tokenizer has a total of
5
https://github.com/SpeechTranslation/ 250,054 word segmentations, supports word seg-
GigaS2S mentation processing for English, Chinese, and
315
Table 1: Datasets used in our proposed system.
other languages, and uses special tokens like <s>, responding En-Zh texts. It is worth noting that the
</s>, and <unk>. development data for evaluations has been removed
5.2.3 TTS Data from the training dataset.
For AISHELL-3, we downsample it to 16KHz and 6 Experiments
24KHz respectively as the TTS modeling target
and the audio super-resolution modeling target. All 6.1 Experimental Setup
other data is down-sampled to 16KHz. All data All the models in our system are trained on 8 A100
in TTS adopts 12.5ms frame shift and 50ms frame GPUs and optimized with Adam (Kingma and Ba,
length. 2015).
Speech Enhancement. Given the presence of ASR Module. All ASR models are implemented
substantial background noise in the test set, the dis- in ESPnet6 . Both Conformer and E-Branchformer
criminative power of speaker embeddings is signif- models employ an encoder with 17 layers and a
icantly reduced, thereby impeding the performance feature dimension of 512, with 8 heads in the self-
of the TTS module. Furthermore, the ASR data in- attention mechanism and an intermediate hidden
corporated during the training of the BN-to-speech dimension of 2048 for the FFN. In addition, we
model is also subject to background noise. There- employ a 6-layer Transformer decoder with the
fore, we employ a single-channel wiener filtering same feature hidden dimension as the encoder. The
method (Lim and Oppenheim, 1979) to remove E-Branchformer model uses a cgMLP with an in-
such noise from these data. Please note that we termediate hidden dimension of 3072. The total
do not perform speech enhancement on the test set number of parameters for the Conformer and E-
in the ASR module, because there is a mismatch Branchformer model in Section 2.1 is 147.8M and
between the denoised audio and which is used in 148.9M respectively. We train the models with
ASR training, and denoising will reduce the speech batch size 32 sentences per GPU for 40 epochs,
recognition accuracy. and set the learning rate to 0.0015, the warm-up
step to 25K.
5.2.4 Evaluation Data
For data augmentation, we conduct speed per-
For all evaluations, we use the English-Chinese turbation, pitch shifting, and audio codec on the
(En-Zh) development data divided by the organizer original recordings. Spectrum augmentation and
from GigaSpeech, GigaST and GigaS2S, including
5,715 parallel En-Zh audio segments, and their cor- 6
https://github.com/espnet/espnet
316
noise augmentation are used for on-the-fly model AISHELL-3.
training. Proposed system & Ablation Study. We fur-
MT Module. All MT models are implemented ther conduct ablation studies to evaluate each com-
in HuggingFace7 . Using MT data, we fine-tune the ponent in the proposed system. Specifically, the
mBART-50 large model, which has 611M param- ablation studies are designed to verify the effec-
eters, with a batch size of 32 sentences per GPU tiveness of model fusion and data augmentation
for 20 epochs. The learning rate is set to 3e-5 and in ASR, three-stage fine-tuning, back translation,
warmed up for the first 10% of updates and linearly cross-verification in MT, two-stage training with
decayed for the following updates. For fine-tuning BN, pre-trained speaker embedding, and audio
using the MT data in ASR transcription format and super-resolution in TTS.
the ASR outputs, we also fine-tune the model with
6.3 Results & Analysis
batch size 32 sentences per GPU for 5 epochs and
set the learning rate to 3e-5, which is warmed up We conduct experiments on the effectiveness of
for the first 5% of updates and linearly decayed for each sub-module and the performance of our pro-
the following updates. posed cascaded S2ST system.
TTS Module. We complete our system based 6.3.1 ASR Module
on VITS official code8 . The text-to-BN follows We calculate the word error rate (WER) of each
the configuration of DelightfulTTS and has about ASR module to evaluate the English speech recog-
64M parameters. To extract the duration required nition accuracy. As shown in Table 2, the WER
for text-to-BN, we train a Kaldi9 model using of the proposed system has a significant drop com-
AISHELL-3. The ASR system used for extract- pared with the baseline, which indicates that the
ing BN is the Chinese-English ASR model men- proposed system greatly improves the recognition
tioned in Section 5.1.1. For BN-to-speech, we use accuracy. Moreover, the results of the ablation
a 6-layer FFT as the BN encoder and follow the study demonstrate the effectiveness of both model
other configuration in VIsinger2 with about 45M fusion and data augmentation in improving speech
parameters in total. The pitch predictor has 4 lay- recognition accuracy.
ers of Conv1D with 256 channels. Pitch is ex-
Table 2: The WER results of each ASR module.
tracted by Visinger2 decoder and DSPGAN from
Harvest (Morise, 2017) with Stonemask. To pre-
Model WER (%)
dict pitch in DSPGAN, we use the method de-
scribed in Section 4.3. Up-sampling factors in Baseline 13.53
DSPGAN is set as [5, 5, 4, 3] and other config- Proposed system 10.25
uration of DSPGAN-mm is preserved for audio w/o model fusion 11.95
super-resolution. The DSPGAN model has about w/o data augmentation 12.40
9M parameters in total. We train all the above mod-
els with a batch size of 64 sentences per GPU for
6.3.2 MT Module
1M steps and set the learning rate to 2e-4. For the
We evaluate our MT module in terms of the BLEU
pre-trained speaker encoder, we follow the model
score, which measures the n-gram overlap between
configuration and training setup of ECAPA-TDNN
the predicted output and the reference sentence.
(C=1024) with 14.7M parameters.
Table 3: The BLEU results of each MT module.
6.2 Evaluation Models
Baseline. To evaluate the effectiveness of the pro- Model BLEU
posed cascaded S2ST system, we adopt the orig-
Baseline 28.1
inal cascaded S2ST system as a baseline, includ-
Proposed system 33.4
ing an E-Branchformer ASR model, a mBART50
w/o three-stage fine-tuning 28.7
MT model fine-tuned using the MT data, and an
end-to-end TTS model based on VITS trained with w/o back translation 30.8
w/o cross-validation 31.0
7
https://github.com/huggingface/
transformers
8
https://github.com/jaywalnut310/vits As shown in Table 4, the proposed system with
9
https://github.com/kaldi-asr/kaldi three-stage fine-tuning achieves a significantly bet-
317
Table 4: Experimental results of TTS in terms of MOS and WER. BN means using two-stage training with BN and
pre-trained spkr. embed. means using pre-trained speaker embedding.
Model Clarity in CER (%) Naturalness (MOS) Sound Quality (MOS) Speaker Similarity (MOS)
Baseline 7.14 3.38±0.05 3.81±0.04 2.12±0.06
Proposed system 6.12 3.70±0.06 3.86±0.06 3.72±0.06
w/o BN 7.12 3.40±0.04 3.81±0.05 3.10±0.07
w/o Pre-trained spkr. embd. - - 4.05±0.05 2.22±0.06
w/o Audio super-resolution - - 3.64±0.04 -
Recording 4.53 4.01±0.04 3.89±0.03 4.35±0.05
ter BLEU score than the baseline, demonstrating an intermediate representation in our experimental
the effectiveness of curriculum learning in our sce- scenario.
nario. Furthermore, by incorporating back trans-
6.3.4 System Evaluation
lation and cross-validation, the translation perfor-
mance can be further improved. Finally, we calculate the ASR-BLEU score for the
baseline and the proposed system to evaluate the
6.3.3 TTS Module speech-to-speech translation performance. Specif-
ically, we use the ASR system to transcribe the
We calculate the character error rate (CER) to eval- Chinese speech generated by TTS, and then com-
uate the clarity of speech for each TTS module. pute the BLEU scores of the ASR-decoded text
The ASR system used for calculating CER is the with respect to the reference English translations.
Chinese-English ASR model mentioned in Sec- The ASR system for transcribing Chinese speech
tion 5.1.1. Additionally, we conduct mean opinion is the same as that in Section 6.2.3.
score (MOS) tests with ten listeners rating each
sample on a scale of 1 (worst) to 5 (best) to evaluate Table 5: The ASR-BLEU results of each system.
naturalness, sound quality, and speaker similarity.
In the ablation study without pre-trained speaker Model ASR-BLEU
embedding, speaker ID is to control the speaker Baseline 27.5
timbre of the synthesized speech. To eliminate the Proposed system 32.2
influence of ASR and MT results on TTS evalua-
tion, we use the Chinese text in the evaluation data As shown in Table 5, our proposed system
and its corresponding English source speech as the achieves a higher ASR-BLEU score than the base-
reference of speaker timbre as the test set for TTS line, which indicates that our proposed system has
evaluation. good speech-to-speech translation accuracy.
As shown in Table 3, our proposed system has
achieved significant improvement in naturalness, 7 Conclusion
sound quality, speaker similarity, and clarity of This paper describes the NPU-MSXF speech-to-
speech compared with the baseline. Interestingly, speech translation system, which we develop for
the system without pre-trained speaker embedding the IWSLT 2023 speech-to-speech translation task.
has better sound quality than both the proposed sys- Our system is built as a cascaded system that in-
tem and recording. We conjecture the reason is that cludes ASR, MT, and TTS modules. To ensure
the pre-trained speaker embedding greatly influ- good performance with multi-source data, we im-
ences the sound quality in the zero-shot TTS setup. proved each module using various techniques such
Therefore, the quality of the synthesized 24KHz as model fusion and data augmentation in the
audio is superior to the 16KHz recording, which ASR, three-stage fine-tuning, back translation, and
can be demonstrated by the 3.64 MOS score of cross-validation in the MT, and two-stage training,
the system without audio super-resolution. Mean- pre-trained speaker embedding, and audio super-
while, the speaker similarity MOS score is very low resolution in the TTS. Through extensive experi-
due to the lack of generalization ability to unseen ments, we demonstrate that our system achieves
speakers. Without using the BN-based two-stage high translation accuracy, naturalness, sound qual-
model, the system decreases performance on all ity, and speaker similarity with multi-source input.
indicators, which shows the effectiveness of BN as
318
References 21st Annual Conference of the International Speech
Communication Association, Virtual Event, Shang-
Farhad Akhbardeh, Arkady Arkhangorodsky, Mag- hai, China, 25-29 October 2020, pages 3830–3834.
dalena Biesialska, Ondrej Bojar, Rajen Chatter- ISCA.
jee, Vishrav Chaudhary, Marta R. Costa-jussà,
Cristina España-Bonet, Angela Fan, Christian Fe- Jonathan G Fiscus. 1997. A post-processing system
dermann, Markus Freitag, Yvette Graham, Ro- to yield reduced word error rates: Recognizer out-
man Grundkiewicz, Barry Haddow, Leonie Harter, put voting error reduction (ROVER). In 1997 IEEE
Kenneth Heafield, Christopher Homan, Matthias Workshop on Automatic Speech Recognition and Un-
Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, derstanding Proceedings, pages 347–354. IEEE.
Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp
Koehn, Nicholas Lourie, Christof Monz, Makoto Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Nakazawa, Matteo Negri, Santanu Pal, Allahsera Au- Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
guste Tapo, Marco Turchi, Valentin Vydrin, and Mar- 2020. Conformer: Convolution-augmented trans-
cos Zampieri. 2021. Findings of the 2021 confer- former for speech recognition. In Interspeech 2020,
ence on machine translation (WMT21). In Proceed- 21st Annual Conference of the International Speech
ings of the Sixth Conference on Machine Translation, Communication Association, Virtual Event, Shang-
WMT@EMNLP 2021, Online Event, November 10- hai, China, 25-29 October 2020, pages 5036–5040.
11, 2021, pages 1–88. Association for Computational ISCA.
Linguistics.
François Hernandez, Vincent Nguyen, Sahar Ghannay,
Rosana Ardila, Megan Branson, Kelly Davis, Michael Natalia A. Tomashenko, and Yannick Estève. 2018.
Kohler, Josh Meyer, Michael Henretty, Reuben TED-LIUM 3: Twice as much data and corpus repar-
Morais, Lindsay Saunders, Francis M. Tyers, and tition for experiments on speaker adaptation. In
Gregor Weber. 2020. Common voice: A massively- Speech and Computer - 20th International Confer-
multilingual speech corpus. In Proceedings of The ence, SPECOM 2018, Leipzig, Germany, September
12th Language Resources and Evaluation Confer- 18-22, 2018, Proceedings, volume 11096 of Lecture
ence, LREC 2020, Marseille, France, May 11-16, Notes in Computer Science, pages 198–208. Springer.
2020, pages 4218–4222. European Language Re-
Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey,
sources Association.
Melvin Johnson, Zhifeng Chen, and Yonghui Wu.
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, 2019. Direct speech-to-speech translation with
and Jason Weston. 2009. Curriculum learning. In a sequence-to-sequence model. In Interspeech
Proceedings of the 26th Annual International Con- 2019, 20th Annual Conference of the International
ference on Machine Learning, ICML 2009, Montreal, Speech Communication Association, pages 1123–
Quebec, Canada, June 14-18, 2009, volume 382 of 1127. ISCA.
ACM International Conference Proceeding Series, Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021.
pages 41–48. ACM. Conditional variational autoencoder with adversar-
ial learning for end-to-end text-to-speech. In Pro-
Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu
ceedings of the 38th International Conference on
Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel
Machine Learning, ICML 2021, 18-24 July 2021, Vir-
Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, San-
tual Event, volume 139 of Proceedings of Machine
jeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao,
Learning Research, pages 5530–5540. PMLR.
Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang,
Zhao You, and Zhiyong Yan. 2021. Gigaspeech: An Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan,
evolving, multi-domain ASR corpus with 10, 000 Prashant Sridhar, Kyu J Han, and Shinji Watanabe.
hours of transcribed audio. In Interspeech 2021, 2023. E-branchformer: Branchformer with enhanced
22nd Annual Conference of the International Speech merging for speech recognition. In 2022 IEEE Spo-
Communication Association, Brno, Czechia, 30 Au- ken Language Technology Workshop (SLT), pages
gust - 3 September 2021, pages 3670–3674. ISCA. 84–91. IEEE.
Yutian Chen, Yannis M. Assael, Brendan Shillingford, Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
David Budden, Scott E. Reed, Heiga Zen, Quan method for stochastic optimization. In 3rd Inter-
Wang, Luis C. Cobo, Andrew Trask, Ben Laurie, national Conference on Learning Representations,
Çaglar Gülçehre, Aäron van den Oord, Oriol Vinyals, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
and Nando de Freitas. 2019. Sample efficient adap- Conference Track Proceedings.
tive text-to-speech. In 7th International Conference
on Learning Representations, ICLR 2019, New Or- Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020.
leans, LA, USA, May 6-9, 2019. OpenReview.net. HiFi-GAN: Generative adversarial networks for effi-
cient and high fidelity speech synthesis. In Advances
Brecht Desplanques, Jenthe Thienpondt, and Kris De- in Neural Information Processing Systems 33: An-
muynck. 2020. ECAPA-TDNN: emphasized chan- nual Conference on Neural Information Processing
nel attention, propagation and aggregation in TDNN Systems 2020, NeurIPS 2020, December 6-12, 2020,
based speaker verification. In Interspeech 2020, virtual.
319
Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Speech and Signal Processing, ICASSP 2015, South
Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Brisbane, Queensland, Australia, April 19-24, 2015,
Qing He, Yun Tang, Juan Pino, and Wei-Ning Hsu. pages 5206–5210. IEEE.
2022. Direct speech-to-speech translation with dis-
crete units. In Proceedings of the 60th Annual Meet- Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng
ing of the Association for Computational Linguistics Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le.
(Volume 1: Long Papers), ACL 2022, pages 3327– 2019. Specaugment: A simple data augmentation
3339. Association for Computational Linguistics. method for automatic speech recognition. In Inter-
speech 2019, 20th Annual Conference of the Inter-
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan national Speech Communication Association, Graz,
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Austria, 15-19 September 2019, pages 2613–2617.
Veselin Stoyanov, and Luke Zettlemoyer. 2020. ISCA.
BART: denoising sequence-to-sequence pre-training
for natural language generation, translation, and com- Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya
prehension. In Proceedings of the 58th Annual Meet- Sutskever, et al. 2018. Improving language under-
ing of the Association for Computational Linguistics, standing by generative pre-training.
ACL 2020, Online, July 5-10, 2020, pages 7871–7880.
Association for Computational Linguistics. Anthony Rousseau, Paul Deléglise, and Yannick Estève.
2012. TED-LIUM: an automatic speech recognition
Jae Soo Lim and Alan V Oppenheim. 1979. Enhance- dedicated corpus. In Proceedings of the Eighth In-
ment and bandwidth compression of noisy speech. ternational Conference on Language Resources and
Proceedings of the IEEE, 67(12):1586–1604. Evaluation, LREC 2012, Istanbul, Turkey, May 23-25,
2012, pages 125–129. European Language Resources
Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. Association (ELRA).
2018. Opensubtitles2018: Statistical rescoring of
sentence alignments in large, noisy parallel corpora. Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming
In Proceedings of the Eleventh International Confer- Li. 2021. AISHELL-3: A multi-speaker mandarin
ence on Language Resources and Evaluation, LREC TTS corpus. In Interspeech 2021, 22nd Annual Con-
2018, Miyazaki, Japan, May 7-12, 2018. European ference of the International Speech Communication
Language Resources Association (ELRA). Association, Brno, Czechia, 30 August - 3 September
2021, pages 2756–2760. ISCA.
Yanqing Liu, Zhihang Xu, Gang Wang, Kuan Chen,
Bohan Li, Xu Tan, Jinzhu Li, Lei He, and Sheng Jongseo Sohn, Nam Soo Kim, and Wonyong Sung. 1999.
Zhao. 2021. DelightfulTTS: The microsoft speech A statistical model-based voice activity detection.
synthesis system for blizzard challenge 2021. CoRR, IEEE Signal Process. Lett., 6(1):1–3.
abs/2110.12612. Kun Song, Heyang Xue, Xinsheng Wang, Jian Cong,
Yongmao Zhang, Lei Xie, Bing Yang, Xiong Zhang,
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
and Dan Su. 2022a. AdaVITS: Tiny VITS for low
Edunov, Marjan Ghazvininejad, Mike Lewis, and
computing resource speaker adaptation. In 13th In-
Luke Zettlemoyer. 2020. Multilingual denoising pre-
ternational Symposium on Chinese Spoken Language
training for neural machine translation. Trans. Assoc.
Processing, ISCSLP 2022, Singapore, December 11-
Comput. Linguistics, 8:726–742.
14, 2022, pages 319–323. IEEE.
Masanori Morise. 2017. Harvest: A high-performance
Kun Song, Yongmao Zhang, Yi Lei, Jian Cong,
fundamental frequency estimator from speech signals.
Hanzhao Li, Lei Xie, Gang He, and Jinfeng Bai.
In Interspeech 2017, 18th Annual Conference of the
2022b. DSPGAN: a gan-based universal vocoder
International Speech Communication Association,
for high-fidelity TTS by time-frequency domain su-
pages 2321–2325. ISCA.
pervision from DSP. CoRR, abs/2211.01087.
Satoshi Nakamura, Konstantin Markov, Hiromi Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao
Nakaiwa, Gen-ichiro Kikui, Hisashi Kawai, Wang, Mingxuan Wang, and Jun Cao. 2022. GigaST:
Takatoshi Jitsuhiro, Jinsong Zhang, Hirofumi A 10, 000-hour pseudo speech translation corpus.
Yamamoto, Eiichiro Sumita, and Seiichi Yamamoto. CoRR, abs/2204.03939.
2006. The ATR multilingual speech-to-speech
translation system. IEEE Trans. Speech Audio Yongmao Zhang, Heyang Xue, Hanzhao Li, Lei Xie,
Process., 14(2):365–376. Tingwei Guo, Ruixiong Zhang, and Caixia Gong.
2022. Visinger 2: High-fidelity end-to-end singing
Markus Ojala and Gemma C. Garriga. 2010. Permu- voice synthesis enhanced by digital signal processing
tation tests for studying classifier performance. J. synthesizer. CoRR, abs/2211.02903.
Mach. Learn. Res., 11:1833–1863.
320
Low-Resource Formality Controlled NMT Using Pre-trained LM
has been tailored to individual languages and has for every token i in the input space. In short, we re-
labeled large amounts of data using word lists or purpose an Embedding layer as a style intervening
morphological analyzers. layer between the encoder and the decoder. This
design resulted from our original question: will
3 Approach allowing more flexibility in the encoder enable it
to identify which tokens require stylization, thus
3.1 Overview
making it more interpretable. The hypothesis that
The task of formality-controlled generation can be originated from this question was: by giving each
viewed as a seq2seq machine translation task. More token its own intervention vector Vi , the model will
formally, given an input sequence x, we design a learn each intervention vector Vi differently based
model that does the following: on whether the token at that time step has a contrast-
ing translation that is dependent on the formality
ŷ = arg max p(y|x, ls , lt , f ; θ) (1) setting. In short, we let the model learn different
y∈Y
Vi ’s for each token. If true, this will provide some
Where, interpretability on which tokens the model recog-
x is the input sequence, nizes as having a formality marker and translates
ls is the source language, them differently in formal and informal settings.
lt is the target language, This approach is visualized in Figure 2. Since our
f is the formality, approach uses an embedding layer for style inter-
ŷ is the formality controlled translation vention, we call our approach ’style embedding
intervention.’
We propose a single model that produces an out- We learn the style embedding layer only in the
put, given input x, and formality setting f. Despite formal setting and use a zero vector in the informal
being part of the unconstrained task, our proposed setting. In other words, the style embedding inter-
approach does not mine or develop any formal- vention is performed only in the formal setting, and
ity annotated data for training and just uses a pre- encoder outputs are not perturbed in the informal
trained checkpoint of mBART. setting. We do not have separate Embedding lay-
ers to learn each formality style, simply because,
3.2 Design
it would be difficult to switch between layers dur-
We looked at previous works incorporating con- ing batched training. Looking at (Schioppa et al.,
trasting styles Rippeth et al., 2022, and Schioppa 2021b), the combination of a style vector and a
et al., 2021b as motivation for our approach. For zero vector for contrasting styles was sufficient to
controlling styles, the aforementioned works use an learn the style.
additive intervention approach. This approach en-
tails adding a single style intervention vector V to 4 Experimental Apparatus
the pre-trained encoder output Z. The same vector
V is added to all the tokens of the encoder outputs, 4.1 Dataset
thereby changing the encoder outputs uniformly. The IWSLT formality shared task provided a for-
We modify the above approach to allow for more mality annotated dataset (Nadejde et al., 2022).
flexibility while learning. Instead of a single inter- This dataset comprises source segments paired with
vention vector V, we propose a unique vector Vi two contrastive reference translations, one for each
322
length of 128. We trained for 15 epochs with an
early stopping callback set at 3.
We have implemented all the models in PyTorch
(Paszke et al., 2019) leveraging Huggingface (Wolf
et al., 2019) transformers and evaluate libraries.
4.3 Evaluation
To assess the performance of the models, we use
four metrics to evaluate the two main underlying
Figure 2: Approach tasks - translation quality and formality control.
For evaluating the translation quality, we use the
following two metrics:
formality level (informal and formal) for two lan- • Bilingual Understudy Evaluation (BLEU)
guage pairs: EN-KO, VI in the supervised setting score: BLEU score (Papineni et al., 2002)
and two language pairs: EN-PT, RU in the zero- calculates the similarity between a machine
shot setting. The data statistics can be seen in Table translation output and a reference translation
1. We use a random split of 0.2 to construct the using n-gram precision. We use SacreBLEU
validation dataset during model development. 2.0 (Post, 2018) implementation for reporting
our scores.
4.2 Training Setup
• Cross-lingual Optimized Metric for Eval-
For all our modeling experiments, we use mbart- uation of Translation (COMET) score:
large-50-one-to-many-mmt, a fine-tuned check- COMET score (Rei et al., 2020) calculates
point of mBART-large-50 (Liu et al., 2020). This the similarity between a machine translation
model, introduced by (Tang et al., 2020), is a fine- output and a reference translation using to-
tuned mBART model which can translate English ken or sentence embeddings. We use COMET
to 49 languages, including the languages we are wmt22-comet-da (Rei et al., 2022) model for
interested in: KO, VI, PT, and RU. reporting our scores.
For our baseline, we perform zero-shot inference
on the mBART model for the four language pairs. For evaluating the formality control, we use the
The results are shown in tables 3 - 6. following two metrics:
Based on the findings of (Nakkiran et al., 2019) • Matched-Accuracy (M-Acc): A reference-
and (Galke and Scherp, 2022) we fixed our loss based corpus-level automatic metric that lever-
function to be ‘cross entropy with logits‘ and op- ages phrase-level formality markers from
timizer to AdamW (Loshchilov and Hutter, 2017). the references to classify a system-generated
We use the default learning rate of 10-3 , standard translation as either formal or informal. This
weight decay of 10-2 and set β1 , β2 and ϵ to 0.9, metric was provided by the IWSLT Formality
0.998 and 10-8 respectively. shared task organizers.
To effectively train the transformer-based
• Reference-free Matched-Accuracy (RF-M-
mBART model, we used a learning rate scheduler
Acc): A reference-free variant of M-Acc that
- a linear schedule with a warm-up, as introduced
uses a multilingual formality classifier, based
by (Vaswani et al., 2017). This creates a schedule
on xlm-roberta-base, fine-tuned on human-
with a learning rate that decreases linearly from
written formal and informal text, to label a
the initial learning rate to 0 after a warm-up period.
system-generated hypothesis as formal or in-
The warm-up period is set to 10% of the total train-
formal. This metric was provided by the
ing steps, during which the learning rate increases
IWSLT Formality shared task organizers.
linearly from 0 to the initial learning rate set in the
optimizer. All the other hyper-parameters are left In addition to this, we evaluate our generic trans-
at their defaults. lation quality on FLORES-200 (Goyal et al., 2022)
We trained our models using one NVIDIA A100 for all language pairs under supervised and zero-
GPU with 80GB memory. To fit our model in this shot settings. We use the devtest set of FLORES-
GPU we used a batch size of 16 and a max sequence 200 and compute the BLEU and COMET scores.
323
Language pair Training Data points Testing Data points
EN-KO 400 600
EN-VI 400 600
EN-PT 0 600
EN-RU 0 600
Formal Informal
BLEU Matched Acc BLEU Matched Acc
Rippeth et al., 2022 38.3 98.4 38.3 82.7
Style embedding intervention 38 99.2 37.4 98
5.1 Style embedding layer analysis Figure 3: Similarity scores for hypothesis analysis.
In this section, we analyze the style embedding
layer and compare the analysis with the original
hypothesis - giving each token its own interven- As seen from the token representation similarity
tion vector Vi , the model will learn each vector scores, the model does not seem to learn new in-
differently based on whether the token at that time formation in tokens that have a contrasting setting-
step has a contrasting translation that is dependent dependent translation - the tokens’ similarity scores
on the formality setting. Due to the unique nature are very near 1. Instead, it uses the </s>’s repre-
of our training setup - learning zero vector in the sentation to store the style ’signal’, by creating a
informal setting - for our hypothesis testing, we style vector that makes the </s>’s representation
compare the encoder vectors with and without the ∼11% different between formality settings.
style embedding intervention. For this purpose, we Another interesting observation is the extremely
use the dot product similarity. At each time step, slight dissimilarity produced at the beginning of
we compute the dot product similarity between the the sentence or ’en_xx’ token. Did the model learn
encoder output before style intervention and the the same style information in ∼1% of information
output after style intervention. This is equivalent space in the ’en_xx’ token compared to the ∼11%
to comparing the encoder outputs in the formal and of information space in the ’</s>’ token? To an-
324
Models EN-VI EN-KO
BLEU COMET %M-Acc %C-F BLEU COMET %M-Acc %C-F
Baseline 1 26.7 0.3629 96 0.95 4.9 0.2110 78 0.99
Baseline 2 26.1 0.829 3 0.006 3.9 0.8445 66.7 0.979
Model 1 44.8 0.8467 99 0.989 22.2 0.8246 74.1 0.9815
Model 2 44.2 0.8702 98.6 0.9782 22.5 0.831 82.9 0.9765
Model 3 44.6 0.874 99 0.9849 23.3 0.836 85.7 0.9832
Model 4 44.3 0.8462 99.2 0.9849 23.2 0.8287 75.3 0.9815
Baseline 1: UMD-baseline
Baseline 2: Zero-Shot mBart
Model 1: single vector intervention with train-dev split of 0.1
Model 2: style embedding intervention
Model 3: bos style intervention - Primary Submission
Model 4: single vector intervention with train-dev split of 0.2
Table 3: Results on the official test split in the formal supervised setting for language pairs EN-VI and EN-KO.
Table 4: Results on the official test split in the formal unsupervised setting for language pairs EN-PT and EN-RU.
swer this question, we added another modification mal setting, we obtain a BLEU score of 44.6 for
to our approach - we masked out the intervention EN-VI and 23.3 for EN-KO on the official test split.
vectors for all tokens except the ’en_xx’ token. In the informal setting, we obtain a BLEU score of
For naming purposes, we call this approach ’bos 43.5 for EN-VI and 22.8 for EN-KO. Tables 3 and
style intervention’ respectively. 5 have detailed results of all our models. Our pri-
mary model - ’bos style intervention’ - outperforms
6 Official Results the UMD baseline significantly for both languages
Along with the approach from Rippeth et al., 2022 with around 20 BLEU increase and more than dou-
taken as a baseline and an adapted version of it, ble the COMET score. This answers our hypothesis
we submit the results of our approach and of the that the model can learn the formality style in the
’bos style intervention’ approach. We analyse the small ∼1% information space at the beginning of
performance of our models under the supervised the sentence in ’en_xx’ token. Moreover, we ob-
setting and the zero-shot setting. We also generate tain higher scores on the metrics M-Acc% & C-F%
results on the FLORES-200 test split. that compute the degree of formality/informality
induced.
6.1 Supervised Setting Qualitative analysis of the translations, espe-
We trained our models multi-lingually on EN-VI cially for KO, revealed that code-switching was
and EN-KO for the supervised setting. In the for- a major issue. For example, some translations have
325
Models EN-VI EN-KO
BLEU COMET %M-Acc %C-F BLEU COMET %M-Acc %C-F
Baseline 1 25.3 0.3452 96 0.9816 4.9 0.1697 97.6 0.995
Baseline 2 31.9 0.8352 97 0.9933 3.2 0.8311 33.3 0.020
Model 1 43.3 0.8238 98.7 0.9949 22.1 0.8115 96.3 0.889
Model 2 43.6 0.8514 98.9 0.9949 23.0 0.8256 98.3 0.9514
Model 3 43.5 0.8504 98.9 1 22.8 0.8257 98.3 0.9581
Model 4 42.5 0.8232 98.3 0.9765 22.6 0.8162 96.4 0.9028
Baseline 1: UMD-baseline
Baseline 2: Zero-Shot mBart
Model 1: single vector intervention with train-dev split of 0.1
Model 2: style embedding intervention
Model 3: bos style intervention - Primary Submission
Model 4: single vector intervention with train-dev split of 0.2
Table 5: Results on the official test split in the informal supervised setting for language pairs EN-VI and EN-KO.
Table 6: Results on the official test split in the informal unsupervised setting for language pairs EN-PT and EN-RU.
Table 7: Results on Flores-200 test split for language pairs EN-VI & EN-KO in supervised setting and for language
pairs EN-PT & EN-RU in unsupervised setting.
327
Weston Feely, Eva Hasler, and Adrià de Gispert. 40th Annual Meeting of the Association for Compu-
2019. Controlling japanese honorifics in english-to- tational Linguistics, pages 311–318, Philadelphia,
japanese neural machine translation. In Proceedings Pennsylvania, USA. Association for Computational
of the 6th Workshop on Asian Translation, pages 45– Linguistics.
53.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lukas Galke and Ansgar Scherp. 2022. Bag-of-words Lerer, James Bradbury, Gregory Chanan, Trevor
vs. graph vs. sequence in text classification: Ques- Killeen, Zeming Lin, Natalia Gimelshein, Luca
tioning the necessity of text-graphs and the surpris- Antiga, Alban Desmaison, Andreas Köpf, Edward Z.
ing strength of a wide MLP. In Proceedings of the Yang, Zach DeVito, Martin Raison, Alykhan Tejani,
60th Annual Meeting of the Association for Compu- Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Jun-
tational Linguistics (Volume 1: Long Papers), pages jie Bai, and Soumith Chintala. 2019. Pytorch: An
4038–4051, Dublin, Ireland. Association for Compu- imperative style, high-performance deep learning li-
tational Linguistics. brary. CoRR, abs/1912.01703.
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Matt Post. 2018. A call for clarity in reporting BLEU
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- scores. In Proceedings of the Third Conference on
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán, Machine Translation: Research Papers, pages 186–
and Angela Fan. 2022. The Flores-101 evaluation 191, Brussels, Belgium. Association for Computa-
benchmark for low-resource and multilingual ma- tional Linguistics.
chine translation. Transactions of the Association for Ricardo Rei, José G. C. de Souza, Duarte Alves,
Computational Linguistics, 10:522–538. Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova,
Alon Lavie, Luisa Coheur, and André F. T. Martins.
Eduard Hovy. 1987. Generating natural language un-
2022. COMET-22: Unbabel-IST 2022 submission
der pragmatic constraints. Journal of Pragmatics,
for the metrics shared task. In Proceedings of the
11(6):689–719.
Seventh Conference on Machine Translation (WMT),
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey pages 578–585, Abu Dhabi, United Arab Emirates
Edunov, Marjan Ghazvininejad, Mike Lewis, and (Hybrid). Association for Computational Linguistics.
Luke Zettlemoyer. 2020. Multilingual denoising pre- Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
training for neural machine translation. Lavie. 2020. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference
Ilya Loshchilov and Frank Hutter. 2017. Fixing
on Empirical Methods in Natural Language Process-
weight decay regularization in adam. CoRR,
ing (EMNLP), pages 2685–2702, Online. Association
abs/1711.05101.
for Computational Linguistics.
Maria Nădejde, Anna Currey, Benjamin Hsu, Xing Elijah Rippeth, Sweta Agrawal, and Marine Carpuat.
Niu, Marcello Federico, and Georgiana Dinu. 2022. 2022. Controlling translation formality using pre-
Cocoa-mt: A dataset and benchmark for contrastive trained multilingual language models. In Proceed-
controlled mt with application to formality. arXiv ings of the 19th International Conference on Spoken
preprint arXiv:2205.04022. Language Translation (IWSLT 2022), pages 327–340,
Dublin, Ireland (in-person and online). Association
Maria Nadejde, Anna Currey, Benjamin Hsu, Xing
for Computational Linguistics.
Niu, Marcello Federico, and Georgiana Dinu. 2022.
CoCoA-MT: A dataset and benchmark for contrastive Andrea Schioppa, David Vilar, Artem Sokolov, and
controlled MT with application to formality. In Find- Katja Filippova. 2021a. Controlling machine transla-
ings of the Association for Computational Linguistics: tion for multiple attributes with additive interventions.
NAACL 2022, pages 616–632, Seattle, United States. In Proceedings of the 2021 Conference on Empiri-
Association for Computational Linguistics. cal Methods in Natural Language Processing, pages
6676–6696.
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan
Yang, Boaz Barak, and Ilya Sutskever. 2019. Deep Andrea Schioppa, David Vilar, Artem Sokolov, and
double descent: Where bigger models and more data Katja Filippova. 2021b. Controlling machine transla-
hurt. CoRR, abs/1912.02292. tion for multiple attributes with additive interventions.
In Proceedings of the 2021 Conference on Empiri-
Xing Niu, Marianna Martindale, and Marine Carpuat. cal Methods in Natural Language Processing, pages
2017. A study of style in machine translation: Con- 6676–6696, Online and Punta Cana, Dominican Re-
trolling the formality of machine translation output. public. Association for Computational Linguistics.
In Proceedings of the 2017 Conference on Empiri-
cal Methods in Natural Language Processing, pages Rico Sennrich, Barry Haddow, and Alexandra Birch.
2814–2819. 2016. Controlling politeness in neural machine trans-
lation via side constraints. In Proceedings of the
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- 2016 Conference of the North American Chapter of
Jing Zhu. 2002. Bleu: a method for automatic evalu- the Association for Computational Linguistics: Hu-
ation of machine translation. In Proceedings of the man Language Technologies, pages 35–40.
328
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
gela Fan. 2020. Multilingual translation with exten-
sible multilingual pretraining and finetuning. CoRR,
abs/2008.00401.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need.
Aditi Viswanathan, Varden Wang, and Antonina
Kononova. 2020. Controlling formality and style
of machine translation output using automl. In Infor-
mation Management and Big Data: 6th International
Conference, SIMBig 2019, Lima, Peru, August 21–23,
2019, Proceedings 6, pages 306–313. Springer.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
and Jamie Brew. 2019. Huggingface’s transformers:
State-of-the-art natural language processing. CoRR,
abs/1910.03771.
329
NAIST Simultaneous Speech Translation System for IWSLT 2023
Ryo Fukuda† Yuta Nishikawa† Yasumasa Kano† Yuka Ko†
Tomoya Yanagita† Kosuke Doi† Mana Makinae†
Sakriani Sakti‡† Katsuhito Sudoh† Satoshi Nakamura†
†
Nara Institute of Science and Technology, Japan
‡
Japan Advanced Institute of Science and Technology, Japan
fukuda.ryo.fo3@is.naist.jp
3 System Setup
with a kernel size of (10, 3, 3, 3, 3, 2, 2), a
3.1 Data
stride of (5, 2, 2, 2, 2, 2, 2), and 512 channels.
We used MuST-C v2.0 (Di Gangi et al., 2019) and The number of the Transformer encoder layers is
CoVoST-2 (Wang et al., 2020) for all language 24. The text decoder was initialized with the de-
pairs: English-to-German (En-De), English-to- coder of mBART50 (Tang et al., 2020). The de-
Japanese (En-Ja), and English-to-Chinese (En- coder consists of twelve Transformer layers, and
Zh). We also used MuST-C v1.0, Europarl-ST an embedding layer and linear projection weights
(Iranzo-Sánchez et al., 2020), and TED-LIUM are shared, with a size of 250,000. The size of
(Rousseau et al., 2012) for English-to-German. each Transformer and feed-forward layer is 1,024
We included the development and test portions of and 4,096, respectively, the number of attention
CoVoST-2 and Europarl-ST in our training data. heads is 16, the activation function is ReLU, and
The overall statistics for these corpora are shown the layer normalization is applied before the at-
in Table 1. For evaluation, we used the tst- tention operations. The encoder and decoder are
COMMON portion of MuST-C v2.0. All the text also connected via Inter-connection (2.1) and a
data in the corpora were tokenized using a multi- length adapter (Tsiamas et al., 2022). The length
lingual SentencePiece tokenizer with a vocabulary adapter is a 3-layer convolutional network with
of 250,000 subwords, distributed with mBART50 1,024 channels, the stride of 2, and the activation
pre-trained model. function of a Gated Linear Unit (GLU).
Speech input is given as waveforms with a 16-
3.2 Data Filtering
kHz sampling rate, normalized to zero mean and
We conducted a data filtering on the prefix trans- unit variance. During training, each source au-
lation pairs obtained through the Bilingual Pre- dio was augmented (Kharitonov et al., 2020) be-
fix Alignment, following our IWSLT 2022 sys- fore normalization, with a probability of 0.8. We
tem (Fukuda et al., 2022). We compared three trained multilingual models on all the data listed in
cut-off ratios of the number of samples in the in- Table 1 with a maximum source length of 400,000
put speech to the number of tokens in the output: frames and a target length of 1,024 tokens. We
4,800, 4,000, and 3,200. Table 2 shows the per- applied gradient accumulation and data-parallel
centage of data that was removed following the computations to achieve a batch size of approx-
application of filters. We also applied the same imately 32 million tokens. We used Adam with
filtering to the development data. β1 = 0.99, β2 = 0.98, and a base learning rate of
2.5 × 10−4 . The learning rate was controlled by a
3.3 Simultaneous Speech-to-Text System tri-stage scheduler with phases of 0.15, 0.15, and
We deveoped an end-to-end speech-to-text model 0.70 for warm-up, hold, and decay, respectively,
initialized with two pre-trained models for its while the initial and final learning rate had a scale
speech encoder and text decoder. The speech en- of 0.01 compared to base. We used sentence av-
coder was initialized with HuBERT-Large, which eraging and gradient clipping of 20. We applied a
consists of a feature extractor trained on 60 K dropout probability of 0.1 and used time masking
hours of unlabeled speech data Libri-Light (Kahn for 10-length spans with a probability of 0.2, and
et al., 2020) and Transformer encoder layers. The channel masking for 20-length spans with a proba-
feature extractor has seven convolutional layers bility of 0.1 in the encoder feature extractor’s out-
332
put. The loss was the cross-entropy loss with a LSTM in Tacotron2 and attention mechanism to
label smoothing with 20% probability mass. the forward attention with the transit agent (Zhang
The offline SimulST model was fine-tuned, and et al., 2018) for incremental processing. Guided
then checkpoint averaging was performed. In the Attention Loss (Tachibana et al., 2018) was used
checkpoint averaging, the model checkpoints were as an additional Loss function. The input size of
saved every 1,000 training steps, and the averaged Tactoron2 is 89, and the optimizer was Adam with
parameter values among the five-best models in the learning rate of 1e-3 and the hyperparameters
the loss on the development data were taken for of β1 = 0.9 and β2 = 0.999 and ϵ = 1e − 6.
the final model. Subsequently, one epoch of fine- The batch size was 32 in the number of sentences.
tuning was performed on the training data-only Experimental conditions for Parallel WaveGan are
prefix alignment pairs in MuST-C v2. We reduced the same as in the original paper, except for the pa-
the learning rate to 2.5 × 10−5 during the fine- rameters related to acoustic features and speech.
tuning using translation pairs obtained using Bilin- The pronunciation estimation used the wait-3
gual Prefix Alignment. policy. The incremental TTS has a couple of look-
As a SimulST policy, the local agreement with ahead parameters, indicating the length to control
n = 2 (LA-2) was used. The chunk size was var- the quality-latency trade-off. We tune these pa-
ied from 200 ms to 1000 ms to adjust the quality- rameters to keep the quality of synthesized speech
latency trade-off. A beam search of beam size five within the latency threshold requirement (2.5 sec-
was used to generate hypotheses for input chunks. onds).
333
30 Offline Offline
Offline+PA (3200) Offline+PA (3200)
Offline+PA (4000) 15 Offline+PA (4000)
28 Offline+PA (4800) Offline+PA (4800)
Offline+PA (None) Offline+PA (None)
14
26
BLEU
BLEU
24 13
22
12
20
250 500 750 1000 1250 1500 1750 2000 500 1000 1500 2000 2500
AL AL
Figure 2: BLEU and AL results of the offline model Figure 3: BLEU and AL results of the offline model
and the models fine-tuned with prefix alignment on and the models fine-tuned with prefix alignment on
En-De. The parentheses indicate the max ratio of En-Ja.
prefix pair filtering. Circled dots indicate our sumit-
ted SimulS2t system.
22
Table 4: BLEU scores for models without and with checkpoint averaging for simple and Inter-connection were
evaluated with MuST-C v2 tst-COMMON.
In the multilingual model, the weights required ASR_BLEU StartOffset EndOffset ATD
for each language pair are different because the 9.873 2495.01 4134.752 3278.809
weights of the weighted sum in Inter-connection
Table 5: Results of the submitted SimulS2S system on
are shared. In the case of En-Zh, there was larger the MuST-C v2 tst-COMMON.
difference in the weights than in En-De and En-Ja,
and sharing weights leads to decrease the perfor-
mance. segmentation strategy and latency reduction with
a fixed strategy.
4.4 Computation-aware Latency
We also evaluated models with computation- 4.5 Submitted SimulS2S System
aware Average Lagging (AL_CA). AL_CA is a
variant of AL that adds the actual elapsed time Table 5 shows the scores of the SimulS2S sys-
elapsedi to the delay di of i-th target token yi : tem. Compared to the BLEU results with the
SimulS2T systems with similar chunk size set-
j
∑ tings, the SimulS2S system resulted in much
di = (Tk + elapsedi ) (1) worse ASR_BLEU in nearly five points due to
k=1
the quality of the synthesized speech and possi-
where Tk is the duration of the k-th input speech ble ASR errors. Figure 6 shows the quality-latency
segment and j is the position of the input segment trade-offs of SimulS2S, with ASR_BLEU stagnat-
already read when generating yi . The elapsed time ing around 10.5 points. In addition, the output
elapsedi is measured as the time from the start of of the submitted SimulS2S system had a charac-
the translation to the output of target token yi . ter error rate of 28.3% relative to the output of the
The evaluation was conducted using an SimulS2T system with the same chunk size. These
NVIDIA GeForce RTX 2080 Ti. Figure 5 shows results indicate that there is a significant room for
the result. Unlike the non-computation-aware improvement both in the TTS and ASR.
latency metrics, the fixed-size segmentation
worked better than the local agreement in the 5 Conclusions
quality-latency trade-off. The local agreement
often discards the latter part of the prefix trans- In this paper, we described our SimulST systems
lation due to the disagreement with the next for the IWSLT 2023 Simultaneous Speech Trans-
prefix translation, while such a trackback does lation task. Experimental results demonstrated
not happen in the fixed segmentation scenario. the effectivenesses of Inter-connection and Bilin-
Therefore, the local agreement needs to predict gual Prefix Alignment. The speech-to-speech sys-
more tokens every time and increases the decod- tem is still challenging but showed promising per-
ing time. This result suggests another trade-off formance by a simple cascade of speech-to-text
between quality improvement with a sophisticated SimulST and incremental TTS.
335
1000 9001000 5000 1000
30 Local Agreement 800 900 5000 Local Agreement 700 Local Agreement 800 900
Fixed-size segmentation 4000 Fixed-size segmentation 600 4000
700 500 3000 22.0 Fixed-size segmentation 3000 4000 5000
29 3000 14 400 600 2000
600
2000 2000 21.5
28 500 500
12 300
BLEU
BLEU
21.0 400
BLEU
27 400 1000
10
800
26 20.5
8 600
25
300
20.0 300
24 1000 6 400 1000
19.5
500 1000 1500 2000 2500 1500 1000 500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
AL AL AL
(a) BLEU and AL in En-De. (b) BLEU and AL in En-Ja. (c) BLEU and AL in En-Zh.
1000 1000
900 1000
30 Local Agreement 800 900 5000 Local Agreement 700 5000 Local Agreement 800 900
Fixed-size segmentation 4000 Fixed-size segmentation 4000
600
700
14 3000 500 22.0 Fixed-size segmentation 3000 4000 5000
29 3000 400 2000 600
600
2000 2000 21.5
28 500 500
12 300
BLEU
BLEU
27 400 1000
10
800
26 20.5
8 600
25
300 20.0 300
24 1000 6 400 1000
19.5
1500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500 1500 2000 2500 3000 3500
AL_CA AL_CA AL_CA
(d) BLEU and AL_CA in En-De. (e) BLEU and AL_CA in En-Ja. (f) BLEU and AL_CA in En-Zh.
Figure 5: Comparison of the local agreement with n = 2 and fixed-size segmentation policies.
11.0 References
10.5 800 840
Milind Agarwal, Sweta Agrawal, Antonios Anas-
WHISPER_ASR_BLEU
10.0 650
700 tasopoulos, Ondřej Bojar, Claudia Borg, Marine
600 Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
9.5
Chen, William Chen, Khalid Choukri, Alexan-
9.0 dra Chronopoulou, Anna Currey, Thierry Declerck,
Qianqian Dong, Yannick Estève, Kevin Duh, Mar-
8.5 cello Federico, Souhir Gahbiche, Barry Haddow,
8.0 400 Benjamin Hsu, Phu Mon Htut, Hirofumi Inaguma,
Dávid Javorský, John Judge, Yasumasa Kano, Tom
7.5
2800 2900 3000 3100 3200 3300 3400 3500 Ko, Rishu Kumar, Pengwei Li, Xutail Ma, Prashant
ATD Mathur, Evgeny Matusov, Paul McNamee, John P.
McCrae, Kenton Murray, Maria Nadejde, Satoshi
Figure 6: WHISPER_ASR_BLEU and ATD results of Nakamura, Matteo Negri, Ha Nguyen, Jan Niehues,
the SimulS2S systems on En-Ja. The numbers Xing Niu, Atul Ojha Kr., John E. Ortega, Proyag Pal,
above the marks indicates chunk size. Circled dots in- Juan Pino, Lonneke van der Plas, Peter Polák, Elijah
dicate our sumitted system. Rippeth, Elizabeth Salesky, Jiatong Shi, Matthias
Sperber, Sebastian Stüker, Katsuhito Sudoh, Yun
Tang, Brian Thompson, Kevin Tran, Marco Turchi,
Acknowledgements Alex Waibel, Mingxuan Wang, Shinji Watanabe,
and Rodolfo Zevallos. 2023. Findings of the IWSLT
Part of this work was supported by JSPS KAK- 2023 Evaluation Campaign. In Proceedings of the
ENHI Grant Number JP21H05054. 20th International Conference on Spoken Language
Translation (IWSLT 2023). Association for Compu-
tational Linguistics.
336
Turchi, Yogesh Virkar, Alexander Waibel, Chang- 19th International Conference on Spoken Language
han Wang, and Shinji Watanabe. 2022. Findings of Translation (IWSLT 2022), pages 22–31, Dublin,
the IWSLT 2022 evaluation campaign. In Proceed- Ireland (in-person and online). Association for Com-
ings of the 19th International Conference on Spoken putational Linguistics.
Language Translation (IWSLT 2022), pages 98–157,
Dublin, Ireland (in-person and online). Association Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka-
for Computational Linguistics. mura. 2023. Average token delay: A latency metric
for simultaneous translation. In Proc, Interspeech
Colin Cherry and George Foster. 2019. Thinking slow 2023. To appear.
about latency evaluation for simultaneous machine
translation. arXiv preprint arXiv:1906.00048. Eugene Kharitonov, Morgane Rivière, Gabriel Syn-
naeve, Lior Wolf, Pierre-Emmanuel Mazaré,
Kyunghyun Cho and Masha Esipova. 2016. Can neu- Matthijs Douze, and Emmanuel Dupoux. 2020.
ral machine translation do simultaneous translation? Data augmenting contrastive learning of speech
arXiv preprint arXiv:1606.02012. representations in the time domain. arXiv preprint
arXiv:2007.00991.
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli,
Matteo Negri, and Marco Turchi. 2019. MuST-C: Danni Liu, Gerasimos Spanakis, and Jan Niehues.
a Multilingual Speech Translation Corpus. In Pro- 2020. Low-Latency Sequence-to-Sequence Speech
ceedings of the 2019 Conference of the North Amer- Recognition and Translation by Partial Hypothesis
ican Chapter of the Association for Computational Selection. In Proc. Interspeech 2020, pages 3620–
Linguistics: Human Language Technologies, Vol- 3624.
ume 1 (Long and Short Papers), pages 2012–2017,
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
Minneapolis, Minnesota. Association for Computa-
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
tional Linguistics.
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
Ryo Fukuda, Yuka Ko, Yasumasa Kano, Kosuke Doi, Haifeng Wang. 2019. STACL: Simultaneous trans-
Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Su- lation with implicit anticipation and controllable la-
doh, and Satoshi Nakamura. 2022. NAIST simulta- tency using prefix-to-prefix framework. In Proceed-
neous speech-to-text translation system for IWSLT ings of the 57th Annual Meeting of the Association
2022. In Proceedings of the 19th International Con- for Computational Linguistics, pages 3025–3036,
ference on Spoken Language Translation (IWSLT Florence, Italy. Association for Computational Lin-
2022), pages 286–292, Dublin, Ireland (in-person guistics.
and online). Association for Computational Linguis- Mingbo Ma, Baigong Zheng, Kaibo Liu, Renjie Zheng,
tics. Hairong Liu, Kainan Peng, Kenneth Church, and
Liang Huang. 2020a. Incremental text-to-speech
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hu-
synthesis with prefix-to-prefix framework. In Find-
bert Tsai, Kushal Lakhotia, Ruslan Salakhutdi-
ings of the Association for Computational Linguis-
nov, and Abdelrahman Mohamed. 2021. Hubert:
tics: EMNLP 2020, pages 3886–3896, Online. As-
Self-supervised speech representation learning by
sociation for Computational Linguistics.
masked prediction of hidden units.
Xutai Ma, Mohammad Javad Dousti, Changhan Wang,
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Jiatao Gu, and Juan Pino. 2020b. SIMULEVAL: An
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- evaluation toolkit for simultaneous translation. In
bert Sanchis, Jorge Civera, and Alfons Juan. 2020. Proceedings of the 2020 Conference on Empirical
Europarl-st: A multilingual corpus for speech trans- Methods in Natural Language Processing: System
lation of parliamentary debates. In ICASSP 2020 Demonstrations, pages 144–150, Online. Associa-
- 2020 IEEE International Conference on Acous- tion for Computational Linguistics.
tics, Speech and Signal Processing (ICASSP), pages
8229–8233. Kikuo Maekawa. 2008. Balanced Corpus of Contem-
porary Written Japanese. In Proceedings of the 6th
J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, Workshop on Asian Language Resources.
P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Col-
lobert, C. Fuegen, T. Likhomanenko, G. Syn- Yuta Nishikawa and Satoshi Nakamura. 2023. Inter-
naeve, A. Joulin, A. Mohamed, and E. Dupoux. connection: Effective connection between pre-
2020. Libri-light: A benchmark for asr with trained encoder and decoder for speech translation.
limited or no supervision. In ICASSP 2020 - In Proc, Interspeech 2023. To appear.
2020 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), Sara Papi, Marco Gaido, Matteo Negri, and Marco
pages 7669–7673. https://github.com/ Turchi. 2022. Over-generation cannot be rewarded:
facebookresearch/libri-light. Length-adaptive average lagging for simultaneous
speech translation. In Proceedings of the Third
Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka- Workshop on Automatic Simultaneous Translation,
mura. 2022. Simultaneous neural machine transla- pages 12–17, Online. Association for Computational
tion with prefix alignment. In Proceedings of the Linguistics.
337
Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. Language Resources and Evaluation Conference,
2021. Layer-wise analysis of a self-supervised pages 4197–4203, Marseille, France. European Lan-
speech representation model. 2021 IEEE Automatic guage Resources Association.
Speech Recognition and Understanding Workshop
(ASRU), pages 914–921. Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim.
2020. Parallel wavegan: A fast waveform genera-
Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen, tion model based on generative adversarial networks
Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo- with multi-resolution spectrogram. In ICASSP 2020
jar, and Alexander Waibel. 2022. CUNI-KIT system - 2020 IEEE International Conference on Acous-
for simultaneous speech translation task at IWSLT tics, Speech and Signal Processing (ICASSP), pages
2022. In Proceedings of the 19th International Con- 6199–6203.
ference on Spoken Language Translation (IWSLT
2022), pages 277–285, Dublin, Ireland (in-person Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong
and online). Association for Computational Linguis- Dai. 2018. Forward attention in sequence- to-
tics. sequence acoustic modeling for speech synthesis.
In 2018 IEEE International Conference on Acous-
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- tics, Speech and Signal Processing (ICASSP), pages
man, Christine McLeavey, and Ilya Sutskever. 2022. 4789–4793.
Robust speech recognition via large-scale weak su-
pervision. arXiv preprint arXiv:2212.04356.
Anthony Rousseau, Paul Deléglise, and Y. Estève.
2012. Ted-lium: an automatic speech recognition
dedicated corpus. In International Conference on
Language Resources and Evaluation.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike
Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,
Rif A. Saurous, Yannis Agiomvrgiannakis, and
Yonghui Wu. 2018. Natural tts synthesis by con-
ditioning wavenet on mel spectrogram predictions.
In 2018 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), pages
4779–4783.
Ryosuke Sonobe, Shinnosuke Takamichi, and Hiroshi
Saruwatari. 2017. Jsut corpus: free large-scale
japanese speech corpus for end-to-end speech syn-
thesis. arXiv preprint arXiv:1711.00354.
Hideyuki Tachibana, Katsuya Uenoyama, and Shun-
suke Aihara. 2018. Efficiently trainable text-to-
speech system based on deep convolutional net-
works with guided attention. In 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 4784–4788.
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
gela Fan. 2020. Multilingual translation with exten-
sible multilingual pretraining and finetuning.
Ioannis Tsiamas, Gerard I. Gállego, Carlos Escolano,
José Fonollosa, and Marta R. Costa-jussà. 2022.
Pretrained speech encoders and efficient fine-tuning
methods for speech translation: UPC at IWSLT
2022. In Proceedings of the 19th International Con-
ference on Spoken Language Translation (IWSLT
2022), pages 265–276, Dublin, Ireland (in-person
and online). Association for Computational Linguis-
tics.
Changhan Wang, Juan Pino, Anne Wu, and Jiatao Gu.
2020. CoVoST: A diverse multilingual speech-to-
text translation corpus. In Proceedings of the 12th
338
A Appendix
Tables 6, 7, and 8 show the results for all chunk
size settings for the En-De, En-Ja, and En-Zh
models used in the submitted system, respectively.
339
chunk size BLEU LAAL AL AP DAL ATD
300 24.217 947.509 495.162 0.732 1465.822 814.368
400 26.657 1189.696 829.689 0.753 1738.568 1180.684
500 27.986 1416.459 1071.682 0.774 1992.596 1375.404
600 28.739 1618.746 1318.715 0.791 2232.175 1367.612
700 29.298 1797.061 1515.356 0.811 2432.087 1608.334
800 29.809 1956.321 1714.173 0.826 2617.073 1720.705
820 29.78 2011.518 1772.404 0.827 2672.554 1765.76
840 29.792 2022.322 1790.452 0.832 2680.218 1741.386
860 29.746 2054.923 1825.194 0.834 2726.204 1740.656
900 29.805 2115.625 1895.961 0.841 2783.033 1711.2
950 29.975 2172.927 1964.329 0.846 2856.738 1893.749
1000 30.234 2255.583 2057.579 0.852 2938.408 1884.775
Table 6: Results of the Offline+PA (None) model on the MuST-C v2 tst-COMMON En-De.
Table 7: Results of the Offline+PA (4000) model on the MuST-C v2 tst-COMMON En-Ja.
Table 8: Results of the Offline+PA (None) model on the MuST-C v2 tst-COMMON En-Zh.
340
Language Model Based Target Token Importance Rescaling for
Simultaneous Neural Machine Translation
warm
It
's
quite
in
winter
,
but
summer
is
extremely
hot
.
<EOS>
ing requirements of latency versus translation
quality. In this paper, we use an auxiliary target-
side language model to augment the training of Source
the decoder model. Under this notion of target
adaptive training, generating rare or difficult SiMT policy
tokens is rewarded which improves the transla-
tion quality while reducing latency. The predic- Target
tions made by a language model in the decoder prior token 1 1 1
work importance
are combined with the traditional cross entropy
loss which frees up the focus on the source side ours rescaling
1 0.7 1.2
context. Our experimental results over multiple
target context
language pairs show that compared to previous
language model
state of the art methods in simultaneous trans-
lation, we can use an augmented target side conditional max. likelihood
26
32
26
24
30
BLEU
BLEU
BLEU
22 24
Offline Offline Offline
Wait-k 28 Wait-k Wait-k
Efficient Wait-k Efficient Wait-k Efficient Wait-k
20 MMA+TC(ours) MMA+TC(ours) MMA+TC(ours)
MMA MMA 22 MMA
Wait-Info 26 Wait-Info Wait-Info
2 4 6 8 10 3 4 5 6 7 8 3 4 5 6 7 8 9
Average Lagging (AL) Average Lagging (AL) Average Lagging (AL)
(a) Vi→En (b) De→En (c) En→De
27 icy gi , AL is :
τ
26 1X i−1
AL = gi − (11)
τ |y|/|x|
i=1
25
3 4 5 where τ = argmaxi (gi = |x|), |x| and |y| are
Average Lagging (AL) source sentence and target sentence lengths respec-
tively.
Figure 3: Performance of several methods on the
En→Vi dataset in the low latency (AL<5) window. 5 Results
Figure 2 shows the comparison of BLEU vs. La-
Gaussian Multihead Attention (GMA; Zhang tency (in terms of Average Lagging) of our method
and Feng (2022a)) that predicts the aligned against previous methods on the IWSLT’15 Vi →
source position for a target token and rescales at- En and IWSLT’14 En ↔ De directions. For Vi →
tention with a gaussian distribution centred at this En, we observe a significant improvement in the
position. BLEU scores at the same latencies, compared to
the baselines. We also reach the offline translation
ITST (Zhang and Feng, 2022b) finds the op- quality in low AL on this dataset. In the En →
timal information transport between source and De, De → En directions too, there is a boost in
target. the translation quality, more noticeably for lower
latencies. The plots show that our method boosts
Adaptive Wait-k(Zheng et al., 2020) dynami- translation quality in the earlier latencies and the
cally chooses an optimal k in the wait-k policy at effect of reweighing is more pronounced in these
every step. regions, where the source context is more limited.
In higher latency regions, when the source infor-
MoE Wait-k (Zhang and Feng, 2021b) uses mation window increases, the other baselines start
attention heads as experts trained with different k to reach our BLEU score in the English-German
with the wait-k policy. directions.
In Figure 3, we compare against several state-
MMA+TC (ours) is the proposed MMA model of-the-art methods on the En → Vi. Our method
with target context aware adaptive training objec- gets better translation quality compared all others,
tive. We use an auxiliary target-side LM decoder in the low-latency zone, matching the offline score
of the same configuration as the MT decoder. Note at 3.86 AL. We show the BLEU vs. AL plot in a
that the LM is only used during training and dis- low latency range to compare performance in the
carded at test time. We do not use extra data. more challenging area of this task, the low latency
points.
The implementation of our method is based on
fairseq (Ott et al., 2019). Following MMA, we 6 Analysis
use transformer (Vaswani et al., 2017) with 6 en-
coder and decoder layers and 4 monotonic attention 6.1 Token-level vs. Sentence-Level Weight
heads for the IWSLT datasets En↔Vi, De↔En. All Ablation Study The two hyperparameters in our
baselines are trained with same configurations and method are Sentence-Level Weight and Token-
are trained with 16k tokens. Our auxiliary language Level Weight, which determine the sentence and
model follows the decoder settings in the model. token-level effect of rescaling with LM. In Fig. 5
345
Token Order Avg. MMA MMA+ POS Ref MMA (%) +TC (%) MSE (↓)
Ref (%)
(Descending) Freq. (%) TC (%)
ADJ 1497 82.1 83.5 0.18 | 0.16
[0, 10%) 1385 85.56 87.63 87.21 ADV 1323 83.5 87.6 0.20 | 0.12
[10, 30%) 56 6.89 6.48 6.34 INTJ 74 98.6 94.6 0.01 | 0.04
[30, 50%) 20 2.19 1.75 1.95 NOUN 4187 90.5 93.4 0.09 | 0.06
[50, 70%) 11 1.30 0.70 0.86 PROPN 1315 99.4 99.4 -|-
[70, 100%] 6 0.95 0.26 0.31 VERB 3226 94.0 95.7 0.06 | 0.04
Table 1: Avg. frequency on the training set and the Table 2: Our method generates more content words
proportion of tokens of different frequencies in the test than the baseline MMA. Columns 2 and 3 show the
set and the translations generated by the baseline and percentage of the reference content words recovered in
our model. MMA and MMA+TC (in blue) respectively. The last
column shows normalized mean squared error (MSE)
0.6 MMA
of the recovered content words wrt reference. Lower
MSE values are better.
F-measure
MMA+TC (Ours)
0.5
0.4
2 3 4 [5,10) [10,100)
Word Frequency Content word occurrences. Zhang et al. (2022a)
show that focusing on the right content words in
Figure 4: F-measure between model outputs and refer-
the target is crucial to getting the necessary target
ence tokens for the low-frequency words, bucketed by
frequency of the reference token.
information in a subcutaneous translation setting.
Following Moradi et al. (2019) we inspect the con-
tent words generated by our model using spacy to
we report the BLEU scores with different hyperpa- get POS tags over the translations. As evident from
rameter settings on Vi-En. (AL across the table are Table 2, our model recovers more content words in
similar as experiments are done with the same λ). the translations wrt the reference.
We set the values of these hyperparameters to 0.2
in all our experiments. 6.3 Effect on Translation Length
Following the rationale of Lakew et al. (2019) in
25.36 26.03 25.83
Token-level Scale
Các
gia
EOS
ình
hàng
nghe
ng
này
.
xóm
ta
Và
khi
làm
ó
,
ai
mà
bi t
c
?
EOS
chúng
vi c
tôi
v t
m u
ây
.
Nó
m
.
EOS
Chúng
còn
còn
khá
k
v
t
(a) MMA+TC (ours)
We And And
also
when the
have
we neighbors
samples
do
here heard
that
.
about
It ,
this
's who
pretty knows idea
warm ? .
.
EOS EOS
EOS
ta
Và
khi
làm
ó
,
ai
mà
bi t
c
?
EOS
chúng
vi c
Các
gia
ý
ng
EOS
ình
hàng
nghe
này
.
xóm
m u
ây
.
Chúng
tôi
còn
v t
Nó
còn
khá
EOS
k
v
t
(b) MMA (baseline)
Figure 6: Attention heatmap comparison on the Vi → En direction. The Read-Write policy is drawn with red and
green arrows respectively. The pink column at the start denotes the source tokens read to produce the target token
on the left (darker implies more source words read, and white denotes 0 reads between consecutive target tokens)
0.88
truth aligned source position of the j th target word
0.86
is denoted by aj , and the number of source words
read when writing target j th word is denoted by rj :
0.84 MMA
0.82 MMA+TC
Wait-Info |y|
0.80 Suf 1 X
20 30 40 50 60 70 80 A = 1aj ≤rj (12)
Sentence Length |y|
j=1
Figure 7: Sufficiency as a function of target length. All
We compare our method against MMA and Wait-
models produce translation with an AL of 4.
Info on AL=4 with the sufficiency metric. Using
equation(12) across sentences of varying lengths,
tence are: we evaluate the read-write paths of each model,
against reference alignments from Eflomal (Östling
Src: Chúng tôi còn vt mu đây . Nó còn khá m . and Tiedemann, 2016)4 . In Figure 7, we can see a
MMA: RRR W RRR WWW RRR WW RRR W RR WWWWW clearly increasing and higher score on sufficiency
Ours: RRR W RRR WW RR W R WW RRR W R W R WWWW as compared to the baselines - Wait-Info and MMA.
This signifies that our target-context augmented
In this example, MMA reads more than required training helps the model read sufficient source to-
for a write in certain places. It shows that at a kens required for producing a translation, while
similar lag, our model gets a higher probability of maintaining the same latency as others, showing
a WRITE action, compared to MMA, after having that the model learns and correctly gauges the in-
read the same number of source words. formation it requires to translate a target token, and
4
Sufficiency of the READ actions. Zhang and We use the Eflomal library to get alignment priors
from IWSLT’15 Vi-En train set, and use them to gen-
Feng (2022c) introduce a metric of sufficiency erate alignments for the test set. https://github.com/
ASuf in Read/Write paths with the notion that too robertostling/eflomal
347
250 MMA
MMA+TC (Ours)
introduce a character level wait-k policy. But fixed
200
Sentence Counts
policy methods aren’t feasible for complex inputs
150
and cannot adapt to them. Full-sentence MT has
100
also been leveraged to augment the policy with
50
future information (Zhang et al., 2020; Alinejad
0
) ) 5 4 3 2 1 0 1 2 3 4 5 ,20)
,-10 0,-5 - - - - -
et al., 2021). But using such oracle or gold (Zheng
[-20 [-1 [6
Len(Output)-Len(Reference) et al., 2019; Arthur et al., 2021) READ/WRITE
actions does not optimize policy with translation
26 0.71 quality. Alinejad et al. (2018) proposes providing
0.70
future-information on the source side using predic-
BLEU
24 0.69
IoU
tion. Grissom II et al. (2014) predict unseen verbs
0.68
22
0.67
and uses reinforcement learning to learn when to
20 0.66 trust these predictions and when to wait for more
[10,20) [20,30) [30,40) [40,50) [50,60) >=60
Sentence Lengths input. In contrast, we leverage target side context
Figure 8: Top: Length difference compared to ref. Bot- to strengthen the simultaneous translations.
tom: Sentence BLEU bucketed by target length (shown Zhang and Feng (2022c) train two models on
in bars), and the ratio of aligned READ actions for each either language directions and make their policies
bucket (IoU scores, Eqn. 13) shown with lines. converge. Wilken et al. (2020) propose external
ground-truth alignments to train the policy. Papi
makes READ actions accordingly. et al. (2023) use cross attention scores to guide pol-
icy. Infinite-lookback (Arivazhagan et al., 2019)
Ratio of Aligned READ actions. We compare and chunkwise (Chiu* and Raffel*, 2018) atten-
MMA and our Read-Write policy against the ref- tion propose to use a soft monotonic attention over
erence source-target alignments by computing the previous encoder states. We use a variant of the
overlap between the hard alignments and the trans- policy proposed by Ma et al. (2020) that adapts
lation path for all output translations : monotonic attention to the multihead architecture
X intersection(ai , ri ) of the Transformer. GMA (Zhang and Feng, 2022a)
IoUa,r = (13) predicts the aligned source position of the current
union(ai , ri ) target token and rescales attention based on it. But
i=1
these methods treat all words equally during train-
where ai is the reference alignment matrix for the
ing whereas our method improves upon MMA via
ith sentence, made by setting all aligned source po-
adaptive training.
sitions to 1 and ri is the upper triangular matrix set
Some recent work explores capturing and quan-
to 1 using reads from the policy.5 The IoU scores
tifying information from the source tokens and use
for our policy and for MMA are shown in Figure
it to model READ/WRITE actions (Zhang et al.,
8 (bottom) with varying sentence lengths. Our pol-
2022a; Zhang and Feng, 2022b). But these works
icy shows a stronger adherence to the source-target
do not use the target context in their information.
monotonic alignment path.
Unlike their quantization method, we present a sim-
7 Related Work ple scoring by using an auxiliary target-side LM.
Simultaneous Translation. Fixed Policy meth- Adaptive Training for MT. Target adaptive ob-
ods (Ma et al., 2019; Elbayad et al., 2020) follow jectives have been explored by (Lin et al., 2017)
the fixed rule of waiting for the first k source tokens which uses probability of a class to scale, but actu-
before generating a target token, and alternate there- ally only scale down high frequency classes; (Jiang
after. Adaptive Wait-k (Zheng et al., 2020) dynam- et al., 2019) which directly uses normalized fre-
ically chooses the best k at every step. Han et al. quency count but have high variance. (Gu et al.,
(2020) applied meta learning in wait-k. Zhang and 2020) use a chi-square and an exponential distri-
Feng (2021b) use each attention head as an expert bution function with frequency. However these
of wait-k policy whereas Zhang and Feng (2021a) use only static word frequency. BMI (Xu et al.,
5
2021) attempt to capture mutual information be-
We choose this metric to show the extent to which the
policy follows the source-target alignments. In an ideal setting, tween each source and target token. CBMI (Zhang
IoU = 1. et al., 2022b) incorporate target context as well, in
348
mutual information. However, these adaptive meth- that helped shape this paper and Dr. Angel Chang
ods are not directly transferable to the streaming for lending us the GPU resources. The research
nature of our task. was partially supported by the Natural Sciences and
Engineering Research Council of Canada grants
8 Conclusion NSERC RGPIN-2018-06437 and RGPAS-2018-
We have presented a simple technique for rescaling 522574 and a Department of National Defence
target-token importance in simultaneous transla- (DND) and NSERC grant DGDND-2018-00025
tion using an information theoretic approach and to the third author.
an adaptive training paradigm. We differentiate the
importance of various target tokens by their depen-
References
dence on the source sentence. To guide our simul-
taneous translation model, we incorporate a target- Ashkan Alinejad, Hassan S. Shavarani, and Anoop
Sarkar. 2021. Translation-based supervision for pol-
side language model that provides an additional sig- icy generation in simultaneous neural machine trans-
nal indicating the importance of each target token lation. In Proceedings of the 2021 Conference on
or sentence under the condition of the previous tar- Empirical Methods in Natural Language Processing,
get context. Our model shows strong performance pages 1734–1744, Online and Punta Cana, Domini-
can Republic. Association for Computational Lin-
on several datasets and outperforms several state-of-
guistics.
the-art techniques in the low latency range (AL<5).
Further analysis shows that our technique is bet- Ashkan Alinejad, Maryam Siahbani, and Anoop Sarkar.
ter able to translate long sentences and those with 2018. Prediction improves simultaneous neural ma-
chine translation. In Proceedings of the 2018 Con-
rare words. We also showed that the translation ference on Empirical Methods in Natural Language
path (read/write action sequence) has a stronger Processing, pages 3022–3027, Brussels, Belgium.
correlation to the source-target alignment. Association for Computational Linguistics.
349
Kyunghyun Cho and Masha Esipova. 2016. Can neural Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
machine translation do simultaneous translation? Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
Maha Elbayad, Laurent Besacier, and Jakob Verbeek. Haifeng Wang. 2019. STACL: Simultaneous trans-
2020. Efficient Wait-k Models for Simultaneous Ma- lation with implicit anticipation and controllable la-
chine Translation. In Proc. Interspeech 2020, pages tency using prefix-to-prefix framework. In Proceed-
1461–1465. ings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 3025–3036, Flo-
Alvin Grissom II, He He, Jordan Boyd-Graber, John rence, Italy. Association for Computational Linguis-
Morgan, and Hal Daumé III. 2014. Don’t until the tics.
final verb wait: Reinforcement learning for simul-
taneous machine translation. In Proceedings of the Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon,
2014 Conference on Empirical Methods in Natural and Jiatao Gu. 2020. Monotonic multihead attention.
Language Processing (EMNLP), pages 1342–1352, In International Conference on Learning Representa-
Doha, Qatar. Association for Computational Linguis- tions.
tics.
Pooya Moradi, Nishant Kambhatla, and Anoop Sarkar.
2019. Interrogating the explanatory power of atten-
Shuhao Gu, Jinchao Zhang, Fandong Meng, Yang Feng,
tion in neural machine translation. In Proceedings of
Wanying Xie, Jie Zhou, and Dong Yu. 2020. Token-
the 3rd Workshop on Neural Generation and Trans-
level adaptive training for neural machine translation.
lation, pages 221–230, Hong Kong. Association for
In Proceedings of the 2020 Conference on Empirical
Computational Linguistics.
Methods in Natural Language Processing (EMNLP),
pages 1035–1046, Online. Association for Computa- Robert Östling and Jörg Tiedemann. 2016. Effi-
tional Linguistics. cient word alignment with Markov Chain Monte
Carlo. Prague Bulletin of Mathematical Linguistics,
Hou Jeung Han, Mohd Abbas Zaidi, Sathish Reddy In- 106:125–146.
durthi, Nikhil Kumar Lakumarapu, Beomseok Lee,
and Sangha Kim. 2020. End-to-end simultaneous Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
translation system for IWSLT2020 using modality Sam Gross, Nathan Ng, David Grangier, and Michael
agnostic meta-learning. In Proceedings of the 17th Auli. 2019. fairseq: A fast, extensible toolkit for
International Conference on Spoken Language Trans- sequence modeling. In Proceedings of NAACL-HLT
lation, pages 62–68, Online. Association for Compu- 2019: Demonstrations.
tational Linguistics.
Sara Papi, Marco Turchi, and Matteo Negri. 2023. Alig-
Sathish Reddy Indurthi, Mohd Abbas Zaidi, Beomseok natt: Using attention-based audio-translation align-
Lee, Nikhil Kumar Lakumarapu, and Sangha Kim. ments as a guide for simultaneous speech translation.
2022. Infusing future information into monotonic
attention through language models. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: A method for automatic evalu-
Shaojie Jiang, Pengjie Ren, Christof Monz, and Maarten ation of machine translation. In Proceedings of the
de Rijke. 2019. Improving neural response diversity 40th Annual Meeting on Association for Computa-
with frequency-aware cross-entropy loss. New York, tional Linguistics, ACL ’02, page 311–318, USA.
NY, USA. Association for Computing Machinery. Association for Computational Linguistics.
350
pages 511–516, Online. Association for Computa- Baigong Zheng, Kaibo Liu, Renjie Zheng, Mingbo Ma,
tional Linguistics. Hairong Liu, and Liang Huang. 2020. Simultane-
ous translation policies: From fixed to adaptive. In
Shaolei Zhang and Yang Feng. 2021a. ICT’s system for Proceedings of the 58th Annual Meeting of the Asso-
AutoSimTrans 2021: Robust char-level simultaneous ciation for Computational Linguistics, pages 2847–
translation. In Proceedings of the Second Workshop 2853, Online. Association for Computational Lin-
on Automatic Simultaneous Translation, pages 1–11, guistics.
Online. Association for Computational Linguistics.
Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang
Shaolei Zhang and Yang Feng. 2021b. Universal simul- Huang. 2019. Simpler and faster learning of adaptive
taneous machine translation with mixture-of-experts policies for simultaneous translation. In Proceedings
wait-k policy. In Proceedings of the 2021 Confer- of the 2019 Conference on Empirical Methods in Nat-
ence on Empirical Methods in Natural Language Pro- ural Language Processing and the 9th International
cessing, pages 7306–7317, Online and Punta Cana, Joint Conference on Natural Language Processing
Dominican Republic. Association for Computational (EMNLP-IJCNLP), pages 1349–1354, Hong Kong,
Linguistics. China. Association for Computational Linguistics.
Shaolei Zhang and Yang Feng. 2022a. Gaussian multi-
head attention for simultaneous machine translation.
In Findings of the Association for Computational
Linguistics: ACL 2022, pages 3019–3030, Dublin,
Ireland. Association for Computational Linguistics.
Shaolei Zhang and Yang Feng. 2022b. Information-
transport-based policy for simultaneous translation.
In Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing, pages 992–
1013, Abu Dhabi, United Arab Emirates. Association
for Computational Linguistics.
Shaolei Zhang and Yang Feng. 2022c. Modeling dual
read/write paths for simultaneous machine transla-
tion. In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 2461–2477, Dublin,
Ireland. Association for Computational Linguistics.
Shaolei Zhang and Yang Feng. 2022d. Reducing posi-
tion bias in simultaneous machine translation with
length-aware framework. In Proceedings of the 60th
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 6775–
6788, Dublin, Ireland. Association for Computational
Linguistics.
Shaolei Zhang, Yang Feng, and Liangyou Li. 2020.
Future-guided incremental transformer for simulta-
neous translation. In AAAI Conference on Artificial
Intelligence.
Shaolei Zhang, Shoutao Guo, and Yang Feng. 2022a.
Wait-info policy: Balancing source and target at in-
formation level for simultaneous machine translation.
In Findings of the Association for Computational
Linguistics: EMNLP 2022, pages 2249–2263, Abu
Dhabi, United Arab Emirates. Association for Com-
putational Linguistics.
Songming Zhang, Yijin Liu, Fandong Meng, Yufeng
Chen, Jinan Xu, Jian Liu, and Jie Zhou. 2022b. Con-
ditional bilingual mutual information based adaptive
training for neural machine translation. In Proceed-
ings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
pers), pages 2377–2389, Dublin, Ireland. Association
for Computational Linguistics.
351
A Hyperparameters
IWSLT’15 En ↔ Vi
Hyperparameter
IWSLT’14 De ↔ En
encoder layers 6
encoder attention heads 4
encoder embed dim 512
encoder ffn embed dim 1024
decoder layers 6
decoder attention heads 4
decoder embed dim 512
decoder ffn embed dim 1024
dropout 0.3
optimizer adam
adam-β (0.9,0.98)
clip-norm 0
lr 5e-4
lr scheduler inverse sqrt
warmup-updates 4000
warmup-init-lr 1e-7
weight decay 0.0001
label-smoothing 0.1
max tokens 16000
B Detailed Results
352
IWSLT15 En-Vi Transformer-Small
AP AL DAL BLEU
Full-sentence MT
1.00 22.08 22.08 28.91
λ AP AL DAL BLEU
0.4 0.58 2.68 3.46 27.73
0.3 0.59 2.98 3.81 27.90
MMA 0.2 0.63 3.57 4.44 28.47
0.1 0.67 4.63 5.65 28.42
0.04 0.70 5.44 6.57 28.33
0.02 0.76 7.09 8.29 28.28
k AP AL DAL BLEU
1 0.63 3.03 3.54 25.21
3 0.71 4.80 5.42 27.65
Wait-K
5 0.78 6.46 7.06 28.34
7 0.83 8.21 8.79 28.60
9 0.88 9.92 10.51 28.69
k AP AL DAL BLEU
1 0.63 3.06 3.61 26.23
3 0.71 4.66 5.20 28.21
Efficient Wait-K
5 0.78 6.38 6.94 28.56
7 1.96 8.13 8.69 28.62
9 0.87 9.80 10.34 28.52
K AP AL DAL BLEU
1 0.67 3.76 4.33 28.37
2 0.69 4.10 4.71 28.45
3 0.71 4.60 5.28 28.54
Wait-Info 4 0.74 5.28 5.97 28.59
5 0.77 6.01 6.71 28.70
6 0.80 6.80 7.51 28.78
7 0.82 7.61 8.33 28.80
8 0.84 8.39 9.11 28.82
λ AP AL DAL BLEU
0.55 0.66 3.1 5.12 28.6
0.5 0.67 3.60 5.78 28.81
MMA+TC 0.3 0.68 3.86 6.12 28.9
0.2 0.71 4.58 7.22 28.74
0.1 0.74 5.34 8.18 28.65
0.01 0.89 9.89 14.37 28.67
353
IWSLT15 Vi - En Transformer-Small
Full-sentence MT AP AL DAL BLEU
(Offline) 1.00 27.56 27.56 26.11
λ AP AL DAL BLEU
0.4 0.63 3.60 6.96 25.36
0.3 0.64 3.95 7.59 24.75
MMA 0.2 0.67 4.54 9.09 25.33
0.1 0.75 7.14 11.60 25.84
0.05 0.77 7.61 15.70 25.31
0.01 0.88 13.63 23.95 26.11
k AP AL DAL BLEU
1 0.42 -2.89 1.62 7.57
3 0.53 -0.18 3.24 14.66
5 0.61 1.49 5.08 17.44
Wait-K
7 0.67 3.28 7.05 19.02
9 0.76 6.75 8.96 22.39
11 0.80 7.91 10.71 23.28
13 0.84 10.37 12.36 24.80
K AP AL DAL BLEU
4 0.62 2.58 5.06 22.45
5 0.67 4.08 6.27 23.75
6 0.72 5.61 7.72 25.19
Wait-Info
7 0.76 7.01 9.19 25.45
8 0.79 8.26 10.66 25.86
9 0.82 9.37 11.98 25.93
10 0.84 10.56 13.30 26.13
λ AP AL DAL BLEU
0.4 0.63 3.51 5.902 26.38
0.3 0.65 4.01 6.558 26.04
0.2 0.67 4.62 7.527 26.32
MMA+TC
0.1 0.71 5.67 9.212 26.63
0.05 0.76 7.23 10.579 26.52
0.04 0.77 7.55 11.76 26.85
0.01 0.89 13.31 18.627 26.67
354
IWSLT15 De-En Transformer-Small
Full-sentence MT AP AL DAL BLEU
(Offline) 1.00 22.97 22.97 33.64
λ AP AL DAL BLEU
0.4 0.67 3.91 6.36 30.8
MMA 0.3 0.69 4.27 6.84 31.12
0.2 0.72 4.97 7.82 31.34
0.1 0.77 6.08 9.47 31.95
K AP AL DAL BLEU
1 0.57 1.32 2.53 26.26
2 0.59 1.97 3.17 27.39
3 0.64 3.08 4.35 29.01
Wait-Info 4 0.69 4.27 5.61 30.36
5 0.739 5.30 6.84 30.92
6 0.77 6.26 8.03 31.45
7 0.80 7.17 9.09 31.82
8 0.82 8.06 9.94 32.05
k AL BLEU
3 1.8 26
Wait-K 5 4 28.6
7 6 29.7
9 8 31.5
k AL BLEU
3 2 26.4
Efficient Wait-K 5 4 27
7 6 30
9 8 31.7
λ AP AL DAL BLEU
0.5 0.66 3.68 5.92 30.97
0.4 0.68 4.06 6.51 31.33
MMA+TC
0.3 0.70 4.49 7.12 31.69
0.2 0.73 5.06 7.93 32.2
0.1 0.77 6.10 9.54 32.22
355
IWSLT15 En-De Transformer-Small
Full-sentence MT AP AL DAL BLEU
(Offline) 1.00 22.21 22.21 27.46
λ AP AL DAL BLEU
0.5 0.69 4.32 6.42 26.03
0.4 0.71 4.70 6.95 26.20
MMA 0.3 0.72 4.97 7.28 26.30
0.2 0.74 5.44 7.96 26.19
0.1 0.79 6.86 9.72 26.77
0.05 0.84 8.25 11.42 26.91
K AP AL DAL BLEU
1 0.61 2.62 3.09 21.75
2 0.63 3.15 3.89 22.42
3 0.68 4.24 5.30 24.48
Wait-Info 4 0.73 5.36 6.77 25.60
5 0.77 6.38 8.09 26.18
6 0.80 7.23 9.18 26.35
7 0.83 8.23 10.35 26.61
8 0.86 9.25 11.46 26.74
k AL BLEU
3 3.41 22.00
Wait-K 5 5.00 25.21
7 6.83 26.32
9 8.72 26.61
k AL BLEU
3 3.51 23.01
Efficient Wait-K 5 5.27 24.80
7 7.03 25.93
9 8.81 26.11
λ AP AL DAL BLEU
0.6 0.68 4.04 6.07 26.03
0.5 0.69 4.19 6.25 26.19
0.4 0.69 4.38 6.52 26.43
MMA+TC
0.3 0.71 4.87 7.14 26.56
0.2 0.74 5.51 8.09 26.71
0.1 0.79 6.74 9.80 26.76
0.06 0.82 7.75 10.94 27.01
356
Kyoto Speech-to-Speech Translation System for IWSLT 2023
Zhengdong Yang1 Shuichiro Shimizu1 Zhou Wangjin1 Sheng Li2 Chenhui Chu1
Kyoto University1 National Institute of Information and Communications Technology2
{zd-yang, sshimizu, chu}@nlp.ist.i.kyoto-u.ac.jp
zhou@sap.ist.i.kyoto-u.ac.jp
sheng.li@nict.go.jp
(2020). For text-to-speech synthesis model, we The training objective is a weighted sum of cross-
took cascade approach of an acoustic model and a entropy losses for both tasks:
vocoder. We used FastSpeech 2 (Ren et al., 2021)
as the acoustic model and HiFi-GAN (Kong et al., Lasr-st = αLasr + (1 − α)Lst (2)
2020) as the vocoder.
Different decoders can exchange information
2 System Description with each other with the interactive attention mech-
anism, which refers to replacing attention sub-
The speech-to-speech translation system is a com-
layers in the standard Transformer decoder with
bination of speech-to-text translation and text-to-
interactive attention sub-layers (Liu et al., 2020). In
speech synthesis.
our models, the replaced sub-layers are the encoder-
2.1 Speech-to-Text Translation decoder attention sub-layers.
As illustrated in the lower part of Figure 1, an
We adopt the end-to-end speech-to-text translation
interactive attention sub-layer consists of one main
architecture. The speech-to-text translation model
attention sub-layer and a cross-attention sub-layers.
is based on dual-decoder Transfomer (Le et al.,
The main attention sub-layer is the same as the
2020).
replaced attention sub-layer. The cross-attention
As shown in Figure 1, the model is a
sub-layers receive query Q from the same decoder
Transformer-based model, comprising two de-
A and receive key K and value V from another
coders - one for speech-to-text translation (ST) and
decoder B. We adopt the parallel variation of dual-
the other for automatic speech recognition (ASR).
decoder Transformers where K and V are hidden
The task of ASR and ST can be defined as follows:
states from the same layer in decoder B.
• For ASR, the input sequence s = [s1 , ..., sTs ] The final output is obtained by merging the out-
is a sequence of speech features. The out- put of the primary attention sub-layer Hmain with
357
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 357–362
July 13-14, 2023 c 2023 Association for Computational Linguistics
the output of the cross attention sub-layer Hcross . Dataset
Sentence Embedding Total Length
We adopt linear interpolation as the merging func- Model Used for Filtering (Hours)
tion. Therefore the output representations of the MuST-C None 600.2
GigaST None 9873.2
interactive attention sub-layers are GigaST LASER 919.1
GigaST Sentence Transformers 601.1
Hdual = Hmain + λHcross (3)
Table 1: The size of the datasets and the filtered versions
used for training the ST system.
where λ is a learnable parameter.
3 Experiments
Transcripts Translations
358
3.1.2 Training and Decoding
English sentences were normalized and tokenized
using the Moses tokenizer (Koehn et al., 2007),
and punctuations were stripped. Chinese sentences
LASER
500000 were tokenized using jieba.3 English and Chinese
tokens were further split into subwords using the
BPE method (Sennrich et al., 2016) with a joint
vocabulary of 16, 000 subwords.
400000
We used Kaldi (Ravanelli et al., 2019) to extract
83-dimensional features normalized by the mean
and standard deviation computed on the training
300000
set. We removed utterances with more than 6, 000
frames or more than 400 characters and used speed
perturbation (Inaguma et al., 2020) with factors of
200000 0.9, 1.0, and 1.1 for data augmentation.
Our implementation was based on the ESPnet-
ST toolkit (Inaguma et al., 2020). We used the
100000 same architecture for all the ST models with a 12-
layer encoder and 8-layer decoders. The coefficient
α in the loss function (Equation 2) was set to 0.3 in
0 all the experiments. We used the Adam optimizer
0.0 0.2 0.4 0.6 0.8 1.0
Similarity (Kingma and Ba, 2015) and Noam learning rate
schedule (Vaswani et al., 2017) with 25, 000 warm-
up steps and a maximum learning rate of 2.5e − 3.
We used a batch size of 48 per GPU and trained
Sentence Transformers models on a single machine with 4 Tesla V100
GPUs. The models were trained for 25 epochs. We
kept checkpoints after each epoch and averaged the
200000 five best models on the development set based on
prediction accuracy. For decoding, the beam size
was set to 5 for ST and 1 for ASR.
150000 3.1.3 Results
We conducted experiments to investigate the im-
pact of using different datasets for training the sys-
100000 tem. The results are presented in Table 2. Ad-
ditionally, we evaluated the performance of the
system when using different sentence embedding
50000 models for data filtering. Our findings reveal that
LASER produces better results compared to Sen-
tence Transformers. Notably, after filtering the data
using LASER, the total number of hours of audio
0
0.0 0.2 0.4 0.6 0.8 1.0 is higher compared to that obtained using Sentence
Similarity
Transformers. Given this observation, it might be
more appropriate to perform filtering based on the
Figure 2: Histograms of cosine similarity between length of the audio rather than the number of utter-
source and target sentence embedding based on LASER ances.
and Sentence Transformers. The red line marks the 90th Our experiments also revealed that training the
percentile.
model with GigaST alone yielded better results
compared to using only the MuST-C dataset. Fur-
3
https://github.com/fxsjy/jieba
359
Training Data BLEU In the future, we will try to perform multi-level pre-
MuST-C 9.71 training based on transforming SpeechUT (Zhang
GigaST (LASER) 13.96 et al., 2022b) with phonemes as unit. We will also
GigaST (Sentence Transformers) 11.57
MuST-C → GigaST (LASER) 13.52
try to use Encodec-based speech synthesis method
GigaST (LASER) → MuST-C 13.30 similar to VALL-EX (Zhang et al., 2023) to in-
crease the accurate representation of emotions and
Table 2: Experimental results on training with different vocal patterns.
datasets. “→” indicates training with the dataset on the
left and use the best checkpoint to initiate the training References
with the dataset on the right.
Milind Agarwal, Sweta Agrawal, Antonios Anas-
tasopoulos, Ondřej Bojar, Claudia Borg, Ma-
thermore, we evaluated an approach in which we rine Carpuat, Roldano Cattoni, Mauro Cettolo,
trained the model with one dataset and use the best Mingda Chen, William Chen, Khalid Choukri,
checkpoint to initiate the training with the other Alexandra Chronopoulou, Anna Currey, Thierry
dataset. However, we observed that this approach Declerck, Qianqian Dong, Yannick Estève,
did not yield any improvement compared to train- Kevin Duh, Marcello Federico, Souhir Gahbiche,
ing the model with GigaST alone. Barry Haddow, Benjamin Hsu, Phu Mon Htut,
Based on these findings, we adopted the transla- Hirofumi Inaguma, Dávid Javorský, John Judge,
tion generated by the ST system trained solely on Yasumasa Kano, Tom Ko, Rishu Kumar, Peng-
GigaST filtered based on LASER for our submis- wei Li, Xutail Ma, Prashant Mathur, Evgeny
sion. Matusov, Paul McNamee, John P. McCrae, Ken-
ton Murray, Maria Nadejde, Satoshi Nakamura,
3.2 Text-to-Speech Synthesis
Matteo Negri, Ha Nguyen, Jan Niehues, Xing
We used pretrained models provided by Zhang Niu, Atul Ojha Kr., John E. Ortega, Proyag Pal,
et al. (2022a) trained on the AISHELL-3 dataset Juan Pino, Lonneke van der Plas, Peter Polák,
(Shi et al., 2021). The PaddleSpeech toolkit pro- Elijah Rippeth, Elizabeth Salesky, Jiatong Shi,
vides several models trained with the AISHELL-3 Matthias Sperber, Sebastian Stüker, Katsuhito
dataset, including FastSpeech 2 and HiFi-GAN. Sudoh, Yun Tang, Brian Thompson, Kevin Tran,
We used the best-performing model combination in Marco Turchi, Alex Waibel, Mingxuan Wang,
terms of MOS reported in (Zhang et al., 2022a). Shinji Watanabe, and Rodolfo Zevallos. 2023.
For other configurations, such as grapheme-to- Findings of the IWSLT 2023 Evaluation Cam-
phoneme conversion, we followed Zhang et al. paign. In Proceedings of the 20th International
(2022a). Conference on Spoken Language Translation
The generated audio files have one channel, a (IWSLT 2023). Association for Computational
sample width of 16 bit, and a frame rate of 24, 000. Linguistics.
Because the predictions of speech-to-text transla-
tion sometimes contained English words that were Mattia A. Di Gangi, Roldano Cattoni, Luisa Ben-
preprocessed to empty strings by the grapheme-to- tivogli, Matteo Negri, and Marco Turchi. 2019.
phoneme conversion, some (less than 1 % of the MuST-C: a Multilingual Speech Translation Cor-
test set) audio files could not be generated. pus. In Proceedings of the 2019 Conference
of the North American Chapter of the Associ-
4 Conclusion ation for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and
In this paper, we described our system, which is a Short Papers), pages 2012–2017, Minneapolis,
combination of speech-to-text translation and text- Minnesota. Association for Computational Lin-
to-speech synthesis. For speech-to-text translation, guistics.
we trained the Dual-decoder Transformer model
with the GigaST dataset filtered based on the simi- Hirofumi Inaguma, Shun Kiyono, Kevin Duh,
larity of multilingual sentence embeddings. For the Shigeki Karita, Nelson Yalta, Tomoki Hayashi,
text-to-speech synthesis model, we took a cascade and Shinji Watanabe. 2020. Espnet-st: All-in-
approach of an acoustic model and a vocoder and one speech translation toolkit. In Proceedings of
used a combination of FastSpeech 2 and HiFi-GAN. the 58th Annual Meeting of the Association for
360
Computational Linguistics: System Demonstra- ence on Acoustics, Speech and Signal Process-
tions, ACL 2020, pages 302–311. Association ing, ICASSP 2019, Brighton, United Kingdom,
for Computational Linguistics. May 12-17, 2019, pages 6465–6469. IEEE.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
A method for stochastic optimization. In 3rd Zhou Zhao, and Tie-Yan Liu. 2021. FastSpeech
International Conference on Learning Represen- 2: Fast and High-Quality End-to-End Text to
tations, ICLR 2015, Conference Track Proceed- Speech. In International Conference on Learn-
ings. ing Representations.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Rico Sennrich, Barry Haddow, and Alexandra
Chris Callison-Burch, Marcello Federico, Nicola Birch. 2016. Neural machine translation of rare
Bertoldi, Brooke Cowan, Wade Shen, Christine words with subword units. In Proceedings of
Moran, Richard Zens, Chris Dyer, Ondrej Bojar, the 54th Annual Meeting of the Association for
Alexandra Constantin, and Evan Herbst. 2007. Computational Linguistics, ACL 2016, Volume
Moses: Open source toolkit for statistical ma- 1: Long Papers. The Association for Computer
chine translation. In ACL 2007, Proceedings of Linguistics.
the 45th Annual Meeting of the Association for
Computational Linguistics. The Association for Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming
Computational Linguistics. Li. 2021. AISHELL-3: A Multi-Speaker Man-
darin TTS Corpus. In Proc. Interspeech 2021,
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae.
pages 2756–2760.
2020. HiFi-GAN: Generative Adversarial Net-
works for Efficient and High Fidelity Speech
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Synthesis. In Proceedings of the 34th Interna-
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
tional Conference on Neural Information Pro-
Lukasz Kaiser, and Illia Polosukhin. 2017. At-
cessing Systems, NIPS’20, Red Hook, NY, USA.
tention is all you need. In Advances in Neu-
Curran Associates Inc.
ral Information Processing Systems 30: Annual
Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Conference on Neural Information Processing
Didier Schwab, and Laurent Besacier. 2020. Systems 2017, pages 5998–6008.
Dual-decoder transformer for joint automatic
speech recognition and multilingual speech Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng,
translation. In Proceedings of the 28th Interna- Tao Wang, Mingxuan Wang, and Jun Cao. 2022.
tional Conference on Computational Linguistics, GigaST: A 10,000-hour Pseudo Speech Transla-
pages 3520–3533, Barcelona, Spain (Online). tion Corpus.
International Committee on Computational Lin-
Hui Zhang, Tian Yuan, Junkun Chen, Xintong Li,
guistics.
Renjie Zheng, Yuxin Huang, Xiaojie Chen, Enlei
Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou, Gong, Zeyu Chen, Xiaoguang Hu, Dianhai Yu,
Zhongjun He, Hua Wu, Haifeng Wang, and Yanjun Ma, and Liang Huang. 2022a. Paddle-
Chengqing Zong. 2020. Synchronous speech Speech: An easy-to-use all-in-one speech toolkit.
recognition and speech-to-text translation with In Proceedings of the 2022 Conference of the
interactive decoding. In The Thirty-Fourth AAAI North American Chapter of the Association for
Conference on Artificial Intelligence, AAAI 2020, Computational Linguistics: Human Language
The Thirty-Second Innovative Applications of Technologies: System Demonstrations, pages
Artificial Intelligence Conference, IAAI 2020, 114–123, Hybrid: Seattle, Washington + Online.
The Tenth AAAI Symposium on Educational Ad- Association for Computational Linguistics.
vances in Artificial Intelligence, EAAI 2020,
pages 8417–8424. AAAI Press. Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu,
Lirong Dai, Jinyu Li, and Furu Wei. 2022b.
Mirco Ravanelli, Titouan Parcollet, and Yoshua Speechut: Bridging speech and text with hidden-
Bengio. 2019. The pytorch-kaldi speech recog- unit for encoder-decoder based speech-text pre-
nition toolkit. In IEEE International Confer- training. arXiv preprint arXiv:2210.03730.
361
Ziqiang Zhang, Long Zhou, Chengyi Wang,
Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen,
Yanqing Liu, Huaming Wang, Jinyu Li, et al.
2023. Speak foreign languages with your own
voice: Cross-lingual neural codec language mod-
eling. arXiv preprint arXiv:2303.03926.
362
Tagged End-to-End Simultaneous Speech Translation Training
using Simultaneous Interpretation Data
Source
And (1)I’m (2)not here to (3)say that (4)men are to (5)blame for the (6)crisis and what (7)happened in my (8)country.
SI Target
(4)男性の、(5)せいだけでは(2)ありません、私どもの(8)国の、金融(6)崩壊の、(5)責任は、
have been some attempts for the development of a given English source sentence. The solid lines in
SI corpora (Toyama et al., 2004; Shimizu et al., the figure represent word correspondences. In this
2013; Doi et al., 2021). However, the amount of figure, we can find:
such SI corpora is still very limited compared to
offline translations. We tackle this problem by us- • Most English content words are translated into
ing a larger-scale offline translation corpus. This Japanese in the offline translation, while some
condition can be seen as domain adaptation from are missing in the SI transcript.
resource-rich offline translation to resource-poor
simultaneous translation. In a typical domain adap- • The SI tries to translate the former half of the
tation scenario, an out-of-domain model is fine- input earlier than the latter half with some un-
tuned using in-domain data (Luong and Manning, naturalness, while the offline translation keeps
2015; Sennrich et al., 2016), but it tends to over- naturalness in Japanese with long-distance re-
fit to the small in-domain data (Chu et al., 2017). ordering from the input English.
As another adaptation approach, tag-based NMT
These points suggest important differences between
works to control the politeness of translations (Sen-
offline translation and SI; SI focuses on the simul-
nrich et al., 2016) and to enable zero-shot mul-
taneity of the interpretation to deliver the contents
tilingual NMT (Johnson et al., 2017). This tag-
as early as possible and to maintain the interpreter’s
based approach has been extended to multi-domain
working memory. The word order difference be-
fine-tuning (Kobus et al., 2017) and mixed fine-
tween English and Japanese poses a serious diffi-
tuning (Chu et al., 2017). These studies fine-tune
culty in SI, as mentioned in the literature (Mizuno,
NMT models using mixed data of in-domain and
2017). Thus, it is important to use SI data to train
out-of-domain corpora. Tagged Back-Translation
a SimulST model to improve its simultaneity.
(Caswell et al., 2019) is an application of the tag-
based approach to well-known back-translation- 4 Proposed Method
based data augmentation. It distinguishes source
language sentences from parallel corpora and those Although training a SimulST model using SI data
obtained from back-translation to handle possible is necessary, we suffer from data scarcity in prac-
back-translation noise in the training of an NMT tice. We propose a method to use a relatively large
model. Our work is motivated by these tag-based offline translation corpus to mitigate for the SI data
methods and tackles the scarcity of SI data. scarcity for training a SimulMT model. Following
the tag-based NMT studies, we put a style tag at
3 Differences between Offline Translation the beginning of the target string in training and
and Simultaneous Interpretation predict a specified tag forcibly at the first step in
inference. In this work, we use two tags: <si> for
There is a large style difference between SI and SI and <off> for offline translation.
offline translation. Figure 1 shows an example of Suppose we have an SI transcript: 私は、買っ
offline translation and SI transcript in Japanese for た。 ペンを、 for an English input: I bought a
364
Offline SI 5.2 Simultaneous Speech Translation
#segm. #En words #segm. #En words
train 328,639 5,714,360 65,008 1,120,245 We used our SimulST implementation based on
dev 1,369 23,059 165 2,804
test 2,841 46,144 511 8,104 fairseq (Ott et al., 2019). It followed the sys-
tem architecture of the best-scored system in the
Table 1: Data sizes of offline data and SI data in the IWSLT 2022 evaluation campaign (Polák et al.,
number of aligned segments. 2022), which used an offline ST model in the online
simultaneous decoding based on Local Agreement
pen. as a training example. We put the SI-style tag (LA) (Liu et al., 2020a)4 .
at the beginning of the SI transcript as follows:
5.2.1 Offline ST Model
<si>私は、買った。ペンを、 We built the initial offline ST model by connect-
This string is tokenized into subwords1 : ing two pre-trained models. Firstly, we used Hu-
BERT Large as the encoder, which consists of a
_< si > 私 は 、 買っ た 。 ペ feature extractor trained on 60k hours of unlabeled
ン を 、 speech data Libri-Light (Kahn et al., 2020) and
Here, we assume we have a pre-trained sequence- a transformer encoder layer. The feature extrac-
to-sequence model such as mBART (Liu et al., tor is a 7-layer convolutional layer with a kernel
2020b; Tang et al., 2021) as a basis of the SimulST size of (10,3,3,3,3,2,2), a stride of (5,2,2,2,2,2,2),
model, as described later in the next section. The and 512 channels, while the transformer encoder
aforementioned style tags may not be included in layer consists of 24 layers. Next, we used the de-
the subword vocabulary of the pre-trained model coder portion of mBART50, an encoder-decoder
and are tokenized further like “_< si >”, but it model pre-trained with 50 language pairs, as the
works in practice. decoder. The decoder consists of 12 layers of trans-
former decoders, and the embedding layer and
5 Experimental Setup linear projection weights are shared, with a size
of 250,000. The dimension of each layer of the
5.1 Dataset
transformer encoder and decoder is 1024, the di-
We used MuST-C (Di Gangi et al., 2019) v2 mension of the feed forward network is 4096, the
English-Japanese data as our offline speech trans- number of multi-heads is 16, the activation func-
lation corpus. We also prepared development and tion is the ReLU function, and the normalization
test sets from our in-house Japanese SI recordings method is pre-layer normalization (Baevski and
on TED Talks that are not included in the train- Auli, 2019). These two models are connected by an
ing sets above. As for the SI data for training, we Inter-connection (Nishikawa and Nakamura, 2023)
used NAIT-SIC-Aligned (Zhao et al., 2023). This that weights each transformer layer of the encoder
SI data is constructed by applying heuristic sen- and integrates the output tensors of each layer in a
tence alignment to extract parallel sentence pairs weighted sum, and a length adapter (Tsiamas et al.,
using the latest version of NAIST-SIC2 (Doi et al., 2022). The length adapter is a 3-layer convolu-
2021). From NAIST-SIC-Aligned, we selected IN- tional network with 1024 channels, the stride of 2,
TRA, AUTO-DEV and AUTO-TEST as train, dev and the activation function of GELU.
and test data, respectively. For all the SI sets, we The inputs are waveforms with a 16-kHz sam-
aligned the English text segments with the corre- pling rate that are normalized to zero mean and
sponding audio tracks in MuST-C using an English unit variance. During training, each source audio
forced-aligner Gentle3 . Here, we excluded seg- is augmented (Kharitonov et al., 2020) with a prob-
ments not aligned with the source speech from the ability of 0.8. We train the model on MuST-C
aligned dataset. Table 1 shows the size of the of- (Di Gangi et al., 2019), CoVoST-2 (Wang et al.,
fline and SI data. 2020), Europarl-ST (Iranzo-Sánchez et al., 2020),
1
“_” is the meta-character representing white spaces in and TED-LIUM (Rousseau et al., 2012). We
an original string by SentencePiece (Kudo and Richardson,
2018), and “ ” represents a white space in a tokenized string. use gradient accumulation and data parallelism to
2
https://dsc-nlp.naist.jp/data/ achieve a batch size of approximately 32 million
NAIST-SIC/2022
3 4
https://github.com/lowerquality/ We also tried wait-k (Ma et al., 2019), but LA worked
gentle better than wait-k in our pilot test.
365
tokens. We use Adam with β1 = 0.99, β2 = 0.98, (BLEURT) SI Offline
and a base learning rate of 2.5 × 10−4 . The learn- Offline FT 0.386 0.518
ing rate is controlled by a tri-stage scheduler with SI FT 0.359 0.347
phases of 0.15, 0.15, and 0.70 for warm-up, hold, Mixed FT 0.393 0.483
and decay, respectively, while the initial and final Mixed FT + Style 0.445 0.522
learning rate has a scale of 0.01 compared to base. Mixed FT + Style + Up 0.443 0.516
We use sentence averaging and gradient clipping
of 20. We apply a dropout of 0.1 before every non- Table 2: BLEURT in full-sentence offline ST on SI and
offline test sets.
frozen layer and use time masking for 10-length
spans with a probability of 0.2, and channel mask- (BLEU) SI Offline
ing for 20-length spans with a probability of 0.1 in Offline FT 7.8 16.0
the encoder feature extractor’s output. The loss is SI FT 10.9 6.3
the cross-entropy loss with label smoothing of 0.2. Mixed FT 9.4 13.3
We call this trained model base model. Mixed FT + Style 10.3 15.4
Mixed FT + Style + Up 12.2 14.2
The base model was fine-tuned using the of-
fline training and development sets (Table 1). Dur- Table 3: BLEU in full-sentence offline ST on SI and
ing fine-tuning, we set the learning rate of 2.5 × offline test sets.
10−5 , saved models in every 1,000 updates, and
adopted checkpoint averaging over five-best check-
points according to the loss on the development SI FT Fine-tuned using the prefix pairs from the
set. We call this fine-tuned model base+O model. SI data (baseline in SI).
About those base and base+O models, we use
the NAIST IWSLT 2023 Simultaneous speech-to- Mixed FT Fine-tuned using prefix pairs from both
speech model for the Simultaneous Speech Transla- of the offline and SI data (baseline in mixed).
tion task (Fukuda et al., 2023). We further fine-tune
Mixed FT + Style Fine-tuned using prefix pairs
the base+O model using the SI data in the same
from both of the offline and SI data with the
manner to derive base+O+S model. Here, follow-
style tags (proposed method).
ing (Tsiamas et al., 2022), to avoid overfitting the
small SI data, the parameters of the following com- Mixed FT + Style + Up The SI portions were up-
ponents were kept fixed: the feature extractor and sampled in Mixed FT + Style to balance the
feedforward layers of the encoder and the embed- data size between the offline and SI data (pro-
ding, self-attention, and feedforward layers of the posed method).
decoder.
Here, the prefix pairs from the offline data were ob-
5.2.2 Fine-tuning using Prefix Alignment tained using base+O model, and those from the SI
For further fine-tuning toward SimulST, we ex- data were obtained using the base+O+S model.
tracted prefix-to-prefix translation pairs from the The hyperparameter settings for the fine-tuning
available training sets using Prefix Alignment were the same as that for the base+O model.
(PA) (Kano et al., 2022). PA uses an offline transla-
tion model to find prefix-to-prefix translation pairs 5.3 Evaluation Metrics
that can be obtained as intermediate translation We evaluated the SimulST systems using SimulE-
results using a given offline translation model. Fi- val5 (Ma et al., 2020a). The unit length of speech
nally, we fine-tuned the base+O model using the segments was set to {200, 400, 600, 800, 1,000}
prefix pairs. milliseconds6 . For the SimulST systems, transla-
tion quality was evaluated in BLEURT (Sellam
5.2.3 Compared Methods et al., 2020) and BLEU (Papineni et al., 2002)7 .
We compared the following conditions on the final 5
https://github.com/facebookresearch/
fine-tuning data: SimulEval
6
We also evaluated SI FT on the SI test set with 120 and
160 ms speech segments to investigate its performance in low
Offline FT Fine-tuned using the prefix pairs from latency ranges.
the offline data (baseline in offline). 7
BLEU was calculated using SacreBLEU (Post, 2018).
366
SI test SI test
0.44 11
0.42 10
0.40 9
BLEURT
BLEU
0.38 8
Offline FT Offline FT
0.36 SI FT 7 SI FT
Mixed FT 6 Mixed FT
0.34 Mixed FT + Style
Mixed FT + Style
0.32 Mixed FT + Style + Up 5 Mixed FT + Style + Up
200 400 600 800 1000 200 400 600 800 1000
ATD ATD
(a) BLEURT (b) BLEU
BLEU
0.450
0.425 Offline FT 10 Offline FT
0.400 SI FT SI FT
Mixed FT Mixed FT
0.375 8
Mixed FT + Style Mixed FT + Style
0.350 Mixed FT + Style + Up Mixed FT + Style + Up
200 400 600 800 1000 200 400 600 800 1000
ATD ATD
(a) BLEURT (b) BLEU
The latency in SimulST was evaluated in Aver- The result shows that the upsampling worked for
age Token Delay (ATD) (Kano et al., 2023) im- BLEU improvement for the SI test set in the offline
plemented in SimulEval. Even though Average translation condition.
Lagging (AL) (Ma et al., 2019) is the most popular
latency metric, it sometimes resulted in negative 6.2 Simultaneous Translation Results
values, as suggested by Kano et al. (2023). Thus,
Figure 2 shows SimulST results in BLEURT and
we present the results using ATD and include the
BLEU for the SI test set. In Figure 2a, the pro-
AL results in Appendix A.
posed method with the style tags showed clearly
6 Results better BLEURT results than the baselines. The up-
sampling did not bring clear differences, the same
6.1 Offline Translation Results as findings on the offline translation results shown
Tables 2 and 3 show the offline translation re- in Table 2. In contrast, Figure 2b shows SI FT
sults in BLEURT and BLEU for the SI and offline worked the best in almost all latency ranges, while
test sets. These results show that our proposed the proposed method outperformed the other two
Mixed FT + Style and Mixed FT + Style + Up sur- baselines (Offline and Mixed).
passed baselines in BLEURT for SI test. On the Figure 3 shows SimulST results for the offline
offline test set (MuST-C tst-COMMON), the per- test set. They reflect the difference in reference
formance of the proposed models was almost the translations between the SI and offline test sets.
same as Offline FT. This suggests that our proposed The Offline FT baseline worked well in BLEURT
method leads to outputs semantically close to SI and outperformed the proposed method in BLEU.
references than the baseline. Contrary, the SI FT The other baselines resulted in worse BLEURT and
baseline surpassed the Mixed FT + Style in BLEU. BLEU scores than the proposed method.
367
SI test SI test SI test
0.75 0.75
0.74
0.74
BERTScore Precision
BERTScore Recall
BERTScore F1
0.72 0.72
Offline FT 0.73 Offline FT Offline FT
0.71
0.71 SI FT SI FT SI FT
Mixed FT 0.72 Mixed FT 0.70 Mixed FT
Mixed FT + Style Mixed FT + Style 0.69 Mixed FT + Style
0.70 Mixed FT + Style + Up Mixed FT + Style + Up Mixed FT + Style + Up
200 400 600 800 1000 0.71 200 400 600 800 1000 200 400 600 800 1000
ATD ATD ATD
Length Ratio
Mixed FT + Style + Up
7 Discussions 1.2
100
the best in BERTScore recall, and the recall curves 75
look similar to BLEURT curves shown in Figure 2a. 50
On the other hand, the SI FT baseline worked the
25
best in BERTScore precision, and the precision
0 75
curves look very similar to the BLEU curves shown 50 25 0
Length Difference
25 50 75 100
in Figure 2b. We conducted further analyses below
to investigate the mixed results in different quality Figure 6: The length differences between hypotheses
metrics. and references in SI FT and Mixed FT + Style (speech
segment size is 600ms) on SI test set.
7.2 Length Differences
First, we focus on the length differences between Table 4 shows the translation examples by SI FT
translation outputs and references. Figure 5 shows and Mixed FT + Style. Here, SI FT generates very
the length ratios of translation results and their ref- short outputs compared with Mixed FT + Style;
erences. The proposed method resulted in longer BLEU is not always good due to the brevity penalty,
outputs than the baselines, and the SI FT baseline but SI FT would have an advantage in BERTScore
preferred shorter output than the others and ref- precision.
erences. From the viewpoint of the precision of
the translation results, outputs longer than their 7.3 Non-speech Sound Events and Repetitions
references are unfavorable. Figure 6 shows the his- Next, we investigated the over-translation sug-
togram of length differences between SI FT and gested in the analyses above.
Mixed FT + Style. They showed different distribu- We observed serious repetitions by the proposed
tions; this suggests that SI FT suffered from under- method, such as (拍手) (拍手) ..., which means
translation, and the proposed method suffered from (Applause). This kind of non-speech sound events
over-translation. (applause and laughter) are found many times in
368
Source TEMPT was one of the foremost graffiti artists in the 80s.
There’s no hospital that can say “No.”
Anybody who’s paralyzed now has access to actually draw or communicate using only their eyes.
SI FT テンプトは、グラフィティアーティストの (TEMPT was, graffiti artists’)
(Baseline) 病院は、(a hospital)
麻痺した人達は、 (paralyzed people)
Mixed FT + Style テンプトは、グラフィティアーティストの一人です。(TEMPT is one of graffiti artists.)
(Proposed) 病院では「いいえ」は言えません。(In a hospital, we cannot say “No.”)
麻痺した人なら誰でも、絵を描いたり、会話をすることができます。
(Anybody who is paralyzed can draw a picture and have a talk.)
SI reference 八十年代の素晴らしいグラフィックアーティストでした。
((He) was a great graphic artist in the 80s.)
病院も、ノーとは言えない。(There’s no hospital that can say “No.”)
麻痺してる人達は、これを全員使うことが出来るようになっています。
(Everybody who is paralyzed can use this.)
Offline reference 80年代を代表するグラフィティ・アーティストでした
病院もダメと言えません
全身麻痺の人誰もが目だけで絵を描いたりコミュニケーションできます
Table 4: Example sentences in SI FT and Mixed FT + Style (speech segment size: 600ms) on SI test set.
TED Talks, but they are not translated by inter- the proposed method, but it made little impact on
preters and excluded from the SI data. According semantic-oriented automatic evaluation results.
to this assumption, we tried to eliminate typical
repetitions as follows and to conduct the evaluation 8 Conclusion
after that.
In this paper, we proposed an effective method
• Removing tokens if they are surrounded by to train a SimulST model using mixed data of SI-
"()" and "<>". (if the tokens include parts of and offline-style translations with style tags to tell
"(拍手)" like "拍手)" or "(", they were also the model to generate outputs in either style, mo-
excluded.) tivated by the tag-based approach to domain adap-
tation. Experiment results on English-to-Japanese
• Stopping the generating output when at least SimulST demonstrated the advantage of the pro-
one kind of 3-gram appeared at least 3 times posed method in BLEURT and BERTScore re-
in the steps until reaching the end of the sen- call despite the inferior performance in BLEU and
tence. BERTScore precision due to over-translations and
repetitions. Future work includes an extension to
We applied this repetition removal on the re- other language pairs and further verification via
sults by Mixed FT + Style and SI + Style; they human evaluation.
are labeled as Mixed FT + Style + Rmrep and
SI FT + Rmrep, respectively. Figure 7 shows 9 Limitation
BLEU and length ratio results before and after
The scores reported in the SI test were lower than
the repetition removal. BLEU increased consis-
those in the offline test. Reporting results on other
tently on the proposed method while almost no
SI data would support seeing the effectiveness of
changes were observed on the SI FT baseline ex-
our method. To our knowledge, this is the first work
cept for one sample at ATD=200. This suggests the
to use SI data as speech translation data. There
existence of many repetitions in the translation re-
are no other language pairs SI data than English-
sults by the proposed method. We also investigated
Japanese pairs those source speech and target text
BLEURT and BERTScore, as shown in Figure 8.
aligned.
The repetition removal made almost no changes in
BLEURT, probably due to the semantic-oriented Acknowledgement
evaluation strategy of BLEURT. BERTScore Pre-
cision and F1 of the proposed method increased Part of this work was supported by JSPS KAK-
in the middle latency ranges, while they decreased ENHI Grant Number JP21H05054 and JST
almost consistently for the SI FT baseline. These SPRING Grant Number JPMJSP2140.
findings suggest an over-translation problem with
369
SI test
11
SI test
10 0.44
BLEU
9 0.42
SI FT
BLEURT
8 SI FT + Rmrep 0.40
Mixed FT + Style
7 Mixed FT + Style + Rmrep 0.38 SI FT
200 400 600 800 1000 SI FT + Rmrep
ATD Mixed FT + Style
0.36
Mixed FT + Style + Rmrep
(a) BLEU 200 400 600 800 1000
SI test ATD
(a) BLEURT
1.4 SI test
0.745
Length Ratio
1.2
0.740
BERTScore F1
1.0 SI FT 0.735
SI FT + Rmrep
Mixed FT + Style 0.730
0.8 Mixed FT + Style + Rmrep SI FT
200 400 600 800 1000 SI FT + Rmrep
0.725 Mixed FT + Style
ATD
Mixed FT + Style + Rmrep
(b) Length ratio 200 400 600 800 1000
ATD
Figure 7: Results with repetition removal (Rmrep) in (b) BERTScore-F1
BLEU and length ratio against ATD on SI test set. SI test
0.75
References
BERTScore Precision
0.74
Alexei Baevski and Michael Auli. 2019. Adaptive Input
Representations for Neural Language Modeling. In 0.73
7th International Conference on Learning Represen- SI FT
tations, ICLR 2019, New Orleans, LA, USA, May 6-9, 0.72 SI FT + Rmrep
2019. OpenReview.net. Mixed FT + Style
0.71 Mixed FT + Style + Rmrep
200 400 600 800 1000
Isaac Caswell, Ciprian Chelba, and David Grangier. ATD
2019. Tagged back-translation. In Proceedings of the
(c) BERTScore-Precision
Fourth Conference on Machine Translation (Volume
SI test
1: Research Papers), pages 53–63, Florence, Italy. 0.755
Association for Computational Linguistics.
0.750
BERTScore Recall
370
Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka-
Matteo Negri, and Marco Turchi. 2019. MuST-C: a mura. 2022. Simultaneous neural machine transla-
Multilingual Speech Translation Corpus. In Proceed- tion with prefix alignment. In Proceedings of the
ings of the 2019 Conference of the North American 19th International Conference on Spoken Language
Chapter of the Association for Computational Lin- Translation (IWSLT 2022), pages 22–31, Dublin, Ire-
guistics: Human Language Technologies, Volume 1 land (in-person and online). Association for Compu-
(Long and Short Papers), pages 2012–2017, Min- tational Linguistics.
neapolis, Minnesota. Association for Computational
Linguistics. Yasumasa Kano, Katsuhito Sudoh, and Satoshi Naka-
mura. 2023. Average Token Delay: A Latency Met-
Kosuke Doi, Katsuhito Sudoh, and Satoshi Nakamura. ric for Simultaneous Translation. In Proceedings of
2021. Large-scale English-Japanese simultaneous in- Interspeech 2023. To appear.
terpretation corpus: Construction and analyses with
sentence-aligned data. In Proceedings of the 18th Eugene Kharitonov, Morgane Rivière, Gabriel Syn-
International Conference on Spoken Language Trans- naeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs
lation (IWSLT 2021), pages 226–235, Bangkok, Thai- Douze, and Emmanuel Dupoux. 2020. Data Aug-
land (online). Association for Computational Linguis- menting Contrastive Learning of Speech Repre-
tics. sentations in the Time Domain. arXiv preprint
arXiv:2007.00991.
Christian Fügen, Alex Waibel, and Muntsin Kolss. 2007.
Simultaneous translation of lectures and speeches. Catherine Kobus, Josep Crego, and Jean Senellart. 2017.
Machine translation, 21:209–252. Domain control for neural machine translation. In
Proceedings of the International Conference Recent
Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Yuka
Advances in Natural Language Processing, RANLP
Ko, Tomoya Yanagita, Kosuke Doi, Mana Makinae,
2017, pages 372–378, Varna, Bulgaria. INCOMA
Katsuhito Sudoh, Sakriani Sakti, and Satoshi Naka-
Ltd.
mura. 2023. NAIST Simultaneous Speech Transla-
tion System for IWSLT 2023. In Proceedings of the Taku Kudo and John Richardson. 2018. SentencePiece:
20th International Conference on Spoken Language A simple and language independent subword tok-
Translation (IWSLT2023). To appear. enizer and detokenizer for neural text processing. In
Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic- Proceedings of the 2018 Conference on Empirical
tor O.K. Li. 2017. Learning to translate in real-time Methods in Natural Language Processing: System
with neural machine translation. In Proceedings of Demonstrations, pages 66–71, Brussels, Belgium.
the 15th Conference of the European Chapter of the Association for Computational Linguistics.
Association for Computational Linguistics: Volume
Danni Liu, Gerasimos Spanakis, and Jan Niehues.
1, Long Papers, pages 1053–1062, Valencia, Spain.
2020a. Low-Latency Sequence-to-Sequence Speech
Association for Computational Linguistics.
Recognition and Translation by Partial Hypothesis
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Selection. In Proc. Interspeech 2020, pages 3620–
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- 3624.
bert Sanchis, Jorge Civera, and Alfons Juan. 2020.
Europarl-ST: A Multilingual Corpus for Speech Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Translation of Parliamentary Debates. In ICASSP Edunov, Marjan Ghazvininejad, Mike Lewis, and
2020 - 2020 IEEE International Conference on Luke Zettlemoyer. 2020b. Multilingual denoising
Acoustics, Speech and Signal Processing (ICASSP), pre-training for neural machine translation. Transac-
pages 8229–8233. tions of the Association for Computational Linguis-
tics, 8:726–742.
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Minh-Thang Luong and Christopher Manning. 2015.
Fernanda Viégas, Martin Wattenberg, Greg Corrado, Stanford neural machine translation systems for spo-
Macduff Hughes, and Jeffrey Dean. 2017. Google’s ken language domains. In Proceedings of the 12th
multilingual neural machine translation system: En- International Workshop on Spoken Language Trans-
abling zero-shot translation. Transactions of the As- lation: Evaluation Campaign, pages 76–79, Da Nang,
sociation for Computational Linguistics, 5:339–351. Vietnam.
J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Col- Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
lobert, C. Fuegen, T. Likhomanenko, G. Syn- Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
naeve, A. Joulin, A. Mohamed, and E. Dupoux. Haifeng Wang. 2019. STACL: Simultaneous trans-
2020. Libri-Light: A Benchmark for ASR with lation with implicit anticipation and controllable la-
Limited or No Supervision. In ICASSP 2020 tency using prefix-to-prefix framework. In Proceed-
- 2020 IEEE International Conference on Acous- ings of the 57th Annual Meeting of the Association for
tics, Speech and Signal Processing (ICASSP), Computational Linguistics, pages 3025–3036, Flo-
pages 7669–7673. https://github.com/ rence, Italy. Association for Computational Linguis-
facebookresearch/libri-light. tics.
371
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Kanishka Rao, Haşim Sak, and Rohit Prabhavalkar.
Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An 2017. Exploring architectures, data and units for
evaluation toolkit for simultaneous translation. In streaming end-to-end speech recognition with rnn-
Proceedings of the 2020 Conference on Empirical transducer. In 2017 IEEE Automatic Speech Recog-
Methods in Natural Language Processing: System nition and Understanding Workshop (ASRU), pages
Demonstrations, pages 144–150, Online. Association 193–199. IEEE.
for Computational Linguistics.
Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin,
Xutai Ma, Juan Pino, and Philipp Koehn. 2020b. Zhou Zhao, and Tie-Yan Liu. 2020. SimulSpeech:
SimulMT to SimulST: Adapting simultaneous text End-to-end simultaneous speech to text translation.
translation to end-to-end simultaneous speech trans- In Proceedings of the 58th Annual Meeting of the As-
lation. In Proceedings of the 1st Conference of the sociation for Computational Linguistics, pages 3787–
Asia-Pacific Chapter of the Association for Compu- 3796, Online. Association for Computational Lin-
tational Linguistics and the 10th International Joint guistics.
Conference on Natural Language Processing, pages
582–587, Suzhou, China. Association for Computa- Anthony Rousseau, Paul Deléglise, and Y. Estève. 2012.
tional Linguistics. TED-LIUM: an Automatic Speech Recognition ded-
icated corpus. In International Conference on Lan-
Akira Mizuno. 2017. Simultaneous interpreting and guage Resources and Evaluation.
cognitive constraints. Bull. Coll. Lit, 58:1–28.
Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020.
Yuta Nishikawa and Satoshi Nakamura. 2023. Inter- BLEURT: Learning robust metrics for text genera-
connection: Effective Connection between Pre- tion. In Proceedings of the 58th Annual Meeting of
trained Encoder and Decoder for Speech Translation. the Association for Computational Linguistics, pages
In Proceedings of Interspeech 2023. To appear. 7881–7892, Online. Association for Computational
Linguistics.
Yusuke Oda, Graham Neubig, Sakriani Sakti, Tomoki
Toda, and Satoshi Nakamura. 2014. Optimizing seg- Rico Sennrich, Barry Haddow, and Alexandra Birch.
mentation strategies for simultaneous speech transla- 2016. Controlling politeness in neural machine trans-
tion. In Proceedings of the 52nd Annual Meeting of lation via side constraints. In Proceedings of the 2016
the Association for Computational Linguistics (Vol- Conference of the North American Chapter of the
ume 2: Short Papers), pages 551–556. Association for Computational Linguistics: Human
Language Technologies, pages 35–40, San Diego,
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, California. Association for Computational Linguis-
Sam Gross, Nathan Ng, David Grangier, and Michael tics.
Auli. 2019. fairseq: A fast, extensible toolkit for
Hiroaki Shimizu, Graham Neubig, Sakriani Sakti,
sequence modeling. In Proceedings of the 2019 Con-
Tomoki Toda, and Satoshi Nakamura. 2013. Con-
ference of the North American Chapter of the Associa-
structing a speech translation system using simulta-
tion for Computational Linguistics (Demonstrations),
neous interpretation data. In Proceedings of IWSLT.
pages 48–53, Minneapolis, Minnesota. Association
for Computational Linguistics. Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- gela Fan. 2021. Multilingual translation from de-
Jing Zhu. 2002. Bleu: a method for automatic evalu- noising pre-training. In Findings of the Association
ation of machine translation. In Proceedings of the for Computational Linguistics: ACL-IJCNLP 2021,
40th Annual Meeting of the Association for Compu- pages 3450–3466.
tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational Hitomi Toyama, Shigeki Matsubara, Koichiro Ryu,
Linguistics. Nobuo Kawaguchi, and Yasuyoshi Inagaki. 2004.
CIAIR Simultaneous Interpretation Corpus. In Pro-
Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen, ceedings of Oriental COCOSDA.
Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo-
jar, and Alexander Waibel. 2022. CUNI-KIT system Ioannis Tsiamas, Gerard I. Gállego, Carlos Escolano,
for simultaneous speech translation task at IWSLT José Fonollosa, and Marta R. Costa-jussà. 2022. Pre-
2022. In Proceedings of the 19th International Con- trained speech encoders and efficient fine-tuning
ference on Spoken Language Translation (IWSLT methods for speech translation: UPC at IWSLT 2022.
2022), pages 277–285, Dublin, Ireland (in-person In Proceedings of the 19th International Confer-
and online). Association for Computational Linguis- ence on Spoken Language Translation (IWSLT 2022),
tics. pages 265–276, Dublin, Ireland (in-person and on-
line). Association for Computational Linguistics.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on Changhan Wang, Juan Pino, Anne Wu, and Jiatao Gu.
Machine Translation: Research Papers, pages 186– 2020. CoVoST: A diverse multilingual speech-to-text
191, Brussels, Belgium. Association for Computa- translation corpus. In Proceedings of the Twelfth Lan-
tional Linguistics. guage Resources and Evaluation Conference, pages
372
4197–4203, Marseille, France. European Language
Resources Association.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
Weinberger, and Yoav Artzi. 2020. BERTScore:
Evaluating Text Generation with BERT. In 8th Inter-
national Conference on Learning Representations,
ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
2020. OpenReview.net.
Jinming Zhao, Yuka Ko, Ryo Fukuda, Katsuhito Su-
doh, Satoshi Nakamura, et al. 2023. NAIST-SIC-
Aligned: Automatically-Aligned English-Japanese
Simultaneous Interpretation Corpus. arXiv preprint
arXiv:2304.11766.
373
A Evaluation Results in AL.
Figure 9 shows the main results in BLEURT and
BLEU in SI test in AL. Figure 10 shows the main
results in BLEURT and BLEU in offline test in
AL. Those results trends are almost the same as the
trends in main results in Figure 2, 3.
374
SI test SI test
0.44 11
0.42 10
0.40 9
BLEURT
0.38 BLEU 8
Offline FT Offline FT
0.36 SI FT 7 SI FT
Mixed FT 6 Mixed FT
0.34 Mixed FT + Style
Mixed FT + Style
0.32 Mixed FT + Style + Up 5 Mixed FT + Style + Up
2000 1000 0 1000 2000 2000 1000 0 1000 2000
AL AL
(a) BLEURT (b) BLEU
BLEU
0.450
0.425 Offline FT 10 Offline FT
0.400 SI FT SI FT
Mixed FT Mixed FT
0.375 8
Mixed FT + Style Mixed FT + Style
0.350 Mixed FT + Style + Up Mixed FT + Style + Up
1000 500 0 500 1000 1500 2000 2500 1000 500 0 500 1000 1500 2000 2500
AL AL
(a) BLEURT (b) BLEU
Figure 10: SimulST latency (AL) – quality results on offline test set.
375
The HW-TSC’s Simultaneous Speech-to-Text Translation system for
IWSLT 2023 evaluation
Jiaxin GUO, Daimeng Wei, Zhanglin Wu, Zongyao Li, Zhiqiang Rao, Minghan Wang,
Hengchao Shang, Xiaoyu Chen, Zhengzhe Yu, Shaojun Li, Yuhao Xie, Lizhi Lei, Hao Yang
asr_ouput1,1 mt_ouput1,1
chunk1 asr_ouput1,2 mt_ouput1,2 ouput1
asr_ouput1,3 mt_ouput1,3
prefix
asr_ouput2,1 mt_ouput2,1
chunk1 chunk2 asr_ouput2,2 mt_ouput2,2 ouput2
asr_ouput2,3 mt_ouput2,3
prefix
asr_ouput3,1 mt_ouput3,1
chunk1 chunk2 chunk3 asr_ouput3,2 mt_ouput3,2 ouput3
asr_ouput3,3 mt_ouput3,3
prefix
end-to-end model, both of which can be (hybrid) in standard Transformer or Conformer architectures
nature. While cascaded systems currently offer the and can perform both streaming and non-streaming
highest quality in offline speech translation, end- ASR. One of the major advantages of U2 over other
to-end speech translation provides a better trade- offline autoregressive ASR models is its ability to
off between quality and latency (Guo et al., 2022; support streaming through dynamic chunk training
Wang et al., 2022a,b). and decoding with a CTC decoder on top of the
End-to-end speech translation systems incorpo- encoder. Additionally, U2 includes a standard au-
rate various techniques to enable simultaneous toregressive attention decoder and can be jointly
translation. For example, (Ma et al., 2019) im- trained with the CTC decoder to improve training
plements a wait-k model and utilizes meta-learning stability. The dynamic chunk training method in-
to address data scarcity, while (Zhang et al., 2022b) volves applying a causal mask with varying chunk
employs a wait-info model that incorporates infor- sizes at the self-attention layer within the encoder.
mation entropy from both the original text and the This allows the hidden representation to condition
translation into the model. Additionally, (Liu et al., on some look-ahead contexts within the chunk,
2020) utilizes a unidirectional encoder with mono- similar to the self-attention of an autoregressive
tonic cross-attention to constrain dependence on decoder.
future context.
In addition, some research has focused on de- U2 offers four different decoding strategies:
tecting stable hypotheses. For instance, (Liu et al., "ctc_greedy_search", "ctc_beam_search", "atten-
2020) proposed the Hold-n strategy, which identi- tion_decoding", and "attention_rescoring". The
fies the best hypothesis in the beam and removes CTC decoder, with argmax decoding, guarantees
the last n tokens from it. Similarly, (Liu et al., 2020) that the tokens decoded in previous chunks are un-
introduced the LA-n strategy, which identifies the altered, leading to a smooth streaming experience.
matching prefixes of two consecutive chunks. Ad- The attention decoder generates output token by
ditionally, like the LA-n strategy, (Nguyen et al., token and also has the ability to re-score CTC gen-
2021) developed the SP-n strategy, which identifies erated texts using prefix beam search in the event
the longest common prefix among all items in the of multiple candidate proposals.
beam of a chunk. Our work directly addresses this
After building on our findings from last year,
issue.
we have discovered that U2 offers stability and
3 Methods robustness in predicting audio without real utter-
ances. This improvement is due to the model’s
Figure 1 illustrates our framework. training strategy, specifically the use of dynamic
chunk training. In our current work, we have fur-
3.1 ASR ther improved the performance of the model by
In our cascade system, we have incorporated the breaking the chunk-based attention approach and
U2 (Wu et al., 2021) as the ASR module. This employing the "attention_rescoring" decoding strat-
framework has the flexibility to be implemented on egy.
377
3.2 MT hypothesize that there are domain-like distinctions
Our cascade system includes the Transformer between ASR-generated results and actual text. To
(Vaswani et al., 2017) as the MT module, which has further improve the performance, we use the gen-
become a prevalent method for machine translation eration from a well-trained ASR model to replace
(Guo et al., 2021) in recent years. The Transformer source-side text in the training corpus data. This
has achieved impressive results, even with a primi- fine-tuning approach enables us to achieve further
tive architecture that requires minimal modification. improvements in the MT model.
To improve the offline MT model performance,
3.3 Onlinization
we utilize multiple training strategies (Wei et al.,
2021). Incremental Decoding Translation tasks may re-
quire reordering or additional information that is
Multilingual Translation (Johnson et al., 2017) not apparent until the end of the source utterance,
has proposed a simple solution for translating mul- depending on the language pair. In offline settings,
tiple languages using a single neural machine trans- processing the entire utterance at once produces
lation model with no need to alter the model archi- the highest-quality results. However, this approach
tecture. The proposed technique involves inserting also leads to significant latency in online mode.
an artificial token at the start of the input sentence One possible solution to reduce latency is to divide
to specify the target language. Furthermore, all the source utterance into smaller parts and translate
languages use the same vocabulary, eliminating the each one separately.
need to add additional parameters. In this study, En- To perform incremental inference, we divide the
De/ZH/JA data was combined and jointly trained, input utterance into chunks of a fixed size and de-
demonstrating that a multilingual model can signif- code each chunk as it arrives. Once a chunk has
icantly enhance translation performance. been selected, its predictions are then committed
to and no longer modified to avoid visual distrac-
Data diversification Data diversification
tions from constantly changing hypotheses. The
(Nguyen et al., 2020) is an effective strategy to
decoding of the next chunk is dependent on the pre-
improve the performance of NMT. This technique
dictions that have been committed to. In practice,
involves utilizing predictions from multiple
decoding for new chunks can proceed from a previ-
forward and backward models and then combining
ously buffered decoder state or begin after forced
the results with raw data to train the final NMT
decoding with the tokens that have been committed
model. Unlike other methods such as knowledge
to. In either case, the source-target attention can
distillation and dual learning, data diversification
span all available chunks, as opposed to only the
does not require additional monolingual data and
current chunk.
can be used with any type of NMT model. Addi-
tionally, this strategy is more efficient and exhibits Stable Hypothesis Detection Our approach is
a strong correlation with model integration. based on prior research in (Polák et al., 2022), and
we have implemented stable hypothesis detection
Forward translation Forward translation (Wu
to minimize the potential for errors resulting from
et al., 2019) refers to using monolingual data in the
incomplete input. Their methods, such as LA-n
source language to generate synthetic data through
(Liu et al., 2020) and SP-n (Nguyen et al., 2021),
beam search decoding. This synthetic data is then
are designed for use in end-to-end systems that
added to the training data in order to increase its
search for a shared prefix among the hypotheses
size. While forward translation alone may not yield
generated from different chunk inputs. In contrast,
optimal results, when combined with a back trans-
our approach operates within a cascaded system
lation strategy, it can enhance performance more
that processes the same chunk input.
effectively than back translation alone. In this work,
we use only the forward model to create synthetic We can denote the MT and ASR generating func-
data and add the data to the original parallel cor- tions as G and F respectively. Let Fi,nC represent
380
Translation (IWSLT 2023). Association for Compu- Brno, Czechia, 30 August - 3 September 2021, pages
tational Linguistics. 1762–1766. ISCA.
Mattia Antonino Di Gangi, Roldano Cattoni, Luisa Xuan-Phi Nguyen, Shafiq R. Joty, Kui Wu, and
Bentivogli, Matteo Negri, and Marco Turchi. 2019. Ai Ti Aw. 2020. Data diversification: A sim-
Must-c: a multilingual speech translation cor- ple strategy for neural machine translation. In
pus. In Proceedings of the 2019 Conference of Advances in Neural Information Processing Systems
the North American Chapter of the Association 33: Annual Conference on Neural Information
for Computational Linguistics: Human Language Processing Systems 2020, NeurIPS 2020, December
Technologies, NAACL-HLT 2019, Minneapolis, 6-12, 2020, virtual.
MN, USA, June 2-7, 2019, Volume 1 (Long and
Short Papers), pages 2012–2017. Association for Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
Computational Linguistics. Sam Gross, Nathan Ng, David Grangier, and Michael
Auli. 2019. fairseq: A fast, extensible toolkit for
Jiaxin Guo, Yinglu Li, Minghan Wang, Xiaosong Qiao, sequence modeling. CoRR, abs/1904.01038.
Yuxia Wang, Hengchao Shang, Chang Su, Yimeng
Chen, Min Zhang, Shimin Tao, Hao Yang, and Peter Polák, Ngoc-Quan Pham, Tuan-Nam Nguyen,
Ying Qin. 2022. The hw-tsc’s speech to speech Danni Liu, Carlos Mullov, Jan Niehues, Ondrej Bojar,
translation system for IWSLT 2022 evaluation. In and Alexander Waibel. 2022. CUNI-KIT system for
Proceedings of the 19th International Conference on simultaneous speech translation task at IWSLT 2022.
Spoken Language Translation, IWSLT@ACL 2022, In Proceedings of the 19th International Conference
Dublin, Ireland (in-person and online), May 26-27, on Spoken Language Translation, IWSLT@ACL
2022, pages 293–297. Association for Computational 2022, Dublin, Ireland (in-person and online), May
Linguistics. 26-27, 2022, pages 277–285. Association for Com-
putational Linguistics.
Jiaxin Guo, Minghan Wang, Daimeng Wei, Hengchao
Shang, Yuxia Wang, Zongyao Li, Zhengzhe Yu, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Zhanglin Wu, Yimeng Chen, Chang Su, Min Zhang, Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Lizhi Lei, Shimin Tao, and Hao Yang. 2021. Self- Kaiser, and Illia Polosukhin. 2017. Attention is
distillation mixup training for non-autoregressive all you need. In Advances in Neural Information
neural machine translation. CoRR, abs/2112.11640. Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017,
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim December 4-9, 2017, Long Beach, CA, USA, pages
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- 5998–6008.
rat, Fernanda B. Viégas, Martin Wattenberg, Greg
Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Minghan Wang, Jiaxin Guo, Yinglu Li, Xiaosong
Google’s multilingual neural machine translation sys- Qiao, Yuxia Wang, Zongyao Li, Chang Su, Yimeng
tem: Enabling zero-shot translation. Trans. Assoc. Chen, Min Zhang, Shimin Tao, Hao Yang, and
Comput. Linguistics, 5:339–351. Ying Qin. 2022a. The hw-tsc’s simultaneous speech
translation system for IWSLT 2022 evaluation. In
Danni Liu, Gerasimos Spanakis, and Jan Niehues. 2020. Proceedings of the 19th International Conference on
Low-latency sequence-to-sequence speech recogni- Spoken Language Translation, IWSLT@ACL 2022,
tion and translation by partial hypothesis selection. Dublin, Ireland (in-person and online), May 26-27,
In Interspeech 2020, 21st Annual Conference of the 2022, pages 247–254. Association for Computational
International Speech Communication Association, Linguistics.
Virtual Event, Shanghai, China, 25-29 October 2020,
pages 3620–3624. ISCA. Minghan Wang, Jiaxin Guo, Xiaosong Qiao, Yuxia
Wang, Daimeng Wei, Chang Su, Yimeng Chen, Min
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Zhang, Shimin Tao, Hao Yang, and Ying Qin. 2022b.
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, The hw-tsc’s offline speech translation system for
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, IWSLT 2022 evaluation. In Proceedings of the
and Haifeng Wang. 2019. STACL: simultane- 19th International Conference on Spoken Language
ous translation with implicit anticipation and con- Translation, IWSLT@ACL 2022, Dublin, Ireland
trollable latency using prefix-to-prefix framework. (in-person and online), May 26-27, 2022, pages 239–
In Proceedings of the 57th Conference of the 246. Association for Computational Linguistics.
Association for Computational Linguistics, ACL
2019, Florence, Italy, July 28- August 2, 2019, Daimeng Wei, Zongyao Li, Zhanglin Wu, Zhengzhe
Volume 1: Long Papers, pages 3025–3036. Asso- Yu, Xiaoyu Chen, Hengchao Shang, Jiaxin Guo,
ciation for Computational Linguistics. Minghan Wang, Lizhi Lei, Min Zhang, Hao Yang,
and Ying Qin. 2021. Hw-tsc’s participation in
Thai-Son Nguyen, Sebastian Stüker, and Alex Waibel. the WMT 2021 news translation shared task. In
2021. Super-human performance in online low- Proceedings of the Sixth Conference on Machine
latency recognition of conversational speech. In Translation, WMT@EMNLP 2021, Online Event,
Interspeech 2021, 22nd Annual Conference of the November 10-11, 2021, pages 225–231. Association
International Speech Communication Association, for Computational Linguistics.
381
Di Wu, Binbin Zhang, Chao Yang, Zhendong
Peng, Wenjing Xia, Xiaoyu Chen, and Xin Lei.
2021. U2++: unified two-pass bidirectional end-
to-end model for speech recognition. CoRR,
abs/2106.05642.
Lijun Wu, Yiren Wang, Yingce Xia, Tao Qin, Jianhuang
Lai, and Tie-Yan Liu. 2019. Exploiting monolin-
gual data at scale for neural machine translation. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Natural
Language Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019, pages 4205–
4215. Association for Computational Linguistics.
382
The HW-TSC’s Simultaneous Speech-to-Speech Translation system for
IWSLT 2023 evaluation
Hengchao Shang, Zhiqiang Rao, Zongyao Li, Jiaxin GUO, Zhanglin Wu, Minghan Wang,
Daimeng Wei, Shaojun Li, Zhengzhe Yu, Xiaoyu Chen, Lizhi Lei, Hao Yang
asr_ouput1,1 mt_ouput1,1
chunk1 asr_ouput1,2 mt_ouput1,2 txt_ouput1 wav_ouput1 ouput1
asr_ouput1,3 mt_ouput1,3
prefix
asr_ouput2,1 mt_ouput2,1
chunk1 chunk2 asr_ouput2,2 mt_ouput2,2 txt_ouput2 wav_ouput2 ouput2
asr_ouput2,3 mt_ouput2,3
prefix
asr_ouput3,1 mt_ouput3,1
chunk1 chunk2 chunk3 asr_ouput3,2 mt_ouput3,2 txt_ouput3 wav_ouput3 ouput3
asr_ouput3,3 mt_ouput3,3
prefix
3.2 Stable Hypothesis Detection Unknown Filtering In the Chinese and Japanese
language directions, we initially remove tokens that
Our approach is based on prior research in (Polák
are not included in the vocabulary, such as infre-
et al., 2022), and we have implemented stable hy-
quent punctuation marks and words. For Chinese
pothesis detection to minimize the potential for
in particular, we must convert Arabic numerals into
errors resulting from incomplete input. In previ-
textual numerals.
ous research, some methods focused on detecting
stable hypotheses using strategies such as the Hold- Context-Aware Pause Detection When analyz-
n strategy proposed by (Liu et al., 2020), which ing the waveform generated by TTS, we evaluate
identifies the best hypothesis in the beam and re- whether or not the original text indicates a pause. If
moves the last n tokens from it. Similarly, (Liu the text does not indicate a pause, we eliminate the
et al., 2020) introduced the LA-n strategy, which final prolonged silence that produces the waveform.
identifies the matching prefixes of two consecutive Additionally, to ensure speech coherence, we’ve
chunks. In addition, (Nguyen et al., 2021) devel- reserved at least 160 frames of blank audio.
oped the SP-n strategy, which identifies the longest
common prefix among all items in the beam of a 4 Experiments
chunk.
4.1 Dataset
However, these methods were designed for end-
to-end systems that search for a shared prefix To train the ASR module, we utilized four datasets:
among the hypotheses generated from different LibriSpeech V12, MuST-C V2 (Gangi et al., 2019),
chunk inputs. Our approach, on the other hand, TEDLIUM V3, and CoVoST V2. LibriSpeech con-
operates within a cascaded system that processes sists of audio book recordings with case-insensitive
the same chunk input. As such, we have adapted text lacking punctuation. MuST-C, a multilingual
these strategies to better fit our context, resulting dataset recorded from TED talks, was used solely
in a more effective approach for stable hypothesis for the English data in the ASR task. TEDLIUM is
detection. By using our approach, we are able to a large-scale speech recognition dataset containing
achieve higher accuracy and stability in our system, TED talk audio recordings along with text tran-
thereby improving its overall performance. scriptions. CoVoST is also a multilingual speech
We can denote the MT and ASR generating func- translation dataset based on Common Voice, with
tions as G and F respectively. Let Fi,n C represent open-domain content. Unlike LibriSpeech, both
the i output generated by the ASR function for a MuST-C and CoVoST have case-sensitive text and
c-chunk input with a beam size of n. Then the punctuation.
final common prefix for the c-chunk input can be To train the MT model, we collected all available
expressed as pref ixc , which is determined as fol- parallel corpora from the official websites and se-
lows: lected data that was similar to the MuST-C domain.
We first trained a multilingual MT baseline model
on all data from three language directions. Then,
pref ixc = LCP (G(F1,n
c c
), ..., G(Fn,n )) (1) we incrementally trained the baseline model based
on data from each language direction.
where LCP (·) is longest common prefix of the
arguments. 4.2 Model
ASR We extract 80-dimensional Mel-Filter bank
3.3 Deblanking features from audio files to create the ASR training
Our team conducted a manual evaluation of the corpus. For tokenization of ASR texts, we utilize
audio output generated by TTS and identified two Sentencepiece with a learned vocabulary of up to
385
Model Language Pair BLEU/Whisper_ASR_BLEU StartOffset EndOffset ATD
EN-DE 33.54
Our S2T System EN-JA 17.89
EN-ZH 27.23
Our System EN-DE 10.45 1.04 2.73 1.97
Our System EN-JA 14.53 1.59 2.96 2.76
Our System EN-ZH 20.19 1.77 2.98 2.93
387
Lijun Wu, Yiren Wang, Yingce Xia, Tao Qin, Jianhuang
Lai, and Tie-Yan Liu. 2019. Exploiting monolin-
gual data at scale for neural machine translation. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Natural
Language Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019, pages 4205–
4215. Association for Computational Linguistics.
Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song,
Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fup-
ing Pan, and Jianwei Niu. 2022. Wenet 2.0: More
productive end-to-end speech recognition toolkit. In
Interspeech 2022, 23rd Annual Conference of the
International Speech Communication Association,
Incheon, Korea, 18-22 September 2022, pages 1661–
1665. ISCA.
388
Towards Efficient Simultaneous Speech Translation:
CUNI-KIT System for Simultaneous Track at IWSLT 2023
Peter Polák1 and Danni Liu2 and Ngoc-Quan Ngoc2
390
Another way, how to utilize the CTC is joint de- The disadvantage of this definition is that
coding (Watanabe et al., 2017; Deng et al., 2022). pctc (. . . |X) must be computed for every vocab-
In the joint decoding setup, the model has two ulary entry separately and one evaluation costs
decoders: the non-autoregressive CTC (usually a O(T ), i.e., O(|V| · T ) in total. Contemporary ST
single linear layer after the encoder) and the atten- systems use vocabularies in orders of thousands
tional autoregressive decoder. The joint decoding items making this definition prohibitively expen-
is typically guided by the attentional decoder, while sive. Since the CTC is used together with the
the CTC output is used for re-scoring. Since the label-synchronous decoder, we can approximate
CTC predicts hard alignment, the rescoring is not the denominator with a single vocabulary entry catt
straightforward. To this end, Watanabe et al. (2017) predicted by the attentional decoder patt :
proposed to use the CTC prefix probability (Graves,
2008) defined as a cumulative probability of all la-
pctc (g ⊕ <eos>|X)
bel sequences that have the current hypothesis h as Oddsend (g) ≈ , (5)
their prefix: pctc (g ⊕ catt |X)
X where catt = argmaxc∈V/{<eos>} patt (g ⊕ c|X).
pctc (h, ...) = pctc (h ⊕ ν|X), (1)
Now the evaluation of Oddsend (g) is O(T ). If we
ν∈V +
consider that the baseline model already uses CTC
where V is output vocabulary (including the rescoring, then evaluating Oddsend (g) amounts to
<eos> symbol), ⊕ is string concatenation, and a constant number of extra operations to evaluate
X is the input speech. To calculate this probability pctc (g ⊕ <eos>|X).
effectively, Watanabe et al. (2017) introduce vari- Finally, to control the latency of the online decod-
(b) (n)
ables γt (h) and γt (h) that represent forward ing, we compare the logarithm of Oddsend (g) with
probabilities of h at time t, where the superscript a tunable constant Cend . If log Oddsend (g) > Cend ,
denotes whether the CTC paths end with a blank we stop the beam search and discard the last token
or non-blank CTC symbol. If the hypothesis h is a from g. We found values of Cend between -2 and 2
complete hypothesis (i.e., ends with the <eos> to- to work well across all models and language pairs.
ken), then the CTC probability of h = g ⊕ <eos>
is: 3 Experiments and Results
(b) (n) 3.1 Models
pctc (h|X) = γT (g) + γT (g), (2)
Our offline multilingual ST models are based on
where T is the final time stamp. attentional encoder-decoder architecture. Specifi-
If h = g ⊕ c is not final, i.e., c ̸= <eos>, then cally, the encoder is based on WavLM (Chen et al.,
the probability is: 2022), and the decoder is based on multilingual
BART (Lewis et al., 2019) or mBART for short.
T The model is implemented in the NMTGMinor li-
X
pctc (h|X) = Φt (g) · p(zt = c|X), (3) brary.2 For details on the offline model see KIT
t=1 submission to IWSLT 2023 Multilingual track (Liu
et al., 2023).
where The small simultaneous speech translation mod-
( els for English-to-German and English-to-Chinese
(b) 0 last(g) = c
Φt (g) = γt−1 (g) + (n)
language pairs follow the blockwise streaming
γt−1 (g) otherwise. Transformer architecture (Tsunoo et al., 2021) im-
plemented in ESPnet-ST-v2 (Yan et al., 2023).
2.3 CTC Online Policy
Specifically, the encoder is a blockwise Conformer
Based on the the definition of pctc (h|X) in Equa- (Gulati et al., 2020) with a block size of 40 and
tions (2) and (3), we can define the odds of g being look-ahead of 16, with 18 layers, and a hidden
at the end of context T : dimension of 256. The decoder is a 6-layer Trans-
former decoder (Vaswani et al., 2017). To improve
the training speed, we initialize the encoder with
pctc (g ⊕ <eos>|X)
Oddsend (g) = P . (4) 2
c∈V/{<eos>} pctc (g ⊕ c|X) https://github.com/quanpn90/NMTGMinor
391
weights pretrained on the ASR task. Further, we Lang Decoding AL↓ ALCA ↓ RTF↓ BLEU↑
employ ST CTC (Deng et al., 2022; Yan et al., BWBS 1922 3121 0.46 30.6
En-De
IBWBS 1977 3277 0.52 31.7
2022) after the encoder with weight 0.3 during the
BWBS 1992 3076 0.50 15.5
training. During the decoding, we use 0.3 for En- En-Ja
IBWBS 1935 3264 0.64 15.6
glish to German, and 0.4 for English to Chinese.
BWBS 1948 2855 0.41 26.5
We preprocess the audio with 80-dimensional fil- En-Zh
IBWBS 1945 3031 0.48 26.5
ter banks. As output vocabulary, we use unigram
models (Kudo, 2018) of size 4000 for English to Table 1: Incremental SST with the original BWBS and
German, and 8000 for English to Chinese. IBWBS. Better scores in bold.
3.2 Evaluation
In all our experiments with the offline models, we pute the decoder states after each source increment.
use beam search of size 8 except for the CTC pol- Since the IBWBS sometimes waits for more source
icy experiments where we use greedy search. For chunks to output more tokens, the unnecessary de-
experiments with the blockwise models, we use coder state recomputations might increase the com-
the beam search of 6. For experiments with the putational complexity.
improved blockwise beam search, we follow Polák 3.4 CTC Online Policy
et al. (2023) and remove the repetition detection in
the underlying offline models, while we keep the In Figure 1, we compare the improved blockwise
repetition detection on for all experiments with the beam search (IBWBS) with the proposed CTC pol-
blockwise models. icy using the blockwise streaming models. The
For evaluation, we use Simuleval (Ma et al., tradeoff curves for English-to-German (see Fig-
2020) toolkit and tst-COMMON test set of MuST- ure 1a) and English-to-Chinese (see Figure 1b)
C (Cattoni et al., 2021). To estimate transla- show that the proposed CTC policy improves the
tion quality, we report detokenized case-sensitive quality (up to 1.1 BLEU for En→De, and 0.8
BLEU (Post, 2018), and for latency, we report av- BLEU for En→Zh), while it is able to achieve the
erage lagging (Ma et al., 2019). To realistically same latencies.
assess the inference speed, we run all our experi-
3.5 CTC Online Policy for Large Offline
ments on a computer with Intel i7-10700 CPU and
Models
NVIDIA GeForce GTX 1080 with 8 GB graphic
memory. We were also interested in whether the CTC policy
can be applied to large offline models. Unfortu-
3.3 Incremental Blockwise Beam Search with nately, due to limited resources, we were not able
Controllable Quality-Latency Tradeoff to train a large offline model with the CTC output.
In Table 1, we compare the performance of the Hence, we decided to utilize the CTC outputs of the
onlinized version of the baseline blockwise beam online blockwise models and used them to guide
search (BWBS) with the improved blockwise beam the large offline model. Since the models have very
search (IBWBS; Polák et al., 2023). As we can see different vocabularies,3 we decided to execute the
in the table, the improved beam search achieves CTC policy after a whole word is generated by the
higher or equal BLEU scores than the baseline offline model (rather than after every sub-word to-
beam search across all language pairs. We can ken). For the very same reason, we do not use CTC
observe the highest improvement in English-to- for rescoring.
German (1.1 BLEU), while we see an advantage We report the results in Table 2. Unlike in the
of 0.1 BLEU for English-to-Japanese. and no im- blockwise models (see Section 3.4), the CTC policy
provement in English-to-Chinese. does not improve the quality in En→De, and has a
In Table 1, we also report the real-time factor slightly worse quality (by 0.7 BLEU) in En→Zh.
(RTF), and the computation-aware average lagging This is most probably due to the delayed CTC-
(ALCA ). Interestingly, we observe a higher com- attention synchronization that is not present for the
putational footprint of the IBWBS compared to blockwise models (as both decoders there share the
the baseline beam search by 13, 28, and 17 % 3
The blockwise models have a vocabulary size of 4000
on En→{De, Ja, Zh}, resp., when measured with for En→De and 8000 for En→Zh, and the offline model has
RTF. This might be due to the fact that we recom- 250k.
392
25.5
25 23.5
BLEU↑
BLEU↑
CTC
24.5 IBWBS
CTC 23
24 IBWBS
1,750 2,000 2,250 2,500 1,750 2,000 2,250
AL↓ (ms) AL↓ (ms)
(a) English to German (b) English to Chinese
Figure 1: Comparison of the improved blockwise beam search (IBWBS) and the proposed CTC policy using
blockwise streaming models.
same vocabulary and the models compute the CTC the BLEU scores for the 2022 model unreliable.
policy after each token rather than word). However,
Lang Model AL↓ ALCA ↓ BLEU↑
we still observe a significant reduction in computa-
2022 1991 3138 31.8
tional latency, namely by 45 and 34 % relative RTF En-De
2023 1955 3072 31.4
for En→De and En→Zh, respectively.
2022 1906 3000 15.5
En-Ja
Lang Decoding AL↓ ALCA ↓ RTF↓ BLEU↑ 2023 1982 3489 15.3
BWBS 1922 3121 0.46 30.6 2022 1984 3289 26.8
En-Zh
En-De IBWBS 1977 3277 0.52 31.7 2023 1987 3508 26.6
CTC 1946 2518 0.21 30.6
BWBS 1948 2855 0.41 26.5 Table 3: Submitted onlinized large offline models.
En-Zh IBWBS 1945 3031 0.48 26.5
CTC 1981 2515 0.28 25.8
We also submit the system based on the large
Table 2: Comparison of onlinization of the large offline model onlinized using the CTC policy. The sys-
model using chunking with the local agreement policy tems are summarized in Table 4. Unfortunately, we
(LA-2) and with the proposed CTC policy. were not aware of the training and test data overlap
during the evaluation period, so we decided to use
our 2022 model also this year.
4 Submission
In this section, we summarize our submission to Lang Model AL↓ ALCA ↓ BLEU↑
the Simultaneous track at IWSLT 2023. In total, En-De 2022 1959 2721 31.4
En-Zh 2022 1990 2466 26.3
we submit 10 systems for all three language pairs.
4.1 Onlinized Offline Models Table 4: Submitted large offline models onlinized using
the proposed CTC policy.
Following our last year’s submission, we onlinize
two large offline models (our models for IWSLT
2022 Offline ST track and IWSLT 2023 Multilin- 4.2 Blockwise Online Models
gual track). This year, however, we utilize the
improved blockwise beam search to yield higher Finally, we submit small blockwise models. Their
BLEU scores. We submit systems for all language advantage is that they are able to run on a CPU
pairs based on the last year’s model, and our new faster than real time (more than 5× faster). We
model. We summarize the submitted models and report their performance in Table 5.
their performance in Table 3. As we can observe Lang AL↓ ALCA ↓ RTF↓ BLEU↑
in Table 3, the 2023 model appears to perform En-De 1986 2425 0.19 25.4
worse. However, we learned during the writing of En-Zh 1999 2386 0.19 23.8
this paper that there was some overlap between the
training and test data for the 2022 model4 , making Table 5: Submitted small blockwise models using the
4
proposed CTC online policy.
(Zhang and Ao, 2022) found an overlap between ST-TED
training corpus and tst-COMMON set of MuST-C dataset.
393
5 Conclusion and Future Work Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan
Niehues, Xing Niu, John Ortega, Juan Pino, Eliz-
In this paper, we present the CUNI-KIT submis- abeth Salesky, Jiatong Shi, Matthias Sperber, Se-
sion to the Simultaneous track at IWSLT 2023. We bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
experimented with the latest decoding methods and gesh Virkar, Alexander Waibel, Changhan Wang,
and Shinji Watanabe. 2022. Findings of the IWSLT
proposed a novel CTC online policy. We experi- 2022 evaluation campaign. In Proceedings of the
mentally showed that the proposed CTC online pol- 19th International Conference on Spoken Language
icy significantly improves the translation quality of Translation (IWSLT 2022), pages 98–157, Dublin,
the blockwise streaming models. Additionally, the Ireland (in-person and online). Association for Com-
putational Linguistics.
proposed CTC policy significantly lowers the com-
putational footprint of the onlinized large offline Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Ben-
models. Unaware of a data overlap issue in 2022, tivogli, Matteo Negri, and Marco Turchi. 2021. Must-
we eventually chose to use our last years’ models c: A multilingual corpus for end-to-end speech trans-
lation. Computer Speech & Language, 66:101155.
in the official evaluation also this year.
Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
Acknowledgments Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022.
This work has received support from the Wavlm: Large-scale self-supervised pre-training for
project “Grant Schemes at CU” (reg. no. full stack speech processing. IEEE Journal of Se-
CZ.02.2.69/0.0/0.0/19_073/0016935), the grant 19- lected Topics in Signal Processing, 16(6):1505–1518.
26934X (NEUREM3) of the Czech Science Foun-
dation, and by Charles University, project GA UK Shun-Po Chuang, Yung-Sung Chuang, Chih-Chiang
Chang, and Hung-yi Lee. 2021. Investigating the re-
No 244523. ordering capability in CTC-based non-autoregressive
end-to-end speech translation. In Findings of the
Association for Computational Linguistics: ACL-
References IJCNLP 2021, pages 1068–1077, Online. Association
for Computational Linguistics.
Milind Agarwal, Sweta Agrawal, Antonios Anasta-
sopoulos, Ondřej Bojar, Claudia Borg, Marine Keqi Deng, Shinji Watanabe, Jiatong Shi, and Siddhant
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Arora. 2022. Blockwise Streaming Transformer for
Chen, William Chen, Khalid Choukri, Alexandra Spoken Language Understanding and Simultaneous
Chronopoulou, Anna Currey, Thierry Declerck, Qian- Speech Translation. In Proc. Interspeech 2022, pages
qian Dong, Yannick Estève, Kevin Duh, Marcello 1746–1750.
Federico, Souhir Gahbiche, Barry Haddow, Benjamin
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
Linhao Dong, Cheng Yi, Jianzong Wang, Shiyu Zhou,
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
Shuang Xu, Xueli Jia, and Bo Xu. 2020. A com-
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
parison of label-synchronous and frame-synchronous
Evgeny Matusov, Paul McNamee, John P. McCrae,
end-to-end models for speech recognition. arXiv
Kenton Murray, Maria Nadejde, Satoshi Nakamura,
preprint arXiv:2005.10113.
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
Lonneke van der Plas, Peter Polák, Elijah Rippeth, Alex Graves. 2008. Supervised sequence labelling with
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- recurrent neural networks. Ph.D. thesis, Technical
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian University Munich.
Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- Alex Graves, Santiago Fernández, Faustino Gomez, and
vallos. 2023. Findings of the IWSLT 2023 Evaluation Jürgen Schmidhuber. 2006. Connectionist temporal
Campaign. In Proceedings of the 20th International classification: labelling unsegmented sequence data
Conference on Spoken Language Translation (IWSLT with recurrent neural networks. In Proceedings of the
2023). Association for Computational Linguistics. 23rd international conference on Machine learning,
pages 369–376.
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben-
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Maha Elbayad, Clara Emmanuel, Yannick Estève, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
Marcello Federico, Christian Federmann, Souhir 2020. Conformer: Convolution-augmented Trans-
Gahbiche, Hongyu Gong, Roman Grundkiewicz, former for Speech Recognition. In Proc. Interspeech
Barry Haddow, Benjamin Hsu, Dávid Javorský, 2020, pages 5036–5040.
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant
Mathur, Paul McNamee, Kenton Murray, Maria Awni Hannun. 2020. The label bias problem.
394
Taku Kudo. 2018. Subword regularization: Improv- Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen,
ing neural network translation models with multiple Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bo-
subword candidates. In Proceedings of the 56th An- jar, and Alexander Waibel. 2022. CUNI-KIT system
nual Meeting of the Association for Computational for simultaneous speech translation task at IWSLT
Linguistics (Volume 1: Long Papers), pages 66–75, 2022. In Proceedings of the 19th International Con-
Melbourne, Australia. Association for Computational ference on Spoken Language Translation (IWSLT
Linguistics. 2022), pages 277–285, Dublin, Ireland (in-person
and online). Association for Computational Linguis-
Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fan- tics.
njiang, and David Sussillo. 2018. Hallucinations in
neural machine translation. Peter Polák, Brian Yan, Shinji Watanabe, Alexander
Waibel, and Ondrej Bojar. 2023. Incremental Block-
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
wise Beam Search for Simultaneous Speech Transla-
Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
tion with Controllable Quality-Latency Tradeoff. In
Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: De-
Proc. Interspeech 2023.
noising sequence-to-sequence pre-training for natural
language generation, translation, and comprehension. Matt Post. 2018. A call for clarity in reporting BLEU
arXiv preprint arXiv:1910.13461. scores. In Proceedings of the Third Conference on
Jindřich Libovický and Jindřich Helcl. 2018. End-to- Machine Translation: Research Papers, pages 186–
end non-autoregressive neural machine translation 191, Brussels, Belgium. Association for Computa-
with connectionist temporal classification. In Pro- tional Linguistics.
ceedings of the 2018 Conference on Empirical Meth- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
ods in Natural Language Processing, pages 3016– and Wojciech Zaremba. 2015. Sequence level train-
3021, Brussels, Belgium. Association for Computa- ing with recurrent neural networks. arXiv preprint
tional Linguistics. arXiv:1511.06732.
Danni Liu, Ngoc-Quan Pham, Tuan Nam Nguyen,
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Thai-Binh Nguyen, Danni Liu, Carlos Mullov, Jan
Sequence to sequence learning with neural networks.
Niehues, and Alexander Waibel. 2023. KIT sub-
Advances in neural information processing systems,
mission to multilingual track at IWSLT 2023. In
27.
Proceedings of the 20th International Conference on
Spoken Language Translation (IWSLT 2023). Associ- Emiru Tsunoo, Yosuke Kashiwagi, and Shinji Watanabe.
ation for Computational Linguistics. 2021. Streaming transformer asr with blockwise
Danni Liu, Gerasimos Spanakis, and Jan Niehues. 2020. synchronous beam search. In 2021 IEEE Spoken
Low-Latency Sequence-to-Sequence Speech Recog- Language Technology Workshop (SLT), pages 22–29.
nition and Translation by Partial Hypothesis Selec- IEEE.
tion. In Proc. Interspeech 2020, pages 3620–3624.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Kaiser, and Illia Polosukhin. 2017. Attention is all
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and you need. Advances in neural information processing
Haifeng Wang. 2019. STACL: Simultaneous trans- systems, 30.
lation with implicit anticipation and controllable la-
tency using prefix-to-prefix framework. In Proceed- Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R
ings of the 57th Annual Meeting of the Association for Hershey, and Tomoki Hayashi. 2017. Hybrid
Computational Linguistics, pages 3025–3036, Flo- ctc/attention architecture for end-to-end speech recog-
rence, Italy. Association for Computational Linguis- nition. IEEE Journal of Selected Topics in Signal
tics. Processing, 11(8):1240–1253.
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Sam Wiseman and Alexander M Rush. 2016. Sequence-
Jiatao Gu, and Juan Pino. 2020. SIMULEVAL: An to-sequence learning as beam-search optimization.
evaluation toolkit for simultaneous translation. In In Proceedings of the 2016 Conference on Empiri-
Proceedings of the 2020 Conference on Empirical cal Methods in Natural Language Processing, pages
Methods in Natural Language Processing: System 1296–1306.
Demonstrations, pages 144–150, Online. Association
Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham
for Computational Linguistics.
Neubig, Florian Metze, Alan W Black, and Shinji
Mathias Müller, Annette Rios, and Rico Sennrich. 2019. Watanabe. 2022. Ctc alignments improve autoregres-
Domain robustness in neural machine translation. sive translation. arXiv preprint arXiv:2210.05200.
arXiv preprint arXiv:1911.03109.
Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma,
Thai-Son Nguyen, Sebastian Stüker, and Alex Waibel. Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick
2020. Super-human performance in online low- Fernandes, Dan Berrebbi, Tomoki Hayashi, et al.
latency recognition of conversational speech. arXiv 2023. Espnet-st-v2: Multipurpose spoken language
preprint arXiv:2010.03449. translation toolkit. arXiv preprint arXiv:2304.04596.
395
Ziqiang Zhang and Junyi Ao. 2022. The YiTrans speech
translation system for IWSLT 2022 offline shared
task. In Proceedings of the 19th International Con-
ference on Spoken Language Translation (IWSLT
2022), pages 158–168, Dublin, Ireland (in-person
and online). Association for Computational Linguis-
tics.
396
Speech Translation with Foundation Models and Optimal Transport:
UPC at IWSLT23
400
the original ones, based on text similarity measures, encoder representations with the MT model’s repre-
using TF-IDF features from the translations. More sentation space. The system, instead of using three
concretely, for each talk id, we compute the simi- layers of 1D convolutions, now incorporates also
larity matrix of its original translations and the new CTC-based compression, a large adapter, and fi-
candidates from SegAugment, find the most similar nally a single layer of 1D convolutions. Following
original example for each new candidate, and add the Siamese pre-training on MuST-C’s ASR data,
it to the filtered data only if its similarity score is we jointly fine-tune the model and the MT decoder
below 0.8. We apply this approach also between on the MuST-C ST data. Similar to the baseline,
the different SegAugment versions (m, l, xl). the MT model is also fine-tuned on the parallel text
of MuST-C beforehand.
4 Experiments
More Data We extend the previously described
Here we describe the experiments we carried out in process by incorporating additional data. Initially,
this work. The implementation details are available we fine-tune mBART50 using all the MT data (Ta-
in §A.1. ble 6). Subsequently, we perform the Siamese pre-
training and ST fine-tuning employing all the avail-
IWSLT ’22 System For the IWSLT 2022 of- able speech data (Table 1). By incorporating a
fline task, our submission employed a HuBERT larger dataset, we aim to enhance the system’s gen-
encoder (Hsu et al., 2021a) and an mBART50 (En- eralization capabilities and overall performance.
Xx) decoder, which were efficiently fine-tuned to
ST with the LNA strategy (Li et al., 2021) and par- Data Augmentation We employ two data aug-
allel adapters (He et al., 2022), using datasets such mentation techniques to increase the performance
as MuST-C v2, Europarl-ST and CoVoST. The ar- of our system during ST fine-tuning (§3.2), while
chitecture included three 1D convolutional layers no modifications are made to the Siamese pre-
between the encoder and decoder, resulting in a training. First, we investigate the use of SegAug-
subsampling of the encoder representation by a fac- ment (Tsiamas et al., 2022a), which we apply to
tor of 8. The final ensemble also comprised models MuST-C v3. Secondly, we generate synthetic data
utilizing Knowledge Distillation and a wav2vec 2.0 from Common Voice (Ardila et al., 2020), by lever-
encoder (Tsiamas et al., 2022b). aging the fine-tuned mBART50 (§A.2).
KD We use knowledge distillation with the fine-
Baseline Our baseline has four main differences
tuned mBART50 as the teacher (§A.2). The loss
compared our last year’s best system. We did an ini-
for training the ST model is the average of the
tial exploratory analysis of various encoders (§A.3),
standard cross entropy and the Kullback-Leibler
including different versions of wav2vec 2.0, and
(KL) divergence between the MT and ST output
HuBERT. Upon observing no significant differ-
probability distributions. We utilize all available
ences, we opted to utilize wav2vec 2.0 fine-tuned
ST data in this experiment, including both real and
with pseudo-labels (Xu et al., 2021b), a more preva-
synthetic data.
lent choice within the research community. Despite
the strong performance demonstrated by efficient 5 Audio Segmentation
fine-tuning with LNA and parallel adapters, we
chose to switch to standard ST fine-tuning in order To segment the audio of the IWSLT test sets, we
to optimize performance. Moreover, we employ a use SHAS (Tsiamas et al., 2022c). The tst2023
semantic encoder initialized from the MT model. test set, unlike previous years, contains another
Lastly, we also pre-train the foundation models, two domains apart from TED talks, which are ACL
wav2vec 2.0 with CTC on the ASR data of MuST- presentations and Press conferences. We tune the
C, and mBART50 on the parallel text of MuST-C. parameters of SHAS separately for each domain,
It is important to note that only MuST-C data was but since no development set is available for the
utilized for the baseline. press conferences, we decided to treat it as the ACL
domain. For fine-tuning the segmentation parame-
Siamese Pre-training Instead of pre-training the ters, we used the ST model that was trained with
speech encoder with CTC only, we follow the synthetic data from CommonVoice and SegAug-
Siamese pre-training method (§2.2), with the en- ment and initialized from Siamese pre-training (Ta-
coder architecture described in §2.1, to align the ble 2, 2d). We evaluate the performance of the
401
Figure 4: BLEU scores on IWSLT.ACLdev2023 for
different combinations of min and max segment length
parameters of SHAS.
Table 2: BLEU scores for En-De MuST-C and IWSLT sets. In bold are the best scores by single models, and in
underlined bold are the best scores overall.
Ensembling multiple models provided small in- are evaluated on the three test sets (TED, ACL,
creases in all sets. We believe that there is very little Sub) with three metrics; BLEU (Papineni et al.,
variation in our best models (2b-2e), since they are 2002), chrF (Popović, 2017), and COMET (Rei
initialized from the same Siamese pre-training (2b), et al., 2020). The TED test set also has two avail-
thus resulting in ineffective ensembles. In general, able references.
and in terms of single models, we improve our re-
Metric BLEU chrF COMET
sults from last year by 1.6 BLEU in tst2019 and 2.1 Reference 1 2 both 1 2 1 2
BLEU in tst2020, while the difference is larger in System 3c 25.5 29.8 36.6 0.56 0.58 0.7985 0.8098
terms of single models.
Table 3: Official Results for the TED test set 2023.
7 Conclusions
We described the submission of the UPC Machine
Metric BLEU chrF COMET
Translation group for the IWSLT 2023 Offline ST
System 3c 32.1 0.6 0.7473
task. Our system leverages ASR and MT foun-
dation models and a Siamese pretraining step to
Table 4: Official Results for the ACL test set 2023.
maximize the transfer learning from MT. We show
that Siamese pretraining can bring significant im-
provements to our ST models, while fine-tuning Metric BLEU chrF COMET
with KD can also be helpful. We furthermore show System 3c 15.6 0.47 0.3746
that synthetic data are crucial at improving perfor-
mance in the IWSLT test sets. In future work, we Table 5: Official Results for the Sub test set 2023.
plan to investigate the zero-shot capabilities of opti-
mal transport in the context of foundation models.
Acknowledgements
8 Submission Results
The work done by Ioannis Tsiamas and Gerard
In Tables 3, 4 and 5, we present the official submis- I. Gállego was supported by the ADAVOICE
sion results for IWSLT 2023 with our best system, project, PID2019-107579RB-I00 / AEI /
which is the Ensemble 3c of Table 2. Systems 10.13039/501100011033
403
References for Speech Recognition. In Proc. Interspeech 2021,
pages 2426–2430.
Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, On-
drej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir Mattia A. Di Gangi, Matteo Negri, Viet Nhat Nguyen,
Durrani, Marcello Federico, Christian Federmann, Amirhossein Tebbifakhr, and Marco Turchi. 2019.
Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay Data Augmentation for End-to-End Speech Trans-
Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz- lation: FBK@IWSLT ’19. In Proceedings of the
abeth Salesky, Xing Shi, Sebastian Stüker, Marco 16th International Workshop on Spoken Language
Turchi, Alexander H. Waibel, and Changhan Wang. Translation, Hong Kong. Publisher: Zenodo.
2020. FINDINGS OF THE IWSLT 2020 EVAL-
UATION CAMPAIGN. In Proceedings of the 17th Charlie Frogner, Chiyuan Zhang, Hossein Mobahi,
International Conference on Spoken Language Trans- Mauricio Araya-Polo, and Tomaso Poggio. 2015.
lation, IWSLT 2020, Online, July 9 - 10, 2020, pages Learning with a wasserstein loss. In Proceedings
1–34. Association for Computational Linguistics. of the 28th International Conference on Neural In-
formation Processing Systems - Volume 2, NIPS’15,
R. Ardila, M. Branson, K. Davis, M. Henretty, page 2053–2061, Cambridge, MA, USA. MIT Press.
M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M.
Tyers, and G. Weber. 2020. Common voice: A Marco Gaido, Mauro Cettolo, Matteo Negri, and Marco
massively-multilingual speech corpus. In Proceed- Turchi. 2021. CTC-based compression for direct
ings of the 12th Conference on Language Resources speech translation. In Proceedings of the 16th Con-
and Evaluation (LREC 2020), pages 4211–4215. ference of the European Chapter of the Association
for Computational Linguistics: Main Volume, pages
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, 690–696, Online. Association for Computational Lin-
and Michael Auli. 2020. wav2vec 2.0: A framework guistics.
for self-supervised learning of speech representations.
In Advances in Neural Information Processing Sys- Marco Gaido, Mattia A. Di Gangi, Matteo Negri, and
tems, volume 33, pages 12449–12460. Curran Asso- Marco Turchi. 2020. End-to-end speech-translation
ciates, Inc. with knowledge distillation: FBK@IWSLT2020. In
Proceedings of the 17th International Conference on
Sameer Bansal, Herman Kamper, Karen Livescu, Adam Spoken Language Translation, pages 80–88, Online.
Lopez, and Sharon Goldwater. 2019. Pre-training Association for Computational Linguistics.
on high-resource speech recognition improves low-
resource speech-to-text translation. In Proceedings Marco Gaido, Sara Papi, Dennis Fucci, Giuseppe
of the 2019 Conference of the North American Chap- Fiameni, Matteo Negri, and Marco Turchi. 2022.
ter of the Association for Computational Linguistics: Efficient yet competitive speech translation:
Human Language Technologies, pages 58–68, Min- FBK@IWSLT2022. In Proceedings of the 19th
neapolis, Minnesota. Association for Computational International Conference on Spoken Language
Linguistics. Translation (IWSLT 2022), pages 177–189, Dublin,
Ireland (in-person and online). Association for
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina Computational Linguistics.
Karakanta, Alberto Martinelli, Matteo Negri, and
Marco Turchi. 2021. Cascade versus direct speech Gerard I. Gállego, Ioannis Tsiamas, Carlos Escolano,
translation: Do the differences still make a differ- José A. R. Fonollosa, and Marta R. Costa-jussà. 2021.
ence? In Proceedings of the 59th Annual Meet- End-to-end speech translation with pre-trained mod-
ing of the Association for Computational Linguistics els and adapters: UPC at IWSLT 2021. In Proceed-
and the 11th International Joint Conference on Natu- ings of the 18th International Conference on Spoken
ral Language Processing (Volume 1: Long Papers), Language Translation (IWSLT 2021), pages 110–119,
pages 2873–2887, Online. Association for Computa- Bangkok, Thailand (online). Association for Compu-
tional Linguistics. tational Linguistics.
Alexandre Berard, Laurent Besacier, Ali Can Ko- Mattia A. Di Gangi, Matteo Negri, and Marco Turchi.
cabiyikoglu, and Olivier Pietquin. 2018. End-to-End 2019. Adapting Transformer to End-to-End Spoken
Automatic Speech Translation of Audiobooks. In Language Translation. In Proc. Interspeech 2019,
2018 IEEE International Conference on Acoustics, pages 1133–1137.
Speech and Signal Processing (ICASSP), pages 6224–
6228, Calgary, AB. IEEE. Alex Graves, Santiago Fernández, Faustino Gomez, and
Jürgen Schmidhuber. 2006. Connectionist temporal
Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Ben- classification: Labelling unsegmented sequence data
tivogli, Matteo Negri, and Marco Turchi. 2021. Must- with recurrent neural networks. In Proceedings of
c: A multilingual corpus for end-to-end speech trans- the 23rd International Conference on Machine Learn-
lation. Computer Speech & Language, 66:101155. ing, ICML ’06, page 369–376, New York, NY, USA.
Association for Computing Machinery.
Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab-
delrahman Mohamed, and Michael Auli. 2021. Un- Chi Han, Mingxuan Wang, Heng Ji, and Lei Li. 2021.
supervised Cross-Lingual Representation Learning Learning shared semantic space for speech-to-text
404
translation. In Findings of the Association for Com- Eugene Kharitonov, Morgane Rivière, Gabriel Syn-
putational Linguistics: ACL-IJCNLP 2021, pages naeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs
2214–2225, Online. Association for Computational Douze, and Emmanuel Dupoux. 2021. Data augment-
Linguistics. ing contrastive learning of speech representations in
the time domain. In 2021 IEEE Spoken Language
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg- Technology Workshop (SLT), pages 215–222.
Kirkpatrick, and Graham Neubig. 2022. Towards a
unified view of parameter-efficient transfer learning. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A
In International Conference on Learning Representa- method for stochastic optimization.
tions.
Paul Knopp and Richard Sinkhorn. 1967. Concerning
Dan Hendrycks and Kevin Gimpel. 2020. Gaussian nonnegative matrices and doubly stochastic matrices.
error linear units (gelus). Pacific Journal of Mathematics, 21(2):343 – 348.
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean.
2015. Distilling the knowledge in a neural network. Philipp Koehn. 2004. Statistical significance tests for
ArXiv, abs/1503.02531. machine translation evaluation. In Proceedings of the
2004 Conference on Empirical Methods in Natural
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Language Processing, pages 388–395, Barcelona,
Bruna Morrone, Quentin De Laroussilhe, Andrea Spain. Association for Computational Linguistics.
Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019.
Parameter-efficient transfer learning for NLP. In Philipp Koehn. 2005. Europarl: A parallel corpus for
Proceedings of the 36th International Conference statistical machine translation. In Proceedings of
on Machine Learning, volume 97 of Proceedings Machine Translation Summit X: Papers, pages 79–86,
of Machine Learning Research, pages 2790–2799. Phuket, Thailand.
PMLR.
Phuong-Hang Le, Hongyu Gong, Changhan Wang, Juan
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Pino, Benjamin Lecouteux, and Didier Schwab. 2023.
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel- Pre-training for speech translation: Ctc meets optimal
rahman Mohamed. 2021a. Hubert: Self-supervised transport.
speech representation learning by masked prediction
of hidden units. IEEE/ACM Transactions on Audio, Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing
Speech, and Language Processing, 29:3451–3460. Tang, Juan Pino, Alexei Baevski, Alexis Conneau,
and Michael Auli. 2021. Multilingual speech trans-
Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Ta- lation from efficient finetuning of pretrained models.
tiana Likhomanenko, Qiantong Xu, Vineel Pratap, In Proceedings of the 59th Annual Meeting of the
Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Syn- Association for Computational Linguistics and the
naeve, and Michael Auli. 2021b. Robust wav2vec 11th International Joint Conference on Natural Lan-
2.0: Analyzing Domain Shift in Self-Supervised Pre- guage Processing (Volume 1: Long Papers), pages
Training. In Proc. Interspeech 2021, pages 721–725. 827–838.
Hirofumi Inaguma, Brian Yan, Siddharth Dalmia,
Pengcheng Guo, Jiatong Shi, Kevin Duh, and Shinji Yuchen Liu, Hao Xiong, Jiajun Zhang, Zhongjun He,
Watanabe. 2021. ESPnet-ST IWSLT 2021 offline Hua Wu, Haifeng Wang, and Chengqing Zong. 2019.
speech translation system. In Proceedings of the 18th End-to-End Speech Translation with Knowledge Dis-
International Conference on Spoken Language Trans- tillation. In Proc. Interspeech 2019, pages 1128–
lation (IWSLT 2021), pages 100–109, Bangkok, Thai- 1132.
land (online). Association for Computational Linguis-
tics. J. Niehues, R. Cattoni, S. Stüker, M. Negri, M. Turchi,
Elizabeth Salesky, Ramon Sanabria, Loïc Barrault,
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Lucia Specia, and Marcello Federico. 2019. The
Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- iwslt 2019 evaluation campaign. In Proceedings
bert Sanchis, Jorge Civera, and Alfons Juan. 2020. of the 16th International Workshop on Spoken Lan-
Europarl-st: A multilingual corpus for speech trans- guage Translation.
lation of parliamentary debates.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, Sam Gross, Nathan Ng, David Grangier, and Michael
P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Col- Auli. 2019. fairseq: A fast, extensible toolkit for
lobert, C. Fuegen, T. Likhomanenko, G. Syn- sequence modeling. In Proceedings of NAACL-HLT
naeve, A. Joulin, A. Mohamed, and E. Dupoux. 2019: Demonstrations.
2020. Libri-light: A benchmark for asr with
limited or no supervision. In ICASSP 2020 - Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
2020 IEEE International Conference on Acous- jeev Khudanpur. 2015. Librispeech: An asr corpus
tics, Speech and Signal Processing (ICASSP), based on public domain audio books. In 2015 IEEE
pages 7669–7673. https://github.com/ International Conference on Acoustics, Speech and
facebookresearch/libri-light. Signal Processing (ICASSP), pages 5206–5210.
405
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
Jing Zhu. 2002. Bleu: a method for automatic evalu- man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
ation of machine translation. In Proceedings of the gela Fan. 2020. Multilingual translation with exten-
40th Annual Meeting of the Association for Compu- sible multilingual pretraining and finetuning. arXiv
tational Linguistics, pages 311–318, Philadelphia, preprint arXiv:2008.00401.
Pennsylvania, USA. Association for Computational
Linguistics. Ioannis Tsiamas, José A. R. Fonollosa, and Marta R.
Costa-jussà. 2022a. SegAugment: Maximiz-
Gabriel Peyré and Marco Cuturi. 2019. Computational ing the Utility of Speech Translation Data with
optimal transport: With applications to data science. Segmentation-based Augmentations.
Ngoc-Quan Pham, Tuan Nam Nguyen, Thai-Binh Ioannis Tsiamas, Gerard I. Gállego, Carlos Escolano,
Nguyen, Danni Liu, Carlos Mullov, Jan Niehues, and José Fonollosa, and Marta R. Costa-jussà. 2022b.
Alexander Waibel. 2022. Effective combination of Pretrained speech encoders and efficient fine-tuning
pretrained models - KIT@IWSLT2022. In Proceed- methods for speech translation: UPC at IWSLT 2022.
ings of the 19th International Conference on Spoken In Proceedings of the 19th International Confer-
Language Translation (IWSLT 2022), pages 190–197, ence on Spoken Language Translation (IWSLT 2022),
Dublin, Ireland (in-person and online). Association pages 265–276, Dublin, Ireland (in-person and on-
for Computational Linguistics. line). Association for Computational Linguistics.
Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
McCarthy, and Deepak Gopinath. 2019. Harness- losa, and Marta R. Costa-jussà. 2022c. Shas:
ing Indirect Training Data for End-to-End Automatic Approaching optimal segmentation for end-to-end
Speech Translation: Tricks of the Trade. In Proceed- speech translation.
ings of the 16th International Workshop on Spoken
Language Translation, Hong Kong. Publisher: Zen- Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
odo. Dmytro Okhonko, and Juan Pino. 2020a. Fairseq
S2T: Fast speech-to-text modeling with fairseq. In
Maja Popović. 2017. chrF++: words helping charac- Proceedings of the 1st Conference of the Asia-Pacific
ter n-grams. In Proceedings of the Second Confer- Chapter of the Association for Computational Lin-
ence on Machine Translation, pages 612–618, Copen- guistics and the 10th International Joint Conference
hagen, Denmark. Association for Computational Lin- on Natural Language Processing: System Demon-
guistics. strations, pages 33–39, Suzhou, China. Association
Matt Post. 2018. A call for clarity in reporting BLEU for Computational Linguistics.
scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers, pages 186– Changhan Wang, Anne Wu, and Juan Pino. 2020b. Cov-
191, Belgium, Brussels. Association for Computa- ost 2: A massively multilingual speech-to-text trans-
tional Linguistics. lation corpus. arXiv preprint arXiv:2007.10310.
Tomasz Potapczyk and Pawel Przybysz. 2020. SR- Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng,
POL’s System for the IWSLT 2020 End-to-End Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan
Speech Translation Task. In Proceedings of the 17th Lan, Liwei Wang, and Tie-Yan Liu. 2020. On layer
International Conference on Spoken Language Trans- normalization in the transformer architecture. In Pro-
lation, pages 89–94, Online. Association for Compu- ceedings of the 37th International Conference on
tational Linguistics. Machine Learning, ICML’20. JMLR.org.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, Shen
Lavie. 2020. COMET: A neural framework for MT Huang, Qi Ju, Tong Xiao, and Jingbo Zhu. 2021a.
evaluation. In Proceedings of the 2020 Conference Stacked acoustic-and-textual encoding: Integrating
on Empirical Methods in Natural Language Process- the pre-trained models into speech translation en-
ing (EMNLP), pages 2685–2702, Online. Association coders. In Proceedings of the 59th Annual Meet-
for Computational Linguistics. ing of the Association for Computational Linguistics
and the 11th International Joint Conference on Natu-
Matthias Sperber and Matthias Paulik. 2020. Speech ral Language Processing (Volume 1: Long Papers),
translation and the end-to-end promise: Taking stock pages 2619–2630, Online. Association for Computa-
of where we are. In Proceedings of the 58th Annual tional Linguistics.
Meeting of the Association for Computational Lin-
guistics, pages 7409–7421, Online. Association for Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko,
Computational Linguistics. Paden Tomasello, Alexis Conneau, Ronan Collobert,
Gabriel Synnaeve, and Michael Auli. 2021b. Self-
Xu Tan, Yi Ren, Di He, Tao Qin, and Tie-Yan Liu. training and pre-training are complementary for
2019. Multilingual neural machine translation with speech recognition. In ICASSP 2021 - 2021 IEEE
knowledge distillation. In International Conference International Conference on Acoustics, Speech and
on Learning Representations. Signal Processing (ICASSP), pages 3030–3034.
406
Biao Zhang, Ivan Titov, Barry Haddow, and Rico Sen- have 12 layers each. All layers have an embedding
nrich. 2020. Adaptive feature selection for end-to- dimensionality of 1024, a feed-forward dimension-
end speech translation. In Findings of the Association
ality of 4098, GELU activations (Hendrycks and
for Computational Linguistics: EMNLP 2020, pages
2533–2544, Online. Association for Computational Gimpel, 2020), 16 attention heads, and pre-layer
Linguistics. normalization (Xiong et al., 2020). The vocabulary
for the CTC has a size of 32 characters, while the
Ziqiang Zhang and Junyi Ao. 2022. The YiTrans speech
translation system for IWSLT 2022 offline shared one for the ST model has a size of 250,000.
task. In Proceedings of the 19th International Con- The model takes waveforms with a 16kHz sam-
ference on Spoken Language Translation (IWSLT pling rate as input, which are normalized to zero
2022), pages 158–168, Dublin, Ireland (in-person
and online). Association for Computational Linguis- mean and unit variance. The models are trained
tics. using the data presented in Table 1, with maximum
source length of 400,000 and target length of 1024
Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu,
tokens. Gradient accumulation and data parallelism
Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong,
Lirong Dai, Jinyu Li, et al. 2022. Speechlm: En- are employed to achieve an effective batch size of
hanced speech pre-training with unpaired textual data. approximately 32 million tokens.
arXiv preprint arXiv:2209.15329.
For the Siamese pre-training we use Adam
Jinming Zhao, Hao Yang, Gholamreza Haffari, and (Kingma and Ba, 2014) with a base learning rate
Ehsan Shareghi. 2022. M-Adapter: Modality Adap- of 2 · 10−4 , a warm-up of 1,000 steps and an in-
tation for End-to-End Speech-to-Text Translation. In verse square root scheduler. We follow a reduced
Proc. Interspeech 2022, pages 111–115.
regularization approach, as compared to the origi-
A Appendix nal configuration of wav2vec 2.0 and mBART50,
which we found to work the best in our preliminary
A.1 Implementation Details experiments. Thus, we use 0.1 activation dropout
This section presents the implementation details of in the acoustic encoder, as well as time masking
our proposed model architecture. with probability of 0.2 and channel masking with
As an ASR model, we are using wav2vec 2.02 probability of 0.1. For the context encoder, we use
which is composed of a 7-layer convolutional fea- 0.1 dropout and 0.1 attention dropout. All other
ture extractor and 24-layer Transformer encoder. dropouts are inactive. All the weights in the loss
It is pretrained with 60k hours of non-transcribed function were set to 1.0 (Eq. 1). We train until the
speech from Libri-Light (Kahn et al., 2020), and LOT2 term of the loss does not improve for 5,000
fine-tuned for ASR with 960 hours of labeled data steps, and then average the 10 best checkpoints
from Librispeech (Panayotov et al., 2015). The according to the same loss term.
wav2vec 2.0 version we use was also fine-tuned For ST fine-tuning, we use Adam with a base
with pseudo-labels (Xu et al., 2021b). learning rate of 5 · 10−5 , fixed for the 20% of the
As an MT model, we are using mBART50 (Tang training before decaying to 5 · 10−7 for the rest.
et al., 2020), which is already fine-tuned on En- In the semantic encoder, we apply a dropout of
Xx multilingual machine translation3 . We further 0.1 and an attention dropout of 0.1, while for the
pretrain it for two reasons. Firstly, we are only in- decoder we use a dropout of 0.3 and an attention
terested in the En-De direction, and thus we would dropout of 0.1. Neither dropout nor masking is
like a more specialized model on that direction. applied in the frozen acoustic encoder. The loss is
Secondly, due to the 2nd step of encoder matching, the cross-entropy with label smoothing of 0.2.
we would like the text encoder to have a very good For the experiments incorporating Knowledge
representation of our data. For MT fine-tuning, we Distillation (KD) during ST fine-tuning, the loss
use the original parameters of mBART50 (Tang is calculated as a weighted sum of the standard
et al., 2020), and the datasets listed in Table 6. cross-entropy (no label smoothing) and the KL di-
The acoustic encoder has 24 Transformer lay- vergence between the teacher and student distribu-
ers, while the semantic encoder and the decoder tions, controlled by a hyperparameter λ, set to 0.5.
2
https://dl.fbaipublicfiles.com/ The teacher distribution for each step is obtained
fairseq/wav2vec/wav2vec2_vox_960h_new.pt offline using the fine-tuned mBART50, where we
3
https://dl.fbaipublicfiles.com/
fairseq/models/mbart50/mbart50.ft.1n. keep the top-8 indices, and both the teacher and
tar.gz student distributions are additionally modified with
407
temperature T = 1.3 (Gaido et al., 2020). v2 dev set, such as BLEU (Papineni et al., 2002),
After ST fine-tuning, we pick the 10 best check- chrF2 (Popović, 2017), and COMET (Rei et al.,
points according to the BLEU (Papineni et al., 2020). To ensure the robustness of our findings,
2002) computed with sacreBLEU (Post, 2018) on we estimated statistical significance using the boot-
the development set of MuST-C and average them. strap resampling method (Koehn, 2004).
For generation, we use a beam search of 5. All In our initial experiment, we examined the im-
models are implemented in FAIRSEQ (Ott et al., pact of various fine-tuning strategies used in our
2019), and experiments were run on a cluster of 8 last years’ participations, specifically LNA (Li et al.,
NVIDIA GeForce RTX 3090. Our code is available 2021) and LNA-Adapters (Tsiamas et al., 2022b),
at a public repository4 . in comparison to full fine-tuning. The goal was
to verify whether these approaches inadvertently
A.2 MT fine-tuning
hurt the system’s performance. As demonstrated in
For the MT fine-tuning, we use the parallel text Table 8, these strategies indeed had a detrimental
of the ST datasets, as well as Europarl v10 En-De effect, leading to reductions of 1.9 BLEU points
(Koehn, 2005) (Table 6). We perform text nor- when applied to both the encoder and the decoder.
malization and remove pairs with extremely short Consequently, we opted to adopt a conventional full
text segments (fewer than 4 characters) or extreme fine-tuning strategy for subsequent experiments.
source-to-target length ratio (less than 0.5 or larger
Following this, we conducted a comparative anal-
than 2).
ysis of various speech encoders, including different
Original Filtered variations of wav2vec 2.0 (Baevski et al., 2020;
Xu et al., 2021b; Hsu et al., 2021b; Conneau et al.,
ST datasets
2021), HuBERT (Hsu et al., 2021a), and SpeechLM
MuST-C v3 270 235
Europarl-ST 33 26 (Zhang et al., 2022) (Table 9). Our baseline was
CoVoST 2 231 203 the wav2vec 2.0 fine-tuned with pseudo-labels (Xu
et al., 2021b), and intriguingly, most encoders ex-
MT datasets
Europarl v10 1, 829 1, 566 hibited a comparable level of performance. A
marginal decrease was observed with the wav2vec
Total 2, 363 2, 030
2.0 pretrained on a large pool of datasets (LV-60 +
Table 6: Filtered training data (thousands of sentences) CV + SWBD + FSH) (Hsu et al., 2021b), and the
for MT fine-tuning stage. multilingual version of wav2vec 2.0, XLSR (Con-
neau et al., 2021). The SpeechLM results were
noticeably below expectations, leading us to sus-
MuST-C
Europarl-ST CoVoST2
pect a bug in our implementation.
v2 v3
Upon noting that the hyperparameters were op-
Off-the-shelf
timized for a specific speech encoder, we hy-
mBART50 31.4 30.9 35.0 33.6
pothesized that a reduction in the learning rate
Fine-tuned
might boost HuBERT’s performance. However,
MuST-C v2 35.3 34.4 34.6 35.3
All (§3.1) 34.9 34.2 40.3 39.9 as demonstrated in Table 11, the performance was
adversely affected, prompting us to retain the origi-
Table 7: BLEU scores on MT test sets. nal wav2vec 2.0 as the primary speech encoder due
to the lack of substantial improvements offered by
other alternatives.
A.3 Preliminary experiments
Our focus then shifted towards examining the
Before starting the primary experiments for the influence of varying regularization and data aug-
IWSLT evaluation campaign, we conducted an ar- mentation strategies on system performance (Table
ray of preliminary tests, building on top of previous 10). We explored a range, from our traditionally
years’ submissions (Gállego et al., 2021; Tsiamas used setup (base), to the one employed in the orig-
et al., 2022b). These explorations were intended to inal foundation model fine-tuning, and a reduced
examine the impact of system configuration varia- version. Implementing the original regularization
tions on the performance metrics on the MuST-C within the speech encoder, as opposed to the base
4
https://github.com/mt-upc/iwslt-2023 variant, significantly boosted performance, leading
408
Encoder Decoder BLEU chrF2 COMET
- - 29.0 54.7 0.8001
LNA - 28.0 ∗ 54.1 ∗ 0.7949 ∗
- LNA 27.9 ∗ 54.0 ∗ 0.7882 ∗
LNA LNA 27.1 ∗ 53.2 ∗ 0.7800 ∗
LNA-Adapt - 28.2 ∗ 54.3 ∗ 0.7960 ∗
- LNA-Adapt 27.6 ∗ 53.6 ∗ 0.7889 ∗
LNA-Adapt LNA-Adapt 27.1 ∗ 53.5 ∗ 0.7847 ∗
409
Learning Rate BLEU chrF2 COMET
5 · 10−4 30.3 56.1 0.8099
2 · 10−4 30.3 56.0 0.8069
1 · 10−4 30.2 55.9 0.8085
5 · 10−5 29.5 ∗ 55.3 ∗ 0.8047
Table 9: Speech encoders exploration with MuST-C v2 dev set (en-de). ∗ indicates significance w.r.t. baseline (1st
row). † uses LNA-Adapters (Tsiamas et al., 2022b). ‡ indicates a possible bug in our implementation.
Table 10: Variations of the regularization and data augmentation strategies, with MuST-C v2 dev set (en-de). ∗
indicates significance w.r.t. baseline (1st row).
410
The Xiaomi AI Lab’s Speech Translation Systems for IWSLT 2023
Offline Task, Simultaneous Task and Speech-to-Speech Task
Wuwei Huang1∗† Mengge Liu2∗‡ Xiang Li1 Yanzhi Tian2‡ Fengyu Yang1
Wen Zhang1 Yuhang Guo2 Jinsong Su3 Jian Luan1 Bin Wang1
1
Xiaomi AI Lab, Beijing, China
2
Beijing Institute of Technology, Beijing, China
3
Xiamen University, Xiamen, Fujian, China.
{huangwuwei,lixiang21,yangfengyu1,zhangwen17,luanjian,wangbin11}@xiaomi.com
{liumengge,tianyanzhi,guoyuhang}@bit.edu.cn jssu@xmu.edu.cn
412
in the training set similar to the rules used in (Guo Models BLEU
et al., 2022), following these steps: mBART50 (one-to-many) 25.81
+ domain fine-tuning on 9M corpus 28.41
• A series of hand-crafted rules are adopted to + domain fine-tuning on MuST-C 29.50
filter out noisy sentences from the training set.
In particular, we discard sentences that con- Table 2: The BLEU scores of MT models obtained by
tain less than 50% linguistic words. For Chi- fine-tuning one-to-many mBART50 model using vari-
nese sentences, Chinese characters are consid- ous bilingual datasets on the tst-COMMON test set.
ered linguistic words; for English sentences,
words containing only alphabet characters are MuST-C datasets to improve the domain adaptabil-
considered linguistic words; ity of the model. The results are shown in Table 2.
• We utilize fast_align11 open source tool to In the Librispeech and TED-LIUM datasets, En-
exclude sentence pairs with a score lower than glish sentences do not have punctuation or case
−8. We also apply the language identifica- information. We fine-tune the mBART50 model
tion (LangID) tool12 to filter out sentence pairs to add punctuation and restore case information to
that are neither in Chinese nor English; English sentences. Furthermore, samples already
• Duplicate sentence pairs are discarded, and included in the CoVoST corpus are removed from
any pairs with a length ratio greater than 3.0 the CommonVoice dataset. The transcriptions of
or sentences with a length exceeding 200 are the ASR data are then translated using the best fine-
also filtered out. tuned mBART50 model and filtered using the same
To filter out noise data in the ST training set, we rules as the ST data in Section 2.2.1, resulting in
apply the following steps: a total of 1.6 million synthesized speech-to-text
translation pairs.
• Pairs that have an audio duration exceeding 60 Finally, for constrained data, we combine the
seconds or a text length exceeding 200 tokens hand-annotated ST corpus with the synthesized ST
are excluded; corpus to produce the final training corpus for the
• We calculate the ratio of the number of speech Offline-ST and Simul-ST models, yielding a total
frames to tokens in each sample, and remove of 2.9 million speech-to-text translation pairs. In
samples whose ratio exceeds three times the the case of unconstrained training on the offline
average ratio. track, we augment our training corpus with the
GigaST corpus, resulting in 9 million speech-to-
2.2.2 Data Augmentation text translation pairs.
To effectively train an end-to-end speech transla-
tion model, it is impractical to rely solely on hand- 2.3 Cascaded S2ST Corpus
annotated training data, due to the scarcity of hand- In the En⇒Zh speech-to-speech translation track,
annotated data. To mitigate this issue, we utilize we leverage all available constrained data from the
a well-trained MT model to translate the transcrip- offline speech translation track as well as the Gi-
tions from ASR data and synthesize a large amount gaST corpus13 to train our offline speech transla-
of pseudo-data, which has been widely used in the tion model. This model is then followed by a TTS
previous years’ competitions (Ding and Tao, 2021; model that is trained on the AISHELL-3 and Gi-
Zhang and Ao, 2022; Zhang et al., 2022b; Li et al., gaS2S datasets.
2022; Zhu et al., 2022).
We initially gather all available English-Chinese 2.4 Speech Segmentation
bilingual parallel sentence pairs from ST and MT Since the speech in the evaluation set is not pre-
tasks, as listed in Table 1. We then filter the data segmented, we apply SHAS (Tsiamas et al., 2022)
using the method mentioned in Section 2.2.1, gen- to segment the full speech into shorter segments.
erating 9M sentence pairs. These 9M sentence However, we observe two issues. Firstly, some
pairs are used to fine-tune the pre-trained one-to- segments have incomplete final words, which could
many mBART50 model for 30 epochs. We further negatively impact the performance of the ST model.
fine-tune mBART50 for another 30 epochs using To alleviate this problem, we add a few extra frames
11 13
https://github.com/clab/fast_align https://st-benchmark.github.io/resources/
12
https://github.com/saffsd/langid.py GigaST.html
413
Text Encoder Text Decoder 欢迎来到小米
Speech Encoder
Initialized from
Transformer Encoder
HuBERT
Initialized from
CNN Feature Extractor mBART
Welcome to Xiaomi
Figure 1: The architecture of our end-to-end offline speech translation model consists of three components: speech
encoder, text encoder, and text decoder. The speech encoder is composed of a CNN feature extractor and a 24-layer
Transformer encoder with a CNN positional encoder. Both the text encoder and the text decoder are 12-layer
standard Transformer structures. Note that the speech encoder is initialized with the pre-trained HuBERT model,
and both the text encoder and text decoder are initialized with the pre-trained mBART model.
at the end of each segment to ensure that the final HuBERT and mBART models. Figure 1 illustrates
word is fully pronounced. Secondly, the speaking the architecture of our model, which consists of
rate varies among different speakers or types of a speech encoder, a text encoder, and a text de-
speeches, resulting in different amounts of words coder. More specifically, the speech encoder is
being spoken within a given time period. Excessive composed of a feature extractor based on con-
words in a speech segment may result in missing volutional neural networks (CNN), named CNN
translations. We choose different hyperparameters feature extractor and a 24-layer Transformer en-
for different speakers or different types of speeches. coder. The CNN feature extractor is used to ex-
tract speech features from waveform, with 7 layers
3 Methods each containing 512 channels and kernel widths of
We build our Offline-ST system in an end-to-end [10, 3, 3, 3, 3, 2, 2] and strides of [5, 2, 2, 2, 2, 2, 2].
manner (End-to-End Offline-ST) based on the Hu- The Transformer encoder is derived from the stan-
BERT and mBART pre-trained models. Our si- dard Transformer (Vaswani et al., 2017) encoder,
multaneous speech translation system (End-to-End except for using CNN as the position encoder. The
Simul-ST) utilizes the same model architecture as text encoder is a 12-layer standard Transformer en-
the Offline-ST system and adopts wait-k and ITST coder, and the text decoder is a 12-layer standard
strategies. The cascaded S2ST system involves Transformer decoder. The training objective of our
an end-to-end speech-to-text translation model fol- speech translation model can be formulated as:
lowed by a TTS model. |y|
X
L (x, y; θe , θd ) = - log p yt |y<t , x; θe , θd (1)
3.1 End-to-End Offline-ST System t=1
The speech translation corpus typically consists of where θe and θd represent the parameters of the
triples (x, z, y) that contain speech, transcription, encoder and the decoder, respectively.
and translation data, where x = (x1 , · · · , x|x| ) rep-
resents a sequence of acoustic features, while z 3.2 Cascaded S2ST System
= (z1 , · · · , z|z| ) and y = (y1 , · · · , y|y| ) denote the In the cascaded S2ST system, we reuse the offline
corresponding transcription in the source language speech translation model discussed in Section 3.1
and translation in the target language, respectively. as the ST model. For the TTS model, we first train a
Our end-to-end Offline-ST system is based on an base TTS model and vocoder using the AISHELL-
encoder-decoder architecture from the pre-trained 3 dataset with the Tacotron2 (Shen et al., 2018)
414
open source framework. The final TTS model is wait-k streaming decoding strategy, and finally
obtained by fine-tuning the base model on the Gi- evaluated using the SimulEval (Ma et al., 2020a)
gaS2S dataset. toolkit. To ensure accurate translations, we en-
force a constraint that the model should not pro-
3.3 End-to-End Simul-ST System duce the final translation until it has fully processed
In order to take full advantage of the powerful capa- the speech in the source language.
bilities of large pre-trained models, we develop an
3.4 Self-Training
end-to-end Simul-ST system based on the HuBERT
and mBART models. Furthermore, we employ two Self-training is a simple semi-supervised learning
strategies, namely wait-k and ITST. method that involves using unlabeled data to aug-
ment labeled data (Pino et al., 2020; Sun et al.,
3.3.1 Wait-k 2021; Wang et al., 2021; Popuri et al., 2022). To
Ma et al. (2020b) adapts methods originally pro- leverage the large-scale unlabeled audio introduced
posed for simultaneous machine translation to de- in Section 2.1, we employ self-training in our ap-
velop an end-to-end Simul-ST system. To achieve proach. In particular, we first train the end-to-end
this, they employ the wait-k (Ma et al., 2019) strat- speech translation model on both manually anno-
egy and a fixed pre-decision module. Under this tated data and augmentation data, as described in
approach, the system first reads k speech segments, Section 2. Next, we use the model to generate
each of which contains a fixed number (q, a hyper- Chinese translation text, which we merge with the
parameter in the pre-decision module) of speech original training data and unlabeled audio. We then
frames. When k speech segments have been read, continue training the end-to-end speech translation
the decoder generates one token in the target lan- model on this merged dataset.
guage. Similarly, we also apply the wait-k strategy
in the decoding process of our end-to-end offline- 3.5 Contrastive Learning
ST system, as it strikes a good balance between The objective of contrastive learning (Chen et al.,
translation quality and latency without requiring 2020; Gao et al., 2021; Ye et al., 2022; Zhang et al.,
any streaming strategy during training (Papi et al., 2023) is to learn an encoder that produces similar
2022; Polák et al., 2022). During inference, once a representations for similar instances, while pro-
speech segment is accepted, the decoder takes the ducing dissimilar representations for dissimilar in-
following action: stances, as measured by their cosine similarity. In
our approach, we assume that the same utterance,
continue to read |x| − |y| < k regardless of whether it is in speech or text modal-
Action = (2)
output yt |x| − |y| ≥ k
ity, will have similar hidden representations. There-
where yt denotes the t-th token of the target lan- fore, we aim to minimize the cosine distance be-
guage, while |x| and |y| refer to the number of tween the hidden representations of the two modal-
source speech segments and target tokens, respec- ities for the same utterance, while increasing the
tively. cosine distance between the hidden representations
of different utterances. Specifically, we minimize
3.3.2 ITST the cosine distance between the speech encoder
The Information-Transport-based Simultaneous output and the corresponding word embedding for
Translation (ITST) architecture has achieved state- the same utterance, while maximizing the distance
of-the-art performance in end-to-end simultaneous between the representations of different utterances.
speech translation. To implement this strategy, we The training objective is as follows:
initialize the corresponding parameters by using XN
exp(sim(u, v)/T )
the pre-trained HuBERT and mBART models, and LCT R = - log p PX (3)
exp(sim(u, v(xj ))/T )
randomly initialize additional parameters for com- t=1
puting the information transport matrix. We then where u is the average state of the speech encoder
optimize the quality and latency objectives using output along the sequence length, v is the average
the ITST criterion, varying the δ value to control word embedding, and T is the temperature hyper-
the latency in streaming inference. parameter. More specifically, LCT R quantifies the
Our end-to-end speech translation system is built negative logarithm of the probability that the simi-
based on the ITST architecture, equipped with a larity between u and v is greater than the similarity
415
between u and other candidate word embeddings Models BLEU
v(xj ). The probabilities are normalized using a 0 wav2vec2.0 (small) 23.84
softmax function over all candidate embeddings. 1 HuBERT + mBART50 (one-to-many) 27.74
In addition to contrastive learning, we also con- 2 + fine-tuning on MuST-C 27.90
duct multitask learning using labeled ASR and MT 3 + Self-Training 27.69
4 + Contrastive Learning 28.11
training data, which results in the final optimization
5 + fine-tuning on MuST-C 27.94
objective: 6 data2vec + mBART50 (one-to-many) 27.66
7 + fine-tuning on MuST-C 27.59
L = LST + LASR + LM T + LCT R (4) 8 Ensemble (2, 5) 27.79
9 Ensemble (2, 7) 27.61
10 Ensemble (2, 5, 7) 27.94
where LST , LASR , LM T , and LCT R denote the
losses for speech-to-text translation, ASR, MT, and
Table 3: The BLEU scores of ST models on the tst-
contrastive learning, respectively. COMMON test set.
4 Experiments
4.3 Main Results
4.1 Experiment Settings Offline En⇒Zh Speech Translation
The fairseq toolkit14
is used to train our speech- We evaluate our offline-ST models on the tst-
to-text models. During training, the models take COMMON test set by reporting the BLEU score
the original waveform sampled at 16kHz as the in- in accordance with the official evaluation criteria.
put. The Adam optimizer (Kingma and Ba, 2015) To establish a baseline for comparison, we use
with a fixed learning rate of 5e-5 is used to train the widely-used standard wav2vec2.0 model for
the models. Each model is trained for 200k steps, speech translation tasks. Table 3 shows the com-
and we save the model every 2.5k steps using an parison results among all models. Our end-to-end
early stopping mechanism. In detail, if the BLEU models exhibit a significant improvement of ap-
score on the development set does not improve for proximately 4 BLEU points over the wav2vec2.0
10 consecutive checkpoints, the training will be ter- baseline, which demonstrates the effectiveness of
minated. During the fine-tuning stage, we set the our methods. Additionally, we also conduct ex-
maximum number of updates to 50k and the learn- periments using data2vec (Baevski et al., 2022)
ing rate to 2e-5. Our TTS model is implemented pre-trained model and obtain comparable results
using the Tacotron2 toolkit15 . on the tst-COMMON test set.
By analyzing our experimental results, we ob-
4.2 Evaluation serve that domain fine-tuning does not significantly
improve the performance of the model. Neverthe-
As the official automatic evaluation criterion, the less, we believe domain fine-tuning will be benefi-
BLEU score (Papineni et al., 2002) is used to eval- cial for final human evaluation on the TED18 test
uate the translation quality of all our systems. For set. Our final submission is an ensemble of the
the Simul-ST system, we employ the average lag models listed in rows 2, 5, and 7 of Table 3.
(AL) (Ma et al., 2019, 2020b) metric to measure
It is worth mentioning that we encounter some
the translation latency, which is a standard metric
challenges when training our model. When the
for simultaneous speech translation. The SimulE-
HuBERT model is used to initialize our model,
val open-source toolkit16 is utilized to calculate
instabilities are observed during training, with sud-
both the BLEU and AL metrics for the Simul-ST
den gradient explosions leading to training collapse.
system. All BLEU scores are calculated with the
After careful analysis, we determine that the prob-
SacreBLEU17 (Post, 2018) toolkit at the character
lem is that the gradients of the CNN layers are
level.
relatively large during the entire training process.
14 We address this issue by scaling down the gradients
https://github.com/pytorch/fairseq
15
https://github.com/NVIDIA/tacotron2 of the CNN layers.
16
https://github.com/facebookresearch/SimulEval
17 18
https://github.com/mjpost/sacrebleu https://www.ted.com/
416
Models BLEU Strategies Models BLEU AL
1 Offline-ST 30.10 1 Wait-k HuBERT+mBART 25.99 1980
2 Wait-k + ST & CL 26.59 1966
2 Offline-ST + GigaST 31.56
3 ITST HuBERT+mBART 26.25 1906
3 Ensemble (1, 2) 31.81
Table 6: The evaluation results of Simul-ST models
Table 4: BLEU scores of our ST models on the develop- on tst-COMMON. ST and CL denote self-training and
ment set of the S2ST track in IWSLT 2023. Offline-ST contrastive learning for the Offline-ST model.
is trained on all manually annotated data and the aug-
mented data described in Section 2.2.2. In addition to
the data used by the offline-ST model, the Offline-ST + 6000ms, the model performs a WRITE action to
GigaST model incorporates additional GigaST data. predict the next target token.
We evaluate the wait-k strategy using models 1
Models ASR-BLEU and 4 in Table 3, and train the ITST model with
1 Offline-ST 28.88 the same configuration as model 1 in Table 3. The
2 Offline-ST + GigaST 30.10 results of the Simul-ST models are presented in
Table 6. Although ITST shows better performance
3 Ensemble (1, 2) 30.18 than wait-k in the same setting, the wait-k strategy
combined with self-training and contrastive learn-
Table 5: ASR-BLEU scores of our ST models on the
ing can achieve better results. Therefore, we finally
development set of the S2ST track in IWSLT 2023. The
models are identical to those presented in Table 4.
submit the system corresponding to the second row
in Table 6.
417
References Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- rahman Mohamed. 2021. HuBERT: Self-supervised
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano speech representation learning by masked prediction
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, of hidden units. In Proc. of TALSP.
Maha Elbayad, Clara Emmanuel, Yannick Estève,
Marcello Federico, Christian Federmann, Souhir Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Gahbiche, Hongyu Gong, Roman Grundkiewicz, method for stochastic optimization. In Proc. of ICLR.
Barry Haddow, Benjamin Hsu, Dávid Javorský,
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu,
Mathur, Paul McNamee, Kenton Murray, Maria Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi,
Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan Qing He, Yun Tang, Juan Pino, and Wei-Ning Hsu.
Niehues, Xing Niu, John Ortega, Juan Pino, Eliz- 2022. Direct speech-to-speech translation with dis-
abeth Salesky, Jiatong Shi, Matthias Sperber, Se- crete units. In Proc. of ACL.
bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
gesh Virkar, Alexander Waibel, Changhan Wang, and Yinglu Li, Minghan Wang, Jiaxin Guo, Xiaosong Qiao,
Shinji Watanabe. 2022. Findings of the IWSLT 2022 Yuxia Wang, Daimeng Wei, Chang Su, Yimeng Chen,
evaluation campaign. In Proc. of IWSLT. Min Zhang, Shimin Tao, Hao Yang, and Ying Qin.
2022. The HW-TSC’s offline speech translation sys-
Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer- tem for IWSLT 2022 evaluation. In Proc. of IWSLT.
man, Roldano Cattoni, Maha Elbayad, Marcello Fed-
erico, Xutai Ma, Satoshi Nakamura, Matteo Negri, Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Jan Niehues, Juan Pino, Elizabeth Salesky, Sebas- Edunov, Marjan Ghazvininejad, Mike Lewis, and
tian Stüker, Katsuhito Sudoh, Marco Turchi, Alexan- Luke Zettlemoyer. 2020. Multilingual denoising pre-
der Waibel, Changhan Wang, and Matthew Wiesner. training for neural machine translation. In Proc. of
2021. FINDINGS OF THE IWSLT 2021 EVALUA- TACL.
TION CAMPAIGN. In Proc. of IWSLT.
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
Ondřej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
Durrani, Marcello Federico, Christian Federmann, Haifeng Wang. 2019. STACL: Simultaneous trans-
Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay lation with implicit anticipation and controllable la-
Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Eliz- tency using prefix-to-prefix framework. In Proc. of
abeth Salesky, Xing Shi, Sebastian Stüker, Marco ACL.
Turchi, Alexander Waibel, and Changhan Wang.
2020. FINDINGS OF THE IWSLT 2020 EVAL- Xutai Ma, Mohammad Javad Dousti, Changhan Wang,
UATION CAMPAIGN. In Proc. of IWSLT. Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An
evaluation toolkit for simultaneous translation. In
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Proc. of EMNLP.
Babu, Jiatao Gu, and Michael Auli. 2022. data2vec:
A general framework for self-supervised learning in Xutai Ma, Juan Pino, and Philipp Koehn. 2020b.
speech, vision and language. In Proc. of ICML. SimulMT to SimulST: Adapting simultaneous text
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, translation to end-to-end simultaneous speech trans-
and Michael Auli. 2020. wav2vec 2.0: A framework lation. In Proc. of AACL/IJCN.
for self-supervised learning of speech representations. Sara Papi, Marco Gaido, Matteo Negri, and Marco
In Proc. of NIPS. Turchi. 2022. Does simultaneous speech translation
Ting Chen, Simon Kornblith, Mohammad Norouzi, and need simultaneous models? In Findings of the Asso-
Geoffrey Hinton. 2020. A simple framework for con- ciation for Computational Linguistics: EMNLP 2022,
trastive learning of visual representations. In Proc. Abu Dhabi, United Arab Emirates, December 7-11,
of ICML. 2022, pages 141–153. Association for Computational
Linguistics.
Liang Ding and Dacheng Tao. 2021. The USYD-JD
speech translation system for IWSLT2021. In Proc. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
of IWSLT. Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proc. of ACL.
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
SimCSE: Simple contrastive learning of sentence Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad
embeddings. In Proc. of EMNLP. Dousti, and Yun Tang. 2020. Self-training for end-to-
end speech translation. In Proc. of Interspeech.
Bao Guo, Mengge Liu, Wen Zhang, Hexuan Chen,
Chang Mu, Xiang Li, Jianwei Cui, Bin Wang, and Peter Polák, Ngoc-Quan Pham, Tuan-Nam Nguyen,
Yuhang Guo. 2022. The Xiaomi text-to-text simulta- Danni Liu, Carlos Mullov, Jan Niehues, Ondrej Bojar,
neous speech translation system for IWSLT 2022. In and Alexander Waibel. 2022. CUNI-KIT system for
Proc. of IWSLT. simultaneous speech translation task at IWSLT 2022.
418
In Proceedings of the 19th International Confer- Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang,
ence on Spoken Language Translation, IWSLT@ACL Xukui Yang, Dan Qu, and Wei-Qiang Zhang. 2023.
2022, Dublin, Ireland (in-person and online), May Improving speech translation by cross-modal multi-
26-27, 2022, pages 277–285. Association for Com- grained contrastive learning. IEEE/ACM Transac-
putational Linguistics. tions on Audio, Speech, and Language Processing.
Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Shaolei Zhang and Yang Feng. 2022. Information-
Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, and Ann transport-based policy for simultaneous translation.
Lee. 2022. Enhanced direct speech-to-speech transla- In Proc. of EMNLP.
tion using self-supervised pre-training and data aug-
mentation. In Proc. of Interspeech. Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
Xinyuan Zhou, Jing Yang, Jianwei Cui, Pan Deng,
Matt Post. 2018. A call for clarity in reporting BLEU Mohan Shi, Yifan Song, Dan Liu, Junhua Liu, and
scores. In Proc. of WMT. Lirong Dai. 2022b. The USTC-NELSLIP offline
speech translation systems for IWSLT 2022. In Proc.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike of IWSLT.
Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
Ziqiang Zhang and Junyi Ao. 2022. The YiTrans speech
Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan,
translation system for IWSLT 2022 offline shared
Rif A. Saurous, Yannis Agiomyrgiannakis, and
task. In Proc. of IWSLT.
Yonghui Wu. 2018. Natural TTS synthesis by condi-
tioning wavenet on mel spectrogram predictions. In Qinpei Zhu, Renshou Wu, Guangfeng Liu, Xinyu Zhu,
Proc. of ICASSP. Xingyu Chen, Yang Zhou, Qingliang Miao, Rui
Wang, and Kai Yu. 2022. The AISP-SJTU simul-
Matthias Sperber and Matthias Paulik. 2020. Speech taneous translation system for IWSLT 2022. In Proc.
translation and the end-to-end promise: Taking stock of IWSLT.
of where we are. In Proc. of ACL.
419
Improving Formality-Sensitive Machine Translation using
Data-Centric Approaches and Prompt Engineering
Seungjun Lee1 , Hyeonseok Moon1 , Chanjun Park1,2 , Heuiseok Lim1∗
1
Korea University, South Korea
2
Upstage, South Korea
{dzzy6505, glee889, bcj1210, limhseok}@korea.ac.kr
chanjun.park@upstage.ai
421
model, which has been specifically designed for ples are sourced from the target language’s training
Vietnamese language tasks. Notably, EnViT5 mod- set and include both informal and formal levels.
els outperformed existing multilingual models such ChatGPT is then tasked with translating the input
as mBART and M2M-100 while maintaining a sig- text into either an informal or formal target lan-
nificantly smaller parameter size, making them scal- guage, depending on the specified prompt. For the
able and promising for both academic and industry input text, we use English source sentences from
applications (Ngo et al., 2022). the IWSLT 22 Formality Track’s other language
EnViT5 was pre-trained with the CC100 pairs. After filtering the translated examples us-
Dataset (Wenzek et al., 2020) which comprises ing a formality classifier, we fine-tuned the respec-
monolingual data for over 100 languages. Subse- tive PLMs for EN-KO and EN-VI by incorporating
quently, EnViT5 was fine-tuned on the MTet (Ngo synthetic examples into the training sets for each
et al., 2022) and PhoMT (Doan et al., 2021) language pair. To verify the effectiveness of data
datasets. MTet is a multi-domain EN-VI machine augmentation through prompt engineering, we con-
translation dataset encompassing a diverse range duct experiments comparing the results with and
of domains, including educational videos, soft- without the augmented data.
ware user interfaces, COVID-related news arti-
cles, religious texts, subtitles, Wikipedia, and TED Language Size
Talks (Reimers and Gurevych, 2020). Ultimately, Train Test
when combined with PhoMT and IWSLT’15 (Cet- EN-KO 400 600
tolo et al., 2015), the final MTet dataset expands EN-VI 400 600
EN-PT 0 600
the training set size to 6 million examples, covering
EN-RU 0 600
previously neglected areas such as law and biomed-
ical data, which contains monolingual data for over Table 2: Data statistics in train and test sets of Formality
100 languages. Dataset
422
with novel language pair and formality combina- Language Size Source
tions. AI Hub (Formal/Informal
EN-KO 6M
+ Tech/Sci + Social/Sci + News)
MTet (Ngo et al., 2022)
3 Experiment Settings EN-VI 6.2M
+ PhoMT (Doan et al., 2021)
EN source from IWSLT’22
3.1 Dataset Details EN-{PT, RU} 1.6K
(Anastasopoulos et al., 2022)
The IWSLT shared task provides Formality Dataset Table 3: Additional external datasets used for the for-
which contains English source segments, each ac- mality track in various language pairs.
companied by two contrasting reference transla-
tions representing informal and formal formality
levels. This is available for two language pairs, 3.2 Training Details
EN-{KO, VI}, in the supervised setting and two In the training details for the EN-KO language
additional language pairs, EN-{PT, RU}, in the pair, we applied a morpheme-aware tokenization
zero-shot setting. The statistics for the train and method to the translation dataset. To achieve this,
test sets of the dataset are shown in Table 2 we followed the training methods proposed by Park
For training and testing purposes, we randomly et al. (2020) and Gowda and May (2020), using
sampled 50 pairs of examples across each domain MeCab-ko and Unigram to construct a vocabu-
from the train set of Formality Dataset, and set lary of 48K tokens. We then pre-trained the Trans-
them aside as validation sets (TASK DEV) for each former model (Vaswani et al., 2017). We used the
supervised language. The remaining samples were fairseq library with 12 encoder and 12 decoder
utilized for training (TASK TRAIN). layers, each having 16 attention heads. Both en-
Additionally, we utilized external datasets in coder and decoder had an embedding dimension
conjunction with the data provided in the shared of 1024 and a feed-forward network (FFN) dimen-
task. For EN-KO, we employed a parallel corpus sion of 4096. During pre-training, we trained for
comprising Formal/Informal, Social Science, Tech- 20 epochs with a learning rate of 5e-4 and 4000
nology Science, and News domains from AI Hub warmup updates. For fine-tuning, we trained for
for the pretraining of the PLM. For EN-VI, we 200 epochs using a learning rate of 4e-5 and 100
utilized EnViT5, which was fine-tuned using the warmup updates. We fine-tuned using the TASK
MTet (Ngo et al., 2022) and PhoMT (Doan et al., TRAIN for all language pairs.
2021) datasets. For EN-{VI, PT, RU} pairs, we fine-tuned us-
In our research, we leverage ChatGPT for the ing the huggingface library. For EN-VI, we
augmentation of the EN-KO and EN-VI and the used the VietAI/envit5-translation as
generation of synthetic examples for fine-tuning the PLM. Fine-tuning was performed for 200
on EN-PT and EN-RU. This was done by using epochs with a learning rate of 4e-5, 200 warmup
the source data from all available English-other steps, and a batch size of 64. For EN-{PT,RU}
language pairs (EN-XX) in the IWSLT’22 Formal- pairs, we used facebook/mbart-large-50
ity Track (Anastasopoulos et al., 2022). To secure and trained for 200 epochs with a learning rate of
the quality and uniqueness of our training set, we 3e-5, 100 warmup steps, and a batch size of 16. All
implemented a preprocessing step that excludes du- models were trained using four RTX A6000 GPUs.
plicate sentences. Furthermore, to determine the op- Detailed hyperparameters and training information
timal hyperparameters, we conducted a case study can be found in the Appendix B.
utilizing TASK DEV (details can be found in Sec-
tion 4.3). The hyperparameters that led to the high- 3.3 Evaluation Details
est Matched-Accuracy (M-Acc) were selected for In our experimental setting, we used the official test
use. For all language pairs, we utilized a temper- set from Formality Dataset (IWSLT’23) to evaluate
ature of 0.9; specifically, we implemented 4-shot our translation model’s performance. The evalua-
learning for EN-KO and 2-shot learning for EN- tion was conducted across two dimensions: overall
VI. For EN-PT and EN-RU, we proceeded with translation quality and formality control. To as-
a zero-shot setting. More detailed information re- sess the overall translation quality, we employed
garding the datasets and the preprocessing steps BLEU (Papineni et al., 2002) and COMET (Rei
are presented in Table 3. et al., 2020) (eamt22-cometinho-da) as au-
423
EN-KO EN-VI
M ETHOD BLEU COMET %M-ACC %C-F BLEU COMET %M-ACC %C-F
Official Baseline 4.91 0.211 78.3 98.6 26.71 0.363 96.0 99.7
Formal
Table 4: Results on the test set of Formality Dataset for formal and informal supervised settings, obtained via our
language specialized data-centric approach.
EN-PT EN-RU
M ETHOD BLEU COMET %M-ACC %C-F BLEU COMET %M-ACC %C-F
Official Baseline 27.29 0.448 96.3 97.7 21.96 0.349 96.2 92.0
Formal
Table 5: Results on the test set of Formality Dataset for formal and informal zero-shot settings, achieved through
our approach of synthetic data generation via prompt engineering.
tomatic evaluation metrics. We use 13A tokenizer ting, while for EN-PT and EN-RU pairs, we em-
to report SACRE BLEU (Post, 2018) scores for all ployed a zero-shot setting. In the supervised set-
languages. ting, we extracted arbitrary n-shot samples using
For formality control, we utilized Matched- the TASK TRAIN. We designed prompts by leverag-
Accuracy (M-Acc), a reference-based corpus- ing langchain’s prompt guide and prompt examples
level metric that leverages phrase-level formality from Hendy et al. (2023). Detailed examples and
markers from the references to classify system- explanations of the prompts can be found in Ap-
generated hypotheses as formal or informal. The pendix A.
corpus-level score is the percentage of system out-
puts that match the desired formality level. 4 Result & Findings
Additionally, we used a reference-free variant
4.1 Results for Supervised Setting
of M-Acc (C-F) 4 , which relies on a multilingual
formality classifier to label system-generated hy- Table 4 presents our experimental results in the su-
potheses as formal or informal, with the corpus- pervised setting. As demonstrated by our results,
level score representing the percentage of system our model, trained with the high-quality human-
outputs matching the desired formality level. annotated Formality Dataset, exhibited outstand-
ing performance. In particular, with respect to the
3.4 Prompt Design C-F metric, our model shows almost perfect for-
We conducted experiments using ChatGPT with mality control performance (100% accuracy) for
GPT-4 engine with langchain5 . For EN-KO and most of the tasks, except for the EN-KO informal
EN-VI language pairs, we used a supervised set- task. Additionally, our model shows superior per-
formance for the conventional NMT metrics (i.e.
4
https://github.com/amazon-science/ BLEU, COMET), outperforming ChatGPT with a
contrastive-controlled-mt/tree/main/
IWSLT2023 21.50 BLEU score for the EN-KO informal task.
5
https://python.langchain.com/ The EN-VI pair also exhibits high NMT metric
424
15.00 35.00
Formal Informal Formal Informal
13.00 33.00
11.00 31.00
BLEU
BLEU
9.00 29.00
7.00 27.00
5.00 25.00
1-shot 2-shot 4-shot 8-shot 16-shot 32-shot shot-1 shot-2 shot-4 shot-8 shot-16 shot-32
EN-KO (temperature=0.5) EN-VI (temperature=0.9)
Formal Informal Formal Informal
100 100
95 95
M-Acc
M-Acc
90 90
85 85
1-shot 2-shot 4-shot 8-shot 16-shot 32-shot shot-1 shot-2 shot-4 shot-8 shot-16 shot-32
EN-KO (temperature=0.5) EN-VI (temperature=0.9)
Figure 1: BLEU and M-Acc scores for ChatGPT based on superviesed setting, evaluated on TASK DEV.
scores, M-Acc, and C-F scores compared to the performs the official baseline on all tasks except the
baseline. These results suggest that our language- EN-PT informal task. Notably, our model demon-
specific data-centric approach is effective. strates consistently higher performance in terms of
Through our experiments, we observed a sig- C-F metric compared to ChatGPT, achieving 100%
nificant degradation in the quality for supervised M-ACC and C-F in the majority of tasks.
settings EN-{KO, VI}. This phenomenon can be Exceptionally for EN-PT informal task, the per-
attributed to the limitations of synthetic data pro- formance of our model is markedly subpar, and
duced by ChatGPT. While the data generated ChatGPT even fails to exceed the official base-
through ChatGPT exhibits considerable quality, line. We find this result is highly noteworthy, as
it was not up to par with the sentences derived it suggest that ChatGPT may generate semantically
from our data-centric approach. We found that the accurate and plausible data, while the formality
integration of ChatGPT-augmented data inadver- can hardly be controlled, especially for the EN-PT
tently introduced noise into the system, leading to language pair. In our experiments, we utilized the
a decrease in overall performance. Despite the ex- same prompt for both EN-PT and EN-RU language
ceptional capabilities of ChatGPT, it appears that pairs, differing only in language specification. The
in this context, the quality of data augmented by disparity in results between these two language pair
conventional NMT methods is still superior. This suggests that specialized techniques for controlling
observation further emphasizes the critical role of formality are required for each language pair. This
data quality over quantity in supervised learning en- issue can be partially attributed to a data bias in
vironments, and highlights the potential benefits of ChatGPT, indicating a potential training data bias
more sophisticated prompting techniques that con- concerning formality.
sider formality control, such as stylistic or sentence
endings, for improving overall performance. 4.3 Case Study
Impact of In-context Shots In this section, we
4.2 Results for Zero-shot Setting
examine the changes in performance based on the
The experimental results for the zero-shot setting number of few-shot samples used for in-context
are shown in Table 5. As can be seen from the learning, particularly when employing prompt en-
experimental results, our model significantly out- gineering for translation. Previous research sug-
425
Formal Informal Formal Informal
35.00 30.00
28.00
30.00
BLEU
BLEU
26.00
25.00
24.00
20.00 22.00
0.2 0.5 0.7 0.9 0.2 0.5 0.7 0.9
100
100
80
95
M-Acc
M-Acc
60
90
40 85
20 80
0.2 0.5 0.7 0.9 0.2 0.5 0.7 0.9
gests that increasing the number of shots beyond to an improvement in the general translation perfor-
10 does not significantly impact translation perfor- mance metric, BLEU. However, the scores of M-
mance when using large language models (Zhang Acc and C-F, we found that the best performance
et al., 2023). However, we argue that applying the was achieved with a smaller number of shots. This
same perspective to formality control tasks proves suggests that the nature of formality as a feature
challenging. This complexity arises as formality in- makes the “formality control” task distinct from
troduces a unique element required for these tasks. conventional NMT, and it may be challenging to di-
Additionally, previous research did not consider rectly apply perspectives from conventional NMT
unintended consequences arising from this factor. to this task. We propose two hypotheses based on
In pursuit of this, we conducted experiments these results: (i) there exists a trade-off between
where the number of shots was incrementally in- translation performance and formality control as
creased from 1 to 32, in powers of 2, using TASK the number of shots increases, and (ii) increasing
DEV . The aim was to verify the differences in per- the number of shots while applying random sample
formance resulting from these changes. This pro- selection may have caused confusion in perform-
cess involved translating data via ChatGPT with ing formality control. We leave the analysis and
an increasing number of shots and then evaluating validation of these hypotheses for future work.
the resulting translation data for its appropriateness.
The experimental results are depicted in Figure 1. Impact of Temperature Temperature is an im-
For this particular experiment, we selected one tem- portant parameter to make ChatGPT generates var-
perature (from the options of 0.2, 0.5, 0.7, 0.9) that ied responses to human queries (Peng et al., 2023).
demonstrated the highest performance and eval- Basically, higher temperatures leads to the higher
uated the changes in performance based on the linguistic variety, while the lower one generates
number of shots. grammatically correct and deterministic text (Ip-
As observed in our experimental results, increas- polito et al., 2019). Previous work suggested that
ing the number of shots for in-context learning led for machine translation, a diverse generation may
426
impede its translation quality with a high degree of centric approaches in NMT, aiming to improve
certainty(i.e. high temperature) (Peng et al., 2023). translation quality and overcome the limitations
In this sense, we experiment with different tem- of low-resource languages.
perature setting and find the optimal temperature
for the formality control data augmentation. In our
experiments, we select the most appropriate one 6 Conclusion
among seven shot-candidates (1, 2, 4, 8, 16, 32) for
each language pair. In this paper, we presented the KU x UpStage
Experimental results reveal that varying temper- team’s submission for four languages, employ-
ature can lead to significant performance fluctu- ing two main strategies: 1) a language-specific
ations. It is particularly noteworthy that the per- data-driven approach, and 2) synthetic data gen-
formance disparity due to temperature changes is eration using large-scale language models and em-
exceptionally high for the informal tasks. For for- pirical prompt engineering. While our data-driven
mal tasks, the impact of temperature is relatively approach excelled, particularly in EN-KO and EN-
minor, with the variation in BLEU score is at most VI, the quality of synthetic data generation was
0.95 (EN-RU). However, for informal tasks, the called into question. In light of this feedback, we
performance shift can reach up to 4.82 points (EN- propose to enhance the quality of synthetic data
RU) as temperature changes. Additionally, we find by integrating Quality Estimation (QE) techniques
that in informal task, the performance variation de- as an additional filter in the generation process.
pending on the temperature shows distinct trend This step aims to further refine our synthetic ex-
for each language pair. This is evident from the amples, potentially improving the overall system
fact that a moderate temperature(0.7) yielded the performance. We also plan to explore the use of
highest BLEU performance in the EN-PT informal translation models with larger parameters and con-
task, while a similarly moderate temperature(0.5) duct a thorough analysis through more shot exam-
resulted in the lowest performance. Our findings ples and linguistically-grounded data augmentation
suggest that handling ChatGPT in informal task techniques. Finally, we aim to extend our under-
necessitates more elaborate control compared to standing of factors influencing FSMT performance,
dealing with formal data. such as the impact of formal register versus gram-
matical formality in training data and a detailed
5 Background examination of zero-shot transfer.
428
Maria Nădejde, Anna Currey, Benjamin Hsu, Xing of transfer learning with a unified text-to-text trans-
Niu, Marcello Federico, and Georgiana Dinu. 2022. former. The Journal of Machine Learning Research,
Cocoa-mt: A dataset and benchmark for contrastive 21(1):5485–5551.
controlled mt with application to formality. arXiv
preprint arXiv:2205.04022. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. Comet: A neural framework for mt
Chinh Ngo, Trieu H Trinh, Long Phan, Hieu Tran, evaluation. arXiv preprint arXiv:2009.09025.
Tai Dang, Hieu Nguyen, Minh Nguyen, and Minh-
Thang Luong. 2022. Mtet: Multi-domain transla- Nils Reimers and Iryna Gurevych. 2020. Mak-
tion for english and vietnamese. arXiv preprint ing monolingual sentence embeddings multilin-
arXiv:2210.05610. gual using knowledge distillation. arXiv preprint
arXiv:2004.09813.
Xing Niu, Marianna Martindale, and Marine Carpuat.
2017. A study of style in machine translation: Con- Elijah Rippeth, Sweta Agrawal, and Marine Carpuat.
trolling the formality of machine translation output. 2022. Controlling translation formality using
In Proceedings of the 2017 Conference on Empiri- pre-trained multilingual language models. arXiv
cal Methods in Natural Language Processing, pages preprint arXiv:2205.06644.
2814–2819, Copenhagen, Denmark. Association for
Computational Linguistics. Elizabeth Salesky, Marcello Federico, and Marta Costa-
jussà, editors. 2022. Proceedings of the 19th Inter-
OpenAI. 2023. Gpt-4 technical report. national Conference on Spoken Language Transla-
tion (IWSLT 2022). Association for Computational
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Linguistics, Dublin, Ireland (in-person and online).
Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of Rico Sennrich, Barry Haddow, and Alexandra Birch.
the 40th Annual Meeting of the Association for Com- 2015a. Improving neural machine translation
putational Linguistics, pages 311–318, Philadelphia, models with monolingual data. arXiv preprint
Pennsylvania, USA. Association for Computational arXiv:1511.06709.
Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
Chanjun Park, Sugyeong Eo, Hyeonseok Moon, and 2015b. Neural machine translation of rare
Heui-Seok Lim. 2021. Should we find another words with subword units. arXiv preprint
model?: Improving neural machine translation per- arXiv:1508.07909.
formance with one-piece tokenization method with-
out model modification. In Proceedings of the 2021 Felix Stahlberg. 2020. Neural machine translation: A
Conference of the North American Chapter of the review. Journal of Artificial Intelligence Research,
Association for Computational Linguistics: Human 69:343–418.
Language Technologies: Industry Papers, pages 97–
104. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Chanjun Park, Midan Shim, Sugyeong Eo, Seolhwa Kaiser, and Illia Polosukhin. 2017. Attention is all
Lee, Jaehyung Seo, Hyeonseok Moon, and Heuiseok you need. Advances in neural information process-
Lim. 2022. Empirical analysis of parallel corpora ing systems, 30.
and in-depth analysis using liwc. Applied Sciences,
12(11):5545. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-
neau, Vishrav Chaudhary, Francisco Guzmán, Ar-
Kyubyong Park, Joohong Lee, Seongbo Jang, and Da- mand Joulin, and Edouard Grave. 2020. CCNet:
woon Jung. 2020. An empirical study of tokeniza- Extracting high quality monolingual datasets from
tion strategies for various korean nlp tasks. arXiv web crawl data. In Proceedings of the Twelfth Lan-
preprint arXiv:2010.02534. guage Resources and Evaluation Conference, pages
4003–4012, Marseille, France. European Language
Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Resources Association.
Xuebo Liu, Min Zhang, Yuanxin Ouyang, and
Dacheng Tao. 2023. Towards making the most Biao Zhang, Barry Haddow, and Alexandra Birch.
of chatgpt for machine translation. arXiv preprint 2023. Prompting large language model for ma-
arXiv:2303.13780. chine translation: A case study. arXiv preprint
arXiv:2301.07069.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on Barret Zoph, Deniz Yuret, Jonathan May, and
Machine Translation: Research Papers, pages 186– Kevin Knight. 2016. Transfer learning for low-
191, Brussels, Belgium. Association for Computa- resource neural machine translation. arXiv preprint
tional Linguistics. arXiv:1604.02201.
429
A Prompt Template
A.1 Superviesd Setting
####
[shot 1 source]
[shot 2 source]
[shot n source]
####
Translate this into only [1. Informal | 2. Formal] [target language]: [input]
Figure 3: Prompt template for supervised setting based on Hendy et al. (2023). We utilize n randomly selected
shots from the English training set of other language pairs in the IWSLT 23 Formality Track as input for our
model, with few-shot examples derived from the target language’s training set.
[shot n source]
Translate this into only [1. Informal | 2. Formal] [target language]: [input]
Figure 4: Prompt template for zero-shot setting, following the recommended instruction and format for the default
sentence-level translation task in OpenAI playground6 . This consistency enables us to maximize the benefits of the
instruction finetuning protocol. We use n random shots from the training set.
430
B Experimental Setup
B.1 EN-KO
In the experimental setup for the EN-KO language pair, we employed a Transformer architecture with
shared decoder input-output embeddings. The model’s parameters included 1024-dimensional embeddings
for both encoder and decoder, 16 attention heads for each, and 12 layers for both encoder and decoder.
We used the Adam optimizer with beta values (0.9, 0.98) and a learning rate of 5e-4 scheduled by an
inverse square root scheduler with a 4000-step warm-up. To prevent overfitting, we applied a dropout rate
of 0.3 and weight decay of 0.0001. Our translation task utilized a label-smoothed cross-entropy criterion
with a label smoothing factor of 0.1. The training process was performed with a maximum token limit
of 4096 per batch and an update frequency of 4. Model performance was evaluated using BLEU scores
with a beam size of 1 and detokenization using the Moses tokenizer. The training process was executed
for a maximum of 20 epochs with a log interval of 200 and without epoch checkpoints, while sharing all
embeddings.
Parameters for pre-training:
fairseq-train \
--fp16 \
--fp16-init-scale 4096 \
--arch transformer --share-decoder-input-output-embed \
--encoder-embed-dim 1024 --decoder-embed-dim 1024 \
--encoder-attention-heads 16 --decoder-attention-heads 16 \
--encoder-ffn-embed-dim 4096 --decoder-ffn-embed-dim 4096 \
--encoder-normalize-before --decoder-normalize-before \
--encoder-layers 12 --decoder-layers 12 \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--task translation \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 4096 \
--update-freq 4 \
--eval-bleu \
--eval-bleu-args '{"beam": 1, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--log-interval 200 \
--max-epoch 20 \
--skip-invalid-size-inputs-valid-test \
--no-epoch-checkpoints \
--share-all-embeddings
B.2 EN-VI
We fine-tuned our model using the Hugging Face library and the code available at their repository7 . The
fine-tuning was performed with a learning rate of 4e-5, Adam optimizer with beta1 and beta2 values set to
0.9 and 0.98, respectively, and a weight decay of 0.0001. We also used mixed precision training (fp16) to
accelerate the process. The learning rate scheduler was set to inverse square root with a warm-up of 200
steps. The training was conducted for 200 epochs with a maximum gradient norm of 0.0, label smoothing
factor of 0.1, and a batch size of 64 for both training and evaluation. The model was saved and evaluated
at the end of each epoch, and the logging was performed after each training step.
7
https://github.com/huggingface/transformers/tree/main/examples/pytorch/
translation
431
Parameters for fine-tuning:
python train_mt_trainer.py \
--fp16 \
--model_name_or_path VietAI/envit5-translation \
--do_train \
--do_eval \
--do_predict \
--source_lang en \
--target_lang vi \
--source_prefix "translate English to Vietnamese: " \
--learning_rate 4e-5 \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--max_grad_norm 0.0 \
--num_train_epochs 200 \
--lr_scheduler_type inverse_sqrt \
--warmup_steps 200 \
--weight_decay 0.0001 \
--label_smoothing_factor 0.1 \
--save_strategy epoch \
--logging_steps 1 \
--evaluation_strategy epoch \
--per_device_train_batch_size=64 \
--per_device_eval_batch_size=64
python train_mt_trainer.py \
--fp16 \
--model_name_or_path facebook/mbart-large-50 \
--do_train \
--do_eval \
--do_predict \
--source_lang en_XX \
--target_lang pt_XX \
--learning_rate 3e-5 \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--max_grad_norm 0.0 \
--num_train_epochs 200 \
--lr_scheduler_type inverse_sqrt \
--warmup_steps 100 \
--weight_decay 0.0001 \
--label_smoothing_factor 0.1 \
--save_strategy epoch \
--logging_steps 1 \
--evaluation_strategy epoch \
--per_device_train_batch_size=16 \
--per_device_eval_batch_size=16
432
UM-DFKI Maltese Speech Translation
Aiden Williams* Kurt Abela* Rishu Kumar⋄ Martin Bär*
434
Table 1: XLSR Wav2Vec 2.0 performance on low- the dataset in (Williams, 2022) as a base has made
resource settings when evaluated using WER. Assamese comparisons with previous experiments possible.
(AS), Tagalog (TL), Swahili (SW), and Georgian (KA)
As described in Table 2, the Maltese speech cor-
are the languages presented.
pus is made up of several segments from two main
Maltese speech corpora, MASRI (Hernandez Mena
Language AS TL SW KA
et al., 2020), CommonVoice (CV) (Ardila et al.,
Annotated Data (h) 55 76 30 46
2020) and an annotated set from publicly available
XLSR-10 44.9 37.3 35.5 - parliamentary sittings. Previous research in ASR
XLSR-53 44.1 33.2 36.5 31.1 for Maltese has used English speech data with vary-
XLS-R (0.3B) 42.9 33.2 24.3 28.0 ing degrees of success (Mena et al., 2021). How-
XLS-R (1B) 40.4 30.6 21.2 25.1 ever, when applied in fine-tuning an XLS-R model,
XLS-R (2B) 39.0 29.3 21.0 24.3 the effect was detrimental. To further observe the
effect non-Maltese data would have on the trans-
2.3 mBART For Maltese to English lation task, we used three other subsets from the
Translation CommonVoice speech corpus. Selecting 50 hours
of validated each from the Italian, French and Ara-
According to (Liu et al., 2020), using mBART-25 as bic sets.
the pre-trained model has been shown to improve Individually these speech corpora each amount
translations over a randomly initialized baseline to 50 hours, from which four models are trained.
in low/medium resource language. mBART-25 is One with just the Maltese data and the other three
a transformer model trained on the BART (Lewis trained on the extra language combined with the
et al., 2019) objective. It is trained on 25 differ- Maltese set. A fifth model is also trained with all
ent languages. mBART-25 was later extended to the data included. Further combinations were not
include 25 more languages and was called mBART- tried due to time concerns.
50 (Tang et al., 2020). However, neither model
included Maltese - in fact, translation experiments Table 2: Each corpus is listed along with its total length,
on Maltese are very limited. In our experiments, sample count and average sample length.
in Section 3.2, we checked whether these perfor-
mance gains expand to the Maltese language, and Length Average
Dataset (h,m)
Samples Length (s)
this claim appears to hold.
HEADSET 6, 40 4979 4.81
3 Methodology MEP 1, 20 656 7.11
For this task, we decided to use a cascade system Tube 13, 20 8954 5.34
where the ASR and MT components were trained MERLIN 19, 4 9720 6.14
separately but evaluated jointly. In this section, a Parlament 2, 30 1672 5.35
detailed description of both components is given. CV Validated 4, 57 3790 12.68
First, the training data is described, followed by CV Other 5, 4 3833 4.71
the pre-processing steps applied to said data. Next, CV French 50 - -
the models are introduced, and lastly training, the CV Italian 50 - -
training procedure is outlined. CV Arabic 50 - -
Validation 2, 32 1912 4.89
3.1 Automatic Speech Recognition Test MASRI 1 668 5.39
The ASR component in this submission contin- Test CV 0, 54 670 4.74
ues the previous work done in (Williams, 2022),
and so the same annotated dataset consisting of 50 The XLS-R model comes in three pre-trained
hours of Maltese speech is used for this task. We variants; the small model with 300 million parame-
opted not to use data released for this task for two ters, the medium model with a billion parameters
reasons. First was the additional annotation work and the large model with two billion parameters.
that was required, mainly segmentation, for which Size on disk scales with size with the small model
we experienced issues attempting to do in a timely being roughly 1GB in size and the large model
manner. Secondly, this submission includes models being roughly 8GB. All three of them have been
fine-tuned with non-Maltese data. Making use of pre-trained on roughly 500 thousand hours of un-
435
Table 3: ASR Models and the data used for fine-tuning. dataset4 , the COVID-19 EC-EUROPA dataset5 ,
the COVID-19 EU press corner V2 dataset6 , the
Model Corpora used COVID-19 EUROPARL v2 dataset7 , the Digital
MT Only All Maltese corpora Corpus of the European Parliament (Hajlaoui et al.,
2014), the DGT-Acquis (Steinberger et al., 2014),
MT+All All corpora presented ELRC8 , the Tatoeba corpus9 , OPUS (Tiedemann,
2012), EUIPO - Trade mark Guidelines10 , Malta
All Maltese corpora + Arabic sub- Government Gazette11 , MaCoCu (Bañón et al.,
MT+AR
set 2022), as well as data extracted from the Laws
of Malta12 .
All Maltese corpora + French The different datasets were compiled into a sin-
MT+FR
subset gle one. The total number of parallel sentences
All Maltese corpora + Italian sub- amounts to 3,671,287. The development and test
MT+IT
set set was kept the exact same as the OPUS dataset
(Tiedemann, 2012), which amount to 2000 sen-
tences each, and the rest of the data was placed
labelled, multilingual speech. Previous research
in the training set, which amounts to 3,667,287
(Williams, 2022), has shown that both the small
parallel sentences.
and large models fare well when fine-tuned for
Before training the system, the data has to be
the downstream Maltese ASR task. With this in
further pre-processed. Firstly, a BPE tokenizer is
mind, the small 300M XLS-R variant model was
trained on the training set only. The MosesDe-
chosen for this task. The main reason was due to
coder13 package is used to pre-process the dataset,
its smaller size, a larger batch size could be used
by normalising punctuation and training a true case
which expedited the fine-tuning process, while the
on the training set and applying it to the whole
performance loss was expected to be minimal.
dataset. In the case of Maltese data, a tokenizer
This submission follows the same training pro-
specifically designed for Maltese was used because
cedure as outlined in (Williams, 2022). Where the
the regular English tokenizer does not tokenize ev-
procedure was conducted utilising the Huggingface
erything correctly. For this, the tokenizer from
Trainer object with the following hyper-parameters.
MLRS14 was used, which utilises regular expres-
Each model is trained for 30 epochs, using the
sions to tokenize linguistic expressions that are
AdamW criterion with a starting learning rate of
specific to Maltese, such as certain prefixes and
3e − 4. To stabilise the training process, the first
articles. The dataset is then encoded using the pre-
500 training steps were used as warm-up steps.
viously trained BPE encoder.
Gradient accumulation was also used to effectively
The machine translation model is built and
quadruple the batch size. The batch size was depen-
trained using Fairseq (Ott et al., 2019). Fairseq
dent on the training set used, where due to some
is a library that allows for easy implementation
differences in sample lengths, different batch sizes
of a machine translation system through CLI com-
had to be used. We fine-tune 5 XLS-R 300m mod-
mands, meaning minimal code is needed to create
els as presented in Table 3.
a fully working machine translation system.
3.2 Machine Translation For this system, a pre-trained mBART-50 model
The dataset used to train the machine translation (Tang et al., 2020) was used and fine-tuned on our
systems comes from publicly available sources.
4
The original data sources include datasets from https://bit.ly/3pBCg7u
5
https://bit.ly/3AcjIzR
Arab-Acquis (Habash et al., 2017), the Euro- 6
https://bit.ly/3wmCyTD
pean Vaccination Portal1 ,the Publications Office 7
https://bit.ly/3wl3brZ
of the EU on the medical domain2 , the European 8
https://www.lr-coordination.eu/node/
Medicines Agency3 , the COVID-19 ANTIBIOTIC 2
9
https://bit.ly/3cejoIU
10
https://bit.ly/3AB01Tr
11
https://bit.ly/3QDXm1a
1 12
https://bit.ly/3dLbGX9 https://legislation.mt/
2 13
https://bit.ly/3R2G5OH https://www.statmt.org/moses/
3 14
https://bit.ly/3QWIjPM https://mlrs.research.um.edu.mt/
436
data. An mBART-25 (Liu et al., 2020) model, as strings are then passed to the mBART model to be
well as a randomly initialised baseline Transformer inferred and the BPE model to encode the inputs.
model, were also experimented with, however af- The beam size is set to five. The resulting tokens
ter training a system using a subset of the dataset, are then detokenized and saved.
it was apparent that the mBART-50 model outper-
forms them both. Due to limited resource con- 4 Evaluation and Results
straints, only one MT model was trained on the full Table 4 contains the official results for our submis-
dataset. sion for the Maltese → English spoken language
The maximum number of steps was set out to translation track. While we observed better scores
be 1,000,000, yet the validation was performed ev- during training and validation, our models strug-
ery 10,000 steps with a patience value of 10. This gled with the official test set. In this section, we
means that if the BLEU score on the validation set note our few observations and qualitative analysis
does not improve after ten validation steps, then of results to highlight the errors.
the model stops training. After multiple experi- The test set proved to be difficult for both the
ments using a smaller subset of the dataset, it was ASR and MT systems to get right due to the type of
seen that increasing max-tokens tended to result language used as well as the speed of the speech in
in higher overall performance. However, due to general. Table 5 shows the reference transcription
resource constraints, the maximum number of to- of the beginning of the file, accompanied by the MT
kens per batch was set to 1024. The learning rate is Only and MT+All ASR transcription, and lastly,
set to 1e−3 , but the initial learning rate is smaller the machine translation of the mt-50 model. The
at 1e−7 and increases using an inverse square root monolingually fine-tuned MT Only model was our
learning rate scheduler to linearly increase the rate primary submission from the five submitted ASR
after 10,000 steps. For inference, a beam size of models, with BLEU scores of 0.6.
five is used to generate predictions. The mt-50 output is relatively similar to the refer-
The total number of updates using mBART-50 ence sentence, except for a few minor errors, includ-
was 990,000, with an early stop since the validation ing the misspelling of the name “Mark”. However,
didn’t improve in the last 10 validation epochs. this should still be a good sentence to input into the
This amounts to exactly three full epochs on the machine translation system. In stark contrast to the
whole training set. MT+All system outputs.
3.3 Completed Pipeline The main issue here is that this system does not
output Maltese characters and completely omits
To create a speech-to-text translation system, a them, which presents an issue for the downstream
Huggingface pipeline is set up to accept an audio translation task since the meaning of the word is
file that is passed to the ASR system. The test set lost in these cases.
provided for this task is a single file of over one Machine translation also had similar issues. The
hour. Due to its size, the file needs to be segmented training set contained data coming from legal texts,
for inference and evaluation due to its size. The so the data is very formal, making it very difficult
XLS-R model automatically returns a timestamp to evaluate since the input text is very informal and
for each output word. These timestamps are used unlike the legal text data seen.
to create segments that align with the segments file Unfortunately, most of this is unrelated to what
provided with the test set.
This means that the ASR component returns a
Table 4: Official Results for our models for Maltese →
list of text strings. Each segment is an item in the
English SLT task
list of strings. Each string is passed to the MT sys-
tem. Before passing through the MT component,
Submission Name BLEU Score
the resultant strings are pre-processed. The afore-
mentioned MosesDecoder package is used to trans- MT Only 0.6
form the strings using the same rules that have been MT+All 0.7
applied to the MT training data. This means that MT+AR 0.4
the strings have their punctuation normalised, then MT+FR 0.3
true cased and finally tokenized. The processed MT+IT 0.4
437
Table 5: Reference transcription sample from the Continuing the trend observed in (Williams,
IWSLT 2023 test set along with the MT Only and 2022), the use of additional languages when fine-
MT+All automatic transcription and the machine trans-
tuning an XLS-R model proved to be detrimental
lation of the MT Only output.
towards the final output. As observed in Section
merh̄ba’ gh̄al- podcast ieh̄or din 4, some models trained with additional data lost
id- darba ma bniedem kemxejn the ability to transcribe Maltese-specific alphabetic
polemikuż mhux gh̄ax jien gh̄andi characters. So far, the character-to-sound pair was
Reference wisq xi ngh̄id però Mark Camil- always made with the source language in mind. For
leri huwa il- mexxejj kemxejn example, the French ‘Ç’ is transformed into the ‘C’
kontroversjali tal- kunsill naz- character, which itself is only present in the Maltese
zjonali tal- ktieb alphabet when English words are loaned and used
directly. It’s important to note that code-switching
merba’ l- pot kast ieh̄or din to English is very common in Maltese speech. Fu-
id- darba ma bniedem kemx- ture work should explore these character-to-sound
ejn polemikuż mhux gh̄ax jien pairs.
MT Only gh̄andi wisq xi ngh̄id però mar
Camilleri huwa il- mexxejj kemx- 5 Conclusion and Future Work
ejn kontroversjali tal- kunsill naz-
This paper showcased the results of a speech-to-
zjonali tal- ktieb
text translation system in the direction of Maltese
meba l Pold cast ieor din id- to English. A cascade system is chosen, where
darba ma bniedem kemmxejn ASR and MT models are pipelined together.
polemiku mhux gax jien Gandi The automatic speech recognition system chosen
MT+All wisq xi ngid per mar kamileri is based on XLS-R and is fine-tuned on data from
huwai - mexxejk emxejh kontro- different languages. The best-performing model
versjali tal- kunsill nazzjonali tal- was the XLS-R 300M model fine-tuned on 50 hours
ktieb of Maltese speech. The machine translation system
chosen is based on mBART-50, and it was fine-
four of the other potential this tuned on parallel Maltese - English data. Aside
time does not work very slightly from fine-tuning, no modifications were made to
at all , but not at all , the same the pre-trained models.
Translation
time , it is the slightly cross- sec- For future work, we have various potential av-
MT Only
toral leader of the national when enues for improvement. For machine translation,
the book is also of humane since mBART-50 was not pre-trained on Maltese
data, extending the vocabulary to include Maltese-
was actually said. Looking into the translations specific tokens would improve the representation
deeper, one can see the reasoning behind certain and potentially the downstream performance as
translations. For example, the dataset does not con- well. Moreover, our approach solely relied on
tain a lot of conversational data, so general greet- parallel data and did not investigate techniques
ings like “merh̄ba” may not be present. This case is which leverage monolingual data, such as back-
represented by the translation of the token “merba”, translation. Monolingual corpora, such as Korpus
which was translated to “four”. Here the token Malti v4 (Micallef et al., 2022), not only provide
“merba” (welcome) was mistaken for “erba” (four). significantly more data but also have more diversity
Other mistakes include those that are phonetically in terms of domains. Apart from this, it might be
plausible but grammatically incorrect output, such beneficial to perform more quality checks on the
as the transcription for “podcast” which was tran- parallel dataset since some portions of the publicly
scribed as “pot kast”. Certain expressions like “din available datasets are automatically crawled and, in
id-darba” were correctly translated to “this time”, some cases, contain noise.
however rarer words such as “polemikuż” and “kon- Regarding ASR improvement, other systems,
troversjali”, both of which have the same meaning such as Whisper and, most recently Meta’s Mas-
as “controversial”, seemed to not appear in the sively Multilingual Speech (MMS) project should
translation. be tried and evaluated. The research made in multi-
438
lingual fine-tuning needs to be more focused. One Rosana Ardila, Megan Branson, Kelly Davis, Michael
idea we can explore is the transliteration of foreign Kohler, Josh Meyer, Michael Henretty, Reuben
Morais, Lindsay Saunders, Francis Tyers, and Gre-
alphabetic characters into Maltese characters, e.g.
gor Weber. 2020. Common voice: A massively-
’h’ in English would be transliterated as ’h̄’. It is multilingual speech corpus. In Proceedings of the
also the case that no language model is used to 12th Language Resources and Evaluation Confer-
correct the ASR output mistakes; this is currently ence, pages 4218–4222, Marseille, France. European
our next milestone. Language Resources Association.
Arun Babu, Changhan Wang, Andros Tjandra, Kushal
Acknowledgements Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei
We acknowledge LT-Bridge Project (GA 952194). Baevski, Alexis Conneau, and Michael Auli. 2021.
Rishu Kumar was supported financially by the XLS-R: self-supervised cross-lingual speech repre-
EMLCT15 programme during this entire work. sentation learning at scale. CoRR, abs/2111.09296.
Parnia Bahar, Patrick Wilken, Mattia A. Di Gangi, and
Evgeny Matusov. 2021. Without Further Ado: Direct
References and Simultaneous Speech Translation by AppTek
Milind Agarwal, Sweta Agrawal, Antonios Anasta- in 2021. In Proceedings of the 18th International
sopoulos, Ondřej Bojar, Claudia Borg, Marine Conference on Spoken Language Translation (IWSLT
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda 2021), pages 52–63, Bangkok, Thailand (online). As-
Chen, William Chen, Khalid Choukri, Alexandra sociation for Computational Linguistics.
Chronopoulou, Anna Currey, Thierry Declerck, Qian-
Marta Bañón, Miquel Esplà-Gomis, Mikel L. For-
qian Dong, Yannick Estève, Kevin Duh, Marcello
cada, Cristian García-Romero, Taja Kuzman, Nikola
Federico, Souhir Gahbiche, Barry Haddow, Benjamin
Ljubešić, Rik van Noord, Leopoldo Pla Sempere,
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
Gema Ramírez-Sánchez, Peter Rupnik, Vít Su-
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
chomel, Antonio Toral, Tobias van der Werff, and
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
Jaume Zaragoza. 2022. MaCoCu: Massive collec-
Evgeny Matusov, Paul McNamee, John P. McCrae,
tion and curation of monolingual and bilingual data:
Kenton Murray, Maria Nadejde, Satoshi Nakamura,
focus on under-resourced languages. In Proceedings
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
of the 23rd Annual Conference of the European As-
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
sociation for Machine Translation, pages 303–304,
Lonneke van der Plas, Peter Polák, Elijah Rippeth,
Ghent, Belgium. European Association for Machine
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
Translation.
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab-
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- delrahman Mohamed, and Michael Auli. 2021. Un-
vallos. 2023. Findings of the IWSLT 2023 Evaluation supervised Cross-Lingual Representation Learning
Campaign. In Proceedings of the 20th International for Speech Recognition. In Proc. Interspeech 2021,
Conference on Spoken Language Translation (IWSLT pages 2426–2430.
2023). Association for Computational Linguistics.
Pavel Denisov, Manuel Mager, and Ngoc Thang Vu.
Antonios Anastasopoulos, Loïc Barrault, Luisa Ben- 2021. IMS’ Systems for the IWSLT 2021 Low-
tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Resource Speech Translation Task. In Proceedings
Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, of the 18th International Conference on Spoken Lan-
Maha Elbayad, Clara Emmanuel, Yannick Estève, guage Translation (IWSLT 2021), pages 175–181,
Marcello Federico, Christian Federmann, Souhir Bangkok, Thailand (online). Association for Compu-
Gahbiche, Hongyu Gong, Roman Grundkiewicz, tational Linguistics.
Barry Haddow, Benjamin Hsu, Dávid Javorský,
Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Liang Ding and Dacheng Tao. 2021. The USYD-JD
Mathur, Paul McNamee, Kenton Murray, Maria Speech Translation System for IWSLT2021. In Pro-
Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan ceedings of the 18th International Conference on
Niehues, Xing Niu, John Ortega, Juan Pino, Eliz- Spoken Language Translation (IWSLT 2021), pages
abeth Salesky, Jiatong Shi, Matthias Sperber, Se- 182–191, Bangkok, Thailand (online). Association
bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo- for Computational Linguistics.
gesh Virkar, Alexander Waibel, Changhan Wang,
and Shinji Watanabe. 2022. Findings of the IWSLT Mark JF Gales, Kate M Knill, Anton Ragni, and
2022 Evaluation Campaign. In Proceedings of the Shakti P Rath. 2014. Speech recognition and key-
19th International Conference on Spoken Language word spotting for low-resource languages: Babel
Translation (IWSLT 2022), pages 98–157, Dublin, project research at cued. In Fourth International
Ireland (in-person and online). Association for Com- workshop on spoken language technologies for under-
putational Linguistics. resourced languages (SLTU-2014), pages 16–23.
International Speech Communication Association
15
https://mundus-web.coli.uni-saarland.de/ (ISCA).
439
Nizar Habash, Nasser Zalmout, Dima Taji, Hieu Hoang, 2021. Data augmentation for speech recognition
and Maverick Alzate. 2017. A parallel corpus for in maltese: A low-resource perspective. CoRR,
evaluating machine translation between arabic and abs/2111.07793.
european languages. In Proceedings of the 15th Con-
ference of the European Chapter of the Association Kurt Micallef, Albert Gatt, Marc Tanti, Lonneke van der
for Computational Linguistics: Volume 2, Short Pa- Plas, and Claudia Borg. 2022. Pre-training data qual-
pers, pages 235–241. ity and quantity for a low-resource language: New
corpus and BERT models for Maltese. In Proceed-
Najeh Hajlaoui, David Kolovratnik, Jaakko Väyrynen, ings of the Third Workshop on Deep Learning for
Ralf Steinberger, and Daniel Varga. 2014. Dcep- Low-Resource Natural Language Processing, pages
digital corpus of the european parliament. In Pro- 90–101, Hybrid. Association for Computational Lin-
ceedings of the Ninth International Conference on guistics.
Language Resources and Evaluation (LREC’14).
Tuan Nam Nguyen, Thai Son Nguyen, Christian Huber,
Michael A. Hedderich, Lukas Lange, Heike Adel, Jan- Ngoc-Quan Pham, Thanh-Le Ha, Felix Schneider,
nik Strötgen, and Dietrich Klakow. 2021. A Survey and Sebastian Stüker. 2021. KIT’s IWSLT 2021 Of-
on Recent Approaches for Natural Language Process- fline Speech Translation System. In Proceedings of
ing in Low-Resource Scenarios. In Proceedings of the 18th International Conference on Spoken Lan-
the 2021 Conference of the North American Chap- guage Translation (IWSLT 2021), pages 125–130,
ter of the Association for Computational Linguistics: Bangkok, Thailand (online). Association for Compu-
Human Language Technologies, pages 2545–2568, tational Linguistics.
Online. Association for Computational Linguistics.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
Carlos Daniel Hernandez Mena, Albert Gatt, Andrea Sam Gross, Nathan Ng, David Grangier, and Michael
DeMarco, Claudia Borg, Lonneke van der Plas, Auli. 2019. fairseq: A fast, extensible toolkit for
Amanda Muscat, and Ian Padovani. 2020. MASRI- sequence modeling. In Proceedings of NAACL-HLT
HEADSET: A Maltese corpus for speech recognition. 2019: Demonstrations.
In Proceedings of the 12th Language Resources and
Evaluation Conference, pages 6381–6388, Marseille, Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel
France. European Language Resources Association. Synnaeve, and Ronan Collobert. 2020. MLS: A
Large-Scale Multilingual Dataset for Speech Re-
Diksha Khurana, Aditya Koli, Kiran Khatter, and search. In Proc. Interspeech 2020, pages 2757–2761.
Sukhdev Singh. 2023. Natural language process-
ing: State of the art, current trends and challenges. Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao,
Multimedia tools and applications, 82(3):3713–3744. Ning Dai, and Xuanjing Huang. 2020. Pre-trained
models for natural language processing: A survey.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Science China Technological Sciences, 63(10):1872–
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, 1897.
Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: De-
noising sequence-to-sequence pre-training for natural Ralf Steinberger, Mohamed Ebrahim, Alexandros
language generation, translation, and comprehension. Poulis, Manuel Carrasco-Benitez, Patrick Schlüter,
arXiv preprint arXiv:1910.13461. Marek Przybyszewski, and Signe Gilbro. 2014. An
overview of the european union’s highly multilingual
Yinglu Li, Minghan Wang, Jiaxin Guo, Xiaosong Qiao, parallel corpora. Language resources and evaluation,
Yuxia Wang, Daimeng Wei, Chang Su, Yimeng Chen, 48(4):679–707.
Min Zhang, Shimin Tao, Hao Yang, and Ying Qin.
2022. The HW-TSC’s Offline Speech Translation Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
System for IWSLT 2022 Evaluation. In Proceedings man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
of the 19th International Conference on Spoken Lan- gela Fan. 2020. Multilingual translation with exten-
guage Translation (IWSLT 2022), pages 239–246, sible multilingual pretraining and finetuning. arXiv
Dublin, Ireland (in-person and online). Association preprint arXiv:2008.00401.
for Computational Linguistics.
Jörg Tiedemann. 2012. Parallel data, tools and inter-
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey faces in opus. In Proceedings of the Eight Inter-
Edunov, Marjan Ghazvininejad, Mike Lewis, and national Conference on Language Resources and
Luke Zettlemoyer. 2020. Multilingual denoising pre- Evaluation (LREC’12), Istanbul, Turkey. European
training for neural machine translation. Language Resources Association (ELRA).
Alexandre Magueresse, Vincent Carles, and Evan Heet- Jörgen Valk and Tanel Alumäe. 2021. Voxlingua107:
derks. 2020. Low-resource languages: A review A dataset for spoken language recognition. In 2021
of past work and future challenges. arXiv preprint IEEE Spoken Language Technology Workshop (SLT),
arXiv:2006.07264. pages 652–658.
Carlos Daniel Hernandez Mena, Andrea DeMarco, Clau- Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
dia Borg, Lonneke van der Plas, and Albert Gatt. Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
440
Juan Pino, and Emmanuel Dupoux. 2021. VoxPop-
uli: A large-scale multilingual speech corpus for rep-
resentation learning, semi-supervised learning and
interpretation. In Proceedings of the 59th Annual
Meeting of the Association for Computational Lin-
guistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long
Papers), pages 993–1003, Online. Association for
Computational Linguistics.
Aiden Williams. 2022. The applicability of Wav2Vec
2.0 for low-resource Maltese ASR. B.S. thesis, Uni-
versity of Malta.
Marcely Zanon Boito, John Ortega, Hugo Riguidel, An-
toine Laurent, Loïc Barrault, Fethi Bougares, Firas
Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gah-
biche, and Yannick Estève. 2022. ON-TRAC Con-
sortium Systems for the IWSLT 2022 Dialect and
Low-resource Speech Translation Tasks. In Proceed-
ings of the 19th International Conference on Spoken
Language Translation (IWSLT 2022), pages 308–318,
Dublin, Ireland (in-person and online). Association
for Computational Linguistics.
Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li,
Xinyuan Zhou, Jing Yang, Jianwei Cui, Pan Deng,
Mohan Shi, Yifan Song, Dan Liu, Junhua Liu, and
Lirong Dai. 2022. The USTC-NELSLIP Offline
Speech Translation Systems for IWSLT 2022. In
Proceedings of the 19th International Conference on
Spoken Language Translation (IWSLT 2022), pages
198–207, Dublin, Ireland (in-person and online). As-
sociation for Computational Linguistics.
Ziqiang Zhang and Junyi Ao. 2022. The YiTrans Speech
Translation System for IWSLT 2022 Offline Shared
Task. In Proceedings of the 19th International Con-
ference on Spoken Language Translation (IWSLT
2022), pages 158–168, Dublin, Ireland (in-person
and online). Association for Computational Linguis-
tics.
441
NVIDIA NeMo Offline Speech Translation Systems for IWSLT 2023
442
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 442–448
July 13-14, 2023 c 2023 Association for Computational Linguistics
Table 1: Statistics of different datasets used for training Table 2: Statistics of TED talks dataset.
our models in a constrained regime.
Segments Time
Segments Time Model
Model (thousands) (hours)
(millions) (hours)
En audio → En text 370 611
ASR 2.7 4800
En audio → De text 280 459
NMT En→De 11 − En audio → Zh text 350 580
NMT En→Zh 7.5 − En audio → Ja text 321 528
NMT En→Ja 21 −
TTS 0.37 611
for En/De, jieba tokenization for Zh, and ja-mecab
tokenization for Ja.
this dataset and its subsets with available transla-
TTS For training our TTS model, we used TED
tions to De/Zh/Ja as TED talks. See Table 2 for
talks with English transcripts. The combination of
the detailed statistics of this dataset.
Must-C v1-v3 and ST-TED contained 3696 speak-
ASR For training our ASR model, we used Lib- ers, however, some of them were not unique. Capi-
riSpeech (Panayotov et al., 2015), Mozilla Com- talizing on the huge overlap with TED-LIUM v3
mon Voice v11.0 (Ardila et al., 2019), TED-LIUM and the speaker names from there, we managed to
v3 (Hernandez et al., 2018), VoxPopuli v2 (Wang attribute several talks to a single speaker reducing
et al., 2021), all available speech-to-English data the number of unique speakers to 3361. We also
from Must-C v1-v3 (Cattoni et al., 2021) En- removed capitalization from English transcripts in
De/Zh/Ja datasets, ST-TED (Jan et al., 2018), and TED talks.
Europarl-ST (Iranzo-Sánchez et al., 2020).
We converted all audio data to mono-channel ST For training our end-to-end ST models, we
16kHz wav format. Of all the datasets allowed un- used the combination of 1) ASR data with the
der the constrained submission, LibriSpeech and ground truth transcripts replaced by synthetic trans-
TED-LIUM v3 were the only datasets that provided lations; 2) NMT data with TTS-generated English
transcripts with neither punctuation nor capitaliza- audios on source side (Table 1).
tion (P&C). For LibriSpeech, we managed to re-
3 System
store P&C from the dataset metadata available at
their website2 . For TED-LIUM v3, we applied In this section, we describe the essential compo-
P&C restoration model trained on the English por- nents of our end-to-end submission.
tion of allowed bitext. Finally, we discarded all
samples shorter than 0.2s and longer than 22s and ASR We trained 17-layer large conformer-
all samples with transcripts present in the evalua- transducer (Gulati et al., 2020) with FastCon-
tion dataset. As a result, our training dataset con- former (Rekesh et al., 2023) encoder and RNN-
tained 2.7M audio segments with a total duration T loss and decoder (Graves, 2012). The pre-
of 4.8k hours. diction network consisted of a single layer of
LSTM (Hochreiter and Schmidhuber, 1997), and
MT For training our NMT models, we used the joint network is an MLP. All the hidden sizes
all available bitext allowed for IWSLT 2023 con- in the decoder were set to 640. Unigram Senten-
strained submission. After training, we additionally cePiece (Kudo and Richardson, 2018) with 1024
fine-tuned our models on bitexts from TED talks tokens was used for tokenization.
for each language. The ASR models were trained for 45 epochs,
We applied langid and bicleaner filtering starting with a checkpoint pre-trained on Lib-
following Subramanian et al. (2021) and discarded riSpeech. We used AdamW (Loshchilov and Hut-
all sentences longer than 128 tokens and sentences ter, 2017) optimizer and Noam Annealing (Vaswani
with the length ratio between source and target et al., 2017) with 10K warmup steps and a maxi-
exceeding 3. We also applied Moses tokenization mum learning rate of 1.15. Weight decay of 0.001
2
https://www.openslr.org/12 on all parameters was used for regularization. The
443
effective batch size was set to 1200, and we could Table 3: Word error rate (WER) of the English ASR
fit larger batch sizes via batch splitting for the RNN- model evaluated on TED talks from Must-C v2 and past
test sets from IWSLT. All predictions and ground truths
T loss. Time-Adaptive SpecAugment (Park et al.,
transcripts were normalized for WER computation.
2020) with 2 freq masks (F = 27) and 10 time
masks (T = 5%) was used as the augmentation
scheme. We also used dropout of 0.1 for both the tst-COM IWSLT.tst
Model
attention scores and intermediate activations. De Zh/Ja 2018 2019 2020
norm 5.9 5.8 9.8 5.6 8.0
NMT We trained our NMT models (Transformer, punct 5.7 5.4 9.4 4.9 7.0
12 × 6 layers, dmodel = 1024, dinner = 4096, punct+capit 5.7 5.5 9.5 5.7 8.5
nheads = 16) with Adam optimizer (Kingma
and Ba, 2014) and inverse square root anneal-
ing (Vaswani et al., 2017) with 7.5K warmup steps
2048, nheads = 8). We used the vocabulary
and a maximum learning rate of 10−3 . The mod-
of 16384 YouTokenToMe3 byte-pair-encodings,
els were trained for a maximum of 75K steps with
trained jointly for En→De and separately for
a dropout of 0.1 on intermediate activations and
En→Zh/Ja. All models were trained for 30k steps
label smoothing with α = 0.1. Our En→De mod-
with ASR-initialized encoder and randomly initial-
els used joint BPE vocabulary of 16384 tokens
ized decoder.
and En→Zh/Ja used separate vocabularies with the
To speed up training and improve GPU utiliza-
same number of tokens per language.
tion, we bucketed our ASR and NMT datasets on
After training, we did checkpoint averaging and
sequence length so each batch contained a simi-
fine-tuned all our base NMT models on TED talks
lar number of tokens. On each iteration, we pick
for 3 epochs with an initial learning rate of 2×10−5 ,
one batch from ASR and one batch which resulted
inverse square root annealing, and a warmup of
in approximately 3:2 ratio between segments from
10% steps. Finally, we ensembled 2 models trained
ASR and NMT for En→De. TTS mel spectrograms
with different initializations for each language di-
were generated on-the-fly for a randomly selected
rection.
speaker for each sample.
TTS Our TTS model was multi-speaker Fast- After pretraining on the ASR task, we fused
Pitch (Łańcucki, 2021) text-to-mel-spectrogram BatchNorm in FastConformer layers as proposed
generator. Training vocoder was not necessary in (Bataev et al., 2023) to avoid a mismatch be-
for our setup as the parameters of spectrograms tween statistics for natural and generated mel spec-
matched ones for ST models following the ap- trograms. The batch normalization layer was re-
proach described in (Bataev et al., 2023). TTS- placed with a trainable projection initialized from
generated spectrograms were fed directly into the original parameters. We observed meaningful
the FastConformer encoder when training the ST improvements when using such an approach com-
model. Our TTS model was trained for 200 epochs pared to retaining the original batch normalization.
on TED talks with restored speakers from TED-
LIUM v3 (Hernandez et al., 2018). 4 Experiments
444
Table 4: En→De BLEU scores calculated on IWSLT test sets from different years by using automatic re-
segmentation of the hypothesis based on the reference translation by mwerSegmenter implemented in
SLTev (Ansari et al., 2021). Avg ∆ computes the improvement over the cascade baseline averaged over 7 test sets.
Model description 2010 2013 2014 2015 2018 2019 2020 Avg
Text-to-text NMT models
Transformer 12 × 6 constrained 32.9 36.7 32.7 34.2 30.5 29.4 33.0 32.8
+ checkpoint averaging 33.1 37.4 32.8 35.1 30.3 29.8 33.5 33.1
+ TED talks fine-tuning 34.5 39.1 34.1 35.3 30.8 30.3 33.8 34.0
+ x2 ensembling 35.2 40.2 34.9 36.0 32.5 31.6 35.4 35.1
NeMo IWSLT’22 NMT model 35.7 41.2 36.2 38.1 34.7 31.7 35.0 36.1
End-to-end ST models
Conformer (17) + Transformer (6 × 6) 29.8 33.8 30.2 27.1 26.2 26.8 29.1 29.0
+ better WebRTC VAD parameters 31.2 35.4 31.8 28.6 27.3 27.6 29.7 30.2
+ SHAS segmentation 32.1 36.1 32.6 29.0 28.4 27.9 30.9 31.0
NeMo IWSLT 2023 constrained 31.0 34.9 30.7 28.6 27.4 27.7 30.3 29.5
NeMo IWSLT 2022 (end-to-end) 24.5 30.0 25.2 25.3 24.9 24.1 26.2 25.7
NeMo IWSLT 2022 (cascade) 26.6 32.2 26.8 28.3 28.1 27.3 29.7 28.4
KIT IWSLT 2022 − − − 27.9 − 27.6 30.0 −
USTC-NELSLIP IWSLT 2022 − − − − 29.9 28.2 30.6 −
YiTrans IWSLT 2022 − − − − − 31.6 34.1 −
coder, we did not notice a significant difference in ST En→Zh/Ja To train English-Chinese and
the corresponding BLEU scores. English-Japanese ST systems, we followed a sim-
ilar recipe to the English-German system. Specif-
ST En→De Table 4 shows the performance of ically, we re-trained NMT components and used
our baseline En→De system and its ablations on them to generate synthetic translations of audio
7 different IWSLT test sets over the years. All ab- segments. With other auxiliary models intact, we
lation experiments used the last year’s constrained replaced bitexts used for TTS augmentations and
setup that included more NMT data from WMT to trained En→Zh (Table 5) and En→Ja (Table 6) ST
be comparable with the last year submissions. The end-to-end models in a constrained setup.
systems we submit were retrained on the allowed The only difference in our submission was that
data to comply with constrained restrictions. the English-Chinese model used punct+capit
ASR, while the English-Japanese model used
We improve the average BLEU score by 5.3 over
norm ASR. This choice was based on a slightly
our last year end-to-end submission. We believe
higher (less than 0.5) BLEU score on Must-C v2
that such gain is attributed to several factors, most
dev dataset.
importantly, switching to synthetic transcripts, in-
cluding TTS-generated data, and a better segmen- 4.2 Discarded alternatives
tation model. On some of the evaluation datasets,
When designing our submission, we explored a
we approached the BLEU scores of top contestants
number of alternatives that did not lead to a clear
from last year.
improvement in preliminary experiments and, thus,
Retraining our model in accordance with this were not included in the final submission.
year constrained setup resulted in the aver-
age degradation of 1.5 BLEU. Most of this perfor- ASR We tried to replace BatchNorm with Layer-
mance drop was attributed to worse NMT models Norm in the FastConformer backbone to mitigate
trained on limited amount of data which did not the statistics mismatch between natural and TTS-
include large bitexts from WMT. generated mel-spectrograms. The resulting model
445
Table 5: En→Zh BLEU scores calculated on Must-C Table 6: En→Ja BLEU scores calculated on Must-C
dev and tst-COMMON with official segmentation. dev and tst-COMMON with official segmentation.
required more epochs to converge and resulted in We experimented with using RNN-T instead
slightly higher WER. of the Transformer decoder. Despite its remark-
able performance in ASR, RNN-T converged much
NMT We experimented with larger models of up slower and underperformed our Transformer de-
to 12 × 8 layers, larger vocabularies of up to 32k coder by more than 2 BLEU in our ST model.
tokens, and label smoothing of up to 0.2 but did not
notice any improvements to BLEU scores. We also 5 Conclusion
saw diminishing returns when using more than 2
We present NVIDIA NeMo group’s offline speech
models in the ensemble. Thus, we decided to stick
translation systems for En→De, En→Zh, and
to the ensemble of two 12 × 6 models with 16k
En→Ja IWSLT 2023 Tasks.
vocab to speed up synthetic data generation.
Our primary end-to-end models that translate
TTS While debugging the code, we noticed that English speech directly into German, Chinese, and
TTS model generating mel-spectrograms used the Japanese texts, consist of FastConformer encoder
same single speaker and had dropout enabled. Sur- and Transformer decoder. To alleviate the prob-
prisingly, it did not lead to performance degrada- lem of direct ST data scarcity, we capitalized on a
tion. We hypothesize that this was caused by using number of auxiliary ASR, TTS, and NMT models,
well converged pre-trained ASR encoder, which and their ability to generate hiqh-quality audio and
was not altered significantly by the low-quality sig- translations. The resulting models achieve com-
nal. We also experimented with improving gener- petitive performance without using any amount of
ated spectrograms with GAN enhancer following direct ST data.
Bataev et al. (2023), which led to similar results at Although we participated in constrained
the cost of significant computation overhead. scenario, our pipeline can be easily scaled to ar-
bitrarily large amounts of ASR and NMT data.
Segmentation We experimented with voice ac-
tivity detection implemented in WebRTC4 toolkit, Acknowledgments
however, the BLEU scores on IWSLT test sets were
The authors would like to thank Somshubra Ma-
lower even after extensive hyperparameter search.
jumdar for many useful discussions over the course
ST Given the effectiveness of ensembling in last of this project and Nithin Koluguri for help with
year’s competition, we evaluated the performance training ASR models.
of an ensemble of up to 3 models with different
ASR encoder initializations. Unlike NMT, we did
References
not observe any improvement in using the best
model from the ensemble. Milind Agarwal, Sweta Agrawal, Antonios Anasta-
sopoulos, Ondřej Bojar, Claudia Borg, Marine
4
https://github.com/wiseman/py-webrtcvad Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
446
Chen, William Chen, Khalid Choukri, Alexandra François Hernandez, Vincent Nguyen, Sahar Ghannay,
Chronopoulou, Anna Currey, Thierry Declerck, Qian- Natalia Tomashenko, and Yannick Esteve. 2018. Ted-
qian Dong, Yannick Estève, Kevin Duh, Marcello lium 3: twice as much data and corpus repartition for
Federico, Souhir Gahbiche, Barry Haddow, Benjamin experiments on speaker adaptation. In International
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja- conference on speech and computer, pages 198–208.
vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Springer.
Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
Evgeny Matusov, Paul McNamee, John P. McCrae, Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
Kenton Murray, Maria Nadejde, Satoshi Nakamura, short-term memory. Neural computation, 9(8):1735–
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, 1780.
Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
Lonneke van der Plas, Peter Polák, Elijah Rippeth, Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerda,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se- Javier Jorge, Nahuel Roselló, Adria Giménez, Al-
bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian bert Sanchis, Jorge Civera, and Alfons Juan. 2020.
Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Europarl-st: A multilingual corpus for speech transla-
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze- tion of parliamentary debates. In ICASSP 2020-2020
vallos. 2023. Findings of the IWSLT 2023 Evaluation IEEE International Conference on Acoustics, Speech
Campaign. In Proceedings of the 20th International and Signal Processing (ICASSP), pages 8229–8233.
Conference on Spoken Language Translation (IWSLT IEEE.
2023). Association for Computational Linguistics.
Niehues Jan, Roldano Cattoni, Stüker Sebastian, Mauro
Ebrahim Ansari, Ondřej Bojar, Barry Haddow, and Mo- Cettolo, Marco Turchi, and Marcello Federico. 2018.
hammad Mahmoudi. 2021. SLTEV: Comprehensive The iwslt 2018 evaluation campaign. In Proceedings
evaluation of spoken language translation. In Pro- of IWSLT, pages 2–6.
ceedings of the 16th Conference of the European
Diederik Kingma and Jimmy Ba. 2014. Adam: A
Chapter of the Association for Computational Lin-
method for stochastic optimization. arXiv preprint
guistics: System Demonstrations, pages 71–79, On-
arXiv:1412.6980.
line. Association for Computational Linguistics.
Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii
Rosana Ardila, Megan Branson, Kelly Davis, Michael Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri-
Henretty, Michael Kohler, Josh Meyer, Reuben man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook,
Morais, Lindsay Saunders, Francis M Tyers, and et al. 2019. Nemo: a toolkit for building ai ap-
Gregor Weber. 2019. Common voice: A massively- plications using neural modules. arXiv preprint
multilingual speech corpus. arXiv preprint arXiv:1909.09577.
arXiv:1912.06670.
Taku Kudo and John Richardson. 2018. Sentencepiece:
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, A simple and language independent subword tok-
and Michael Auli. 2020. wav2vec 2.0: A framework enizer and detokenizer for neural text processing.
for self-supervised learning of speech representations. arXiv preprint arXiv:1808.06226.
Advances in neural information processing systems,
33:12449–12460. Adrian Łańcucki. 2021. FastPitch: Parallel text-to-
speech with pitch prediction. In ICASSP.
Vladimir Bataev, Roman Korostik, Evgeny Shabalin,
Vitaly Lavrukhin, and Boris Ginsburg. 2023. Text- Ilya Loshchilov and Frank Hutter. 2017. Decou-
only domain adaptation for end-to-end asr using in- pled weight decay regularization. arXiv preprint
tegrated text-to-mel-spectrogram generator. ArXiv, arXiv:1711.05101.
abs/2302.14036.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
jeev Khudanpur. 2015. Librispeech: an ASR corpus
Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Ben-
based on public domain audio books. In Proceedings
tivogli, Matteo Negri, and Marco Turchi. 2021. Must-
of ICASSP, pages 5206–5210. IEEE.
c: A multilingual corpus for end-to-end speech trans-
lation. Computer Speech & Language, 66:101155. Daniel S Park, Yu Zhang, Chung-Cheng Chiu,
Youzheng Chen, Bo Li, William Chan, Quoc V Le,
Alex Graves. 2012. Sequence transduction with and Yonghui Wu. 2020. Specaugment on large scale
recurrent neural networks. arXiv preprint datasets. In ICASSP 2020-2020 IEEE International
arXiv:1211.3711. Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 6879–6883. IEEE.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Dima Rekesh, Samuel Kriman, Somshubra Majumdar,
Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Vahid Noroozi, He Juang, Oleksii Hrinchuk, Ankur
2020. Conformer: Convolution-augmented Trans- Kumar, and Boris Ginsburg. 2023. Fast conformer
former for speech recognition. In Proceedings of with linearly scalable attention for efficient speech
Interspeech, pages 5036–5040. recognition. arXiv preprint arXiv:2305.05084.
447
Sandeep Subramanian, Oleksii Hrinchuk, Virginia
Adams, and Oleksii Kuchaiev. 2021. Nvidia nemo
neural machine translation systems for english-
german and english-russian news and biomedical
tasks at wmt21. arXiv preprint arXiv:2111.08634.
Ioannis Tsiamas, Gerard I Gállego, José AR Fonollosa,
and Marta R Costa-jussà. 2022. Shas: Approaching
optimal segmentation for end-to-end speech transla-
tion. arXiv preprint arXiv:2202.04774.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Proceedings of NeurIPS, pages 5998–
6008.
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,
Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
Juan Pino, and Emmanuel Dupoux. 2021. Voxpop-
uli: A large-scale multilingual speech corpus for rep-
resentation learning, semi-supervised learning and
interpretation. arXiv preprint arXiv:2101.00390.
448
SRI-B’s systems for IWSLT 2023 Dialectal and Low-resource track:
Marathi-Hindi Speech Translation
0.1 during ST training. We pre-train on the ASR the Transformer ones as can be gleaned from Table
data for 6000 steps and then train on the ST data 3, we chose to use only Conformer models for the
for 2250 steps. After ST training, we average the unconstrained condition. We train the following
last 10 checkpoints to create the final model. We models for the unconstrained condition:
used a beam size of 10 for decoding.
• The Conformer model encoder pre-trained
4.1 Constrained condition with constrained and unconstrained ASR data
For the constrained condition, we are only permit- mentioned in Table 2 and then trained with
ted to use the data provided by the organizers. For only the train split from the ST data.This
the constrained models, wherever pre-training is served as our unconstrained contrastive model
involved, we only utilize the 3 constrained datasets for the final submission.
from Table 2. For this condition, we train the fol-
lowing models: • The Conformer model encoder pre-trained
with constrained and unconstrained ASR data
• The Transformer model trained with only the mentioned in Table 2 and then trained with
train split from the ST data. both the train and the dev splits from the ST
data. This served as our unconstrained pri-
• The Conformer model trained with only the
mary model for the final submission.
train split from the ST data.
Table 3: Results for all our trained models on dev & test splits. Here all indicates that both constrained and
unconstrained datasets were used for ASR pretraining.
Finally, since the dev and test splits come from Chronopoulou, Anna Currey, Thierry Declerck, Qian-
a similar distribution, including the dev split in qian Dong, Yannick Estève, Kevin Duh, Marcello
Federico, Souhir Gahbiche, Barry Haddow, Benjamin
speech translation training boosted our BLEU
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
scores on the test split by 5.5 and 2.6 points in vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
the cases of constrained and unconstrained condi- Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
tions respectively. Utilizing the dev split for speech Evgeny Matusov, Paul McNamee, John P. McCrae,
translation training also narrowed down the gap in Kenton Murray, Maria Nadejde, Satoshi Nakamura,
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
performance between the unconstrained and con- Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
strained models on the test split. Lonneke van der Plas, Peter Polák, Elijah Rippeth,
Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
6 Conclusion bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
In this paper we present our approaches to the
vallos. 2023. Findings of the IWSLT 2023 Evaluation
IWSLT 2023 Evaluation Campaign Dialectal and Campaign. In Proceedings of the 20th International
Low-resource track: Marathi-Hindi Speech Trans- Conference on Spoken Language Translation (IWSLT
lation which secured the first and second places 2023). Association for Computational Linguistics.
in the constrained and unconstrained conditions Antonios Anastasopoulos, Loïc Barrault, Luisa Ben-
respectively. We start off with a simple end-to- tivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano
end approach with Transformers and then apply a Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh,
gamut of ideas like replacing the encoder blocks Maha Elbayad, Clara Emmanuel, Yannick Estève,
Marcello Federico, Christian Federmann, Souhir
with Conformers, encoder pre-training, etc., to dras- Gahbiche, Hongyu Gong, Roman Grundkiewicz,
tically improve our dev BLEU score from 1.02 to Barry Haddow, Benjamin Hsu, Dávid Javorský,
20.22. Through our results, we also quantitatively Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant
demonstrate how much of an impact each of our Mathur, Paul McNamee, Kenton Murray, Maria
ideas bring forth and sincerely hope that some of Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan
Niehues, Xing Niu, John Ortega, Juan Pino, Eliz-
these ideas might be useful for researchers and abeth Salesky, Jiatong Shi, Matthias Sperber, Se-
practitioners alike working on low-resource speech bastian Stüker, Katsuhito Sudoh, Marco Turchi, Yo-
translation problems. gesh Virkar, Alexander Waibel, Changhan Wang,
and Shinji Watanabe. 2022. Findings of the IWSLT
2022 evaluation campaign. In Proceedings of the
19th International Conference on Spoken Language
References Translation (IWSLT 2022), pages 98–157, Dublin,
Ireland (in-person and online). Association for Com-
Basil Abraham, Danish Goel, Divya Siddarth, Kalika putational Linguistics.
Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi,
Preethi Jyothi, Sunayana Sitaram, and Vivek Se- Rosana Ardila, Megan Branson, Kelly Davis, Michael
shadri. 2020. Crowdsourcing speech data for low- Henretty, Michael Kohler, Josh Meyer, Reuben
resource languages from low-income workers. In Morais, Lindsay Saunders, Francis M Tyers, and
Proceedings of the 12th Conference on Language Re- Gregor Weber. 2019. Common voice: A massively-
sources and Evaluation (LREC), pages 2819–2826. multilingual speech corpus. arXiv preprint
arXiv:1912.06670.
Milind Agarwal, Sweta Agrawal, Antonios Anasta-
sopoulos, Ondřej Bojar, Claudia Borg, Marine Arun Babu, Changhan Wang, Andros Tjandra, Kushal
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh,
Chen, William Chen, Khalid Choukri, Alexandra Patrick von Platen, Yatharth Saraf, Juan Pino, et al.
452
2021. Xls-r: Self-supervised cross-lingual speech Hang Le, Florentin Barbier, Ha Nguyen, Natalia
representation learning at scale. arXiv preprint Tomashenko, Salima Mdhaffar, Souhir Gahbiche,
arXiv:2111.09296. Bougares Fethi, Benjamin Lecouteux, Didier
Schwab, and Yannick Estève. 2021. On-trac’systems
Arun Baby, Anju Leela Thomas, N. L. Nishanthi, and for the iwslt 2021 low-resource speech translation
TTS Consortium. 2016. Resources for Indian lan- and multilingual speech translation shared tasks. In
guages. In CBBLR – Community-Based Building International Conference on Spoken Language Trans-
of Language Resources, pages 37–43, Brno, Czech lation (IWSLT).
Republic. Tribun EU.
H. Ney. 1999. Speech translation: coupling of recog-
Parnia Bahar, Tobias Bieschke, and Hermann Ney. 2019. nition and translation. In 1999 IEEE International
A comparative study on end-to-end speech to text Conference on Acoustics, Speech, and Signal Process-
translation. In 2019 IEEE Automatic Speech Recog- ing. Proceedings. ICASSP99 (Cat. No.99CH36258),
nition and Understanding Workshop (ASRU), pages volume 1, pages 517–520 vol.1.
792–799. IEEE.
Jan Niehues, Rolando Cattoni, Sebastian Stüker, Mat-
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina teo Negri, Marco Turchi, Thanh-Le Ha, Elizabeth
Karakanta, Alberto Martinelli, Matteo Negri, and Salesky, Ramon Sanabria, Loic Barrault, Lucia Spe-
Marco Turchi. 2021. Cascade versus direct speech cia, and Marcello Federico. 2019. The IWSLT 2019
translation: Do the differences still make a differ- evaluation campaign. In Proceedings of the 16th In-
ence? arXiv preprint arXiv:2106.01045. ternational Conference on Spoken Language Trans-
lation, Hong Kong. Association for Computational
Alexandre Bérard, Olivier Pietquin, Christophe Servan, Linguistics.
and Laurent Besacier. 2016. Listen and translate: A
proof of concept for end-to-end speech-to-text trans- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
lation. arXiv preprint arXiv:1612.01744. Sam Gross, Nathan Ng, David Grangier, and Michael
Auli. 2019. fairseq: A fast, extensible toolkit for
Francisco Casacuberta, Marcello Federico, Hermann sequence modeling. In Proceedings of NAACL-HLT
Ney, and Enrique Vidal. 2008. Recent efforts in 2019: Demonstrations.
spoken language translation. IEEE Signal Processing
Magazine, 25(3):80–88. Daniel S Park, William Chan, Yu Zhang, Chung-Cheng
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le.
William Chan, Navdeep Jaitley, Quoc Le, and Oriol 2019. Specaugment: A simple data augmentation
Vinyals. 2016. Listen, attend and spell: A neural method for automatic speech recognition. arXiv
network for large vocabulary conversational speech preprint arXiv:1904.08779.
recognition. IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). Matt Post, Gaurav Kumar, Adam Lopez, Damianos
Karakos, Chris Callison-Burch, and Sanjeev Khu-
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki danpur. 2013. Improved speech-to-text translation
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo with the fisher and callhome Spanish-English speech
Wang, Zhengdong Zhang, Yonghui Wu, et al. translation corpus. In Proceedings of the 10th Inter-
2020. Conformer: Convolution-augmented trans- national Workshop on Spoken Language Translation:
former for speech recognition. arXiv preprint Papers, Heidelberg, Germany.
arXiv:2005.08100.
Kishore Prahallad, E Naresh Kumar, Venkatesh Keri,
Fei He, Shan-Hui Cathy Chu, Oddur Kjartansson, Clara S Rajendran, and Alan W Black. 2012. The iiit-h
Rivera, Anna Katanova, Alexander Gutkin, Isin indic speech databases. In Thirteenth annual con-
Demirsahin, Cibu Johny, Martin Jansche, Supheak- ference of the international speech communication
mungkol Sarin, and Knot Pipatsrisawat. 2020. Open- association.
source multi-speaker speech corpora for building Gu-
jarati, Kannada, Malayalam, Marathi, Tamil and Tel- Kishore Prahallad, Anandaswarup Vadapalli, Naresh
ugu speech synthesis systems. In Proceedings of the Elluru, Gautam Mantena, Bhargav Pulugundla, Peri
Twelfth Language Resources and Evaluation Confer- Bhaskararao, Hema A Murthy, Simon King, Vasilis
ence, pages 6494–6503, Marseille, France. European Karaiskos, and Alan W Black. 2013. The blizzard
Language Resources Association. challenge 2013–indian language task. In Blizzard
challenge workshop, volume 2013.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
arXiv:1412.6980. Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: a simple way to prevent neural networks
Gaurav Kumar, Matt Post, Daniel Povey, and Sanjeev from overfitting. The journal of machine learning
Khudanpur. 2014. Some insights from translating research, 15(1):1929–1958.
conversational telephone speech. In 2014 IEEE Inter-
national Conference on Acoustics, Speech and Signal Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
Processing (ICASSP), pages 3231–3235. man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
453
gela Fan. 2020. Multilingual translation with exten-
sible multilingual pretraining and finetuning. arXiv
preprint arXiv:2008.00401.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
systems, 30.
Ron J Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
Wu, and Zhifeng Chen. 2017. Sequence-to-sequence
models can directly translate foreign speech. arXiv
preprint arXiv:1703.08581.
454
BIT’s System for Multilingual Track
Zhipeng Wang Yuhang Guo∗
Beijing Institute of Technology Beijing Institute of Technology
wzp3139725181@163.com guoyuhang@bit.edu.cn
Shuoying Chen
Beijing Institute of Technology
chensy@bit.edu.cn
data for the relevant languages from MUST-C tures, training the sentencepeice(Kudo and Richard-
v1.0, MUST-C v1.2, and MUST-C v2.0, merged son, 2018) model, generating a vocabulary, and
them, and preprocessed them to obtain our train- finally generating a training set. The processed
ing dataset. We used the Fairseq(Ott et al., 2019) MFCC feature dimension is 80, and SpecAugment
toolkit to conduct our experiment, and after the is applied for data augmentation. The relevant
training was completed, we scored the translation configurations used in the experiment regarding
quality using the sacrebleu metric. Our model SpecAugment are shown in Table 2:
achieved our expected results on 10 target lan-
guages.
Table 2: Parameter settings for SpecAugment
2 Data Preparation
As shown in the Table 1, we collected training Parameters Values
data for relevant languages from the MUST-C cor- freq_mask_F 27
pus and provided their information. It can be seen freq_mask_N 2
from this that there are significant differences be- time_mask_N 2
tween different languages. There are differences time_mask_T 100
in the number of source language words and target time_mask_p 1.0
language words among different languages. For time_wrap_W 0
example, the number of source language words
in the Arabic language corpus is greater than the
number of target language words, while the num-
ber of source language words in the Farsi language The SpecAugment method uses three different
corpus is less than the number of target language data augmentation methods: Time warping, Fre-
words. This indicates that the difficulty of length quency masking, and Time masking. Time warping
conversion required by the model when dealing selects an area from the time dimension for warp-
with different languages varies to some extent. ing. Frequency masking selects an area from the
Due to our task of one-to-many multilingual frequency dimension for masking, in our experi-
speech translation, the input received by the model mental configuration, the length of the masked part
is all English speech data, which enables us to per- is 27, which is the parameter freq_mask_F, and
form the same preprocessing operation on all data. the parameter freq_mask_N refers to the number
The original speech is in wav format, and most of of masked areas. Time masking selects an area
it is long audio. We need to segment and extract from the time dimension for masking, the param-
features before inputting it into the model. So we eter time_mask_T we set is 100, and the number
segment the speech data based on the start time and of masked areas is 2. SpecAugment increases the
duration of each segment given in MUST-C. The diversity of training data, making the trained model
preprocessing stage includes extracting MFCC fea- more robust.
456
3 Method their original dimensions, the calculation in the
feed forward module is as follows:
3.1 Speech Recognition
We use speech recognition tasks to pre train en- F F N (x) = max(0, xW1 + b1 )W2 + b2 (2)
coder parameters. After experimental verification,
Positional Encoding. The transformer uses po-
using speech recognition for pre training parame-
sition encoding to indicate the relative position be-
ters is much better than not using pre-training. Due
tween tokens, and the calculation method is as fol-
to the need to initialize the parameters of the speech
lows:
translation model using the encoder of the speech
recognition model, we use the same structure to P E(pos,2i) = sin(pos/100002i/dmodel ) (3)
train the speech recognition model. Although ex-
tracting MFCC features from the original audio can P E(pos,2i+1) = cos(pos/100002i/dmodel ) (4)
reduce the sequence length, the processed MFCC
After extracting shallow features from speech
features still have a long time dimension and re-
using convolutional neural networks, transformer
quire further downsampling. In speech translation
combines the extracted information. Convolutional
related works, a common practice is to use CNN
neural networks are good at extracting local fea-
or Shrink modules(Liu et al., 2020) to compress
tures, while transformer have a stronger ability to
feature sequences. We use convolutional neural
model global features. This structure enables the
networks to downsample the extracted MFCC fea-
model to perform well in several speech processing
ture sequence, the input MFCC features are first
tasks.
extracted through a two-layer convolutional neural
network to extract shallow features and downsam-
pling, and then input into the Transformer model to
complete the speech recognition task. The model
structure is shown in the Figure 1. The reason
why Transformer has strong modeling information
ability is due to its self attention mechanism, the
multi-head attention calculation in transformer is
shown in the Figure 2. Perform different linear
calculations on the input to obtain Q, K, and V.
compute the matrix of outputs as:
QK T
Attention(Q, K, V ) = sof tmax( √ )V (1)
dk
ar de fa fr ja nl pt ru tr zh
WER 16.01 10.64 11.65 10.74 8.79 10.43 10.76 10.71 11.10 8.80
ar de fa fr ja nl pt ru tr zh
BLEU 12.35 23.30 12.15 32.59 12.93 27.46 28.57 14.66 11.33 22.07
After training the model on the MUST-C training We used convolutional neural networks combined
set, we used its tst-COMMON test set to verify the with Transformer models to complete the task of
model’s effectiveness. The experimental results are English speech to 10 target language texts. Our
shown in the Table 4. system is characterized by its simplicity and effi-
From the Table 4, it can be seen that our sys- ciency, effectively modeling local and global fea-
tem can complete translations in these 10 target tures in speech, and completing modal and lan-
languages, and the BLEU score exceeds 20 in all guage transformations within the model. Our sys-
5 languages of them. Although using the same tem has achieved satisfactory results on the test set
model for translation tasks, the difficulty of transla- of 10 languages in MUST-C corpus.
tion varies among different languages. As shown
in the table, the BLEU scores of ar, fa, ja, ru, and
tr are lower compared to other languages, but they
References
use a similar amount of data. On the one hand, Milind Agarwal, Sweta Agrawal, Antonios Anasta-
there are significant differences in grammar rules sopoulos, Ondřej Bojar, Claudia Borg, Marine
Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda
between these target languages and the source lan- Chen, William Chen, Khalid Choukri, Alexandra
guage, making it more difficult for the model to Chronopoulou, Anna Currey, Thierry Declerck, Qian-
complete language conversion; On the other hand, qian Dong, Yannick Estève, Kevin Duh, Marcello
the differences between target languages make it Federico, Souhir Gahbiche, Barry Haddow, Benjamin
Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Ja-
difficult to share information between them consis- vorský, John Judge, Yasumasa Kano, Tom Ko, Rishu
tently. Kumar, Pengwei Li, Xutail Ma, Prashant Mathur,
In the current work of multilingual speech trans- Evgeny Matusov, Paul McNamee, John P. McCrae,
lation, many methods have modified the model Kenton Murray, Maria Nadejde, Satoshi Nakamura,
Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu,
architecture and optimization methods, and our Atul Ojha Kr., John E. Ortega, Proyag Pal, Juan Pino,
system uses a simple convolutional neural network Lonneke van der Plas, Peter Polák, Elijah Rippeth,
combined with the Transformer structure to achieve Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Se-
a relatively good effect. Compared to those com- bastian Stüker, Katsuhito Sudoh, Yun Tang, Brian
Thompson, Kevin Tran, Marco Turchi, Alex Waibel,
plex systems that modify models, our system has Mingxuan Wang, Shinji Watanabe, and Rodolfo Ze-
the following advantages: On the one hand, our sys- vallos. 2023. Findings of the IWSLT 2023 Evaluation
tem’s training method is relatively simple and re- Campaign. In Proceedings of the 20th International
quires fewer model parameters. On the other hand, Conference on Spoken Language Translation (IWSLT
2023). Association for Computational Linguistics.
this simple structure can also effectively complete
multilingual speech translation tasks. Our system Mattia A Di Gangi, Roldano Cattoni, Luisa Bentivogli,
can be applied to devices with strict memory re- Matteo Negri, and Marco Turchi. 2019. Must-c: a
multilingual speech translation corpus. In Proceed-
quirements, and can achieve relatively satisfactory ings of the 2019 Conference of the North American
results with a small number of parameters. Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1
5 Conclusion (Long and Short Papers), pages 2012–2017. Associa-
tion for Computational Linguistics.
This paper introduces our system submitted on the Taku Kudo and John Richardson. 2018. Sentencepiece:
IWSLT 2023 multilingual speech translation track. A simple and language independent subword tok-
459
enizer and detokenizer for neural text processing.
arXiv preprint arXiv:1808.06226.
Yuchen Liu, Junnan Zhu, Jiajun Zhang, and Chengqing
Zong. 2020. Bridging the modality gap for speech-
to-text translation. arXiv preprint arXiv:2010.14920.
460
Matesub: the Translated Subtitling Tool at the IWSLT2023 Subtitling task
Simone G. Perone
Translated srl
via Indonesia 23
00144 Rome - Italy
simone@translated.com
nique. The segmenter, implemented as proposed 1. Segmentation of the transcription on the basis
in (Karakanta et al., 2020; Papi et al., 2022), inserts of acoustic cues (audio blocks)
in an unsegmented input text - either in the source 2. Segmentation of audio blocks into caption
or in the target language - markers of segment blocks (and lines) by means of the source lan-
boundaries. It is trained on pairs of unsegmented- guage segmenter
segmented text, where segment boundaries are
marked by means of two special symbols: <eob> 3. Automatic translation of each caption block
to mark the end of block (caption or subtitle), and into the target language(s) (subtitle blocks)
<eol> to mark the end of line. Figure 3 shows an 4. Segmentation of subtitle blocks into lines by
example of a sentence after inserting the markers means of the target language segmenter
from the corresponding fragment of the SRT file.
5. Timing projection from the CTM to the cap-
164
tion/subtitle blocks
00:08:57,020–>00:08:58,476 6. Packaging of SRT and JSON files.
I wanted to challenge the idea
Note that the translation of each block in step 3
165 is done without looking at the context, i.e. at the
00:08:58,500–>00:09:02,060 surrounding blocks. On the one hand, this worsens
that design is but a tool the quality of the translation a little, but, on the
to create function and beauty.
other, it facilitates the satisfaction of the reading
I wanted to challenge the idea <eob> that design is but a speed requirement through the n-best mechanism,
tool <eol> to create function and beauty. <eob>
sketched in the next section.
Figure 3: Subtitle file (top) and the full sentence an-
1.1.3 Machine translation
notated with the subtitle breaks (bottom). Figure taken
from (Karakanta et al., 2020). Neural machine translation is provided by Mod-
ernMT3 (Bertoldi et al., 2021) through a REST
The neural machine translation engine performs API connection. ModernMT implements the Trans-
the translation of the text from the source language former (Vaswani et al., 2017) architecture; generic
(English, in the IWSLT 2023 context) into the corre- big models (about 200M parameters each), trained
sponding text in the target language (here German on both public and proprietary data, cover hundred
and Spanish). Other processing modules are in of languages4 in any direction, through a seam-
charge of (i) generating captions/subtitles in SRT less integration of the pivot based approach, where
format (starting from transcripts, word timestamps, the pivot language is English. Matesub requests
translations and segmentations), and (ii) merging ModernMT to provide the 16 best translations of
the SRTs of captions and subtitles into a single 3
https://www.modernmt.com/
JSON file. The main processing steps are: 4
https://www.modernmt.com/api/#languages
462
each block (step 3 mentioned in the previous sec- that the quality of TED and of Spanish EPTV
tion); between them, the hypothesis with the high- subtitles is high, while subtitles of ITV, PELO-
est probability and whose length permits to satisfy TON and German EPTV documents would
the reading speed constraint (given the duration of need major post-editing
the block) is selected. If no such hypothesis exists, • Since SubER is based on TER and Sigma
the shortest is chosen. on BLEU, their values match the scores of
those metrics rather than BLEURT, ChrF
1.2 The editor
and the subtitle compliance as measured by
Matesub provides a WYSIWYG editor, which al- CPS/CPL/LPB, possibly affecting the final
lows the user to review and correct the subtitles ranking of Matesub
automatically generated and synced in the chosen
• The compliance of subtitles is language inde-
target language by the back-end subtitling pipeline.
pendent
Figure 4 shows a screenshot of the Matesub editor.
The editor permits the user to easily fix both • Despite the fact that Matesub does not imple-
translation and segmentation errors, thanks to the ment any hard rule, relying only on machine
rich catalogue of functions and user-friendliness. learning methods, CPL and CPL are (almost)
Once the editing is over, subtitles can be embedded perfect
in the video or exported in production-ready SRT • The reading speed (CPS) is under the max
files or any other supported subtitles format. threshold of 21 characters per second in about
85% of subtitles; more in detail, the average
2 Submission and Results is about 18.5 and only in 5% of cases it ex-
ceeds 30 characters per second, values that we
Translated participated in the Subtitling shared
consider satisfactory.
task at IWSLT 2023 with the back-end subtitling
pipeline of Matesub. No adaptation of the general Acknowledgements
purpose pipeline was carried out, therefore the qual-
ity of subtitles generated for the audio-visual docu- Matesub received funding from the European Insti-
ments proposed in the shared task is that typically tute of Innovation and Technology (EIT), a body of
expected by the in-production system before the the European Union, through the MateDub (2020)
post-editing stage. Since neural models of Mate- and MateDub++ (2021) projects. Within them, the
sub (ASR, text segmenter and MT) were trained Matesub subtitling chain was developed in collab-
on more resources than those allowed for the con- oration with the FBK’s MT research unit. We es-
strained condition, we labelled our submission as pecially thank Mauro Cettolo for his invaluable
unconstrained; it was also our unique submission, contribution to the success of this product and the
and as such it is the primary run. support he gave us in the participation to the Subti-
Table 2 shows scores of our test set subtitles as tling track at IWSLT 2023.
computed by the organizers (Agarwal et al., 2023).
They are in line with those we obtained on the dev
References
sets.
Without knowing the results of the other sub- Milind Agarwal, Sweta Agrawal, Antonios Anasta-
sopoulos, Claudia Borg, Marine Carpuat, Roldano
missions, it is hard to judge the results obtained. Cattoni, Mauro Cettolo, William Chen, Khalid
However, some considerations can be made: Choukri, Alexandra Chronopoulou, Thierry Declerck,
Qianqian Dong, Yannick Estève, Kevin Duh, Mar-
• As expected, from the pure speech translation
cello Federico, Souhir Gahbiche, Benjamin Hsu,
perspective, the TED domain is the easiest John Judge, Tom Ko, Rishu Kumar, Xutail Ma,
one by far Prashant Mathur, Evgeny Matusov, Paul McNamee,
• Surprisingly, at least when German is the tar- John P. McCrae, Kenton Murray, Matteo Negri, Jan
Niehues, Xing Niu, Atul Ojha Kr., John E. Ortega,
get language, the EPTV domain is as much Proyag Pal, Juan Pino, Lonneke van der Plas, Elijah
challenging as ITV and PELOTON, which we Rippeth, Elizabeth Salesky, Matthias Sperber, Se-
expected to be the most difficult ones bastian Stüker, Katsuhito Sudoh, Brian Thompson,
Marco Turchi, Alex Waibel, Mingxuan Wang, and
• Assuming that BLEURT and ChrF are more Rodolfo Zevallos. 2023. Findings of the IWSLT 2023
reliable than BLEU and TER (according Evaluation Campaign. In Proc. of IWSLT, Toronto,
to (Kocmi et al., 2021), for example), it seems Canada.
463
Subtitle quality Translation quality Subtitle compliance
en- domain SubER↓ Sigma↑ BLEU↑ ChrF↑ TER↓ BLEURT↑ CPS↑ CPL↑ LPB↑
EPTV 87.04 57.73 12.08 43.59 85.53 .4705 88.59 99.20 100.00
TED 67.70 62.01 20.37 50.05 65.55 .5500 90.55 98.61 100.00
-de ITV 73.11 67.04 14.92 37.13 71.27 .4501 80.21 99.47 100.00
PELOTON 79.72 68.27 10.06 34.46 78.25 .4264 89.17 99.29 100.00
ALL 75.41 65.22 14.81 39.50 73.60 .4591 84.97 99.25 100.00
EPTV 74.47 59.59 21.06 54.11 72.08 .5728 90.15 99.44 100.00
TED 45.94 66.85 40.36 65.72 43.81 .7047 92.62 99.48 100.00
-es ITV 71.25 71.06 18.50 41.07 69.57 .4592 81.93 99.51 100.00
PELOTON 74.87 70.99 15.96 41.86 73.88 .4666 88.27 99.60 100.00
ALL 68.11 68.37 22.34 47.38 66.66 .5059 86.07 99.52 100.00
Nicola Bertoldi, Davide Caroselli, M. Amin Farajian, matic subtitling with automatically segmented st cor-
Marcello Federico, Matteo Negri, Marco Trombetti, pora. In Proc. of AACL-IJCNLP, pages 480–487.
and Marco Turchi. 2021. Translation system and
method. US Patent 11036940. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Alina Karakanta, Matteo Negri, and Marco Turchi. 2020. Kaiser, and Illia Polosukhin. 2017. Attention is all
MuST-cinema: a speech-to-subtitles corpus. In Proc. you need. In Proc. of NIPS, pages 5998—-6008.
of LREC, pages 3727–3734, Marseille, France.
464
Augmentation Invariant Discrete Representation for
Generative Spoken Language Modeling
Itai Gat♢ , Felix Kreuk♢ , Tu Anh Nguyen♢ , Ann Lee♢ , Jade Copet♢ ,
Gabriel Synnaeve♢ , Emmanuel Dupoux♠,♢ , Yossi Adi♡,♢
♢
FAIR Team, Meta AI Research
♠
ENS, INRIA, INSERM, UPEC, PSL Research University
♡
The Hebrew University of Jerusalem
Pre-trained
Unit Language Language
Model Model
Speech-to-unit
Quantizer
Resynthesis Pre-trained
Speech Encoder Unit-to-speech
ASR
Figure 1: Generative Spoken Language Modeling is composed of three components: (i) Speech-to-unit, (ii) Unit
language model, and (iii) Unit-to-speech. Pre-trained ASR and language models are used for evaluation.
ations, namely time-stretch, pitch-shift, additive- mon approach is first to encode the speech into
noise, and reverberation. Our premise is that while a continuous representation and then quantize the
these variations modify the signal, its’ underly- representation to achieve a sequence of discrete
ing content remains the same, especially under units (Lakhotia et al., 2021; Polyak et al., 2021;
the units repetition removal process. Therefore, Popuri et al., 2022; Lee et al., 2021; Kharitonov
a robust representation should be affected by such et al., 2021a; Kreuk et al., 2021; Kharitonov et al.,
variations to a minimal extent. 2022; Nguyen et al., 2022; Borsos et al., 2022;
As a first step, we propose a set of metrics for Tjandra et al., 2019, 2020).
evaluating the model’s robustness. Then, we point Formally, denote the domain of audio samples
to the lack of robustness of these models with re- by X ⊂ R. The representation for a raw signal is
spect to the aforementioned variations. Next, we therefore a sequence of samples x = (x1 , . . . , xT ),
design a simple and effective method for learning where xt ∈ X for all 1 ≤ t ≤ T .
augmentation-invariant discrete representation on Consider an encoder network, f , that gets as in-
top of any speech SSL model. We demonstrate how put the speech utterance and outputs a sequence of
such a method greatly improves robustness. Then, spectral representations sampled at a low frequency
we empirically show that performance improves as follows f (x) = (v1 , . . . , vT ′ ). Note that we do
on several tasks for various SSL models. Specifi- not assume anything about the structure of the en-
cally, we evaluate the newly proposed speech en- coder network f . Lakhotia et al. (2021), evaluated
coders when considering zero-shot evaluation tasks several speech encoders, namely, Mel-spectrogram,
considering encoding and modeling, i.e., ABX, Contrastive Predictive Coding (Oord et al., 2018,
sWUGGY, and sBLIMP (Nguyen et al., 2020), to- CPC), wav2vec2 (Baevski et al., 2020), and Hu-
gether with a high-level downstream task in the BERT (Hsu et al., 2021).
form of speech-to-speech translation. Since the representations learned by such mod-
els are usually continuous, a k-means algorithm is
2 Background applied over the models’ outputs to generate dis-
The general Generative Spoken Language Model- crete units, denoted as z = (z1 , . . . , zT ′ ). Each
ing (GSLM) pipeline is comprised of three main element zi in z is a positive integer, zi ∈ {1, .., K}
modules: (i) Speech-to-unit, (ii) Unit language for 1 ≤ i ≤ T ′ , where K is the number of discrete
model, and (iii) Unit-to-speech, where each of units. We denote the quantization model with E.
these modules is trained separately. Speech resyn-
Unit Language Model is trained on the extracted
thesis can be achieved while ignoring the language
discrete units, z. Such a language model learns
model and directly feeding the quantized units into
a probability distribution of the learned unit se-
the unit-to-speech module (Polyak et al., 2021)
quences, which enables direct modeling of speech
(See Figure 1 for a visual description). In the fol-
data without textual supervision.
lowing paragraphs, we give detailed background
The language model can be used to gener-
for each of the three components mentioned above,
ate speech conditionally or unconditionally, repli-
including the standard evaluation methods.
cating what toddlers achieve before learning to
Speech-to-unit module encodes the raw speech read. Moreover, such a modeling framework al-
signal into a discrete representation. The com- lows for capturing and modeling prosodic fea-
466
tures (Kharitonov et al., 2021a), as well as speaker tions. It is essential to note that augmentations can
identity (Borsos et al., 2022), or even natural dia- alter the spatial dimension of the signal. For ex-
logues (Nguyen et al., 2022). This is in contrast to ample, stretching a signal results in more frames,
using textual features, as they do not encode such yielding a longer representation sequence. Similar
information. phenomenon will happen when convolving with
different room impulse response to simulate re-
Unit-to-speech module converts the speech dis- verberation. Hence, the metric should be able to
crete units to a raw waveform. Lakhotia et al. measure the distance between two sequences of dif-
(2021) used a Tacotron2.0 (Shen et al., 2018) ferent lengths. Ideally, it will consider the number
based model followed by WaveGlow (Prenger et al., of deletions, insertions, and substitutions that occur
2019) vocoder. Later, Polyak et al. (2021) proposed due to augmenting the input data. For this purpose,
a unit-based vocoder based on the HiFi-GAN ar- we find the Levenshtein distance a good fit (Leven-
chitecture to convert units to speech directly. Such shtein, 1966). The Levenshtein distance measures
a paradigm seems to provide high-quality gener- the minimum changes one should make to modify
ations with better efficiency as it uses only one one sequence to another. It has two essential prop-
model rather than two. Kreuk et al. (2021) and Lee erties: the first is that the score is non-negative, and
et al. (2021) additionally improved the unit based when the sequences are equal, the metric equals
vocoder to include emotional tokens for speech zero. The second property is that the maximum
emotion conversion tasks, and duration modeling value it can get equals the longer sequence length
for direct speech-to-speech translation. between the two sequences. We provide a detailed
Zero-shot Evaluation. Evaluating such a com- explanation of the Levenshtein distance in the Ap-
plex pipeline comprised of several components is pendix material.
a challenging task. Lakhotia et al. (2021) pro- We aggregate the distance values over the eval-
posed a set of zero-shot evaluation tasks aiming uation set while considering the sequence length.
for each of the modules. Overall the proposed This is desirable since we want to normalize scores
tasks can be divided into four main groups: (i) for sequences in different lengths, and the Leven-
acoustic encoding using ABX, bitrat, (ii) language shtein distance’s maximum value is the original
encoding using sWUGGY, sBLIMP (Nguyen et al., sequence’s length. Another essential property of a
2020; Lakhotia et al., 2021), (iii) resynthesis using spatial metric is repetitions. Consider time stretch
Phoneme/Word Error Rate; (iv) speech generation as an example, it changes the number of the in-
using VERT (Lakhotia et al., 2021), Meaningful- put frames, but one would expect the deduplicated
ness Mean Opinion Score. quantized signal to be the same as before the aug-
mentation. Hypothetically, one can maximize the
3 Robustness of Speech-to-Unit Models score by stretching the signal infinitely. To elimi-
nate such dependencies, we compute the score on
The first step toward developing an effective spoken a deduplicated quantized representation. Formally,
language model is to develop a robust representa- our final metric is:
tion. The focus of a robust representation should
Definition 3.1 (Unit Edit Distance). Given a con-
be on the spoken information rather than unrelated ′
tinuous encoder f : RT → RT , a quantizer
signals, such as prosodic features in the form on ′ ′
E : RT → {1, .., K}T , and an input augmen-
duration and F0, background noise, or reverbera- ′ c′
tions. In the following section, we propose a metric tation g : RT → RT . The deduplicated unit edit
for quantifying the degree to which augmentations distance UEDD (E, f, g) on the evaluation set D is:
change the resulting encoding. X 1
LEV ((E ◦ f )(x), (E ◦ f ◦ g)(x)) , (1)
Tx′
3.1 Unit Edit Distance x∈D
A spoken language model is built on top of a dis- where Tx′ is the number of frames of a sample x.
crete representation of a continuous encoder. We Ideally, a perfect spoken language quantizer ob-
examine the robustness of the discrete space to tains a zero distance after deduplication. Next,
augmentations that do not change the spoken con- we study state-of-the-art spoken language repre-
tent. Therefore, we are interested in a sequential sentations using our proposed metric in different
distance metric between two discrete representa- settings.
467
60 60 60 60
50 50 50 50
40 40 40 40
UED
UED
UED
UED
30 30 30 30
20 20 20 20
10 10 10 10
0 50 100 200 500 0 50 100 200 500 0 50 100 200 500 0 50 100 200 500
Number of units (K) Number of units (K) Number of units (K) Number of units (K)
(a) Time stretch (b) Pitch shift (c) Reverberation (d) Noise
Figure 2: UED scores for various augmentations and number of clusters. We note that the UED is relatively high
(the distance is normalized). We also note that the UED monotonically increases with the number of units used. We
multiply the scores by a hundred.
468
Clean signal
Figure 3: Illustration of our method: We forward a clean signal through an encoder followed by a pre-trained
quantizer (k-means). Next, we forward an augmented signal through the same encoder, followed by a new quantizer
(green). The CTC loss between the deduplicated output of the clean signal and the output of the augmented signal is
used to learn the parameters of the new quantizer. In the iterative approach, post the convergence of the learned
quantizer E0 , we freeze it and learn a new quantizer E1 that distills information from E0 .
Table 1: Unit edit distance study: Using our metric, we assess the robustness of various quantization methods on
top of a HuBERT representation. This study uses four different augmentations: time stretching, pitch shifting,
reverberation, and noise injection. The non-iterative (Section 4.1) and iterative (Section 4.2) methods significantly
and consistently improve the robustness of k-means. Pseudo-labeling accounts for most of the improvement. By
applying our method iteratively, we can improve it further. For readability, we multiply the scores by a hundred.
the converged E1 . We repeat this process K times. and WavLM are in Appendix C. To match the cur-
This process needs more careful training. We note rent k-means training set, we use the Librispeech-
that it is essential to replace the quantizers only 100h to learn our quantizer (Panayotov et al., 2015).
post-convergence. We analyze our metric using the ‘clean’ and ‘other’
development sets from Librispeech. A detailed
5 Experiments setup is provided in Appendix B.
In the following, we assess the efficacy of our 5.1 Analysis
method using state-of-the-art self-supervised rep-
In Section 3, we presented an evaluation metric
resentations and popular discriminative and gener-
that assesses the robustness of a quantized speech
ative evaluation tasks. It is important to note that
representation to augmentations. The metric is
a single metric cannot tell the whole story. For
insensitive to changes in the length of the signal.
example, similarly to perplexity, all representations
Using it, we investigated the current state-of-the-
can be assigned to the same cluster, which achieves
art representations. In the following, we study our
a perfect unit edit distance but a poor representa-
invariant quantization method.
tion. We first examine our proposed method using
the unit edit distance along with other discrimina- Table 1 presents the unit edit distance metric us-
tive and generative performance metrics. Then, we ing our robustness method with and without the
show that our method improves downstream tasks. iterative approach. Compared with the k-means
method, which is currently in use, our non-iterative
In Section 5.1 we use our proposed metric from
method consistently outperforms it by a large mar-
Section 3 to analyze the robustness of our method.
gin (relative improvement of at least 30%). We
In Section 5.2 we study the discriminative capabili-
also note that different augmentations affect the
ties of our method using the ABX test (Schatz et al.,
representation differently. Our iterative method
2013). Then, we evaluate our methods using gener-
provides a slight but consistent improvement over
ative zero-shot evaluation tasks such as sWUGGY
the non-iterative method. It is noticeable that the
and sBLIMP (Nguyen et al., 2020; Lakhotia et al.,
UED is increasing (i.e., worse performing) with the
2021). Finally, we demonstrate the effect of using
number of units used.
our invariant quantizer’s units in speech-to-speech
translation. 5.2 Zero-shot Evaluation
Experimental Setup. We study our method us- We evaluate the proposed method using the stan-
ing the base versions of HuBERT, wav2vec2, and dard GSLM setup, i.e., ABX, sWUGGY, sBLIMP.
WavLM. For readability, we report results for Hu- The ABX task examines the discriminative pho-
BERT in the main paper. The results for wav2vec2 netic abilities of the representation. Versteegh et al.
470
ABX (clean) ↓ ABX (other)↓
# units Method sWUGGY ↑ sBLIMP ↑
Within Across Within Across
k-means 7.52 8.90 9.84 13.5 66.12 54.91
50 Ours 6.76 7.72 9.03 11.78 67.59 55.76
Ours (Iterative) 6.63 7.55 9.53 12.14 67.42 57.04
k-means 6.37 7.72 8.4 12.29 67.70 56.16
100 Ours 5.50 6.21 7.24 10.11 67.79 57.01
Ours (Iterative) 5.39 6.22 7.46 10.20 68.20 56.99
k-means 5.99 7.14 8.23 11.51 66.51 54.64
200 Ours 5.29 6.01 7.22 9.78 70.51 56.19
Ours (Iterative) 5.19 6.00 7.18 9.70 70.68 56.26
k-means 5.98 6.98 7.89 11.43 66.92 55.97
500 Ours 5.16 6.03 7.06 9.76 70.13 55.19
Ours (Iterative) 4.96 5.73 6.93 9.63 69.33 56.93
Table 2: Zero-shot discriminative and generative evaluation tasks: We evaluate the ABX score on the ‘clean’ and
‘other’ development sets from Librispeech. Our method improves the scores scores in all setups.
(2015) show that the ABX result is a good proxy model. As presented in Table 2, our method en-
to signal content (i.e., Phoneme Error Rate). The ables improvement in all the investigated setups
input to the ABX is a pair of words with a phoneme for both the spot-the-word and acceptability judg-
modification and a reference word containing the ment tests. This is especially noticeable for a larger
same phoneme as one of the pair’s words. Next, number of units. For instance, when considering
the ABX measures the distance of the test phoneme 200 or 500 units, the absolute improvement of the
representation to both the correct and incorrect rep- sWUGGY score is 4.17 and 3.21, respectively.
resentations. Finally, the distance between the test
and the correct representation is expected to be 5.3 Speech-to-speech Translation
lower than the distance to the incorrect represen- Lastly, we evaluate the proposed method consid-
tation. The ABX task is conducted in two setups: ering the speech-to-speech translation task. To
‘within’ and ‘across.’ ‘Within’ is evaluated on in- better assess the effectiveness of the proposed
put data from the same speaker, while ‘across’ is augmentation-invariant discrete representation we
evaluated on input data from different speakers. follow the same setup as in Lee et al. (2022) while
Table 2 shows the ABX results for both Lib- changing the discrete speech representation only.
rispeech ‘clean’ and ‘other’. In our experiments, Lee et al. (2022) propose a textless speech-to-
we found that the ABX score consistently and sig- speech translation method by forwarding a source
nificantly improved on all the setups we tested. In speech signal and predicting its target’s discrete
this case, the iterative approach improves more representation. The authors use a k-means model
than the non-iterative one, but the improvement trained on top of a multilingual HuBERT (mHu-
is inconsistent. For a small number of units and BERT) for speech representation. Additionally,
the ‘other’ split, the ABX score is lower than the the authors show that solving an auxiliary task en-
iterative model’s score. Note that the ‘other’ split hances performance. We investigate the impact of
is challenging as it is characterized by recordings using our augmentation-invariant quantizer as an
that contain background noise and various accents. alternative to the k-means used by Lee et al. (2022).
The spot-the-word task (sWUGGY) requires de- Differently, we use HuBERT (instead of mHu-
tecting the real word from a pair of short utterances BERT). Besides that, we follow the same setup
such as ‘brick’ vs. ‘blick.’ The detection is done in terms of model, computation resources, and data.
by comparing the probabilities given by a language To evaluate the quality of the translation the sen-
model for each word. This allows comparing rep- tence BLEU score (SacreBLEU) (Post, 2018) was
resentations by training language models on top of used.
them. Differently, the acceptability judgment test Table 3 presents the results for the Spanish-
(sBLIMP) requires detecting the syntactically cor- English and French-English setups on the Europarl-
rect sentence from a pair of sentences, one of which ST development and test sets (Iranzo-Sánchez et al.,
is syntactically correct and the other is wrong. The 2020). It also shows the original result from Lee
detection is based on the perplexity of the language et al. (2022). The proposed method improves over
471
# units Method S-E F-E generative self-supervised work is Autoregresstive
500 Invariant 17.3 16.4 Predictive Coding (Chung et al., 2019), which pre-
Dev
1000 k-means 15.4 16.0 dicts the spectrum of a future frame. Later, Liu
et al. (2020) introduced Mockingjay, which learns
1000 Invariant 18.2 17.5
its representation by predicting non-causal context.
500 Invariant 14.4 15.75 TERA (Liu et al., 2021) alters time, frequency, and
Test
1000 k-means 13.1 15.4 magnitude. Then it is required to reconstruct acous-
1000 Invariant 15.9 17.1 tic frames from altered versions.
Robustness. A desired property of a spoken lan-
Table 3: Speech-to-Speech Translation results: We re- guage representation is robustness to augmenta-
port BLEU scores for the proposed method (Invariant) tions that do not change the spoken information.
and compare it against the k-means used in Lee et al.
The spoken information should not differ signifi-
(2022). We report both development and test sets results
for Spanish(S)-English(E) and French(F)-English(E).
cantly when male and female speakers say the same
content. There is an interesting trade-off between
Lee et al. (2022) under all the evaluated setups. training a robust representation and the quality of
Note, these results are especially interesting as the the input data. It is possible, for example, to use the
proposed method was trained on significantly less same speaker for all data points in the training set.
data (ours was trained on 1k hours while Lee et al. The model would not be able to learn any speaker
(2022) was trained on 100k hours). bias, but this constraint prevents scaling.
Recently, the robustness of self-supervised
6 Related work speech representations has gained attention from
This work investigates the robustness of self- the community. WavLM (Chen et al., 2022)
supervised representations for language modeling. proposes adopting the well-known HuBERT
This is related to the advancements in speech self- model (Hsu et al., 2021) and training it with an addi-
supervised learning, their robustness, and modern tional denoising process. The authors apply a nois-
generative spoken language modeling. In the fol- ing process to the training data and then predict the
lowing, we review all three areas. clean units from it. ContentVec (Qian et al., 2022)
is focused on the disentanglement of a speaker from
Self-supervised Learning. The field of deep self-supervised speech representation. The authors
learning research has significantly benefited from propose to use three disentanglement components.
self-supervised learning. Commonly, it involves First, the student network is disentangled through
encoding the input data and performing a task that two transformations. Then the representations are
enforces the representation to learn contextual em- forwarded through a speaker condition component.
beddings. Speech self-supervised learning can be Finally, voice-converted input data points are used
divided into two lines of research. to generate teacher labels.
The first is discriminative, Oord et al. (2018)
introduced Contrastive Predictive Coding (CPC),
7 Conclusions
which trains a convolutional encoder and a pre- In this work, we first propose a metric for evaluat-
dictor for future embeddings of the encoder us- ing the robustness of self-supervised speech repre-
ing a contrastive loss. On top of it, Kharitonov sentations applied for spoken language modeling
et al. (2021b) propose to use time domain aug- tasks. Equipped with the aforementioned metric,
mentations to improve the CPC model further. we point out the lack of robustness in current state-
Wav2vec2 (Schneider et al., 2019) suggest using of-the-art speech encoders with respect to simple
a contrastive loss that requires distinguishing be- signal variations that do not alter the spoken infor-
tween true and false future audio samples. Later, mation. We then propose a simple and effective
wav2vec2 (Baevski et al., 2020) learn quantized method to augmentation-invariant discrete repre-
units using Gumbel softmax and predict masked sentation that boosts the robustness of the current
spans of the latent speech representation. Hu- approaches and demonstrate it on three state-of-the-
BERT (Hsu et al., 2021) employ a frame-based art self-supervised speech representation models.
masked prediction task. First, it quantizes input We empirically show the efficacy of the proposed
frames and then predicts masked frames. approach when considering encoding methods to-
The second line of work is generative. An early gether with a textless speech-to-speech translation.
472
Broader Impact Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman,
Aren Jansen, Wade Lawrence, R Channing Moore,
As for broader impacts, this work is the first (to Manoj Plakal, and Marvin Ritter. 2017. Audio set:
the best of our knowledge) which analyzes self- An ontology and human-labeled dataset for audio
supervised speech representation models, consid- events. In ICASSP.
ering basic signal variations. We hope that with Alex Graves, Santiago Fernández, Faustino Gomez, and
the provided analysis and evaluation, researchers Jürgen Schmidhuber. 2006. Connectionist temporal
working on spoken language modeling and self- classification: labelling unsegmented sequence data
supervised speech representation learning will con- with recurrent neural networks. In ICLR.
sider reporting the proposed metric setup along Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
with evaluation of down stream tasks. Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
rahman Mohamed. 2021. Hubert: Self-supervised
Limitations speech representation learning by masked prediction
of hidden units. IEEE/ACM Transactions on Audio,
The proposed method has several limitations that Speech, and Language Processing.
should be taken into consideration when employing
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà,
it. First, the method relies on an existing model, Javier Jorge, Nahuel Roselló, Adrià Giménez, Al-
e.g., k-means, which creates a dependency between bert Sanchis, Jorge Civera, and Alfons Juan. 2020.
the performance of the initial and the robust mod- Europarl-st: A multilingual corpus for speech trans-
els. Second, the flow is not trained end-to-end, lation of parliamentary debates. In ICASSP.
which can also limit its performance as end-to-end Thorsten Karrer, Eric Lee, and Jan O Borchers. 2006.
training allows improvement of the robustness of Phavorit: A phase vocoder for real-time interactive
the whole representation. Lastly, to fully assess time-stretching. In ICMC.
the effectiveness of the method, multiple metrics
Eugene Kharitonov, Jade Copet, Kushal Lakhotia,
need to be examined. This can be a limitation as Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali
interpreting the results from multiple metrics may Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed,
not be straightforward. However, it gives a more Emmanuel Dupoux, et al. 2022. textless-lib: a li-
complete picture of the model’s performance. brary for textless spoken language processing. arXiv
preprint arXiv:2202.07359.
Zalán Borsos, Raphaël Marinier, Damien Vincent, Eu- Eugene Kharitonov, Morgane Rivière, Gabriel Syn-
gene Kharitonov, Olivier Pietquin, Matt Sharifi, naeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs
Olivier Teboul, David Grangier, Marco Tagliasacchi, Douze, and Emmanuel Dupoux. 2021b. Data aug-
and Neil Zeghidour. 2022. Audiolm: a language mod- menting contrastive learning of speech representa-
eling approach to audio generation. arXiv preprint tions in the time domain. In SLT.
arXiv:2209.03143.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
Shlomo E Chazan, Lior Wolf, Eliya Nachmani, and method for stochastic optimization. arXiv preprint
Yossi Adi. 2021. Single channel voice separation for arXiv:1412.6980.
unknown number of speakers under reverberant and
noisy settings. In ICASSP. Felix Kreuk, Adam Polyak, Jade Copet, Eugene
Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Ning Hsu, Abdelrahman Mohamed, Emmanuel
Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Dupoux, and Yossi Adi. 2021. Textless speech emo-
Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022. tion conversion using decomposed and discrete rep-
Wavlm: Large-scale self-supervised pre-training for resentations. arXiv preprint arXiv:2111.07402.
full stack speech processing. IEEE Journal of Se-
lected Topics in Signal Processing. Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu,
Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh
Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Nguyen, Jade Copet, Alexei Baevski, Abdelrahman
Glass. 2019. An unsupervised autoregressive model Mohamed, and Emmanuel Dupoux. 2021. On Gener-
for speech representation learning. In INTER- ative Spoken Language Modeling from Raw Audio.
SPEECH. TACL.
473
Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2019.
Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Waveglow: A flow-based generative network for
Tang, Juan Pino, et al. 2021. Direct speech-to- speech synthesis. In ICASSP.
speech translation with discrete units. arXiv preprint
arXiv:2107.05604. Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni,
Cheng-I Lai, David Cox, Mark Hasegawa-Johnson,
Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, and Shiyu Chang. 2022. Contentvec: An improved
Holger Schwenk, Peng-Jen Chen, Changhan Wang, self-supervised speech representation by disentan-
Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, gling speakers. In ICML.
and Wei-Ning Hsu. 2022. Textless speech-to-speech Chandan KA Reddy, Vishak Gopal, Ross Cutler,
translation on real data. In NAACL. Ebrahim Beyrami, Roger Cheng, Harishchandra
Dubey, Sergiy Matusevych, Robert Aichner, Ashkan
Vladimir Levenshtein. 1966. Binary codes capable of Aazami, Sebastian Braun, et al. 2020. The in-
correcting deletions, insertions, and reversals. In terspeech 2020 deep noise suppression challenge:
Soviet physics doklady. Datasets, subjective testing framework, and challenge
results. arXiv preprint arXiv:2005.13981.
Andy T Liu, Shang-Wen Li, and Hung-yi Lee. 2021.
Tera: Self-supervised learning of transformer encoder Thomas Schatz, Vijayaditya Peddinti, Francis Bach,
representation for speech. IEEE/ACM Transactions Aren Jansen, Hynek Hermansky, and Emmanuel
on Audio, Speech, and Language Processing. Dupoux. 2013. Evaluating speech features with
the minimal-pair abx task: Analysis of the classical
Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun mfc/plp pipeline. In INTERSPEECH.
Hsu, and Hung-yi Lee. 2020. Mockingjay: Unsu-
pervised speech representation learning with deep Robin Scheibler, Eric Bezzam, and Ivan Dokmanić.
bidirectional transformer encoders. In ICASSP. 2018. Pyroomacoustics: A python package for audio
room simulation and array processing algorithms. In
Tu Anh Nguyen, Maureen de Seyssel, Patricia ICASSP.
Rozé, Morgane Rivière, Evgeny Kharitonov, Alexei Steffen Schneider, Alexei Baevski, Ronan Collobert,
Baevski, Ewan Dunbar, and Emmanuel Dupoux. and Michael Auli. 2019. wav2vec: Unsupervised pre-
2020. The zero resource speech benchmark 2021: training for speech recognition. In INTERSPEECH.
Metrics and baselines for unsupervised spoken lan-
guage modeling. In NeurIPS – Self-Supervised Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike
Learning for Speech and Audio Processing Work- Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
shop. Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,
et al. 2018. Natural tts synthesis by conditioning
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi wavenet on mel spectrogram predictions. In ICASSP.
Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello,
Robin Algayres, Benoit Sagot, Abdelrahman Mo- Joachim Thiemann, Nobutaka Ito, and Emmanuel Vin-
hamed, et al. 2022. Generative spoken dialogue lan- cent. 2013. Demand: a collection of multi-channel
guage modeling. arXiv preprint arXiv:2203.16502. recordings of acoustic noise in diverse environments.
In Proc. Meetings Acoust.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura.
Representation learning with contrastive predictive 2020. Transformer vq-vae for unsupervised unit dis-
coding. arXiv preprint arXiv:1807.03748. covery and speech synthesis: Zerospeech 2020 chal-
lenge. In Interspeech.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
jeev Khudanpur. 2015. Librispeech: an asr corpus Andros Tjandra, Berrak Sisman, Mingyang Zhang,
based on public domain audio books. In ICASSP. Sakriani Sakti, Haizhou Li, and Satoshi Nakamura.
2019. Vqvae unsupervised unit discovery and multi-
Adam Polyak, Yossi Adi, Jade Copet, Eugene scale code2spec inverter for zerospeech challenge
Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Ab- 2019. In Interspeech.
delrahman Mohamed, and Emmanuel Dupoux.
2021. Speech resynthesis from discrete disentan- Maarten Versteegh, Roland Thiolliere, Thomas Schatz,
gled self-supervised representations. arXiv preprint Xuan Nga Cao, Xavier Anguera, Aren Jansen, and
arXiv:2104.00355. Emmanuel Dupoux. 2015. The zero resource speech
challenge 2015. In Sixteenth annual conference of
Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan the international speech communication association.
Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, and Ann
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang,
Lee. 2022. Enhanced direct speech-to-speech transla-
Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin,
tion using self-supervised pre-training and data aug-
Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-
mentation. arXiv preprint arXiv:2204.02967.
Ting Lin, et al. 2021. Superb: Speech processing
universal performance benchmark. arXiv preprint
Matt Post. 2018. A call for clarity in reporting BLEU
arXiv:2105.01051.
scores. In EMNLP.
474
A Levenshtein Distance in Section C, we present additional results. We
report results on two additional state-of-the-art self-
Throughout the paper, we use a version of the Lev-
supervised speech representations. We show that
enshtein distance. In this section, we detail the
our method is indeed effective for those representa-
Levenshtein distance between two sequences. Let
tions as well as shown in the main paper.
x ∈ {1, .., K}Tx and y ∈ {1, .., K}Ty be two dis-
crete vectors, not necessary in the same size. Let C Additional Results
us also denote the operator tail(x) to return a copy
of the vector x without its first element. Then, In the following, we provide additional results on
the Levenshtein distance is defined recursively by the state-of-arts representations “wav2vec2” and
Lev(x, y) = “WavLM” (Baevski et al., 2020; Chen et al., 2022).
Tables 4 and 5 present the UED scores for both
|x|, if |y| = 0 the wav2vec2 and WavLM models. Using our
method, we observe robustness improvements for
if |x| = 0
|y|, both of the models. However, it is notable that the
Lev(tail(x), y)
WavLM model is more robust than the wav2vec2
1 + min Lev(x, tail(y)) , otherwise model. It is reasonable since the WavLM trained
Lev(tail(x), tail(y)) to be a more robust model using noisy training
samples.
where |x|, |y| are the lengths of the vectors x and y Tables 6 and 7 present the discriminative and
respectively. Note, in our implementation, we use generative metrics for both wav2vec2 and WavLM.
deduplicated sequences. We observe a consistent improvement using our
robust quantizer as in the robustness metrics. How-
B Extended Experimental Setup
ever, for the WavLM, the improvements are some-
Models. We study our method using the base ver- times marginal (except for k = 50 where k-means
sions of HuBERT, wav2vec2, and WavLM. Similar outperforms our method). The WavLM model is
to prior work, for HuBERT and WavLM, we use trained with a HuBERT architecture, with more
the ninth and sixth layers for wav2vec2. For read- data and noisy samples. Interestingly, while pre-
ability, we report results for HuBERT in the main senting better performance on various downstream
paper. The results for wav2vec2 and WavLM are tasks than HuBERT, their ABX, sWUGGY, and
presented in Appendix C. In our quantizer learning sBLIMP scores are lower.
process, we use a learning rate of 0.0001, a batch
size of 32, and Adam optimizer (Kingma and Ba,
2014). Our quantizer is composed of three fully
connected layers with LeakyReLU activation be-
tween them. The dimensions of those layers are
determined by the division floor of the difference
between the upstream dimension to the number
of units. We train our quantizer using a single
NVIDIA V100 GPU.
Datasets. To match the current k-means popular
training set, we use the Librispeech-100h to learn
our quantizer (Panayotov et al., 2015). We analyze
our metric using the ‘clean’ and ‘other’ develop-
ment sets from Librispeech. The augmentations
in all setups include time stretch, pitch shift, rever-
beration, and noise injection (exact parameters are
detailed in Section 3.2.1). For the sWUGGY and
sBLIMP evaluations, we use the ‘big’ transformer
language model from Lakhotia et al. (2021).
This appendix begins with a detailed explana-
tion on the Levenshtein distance (Section A). Then,
475
Augmentation
# units Method
Time Pitch shift Reverberation Noise
k-means 50.81±0.41 58.66±1.16 43.71±0.77 32.17±0.61
50 Ours 38.74±0.45 42.33±0.97 33.69±0.73 25.36±0.49
Ours (Iterative) 36.68±0.39 40.29±1.04 33.28±0.74 23.99±0.51
k-means 55.30±0.61 65.23±0.91 48.41±0.72 33.97±0.46
100 Ours 42.32±0.46 47.07±0.88 36.83±0.71 27.15±0.75
Ours (Iterative) 40.43±0.57 45.73±0.90 36.34±0.77 26.22±0.59
k-means 59.85±0.39 70.80±1.31 53.13±0.67 36.64±0.62
200 Ours 46.84±0.42 51.60±1.21 40.54±0.66 32.61±0.67
Ours (Iterative) 44.90±0.35 49.59±1.25 40.58±0.62 29.49 ±0.57
k-means 66.12±0.48 77.01±0.98 59.69±1.01 37.22±0.65
500 Ours 51.65±0.49 55.40±1.03 45.85±0.93 33.17±0.62
Ours (Iterative) 50.50±0.53 57.12±1.02 44.67±0.98 31.92±0.69
Augmentation
# units Method
Time Pitch shift Reverberation Noise
k-means 47.66±0.49 52.93±1.02 33.45±0.62 28.46±0.61
50 Ours 39.12±0.43 44.25±1.06 31.58±0.62 25.32±0.67
Ours (Iterative) 36.79±0.46 40.16±1.05 25.73±0.64 25.01±0.66
k-means 52.61±0.51 58.44±0.72 36.27±0.45 29.44±0.64
100 Ours 43.55±0.53 49.03±0.75 30.54±0.44 25.93±0.67
Ours (Iterative) 42.11±0.50 46.08±0.74 28.88±0.47 25.47±0.59
k-means 58.50±0.42 64.75±1.02 41.05±0.54 30.93±0.62
200 Ours 49.57±0.41 53.48±1.09 34.29±0.53 26.66±0.65
Ours (Iterative) 47.82±0.46 52.47±1.01 32.88±0.55 26.09 ±0.62
k-means 64.25±0.67 70.55±0.75 45.63±0.83 33.17±0.71
500 Ours 55.41±0.64 59.79±0.87 42.85±0.78 28.46±0.79
Ours (Iterative) 52.92±0.69 57.840±0.81 40.46±0.81 27.09±0.72
476
ABX (clean) ↓ ABX (other)↓
# units Method sWUGGY ↑ sBLIMP ↑
Within Across Within Across
k-means 12.03 15.31 13.61 19.07 49.76 53.92
50 Ours 11.18 13.82 13.34 18.39 - -
Ours (Iterative) 10.35 12.75 12.64 17.29 49.65 55.29
k-means 11.27 13.99 13.06 17.11 51.63 53.87
100 Ours 9.86 11.81 11.44 16.63 -
Ours (Iterative) 9.24 11.30 11.37 16.14 51.90 54.95
k-means 11.13 14.42 12.37 18.02 51.29 54.99
200 Ours 10.19 12.41 11.85 17.52 - -
Ours (Iterative) 9.00 11.11 11.49 16.53 51.99 55.67
k-means 12.06 15.61 13.77 19.94 52.21 54.32
500 Ours 10.76 13.83 13.52 19.60 - -
Ours (Iterative) 10.16 12.42 12.56 18.24 52.93 55.17
477
DePA: Improving Non-autoregressive Machine Translation with
Dependency-Aware Decoder
Jiaao Zhan1 , Qian Chen, Boxing Chen, Wen Wang, Yu Bai1 , Yang Gao1∗
1
School of Computer Science and Technology,
Beijing Institute of Technology, Beijing, China
jiaao_zhan@163.com
{lukechan1231,chenboxing,wwang.969803}@gmail.com
{yubai,gyang}@bit.edu.cn
Table 1: Case studies of our proposed FBD approach on the highly competitive fully NAT model GLAT (Qian et al., 2021) for
alleviating three types of multi-modality errors on the IWSLT16 DE-EN validation set. Repetitive tokens are in red. Source words
that are not semantically translated are in bold and underlined. Wrong lexical choices (for polysemous words) and redundant
words are in blue. F-NAT denotes only modeling forward dependencies while FB-NAT denotes modeling both forward and
backward dependencies, the same as the models in Table 5. Case studies of our proposed IT approach are in Appendix.
The NAT model modeling only forward depen- tion space, resulting in differences from the true
dency (F-NAT) incorrectly translates “woher” into target-side distribution. Our proposed IT ensures
“how” and outputs “How do I come from?”; whereas that the decoder input is in the exact target repre-
the model modeling both forward and backward sentation space hence enables the model to better
dependency (FB-NAT) translates it correctly into capture target dependencies.
“Where do I come from?”. Therefore, instead of Our contributions can be summarized as follows:
dependency reduction, we propose a novel and gen- (1) We propose a novel and general Dependency-
eral Dependency-Aware Decoder (DePA), which Aware Decoder (DePA) for fully NAT models. For
enhances the learning capacity of fully NAT mod- DePA, we propose a novel approach FBD for learn-
els and enables them to learn complete and complex ing both forward and backward dependencies in
forward and backward target dependencies in order NAT decoder, through which the target dependen-
to alleviate the multi-modality issue. cies can be better modeled. To the best of our
Firstly, we enhance the NAT decoder to learn knowledge, our work is the first to successfully
complete target dependencies by exploring decoder model both forward and backward target-side
self-attention. We believe that previous works (Guo dependencies explicitly for fully NAT models.
et al., 2020a) incorporating only forward depen- We also propose a novel decoder input transfor-
dency modeled by AT models into NAT models are mation approach (IT). IT could ease target-side
inadequate to address multi-modality. Therefore, dependency modeling and enhance the effective-
we propose an effective forward-backward depen- ness of FBD. DePA is model-agnostic and can
dency modeling approach, denoted by FBD, as be applied to any fully NAT models. (2) Exten-
an auto-aggressive forward-backward pre-training sive experiments on WMT and IWSLT benchmarks
phase before NAT training, using curriculum learn- demonstrate that our DePA consistently improves
ing. The FBD approach implements triangular the representative vanilla NAT model (Gu et al.,
attention masks and takes different decoder inputs 2018), the highly competitive fully NAT model
and targets in a unified framework to train the GLAT (Qian et al., 2021) and the current SOTA
model to attend to previous or future tokens and of fully NAT models, CTC w/ DSLP & Mixed
learn both forward or backward dependencies. Training (denoted by CTC-DSLP-MT) (Huang
et al., 2021) (DSLP denotes Deep Supervision and
Secondly, we enhance target dependency model-
Layer-wise Prediction), by up to +0.85 BLEU on
ing within the NAT decoder from the perspective
the SOTA CTC-DSLP-MT, +1.88 BLEU on GLAT,
of the decoder input. Most prior NAT models (Gu
and +2.2 BLEU on vanilla NAT, while reserving
et al., 2018; Wang et al., 2019; Wei et al., 2019)
inference latency as other fully NAT models, about
use a copy of the source text embedding as the
15× speed-up over AT models. Experiments show
decoder input, which is independent from the tar-
that DePA achieves greater BLEU gains with less
get representation space and hence makes target
speed-up loss than DSLP when applied to various
dependency modeling difficult. We transform the
fully NAT models.
initial decoder input from the source language rep-
resentation space to the target language representa- 2 Related Work
tion space through a novel attentive transformation
process, denoted by IT. Previous works on trans- Forward and Backward Dependencies Prior
forming the decoder input cannot guarantee that works explore bidirectional decoding to improve
the decoder input is in the exact target representa- modeling of both forward and backward depen-
479
dencies in phrase-based statistical MT (Finch and ate multiple possible translations. In contrast, our
Sumita, 2009) and RNN-based MT (Zhang et al., DePA utilizes forward-backward pre-training and
2018). For NAT, Guo et al. (2020a) and Wei et al. a novel attentive transformation of decoder input
(2019) use forward auto-regressive models to guide to enhance target dependency modeling. Under
NAT training. Liu et al. (2020) introduces an in- same settings and with KD, DA-Transformer per-
termediate semi-autoregressive translation task to forms only comparably to CTC-DSLP-MT; how-
smooth the shift from AT training to NAT train- ever, performance of DA-Transformer benefits no-
ing. However, backward dependencies are rarely tably from Transformer-big for KD while CTC-
investigated in NAT. DSLP-MT uses Transformer-base for KD. DDRS
w/ NMLA (Shao and Feng, 2022) benefits greatly
Decoder Input of Fully NAT Models The de- from using diverse KD references while CTC-
coder input of AT models consists of previously DSLP-MT uses only a single KD reference. Hence,
generated tokens. However, selecting appropriate CTC-DSLP-MT is still the current SOTA for
decoder input for fully NAT models could be chal- fully NAT models on WMT benchmarks.
lenging. Most prior NAT models (Gu et al., 2018;
Wang et al., 2019; Wei et al., 2019) use uniform Non-autoregressive Models Besides fully NAT
copy (Gu et al., 2018) or soft copy (Wei et al., models, iterative NAT models are proposed such
2019) of the source text embedding as the decoder as iterative refinement of target sentences (Lee
input, which is independent of the target repre- et al., 2018), masking and repredicting words with
sentation space hence hinders target dependency low probabilities (Ghazvininejad et al., 2019), edit-
modeling. Methods such as GLAT (Qian et al., based methods to iteratively modify decoder out-
2021) and (Guo et al., 2020a,b) attempt to make put (Stern et al., 2019; Gu et al., 2019), and parallel
the NAT decoder input similar to the target rep- refinement of every token (Kasai et al., 2020). It-
resentation space by substituting certain positions erative NAT models improve translation accuracy
in the decoder input with the corresponding target at the cost of slower speed. Non-autoregressive
embedding. However, this creates a mismatch be- models are practically important due to high effi-
tween training and inference. Guo et al. (2019) uses ciency. Other than MT, they are applied to various
phrase-table lookup and linear mapping to make tasks such as image captioning (Gao et al., 2019),
the decoder input closer to the target embedding, automatic speech recognition (Chen et al., 2019),
but this method still causes difference between the and text-to-speech synthesis (Oord et al., 2018).
decoder input and the real target-side distribution.
3 Methodology
Fully NAT Models To address multi-modality
for fully NAT models, various approaches are pro- 3.1 Problem Formulation
posed. Gu et al. (2018) uses knowledge distillation NMT can be formulated as a sequence-to-sequence
(KD) (Kim and Rush, 2016) to reduce dataset com- generation problem. Given a sequence X =
plexity. Libovickỳ and Helcl (2018) and Saharia {x1 , ..., xN } in the source language, a sequence
et al. (2020) use connectionist temporal classifica- Y = {y1 , ..., yT } in the target language is gener-
tion (CTC) (Graves et al., 2006) for latent align- ated following the conditional probability P (Y |X).
ment. Sun et al. (2019) utilizes CRFs to model NAT models are proposed to speed up generation
target positional contexts. Kaiser et al. (2018), by decoding all the target tokens in parallel, using
Ma et al. (2019) and Shu et al. (2020) incorpo- conditional independent factorization as:
rate latent variables to guide generation, similar
to VAEs (Kingma and Welling, 2013). Guo et al. T
Y
PN A (Y |X) = PL (T |x1:N ) · P (yt |x1:N ; θ) (1)
(2020c) initializes NAT decoders with pretrained t=1
language models. Huang et al. (2021) proposes
CTC with Deep Supervision and Layer-wise Pre- where the target sequence length T is modeled by
diction and Mixed Training (CTC-DSLP-MT), set- the conditional distribution PL , and dependence
ting new SOTA for fully NAT models on WMT on previous target tokens is removed. Compared
benchmarks. DA-Transformer (Huang et al., 2022) to AT models, NAT models speed up inference
represents hidden states in a directed acyclic graph significantly at the expense of translation quality,
to capture dependencies between tokens and gener- because the conditional independence assumption
480
ing phase uses features of each word to predict the
word itself. We make the following hypotheses:
(1) Considering the nature of languages, learning
forward dependency in Phase 1 is easier for the
model for language generation. (2) Modeling back-
ward dependency relies on learned forward depen-
dency knowledge, hence it should be in the second
phase. In fact, we observe the interesting find-
ing that the best curriculum remains forward-
Figure 1: The proposed forward-backward dependency mod- backward-forward-NAT (FBF-NAT) for both
eling (FBD) with triangular attention masks in a unified frame-
work. The red dashed lines indicate the attention masks. We left-branching and right-branching languages,
use different colors to highlight the difference of inputs and proving our hypotheses. We speculate that NAT
targets in each phase. training may benefit from another forward depen-
dency modeling in Phase 3 because the order of
left-to-right is more consistent with characteristics
in Eq.1 enables parallel processing but lacks ex-
of natural languages, hence adding the second for-
plicit modeling of dependency between target to-
ward dependency modeling after FB (i.e., FBF)
kens. To enhance target dependency modeling, we
smooths the transition to the final NAT training.
propose two innovations as incorporating both for-
Detailed discussions are in Section 4.3.
ward and backward dependency modeling into the
training process (Section 3.2) and transforming the 3.3 Decoder Input Transformation (IT) for
decoder input into the target representation space Target Dependency Modeling
(Section 3.3).
Given the initial decoder input z as a copy of source
3.2 Target Dependency Modeling with text embedding, we propose to directly select rele-
Curriculum Learning (FBD) vant representations from target embedding to form
a new decoder input z ′ (Figure 2). z is used as
Prior work (Guo et al., 2020a) utilizes forward de- the query and the selection is implemented as a
pendency in AT models to initialize model parame- learnable attention module. The learnable parame-
ters for NAT. However, as discussed in Section 1, ters bridge the gap between training and inference
for fully NAT models, only modeling forward de- while the selection guarantees consistency between
pendency is inadequate for addressing the multi- the decoder input matrix and the target represen-
modality problem (Finch and Sumita, 2009; Zhang tation space (i.e., the output embedding matrix of
et al., 2018) (the Row for F-NAT in Table 1). Our the decoder). This way, the decoder input is in the
innovations include incorporating both forward and exact target-side embedding space and more con-
backward dependency modeling into NAT models, ducive to modeling target dependencies for NAT
via triangular attention masks in a unified frame- models than previous approaches using source text
work through curriculum learning (Figure 1), and embedding or transformed decoder input.
investigating efficacy of different curricula. In Fig-
ure 1, the NAT decoder phase denotes standard Decoder Input Transformation To transform z
NAT training of any NAT decoder Dec. The For- into the target representation space, we apply atten-
ward Dependency and Backward Dependency tion mechanism between z and the output embed-
phases serve pre-training for NAT training, learning ding matrix Emb ∈ Rd×v , where d and v denote
left-to-right and right-to-left dependencies to ini- sizes of hidden states and the target vocabulary.
tialize NAT models with better dependencies. For- Since NAT models usually have embedding matrix
ward Dependency and Backward Dependency train- Emb including both source and target vocabular-
ing phases apply the same upper triangle attention ies, first, we conduct a filtering process to remove
mask on Dec. We use KD data from AT models for source vocabulary (mostly not used by the decoder)
each phase but the inputs and the targets are differ- from the decoder output embedding matrix (the
ent. The Forward Dependency training phase uses linear layer before decoder softmax). We build a
y1 to predict y2 and so on. The Backward Depen- dictionary that contains only target-side tokens in
dency training phase reverses the target sequence the training set. We then use this dictionary to filter
and uses y2 to predict y1 and so on. The NAT Train- Emb and obtain the new output embedding matrix
481
Figure 2: The proposed Decoder Input Transformation (IT) from z to z ′ , where z ∈ RT ×d denotes the initial decoder input
copied from the source text embedding xemb , T and d denote the length of the target text y and the size of hidden states,
respectively. Emb ∈ Rd×v denotes the output embedding matrix of the decoder (the target representation space), where v
denotes the size of the target vocabulary.
′
of the decoder Emb′ ∈ Rd×v , where v ′ denotes Since we can manually set v ∗ as a relatively small
size of the filtered vocabulary. This filtering pro- number (e.g., 1000, 2000), the computational cost
cess guarantees that Emb′ is strictly from the target of the attention mechanism can be greatly reduced.
representation space. The attention process starts We hypothesize that target-side embedding com-
with a linear transformation: pression may also alleviate over-fitting on small
datasets and confirm this hypothesis in Section 4.3.
z l = Wq · z (2)
482
Baselines and Training We implement the base- single NVIDIA V100 GPU, then compute the aver-
line models based on their released codebases. age time per sentence. We report Speed-up based
We implement the representative vanilla NAT (Gu on the inference latency of Transformer-base AT
et al., 2018; Qian et al., 2021; Huang et al., (teacher) and fully NAT models.
2021)4 , the highly competitive fully NAT model
GLAT (Qian et al., 2021)5 , and current fully 4.2 Main Results
NAT SOTA CTC w/ DSLP & Mixed Training Table 2 shows the main results on the WMT bench-
(CTC-DSLP-MT) (Huang et al., 2021)6 and ap- marks. For EN↔RO, we report the mean of BLEU
ply our methods to them. Following Qian et al. from 3 runs with different random seeds for Row
(2021), we use base-Transformer (dmodel =512, 12-13, all with quite small standard deviations
nhead =8, nlayer =6) for WMT datasets and small- (≤ 0.16) 7 . We apply our proposed DePA, which
Transformer (dmodel =256, nhead =4, nlayer =5) for includes IT and FBD, to vanilla NAT, GLAT, and
IWSLT and SP EN-JA datasets. We use the same the current fully NAT SOTA CTC-DSLP-MT, on
training setup for training the three models, Vanilla WMT, IWSLT, and EN-JA benchmarks. We use
NAT , GLAT, and CTC-DSLP-MT as in their orig- the same hyperparameters and random seeds to
inal papers cited above. We train models with fairly compare two models. It is crucial to point
batches of 64K tokens for WMT datasets, and 8K out that accuracies of vanilla NAT, GLAT, and
tokens for IWSLT and SP EN-JA datasets, using CTC-DSLP-MT models have plateaued out af-
NVIDIA V100 GPUs. For GLAT, we use Adam ter 300K training steps on WMT datasets hence
optimizer (Kingma and Ba, 2015) with β = (0.9, original papers of these three models set max
0.999) and set dropout rate to 0.1. For Vanilla training steps to 300K. We verify this observation
NAT and CTC-DSLP-MT, we use Adam optimizer in our own experiments as we also see no gains
(Kingma and Ba, 2015) with β = (0.9, 0.98). For on these models after 300K training steps on the
WMT datasets, the learning rate warms up to 5e-4 WMT datasets. Hence, although our DePA trains
in 4K steps and gradually decays according to in- 300K × 4 = 1200K steps on WMT datasets due
verse square root schedule (Vaswani et al., 2017). to FBF pre-training as in Section 4.3, all compar-
As for IWSLT and SP EN-JA datasets, we adopt isons between baselines w/ DePA and w/o DePA
linear annealing (from 3e-4 to 1e-5 ) as in Lee et al. are fair comparisons. Table 2 shows that DePA
(2018). We choose the model with the best perfor- consistently improves the translation accuracy for
mance on the validation set as the final model and both vanilla NAT and GLAT on each benchmark,
evaluate the final model on the test sets. For experi- achieving mean=+1.37 and max=+1.88 BLEU
ments using our method FBD (Section 3.2), we use gain on GLAT and mean=+2.34 and max=+2.46
the FBF-NAT configuration (as in Section 4.3) BLEU gain on vanilla NAT. DePA also improves
and train the same number of steps at each phase the SOTA CTC-DSLP-MT by mean=+0.42 and
(including NAT training phase), with 300K steps max=+0.49 BLEU gain on the WMT test sets (Ta-
for each phase for WMT datasets and 100K steps ble 2), +0.85 BLEU gain on the IWSLT16 DE-EN
for each phase for IWSLT datasets and SP EN-JA. validation set and +1.43 BLEU gain on the EN-JA
IT by default is without Target-side Embedding test set (Table 3). All gains from DePA on vanilla
Compression (Section 3.3). NAT, GLAT, and CTC-DSLP-MT are statistically
Evaluation To evaluate the translation accuracy, significant (p < 0.05) based on a paired bootstrap
we use SacreBLEU (Post, 2018) for all experi- resampling test conducted using 1K resampling
ments and ChrF (Popovic, 2015) (also using the trials and the SacreBLEU tool.
SacreBLEU tool) additionally for ablation study Table 2 also shows that on each benchmark,
on IWSLT benchmark. To evaluate the inference the average improvement from DePA on three
latency, following Gu and Kong (2021), we mea- models (vanilla NAT, GLAT, and CTC-DSLP-
sure the wall-clock time for translating the entire MT) is within [0.90,1.56] (Row15), always larger
WMT14 EN-DE test set with batch_size=1 on a than the average improvement from w/DSLP on
4 7
https://github.com/facebookresearch/fairseq/ WMT14 EN↔DE is much larger than WMT16 EN↔RO.
tree/main/examples/nonautoregressive_ Since standard deviations of BLEU from multiple runs with
translation different random seeds on WMT14 EN↔DE are very small,
5
https://github.com/FLC777/GLAT ≤ 0.08 (Huang et al., 2022), following prior works, we report
6
https://github.com/chenyangh/DSLP single-run BLEU on WMT14 EN↔DE to save energy.
483
WMT’14 WMT’16
Row# Models Speed-up ↑ EN-DE DE-EN EN-RO RO-EN
1 Transformer-base (teacher) 1.0× 27.48 31.39 33.70 34.05
2 + KD 2.5× 27.34 30.95 33.52 34.01
3 Vanilla NAT 15.6× 20.36 24.81 28.47 29.43
4 w/ DSLP∗ 14.8× 22.72 25.83 30.48 31.46
5 w/ DePA (Ours) 15.4× 23.15 26.59 30.78 31.89
6 GLAT 15.3× 25.21 29.84 31.19 32.04
7 w/ DSLP∗ 14.9× 25.69 29.90 32.36 33.06
8 w/ DePA (Ours) 15.1× 26.43 30.42 33.07 33.82
10 CTC∗ 15.5× 25.72 29.89 32.89 33.79
11 w/ DSLP∗ 14.8× 26.85 31.16 33.85 34.24
12 w/ DSLP & Mixed Training 14.8× 27.02 31.61 33.99 34.42
13 w/ DSLP & Mixed Training & w/ DePA (Ours) 14.7× 27.51 31.96 34.48 34.77
14 Average improvement from DSLP - 1.32 0.78 1.38 1.17
15 Average improvement from DePA (Ours) - 1.50 0.90 1.56 1.53
Table 2: BLEU and Speed-up from our DePA and existing methods on WMT benchmark test sets. Speed-up is measured on
WMT14 EN-DE test set. BLEUs without rescoring are reported, with the best BLEU scores in bold for each group. ∗ denotes
the results are copied from previous work (Huang et al., 2021), other results are obtained by our implementation. Average
improvements of DSLP are re-calculated using our results, which are slightly different from Table 1 in (Huang et al., 2021).
them, [0.78,1.38] (Row14). DePA brings consis- (only +0.07 BLEU gain) on IWSLT16 DE-EN,
tent improvement over SOTA CTC-DSLP-MT on whereas GLAT w/FBD brings +1.19 BLEU/+1.2
all benchmarks (Table 2 Row13-over-Row12, Ta- ChrF gains over GLAT (400K steps).
ble 3), hence we expect DePA to also improve DA- Table 4 shows our IT outperforms Linear Map-
Transformer (Huang et al., 2022) and DDRS w/ ping (Guo et al., 2019) by +2.31 BLEU gain on
NMLA (Shao and Feng, 2022) and will verify this IWSLT14 DE-EN test set. IT has the same num-
w/ and w/o KD in future work. Applying DePA to ber of extra parameters as Linear Mapping. Hence,
fully NAT models retains the inference speed-up the large gain proves that improvements from IT
advantages of fully NAT models. Applying DePA are not just from additional layers. The number
to vanilla NAT, GLAT, and SOTA CTC-DSLP- of extra parameters of IT, as from Wq in Eq.2,
MT obtain 15.4×, 15.1×, and 14.7× speed-up is quite small: 512*512=262144 for Transformer-
over the autoregressive Transformer-base (teacher) base on WMT datasets and 256*256=65536 for
(Row1). Overall Table 2 shows that DePA achieves Transformer-small on IWSLT datasets. The large
greater BLEU gains with less speed-up loss than BLEU gain +3.18 from applying IT to vanilla NAT
DSLP on all baselines. These results demonstrate proves vanilla transformer decoder cannot achieve
superiority of DePA over DSLP on improving other similar transformation effectiveness as IT. Table 3
fully NAT models. shows that for language pairs with different lev-
els of source-target vocabulary sharing, such as
4.3 Analysis WMT EN-DE and DE-EN, IWSLT DE-EN, EN-
Ablation Study We analyze the respective effi- RO, and EN-JA, our IT method can achieve con-
cacy of IT and FBD in DePA on the IWSLT16 DE- sistent improvements over GLAT and CTC-DSLP-
EN validation and the WMT and SP EN-JA test sets. MT. Applying IT consistently improves GLAT and
Table 3 shows that FBD and IT improve GLAT by CTC-DSLP-MT although these gains are smaller
+1.26 BLEU/+1.5 ChrF and +0.34 BLEU/+1.0 than gain on vanilla NAT. This is because decoder
ChrF on IWSLT16 DE-EN validation set, respec- input of vanilla NAT only replicates source em-
tively. Considering that GLAT w/FBD has more bedding, whereas GLAT and CTC-DSLP-MT al-
training steps than GLAT, we also train GLAT ready transform decoder input by replacing se-
(400K steps) which has the same training steps lected positions in decoder input with target em-
as GLAT w/FBD for fair comparison. Similar to bedding, hence reducing improvements of IT. Still,
findings on WMT datasets, we observe plateaus gains from w/IT+FBD over w/FBD confirms our
of accuracy on IWSLT and EN-JA datasets from hypothesis that IT can enhance effectiveness of
more training steps than the original 100K. Just FBD. On GLAT, IT+FBD yields +1.4 BLEU/+2.7
training more steps hardly improves the baseline ChrF gains on IWSLT16 DE-EN and +1.43 BLEU
484
IWSLT16 WMT’14 WMT’16
Models DE-EN EN-DE DE-EN EN-RO RO-EN
BLEU ChrF BLEU BLEU
CTC-DSLP-MT 31.04 56.7 27.02 31.61 34.17 34.60
CTC-DSLP-MT w/ IT 31.29 57.1 27.21 31.78 34.32 34.71
CTC-DSLP-MT w/ FBD 31.73 57.5 27.44 31.90 34.60 34.92
CTC-DSLP-MT w/ IT+FBD 31.89 57.8 27.51 31.96 34.68 34.98
IWSLT16 EN-JA
Models DE-EN
BLEU ChrF BLEU
GLAT 29.61 51.8 27.67
GLAT (400K step) 29.68 52.1 –
GLAT w/ IT 29.95 52.8 27.95
GLAT w/ FBD 30.87 53.3 28.87
GLAT w/ IT+FBD 31.01 54.5 29.10
Table 3: Effect of IT and FBD and IT+FBD (i.e., DePA) on the IWSLT16 DE-EN validation set, the WMT and SP EN-JA
test sets. We report mean of BLEU/ChrF from 3 runs with different random seeds. BLEU gains from DePA on SOTA
CTC-DSLP-MT on each set, [0.85, 0.49, 0.51], are larger than std (≤ 0.17).
Table 5: BLEU from different dependency modeling curricula on GLAT. Best results for each set are in bold. NAT denotes
GLAT baseline. F and B denote forward dependency and backward dependency phase respectively (Figure 1). For example,
F-NAT denotes forward dependency training then NAT training.
Compressed
w/o IT 1000 1200 1400 1600 1800 2000
Dimension
BLEU 29.61 29.45 29.56 29.77 29.85 30.39 29.14
486
6 Limitations Junlong Gao, Xi Meng, Shiqi Wang, Xia Li, Shan-
she Wang, Siwei Ma, and Wen Gao. 2019. Masked
Apart from all the advantages that our work non-autoregressive image captioning. arXiv preprint
achieves, some limitations still exist. Firstly, in arXiv:1906.00717.
this work, we investigate the efficacy of apply- Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and
ing our proposed DePA approach on the represen- Luke Zettlemoyer. 2019. Mask-predict: Parallel
tative vanilla NAT, the highly competitive fully decoding of conditional masked language models.
NAT model GLAT and current SOTA CTC-DSLP- arXiv preprint arXiv:1904.09324.
MT for fully NAT models, but we have yet to ap-
Alex Graves, Santiago Fernández, Faustino J. Gomez,
ply DePA to iterative NAT models, such as Im- and Jürgen Schmidhuber. 2006. Connectionist tem-
puter (Saharia et al., 2020), CMLM (Ghazvinine- poral classification: labelling unsegmented sequence
jad et al., 2019), and Levenshtein Transformer (Gu data with recurrent neural networks. In Machine
et al., 2019). Hence, the effectiveness of DePA Learning, Proceedings of the Twenty-Third Interna-
tional Conference (ICML 2006), Pittsburgh, Pennsyl-
on iterative NAT models still needs to be veri- vania, USA, June 25-29, 2006, volume 148 of ACM
fied. Secondly, we have not yet incorporated re- International Conference Proceeding Series, pages
ranking approaches such as Noisy Parallel Decod- 369–376. ACM.
ing (NPD) (Gu et al., 2018) into DePA. Thirdly, our
proposed method FBD requires multiple additional Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK
Li, and Richard Socher. 2018. Non-autoregressive
training phases before NAT training, resulting in neural machine translation. In International Confer-
longer training time and using more GPU resources. ence on Learning Representations.
Reducing the computational cost of FBD training
is one future work that will be beneficial for energy Jiatao Gu and Xiang Kong. 2021. Fully non-
autoregressive neural machine translation: Tricks of
saving. Last but not least, NAT models have limita- the trade. In Findings of the Association for Com-
tions on handling long text. They suffer from worse putational Linguistics: ACL-IJCNLP 2021, pages
translation quality when translating relatively long 120–133, Online. Association for Computational Lin-
text. We plan to investigate all these topics in future guistics.
work.
Jiatao Gu, Changhan Wang, and Jake Zhao.
2019. Levenshtein transformer. arXiv preprint
arXiv:1905.11006.
References
Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and
Yu Bao, Hao Zhou, Shujian Huang, Dongqi Wang, Li- Tie-Yan Liu. 2019. Non-autoregressive neural ma-
hua Qian, Xinyu Dai, Jiajun Chen, and Lei Li. 2022. chine translation with enhanced decoder input. In
GLAT: glancing at latent variables for parallel text Proceedings of the AAAI Conference on Artificial
generation. In Proceedings of the 60th Annual Meet- Intelligence, volume 33, pages 3723–3730.
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2022, Dublin, Ireland, Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen,
May 22-27, 2022, pages 8398–8409. Association for and Tie-Yan Liu. 2020a. Fine-tuning by curriculum
Computational Linguistics. learning for non-autoregressive neural machine trans-
lation. In Proceedings of the AAAI Conference on
Nanxin Chen, Shinji Watanabe, Jesús Villalba, and Na- Artificial Intelligence, volume 34, pages 7839–7846.
jim Dehak. 2019. Listen and fill in the missing letters:
Non-autoregressive transformer for speech recogni- Junliang Guo, Linli Xu, and Enhong Chen. 2020b.
tion. arXiv preprint arXiv:1911.04908. Jointly masked sequence-to-sequence model for non-
autoregressive neural machine translation. In Pro-
Katsuki Chousa, Katsuhito Sudoh, and Satoshi Naka- ceedings of the 58th Annual Meeting of the Associa-
mura. 2019. Simultaneous neural machine trans- tion for Computational Linguistics, pages 376–385.
lation using connectionist temporal classification.
arXiv preprint arXiv:1911.11933. Junliang Guo, Zhirui Zhang, Linli Xu, Hao-Ran Wei,
Boxing Chen, and Enhong Chen. 2020c. Incor-
Andrew M. Finch and Eiichiro Sumita. 2009. Bidirec- porating bert into parallel sequence decoding with
tional phrase-based statistical machine translation. In adapters. In NeurIPS.
Proceedings of the 2009 Conference on Empirical
Methods in Natural Language Processing, EMNLP Chenyang Huang, Hao Zhou, Osmar R Zaïane, Lili
2009, 6-7 August 2009, Singapore, A meeting of SIG- Mou, and Lei Li. 2021. Non-autoregressive transla-
DAT, a Special Interest Group of the ACL, pages tion with layer-wise prediction and deep supervision.
1124–1132. ACL. arXiv preprint arXiv:2110.07515.
487
Fei Huang, Hao Zhou, Yang Liu, Hang Li, and Min- Maja Popovic. 2015. chrf: character n-gram f-score
lie Huang. 2022. Directed acyclic transformer for automatic MT evaluation. In Proceedings of the
for non-autoregressive machine translation. CoRR, Tenth Workshop on Statistical Machine Translation,
abs/2205.07459. WMT@EMNLP 2015, 17-18 September 2015, Lis-
bon, Portugal, pages 392–395. The Association for
Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Computer Linguistics.
Vaswani, Niki Parmar, Jakob Uszkoreit, and Noam
Shazeer. 2018. Fast decoding in sequence mod- Matt Post. 2018. A call for clarity in reporting BLEU
els using discrete latent variables. In International scores. In Proceedings of the Third Conference on
Conference on Machine Learning, pages 2390–2399. Machine Translation: Research Papers, WMT 2018,
PMLR. Belgium, Brussels, October 31 - November 1, 2018,
pages 186–191. Association for Computational Lin-
Jungo Kasai, James Cross, Marjan Ghazvininejad, and guistics.
Jiatao Gu. 2020. Non-autoregressive machine trans-
lation with disentangled context transformer. In In- Lihua Qian, Hao Zhou, Yu Bao, Mingxuan Wang, Lin
ternational Conference on Machine Learning, pages Qiu, Weinan Zhang, Yong Yu, and Lei Li. 2021.
5144–5155. PMLR. Glancing transformer for non-autoregressive neural
machine translation. In Proceedings of the 59th An-
Yoon Kim and Alexander M Rush. 2016. Sequence- nual Meeting of the Association for Computational
level knowledge distillation. In Proceedings of the Linguistics and the 11th International Joint Confer-
2016 Conference on Empirical Methods in Natural ence on Natural Language Processing (Volume 1:
Language Processing, pages 1317–1327. Long Papers), pages 1993–2003, Online. Association
for Computational Linguistics.
Diederik P Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In ICLR (Poster). Qiu Ran, Yankai Lin, Peng Li, and Jie Zhou. 2020.
Learning to recover from multi-modality errors for
Diederik P Kingma and Max Welling. 2013. Auto-
non-autoregressive neural machine translation. In
encoding variational bayes. arXiv preprint
Proceedings of the 58th Annual Meeting of the Asso-
arXiv:1312.6114.
ciation for Computational Linguistics, pages 3059–
Jason Lee, Elman Mansimov, and Kyunghyun Cho. 3069.
2018. Deterministic non-autoregressive neural se-
quence modeling by iterative refinement. In Proceed- Chitwan Saharia, William Chan, Saurabh Saxena, and
ings of the 2018 Conference on Empirical Methods Mohammad Norouzi. 2020. Non-autoregressive ma-
in Natural Language Processing, pages 1173–1182. chine translation with latent alignments. In Proceed-
ings of the 2020 Conference on Empirical Methods
Jindřich Libovickỳ and Jindřich Helcl. 2018. End-to- in Natural Language Processing (EMNLP), pages
end non-autoregressive neural machine translation 1098–1108.
with connectionist temporal classification. In Pro-
ceedings of the 2018 Conference on Empirical Meth- Chenze Shao and Yang Feng. 2022. Non-monotonic la-
ods in Natural Language Processing, pages 3016– tent alignments for ctc-based non-autoregressive ma-
3021. chine translation. arXiv preprint arXiv:2210.03953.
Jinglin Liu, Yi Ren, Xu Tan, Chen Zhang, Tao Qin, Raphael Shu, Jason Lee, Hideki Nakayama, and
Zhou Zhao, and Tie-Yan Liu. 2020. Task-level cur- Kyunghyun Cho. 2020. Latent-variable non-
riculum learning for non-autoregressive neural ma- autoregressive neural machine translation with deter-
chine translation. In Proceedings of the Twenty-Ninth ministic inference using a delta posterior. In Proceed-
International Joint Conference on Artificial Intelli- ings of the AAAI Conference on Artificial Intelligence,
gence, IJCAI 2020, pages 3861–3867. ijcai.org. volume 34, pages 8846–8853.
Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neu- Mitchell Stern, William Chan, Jamie Kiros, and Jakob
big, and Eduard Hovy. 2019. Flowseq: Non- Uszkoreit. 2019. Insertion transformer: Flexible se-
autoregressive conditional sequence generation with quence generation via insertion operations. In In-
generative flow. In Proceedings of the 2019 Confer- ternational Conference on Machine Learning, pages
ence on Empirical Methods in Natural Language Pro- 5976–5985. PMLR.
cessing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin,
pages 4282–4292. and Zhi-Hong Deng. 2019. Fast structured decoding
for sequence models. In NeurIPS.
Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Si-
monyan, Oriol Vinyals, Koray Kavukcuoglu, George Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Driessche, Edward Lockhart, Luis Cobo, Florian Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Stimberg, et al. 2018. Parallel wavenet: Fast high- Kaiser, and Illia Polosukhin. 2017. Attention is all
fidelity speech synthesis. In International conference you need. In Advances in neural information pro-
on machine learning, pages 3918–3926. PMLR. cessing systems, pages 5998–6008.
488
Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang
Zhai, and Tie-Yan Liu. 2019. Non-autoregressive
machine translation with auxiliary regularization. In
Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, pages 5377–5384.
Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang
Lin, and Xu Sun. 2019. Imitation learning for non-
autoregressive neural machine translation. In Pro-
ceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1304–
1312.
Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Ron-
grong Ji, and Hongji Wang. 2018. Asynchronous
bidirectional decoding for neural machine translation.
In Proceedings of the Thirty-Second AAAI Confer-
ence on Artificial Intelligence, (AAAI-18), the 30th in-
novative Applications of Artificial Intelligence (IAAI-
18), and the 8th AAAI Symposium on Educational
Advances in Artificial Intelligence (EAAI-18), New
Orleans, Louisiana, USA, February 2-7, 2018, pages
5698–5705. AAAI Press.
489
Case #1 Case #2
obwohl sie erwischt wurden , wurden sie schließlich
Source das ist ein Bauplan für Länder wie China und den Iran .
freigelassen aufgrund immensen internationalen Drucks .
even though they were caught , they were eventually
Target Reference this is a blueprint for countries like China and Iran .
released after heavy international pressure .
although they were caught , they were released released
Vanilla NAT this is a blueprint plan for countries like China and and Iran .
because because of huge drug .
although they were caught , they were finally released
Vanilla NAT w/ IT this is a blueprint for countries like China and Iran .
because huge international pressure .
although they were caught , they finally were released
GLAT this is a blueprint plan for countries like China and Iran .
because of of international printing .
although they were caught , they were finally
GLAT w/ IT this is a blueprint for countries like China and Iran .
released after huge international pressure .
Table 7: Case studies of our method IT on the IWSLT16 DE-EN validation set by comparing the translations from
the two baseline models Vanilla NAT and GLAT and from them after applying IT (models in bold). Repetitive tokens
are in red. Source words that are not semantically translated are marked in bold and underlined (under-translation).
Wrong lexical choice (incorrect translations caused by polysemy) and redundant words are in blue.
490
On the Copying Problem of Unsupervised NMT:
A Training Schedule with a Language Discriminator Loss
Yihong Liu*⋄ , Alexandra Chronopoulou*⋄ , Hinrich Schütze*⋄ , and Alexander Fraser*⋄
*
Center for Information and Language Processing, LMU Munich
⋄
Munich Center for Machine Learning (MCML)
{yihong, achron, fraser}@cis.lmu.de
491
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 491–502
July 13-14, 2023 c 2023 Association for Computational Linguistics
Figure 2: The losses (left ordinate) and copying ratios
(right ordinate) of Multi30K English-French pair over
epochs. The normal_dae_loss (resp. normal_bt_loss)
Figure 1: A view of the UNMT architecture. The and normal_copying_ratio are DAE loss (resp. BT
weights of the final fully connected layer (block F) are loss) and copying ratio from the vanilla UNMT. The
tied with the weight of the embedding layer ( block E). ld_dae_loss (resp. ld_bt_loss) and ld_copying_ratio are
DAE loss (resp. BT loss) and the copying ratio from the
UNMT incorporated with the language discriminator.
UNMT model often specifically deals with two
languages, therefore only two translation directions
are considered. Although adding language tags generation of the subsequent tokens. In contrast
(Wu et al., 2021) is effective in addressing the to this setting, using separate word look-up tables
copying problem in multilingual NMT, it is not or separate decoders for involved languages can
a standard process in UNMT. This is because address the problem (Lample et al., 2018; Liu
a language embedding is often added to each et al., 2022). However, such a setting can be
token embedding (Conneau and Lample, 2019; harmful for learning cross-lingual knowledge and
Song et al., 2019; Liu et al., 2022). Language largely increase the number of parameters. In this
embeddings have similar functions to language view, it is desired to keep the structure simple (no
tags: providing information about the language of language-specific architecture) while preventing
each token. Unfortunately, language embeddings the model from decoding in a copying way.
turn out to be not very effective in addressing the
copying problem, especially for low-resource or Objective perspective. Typically, a UNMT
distant language pairs (Kim et al., 2020). Thus, in model is trained by denoising autoencoding (DAE)
this work, we explore why the copying problem (Vincent et al., 2008) and online back-translation
occurs and how we can alleviate it in UNMT. We (BT) (Sennrich et al., 2016) objectives. In DAE
analyze the problem from two perspectives: objective, even though the model is trained to
denoise on two languages simultaneously, there
Architecture perspective. In UNMT, the weight is no guarantee that the model can transfer the
of the final fully connected layer (for obtaining the cross-lingual information that might improve
logits of each word in the vocabulary) is often tied translation between the two languages. In fact,
to the weight of a cross-lingual embedding layer, Song et al. (2019) empirically find that a pretrained
as shown in Figure 1. That is, the representations encoder-decoder model with DAE objective can
of tokens from two languages are shared in the even perform worse than the model without it
same space. Although this setting is arguably a because DAE encourages the model to perform the
better starting point for most modern NMT models, copying. In comparison with DAE, BT is arguably
it unfortunately also allows the models to generate more important, as it tries to directly optimize the
a token in an unexpected language at any time step. translation. However, we find that BT can also
Furthermore, because of an autoregressive decoder, “fail” during training. That is, the model can take
errors can easily accumulate, as the tokens initially the shortcut, i.e., copy the input sentence as the
generated by the model highly influence the intermediate translation and then copy it again for
492
the reconstruction. By taking such a shortcut, the where Otgt are the first-time-step outputs generated
loss of BT can quickly decrease while the copying in the src-to-tgt step, i.e., Dec(Enc(x, src), tgt).
ratio (Liu et al., 2021), a metric to measure the The language discriminator does not have to be
percentage of generated tokens that are copied used for the next step in BT, i.e., tgt-to-src trans-
from the input, keeps increasing and reaches a lation, because there are already ground-truth src-
high-value plateau, as shown in Figure 2. This language sentences as supervision. All we need
indicates that: because of no constraints on the to do is to make sure the intermediate translation
intermediate translation, the model can always is in the correct language. We use a weight λLD
choose the easiest shortcut for BT, which finally to control the contribution of the LD loss to the
corrupts the model’s translation capability. final loss that is used to update the parameters of
the main model. It is easy to note that the larger
2.2 A Language Discriminator Loss the weight, the model will be more focusing on the
To avoid such an unexpected copying behavior in task of distinguishing representations from differ-
BT, our intuition suggests that forcing the interme- ent languages.
diate generation to be in the correct language would This training schedule is similar to the adversar-
be helpful. Instead of forcing all tokens, we could ial loss (Goodfellow et al., 2014) used by Lample
simply force the first token to be in the correct et al. (2018), where they trained a discriminator
language, because the first generated token will in- to make the outputs of the encoder language-
fluence the generation of all the subsequent tokens. agnostic, aiming to improve the cross-linguality
Next, the problem is how to force the first gener- of a shared encoder. Our aim, however, is different:
ated token to be in the desired target language. An we want to enable the decoder to generate distin-
equivalent question would be: how can we force the guishable outputs which correctly correspond to
output vector of the decoder at the first time step to the language that the model is expected to gener-
be closer to the embedding of a token in the target ate in the BT process. Algorithm 1 presents the
language? The answer might be trivial. We could training schedule in detail.
use a trained language discriminator (LD), which
is a classifier, to classify the first-time-step output
vectors of the decoder and then backpropagate the Algorithm 1: Training Schedule
Input: pretrained encoder Enc and decoder Dec,
gradients to the main model (encoder and decoder). language discriminator LD, source and target
In this way, the model knows which intermediate monolingual data Dsrc , Dtgt , maximum
language it should generate for the first-time-step finetuning steps T and coefficient λLD ;
Output: Finetuned encoder Enc and decoder Dec);
token, therefore preventing the copying behavior. 1 t ← 0;
For training LD, we could use the first-time-step 2 while not converged or t < T do
outputs of the decoder in DAE steps. The LD is 3 // for src language do DAE and BT:
4 Bsrc ← sample batch from Dsrc ;
trained to predict the language of the first-time-step 5 // DAE step (below)
outputs by minimizing the cross entropy loss: 6 B̃src , Osrc ← generate reconstructions and
first-time-step outputs from
LLD = Ex∼Dl [p(l|LD(Ol )] (1) Dec(Enc(noise(Bsrc ), src), src);
7 detach Osrc from the compute graph ;
where LD is the language discriminator, Ol 8 θ Enc , θ Dec ← arg min LDAE (Bsrc , B̃src );
9 θ LD ← arg min LLD (Osrc , src);
are the first-time-step outputs generated by 10 // BT step (below)
Dec(Enc(x, l), l) and l denotes the language (ei- 11 freeze θ LD ;
ther src or tgt). Notably, LLD only backpropagates 12 B̃tgt , Otgt ← generate tgt-language translations
to the language discriminator in the DAE step. In and first-time-step outputs from
Dec(Enc(Bsrc , src), tgt) ;
this way, the discriminator is able to distinguish 13 B̃src ← generate src-language back-translations
representations from different languages. from Dec(Enc(B̃tgt , tgt), src) ;
In the BT process, the language discriminator is 14 θ Enc , θ Dec ← arg min LBT (Bsrc , B̃src ) +
fixed and LLD loss is only used to update the main λLD LLD (Otgt , tgt);
model so it learns to differentiate representations 15 // for tgt language do the same as above
16 t ← t + 1;
from different languages. Taking src-tgt-src BT for 17 end
example, the loss is as follows: 18 return Enc and Dec;
Figure 3: The visualizations of the first-time-step output vectors of the decoder in UNMT trained with different
weights for the proposed language discriminator loss. The dimension of the outputs is originally 1024. Principal
component analysis (PCA) is leveraged to project those outputs into a 2-dimensional subspace for convenience
of visualization. src2src (resp. tgt2tgt) denotes the output in the English-to-English (resp. German-to-German)
autoencoding task. src2tgt (resp. tgt2src) denotes the output in the English-to-German (resp. German-to-English)
translation task. The sentences used for the visualizations are the same or the corresponding parallel translations.
494
Model Source input Model output Reference output
a man in an orange hat
λLD = 0
staring at something.
ein mann in an orange hat ein mann mit einem
λLD = 0.01 a man in an orange hat
starring at something. orangefarbenen hut,
starring at something.
ein mann in an orange hat der etwas anstarrt.
λLD = 0, 1
gerade etwas bei etwas.
ein mann in einem orangefarbenen
λLD = 1
hut spielt bei etwas.
ein mann in einem orangefarbenen
λLD = 10
hut spielt bei etwas.
eine frau in einem orangefarbenen
λLD = 100
hut spielt bei etwas.
a boston dog is running on leafy grass
λLD = 0
in front of a white fence.
a boston terrier is running ein boston terrier läuft auf einem gepflasterten ein boston terrier läuft
λLD = 0.01
on lush green grass grünen grass in front of a white fence. über saftig-grünes gras
in front of a white fence. ein boston terrier läuft auf einem grünen rasen vor einem weißen zaun.
λLD = 0.1
vor einem weißen zaun.
ein boston terrier läuft auf einem grünen rasen
λLD = 1
vor einem weißen zaun.
ein boston terrier läuft auf einem grünen gras
λLD = 10
vor einem weißen zaun.
eine boston terrier läuft auf grünen gras
λLD = 100
vor einem weißen zaun.
Table 1: Examples of translations from the model trained on Multi30K dataset (En-De pair) with different weights
λLD for language discriminator loss. We do not use beam search to generate these translations.
Models En ) De De ) En En ) Fr Fr ) En
likely for the model to copy the words from the
0 0.22 (87%) 0.19 (84%) 0.14 (89%) 0.10 (83%)
0.01 15.78 (42%) 22.04 (24%) 24.73 (24%) 22.15 (25%) source input as the output translation. However,
0.1 25.91 (14%) 28.46 (15%) 39.72 (6%) 37.50 (7%) when the weight is too large, e.g., λLD = 100,
1 27.96 (12%) 30.05 (12%) 42.74 (5%) 39.02 (6%)
10 24.35 (14%) 25.60 (13%) 41.26 (5%) 37.61 (6%) there are obvious mistakes made by the translation
100 20.66 (12%) 26.74 (10%) 30.65 (5%) 32.10 (7%) model. For example, “man” in English is wrongly
translated to “frau” (means woman) in German,
Table 2: BLEU scores and copying ratios (inside paren-
theses) of models trained with different weights λLD “a” is wrongly translated into “eine” since boston
on Multi30K dataset. When the weight λLD = 0, the terrier is a masculine instead of a feminine noun.
model degenerates to the vanilla UNMT model. Moderate weights, e.g., λLD = 1, achieves the best
performance while obtaining fewer errors.
To figure out how the LD loss influences the
scores decrease while copying ratios remain at the representations, i.e., the first-time-step output vec-
same level with the increase of the weight. This tors generated by the decoder, we visualize these
indicates that the model is over-emphasizing dis- vectors in 2D by using principal component anal-
tinguishing the outputs when the weights are large. ysis (PCA), as shown in Figure 3. The visual-
Therefore, moderate weights, e.g., 1, might be op- ization verifies the relationship between the out-
timal if we want to alleviate the copying problem put and the occurrence of the copying problem.
while achieving good translation performance. src2tgt and tgt2tgt first-time-step outputs should be
When λLD = 0, poor BLEU scores are obtained close to each other in the subspace as they are both
because of the copying problem. We see that all used to directly generate target-language sentences.
copying ratios in Table 2 are very high: more than However, in Fig. 3 (a), when λLD = 0, src2tgt
80% for all directions. Example translations from and src2src are located together while tgt2src and
the translation model for En-De pair in Table 1 tgt2tgt are together. In contrast, when LD loss is
show that when λLD = 0, the MT system simply imposed, e.g., λLD = 1 (Fig. 3 (d)), the outputs
copies the input sentences. It is very clear that are distributed as we expect: src2tgt and tgt2tgt are
with the increase of the weight, it becomes less located together and tgt2src and src2src together.
495
Models En ) De De ) En En ) Fr Fr ) En En ) Ru Ru ) En En ) Zh Zh ) En
XLM baseline 20.51 25.99 22.87 25.88 14.10 16.92 6.36 4.28
XLM (+ LD) 20.40 25.85 21.22 26.92 13.49 16.12 6.80 4.69
Table 3: BLEU scores of the XLM baseline and the same model enhanced with the LD loss on high-resource
language pairs. The scores of baseline are obtained by reproducing the published code (Conneau and Lample, 2019).
Models En-De En-Fr En-Ru En-Zh En-Kk En-Gu Models En )Kk Kk )En En )Gu Gu )En
baseline 18% 23% 11% 29% 57% 68% XLM baseline (512) 0.80 2.00 0.60 0.60
(+ LD) 19% 25% 11% 24% 42% 52% XLM baseline (1024) 1.80 1.59 2.12 0.54
XLM (+ LD) 2.03 1.70 3.55 0.64
∆ +1% +2% -0% -5% -15% -14%
Table 5: BLEU scores of the XLM baseline and the
Table 4: The copying ratio for each language pair of
same model enhanced with the LD loss on low-resource
XLM baselines and LD model. The average of the ratios
language pairs. The scores of baseline (512) are copied
of two directions for a language pair is reported. The
from (Kim et al., 2020). Same as the setting for high-
translations used to compute the ratios are the same as
resource languages, we reproduced XLM with 1024-
translations for BLEU used in Table 3 and Table 5.
dim embeddings to obtain the scores for baseline (1024).
Table 6: Examples of translations from Kazakh to English by XLM baseline (1024) and XLM (+LD) in Table 5.
The examples show XLM (+LD) suffers fewer the copying problem but it can generate incorrect tokens that do not
match the semantics of the input sentence.
498
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Maja Popović. 2015. chrF: character n-gram F-score
Conference Track Proceedings. for automatic MT evaluation. In Proceedings of the
Tenth Workshop on Statistical Machine Translation,
Philipp Koehn. 2004. Statistical significance tests for pages 392–395, Lisbon, Portugal.
machine translation evaluation. In Proceedings of the
2004 Conference on Empirical Methods in Natural Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Language Processing, pages 388–395, Barcelona, Lavie. 2020. COMET: A neural framework for MT
Spain. evaluation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process-
Guillaume Lample, Alexis Conneau, Ludovic Denoyer, ing (EMNLP), pages 2685–2702, Online.
and Marc’Aurelio Ranzato. 2018. Unsupervised ma-
chine translation using monolingual corpora only. In Rico Sennrich, Barry Haddow, and Alexandra Birch.
6th International Conference on Learning Represen- 2016. Improving neural machine translation models
tations, ICLR 2018, Vancouver, BC, Canada, April with monolingual data. In Proceedings of the 54th
30 - May 3, 2018, Conference Track Proceedings. Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 86–96,
Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Berlin, Germany.
Lidia S. Chao, Shuming Shi, and Zhaopeng Tu. 2021.
On the copying behaviors of pre-training for neural Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
machine translation. In Findings of the Association Yan Liu. 2019. MASS: masked sequence to sequence
for Computational Linguistics: ACL-IJCNLP 2021, pre-training for language generation. In Proceedings
pages 4265–4275, Online. of the 36th International Conference on Machine
Learning, ICML 2019, 9-15 June 2019, Long Beach,
Yihong Liu, Haris Jabbar, and Hinrich Schuetze. 2022. California, USA, volume 97, pages 5926–5936.
Flow-adapter architecture for unsupervised machine
translation. In Proceedings of the 60th Annual Meet- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and
ing of the Association for Computational Linguistics Pierre-Antoine Manzagol. 2008. Extracting and com-
(Volume 1: Long Papers), pages 1253–1266, Dublin, posing robust features with denoising autoencoders.
Ireland. In Proceedings of the 25th international conference
on Machine learning, pages 1096–1103.
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Edunov, Marjan Ghazvininejad, Mike Lewis, and Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Luke Zettlemoyer. 2020. Multilingual denoising pre- Chaumond, Clement Delangue, Anthony Moi, Pier-
training for neural machine translation. Transac- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
tions of the Association for Computational Linguis- icz, Joe Davison, Sam Shleifer, Patrick von Platen,
tics, 8:726–742. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Kelly Marchisio, Kevin Duh, and Philipp Koehn. 2020. Quentin Lhoest, and Alexander Rush. 2020. Trans-
When does unsupervised machine translation work? formers: State-of-the-art natural language processing.
In Proceedings of the Fifth Conference on Machine In Proceedings of the 2020 Conference on Empirical
Translation, pages 571–583, Online. Methods in Natural Language Processing: System
Demonstrations, pages 38–45, Online.
Graham Neubig and Junjie Hu. 2018. Rapid adaptation
of neural machine translation to new languages. In Liwei Wu, Shanbo Cheng, Mingxuan Wang, and Lei
Proceedings of the 2018 Conference on Empirical Li. 2021. Language tags matter for zero-shot neural
Methods in Natural Language Processing, pages 875– machine translation. In Findings of the Association
880, Brussels, Belgium. for Computational Linguistics: ACL-IJCNLP 2021,
pages 3001–3007, Online.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu- Yilin Yang, Akiko Eriguchi, Alexandre Muzio, Prasad
ation of machine translation. In Proceedings of the Tadepalli, Stefan Lee, and Hany Hassan. 2021. Im-
40th Annual Meeting of the Association for Compu- proving multilingual translation by representation
tational Linguistics, pages 311–318, Philadelphia, and gradient regularization. In Proceedings of the
Pennsylvania, USA. 2021 Conference on Empirical Methods in Natural
Language Processing, pages 7266–7279, Online and
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Punta Cana, Dominican Republic.
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre- Zhen Yang, Bojie Hu, Ambyera Han, Shen Huang, and
sentations. In Proceedings of the 2018 Conference of Qi Ju. 2020. CSP:code-switching pre-training for
the North American Chapter of the Association for neural machine translation. In Proceedings of the
Computational Linguistics: Human Language Tech- 2020 Conference on Empirical Methods in Natural
nologies, Volume 1 (Long Papers), pages 2227–2237, Language Processing (EMNLP), pages 2624–2636,
New Orleans, Louisiana. Online.
499
Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen- a shared encoder and randomly initialize a shared
nrich. 2020. Improving massively multilingual neu- decoder. A single embedding layer (containing the
ral machine translation and zero-shot translation. In
words/subwords of both the source and target lan-
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1628– guages) from the pretrained encoder is used. The
1639, Online. weight of the final fully connected layer is tied
with the embedding layer. The parameters of the
A Appendix encoder are fixed except for this embedding layer
A.1 Scores of Other Metrics which is also used by the decoder. The embedding
size is 1024 and the hidden size of the decoder is
In addition to BLEU scores, we also compute other
512. The decoder has 8 heads and 3 layers. We
scores in other metrics, such as CHR F (Popović,
follow the denoising autoencoding hyperparame-
2015) in Table 9 and Table 7, COMET (Rei et al.,
ter settings used by Lample et al. (2018) and the
2020) in Table 10 and Table 8, and confidence in-
training schedule of Liu et al. (2022), i.e., firstly
terval of BLEU scores (Koehn, 2004) in Table 11,
fine-tuning the models with only DAE loss and LD
Table 12 and Table 13. The translations used for
loss for the language discriminator for the first 2
computing the scores are the same as the transla-
epochs, then fine-tuning the models with all losses
tions used to compute the BLEU scores in Table 3
(including the BT) for the rest of the epochs. We
and Table 5.
set the batch size to 32 and use Adam optimizer
To quantify the copying problem, we use the
(Kingma and Ba, 2015) with an initial learning rate
copying ratio proposed by Liu et al. (2021), which
of 0.0001. We stop the training when the model
is defined as follows:
PI does not improve the BLEU scores on the valida-
count(copying tokens) tion set for 5 epochs. We do not use beam search
Ratio = i=1 PI (3)
i=1 count(tokens)
to generate translations for Multi30K.
In Section 3.3, we pretrain all our own cross-
where I denotes the number of the total sentences lingual language models of each language pair
in the test set, copying tokens are those tokens based on XLM code base7 (Conneau and Lam-
in the translation which are directly copied from ple, 2019). Then the encoder and decoder are both
the source language and the denominator is the to- initialized with the same cross-lingual pretrained
tal number of tokens in the generated translations. model. The recommended hyperparameters for the
This metric will directly reflect the degree of the model architecture are used, i.e., 1024 for the em-
copying behavior of the translation model. The bedding size, 4096 for the hidden size, 8 heads
higher the copying ratio, the model tends to per- and 6 layers for the transformer blocks. We follow
form more copying instead translation. We report the recommended pretraining as well as UNMT
the average of the copying ratios of the two trans- fine-tuning hyperparameters from XLM. We only
lation directions for each language pair in Table 4. change the hyperparameter tokens_per_batch to
We could see that the copying problem of the XLM 250 to adapt to small- or moderate memory GPUs.
baseline models is very obvious in low-resource We generate the translations by using beam search
language pairs, i.e., En-Kk and En-Gu. When the of size 5. These translations are used to compute
language discriminator loss is introduced, the copy- the scores in all the WMT-related experiments.
ing ratios decrease by more than 10%. We also For the language discriminator, we simply use
notice that XLM (+LD) has a less obvious copy- a feed-forward neural network (FFNN). The lan-
ing problem than the baseline in En-Zh pair, a dis- guage discriminator has two hidden layers and each
tant language pair. For other language pairs, the layer has the same dimension as the embedding,
copying problem is not that severe and therefore i.e., 1024, for both Multi30K and WMT-related
introducing the language discriminator loss does experiments. The output dimension is two which
not much change the ratios. corresponds to the number of language domains
A.2 Model Details we want to classify into, as we have two languages
involved in the training for each model.
In Section 3.2, we use the pretrained XLM mod-
els from HuggingFace6 (Wolf et al., 2020) (xlm-
mlm-enfr-1024, xlm-mlm-ende-1024) to initialize 7
https://github.com/facebookresearch/
6
https://github.com/huggingface XLM
500
Models En)Kk Kk)En En)Gu Gu)En
XLM baseline 8.85 7.61 7.95 4.76
XLM (+ LD) 11.78 10.09 11.71 7.12
501
Models En)De De)En En)Fr Fr)En En)Ru Ru)En En)Zh Zh)En
XLM baseline 45.09 48.20 44.99 49.93 34.75 38.56 16.11 19.08
XLM (+ LD) 44.42 48.20 42.94 50.50 34.39 36.56 16.74 20.45
Table 9: CHR F scores (Popović, 2015) of the XLM UNMT baseline as well as the XLM model with the language
discriminator on high-resource language pairs (the translations used are the same as used in Table 3 for BLEU
scores).
Table 10: COMET scores (Rei et al., 2020) of the XLM UNMT baseline as well as the XLM model with the
language discriminator on high-resource language pairs (the translations used are the same as used in Table 3 for
BLEU scores). We use wmt20-comet-da model to evaluate the translations.
Table 11: 95% confidence interval for the BLEU scores of the XLM UNMT baseline as well as the XLM model
with the language discriminator on En-De and En-Fr pair (the translations used are the same as used in Table 3 for
BLEU scores). Differences between bold results are statistically significant under p = 0.05. For the statistical test,
we use paired bootstrap resampling (Koehn, 2004).
Table 12: 95% confidence interval for the BLEU scores of the XLM UNMT baseline as well as the XLM model
with the language discriminator on En-Ru and En-Zh pair (the translations used are the same as used in Table 3 for
BLEU scores). Differences between bold results are statistically significant under p = 0.05. For the statistical test,
we use paired bootstrap resampling (Koehn, 2004).
Table 13: 95% confidence interval for the BLEU scores of the XLM UNMT baseline as well as the XLM model
with the language discriminator on En-Kk and En-Gu pair (the translations used are the same as used in Table 3 for
BLEU scores). Differences between bold results are statistically significant under p = 0.05. For the statistical test,
we use paired bootstrap resampling (Koehn, 2004).
502
Author Index
Abela, Kurt, 433 Currey, Anna, 1
Adi, Yossi, 465
Agrawal, Saurabh, 449 Dabre, Raj, 169
Agrawal, Sweta, 1 Dai, Lirong, 102, 194
Al-Badrashiny, Mohamed, 62 Darwish, Kareem, 62
Anastasopoulos, Antonios, 1, 269 Declerck, Thierry, 1
Anderson, Tim, 130 DeMarco, Andrea, 433
Anh Dinh, Tu, 113 Deng, Pan, 102
Anh Nguyen, Tu, 465 Di Gangi, Mattia, 251
Arora, Siddhant, 235 Diab, Mona, 62
Doi, Kosuke, 330
Bahar, Parnia, 251 Dong, Qianqian, 1
Bai, Yu, 478 Du, Yichao, 79
Bakhturina, Evelina, 442 Duan, Richeng, 202
Bamfo Odoom, Bismarck, 302 Duh, Kevin, 1, 130
Basmatkar, Pranjali, 321 Dupoux, Emmanuel, 465
Bataev, Vladimir, 442
Beneš, Karel, 227 E. Ortega, John, 1, 261
Bentivogli, Luisa, 1 Elleuch, Haroun, 219
Berard, Alexandre, 144 Estève, Yannick, 1, 219
Billinghurst, Hannah, 433
Binh Nguyen, Thai, 113 Federico, Marcello, 1
Bojar, Ondřej, 1, 169, 389 Fonollosa, Jose, 397
Borg, Claudia, 1, 433 Fraser, Alexander, 491
Born, Logan, 291 Fukuda, Ryo, 330, 363
Bougares, Fethi, 219
Bär, Martin, 433 Gahbiche, Souhir, 1, 219
Gaido, Marco, 159
Calapodescu, Ioan, 144 Ganesan, Ashwinkumar, 241
Cao, Yiqing, 311 Gao, Yang, 478
Carpuat, Marine, 1 Gat, Itai, 465
Cattoni, Roldano, 1 Ginsburg, Boris, 442
Cettolo, Mauro, 1 Gow-Smith, Edward, 144
Chen, Boxing, 478 GUO, Jiaxin, 138, 277, 376, 383
Chen, Enhong, 79 Guo, Jiaxin, 180
Chen, Hao, 211 Guo, Yuhang, 411, 455
Chen, Mingda, 1 Gwinnup, Jeremy, 130
Chen, Peikun, 311
Chen, Qian, 478 Haddow, Barry, 1
Chen, Shihao, 102 Han, Yuchen, 211
Chen, Shuoying, 455 Hansen, Eric, 130
Chen, William, 1, 235, 261 Hrinchuk, Oleksii, 442
Chen, Xiaoyu, 180, 187, 376, 383 Hsu, Benjamin, 1
Choukri, Khalid, 1 Huang, Wuwei, 411
Chronopoulou, Alexandra, 1, 491 Hubert, Rebekka, 89
Chu, Chenhui, 357 Hussein, Amir, 283
Copet, Jade, 465 Huzaifah, Muhammad, 202
Cui, Jianwei, 194
503
I. Gállego, Gerard, 397 Mbuya, Jonathan, 269
Inaguma, Hirofumi, 1 McNamee, Paul, 1
Iranzo-Sánchez, Javier, 251 Mdhaffar, Salima, 219
Micallef, Kurt, 433
Jain, Aditi, 341 Min Tan, Kye, 202
Javorský, Dávid, 1 Minghan, Wang, 187
Jiang, Ning, 311 Mon Htut, Phu, 1
Jiang, Yanfei, 180 Moon, Hyeonseok, 420
JiaXin, Guo, 187 Mozib Samin, Ahnaf, 433
Judge, John, 1 Mullov, Carlos, 113
Murray, Kenton, 1
Kambhatla, Nishant, 291, 341
Kano, Yasumasa, 1, 330, 363 Nadejde, Maria, 1
Kesiraju, Santosh, 227 Nakamura, Satoshi, 1, 330, 363
Khudanpur, Sanjeev, 283, 302 Nam Nguyen, Tuan, 113
Khurana, Sameer, 219 Negri, Matteo, 1, 159
Ko, Tom, 1 Nguyen, Ha, 1, 219
Ko, Yuka, 330, 363 Niehues, Jan, 1, 62, 113, 389
Koneru, Sai, 113 Nishikawa, Yuta, 330, 363
Kr. Ojha, Atul, 1 Niu, Xing, 1
Kreuk, Felix, 465
Kumar, Rishu, 1, 433 Ore, Brian, 130
504
ShaoJun, Li, 187
Shi, Jiatong, 1, 235 Xiao, Cihan, 283
Shimizu, Shuichiro, 357 Xiao, Tong, 211
Sokolov, Artem, 89 Xie, Lei, 311
Song, Kun, 311 Xie, Yuhao, 138, 180, 376
Sperber, Matthias, 1 Xie, Zhihang, 123
Stüker, Sebastian, 1 Xinyuan, Henry Li, 302
Su, Jinsong, 411 Xu, Chen, 211
Sudoh, Katsuhito, 1, 330, 363 Xu, Luzhen, 194
Synnaeve, Gabriel, 465 Xu, Tong, 79
Xue, Ran, 241
Tang, Yun, 1
Tao, Shimin, 277 Yan, Brian, 235
Thebaud, Thomas, 283 Yanagita, Tomoya, 330
Thiol, Antoine, 219 Yang, Fengyu, 411
Thompson, Brian, 1 Yang, Hao, 138, 180, 187, 277, 376, 383
Tian, Jinchuan, 79 Yang, Jinlong, 138, 180
Tian, Yanzhi, 411 Yang, Zhengdong, 357
Tikhonov, Maksim, 227 Yavuz Ugan, Enes, 113
Tran, Kevin, 1 Ye, Zhongyi, 194
Tsiamas, Ioannis, 397 Yu, Jianwei, 79
Tu, Zhaopeng, 79 YU, Zhengzhe, 180, 187, 383
Turchi, Marco, 1 Yu, Zhengzhe, 138, 376
Tüske, Zoltán, 251 YuHao, Xie, 187
505