Whisper Openai

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford * 1 Jong Wook Kim * 1 Tao Xu 1 Greg Brockman 1 Christine McLeavey 1 Ilya Sutskever 1

Abstract methods are exceedingly adept at finding patterns within a


training dataset which boost performance on held-out data
We study the capabilities of speech processing
from the same dataset. However, some of these patterns are
systems trained simply to predict large amounts of
brittle and spurious and don’t generalize to other datasets
transcripts of audio on the internet. When scaled
and distributions. In a particularly disturbing example, Rad-
to 680,000 hours of multilingual and multitask
ford et al. (2021) documented a 9.2% increase in object
supervision, the resulting models generalize well
classification accuracy when fine-tuning a computer vision
to standard benchmarks and are often competitive
model on the ImageNet dataset (Russakovsky et al., 2015)
with prior fully supervised results but in a zero-
without observing any improvement in average accuracy
shot transfer setting without the need for any fine-
when classifying the same objects on seven other natural
tuning. When compared to humans, the models
image datasets. A model that achieves “superhuman” per-
approach their accuracy and robustness. We are
formance when trained on a dataset can still make many
releasing models and inference code to serve as
basic errors when evaluated on another, possibly precisely
a foundation for further work on robust speech
because it is exploiting those dataset-specific quirks that
processing.
humans are oblivious to (Geirhos et al., 2020).
This suggests that while unsupervised pre-training has im-
1. Introduction proved the quality of audio encoders dramatically, the lack
of an equivalently high-quality pre-trained decoder, com-
Progress in speech recognition has been energized by the bined with a recommended protocol of dataset-specific fine-
development of unsupervised pre-training techniques exem- tuning, is a crucial weakness which limits their usefulness
plified by Wav2Vec 2.0 (Baevski et al., 2020). Since these and robustness. The goal of a speech recognition system
methods learn directly from raw audio without the need for should be to work reliably “out of the box” in a broad range
human labels, they can productively use large datasets of un- of environments without requiring supervised fine-tuning of
labeled speech and have been quickly scaled up to 1,000,000 a decoder for every deployment distribution.
hours of training data (Zhang et al., 2021), far more than the
1,000 or so hours typical of an academic supervised dataset. As demonstrated by Narayanan et al. (2018), Likhomanenko
When fine-tuned on standard benchmarks, this approach et al. (2020), and Chan et al. (2021) speech recognition sys-
has improved the state of the art, especially in a low-data tems that are pre-trained in a supervised fashion across many
setting. datasets/domains exhibit higher robustness and generalize
much more effectively to held-out datasets than models
These pre-trained audio encoders learn high-quality repre- trained on a single source. These works achieve this by
sentations of speech, but because they are purely unsuper- combining as many existing high-quality speech recogni-
vised they lack an equivalently performant decoder mapping tion datasets as possible. However, there is still only a
those representations to usable outputs, necessitating a fine- moderate amount of this data easily available. SpeechStew
tuning stage in order to actually perform a task such as (Chan et al., 2021) mixes together 7 pre-existing datasets
speech recognition1 . This unfortunately limits their use- totalling 5,140 hours of supervision. While not insignifi-
fulness and impact as fine-tuning can still be a complex cant, this is still tiny compared to the previously mentioned
process requiring a skilled practitioner. There is an addi- 1,000,000 hours of unlabeled speech data utilized in Zhang
tional risk with requiring fine-tuning. Machine learning et al. (2021).
*
Equal contribution 1 OpenAI, San Francisco, CA 94110, USA. Recognizing the limiting size of existing high-quality super-
Correspondence to: Alec Radford <alec@openai.com>, Jong
Wook Kim <jongwook@openai.com>.
vised datasets, recent efforts have created larger datasets for
speech recognition. By relaxing the requirement of gold-
standard human-validated transcripts, Chen et al. (2021) and
1
Baevski et al. (2021) is an exciting exception - having devel- Galvez et al. (2021) make use of sophisticated automated
oped a fully unsupervised speech recognition system
Robust Speech Recognition via Large-Scale Weak Supervision 2

pipelines to scale weakly supervised speech recognition the speech recognition pipeline since it removes the need
to 10,000 and 30,000 hours of noisier training data. This for a separate inverse text normalization step in order to
trade-off between quality and quantity is often the right produce naturalistic transcriptions.
call. Although understudied so far for speech recognition,
We construct the dataset from audio that is paired with tran-
recent work in computer vision has demonstrated that mov-
scripts on the Internet. This results in a very diverse dataset
ing beyond gold-standard crowdsourced datasets such as
covering a broad distribution of audio from many different
ImageNet (Russakovsky et al., 2015) to much larger but
environments, recording setups, speakers, and languages.
weakly supervised datasets significantly improves the ro-
While diversity in audio quality can help train a model to be
bustness and generalization of models (Mahajan et al., 2018;
robust, diversity in transcript quality is not similarly bene-
Kolesnikov et al., 2020).
ficial. Initial inspection showed a large amount of subpar
Yet these new datasets are only a few times larger than the transcripts in the raw dataset. To address this, we developed
sum of existing high-quality datasets and still much smaller several automated filtering methods to improve transcript
than prior unsupervised work. In this work we close that quality.
gap, scaling weakly supervised speech recognition the next
Many transcripts on the internet are not actually human-
order of magnitude to 680,000 hours of labeled audio data.
generated but the output of existing ASR systems. Recent
We call our approach Whisper2 . We demonstrate models
research has shown that training on datasets of mixed human
trained at this scale transfer well to existing datasets zero-
and machine-generated data can significantly impair the per-
shot, removing the need for any dataset-specific fine-tuning
formance of translation systems (Ghorbani et al., 2021). In
to achieve high-quality results.
order to avoid learning “transcript-ese”, we developed many
In addition to scale, our work also focuses on broaden- heuristics to detect and remove machine-generated tran-
ing the scope of weakly supervised pre-training beyond scripts from the training dataset. Many existing ASR sys-
English-only speech recognition to be both multilingual and tems output only a limited subset of written language which
multitask. Of those 680,000 hours of audio, 117,000 hours removes or normalizes away aspects that are difficult to pre-
cover 96 other languages. The dataset also includes 125,000 dict from only audio signals such as complex punctuation
hours of X→en translation data. We find that for sufficiently (exclamation points, commas, and question marks), format-
large models there is no drawback and even benefits to joint ting whitespace such as paragraphs, or stylistic aspects such
multilingual and multitask training. as capitalization. An all-uppercase or all-lowercase tran-
script is very unlikely to be human generated. While many
Our work suggests that simple scaling of weakly supervised
ASR systems include some level of inverse text normaliza-
pre-training has been underappreciated so far for speech
tion, it is often simple or rule-based and still detectable from
recognition. We achieve these results without the need for
other unhandled aspects such as never including commas.
the self-supervision or self-training techniques that have
been a mainstay of recent large-scale speech recognition We also use an audio language detector, which was created
work. To serve as a foundation for further research on robust by fine-tuning a prototype model trained on a prototype ver-
speech recognition, we release inference code and models at sion of the dataset on VoxLingua107 (Valk & Alumäe, 2021)
the following URL: https://github.com/openai/ to ensure that the spoken language matches the language of
whisper. the transcript according to CLD2. If the two do not match,
we don’t include the (audio, transcript) pair as a speech
2. Approach recognition training example in the dataset. We make an
exception if the transcript language is English and add these
2.1. Data Processing pairs to the dataset as X→en speech translation training
examples instead. We use fuzzy de-duping of transcript
Following the trend of recent work leveraging web-scale
texts to reduce the amount of duplication and automatically
text from the internet for training machine learning systems,
generated content in the training dataset.
we take a minimalist approach to data pre-processing. In
contrast to a lot of work on speech recognition, we train We break audio files into 30-second segments paired with
Whisper models to predict the raw text of transcripts without the subset of the transcript that occurs within that time
any significant standardization, relying on the expressive- segment. We train on all audio, including segments where
ness of sequence-to-sequence models to learn to map be- there is no speech (though with sub-sampled probability)
tween utterances and their transcribed form. This simplifies and use these segments as training data for voice activity
2
detection.
If an acronym or basis for the name is desired, WSPSR stand-
ing for Web-scale Supervised Pretraining for Speech Recognition For an additional filtering pass, after training an initial model
can be used. we aggregated information about its error rate on training
Robust Speech Recognition via Large-Scale Weak Supervision 3

data sources and performed manual inspection of these data ization. These components are often handled separately,
sources sorting by a combination of both high error rate and resulting in a relatively complex system around the core
data source size in order to identify and remove low-quality speech recognition model. To reduce this complexity, we
ones efficiently. This inspection showed a large amount of would like to have a single model perform the entire speech
only partially transcribed or poorly aligned/misaligned tran- processing pipeline, not just the core recognition part. An
scripts as well as remaining low-quality machine-generated important consideration here is the interface for the model.
captions that filtering heuristics did not detect. There are many different tasks that can be performed on
the same input audio signal: transcription, translation, voice
To avoid contamination, we perform de-duplication at a tran-
activity detection, alignment, and language identification
script level between the training dataset and the evaluation
are some examples.
datasets we thought were at higher risk of overlap, namely
TED-LIUM 3 (Hernandez et al., 2018). For this kind of one-to-many mapping to work with a single
model, some form of task specification is necessary. We use
2.2. Model a simple format to specify all tasks and conditioning infor-
mation as a sequence of input tokens to the decoder. Since
Since the focus of our work is on studying the capabilities our decoder is an audio-conditional language model, we also
of large-scale supervised pre-training for speech recogni- train it to condition on the history of text of the transcript in
tion, we use an off-the-shelf architecture to avoid confound- the hope that it will learn to use longer-range text context
ing our findings with model improvements. We chose an to resolve ambiguous audio. Specifically, with some proba-
encoder-decoder Transformer (Vaswani et al., 2017) as this bility we add the transcript text preceding the current audio
architecture has been well validated to scale reliably. All segment to the decoder’s context. We indicate the beginning
audio is re-sampled to 16,000 Hz, and an 80-channel log- of prediction with a <|startoftranscript|> token.
magnitude Mel spectrogram representation is computed on First, we predict the language being spoken which is repre-
25-millisecond windows with a stride of 10 milliseconds. sented by a unique token for each language in our training
For feature normalization, we globally scale the input to set (99 total). These language targets are sourced from the
be between -1 and 1 with approximately zero mean across aforementioned VoxLingua107 model. In the case where
the pre-training dataset. The encoder processes this input there is no speech in an audio segment, the model is trained
representation with a small stem consisting of two convolu- to predict a <|nospeech|> token indicating this. The
tion layers with a filter width of 3 and the GELU activation next token specifies the task (either transcription or trans-
function (Hendrycks & Gimpel, 2016) where the second lation) with an <|transcribe|> or <|translate|>
convolution layer has a stride of two. Sinusoidal position token. After this, we specify whether to predict timestamps
embeddings are then added to the output of the stem after or not by including a <|notimestamps|> token for that
which the encoder Transformer blocks are applied. The case. At this point, the task and desired format is fully
transformer uses pre-activation residual blocks (Child et al., specified, and the output begins. For timestamp predic-
2019), and a final layer normalization is applied to the en- tion, we predict time relative to the current audio segment,
coder output. The decoder uses learned position embeddings quantizing all times to the nearest 20 milliseconds which
and tied input-output token representations (Press & Wolf, matches the native time resolution of Whisper models, and
2017). The encoder and decoder have the same width and add additional tokens to our vocabulary for each of these.
number of transformer blocks. Figure 1 summarizes the We interleave their prediction with the caption tokens: the
model architecture. start time token is predicted before each caption’s text, and
We use the same byte-level BPE text tokenizer used in GPT- the end time token is predicted after. When a final tran-
2 (Sennrich et al., 2015; Radford et al., 2019) for the English- script segment is only partially included in the current 30-
only models and refit the vocabulary (but keep the same size) second audio chunk, we predict only its start time token
for the multilingual models to avoid excessive fragmenta- for the segment when in timestamp mode, to indicate that
tion on other languages since the GPT-2 BPE vocabulary is the subsequent decoding should be performed on an au-
English only. dio window aligned with that time, otherwise we truncate
the audio to not include the segment. Lastly, we add a
2.3. Multitask Format <|endoftranscript|> token. We only mask out the
training loss over the previous context text, and train the
Although predicting which words were spoken in a given model to predict all other tokens. Please see Figure 1 for an
audio snippet is a core part of the full speech recognition overview of our format and training setup.
problem and extensively studied in research, it is not the
only part. A fully featured speech recognition system can
involve many additional components such as voice activ-
ity detection, speaker diarization, and inverse text normal-
Robust Speech Recognition via Large-Scale Weak Supervision 4

Sequence-to-sequence learning
Multitask training data (680k hours) EN
TRANS-

CRIBE 0.0 The quick brown ⋯


next-token

English transcription prediction

🗣️  “Ask not what your country can do for ⋯”


MLP

📝  Ask not what your country can do for ⋯ MLP cross attention

self attention self attention

Any-to-English speech translation

cross attention
⋮ ⋮ ⋮
🗣️  “El rápido zorro marrón salta sobre ⋯”
Transformer

MLP MLP Transformer

Encoder Blocks

📝  The quick brown fox jumps over ⋯ self attention cross attention
Decoder Blocks

self attention
Non-English transcription MLP

🗣️ “언덕 위에 올라 내려다보면 너무나 넓고 넓은 ⋯”


self attention MLP

~
cross attention

📝  언덕 위에 올라 내려다보면 너무나 넓고 넓은 ⋯ Sinusoidal

Positional
self attention
Encoding
No speech Learned

2 × Conv1D + GELU Positional

🔊 (background music playing)


Encoding

📝  ∅
TRANS-

SOT EN CRIBE 0.0 The quick ⋯


Log-Mel Spectrogram Tokens in Multitask Training Format

Multitask training format Language


X → X

Time-aligned transcription
identification Transcription 

LANGUAGE
begin
end
begin
end

TAG
TRANSCRIBE
time
text tokens
time ⋯ time
text tokens
time
previous
START OF

PREV EOT
text tokens TRANSCRIPT
NO
NO

TRANSLATE text tokens


SPEECH TIMESTAMPS
Custom vocabulary /
prompting
Voice activity
X → English
Text-only transcription

detection
Translation 
special text
timestamp (allows dataset-specific fine-tuning)
(VAD)
tokens tokens tokens

Figure 1. Overview of our approach. A sequence-to-sequence Transformer model is trained on many different speech processing tasks,
including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these
tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different
stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or
classification targets, as further explained in Section 2.3.

2.4. Training Details large dataset to encourage generalization and robustness.


Please see Appendix F for full training hyperparameters.
We train a suite of models of various sizes in order to study
the scaling properties of Whisper. Please see Table 1 for an During early development and evaluation we observed that
overview. We train with data parallelism across accelerators Whisper models had a tendency to transcribe plausible but
using FP16 with dynamic loss scaling and activation check- almost always incorrect guesses for the names of speakers.
pointing (Griewank & Walther, 2000; Chen et al., 2016). This happens because many transcripts in the pre-training
Models were trained with AdamW (Loshchilov & Hutter, dataset include the name of the person who is speaking,
2017) and gradient norm clipping (Pascanu et al., 2013) encouraging the model to try to predict them, but this infor-
with a linear learning rate decay to zero after a warmup over mation is only rarely inferable from only the most recent 30
the first 2048 updates. A batch size of 256 segments was seconds of audio context. To avoid this, we fine-tune Whis-
used, and the models are trained for 220 updates which is per models briefly on the subset of transcripts that do not
between two and three passes over the dataset. Due to only include speaker annotations which removes this behavior.
training for a few epochs, over-fitting is not a large concern,
and we do not use any data augmentation or regularization
and instead rely on the diversity contained within such a
Robust Speech Recognition via Large-Scale Weak Supervision 5

Model Layers Width Heads Parameters in Section 4.4. We are releasing the code for our text nor-
Tiny 4 384 6 39M malizer to allow for easy comparison and to help others
Base 6 512 8 74M study the performance of speech recognition systems in
Small 12 768 12 244M out-of-distribution settings.
Medium 24 1024 16 769M
Large 32 1280 20 1550M
3.3. English Speech Recognition
Table 1. Architecture details of the Whisper model family. In 2015, Deep Speech 2 (Amodei et al., 2015) reported
a speech recognition system matched human-level perfor-
mance when transcribing the LibriSpeech test-clean split.
3. Experiments As part of their analysis they concluded: “Given this result,
we suspect that there is little room for a generic speech sys-
3.1. Zero-shot Evaluation tem to further improve on clean read speech without further
domain adaptation.” Yet seven years later the SOTA WER
The goal of Whisper is to develop a single robust speech
on LibriSpeech test-clean has dropped another 73% from
processing system that works reliably without the need for
their 5.3% to 1.4% (Zhang et al., 2021), far below their re-
dataset specific fine-tuning to achieve high-quality results
ported human-level error rate of 5.8%. Despite this massive
on specific distributions. To study this capability, we re-
and unanticipated further improvement in performance on
use a wide set of existing speech processing datasets to
held-out but in-distribution data, speech recognition mod-
check whether Whisper is able to generalize well across
els trained on LibriSpeech remain far above human error
domains, tasks, and languages. Instead of using the standard
rates when used in other settings. What explains this gap
evaluation protocol for these datasets, which include both
between reportedly superhuman performance in-distribution
a train and test split, we evaluate Whisper in a zero-shot
and subhuman performance out-of-distribution?
setting without using any of the training data for each of
these datasets so that we are measuring broad generalization. We suspect a large part of this gap between human and
machine behavior is due to conflating different capabilities
3.2. Evaluation Metrics being measured by human and machine performance on
a test set. This claim may seem confusing at first; if both
Speech recognition research typically evaluates and com- humans and machines are taking the same test, how can it be
pares systems based on the word error rate (WER) metric. that different skills are being tested? The difference arises
However, WER, which is based on string edit distance, pe- not in the testing but in how they trained for it. Humans are
nalizes all differences between the model’s output and the often asked to perform a task given little to no supervision
reference transcript including innocuous differences in tran- on the specific data distribution being studied. Thus human
script style. As a result, systems that output transcripts that performance is a measure of out-of-distribution generaliza-
would be judged as correct by humans can still have a large tion. But machine learning models are usually evaluated
WER due to minor formatting differences. While this poses after training on a large amount of supervision from the
a problem for all transcribers, it is particularly acute for evaluation distribution, meaning that machine performance
zero-shot models like Whisper, which do not observe any is instead a measure of in-distribution generalization. While
examples of specific datasets transcript formats. both humans and machines are being evaluated on the same
This is not a novel observation; the development of evalua- test data, two quite different abilities are being measured
tion metrics that better correlate with human judgement is an due to a difference in train data.
active area of research, and while there are some promising Whisper models, which are trained on a broad and diverse
methods, none have seen widespread adoption for speech distribution of audio and evaluated in a zero-shot setting,
recognition yet. We opt to address this problem with ex- could potentially match human behavior much better than
tensive standardization of text before the WER calculation existing systems. To study whether this is the case (or
to minimize penalization of non-semantic differences. Our whether the difference between machine and human per-
text normalizer was developed through iterative manual in- formance is due to yet-to-be-understood factors) we can
spection to identify common patterns where naive WER compare Whisper models with both human performance
penalized Whisper models for an innocuous difference. Ap- and standard fine-tuned machine learning models and check
pendix C includes full details. For several datasets, we which they more closely match.
observe WER drops of up to 50 percent usually due to a
quirk such as a dataset’s reference transcripts seperating To quantify this difference, we examine both overall ro-
contractions from words with whitespace. We caution this bustness, that is average performance across many distribu-
development procedure comes at a risk of overfitting to the tions/datasets, and effective robustness, introduced by Taori
transcription style of Whisper models which we investigate et al. (2020), which measures the difference in expected
Robust Speech Recognition via Large-Scale Weak Supervision 6

50
Dataset wav2vec 2.0 Whisper RER
Average WER on [Common Voice, CHiME-6, TED-LIUM] (%)

Large 960h Large (%)


40 LibriSpeech test-clean 2.7 2.7 0.0
Supervised LibriSpeech models Artie 24.5 6.7 72.7
Zero-shot Whisper models
Zero-shot Human (Alec) Fleurs (English) 14.6 4.6 68.5
Ideal robustness (y = x) Common Voice 29.9 9.5 68.2
30 Tedlium 10.5 4.0 61.9
CHiME6 65.8 25.6 61.1
WSJ 7.7 3.1 59.7
VoxPopuli (English) 17.9 7.3 59.2
20 AMI-IHM 37.0 16.4 55.7
CallHome 34.8 15.8 54.6
Switchboard 28.3 13.1 53.7
CORAAL 38.3 19.4 49.3
10 AMI-SDM1 67.6 36.9 45.4
LibriSpeech test-other 6.2 5.6 9.7
Average 29.5 12.9 55.4
0
0 1 2 3 4 5 6 7 8
WER on LibriSpeech dev-clean (%) Table 2. Detailed comparison of robustness on various datasets.
Although both models perform equally well on LibriSpeech, a
zero-shot Whisper model performs much better on other datasets
Figure 2. Zero-shot Whisper models close the gap to human than expected for its LibriSpeech performance and makes 55%
robustness. Despite matching or outperforming a human on Lib- less errors on average. Results reported in word error rate (WER)
riSpeech dev-clean, supervised LibriSpeech models make roughly for both models after applying our text normalizer.
twice as many errors as a human on other datasets demonstrating
their brittleness and lack of robustness. The estimated robustness
frontier of zero-shot Whisper models, however, includes the 95% to a human in Figure 2, the best zero-shot Whisper models
confidence interval for this particular human. roughly match their accuracy and robustness. For a detailed
breakdown of this large improvement in robustness, Table
2 compares the performance of the best zero-shot Whisper
performance between a reference dataset, which is usually model with a supervised LibriSpeech model that has the
in-distribution, and one or more out-of-distribution datasets. same performance as it on LibriSpeech test-clean. Despite
A model with high effective robustness does better than this same performance on the reference distribution, the
expected on out-of-distribution datasets as a function of its zero-shot Whisper model achieves an average relative error
performance on the reference dataset and approaches the reduction of 55% when evaluated on other speech recogni-
ideal of equal performance on all datasets. For our analy- tion datasets.
sis, we use LibriSpeech as the reference dataset due to its This finding suggests emphasizing zero-shot and out-of-
central role in modern speech recognition research and the distribution evaluations of models, particularly when at-
availability of many released models trained on it, which tempting to compare to human performance, to avoid over-
allows for characterizing robustness behaviors. We use a stating the capabilities of machine learning systems due to
suite of 12 other academic speech recognition datasets to misleading comparisons.
study out-of-distribution behaviors. Full details about these
datasets can be found in Appendix A.
3.4. Multi-lingual Speech Recognition
Our main findings are summarized in Figure 2 and Table 2.
In order to compare to prior work on multilingual speech
Although the best zero-shot Whisper model has a relatively
recognition, we report results on two low-data benchmarks:
unremarkable LibriSpeech clean-test WER of 2.7, which
Multilingual LibriSpeech (MLS) (Pratap et al., 2020b) and
is roughly the performance of modern supervised baseline
VoxPopuli (Wang et al., 2021) in Table 3.
or the mid-2019 state of the art, zero-shot Whisper models
have very different robustness properties than supervised Our system performs well on Multilingual LibriSpeech,
LibriSpeech models and out-perform all benchmarked Lib- outperforming both XLS-R and mSLAM in a zero-shot
riSpeech models by large amounts on other datasets. Even setting. We caution that we do use a simple text standardizer
the smallest zero-shot Whisper model, which has only 39 for this result which prevents direct comparison or claims
million parameters and a 6.7 WER on LibriSpeech test-clean of SOTA performance. On VoxPopuli, however, our system
is roughly competitive with the best supervised LibriSpeech significantly underperforms both XLS-R (Babu et al., 2021)
model when evaluated on other datasets. When compared and mSLAM (Bapna et al., 2022) and only matches the
Robust Speech Recognition via Large-Scale Weak Supervision 7

160 MY
40
PT
LO GU KA KM TE
UZ ML PA BN
MT 35 CA SV
80 TG
KN DE FR
NE BE SW HY
MR IS 30 DA
SR NB AF
AF KK RU
RO
40 FA CY SK HR UK
Word Error Rate (WER)

SR LT BSGL CS ID
HI SL AZ HE 25 MK
BG
IT
ET LV UR MSTR ES
FIL EL NL
PL
MK TA HU AR

BLEU
BS
20 GL
BG
SK
HR CS DA ELAR ZH 20 HE
FI
RO KO SL HI KO
FIL TH HU VI
OC ZH
FI ET
NB VI
MS
SV
TR
15 PA FA
ML UR JA
10 UK
ID NL
LB
LV HY GU TH
CA PL FR TG MT KN LT BE MR
JA RU 10 AZ NE TE BN CY
MI
DE
5 PT EN LO SD
IS TA
IT
ES
5 UZ KK
AS
SW
KM
r2 = 0.84 KA
PS r2 = 0.24
LN SO AM MYMN SN YO
HA
2.5 0
0.1 1 10 100 1K 10K 100K 1M 1 10 100 1K 10K 100K
Hours of transcribed audio Hours of translated audio

Figure 3. Correlation of pre-training supervision amount with Figure 4. Correlation of pre-training supervision amount with
downstream speech recognition performance. The amount of downstream translation performance. The amount of pre-
pre-training speech recognition data for a given language is very training translation data for a given language is only moderately
predictive of zero-shot performance on that language in Fleurs. predictive of Whisper’s zero-shot performance on that language in
Fleurs.

Model MLS VoxPopuli


recognition in 75 languages. To study the performance of
Supervised Baseline - 37.5 our systems more broadly we also report performance on the
VP-10K + FT - 15.3 Fleurs dataset (Conneau et al., 2022). In particular, we were
XLS-R (1B) 10.9 10.6 interested in studying the relationship between the amount
mSLAM-CTC (2B) 9.7 9.1 of training data we have for a given language and the result-
Zero-Shot Whisper 8.1 15.2 ing downstream zero-shot performance for that language.
We visualize this relation in Figure 3. We find a strong
Table 3. Multilingual speech recognition performance. Zero- squared correlation coefficient of 0.84 between the log of
shot Whisper improves performance on Multilingual LibriSpeech the word error rate and the log of the amount of training
(MLS) but is still significantly behind both XLS-R and mSLAM data per language. Checking the regression coefficient for a
on VoxPopuli. linear fit to these log-log values results in an estimate that
WER halves for every 16× increase in training data. We
VP-10K+FT baseline from the original paper. We suspect also observed that many of the largest outliers in terms of
the underperformance of Whisper models on VoxPopuli worse than expected performance according to this trend are
could be due to XLS-R and mSLAM both including this languages that have unique scripts and are more distantly
distribution as a major source for their unsupervised pre- related to the Indo-European languages making up the ma-
training data as well as the dataset having significantly more jority of the training dataset such as Hebrew (HE), Telugu
supervised data, and therefore benefit, to fine-tuning. While (TE), Chinese (ZH), and Korean (KO). These differences
MLS has 10 hours of training data per language, the average could be due to a lack of transfer due to linguistic distance,
amount of training data per language is roughly 10× higher our byte level BPE tokenizer being a poor match for these
for VoxPopuli. languages, or variations in data quality.
These two benchmarks are somewhat narrow since they
3.5. Translation
only include 14 unique languages, almost all of which are in
the Indo-European language family and many of which are We study the translation capabilities of Whisper models by
high-resource languages. These benchmarks only provide measuring their performance on the X→en subset of CoV-
limited coverage and room to study Whisper models multi- oST2 (Wang et al., 2020b). We compare with mSLAM and
lingual capabilities which include training data for speech XLS-R, the highest-performing prior work. We achieve a
Robust Speech Recognition via Large-Scale Weak Supervision 8

100 white noise pub noise


X → English

WER on LibriSpeech test-clean (%)


High Mid Low All
50
XMEF-X 34.2 20.2 5.9 14.7
XLS-R (2B) 36.1 27.7 15.1 22.1 20
mSLAM-CTC (2B) 37.8 29.6 18.5 24.8
10
Zero-Shot Whisper 35.0 31.1 23.1 27.3
5
Table 4. X→en Speech translation performance. Zero-shot
Whisper outperforms existing models on CoVoST2 in the overall, 2
medium, and low resource settings but still moderately under-
performs on high-resource languages compared to prior directly 1
40 30 20 10 0 -10 40 30 20 10 0 -10
supervised work. signal-to-noise ratio (dB) signal-to-noise ratio (dB)
unispeech-sat-base-100h-libri-ft hubert-large-ls960-ft
wav2vec2-base-100h hubert-xlarge-ls960-ft
Language ID Fleurs wav2vec2-base-960h s2t-medium-librispeech-asr
wav2vec2-large-960h s2t-large-librispeech-asr
w2v-bert-51 (0.6B) 71.4 wav2vec2-large-robust-ft-libri-960h stt_en_conformer_ctc_large
wav2vec2-large-960h-lv60-self stt_en_conformer_transducer_xlarge
mSLAM-CTC (2B) 77.7 asr-crdnn-rnnlm-librispeech Whisper
asr-transformer-transformerlm-librispeech
Zero-shot Whisper 64.1

Table 5. Language identification performance. Zero-shot Whis- Figure 5. WER on LibriSpeech test-clean as a function of SNR
per’s accuracy at language identification is not competitive with under additive white noise (left) and pub noise (right). The
prior supervised results on Fleurs. This is partially due to Whisper accuracy of LibriSpeech-trained models degrade faster than the
being heavily penalized for having no training data for 20 of Fleurs best Whisper model (⋆). NVIDIA STT models (•) perform best
languages. under low noise but are outperformed by Whisper under high noise
(SNR < 10 dB). The second-best model under low noise (▼) is
fine-tuned on LibriSpeech only and degrades even more quickly.
new state of the art of 27.3 BLEU zero-shot without using
any of the CoVoST2 training data. We attribute this to the
68,000 hours of X→en translation data for these languages
in our pre-training dataset which, although noisy, is vastly spoken languages in the world like French, Spanish, and
larger than the 861 hours of training data for X→en trans- Russian. Inspection shows the majority of supposedly Welsh
lation in CoVoST2. Since Whisper evaluation is zero-shot, translation data is actually English audio with English cap-
it does particularly well on the lowest resource grouping tions where the English audio was mis-classified as Welsh
of CoVoST2, improving over mSLAM by 4.6 BLEU. Con- by the language identification system, resulting in it being
versely, the best Whisper model does not actually improve included as translation training data rather transcription data
over mSLAM and XLS-R on average for the highest re- according to our dataset creation rules.
source languages.
For an additional analysis on an even wider set of languages, 3.6. Language Identification
we also re-purpose Fleurs, which is a speech recognition To evaluate language identification, we use the Fleurs
dataset, as a translation dataset. Since the same sentences dataset (Conneau et al., 2022). The zero-shot performance
are transcribed for every language we use the English tran- of Whisper is not competitive with prior supervised work
scripts as reference translations. In Figure 4 we visualize here and underperforms the supervised SOTA by 13.6%.
the correlation between the amount of translation training However, Whisper is heavily disadvantaged for language
data per language and the resulting zero-shot BLEU score identification on Fleurs, since the Whisper dataset contains
on Fleurs. While there is a clear trend of improvement with no training data for 20 of the 102 languages in Fleurs, upper-
increasing training data, the squared correlation coefficient bounding accuracy at 80.4%. On the 82 overlapping lan-
is much lower than the 0.84 observed for speech recognition guages the best Whisper model achieves 79.7% accuracy.
and only 0.24. We suspect this is partly caused by the noisier
training data due to errors in audio language identification.
3.7. Robustness to Additive Noise
As an example, Welsh (CY) is an outlier with much worse
than expected performance at only 9 BLEU despite sup- We tested the noise robustness of Whisper models and 14
posedly having 9,000 hours of translation data. This large LibriSpeech-trained models by measuring the WER when
amount of Welsh translation data is surprising, ranking 4th either white noise or pub noise from the Audio Degrada-
overall for translation data and ahead of some of the most tion Toolbox (Mauch & Ewert, 2013) was added to the
Robust Speech Recognition via Large-Scale Weak Supervision 9

audio. The pub noise represents a more natural noisy envi- timestamps predicted by the model. We observed that it
ronment with ambient noise and indistinct chatter typical is crucial to have beam search and temperature scheduling
in a crowded restaurant or a pub. Among the 14 models, based on the repetitiveness and the log probability of the
twelve are pre-trained and/or fine-tuned on LibriSpeech, and model predictions in order to reliably transcribe long audio.
the other two are NVIDIA STT models trained on a mixture The full procedure is described in Section 4.5.
dataset similar to prior work like SpeechStew that includes
We evaluate the long-form transcription performance on
LibriSpeech. The level of additive noise corresponding to
seven datasets consisting of speech recordings of various
a given signal-to-noise ratio (SNR) is calculated based on
lengths and recording conditions, to cover as diverse a data
the signal power of individual examples. Figure 5 shows
distribution as possible. These include a long-form adapta-
how the ASR performance degrades as the additive noise
tion of TED-LIUM3 (Hernandez et al., 2018) concatenated
becomes more intensive. There are many models that out-
so that each example is a full-length TED talk, a collection
perform our zero-shot performance under low noise (40 dB
of jargon-laden segments taken from The Late Show with
SNR), which is unsurprising given those models are trained
Stephen Colbert (Meanwhile), sets of videos/podcasts that
primarily on LibriSpeech, but all models quickly degrade as
has been used as ASR benchmarks in online blogs (Rev16
the noise becomes more intensive, performing worse than
and Kincaid46), recordings of earnings calls (Del Rio et al.,
the Whisper model under additive pub noise of SNR below
2021), and the full-length interviews from the Corpus of
10 dB. This showcases Whisper’s robustness to noise, es-
Regional African American Language (CORAAL) (Gunter
pecially under more natural distribution shifts like the pub
et al., 2021). Full details about the long-form datasets can
noise.
be found in Appendix A.
3.8. Long-form Transcription We compare the performance with open-source models as
well as 4 commercial ASR services. The results are sum-
Whisper models are trained on 30-second audio chunks and marized in Figure 6, showing the distribution of word error
cannot consume longer audio inputs at once. This is not a rates from Whisper and the 4 commercial ASR services,
problem with most academic datasets comprised of short as well as the NVIDIA STT Conformer-CTC Large model
utterances but presents challenges in real-world applications from the NeMo toolkit (Kuchaiev et al., 2019) which per-
which often require transcribing minutes- or hours-long au- formed the best among the open-source models. All com-
dio. We developed a strategy to perform buffered transcrip- mercial ASR services are queried using their default English
tion of long audio by consecutively transcribing 30-second transcription settings as of September 1st, 2022, and for
segments of audio and shifting the window according to the the NVIDIA STT model we used their buffered inference

40
35
30
Word Error Rate (%)

25
20
15
10
5
0 TED-LIUM3 Meanwhile Kincaid46 Rev16 Earnings-21 Earnings-22 CORAAL

Whisper Company A Company B Company C Company D NVIDIA STT (CTC large)

Figure 6. Whisper is competitive with state-of-the-art commercial and open-source ASR systems in long-form transcription. The
distribution of word error rates from six ASR systems on seven long-form datasets are compared, where the input lengths range from a
few minutes to a few hours. The boxes show the quartiles of per-example WERs, and the per-dataset aggregate WERs are annotated
on each box. Our model outperforms the best open source model (NVIDIA STT) on all datasets, and in most cases, commercial ASR
systems as well.
Robust Speech Recognition via Large-Scale Weak Supervision 10
30 4. Analysis and Ablations
25
4.1. Model Scaling
Word Error Rate (%)

20
A large amount of the promise in weakly supervised train-
15 ing approaches is their potential to use datasets much larger
10 than those in traditional supervised learning. However, this
comes with the cost of using data that is possibly much
5 noisier and lower quality than gold-standard supervision.
0 A concern with this approach is that although it may look
Whisper A B C D E F G H I
promising to begin with, the performance of models trained
ASR human transcription
computer-assisted on this kind of data may saturate at the inherent quality level
of the dataset, which could be far below human level. A re-
lated concern is that as capacity and compute spent training
Figure 7. Whisper’s performance is close to that of professional
human transcribers. This plot shows the WER distributions of
on the dataset increases, models may learn to exploit the
25 recordings from the Kincaid46 dataset transcribed by Whisper, idiosyncrasies of the dataset, and their ability to generalize
the same 4 commercial ASR systems from Figure 6 (A-D), one robustly to out-of-distribution data could even degrade.
computer-assisted human transcription service (E) and 4 human To check whether this is the case, we study the zero-shot
transcription services (F-I). The box plot is superimposed with dots
generalization of Whisper models as a function of the model
indicating the WERs on individual recordings, and the aggregate
WER over the 25 recordings are annotated on each box.
size. Our analysis is summarized in Figure 8. With the
exception of English speech recognition, performance con-
tinues to increase with model size across multilingual speech
recognition, speech translation, and language identification.
implementation in the FrameBatchASR class to enable The diminishing returns for English speech recognition
long-form transcription. The results show that Whisper per- could be due to saturation effects from approaching human-
forms better than the compared models on most datasets, level performance as analysis in Section 3.9 suggests.
especially on the Meanwhile dataset which is heavy with
uncommon words. Additionally, we note the possibility that 4.2. Dataset Scaling
some of the commercial ASR systems have been trained
on some of these publicly available datasets, and therefore At 680,000 hours of labeled audio, the Whisper dataset is
these results may not be accurately reflecting the relative one of the largest ever created in supervised speech recog-
robustness of the systems. nition. Exactly how important is the raw dataset size to
Whisper’s performance? To study this, we trained a series
of medium-sized models on subsampled versions of the
3.9. Comparison with Human Performance
dataset which are 0.5%, 1%, 2%, 4%, and 8% of the full
Because of ambiguous or indistinct speech as well as la- dataset size and compared their performance with the same
beling errors, there are different levels of irreducible error medium-sized model trained on the whole dataset. Early
in each dataset, and with WER metrics from ASR systems stopping based on the validation loss was used to select
alone it is difficult to make sense of how much room for model checkpoints for each dataset size. Evaluation was
improvement exists in each dataset. To quantify how close performed on an exponential moving average estimate of
Whisper’s performance is to the human performance, we se- the parameters (Polyak & Juditsky, 1992) using a smooth-
lected 25 recordings from the Kincaid46 dataset and used 5 ing rate of 0.9999 to help reduce the effect of the learning
services to obtain transcripts produced by professional tran- rate not fully decaying to zero for the models trained on the
scribers, among which one provides computer-assisted tran- subsampled datasets due to early stopping. Performance
scription and the other four are entirely human-transcribed. on English and multilingual speech recognition and X→en
The audio selection covers various recording conditions translation is reported in Table 6.
such as scripted and unscripted broadcast, telephone and
All increases in the dataset size result in improved perfor-
VoIP calls, and meetings. Figure 7 shows the distribution
mance on all tasks, although we see significant variability
of per-example WERs and aggregate WER across the 25
in improvement rates across tasks and sizes. Performance
recordings, where the computer-assisted service has the
improves rapidly on English speech recognition from 3,000
lowest aggregate WER that is 1.15% point better than Whis-
to 13,000 hours and then slows down noticeably between
per’s, and the pure-human performance is only a fraction
13,000 and 54,000 hours. Using the full dataset, which cor-
of a percentage point better than Whisper’s. These results
responds to another 12.5× increase in size results in only a
indicate that Whisper’s English ASR performance is not
further 1 point drop in WER. This mirrors the diminishing
perfect but very close to human-level accuracy.
Robust Speech Recognition via Large-Scale Weak Supervision 11
English Speech Recognition Multilingual Speech Recognition (Fleurs) X->En Translation (CoVoST2) Language Identification (Fleurs)
20.0 100 50 80
Average Average Average Average
17.5
80 40 70

Accuracy on 102 languages (%)


15.0

WER on 67 languages (%)


WER on 12 datasets (%)

BLEU on 21 languages
12.5 60 30 60
10.0

7.5 40 20 50

5.0
20 10 40
2.5

0.0 0 0 30
38M 73M 244M 768M 1549M 38M 73M 244M 768M 1549M 38M 73M 244M 768M 1549M 38M 73M 244M 768M 1549M
Model parameters Model parameters Model parameters Model parameters

Figure 8. Zero-shot Whisper performance scales reliably across tasks and languages with increasing model size. Lightly shaded lines
represent individual datasets or languages, showing that performance is more varied than the smooth trends in aggregate performance.

scaling for speech recognition. Further analysis is needed to


Dataset English Multilingual X→En characterize “scaling laws” for speech recognition in order
size WER (↓) WER (↓) BLEU (↑) to decided between these explanations.
3405 30.5 92.4 0.2
6811 19.6 72.7 1.7 4.3. Multitask and Multilingual Transfer
13621 14.4 56.6 7.9 A potential concern with jointly training a single model
27243 12.3 45.0 13.9 on many tasks and languages is the possibility of negative
54486 10.9 36.4 19.2 transfer where interference between the learning of several
681070 9.9 29.2 24.8 tasks results in performance worse than would be achieved
by training on only a single task or language. To investigate
Table 6. Performance improves with increasing dataset size. whether this is occurring, we compared the performance
English speech recognition performance refers to an average over of models trained on just English speech recognition with
12 datasets while the Multilingual speech recognition reports per-
our standard multitask and multilingual training setup and
formance on the overlapping subset of languages in Fleurs and
measured their average performance across our suite of zero-
X→en translation reports average BLEU on CoVoST2. Dataset
size reported in hours. shot English speech recognition benchmarks. We adjust for
the amount of FLOPs spent training on the task of English
speech recognition as only 65% of compute is spent on this
task in a joint training setup; analysis would otherwise be
returns observed with model size scaling for English speech
confounded by under-training on the task when compared
recognition and could similarly be explained by saturation
to a same-sized English-only model.
effects when approaching human-level performance.
Our results visualized in Figure 9 show that for small models
Improvements in WER follow a power-law trend for mul-
trained with moderate amounts of compute, there is indeed
tilingual speech recognition till 54,000 hours and then de-
negative transfer between tasks and languages: joint mod-
viate from this trend, improving only a further 7 points
els underperform English-only models trained for the same
when increasing to the full dataset size. For X→en transla-
amount of compute. However, multitask and multilingual
tion, performance is practically zero when training on 7,000
models scale better and for our largest experiments outper-
hours of audio or less, and then follows a roughly log-linear
form their English-only counterparts demonstrating positive
improvement trend till 54,000 hours before also showing
transfer from other tasks. For our largest experiments, joint
diminishing returns when further scaling to the full dataset
models also slightly outperform English-only models even
size.
when not adjusting for compute spent per task.
The general trend across tasks of diminishing returns when
moving from 54,000 hours to our full dataset size of 680,000 4.4. Text Normalization
hours could suggest that the current best Whisper models are
under-trained relative to dataset size and performance could Since we developed our text normalization jointly with
be further improved by a combination of longer training Whisper to discount innocuous word errors, there is a risk
and larger models. It could also suggest that we are nearing that our normalizer is overfitted to fixing Whisper’s peculiar-
the end of performance improvements from dataset size ities rather than addressing general variation in transcription.
Robust Speech Recognition via Large-Scale Weak Supervision 12

20 CORAAL
English Only Open-source models
Multilingual and Multitask CommonVoice9.en Whisper models
Average WER on 11 english speech recognition datasets

18
AMI-SDM1
16 CommonVoice5.1
Fleurs.en_us
14 AMI-IHM
Artie
LibriSpeech
12
TED-LIUM3
VoxPopuli.en
WSJ
10
CallHome
Switchboard
50 40 30 20 10 0
8 Relative WER reduction compared to FairSpeech's normalizer (%)
10e+19 10e+20 10e+21 10e+22
FLOPs training on english speech recognition
Figure 10. On most datasets, our text normalizer has similar
effect on reducing WERs between Whisper models and other
Figure 9. Multitask and multilingual transfer improves with open-source models, compared to FairSpeech’s normalizer. For
scale. For small models, performance on English speech recogni- each dataset, the boxplot shows the distribution of relative WER
tion degrades when trained jointly in a multitask and multilingual reduction across different models in our eval suite, showing that
setup. However, multilingual and multitask models benefit more using our text normalizer generally results in lower WERs than
from scale and eventually outperform models trained on English FairSpeech’s. On a few datasets our normalizer reduces WER
data only. 95% bootstrap estimate confidence intervals are shown. significantly and more so for Whisper models, such as CallHome
and Switchboard which have many contractions in the ground truth
and WSJ which contains many numerical expressions.
To check this, we compared the performance of Whisper
using our normalizer versus an independently developed
one from the FairSpeech project (Koenecke et al., 2020). In the results reported in sections 3.8 and 3.9. First, we use
Figure 10, we visualize the differences. On most datasets beam search with 5 beams using the log probability as the
the two normalizers perform similarly, without significant score function, to reduce repetition looping which happens
differences in WER reduction between Whisper and com- more frequently in greedy decoding. We start with tem-
pared open-source models, while on some datasets, namely perature 0, i.e. always selecting the tokens with the high-
WSJ, CallHome, and Switchboard, our normalizer reduces est probability, and increase the temperature by 0.2 up to
the WER of Whisper models’ significantly more. The differ- 1.0 when either the average log probability over the gen-
ences in reduction can be traced down to different formats erated tokens is lower than −1 or the generated text has a
used by the ground truth and how the two normalizers are pe- gzip compression rate higher than 2.4. Providing the tran-
nalizing them. For example, in CallHome and Switchboard, scribed text from the preceding window as previous-text
our standardizer did not penalize differences in common conditioning when the applied temperature is below 0.5
English contractions such as “you’re” versus “you are”, and further improves the performance. We found that the proba-
in WSJ, our normalizer standardized the written and spo- bility of the <|nospeech|> token alone is not sufficient
ken forms of numerical and monetary expressions, such as to distinguish a segment with no speech, but combining
“sixty-eight million dollars” versus “$68 million”. the no-speech probability threshold of 0.6 and the average
log-probability threshold of −1 makes the voice activity
detection of Whisper more reliable. Finally, to avoid a fail-
4.5. Strategies for Reliable Long-form Transcription
ure mode where the model ignores the first few words in
Transcribing long-form audio using Whisper relies on ac- the input, we constrained the initial timestamp token to be
curate prediction of the timestamp tokens to determine the between 0.0 and 1.0 second. Table 7 shows that adding each
amount to shift the model’s 30-second audio context win- of the interventions above incrementally reduces the WER
dow by, and inaccurate transcription in one window may overall, but not evenly across the dataset. These heuristics
negatively impact transcription in the subsequent windows. serve as a workaround for the noisy predictions of the model,
We have developed a set of heuristics that help avoid fail- and more research would be needed to further improve the
ure cases of long-form transcription, which is applied in reliability of long-form decoding.
Robust Speech Recognition via Large-Scale Weak Supervision 13

strated for machine translation by Johnson et al. (2017),


removing the need for separate encoders and decoders. This

TED-LIUM3

Earnings-21

Earnings-22
Meanwhile

Kincaid46

CORAAL
approach was simplified further into the “text-to-text” frame-

Average
Rev16
work of McCann et al. (2018) and popularized by its success
with large transformer language models in the work of Rad-
Greedy decoding only 3.55 5.95 9.96 12.1 10.5 13.9 21.3 11.0 ford et al. (2019) and Raffel et al. (2020). Toshniwal et al.
+ Beam search 3.66 5.78 9.18 12.1 10.2 13.5 20.1 10.6 (2018) demonstrated jointly training a modern deep learn-
+ Temperature fallback 3.66 5.72 8.93 12.0 10.2 13.4 19.8 10.5 ing speech recognition system on several languages with a
+ Previous-text conditioning 3.68 5.35 9.19 11.3 9.78 12.8 19.8 10.3
+ Voice activity detection 3.57 5.28 9.16 10.8 9.75 12.9 19.9 10.2
single model, and Pratap et al. (2020a) scaled this line of
+ Initial timestamp constraint 3.56 5.33 8.87 10.6 9.68 13.0 19.9 10.1 work significantly to 50 languages with a billion-parameter
model. MUTE (Wang et al., 2020c) and mSLAM (Bapna
Table 7. Long-form transcription performance improves incremen- et al., 2022) studied joint training over both text and speech
tally as additional decoding heuristics are employed. Details on language tasks, demonstrating transfer between them.
each intervention are described in Section 4.5.
Robustness The question of how effectively models trans-
5. Related Work fer and how robust they are to distribution shift and other
types of perturbations has long been studied and is actively
Scaling Speech Recognition A consistent theme across being researched across many fields of machine learning.
speech recognition research has been documenting the bene- Torralba & Efros (2011) highlighted the lack of generaliza-
fits of scaling compute, models, and datasets. Early work ap- tion of machine learning models between datasets over a
plying deep learning to speech recognition found improved decade ago. Many other works have shown and continu-
performance with model depth and size and leveraged GPU ally reiterated how despite high performance on IID test
acceleration to make training these larger models tractable sets, machine learning models can still make many mistakes
(Mohamed et al., 2009). Further research demonstrated that when evaluated in even slightly different settings (Lake et al.,
the benefit of deep learning approaches to speech recogni- 2017; Jia & Liang, 2017; Alcorn et al., 2019; Barbu et al.,
tion increased with dataset size, improving from being only 2019; Recht et al., 2019). More recently, Taori et al. (2020)
competitive with prior GMM-HMM systems when using studied the robustness of image classification models, and
just 3 hours of TIMIT training data for phone recognition Miller et al. (2020) investigated this for question-answering
to achieving a 30% word error rate reduction when trained models. A key finding has been that multi-domain train-
on the 2,000 hour Switchboard dataset (Seide et al., 2011). ing increases robustness and generalization as discussed in
Liao et al. (2013) is an early example of leveraging weakly the Introduction. This finding has been replicated across
supervised learning to increase the size of a deep learn- many fields in addition to speech recognition including NLP
ing based speech recognition dataset by over 1,000 hours. (Hendrycks et al., 2020) and computer vision (Radford et al.,
These trends continued with Deep Speech 2 (Amodei et al., 2021).
2015) being a notable system developing high-throughput
distributed training across 16 GPUs and scaling to 12,000
6. Limitations and Future Work
hours of training data while demonstrating continuing im-
provements at that scale. By leveraging semi-supervised From our experimental results, analyses, and ablations, we
pre-training, Narayanan et al. (2018) were able to grow have noted several limitations and areas for future work.
dataset size much further and study training on 162,000
hours of labeled audio. More recent work has explored Improved decoding strategies. As we have scaled Whis-
billion-parameter models (Zhang et al., 2020) and using up per, we have observed that larger models have made steady
to 1,000,000 hours of training data (Zhang et al., 2021). and reliable progress on reducing perception-related errors
such as confusing similar-sounding words. Many remaining
Multitask Learning Multitask learning (Caruana, 1997) errors, particularly in long-form transcription seem more
has been studied for a long time. In speech recognition, stubborn in nature and decidedly non-human/perceptual.
multi-lingual models have been explored for well over a They are a combination of failure modes of seq2seq mod-
decade (Schultz & Kirchhoff, 2006). An inspirational and els, language models, and text-audio alignment and include
foundational work in NLP exploring multi-task learning problems such as getting stuck in repeat loops, not tran-
with a single model is Collobert et al. (2011). Multitask scribing the first or last few words of an audio segment, or
learning in the sequence-to-sequence framework (Sutskever complete hallucination where the model will output a tran-
et al., 2014) using multiple encoders and decoders was in- script entirely unrelated to the actual audio. Although the
vestigated in Luong et al. (2015). The use of language codes decoding details discussed in Section 4.5 help significantly,
with a shared encoder/decoder architecture was first demon- we suspect fine-tuning Whisper models on a high-quality
Robust Speech Recognition via Large-Scale Weak Supervision 14

supervised dataset and/or using reinforcement learning to been a mainstay of recent large-scale speech recognition
more directly optimize for decoding performance could help work and demonstrate how simply training on a large and
further reduce these errors. diverse supervised dataset and focusing on zero-shot trans-
fer can significantly improve the robustness of a speech
Increase Training Data For Lower-Resource Languages recognition system.
As Figure 3 shows, Whisper’s speech recognition perfor-
mance is still quite poor on many languages. The same ACKNOWLEDGMENTS
analysis suggests a clear route for improvement since perfor-
We’d like to thank the millions of people who were involved
mance on a language is very well predicted by the amount
in creating the data used by Whisper. We’d also like to
of training data for the language. Since our pre-training
thank Nick Ryder, Will Zhuk, and Andrew Carr for the
dataset is currently very English-heavy due to biases of
conversation on the waterfall hike that inspired this project.
our data collection pipeline, which sourced primarily from
We are also grateful to the Acceleration and Supercomputing
English-centric parts of the internet, most languages have
teams at OpenAI for their critical work on software and
less than 1000 hours of training data. A targeted effort at in-
hardware infrastructure this project used. We’d also like to
creasing the amount of data for these rarer languages could
thank Pamela Mishkin for advising the project from a policy
result in a large improvement to average speech recognition
perspective. Finally, we are grateful to the developers of
performance even with only a small increase in our overall
the many software packages used throughout this project
training dataset size.
including, but not limited, to Numpy (Harris et al., 2020),
SciPy (Virtanen et al., 2020), ftfy (Speer, 2019), PyTorch
Studying fine-tuning In this work, we have focused on (Paszke et al., 2019), pandas (pandas development team,
the robustness properties of speech processing systems and 2020), and scikit-learn (Pedregosa et al., 2011).
as a result only studied the zero-shot transfer performance
of Whisper. While this is a crucial setting to study due to it
being representative of general reliability, for many domains References
where high-quality supervised speech data does exist, it is Alcorn, M. A., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W.-
likely that results can be improved further by fine-tuning. S., and Nguyen, A. Strike (with) a pose: Neural networks
An additional benefit of studying fine-tuning is that it allows are easily fooled by strange poses of familiar objects. In
for direct comparisons with prior work since it is a much Proceedings of the IEEE/CVF Conference on Computer
more common evaluation setting. Vision and Pattern Recognition, pp. 4845–4854, 2019.

Tuning Architecture, Regularization, and Augmentation Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper,
As a study focusing primarily on the impact of dataset scal- J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates,
ing on speech processing, our training setup is relatively A., Diamos, G., et al. Deep speech 2: end-to-end speech
basic and does not have many components of current state- recognition in english and mandarin. arxiv. arXiv preprint
of-the-art systems. Adding common regularization tech- arXiv:1512.02595, 2015.
niques such as dropout (Srivastava et al., 2014) or stochastic
depth (Huang et al., 2016) as well as data augmentation Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler,
methods such as SpecAugment (Park et al., 2019) could M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M.,
potentially combine well with fine-tuning which is often and Weber, G. Common voice: A massively-multilingual
data-limited. speech corpus. arXiv preprint arXiv:1912.06670, 2019.

Adding Auxiliary Training Objectives Whisper departs Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu,
noticeably from most recent state-of-the-art speech recog- Q., Goyal, N., Singh, K., von Platen, P., Saraf, Y.,
nition systems due to the lack of unsupervised pre-training Pino, J., et al. XLS-R: Self-supervised cross-lingual
or self-teaching methods. While we have not found them speech representation learning at scale. arXiv preprint
necessary to achieve good performance, it is possible that arXiv:2111.09296, 2021.
the results could be further improved by incorporating this.
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec
7. Conclusion 2.0: A framework for self-supervised learning of speech
representations. arXiv preprint arXiv:2006.11477, 2020.
Whisper suggests that scaling weakly supervised pre-
training has been underappreciated so far in speech recogni- Baevski, A., Hsu, W.-N., Conneau, A., and Auli, M. Unsu-
tion research. We achieve our results without the need for pervised speech recognition. Advances in Neural Infor-
the self-supervision and self-training techniques that have mation Processing Systems, 34:27826–27839, 2021.
Robust Speech Recognition via Large-Scale Weak Supervision 15

Bapna, A., Cherry, C., Zhang, Y., Jia, Y., Johnson, M., The people’s speech: A large-scale diverse english speech
Cheng, Y., Khanuja, S., Riesa, J., and Conneau, A. mslam: recognition dataset for commercial usage. arXiv preprint
Massively multilingual joint pre-training for speech and arXiv:2111.09344, 2021.
text. arXiv preprint arXiv:2202.01374, 2022.
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Bren-
Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gut- del, W., Bethge, M., and Wichmann, F. A. Shortcut learn-
freund, D., Tenenbaum, J., and Katz, B. Objectnet: A ing in deep neural networks. Nature Machine Intelligence,
large-scale bias-controlled dataset for pushing the lim- 2(11):665–673, 2020.
its of object recognition models. Advances in neural
information processing systems, 32, 2019. Ghorbani, B., Firat, O., Freitag, M., Bapna, A., Krikun,
Caruana, R. Multitask learning. Machine learning, 28(1): M., Garcia, X., Chelba, C., and Cherry, C. Scaling
41–75, 1997. laws for neural machine translation. arXiv preprint
arXiv:2109.07740, 2021.
Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., and Norouzi,
M. SpeechStew: Simply mix all available speech recogni- Griewank, A. and Walther, A. Algorithm 799: revolve: an
tion data to train one large neural network. arXiv preprint implementation of checkpointing for the reverse or ad-
arXiv:2104.02133, 2021. joint mode of computational differentiation. ACM Trans-
actions on Mathematical Software (TOMS), 26(1):19–45,
Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.-Q.,
2000.
Weng, C., Su, D., Povey, D., Trmal, J., Zhang, J.,
et al. Gigaspeech: An evolving, multi-domain asr corpus
Gunter, K., Vaughn, C., and Kendall, T. Contextualiz-
with 10,000 hours of transcribed audio. arXiv preprint
ing/s/retraction: Sibilant variation and change in wash-
arXiv:2106.06909, 2021.
ington dc african american language. Language Variation
Chen, S., Wu, Y., Wang, C., Chen, Z., Chen, Z., Liu, S., and Change, 33(3):331–357, 2021.
Wu, J., Qian, Y., Wei, F., Li, J., et al. Unispeech-sat: Uni-
versal speech representation learning with speaker aware Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers,
pre-training. In ICASSP 2022-2022 IEEE International R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J.,
Conference on Acoustics, Speech and Signal Processing Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van
(ICASSP), pp. 6152–6156. IEEE, 2022. Kerkwijk, M. H., Brett, M., Haldane, A., Fernández del
Rı́o, J., Wiebe, M., Peterson, P., Gérard-Marchant, P.,
Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H.,
deep nets with sublinear memory cost. arXiv preprint Gohlke, C., and Oliphant, T. E. Array programming
arXiv:1604.06174, 2016. with NumPy. Nature, 585:357–362, 2020. doi: 10.1038/
s41586-020-2649-2.
Child, R., Gray, S., Radford, A., and Sutskever, I. Gen-
erating long sequences with sparse transformers. arXiv
Hendrycks, D. and Gimpel, K. Gaussian error linear units
preprint arXiv:1904.10509, 2019.
(gelus). arXiv preprint arXiv:1606.08415, 2016.
Collobert, R., Weston, J., Bottou, L., Karlen, M.,
Kavukcuoglu, K., and Kuksa, P. Natural language pro- Hendrycks, D., Liu, X., Wallace, E., Dziedzic, A., Krishnan,
cessing (almost) from scratch. Journal of machine learn- R., and Song, D. Pretrained transformers improve out-of-
ing research, 12(ARTICLE):2493–2537, 2011. distribution robustness. arXiv preprint arXiv:2004.06100,
2020.
Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V.,
Dalmia, S., Riesa, J., Rivera, C., and Bapna, A. Fleurs: Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N. A.,
Few-shot learning evaluation of universal representations and Estève, Y. Ted-lium 3: twice as much data and corpus
of speech. arXiv preprint arXiv:2205.12446, 2022. repartition for experiments on speaker adaptation. In
Del Rio, M., Delworth, N., Westerman, R., Huang, M., SPECOM, 2018.
Bhandari, N., Palakapilly, J., McNamara, Q., Dong, J.,
Zelasko, P., and Jetté, M. Earnings-21: a practical bench- Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K.,
mark for asr in the wild. arXiv preprint arXiv:2104.11348, Salakhutdinov, R., and Mohamed, A. Hubert: Self-
2021. supervised speech representation learning by masked
prediction of hidden units. IEEE/ACM Transactions on
Galvez, D., Diamos, G., Torres, J. M. C., Achorn, K., Gopi, Audio, Speech, and Language Processing, 29:3451–3460,
A., Kanter, D., Lam, M., Mazumder, M., and Reddi, V. J. 2021a.
Robust Speech Recognition via Large-Scale Weak Supervision 16

Hsu, W.-N., Sriram, A., Baevski, A., Likhomanenko, T., Loshchilov, I. and Hutter, F. Decoupled weight decay regu-
Xu, Q., Pratap, V., Kahn, J., Lee, A., Collobert, R., Syn- larization. arXiv preprint arXiv:1711.05101, 2017.
naeve, G., et al. Robust wav2vec 2.0: Analyzing do-
main shift in self-supervised pre-training. arXiv preprint Luong, M.-T., Le, Q. V., Sutskever, I., Vinyals, O., and
arXiv:2104.01027, 2021b. Kaiser, L. Multi-task sequence to sequence learning.
arXiv preprint arXiv:1511.06114, 2015.
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,
K. Q. Deep networks with stochastic depth. In European Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri,
conference on computer vision, pp. 646–661. Springer, M., Li, Y., Bharambe, A., and Van Der Maaten, L. Ex-
2016. ploring the limits of weakly supervised pretraining. In
Jia, R. and Liang, P. Adversarial examples for evalu- Proceedings of the European conference on computer
ating reading comprehension systems. arXiv preprint vision (ECCV), pp. 181–196, 2018.
arXiv:1707.07328, 2017.
Mauch, M. and Ewert, S. The audio degradation toolbox and
Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., its application to robustness evaluation. In Proceedings of
Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, the 14th International Society for Music Information Re-
G., et al. Google’s multilingual neural machine translation trieval Conference (ISMIR 2013), Curitiba, Brazil, 2013.
system: Enabling zero-shot translation. Transactions of accepted.
the Association for Computational Linguistics, 5:339–
351, 2017. McCann, B., Keskar, N. S., Xiong, C., and Socher, R. The
natural language decathlon: Multitask learning as ques-
Kendall, T. and Farrington, C. The corpus of regional tion answering. arXiv preprint arXiv:1806.08730, 2018.
african american language. Version 2021.07. Eugene, OR:
The Online Resources for African American Language Meyer, J., Rauchenstein, L., Eisenberg, J. D., and Howell,
Project. http://oraal.uoregon.edu/coraal, N. Artie bias corpus: An open dataset for detecting de-
2021. Accessed: 2022-09-01. mographic bias in speech applications. In Proceedings of
the 12th Language Resources and Evaluation Conference,
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M.,
pp. 6462–6468, Marseille, France, May 2020. European
Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D.,
Language Resources Association. ISBN 979-10-95546-
and Goel, S. Racial disparities in automated speech recog-
34-4. URL https://aclanthology.org/2020.
nition. Proceedings of the National Academy of Sciences,
lrec-1.796.
117(14):7684–7689, 2020.
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, Miller, J., Krauth, K., Recht, B., and Schmidt, L. The effect
J., Gelly, S., and Houlsby, N. Big transfer (bit): General of natural distribution shift on question answering models.
visual representation learning. In European conference In ICML, 2020.
on computer vision, pp. 491–507. Springer, 2020.
Mohamed, A.-r., Dahl, G., Hinton, G., et al. Deep belief net-
Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., works for phone recognition. In Nips workshop on deep
Ginsburg, B., Kriman, S., Beliaev, S., Lavrukhin, V., learning for speech recognition and related applications,
Cook, J., et al. Nemo: a toolkit for building ai applications volume 1, pp. 39, 2009.
using neural modules. arXiv preprint arXiv:1909.09577,
2019. Narayanan, A., Misra, A., Sim, K. C., Pundak, G., Tripathi,
A., Elfeky, M., Haghani, P., Strohman, T., and Bacchi-
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh- ani, M. Toward domain-invariant speech recognition via
man, S. J. Building machines that learn and think like large scale training. In 2018 IEEE Spoken Language
people. Behavioral and brain sciences, 40, 2017. Technology Workshop (SLT), pp. 441–447. IEEE, 2018.
Liao, H., McDermott, E., and Senior, A. Large scale deep
neural network acoustic modeling with semi-supervised Panayotov, V., Chen, G., Povey, D., and Khudanpur, S.
training data for youtube video transcription. In 2013 Librispeech: an asr corpus based on public domain au-
IEEE Workshop on Automatic Speech Recognition and dio books. In 2015 IEEE international conference on
Understanding, pp. 368–373. IEEE, 2013. acoustics, speech and signal processing (ICASSP), pp.
5206–5210. IEEE, 2015.
Likhomanenko, T., Xu, Q., Pratap, V., Tomasello, P., Kahn,
J., Avidov, G., Collobert, R., and Synnaeve, G. Rethink- pandas development team, T. pandas-dev/pandas: Pan-
ing evaluation in asr: Are our models robust enough? das, February 2020. URL https://doi.org/10.
arXiv preprint arXiv:2010.11745, 2020. 5281/zenodo.3509134.
Robust Speech Recognition via Large-Scale Weak Supervision 17

Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Cubuk, E. D., and Le, Q. V. SpecAugment: A simple data Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Exploring
augmentation method for automatic speech recognition. the limits of transfer learning with a unified text-to-text
arXiv preprint arXiv:1904.08779, 2019. transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cor-
of training recurrent neural networks. In International nell, S., Lugosch, L., Subakan, C., Dawalatabad, N.,
conference on machine learning, pp. 1310–1318. PMLR, Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W.,
2013. Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na,
H., Gao, Y., Mori, R. D., and Bengio, Y. SpeechBrain: A
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., general-purpose speech toolkit, 2021. arXiv:2106.04624.
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, Recht, B., Roelofs, R., Schmidt, L., and Shankar, V.
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Do ImageNet classifiers generalize to ImageNet? In
Bai, J., and Chintala, S. Pytorch: An imperative style, Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceed-
high-performance deep learning library. In Advances ings of the 36th International Conference on Machine
in Neural Information Processing Systems 32, pp. 8024– Learning, volume 97 of Proceedings of Machine Learn-
8035, 2019. ing Research, pp. 5389–5400. PMLR, 09–15 Jun 2019.
URL https://proceedings.mlr.press/v97/
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., recht19a.html.
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
Scikit-learn: Machine learning in Python. Journal of M., et al. Imagenet large scale visual recognition chal-
Machine Learning Research, 12:2825–2830, 2011. lenge. International journal of computer vision, 115(3):
211–252, 2015.
Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic
approximation by averaging. SIAM journal on control Schultz, T. and Kirchhoff, K. Multilingual speech process-
and optimization, 30(4):838–855, 1992. ing. Elsevier, 2006.

Pratap, V., Sriram, A., Tomasello, P., Hannun, A. Y., Seide, F., Li, G., Chen, X., and Yu, D. Feature engineering
Liptchinsky, V., Synnaeve, G., and Collobert, R. Mas- in context-dependent deep neural networks for conver-
sively multilingual asr: 50 languages, 1 model, 1 billion sational speech transcription. In 2011 IEEE Workshop
parameters. ArXiv, abs/2007.03001, 2020a. on Automatic Speech Recognition & Understanding, pp.
24–29. IEEE, 2011.
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., and Collobert,
Sennrich, R., Haddow, B., and Birch, A. Neural machine
R. Mls: A large-scale multilingual dataset for speech
translation of rare words with subword units. arXiv
research. arXiv preprint arXiv:2012.03411, 2020b.
preprint arXiv:1508.07909, 2015.
Press, O. and Wolf, L. Using the output embedding to
Speer, R. ftfy. Zenodo, 2019. URL https://doi.org/
improve language models. In Proceedings of the 15th
10.5281/zenodo.2591652. Version 5.5.
Conference of the European Chapter of the Associa-
tion for Computational Linguistics: Volume 2, Short Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
Papers, pp. 157–163, Valencia, Spain, April 2017. As- and Salakhutdinov, R. Dropout: a simple way to prevent
sociation for Computational Linguistics. URL https: neural networks from overfitting. The journal of machine
//aclanthology.org/E17-2025. learning research, 15(1):1929–1958, 2014.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-
Sutskever, I. Language models are unsupervised multitask quence learning with neural networks. Advances in neural
learners. 2019. information processing systems, 27, 2014.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, and Schmidt, L. Measuring robustness to natural
J., Krueger, G., and Sutskever, I. Learning transferable distribution shifts in image classification. In Larochelle,
visual models from natural language supervision. arXiv H., Ranzato, M., Hadsell, R., Balcan, M., and Lin,
preprint arXiv:2103.00020, 2021. H. (eds.), Advances in Neural Information Processing
Robust Speech Recognition via Large-Scale Weak Supervision 18

Systems, volume 33, pp. 18583–18599. Curran Asso- speech recognition for unsegmented recordings. arXiv
ciates, Inc., 2020. URL https://proceedings. preprint arXiv:2004.09249, 2020.
neurips.cc/paper/2020/file/
d8330f857a17c53d217014ee776bfd50-Paper. Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Con-
pdf. neau, A., Collobert, R., Synnaeve, G., and Auli, M. Self-
training and pre-training are complementary for speech
Torralba, A. and Efros, A. A. Unbiased look at dataset bias. recognition. In ICASSP 2021-2021 IEEE International
CVPR 2011, pp. 1521–1528, 2011. Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 3030–3034. IEEE, 2021.
Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno,
P. J., Weinstein, E., and Rao, K. Multilingual speech Zhang, Y., Qin, J., Park, D. S., Han, W., Chiu, C.-C., Pang,
recognition with a single end-to-end model. 2018 IEEE R., Le, Q. V., and Wu, Y. Pushing the limits of semi-
International Conference on Acoustics, Speech and Sig- supervised learning for automatic speech recognition.
nal Processing (ICASSP), pp. 4904–4908, 2018. arXiv preprint arXiv:2010.10504, 2020.
Valk, J. and Alumäe, T. Voxlingua107: a dataset for spoken Zhang, Y., Park, D. S., Han, W., Qin, J., Gulati, A., Shor, J.,
language recognition. In 2021 IEEE Spoken Language Jansen, A., Xu, Y., Huang, Y., Wang, S., et al. BigSSL:
Technology Workshop (SLT), pp. 652–658. IEEE, 2021. Exploring the frontier of large-scale semi-supervised
learning for automatic speech recognition. arXiv preprint
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
arXiv:2109.13226, 2021.
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. In Advances in neural information
processing systems, pp. 5998–6008, 2017.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M.,
Reddy, T., Cournapeau, D., Burovski, E., Peterson, P.,
Weckesser, W., Bright, J., van der Walt, S. J., Brett, M.,
Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J.,
Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ.,
Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D.,
Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A.,
Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa,
F., van Mulbregt, P., and SciPy 1.0 Contributors. SciPy
1.0: Fundamental Algorithms for Scientific Computing
in Python. Nature Methods, 17:261–272, 2020. doi:
10.1038/s41592-019-0686-2.
Wang, C., Tang, Y., Ma, X., Wu, A., Okhonko, D., and Pino,
J. fairseq s2t: Fast speech-to-text modeling with fairseq.
arXiv preprint arXiv:2010.05171, 2020a.
Wang, C., Wu, A., and Pino, J. Covost 2 and massively
multilingual speech-to-text translation. arXiv preprint
arXiv:2007.10310, 2020b.
Wang, C., Riviere, M., Lee, A., Wu, A., Talnikar, C., Haziza,
D., Williamson, M., Pino, J., and Dupoux, E. Voxpopuli:
A large-scale multilingual speech corpus for representa-
tion learning, semi-supervised learning and interpretation.
arXiv preprint arXiv:2101.00390, 2021.
Wang, P., Sainath, T. N., and Weiss, R. J. Multitask training
with text data for end-to-end speech recognition. arXiv
preprint arXiv:2010.14318, 2020c.
Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora,
A., Chang, X., Khudanpur, S., Manohar, V., Povey, D.,
Raj, D., et al. Chime-6 challenge: Tackling multispeaker
Robust Speech Recognition via Large-Scale Weak Supervision 19

A. Evaluation Datasets.
A.1. Short-form English-only datasets
• LibriSpeech (Panayotov et al., 2015): We used the test-clean and test-other splits from the LibriSpeech ASR corpus.

• TED-LIUM 3 (Hernandez et al., 2018): We used the test split of TED-LIUM Release 3, using the segmented manual
transcripts included in the release.

• Common Voice 5.1 (Ardila et al., 2019): We downloaded the English subset of Common Voice Corpus 5.1 from the
official website.

• Artie bias corpus (Meyer et al., 2020): We used the Artie bias corpus. This is a subset of the Common Voice dataset.

• CallHome and Switchboard: We used the two corpora from LDC2002S09 and LDC2002T43.

• WSJ: We used LDC93S6B and LDC94S13B and followed the s5 recipe to preprocess the dataset.

• CORAAL: We used the 231 interviews from CORAAL (Kendall & Farrington, 2021) and used the preprocessing
script from the FairSpeech project.

• CHiME-6: For CHiME-6 (Watanabe et al., 2020), we downloaded the CHiME-5 dataset and followed the stage 0
of the s5 track1 recipe to create the CHiME-6 dataset which fixes synchronization. We then used the binaural
recordings (* P??.wav) and the corresponding transcripts.

• AMI-IHM and AMI-SDM1: We preprocessed the AMI Corpus by following the stage 0 ad 2 of the s5b recipe.

A.2. Long-form English-only datasets


• TED-LIUM 3 (Hernandez et al., 2018): We used the 11 full-length TED talks from the test split of TED-LIUM
Release 3, slicing the source audio files between the beginning of the first labeled segment and the end of the last
labeled segment of each talk, and we used the concatenated text as the label.

• Meanwhile: This dataset consists of 64 segments from The Late Show with Stephen Colbert. The YouTube video ID
and the corresponding start and end timestamps are available as part of the code release. The labels are collected from
the closed-caption data for each video and corrected with manual inspection.

• Rev16: We use a subset of 16 files from the 30 podcast episodes in Rev.AI’s Podcast Transcription Benchmark, after
finding that there are multiple cases where a significant portion of the audio and the labels did not match, mostly on the
parts introducing the sponsors. We selected 16 episodes that do not have this error, whose ”file number” are:

3 4 9 10 11 14 17 18 20 21 23 24 26 27 29 32

• Kincaid46: This dataset consists of 46 audio files and the corresponding transcripts compiled in the blog article ¡Which
automatic transcription service is the most accurate - 2018¿ by Jason Kincaid. We used the 46 audio files and reference
transcripts from the Airtable widget in the article. For the human transcription benchmark in the paper, we use a subset
of 25 examples from this data, whose ”Ref ID” are:

2 4 5 8 9 10 12 13 14 16 19 21 23 25 26 28 29 30 33 35 36 37 42 43 45

• Earnings-21 (Del Rio et al., 2021) and Earnings-22: We used the files available in the speech-datasets repository, as
of their 202206 version.

• CORAAL: We used the 231 full-length interviews and transcripts from (Kendall & Farrington, 2021).
Robust Speech Recognition via Large-Scale Weak Supervision 20

A.3. Multilingual datasets


• Multilingual LibriSpeech (Pratap et al., 2020b): We used the test splits from each language in the Multilingual
LibriSpeech (MLS) corpus.

• Fleurs (Conneau et al., 2022): We collected audio files and transcripts using the implementation available as Hug-
gingFace datasets. To use as a translation dataset, we matched the numerical utterance IDs to find the corresponding
transcript in English.
• VoxPopuli (Wang et al., 2021): We used the get asr data.py script from the official repository to collect the ASR
data in 14 languages.

• Common Voice 9 (Ardila et al., 2019): We downloaded the Common Voice Corpus 9 from the official website.
• CoVOST 2 (Wang et al., 2020b): We collected the X into English data collected using the official repository.

B. Compared Models
For comparison, we use the following models from HuggingFace, downloaded as of September 2022 using version 4.21.0 of
the transformers library:

• facebook/wav2vec2-large-960h-lv60-self (Xu et al., 2021)


• facebook/wav2vec2-large-robust-ft-libri-960h (Hsu et al., 2021b)
• facebook/wav2vec2-base-100h (Baevski et al., 2020)

• facebook/wav2vec2-base-960h (Baevski et al., 2020)


• facebook/wav2vec2-large-960h (Baevski et al., 2020)
• facebook/hubert-large-ls960-ft (Hsu et al., 2021a)

• facebook/hubert-xlarge-ls960-ft (Hsu et al., 2021a)


• facebook/s2t-medium-librispeech-asr (Wang et al., 2020a)
• facebook/s2t-large-librispeech-asr (Wang et al., 2020a)
• microsoft/unispeech-sat-base-100h-libri-ft (Chen et al., 2022)

• nvidia/stt en conformer ctc large (Kuchaiev et al., 2019)


• nvidia/stt en conformer transducer xlarge (Kuchaiev et al., 2019)
• speechbrain/asr-crdnn-rnnlm-librispeech (Ravanelli et al., 2021)

• speechbrain/asr-transformer-transformerlm-librispeech (Ravanelli et al., 2021)

We note that all of the models above are entirely or partly trained on LibriSpeech.
Robust Speech Recognition via Large-Scale Weak Supervision 21

C. Text Standardization
Since Whisper may output any UTF-8 string rather than a restricted set of graphemes, the rules for text standardization need
to be more intricate and comprehensive than those defined on e.g. ASCII characters. We perform the following steps to
normalize English texts in different styles into a standardized form, which is a best-effort attempt to penalize only when a
word error is caused by actually mistranscribing a word, and not by formatting or punctuation differences.

1. Remove any phrases between matching brackets ([, ]).


2. Remove any phrases between matching parentheses ((, )).

3. Remove any of the following words: hmm, mm, mhm, mmm, uh, um
4. Remove whitespace characters that comes before an apostrophe ’
5. Convert standard or informal contracted forms of English into the original form.
6. Remove commas (,) between digits

7. Remove periods (.) not followed by numbers


8. Remove symbols as well as diacritics from the text, where symbols are the characters with the Unicode category
starting with M, S, or P, except period, percent, and currency symbols that may be detected in the next step.
9. Detect any numeric expressions of numbers and currencies and replace with a form using Arabic numbers, e.g. “Ten
thousand dollars” → “$10000”.
10. Convert British spellings into American spellings.
11. Remove remaining symbols that are not part of any numeric expressions.
12. Replace any successive whitespace characters with a space.

A different, language-specific set of transformations would be needed to equivalently normalize non-English text, but due to
our lack of linguistic knowledge to build such normalizers for all languages, we resort to the following basic standardization
for non-English text:

1. Remove any phrases between matching brackets ([, ]).


2. Remove any phrases between matching parentheses ((, )).

3. Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each
character in the NFKC-normalized string starts with M, S, or P.
4. make the text lowercase.
5. replace any successive whitespace characters with a space.

Additionally, we put a space between every letter for the languages that do not use spaces to separate words, namely Chinese,
Japanese, Thai, Lao, and Burmese, effectively measuring the character error rate instead.
We note that the above is an imperfect solution, and it will sometimes produce unintended and unexpected outputs. We do
not claim that the text format resulting from the above is more “correct” in any measure. Rather, the procedures above are
designed to better distinguish between innocuous differences in wording and genuine mistranscriptions. Python code for
the standardization procedures above is available as part of our code and model release to facilitate future iterations and
improvements on text standardization.
Robust Speech Recognition via Large-Scale Weak Supervision 22

D. Raw Performance Table


D.1. English Transcription
D.1.1. G REEDY DECODING

LibriSpeech.test-clean

LibriSpeech.test-other

CommonVoice5.1
TED-LIUM3

VoxPopuli.en
Switchboard

AMI-SDM1

Fleurs.en us
AMI-IHM
CallHome

CORAAL

CHiME6
Artie
WSJ
Model
Whisper tiny.en 5.6 14.6 6.0 5.0 24.1 17.8 26.3 20.0 23.9 41.3 23.7 50.3 11.7 11.6
Whisper tiny 7.6 16.9 7.0 6.7 30.0 22.8 29.6 23.9 31.0 49.6 27.6 58.1 12.7 13.7
Whisper base.en 4.2 10.2 4.9 4.6 20.9 15.2 19.0 13.4 22.6 36.4 20.5 46.7 10.0 7.6
Whisper base 5.0 12.4 5.5 5.1 23.0 16.8 21.6 16.9 26.0 40.2 22.0 49.9 10.0 10.1
Whisper small.en 3.1 7.4 4.0 3.3 18.2 15.7 13.1 9.7 20.2 27.6 17.5 38.0 8.1 6.0
Whisper small 3.4 7.6 4.3 4.0 17.5 14.5 13.5 10.3 18.1 29.3 19.0 39.6 8.3 6.6
Whisper medium.en 3.1 6.3 4.1 3.3 16.2 14.1 10.6 7.6 17.5 25.3 16.4 37.2 7.4 5.0
Whisper medium 2.9 5.9 3.8 2.9 16.4 14.0 10.3 7.2 16.6 26.4 16.6 36.0 7.4 5.4
Whisper large 2.7 5.6 4.0 3.1 15.8 13.1 9.5 6.7 19.4 25.6 16.4 36.9 7.3 4.6
wav2vec2-base-100h 6.0 13.4 17.8 13.9 46.9 40.2 47.4 40.8 47.0 79.9 48.1 81.2 28.9 23.1
wav2vec2-base-960h 3.3 8.5 12.8 8.9 40.6 32.9 36.4 30.9 39.9 68.5 40.2 71.9 21.4 17.4
wav2vec2-large-960h-lv60-self 1.8 3.8 7.4 4.4 29.1 22.2 19.9 15.8 29.2 56.3 30.8 57.0 13.0 10.2
wav2vec2-large-960h 2.7 6.2 10.5 7.7 34.8 28.3 29.9 24.5 35.6 65.8 37.0 67.6 17.9 14.6
wav2vec2-large-robust-ft-libri-960h 2.6 5.3 9.2 6.1 23.4 19.8 20.3 16.2 29.4 58.1 31.7 61.6 15.1 11.8
asr-crdnn-rnnlm-librispeech 3.0 9.7 17.7 10.7 59.7 56.1 43.7 33.3 83.8 81.0 57.2 85.8 30.6 32.4
asr-transformer-transformerlm-librispeech 2.1 5.4 11.9 7.4 38.9 33.0 30.6 23.5 44.9 79.5 44.5 75.4 17.8 17.0
hubert-large-ls960-ft 2.0 4.1 8.4 5.4 29.6 22.8 20.8 16.0 32.0 60.0 33.7 59.1 14.4 10.9
hubert-xlarge-ls960-ft 1.9 3.5 8.3 5.4 29.3 22.2 19.8 14.8 31.5 58.5 33.3 58.9 14.2 10.5
s2t-large-librispeech-asr 3.3 8.1 14.9 9.4 54.5 40.3 38.1 30.7 50.2 79.2 53.4 79.5 21.6 18.0
s2t-medium-librispeech-asr 3.6 8.2 15.7 9.7 58.1 42.4 39.3 31.3 52.6 79.8 60.3 85.3 22.9 19.7
stt en conformer ctc large 2.1 4.2 4.4 2.1 11.3 8.2 7.4 4.0 13.5 30.5 15.9 39.9 6.7 8.2
stt en conformer transducer xlarge 1.5 2.8 4.3 1.2 12.0 7.4 4.3 1.5 19.9 36.8 20.5 48.6 6.0 6.3
unispeech-sat-base-100h-libri-ft 5.7 13.8 17.7 13.6 46.5 40.0 45.3 38.6 44.7 74.8 47.8 77.7 29.8 22.4

Table 8. English transcription WER (%) with greedy decoding

D.1.2. B EAM SEARCH WITH TEMPERATURE FALLBACK


LibriSpeech.test-clean

LibriSpeech.test-other

CommonVoice5.1
TED-LIUM3

VoxPopuli.en
Switchboard

AMI-SDM1

Fleurs.en us
AMI-IHM
CallHome

CORAAL

CHiME6
Artie
WSJ

Model
Whisper tiny.en 5.4 12.8 5.4 4.6 21.4 16.0 23.5 18.4 21.4 42.0 22.7 54.2 10.9 10.0
Whisper tiny 6.7 15.0 6.3 5.9 24.8 18.3 26.1 20.8 25.1 48.0 25.6 57.3 11.6 12.4
Whisper base.en 4.1 9.6 4.6 4.0 18.3 14.2 17.5 13.2 18.5 35.2 21.1 49.0 9.3 7.1
Whisper base 4.9 11.0 5.0 4.4 20.5 15.6 19.4 15.3 20.5 40.0 21.5 50.0 9.5 8.9
Whisper small.en 3.2 6.7 4.3 3.0 17.2 13.4 12.6 9.2 17.5 29.5 17.9 42.5 8.1 5.3
Whisper small 3.3 7.2 4.3 3.9 17.1 13.3 12.8 9.3 16.4 30.9 19.2 43.5 8.2 6.1
Whisper medium.en 3.0 5.7 4.3 2.8 14.7 12.4 10.3 7.4 15.3 27.0 17.1 39.4 7.8 4.5
Whisper medium 2.7 5.6 4.0 2.7 15.3 13.2 9.7 6.7 14.9 27.6 17.6 43.0 7.6 4.4
Whisper large 2.8 5.7 4.3 3.5 16.2 14.2 8.9 6.4 15.1 25.2 17.6 37.1 7.2 4.5

Table 9. English transcription WER (%) with beam search and temperature fallback
Robust Speech Recognition via Large-Scale Weak Supervision 23

D.2. Multilingual Transcription


D.2.1. M ULTILINGUAL L IBRI S PEECH

Portuguese
German

Spanish
English

French

Italian

Polish
Dutch
Model
Whisper tiny 39.4 15.7 36.8 24.9 41.7 34.2 31.3 19.2
Whisper base 28.4 11.7 26.6 17.7 31.1 22.8 21.9 12.8
Whisper small 17.2 8.3 16.2 10.5 21.4 11.2 13.0 7.8
Whisper medium 11.7 6.8 8.9 7.4 16.0 6.5 9.0 5.3
Whisper large 10.2 6.3 8.9 6.6 14.3 6.6 9.2 5.4

Table 10. WER (%) on MLS

D.2.2. C OMMON VOICE 9


Bulgarian

Estonian
German

Spanish
Catalan
Bengali

English

Persian
Danish
Arabic

Czech

Welsh

Greek
Model
Whisper tiny 90.9 79.3 104.1 51.0 79.7 101.8 77.2 34.5 61.9 28.8 30.3 102.1 120.3
Whisper base 84.4 68.1 103.7 39.9 63.1 93.8 57.5 24.5 51.5 21.9 19.6 88.1 99.0
Whisper small 66.4 44.8 118.6 23.8 34.1 65.4 32.1 13.0 31.7 14.5 10.3 67.2 71.9
Whisper medium 60.3 26.7 124.7 16.4 18.8 43.6 19.3 8.5 20.0 11.2 6.9 45.6 49.9
Whisper large 56.0 24.1 106.0 15.3 17.1 40.3 18.3 7.7 18.3 10.1 6.4 41.4 44.8

Malayalam
Indonesian

Lithuanian

Mongolian
Hungarian

Japanese

Latvian
Finnish

French

Italian

Polish
Dutch
Hindi

Model
Whisper tiny 68.5 49.7 108.3 87.0 49.6 44.5 36.1 103.5 87.8 102.7 123.0 43.6 45.3
Whisper base 52.9 37.3 106.5 71.9 36.1 30.5 24.2 91.3 78.0 122.9 137.0 29.5 32.8
Whisper small 30.5 22.7 43.6 44.4 18.4 16.0 14.0 72.8 54.6 104.8 225.8 14.2 16.9
Whisper medium 18.8 16.0 31.5 26.9 11.6 9.4 10.5 49.4 37.2 137.8 113.4 8.0 10.1
Whisper large 17.0 14.7 25.0 23.5 10.6 8.1 9.4 43.9 34.8 107.1 117.4 7.1 9.0

Vietnamese
Portuguese

Romanian

Slovenian

Swedish

Chinese
Russian

Turkish
Serbian
Slovak

Tamil

Urdu
Thai

Model
Whisper tiny 35.2 68.2 40.6 104.0 82.0 106.1 58.2 105.7 55.9 53.6 74.7 69.3 52.4
Whisper base 23.7 55.9 28.8 87.2 70.3 103.0 42.4 49.5 32.1 38.6 58.6 51.6 44.9
Whisper small 12.5 33.2 15.0 60.4 45.5 101.3 22.1 28.7 18.1 23.7 39.1 33.3 29.4
Whisper medium 8.1 21.5 9.3 42.0 29.8 85.6 13.7 19.6 10.5 17.7 29.9 24.4 23.2
Whisper large 7.1 19.8 8.2 37.9 25.1 87.4 12.4 17.6 8.8 16.6 28.1 19.9 29.1

Table 11. WER (%) on Common Voice 9

D.2.3. VOX P OPULI


en accented

Lithuanian
Hungarian

Romanian

Slovenian
Estonian

Croatian
German

Spanish

Finnish
English

French

Slovak
Italian

Polish
Czech

Dutch

Model
Whisper tiny 73.5 27.4 11.6 18.8 19.7 99.2 54.1 32.9 72.4 74.5 40.5 93.1 41.9 31.4 65.9 78.7 81.9
Whisper base 54.7 20.6 9.5 17.5 14.4 83.0 39.7 24.9 53.6 52.6 30.8 82.1 29.4 22.1 49.3 63.7 70.5
Whisper small 28.8 14.8 8.2 19.2 11.1 59.2 24.9 15.7 33.7 31.3 22.9 60.1 18.8 13.3 28.6 37.3 50.8
Whisper medium 18.4 12.4 7.6 19.1 9.6 38.2 16.6 12.2 23.9 19.3 19.7 39.3 14.9 10.1 18.4 23.0 36.3
Whisper large 15.9 11.9 7.2 20.8 8.8 33.3 15.5 11.0 19.0 16.8 18.4 35.0 14.0 9.0 17.0 19.1 31.3

Table 12. WER (%) on VoxPopuli


Robust Speech Recognition via Large-Scale Weak Supervision 24

D.2.4. F LEURS

Azerbaijani

Belarusian
Assamese
Afrikaans

Bulgarian
Amharic

Bosnian

Chinese
Catalan
Bengali

Danish
Arabic

Czech

Welsh
Model
Whisper tiny 91.2 122.9 63.4 102.0 93.1 94.0 81.0 101.6 82.1 42.8 40.5 82.8 101.3 82.0
Whisper base 81.5 196.8 48.8 102.0 76.4 91.3 65.1 100.6 66.7 29.0 34.1 66.0 85.3 57.6
Whisper small 61.1 120.2 30.6 108.0 49.1 75.1 37.3 104.4 39.4 16.2 20.8 37.6 59.3 32.8
Whisper medium 44.9 229.3 20.4 102.3 33.1 60.4 21.4 100.6 23.9 9.6 12.1 21.3 40.8 19.5
Whisper large 42.6 129.3 18.1 105.6 28.7 56.6 18.4 104.9 20.7 8.0 19.6 17.4 36.6 16.8

Estonian

Galician
German

Gujarati

Hebrew
Tagalog
Spanish

Finnish
English

Persian

French

Hausa
Greek

Hindi
Model
Whisper tiny 27.8 67.4 12.4 15.9 94.8 101.8 59.5 65.6 41.4 54.8 101.2 100.2 71.6 102.3
Whisper base 17.9 53.5 8.9 9.9 77.9 86.1 43.1 45.8 28.5 47.4 101.4 98.6 61.7 101.1
Whisper small 10.2 30.8 6.1 5.6 51.3 55.8 24.0 27.7 15.0 30.2 106.4 90.1 44.4 38.4
Whisper medium 6.5 19.0 4.4 3.6 29.8 41.0 13.9 19.1 8.7 21.2 104.8 106.6 33.1 26.8
Whisper large 5.5 18.7 4.5 3.5 25.5 36.1 12.2 15.8 7.7 19.0 103.9 87.0 30.2 26.9

Luxembourgish
Indonesian
Hungarian

Armenian

Icelandic

Georgian

Kannada
Javanese
Japanese
Croatian

Kazakh

Korean
Khmer
Italian

Model
Whisper tiny 79.0 83.8 118.6 51.7 113.3 29.8 37.0 107.3 123.0 165.2 100.6 100.7 36.1 99.1
Whisper base 59.1 65.0 126.3 33.1 95.5 17.9 22.8 89.5 114.7 109.2 101.6 107.2 27.8 100.7
Whisper small 33.4 38.9 86.6 16.3 72.6 9.8 12.0 88.6 118.3 70.3 104.4 100.4 19.6 100.1
Whisper medium 19.3 24.3 60.1 10.2 49.9 5.2 7.1 67.9 117.3 48.8 98.9 77.7 16.4 90.0
Whisper large 16.7 21.0 53.7 8.5 43.0 4.2 6.4 87.0 100.5 43.8 96.0 69.8 15.2 86.5
Macedonian

Malayalam
Lithuanian

Norwegian
Mongolian

Myanmar
Marathi

Maltese
Latvian
Lingala

Nepali
Malay
Maori
Lao

Model
Whisper tiny 105.4 115.1 98.5 91.6 94.5 73.3 101.5 113.7 100.3 51.2 100.8 124.8 62.0 101.8
Whisper base 96.7 105.1 87.3 79.8 77.5 59.9 107.4 125.7 100.3 35.1 97.6 122.6 44.0 102.4
Whisper small 91.3 102.2 65.6 53.2 59.5 36.9 100.9 144.2 60.2 18.9 92.2 110.1 24.2 69.5
Whisper medium 83.2 101.4 41.1 32.0 77.8 22.0 101.1 103.7 63.2 12.2 83.2 123.0 12.9 54.4
Whisper large 76.8 101.6 35.2 28.3 45.7 20.6 101.4 106.2 43.7 10.2 80.5 124.5 11.4 52.2
Portuguese

Romanian

Slovenian
Russian
Occitan

Serbian
Punjabi

Somali
Slovak
Pashto

Sindhi
Polish

Shona
Dutch

Model
Whisper tiny 49.0 95.9 102.6 45.6 105.6 20.1 74.7 31.1 105.8 77.2 87.2 128.1 105.6 83.7
Whisper base 33.0 82.9 101.5 30.8 99.0 13.0 56.0 20.5 103.9 60.6 74.6 126.0 109.6 64.3
Whisper small 16.4 87.3 103.6 14.7 92.9 7.3 29.8 11.4 131.7 33.3 49.3 140.0 105.3 42.2
Whisper medium 9.9 79.5 102.0 8.0 119.4 5.0 20.0 7.2 147.0 17.3 31.9 143.9 104.0 44.9
Whisper large 8.3 75.9 102.8 7.2 92.7 4.8 15.4 6.4 177.9 15.7 27.8 130.0 103.5 29.2
Vietnamese
Ukrainian
Swedish

Turkish
Swahili

Yoruba
Telugu

Uzbek
Tamil

Urdu
Tajik

Thai

Model
Whisper tiny 52.7 100.9 99.9 105.1 101.7 58.8 42.5 51.2 65.2 105.2 60.0 106.4
Whisper base 37.4 92.5 58.7 105.2 109.3 38.2 27.5 37.7 52.0 114.0 40.5 101.8
Whisper small 20.8 73.7 35.2 98.2 84.3 21.9 15.9 19.3 37.3 107.7 21.2 116.4
Whisper medium 11.2 52.8 23.1 82.8 74.0 15.4 10.4 11.6 28.2 109.6 12.7 105.1
Whisper large 10.5 47.9 20.6 100.6 74.5 13.2 9.4 10.3 25.0 93.3 10.7 111.7

Table 13. WER (%) on Fleurs


Robust Speech Recognition via Large-Scale Weak Supervision 25

D.3. Speech Translation


D.3.1. F LEURS

Azerbaijani

Belarusian
Assamese
Afrikaans

Bulgarian
Amharic

Bosnian

Chinese
Catalan
Bengali

Danish
Arabic

Czech

Welsh
Model
Whisper tiny 1.6 0.1 0.1 0.4 0.1 0.8 0.4 0.4 0.4 5.2 0.6 0.6 0.6 0.7
Whisper base 4.4 0.3 1.0 0.4 0.8 3.3 2.7 0.7 4.1 13.1 1.9 2.7 0.7 5.0
Whisper small 18.1 0.2 10.6 1.2 5.8 7.1 14.8 2.7 16.8 25.1 9.3 14.2 1.3 18.1
Whisper medium 29.5 0.9 19.9 3.5 11.7 9.8 23.9 10.6 26.0 31.9 15.1 23.6 8.4 28.6
Whisper large 31.6 1.1 23.8 3.9 13.1 11.0 26.2 12.0 28.0 33.7 16.8 25.6 11.2 31.6

Estonian

Galician
German

Gujarati

Hebrew
Tagalog
Spanish

Finnish
English

Persian

French

Hausa
Greek

Hindi
Model
Whisper tiny 5.2 0.1 68.6 7.7 0.1 0.1 0.2 0.8 4.7 4.0 0.7 0.1 0.2 1.0
Whisper base 13.7 0.7 73.3 12.4 0.3 0.2 0.5 2.1 13.1 10.5 1.5 0.0 0.6 3.4
Whisper small 25.9 11.6 77.3 18.2 3.6 5.8 7.3 12.0 23.5 17.5 3.9 0.3 5.4 11.1
Whisper medium 31.4 19.9 79.2 21.4 13.5 15.0 18.5 20.5 28.6 24.7 12.8 0.5 15.9 19.4
Whisper large 34.3 21.7 77.8 22.8 15.9 17.6 20.6 22.7 31.6 26.0 14.8 0.5 19.6 20.7

Luxembourgish
Indonesian
Hungarian

Armenian

Icelandic

Georgian

Kannada
Javanese
Japanese
Croatian

Kazakh

Korean
Khmer
Italian

Model
Whisper tiny 0.6 0.1 0.1 0.3 0.4 5.3 0.2 0.2 0.1 0.1 0.1 0.8 0.5 0.8
Whisper base 3.7 0.2 0.1 2.6 0.4 11.3 1.5 0.2 0.2 0.2 0.1 0.9 3.7 1.7
Whisper small 14.6 4.8 0.7 16.4 1.8 17.8 9.6 1.4 0.2 0.8 0.5 2.3 12.2 5.7
Whisper medium 23.0 15.5 10.4 24.1 6.8 21.6 14.9 5.0 1.3 4.3 3.3 8.5 19.2 13.6
Whisper large 25.4 18.3 13.2 27.2 6.6 23.5 17.0 5.1 2.7 6.3 5.2 9.9 20.0 15.4
Macedonian

Malayalam
Lithuanian

Norwegian
Mongolian

Myanmar
Marathi

Maltese
Latvian
Lingala

Nepali
Malay
Maori
Lao

Model
Whisper tiny 0.1 0.2 0.1 0.2 0.3 1.0 0.8 0.1 0.2 0.3 0.6 0.1 1.4 0.1
Whisper base 0.1 0.3 0.3 0.4 1.0 5.4 1.4 0.1 0.9 2.1 1.4 0.1 8.4 0.3
Whisper small 0.5 2.0 1.9 1.5 3.9 15.3 5.7 0.1 3.8 14.1 4.9 0.0 22.0 2.9
Whisper medium 0.9 8.1 9.6 10.0 8.5 23.5 13.8 0.5 10.9 23.2 11.2 0.2 29.1 12.7
Whisper large 1.2 9.3 12.0 12.5 9.4 26.4 16.5 1.0 13.1 25.5 12.8 0.5 30.5 12.9
Portuguese

Romanian

Slovenian
Russian
Occitan

Serbian
Punjabi

Somali
Slovak
Pashto

Sindhi
Polish

Shona
Dutch

Model
Whisper tiny 2.7 1.7 0.3 0.8 0.3 12.1 1.0 3.1 0.5 0.7 0.3 0.1 0.0 0.6
Whisper base 7.5 4.2 1.1 5.1 0.4 22.4 4.9 12.1 0.7 4.6 1.3 0.3 0.1 5.4
Whisper small 15.9 9.5 4.4 14.0 0.8 31.2 18.3 19.7 2.0 14.4 6.9 0.6 0.1 19.3
Whisper medium 21.6 15.9 12.8 19.0 2.1 35.9 26.6 24.8 5.5 22.7 14.0 1.4 0.4 27.7
Whisper large 22.8 16.8 14.6 21.4 3.7 37.4 29.1 26.7 5.9 25.1 16.9 1.8 0.5 30.5
Vietnamese
Ukrainian
Swedish

Turkish
Swahili

Yoruba
Telugu

Uzbek
Tamil

Urdu
Tajik

Thai

Model
Whisper tiny 1.8 0.1 0.2 0.3 0.2 0.2 0.2 1.2 0.4 0.0 0.1 0.2
Whisper base 9.1 0.1 0.4 0.4 0.2 0.7 2.4 6.9 1.5 0.2 0.9 0.5
Whisper small 22.9 0.1 2.1 4.0 4.4 5.8 15.7 18.7 8.8 0.5 8.5 0.5
Whisper medium 32.1 3.1 7.0 10.8 11.4 12.8 22.9 25.8 14.9 3.8 16.6 0.9
Whisper large 33.1 5.3 8.5 10.9 13.0 15.2 25.7 28.0 16.3 5.8 19.5 1.2

Table 14. BLEU scores on Fleurs


Robust Speech Recognition via Large-Scale Weak Supervision 26

D.3.2. C OVOST 2

Indonesian

Mongolian
Estonian

Japanese
German

Spanish
Catalan

Latvian
Persian

French
Arabic

Italian
Welsh
Model
Whisper tiny 0.2 4.9 0.4 4.0 10.5 0.2 0.1 6.1 0.3 5.1 0.3 0.1 0.1
Whisper base 1.2 11.0 0.5 11.7 21.3 0.3 0.1 15.4 4.9 13.0 4.9 0.5 0.1
Whisper small 17.7 22.3 1.0 25.3 33.0 2.4 4.9 27.3 27.6 24.0 17.3 1.4 0.2
Whisper medium 30.6 29.2 12.1 33.2 38.4 11.4 15.5 33.6 42.3 29.5 24.6 9.7 0.2
Whisper large 35.5 30.3 16.1 34.3 38.0 13.4 17.5 34.4 45.4 29.1 24.2 10.5 0.3

Portuguese

Slovenian

Swedish

Chinese
Russian

Turkish
Dutch

Tamil
Model
Whisper tiny 4.3 9.5 5.7 0.4 2.0 0.1 0.2 0.4
Whisper base 12.4 23.2 16.1 1.4 10.5 0.4 2.8 1.4
Whisper small 28.1 40.6 30.9 9.2 29.9 1.7 16.8 6.8
Whisper medium 38.1 48.7 39.4 17.7 39.5 2.9 27.0 14.0
Whisper large 39.3 48.6 41.6 23.9 40.3 3.7 26.7 17.1

Table 15. BLEU scores on CoVoST2

D.4. Long-form Transcription


TED-LIUM3

Earnings-21

Earnings-22
Meanwhile

Kincaid46

CORAAL
Rev16

Model
Whisper tiny.en 5.5 12.8 13.8 15.1 17.0 22.0 30.3
Whisper tiny 6.8 15.5 16.7 17.0 18.7 24.4 33.1
Whisper base.en 4.6 9.4 11.2 13.2 12.5 16.6 25.2
Whisper base 4.8 12.2 12.2 14.5 13.5 18.4 26.9
Whisper small.en 4.6 6.0 9.4 12.0 10.8 14.0 21.9
Whisper small 4.2 6.9 10.1 12.1 11.1 14.3 22.3
Whisper medium.en 3.6 5.2 8.9 11.9 10.2 13.3 20.6
Whisper medium 3.8 5.4 8.6 11.4 10.3 13.2 20.3
Whisper large 3.8 5.3 8.8 11.0 10.3 13.4 20.4
wav2vec2-base-100h 17.6 27.7 39.3 35.2 45.7 57.1 55.4
wav2vec2-base-960h 12.8 19.7 32.9 29.8 37.3 46.8 49.1
wav2vec2-large-960h-lv60-self 7.2 11.4 21.1 21.3 21.7 28.0 36.7
wav2vec2-large-960h 10.1 16.4 27.4 26.4 30.4 40.1 43.5
wav2vec2-large-robust-ft-libri-960h 8.8 15.2 22.9 23.4 23.0 31.0 36.8
hubert-large-ls960-ft 8.1 12.9 22.4 23.4 23.0 30.6 37.9
hubert-xlarge-ls960-ft 8.1 12.5 22.9 23.2 23.1 31.3 38.1
stt en conformer ctc large 4.0 9.8 13.1 14.5 12.6 17.6 25.1
stt en conformer transducer xlarge 5.3 10.6 17.1 19.8 16.2 19.7 38.9

Table 16. Long-form English transcription WER (%)


Robust Speech Recognition via Large-Scale Weak Supervision 27

E. Training Dataset Statistics

Multilingual Speech Recognition Dataset Components Translation


23446 Chinese Korean 19938
13344 German Chinese 11731
Japanese 8860
11100 Spanish Welsh 8263
9761 Russian Russian 7687
9752 French Spanish 6693
Hindi 5438
8573 Portuguese 17% Multilingual Speech Recognition French
German
4481
4309
7993 Korean (117,113 hours)
7054 Japanese Portuguese 3620
4333 Turkish Arabic 2286
Turkish 2241
4278 Polish Polish 2200
2585 Italian Italian 2145
2119 Swedish Urdu 1990
Bengali 1988
2077 Dutch Norwegian Nynorsk 1889
1883 Catalan Dutch 1767
1066 Finnish Vietnamese 1719
Malay 1691
1014 Indonesian Thai 1635
739 Arabic Latin 1614
697 Ukrainian Tamil 1484
Maori 1381
691
688
Vietnamese
Hebrew
18% Translation Indonesian
Swedish
1174
1055
529 Greek (125,739 hours) Telugu
Greek
987
968
473 Danish Tagalog 894
382 Malay Malayalam 892
379 Hungarian Finnish 750
Khmer 672
356 Romanian Javanese 622
266 Norwegian Romanian 555
226 Thai Hungarian 554
Ukrainian 509
192 Czech Yoruba 432
136 Tamil Hebrew 418
104 Urdu Czech 401
Slovenian 395
91 Croatian Persian 392
90 Slovak Danish 386
86 Bulgarian Galician 368
Hawaiian 338
75 Tagalog Afrikaans 330
73 Welsh Norwegian 322
67 Lithuanian Marathi 288
Swahili 282
65 Latvian Shona 279
47 Azerbaijani Breton 269
41 Estonian Croatian 239
Catalan 236
41 Slovenian Bosnian 219
28 Serbian Sinhala 211
24 Persian Gujarati 208
Bulgarian 202
21 Basque Sanskrit 195
16 Icelandic Tibetan 186
16 Macedonian Basque 168
13 Armenian Slovak 144
Serbian 136
12 Kazakh Assamese 136
12 Hindi Nepali 133
11 Bosnian Belarusian 133
Punjabi 117
8.9 Galician 65% English Speech Recognition Armenian
Lithuanian
116
99
5.7
5.4
Albanian
Sinhala
(438,218 hours) Kannada 90
Azerbaijani 86
5.4 Swahili Yiddish 85
4.3 Telugu Icelandic 84
4.1 Afrikaans Mongolian 79
Estonian 79
3.8 Kannada Haitian Creole 74
2.4 Belarusian Albanian 72
1.3 Khmer Latvian 68
Pashto 63
1.3 Bengali Burmese 59
1.1 Maltese Occitan 49
1.0 Haitian Creole Faroese 46
Sindhi 46
0.8 Punjabi Maltese 41
0.6 Marathi Georgian 40
0.6 Nepali Amharic 32
Kazakh 31
0.6 Georgian Macedonian 30
0.5 Malayalam Somali 21
0.4 Yiddish Lao 20
Lingala 20
0.3 Uzbek Tajik 15
0.3 Gujarati Tatar 14
0.3 Tajik Luxembourgish 10
Hausa 8
0.2 Malagasy Sundanese 7
0.1 Burmese Uzbek 4
0.1Sundanese Malagasy 2
Bashkir 1
0.1 Lao Turkmen 1
10K 1K 100 10 1 0.1 1 10 100 1K 10K
Hours of audio Hours of audio

Figure 11. Training dataset statistics


Robust Speech Recognition via Large-Scale Weak Supervision 28

F. Hyperparameters

Hyperparameter Value
Updates 1048576
Batch Size 256
Warmup Updates 2048
Max grad norm 1.0
Optimizer AdamW
β1 0.9
β2 0.98
ϵ 10−6
Weight Decay 0.1
Weight Init Gaussian Fan-In
Learning Rate Schedule Linear Decay
Speechless audio subsample factor 10×
Condition on prior text rate 50%

Table 17. Whisper training hyperparameters.

Model Max Learning Rate


Tiny 1.5 × 10−3
Base 1 × 10−3
Small 5 × 10−4
Medium 2.5 × 10−4
Large 1.75 × 10−4

Table 18. Whisper model learning rates.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy