Whisper Openai
Whisper Openai
Whisper Openai
Alec Radford * 1 Jong Wook Kim * 1 Tao Xu 1 Greg Brockman 1 Christine McLeavey 1 Ilya Sutskever 1
pipelines to scale weakly supervised speech recognition the speech recognition pipeline since it removes the need
to 10,000 and 30,000 hours of noisier training data. This for a separate inverse text normalization step in order to
trade-off between quality and quantity is often the right produce naturalistic transcriptions.
call. Although understudied so far for speech recognition,
We construct the dataset from audio that is paired with tran-
recent work in computer vision has demonstrated that mov-
scripts on the Internet. This results in a very diverse dataset
ing beyond gold-standard crowdsourced datasets such as
covering a broad distribution of audio from many different
ImageNet (Russakovsky et al., 2015) to much larger but
environments, recording setups, speakers, and languages.
weakly supervised datasets significantly improves the ro-
While diversity in audio quality can help train a model to be
bustness and generalization of models (Mahajan et al., 2018;
robust, diversity in transcript quality is not similarly bene-
Kolesnikov et al., 2020).
ficial. Initial inspection showed a large amount of subpar
Yet these new datasets are only a few times larger than the transcripts in the raw dataset. To address this, we developed
sum of existing high-quality datasets and still much smaller several automated filtering methods to improve transcript
than prior unsupervised work. In this work we close that quality.
gap, scaling weakly supervised speech recognition the next
Many transcripts on the internet are not actually human-
order of magnitude to 680,000 hours of labeled audio data.
generated but the output of existing ASR systems. Recent
We call our approach Whisper2 . We demonstrate models
research has shown that training on datasets of mixed human
trained at this scale transfer well to existing datasets zero-
and machine-generated data can significantly impair the per-
shot, removing the need for any dataset-specific fine-tuning
formance of translation systems (Ghorbani et al., 2021). In
to achieve high-quality results.
order to avoid learning “transcript-ese”, we developed many
In addition to scale, our work also focuses on broaden- heuristics to detect and remove machine-generated tran-
ing the scope of weakly supervised pre-training beyond scripts from the training dataset. Many existing ASR sys-
English-only speech recognition to be both multilingual and tems output only a limited subset of written language which
multitask. Of those 680,000 hours of audio, 117,000 hours removes or normalizes away aspects that are difficult to pre-
cover 96 other languages. The dataset also includes 125,000 dict from only audio signals such as complex punctuation
hours of X→en translation data. We find that for sufficiently (exclamation points, commas, and question marks), format-
large models there is no drawback and even benefits to joint ting whitespace such as paragraphs, or stylistic aspects such
multilingual and multitask training. as capitalization. An all-uppercase or all-lowercase tran-
script is very unlikely to be human generated. While many
Our work suggests that simple scaling of weakly supervised
ASR systems include some level of inverse text normaliza-
pre-training has been underappreciated so far for speech
tion, it is often simple or rule-based and still detectable from
recognition. We achieve these results without the need for
other unhandled aspects such as never including commas.
the self-supervision or self-training techniques that have
been a mainstay of recent large-scale speech recognition We also use an audio language detector, which was created
work. To serve as a foundation for further research on robust by fine-tuning a prototype model trained on a prototype ver-
speech recognition, we release inference code and models at sion of the dataset on VoxLingua107 (Valk & Alumäe, 2021)
the following URL: https://github.com/openai/ to ensure that the spoken language matches the language of
whisper. the transcript according to CLD2. If the two do not match,
we don’t include the (audio, transcript) pair as a speech
2. Approach recognition training example in the dataset. We make an
exception if the transcript language is English and add these
2.1. Data Processing pairs to the dataset as X→en speech translation training
examples instead. We use fuzzy de-duping of transcript
Following the trend of recent work leveraging web-scale
texts to reduce the amount of duplication and automatically
text from the internet for training machine learning systems,
generated content in the training dataset.
we take a minimalist approach to data pre-processing. In
contrast to a lot of work on speech recognition, we train We break audio files into 30-second segments paired with
Whisper models to predict the raw text of transcripts without the subset of the transcript that occurs within that time
any significant standardization, relying on the expressive- segment. We train on all audio, including segments where
ness of sequence-to-sequence models to learn to map be- there is no speech (though with sub-sampled probability)
tween utterances and their transcribed form. This simplifies and use these segments as training data for voice activity
2
detection.
If an acronym or basis for the name is desired, WSPSR stand-
ing for Web-scale Supervised Pretraining for Speech Recognition For an additional filtering pass, after training an initial model
can be used. we aggregated information about its error rate on training
Robust Speech Recognition via Large-Scale Weak Supervision 3
data sources and performed manual inspection of these data ization. These components are often handled separately,
sources sorting by a combination of both high error rate and resulting in a relatively complex system around the core
data source size in order to identify and remove low-quality speech recognition model. To reduce this complexity, we
ones efficiently. This inspection showed a large amount of would like to have a single model perform the entire speech
only partially transcribed or poorly aligned/misaligned tran- processing pipeline, not just the core recognition part. An
scripts as well as remaining low-quality machine-generated important consideration here is the interface for the model.
captions that filtering heuristics did not detect. There are many different tasks that can be performed on
the same input audio signal: transcription, translation, voice
To avoid contamination, we perform de-duplication at a tran-
activity detection, alignment, and language identification
script level between the training dataset and the evaluation
are some examples.
datasets we thought were at higher risk of overlap, namely
TED-LIUM 3 (Hernandez et al., 2018). For this kind of one-to-many mapping to work with a single
model, some form of task specification is necessary. We use
2.2. Model a simple format to specify all tasks and conditioning infor-
mation as a sequence of input tokens to the decoder. Since
Since the focus of our work is on studying the capabilities our decoder is an audio-conditional language model, we also
of large-scale supervised pre-training for speech recogni- train it to condition on the history of text of the transcript in
tion, we use an off-the-shelf architecture to avoid confound- the hope that it will learn to use longer-range text context
ing our findings with model improvements. We chose an to resolve ambiguous audio. Specifically, with some proba-
encoder-decoder Transformer (Vaswani et al., 2017) as this bility we add the transcript text preceding the current audio
architecture has been well validated to scale reliably. All segment to the decoder’s context. We indicate the beginning
audio is re-sampled to 16,000 Hz, and an 80-channel log- of prediction with a <|startoftranscript|> token.
magnitude Mel spectrogram representation is computed on First, we predict the language being spoken which is repre-
25-millisecond windows with a stride of 10 milliseconds. sented by a unique token for each language in our training
For feature normalization, we globally scale the input to set (99 total). These language targets are sourced from the
be between -1 and 1 with approximately zero mean across aforementioned VoxLingua107 model. In the case where
the pre-training dataset. The encoder processes this input there is no speech in an audio segment, the model is trained
representation with a small stem consisting of two convolu- to predict a <|nospeech|> token indicating this. The
tion layers with a filter width of 3 and the GELU activation next token specifies the task (either transcription or trans-
function (Hendrycks & Gimpel, 2016) where the second lation) with an <|transcribe|> or <|translate|>
convolution layer has a stride of two. Sinusoidal position token. After this, we specify whether to predict timestamps
embeddings are then added to the output of the stem after or not by including a <|notimestamps|> token for that
which the encoder Transformer blocks are applied. The case. At this point, the task and desired format is fully
transformer uses pre-activation residual blocks (Child et al., specified, and the output begins. For timestamp predic-
2019), and a final layer normalization is applied to the en- tion, we predict time relative to the current audio segment,
coder output. The decoder uses learned position embeddings quantizing all times to the nearest 20 milliseconds which
and tied input-output token representations (Press & Wolf, matches the native time resolution of Whisper models, and
2017). The encoder and decoder have the same width and add additional tokens to our vocabulary for each of these.
number of transformer blocks. Figure 1 summarizes the We interleave their prediction with the caption tokens: the
model architecture. start time token is predicted before each caption’s text, and
We use the same byte-level BPE text tokenizer used in GPT- the end time token is predicted after. When a final tran-
2 (Sennrich et al., 2015; Radford et al., 2019) for the English- script segment is only partially included in the current 30-
only models and refit the vocabulary (but keep the same size) second audio chunk, we predict only its start time token
for the multilingual models to avoid excessive fragmenta- for the segment when in timestamp mode, to indicate that
tion on other languages since the GPT-2 BPE vocabulary is the subsequent decoding should be performed on an au-
English only. dio window aligned with that time, otherwise we truncate
the audio to not include the segment. Lastly, we add a
2.3. Multitask Format <|endoftranscript|> token. We only mask out the
training loss over the previous context text, and train the
Although predicting which words were spoken in a given model to predict all other tokens. Please see Figure 1 for an
audio snippet is a core part of the full speech recognition overview of our format and training setup.
problem and extensively studied in research, it is not the
only part. A fully featured speech recognition system can
involve many additional components such as voice activ-
ity detection, speaker diarization, and inverse text normal-
Robust Speech Recognition via Large-Scale Weak Supervision 4
Sequence-to-sequence learning
Multitask training data (680k hours) EN
TRANS-
📝 Ask not what your country can do for ⋯ MLP cross attention
cross attention
⋮ ⋮ ⋮
🗣️ “El rápido zorro marrón salta sobre ⋯”
Transformer
Encoder Blocks
📝 The quick brown fox jumps over ⋯ self attention cross attention
Decoder Blocks
self attention
Non-English transcription MLP
~
cross attention
Positional
self attention
Encoding
No speech Learned
📝 ∅
TRANS-
Time-aligned transcription
identification Transcription
LANGUAGE
begin
end
begin
end
TAG
TRANSCRIBE
time
text tokens
time ⋯ time
text tokens
time
previous
START OF
PREV EOT
text tokens TRANSCRIPT
NO
NO
detection
Translation
special text
timestamp (allows dataset-specific fine-tuning)
(VAD)
tokens tokens tokens
Figure 1. Overview of our approach. A sequence-to-sequence Transformer model is trained on many different speech processing tasks,
including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these
tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different
stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or
classification targets, as further explained in Section 2.3.
Model Layers Width Heads Parameters in Section 4.4. We are releasing the code for our text nor-
Tiny 4 384 6 39M malizer to allow for easy comparison and to help others
Base 6 512 8 74M study the performance of speech recognition systems in
Small 12 768 12 244M out-of-distribution settings.
Medium 24 1024 16 769M
Large 32 1280 20 1550M
3.3. English Speech Recognition
Table 1. Architecture details of the Whisper model family. In 2015, Deep Speech 2 (Amodei et al., 2015) reported
a speech recognition system matched human-level perfor-
mance when transcribing the LibriSpeech test-clean split.
3. Experiments As part of their analysis they concluded: “Given this result,
we suspect that there is little room for a generic speech sys-
3.1. Zero-shot Evaluation tem to further improve on clean read speech without further
domain adaptation.” Yet seven years later the SOTA WER
The goal of Whisper is to develop a single robust speech
on LibriSpeech test-clean has dropped another 73% from
processing system that works reliably without the need for
their 5.3% to 1.4% (Zhang et al., 2021), far below their re-
dataset specific fine-tuning to achieve high-quality results
ported human-level error rate of 5.8%. Despite this massive
on specific distributions. To study this capability, we re-
and unanticipated further improvement in performance on
use a wide set of existing speech processing datasets to
held-out but in-distribution data, speech recognition mod-
check whether Whisper is able to generalize well across
els trained on LibriSpeech remain far above human error
domains, tasks, and languages. Instead of using the standard
rates when used in other settings. What explains this gap
evaluation protocol for these datasets, which include both
between reportedly superhuman performance in-distribution
a train and test split, we evaluate Whisper in a zero-shot
and subhuman performance out-of-distribution?
setting without using any of the training data for each of
these datasets so that we are measuring broad generalization. We suspect a large part of this gap between human and
machine behavior is due to conflating different capabilities
3.2. Evaluation Metrics being measured by human and machine performance on
a test set. This claim may seem confusing at first; if both
Speech recognition research typically evaluates and com- humans and machines are taking the same test, how can it be
pares systems based on the word error rate (WER) metric. that different skills are being tested? The difference arises
However, WER, which is based on string edit distance, pe- not in the testing but in how they trained for it. Humans are
nalizes all differences between the model’s output and the often asked to perform a task given little to no supervision
reference transcript including innocuous differences in tran- on the specific data distribution being studied. Thus human
script style. As a result, systems that output transcripts that performance is a measure of out-of-distribution generaliza-
would be judged as correct by humans can still have a large tion. But machine learning models are usually evaluated
WER due to minor formatting differences. While this poses after training on a large amount of supervision from the
a problem for all transcribers, it is particularly acute for evaluation distribution, meaning that machine performance
zero-shot models like Whisper, which do not observe any is instead a measure of in-distribution generalization. While
examples of specific datasets transcript formats. both humans and machines are being evaluated on the same
This is not a novel observation; the development of evalua- test data, two quite different abilities are being measured
tion metrics that better correlate with human judgement is an due to a difference in train data.
active area of research, and while there are some promising Whisper models, which are trained on a broad and diverse
methods, none have seen widespread adoption for speech distribution of audio and evaluated in a zero-shot setting,
recognition yet. We opt to address this problem with ex- could potentially match human behavior much better than
tensive standardization of text before the WER calculation existing systems. To study whether this is the case (or
to minimize penalization of non-semantic differences. Our whether the difference between machine and human per-
text normalizer was developed through iterative manual in- formance is due to yet-to-be-understood factors) we can
spection to identify common patterns where naive WER compare Whisper models with both human performance
penalized Whisper models for an innocuous difference. Ap- and standard fine-tuned machine learning models and check
pendix C includes full details. For several datasets, we which they more closely match.
observe WER drops of up to 50 percent usually due to a
quirk such as a dataset’s reference transcripts seperating To quantify this difference, we examine both overall ro-
contractions from words with whitespace. We caution this bustness, that is average performance across many distribu-
development procedure comes at a risk of overfitting to the tions/datasets, and effective robustness, introduced by Taori
transcription style of Whisper models which we investigate et al. (2020), which measures the difference in expected
Robust Speech Recognition via Large-Scale Weak Supervision 6
50
Dataset wav2vec 2.0 Whisper RER
Average WER on [Common Voice, CHiME-6, TED-LIUM] (%)
160 MY
40
PT
LO GU KA KM TE
UZ ML PA BN
MT 35 CA SV
80 TG
KN DE FR
NE BE SW HY
MR IS 30 DA
SR NB AF
AF KK RU
RO
40 FA CY SK HR UK
Word Error Rate (WER)
SR LT BSGL CS ID
HI SL AZ HE 25 MK
BG
IT
ET LV UR MSTR ES
FIL EL NL
PL
MK TA HU AR
BLEU
BS
20 GL
BG
SK
HR CS DA ELAR ZH 20 HE
FI
RO KO SL HI KO
FIL TH HU VI
OC ZH
FI ET
NB VI
MS
SV
TR
15 PA FA
ML UR JA
10 UK
ID NL
LB
LV HY GU TH
CA PL FR TG MT KN LT BE MR
JA RU 10 AZ NE TE BN CY
MI
DE
5 PT EN LO SD
IS TA
IT
ES
5 UZ KK
AS
SW
KM
r2 = 0.84 KA
PS r2 = 0.24
LN SO AM MYMN SN YO
HA
2.5 0
0.1 1 10 100 1K 10K 100K 1M 1 10 100 1K 10K 100K
Hours of transcribed audio Hours of translated audio
Figure 3. Correlation of pre-training supervision amount with Figure 4. Correlation of pre-training supervision amount with
downstream speech recognition performance. The amount of downstream translation performance. The amount of pre-
pre-training speech recognition data for a given language is very training translation data for a given language is only moderately
predictive of zero-shot performance on that language in Fleurs. predictive of Whisper’s zero-shot performance on that language in
Fleurs.
Table 5. Language identification performance. Zero-shot Whis- Figure 5. WER on LibriSpeech test-clean as a function of SNR
per’s accuracy at language identification is not competitive with under additive white noise (left) and pub noise (right). The
prior supervised results on Fleurs. This is partially due to Whisper accuracy of LibriSpeech-trained models degrade faster than the
being heavily penalized for having no training data for 20 of Fleurs best Whisper model (⋆). NVIDIA STT models (•) perform best
languages. under low noise but are outperformed by Whisper under high noise
(SNR < 10 dB). The second-best model under low noise (▼) is
fine-tuned on LibriSpeech only and degrades even more quickly.
new state of the art of 27.3 BLEU zero-shot without using
any of the CoVoST2 training data. We attribute this to the
68,000 hours of X→en translation data for these languages
in our pre-training dataset which, although noisy, is vastly spoken languages in the world like French, Spanish, and
larger than the 861 hours of training data for X→en trans- Russian. Inspection shows the majority of supposedly Welsh
lation in CoVoST2. Since Whisper evaluation is zero-shot, translation data is actually English audio with English cap-
it does particularly well on the lowest resource grouping tions where the English audio was mis-classified as Welsh
of CoVoST2, improving over mSLAM by 4.6 BLEU. Con- by the language identification system, resulting in it being
versely, the best Whisper model does not actually improve included as translation training data rather transcription data
over mSLAM and XLS-R on average for the highest re- according to our dataset creation rules.
source languages.
For an additional analysis on an even wider set of languages, 3.6. Language Identification
we also re-purpose Fleurs, which is a speech recognition To evaluate language identification, we use the Fleurs
dataset, as a translation dataset. Since the same sentences dataset (Conneau et al., 2022). The zero-shot performance
are transcribed for every language we use the English tran- of Whisper is not competitive with prior supervised work
scripts as reference translations. In Figure 4 we visualize here and underperforms the supervised SOTA by 13.6%.
the correlation between the amount of translation training However, Whisper is heavily disadvantaged for language
data per language and the resulting zero-shot BLEU score identification on Fleurs, since the Whisper dataset contains
on Fleurs. While there is a clear trend of improvement with no training data for 20 of the 102 languages in Fleurs, upper-
increasing training data, the squared correlation coefficient bounding accuracy at 80.4%. On the 82 overlapping lan-
is much lower than the 0.84 observed for speech recognition guages the best Whisper model achieves 79.7% accuracy.
and only 0.24. We suspect this is partly caused by the noisier
training data due to errors in audio language identification.
3.7. Robustness to Additive Noise
As an example, Welsh (CY) is an outlier with much worse
than expected performance at only 9 BLEU despite sup- We tested the noise robustness of Whisper models and 14
posedly having 9,000 hours of translation data. This large LibriSpeech-trained models by measuring the WER when
amount of Welsh translation data is surprising, ranking 4th either white noise or pub noise from the Audio Degrada-
overall for translation data and ahead of some of the most tion Toolbox (Mauch & Ewert, 2013) was added to the
Robust Speech Recognition via Large-Scale Weak Supervision 9
audio. The pub noise represents a more natural noisy envi- timestamps predicted by the model. We observed that it
ronment with ambient noise and indistinct chatter typical is crucial to have beam search and temperature scheduling
in a crowded restaurant or a pub. Among the 14 models, based on the repetitiveness and the log probability of the
twelve are pre-trained and/or fine-tuned on LibriSpeech, and model predictions in order to reliably transcribe long audio.
the other two are NVIDIA STT models trained on a mixture The full procedure is described in Section 4.5.
dataset similar to prior work like SpeechStew that includes
We evaluate the long-form transcription performance on
LibriSpeech. The level of additive noise corresponding to
seven datasets consisting of speech recordings of various
a given signal-to-noise ratio (SNR) is calculated based on
lengths and recording conditions, to cover as diverse a data
the signal power of individual examples. Figure 5 shows
distribution as possible. These include a long-form adapta-
how the ASR performance degrades as the additive noise
tion of TED-LIUM3 (Hernandez et al., 2018) concatenated
becomes more intensive. There are many models that out-
so that each example is a full-length TED talk, a collection
perform our zero-shot performance under low noise (40 dB
of jargon-laden segments taken from The Late Show with
SNR), which is unsurprising given those models are trained
Stephen Colbert (Meanwhile), sets of videos/podcasts that
primarily on LibriSpeech, but all models quickly degrade as
has been used as ASR benchmarks in online blogs (Rev16
the noise becomes more intensive, performing worse than
and Kincaid46), recordings of earnings calls (Del Rio et al.,
the Whisper model under additive pub noise of SNR below
2021), and the full-length interviews from the Corpus of
10 dB. This showcases Whisper’s robustness to noise, es-
Regional African American Language (CORAAL) (Gunter
pecially under more natural distribution shifts like the pub
et al., 2021). Full details about the long-form datasets can
noise.
be found in Appendix A.
3.8. Long-form Transcription We compare the performance with open-source models as
well as 4 commercial ASR services. The results are sum-
Whisper models are trained on 30-second audio chunks and marized in Figure 6, showing the distribution of word error
cannot consume longer audio inputs at once. This is not a rates from Whisper and the 4 commercial ASR services,
problem with most academic datasets comprised of short as well as the NVIDIA STT Conformer-CTC Large model
utterances but presents challenges in real-world applications from the NeMo toolkit (Kuchaiev et al., 2019) which per-
which often require transcribing minutes- or hours-long au- formed the best among the open-source models. All com-
dio. We developed a strategy to perform buffered transcrip- mercial ASR services are queried using their default English
tion of long audio by consecutively transcribing 30-second transcription settings as of September 1st, 2022, and for
segments of audio and shifting the window according to the the NVIDIA STT model we used their buffered inference
40
35
30
Word Error Rate (%)
25
20
15
10
5
0 TED-LIUM3 Meanwhile Kincaid46 Rev16 Earnings-21 Earnings-22 CORAAL
Figure 6. Whisper is competitive with state-of-the-art commercial and open-source ASR systems in long-form transcription. The
distribution of word error rates from six ASR systems on seven long-form datasets are compared, where the input lengths range from a
few minutes to a few hours. The boxes show the quartiles of per-example WERs, and the per-dataset aggregate WERs are annotated
on each box. Our model outperforms the best open source model (NVIDIA STT) on all datasets, and in most cases, commercial ASR
systems as well.
Robust Speech Recognition via Large-Scale Weak Supervision 10
30 4. Analysis and Ablations
25
4.1. Model Scaling
Word Error Rate (%)
20
A large amount of the promise in weakly supervised train-
15 ing approaches is their potential to use datasets much larger
10 than those in traditional supervised learning. However, this
comes with the cost of using data that is possibly much
5 noisier and lower quality than gold-standard supervision.
0 A concern with this approach is that although it may look
Whisper A B C D E F G H I
promising to begin with, the performance of models trained
ASR human transcription
computer-assisted on this kind of data may saturate at the inherent quality level
of the dataset, which could be far below human level. A re-
lated concern is that as capacity and compute spent training
Figure 7. Whisper’s performance is close to that of professional
human transcribers. This plot shows the WER distributions of
on the dataset increases, models may learn to exploit the
25 recordings from the Kincaid46 dataset transcribed by Whisper, idiosyncrasies of the dataset, and their ability to generalize
the same 4 commercial ASR systems from Figure 6 (A-D), one robustly to out-of-distribution data could even degrade.
computer-assisted human transcription service (E) and 4 human To check whether this is the case, we study the zero-shot
transcription services (F-I). The box plot is superimposed with dots
generalization of Whisper models as a function of the model
indicating the WERs on individual recordings, and the aggregate
WER over the 25 recordings are annotated on each box.
size. Our analysis is summarized in Figure 8. With the
exception of English speech recognition, performance con-
tinues to increase with model size across multilingual speech
recognition, speech translation, and language identification.
implementation in the FrameBatchASR class to enable The diminishing returns for English speech recognition
long-form transcription. The results show that Whisper per- could be due to saturation effects from approaching human-
forms better than the compared models on most datasets, level performance as analysis in Section 3.9 suggests.
especially on the Meanwhile dataset which is heavy with
uncommon words. Additionally, we note the possibility that 4.2. Dataset Scaling
some of the commercial ASR systems have been trained
on some of these publicly available datasets, and therefore At 680,000 hours of labeled audio, the Whisper dataset is
these results may not be accurately reflecting the relative one of the largest ever created in supervised speech recog-
robustness of the systems. nition. Exactly how important is the raw dataset size to
Whisper’s performance? To study this, we trained a series
of medium-sized models on subsampled versions of the
3.9. Comparison with Human Performance
dataset which are 0.5%, 1%, 2%, 4%, and 8% of the full
Because of ambiguous or indistinct speech as well as la- dataset size and compared their performance with the same
beling errors, there are different levels of irreducible error medium-sized model trained on the whole dataset. Early
in each dataset, and with WER metrics from ASR systems stopping based on the validation loss was used to select
alone it is difficult to make sense of how much room for model checkpoints for each dataset size. Evaluation was
improvement exists in each dataset. To quantify how close performed on an exponential moving average estimate of
Whisper’s performance is to the human performance, we se- the parameters (Polyak & Juditsky, 1992) using a smooth-
lected 25 recordings from the Kincaid46 dataset and used 5 ing rate of 0.9999 to help reduce the effect of the learning
services to obtain transcripts produced by professional tran- rate not fully decaying to zero for the models trained on the
scribers, among which one provides computer-assisted tran- subsampled datasets due to early stopping. Performance
scription and the other four are entirely human-transcribed. on English and multilingual speech recognition and X→en
The audio selection covers various recording conditions translation is reported in Table 6.
such as scripted and unscripted broadcast, telephone and
All increases in the dataset size result in improved perfor-
VoIP calls, and meetings. Figure 7 shows the distribution
mance on all tasks, although we see significant variability
of per-example WERs and aggregate WER across the 25
in improvement rates across tasks and sizes. Performance
recordings, where the computer-assisted service has the
improves rapidly on English speech recognition from 3,000
lowest aggregate WER that is 1.15% point better than Whis-
to 13,000 hours and then slows down noticeably between
per’s, and the pure-human performance is only a fraction
13,000 and 54,000 hours. Using the full dataset, which cor-
of a percentage point better than Whisper’s. These results
responds to another 12.5× increase in size results in only a
indicate that Whisper’s English ASR performance is not
further 1 point drop in WER. This mirrors the diminishing
perfect but very close to human-level accuracy.
Robust Speech Recognition via Large-Scale Weak Supervision 11
English Speech Recognition Multilingual Speech Recognition (Fleurs) X->En Translation (CoVoST2) Language Identification (Fleurs)
20.0 100 50 80
Average Average Average Average
17.5
80 40 70
BLEU on 21 languages
12.5 60 30 60
10.0
7.5 40 20 50
5.0
20 10 40
2.5
0.0 0 0 30
38M 73M 244M 768M 1549M 38M 73M 244M 768M 1549M 38M 73M 244M 768M 1549M 38M 73M 244M 768M 1549M
Model parameters Model parameters Model parameters Model parameters
Figure 8. Zero-shot Whisper performance scales reliably across tasks and languages with increasing model size. Lightly shaded lines
represent individual datasets or languages, showing that performance is more varied than the smooth trends in aggregate performance.
20 CORAAL
English Only Open-source models
Multilingual and Multitask CommonVoice9.en Whisper models
Average WER on 11 english speech recognition datasets
18
AMI-SDM1
16 CommonVoice5.1
Fleurs.en_us
14 AMI-IHM
Artie
LibriSpeech
12
TED-LIUM3
VoxPopuli.en
WSJ
10
CallHome
Switchboard
50 40 30 20 10 0
8 Relative WER reduction compared to FairSpeech's normalizer (%)
10e+19 10e+20 10e+21 10e+22
FLOPs training on english speech recognition
Figure 10. On most datasets, our text normalizer has similar
effect on reducing WERs between Whisper models and other
Figure 9. Multitask and multilingual transfer improves with open-source models, compared to FairSpeech’s normalizer. For
scale. For small models, performance on English speech recogni- each dataset, the boxplot shows the distribution of relative WER
tion degrades when trained jointly in a multitask and multilingual reduction across different models in our eval suite, showing that
setup. However, multilingual and multitask models benefit more using our text normalizer generally results in lower WERs than
from scale and eventually outperform models trained on English FairSpeech’s. On a few datasets our normalizer reduces WER
data only. 95% bootstrap estimate confidence intervals are shown. significantly and more so for Whisper models, such as CallHome
and Switchboard which have many contractions in the ground truth
and WSJ which contains many numerical expressions.
To check this, we compared the performance of Whisper
using our normalizer versus an independently developed
one from the FairSpeech project (Koenecke et al., 2020). In the results reported in sections 3.8 and 3.9. First, we use
Figure 10, we visualize the differences. On most datasets beam search with 5 beams using the log probability as the
the two normalizers perform similarly, without significant score function, to reduce repetition looping which happens
differences in WER reduction between Whisper and com- more frequently in greedy decoding. We start with tem-
pared open-source models, while on some datasets, namely perature 0, i.e. always selecting the tokens with the high-
WSJ, CallHome, and Switchboard, our normalizer reduces est probability, and increase the temperature by 0.2 up to
the WER of Whisper models’ significantly more. The differ- 1.0 when either the average log probability over the gen-
ences in reduction can be traced down to different formats erated tokens is lower than −1 or the generated text has a
used by the ground truth and how the two normalizers are pe- gzip compression rate higher than 2.4. Providing the tran-
nalizing them. For example, in CallHome and Switchboard, scribed text from the preceding window as previous-text
our standardizer did not penalize differences in common conditioning when the applied temperature is below 0.5
English contractions such as “you’re” versus “you are”, and further improves the performance. We found that the proba-
in WSJ, our normalizer standardized the written and spo- bility of the <|nospeech|> token alone is not sufficient
ken forms of numerical and monetary expressions, such as to distinguish a segment with no speech, but combining
“sixty-eight million dollars” versus “$68 million”. the no-speech probability threshold of 0.6 and the average
log-probability threshold of −1 makes the voice activity
detection of Whisper more reliable. Finally, to avoid a fail-
4.5. Strategies for Reliable Long-form Transcription
ure mode where the model ignores the first few words in
Transcribing long-form audio using Whisper relies on ac- the input, we constrained the initial timestamp token to be
curate prediction of the timestamp tokens to determine the between 0.0 and 1.0 second. Table 7 shows that adding each
amount to shift the model’s 30-second audio context win- of the interventions above incrementally reduces the WER
dow by, and inaccurate transcription in one window may overall, but not evenly across the dataset. These heuristics
negatively impact transcription in the subsequent windows. serve as a workaround for the noisy predictions of the model,
We have developed a set of heuristics that help avoid fail- and more research would be needed to further improve the
ure cases of long-form transcription, which is applied in reliability of long-form decoding.
Robust Speech Recognition via Large-Scale Weak Supervision 13
TED-LIUM3
Earnings-21
Earnings-22
Meanwhile
Kincaid46
CORAAL
approach was simplified further into the “text-to-text” frame-
Average
Rev16
work of McCann et al. (2018) and popularized by its success
with large transformer language models in the work of Rad-
Greedy decoding only 3.55 5.95 9.96 12.1 10.5 13.9 21.3 11.0 ford et al. (2019) and Raffel et al. (2020). Toshniwal et al.
+ Beam search 3.66 5.78 9.18 12.1 10.2 13.5 20.1 10.6 (2018) demonstrated jointly training a modern deep learn-
+ Temperature fallback 3.66 5.72 8.93 12.0 10.2 13.4 19.8 10.5 ing speech recognition system on several languages with a
+ Previous-text conditioning 3.68 5.35 9.19 11.3 9.78 12.8 19.8 10.3
+ Voice activity detection 3.57 5.28 9.16 10.8 9.75 12.9 19.9 10.2
single model, and Pratap et al. (2020a) scaled this line of
+ Initial timestamp constraint 3.56 5.33 8.87 10.6 9.68 13.0 19.9 10.1 work significantly to 50 languages with a billion-parameter
model. MUTE (Wang et al., 2020c) and mSLAM (Bapna
Table 7. Long-form transcription performance improves incremen- et al., 2022) studied joint training over both text and speech
tally as additional decoding heuristics are employed. Details on language tasks, demonstrating transfer between them.
each intervention are described in Section 4.5.
Robustness The question of how effectively models trans-
5. Related Work fer and how robust they are to distribution shift and other
types of perturbations has long been studied and is actively
Scaling Speech Recognition A consistent theme across being researched across many fields of machine learning.
speech recognition research has been documenting the bene- Torralba & Efros (2011) highlighted the lack of generaliza-
fits of scaling compute, models, and datasets. Early work ap- tion of machine learning models between datasets over a
plying deep learning to speech recognition found improved decade ago. Many other works have shown and continu-
performance with model depth and size and leveraged GPU ally reiterated how despite high performance on IID test
acceleration to make training these larger models tractable sets, machine learning models can still make many mistakes
(Mohamed et al., 2009). Further research demonstrated that when evaluated in even slightly different settings (Lake et al.,
the benefit of deep learning approaches to speech recogni- 2017; Jia & Liang, 2017; Alcorn et al., 2019; Barbu et al.,
tion increased with dataset size, improving from being only 2019; Recht et al., 2019). More recently, Taori et al. (2020)
competitive with prior GMM-HMM systems when using studied the robustness of image classification models, and
just 3 hours of TIMIT training data for phone recognition Miller et al. (2020) investigated this for question-answering
to achieving a 30% word error rate reduction when trained models. A key finding has been that multi-domain train-
on the 2,000 hour Switchboard dataset (Seide et al., 2011). ing increases robustness and generalization as discussed in
Liao et al. (2013) is an early example of leveraging weakly the Introduction. This finding has been replicated across
supervised learning to increase the size of a deep learn- many fields in addition to speech recognition including NLP
ing based speech recognition dataset by over 1,000 hours. (Hendrycks et al., 2020) and computer vision (Radford et al.,
These trends continued with Deep Speech 2 (Amodei et al., 2021).
2015) being a notable system developing high-throughput
distributed training across 16 GPUs and scaling to 12,000
6. Limitations and Future Work
hours of training data while demonstrating continuing im-
provements at that scale. By leveraging semi-supervised From our experimental results, analyses, and ablations, we
pre-training, Narayanan et al. (2018) were able to grow have noted several limitations and areas for future work.
dataset size much further and study training on 162,000
hours of labeled audio. More recent work has explored Improved decoding strategies. As we have scaled Whis-
billion-parameter models (Zhang et al., 2020) and using up per, we have observed that larger models have made steady
to 1,000,000 hours of training data (Zhang et al., 2021). and reliable progress on reducing perception-related errors
such as confusing similar-sounding words. Many remaining
Multitask Learning Multitask learning (Caruana, 1997) errors, particularly in long-form transcription seem more
has been studied for a long time. In speech recognition, stubborn in nature and decidedly non-human/perceptual.
multi-lingual models have been explored for well over a They are a combination of failure modes of seq2seq mod-
decade (Schultz & Kirchhoff, 2006). An inspirational and els, language models, and text-audio alignment and include
foundational work in NLP exploring multi-task learning problems such as getting stuck in repeat loops, not tran-
with a single model is Collobert et al. (2011). Multitask scribing the first or last few words of an audio segment, or
learning in the sequence-to-sequence framework (Sutskever complete hallucination where the model will output a tran-
et al., 2014) using multiple encoders and decoders was in- script entirely unrelated to the actual audio. Although the
vestigated in Luong et al. (2015). The use of language codes decoding details discussed in Section 4.5 help significantly,
with a shared encoder/decoder architecture was first demon- we suspect fine-tuning Whisper models on a high-quality
Robust Speech Recognition via Large-Scale Weak Supervision 14
supervised dataset and/or using reinforcement learning to been a mainstay of recent large-scale speech recognition
more directly optimize for decoding performance could help work and demonstrate how simply training on a large and
further reduce these errors. diverse supervised dataset and focusing on zero-shot trans-
fer can significantly improve the robustness of a speech
Increase Training Data For Lower-Resource Languages recognition system.
As Figure 3 shows, Whisper’s speech recognition perfor-
mance is still quite poor on many languages. The same ACKNOWLEDGMENTS
analysis suggests a clear route for improvement since perfor-
We’d like to thank the millions of people who were involved
mance on a language is very well predicted by the amount
in creating the data used by Whisper. We’d also like to
of training data for the language. Since our pre-training
thank Nick Ryder, Will Zhuk, and Andrew Carr for the
dataset is currently very English-heavy due to biases of
conversation on the waterfall hike that inspired this project.
our data collection pipeline, which sourced primarily from
We are also grateful to the Acceleration and Supercomputing
English-centric parts of the internet, most languages have
teams at OpenAI for their critical work on software and
less than 1000 hours of training data. A targeted effort at in-
hardware infrastructure this project used. We’d also like to
creasing the amount of data for these rarer languages could
thank Pamela Mishkin for advising the project from a policy
result in a large improvement to average speech recognition
perspective. Finally, we are grateful to the developers of
performance even with only a small increase in our overall
the many software packages used throughout this project
training dataset size.
including, but not limited, to Numpy (Harris et al., 2020),
SciPy (Virtanen et al., 2020), ftfy (Speer, 2019), PyTorch
Studying fine-tuning In this work, we have focused on (Paszke et al., 2019), pandas (pandas development team,
the robustness properties of speech processing systems and 2020), and scikit-learn (Pedregosa et al., 2011).
as a result only studied the zero-shot transfer performance
of Whisper. While this is a crucial setting to study due to it
being representative of general reliability, for many domains References
where high-quality supervised speech data does exist, it is Alcorn, M. A., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W.-
likely that results can be improved further by fine-tuning. S., and Nguyen, A. Strike (with) a pose: Neural networks
An additional benefit of studying fine-tuning is that it allows are easily fooled by strange poses of familiar objects. In
for direct comparisons with prior work since it is a much Proceedings of the IEEE/CVF Conference on Computer
more common evaluation setting. Vision and Pattern Recognition, pp. 4845–4854, 2019.
Tuning Architecture, Regularization, and Augmentation Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper,
As a study focusing primarily on the impact of dataset scal- J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates,
ing on speech processing, our training setup is relatively A., Diamos, G., et al. Deep speech 2: end-to-end speech
basic and does not have many components of current state- recognition in english and mandarin. arxiv. arXiv preprint
of-the-art systems. Adding common regularization tech- arXiv:1512.02595, 2015.
niques such as dropout (Srivastava et al., 2014) or stochastic
depth (Huang et al., 2016) as well as data augmentation Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler,
methods such as SpecAugment (Park et al., 2019) could M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M.,
potentially combine well with fine-tuning which is often and Weber, G. Common voice: A massively-multilingual
data-limited. speech corpus. arXiv preprint arXiv:1912.06670, 2019.
Adding Auxiliary Training Objectives Whisper departs Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu,
noticeably from most recent state-of-the-art speech recog- Q., Goyal, N., Singh, K., von Platen, P., Saraf, Y.,
nition systems due to the lack of unsupervised pre-training Pino, J., et al. XLS-R: Self-supervised cross-lingual
or self-teaching methods. While we have not found them speech representation learning at scale. arXiv preprint
necessary to achieve good performance, it is possible that arXiv:2111.09296, 2021.
the results could be further improved by incorporating this.
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec
7. Conclusion 2.0: A framework for self-supervised learning of speech
representations. arXiv preprint arXiv:2006.11477, 2020.
Whisper suggests that scaling weakly supervised pre-
training has been underappreciated so far in speech recogni- Baevski, A., Hsu, W.-N., Conneau, A., and Auli, M. Unsu-
tion research. We achieve our results without the need for pervised speech recognition. Advances in Neural Infor-
the self-supervision and self-training techniques that have mation Processing Systems, 34:27826–27839, 2021.
Robust Speech Recognition via Large-Scale Weak Supervision 15
Bapna, A., Cherry, C., Zhang, Y., Jia, Y., Johnson, M., The people’s speech: A large-scale diverse english speech
Cheng, Y., Khanuja, S., Riesa, J., and Conneau, A. mslam: recognition dataset for commercial usage. arXiv preprint
Massively multilingual joint pre-training for speech and arXiv:2111.09344, 2021.
text. arXiv preprint arXiv:2202.01374, 2022.
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Bren-
Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gut- del, W., Bethge, M., and Wichmann, F. A. Shortcut learn-
freund, D., Tenenbaum, J., and Katz, B. Objectnet: A ing in deep neural networks. Nature Machine Intelligence,
large-scale bias-controlled dataset for pushing the lim- 2(11):665–673, 2020.
its of object recognition models. Advances in neural
information processing systems, 32, 2019. Ghorbani, B., Firat, O., Freitag, M., Bapna, A., Krikun,
Caruana, R. Multitask learning. Machine learning, 28(1): M., Garcia, X., Chelba, C., and Cherry, C. Scaling
41–75, 1997. laws for neural machine translation. arXiv preprint
arXiv:2109.07740, 2021.
Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., and Norouzi,
M. SpeechStew: Simply mix all available speech recogni- Griewank, A. and Walther, A. Algorithm 799: revolve: an
tion data to train one large neural network. arXiv preprint implementation of checkpointing for the reverse or ad-
arXiv:2104.02133, 2021. joint mode of computational differentiation. ACM Trans-
actions on Mathematical Software (TOMS), 26(1):19–45,
Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.-Q.,
2000.
Weng, C., Su, D., Povey, D., Trmal, J., Zhang, J.,
et al. Gigaspeech: An evolving, multi-domain asr corpus
Gunter, K., Vaughn, C., and Kendall, T. Contextualiz-
with 10,000 hours of transcribed audio. arXiv preprint
ing/s/retraction: Sibilant variation and change in wash-
arXiv:2106.06909, 2021.
ington dc african american language. Language Variation
Chen, S., Wu, Y., Wang, C., Chen, Z., Chen, Z., Liu, S., and Change, 33(3):331–357, 2021.
Wu, J., Qian, Y., Wei, F., Li, J., et al. Unispeech-sat: Uni-
versal speech representation learning with speaker aware Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers,
pre-training. In ICASSP 2022-2022 IEEE International R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J.,
Conference on Acoustics, Speech and Signal Processing Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van
(ICASSP), pp. 6152–6156. IEEE, 2022. Kerkwijk, M. H., Brett, M., Haldane, A., Fernández del
Rı́o, J., Wiebe, M., Peterson, P., Gérard-Marchant, P.,
Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H.,
deep nets with sublinear memory cost. arXiv preprint Gohlke, C., and Oliphant, T. E. Array programming
arXiv:1604.06174, 2016. with NumPy. Nature, 585:357–362, 2020. doi: 10.1038/
s41586-020-2649-2.
Child, R., Gray, S., Radford, A., and Sutskever, I. Gen-
erating long sequences with sparse transformers. arXiv
Hendrycks, D. and Gimpel, K. Gaussian error linear units
preprint arXiv:1904.10509, 2019.
(gelus). arXiv preprint arXiv:1606.08415, 2016.
Collobert, R., Weston, J., Bottou, L., Karlen, M.,
Kavukcuoglu, K., and Kuksa, P. Natural language pro- Hendrycks, D., Liu, X., Wallace, E., Dziedzic, A., Krishnan,
cessing (almost) from scratch. Journal of machine learn- R., and Song, D. Pretrained transformers improve out-of-
ing research, 12(ARTICLE):2493–2537, 2011. distribution robustness. arXiv preprint arXiv:2004.06100,
2020.
Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V.,
Dalmia, S., Riesa, J., Rivera, C., and Bapna, A. Fleurs: Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N. A.,
Few-shot learning evaluation of universal representations and Estève, Y. Ted-lium 3: twice as much data and corpus
of speech. arXiv preprint arXiv:2205.12446, 2022. repartition for experiments on speaker adaptation. In
Del Rio, M., Delworth, N., Westerman, R., Huang, M., SPECOM, 2018.
Bhandari, N., Palakapilly, J., McNamara, Q., Dong, J.,
Zelasko, P., and Jetté, M. Earnings-21: a practical bench- Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K.,
mark for asr in the wild. arXiv preprint arXiv:2104.11348, Salakhutdinov, R., and Mohamed, A. Hubert: Self-
2021. supervised speech representation learning by masked
prediction of hidden units. IEEE/ACM Transactions on
Galvez, D., Diamos, G., Torres, J. M. C., Achorn, K., Gopi, Audio, Speech, and Language Processing, 29:3451–3460,
A., Kanter, D., Lam, M., Mazumder, M., and Reddi, V. J. 2021a.
Robust Speech Recognition via Large-Scale Weak Supervision 16
Hsu, W.-N., Sriram, A., Baevski, A., Likhomanenko, T., Loshchilov, I. and Hutter, F. Decoupled weight decay regu-
Xu, Q., Pratap, V., Kahn, J., Lee, A., Collobert, R., Syn- larization. arXiv preprint arXiv:1711.05101, 2017.
naeve, G., et al. Robust wav2vec 2.0: Analyzing do-
main shift in self-supervised pre-training. arXiv preprint Luong, M.-T., Le, Q. V., Sutskever, I., Vinyals, O., and
arXiv:2104.01027, 2021b. Kaiser, L. Multi-task sequence to sequence learning.
arXiv preprint arXiv:1511.06114, 2015.
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,
K. Q. Deep networks with stochastic depth. In European Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri,
conference on computer vision, pp. 646–661. Springer, M., Li, Y., Bharambe, A., and Van Der Maaten, L. Ex-
2016. ploring the limits of weakly supervised pretraining. In
Jia, R. and Liang, P. Adversarial examples for evalu- Proceedings of the European conference on computer
ating reading comprehension systems. arXiv preprint vision (ECCV), pp. 181–196, 2018.
arXiv:1707.07328, 2017.
Mauch, M. and Ewert, S. The audio degradation toolbox and
Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., its application to robustness evaluation. In Proceedings of
Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, the 14th International Society for Music Information Re-
G., et al. Google’s multilingual neural machine translation trieval Conference (ISMIR 2013), Curitiba, Brazil, 2013.
system: Enabling zero-shot translation. Transactions of accepted.
the Association for Computational Linguistics, 5:339–
351, 2017. McCann, B., Keskar, N. S., Xiong, C., and Socher, R. The
natural language decathlon: Multitask learning as ques-
Kendall, T. and Farrington, C. The corpus of regional tion answering. arXiv preprint arXiv:1806.08730, 2018.
african american language. Version 2021.07. Eugene, OR:
The Online Resources for African American Language Meyer, J., Rauchenstein, L., Eisenberg, J. D., and Howell,
Project. http://oraal.uoregon.edu/coraal, N. Artie bias corpus: An open dataset for detecting de-
2021. Accessed: 2022-09-01. mographic bias in speech applications. In Proceedings of
the 12th Language Resources and Evaluation Conference,
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M.,
pp. 6462–6468, Marseille, France, May 2020. European
Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D.,
Language Resources Association. ISBN 979-10-95546-
and Goel, S. Racial disparities in automated speech recog-
34-4. URL https://aclanthology.org/2020.
nition. Proceedings of the National Academy of Sciences,
lrec-1.796.
117(14):7684–7689, 2020.
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, Miller, J., Krauth, K., Recht, B., and Schmidt, L. The effect
J., Gelly, S., and Houlsby, N. Big transfer (bit): General of natural distribution shift on question answering models.
visual representation learning. In European conference In ICML, 2020.
on computer vision, pp. 491–507. Springer, 2020.
Mohamed, A.-r., Dahl, G., Hinton, G., et al. Deep belief net-
Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., works for phone recognition. In Nips workshop on deep
Ginsburg, B., Kriman, S., Beliaev, S., Lavrukhin, V., learning for speech recognition and related applications,
Cook, J., et al. Nemo: a toolkit for building ai applications volume 1, pp. 39, 2009.
using neural modules. arXiv preprint arXiv:1909.09577,
2019. Narayanan, A., Misra, A., Sim, K. C., Pundak, G., Tripathi,
A., Elfeky, M., Haghani, P., Strohman, T., and Bacchi-
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh- ani, M. Toward domain-invariant speech recognition via
man, S. J. Building machines that learn and think like large scale training. In 2018 IEEE Spoken Language
people. Behavioral and brain sciences, 40, 2017. Technology Workshop (SLT), pp. 441–447. IEEE, 2018.
Liao, H., McDermott, E., and Senior, A. Large scale deep
neural network acoustic modeling with semi-supervised Panayotov, V., Chen, G., Povey, D., and Khudanpur, S.
training data for youtube video transcription. In 2013 Librispeech: an asr corpus based on public domain au-
IEEE Workshop on Automatic Speech Recognition and dio books. In 2015 IEEE international conference on
Understanding, pp. 368–373. IEEE, 2013. acoustics, speech and signal processing (ICASSP), pp.
5206–5210. IEEE, 2015.
Likhomanenko, T., Xu, Q., Pratap, V., Tomasello, P., Kahn,
J., Avidov, G., Collobert, R., and Synnaeve, G. Rethink- pandas development team, T. pandas-dev/pandas: Pan-
ing evaluation in asr: Are our models robust enough? das, February 2020. URL https://doi.org/10.
arXiv preprint arXiv:2010.11745, 2020. 5281/zenodo.3509134.
Robust Speech Recognition via Large-Scale Weak Supervision 17
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Cubuk, E. D., and Le, Q. V. SpecAugment: A simple data Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. Exploring
augmentation method for automatic speech recognition. the limits of transfer learning with a unified text-to-text
arXiv preprint arXiv:1904.08779, 2019. transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cor-
of training recurrent neural networks. In International nell, S., Lugosch, L., Subakan, C., Dawalatabad, N.,
conference on machine learning, pp. 1310–1318. PMLR, Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W.,
2013. Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na,
H., Gao, Y., Mori, R. D., and Bengio, Y. SpeechBrain: A
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., general-purpose speech toolkit, 2021. arXiv:2106.04624.
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, Recht, B., Roelofs, R., Schmidt, L., and Shankar, V.
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Do ImageNet classifiers generalize to ImageNet? In
Bai, J., and Chintala, S. Pytorch: An imperative style, Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceed-
high-performance deep learning library. In Advances ings of the 36th International Conference on Machine
in Neural Information Processing Systems 32, pp. 8024– Learning, volume 97 of Proceedings of Machine Learn-
8035, 2019. ing Research, pp. 5389–5400. PMLR, 09–15 Jun 2019.
URL https://proceedings.mlr.press/v97/
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., recht19a.html.
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
Scikit-learn: Machine learning in Python. Journal of M., et al. Imagenet large scale visual recognition chal-
Machine Learning Research, 12:2825–2830, 2011. lenge. International journal of computer vision, 115(3):
211–252, 2015.
Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic
approximation by averaging. SIAM journal on control Schultz, T. and Kirchhoff, K. Multilingual speech process-
and optimization, 30(4):838–855, 1992. ing. Elsevier, 2006.
Pratap, V., Sriram, A., Tomasello, P., Hannun, A. Y., Seide, F., Li, G., Chen, X., and Yu, D. Feature engineering
Liptchinsky, V., Synnaeve, G., and Collobert, R. Mas- in context-dependent deep neural networks for conver-
sively multilingual asr: 50 languages, 1 model, 1 billion sational speech transcription. In 2011 IEEE Workshop
parameters. ArXiv, abs/2007.03001, 2020a. on Automatic Speech Recognition & Understanding, pp.
24–29. IEEE, 2011.
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., and Collobert,
Sennrich, R., Haddow, B., and Birch, A. Neural machine
R. Mls: A large-scale multilingual dataset for speech
translation of rare words with subword units. arXiv
research. arXiv preprint arXiv:2012.03411, 2020b.
preprint arXiv:1508.07909, 2015.
Press, O. and Wolf, L. Using the output embedding to
Speer, R. ftfy. Zenodo, 2019. URL https://doi.org/
improve language models. In Proceedings of the 15th
10.5281/zenodo.2591652. Version 5.5.
Conference of the European Chapter of the Associa-
tion for Computational Linguistics: Volume 2, Short Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
Papers, pp. 157–163, Valencia, Spain, April 2017. As- and Salakhutdinov, R. Dropout: a simple way to prevent
sociation for Computational Linguistics. URL https: neural networks from overfitting. The journal of machine
//aclanthology.org/E17-2025. learning research, 15(1):1929–1958, 2014.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-
Sutskever, I. Language models are unsupervised multitask quence learning with neural networks. Advances in neural
learners. 2019. information processing systems, 27, 2014.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, and Schmidt, L. Measuring robustness to natural
J., Krueger, G., and Sutskever, I. Learning transferable distribution shifts in image classification. In Larochelle,
visual models from natural language supervision. arXiv H., Ranzato, M., Hadsell, R., Balcan, M., and Lin,
preprint arXiv:2103.00020, 2021. H. (eds.), Advances in Neural Information Processing
Robust Speech Recognition via Large-Scale Weak Supervision 18
Systems, volume 33, pp. 18583–18599. Curran Asso- speech recognition for unsegmented recordings. arXiv
ciates, Inc., 2020. URL https://proceedings. preprint arXiv:2004.09249, 2020.
neurips.cc/paper/2020/file/
d8330f857a17c53d217014ee776bfd50-Paper. Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Con-
pdf. neau, A., Collobert, R., Synnaeve, G., and Auli, M. Self-
training and pre-training are complementary for speech
Torralba, A. and Efros, A. A. Unbiased look at dataset bias. recognition. In ICASSP 2021-2021 IEEE International
CVPR 2011, pp. 1521–1528, 2011. Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 3030–3034. IEEE, 2021.
Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno,
P. J., Weinstein, E., and Rao, K. Multilingual speech Zhang, Y., Qin, J., Park, D. S., Han, W., Chiu, C.-C., Pang,
recognition with a single end-to-end model. 2018 IEEE R., Le, Q. V., and Wu, Y. Pushing the limits of semi-
International Conference on Acoustics, Speech and Sig- supervised learning for automatic speech recognition.
nal Processing (ICASSP), pp. 4904–4908, 2018. arXiv preprint arXiv:2010.10504, 2020.
Valk, J. and Alumäe, T. Voxlingua107: a dataset for spoken Zhang, Y., Park, D. S., Han, W., Qin, J., Gulati, A., Shor, J.,
language recognition. In 2021 IEEE Spoken Language Jansen, A., Xu, Y., Huang, Y., Wang, S., et al. BigSSL:
Technology Workshop (SLT), pp. 652–658. IEEE, 2021. Exploring the frontier of large-scale semi-supervised
learning for automatic speech recognition. arXiv preprint
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
arXiv:2109.13226, 2021.
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. In Advances in neural information
processing systems, pp. 5998–6008, 2017.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M.,
Reddy, T., Cournapeau, D., Burovski, E., Peterson, P.,
Weckesser, W., Bright, J., van der Walt, S. J., Brett, M.,
Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J.,
Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ.,
Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D.,
Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A.,
Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa,
F., van Mulbregt, P., and SciPy 1.0 Contributors. SciPy
1.0: Fundamental Algorithms for Scientific Computing
in Python. Nature Methods, 17:261–272, 2020. doi:
10.1038/s41592-019-0686-2.
Wang, C., Tang, Y., Ma, X., Wu, A., Okhonko, D., and Pino,
J. fairseq s2t: Fast speech-to-text modeling with fairseq.
arXiv preprint arXiv:2010.05171, 2020a.
Wang, C., Wu, A., and Pino, J. Covost 2 and massively
multilingual speech-to-text translation. arXiv preprint
arXiv:2007.10310, 2020b.
Wang, C., Riviere, M., Lee, A., Wu, A., Talnikar, C., Haziza,
D., Williamson, M., Pino, J., and Dupoux, E. Voxpopuli:
A large-scale multilingual speech corpus for representa-
tion learning, semi-supervised learning and interpretation.
arXiv preprint arXiv:2101.00390, 2021.
Wang, P., Sainath, T. N., and Weiss, R. J. Multitask training
with text data for end-to-end speech recognition. arXiv
preprint arXiv:2010.14318, 2020c.
Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora,
A., Chang, X., Khudanpur, S., Manohar, V., Povey, D.,
Raj, D., et al. Chime-6 challenge: Tackling multispeaker
Robust Speech Recognition via Large-Scale Weak Supervision 19
A. Evaluation Datasets.
A.1. Short-form English-only datasets
• LibriSpeech (Panayotov et al., 2015): We used the test-clean and test-other splits from the LibriSpeech ASR corpus.
• TED-LIUM 3 (Hernandez et al., 2018): We used the test split of TED-LIUM Release 3, using the segmented manual
transcripts included in the release.
• Common Voice 5.1 (Ardila et al., 2019): We downloaded the English subset of Common Voice Corpus 5.1 from the
official website.
• Artie bias corpus (Meyer et al., 2020): We used the Artie bias corpus. This is a subset of the Common Voice dataset.
• CallHome and Switchboard: We used the two corpora from LDC2002S09 and LDC2002T43.
• WSJ: We used LDC93S6B and LDC94S13B and followed the s5 recipe to preprocess the dataset.
• CORAAL: We used the 231 interviews from CORAAL (Kendall & Farrington, 2021) and used the preprocessing
script from the FairSpeech project.
• CHiME-6: For CHiME-6 (Watanabe et al., 2020), we downloaded the CHiME-5 dataset and followed the stage 0
of the s5 track1 recipe to create the CHiME-6 dataset which fixes synchronization. We then used the binaural
recordings (* P??.wav) and the corresponding transcripts.
• AMI-IHM and AMI-SDM1: We preprocessed the AMI Corpus by following the stage 0 ad 2 of the s5b recipe.
• Meanwhile: This dataset consists of 64 segments from The Late Show with Stephen Colbert. The YouTube video ID
and the corresponding start and end timestamps are available as part of the code release. The labels are collected from
the closed-caption data for each video and corrected with manual inspection.
• Rev16: We use a subset of 16 files from the 30 podcast episodes in Rev.AI’s Podcast Transcription Benchmark, after
finding that there are multiple cases where a significant portion of the audio and the labels did not match, mostly on the
parts introducing the sponsors. We selected 16 episodes that do not have this error, whose ”file number” are:
3 4 9 10 11 14 17 18 20 21 23 24 26 27 29 32
• Kincaid46: This dataset consists of 46 audio files and the corresponding transcripts compiled in the blog article ¡Which
automatic transcription service is the most accurate - 2018¿ by Jason Kincaid. We used the 46 audio files and reference
transcripts from the Airtable widget in the article. For the human transcription benchmark in the paper, we use a subset
of 25 examples from this data, whose ”Ref ID” are:
2 4 5 8 9 10 12 13 14 16 19 21 23 25 26 28 29 30 33 35 36 37 42 43 45
• Earnings-21 (Del Rio et al., 2021) and Earnings-22: We used the files available in the speech-datasets repository, as
of their 202206 version.
• CORAAL: We used the 231 full-length interviews and transcripts from (Kendall & Farrington, 2021).
Robust Speech Recognition via Large-Scale Weak Supervision 20
• Fleurs (Conneau et al., 2022): We collected audio files and transcripts using the implementation available as Hug-
gingFace datasets. To use as a translation dataset, we matched the numerical utterance IDs to find the corresponding
transcript in English.
• VoxPopuli (Wang et al., 2021): We used the get asr data.py script from the official repository to collect the ASR
data in 14 languages.
• Common Voice 9 (Ardila et al., 2019): We downloaded the Common Voice Corpus 9 from the official website.
• CoVOST 2 (Wang et al., 2020b): We collected the X into English data collected using the official repository.
B. Compared Models
For comparison, we use the following models from HuggingFace, downloaded as of September 2022 using version 4.21.0 of
the transformers library:
We note that all of the models above are entirely or partly trained on LibriSpeech.
Robust Speech Recognition via Large-Scale Weak Supervision 21
C. Text Standardization
Since Whisper may output any UTF-8 string rather than a restricted set of graphemes, the rules for text standardization need
to be more intricate and comprehensive than those defined on e.g. ASCII characters. We perform the following steps to
normalize English texts in different styles into a standardized form, which is a best-effort attempt to penalize only when a
word error is caused by actually mistranscribing a word, and not by formatting or punctuation differences.
3. Remove any of the following words: hmm, mm, mhm, mmm, uh, um
4. Remove whitespace characters that comes before an apostrophe ’
5. Convert standard or informal contracted forms of English into the original form.
6. Remove commas (,) between digits
A different, language-specific set of transformations would be needed to equivalently normalize non-English text, but due to
our lack of linguistic knowledge to build such normalizers for all languages, we resort to the following basic standardization
for non-English text:
3. Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each
character in the NFKC-normalized string starts with M, S, or P.
4. make the text lowercase.
5. replace any successive whitespace characters with a space.
Additionally, we put a space between every letter for the languages that do not use spaces to separate words, namely Chinese,
Japanese, Thai, Lao, and Burmese, effectively measuring the character error rate instead.
We note that the above is an imperfect solution, and it will sometimes produce unintended and unexpected outputs. We do
not claim that the text format resulting from the above is more “correct” in any measure. Rather, the procedures above are
designed to better distinguish between innocuous differences in wording and genuine mistranscriptions. Python code for
the standardization procedures above is available as part of our code and model release to facilitate future iterations and
improvements on text standardization.
Robust Speech Recognition via Large-Scale Weak Supervision 22
LibriSpeech.test-clean
LibriSpeech.test-other
CommonVoice5.1
TED-LIUM3
VoxPopuli.en
Switchboard
AMI-SDM1
Fleurs.en us
AMI-IHM
CallHome
CORAAL
CHiME6
Artie
WSJ
Model
Whisper tiny.en 5.6 14.6 6.0 5.0 24.1 17.8 26.3 20.0 23.9 41.3 23.7 50.3 11.7 11.6
Whisper tiny 7.6 16.9 7.0 6.7 30.0 22.8 29.6 23.9 31.0 49.6 27.6 58.1 12.7 13.7
Whisper base.en 4.2 10.2 4.9 4.6 20.9 15.2 19.0 13.4 22.6 36.4 20.5 46.7 10.0 7.6
Whisper base 5.0 12.4 5.5 5.1 23.0 16.8 21.6 16.9 26.0 40.2 22.0 49.9 10.0 10.1
Whisper small.en 3.1 7.4 4.0 3.3 18.2 15.7 13.1 9.7 20.2 27.6 17.5 38.0 8.1 6.0
Whisper small 3.4 7.6 4.3 4.0 17.5 14.5 13.5 10.3 18.1 29.3 19.0 39.6 8.3 6.6
Whisper medium.en 3.1 6.3 4.1 3.3 16.2 14.1 10.6 7.6 17.5 25.3 16.4 37.2 7.4 5.0
Whisper medium 2.9 5.9 3.8 2.9 16.4 14.0 10.3 7.2 16.6 26.4 16.6 36.0 7.4 5.4
Whisper large 2.7 5.6 4.0 3.1 15.8 13.1 9.5 6.7 19.4 25.6 16.4 36.9 7.3 4.6
wav2vec2-base-100h 6.0 13.4 17.8 13.9 46.9 40.2 47.4 40.8 47.0 79.9 48.1 81.2 28.9 23.1
wav2vec2-base-960h 3.3 8.5 12.8 8.9 40.6 32.9 36.4 30.9 39.9 68.5 40.2 71.9 21.4 17.4
wav2vec2-large-960h-lv60-self 1.8 3.8 7.4 4.4 29.1 22.2 19.9 15.8 29.2 56.3 30.8 57.0 13.0 10.2
wav2vec2-large-960h 2.7 6.2 10.5 7.7 34.8 28.3 29.9 24.5 35.6 65.8 37.0 67.6 17.9 14.6
wav2vec2-large-robust-ft-libri-960h 2.6 5.3 9.2 6.1 23.4 19.8 20.3 16.2 29.4 58.1 31.7 61.6 15.1 11.8
asr-crdnn-rnnlm-librispeech 3.0 9.7 17.7 10.7 59.7 56.1 43.7 33.3 83.8 81.0 57.2 85.8 30.6 32.4
asr-transformer-transformerlm-librispeech 2.1 5.4 11.9 7.4 38.9 33.0 30.6 23.5 44.9 79.5 44.5 75.4 17.8 17.0
hubert-large-ls960-ft 2.0 4.1 8.4 5.4 29.6 22.8 20.8 16.0 32.0 60.0 33.7 59.1 14.4 10.9
hubert-xlarge-ls960-ft 1.9 3.5 8.3 5.4 29.3 22.2 19.8 14.8 31.5 58.5 33.3 58.9 14.2 10.5
s2t-large-librispeech-asr 3.3 8.1 14.9 9.4 54.5 40.3 38.1 30.7 50.2 79.2 53.4 79.5 21.6 18.0
s2t-medium-librispeech-asr 3.6 8.2 15.7 9.7 58.1 42.4 39.3 31.3 52.6 79.8 60.3 85.3 22.9 19.7
stt en conformer ctc large 2.1 4.2 4.4 2.1 11.3 8.2 7.4 4.0 13.5 30.5 15.9 39.9 6.7 8.2
stt en conformer transducer xlarge 1.5 2.8 4.3 1.2 12.0 7.4 4.3 1.5 19.9 36.8 20.5 48.6 6.0 6.3
unispeech-sat-base-100h-libri-ft 5.7 13.8 17.7 13.6 46.5 40.0 45.3 38.6 44.7 74.8 47.8 77.7 29.8 22.4
LibriSpeech.test-other
CommonVoice5.1
TED-LIUM3
VoxPopuli.en
Switchboard
AMI-SDM1
Fleurs.en us
AMI-IHM
CallHome
CORAAL
CHiME6
Artie
WSJ
Model
Whisper tiny.en 5.4 12.8 5.4 4.6 21.4 16.0 23.5 18.4 21.4 42.0 22.7 54.2 10.9 10.0
Whisper tiny 6.7 15.0 6.3 5.9 24.8 18.3 26.1 20.8 25.1 48.0 25.6 57.3 11.6 12.4
Whisper base.en 4.1 9.6 4.6 4.0 18.3 14.2 17.5 13.2 18.5 35.2 21.1 49.0 9.3 7.1
Whisper base 4.9 11.0 5.0 4.4 20.5 15.6 19.4 15.3 20.5 40.0 21.5 50.0 9.5 8.9
Whisper small.en 3.2 6.7 4.3 3.0 17.2 13.4 12.6 9.2 17.5 29.5 17.9 42.5 8.1 5.3
Whisper small 3.3 7.2 4.3 3.9 17.1 13.3 12.8 9.3 16.4 30.9 19.2 43.5 8.2 6.1
Whisper medium.en 3.0 5.7 4.3 2.8 14.7 12.4 10.3 7.4 15.3 27.0 17.1 39.4 7.8 4.5
Whisper medium 2.7 5.6 4.0 2.7 15.3 13.2 9.7 6.7 14.9 27.6 17.6 43.0 7.6 4.4
Whisper large 2.8 5.7 4.3 3.5 16.2 14.2 8.9 6.4 15.1 25.2 17.6 37.1 7.2 4.5
Table 9. English transcription WER (%) with beam search and temperature fallback
Robust Speech Recognition via Large-Scale Weak Supervision 23
Portuguese
German
Spanish
English
French
Italian
Polish
Dutch
Model
Whisper tiny 39.4 15.7 36.8 24.9 41.7 34.2 31.3 19.2
Whisper base 28.4 11.7 26.6 17.7 31.1 22.8 21.9 12.8
Whisper small 17.2 8.3 16.2 10.5 21.4 11.2 13.0 7.8
Whisper medium 11.7 6.8 8.9 7.4 16.0 6.5 9.0 5.3
Whisper large 10.2 6.3 8.9 6.6 14.3 6.6 9.2 5.4
Estonian
German
Spanish
Catalan
Bengali
English
Persian
Danish
Arabic
Czech
Welsh
Greek
Model
Whisper tiny 90.9 79.3 104.1 51.0 79.7 101.8 77.2 34.5 61.9 28.8 30.3 102.1 120.3
Whisper base 84.4 68.1 103.7 39.9 63.1 93.8 57.5 24.5 51.5 21.9 19.6 88.1 99.0
Whisper small 66.4 44.8 118.6 23.8 34.1 65.4 32.1 13.0 31.7 14.5 10.3 67.2 71.9
Whisper medium 60.3 26.7 124.7 16.4 18.8 43.6 19.3 8.5 20.0 11.2 6.9 45.6 49.9
Whisper large 56.0 24.1 106.0 15.3 17.1 40.3 18.3 7.7 18.3 10.1 6.4 41.4 44.8
Malayalam
Indonesian
Lithuanian
Mongolian
Hungarian
Japanese
Latvian
Finnish
French
Italian
Polish
Dutch
Hindi
Model
Whisper tiny 68.5 49.7 108.3 87.0 49.6 44.5 36.1 103.5 87.8 102.7 123.0 43.6 45.3
Whisper base 52.9 37.3 106.5 71.9 36.1 30.5 24.2 91.3 78.0 122.9 137.0 29.5 32.8
Whisper small 30.5 22.7 43.6 44.4 18.4 16.0 14.0 72.8 54.6 104.8 225.8 14.2 16.9
Whisper medium 18.8 16.0 31.5 26.9 11.6 9.4 10.5 49.4 37.2 137.8 113.4 8.0 10.1
Whisper large 17.0 14.7 25.0 23.5 10.6 8.1 9.4 43.9 34.8 107.1 117.4 7.1 9.0
Vietnamese
Portuguese
Romanian
Slovenian
Swedish
Chinese
Russian
Turkish
Serbian
Slovak
Tamil
Urdu
Thai
Model
Whisper tiny 35.2 68.2 40.6 104.0 82.0 106.1 58.2 105.7 55.9 53.6 74.7 69.3 52.4
Whisper base 23.7 55.9 28.8 87.2 70.3 103.0 42.4 49.5 32.1 38.6 58.6 51.6 44.9
Whisper small 12.5 33.2 15.0 60.4 45.5 101.3 22.1 28.7 18.1 23.7 39.1 33.3 29.4
Whisper medium 8.1 21.5 9.3 42.0 29.8 85.6 13.7 19.6 10.5 17.7 29.9 24.4 23.2
Whisper large 7.1 19.8 8.2 37.9 25.1 87.4 12.4 17.6 8.8 16.6 28.1 19.9 29.1
Lithuanian
Hungarian
Romanian
Slovenian
Estonian
Croatian
German
Spanish
Finnish
English
French
Slovak
Italian
Polish
Czech
Dutch
Model
Whisper tiny 73.5 27.4 11.6 18.8 19.7 99.2 54.1 32.9 72.4 74.5 40.5 93.1 41.9 31.4 65.9 78.7 81.9
Whisper base 54.7 20.6 9.5 17.5 14.4 83.0 39.7 24.9 53.6 52.6 30.8 82.1 29.4 22.1 49.3 63.7 70.5
Whisper small 28.8 14.8 8.2 19.2 11.1 59.2 24.9 15.7 33.7 31.3 22.9 60.1 18.8 13.3 28.6 37.3 50.8
Whisper medium 18.4 12.4 7.6 19.1 9.6 38.2 16.6 12.2 23.9 19.3 19.7 39.3 14.9 10.1 18.4 23.0 36.3
Whisper large 15.9 11.9 7.2 20.8 8.8 33.3 15.5 11.0 19.0 16.8 18.4 35.0 14.0 9.0 17.0 19.1 31.3
D.2.4. F LEURS
Azerbaijani
Belarusian
Assamese
Afrikaans
Bulgarian
Amharic
Bosnian
Chinese
Catalan
Bengali
Danish
Arabic
Czech
Welsh
Model
Whisper tiny 91.2 122.9 63.4 102.0 93.1 94.0 81.0 101.6 82.1 42.8 40.5 82.8 101.3 82.0
Whisper base 81.5 196.8 48.8 102.0 76.4 91.3 65.1 100.6 66.7 29.0 34.1 66.0 85.3 57.6
Whisper small 61.1 120.2 30.6 108.0 49.1 75.1 37.3 104.4 39.4 16.2 20.8 37.6 59.3 32.8
Whisper medium 44.9 229.3 20.4 102.3 33.1 60.4 21.4 100.6 23.9 9.6 12.1 21.3 40.8 19.5
Whisper large 42.6 129.3 18.1 105.6 28.7 56.6 18.4 104.9 20.7 8.0 19.6 17.4 36.6 16.8
Estonian
Galician
German
Gujarati
Hebrew
Tagalog
Spanish
Finnish
English
Persian
French
Hausa
Greek
Hindi
Model
Whisper tiny 27.8 67.4 12.4 15.9 94.8 101.8 59.5 65.6 41.4 54.8 101.2 100.2 71.6 102.3
Whisper base 17.9 53.5 8.9 9.9 77.9 86.1 43.1 45.8 28.5 47.4 101.4 98.6 61.7 101.1
Whisper small 10.2 30.8 6.1 5.6 51.3 55.8 24.0 27.7 15.0 30.2 106.4 90.1 44.4 38.4
Whisper medium 6.5 19.0 4.4 3.6 29.8 41.0 13.9 19.1 8.7 21.2 104.8 106.6 33.1 26.8
Whisper large 5.5 18.7 4.5 3.5 25.5 36.1 12.2 15.8 7.7 19.0 103.9 87.0 30.2 26.9
Luxembourgish
Indonesian
Hungarian
Armenian
Icelandic
Georgian
Kannada
Javanese
Japanese
Croatian
Kazakh
Korean
Khmer
Italian
Model
Whisper tiny 79.0 83.8 118.6 51.7 113.3 29.8 37.0 107.3 123.0 165.2 100.6 100.7 36.1 99.1
Whisper base 59.1 65.0 126.3 33.1 95.5 17.9 22.8 89.5 114.7 109.2 101.6 107.2 27.8 100.7
Whisper small 33.4 38.9 86.6 16.3 72.6 9.8 12.0 88.6 118.3 70.3 104.4 100.4 19.6 100.1
Whisper medium 19.3 24.3 60.1 10.2 49.9 5.2 7.1 67.9 117.3 48.8 98.9 77.7 16.4 90.0
Whisper large 16.7 21.0 53.7 8.5 43.0 4.2 6.4 87.0 100.5 43.8 96.0 69.8 15.2 86.5
Macedonian
Malayalam
Lithuanian
Norwegian
Mongolian
Myanmar
Marathi
Maltese
Latvian
Lingala
Nepali
Malay
Maori
Lao
Model
Whisper tiny 105.4 115.1 98.5 91.6 94.5 73.3 101.5 113.7 100.3 51.2 100.8 124.8 62.0 101.8
Whisper base 96.7 105.1 87.3 79.8 77.5 59.9 107.4 125.7 100.3 35.1 97.6 122.6 44.0 102.4
Whisper small 91.3 102.2 65.6 53.2 59.5 36.9 100.9 144.2 60.2 18.9 92.2 110.1 24.2 69.5
Whisper medium 83.2 101.4 41.1 32.0 77.8 22.0 101.1 103.7 63.2 12.2 83.2 123.0 12.9 54.4
Whisper large 76.8 101.6 35.2 28.3 45.7 20.6 101.4 106.2 43.7 10.2 80.5 124.5 11.4 52.2
Portuguese
Romanian
Slovenian
Russian
Occitan
Serbian
Punjabi
Somali
Slovak
Pashto
Sindhi
Polish
Shona
Dutch
Model
Whisper tiny 49.0 95.9 102.6 45.6 105.6 20.1 74.7 31.1 105.8 77.2 87.2 128.1 105.6 83.7
Whisper base 33.0 82.9 101.5 30.8 99.0 13.0 56.0 20.5 103.9 60.6 74.6 126.0 109.6 64.3
Whisper small 16.4 87.3 103.6 14.7 92.9 7.3 29.8 11.4 131.7 33.3 49.3 140.0 105.3 42.2
Whisper medium 9.9 79.5 102.0 8.0 119.4 5.0 20.0 7.2 147.0 17.3 31.9 143.9 104.0 44.9
Whisper large 8.3 75.9 102.8 7.2 92.7 4.8 15.4 6.4 177.9 15.7 27.8 130.0 103.5 29.2
Vietnamese
Ukrainian
Swedish
Turkish
Swahili
Yoruba
Telugu
Uzbek
Tamil
Urdu
Tajik
Thai
Model
Whisper tiny 52.7 100.9 99.9 105.1 101.7 58.8 42.5 51.2 65.2 105.2 60.0 106.4
Whisper base 37.4 92.5 58.7 105.2 109.3 38.2 27.5 37.7 52.0 114.0 40.5 101.8
Whisper small 20.8 73.7 35.2 98.2 84.3 21.9 15.9 19.3 37.3 107.7 21.2 116.4
Whisper medium 11.2 52.8 23.1 82.8 74.0 15.4 10.4 11.6 28.2 109.6 12.7 105.1
Whisper large 10.5 47.9 20.6 100.6 74.5 13.2 9.4 10.3 25.0 93.3 10.7 111.7
Azerbaijani
Belarusian
Assamese
Afrikaans
Bulgarian
Amharic
Bosnian
Chinese
Catalan
Bengali
Danish
Arabic
Czech
Welsh
Model
Whisper tiny 1.6 0.1 0.1 0.4 0.1 0.8 0.4 0.4 0.4 5.2 0.6 0.6 0.6 0.7
Whisper base 4.4 0.3 1.0 0.4 0.8 3.3 2.7 0.7 4.1 13.1 1.9 2.7 0.7 5.0
Whisper small 18.1 0.2 10.6 1.2 5.8 7.1 14.8 2.7 16.8 25.1 9.3 14.2 1.3 18.1
Whisper medium 29.5 0.9 19.9 3.5 11.7 9.8 23.9 10.6 26.0 31.9 15.1 23.6 8.4 28.6
Whisper large 31.6 1.1 23.8 3.9 13.1 11.0 26.2 12.0 28.0 33.7 16.8 25.6 11.2 31.6
Estonian
Galician
German
Gujarati
Hebrew
Tagalog
Spanish
Finnish
English
Persian
French
Hausa
Greek
Hindi
Model
Whisper tiny 5.2 0.1 68.6 7.7 0.1 0.1 0.2 0.8 4.7 4.0 0.7 0.1 0.2 1.0
Whisper base 13.7 0.7 73.3 12.4 0.3 0.2 0.5 2.1 13.1 10.5 1.5 0.0 0.6 3.4
Whisper small 25.9 11.6 77.3 18.2 3.6 5.8 7.3 12.0 23.5 17.5 3.9 0.3 5.4 11.1
Whisper medium 31.4 19.9 79.2 21.4 13.5 15.0 18.5 20.5 28.6 24.7 12.8 0.5 15.9 19.4
Whisper large 34.3 21.7 77.8 22.8 15.9 17.6 20.6 22.7 31.6 26.0 14.8 0.5 19.6 20.7
Luxembourgish
Indonesian
Hungarian
Armenian
Icelandic
Georgian
Kannada
Javanese
Japanese
Croatian
Kazakh
Korean
Khmer
Italian
Model
Whisper tiny 0.6 0.1 0.1 0.3 0.4 5.3 0.2 0.2 0.1 0.1 0.1 0.8 0.5 0.8
Whisper base 3.7 0.2 0.1 2.6 0.4 11.3 1.5 0.2 0.2 0.2 0.1 0.9 3.7 1.7
Whisper small 14.6 4.8 0.7 16.4 1.8 17.8 9.6 1.4 0.2 0.8 0.5 2.3 12.2 5.7
Whisper medium 23.0 15.5 10.4 24.1 6.8 21.6 14.9 5.0 1.3 4.3 3.3 8.5 19.2 13.6
Whisper large 25.4 18.3 13.2 27.2 6.6 23.5 17.0 5.1 2.7 6.3 5.2 9.9 20.0 15.4
Macedonian
Malayalam
Lithuanian
Norwegian
Mongolian
Myanmar
Marathi
Maltese
Latvian
Lingala
Nepali
Malay
Maori
Lao
Model
Whisper tiny 0.1 0.2 0.1 0.2 0.3 1.0 0.8 0.1 0.2 0.3 0.6 0.1 1.4 0.1
Whisper base 0.1 0.3 0.3 0.4 1.0 5.4 1.4 0.1 0.9 2.1 1.4 0.1 8.4 0.3
Whisper small 0.5 2.0 1.9 1.5 3.9 15.3 5.7 0.1 3.8 14.1 4.9 0.0 22.0 2.9
Whisper medium 0.9 8.1 9.6 10.0 8.5 23.5 13.8 0.5 10.9 23.2 11.2 0.2 29.1 12.7
Whisper large 1.2 9.3 12.0 12.5 9.4 26.4 16.5 1.0 13.1 25.5 12.8 0.5 30.5 12.9
Portuguese
Romanian
Slovenian
Russian
Occitan
Serbian
Punjabi
Somali
Slovak
Pashto
Sindhi
Polish
Shona
Dutch
Model
Whisper tiny 2.7 1.7 0.3 0.8 0.3 12.1 1.0 3.1 0.5 0.7 0.3 0.1 0.0 0.6
Whisper base 7.5 4.2 1.1 5.1 0.4 22.4 4.9 12.1 0.7 4.6 1.3 0.3 0.1 5.4
Whisper small 15.9 9.5 4.4 14.0 0.8 31.2 18.3 19.7 2.0 14.4 6.9 0.6 0.1 19.3
Whisper medium 21.6 15.9 12.8 19.0 2.1 35.9 26.6 24.8 5.5 22.7 14.0 1.4 0.4 27.7
Whisper large 22.8 16.8 14.6 21.4 3.7 37.4 29.1 26.7 5.9 25.1 16.9 1.8 0.5 30.5
Vietnamese
Ukrainian
Swedish
Turkish
Swahili
Yoruba
Telugu
Uzbek
Tamil
Urdu
Tajik
Thai
Model
Whisper tiny 1.8 0.1 0.2 0.3 0.2 0.2 0.2 1.2 0.4 0.0 0.1 0.2
Whisper base 9.1 0.1 0.4 0.4 0.2 0.7 2.4 6.9 1.5 0.2 0.9 0.5
Whisper small 22.9 0.1 2.1 4.0 4.4 5.8 15.7 18.7 8.8 0.5 8.5 0.5
Whisper medium 32.1 3.1 7.0 10.8 11.4 12.8 22.9 25.8 14.9 3.8 16.6 0.9
Whisper large 33.1 5.3 8.5 10.9 13.0 15.2 25.7 28.0 16.3 5.8 19.5 1.2
D.3.2. C OVOST 2
Indonesian
Mongolian
Estonian
Japanese
German
Spanish
Catalan
Latvian
Persian
French
Arabic
Italian
Welsh
Model
Whisper tiny 0.2 4.9 0.4 4.0 10.5 0.2 0.1 6.1 0.3 5.1 0.3 0.1 0.1
Whisper base 1.2 11.0 0.5 11.7 21.3 0.3 0.1 15.4 4.9 13.0 4.9 0.5 0.1
Whisper small 17.7 22.3 1.0 25.3 33.0 2.4 4.9 27.3 27.6 24.0 17.3 1.4 0.2
Whisper medium 30.6 29.2 12.1 33.2 38.4 11.4 15.5 33.6 42.3 29.5 24.6 9.7 0.2
Whisper large 35.5 30.3 16.1 34.3 38.0 13.4 17.5 34.4 45.4 29.1 24.2 10.5 0.3
Portuguese
Slovenian
Swedish
Chinese
Russian
Turkish
Dutch
Tamil
Model
Whisper tiny 4.3 9.5 5.7 0.4 2.0 0.1 0.2 0.4
Whisper base 12.4 23.2 16.1 1.4 10.5 0.4 2.8 1.4
Whisper small 28.1 40.6 30.9 9.2 29.9 1.7 16.8 6.8
Whisper medium 38.1 48.7 39.4 17.7 39.5 2.9 27.0 14.0
Whisper large 39.3 48.6 41.6 23.9 40.3 3.7 26.7 17.1
Earnings-21
Earnings-22
Meanwhile
Kincaid46
CORAAL
Rev16
Model
Whisper tiny.en 5.5 12.8 13.8 15.1 17.0 22.0 30.3
Whisper tiny 6.8 15.5 16.7 17.0 18.7 24.4 33.1
Whisper base.en 4.6 9.4 11.2 13.2 12.5 16.6 25.2
Whisper base 4.8 12.2 12.2 14.5 13.5 18.4 26.9
Whisper small.en 4.6 6.0 9.4 12.0 10.8 14.0 21.9
Whisper small 4.2 6.9 10.1 12.1 11.1 14.3 22.3
Whisper medium.en 3.6 5.2 8.9 11.9 10.2 13.3 20.6
Whisper medium 3.8 5.4 8.6 11.4 10.3 13.2 20.3
Whisper large 3.8 5.3 8.8 11.0 10.3 13.4 20.4
wav2vec2-base-100h 17.6 27.7 39.3 35.2 45.7 57.1 55.4
wav2vec2-base-960h 12.8 19.7 32.9 29.8 37.3 46.8 49.1
wav2vec2-large-960h-lv60-self 7.2 11.4 21.1 21.3 21.7 28.0 36.7
wav2vec2-large-960h 10.1 16.4 27.4 26.4 30.4 40.1 43.5
wav2vec2-large-robust-ft-libri-960h 8.8 15.2 22.9 23.4 23.0 31.0 36.8
hubert-large-ls960-ft 8.1 12.9 22.4 23.4 23.0 30.6 37.9
hubert-xlarge-ls960-ft 8.1 12.5 22.9 23.2 23.1 31.3 38.1
stt en conformer ctc large 4.0 9.8 13.1 14.5 12.6 17.6 25.1
stt en conformer transducer xlarge 5.3 10.6 17.1 19.8 16.2 19.7 38.9
F. Hyperparameters
Hyperparameter Value
Updates 1048576
Batch Size 256
Warmup Updates 2048
Max grad norm 1.0
Optimizer AdamW
β1 0.9
β2 0.98
ϵ 10−6
Weight Decay 0.1
Weight Init Gaussian Fan-In
Learning Rate Schedule Linear Decay
Speechless audio subsample factor 10×
Condition on prior text rate 50%