BLEURT: Learning Robust Metrics For Text Generation
BLEURT: Learning Robust Metrics For Text Generation
BLEURT: Learning Robust Metrics For Text Generation
Thibault Sellam
Dipanjan Das Ankur P. Parikh
Google Research
New York, NY
{tsellam, dipanjand, aparikh }@google.com
sentences given french sentences. If |z̃| is the Wτk ṽ[CLS] + bτk . If τk is a classification task,
number of tokens in z̃, we define our score as we use a separate linear layer to predict a logit for
τen-fr,z̃|z = log P|z̃|(z̃|z) , with: each class c: τ̂kc = Wτkc ṽ[CLS] + bτkc , and we use
X the multiclass cross-entropy loss. We define our
P (z̃|z) = Pfr→en (z̃|zfr )Pen→fr (zfr |z) aggregate pre-training loss function as follows:
zfr
M K
1 XX
Because computing the summation over `pre-training = γk `k (τkm , τ̂km ) (1)
M
all possible French sentences is in- m=1 k=1
tractable, we approximate the sum using
∗ where τkm is the target vector for example m, M
zfr = arg max Pen→fr (zfr |z) and we as-
∗ |z) ≈ 1: is number of synthetic examples, and γk are hy-
sume that Pen→fr (zfr
perparameter weights obtained with grid search
∗
P (z̃|z) ≈ Pfr→en (z̃|zfr ) (more details in the Appendix).
Table 2: Agreement with human ratings on the WMT17 Metrics Shared Task. The metrics are Kendall Tau (τ ) and
the Pearson correlation (r, the official metric of the shared task), divided by 100.
Table 3: Agreement with human ratings on the WMT18 Metrics Shared Task. The metrics are Kendall Tau (τ ) and
WMT’s Direct Assessment metrics divided by 100. The star * indicates results that are more than 0.2 percentage
points away from the official WMT results (up to 0.4 percentage points away).
.
5.1 WMT Metrics Shared Task Models: We experiment with four versions of
B LEURT: BLEURT, BLEURTbase, BLEURT
Datasets and Metrics: We use years 2017 to -pre and BLEURTbase -pre. The first two
2019 of the WMT Metrics Shared Task, to-English models are based on BERT-large and BERT-base.
language pairs. For each year, we used the of- In the latter two versions, we skip the pre-training
ficial WMT test set, which include several thou- phase and fine-tune directly on the WMT ratings.
sand pairs of sentences with human ratings from For each year of the WMT shared task, we use the
the news domain. The training sets contain 5,360, test set from the previous years for training and
9,492, and 147,691 records for each year. The test validation. We describe our setup in further detail
sets for years 2018 and 2019 are noisier, as re- in the Appendix. We compare B LEURT to partici-
ported by the organizers and shown by the overall pant data from the shared task and automatic met-
lower correlations. rics that we ran ourselves. In the former case, we
We evaluate the agreement between the auto- use the the best-performing contestants for each
matic metrics and the human ratings. For each year, that is, chrF++, BEER, Meteor++, RUSE,
year, we report two metrics: Kendall’s Tau τ (for Yisi1, ESIM and Yisi1-SRL (Mathur et al.,
consistency across experiments), and the official 2019). All the contestants use the same WMT
WMT metric for that year (for completeness). The training data, in addition to existing sentence or to-
official WMT metric is either Pearson’s correla- ken embeddings. In the latter case, we use Moses
tion or a robust variant of Kendall’s Tau called sentenceBLEU, BERTscore (Zhang et al.,
DARR, described in the Appendix. All the num- 2020), and MoverScore (Zhao et al., 2019).
bers come from our own implementation of the For BERTscore, we use BERT-large uncased
benchmark.4 Our results are globally consistent for fairness, and roBERTa (the recommended ver-
with the official results but we report small differ- sion) for completeness (Liu et al., 2019). We run
ences in 2018 and 2019, marked in the tables. MoverScore on WMT 2017 using the scripts
published by the authors.
4
The official scripts are public but they suffer from docu-
mentation and dependency issues, as shown by a README file Results: Tables 2, 3, 4 show the results. For
in the 2019 edition which explicitly discourages using them. years 2017 and 2018, a B LEURT-based metric
model de-en fi-en gu-en kk-en lt-en ru-en zh-en avg
τ / DA τ / DA τ / DA τ / DA τ / DA τ / DA τ / DA τ / DA
sentBLEU 19.4 / 5.4 20.6 / 23.3 17.3 / 18.9 30.0 / 37.6 23.8 / 26.2 19.4 / 12.4 28.7 / 32.2 22.7 / 22.3
BERTscore w/ BERT 26.2 / 17.3 27.6 / 34.7 25.8 / 29.3 36.9 / 44.0 30.8 / 37.4 25.2 / 20.6 37.5 / 41.4 30.0 / 32.1
BERTscore w/ roBERTa 29.1 / 19.3 29.7 / 35.3 27.7 / 32.4 37.1 / 43.1 32.6 / 38.2 26.3 / 22.7 41.4 / 43.8 32.0 / 33.6
ESIM 28.4 / 16.6 28.9 / 33.7 27.1 / 30.4 38.4 / 43.3 33.2 / 35.9 26.6 / 19.9 38.7 / 39.6 31.6 / 31.3
YiSi1 SRL 19 26.3 / 19.8 27.8 / 34.6 26.6 / 30.6 36.9 / 44.1 30.9 / 38.0 25.3 / 22.0 38.9 / 43.1 30.4 / 33.2
BLEURTbase -pre 30.1 / 15.8 30.4 / 35.4 26.8 / 29.7 37.8 / 41.8 34.2 / 39.0 27.0 / 20.7 40.1 / 39.8 32.3 / 31.7
BLEURTbase 31.0 / 16.6 31.3 / 36.2 27.9 / 30.6 39.5 / 44.6 35.2 / 39.4 28.5 / 21.5 41.7 / 41.6 33.6 / 32.9
BLEURT -pre 31.1 / 16.9 31.3 / 36.5 27.6 / 31.3 38.4 / 42.8 35.0 / 40.0 27.5 / 21.4 41.6 / 41.4 33.2 / 32.9
BLEURT 31.2 / 16.9 31.7 / 36.3 28.3 / 31.9 39.5 / 44.6 35.2 / 40.6 28.3 / 22.3 42.7 / 42.4 33.8 / 33.6
Table 4: Agreement with human ratings on the WMT19 Metrics Shared Task. The metrics are Kendall Tau (τ ) and
WMT’s Direct Assessment metrics divided by 100. All the values reported for Yisi1 SRL and ESIM fall within
0.2 percentage of the official WMT results.
● ●
0.75 ● ● ● ● ●
●
● ● ●
●
●
● ●
Skew factor 0.4 ●
● ●
0.50 ● ●
● ●
0 ● ●
●
0.5 ●
●
0.25 0.2 ●
1.0
● ●
1.5
●
0.00 3.0 0.0
●
−2 −1 0 1
Ratings 0 1 2 3 0 1 2 3
Test Set skew
Figure 1: Distribution of the human ratings in the BERTscore ● train sk. 0 ● train sk. 1.0 ● train sk. 3.0
train/validation and test datasets for different skew fac- BLEU ● train sk. 0.5 ● train sk. 1.5
tors.
Figure 2: Agreement between BLEURT and human
ratings for different skew factors in train and test.
dominates the benchmark for each language pair
(Tables 2 and 3). BLEURT and BLEURTbase are
also competitive for year 2019: they yield the best a series of tasks for which it is increasingly pres-
results for all language pairs on Kendall’s Tau, and sured to extrapolate. All the experiments that fol-
they come first for 3 out of 7 pairs on DARR. As low are based on the WMT Metrics Shared Task
expected, BLEURT dominates BLEURTbase in 2017, because the ratings for this edition are par-
the majority of cases. Pre-training consistently im- ticularly reliable.5
proves the results of BLEURT and BLEURTbase.
Methodology: We create increasingly challeng-
We observe the largest effect on year 2017,
ing datasets by sub-sampling the records from
where it adds up to 7.4 Kendall Tau points for
the WMT Metrics shared task, keeping low-rated
BLEURTbase (zh-en). The effect is milder on
translations for training and high-rated translations
years 2018 and 2019, up to 2.1 points (tr-en,
for test. The key parameter is the skew factor α,
2018). We explain the difference by the fact
that measures how much the training data is left-
that the training data used for 2017 is smaller
skewed and the test data is right-skewed. Figure 1
than the datasets used for the following years, so
demonstrates the ratings distribution that we used
pre-training is likelier to help. In general pre-
in our experiments. The training data shrinks as
training yields higher returns for BERT-base than
α increases: in the most extreme case (α = 3.0),
for BERT-large—in fact, BLEURTbase with pre-
we use only 11.9% of the original 5,344 training
training is often better than BLEURT without.
records. We give the full detail of our sampling
Takeaways: Pre-training delivers consis- methodology in the Appendix.
tent improvements, especially for BLEURT-base. We use BLEURT with and without pre-training
B LEURT yields state-of-the art performance for all and we compare to Moses sentBLEU and
years of the WMT Metrics Shared task. BERTscore. We use BERT-large uncased for
both BLEURT and BERTscore.
5.2 Robustness to Quality Drift
5
The organizers managed to collect 15 adequacy scores
We assess our claim that pre-training makes for each translation, and thus the ratings are almost perfectly
B LEURT robust to quality drifts, by constructing repeatable (Bojar et al., 2017)
Split by System Split by Input
0.4
fluency
0.2
Kentall Tau w. Human Ratings
Metric
0.0
0.5 BLEU
0.4 TER
grammar
0.3 Meteor
0.2 BERTscore
0.1 BLEURT −pre −wmt
0.0 BLEURT −wmt
0.6 BLEURT
semantics
0.4
0.2
0.0
0/9 systems 2/9 systems 3/9 systems 5/9 systems 0/224 inputs 38/224 inputs 66/224 inputs 122/224 inputs
0 records 1,174 records 1,317 records 2,424 records 0 records 836 records 1,445 records 2,689 records
Num. Systems/Inputs Used for Training and Validation
Figure 3: Absolute Kendall Tau of BLEU, Meteor, and BLEURT with human judgements on the WebNLG dataset,
varying the size of the data used for training and validation.
Results: Figure 2 presents B LEURT’s perfor- pairs in total (we removed null values). Each in-
mance as we vary the train and test skew inde- put comes with 1 to 3 reference descriptions. The
pendently. Our first observation is that the agree- submissions are evaluated on 3 aspects: semantics,
ments fall for all metrics as we increase the test grammar, and fluency. We treat each type of rat-
skew. This effect was already described is the ing as a separate modeling task. The data has no
2019 WMT Metrics report (Ma et al., 2019). A natural split between train and test, therefore we
common explanation is that the task gets more dif- experiment with several schemes. We allocate 0%
ficult as the ratings get closer—it is easier to dis- to about 50% of the data to training, and we split
criminate between “good” and “bad” systems than on both the evaluated systems or the RDF inputs
to rank “good” systems. in order to test different generalization regimes.
Training skew has a disastrous effect on
Systems and Baselines: BLEURT -pre
B LEURT without pre-training: it is below
-wmt, is a public BERT-large uncased checkpoint
BERTscore for α = 1.0, and it falls under
directly trained on the WebNLG ratings. BLEURT
sentBLEU for α ≥ 1.5. Pre-trained B LEURT is
-wmtwas first pre-trained on synthetic data,
much more robust: the only case in which it falls
then fine-tuned on WebNLG data. BLEURT
under the baselines is α = 3.0, the most extreme
was trained in three steps: first on synthetic
drift, for which incorrect translations are used for
data, then on WMT data (16-18), and finally on
train while excellent ones for test.
WebNLG data. When a record comes with several
Takeaways: Pre-training makes BLEURT sig- references, we run BLEURT on each reference
nificantly more robust to quality drifts. and report the highest value (Zhang et al., 2020).
We report four baselines: BLEU, TER,
5.3 WebNLG Experiments Meteor, and BERTscore. The first three were
In this section, we evaluate B LEURT’s perfor- computed by the WebNLG competition organiz-
mance on three tasks from a data-to-text dataset, ers. We ran the latter one ourselves, using BERT-
the WebNLG Challenge 2017 (Shimorina et al., large uncased for a fair comparison.
2019). The aim is to assess B LEURT’s capacity
Results: Figure 3 presents the correlation of the
to adapt to new tasks with limited training data.
metrics with human assessments as we vary the
Dataset and Evaluation Tasks: The WebNLG share of data allocated to training. The more pre-
challenge benchmarks systems that produce natu- trained B LEURT is, the quicker it adapts. The
ral language description of entities (e.g., buildings, vanilla BERT approach BLEURT -pre -wmt
cities, artists) from sets of 1 to 5 RDF triples. The requires about a third of the WebNLG data to dom-
organizers released the human assessments for 9 inate the baselines on the majority of tasks, and it
systems over 223 inputs, that is, 4,677 sentence still lags behind on semantics (split by system). In
1 task N−1 tasks the creation of many learned metrics, some of
Relative Improv./Degradation (%)
0%: no pre−training 0%: all pre−training tasks
5 which use regression or deep learning (Stanoje-
vic and Simaan, 2014; Ma et al., 2017; Shimanaka
0
et al., 2018; Chen et al., 2017; Mathur et al., 2019).
−5
Other metrics have been introduced, such as the
−10 recent MoverScore (Zhao et al., 2019) which com-
−15
bines contextual embeddings and Earth Mover’s
Distance. We provide a head-to-head compari-
re il s g U
co enta ktran d_fla BLE OUG
E re il s g
co nta ran fla LE
U GE son with the best performing of those in our ex-
RT
s c o R Ts −e ackt hod_ −B ROU
BE ba eth B ER − b et − periments. Other approaches do not attempt to
m − −m
Pretraining Task estimate quality directly, but use information ex-
BLEURT BLEURTbase
traction or question answering as a proxy (Wise-
man et al., 2017; Goodrich et al., 2019; Eyal et al.,
2019). Those are complementary to our work.
Figure 4: Improvement in Kendall Tau on WMT 17
varying the pre-training tasks. There has been recent work that uses BERT for
evaluation. BERTScore (Zhang et al., 2020) pro-
poses replacing the hard n-gram overlap of BLEU
contrast, BLEURT -wmt is competitive with as with a soft-overlap using BERT embeddings. We
little as 836 records, and B LEURT is comparable use it in all our experiments. Bertr (Mathur et al.,
with BERTscore with zero fine-tuning. 2019) and YiSi (Mathur et al., 2019) also make use
Takeaways: Thanks to pre-training, B LEURT of BERT embeddings to capture similarity. Sum-
can quickly adapt to the new tasks. B LEURT fine- QE (Xenouleas et al., 2019) fine-tunes BERT for
tuned twice (first on synthetic data, then on WMT quality estimation as we describe in Section 3.
data) provides acceptable results on all tasks with- Our focus is different—we train metrics that are
out training data. not only state-of-the-art in conventional IID ex-
perimental setups, but also robust in the presence
5.4 Ablation Experiments of scarce and out-of-distribution training data. To
Figure 4 presents our ablation experiments on our knowledge no existing work has explored pre-
WMT 2017, which highlight the relative impor- training and extrapolation in the context of NLG.
tance of each pre-training task. On the left side, Previous studies have used noising for refer-
we compare B LEURT pre-trained on a single task enceless evaluation (Dušek et al., 2019). Noisy
to B LEURT without pre-training. On the right pre-training has also been proposed before for
side, we compare full B LEURT to B LEURT pre- other tasks such as paraphrasing (Wieting et al.,
trained on all tasks except one. Pre-training on 2016; Tomar et al., 2017) but generally not with
BERTscore, entailment, and the backtranslation synthetic data. Generating synthetic data via para-
scores yield improvements (symmetrically, ablat- phrases and perturbations has been commonly
ing them degrades B LEURT). Oppositely, BLEU used for generating adversarial examples (Jia and
and ROUGE have a negative impact. We con- Liang, 2017; Iyyer et al., 2018; Belinkov and Bisk,
clude that pre-training on high quality signals 2018; Ribeiro et al., 2018), an orthogonal line of
helps BLEURT, but that metrics that correlate less research.
well with human judgment may in fact harm the
model.6 7 Conclusion
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Qingsong Ma, Ondřej Bojar, and Yvette Graham. 2018.
Kristina Toutanova. 2019. Bert: Pre-training of deep Results of the wmt18 metrics shared task: Both char-
bidirectional transformers for language understand- acters and embeddings achieve good performance.
ing. In Proceedings of NAACL HLT. In Proceedings of WMT.
Ondřej Dušek, Karin Sevegnani, Ioannis Konstas, and Qingsong Ma, Yvette Graham, Shugen Wang, and Qun
Verena Rieser. 2019. Automatic quality estimation Liu. 2017. Blend: a novel combined mt metric
for natural language generation: Ranting (jointly rat- based on direct assessment–casict-dcu submission to
ing and ranking). Proceedings of INLG. wmt17 metrics task. In Proceedings of WMT.
Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. Qingsong Ma, Johnny Wei, Ondřej Bojar, and Yvette
Question answering as an automatic evaluation met- Graham. 2019. Results of the wmt19 metrics shared
ric for news article summarization. In Proceedings task: Segment-level and strong mt systems pose big
of NAACL HLT. challenges. In Proceedings of WMT.
Inderjeet Mani. 1999. Advances in automatic text sum- Oriol Vinyals and Quoc Le. 2015. A neural conversa-
marization. MIT press. tional model. Proceedings of ICML.
Nitika Mathur, Timothy Baldwin, and Trevor Cohn. Alex Wang, Amanpreet Singh, Julian Michael, Felix
2019. Putting evaluation in context: Contextual em- Hill, Omer Levy, and Samuel R Bowman. 2019.
beddings improve machine translation evaluation. Glue: A multi-task benchmark and analysis platform
In Proceedings of ACL. for natural language understanding. Proceedings of
ICLR.
Kathleen McKeown. 1992. Text generation. Cam-
bridge University Press. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen
Livescu. 2016. Towards universal paraphrastic sen-
Jekaterina Novikova, Ondřej Dušek, Amanda Cercas tence embeddings. Proceedings of ICLR.
Curry, and Verena Rieser. 2017. Why we need new
evaluation metrics for nlg. Proceedings of EMNLP. Adina Williams, Nikita Nangia, and Samuel R Bow-
man. 2018. A broad-coverage challenge corpus for
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- sentence understanding through inference. Proceed-
Jing Zhu. 2002. Bleu: a method for automatic eval- ings of NAACL HLT.
uation of machine translation. In Proceedings of
ACL. Sam Wiseman, Stuart M Shieber, and Alexander M
Rush. 2017. Challenges in data-to-document gen-
Marco Tulio Ribeiro, Sameer Singh, and Carlos eration. Proceedings of EMNLP.
Guestrin. 2018. Semantically equivalent adversar-
ial rules for debugging nlp models. In Proceedings Stratos Xenouleas, Prodromos Malakasiotis, Marianna
of ACL. Apidianaki, and Ion Androutsopoulos. 2019. Sum-
qe: a bert-based summary quality estimation model
Rico Sennrich, Barry Haddow, and Alexandra Birch. supplementary material. In Proceedings of EMNLP.
2016. Improving neural machine translation models
with monolingual data. Proceedings of ACL. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru uating text generation with bert. Proceedings of
Komachi. 2018. Ruse: Regressor using sentence ICLR.
embeddings for automatic machine translation eval-
uation. In Proceedings of WMT. Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris-
tian M Meyer, and Steffen Eger. 2019. Moverscore:
Anastasia Shimorina, Claire Gardent, Shashi Narayan, Text generation evaluating with contextualized em-
and Laura Perez-Beltrachini. 2019. Webnlg chal- beddings and earth mover distance. Proceedings of
lenge: Human evaluation results. Technical report. EMNLP.
Ronnie W Smith and D Richard Hipp. 1994. Spoken A Implementation Details of the
natural language dialog systems: A practical ap- Pre-Training Phase
proach. Oxford University Press.
This section provides implementation details for
Milos Stanojevic and Khalil Simaan. 2014. Beer: Bet-
ter evaluation as ranking. In Proceedings of WMT. some of the pre-training techniques described in
the main paper.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural net- A.1 Data Generation
works. In Proceedings of NIPS.
Random Masking: We use two masking strate-
Ran Tian, Shashi Narayan, Thibault Sellam, and gies. The first strategy samples random words
Ankur P Parikh. 2019. Sticking to the facts: Con- in the sentence and it replaces them with masks
fident decoding for faithful data-to-text generation.
(one for each token). Thus, the masks are scat-
arXiv:1910.08684.
tered across the sentence. The second strategy cre-
Gaurav Singh Tomar, Thyago Duque, Oscar ates contiguous sequences: it samples a start po-
Täckström, Jakob Uszkoreit, and Dipanjan Das. sition s, a length l (uniformly distributed), and it
2017. Neural paraphrase identification of questions
with noisy pretraining. Proceedings of the First
masks all the tokens spanned by words between
Workshop on Subword and Character Level Models positions s and s + l. In both cases, we use up
in NLP. to 15 masks per sentence. Instead of running the
language model once and picking the most likely
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
token at each position, we use beam search (the
Kaiser, and Illia Polosukhin. 2017. Attention is all beam size 8 by default). This enforces consistency
you need. In Proceedings of NIPS. and avoids repeated sequences, e.g., “,,,”.
Backtranslation: Consider English and validation set. To reduce the size of the grid,
French. Given a forward translation model we make groups of pre-training tasks that share
Pen→fr (zfr |zen ) and backward translation model the same weights: (τBLEU , τROUGE , τBERTscore ),
Pfr→en (zen |zfr ), we generate z̃ as follows: (τen-fr,z|z̃ , τen-fr,z̃|z , τen-de,z|z̃ , τen-de,z̃|z ), and
∗ (τentail , τbacktran flag ).
z̃ = arg max (Pfr→en (zen |zfr ))
zen
∗
B Experiments–Supplementary Material
where zfr = arg maxzfr (Pfr→en (zfr |z)).
For the translations, we use a Transformer B.1 Training Setup for All Experiments
model (Vaswani et al., 2017), trained on English- We user BERT’s public checkpoints10 with Adam
German with the tensor2tensor framework.7 (the default optimizer), learning rate 1e-5, and
Word dropping: Given a synthetic example batch size 32. Unless specified otherwise, we use
(z, z̃) we generate a pair (z, z̃ 0 ), by randomly 800,00 training steps for pre-training and 40,000
dropping words from z̃. We draw the number steps for fine-tuning. We run training and evalua-
of words to drop uniformly, up to the length of tion in parallel: we run the evaluation every 1,500
the sentence. We apply this transformation on steps and store the checkpoint that performs best
about 30% of the data generated with the previous on a held-out validation set (more details on the
method. data splits and our choice of metrics in the follow-
ing sections). We use Google Cloud TPUs v2 for
A.2 Pre-Training Tasks learning, and Nvidia Tesla V100 accelerators for
We now provide additional details on the signals evaluation and test. Our code uses Tensorflow 1.15
we used for pre-training. and Python 2.7.
Automatic Metrics: As shown in the table, we B.2 WMT Metric Shared Task
use three types of signals: BLEU, ROUGE, and Metrics. The metrics used to compare the eval-
BERTscore. For BLEU, we used the original uation systems vary across the years. The organiz-
Moses SENTENCE BLEU8 implementation, using ers use Pearson’s correlation on standardized hu-
the Moses tokenizer and the default parameters. man judgments across all segments in 2017, and a
For ROUGE, we used the seq2seq implemen- custom variant of Kendall’s Tau named “DARR”
tation of ROUGE-N.9 We used a custom imple- on raw human judgments in 2018 and 2019. The
mentation of BERT SCORE, based on BERT-large latter metrics operates as follows. The organiz-
uncased. ROUGE and BERTscore return three ers gather all the translations for the same ref-
scores: precision, recall, and F-score. We use all erence segment, they enumerate all the possible
three quantities. pairs (translation1 , translation2 ), and they discard
Backtranslation Likelihood: We compute all the pairs which have a “similar” score (less than
all the losses using custom Transformer 25 points away on a 100 points scale). For each
model (Vaswani et al., 2017), trained on two remaining pair, they then determine which trans-
language pairs (English-French and English- lation is the best according both human judgment
German) with the tensor2tensor framework. and the candidate metric. Let |Concordant| be the
number of pairs on which the NLG metrics agree
Normalization: All the regression labels are and |Discordant| be those on which they disagree,
normalized before training. then the score is computed as follows:
A.3 Modeling
|Concordant| − |Discordant|
Setting the weights of the pre-training tasks: |Concordant| + |Discordant|
We set the weights γk with grid search, opti-
mizing B LEURT’s performance on WMT 17’s The idea behind the 25 points filter is to make
7 the evaluation more robust, since the judgments
https://github.com/tensorflow/
tensor2tensor collected for WMT 2018 and 2019 are noisy.
8
https://github.com/moses-smt/ Kendall’s Tau is identical, but it does not use the
mosesdecoder/blob/master/mert/ filter.
sentence-bleu.cpp
9 10
https://github.com/google/seq2seq/ https://github.com/google-research/
blob/master/seq2seq/metrics/rouge.py bert
Rel. Kendall Tau Improvement (%)
● ●
●
● ● the sizes of the datasets decrease as α increases:
6 ●
●
we use 50.7%, 30.3%, 20.4%, and 11.9% of the
●
● ●
● ●
● original 5,344 training records for α = 0.5, 1.0,
● ●
4
● ● BLEURT 1.5, and 3.0 respectively.
● BLEURTbase
2
B.4 Ablation Experiment–How Much
Pre-Training Time is Necessary?
0 ● To understand the relationship between pre-
0 200 400 600 800 training time and downstream accuracy, we pre-
Number of Pretraining Steps (*1,000)
train several versions of BLEURT and we fine-tune
Figure 5: Improvement in Kendall Tau accuracy on all them on WMT17 data, varying the number of pre-
language pairs of the WMT Metrics Shared Task 2017, training steps. Figure 5 presents the results. Most
varying the number of pre-training steps. 0 steps cor- gains are obtained during the first 400,000 steps,
responds to 0.555 Kendall Tau for BLEURTbase and that is, after about 2 epochs over our synthetic
0.580 for BLEURT. dataset.