BLEURT: Learning Robust Metrics For Text Generation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

BLEURT: Learning Robust Metrics for Text Generation

Thibault Sellam
Dipanjan Das Ankur P. Parikh
Google Research
New York, NY
{tsellam, dipanjand, aparikh }@google.com

Abstract evaluation metrics, which provide an acceptable


proxy for quality and are very cheap to compute.
Text generation has made significant advances This paper investigates sentence-level, reference-
arXiv:2004.04696v5 [cs.CL] 21 May 2020

in the last few years. Yet, evaluation met-


based metrics, which describe the extent to which
rics have lagged behind, as the most popu-
lar choices (e.g., BLEU and ROUGE) may a candidate sentence is similar to a reference one.
correlate poorly with human judgments. We The exact definition of similarity may range from
propose B LEURT, a learned evaluation met- string overlap to logical entailment.
ric based on BERT that can model human The first generation of metrics relied on hand-
judgments with a few thousand possibly bi- crafted rules that measure the surface similarity
ased training examples. A key aspect of our between the sentences. To illustrate, BLEU (Pa-
approach is a novel pre-training scheme that
pineni et al., 2002) and ROUGE (Lin, 2004), two
uses millions of synthetic examples to help the
model generalize. B LEURT provides state-of- popular metrics, rely on N-gram overlap. Because
the-art results on the last three years of the those metrics are only sensitive to lexical vari-
WMT Metrics shared task and the WebNLG ation, they cannot appropriately reward seman-
Competition dataset. In contrast to a vanilla tic or syntactic variations of a given reference.
BERT-based approach, it yields superior re- Thus, they have been repeatedly shown to cor-
sults even when the training data is scarce and relate poorly with human judgment, in particular
out-of-distribution. when all the systems to compare have a similar
level of accuracy (Liu et al., 2016; Novikova et al.,
1 Introduction
2017; Chaganty et al., 2018).
In the last few years, research in natural text Increasingly, NLG researchers have addressed
generation (NLG) has made significant progress, those problems by injecting learned components
driven largely by the neural encoder-decoder in their metrics. To illustrate, consider the WMT
paradigm (Sutskever et al., 2014; Bahdanau et al., Metrics Shared Task, an annual benchmark in
2015) which can tackle a wide array of tasks which translation metrics are compared on their
including translation (Koehn, 2009), summariza- ability to imitate human assessments. The last two
tion (Mani, 1999; Chopra et al., 2016), structured- years of the competition were largely dominated
data-to-text generation (McKeown, 1992; Kukich, by neural net-based approaches, RUSE, YiSi and
1983; Wiseman et al., 2017) dialog (Smith and ESIM (Ma et al., 2018, 2019). Current approaches
Hipp, 1994; Vinyals and Le, 2015) and image cap- largely fall into two categories. Fully learned met-
tioning (Fang et al., 2015). However, progress is rics, such as BEER, RUSE, and ESIM are trained
increasingly impeded by the shortcomings of ex- end-to-end, and they typically rely on handcrafted
isting metrics (Wiseman et al., 2017; Ma et al., features and/or learned embeddings. Conversely,
2019; Tian et al., 2019). hybrid metrics, such as YiSi and BERTscore com-
Human evaluation is often the best indicator bine trained elements, e.g., contextual embed-
of the quality of a system. However, design- dings, with handwritten logic, e.g., as token align-
ing crowd sourcing experiments is an expensive ment rules. The first category typically offers great
and high-latency process, which does not easily expressivity: if a training set of human ratings data
fit in a daily model development pipeline. There- is available, the metrics may take full advantage
fore, NLG researchers commonly use automatic of it and fit the ratings distribution tightly. Fur-
thermore, learned metrics can be tuned to measure 2 Preliminaries
task-specific properties, such as fluency, faithful-
Define x = (x1 , .., xr ) to be the reference sen-
ness, grammar, or style. On the other hand, hybrid
tence of length r where each xi is a token and let
metrics offer robustness. They may provide better
x̃ = (x̃1 , .., x̃p ) be a prediction sentence of length
results when there is little to no training data, and
p. Let {(xi , x̃i , yi )}N n=1 be a training dataset of
they do not rely on the assumption that training
size N where yi ∈ R is the human rating that in-
and test data are identically distributed.
dicates how good x̃i is with respect to xi . Given
And indeed, the IID assumption is particularly the training data, our goal is to learn a function
problematic in NLG evaluation because of domain f : (x, x̃) → y that predicts the human rating.
drifts, that have been the main target of the metrics
literature, but also because of quality drifts: NLG 3 Fine-Tuning BERT for Quality
systems tend to get better over time, and therefore Evaluation
a model trained on ratings data from 2015 may fail
Given the small amounts of rating data available, it
to distinguish top performing systems in 2019, es-
is natural to leverage unsupervised representations
pecially for newer research tasks. An ideal learned
for this task. In our model, we use BERT (Bidirec-
metric would be able to both take full advantage of
tional Encoder Representations from Transform-
available ratings data for training, and be robust to
ers) (Devlin et al., 2019), which is an unsuper-
distribution drifts, i.e., it should be able to extrap-
vised technique that learns contextualized repre-
olate.
sentations of sequences of text. Given x and x̃,
Our insight is that it is possible to combine ex- BERT is a Transformer (Vaswani et al., 2017) that
pressivity and robustness by pre-training a fully returns a sequence of contextualized vectors:
learned metric on large amounts of synthetic data,
before fine-tuning it on human ratings. To this end, v[CLS] , vx1 , ..., vxr , v1 , ..., vx̃p = BERT(x, x̃)
we introduce B LEURT,1 a text generation metric where v[CLS] is the representation for the special
based on BERT (Devlin et al., 2019). A key ingre- [CLS] token. As described by Devlin et al. (2019),
dient of B LEURT is a novel pre-training scheme, we add a linear layer on top of the [CLS] vector to
which uses random perturbations of Wikipedia predict the rating:
sentences augmented with a diverse set of lexical
and semantic-level supervision signals. ŷ = f (x, x̃) = W ṽ[CLS] + b
To demonstrate our approach, we train B LEURT where W and b are the weight matrix and bias
for English and evaluate it under different gen- vector respectively. Both the above linear layer
eralization regimes. We first verify that it pro- as well as the BERT parameters are trained (i.e.
vides state-of-the-art results on all recent years fine-tuned) on the supervised data which typically
of the WMT Metrics Shared task (2017 to 2019, numbers in a few thousand examples. We use the
to-English language pairs). We then stress-test regression loss `supervised = N1 N
P
ky 2
i − ŷk .
n=1
its ability to cope with quality drifts with a syn- Although this approach is quite straightforward,
thetic benchmark based on WMT 2017. Finally, we will show in Section 5 that it gives state-of-the-
we show that it can easily adapt to a different do- art results on WMT Metrics Shared Task 17-19,
main with three tasks from a data-to-text dataset, which makes it a high-performing evaluation met-
WebNLG 2017 (Gardent et al., 2017). Ablations ric. However, fine-tuning BERT requires a sizable
show that our synthetic pretraining scheme in- amount of IID data, which is less than ideal for a
creases performance in the IID setting, and is crit- metric that should generalize to a variety of tasks
ical to ensure robustness when the training data is and model drift.
scarce, skewed, or out-of-domain.
The code and pre-trained models are available 4 Pre-Training on Synthetic Data
online2 . The key aspect of our approach is a pre-training
technique that we use to “warm up” BERT before
1
Bilingual Evaluation Understudy with Representations fine-tuning on rating data.3 We generate a large
from Transformers. We refer the intrigued reader to Papineni
et al. 2002 for a justification of the term understudy. 3
To clarify, our pre-training scheme is an addition, not a
2
http://github.com/google-research/ replacement to BERT’s initial training (Devlin et al., 2019)
bleurt and happens after it.
number of of synthetic reference-candidate pairs ations while maintaining the fluency of the sen-
(z, z̃), and we train BERT on several lexical- and tence. We use two masking strategies—we either
semantic-level supervision signals with a multi- introduce the masks at random positions in the
task loss. As our experiments will show, B LEURT sentences, or we create contiguous sequences of
generalizes much better after this phase, especially masked tokens. More details are provided in the
with incomplete training data. Appendix.
Any pre-training approach requires a dataset
Backtranslation: We generate paraphrases and
and a set of pre-training tasks. Ideally, the setup
perturbations with backtranslation, that is, round
should resemble the final NLG evaluation task,
trips from English to another language and then
i.e., the sentence pairs should be distributed sim-
back to English with a translation model (Bannard
ilarly and the pre-training signals should corre-
and Callison-Burch, 2005; Ganitkevitch et al.,
late with human ratings. Unfortunately, we cannot
2013; Sennrich et al., 2016). Our primary aim is to
have access to the NLG models that we will eval-
create variants of the reference sentence that pre-
uate in the future. Therefore, we optimized our
serves semantics. Additionally, we use the mispre-
scheme for generality, with three requirements.
dictions of the backtranslation models as a source
(1) The set of reference sentences should be large
of realistic alterations.
and diverse, so that B LEURT can cope with a wide
range of NLG domains and tasks. (2) The sen- Dropping words: We found it useful in our ex-
tence pairs should contain a wide variety of lex- periments to randomly drop words from the syn-
ical, syntactic, and semantic dissimilarities. The thetic examples above to create other examples.
aim here is to anticipate all variations that an This method prepares B LEURT for “pathological”
NLG system may produce, e.g., phrase substitu- behaviors or NLG systems, e.g., void predictions,
tion, paraphrases, noise, or omissions. (3) The or sentence truncation.
pre-training objectives should effectively capture
4.2 Pre-Training Signals
those phenomena, so that B LEURT can learn to
identify them. The following sections present our The next step is to augment each sentence pair
approach. (z, z̃) with a set of pre-training signals {τk },
where τk is the target vector of pre-training task k.
4.1 Generating Sentence Pairs Good pre-training signals should capture a wide
One way to expose B LEURT to a wide variety of variety of lexical and semantic differences. They
sentence differences is to use existing sentence should also be cheap to obtain, so that the ap-
pairs datasets (Bowman et al., 2015; Williams proach can scale to large amounts of synthetic
et al., 2018; Wang et al., 2019). These sets are data. The following section presents our 9 pre-
a rich source of related sentences, but they may training tasks, summarized in Table 1. Additional
fail to capture the errors and alterations that NLG implementation details are in the Appendix.
systems produce (e.g., omissions, repetitions, non- Automatic Metrics: We create three signals
sensical substitutions). We opted for an automatic τBLEU , τROUGE , and τBERTscore with sentence
approach instead, that can be scaled arbitrarily and BLEU (Papineni et al., 2002), ROUGE (Lin,
at little cost: we generate synthetic sentence pairs 2004), and BERTscore (Zhang et al., 2020) re-
(z, z̃) by randomly perturbing 1.8 million seg- spectively (we use precision, recall and F-score for
ments z from Wikipedia. We use three techniques: the latter two).
mask-filling with BERT, backtranslation, and ran-
domly dropping out words. We obtain about 6.5 Backtranslation Likelihood: The idea behind
million perturbations z̃. Let us describe those this signal is to leverage existing translation mod-
techniques. els to measure semantic equivalence. Given a pair
(z, z̃), this training signal measures the probabil-
Mask-filling with BERT: BERT’s initial train- ity that z̃ is a backtranslation of z, P (z̃|z), nor-
ing task is to fill gaps (i.e., masked tokens) in to- malized by the length of z̃. Let Pen→fr (zfr |z)
kenized sentences. We leverage this functional- be a translation model that assigns probabilities
ity by inserting masks at random positions in the to French sentences zfr conditioned on English
Wikipedia sentences, and fill them with the lan- sentences z and let Pfr→en (z|zfr ) be a trans-
guage model. Thus, we introduce lexical alter- lation model that assigns probabilities to English
Task Type Pre-training Signals Loss Type
BLEU τBLEU Regression
ROUGE τROUGE = (τROUGE-P , τROUGE-R , τROUGE-F ) Regression
BERTscore τBERTscore = (τBERTscore-P , τBERTscore-R , τBERTscore-F ) Regression
Backtrans. likelihood τen-fr,z|z̃ , τen-fr,z̃|z , τen-de,z|z̃ , τen-de,z̃|z Regression
Entailment τentail = (τEntail , τContradict , τNeutral ) Multiclass
Backtrans. flag τbacktran flag Multiclass

Table 1: Our pre-training signals.

sentences given french sentences. If |z̃| is the Wτk ṽ[CLS] + bτk . If τk is a classification task,
number of tokens in z̃, we define our score as we use a separate linear layer to predict a logit for
τen-fr,z̃|z = log P|z̃|(z̃|z) , with: each class c: τ̂kc = Wτkc ṽ[CLS] + bτkc , and we use
X the multiclass cross-entropy loss. We define our
P (z̃|z) = Pfr→en (z̃|zfr )Pen→fr (zfr |z) aggregate pre-training loss function as follows:
zfr
M K
1 XX
Because computing the summation over `pre-training = γk `k (τkm , τ̂km ) (1)
M
all possible French sentences is in- m=1 k=1
tractable, we approximate the sum using
∗ where τkm is the target vector for example m, M
zfr = arg max Pen→fr (zfr |z) and we as-
∗ |z) ≈ 1: is number of synthetic examples, and γk are hy-
sume that Pen→fr (zfr
perparameter weights obtained with grid search

P (z̃|z) ≈ Pfr→en (z̃|zfr ) (more details in the Appendix).

We can trivially reverse the procedure to com- 5 Experiments


pute P (z|z̃), thus we create 4 pre-training signals
In this section, we report our experimental results
τen-fr,z|z̃ , τen-fr,z̃|z , τen-de,z|z̃ , τen-de,z̃|z with two
for two tasks, translation and data-to-text. First,
pairs of languages (en ↔ de and en ↔ fr) in
we benchmark B LEURT against existing text gen-
both directions.
eration metrics on the last 3 years of the WMT
Textual Entailment: The signal τentail expresses Metrics Shared Task (Bojar et al., 2017). We then
whether z entails or contradicts z̃ using a clas- evaluate its robustness to quality drifts with a se-
sifier. We report the probability of three labels: ries of synthetic datasets based on WMT17. We
Entail, Contradict, and Neutral, using BERT fine- test B LEURT’s ability to adapt to different tasks
tuned on an entailment dataset, MNLI (Devlin with the WebNLG 2017 Challenge Dataset (Gar-
et al., 2019; Williams et al., 2018). dent et al., 2017). Finally, we measure the contri-
bution of each pre-training task with ablation ex-
Backtranslation flag: The signal τbacktran flag is periments.
a Boolean that indicates whether the perturbation
was generated with backtranslation or with mask- Our Models: Unless specified otherwise, all
filling. B LEURT models are trained in three steps: reg-
ular BERT pre-training (Devlin et al., 2019),
4.3 Modeling pre-training on synthetic data (as explained in
For each pre-training task, our model uses either a Section 4), and fine-tuning on task-specific rat-
regression or a classification loss. We then aggre- ings (translation and/or data-to-text). We exper-
gate the task-level losses with a weighted sum. iment with two versions of B LEURT, BLEURT
Let τk describe the target vector for each task, and BLEURTbase, respectively based on BERT-
e.g., the probabilities for the classes Entail, Con- Large (24 layers, 1024 hidden units, 16 heads)
tradict, Neutral, or the precision, recall, and F- and BERT-Base (12 layers, 768 hidden units, 12
score for ROUGE. If τk is a regression task, then heads) (Devlin et al., 2019), both uncased. We
the loss used is the `2 loss i.e. `k = kτk − use batch size 32, learning rate 1e-5, and 800,000
τ̂k k22 /|τk | where |τk | is the dimension of τk and steps for pre-training and 40,000 steps for fine-
τ̂k is computed by using a task-specific linear tuning. We provide the full detail of our training
layer on top of the [CLS] embedding: τ̂k = setup in the Appendix.
model cs-en de-en fi-en lv-en ru-en tr-en zh-en avg
τ /r τ /r τ /r τ /r τ /r τ /r τ /r τ /r
sentBLEU 29.6 / 43.2 28.9 / 42.2 38.6 / 56.0 23.9 / 38.2 34.3 / 47.7 34.3 / 54.0 37.4 / 51.3 32.4 / 47.5
MoverScore 47.6 / 67.0 51.2 / 70.8 NA NA 53.4 / 73.8 56.1 / 76.2 53.1 / 74.4 52.3 / 72.4
BERTscore w/ BERT 48.0 / 66.6 50.3 / 70.1 61.4 / 81.4 51.6 / 72.3 53.7 / 73.0 55.6 / 76.0 52.2 / 73.1 53.3 / 73.2
BERTscore w/ roBERTa 54.2 / 72.6 56.9 / 76.0 64.8 / 83.2 56.2 / 75.7 57.2 / 75.2 57.9 / 76.1 58.8 / 78.9 58.0 / 76.8
chrF++ 35.0 / 52.3 36.5 / 53.4 47.5 / 67.8 33.3 / 52.0 41.5 / 58.8 43.2 / 61.4 40.5 / 59.3 39.6 / 57.9
BEER 34.0 / 51.1 36.1 / 53.0 48.3 / 68.1 32.8 / 51.5 40.2 / 57.7 42.8 / 60.0 39.5 / 58.2 39.1 / 57.1
BLEURTbase -pre 51.5 / 68.2 52.0 / 70.7 66.6 / 85.1 60.8 / 80.5 57.5 / 77.7 56.9 / 76.0 52.1 / 72.1 56.8 / 75.8
BLEURTbase 55.7 / 73.4 56.3 / 75.7 68.0 / 86.8 64.7 / 83.3 60.1 / 80.1 62.4 / 81.7 59.5 / 80.5 61.0 / 80.2
BLEURT -pre 56.0 / 74.7 57.1 / 75.7 67.2 / 86.1 62.3 / 81.7 58.4 / 78.3 61.6 / 81.4 55.9 / 76.5 59.8 / 79.2
BLEURT 59.3 / 77.3 59.9 / 79.2 69.5 / 87.8 64.4 / 83.5 61.3 / 81.1 62.9 / 82.4 60.2 / 81.4 62.5 / 81.8

Table 2: Agreement with human ratings on the WMT17 Metrics Shared Task. The metrics are Kendall Tau (τ ) and
the Pearson correlation (r, the official metric of the shared task), divided by 100.

model cs-en de-en et-en fi-en ru-en tr-en zh-en avg


τ / DA τ / DA τ / DA τ / DA τ / DA τ / DA τ / DA τ / DA
sentBLEU 20.0 / 22.5 31.6 / 41.5 26.0 / 28.2 17.1 / 15.6 20.5 / 22.4 22.9 / 13.6 21.6 / 17.6 22.8 / 23.2
BERTscore w/ BERT 29.5 / 40.0 39.9 / 53.8 34.7 / 39.0 26.0 / 29.7 27.8 / 34.7 31.7 / 27.5 27.5 / 25.2 31.0 / 35.7
BERTscore w/ roBERTa 31.2 / 41.1 42.2 / 55.5 37.0 / 40.3 27.8 / 30.8 30.2 / 35.4 32.8 / 30.2 29.2 / 26.3 32.9 / 37.1
Meteor++ 22.4 / 26.8 34.7 / 45.7 29.7 / 32.9 21.6 / 20.6 22.8 / 25.3 27.3 / 20.4 23.6 / 17.5* 26.0 / 27.0
RUSE 27.0 / 34.5 36.1 / 49.8 32.9 / 36.8 25.5 / 27.5 25.0 / 31.1 29.1 / 25.9 24.6 / 21.5* 28.6 / 32.4
YiSi1 23.5 / 31.7 35.5 / 48.8 30.2 / 35.1 21.5 / 23.1 23.3 / 30.0 26.8 / 23.4 23.1 / 20.9 26.3 / 30.4
YiSi1 SRL 18 23.3 / 31.5 34.3 / 48.3 29.8 / 34.5 21.2 / 23.7 22.6 / 30.6 26.1 / 23.3 22.9 / 20.7 25.7 / 30.4
BLEURTbase -pre 33.0 / 39.0 41.5 / 54.6 38.2 / 39.6 30.7 / 31.1 30.7 / 34.9 32.9 / 29.8 28.3 / 25.6 33.6 / 36.4
BLEURTbase 34.5 / 42.9 43.5 / 55.6 39.2 / 40.5 31.5 / 30.9 31.0 / 35.7 35.0 / 29.4 29.6 / 26.9 34.9 / 37.4
BLEURT -pre 34.5 / 42.1 42.7 / 55.4 39.2 / 40.6 31.4 / 31.6 31.4 / 34.2 33.4 / 29.3 28.9 / 25.6 34.5 / 37.0
BLEURT 35.6 / 42.3 44.2 / 56.7 40.0 / 41.4 32.1 / 32.5 31.9 / 36.0 35.5 / 31.5 29.7 / 26.0 35.6 / 38.1

Table 3: Agreement with human ratings on the WMT18 Metrics Shared Task. The metrics are Kendall Tau (τ ) and
WMT’s Direct Assessment metrics divided by 100. The star * indicates results that are more than 0.2 percentage
points away from the official WMT results (up to 0.4 percentage points away).
.

5.1 WMT Metrics Shared Task Models: We experiment with four versions of
B LEURT: BLEURT, BLEURTbase, BLEURT
Datasets and Metrics: We use years 2017 to -pre and BLEURTbase -pre. The first two
2019 of the WMT Metrics Shared Task, to-English models are based on BERT-large and BERT-base.
language pairs. For each year, we used the of- In the latter two versions, we skip the pre-training
ficial WMT test set, which include several thou- phase and fine-tune directly on the WMT ratings.
sand pairs of sentences with human ratings from For each year of the WMT shared task, we use the
the news domain. The training sets contain 5,360, test set from the previous years for training and
9,492, and 147,691 records for each year. The test validation. We describe our setup in further detail
sets for years 2018 and 2019 are noisier, as re- in the Appendix. We compare B LEURT to partici-
ported by the organizers and shown by the overall pant data from the shared task and automatic met-
lower correlations. rics that we ran ourselves. In the former case, we
We evaluate the agreement between the auto- use the the best-performing contestants for each
matic metrics and the human ratings. For each year, that is, chrF++, BEER, Meteor++, RUSE,
year, we report two metrics: Kendall’s Tau τ (for Yisi1, ESIM and Yisi1-SRL (Mathur et al.,
consistency across experiments), and the official 2019). All the contestants use the same WMT
WMT metric for that year (for completeness). The training data, in addition to existing sentence or to-
official WMT metric is either Pearson’s correla- ken embeddings. In the latter case, we use Moses
tion or a robust variant of Kendall’s Tau called sentenceBLEU, BERTscore (Zhang et al.,
DARR, described in the Appendix. All the num- 2020), and MoverScore (Zhao et al., 2019).
bers come from our own implementation of the For BERTscore, we use BERT-large uncased
benchmark.4 Our results are globally consistent for fairness, and roBERTa (the recommended ver-
with the official results but we report small differ- sion) for completeness (Liu et al., 2019). We run
ences in 2018 and 2019, marked in the tables. MoverScore on WMT 2017 using the scripts
published by the authors.
4
The official scripts are public but they suffer from docu-
mentation and dependency issues, as shown by a README file Results: Tables 2, 3, 4 show the results. For
in the 2019 edition which explicitly discourages using them. years 2017 and 2018, a B LEURT-based metric
model de-en fi-en gu-en kk-en lt-en ru-en zh-en avg
τ / DA τ / DA τ / DA τ / DA τ / DA τ / DA τ / DA τ / DA
sentBLEU 19.4 / 5.4 20.6 / 23.3 17.3 / 18.9 30.0 / 37.6 23.8 / 26.2 19.4 / 12.4 28.7 / 32.2 22.7 / 22.3
BERTscore w/ BERT 26.2 / 17.3 27.6 / 34.7 25.8 / 29.3 36.9 / 44.0 30.8 / 37.4 25.2 / 20.6 37.5 / 41.4 30.0 / 32.1
BERTscore w/ roBERTa 29.1 / 19.3 29.7 / 35.3 27.7 / 32.4 37.1 / 43.1 32.6 / 38.2 26.3 / 22.7 41.4 / 43.8 32.0 / 33.6
ESIM 28.4 / 16.6 28.9 / 33.7 27.1 / 30.4 38.4 / 43.3 33.2 / 35.9 26.6 / 19.9 38.7 / 39.6 31.6 / 31.3
YiSi1 SRL 19 26.3 / 19.8 27.8 / 34.6 26.6 / 30.6 36.9 / 44.1 30.9 / 38.0 25.3 / 22.0 38.9 / 43.1 30.4 / 33.2
BLEURTbase -pre 30.1 / 15.8 30.4 / 35.4 26.8 / 29.7 37.8 / 41.8 34.2 / 39.0 27.0 / 20.7 40.1 / 39.8 32.3 / 31.7
BLEURTbase 31.0 / 16.6 31.3 / 36.2 27.9 / 30.6 39.5 / 44.6 35.2 / 39.4 28.5 / 21.5 41.7 / 41.6 33.6 / 32.9
BLEURT -pre 31.1 / 16.9 31.3 / 36.5 27.6 / 31.3 38.4 / 42.8 35.0 / 40.0 27.5 / 21.4 41.6 / 41.4 33.2 / 32.9
BLEURT 31.2 / 16.9 31.7 / 36.3 28.3 / 31.9 39.5 / 44.6 35.2 / 40.6 28.3 / 22.3 42.7 / 42.4 33.8 / 33.6

Table 4: Agreement with human ratings on the WMT19 Metrics Shared Task. The metrics are Kendall Tau (τ ) and
WMT’s Direct Assessment metrics divided by 100. All the values reported for Yisi1 SRL and ESIM fall within
0.2 percentage of the official WMT results.

1.00 Dataset BLEURT No Pretrain. BLEURT w. Pretrain


Test

Kendall Tau w. Human Ratings


● ●
0.6 ● ● ●



● ●

Train/Validation ● ● ● ● ●
Density (rescaled)

● ●
0.75 ● ● ● ● ●

● ● ●


● ●
Skew factor 0.4 ●
● ●
0.50 ● ●
● ●
0 ● ●

0.5 ●

0.25 0.2 ●
1.0
● ●
1.5

0.00 3.0 0.0

−2 −1 0 1
Ratings 0 1 2 3 0 1 2 3
Test Set skew

Figure 1: Distribution of the human ratings in the BERTscore ● train sk. 0 ● train sk. 1.0 ● train sk. 3.0

train/validation and test datasets for different skew fac- BLEU ● train sk. 0.5 ● train sk. 1.5

tors.
Figure 2: Agreement between BLEURT and human
ratings for different skew factors in train and test.
dominates the benchmark for each language pair
(Tables 2 and 3). BLEURT and BLEURTbase are
also competitive for year 2019: they yield the best a series of tasks for which it is increasingly pres-
results for all language pairs on Kendall’s Tau, and sured to extrapolate. All the experiments that fol-
they come first for 3 out of 7 pairs on DARR. As low are based on the WMT Metrics Shared Task
expected, BLEURT dominates BLEURTbase in 2017, because the ratings for this edition are par-
the majority of cases. Pre-training consistently im- ticularly reliable.5
proves the results of BLEURT and BLEURTbase.
Methodology: We create increasingly challeng-
We observe the largest effect on year 2017,
ing datasets by sub-sampling the records from
where it adds up to 7.4 Kendall Tau points for
the WMT Metrics shared task, keeping low-rated
BLEURTbase (zh-en). The effect is milder on
translations for training and high-rated translations
years 2018 and 2019, up to 2.1 points (tr-en,
for test. The key parameter is the skew factor α,
2018). We explain the difference by the fact
that measures how much the training data is left-
that the training data used for 2017 is smaller
skewed and the test data is right-skewed. Figure 1
than the datasets used for the following years, so
demonstrates the ratings distribution that we used
pre-training is likelier to help. In general pre-
in our experiments. The training data shrinks as
training yields higher returns for BERT-base than
α increases: in the most extreme case (α = 3.0),
for BERT-large—in fact, BLEURTbase with pre-
we use only 11.9% of the original 5,344 training
training is often better than BLEURT without.
records. We give the full detail of our sampling
Takeaways: Pre-training delivers consis- methodology in the Appendix.
tent improvements, especially for BLEURT-base. We use BLEURT with and without pre-training
B LEURT yields state-of-the art performance for all and we compare to Moses sentBLEU and
years of the WMT Metrics Shared task. BERTscore. We use BERT-large uncased for
both BLEURT and BERTscore.
5.2 Robustness to Quality Drift
5
The organizers managed to collect 15 adequacy scores
We assess our claim that pre-training makes for each translation, and thus the ratings are almost perfectly
B LEURT robust to quality drifts, by constructing repeatable (Bojar et al., 2017)
Split by System Split by Input

0.4

fluency
0.2
Kentall Tau w. Human Ratings

Metric
0.0
0.5 BLEU
0.4 TER

grammar
0.3 Meteor
0.2 BERTscore
0.1 BLEURT −pre −wmt
0.0 BLEURT −wmt
0.6 BLEURT

semantics
0.4

0.2

0.0
0/9 systems 2/9 systems 3/9 systems 5/9 systems 0/224 inputs 38/224 inputs 66/224 inputs 122/224 inputs
0 records 1,174 records 1,317 records 2,424 records 0 records 836 records 1,445 records 2,689 records
Num. Systems/Inputs Used for Training and Validation

Figure 3: Absolute Kendall Tau of BLEU, Meteor, and BLEURT with human judgements on the WebNLG dataset,
varying the size of the data used for training and validation.

Results: Figure 2 presents B LEURT’s perfor- pairs in total (we removed null values). Each in-
mance as we vary the train and test skew inde- put comes with 1 to 3 reference descriptions. The
pendently. Our first observation is that the agree- submissions are evaluated on 3 aspects: semantics,
ments fall for all metrics as we increase the test grammar, and fluency. We treat each type of rat-
skew. This effect was already described is the ing as a separate modeling task. The data has no
2019 WMT Metrics report (Ma et al., 2019). A natural split between train and test, therefore we
common explanation is that the task gets more dif- experiment with several schemes. We allocate 0%
ficult as the ratings get closer—it is easier to dis- to about 50% of the data to training, and we split
criminate between “good” and “bad” systems than on both the evaluated systems or the RDF inputs
to rank “good” systems. in order to test different generalization regimes.
Training skew has a disastrous effect on
Systems and Baselines: BLEURT -pre
B LEURT without pre-training: it is below
-wmt, is a public BERT-large uncased checkpoint
BERTscore for α = 1.0, and it falls under
directly trained on the WebNLG ratings. BLEURT
sentBLEU for α ≥ 1.5. Pre-trained B LEURT is
-wmtwas first pre-trained on synthetic data,
much more robust: the only case in which it falls
then fine-tuned on WebNLG data. BLEURT
under the baselines is α = 3.0, the most extreme
was trained in three steps: first on synthetic
drift, for which incorrect translations are used for
data, then on WMT data (16-18), and finally on
train while excellent ones for test.
WebNLG data. When a record comes with several
Takeaways: Pre-training makes BLEURT sig- references, we run BLEURT on each reference
nificantly more robust to quality drifts. and report the highest value (Zhang et al., 2020).
We report four baselines: BLEU, TER,
5.3 WebNLG Experiments Meteor, and BERTscore. The first three were
In this section, we evaluate B LEURT’s perfor- computed by the WebNLG competition organiz-
mance on three tasks from a data-to-text dataset, ers. We ran the latter one ourselves, using BERT-
the WebNLG Challenge 2017 (Shimorina et al., large uncased for a fair comparison.
2019). The aim is to assess B LEURT’s capacity
Results: Figure 3 presents the correlation of the
to adapt to new tasks with limited training data.
metrics with human assessments as we vary the
Dataset and Evaluation Tasks: The WebNLG share of data allocated to training. The more pre-
challenge benchmarks systems that produce natu- trained B LEURT is, the quicker it adapts. The
ral language description of entities (e.g., buildings, vanilla BERT approach BLEURT -pre -wmt
cities, artists) from sets of 1 to 5 RDF triples. The requires about a third of the WebNLG data to dom-
organizers released the human assessments for 9 inate the baselines on the majority of tasks, and it
systems over 223 inputs, that is, 4,677 sentence still lags behind on semantics (split by system). In
1 task N−1 tasks the creation of many learned metrics, some of
Relative Improv./Degradation (%)
0%: no pre−training 0%: all pre−training tasks
5 which use regression or deep learning (Stanoje-
vic and Simaan, 2014; Ma et al., 2017; Shimanaka
0
et al., 2018; Chen et al., 2017; Mathur et al., 2019).
−5
Other metrics have been introduced, such as the
−10 recent MoverScore (Zhao et al., 2019) which com-
−15
bines contextual embeddings and Earth Mover’s
Distance. We provide a head-to-head compari-
re il s g U
co enta ktran d_fla BLE OUG
E re il s g
co nta ran fla LE
U GE son with the best performing of those in our ex-
RT
s c o R Ts −e ackt hod_ −B ROU
BE ba eth B ER − b et − periments. Other approaches do not attempt to
m − −m
Pretraining Task estimate quality directly, but use information ex-
BLEURT BLEURTbase
traction or question answering as a proxy (Wise-
man et al., 2017; Goodrich et al., 2019; Eyal et al.,
2019). Those are complementary to our work.
Figure 4: Improvement in Kendall Tau on WMT 17
varying the pre-training tasks. There has been recent work that uses BERT for
evaluation. BERTScore (Zhang et al., 2020) pro-
poses replacing the hard n-gram overlap of BLEU
contrast, BLEURT -wmt is competitive with as with a soft-overlap using BERT embeddings. We
little as 836 records, and B LEURT is comparable use it in all our experiments. Bertr (Mathur et al.,
with BERTscore with zero fine-tuning. 2019) and YiSi (Mathur et al., 2019) also make use
Takeaways: Thanks to pre-training, B LEURT of BERT embeddings to capture similarity. Sum-
can quickly adapt to the new tasks. B LEURT fine- QE (Xenouleas et al., 2019) fine-tunes BERT for
tuned twice (first on synthetic data, then on WMT quality estimation as we describe in Section 3.
data) provides acceptable results on all tasks with- Our focus is different—we train metrics that are
out training data. not only state-of-the-art in conventional IID ex-
perimental setups, but also robust in the presence
5.4 Ablation Experiments of scarce and out-of-distribution training data. To
Figure 4 presents our ablation experiments on our knowledge no existing work has explored pre-
WMT 2017, which highlight the relative impor- training and extrapolation in the context of NLG.
tance of each pre-training task. On the left side, Previous studies have used noising for refer-
we compare B LEURT pre-trained on a single task enceless evaluation (Dušek et al., 2019). Noisy
to B LEURT without pre-training. On the right pre-training has also been proposed before for
side, we compare full B LEURT to B LEURT pre- other tasks such as paraphrasing (Wieting et al.,
trained on all tasks except one. Pre-training on 2016; Tomar et al., 2017) but generally not with
BERTscore, entailment, and the backtranslation synthetic data. Generating synthetic data via para-
scores yield improvements (symmetrically, ablat- phrases and perturbations has been commonly
ing them degrades B LEURT). Oppositely, BLEU used for generating adversarial examples (Jia and
and ROUGE have a negative impact. We con- Liang, 2017; Iyyer et al., 2018; Belinkov and Bisk,
clude that pre-training on high quality signals 2018; Ribeiro et al., 2018), an orthogonal line of
helps BLEURT, but that metrics that correlate less research.
well with human judgment may in fact harm the
model.6 7 Conclusion

6 Related Work We presented B LEURT, a reference-based text


generation metric for English. Because the metric
The WMT shared metrics competition (Bojar is trained end-to-end, B LEURT can model human
et al., 2016; Ma et al., 2018, 2019) has inspired assessment with superior accuracy. Furthermore,
6
Do those results imply that BLEU and ROUGE should pre-training makes the metrics robust particularly
be removed from future versions of B LEURT? Doing so may robust to both domain and quality drifts. Future re-
indeed yield slight improvements on the WMT Metrics 2017 search directions include multilingual NLG evalu-
shared task. On the other hand the removal may hurt future
tasks in which BLEU or ROUGE actually correlate with hu- ation, and hybrid methods involving both humans
man assessments. We therefore leave the question open. and classifiers.
Acknowledgments Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K
Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xi-
Thanks to Eunsol Choi, Nicholas FitzGerald, Ja- aodong He, Margaret Mitchell, John C Platt, et al.
cob Devlin, and to the members of the Google AI 2015. From captions to visual concepts and back.
Language team for the proof-reading, feedback, In Proceedings of CVPR.
and suggestions. We also thank Madhavan Ki- Juri Ganitkevitch, Benjamin Van Durme, and Chris
dambi and Ming-Wei Chang, who implemented Callison-Burch. 2013. Ppdb: The paraphrase
blank-filling with BERT. database. In Proceedings NAACL HLT.

Claire Gardent, Anastasia Shimorina, Shashi Narayan,


and Laura Perez-Beltrachini. 2017. The webnlg
References challenge: Generating text from rdf data. In Pro-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- ceedings of INLG.
gio. 2015. Neural machine translation by jointly
learning to align and translate. In Proceedings of Ben Goodrich, Mohammad Ahmad Saleh, Peter Liu,
ICLR. and Vinay Rao. 2019. Assessing the factual ac-
curacy of text generation. In Proceedings of ACM
Colin Bannard and Chris Callison-Burch. 2005. Para- SIGKDD.
phrasing with bilingual parallel corpora. In Pro-
ceedings of ACL. Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke
Zettlemoyer. 2018. Adversarial example generation
Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic with syntactically controlled paraphrase networks.
and natural noise both break neural machine transla- Proceedings of NAACL HLT.
tion. In Proceedings of ICLR.
Robin Jia and Percy Liang. 2017. Adversarial exam-
Ondřej Bojar, Yvette Graham, and Amir Kamran. ples for evaluating reading comprehension systems.
2017. Results of the wmt17 metrics shared task. In Proceedings of EMNLP.
Proceedings of WMT.
Philipp Koehn. 2009. Statistical machine translation.
Ondřej Bojar, Yvette Graham, Amir Kamran, and Cambridge University Press.
Miloš Stanojević. 2016. Results of the wmt16 met-
rics shared task. In Proceedings of WMT. Karen Kukich. 1983. Design of a knowledge-based re-
port generator. In Proceedings of ACL.
Samuel R Bowman, Gabor Angeli, Christopher Potts,
and Christopher D Manning. 2015. A large anno- Chin-Yew Lin. 2004. Rouge: A package for automatic
tated corpus for learning natural language inference. evaluation of summaries. In Workshop on Text Sum-
Proceedings of EMNLP. marization Branches Out.
Arun Tejasvi Chaganty, Stephen Mussman, and Percy Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael
Liang. 2018. The price of debiasing automatic met- Noseworthy, Laurent Charlin, and Joelle Pineau.
rics in natural language evaluation. Proceedings of 2016. How not to evaluate your dialogue system:
ACL. An empirical study of unsupervised evaluation met-
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui rics for dialogue response generation. Proceedings
Jiang, and Diana Inkpen. 2017. Enhanced lstm for of EMNLP.
natural language inference. Proceedings of ACL. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Sumit Chopra, Michael Auli, and Alexander M Rush. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
2016. Abstractive sentence summarization with at- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
tentive recurrent neural networks. In Proceedings of Roberta: A robustly optimized bert pretraining ap-
NAACL HLT. proach. arXiv:1907.11692.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Qingsong Ma, Ondřej Bojar, and Yvette Graham. 2018.
Kristina Toutanova. 2019. Bert: Pre-training of deep Results of the wmt18 metrics shared task: Both char-
bidirectional transformers for language understand- acters and embeddings achieve good performance.
ing. In Proceedings of NAACL HLT. In Proceedings of WMT.

Ondřej Dušek, Karin Sevegnani, Ioannis Konstas, and Qingsong Ma, Yvette Graham, Shugen Wang, and Qun
Verena Rieser. 2019. Automatic quality estimation Liu. 2017. Blend: a novel combined mt metric
for natural language generation: Ranting (jointly rat- based on direct assessment–casict-dcu submission to
ing and ranking). Proceedings of INLG. wmt17 metrics task. In Proceedings of WMT.

Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. Qingsong Ma, Johnny Wei, Ondřej Bojar, and Yvette
Question answering as an automatic evaluation met- Graham. 2019. Results of the wmt19 metrics shared
ric for news article summarization. In Proceedings task: Segment-level and strong mt systems pose big
of NAACL HLT. challenges. In Proceedings of WMT.
Inderjeet Mani. 1999. Advances in automatic text sum- Oriol Vinyals and Quoc Le. 2015. A neural conversa-
marization. MIT press. tional model. Proceedings of ICML.

Nitika Mathur, Timothy Baldwin, and Trevor Cohn. Alex Wang, Amanpreet Singh, Julian Michael, Felix
2019. Putting evaluation in context: Contextual em- Hill, Omer Levy, and Samuel R Bowman. 2019.
beddings improve machine translation evaluation. Glue: A multi-task benchmark and analysis platform
In Proceedings of ACL. for natural language understanding. Proceedings of
ICLR.
Kathleen McKeown. 1992. Text generation. Cam-
bridge University Press. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen
Livescu. 2016. Towards universal paraphrastic sen-
Jekaterina Novikova, Ondřej Dušek, Amanda Cercas tence embeddings. Proceedings of ICLR.
Curry, and Verena Rieser. 2017. Why we need new
evaluation metrics for nlg. Proceedings of EMNLP. Adina Williams, Nikita Nangia, and Samuel R Bow-
man. 2018. A broad-coverage challenge corpus for
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- sentence understanding through inference. Proceed-
Jing Zhu. 2002. Bleu: a method for automatic eval- ings of NAACL HLT.
uation of machine translation. In Proceedings of
ACL. Sam Wiseman, Stuart M Shieber, and Alexander M
Rush. 2017. Challenges in data-to-document gen-
Marco Tulio Ribeiro, Sameer Singh, and Carlos eration. Proceedings of EMNLP.
Guestrin. 2018. Semantically equivalent adversar-
ial rules for debugging nlp models. In Proceedings Stratos Xenouleas, Prodromos Malakasiotis, Marianna
of ACL. Apidianaki, and Ion Androutsopoulos. 2019. Sum-
qe: a bert-based summary quality estimation model
Rico Sennrich, Barry Haddow, and Alexandra Birch. supplementary material. In Proceedings of EMNLP.
2016. Improving neural machine translation models
with monolingual data. Proceedings of ACL. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru uating text generation with bert. Proceedings of
Komachi. 2018. Ruse: Regressor using sentence ICLR.
embeddings for automatic machine translation eval-
uation. In Proceedings of WMT. Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris-
tian M Meyer, and Steffen Eger. 2019. Moverscore:
Anastasia Shimorina, Claire Gardent, Shashi Narayan, Text generation evaluating with contextualized em-
and Laura Perez-Beltrachini. 2019. Webnlg chal- beddings and earth mover distance. Proceedings of
lenge: Human evaluation results. Technical report. EMNLP.

Ronnie W Smith and D Richard Hipp. 1994. Spoken A Implementation Details of the
natural language dialog systems: A practical ap- Pre-Training Phase
proach. Oxford University Press.
This section provides implementation details for
Milos Stanojevic and Khalil Simaan. 2014. Beer: Bet-
ter evaluation as ranking. In Proceedings of WMT. some of the pre-training techniques described in
the main paper.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural net- A.1 Data Generation
works. In Proceedings of NIPS.
Random Masking: We use two masking strate-
Ran Tian, Shashi Narayan, Thibault Sellam, and gies. The first strategy samples random words
Ankur P Parikh. 2019. Sticking to the facts: Con- in the sentence and it replaces them with masks
fident decoding for faithful data-to-text generation.
(one for each token). Thus, the masks are scat-
arXiv:1910.08684.
tered across the sentence. The second strategy cre-
Gaurav Singh Tomar, Thyago Duque, Oscar ates contiguous sequences: it samples a start po-
Täckström, Jakob Uszkoreit, and Dipanjan Das. sition s, a length l (uniformly distributed), and it
2017. Neural paraphrase identification of questions
with noisy pretraining. Proceedings of the First
masks all the tokens spanned by words between
Workshop on Subword and Character Level Models positions s and s + l. In both cases, we use up
in NLP. to 15 masks per sentence. Instead of running the
language model once and picking the most likely
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
token at each position, we use beam search (the
Kaiser, and Illia Polosukhin. 2017. Attention is all beam size 8 by default). This enforces consistency
you need. In Proceedings of NIPS. and avoids repeated sequences, e.g., “,,,”.
Backtranslation: Consider English and validation set. To reduce the size of the grid,
French. Given a forward translation model we make groups of pre-training tasks that share
Pen→fr (zfr |zen ) and backward translation model the same weights: (τBLEU , τROUGE , τBERTscore ),
Pfr→en (zen |zfr ), we generate z̃ as follows: (τen-fr,z|z̃ , τen-fr,z̃|z , τen-de,z|z̃ , τen-de,z̃|z ), and
∗ (τentail , τbacktran flag ).
z̃ = arg max (Pfr→en (zen |zfr ))
zen

B Experiments–Supplementary Material
where zfr = arg maxzfr (Pfr→en (zfr |z)).
For the translations, we use a Transformer B.1 Training Setup for All Experiments
model (Vaswani et al., 2017), trained on English- We user BERT’s public checkpoints10 with Adam
German with the tensor2tensor framework.7 (the default optimizer), learning rate 1e-5, and
Word dropping: Given a synthetic example batch size 32. Unless specified otherwise, we use
(z, z̃) we generate a pair (z, z̃ 0 ), by randomly 800,00 training steps for pre-training and 40,000
dropping words from z̃. We draw the number steps for fine-tuning. We run training and evalua-
of words to drop uniformly, up to the length of tion in parallel: we run the evaluation every 1,500
the sentence. We apply this transformation on steps and store the checkpoint that performs best
about 30% of the data generated with the previous on a held-out validation set (more details on the
method. data splits and our choice of metrics in the follow-
ing sections). We use Google Cloud TPUs v2 for
A.2 Pre-Training Tasks learning, and Nvidia Tesla V100 accelerators for
We now provide additional details on the signals evaluation and test. Our code uses Tensorflow 1.15
we used for pre-training. and Python 2.7.
Automatic Metrics: As shown in the table, we B.2 WMT Metric Shared Task
use three types of signals: BLEU, ROUGE, and Metrics. The metrics used to compare the eval-
BERTscore. For BLEU, we used the original uation systems vary across the years. The organiz-
Moses SENTENCE BLEU8 implementation, using ers use Pearson’s correlation on standardized hu-
the Moses tokenizer and the default parameters. man judgments across all segments in 2017, and a
For ROUGE, we used the seq2seq implemen- custom variant of Kendall’s Tau named “DARR”
tation of ROUGE-N.9 We used a custom imple- on raw human judgments in 2018 and 2019. The
mentation of BERT SCORE, based on BERT-large latter metrics operates as follows. The organiz-
uncased. ROUGE and BERTscore return three ers gather all the translations for the same ref-
scores: precision, recall, and F-score. We use all erence segment, they enumerate all the possible
three quantities. pairs (translation1 , translation2 ), and they discard
Backtranslation Likelihood: We compute all the pairs which have a “similar” score (less than
all the losses using custom Transformer 25 points away on a 100 points scale). For each
model (Vaswani et al., 2017), trained on two remaining pair, they then determine which trans-
language pairs (English-French and English- lation is the best according both human judgment
German) with the tensor2tensor framework. and the candidate metric. Let |Concordant| be the
number of pairs on which the NLG metrics agree
Normalization: All the regression labels are and |Discordant| be those on which they disagree,
normalized before training. then the score is computed as follows:
A.3 Modeling
|Concordant| − |Discordant|
Setting the weights of the pre-training tasks: |Concordant| + |Discordant|
We set the weights γk with grid search, opti-
mizing B LEURT’s performance on WMT 17’s The idea behind the 25 points filter is to make
7 the evaluation more robust, since the judgments
https://github.com/tensorflow/
tensor2tensor collected for WMT 2018 and 2019 are noisy.
8
https://github.com/moses-smt/ Kendall’s Tau is identical, but it does not use the
mosesdecoder/blob/master/mert/ filter.
sentence-bleu.cpp
9 10
https://github.com/google/seq2seq/ https://github.com/google-research/
blob/master/seq2seq/metrics/rouge.py bert
Rel. Kendall Tau Improvement (%)
● ●

● ● the sizes of the datasets decrease as α increases:
6 ●

we use 50.7%, 30.3%, 20.4%, and 11.9% of the

● ●
● ●
● original 5,344 training records for α = 0.5, 1.0,
● ●
4
● ● BLEURT 1.5, and 3.0 respectively.
● BLEURTbase

2
B.4 Ablation Experiment–How Much
Pre-Training Time is Necessary?
0 ● To understand the relationship between pre-
0 200 400 600 800 training time and downstream accuracy, we pre-
Number of Pretraining Steps (*1,000)
train several versions of BLEURT and we fine-tune
Figure 5: Improvement in Kendall Tau accuracy on all them on WMT17 data, varying the number of pre-
language pairs of the WMT Metrics Shared Task 2017, training steps. Figure 5 presents the results. Most
varying the number of pre-training steps. 0 steps cor- gains are obtained during the first 400,000 steps,
responds to 0.555 Kendall Tau for BLEURTbase and that is, after about 2 epochs over our synthetic
0.580 for BLEURT. dataset.

Training setup. To separate training and vali-


dation data, we set aside a fixed ratio of records
in such a way that there is no “leak” between
the datasets (i.e., train and validation records that
share the same source). We use 10% of the data
for validation for years 2017 and 2018, and 5% for
year 2019. We report results for the models that
yield the highest Kendall Tau across all records on
validation data. The weights associated to each
pretraining task (see our Modeling section) are set
with grid search, using the train/validation setup
of WMT 2017.
Baselines. we use three metrics: the Moses
implementation of sentenceBLEU,11
BERTscore, 12 and MoverScore,13 which
are all available online. We run the Moses
tokenizer on the reference and candidate segments
before computing sentenceBLEU.

B.3 Robustness to Quality Drift


Data Re-sampling Methodology: We sample
the training and test separately, as follows. We
split the data in 10 bins of equal size. We then
sample each record in the dataset with probabili-
ties B1α and (11−B)
1
α for train and test respectively,
where B is the bin index of the record between 1
and 10, and α is a predefined skew factor. The
skew factor α controls the drift: a value of 0 has
no effect (the ratings are centered around 0), and
value of 3.0 yields extreme differences. Note that
11
https://github.com/moses-smt/
mosesdecoder/blob/master/mert/
sentence-bleu.cpp
12
https://github.com/Tiiiger/bert_score
13
https://github.com/AIPHES/
emnlp19-moverscore

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy