News Summarization and Evaluation in The Era of GPT-3

News Summarization and Evaluation in the Era of GPT-3
Tanya Goyal1 Junyi Jessy Li2 Greg Durrett1

1 2
Department of Computer Science Department of Linguistics
The University of Texas at Austin
tanyagoyal@utexas.edu
Abstract CNN article: https://www.cnn.com/2022/09/09/politics/judge-throws-

out-trumps-rico-lawsuit-against-hillary-clinton-and-democrats/index.html
The recent success of prompting large language Length-constrained prompt

Summarize the above article in 2 sentences.
models like GPT-3 has led to a paradigm shift
arXiv:2209.12356v2 [cs.CL] 23 May 2023
GPT-3: Donald Trump's lawsuit against Hillary Clinton, the Democratic

in NLP research. In this paper, we study its National Committee, and several other individuals and entities has been
impact on text summarization, focusing on the dismissed by a federal judge. The judge ruled that Trump's claims are
classic benchmark domain of news summariza- unsupported by any legal authority and that many of his specific factual
assertions are implausible or unsupported.
tion. First, we investigate how GPT-3 compares
against fine-tuned models trained on large sum- Keyword-constrained prompt
marization datasets. We show that not only Summarize the above article briefly focusing on Alina Habba.
do humans overwhelmingly prefer GPT-3 sum- GPT-3: Alina Habba is an attorney for Donald Trump. She stated that
Trump will "immediately move to appeal this decision." Habba also said
maries, prompted using only a task descrip- that the judge's opinion is "rife with erroneous applications of the law" and
tion, but these also do not suffer from common that it "disregards the numerous independent governmental investigations
which substantiate our claim that the defendants conspired to falsely
dataset-specific issues such as poor factuality. implicate our client and undermine the 2016 Presidential election."
Next, we study what this means for evalua-
tion, particularly the role of gold standard test Figure 1: Examples of GPT-3 summaries. We can gen-
sets. Our experiments show that both reference- erate summaries following style constraints or queries
based and reference-free automatic metrics can- included in the prompts, allowing us to emulate a range
not reliably evaluate GPT-3 summaries. Fi- of existing fine-tuned systems.
nally, we evaluate models on a setting beyond
generic summarization, specifically keyword-
based summarization, and show how dominant 2020), T0 (Sanh et al., 2022), PaLM (Chowdhery
fine-tuning approaches compare to prompting. et al., 2022), etc.) provides an alternative approach,
To support further research, we release: (a) namely learning from natural language task instruc-
a corpus of 10K generated summaries from tions and/or a few demonstrative examples in the
fine-tuned and prompt-based models across 4 context without updating model parameters. While
standard summarization benchmarks, (b) 1K recent work (Zhao et al., 2021; Min et al., 2022;
human preference judgments comparing differ- Ye and Durrett, 2022) has evaluated this paradigm
ent systems for generic- and keyword-based
across a number of tasks, it has only been studied
summarization.1
for text summarization with unreliable automatic
1 Introduction metrics (He et al., 2022b; Chowdhery et al., 2022;
Ouyang et al., 2022) or in non-standard settings
Fine-tuning pre-trained models on domain-specific (Saunders et al., 2022).
datasets has been the leading paradigm in text sum- In this paper, we conduct the first systematic
marization research in recent years (Lewis et al., study of the impact of prompt-based models on
2020; Zhang et al., 2020; Raffel et al., 2020). These the text summarization research space, using an
models generate high-quality summaries on stan- Instruct-tuned 175B GPT-3 model (text-davinci-
dard benchmarks, but still require sizeable training 002) (Brown et al., 2020; Ouyang et al., 2022) as a
datasets to adapt to new settings, e.g., summarizing case study. Figure 1 shows that GPT-3 summaries
data from a new source domain or producing a sum- are extremely high-quality and adaptable to differ-
mary in a different style. The success of prompt- ent summarization settings. Starting from these
ing large language models (GPT-3 (Brown et al., observations, we aim to answer three main ques-
1
All data available at: https://tagoyal.github.io/ tions. First, how do prompt-based GPT-3 sum-
zeroshot-news-annotations.html. maries compare to those obtained from state-of-
the-art fine-tuned summarization models (Zhang Dataset
Avg. Words % novel n-grams
et al., 2020; Liu et al., 2022)? We compare these Article Summ n=1 n=2
approaches using A/B testing on a new corpus CNN 760.5 45.7 16.7 54.3
DailyMail 653.3 54.6 17.0 53.8
of recent news articles, and find that our study XSum (BBC) 431.1 23.2 35.7 82.4
participants overwhelmingly prefer GPT-3 sum- Newsroom 658.6 26.7 18.9 47.5
maries across two different “styles” with differ-
ent prompts (three-sentence and single-sentence). Table 1: Basic statistics of standard summarization
Moreover, these summaries do not suffer from lim- datasets: CNN/DM (Hermann et al., 2015; Nallapati et al.,
2016), XSum (Narayan et al., 2018), Newsroom (Grusky
itations due to low-quality training data that plague
et al., 2018). These show large variance in their sum-
fine-tuned generic summarization models (Maynez mary properties and fundamentally differ in their defini-
et al., 2020; Goyal et al., 2022). tion of the “gold” standard.
Second, are existing automatic metrics well-
suited to evaluating prompt-based summaries? Re-
2 Models and Setup
cent work has shown that classic reference-based
such as ROUGE (Lin, 2004) and BERTScore (Zhang* 2.1 Current Paradigms for Summarization
et al., 2020) are unreliable when small improve- Recent zero- and few-shot prompting based mod-
ments are reported (Peyrard, 2019; Fabbri et al., els (Brown et al., 2020; Sanh et al., 2022), have
2021); however large differences, on the order of shown impressive generalization capabilities on
say 5 ROUGE points or greater, are considered to be unseen tasks specified using prompts alone and
correlated with human preferences (Bhandari et al., without performing any gradient updates (Mishra
2020; Deutsch et al., 2022). However, we find that et al., 2022). In this work, we want to compare
the same is no longer true when evaluating GPT-3 their text summarization performance against the
summaries. These summaries score much lower on current state-of-the-art models.
automatic metrics (7 ROUGE-L points on average)
than all prior state-of-the-art models while com- Pre-trained LM
fortably outperforming them on human evaluation. Fine-tuned on Instruction-tuned on Zero-shot prompting

Furthermore, we show that recent reference-free summ. datasets multiple tasks
(Not available or less
(Task-specific models effective than
metrics, e.g. QA-based metrics (Fabbri et al., 2022; trained for each dataset) Prompting instruction-tuned
Durmus et al., 2020) and trained factuality models ‣ Instruct-GPT counterparts)
‣ BART ‣ T5 ‣ T0 (these not trained
‣ PEGASUS ‣ CTRLSum ‣ FLAN on standard summ. ‣ GPT-3
(Kryscinski et al., 2020; Goyal and Durrett, 2020), ‣ BRIO Summarization datasets datasets) ‣ PaLM
used during training ‣ text-davinci-002 ‣ Turing-NLG
similarly fail to adapt to this shift from the fine-
tuned to prompting, and need to be re-visited.
Figure 2: Broad categorization of available summariza-
Finally, how can prompting be used beyond tion systems; those compared in this work are high-
generic summarization? We focus on keyword- lighted in red.
based and aspect-based summarization. For
keyword-based summarization, we find that GPT-3 Figure 2 shows the broad categories of all avail-
consistently generates more coherent and keyword- able summarization approaches, including current
relevant summaries compared to current fine-tuned SOTA models and prompting-based models. The
alternatives: crowd annotators prefer GPT-3 sum- former set consists of fine-tuned language mod-
maries over a baseline model (He et al., 2022a) els, trained on a large number of article-summary
70% of the time. We observe mixed results for pairs (e.g. BART (Lewis et al., 2020), PEGASUS
the aspect-based setting, where GPT-3 summaries (Zhang et al., 2020), BRIO (Liu et al., 2022)) to
show frequent failure cases with simple prompts. obtain dataset-specific systems. This category also
Taken together, this evidence suggests that GPT- includes models aimed at tasks beyond generic
3 represents a fundamental paradigm shift in sum- summarization, such as keyword- or query-based
marization, changing what data we need (or don’t summarization, that still rely on standard datasets
need) and what approaches we can now explore. for training (He et al., 2022a).
Evaluating these systems will require a new frame- On the other extreme are zero- or few-shot
work distinct from the automatic metrics that have models, (e.g. GPT3 (Brown et al., 2020), PaLM
dominated the last decade of summarization re- (Chowdhery et al., 2022)), that are not explicitly
search. trained for any particular task, as discussed above.
Recent work (Ouyang et al., 2022; Wei et al., 2022; Article: https://www.cnn.com/2022/03/01/africa/africa-condemns-
racism-ukraine-intl/index.html
Sanh et al., 2022) has improved on these models
Prompt: Summarize the article in N sentences.
by introducing instruction-tuned models. Here, The three African nations on the UN Security Council condemned reports
N=1
pre-trained language models are fine-tuned on mul- of discrimination against African citizens at the Ukrainian border during a
meeting at the UN HQ in New York City Monday.
tiple tasks (which may include summarization) us-
The United Nations Security Council condemned the reports of
ing instruction templates in order to align their
N=2
discrimination against African citizens at the Ukrainian border. The African
Union has said it is "disturbed" by the reports of segregation against
training with inference time usage. Africans in Ukraine, which it described as "shockingly racist."
In this work, we compare the summarization The article discusses the reports of discrimination against African citizens
at the Ukrainian border. The representatives from the three African nations
performance of three models that are representative on the UN Security Council condemned the reports and called for the
N=3
of this space of options: mistreatment of African peoples on Europe's borders to cease immediately.
Foreign students attempting to flee Ukraine after Russia invaded the
country told CNN that they experienced racial discrimination at the
1. OpenAI’s text-davinci-002, a GPT-3 model Ukrainian border.
(Brown et al., 2020) from the Instruct series

(Ouyang et al., 2022). While we do not know Figure 3: Illustration of length control using the task
the exact training details for this release of description / prompt for GPT3-D2. We found that the
the model, the previous in the series (text- generated summaries followed the given sentence length
davinci-001) was fine-tuned on a combina- constraint 98% of the time, allowing us to generate
different length summaries emulating different datasets.
tion of prompts submitted to their API and la-
beler prompts spanning multiple tasks. These
tasks include summarization but not (to our Specifically, we follow prior work (Sanh et al.,
knowledge) standard summarization datasets 2022) and use sentence-count length prompts to
like CNN/DM (Hermann et al., 2015; Nallapati adapt to each dataset. Although these datasets also
et al., 2016) or XSum (Narayan et al., 2018). differ along other attributes, e.g. CNN/DM is lead-
We choose the text-davinci-002 version for our biased whereas XSum requires drawing inferences
experiments in order to benchmark the best from a whole article, we do not attempt to con-
available prompt-based model.2 We refer to trol any other attributed of the summary. Figure 3
this approach as GPT3-D2. shows an example of different length GPT3-D2 sum-
maries for the same news article, using the follow-
2. BRIO (Liu et al., 2022), a fine-tuned summariza- ing prompt format:
tion model that reports state-of-the art results
Article: {{article}}
on both CNN/DM and XSum. We will use versions
Summarize the above article in N sentences.
of this model fine-tuned on each of these two
datasets. We found that GPT3-D2 summaries faithfully fol-
low the given length constraint in 98% of the test
3. T0 (Sanh et al., 2022), a prompt-based model
instances used in our human study data in Sec-
fine-tuned on multiple tasks including standard
tion 3.
summarization datasets. This provides a use-
Given this setup, we first compare the summary
ful point of comparison between task-specific
quality of the three summarization models through
fine-tuned (BRIO) and bigger instruction-tuned
a human annotation study (Section 3). Then, we
models (GPT3-D2).
evaluate the current suite of summarization metrics
2.2 Using GPT3-D2 for summarization for prompt-based summarization (Section 4). Fi-
nally, in Section 5, we briefly discuss GPT3-D2 per-
Fine-tuned models largely follow the “style” of ref-
formance on summarization tasks beyond generic
erence summaries in their training data, and hence,
summarization and new challenges.
generated summaries show large variance between
datasets (see Table 1 for basic summary statistics 3 Human evaluation of GPT3-D2
of standard summarization datasets). To ensure fair summaries
comparison between these and GPT3-D2, we adapt
the latter’s prompt to align with dataset-specific Generated summaries of fine-tuned models (Lewis
styles. et al., 2020; Zhang et al., 2020; Liu et al., 2022)
2
emulate gold-standard summaries in their training
We did not observe obvious quality differences in gen-
erated summaries between text-davinci-001 and text-davinci- datasets. In contrast, prompt-based GPT3-D2 mod-
002. Examples are included in Appendix C. els generate summaries based on how the given
CNN Article: (CNN) Mortgage rates fell slightly this week, The 30-year, fixed-rate mortgage averaged 5.09% in the week ending
marking the third consecutive week of declines. But with rates June 2. It's the third consecutive week of declines. But rates are still
BRIO
hovering above 5% and home prices well above where they were above 5% and home prices are well above where they were this time last
this time last year, prospective buyers are finding it increasingly year. Rising prices have pushed some prospective buyers to the sidelines.
difficult to afford a home. The 30-year, fixed-rate mortgage
30-year, fixed-rate mortgage averaged 5.09% in the week ending June 2.
averaged 5.09% in the week ending June 2, down from 5.10% the
T0 It is still well above the 2.99% average from this time last year. Rising
week before, according to Freddie Mac. It is still well above the
prices had already been pushing many prospective buyers to the sidelines.
2.99% average from this time last year. […] Rising prices had
already been pushing many prospective buyers to the sidelines. Mortgage rates have fallen slightly in the past few weeks, but they are
[…] The Fed has been seeking to tame inflation by raising interest GPT3 still well above last year's levels. This is making it difficult for many
rates over the past couple of months. And the central bank has -D2 prospective buyers to afford a home. The Fed has been raising interest
signaled there are more rate hikes to come. […] rates in an effort to control inflation, and more rate hikes are expected.
BBC Article: The full opening of a landmark shared education The full opening of the Strule shared education campus in Omagh,
BRIO
campus in Omagh is facing another delay. Education Minister County Tyrone, has been delayed to September 2026.
Michelle McIlveen has now said the Strule Shared Education
Strule, Northern Ireland's biggest ever school-building project, is now
Campus is planned to open in September 2026. The minister T0
scheduled to open in 2026, the Education Minister has said.
clarified the new date in response to an assembly question from the
SDLP MLA Daniel McCrossan. The campus had originally been GPT3 The Strule Shared Education Campus is facing another delay, and is now
due to open in 2020, but has been delayed a number of times.[…] -D2 planned to open in September 2026.
Figure 4: Examples of CNN-style and BBC/XSum-style summaries for the three systems. For CNN, we observe
that models fine-tuned on the CNN/DM training set reflect its dataset biases; summaries are highly extractive, specific
and lead-biased. On the other hand, GPT3-D2 summaries contain fewer specific details but cover more content.
task description surfaces behavior learned during 100 recent articles from CNN3 and BBC, collected
pre-training or instruction-tuning. In this section, between March 1, 2022 and June 31, 2022. We call
we ask: how do these paradigms compare? Does these CNN-2022 and BBC-2022 respectively.
learning from gold summaries lead to a better sum-
marization model? To answer this, we conduct a Model details We use the publicly released
human study to compare outputs of our 3 repre- BRIO-XSum and BRIO-CNN/DM models to generate
sentative models and collect human preferences of summaries.4 For T0, we use a prompt we selected
quality. from its prompt repository for CNN/DM and XSum
datasets.5 Finally, to generate GPT3-D2 summaries,
3.1 Experimental Setup we set N = 3 for CNN and N = 1 for BBC in
Datasets for fine-tuning We choose two stan- our standard sentence-count prompt template from
dard fine-tuning datasets whose summaries differ Section 2.
along multiple dimensions such as length and ab- For a maximally fair comparison in this “realis-
stractiveness: tic” setting, we take some additional steps to im-
prove the output of BRIO-XSum. In order to auto-
1. CNN/DM (Hermann et al., 2015; Nallapati
mate dataset creation, XSum removes the first sen-
et al., 2016) contains reference summaries that
tence from news articles to use as the gold summary
are approximately 3-4 sentences long. Sum-
for training, then treats the rest of the sentences as
maries in this dataset are highly extractive and
the article to summarize. This setup differs from
lead-biased.
the real world usage of summarization systems
2. XSum (Narayan et al., 2018) contains 1 sen- where the complete article is summarized. Due
tence summaries of BBC news articles. In to this mismatch, BRIO-XSum often generates very
this dataset, references summaries, and conse- low quality outputs, e.g. All images: Strule Shared
quently generated summaries from fine-tuned 3
Although the BRIO’s CNN/DM model also includes Daily-
models are highly abstractive. Mail data in its training, we do not use this news source in
our study as it is now widely considered to be unreliable. E.g.
Datasets for evaluation Because GPT3-D2’s pre- according to Media Bias / Fact Check site, DM’s factual re-
training and instruction-tuning datasets are un- porting is rated ‘low’ https://mediabiasfactcheck.com/
daily-mail/.
known, it may have been trained on existing articles 4
Models at: https://github.com/yixinL7/BRIO
and summaries in the test splits of these standard 5
Repository with T0 prompts: https://github.com/
benchmarks. We therefore run our human study on bigscience-workshop/promptsource
Education Campus in Figure 4, for around 30% of Model
Length Statistics % novel n-gms #NEs per
the articles. We manually identify these examples #sent #words/sent n = 1 n = 2 100 words
and first attempt to fix them by selecting a summary CNN
without such obvious failures from further down BRIO 3.7 15.8 12.1 36.2 12.9
the beam (we use beam size = 10). However, if we T0 2.7 14.9 16.4 45.2 12.8
GPT3-D2 2.9 23.4 16.3 40.7 10.5
cannot find a “better” summary, we remove the first
BBC
sentence of the article and re-sample a new sum-
mary to align with its noisy training. This latter BRIO 1.0 20.2 24.6 61.2 9.1
T0 1.0 20.0 26.3 66.7 9.8
strategy often results in factually incorrect sum- GPT3-D2 1.0 27.7 16.4 42.3 8.5
mary generations, as is well documented in prior
research (Maynez et al., 2020; Goyal and Durrett, Table 2: Statistics for generated summaries evaluated
2021). in the human study across all datasets and summariza-
tion systems. We observe that GPT3-D2 generated sum-
maries nearly always follow the sentence length con-
Design of the human study We design an A/B straints in their prompts.
test to collect preference annotations. For each
given article, annotators are shown summaries from 3.2 Results
all three summarization systems (BRIO, T0 and
GPT3-D2). They are then asked to select their most Differences between summarization systems
and least preferred summary or summaries. In ad- Figure 4 shows examples of generated summaries
dition to these multiple choice questions, we also from all three summarization systems for both
ask for a free-text justification of both choices. CNN and BBC articles. For CNN, we observe that
fine-tuned BRIO summaries tend to be highly extrac-
We make two design decisions for our human tive and generally include a high number of named
study: first, we do not provide annotators with spe- entities (dates, percentages, names), reflecting the
cific definitions of summary quality to avoid intro- data it was trained on. In contrast, GPT3-D2 sum-
ducing our own biases. It is also quite challenging maries are more abstractive and less specific, but
to produce a unified definition of quality for the provide a more exhaustive overview of the article
very different “styles” of summaries evaluated in content. Table 2 provides quantitative evidence of
this study. Instead, we ask them to rely on their this; we use percentage of novel n-grams to mea-
own preferences based on summaries they would sure abstractiveness, and number of named entities
like to see if they were browsing the web, which per 100 words to measure specificity.
we believe to be a representative scenario for non- For BBC, we observe inverse trends where
expert consumers of news summaries. Detailed BRIO and T0 are more abstractive compared to
task instructions are included in Appendix F. GPT3-D2. Again, this can be attributed to the XSum
Second, we allow multiple selections for both the training data used to train both these prior mod-
best and worst summary questions to cater to sce- els. For GPT3-D2 summaries, on the other hand,
narios in which different summarization systems the level of abstractiveness does not differ between
output similar quality summaries without meaning- datasets. Finally, Table 2 shows that GPT3-D2 sum-
ful differences. maries tend to have longer sentences, and therefore
similar number of summary sentences often results
We hire crowd annotators through Prolific. For
in a longer summary for both datasets. We study
both CNN and BBC, we recruit 60 unique partici-
the effect of this length difference on human pref-
pants to annotate the 100 summaries in each dataset.
erence judgments in Appendix B.
Each annotator was asked to annotate 5 articles and
each article was annotated by 3 annotators. Addi- Which systems do humans prefer? Results of
tionally, we use the Prolific’s demographic filters to our human study are summarized in Table 3. We
restrict participation to USA (or UK) residents for report the percentage of times a particular system is
CNN (or BBC). We anticipate that residents from the most/least preferred model according to major-
these respective countries are better positioned to ity vote combining all three annotator’s choices.6
understand country-specific news events and evalu- 6
As we allow multiple system selections, note that more
ate their summaries. Participants were paid approx- that one system could be the majority. However, this is rare
imately $11/hr for their work. after majority vote: only 2% of the articles in CNN and 7% in
Which summary is Which summary is
B RIO T0 GPT3 the most preferred? the least preferred?
Dataset
Best ↑ Worst ↓ Best ↑ Worst ↓ Best ↑ Worst ↓
GPT3 GPT3
CNN
CNN 36 24 8 67 58 9 BRIO BRIO
BBC 20 56 30 29 57 15
T0 T0
Agreement = 0.05 Agreement = 0.11
Table 3: Percentage of times a summarization system is
GPT3 GPT3
selected as the best or worst according to majority vote
BBC
BRIO BRIO
(may be tied). Human annotators have a clear preference
T0 T0
for GPT3-D2 for both CNN and BBC style summaries. Agreement = 0.18 Agreement = 0.15
0 1 2 3 0 1 2 3
No. of annotator votes for No. of annotator votes for
Across both datasets and styles, we observe a clear “best summary” “worst summary”
preference for GPT3-D2 summaries compared to

Figure 5: Annotator vote distribution for best and worst
the other two models. In fact, in both scenarios, summaries across all datasets and models. Although
the GPT3-D2 outperforms the next best model by at GPT3-D2 is the clear winner according to majority vote,
least 20 percentage points. This improvement is sta- this choice is unanimous for less than 30% of the ar-
tistically significant according to a paired bootstrap ticles. This demonstrates the inherent variance in dif-
test (CNN p−value = 2 × 10−3 , BBC p−value ferent annotators’ definitions of “best summary”, espe-
= 6 × 10−4 ). cially when comparing high-quality summaries from
strong models.
Note that the next best model differs between the
two datasets. For BBC, annotators prefer T0 sum-
maries over BRIO. Annotator rationales often men- Conversely, although BRIO (or T0) summaries are
tioned misleading or incorrect information as the less preferred than GPT3-D2 for the CNN (or BBC)
primarily reason for selecting BRIO as the worst dataset on aggregate, they were voted as the best
summary, confirming the issues that have been ob- summary by at least one annotator for more than
served with XSum-trained models (Maynez et al., 60% of the articles. This demonstrate two things:
2020; Pagnoni et al., 2021; Goyal and Durrett, first, when comparing summaries from two strong
2021). Although T0 also includes XSum training models, the choice is inherently ambiguous (similar
data, we hypothesize that its multi-task framework observations in Clark et al. (2021)). Second, these
helps offset the noisy signal from XSum. results and the diversity in the written rationales,
In contrast, annotators rate T0 as the worst sum- show that there does not exist a universal definition
marization system for CNN. The most common of a “good” summary and that different summary
rationales for these were shorter length and inclu- properties appeal to different annotators. Regard-
sion of irrelevant details, e.g. long quotes, while less, the aggregate preference for GPT3-D2 is high
missing key points. Some annotators also com- enough across the board to give us confidence in
mented that these T0 summaries were less coherent its strength.
compared to the other models. Interestingly, we
did not observe similar complaints for the single- How do these results impact the field? Progress
sentence T0 summaries for BBC. in text summarization research in the last five years
has been enabled by the construction of large-scale
Do annotators agree with each other? To study text summarization datasets that involved scrap-
this, we plot the distribution of annotator votes for ing news articles and pairing them with any avail-
each summarization system and dataset in Figure 5. able summary-like data (Hermann et al., 2015;
Additionally, we report the inter-annotator agree- Narayan et al., 2018; Grusky et al., 2018). The
ment, measured using Krippendorff’s alpha with CNN/DM dataset considers bullet points accompa-
MASI distance (Passonneau, 2006), to account for nying news articles as its summary. These “gold”
multiple selections of best or worst summary al- standard summaries provided useful training sig-
lowed in our study design. nal to train impressive supervised models (Lewis
The vote distribution shows that although more et al., 2020; Zhang et al., 2020; Liu et al., 2022)
annotators prefer GPT3-D2 summaries, this choice and hence, their quality or alignment with human
is only unanimous, i.e. supported by all three an- preferences was largely ignored.
notators, for less that 30% of the annotated articles.
We found that, despite its popularity, XSum is
BBC have multiple best summaries. largely unsuitable for fine-tuning models like BRIO
Overlap-Based Similarity-Based QAEval
Dataset Model
ROUGE(1/2/L) M ETEOR B LEU BERTScore MoverScore EM F1
PEGASUS 34.85/14.62/28.23 .24 7.1 .858 .229 .105 .160
BRIO 38.49/17.08/31.44 .31 6.6 .864 .261 .137 .211
CNN
T0 35.06/13.84/28.46 .25 5.9 .859 .238 .099 .163
GPT3-D2 31.86/11.31/24.71 .25 3.8 .858 .216 .098 .159
PEGASUS 45.77/23.00/36.65 .33 12.2 .865 .308 .159 .229
BRIO 49.27/24.76/39.21 .37 11.7 .871 .331 .175 .259
DailyMail
T0 42.97/19.04/33.95 .28 8.9 .863 .290 .121 .184
GPT3-D2 38.68/14.24/28.08 .26 6.6 .859 .248 .101 .159
PEGASUS 47.97/24.82/39.63 .36 9.8 .901 .362 .145 .221
BRIO 49.66/25.97/41.04 .39 10.6 .901 .372 .139 .224
XSum
T0 44.20/20.72/35.84 .34 8.0 .896 .340 .125 .208
GPT3-D2 28.78/7.64/20.60 .19 2.2 .869 .197 .066 .119
PEGASUS 39.21/27.73/35.68 .39 .14 .873 .272 0.182 0.253
BRIO - - - - - - -
Newsroom
T0 25.64/9.49/21.41 .20 .04 .849 .145 .080 0.125
GPT3-D2 27.44/10.67/22.18 .22 .05 .859 .159 .089 0.142
Table 4: Performance of different summarization systems measured using reference-based automatic metrics. Across
all datasets, we observe that automatic metrics report substantially worse results for GPT3-D2 summaries compared
to fine-tuned models. This directly contradicts the human preference results from Section 3, demonstrating that
these reference-based metrics cannot reliably compare the quality of prompt-based summaries against fine-tuned
summaries.
for realistic summarization settings. Even though a evaluating prompt-based GPT3-D2 summaries.
CNN/DM-trained BRIO model performed better, the
Experimental Setup We evaluate automatic met-
results of our human study question the contin-
rics using summaries from 4 different summariza-
ued utility of hill-climbing on this dataset, as it
tion datasets, listed in Table 1. For each dataset,
seems users may simply prefer a different style of
we construct our evaluation sets by randomly sam-
summary altogether. In fact, this preference for
pling 5007 articles from the standard test split.8 We
GPT3-D2 is much larger than incremental improve-
compare the same 3 summarization systems from
ments reported in other human evaluation settings,
Section 3 in our analysis. Additionally, we also
e.g. improvements on XSum on the GENIE leader-
report results using the fine-tuned PEGASUS model
board (Khashabi et al., 2022). Furthermore, as
(Zhang et al., 2020), as BRIO fine-tuned models are
we we will see in Section 5, the greater flexibil-
not available for all datasets.
ity of GPT3-D2 compared to these systems makes
We publicly release this corpus of summariza-
it more suitable for news summarization tasks be-
tion outputs to standardize the test sets and sup-
yond generic summarization.
port future research into GPT3-D2 based summa-
If a system designer collects a large-scale dataset
rization. Link: https://tagoyal.github.io/
of high-quality summaries that they wish to emu-
zeroshot-news-annotations.html.
late, we believe a fine-tuned system may outper-
form GPT3-D2. However, better-trained models on 4.1 Reference-based metrics
datasets collected via “incidental” supervision are
Here, we study if the gold summaries of the stan-
less likely to help.
dard datasets are useful for evaluation, especially
when evaluating prompt-based summaries that are
4 Can current automatic metrics evaluate not trained to emulate the gold. We benchmark
GPT3-D2 summaries?
7
This size is chosen to give sufficient statistical power
Automatic metrics proposed for summarization (Card et al., 2020) while keeping costs for GPT3-D2 evaluation
low to enable others to compare on this subset. We outline
evaluation can be broadly divided into two cate- costs in Appendix D.
gories: (1) reference-based, that compare gener- 8
Note that these standard datasets were released before
ated summaries against available gold summaries, 2020. Therefore, it is possible that some article-summary
pairs in our test set overlap with GPT3-D2’s training data. How-
and (2) reference-free that only rely on the input ever, we do not observe a qualitative difference in GPT3-D2’s
document. Here, we compare their performance at performance on these older articles.
Overall Quality Factuality (QA-based) Factuality (NLI-based)
Dataset Model
SUPERT BLANC QuestEval QAFactEval FactCC DAE SummaC
PEGASUS .5466 .0605 .7373 4.4071 .3743 .8223 .1138
BRIO .5586 .0802 .7334 3.8332 .1817 .7577 -.0532
CNN
T0 .5330 .0558 .7799 3.7517 .2012 .7556 -.0605
GPT3-D2 .5560 .0749 .7249 3.6399 .2428 .6671 -.0729
PEGASUS .6433 .1137 .7536 4.4677 .5152 .8497 .2402
BRIO .6360 .1217 .7415 4.1362 .3699 .8118 .0153
DailyMail
T0 .5995 .0889 .7803 3.9827 .2431 .8043 .0478
GPT3-D2 .6118 .0983 .7461 3.8279 .2697 .6990 .0365
PEGASUS .4439 .0249 .8233 2.0089 .2465 .3598 -.2993
BRIO .4459 .0230 .8305 1.8626 .2031 .3040 -.3292
XSum
T0 .4538 .0238 .7957 2.0330 .2219 .3392 -.3037
GPT3-D2 .5060 .0594 .8064 2.9492 .3977 .6372 -.2626
PEGASUS .6286 .1131 .7118 4.2120 .7218 .7956 .2418
BRIO - - - - - - -
Newsroom
T0 .5433 .0640 .7511 3.5799 .2828 .7376 .0261
GPT3-D2 .5408 .0599 .7160 3.2336 .3988 .6564 -.0729
Table 5: Performance of different summarization systems, as scored by automatic reference-free evaluation metrics
from the summarization literature. Similar to reference-based metrics, these also generally fail to produce the same
system rankings as human preferences reliably across datasets.
the performance of 3 different summarization met- estingly, out of the four datasets evaluated here,
rics: (1) overlap-based metrics, specifically ROUGE Newsroom is the only one not used to train the
(Lin, 2004) METEOR (Banerjee and Lavie, 2005) and T0 model. This further shows that access to dataset-
BLEU (Papineni et al., 2002). (2) similarity-based specific reference summaries during training im-
metrics, that compute similarity between embed- proves performance according to these metrics, ren-
dings representations of generated and reference dering them unsuitable for evaluating prompt-based
summaries. Specifically, we report BERTScore models.
(Zhang* et al., 2020) and MoverScore (Zhao et al.,
2019). (3) a QA-based metric, specifically QAE-
val (Deutsch et al., 2021). Although most QA- 4.2 Reference-free metrics
metrics are reference-free (discussed in Section
4.2), QAEval uses the reference summaries to in-
dicate saliency. We report both exact match (EM) Next, we investigate whether current reference-free
and F1 components of QAEval. evaluation metrics reflect the human preference
rankings between summarization systems, as ob-
Results Table 4 outlines the results. It shows that served in Section 3. Here, we study 2 categories
BRIO and PEGASUS models, fine-tuned to emulate of metrics: (1) quality metrics, specifically SU-
the reference summaries, outperform GPT3-D2 sum- PERT (Gao et al., 2020), which evaluates generated
maries according to all reference-based automatic summaries against automatically identified salient
metrics. The difference in their assigned scores sentences in the input, and BLANC (Vasilyev et al.,
is very high, e.g. >7 ROUGE-L points between 2020), which evaluates summaries on language
GPT3-D2 and BRIO. For comparison, these reported understanding tasks. We refer readers to the orig-
scores for GPT3-D2 are even lower than the triv- inal papers for detailed explanation of these. (2)
ial Lead-3 baseline reported in prior work (Fabbri factuality metrics, that are evaluate whether gener-
et al., 2021; Grusky et al., 2018). This clearly ated summaries contain incorrect information with
demonstrates that current automatic reference- respect to the source article. We report the perfor-
based metrics cannot be used to reliably mea- mance of summarization systems using two QA-
sure summary quality under the prompting based metrics: QuestEval (Scialom et al., 2021)
paradigm. and QAFactEval (Fabbri et al., 2022). Addition-
Amongst prompting-based models, we observe ally, we also benchmark entailment-based metrics:
that T0 summaries report better metric scores than FactCC (Kryscinski et al., 2020), DAE (Goyal and
GPT3-D2 for all datasets except Newsroom. Inter- Durrett, 2020, 2021) and SummaC (Laban et al.,
2022).9 These entailment-based models are de- though “reference-free” at test time, they are still
signed for classification into factual or non-factual; trained to reward the summary properties seen in
therefore, we use P (factual | article, summary) the standard summarization benchmarks. (2) Even
to score generated summaries. completely reference-free metrics, e.g. QuestE-
val and QAFactEval, have only been evaluated on
Results Table 5 outlines the scores for each sum- reference-based benchmarks and fine-tuned mod-
marization system according to the above reference- els. Therefore, the choice of different components,
free metrics. Ideally, we want the relative rankings such as question answering or question generation
of different systems according to these metrics to models to use, etc. has been dictated by the error
correspond to human preferences, i.e. GPT3-D2 > space of prior fine-tuned models (Tang et al., 2023).
BRIO > T0 for CNN/DM10 and GPT3-D2 > T0 > BRIO These decisions also now need to be re-visited to
for XSum.11 incorporate GPT3-D2 evaluation; we leave this for
Overall, we observe that none of the reference- future work.
free metrics we evaluate follow these trends for
both CNN/DM and XSum datasets. In particular, we 5 Beyond Generic Summarization
observe that GPT3-D2 summaries report low factu-
ality scores (except XSum) even though we rarely Previously, we observed that GPT3-D2 models faith-
found any factual errors in our qualitative analysis fully follow simple “style” instructions in the given
of its generated summaries. prompts. This provides a promising direction to
Interestingly, we noticed a roughly inverse rela- tackle other use cases in news summarization be-
tion to abstractiveness; summarization systems that yond the generic summarization task from Sec-
generated more abstractive summaries (see Table tion 3.
2) were generally scored lower by all automatic Different users can have very different infor-
reference-based metrics. For instance, GPT3-D2 mation needs from the same article, all of which
is scored lower than BRIO by both quality metrics cannot be satisfied with a single generic summary.
for all datasets except XSum; the latter is the only Prior work has introduced several task formulations
dataset for which GPT3-D2 summaries are less ab- to address this gap, including keyword-focused (He
stractive. Such shortcomings of reference-free eval- et al., 2022a), query-focused (Baumel et al., 2014;
uation metrics due to spurious correlations have He et al., 2022a), or aspect-focused summariza-
also been studied in prior work (Durmus et al., tion (Krishna and Srinivasan, 2018; Ahuja et al.,
2022). These issues become more exaggerated 2022), amongst others. Here, we evaluate GPT3-D2
when the summarization systems being compared performance at two of these use cases.
exhibit very different properties. In keyword-based summarization, the output
summaries must succinctly summarize the input
Discussion On the surface, the failure of document focusing on a given keyword; these gen-
reference-free metrics at evaluating GPT3-D2 sum- erally correspond to specific entities or events di-
maries is more surprising that reference-based met- rectly mentioned in the document. In contrast, the
rics as the later explicitly compares generated sum- control units in aspect-based summarization are
maries with references that GPT3-D2 is not trained high-level topics that can be common across mul-
to imitate. Therefore, GPT3-D2 understandably tiple similar types of documents. For e.g., for the
scores lower than fine-tuned systems. input article in Figure 1, Donald Trump or Russian
However, we note two different issues with interference in 2016 elections are keyword controls
reference-free metrics: (1) Some of these, e.g. whereas charges against the defendants is a higher-
FactCC and DAE, use reference summaries as pos- level aspect that can serve as the query for any news
itive examples to train the metric. Therefore, al- article discussing a lawsuit or investigation.
9
Exact model versions and configurations used for these
are outlined in Appendix A. 5.1 Qualitative Analysis
10
Although the human study in Section 3 is only run on
CNN articles, the underlying fine-tuned model is same for Baseline Model for comparison We use the re-
both CNN and DM. Therefore, it we can reasonably expect it cently proposed CTRLSum (He et al., 2022a), a fine-
to display similar quality differences with respect to GPT3-D2. tuned BART model, as our baseline. It can be flex-
11
Note that while annotators were not explicitly asked to
rate factuality, we instructed them to carefully check factuality ibly adapted for both keyword- and aspect-based
and appropriately downvote non-factual summaries. settings by including a prompt as additional input
Article: Republican defenders of Donald Trump won't or can't answer the questions that are at the root of the intrigue over why classified documents were at Mar-a-Lago and the
troubling question of whether national security was put at risk. […] At one end of the scale is Missouri Sen. Josh Hawley, a firm Trump backer who is often mentioned as a future
presidential candidate once the ex-President finally exits the political stage. Hawley promoted Trump's line that the key issue was the FBI search of an ex-President's home, not
what he might have kept there. "I don't know what he has or doesn't have, but what I have a lot of concerns about is the FBI raiding the home of the likely presidential nominee in
2024, the former President of the United States, and particularly as we learn more and more about what's going on with the FBI," Hawley told CNN's Manu Raju on Wednesday.
[…] Ex-Attorney General William Barr has repeatedly questioned publicly what his former boss was doing with classified material. There has been speculation that Trump might
want such material as keepsakes or as some kind of leverage in a future business deal. But none of this is informed given the secrecy sounding the DOJ's investigation. […]
Link: https://www.cnn.com/2022/09/08/politics/republicans-trump-national-security-implications-classified-docs
CTRLSum: Sen. Marco Rubio downplays the seriousness of the Department of Justice's investigation. Ex-Attorney General
Keyword-based Missing context
William Barr has repeatedly questioned what his former boss was doing with classified material. Trump and his allies have
Poor discourse
filled the vacuum for answers with a smokescreen of unproven claims, including that Trump declassified the material.
Summarize the
above article in GPT3-D2: The article discusses how various Republicans have been defending Donald Trump in light of the recent revelations Includes article
2-3 sentences that he had classified documents at his Mar-a-Lago resort. While some have downplayed the significance of the situation, context before
focusing on others have outright denied that there is any cause for concern. However, former Attorney General William Barr has repeatedly keyword-specific
William Barr. questioned Trump's actions, speculating that he may have been trying to use the classified information for personal gain. information
Does not directly

Aspect-based CTRLSum: Republican defenders of Donald Trump won't or can't answer the questions that are at the root of the intrigue
over why classified documents were at Mar-a-Lago and the troubling question of whether national security was put at risk. In address aspect.
the article, who is a defendant or under investigation? Repeats the prompt
In the above article, at the end.
who is a defendant or
under investigation? GPT3-D2: Donald Trump. The FBI. The Department of Justice. Sandy Berger. Incorrectly answers
the query.
In the above article, CTRLSum: Republican defenders of Donald Trump won't or can't answer the questions that are at the root of the intrigue
what is the over why classified documents were at Mar-a-Lago, says Julian Zelizer. Zelizer: Trump and his allies have filled the vacuum Does not directly
defendant's reaction for answers with a smokescreen of unproven claims, including that Trump declassified the material or it might have been answer query.
to charges, including planted by the FBI.
admission of guilt, Summary captures
GPT3-D2: The defendant's reaction to charges in the above article is denial of charges. Trump and his allies have filled the
denial of charges, or aspect-relevant
vacuum for answers with a smokescreen of unproven claims, including that Trump declassified the material or that it might
explanations? content.
have been planted by the FBI in order to persecute him.
Figure 6: Comparison of keyword- and aspect-based summaries using GPT3-D2 and CTRLSum models. The
GPT3-D2 prompt is shown on the left with the corresponding keyword or aspect bolded. For keyword-based
summarization, the GPT3-D2 summary presents appropriate context before the keyword-specific information. How-
ever, for aspect-based summarization, it does not always generate factually correct summaries, as shown in the first
aspect example. We observe that CTRLSum performs poorly for both these settings.
to the encoder. We use the prompt template recom- In this example, representative of aver-
mended in the original paper.12 age GPT3-D2 quality, the keyword-focused
GPT3-D2 summary first gives a brief overview
Control Units For the keyword-focused setting,
of the article setting before providing keyword-
we use named entities extracted from the input arti-
relevant information. In contrast, the CTRLSum
cle as the control units. For aspect-focused summa-
summary exhibits poor discourse structure and
rization, we directly use the aspects introduced in
reads like a list of facts stapled together.
the guided summarization task from TAC 2011.13
It defined 5 broad categories of newswire articles, The figure also shows aspect-focused summaries
such as accidents and natural disasters, investiga- for two aspects associated with the “investigations
tions and trial, etc., and multiple aspects for each and trial” category most appropriate for the chosen
category. For example, the “investigations and tri- article. We see mixed results here for GPT3-D2; it
als” category includes aspects such as “who is the generates a factually incorrect summary for the first
defendant or under trial?”, “who is investigating, aspect, listing multiple people from the input arti-
prosecuting, judging?”, and so on. cle as defendants instead of only “Donald Trump”.
For the second aspect, it correctly maps the high-
Qualitative Analysis Figure 6 shows examples level concept “defendant” to “Donald Trump” in
of keyword- and aspect-focused summaries using the input article and generates the correct answer
GPT3-D2 and the baseline CTRLSum model. The to the input query: “The defendant’s reaction to
keywords or aspects are highlighted in bold within charges in the above article is denial of charges”.
the GPT3-D2 prompt displayed on the left.
On the other hand, CTRLSum fails to generate
12
Trained model publicly released at: https://github. aspect-focused summaries for both cases. We be-
com/salesforce/ctrl-sum.
13
https://tac.nist.gov/2011/Summarization/ lieve that it struggles to align high-level concepts
Guided-Summ.2011.guidelines.html and explicit entities in the article due to a lack of
by a majority of the annotators. The main ratio-
Which keyword-focused Win % according
summary is better? to majority vote nales given for this choice were better contextual-
GPT3-D2 69.8 %
ization of keyword-related information and better
coherence in GPT3-D2 summaries.
CTRLSum 30.2 %
0 1 2 3
Impact These results show that prompting GPT-3
No. of votes for “best summary” models present a promising alternative to fine-
tuned models for such specialized summarization
tasks that can be easily described using textual
Figure 7: Distribution of annotator votes for the
prompts. One of the major drawbacks of fine-tuned
keyword-focused summarization task. Annotators pre-
fer GPT3-D2 summaries over CTRLSum for approxi- models is that they are constrained by what data
mately 70% of all article-keyword pairs, showing unani- is available and how it can be transformed to cre-
mous preference more than half the time. ate new task-specific training data. CTRLSum relied
on the SQuAD question answering dataset (Ra-
jpurkar et al., 2016) because the required “queries”
such aspect-specific examples in its training data. or “questions” were unavailable at scale for sum-
Instead, it generates summaries focusing on lexi- maries in standard summarization datasets. In con-
cally similar words, i.e. “defenders” for both cases. trast, prompt-based models are not constrained by
Based off of GPT3-D2’s promising keyword- the availability of task-specific data and can flexibly
focused summarization capabilities observed adapt to new tasks. Future research should focus
above, we next conduct a human study to system- on further exploring these capabilities and possible
atically compare it against the CTRLSum baseline. improvements on currently “unsolved” tasks such
We leave further explorations of aspect-based sum- as aspect-based or plan-based summarization.
marization to future work, given the mixed to poor
results for both models at this task. 6 Discussion and Related Work
5.2 Human Study: Keyword-focused In recent years, research in text summarization

summarization (Rush et al., 2015; Nallapati et al., 2016; See et al.,
2017; Lewis et al., 2020; Zhang et al., 2020; Liu
Task Setup Similar to Section 3, we design an et al., 2022) has typically relied on comparisons
A/B test to compare the two models. We use the with gold test sets for evaluation, possibly aug-
same set of 100 CNN14 articles as Section 3. We mented with reference-free metrics for dimensions
randomly extract 2 distinct named entities from like factuality. This paper shows that all these
each article. In the study interface, the annota- metrics are completely ineffective at evaluating
tor is shown the article-keyword pair and GPT3-D2 GPT-3 summaries. Although issues with these
and CTRLSum summaries corresponding to it. They metrics, particularly low correlation with human
are asked to select the summary that best summa- judgments, have also been studied earlier (Fabbri
rizes the input article while focusing on the given et al., 2021; Deutsch and Roth, 2021), they are
keyword. Exact task instructions are included in considered reliable when comparing systems in dif-
Appendix F. ferent score ranges (Peyrard, 2019; Deutsch et al.,
Again, we run this study using the Prolific plat- 2022). However, GPT-3 challenges these estab-
form. We recruit 60 participants to annotate the lished practices and evaluation protocols, and poses
100 articles; each article is annotated by 3 anno- an urgent need for better evaluation.
tators which includes annotations for 2 separate This brings us to manual evaluation, generally
keywords. Each annotator evaluates 5 articles. considered to be the gold standard for generation
evaluation. The majority of summarization re-
Results Figure 7 shows the distribution of an- search now reports results from a human study in
notator votes between the GPT3-D2 and CTRLSum addition to automatic metrics, but there is a general
models. Annotators show a clear preference for lack of consensus on what dimensions to evalu-
GPT3-D2. In fact, for nearly 70% of all article- ate, task design, and other factors (Hardy et al.,
keyword pairs, GPT3-D2 is preferred over CTRLSum 2019). This presents difficulties in conducting re-
14
We run this study using only CNN articles as the baseline liable and reproducible comparisons between sys-
CTRLSum model is trained on CNN/DM. tems (Karpinska et al., 2021), another factor con-
tributing to the popularity of automatic metrics. paring different system generations. In our work,
Although recent efforts like GENIE (Khashabi et al., we chose a human evaluation workflow that directly
2022) have taken steps to standardize manual eval- asks annotators to compare systems, while other
uation protocols across systems, its annotation is prior work has opted for Likert-scale judgments
not universally affordable and the quality is not and/or evaluation along multiple quality dimen-
strictly monitored. We hope that future work ad- sions (Gehrmann et al., 2022). The latter strategy
dresses these challenges and democratizes human of evaluating different dimensions could surface
evaluations. more insights into which “style” properties of GPT-
The ultimate test of summarization systems is 3 summaries provide them an edge over fine-tuned
with actual users using the systems in practice. models; however, such analysis is outside the scope
Jones (2007) discusses the need to align task formu- of this paper. Our experiments comparing overall
lations with actual applications scenarios (“purpose quality reveal that current summarization datasets
factors”). However, the research in text summa- are not well-aligned with user preferences. We
rization until now has been constrained to certain leave more fine-grained analysis into these prefer-
problems or domains by the heavy dependence on ence judgments for future work.
large-scale training data: for example, producing a The experiments in this paper are run on English-
bullet-point summary of a news article has emerged language news summarization datasets as these
as standard due to availability of data from CNN, serve as common benchmarks in the summariza-
not because it is shown to be the best way to present tion literature. However, user rankings of system
information. outputs might be different when evaluating other
Now, the success of prompt-based models can domains, e.g., summaries of scientific text. While
allow realistic use-cases to drive research in a more we believe that automatic metrics would fail to eval-
top-down way. We already show that GPT3-D2 im- uate GPT-3 summaries on these domains also (gen-
proves upon prior keyword-focused summarization erated summaries would still look different from
systems that were trained on artificially adapted the reference summaries), users may prefer models
training data. In future research, we are inter- that are specifically fine-tuned on domain-specific
ested in tackling other real world use cases, such data for niche domains.
as update summarization and plan- or aspect-based Finally, we do not know exact datasets or tasks
summarization. Additionally, adapting GPT3-D2 used to train GPT3-D2. It is possible that its RLHF
to documents longer than the allowed context, or training (Ouyang et al., 2022) included summariza-
structured inputs such as tables, presents research tion examples, and therefore, preference judgments
challenges beyond the current capabilities of GPT-3 from human annotators for its different outputs.
and would be interesting to study.15 However, our arguments in this paper do not rely
on the specifics of the GPT3-D2 system, merely that
7 Conclusion such a system exists. If anything, the existence
In this work, we performed the first systematic of potentially better data underscores that further
study comparing prompt-based GPT-3 and fine- work should collect new data for summarization
tuned models at the news summarization task. We model tuning, and our claims about metrics still
analyzed the impact of prompting on the summa- hold regardless of the details of how the GPT3-D2
rization field, including training paradigms and summaries were produced.
evaluation practices. Finally, to support further
research in this direction, we release a large corpus References
of generated summaries for multiple prompt-based
Ojas Ahuja, Jiacheng Xu, Akshay Gupta, Kevin
and fine-tuned models, as well as human preference
Horecka, and Greg Durrett. 2022. ASPECTNEWS:
judgments comparing these systems. Aspect-oriented summarization of news documents.
In Proceedings of the 60th Annual Meeting of the
8 Limitations Association for Computational Linguistics (Volume
1: Long Papers), pages 6494–6506.
In the text generation evaluation literature, there
does not exist a standardized task design for com- Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
An automatic metric for mt evaluation with improved
15 correlation with human judgments. In Proceedings of
We very briefly discuss long document summarization
with GPT-3 in Appendix E. the acl workshop on intrinsic and extrinsic evaluation
measures for machine translation and/or summariza- summarization evaluation metrics. In Proceedings of
tion, pages 65–72. the 2022 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Tal Baumel, Raphael Cohen, and Michael Elhadad. Human Language Technologies, pages 6038–6052,
2014. Query-chain focused summarization. In Pro- Seattle, United States. Association for Computational
ceedings of the 52nd Annual Meeting of the Associa- Linguistics.
tion for Computational Linguistics (Volume 1: Long
Papers), pages 913–922. Daniel Deutsch and Dan Roth. 2021. Understanding the
extent to which content quality metrics measure the
Manik Bhandari, Pranav Narayan Gour, Atabak Ash- information quality of summaries. In Proceedings
faq, and Pengfei Liu. 2020. Metrics also disagree of the 25th Conference on Computational Natural
in the low scoring range: Revisiting summarization Language Learning, pages 300–309.
evaluation metrics. In Proceedings of the 28th Inter-
national Conference on Computational Linguistics, Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
pages 5702–5711. question answering evaluation framework for faith-
fulness assessment in abstractive summarization. In
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Proceedings of the 58th Annual Meeting of the Asso-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind ciation for Computational Linguistics, pages 5055–
Neelakantan, Pranav Shyam, Girish Sastry, Amanda 5070.
Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing Esin Durmus, Faisal Ladhak, and Tatsunori B
systems, 33:1877–1901. Hashimoto. 2022. Spurious correlations in reference-
free evaluation of text generation. In Proceedings
Dallas Card, Peter Henderson, Urvashi Khandelwal, of the 60th Annual Meeting of the Association for
Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. Computational Linguistics (Volume 1: Long Papers),
With little power comes great responsibility. In pages 1443–1454.
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and
pages 9263–9274. Caiming Xiong. 2022. QAFactEval: Improved QA-
based factual consistency evaluation for summariza-
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, tion. In Proceedings of the 2022 Conference of the
Maarten Bosma, Gaurav Mishra, Adam Roberts, North American Chapter of the Association for Com-
Paul Barham, Hyung Won Chung, Charles Sutton, putational Linguistics: Human Language Technolo-
Sebastian Gehrmann, et al. 2022. PaLM: Scaling gies, pages 2587–2601, Seattle, United States. Asso-
language modeling with pathways. arXiv preprint ciation for Computational Linguistics.
arXiv:2204.02311.
Alexander R Fabbri, Wojciech Kryscinski, Bryan Mc-
Elizabeth Clark, Tal August, Sofia Serrano, Nikita Cann, Caiming Xiong, Richard Socher, and Dragomir
Haduong, Suchin Gururangan, and Noah A Smith. Radev. 2021. SummEval: Re-evaluating summariza-
2021. All that’s ‘human’ is not gold: Evaluating tion evaluation. Transactions of the Association for
human evaluation of generated text. In Proceedings Computational Linguistics, 9:391–409.
of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Yang Gao, Wei Zhao, and Steffen Eger. 2020. SUPERT:
Joint Conference on Natural Language Processing Towards new frontiers in unsupervised evaluation
(Volume 1: Long Papers), pages 7282–7296. metrics for multi-document summarization. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, ciation for Computational Linguistics, pages 1347–
Trung Bui, Seokhwan Kim, Walter Chang, and Nazli 1354.
Goharian. 2018. A discourse-aware attention model
for abstractive summarization of long documents. In Sebastian Gehrmann, Elizabeth Clark, and Thibault Sel-
Proceedings of the 2018 Conference of the North lam. 2022. Repairing the cracked foundation: A sur-
American Chapter of the Association for Computa- vey of obstacles in evaluation practices for generated
tional Linguistics: Human Language Technologies, text. arXiv preprint arXiv:2202.06935.
Volume 2 (Short Papers), pages 615–621, New Or-
leans, Louisiana. Association for Computational Lin- Tanya Goyal and Greg Durrett. 2020. Evaluating factu-
guistics. ality in generation with dependency-level entailment.
In Findings of the Association for Computational
Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth. Linguistics: EMNLP 2020, pages 3592–3603.
2021. Towards question-answering as an automatic
metric for evaluating the content quality of a sum- Tanya Goyal and Greg Durrett. 2021. Annotating and
mary. Transactions of the Association for Computa- modeling fine-grained factuality in summarization.
tional Linguistics, 9:774–789. In Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
Daniel Deutsch, Rotem Dror, and Dan Roth. 2022. Re- tional Linguistics: Human Language Technologies,
examining system-level correlations of automatic pages 1449–1462.
Tanya Goyal, Jiacheng Xu, Junyi Jessy Li, and Greg Wojciech Kryscinski, Bryan McCann, Caiming Xiong,
Durrett. 2022. Training dynamics for text summa- and Richard Socher. 2020. Evaluating the factual
rization models. In Findings of the Association for consistency of abstractive text summarization. In
Computational Linguistics: ACL 2022, pages 2061– Proceedings of the 2020 Conference on Empirical
2073. Methods in Natural Language Processing (EMNLP),
pages 9332–9346.
Max Grusky, Mor Naaman, and Yoav Artzi. 2018.
Newsroom: A dataset of 1.3 million summaries with Wojciech Kryscinski, Nazneen Fatema Rajani, Di-
diverse extractive strategies. In Proceedings of the vyansh Agarwal, Caiming Xiong, and Dragomir R
2018 Conference of the North American Chapter of Radev. 2021. BookSum: A collection of datasets for
the Association for Computational Linguistics: Hu- long-form narrative summarization.
man Language Technologies, Volume 1 (Long Pa-
pers), pages 708–719. Philippe Laban, Tobias Schnabel, Paul N. Bennett, and
Marti A. Hearst. 2022. SummaC: Re-visiting NLI-
Hardy Hardy, Shashi Narayan, and Andreas Vlachos. based models for inconsistency detection in summa-
2019. Highres: Highlight-based reference-less evalu- rization. Transactions of the Association for Compu-
ation of summarization. In Proceedings of the 57th tational Linguistics, 10.
Annual Meeting of the Association for Computational
Linguistics, pages 3381–3392. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Junxian He, Wojciech Kryscinski, Bryan McCann, Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Nazneen Rajani, and Caiming Xiong. 2022a. CTRL- BART: Denoising sequence-to-sequence pre-training
sum: Towards generic controllable text summariza- for natural language generation, translation, and com-
tion. In Proceedings of the 2022 Conference on Em- prehension. In Proceedings of the 58th Annual Meet-
pirical Methods in Natural Language Processing, ing of the Association for Computational Linguistics,
pages 5879–5915, Abu Dhabi, United Arab Emirates. pages 7871–7880.
Association for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
Pengcheng He, Baolin Peng, Liyang Lu, Song Wang, Jie matic evaluation of summaries. In Text summariza-
Mei, Yang Liu, Ruochen Xu, Hany Hassan Awadalla, tion branches out, pages 74–81.
Yu Shi, Chenguang Zhu, et al. 2022b. Z-Code++: A
pre-trained language model optimized for abstractive Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham
summarization. arXiv preprint arXiv:2208.09770. Neubig. 2022. BRIO: Bringing order to abstractive
Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- summarization. In Proceedings of the 60th Annual
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Meeting of the Association for Computational Lin-
and Phil Blunsom. 2015. Teaching machines to read guistics (Volume 1: Long Papers), pages 2890–2903.
and comprehend. Advances in Neural Information Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Processing Systems, 28. Ryan McDonald. 2020. On Faithfulness and Factu-
Karen Spärck Jones. 2007. Automatic summarising: ality in Abstractive Summarization. In Proceedings
The state of the art. Information Processing & Man- of the 58th Annual Meeting of the Association for
agement, 43(6):1449–1481. Computational Linguistics, pages 1906–1919.
Marzena Karpinska, Nader Akoury, and Mohit Iyyer. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,
2021. The perils of using mechanical turk to evaluate Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
open-ended text generation. In Proceedings of the moyer. 2022. Rethinking the role of demonstrations:
2021 Conference on Empirical Methods in Natural What makes in-context learning work? In Proceed-
Language Processing, pages 1265–1285. ings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 11048–11064,
Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Abu Dhabi, United Arab Emirates. Association for
Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. Computational Linguistics.
Smith, and Daniel Weld. 2022. GENIE: Toward re-
producible and standardized human evaluation for Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
text generation. In Proceedings of the 2022 Con- Hannaneh Hajishirzi. 2022. Cross-task generaliza-
ference on Empirical Methods in Natural Language tion via natural language crowdsourcing instructions.
Processing, pages 11444–11458, Abu Dhabi, United In Proceedings of the 60th Annual Meeting of the
Arab Emirates. Association for Computational Lin- Association for Computational Linguistics (Volume
guistics. 1: Long Papers), pages 3470–3487.
Kundan Krishna and Balaji Vasan Srinivasan. 2018. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Generating topic-oriented summaries using neural Caglar Gulcehre, and Bing Xiang. 2016. Abstrac-
attention. In Proceedings of the 2018 Conference of tive text summarization using sequence-to-sequence
the North American Chapter of the Association for RNNs and beyond. In Proceedings of The 20th
Computational Linguistics: Human Language Tech- SIGNLL Conference on Computational Natural Lan-
nologies, Volume 1 (Long Papers), pages 1697–1705. guage Learning, pages 280–290.
Shashi Narayan, Shay B Cohen, and Mirella Lapata. William Saunders, Catherine Yeh, Jeff Wu, Steven Bills,
2018. Don’t Give Me the Details, Just the Summary! Long Ouyang, Jonathan Ward, and Jan Leike. 2022.
Topic-Aware Convolutional Neural Networks for Ex- Self-critiquing models for assisting human evaluators.
treme Summarization. In Proceedings of the 2018 arXiv preprint arXiv:2206.05802.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1797–1807. Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier,
Benjamin Piwowarski, Jacopo Staiano, Alex Wang,
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- and Patrick Gallinari. 2021. QuestEval: Summariza-
roll L Wainwright, Pamela Mishkin, Chong Zhang, tion asks for fact-based evaluation. In Proceedings
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. of the 2021 Conference on Empirical Methods in
2022. Training language models to follow in- Natural Language Processing, pages 6594–6604.
structions with human feedback. arXiv preprint
arXiv:2203.02155. Abigail See, Peter J Liu, and Christopher D Manning.
2017. Get to the point: Summarization with pointer-
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia generator networks. In Proceedings of the 55th An-
Tsvetkov. 2021. Understanding factuality in abstrac- nual Meeting of the Association for Computational
tive summarization with FRANK: A benchmark for Linguistics (Volume 1: Long Papers), pages 1073–
factuality metrics. In Proceedings of the 2021 Con- 1083.
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan- Liyan Tang, Tanya Goyal, Alexander R Fabbri, Philippe
guage Technologies, pages 4812–4829. Laban, Jiacheng Xu, Semih Yahvuz, Wojciech Kryś-
ciński, Justin F Rousseau, and Greg Durrett. 2023.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Understanding factual errors in summarization: Er-
Jing Zhu. 2002. BLEU: a method for automatic eval- rors, summarizers, datasets, error detectors. Associa-
uation of machine translation. In Proceedings of the tion for Computational Linguistics.
40th annual meeting of the Association for Computa-
tional Linguistics, pages 311–318. Oleg Vasilyev, Vedant Dharnidharka, and John Bohan-
non. 2020. Fill in the BLANC: Human-free quality
Rebecca J Passonneau. 2006. Measuring agreement on estimation of document summaries. In Proceedings
set-valued items (MASI) for semantic and pragmatic of the First Workshop on Evaluation and Comparison
annotation. In Proceedings of the Fifth International of NLP Systems, pages 11–20.
Conference on Language Resources and Evaluation
(LREC’06).
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Maxime Peyrard. 2019. Studying summarization eval- Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
uation metrics in the appropriate scoring range. In Dai, and Quoc V Le. 2022. Finetuned language mod-
Proceedings of the 57th Annual Meeting of the Asso- els are zero-shot learners. In International Confer-
ciation for Computational Linguistics, pages 5093– ence on Learning Representations.
5100.
Xi Ye and Greg Durrett. 2022. The unreliability of ex-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine planations in few-shot prompting for textual reason-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, ing. In Advances in Neural Information Processing
Wei Li, Peter J Liu, et al. 2020. Exploring the limits Systems.
of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21(140):1–67. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter
Liu. 2020. PEGASUS: Pre-training with extracted
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and gap-sentences for abstractive summarization. In In-
Percy Liang. 2016. SQuAD: 100,000+ questions ternational Conference on Machine Learning, pages
for machine comprehension of text. In Proceedings 11328–11339. PMLR.
of the 2016 Conference on Empirical Methods in
Natural Language Processing, pages 2383–2392. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q.
Weinberger, and Yoav Artzi. 2020. BERTScore:
Alexander M Rush, Sumit Chopra, and Jason Weston. Evaluating Text Generation with BERT. In Inter-
2015. A neural attention model for abstractive sen- national Conference on Learning Representations.
tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- Yusen Zhang, Ansong Ni, Ziming Mao, Chen Henry Wu,
guage Processing, pages 379–389. Chenguang Zhu, Budhaditya Deb, Ahmed Awadal-
lah, Dragomir Radev, and Rui Zhang. 2022. SummN:
Victor Sanh, Albert Webson, Colin Raffel, Stephen A multi-stage summarization framework for long in-
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine put dialogues and documents: A multi-stage sum-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, marization framework for long input dialogues and
et al. 2022. Multitask prompted training enables zero- documents. In Proceedings of the 60th Annual Meet-
shot task generalization. In The Tenth International ing of the Association for Computational Linguistics
Conference on Learning Representations. (Volume 1: Long Papers), pages 1592–1604.
Yusen Zhang, Ansong Ni, Tao Yu, Rui Zhang, Chen-
3 Dataset
guang Zhu, Budhaditya Deb, Asli Celikyilmaz, CNN
Ahmed Hassan, and Dragomir Radev. 2021. An BBC
2
exploratory study on long dialogue summarization:
Score(s*) - Score(s)
What works and what’s next. In Findings of the Asso- 1
ciation for Computational Linguistics: EMNLP 2021,
pages 4426–4433. 0
Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and 1

Sameer Singh. 2021. Calibrate before use: Improv-
2
ing few-shot performance of language models. In
Proceedings of the International Conference on Ma- 3
chine Learning (ICML).
40 20 0 20 40 60
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris- Len(s*) - L(s)
tian M Meyer, and Steffen Eger. 2019. MoverScore:
Figure 8: Correlation between summary length and
Text generation evaluating with contextualized em-
beddings and earth mover distance. In Proceedings annotator score (computed as the no. of “best summary”
of the 2019 Conference on Empirical Methods in Nat- votes. For each example, plot the difference in length (x-
ural Language Processing and the 9th International axis) and annotator score (y-axis) between the GPT3-D2
Joint Conference on Natural Language Processing summary and the next best system’s summary.
(EMNLP-IJCNLP), pages 563–578.
Yao Zhao, Mohammad Saleh, and Peter J Liu. 3. SummaC: We use the SummaC-Conv model
2020. SEAL: Segment-wise extractive-abstractive
(model_name = ‘vitc’) and sentence-level gran-
long-form text summarization. arXiv preprint
arXiv:2006.10213. ularity in our experiments.
A Implementation Details Keyword-based data For our keyword-based hu-

man study, we extracted two named entities per
Prompts Used To generate GPT3-D2 summaries article, as discussed in Section 5. In practice, we
for all experiments in this paper, we use the stan- constrained the first keyword to be lead-biased, i.e.
dard prompt format outlined in Section 2. We set it was extracted from the first three sentences of the
N = 3 for CNN and DailyMail, N = 2 for News- article, and the second keyword was extracted from
room, and N = 1 for XSum/BBC. For the latter, the remaining article. As CNN-based summariza-
the prompt is slightly modified to “Summarize the tion models are generally lead-biased, this allowed
above article briefly in 1 sentence.” us to benchmark models under both settings.
For T0, we use the following prompts: a)
CNN/DM: “Summarize the article below in 3 to 4 B Are annotator judgments of quality
sentences?”, b) Newsroom: “Summarize the arti- correlated with length?
cle below in 2 to 3 sentences?”, and c) XSum/BBC:
“Summarize the article below in 1 sentence?” In Section 3, results of the human study showed
that annotators provide shorter length as one of
Factuality Metrics In Section 4.2, we evaluated the main reasons for selecting T0 summaries as the
several recently proposed factuality metrics. We worst for the CNN dataset. Here, we investigate if
note that multiple versions have been released for the choice between GPT3-D2 and BRIO is similarly
some of these models in recent years. Here, we influenced by their length differences; GPT3-D2
specify the versions used in our experiments to summaries are on average 9 words longer.
ensure reproducibility of results: To study this, we plot the difference in summary
1. QuestEval: We use version 0.2.4 of the queste- length against the difference in annotator score
val python package and report numbers using (measured as the no. of votes for a summarization
the precision-only setting. system) between the best summarization system
(GPT3-D2) and the next best system (BRIO for CNN
2. DAE: We use the updated version of and T0 for BBC). The resulting plot is shown in
the DAE model trained for document- Figure 8. In general we observe low correlation
level factuality. Latest code and model between these; Pearson’s ρ is 0.17 for CNN and .02
released at https://github.com/tagoyal/ for the BBC dataset. These correlation values can-
factuality-datasets. not solely explain the large differences in annotator
BBC-style 1 sentence summaries
Input Article: https://www.cnn.com/2021/09/07/opinions/
GPT3-D1 Four men completed a world record by playing a board game covid-19-good-and-bad-news-ranney/index.html
continuously for more than 85 hours.
GPT3-D2 The article is about four men who have completed a world record Common prompt for all segments:
by playing a board game continuously for more than 85 hours.
Summarize the above article briefly in 2-3 sentences.
GPT3-D1 Eight people have been charged with murder after two men died in
a car crash on a dual carriageway.
Summary for input segment 1: The article discusses how the current
situation with Covid-19 cases rising again is similar to last year in some
GPT3-D2 Eight defendants have been charged with murder after two men ways, but different in others. One big difference is that there are now
died in a car crash. vaccines available which decrease the risk of severe illness.
Summary for input segment 2: The article discusses how the current
CNN-style 3 sentence summaries situation with Covid-19 cases rising again is similar to last year in some
GPT3-D1 Russian President Vladimir Putin's speech at Russia's Victory Day ways, but different in others. One big difference is that there are now
commemorations offered few clues on the direction of the conflict vaccines available which decrease the risk of severe illness. The article
in Ukraine. UK defense chief Ben Wallace had suggested that Putin advises that everyone should take precautions such as wearing masks and
may use this historic day to escalate his so-called "special military opening windows to let fresh air in.
operation" in Ukraine and declare an outright war. Putin used his
Summary for input segment 3: The article discusses the challenges of
speech to blend history with the present, banking on Russian
living through the pandemic and the importance of science and community. It
nationalism on its most patriotic of holidays to justify his war.
also highlights the need for patience, hope, and humility as we enter the next
GPT3-D2 The article discusses Russian president Vladimir Putin's speech on phase of the pandemic.
Victory Day, in which he blended history with the present and
justified his war. It is speculated that Putin may use this day to
escalate the conflict in Ukraine, but it is also noted that he has few Figure 10: Illustrative example of GPT3-D2 summary of
successes in Ukraine to brag of. The question now is whether Putin a long source article generated using the segment-then-
will use this day to escalate the war in other ways.
summarize pipeline.
Figure 9: Examples of generated summaries using

the text-davinci-001 (GPT3-D1) and text-davinci-002 We spent a total of approximately $150 on API
(GPT3-D2) versions. The figure shows both BBC and requests.
CNN-style summaries. For the human study, we paid participants $4
per task (each task involved annotation for 5 arti-
judgments reported in the human study results of cles). On average, this translated to $11/hr of work.
Section 3; additional quality factors must have in- The combined cost for the generic summarization
fluenced this choice. Anecdotally, we observe that (Section 3) and the keyword-based summarization
the GPT summaries are slightly less information (Section 5) studies was $1020, including platform
dense; our impression is that these contain a similar costs and bonus payments.
level of information content, but are easier to read
and understand despite being a bit more verbose. E Long document summarization using
GPT3-D2
C Qualitative differences between GPT-3
versions Summarization of long documents has attracted
significant interest in recent years (Cohan et al.,
Figure 9 shows examples comparing summaries 2018; Kryscinski et al., 2021). Here, we study
from text-davinci-001 (GPT3-D1) to those from how naive prompting of GPT-3 performs at long-
GPT3-D2. For BBC-style single sentence sum- document summarization.
maries, we observed that the two models generated First, we extract text from a long input article
very similar summaries with high content and lexi- from the CNN website.16 Next, we follow the com-
cal overlap. More variance is observed for CNN- monly used segment-then-summarize procedure
style summaries. In our anecdotal assessment, from prior work (Zhao et al., 2020; Zhang et al.,
GPT3-D1 generated more detailed summaries while 2022). We divide the input article into 3 disjoint
those from GPT3-D2 are less information dense. segments, summarize each segment separately and
concatenate these outputs to form the final sum-
D Human study and API costs mary.
At the time of running our experiments, GPT-3 Figure 10 shows the prompt used and the gener-
API’s text-davinci-002 version was priced at $0.06 ated summaries for each segment. While individual
per 1K tokens. New pricing information is avail- segment summaries are high quality, we can see
able at: https://openai.com/api/pricing/. that the concatenated summary is not coherent and
In our experiments, we generated around 2600 includes repeated “introductory” sentences outlin-
GPT3-D2 summaries across all experiments in Sec- 16
Article link: https://www.cnn.com/2021/09/07/
tion 3 (human study), Section 4 (evaluation of met- opinions/covid-19-good-and-bad-news-ranney/
rics) and Section 5 (keyword-based human study). index.html
ing similar content. Related to this, it also does
not cover all important aspects of the input arti-
cle as a majority of its ‘length budget’ is spent on
a high-level overview. We also observed that the
generated summaries for long documents often fo-
cus on less unimportant parts of the document, e.g.
“...everyone should take the precaution of ... opening
windows to let the fresh air in” in the illustrated ex-
ample. This is, in part, due to the segmentation of
the input article: GPT3-D2 still exhibits some lead
bias and treats the beginning of the input segment
as more salient. Therefore, the exact segmentation
of the article also dictates the quality of the final
summary, and cannot be readily fixed by altering
the prompt.
These observations show that while GPT3-D2
produces superior segment-level summaries, it is
more difficult to adapt it to “non-natural” text in-
puts without fine-tuning. Therefore, techniques that
have shown promising results for fine-tuned mod-
els, e.g. segment-then-summarize or extract-then-
abstract (Zhang et al., 2021) approaches, are not
as effective when directly applied with prompting-
based models.
F Task Instructions
Task instructions provided to crowd annotators for
the generic summarization task setting are shown in
Figure 14 and those for the keyword-based setting
are shown in Figure 15.
G Examples of generated summaries

We show examples of generated summaries for arti-
cles for generic summarization for CNN-2022 and
BBC-2022 in Figures 11 and 12. It includes sum-
maries from the 3 different summarization models
evaluated in the human study in Section 3.
Examples of keyword-focused summaries are
shown in Figure 13 for CNN. It includes summaries
generated by GPT3-D2 and CTRLSum models.
Input Article Generated Summaries
(CNN) Toronto Police fatally shot a man who was seen carrying a firearm near three schools in the Scarborough area Thursday, police BRIO: Suspect was seen carrying a firearm near three schools
said. Officers responded to reports of a man carrying a firearm and "there was an interaction" between officers and the man, according in the Scarborough area of Toronto, police say. At least four
to the Special Investigations Unit in the Canadian province of Ontario. At some point during the interaction, two officers shot at the schools are under lockdown in the area. The incident comes as
suspect, hitting him at least once, police said. The suspect was pronounced dead at the scene. At least four schools in the area were the U.S. is reeling from two recent mass shootings. The
under lockdown Thursday as authorities tried to determine the extent of the threat, according to a series of tweets from the Toronto suspect was shot and killed by police.
District School Board. The incident comes as the US is grappling with the shock of two recent mass shootings: its deadliest school
shooting in nearly a decade in Uvalde, Texas, and a racist shooting at a supermarket in Buffalo, New York. "We certainly understand T0: NEW: "I was really scared, I just started praying," one
the trauma and how traumatic this must have been for staff, students and parents given the two recent events that have happened in student says. Shooting comes after two mass shootings in the
the United States," said Toronto Police Chief James Ramer at a news briefing after the incident, referencing the Uvalde and Buffalo United States. The gunman was spotted near three schools in
shootings. He added that he understands the community's concern as the armed suspected was very close to schools. The schools the Scarborough area.
under lockdown included William G. Davis Junior Public School, Joseph Howe Senior Public School, Charlottetown Junior Public
School and Sir Oliver Mowat Collegiate Institute, according to a tweet from the school board. "I was really scared, I just started GPT3-D2: The article discusses a police shooting that took
praying," one student told CNN affiliate CTV. Ontario's Special Investigations Unit is now investigating the fatal shooting, according place near several schools in Scarborough, Canada. The
to a news release. "Four investigators and three forensic investigators have been assigned to the case," the release said. Ontario suspect was shot and killed by officers after an interaction
premier Doug Ford tweeted his thanks to police and emergency services for their "quick action." "Thank you to police and emergency during which the suspect was seen carrying a firearm. The
services for your quick action today in Scarborough," he said. "We're extremely grateful for everything you do to keep our incident is being investigated by the Special Investigations
communities safe." Unit.
(CNN) The owner of seven nursing homes across Louisiana that evacuated residents to a warehouse as Hurricane Ida approached last BRIO: Owner of seven nursing homes evacuated residents to
year has been indicted on felony charges after seven residents died at the temporary shelter, officials said. Bob Glynn Dean was a warehouse as Hurricane Ida approached last year. Seven
arrested and charged with eight felony counts of cruelty to persons with infirmities, five felony counts of Medicaid fraud and two residents died at the temporary shelter. Bob Glynn Dean's
felony counts of obstruction of justice, according to a Wednesday news release from Louisiana Attorney General Jeff Landry. Five of attorney says he plans to plead not guilty to all charges. Five
the seven deaths at the warehouse shelter were considered storm-related, state health officials said. In total, more than 800 residents of the seven deaths at the warehouse shelter were considered
were taken to the facility ahead of the storm. A joint investigation by the attorney general's Medicaid Fraud Control Unit (MFCU) and storm-related, officials say.
the Louisiana Bureau of Investigations (LBI) "revealed Dean refused to move his residents out of the warehouse following Hurricane
Ida, billed Medicaid for dates his residents were not receiving proper care, and engaged in conduct intended to intimidate or obstruct T0: Bob Glynn Dean faces eight felony counts of cruelty to
public health officials and law enforcement," the AG's news release said. Dean's attorney, John McClindon, told CNN Wednesday persons with infirmities. Dean's attorney says his client's
Dean plans to plead not guilty to all charges he's facing and said Dean's mental health will factor into the case. "I don't think it's any mental health will be an issue in the case. Seven nursing home
secret that Bob's mental health is going to be an issue in this case," McClindon said. "Bob clearly has some cognitive impairments and residents died at the warehouse shelter during Hurricane Ida.
did on the day of this incident." McClindon said Landry informed him of an arrest warrant for Dean Monday and his client flew from
Georgia to self-surrender. He made an initial court appearance in Louisiana Wednesday and was released on $350,000 bond, GPT3-D2: The owner of seven nursing homes in Louisiana
McClindon said. McClindon called the charges "very defensible" and said "the evidence will bear out eventually." Residents kept in has been indicted on felony charges after seven residents died
'unsafe, unsanitary, and unhealthy' conditions, officials said The nursing home residents were taken to the warehouse in Independence, at a temporary shelter during Hurricane Ida. The owner, Bob
about 57 miles east of Baton Rouge, ahead of Hurricane Ida's landfall on August 29. The state health department said it soon started to Glynn Dean, is facing eight felony counts of cruelty to
hear about deteriorating conditions at the warehouse. CNN obtained the logs of 61 calls from the warehouse to 911 operators. At least persons with infirmities, five felony counts of Medicaid fraud,
30 of the calls asked for assistance with medical episodes before and after landfall, including calls for seizures, stopped breathing, and and two felony counts of obstruction of justice. Dean's
one instance in which a caller says a diabetic patient needed transport because they had "not eaten due to them having no more attorney says Dean's mental health will be a factor in the case.
supplies." "Let's be clear; there is no emergency-preparedness plan that allows for residents to be kept in such an unsafe, unsanitary,
and unhealthy condition," Stephen Russo, director of legal, audit and regulatory affairs for the health department, said last year. "The
lack of adequate care for these residents is inhumane, and goes against the rules, regulations, and applicable statutes." The seven
facilities involved had their licenses revoked and cannot repatriate or admit residents, officials said at the time. The homes also had
their Medicaid provider agreements terminated, the health department said. The Attorney General's Office investigation is ongoing
and additional legal action may be filed in the future, the Wednesday release said. The next court date for Dean has not been set, but
McClindon said it will most likely happen in the next 60 days.
(CNN) Global leaders and defense officials had spent weeks speculating what Russian President Vladimir Putin might reveal about his BRIO: Russian President Vladimir Putin gave a speech at
Ukraine plans in a speech at Russia's Victory Day commemorations Monday. They'll have to keep guessing -- the leader offered few Russia's Victory Day commemorations on Monday. Peter
clues on the direction of the conflict. UK defense chief Ben Wallace had suggested that Putin may use this historic day to escalate his Bergen says Putin gave few clues about his Ukraine plans in
so-called "special military operation" in Ukraine and declare an outright war. Even if that had been Putin's plan, he was unlikely to his speech. He says Putin used his speech to blend history
follow through after Wallace's comments, not wanting to appear to his Western foes as such an easy nut to crack. Instead, the Russian with the present to justify his war in Ukraine. The question
president used his speech to blend history with the present, banking on Russian nationalism on its most patriotic of holidays to justify now is whether Putin will escalate the war in other ways, he
his war. In his reverence for Soviet war heroes who helped defeat Nazi Germany in World War II -- the reason Russia celebrates says.
Victory Day -- Putin referred to new Nazi threats in Ukraine, repeating his baseless justification for the invasion as an operation to
"denazify" the nation. In reference to the threat of NATO troops in Europe, Putin said: "Everything indicated that a clash with neo- T0: Vladimir Putin offered few clues about the direction of
Nazis, Banderites [Ukrainian nationalists], on whom the United States and their younger partners counted on, would be inevitable." the conflict in Ukraine. Putin used his speech to blend history
"Danger was increasing every day. Russia repelled this aggression in a preventative way. This was the only correct decision, and it was and the present to justify his war.
a timely decision. The decision of an independent, sovereign and powerful nation," he said. Putin had few other options than to use his
speech to keep selling his war to his own people. He has so few successes in Ukraine to brag of, after all. All he can do now is to keep GPT3-D2: The article discusses Russian president Vladimir
Russians on his side as they suffer the economic hardship of crippling sanctions and international isolationism. The question now is Putin's speech on Victory Day, in which he blended history
whether Putin will use this day -- or this week even -- to escalate the war in other ways. There are growing concerns that Russian forces with the present and justified his war. It is speculated that
will turn again to standoff weapons -- aerial strikes and long-range missiles, for example -- that can be fired from afar, as they so often Putin may use this day to escalate the conflict in Ukraine, but
do when they are on the backfoot. That's worrying, as those attacks are indiscriminate and tend to cause huge civilian tolls. A bombing it is also noted that he has few successes in Ukraine to brag
of a school in Luhansk, eastern Ukraine, that is feared to have killed at least 60 sheltering people over the weekend is just one example. of. The question now is whether Putin will use this day to
After Russia's failure to take territory in Ukraine's north and around the capital, Kyiv, it's struggling even in the east and south, where it escalate the war in other ways.
has had a presence through pro-Russian rebels for years. The possibility that Russia may win nothing, or very little, in Ukraine is real.
Whether anything changes this Victory Day or not, a new chapter in the war will inevitably need writing soon.
Figure 11: Examples of generated summaries for the CNN-2022 dataset using 3 different summarization systems.

Four men have completed what they hope will be a world record by playing a board game continuously for more than 85 hours. The foursome, of Gloucester, completed their BRIO: A team of four men have set a new world
mammoth challenge on Monday night with just a couple of hours of sleep. Lea Poole, Dale Poole, Adam Bircher and Luke de Witt Vine played the game Dune, watched by record by playing a board game for 90 hours.
supporters and an online audience. The world record attempt was carried out in aid of Alzheimer's Research UK. The rules of the challenge stated the men were allowed to
accumulate five minutes respite for every hour they played. They had to beat the previous world record, held by a team in The Netherlands, which played a board game for 80 hours T0: Four men have broken the world record for the
in January 2017. Guinness World Records stipulated they would need to play for at least five more hours to be considered for a world record. The men's attempt has yet to be longest time spent playing a board game.
ratified. Alzheimer's Research UK was selected to benefit from the challenge as Lea and Dale Poole's father suffers from the condition. Dale Poole said: "It's a bit of a rollercoaster
really. It's very flattering. People have been sending in their support and donations and it's really humbling." The team were not allowed full meals, but just snacks throughout the GPT3-D2: The article is about four men who have
attempt, which included 79 games in total. "You can be within one hour quite wide awake and alert, and by sitting down and not having to think for a few seconds, that's it, you're completed a world record by playing a board game
now tired and there's no way of breaking that weird cycle," said Lea Poole. The men completed the challenge at 23:00 GMT on Monday but decided to carry on and finish the game continuously for more than 85 hours.
they were on, sipping on champagne to celebrate. Mr Burcher said he was "tired, beyond measure" but "really quite satisfied and pleased with how far we've come". "If you'd told
me last year I'd be doing something like this I would have called you crazy," he added. Mr De Witt Vine said it had been a "difficult and trying experience but I'm glad we've done
it." The group have raised almost £1,000 for Alzheimer's UK, as well as a smaller amount for Herefordshire Mind, a mental health charity.
A man standing trial for the murder of his partner's three-year-old son had convictions for common assault, burglary and possession of heroin with intent to supply, a jury has BRIO: A man accused of murdering a toddler had a
heard. Kemarni Watson Darby suffered fatal abdominal injuries at his mother's flat where he lived with Nathaniel Pope. Alicia Watson, 30, from West Bromwich, is also on trial string of convictions, Wolverhampton Crown Court has
for murder. Mr Pope, 32, from Wolverhampton, and Miss Watson both deny that charge and another of child cruelty. During his cross-examination, Miss Watson's QC, Charles heard.
Sherrard, told the court Mr Pope had been sentenced to four months' imprisonment for common assault in May 2011, 12 months for burglary in September 2011, and 20 months
for possessing heroin with intent to supply in July 2013. Mr Pope conceded he had "just told some of the truth" about himself in evidence heard last week. Mr Sherrard referred T0: A man accused of murdering a toddler in
to another arrest in March 2021. He said: "On the journey to Oldbury custody suite and on arrival at Oldbury custody suite you were asked by police officers several times if you Wolverhampton had convictions for assault and drugs
had anything secreted on your person that you needed to present to them. "Each time you said to them 'no'. The custody suite police officers were authorised to conduct a strip offences, a court has heard.
search of you. "Two police officers then took you into a room and asked again if you had any drugs secreted on your person. Once again, you said 'no'. "A rock of 9.16 grams of
crack cocaine was found in a white tissue secreted between your buttocks." Mr Sherrard also referred to Mr Pope's evidence last week that he was "merely a recreational user of GPT3-D2: The man on trial for the murder of his
cannabis". He asked: "(In) 2013 you're a dealer of heroin and in 2021 it's crack cocaine. Are you saying that in between we should read into it there would have been nothing of partner's 3-year-old son had a history of convictions for
the sort?" Mr Pope answered: "There was nothing of the sort." The trial continues. assault, burglary, and drug dealing.
A yellow warning for snow and ice has been issued for much of the North East. The Met Office warning is due to come into force at 20:00 BST on Wednesday and will run until BRIO: Parts of the UK are set to be hit by icy
10:00 on Thursday. There will probably be icy patches on untreated roads and paths with higher elevation roads will be "affected by snowfall", the Met Office said. Motorists are conditions, forecasters have warned.
being urged to take care. The cold snap comes days after the region basked in warm sun and highs of 20C (68F). The weather warning will cover Northumberland, County
Durham, Tyneside, Darlington and Teesside. The wintry weather is expected to last until the weekend when slightly warmer temperatures will come in from the west, bringing T0: The wintry weather is set to return to the North East
unsettled conditions. of England on Wednesday night into Thursday morning.
GPT3-D2: The Met Office has issued a warning for

snow and ice in the North East, which is expected to last
until the weekend.
Figure 12: Examples of generated summaries for the BBC-2022 dataset using 3 different summarization systems.
A coalition of thousands of Etsy sellers signaled support for a one-week strike starting on Monday — the same day the online marketplace known for its unique Keyword: Kristi Cassidy
handicrafts will start hiking the fees it charges those who use its platform to earn a living. An online petition started by Etsy (ETSY) shop owner Kristi Cassidy urging
CTRLSum: Etsy (ETSY) shop owner Kristi Cassidy started a
the company to cancel the fee increases — which tick up from 5% to 6.5% starting Monday — has garnered nearly 50,000 signatures. Of those signatories, some
petition urging the company to cancel the fee increases. Sellers
18,500 come from people who have identified as Etsy sellers who support the strike, according to Etsy shop operator and strike participant Mattie Boyd. "We feel like
participating in the strike are putting their shops on "vacation mode"
we deserve a seat at the table," Boyd told CNN Business. "And we hope these demands are met, that's our immediate goal. But, generally, there's got to be some kind
for a week starting Monday. The strike comes amid a wave of
of change, where there's some kind of dialogue, or Etsy sellers have some kind of representation where these decisions are being made.” Sellers participating in the
workplace activism seen at a slew of major companies.
strike are putting their shops on "vacation mode" for a week starting Monday, according to Cassidy's petition, a temporary setting that lets users essentially put their
Etsy shop on hold for a designated period of time. Etsy CEO Josh Silverman announced the fee increases in a memo to sellers in late February. The letter touted Etsy's GPT3-D2: Kristi Cassidy is an Etsy shop owner who is organizing a
massive growth over the past two years, boasting how active sellers last year increased their sales by "23% on average compared to 2019, and in 2021 alone, we one-week strike against the company starting on Monday. The strike
showed more than 90 million active buyers worldwide that there's an alternative to big-box, automated shopping.” Silverman then announced plans to "make is in response to Etsy's decision to raise the fees it charges sellers,
significant investments in marketing, seller tools, and creating a world-class customer experience so we can continue this tremendous growth.” "To support this goal, which will increase from 5% to 6.5%. Cassidy's petition urging the
on April 11 we will increase our current 5% transaction fee to 6.5%," Silverman wrote. Etsy is the main source of income for Boyd, who operates a shop via the online company to cancel the fee increases has garnered nearly 50,000
retailer featuring homemade graphic T-shirts and other "niche" items that Boyd says are "geared towards people who are members of the queer and trans community, signatures.
and who are also into punk rock and metal.” Demands listed in the petition include canceling the fee increases passed onto sellers; creating a comprehensive plan to
crack down on "reseller" shops (people selling mass-produced goods that they have not designed themselves); improve and expedite the support systems for sellers Keyword: Silverman
who have had their business disrupted by Etsy's automated tools; end the "Star Seller" program that Etsy uses to rate sellers; and to let sellers opt out of offsite ads for
their products. Boyd said many sellers felt like they weren't given a fair way to give feedback on the sudden fee-hike announcement, which marks the first increase CTRLSum: Etsy CEO Josh Silverman announced the fee increases
since 2018. While the other demands in the petition are issues Boyd said have been brewing for a while, "I think for a lot of us it was that 30% increase [to fees] that in a memo to sellers in late February. "To support this goal, on April
really lit the match," Boyd added. "The strike officially is meant to go from April 11, today, through April 18," Boyd told CNN Business. "But people are being 11 we will increase our current 5% transaction fee to 6.5%,"
encouraged to participate for as much time as they feel like they can, and no one's being shamed if they can't do the whole week.” With sellers spread out across the Silverman wrote. "We feel like we deserve a seat at the table," Etsy
country, Boyd said organizers are using a slew of tech tools to rally together and support each other — including Reddit channels, Discord chats, and Instagram. Raina shop operator Mattie Boyd told CNN Business.
Moskowitz, the chief operating officer for Etsy, told CNN Business in a statement via email on Monday that "sellers' success is a top priority for Etsy.” "We are always
receptive to seller feedback and, in fact, the new fee structure will enable us to increase our investments in areas outlined in the petition, including marketing, customer GPT3-D2: The article discusses a one-week strike being led by Etsy
support, and removing listings that don't meet our policies," the statement added. "We are committed to providing great value for our 5.3 million sellers so they are able sellers in response to fee increases that will go into effect on Monday.
to grow their businesses while keeping Etsy a beloved, trusted, and thriving marketplace.” The seller strike on Etsy notably comes amid a wave of workplace activism The strike was organized in response to a memo from Etsy CEO Josh
seen at a slew of major companies over the past year -- from Starbucks (SBUX) to Amazon. Earlier this month, Amazon (AMZN) workers in Staten Island, New York, Silverman announcing the fee increases. Etsy sellers are demanding
voted to form the e-commerce giant's first-ever US labor union in a landmark election. Amazon has since filed an appeal, calling for a do-over of the entire vote. that the fee increases be cancelled, among other things.
New York (CNN Business) As Russia's assault on Ukraine continues, American bar and restaurant owners are hoping a small word change will help show their solidarity Keyword: Ukraine
with the Ukrainian people. In a move reminiscent of the "freedom fries" fad of the early aughts, they're taking Moscow Mules off the menu and replacing them with Kyiv CTRLSum: Bar owners in the U.S. are replacing "Moscow" with
Mules. Small American businesses, such as independent bar or restaurant owners, may not have any direct business ties to Russia, but many feel strongly about the "Kyiv" in their vodka-ginger-lime cocktails. They're doing it to show
violent attack on Ukrainian cities and citizens. Replacing "Moscow" with "Kyiv" in their vodka-ginger-lime cocktails is one way to show support for Ukraine. Bond Bar, their solidarity with the Ukrainian people. Russia's assault on
in San Francisco, has renamed its Moscow Mule the Kyiv Mule. "It's just a little token of acknowledgment to the Ukrainian people," said owner Andrea Minoo. "We're Ukraine is "wrong," says Ronnie Heckman, owner of a Maryland bar.
just trying to raise awareness, and to let people know, we're in support [of Ukraine]." She wants Ukrainians to know that "we see what's happening, we wish we could do
more." Bond Bar doesn't serve Russian vodka, Minoo noted, so it's not replacing any ingredients in its Kyiv Mule. Madrone Art Bar, also in San Francisco, did serve GPT3-D2: The article discusses how American bar and restaurant
Russian vodka until this past weekend, when owner Michael Krouse decided to take it off the menu. First, he had to figure out which of the roughly 10 vodkas he carries owners are changing the name of the Moscow Mule cocktail to the
were actually Russian. Many top-selling vodka brands that trace their origins to Russia are now distilled in multiple countries, including the United States. Stoli Vodka, Kyiv Mule to show support for Ukraine. The Kyiv Mule is made with
for example, is actually made in Latvia, and the company's headquarters are in Luxembourg. After some research, Krouse removed Russian Standard, one of the few Ukrainian vodka and ginger-lime, and some businesses are donating
vodka brands that actually is Russian-made, from his bar. Then he decided to rename Madrone's Moscow Mule the Kyiv Mule and looked for a Ukrainian vodka to make part of the proceeds to Ukrainian aid. The change is meant to be a
it with. The bar unveiled the reconstituted cocktail on Instagram this week. "Introducing the 'Kyiv Mule' made with Prime Ukrainian vodka!,'" a Wednesday post reads, small gesture to show support for the Ukrainian people as Russia
adding that "$2 of each Kyiv Mule sale will be donated to the Ukraine Crisis Fund." The Kyiv Mule costs $12. Krouse said he was feeling sad and helpless about the continues its assault.
situation in Ukraine when he decided to make those moves. Those changes were "at least something that we could do," he said. Making a gesture Em Chamas Brazilian
Grill in Kansas City, Missouri, said in a Facebook post last week that its Moscow Mule will be replaced by a "Snake Island Mule," in "support of the Ukrainian resistance Keyword: Brad Lendon
and in honor of the brave soldiers of Snake Island." Snake Island, also known as Zmiinyi Island, sits about 30 miles off the southern tip of the Ukrainian mainland in the CTRLSum: Bar owners are replacing "Moscow" with "Kyiv" in
northwestern Black Sea. Last week, a handful of Ukrainian fighters on the island stood up to Russian warships. They were feared dead, but the Ukrainian Navy released their vodka-ginger-lime cocktails. CNN's Jordan Valinsky, Brad Tim
a statement Monday saying the troops were "alive and well" after being forced to surrender "due to the lack of ammunition." Back in Kansas City, the Em Chamas post Lendon, Josh Pennington contributed to this report. The attack on
also urged customers against "directing animosity towards local restaurants or businesses that are Russian-themed and/or are owned/operated by those of Russian Ukraine "makes no sense," says Ronnie Heckman, owner of Caddies
[descent]." The eatery, which charges about $11 for cocktails, plans to donate profits from the Snake Island Mule to a Ukrainian charity, according to the Facebook post. on Cordell.
Ronnie Heckman, owner of Caddies on Cordell, a bar and grill in Bethesda, Maryland, has not only swapped out Moscow Mules for Kyiv Mules, he's gone so far to
replace Black Russians and White Russians with Black Ukrainians (vodka and Kahlua on the rocks) and White Ukrainians (made with vodka, Kahlua and cream). "It's a GPT3-D2:The article discusses how American bar and restaurant
gesture," he said. Caddies is also donating part of the revenue from those drinks to Ukrainian aid, he said. Heckman hopes that if enough restaurant and bar owners take owners are swapping out Moscow Mules for Kyiv Mules in order to
references to Russia off their menus, they can send a message to Russian leadership. The attack on Ukraine "makes no sense," he said. "It's wrong." CNN's Jordan show support for Ukraine. This is in response to Russia's recent
Valinsky, Brad Lendon, Tim Lister and Josh Pennington contributed to this report. assault on the country. CNN's Brad Lendon contributed to the report.
Figure 13: Examples of keyword-focused summaries for CNN articles from 2022.
Figure 14: Screenshot of the task instructions for the Figure 15: Screenshot of the task instructions for the
generic summarization setting. keyword-based setting.

News Summarization and Evaluation in The Era of GPT-3

Uploaded by

Copyright:

Available Formats

News Summarization and Evaluation in The Era of GPT-3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

News Summarization and Evaluation in The Era of GPT-3

Uploaded by

Copyright:

Available Formats

News Summarization and Evaluation in the Era of GPT-3

Tanya Goyal1 Junyi Jessy Li2 Greg Durrett1

Abstract CNN article: https://www.cnn.com/2022/09/09/politics/judge-throws-

The recent success of prompting large language Length-constrained prompt

GPT-3: Donald Trump's lawsuit against Hillary Clinton, the Democratic

fortably outperforming them on human evaluation. Fine-tuned on Instruction-tuned on Zero-shot prompting

(Brown et al., 2020) from the Instruct series

preference for GPT3-D2 summaries compared to

Does not directly

5.2 Human Study: Keyword-focused In recent years, research in text summarization

Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and 1

A Implementation Details Keyword-based data For our keyword-based hu-

Figure 9: Examples of generated summaries using

G Examples of generated summaries

Input Article Generated Summaries

GPT3-D2: The Met Office has issued a warning for

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.