News Summarization and Evaluation in The Era of GPT-3
News Summarization and Evaluation in The Era of GPT-3
News Summarization and Evaluation in The Era of GPT-3
N=1
pre-trained language models are fine-tuned on mul- of discrimination against African citizens at the Ukrainian border during a
meeting at the UN HQ in New York City Monday.
tiple tasks (which may include summarization) us-
The United Nations Security Council condemned the reports of
ing instruction templates in order to align their
N=2
discrimination against African citizens at the Ukrainian border. The African
Union has said it is "disturbed" by the reports of segregation against
training with inference time usage. Africans in Ukraine, which it described as "shockingly racist."
In this work, we compare the summarization The article discusses the reports of discrimination against African citizens
at the Ukrainian border. The representatives from the three African nations
performance of three models that are representative on the UN Security Council condemned the reports and called for the
N=3
of this space of options: mistreatment of African peoples on Europe's borders to cease immediately.
Foreign students attempting to flee Ukraine after Russia invaded the
country told CNN that they experienced racial discrimination at the
1. OpenAI’s text-davinci-002, a GPT-3 model Ukrainian border.
BBC Article: The full opening of a landmark shared education The full opening of the Strule shared education campus in Omagh,
BRIO
campus in Omagh is facing another delay. Education Minister County Tyrone, has been delayed to September 2026.
Michelle McIlveen has now said the Strule Shared Education
Strule, Northern Ireland's biggest ever school-building project, is now
Campus is planned to open in September 2026. The minister T0
scheduled to open in 2026, the Education Minister has said.
clarified the new date in response to an assembly question from the
SDLP MLA Daniel McCrossan. The campus had originally been GPT3 The Strule Shared Education Campus is facing another delay, and is now
due to open in 2020, but has been delayed a number of times.[…] -D2 planned to open in September 2026.
Figure 4: Examples of CNN-style and BBC/XSum-style summaries for the three systems. For CNN, we observe
that models fine-tuned on the CNN/DM training set reflect its dataset biases; summaries are highly extractive, specific
and lead-biased. On the other hand, GPT3-D2 summaries contain fewer specific details but cover more content.
task description surfaces behavior learned during 100 recent articles from CNN3 and BBC, collected
pre-training or instruction-tuning. In this section, between March 1, 2022 and June 31, 2022. We call
we ask: how do these paradigms compare? Does these CNN-2022 and BBC-2022 respectively.
learning from gold summaries lead to a better sum-
marization model? To answer this, we conduct a Model details We use the publicly released
human study to compare outputs of our 3 repre- BRIO-XSum and BRIO-CNN/DM models to generate
sentative models and collect human preferences of summaries.4 For T0, we use a prompt we selected
quality. from its prompt repository for CNN/DM and XSum
datasets.5 Finally, to generate GPT3-D2 summaries,
3.1 Experimental Setup we set N = 3 for CNN and N = 1 for BBC in
Datasets for fine-tuning We choose two stan- our standard sentence-count prompt template from
dard fine-tuning datasets whose summaries differ Section 2.
along multiple dimensions such as length and ab- For a maximally fair comparison in this “realis-
stractiveness: tic” setting, we take some additional steps to im-
prove the output of BRIO-XSum. In order to auto-
1. CNN/DM (Hermann et al., 2015; Nallapati
mate dataset creation, XSum removes the first sen-
et al., 2016) contains reference summaries that
tence from news articles to use as the gold summary
are approximately 3-4 sentences long. Sum-
for training, then treats the rest of the sentences as
maries in this dataset are highly extractive and
the article to summarize. This setup differs from
lead-biased.
the real world usage of summarization systems
2. XSum (Narayan et al., 2018) contains 1 sen- where the complete article is summarized. Due
tence summaries of BBC news articles. In to this mismatch, BRIO-XSum often generates very
this dataset, references summaries, and conse- low quality outputs, e.g. All images: Strule Shared
quently generated summaries from fine-tuned 3
Although the BRIO’s CNN/DM model also includes Daily-
models are highly abstractive. Mail data in its training, we do not use this news source in
our study as it is now widely considered to be unreliable. E.g.
Datasets for evaluation Because GPT3-D2’s pre- according to Media Bias / Fact Check site, DM’s factual re-
training and instruction-tuning datasets are un- porting is rated ‘low’ https://mediabiasfactcheck.com/
daily-mail/.
known, it may have been trained on existing articles 4
Models at: https://github.com/yixinL7/BRIO
and summaries in the test splits of these standard 5
Repository with T0 prompts: https://github.com/
benchmarks. We therefore run our human study on bigscience-workshop/promptsource
Education Campus in Figure 4, for around 30% of Model
Length Statistics % novel n-gms #NEs per
the articles. We manually identify these examples #sent #words/sent n = 1 n = 2 100 words
and first attempt to fix them by selecting a summary CNN
without such obvious failures from further down BRIO 3.7 15.8 12.1 36.2 12.9
the beam (we use beam size = 10). However, if we T0 2.7 14.9 16.4 45.2 12.8
GPT3-D2 2.9 23.4 16.3 40.7 10.5
cannot find a “better” summary, we remove the first
BBC
sentence of the article and re-sample a new sum-
mary to align with its noisy training. This latter BRIO 1.0 20.2 24.6 61.2 9.1
T0 1.0 20.0 26.3 66.7 9.8
strategy often results in factually incorrect sum- GPT3-D2 1.0 27.7 16.4 42.3 8.5
mary generations, as is well documented in prior
research (Maynez et al., 2020; Goyal and Durrett, Table 2: Statistics for generated summaries evaluated
2021). in the human study across all datasets and summariza-
tion systems. We observe that GPT3-D2 generated sum-
maries nearly always follow the sentence length con-
Design of the human study We design an A/B straints in their prompts.
test to collect preference annotations. For each
given article, annotators are shown summaries from 3.2 Results
all three summarization systems (BRIO, T0 and
GPT3-D2). They are then asked to select their most Differences between summarization systems
and least preferred summary or summaries. In ad- Figure 4 shows examples of generated summaries
dition to these multiple choice questions, we also from all three summarization systems for both
ask for a free-text justification of both choices. CNN and BBC articles. For CNN, we observe that
fine-tuned BRIO summaries tend to be highly extrac-
We make two design decisions for our human tive and generally include a high number of named
study: first, we do not provide annotators with spe- entities (dates, percentages, names), reflecting the
cific definitions of summary quality to avoid intro- data it was trained on. In contrast, GPT3-D2 sum-
ducing our own biases. It is also quite challenging maries are more abstractive and less specific, but
to produce a unified definition of quality for the provide a more exhaustive overview of the article
very different “styles” of summaries evaluated in content. Table 2 provides quantitative evidence of
this study. Instead, we ask them to rely on their this; we use percentage of novel n-grams to mea-
own preferences based on summaries they would sure abstractiveness, and number of named entities
like to see if they were browsing the web, which per 100 words to measure specificity.
we believe to be a representative scenario for non- For BBC, we observe inverse trends where
expert consumers of news summaries. Detailed BRIO and T0 are more abstractive compared to
task instructions are included in Appendix F. GPT3-D2. Again, this can be attributed to the XSum
Second, we allow multiple selections for both the training data used to train both these prior mod-
best and worst summary questions to cater to sce- els. For GPT3-D2 summaries, on the other hand,
narios in which different summarization systems the level of abstractiveness does not differ between
output similar quality summaries without meaning- datasets. Finally, Table 2 shows that GPT3-D2 sum-
ful differences. maries tend to have longer sentences, and therefore
similar number of summary sentences often results
We hire crowd annotators through Prolific. For
in a longer summary for both datasets. We study
both CNN and BBC, we recruit 60 unique partici-
the effect of this length difference on human pref-
pants to annotate the 100 summaries in each dataset.
erence judgments in Appendix B.
Each annotator was asked to annotate 5 articles and
each article was annotated by 3 annotators. Addi- Which systems do humans prefer? Results of
tionally, we use the Prolific’s demographic filters to our human study are summarized in Table 3. We
restrict participation to USA (or UK) residents for report the percentage of times a particular system is
CNN (or BBC). We anticipate that residents from the most/least preferred model according to major-
these respective countries are better positioned to ity vote combining all three annotator’s choices.6
understand country-specific news events and evalu- 6
As we allow multiple system selections, note that more
ate their summaries. Participants were paid approx- that one system could be the majority. However, this is rare
imately $11/hr for their work. after majority vote: only 2% of the articles in CNN and 7% in
Which summary is Which summary is
B RIO T0 GPT3 the most preferred? the least preferred?
Dataset
Best ↑ Worst ↓ Best ↑ Worst ↓ Best ↑ Worst ↓
GPT3 GPT3
CNN
CNN 36 24 8 67 58 9 BRIO BRIO
BBC 20 56 30 29 57 15
T0 T0
Agreement = 0.05 Agreement = 0.11
Table 3: Percentage of times a summarization system is
GPT3 GPT3
selected as the best or worst according to majority vote
BBC
BRIO BRIO
(may be tied). Human annotators have a clear preference
T0 T0
for GPT3-D2 for both CNN and BBC style summaries. Agreement = 0.18 Agreement = 0.15
0 1 2 3 0 1 2 3
No. of annotator votes for No. of annotator votes for
Across both datasets and styles, we observe a clear “best summary” “worst summary”
Table 4: Performance of different summarization systems measured using reference-based automatic metrics. Across
all datasets, we observe that automatic metrics report substantially worse results for GPT3-D2 summaries compared
to fine-tuned models. This directly contradicts the human preference results from Section 3, demonstrating that
these reference-based metrics cannot reliably compare the quality of prompt-based summaries against fine-tuned
summaries.
for realistic summarization settings. Even though a evaluating prompt-based GPT3-D2 summaries.
CNN/DM-trained BRIO model performed better, the
Experimental Setup We evaluate automatic met-
results of our human study question the contin-
rics using summaries from 4 different summariza-
ued utility of hill-climbing on this dataset, as it
tion datasets, listed in Table 1. For each dataset,
seems users may simply prefer a different style of
we construct our evaluation sets by randomly sam-
summary altogether. In fact, this preference for
pling 5007 articles from the standard test split.8 We
GPT3-D2 is much larger than incremental improve-
compare the same 3 summarization systems from
ments reported in other human evaluation settings,
Section 3 in our analysis. Additionally, we also
e.g. improvements on XSum on the GENIE leader-
report results using the fine-tuned PEGASUS model
board (Khashabi et al., 2022). Furthermore, as
(Zhang et al., 2020), as BRIO fine-tuned models are
we we will see in Section 5, the greater flexibil-
not available for all datasets.
ity of GPT3-D2 compared to these systems makes
We publicly release this corpus of summariza-
it more suitable for news summarization tasks be-
tion outputs to standardize the test sets and sup-
yond generic summarization.
port future research into GPT3-D2 based summa-
If a system designer collects a large-scale dataset
rization. Link: https://tagoyal.github.io/
of high-quality summaries that they wish to emu-
zeroshot-news-annotations.html.
late, we believe a fine-tuned system may outper-
form GPT3-D2. However, better-trained models on 4.1 Reference-based metrics
datasets collected via “incidental” supervision are
Here, we study if the gold summaries of the stan-
less likely to help.
dard datasets are useful for evaluation, especially
when evaluating prompt-based summaries that are
4 Can current automatic metrics evaluate not trained to emulate the gold. We benchmark
GPT3-D2 summaries?
7
This size is chosen to give sufficient statistical power
Automatic metrics proposed for summarization (Card et al., 2020) while keeping costs for GPT3-D2 evaluation
low to enable others to compare on this subset. We outline
evaluation can be broadly divided into two cate- costs in Appendix D.
gories: (1) reference-based, that compare gener- 8
Note that these standard datasets were released before
ated summaries against available gold summaries, 2020. Therefore, it is possible that some article-summary
pairs in our test set overlap with GPT3-D2’s training data. How-
and (2) reference-free that only rely on the input ever, we do not observe a qualitative difference in GPT3-D2’s
document. Here, we compare their performance at performance on these older articles.
Overall Quality Factuality (QA-based) Factuality (NLI-based)
Dataset Model
SUPERT BLANC QuestEval QAFactEval FactCC DAE SummaC
PEGASUS .5466 .0605 .7373 4.4071 .3743 .8223 .1138
BRIO .5586 .0802 .7334 3.8332 .1817 .7577 -.0532
CNN
T0 .5330 .0558 .7799 3.7517 .2012 .7556 -.0605
GPT3-D2 .5560 .0749 .7249 3.6399 .2428 .6671 -.0729
PEGASUS .6433 .1137 .7536 4.4677 .5152 .8497 .2402
BRIO .6360 .1217 .7415 4.1362 .3699 .8118 .0153
DailyMail
T0 .5995 .0889 .7803 3.9827 .2431 .8043 .0478
GPT3-D2 .6118 .0983 .7461 3.8279 .2697 .6990 .0365
PEGASUS .4439 .0249 .8233 2.0089 .2465 .3598 -.2993
BRIO .4459 .0230 .8305 1.8626 .2031 .3040 -.3292
XSum
T0 .4538 .0238 .7957 2.0330 .2219 .3392 -.3037
GPT3-D2 .5060 .0594 .8064 2.9492 .3977 .6372 -.2626
PEGASUS .6286 .1131 .7118 4.2120 .7218 .7956 .2418
BRIO - - - - - - -
Newsroom
T0 .5433 .0640 .7511 3.5799 .2828 .7376 .0261
GPT3-D2 .5408 .0599 .7160 3.2336 .3988 .6564 -.0729
Table 5: Performance of different summarization systems, as scored by automatic reference-free evaluation metrics
from the summarization literature. Similar to reference-based metrics, these also generally fail to produce the same
system rankings as human preferences reliably across datasets.
the performance of 3 different summarization met- estingly, out of the four datasets evaluated here,
rics: (1) overlap-based metrics, specifically ROUGE Newsroom is the only one not used to train the
(Lin, 2004) METEOR (Banerjee and Lavie, 2005) and T0 model. This further shows that access to dataset-
BLEU (Papineni et al., 2002). (2) similarity-based specific reference summaries during training im-
metrics, that compute similarity between embed- proves performance according to these metrics, ren-
dings representations of generated and reference dering them unsuitable for evaluating prompt-based
summaries. Specifically, we report BERTScore models.
(Zhang* et al., 2020) and MoverScore (Zhao et al.,
2019). (3) a QA-based metric, specifically QAE-
val (Deutsch et al., 2021). Although most QA- 4.2 Reference-free metrics
metrics are reference-free (discussed in Section
4.2), QAEval uses the reference summaries to in-
dicate saliency. We report both exact match (EM) Next, we investigate whether current reference-free
and F1 components of QAEval. evaluation metrics reflect the human preference
rankings between summarization systems, as ob-
Results Table 4 outlines the results. It shows that served in Section 3. Here, we study 2 categories
BRIO and PEGASUS models, fine-tuned to emulate of metrics: (1) quality metrics, specifically SU-
the reference summaries, outperform GPT3-D2 sum- PERT (Gao et al., 2020), which evaluates generated
maries according to all reference-based automatic summaries against automatically identified salient
metrics. The difference in their assigned scores sentences in the input, and BLANC (Vasilyev et al.,
is very high, e.g. >7 ROUGE-L points between 2020), which evaluates summaries on language
GPT3-D2 and BRIO. For comparison, these reported understanding tasks. We refer readers to the orig-
scores for GPT3-D2 are even lower than the triv- inal papers for detailed explanation of these. (2)
ial Lead-3 baseline reported in prior work (Fabbri factuality metrics, that are evaluate whether gener-
et al., 2021; Grusky et al., 2018). This clearly ated summaries contain incorrect information with
demonstrates that current automatic reference- respect to the source article. We report the perfor-
based metrics cannot be used to reliably mea- mance of summarization systems using two QA-
sure summary quality under the prompting based metrics: QuestEval (Scialom et al., 2021)
paradigm. and QAFactEval (Fabbri et al., 2022). Addition-
Amongst prompting-based models, we observe ally, we also benchmark entailment-based metrics:
that T0 summaries report better metric scores than FactCC (Kryscinski et al., 2020), DAE (Goyal and
GPT3-D2 for all datasets except Newsroom. Inter- Durrett, 2020, 2021) and SummaC (Laban et al.,
2022).9 These entailment-based models are de- though “reference-free” at test time, they are still
signed for classification into factual or non-factual; trained to reward the summary properties seen in
therefore, we use P (factual | article, summary) the standard summarization benchmarks. (2) Even
to score generated summaries. completely reference-free metrics, e.g. QuestE-
val and QAFactEval, have only been evaluated on
Results Table 5 outlines the scores for each sum- reference-based benchmarks and fine-tuned mod-
marization system according to the above reference- els. Therefore, the choice of different components,
free metrics. Ideally, we want the relative rankings such as question answering or question generation
of different systems according to these metrics to models to use, etc. has been dictated by the error
correspond to human preferences, i.e. GPT3-D2 > space of prior fine-tuned models (Tang et al., 2023).
BRIO > T0 for CNN/DM10 and GPT3-D2 > T0 > BRIO These decisions also now need to be re-visited to
for XSum.11 incorporate GPT3-D2 evaluation; we leave this for
Overall, we observe that none of the reference- future work.
free metrics we evaluate follow these trends for
both CNN/DM and XSum datasets. In particular, we 5 Beyond Generic Summarization
observe that GPT3-D2 summaries report low factu-
ality scores (except XSum) even though we rarely Previously, we observed that GPT3-D2 models faith-
found any factual errors in our qualitative analysis fully follow simple “style” instructions in the given
of its generated summaries. prompts. This provides a promising direction to
Interestingly, we noticed a roughly inverse rela- tackle other use cases in news summarization be-
tion to abstractiveness; summarization systems that yond the generic summarization task from Sec-
generated more abstractive summaries (see Table tion 3.
2) were generally scored lower by all automatic Different users can have very different infor-
reference-based metrics. For instance, GPT3-D2 mation needs from the same article, all of which
is scored lower than BRIO by both quality metrics cannot be satisfied with a single generic summary.
for all datasets except XSum; the latter is the only Prior work has introduced several task formulations
dataset for which GPT3-D2 summaries are less ab- to address this gap, including keyword-focused (He
stractive. Such shortcomings of reference-free eval- et al., 2022a), query-focused (Baumel et al., 2014;
uation metrics due to spurious correlations have He et al., 2022a), or aspect-focused summariza-
also been studied in prior work (Durmus et al., tion (Krishna and Srinivasan, 2018; Ahuja et al.,
2022). These issues become more exaggerated 2022), amongst others. Here, we evaluate GPT3-D2
when the summarization systems being compared performance at two of these use cases.
exhibit very different properties. In keyword-based summarization, the output
summaries must succinctly summarize the input
Discussion On the surface, the failure of document focusing on a given keyword; these gen-
reference-free metrics at evaluating GPT3-D2 sum- erally correspond to specific entities or events di-
maries is more surprising that reference-based met- rectly mentioned in the document. In contrast, the
rics as the later explicitly compares generated sum- control units in aspect-based summarization are
maries with references that GPT3-D2 is not trained high-level topics that can be common across mul-
to imitate. Therefore, GPT3-D2 understandably tiple similar types of documents. For e.g., for the
scores lower than fine-tuned systems. input article in Figure 1, Donald Trump or Russian
However, we note two different issues with interference in 2016 elections are keyword controls
reference-free metrics: (1) Some of these, e.g. whereas charges against the defendants is a higher-
FactCC and DAE, use reference summaries as pos- level aspect that can serve as the query for any news
itive examples to train the metric. Therefore, al- article discussing a lawsuit or investigation.
9
Exact model versions and configurations used for these
are outlined in Appendix A. 5.1 Qualitative Analysis
10
Although the human study in Section 3 is only run on
CNN articles, the underlying fine-tuned model is same for Baseline Model for comparison We use the re-
both CNN and DM. Therefore, it we can reasonably expect it cently proposed CTRLSum (He et al., 2022a), a fine-
to display similar quality differences with respect to GPT3-D2. tuned BART model, as our baseline. It can be flex-
11
Note that while annotators were not explicitly asked to
rate factuality, we instructed them to carefully check factuality ibly adapted for both keyword- and aspect-based
and appropriately downvote non-factual summaries. settings by including a prompt as additional input
Article: Republican defenders of Donald Trump won't or can't answer the questions that are at the root of the intrigue over why classified documents were at Mar-a-Lago and the
troubling question of whether national security was put at risk. […] At one end of the scale is Missouri Sen. Josh Hawley, a firm Trump backer who is often mentioned as a future
presidential candidate once the ex-President finally exits the political stage. Hawley promoted Trump's line that the key issue was the FBI search of an ex-President's home, not
what he might have kept there. "I don't know what he has or doesn't have, but what I have a lot of concerns about is the FBI raiding the home of the likely presidential nominee in
2024, the former President of the United States, and particularly as we learn more and more about what's going on with the FBI," Hawley told CNN's Manu Raju on Wednesday.
[…] Ex-Attorney General William Barr has repeatedly questioned publicly what his former boss was doing with classified material. There has been speculation that Trump might
want such material as keepsakes or as some kind of leverage in a future business deal. But none of this is informed given the secrecy sounding the DOJ's investigation. […]
Link: https://www.cnn.com/2022/09/08/politics/republicans-trump-national-security-implications-classified-docs
CTRLSum: Sen. Marco Rubio downplays the seriousness of the Department of Justice's investigation. Ex-Attorney General
Keyword-based Missing context
William Barr has repeatedly questioned what his former boss was doing with classified material. Trump and his allies have
Poor discourse
filled the vacuum for answers with a smokescreen of unproven claims, including that Trump declassified the material.
Summarize the
above article in GPT3-D2: The article discusses how various Republicans have been defending Donald Trump in light of the recent revelations Includes article
2-3 sentences that he had classified documents at his Mar-a-Lago resort. While some have downplayed the significance of the situation, context before
focusing on others have outright denied that there is any cause for concern. However, former Attorney General William Barr has repeatedly keyword-specific
William Barr. questioned Trump's actions, speculating that he may have been trying to use the classified information for personal gain. information
In the above article, CTRLSum: Republican defenders of Donald Trump won't or can't answer the questions that are at the root of the intrigue
what is the over why classified documents were at Mar-a-Lago, says Julian Zelizer. Zelizer: Trump and his allies have filled the vacuum Does not directly
defendant's reaction for answers with a smokescreen of unproven claims, including that Trump declassified the material or it might have been answer query.
to charges, including planted by the FBI.
admission of guilt, Summary captures
GPT3-D2: The defendant's reaction to charges in the above article is denial of charges. Trump and his allies have filled the
denial of charges, or aspect-relevant
vacuum for answers with a smokescreen of unproven claims, including that Trump declassified the material or that it might
explanations? content.
have been planted by the FBI in order to persecute him.
Figure 6: Comparison of keyword- and aspect-based summaries using GPT3-D2 and CTRLSum models. The
GPT3-D2 prompt is shown on the left with the corresponding keyword or aspect bolded. For keyword-based
summarization, the GPT3-D2 summary presents appropriate context before the keyword-specific information. How-
ever, for aspect-based summarization, it does not always generate factually correct summaries, as shown in the first
aspect example. We observe that CTRLSum performs poorly for both these settings.
to the encoder. We use the prompt template recom- In this example, representative of aver-
mended in the original paper.12 age GPT3-D2 quality, the keyword-focused
GPT3-D2 summary first gives a brief overview
Control Units For the keyword-focused setting,
of the article setting before providing keyword-
we use named entities extracted from the input arti-
relevant information. In contrast, the CTRLSum
cle as the control units. For aspect-focused summa-
summary exhibits poor discourse structure and
rization, we directly use the aspects introduced in
reads like a list of facts stapled together.
the guided summarization task from TAC 2011.13
It defined 5 broad categories of newswire articles, The figure also shows aspect-focused summaries
such as accidents and natural disasters, investiga- for two aspects associated with the “investigations
tions and trial, etc., and multiple aspects for each and trial” category most appropriate for the chosen
category. For example, the “investigations and tri- article. We see mixed results here for GPT3-D2; it
als” category includes aspects such as “who is the generates a factually incorrect summary for the first
defendant or under trial?”, “who is investigating, aspect, listing multiple people from the input arti-
prosecuting, judging?”, and so on. cle as defendants instead of only “Donald Trump”.
For the second aspect, it correctly maps the high-
Qualitative Analysis Figure 6 shows examples level concept “defendant” to “Donald Trump” in
of keyword- and aspect-focused summaries using the input article and generates the correct answer
GPT3-D2 and the baseline CTRLSum model. The to the input query: “The defendant’s reaction to
keywords or aspects are highlighted in bold within charges in the above article is denial of charges”.
the GPT3-D2 prompt displayed on the left.
On the other hand, CTRLSum fails to generate
12
Trained model publicly released at: https://github. aspect-focused summaries for both cases. We be-
com/salesforce/ctrl-sum.
13
https://tac.nist.gov/2011/Summarization/ lieve that it struggles to align high-level concepts
Guided-Summ.2011.guidelines.html and explicit entities in the article due to a lack of
by a majority of the annotators. The main ratio-
Which keyword-focused Win % according
summary is better? to majority vote nales given for this choice were better contextual-
GPT3-D2 69.8 %
ization of keyword-related information and better
coherence in GPT3-D2 summaries.
CTRLSum 30.2 %
0 1 2 3
Impact These results show that prompting GPT-3
No. of votes for “best summary” models present a promising alternative to fine-
tuned models for such specialized summarization
tasks that can be easily described using textual
Figure 7: Distribution of annotator votes for the
prompts. One of the major drawbacks of fine-tuned
keyword-focused summarization task. Annotators pre-
fer GPT3-D2 summaries over CTRLSum for approxi- models is that they are constrained by what data
mately 70% of all article-keyword pairs, showing unani- is available and how it can be transformed to cre-
mous preference more than half the time. ate new task-specific training data. CTRLSum relied
on the SQuAD question answering dataset (Ra-
jpurkar et al., 2016) because the required “queries”
such aspect-specific examples in its training data. or “questions” were unavailable at scale for sum-
Instead, it generates summaries focusing on lexi- maries in standard summarization datasets. In con-
cally similar words, i.e. “defenders” for both cases. trast, prompt-based models are not constrained by
Based off of GPT3-D2’s promising keyword- the availability of task-specific data and can flexibly
focused summarization capabilities observed adapt to new tasks. Future research should focus
above, we next conduct a human study to system- on further exploring these capabilities and possible
atically compare it against the CTRLSum baseline. improvements on currently “unsolved” tasks such
We leave further explorations of aspect-based sum- as aspect-based or plan-based summarization.
marization to future work, given the mixed to poor
results for both models at this task. 6 Discussion and Related Work
Marzena Karpinska, Nader Akoury, and Mohit Iyyer. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,
2021. The perils of using mechanical turk to evaluate Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
open-ended text generation. In Proceedings of the moyer. 2022. Rethinking the role of demonstrations:
2021 Conference on Empirical Methods in Natural What makes in-context learning work? In Proceed-
Language Processing, pages 1265–1285. ings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 11048–11064,
Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Abu Dhabi, United Arab Emirates. Association for
Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. Computational Linguistics.
Smith, and Daniel Weld. 2022. GENIE: Toward re-
producible and standardized human evaluation for Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
text generation. In Proceedings of the 2022 Con- Hannaneh Hajishirzi. 2022. Cross-task generaliza-
ference on Empirical Methods in Natural Language tion via natural language crowdsourcing instructions.
Processing, pages 11444–11458, Abu Dhabi, United In Proceedings of the 60th Annual Meeting of the
Arab Emirates. Association for Computational Lin- Association for Computational Linguistics (Volume
guistics. 1: Long Papers), pages 3470–3487.
Kundan Krishna and Balaji Vasan Srinivasan. 2018. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Generating topic-oriented summaries using neural Caglar Gulcehre, and Bing Xiang. 2016. Abstrac-
attention. In Proceedings of the 2018 Conference of tive text summarization using sequence-to-sequence
the North American Chapter of the Association for RNNs and beyond. In Proceedings of The 20th
Computational Linguistics: Human Language Tech- SIGNLL Conference on Computational Natural Lan-
nologies, Volume 1 (Long Papers), pages 1697–1705. guage Learning, pages 280–290.
Shashi Narayan, Shay B Cohen, and Mirella Lapata. William Saunders, Catherine Yeh, Jeff Wu, Steven Bills,
2018. Don’t Give Me the Details, Just the Summary! Long Ouyang, Jonathan Ward, and Jan Leike. 2022.
Topic-Aware Convolutional Neural Networks for Ex- Self-critiquing models for assisting human evaluators.
treme Summarization. In Proceedings of the 2018 arXiv preprint arXiv:2206.05802.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1797–1807. Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier,
Benjamin Piwowarski, Jacopo Staiano, Alex Wang,
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- and Patrick Gallinari. 2021. QuestEval: Summariza-
roll L Wainwright, Pamela Mishkin, Chong Zhang, tion asks for fact-based evaluation. In Proceedings
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. of the 2021 Conference on Empirical Methods in
2022. Training language models to follow in- Natural Language Processing, pages 6594–6604.
structions with human feedback. arXiv preprint
arXiv:2203.02155. Abigail See, Peter J Liu, and Christopher D Manning.
2017. Get to the point: Summarization with pointer-
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia generator networks. In Proceedings of the 55th An-
Tsvetkov. 2021. Understanding factuality in abstrac- nual Meeting of the Association for Computational
tive summarization with FRANK: A benchmark for Linguistics (Volume 1: Long Papers), pages 1073–
factuality metrics. In Proceedings of the 2021 Con- 1083.
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan- Liyan Tang, Tanya Goyal, Alexander R Fabbri, Philippe
guage Technologies, pages 4812–4829. Laban, Jiacheng Xu, Semih Yahvuz, Wojciech Kryś-
ciński, Justin F Rousseau, and Greg Durrett. 2023.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Understanding factual errors in summarization: Er-
Jing Zhu. 2002. BLEU: a method for automatic eval- rors, summarizers, datasets, error detectors. Associa-
uation of machine translation. In Proceedings of the tion for Computational Linguistics.
40th annual meeting of the Association for Computa-
tional Linguistics, pages 311–318. Oleg Vasilyev, Vedant Dharnidharka, and John Bohan-
non. 2020. Fill in the BLANC: Human-free quality
Rebecca J Passonneau. 2006. Measuring agreement on estimation of document summaries. In Proceedings
set-valued items (MASI) for semantic and pragmatic of the First Workshop on Evaluation and Comparison
annotation. In Proceedings of the Fifth International of NLP Systems, pages 11–20.
Conference on Language Resources and Evaluation
(LREC’06).
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Maxime Peyrard. 2019. Studying summarization eval- Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
uation metrics in the appropriate scoring range. In Dai, and Quoc V Le. 2022. Finetuned language mod-
Proceedings of the 57th Annual Meeting of the Asso- els are zero-shot learners. In International Confer-
ciation for Computational Linguistics, pages 5093– ence on Learning Representations.
5100.
Xi Ye and Greg Durrett. 2022. The unreliability of ex-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine planations in few-shot prompting for textual reason-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, ing. In Advances in Neural Information Processing
Wei Li, Peter J Liu, et al. 2020. Exploring the limits Systems.
of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21(140):1–67. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter
Liu. 2020. PEGASUS: Pre-training with extracted
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and gap-sentences for abstractive summarization. In In-
Percy Liang. 2016. SQuAD: 100,000+ questions ternational Conference on Machine Learning, pages
for machine comprehension of text. In Proceedings 11328–11339. PMLR.
of the 2016 Conference on Empirical Methods in
Natural Language Processing, pages 2383–2392. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q.
Weinberger, and Yoav Artzi. 2020. BERTScore:
Alexander M Rush, Sumit Chopra, and Jason Weston. Evaluating Text Generation with BERT. In Inter-
2015. A neural attention model for abstractive sen- national Conference on Learning Representations.
tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- Yusen Zhang, Ansong Ni, Ziming Mao, Chen Henry Wu,
guage Processing, pages 379–389. Chenguang Zhu, Budhaditya Deb, Ahmed Awadal-
lah, Dragomir Radev, and Rui Zhang. 2022. SummN:
Victor Sanh, Albert Webson, Colin Raffel, Stephen A multi-stage summarization framework for long in-
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine put dialogues and documents: A multi-stage sum-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, marization framework for long input dialogues and
et al. 2022. Multitask prompted training enables zero- documents. In Proceedings of the 60th Annual Meet-
shot task generalization. In The Tenth International ing of the Association for Computational Linguistics
Conference on Learning Representations. (Volume 1: Long Papers), pages 1592–1604.
Yusen Zhang, Ansong Ni, Tao Yu, Rui Zhang, Chen-
3 Dataset
guang Zhu, Budhaditya Deb, Asli Celikyilmaz, CNN
Ahmed Hassan, and Dragomir Radev. 2021. An BBC
2
exploratory study on long dialogue summarization:
Score(s*) - Score(s)
What works and what’s next. In Findings of the Asso- 1
ciation for Computational Linguistics: EMNLP 2021,
pages 4426–4433. 0
Yao Zhao, Mohammad Saleh, and Peter J Liu. 3. SummaC: We use the SummaC-Conv model
2020. SEAL: Segment-wise extractive-abstractive
(model_name = ‘vitc’) and sentence-level gran-
long-form text summarization. arXiv preprint
arXiv:2006.10213. ularity in our experiments.
F Task Instructions
Task instructions provided to crowd annotators for
the generic summarization task setting are shown in
Figure 14 and those for the keyword-based setting
are shown in Figure 15.
(CNN) Global leaders and defense officials had spent weeks speculating what Russian President Vladimir Putin might reveal about his BRIO: Russian President Vladimir Putin gave a speech at
Ukraine plans in a speech at Russia's Victory Day commemorations Monday. They'll have to keep guessing -- the leader offered few Russia's Victory Day commemorations on Monday. Peter
clues on the direction of the conflict. UK defense chief Ben Wallace had suggested that Putin may use this historic day to escalate his Bergen says Putin gave few clues about his Ukraine plans in
so-called "special military operation" in Ukraine and declare an outright war. Even if that had been Putin's plan, he was unlikely to his speech. He says Putin used his speech to blend history
follow through after Wallace's comments, not wanting to appear to his Western foes as such an easy nut to crack. Instead, the Russian with the present to justify his war in Ukraine. The question
president used his speech to blend history with the present, banking on Russian nationalism on its most patriotic of holidays to justify now is whether Putin will escalate the war in other ways, he
his war. In his reverence for Soviet war heroes who helped defeat Nazi Germany in World War II -- the reason Russia celebrates says.
Victory Day -- Putin referred to new Nazi threats in Ukraine, repeating his baseless justification for the invasion as an operation to
"denazify" the nation. In reference to the threat of NATO troops in Europe, Putin said: "Everything indicated that a clash with neo- T0: Vladimir Putin offered few clues about the direction of
Nazis, Banderites [Ukrainian nationalists], on whom the United States and their younger partners counted on, would be inevitable." the conflict in Ukraine. Putin used his speech to blend history
"Danger was increasing every day. Russia repelled this aggression in a preventative way. This was the only correct decision, and it was and the present to justify his war.
a timely decision. The decision of an independent, sovereign and powerful nation," he said. Putin had few other options than to use his
speech to keep selling his war to his own people. He has so few successes in Ukraine to brag of, after all. All he can do now is to keep GPT3-D2: The article discusses Russian president Vladimir
Russians on his side as they suffer the economic hardship of crippling sanctions and international isolationism. The question now is Putin's speech on Victory Day, in which he blended history
whether Putin will use this day -- or this week even -- to escalate the war in other ways. There are growing concerns that Russian forces with the present and justified his war. It is speculated that
will turn again to standoff weapons -- aerial strikes and long-range missiles, for example -- that can be fired from afar, as they so often Putin may use this day to escalate the conflict in Ukraine, but
do when they are on the backfoot. That's worrying, as those attacks are indiscriminate and tend to cause huge civilian tolls. A bombing it is also noted that he has few successes in Ukraine to brag
of a school in Luhansk, eastern Ukraine, that is feared to have killed at least 60 sheltering people over the weekend is just one example. of. The question now is whether Putin will use this day to
After Russia's failure to take territory in Ukraine's north and around the capital, Kyiv, it's struggling even in the east and south, where it escalate the war in other ways.
has had a presence through pro-Russian rebels for years. The possibility that Russia may win nothing, or very little, in Ukraine is real.
Whether anything changes this Victory Day or not, a new chapter in the war will inevitably need writing soon.
Figure 11: Examples of generated summaries for the CNN-2022 dataset using 3 different summarization systems.
A man standing trial for the murder of his partner's three-year-old son had convictions for common assault, burglary and possession of heroin with intent to supply, a jury has BRIO: A man accused of murdering a toddler had a
heard. Kemarni Watson Darby suffered fatal abdominal injuries at his mother's flat where he lived with Nathaniel Pope. Alicia Watson, 30, from West Bromwich, is also on trial string of convictions, Wolverhampton Crown Court has
for murder. Mr Pope, 32, from Wolverhampton, and Miss Watson both deny that charge and another of child cruelty. During his cross-examination, Miss Watson's QC, Charles heard.
Sherrard, told the court Mr Pope had been sentenced to four months' imprisonment for common assault in May 2011, 12 months for burglary in September 2011, and 20 months
for possessing heroin with intent to supply in July 2013. Mr Pope conceded he had "just told some of the truth" about himself in evidence heard last week. Mr Sherrard referred T0: A man accused of murdering a toddler in
to another arrest in March 2021. He said: "On the journey to Oldbury custody suite and on arrival at Oldbury custody suite you were asked by police officers several times if you Wolverhampton had convictions for assault and drugs
had anything secreted on your person that you needed to present to them. "Each time you said to them 'no'. The custody suite police officers were authorised to conduct a strip offences, a court has heard.
search of you. "Two police officers then took you into a room and asked again if you had any drugs secreted on your person. Once again, you said 'no'. "A rock of 9.16 grams of
crack cocaine was found in a white tissue secreted between your buttocks." Mr Sherrard also referred to Mr Pope's evidence last week that he was "merely a recreational user of GPT3-D2: The man on trial for the murder of his
cannabis". He asked: "(In) 2013 you're a dealer of heroin and in 2021 it's crack cocaine. Are you saying that in between we should read into it there would have been nothing of partner's 3-year-old son had a history of convictions for
the sort?" Mr Pope answered: "There was nothing of the sort." The trial continues. assault, burglary, and drug dealing.
A yellow warning for snow and ice has been issued for much of the North East. The Met Office warning is due to come into force at 20:00 BST on Wednesday and will run until BRIO: Parts of the UK are set to be hit by icy
10:00 on Thursday. There will probably be icy patches on untreated roads and paths with higher elevation roads will be "affected by snowfall", the Met Office said. Motorists are conditions, forecasters have warned.
being urged to take care. The cold snap comes days after the region basked in warm sun and highs of 20C (68F). The weather warning will cover Northumberland, County
Durham, Tyneside, Darlington and Teesside. The wintry weather is expected to last until the weekend when slightly warmer temperatures will come in from the west, bringing T0: The wintry weather is set to return to the North East
unsettled conditions. of England on Wednesday night into Thursday morning.
Figure 12: Examples of generated summaries for the BBC-2022 dataset using 3 different summarization systems.
Input Article Generated Summaries
A coalition of thousands of Etsy sellers signaled support for a one-week strike starting on Monday — the same day the online marketplace known for its unique Keyword: Kristi Cassidy
handicrafts will start hiking the fees it charges those who use its platform to earn a living. An online petition started by Etsy (ETSY) shop owner Kristi Cassidy urging
CTRLSum: Etsy (ETSY) shop owner Kristi Cassidy started a
the company to cancel the fee increases — which tick up from 5% to 6.5% starting Monday — has garnered nearly 50,000 signatures. Of those signatories, some
petition urging the company to cancel the fee increases. Sellers
18,500 come from people who have identified as Etsy sellers who support the strike, according to Etsy shop operator and strike participant Mattie Boyd. "We feel like
participating in the strike are putting their shops on "vacation mode"
we deserve a seat at the table," Boyd told CNN Business. "And we hope these demands are met, that's our immediate goal. But, generally, there's got to be some kind
for a week starting Monday. The strike comes amid a wave of
of change, where there's some kind of dialogue, or Etsy sellers have some kind of representation where these decisions are being made.” Sellers participating in the
workplace activism seen at a slew of major companies.
strike are putting their shops on "vacation mode" for a week starting Monday, according to Cassidy's petition, a temporary setting that lets users essentially put their
Etsy shop on hold for a designated period of time. Etsy CEO Josh Silverman announced the fee increases in a memo to sellers in late February. The letter touted Etsy's GPT3-D2: Kristi Cassidy is an Etsy shop owner who is organizing a
massive growth over the past two years, boasting how active sellers last year increased their sales by "23% on average compared to 2019, and in 2021 alone, we one-week strike against the company starting on Monday. The strike
showed more than 90 million active buyers worldwide that there's an alternative to big-box, automated shopping.” Silverman then announced plans to "make is in response to Etsy's decision to raise the fees it charges sellers,
significant investments in marketing, seller tools, and creating a world-class customer experience so we can continue this tremendous growth.” "To support this goal, which will increase from 5% to 6.5%. Cassidy's petition urging the
on April 11 we will increase our current 5% transaction fee to 6.5%," Silverman wrote. Etsy is the main source of income for Boyd, who operates a shop via the online company to cancel the fee increases has garnered nearly 50,000
retailer featuring homemade graphic T-shirts and other "niche" items that Boyd says are "geared towards people who are members of the queer and trans community, signatures.
and who are also into punk rock and metal.” Demands listed in the petition include canceling the fee increases passed onto sellers; creating a comprehensive plan to
crack down on "reseller" shops (people selling mass-produced goods that they have not designed themselves); improve and expedite the support systems for sellers Keyword: Silverman
who have had their business disrupted by Etsy's automated tools; end the "Star Seller" program that Etsy uses to rate sellers; and to let sellers opt out of offsite ads for
their products. Boyd said many sellers felt like they weren't given a fair way to give feedback on the sudden fee-hike announcement, which marks the first increase CTRLSum: Etsy CEO Josh Silverman announced the fee increases
since 2018. While the other demands in the petition are issues Boyd said have been brewing for a while, "I think for a lot of us it was that 30% increase [to fees] that in a memo to sellers in late February. "To support this goal, on April
really lit the match," Boyd added. "The strike officially is meant to go from April 11, today, through April 18," Boyd told CNN Business. "But people are being 11 we will increase our current 5% transaction fee to 6.5%,"
encouraged to participate for as much time as they feel like they can, and no one's being shamed if they can't do the whole week.” With sellers spread out across the Silverman wrote. "We feel like we deserve a seat at the table," Etsy
country, Boyd said organizers are using a slew of tech tools to rally together and support each other — including Reddit channels, Discord chats, and Instagram. Raina shop operator Mattie Boyd told CNN Business.
Moskowitz, the chief operating officer for Etsy, told CNN Business in a statement via email on Monday that "sellers' success is a top priority for Etsy.” "We are always
receptive to seller feedback and, in fact, the new fee structure will enable us to increase our investments in areas outlined in the petition, including marketing, customer GPT3-D2: The article discusses a one-week strike being led by Etsy
support, and removing listings that don't meet our policies," the statement added. "We are committed to providing great value for our 5.3 million sellers so they are able sellers in response to fee increases that will go into effect on Monday.
to grow their businesses while keeping Etsy a beloved, trusted, and thriving marketplace.” The seller strike on Etsy notably comes amid a wave of workplace activism The strike was organized in response to a memo from Etsy CEO Josh
seen at a slew of major companies over the past year -- from Starbucks (SBUX) to Amazon. Earlier this month, Amazon (AMZN) workers in Staten Island, New York, Silverman announcing the fee increases. Etsy sellers are demanding
voted to form the e-commerce giant's first-ever US labor union in a landmark election. Amazon has since filed an appeal, calling for a do-over of the entire vote. that the fee increases be cancelled, among other things.
New York (CNN Business) As Russia's assault on Ukraine continues, American bar and restaurant owners are hoping a small word change will help show their solidarity Keyword: Ukraine
with the Ukrainian people. In a move reminiscent of the "freedom fries" fad of the early aughts, they're taking Moscow Mules off the menu and replacing them with Kyiv CTRLSum: Bar owners in the U.S. are replacing "Moscow" with
Mules. Small American businesses, such as independent bar or restaurant owners, may not have any direct business ties to Russia, but many feel strongly about the "Kyiv" in their vodka-ginger-lime cocktails. They're doing it to show
violent attack on Ukrainian cities and citizens. Replacing "Moscow" with "Kyiv" in their vodka-ginger-lime cocktails is one way to show support for Ukraine. Bond Bar, their solidarity with the Ukrainian people. Russia's assault on
in San Francisco, has renamed its Moscow Mule the Kyiv Mule. "It's just a little token of acknowledgment to the Ukrainian people," said owner Andrea Minoo. "We're Ukraine is "wrong," says Ronnie Heckman, owner of a Maryland bar.
just trying to raise awareness, and to let people know, we're in support [of Ukraine]." She wants Ukrainians to know that "we see what's happening, we wish we could do
more." Bond Bar doesn't serve Russian vodka, Minoo noted, so it's not replacing any ingredients in its Kyiv Mule. Madrone Art Bar, also in San Francisco, did serve GPT3-D2: The article discusses how American bar and restaurant
Russian vodka until this past weekend, when owner Michael Krouse decided to take it off the menu. First, he had to figure out which of the roughly 10 vodkas he carries owners are changing the name of the Moscow Mule cocktail to the
were actually Russian. Many top-selling vodka brands that trace their origins to Russia are now distilled in multiple countries, including the United States. Stoli Vodka, Kyiv Mule to show support for Ukraine. The Kyiv Mule is made with
for example, is actually made in Latvia, and the company's headquarters are in Luxembourg. After some research, Krouse removed Russian Standard, one of the few Ukrainian vodka and ginger-lime, and some businesses are donating
vodka brands that actually is Russian-made, from his bar. Then he decided to rename Madrone's Moscow Mule the Kyiv Mule and looked for a Ukrainian vodka to make part of the proceeds to Ukrainian aid. The change is meant to be a
it with. The bar unveiled the reconstituted cocktail on Instagram this week. "Introducing the 'Kyiv Mule' made with Prime Ukrainian vodka!,'" a Wednesday post reads, small gesture to show support for the Ukrainian people as Russia
adding that "$2 of each Kyiv Mule sale will be donated to the Ukraine Crisis Fund." The Kyiv Mule costs $12. Krouse said he was feeling sad and helpless about the continues its assault.
situation in Ukraine when he decided to make those moves. Those changes were "at least something that we could do," he said. Making a gesture Em Chamas Brazilian
Grill in Kansas City, Missouri, said in a Facebook post last week that its Moscow Mule will be replaced by a "Snake Island Mule," in "support of the Ukrainian resistance Keyword: Brad Lendon
and in honor of the brave soldiers of Snake Island." Snake Island, also known as Zmiinyi Island, sits about 30 miles off the southern tip of the Ukrainian mainland in the CTRLSum: Bar owners are replacing "Moscow" with "Kyiv" in
northwestern Black Sea. Last week, a handful of Ukrainian fighters on the island stood up to Russian warships. They were feared dead, but the Ukrainian Navy released their vodka-ginger-lime cocktails. CNN's Jordan Valinsky, Brad Tim
a statement Monday saying the troops were "alive and well" after being forced to surrender "due to the lack of ammunition." Back in Kansas City, the Em Chamas post Lendon, Josh Pennington contributed to this report. The attack on
also urged customers against "directing animosity towards local restaurants or businesses that are Russian-themed and/or are owned/operated by those of Russian Ukraine "makes no sense," says Ronnie Heckman, owner of Caddies
[descent]." The eatery, which charges about $11 for cocktails, plans to donate profits from the Snake Island Mule to a Ukrainian charity, according to the Facebook post. on Cordell.
Ronnie Heckman, owner of Caddies on Cordell, a bar and grill in Bethesda, Maryland, has not only swapped out Moscow Mules for Kyiv Mules, he's gone so far to
replace Black Russians and White Russians with Black Ukrainians (vodka and Kahlua on the rocks) and White Ukrainians (made with vodka, Kahlua and cream). "It's a GPT3-D2:The article discusses how American bar and restaurant
gesture," he said. Caddies is also donating part of the revenue from those drinks to Ukrainian aid, he said. Heckman hopes that if enough restaurant and bar owners take owners are swapping out Moscow Mules for Kyiv Mules in order to
references to Russia off their menus, they can send a message to Russian leadership. The attack on Ukraine "makes no sense," he said. "It's wrong." CNN's Jordan show support for Ukraine. This is in response to Russia's recent
Valinsky, Brad Lendon, Tim Lister and Josh Pennington contributed to this report. assault on the country. CNN's Brad Lendon contributed to the report.
Figure 13: Examples of keyword-focused summaries for CNN articles from 2022.
Figure 14: Screenshot of the task instructions for the Figure 15: Screenshot of the task instructions for the
generic summarization setting. keyword-based setting.