0% found this document useful (0 votes)
4 views

Field Guide to Automatic Evaluation

The document discusses the evaluation of Large Language Model (LLM)-generated summaries, highlighting the need for reliable and scalable evaluation methods due to the rapid adoption of LLMs in various industries. It outlines different evaluation types, including offline and online evaluations, and presents a guide for practitioners on best practices and challenges in assessing LLM performance. The authors emphasize the importance of using a combination of metrics and involving domain experts to ensure effective evaluation of LLM applications.

Uploaded by

kaanguneyli0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Field Guide to Automatic Evaluation

The document discusses the evaluation of Large Language Model (LLM)-generated summaries, highlighting the need for reliable and scalable evaluation methods due to the rapid adoption of LLMs in various industries. It outlines different evaluation types, including offline and online evaluations, and presents a guide for practitioners on best practices and challenges in assessing LLM performance. The authors emphasize the importance of using a combination of metrics and involving domain experts to ensure effective evaluation of LLM applications.

Uploaded by

kaanguneyli0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A Field Guide to Automatic Evaluation of LLM-Generated

Summaries
Tempest A. van Schaik Brittany Pugh
Microsoft Microsoft
Redmond Washington USA Redmond Washington USA
tempest.van@microsoft.com brittanypugh@microsoft.com

ABSTRACT
1 INTRODUCTION
Large Language models (LLMs) are rapidly being adopted for
tasks such as text summarization, in a wide range of industries. Large Language Models (LLMs), such as the GPT-3, show
This has driven the need for scalable, automatic, reliable, and surprising, emergent abilities [1] compared to smaller pre-
cost-effective methods to evaluate the quality of LLM-generated trained models (PTMs) like BERT [2]. ChatGPT is an application
text. What is meant by evaluating an LLM is not yet well defined of LLMs that caused much excitement in the AI community and
and there are widely different expectations about what kind of beyond [1]. The power and ease-of-use of LLMs has led to a
information evaluation will produce. Evaluation methods that surge in LLM-powered applications with developers harnessing
were developed for traditional Natural Language Processing Artificial Intelligence (AI) to address real world challenges. An
(NLP) tasks (before the rise of LLMs) remain applicable but are emerging application of LLMs in industry is to summarize text,
not sufficient for capturing high-level semantic qualities of such as producing summaries of product reviews, or technical
summaries. Emerging evaluation methods that use LLMs to manuals.
evaluate LLM-output, appear to be powerful but lacking in LLMs have known issues of hallucination, knowledge
reliability. New elements of LLM generated text that were not an recency, reasoning inconsistency, difficulty in computational
element of previous NLP tasks, such as the artifacts of reasoning and more [1], which result in unpredictable errors. It
hallucination, need to be considered. We outline the different is important to evaluate how well LLM applications perform
types of LLM evaluation currently used in the literature but before they are released [3], and when running in production [3].
focus on offline, system-level evaluation of the text generated by Lack of proper evaluation can lead to undetected errors that can
LLMs. Evaluating LLM-generated summaries is a complex and have a range of harms, from brand damage and revenue loss to
fast-evolving area, and we propose strategies for applying consequential impact on life opportunities like access to
evaluation methods to avoid common pitfalls. Despite having healthcare or employment.
promising strategies for evaluating LLM summaries, we For software developers who are new to AI, the landscape of
highlight some open challenges that remain. evaluation can be difficult to get started in. For data scientists
experienced in NLP and evaluation, it can feel challenging to
CCS CONCEPTS keep-up with advancements in LLMs, let alone how best to
evaluate them. We present a guide to LLM evaluation that is
• Information Systems → Information retrieval → Evaluation of
geared towards the common situations that AI practitioners
retrieval results
(software developers and data scientists) face when building
LLM applications in industry.
KEYWORDS We begin by surveying what is meant by LLM evaluation in
Evaluation metrics; LLMs; summarization; offline evaluation section 2, and what evaluation methods exist in section 3. We
ACM Reference format: present recommendations for best practices in section 4 and
Tempest A. van Schaik and Brittany Pugh. 2024. A Field Guide to conclude with open issues in section 5. Although we focus on
Automatic Evaluation of LLM-Generated Summaries. In Proceedings summarization, much of what we discuss is relevant to
of the 47th International ACM SIGIR Conference on Research and additional language tasks.
Development in Information Retrieval (SIGIR’24). July 14-18, 2024,
Washington, D.C., USA. ACM, New York, NY, USA, 6 pages.
https://doi.org/10.1145/3626772.3661346 2 WHAT IT MEANS TO EVALUATE AN LLM
What is meant by evaluation of an LLM varies between different
This work is licensed under a Creative Commons AI communities. We present these different definitions to help
Attribution International 4.0 License. practitioners navigate the landscape of LLM evaluation and find
the most appropriate definition for them.
SIGIR '24, July 14–18, 2024, Washington, DC, USA
2.1 Security and Responsible AI. It is important to evaluate
© 2024 Copyright is held by the owner/author(s).
ACM ISBN 979-8-4007-0431-4/24/07. https://doi.org/10.1145/3626772.3661346 how well LLM systems align with social norms, values, and
regulations such as fairness, privacy and copyright [4]. They

2832
SIGIR ’24, July 14-18, 2024, Washington, DC, USA Tempest van Schaik and Brittany Pugh

should also be evaluated for robustness against producing metrics were developed for traditional NLP tasks before LLMs
harmful content and jailbreaking [5]. were developed but remain applicable to LLM summarization.
2.2 Computing Performance. LLMs can be costly and have 3.1.1 N-gram based metrics. Metrics BLEU (Bilingual
high latency [6] so they should be evaluated in terms of cost, Evaluation Understudy) [14], ROUGE (Recall-Oriented
CPU and GPU usage, latency and memory. Understudy for Gisting Evaluation) [15], and JS divergence
2.3 Retrieval vs Generator Evaluation. Retrieval-Augmented (JS2) [16] are overlap-based metrics that measure the similarity
Generation (RAG) is the process of retrieving relevant data from of the output text and the reference text using n-grams.
outside the pre-trained model to enhance the input to improve 3.1.2 Embedding-based metrics. BERTScore [17], MoverScore
the generated output [7]. We recommend breaking down the [18], and Sentence Mover Similarity (SMS) [19] metrics all
evaluation of RAGs into three parts: (i) the information retrieval rely on contextualized embeddings to measure the similarity
part of such a system (e.g. with well-established search metrics between two texts.
like precision, recall, Discounted Cumulative Gain [8]), since the While these metrics are relatively simple, fast, and
generator can only perform as well as the context that it is given; inexpensive compared to LLM-based metrics, studies have
(ii) the generative AI component; and (iii) the entire RAG system shown that they can have poor correlation with human
end-to-end to see how well it meets end-user needs. evaluators, lack of interpretability, inherent bias, poor
2.4 Offline vs Online Evaluation. Offline evaluation involves adaptability to a wider variety of tasks and inability to capture
developing the system with a batch of test data that is often subtle nuances in language [19].
human-annotated ground truth. It guides practitioners while
they build the system and helps them decide when the system 3.2 Reference-free Metrics
performs well enough to release. Online evaluation monitors Reference-free (context-based) metrics produce a score for the
performance with live user data. In some applications like e- generated text and do not rely on ground truth. Evaluation is
commerce, it provides an opportunity to evaluate the system based on the context or source document. Many of these metrics
based on user-interaction metrics like click-through rate during were developed in response to the challenge of creating ground
AB testing. truth data. These methods tend to be newer than reference-based
2.5 System Evaluation vs Model Evaluation. Model evaluation techniques, reflecting the growing demand for scalable text
is often performed with benchmarks like HellaSwag, TruthfulQA evaluation as PTMs became increasingly powerful.
and MMLU [9] Benchmarks are used to compare competing 3.2.1 Quality-based metrics. Metrics SUPERT [21] and
LLMs, using a fixed, standard dataset, and fixed, standard BLANC [22] focus on the quality of the content of the generated
language tasks (like summarization). However, in many industry summary and use BERT to generate a pseudo-reference.
scenarios, the model is fixed. Practitioners need to evaluate how ROUGE-C is a modification of ROUGE without the need for
well their model performs on specific industry data and tasks references and uses the source text as the context for comparison
[10]. As an example, with prompt engineering, a system [23].
evaluation would keep the LLM constant but change the 3.2.2 Entailment-based metrics. Entailment-based metrics are
prompts. based on the Natural Language Inference (NLI) task, where for a
With a focus on offline, system-level evaluation of generative given text (premise), it determines whether the output text
AI text, we outline methods for evaluating the quality of (hypothesis) entails, contradicts or undermines the premise [24].
summaries. The SummaC (Summary Consistency) benchmark [24],
FactCC [25], and DAE (Dependency Arc Entailment) [26]
3 EVALUATION METHODS metrics serve as an approach to detect factual inconsistencies
Evaluation methods measure how well our system is performing. with the source text. Entailment-based metrics are designed as a
Manual evaluation (human review) of each summary would be classification task with labels “consistent” or “inconsistent”.
time-consuming, costly and would not be scalable, so it is usually 3.2.3 Factuality, QA and QG-based metrics. Factuality-based
complemented by automatic evaluation. Many automatic metrics like SRLScore (Semantic Role Labeling) [27] and
evaluation methods attempt to measure the same qualities of a QAFactEval [27] evaluate whether generated text contains
summary that human evaluators would consider. Those qualities incorrect information that does not hold true to the source text
include fluency [11], coherence [10], [11], [12], relevance [11], [29]. QA-based, like QuestEval [30], and QG-based metrics are
factual consistency [11], and fairness [13]. Similarity in content used as another approach to measure factual consistency and
or style to a reference text can also be an important quality of relevance [30] [31].
generated text. In the next sections, we will examine reference- Reference-free metrics have shown improved correlations to
based, reference-free (context-based), and LLM-based metrics. human evaluators compared to reference-based metrics, but
there are limitations to using them as the single measure of
3.1 Reference-based Metrics progress on a task [32], including bias towards their underlying
models’ outputs and bias against higher-quality text [32].
Reference-based metrics are used to compare generated text to a
reference: the human annotated ground truth text. Many of these

2833
A Field Guide to Automatic Evaluation of LLM-Generated Summaries SIGIR ’24, July 14-18, 2024, Washington, DC, USA

3.3 LLM-based Evaluators specification contains many numerical values, such as “16 GB”.
To detect a factual inconsistency such as “8 GB”, we used a
LLM’s remarkable abilities have led to their emerging use as not
simple custom consistency metric to extract these facts from the
only summarizers of text, but also evaluators of summarized
summary and ground truth using a regular expression search,
text. [33], [34], [35] evaluators offer scalability and
and measured their overlap with F1 score. Consistency metrics
explainability.
based on PTMs failed to distinguish the semantic difference
3.3.1 Prompt-based evaluators. LLM-based evaluators prompt
between “8 GB” and “16 GB”. This illustrates why custom
an LLM to be the judge of some text. The judgement can be
metrics are important for each application which has unique
based on (i) the summary alone (reference-free), where the LLM
data qualities.
is judging qualities like fluency, and coherence; (ii) the summary,
4.3 LLM and non-LLM metrics. Given the limitations of LLM
the original text, and potentially a topic for summarization
evaluators, we recommend using them along with evaluators
(reference-free), where the LLM is judging qualities like
based on smaller PTMs, as well as non-model metrics (like
consistency, and relevancy (iii) a comparison between the
ROUGE). These different metrics can corroborate each other.
summary and the ground truth summary (reference-based),
Discrepancies between these metrics can detect changes in
where the LLM is judging quality, and similarity. Some
performance that may not be apparent from each metric on its
frameworks for these evaluation prompts include Reason-then-
own. For example, we have observed that even when LLM-based
Score (RTS) [36], Multiple Choice Question Scoring (MCQ)
metrics saturate (scoring almost every summary as a perfect 5/5),
[36], Head-to-head scoring (H2H) [36], and G-Eval [37].
non-LLM based metrics can still detect differences in the
LLM-evaluation is an emerging area of research and has not
summaries’ qualities. Cost is also a consideration, and it may
yet been systematically studied. Already, researchers have
not be economical to use the most powerful LLM to routinely
identified issues with reliability in LLM evaluators [3] such as
evaluate a large dataset.
positional bias [38], [39], verbosity bias [40], self-
4.4 Validate the evaluators. Non-standard (custom) metrics,
enhancement bias [40], and limited mathematical and
especially custom LLM-based evaluators, should be validated to
reasoning capabilities [40]. Strategies that have been proposed
see how they perform. One way to do this is to calibrate them
to mitigate positional bias include Multiple Evidence Calibration
against human-evaluated summaries, that are balanced and
(MEC), Balanced Position Calibration (BPC), and Human In The
ranging from “good” to “bad”. Another best practice is to
Loop Calibration (HITLC) [38].
determine the stability of each metric by reporting its average
3.3.2 LLM embedding-base metrics. Recently, the embedding
over several runs of the same inputs.
models from LLMs, such as GPT3’s text-embedding-ada-002 [1]
4.5 Visualize and analyze metrics. Simply calculating metrics
has also been used for embedding-based metrics (see section
is not sufficient, they need to be interpreted to make them
3.1.2).
actionable. Data visualization is far more effective than simply
viewing data in tabular format. For example, using boxplots to
4 BEST PRACTICES TO AVOID visualize metrics distribution, skewness, and outliers, for the
EVALUATION PITFALLS whole dataset and within certain sub-categories (such as the
Given the extensive menu of evaluation metrics that are search query in a RAG system).
available for LLM-generated summaries, we provide some best 4.6 Involve the experts. Sufficient time and resources need to
practices for evaluation that could help to avoid common pitfalls. be budgeted to develop a high quality LLM solution which
4.1 Suite of metrics. Rather than searching for “one metric to involves domain experts (who are often the end-users) in three
rule them all”, we recommend developing a suite of metrics. We areas:
make the analogy with software testing, where unit tests, smoke 1) Annotation: Even when relying on reference-free
tests and functional tests all work together to test the software at evaluation, some annotated ground truth data is always
different levels. The suite approach acknowledges that each necessary to calibrate the metrics and to develop the LLM system
metric has its strengths and weaknesses. For example, ROUGE with.
benefits from being simple and deterministic but does not 2) Evaluation: The most important judge of the quality of a
capture semantic meaning. An LLM-based evaluator benefits summary is the domain expert. This is especially important
from capturing nuanced qualities of a summary but is not always when the text is from a technical field (such as summarizing
reliable. clinical research or airplane manuals), where the AI practitioner
4.2 Standard and custom metrics. Several metrics have may not have the domain expertise to judge or even understand
emerged as the standard for evaluating LLM text (such as factual the summary. Domain experts may provide feedback that is
consistency). While these are important basic qualities of text, qualitative (describing what they like about the summary), or
there are often additional, unique qualities of text for a specific quantitative (a Likert score for each summary, or thumbs
use-case that will require developing custom metrics (see section up/down).
4.6.3). An example would be in summarizing legal documents 3) Metric Design: Domain experts must define what unique
which have a distinctive writing style. qualities are important to them and inform the design of metrics.
Even a standard measure like consistency may need custom In one scenario, we observed that domain experts evaluate an AI
implementation. For example, a summary of electronic product summary by seeing if it contains the information in each

2834
SIGIR ’24, July 14-18, 2024, Washington, DC, USA Tempest van Schaik and Brittany Pugh

sentence from the ground truth summary. In response, we 5 OPEN CHALLENGES


designed a custom metric to perform sentence-by-sentence
5.1 Cold start problem. LLMs are so powerful that they enable
matching. We considered a sentence in the ground truth
entirely new services to be built, rather than simply enhancing
summary to be "found" in the AI summary if the cosine
existing solutions. This leads to the cold-start problem where
similarities between a domain-relevant PTM (PubMedBERT [41])
little, or no data exists to evaluate the new solution. One
embedding of those sentences was above a certain threshold. The
approach is to synthesize data, though this should be used
experts highlighted that editing-down excessive information
cautiously to enhance the productivity of human annotators
saved more time than writing missing content, so the metric
rather than replace them [42]. Naively synthesized data may not
(shown in figure 1) measured sentence recall.
represent the real-world distribution of underlying data [43].
4.7 Data-driven prompt engineering. Having metrics allows the
Another approach is to explore re-purposing an existing dataset.
practitioner to form a hypothesis, test it, and observe the metrics
For example, historic search logs can tell us what information an
to see if the experiment was successful. For example, a
end-user found valuable.
practitioner might write a new LLM summarization prompt and
5.2 Subjectivity in evaluating and annotating text. What makes
observe the effect on the summaries. They may solicit feedback
a summary good is subjective and experts sometimes disagree. It
from experts, and/or get immediate feedback from automatically
is challenging for experts to consistently quantify how good a
calculated metrics and iterate on the prompt. Note that it is
summary is. Annotators may also disagree on what information
likely that the metrics that are chosen at the start of the project
should be annotated, for example sufficient or exhaustive
will evolve as practitioners gain a deeper understanding of the
information [44]. Inter-annotator agreement varies by task and
goals of summarization.
industry but was recently found in LLM-studies [40] to be 80%.
4.8 Tracking metrics over time. If metrics are introduced early
Therefore, we cannot expect perfect agreement between
in development, the metric values can be tracked for this corpus
automatic metrics and human evaluators. Subjectivity remains a
of summaries over time. Having metrics from the start enables
challenge when evaluating text.
the practitioner to establish baseline performance for a simple,
5.3 Challenge of good vs excellent. While we are confident that
single prompt, before progressing to more complex prompting
the approaches that we have set forth in this paper can discern
strategies, and measuring what value this complexity adds.
good summaries from bad summaries, the challenge remains in
Figure 1 shows the average values of various metrics over
discerning good summaries from excellent summaries. LLM’s
multiple iterations of prompt engineering.
remarkable performance in text summarization has highlighted
4.9 Appropriate metric interpretation. Even standard,
this challenge. This may be unsurprising, because even if
established metrics like F1 score need careful custom
automatic metrics can match human evaluators, the expertise
interpretation for each application. A heuristic such as, F1 score
required to evaluate summaries in some fields is very rare. Two
should be “high” or close to 1 for a “good” summary, can be
domain-specific summaries with subtle quality differences (good
misleading. For ROUGE, F1 score comparing an abstractive
and excellent) would have almost identical LLM embedding
summary and the original text, the summary may be good and
vectors. This can potentially be improved by fine-tuning with
have few overlapping n-grams (low F1).
domain-specific text, though this is not always practical.

6 CONCLUSION
There is a clear need to evaluate how well LLMs perform, and a
wide variety of ways to approach this. It can be challenging to
decide what element of the LLM to evaluate, which established
techniques are still applicable in the era of LLMs, and which
emerging evaluation techniques to adopt. We outline what is
meant by LLM evaluation in different AI communities, and we
focus on system-level, offline evaluation of LLM-generated
summaries. We survey established NLP metrics which are
reference-based, more recent PTM-based reference-free metrics,
Figure 1: Six metrics that track performance of an LLM and emerging LLM prompt-based metrics. We explore some best
summarizer during prompt engineering: Quality, Similarity and
Factualness (GPT-4 based); sentence recall (BERT-based); missing practices for selecting, combining, interpreting, and acting on
summaries; and F1 score for n-gram overlap. Y-axis is the average metrics; and for involving human experts in evaluation. Even
value for that metric across the corpus of 100 summaries. X-axis when using these best-practices, we find that several open-
is the prompt variant number, with prompt 1 being the first challenges still remain. Evaluating LLMs is far from being solved.
(leftmost) and prompt 13 being the last (rightmost). Dashed line
shows an unsuccessful prompt experiment (prompt 4), after
We should not underestimate the complexity and investment
which summaries began to improve. needed to properly evaluate LLMs, which is essential for building
LLM applications.

2835
A Field Guide to Automatic Evaluation of LLM-Generated Summaries SIGIR ’24, July 14-18, 2024, Washington, DC, USA

AUTHOR BIOGRAPHIES [19] E. Clark, A. Celikyilmaz, N. A. Smith, and P. G. Allen, “Sentence Mover’s
Similarity: Automatic Evaluation for Multi-Sentence Texts.” [Online].
Tempest van Schaik is a Principal Data Scientist in Microsoft’s Available: https://github.com/src-d/wmd-relax
Industry Solutions Engineering team. She helps to build new [20] B. Sai and M. M. Khapra, “A Survey of Evaluation Metrics Used for NLG
products and services that use AI for Microsoft’s health and life Systems,” 2020, doi: 10.1145/0000001.0000001.
[21] Y. Gao, W. Zhao, and S. Eger, “SUPERT: Towards New Frontiers in
science enterprise customers. She received her PhD in Unsupervised Evaluation Metrics for Multi-Document Summarization.”
Bioengineering from Imperial College London in 2014. Her [Online]. Available: https://tac.nist.gov/
research publications have been in the areas of medical devices, [22] O. Vasilyev, V. Dharnidharka, and J. Bohannon, “Fill in the BLANC: Human-
clinical research, and responsible AI. free quality estimation of document summaries.”
[23] T. He et al., “ROUGE-C: A Fully Automated Evaluation Method for Multi-
document Summarization.” [Online]. Available: http://www.isi.edu/-cyl/SEE
Brittany Pugh is a Senior Data Scientist in Microsoft’s Industry [24] P. Laban, T. Schnabel, P. N. Bennett, and M. A. Hearst, “SummaC: Re-Visiting
Solutions Engineering team. She helps customers solve their NLI-based Models for Inconsistency Detection in Summarization,” Nov. 2021,
most challenging problems utilizing AI products and services for [Online]. Available: http://arxiv.org/abs/2111.09525
Microsoft’s health and life science enterprise customers. Brittany [25] W. Kryściński, B. McCann, C. Xiong, and R. Socher, “Evaluating the Factual
Consistency of Abstractive Text Summarization,” Oct. 2019, [Online].
received a master’s degree in systems engineering from George Available: http://arxiv.org/abs/1910.12840
Washington University after receiving a bachelor’s in [26] T. Goyal and G. Durrett, “Findings of the Association for Computational
mathematics from the University of Virginia. Linguistics Evaluating Factuality in Generation with Dependency-level
Entailment.” [Online]. Available: https://github.com/
[27] Fan, D. Aumiller, and M. Gertz, “Evaluating Factual Consistency of Texts
REFERENCES with Semantic Role Labeling,” May 2023, [Online]. Available:
[1] W. X. Zhao et al., “A Survey of Large Language Models,” Mar. 2023, [Online]. http://arxiv.org/abs/2305.13309
Available: http://arxiv.org/abs/2303.18223 [28] R. Fabbri, C.-S. Wu, W. Liu, and C. Xiong, “QAFactEval: Improved QA-Based
[2] J. Wei et al., “Emergent Abilities of Large Language Models”, Accessed: Feb. Factual Consistency Evaluation for Summarization,” Dec. 2021, [Online].
14, 2024. [Online]. Available: https://openreview.net/forum?id=yzkSU5zdwD Available: http://arxiv.org/abs/2112.08542
[3] X. Wang, R. Gao, A. Jain, G. Edge, and S. Ahuja, “How Well do Offline [29] Pagnoni, V. Balachandran, and Y. Tsvetkov, “Understanding Factuality in
Metrics Predict Online Performance of Prod-uct Ranking Models,” no. 23, Abstractive Summarization with FRANK: A Benchmark for Factuality
2023, doi: 10.1145/3539618.3591865. Metrics,” Apr. 2021, [Online]. Available: http://arxiv.org/abs/2104.13346
[4] Y. Liu et al., “Trustworthy LLMs: a Survey and Guideline for Evaluating [30] T. Scialom et al., “QuestEval: Summarization Asks for Fact-based Evaluation,”
Large Language Models’ Alignment,” Aug. 2023, [Online]. Available: Mar. 2021, [Online]. Available: http://arxiv.org/abs/2103.12693
http://arxiv.org/abs/2308.05374 [31] T. Scialom, S. Lamprier, B. Piwowarski, and J. Staiano, “Answers Unite!
[5] Y. Dong et al., “Building Guardrails for Large Language Models,” Feb. 2024, Unsupervised Metrics for Reinforced Summarization Models,” Sep. 2019,
[Online]. Available: http://arxiv.org/abs/2402.01822 [Online]. Available: http://arxiv.org/abs/1909.01610
[6] S. Minaee et al., “Large Language Models: A Survey”. [32] D. Deutsch, R. Dror, and D. Roth, “On the Limitations of Reference-Free
[7] P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive Evaluations of Generated Text,” Oct. 2022, [Online]. Available:
NLP Tasks”, Accessed: Feb. 11, 2024. [Online]. Available: http://arxiv.org/abs/2210.12563
https://github.com/huggingface/transformers/blob/master/ [33] Peng, C. Li, P. He, M. Galley, and J. Gao, “INSTRUCTION TUNING WITH
[8] O. Jeunen, I. Potapov, and A. Ustimenko, “On (Normalised) Discounted GPT-4”, Accessed: Feb. 19, 2024. [Online]. Available: https://instruction-
Cumulative Gain as an Off-Policy Evaluation Metric for Top-í µí±› tuning-with-gpt-4.github.io/
Recommendation,” Proceedings of ACM Conference (Conference’17), vol. 1. [34] Z. Sun et al., “Principle-Driven Self-Alignment of Language Models from
[9] P. Liang et al., “Holistic Evaluation of Language Models,” Nov. 2022, [Online]. Scratch with Minimal Human Supervision”, Accessed: Feb. 19, 2024. [Online].
Available: http://arxiv.org/abs/2211.09110 Available: https://github.com/IBM/Dromedary
[10] “LLM Evaluation: Everything You Need To Run, Benchmark Evals.” [35] Zhou et al., “LIMA: Less Is More for Alignment”.
Accessed: Feb. 11, 2024. [Online]. Available: https://arize.com/blog- [36] Shen, L. Cheng, X.-P. Nguyen, Y. You, and L. Bing, “Large Language Models
course/llm-evaluation-the-definitive-guide/#large-language-model-model- are Not Yet Human-Level Evaluators for Abstractive Summarization,” May
eval 2023, [Online]. Available: http://arxiv.org/abs/2305.13091
[11] W. Kryściński, N. S. Keskar, B. McCann, C. Xiong, and R. Socher, “Neural [37] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-Eval: NLG Evaluation
Text Summarization: A Critical Evaluation,” Aug. 2019, [Online]. Available: using GPT-4 with Better Human Alignment,” Mar. 2023, [Online]. Available:
http://arxiv.org/abs/1908.08960 http://arxiv.org/abs/2303.16634
[12] R. Fabbri et al., “SummEval: Re-evaluating Summarization Evaluation”, doi: [38] P. Wang et al., “Large Language Models are not Fair Evaluators”, Accessed:
10.1162/tacl. Feb. 11, 2024. [Online]. Available: https://github.com/i-Eval/
[13] O. Gallegos et al., “Bias and Fairness in Large Language Models: A Survey,” [39] C.-H. Chiang and H.-Y. Lee, “Can Large Language Models Be an Alternative
Sep. 2023, [Online]. Available: http://arxiv.org/abs/2309.00770 to Human Evaluation?,” Long Papers.
[14] Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Method for Automatic [40] Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”.
Evaluation of Machine Translation.” [41] Y. U. Gu et al., “Domain-Specific Language Model Pretraining for Biomedical
[15] C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries.” Natural Language Processing,” no. 1, p. 24, 2021, doi: 10.1145/3458754.
[16] M. Bhandari, P. Gour, A. Ashfaq, P. Liu, and G. Neubig, “Re-evaluating [42] V. Veselovsky, M. H. Ribeiro, A. Arora, M. Josifoski, A. Anderson, and R.
Evaluation in Text Summarization,” Oct. 2020, [Online]. Available: West, “Generating Faithful Synthetic Data with Large Language Models: A
http://arxiv.org/abs/2010.07100 Case Study in Computational Social Science,” May 2023, [Online]. Available:
[17] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTSCORE: http://arxiv.org/abs/2305.15041
EVALUATING TEXT GENERATION WITH BERT”, Accessed: Feb. 11, 2024. [43] Josifoski, M. Sakota, M. Peyrard, and R. West, “Exploiting Asymmetry for
[Online]. Available: https://github.com/Tiiiger/bert_score. Synthetic Training Data Generation: SynthIE and the Case of Information
[18] W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger, “MoverScore: Extraction,” Mar. 2023, [Online]. Available: http://arxiv.org/abs/2303.04132
Text Generation Evaluating with Contextualized Embeddings and Earth [44] H. Cheng et al., “MDACE: MIMIC Documents Annotated with Code
Mover Distance”, Accessed: Feb. 11, 2024. [Online]. Available: Evidence”, Accessed: Feb. 23, 2024. [Online]. Available:
http://tiny.cc/vsqtbz https://github.com/3mcloud/MDACE/.

2836

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy