0% found this document useful (0 votes)
42 views14 pages

Chat GPT

The document proposes a new prompting method called Error Analysis Prompting (EAPrompt) to improve the performance of large language models for machine translation evaluation. EAPrompt trains models to identify major and minor errors in translations based on the source text and reference, emulating the commonly used MQM human evaluation framework. Experimental results on the WMT22 metrics task validate that EAPrompt can effectively distinguish error types and achieve more human-like evaluation compared to previous prompting methods.

Uploaded by

Dũng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views14 pages

Chat GPT

The document proposes a new prompting method called Error Analysis Prompting (EAPrompt) to improve the performance of large language models for machine translation evaluation. EAPrompt trains models to identify major and minor errors in translations based on the source text and reference, emulating the commonly used MQM human evaluation framework. Experimental results on the WMT22 metrics task validate that EAPrompt can effectively distinguish error types and achieve more human-like evaluation compared to previous prompting methods.

Uploaded by

Dũng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Error Analysis Prompting Enables Human-Like Translation Evaluation in

Large Language Models


Qingyu Lu♢,ℜ , Baopu Qiu♭,ℜ , Liang Dingℜ , Kanjian Zhang♢ , Tom Kocmi♡ , Dacheng Taoℜ

Southeast University ℜ JD Explore Academy, JD.com Inc. ♭ Nanjing University ♡ Microsoft
luqingyu@seu.edu.cn, qiubaopu@smail.nju.edu.cn,
liangding.liam@gmail.com, tomkocmi@microsoft.com
https://github.com/Coldmist-Lu/ErrorAnalysis_Prompt

Abstract inquiries. Additionally, they can respond appropri-


ately to follow-up questions and maintain sensitiv-
Generative large language models (LLMs), e.g., ity throughout several turns of conversation.
ChatGPT, have demonstrated remarkable pro-
arXiv:2303.13809v3 [cs.CL] 21 Feb 2024

Previous research has demonstrated that LLMs


ficiency across several NLP tasks, such as ma-
can perform as well as or even better than other
chine translation, text summarization. Recent
research (Kocmi and Federmann, 2023b) has
LLMs in machine translation task (Hendy et al.,
shown that utilizing LLMs for assessing the 2023; Jiao et al., 2023; Peng et al., 2023). Given
quality of machine translation (MT) achieves the high cost and time-intensive nature of human
state-of-the-art performance at the system level evaluation, there is a growing demand for MT met-
but performs poorly at the segment level. To rics that offer both explainability and reliability.
further improve the performance of LLMs on Therefore, LLMs hold promise in serving as ideal
MT quality assessment, we conduct an inves- evaluators, capable of generating both judgments
tigation into several prompting designs, and
and explanations for the translations.
propose a new prompting method called Error
Analysis Prompting (EAPrompt) by combin- Concurrent to our research, GEMBA (Kocmi
ing Chain-of-Thoughts (Wei et al., 2022) and and Federmann, 2023b) presents an encouraging
Error Analysis (Lu et al., 2023). This technique finding that GPT models can surpass current best
emulates the commonly accepted human eval- MT metrics at the system level quality assessment
uation framework - Multidimensional Quality using straightforward zero-shot standard prompt-
Metrics (MQM, Freitag et al. (2021)) and pro- ing, confirming the reliability and potential of this
duces explainable and reliable MT evaluations
technique. However, such prompts exhibit unreal-
at both the system and segment level. Exper-
imental Results from WMT22 metrics shared
istic performance at the segment level, and cannot
task validate the effectiveness of EAPrompt on offer additional interpretable information regarding
various LLMs, with different structures. Fur- translation errors, thus detracting from the goal of
ther analysis confirms that EAPrompt effec- achieving a "human-like" evaluation.
tively distinguishes major errors from minor To this end, we take the further step by carefully
ones, while also sharing a similar distribution investigating advanced prompting strategies upon
of the number of errors with MQM. These find- various LLMs for MT quality assessment and pro-
ings highlight the potential of EAPrompt as a
pose a novel prompting strategy - Error Analysis
human-like evaluator prompting technique for
MT evaluation. Prompting (EAPrompt), combining the Chain-of-
Thought (CoT, Wei et al. (2022)) and Error Anal-
1 Introduction ysis (EA, Lu et al. (2023)). We give an example
of EAPrompt in Figure 1. The idea is to prompt
Large language models (LLMs), especially Gener- LLMs to emulate the human evaluation framework
ative Pre-trained Transformer (GPT) models (Rad- - MQM (Freitag et al., 2021) by ❶ identifying ma-
ford et al., 2019; Brown et al., 2020) such as Chat- jor&minor errors, and ❷ scoring the translations
GPT (Ouyang et al., 2022; Achiam et al., 2023), according to the severity of these errors.
have shown remarkable performance in various nat- We conduct experiments using the test set
ural language processing (NLP) tasks (Qin et al., from the WMT22 metrics shared task, compris-
2023; Zhong et al., 2023). LLMs are capable of ing 106,758 segments on 54 MT systems across
integrating multiple NLP tasks and can generate diverse domains to verify the effectiveness of our
detailed and comprehensive responses to human approach. Our findings reveal that:
GEMBA Prompting Error Analysis Prompting
Q: Source : 迈克去书店。 Q: Source : 迈克去书店。
Reference : Mike goes to the bookstore. Reference : Mike goes to the bookstore.
Translation : Jerry went bookstore happily. In-context Example
Translation : Jerry went to bookstore happily.
Score the following translation from en to de with Based on the given source and reference, identify the major and minor errors in this translation.
respect to the human reference on a continuous Note that Major errors refer to actual translation or grammatical errors, and Minor errors refer to
scale from 0 to 100, where a score of zero means smaller imperfections, and purely subjective opinions about the translation.
"no meaning preserved" and score of one
hundred means "perfect meaning and grammar“ A: Major errors:
(1) “Jerry” – Mistranslation
A: 30. (2) “happily” – Addition Error Demo
Minor errors:
Q: Source : 猫在垫子上。 (1) “went” – Grammar
Reference : The cat is on the mat. (2) “the” – Omission
Translation : The mat sat on the mat.
Score the following translation from en to de with Q: Source : 猫在垫子上。
respect to the human reference on a continuous Reference : The cat is on the mat. Test Sample Instruction: Identify Errors
scale from 0 to 100, where a score of zero means Translation : The mat sat on the mat.
"no meaning preserved" and score of one Based on the given source and reference, identify the major and minor errors in this translation.
hundred means "perfect meaning and grammar" Note that Major errors refer to actual translation or grammatical errors, and Minor errors refer to
smaller imperfections, and purely subjective opinions about the translation.
A: 40. Response: Direct Assessment
A: Major errors:
(1) “mat” – Mistranslation
(2) “sat” – Verb tense and agreement error Response: Human-Like Error Analysis
Minor errors:
Error Analysis Strategy None

Score Q: Based on the above error information,


Identify Output 2 numbers ONLY with the format: "x, x", Instruction: Count Errors
indicating the number of major and minor errors. DO NOT ADD other information!

A: 2, 0 Response: Number of Errors

Figure 1: A comparative overview between GEMBA Prompting and our proposed Error Analysis Prompting
in assessing the MT quality with LLMs.

• EAPrompt significantly enhances the perfor- EAPrompt can also be extended to benefit other
mance of LLMs at the system level. Notably, evaluation scenarios within language generation,
prompting GPT-3.5-Turbo with EAPrompt including summarization and data-to-text tasks.
outperforms all other metrics and prompting
strategies, establishing a new state-of-the-art. 2 Prompt LLMs with Error Analysis
• EAPrompt surpasses GEMBA in 8 out of 9
test scenarios across various language models 2.1 Translation Evaluation Metric
and language pairs, demonstrating superior Translation evaluation metrics are used to assess
performance at the segment level. the performance of machine translation systems on
• The findings regarding EAPrompt’s strong specific test sets (Freitag et al., 2022; Mathur et al.,
performance remain consistent even in 2020b). These metrics typically take inputs from
reference-less settings, highlighting its suit- three sources: the sentence from source language
ability for quality estimation tasks. ("Source"), the reference translation provided by
human translators ("Reference"), and the hypoth-
• When designing prompts, we recommend the esis being evaluated ("Translation"). In scenar-
EAPrompt variant featuring a 2-step sepa- ios where reference signals are not provided, this
rated prompting approach and itemized error "reference-less" metric can also be utilized for qual-
demonstrations. ity estimation purposes (Zerva et al., 2022; Specia
et al., 2010; Qiu et al., 2022). The output of the
• Further analysis confirms that EAPrompt metric is a score or rank indicating the translation
adeptly distinguishes major errors from mi- quality of each hypothesis.
nor ones, closely aligning its error distribution
To verify the reliability of MT metrics, Multi-
with MQM.
dimensional Quality Metric (MQM) has been
• Optimizing the inference costs of EAPrompt adopted recently in WMT as a high-quality human
can be achieved by leveraging Regular Expres- evaluation strategy (Freitag et al., 2021). It asks
sions instead of counting queries. human experts to annotate the errors in the hypoth-
esis and categorize them into "Major" and "Minor"
This study provides an initial exploration of uti- indicating their severity. A detailed example of
lizing error analysis to prompt LLMs as evaluators. MQM annotation is presented in Appendix A.
2.2 Prompt LLMs as Evaluation Metrics • we adopt the one-shot learning format (Brown
et al., 2020) to enhance the LLMs’ understand-
When prompting LLMs as evaluation metrics, it is
ing of the task (§3.4); different in-context ex-
crucial to design appropriate instructions that de-
amples are used for different language pairs;
scribe the evaluation task. In this paper, we mainly
adopt two prompting strategies: "GEMBA Prompt- • we employ itemized error demonstration in
ing" and "Error Analysis Prompting". the template response, enabling clearer identi-
GEMBA (Kocmi and Federmann, 2023b) is a fication and quantification of errors (§3.5);
zero-shot prompting approach that directly asks
LLMs to generate a score that reflects the quality • we partition the evaluation process into two
of the translation, which shows state-of-the-art per- stages to enhance the reliability of metric per-
formance on GPT models when compared to other formance. Additionally, we present a simpli-
model-based metrics. However, they also observe fied alternative to optimize inference costs by
that the performance at the segment level is rela- counting errors automatically (§4.3).
tively poorer. This highlights the importance of
combining Chain-of-Thought with the Error Analy- 2.4 Post-processing of LLM responses
sis Strategy to prompt LLMs in a manner that more After obtaining the number of major and minor
closely resembles human evaluation. errors, we compute the final score of the translation
using the following equation:
2.3 Error Analysis Prompting
score = −wmajor nmajor − wminor nminor , (1)
Motivated by the MQM framework in human evalu-
ation, the idea of the Error Analysis (EA) paradigm, where nmajor and nminor denotes the number of ma-
as introduced by Lu et al. (2023), is to enhance the jor and minor errors respectively, while wmajor and
automatic scoring process by explicitly incorpo- wminor represent the severity weight assigned to ma-
rating error identification, thus providing a more jor and minor errors. Since different LLMs may ap-
human-like evaluation. ply distinct criteria for major and minor errors, we
The Chain-of-Thought (CoT) prompting strategy follow Lu et al. (2023) to adopt a flexible scoring
was first proposed by Wei et al. (2022). Instead of approach by fixing the wminor = 1 while treating
directly generating the answer, CoT prompts LLMs wmajor as a latent variable within EAPrompt. We
to think step-by-step. This approach has shown sig- present an analysis on the influence of this vari-
nificant performance improvements on reasoning able in §4.2 and the detailed implementation in
tasks, such as GSM8K (Cobbe et al., 2021). CoT experiments is described in Appendix B.
is an emergent ability of LLMs and has been incor-
porated in instruction fine-tuning of LLMs (Chung 3 Experimental Results
et al., 2022) as well as in benchmarks designed to
evaluate LLM capabilities (Suzgun et al., 2022). 3.1 Experiment Setup
In this work, we combine the CoT and EA Dataset We utilize the test set from the WMT22
paradigms, introducing a novel prompting strat- shared tasks (Freitag et al., 2022) in English-
egy called Error Analysis Prompting (EAPrompt). German (En-De), English-Russian (En-Ru), and
As shown in Figure 1, EAPrompt divides the scor- Chinese-English (Zh-En) across 4 different do-
ing process into two stages: First, the LLM is in- mains - conversational, e-commerce, news, and
structed to identify major and minor errors in the social. Table 1 provides statistics about our test set.
translation ("Instruction: Identify Errors"). Sub-
Human Evaluation We utilize MQM (Freitag
sequently, the number of these two types of er-
et al., 2021) as human judgments, which is an-
rors is counted ("Instruction: Count Errors"). Dis-
notated by human experts and has been widely
tinguished from GEMBA prompting, EAPrompt
adopted in recent WMT metrics shared tasks (Fre-
emulates the evaluation process of MQM and pro-
itag et al., 2022) and quality estimation tasks (Zerva
duces more explanable and reliable automatic eval-
et al., 2022).
uations.
After exploring several prompt contexts in initial Meta Evaluation We follow the standard meta-
experiments, we made the following modifications evaluation approach to measure the performance
to EAPrompt as follows: of MT evaluation metrics (Freitag et al., 2023). At
Dataset Language Pair Segments Systems Domains
En-De 2037 17 conversational, e-commerce, news, social
WMT22 En-Ru 2037 17 conversational, e-commerce, news, social
Zh-En 1875 20 conversational, e-commerce, news, social

Table 1: Statistics of testset. Source, reference texts, and translations are from the WMT22 metrics shared task.

the system level, we use pairwise accuracy across perimental results. We also use a high-quality
all three language pairs, which calculates the pro- sparse mixture-of-experts model, Mixtral-8x7b
portion of all possible pairs of MT systems that are (Jiang et al., 2024). We use a state-of-the-art check-
ranked the same by the metric and human scores point Mixtral-8x7b-Instruct which has been op-
(Kocmi et al., 2021). At the segment level, we timised through supervised fine-tuning and direct
adopt the group-by-item pairwise accuracy with tie preference optimisation to follow instructions.
calibration as described by Deutsch et al. (2023).
We use the acc∗eq variant to compare vectors of met- 3.3 Prompts for LLM evaluators
ric and gold scores for each segment, then average For GEMBA Prompting, we adopt the GEMBA-
the results over segments. All the meta-evaluation DA variant as suggested by (Kocmi and Federmann,
are calculated with MTME1 , a metric evaluation 2023b), given its widespread usage and superior
tool recommended by WMT (Freitag et al., 2022) performance across three language pairs (Kocmi
to maintain comparability with other metrics. and Federmann, 2023a).
For Error Analysis Prompting (EAPrompt), we
3.2 Baselines and Large Language Models conduct a comparison of various prompting strate-
Baseline Metrics Given the reported unreliabil- gies of EAPrompt in §3.5, and use the best-
ity of BLEU (Papineni et al., 2002), we com- performing variant for other experiments. We show
pare our method with several model-based met- the detailed prompt contexts in Appendix C.
rics for MT evaluation. BLEURT (Sellam et al., 3.4 Experimental Results
2020) and COMET (Rei et al., 2020) are super-
vised neural metrics fine-tuned on human evalua- We compute system&segment level performance
tion. We employ the BLEURT20 and COMET- of EAPrompt with LLMs in Table 2. We see that:
22 for reference-based metrics, and COMET-QE (i) At the system level, EAPrompt empowers
for the reference-less metric. UniTE (Wan et al., GPT-3.5-Turbo to surpass all other metrics and
2022) is also a learnt metric that evaluates MT out- achieves state-of-the-art performance. Consis-
puts combining three different evaluation scenarios. tent with the findings of Kocmi and Federmann
We also adopt UniTE-src for comparing reference- (2023b), LLMs achieve state-of-the-art perfor-
less metrics. MetricX-XXL (Juraska et al., 2023) mance across all three language pairs at the sys-
is a large-scale multi-task metric that fine-tunes tem level, significantly outperforming traditional
LLM checkpoints using diverse human feedback metrics ("Baselines") by a large margin.
data. For reference-less metrics, we also reproduce Remarkably, when prompting all LLMs with
MaTESe-QE (Perrella et al., 2022), a metric lever- EAPrompt, the performance notably surpasses
aging transformer-based multilingual encoders to GEMBA at the system level, achieving the highest
identify error spans in translations. pairwise accuracy of 91.2% on GPT-3.5-Turbo,
thus establishing a new SOTA.
Large Language Models For prioprietary mod-
els, we use the OpenAI API to experiment with (ii) At the segment level, EAPrompt outperforms
GPT-3.5-Turbo 2 . We also experiment with a GEMBA in 8 out of 9 tested scenarios. At the
human-aligned Llama2-70B series model (Touvron segment level, despite previous findings by Kocmi
et al., 2023b) fine-tuned with multilingual trans- and Federmann (2023b) regarding the weak cor-
lation data, noted as "Llama2-70b-Chat" in ex- relation between LLMs as evaluators and human
1 judgments, prompting with EAPrompt addresses
https://github.com/google-research/
mt-metrics-eval this drawback of LLM evaluators, outperforming
2
We use the 0613 OpenAI model versions. GEMBA’s performance on nearly all tested LLMs
System-Level Acc. Segment-Level Acc*
Models Metrics / Prompts Ref? All (3 LPs) En-De En-Ru Zh-En
MetricsX-XXL ✓ 85.0 60.4 60.6 54.4
BLEURT20 ✓ 84.7 56.8 54.0 48.9
COMET22 ✓ 83.9 59.4 57.7 53.6
Baselines UniTE ✓ 82.8 59.8 57.7 51.7
COMET-QE ✗ 78.1 55.5 53.4 48.3
UniTE-src ✗ 75.9 58.2 55.4 50.8
MaTESe-QE ✗ 74.8 57.2 49.9 49.4
GEMBA ✓ 74.1 53.7 48.8 45.4
EAPrompt ✓ 85.4 (+11.3) 55.2(+1.5) 51.4 (+2.6) 50.2 (+4.8)
Llama2-70b-Chat
GEMBA ✗ 72.6 54.1 47.8 45.0
EAPrompt ✗ 85.8 (+13.2) 55.0 (+0.9) 51.6 (+3.8) 49.3 (+4.3)
GEMBA ✓ 69.7 54.8 48.3 46.7
EAPrompt ✓ 84.0 (+14.3) 53.8 (-1.0) 50.6 (+2.3) 48.2 (+1.5)
Mixtral-8x7b-Instruct
GEMBA ✗ 74.1 54.8 47.5 46.2
EAPrompt ✗ 82.5 (+8.4) 54.1 (-0.7) 49.9 (+2.4) 48.3 (+1.1)
GEMBA ✓ 86.5 55.2 49.5 48.2
EAPrompt ✓ 91.2 (+4.7) 56.7 (+1.5) 53.3 (+3.8) 50.0 (+1.8)
GPT-3.5-Turbo
GEMBA ✗ 86.9 54.7 50.0 47.6
EAPrompt ✗ 89.4 (+2.5) 55.7 (+1.0) 53.4 (+3.4) 48.8 (+1.2)

Table 2: The performance of metrics using pairwise accuracy (%) at the system level and pairwise accuracy
with tie calibration (%) at the segment level. All results are compared with human-annotated MQM scores. The
best results among the same model are highlighted in bold. The best results among all metrics are underlined.

and language pairs by a significant margin. The These results underscore the impressive cross-
best segment-level results are achieved by GPT-3.5- lingual capabilities of LLMs and their suitability
Turbo for En-De (56.7) and En-Ru (53.4), and by for quality estimation under EAPrompt, even in the
Llama2-70b-chat for Zh-En (50.2). This validates absence of reference translations, which poses a
the effectiveness of our EAPrompt. significant challenge on MT evaluation.
The only exception of the result is observed for
En-De Mixtral-8x7b-Instruct, where the segment- 3.5 Ablation Study of Prompt Variants
level accuracy is lower than GEMBA by 1.0. This Given the crucial significance of the prompt de-
discrepancy might be attributed to the limited capa- sign, we investigate several versions of in-context
bility of identifying translation errors in En-De lan- prompt contexts and present an analysis in Table 3.
guage pair. Another notable finding is that prompt- The prompt contexts used in our experiment are
ing with LLMs, both with GEMBA and EAPrompt, detailed in Appendix C. Due to budget constraints,
fails to surpass current best metrics ("Baselines") we utilize two LLMs, Mixtral-8x7b-Instruct and
at the segment level. This could be because these Llama2-70b-Chat, as the test bed for this ablation
baseline metrics have been fine-tuned using ex- study. Our findings indicate that:
tensive translation and human evaluation datasets,
while the LLMs employed in our experiment are (i) Itemized error demonstration is superior to
versatile models guided by few-shot prompts. detailed illustration. We assume that when iden-
tifying translation errors, providing detailed de-
(iii) EAPrompt enhances the performance of scriptions may impede the LLM’s capability to
LLMs as translation evaluators in reference-less accurately identify errors and count the number
scenarios. Our main findings remain consistent of them. As illustrated in the "Demo of Errors"
with both reference-based and reference-less set- column, employing itemized error demonstrations
tings (indicated by "✓" and "✗" in Ref?, respec- instead of detailed paragraphs yields improved per-
tively), where EAPrompt continues to outperform formance at both the system and segment levels for
GEMBA across all three tested LLMs at the system both tested LLMs.
level, and in 8 out of 9 scenarios at the segment In our initial study, we observed that generat-
level. The improvement is slightly lower compared ing excessively detailed responses could lead to in-
to scenarios with referenced signals. correct error counting or misclassification of error
Prompt Demo of Errors Type of Queries Mixtral-8x7b-Instruct Llama2-70b-Chat
Detailed Itemized 1-step 2-step All (3 LPs) En-De En-Ru Zh-En All (3 LPs) En-De En-Ru Zh-En
GEMBA - - - - 69.7 54.8 48.3 46.7 74.1 53.7 48.8 45.4
✓ ✓ 75.2 53.4 50.0 45.0 62.0 53.7 47.0 47.8
✓ ✓ 75.5 53.4 47.9 45.5 84.7 53.5 46.9 47.5
EAPrompt
✓ ✓ 60.2 53.4 45.1 45.6 56.9 53.7 48.4 50.2
✓ ✓ 84.0 53.7 50.6 48.2 85.4 55.2 51.4 50.2

Table 3: Comparison of the system level ("All (3 LPs)") and segment level ("En-De", "En-Ru", "Zh-En")
performance of LLMs with different variants of prompts for EAPrompt. We compare itemized or detailed
responses to demonstrate identified errors. We also compare the instructions, whether separated into two queries
(marked as "2-step", one for identifying errors and another for scoring) or combined into a single query (marked as
"1-step"). The best results among all prompt variants are highlighted in bold.

severity. Therefore, it is recommended to employ LLMs exhibit distributions that closely resem-
clear and concise error descriptions in a format that ble MQM. Regarding minor errors, Mixtral-8x7b-
is easily processed and comprehended by LLMs. Instruct appears to produce a slightly higher fre-
quency of such errors compared to other LLMs,
(ii) Separating the scoring process from error while the distribution of other LLMs remains con-
identification with two queries will enhance the sistent with MQM. This observation further vali-
performance of LLMs as translation evaluators. dates the efficacy of EAPrompt.
Another consideration in prompt design is the di- This finding provides valuable insights into en-
vision of the evaluation process into error iden- hancing the reliability of LLMs as translation eval-
tification and error counting. As depicted in the uators. It suggests a potential focus on guiding
"Type of Queries" column, it is evident that the LLMs to more accurately identify minor errors,
performance of using a single prompting step is such as clarifying the specific categories and sever-
considerably lower than that of employing a 2-step ity of minor errors.
prompting approach. This may be because separat-
ing the scoring process allows LLMs to concentrate 4.2 EAPrompt empowers LLMs to distinguish
on a single task in each query, thereby facilitating major errors from minor ones
more accurate judgments and reducing the likeli-
A potential concern on EAPrompt is whether this
hood of incorrectly counting the number of errors.
technique can prompt LLMs to distinguish ma-
(iii) Among the prompting strategies, EAPrompt jor errors from minor ones. To address this con-
appears to be more suitable for the LLMs as cern, we adjust the weight assigned to major errors
translation evaluators. When compared with (wmajor ) in the score computation process outlined
GEMBA prompting strategies, the EAPrompt vari- in §2.4. We visualize the impact of this adjustment
ant featuring a 2-step separated prompting ap- on both the system and segment-level performance
proach and itemized response achieves superior in Figure 3. If the metric effectively distinguishes
performance in enhancing LLMs’ effectiveness as major errors from minor ones, we anticipate a no-
translation evaluators. Consequently, we recom- ticeable performance decrease when the weight of
mend employing this particular variant for LLMs major errors wmajor approaches that of minor errors
as translation evaluators. (wminor = 1 in this study).
Our findings reveal that for all three LLMs tested,
4 Analysis adjusting wmajor < 3 results in a substantial per-
formance decline, indicating that prompting error
4.1 EAPrompt aligns with human judgment analysis with all tested LLMs possesses the ability
through similar distribution of major and to discriminate major errors from minor ones.
minor errors across most LLMs
Another noteworthy observation from this anal-
To investigate can LLMs align with gold human ysis is that when wmajor ≥ 5, both the system-
judgement MQM through similar distributions of level and segment level-accuracies exhibit mini-
major and minor errors, we present the error distri- mal fluctuation, suggesting that the performance of
bution across various test scenarios in Figure 2. EAPrompt remains nearly unaffected by this latent
We can see that, for major errors, all tested variable during score computation.
Mixtral-8x7b-Instruct Llama2-70b-Chat GPT-3.5-Turbo Gold MQM
En-De En-Ru Zh-En
104 104 104
No. of Translations
103 103 103

102 102 102

101 101 101

100 0 1 2 3 4 5 6 7 8 9 10 100 0 1 2 3 4 5 6 7 8 9 10 100 0 1 2 3 4 5 6 7 8 9 10


No. of Major Errors No. of Major Errors No. of Major Errors
104 104 104
No. of Translations

103 103 103

102 102 102

101 101 101

100 0 1 2 3 4 5 6 7 8 9 10 100 0 1 2 3 4 5 6 7 8 9 10 100 0 1 2 3 4 5 6 7 8 9 10


No. of Minor Errors No. of Minor Errors No. of Minor Errors

Figure 2: Distribution of identified error counts across various LLMs and human evaluation (MQM), for the
language pairs En-De, En-Ru and Zh-En, repectively.

System-Level Acc. Segment-Level Acc*


Models Repr? All (3 LPs) En-De En-Ru Zh-En
✓ 85.0 55.6 51.5 50.4
Llama2-70b-Chat
✗ 85.4 55.2 51.4 50.2
✓ 82.8 53.7 50.9 47.6
Mixtral-8x7b-Instruct
✗ 84.0 53.7 50.6 48.2
✓ 90.1 56.8 53.9 50.0
GPT-3.5-Turbo
✗ 91.2 56.7 53.3 50.0

Table 4: Performance comparison of EAPrompt between utilizing the Regular Expression Matching strategy
("✓" in Repr?) and the counting query strategy ("✗" in Repr?) across various LLMs.

4.3 EAPrompt optimizes inference costs by step of EAPrompt with regular expressions could
utilizing regular expressions instead of be a viable option. Note that for different LLMs, a
counting queries tailored regular expression pattern may be neces-
sary to encompass various response structures.
Since EAPrompt adopts a two-step prompting strat-
egy, one related question is: can we simplify the 4.4 Case Study
query process to reduce inference costs? One po- We discuss potential issues encountered by LLMs
tential approach involves substituting the scoring and their corresponding solutions in Appendix E,
query step with an algorithm that identifies major including invalid responses, input order bias, etc.
and minor errors using regular expressions (Repr) We aim to provide insights that should be consid-
to detect bullet points or initial numbers. A detailed ered when utilizing LLMs as translation evaluators.
description of the Repr matching strategy is pro-
vided in the Appendix. The analysis, as depicted 5 Related Work
in Table 4, indicates that employing Repr match-
ing strategy, as opposed to the original query for Translation Evaluation Metrics MT Evalua-
counting errors (indicated by "✗" in Repr?), yields tion metrics are of crucial importance to the de-
minimal performance variation at both system and velopment of MT systems (Freitag et al., 2022).
segment levels. Thus, if inference costs are a con- Studies have shown that traditional surface-based
cern for this metric, substituting the second query metrics such as BLEU (Papineni et al., 2002) are
SYS All(3 LPs) SEG En-Ru application of LLMs is harnessing them as eval-
SEG En-De SEG Zh-En uators for assessing the performance of Chatbots
100 Mixtral-8x7b-Instruct 55
(Zheng et al., 2023). Recent studies also show
LLM’s efficacy in evaluating NLG tasks like sum-

SEG-level Acc*
95
SYS-level Acc.

50 marization and dialog generation through multi-


90 step prompting (Liu et al., 2023). GEMBA (Kocmi
85 45 and Federmann, 2023b) is the pioneering effort
80 40 in utilizing LLMs as translation evaluators via a
100 GPT-3.5-Turbo 60 zero-shot prompting approach with GPT models.
In this work, EAPrompt innovatively combines er-

SEG-level Acc*
95
SYS-level Acc.

55 ror analysis (Lu et al., 2023) and chain-of-thought


90 (Wei et al., 2022) to prompt LLMs for achieving
85 50 human-like translation evaluation.
80 45 Subsequent work follows our work to further
100 Llama2-70b-Chat
55
SEG-level Acc* explore the potential of LLMs as translation evalu-
95
SYS-level Acc.

ators. AutoMQM (Fernandes et al., 2023) parallels


90 50 our approach, utilizing PaLM-2 model (Anil et al.,
2023) as the testbed. GEMBA-MQM (Kocmi and
85
45 Federmann, 2023a) further improves EAPrompt by
80 1 2 3 4 5 6 7 8 9 10 employing a few-shot prompting technique using
Major Error Weight GPT-4, making this approach universally applica-
ble across languages. Another line of research
Figure 3: Effect of varying major error weight
(wmajor ) on EAPrompt across different LLMs at both focuses on fine-tuning LLMs to accurately pre-
system and segment levels. dict error spans in translations. For instance, In-
structScore (Xu et al., 2023) fine-tunes a Llama
model (Touvron et al., 2023a), while XCOMET
no longer suitable for evaluating high-quality MT (Guerreiro et al., 2023) scales from COMETKiwi
systems (Mathur et al., 2020a). Modern metrics (Rei et al., 2023) to achieve this goal.
like COMET (Rei et al., 2020), MetricsX-XXL
(Juraska et al., 2023), BLEURT (Sellam et al.,
2020), and UniTE (Wan et al., 2022) leverage hu-
6 Conclusion
man evaluations and high-quality translations for
training. While these metrics achieve strong corre- In this paper, we explore the potential of LLMs
lation with human judgements such as MQM (Fre- as a metric for evaluating translations. We design
itag et al., 2021), there is a growing demand for a novel one-shot prompting strategy EAPrompt
explainability in their evaluation. Despite progress, based on chain-of-thought and error analysis, and
recent research struggles to strike a balance be- show that this strategy significantly improves the
tween the reliability and explainability of these met- evaluation performance on both the system and
rics (Lu et al., 2023; Xu et al., 2022; Perrella et al., segment levels. We compare different EAPrompt
2022). In this work, we delve into the potential variants and ultimately opt for a 2-step prompt-
of LLMs for "human-like" translation evaluation, ing approach with itemized error demonstrations.
as they possess the capability to explicitly iden- Further analysis confirms EAPrompt’s proficiency
tify translation errors without further fine-tuning, in error identification and its alignment with the
which resembles the evaluation process of human. commonly accepted human evaluations MQM.
In future work, we would like to experiment with
LLMs as Evaluators LLMs refers to language a broader range of LLMs (Barrault et al., 2019;
models with hundreds of billion of parameters Anastasopoulos et al., 2021; Kocmi et al., 2022;
which are trained on massive textual data (Chang Zan et al., 2022), to make our conclusion more
et al., 2024; Zhao et al., 2023). Since the emer- convincing. Lastly, it will be interesting to test the
gence of ChatGPT, LLMs have shown its remark- capabilities of LLMs for other MT-related tasks,
able proficiency across various NLP tasks (Achiam such as grammatical error correction and automatic
et al., 2023; Touvron et al., 2023b). A prevalent post-editing (Wu et al., 2023; Vidal et al., 2022).
Limitations References
The limitations of this work are three-fold: Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
• Potential Test Data Contamination: Although Diogo Almeida, Janko Altenschmidt, Sam Altman,
we utilized WMT22 to minimize the risk of Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
arXiv preprint.
test set leakage in the training data of LLMs,
it is still possible that some contamination Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer-
from the test data remains. Therefore, future man, Roldano Cattoni, Maha Elbayad, Marcello Fed-
researchers utilizing these datasets should be erico, et al. 2021. Findings of the IWSLT 2021 eval-
uation campaign. In IWSLT.
cautious and carefully address this issue, as it
may affect the availability of the test set for Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John-
comparison purposes. son, Dmitry Lepikhin, Alexandre Passos, Siamak
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng
• Budget Constraints: Due to limited resources, Chen, et al. 2023. Palm 2 technical report. arXiv
we were unable to explore more prompt preprint.
choices comprehensively in our research. The Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà,
findings presented in this study only reflect Christian Federmann, Mark Fishel, et al. 2019. Find-
our initial experiments. We leave the impact ings of the 2019 conference on machine translation
of different prompt choices for further investi- (WMT19). In WMT.
gation. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, et al. 2020. Language models are few-shot
• Limited Range of LLMs Tested: In this study, learners. NeurIPS.
we focused on evaluating a limited number
of LLMs that we believed possessed potential Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu,
Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi,
and capability as translation evaluators. How- Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang,
ever, it is important to note that not all existing Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie.
LLMs can necessarily serve as reliable evalu- 2024. A survey on evaluation of large language mod-
ators under the EAPrompt approach. Future els. ACM.
research could explore and experiment with Hyung Won Chung, Le Hou, Shayne Longpre, Bar-
a broader range of LLMs, examining their ef- ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
fectiveness and assessing their suitability as Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
evaluators. 2022. Scaling instruction-finetuned language models.
arXiv preprint.
Ethics Statement
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
We take ethical considerations very seriously, and Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
strictly adhere to the Code of Ethics. All proce- Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Nakano, et al. 2021. Training verifiers to solve math
dures performed in this study are in accordance word problems. arXiv preprint.
with the ethical standards. This paper focuses on
evaluating the capabilities of LLM as a transla- Daniel Deutsch, George Foster, and Markus Freitag.
2023. Ties matter: Meta-evaluating modern met-
tion evaluator. Our proposed approach, EAPrompt,
rics with pairwise accuracy and tie calibration. In
does not include statements that induce the model EMNLP.
to generate harmful information. Additionally, this
method solely extracts and processes the numerical Patrick Fernandes, Daniel Deutsch, Mara Finkel-
stein, Parker Riley, André Martins, Graham Neubig,
scores from the model’s response, thereby further Ankush Garg, Jonathan Clark, Markus Freitag, and
mitigating the potential risks. Both the datasets Orhan Firat. 2023. The devil is in the errors: Leverag-
and models used in this paper are publicly avail- ing large language models for fine-grained machine
able and have been widely adopted by researchers. translation evaluation. In WMT.
Our model will not learn from user inputs or cause Markus Freitag, George Foster, David Grangier, et al.
potential risks to the NLP community. We ensure 2021. Experts, errors, and context: A large-scale
that the findings and conclusions of this paper are study of human evaluation for machine translation.
reported accurately and objectively. Informed con- TACL.
sent was obtained from all individual participants Markus Freitag, Nitika Mathur, Chi-kiu Lo, Elefthe-
included in this study. rios Avramidis, Ricardo Rei, Brian Thompson, Tom
Kocmi, Frederic Blain, Daniel Deutsch, Craig Stew- Nitika Mathur, Timothy Baldwin, and Trevor Cohn.
art, Chrysoula Zerva, Sheila Castilho, Alon Lavie, 2020a. Tangled up in BLEU: Reevaluating the eval-
and George Foster. 2023. Results of WMT23 metrics uation of automatic machine translation evaluation
shared task: Metrics might be guilty but references metrics. In ACL.
are not innocent. In WMT.
Nitika Mathur, Johnny Wei, Markus Freitag, Qingsong
Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Ma, and Ondřej Bojar. 2020b. Results of the WMT20
Craig Stewart, Eleftherios Avramidis, Tom Kocmi, metrics shared task. In WMT.
et al. 2022. Results of WMT22 metrics shared task:
Stop using BLEU – neural metrics are better and Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, et al.
more robust. In WMT. 2022. Training language models to follow instruc-
tions with human feedback. arXiv preprint.
Nuno M Guerreiro, Ricardo Rei, Daan van Stigt, Luisa
Coheur, Pierre Colombo, and André FT Martins. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
2023. xcomet: Transparent machine translation eval- Jing Zhu. 2002. Bleu: a method for automatic evalu-
uation through fine-grained error detection. arXiv ation of machine translation. In ACL.
preprint.
Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen,
Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Xuebo Liu, Min Zhang, Yuanxin Ouyang, and
Raunak, Mohamed Gabr, et al. 2023. How good are Dacheng Tao. 2023. Towards making the most of
gpt models at machine translation? a comprehensive chatgpt for machine translation. arXiv preprint.
evaluation. arXiv preprint.
Stefano Perrella, Lorenzo Proietti, Alessandro Scirè,
Albert Q Jiang, Alexandre Sablayrolles, Antoine Niccolò Campolungo, and Roberto Navigli. 2022.
Roux, Arthur Mensch, Blanche Savary, Chris Bam- MaTESe: Machine translation evaluation as a se-
ford, Devendra Singh Chaplot, Diego de las Casas, quence tagging problem. In WMT.
Emma Bou Hanna, Florian Bressand, et al. 2024. Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao
Mixtral of experts. arXiv preprint. Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is
chatgpt a general-purpose natural language process-
Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing
ing task solver? arXiv preprint.
Wang, and Zhaopeng Tu. 2023. Is chatgpt a good
translator? a preliminary study. arXiv preprint. Baopu Qiu, Liang Ding, Di Wu, Lin Shang, Yibing
Zhan, and Dacheng Tao. 2022. Original or trans-
Juraj Juraska, Mara Finkelstein, Daniel Deutsch, Aditya lated? on the use of parallel data for translation qual-
Siddhant, Mehdi Mirzazadeh, and Markus Freitag. ity estimation. arXiv preprint.
2023. MetricX-23: The Google submission to the
WMT 2023 metrics shared task. In WMT. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, Ilya Sutskever, et al. 2019. Language
Tom Kocmi, Rachel Bawden, Ondřej Bojar, et al. 2022. models are unsupervised multitask learners. OpenAI
Findings of the 2022 conference on machine transla- blog.
tion (WMT22). In WMT.
Ricardo Rei, Nuno M. Guerreiro, José Pombal, Daan
Tom Kocmi and Christian Federmann. 2023a. GEMBA- van Stigt, Marcos Treviso, Luisa Coheur, José G.
MQM: Detecting translation quality error spans with C. de Souza, and André Martins. 2023. Scaling up
GPT-4. In WMT. CometKiwi: Unbabel-IST 2023 submission for the
quality estimation shared task. In WMT.
Tom Kocmi and Christian Federmann. 2023b. Large
language models are state-of-the-art evaluators of Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
translation quality. In EAMT. Lavie. 2020. COMET: A neural framework for MT
evaluation. In EMNLP.
Tom Kocmi, Christian Federmann, Roman Grund-
kiewicz, Marcin Junczys-Dowmunt, Hitokazu Mat- Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020.
sushita, and Arul Menezes. 2021. To ship or not to BLEURT: Learning robust metrics for text genera-
ship: An extensive evaluation of automatic metrics tion. In ACL.
for machine translation. In WMT.
Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Ma-
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, chine translation evaluation versus quality estimation.
Ruochen Xu, and Chenguang Zhu. 2023. G-eval: Machine translation.
NLG evaluation using gpt-4 with better human align-
ment. In EMNLP. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se-
bastian Gehrmann, Yi Tay, Hyung Won Chung,
Qingyu Lu, Liang Ding, Liping Xie, Kanjian Zhang, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny
Derek F. Wong, and Dacheng Tao. 2023. Toward Zhou, et al. 2022. Challenging big-bench tasks and
human-like evaluation for natural language genera- whether chain-of-thought can solve them. arXiv
tion with error analysis. In ACL. preprint.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier A Description of MQM
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Multidimensional Quality Metric (MQM) is a hu-
Azhar, et al. 2023a. Llama: Open and efficient foun- man evaluation framework commonly used in
dation language models. arXiv preprint. WMT metrics shared tasks as the golden stan-
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- dard (Freitag et al., 2021, 2023). In this paper,
bert, Amjad Almahairi, Yasmine Babaei, Nikolay EAPrompt emulates MQM to identify major and
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti minor errors, providing insightful explanations for
Bhosale, et al. 2023b. Llama 2: Open foundation and the translation. Table 6 illustrates an example in
fine-tuned chat models. arXiv preprint.
detail annotated through MQM framework.
Blanca Vidal, Albert Llorens, and Juan Alonso. 2022.
Automatic post-editing of MT output using large lan- B Post-processing of EAPrompt
guage models. In AMTA.
As described in §2.4, we treat wmajor as a latent
Yu Wan, Dayiheng Liu, Baosong Yang, Haibo Zhang, variable within EAPrompt. In our experiments, we
Boxing Chen, Derek Wong, and Lidia Chao. 2022. select this latent variable with the best averaging
UniTE: Unified translation evaluation. In ACL. ∗
performance for each LLMs denoted as wmajor . The
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten value was reported in Table 5.
Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.
Chain of thought prompting elicits reasoning in large ∗
Model wmajor
language models. arXiv preprint.
GPT-3.5-Turbo 6
Llama2-70b-Chat 10
Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Mixtral-8x7b-Instruct 10
Jiao, and Michael Lyu. 2023. Chatgpt or grammarly?
evaluating chatgpt on grammatical error correction ∗
benchmark. arXiv preprint. Table 5: Optimal values of wmajor for each LLM. To
ensure fair comparison, we maintain this variable con-
Wenda Xu, Yi-Lin Tuan, Yujie Lu, Michael Saxon, Lei stant across all tested scenarios for every LLM.
Li, and William Yang Wang. 2022. Not all errors
are equal: Learning text generation metrics using
stratified error synthesis. In EMNLP. C Prompt Contexts of EAPrompt
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Figure 4 provides the prompt contexts implemented
Song, Markus Freitag, William Wang, and Lei Li. in EAPrompt, along with the detailed error demon-
2023. INSTRUCTSCORE: Towards explainable text
generation evaluation with automatic feedback. In stration and combined query instruction discussed
EMNLP. in §3.5 for reproduction of our experiments.

Changtong Zan, Keqin Peng, Liang Ding, Baopu Qiu, D Counting Errors using Regular
et al. 2022. Vega-MT: The JD explore academy ma- Expressions Matching
chine translation system for WMT22. In WMT.
In Figure 5, we present an overview of our error-
Chrysoula Zerva, Frédéric Blain, Ricardo Rei, et al.
2022. Findings of the WMT 2022 shared task on
matching strategy utilized in §4.3 to automatically
quality estimation. In WMT. identify the number of major and minor errors. The
procedure can be listed as follows:
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen 1. Locate "major error" and "minor error" within
Zhang, Junjie Zhang, Zican Dong, et al. 2023. A the response, then segment the response ac-
survey of large language models. arXiv preprint.
cordingly.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, 2. Utilize Regular Expression matching to iden-
Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. tify the initial numbers of major and minor
Judging llm-as-a-judge with mt-bench and chatbot errors. For implementation, we include three
arena. arXiv preprint.
different initial number formats: "1.", "1)" and
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and "(1)" (using "1" as an example);
Dacheng Tao. 2023. Can chatgpt understand too?
a comparative study on chatgpt and fine-tuned bert. 3. Record the number of major and minor errors.
arXiv preprint.
E Case Study offering judgments based on its inherent capabili-
ties.
In Figure 6, we list several typical issues with the Solution: We follow the method mentioned in
case study that should be aware of when using Kocmi and Federmann (2023b) for handling in-
LLMs such as ChatGPT as translation evaluators. valid answers, where we introduce randomness to
E.1 Potential instability in the responses LLMs by iteratively increasing the temperature.
without temperature control Subsequently, we take the first response that falls
within the expected score range.
Issue: When evaluating translations using LLMs,
the generated responses may vary significantly. See
in Case 1, we regenerate several responses with the
same input and obtain 3 different scores (98, 95,
100) for the translation.
Solution: We control the temperature parameter to
mitigate the variability in LLM judgments. Accord-
ingly, for all experiments detailed in this paper, we
set the temperature to 0 for GPT-3.5-Turbo. For
the other two models, namely Llama2-70b-Chat
and Mixtral-8x7b-Instruct, we opted for a temper-
ature setting of 0.05 since the inference parameter
from these two models should be above zero.

E.2 Input order bias when evaluating


multiple translations simultaneously
Issue: An alternative prompting strategy is to
present multiple translations together as a single
input to LLMs for evaluation, reducing the number
of queries and potentially saving budget. However,
we observe a bias where translations presented ear-
lier tend to get higher scores compared to those
presented later. As shown in Case 2, we pro-
vide 8 translations along with their corresponding
source and reference sentences. At the first time,
we present the translations sequentially and ask
LLM to rank them according to their translation
quality. Then, we reverse the order of translations
and obtain an entirely different sequence of ranks.
Solution: The contradictory results may be at-
tributed to the auto-regressive nature of the decoder
model, which gives more attention to the latter in-
put, potentially leading to greater identification of
errors for the translation input later. Therefore, we
recommend that researchers input one translation
at a time instead of providing multiple translations.

E.3 LLMs may generate invalid answers for


all prompting strategies
Issue: We observe that in certain cases, LLMs may
not function as translation evaluators that may pro-
ducing invalid answers with textual explanations.
A typical case is illustrated in Case 3, where Chat-
GPT tends to prioritize the BLEU score instead of
System Online-A.en
Domain conversational
Doc_id 1
Seg_id 6
Source: 请问,订单情况现在是什么样?
Text Reference: May I ask what the status of the order is now?
Translation: Please ask, what is the order situation now?
Human Major Error: "Please ask" - Accuracy/Mistranslation
Evaluation Minor Error: "situation" - Style/Awkward

Table 6: An example of MQM, comprising information of the test sample along with human-annotated major and
minor errors.

In-Context Example
Q: Source: 中新网北京9月27日电 (记者 杜燕)为加强节前市场监管执法,北京市市场监管
局在国庆节前夕检查各类经营主体2000余户。
Reference: Chinanews.com Report on September 27 in Beijing (Journalist Du Yan) The
Beijing Administration for Market Regulation inspected more than 2,000 operating
entities of different types before the National Day holiday to strengthen pre-holiday
market regulation and law enforcement.
Translation: BEIJING, Sept. 27 (Reporter Du Yan) In order to strengthen market
supervision and law enforcement before the festival, the Beijing Municipal Market
Supervision Bureau inspected more than 2,000 households of various business subjects
on the eve of the National Day.
Based on the given source and reference, identify the major and minor errors in this
translation. Note that Major errors refer to actual translation or grammatical errors, and
Minor errors refer to smaller imperfections, and purely subjective opinions about the
translation.
Itemized Error Demo Detailed Error Demo
A: Major errors: I think the mistranslation of “subjects”
(1)“BEIJING” – Omission should be categorized into a major error,
(2)“subjects” – Mistranslation and the omission in “BEIJING” should also
Minor errors: considered as a major error. “households
(1)“households of various” – Mistranslation of various”, “festival” and “supervision” are
(2)“festival” – Mistranslation three mistranslation errors, they should be
(3)“supervision” – Mistranslation categorized into minor errors. The
(4) “Beijing Municipal Market Supervision Bureau” – Inappropriate for context terminology, “Beijing Municipal Market
(5) “BEIJING” – Spelling Supervision Bureau” is Inappropriate for
context, and should also be categorized
into a minor error. “BEIJING” also has a
spelling error, which is considered as a
Test Question minor error.
Q: Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作创新性、基础
性平台——“中国好故事”数据库正式上线,向世界展现真实、立体、全面的中国。
Reference: On that day, the externally publicized innovative and basic platform-“The
story of China”, for the purpose of telling the story of China well and spreading the voice
of China well”, was officially on line today, to show the world a true, three-dimensional
and comprehensive China.
Translation: On that day, the "China Good Story" database, an innovative and basic
platform for outreach work with the aim of "telling a good Chinese story and
disseminating a good Chinese voice", was officially launched to show the world a real,
three-dimensional and comprehensive China.
Separated queries Combined query
Based on the given source and reference, identify the major and minor errors in this
Based on the given source and reference,
translation. Note that Major errors refer to actual translation or grammatical errors, and
identify the major and minor errors in this
Minor errors refer to smaller imperfections, and purely subjective opinions about the
translation. Note that Major errors refer to
translation.
actual translation or grammatical errors, and
Minor errors refer to smaller imperfections,
A:
and purely subjective opinions about the
translation.
Based on the above error information,
Q: Based on the above error information, Output 2 numbers ONLY with the format: "x, x",
Output 2 numbers ONLY with the format: "x,
indicating the number of major and minor errors. DO NOT ADD other information!
x", indicating the number of major and minor
errors. DO NOT ADD other information!
A:

Figure 4: The prompt contexts employed in EAPrompt. We present itemized/detailed responses for error
demonstrations and separated/combined instructions for different types of queries.
A: Major errors:
Find initial numbers – Find keywords –
(1)“BEIJING” – Omission
(1)(2) “Major error”
for major errors (2)“subjects” – Mistranslation “Minor error”
then split the response
Minor errors:
(1)“households of various” – Mistranslation
(2)“festival” – Mistranslation
Find initial numbers –
(1)(2) (3) (4) (5) (3)“supervision” – Mistranslation
for minor errors
(4) “Beijing Municipal Market Supervision Bureau” – Inappropriate for context
(5) “BEIJING” – Spelling

Output: 2 major errors, 5 minor errors

Figure 5: The regular expression matching strategy utilized in §4.3 to automatically count the number of major
and minor errors in the LLM response.

Case 1 Prompt ChatGPT


...
Source: 携程国庆旅游报告预测...
I would evaluate this translation as a 100 out of 100.
Reference: Ctrip’s National Day tourism report predicts that, based on previous regenerate
National Day tourism statistics, it is expected that ... ...
Translation: Ctrip's National Day tourism report predicts that according to the I would evaluate this translation as a 98 out of 100.
national day tourism statistics of previous years, it is estimated that ...
Please act as a translation evaluation metric that scores a translation between 0
to 100 based on the source and reference. Do not use existing metrics.
... regenerate
I would evaluate this translation as a 95 out of 100.

Case 2 Prompt
Based on given source and reference, please rank the following translation from ChatGPT
best to worst according to their translation quality, and do not use existing
metrics.
Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作... Based on my analysis of the given translations, here is my
Reference: On that day, the externally publicized innovative ... ranking from best to worst based on translation quality:
SYS1 translation: On the same day, the "China Good Stories" ....
SYS2 translation: .... SYS1>SYS2>SYS4>SYS5>SYS3>SYS6>SYS7>SYS8
....
SYS8 translation: On the same day, the innovative and basic platform...
Input order
Prompt ChatGPT affects evaluation
Based on given source and reference, please rank the following translation from results
best to worst according to their translation quality, and do not use existing
metrics.
Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作... Based on my analysis of the given translations, here is
Reference: On that day, the externally publicized innovative ... my ranking from best to worst based on translation quality:
SYS8 translation: On the same day, the innovative and basic platform...
SYS7 translation: .... SYS8>SYS7>SYS6>SYS5>SYS4>SYS3>SYS2>SYS1
....
SYS1 translation: On the same day, the "China Good Stories" ....

Case 3
Prompt ChatGPT
Source: 兴安盟属于大兴安岭南麓集中连片贫困地区
Reference: Xing’an Prefecture is within the concentrated poverty-stricken areas... Based on the provided source, reference, and target
Translation: Xing 'an League belongs to a contiguous poverty-stricken area... translation, the translation can be evaluated using the
Please act as a translation evaluation metric that scores a translation between 0 Use
BLEU score, which is a widely used metric .......
to 100 based on the source and reference. existing metrics

Figure 6: Case study of potential issues in LLMs. All three cases are from GPT-3.5-Turbo model ("ChatGPT").
Top: LLM exhibits variations in its responses upon multiple regenerations; Medium: different input order of
samples may affect the judgment of LLM; Bottom: LLM sometimes relies on existing metrics during translation
evaluation.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy