Chat GPT
Chat GPT
Figure 1: A comparative overview between GEMBA Prompting and our proposed Error Analysis Prompting
in assessing the MT quality with LLMs.
• EAPrompt significantly enhances the perfor- EAPrompt can also be extended to benefit other
mance of LLMs at the system level. Notably, evaluation scenarios within language generation,
prompting GPT-3.5-Turbo with EAPrompt including summarization and data-to-text tasks.
outperforms all other metrics and prompting
strategies, establishing a new state-of-the-art. 2 Prompt LLMs with Error Analysis
• EAPrompt surpasses GEMBA in 8 out of 9
test scenarios across various language models 2.1 Translation Evaluation Metric
and language pairs, demonstrating superior Translation evaluation metrics are used to assess
performance at the segment level. the performance of machine translation systems on
• The findings regarding EAPrompt’s strong specific test sets (Freitag et al., 2022; Mathur et al.,
performance remain consistent even in 2020b). These metrics typically take inputs from
reference-less settings, highlighting its suit- three sources: the sentence from source language
ability for quality estimation tasks. ("Source"), the reference translation provided by
human translators ("Reference"), and the hypoth-
• When designing prompts, we recommend the esis being evaluated ("Translation"). In scenar-
EAPrompt variant featuring a 2-step sepa- ios where reference signals are not provided, this
rated prompting approach and itemized error "reference-less" metric can also be utilized for qual-
demonstrations. ity estimation purposes (Zerva et al., 2022; Specia
et al., 2010; Qiu et al., 2022). The output of the
• Further analysis confirms that EAPrompt metric is a score or rank indicating the translation
adeptly distinguishes major errors from mi- quality of each hypothesis.
nor ones, closely aligning its error distribution
To verify the reliability of MT metrics, Multi-
with MQM.
dimensional Quality Metric (MQM) has been
• Optimizing the inference costs of EAPrompt adopted recently in WMT as a high-quality human
can be achieved by leveraging Regular Expres- evaluation strategy (Freitag et al., 2021). It asks
sions instead of counting queries. human experts to annotate the errors in the hypoth-
esis and categorize them into "Major" and "Minor"
This study provides an initial exploration of uti- indicating their severity. A detailed example of
lizing error analysis to prompt LLMs as evaluators. MQM annotation is presented in Appendix A.
2.2 Prompt LLMs as Evaluation Metrics • we adopt the one-shot learning format (Brown
et al., 2020) to enhance the LLMs’ understand-
When prompting LLMs as evaluation metrics, it is
ing of the task (§3.4); different in-context ex-
crucial to design appropriate instructions that de-
amples are used for different language pairs;
scribe the evaluation task. In this paper, we mainly
adopt two prompting strategies: "GEMBA Prompt- • we employ itemized error demonstration in
ing" and "Error Analysis Prompting". the template response, enabling clearer identi-
GEMBA (Kocmi and Federmann, 2023b) is a fication and quantification of errors (§3.5);
zero-shot prompting approach that directly asks
LLMs to generate a score that reflects the quality • we partition the evaluation process into two
of the translation, which shows state-of-the-art per- stages to enhance the reliability of metric per-
formance on GPT models when compared to other formance. Additionally, we present a simpli-
model-based metrics. However, they also observe fied alternative to optimize inference costs by
that the performance at the segment level is rela- counting errors automatically (§4.3).
tively poorer. This highlights the importance of
combining Chain-of-Thought with the Error Analy- 2.4 Post-processing of LLM responses
sis Strategy to prompt LLMs in a manner that more After obtaining the number of major and minor
closely resembles human evaluation. errors, we compute the final score of the translation
using the following equation:
2.3 Error Analysis Prompting
score = −wmajor nmajor − wminor nminor , (1)
Motivated by the MQM framework in human evalu-
ation, the idea of the Error Analysis (EA) paradigm, where nmajor and nminor denotes the number of ma-
as introduced by Lu et al. (2023), is to enhance the jor and minor errors respectively, while wmajor and
automatic scoring process by explicitly incorpo- wminor represent the severity weight assigned to ma-
rating error identification, thus providing a more jor and minor errors. Since different LLMs may ap-
human-like evaluation. ply distinct criteria for major and minor errors, we
The Chain-of-Thought (CoT) prompting strategy follow Lu et al. (2023) to adopt a flexible scoring
was first proposed by Wei et al. (2022). Instead of approach by fixing the wminor = 1 while treating
directly generating the answer, CoT prompts LLMs wmajor as a latent variable within EAPrompt. We
to think step-by-step. This approach has shown sig- present an analysis on the influence of this vari-
nificant performance improvements on reasoning able in §4.2 and the detailed implementation in
tasks, such as GSM8K (Cobbe et al., 2021). CoT experiments is described in Appendix B.
is an emergent ability of LLMs and has been incor-
porated in instruction fine-tuning of LLMs (Chung 3 Experimental Results
et al., 2022) as well as in benchmarks designed to
evaluate LLM capabilities (Suzgun et al., 2022). 3.1 Experiment Setup
In this work, we combine the CoT and EA Dataset We utilize the test set from the WMT22
paradigms, introducing a novel prompting strat- shared tasks (Freitag et al., 2022) in English-
egy called Error Analysis Prompting (EAPrompt). German (En-De), English-Russian (En-Ru), and
As shown in Figure 1, EAPrompt divides the scor- Chinese-English (Zh-En) across 4 different do-
ing process into two stages: First, the LLM is in- mains - conversational, e-commerce, news, and
structed to identify major and minor errors in the social. Table 1 provides statistics about our test set.
translation ("Instruction: Identify Errors"). Sub-
Human Evaluation We utilize MQM (Freitag
sequently, the number of these two types of er-
et al., 2021) as human judgments, which is an-
rors is counted ("Instruction: Count Errors"). Dis-
notated by human experts and has been widely
tinguished from GEMBA prompting, EAPrompt
adopted in recent WMT metrics shared tasks (Fre-
emulates the evaluation process of MQM and pro-
itag et al., 2022) and quality estimation tasks (Zerva
duces more explanable and reliable automatic eval-
et al., 2022).
uations.
After exploring several prompt contexts in initial Meta Evaluation We follow the standard meta-
experiments, we made the following modifications evaluation approach to measure the performance
to EAPrompt as follows: of MT evaluation metrics (Freitag et al., 2023). At
Dataset Language Pair Segments Systems Domains
En-De 2037 17 conversational, e-commerce, news, social
WMT22 En-Ru 2037 17 conversational, e-commerce, news, social
Zh-En 1875 20 conversational, e-commerce, news, social
Table 1: Statistics of testset. Source, reference texts, and translations are from the WMT22 metrics shared task.
the system level, we use pairwise accuracy across perimental results. We also use a high-quality
all three language pairs, which calculates the pro- sparse mixture-of-experts model, Mixtral-8x7b
portion of all possible pairs of MT systems that are (Jiang et al., 2024). We use a state-of-the-art check-
ranked the same by the metric and human scores point Mixtral-8x7b-Instruct which has been op-
(Kocmi et al., 2021). At the segment level, we timised through supervised fine-tuning and direct
adopt the group-by-item pairwise accuracy with tie preference optimisation to follow instructions.
calibration as described by Deutsch et al. (2023).
We use the acc∗eq variant to compare vectors of met- 3.3 Prompts for LLM evaluators
ric and gold scores for each segment, then average For GEMBA Prompting, we adopt the GEMBA-
the results over segments. All the meta-evaluation DA variant as suggested by (Kocmi and Federmann,
are calculated with MTME1 , a metric evaluation 2023b), given its widespread usage and superior
tool recommended by WMT (Freitag et al., 2022) performance across three language pairs (Kocmi
to maintain comparability with other metrics. and Federmann, 2023a).
For Error Analysis Prompting (EAPrompt), we
3.2 Baselines and Large Language Models conduct a comparison of various prompting strate-
Baseline Metrics Given the reported unreliabil- gies of EAPrompt in §3.5, and use the best-
ity of BLEU (Papineni et al., 2002), we com- performing variant for other experiments. We show
pare our method with several model-based met- the detailed prompt contexts in Appendix C.
rics for MT evaluation. BLEURT (Sellam et al., 3.4 Experimental Results
2020) and COMET (Rei et al., 2020) are super-
vised neural metrics fine-tuned on human evalua- We compute system&segment level performance
tion. We employ the BLEURT20 and COMET- of EAPrompt with LLMs in Table 2. We see that:
22 for reference-based metrics, and COMET-QE (i) At the system level, EAPrompt empowers
for the reference-less metric. UniTE (Wan et al., GPT-3.5-Turbo to surpass all other metrics and
2022) is also a learnt metric that evaluates MT out- achieves state-of-the-art performance. Consis-
puts combining three different evaluation scenarios. tent with the findings of Kocmi and Federmann
We also adopt UniTE-src for comparing reference- (2023b), LLMs achieve state-of-the-art perfor-
less metrics. MetricX-XXL (Juraska et al., 2023) mance across all three language pairs at the sys-
is a large-scale multi-task metric that fine-tunes tem level, significantly outperforming traditional
LLM checkpoints using diverse human feedback metrics ("Baselines") by a large margin.
data. For reference-less metrics, we also reproduce Remarkably, when prompting all LLMs with
MaTESe-QE (Perrella et al., 2022), a metric lever- EAPrompt, the performance notably surpasses
aging transformer-based multilingual encoders to GEMBA at the system level, achieving the highest
identify error spans in translations. pairwise accuracy of 91.2% on GPT-3.5-Turbo,
thus establishing a new SOTA.
Large Language Models For prioprietary mod-
els, we use the OpenAI API to experiment with (ii) At the segment level, EAPrompt outperforms
GPT-3.5-Turbo 2 . We also experiment with a GEMBA in 8 out of 9 tested scenarios. At the
human-aligned Llama2-70B series model (Touvron segment level, despite previous findings by Kocmi
et al., 2023b) fine-tuned with multilingual trans- and Federmann (2023b) regarding the weak cor-
lation data, noted as "Llama2-70b-Chat" in ex- relation between LLMs as evaluators and human
1 judgments, prompting with EAPrompt addresses
https://github.com/google-research/
mt-metrics-eval this drawback of LLM evaluators, outperforming
2
We use the 0613 OpenAI model versions. GEMBA’s performance on nearly all tested LLMs
System-Level Acc. Segment-Level Acc*
Models Metrics / Prompts Ref? All (3 LPs) En-De En-Ru Zh-En
MetricsX-XXL ✓ 85.0 60.4 60.6 54.4
BLEURT20 ✓ 84.7 56.8 54.0 48.9
COMET22 ✓ 83.9 59.4 57.7 53.6
Baselines UniTE ✓ 82.8 59.8 57.7 51.7
COMET-QE ✗ 78.1 55.5 53.4 48.3
UniTE-src ✗ 75.9 58.2 55.4 50.8
MaTESe-QE ✗ 74.8 57.2 49.9 49.4
GEMBA ✓ 74.1 53.7 48.8 45.4
EAPrompt ✓ 85.4 (+11.3) 55.2(+1.5) 51.4 (+2.6) 50.2 (+4.8)
Llama2-70b-Chat
GEMBA ✗ 72.6 54.1 47.8 45.0
EAPrompt ✗ 85.8 (+13.2) 55.0 (+0.9) 51.6 (+3.8) 49.3 (+4.3)
GEMBA ✓ 69.7 54.8 48.3 46.7
EAPrompt ✓ 84.0 (+14.3) 53.8 (-1.0) 50.6 (+2.3) 48.2 (+1.5)
Mixtral-8x7b-Instruct
GEMBA ✗ 74.1 54.8 47.5 46.2
EAPrompt ✗ 82.5 (+8.4) 54.1 (-0.7) 49.9 (+2.4) 48.3 (+1.1)
GEMBA ✓ 86.5 55.2 49.5 48.2
EAPrompt ✓ 91.2 (+4.7) 56.7 (+1.5) 53.3 (+3.8) 50.0 (+1.8)
GPT-3.5-Turbo
GEMBA ✗ 86.9 54.7 50.0 47.6
EAPrompt ✗ 89.4 (+2.5) 55.7 (+1.0) 53.4 (+3.4) 48.8 (+1.2)
Table 2: The performance of metrics using pairwise accuracy (%) at the system level and pairwise accuracy
with tie calibration (%) at the segment level. All results are compared with human-annotated MQM scores. The
best results among the same model are highlighted in bold. The best results among all metrics are underlined.
and language pairs by a significant margin. The These results underscore the impressive cross-
best segment-level results are achieved by GPT-3.5- lingual capabilities of LLMs and their suitability
Turbo for En-De (56.7) and En-Ru (53.4), and by for quality estimation under EAPrompt, even in the
Llama2-70b-chat for Zh-En (50.2). This validates absence of reference translations, which poses a
the effectiveness of our EAPrompt. significant challenge on MT evaluation.
The only exception of the result is observed for
En-De Mixtral-8x7b-Instruct, where the segment- 3.5 Ablation Study of Prompt Variants
level accuracy is lower than GEMBA by 1.0. This Given the crucial significance of the prompt de-
discrepancy might be attributed to the limited capa- sign, we investigate several versions of in-context
bility of identifying translation errors in En-De lan- prompt contexts and present an analysis in Table 3.
guage pair. Another notable finding is that prompt- The prompt contexts used in our experiment are
ing with LLMs, both with GEMBA and EAPrompt, detailed in Appendix C. Due to budget constraints,
fails to surpass current best metrics ("Baselines") we utilize two LLMs, Mixtral-8x7b-Instruct and
at the segment level. This could be because these Llama2-70b-Chat, as the test bed for this ablation
baseline metrics have been fine-tuned using ex- study. Our findings indicate that:
tensive translation and human evaluation datasets,
while the LLMs employed in our experiment are (i) Itemized error demonstration is superior to
versatile models guided by few-shot prompts. detailed illustration. We assume that when iden-
tifying translation errors, providing detailed de-
(iii) EAPrompt enhances the performance of scriptions may impede the LLM’s capability to
LLMs as translation evaluators in reference-less accurately identify errors and count the number
scenarios. Our main findings remain consistent of them. As illustrated in the "Demo of Errors"
with both reference-based and reference-less set- column, employing itemized error demonstrations
tings (indicated by "✓" and "✗" in Ref?, respec- instead of detailed paragraphs yields improved per-
tively), where EAPrompt continues to outperform formance at both the system and segment levels for
GEMBA across all three tested LLMs at the system both tested LLMs.
level, and in 8 out of 9 scenarios at the segment In our initial study, we observed that generat-
level. The improvement is slightly lower compared ing excessively detailed responses could lead to in-
to scenarios with referenced signals. correct error counting or misclassification of error
Prompt Demo of Errors Type of Queries Mixtral-8x7b-Instruct Llama2-70b-Chat
Detailed Itemized 1-step 2-step All (3 LPs) En-De En-Ru Zh-En All (3 LPs) En-De En-Ru Zh-En
GEMBA - - - - 69.7 54.8 48.3 46.7 74.1 53.7 48.8 45.4
✓ ✓ 75.2 53.4 50.0 45.0 62.0 53.7 47.0 47.8
✓ ✓ 75.5 53.4 47.9 45.5 84.7 53.5 46.9 47.5
EAPrompt
✓ ✓ 60.2 53.4 45.1 45.6 56.9 53.7 48.4 50.2
✓ ✓ 84.0 53.7 50.6 48.2 85.4 55.2 51.4 50.2
Table 3: Comparison of the system level ("All (3 LPs)") and segment level ("En-De", "En-Ru", "Zh-En")
performance of LLMs with different variants of prompts for EAPrompt. We compare itemized or detailed
responses to demonstrate identified errors. We also compare the instructions, whether separated into two queries
(marked as "2-step", one for identifying errors and another for scoring) or combined into a single query (marked as
"1-step"). The best results among all prompt variants are highlighted in bold.
severity. Therefore, it is recommended to employ LLMs exhibit distributions that closely resem-
clear and concise error descriptions in a format that ble MQM. Regarding minor errors, Mixtral-8x7b-
is easily processed and comprehended by LLMs. Instruct appears to produce a slightly higher fre-
quency of such errors compared to other LLMs,
(ii) Separating the scoring process from error while the distribution of other LLMs remains con-
identification with two queries will enhance the sistent with MQM. This observation further vali-
performance of LLMs as translation evaluators. dates the efficacy of EAPrompt.
Another consideration in prompt design is the di- This finding provides valuable insights into en-
vision of the evaluation process into error iden- hancing the reliability of LLMs as translation eval-
tification and error counting. As depicted in the uators. It suggests a potential focus on guiding
"Type of Queries" column, it is evident that the LLMs to more accurately identify minor errors,
performance of using a single prompting step is such as clarifying the specific categories and sever-
considerably lower than that of employing a 2-step ity of minor errors.
prompting approach. This may be because separat-
ing the scoring process allows LLMs to concentrate 4.2 EAPrompt empowers LLMs to distinguish
on a single task in each query, thereby facilitating major errors from minor ones
more accurate judgments and reducing the likeli-
A potential concern on EAPrompt is whether this
hood of incorrectly counting the number of errors.
technique can prompt LLMs to distinguish ma-
(iii) Among the prompting strategies, EAPrompt jor errors from minor ones. To address this con-
appears to be more suitable for the LLMs as cern, we adjust the weight assigned to major errors
translation evaluators. When compared with (wmajor ) in the score computation process outlined
GEMBA prompting strategies, the EAPrompt vari- in §2.4. We visualize the impact of this adjustment
ant featuring a 2-step separated prompting ap- on both the system and segment-level performance
proach and itemized response achieves superior in Figure 3. If the metric effectively distinguishes
performance in enhancing LLMs’ effectiveness as major errors from minor ones, we anticipate a no-
translation evaluators. Consequently, we recom- ticeable performance decrease when the weight of
mend employing this particular variant for LLMs major errors wmajor approaches that of minor errors
as translation evaluators. (wminor = 1 in this study).
Our findings reveal that for all three LLMs tested,
4 Analysis adjusting wmajor < 3 results in a substantial per-
formance decline, indicating that prompting error
4.1 EAPrompt aligns with human judgment analysis with all tested LLMs possesses the ability
through similar distribution of major and to discriminate major errors from minor ones.
minor errors across most LLMs
Another noteworthy observation from this anal-
To investigate can LLMs align with gold human ysis is that when wmajor ≥ 5, both the system-
judgement MQM through similar distributions of level and segment level-accuracies exhibit mini-
major and minor errors, we present the error distri- mal fluctuation, suggesting that the performance of
bution across various test scenarios in Figure 2. EAPrompt remains nearly unaffected by this latent
We can see that, for major errors, all tested variable during score computation.
Mixtral-8x7b-Instruct Llama2-70b-Chat GPT-3.5-Turbo Gold MQM
En-De En-Ru Zh-En
104 104 104
No. of Translations
103 103 103
Figure 2: Distribution of identified error counts across various LLMs and human evaluation (MQM), for the
language pairs En-De, En-Ru and Zh-En, repectively.
Table 4: Performance comparison of EAPrompt between utilizing the Regular Expression Matching strategy
("✓" in Repr?) and the counting query strategy ("✗" in Repr?) across various LLMs.
4.3 EAPrompt optimizes inference costs by step of EAPrompt with regular expressions could
utilizing regular expressions instead of be a viable option. Note that for different LLMs, a
counting queries tailored regular expression pattern may be neces-
sary to encompass various response structures.
Since EAPrompt adopts a two-step prompting strat-
egy, one related question is: can we simplify the 4.4 Case Study
query process to reduce inference costs? One po- We discuss potential issues encountered by LLMs
tential approach involves substituting the scoring and their corresponding solutions in Appendix E,
query step with an algorithm that identifies major including invalid responses, input order bias, etc.
and minor errors using regular expressions (Repr) We aim to provide insights that should be consid-
to detect bullet points or initial numbers. A detailed ered when utilizing LLMs as translation evaluators.
description of the Repr matching strategy is pro-
vided in the Appendix. The analysis, as depicted 5 Related Work
in Table 4, indicates that employing Repr match-
ing strategy, as opposed to the original query for Translation Evaluation Metrics MT Evalua-
counting errors (indicated by "✗" in Repr?), yields tion metrics are of crucial importance to the de-
minimal performance variation at both system and velopment of MT systems (Freitag et al., 2022).
segment levels. Thus, if inference costs are a con- Studies have shown that traditional surface-based
cern for this metric, substituting the second query metrics such as BLEU (Papineni et al., 2002) are
SYS All(3 LPs) SEG En-Ru application of LLMs is harnessing them as eval-
SEG En-De SEG Zh-En uators for assessing the performance of Chatbots
100 Mixtral-8x7b-Instruct 55
(Zheng et al., 2023). Recent studies also show
LLM’s efficacy in evaluating NLG tasks like sum-
SEG-level Acc*
95
SYS-level Acc.
SEG-level Acc*
95
SYS-level Acc.
Changtong Zan, Keqin Peng, Liang Ding, Baopu Qiu, D Counting Errors using Regular
et al. 2022. Vega-MT: The JD explore academy ma- Expressions Matching
chine translation system for WMT22. In WMT.
In Figure 5, we present an overview of our error-
Chrysoula Zerva, Frédéric Blain, Ricardo Rei, et al.
2022. Findings of the WMT 2022 shared task on
matching strategy utilized in §4.3 to automatically
quality estimation. In WMT. identify the number of major and minor errors. The
procedure can be listed as follows:
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen 1. Locate "major error" and "minor error" within
Zhang, Junjie Zhang, Zican Dong, et al. 2023. A the response, then segment the response ac-
survey of large language models. arXiv preprint.
cordingly.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, 2. Utilize Regular Expression matching to iden-
Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. tify the initial numbers of major and minor
Judging llm-as-a-judge with mt-bench and chatbot errors. For implementation, we include three
arena. arXiv preprint.
different initial number formats: "1.", "1)" and
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and "(1)" (using "1" as an example);
Dacheng Tao. 2023. Can chatgpt understand too?
a comparative study on chatgpt and fine-tuned bert. 3. Record the number of major and minor errors.
arXiv preprint.
E Case Study offering judgments based on its inherent capabili-
ties.
In Figure 6, we list several typical issues with the Solution: We follow the method mentioned in
case study that should be aware of when using Kocmi and Federmann (2023b) for handling in-
LLMs such as ChatGPT as translation evaluators. valid answers, where we introduce randomness to
E.1 Potential instability in the responses LLMs by iteratively increasing the temperature.
without temperature control Subsequently, we take the first response that falls
within the expected score range.
Issue: When evaluating translations using LLMs,
the generated responses may vary significantly. See
in Case 1, we regenerate several responses with the
same input and obtain 3 different scores (98, 95,
100) for the translation.
Solution: We control the temperature parameter to
mitigate the variability in LLM judgments. Accord-
ingly, for all experiments detailed in this paper, we
set the temperature to 0 for GPT-3.5-Turbo. For
the other two models, namely Llama2-70b-Chat
and Mixtral-8x7b-Instruct, we opted for a temper-
ature setting of 0.05 since the inference parameter
from these two models should be above zero.
Table 6: An example of MQM, comprising information of the test sample along with human-annotated major and
minor errors.
In-Context Example
Q: Source: 中新网北京9月27日电 (记者 杜燕)为加强节前市场监管执法,北京市市场监管
局在国庆节前夕检查各类经营主体2000余户。
Reference: Chinanews.com Report on September 27 in Beijing (Journalist Du Yan) The
Beijing Administration for Market Regulation inspected more than 2,000 operating
entities of different types before the National Day holiday to strengthen pre-holiday
market regulation and law enforcement.
Translation: BEIJING, Sept. 27 (Reporter Du Yan) In order to strengthen market
supervision and law enforcement before the festival, the Beijing Municipal Market
Supervision Bureau inspected more than 2,000 households of various business subjects
on the eve of the National Day.
Based on the given source and reference, identify the major and minor errors in this
translation. Note that Major errors refer to actual translation or grammatical errors, and
Minor errors refer to smaller imperfections, and purely subjective opinions about the
translation.
Itemized Error Demo Detailed Error Demo
A: Major errors: I think the mistranslation of “subjects”
(1)“BEIJING” – Omission should be categorized into a major error,
(2)“subjects” – Mistranslation and the omission in “BEIJING” should also
Minor errors: considered as a major error. “households
(1)“households of various” – Mistranslation of various”, “festival” and “supervision” are
(2)“festival” – Mistranslation three mistranslation errors, they should be
(3)“supervision” – Mistranslation categorized into minor errors. The
(4) “Beijing Municipal Market Supervision Bureau” – Inappropriate for context terminology, “Beijing Municipal Market
(5) “BEIJING” – Spelling Supervision Bureau” is Inappropriate for
context, and should also be categorized
into a minor error. “BEIJING” also has a
spelling error, which is considered as a
Test Question minor error.
Q: Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作创新性、基础
性平台——“中国好故事”数据库正式上线,向世界展现真实、立体、全面的中国。
Reference: On that day, the externally publicized innovative and basic platform-“The
story of China”, for the purpose of telling the story of China well and spreading the voice
of China well”, was officially on line today, to show the world a true, three-dimensional
and comprehensive China.
Translation: On that day, the "China Good Story" database, an innovative and basic
platform for outreach work with the aim of "telling a good Chinese story and
disseminating a good Chinese voice", was officially launched to show the world a real,
three-dimensional and comprehensive China.
Separated queries Combined query
Based on the given source and reference, identify the major and minor errors in this
Based on the given source and reference,
translation. Note that Major errors refer to actual translation or grammatical errors, and
identify the major and minor errors in this
Minor errors refer to smaller imperfections, and purely subjective opinions about the
translation. Note that Major errors refer to
translation.
actual translation or grammatical errors, and
Minor errors refer to smaller imperfections,
A:
and purely subjective opinions about the
translation.
Based on the above error information,
Q: Based on the above error information, Output 2 numbers ONLY with the format: "x, x",
Output 2 numbers ONLY with the format: "x,
indicating the number of major and minor errors. DO NOT ADD other information!
x", indicating the number of major and minor
errors. DO NOT ADD other information!
A:
Figure 4: The prompt contexts employed in EAPrompt. We present itemized/detailed responses for error
demonstrations and separated/combined instructions for different types of queries.
A: Major errors:
Find initial numbers – Find keywords –
(1)“BEIJING” – Omission
(1)(2) “Major error”
for major errors (2)“subjects” – Mistranslation “Minor error”
then split the response
Minor errors:
(1)“households of various” – Mistranslation
(2)“festival” – Mistranslation
Find initial numbers –
(1)(2) (3) (4) (5) (3)“supervision” – Mistranslation
for minor errors
(4) “Beijing Municipal Market Supervision Bureau” – Inappropriate for context
(5) “BEIJING” – Spelling
Figure 5: The regular expression matching strategy utilized in §4.3 to automatically count the number of major
and minor errors in the LLM response.
Case 2 Prompt
Based on given source and reference, please rank the following translation from ChatGPT
best to worst according to their translation quality, and do not use existing
metrics.
Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作... Based on my analysis of the given translations, here is my
Reference: On that day, the externally publicized innovative ... ranking from best to worst based on translation quality:
SYS1 translation: On the same day, the "China Good Stories" ....
SYS2 translation: .... SYS1>SYS2>SYS4>SYS5>SYS3>SYS6>SYS7>SYS8
....
SYS8 translation: On the same day, the innovative and basic platform...
Input order
Prompt ChatGPT affects evaluation
Based on given source and reference, please rank the following translation from results
best to worst according to their translation quality, and do not use existing
metrics.
Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作... Based on my analysis of the given translations, here is
Reference: On that day, the externally publicized innovative ... my ranking from best to worst based on translation quality:
SYS8 translation: On the same day, the innovative and basic platform...
SYS7 translation: .... SYS8>SYS7>SYS6>SYS5>SYS4>SYS3>SYS2>SYS1
....
SYS1 translation: On the same day, the "China Good Stories" ....
Case 3
Prompt ChatGPT
Source: 兴安盟属于大兴安岭南麓集中连片贫困地区
Reference: Xing’an Prefecture is within the concentrated poverty-stricken areas... Based on the provided source, reference, and target
Translation: Xing 'an League belongs to a contiguous poverty-stricken area... translation, the translation can be evaluated using the
Please act as a translation evaluation metric that scores a translation between 0 Use
BLEU score, which is a widely used metric .......
to 100 based on the source and reference. existing metrics
Figure 6: Case study of potential issues in LLMs. All three cases are from GPT-3.5-Turbo model ("ChatGPT").
Top: LLM exhibits variations in its responses upon multiple regenerations; Medium: different input order of
samples may affect the judgment of LLM; Bottom: LLM sometimes relies on existing metrics during translation
evaluation.