0% found this document useful (0 votes)
15 views

823 Prompting Large Language Model

Uploaded by

hejojew345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

823 Prompting Large Language Model

Uploaded by

hejojew345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Prompting Large Language Model for Machine Translation: A Case Study

Biao Zhang * Barry Haddow Alexandra Birch 1

Abstract constructs task-specific prompts by rephrasing test exam-


Research on prompting has shown it to have excel- ples with descriptive task instructions and executes the task
lent performance with little or even no supervised by feeding prompts to LLMs directly. It can be further
training across many tasks. However, prompt- enhanced through in-context learning by providing a few
ing for machine translation is still under-explored labeled examples (or prompt examples) as a demonstra-
in the literature. We fill this gap by offering tion (Brown et al., 2020). As a new paradigm, prompt-
a systematic study on prompting strategies for ing LLMs has achieved state-of-the-art performance over a
translation, examining various factors for prompt range of natural language processing (NLP) tasks (Chung
template and demonstration example selection. et al., 2022; Goyal et al., 2022; Wei et al., 2022c;a; Chowd-
We further explore the use of monolingual data hery et al., 2022).
and the feasibility of cross-lingual, cross-domain, In this paper, we focus on prompting LLMs for machine
and sentence-to-document transfer learning in translation (MT). MT represents a complex task requiring
prompting. Extensive experiments with GLM- transforming a source input into its semantically equivalent
130B (Zeng et al., 2022) as the testbed show that target output in a different language, which combines se-
1) the number and the quality of prompt examples quence understanding and generation. It offers a unique
matter, where using suboptimal examples degen- platform to assess the cross-lingual generation capabil-
erates translation; 2) several features of prompt ity of LLMs, and the assessment may shed light on pre-
examples, such as semantic similarity, show sig- training/finetuning algorithm design for achieving universal
nificant Spearman correlation with their prompt- LLMs (Chowdhery et al., 2022). While a few studies have
ing performance; yet, none of the correlations are reported translation results (Brown et al., 2020; Reynolds
strong enough; 3) using pseudo parallel prompt & McDonell, 2021; Chowdhery et al., 2022), a systematic
examples constructed from monolingual data via study on how prompting works for MT is still missing in
zero-shot prompting could improve translation; the literature.
and 4) improved performance is achievable by
transferring knowledge from prompt examples se- We aim at filling this gap by thoroughly examining dif-
lected in other settings. We finally provide an ferent prompting setups using the recently released GLM-
analysis on the model outputs and discuss several 130B (Zeng et al., 2022), particularly concerning three as-
problems that prompting still suffers from. pects: the prompting strategy, the use of unlabeled/monolin-
gual data, and the feasibility of transfer learning. Prompting
has shown varying sensitivity to the choice of prompt tem-
1. Introduction plates and examples (Zhao et al., 2021). For MT, prior
studies adopted different templates (Brown et al., 2020;
Large language models (LLMs) pretrained on massive un- Wei et al., 2022a; Chowdhery et al., 2022), and we reeval-
labeled corpora have shown impressive emergent abilities uate them to figure out the optimal one. We further design
under model scaling which enable prompting for down- a set of features for prompt examples and explore which
stream applications (Brown et al., 2020; Kaplan et al., 2020; one(s) could explain the prompting performance, according
Wei et al., 2022b; Zhang et al., 2022a; Chowdhery et al., to which we develop the example selection strategy.
2022). Different from task-specific finetuning, prompting
Since leveraging monolingual data to improve MT has long
*
Now at Google DeepMind; work done prior to joining Google. been of interest, we would like to determine whether and
1
School of Informatics, University of Edinburgh. Correspondence how such data can be used in prompt example construc-
to: Biao Zhang <biaojiaxing@google.com>, Barry Haddow <bhad- tion. We make a step in this direction by studying the effect
dow@inf.ed.ac.uk>, Alexandra Birch <a.birch@ed.ac.uk>.
of data augmentation using back-/forward-translation (Sen-
Proceedings of the 40 th International Conference on Machine nrich et al., 2016b; Zhang & Zong, 2016) via zero-shot
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright prompting. In addition, neural MT and pretrained LLMs
2023 by the author(s).

1
Prompting Large Language Model for Machine Translation: A Case Study

have shown encouraging transfer abilities (Devlin et al., {Xi′ , Yi′ }K


i=1 are available and can be used as a demonstra-
2019; Arivazhagan et al., 2019; Zhang et al., 2020; Xue et al., tion. Particularly, we adopt the following template for zero-
2021) but transfer learning for prompting has received little shot prompting based on the results in Section 3:
attention. Whether prompt examples are transferable across
different settings, such as from one domain/language pair [src]: X [tgt]: (1)
to another and from sentence-level examples to document- where [src] and [tgt] denote test language(s), i.e., the
level translation, is yet to be addressed. source and target language name of the test language pair,
We address the above concerns with GLM-130B as the respectively. For few-show prompting, we concatenate the
testbed and conduct extensive experiments on FLORES and given prompt examples:
WMT evaluation sets. We mainly study translation for three
[psrc]: X1′ [ptgt]: Y1′ . . . [psrc]: ′
XK
languages: English, German and Chinese. We also provide
a quantitative and qualitative analysis to disclose problems [ptgt]: YK′ [src]: X [tgt]: (2)
when prompting for MT, which might offer insights for where [psrc] and [ptgt] denote prompt language(s),
future study. Our main findings are listed as below: i.e., the source and target language name of the prompt
example, respectively. By default, prompt examples and
• Prompting performance varies greatly across templates, test data are in the same language pair. However, when
and language-specific templates mainly work when considering cross-lingual transfer for prompting, prompt
translating into languages LLMs are pretrained on. An examples might be in a different language pair.
English template in a simple form works best for MT.
We also explore template language, which denotes the lan-
• Several features of prompt examples, such as sequence guage in which the template is expressed. For example,
length, language model score, and semantic similarity, the Chinese template “ 中文:X 英文:” represents the
correlate significantly with its prompting performance Chinese counterpart of the following English template “Chi-
while the correlation strength is weak in general. Se- nese: X English: ”.
lecting examples based on these features can outper-
form the random strategy, but not consistently. Setting We experiment with GLM-130B, a LLM with
130B parameters pretrained on Chinese and English “mono-
• Using monolingual examples for prompting hurts trans- lingual” corpora, which was reported to outperform GPT-3
lation. By contrast, constructing pseudo parallel ex- and OPT-175B on several NLP tasks (Zeng et al., 2022).
amples via back-/forward-translation is a good option. Note GLM-130B is a raw LLM without any further fine-
Back-translation performs better and is more robust. tuning. We use its INT4-quantized version, which is more
affordable and suffers little performance degradation. We
• Prompting shows some degree of transferability. Us- adopt beam search for MT with a beam size of 2, and per-
ing demonstrations from other settings can improve form experiments with 4 RTX 3090 or A100-40G GPUs.
translation over the zero-shot counterpart, while the
superiority of a demonstration in one setting can barely We work on three languages: English (En), German (De),
generalize to another. and Chinese (Zh). We perform major analysis on FLO-
RES (Wiki domain, En-De-Zh, NLLB Team et al., 2022)
• Prompting for MT still suffers from copying, mistrans- and WMT21 (News domain, En-De, En-Zh, Akhbardeh
lation of entities, hallucination, inferior direct non- et al., 2021), and also report results on Multi-Domain (IT,
English translation, and prompt trap where translating Law and Medical domain, De-En, Aharoni & Goldberg,
the prompt itself via prompting becomes non-trivial. 2020) to examine domain robustness and transfer ability,
and PDC (News domain, Zh→En, Sun et al., 2022) for
document-level translation. To understand the relation be-
2. Setup tween prompt examples and their prompting performance,
we construct an Ablation set for Wiki, WMT and Multi-
Prompting for MT Given a pretrained and fixed LLM L, Domain (IT and Medical) based on the dev set of FLORES,
MT prompting first converts each test input X to a prompt WMT21 and Multi-Domain, separately, where we randomly
according to a template T and then generates the translation sample 100 instances as the ablation test set and use the rest
Y by feeding the prompt to L. In this study, we consider as the default example selection pool. To distinguish, we
zero-shot and few-shot prompting for translation. will refer to the official dev and test set as Full set. Detailed
statistics are listed in Table 1.
Zero-shot prompting only has access to the test input X,
while few-shot prompting assumes that a few extra la- We evaluate translation performance using both a surface-
beled examples (or prompt/demonstration examples) DP = based metric, detokenized case-sensitive BLEU↑ from

2
Prompting Large Language Model for Machine Translation: A Case Study

Dataset Language(s) Test Set Selection Pool (Default) Source (#sample)


English 100 897 FLORES eng_Latn.dev (997)
Wiki German 100 897 FLORES deu_Latn.dev (997)
Chinese 100 897 FLORES zho_Hans.dev (997)
WMT English-German 100 2900 newstest2013 (3000)
IT German-English 100 1900 Multi-Domain Dev Set (2000)
Medical German-English 100 1900 Multi-Domain Dev Set (2000)

(a) Ablation Sets

Dataset Languages Source Test Set High-quality Pool (Default) Low-quality Pool
English FLORES eng_Latn.devtest (1012) eng_Latn.dev (997) En-Zh⋆ (0.79M)
Wiki German FLORES deu_Latn.devtest (1012) deu_Latn.dev (997) De-En⋆ (1.57M)
Chinese FLORES zho_Hans.devtest (1012) zho_Hans.dev (997) De-Zh⋆ (0.13M)
English-German WMT newstest2021 (1002/1000) newstest2020 (1418)
WMT
English-Chinese WMT newstest2021 (1002/1948) newstest2020 (1418)
IT German-English Multi-Domain Test Set (2000) - Train Set (0.22M)
Law German-English Multi-Domain Test Set (2000) - Train Set (0.47M)
Medical German-English Multi-Domain Test Set (2000) - Train Set (0.25M)
PDC Chinese-English News Test Set (4858/148 Docs) Dev Set (2881) -

(b) Full Sets



Table 1: Statistics of Ablation sets and Full sets. Numbers in brackets denote the number of instances. : data from WikiMa-
trix.v1 (Schwenk et al., 2021).

SacreBLEU (Post, 2018) (with the option -tok zh for we compare 6 templates and evaluate them on the Wiki
Chinese), and a model-based metric, COMET↑ from Ablation sets covering 6 language pairs (En↔De, En↔Zh,
unbabel-comet with the model wmt20-comet-da (Rei De↔Zh). Table 2 shows the results (we list detailed results
et al., 2020). in Table 10, Appendix). The template affects zero-shot
quality substantially, and the simple template ⃝A in English
3. Prompting Strategy for MT specifying just the source and target language name achieves
the best overall results. In follow-up experiments, we thus
To perform MT, prompting needs to cast the translation focus on template ⃝. A
problem into a language modeling problem via the prompt.
Thus, the format of the prompt, including its wording, di-
rectly affects how the LLM understands the task and its Language-specific template delivers mixed results. Ta-
behavior. For MT, we are interested in the following re- ble 2 also shows the prompting results of German and Chi-
search questions: nese templates, which often largely underperform their En-
glish counterparts. Since German is not a major pretraining
• Which template should we use for MT prompting? language in GLM-130B, a German template degenerates the
And what language for the template? translation substantially. By contrast, a Chinese template
yields improved quality when translating into Chinese (see
• Does demonstration matter for MT prompting? How Table 10). Still, an English template works best on average.
to select optimal prompt examples? The preference of GLM-130B to English template also
shows that the level of language understanding and cross-
We address them through extensive experiments on the Wiki lingual ability in GLM-130B varies across languages, even
Ablation sets. though it’s pretrained on the same amount of monolingual
Chinese and English tokens. This might be caused by the
Zero-shot prompting performance varies greatly across fact that more cross-lingual code-switched data is mixed into
templates. We start with zero-shot prompting and explore the English pretraining data (note English is used more glob-
the effect of different templates. Depending on how to ally than Chinese), but might also suggest that improving
describe MT and partially inspired by prior studies (Brown the language understanding of LLM requires more advanced
et al., 2020; Chowdhery et al., 2022; Wei et al., 2022a), training algorithms beyond scaling training data.

3
Prompting Large Language Model for Machine Translation: A Case Study

English German Chinese


ID Template (in English)
w/o w/ w/o w/ w/o w/
A [src]: [input] ⋄ [tgt]: 38.78 31.17 -26.15 -16.48 14.82 -1.08
B [input] ⋄ [tgt]: -88.62 -85.35 -135.97 -99.65 -66.55 -85.84
C [input] ⋄ Translate to [tgt]: -87.63 -68.75 -106.30 -73.23 -63.38 -70.91
D [input] ⋄ Translate from [src] to [tgt]: -113.80 -89.16 -153.80 -130.65 -76.79 -67.71
E [src]: [input] ⋄ Translate to [tgt]: 20.81 16.69 -24.33 -5.68 -8.61 -30.38
F [src]: [input] ⋄ Translate from [src] to [tgt]: -27.14 -6.88 -34.36 -9.22 -32.22 -44.95

Table 2: COMET scores averaged over 6 language pairs for zero-shot prompting with different templates and different template languages
on Wiki Ablation sets. w/ and w/o denote whether adding line breaks into the template or not; ⋄ indicates the position of the line break.
[src] and [tgt] denote source and target test language name, respectively, and [input] denotes the test input; all of them are
placeholders. English, German and Chinese indicate template languages. Best results are shown in bold.

Wiki De→En Wiki En→De Wiki En→Zh Wiki Zh→En


74 50
68
73
40 60 66
72
COMET(↑)

71 64
30 50
70
62
69 20
40 60
68 Zero-shot Baseline

1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
#examples #examples #examples #examples

Figure 1: COMET scores for few-shot prompting as a function of the number of prompt examples (K = 1, 5, 10, 20) on Wiki Ablation
sets. For each setup, we randomly sample 100 times from the example pool and show the performance distribution via box plots. Dashed
red line denotes the zero-shot baseline; blue curve and shadow area denote the mean and standard deviation.

Wiki En↔De The performance of demonstration is not stable. How-


0.30 De→En ever, we also see high performance variance under the same
0.28 En→De K. It’s possible that a demonstration with 5 examples out-
Second Per Token

performs its 10 or 20 counterpart. Figure 1 also shows


0.26
that 1-shot prompting underperforms zero-shot prompting
0.24
in many cases, even on average. This echoes with previous
0.22 findings on other NLP tasks (Zhao et al., 2021; Liu et al.,
0.20 2022) and also highlights the significance of developing
effective example selection strategies.
0.18
0 5 10 15 20
#examples
Note that few-shot prompting greatly improves translation
into Chinese. The reason based on our manual analysis is
Figure 2: Inference time per token in seconds for zero-/few-shot that the zero-shot baseline tends to translate into traditional
prompting on Wiki En-De Ablation sets. Numbers are averaged Chinese with messy codes, whereas prompt examples help
over 3 runs with 3 distinct demonstrations on 4 A100-40G GPUs. (the reference text is always simplified Chinese).

Using more prompt examples for demonstration im- Several features correlate with prompting performance
proves translation significantly on average. We next significantly yet weakly. We thus turn to explore example
study few-shot prompting following the template ⃝ A but selection for prompting. Our idea is to extract a couple of
in format (2) with K varying from 1 to 20. We evaluate diverse features from demonstrations and examine whether
multiple demonstrations for each K via random sampling any of them are informative enough to be used as an indica-
to reduce data biases. Figure 1 shows that the more exam- tor for the selection. In this study, we simplify our analysis
ples used, the better average performance (more results are by focusing on 1-shot prompting, which ignores the order-
shown in Figure 5, Appendix), albeit at the cost of using ing of prompt examples (we return to few-shot prompting
more GPU memory and increasing the inference time per later). Particularly, we extract and analyze 7 features of a
token as in Figure 2. demonstration:

4
Prompting Large Language Model for Machine Translation: A Case Study

BLEU COMET Wiki WMT


Method
Feature HQ + LQ HQ + LQ BLEU COMET BLEU COMET
SLength 0.21 0.31 0.14 0.26 Supervised SOTA 32.50 54.14 34.18 53.65
TLength 0.23 0.32 0.17 0.29
LMScore 0.20 0.33 0.14 0.31 Zero-Shot 24.08 33.92 20.38 17.97
MTScore 0.04 0.14 0.11 0.19 1-Shot Translation (high-quality pool)
SemScore 0.19 0.30 0.16 0.30
CaseSemScore-Src 0.14 0.29 0.11 0.28 Random 26.31 48.29 21.27 30.70
CaseSemScore-Tgt 0.14 0.30 0.14 0.31 SemScore 26.73 49.34 21.82 31.28
LMScore 26.48 47.92 21.59 30.81
TLength 26.54 48.73 21.29 30.68
Table 3: Spearman’s ρ between demonstration features and their
prompting performance for 1-shot prompting on Wiki Ablation 5-Shot Translation (high-quality pool)
sets. We randomly sample 600 demonstrations from each pool Random 27.46 51.11 21.82 33.87
to calculate the correlation. HQ: examples are from the default SemScore 27.36 51.66 22.37 34.30
high-quality pool; LQ: examples are from the low-quality pool LMScore 27.17 50.65 22.04 35.19
based on WikiMatrix.v1. TLength 27.08 50.50 21.75 34.29
1-shot Translation (Low-quality Pool)
Wiki De→En Wiki En→De
50 Random 24.75 38.86 22.06 30.70
70 40 Ours 24.94 39.88 22.23 30.87
COMET(↑)

30
60 Table 4: BLEU and COMET scores for zero-shot and few-shot
20
prompting on Wiki and WMT Full sets with different selection
50 10 strategies. Ours: the proposed combined strategy; Random: ran-
Wiki En→Zh Wiki Zh→En dom sampling; SemScore, LMScore and TLength denote selecting
80 70
top-ranked examples based on the corresponding feature values.
65 We select 3 demonstrations for each translation direction and re-
COMET(↑)

60
60 port average performance; the final score is further averaged over
40 Low Quality different language pairs. Underlined results denote the best in each
55
High Quality section, while Bold results are the overall best.
20 50
−6 −4 −2 −6 −4 −2
LMScore LMScore

Figure 3: Visualization between COMET and LMScore for 1- Table 3 summarizes the results and Figure 3 illustrates the
shot prompting on Wiki Ablation sets. While correlations are relation between COMET and LMScore (more results are
significant, data points are scattered like clouds. given in Table 11 and Figures 6, 7, Appendix). With the
high-quality pool, different demonstrations yield similar
S(T)Length the number of source (target) tokens; translation results (see blue points) despite their feature
values varying greatly. Several features show insignificant
LMScore GLM-130B-based, length-normalized log likeli- and inconsistent correlation, particularly for De→En and
hood of the demonstration; Zh→En. This suggests developing selection policy for high-
quality example pool is non-trivial.
MTScore translation quality of the prompt example from
COMET QE wmt20-comet-qe-da (Rei et al., 2020); After mixing with demonstrations from the low-quality
pool, the significance gets strengthened. LMScore and
SemScore semantic score based on the cosine similarity of CaseSemScore-Tgt shows the highest correlation on aver-
the demonstration’s source and target sentence embed- age followed by TLength and SemScore. MTScore behaves
dings from LASER2 (Heffernan et al., 2022); much worse which might be caused by its instability on
CaseSemScore-Src similarity to the input that averages sentence-level evaluation (Moghe et al., 2022). However,
over SemScores between the test input and the demon- we didn’t see significant difference in terms of Spearman’s ρ
stration’s source; between input-relevant and input-agnostic features (Agrawal
et al., 2022), neither among surface-based, LLM-based or
CaseSemScore-Tgt similar to CaseSemScore-Src but com- semantic-based features. Surprisingly, the simple feature,
pares to demonstration’s target; S/TLength, yields reasonably high correlation. We argue
that long examples could offer LLM with more signals about
We sample multiple demonstrations randomly and inspect the task’s input and output space. This finding suggests that
the Spearman’s correlation between feature values and researchers should select long unlabeled sentences for anno-
prompting performance. We consider high-quality and low- tation to improve prompting. Yet, most Spearman’s ρs are
quality pool for sampling. much smaller than 0.5, indicating a weak/fragile relation.

5
Prompting Large Language Model for Machine Translation: A Case Study

Wiki De→En Wiki En→De Wiki En→Zh Wiki Zh→En


50
50
50 50
0
0
0
COMET(↑)

0 Zero-shot Baseline
−50 Random Example
−50 −50 Source Example Only
−50 Target Example Only
−100 −100 −100
−100
−150 −150 −150

Wiki De→En Wiki En→De Wiki En→Zh Wiki Zh→En

72.5 44
60 66
42
72.0
COMET(↑)

65
40 50
71.5
64 Zero-shot Baseline
38
71.0 40 Parallel Example
63 Source Example Aug
36
Target Example Aug
70.5
1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
#examples #examples #examples #examples

Figure 4: COMET scores for few-shot prompting with monolingual data on Wiki Ablation sets. Random Example: random sentence
pairs; Source/Target Example Only: only use source or target data for prompting; Source/Target Example Aug: use pseudo-parallel data
instead constructed via zero-shot prompting. For each setup, we randomly sample 50 demonstrations and report average performance.

In general, selecting prompt examples of high translation roughly follows their relative 1-shot translation performance
quality, high semantic similarity, high LLM likelihood, long (SemScore > LMScore > TLength on average). In Table
sequence length and high similarity to test inputs are all 4, this combined strategy outperforms the random one by
preferable strategies. Unfortunately, none of them can guar- varying degrees.
antee optimal translation performance.
4. Monolingual Data for Prompting
Using prompt examples selected based on the proposed
features yields improved performance. We next verify A longstanding concern in MT is how to utilize unlabeled
the above findings on the Full sets. We explore selection data to improve translation. While prompting enables
strategies based on SemScore, LMScore and TLength (i.e. few-shot learning reducing the data requirement, explor-
use top-ranked examples) as they show high average corre- ing whether demonstration could benefit from monolingual
lation. We didn’t analyze CaseSemScore-Tgt as it’s more examples is still valuable, both for MT study and for under-
complicated and doesn’t make a significant difference. Note standing of the role of demonstration in prompting.
we excluded examples that are too long (more than 100 to- Min et al. (2022) argue that the key role of demonstration
kens; to reduce the inference cost) or too short (less than 10 lies in its support of the input space, the label space and the
tokens; to ensure the informativeness) during the selection. prompt format, rather than the genuineness of the examples.
We also consider 5-shot prompting, where we concatenate They found that randomly replacing labels in demonstra-
the top-ranked 5 examples in an ascending order (Liu et al., tion barely hurts performance on classification tasks. We
2022). reexamine this argument in the context of MT by studying
Table 4 shows that, with a high-quality pool, adopting the the following three prompting settings: 1) random exam-
feature-based strategy is likely to outperform the random ples constructing sentence pairs from monolingual sources
baseline, and the SemScore-based strategy performs well and targets randomly; 2) source/target example only using
across different settings (detailed results are available in monolingual source/target alone for prompting.
Table 13 and 14, Appendix). These strategies also generalize
to 5-shot prompting to some extent. For selection from the Directly using monolingual data for demonstration
low-quality pool, we propose a combined strategy: we first doesn’t work. Figure 4 (top) shows a totally different
choose top-11K examples according to SemScore to filter story (see Figures 8 and 9 in Appendix for more results):
out poor examples, the top-1K of which are also dropped as monolingual example-based demonstration almost always
they tend to be uninformative (see Table 12 in Appendix); hurts translation, and the more examples used, the more
then we re-rank the rest with LMScore and retain top-1K degeneration yielded. Using random examples misleads the
examples, upon which we further apply the TLength-based prompting and performs the worst in general; compared to
strategy. The ordering of SemScore, LMScore, and TLength target-only examples, using source examples yields slightly

6
Prompting Large Language Model for Machine Translation: A Case Study

Correlation ∆ Quality Transfer from Wiki to ⇒ WMT IT Medical


Setting
BLEU COMET BLEU COMET En→De 0.09 0.14 0.27‡
Correlation
Source Shared 0.08 0.10 +0.59 +7.03 De→En 0.23‡ 0.20‡ 0.13
Target Shared 0.20 0.24 +1.32 +9.67 En→De +4.00 +19.52 +7.80
Reversed 0.15 0.06 +1.41 +11.56 ∆ Quality
De→En +0.10 +19.46 +1.24

Table 5: Spearman’s ρ and relative performance for cross-lingual Table 6: Spearman’s ρ and relative performance (in COMET) for
transfer under 1-shot prompting on Wiki Ablation sets (among En, cross-domain transfer under 1-shot prompting. We explore transfer
De and Zh). When studying transfer from language pair S1 to S2 , from Wiki to Multi-Domain using the Ablation sets. Correlation
we randomly sample 300 demonstrations from the default pool of and performance are calculated in the same way as in cross-lingual
S1 , and then evaluate them on the Ablation test sets for S1 and transfer, except that we sample 200 demonstrations. ‡ : statistically
S2 respectively, based on which we compute the correlation. The significant at p < 0.01; Gray cells indicate insignificance.
performance is also averaged. ∆ Quality: relative quality against
the zero-shot baseline for S2 . Blue cells indicate positive gains.
Source/Target Shared: average result for transfer settings where Method d-BLEU TC CP PT TCP
the source/target language is shared; Reversed: average result for Zero-Shot 30.2 47.5 38.7 41.6 42.4
the same language pair but in different directions.
SemScore 30.5 53.0 34.4 43.2 42.9
LMScore 30.5 53.0 36.8 42.9 43.7

better results except translating into Chinese. This indi- Table 7: Results for transfer learning from sentence-level demon-
cates that the genuine source-target mapping should be re- stration to document-level translation under 1-shot prompting on
tained in the demonstration, and also indicates that MT PDC Zh→En Full sets. We split each test document in PDC into
features unique challenges which deserves more attention non-overlapped chunks, each of which contains about 4 sentences.
when studying prompting. SemScore/LMScore: prompt example selection strategy; we apply
them to PDC’s default pool. We select 3 demonstrations and report
average performance. d-BLEU: document-level BLEU; TC/CP/P-
Pseudo parallel examples by forward-/back-translation T/TCP(↑): document-specific metrics (Sun et al., 2022).
benefits prompting. Inspired by data augmentation in
MT (Sennrich et al., 2016b; Zhang & Zong, 2016), we next
• Could we also expect D1 > D2 in setting S2 ?
resort to constructing pseudo parallel data. We first adopt
GLM-130B to translate the source or target examples via • Whether using demonstrations from S1 could outper-
zero-shot prompting, and then use the generated parallel form zero-shot prompting in S2 ?
examples as demonstration. Despite low quality, Figure
4 (bottom) shows that this is an effective way to improve We next study them via experiments with 1-shot prompting.
prompting, and using more examples often produces better
results, partially echoing with the findings on prompting-
The superiority of a demonstration doesn’t generalize
based unsupervised MT (Han et al., 2021; Patel et al., 2022).
across settings. If the ranking D1 > D2 holds across
We also observe that back-translation (i.e. translating target
settings, the results of the same set of demonstrations in
monolingual examples) performs better and behaves more
different settings should show high and significant Spear-
robustly than forward-translation (i.e. translating source
man’s correlation. However, the correlations in Table 5 and
examples instead), which even approaches prompting with
6 are weak and often insignificant (more results are given
genuine parallel examples.
in Table 15, 16, and 17), even for the same language pairs
in different directions (Reversed) and for similar domains
5. Transfer Learning for Prompting (Wiki⇒WMT). This suggests that we will need setting-
specific demonstration to get the optimal translation quality.
After obtaining a performant demonstration, we are in-
terested in to what extent its capability could be trans-
Using out-of-setting demonstrations can benefit transla-
ferred across different settings, especially from one do-
tion. However, we can still gain from using out-of-setting
main/language pair to another and from sentence-level to
demonstrations as shown by the positive gains in Table 5 and
document-level translation. While previous studies demon-
6, where we find that transfer in target-shared and reversed
strate the feasibility with continuous prompts on classifica-
settings is relatively easier, and that transfer across distant
tion tasks (Wang et al., 2021a), transfer for hard prompting
domains can be successful particularly when in-setting ex-
on MT has never been investigated.
ample pool is of low quality. This is also supported by the
Assume that demonstrations D1 and D2 are selected in transfer to document-level translation, where both BLEU
setting S1 and that D1 performs better (i.e. D1 > D2 ), We and document-specific evaluation get improved as shown
have the following research questions: in Table 7. Results in Table 19 show that the transfer is

7
Prompting Large Language Model for Machine Translation: A Case Study

根据三江源国家公园管理局长江源园区可可西里管理处统计,藏羚羊回迁数量总体呈
Source
逐年上升态势,2019年藏羚羊回迁数量为4860只,比2018年增加338只。
Statistics from the Sanjiangyuan National Park Administration Yangtze River Origin Park Hoh
Reference Xil Management Office show that the number of Tibetan antelopes on the return migration route
has been increasing each year, with 4,860 counted in 2019, an increase of 338 over 2018.
According to the三江源国家公园管理局长江源园区可可西里管理处, the total number of
GLM-130B (1-shot) re-migration of the Tibetan antelope :::
has :::
been:::
on ::
the:::
rise:::::
since 2018,
::::
with 4,860 re-migrating in
2109, an increase of 338 compared to 2808.
English: Dominic Raab has defended the Government’s decision to re-introduce quarantine
Prompt in Prompt
measures on Spain at short notice. Translate from English to Chinese: Chinese:
针对政府突然做出重新对西班牙实施隔离措施的决定,Dominic Raab 做出了辩解。从英
Reference
文翻译成中文:
多 米 尼 克·拉 布(Dominic Raab)对 政 府 决 定 重 新 引 入 西 班 牙 的 检 疫 措 施 表 示 支
GLM-130B (zero-shot)
持。Translate from English to Chinese:

Table 8: Case study of translation errors by prompting. Top: copying (in red), mistranslation of date (in blue), misunderstanding of source
(wave lines); Bottom: prompt trap where the model fails to translate the prompt phrase (in bold).
:::::::

0-shot 1-shot trivial, and that users may attack prompting-based transla-
Setting
De→Zh Zh→De De→Zh Zh→De tion systems by manipulating the input format.
Direct 2.80 10.05 47.23 11.75 We find that the translation quality between German and
Pivoting 19.23 19.53 48.25 25.31 Chinese is very poor (see Table 13). We argue that the cross-
lingual ability of GLM-130B mainly centers around English
Table 9: COMET scores for direct vs. pivoting translation for (although GLM-130B was pretrained on Chinese as well),
De↔Zh on Wiki Full sets. In 1-shot prompting, we randomly sam-
ple 3 demonstrations and report average performance. Pivoting:
and thus explore pivoting translation instead. Table 9 shows
source → English → target. that pivoting through English greatly improves non-English
translation. It’s still unclear whether the current LLM pre-
training recipe could achieve promising non-English-centric
cross-lingual ability. We might need to consider adding
unstable and could deliver negative results, i.e. worse than parallel data into the LLM pretraining or finetuning.
zero-shot prompting, partially resonating with previous find-
ings (Lin et al., 2021). We leave the study of how to select
prompt examples in transfer learning setups to future. 7. Related Work
The capability of prompting heavily depends on its surface
6. Discussion representation, where small modifications to the prompt
could cause high variance in its performance. This inspires
Although prompting enables translation with decent perfor- researchers to develop advanced prompting strategies to
mance, it still suffers from many (well-known) problems. get the most from LLMs. Gao et al. (2021) proposed to
Here, we briefly explain the problems we observed. generate prompt templates automatically using T5 (Xue
Prompting sometimes rejects translating the input. Instead, et al., 2021) rather than adopting manual templates. Liu
it emits either empty or off-target outputs, i.e. translating et al. (2022) reported selecting prompt examples close to
in a wrong target language. This occurs frequently when the test input via a kNN-based retriever, Sorensen et al.
translating into Chinese, where the model often translates (2022) resorted to an information-theoretic approach based
into traditional Chinese with messy codes, causing unstable on mutual information, while Zhang et al. (2022b) formu-
performance. Besides overly relying on a language model, lated example selection as a sequential decision problem
prompting tends to under-translate the input, copy source and solved it by reinforcement learning. For reasoning
phrases, produce code-switched output, mistranslate entities tasks, Wei et al. (2022c) developed chain-of-thought (CoT)
(e.g. dates) and generate hallucination, as shown in Table 8. prompting letting the model output the intermediate rea-
soning steps, which inspires researchers to further explore
We also observe a phenomenon specific to prompting: CoT selection (Fu et al., 2022) and decomposition (Zhou
prompt trap where prompting behaves unpredictable when et al., 2022). In contrast to the studies just mentioned, which
its input is mixed with prompt template phrases. In the sec- focus on NLP tasks other than MT, we explore prompting
ond case in Table 8, the model copies the template phrases, strategies exclusively for translation.
rather than translating them into Chinese. This means that
translating prompt itself (not just the input) becomes non- Prompting uses instructions to guide LLMs, which is closely

8
Prompting Large Language Model for Machine Translation: A Case Study

related to neural MT with special prefixes. In multilingual provides a set of unique challenges and call for more efforts
NMT, a target language tag is often appended to the source on evaluating prompting LLMs for MT.
input to indicate the translation direction (Johnson et al.,
Prompting also faces a number of other issues, like off-target
2017; Arivazhagan et al., 2019; Zhang et al., 2020). Special
generation and prompt traps, which we plan to address in
attribute tags can also be used to control properties of the
the future. We acknowledge that our study heavily depends
model output, such as politeness (Sennrich et al., 2016a),
on the INT-4 quantized GLM-130B, which, unlike GPT and
diversity (Shu et al., 2019), and quality (Caswell et al., 2019).
PaLM, was pretrained with both bidirectional and unidirec-
Besides, retrieved phrases and sentences can be augmented
tional training objectives. The quantization might weaken
to the input to improve translation quality (Zhang et al.,
the model’s capability and deteriorate some unknown as-
2018; Gu et al., 2018). With the popularity of prompting
pects. We thus are interested in examining whether our
LLMs, researchers see value in incorporating prompts into
findings can generalize to other LLMs, like GPT-3, OPT
neural MT (Li et al., 2022; Tan et al., 2021; Garcia & Firat,
and PaLM. We would also like to explore further how to im-
2022). Still, these methods rely on pretraining or finetuning
prove the cross-lingual ability in LLMs. Finally, while our
the model rather than prompting frozen LLMs.
study focuses on prompting, how to finetune LLMs for MT
Very recently, concurrent to our work, Vilar et al. (2022) and when/whether finetuning is preferred over prompting
examined the capability of prompting PaLM for translation are yet to be investigated.
and discovered that prompting with high-quality examples
even chosen randomly performs on par with or better than Acknowledgments
the one using input-relevant examples. By contrast, Agrawal
et al. (2022) explored strategies to select input-specific ex- We thank the reviewers for their insightful comments.
amples, and observed that input-relevant examples based This work was funded by UK Research and Innova-
on n-gram overlap significantly improves the capability of tion (UKRI) under the UK government’s Horizon Eu-
prompts. Our study resonates with both their findings and rope funding guarantee [grant number 10039436 – UT-
also explains their conflict: while the quality and input- TER]. The computations described in this research were
based semantic similarity correlate with prompting perfor- performed using the Baskerville Tier 2 HPC service
mance significantly, the correlation strength is unfortunately (https://www.baskerville.ac.uk/). Baskerville was funded
not strong enough so using them as indicators to select ex- by the EPSRC and UKRI through the World Class Labs
amples may produce mixed results. Note that apart from scheme (EP/T022221/1) and the Digital Research Infras-
example selection, we also studied using monolingual data tructure programme (EP/W032244/1) and is operated by
and transfer learning for MT prompting. Advanced Research Computing at the University of Birm-
ingham.
8. Conclusion and Future Work
References
In this paper, we presented a systematic study on prompting
for MT, exploring topics ranging from prompting strategy, Agrawal, S., Zhou, C., Lewis, M., Zettlemoyer, L., and
the use of unlabelled monolingual data, to transfer learning. Ghazvininejad, M. In-context examples selection for
We found that prompt template and demonstration exam- machine translation. arXiv preprint arXiv:2212.02437,
ple selection both have substantial impact on translation. 2022.
Some prompt example features correlate significantly with
prompting performance; treating them as criteria for ex- Aharoni, R. and Goldberg, Y. Unsupervised domain clusters
ample selection benefits translation to some extent but not in pretrained language models. In Proceedings of the 58th
consistently as the correlations are not strong enough. Annual Meeting of the Association for Computational
Linguistics, pp. 7747–7763, Online, July 2020. Associ-
Prompting for MT requires retaining the source-target map- ation for Computational Linguistics. doi: 10.18653/v1/
ping signals in the demonstration. Directly applying mono- 2020.acl-main.692. URL https://aclanthology.
lingual data for prompting sounds interesting but doesn’t org/2020.acl-main.692.
work. Constructing pseudo parallel prompt examples by
back-/forward-translation via zero-shot prompting is a sim- Akhbardeh, F., Arkhangorodsky, A., Biesialska, M., Bojar,
ple yet effective solution. Regarding transfer learning, O., Chatterjee, R., Chaudhary, V., Costa-jussa, M. R.,
we saw positive results when applying a (sentence-level) España-Bonet, C., Fan, A., Federmann, C., Freitag, M.,
demonstration to other domains, other language pairs or Graham, Y., Grundkiewicz, R., Haddow, B., Harter, L.,
document-level translation. Unfortunately, the optimality Heafield, K., Homan, C., Huck, M., Amponsah-Kaakyire,
of the demonstration doesn’t generalize across settings and K., Kasai, J., Khashabi, D., Knight, K., Kocmi, T.,
the transfer performance is also unstable. We argue that MT Koehn, P., Lourie, N., Monz, C., Morishita, M., Na-

9
Prompting Large Language Model for Machine Translation: A Case Study

gata, M., Nagesh, A., Nakazawa, T., Negri, M., Pal, S., Fu, Y., Peng, H., Sabharwal, A., Clark, P., and Khot, T.
Tapo, A. A., Turchi, M., Vydrin, V., and Zampieri, M. Complexity-based prompting for multi-step reasoning.
Findings of the 2021 conference on machine translation arXiv preprint arXiv:2210.00720, 2022.
(wmt21). In Proceedings of the Sixth Conference on
Machine Translation, pp. 1–88, Online, November 2021. Gao, T., Fisch, A., and Chen, D. Making pre-trained lan-
Association for Computational Linguistics. URL https: guage models better few-shot learners. In Proceedings
//aclanthology.org/2021.wmt-1.1. of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Arivazhagan, N., Bapna, A., Firat, O., Lepikhin, D., John- Joint Conference on Natural Language Processing
son, M., Krikun, M., Chen, M. X., Cao, Y., Foster, G., (Volume 1: Long Papers), pp. 3816–3830, Online, Au-
Cherry, C., Macherey, W., Chen, Z., and Wu, Y. Mas- gust 2021. Association for Computational Linguistics.
sively multilingual neural machine translation in the wild: doi: 10.18653/v1/2021.acl-long.295. URL https:
Findings and challenges, 2019. //aclanthology.org/2021.acl-long.295.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, Garcia, X. and Firat, O. Using natural language prompts for
J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, machine translation. arXiv preprint arXiv:2202.11822,
G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, 2022.
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,
Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Goyal, T., Li, J. J., and Durrett, G. News summariza-
Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., tion and evaluation in the era of gpt-3. arXiv preprint
McCandlish, S., Radford, A., Sutskever, I., and Amodei, arXiv:2209.12356, 2022.
D. Language models are few-shot learners. In Larochelle,
H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, Gu, J., Wang, Y., Cho, K., and Li, V. O. Search en-
H. (eds.), Advances in Neural Information Processing gine guided neural machine translation. In Proceedings
Systems, volume 33, pp. 1877–1901. Curran Asso- of the Thirty-Second AAAI Conference on Artificial
ciates, Inc., 2020. URL https://proceedings. Intelligence and Thirtieth Innovative Applications of
neurips.cc/paper/2020/file/ Artificial Intelligence Conference and Eighth AAAI
1457c0d6bfcb4967418bfb8ac142f64a-Paper. Symposium on Educational Advances in Artificial
pdf. Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press,
2018. ISBN 978-1-57735-800-8.
Caswell, I., Chelba, C., and Grangier, D. Tagged back-
translation. In Proceedings of the Fourth Conference Han, J. M., Babuschkin, I., Edwards, H., Neelakantan, A.,
on Machine Translation (Volume 1: Research Papers), Xu, T., Polu, S., Ray, A., Shyam, P., Ramesh, A., Rad-
pp. 53–63, Florence, Italy, August 2019. Association for ford, A., et al. Unsupervised neural machine translation
Computational Linguistics. doi: 10.18653/v1/W19-5206. with generative language models only. arXiv preprint
URL https://aclanthology.org/W19-5206. arXiv:2110.05448, 2021.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, Heffernan, K., Çelebi, O., and Schwenk, H. Bitext mining
G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., using distilled sentence representations for low-resource
Gehrmann, S., et al. Palm: Scaling language modeling languages. arXiv preprint arXiv:2205.12654, 2022.
with pathways. arXiv preprint arXiv:2204.02311, 2022.
Johnson, M., Schuster, M., Le, Q., Krikun, M., Wu, Y.,
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Cor-
Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, rado, G., Hughes, M., and Dean, J. Google’s mul-
S., et al. Scaling instruction-finetuned language models. tilingual neural machine translation system: Enabling
arXiv preprint arXiv:2210.11416, 2022. zero-shot translation. Transactions of the Association
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. for Computational Linguistics, 5(0):339–351, 2017.
BERT: Pre-training of deep bidirectional transformers ISSN 2307-387X. URL https://transacl.org/
for language understanding. In Proceedings of the index.php/tacl/article/view/1081.
2019 Conference of the North American Chapter of Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
the Association for Computational Linguistics: Human Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
Language Technologies, Volume 1 (Long and Short Amodei, D. Scaling laws for neural language models.
Papers), pp. 4171–4186, Minneapolis, Minnesota, arXiv preprint arXiv:2001.08361, 2020.
June 2019. Association for Computational Linguis-
tics. doi: 10.18653/v1/N19-1423. URL https:// Li, Y., Yin, Y., Li, J., and Zhang, Y. Prompt-driven neural
aclanthology.org/N19-1423. machine translation. In Findings of the Association for

10
Prompting Large Language Model for Machine Translation: A Case Study

Computational Linguistics: ACL 2022, pp. 2579–2590, In Proceedings of the Sixth Conference on Machine
Dublin, Ireland, May 2022. Association for Computa- Translation, pp. 187–196, Online, November 2021. As-
tional Linguistics. doi: 10.18653/v1/2022.findings-acl. sociation for Computational Linguistics. URL https:
203. URL https://aclanthology.org/2022. //aclanthology.org/2021.wmt-1.17.
findings-acl.203.
Rei, R., Stewart, C., Farinha, A. C., and Lavie, A.
Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., COMET: A neural framework for MT evaluation.
Simig, D., Ott, M., Goyal, N., Bhosale, S., Du, J., et al. In Proceedings of the 2020 Conference on Empirical
Few-shot learning with multilingual language models. Methods in Natural Language Processing (EMNLP), pp.
arXiv preprint arXiv:2112.10668, 2021. 2685–2702, Online, November 2020. Association for
Computational Linguistics. doi: 10.18653/v1/2020.
Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., and Chen,
emnlp-main.213. URL https://aclanthology.
W. What makes good in-context examples for GPT-3?
org/2020.emnlp-main.213.
In Proceedings of Deep Learning Inside Out (DeeLIO
2022): The 3rd Workshop on Knowledge Extraction and
Reynolds, L. and McDonell, K. Prompt programming for
Integration for Deep Learning Architectures, pp. 100–
large language models: Beyond the few-shot paradigm.
114, Dublin, Ireland and Online, May 2022. Association
In Extended Abstracts of the 2021 CHI Conference on
for Computational Linguistics. doi: 10.18653/v1/2022.
Human Factors in Computing Systems, CHI EA ’21,
deelio-1.10. URL https://aclanthology.org/
New York, NY, USA, 2021. Association for Comput-
2022.deelio-1.10.
ing Machinery. ISBN 9781450380959. doi: 10.1145/
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., 3411763.3451760. URL https://doi.org/10.
Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of 1145/3411763.3451760.
demonstrations: What makes in-context learning work?
arXiv preprint arXiv:2202.12837, 2022. Schwenk, H., Chaudhary, V., Sun, S., Gong, H., and
Guzmán, F. WikiMatrix: Mining 135M parallel sentences
Moghe, N., Sherborne, T., Steedman, M., and Birch, A. in 1620 language pairs from Wikipedia. In Proceedings
Extrinsic evaluation of machine translation metrics. arXiv of the 16th Conference of the European Chapter of
preprint arXiv:2212.10297, 2022. the Association for Computational Linguistics: Main
Volume, pp. 1351–1361, Online, April 2021. Association
NLLB Team, Costa-jussà, M. R., Cross, J., Çelebi, O., El- for Computational Linguistics. doi: 10.18653/v1/2021.
bayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, eacl-main.115. URL https://aclanthology.
J., Licht, D., Maillard, J., Sun, A., Wang, S., Wenzek, G., org/2021.eacl-main.115.
Youngblood, A., Akula, B., Barrault, L., Gonzalez, G. M.,
Hansanti, P., Hoffman, J., Jarrett, S., Sadagopan, K. R., Sennrich, R., Haddow, B., and Birch, A. Controlling po-
Rowe, D., Spruit, S., Tran, C., Andrews, P., Ayan, N. F., liteness in neural machine translation via side constraints.
Bhosale, S., Edunov, S., Fan, A., Gao, C., Goswami, In Proceedings of the 2016 Conference of the North
V., Guzmán, F., Koehn, P., Mourachko, A., Ropers, C., American Chapter of the Association for Computational
Saleem, S., Schwenk, H., and Wang, J. No language Linguistics: Human Language Technologies, pp. 35–40,
left behind: Scaling human-centered machine translation. San Diego, California, June 2016a. Association for Com-
2022. putational Linguistics. doi: 10.18653/v1/N16-1005. URL
https://aclanthology.org/N16-1005.
Patel, A., Li, B., Rasooli, M. S., Constant, N., Raffel, C.,
and Callison-Burch, C. Bidirectional language models are
Sennrich, R., Haddow, B., and Birch, A. Improving
also few-shot learners. arXiv preprint arXiv:2209.14500,
neural machine translation models with monolingual
2022.
data. In Proceedings of the 54th Annual Meeting of
Post, M. A call for clarity in reporting BLEU scores. the Association for Computational Linguistics (Volume
In Proceedings of the Third Conference on Machine 1: Long Papers), pp. 86–96, Berlin, Germany, Au-
Translation: Research Papers, pp. 186–191, Belgium, gust 2016b. Association for Computational Linguis-
Brussels, October 2018. Association for Computational tics. doi: 10.18653/v1/P16-1009. URL https://
Linguistics. URL https://www.aclweb.org/ aclanthology.org/P16-1009.
anthology/W18-6319.
Shu, R., Nakayama, H., and Cho, K. Generating di-
Qian, L., Zhou, Y., Zheng, Z., ZHU, Y., Lin, Z., Feng, J., verse translations with sentence codes. In Proceedings
Cheng, S., Li, L., Wang, M., and Zhou, H. The volctrans of the 57th Annual Meeting of the Association for
glat system: Non-autoregressive translation meets wmt21. Computational Linguistics, pp. 1823–1827, Florence,

11
Prompting Large Language Model for Machine Translation: A Case Study

Italy, July 2019. Association for Computational Lin- 2022a. URL https://openreview.net/forum?
guistics. doi: 10.18653/v1/P19-1177. URL https: id=gEZrGCozdqR.
//aclanthology.org/P19-1177.
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B.,
Sorensen, T., Robinson, J., Rytting, C., Shaw, A., Rogers, Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Met-
K., Delorey, A., Khalil, M., Fulda, N., and Wingate, D. zler, D., et al. Emergent abilities of large language models.
An information-theoretic approach to prompt engineering arXiv preprint arXiv:2206.07682, 2022b.
without ground truth labels. In Proceedings of the 60th
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E.,
Annual Meeting of the Association for Computational
Le, Q., and Zhou, D. Chain of thought prompting elic-
Linguistics (Volume 1: Long Papers), pp. 819–862,
its reasoning in large language models. arXiv preprint
Dublin, Ireland, May 2022. Association for Compu-
arXiv:2201.11903, 2022c.
tational Linguistics. doi: 10.18653/v1/2022.acl-long.
60. URL https://aclanthology.org/2022. Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R.,
acl-long.60. Siddhant, A., Barua, A., and Raffel, C. mT5: A mas-
sively multilingual pre-trained text-to-text transformer.
Sun, Z., Wang, M., Zhou, H., Zhao, C., Huang, S., Chen,
In Proceedings of the 2021 Conference of the North
J., and Li, L. Rethinking document-level neural ma-
American Chapter of the Association for Computational
chine translation. In Findings of the Association for
Linguistics: Human Language Technologies, pp. 483–
Computational Linguistics: ACL 2022, pp. 3537–3548,
498, Online, June 2021. Association for Computa-
Dublin, Ireland, May 2022. Association for Computa-
tional Linguistics. doi: 10.18653/v1/2021.naacl-main.
tional Linguistics. doi: 10.18653/v1/2022.findings-acl.
41. URL https://aclanthology.org/2021.
279. URL https://aclanthology.org/2022.
naacl-main.41.
findings-acl.279.
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang,
Tan, Z., Zhang, X., Wang, S., and Liu, Y. Msp: Multi-stage Z., Xu, Y., Zheng, W., Xia, X., Tam, W. L., Ma, Z., Xue,
prompting for making pre-trained language models better Y., Zhai, J., Chen, W., Zhang, P., Dong, Y., and Tang, J.
translators. arXiv preprint arXiv:2110.06609, 2021. Glm-130b: An open bilingual pre-trained model. arXiv
Vilar, D., Freitag, M., Cherry, C., Luo, J., Ratnakar, preprint arXiv:2210.02414, 2022.
V., and Foster, G. Prompting palm for translation: Zeng, X., Liu, Y., Li, E., Ran, Q., Meng, F., Li, P.,
Assessing strategies and performance. arXiv preprint Xu, J., and Zhou, J. Wechat neural machine transla-
arXiv:2211.09102, 2022. tion systems for wmt21. In Proceedings of the Sixth
Conference on Machine Translation, pp. 243–254, On-
Wang, C., Wang, J., Qiu, M., Huang, J., and Gao,
line, November 2021. Association for Computational
M. TransPrompt: Towards an automatic transfer-
Linguistics. URL https://aclanthology.org/
able prompting framework for few-shot text classifi-
2021.wmt-1.23.
cation. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing, pp. Zhang, B., Williams, P., Titov, I., and Sennrich, R. Improv-
2792–2802, Online and Punta Cana, Dominican Re- ing massively multilingual neural machine translation
public, November 2021a. Association for Computa- and zero-shot translation. In Proceedings of the 58th
tional Linguistics. doi: 10.18653/v1/2021.emnlp-main. Annual Meeting of the Association for Computational
221. URL https://aclanthology.org/2021. Linguistics, pp. 1628–1639, Online, July 2020. Asso-
emnlp-main.221. ciation for Computational Linguistics. doi: 10.18653/
v1/2020.acl-main.148. URL https://www.aclweb.
Wang, L., Li, M., Liu, F., Shi, S., Tu, Z., Wang, X., Wu, S.,
org/anthology/2020.acl-main.148.
Zeng, J., and Zhang, W. Tencent translation system for
the wmt21 news translation task. In Proceedings of the Zhang, J. and Zong, C. Exploiting source-side monolingual
Sixth Conference on Machine Translation, pp. 216–224, data in neural machine translation. In Proceedings of
Online, November 2021b. Association for Computational the 2016 Conference on Empirical Methods in Natural
Linguistics. URL https://aclanthology.org/ Language Processing, pp. 1535–1545, 2016.
2021.wmt-1.20.
Zhang, J., Utiyama, M., Sumita, E., Neubig, G., and
Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Nakamura, S. Guiding neural machine translation
Lester, B., Du, N., Dai, A. M., and Le, Q. V. Fine- with retrieved translation pieces. In Proceedings of
tuned language models are zero-shot learners. In the 2018 Conference of the North American Chapter of
International Conference on Learning Representations, the Association for Computational Linguistics: Human

12
Prompting Large Language Model for Machine Translation: A Case Study

Language Technologies, Volume 1 (Long Papers), pp.


1325–1335, New Orleans, Louisiana, June 2018. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/
N18-1120. URL https://aclanthology.org/
N18-1120.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V.,
et al. Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068, 2022a.
Zhang, Y., Feng, S., and Tan, C. Active example selection
for in-context learning. arXiv preprint arXiv:2211.04486,
2022b.

Zhao, Z., Wallace, E., Feng, S., Klein, D., and Singh,
S. Calibrate before use: Improving few-shot perfor-
mance of language models. In Meila, M. and Zhang, T.
(eds.), Proceedings of the 38th International Conference
on Machine Learning, volume 139 of Proceedings of
Machine Learning Research, pp. 12697–12706. PMLR,
18–24 Jul 2021. URL https://proceedings.mlr.
press/v139/zhao21c.html.
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang,
X., Schuurmans, D., Bousquet, O., Le, Q., and Chi, E.
Least-to-most prompting enables complex reasoning in
large language models. arXiv preprint arXiv:2205.10625,
2022.

A. Appendix

13
Prompting Large Language Model for Machine Translation: A Case Study

Wiki De→En Wiki En→De Wiki En→Zh Wiki Zh→En


74 50
68
40 60 66
72
COMET(↑)

64
30 50
70
62
20
40 60
68

Wiki De→En Wiki En→De Wiki En→Zh Wiki Zh→En


42 26
40 30
25
40 38 29
BLEU(↑)

24
28
36
23
38 27
34
22
26
32 Zero-shot Baseline
36 21 25
1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
#examples #examples #examples #examples

Wiki De→Zh Wiki Zh→De


60

20
COMET(↑)

40

0
20

−20

Wiki De→Zh Wiki Zh→De


34

32
14
30
BLEU(↑)

28 12
26

24 10 Zero-shot Baseline

1 5 10 20 1 5 10 20
#examples #examples

Figure 5: COMET (top) and BLEU (bottom) scores for few-shot prompting as a function of the number of prompt examples (K =
1, 5, 10, 20) on Wiki Ablation sets. For each setup, we randomly sample 100 times from the example pool and show the performance
distribution via box plots. Dashed red line denotes the zero-shot baseline; blue curve and shadow area denote the mean and standard
deviation.

Wiki De→En Wiki En→De Wiki En→Zh Wiki Zh→En


40 40 30
24
38 28
35
22
BLEU(↑)

36 26
20 30
34 24
18
25 Low Quality
32 22
16 High Quality
30 20 20
−5 −4 −3 −2 −1 −5 −4 −3 −2 −1 −5 −4 −3 −2 −1 −6 −4 −2
LMScore LMScore LMScore LMScore

Figure 6: Scatter plotting between BLEU and LMScore for 1-shot prompting on Wiki De↔En, En↔Zh Ablation sets.

14
Prompting Large Language Model for Machine Translation: A Case Study

BLEU COMET
ID De ↔ En De ↔ Zh En ↔ Zh De ↔ En De ↔ Zh En ↔ Zh
Avg Avg
→ ← → ← → ← → ← → ← → ←
English Template Without Line Break
A 38.00 23.10 23.30 12.10 31.50 27.90 25.98 70.83 41.95 4.34 15.92 35.68 63.98 38.78
B 8.30 9.00 2.80 2.40 6.60 8.20 6.22 -45.75 -70.27 -140.43 -119.82 -112.38 -43.10 -88.62
C 30.60 2.10 5.50 1.10 1.10 8.30 8.12 29.78 -142.36 -117.20 -117.14 -120.57 -58.32 -87.63
D 26.10 0.00 5.10 0.00 0.20 0.60 5.33 -1.20 -160.59 -124.15 -157.62 -130.51 -108.71 -113.80
E 35.90 18.20 26.10 9.60 16.00 22.30 21.35 68.06 5.41 27.53 -6.46 -5.58 35.93 20.81
F 33.50 5.60 25.10 0.80 0.20 9.10 12.38 61.09 -62.31 22.71 -112.79 -50.84 -20.71 -27.14
English Template With Line Break
A 36.60 21.80 25.10 11.40 26.90 26.90 24.78 67.97 37.41 7.24 9.46 4.89 60.08 31.17
B 7.70 7.70 5.00 2.70 13.20 10.00 7.72 -85.97 -81.79 -126.58 -113.27 -55.64 -48.82 -85.35
C 28.00 4.40 7.70 0.70 13.30 14.50 11.43 36.10 -99.01 -118.99 -133.39 -74.19 -23.00 -68.75
D 25.20 1.60 4.20 0.10 4.90 5.40 6.90 13.96 -121.58 -125.36 -148.29 -78.78 -74.91 -89.16
E 35.70 20.00 24.40 3.90 28.30 20.30 22.10 66.08 22.21 15.62 -55.41 13.36 38.30 16.69
F 33.60 9.30 23.60 3.00 6.70 17.90 15.68 57.46 -45.84 14.73 -69.69 -30.63 32.68 -6.88
German Template Without Line Break
A 20.00 15.70 1.60 3.10 0.70 7.10 8.03 23.09 4.61 -70.84 -47.51 -65.61 -0.66 -26.15
B 5.60 2.10 0.10 1.60 0.20 1.10 1.78 -82.99 -152.26 -174.72 -132.06 -162.79 -110.99 -135.97
C 4.60 5.40 0.30 3.70 0.00 4.10 3.02 -57.63 -108.36 -120.99 -125.18 -135.21 -90.42 -106.30
D 3.50 0.10 0.00 0.00 0.00 0.10 0.62 -115.55 -168.13 -166.07 -169.21 -161.27 -142.57 -153.80
E 17.30 19.00 0.20 8.50 2.30 19.60 11.15 14.19 6.47 -100.92 -25.14 -50.42 9.85 -24.33
F 6.30 4.80 0.20 7.30 0.10 11.70 5.07 3.88 -65.86 -44.76 -27.91 -60.31 -11.22 -34.36
German Template With Line Break
A 25.40 20.20 6.40 3.50 8.00 9.20 12.12 38.47 31.45 -80.14 -47.22 -50.26 8.84 -16.48
B 15.60 7.80 2.60 1.00 0.50 0.80 4.72 -20.65 -81.28 -125.21 -137.02 -125.31 -108.45 -99.65
C 15.40 5.70 5.70 3.00 6.00 6.70 7.08 -23.46 -80.15 -86.27 -104.10 -87.18 -58.23 -73.23
D 2.80 0.50 0.00 0.00 0.10 1.10 0.75 -95.30 -154.76 -140.51 -155.91 -137.36 -100.08 -130.65
E 24.70 19.50 10.40 8.50 11.10 17.20 15.23 35.12 3.95 -62.48 -18.32 -27.61 35.26 -5.68
F 7.60 17.20 0.50 8.60 3.90 11.30 8.18 13.01 9.10 -43.63 -10.88 -46.46 23.54 -9.22
Chinese Template Without Line Break
A 37.60 15.50 28.30 2.10 33.40 15.10 22.00 67.41 -5.40 45.24 -74.78 53.71 2.72 14.82
B 23.60 6.30 14.50 0.50 19.30 1.90 11.02 -6.41 -90.63 -12.10 -159.66 -9.24 -121.29 -66.55
C 11.40 3.20 14.30 0.40 20.80 5.00 9.18 -32.55 -114.57 -9.91 -140.54 2.89 -85.58 -63.38
D 17.10 6.40 15.90 0.20 19.60 1.90 10.18 -34.15 -101.69 -24.36 -166.15 -9.20 -125.20 -76.79
E 29.00 8.00 27.00 0.40 34.90 16.10 19.23 35.55 -63.09 37.06 -119.13 54.14 3.80 -8.61
F 31.70 3.70 24.80 0.10 27.20 11.80 16.55 35.65 -105.74 22.97 -129.71 5.61 -34.09 -34.22
Chinese Template With Line Break
A 26.80 14.70 24.70 3.30 33.80 22.90 21.03 24.46 -84.74 24.76 -64.07 52.65 40.45 -1.08
B 23.70 6.30 11.90 0.10 14.40 0.60 9.50 -11.65 -102.50 -63.95 -161.96 -46.84 -128.12 -85.84
C 12.10 3.00 13.80 0.80 21.20 9.90 10.13 -36.39 -105.55 -42.16 -151.06 -15.41 -74.90 -70.91
D 14.10 3.20 15.10 0.20 20.00 2.50 9.18 -19.15 -106.69 -19.34 -154.73 -11.51 -94.82 -67.71
E 28.60 8.00 26.50 0.90 32.30 21.40 19.62 8.71 -118.14 15.34 -124.30 21.18 14.91 -30.38
F 26.90 3.40 26.10 0.20 25.80 16.00 16.40 11.58 -120.31 10.33 -129.61 -21.19 -20.52 -44.95

Table 10: Detailed zero-shot results for prompting with different templates and different template languages on Wiki Ablation sets.
Template ⃝A in English achieves the overall best performance measured by BLEU and COMET. Avg: average result over different
language pairs. Best results in each section are underlined; best results in each column are in bold.

15
Prompting Large Language Model for Machine Translation: A Case Study

High-quality Examples Plusll Low-quality Examples


Method
De ↔ En De ↔ Zh En ↔ Zh De ↔ En De ↔ Zh En ↔ Zh
Avg Avg
→ ← → ← → ← → ← → ← → ←
Correlation with COMET
SLength 0.02 0.18‡ 0.24‡ 0.12‡ 0.26‡ 0.01 0.14 0.09‡ 0.20‡ 0.52 ‡ 0.44‡ 0.24‡ 0.10‡ 0.26
TLength -0.01 0.23‡ 0.19‡ 0.27‡ 0.29‡ 0.06 0.17 0.06† 0.35‡ 0.41‡ 0.57 ‡ 0.25‡ 0.13‡ 0.29
LMScore 0.06 0.23‡ 0.01 0.20‡ 0.12‡ 0.21‡ 0.14 0.19‡ 0.38‡ 0.35‡ 0.51 ‡ 0.16‡ 0.27‡ 0.31
MTScore 0.01 0.05 0.11‡ 0.12‡ 0.06 0.28‡ 0.11 0.13‡ 0.04 0.30‡ 0.23‡ 0.18‡ 0.28‡ 0.19
SemScore 0.11‡ 0.17‡ 0.11‡ 0.15‡ 0.10‡ 0.31‡ 0.16 0.12‡ 0.24‡ 0.42‡ 0.50‡ 0.17‡ 0.33‡ 0.30
CaseSemScore-Src -0.01 0.20‡ 0.22‡ 0.08† 0.18‡ -0.03 0.11 0.08‡ 0.29‡ 0.53 ‡ 0.49‡ 0.26‡ 0.05 0.28
CaseSemScore-Tgt -0.01 0.22‡ 0.25‡ 0.14‡ 0.21‡ 0.05 0.14 0.09‡ 0.32‡ 0.53 ‡ 0.53 ‡ 0.27‡ 0.11‡ 0.31
Correlation with BLEU
SLength 0.20‡ 0.27‡ 0.21‡ 0.11‡ 0.33‡ 0.12‡ 0.21 0.23‡ 0.30‡ 0.51 ‡ 0.35‡ 0.29‡ 0.18‡ 0.31
TLength 0.15‡ 0.32‡ 0.16‡ 0.22‡ 0.40‡ 0.12‡ 0.23 0.15‡ 0.38‡ 0.41‡ 0.47‡ 0.33‡ 0.19‡ 0.32
LMScore 0.14‡ 0.17‡ 0.10‡ 0.24‡ 0.27‡ 0.26‡ 0.20 0.23‡ 0.30‡ 0.39‡ 0.46‡ 0.27‡ 0.32‡ 0.33
MTScore 0.03 -0.05 0.04 0.09† 0.03 0.12‡ 0.04 0.11‡ -0.04 0.26‡ 0.19‡ 0.17‡ 0.14‡ 0.14
SemScore 0.13‡ 0.11‡ 0.15‡ 0.20‡ 0.25‡ 0.29‡ 0.19 0.13‡ 0.20‡ 0.45‡ 0.45‡ 0.28‡ 0.31‡ 0.30
CaseSemScore-Src 0.16‡ 0.15‡ 0.18‡ 0.03 0.28‡ 0.03 0.14 0.20‡ 0.29‡ 0.51 ‡ 0.36‡ 0.31‡ 0.07‡ 0.29
CaseSemScore-Tgt 0.14‡ 0.17‡ 0.16‡ 0.05 0.24‡ 0.09† 0.14 0.18‡ 0.30‡ 0.49‡ 0.39‡ 0.29‡ 0.13‡ 0.30

Table 11: Detailed Spearman’s ρ between demonstration features and their prompting performance (COMET and BLEU) for 1-shot
prompting on Wiki Ablation sets. We randomly sample 600 demonstrations from each pool to calculate the correlation. High-quality
examples are from the default selection pool while Low-quality examples are from WikiMatrix.v1. † /‡ : statistically significant at
p < 0.05/0.01. Gray cells indicate insignificance; Red cells indicate ρ > 0.5.

Wiki De→Zh Wiki Zh→De Wiki De→Zh Wiki Zh→De


60 30 35
14
30
12
COMET(↑)

40 20
BLEU(↑)

25
10
20
20 10 8
Low Quality Low Quality
15
High Quality High Quality 6
0 0 10
−6 −4 −2 −6 −4 −2 −6 −4 −2 −6 −4 −2
LMScore LMScore LMScore LMScore

(a) COMET vs. LMScore (b) BLEU vs. LMScore

Figure 7: Scatter plotting between COMET/BLEU and LMScore for 1-shot prompting on Wiki De↔Zh Ablation sets.

Source Coordinates: 19°43′10″S 63°18′00″E / 19.71944°S 63.30000°E / -19.71944;


63.30000
En→Zh Target 坐标:19°43′10″S 63°18′00″E / 19.71944°S 63.30000°E / -19.71944; 63.30000
Source SAO 40012 is HD 277559.
Target SAO 40012是HD 277559。
Source 2002 and 2004.
Target 2002 und 2004.
En→De
Source Brinton, Lauren and Leslie Arnovick.
Target Brinton, Lauren und Leslie Arnovick.

Table 12: Top-ranked parallel examples according to SemScore on WikiMatrix.v1 En-De and En-Zh. Despite showing high semantic
similarity, these examples are not very informative. We thus dropped them at selection.

16
Prompting Large Language Model for Machine Translation: A Case Study

BLEU COMET
Method
De ↔ En De ↔ Zh En ↔ Zh De ↔ En De ↔ Zh En ↔ Zh
Avg Avg
→ ← → ← → ← → ← → ← → ←
NLLB-200 (54.5B)⋆ 45.80 39.60 25.90 20.60 31.20 31.90 32.50 75.43 66.37 26.22 54.01 32.97 69.82 54.14
Zero-Shot 37.80 20.50 21.70 9.60 28.60 26.30 24.08 68.30 29.96 2.80 10.05 29.17 63.25 33.92
1-Shot Translation (high-quality pool)
Random 37.67 21.23 28.70 9.07 34.87 26.30 26.31 68.77 35.56 47.23 11.75 60.69 65.75 48.29
SemScore 38.40 21.37 29.17 9.47 35.50 26.50 26.73 69.04 36.06 48.79 14.63 60.54 66.98 49.34
LMScore 37.80 21.43 28.13 9.40 35.40 26.73 26.48 68.55 35.49 43.54 13.14 59.84 66.98 47.92
TLength 37.00 21.80 28.57 9.47 35.90 26.53 26.54 67.79 37.00 45.66 13.63 61.87 66.45 48.73
5-Shot Translation (high-quality pool)
Random 39.03 22.00 29.37 10.07 37.07 27.20 27.46 70.30 36.46 51.77 16.74 63.77 67.62 51.11
SemScore 38.13 21.93 30.50 10.20 36.87 26.50 27.36 70.12 38.40 52.29 16.88 64.40 67.85 51.66
LMScore 38.87 22.03 30.20 9.97 35.83 26.13 27.17 69.74 37.01 51.01 16.63 61.74 67.74 50.65
TLength 38.57 22.00 29.50 10.00 35.90 26.53 27.08 68.94 37.16 50.80 15.80 63.01 67.29 50.50
1-shot Translation (Low-quality Pool)
Random 36.73 20.53 22.23 8.23 34.63 26.13 24.75 66.82 34.15 10.11 -1.94 57.97 66.08 38.86
Ours 37.90 21.27 20.50 9.37 34.47 26.17 24.94 68.46 33.78 0.19 12.07 58.05 66.75 39.88

Table 13: Detailed test results for zero-shot and few-shot prompting on Wiki Full sets with different selection strategies. ⋆ : results from
NLLB Team et al. (2022); Ours: the proposed combined strategy; Random: random sampling; SemScore, LMScore and TLength denote
selecting top-ranked examples based on the corresponding feature values. We select 3 demonstrations for each setup and report the
average. Avg: average result over language pairs. Underlined results denote the best in each section, while Bold results are the overall
best.

BLEU COMET
Method
De ↔ En En ↔ Zh De ↔ En En ↔ Zh
Avg Avg
→ ← → ← → ← → ←
WMT SOTA System 35.05b 31.32a 36.92a 33.41c 34.18 61.28 54.87 50.11 48.35 53.65
Zero-Shot 28.30 15.70 20.70 16.80 20.38 46.01 13.32 4.63 7.92 17.97
1-Shot Translation (high-quality pool)
Random 25.63 16.37 26.03 17.03 21.27 45.90 16.89 40.88 19.14 30.70
SemScore 26.90 16.03 26.30 18.07 21.82 46.39 15.13 41.13 22.49 31.28
LMScore 27.53 15.70 25.43 17.70 21.59 47.47 17.53 38.95 19.29 30.81
TLength 25.60 16.33 25.80 17.43 21.29 43.47 18.24 42.17 18.82 30.68
5-Shot Translation (high-quality pool)
Random 26.40 17.10 26.23 17.53 21.82 48.36 20.19 43.97 22.95 33.87
SemScore 27.30 16.57 26.93 18.67 22.37 49.33 18.83 43.49 25.54 34.30
LMScore 25.90 16.87 26.47 18.93 22.04 47.77 20.83 44.76 27.41 35.19
TLength 25.80 17.03 26.55 17.63 21.75 47.34 20.78 45.17 23.85 34.29
1-shot Translation (Low-quality Pool)
Random 27.33 15.53 25.30 20.07 22.06 45.29 14.21 36.83 26.49 30.70
Ours 27.63 15.97 25.23 20.10 22.23 47.16 15.01 34.48 26.82 30.87

Table 14: Detailed test results on WMT Full sets. a ,b ,c : results from Zeng et al. (2021), Qian et al. (2021), and Wang et al. (2021b),
respectively.

17
Prompting Large Language Model for Machine Translation: A Case Study

Wiki De→Zh Wiki Zh→De Wiki De→Zh Wiki Zh→De


12.5
Zero-shot Baseline Zero-shot Baseline
Random Example
0 25 Random Example
10.0
0 Source Example Only Source Example Only
Target Example Only Target Example Only
COMET(↑)

−50 20

BLEU(↑)
7.5
−50 15
5.0
−100
10 2.5
−100
−150
5 0.0

Wiki De→Zh Wiki Zh→De Wiki De→Zh Wiki Zh→De


60
30 32
50 13.5
30
13.0
COMET(↑)

40 25

BLEU(↑)
28
30 12.5
Zero-shot Baseline 20 26 Zero-shot Baseline
20
Parallel Example Parallel Example 12.0
Source Example Aug Source Example Aug
10 24
Target Example Aug Target Example Aug 11.5
15
1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
#examples #examples #examples #examples

(a) COMET (b) BLEU

Figure 8: Results for few-shot prompting with monolingual data on Wiki Ablation sets for De↔Zh.

Wiki De→En Wiki En→De Wiki En→Zh Wiki Zh→En


40

20 30
30
20
15 Zero-shot Baseline
BLEU(↑)

20 Random Example
20
Source Example Only
10
10 Target Example Only
10 10
5

0 0 0
0

Wiki De→En Wiki En→De Wiki En→Zh Wiki Zh→En


40 29.0

24.0 38 28.5
40
BLEU(↑)

28.0
23.5 36
39 27.5
34 Zero-shot Baseline
23.0 Parallel Example
27.0
Source Example Aug
38 32 Target Example Aug
22.5 26.5
1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
#examples #examples #examples #examples

Figure 9: BLEU scores for few-shot prompting with monolingual data on Wiki Ablation sets.

BLEU COMET
Method
De ↔ En De ↔ Zh En ↔ Zh De ↔ En De ↔ Zh En ↔ Zh
→ ← → ← → ← → ← → ← → ←
† † † †
0.21‡
Prompt Language

De→En - 0.06 0.08 0.12 0.13 0.13 - -0.02 0.09 0.12 -0.01
En→De 0.07 - 0.14‡ 0.19‡ 0.17‡ 0.11† 0.01 - 0.07 0.21‡ 0.14‡ 0.17‡
De→Zh -0.08 0.06 - 0.14‡ 0.24‡ -0.05 0.02 0.15‡ - 0.08 0.40‡ 0.02
Zh→De 0.00 0.26‡ 0.26‡ - 0.05 0.01 -0.03 0.21‡ 0.22‡ - 0.13† 0.15‡
En→Zh 0.01 -0.01 0.24‡ 0.25‡ - 0.19‡ 0.04 -0.01 0.22‡ 0.21‡ - 0.03
Zh→En 0.15‡ -0.16‡ 0.14‡ 0.34‡ 0.15‡ - 0.25‡ 0.09 0.14‡ 0.21‡ 0.03 -

Table 15: Detailed Spearman’s ρ for cross-lingual transfer under 1-shot prompting on Wiki Ablation sets. Gray cells indicate
insignificance.

18
Prompting Large Language Model for Machine Translation: A Case Study

BLEU COMET
Method
De ↔ En De ↔ Zh En ↔ Zh De ↔ En De ↔ Zh En ↔ Zh
Prompt Language → ← → ← → ← → ← → ← → ←
De→En - -0.32 5.02 -0.86 1.29 0.00 - -1.08 35.04 2.71 7.00 -0.01
En→De -0.69 - 3.88 -0.69 1.21 -0.41 -0.46 - 26.01 1.56 6.31 -2.40
De→Zh -0.63 -0.48 - -0.65 4.38 0.04 0.92 -3.68 - 4.16 23.51 -0.34
Zh→De -0.66 -0.86 6.84 - 3.23 0.19 0.71 -6.15 43.67 - 17.54 0.51
En→Zh -1.54 -1.17 6.23 -1.44 - -1.50 -6.00 -4.47 41.77 -1.79 - -2.20
Zh→En -1.12 -1.00 1.78 -1.11 4.81 - -2.63 -3.85 15.25 3.90 25.29 -

Table 16: Detailed translation results (relative against the zero-shot baseline) for cross-lingual transfer under 1-shot prompting on Wiki
Ablation sets. Blue cells indicate positive gains.

Transfer from Wiki to ⇒ WMT IT Medical


En→De 0.05 0.11 0.15†
Correlation
De→En -0.25‡ 0.19‡ 0.07
En→De -0.45 +0.88 -0.21
∆ Quality
De→En -0.43 +1.00 +0.77

Table 17: Spearman’s ρ and relative performance (in BLEU) for cross-domain transfer under 1-shot prompting.

0-shot 1-shot
Setting
De→Zh Zh→De De→Zh Zh→De
Direct 21.70 9.60 28.70 9.07
Pivoting 24.4 11.5 29.47 11.47

Table 18: BLEU scores for direct vs. pivoting translation for De↔Zh on Wiki Full sets.

BLEU COMET
Method
IT Law Medical Avg IT Law Medical Avg
Zero-Shot 32.4 28.5 31.3 30.7 12.39 32.85 33.99 26.41
1-shot Translation (Low-quality Pool)
Random 33.70 27.33 30.80 30.61 29.12 30.22 34.08 31.14
Ours 32.93 27.60 33.23 31.26 29.95 29.60 41.37 33.64
Cross-domain Transfer
Wiki⇒Multi-Domain 32.90 26.73 31.87 30.50 25.08 33.27 37.85 32.07
WMT⇒Multi-Domain 30.87 25.37 31.43 29.22 12.98 30.34 34.80 26.04
Cross-lingual Transfer
De→Fr ⇒ De→En 33.45 28.67 32.90 31.67 29.43 34.76 39.31 34.50
Fr→De ⇒ De→En 32.77 28.53 31.73 31.01 27.68 34.90 33.75 32.11
Zh→Fr ⇒ De→En 15.80 25.53 19.70 20.34 -37.03 7.38 -27.38 -19.01
Fr→Zh ⇒ De→En 19.17 26.95 26.35 24.16 -17.62 22.42 4.37 3.06

Table 19: Cross-lingual and cross-domain transfer results on Multi-Domain Full sets under 1-shot prompting. For cross-domain transfer,
we adopt the SemScore-based strategy for example selection using the default Wiki/WMT Full candidate pool; for cross-lingual transfer,
we extend the selected examples in Multi-Domain 1-shot translation (low-quality pool) by translating the English sentences to French and
Chinese using Google Translate. Results are averaged over 3 different demonstrations.

19

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy