823 Prompting Large Language Model
823 Prompting Large Language Model
1
Prompting Large Language Model for Machine Translation: A Case Study
2
Prompting Large Language Model for Machine Translation: A Case Study
Dataset Languages Source Test Set High-quality Pool (Default) Low-quality Pool
English FLORES eng_Latn.devtest (1012) eng_Latn.dev (997) En-Zh⋆ (0.79M)
Wiki German FLORES deu_Latn.devtest (1012) deu_Latn.dev (997) De-En⋆ (1.57M)
Chinese FLORES zho_Hans.devtest (1012) zho_Hans.dev (997) De-Zh⋆ (0.13M)
English-German WMT newstest2021 (1002/1000) newstest2020 (1418)
WMT
English-Chinese WMT newstest2021 (1002/1948) newstest2020 (1418)
IT German-English Multi-Domain Test Set (2000) - Train Set (0.22M)
Law German-English Multi-Domain Test Set (2000) - Train Set (0.47M)
Medical German-English Multi-Domain Test Set (2000) - Train Set (0.25M)
PDC Chinese-English News Test Set (4858/148 Docs) Dev Set (2881) -
SacreBLEU (Post, 2018) (with the option -tok zh for we compare 6 templates and evaluate them on the Wiki
Chinese), and a model-based metric, COMET↑ from Ablation sets covering 6 language pairs (En↔De, En↔Zh,
unbabel-comet with the model wmt20-comet-da (Rei De↔Zh). Table 2 shows the results (we list detailed results
et al., 2020). in Table 10, Appendix). The template affects zero-shot
quality substantially, and the simple template ⃝A in English
3. Prompting Strategy for MT specifying just the source and target language name achieves
the best overall results. In follow-up experiments, we thus
To perform MT, prompting needs to cast the translation focus on template ⃝. A
problem into a language modeling problem via the prompt.
Thus, the format of the prompt, including its wording, di-
rectly affects how the LLM understands the task and its Language-specific template delivers mixed results. Ta-
behavior. For MT, we are interested in the following re- ble 2 also shows the prompting results of German and Chi-
search questions: nese templates, which often largely underperform their En-
glish counterparts. Since German is not a major pretraining
• Which template should we use for MT prompting? language in GLM-130B, a German template degenerates the
And what language for the template? translation substantially. By contrast, a Chinese template
yields improved quality when translating into Chinese (see
• Does demonstration matter for MT prompting? How Table 10). Still, an English template works best on average.
to select optimal prompt examples? The preference of GLM-130B to English template also
shows that the level of language understanding and cross-
We address them through extensive experiments on the Wiki lingual ability in GLM-130B varies across languages, even
Ablation sets. though it’s pretrained on the same amount of monolingual
Chinese and English tokens. This might be caused by the
Zero-shot prompting performance varies greatly across fact that more cross-lingual code-switched data is mixed into
templates. We start with zero-shot prompting and explore the English pretraining data (note English is used more glob-
the effect of different templates. Depending on how to ally than Chinese), but might also suggest that improving
describe MT and partially inspired by prior studies (Brown the language understanding of LLM requires more advanced
et al., 2020; Chowdhery et al., 2022; Wei et al., 2022a), training algorithms beyond scaling training data.
3
Prompting Large Language Model for Machine Translation: A Case Study
Table 2: COMET scores averaged over 6 language pairs for zero-shot prompting with different templates and different template languages
on Wiki Ablation sets. w/ and w/o denote whether adding line breaks into the template or not; ⋄ indicates the position of the line break.
[src] and [tgt] denote source and target test language name, respectively, and [input] denotes the test input; all of them are
placeholders. English, German and Chinese indicate template languages. Best results are shown in bold.
71 64
30 50
70
62
69 20
40 60
68 Zero-shot Baseline
1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
#examples #examples #examples #examples
Figure 1: COMET scores for few-shot prompting as a function of the number of prompt examples (K = 1, 5, 10, 20) on Wiki Ablation
sets. For each setup, we randomly sample 100 times from the example pool and show the performance distribution via box plots. Dashed
red line denotes the zero-shot baseline; blue curve and shadow area denote the mean and standard deviation.
Using more prompt examples for demonstration im- Several features correlate with prompting performance
proves translation significantly on average. We next significantly yet weakly. We thus turn to explore example
study few-shot prompting following the template ⃝ A but selection for prompting. Our idea is to extract a couple of
in format (2) with K varying from 1 to 20. We evaluate diverse features from demonstrations and examine whether
multiple demonstrations for each K via random sampling any of them are informative enough to be used as an indica-
to reduce data biases. Figure 1 shows that the more exam- tor for the selection. In this study, we simplify our analysis
ples used, the better average performance (more results are by focusing on 1-shot prompting, which ignores the order-
shown in Figure 5, Appendix), albeit at the cost of using ing of prompt examples (we return to few-shot prompting
more GPU memory and increasing the inference time per later). Particularly, we extract and analyze 7 features of a
token as in Figure 2. demonstration:
4
Prompting Large Language Model for Machine Translation: A Case Study
30
60 Table 4: BLEU and COMET scores for zero-shot and few-shot
20
prompting on Wiki and WMT Full sets with different selection
50 10 strategies. Ours: the proposed combined strategy; Random: ran-
Wiki En→Zh Wiki Zh→En dom sampling; SemScore, LMScore and TLength denote selecting
80 70
top-ranked examples based on the corresponding feature values.
65 We select 3 demonstrations for each translation direction and re-
COMET(↑)
60
60 port average performance; the final score is further averaged over
40 Low Quality different language pairs. Underlined results denote the best in each
55
High Quality section, while Bold results are the overall best.
20 50
−6 −4 −2 −6 −4 −2
LMScore LMScore
Figure 3: Visualization between COMET and LMScore for 1- Table 3 summarizes the results and Figure 3 illustrates the
shot prompting on Wiki Ablation sets. While correlations are relation between COMET and LMScore (more results are
significant, data points are scattered like clouds. given in Table 11 and Figures 6, 7, Appendix). With the
high-quality pool, different demonstrations yield similar
S(T)Length the number of source (target) tokens; translation results (see blue points) despite their feature
values varying greatly. Several features show insignificant
LMScore GLM-130B-based, length-normalized log likeli- and inconsistent correlation, particularly for De→En and
hood of the demonstration; Zh→En. This suggests developing selection policy for high-
quality example pool is non-trivial.
MTScore translation quality of the prompt example from
COMET QE wmt20-comet-qe-da (Rei et al., 2020); After mixing with demonstrations from the low-quality
pool, the significance gets strengthened. LMScore and
SemScore semantic score based on the cosine similarity of CaseSemScore-Tgt shows the highest correlation on aver-
the demonstration’s source and target sentence embed- age followed by TLength and SemScore. MTScore behaves
dings from LASER2 (Heffernan et al., 2022); much worse which might be caused by its instability on
CaseSemScore-Src similarity to the input that averages sentence-level evaluation (Moghe et al., 2022). However,
over SemScores between the test input and the demon- we didn’t see significant difference in terms of Spearman’s ρ
stration’s source; between input-relevant and input-agnostic features (Agrawal
et al., 2022), neither among surface-based, LLM-based or
CaseSemScore-Tgt similar to CaseSemScore-Src but com- semantic-based features. Surprisingly, the simple feature,
pares to demonstration’s target; S/TLength, yields reasonably high correlation. We argue
that long examples could offer LLM with more signals about
We sample multiple demonstrations randomly and inspect the task’s input and output space. This finding suggests that
the Spearman’s correlation between feature values and researchers should select long unlabeled sentences for anno-
prompting performance. We consider high-quality and low- tation to improve prompting. Yet, most Spearman’s ρs are
quality pool for sampling. much smaller than 0.5, indicating a weak/fragile relation.
5
Prompting Large Language Model for Machine Translation: A Case Study
0 Zero-shot Baseline
−50 Random Example
−50 −50 Source Example Only
−50 Target Example Only
−100 −100 −100
−100
−150 −150 −150
72.5 44
60 66
42
72.0
COMET(↑)
65
40 50
71.5
64 Zero-shot Baseline
38
71.0 40 Parallel Example
63 Source Example Aug
36
Target Example Aug
70.5
1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
#examples #examples #examples #examples
Figure 4: COMET scores for few-shot prompting with monolingual data on Wiki Ablation sets. Random Example: random sentence
pairs; Source/Target Example Only: only use source or target data for prompting; Source/Target Example Aug: use pseudo-parallel data
instead constructed via zero-shot prompting. For each setup, we randomly sample 50 demonstrations and report average performance.
In general, selecting prompt examples of high translation roughly follows their relative 1-shot translation performance
quality, high semantic similarity, high LLM likelihood, long (SemScore > LMScore > TLength on average). In Table
sequence length and high similarity to test inputs are all 4, this combined strategy outperforms the random one by
preferable strategies. Unfortunately, none of them can guar- varying degrees.
antee optimal translation performance.
4. Monolingual Data for Prompting
Using prompt examples selected based on the proposed
features yields improved performance. We next verify A longstanding concern in MT is how to utilize unlabeled
the above findings on the Full sets. We explore selection data to improve translation. While prompting enables
strategies based on SemScore, LMScore and TLength (i.e. few-shot learning reducing the data requirement, explor-
use top-ranked examples) as they show high average corre- ing whether demonstration could benefit from monolingual
lation. We didn’t analyze CaseSemScore-Tgt as it’s more examples is still valuable, both for MT study and for under-
complicated and doesn’t make a significant difference. Note standing of the role of demonstration in prompting.
we excluded examples that are too long (more than 100 to- Min et al. (2022) argue that the key role of demonstration
kens; to reduce the inference cost) or too short (less than 10 lies in its support of the input space, the label space and the
tokens; to ensure the informativeness) during the selection. prompt format, rather than the genuineness of the examples.
We also consider 5-shot prompting, where we concatenate They found that randomly replacing labels in demonstra-
the top-ranked 5 examples in an ascending order (Liu et al., tion barely hurts performance on classification tasks. We
2022). reexamine this argument in the context of MT by studying
Table 4 shows that, with a high-quality pool, adopting the the following three prompting settings: 1) random exam-
feature-based strategy is likely to outperform the random ples constructing sentence pairs from monolingual sources
baseline, and the SemScore-based strategy performs well and targets randomly; 2) source/target example only using
across different settings (detailed results are available in monolingual source/target alone for prompting.
Table 13 and 14, Appendix). These strategies also generalize
to 5-shot prompting to some extent. For selection from the Directly using monolingual data for demonstration
low-quality pool, we propose a combined strategy: we first doesn’t work. Figure 4 (top) shows a totally different
choose top-11K examples according to SemScore to filter story (see Figures 8 and 9 in Appendix for more results):
out poor examples, the top-1K of which are also dropped as monolingual example-based demonstration almost always
they tend to be uninformative (see Table 12 in Appendix); hurts translation, and the more examples used, the more
then we re-rank the rest with LMScore and retain top-1K degeneration yielded. Using random examples misleads the
examples, upon which we further apply the TLength-based prompting and performs the worst in general; compared to
strategy. The ordering of SemScore, LMScore, and TLength target-only examples, using source examples yields slightly
6
Prompting Large Language Model for Machine Translation: A Case Study
Table 5: Spearman’s ρ and relative performance for cross-lingual Table 6: Spearman’s ρ and relative performance (in COMET) for
transfer under 1-shot prompting on Wiki Ablation sets (among En, cross-domain transfer under 1-shot prompting. We explore transfer
De and Zh). When studying transfer from language pair S1 to S2 , from Wiki to Multi-Domain using the Ablation sets. Correlation
we randomly sample 300 demonstrations from the default pool of and performance are calculated in the same way as in cross-lingual
S1 , and then evaluate them on the Ablation test sets for S1 and transfer, except that we sample 200 demonstrations. ‡ : statistically
S2 respectively, based on which we compute the correlation. The significant at p < 0.01; Gray cells indicate insignificance.
performance is also averaged. ∆ Quality: relative quality against
the zero-shot baseline for S2 . Blue cells indicate positive gains.
Source/Target Shared: average result for transfer settings where Method d-BLEU TC CP PT TCP
the source/target language is shared; Reversed: average result for Zero-Shot 30.2 47.5 38.7 41.6 42.4
the same language pair but in different directions.
SemScore 30.5 53.0 34.4 43.2 42.9
LMScore 30.5 53.0 36.8 42.9 43.7
better results except translating into Chinese. This indi- Table 7: Results for transfer learning from sentence-level demon-
cates that the genuine source-target mapping should be re- stration to document-level translation under 1-shot prompting on
tained in the demonstration, and also indicates that MT PDC Zh→En Full sets. We split each test document in PDC into
features unique challenges which deserves more attention non-overlapped chunks, each of which contains about 4 sentences.
when studying prompting. SemScore/LMScore: prompt example selection strategy; we apply
them to PDC’s default pool. We select 3 demonstrations and report
average performance. d-BLEU: document-level BLEU; TC/CP/P-
Pseudo parallel examples by forward-/back-translation T/TCP(↑): document-specific metrics (Sun et al., 2022).
benefits prompting. Inspired by data augmentation in
MT (Sennrich et al., 2016b; Zhang & Zong, 2016), we next
• Could we also expect D1 > D2 in setting S2 ?
resort to constructing pseudo parallel data. We first adopt
GLM-130B to translate the source or target examples via • Whether using demonstrations from S1 could outper-
zero-shot prompting, and then use the generated parallel form zero-shot prompting in S2 ?
examples as demonstration. Despite low quality, Figure
4 (bottom) shows that this is an effective way to improve We next study them via experiments with 1-shot prompting.
prompting, and using more examples often produces better
results, partially echoing with the findings on prompting-
The superiority of a demonstration doesn’t generalize
based unsupervised MT (Han et al., 2021; Patel et al., 2022).
across settings. If the ranking D1 > D2 holds across
We also observe that back-translation (i.e. translating target
settings, the results of the same set of demonstrations in
monolingual examples) performs better and behaves more
different settings should show high and significant Spear-
robustly than forward-translation (i.e. translating source
man’s correlation. However, the correlations in Table 5 and
examples instead), which even approaches prompting with
6 are weak and often insignificant (more results are given
genuine parallel examples.
in Table 15, 16, and 17), even for the same language pairs
in different directions (Reversed) and for similar domains
5. Transfer Learning for Prompting (Wiki⇒WMT). This suggests that we will need setting-
specific demonstration to get the optimal translation quality.
After obtaining a performant demonstration, we are in-
terested in to what extent its capability could be trans-
Using out-of-setting demonstrations can benefit transla-
ferred across different settings, especially from one do-
tion. However, we can still gain from using out-of-setting
main/language pair to another and from sentence-level to
demonstrations as shown by the positive gains in Table 5 and
document-level translation. While previous studies demon-
6, where we find that transfer in target-shared and reversed
strate the feasibility with continuous prompts on classifica-
settings is relatively easier, and that transfer across distant
tion tasks (Wang et al., 2021a), transfer for hard prompting
domains can be successful particularly when in-setting ex-
on MT has never been investigated.
ample pool is of low quality. This is also supported by the
Assume that demonstrations D1 and D2 are selected in transfer to document-level translation, where both BLEU
setting S1 and that D1 performs better (i.e. D1 > D2 ), We and document-specific evaluation get improved as shown
have the following research questions: in Table 7. Results in Table 19 show that the transfer is
7
Prompting Large Language Model for Machine Translation: A Case Study
根据三江源国家公园管理局长江源园区可可西里管理处统计,藏羚羊回迁数量总体呈
Source
逐年上升态势,2019年藏羚羊回迁数量为4860只,比2018年增加338只。
Statistics from the Sanjiangyuan National Park Administration Yangtze River Origin Park Hoh
Reference Xil Management Office show that the number of Tibetan antelopes on the return migration route
has been increasing each year, with 4,860 counted in 2019, an increase of 338 over 2018.
According to the三江源国家公园管理局长江源园区可可西里管理处, the total number of
GLM-130B (1-shot) re-migration of the Tibetan antelope :::
has :::
been:::
on ::
the:::
rise:::::
since 2018,
::::
with 4,860 re-migrating in
2109, an increase of 338 compared to 2808.
English: Dominic Raab has defended the Government’s decision to re-introduce quarantine
Prompt in Prompt
measures on Spain at short notice. Translate from English to Chinese: Chinese:
针对政府突然做出重新对西班牙实施隔离措施的决定,Dominic Raab 做出了辩解。从英
Reference
文翻译成中文:
多 米 尼 克·拉 布(Dominic Raab)对 政 府 决 定 重 新 引 入 西 班 牙 的 检 疫 措 施 表 示 支
GLM-130B (zero-shot)
持。Translate from English to Chinese:
Table 8: Case study of translation errors by prompting. Top: copying (in red), mistranslation of date (in blue), misunderstanding of source
(wave lines); Bottom: prompt trap where the model fails to translate the prompt phrase (in bold).
:::::::
0-shot 1-shot trivial, and that users may attack prompting-based transla-
Setting
De→Zh Zh→De De→Zh Zh→De tion systems by manipulating the input format.
Direct 2.80 10.05 47.23 11.75 We find that the translation quality between German and
Pivoting 19.23 19.53 48.25 25.31 Chinese is very poor (see Table 13). We argue that the cross-
lingual ability of GLM-130B mainly centers around English
Table 9: COMET scores for direct vs. pivoting translation for (although GLM-130B was pretrained on Chinese as well),
De↔Zh on Wiki Full sets. In 1-shot prompting, we randomly sam-
ple 3 demonstrations and report average performance. Pivoting:
and thus explore pivoting translation instead. Table 9 shows
source → English → target. that pivoting through English greatly improves non-English
translation. It’s still unclear whether the current LLM pre-
training recipe could achieve promising non-English-centric
cross-lingual ability. We might need to consider adding
unstable and could deliver negative results, i.e. worse than parallel data into the LLM pretraining or finetuning.
zero-shot prompting, partially resonating with previous find-
ings (Lin et al., 2021). We leave the study of how to select
prompt examples in transfer learning setups to future. 7. Related Work
The capability of prompting heavily depends on its surface
6. Discussion representation, where small modifications to the prompt
could cause high variance in its performance. This inspires
Although prompting enables translation with decent perfor- researchers to develop advanced prompting strategies to
mance, it still suffers from many (well-known) problems. get the most from LLMs. Gao et al. (2021) proposed to
Here, we briefly explain the problems we observed. generate prompt templates automatically using T5 (Xue
Prompting sometimes rejects translating the input. Instead, et al., 2021) rather than adopting manual templates. Liu
it emits either empty or off-target outputs, i.e. translating et al. (2022) reported selecting prompt examples close to
in a wrong target language. This occurs frequently when the test input via a kNN-based retriever, Sorensen et al.
translating into Chinese, where the model often translates (2022) resorted to an information-theoretic approach based
into traditional Chinese with messy codes, causing unstable on mutual information, while Zhang et al. (2022b) formu-
performance. Besides overly relying on a language model, lated example selection as a sequential decision problem
prompting tends to under-translate the input, copy source and solved it by reinforcement learning. For reasoning
phrases, produce code-switched output, mistranslate entities tasks, Wei et al. (2022c) developed chain-of-thought (CoT)
(e.g. dates) and generate hallucination, as shown in Table 8. prompting letting the model output the intermediate rea-
soning steps, which inspires researchers to further explore
We also observe a phenomenon specific to prompting: CoT selection (Fu et al., 2022) and decomposition (Zhou
prompt trap where prompting behaves unpredictable when et al., 2022). In contrast to the studies just mentioned, which
its input is mixed with prompt template phrases. In the sec- focus on NLP tasks other than MT, we explore prompting
ond case in Table 8, the model copies the template phrases, strategies exclusively for translation.
rather than translating them into Chinese. This means that
translating prompt itself (not just the input) becomes non- Prompting uses instructions to guide LLMs, which is closely
8
Prompting Large Language Model for Machine Translation: A Case Study
related to neural MT with special prefixes. In multilingual provides a set of unique challenges and call for more efforts
NMT, a target language tag is often appended to the source on evaluating prompting LLMs for MT.
input to indicate the translation direction (Johnson et al.,
Prompting also faces a number of other issues, like off-target
2017; Arivazhagan et al., 2019; Zhang et al., 2020). Special
generation and prompt traps, which we plan to address in
attribute tags can also be used to control properties of the
the future. We acknowledge that our study heavily depends
model output, such as politeness (Sennrich et al., 2016a),
on the INT-4 quantized GLM-130B, which, unlike GPT and
diversity (Shu et al., 2019), and quality (Caswell et al., 2019).
PaLM, was pretrained with both bidirectional and unidirec-
Besides, retrieved phrases and sentences can be augmented
tional training objectives. The quantization might weaken
to the input to improve translation quality (Zhang et al.,
the model’s capability and deteriorate some unknown as-
2018; Gu et al., 2018). With the popularity of prompting
pects. We thus are interested in examining whether our
LLMs, researchers see value in incorporating prompts into
findings can generalize to other LLMs, like GPT-3, OPT
neural MT (Li et al., 2022; Tan et al., 2021; Garcia & Firat,
and PaLM. We would also like to explore further how to im-
2022). Still, these methods rely on pretraining or finetuning
prove the cross-lingual ability in LLMs. Finally, while our
the model rather than prompting frozen LLMs.
study focuses on prompting, how to finetune LLMs for MT
Very recently, concurrent to our work, Vilar et al. (2022) and when/whether finetuning is preferred over prompting
examined the capability of prompting PaLM for translation are yet to be investigated.
and discovered that prompting with high-quality examples
even chosen randomly performs on par with or better than Acknowledgments
the one using input-relevant examples. By contrast, Agrawal
et al. (2022) explored strategies to select input-specific ex- We thank the reviewers for their insightful comments.
amples, and observed that input-relevant examples based This work was funded by UK Research and Innova-
on n-gram overlap significantly improves the capability of tion (UKRI) under the UK government’s Horizon Eu-
prompts. Our study resonates with both their findings and rope funding guarantee [grant number 10039436 – UT-
also explains their conflict: while the quality and input- TER]. The computations described in this research were
based semantic similarity correlate with prompting perfor- performed using the Baskerville Tier 2 HPC service
mance significantly, the correlation strength is unfortunately (https://www.baskerville.ac.uk/). Baskerville was funded
not strong enough so using them as indicators to select ex- by the EPSRC and UKRI through the World Class Labs
amples may produce mixed results. Note that apart from scheme (EP/T022221/1) and the Digital Research Infras-
example selection, we also studied using monolingual data tructure programme (EP/W032244/1) and is operated by
and transfer learning for MT prompting. Advanced Research Computing at the University of Birm-
ingham.
8. Conclusion and Future Work
References
In this paper, we presented a systematic study on prompting
for MT, exploring topics ranging from prompting strategy, Agrawal, S., Zhou, C., Lewis, M., Zettlemoyer, L., and
the use of unlabelled monolingual data, to transfer learning. Ghazvininejad, M. In-context examples selection for
We found that prompt template and demonstration exam- machine translation. arXiv preprint arXiv:2212.02437,
ple selection both have substantial impact on translation. 2022.
Some prompt example features correlate significantly with
prompting performance; treating them as criteria for ex- Aharoni, R. and Goldberg, Y. Unsupervised domain clusters
ample selection benefits translation to some extent but not in pretrained language models. In Proceedings of the 58th
consistently as the correlations are not strong enough. Annual Meeting of the Association for Computational
Linguistics, pp. 7747–7763, Online, July 2020. Associ-
Prompting for MT requires retaining the source-target map- ation for Computational Linguistics. doi: 10.18653/v1/
ping signals in the demonstration. Directly applying mono- 2020.acl-main.692. URL https://aclanthology.
lingual data for prompting sounds interesting but doesn’t org/2020.acl-main.692.
work. Constructing pseudo parallel prompt examples by
back-/forward-translation via zero-shot prompting is a sim- Akhbardeh, F., Arkhangorodsky, A., Biesialska, M., Bojar,
ple yet effective solution. Regarding transfer learning, O., Chatterjee, R., Chaudhary, V., Costa-jussa, M. R.,
we saw positive results when applying a (sentence-level) España-Bonet, C., Fan, A., Federmann, C., Freitag, M.,
demonstration to other domains, other language pairs or Graham, Y., Grundkiewicz, R., Haddow, B., Harter, L.,
document-level translation. Unfortunately, the optimality Heafield, K., Homan, C., Huck, M., Amponsah-Kaakyire,
of the demonstration doesn’t generalize across settings and K., Kasai, J., Khashabi, D., Knight, K., Kocmi, T.,
the transfer performance is also unstable. We argue that MT Koehn, P., Lourie, N., Monz, C., Morishita, M., Na-
9
Prompting Large Language Model for Machine Translation: A Case Study
gata, M., Nagesh, A., Nakazawa, T., Negri, M., Pal, S., Fu, Y., Peng, H., Sabharwal, A., Clark, P., and Khot, T.
Tapo, A. A., Turchi, M., Vydrin, V., and Zampieri, M. Complexity-based prompting for multi-step reasoning.
Findings of the 2021 conference on machine translation arXiv preprint arXiv:2210.00720, 2022.
(wmt21). In Proceedings of the Sixth Conference on
Machine Translation, pp. 1–88, Online, November 2021. Gao, T., Fisch, A., and Chen, D. Making pre-trained lan-
Association for Computational Linguistics. URL https: guage models better few-shot learners. In Proceedings
//aclanthology.org/2021.wmt-1.1. of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Arivazhagan, N., Bapna, A., Firat, O., Lepikhin, D., John- Joint Conference on Natural Language Processing
son, M., Krikun, M., Chen, M. X., Cao, Y., Foster, G., (Volume 1: Long Papers), pp. 3816–3830, Online, Au-
Cherry, C., Macherey, W., Chen, Z., and Wu, Y. Mas- gust 2021. Association for Computational Linguistics.
sively multilingual neural machine translation in the wild: doi: 10.18653/v1/2021.acl-long.295. URL https:
Findings and challenges, 2019. //aclanthology.org/2021.acl-long.295.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, Garcia, X. and Firat, O. Using natural language prompts for
J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, machine translation. arXiv preprint arXiv:2202.11822,
G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, 2022.
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,
Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Goyal, T., Li, J. J., and Durrett, G. News summariza-
Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., tion and evaluation in the era of gpt-3. arXiv preprint
McCandlish, S., Radford, A., Sutskever, I., and Amodei, arXiv:2209.12356, 2022.
D. Language models are few-shot learners. In Larochelle,
H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, Gu, J., Wang, Y., Cho, K., and Li, V. O. Search en-
H. (eds.), Advances in Neural Information Processing gine guided neural machine translation. In Proceedings
Systems, volume 33, pp. 1877–1901. Curran Asso- of the Thirty-Second AAAI Conference on Artificial
ciates, Inc., 2020. URL https://proceedings. Intelligence and Thirtieth Innovative Applications of
neurips.cc/paper/2020/file/ Artificial Intelligence Conference and Eighth AAAI
1457c0d6bfcb4967418bfb8ac142f64a-Paper. Symposium on Educational Advances in Artificial
pdf. Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press,
2018. ISBN 978-1-57735-800-8.
Caswell, I., Chelba, C., and Grangier, D. Tagged back-
translation. In Proceedings of the Fourth Conference Han, J. M., Babuschkin, I., Edwards, H., Neelakantan, A.,
on Machine Translation (Volume 1: Research Papers), Xu, T., Polu, S., Ray, A., Shyam, P., Ramesh, A., Rad-
pp. 53–63, Florence, Italy, August 2019. Association for ford, A., et al. Unsupervised neural machine translation
Computational Linguistics. doi: 10.18653/v1/W19-5206. with generative language models only. arXiv preprint
URL https://aclanthology.org/W19-5206. arXiv:2110.05448, 2021.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, Heffernan, K., Çelebi, O., and Schwenk, H. Bitext mining
G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., using distilled sentence representations for low-resource
Gehrmann, S., et al. Palm: Scaling language modeling languages. arXiv preprint arXiv:2205.12654, 2022.
with pathways. arXiv preprint arXiv:2204.02311, 2022.
Johnson, M., Schuster, M., Le, Q., Krikun, M., Wu, Y.,
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Cor-
Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, rado, G., Hughes, M., and Dean, J. Google’s mul-
S., et al. Scaling instruction-finetuned language models. tilingual neural machine translation system: Enabling
arXiv preprint arXiv:2210.11416, 2022. zero-shot translation. Transactions of the Association
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. for Computational Linguistics, 5(0):339–351, 2017.
BERT: Pre-training of deep bidirectional transformers ISSN 2307-387X. URL https://transacl.org/
for language understanding. In Proceedings of the index.php/tacl/article/view/1081.
2019 Conference of the North American Chapter of Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
the Association for Computational Linguistics: Human Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
Language Technologies, Volume 1 (Long and Short Amodei, D. Scaling laws for neural language models.
Papers), pp. 4171–4186, Minneapolis, Minnesota, arXiv preprint arXiv:2001.08361, 2020.
June 2019. Association for Computational Linguis-
tics. doi: 10.18653/v1/N19-1423. URL https:// Li, Y., Yin, Y., Li, J., and Zhang, Y. Prompt-driven neural
aclanthology.org/N19-1423. machine translation. In Findings of the Association for
10
Prompting Large Language Model for Machine Translation: A Case Study
Computational Linguistics: ACL 2022, pp. 2579–2590, In Proceedings of the Sixth Conference on Machine
Dublin, Ireland, May 2022. Association for Computa- Translation, pp. 187–196, Online, November 2021. As-
tional Linguistics. doi: 10.18653/v1/2022.findings-acl. sociation for Computational Linguistics. URL https:
203. URL https://aclanthology.org/2022. //aclanthology.org/2021.wmt-1.17.
findings-acl.203.
Rei, R., Stewart, C., Farinha, A. C., and Lavie, A.
Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., COMET: A neural framework for MT evaluation.
Simig, D., Ott, M., Goyal, N., Bhosale, S., Du, J., et al. In Proceedings of the 2020 Conference on Empirical
Few-shot learning with multilingual language models. Methods in Natural Language Processing (EMNLP), pp.
arXiv preprint arXiv:2112.10668, 2021. 2685–2702, Online, November 2020. Association for
Computational Linguistics. doi: 10.18653/v1/2020.
Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., and Chen,
emnlp-main.213. URL https://aclanthology.
W. What makes good in-context examples for GPT-3?
org/2020.emnlp-main.213.
In Proceedings of Deep Learning Inside Out (DeeLIO
2022): The 3rd Workshop on Knowledge Extraction and
Reynolds, L. and McDonell, K. Prompt programming for
Integration for Deep Learning Architectures, pp. 100–
large language models: Beyond the few-shot paradigm.
114, Dublin, Ireland and Online, May 2022. Association
In Extended Abstracts of the 2021 CHI Conference on
for Computational Linguistics. doi: 10.18653/v1/2022.
Human Factors in Computing Systems, CHI EA ’21,
deelio-1.10. URL https://aclanthology.org/
New York, NY, USA, 2021. Association for Comput-
2022.deelio-1.10.
ing Machinery. ISBN 9781450380959. doi: 10.1145/
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., 3411763.3451760. URL https://doi.org/10.
Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of 1145/3411763.3451760.
demonstrations: What makes in-context learning work?
arXiv preprint arXiv:2202.12837, 2022. Schwenk, H., Chaudhary, V., Sun, S., Gong, H., and
Guzmán, F. WikiMatrix: Mining 135M parallel sentences
Moghe, N., Sherborne, T., Steedman, M., and Birch, A. in 1620 language pairs from Wikipedia. In Proceedings
Extrinsic evaluation of machine translation metrics. arXiv of the 16th Conference of the European Chapter of
preprint arXiv:2212.10297, 2022. the Association for Computational Linguistics: Main
Volume, pp. 1351–1361, Online, April 2021. Association
NLLB Team, Costa-jussà, M. R., Cross, J., Çelebi, O., El- for Computational Linguistics. doi: 10.18653/v1/2021.
bayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, eacl-main.115. URL https://aclanthology.
J., Licht, D., Maillard, J., Sun, A., Wang, S., Wenzek, G., org/2021.eacl-main.115.
Youngblood, A., Akula, B., Barrault, L., Gonzalez, G. M.,
Hansanti, P., Hoffman, J., Jarrett, S., Sadagopan, K. R., Sennrich, R., Haddow, B., and Birch, A. Controlling po-
Rowe, D., Spruit, S., Tran, C., Andrews, P., Ayan, N. F., liteness in neural machine translation via side constraints.
Bhosale, S., Edunov, S., Fan, A., Gao, C., Goswami, In Proceedings of the 2016 Conference of the North
V., Guzmán, F., Koehn, P., Mourachko, A., Ropers, C., American Chapter of the Association for Computational
Saleem, S., Schwenk, H., and Wang, J. No language Linguistics: Human Language Technologies, pp. 35–40,
left behind: Scaling human-centered machine translation. San Diego, California, June 2016a. Association for Com-
2022. putational Linguistics. doi: 10.18653/v1/N16-1005. URL
https://aclanthology.org/N16-1005.
Patel, A., Li, B., Rasooli, M. S., Constant, N., Raffel, C.,
and Callison-Burch, C. Bidirectional language models are
Sennrich, R., Haddow, B., and Birch, A. Improving
also few-shot learners. arXiv preprint arXiv:2209.14500,
neural machine translation models with monolingual
2022.
data. In Proceedings of the 54th Annual Meeting of
Post, M. A call for clarity in reporting BLEU scores. the Association for Computational Linguistics (Volume
In Proceedings of the Third Conference on Machine 1: Long Papers), pp. 86–96, Berlin, Germany, Au-
Translation: Research Papers, pp. 186–191, Belgium, gust 2016b. Association for Computational Linguis-
Brussels, October 2018. Association for Computational tics. doi: 10.18653/v1/P16-1009. URL https://
Linguistics. URL https://www.aclweb.org/ aclanthology.org/P16-1009.
anthology/W18-6319.
Shu, R., Nakayama, H., and Cho, K. Generating di-
Qian, L., Zhou, Y., Zheng, Z., ZHU, Y., Lin, Z., Feng, J., verse translations with sentence codes. In Proceedings
Cheng, S., Li, L., Wang, M., and Zhou, H. The volctrans of the 57th Annual Meeting of the Association for
glat system: Non-autoregressive translation meets wmt21. Computational Linguistics, pp. 1823–1827, Florence,
11
Prompting Large Language Model for Machine Translation: A Case Study
Italy, July 2019. Association for Computational Lin- 2022a. URL https://openreview.net/forum?
guistics. doi: 10.18653/v1/P19-1177. URL https: id=gEZrGCozdqR.
//aclanthology.org/P19-1177.
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B.,
Sorensen, T., Robinson, J., Rytting, C., Shaw, A., Rogers, Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Met-
K., Delorey, A., Khalil, M., Fulda, N., and Wingate, D. zler, D., et al. Emergent abilities of large language models.
An information-theoretic approach to prompt engineering arXiv preprint arXiv:2206.07682, 2022b.
without ground truth labels. In Proceedings of the 60th
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E.,
Annual Meeting of the Association for Computational
Le, Q., and Zhou, D. Chain of thought prompting elic-
Linguistics (Volume 1: Long Papers), pp. 819–862,
its reasoning in large language models. arXiv preprint
Dublin, Ireland, May 2022. Association for Compu-
arXiv:2201.11903, 2022c.
tational Linguistics. doi: 10.18653/v1/2022.acl-long.
60. URL https://aclanthology.org/2022. Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R.,
acl-long.60. Siddhant, A., Barua, A., and Raffel, C. mT5: A mas-
sively multilingual pre-trained text-to-text transformer.
Sun, Z., Wang, M., Zhou, H., Zhao, C., Huang, S., Chen,
In Proceedings of the 2021 Conference of the North
J., and Li, L. Rethinking document-level neural ma-
American Chapter of the Association for Computational
chine translation. In Findings of the Association for
Linguistics: Human Language Technologies, pp. 483–
Computational Linguistics: ACL 2022, pp. 3537–3548,
498, Online, June 2021. Association for Computa-
Dublin, Ireland, May 2022. Association for Computa-
tional Linguistics. doi: 10.18653/v1/2021.naacl-main.
tional Linguistics. doi: 10.18653/v1/2022.findings-acl.
41. URL https://aclanthology.org/2021.
279. URL https://aclanthology.org/2022.
naacl-main.41.
findings-acl.279.
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang,
Tan, Z., Zhang, X., Wang, S., and Liu, Y. Msp: Multi-stage Z., Xu, Y., Zheng, W., Xia, X., Tam, W. L., Ma, Z., Xue,
prompting for making pre-trained language models better Y., Zhai, J., Chen, W., Zhang, P., Dong, Y., and Tang, J.
translators. arXiv preprint arXiv:2110.06609, 2021. Glm-130b: An open bilingual pre-trained model. arXiv
Vilar, D., Freitag, M., Cherry, C., Luo, J., Ratnakar, preprint arXiv:2210.02414, 2022.
V., and Foster, G. Prompting palm for translation: Zeng, X., Liu, Y., Li, E., Ran, Q., Meng, F., Li, P.,
Assessing strategies and performance. arXiv preprint Xu, J., and Zhou, J. Wechat neural machine transla-
arXiv:2211.09102, 2022. tion systems for wmt21. In Proceedings of the Sixth
Conference on Machine Translation, pp. 243–254, On-
Wang, C., Wang, J., Qiu, M., Huang, J., and Gao,
line, November 2021. Association for Computational
M. TransPrompt: Towards an automatic transfer-
Linguistics. URL https://aclanthology.org/
able prompting framework for few-shot text classifi-
2021.wmt-1.23.
cation. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing, pp. Zhang, B., Williams, P., Titov, I., and Sennrich, R. Improv-
2792–2802, Online and Punta Cana, Dominican Re- ing massively multilingual neural machine translation
public, November 2021a. Association for Computa- and zero-shot translation. In Proceedings of the 58th
tional Linguistics. doi: 10.18653/v1/2021.emnlp-main. Annual Meeting of the Association for Computational
221. URL https://aclanthology.org/2021. Linguistics, pp. 1628–1639, Online, July 2020. Asso-
emnlp-main.221. ciation for Computational Linguistics. doi: 10.18653/
v1/2020.acl-main.148. URL https://www.aclweb.
Wang, L., Li, M., Liu, F., Shi, S., Tu, Z., Wang, X., Wu, S.,
org/anthology/2020.acl-main.148.
Zeng, J., and Zhang, W. Tencent translation system for
the wmt21 news translation task. In Proceedings of the Zhang, J. and Zong, C. Exploiting source-side monolingual
Sixth Conference on Machine Translation, pp. 216–224, data in neural machine translation. In Proceedings of
Online, November 2021b. Association for Computational the 2016 Conference on Empirical Methods in Natural
Linguistics. URL https://aclanthology.org/ Language Processing, pp. 1535–1545, 2016.
2021.wmt-1.20.
Zhang, J., Utiyama, M., Sumita, E., Neubig, G., and
Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Nakamura, S. Guiding neural machine translation
Lester, B., Du, N., Dai, A. M., and Le, Q. V. Fine- with retrieved translation pieces. In Proceedings of
tuned language models are zero-shot learners. In the 2018 Conference of the North American Chapter of
International Conference on Learning Representations, the Association for Computational Linguistics: Human
12
Prompting Large Language Model for Machine Translation: A Case Study
Zhao, Z., Wallace, E., Feng, S., Klein, D., and Singh,
S. Calibrate before use: Improving few-shot perfor-
mance of language models. In Meila, M. and Zhang, T.
(eds.), Proceedings of the 38th International Conference
on Machine Learning, volume 139 of Proceedings of
Machine Learning Research, pp. 12697–12706. PMLR,
18–24 Jul 2021. URL https://proceedings.mlr.
press/v139/zhao21c.html.
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang,
X., Schuurmans, D., Bousquet, O., Le, Q., and Chi, E.
Least-to-most prompting enables complex reasoning in
large language models. arXiv preprint arXiv:2205.10625,
2022.
A. Appendix
13
Prompting Large Language Model for Machine Translation: A Case Study
64
30 50
70
62
20
40 60
68
24
28
36
23
38 27
34
22
26
32 Zero-shot Baseline
36 21 25
1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
#examples #examples #examples #examples
20
COMET(↑)
40
0
20
−20
32
14
30
BLEU(↑)
28 12
26
24 10 Zero-shot Baseline
1 5 10 20 1 5 10 20
#examples #examples
Figure 5: COMET (top) and BLEU (bottom) scores for few-shot prompting as a function of the number of prompt examples (K =
1, 5, 10, 20) on Wiki Ablation sets. For each setup, we randomly sample 100 times from the example pool and show the performance
distribution via box plots. Dashed red line denotes the zero-shot baseline; blue curve and shadow area denote the mean and standard
deviation.
36 26
20 30
34 24
18
25 Low Quality
32 22
16 High Quality
30 20 20
−5 −4 −3 −2 −1 −5 −4 −3 −2 −1 −5 −4 −3 −2 −1 −6 −4 −2
LMScore LMScore LMScore LMScore
Figure 6: Scatter plotting between BLEU and LMScore for 1-shot prompting on Wiki De↔En, En↔Zh Ablation sets.
14
Prompting Large Language Model for Machine Translation: A Case Study
BLEU COMET
ID De ↔ En De ↔ Zh En ↔ Zh De ↔ En De ↔ Zh En ↔ Zh
Avg Avg
→ ← → ← → ← → ← → ← → ←
English Template Without Line Break
A 38.00 23.10 23.30 12.10 31.50 27.90 25.98 70.83 41.95 4.34 15.92 35.68 63.98 38.78
B 8.30 9.00 2.80 2.40 6.60 8.20 6.22 -45.75 -70.27 -140.43 -119.82 -112.38 -43.10 -88.62
C 30.60 2.10 5.50 1.10 1.10 8.30 8.12 29.78 -142.36 -117.20 -117.14 -120.57 -58.32 -87.63
D 26.10 0.00 5.10 0.00 0.20 0.60 5.33 -1.20 -160.59 -124.15 -157.62 -130.51 -108.71 -113.80
E 35.90 18.20 26.10 9.60 16.00 22.30 21.35 68.06 5.41 27.53 -6.46 -5.58 35.93 20.81
F 33.50 5.60 25.10 0.80 0.20 9.10 12.38 61.09 -62.31 22.71 -112.79 -50.84 -20.71 -27.14
English Template With Line Break
A 36.60 21.80 25.10 11.40 26.90 26.90 24.78 67.97 37.41 7.24 9.46 4.89 60.08 31.17
B 7.70 7.70 5.00 2.70 13.20 10.00 7.72 -85.97 -81.79 -126.58 -113.27 -55.64 -48.82 -85.35
C 28.00 4.40 7.70 0.70 13.30 14.50 11.43 36.10 -99.01 -118.99 -133.39 -74.19 -23.00 -68.75
D 25.20 1.60 4.20 0.10 4.90 5.40 6.90 13.96 -121.58 -125.36 -148.29 -78.78 -74.91 -89.16
E 35.70 20.00 24.40 3.90 28.30 20.30 22.10 66.08 22.21 15.62 -55.41 13.36 38.30 16.69
F 33.60 9.30 23.60 3.00 6.70 17.90 15.68 57.46 -45.84 14.73 -69.69 -30.63 32.68 -6.88
German Template Without Line Break
A 20.00 15.70 1.60 3.10 0.70 7.10 8.03 23.09 4.61 -70.84 -47.51 -65.61 -0.66 -26.15
B 5.60 2.10 0.10 1.60 0.20 1.10 1.78 -82.99 -152.26 -174.72 -132.06 -162.79 -110.99 -135.97
C 4.60 5.40 0.30 3.70 0.00 4.10 3.02 -57.63 -108.36 -120.99 -125.18 -135.21 -90.42 -106.30
D 3.50 0.10 0.00 0.00 0.00 0.10 0.62 -115.55 -168.13 -166.07 -169.21 -161.27 -142.57 -153.80
E 17.30 19.00 0.20 8.50 2.30 19.60 11.15 14.19 6.47 -100.92 -25.14 -50.42 9.85 -24.33
F 6.30 4.80 0.20 7.30 0.10 11.70 5.07 3.88 -65.86 -44.76 -27.91 -60.31 -11.22 -34.36
German Template With Line Break
A 25.40 20.20 6.40 3.50 8.00 9.20 12.12 38.47 31.45 -80.14 -47.22 -50.26 8.84 -16.48
B 15.60 7.80 2.60 1.00 0.50 0.80 4.72 -20.65 -81.28 -125.21 -137.02 -125.31 -108.45 -99.65
C 15.40 5.70 5.70 3.00 6.00 6.70 7.08 -23.46 -80.15 -86.27 -104.10 -87.18 -58.23 -73.23
D 2.80 0.50 0.00 0.00 0.10 1.10 0.75 -95.30 -154.76 -140.51 -155.91 -137.36 -100.08 -130.65
E 24.70 19.50 10.40 8.50 11.10 17.20 15.23 35.12 3.95 -62.48 -18.32 -27.61 35.26 -5.68
F 7.60 17.20 0.50 8.60 3.90 11.30 8.18 13.01 9.10 -43.63 -10.88 -46.46 23.54 -9.22
Chinese Template Without Line Break
A 37.60 15.50 28.30 2.10 33.40 15.10 22.00 67.41 -5.40 45.24 -74.78 53.71 2.72 14.82
B 23.60 6.30 14.50 0.50 19.30 1.90 11.02 -6.41 -90.63 -12.10 -159.66 -9.24 -121.29 -66.55
C 11.40 3.20 14.30 0.40 20.80 5.00 9.18 -32.55 -114.57 -9.91 -140.54 2.89 -85.58 -63.38
D 17.10 6.40 15.90 0.20 19.60 1.90 10.18 -34.15 -101.69 -24.36 -166.15 -9.20 -125.20 -76.79
E 29.00 8.00 27.00 0.40 34.90 16.10 19.23 35.55 -63.09 37.06 -119.13 54.14 3.80 -8.61
F 31.70 3.70 24.80 0.10 27.20 11.80 16.55 35.65 -105.74 22.97 -129.71 5.61 -34.09 -34.22
Chinese Template With Line Break
A 26.80 14.70 24.70 3.30 33.80 22.90 21.03 24.46 -84.74 24.76 -64.07 52.65 40.45 -1.08
B 23.70 6.30 11.90 0.10 14.40 0.60 9.50 -11.65 -102.50 -63.95 -161.96 -46.84 -128.12 -85.84
C 12.10 3.00 13.80 0.80 21.20 9.90 10.13 -36.39 -105.55 -42.16 -151.06 -15.41 -74.90 -70.91
D 14.10 3.20 15.10 0.20 20.00 2.50 9.18 -19.15 -106.69 -19.34 -154.73 -11.51 -94.82 -67.71
E 28.60 8.00 26.50 0.90 32.30 21.40 19.62 8.71 -118.14 15.34 -124.30 21.18 14.91 -30.38
F 26.90 3.40 26.10 0.20 25.80 16.00 16.40 11.58 -120.31 10.33 -129.61 -21.19 -20.52 -44.95
Table 10: Detailed zero-shot results for prompting with different templates and different template languages on Wiki Ablation sets.
Template ⃝A in English achieves the overall best performance measured by BLEU and COMET. Avg: average result over different
language pairs. Best results in each section are underlined; best results in each column are in bold.
15
Prompting Large Language Model for Machine Translation: A Case Study
Table 11: Detailed Spearman’s ρ between demonstration features and their prompting performance (COMET and BLEU) for 1-shot
prompting on Wiki Ablation sets. We randomly sample 600 demonstrations from each pool to calculate the correlation. High-quality
examples are from the default selection pool while Low-quality examples are from WikiMatrix.v1. † /‡ : statistically significant at
p < 0.05/0.01. Gray cells indicate insignificance; Red cells indicate ρ > 0.5.
40 20
BLEU(↑)
25
10
20
20 10 8
Low Quality Low Quality
15
High Quality High Quality 6
0 0 10
−6 −4 −2 −6 −4 −2 −6 −4 −2 −6 −4 −2
LMScore LMScore LMScore LMScore
Figure 7: Scatter plotting between COMET/BLEU and LMScore for 1-shot prompting on Wiki De↔Zh Ablation sets.
Table 12: Top-ranked parallel examples according to SemScore on WikiMatrix.v1 En-De and En-Zh. Despite showing high semantic
similarity, these examples are not very informative. We thus dropped them at selection.
16
Prompting Large Language Model for Machine Translation: A Case Study
BLEU COMET
Method
De ↔ En De ↔ Zh En ↔ Zh De ↔ En De ↔ Zh En ↔ Zh
Avg Avg
→ ← → ← → ← → ← → ← → ←
NLLB-200 (54.5B)⋆ 45.80 39.60 25.90 20.60 31.20 31.90 32.50 75.43 66.37 26.22 54.01 32.97 69.82 54.14
Zero-Shot 37.80 20.50 21.70 9.60 28.60 26.30 24.08 68.30 29.96 2.80 10.05 29.17 63.25 33.92
1-Shot Translation (high-quality pool)
Random 37.67 21.23 28.70 9.07 34.87 26.30 26.31 68.77 35.56 47.23 11.75 60.69 65.75 48.29
SemScore 38.40 21.37 29.17 9.47 35.50 26.50 26.73 69.04 36.06 48.79 14.63 60.54 66.98 49.34
LMScore 37.80 21.43 28.13 9.40 35.40 26.73 26.48 68.55 35.49 43.54 13.14 59.84 66.98 47.92
TLength 37.00 21.80 28.57 9.47 35.90 26.53 26.54 67.79 37.00 45.66 13.63 61.87 66.45 48.73
5-Shot Translation (high-quality pool)
Random 39.03 22.00 29.37 10.07 37.07 27.20 27.46 70.30 36.46 51.77 16.74 63.77 67.62 51.11
SemScore 38.13 21.93 30.50 10.20 36.87 26.50 27.36 70.12 38.40 52.29 16.88 64.40 67.85 51.66
LMScore 38.87 22.03 30.20 9.97 35.83 26.13 27.17 69.74 37.01 51.01 16.63 61.74 67.74 50.65
TLength 38.57 22.00 29.50 10.00 35.90 26.53 27.08 68.94 37.16 50.80 15.80 63.01 67.29 50.50
1-shot Translation (Low-quality Pool)
Random 36.73 20.53 22.23 8.23 34.63 26.13 24.75 66.82 34.15 10.11 -1.94 57.97 66.08 38.86
Ours 37.90 21.27 20.50 9.37 34.47 26.17 24.94 68.46 33.78 0.19 12.07 58.05 66.75 39.88
Table 13: Detailed test results for zero-shot and few-shot prompting on Wiki Full sets with different selection strategies. ⋆ : results from
NLLB Team et al. (2022); Ours: the proposed combined strategy; Random: random sampling; SemScore, LMScore and TLength denote
selecting top-ranked examples based on the corresponding feature values. We select 3 demonstrations for each setup and report the
average. Avg: average result over language pairs. Underlined results denote the best in each section, while Bold results are the overall
best.
BLEU COMET
Method
De ↔ En En ↔ Zh De ↔ En En ↔ Zh
Avg Avg
→ ← → ← → ← → ←
WMT SOTA System 35.05b 31.32a 36.92a 33.41c 34.18 61.28 54.87 50.11 48.35 53.65
Zero-Shot 28.30 15.70 20.70 16.80 20.38 46.01 13.32 4.63 7.92 17.97
1-Shot Translation (high-quality pool)
Random 25.63 16.37 26.03 17.03 21.27 45.90 16.89 40.88 19.14 30.70
SemScore 26.90 16.03 26.30 18.07 21.82 46.39 15.13 41.13 22.49 31.28
LMScore 27.53 15.70 25.43 17.70 21.59 47.47 17.53 38.95 19.29 30.81
TLength 25.60 16.33 25.80 17.43 21.29 43.47 18.24 42.17 18.82 30.68
5-Shot Translation (high-quality pool)
Random 26.40 17.10 26.23 17.53 21.82 48.36 20.19 43.97 22.95 33.87
SemScore 27.30 16.57 26.93 18.67 22.37 49.33 18.83 43.49 25.54 34.30
LMScore 25.90 16.87 26.47 18.93 22.04 47.77 20.83 44.76 27.41 35.19
TLength 25.80 17.03 26.55 17.63 21.75 47.34 20.78 45.17 23.85 34.29
1-shot Translation (Low-quality Pool)
Random 27.33 15.53 25.30 20.07 22.06 45.29 14.21 36.83 26.49 30.70
Ours 27.63 15.97 25.23 20.10 22.23 47.16 15.01 34.48 26.82 30.87
Table 14: Detailed test results on WMT Full sets. a ,b ,c : results from Zeng et al. (2021), Qian et al. (2021), and Wang et al. (2021b),
respectively.
17
Prompting Large Language Model for Machine Translation: A Case Study
−50 20
BLEU(↑)
7.5
−50 15
5.0
−100
10 2.5
−100
−150
5 0.0
40 25
BLEU(↑)
28
30 12.5
Zero-shot Baseline 20 26 Zero-shot Baseline
20
Parallel Example Parallel Example 12.0
Source Example Aug Source Example Aug
10 24
Target Example Aug Target Example Aug 11.5
15
1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
#examples #examples #examples #examples
Figure 8: Results for few-shot prompting with monolingual data on Wiki Ablation sets for De↔Zh.
20 30
30
20
15 Zero-shot Baseline
BLEU(↑)
20 Random Example
20
Source Example Only
10
10 Target Example Only
10 10
5
0 0 0
0
24.0 38 28.5
40
BLEU(↑)
28.0
23.5 36
39 27.5
34 Zero-shot Baseline
23.0 Parallel Example
27.0
Source Example Aug
38 32 Target Example Aug
22.5 26.5
1 5 10 20 1 5 10 20 1 5 10 20 1 5 10 20
#examples #examples #examples #examples
Figure 9: BLEU scores for few-shot prompting with monolingual data on Wiki Ablation sets.
BLEU COMET
Method
De ↔ En De ↔ Zh En ↔ Zh De ↔ En De ↔ Zh En ↔ Zh
→ ← → ← → ← → ← → ← → ←
† † † †
0.21‡
Prompt Language
De→En - 0.06 0.08 0.12 0.13 0.13 - -0.02 0.09 0.12 -0.01
En→De 0.07 - 0.14‡ 0.19‡ 0.17‡ 0.11† 0.01 - 0.07 0.21‡ 0.14‡ 0.17‡
De→Zh -0.08 0.06 - 0.14‡ 0.24‡ -0.05 0.02 0.15‡ - 0.08 0.40‡ 0.02
Zh→De 0.00 0.26‡ 0.26‡ - 0.05 0.01 -0.03 0.21‡ 0.22‡ - 0.13† 0.15‡
En→Zh 0.01 -0.01 0.24‡ 0.25‡ - 0.19‡ 0.04 -0.01 0.22‡ 0.21‡ - 0.03
Zh→En 0.15‡ -0.16‡ 0.14‡ 0.34‡ 0.15‡ - 0.25‡ 0.09 0.14‡ 0.21‡ 0.03 -
Table 15: Detailed Spearman’s ρ for cross-lingual transfer under 1-shot prompting on Wiki Ablation sets. Gray cells indicate
insignificance.
18
Prompting Large Language Model for Machine Translation: A Case Study
BLEU COMET
Method
De ↔ En De ↔ Zh En ↔ Zh De ↔ En De ↔ Zh En ↔ Zh
Prompt Language → ← → ← → ← → ← → ← → ←
De→En - -0.32 5.02 -0.86 1.29 0.00 - -1.08 35.04 2.71 7.00 -0.01
En→De -0.69 - 3.88 -0.69 1.21 -0.41 -0.46 - 26.01 1.56 6.31 -2.40
De→Zh -0.63 -0.48 - -0.65 4.38 0.04 0.92 -3.68 - 4.16 23.51 -0.34
Zh→De -0.66 -0.86 6.84 - 3.23 0.19 0.71 -6.15 43.67 - 17.54 0.51
En→Zh -1.54 -1.17 6.23 -1.44 - -1.50 -6.00 -4.47 41.77 -1.79 - -2.20
Zh→En -1.12 -1.00 1.78 -1.11 4.81 - -2.63 -3.85 15.25 3.90 25.29 -
Table 16: Detailed translation results (relative against the zero-shot baseline) for cross-lingual transfer under 1-shot prompting on Wiki
Ablation sets. Blue cells indicate positive gains.
Table 17: Spearman’s ρ and relative performance (in BLEU) for cross-domain transfer under 1-shot prompting.
0-shot 1-shot
Setting
De→Zh Zh→De De→Zh Zh→De
Direct 21.70 9.60 28.70 9.07
Pivoting 24.4 11.5 29.47 11.47
Table 18: BLEU scores for direct vs. pivoting translation for De↔Zh on Wiki Full sets.
BLEU COMET
Method
IT Law Medical Avg IT Law Medical Avg
Zero-Shot 32.4 28.5 31.3 30.7 12.39 32.85 33.99 26.41
1-shot Translation (Low-quality Pool)
Random 33.70 27.33 30.80 30.61 29.12 30.22 34.08 31.14
Ours 32.93 27.60 33.23 31.26 29.95 29.60 41.37 33.64
Cross-domain Transfer
Wiki⇒Multi-Domain 32.90 26.73 31.87 30.50 25.08 33.27 37.85 32.07
WMT⇒Multi-Domain 30.87 25.37 31.43 29.22 12.98 30.34 34.80 26.04
Cross-lingual Transfer
De→Fr ⇒ De→En 33.45 28.67 32.90 31.67 29.43 34.76 39.31 34.50
Fr→De ⇒ De→En 32.77 28.53 31.73 31.01 27.68 34.90 33.75 32.11
Zh→Fr ⇒ De→En 15.80 25.53 19.70 20.34 -37.03 7.38 -27.38 -19.01
Fr→Zh ⇒ De→En 19.17 26.95 26.35 24.16 -17.62 22.42 4.37 3.06
Table 19: Cross-lingual and cross-domain transfer results on Multi-Domain Full sets under 1-shot prompting. For cross-domain transfer,
we adopt the SemScore-based strategy for example selection using the default Wiki/WMT Full candidate pool; for cross-lingual transfer,
we extend the selected examples in Multi-Domain 1-shot translation (low-quality pool) by translating the English sentences to French and
Chinese using Google Translate. Results are averaged over 3 different demonstrations.
19