Lizhong Chen
Lizhong Chen
grulkar et al., 2022). Early work in this area demon- steps before the necessary context is available in
strated some potential for smaller models (Üstün the incremental source.
and Cooper Stickland, 2022), and the accessibil- There are two classical, high level approaches
ity that PEFT provides designers in terms of fine- to scheduling the write and read decisions of
tuning on low-to-mid performance hardware se- SimulMT, those being static schedules like wait-
tups renders it desirable. One of the most popular k (Ma et al., 2019) or adaptive schedules which
forms of PEFT freezes a LLM’s weights and adds are flexible and learned, such as variants of mono-
Low-Rank Adaptation (LoRA) (Hu et al., 2022) tonic multi-head attention, adaptive wait-k compo-
adapters between layers (other forms of PEFT ex- sitions, wait-if-worse, decision state assisted SMT,
ist, but adapter-based PEFT is extremely common and others (Grissom II et al., 2014; Gu et al., 2017;
so we refer to adapter-based PEFT simply as PEFT Arivazhagan et al., 2019; Zheng et al., 2020). Wait-
hereafter). While fully fine-tuned NMT LLMs tend k remains a particularly popular baseline strategy
to suffer from some level of catastrophic-forgetting given its ease of application during training and dur-
(Kirkpatrick et al., 2017), intuitively, PEFT-based ing inference. It functions by retaining a k-lagging
NMT LLMs should not suffer from any loss of factor between the source context (either in tokens
off-task performance, as adapters can be loaded or in words) x and the translation hypothesis y. We
or detached depending on whether or not a given can model a typical wait-k schedule’s probability
user is prompting for a translation. Given all of of generating a given output sequence, provided
these factors, PEFT is an attractive option for NMT some source sequence, with Equation 1:
LLMs.
then backpropagated. The fine-tuning wrapper of ronments demand flexibility in decoding strategies.
Simul-LLM ensures that the model only learns As such, evaluation agents for Simul-LLM sup-
from data it is intended to generate post-prompt via port several decoding strategies, including greedy
a DataCollator object and a specified response and naive decoding, subword-based beam search
template. for single-word decoding, and variations on Spec-
ulative Beam Search (SBS) (Zheng et al., 2019),
Supervised Fine-tuning Agent: While custom
including single word-based SBS and chunk-wise
fine-tuning and optimization loops can be useful,
SBS.
existing and popular options are quite capable.
Given that, Simul-LLM’s fine-tuning wrapper sur- Efficient Inference via Custom Generation Stop-
rounds the transformers supervised fine-tuning ping Criteria: To avoid excessive generation,
trainer and interfaces with it directly. Hyperparam- Simul-LLM evaluation agents employ custom gen-
eters are fed directly to it as well as PEFT configs, eration stopping criteria where applicable. This is
quantization configs, and a specified prompt struc- most relevant in the case of greedy decoding based
ture. The trainer itself handles saving model (or on singular words, where generation can halt on
LoRA adapter) checkpoints, engaging in validation the first white space delimiter being detected, but
runs, and optimizing as defined by the provided constructing more complex criteria is simple given
hyperparameters. provided examples.
3.2 Evaluation Agent and Features Scoring and Latency via SimulEval: SimulEval
(Ma et al., 2020) is the premier SimulMT evalu-
Evaluation agents for Simul-LLM are similarly
ation framework. Simul-LLM evaluation agents
built for ease of use and extensibility in addition to
interface seamlessly with SimulEval, which han-
customizability towards complex translation sched-
dles the incremental source context and manages
ules and decoding strategies while seamlessly in-
translation scoring and latency tracking regarding
terfacing with the preeminent SimulMT evaluation
the translation hypothesis.
framework, SimulEval (Ma et al., 2020). This is
depicted in Figure 3 and includes the following 4 Adapting NMT LLMs to SimulMT
features.
Existing LLMs that have been fine-tuned for clas-
Classical SimulMT Translation Scheduler: In
sical NMT may have the potential to be employed
the interest of baseline accessibility, Simul-LLM
directly for SimulMT inference. This can be desir-
evaluation agents support wait-k translation sched-
able under circumstances where a single deployed
ules for SimulMT, given their ease of application.
model is preferable to multiple, specialized models
Further adaptive or otherwise more involved trans-
(e.g. avoiding fine-tuning costs for multiple mod-
lation schedulers can be quickly constructed and
els). However, exactly how to adapt such models is
applied, assuming no reliance on fine-tuning.
unclear in practice given the differences between
Support for Multiple Decoding Strategies: Vari- prompts during NMT fine-tuning (full-sentence
able latency constraints for possible inference envi- availability) and SimulMT inference (incremental
Figure 4: Example of English to Spanish translation prompt construction with an incremental source x and an
incremental output y applied via our proposed expanded dataset. Without more complex loss filtering than is typical,
the entire output sequence for the split source-target prompt structure would be scored and the model would learn
for wait-k schedules ranging from wait-i to wait-k as opposed to just wait-k.
source availability). This is especially problematic incrementally provides additional source and target
when engaging with short wait-k schemes and sim- tokens. When employed, the expanded examples
ilar low-lagging schedules, where minimal source directly mimic the behavior of simultaneous trans-
context is available during early translation steps. lation. We investigate two specific structures of
To intuit why this is an issue, suppose that an materializing this prompt structuring approach, as
NMT LLM is accustomed to receiving the entire analyzed and compared below.
source sequence x before outputting a word or to-
ken y1 . If engaging in a low-lagging wait-k where 5.1 Split Source-Target Prompt Structure
|x| >> k, then the output y1 can now only be based The first prompt structure follows the classical
on source context up to xk . Assuming the NMT NMT prompts where the source sentence is in-
LLM would typically rely on some source context cluded in the prompt and the target sentence is
xi where i > k, then it is missing what should be the output of the LLM. As illustrated in the first
critical information in its translation decision. We prompt structure in Figure 4, with the expanded
have conducted preliminary quantitative studies on examples, each example contains partial source
this front by employing the proposed Simul-LLM along with some instruction in the prompt (starting
framework, and these results are presented in Sec- with "<h>") and partial target that is k words be-
tion 7. hind the source in the model output (starting with
"<a>"), where k is the intended inference wait-k
5 Prompt Structure for SimulMT LLMs
value. While this split source-target prompt struc-
Alternative to adapting NMT LLMs to SimulMT ture seems to be a natural and plausible way for
tasks, the SimulMT LLM approach aims to fine- model fine-tuning, it has exhibited difficulties when
tune LLMs directly for SimulMT, which requires learning for simultaneous translation. The root of
new prompt structuring. Unlike classical encoder- this stems from how fine-tuning with LLMs often
decoder models for translation, source context works: when filtering the prompt from loss cal-
for an LLM must be packaged within a prompt culations during fine-tuning, a response template
and structured appropriately. Some existing work of some kind (e.g. "<a>:" or "»ANSWER«") is
has studied prompt structures for typical NMT employed to ensure only the target word that is
(Zhang et al., 2023; Xu et al., 2023), but it is un- generated at the current step (i.e., yi ) is scored. Un-
clear whether such prompts are still optimal for fortunately, with the split source-target prompt, the
SimulMT. This is because significant differences template allows for all the target words that have
exist in source context availability between fine- been generated from previous translation steps (i.e.,
tuning and inference, as discussed in Section 4. y1 to yi ) to be scored. Without employing a more
To address this need, we propose a new prompt- complex loss filtering, this leads to an inappropriate
ing approach to construct the dataset for fine-tuning level of context.
and evaluation of SimulMT LLMs. Instead of struc- This lack of loss filtering can be especially prob-
turing each source-target sentence pair as a single lematic near the end of a given sequence’s trans-
example as typically done for NMT, we propose lation. As a simple example, suppose that a given
decomposing and expanding every sentence pair source sequence is of length |x| and |x| >> k.
into a number of examples, where each example In the first prompt structure where up to i source
words have been supplied and where |x| > i >> k, popular LLMs such as LLaMa-7B and Mistral-7B
the LLM is effectively being fine-tuned for varying soon.
wait-k values ranging from wait-i to wait-k in a
single example (as yi is predicted from x1 to xi+k 6.1 Dataset Selection and Preprocessing
which is wait-k, yi−1 is predicted from x1 to xi+k No standardized dataset exists for SimulMT with
which is wait-(k + 1), and so so). If i >> k to LLMs. Due to its popularity in the speech-to-text si-
the point where it is close to normal NMT levels of multaneous translation (SimulST) track, we employ
context, where i =˜ |x|, then the LLM is no longer MuST-C for our experiments2 . For the purposes
being effectively fine-tuned with an appropriate of adapting MuST-C for text-to-text usage, we pre-
amount of source context for a given translation process the dataset and filter out certain acoustic
(write) decision schedule. indicators (e.g., floating "-" characters representing
pauses). In some cases, this resulted in significant
5.2 Single Output Word Prompt Structure changes to some samples of the test set, such as
The above problem can be entirely side-stepped the removal of (Laughter) and (Applause) acoustic
by our proposed second approach, single output indicators that may take up a good portion of the
token prompt structure, that embeds only the cur- samples.
rent target translation hypothesis within the model We employ MuST-C across two language
output. pairs, those being English-to-German (en-de) and
As illustrated in the second prompt structure in English-to-Spanish (en-es). Some additional exper-
Figure 4, instead of allowing target translation hy- iments are provided for the en-es language pair that
potheses from previous time-steps to be incorpo- validate fundamental SimulMT concepts and dis-
rated into the loss, the proposed prompt structure play BLEU scores, gathered via sacreBLEU (Post,
shifts those previous translation hypotheses into the 2018), with respect to samples observed during
prompt. Combined with the expanded examples fine-tuning. The original dataset contains roughly
that form a rigorous wait-k curriculum in terms of 270K training set samples and approximately 2.5-
the fine-tuning dataset (rigorous meaning a com- 3K test set samples (tst-COMMON split) per lan-
plete curriculum as opposed to a random subset), guage pair. The expanded version of this dataset
inference behavior can be copied exactly for every for single output word fine-tuning contains approx-
fine-tuning example, thus completely closing the imately 5M training set samples per language pair.
context mismatch between fine-tuning and infer-
6.2 Training, Fine-Tuning, and Evaluation
ence.
Hyperparameters and Hardware Details
Between the two prompt structures, it is clear
that the single output word prompt structure more All classical models were trained on two NVIDIA
closely replicates the relationship observed in Equa- 32GB V100s and validated on a single V100. All
tion 1 between the source and target sequence dur- LLMs were fine-tuned via PEFT (Mangrulkar et al.,
ing fine-tuning. For both structures, a source-target 2022) on a single NVIDIA 40GB A40 in bfloat16
sentence pair is expanded up to max(|x| − (k − and evaluated on a single V100 in float32. Simul-
1), |y|) examples. LLM seamlessly integrates with SimulEval (Ma
et al., 2020) for the purpose of these evaluations.
6 Evaluation Methodology Classical transformer baselines were trained via
Fairseq (Ott et al., 2019), an easily extensible se-
To validate our proposed solutions to the afore- quence to sequence toolkit.
mentioned challenges and to test the capabilities Classical models were trained with typical hy-
of Simul-LLM as the first SimulMT LLM open- perparameters provided in Fairseq examples. All
source framework, we engaged in several exper- LLMs were fine-tuned with identical hyperparame-
iments allowing for comparisons among classi- ters, employing a constant learning rate of 3e-4 and
cal non-LLM-based NMT (Vaswani et al., 2017) were optimized via Adam with around 4K warmup
/ SimulMT architectures (Ma et al., 2019), NMT updates and batch sizes of 40 samples. LoRA
LLMs adapted for SimulMT, and SimulMT LLMs.
2
All mentioned LLMs are fine-tuned Falcon-7B This allows for future work that explores multi-modal
simultaneous LLMs, engaging in SimulST via a cascaded
models, but Simul-LLM features an easy to extend model structure with a transcription model for the source
framework and we intend to add results for other speech or a joint speech/text-to-text framework.
Grouped
Model and Decoding Scheme en-de en-es
Explorations
Classical NMT Transformer (non-simultaneous) 26.96 (22.6) 32.64 (23.1)
Baselines Monotonic Transformer Wait-5 (SimulMT) 22.01 (3.32) 24.90 (2.58)
NMT LLM 25.83 (3.65) 30.06 (3.95)
NMT LLM Single SBS (k=3, b=5, c=1, w=6) 25.98 (4.12) 29.48 (4.64)
NMT LLMs
NMT LLM Single SBS (k=3, b=5, c=1, w=10) 25.95 (4.25) 27.67 (4.82)
Adapted for
NMT LLM Chunk SBS (k=3, b=5, c=2, w=10) 23.61 (4.63) 26.33 (5.26)
SimulMT
NMT LLM Chunk SBS (k=5, b=5, c=3, w=15) 25.80 (5.61) 27.00 (5.90)
NMT LLM Chunk SBS (k=7, b=5, c=4, w=20) 27.32 (6.97) 28.66 (7.09)
SimulMT LLMs Wait-3 Fine-tuning LLM 19.99 (3.41) 23.68 (3.64)
with Proposed Wait-7 Fine-tuning LLM 20.82 (3.44) 25.18 (3.61)
Prompt Wait-7 Fine-tuning LLM (k=7) 23.09 (6.71) 28.92 (6.87)
Table 1: Comparisons of peak performance for various models and decoding schemes during primarily wait-3
evaluation (non-wait-3 is specified via k) via detokenized BLEU. Non-LLM baselines are subword-based wait-k
(standard) while LLMs are word-based wait-k. Best SimulMT quality results are bolded, second best results are
underlined, and latency is provided in parentheses as LAAL (Papi et al., 2022). Speculative Beam Search (SBS)
during inference is experimented with for NMT LLMs, which lend themselves towards SBS (k=wait-k value,
b=beams, c=chunks/words, w=window size).
adapter parameters were an α of 16 and a r value is receiving a given sequence actively from a tran-
of 64, resulting in a total of around 40M added pa- scription system or something similar, it makes
rameters during fine-tuning, with a dropout value intuitive sense to wait for a word to be emitted
of 0.1. For fair comparison, classical models were from the system as opposed to a fragment.
of a similar, although not quite identical, size in
terms of parameter count. 7 Results and Analysis
Additionally, all LLMs were fine-tuned while
quantized with NormalFloat4 (nf4) quantization. 7.1 Exploration of Adapting NMT LLMs to
A small performance boost was observed when SimulMT
removing this quantization during inference, so In Table 1, we provide a breakdown of the perfor-
all models did not engage with nf4 quantization mance of several different models, decoding strate-
during inference. NMT LLMs were fine-tuned for gies, and wait-k schedules. Regarding our explo-
one epoch, as overfitting was observed beyond that ration related to adapting NMT LLMs to SimulMT,
point, which we intuit to be possibly due to the well- we also include results related to our implemen-
documented ability of LLMs to quickly memorize tation of Speculative Beam Search (SBS) (Zheng
training sets (Biderman et al., 2023). In contrast, et al., 2019). As demonstrated by these results,
SimulMT LLMs were fine-tuned for 2M random compared with classical models, LLMs fine-tuned
examples out of 5M examples on the expanded for NMT are very capable of SimulMT upon being
dataset due to computational constraints. adapted during inference (even exceeding the score
of the classical NMT transformer on en-de that per-
6.3 Word or Token-Based Wait-k for LLMs forms non-simultaneous translation). It is worth
While classical encoder-decoder SimulMT systems reiterating that classical architectures typically en-
usually engage in either word or token-based wait- gage in subword-based wait-k whereas we employ
k, they most typically engage with whichever is word-based wait-k for LLMs, but the comparisons
more suitable for their vocabulary (i.e. word versus still serve as a useful reference.
sub-word vocabularies). In spite of the fact that SBS-based decoding strategies helped NMT
LLMs function via sub-word vocabularies, we rec- LLMs in the en-de language pair, but lacked im-
ommend, and employ for this work, word-based provement for the en-es language pair. We noted
wait-k for SimulMT LLMs, as it more closely re- that our implementation (and seemingly also the
sembles the flow of engaging with a natural lan- original implementation) was sensitive to both the
guage interface. Moreover, supposing that the LLM window size and the number of committed chunks,
with large values for either resulting in the specula- Fine-tuning Inference
BLEU
tive target translation getting too close to the size Wait-k Wait-k
of the source context. In our tests, when reaching SimulMT LLMs Wait-3 23.68
the same length as the source context (akin to con- Fine-tuned in Wait-5 25.59
text levels of wait-1), degenerate output began to Wait-3 Wait-7 26.31
appear that resulted in the output trailing off (e.g., SimulMT LLMs Wait-3 25.18
final output of "que..." instead of correct output of Fine-tuned in Wait-5 28.19
"que"). Notably, too many committed chunks only Wait-7 Wait-7 28.92
explains performance gaps for chunk-wise SBS,
not single SBS, which is normally a flat improve- Table 2: Peak BLEU scores for various SimulMT LLMs
fine-tuned with different wait-k values. Across all infer-
ment upon greedy decoding (single SBS is still ence wait-k values, the SimulMT LLMs fine-tuned in
sensitive to window size). Future experiments can wait-7 outperforms the SimulMT LLMs fine-tuned in
be conducted to utilize the proposed Simul-LLM wait-3 by up to 2.6 BLEU.
framework to quantify these factors.
7.2 Exploration of SimulMT LLMs with datasets. Third, at least one other work related to
Proposed Prompt Structure NMT LLMs (Chen et al., 2023) has demonstrated
In Table 1, we also provide a breakdown of the that relative positional embeddings can cause issues
performance of our exploration of SimulMT LLMs via attention dilution that ends up being unhelpful,
with our proposed prompt structure in Section 5.2 suggesting that distancing the source context, run-
that carefully manages source context availability ning target hypothesis, and the current translation
for target translation generation. Two models are step hypothesis can be unexpectedly problematic.
employed for this exploration, one fine-tuned for We posit that our proposed Simul-LLM can be
wait-3 inference, and another fine-tuned for wait-7 leveraged to verify the above reasons.
inference that produces noticeable better quality 7.3 Higher Wait-k Generalizability
translations than the first. Comparisons
While SimulMT LLMs are a more promising
approach in achieving higher translation quality It is well documented that in typical SimulMT
than NMT LLMs due to more direct task-specific systems, training or fine-tuning with a slightly
fine-tuning and better context alignment, our ex- higher wait-k than is intended during inference can
periment results suggest that, for the time-being, boost translation quality and generalizability across
the performance of SimulMT LLMs is not advan- slightly lower wait-k (Ma et al., 2019). While this
tageous to NMT LLMs adapted to SimulMT (and likely applies to SimulMT LLMs, no existing work
varies from slightly worse to barely better trans- has validated that this behavior persists. We pro-
lation quality compared with the classical non- vide a brief comparison of two SimulMT LLMs
LLM SimulMT baseline). This is not completely fine-tuned via wait-3 and wait-7 context levels in
unexpected as NMT LLMs have been optimized Table 2. The results in the table demonstrate that,
heavily in recent years whereas the exploration of generally, the expected behavior does hold, with all
SimulMT LLMs has just started. We provide some LLMs fine-tuned in wait-7 outperforming their cor-
analysis below that point out several possible rea- responding wait-3 models for the same inference
sons for this observed performance gap and call for wait-k. We leave validating additional, previously
additional community efforts to investigate further. understood SimulMT principles in SimulMT LLMs
First, due to computational constraints we were to future work.
unable to fine-tune for an entire epoch of the train-
8 Conclusion
ing dataset (only 2M random samples out of 5M),
which represents a major loss of lessons and a lack In this work, we introduce Simul-LLM, the first
of rigorous wait-k curriculum for the fine-tuned open-source framework that enables rapid de-
model. Second, it is possible that the fine-tuning velopment of LLM fine-tuning and evaluation
hyperparameters are ill-suited for this particular pipelines for simultaneous machine translation
prompt. We consider this likely to be the most influ- (SimulMT). Simul-LLM seamlessly integrates with
ential issue on our observed results, given the dras- the fine-tuning and generation tools of the popular
tic differences between the original and expanded transformers library as well as with SimulEval,
the preeminent SimulMT evaluation framework. In of the North American Chapter of the Association
addition to introducing Simul-LLM, we employ for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
this framework to explore adapting existing NMT
pages 4171–4186.
LLMs to SimulMT and propose an expanded fine-
tuning dataset and alternative prompt structure to Alvin Grissom II, He He, Jordan Boyd-Graber, John
those employed in typical NMT fine-tuning that we Morgan, and Hal Daumé III. 2014. Don’t until the
final verb wait: Reinforcement learning for simul-
believe better replicates inference-time behavior. taneous machine translation. In Proceedings of the
Moreover, we validate some classically understood 2014 Conference on Empirical Methods in Natural
SimulMT concepts concerning wait-k scheduling Language Processing (EMNLP), pages 1342–1352,
and examine the behavior of SimulMT LLMs dur- Doha, Qatar. Association for Computational Linguis-
tics.
ing fine-tuning. Our proposed Simul-LLM frame-
work enables multiple lines of future work that can Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor
be carried out to understand and optimize LLMs O. K. Li. 2017. Learning to translate in real-time
for SimulMT, and it will likely be a useful tool for with neural machine translation.
the research community. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
Weizhu Chen. 2022. LoRA: Low-rank adaptation of
References large language models. In International Conference
on Learning Representations.
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al-
shamsi, Alessandro Cappelli, Ruxandra Cojocaru, Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann
Merouane Debbah, Etienne Goffinet, Daniel Hes- Ney. 2019. Language modeling with deep transform-
low, Julien Launay, Quentin Malartic, Badreddine ers. In Interspeech 2019. ISCA.
Noune, Baptiste Pannier, and Guilherme Penedo.
2023. Falcon-40B: an open large language model Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
with state-of-the-art performance. sch, Chris Bamford, Devendra Singh Chaplot, Diego
Naveen Arivazhagan, Colin Cherry, Wolfgang de las Casas, Florian Bressand, Gianna Lengyel, Guil-
Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruom- laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
ing Pang, Wei Li, and Colin Raffel. 2019. Monotonic Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
infinite lookback attention for simultaneous machine Thibaut Lavril, Thomas Wang, Timothée Lacroix,
translation. In Proceedings of the 57th Annual and William El Sayed. 2023. Mistral 7b.
Meeting of the Association for Computational
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
Linguistics, pages 1313–1323, Florence, Italy.
Brown, Benjamin Chess, Rewon Child, Scott Gray,
Association for Computational Linguistics.
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.
Stella Biderman, USVSN Sai Prashanth, Lintang Scaling laws for neural language models.
Sutawika, Hailey Schoelkopf, Quentin Anthony,
Shivanshu Purohit, and Edward Raff. 2023. Emer- James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,
gent and predictable memorization in large language Joel Veness, Guillaume Desjardins, Andrei A. Rusu,
models. Kieran Milan, John Quan, Tiago Ramalho, Ag-
nieszka Grabska-Barwinska, Demis Hassabis, Clau-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie dia Clopath, Dharshan Kumaran, and Raia Hadsell.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind 2017. Overcoming catastrophic forgetting in neural
Neelakantan, Pranav Shyam, Girish Sastry, Amanda networks. Proceedings of the National Academy of
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Sciences, 114(13):3521–3526.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, M. Ma, L. Huang, H. Xiong, R. Zheng, K. Liu, B. Zheng,
Clemens Winter, Christopher Hesse, Mark Chen, Eric C. Zhang, Z. He, H. Liu, X. Li, H. Wu, and H. Wang.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, 2019. Stacl: Simultaneous translation with implicit
Jack Clark, Christopher Berner, Sam McCandlish, anticipation and controllable latency using prefix-to-
Alec Radford, Ilya Sutskever, and Dario Amodei. prefix framework. In Proceedings of the 57th Annual
2020. Language models are few-shot learners. Meeting of the Association for Computational Lin-
guistics, pages 3025–3036, Florence, Italy. Associa-
Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, tion for Computational Linguistics (ACL).
Jinan Xu, and Jie Zhou. 2023. Improving translation
faithfulness of large language models via augmenting Xutai Ma, Mohammad Javad Dousti, Changhan Wang,
instructions. Jiatao Gu, and Juan Pino. 2020. Simuleval: An eval-
uation toolkit for simultaneous translation.
J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019.
Bert: Pre-training of deep bidirectional transformers Sourab Mangrulkar, Sylvain Gugger, Lysandre De-
for language understanding. In 2019 Conference but, Younes Belkada, Sayak Paul, and Benjamin
Bossan. 2022. Peft: State-of-the-art parameter- Renjie Zheng, Mingbo Ma, Baigong Zheng, and Liang
efficient fine-tuning methods. https://github. Huang. 2019. Speculative beam search for simultane-
com/huggingface/peft. ous translation. In Proceedings of the 2019 Confer-
ence on Empirical Methods in Natural Language Pro-
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, cessing and the 9th International Joint Conference
Sam Gross, Nathan Ng, David Grangier, and Michael on Natural Language Processing (EMNLP-IJCNLP),
Auli. 2019. fairseq: A fast, extensible toolkit for pages 1395–1402, Hong Kong, China. Association
sequence modeling. for Computational Linguistics.
Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Has-
san Awadalla. 2023. A paradigm shift in machine
translation: Boosting translation performance of
large language models.