0% found this document useful (0 votes)
9 views11 pages

Lizhong Chen

Uploaded by

shayuddin4532
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Lizhong Chen

Uploaded by

shayuddin4532
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Simul-LLM: A Framework for Exploring High-Quality Simultaneous

Translation with Large Language Models

Victor Agostinelli Max Wild Matthew Raffel


Kazi Ahmed Asif Fuad Lizhong Chen
Oregon State University
{agostinv, wildma, raffelm, fuadk, chenliz}@oregonstate.edu

Abstract focuses on taking an input sequence in a given


language and outputting a translation in another
Large language models (LLMs) with billions of
language. Typically, the entire source context is
parameters and pretrained on massive amounts
available at the start of translation for NMT. A
arXiv:2312.04691v2 [cs.CL] 12 Dec 2023

of data are now capable of near or better than


state-of-the-art performance in a variety of particularly challenging subset of NMT is known
downstream natural language processing tasks. as simultaneous translation (SimulMT), where the
Neural machine translation (NMT) is one such model begins translation without having access to
task that LLMs have been applied to with great the entire source sequence, and the translation pro-
success. However, little research has focused gresses as the remaining source sequence is incre-
on applying LLMs to the more difficult sub-
mentally provided. For languages that are syntac-
set of NMT called simultaneous translation
(SimulMT), where translation begins before
tically and structurally similar, near-NMT perfor-
the entire source context is available to the mance is fairly achievable, but for language pairs
model. In this paper, we address key chal- that differ significantly in structure, traditional mod-
lenges facing LLMs fine-tuned for SimulMT, els struggle to balance high-quality translations
validate classical SimulMT concepts and prac- with delay for additional source context. This bal-
tices in the context of LLMs, explore adapting ance is typically achieved via a fixed or adaptive
LLMs that are fine-tuned for NMT to the task of read-write schedule, with one of the most popular
SimulMT, and introduce Simul-LLM 1 , the first
and longstanding fixed schedules being the wait-k
open-source fine-tuning and evaluation pipeline
development framework for LLMs focused on policy (Ma et al., 2019), where the target translation
SimulMT. hypothesis lags behind the incrementally available
source sequence by k words or subwords.

1 Introduction While LLMs have been applied to and studied


actively in NMT, their application to simultaneous
Modern large language models (LLMs) contain at translation has been lagging. This is in part due
least several billion and up to trillions of parameters to a few challenges LLMs face when applied to
and are remarkably capable across a wide range of SimulMT that are non-trivial to address. First and
tasks. Pretrained on humongous amounts of unla- foremost, it is unclear how well LLMs, which are
beled data, they have demonstrated incredible emer- pretrained and usually fine-tuned under the assump-
gent capabilities. With minor prompt adjustments, tion that the prompt is completely provided and
such as including instructions and examples, LLMs static before generation, will adapt to an applica-
are often capable of near state-of-the-art perfor- tion space where the prompt dynamically changes
mance set by highly customized solutions. The per- as the simultaneous scheduler elects to read from
formance of these models is further enhanced when the source sequence. Second, multiple approaches
fine-tuned for specialized downstream tasks, often exist to enable LLMs for SimulMT and it is chal-
exceeding the performance of previously cutting- lenging to intuit which approach will perform best.
edge solutions. Given their rapidly evolving capa- For example, one could adapt LLMs fine-tuned
bilities, LLMs and their application have become a for NMT (hereafter referred to as NMT LLMs) to
focused topic of research within NLP academia. SimulMT during inference, although how well such
One popular downstream task for LLMs is text- models will deal with the source context availabil-
to-text neural machine translation (NMT), which ity mismatch between fine-tuning (full sentence)
1
https://github.com/OSU-STARLAB/Simul-LLM and inference (partial sentence) is nebulous. Al-
ternatively, one could fine-tune LLMs directly for as an application space. Readers interested in ad-
SimulMT (hereafter referred to as SimulMT LLMs), ditional details should engage further with cited
but new prompt structuring is likely needed to works in these areas.
match inference SimulMT behavior during fine-
tuning exactly. Finally, it is also unclear how 2.1 Large Language Models
well previously understood concepts in existing Language modeling via transformers, autoregres-
SimulMT work, such as higher fine-tuning wait- sive (Irie et al., 2019) or bi-directional (Devlin et al.,
k values increasing generalizability, will apply to 2019), has been popular nearly since the architec-
SimulMT LLMs. ture’s inception, with bi-directional language mod-
This paper seeks to address the above problems eling via BERT demonstrating particularly potent
and contributes to the process of applying LLMs to results for its time. In recent years, however, re-
SimulMT in the following major ways: search on other approaches to this task has largely
slowed down in favor of rapidly scaling model pa-
• We develop Simul-LLM, the first open-source rameters and employing massive amounts of high-
fine-tuning and evaluation pipeline develop- quality data to train on (Kaplan et al., 2020), result-
ment framework for SimulMT LLMs, which ing in the advent of purely generative Large Lan-
seamlessly wraps around and interfaces with guage Models (LLMs) like GPT-3 (Brown et al.,
popular libraries for LLMs and SimulMT. 2020), LLaMa (Touvron et al., 2023), and others.
This framework serves as a foundation for At their core, LLMs generate the most likely
research on SimulMT LLMs that the commu- subsequent token (usually at the sub-word level)
nity can employ and extend for a wide range given all previous tokens in a sequence. Regardless
of future work on LLM-based simultaneous of this simple, core concept, their size and com-
translation. plexity allows for potent pattern recognition and
• With the aforementioned framework, we ex- demonstrates numerous emergent capabilities (e.g.
plore the feasibility of adapting LLMs fine- simple math, summarization). Additionally, when
tuned for NMT to SimulMT under a few fine-tuned for specific tasks, this class of models is
decoding strategies and the classical wait-k capable of near-equivalent performance to special-
fixed translation scheduler. Generally, we find ized solutions while, in many cases, maintaining
that NMT LLMs demonstrate good perfor- their generalizability to other tasks. For example,
mance during SimulMT inference which can this can allow a single end-to-end model to serve as
be somewhat boosted by more complex de- both a high-quality multilingual translation agent
coding strategies. and a chat assistant, opening up interesting, practi-
cal deployment opportunities.
• We propose an alternative prompt structur-
ing approach to commonly employed NMT 2.2 Large Language Models for Neural
prompts that bridges the gap between the fine- Machine Translation
tuning and inference environment, assuming While LLMs are capable of effectively zero-shot
a wait-k schedule, and we validate this via sentence-to-sentence neural machine translation
the Simul-LLM framework. We elaborate on (NMT) (Vilar et al., 2023), their performance can
counter-intuitive results that we observe and still be improved via simple techniques. Prompt
provide a base of exploration for future re- construction has been demonstrated to be critical
search to employ. Along these lines, we also to LLM performance, both before and after fine-
validate that higher wait-k values employed tuning (Zhang et al., 2023). One-shot or few-shot
during SimulMT fine-tuning do increase wait- performance via In-Context Learning (ICL) can
k generalizability and boost translation quality produce near competitive results with fine-tuned
across the board during SimulMT inference. LLMs for translation and can even be employed to
enhance fine-tuned model performance (Vilar et al.,
2 Background
2023; Xu et al., 2023).
While a range of work is relevant to this paper, we One particularly interesting area of study related
will only provide a focused and high level review to LLMs applied towards NMT remains whether
of large language models, LLMs applied towards or not to fully fine-tune a given model or engage
machine translation, and simultaneous translation in Parameter-Efficient Fine-Tuning (PEFT) (Man-
Figure 1: An example of a misalignment obstacle in wait-3 simultaneous translation from German to English. Under
certain circumstances a model needs to infer information earlier than the corresponding context in the source.

grulkar et al., 2022). Early work in this area demon- steps before the necessary context is available in
strated some potential for smaller models (Üstün the incremental source.
and Cooper Stickland, 2022), and the accessibil- There are two classical, high level approaches
ity that PEFT provides designers in terms of fine- to scheduling the write and read decisions of
tuning on low-to-mid performance hardware se- SimulMT, those being static schedules like wait-
tups renders it desirable. One of the most popular k (Ma et al., 2019) or adaptive schedules which
forms of PEFT freezes a LLM’s weights and adds are flexible and learned, such as variants of mono-
Low-Rank Adaptation (LoRA) (Hu et al., 2022) tonic multi-head attention, adaptive wait-k compo-
adapters between layers (other forms of PEFT ex- sitions, wait-if-worse, decision state assisted SMT,
ist, but adapter-based PEFT is extremely common and others (Grissom II et al., 2014; Gu et al., 2017;
so we refer to adapter-based PEFT simply as PEFT Arivazhagan et al., 2019; Zheng et al., 2020). Wait-
hereafter). While fully fine-tuned NMT LLMs tend k remains a particularly popular baseline strategy
to suffer from some level of catastrophic-forgetting given its ease of application during training and dur-
(Kirkpatrick et al., 2017), intuitively, PEFT-based ing inference. It functions by retaining a k-lagging
NMT LLMs should not suffer from any loss of factor between the source context (either in tokens
off-task performance, as adapters can be loaded or in words) x and the translation hypothesis y. We
or detached depending on whether or not a given can model a typical wait-k schedule’s probability
user is prompting for a translation. Given all of of generating a given output sequence, provided
these factors, PEFT is an attractive option for NMT some source sequence, with Equation 1:
LLMs.

2.3 Simultaneous Translation |y|


Y
As a subset of typical NMT, simultaneous transla- p(y, x) = p(yi |y<i , x<min(i+k,|x|) ) (1)
tion (SimulMT) focuses on engaging in translation i=1

(write decisions) while balancing the amount of


available source context (read decisions) to reduce Under circumstances or in environments where
translation latency. This necessarily increases the additional computational latency is acceptable, vari-
difficulty of translating a sequence in one language ations of beam search have been applied in simul-
to a sequence in another language, especially when taneous scenarios. Speculative Beam Search (SBS)
structural and/or syntactical differences exist in the (Zheng et al., 2019) is one such example where,
language pair. As a brief example, we can consider at a high level, each translation step attempts to
translating from a subject-verb-object (SVO) lan- speculatively translate future steps for some num-
guage, such as English, to a subject-object-verb ber of beams, eventually selecting a single token or
(SOV) language like German. Based on the avail- word (the first one) from the most likely beam for
able source context in English, we may have to some beam length. When applied to wait-k, single
guess at the translation in German without access token or word SBS can be modeled via Equation
to the necessary context to effectively make that 2, where w is the length of the beam and ŷi+1:w
prediction. This is exemplified in Figure 1, where represents the speculative beam of maximum joint
the verb "repaired" must be predicted multiple time- probability:
|y|
Y
p(y, x) = p(yi |ŷi+1:w , y<i , x<min(i+k,|x|) )
i=1
(2)
Chunk-wise variations are also possible, where
for multiple consecutive translation steps, or write
decisions, words or subwords from the last deter-
mined most-likely beam are employed to cut down
on latency.

3 Simul-LLM: an Open-Source SimulMT


LLM Fine-tuning Framework
SimulMT is an underexplored application space
for LLMs, at the moment, and there is plenty of Figure 2: Depiction of the Simul-LLM fine-tuning wrap-
room to improve performance. To facilitate the per framework. High level specifications and hyper-
parameters are passed to the wrapper on instantiation,
rapid development of solutions for SimulMT LLMs
which employs a specified prompt constructor, instan-
(fine-tuned for SimulMT) or NMT LLMs (fine- tiates a specified LLM foundational model, optionally
tuned for NMT) adapted for SimulMT, we choose constructs a PEFT config, and optionally constructs a
to develop and provide an open-source framework quantization config via BitsAndBytes.
written in PyTorch for researchers to actively em-
ploy for future experiments. We call this frame-
fine-tuning wrapper of Simul-LLM supports multi-
work Simul-LLM, and it bridges the gap between
ple kinds of prompt structures, all of which we will
the development of fine-tuning agents via popular
validate later in this paper. In the interest of support-
libraries and proper SimulMT evaluation. Simul-
ing adapting NMT LLMs to SimulMT, Simul-LLM
LLM is poised to support both selective classical
supports NMT fine-tuning in addition to support-
SimulMT systems as well as fine-tuning and evalu-
ing prompt structures that allow for strict wait-k
ation for a variety of LLM systems such as Falcon
fine-tuning structures.
(Almazrouei et al., 2023), LLaMa (Touvron et al.,
2023) and Mistral-based (Jiang et al., 2023) models. PEFT and Full Model Fine-tuning: While PEFT
The high-level components of Simul-LLM include (Mangrulkar et al., 2022) has recently become pop-
a fine-tuning wrapper and a SimulMT evaluation ular, full model fine-tuning is still a valuable option.
agent for every supported LLM; in case of classi- Simul-LLM supports both PEFT and full model
cal models, only the evaluation agent is supported fine-tuning, although PEFT is recommended for
in-framework. most low-to-mid resource hardware setups. We pro-
vide recommended PEFT configurations (as seen
3.1 Fine-tuning Wrapper and Features in Figure 2) for all supported models, and integrate
The fine-tuning wrapper of Simul-LLM is con- interface for new PEFT configurations when ex-
structed with a focus on simplicity and extensibility. tending to new LLMs.
It is depicted in Figure 2 and supports the following Flexible Quantization: Effectively fine-tuning
set of user-friendly features. LLMs is reliant on careful memory management
LLM Support and Extensibility: The proposed for low-to-mid hardware setups. Given that, Simul-
Simul-LLM currently supports Falcon and will im- LLM quantizes LLMs via the BitsAndBytes li-
minently support Mistral and LLaMa for SimulMT brary (seen in the Quant Config in Figure 2), which
fine-tuning. The fine-tuning wrapper is constructed enables flexible fixed-point quantization. For most
with a focus on extensibility, allowing for rapid low-to-mid hardware setups, we recommend quan-
expansion to other LLMs assuming users hold a tizing in 4-bit floating point via 4-bit NormalFloat
minimal level of LLM-specific knowledge. (nf4).
Multiple Prompt Structures: Translation quality, Prompt Loss Filtering: Extremely basic super-
as demonstrated by numerous prior works, varies vised fine-tuning may inappropriately include por-
significantly with prompt structure. As such, the tions of the prompt in loss calculations that are
Figure 3: Depiction of the Simul-LLM evaluation agent framework. The SimulMT agent receives the incremental
source from (left of the figure) and sends the finalized translation step hypothesis to SimulEval (right of the figure),
which manages latency calculation and translation quality scoring.

then backpropagated. The fine-tuning wrapper of ronments demand flexibility in decoding strategies.
Simul-LLM ensures that the model only learns As such, evaluation agents for Simul-LLM sup-
from data it is intended to generate post-prompt via port several decoding strategies, including greedy
a DataCollator object and a specified response and naive decoding, subword-based beam search
template. for single-word decoding, and variations on Spec-
ulative Beam Search (SBS) (Zheng et al., 2019),
Supervised Fine-tuning Agent: While custom
including single word-based SBS and chunk-wise
fine-tuning and optimization loops can be useful,
SBS.
existing and popular options are quite capable.
Given that, Simul-LLM’s fine-tuning wrapper sur- Efficient Inference via Custom Generation Stop-
rounds the transformers supervised fine-tuning ping Criteria: To avoid excessive generation,
trainer and interfaces with it directly. Hyperparam- Simul-LLM evaluation agents employ custom gen-
eters are fed directly to it as well as PEFT configs, eration stopping criteria where applicable. This is
quantization configs, and a specified prompt struc- most relevant in the case of greedy decoding based
ture. The trainer itself handles saving model (or on singular words, where generation can halt on
LoRA adapter) checkpoints, engaging in validation the first white space delimiter being detected, but
runs, and optimizing as defined by the provided constructing more complex criteria is simple given
hyperparameters. provided examples.

3.2 Evaluation Agent and Features Scoring and Latency via SimulEval: SimulEval
(Ma et al., 2020) is the premier SimulMT evalu-
Evaluation agents for Simul-LLM are similarly
ation framework. Simul-LLM evaluation agents
built for ease of use and extensibility in addition to
interface seamlessly with SimulEval, which han-
customizability towards complex translation sched-
dles the incremental source context and manages
ules and decoding strategies while seamlessly in-
translation scoring and latency tracking regarding
terfacing with the preeminent SimulMT evaluation
the translation hypothesis.
framework, SimulEval (Ma et al., 2020). This is
depicted in Figure 3 and includes the following 4 Adapting NMT LLMs to SimulMT
features.
Existing LLMs that have been fine-tuned for clas-
Classical SimulMT Translation Scheduler: In
sical NMT may have the potential to be employed
the interest of baseline accessibility, Simul-LLM
directly for SimulMT inference. This can be desir-
evaluation agents support wait-k translation sched-
able under circumstances where a single deployed
ules for SimulMT, given their ease of application.
model is preferable to multiple, specialized models
Further adaptive or otherwise more involved trans-
(e.g. avoiding fine-tuning costs for multiple mod-
lation schedulers can be quickly constructed and
els). However, exactly how to adapt such models is
applied, assuming no reliance on fine-tuning.
unclear in practice given the differences between
Support for Multiple Decoding Strategies: Vari- prompts during NMT fine-tuning (full-sentence
able latency constraints for possible inference envi- availability) and SimulMT inference (incremental
Figure 4: Example of English to Spanish translation prompt construction with an incremental source x and an
incremental output y applied via our proposed expanded dataset. Without more complex loss filtering than is typical,
the entire output sequence for the split source-target prompt structure would be scored and the model would learn
for wait-k schedules ranging from wait-i to wait-k as opposed to just wait-k.

source availability). This is especially problematic incrementally provides additional source and target
when engaging with short wait-k schemes and sim- tokens. When employed, the expanded examples
ilar low-lagging schedules, where minimal source directly mimic the behavior of simultaneous trans-
context is available during early translation steps. lation. We investigate two specific structures of
To intuit why this is an issue, suppose that an materializing this prompt structuring approach, as
NMT LLM is accustomed to receiving the entire analyzed and compared below.
source sequence x before outputting a word or to-
ken y1 . If engaging in a low-lagging wait-k where 5.1 Split Source-Target Prompt Structure
|x| >> k, then the output y1 can now only be based The first prompt structure follows the classical
on source context up to xk . Assuming the NMT NMT prompts where the source sentence is in-
LLM would typically rely on some source context cluded in the prompt and the target sentence is
xi where i > k, then it is missing what should be the output of the LLM. As illustrated in the first
critical information in its translation decision. We prompt structure in Figure 4, with the expanded
have conducted preliminary quantitative studies on examples, each example contains partial source
this front by employing the proposed Simul-LLM along with some instruction in the prompt (starting
framework, and these results are presented in Sec- with "<h>") and partial target that is k words be-
tion 7. hind the source in the model output (starting with
"<a>"), where k is the intended inference wait-k
5 Prompt Structure for SimulMT LLMs
value. While this split source-target prompt struc-
Alternative to adapting NMT LLMs to SimulMT ture seems to be a natural and plausible way for
tasks, the SimulMT LLM approach aims to fine- model fine-tuning, it has exhibited difficulties when
tune LLMs directly for SimulMT, which requires learning for simultaneous translation. The root of
new prompt structuring. Unlike classical encoder- this stems from how fine-tuning with LLMs often
decoder models for translation, source context works: when filtering the prompt from loss cal-
for an LLM must be packaged within a prompt culations during fine-tuning, a response template
and structured appropriately. Some existing work of some kind (e.g. "<a>:" or "»ANSWER«") is
has studied prompt structures for typical NMT employed to ensure only the target word that is
(Zhang et al., 2023; Xu et al., 2023), but it is un- generated at the current step (i.e., yi ) is scored. Un-
clear whether such prompts are still optimal for fortunately, with the split source-target prompt, the
SimulMT. This is because significant differences template allows for all the target words that have
exist in source context availability between fine- been generated from previous translation steps (i.e.,
tuning and inference, as discussed in Section 4. y1 to yi ) to be scored. Without employing a more
To address this need, we propose a new prompt- complex loss filtering, this leads to an inappropriate
ing approach to construct the dataset for fine-tuning level of context.
and evaluation of SimulMT LLMs. Instead of struc- This lack of loss filtering can be especially prob-
turing each source-target sentence pair as a single lematic near the end of a given sequence’s trans-
example as typically done for NMT, we propose lation. As a simple example, suppose that a given
decomposing and expanding every sentence pair source sequence is of length |x| and |x| >> k.
into a number of examples, where each example In the first prompt structure where up to i source
words have been supplied and where |x| > i >> k, popular LLMs such as LLaMa-7B and Mistral-7B
the LLM is effectively being fine-tuned for varying soon.
wait-k values ranging from wait-i to wait-k in a
single example (as yi is predicted from x1 to xi+k 6.1 Dataset Selection and Preprocessing
which is wait-k, yi−1 is predicted from x1 to xi+k No standardized dataset exists for SimulMT with
which is wait-(k + 1), and so so). If i >> k to LLMs. Due to its popularity in the speech-to-text si-
the point where it is close to normal NMT levels of multaneous translation (SimulST) track, we employ
context, where i =˜ |x|, then the LLM is no longer MuST-C for our experiments2 . For the purposes
being effectively fine-tuned with an appropriate of adapting MuST-C for text-to-text usage, we pre-
amount of source context for a given translation process the dataset and filter out certain acoustic
(write) decision schedule. indicators (e.g., floating "-" characters representing
pauses). In some cases, this resulted in significant
5.2 Single Output Word Prompt Structure changes to some samples of the test set, such as
The above problem can be entirely side-stepped the removal of (Laughter) and (Applause) acoustic
by our proposed second approach, single output indicators that may take up a good portion of the
token prompt structure, that embeds only the cur- samples.
rent target translation hypothesis within the model We employ MuST-C across two language
output. pairs, those being English-to-German (en-de) and
As illustrated in the second prompt structure in English-to-Spanish (en-es). Some additional exper-
Figure 4, instead of allowing target translation hy- iments are provided for the en-es language pair that
potheses from previous time-steps to be incorpo- validate fundamental SimulMT concepts and dis-
rated into the loss, the proposed prompt structure play BLEU scores, gathered via sacreBLEU (Post,
shifts those previous translation hypotheses into the 2018), with respect to samples observed during
prompt. Combined with the expanded examples fine-tuning. The original dataset contains roughly
that form a rigorous wait-k curriculum in terms of 270K training set samples and approximately 2.5-
the fine-tuning dataset (rigorous meaning a com- 3K test set samples (tst-COMMON split) per lan-
plete curriculum as opposed to a random subset), guage pair. The expanded version of this dataset
inference behavior can be copied exactly for every for single output word fine-tuning contains approx-
fine-tuning example, thus completely closing the imately 5M training set samples per language pair.
context mismatch between fine-tuning and infer-
6.2 Training, Fine-Tuning, and Evaluation
ence.
Hyperparameters and Hardware Details
Between the two prompt structures, it is clear
that the single output word prompt structure more All classical models were trained on two NVIDIA
closely replicates the relationship observed in Equa- 32GB V100s and validated on a single V100. All
tion 1 between the source and target sequence dur- LLMs were fine-tuned via PEFT (Mangrulkar et al.,
ing fine-tuning. For both structures, a source-target 2022) on a single NVIDIA 40GB A40 in bfloat16
sentence pair is expanded up to max(|x| − (k − and evaluated on a single V100 in float32. Simul-
1), |y|) examples. LLM seamlessly integrates with SimulEval (Ma
et al., 2020) for the purpose of these evaluations.
6 Evaluation Methodology Classical transformer baselines were trained via
Fairseq (Ott et al., 2019), an easily extensible se-
To validate our proposed solutions to the afore- quence to sequence toolkit.
mentioned challenges and to test the capabilities Classical models were trained with typical hy-
of Simul-LLM as the first SimulMT LLM open- perparameters provided in Fairseq examples. All
source framework, we engaged in several exper- LLMs were fine-tuned with identical hyperparame-
iments allowing for comparisons among classi- ters, employing a constant learning rate of 3e-4 and
cal non-LLM-based NMT (Vaswani et al., 2017) were optimized via Adam with around 4K warmup
/ SimulMT architectures (Ma et al., 2019), NMT updates and batch sizes of 40 samples. LoRA
LLMs adapted for SimulMT, and SimulMT LLMs.
2
All mentioned LLMs are fine-tuned Falcon-7B This allows for future work that explores multi-modal
simultaneous LLMs, engaging in SimulST via a cascaded
models, but Simul-LLM features an easy to extend model structure with a transcription model for the source
framework and we intend to add results for other speech or a joint speech/text-to-text framework.
Grouped
Model and Decoding Scheme en-de en-es
Explorations
Classical NMT Transformer (non-simultaneous) 26.96 (22.6) 32.64 (23.1)
Baselines Monotonic Transformer Wait-5 (SimulMT) 22.01 (3.32) 24.90 (2.58)
NMT LLM 25.83 (3.65) 30.06 (3.95)
NMT LLM Single SBS (k=3, b=5, c=1, w=6) 25.98 (4.12) 29.48 (4.64)
NMT LLMs
NMT LLM Single SBS (k=3, b=5, c=1, w=10) 25.95 (4.25) 27.67 (4.82)
Adapted for
NMT LLM Chunk SBS (k=3, b=5, c=2, w=10) 23.61 (4.63) 26.33 (5.26)
SimulMT
NMT LLM Chunk SBS (k=5, b=5, c=3, w=15) 25.80 (5.61) 27.00 (5.90)
NMT LLM Chunk SBS (k=7, b=5, c=4, w=20) 27.32 (6.97) 28.66 (7.09)
SimulMT LLMs Wait-3 Fine-tuning LLM 19.99 (3.41) 23.68 (3.64)
with Proposed Wait-7 Fine-tuning LLM 20.82 (3.44) 25.18 (3.61)
Prompt Wait-7 Fine-tuning LLM (k=7) 23.09 (6.71) 28.92 (6.87)

Table 1: Comparisons of peak performance for various models and decoding schemes during primarily wait-3
evaluation (non-wait-3 is specified via k) via detokenized BLEU. Non-LLM baselines are subword-based wait-k
(standard) while LLMs are word-based wait-k. Best SimulMT quality results are bolded, second best results are
underlined, and latency is provided in parentheses as LAAL (Papi et al., 2022). Speculative Beam Search (SBS)
during inference is experimented with for NMT LLMs, which lend themselves towards SBS (k=wait-k value,
b=beams, c=chunks/words, w=window size).

adapter parameters were an α of 16 and a r value is receiving a given sequence actively from a tran-
of 64, resulting in a total of around 40M added pa- scription system or something similar, it makes
rameters during fine-tuning, with a dropout value intuitive sense to wait for a word to be emitted
of 0.1. For fair comparison, classical models were from the system as opposed to a fragment.
of a similar, although not quite identical, size in
terms of parameter count. 7 Results and Analysis
Additionally, all LLMs were fine-tuned while
quantized with NormalFloat4 (nf4) quantization. 7.1 Exploration of Adapting NMT LLMs to
A small performance boost was observed when SimulMT
removing this quantization during inference, so In Table 1, we provide a breakdown of the perfor-
all models did not engage with nf4 quantization mance of several different models, decoding strate-
during inference. NMT LLMs were fine-tuned for gies, and wait-k schedules. Regarding our explo-
one epoch, as overfitting was observed beyond that ration related to adapting NMT LLMs to SimulMT,
point, which we intuit to be possibly due to the well- we also include results related to our implemen-
documented ability of LLMs to quickly memorize tation of Speculative Beam Search (SBS) (Zheng
training sets (Biderman et al., 2023). In contrast, et al., 2019). As demonstrated by these results,
SimulMT LLMs were fine-tuned for 2M random compared with classical models, LLMs fine-tuned
examples out of 5M examples on the expanded for NMT are very capable of SimulMT upon being
dataset due to computational constraints. adapted during inference (even exceeding the score
of the classical NMT transformer on en-de that per-
6.3 Word or Token-Based Wait-k for LLMs forms non-simultaneous translation). It is worth
While classical encoder-decoder SimulMT systems reiterating that classical architectures typically en-
usually engage in either word or token-based wait- gage in subword-based wait-k whereas we employ
k, they most typically engage with whichever is word-based wait-k for LLMs, but the comparisons
more suitable for their vocabulary (i.e. word versus still serve as a useful reference.
sub-word vocabularies). In spite of the fact that SBS-based decoding strategies helped NMT
LLMs function via sub-word vocabularies, we rec- LLMs in the en-de language pair, but lacked im-
ommend, and employ for this work, word-based provement for the en-es language pair. We noted
wait-k for SimulMT LLMs, as it more closely re- that our implementation (and seemingly also the
sembles the flow of engaging with a natural lan- original implementation) was sensitive to both the
guage interface. Moreover, supposing that the LLM window size and the number of committed chunks,
with large values for either resulting in the specula- Fine-tuning Inference
BLEU
tive target translation getting too close to the size Wait-k Wait-k
of the source context. In our tests, when reaching SimulMT LLMs Wait-3 23.68
the same length as the source context (akin to con- Fine-tuned in Wait-5 25.59
text levels of wait-1), degenerate output began to Wait-3 Wait-7 26.31
appear that resulted in the output trailing off (e.g., SimulMT LLMs Wait-3 25.18
final output of "que..." instead of correct output of Fine-tuned in Wait-5 28.19
"que"). Notably, too many committed chunks only Wait-7 Wait-7 28.92
explains performance gaps for chunk-wise SBS,
not single SBS, which is normally a flat improve- Table 2: Peak BLEU scores for various SimulMT LLMs
fine-tuned with different wait-k values. Across all infer-
ment upon greedy decoding (single SBS is still ence wait-k values, the SimulMT LLMs fine-tuned in
sensitive to window size). Future experiments can wait-7 outperforms the SimulMT LLMs fine-tuned in
be conducted to utilize the proposed Simul-LLM wait-3 by up to 2.6 BLEU.
framework to quantify these factors.

7.2 Exploration of SimulMT LLMs with datasets. Third, at least one other work related to
Proposed Prompt Structure NMT LLMs (Chen et al., 2023) has demonstrated
In Table 1, we also provide a breakdown of the that relative positional embeddings can cause issues
performance of our exploration of SimulMT LLMs via attention dilution that ends up being unhelpful,
with our proposed prompt structure in Section 5.2 suggesting that distancing the source context, run-
that carefully manages source context availability ning target hypothesis, and the current translation
for target translation generation. Two models are step hypothesis can be unexpectedly problematic.
employed for this exploration, one fine-tuned for We posit that our proposed Simul-LLM can be
wait-3 inference, and another fine-tuned for wait-7 leveraged to verify the above reasons.
inference that produces noticeable better quality 7.3 Higher Wait-k Generalizability
translations than the first. Comparisons
While SimulMT LLMs are a more promising
approach in achieving higher translation quality It is well documented that in typical SimulMT
than NMT LLMs due to more direct task-specific systems, training or fine-tuning with a slightly
fine-tuning and better context alignment, our ex- higher wait-k than is intended during inference can
periment results suggest that, for the time-being, boost translation quality and generalizability across
the performance of SimulMT LLMs is not advan- slightly lower wait-k (Ma et al., 2019). While this
tageous to NMT LLMs adapted to SimulMT (and likely applies to SimulMT LLMs, no existing work
varies from slightly worse to barely better trans- has validated that this behavior persists. We pro-
lation quality compared with the classical non- vide a brief comparison of two SimulMT LLMs
LLM SimulMT baseline). This is not completely fine-tuned via wait-3 and wait-7 context levels in
unexpected as NMT LLMs have been optimized Table 2. The results in the table demonstrate that,
heavily in recent years whereas the exploration of generally, the expected behavior does hold, with all
SimulMT LLMs has just started. We provide some LLMs fine-tuned in wait-7 outperforming their cor-
analysis below that point out several possible rea- responding wait-3 models for the same inference
sons for this observed performance gap and call for wait-k. We leave validating additional, previously
additional community efforts to investigate further. understood SimulMT principles in SimulMT LLMs
First, due to computational constraints we were to future work.
unable to fine-tune for an entire epoch of the train-
8 Conclusion
ing dataset (only 2M random samples out of 5M),
which represents a major loss of lessons and a lack In this work, we introduce Simul-LLM, the first
of rigorous wait-k curriculum for the fine-tuned open-source framework that enables rapid de-
model. Second, it is possible that the fine-tuning velopment of LLM fine-tuning and evaluation
hyperparameters are ill-suited for this particular pipelines for simultaneous machine translation
prompt. We consider this likely to be the most influ- (SimulMT). Simul-LLM seamlessly integrates with
ential issue on our observed results, given the dras- the fine-tuning and generation tools of the popular
tic differences between the original and expanded transformers library as well as with SimulEval,
the preeminent SimulMT evaluation framework. In of the North American Chapter of the Association
addition to introducing Simul-LLM, we employ for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
this framework to explore adapting existing NMT
pages 4171–4186.
LLMs to SimulMT and propose an expanded fine-
tuning dataset and alternative prompt structure to Alvin Grissom II, He He, Jordan Boyd-Graber, John
those employed in typical NMT fine-tuning that we Morgan, and Hal Daumé III. 2014. Don’t until the
final verb wait: Reinforcement learning for simul-
believe better replicates inference-time behavior. taneous machine translation. In Proceedings of the
Moreover, we validate some classically understood 2014 Conference on Empirical Methods in Natural
SimulMT concepts concerning wait-k scheduling Language Processing (EMNLP), pages 1342–1352,
and examine the behavior of SimulMT LLMs dur- Doha, Qatar. Association for Computational Linguis-
tics.
ing fine-tuning. Our proposed Simul-LLM frame-
work enables multiple lines of future work that can Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor
be carried out to understand and optimize LLMs O. K. Li. 2017. Learning to translate in real-time
for SimulMT, and it will likely be a useful tool for with neural machine translation.
the research community. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
Weizhu Chen. 2022. LoRA: Low-rank adaptation of
References large language models. In International Conference
on Learning Representations.
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al-
shamsi, Alessandro Cappelli, Ruxandra Cojocaru, Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann
Merouane Debbah, Etienne Goffinet, Daniel Hes- Ney. 2019. Language modeling with deep transform-
low, Julien Launay, Quentin Malartic, Badreddine ers. In Interspeech 2019. ISCA.
Noune, Baptiste Pannier, and Guilherme Penedo.
2023. Falcon-40B: an open large language model Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
with state-of-the-art performance. sch, Chris Bamford, Devendra Singh Chaplot, Diego
Naveen Arivazhagan, Colin Cherry, Wolfgang de las Casas, Florian Bressand, Gianna Lengyel, Guil-
Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruom- laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
ing Pang, Wei Li, and Colin Raffel. 2019. Monotonic Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
infinite lookback attention for simultaneous machine Thibaut Lavril, Thomas Wang, Timothée Lacroix,
translation. In Proceedings of the 57th Annual and William El Sayed. 2023. Mistral 7b.
Meeting of the Association for Computational
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
Linguistics, pages 1313–1323, Florence, Italy.
Brown, Benjamin Chess, Rewon Child, Scott Gray,
Association for Computational Linguistics.
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.
Stella Biderman, USVSN Sai Prashanth, Lintang Scaling laws for neural language models.
Sutawika, Hailey Schoelkopf, Quentin Anthony,
Shivanshu Purohit, and Edward Raff. 2023. Emer- James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,
gent and predictable memorization in large language Joel Veness, Guillaume Desjardins, Andrei A. Rusu,
models. Kieran Milan, John Quan, Tiago Ramalho, Ag-
nieszka Grabska-Barwinska, Demis Hassabis, Clau-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie dia Clopath, Dharshan Kumaran, and Raia Hadsell.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind 2017. Overcoming catastrophic forgetting in neural
Neelakantan, Pranav Shyam, Girish Sastry, Amanda networks. Proceedings of the National Academy of
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Sciences, 114(13):3521–3526.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, M. Ma, L. Huang, H. Xiong, R. Zheng, K. Liu, B. Zheng,
Clemens Winter, Christopher Hesse, Mark Chen, Eric C. Zhang, Z. He, H. Liu, X. Li, H. Wu, and H. Wang.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, 2019. Stacl: Simultaneous translation with implicit
Jack Clark, Christopher Berner, Sam McCandlish, anticipation and controllable latency using prefix-to-
Alec Radford, Ilya Sutskever, and Dario Amodei. prefix framework. In Proceedings of the 57th Annual
2020. Language models are few-shot learners. Meeting of the Association for Computational Lin-
guistics, pages 3025–3036, Florence, Italy. Associa-
Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, tion for Computational Linguistics (ACL).
Jinan Xu, and Jie Zhou. 2023. Improving translation
faithfulness of large language models via augmenting Xutai Ma, Mohammad Javad Dousti, Changhan Wang,
instructions. Jiatao Gu, and Juan Pino. 2020. Simuleval: An eval-
uation toolkit for simultaneous translation.
J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019.
Bert: Pre-training of deep bidirectional transformers Sourab Mangrulkar, Sylvain Gugger, Lysandre De-
for language understanding. In 2019 Conference but, Younes Belkada, Sayak Paul, and Benjamin
Bossan. 2022. Peft: State-of-the-art parameter- Renjie Zheng, Mingbo Ma, Baigong Zheng, and Liang
efficient fine-tuning methods. https://github. Huang. 2019. Speculative beam search for simultane-
com/huggingface/peft. ous translation. In Proceedings of the 2019 Confer-
ence on Empirical Methods in Natural Language Pro-
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, cessing and the 9th International Joint Conference
Sam Gross, Nathan Ng, David Grangier, and Michael on Natural Language Processing (EMNLP-IJCNLP),
Auli. 2019. fairseq: A fast, extensible toolkit for pages 1395–1402, Hong Kong, China. Association
sequence modeling. for Computational Linguistics.

Sara Papi, Marco Gaido, Matteo Negri, and Marco


Turchi. 2022. Over-generation cannot be rewarded:
Length-adaptive average lagging for simultaneous
speech translation. In Proceedings of the Third Work-
shop on Automatic Simultaneous Translation. Asso-
ciation for Computational Linguistics.

Matt Post. 2018. A call for clarity in reporting BLEU


scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers, pages 186–
191, Belgium, Brussels. Association for Computa-
tional Linguistics.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier


Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open
and efficient foundation language models.

Ahmet Üstün and Asa Cooper Stickland. 2022. When


does parameter-efficient transfer learning work for
machine translation? In Proceedings of the 2022
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 7919–7933, Abu Dhabi,
United Arab Emirates. Association for Computa-
tional Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need.

David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo,


Viresh Ratnakar, and George Foster. 2023. Prompt-
ing PaLM for translation: Assessing strategies and
performance. In Proceedings of the 61st Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 15406–
15427, Toronto, Canada. Association for Computa-
tional Linguistics.

Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Has-
san Awadalla. 2023. A paradigm shift in machine
translation: Boosting translation performance of
large language models.

Biao Zhang, Barry Haddow, and Alexandra Birch. 2023.


Prompting large language model for machine transla-
tion: A case study.

Baigong Zheng, Kaibo Liu, Renjie Zheng, Mingbo Ma,


Hairong Liu, and Liang Huang. 2020. Simultaneous
translation policies: From fixed to adaptive.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy