0% found this document useful (0 votes)
10 views

CLLMS: Consistency Large Language Models

Uploaded by

scribe1hdf3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

CLLMS: Consistency Large Language Models

Uploaded by

scribe1hdf3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CLLMs: Consistency Large Language Models

Siqi Kou * 1 Lanxiang Hu * 2 Zhezhi He 1 Zhijie Deng 1 Hao Zhang 2

Abstract 2023; Chen et al., 2023) introduces a small draft LLM to


Parallel decoding methods such as Jacobi decod- guess tokens and let the target LLM verify them in paral-
ing show promise for more efficient LLM infer- lel. Although they can opportunistically generate multiple
tokens in a single evaluation of the target LLM, obtaining
arXiv:2403.00835v3 [cs.CL] 8 Mar 2024

ence as it breaks the sequential nature of the LLM


decoding process and transforms it into paral- a small yet effective draft model is non-trivial; managing
lelizable computation. However, in practice, it multiple models within a single system remains a challeng-
achieves little speedup compared to traditional ing engineering task. Medusa (Cai et al., 2024) alternatively
autoregressive (AR) decoding, primarily because augments the target LLM with extra guess heads to enable
Jacobi decoding seldom accurately predicts more self-speculation with as much as 3× speedup on various
than one token in a single fixed-point iteration tasks. Yet, the number of added parameters can be signifi-
step. To address this, we develop a new ap- cant (e.g., Medusa2 with 5 extra heads adds 1.6B parameters
proach aimed at realizing fast convergence from for a 6.7B target LLM). Increased memory consumption
any state to the fixed point on a Jacobi trajec- could limit generation length and negatively affect infer-
tory. This is accomplished by refining the target ence latency due to the reduction in available memory for
LLM to consistently predict the fixed point given key-value (KV) cache (Pope et al., 2023).
any state as input. Extensive experiments demon- On the other hand, originating from the Jacobi and Gauss-
strate the effectiveness of our method, showing Seidel fixed-point iteration for solving nonlinear equa-
2.4× to 3.4× improvements in generation speed tions (Ortega & Rheinboldt, 2000; Song et al., 2021a), the
while preserving generation quality across both Jacobi decoding method (Santilli et al., 2023) first ran-
domain-specific and open-domain benchmarks. domly guesses the next n tokens in a sequence (referred to
Our code is available at https://github.com/hao-ai- as n-token sequence hereinafter) from an input prompt. The
lab/Consistency LLM. n-token sequence, along with the prompt, is then fed to the
LLM to iteratively update itself. Eventually, the n-token
sequence converges to the same output generated by AR de-
1. Introduction coding under a greedy strategy (see Figure 1). The evolution
Large language models (LLMs), including GPT-4 (Achiam of the n-token sequence forms a Jacobi trajectory between
et al., 2023), LLaMA (Touvron et al., 2023a;b), PaLM (Anil a randomly initialized sequence to the n-token sequence
et al., 2023), are pushing the limit of artificial intelligence. generated by AR decoding (i.e., the fixed point).
As LLMs are integrated into more applications (Zheng et al., However, vanilla Jacobi decoding for LLMs shows only
2023; Wu et al., 2023), the inference latency of LLMs plays marginal speedup over AR decoding in practice, e.g., an
a crucial role in ensuring a positive user experience and high average of 1.05× speedup in Santilli et al. (2023). This is
service quality. However, LLM serving operates in an AR because a LLM can rarely yield a correct token when there
paradigm, generating one token at a time due to the attention are incorrection1 in its preceding tokens due to the attention
mechanism’s need for previous token states to generate the mechanism, resulting in a long trajectory as illustrated on the
next one. To produce a lengthy response, one must execute left side of Figure 2. Lookahead decoding (Fu et al., 2024)
forward passes through the LLMs as many times as the improves the efficiency by leveraging n-grams generated
number of tokens generated, resulting in high latency. from previous Jacobi iterations and verify them in parallel
Existing methods address this issue from various perspec- during the decoding process. However, both work are unable
tives. For example, speculative decoding (Leviathan et al., to achieve the same level of speedup as Meudsa.
* This work aims to achieve all three goals by refining the
Equal contribution 1 Shanghai Jiao Tong University 2 University
of California, San Diego. Correspondence to: Zhijie Deng <zhi- target LLM. Specifically, we propose to fine-tune the LLM
jied@sjtu.edu.cn>. 1
By correctness, we mean alignment with the AR decoding
result under a greedy sampling strategy.

1
CLLMs: Consistency Large Language Models

<BOS> The prompt … Answer: This is correct ! xed point Consistency Large Language Models (CLLMs). In com-
✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
parison with previous methods like speculative decoding
❄ pre x converged n-token seq
and Medusa, CLLM doesn’t introduce extra memory cost to
accommodate auxiliary model components while delivering
<BOS> The prompt … Answer: This is one try
✔ ✔ ✔ ✔ ✔ ✔ ✔ ❌ ❌
significant speedup with minimal performance degradation.
❄ pre x n-token seq 🔥 To implement this learning strategy, it only requires model
training with two loss terms. Following CMs, we can con-
k iterations
vert the aforementioned learning objective into a consistency
loss where the model is demended to map arbitrary point on
Autoregressive LM the Jacobi trajectory to the fixed point. CLLMs also include
an AR loss to avoid deviating from the distribution of the
target LLM and hence ensure the generation quality.
Jacobi The fine-tuning cost of CLLMs is moderate, e.g., training
❄ pre x n-token seq 🔥 trajectory on only ∼ 1M tokens for LLaMA-7B to achieve a 3.4×
speedup on the Spider dataset. We further empirically iden-
<BOS> The prompt … The … The prompt … tify that such acceleration is likely to stem from the existence
✔ ✔ ✔ ✔ ❌ ❌ ❌ ❌ ❌
of 1) fast forwarding, where multiple consecutive tokens
❄ pre x random n-token seq
are correctly predicted in a single forward pass, and 2) sta-
randomly initialized point tionary tokens, which are correctly predicted and remain
unaltered through subsequent iterations, despite being pre-
Figure 1. An instance of Jacobi trajectory. “n-token seq” refers to
the n-token sequence that is iteratively updated in Jacobi iterations. ceded by inaccurate tokens. An illustration of the examples
is shown in Figure 2.
so that it can yield multiple, instead of one, subsequent
tokens of a prefix at once. In the ideal case, with the prompt To summarize, our key contributions are as follows:
and a randomly initialized n-token sequence as input, our
goal is to train a LLM that can generate the same n-token • We propose Consistency Large Language Models
sequence as AR decoding (the fixed point) using only one (CLLMs), a new family of LLMs specialized for the
step. Our preliminary experiments show the single-step Jacobi decoding method for latency reduction.
learning task is difficult when n is large, and leads to slow • We empirically observe the existence of fast forwarding
model convergence. We therefore ease the learning process and stationary tokens phenomena in Jacobi decoding of
by also taking intermediate points on the Jacobi trajectory CLLMs. Empirically, CLLMs can lead to a 2.0× to 6.8×
with more correct tokens1 into account. In particular, for improvement in the count of fast-forwarded tokens and
fi fi
the second to last point on the trajectory, the learning is stationary tokens compared to the original LLM.
identical to AR modeling, at which the target LLM without
adaptation has already excelled. • We demonstrate the efficacy of CLLMs on a variety of
benchmarks. On domain-specific benchmarks including
We argue such a learning strategy that a single model is
GSM8K, CodeSearchNet Python, and Spider, CLLMs
tuned to solve a series of learning problems of mapping
can achieve 2.4× to 3.4× speedup using Jacobi decoding
any arbitrary point on the trajectory to the fixed-point is
with nearly no loss in accuracy. On open-domain bench-
beneficial to model convergence (see Figure 4 and Fig-
mark MT-bench, CLLMs can achieve 2.4× speedup on
ure 5). Imagining the evolution of the n-token sequence as
ShareGPT with state-of-the-art performance, scoring 6.4.
the denoising process of a natural image (Ho et al., 2020;
Song et al., 2021b), we surprisingly find that the above
learning procedure draws a sharp analogy to the acceler- 2. Related Work
ation technique for diffusion models named consistency
Efficient LLM Inference. This body of work can be
models (CMs) (Song et al., 2023; Song & Dhariwal, 2023).
broadly categorized into two streams: methods that neces-
CMs aim to achieve single-step image generation using
sitate additional training and those that do not. The high
the denoising objective by minimizing distances between
AR inference cost in LLMs has sparked a surge in research
consecutive denoising steps along the probability flow ordi-
aimed at efficient LLM inference, primarily focused on ac-
nary differential equation (ODE) trajectory during training.
celerating the AR decoding process.
Our method and CMs share the notion of directly map-
ping intermediate states of a solving process (of non-linear The methods that do not require additional training include
systems or ODEs) to its final solution for inference accel- speculative decoding, as introduced in studies by Leviathan
eration. Based on these, we refer to our trained models as et al. (2023) and Chen et al. (2023). These techniques en-

2
CLLMs: Consistency Large Language Models

hance LLM decoding speed by leveraging a smaller draft cess. Consistency models overcome this limitation by map-
model to predict the outputs of a larger target model which ping any point along the probability flow ODE of the dif-
subsequently verifies these predictions. Another category fusion process back to the original point, corresponding to
of training-free approaches involves system- or hardware- the initial image, in a single step (Song et al., 2023). In this
oriented optimizations. Notable examples include PagedAt- work, we highlight that a parallelism can be drawn between
tention (Kwon et al., 2023), which optimizes KV cache the few-step generation capability of CLLMs and that of the
management for throughput using memory paging, and consistency models.
FlashAttention (Dao et al., 2022; Dao, 2023), which ac-
celerates attention module computations by reducing HBM 3. Methodology
access via softmax tiling. Other strategies enhance LLM
inference speed by optimizing model designs, reducing This section begins with a review of the Jacobi decoding
weight/activation precision, and utilizing sparsity, includ- method (Santilli et al., 2023) for accelerating LLM infer-
ing multi-query and grouped-query attention mechanisms ence, then elaborates on CLLMs, a refinement of pre-trained
with fused heads (Shazeer, 2019; Ainslie et al., 2023), post- LLMs to enjoy higher speedup from Jacobi decoding. In this
training quantization (Dettmers et al., 2022; Xiao et al., paper, we only consider greedy sampling and leave other
2023; Frantar et al., 2022; Lin et al., 2023), and various sampling strategies to future work. We also empirically
pruning techniques (Sun et al., 2023; Frantar & Alistarh, identify the fast-forwarding phenomenon and the emera-
2023; Ashkboos et al., 2024). gence of stationary tokens from CLLMs, which serve as the
source of such acceleration.
For methods that necessitate training, they often require in-
tegration of auxiliary components, such as additional LM or
AR heads, to facilitate faster AR generation (Cai et al., 2024; 3.1. Preliminary: Jacobi Decoding
Li et al., 2024). It may also involve significant modifica- Given a prompt x and a pre-trained LLM p(·|x), we obtain
tions to the model weights or architecture, as seen in various the model response typically with the standard AR decoding
pruning approaches (Ma et al., 2023; Xia et al., 2022; 2023). method under the greedy strategy, i.e.,
Moreover, training can enhance certain training-free tech-
niques, like speculative decoding, by capturing the behavior
yi = arg max p(y|y<i , x) for i = 1, . . . , n (1)
of the original, larger model in a smaller student model y
through distillation, thereby retaining performance with re-
duced size (Zhou et al., 2023b; Liu et al., 2023). An detailed
where y<i denotes {y1 , . . . , yi−1 }. As shown, n forward
analysis that compare CLLMs with different SOTA baseline
passes of the LLM are required to obtain n tokens y≤n . The
methods are further discussed and compared in Section B
sequential nature of AR decoding hinders the fast genera-
and Table 7. It’s worthy noticing that CLLMs requires nei-
tion of a lengthy response in practice. Speculative decod-
ther modification to pre-trained models nor any auxiliary
ing (Leviathan et al., 2023; Zhou et al., 2023b; Liu et al.,
components. This brings higher memory efficiency and
2023) and Medusa (Cai et al., 2024) are existing remedia-
adaptability to users at inference time.
tions to such an issue, but the former suffers from the diffi-
LLM Distillation. Knowledge distillation (KD) serves as a culties in finding a suitable draft model and managing both
technique for creating smaller models that replicate the func- models in a single system, and the latter causes significant
tionality of larger ones. While traditional KD approaches increases in model size and architecture.
often fall short for LLMs, (Gu et al., 2023) has adapted
In comparison, Jacobi decoding has shown the capacity to
KD for autoregressive LLMs, focusing on minimizing the
reduce the inference cost of LLMs without extra model
reverse KL divergence between student and teacher models
components (Santilli et al., 2023) and is therefore more
through student-driven decoding. In another advancement,
applicable. Concretely, supposing f (yi , y<i , x) := yi −
Agarwal et al. (2023) introduces generalized knowledge
arg maxy p(y|y<i , x), Jacobi decoding re-frames the LLM
distillation (GKD), which balances forward and reverse KL
inference process in Equation (1) as solving a system of
divergences by employing a mix of data sampled from both
nonlinear equations w.r.t. yi :
teacher and student models.
CLLMs are distinct from these works as our proposed f (yi , y<i , x) = 0 for i = 1, . . . , n. (2)
method can be regarded as a self-distillation approach with
a Jacobi trajectory training dataset that matches the target
LLM’s output distribution. It can be solved in parallel using the Jacobi fix-point
iteration method (Ortega & Rheinboldt, 2000), starting
Consistency Models. Diffusion models (Ho et al., 2020; from a randomly initialized n-token sequence y (0) =
Song et al., 2021b) suffer from slow iterative sampling pro- (0) (0)
{y1 , . . . , yn } and iteratively updating it by the follow-

3
CLLMs: Consistency Large Language Models

GROUP BY
H AV ING
count ( *)

Target LLM a lot of collocations CLLM

Figure 2. Comparison of Jacobi trajectory between a target LLM and CLLMs on Spider. Each point along the Jacobi trajectory is a
color-coded sequence: blue for correct tokens matching with AR results, and red for inaccurate ones. CLLM demonstrates enhanced
efficiency, converging to the fixed point 2× faster than the target LLM. This increased efficiency in the CLLM can be attributed to the
consistency loss which facilitates the learning of the structure of each n-token sequence given a prefix.

ing rule: tokens along with the decoding procedure. We elaborate on


 (j+1) this in Algorithm 3.

y1 = arg max p(y|x)
 y
3.2. Consistency Large Language Models (CLLMs)

 (j+1) (j)
y2 = arg max p(y|y1 , x)


y
(3) Despite the promise, the speedup effect of Jacobi decod-
 ..


 . ing for vanilla LLMs is minimal in practice (Santilli et al.,
yn(j+1) (j)


 = arg max p(y|y<n , x). 2023; Fu et al., 2024). The reason is that AR-trained LLMs
y can usually generate only one correct token in each Jacobi
Notably, for LLM, the above n maximization problems iteration as such models can rarely yield a correct token
can be solved in parallel by using a causal attention mask, when there are incorrect preceding tokens. To address this,
i.e., only one forward pass of the LLM is required to obtain we propose to adapt pre-trained LLMs to consistently map
y (j+1) based on y (j) . The iteration exits at some k such that any point y on the Jacobi trajectory J to the fixed point y ∗ .
y (k) = y (k−1) and we define y ∗ := y (k) as the fixed point. Surprisingly, such an objective is analogous to that of con-
Let J := {y (1) , . . . , y (k) } denote the Jacobi trajectory. It sistency models (Song et al., 2023; Song & Dhariwal, 2023),
can be proven that y ∗ is identical to AR decoding under a leading acceleration approach for diffusion models (Ho
greedy strategy (Song et al., 2021a). The acceleration effect et al., 2020; Song et al., 2021b).
of Jacobi decoding primarily stems from the fact that each This section first delineates our data preparation procedure
forward pass of the LLM could potentially generate more for tuning CLLM and then elaborates on the training proce-
than one fixed token within the n-token sequence, so the dure of CLLM. Lastly, we discuss some possible sources of
number of queries to the LLM could be smaller than that of the reason for CLLMs’ acceleration.
AR decoding, i.e., k ≤ n.
Generally, for a prefix x of length nx , each forward pass 3.2.1. JACOBI T RAJECTORY C OLLECTION
in Jacobi decoding deals with a longer sequence of length Let p denote the target LLM we aim to adapt. Let qθ (·|x)
nx + n, demanding more FLOPs than AR decoding that denote the CLLM with parameters θ initialized with those
deals with a shorter sequence length at nx + i, 1 ≤ i ≤ n. of p. To realize the aforementioned adaptation, we collect
Yet, the added overhead can be minimal when nx is large or a set of Jacobi trajectories by running the Jacobi decoding
n is small. Besides, we can integrate the KV cache mech- algorithm with the target LLM p on prompts from a certain
anism (Pope et al., 2023) into Jacobi decoding to further domain of interest, forming an original training set D. We
reduce the additional overhead, as detailed below. summarize the algorithm for dataset generation in Algo-
Jacobi Decoding with KV Cache. The sequential nature rithm 1. Note that to generate a lengthy response l of N
of LLMs ensures that each token generation is dependent (N ≫ n) tokens, we can sequentially perform Jacobi de-
only on preceding tokens. Namely, we have an increasing coding for every truncation of n tokens to avoid slow model
number of fixed tokens, which are correctly aligned with the evaluation on lengthy input. Consequently, l amounts to the
AR generations. We don’t need to iteratively update them concatenation of a set of consecutive fixed points.
and recompute their keys and values for computing attention Data augmentation. In a typical Jacobi iteration process,
in subsequent iterations thanks to the KV cache technique. the correct tokens often appear one after another, and n-
So, we 1) progressively reduce the length of the iteration token sequences usually exhibit a “correct, correct, wrong,
state by at least one token and 2) save the KV cache of fixed

4
CLLMs: Consistency Large Language Models

Algorithm 1 Generate dataset to train a CLLM Algorithm 2 Training algorithm for a CLLM
Input: prompt set O, n-gram size n, max new tokens N , target Input: Jacobi trajectory dataset D, n-gram size n, the weight
LLM p factor ω, CLLM qθ (·|x)
repeat repeat
Sample prompt x from origin dataset O. Sample prompt x, Jacobi trajectory J , and full response l
while <EOS> is not generated and length generated < N from D
do Calculate LAR using Equation (6)
J = {y (0) , . . . , y ∗ } ← Jacobi Decoding(p, x) Sample y from J
x ← cat(x, y ∗ ) Calculate Lconsistency using Equation (4) or Equation (5)
if use data augmentation then Calculate L(θ) and update the parameters θ
for all y ∈ J do until convergence
Augment y with false tokens corrected randomly
end for
end if vergence) as popular examples (Agarwal et al., 2023). We
Append x and J to Training Dataset D
end while primarily experiment with the forward KL.
until all prompts in origin dataset O are used Alternatively, we can also achieve the goal that CLLM con-
sistently maps all intermediate states to the fixed point with
a local consistency (LC) loss following CMs (Song et al.,
wrong, wrong” pattern. In comparison, patterns like “cor-
2023), where the adjacent states (y (j) , y (j+1) in the Jacobi
rect, correct, wrong, correct, wrong” can be rare. To enhance
trajectory J are demanded to yield the same outputs:
the learning and generalization capabilities of CLLMs, we h
augment the dataset D by randomly correcting erroneously LLC = E(x,J )∼D,(y(j) ,y(j+1) )∼J
predicted tokens within the samples.
n   i (5)
(j+1) (j)
X
Data post-processing. Since the target LLM itself can D qθ− (·|y<i , x)||qθ (·|y<i , x) .
make errors for some prompts, it often leads to low-quality i=1
generations in the Jacobi trajectories. We find training a We compare LGC and LLC empirically in Table 6, where
CLLM with n-token sequences with token-level (Holtzman the results show that the global consistency loss is more
et al., 2019) or sentence-level repetitions (Polišenská et al., efficacious to train CLLMs. This is probably attributed to
2015) often results in to repetitive content generation and that LLC only implicitly aims at mapping from any point
noticeably degrades performance. Recognizing the signif- consistently to the fixed point by minimizing the distance
icance of high-quality datasets for training LLMs (Zhou between consecutive points. However, there is still a gap
et al., 2023a), we perform post-processing to eliminate the between LLC and the goal of predicting multiple tokens at
low-quality samples from our training dataset D based on a once, because there is typically only one more correct token
rule-based detector. in y (j+1) than y (j) in the collected Jacobi trajectory.
3.2.2. T RAINING AR Loss. To avoid deviating from the distribution of the
target LLM, we incorporate the traditional AR loss based
We jointly optimize two losses for tuning CLLMs, one guar- on the generation l of the target LLM p:
anteeing the prediction of multiple tokens at once and the
N
other avoiding the CLLM from deviating from the target h X i
LLM so as to maintain generation quality. LAR = E(x,l)∼D − log qθ (li |l<i , x) . (6)
i=1
Consistency Loss. For a prompt x with the Jacobi trajectory
This term contributes to maintaining generation quality sub-
J , let y and y ∗ denote a random state on the trajectory and
stantially (see Table 6).
the fixed point respectively. We can directly push CLLM to
output y ∗ with y as the input by minimizing the following Consequently, the total loss for training a CLLM is:
loss:
h L(θ) = Lconsistency + wLAR (7)
LGC = E(x,J )∼D,y∼J
where ω represents a weighting coefficient, Lconsistency can
n i (4)
X

be either LGC or LLC and we adopt LGC in our experiments.
D (qθ− (·|y<i , x)||qθ (·|y<i , x))
i=1 The training procedure is detailed in Algorithm 2.
where θ− = stopgrad(θ) and we abuse notations to repre-
3.3. Acceleration Mechanisms in CLLMs
sent uniform sampling from the dataset. D(·||·) denotes
the distance between two distributions, with forward KL, Next, we compare the Jacobi trajectory of the target LLM
reverse KL, and their mixture (i.e., the Jensen-Shannon di- and CLLM in Figure 2 to chase an in-depth understanding

5
CLLMs: Consistency Large Language Models

of acceleration mechanisms in CLLMs.


Table 1. Comparison of CLLMs with other baselines including
As shown in the left side of Figure 2, target LLMs typically speculative decoding using distilled draft model, Medusa, and
generate only one correct token in one iteration. In contrast, fine-tuned model using LLaMA2-7B as the backbone model. Per-
we identify fast forwarding phenomenon where multiple formance and inference speed are evaluated with applicable gener-
consecutive tokens are correctly predicted in a single for- ation techniques. To quantify speed improvements, we measure
speedup as the ratio of the wall-clock speed to the baseline AR
ward pass in CLLMs. The average fast forward count per
decoding speed for each model. Results are measured with a batch
forward pass in CLLMs ranges from 2 to 6 tokens as eval-
size of 1.
uated in Table 3. Moreover, tokens correctly generated in
advance (e.g. “country” and “H” in point 5 and 6 in the Methods Speed (tokens/s) Speedup Metric Size
left side of Figure 2), are often replaced inaccurately in sub- GSM8K
sequent iterations in target LLMs. Unlike the pre-trained
Fine-tuned LLaMA2-7B (Chern et al.)
models, CLLMs exhibit the capability of predicting correct + AR 43.5 1.0× 59.1
tokens preemptively, even with preceding incorrect tokens, + Jacobi 45.7 1.1× 59.1 6.7B
while ensuring the tokens remain unchanged. We term such + lookahead 74.8 1.7× 59.1
tokens as stationary tokens, whose existance allow simulta- CLLM-LLaMA2-7B
neous extension of discontinuous correct tokens within the + AR 43.5 1.0× 56.4
n-token sequence. Both phenomena contribute to the fast + Jacobi 132.4 3.0× 56.4 6.7B
convergence in Jacobi decoding of CLLMs, thereby leading + lookahead 125.2 2.9× 56.4
to a considerable generation speedup. Medusa-2 + LLaMA2-7B
+ typical 70.2 1.6× 51.3 8.3B
We observe that CLLMs acquire a crucial linguistic con-
Fine-tuned LLaMA2-7B + distilled LLaMA-160m
cept through training – collocations: a series of words or
+ speculative 73.8 1.7× 59.1 6.8B
terms that co-occur more frequently than one would ex-
pect by random chance (Smadja, 1991). Language is not ShareGPT (MT-Bench)
solely composed of isolated words but also relies heavily on Fine-tuned LLaMA2-7B
specific word pairings. Examples of collocations are abun- + AR 37.6 1.0× 6.5
dant in both natural and coding languages. They include + Jacobi 39.9 1.1× 6.5 6.7B
verb + preposition combinations (e.g., “talk to”, “remind + lookahead 60.8 1.6× 6.5
... of ...”), verb + noun structures (e.g., “make a decision”, CLLM-LLaMA2-7B
“catch a cold”), and many more domain-specific syntactical + AR 36.7 1.0× 6.4
+ Jacobi 88.4 2.4× 6.4 6.7B
structures (e.g., “SELECT ... FROM ...”, “if ... else” for + lookahead 95.0 2.5× 6.4
programming). The consistency generation objective allows
CLLMs to infer such structures from any point in the Jacobi Medusa-2 + LLaMA2-7B
+ typical 102.5 2.7× 6.4 8.3B
trajectory, encouraging CLLMs to acquire proficiency in
numerous collocations and thereby predict multiple words Fine-tuned LLaMA2-7B + distilled LLaMA-160m
+ speculative 51.3 1.4× 6.5 6.8B
simultaneously to minimize iteration steps.
Notably, lookahead decoding (Fu et al., 2024) collects n-
grams generated from previous Jacobi iterations as candi- and instruction-following scenarios, we also train CLLMs
date tokens and verifies them in the next iteration to acceler- on ShareGPT2 data and perform evaluation on the MT-
ate decoding. CLLMs can also be combined with lookahead bench (Zheng et al., 2023). The performance metrics are the
decoding and achieve extra speedup (see Table 1 and Ta- greedy answers’ problem solve rate (test@1) on GSM8K,
ble 2) because collocations learned in CLLMs improve the MT-bench score, execution accuracy on Spider, as well as
quality of n-grams and thus increase the acceptance rate. and strict accuracy (pass@1) on Human-Eval. Additionally,
we also run evaluations of CLLMs’ language modeling capa-
bility on raw-WikiText2 (Merity et al., 2016) and PTB (Pan
4. Experiments et al., 2020).
4.1. Evaluations Reported experiments were conducted using either pre-
Benchmarks and Setup. We evaluate performance across trained coder LLM, Deepseek-coder-7B-instruct (Bi et al.,
three domain-specific tasks, including text-to-SQL (Spi- 2024) or LLaMA-2-7B (Touvron et al., 2023a;b) depending
der) (Yu et al., 2018), Python code generation (Code- on the task. Both training and evaluation are carried out on
search-Python) (Husain et al., 2019) and graduate school servers equipped with 8 NVIDIA A100 40GB GPUs and
math (GSM8k) (Cobbe et al., 2021). To test CLLMs gen- 128 AMD EPYC 7742 64-core processors.
eralizability on open-domain conversational interactions 2
http://www.sharegpt.com.

6
CLLMs: Consistency Large Language Models

stronger speculative decoding baseline using a distilled draft


Table 2. Comparison of CLLMs with other baselines using
model. In both Jacobi and lookahead decoding, CLLMs
Deepseek-Coder-7B-Instruct as the backbone model.
consistently surpass the baselines. Notably, on the Spider
Methods Speed (tokens/s) Speedup Metric Size dataset, CLLMs achieve a 3.4× speedup with negligible per-
Spider
formance loss using Jacobi decoding. When benchmarked
against other SOTA methods for efficient LLM inference,
Fine-tuned Deepseek-7B particularly those necessitating training, CLLMs exhibit
+ AR 38.0 1.0× 70.0
+ Jacobi 39.5 1.0× 70.0 6.7B the ability of fast consistency generation while maintain-
+ lookahead 55.3 1.5× 70.0 ing lower memory and computational demands with lowest
CLLM-Deepseek-7B
memory consumption in comparison with Medusa and spec-
+ AR 38.0 1.0× 69.3 ulative decoding. In these cases, we can still see CLLMs
+ Jacobi 127.4 3.4× 69.3 6.7B consistently outperform speculative decoding with distilled
+ lookahead 135.2 3.6× 69.3 draft model and achieve better accuracy with comparable
Medusa-2 + Deepseek-7B and even better inference speedup on datasets like Spi-
+ typical 104.2 2.7× 66.4 8.3B der and GSM8K, where collocations are more common.
Fine-tuned Deepseek-7B + distilled LLaMA-160m CLLMs can also seamlessly integrate with lookahead de-
+ speculative 66.8 1.8× 70.0 6.8B coding and more speedup is gained compared to lookahead
decoding applied in fine-tuned LLMs.
Code-Search-Net Python
Fine-tuned Deepseek-7B We highlight CLLMs’ advantage over speculative decoding
+ AR 40.1 1.0× 60.4 with distilled draft models and Medusa is its high adaptabil-
+ Jacobi 43.2 1.1× 60.4 6.7B ity. This is because CLLMs’ are models tailored for Jacobi
+ lookahead 68.0 1.7× 60.0 decoding. Jacobi decoding requires no modification to the
CLLM-Deepseek-7B original models. In the contrary, both speculative decod-
+ AR 38.5 1.0× 59.2 ing and Meudsa require either auxiliary components like
+ Jacobi 102.1 2.5× 59.2 6.7B LM head, tree-based attention mask, or draft model, which
+ lookahead 115.7 2.9× 59.2
usually come with the cost of searching for the optimal
Medusa-2 + Deepseek-7B configuration. This is further summarized in Table 7.
+ typical 128.0 3.2× 48.3 8.3B
Fine-tuned Deepseek-7B + distilled LLaMA-160m
Moreover, the language modeling results in Table 5 show
+ speculative 59.3 1.5× 60.4 6.8B CLLMs are able to maintain a low perplexity while render-
ing at least 2× speedup, suggesting CLLMs’ potential to be
trained as pre-trained LLM with higher inference efficiency.
Baselines. In this section, we compare CLLMs with a
range of alternative models that employ various strategies to 4.2. Acceleration Mechanisms in CLLMs
speed up the inference process. This includes Medusa (Cai With insights provided in Section 3.3, we investigate the
et al., 2024), which modifies the underlying architecture, fast-forwarding phenomenon and the emergence of station-
and approaches utilizing distilled draft models for specula- ary tokens in Jacobi decoding to provide further empirical
tive decoding (Zhou et al., 2023b; Liu et al., 2023). Along- evidences for our hypothesis. We compare fast-forwarded
side these, we also consider fine-tuned baseline models for and stationary token counts in target LLMs and CLLMs
a comprehensive comparison. Our evaluation tests each across the four datasets in Table 3.
model under different decoding paradigms the model is
compatible with to thoroughly assess their inference quality From the table, there is a consistent 2.0x to 6.8x improve-
and speed. The decoding algorithms include vanilla AR ment in both fast-forwarded token and stationary token
decoding, Jacobi decoding (Song et al., 2021a), speculative counts across all four datasets. In particular, for domain-
decoding (Leviathan et al., 2023), and lookahead decod- specific datasets, such improvement is much more signifi-
ing (Fu et al., 2024). cant than open-domain dataset profiled on MT-bench. The
results align with the observations from Section 3.3, where
Results. To evaluate the performance and inference speedup we see more distinctive collocations and easy syntactical
of CLLMs across various tasks, we conduct an extensive structures like blank space, newline tokens, and repetitive
comparison with the SOTA baselines on the three domain- special characters in specialized domains like coding as
specific tasks and the open-domain MT-bench. demonstrated in Figure 2, versus open-domain conversa-
Table 1 and Table 2 compare CLLMs against fine-tuned tions in ShareGPT and MT-bench with a significantly more
baseline models across three different generation modes: diverse set of collocations.
AR decoding, Jacobi decoding, lookahead decoding, and the

7
CLLMs: Consistency Large Language Models

Table 3. Profiling results for fast-forwarded and stationary token counts in fine-tuned models and CLLMs. The numbers are
reported for each n-token sequence, with the best-performing model and an accompanying n-gram size. Fast-forwarded token count
reported in the table includes the one token that will be predicted right even without fast-forwarding.

Models n-token sequence length Fast-forward token count Stationary token count
Spider
Fine-tuned Deepseek-coder-7B-instruct 16 1.1 0.4
CLLM-Deepseek-coder-7B-instruct (size 16) 16 5.7 1.6
Code-Search-Net Python
Fine-tuned Deepseek-coder-7B-instruct 32 1.1 0.4
CLLM-Deepseek-coder-7B-instruct (size 32) 32 4.0 6.8
GSM8K
Fine-tuned LLaMA-2-7B 16 1.1 0.1
CLLM-LLaMA-2-7B (size 16) 16 2.8 2.0
ShareGPT
Fine-tuned LLaMA-2-7B 32 1.1 0.3
CLLM-LLaMA-2-7B (size 32) 32 2.2 4.8

1.0
4.3. Ablation Studies 3.0 Accuracy
Dataset sizes and generalizability. In Section 3.2.1, Ja- 2.5 0.8
cobi trajectory datasets are collected to conduct training for
efficient Jacobi decoding. Table 4 demonstrates larger Ja- 2.0 0.6

Accuracy
Speedup

cobi trajectory datasets bring more significant speedup, and 1.5


the speedup gradually saturates as the dataset size scales. 0.4
Moreover, CLLMs trained with more data can perform well 1.0
even at the n-token sequence lengths it’s not trained on and 0.2
0.5
introduce more deployment-time robustness.
0.0 16 32 64 128 256 0.0
Different lengths of n-token sequence. We investigate how
Length of n-token sequence
different n-token sequence lengths in the Jacobi trajectory
dataset affect CLLMs’ performance on GSM8K. We employ Figure 3. Accuracy and speedup of models trained with different
varying lengths to generate the Jacobi dataset and train the n-token sequences lengths on GSM8K dataset. The sequence
CLLMs accordingly. Figure 3 illustrates that CLLMs con- length for generation matches the training settings. Speedup is
sistently maintain generation quality while the models are measured as the ratio of the wall-clock generation throughput when
trained with different lengths. In practice, longer sequence employing Jacobi decoding, and that of the baseline AR decoding.
lengths come at cost of increased computational overhead
during inference. In Figure 3, significant degradation infer-
ence speed can thus be observed when the when the n-token trajectory dataset. Therefore, data cleaning is crucial, as
sequence length exceeds 64. discussed in Section 3.2.1. Dataset size also plays a role as
described in Section 4.3 and shown in Table 4, although to
Loss design. We adjust the ratio of consistency loss to a lesser extent. For instance, Jacobi trajectories generated
autoregressive loss described in Section 3.2.2 and evaluate with only 10% of the Code-Search-Net Python dataset is
different loss ratios’ performance on GSM8K. As illustrated able to yield a 2.9× speedup as demonstrated in Table 2.
in Table 6, increasing the emphasis on autoregressive loss However, for open-domain datasets like ShareGPT, more
does indeed enhance accuracy, though it slightly compro- data is necessary for improved efficiency.
mises the speedup gains. Additionally, we compare the
efficacy of CLLMs using both consistency global loss and In our proposed method and experiments, we primarily use
consistency local loss. Table 6 demonstrates that the global output sequences from the teacher (Kim & Rush, 2016)
loss is more efficacious in the training of CLLMs. to collect Jacobi trajectories and train a CLLM. This intro-
duces some additional overhead in comparison with conven-
4.4. Limitations and Discussion tional model training. On-policy GKD proposed in Agarwal
et al. (2023) suggests LLM distillation using a mixture of
In our experiments, we observe that achieving significant teacher and student samples or even student samples by
speedup while maintaining good generation quality with themselves can yield high-performance models. One miti-
a CLLM relies strongly on having a high-quality Jacobi gation is therefore to use n-token sequences generated by

8
CLLMs: Consistency Large Language Models

Table 4. Comparison the performance of CLLMs trained with different sizes of Jacobi trajectory datasets on ShareGPT.

I NFERENCE SPEEDUP ( VARYING LENGTHS )


T RAJECTORY C OUNT MT- BENCH
16 32 64 128 256
20 K 6.1 1.7× 1.8× 1.4× 1.2× 1.1×
100 K 6.4 2.5× 2.4× 2.1× 2.0× 1.5×
500 K 6.4 2.7× 2.7× 2.2× 2.1× 1.8×

Table 6. Comparison the performance of CLLMs trained with dif-


Table 5. CLLMs’ performance versus the fine-tuned baseline on ferent loss design. All models are trained on GSM8K.
language modeling tasks. L OSS S PEEDUP ACCURACY
Methods Speed (tokens/s) Speedup PPL (↓) LCTG + LAR 3.2× 51.3
LCTG + 10 · LAR 3.0× 56.4
raw-WikTtext2 LCTL + LAR 2.8× 55.2
fine-tuned LLaMA2-7B LCTL + 10 · LAR 2.4× 56.0
+ AR 41.2 1.0× 8.0
+ Jacobi 36.9 1.0× 8.0
+ lookahead 58.1 1.6× 8.0 other existing techniques for efficient LLM inference, which
often require either additional architectural components (Cai
CLLM-LLaMA2-7B
+ AR 40.1 1.0× 9.5 et al., 2024; Li et al., 2024) or draft models (Leviathan et al.,
+ Jacobi 83.2 2.1× 9.5 2023; Zhou et al., 2023b; Liu et al., 2023), CLLMs are di-
+ lookahead 89.5 2.2× 9.5 rectly adapted from a target pre-trained LLM. This reduces
PTB
the complexity associated with additional architecture de-
signs or managing two different models in a single system.
fine-tuned LLaMA2-7B In addition, CLLMs can also be integrated seamlessly with
+ AR 43.8 1.0× 15.6
+ Jacobi 41.8 1.0× 15.6 other techniques for efficient LLM inference (Dao, 2023; Fu
+ lookahead 62.0 1.5× 15.6 et al., 2024; Ainslie et al., 2023) to achieve greater speedup.
CLLM-LLaMA2-7B
We have demonstrated the efficacy of CLLMs on both spe-
+ AR 43.6 1.0× 15.3 cific and open domains, revealing a significant improvement
+ Jacobi 98.1 2.3× 15.3 in generation speed while preserving generation quality.
+ lookahead 101.5 2.3× 15.3

Impact Statement
the trained model itself as the training samples. This can This work presents a challenge in machine learning and
remove the Jacobi trajectory collection overhead, making proposes a solution, the potential negative consequences
our proposed method potentially feasible for pre-training. are not apparent. While it is theoretically possible for any
Results from our language modeling experiments, as de- technique to be misused, the likelihood of such misuse
tailed in Table 5, demonstrate the robustness of the CLLM occurring at the current stage is low.
when trained on pre-training jobs with a notable speedup. By
incorporating on-policy GKD, it is conceivable that a mod- References
ified version of our proposed method could be employed
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
for LLM pre-training. This modification would equip the
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S.,
pre-trained model with both a strong language modeling
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint
capability, as existing models possess, and a high generation
arXiv:2303.08774, 2023.
speed when employing Jacobi decoding for inference. We
leave the opportunities of adapting CLLMs to pre-trained
jobs for future work. Agarwal, R., Vieillard, N., Stanczyk, P., Ramos, S., Geist,
M., and Bachem, O. Gkd: Generalized knowledge distilla-
tion for auto-regressive sequence models. arXiv preprint
5. Conclusion arXiv:2306.13649, 2023.
In this work, we introduce CLLMs, a new family of LLMs
that excel in efficient parallel decoding, designed to signif- Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y.,
icantly enhance the efficiency of Jacobi decoding. Unlike Lebrón, F., and Sanghai, S. Gqa: Training generalized

9
CLLMs: Consistency Large Language Models

multi-query transformer models from multi-head check- Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break the se-
points. arXiv preprint arXiv:2305.13245, 2023. quential dependency of llm inference using lookahead
decoding. arXiv preprint arXiv:2402.02057, 2024.
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin,
D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Gu, Y., Dong, L., Wei, F., and Huang, M. Knowledge
Z., et al. Palm 2 technical report. arXiv preprint distillation of large language models. arXiv preprint
arXiv:2305.10403, 2023. arXiv:2306.08543, 2023.

Ashkboos, S., Croci, M. L., do Nascimento, M. G., Hoefler, Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
T., and Hensman, J. Slicegpt: Compress large language bilistic models. Advances in neural information process-
models by deleting rows and columns, 2024. ing systems, 33:6840–6851, 2020.

Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The
Ding, H., Dong, K., Du, Q., Fu, Z., et al. Deepseek llm: curious case of neural text degeneration. arXiv preprint
Scaling open-source language models with longtermism. arXiv:1904.09751, 2019.
arXiv preprint arXiv:2401.02954, 2024.
Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and
Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., Brockschmidt, M. CodeSearchNet challenge: Evalu-
and Dao, T. Medusa: Simple llm inference acceleration ating the state of semantic code search. arXiv preprint
framework with multiple decoding heads. arXiv preprint arXiv:1909.09436, 2019.
arXiv:2401.10774, 2024.
Kim, Y. and Rush, A. M. Sequence-level knowledge distil-
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, lation. arXiv preprint arXiv:1606.07947, 2016.
L., and Jumper, J. Accelerating large language model
decoding with speculative sampling. arXiv preprint Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu,
arXiv:2302.01318, 2023. C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient
memory management for large language model serving
Chern, E., Zou, H., Li, X., Hu, J., Feng, K., Li, J., and with pagedattention. In Proceedings of the 29th Sym-
Liu, P. Generative ai for math: Abel. URL https: posium on Operating Systems Principles, pp. 611–626,
//github.com/GAIR-NLP/abel. 2023.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Leviathan, Y., Kalman, M., and Matias, Y. Fast inference
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, from transformers via speculative decoding. In Inter-
R., Hesse, C., and Schulman, J. Training verifiers to solve national Conference on Machine Learning, pp. 19274–
math word problems. arXiv preprint arXiv:2110.14168, 19286. PMLR, 2023.
2021.
Li, Y., Wei, F., Zhang, C., and Zhang, H. Eagle: Speculative
Dao, T. Flashattention-2: Faster attention with bet- sampling requires rethinking feature uncertainty, 2024.
ter parallelism and work partitioning. arXiv preprint
arXiv:2307.08691, 2023. Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and
Han, S. Awq: Activation-aware weight quantization
Dao, T., Fu, D., Ermon, S., Rudra, A., and Ré, C. Flashat- for llm compression and acceleration. arXiv preprint
tention: Fast and memory-efficient exact attention with arXiv:2306.00978, 2023.
io-awareness. Advances in Neural Information Process-
ing Systems, 35:16344–16359, 2022. Liu, X., Hu, L., Bailis, P., Stoica, I., Deng, Z., Cheung, A.,
and Zhang, H. Online speculative decoding, 2023.
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L.
Llm. int8 (): 8-bit matrix multiplication for transformers Ma, X., Fang, G., and Wang, X. Llm-pruner: On the struc-
at scale. arXiv preprint arXiv:2208.07339, 2022. tural pruning of large language models. arXiv preprint
arXiv:2305.11627, 2023.
Frantar, E. and Alistarh, D. Sparsegpt: Massive language
models can be accurately pruned in one-shot. 2023. Merity, S., Xiong, C., Bradbury, J., and Socher, R.
Pointer sentinel mixture models. arXiv preprint
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: arXiv:1609.07843, 2016.
Accurate post-training quantization for generative pre-
trained transformers. arXiv preprint arXiv:2210.17323, Ortega, J. M. and Rheinboldt, W. C. Iterative solution of
2022. nonlinear equations in several variables. SIAM, 2000.

10
CLLMs: Consistency Large Language Models

Pan, H., Wang, C., Qiu, M., Zhang, Y., Li, Y., and Huang, Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
J. Meta-kd: A meta knowledge distillation framework A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
for language model compression across domains. arXiv Bhosale, S., et al. Llama 2: Open foundation and fine-
preprint arXiv:2012.01266, 2020. tuned chat models. arXiv preprint arXiv:2307.09288,
2023b.
Polišenská, K., Chiat, S., and Roy, P. Sentence repetition:
What does the task measure? International Journal of Wu, S., Fei, H., Qu, L., Ji, W., and Chua, T.-S. Next-
Language & Communication Disorders, 50(1):106–118, gpt: Any-to-any multimodal llm. arXiv preprint
2015. arXiv:2309.05519, 2023.

Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, Xia, M., Zhong, Z., and Chen, D. Structured pruning
J., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently learns compact and accurate models. arXiv preprint
scaling transformer inference. Proceedings of Machine arXiv:2204.00408, 2022.
Learning and Systems, 5, 2023. Xia, M., Gao, T., Zeng, Z., and Chen, D. Sheared llama:
Accelerating language model pre-training via structured
Santilli, A., Severino, S., Postolache, E., Maiorca, V., Man-
pruning. arXiv preprint arXiv:2310.06694, 2023.
cusi, M., Marin, R., and Rodolà, E. Accelerating trans-
former inference for translation via parallel decoding. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han,
arXiv preprint arXiv:2305.10427, 2023. S. Smoothquant: Accurate and efficient post-training
quantization for large language models. In International
Shazeer, N. Fast transformer decoding: One write-head is Conference on Machine Learning, pp. 38087–38099.
all you need. arXiv preprint arXiv:1911.02150, 2019. PMLR, 2023.
Smadja, F. From n-grams to collocations: An evaluation Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li,
of xtract. In 29th Annual Meeting of the Association for Z., Ma, J., Li, I., Yao, Q., Roman, S., et al. Spider:
Computational Linguistics, pp. 279–284, 1991. A large-scale human-labeled dataset for complex and
cross-domain semantic parsing and text-to-sql task. arXiv
Song, Y. and Dhariwal, P. Improved techniques for training preprint arXiv:1809.08887, 2018.
consistency models. arXiv preprint arXiv:2310.14189,
2023. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging
Song, Y., Meng, C., Liao, R., and Ermon, S. Accelerating llm-as-a-judge with mt-bench and chatbot arena. arXiv
feedforward computation via parallel nonlinear equation preprint arXiv:2306.05685, 2023.
solving. In International Conference on Machine Learn-
ing, pp. 9791–9800. PMLR, 2021a. Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
Efrat, A., Yu, P., YU, L., Zhang, S., Ghosh, G., Lewis,
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., M., Zettlemoyer, L., and Levy, O. LIMA: Less is more
Ermon, S., and Poole, B. Score-based generative mod- for alignment. In Thirty-seventh Conference on Neural
eling through stochastic differential equations. In In- Information Processing Systems, 2023a.
ternational Conference on Learning Representations,
Zhou, Y., Lyu, K., Rawat, A. S., Menon, A. K., Ros-
2021b. URL https://openreview.net/forum?
tamizadeh, A., Kumar, S., Kagy, J.-F., and Agarwal,
id=PxTIG12RRHS.
R. Distillspec: Improving speculative decoding via
Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. Consis- knowledge distillation. arXiv preprint arXiv:2310.08461,
tency models. In Proceedings of the 40th International 2023b.
Conference on Machine Learning, ICML’23. JMLR.org,
2023.

Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and
effective pruning approach for large language models.
arXiv preprint arXiv:2306.11695, 2023.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,


M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models. arXiv preprint arXiv:2302.13971, 2023a.

11
CLLMs: Consistency Large Language Models

A. Illustration of Consistency Loss Learning Objectives


In our proposed method described in Section 3.2, we use Jacobi trajectories collected from a target model to train the model
with a loss that encourages single-step convergence during Jacobi iterations. This is achieved with either choice of the two
consistency loss:

• Global consistency loss: directly minimize the distance D between any arbitrary point y on a Jacobi trajectory and the
fixed point y∗ in Equation 4.
• Local consistency loss: minimize the distance D between any arbitrary point y(j) on a Jacobi trajectory with its adjacent
state y(j+1) in Equation 5, which thereby also implicitly minimizes the distance between y(j+1) and the fixed point y∗ .

An illustration further depict the global consistency loss and the local consistency loss in Figure 4 and Figure 5.

fixed point
k=4 (4th iteration):
Converged, same result as greedy AR decoding.

Yes I am . Nice to meet


prefix k-th n-token sequence

Yes I am . Nice to chat

k=3 (3rd iteration)

Yes I am one one smart gadget


Autoregressive LM
k=2 (2nd iteration)

Yes a LLM ! one fun meet

k=1 (1-st iteration)


prefix (k-1)-th n-token sequence
You ? Are LLM ? a you

<BOS> Are you a LLM ?


k=0 (initialization)

random initialization Jacobi trajectory


input

Figure 4. The image illustrates global consistency loss where we aim to directly learn a model qθ that maps arbitrary n-token sequence
y(0) , y(1) , etc.) to the fixed point y∗ .

fixed point
k=4 (4th iteration):
Converged, same result as greedy AR decoding.

Yes I am . Nice to meet


prefix k-th n-token sequence

Yes I am . Nice to chat

k=3 (3rd iteration)

Yes I am one one smart gadget


Autoregressive LM
k=2 (2nd iteration)

Yes a LLM ! one fun meet

k=1 (1-st iteration)


prefix (k-1)-th n-token sequence
You ? Are LLM ? a you

<BOS> Are you a LLM ?


k=0 (initialization)

random initialization Jacobi trajectory


input

Figure 5. The image illustrates local consistency loss where we aim to learn a model qθ that maps an arbitrary n-token sequence y(j) to
its next adjacent state, and implicitly mapping the point to the fixed point y∗ .

12
CLLMs: Consistency Large Language Models

B. Comparison with Baseline Algorithms


In this section, we present a comparative analysis of baseline algorithms for efficient LLM inference. Key features considered
are listed below. Table 7 underlines that CLLMs, our proposed method, stands out for its memory efficiency and adaptability,
requiring no modifications to the existing model architecture while achieving up to 3.4× inference speedup.

• Lossless: whether the method generates exactly the same output distribution as AR decoding does in the backbone model.
• Training-free: whether the method requires training.
• Architecture-design-free: whether the method requires modifications or adding auxiliary components to pre-trained
LLMs (like extra MLP layers, LM heads (Cai et al., 2024), autoregressive heads (Li et al., 2024), etc.).

• Attention-modification-free: whether the methods require modifications to exisiting attention mechanism in transformers.
For example, this includes tree token verification as appears in Cai et al. (2024).
• Extra-memory-free: whether the method requires extra memory conmsumption in the system to accommodate speculative
model or extra parameters.
• Speedup: Whether the method can effectively deliver inference speedup in practical use cases.

Table 7. All speedups are relative to the vanilla AR. CLLMs has the best memory efficiency and adaptability as it requires no modifications
to the model. yes∗ refers to capability of achieving more than 3× speedup on at least one of our benchmarks. Jacobi decoding doesn’t
always lead to a speedup as discussed in Section 3.1, so we denote it with yes.

Methods Lossless Training-free Arch-design-free Attention-mod-free Extra-memory-free Speedup


Vanilla AR yes yes yes yes yes no
Jacobi Decoding yes yes yes yes yes yes
Speculative Decoding yes yes yes yes no yes
Lookahead Decoding yes yes yes yes no yes
SD with Distilled Student yes no yes yes no yes
Eagle yes no no no no yes∗
Medusa no no no no no yes∗
CLLMs ( Ours ) no no yes yes yes yes∗

C. Pesudo Code for Jacobi Decoding with KV Cache

Algorithm 3 Jacobi Decoding with KV Cache


1: Input: prompt x, n-gram size n, past KV cache K, LLM, Jacobi trajectory J
2: y ← random tokens from x
3: nt ← 0 {Initialization of accurate length}
4: y0 , K ← LLM(x) {Prefill phase: generate the first token}
5: znext ← cat(y0 , y≥1 )
6: repeat
7: z current ← z next
8: z next , K ← LLM(z current , K)
9: i∗ ← max{i | z<i current
= z<inext
, i ∈ {0, . . . , len(z current ) − 1}} {Fast-forwarded token count}

next
10: ynt ≤i′ <nt +i∗ ← z<i ∗ {i denotes a dummy variable}

11: nt ← nt + i∗
next

12: Append cat y<nt , z≥i ∗ to J
13: Remove KV cache of false tokens from K
14: z next ← z≥inext

15: until nt = n
16: Ouput: J and y

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy