CLLMS: Consistency Large Language Models
CLLMS: Consistency Large Language Models
1
CLLMs: Consistency Large Language Models
<BOS> The prompt … Answer: This is correct ! xed point Consistency Large Language Models (CLLMs). In com-
✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
parison with previous methods like speculative decoding
❄ pre x converged n-token seq
and Medusa, CLLM doesn’t introduce extra memory cost to
accommodate auxiliary model components while delivering
<BOS> The prompt … Answer: This is one try
✔ ✔ ✔ ✔ ✔ ✔ ✔ ❌ ❌
significant speedup with minimal performance degradation.
❄ pre x n-token seq 🔥 To implement this learning strategy, it only requires model
training with two loss terms. Following CMs, we can con-
k iterations
vert the aforementioned learning objective into a consistency
loss where the model is demended to map arbitrary point on
Autoregressive LM the Jacobi trajectory to the fixed point. CLLMs also include
an AR loss to avoid deviating from the distribution of the
target LLM and hence ensure the generation quality.
Jacobi The fine-tuning cost of CLLMs is moderate, e.g., training
❄ pre x n-token seq 🔥 trajectory on only ∼ 1M tokens for LLaMA-7B to achieve a 3.4×
speedup on the Spider dataset. We further empirically iden-
<BOS> The prompt … The … The prompt … tify that such acceleration is likely to stem from the existence
✔ ✔ ✔ ✔ ❌ ❌ ❌ ❌ ❌
of 1) fast forwarding, where multiple consecutive tokens
❄ pre x random n-token seq
are correctly predicted in a single forward pass, and 2) sta-
randomly initialized point tionary tokens, which are correctly predicted and remain
unaltered through subsequent iterations, despite being pre-
Figure 1. An instance of Jacobi trajectory. “n-token seq” refers to
the n-token sequence that is iteratively updated in Jacobi iterations. ceded by inaccurate tokens. An illustration of the examples
is shown in Figure 2.
so that it can yield multiple, instead of one, subsequent
tokens of a prefix at once. In the ideal case, with the prompt To summarize, our key contributions are as follows:
and a randomly initialized n-token sequence as input, our
goal is to train a LLM that can generate the same n-token • We propose Consistency Large Language Models
sequence as AR decoding (the fixed point) using only one (CLLMs), a new family of LLMs specialized for the
step. Our preliminary experiments show the single-step Jacobi decoding method for latency reduction.
learning task is difficult when n is large, and leads to slow • We empirically observe the existence of fast forwarding
model convergence. We therefore ease the learning process and stationary tokens phenomena in Jacobi decoding of
by also taking intermediate points on the Jacobi trajectory CLLMs. Empirically, CLLMs can lead to a 2.0× to 6.8×
with more correct tokens1 into account. In particular, for improvement in the count of fast-forwarded tokens and
fi fi
the second to last point on the trajectory, the learning is stationary tokens compared to the original LLM.
identical to AR modeling, at which the target LLM without
adaptation has already excelled. • We demonstrate the efficacy of CLLMs on a variety of
benchmarks. On domain-specific benchmarks including
We argue such a learning strategy that a single model is
GSM8K, CodeSearchNet Python, and Spider, CLLMs
tuned to solve a series of learning problems of mapping
can achieve 2.4× to 3.4× speedup using Jacobi decoding
any arbitrary point on the trajectory to the fixed-point is
with nearly no loss in accuracy. On open-domain bench-
beneficial to model convergence (see Figure 4 and Fig-
mark MT-bench, CLLMs can achieve 2.4× speedup on
ure 5). Imagining the evolution of the n-token sequence as
ShareGPT with state-of-the-art performance, scoring 6.4.
the denoising process of a natural image (Ho et al., 2020;
Song et al., 2021b), we surprisingly find that the above
learning procedure draws a sharp analogy to the acceler- 2. Related Work
ation technique for diffusion models named consistency
Efficient LLM Inference. This body of work can be
models (CMs) (Song et al., 2023; Song & Dhariwal, 2023).
broadly categorized into two streams: methods that neces-
CMs aim to achieve single-step image generation using
sitate additional training and those that do not. The high
the denoising objective by minimizing distances between
AR inference cost in LLMs has sparked a surge in research
consecutive denoising steps along the probability flow ordi-
aimed at efficient LLM inference, primarily focused on ac-
nary differential equation (ODE) trajectory during training.
celerating the AR decoding process.
Our method and CMs share the notion of directly map-
ping intermediate states of a solving process (of non-linear The methods that do not require additional training include
systems or ODEs) to its final solution for inference accel- speculative decoding, as introduced in studies by Leviathan
eration. Based on these, we refer to our trained models as et al. (2023) and Chen et al. (2023). These techniques en-
2
CLLMs: Consistency Large Language Models
hance LLM decoding speed by leveraging a smaller draft cess. Consistency models overcome this limitation by map-
model to predict the outputs of a larger target model which ping any point along the probability flow ODE of the dif-
subsequently verifies these predictions. Another category fusion process back to the original point, corresponding to
of training-free approaches involves system- or hardware- the initial image, in a single step (Song et al., 2023). In this
oriented optimizations. Notable examples include PagedAt- work, we highlight that a parallelism can be drawn between
tention (Kwon et al., 2023), which optimizes KV cache the few-step generation capability of CLLMs and that of the
management for throughput using memory paging, and consistency models.
FlashAttention (Dao et al., 2022; Dao, 2023), which ac-
celerates attention module computations by reducing HBM 3. Methodology
access via softmax tiling. Other strategies enhance LLM
inference speed by optimizing model designs, reducing This section begins with a review of the Jacobi decoding
weight/activation precision, and utilizing sparsity, includ- method (Santilli et al., 2023) for accelerating LLM infer-
ing multi-query and grouped-query attention mechanisms ence, then elaborates on CLLMs, a refinement of pre-trained
with fused heads (Shazeer, 2019; Ainslie et al., 2023), post- LLMs to enjoy higher speedup from Jacobi decoding. In this
training quantization (Dettmers et al., 2022; Xiao et al., paper, we only consider greedy sampling and leave other
2023; Frantar et al., 2022; Lin et al., 2023), and various sampling strategies to future work. We also empirically
pruning techniques (Sun et al., 2023; Frantar & Alistarh, identify the fast-forwarding phenomenon and the emera-
2023; Ashkboos et al., 2024). gence of stationary tokens from CLLMs, which serve as the
source of such acceleration.
For methods that necessitate training, they often require in-
tegration of auxiliary components, such as additional LM or
AR heads, to facilitate faster AR generation (Cai et al., 2024; 3.1. Preliminary: Jacobi Decoding
Li et al., 2024). It may also involve significant modifica- Given a prompt x and a pre-trained LLM p(·|x), we obtain
tions to the model weights or architecture, as seen in various the model response typically with the standard AR decoding
pruning approaches (Ma et al., 2023; Xia et al., 2022; 2023). method under the greedy strategy, i.e.,
Moreover, training can enhance certain training-free tech-
niques, like speculative decoding, by capturing the behavior
yi = arg max p(y|y<i , x) for i = 1, . . . , n (1)
of the original, larger model in a smaller student model y
through distillation, thereby retaining performance with re-
duced size (Zhou et al., 2023b; Liu et al., 2023). An detailed
where y<i denotes {y1 , . . . , yi−1 }. As shown, n forward
analysis that compare CLLMs with different SOTA baseline
passes of the LLM are required to obtain n tokens y≤n . The
methods are further discussed and compared in Section B
sequential nature of AR decoding hinders the fast genera-
and Table 7. It’s worthy noticing that CLLMs requires nei-
tion of a lengthy response in practice. Speculative decod-
ther modification to pre-trained models nor any auxiliary
ing (Leviathan et al., 2023; Zhou et al., 2023b; Liu et al.,
components. This brings higher memory efficiency and
2023) and Medusa (Cai et al., 2024) are existing remedia-
adaptability to users at inference time.
tions to such an issue, but the former suffers from the diffi-
LLM Distillation. Knowledge distillation (KD) serves as a culties in finding a suitable draft model and managing both
technique for creating smaller models that replicate the func- models in a single system, and the latter causes significant
tionality of larger ones. While traditional KD approaches increases in model size and architecture.
often fall short for LLMs, (Gu et al., 2023) has adapted
In comparison, Jacobi decoding has shown the capacity to
KD for autoregressive LLMs, focusing on minimizing the
reduce the inference cost of LLMs without extra model
reverse KL divergence between student and teacher models
components (Santilli et al., 2023) and is therefore more
through student-driven decoding. In another advancement,
applicable. Concretely, supposing f (yi , y<i , x) := yi −
Agarwal et al. (2023) introduces generalized knowledge
arg maxy p(y|y<i , x), Jacobi decoding re-frames the LLM
distillation (GKD), which balances forward and reverse KL
inference process in Equation (1) as solving a system of
divergences by employing a mix of data sampled from both
nonlinear equations w.r.t. yi :
teacher and student models.
CLLMs are distinct from these works as our proposed f (yi , y<i , x) = 0 for i = 1, . . . , n. (2)
method can be regarded as a self-distillation approach with
a Jacobi trajectory training dataset that matches the target
LLM’s output distribution. It can be solved in parallel using the Jacobi fix-point
iteration method (Ortega & Rheinboldt, 2000), starting
Consistency Models. Diffusion models (Ho et al., 2020; from a randomly initialized n-token sequence y (0) =
Song et al., 2021b) suffer from slow iterative sampling pro- (0) (0)
{y1 , . . . , yn } and iteratively updating it by the follow-
3
CLLMs: Consistency Large Language Models
GROUP BY
H AV ING
count ( *)
Figure 2. Comparison of Jacobi trajectory between a target LLM and CLLMs on Spider. Each point along the Jacobi trajectory is a
color-coded sequence: blue for correct tokens matching with AR results, and red for inaccurate ones. CLLM demonstrates enhanced
efficiency, converging to the fixed point 2× faster than the target LLM. This increased efficiency in the CLLM can be attributed to the
consistency loss which facilitates the learning of the structure of each n-token sequence given a prefix.
4
CLLMs: Consistency Large Language Models
Algorithm 1 Generate dataset to train a CLLM Algorithm 2 Training algorithm for a CLLM
Input: prompt set O, n-gram size n, max new tokens N , target Input: Jacobi trajectory dataset D, n-gram size n, the weight
LLM p factor ω, CLLM qθ (·|x)
repeat repeat
Sample prompt x from origin dataset O. Sample prompt x, Jacobi trajectory J , and full response l
while <EOS> is not generated and length generated < N from D
do Calculate LAR using Equation (6)
J = {y (0) , . . . , y ∗ } ← Jacobi Decoding(p, x) Sample y from J
x ← cat(x, y ∗ ) Calculate Lconsistency using Equation (4) or Equation (5)
if use data augmentation then Calculate L(θ) and update the parameters θ
for all y ∈ J do until convergence
Augment y with false tokens corrected randomly
end for
end if vergence) as popular examples (Agarwal et al., 2023). We
Append x and J to Training Dataset D
end while primarily experiment with the forward KL.
until all prompts in origin dataset O are used Alternatively, we can also achieve the goal that CLLM con-
sistently maps all intermediate states to the fixed point with
a local consistency (LC) loss following CMs (Song et al.,
wrong, wrong” pattern. In comparison, patterns like “cor-
2023), where the adjacent states (y (j) , y (j+1) in the Jacobi
rect, correct, wrong, correct, wrong” can be rare. To enhance
trajectory J are demanded to yield the same outputs:
the learning and generalization capabilities of CLLMs, we h
augment the dataset D by randomly correcting erroneously LLC = E(x,J )∼D,(y(j) ,y(j+1) )∼J
predicted tokens within the samples.
n i (5)
(j+1) (j)
X
Data post-processing. Since the target LLM itself can D qθ− (·|y<i , x)||qθ (·|y<i , x) .
make errors for some prompts, it often leads to low-quality i=1
generations in the Jacobi trajectories. We find training a We compare LGC and LLC empirically in Table 6, where
CLLM with n-token sequences with token-level (Holtzman the results show that the global consistency loss is more
et al., 2019) or sentence-level repetitions (Polišenská et al., efficacious to train CLLMs. This is probably attributed to
2015) often results in to repetitive content generation and that LLC only implicitly aims at mapping from any point
noticeably degrades performance. Recognizing the signif- consistently to the fixed point by minimizing the distance
icance of high-quality datasets for training LLMs (Zhou between consecutive points. However, there is still a gap
et al., 2023a), we perform post-processing to eliminate the between LLC and the goal of predicting multiple tokens at
low-quality samples from our training dataset D based on a once, because there is typically only one more correct token
rule-based detector. in y (j+1) than y (j) in the collected Jacobi trajectory.
3.2.2. T RAINING AR Loss. To avoid deviating from the distribution of the
target LLM, we incorporate the traditional AR loss based
We jointly optimize two losses for tuning CLLMs, one guar- on the generation l of the target LLM p:
anteeing the prediction of multiple tokens at once and the
N
other avoiding the CLLM from deviating from the target h X i
LLM so as to maintain generation quality. LAR = E(x,l)∼D − log qθ (li |l<i , x) . (6)
i=1
Consistency Loss. For a prompt x with the Jacobi trajectory
This term contributes to maintaining generation quality sub-
J , let y and y ∗ denote a random state on the trajectory and
stantially (see Table 6).
the fixed point respectively. We can directly push CLLM to
output y ∗ with y as the input by minimizing the following Consequently, the total loss for training a CLLM is:
loss:
h L(θ) = Lconsistency + wLAR (7)
LGC = E(x,J )∼D,y∼J
where ω represents a weighting coefficient, Lconsistency can
n i (4)
X
∗
be either LGC or LLC and we adopt LGC in our experiments.
D (qθ− (·|y<i , x)||qθ (·|y<i , x))
i=1 The training procedure is detailed in Algorithm 2.
where θ− = stopgrad(θ) and we abuse notations to repre-
3.3. Acceleration Mechanisms in CLLMs
sent uniform sampling from the dataset. D(·||·) denotes
the distance between two distributions, with forward KL, Next, we compare the Jacobi trajectory of the target LLM
reverse KL, and their mixture (i.e., the Jensen-Shannon di- and CLLM in Figure 2 to chase an in-depth understanding
5
CLLMs: Consistency Large Language Models
6
CLLMs: Consistency Large Language Models
7
CLLMs: Consistency Large Language Models
Table 3. Profiling results for fast-forwarded and stationary token counts in fine-tuned models and CLLMs. The numbers are
reported for each n-token sequence, with the best-performing model and an accompanying n-gram size. Fast-forwarded token count
reported in the table includes the one token that will be predicted right even without fast-forwarding.
Models n-token sequence length Fast-forward token count Stationary token count
Spider
Fine-tuned Deepseek-coder-7B-instruct 16 1.1 0.4
CLLM-Deepseek-coder-7B-instruct (size 16) 16 5.7 1.6
Code-Search-Net Python
Fine-tuned Deepseek-coder-7B-instruct 32 1.1 0.4
CLLM-Deepseek-coder-7B-instruct (size 32) 32 4.0 6.8
GSM8K
Fine-tuned LLaMA-2-7B 16 1.1 0.1
CLLM-LLaMA-2-7B (size 16) 16 2.8 2.0
ShareGPT
Fine-tuned LLaMA-2-7B 32 1.1 0.3
CLLM-LLaMA-2-7B (size 32) 32 2.2 4.8
1.0
4.3. Ablation Studies 3.0 Accuracy
Dataset sizes and generalizability. In Section 3.2.1, Ja- 2.5 0.8
cobi trajectory datasets are collected to conduct training for
efficient Jacobi decoding. Table 4 demonstrates larger Ja- 2.0 0.6
Accuracy
Speedup
8
CLLMs: Consistency Large Language Models
Table 4. Comparison the performance of CLLMs trained with different sizes of Jacobi trajectory datasets on ShareGPT.
Impact Statement
the trained model itself as the training samples. This can This work presents a challenge in machine learning and
remove the Jacobi trajectory collection overhead, making proposes a solution, the potential negative consequences
our proposed method potentially feasible for pre-training. are not apparent. While it is theoretically possible for any
Results from our language modeling experiments, as de- technique to be misused, the likelihood of such misuse
tailed in Table 5, demonstrate the robustness of the CLLM occurring at the current stage is low.
when trained on pre-training jobs with a notable speedup. By
incorporating on-policy GKD, it is conceivable that a mod- References
ified version of our proposed method could be employed
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
for LLM pre-training. This modification would equip the
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S.,
pre-trained model with both a strong language modeling
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint
capability, as existing models possess, and a high generation
arXiv:2303.08774, 2023.
speed when employing Jacobi decoding for inference. We
leave the opportunities of adapting CLLMs to pre-trained
jobs for future work. Agarwal, R., Vieillard, N., Stanczyk, P., Ramos, S., Geist,
M., and Bachem, O. Gkd: Generalized knowledge distilla-
tion for auto-regressive sequence models. arXiv preprint
5. Conclusion arXiv:2306.13649, 2023.
In this work, we introduce CLLMs, a new family of LLMs
that excel in efficient parallel decoding, designed to signif- Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y.,
icantly enhance the efficiency of Jacobi decoding. Unlike Lebrón, F., and Sanghai, S. Gqa: Training generalized
9
CLLMs: Consistency Large Language Models
multi-query transformer models from multi-head check- Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break the se-
points. arXiv preprint arXiv:2305.13245, 2023. quential dependency of llm inference using lookahead
decoding. arXiv preprint arXiv:2402.02057, 2024.
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin,
D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Gu, Y., Dong, L., Wei, F., and Huang, M. Knowledge
Z., et al. Palm 2 technical report. arXiv preprint distillation of large language models. arXiv preprint
arXiv:2305.10403, 2023. arXiv:2306.08543, 2023.
Ashkboos, S., Croci, M. L., do Nascimento, M. G., Hoefler, Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
T., and Hensman, J. Slicegpt: Compress large language bilistic models. Advances in neural information process-
models by deleting rows and columns, 2024. ing systems, 33:6840–6851, 2020.
Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The
Ding, H., Dong, K., Du, Q., Fu, Z., et al. Deepseek llm: curious case of neural text degeneration. arXiv preprint
Scaling open-source language models with longtermism. arXiv:1904.09751, 2019.
arXiv preprint arXiv:2401.02954, 2024.
Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and
Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., Brockschmidt, M. CodeSearchNet challenge: Evalu-
and Dao, T. Medusa: Simple llm inference acceleration ating the state of semantic code search. arXiv preprint
framework with multiple decoding heads. arXiv preprint arXiv:1909.09436, 2019.
arXiv:2401.10774, 2024.
Kim, Y. and Rush, A. M. Sequence-level knowledge distil-
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, lation. arXiv preprint arXiv:1606.07947, 2016.
L., and Jumper, J. Accelerating large language model
decoding with speculative sampling. arXiv preprint Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu,
arXiv:2302.01318, 2023. C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient
memory management for large language model serving
Chern, E., Zou, H., Li, X., Hu, J., Feng, K., Li, J., and with pagedattention. In Proceedings of the 29th Sym-
Liu, P. Generative ai for math: Abel. URL https: posium on Operating Systems Principles, pp. 611–626,
//github.com/GAIR-NLP/abel. 2023.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Leviathan, Y., Kalman, M., and Matias, Y. Fast inference
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, from transformers via speculative decoding. In Inter-
R., Hesse, C., and Schulman, J. Training verifiers to solve national Conference on Machine Learning, pp. 19274–
math word problems. arXiv preprint arXiv:2110.14168, 19286. PMLR, 2023.
2021.
Li, Y., Wei, F., Zhang, C., and Zhang, H. Eagle: Speculative
Dao, T. Flashattention-2: Faster attention with bet- sampling requires rethinking feature uncertainty, 2024.
ter parallelism and work partitioning. arXiv preprint
arXiv:2307.08691, 2023. Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and
Han, S. Awq: Activation-aware weight quantization
Dao, T., Fu, D., Ermon, S., Rudra, A., and Ré, C. Flashat- for llm compression and acceleration. arXiv preprint
tention: Fast and memory-efficient exact attention with arXiv:2306.00978, 2023.
io-awareness. Advances in Neural Information Process-
ing Systems, 35:16344–16359, 2022. Liu, X., Hu, L., Bailis, P., Stoica, I., Deng, Z., Cheung, A.,
and Zhang, H. Online speculative decoding, 2023.
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L.
Llm. int8 (): 8-bit matrix multiplication for transformers Ma, X., Fang, G., and Wang, X. Llm-pruner: On the struc-
at scale. arXiv preprint arXiv:2208.07339, 2022. tural pruning of large language models. arXiv preprint
arXiv:2305.11627, 2023.
Frantar, E. and Alistarh, D. Sparsegpt: Massive language
models can be accurately pruned in one-shot. 2023. Merity, S., Xiong, C., Bradbury, J., and Socher, R.
Pointer sentinel mixture models. arXiv preprint
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: arXiv:1609.07843, 2016.
Accurate post-training quantization for generative pre-
trained transformers. arXiv preprint arXiv:2210.17323, Ortega, J. M. and Rheinboldt, W. C. Iterative solution of
2022. nonlinear equations in several variables. SIAM, 2000.
10
CLLMs: Consistency Large Language Models
Pan, H., Wang, C., Qiu, M., Zhang, Y., Li, Y., and Huang, Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
J. Meta-kd: A meta knowledge distillation framework A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
for language model compression across domains. arXiv Bhosale, S., et al. Llama 2: Open foundation and fine-
preprint arXiv:2012.01266, 2020. tuned chat models. arXiv preprint arXiv:2307.09288,
2023b.
Polišenská, K., Chiat, S., and Roy, P. Sentence repetition:
What does the task measure? International Journal of Wu, S., Fei, H., Qu, L., Ji, W., and Chua, T.-S. Next-
Language & Communication Disorders, 50(1):106–118, gpt: Any-to-any multimodal llm. arXiv preprint
2015. arXiv:2309.05519, 2023.
Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, Xia, M., Zhong, Z., and Chen, D. Structured pruning
J., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently learns compact and accurate models. arXiv preprint
scaling transformer inference. Proceedings of Machine arXiv:2204.00408, 2022.
Learning and Systems, 5, 2023. Xia, M., Gao, T., Zeng, Z., and Chen, D. Sheared llama:
Accelerating language model pre-training via structured
Santilli, A., Severino, S., Postolache, E., Maiorca, V., Man-
pruning. arXiv preprint arXiv:2310.06694, 2023.
cusi, M., Marin, R., and Rodolà, E. Accelerating trans-
former inference for translation via parallel decoding. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han,
arXiv preprint arXiv:2305.10427, 2023. S. Smoothquant: Accurate and efficient post-training
quantization for large language models. In International
Shazeer, N. Fast transformer decoding: One write-head is Conference on Machine Learning, pp. 38087–38099.
all you need. arXiv preprint arXiv:1911.02150, 2019. PMLR, 2023.
Smadja, F. From n-grams to collocations: An evaluation Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li,
of xtract. In 29th Annual Meeting of the Association for Z., Ma, J., Li, I., Yao, Q., Roman, S., et al. Spider:
Computational Linguistics, pp. 279–284, 1991. A large-scale human-labeled dataset for complex and
cross-domain semantic parsing and text-to-sql task. arXiv
Song, Y. and Dhariwal, P. Improved techniques for training preprint arXiv:1809.08887, 2018.
consistency models. arXiv preprint arXiv:2310.14189,
2023. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging
Song, Y., Meng, C., Liao, R., and Ermon, S. Accelerating llm-as-a-judge with mt-bench and chatbot arena. arXiv
feedforward computation via parallel nonlinear equation preprint arXiv:2306.05685, 2023.
solving. In International Conference on Machine Learn-
ing, pp. 9791–9800. PMLR, 2021a. Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
Efrat, A., Yu, P., YU, L., Zhang, S., Ghosh, G., Lewis,
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., M., Zettlemoyer, L., and Levy, O. LIMA: Less is more
Ermon, S., and Poole, B. Score-based generative mod- for alignment. In Thirty-seventh Conference on Neural
eling through stochastic differential equations. In In- Information Processing Systems, 2023a.
ternational Conference on Learning Representations,
Zhou, Y., Lyu, K., Rawat, A. S., Menon, A. K., Ros-
2021b. URL https://openreview.net/forum?
tamizadeh, A., Kumar, S., Kagy, J.-F., and Agarwal,
id=PxTIG12RRHS.
R. Distillspec: Improving speculative decoding via
Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. Consis- knowledge distillation. arXiv preprint arXiv:2310.08461,
tency models. In Proceedings of the 40th International 2023b.
Conference on Machine Learning, ICML’23. JMLR.org,
2023.
Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and
effective pruning approach for large language models.
arXiv preprint arXiv:2306.11695, 2023.
11
CLLMs: Consistency Large Language Models
• Global consistency loss: directly minimize the distance D between any arbitrary point y on a Jacobi trajectory and the
fixed point y∗ in Equation 4.
• Local consistency loss: minimize the distance D between any arbitrary point y(j) on a Jacobi trajectory with its adjacent
state y(j+1) in Equation 5, which thereby also implicitly minimizes the distance between y(j+1) and the fixed point y∗ .
An illustration further depict the global consistency loss and the local consistency loss in Figure 4 and Figure 5.
fixed point
k=4 (4th iteration):
Converged, same result as greedy AR decoding.
Figure 4. The image illustrates global consistency loss where we aim to directly learn a model qθ that maps arbitrary n-token sequence
y(0) , y(1) , etc.) to the fixed point y∗ .
fixed point
k=4 (4th iteration):
Converged, same result as greedy AR decoding.
Figure 5. The image illustrates local consistency loss where we aim to learn a model qθ that maps an arbitrary n-token sequence y(j) to
its next adjacent state, and implicitly mapping the point to the fixed point y∗ .
12
CLLMs: Consistency Large Language Models
• Lossless: whether the method generates exactly the same output distribution as AR decoding does in the backbone model.
• Training-free: whether the method requires training.
• Architecture-design-free: whether the method requires modifications or adding auxiliary components to pre-trained
LLMs (like extra MLP layers, LM heads (Cai et al., 2024), autoregressive heads (Li et al., 2024), etc.).
• Attention-modification-free: whether the methods require modifications to exisiting attention mechanism in transformers.
For example, this includes tree token verification as appears in Cai et al. (2024).
• Extra-memory-free: whether the method requires extra memory conmsumption in the system to accommodate speculative
model or extra parameters.
• Speedup: Whether the method can effectively deliver inference speedup in practical use cases.
Table 7. All speedups are relative to the vanilla AR. CLLMs has the best memory efficiency and adaptability as it requires no modifications
to the model. yes∗ refers to capability of achieving more than 3× speedup on at least one of our benchmarks. Jacobi decoding doesn’t
always lead to a speedup as discussed in Section 3.1, so we denote it with yes.
11: nt ← nt + i∗
next
12: Append cat y<nt , z≥i ∗ to J
13: Remove KV cache of false tokens from K
14: z next ← z≥inext
∗
15: until nt = n
16: Ouput: J and y
13