-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Model] vLLM v1 supports Medusa #17956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Teacher @WoosukKwon, please look here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skylee-01 Can you please add a test?
ok |
buildkite/ci/pr |
Which test is that? I don't see the CI failing because of OOM |
The pytest output at the end indicates that the only failing test is audio test, which is also failing on main. |
Thank you very much for your reply |
Are you addressing this? |
Yes, I will synchronize a version for testing today. |
This test uses the base model qwen2-4b, with the input set to 100 tokens and the output set to 10 tokens. The test is carried out through an offline script. Among them, both the base model and medusa are fine-tuned using business data. During the test, the batch_size is set to 10, and the loop operation is executed 100 times. After testing, the following results are obtained:
Attached is the test script: # -*- coding: utf-8 -*-
import os
import sys
import time
os.environ.update({
"CUDA_VISIBLE_DEVICES": "0",
"VLLM_USE_V1": "1"
})
custom_vllm_path = '/path/to/your/vllm'
if custom_vllm_path not in sys.path:
sys.path.insert(0, custom_vllm_path)
from vllm import LLM, SamplingParams
prompt = "******************************"
prompts = [prompt] * 10
llm = LLM(
model="/path/to/your/model",
speculative_config={
"method": "medusa",
"model": "/path/to/your/medusa/model",
"num_speculative_tokens": 3,
},
max_model_len=2000,
)
start_time = time.time()
for _ in range(100):
llm.generate(prompts, SamplingParams(temperature=0))
print(f"total time: {time.time() - start_time:.2f} s") |
test has been updated, what else do I need to do. |
It should be good to go then, I'll try force merging when I am back on my computer |
Thank you for your help |
EAGLE support is gated by this:
i.e. you have to explicitly set I'd imagine we're nearly ready to remove this gate for eagle ... but I presume we don't intend medusa to be flagged as "more supported" in V1 than eagle? |
For reference, here's a target/draft model combo for testing:
|
@skylee-01 - do we have any analysis on the speedup and the Acceptance length of Medusa implementation on MTBench? We have benchmarks for EAGLE and would be good to compare with it: #17812 |
I fully agree with your point of view. However, during the process of my training of the Eagle model, I encountered rather serious problems. Specifically, Eagle did not open - source Eagle3, and the models supported by its training code have limitations. For example, it does not support the Qwen model. In addition, the code style written by Eagle is extremely casual, which undoubtedly causes great difficulties for commercial implementation. |
Thank you for your work |
Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com> Signed-off-by: skylee-01 <497627264@qq.com> Co-authored-by: lisiqi23 <lisiqi23@xiaomi.com> Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com> Signed-off-by: skylee-01 <497627264@qq.com> Co-authored-by: lisiqi23 <lisiqi23@xiaomi.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
vLLM v1 supports Medusa. This is a basic version, and I will support cudaGraph in the future