Skip to content

[Model] vLLM v1 supports Medusa #17956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 16, 2025
Merged

Conversation

skylee-01
Copy link
Contributor

@skylee-01 skylee-01 commented May 11, 2025

vLLM v1 supports Medusa. This is a basic version, and I will support cudaGraph in the future

lisiqi23 added 2 commits May 11, 2025 14:04
Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com>
Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label May 11, 2025
@skylee-01
Copy link
Contributor Author

Teacher @WoosukKwon, please look here.

Signed-off-by: skylee-01 <497627264@qq.com>
@simon-mo simon-mo requested a review from LiuXiaoxuanPKU May 12, 2025 02:54
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR!

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label May 13, 2025
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skylee-01 Can you please add a test?

@skylee-01
Copy link
Contributor Author

@skylee-01 Can you please add a test?

ok

@skylee-01
Copy link
Contributor Author

skylee-01 commented May 14, 2025

image

buildkite/ci/pr
encountered an "out of memory" error, do we need to continue troubleshooting the issue?

@DarkLight1337
Copy link
Member

Which test is that? I don't see the CI failing because of OOM

@skylee-01
Copy link
Contributor Author

Which test is that? I don't see the CI failing because of OOM

image
The two tests showed oom

@DarkLight1337
Copy link
Member

The pytest output at the end indicates that the only failing test is audio test, which is also failing on main.

@skylee-01
Copy link
Contributor Author

The pytest output at the end indicates that the only failing test is audio test, which is also failing on main.

Thank you very much for your reply

@DarkLight1337
Copy link
Member

@skylee-01 Can you please add a test?

Are you addressing this?

@skylee-01
Copy link
Contributor Author

@skylee-01 Can you please add a test?

Are you addressing this?

Yes, I will synchronize a version for testing today.

@skylee-01
Copy link
Contributor Author

skylee-01 commented May 16, 2025

This test uses the base model qwen2-4b, with the input set to 100 tokens and the output set to 10 tokens. The test is carried out through an offline script. Among them, both the base model and medusa are fine-tuned using business data.

During the test, the batch_size is set to 10, and the loop operation is executed 100 times. After testing, the following results are obtained:

  • The base model takes 19.19 seconds;
  • The medusa model takes 9.9 seconds.

Attached is the test script:

# -*- coding: utf-8 -*-
import os
import sys
import time


os.environ.update({
    "CUDA_VISIBLE_DEVICES": "0",
    "VLLM_USE_V1": "1"
})

custom_vllm_path = '/path/to/your/vllm'
if custom_vllm_path not in sys.path:
    sys.path.insert(0, custom_vllm_path)

from vllm import LLM, SamplingParams

prompt = "******************************"
prompts = [prompt] * 10

llm = LLM(
    model="/path/to/your/model",
    speculative_config={
        "method": "medusa", 
        "model": "/path/to/your/medusa/model",
        "num_speculative_tokens": 3,
    },
    max_model_len=2000,
)

start_time = time.time()
for _ in range(100):
    llm.generate(prompts, SamplingParams(temperature=0))
print(f"total time: {time.time() - start_time:.2f} s")

@skylee-01
Copy link
Contributor Author

skylee-01 commented May 16, 2025

@skylee-01 Can you please add a test?

Are you addressing this?

test has been updated, what else do I need to do.

@DarkLight1337 DarkLight1337 added this to the v0.9.0 milestone May 16, 2025
@DarkLight1337
Copy link
Member

It should be good to go then, I'll try force merging when I am back on my computer

@skylee-01
Copy link
Contributor Author

It should be good to go then, I'll try force merging when I am back on my computer

Thank you for your help

@vllm-bot vllm-bot merged commit f4937a5 into vllm-project:main May 16, 2025
86 of 89 checks passed
@markmc
Copy link
Member

markmc commented May 16, 2025

EAGLE support is gated by this:

        # Eagle is under development, so we don't support it yet.                                                                                                                   
        if is_eagle_enabled and _warn_or_fallback("Eagle"):
            return False

i.e. you have to explicitly set VLLM_USE_V1=1 to use V1 eagle

I'd imagine we're nearly ready to remove this gate for eagle ... but I presume we don't intend medusa to be flagged as "more supported" in V1 than eagle?

@markmc
Copy link
Member

markmc commented May 16, 2025

For reference, here's a target/draft model combo for testing:

$ vllm serve lmsys/vicuna-7b-v1.3 --speculative-config '{"method": "medusa", "model": "abhigoyal/vllm-medusa-vicuna-7b-v1.3", "num_speculative_tokens": 3}'

$ python3 ./benchmarks/benchmark_serving.py --model lmsys/vicuna-7b-v1.3 --tokenizer lmsys/vicuna-7b-v1.3 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 3.0 --num-prompts 200

INFO 05-16 11:51:05 [metrics.py:86] SpecDecoding metrics: Draft acceptance rate: 24.7%, Mean acceptance length: 1.74, Accepted: 569 tokens, Drafted: 2301 tokens, Per-position acceptance rate: 0.510, 0.186, 0.046

@ekagra-ranjan
Copy link
Contributor

@skylee-01 - do we have any analysis on the speedup and the Acceptance length of Medusa implementation on MTBench? We have benchmarks for EAGLE and would be good to compare with it: #17812

@skylee-01
Copy link
Contributor Author

EAGLE support is gated by this:

        # Eagle is under development, so we don't support it yet.                                                                                                                   
        if is_eagle_enabled and _warn_or_fallback("Eagle"):
            return False

i.e. you have to explicitly set VLLM_USE_V1=1 to use V1 eagle

I'd imagine we're nearly ready to remove this gate for eagle ... but I presume we don't intend medusa to be flagged as "more supported" in V1 than eagle?

I fully agree with your point of view. However, during the process of my training of the Eagle model, I encountered rather serious problems. Specifically, Eagle did not open - source Eagle3, and the models supported by its training code have limitations. For example, it does not support the Qwen model. In addition, the code style written by Eagle is extremely casual, which undoubtedly causes great difficulties for commercial implementation.

@skylee-01
Copy link
Contributor Author

For reference, here's a target/draft model combo for testing:

$ vllm serve lmsys/vicuna-7b-v1.3 --speculative-config '{"method": "medusa", "model": "abhigoyal/vllm-medusa-vicuna-7b-v1.3", "num_speculative_tokens": 3}'

$ python3 ./benchmarks/benchmark_serving.py --model lmsys/vicuna-7b-v1.3 --tokenizer lmsys/vicuna-7b-v1.3 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 3.0 --num-prompts 200

INFO 05-16 11:51:05 [metrics.py:86] SpecDecoding metrics: Draft acceptance rate: 24.7%, Mean acceptance length: 1.74, Accepted: 569 tokens, Drafted: 2301 tokens, Per-position acceptance rate: 0.510, 0.186, 0.046

Thank you for your work

huachenheli pushed a commit to huachenheli/vllm that referenced this pull request May 22, 2025
Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com>
Signed-off-by: skylee-01 <497627264@qq.com>
Co-authored-by: lisiqi23 <lisiqi23@xiaomi.com>
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com>
Signed-off-by: skylee-01 <497627264@qq.com>
Co-authored-by: lisiqi23 <lisiqi23@xiaomi.com>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy