[Model] vLLM v1 supports Medusa #17956

skylee-01 · 2025-05-11T06:29:28Z

vLLM v1 supports Medusa. This is a basic version, and I will support cudaGraph in the future

Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com>

github-actions · 2025-05-11T06:29:36Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

skylee-01 · 2025-05-11T06:31:41Z

Teacher @WoosukKwon, please look here.

Signed-off-by: skylee-01 <497627264@qq.com>

vllm/v1/worker/gpu_model_runner.py

WoosukKwon

LGTM. Thanks for the PR!

vllm/v1/worker/gpu_model_runner.py

WoosukKwon

@skylee-01 Can you please add a test?

skylee-01 · 2025-05-14T01:58:51Z

@skylee-01 Can you please add a test?

ok

skylee-01 · 2025-05-14T02:15:07Z

buildkite/ci/pr
encountered an "out of memory" error, do we need to continue troubleshooting the issue?

DarkLight1337 · 2025-05-14T05:26:25Z

Which test is that? I don't see the CI failing because of OOM

skylee-01 · 2025-05-14T06:04:45Z

Which test is that? I don't see the CI failing because of OOM

The two tests showed oom

DarkLight1337 · 2025-05-14T06:06:05Z

The pytest output at the end indicates that the only failing test is audio test, which is also failing on main.

skylee-01 · 2025-05-14T06:06:58Z

The pytest output at the end indicates that the only failing test is audio test, which is also failing on main.

Thank you very much for your reply

DarkLight1337 · 2025-05-15T16:02:35Z

@skylee-01 Can you please add a test?

Are you addressing this?

skylee-01 · 2025-05-16T01:33:15Z

@skylee-01 Can you please add a test?

Are you addressing this?

Yes, I will synchronize a version for testing today.

skylee-01 · 2025-05-16T02:26:51Z

This test uses the base model qwen2-4b, with the input set to 100 tokens and the output set to 10 tokens. The test is carried out through an offline script. Among them, both the base model and medusa are fine-tuned using business data.

During the test, the batch_size is set to 10, and the loop operation is executed 100 times. After testing, the following results are obtained:

The base model takes 19.19 seconds;
The medusa model takes 9.9 seconds.

Attached is the test script:

# -*- coding: utf-8 -*-
import os
import sys
import time


os.environ.update({
    "CUDA_VISIBLE_DEVICES": "0",
    "VLLM_USE_V1": "1"
})

custom_vllm_path = '/path/to/your/vllm'
if custom_vllm_path not in sys.path:
    sys.path.insert(0, custom_vllm_path)

from vllm import LLM, SamplingParams

prompt = "******************************"
prompts = [prompt] * 10

llm = LLM(
    model="/path/to/your/model",
    speculative_config={
        "method": "medusa", 
        "model": "/path/to/your/medusa/model",
        "num_speculative_tokens": 3,
    },
    max_model_len=2000,
)

start_time = time.time()
for _ in range(100):
    llm.generate(prompts, SamplingParams(temperature=0))
print(f"total time: {time.time() - start_time:.2f} s")

skylee-01 · 2025-05-16T02:28:33Z

@skylee-01 Can you please add a test?

Are you addressing this?

test has been updated, what else do I need to do.

DarkLight1337 · 2025-05-16T03:28:19Z

It should be good to go then, I'll try force merging when I am back on my computer

skylee-01 · 2025-05-16T03:29:34Z

It should be good to go then, I'll try force merging when I am back on my computer

Thank you for your help

markmc · 2025-05-16T15:56:51Z

EAGLE support is gated by this:

        # Eagle is under development, so we don't support it yet.                                                                                                                   
        if is_eagle_enabled and _warn_or_fallback("Eagle"):
            return False

i.e. you have to explicitly set VLLM_USE_V1=1 to use V1 eagle

I'd imagine we're nearly ready to remove this gate for eagle ... but I presume we don't intend medusa to be flagged as "more supported" in V1 than eagle?

markmc · 2025-05-16T16:26:30Z

For reference, here's a target/draft model combo for testing:

$ vllm serve lmsys/vicuna-7b-v1.3 --speculative-config '{"method": "medusa", "model": "abhigoyal/vllm-medusa-vicuna-7b-v1.3", "num_speculative_tokens": 3}'

$ python3 ./benchmarks/benchmark_serving.py --model lmsys/vicuna-7b-v1.3 --tokenizer lmsys/vicuna-7b-v1.3 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 3.0 --num-prompts 200

INFO 05-16 11:51:05 [metrics.py:86] SpecDecoding metrics: Draft acceptance rate: 24.7%, Mean acceptance length: 1.74, Accepted: 569 tokens, Drafted: 2301 tokens, Per-position acceptance rate: 0.510, 0.186, 0.046

ekagra-ranjan · 2025-05-16T23:06:26Z

@skylee-01 - do we have any analysis on the speedup and the Acceptance length of Medusa implementation on MTBench? We have benchmarks for EAGLE and would be good to compare with it: #17812

skylee-01 · 2025-05-17T00:37:35Z

EAGLE support is gated by this:
        # Eagle is under development, so we don't support it yet.                                                                                                                   
        if is_eagle_enabled and _warn_or_fallback("Eagle"):
            return False
i.e. you have to explicitly set VLLM_USE_V1=1 to use V1 eagle

I'd imagine we're nearly ready to remove this gate for eagle ... but I presume we don't intend medusa to be flagged as "more supported" in V1 than eagle?

I fully agree with your point of view. However, during the process of my training of the Eagle model, I encountered rather serious problems. Specifically, Eagle did not open - source Eagle3, and the models supported by its training code have limitations. For example, it does not support the Qwen model. In addition, the code style written by Eagle is extremely casual, which undoubtedly causes great difficulties for commercial implementation.

skylee-01 · 2025-05-17T00:38:26Z

For reference, here's a target/draft model combo for testing:

$ vllm serve lmsys/vicuna-7b-v1.3 --speculative-config '{"method": "medusa", "model": "abhigoyal/vllm-medusa-vicuna-7b-v1.3", "num_speculative_tokens": 3}'

$ python3 ./benchmarks/benchmark_serving.py --model lmsys/vicuna-7b-v1.3 --tokenizer lmsys/vicuna-7b-v1.3 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 3.0 --num-prompts 200

INFO 05-16 11:51:05 [metrics.py:86] SpecDecoding metrics: Draft acceptance rate: 24.7%, Mean acceptance length: 1.74, Accepted: 569 tokens, Drafted: 2301 tokens, Per-position acceptance rate: 0.510, 0.186, 0.046

Thank you for your work

Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com> Signed-off-by: skylee-01 <497627264@qq.com> Co-authored-by: lisiqi23 <lisiqi23@xiaomi.com> Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com> Signed-off-by: skylee-01 <497627264@qq.com> Co-authored-by: lisiqi23 <lisiqi23@xiaomi.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

lisiqi23 added 2 commits May 11, 2025 14:04

medusa v1

13b1238

Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com>

vllm v1

54dacd9

Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com>

skylee-01 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 11, 2025 06:29

mergify bot added the v1 label May 11, 2025

medusa v1

d96e590

Signed-off-by: skylee-01 <497627264@qq.com>

simon-mo requested a review from LiuXiaoxuanPKU May 12, 2025 02:54

WoosukKwon reviewed May 12, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

WoosukKwon approved these changes May 13, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label May 13, 2025

WoosukKwon reviewed May 13, 2025

View reviewed changes

DarkLight1337 added this to the v0.9.0 milestone May 16, 2025

vllm-bot merged commit f4937a5 into vllm-project:main May 16, 2025
86 of 89 checks passed

markmc added the speculative-decoding label May 16, 2025

Uh oh!

[Model] vLLM v1 supports Medusa #17956

[Model] vLLM v1 supports Medusa #17956

Uh oh!

Conversation

skylee-01 commented May 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 11, 2025

Uh oh!

skylee-01 commented May 11, 2025

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

skylee-01 commented May 14, 2025

Uh oh!

skylee-01 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented May 14, 2025

Uh oh!

skylee-01 commented May 14, 2025

Uh oh!

DarkLight1337 commented May 14, 2025

Uh oh!

skylee-01 commented May 14, 2025

Uh oh!

DarkLight1337 commented May 15, 2025

Uh oh!

skylee-01 commented May 16, 2025

Uh oh!

skylee-01 commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skylee-01 commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented May 16, 2025

Uh oh!

skylee-01 commented May 16, 2025

Uh oh!

Uh oh!

markmc commented May 16, 2025

Uh oh!

markmc commented May 16, 2025

Uh oh!

ekagra-ranjan commented May 16, 2025

Uh oh!

skylee-01 commented May 17, 2025

Uh oh!

skylee-01 commented May 17, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

skylee-01 commented May 11, 2025 •

edited by github-actions bot

Loading

skylee-01 commented May 14, 2025 •

edited

Loading

skylee-01 commented May 16, 2025 •

edited

Loading

skylee-01 commented May 16, 2025 •

edited

Loading