[Misc] Add Ray Prometheus logger to V1 #17925

eicherseiji · 2025-05-09T23:31:11Z

Implements the following item from #10582:

[P1] Allow users to define their own prometheus client and other arbitrary loggers.

Tested with V1 as follows (with ray-project/ray#52719):

VLLM_USE_V1=1 python serve.py

#serve.py
import ray
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=1,
        )
    ),
    log_engine_metrics=True
)

ray.init(num_cpus=3, num_gpus=1)
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Ray dashboard output:

github-actions · 2025-05-09T23:31:19Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

comaniac

Overall LGTM. Mainly coding style and structure discussions.

Also cc @markmc

vllm/v1/spec_decode/metrics.py

vllm/v1/metrics/loggers.py

mergify · 2025-05-12T21:12:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @eicherseiji.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc

No fundamental objection to adding a vllm.v1.metrics.ray.RayPrometheusStatLogger custom logger

I'd prefer not to add the extra PrometheusMetrics indirection

And ~90% of the diff is unnecessary reformatting of code which makes it more difficult to review

vllm/v1/metrics/loggers.py

vllm/v1/metrics/wrappers.py

vllm/v1/metrics/loggers.py

eicherseiji · 2025-05-13T17:06:06Z

Thanks for the review @markmc! Apologies for reformatting changes. My pre-commit hook may have been configured incorrectly.

kouroshHakha · 2025-05-13T20:43:20Z

Hey @eicherseiji I made some changes to the interface of PrometheusStatLogger that I think will conflict your stuff, but at the same time it will make this PR easier to write and review if you incorporate those changes. for easier review https://github.com/njhill/vllm/pull/6/files but this one is actually against master (which includes other stuff) #18053

eicherseiji · 2025-05-13T22:22:36Z

Done. Thanks @kouroshHakha

eicherseiji · 2025-05-14T14:18:46Z

Starting with a rebase due to test failure:

[2025-05-14T01:44:26Z] �[36m�[1m=========================== short test summary info ============================�[0m
[2025-05-14T01:44:26Z] �[31mFAILED�[0m entrypoints/openai/test_audio.py::�[1mtest_chat_streaming_audio[https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg-fixie-ai/ultravox-v0_5-llama-3_2-1b]�[0m - AssertionError: assert 'This audio a...t from a poem' == 'This audio a...a traditional'
[2025-05-14T01:44:26Z]   
[2025-05-14T01:44:26Z]   - This audio appears to be a snippet from a traditional
[2025-05-14T01:44:26Z]   ?                                           ^^^^^^^ ^^^
[2025-05-14T01:44:26Z]   + This audio appears to be a snippet from a poem
[2025-05-14T01:44:26Z]   ?                                           ^ ^^
[2025-05-14T01:44:26Z] �[31m====== �[31m�[1m1 failed�[0m, �[32m426 passed�[0m, �[33m2 skipped�[0m, �[33m36 warnings�[0m�[31m in 3825.08s (1:03:45)�[0m�[31m ======�[0m
�[0m^^^ +++
[2025-05-14T01:44:29Z] �[31m🚨 Error: The command exited with status 1�[0m]

markmc

Pretty happy to merge the ray stuff, but there's still a lot of unrelated changes I'd like to see removed

vllm/v1/metrics/loggers.py

eicherseiji

Thanks again @markmc for the review! Going to remove mentions to multiprocess_mode and unnecessary type hint changes for now.

Let me know what the recommendation is for Ray wrapper class injection (should we prefer static class variables for now?)

vllm/v1/metrics/loggers.py

markmc · 2025-05-14T18:48:10Z

Let me know what the recommendation is for Ray wrapper class injection (should we prefer static class variables for now?)

What you had the first time around with the static class variables was fine - I just asked to remove the extra Metrics class

markmc

Looks great, thanks for your patience. Just some questions on the test you added

tests/metrics/test_ray_metrics.py

markmc · 2025-05-15T06:46:23Z

tests/metrics/test_ray_metrics.py

+
+    # Create the actor and call the async method
+    actor = EngineTestActor.remote()  # type: ignore[attr-defined]
+    ray.get(actor.run.remote())


You're not actually verifying metrics behavior?

Just a smoke test for now; added a comment.

In the future it would be helpful to verify output to avoid regressions. But setting up a ray cluster in the test environment could be non-trivial.

So I think that ROI is okay with this to start, given the current design (class injection with a wrapper that's verified in ray-project).

eicherseiji

Responded to comments @markmc! 🙏

tests/metrics/test_ray_metrics.py

eicherseiji · 2025-05-15T14:19:37Z

tests/metrics/test_ray_metrics.py

+
+    # Create the actor and call the async method
+    actor = EngineTestActor.remote()  # type: ignore[attr-defined]
+    ray.get(actor.run.remote())


Just a smoke test for now; added a comment.

eicherseiji · 2025-05-15T14:19:40Z

tests/metrics/test_ray_metrics.py

+
+    # Create the actor and call the async method
+    actor = EngineTestActor.remote()  # type: ignore[attr-defined]
+    ray.get(actor.run.remote())


In the future it would be helpful to verify output to avoid regressions. But setting up a ray cluster in the test environment could be non-trivial.

So I think that ROI is okay with this to start, given the current design (class injection with a wrapper that's verified in ray-project).

tests/metrics/test_ray_metrics.py

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

markmc

Thanks again for your patience!

simon-mo

stamp

Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

mergify bot added the v1 label May 9, 2025

eicherseiji marked this pull request as ready for review May 9, 2025 23:35

eicherseiji requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 9, 2025 23:35

eicherseiji marked this pull request as draft May 9, 2025 23:36

eicherseiji changed the title ~~[Misc] Add Ray Prometheus logger to V1~~ [WIP][Misc] Add Ray Prometheus logger to V1 May 9, 2025

eicherseiji marked this pull request as ready for review May 11, 2025 02:03

eicherseiji changed the title ~~[WIP][Misc] Add Ray Prometheus logger to V1~~ [Misc] Add Ray Prometheus logger to V1 May 12, 2025

comaniac reviewed May 12, 2025

View reviewed changes

vllm/v1/spec_decode/metrics.py Show resolved Hide resolved

vllm/v1/metrics/loggers.py Outdated Show resolved Hide resolved

vllm/v1/metrics/loggers.py Outdated Show resolved Hide resolved

vllm/v1/metrics/loggers.py Outdated Show resolved Hide resolved

eicherseiji mentioned this pull request May 12, 2025

Expose vLLM Metrics to serve.llm API ray-project/ray#52719

Merged

8 tasks

mergify bot added the needs-rebase label May 12, 2025

eicherseiji force-pushed the ray_prometheus_logger branch from 05b63be to 1c773e2 Compare May 12, 2025 21:45

mergify bot removed the needs-rebase label May 12, 2025

eicherseiji force-pushed the ray_prometheus_logger branch 2 times, most recently from fc1838a to b793da4 Compare May 12, 2025 22:13

markmc reviewed May 13, 2025

View reviewed changes

eicherseiji force-pushed the ray_prometheus_logger branch 2 times, most recently from a923f87 to a41f879 Compare May 13, 2025 17:26

eicherseiji marked this pull request as draft May 13, 2025 22:03

eicherseiji marked this pull request as ready for review May 13, 2025 22:22

eicherseiji force-pushed the ray_prometheus_logger branch from ece396d to 3079a87 Compare May 14, 2025 14:15

markmc suggested changes May 14, 2025

View reviewed changes

kouroshHakha mentioned this pull request May 14, 2025

[Misc][DP] Fix AsyncLLM metrics for multi-API server deployments njhill/vllm#6

Merged

eicherseiji commented May 14, 2025

View reviewed changes

vllm/v1/metrics/loggers.py Outdated Show resolved Hide resolved

vllm/v1/metrics/loggers.py Outdated Show resolved Hide resolved

eicherseiji force-pushed the ray_prometheus_logger branch from 858f20b to a0f1b00 Compare May 14, 2025 21:25

markmc reviewed May 15, 2025

View reviewed changes

eicherseiji commented May 15, 2025

View reviewed changes

eicherseiji added 10 commits May 15, 2025 14:32

Add Ray Prometheus logger to V1

aa1398e

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Implement log_engine_initialized

51015bf

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Clean up structure

01b1b54

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Remove PrometheusMetrics indirection

bb291a3

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Build on Kourosh's loggers.py changes from vllm-project#18053

34571fb

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Further revisions to match vllm-project#18053

0b480e7

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Aligned with @markmc to use static class variables for now

0bc384b

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Add test

a35242a

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Suppress type check

6a242c3

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Correct test location

ffc3331

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji force-pushed the ray_prometheus_logger branch from 4af2dd1 to ffc3331 Compare May 15, 2025 14:32

markmc approved these changes May 15, 2025

View reviewed changes

simon-mo approved these changes May 15, 2025

View reviewed changes

simon-mo enabled auto-merge (squash) May 15, 2025 22:46

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 15, 2025

vllm-bot merged commit 5418176 into vllm-project:main May 16, 2025
65 of 67 checks passed

huachenheli pushed a commit to huachenheli/vllm that referenced this pull request May 22, 2025

[Misc] Add Ray Prometheus logger to V1 (vllm-project#17925)

31496c4

Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Misc] Add Ray Prometheus logger to V1 (vllm-project#17925)

6b8c842

Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

eicherseiji mentioned this pull request May 28, 2025

[Serve.llm] Add test to guard against upstream StatLogger changes ray-project/ray#53351

Merged

8 tasks

Uh oh!

[Misc] Add Ray Prometheus logger to V1 #17925

[Misc] Add Ray Prometheus logger to V1 #17925

Uh oh!

Conversation

eicherseiji commented May 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented May 12, 2025

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eicherseiji commented May 13, 2025

Uh oh!

kouroshHakha commented May 13, 2025

Uh oh!

eicherseiji commented May 13, 2025

Uh oh!

eicherseiji commented May 14, 2025

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eicherseiji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

markmc commented May 14, 2025

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

markmc May 15, 2025

Choose a reason for hiding this comment

Uh oh!

eicherseiji May 15, 2025

Choose a reason for hiding this comment

Uh oh!

eicherseiji May 15, 2025

Choose a reason for hiding this comment

Uh oh!

eicherseiji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eicherseiji May 15, 2025

Choose a reason for hiding this comment

Uh oh!

eicherseiji May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

simon-mo left a comment

Choose a reason for hiding this comment

eicherseiji commented May 9, 2025 •

edited by github-actions bot

Loading