May 24, 2025 – May 31, 2025

Overview

230 Active pull requests

173 Active issues

1 Release published by 1 person

v0.9.0.1
published May 30, 2025

160 Pull requests merged by 71 people

[BugFix] Fix incorrect metrics shutdown error log message
#18992 merged Jun 1, 2025
[BugFix] fix data parallel construct ipv6 url addres
#18991 merged Jun 1, 2025
Let max_num_batched_tokens use human_readable_int for large numbers
#18968 merged Jun 1, 2025
[doc] small fix - mkdocs
#18996 merged Jun 1, 2025
[LoRA] Support dynamically initialize packed_modules_mapping for VLM with arbitrary components
#18987 merged Jun 1, 2025
[Core] Rework dtype resolution
#18751 merged Jun 1, 2025
[Bugfix] Fix EAGLE3 broken logits
#18909 merged Jun 1, 2025
[Misc][Benchmark] Add support for CustomDataset
#18511 merged May 31, 2025
[Misc] add return token strs for tokenize
#18941 merged May 31, 2025
[BugFix] Fix multi-node offline data-parallel
#18981 merged May 31, 2025
[P/D] NixlConnector use cache device index for memory registration
#18969 merged May 31, 2025
[ROCm][Kernel] Add gfx950 support for skinny gemms
#18010 merged May 31, 2025
[Bugfix] Fix for issue 17396
#18773 merged May 31, 2025
[FEAT][ROCm] Add AITER grouped topk for DeepSeekV2
#18825 merged May 31, 2025
[BugFix] Pydantic part 2
#18911 merged May 31, 2025
[doc] fix the list rendering issue - security.md
#18982 merged May 31, 2025
[Neuron] Add Multi-Modal model support for Neuron
#18921 merged May 31, 2025
fix security issue of logging llm output
#18980 merged May 31, 2025
[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled
#18879 merged May 31, 2025
[Misc] Fix estimated max model len msg
#18966 merged May 31, 2025
[Frontend] Add rerank support to run_batch endpoint
#16278 merged May 31, 2025
create util function for batched arange
#18937 merged May 31, 2025
[Docs] Correct multiprocessing design doc
#18964 merged May 31, 2025
Tool parser regex timeout handling
#18960 merged May 30, 2025
[Misc] add group_size is -1 in awq quantization
#18910 merged May 30, 2025
[VLM] Add PP support and fix GPTQ inference for Ovis models
#18958 merged May 30, 2025
Benchmark script for fp8 vs bf16 gemm
#17126 merged May 30, 2025
[Perf] API-server scaleout with many-to-many server-engine comms
#17546 merged May 30, 2025
Improve "failed to get the hash of the compiled graph" error
#18956 merged May 30, 2025
[Docs] Update SECURITY.md with link to our security guide
#18961 merged May 30, 2025
[doc] show the count for fork and watch
#18950 merged May 30, 2025
[Feature] minicpm eagle support
#18943 merged May 30, 2025
[CI/Build] remove regex from build dependencies
#18945 merged May 30, 2025
[Bugfix][TPU] Fix tpu model runner testcase failure
#18810 merged May 30, 2025
[Misc]Fix typo
#18947 merged May 30, 2025
[Bugfix][Failing Test] Fix test_vllm_port.py
#18618 merged May 30, 2025
[Model] Use in-place adds in SigLIP
#18922 merged May 30, 2025
[doc] add mkdocs doc
#18930 merged May 30, 2025
[Misc]Fix benchmarks/README.md for speculative decoding
#18897 merged May 30, 2025
[Deprecation] Remove mean pooling default for Qwen2EmbeddingModel
#18913 merged May 30, 2025
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy
#18861 merged May 30, 2025
[ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend.
#18938 merged May 30, 2025
[docs] fix: fix markdown syntax
#18927 merged May 30, 2025
[Model] Use AutoWeightsLoader for mamba2
#18918 merged May 30, 2025
[Bugfix] Consistent ascii handling in tool parsers
#18883 merged May 30, 2025
improve the robustness of parsing vlms config in AutoRound
#18894 merged May 30, 2025
[TPU][CI/CD] Clean up docker for TPU tests.
#18926 merged May 30, 2025
[Misc] Update type annotation for rotary embedding base
#18914 merged May 30, 2025
[Bugfix] Fix PP default fallback behavior for V1
#18915 merged May 30, 2025
[TPU] remove transpose ops in moe kernel
#18923 merged May 29, 2025
Use standalone_compile by default in torch >= 2.8.0
#18846 merged May 29, 2025
[P/D] NixlConnector DP fixes
#18903 merged May 29, 2025
[BugFix] Make DP work with connector-delayed new requests
#18559 merged May 29, 2025
[V1] Allocate kv_cache with stride order for V1
#18775 merged May 29, 2025
[Misc] Remove duplicate init for self.vllm_config
#18896 merged May 29, 2025
[Deprecation] Disallow pos-args other than model when initializing LLM
#18802 merged May 29, 2025
[ROCm][V0][Attention] Revert to the previous FA triton kernel
#18226 merged May 29, 2025
[Attention][V1] Toggle for v1 attention backend
#18275 merged May 29, 2025
[Bugfix] Fix the failing gte embedding test
#18720 merged May 29, 2025
[Doc] Fix codeblocks formatting in LoRA adapters documentation
#18907 merged May 29, 2025
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp.
#18692 merged May 29, 2025
Fix an error in dummy weight loading for quantization models
#18855 merged May 29, 2025
[BugFix] Update pydantic to fix error on python 3.10
#18852 merged May 29, 2025
[Bugfix] Ensure tensors are contiguous during serialisation
#18860 merged May 29, 2025
[Misc] Replace TODO in serving transcription
#18895 merged May 29, 2025
[Bugfix] Fix misleading information in the documentation
#18845 merged May 29, 2025
[doc] add CLI doc
#18871 merged May 29, 2025
[Doc] Remove redundant spaces from compatibility_matrix.md
#18891 merged May 29, 2025
[LoRA] Add LoRA support for InternVL
#18842 merged May 29, 2025
[Neuron] Add multi-LoRA support for Neuron.
#18284 merged May 29, 2025
Fixes a dead link in nightly benchmark readme
#18856 merged May 29, 2025
Skip device and quant Pydantic validation to make plugin device work
#18843 merged May 29, 2025
[Doc][Neuron] Update documentation for Neuron
#18868 merged May 29, 2025
[Bugfix][TPU] fix moe custom kernel import
#18853 merged May 29, 2025
Add ability to use CUDAGraphs with use_inductor=False
#17345 merged May 29, 2025
Prevent the cross-encoder logic from being applied to classification tasks
#18838 merged May 29, 2025
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix
#18100 merged May 28, 2025
[Core] Enable CUDA graphs for DP + All2All kernels
#18724 merged May 28, 2025
Remove checks for None for fields which should never be None
#17985 merged May 28, 2025
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend
#15655 merged May 28, 2025
[Chore][Spec Decode] Update check NoneType instead of assigning variables
#18836 merged May 28, 2025
[V1][Metrics] Remove metrics that were deprecated in 0.8
#18837 merged May 28, 2025
[Misc] fix olmoe model layer for TP > 1
#18828 merged May 28, 2025
[Chore] update ty configuration
#18839 merged May 28, 2025
[Core] Add Lora Support to Beam Search
#18346 merged May 28, 2025
decrement server_load on listen for disconnect
#18784 merged May 28, 2025
[Frontend] add run batch to CLI
#18804 merged May 28, 2025
Enable Pydantic mypy checks and convert configs to Pydantic dataclasses
#17599 merged May 28, 2025
[Platform][Dist] Make torch distributed process group extendable
#18763 merged May 28, 2025
[BugFix] FA2 MLA Accuracy Issue
#18807 merged May 28, 2025
Fix PiecewiseCompileInterpreter
#17338 merged May 28, 2025
[CI] improve embed testing
#18747 merged May 28, 2025
[Deprecation] Remove fallbacks for Embeddings API
#18795 merged May 28, 2025
[Deprecation] Remove unused sync methods in async_timeout
#18792 merged May 28, 2025
[Deprecation] Require overriding get_dummy_text and get_dummy_mm_data
#18796 merged May 28, 2025
[Bugfix][FailingTest]Fix test_model_load_with_params.py
#18758 merged May 28, 2025
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2)
#18781 merged May 28, 2025
[V1] fix torch profiling for V1 offline scenarios
#18445 merged May 28, 2025
[Bugfix]: correctly propagate errors message caught at the chat_templating step to the client
#18769 merged May 28, 2025
[Bugfix] Fix nomic max_model_len
#18755 merged May 28, 2025
[rocm] Fix wrong attention log
#18764 merged May 28, 2025
[Core] Improve Tensor serialisation
#18774 merged May 28, 2025
[Build] Fixes for CMake install
#18570 merged May 28, 2025
[Bugfix] Disable prefix caching by default for benchmark
#18771 merged May 28, 2025
Support datasets in vllm bench serve and sync with benchmark_[serving,datasets].py
#18566 merged May 27, 2025
[Neuron] Support quantization on neuron
#18283 merged May 27, 2025
[CI/Build] [TPU] Fix TPU CI exit code
#18282 merged May 27, 2025
[Bugfix] Mistral tool calling when content is list
#18729 merged May 27, 2025
[Core] Automatically cast multi-modal input dtype
#18756 merged May 27, 2025
optimize get_kv_cache_torch_dtype
#18531 merged May 27, 2025
Disable prefix cache by default for benchmark
#18639 merged May 27, 2025
[V1][Metrics] Add API for accessing in-memory Prometheus metrics
#17010 merged May 27, 2025
[CI/Build] Remove imports of built-in re
#18750 merged May 27, 2025
[Doc] Convert Sphinx directives ( {class}, {meth}, {attr}, ...) to MkDocs format for better documentation linking
#18663 merged May 27, 2025
[BUG FIX] minicpm
#18739 merged May 27, 2025
[Build] fix cpu build missing libtbbmalloc.so
#18744 merged May 27, 2025
Minor fix about MooncakeStoreConnector
#18721 merged May 27, 2025
[Doc] cleanup deprecated flag for doc
#18715 merged May 27, 2025
[Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh
#18752 merged May 27, 2025
feat(rocm-support): support mamba2 on rocm
#18565 merged May 27, 2025
[Misc] improve docs
#18734 merged May 27, 2025
[Doc] Update reproducibility doc and example
#18741 merged May 27, 2025
[Doc] Update OOT model docs
#18742 merged May 27, 2025
[FEAT] [ROCm] Upgrade AITER Fused MoE kernels.
#18271 merged May 27, 2025
[Model][Gemma3] Cast image pixel values already on CPU
#18732 merged May 27, 2025
[V1][Quantization] Add CUDA graph compatible v1 GGUF support
#18646 merged May 27, 2025
[Misc] improve web section group title display
#18684 merged May 27, 2025
[Model][Gemma3] Simplify image input validation
#18710 merged May 27, 2025
Convert examples to ruff-format
#18400 merged May 26, 2025
[V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs
#18608 merged May 26, 2025
[Bugfix] Fix Llama GGUF initialization
#18717 merged May 26, 2025
[Doc] Move examples and further reorganize user guide
#18666 merged May 26, 2025
[Doc] Improve API docs
#18713 merged May 26, 2025
[Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM
#18701 merged May 26, 2025
[Misc] add AutoGen integration
#18712 merged May 26, 2025
[Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI
#18709 merged May 26, 2025
[CI/Build] Split pooling and generation extended language models tests in CI
#18705 merged May 26, 2025
[Model] Add support for YARN in NemotronNAS models
#18427 merged May 26, 2025
[CI] fix dump_input for str type
#18697 merged May 26, 2025
[CI/Build] Replace math.isclose with pytest.approx
#18703 merged May 26, 2025
[Bugfix] Fix Mistral-format models with sliding window
#18693 merged May 26, 2025
[Doc] Fix issue template format
#18699 merged May 26, 2025
[GH] Add issue template for reporting CI failures
#18696 merged May 26, 2025
[CI] add missing argument
#18694 merged May 26, 2025
[Bugfix] Fix the lm_head in gpt_bigcode in lora mode
#6357 merged May 26, 2025
refactor: simplify request handler, use positive condition check for handler assignment
#18690 merged May 26, 2025
[Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example
#18644 merged May 26, 2025
[CI/Build][Doc] Update gte-Qwen2-1.5B-instruct usage
#18683 merged May 26, 2025
[Core][Multimodal] Convert PIL Image to array without data copy when hashing
#18682 merged May 25, 2025
[Bugfix] Fix profiling dummy data for Pixtral
#18677 merged May 25, 2025
[Misc] small improve
#18680 merged May 25, 2025
[CI/build] fix no regex
#18676 merged May 25, 2025
[Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment
#18674 merged May 25, 2025
[doc] improve readability
#18675 merged May 25, 2025
[doc] fix broken links
#18671 merged May 25, 2025
[Misc] Reduce logs on startup
#18649 merged May 25, 2025
[BUGFIX] catch subclass first for try...except
#18672 merged May 25, 2025
Speed up the kernels/quantization/ tests
#18669 merged May 25, 2025
[VLM] Initialize video input support for InternVL models
#18499 merged May 25, 2025
[Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE
#18655 merged May 25, 2025

70 Pull requests opened by 57 people

[CI/Build][Bugfix] Ensure compatibility with transformers 4.52
#18678 opened May 25, 2025
[Bugfix][Benchmarks]Fixed async_request_deepspeed_mii() to get ttft
#18689 opened May 26, 2025
[Doc] Readme standardization
#18695 opened May 26, 2025
[Doc] Clarify cudagraph capture size logic and default behavior in scheduler
#18698 opened May 26, 2025
[Core] feat: Implement Priority Scheduling in V1 Engine
#18700 opened May 26, 2025
[CI] change spell checker from codespell to typos
#18711 opened May 26, 2025
[V1][Spec Decode] MLP speculator support
#18719 opened May 26, 2025
[CI]: Fix test_kv_cache_events
#18722 opened May 26, 2025
[Misc][Benchmark] Fix error on benchmark_moe.py
#18723 opened May 26, 2025
Add cuda 12.8 wheel nightly build
#18726 opened May 26, 2025
[Core] Support inplace model weights loading
#18745 opened May 27, 2025
[Bugfix]: Fix moe_unpermute compatibility by aligning function signatures under CUDA < 12.0
#18749 opened May 27, 2025
[Kernel] GGUF MMVQ kernel for multiple input vectors
#18754 opened May 27, 2025
[Kernel] Integrate CUTLASS MoE kernel with PPLX
#18762 opened May 27, 2025
[WIP] Add a metric to track request failures
#18765 opened May 27, 2025
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference
#18768 opened May 27, 2025
[Torch Nightly]add missing dependency
#18770 opened May 27, 2025
Export NaNs in logits to scheduler_stats if output is corrupted
#18777 opened May 27, 2025
[Perf] Tunings for SM100 FP8 CUTLASS kernel
#18778 opened May 27, 2025
[V1] Support DP with Ray
#18779 opened May 27, 2025
Fail request if FSM fails to advance
#18780 opened May 27, 2025
[Docs] Add developer doc about CI failures
#18782 opened May 27, 2025
[Bugfix] handle `attn_metadata=None` in `calculate_kv_scales` branch of attn forward
#18788 opened May 28, 2025
[Model] Add support for normalized Transformer (nGPT) from NVIDIA
#18798 opened May 28, 2025
[Deprecation] Remove `inputs` arg fallback in Engine classes
#18799 opened May 28, 2025
[Deprecation] Remove `prompt_token_ids` arg fallback in `LLM.generate` and `LLM.embed`
#18800 opened May 28, 2025
[Misc] add more info make_client() func
#18803 opened May 28, 2025
Respect passed in device overrides in engine args
#18808 opened May 28, 2025
[Bug] fix the structure of decoder_prompt
#18809 opened May 28, 2025
[Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM
#18817 opened May 28, 2025
[P/D][Misc] Enable profiling in disagg setup
#18827 opened May 28, 2025
[Misc] Split monolithic config.py into domain-specific modules
#18830 opened May 28, 2025
[P/D] Heterogeneous TP
#18833 opened May 28, 2025
Updating the incremental de-tokenizer
#18840 opened May 28, 2025
[Perf] Tune `scaled_fp8_quant` by increasing vectorization
#18844 opened May 28, 2025
[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets
#18847 opened May 28, 2025
Account for memory usage of other processes
#18858 opened May 28, 2025
[Core] Cast multimodal input in hf processor
#18862 opened May 28, 2025
[Model] NemotronH support
#18863 opened May 28, 2025
[Kernel] Fix fp8 support for pplx and BatchedTritonExperts.
#18864 opened May 28, 2025
Fix AOPerModuleConfig name changes
#18869 opened May 29, 2025
Add DeepSeek-R1-0528 function call chat template
#18874 opened May 29, 2025
[Chore] remove unused jinaai_serving_reranking
#18878 opened May 29, 2025
[bugfix][v1]fixed the missing prompt value in RequestOutputs
#18880 opened May 29, 2025
[BugFix] v0 cache evictor：priority_queue and free_table desynchronization
#18882 opened May 29, 2025
[V1][Metrics] Add max_token_capacity_per_batch
#18900 opened May 29, 2025
[V1][Metrics] Add time_per_prefill_token
#18901 opened May 29, 2025
[V1][Metrics] Add total_tokens_in_queue (prefill + decode)
#18904 opened May 29, 2025
[V1][Metrics] Add num_tokens_preempted
#18905 opened May 29, 2025
update the arch list for Blackwell support on nightly dockerfile
#18912 opened May 29, 2025
[Misc] Fix path and python alias errors in disagg_prefill exmaples
#18919 opened May 29, 2025
[Core] Remove int32->int64->int32 overhead in FlashInfer sampling
#18920 opened May 29, 2025
feat: add data parallel rank to KVEventBatch
#18925 opened May 29, 2025
Adding "LoRA Test %N" to AMD production tests
#18929 opened May 29, 2025
[Bugfix][Config] Fix config dtype get error
#18934 opened May 30, 2025
[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context
#18935 opened May 30, 2025
Abstract mooncake store connector to kv store connector
#18936 opened May 30, 2025
[Core] hybrid dtype for Pooling Models: float32 for weights and activation, float16 or bfloat16 for attention.
#18940 opened May 30, 2025
[Bugfix][core] Prefix caching enabled causes incorrect outputs
#18957 opened May 30, 2025
[Performance] Replace per-tensor/token FP8 quant CUDA kernels with torch.compile
#18965 opened May 30, 2025
Reduce logs in CLI scripts and plugin loader
#18970 opened May 30, 2025
[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix
#18971 opened May 30, 2025
[Core] Remove unnecessary copy of multi modal input embeddings
#18973 opened May 30, 2025
[BugFix][V1] Fix memory profiling bug
#18974 opened May 30, 2025
[Bugfix][Model] Attempt to fix eagle in V0.
#18978 opened May 30, 2025
Fix FlashMLA detection in ray environment
#18979 opened May 30, 2025
Add tarsier model support
#18985 opened May 31, 2025
[Benchmark] Add hf_stream arg to enable or disable datasets streaming loading
#18989 opened May 31, 2025
[ROCm] [AITER] [Bugfix] Patch for AITER commit `648764942e552a8bb5fe16026703716a81f05374`
#18990 opened May 31, 2025
[DRAFT] Self-Speculative Decoding using LayerSkip
#18994 opened May 31, 2025

87 Issues closed by 39 people

[Bug]: Multi-node data parallel cannot recognize IPv6 host address
#18986 closed Jun 1, 2025
[Bug]: Eagle3 in vLLM v0.9.0 has no acceleration effect.
#18946 closed Jun 1, 2025
[Bug]: Unable to use --enable-lora on latest vllm docker container (v0.6.2)
#9133 closed Jun 1, 2025
[Bug]: AttributeError: 'Int8Params' object has no attribute 'bnb_shard_offsets', It seems that vllm's bnb prequantification support for cls models is not yet complete.
#11807 closed Jun 1, 2025
[Feature]: Janus-Series: Unified Multimodal Understanding and Generation Models
#12479 closed Jun 1, 2025
[Bug]: Asyncengine is dead after sending request!
#12510 closed Jun 1, 2025
[Bug]: vllm container does not set LD_LIBRARY_PATH correctly
#12559 closed Jun 1, 2025
[Performance]: Weird Sliding Window Attention Profiling Results
#12616 closed Jun 1, 2025
[Bug]: shape is invalid for input of size
#12633 closed Jun 1, 2025
[Feature]: Return token strings in addition to token ids for /tokenize
#18928 closed May 31, 2025
[Bug]: data_parallel.py not working in multi-node case
#18553 closed May 31, 2025
[New Model]: deepseek-ai/DeepSeek-R1-0528
#18849 closed May 31, 2025
[Bug]: The size of tensor a (49472) must match the size of tensor b (49664) at non-singleton dimension 1
#17396 closed May 31, 2025
[Bug]: In Version V0.9.0, Qwen3-32B-AWQ Error when turn off thinking and use guided_json simultaneously.
#18821 closed May 31, 2025
[Bug]: Failed to load the fine-tuned Qwen2.5-VL-7B-Instruct model.
#18983 closed May 31, 2025
[Bug]: AWQ INT4 Model with group_size=-1 throws exception while gptq format is fine
#18885 closed May 30, 2025
[Bug]: Non-coherent output from DeepSeek-R1 671B on H200 SXM
#12892 closed May 30, 2025
[Performance]: Qwen2.5VL preprocessing extremely slow with large image, leading low gpu usage
#15869 closed May 30, 2025
[Usage]: V1+1P1D+Lmcahe prefill running queue only processes one request at a time, others stuck in waiting queue
#18951 closed May 30, 2025
[Bug]: test_vllm_port.py::test_get_vllm_port_uri fails with AssertionError: Regex pattern did not match
#18617 closed May 30, 2025
[Bug]: vllm0.9.0 2 A40 GPUs Error running Qwen3-32B
#18870 closed May 30, 2025
[Bug]: Tools results encoding in ascii when using "required" tool call with server.
#18881 closed May 30, 2025
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
#18873 closed May 30, 2025
[Bug]: Index out of range error related to speculative decoding and `-O3`
#12507 closed May 30, 2025
[Usage]: Guidance on Building a v0.9.0 Docker Image with Volta GPU Support
#18818 closed May 30, 2025
[Bug]: Critical distributed executor bug
#7791 closed May 29, 2025
[Usage]: Running vLLM with B200 Blackwell
#17901 closed May 29, 2025
[Bug]: v0.8.5 causes gemma-3 models to output whitespace or incoherent output
#17390 closed May 29, 2025
[Bug]:some vllm routes can be reached without authorization
#18893 closed May 29, 2025
[Doc]: The description about InternVL's support for LoRA in the document does not conform to the reality
#18820 closed May 29, 2025
[Bug]:
#18889 closed May 29, 2025
[Bug]:
#18888 closed May 29, 2025
[Usage]: Run multi images, videos inference with MiniCPM-o 2.6
#18685 closed May 29, 2025
[Bug]: VLLM crashes when prefix caching is enabled
#7003 closed May 29, 2025
[Bug]: "gettid" was not declared error when build from source for cpu with version after v0.6.1
#9683 closed May 29, 2025
[Feature]: Beam search: top_p, min_p and logit processors
#10754 closed May 29, 2025
[Usage]: Why does it consume so much memory?
#12346 closed May 29, 2025
[Bug]: Slower inference time on less input tokens
#12406 closed May 29, 2025
[Bug]: Computation cache has already been initialized error on TPUs
#12476 closed May 29, 2025
[Doc]: Example launch command for deepseek v3/R1 for 8-way H100/H200 and MI300X?
#12493 closed May 29, 2025
[Bug]: triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 66560, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.
#12498 closed May 29, 2025
[Bug]: AttributeError: 'TokenizeChatRequest' object has no attribute 'mm_processor_kwargs'
#13951 closed May 29, 2025
[Feature]: Enable CUDA Graph without turn on torch.compile / Inductor for V1
#15896 closed May 29, 2025
[Bug]: OpenAI Classification Client returning logits instead of softmax values
#18727 closed May 29, 2025
[Bug]: Cannot use OLMoE with tensor parallel higher than 1
#18706 closed May 28, 2025
[Bug]:ModuleNotFoundError: No module named 'vllm._C'
#15592 closed May 28, 2025
[Feature]: Support Lora for Beam Search
#17205 closed May 28, 2025
[CI Failure]: LM Eval Large Models - test_lm_eval_correctness.py
#18766 closed May 28, 2025
[Bug]: MLA correctness issues when using FA2
#18561 closed May 28, 2025
[Bug]: vllm-0.8.5的镜像在CUDA12.8上出错
#18790 closed May 28, 2025
[Bug]:openai接口tools使用enum多标签问题
#18585 closed May 28, 2025
[Bug]: model_executor/test_model_load_with_params.py fails with AttributeError
#18757 closed May 28, 2025
[Usage]: sequence parallelism or async tp integration seems takes no effect on Qwen3-MoE
#18753 closed May 28, 2025
[Bug]: "The model only supports at most 0 video items, but you passed 1 video items in the same prompt." in vision_language.py using InternVL3
#18787 closed May 28, 2025
[Bug]: WARNING 05-09 10:18:05 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
#17898 closed May 28, 2025
[Feature]: APC introspection interface
#8523 closed May 28, 2025
[Bug]: Function calling with Qwen & Streaming ('NoneType' object has no attribute 'get')
#9874 closed May 28, 2025
[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB
#10062 closed May 28, 2025
[Feature]: Manually inject Prefix KV Cache
#10515 closed May 28, 2025
[Bug]: Loading model from S3 using RunAI Model Streamer excludes too many files
#11929 closed May 28, 2025
[Performance]: Unexpected performance of vLLM Cascade Attention
#12395 closed May 28, 2025
Does vLLM support Sliding Window for chat type use case?
#12488 closed May 28, 2025
[Usage]: When will pip support version 0.9.0?
#18539 closed May 28, 2025
[Bug]: Could not import module 'ProcessorMixin
#18776 closed May 27, 2025
[Bug]: Tools usage with `mistralai/Devstral-Small-2505` fails
#18628 closed May 27, 2025
[Usage]: can data_parallel_size or --data-parallel-size be used with v0 engine
#18702 closed May 27, 2025
[Bug]: Dockerfile.cpu missing libtbb-dev
#18743 closed May 27, 2025
[Bug]: register model can't be found in v1 mode
#18740 closed May 27, 2025
[Bug]: Deploying Gemma3 model through VLLM, there are a large number of special characters <pad> filled in the reply content
#18733 closed May 27, 2025
[New Model]: IDEA-Research/ChatRex-7B
#12444 closed May 27, 2025
[Usage]:RuntimeError: Triton Error [CUDA]: device kernel image is invalid
#18580 closed May 27, 2025
[Usage]: about flash attention
#18658 closed May 27, 2025
[Bug]: lora_filesystem_resolver wont work
#18630 closed May 26, 2025
[Bug]: Llama4 MoE weight_loaders Removed from Parameters After Initial Load, Causing Errors During Refitting
#17915 closed May 26, 2025
[Usage]: why max-num-batched-tokens can smaller than max-model-len
#18681 closed May 26, 2025
[Installation]: When to expect the release of v0.9.0
#18704 closed May 26, 2025
[Bug]: --enable-prompt-tokens-details not working in V1
#16162 closed May 26, 2025
[Performance]: Running llama-70b on 4 A100 40Gb
#13234 closed May 26, 2025
[Bug]: GPU and CPU KV cach usage reported opposite by benchmark_throughput
#15201 closed May 26, 2025
[Feature]: Is there any plans for multi loras with Qwen2.5vl ?
#18688 closed May 26, 2025
[Bug]: Errors Encountered While Running Qwen2.5-VL-72B-AWQ Inference on 2xA800 GPUs (Works Fine with Qwen2-VL-72B-AWQ)
#13640 closed May 26, 2025
[Usage]: unrecognized arguments: --enable-reasoning --reasoning-parser deepseek_r1
#18538 closed May 26, 2025
[Usage]: enable_sequence_parallelism=True
#18648 closed May 26, 2025
[Bug]: chunked_prefill cannot be turn off in V1 engine
#18547 closed May 26, 2025
[Bug]: PixtralForConditionalGeneration is broken due to bad placeholder tokens
#18556 closed May 25, 2025
[Bug]: prefix caching doesn't work on CPU vLLM
#17954 closed May 25, 2025
[Bug]: Does not support video input for InternVL series models
#18381 closed May 25, 2025

86 Issues opened by 83 people

[Usage]: meets gpu p2p check err when use tp=4 to launch vllm by torhcrun
#18998 opened Jun 1, 2025
[Bug]: ncclCommInitRank failed with error: NCCL error: internal error (H100, KubeRay, DeepSeek, TP=8, PP=2)
#18997 opened Jun 1, 2025
[Bug]: Unable to get vLLM working with RTX 5090
#18995 opened May 31, 2025
[Bug]: V100能启动Qwen3-4B，但无法对其进行调用，一旦调用，vllm服务会直接停掉
#18993 opened May 31, 2025
[Usage]: Controll Deepseek R1 think or not
#18988 opened May 31, 2025
[Usage]: How to Retrieve Model Parameters (e.g., Supported Embedding Dimensions) for an Embedding Model (Online Service)
#18984 opened May 31, 2025
[Bug]:GPTQ-quantized Qwen2-VL-2B-Instruct produces poor output in vLLM but works correctly in HuggingFace transformers
#18976 opened May 30, 2025
[Feature]: Colocating multiple LLM engines in the same process with sleep mode.
#18975 opened May 30, 2025
[Bug]: vLLM 0.84 (others as well) TypeError: unsupported operand type(s) for *: 'int' and 'NoneType' Mistral 7b
#18972 opened May 30, 2025
[Bug]: MoE models fail at startup: AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'
#18967 opened May 30, 2025
[Feature]: Native packages
#18963 opened May 30, 2025
[Bug]: Deserialisation of the model is taking several minutes.
#18962 opened May 30, 2025
[Feature]: Add Lora for ModernBERT models
#18959 opened May 30, 2025
[Bug]: Under high concurrency, kvcache will be tampered with, causing duplicate characters or gibberish in subsequent request results
#18955 opened May 30, 2025
[CI Failure]: Spec Decoding - spec_decode/e2e/test_eagle_correctness.py
#18954 opened May 30, 2025
[RFC]: vLLM configuration refactoring and modularization
#18953 opened May 30, 2025
[Usage]: [0.8.5v1+1P1D+LMCACHE] Is the Prefill instance running queue limited to processing only one request?
#18952 opened May 30, 2025
[Usage]: How to log stat when using AsyncLLM locally (do not based on openAI api)
#18948 opened May 30, 2025
[Bug]: AsyncLLM when DP > 1, device allocation bug
#18942 opened May 30, 2025
[Feature]: Can vllm support Megakernel?
#18939 opened May 30, 2025
[Bug]:
#18933 opened May 30, 2025
[Bug]: offline dp will stack when one dp group finish work and exit
#18932 opened May 30, 2025
[Usage]: How to use Deepseek-R1-0528 with function call
#18931 opened May 30, 2025
[Feature]: Include data parallel rank in Kv Events
#18924 opened May 29, 2025
[Bug]: video input to vllm server encounter hanging out
#18917 opened May 29, 2025
[Bug]: VLLM Docker v0.9.0 produces Runtime Error: Cuda Error on Blackwell using Qwen0.6B
#18916 opened May 29, 2025
[Usage]: prompt logprobs + APC in speculative decoding
#18908 opened May 29, 2025
[Bug]: vllm0.9.0 cannot load eagle30-llama3.3-70b-inst model
#18906 opened May 29, 2025
[Bug]: The frequency penalty does not work when spec decoding is enabled in V1, with no warning or error
#18902 opened May 29, 2025
[Bug]: Do we really need to implement additional functions for custom_allreduce to serve graph capture?
#18899 opened May 29, 2025
[Feature]: Enable setting `leave=False` in `tqdm` progress bars
#18898 opened May 29, 2025
[Bug]: some vllm routes can be reached without authorization
#18892 opened May 29, 2025
[Bug]: vllm启动模型后使用openai格式请求传base64值有问题
#18890 opened May 29, 2025
[Bug]: FlashMLA V1 with FP8 KV cache not yet supported!
#18887 opened May 29, 2025
[Performance]: The Unstable Performance Difference between CUDA and PyTorch
#18884 opened May 29, 2025
[Questions]: The problem of repeated capture of cudagraph during weight update phase?
#18877 opened May 29, 2025
[Usage][V1]: How to determine the V1 engine is busy and has pending requests?
#18876 opened May 29, 2025
[Usage]: Does vllm support inference or service startup of CPU small model?
#18875 opened May 29, 2025
[Performance]: why the batch-embeddings inputs are separated to small single one?
#18867 opened May 29, 2025
[Feature]: Vectorize `scaled_int8_quant`
#18866 opened May 28, 2025
[Bug]: Image v0.9.0 Fails to Initialize on GCP instance Due to Undetected Platform
#18859 opened May 28, 2025
[Bug]: Non-torch memory tracking fails to account for gpu usage of other processes
#18854 opened May 28, 2025
[Bug]: Strange error `AssertionError: failed to get the hash of the compiled graph` when running `Qwen/Qwen3-8B` via `LLM` class
#18851 opened May 28, 2025
[New Model]: ByteDance/Dolphin
#18850 opened May 28, 2025
[Bug]: vllm-openai:0.9.0 docker image raise 'CUDA error: no kernel image is available for execution on the device' for Llama4 Maverick FP8
#18841 opened May 28, 2025
[Bug]: Help, RuntimeError: CUDA error: no kernel image is available for execution on the device
#18835 opened May 28, 2025
[Bug][Regression]: Dimension out of range when using MooncakeStoreConnector
#18834 opened May 28, 2025
[Bug]: Error during serialization of the model.
#18832 opened May 28, 2025
[Bug]: "NCCL WARN Error while attaching to shared memory segment /dev/shm/nccl- (size 192814336), error: No such file or directory (2)"
#18831 opened May 28, 2025
[RFC]: Controlling the maximum length of the waiting queue
#18826 opened May 28, 2025
[Bug]: 0.8.x with vllm V1 fails on loading Qwen-vl-2.5 with UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 2218: ordinal not in range(128)
#18823 opened May 28, 2025
[Bug]: Broken Structured Output (Guided Decoding) with Qwen3 models when `enable_thinking=False`
#18819 opened May 28, 2025
[Bug]: Model fails to load in background thread in versions >0.8.5
#18816 opened May 28, 2025
[Feature]: vllm/v1/attention/backends/blocksparse_attn.py
#18815 opened May 28, 2025
[Bug] TP=2 fails on dual RTX 5090: TorchInductor compile error or CUDA illegal memory access (TP=1 works)
#18814 opened May 28, 2025
[Usage]: 请问0.9.0版容器是限制只能在CUDA12.8以上版本运行了吗？
#18813 opened May 28, 2025
[Bug]: python sampler is faster than flashinfer sampler
#18811 opened May 28, 2025
[Usage]: How to release GPU resource of a reproducible LLM instance
#18806 opened May 28, 2025
[Usage]: NCCL error when using tow AMD GPUs ( gfx1100 )
#18805 opened May 28, 2025
[Performance]: How can i improve performance further in vllm lmcache PD Disaggregate？Plz Help Me
#18801 opened May 28, 2025
[New Model]: NVIDIA-Normalized-GPT (nGPT)
#18797 opened May 28, 2025
[New Model]: ByteDance-Seed/BAGEL-7B-MoT
#18793 opened May 28, 2025
[Bug]: something wrong with hermes tool parser
#18791 opened May 28, 2025
pip install -e . failed
#18789 opened May 28, 2025
[Bug]: Not allowed: Wheel dist/vllm-0.9.1.dev2+ge0cbad4e3-cp38-abi3-linux_x86_64.whl is larger (824.73 MB) than the limit (400 MB)
#18786 opened May 28, 2025
[Performance]: Falcon H1 7B seems to be significantly slower than Qwen 7B
#18785 opened May 28, 2025
[Bug][V1] Structured output FSM failures should be handled gracefully without aborting requests
#18783 opened May 27, 2025
[Feature]: vllm torch nightly package not in sync issues
#18772 opened May 27, 2025
[Bug]: Low GPU Underutilization and Badwords Failure When Rollout n > 1
#18767 opened May 27, 2025
[Bug]: [0.9.0] llama-3-8b-instruct-awq returns `name 'FusedMoEPermuteExpertsUnpermute' is not defined` error
#18761 opened May 27, 2025
[Performance]: The CPU overhead gradually increases with multiple batches.
#18760 opened May 27, 2025
[Usage]: How can I use spec-decoding features with multimodal model like qwen2.5vl
#18759 opened May 27, 2025
[Installation]: error: can't copy 'build/lib.linux-x86_64-cpython-310/vllm/_C.abi3.so': doesn't exist or not a regular file
#18748 opened May 27, 2025
[Bug]: Schema inconsistency in moe_unpermute causes runtime crash under CUDA 11.8
#18746 opened May 27, 2025
[Usage]: GPU/CPU communication sanity check failed on K8S env
#18731 opened May 27, 2025
[Bug]:RuntimeError: Engine core initialization failed.
#18730 opened May 27, 2025
[Performance]: yarn degrades the performance of qwen3
#18728 opened May 26, 2025
[Performance]: Unexpected: B200 GPU Performance Similar to H200 for Qwen/QwQ-32B, Expected B200 to be Significantly Faster
#18725 opened May 26, 2025
[Bug]: LLaMA 4 model loading fails with KeyError on missing parameter 'layers.10.feed_forward.experts.0.activation_fn.scales'
#18718 opened May 26, 2025
[Usage]: Can the continuous batching function be disabled in vllm now?
#18716 opened May 26, 2025
[Feature]: Support Prometheus Metrics with P/D disagg on multi-machines
#18714 opened May 26, 2025
[Bug][CI Failure] - VI Test - test_engine_core_client.py::test_kv_cache_events[True-tcp]
#18708 opened May 26, 2025
[Doc]: Newest documentation for engine arguments is significantly worse than v0.8.5 and prior
#18707 opened May 26, 2025
[Bug]: build source errors
#18691 opened May 26, 2025
[Bug]: benchmark_serving.py cannot reach the specified generated tokens even with the flag --ignore-eos
#18687 opened May 26, 2025
[Usage]:
#18679 opened May 25, 2025

321 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Kernel] DeepEP dispatch-combine kernel integration
#18434 commented on May 31, 2025 • 19 new comments
feat: engine v1 post process sampled logprobs
#17724 commented on May 28, 2025 • 18 new comments
Support embedding models in V1 with a dedicated model_runner
#18015 commented on May 30, 2025 • 15 new comments
[v1] Hybrid Memory Allocator
#17996 commented on Jun 1, 2025 • 14 new comments
[Hardware][AMD] integrate aiter chunked prefill into vllm
#18596 commented on May 31, 2025 • 13 new comments
[V1] Support cross-layer KV sharing
#18212 commented on May 30, 2025 • 12 new comments
add causal-conv1d in Triton and integrate into vLLM with test code
#18218 commented on May 30, 2025 • 12 new comments
Add FlexAttention to V1
#16078 commented on May 31, 2025 • 11 new comments
[V1][P/D] XpYd based on p2p communication without cache store
#18242 commented on May 31, 2025 • 11 new comments
[Frontend] speed up import time of vllm.config
#18036 commented on Jun 1, 2025 • 10 new comments
[V1] LogitsProcessor programming model
#16728 commented on May 27, 2025 • 9 new comments
[KERNEL] Sampler. CUDA kernel for applying repetition penalty
#18437 commented on May 30, 2025 • 9 new comments
[Misc] Add fully interleaved support for multimodal 'string' content format
#14047 commented on May 29, 2025 • 8 new comments
[Model] enable data parallel for Llama4 vision encoder
#18368 commented on May 31, 2025 • 8 new comments
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on May 29, 2025 • 6 new comments
[Doc] update Contributing page's testing section
#18272 commented on May 29, 2025 • 6 new comments
Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node.
#17930 commented on May 30, 2025 • 6 new comments
[Bugfix] Fix spec decode on non-cuda platforms
#18501 commented on May 28, 2025 • 6 new comments
[kernel] integrate permute/unpermute kernel into deepgemm moe
#17934 commented on May 29, 2025 • 6 new comments
[Frontend] speed up import time of vllm.reasoning
#18236 commented on May 28, 2025 • 5 new comments
[Security] Prevent new imports of (cloud)pickle
#18018 commented on May 28, 2025 • 5 new comments
[Hardware][TPU] Initial support of model parallelism with single worker using SPMD
#18011 commented on Jun 1, 2025 • 4 new comments
[P/D][Core] Fix abrupt request abort
#18485 commented on May 30, 2025 • 4 new comments
[Misc] fix: add miss best_of param validation
#18555 commented on May 31, 2025 • 4 new comments
[MISC][Bugfix] Use less CPU when message queue has been empty for some time
#16226 commented on May 29, 2025 • 4 new comments
[Feature] support torchrun PP with concurrent requests
#18191 commented on Jun 1, 2025 • 4 new comments
[ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES
#15246 commented on Jun 1, 2025 • 3 new comments
[Doc] Unify structured outputs examples
#18196 commented on May 30, 2025 • 3 new comments
[VLM] Support HF format Phi-4-MM model
#17121 commented on May 30, 2025 • 3 new comments
Update common.txt
#18442 commented on May 28, 2025 • 3 new comments
[V1][Metrics] Add model_load_time as a log for CUDA devices
#14148 commented on May 30, 2025 • 2 new comments
[Feature] Expert Parallelism Load Balancer (EPLB)
#18343 commented on May 30, 2025 • 2 new comments
[V0][V1][Core] Add outlines integration for V1, and update V0 integration.
#15975 commented on Jun 1, 2025 • 2 new comments
[Misc][benchmark] add warmup; add e2el_per_concurrency and throughput; add random_output_ratio
#18475 commented on May 29, 2025 • 2 new comments
[CUDA] Enable full cudagraph for FlashMLA
#18581 commented on May 30, 2025 • 2 new comments
Add custom default max tokens for different plataforms
#18557 commented on May 30, 2025 • 2 new comments
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger
#17331 commented on May 30, 2025 • 1 new comment
[Bugfix][Nixl] Fix full prefix cache hit bug
#18632 commented on May 27, 2025 • 1 new comment
Fix links in multi-modal model contributing page
#18615 commented on May 28, 2025 • 1 new comment
[V1][Spec Decode][Perf] Add fused Triton kernel to reduce overhead in EAGLE spec decoding
#18221 commented on May 30, 2025 • 1 new comment
[WIP] [Core][P/D] CPU connector for PD disagg
#18332 commented on May 31, 2025 • 1 new comment
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics.
#18354 commented on May 30, 2025 • 1 new comment
[Hardware][Intel GPU] Add V1 engine support and `chunked_prefill` kernel
#14612 commented on May 31, 2025 • 1 new comment
[Core] Support all head sizes up to 256 with FlashAttention backend
#8910 commented on May 28, 2025 • 0 new comments
[Doc] update the debugging document to add more explanation on `gpu_memory_utilization` and CUDA OOM issues
#8541 commented on May 27, 2025 • 0 new comments
[Model] MLPSpeculator quantization support
#8476 commented on May 27, 2025 • 0 new comments
[Bugfix] Update grafana dashboard
#9311 commented on May 28, 2025 • 0 new comments
[Misc] add non cuda hf benchmark_througput
#8653 commented on May 27, 2025 • 0 new comments
[Misc] Add conftest plugin for applying forking decorator
#8727 commented on May 27, 2025 • 0 new comments
[CI/Build] Add audio+video docker tag to Dockerfile
#17974 commented on May 31, 2025 • 0 new comments
[Misc] Optimizing mrope calc
#13793 commented on May 27, 2025 • 0 new comments
[WIP][Whisper] beam search for whisper
#13758 commented on May 28, 2025 • 0 new comments
[Model] Support VLMs with transformers backend
#13754 commented on May 30, 2025 • 0 new comments
[UT][spec_decode]remove hard-dependencies of spec decode to CUDA
#13746 commented on May 26, 2025 • 0 new comments
Truncation support for recent Mistrals to prevent AsyncEngineDeadError on input exceeding max_model_len w/ chunked prefill
#13741 commented on May 27, 2025 • 0 new comments
[V1][BugFix] Raise error when selected attn backend is not supported
#13730 commented on May 31, 2025 • 0 new comments
[Model] GPTBigCodeForEmbedding supporting token span classification
#13684 commented on May 28, 2025 • 0 new comments
[Core] Add Additional Metrics to vLLM Server
#12726 commented on May 29, 2025 • 0 new comments
[Bugfix] Update Prometheus datasource configuration to use variable UID
#12659 commented on May 30, 2025 • 0 new comments
[CI/Build] Better default num jobs heuristic
#12477 commented on May 30, 2025 • 0 new comments
Update run_cluster.sh
#11796 commented on May 30, 2025 • 0 new comments
Add TTFT to offline_inference_with_prefix.py
#11428 commented on May 30, 2025 • 0 new comments
[Core] Support global prefix caching
#11385 commented on May 30, 2025 • 0 new comments
[Frontend] Add Command-R and Llama-3 chat template
#10496 commented on May 30, 2025 • 0 new comments
[Core/Bugfix] Per FlashInfer API changing data_type to kv_data_type for kv_cache
#10103 commented on May 27, 2025 • 0 new comments
[Bugfix] Generate multiple different prompts in benchmark_prefix_caching.py based on --num-prompts
#9687 commented on May 27, 2025 • 0 new comments
【Frontend】Add sampler_priority and repetition_penalty_range
#9485 commented on May 27, 2025 • 0 new comments
[Kernel] Factor registrations
#8424 commented on May 27, 2025 • 0 new comments
[New Model]: support Qwen3-235B-A22B-GPTQ-Int4
#18041 commented on Jun 1, 2025 • 0 new comments
[Feature]: Support Gemma 3 QAT series
#16856 commented on May 31, 2025 • 0 new comments
[Feature]: Support openai responses API interface
#14721 commented on May 31, 2025 • 0 new comments
[Bug]: Can't serve Qwen3-AWQ
#18156 commented on May 31, 2025 • 0 new comments
[Bug]: vLLM v0.8.5.post1 hanging with Llama 3.3 70b
#18260 commented on May 31, 2025 • 0 new comments
[Feature]: Ensure benchmark serving do not import vLLM
#14923 commented on May 31, 2025 • 0 new comments
[Bug]: ValueError when using Multi-Instance GPU
#17047 commented on May 31, 2025 • 0 new comments
[V1] Feedback Thread
#12568 commented on May 31, 2025 • 0 new comments
[Performance]: Performance comparison for v1 engine and v0 engine
#17540 commented on May 31, 2025 • 0 new comments
[Bug]: decoding output parsing error
#18376 commented on May 31, 2025 • 0 new comments
[Bug]: LLVM ERROR: Failed to compute parent layout for slice layout. when using fp16
#17152 commented on May 31, 2025 • 0 new comments
[Bug][PERF]: Qwen2.5 performance degradation 0.8.4 -> 0.8.5
#18619 commented on May 31, 2025 • 0 new comments
[Performance]: Stability Concerns with LLaMA-4 Models After Extended Uptime (llama-4 models stability on h100 gpus)
#16473 commented on May 31, 2025 • 0 new comments
[Bug]: regression from vllm==0.8.4 - Llama 4 Maverick FP8 + xgrammar crash server
#18085 commented on May 31, 2025 • 0 new comments
[Bug]: CPU Memory oom on 8*L40s when deploy meta-llama/Llama-4-Scout-17B-16E-Instruct
#16916 commented on May 31, 2025 • 0 new comments
[Bug]: vLLM does not serve text-only version of Llama4
#18022 commented on May 31, 2025 • 0 new comments
[Usage]: Llama4 tool parser
#16214 commented on May 31, 2025 • 0 new comments
[torch.compile] A simple solution to recursively compile loaded model: using phi3-small as an example
#8398 commented on May 27, 2025 • 0 new comments
[Benchmark] Add block_size option to benchmark_throughput.py
#8175 commented on May 27, 2025 • 0 new comments
[Core][Kernel][Misc] Support external swapper for vllm
#8018 commented on May 28, 2025 • 0 new comments
Print request metrics to stdout
#8014 commented on May 27, 2025 • 0 new comments
[misc] Optimize speculative decoding
#7875 commented on May 27, 2025 • 0 new comments
[Misc] Allow for unsigned zero NAN representation in ScalarType
#7661 commented on May 27, 2025 • 0 new comments
`[Core]` Added streaming support to `LLM` Class
#7648 commented on May 27, 2025 • 0 new comments
[WIP][SPMD] Support spec decoding
#7643 commented on May 27, 2025 • 0 new comments
[Model] Teleflm Support
#6822 commented on May 28, 2025 • 0 new comments
Prefetch all
#6817 commented on May 27, 2025 • 0 new comments
[New Model]: moonshotai/Kimi-Audio-7B-Instruct
#17234 commented on Jun 1, 2025 • 0 new comments
[Bug]:ValueError: vllm serve Qwen/Qwen2.5-VL-72B-Instruct-AWQ ERROR:The input size is not aligned with the quantized weight shape.
#13980 commented on Jun 1, 2025 • 0 new comments
[Feature]: Speculative decoding and Pipeline Paralelism
#14044 commented on Jun 1, 2025 • 0 new comments
[Bug]: [Bug]: Run vllm serve raise Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks
#14095 commented on Jun 1, 2025 • 0 new comments
[Usage]: How to load DeepSeek-R1-Distill-Qwen-32B model which runs as offline batch inference
#14087 commented on Jun 1, 2025 • 0 new comments
[Bug]: max_model_len setting fail
#14102 commented on Jun 1, 2025 • 0 new comments
[Bug]: cannot launch deepseek-vl2 on A100
#14103 commented on Jun 1, 2025 • 0 new comments
[Benchmark] fixing profling for benchmark latency
#18035 commented on Jun 1, 2025 • 0 new comments
[P/D] Heterogeneous TP
#18079 commented on May 25, 2025 • 0 new comments
[Hardware][Intel-Gaudi] t.compile optimizations
#18137 commented on May 29, 2025 • 0 new comments
[Bugfix] Fix Hermes tool call parser with streaming
#18220 commented on May 25, 2025 • 0 new comments
(deprecated) [V1] Support DP with Ray
#18233 commented on May 30, 2025 • 0 new comments
[Model] support dots1
#18254 commented on May 30, 2025 • 0 new comments
add an absolute path for run.sh
#18258 commented on May 26, 2025 • 0 new comments
[Kernel] Add EP support for cutlass_moe_fp4
#18281 commented on May 31, 2025 • 0 new comments
[P/D] Support CPU Transfer in NixlConnector
#18293 commented on May 31, 2025 • 0 new comments
[V1] Optimized the `determine_available_memory` method for v1
#18296 commented on May 31, 2025 • 0 new comments
[Frontend] Speedup frontend test
#18310 commented on May 30, 2025 • 0 new comments
[Model]: Fused MoE for nomic-embed-text-v2-moe
#18321 commented on May 29, 2025 • 0 new comments
[V1] Support MultiNode DP with Ray
#18366 commented on May 28, 2025 • 0 new comments
[V0] Support multiple kv connectors
#18395 commented on May 30, 2025 • 0 new comments
[Bugfix][P/D] Fix Preemption + Prefix Cache Bug (#92)
#18411 commented on May 28, 2025 • 0 new comments
[WIP] Two batch overlap
#18415 commented on May 31, 2025 • 0 new comments
Fixed ppc build when it runs on non-RHEL based linux distros
#18422 commented on May 30, 2025 • 0 new comments
[Core] Add support for sampling penalties to v1 ngram speculative decoding
#18441 commented on May 30, 2025 • 0 new comments
Enable CPU nightly performance benchmark and its Markdown report
#18444 commented on May 28, 2025 • 0 new comments
[V1] Support `LLM.apply_model`
#18465 commented on May 27, 2025 • 0 new comments
make TIMEOUT_KEEP_ALIVE configurable through env var
#18472 commented on May 30, 2025 • 0 new comments
Integrate quick allreduce and select the best allreduce implementation
#18473 commented on May 30, 2025 • 0 new comments
Remove Vision FA warning
#18522 commented on May 27, 2025 • 0 new comments
[BUGFIX] fix layout shape of moe 2stage
#18523 commented on May 27, 2025 • 0 new comments
[Model][Speculative Decoding] Integrate PARD into vLLM
#18541 commented on May 28, 2025 • 0 new comments
Sm100 blockwise fp8 swap ab
#18564 commented on May 30, 2025 • 0 new comments
Porting triton_kernels for FusedMoE
#18595 commented on May 30, 2025 • 0 new comments
[v1][KVCacheManager] Add a special KVCacheNullBlock class
#18652 commented on May 29, 2025 • 0 new comments
[v1] Re-init input batch for multiple kv cache groups
#18654 commented on May 30, 2025 • 0 new comments
[Model] Google SigLip 2
#13808 commented on May 29, 2025 • 0 new comments
Support w8a8 block_fp8_matmul from generated kernel
#13835 commented on May 29, 2025 • 0 new comments
[WIP] Fix weight loading tests
#13842 commented on May 27, 2025 • 0 new comments
[NVIDIA] Unify CUTLASS version in CMakelist.txt
#13846 commented on May 28, 2025 • 0 new comments
[Misc][V1] Enhance performance of KVCacheManager._get_cached_block
#13878 commented on May 29, 2025 • 0 new comments
[BugFix] Fix an Overflow Problem for Some Triton Fused MoE Configurations with large BLOCK_SIZE
#13901 commented on May 31, 2025 • 0 new comments
[Misc] Add JSON format logging support with `loguru`
#13920 commented on May 30, 2025 • 0 new comments
Support non-attention path operators in Triton
#13963 commented on May 29, 2025 • 0 new comments
[Distributed] Add reduce_scatter to DeviceCommunicatorBase
#14057 commented on May 30, 2025 • 0 new comments
Tune release tag to support release candidates
#14064 commented on May 31, 2025 • 0 new comments
Add CUDA kernel for per_token_group_quant_fp8
#14175 commented on May 29, 2025 • 0 new comments
[Model] add colqwen2_vl code & inference
#14291 commented on May 25, 2025 • 0 new comments
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on May 29, 2025 • 0 new comments
[V0][Bugfix] Fix Mamba cache crashing
#15296 commented on May 29, 2025 • 0 new comments
fix: can not install torch+cpu for no index url
#15822 commented on May 31, 2025 • 0 new comments
[ROCm][V1] Changes needed for making vllm run on Fedora 41 with gtx1100
#16062 commented on May 30, 2025 • 0 new comments
[Draft] SnapKV
#16160 commented on May 30, 2025 • 0 new comments
[Bugfix][Frontend] Add missing "type":"function" in tool call streaming responses
#16346 commented on May 28, 2025 • 0 new comments
[Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support)
#16347 commented on Jun 1, 2025 • 0 new comments
[CPU] V1 support for the CPU backend
#16441 commented on May 29, 2025 • 0 new comments
[Bugfix][Model] fix Phi3Small model only support v0
#16493 commented on May 28, 2025 • 0 new comments
[Bugfix] fix: close issue #16554 to make it real async
#16557 commented on May 31, 2025 • 0 new comments
[Kernel] Add Split-KV Attention Kernel to the triton_attn Backend
#16794 commented on May 30, 2025 • 0 new comments
[Model][Frontend] Adding timeseries modality support and Qwen2.5-ChatTS model support
#16852 commented on May 29, 2025 • 0 new comments
Enable FlashInfer V1 FP8 kv cache
#17005 commented on May 28, 2025 • 0 new comments
[RFC][core][V1] generalize structured output manager and backends
#17503 commented on May 26, 2025 • 0 new comments
[WIP][V1][Spec Decode] EAGLE tree-attention
#17560 commented on May 27, 2025 • 0 new comments
[NVIDIA] Add Cutlass MLA backend
#17625 commented on May 31, 2025 • 0 new comments
[V1] Fast decode prepare path for prepare_inputs logic
#17866 commented on May 28, 2025 • 0 new comments
[Feature][Quantization] MXFP4 support for MOE models
#17888 commented on May 29, 2025 • 0 new comments
[Bug]: always finish_reason='length' using google/gemma-2-27b-it
#13924 commented on May 28, 2025 • 0 new comments
[Bug]: Speculative Model from Hugging Face Repository Fails to Load (is not a file error)
#13937 commented on May 28, 2025 • 0 new comments
[Bug]: NixlConnector should not skip short do_remote_prefill requests in connector metadata
#18591 commented on May 27, 2025 • 0 new comments
[Usage]: Deciding max-num-seqs and max-num-batched-tokens for desired throughput
#16886 commented on May 27, 2025 • 0 new comments
[RFC]: vLLM x torch.compile caching should be opt-out by default
#16501 commented on May 27, 2025 • 0 new comments
[Bug]:Question about logprobs output being 0.0 when using `vllm` sampling params
#17286 commented on May 27, 2025 • 0 new comments
[Feature]: Data parallel inference in offline mode(based on Ray)
#14683 commented on May 27, 2025 • 0 new comments
[Bug]: Cannot obtain logits
#16619 commented on May 27, 2025 • 0 new comments
[Performance]: lmcache cannot work！
#18135 commented on May 27, 2025 • 0 new comments
[Feature]: Simple Data Parallelism in vLLM
#9206 commented on May 27, 2025 • 0 new comments
[Bug]: Quantized models - NotImplementedError: Could not run '_C::machete_prepack_B'
#16131 commented on May 27, 2025 • 0 new comments
[Usage] Qwen3 Usage Guide
#17327 commented on May 27, 2025 • 0 new comments
Migrating from `yapf` to `ruff format`
#17657 commented on May 27, 2025 • 0 new comments
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on May 27, 2025 • 0 new comments
[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models
#17604 commented on May 27, 2025 • 0 new comments
[Bug]: Engine Core initialization failed. See root cause above
#17618 commented on May 27, 2025 • 0 new comments
[Bug]: Engine stuck with requests are blocked, running/waiting request count and KV cache usage remain constant.
#18431 commented on May 27, 2025 • 0 new comments
[Usage]: Is there an option to obtain attention matrices during inference, similar to the output_attentions=True parameter in the transformers package?
#7736 commented on May 27, 2025 • 0 new comments
[Bug]: meta-llama/Llama-3.2-90B-Vision-Instruct and Qwen/Qwen2-VL-72B-Instruct models fails with asyncio.exceptions.CancelledError when using wiki image URLs
#10904 commented on May 27, 2025 • 0 new comments
[Bug]: ValueError: Ray does not allocate any GPUs on the driver node. Consider adjusting the Ray placement group or running the driver on a GPU node.
#12983 commented on May 27, 2025 • 0 new comments
[Installation]:Error while deploying Deepseek-R1 671B with AMD 8xMi300x
#13659 commented on May 28, 2025 • 0 new comments
[Bug]: how to use cuda graph on vllm
#13661 commented on May 28, 2025 • 0 new comments
[Bug]: [Qwen2.5-VL-72B-Instruct-4bit](VllmWorkerProcess pid=191) WARNING 02-22 22:09:01 custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs.
#13719 commented on May 28, 2025 • 0 new comments
[Performance]: Regarding PD separation performance
#13816 commented on May 28, 2025 • 0 new comments
[Feature]: proxy load balance function use like: vllm proxy --server-address ... --server-port ...
#13861 commented on May 28, 2025 • 0 new comments
[New Model]: add GOT-OCR2
#13862 commented on May 28, 2025 • 0 new comments
[Bug]: Failed to infer device type how to solve this problem
#13865 commented on May 28, 2025 • 0 new comments
[New Model]: silero-vad
#13866 commented on May 28, 2025 • 0 new comments
[New Model]: RapidOCR
#13868 commented on May 28, 2025 • 0 new comments
[Bug]: V1 does not support torch compile
#13872 commented on May 28, 2025 • 0 new comments
[Usage]: when i deploy a model ,how to set the max input str length and the number of max input token. and the max output length??
#13874 commented on May 28, 2025 • 0 new comments
[Usage]: `max_num_batched_tokens` and `max_model_len`
#13875 commented on May 28, 2025 • 0 new comments
[New Model]: dunsloth/DeepSeek-R1-GGUF
#13877 commented on May 28, 2025 • 0 new comments
[Feature]: Support Deepseek's DeepGemm MoE
#13879 commented on May 28, 2025 • 0 new comments
[Bug]: Incorrect first_token_time and first_scheduled_time metrics results
#13883 commented on May 28, 2025 • 0 new comments
[Bug]: vllm 0.7.3, system gets stuck during the reasoning process
#13884 commented on May 28, 2025 • 0 new comments
[Feature]: Upstream flash attention to support cutlass 3.8
#13893 commented on May 28, 2025 • 0 new comments
[Bug]: 【Qwen2.5-VL-72B-Instruct-AWQ】ERROR 02-26 05:28:06 engine.py:400] Error while deserializing header: InvalidHeaderDeserialization
#13899 commented on May 28, 2025 • 0 new comments
[Feature]: T5Model has no vLLM implementation
#13903 commented on May 28, 2025 • 0 new comments
[Bug]: Speculative Decoding Tokens not being included in Prometheus metrics
#13916 commented on May 28, 2025 • 0 new comments
[Bug]: The output size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
#11671 commented on May 26, 2025 • 0 new comments
[Bug]: v0.8.5.post1 Eagle3 broken with llama3-70b
#18452 commented on May 26, 2025 • 0 new comments
[Bug]: Pipeline parallelism is only supported through AsyncLLMEngine as performance will be severely degraded otherwise.
#7413 commented on May 26, 2025 • 0 new comments
[Bug]: Logprob values are affected by sampling parameters and are incompatible with OpenAI API
#9453 commented on May 26, 2025 • 0 new comments
[Usage]: deepseek v3 can not set tensor_parallel_size=32
#12256 commented on May 26, 2025 • 0 new comments
[Usage]: tensor-parallel-size=2，The program just kept hanging
#13273 commented on May 26, 2025 • 0 new comments
[V1]: Unable to serve Qwen model on V1 alpha.
#13284 commented on May 26, 2025 • 0 new comments
[Bug]: Chunk Prefill feature fails for ppc64le (IBM POWER)
#13387 commented on May 26, 2025 • 0 new comments
[Bug]: when nsight cature nvtx with PP>1, vllmWorkerProcess will unexpectedly terminate
#13482 commented on May 26, 2025 • 0 new comments
[Usage]: Does vllm support mix deploy on GPU+CPU?
#13517 commented on May 26, 2025 • 0 new comments
[Bug]: When deploying the Qwen2.5-VL-3B service, some image requests return errors.
#13657 commented on May 26, 2025 • 0 new comments
[Bug]: Error while running Deepseek-R1: vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly.
#13676 commented on May 26, 2025 • 0 new comments
[Bug]: Errors Encountered While Running Qwen/Qwen2.5-VL-72B-AWQ Inference on 8x24G的4090 GPUs
#13677 commented on May 26, 2025 • 0 new comments
[Bug]: vLLM Local path model loading error
#13707 commented on May 26, 2025 • 0 new comments
[Bug]: V100 may can not support enable-prefix-caching
#13738 commented on May 26, 2025 • 0 new comments
[Bug]: Speculative Decoding Model Load Error (Qwen 14b + 0.5b)
#13759 commented on May 26, 2025 • 0 new comments
[Feature]: Support Python 3.13
#12083 commented on May 25, 2025 • 0 new comments
[Bug]: GGUF model with architecture qwen3moe is not supported yet.
#18382 commented on May 25, 2025 • 0 new comments
[Installation]: Hard to find right wheel files to build the release version
#18673 commented on May 25, 2025 • 0 new comments
[Bug]: `vllm serve` doesn't display detailed error logs when async_llm.generate raises an exception
#18393 commented on May 25, 2025 • 0 new comments
[Feature]: S1-32B Reasoning Parser support
#13342 commented on May 27, 2025 • 0 new comments
[Bug]: Mamba2 models (Bamba and Codestral Mamba) fail on RoCM
#13678 commented on May 27, 2025 • 0 new comments
[Bug]: v0.7.3 upgrade issue,
#13712 commented on May 27, 2025 • 0 new comments
[Usage]: Does vllm support to deploy one model on multiple type GPUs(e.g. one is A100, the other is H20)?
#13760 commented on May 27, 2025 • 0 new comments
[Bug]: Structured generation with JSON schema does not produce empty array
#13821 commented on May 27, 2025 • 0 new comments
[Bug]: database disk image is malformed
#13838 commented on May 27, 2025 • 0 new comments
[Usage]: Speculative Decoding KV Cache Generate
#13845 commented on May 27, 2025 • 0 new comments
[Doc]: guided grammar example lack parameter guided_decoding_backend
#13847 commented on May 27, 2025 • 0 new comments
[Doc]: vLLM TPU missing git clone instruction
#13854 commented on May 27, 2025 • 0 new comments
[Feature]: Support LoRA adapters to vision/merge modules
#17660 commented on May 26, 2025 • 0 new comments
[Bug]: v0.8.2, enable calculate_kv_scales, caught exception
#15973 commented on May 26, 2025 • 0 new comments
[RFC]: [Spec Decode] Combine Ngram and EAGLE
#18633 commented on May 26, 2025 • 0 new comments
[Bug]: Single-Node data parallel (--data-parallel-size=4) leads to vLLM crash
#18567 commented on May 26, 2025 • 0 new comments
[Bug]: crash during debug, works ok running cli
#16006 commented on May 26, 2025 • 0 new comments
[Bug]: Failed to run model Qwen3-30B-A3B on DGX V100x4
#17392 commented on May 26, 2025 • 0 new comments
[Installation]: VLLM on ARM machine with GH200
#10459 commented on May 26, 2025 • 0 new comments
[Bug]: load_adapter crashes server if called when generations are in progress
#13698 commented on May 26, 2025 • 0 new comments
[Feature]: Implement Priority Scheduling In V1 Engine
#14002 commented on May 26, 2025 • 0 new comments
[Feature]:Slim Attention (lossless 2x reduction in KV cache size)
#14937 commented on May 26, 2025 • 0 new comments
[Bug]: Reward model usage
#12791 commented on May 26, 2025 • 0 new comments
[Usage]: RTX 5090 with vllm/vllm-openai docker image
#16652 commented on May 30, 2025 • 0 new comments
[Bug]: wake up OOM (72B model in 8*A800(40G))
#13941 commented on May 30, 2025 • 0 new comments
[Installation]: build on arm64 meet a error
#9964 commented on May 30, 2025 • 0 new comments
[Feature]: Composite model loading using `AutoWeightsLoader` for all models
#15697 commented on May 30, 2025 • 0 new comments
[Bug]: error while attempting to bind on address ('0.0.0.0', 8000): address already in use
#7514 commented on May 30, 2025 • 0 new comments
[Doc]: guided decoding is not compatible with speculative decoding, but "Compatibility Matrix" shows compatible
#12148 commented on May 30, 2025 • 0 new comments
[Bug]: Deepseek R1 MI300A Memory access fault
#12773 commented on May 30, 2025 • 0 new comments
[Bug]: vllm version 0.7.2 downloading and loading issues.
#12889 commented on May 30, 2025 • 0 new comments
[RFC]: Introduce a Triton-only Transformer Execution Path in vLLM
#13319 commented on May 30, 2025 • 0 new comments
[Bug]: ValueError: Unsupported FA version: None on V100 and V1 engine
#13788 commented on May 30, 2025 • 0 new comments
[Feature]: Pin vLLM process to the right NUMA Region
#13855 commented on May 30, 2025 • 0 new comments
[Bug]: Architectures DeepseekV3ForCausalLM can't deploy on 2080ti
#14016 commented on May 30, 2025 • 0 new comments
[Bug]: Does the current version 0.7.3 support the installation of vllm - flash - attn or flash - attn? After installation, an error occurred when starting the container.
#14018 commented on May 30, 2025 • 0 new comments
4090 machine vllm inference performance is weaker than sglang performance！
#14021 commented on May 30, 2025 • 0 new comments
[Bug]: EP enablement and chunked-prefill-enablement
#14032 commented on May 30, 2025 • 0 new comments
[Feature]: Consolidate AITER env flags
#18367 commented on May 30, 2025 • 0 new comments
[Bug]: cpu core 100%
#16968 commented on May 29, 2025 • 0 new comments
[Bug]: 100% CPU usage when idle
#16660 commented on May 29, 2025 • 0 new comments
[Bug]: vLLM hangs forever on waiting engine process to start
#17676 commented on May 29, 2025 • 0 new comments
[Bug]: vLLM server hangs and timeouts after initial requests
#17972 commented on May 29, 2025 • 0 new comments
[Usage]: How to maintain per-sequence state in custom LogitsProcessor?
#14078 commented on May 31, 2025 • 0 new comments
[Performance]: Low GPU Utilization (70%) for ViT+Qwen2 VLM Model.
#18392 commented on May 31, 2025 • 0 new comments
[Bug]: AttributeError: 'MultiprocExecutor' object has no attribute 'workers' when VLLM_USE_V1=1 on rocm platform serve deepseek-r1 671B
#17533 commented on May 30, 2025 • 0 new comments
[Bug]: SamplingParams() use_beam_search error
#18231 commented on May 30, 2025 • 0 new comments
[Misc]: CMake Clean-up / Refactor Tasks
#9129 commented on May 30, 2025 • 0 new comments
[Feature]: Inflight BNB quantization for Mixtral models
#17199 commented on May 30, 2025 • 0 new comments
[Bug]: Mistral streaming tool parser fails to parse integer tool argument
#13622 commented on May 30, 2025 • 0 new comments
[Roadmap] vLLM Roadmap Q2 2025
#15735 commented on May 30, 2025 • 0 new comments
[Bug]: RuntimeError: ('Quantization scheme is not supported for ', 'the current GPU. Min capability: 80. ', 'Current capability: 75.')
#14040 commented on May 30, 2025 • 0 new comments
[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference
#18037 commented on May 30, 2025 • 0 new comments
[RFC]: Blackwell Enablement for vLLM (SM100)
#18153 commented on May 30, 2025 • 0 new comments
[Bug]: vllm 0.8.4 start with using ray, and ray's dashboard fails to start
#16779 commented on May 30, 2025 • 0 new comments
[Usage]: Regex Structured Output Became Very Slow
#18546 commented on May 30, 2025 • 0 new comments
[Bug]: Qwen2.5-VL Series Randomly Crashes with Pipeline Parallel
#17351 commented on May 30, 2025 • 0 new comments
[Bug]: Clarification regarding bug inside vllm-flash-attn vision module
#18324 commented on May 30, 2025 • 0 new comments
[RFC]: hybrid dtype: float32 for weights and activation, float16 or bfloat16 for attention.
#18342 commented on May 30, 2025 • 0 new comments
[Usage]: How to use the appropriate --gpu-memory-utilization
#18582 commented on May 30, 2025 • 0 new comments
[RFC]: Data Parallel Attention and Expert Parallel MoEs
#16037 commented on May 30, 2025 • 0 new comments
[Bug]: Qwen3 FP8 on 0.8.5: type fp8e4nv not supported in this architecture.
#17581 commented on May 30, 2025 • 0 new comments
Failed to find C compiler. Please specify via CC environment variable
#2997 commented on May 30, 2025 • 0 new comments
[Bug]: After converting InternVL3-8B to the Hugging Face (HF) format, vLLM fails to launch and throws the error: ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.
#17801 commented on May 28, 2025 • 0 new comments
[Bug]: Difference in Logprobs between vllm and transformers
#18352 commented on May 28, 2025 • 0 new comments
[Bug]: Killing local vLLM worker processes in multiproc_worker_utils.py
#18577 commented on May 28, 2025 • 0 new comments
[Feature]: obtain logits
#11397 commented on May 28, 2025 • 0 new comments
[Bug]: Large-scale vLLM offline inference fails to start due to port conflicts.
#14919 commented on May 28, 2025 • 0 new comments
[Bug]: vLLM lacks eviction policy for MooncakeStore
#18348 commented on May 28, 2025 • 0 new comments
[Bug]: Host CPU Docker image on Docker Hub
#18468 commented on May 28, 2025 • 0 new comments
[Installation]: deployment failure on Kuberentes with CPU device (testing).
#17187 commented on May 28, 2025 • 0 new comments
[RFC]: Deprecating vLLM V0
#18571 commented on May 28, 2025 • 0 new comments
[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device
#5547 commented on May 28, 2025 • 0 new comments
[Bug]: Qwen2.5-VL AWQ/GPTQ RuntimeError: CUDA error: an illegal memory access was encountered 0.8.5+
#17663 commented on May 28, 2025 • 0 new comments
[Bug]: Inference fails on Apple silicon due to (distributed) networking error?
#18362 commented on May 28, 2025 • 0 new comments
invalid conversion from ‘int’ to ‘CUresult’ {aka ‘cudaError_enum’}
#17931 commented on May 28, 2025 • 0 new comments
[Bug]: RuntimeError on RTX 5090: "no kernel image is available for execution on the device
#16901 commented on May 28, 2025 • 0 new comments
[Bug]: 0.8.5 post1 cuda error
#17813 commented on May 28, 2025 • 0 new comments
[Bug]: When I use llmcompressor to quantify the llama3 70b model to int8-a8w8,it shows ValueError: Failed to invert hessian due to numerical instability.
#11064 commented on May 28, 2025 • 0 new comments
[New Model]: NV-Embed-v2
#12137 commented on May 28, 2025 • 0 new comments
[Bug]: There is no module or parameter named '_orig_mod' in Qwen2ForCausalLM
#12783 commented on May 28, 2025 • 0 new comments
[Bug]: Deepseek-R1 performance issue on 2*8*H100
#13066 commented on May 28, 2025 • 0 new comments
[Installation]: Two CPU-only hosts installed.
#13654 commented on May 28, 2025 • 0 new comments
[Bug]: [v0.8.4][Critical] Tools calling broken: xgrammar rejects minItems in JSON Schema, blocking agent functionality
#16880 commented on May 29, 2025 • 0 new comments
[RFC]: AWS Neuron 2.23 NxD Inference with vLLM V0
#15970 commented on May 29, 2025 • 0 new comments
[Bug]: tool_calls.id is Missing in Streaming Responses (stream=true) but Present in Non-Streaming Responses
#18412 commented on May 29, 2025 • 0 new comments
[Bug]: Engine V1 When loading two models into the same GPU the second model requires more memory allocation than the first
#14376 commented on May 29, 2025 • 0 new comments
[Bug]: available VRAM calculation bug in V1
#17979 commented on May 29, 2025 • 0 new comments
[Feature]: Add OpenTelemetry API to v1
#17794 commented on May 29, 2025 • 0 new comments
[New Model]: Multimodal Embedding Model GME.
#16406 commented on May 29, 2025 • 0 new comments
[Bug]: vLLM serve `google/gemma-3-1b-it` with version `0.8.5` interrupted `SIGTERM`
#17386 commented on May 29, 2025 • 0 new comments
[Feature]: Qwen 3 MoE Lora adapter support.
#18120 commented on May 29, 2025 • 0 new comments
[Bug]: I'm trying to run Pixtral-Large-Instruct-2411 using vllm, following the documentation at https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411, but I encountered an error.
#10512 commented on May 29, 2025 • 0 new comments
[Bug]: Problems with releasing memory after starting the vllm container
#11902 commented on May 29, 2025 • 0 new comments
[Bug]: AssertionError in Sampler with Prefix Caching and Prompt Logprobs Enabled.
#13105 commented on May 29, 2025 • 0 new comments
[Bug]: Speculative Decoding doesn't work with Ray compiled DAG and SPMD
#13682 commented on May 29, 2025 • 0 new comments
[Bug]: using v1 AsyncLLMEngine, signal only works in main thread of the main interpreter. v0 has no this problem.
#13806 commented on May 29, 2025 • 0 new comments
[Usage]: I want to be able to Qwen2.5-7B & RTX4060
#13882 commented on May 29, 2025 • 0 new comments
[Bug]: When deploying LLM with Docker, the following error occurs: RuntimeError: Failed to infer device type
#13946 commented on May 29, 2025 • 0 new comments
[Bug]: vllm crash when enable prefix caching
#13954 commented on May 29, 2025 • 0 new comments
[Bug]: Llama chat template cannot process tool_calls=[] in previous messages
#13978 commented on May 29, 2025 • 0 new comments
[Feature]: vllm0.7.3版本内置的pytorch版本低，不支持nvidia5080
#13999 commented on May 29, 2025 • 0 new comments
[Bug]: Add TPU support for gemma-3-4b-it and gemma-3-27b-it
#16521 commented on May 29, 2025 • 0 new comments

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Alternative Proxies:

Alternative Proxy