-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Insights: vllm-project/vllm
Overview
Could not load contribution data
Please try again later
1 Release published by 1 person
-
v0.9.0.1
published
May 30, 2025
160 Pull requests merged by 71 people
-
[BugFix] Fix incorrect metrics shutdown error log message
#18992 merged
Jun 1, 2025 -
[BugFix] fix data parallel construct ipv6 url addres
#18991 merged
Jun 1, 2025 -
Let max_num_batched_tokens use human_readable_int for large numbers
#18968 merged
Jun 1, 2025 -
[doc] small fix - mkdocs
#18996 merged
Jun 1, 2025 -
[LoRA] Support dynamically initialize
packed_modules_mapping
for VLM with arbitrary components#18987 merged
Jun 1, 2025 -
[Core] Rework dtype resolution
#18751 merged
Jun 1, 2025 -
[Bugfix] Fix EAGLE3 broken logits
#18909 merged
Jun 1, 2025 -
[Misc][Benchmark] Add support for CustomDataset
#18511 merged
May 31, 2025 -
[Misc] add return token strs for tokenize
#18941 merged
May 31, 2025 -
[BugFix] Fix multi-node offline data-parallel
#18981 merged
May 31, 2025 -
[P/D] NixlConnector use cache device index for memory registration
#18969 merged
May 31, 2025 -
[ROCm][Kernel] Add gfx950 support for skinny gemms
#18010 merged
May 31, 2025 -
[Bugfix] Fix for issue 17396
#18773 merged
May 31, 2025 -
[FEAT][ROCm] Add AITER grouped topk for DeepSeekV2
#18825 merged
May 31, 2025 -
[BugFix] Pydantic part 2
#18911 merged
May 31, 2025 -
[doc] fix the list rendering issue - security.md
#18982 merged
May 31, 2025 -
[Neuron] Add Multi-Modal model support for Neuron
#18921 merged
May 31, 2025 -
fix security issue of logging llm output
#18980 merged
May 31, 2025 -
[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled
#18879 merged
May 31, 2025 -
[Misc] Fix estimated max model len msg
#18966 merged
May 31, 2025 -
[Frontend] Add rerank support to run_batch endpoint
#16278 merged
May 31, 2025 -
create util function for batched arange
#18937 merged
May 31, 2025 -
[Docs] Correct multiprocessing design doc
#18964 merged
May 31, 2025 -
Tool parser regex timeout handling
#18960 merged
May 30, 2025 -
[Misc] add group_size is -1 in awq quantization
#18910 merged
May 30, 2025 -
[VLM] Add PP support and fix GPTQ inference for Ovis models
#18958 merged
May 30, 2025 -
Benchmark script for fp8 vs bf16 gemm
#17126 merged
May 30, 2025 -
[Perf] API-server scaleout with many-to-many server-engine comms
#17546 merged
May 30, 2025 -
Improve "failed to get the hash of the compiled graph" error
#18956 merged
May 30, 2025 -
[Docs] Update SECURITY.md with link to our security guide
#18961 merged
May 30, 2025 -
[doc] show the count for fork and watch
#18950 merged
May 30, 2025 -
[Feature] minicpm eagle support
#18943 merged
May 30, 2025 -
[CI/Build] remove regex from build dependencies
#18945 merged
May 30, 2025 -
[Bugfix][TPU] Fix tpu model runner testcase failure
#18810 merged
May 30, 2025 -
[Misc]Fix typo
#18947 merged
May 30, 2025 -
[Bugfix][Failing Test] Fix test_vllm_port.py
#18618 merged
May 30, 2025 -
[Model] Use in-place adds in SigLIP
#18922 merged
May 30, 2025 -
[doc] add mkdocs doc
#18930 merged
May 30, 2025 -
[Misc]Fix benchmarks/README.md for speculative decoding
#18897 merged
May 30, 2025 -
[Deprecation] Remove mean pooling default for
Qwen2EmbeddingModel
#18913 merged
May 30, 2025 -
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy
#18861 merged
May 30, 2025 -
[ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend.
#18938 merged
May 30, 2025 -
[docs] fix: fix markdown syntax
#18927 merged
May 30, 2025 -
[Model] Use AutoWeightsLoader for mamba2
#18918 merged
May 30, 2025 -
[Bugfix] Consistent ascii handling in tool parsers
#18883 merged
May 30, 2025 -
improve the robustness of parsing vlms config in AutoRound
#18894 merged
May 30, 2025 -
[TPU][CI/CD] Clean up docker for TPU tests.
#18926 merged
May 30, 2025 -
[Misc] Update type annotation for rotary embedding
base
#18914 merged
May 30, 2025 -
[Bugfix] Fix PP default fallback behavior for V1
#18915 merged
May 30, 2025 -
[TPU] remove transpose ops in moe kernel
#18923 merged
May 29, 2025 -
Use standalone_compile by default in torch >= 2.8.0
#18846 merged
May 29, 2025 -
[P/D] NixlConnector DP fixes
#18903 merged
May 29, 2025 -
[BugFix] Make DP work with connector-delayed new requests
#18559 merged
May 29, 2025 -
[V1] Allocate kv_cache with stride order for V1
#18775 merged
May 29, 2025 -
[Misc] Remove duplicate init for self.vllm_config
#18896 merged
May 29, 2025 -
[Deprecation] Disallow pos-args other than
model
when initializingLLM
#18802 merged
May 29, 2025 -
[ROCm][V0][Attention] Revert to the previous FA triton kernel
#18226 merged
May 29, 2025 -
[Attention][V1] Toggle for v1 attention backend
#18275 merged
May 29, 2025 -
[Bugfix] Fix the failing gte embedding test
#18720 merged
May 29, 2025 -
[Doc] Fix codeblocks formatting in LoRA adapters documentation
#18907 merged
May 29, 2025 -
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp.
#18692 merged
May 29, 2025 -
Fix an error in dummy weight loading for quantization models
#18855 merged
May 29, 2025 -
[BugFix] Update pydantic to fix error on python 3.10
#18852 merged
May 29, 2025 -
[Bugfix] Ensure tensors are contiguous during serialisation
#18860 merged
May 29, 2025 -
[Misc] Replace TODO in serving transcription
#18895 merged
May 29, 2025 -
[Bugfix] Fix misleading information in the documentation
#18845 merged
May 29, 2025 -
[doc] add CLI doc
#18871 merged
May 29, 2025 -
[Doc] Remove redundant spaces from compatibility_matrix.md
#18891 merged
May 29, 2025 -
[LoRA] Add LoRA support for InternVL
#18842 merged
May 29, 2025 -
[Neuron] Add multi-LoRA support for Neuron.
#18284 merged
May 29, 2025 -
Fixes a dead link in nightly benchmark readme
#18856 merged
May 29, 2025 -
Skip device and quant Pydantic validation to make plugin device work
#18843 merged
May 29, 2025 -
[Doc][Neuron] Update documentation for Neuron
#18868 merged
May 29, 2025 -
[Bugfix][TPU] fix moe custom kernel import
#18853 merged
May 29, 2025 -
Add ability to use CUDAGraphs with use_inductor=False
#17345 merged
May 29, 2025 -
Prevent the cross-encoder logic from being applied to classification tasks
#18838 merged
May 29, 2025 -
[Core] Enable CUDA graphs for DP + All2All kernels
#18724 merged
May 28, 2025 -
Remove checks for
None
for fields which should never beNone
#17985 merged
May 28, 2025 -
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend
#15655 merged
May 28, 2025 -
[Chore][Spec Decode] Update check NoneType instead of assigning variables
#18836 merged
May 28, 2025 -
[V1][Metrics] Remove metrics that were deprecated in 0.8
#18837 merged
May 28, 2025 -
[Misc] fix olmoe model layer for TP > 1
#18828 merged
May 28, 2025 -
[Chore] update ty configuration
#18839 merged
May 28, 2025 -
[Core] Add Lora Support to Beam Search
#18346 merged
May 28, 2025 -
decrement server_load on listen for disconnect
#18784 merged
May 28, 2025 -
[Frontend] add run batch to CLI
#18804 merged
May 28, 2025 -
Enable Pydantic mypy checks and convert configs to Pydantic dataclasses
#17599 merged
May 28, 2025 -
[Platform][Dist] Make torch distributed process group extendable
#18763 merged
May 28, 2025 -
[BugFix] FA2 MLA Accuracy Issue
#18807 merged
May 28, 2025 -
Fix PiecewiseCompileInterpreter
#17338 merged
May 28, 2025 -
[CI] improve embed testing
#18747 merged
May 28, 2025 -
[Deprecation] Remove fallbacks for Embeddings API
#18795 merged
May 28, 2025 -
[Deprecation] Remove unused sync methods in
async_timeout
#18792 merged
May 28, 2025 -
[Deprecation] Require overriding
get_dummy_text
andget_dummy_mm_data
#18796 merged
May 28, 2025 -
[Bugfix][FailingTest]Fix test_model_load_with_params.py
#18758 merged
May 28, 2025 -
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2)
#18781 merged
May 28, 2025 -
[V1] fix torch profiling for V1 offline scenarios
#18445 merged
May 28, 2025 -
[Bugfix]: correctly propagate errors message caught at the chat_templating step to the client
#18769 merged
May 28, 2025 -
[Bugfix] Fix nomic max_model_len
#18755 merged
May 28, 2025 -
[rocm] Fix wrong attention log
#18764 merged
May 28, 2025 -
[Core] Improve Tensor serialisation
#18774 merged
May 28, 2025 -
[Build] Fixes for CMake install
#18570 merged
May 28, 2025 -
[Bugfix] Disable prefix caching by default for benchmark
#18771 merged
May 28, 2025 -
Support datasets in
vllm bench serve
and sync with benchmark_[serving,datasets].py#18566 merged
May 27, 2025 -
[Neuron] Support quantization on neuron
#18283 merged
May 27, 2025 -
[CI/Build] [TPU] Fix TPU CI exit code
#18282 merged
May 27, 2025 -
[Bugfix] Mistral tool calling when content is list
#18729 merged
May 27, 2025 -
[Core] Automatically cast multi-modal input dtype
#18756 merged
May 27, 2025 -
optimize get_kv_cache_torch_dtype
#18531 merged
May 27, 2025 -
Disable prefix cache by default for benchmark
#18639 merged
May 27, 2025 -
[V1][Metrics] Add API for accessing in-memory Prometheus metrics
#17010 merged
May 27, 2025 -
[CI/Build] Remove imports of built-in
re
#18750 merged
May 27, 2025 -
[BUG FIX] minicpm
#18739 merged
May 27, 2025 -
[Build] fix cpu build missing libtbbmalloc.so
#18744 merged
May 27, 2025 -
Minor fix about MooncakeStoreConnector
#18721 merged
May 27, 2025 -
[Doc] cleanup deprecated flag for doc
#18715 merged
May 27, 2025 -
[Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh
#18752 merged
May 27, 2025 -
feat(rocm-support): support mamba2 on rocm
#18565 merged
May 27, 2025 -
[Misc] improve docs
#18734 merged
May 27, 2025 -
[Doc] Update reproducibility doc and example
#18741 merged
May 27, 2025 -
[Doc] Update OOT model docs
#18742 merged
May 27, 2025 -
[FEAT] [ROCm] Upgrade AITER Fused MoE kernels.
#18271 merged
May 27, 2025 -
[Model][Gemma3] Cast image pixel values already on CPU
#18732 merged
May 27, 2025 -
[V1][Quantization] Add CUDA graph compatible v1 GGUF support
#18646 merged
May 27, 2025 -
[Misc] improve web section group title display
#18684 merged
May 27, 2025 -
[Model][Gemma3] Simplify image input validation
#18710 merged
May 27, 2025 -
Convert
examples
toruff-format
#18400 merged
May 26, 2025 -
[V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs
#18608 merged
May 26, 2025 -
[Bugfix] Fix Llama GGUF initialization
#18717 merged
May 26, 2025 -
[Doc] Move examples and further reorganize user guide
#18666 merged
May 26, 2025 -
[Doc] Improve API docs
#18713 merged
May 26, 2025 -
[Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM
#18701 merged
May 26, 2025 -
[Misc] add AutoGen integration
#18712 merged
May 26, 2025 -
[Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI
#18709 merged
May 26, 2025 -
[CI/Build] Split pooling and generation extended language models tests in CI
#18705 merged
May 26, 2025 -
[Model] Add support for YARN in NemotronNAS models
#18427 merged
May 26, 2025 -
[CI] fix dump_input for str type
#18697 merged
May 26, 2025 -
[CI/Build] Replace
math.isclose
withpytest.approx
#18703 merged
May 26, 2025 -
[Bugfix] Fix Mistral-format models with sliding window
#18693 merged
May 26, 2025 -
[Doc] Fix issue template format
#18699 merged
May 26, 2025 -
[GH] Add issue template for reporting CI failures
#18696 merged
May 26, 2025 -
[CI] add missing argument
#18694 merged
May 26, 2025 -
[Bugfix] Fix the lm_head in gpt_bigcode in lora mode
#6357 merged
May 26, 2025 -
refactor: simplify request handler, use positive condition check for handler assignment
#18690 merged
May 26, 2025 -
[Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example
#18644 merged
May 26, 2025 -
[CI/Build][Doc] Update
gte-Qwen2-1.5B-instruct
usage#18683 merged
May 26, 2025 -
[Core][Multimodal] Convert PIL Image to array without data copy when hashing
#18682 merged
May 25, 2025 -
[Bugfix] Fix profiling dummy data for Pixtral
#18677 merged
May 25, 2025 -
[Misc] small improve
#18680 merged
May 25, 2025 -
[CI/build] fix no regex
#18676 merged
May 25, 2025 -
[Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment
#18674 merged
May 25, 2025 -
[doc] improve readability
#18675 merged
May 25, 2025 -
[doc] fix broken links
#18671 merged
May 25, 2025 -
[Misc] Reduce logs on startup
#18649 merged
May 25, 2025 -
[BUGFIX] catch subclass first for try...except
#18672 merged
May 25, 2025 -
Speed up the
kernels/quantization/
tests#18669 merged
May 25, 2025 -
[VLM] Initialize video input support for InternVL models
#18499 merged
May 25, 2025 -
[Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE
#18655 merged
May 25, 2025
70 Pull requests opened by 57 people
-
[CI/Build][Bugfix] Ensure compatibility with transformers 4.52
#18678 opened
May 25, 2025 -
[Bugfix][Benchmarks]Fixed async_request_deepspeed_mii() to get ttft
#18689 opened
May 26, 2025 -
[Doc] Readme standardization
#18695 opened
May 26, 2025 -
[Doc] Clarify cudagraph capture size logic and default behavior in scheduler
#18698 opened
May 26, 2025 -
[Core] feat: Implement Priority Scheduling in V1 Engine
#18700 opened
May 26, 2025 -
[CI] change spell checker from codespell to typos
#18711 opened
May 26, 2025 -
[V1][Spec Decode] MLP speculator support
#18719 opened
May 26, 2025 -
[CI]: Fix test_kv_cache_events
#18722 opened
May 26, 2025 -
[Misc][Benchmark] Fix error on benchmark_moe.py
#18723 opened
May 26, 2025 -
Add cuda 12.8 wheel nightly build
#18726 opened
May 26, 2025 -
[Core] Support inplace model weights loading
#18745 opened
May 27, 2025 -
[Bugfix]: Fix moe_unpermute compatibility by aligning function signatures under CUDA < 12.0
#18749 opened
May 27, 2025 -
[Kernel] GGUF MMVQ kernel for multiple input vectors
#18754 opened
May 27, 2025 -
[Kernel] Integrate CUTLASS MoE kernel with PPLX
#18762 opened
May 27, 2025 -
[WIP] Add a metric to track request failures
#18765 opened
May 27, 2025 -
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference
#18768 opened
May 27, 2025 -
[Torch Nightly]add missing dependency
#18770 opened
May 27, 2025 -
Export NaNs in logits to scheduler_stats if output is corrupted
#18777 opened
May 27, 2025 -
[Perf] Tunings for SM100 FP8 CUTLASS kernel
#18778 opened
May 27, 2025 -
[V1] Support DP with Ray
#18779 opened
May 27, 2025 -
Fail request if FSM fails to advance
#18780 opened
May 27, 2025 -
[Docs] Add developer doc about CI failures
#18782 opened
May 27, 2025 -
[Bugfix] handle `attn_metadata=None` in `calculate_kv_scales` branch of attn forward
#18788 opened
May 28, 2025 -
[Model] Add support for normalized Transformer (nGPT) from NVIDIA
#18798 opened
May 28, 2025 -
[Deprecation] Remove `inputs` arg fallback in Engine classes
#18799 opened
May 28, 2025 -
[Deprecation] Remove `prompt_token_ids` arg fallback in `LLM.generate` and `LLM.embed`
#18800 opened
May 28, 2025 -
[Misc] add more info make_client() func
#18803 opened
May 28, 2025 -
Respect passed in device overrides in engine args
#18808 opened
May 28, 2025 -
[Bug] fix the structure of decoder_prompt
#18809 opened
May 28, 2025 -
[Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM
#18817 opened
May 28, 2025 -
[P/D][Misc] Enable profiling in disagg setup
#18827 opened
May 28, 2025 -
[Misc] Split monolithic config.py into domain-specific modules
#18830 opened
May 28, 2025 -
[P/D] Heterogeneous TP
#18833 opened
May 28, 2025 -
Updating the incremental de-tokenizer
#18840 opened
May 28, 2025 -
[Perf] Tune `scaled_fp8_quant` by increasing vectorization
#18844 opened
May 28, 2025 -
[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets
#18847 opened
May 28, 2025 -
Account for memory usage of other processes
#18858 opened
May 28, 2025 -
[Core] Cast multimodal input in hf processor
#18862 opened
May 28, 2025 -
[Model] NemotronH support
#18863 opened
May 28, 2025 -
[Kernel] Fix fp8 support for pplx and BatchedTritonExperts.
#18864 opened
May 28, 2025 -
Fix AOPerModuleConfig name changes
#18869 opened
May 29, 2025 -
Add DeepSeek-R1-0528 function call chat template
#18874 opened
May 29, 2025 -
[Chore] remove unused jinaai_serving_reranking
#18878 opened
May 29, 2025 -
[bugfix][v1]fixed the missing prompt value in RequestOutputs
#18880 opened
May 29, 2025 -
[BugFix] v0 cache evictor:priority_queue and free_table desynchronization
#18882 opened
May 29, 2025 -
[V1][Metrics] Add max_token_capacity_per_batch
#18900 opened
May 29, 2025 -
[V1][Metrics] Add time_per_prefill_token
#18901 opened
May 29, 2025 -
[V1][Metrics] Add total_tokens_in_queue (prefill + decode)
#18904 opened
May 29, 2025 -
[V1][Metrics] Add num_tokens_preempted
#18905 opened
May 29, 2025 -
update the arch list for Blackwell support on nightly dockerfile
#18912 opened
May 29, 2025 -
[Misc] Fix path and python alias errors in disagg_prefill exmaples
#18919 opened
May 29, 2025 -
[Core] Remove int32->int64->int32 overhead in FlashInfer sampling
#18920 opened
May 29, 2025 -
feat: add data parallel rank to KVEventBatch
#18925 opened
May 29, 2025 -
Adding "LoRA Test %N" to AMD production tests
#18929 opened
May 29, 2025 -
[Bugfix][Config] Fix config dtype get error
#18934 opened
May 30, 2025 -
Abstract mooncake store connector to kv store connector
#18936 opened
May 30, 2025 -
[Bugfix][core] Prefix caching enabled causes incorrect outputs
#18957 opened
May 30, 2025 -
[Performance] Replace per-tensor/token FP8 quant CUDA kernels with torch.compile
#18965 opened
May 30, 2025 -
Reduce logs in CLI scripts and plugin loader
#18970 opened
May 30, 2025 -
[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix
#18971 opened
May 30, 2025 -
[Core] Remove unnecessary copy of multi modal input embeddings
#18973 opened
May 30, 2025 -
[BugFix][V1] Fix memory profiling bug
#18974 opened
May 30, 2025 -
[Bugfix][Model] Attempt to fix eagle in V0.
#18978 opened
May 30, 2025 -
Fix FlashMLA detection in ray environment
#18979 opened
May 30, 2025 -
Add tarsier model support
#18985 opened
May 31, 2025 -
[Benchmark] Add hf_stream arg to enable or disable datasets streaming loading
#18989 opened
May 31, 2025 -
[ROCm] [AITER] [Bugfix] Patch for AITER commit `648764942e552a8bb5fe16026703716a81f05374`
#18990 opened
May 31, 2025 -
[DRAFT] Self-Speculative Decoding using LayerSkip
#18994 opened
May 31, 2025
87 Issues closed by 39 people
-
[Bug]: Multi-node data parallel cannot recognize IPv6 host address
#18986 closed
Jun 1, 2025 -
[Bug]: Eagle3 in vLLM v0.9.0 has no acceleration effect.
#18946 closed
Jun 1, 2025 -
[Bug]: Unable to use --enable-lora on latest vllm docker container (v0.6.2)
#9133 closed
Jun 1, 2025 -
[Feature]: Janus-Series: Unified Multimodal Understanding and Generation Models
#12479 closed
Jun 1, 2025 -
[Bug]: Asyncengine is dead after sending request!
#12510 closed
Jun 1, 2025 -
[Bug]: vllm container does not set LD_LIBRARY_PATH correctly
#12559 closed
Jun 1, 2025 -
[Performance]: Weird Sliding Window Attention Profiling Results
#12616 closed
Jun 1, 2025 -
[Bug]: shape is invalid for input of size
#12633 closed
Jun 1, 2025 -
[Feature]: Return token strings in addition to token ids for /tokenize
#18928 closed
May 31, 2025 -
[Bug]: data_parallel.py not working in multi-node case
#18553 closed
May 31, 2025 -
[New Model]: deepseek-ai/DeepSeek-R1-0528
#18849 closed
May 31, 2025 -
[Bug]: The size of tensor a (49472) must match the size of tensor b (49664) at non-singleton dimension 1
#17396 closed
May 31, 2025 -
[Bug]: In Version V0.9.0, Qwen3-32B-AWQ Error when turn off thinking and use guided_json simultaneously.
#18821 closed
May 31, 2025 -
[Bug]: Failed to load the fine-tuned Qwen2.5-VL-7B-Instruct model.
#18983 closed
May 31, 2025 -
[Bug]: AWQ INT4 Model with group_size=-1 throws exception while gptq format is fine
#18885 closed
May 30, 2025 -
[Bug]: Non-coherent output from DeepSeek-R1 671B on H200 SXM
#12892 closed
May 30, 2025 -
[Performance]: Qwen2.5VL preprocessing extremely slow with large image, leading low gpu usage
#15869 closed
May 30, 2025 -
[Bug]: test_vllm_port.py::test_get_vllm_port_uri fails with AssertionError: Regex pattern did not match
#18617 closed
May 30, 2025 -
[Bug]: vllm0.9.0 2 A40 GPUs Error running Qwen3-32B
#18870 closed
May 30, 2025 -
[Bug]: Tools results encoding in ascii when using "required" tool call with server.
#18881 closed
May 30, 2025 -
[Bug]: Index out of range error related to speculative decoding and `-O3`
#12507 closed
May 30, 2025 -
[Usage]: Guidance on Building a v0.9.0 Docker Image with Volta GPU Support
#18818 closed
May 30, 2025 -
[Bug]: Critical distributed executor bug
#7791 closed
May 29, 2025 -
[Usage]: Running vLLM with B200 Blackwell
#17901 closed
May 29, 2025 -
[Bug]: v0.8.5 causes gemma-3 models to output whitespace or incoherent output
#17390 closed
May 29, 2025 -
[Bug]:some vllm routes can be reached without authorization
#18893 closed
May 29, 2025 -
[Doc]: The description about InternVL's support for LoRA in the document does not conform to the reality
#18820 closed
May 29, 2025 -
[Bug]:
#18889 closed
May 29, 2025 -
[Bug]:
#18888 closed
May 29, 2025 -
[Usage]: Run multi images, videos inference with MiniCPM-o 2.6
#18685 closed
May 29, 2025 -
[Bug]: VLLM crashes when prefix caching is enabled
#7003 closed
May 29, 2025 -
[Bug]: "gettid" was not declared error when build from source for cpu with version after v0.6.1
#9683 closed
May 29, 2025 -
[Feature]: Beam search: top_p, min_p and logit processors
#10754 closed
May 29, 2025 -
[Usage]: Why does it consume so much memory?
#12346 closed
May 29, 2025 -
[Bug]: Slower inference time on less input tokens
#12406 closed
May 29, 2025 -
[Bug]: Computation cache has already been initialized error on TPUs
#12476 closed
May 29, 2025 -
[Doc]: Example launch command for deepseek v3/R1 for 8-way H100/H200 and MI300X?
#12493 closed
May 29, 2025 -
[Bug]: AttributeError: 'TokenizeChatRequest' object has no attribute 'mm_processor_kwargs'
#13951 closed
May 29, 2025 -
[Feature]: Enable CUDA Graph without turn on torch.compile / Inductor for V1
#15896 closed
May 29, 2025 -
[Bug]: OpenAI Classification Client returning logits instead of softmax values
#18727 closed
May 29, 2025 -
[Bug]: Cannot use OLMoE with tensor parallel higher than 1
#18706 closed
May 28, 2025 -
[Bug]:ModuleNotFoundError: No module named 'vllm._C'
#15592 closed
May 28, 2025 -
[Feature]: Support Lora for Beam Search
#17205 closed
May 28, 2025 -
[CI Failure]: LM Eval Large Models - test_lm_eval_correctness.py
#18766 closed
May 28, 2025 -
[Bug]: MLA correctness issues when using FA2
#18561 closed
May 28, 2025 -
[Bug]: vllm-0.8.5的镜像在CUDA12.8上出错
#18790 closed
May 28, 2025 -
[Bug]:openai接口tools使用enum多标签问题
#18585 closed
May 28, 2025 -
[Bug]: model_executor/test_model_load_with_params.py fails with AttributeError
#18757 closed
May 28, 2025 -
[Usage]: sequence parallelism or async tp integration seems takes no effect on Qwen3-MoE
#18753 closed
May 28, 2025 -
[Feature]: APC introspection interface
#8523 closed
May 28, 2025 -
[Bug]: Function calling with Qwen & Streaming ('NoneType' object has no attribute 'get')
#9874 closed
May 28, 2025 -
[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB
#10062 closed
May 28, 2025 -
[Feature]: Manually inject Prefix KV Cache
#10515 closed
May 28, 2025 -
[Bug]: Loading model from S3 using RunAI Model Streamer excludes too many files
#11929 closed
May 28, 2025 -
[Performance]: Unexpected performance of vLLM Cascade Attention
#12395 closed
May 28, 2025 -
Does vLLM support Sliding Window for chat type use case?
#12488 closed
May 28, 2025 -
[Usage]: When will pip support version 0.9.0?
#18539 closed
May 28, 2025 -
[Bug]: Could not import module 'ProcessorMixin
#18776 closed
May 27, 2025 -
[Bug]: Tools usage with `mistralai/Devstral-Small-2505` fails
#18628 closed
May 27, 2025 -
[Usage]: can data_parallel_size or --data-parallel-size be used with v0 engine
#18702 closed
May 27, 2025 -
[Bug]: Dockerfile.cpu missing libtbb-dev
#18743 closed
May 27, 2025 -
[Bug]: register model can't be found in v1 mode
#18740 closed
May 27, 2025 -
[New Model]: IDEA-Research/ChatRex-7B
#12444 closed
May 27, 2025 -
[Usage]:RuntimeError: Triton Error [CUDA]: device kernel image is invalid
#18580 closed
May 27, 2025 -
[Usage]: about flash attention
#18658 closed
May 27, 2025 -
[Bug]: lora_filesystem_resolver wont work
#18630 closed
May 26, 2025 -
[Bug]: Llama4 MoE weight_loaders Removed from Parameters After Initial Load, Causing Errors During Refitting
#17915 closed
May 26, 2025 -
[Usage]: why max-num-batched-tokens can smaller than max-model-len
#18681 closed
May 26, 2025 -
[Installation]: When to expect the release of v0.9.0
#18704 closed
May 26, 2025 -
[Bug]: --enable-prompt-tokens-details not working in V1
#16162 closed
May 26, 2025 -
[Performance]: Running llama-70b on 4 A100 40Gb
#13234 closed
May 26, 2025 -
[Bug]: GPU and CPU KV cach usage reported opposite by benchmark_throughput
#15201 closed
May 26, 2025 -
[Feature]: Is there any plans for multi loras with Qwen2.5vl ?
#18688 closed
May 26, 2025 -
[Usage]: unrecognized arguments: --enable-reasoning --reasoning-parser deepseek_r1
#18538 closed
May 26, 2025 -
[Usage]: enable_sequence_parallelism=True
#18648 closed
May 26, 2025 -
[Bug]: chunked_prefill cannot be turn off in V1 engine
#18547 closed
May 26, 2025 -
[Bug]: PixtralForConditionalGeneration is broken due to bad placeholder tokens
#18556 closed
May 25, 2025 -
[Bug]: prefix caching doesn't work on CPU vLLM
#17954 closed
May 25, 2025 -
[Bug]: Does not support video input for InternVL series models
#18381 closed
May 25, 2025
86 Issues opened by 83 people
-
[Usage]: meets gpu p2p check err when use tp=4 to launch vllm by torhcrun
#18998 opened
Jun 1, 2025 -
[Bug]: Unable to get vLLM working with RTX 5090
#18995 opened
May 31, 2025 -
[Bug]: V100能启动Qwen3-4B,但无法对其进行调用,一旦调用,vllm服务会直接停掉
#18993 opened
May 31, 2025 -
[Usage]: Controll Deepseek R1 think or not
#18988 opened
May 31, 2025 -
[Feature]: Colocating multiple LLM engines in the same process with sleep mode.
#18975 opened
May 30, 2025 -
[Feature]: Native packages
#18963 opened
May 30, 2025 -
[Bug]: Deserialisation of the model is taking several minutes.
#18962 opened
May 30, 2025 -
[Feature]: Add Lora for ModernBERT models
#18959 opened
May 30, 2025 -
[CI Failure]: Spec Decoding - spec_decode/e2e/test_eagle_correctness.py
#18954 opened
May 30, 2025 -
[RFC]: vLLM configuration refactoring and modularization
#18953 opened
May 30, 2025 -
[Usage]: How to log stat when using AsyncLLM locally (do not based on openAI api)
#18948 opened
May 30, 2025 -
[Bug]: AsyncLLM when DP > 1, device allocation bug
#18942 opened
May 30, 2025 -
[Feature]: Can vllm support Megakernel?
#18939 opened
May 30, 2025 -
[Bug]:
#18933 opened
May 30, 2025 -
[Bug]: offline dp will stack when one dp group finish work and exit
#18932 opened
May 30, 2025 -
[Usage]: How to use Deepseek-R1-0528 with function call
#18931 opened
May 30, 2025 -
[Feature]: Include data parallel rank in Kv Events
#18924 opened
May 29, 2025 -
[Bug]: video input to vllm server encounter hanging out
#18917 opened
May 29, 2025 -
[Bug]: VLLM Docker v0.9.0 produces Runtime Error: Cuda Error on Blackwell using Qwen0.6B
#18916 opened
May 29, 2025 -
[Usage]: prompt logprobs + APC in speculative decoding
#18908 opened
May 29, 2025 -
[Bug]: vllm0.9.0 cannot load eagle30-llama3.3-70b-inst model
#18906 opened
May 29, 2025 -
[Bug]: The frequency penalty does not work when spec decoding is enabled in V1, with no warning or error
#18902 opened
May 29, 2025 -
[Bug]: Do we really need to implement additional functions for custom_allreduce to serve graph capture?
#18899 opened
May 29, 2025 -
[Feature]: Enable setting `leave=False` in `tqdm` progress bars
#18898 opened
May 29, 2025 -
[Bug]: some vllm routes can be reached without authorization
#18892 opened
May 29, 2025 -
[Bug]: vllm启动模型后使用openai格式请求传base64值有问题
#18890 opened
May 29, 2025 -
[Bug]: FlashMLA V1 with FP8 KV cache not yet supported!
#18887 opened
May 29, 2025 -
[Performance]: The Unstable Performance Difference between CUDA and PyTorch
#18884 opened
May 29, 2025 -
[Questions]: The problem of repeated capture of cudagraph during weight update phase?
#18877 opened
May 29, 2025 -
[Usage][V1]: How to determine the V1 engine is busy and has pending requests?
#18876 opened
May 29, 2025 -
[Usage]: Does vllm support inference or service startup of CPU small model?
#18875 opened
May 29, 2025 -
[Performance]: why the batch-embeddings inputs are separated to small single one?
#18867 opened
May 29, 2025 -
[Feature]: Vectorize `scaled_int8_quant`
#18866 opened
May 28, 2025 -
[Bug]: Image v0.9.0 Fails to Initialize on GCP instance Due to Undetected Platform
#18859 opened
May 28, 2025 -
[Bug]: Non-torch memory tracking fails to account for gpu usage of other processes
#18854 opened
May 28, 2025 -
[New Model]: ByteDance/Dolphin
#18850 opened
May 28, 2025 -
[Bug]: Help, RuntimeError: CUDA error: no kernel image is available for execution on the device
#18835 opened
May 28, 2025 -
[Bug][Regression]: Dimension out of range when using MooncakeStoreConnector
#18834 opened
May 28, 2025 -
[Bug]: Error during serialization of the model.
#18832 opened
May 28, 2025 -
[RFC]: Controlling the maximum length of the waiting queue
#18826 opened
May 28, 2025 -
[Bug]: Broken Structured Output (Guided Decoding) with Qwen3 models when `enable_thinking=False`
#18819 opened
May 28, 2025 -
[Bug]: Model fails to load in background thread in versions >0.8.5
#18816 opened
May 28, 2025 -
[Feature]: vllm/v1/attention/backends/blocksparse_attn.py
#18815 opened
May 28, 2025 -
[Bug] TP=2 fails on dual RTX 5090: TorchInductor compile error or CUDA illegal memory access (TP=1 works)
#18814 opened
May 28, 2025 -
[Usage]: 请问0.9.0版容器是限制只能在CUDA12.8以上版本运行了吗?
#18813 opened
May 28, 2025 -
[Bug]: python sampler is faster than flashinfer sampler
#18811 opened
May 28, 2025 -
[Usage]: How to release GPU resource of a reproducible LLM instance
#18806 opened
May 28, 2025 -
[Usage]: NCCL error when using tow AMD GPUs ( gfx1100 )
#18805 opened
May 28, 2025 -
[Performance]: How can i improve performance further in vllm lmcache PD Disaggregate?Plz Help Me
#18801 opened
May 28, 2025 -
[New Model]: NVIDIA-Normalized-GPT (nGPT)
#18797 opened
May 28, 2025 -
[New Model]: ByteDance-Seed/BAGEL-7B-MoT
#18793 opened
May 28, 2025 -
[Bug]: something wrong with hermes tool parser
#18791 opened
May 28, 2025 -
pip install -e . failed
#18789 opened
May 28, 2025 -
[Performance]: Falcon H1 7B seems to be significantly slower than Qwen 7B
#18785 opened
May 28, 2025 -
[Bug][V1] Structured output FSM failures should be handled gracefully without aborting requests
#18783 opened
May 27, 2025 -
[Feature]: vllm torch nightly package not in sync issues
#18772 opened
May 27, 2025 -
[Bug]: Low GPU Underutilization and Badwords Failure When Rollout n > 1
#18767 opened
May 27, 2025 -
[Bug]: [0.9.0] llama-3-8b-instruct-awq returns `name 'FusedMoEPermuteExpertsUnpermute' is not defined` error
#18761 opened
May 27, 2025 -
[Performance]: The CPU overhead gradually increases with multiple batches.
#18760 opened
May 27, 2025 -
[Usage]: How can I use spec-decoding features with multimodal model like qwen2.5vl
#18759 opened
May 27, 2025 -
[Bug]: Schema inconsistency in moe_unpermute causes runtime crash under CUDA 11.8
#18746 opened
May 27, 2025 -
[Usage]: GPU/CPU communication sanity check failed on K8S env
#18731 opened
May 27, 2025 -
[Bug]:RuntimeError: Engine core initialization failed.
#18730 opened
May 27, 2025 -
[Performance]: yarn degrades the performance of qwen3
#18728 opened
May 26, 2025 -
[Usage]: Can the continuous batching function be disabled in vllm now?
#18716 opened
May 26, 2025 -
[Feature]: Support Prometheus Metrics with P/D disagg on multi-machines
#18714 opened
May 26, 2025 -
[Bug][CI Failure] - VI Test - test_engine_core_client.py::test_kv_cache_events[True-tcp]
#18708 opened
May 26, 2025 -
[Doc]: Newest documentation for engine arguments is significantly worse than v0.8.5 and prior
#18707 opened
May 26, 2025 -
[Bug]: build source errors
#18691 opened
May 26, 2025 -
[Bug]: benchmark_serving.py cannot reach the specified generated tokens even with the flag --ignore-eos
#18687 opened
May 26, 2025 -
[Usage]:
#18679 opened
May 25, 2025
321 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[Kernel] DeepEP dispatch-combine kernel integration
#18434 commented on
May 31, 2025 • 19 new comments -
feat: engine v1 post process sampled logprobs
#17724 commented on
May 28, 2025 • 18 new comments -
Support embedding models in V1 with a dedicated model_runner
#18015 commented on
May 30, 2025 • 15 new comments -
[v1] Hybrid Memory Allocator
#17996 commented on
Jun 1, 2025 • 14 new comments -
[Hardware][AMD] integrate aiter chunked prefill into vllm
#18596 commented on
May 31, 2025 • 13 new comments -
[V1] Support cross-layer KV sharing
#18212 commented on
May 30, 2025 • 12 new comments -
add causal-conv1d in Triton and integrate into vLLM with test code
#18218 commented on
May 30, 2025 • 12 new comments -
Add FlexAttention to V1
#16078 commented on
May 31, 2025 • 11 new comments -
[V1][P/D] XpYd based on p2p communication without cache store
#18242 commented on
May 31, 2025 • 11 new comments -
[Frontend] speed up import time of vllm.config
#18036 commented on
Jun 1, 2025 • 10 new comments -
[V1] LogitsProcessor programming model
#16728 commented on
May 27, 2025 • 9 new comments -
[KERNEL] Sampler. CUDA kernel for applying repetition penalty
#18437 commented on
May 30, 2025 • 9 new comments -
[Misc] Add fully interleaved support for multimodal 'string' content format
#14047 commented on
May 29, 2025 • 8 new comments -
[Model] enable data parallel for Llama4 vision encoder
#18368 commented on
May 31, 2025 • 8 new comments -
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on
May 29, 2025 • 6 new comments -
[Doc] update Contributing page's testing section
#18272 commented on
May 29, 2025 • 6 new comments -
Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node.
#17930 commented on
May 30, 2025 • 6 new comments -
[Bugfix] Fix spec decode on non-cuda platforms
#18501 commented on
May 28, 2025 • 6 new comments -
[kernel] integrate permute/unpermute kernel into deepgemm moe
#17934 commented on
May 29, 2025 • 6 new comments -
[Frontend] speed up import time of vllm.reasoning
#18236 commented on
May 28, 2025 • 5 new comments -
[Security] Prevent new imports of (cloud)pickle
#18018 commented on
May 28, 2025 • 5 new comments -
[Hardware][TPU] Initial support of model parallelism with single worker using SPMD
#18011 commented on
Jun 1, 2025 • 4 new comments -
[P/D][Core] Fix abrupt request abort
#18485 commented on
May 30, 2025 • 4 new comments -
[Misc] fix: add miss best_of param validation
#18555 commented on
May 31, 2025 • 4 new comments -
[MISC][Bugfix] Use less CPU when message queue has been empty for some time
#16226 commented on
May 29, 2025 • 4 new comments -
[Feature] support torchrun PP with concurrent requests
#18191 commented on
Jun 1, 2025 • 4 new comments -
[ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES
#15246 commented on
Jun 1, 2025 • 3 new comments -
[Doc] Unify structured outputs examples
#18196 commented on
May 30, 2025 • 3 new comments -
[VLM] Support HF format Phi-4-MM model
#17121 commented on
May 30, 2025 • 3 new comments -
Update common.txt
#18442 commented on
May 28, 2025 • 3 new comments -
[V1][Metrics] Add model_load_time as a log for CUDA devices
#14148 commented on
May 30, 2025 • 2 new comments -
[Feature] Expert Parallelism Load Balancer (EPLB)
#18343 commented on
May 30, 2025 • 2 new comments -
[V0][V1][Core] Add outlines integration for V1, and update V0 integration.
#15975 commented on
Jun 1, 2025 • 2 new comments -
[Misc][benchmark] add warmup; add e2el_per_concurrency and throughput; add random_output_ratio
#18475 commented on
May 29, 2025 • 2 new comments -
[CUDA] Enable full cudagraph for FlashMLA
#18581 commented on
May 30, 2025 • 2 new comments -
Add custom default max tokens for different plataforms
#18557 commented on
May 30, 2025 • 2 new comments -
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger
#17331 commented on
May 30, 2025 • 1 new comment -
[Bugfix][Nixl] Fix full prefix cache hit bug
#18632 commented on
May 27, 2025 • 1 new comment -
Fix links in multi-modal model contributing page
#18615 commented on
May 28, 2025 • 1 new comment -
[V1][Spec Decode][Perf] Add fused Triton kernel to reduce overhead in EAGLE spec decoding
#18221 commented on
May 30, 2025 • 1 new comment -
[WIP] [Core][P/D] CPU connector for PD disagg
#18332 commented on
May 31, 2025 • 1 new comment -
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics.
#18354 commented on
May 30, 2025 • 1 new comment -
[Hardware][Intel GPU] Add V1 engine support and `chunked_prefill` kernel
#14612 commented on
May 31, 2025 • 1 new comment -
[Core] Support all head sizes up to 256 with FlashAttention backend
#8910 commented on
May 28, 2025 • 0 new comments -
[Doc] update the debugging document to add more explanation on `gpu_memory_utilization` and CUDA OOM issues
#8541 commented on
May 27, 2025 • 0 new comments -
[Model] MLPSpeculator quantization support
#8476 commented on
May 27, 2025 • 0 new comments -
[Bugfix] Update grafana dashboard
#9311 commented on
May 28, 2025 • 0 new comments -
[Misc] add non cuda hf benchmark_througput
#8653 commented on
May 27, 2025 • 0 new comments -
[Misc] Add conftest plugin for applying forking decorator
#8727 commented on
May 27, 2025 • 0 new comments -
[CI/Build] Add audio+video docker tag to Dockerfile
#17974 commented on
May 31, 2025 • 0 new comments -
[Misc] Optimizing mrope calc
#13793 commented on
May 27, 2025 • 0 new comments -
[WIP][Whisper] beam search for whisper
#13758 commented on
May 28, 2025 • 0 new comments -
[Model] Support VLMs with transformers backend
#13754 commented on
May 30, 2025 • 0 new comments -
[UT][spec_decode]remove hard-dependencies of spec decode to CUDA
#13746 commented on
May 26, 2025 • 0 new comments -
Truncation support for recent Mistrals to prevent AsyncEngineDeadError on input exceeding max_model_len w/ chunked prefill
#13741 commented on
May 27, 2025 • 0 new comments -
[V1][BugFix] Raise error when selected attn backend is not supported
#13730 commented on
May 31, 2025 • 0 new comments -
[Model] GPTBigCodeForEmbedding supporting token span classification
#13684 commented on
May 28, 2025 • 0 new comments -
[Core] Add Additional Metrics to vLLM Server
#12726 commented on
May 29, 2025 • 0 new comments -
[Bugfix] Update Prometheus datasource configuration to use variable UID
#12659 commented on
May 30, 2025 • 0 new comments -
[CI/Build] Better default num jobs heuristic
#12477 commented on
May 30, 2025 • 0 new comments -
Update run_cluster.sh
#11796 commented on
May 30, 2025 • 0 new comments -
Add TTFT to offline_inference_with_prefix.py
#11428 commented on
May 30, 2025 • 0 new comments -
[Core] Support global prefix caching
#11385 commented on
May 30, 2025 • 0 new comments -
[Frontend] Add Command-R and Llama-3 chat template
#10496 commented on
May 30, 2025 • 0 new comments -
[Core/Bugfix] Per FlashInfer API changing data_type to kv_data_type for kv_cache
#10103 commented on
May 27, 2025 • 0 new comments -
[Bugfix] Generate multiple different prompts in benchmark_prefix_caching.py based on --num-prompts
#9687 commented on
May 27, 2025 • 0 new comments -
【Frontend】Add sampler_priority and repetition_penalty_range
#9485 commented on
May 27, 2025 • 0 new comments -
[Kernel] Factor registrations
#8424 commented on
May 27, 2025 • 0 new comments -
[New Model]: support Qwen3-235B-A22B-GPTQ-Int4
#18041 commented on
Jun 1, 2025 • 0 new comments -
[Feature]: Support Gemma 3 QAT series
#16856 commented on
May 31, 2025 • 0 new comments -
[Feature]: Support openai responses API interface
#14721 commented on
May 31, 2025 • 0 new comments -
[Bug]: Can't serve Qwen3-AWQ
#18156 commented on
May 31, 2025 • 0 new comments -
[Bug]: vLLM v0.8.5.post1 hanging with Llama 3.3 70b
#18260 commented on
May 31, 2025 • 0 new comments -
[Feature]: Ensure benchmark serving do not import vLLM
#14923 commented on
May 31, 2025 • 0 new comments -
[Bug]: ValueError when using Multi-Instance GPU
#17047 commented on
May 31, 2025 • 0 new comments -
[V1] Feedback Thread
#12568 commented on
May 31, 2025 • 0 new comments -
[Performance]: Performance comparison for v1 engine and v0 engine
#17540 commented on
May 31, 2025 • 0 new comments -
[Bug]: decoding output parsing error
#18376 commented on
May 31, 2025 • 0 new comments -
[Bug]: LLVM ERROR: Failed to compute parent layout for slice layout. when using fp16
#17152 commented on
May 31, 2025 • 0 new comments -
[Bug][PERF]: Qwen2.5 performance degradation 0.8.4 -> 0.8.5
#18619 commented on
May 31, 2025 • 0 new comments -
[Performance]: Stability Concerns with LLaMA-4 Models After Extended Uptime (llama-4 models stability on h100 gpus)
#16473 commented on
May 31, 2025 • 0 new comments -
[Bug]: regression from vllm==0.8.4 - Llama 4 Maverick FP8 + xgrammar crash server
#18085 commented on
May 31, 2025 • 0 new comments -
[Bug]: CPU Memory oom on 8*L40s when deploy meta-llama/Llama-4-Scout-17B-16E-Instruct
#16916 commented on
May 31, 2025 • 0 new comments -
[Bug]: vLLM does not serve text-only version of Llama4
#18022 commented on
May 31, 2025 • 0 new comments -
[Usage]: Llama4 tool parser
#16214 commented on
May 31, 2025 • 0 new comments -
[torch.compile] A simple solution to recursively compile loaded model: using phi3-small as an example
#8398 commented on
May 27, 2025 • 0 new comments -
[Benchmark] Add block_size option to benchmark_throughput.py
#8175 commented on
May 27, 2025 • 0 new comments -
[Core][Kernel][Misc] Support external swapper for vllm
#8018 commented on
May 28, 2025 • 0 new comments -
Print request metrics to stdout
#8014 commented on
May 27, 2025 • 0 new comments -
[misc] Optimize speculative decoding
#7875 commented on
May 27, 2025 • 0 new comments -
[Misc] Allow for unsigned zero NAN representation in ScalarType
#7661 commented on
May 27, 2025 • 0 new comments -
`[Core]` Added streaming support to `LLM` Class
#7648 commented on
May 27, 2025 • 0 new comments -
[WIP][SPMD] Support spec decoding
#7643 commented on
May 27, 2025 • 0 new comments -
[Model] Teleflm Support
#6822 commented on
May 28, 2025 • 0 new comments -
Prefetch all
#6817 commented on
May 27, 2025 • 0 new comments -
[New Model]: moonshotai/Kimi-Audio-7B-Instruct
#17234 commented on
Jun 1, 2025 • 0 new comments -
[Bug]:ValueError: vllm serve Qwen/Qwen2.5-VL-72B-Instruct-AWQ ERROR:The input size is not aligned with the quantized weight shape.
#13980 commented on
Jun 1, 2025 • 0 new comments -
[Feature]: Speculative decoding and Pipeline Paralelism
#14044 commented on
Jun 1, 2025 • 0 new comments -
[Bug]: [Bug]: Run vllm serve raise Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks
#14095 commented on
Jun 1, 2025 • 0 new comments -
[Usage]: How to load DeepSeek-R1-Distill-Qwen-32B model which runs as offline batch inference
#14087 commented on
Jun 1, 2025 • 0 new comments -
[Bug]: max_model_len setting fail
#14102 commented on
Jun 1, 2025 • 0 new comments -
[Bug]: cannot launch deepseek-vl2 on A100
#14103 commented on
Jun 1, 2025 • 0 new comments -
[Benchmark] fixing profling for benchmark latency
#18035 commented on
Jun 1, 2025 • 0 new comments -
[P/D] Heterogeneous TP
#18079 commented on
May 25, 2025 • 0 new comments -
[Hardware][Intel-Gaudi] t.compile optimizations
#18137 commented on
May 29, 2025 • 0 new comments -
[Bugfix] Fix Hermes tool call parser with streaming
#18220 commented on
May 25, 2025 • 0 new comments -
(deprecated) [V1] Support DP with Ray
#18233 commented on
May 30, 2025 • 0 new comments -
[Model] support dots1
#18254 commented on
May 30, 2025 • 0 new comments -
add an absolute path for run.sh
#18258 commented on
May 26, 2025 • 0 new comments -
[Kernel] Add EP support for cutlass_moe_fp4
#18281 commented on
May 31, 2025 • 0 new comments -
[P/D] Support CPU Transfer in NixlConnector
#18293 commented on
May 31, 2025 • 0 new comments -
[V1] Optimized the `determine_available_memory` method for v1
#18296 commented on
May 31, 2025 • 0 new comments -
[Frontend] Speedup frontend test
#18310 commented on
May 30, 2025 • 0 new comments -
[Model]: Fused MoE for nomic-embed-text-v2-moe
#18321 commented on
May 29, 2025 • 0 new comments -
[V1] Support MultiNode DP with Ray
#18366 commented on
May 28, 2025 • 0 new comments -
[V0] Support multiple kv connectors
#18395 commented on
May 30, 2025 • 0 new comments -
[Bugfix][P/D] Fix Preemption + Prefix Cache Bug (#92)
#18411 commented on
May 28, 2025 • 0 new comments -
[WIP] Two batch overlap
#18415 commented on
May 31, 2025 • 0 new comments -
Fixed ppc build when it runs on non-RHEL based linux distros
#18422 commented on
May 30, 2025 • 0 new comments -
[Core] Add support for sampling penalties to v1 ngram speculative decoding
#18441 commented on
May 30, 2025 • 0 new comments -
Enable CPU nightly performance benchmark and its Markdown report
#18444 commented on
May 28, 2025 • 0 new comments -
[V1] Support `LLM.apply_model`
#18465 commented on
May 27, 2025 • 0 new comments -
make TIMEOUT_KEEP_ALIVE configurable through env var
#18472 commented on
May 30, 2025 • 0 new comments -
Integrate quick allreduce and select the best allreduce implementation
#18473 commented on
May 30, 2025 • 0 new comments -
Remove Vision FA warning
#18522 commented on
May 27, 2025 • 0 new comments -
[BUGFIX] fix layout shape of moe 2stage
#18523 commented on
May 27, 2025 • 0 new comments -
[Model][Speculative Decoding] Integrate PARD into vLLM
#18541 commented on
May 28, 2025 • 0 new comments -
Sm100 blockwise fp8 swap ab
#18564 commented on
May 30, 2025 • 0 new comments -
Porting triton_kernels for FusedMoE
#18595 commented on
May 30, 2025 • 0 new comments -
[v1][KVCacheManager] Add a special KVCacheNullBlock class
#18652 commented on
May 29, 2025 • 0 new comments -
[v1] Re-init input batch for multiple kv cache groups
#18654 commented on
May 30, 2025 • 0 new comments -
[Model] Google SigLip 2
#13808 commented on
May 29, 2025 • 0 new comments -
Support w8a8 block_fp8_matmul from generated kernel
#13835 commented on
May 29, 2025 • 0 new comments -
[WIP] Fix weight loading tests
#13842 commented on
May 27, 2025 • 0 new comments -
[NVIDIA] Unify CUTLASS version in CMakelist.txt
#13846 commented on
May 28, 2025 • 0 new comments -
[Misc][V1] Enhance performance of KVCacheManager._get_cached_block
#13878 commented on
May 29, 2025 • 0 new comments -
[BugFix] Fix an Overflow Problem for Some Triton Fused MoE Configurations with large BLOCK_SIZE
#13901 commented on
May 31, 2025 • 0 new comments -
[Misc] Add JSON format logging support with `loguru`
#13920 commented on
May 30, 2025 • 0 new comments -
Support non-attention path operators in Triton
#13963 commented on
May 29, 2025 • 0 new comments -
[Distributed] Add reduce_scatter to DeviceCommunicatorBase
#14057 commented on
May 30, 2025 • 0 new comments -
Tune release tag to support release candidates
#14064 commented on
May 31, 2025 • 0 new comments -
Add CUDA kernel for per_token_group_quant_fp8
#14175 commented on
May 29, 2025 • 0 new comments -
[Model] add colqwen2_vl code & inference
#14291 commented on
May 25, 2025 • 0 new comments -
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on
May 29, 2025 • 0 new comments -
[V0][Bugfix] Fix Mamba cache crashing
#15296 commented on
May 29, 2025 • 0 new comments -
fix: can not install torch+cpu for no index url
#15822 commented on
May 31, 2025 • 0 new comments -
[ROCm][V1] Changes needed for making vllm run on Fedora 41 with gtx1100
#16062 commented on
May 30, 2025 • 0 new comments -
[Draft] SnapKV
#16160 commented on
May 30, 2025 • 0 new comments -
[Bugfix][Frontend] Add missing "type":"function" in tool call streaming responses
#16346 commented on
May 28, 2025 • 0 new comments -
[Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support)
#16347 commented on
Jun 1, 2025 • 0 new comments -
[CPU] V1 support for the CPU backend
#16441 commented on
May 29, 2025 • 0 new comments -
[Bugfix][Model] fix Phi3Small model only support v0
#16493 commented on
May 28, 2025 • 0 new comments -
[Bugfix] fix: close issue #16554 to make it real async
#16557 commented on
May 31, 2025 • 0 new comments -
[Kernel] Add Split-KV Attention Kernel to the triton_attn Backend
#16794 commented on
May 30, 2025 • 0 new comments -
[Model][Frontend] Adding timeseries modality support and Qwen2.5-ChatTS model support
#16852 commented on
May 29, 2025 • 0 new comments -
Enable FlashInfer V1 FP8 kv cache
#17005 commented on
May 28, 2025 • 0 new comments -
[RFC][core][V1] generalize structured output manager and backends
#17503 commented on
May 26, 2025 • 0 new comments -
[WIP][V1][Spec Decode] EAGLE tree-attention
#17560 commented on
May 27, 2025 • 0 new comments -
[NVIDIA] Add Cutlass MLA backend
#17625 commented on
May 31, 2025 • 0 new comments -
[V1] Fast decode prepare path for prepare_inputs logic
#17866 commented on
May 28, 2025 • 0 new comments -
[Feature][Quantization] MXFP4 support for MOE models
#17888 commented on
May 29, 2025 • 0 new comments -
[Bug]: always finish_reason='length' using google/gemma-2-27b-it
#13924 commented on
May 28, 2025 • 0 new comments -
[Bug]: Speculative Model from Hugging Face Repository Fails to Load (is not a file error)
#13937 commented on
May 28, 2025 • 0 new comments -
[Bug]: NixlConnector should not skip short do_remote_prefill requests in connector metadata
#18591 commented on
May 27, 2025 • 0 new comments -
[Usage]: Deciding max-num-seqs and max-num-batched-tokens for desired throughput
#16886 commented on
May 27, 2025 • 0 new comments -
[RFC]: vLLM x torch.compile caching should be opt-out by default
#16501 commented on
May 27, 2025 • 0 new comments -
[Bug]:Question about logprobs output being 0.0 when using `vllm` sampling params
#17286 commented on
May 27, 2025 • 0 new comments -
[Feature]: Data parallel inference in offline mode(based on Ray)
#14683 commented on
May 27, 2025 • 0 new comments -
[Bug]: Cannot obtain logits
#16619 commented on
May 27, 2025 • 0 new comments -
[Performance]: lmcache cannot work!
#18135 commented on
May 27, 2025 • 0 new comments -
[Feature]: Simple Data Parallelism in vLLM
#9206 commented on
May 27, 2025 • 0 new comments -
[Bug]: Quantized models - NotImplementedError: Could not run '_C::machete_prepack_B'
#16131 commented on
May 27, 2025 • 0 new comments -
[Usage] Qwen3 Usage Guide
#17327 commented on
May 27, 2025 • 0 new comments -
Migrating from `yapf` to `ruff format`
#17657 commented on
May 27, 2025 • 0 new comments -
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on
May 27, 2025 • 0 new comments -
[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models
#17604 commented on
May 27, 2025 • 0 new comments -
[Bug]: Engine Core initialization failed. See root cause above
#17618 commented on
May 27, 2025 • 0 new comments -
[Bug]: Engine stuck with requests are blocked, running/waiting request count and KV cache usage remain constant.
#18431 commented on
May 27, 2025 • 0 new comments -
[Usage]: Is there an option to obtain attention matrices during inference, similar to the output_attentions=True parameter in the transformers package?
#7736 commented on
May 27, 2025 • 0 new comments -
[Bug]: meta-llama/Llama-3.2-90B-Vision-Instruct and Qwen/Qwen2-VL-72B-Instruct models fails with asyncio.exceptions.CancelledError when using wiki image URLs
#10904 commented on
May 27, 2025 • 0 new comments -
[Bug]: ValueError: Ray does not allocate any GPUs on the driver node. Consider adjusting the Ray placement group or running the driver on a GPU node.
#12983 commented on
May 27, 2025 • 0 new comments -
[Installation]:Error while deploying Deepseek-R1 671B with AMD 8xMi300x
#13659 commented on
May 28, 2025 • 0 new comments -
[Bug]: how to use cuda graph on vllm
#13661 commented on
May 28, 2025 • 0 new comments -
[Performance]: Regarding PD separation performance
#13816 commented on
May 28, 2025 • 0 new comments -
[Feature]: proxy load balance function use like: vllm proxy --server-address ... --server-port ...
#13861 commented on
May 28, 2025 • 0 new comments -
[New Model]: add GOT-OCR2
#13862 commented on
May 28, 2025 • 0 new comments -
[Bug]: Failed to infer device type how to solve this problem
#13865 commented on
May 28, 2025 • 0 new comments -
[New Model]: silero-vad
#13866 commented on
May 28, 2025 • 0 new comments -
[New Model]: RapidOCR
#13868 commented on
May 28, 2025 • 0 new comments -
[Bug]: V1 does not support torch compile
#13872 commented on
May 28, 2025 • 0 new comments -
[Usage]: when i deploy a model ,how to set the max input str length and the number of max input token. and the max output length??
#13874 commented on
May 28, 2025 • 0 new comments -
[Usage]: `max_num_batched_tokens` and `max_model_len`
#13875 commented on
May 28, 2025 • 0 new comments -
[New Model]: dunsloth/DeepSeek-R1-GGUF
#13877 commented on
May 28, 2025 • 0 new comments -
[Feature]: Support Deepseek's DeepGemm MoE
#13879 commented on
May 28, 2025 • 0 new comments -
[Bug]: Incorrect first_token_time and first_scheduled_time metrics results
#13883 commented on
May 28, 2025 • 0 new comments -
[Bug]: vllm 0.7.3, system gets stuck during the reasoning process
#13884 commented on
May 28, 2025 • 0 new comments -
[Feature]: Upstream flash attention to support cutlass 3.8
#13893 commented on
May 28, 2025 • 0 new comments -
[Bug]: 【Qwen2.5-VL-72B-Instruct-AWQ】ERROR 02-26 05:28:06 engine.py:400] Error while deserializing header: InvalidHeaderDeserialization
#13899 commented on
May 28, 2025 • 0 new comments -
[Feature]: T5Model has no vLLM implementation
#13903 commented on
May 28, 2025 • 0 new comments -
[Bug]: Speculative Decoding Tokens not being included in Prometheus metrics
#13916 commented on
May 28, 2025 • 0 new comments -
[Bug]: The output size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
#11671 commented on
May 26, 2025 • 0 new comments -
[Bug]: v0.8.5.post1 Eagle3 broken with llama3-70b
#18452 commented on
May 26, 2025 • 0 new comments -
[Bug]: Pipeline parallelism is only supported through AsyncLLMEngine as performance will be severely degraded otherwise.
#7413 commented on
May 26, 2025 • 0 new comments -
[Bug]: Logprob values are affected by sampling parameters and are incompatible with OpenAI API
#9453 commented on
May 26, 2025 • 0 new comments -
[Usage]: deepseek v3 can not set tensor_parallel_size=32
#12256 commented on
May 26, 2025 • 0 new comments -
[Usage]: tensor-parallel-size=2,The program just kept hanging
#13273 commented on
May 26, 2025 • 0 new comments -
[V1]: Unable to serve Qwen model on V1 alpha.
#13284 commented on
May 26, 2025 • 0 new comments -
[Bug]: Chunk Prefill feature fails for ppc64le (IBM POWER)
#13387 commented on
May 26, 2025 • 0 new comments -
[Bug]: when nsight cature nvtx with PP>1, vllmWorkerProcess will unexpectedly terminate
#13482 commented on
May 26, 2025 • 0 new comments -
[Usage]: Does vllm support mix deploy on GPU+CPU?
#13517 commented on
May 26, 2025 • 0 new comments -
[Bug]: When deploying the Qwen2.5-VL-3B service, some image requests return errors.
#13657 commented on
May 26, 2025 • 0 new comments -
[Bug]: Error while running Deepseek-R1: vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly.
#13676 commented on
May 26, 2025 • 0 new comments -
[Bug]: Errors Encountered While Running Qwen/Qwen2.5-VL-72B-AWQ Inference on 8x24G的4090 GPUs
#13677 commented on
May 26, 2025 • 0 new comments -
[Bug]: vLLM Local path model loading error
#13707 commented on
May 26, 2025 • 0 new comments -
[Bug]: V100 may can not support enable-prefix-caching
#13738 commented on
May 26, 2025 • 0 new comments -
[Bug]: Speculative Decoding Model Load Error (Qwen 14b + 0.5b)
#13759 commented on
May 26, 2025 • 0 new comments -
[Feature]: Support Python 3.13
#12083 commented on
May 25, 2025 • 0 new comments -
[Bug]: GGUF model with architecture qwen3moe is not supported yet.
#18382 commented on
May 25, 2025 • 0 new comments -
[Installation]: Hard to find right wheel files to build the release version
#18673 commented on
May 25, 2025 • 0 new comments -
[Bug]: `vllm serve` doesn't display detailed error logs when async_llm.generate raises an exception
#18393 commented on
May 25, 2025 • 0 new comments -
[Feature]: S1-32B Reasoning Parser support
#13342 commented on
May 27, 2025 • 0 new comments -
[Bug]: Mamba2 models (Bamba and Codestral Mamba) fail on RoCM
#13678 commented on
May 27, 2025 • 0 new comments -
[Bug]: v0.7.3 upgrade issue,
#13712 commented on
May 27, 2025 • 0 new comments -
[Usage]: Does vllm support to deploy one model on multiple type GPUs(e.g. one is A100, the other is H20)?
#13760 commented on
May 27, 2025 • 0 new comments -
[Bug]: Structured generation with JSON schema does not produce empty array
#13821 commented on
May 27, 2025 • 0 new comments -
[Bug]: database disk image is malformed
#13838 commented on
May 27, 2025 • 0 new comments -
[Usage]: Speculative Decoding KV Cache Generate
#13845 commented on
May 27, 2025 • 0 new comments -
[Doc]: guided grammar example lack parameter guided_decoding_backend
#13847 commented on
May 27, 2025 • 0 new comments -
[Doc]: vLLM TPU missing git clone instruction
#13854 commented on
May 27, 2025 • 0 new comments -
[Feature]: Support LoRA adapters to vision/merge modules
#17660 commented on
May 26, 2025 • 0 new comments -
[Bug]: v0.8.2, enable calculate_kv_scales, caught exception
#15973 commented on
May 26, 2025 • 0 new comments -
[RFC]: [Spec Decode] Combine Ngram and EAGLE
#18633 commented on
May 26, 2025 • 0 new comments -
[Bug]: Single-Node data parallel (--data-parallel-size=4) leads to vLLM crash
#18567 commented on
May 26, 2025 • 0 new comments -
[Bug]: crash during debug, works ok running cli
#16006 commented on
May 26, 2025 • 0 new comments -
[Bug]: Failed to run model Qwen3-30B-A3B on DGX V100x4
#17392 commented on
May 26, 2025 • 0 new comments -
[Installation]: VLLM on ARM machine with GH200
#10459 commented on
May 26, 2025 • 0 new comments -
[Bug]: load_adapter crashes server if called when generations are in progress
#13698 commented on
May 26, 2025 • 0 new comments -
[Feature]: Implement Priority Scheduling In V1 Engine
#14002 commented on
May 26, 2025 • 0 new comments -
[Feature]:Slim Attention (lossless 2x reduction in KV cache size)
#14937 commented on
May 26, 2025 • 0 new comments -
[Bug]: Reward model usage
#12791 commented on
May 26, 2025 • 0 new comments -
[Usage]: RTX 5090 with vllm/vllm-openai docker image
#16652 commented on
May 30, 2025 • 0 new comments -
[Bug]: wake up OOM (72B model in 8*A800(40G))
#13941 commented on
May 30, 2025 • 0 new comments -
[Installation]: build on arm64 meet a error
#9964 commented on
May 30, 2025 • 0 new comments -
[Feature]: Composite model loading using `AutoWeightsLoader` for all models
#15697 commented on
May 30, 2025 • 0 new comments -
[Bug]: error while attempting to bind on address ('0.0.0.0', 8000): address already in use
#7514 commented on
May 30, 2025 • 0 new comments -
[Doc]: guided decoding is not compatible with speculative decoding, but "Compatibility Matrix" shows compatible
#12148 commented on
May 30, 2025 • 0 new comments -
[Bug]: Deepseek R1 MI300A Memory access fault
#12773 commented on
May 30, 2025 • 0 new comments -
[Bug]: vllm version 0.7.2 downloading and loading issues.
#12889 commented on
May 30, 2025 • 0 new comments -
[RFC]: Introduce a Triton-only Transformer Execution Path in vLLM
#13319 commented on
May 30, 2025 • 0 new comments -
[Bug]: ValueError: Unsupported FA version: None on V100 and V1 engine
#13788 commented on
May 30, 2025 • 0 new comments -
[Feature]: Pin vLLM process to the right NUMA Region
#13855 commented on
May 30, 2025 • 0 new comments -
[Bug]: Architectures DeepseekV3ForCausalLM can't deploy on 2080ti
#14016 commented on
May 30, 2025 • 0 new comments -
[Bug]: Does the current version 0.7.3 support the installation of vllm - flash - attn or flash - attn? After installation, an error occurred when starting the container.
#14018 commented on
May 30, 2025 • 0 new comments -
4090 machine vllm inference performance is weaker than sglang performance!
#14021 commented on
May 30, 2025 • 0 new comments -
[Bug]: EP enablement and chunked-prefill-enablement
#14032 commented on
May 30, 2025 • 0 new comments -
[Feature]: Consolidate AITER env flags
#18367 commented on
May 30, 2025 • 0 new comments -
[Bug]: cpu core 100%
#16968 commented on
May 29, 2025 • 0 new comments -
[Bug]: 100% CPU usage when idle
#16660 commented on
May 29, 2025 • 0 new comments -
[Bug]: vLLM hangs forever on waiting engine process to start
#17676 commented on
May 29, 2025 • 0 new comments -
[Bug]: vLLM server hangs and timeouts after initial requests
#17972 commented on
May 29, 2025 • 0 new comments -
[Usage]: How to maintain per-sequence state in custom LogitsProcessor?
#14078 commented on
May 31, 2025 • 0 new comments -
[Performance]: Low GPU Utilization (70%) for ViT+Qwen2 VLM Model.
#18392 commented on
May 31, 2025 • 0 new comments -
[Bug]: AttributeError: 'MultiprocExecutor' object has no attribute 'workers' when VLLM_USE_V1=1 on rocm platform serve deepseek-r1 671B
#17533 commented on
May 30, 2025 • 0 new comments -
[Bug]: SamplingParams() use_beam_search error
#18231 commented on
May 30, 2025 • 0 new comments -
[Misc]: CMake Clean-up / Refactor Tasks
#9129 commented on
May 30, 2025 • 0 new comments -
[Feature]: Inflight BNB quantization for Mixtral models
#17199 commented on
May 30, 2025 • 0 new comments -
[Bug]: Mistral streaming tool parser fails to parse integer tool argument
#13622 commented on
May 30, 2025 • 0 new comments -
[Roadmap] vLLM Roadmap Q2 2025
#15735 commented on
May 30, 2025 • 0 new comments -
[Bug]: RuntimeError: ('Quantization scheme is not supported for ', 'the current GPU. Min capability: 80. ', 'Current capability: 75.')
#14040 commented on
May 30, 2025 • 0 new comments -
[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference
#18037 commented on
May 30, 2025 • 0 new comments -
[RFC]: Blackwell Enablement for vLLM (SM100)
#18153 commented on
May 30, 2025 • 0 new comments -
[Bug]: vllm 0.8.4 start with using ray, and ray's dashboard fails to start
#16779 commented on
May 30, 2025 • 0 new comments -
[Usage]: Regex Structured Output Became Very Slow
#18546 commented on
May 30, 2025 • 0 new comments -
[Bug]: Qwen2.5-VL Series Randomly Crashes with Pipeline Parallel
#17351 commented on
May 30, 2025 • 0 new comments -
[Bug]: Clarification regarding bug inside vllm-flash-attn vision module
#18324 commented on
May 30, 2025 • 0 new comments -
[RFC]: hybrid dtype: float32 for weights and activation, float16 or bfloat16 for attention.
#18342 commented on
May 30, 2025 • 0 new comments -
[Usage]: How to use the appropriate --gpu-memory-utilization
#18582 commented on
May 30, 2025 • 0 new comments -
[RFC]: Data Parallel Attention and Expert Parallel MoEs
#16037 commented on
May 30, 2025 • 0 new comments -
[Bug]: Qwen3 FP8 on 0.8.5: type fp8e4nv not supported in this architecture.
#17581 commented on
May 30, 2025 • 0 new comments -
Failed to find C compiler. Please specify via CC environment variable
#2997 commented on
May 30, 2025 • 0 new comments -
[Bug]: After converting InternVL3-8B to the Hugging Face (HF) format, vLLM fails to launch and throws the error: ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.
#17801 commented on
May 28, 2025 • 0 new comments -
[Bug]: Difference in Logprobs between vllm and transformers
#18352 commented on
May 28, 2025 • 0 new comments -
[Bug]: Killing local vLLM worker processes in multiproc_worker_utils.py
#18577 commented on
May 28, 2025 • 0 new comments -
[Feature]: obtain logits
#11397 commented on
May 28, 2025 • 0 new comments -
[Bug]: Large-scale vLLM offline inference fails to start due to port conflicts.
#14919 commented on
May 28, 2025 • 0 new comments -
[Bug]: vLLM lacks eviction policy for MooncakeStore
#18348 commented on
May 28, 2025 • 0 new comments -
[Bug]: Host CPU Docker image on Docker Hub
#18468 commented on
May 28, 2025 • 0 new comments -
[Installation]: deployment failure on Kuberentes with CPU device (testing).
#17187 commented on
May 28, 2025 • 0 new comments -
[RFC]: Deprecating vLLM V0
#18571 commented on
May 28, 2025 • 0 new comments -
[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device
#5547 commented on
May 28, 2025 • 0 new comments -
[Bug]: Qwen2.5-VL AWQ/GPTQ RuntimeError: CUDA error: an illegal memory access was encountered 0.8.5+
#17663 commented on
May 28, 2025 • 0 new comments -
[Bug]: Inference fails on Apple silicon due to (distributed) networking error?
#18362 commented on
May 28, 2025 • 0 new comments -
invalid conversion from ‘int’ to ‘CUresult’ {aka ‘cudaError_enum’}
#17931 commented on
May 28, 2025 • 0 new comments -
[Bug]: RuntimeError on RTX 5090: "no kernel image is available for execution on the device
#16901 commented on
May 28, 2025 • 0 new comments -
[Bug]: 0.8.5 post1 cuda error
#17813 commented on
May 28, 2025 • 0 new comments -
[Bug]: When I use llmcompressor to quantify the llama3 70b model to int8-a8w8,it shows ValueError: Failed to invert hessian due to numerical instability.
#11064 commented on
May 28, 2025 • 0 new comments -
[New Model]: NV-Embed-v2
#12137 commented on
May 28, 2025 • 0 new comments -
[Bug]: There is no module or parameter named '_orig_mod' in Qwen2ForCausalLM
#12783 commented on
May 28, 2025 • 0 new comments -
[Bug]: Deepseek-R1 performance issue on 2*8*H100
#13066 commented on
May 28, 2025 • 0 new comments -
[Installation]: Two CPU-only hosts installed.
#13654 commented on
May 28, 2025 • 0 new comments -
[Bug]: [v0.8.4][Critical] Tools calling broken: xgrammar rejects minItems in JSON Schema, blocking agent functionality
#16880 commented on
May 29, 2025 • 0 new comments -
[RFC]: AWS Neuron 2.23 NxD Inference with vLLM V0
#15970 commented on
May 29, 2025 • 0 new comments -
[Bug]: tool_calls.id is Missing in Streaming Responses (stream=true) but Present in Non-Streaming Responses
#18412 commented on
May 29, 2025 • 0 new comments -
[Bug]: Engine V1 When loading two models into the same GPU the second model requires more memory allocation than the first
#14376 commented on
May 29, 2025 • 0 new comments -
[Bug]: available VRAM calculation bug in V1
#17979 commented on
May 29, 2025 • 0 new comments -
[Feature]: Add OpenTelemetry API to v1
#17794 commented on
May 29, 2025 • 0 new comments -
[New Model]: Multimodal Embedding Model GME.
#16406 commented on
May 29, 2025 • 0 new comments -
[Bug]: vLLM serve `google/gemma-3-1b-it` with version `0.8.5` interrupted `SIGTERM`
#17386 commented on
May 29, 2025 • 0 new comments -
[Feature]: Qwen 3 MoE Lora adapter support.
#18120 commented on
May 29, 2025 • 0 new comments -
[Bug]: I'm trying to run Pixtral-Large-Instruct-2411 using vllm, following the documentation at https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411, but I encountered an error.
#10512 commented on
May 29, 2025 • 0 new comments -
[Bug]: Problems with releasing memory after starting the vllm container
#11902 commented on
May 29, 2025 • 0 new comments -
[Bug]: AssertionError in Sampler with Prefix Caching and Prompt Logprobs Enabled.
#13105 commented on
May 29, 2025 • 0 new comments -
[Bug]: Speculative Decoding doesn't work with Ray compiled DAG and SPMD
#13682 commented on
May 29, 2025 • 0 new comments -
[Bug]: using v1 AsyncLLMEngine, signal only works in main thread of the main interpreter. v0 has no this problem.
#13806 commented on
May 29, 2025 • 0 new comments -
[Usage]: I want to be able to Qwen2.5-7B & RTX4060
#13882 commented on
May 29, 2025 • 0 new comments -
[Bug]: When deploying LLM with Docker, the following error occurs: RuntimeError: Failed to infer device type
#13946 commented on
May 29, 2025 • 0 new comments -
[Bug]: vllm crash when enable prefix caching
#13954 commented on
May 29, 2025 • 0 new comments -
[Bug]: Llama chat template cannot process tool_calls=[] in previous messages
#13978 commented on
May 29, 2025 • 0 new comments -
[Feature]: vllm0.7.3版本内置的pytorch版本低,不支持nvidia5080
#13999 commented on
May 29, 2025 • 0 new comments -
[Bug]: Add TPU support for gemma-3-4b-it and gemma-3-27b-it
#16521 commented on
May 29, 2025 • 0 new comments