Pulse · vllm-project/vllm · GitHub

April 19, 2025 – April 26, 2025

Overview

247 Active pull requests

223 Active issues

163 Pull requests merged by 77 people

[Core] Remove prompt string from engine core data structures
#17214 merged Apr 26, 2025
[CI/test] Fix Eagle Correctness Test
#17209 merged Apr 26, 2025
[BugFix] Avoid race conditions in zero-copy tensor transmission
#17203 merged Apr 26, 2025
[V1][Metrics] Allow V1 AsyncLLM to use custom logger
#14661 merged Apr 26, 2025
[ROCm][Misc] Follow-ups for Skinny Gemms on ROCm.
#17011 merged Apr 26, 2025
Allocate kv_cache with stride order
#16605 merged Apr 26, 2025
[Minor][Models] Fix Return Types of Llama & Eagle
#17220 merged Apr 26, 2025
[Doc] Minor fix for the vLLM TPU setup page
#17206 merged Apr 26, 2025
[Minor][Spec Decode] Add use_eagle to SpeculativeConfig
#17213 merged Apr 26, 2025
[doc] add Anything LLM integration
#17216 merged Apr 26, 2025
[MISC][AMD] Add unused annotation to rocm kernel file
#17097 merged Apr 26, 2025
[Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env
#17142 merged Apr 26, 2025
[v1] [P/D] Adding LMCache KV connector for v1
#16625 merged Apr 26, 2025
[AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary
#17215 merged Apr 26, 2025
[Bugfix] gemma[2,3] interleaved attention when sliding window is disabled
#17180 merged Apr 26, 2025
[Misc] Refine ray_serve_deepseek example
#17204 merged Apr 25, 2025
[V1][Spec Decode] EAGLE-3 Support
#16937 merged Apr 25, 2025
[BugFix][Frontend] Fix LLM.chat() tokenization
#16081 merged Apr 25, 2025
Fix Python packaging edge cases
#17159 merged Apr 25, 2025
[Bugfix] Fix hybrid model tests
#17182 merged Apr 25, 2025
[V1] Move usage stats to worker and start logging TPU hardware
#16211 merged Apr 25, 2025
[Security] Use safe serialization and fix zmq setup for mooncake pipe
#17192 merged Apr 25, 2025
[Misc] Inline Molmo requirements
#17190 merged Apr 25, 2025
[doc] update wrong hf model links
#17184 merged Apr 25, 2025
Use Transformers helper get_text_config() instead of checking for text_config
#17105 merged Apr 25, 2025
Bump Transformers to 4.51.3
#17116 merged Apr 25, 2025
[Bugfix] Fix Mistral ChatCompletionRequest Body Exception
#16769 merged Apr 25, 2025
[Bugfix] Fix mistral model tests
#17181 merged Apr 25, 2025
[Doc] Move todo out of beam search docstring
#17183 merged Apr 25, 2025
[Doc] Add two links to disagg_prefill.md
#17168 merged Apr 25, 2025
Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1
#17158 merged Apr 25, 2025
[Doc] Add headings to improve gptqmodel.md
#17164 merged Apr 25, 2025
[Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization
#15734 merged Apr 25, 2025
[Bugfix] remove fallback in guided_json (int range, patterns)
#16725 merged Apr 25, 2025
[Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance
#16457 merged Apr 25, 2025
[Misc] Benchmark Serving Script Support Appending Results
#17028 merged Apr 25, 2025
[Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton
#15099 merged Apr 25, 2025
[Misc] Clean up redundant code in uniproc_executor.py
#16762 merged Apr 25, 2025
Move missed SchedulerConfig args into scheduler config group in EngineArgs
#17131 merged Apr 25, 2025
[Docs] Fix True->true in supported_models.md
#17141 merged Apr 25, 2025
[Doc] V1 : Update LoRA status
#17133 merged Apr 25, 2025
fix float16 support for kimi-vl
#17156 merged Apr 25, 2025
[Attention] FA3 decode perf improvement - single mma warp group support for head dim 128
#16864 merged Apr 25, 2025
[FEAT] [ROCm]: AITER Fused MOE V1 Support
#16752 merged Apr 25, 2025
Use custom address for listening socket
#15988 merged Apr 25, 2025
Better error message for missing mistral params.json
#17132 merged Apr 24, 2025
[Misc] Add example to run DeepSeek with Ray Serve LLM
#17134 merged Apr 24, 2025
Add chat template for Llama 4 models
#16428 merged Apr 24, 2025
Add collective_rpc to llm engine
#16999 merged Apr 24, 2025
[Docs] Generate correct github links for decorated functions
#17125 merged Apr 24, 2025
Improve configs - LoRAConfig + PromptAdapterConfig
#16980 merged Apr 24, 2025
Add :markdownhelp: to EngineArgs docs so markdown docstrings render properly
#17124 merged Apr 24, 2025
Molmo Requirements
#17026 merged Apr 24, 2025
existing torch installation pip command fix for docs
#17059 merged Apr 24, 2025
Updating builkite job for IBM Power
#17111 merged Apr 24, 2025
[CI] Add automation for the tool-calling github label
#17118 merged Apr 24, 2025
[V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics
#16665 merged Apr 24, 2025
[Misc] refactor example series - structured outputs
#17040 merged Apr 24, 2025
Add missing rocm_skinny_gemms kernel test to CI
#17060 merged Apr 24, 2025
[Frontend] Using matryoshka_dimensions control the allowed output dimensions.
#16970 merged Apr 24, 2025
[V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning
#16954 merged Apr 24, 2025
Improve static type checking in LoRAModelRunnerMixin
#17104 merged Apr 24, 2025
[Misc] Remove OLMo2 config copy
#17066 merged Apr 24, 2025
[V1][PP] Optimization: continue scheduling prefill chunks
#17080 merged Apr 24, 2025
Fix OOT registration test
#17099 merged Apr 24, 2025
Simplify TokenizerGroup
#16790 merged Apr 24, 2025
Disable enforce_eager for V1 TPU sampler and structured output tests
#17016 merged Apr 24, 2025
[Chore] Remove Sampler from Model Code
#17084 merged Apr 24, 2025
Add docs for runai_streamer_sharded
#17093 merged Apr 24, 2025
[doc] update to hyperlink
#17096 merged Apr 24, 2025
[V1] Update structured output
#16812 merged Apr 24, 2025
[Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s…
#16472 merged Apr 24, 2025
Addendum Fix to support FIPS enabled machines with MD5 hashing
#17043 merged Apr 24, 2025
More informative error when using Transformers backend
#16988 merged Apr 24, 2025
[Bugfix] Enable V1 usage stats
#16986 merged Apr 24, 2025
[Minor] Use larger batch sizes for A100/B100/B200/MI300x
#17073 merged Apr 24, 2025
[Quantization]add prefix for commandA quantized model
#17017 merged Apr 24, 2025
[CI/Build] workaround for CI build failure
#17070 merged Apr 23, 2025
[V1][Spec Decode] Always use argmax for sampling draft tokens
#16899 merged Apr 23, 2025
[BugFix][V1] Fix int32 token index overflow when preparing input ids
#16806 merged Apr 23, 2025
[Frontend] Support guidance:no-additional-properties for compatibility with xgrammar
#15949 merged Apr 23, 2025
Use @property and private field for data_parallel_rank_local
#17053 merged Apr 23, 2025
CacheConfig.block_size should always be int when used
#17052 merged Apr 23, 2025
Improve Transformers backend model loading QoL
#17039 merged Apr 23, 2025
[CI] Update structured-output label automation
#17055 merged Apr 23, 2025
Ensure that pid passed to kill_process_tree is int for mypy
#17051 merged Apr 23, 2025
[Doc] Add top anchor and a note to quantization/bitblas.md
#17042 merged Apr 23, 2025
Categorize tests/kernels/ based on kernel type
#16799 merged Apr 23, 2025
Mistral-format support for compressed-tensors
#16803 merged Apr 23, 2025
[CI] Run v1/test_serial_utils.py in CI
#16996 merged Apr 23, 2025
[Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers
#16964 merged Apr 23, 2025
[Misc] Improve readability of get_open_port function.
#17024 merged Apr 23, 2025
[BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size)
#16998 merged Apr 23, 2025
[V1] Avoid socket errors during shutdown when requests are in in-flight
#16807 merged Apr 23, 2025
[Bugfix] Triton FA function takes no keyword arguments
#16902 merged Apr 23, 2025
[doc] add download path tips
#17013 merged Apr 23, 2025
[INTEL-HPU][v0] Port delayed sampling to upstream
#16949 merged Apr 23, 2025
[misc] tune some env vars for GB200
#16992 merged Apr 23, 2025
Revert "[Misc] Add S3 environment variables for better support of MinIO."
#17021 merged Apr 23, 2025
[BugFix] Revert ROCm Custom Paged Attention Env Flag Check
#17022 merged Apr 23, 2025
[V1][DP] More robust DP/EP dummy request coordination
#16277 merged Apr 23, 2025
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1
#13305 merged Apr 23, 2025
add Dockerfile build vllm against torch nightly
#16936 merged Apr 23, 2025
[Bugfix] validate urls object for multimodal content parts
#16990 merged Apr 23, 2025
[Core][V1][TPU] Enable structured decoding on TPU V1
#16499 merged Apr 23, 2025
[BugFix] Remove default multiproc executor collective_rpc timeout
#17000 merged Apr 22, 2025
Fencing Kernels Tests for enabling on AMD
#16929 merged Apr 22, 2025
Add assertion for no objects while hashing hf_config
#16930 merged Apr 22, 2025
[FEAT][ROCm]: Support AITER MLA
#15893 merged Apr 22, 2025
[frontend] enhance tool_calls type check
#16882 merged Apr 22, 2025
[Misc] Add S3 environment variables for better support of MinIO.
#16977 merged Apr 22, 2025
[BugFix] Pass in correct VLLM config in FlashInfer backend (#13207)
#16973 merged Apr 22, 2025
Improve configs - SpeculativeConfig
#16971 merged Apr 22, 2025
[Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni
#16974 merged Apr 22, 2025
[Misc] refactor example series
#16972 merged Apr 22, 2025
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER
#15001 merged Apr 22, 2025
[Doc] Improve documentation for multimodal CLI args
#16960 merged Apr 22, 2025
[BugFix] Fix incremental detokenization perf issue
#16963 merged Apr 22, 2025
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS
#6036 merged Apr 22, 2025
[V1] Remove pre-allocation for KV cache
#16941 merged Apr 22, 2025
[Model] Use autoweightloader for mamba
#16950 merged Apr 22, 2025
[Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams
#16767 merged Apr 22, 2025
[Perf] Optimize _update_states for GPU model runner
#16910 merged Apr 22, 2025
[Doc] Update ai_accelerator/hpu-gaudi.inc.md
#16956 merged Apr 22, 2025
[Bugfix] Fix f-string for Python 3.9-3.11
#16962 merged Apr 22, 2025
Support S3 Sharded loading with RunAI Model Streamer
#16317 merged Apr 22, 2025
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm
#15830 merged Apr 22, 2025
[V1] Remove additional_config check
#16710 merged Apr 22, 2025
[Kernel] Add expert_map support to Cutlass FP8 MOE
#16861 merged Apr 22, 2025
[Misc] Remove the chunked prefill warning for LoRA
#16925 merged Apr 22, 2025
[ROCm] Add aiter tkw1 kernel for Llama4 fp8
#16727 merged Apr 22, 2025
[Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other
#16863 merged Apr 22, 2025
[BugFix][Spec Decode] No in-place update to draft probs
#16952 merged Apr 22, 2025
[Doc] Remove unnecessary V1 flag
#16924 merged Apr 22, 2025
[TPU][V1] Enable Top-P
#16843 merged Apr 22, 2025
[V1] V1 FlashInfer Attention
#16684 merged Apr 22, 2025
[TPU][V1] Capture multimodal encoder during model compilation
#15051 merged Apr 22, 2025
Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml
#16946 merged Apr 22, 2025
[TPU][V1] Implicitly adjust page size when there's SMEM OOM
#16871 merged Apr 21, 2025
[V1][Spec Decode] Handle draft tokens beyond max_model_len
#16087 merged Apr 21, 2025
[Core] Speed up decode by remove synchronizing operation in sampler
#16436 merged Apr 21, 2025
[Doc] mention how to install in CPU editable mode
#16923 merged Apr 21, 2025
[doc] install required python3-dev apt package
#16888 merged Apr 21, 2025
[XPU][Bugfix] minor fix for XPU
#15591 merged Apr 21, 2025
Raise error for data-parallel with benchmark_throughput
#16737 merged Apr 21, 2025
[Bugfix] Fix GLM rotary_dim issue and support v1
#16912 merged Apr 21, 2025
[Misc] Refactor platform to get device specific stream and event
#14411 merged Apr 21, 2025
[Misc] fix collect_env version parse
#15267 merged Apr 21, 2025
Restore buffers when wake up from level 2 sleep (#16564)
#16889 merged Apr 21, 2025
[Doc] Split dummy_processor_inputs() in Multimodal Docs
#16915 merged Apr 21, 2025
[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni
#16907 merged Apr 21, 2025
[CI/CD][V1] Add spec decode tests to CI
#16900 merged Apr 21, 2025
[Bugfix] Fix v1/spec_decode/test_ngram.py
#16895 merged Apr 21, 2025
[easy] Pass compile_fx only the config patches
#16845 merged Apr 20, 2025
Improve configs - CacheConfig
#16835 merged Apr 20, 2025
Serialize tensors using int8 views
#16866 merged Apr 19, 2025
Log how much time loading a compiled artifact takes
#16848 merged Apr 19, 2025
[doc] update hyperlink
#16877 merged Apr 19, 2025
[VLM] Clean up models
#16873 merged Apr 19, 2025
[Model] Qwen2.5-Omni Cleanup
#16872 merged Apr 19, 2025
[Model] Refactor Phi-4-multimodal to use merged processor and support V1
#15477 merged Apr 19, 2025
[V1][Misc] stop update prefix cache stats when logs_stats is disabled
#16460 merged Apr 19, 2025
[Misc] Benchmarks for audio models
#16505 merged Apr 19, 2025

84 Pull requests opened by 64 people

[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation
#16878 opened Apr 19, 2025
[Perf] Optimize MRotaryEmbedding::get_input_positions performance by numba
#16881 opened Apr 19, 2025
Added support for HermesToolParser for models without special tokens
#16890 opened Apr 20, 2025
profiling to find bottleneck of runing `vllm --version`
#16891 opened Apr 20, 2025
[Model] Include extra module from sentence transformer
#16898 opened Apr 21, 2025
[Bugfix] Fix the missing '}' issue for nested object parameters in stream function call.
#16919 opened Apr 21, 2025
[Bugfix] Fix layer KV cache API not triggered with direct call enabled
#16921 opened Apr 21, 2025
Add docker to build vllm against torch nightly
#16935 opened Apr 21, 2025
[Misc] Add DeepSeek deployment example
#16938 opened Apr 21, 2025
[Model] Refactor Mamba2 SSD to improve chunked prefill performance
#16942 opened Apr 21, 2025
[Quantization] Quark MXFP4 format loading
#16943 opened Apr 21, 2025
[Misc] Replace `cuda` hard code with `current_platform`
#16983 opened Apr 22, 2025
[Hardware][TPU][V1] Better tpu multilora compilation
#16989 opened Apr 22, 2025
Add squash option to container image build commands
#16991 opened Apr 22, 2025
[RFC] per module sharded weight tagging
#17001 opened Apr 22, 2025
[ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1
#17004 opened Apr 22, 2025
Enable FlashInfer V1 FP8 kv cache
#17005 opened Apr 22, 2025
Simplify (and fix) passing of guided decoding backend options
#17008 opened Apr 22, 2025
[V1][Metrics] Add API for accessing in-memory Prometheus metrics
#17010 opened Apr 22, 2025
[CI] Prune down lm-eval small tests
#17012 opened Apr 22, 2025
[INTEL_HPU][v0] Enable spec decode on HPU
#17014 opened Apr 23, 2025
[WIP][Attention] Update FlashMLA
#17027 opened Apr 23, 2025
[Frontend] Add /classify endpoint
#17032 opened Apr 23, 2025
[Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens
#17033 opened Apr 23, 2025
Move V1 into regular `mypy` call
#17044 opened Apr 23, 2025
[Core] Prevent side-channel attacks via cache salting
#17045 opened Apr 23, 2025
[ROCm] default v1 args for mi300x
#17046 opened Apr 23, 2025
[Misc] Make cached tokenizer pickle-compatible
#17048 opened Apr 23, 2025
Fix: Python package installation for opentelmetry
#17049 opened Apr 23, 2025
fix setuptools-scm was unable to detect version for workspace
#17050 opened Apr 23, 2025
Add option to use torch._inductor.standalone_compile
#17057 opened Apr 23, 2025
[Docs] Propose a deprecation policy for the project
#17063 opened Apr 23, 2025
[TPU][V1][CI] Set `VLLM_XLA_CACHE_PATH=` to avoid disk-full error
#17064 opened Apr 23, 2025
[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs
#17071 opened Apr 23, 2025
[TPU][V1] Add support for top-logprobs
#17072 opened Apr 23, 2025
[Bugfix] Fix Gemma3 multimodal placeholder replacement
#17074 opened Apr 23, 2025
Introduce PaddingConfig to combine GPU cudagraph_capture_sizes and TPU num_tokens_paddings
#17081 opened Apr 23, 2025
Fix `numel()` downcast in vllm/csrc/moe/moe_align_sum_kernels.cu +2
#17082 opened Apr 23, 2025
Fix `numel()` downcast in vllm/csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu +2
#17083 opened Apr 23, 2025
[V1] Add `structural_tag` support using xgrammar
#17085 opened Apr 24, 2025
[BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set
#17088 opened Apr 24, 2025
[Bugfix] Add contiguous call inside rope kernel wrapper
#17091 opened Apr 24, 2025
[Bugfix] fix phi4-mini tool call parse in streaming mode
#17094 opened Apr 24, 2025
[CI][UT]Compat with cuda and npu
#17100 opened Apr 24, 2025
Update test_flash_attn.py
#17102 opened Apr 24, 2025
[CI/Build] Add retry mechanism for add-apt-repository
#17107 opened Apr 24, 2025
[FEAT] [ROCm]: Add AITER CK 2 Stages MoE support
#17110 opened Apr 24, 2025
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels
#17112 opened Apr 24, 2025
Fix static typing issues in `v1/attention`
#17113 opened Apr 24, 2025
Enabling multi-group kernel tests.
#17115 opened Apr 24, 2025
[VLM] Support HF format Phi-4-MM model
#17121 opened Apr 24, 2025
[Misc]: Enable memory usage logging for vLLM GPU worker
#17122 opened Apr 24, 2025
Benchmark script for fp8 vs bf16 gemm
#17126 opened Apr 24, 2025
Improve configs - `ModelConfig`
#17130 opened Apr 24, 2025
[Docs] Update structured output doc for V1
#17135 opened Apr 24, 2025
[Misc] Only import amdsmi and _rocm_C on rocm platform
#17136 opened Apr 24, 2025
[V1][Spec Decode] Make eagle compatible with prefix caching.
#17137 opened Apr 24, 2025
[easy] Fix logspam on PiecewiseBackend errors
#17138 opened Apr 24, 2025
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention
#17139 opened Apr 24, 2025
[Kernel] FP8 quantization fused into V1 Triton Attention
#17143 opened Apr 24, 2025
[Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation
#17145 opened Apr 24, 2025
[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels
#17146 opened Apr 25, 2025
Add xLAM tool parser support
#17148 opened Apr 25, 2025
[Misc] Add gemma3 chat template with pythonic-style function calling
#17149 opened Apr 25, 2025
[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER
#17153 opened Apr 25, 2025
[Bugfix] Modifications to error handling of multiple vllm api endpoints
#17165 opened Apr 25, 2025
[CI] Add mteb testing to test the accuracy of the embedding model
#17175 opened Apr 25, 2025
Add option "--expand-tools-even-if-tool-choice-none"
#17177 opened Apr 25, 2025
[Bugfix] support local dataset path in benchmark_serving
#17179 opened Apr 25, 2025
[Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device
#17186 opened Apr 25, 2025
[V1] Remove num_input_tokens from attn_metadata
#17193 opened Apr 25, 2025
[WIP][Bugfix] Fix 'MistralTokenizer' object has no attribute 'init_kwargs'
#17195 opened Apr 25, 2025
[Bugfix] Fix Lora Name Parsing
#17196 opened Apr 25, 2025
[Security] Don't bind tcp zmq socket to all interfaces
#17197 opened Apr 25, 2025
[WIP] Support vLLM in transformers hybrid attention implementation
#17198 opened Apr 25, 2025
[Hardware][Apple] Allows VLLM_TARGET_DEVICE=empty on MacOs
#17200 opened Apr 25, 2025
[Misc]add configurable cuda graph size
#17201 opened Apr 25, 2025
[Benchmark] Add single turn MTBench to Serving Bench
#17202 opened Apr 25, 2025
[Misc][Tools][Benchmark] Publish script to auto tune server parameters
#17207 opened Apr 25, 2025
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE
#17211 opened Apr 26, 2025
[Bugfix] Fix standard models tests
#17217 opened Apr 26, 2025
[Doc] Clarify note for H2O-VL
#17219 opened Apr 26, 2025
[Bugfix] Get a specific type of layer from forward context
#17222 opened Apr 26, 2025
[Bugfix] Fix missing int type for `-n` in multi-image example
#17223 opened Apr 26, 2025

115 Issues closed by 38 people

[Bug]: msgspec.DecodeError: MessagePack data is malformed: trailing characters (byte 13)
#15207 closed Apr 26, 2025
[Misc]: qwen2 vllm和transform 推理结果未对齐
#11478 closed Apr 26, 2025
[Bug]: Prefix caching doesn't work for LlavaOneVision
#11371 closed Apr 26, 2025
[Bug]: error when start in multiple GPU
#11467 closed Apr 26, 2025
[Bug]: 'int' object has no attribute 'parser_state'
#11498 closed Apr 26, 2025
[Bug]: Qwen2.5-72B-Instruct 在A800上推理不成功
#11506 closed Apr 26, 2025
[Bug]: The CPU usage is very low when inference is performed on the ARM CPU
#11511 closed Apr 26, 2025
[Bug]: vllm 0.6.5 run Qwen2-VL-7B-Instruct ，lora lond success but not effective
#11525 closed Apr 26, 2025
[Bug]: Two beginning of sequence tokens for Lllama-3.2-3B-Instruct
#16028 closed Apr 25, 2025
[Bug]: Two BOS when using chat
#16853 closed Apr 25, 2025
[Bug]: run on cpu: ModuleNotFoundError: No module named 'vllm.benchmarks'
#15812 closed Apr 25, 2025
[Bug]: KeyError in mm_input_cache when processing multimodal requests with Qwen2.5-VL-72B
#16875 closed Apr 25, 2025
[Tracker] Merge security fixes for v0.8.5
#17128 closed Apr 25, 2025
[Bug]: Invalid Mistral ChatCompletionRequest Body Exception
#16774 closed Apr 25, 2025
[Usage]: Is vllm support prefill decode disaggregation, and use DP+EP in prefill instances and decode instances?
#17185 closed Apr 25, 2025
[Bug]: API Returns Only Single Result Despite n=8 Parameter Setting
#17173 closed Apr 25, 2025
[Usage]: Does vLLM support QwQ 32B + tool calling?
#17061 closed Apr 25, 2025
[Bug]: VLLM will hang during multiple node startup, but if ‘--enforce-eager’ configuration is enabled, it can start normally
#17167 closed Apr 25, 2025
[Bug]: [Feature]: I want to extend the vLLM MoE functionality to support a variable number of experts.
#17150 closed Apr 25, 2025
[Bug]: deepseek-v2-lite hangs during loading with tp=8; inference works with tp=2 using v1 engine, but both loading and inference succeed with v0 engine on H20
#17169 closed Apr 25, 2025
[Bug]: Remove fallback to outlines for int/number range and pattern constraints in guided_json
#16723 closed Apr 25, 2025
[Bug]: GLM-4-32B-0414-FP8 output !!!!! error (tensor is nan)
#17154 closed Apr 25, 2025
[Bug]: MiniCPM3 failed on ascend npu because of ModuleNotFoundError: No module named 'triton'
#16955 closed Apr 25, 2025
[Bug]: Cannot run MiniCPMV on OpenVINO
#12384 closed Apr 25, 2025
[Feature]: expose the tqdm progress bar to enable logging the progress
#6154 closed Apr 25, 2025
[Bug]: KeyError: 'layers.0.self_attn.qkv_proj.weight'
#9595 closed Apr 25, 2025
[Bug]: Qwen2-VL-7B with sglang (vLLM-back) Performance Degradation on MME benchmark
#10588 closed Apr 25, 2025
[Installation]: May I ask if there is a good solution for deploying grmma-2-27b on v100? The deployment has been consistently unsuccessful
#11462 closed Apr 25, 2025
[Usage]: Client-Side Error Handling for VLLM in a Client-Server Architecture
#11487 closed Apr 25, 2025
[Bug]: EADDRINUSE (-98) error when setting up NCCL communicator
#15987 closed Apr 25, 2025
[Usage]: How to get log probabilities for existing tokens in assistant message?
#16686 closed Apr 24, 2025
[Feature]: vLLM DP=2 didn't speed up the training as low batch size.
#17129 closed Apr 24, 2025
[Bug]: xgrammar missing file crashes the server
#16030 closed Apr 24, 2025
[Doc]: Documentation source code hyperlinks do not always point to the correct source code
#17120 closed Apr 24, 2025
[Bug]: guided_json 请求报错在 v0.7.2
#15073 closed Apr 24, 2025
[Bug]:
#15329 closed Apr 24, 2025
[Bug]: ImportError: cannot import name '_vllm_fa3_C' - `vllm serve google/gemma-3-12b-it` fails with commit fe742aef5a (Apr 20).
#16995 closed Apr 24, 2025
[Usage]: OpenAI Server API
#17075 closed Apr 24, 2025
[Bug]: {'object': 'error', 'message': 'At most 5 image(s) may be provided in one request.', 'type': 'BadRequestError', 'param': None, 'code': 400}......Can this issue be avoided by modifying the source code? If so, where should it be modified?
#17092 closed Apr 24, 2025
[Bug]: ValueError: Model architectures ['OPTForCausalLM'] failed to be inspected.
#17031 closed Apr 24, 2025
[Usage]: How to log incoming requests (inputs and outputs) in vllm serve ?
#12336 closed Apr 24, 2025
[Bug]: When use `guided choice` feature, vllm.engine.async_llm_engine.AsyncEngineDeadError
#8100 closed Apr 24, 2025
[Usage]: RuntimeError: Failed to infer device type (Intel Iris Xe Graphics)
#8863 closed Apr 24, 2025
[Bug]: AsyncLLMEngine CUDA runtime error 'device-side assert triggered'
#8948 closed Apr 24, 2025
[Installation]: Segmentation fault when building Docker container on WSL
#10575 closed Apr 24, 2025
[Bug]: Crash with Qwen2-Audio Model in vLLM During Audio Processing
#10627 closed Apr 24, 2025
[Bug]: Prefill/decode separation leads to blocking and crashing in multi concurrent scenarios
#11445 closed Apr 24, 2025
[Bug]: InternVL2-40B Inference Precision Problem
#11454 closed Apr 24, 2025
[Misc]: Molmo inference multi-GPU
#11468 closed Apr 24, 2025
[Usage]: How to figure out why vllm response nothing but trt-llm response meaningful result
#11473 closed Apr 24, 2025
[Bug]: CI Build image failure due to mamba-ssm==2.2.4 installation error
#17068 closed Apr 23, 2025
[Bug]: Llama4 Scout fails on H200
#16414 closed Apr 23, 2025
[Feature]: guided decoding on TPU
#11104 closed Apr 23, 2025
[Usage]:Qwen/QwQ-32B
#16931 closed Apr 23, 2025
[Bug]: Qwen/Qwen2.5-VL-3B-Instruct doesnt identify tools
#16797 closed Apr 23, 2025
[Bug]: Error when running Llama-4-Maverick-17B-128E-Instruct-FP8 on mi300x
#16474 closed Apr 23, 2025
[Usage]: Customized model parameters on different devices
#16981 closed Apr 23, 2025
[Bug]: vllm stopped at vLLM is using nccl==2.21.5
#16772 closed Apr 23, 2025
[Bug]: AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers
#16958 closed Apr 23, 2025
[Bug]: Qwen2.5-VL-72B Inference
#16997 closed Apr 23, 2025
[Bug]: Run Llama4 Scout 16E w/ 10000 input length trigger vllm crashing, but run fine if use FA2.
#16948 closed Apr 23, 2025
[test] llama4 testing issue
#17025 closed Apr 23, 2025
[Usage]: how to change the path of download models
#16975 closed Apr 23, 2025
[Bug]: /usr/bin/ld: cannot find -lcuda: No such file or directory, when running inference
#16984 closed Apr 23, 2025
Error in running 'python -m vllm.entrypoints.openai.api_server '
#11411 closed Apr 23, 2025
[Bug]: Extra body don't work when response_format is also sent for serving.
#7337 closed Apr 23, 2025
[Bug]: vLLM 0.5.5 and FlashInfer0.1.6
#8091 closed Apr 23, 2025
[Bug]: 启动之后用了一段时间显存越占越多
#8413 closed Apr 23, 2025
[Bug]: Lora refuses to load from disk without extremely weird manipulations with file paths
#9063 closed Apr 23, 2025
[Bug]: vLLM multi-step scheduling crashes when input prompt is long
#10009 closed Apr 23, 2025
[Usage]: Removal of vllm.openai.rpc folder in vLLM 0.6.2 release
#10766 closed Apr 23, 2025
[Performance]: Performance degradation due to CPU bottleneck when serving embedding models to GPUs
#11320 closed Apr 23, 2025
[Bug]: no output of profile when VLLM_TORCH_PROFILER_DIR is enabled for vllm serve
#11346 closed Apr 23, 2025
[Feature]: c4ai-command-r-plus-08-2024 tool choice support
#11405 closed Apr 23, 2025
[RFC]: The two features i wish vllm has
#11410 closed Apr 23, 2025
[Misc]: How to Profile Both EngineCoreClient and EngineCoreProc Activities in V1 Using Profiler
#11413 closed Apr 23, 2025
[Bug]: 0.6.5 randomly closes connection/drops requests
#11421 closed Apr 23, 2025
[Bug]: top k isn't deterministic
#16945 closed Apr 22, 2025
[RFC]: tool_calls and None types.
#16678 closed Apr 22, 2025
[Feature]: suggest passing a splited tensor to RLHF vllm's load_weights when tp>1
#16820 closed Apr 22, 2025
[Bug]: VLLM config not set when using Flash Infer backend.
#13207 closed Apr 22, 2025
[Bug]: vllm 0.8.x unable to load model from S3 using runai_streamer but works in 0.7.3
#16926 closed Apr 22, 2025
[Doc]: Add documents on multimodal args
#16922 closed Apr 22, 2025
[Usage]: what is the most efficient way to do with a 72b model and 8 * A100 ?
#12205 closed Apr 22, 2025
[Bug]: GuidedDecodingParams choice - Request-level structured output backend must match engine-level backend
#16738 closed Apr 22, 2025
[Bug]: [V1] New v1 engine does not support n>1?
#12584 closed Apr 22, 2025
[Bug]: leaked instance 0xfffc8c22b108 of type "xgrammar.xgrammar_bindings.GrammarCompiler"
#16951 closed Apr 22, 2025
[Installation]: Fail to build vllm from the latest source code
#16897 closed Apr 22, 2025
[Bug]: [V1] Random infinite response generation followed by silent crash
#16151 closed Apr 21, 2025
[Bug]: V1 engine Index Error When Single Request Near Max Context Length LLaMA 4
#16157 closed Apr 21, 2025
[Bug]: [RLHF] Weights update broken with V1 multiprocessing
#16434 closed Apr 21, 2025
[Bug]: Multi-GPU (TP > 1) vLLM serve docker timeout during startup
#16514 closed Apr 21, 2025
[Bug]: glm.py rotary_dim bug
#16904 closed Apr 21, 2025
[Bug]: vllm 0.8.3 abnormal TTFT (too long) in the first serving
#16858 closed Apr 21, 2025
[Bug]: Pooling last token differences with Sentence Transformers for embedding models
#16892 closed Apr 21, 2025
[Bug]: Load meta-llama/Llama-3.2-1B-Instruct throw error: ValueError: Cannot cast <zmq.Socket(zmq.ROUTER) at 0x784ecf7f4940> to int
#16448 closed Apr 21, 2025
[Feature]: Add CLI Commands for Benchmarking
#13840 closed Apr 21, 2025
[New Model]: nvidia/Hymba-1.5B-Base
#10783 closed Apr 21, 2025
[Usage]: Is pipeline parallelism supported on machines that are not in the same local network?
#11285 closed Apr 21, 2025
[Misc]: What is 'residual' used for in the IntermediateTensor class?
#11364 closed Apr 21, 2025
Where does the default number 43328 of KV cache come from and How can I change it?
#11391 closed Apr 21, 2025
[Bug]: After wake up from level 2 sleep, model cannot load weights properly
#16564 closed Apr 20, 2025
[New Model]: Qwen/QwQ-32B-Preview
#10737 closed Apr 20, 2025
[Bug]: Incomplete tool calling response for pipeline-parallel vllm with ray
#7194 closed Apr 20, 2025
[Bug]: AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'
#7871 closed Apr 20, 2025
[Bug]: ValueError: Queue <multiprocessing.queues.Queue object at 0x7f5703d2d0f0> is closed;zipfile.BadZipFile: Bad magic number for file header
#8101 closed Apr 20, 2025
[Doc]: Offline Inference Distributed
#8966 closed Apr 20, 2025
[Usage]: how to use EAGLE on vLLM?
#11126 closed Apr 20, 2025
[Bug]: Paligemma 2 model loading error
#11343 closed Apr 20, 2025
[Feature]: meta-llama/Prompt-Guard-86M Usage Value Error.
#11360 closed Apr 20, 2025
[Bug]: priority scheduling doesn't work according to token_per_s. The token_per_s of requests with higher priorities is not higher than that of requests without priority settings.
#11361 closed Apr 20, 2025
[Bug]: The service operation process results in occasional exception errors RuntimeError: CUDA error: an illegal memory access was encountered
#11366 closed Apr 20, 2025
[Bug]: vLLM crashes on tokenized embedding input
#11375 closed Apr 20, 2025
[Usage]: How do I run offline batch inference with Llama 405B BF16 across multinode (via SLURM)
#11379 closed Apr 20, 2025
[Feature]: Benchmarks for audio models
#16354 closed Apr 19, 2025

108 Issues opened by 101 people

[Usage]: EOFError when loading Qwen/Qwen2.5-32B-Instruct
#17218 opened Apr 26, 2025
[Installation]: torch 2.6.0 unavailable for intel mac
#17212 opened Apr 26, 2025
[Installation]: Can't get Mistral-Small-3.1-24B-Instruct-2503-Q6_K to load on Docker (local or HF)
#17210 opened Apr 25, 2025
[Bug]: AsyncLLM._run_output_handler running at async_llm.py:342 got Future <Future pending> attached to a different loop
#17208 opened Apr 25, 2025
[Feature]: Support Lora for Beam Search
#17205 opened Apr 25, 2025
[Feature]: Inflight BNB quantization for Mixtral models
#17199 opened Apr 25, 2025
[Bug]: DP with sampling hangs after completing generation
#17194 opened Apr 25, 2025
[RFC]: Custom sampling params support in REST API
#17191 opened Apr 25, 2025
[Bug]: vllm LLM utils.py resolve_obj_by_qualname ValueError: not enough values to unpack (expected 2, got 1)
#17188 opened Apr 25, 2025
[Installation]: deployment failure on Kuberentes with CPU device (testing).
#17187 opened Apr 25, 2025
[Usage]: How to deploy tensorized vllm model (deserialize) as api_server?
#17178 opened Apr 25, 2025
[Bug]: `uv run vllm serve` with DP results in NCCL error: two ranks use the same device
#17176 opened Apr 25, 2025
[Installation]: Pinned version of OpenTelemetry in requirements
#17174 opened Apr 25, 2025
[Usage]: I want to create custom docker image by adding my code
#17172 opened Apr 25, 2025
[Bug]: Qwen2VL-2b / Qwen2.5-7b has AssertionError and Cuda error when qps goes higher
#17171 opened Apr 25, 2025
[Bug]: HIP error: invalid device function
#17170 opened Apr 25, 2025
[Installation]: Bloated docker image size causes problems on k8s
#17163 opened Apr 25, 2025
Error：kimi-vl：Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
#17162 opened Apr 25, 2025
[Bug]: When I use tool call, the "tool_calls" list in the response is empty, and the value is in "content", which does not conform to the standard provided by OpenAI.
#17161 opened Apr 25, 2025
[Bug]: failed to run distribute Inference with vllm 0.8.2
#17160 opened Apr 25, 2025
[Bug]: GLM-Z1 uses vllm batch inference to output confusion
#17157 opened Apr 25, 2025
[Bug]: DeepSeek Lora inference has no effect.
#17155 opened Apr 25, 2025
[Bug]: LLVM ERROR: Failed to compute parent layout for slice layout. when using fp16
#17152 opened Apr 25, 2025
[Bug]: 为什么在部署qwen2.5-vl-32b-instruct的时候，部署过程被卡死不动了
#17151 opened Apr 25, 2025
[Bug]: waiting reqs vanish！
#17147 opened Apr 25, 2025
Missing Opening <think> for Qwen32B
#17144 opened Apr 24, 2025
[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1
#17140 opened Apr 24, 2025
[Feature]: Support for image linebreak tokens for vision model
#17127 opened Apr 24, 2025
[Feature]: Automatically detect numerical issues
#17123 opened Apr 24, 2025
[Bug]: jinja2 TemplateError should return 422 instead of 500 error code
#17119 opened Apr 24, 2025
[Bug]: Why does torch.cuda.memory_allocated() remain unchanged after calling sleep()?
#17117 opened Apr 24, 2025
[Installation]: vllm/vllm-tpu image doesn't have :latest tag
#17114 opened Apr 24, 2025
[Bug]: Tool calls data comes in content field after text chunks
#17109 opened Apr 24, 2025
[Feature]: Add Support to Video Generation Models
#17106 opened Apr 24, 2025
[Bug]: AsyncLLM sleep then wake_up produces meaningless outputs
#17103 opened Apr 24, 2025
[Bug]: Shutdown during Qwen2.5-VL-72B inference on 4 A800s
#17101 opened Apr 24, 2025
[Bug]: [0.7.2+] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
#17098 opened Apr 24, 2025
[Bug]: Failed to run dp+tp in 2 GPU Nodes
#17095 opened Apr 24, 2025
tool call arguments parse failed
#17089 opened Apr 24, 2025
[RFC]: Expert parallelism in VLLM - do you do local dropping on sub-batch of token activations before going through gating layer to make each rank possess unique sub-batch of data?
#17087 opened Apr 24, 2025
[Bug]: raise NotImplementedError
#17086 opened Apr 24, 2025
[Bug]: Importing DeepSpeed causes crash in vLLM when running with data parallelism and TP=1
#17079 opened Apr 23, 2025
[Bug]: noop elimination for slice errors when end = -1
#17078 opened Apr 23, 2025
[Bug]: Aria model error due to version mismatch with transformers
#17077 opened Apr 23, 2025
[RFC]: Implement structural_tag support in structured output
#17076 opened Apr 23, 2025
[Feature]: GGUF support for GLM4
#17069 opened Apr 23, 2025
[RFC]: All Ops should be determined during init and wrapped in a Layer Module to avoid envs.ENVIRON overhead
#17067 opened Apr 23, 2025
[Performance]: UVA vs UVM for CPU offloading on v0.8.4+
#17062 opened Apr 23, 2025
[Bug]: Issue with SpecDecode when using data parallel
#17056 opened Apr 23, 2025
[Bug]: ValueError when using Multi-Instance GPU
#17047 opened Apr 23, 2025
[Usage]: I have 2 nodes 16 GPUs, how can i use 16 dp+16 ep to run deepseek v3?
#17041 opened Apr 23, 2025
[Bug]: Many endpoints are returning 500 Internal Server Error
#17038 opened Apr 23, 2025
[Bug]: Undocumented HTTP Status Codes for vllm endpoints
#17037 opened Apr 23, 2025
[Bug]: Multiple openai endpoint Missing Content-Type Header
#17036 opened Apr 23, 2025
[Usage]: DeepSeek R1 on a 8xH200 node is too slow
#17035 opened Apr 23, 2025
[Bug]: vllm 0.8.3 v1 engine has different computation performance per iteration when serving multi-lora with different chunk size
#17034 opened Apr 23, 2025
[Feature]: add hostname in metrics for clustering deployment
#17029 opened Apr 23, 2025
[Bug]: The shape of the kv cache in the FlashAttention component of the LLM model in Qwen2.5 is very strange.
#17023 opened Apr 23, 2025
[Bug]: When adding the parameter tensor_parallel_size, a TypeError occurred: BackendCompilerFailed.__init__() is missing one required positional argument: 'inner_exception'.
#17018 opened Apr 23, 2025
[Installation]: Cannot install vllm due to xformers: ERROR: Failed building wheel for xformers fatal: Not a git repository (or any parent up to mount point /scratch) assert len(sources) > 0 AssertionError
#17015 opened Apr 23, 2025
[Bug]: ```image_grid_thw``` not set in ```CachedRequestState``` - ```Qwen2.5 VL 3B```
#17007 opened Apr 22, 2025
[Performance]: Distributed Inference w/ & w/o RDMA over Infiniband
#17006 opened Apr 22, 2025
[Usage]: multilora_inference with max_loras>1
#17003 opened Apr 22, 2025
[Bug]: Guided Decoding Backend options with the OpenAI server recently broken
#17002 opened Apr 22, 2025
[Feature]: Automatically Enable Modality Specific Loras
#16994 opened Apr 22, 2025
[Bug]: vLLM sleep experiences segmentation fault when used in TRL
#16993 opened Apr 22, 2025
[Bug]: `original_load_name` undefined with certain torch versions
#16987 opened Apr 22, 2025
[Bug]: Performance degradation with increasing number of requests in long-running vLLM inference sessions
#16985 opened Apr 22, 2025
[Bug]: Is the logic order correct during the scheduler procedure?
#16982 opened Apr 22, 2025
[Feature]: Enable Partial Guided Decoding / Structured Output Support
#16979 opened Apr 22, 2025
[Bug]: unable automatically set CUDA_VISIBLE_DEVICES correctly for v0 engine data parallel
#16978 opened Apr 22, 2025
[RFC]: scheduling policy optimization in vLLM
#16969 opened Apr 22, 2025
[Bug]: cpu core 100%
#16968 opened Apr 22, 2025
[Bug]: The output of MathResponse is empty when running THUDM/GLM-Z1-32B-0414 with vLLM-0.8.4
#16967 opened Apr 22, 2025
[Bug]: vllm 0.8.4 whisper possible memory leak?
#16966 opened Apr 22, 2025
[Usage]: How can vllm process multiple prompts within single request on server
#16965 opened Apr 22, 2025
[Bug]: vllm 0.8.3 v1 startup time is too long when using multi lora
#16961 opened Apr 22, 2025
[Bug]:Why is the GPU memory usage after quantizing the model to int8 W8A8 with llmcompressor almost the same as before quantization?
#16959 opened Apr 22, 2025
[Bug]: DataParallel on multinode unable to start GPU
#16957 opened Apr 22, 2025
[Bug]: Fail to use deepseek vl2 with images, maybe need a new chat template?
#16953 opened Apr 22, 2025
[Performance]: Why/How vLLM uses CPU memory?
#16947 opened Apr 21, 2025
[New Model]: nemotron Super GGUF
#16944 opened Apr 21, 2025
[Doc]: update contributing guide for macOS Apple silicon
#16940 opened Apr 21, 2025
[Bug]: Phi-4-MM generates gibberish for large image input with v1 chunked prefill
#16934 opened Apr 21, 2025
[Bug]: Pooling model adapter removes the attributes expected by model init
#16932 opened Apr 21, 2025
[Bug]: SharedStorageConnector only see first batch of tokens
#16928 opened Apr 21, 2025
[Doc]: state requirements for testing or update to work for CPU-only
#16920 opened Apr 21, 2025
Qwen2.5 VL and gemma-3-12b error on VLLM 8.4
#16918 opened Apr 21, 2025
[UI_Bug]: Content_Menu_and_Icon_Spacing_Issue_in_UI
#16917 opened Apr 21, 2025
[Bug]: CPU Memory oom on 8*L40s when deploy meta-llama/Llama-4-Scout-17B-16E-Instruct
#16916 opened Apr 21, 2025
[Bug]: vllm can' t serve for Multi-audio input inference
#16914 opened Apr 21, 2025
[Bug]:Engine Compatibility Issue with vllm 0.8.4 Loading Qwen2.5-32B-AWQ: Abnormal Behavior of v1 Engine Under High Concurrency and Solutions
#16913 opened Apr 21, 2025
[Bug]: guided_grammar example syntax does not work
#16911 opened Apr 21, 2025
[Bug]: oom occurs when 128+128 256 concurrency, while 4K+4K 256 concurrency is ok. DeepSeek-R1-awq benchmark test.
#16909 opened Apr 21, 2025
[Bug]: Kimi-VL-A3B-Thinking Error
#16908 opened Apr 21, 2025
[Bug]: architecture of models not correctly recognized
#16905 opened Apr 21, 2025
[Bug]: mm_cache keyerror
#16903 opened Apr 21, 2025
[Bug]: RuntimeError on RTX 5090: "no kernel image is available for execution on the device
#16901 opened Apr 21, 2025
[Usage]: When deploying the GLM-4-32B BF16 model with vLLM 0.8.4, I encountered a GPU memory overflow
#16896 opened Apr 21, 2025
[Feature]: Llama4 LoRA support
#16894 opened Apr 20, 2025
[Bug]: tool_choice: "required" does not work for mistral
#16887 opened Apr 20, 2025
[Usage]: Deciding max-num-seqs and max-num-batched-tokens for desired throughput
#16886 opened Apr 20, 2025
[Usage]: Is it true that vllm doesn't support deepseek r1 yet with the v1 engine?
#16885 opened Apr 20, 2025
[Bug]: internvl3-78B-AWQ
#16884 opened Apr 20, 2025
[Bug]: Ngram speculative decoding doesn't work in vLLM 0.8.3/0.8.4 with VLLM_USE_V1 enabled.
#16883 opened Apr 20, 2025
[Bug]: [v0.8.4][Critical] Tools calling broken: xgrammar rejects minItems in JSON Schema, blocking agent functionality
#16880 opened Apr 19, 2025
[Usage]: Request scheduling when using LoRA
#16876 opened Apr 19, 2025
[New Model]: jinaai/jina-embeddings-v2-base-code
#16874 opened Apr 19, 2025

358 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Feature] support sequence parallelism using compilation pass
#16155 commented on Apr 26, 2025 • 41 new comments
[V1][Metrics] add support for kv event publishing
#16750 commented on Apr 26, 2025 • 38 new comments
[Kernel] some optimizations for dense marlin and moe marlin
#16850 commented on Apr 24, 2025 • 35 new comments
[Model] support MiniMax-VL-01 model
#16328 commented on Apr 25, 2025 • 32 new comments
[Kernel] Adding basic Triton JitCache for triton_attn
#16606 commented on Apr 24, 2025 • 24 new comments
[V1][Feature] Enable Speculative Decoding with Structured Outputs
#14702 commented on Apr 25, 2025 • 22 new comments
[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model
#16362 commented on Apr 25, 2025 • 17 new comments
[Model][Frontend] Adding timeseries modality support and Qwen2.5-ChatTS model support
#16852 commented on Apr 21, 2025 • 15 new comments
[Core] Support full cuda graph in v1
#16072 commented on Apr 25, 2025 • 14 new comments
Add default local directory LoRA resolver plugin.
#16855 commented on Apr 24, 2025 • 12 new comments
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass
#16756 commented on Apr 25, 2025 • 11 new comments
[Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel
#12591 commented on Apr 26, 2025 • 10 new comments
Update PyTorch to 2.7.0
#16859 commented on Apr 26, 2025 • 10 new comments
[Core] [Bugfix] Add Input Embeddings
#15428 commented on Apr 24, 2025 • 9 new comments
[MODEL ADDITION] Ovis2 Model Addition
#15826 commented on Apr 25, 2025 • 9 new comments
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend
#14238 commented on Apr 25, 2025 • 8 new comments
[Model] Add Granite Speech Support
#16246 commented on Apr 26, 2025 • 8 new comments
[V0][V1][Core] Add outlines integration for V1, and update V0 integration.
#15975 commented on Apr 24, 2025 • 8 new comments
[V1] vLLM OpenAI API custom args
#16862 commented on Apr 23, 2025 • 8 new comments
[TPU] Increase block size and reset block shapes
#16458 commented on Apr 26, 2025 • 7 new comments
[NVIDIA] Support Cutlass MLA for Blackwell GPUs
#16032 commented on Apr 26, 2025 • 7 new comments
[CPU] Support torch compile in CPU backend
#15020 commented on Apr 22, 2025 • 7 new comments
[FEAT] [ROCm]: Support AITER Linear
#14916 commented on Apr 24, 2025 • 6 new comments
[Misc] Add fully interleaved support for multimodal 'string' content format
#14047 commented on Apr 22, 2025 • 6 new comments
Add `pt_load_map_location` to allow loading to cuda
#16869 commented on Apr 25, 2025 • 6 new comments
[WIP] Add Flex to V1
#16078 commented on Apr 25, 2025 • 5 new comments
[Frontend]Reduce vLLM's import time
#15128 commented on Apr 25, 2025 • 5 new comments
[Misc] support multi-node data parallel
#15863 commented on Apr 25, 2025 • 4 new comments
Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1
#16573 commented on Apr 25, 2025 • 3 new comments
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature
#14968 commented on Apr 23, 2025 • 2 new comments
[Bugfix][V0] Another multi-sequence logprobs streaming edge case
#16805 commented on Apr 23, 2025 • 2 new comments
[Misc] Add Next Edit Prediction (NEP) datasets support in `benchmark_serving.py`
#16839 commented on Apr 24, 2025 • 2 new comments
[Misc] improve chat_with_tools example
#16044 commented on Apr 25, 2025 • 2 new comments
Add cutlass support for blackwell fp8 blockwise gemm
#14383 commented on Apr 25, 2025 • 2 new comments
Online Rotations to vLLM
#16443 commented on Apr 25, 2025 • 2 new comments
[Kernel] GGUF MoeVec kernel
#16780 commented on Apr 25, 2025 • 2 new comments
Adding Share Expert Fusion for DeepSeek
#15502 commented on Apr 23, 2025 • 1 new comment
[Bugfix] set correct lora mapping when compute prompt logprobs
#16694 commented on Apr 26, 2025 • 1 new comment
Support loading transformers models with named parameters
#16868 commented on Apr 25, 2025 • 1 new comment
[Distributed] Tensor Parallel RMSNorm
#10542 commented on Apr 24, 2025 • 0 new comments
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture
#10608 commented on Apr 23, 2025 • 0 new comments
[Bug]: LLVM ERROR: Failed to compute parent layout for slice layout.
#15235 commented on Apr 24, 2025 • 0 new comments
Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support
#11844 commented on Apr 21, 2025 • 0 new comments
[Frontend] Disaggregate prefill decode with zmq
#11791 commented on Apr 22, 2025 • 0 new comments
[Misc] Allow LoRA to adaptively increase rank and remove possible_max_ranks
#10623 commented on Apr 22, 2025 • 0 new comments
[Feature]: Enable CUDA Graph without turn on torch.compile / Inductor for V1
#15896 commented on Apr 24, 2025 • 0 new comments
[Frontend] [Bugfix] Refactor tool parsers and simplify the tool parsing interface.
#11554 commented on Apr 25, 2025 • 0 new comments
[Frontend] improve hermes_tool_parser.py
#11453 commented on Apr 25, 2025 • 0 new comments
fix: add missing bos_token to example templates
#11432 commented on Apr 25, 2025 • 0 new comments
[Hardware][CPU] Refactor CPU vector types for ISAs
#10787 commented on Apr 22, 2025 • 0 new comments
[Model] Working BNB for InternVL.
#11095 commented on Apr 24, 2025 • 0 new comments
[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations
#10867 commented on Apr 24, 2025 • 0 new comments
[Misc][Benchmark]feat(benchmarks): Add async_request_generate function to support generate endpoint
#16421 commented on Apr 24, 2025 • 0 new comments
[CI/Build] Add support for Python 3.13
#13164 commented on Apr 23, 2025 • 0 new comments
[Bugfix] Adjust tool call handling in llama template to support single tool calls only
#12938 commented on Apr 25, 2025 • 0 new comments
[Bugfix] Update chat_utils.py to avoid issues when tool call is present but None
#12788 commented on Apr 25, 2025 • 0 new comments
[Frontend] Adding the "User Defined Custom Tool Calling" parser for the Llama models
#12752 commented on Apr 25, 2025 • 0 new comments
[Core] Add Additional Metrics to vLLM Server
#12726 commented on Apr 25, 2025 • 0 new comments
[Bug]: xgrammar==0.17 not work when guided
#15790 commented on Apr 24, 2025 • 0 new comments
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform
#12695 commented on Apr 23, 2025 • 0 new comments
[Bugfix] Fix quark fp8 format loading on AMD GPUs
#12612 commented on Apr 22, 2025 • 0 new comments
[Bug]: gemma 3 structured output api occurs assertion error
#15766 commented on Apr 24, 2025 • 0 new comments
[Bugfix][Spec Decode][V0] fix: update logits processor for MQA scoring
#12537 commented on Apr 21, 2025 • 0 new comments
add support for AMD MI25/50/60
#12431 commented on Apr 25, 2025 • 0 new comments
[Core] Make disaggregated prefill compatible with pipeline parallelism
#12301 commented on Apr 24, 2025 • 0 new comments
[Core] Optimize topp/topk calculation in sampler
#12156 commented on Apr 24, 2025 • 0 new comments
[Doc] update docs for nightly benchmarks
#12022 commented on Apr 22, 2025 • 0 new comments
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on Apr 22, 2025 • 0 new comments
[Spec Decode][V0] feat: support LoRA with speculative decoding
#11966 commented on Apr 21, 2025 • 0 new comments
[Spec Decode] Add Script for converting HF Eagle checkpoint to vLLM compatible checkpoint
#11866 commented on Apr 22, 2025 • 0 new comments
[Bug]: Mistral 3.1 Small Image inference is broken on 0.8.4
#16675 commented on Apr 24, 2025 • 0 new comments
[Feature]: Support Gemma 3 QAT series
#16856 commented on Apr 25, 2025 • 0 new comments
[Performance]: vllm Eagle performance is worse than expected
#9565 commented on Apr 25, 2025 • 0 new comments
[Bug]: vllm部署qwen2.5_vl_72b之后，你们有出现，刚部署好之后调用一切正常3-5秒一条，然后使用一段时间，就越来越慢了的情况吗60s一条
#13886 commented on Apr 25, 2025 • 0 new comments
[Usage]: Is it possible to use `meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8` with vLLM?
#12411 commented on Apr 25, 2025 • 0 new comments
[Feature] [ROCm]: AITER Kernel Integration
#14964 commented on Apr 25, 2025 • 0 new comments
[Bug]: AssertionError when using automatic prefix caching and prompt_logprobs
#8268 commented on Apr 25, 2025 • 0 new comments
[Bug]: [v0.6.5] Streaming tool call responses with the hermes template is inconsistent with the non-stream version.
#11392 commented on Apr 25, 2025 • 0 new comments
[Bug]: InternVL2-26B-AWQ Service startup failure
#12404 commented on Apr 25, 2025 • 0 new comments
[Feature]: The tool_choice option required is not yet supported but on the roadmap.
#11700 commented on Apr 25, 2025 • 0 new comments
[Feature]: Llama3.3 Tool calling support or a Geneneric and extensible llama tool calling support
#11799 commented on Apr 25, 2025 • 0 new comments
[New Model]: Support Efficient-Large-Model/NVILA
#11887 commented on Apr 25, 2025 • 0 new comments
[Usage]: Automated Tool Calling for OLMoForCausalLM
#12263 commented on Apr 25, 2025 • 0 new comments
[Usage]: Is it possible to speed up the generation speed by adding another video card?
#12322 commented on Apr 25, 2025 • 0 new comments
[Usage]: how to use tool calling with auto option, setting the tool works
#12349 commented on Apr 25, 2025 • 0 new comments
[Bug]: Inference with gguf returns garbage
#12364 commented on Apr 25, 2025 • 0 new comments
[Usage]: How to run vllm with regression task, just like classify task
#12379 commented on Apr 25, 2025 • 0 new comments
[Usage]: mistralai/Ministral-8B-Instruct-2410 scale to 128k context length.
#12385 commented on Apr 25, 2025 • 0 new comments
[Feature]: Consider integrating SVDquant (W4A4 quantization) from Nunchaku project
#12399 commented on Apr 25, 2025 • 0 new comments
[Usage]: Overwhelmed trying to find out information about how to serve Llama-3 70b to multiple users with 128k context
#12400 commented on Apr 25, 2025 • 0 new comments
Reshape cache flash kernel to support HND layout
#8200 commented on Apr 23, 2025 • 0 new comments
[BigFix] Fix the lm_head in gpt_bigcode in lora mode
#6357 commented on Apr 24, 2025 • 0 new comments
[Feature]: Support for jina-embeddings-v2-small-en
#16639 commented on Apr 26, 2025 • 0 new comments
[SpecDecode] Support EAGLE in V1
#15901 commented on Apr 26, 2025 • 0 new comments
[Feature]: Audit and Update Examples To Use `VLLM_USE_V1=1`
#14530 commented on Apr 26, 2025 • 0 new comments
[Usage]: How to increase the generation throughput of Qwen-0.5B
#14023 commented on Apr 26, 2025 • 0 new comments
[Bug]: v0.8.2 vLLM engine crashes when starting after V1 environment variable is enabled with deepseek-r1
#15769 commented on Apr 26, 2025 • 0 new comments
[Feature]: Implement Priority Scheduling In V1 Engine
#14002 commented on Apr 24, 2025 • 0 new comments
[Bug]: Can't deserialize object: ObjectRef，DeepSeek R1, H20*16, pp2, tp8, v1 engine
#15333 commented on Apr 26, 2025 • 0 new comments
[Feature]: Improve Logging for Error Messages
#14083 commented on Apr 24, 2025 • 0 new comments
[Feature]: Support Inflight quantization: load as 8bit quantization.
#11655 commented on Apr 26, 2025 • 0 new comments
[Bug]: FP8 Quantization with enforce_eager=False Causes Gibberish Output on Llama-4-Scout Model (VLLM_USE_V1=1)
#16337 commented on Apr 24, 2025 • 0 new comments
[Feature]: Return hidden states (in progress?)
#6165 commented on Apr 25, 2025 • 0 new comments
[Bug]: Guided generation throws 500 error or endless generation in vllm serve for mistral small 2501
#13260 commented on Apr 25, 2025 • 0 new comments
[Bug]: Bug in LRUEvictor: priority_queue and free_table desynchronization cause error
#16825 commented on Apr 25, 2025 • 0 new comments
unload the model
#3281 commented on Apr 25, 2025 • 0 new comments
[Feature]: Allow head_size smaller than 128 on TPU with Pallas backend
#10343 commented on Apr 25, 2025 • 0 new comments
[RFC]: Data Parallel Attention and Expert Parallel MoEs
#16037 commented on Apr 25, 2025 • 0 new comments
[Bug]: Vllm 0.8.2 + Ray 2.44 (Ray serve deployment) fallbacks to V0 Engine
#15569 commented on Apr 25, 2025 • 0 new comments
[ROCm] (Deprecated) Enable AITER Tkw1 kernel
#16418 commented on Apr 19, 2025 • 0 new comments
Fix cuda_version_str reset logic.
#16400 commented on Apr 24, 2025 • 0 new comments
[WIP]Docker Release
#16396 commented on Apr 22, 2025 • 0 new comments
[V1] Add request-level, per-step acceptance counts tracking for spec dec.
#16367 commented on Apr 25, 2025 • 0 new comments
Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling
#16357 commented on Apr 22, 2025 • 0 new comments
[Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support)
#16347 commented on Apr 25, 2025 • 0 new comments
[Bugfix][Frontend] Add missing "type":"function" in tool call streaming responses
#16346 commented on Apr 25, 2025 • 0 new comments
[V1][Spec Decode] Add random seed for EAGLE and its test script
#16235 commented on Apr 23, 2025 • 0 new comments
[MISC][Bugfix] Use less CPU when message queue has been empty for some time
#16226 commented on Apr 21, 2025 • 0 new comments
[Model] set default attn tmp scaling to True for llama4
#16216 commented on Apr 26, 2025 • 0 new comments
Support embedding models in V1
#16188 commented on Apr 24, 2025 • 0 new comments
[WIP] Hybrid Memory Allocator
#16178 commented on Apr 25, 2025 • 0 new comments
[v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type
#16101 commented on Apr 26, 2025 • 0 new comments
[Frontend] [Bugfix] Refactor tool parsers and simplify the tool parsing interface.
#16096 commented on Apr 25, 2025 • 0 new comments
[V1][Spec Decode] Non greedy sample with EAGLE / Reduce memory allocation for Rejection Sampler
#16077 commented on Apr 25, 2025 • 0 new comments
[ROCM] Add gfx950 to the custom attention archs
#16034 commented on Apr 24, 2025 • 0 new comments
[WIP][Feature] Support chunked prefill when using Deepseek MTP model as draft model
#15153 commented on Apr 21, 2025 • 0 new comments
[CORE] Eliminate Occasional Scheduling Delay for Parallel Sampling
#16849 commented on Apr 22, 2025 • 0 new comments
[V1] Async DP shutdown test
#16846 commented on Apr 21, 2025 • 0 new comments
[Misc] Raise ValueError for V1 during profiling when max_num_batched_tokens is too short
#16834 commented on Apr 19, 2025 • 0 new comments
Add quickreduce as alternative to custom allreduce
#16804 commented on Apr 23, 2025 • 0 new comments
[Kernel] Add Split-KV Attention Kernel to the triton_attn Backend
#16794 commented on Apr 21, 2025 • 0 new comments
[Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c…
#16751 commented on Apr 21, 2025 • 0 new comments
[CI] Enable test_initialization to run on V1
#16736 commented on Apr 23, 2025 • 0 new comments
[V1] LogitsProcessor interface
#16728 commented on Apr 23, 2025 • 0 new comments
[NIXL] vllm v0 nixl integration
#16677 commented on Apr 21, 2025 • 0 new comments
[V1][Spec Decode][Bugfix] Allocate lookahead token kvc in WAITING queue
#16613 commented on Apr 23, 2025 • 0 new comments
[Misc]fix demo function call JSONDecodeError
#16595 commented on Apr 25, 2025 • 0 new comments
[V1] Structured Outputs + Thinking parser compatiblity
#16577 commented on Apr 26, 2025 • 0 new comments
Remove scipy dep by implementing `resample_poly`
#16542 commented on Apr 24, 2025 • 0 new comments
Fix #15483 : Add error handling for model-dependent endpoints during sleep mode
#16536 commented on Apr 22, 2025 • 0 new comments
[Core] Enable IPv6 with vllm.utils.make_zmq_socket()
#16506 commented on Apr 26, 2025 • 0 new comments
Adding "amd_experimental: CI functionality to test all available test groups.
#16497 commented on Apr 24, 2025 • 0 new comments
[Bugfix][Model] fix Phi3Small model only support v0
#16493 commented on Apr 22, 2025 • 0 new comments
[Metrics] Log multi-modal cache stats
#16478 commented on Apr 26, 2025 • 0 new comments
Truncation control for embedding models
#14776 commented on Apr 24, 2025 • 0 new comments
[Quantization] Add Gemma2 and Gemma3 text model GGUF support
#14766 commented on Apr 23, 2025 • 0 new comments
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on Apr 25, 2025 • 0 new comments
[Neuron][V1] Experimental support for neuron backend with V1 architecture
#14648 commented on Apr 25, 2025 • 0 new comments
[Hardware][Intel GPU] Add V1 engine support and `chunked_prefill` kernel
#14612 commented on Apr 25, 2025 • 0 new comments
[Misc] Using ruff-format for smaller sets of directories
#14485 commented on Apr 22, 2025 • 0 new comments
[Frontend] Pythonic tool names flexibility (#14470)
#14474 commented on Apr 25, 2025 • 0 new comments
[Kernel] Update cutlass FP8 blockwise to use upstream CUTLASS
#14395 commented on Apr 24, 2025 • 0 new comments
[Core] Add DoRA Support
#14389 commented on Apr 22, 2025 • 0 new comments
[Doc] Create tool_chat_template_llama3.3_json.jinja
#14269 commented on Apr 25, 2025 • 0 new comments
[WIP][Attention] FlashAttn MLA
#14258 commented on Apr 24, 2025 • 0 new comments
Add CUDA kernel for per_token_group_quant_fp8
#14175 commented on Apr 23, 2025 • 0 new comments
[V1][Metrics] Add additional metrics to V1
#14148 commented on Apr 22, 2025 • 0 new comments
[Hardware][CPU] Vllm int8 quantization enablement for ARM CPU
#14129 commented on Apr 22, 2025 • 0 new comments
[Bugfix][Frontend] Strip empty tool calls from incoming chat conversations
#14054 commented on Apr 25, 2025 • 0 new comments
[Bugfix] Ensure JSON encoding preserves non-ASCII characters in Llama3JsonToolParser
#13826 commented on Apr 25, 2025 • 0 new comments
Minor fix in documentation for tool_calling.md
#13291 commented on Apr 25, 2025 • 0 new comments
[V1] DP scale-out (2/N): Decouple engine process management and comms
#15977 commented on Apr 26, 2025 • 0 new comments
Fixed Stream set to True, client stream receiving arguments, concatenated json string, missing curly braces end
#15930 commented on Apr 25, 2025 • 0 new comments
[Misc] Disable pin_memory in AsyncMetricsCollector for spec decode tensor allocation
#15886 commented on Apr 23, 2025 • 0 new comments
[Bugfix] fix client socket timeout when serve multi-node model in Ray
#15850 commented on Apr 24, 2025 • 0 new comments
[WIP][V1/0][P/D] XpYd based on p2p communication without cache store
#15806 commented on Apr 26, 2025 • 0 new comments
[Sampler] Adapt to FlashInfer 0.2.3 sampler API
#15777 commented on Apr 23, 2025 • 0 new comments
Use pip wheel to build wheels
#15749 commented on Apr 24, 2025 • 0 new comments
Try Python 3.13
#15743 commented on Apr 22, 2025 • 0 new comments
[Core] Remove legacy input mapper/processor from V0
#15686 commented on Apr 25, 2025 • 0 new comments
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend
#15655 commented on Apr 22, 2025 • 0 new comments
Enable Outlines with JSON Sub-Schema References
#15627 commented on Apr 24, 2025 • 0 new comments
[Frontend] fix streaming tool output lose 2 token bug #15545
#15546 commented on Apr 25, 2025 • 0 new comments
[Minor] QoL for Benchmarking
#15512 commented on Apr 25, 2025 • 0 new comments
[BugFix] fix speculative decoding memory leak when speculation is disabled
#15506 commented on Apr 25, 2025 • 0 new comments
[V1][Draft] Jump-forward decoding
#15490 commented on Apr 24, 2025 • 0 new comments
[Bugfix][Frontend] Fix pythonic tool parser failure with negative numbers
#15462 commented on Apr 24, 2025 • 0 new comments
[Misc] Improve cli help show
#15455 commented on Apr 21, 2025 • 0 new comments
[Spec Decode] Make speculative decoding compatible with pipeline parallelism
#15173 commented on Apr 24, 2025 • 0 new comments
[Bug]: xgrammar doesn't support enums, but vllm isn't falling back to outlines
#15762 commented on Apr 24, 2025 • 0 new comments
[Installation]:
#14398 commented on Apr 22, 2025 • 0 new comments
[Bug]: The Transformers implementation of My Model is not compatible with vLLM.
#16826 commented on Apr 22, 2025 • 0 new comments
[Feature]: Support Gemma3 GGUF
#14753 commented on Apr 22, 2025 • 0 new comments
[Bug]: Mistral tool parser failed to parse function calling
#16190 commented on Apr 22, 2025 • 0 new comments
[Bug]: InternVL3-9B call is hanging
#16782 commented on Apr 22, 2025 • 0 new comments
[Usage]: Guided choice not working as expected
#12225 commented on Apr 22, 2025 • 0 new comments
[Bug]: Main branch code reasoning reports an error in h100 inference
#16656 commented on Apr 22, 2025 • 0 new comments
[Bug]: An error occurred when deploying DeepSeek-R1-Channel-INT8 on two A100 machines using lws
#16827 commented on Apr 22, 2025 • 0 new comments
[Bug]: Can't use yarn rope config for long context in Qwen2 model
#10293 commented on Apr 22, 2025 • 0 new comments
[Bug]: Out of Memory (OOM) Issues During MMLU Evaluation with lm_eval
#10325 commented on Apr 22, 2025 • 0 new comments
[RFC]: Improve Ray Support in vLLM for Enhanced Elasticity and Performance
#11137 commented on Apr 22, 2025 • 0 new comments
新手入门，请多指教
#11223 commented on Apr 22, 2025 • 0 new comments
[Feature]: Add support for attention score output
#11365 commented on Apr 22, 2025 • 0 new comments
[Performance]: Prefill is not using cuda graph and become very slow when LORA enabled
#11436 commented on Apr 22, 2025 • 0 new comments
[Usage]: Does vLLM support deploying the speculative model on a second device?
#12200 commented on Apr 22, 2025 • 0 new comments
[Usage]: Does vLLM support running the DeepSeek-V3 model with CUDA 11.8?
#12247 commented on Apr 22, 2025 • 0 new comments
[Feature]: loading model from remote KV store such as Redis
#12250 commented on Apr 22, 2025 • 0 new comments
[Feature]: PD separation supports prefix caching
#12257 commented on Apr 22, 2025 • 0 new comments
[Bug]: AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'
#12267 commented on Apr 22, 2025 • 0 new comments
[Usage]: Does vLLM supports speculative decoding for MoE model?
#12278 commented on Apr 22, 2025 • 0 new comments
[Usage]:How to implement concurrency
#12289 commented on Apr 22, 2025 • 0 new comments
[Usage]: why no ray command in my docker image
#15284 commented on Apr 22, 2025 • 0 new comments
[Bug]: Gemma3-27B failed in forward process
#16590 commented on Apr 23, 2025 • 0 new comments
[Bug]: Llama-3.1-405B-Instruct-FP8 only generates exclamation marks
#13035 commented on Apr 23, 2025 • 0 new comments
[Bug]: Qwen2.5-VL-32B, Following weights were not initialized from checkpoint
#15536 commented on Apr 23, 2025 • 0 new comments
[Bug]: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
#10300 commented on Apr 23, 2025 • 0 new comments
[Performance]: Update Cascade Attention Heuristics for FA3
#15647 commented on Apr 23, 2025 • 0 new comments
[Bug]: 使用Sonatype Nexus Repository时下载模型错误。
#14993 commented on Apr 23, 2025 • 0 new comments
[Bug]: VLLM 0.8.3 LLM initialization hangs when EngineArgs data parallel size > 1
#16588 commented on Apr 23, 2025 • 0 new comments
[Bug]: vllm 0.8.4 start with using ray, and ray's dashboard fails to start
#16779 commented on Apr 23, 2025 • 0 new comments
[Installation]: XPU dependencies not built against most recent oneAPI
#11734 commented on Apr 23, 2025 • 0 new comments
[Feature]: SwiftKV cache compression
#12220 commented on Apr 23, 2025 • 0 new comments
[Feature]: Support pass in user-specified backend to torch dynamo piecewise compilation
#12261 commented on Apr 23, 2025 • 0 new comments
[Bug]: Fail to load W4A16-G128 (llmcompressor) quantized model on CPU
#12268 commented on Apr 23, 2025 • 0 new comments
[Performance]: why vllm-0.6.1.post2 faster than latest vllm=0.6.6.post1?
#12274 commented on Apr 23, 2025 • 0 new comments
[Feature]: DeepSeek-R1 tool choice && Function Call
#12297 commented on Apr 23, 2025 • 0 new comments
[Bug]: build docker error
#12300 commented on Apr 23, 2025 • 0 new comments
[Performance]: Unable to produce the result of throughput & latency claimed on vLLM dashboard v0
#12315 commented on Apr 23, 2025 • 0 new comments
[Bug]: v0.8.2, enable calculate_kv_scales, caught exception
#15973 commented on Apr 23, 2025 • 0 new comments
[Bug]: Cast error details: Unable to cast 1024 to Tensor
#12771 commented on Apr 22, 2025 • 0 new comments
[Bug]: `undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE` when running `0.7.3.dev57+g2ae88905.precompiled` on A100
#13047 commented on Apr 22, 2025 • 0 new comments
[Bug]: Can't serve on ray cluster although passing VLLM_HOST_IP
#13521 commented on Apr 22, 2025 • 0 new comments
[Feature]: Composite model loading using `AutoWeightsLoader` for all models
#15697 commented on Apr 22, 2025 • 0 new comments
[Usage]: LLM.beam_search is much slower in vLLM 0.7.3 compared to 0.5.4
#14426 commented on Apr 22, 2025 • 0 new comments
[Bug]: Enabling LoRA not working with vLLM
#16676 commented on Apr 21, 2025 • 0 new comments
[Bug]: Quantization example outdated (Ammo -> ModelOpt)
#9288 commented on Apr 21, 2025 • 0 new comments
[Usage]: Dynamically loaded LoRas do not appear on the /models endpoint
#10784 commented on Apr 21, 2025 • 0 new comments
[Misc]: Finetuned llama3.2 vision instruct model is failing during VLLM weight_loader
#11765 commented on Apr 21, 2025 • 0 new comments
[Misc]: For disaggregated prefill with multiple decode instances, drop_select might not enough
#12039 commented on Apr 21, 2025 • 0 new comments
[Bug]: Inconsistent data received and sent using PyNcclPipe
#12197 commented on Apr 21, 2025 • 0 new comments
[Usage]: how to generate results and get the embeddings of the result
#12213 commented on Apr 21, 2025 • 0 new comments
[Usage]: Context window crashes web window when full
#12221 commented on Apr 21, 2025 • 0 new comments
[Performance]: Added request take too much time, and the model will not run untill all the request are added into the cache
#13259 commented on Apr 21, 2025 • 0 new comments
[Installation]: VLLM on ARM machine with GH200
#10459 commented on Apr 20, 2025 • 0 new comments
[Feature]: gemma3 raise error
#14723 commented on Apr 20, 2025 • 0 new comments
[Usage]: [V1] Misleading Error Messages
#13510 commented on Apr 20, 2025 • 0 new comments
[Usage]: How can I get the sparse embedding from OpenAI Embedding Client?
#13609 commented on Apr 20, 2025 • 0 new comments
[Usage]: Benchmarking Issues: Low Success Rate and Tensor Parallel Size Constraints on 8x AMD MI300x GPUs
#9070 commented on Apr 20, 2025 • 0 new comments
[Bug]: Speculative decoding inconsistency for Qwen-Coder-32B
#10913 commented on Apr 20, 2025 • 0 new comments
[Bug]: v0.7.3 can't work on wsl-ubuntu mirrored network
#13656 commented on Apr 20, 2025 • 0 new comments
[Bug]: InternVL3-78B OOM on 4 A100 40G in 0.8.4
#16749 commented on Apr 20, 2025 • 0 new comments
Flash Attention 3 (FA3) Support
#12429 commented on Apr 19, 2025 • 0 new comments
[Usage]: Does model streamer supports loading model from GCS bucket?
#12290 commented on Apr 19, 2025 • 0 new comments
[Feature]: Support Python 3.13
#12083 commented on Apr 19, 2025 • 0 new comments
[Bug]: Rocm Memory Access Fault.
#16840 commented on Apr 19, 2025 • 0 new comments
First tpot/itl is too long?
#15106 commented on Apr 19, 2025 • 0 new comments
[Bug]: v1 engine error when I using gemma-3 (v0 engine is okay)
#16643 commented on Apr 19, 2025 • 0 new comments
[Bug]: Not able to deploy Llama-4-Scout-17B-16E-Instruct on vllm-openai v0.8.3
#16197 commented on Apr 21, 2025 • 0 new comments
[Feature]: Support custom args in OpenAI (chat) completion requests
#16802 commented on Apr 21, 2025 • 0 new comments
[Bug]: Calling the load_weights method of the MOE model failed
#16842 commented on Apr 21, 2025 • 0 new comments
[RFC]: KVBlocks and Metrics Publishing In Inference Frameworks
#16669 commented on Apr 21, 2025 • 0 new comments
[Bug]: using TP = 16 to serving deepseek-v3 in 2*H20 On Ray cluster, get EngineCore exception
#16646 commented on Apr 21, 2025 • 0 new comments
[Bug]: the official pre-built image for cpu-type prints a simple error: RuntimeError: Engine process failed to start. See stack trace for the root cause
#16446 commented on Apr 21, 2025 • 0 new comments
[Bug]: When configuring Ray with a custom temporary directory using the --temp-dir parameter, the distributed multi-node inference cluster fails to deploy successfully.
#16819 commented on Apr 21, 2025 • 0 new comments
[Usage]: How to add a hook function
#16585 commented on Apr 21, 2025 • 0 new comments
[Feature]: Reduce vLLM's import time
#14924 commented on Apr 21, 2025 • 0 new comments
[Bug]: sampling_params.n > 1, after reset_state_for_recompute() will meet 'AssertionError: seq_len: 2701, context_len: 0, query_len: 2701'
#14759 commented on Apr 21, 2025 • 0 new comments
[Bug]: 100% CPU usage when idle
#16660 commented on Apr 21, 2025 • 0 new comments
[RFC]: Merge input processor and input mapper for multi-modal models
#10114 commented on Apr 21, 2025 • 0 new comments
[Bug]: TypeError: Unknown image model type: qwen2_5_omni for branch: qwen2_omni_public_v1
#15754 commented on Apr 21, 2025 • 0 new comments
[Bug]: Request stucks when serving model with v1 engine
#16580 commented on Apr 21, 2025 • 0 new comments
[New Model]: support Ovis VLM series
#13441 commented on Apr 21, 2025 • 0 new comments
[Bug]: Cannot load Qwen2.5-VL
#16429 commented on Apr 21, 2025 • 0 new comments
[Bug]: NVIDIA Jetson AGX Orin use vllm-0.7.4 error
#16465 commented on Apr 21, 2025 • 0 new comments
[Feature]: Support for Running Classification Task in Online Server
#13567 commented on Apr 21, 2025 • 0 new comments
[Feature]: Integrate Triton MoE Kernel
#16294 commented on Apr 21, 2025 • 0 new comments
[Bug]: Engine iteration timed out. This should never happen!
#9839 commented on Apr 21, 2025 • 0 new comments
[Bug]: Vllm0.6.2 UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
#8933 commented on Apr 21, 2025 • 0 new comments
Using the VLLM engine framework for inference, why is the first character generated always a space?
#3683 commented on Apr 21, 2025 • 0 new comments
[Bug]: deploy deepseek-r1-awq on 16 x 4090 48G, layer_kv_cache = torch.zeros(kv_cache_shape, [rank0]: RuntimeError: CUDA error: invalid argument
#15014 commented on Apr 24, 2025 • 0 new comments
[Bug]: Grammar error: Pointer '/$defs/xxxxx' does not exist
#16467 commented on Apr 24, 2025 • 0 new comments
[Usage]: How to configure the server parameters for THUDM/GLM-4-32B-0414 to support Function call using vllm-0.8.4?
#16771 commented on Apr 24, 2025 • 0 new comments
[Usage]: how to use prefill-decode disaggregation ??
#11490 commented on Apr 24, 2025 • 0 new comments
[Bug]: MistralTokenizer not working when using Mistral Small 3.1 in HF format
#16292 commented on Apr 24, 2025 • 0 new comments
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on Apr 24, 2025 • 0 new comments
[Bug]: v1 flash_attn and triton_attn backends don't have `get_state_cls`
#15630 commented on Apr 24, 2025 • 0 new comments
[Bug]: Debugging vLLM script results in torch error
#15722 commented on Apr 24, 2025 • 0 new comments
[Usage]: There is no module or parameter named 'language_model' in Gemma3ForCausalLM
#15031 commented on Apr 24, 2025 • 0 new comments
[Feature]: Simple Data Parallelism in vLLM
#9206 commented on Apr 24, 2025 • 0 new comments
vLLM's V1 Engine Architecture
#8779 commented on Apr 24, 2025 • 0 new comments
[Feature]: Support Multiple Tasks Per Model
#11905 commented on Apr 24, 2025 • 0 new comments
[Bug]: V0 engines gives incorrect output for Moonlight model
#16658 commented on Apr 24, 2025 • 0 new comments
[Roadmap] vLLM Roadmap Q2 2025
#15735 commented on Apr 24, 2025 • 0 new comments
[Bug]: Running llama2-7b on H20, Floating point exception (core dumped) appears on float16
#4392 commented on Apr 24, 2025 • 0 new comments
[Bug]: load_adapter crashes server if called when generations are in progress
#13698 commented on Apr 24, 2025 • 0 new comments
[RFC]: Hidden states processor
#12249 commented on Apr 24, 2025 • 0 new comments
vllm keeps hanging when using djl-deepspeed
#2912 commented on Apr 24, 2025 • 0 new comments
[Feature]: Align the API with OAI's structured output
#7220 commented on Apr 24, 2025 • 0 new comments
[Bug]: Guided decoding is broken because tokenizers can't be pickled
#7557 commented on Apr 24, 2025 • 0 new comments
[Performance]: guided generation is very slow in offline mode
#8313 commented on Apr 24, 2025 • 0 new comments
[Bug]: vllm api server return escaped unicode string in guided backend 'outlines'
#8805 commented on Apr 24, 2025 • 0 new comments
[Bug]: Outlines broken on vLLM 0.8+
#15636 commented on Apr 24, 2025 • 0 new comments
[Feature]: Disable unicode characters in structured decoding
#16363 commented on Apr 24, 2025 • 0 new comments
[Bug]: issues with guided generation for tool calls (xgrammar)
#16321 commented on Apr 24, 2025 • 0 new comments
[Bug]: Multiple tool calls for llama3.2-11b-vision-instruct
#11786 commented on Apr 24, 2025 • 0 new comments
[RFC]: Refactor tool parsers to eliminate coding errors and allow more efficient implementations.
#11522 commented on Apr 24, 2025 • 0 new comments
[Usage]: Confirm tool calling is not supported and this is the closest thing can be done
#7912 commented on Apr 24, 2025 • 0 new comments
[Bug]: Is vllm support function call mode?
#6631 commented on Apr 24, 2025 • 0 new comments
[Feature]: Consider parallel_tool_calls parameter at the API level
#9451 commented on Apr 24, 2025 • 0 new comments
[Feature]: Native Tool Call for Gemma 3
#16482 commented on Apr 24, 2025 • 0 new comments
[Bug]: Qwen2.5 assistant output on tool call is empty
#16430 commented on Apr 24, 2025 • 0 new comments
[Bug]: Models converted to GGUF don't seem to be able to do tool calling
#16195 commented on Apr 24, 2025 • 0 new comments
[Feature]: JSON based tool calling for Gemma 3
#15403 commented on Apr 24, 2025 • 0 new comments
[Feature]: Add support for reusable subschemas in tool requests (PydanticAI)
#15035 commented on Apr 24, 2025 • 0 new comments
[Bug]: vLLM response on tool_calls does not align with OpenAI standard
#14951 commented on Apr 24, 2025 • 0 new comments
[Feature]: Support tool calls for DeepSeek.
#14745 commented on Apr 24, 2025 • 0 new comments
[New Model]: Command A with tool support
#14866 commented on Apr 24, 2025 • 0 new comments
[Bug]: Ultravox audio doesn't work with auto tool choice
#14209 commented on Apr 24, 2025 • 0 new comments
[Bug]: pythonic tool parser only accepts alphabetical tool names
#14470 commented on Apr 24, 2025 • 0 new comments
[Feature]: add tool calling support for DeepSeek-R1-Distill-Qwen-32B
#13700 commented on Apr 24, 2025 • 0 new comments
[Bug]:vLLM 0.6.3 generate_sequences Randomly Hangs After 1-2 Steps When trying to Implement Tool Calling with Logits Processors
#13671 commented on Apr 24, 2025 • 0 new comments
[Usage]: vLLM and In the fly tool calling
#13497 commented on Apr 24, 2025 • 0 new comments
[Bug]: CPU offload not working for vllm serve
#15877 commented on Apr 24, 2025 • 0 new comments
[Feature]: Guided Decoding Schema Cache Store
#8902 commented on Apr 24, 2025 • 0 new comments
[Feature]: Support guided decoding with multistep decoding
#9893 commented on Apr 23, 2025 • 0 new comments
[Performance]: Transformers 4.45.1 slows down `outlines` guided decoding
#9032 commented on Apr 23, 2025 • 0 new comments
[Bug]: Distilled DeepSeek Models do not work with guided_json
#12548 commented on Apr 23, 2025 • 0 new comments
[Bug]: Guided Decoding (structured json outputs) not generating proper outputs.
#13683 commented on Apr 23, 2025 • 0 new comments
[Bug]: Very slow guided decoding with Outlines backend since v0.6.5
#12005 commented on Apr 23, 2025 • 0 new comments
[Bug]: XGrammar-based CFG decoding degraded after 0.6.5
#12122 commented on Apr 23, 2025 • 0 new comments
[Bug]: Close feature gaps when using xgrammar for structured output
#12131 commented on Apr 23, 2025 • 0 new comments
[Bug]: xgrammar crashes with speculative decoding
#11484 commented on Apr 23, 2025 • 0 new comments
[Bug]: Using "response_format": { "type": "json_object" } with /v1/chat/completions is terminating the engine
#11828 commented on Apr 23, 2025 • 0 new comments
[Bug]: Engine crashes with Pixtral-HF and xgrammar decoding
#11044 commented on Apr 23, 2025 • 0 new comments
[Bug]: Speculative decoding + guided decoding not working
#10442 commented on Apr 23, 2025 • 0 new comments
[Bug]: Speculative decoding breaks guided decoding.
#9423 commented on Apr 23, 2025 • 0 new comments
[Bug]: Compiling FSM index high memory && subprocess OOM
#7332 commented on Apr 23, 2025 • 0 new comments
[RFC]: TPU V1 Sampler planning
#16268 commented on Apr 23, 2025 • 0 new comments
[Installation]: how to run swiftkv with vllm
#16109 commented on Apr 23, 2025 • 0 new comments
[Usage]: Transcription "Maximum clip duration (30s) exceeded
#15012 commented on Apr 23, 2025 • 0 new comments
[New Model]: Multimodal Embedding Model GME.
#16406 commented on Apr 23, 2025 • 0 new comments
[Bug]: Llama 4 EOFError
#16127 commented on Apr 23, 2025 • 0 new comments
[Feature]: support tool and reasoning together
#14429 commented on Apr 23, 2025 • 0 new comments
[Feature]: hub.docker.com Please add arm docker image
#14656 commented on Apr 23, 2025 • 0 new comments
[Installation]: Can't build arm container image with podman without a SELinux relabel of bind mounts
#12734 commented on Apr 23, 2025 • 0 new comments
[Bug]: cpu memory not released when wake up the vLLM instance
#16663 commented on Apr 23, 2025 • 0 new comments
[Bug]: Exception in worker VllmWorkerProcess while processing method init_device: NCCL error: unhandled cuda error
#9329 commented on Apr 24, 2025 • 0 new comments
[Bug]: Error with structured output inference after upgrade 0.6.2->0.6.3
#9462 commented on Apr 24, 2025 • 0 new comments
[Bug]:Structured outputs inference often took a very long time,and eventually causing a timeout and vLLM engine crushing.
#10081 commented on Apr 24, 2025 • 0 new comments
[Bug]: Guided Decoding Broken in Streaming mode
#10376 commented on Apr 24, 2025 • 0 new comments
[Bug]: CPU Offloading errors (Worker.__init__() got an unexpected keyword argument 'kv_cache_dtype')
#11986 commented on Apr 24, 2025 • 0 new comments
[Bug]: PaliGemma2 not working with OpenAI Docker serve
#12052 commented on Apr 24, 2025 • 0 new comments
[Bug]: Fail to use beamsearch with llm.chat
#12183 commented on Apr 24, 2025 • 0 new comments
[Usage]: How can I use LLMEngine to perform distributed inference for multimodal large models, such as Qwen-VL?
#12305 commented on Apr 24, 2025 • 0 new comments
[Bug]: Speculative decoding does not work
#12323 commented on Apr 24, 2025 • 0 new comments
[Bug]: Run multiple LLMs inference one by one with multiple TP always pending on the second one in Model list
#12337 commented on Apr 24, 2025 • 0 new comments
[Usage]: When running models on multiple GPUs, workload does not get split
#12354 commented on Apr 24, 2025 • 0 new comments
[RFC]: Refactor `config-format` and `load-format` as plugins
#12363 commented on Apr 24, 2025 • 0 new comments
[Feature]: Support LoRA adapter for whisper
#15370 commented on Apr 24, 2025 • 0 new comments
[Bug]: Qwen2.5 tool call failed
#16393 commented on Apr 24, 2025 • 0 new comments
[Bug]: Out of Memory error for Qwen2.5 in 0.8.0 and 0.8.1. Worked fine in the previous versions
#15228 commented on Apr 24, 2025 • 0 new comments
[V1] Feedback Thread
#12568 commented on Apr 23, 2025 • 0 new comments
[Bug]: Persistent OutOfMemoryError error when using speculative decoding
#8073 commented on Apr 23, 2025 • 0 new comments
[Bug]: Bug while using deepspeed with TRL with vLLM
#16867 commented on Apr 23, 2025 • 0 new comments
[Feature]: Specific Docker Image for vllm["audio,video"]
#13940 commented on Apr 23, 2025 • 0 new comments
[Bug]: examples/offline_inference/chat_with_tools.py JSONDecodeError
#16594 commented on Apr 23, 2025 • 0 new comments
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on Apr 23, 2025 • 0 new comments
[Bug]: guided_json not working correctly with (quantized) mistral-small model
#15577 commented on Apr 23, 2025 • 0 new comments