-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Insights: vllm-project/vllm
Overview
Could not load contribution data
Please try again later
163 Pull requests merged by 77 people
-
[Core] Remove prompt string from engine core data structures
#17214 merged
Apr 26, 2025 -
[CI/test] Fix Eagle Correctness Test
#17209 merged
Apr 26, 2025 -
[BugFix] Avoid race conditions in zero-copy tensor transmission
#17203 merged
Apr 26, 2025 -
[V1][Metrics] Allow V1 AsyncLLM to use custom logger
#14661 merged
Apr 26, 2025 -
[ROCm][Misc] Follow-ups for Skinny Gemms on ROCm.
#17011 merged
Apr 26, 2025 -
Allocate kv_cache with stride order
#16605 merged
Apr 26, 2025 -
[Minor][Models] Fix Return Types of Llama & Eagle
#17220 merged
Apr 26, 2025 -
[Doc] Minor fix for the vLLM TPU setup page
#17206 merged
Apr 26, 2025 -
[Minor][Spec Decode] Add use_eagle to SpeculativeConfig
#17213 merged
Apr 26, 2025 -
[doc] add Anything LLM integration
#17216 merged
Apr 26, 2025 -
[MISC][AMD] Add unused annotation to rocm kernel file
#17097 merged
Apr 26, 2025 -
[Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env
#17142 merged
Apr 26, 2025 -
[v1] [P/D] Adding LMCache KV connector for v1
#16625 merged
Apr 26, 2025 -
[AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary
#17215 merged
Apr 26, 2025 -
[Bugfix] gemma[2,3] interleaved attention when sliding window is disabled
#17180 merged
Apr 26, 2025 -
[Misc] Refine ray_serve_deepseek example
#17204 merged
Apr 25, 2025 -
[V1][Spec Decode] EAGLE-3 Support
#16937 merged
Apr 25, 2025 -
[BugFix][Frontend] Fix
LLM.chat()
tokenization#16081 merged
Apr 25, 2025 -
Fix Python packaging edge cases
#17159 merged
Apr 25, 2025 -
[Bugfix] Fix hybrid model tests
#17182 merged
Apr 25, 2025 -
[V1] Move usage stats to worker and start logging TPU hardware
#16211 merged
Apr 25, 2025 -
[Security] Use safe serialization and fix zmq setup for mooncake pipe
#17192 merged
Apr 25, 2025 -
[Misc] Inline Molmo requirements
#17190 merged
Apr 25, 2025 -
[doc] update wrong hf model links
#17184 merged
Apr 25, 2025 -
Use Transformers helper
get_text_config()
instead of checking fortext_config
#17105 merged
Apr 25, 2025 -
Bump Transformers to 4.51.3
#17116 merged
Apr 25, 2025 -
[Bugfix] Fix Mistral ChatCompletionRequest Body Exception
#16769 merged
Apr 25, 2025 -
[Bugfix] Fix mistral model tests
#17181 merged
Apr 25, 2025 -
[Doc] Move todo out of beam search docstring
#17183 merged
Apr 25, 2025 -
[Doc] Add two links to disagg_prefill.md
#17168 merged
Apr 25, 2025 -
Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1
#17158 merged
Apr 25, 2025 -
[Doc] Add headings to improve gptqmodel.md
#17164 merged
Apr 25, 2025 -
[Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization
#15734 merged
Apr 25, 2025 -
[Bugfix] remove fallback in guided_json (int range, patterns)
#16725 merged
Apr 25, 2025 -
[Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance
#16457 merged
Apr 25, 2025 -
[Misc] Benchmark Serving Script Support Appending Results
#17028 merged
Apr 25, 2025 -
[Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton
#15099 merged
Apr 25, 2025 -
[Misc] Clean up redundant code in uniproc_executor.py
#16762 merged
Apr 25, 2025 -
Move missed
SchedulerConfig
args into scheduler config group inEngineArgs
#17131 merged
Apr 25, 2025 -
[Docs] Fix True->true in supported_models.md
#17141 merged
Apr 25, 2025 -
[Doc] V1 : Update LoRA status
#17133 merged
Apr 25, 2025 -
fix float16 support for kimi-vl
#17156 merged
Apr 25, 2025 -
[Attention] FA3 decode perf improvement - single mma warp group support for head dim 128
#16864 merged
Apr 25, 2025 -
[FEAT] [ROCm]: AITER Fused MOE V1 Support
#16752 merged
Apr 25, 2025 -
Use custom address for listening socket
#15988 merged
Apr 25, 2025 -
Better error message for missing mistral params.json
#17132 merged
Apr 24, 2025 -
[Misc] Add example to run DeepSeek with Ray Serve LLM
#17134 merged
Apr 24, 2025 -
Add chat template for Llama 4 models
#16428 merged
Apr 24, 2025 -
Add collective_rpc to llm engine
#16999 merged
Apr 24, 2025 -
[Docs] Generate correct github links for decorated functions
#17125 merged
Apr 24, 2025 -
Improve configs -
LoRAConfig
+PromptAdapterConfig
#16980 merged
Apr 24, 2025 -
Add
:markdownhelp:
toEngineArgs
docs so markdown docstrings render properly#17124 merged
Apr 24, 2025 -
Molmo Requirements
#17026 merged
Apr 24, 2025 -
existing torch installation pip command fix for docs
#17059 merged
Apr 24, 2025 -
Updating builkite job for IBM Power
#17111 merged
Apr 24, 2025 -
[CI] Add automation for the
tool-calling
github label#17118 merged
Apr 24, 2025 -
[V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics
#16665 merged
Apr 24, 2025 -
[Misc] refactor example series - structured outputs
#17040 merged
Apr 24, 2025 -
Add missing rocm_skinny_gemms kernel test to CI
#17060 merged
Apr 24, 2025 -
[Frontend] Using matryoshka_dimensions control the allowed output dimensions.
#16970 merged
Apr 24, 2025 -
Improve static type checking in
LoRAModelRunnerMixin
#17104 merged
Apr 24, 2025 -
[Misc] Remove OLMo2 config copy
#17066 merged
Apr 24, 2025 -
[V1][PP] Optimization: continue scheduling prefill chunks
#17080 merged
Apr 24, 2025 -
Fix OOT registration test
#17099 merged
Apr 24, 2025 -
Simplify
TokenizerGroup
#16790 merged
Apr 24, 2025 -
Disable enforce_eager for V1 TPU sampler and structured output tests
#17016 merged
Apr 24, 2025 -
[Chore] Remove Sampler from Model Code
#17084 merged
Apr 24, 2025 -
Add docs for runai_streamer_sharded
#17093 merged
Apr 24, 2025 -
[doc] update to hyperlink
#17096 merged
Apr 24, 2025 -
[V1] Update structured output
#16812 merged
Apr 24, 2025 -
[Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s…
#16472 merged
Apr 24, 2025 -
Addendum Fix to support FIPS enabled machines with MD5 hashing
#17043 merged
Apr 24, 2025 -
More informative error when using Transformers backend
#16988 merged
Apr 24, 2025 -
[Bugfix] Enable V1 usage stats
#16986 merged
Apr 24, 2025 -
[Minor] Use larger batch sizes for A100/B100/B200/MI300x
#17073 merged
Apr 24, 2025 -
[Quantization]add prefix for commandA quantized model
#17017 merged
Apr 24, 2025 -
[CI/Build] workaround for CI build failure
#17070 merged
Apr 23, 2025 -
[V1][Spec Decode] Always use argmax for sampling draft tokens
#16899 merged
Apr 23, 2025 -
[BugFix][V1] Fix int32 token index overflow when preparing input ids
#16806 merged
Apr 23, 2025 -
[Frontend] Support guidance:no-additional-properties for compatibility with xgrammar
#15949 merged
Apr 23, 2025 -
Use
@property
and private field fordata_parallel_rank_local
#17053 merged
Apr 23, 2025 -
CacheConfig.block_size
should always beint
when used#17052 merged
Apr 23, 2025 -
Improve Transformers backend model loading QoL
#17039 merged
Apr 23, 2025 -
[CI] Update structured-output label automation
#17055 merged
Apr 23, 2025 -
Ensure that
pid
passed tokill_process_tree
isint
formypy
#17051 merged
Apr 23, 2025 -
[Doc] Add top anchor and a note to quantization/bitblas.md
#17042 merged
Apr 23, 2025 -
Categorize
tests/kernels/
based on kernel type#16799 merged
Apr 23, 2025 -
Mistral-format support for compressed-tensors
#16803 merged
Apr 23, 2025 -
[CI] Run v1/test_serial_utils.py in CI
#16996 merged
Apr 23, 2025 -
[Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers
#16964 merged
Apr 23, 2025 -
[Misc] Improve readability of get_open_port function.
#17024 merged
Apr 23, 2025 -
[BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size)
#16998 merged
Apr 23, 2025 -
[V1] Avoid socket errors during shutdown when requests are in in-flight
#16807 merged
Apr 23, 2025 -
[Bugfix] Triton FA function takes no keyword arguments
#16902 merged
Apr 23, 2025 -
[doc] add download path tips
#17013 merged
Apr 23, 2025 -
[INTEL-HPU][v0] Port delayed sampling to upstream
#16949 merged
Apr 23, 2025 -
[misc] tune some env vars for GB200
#16992 merged
Apr 23, 2025 -
Revert "[Misc] Add S3 environment variables for better support of MinIO."
#17021 merged
Apr 23, 2025 -
[BugFix] Revert ROCm Custom Paged Attention Env Flag Check
#17022 merged
Apr 23, 2025 -
[V1][DP] More robust DP/EP dummy request coordination
#16277 merged
Apr 23, 2025 -
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1
#13305 merged
Apr 23, 2025 -
add Dockerfile build vllm against torch nightly
#16936 merged
Apr 23, 2025 -
[Bugfix] validate urls object for multimodal content parts
#16990 merged
Apr 23, 2025 -
[Core][V1][TPU] Enable structured decoding on TPU V1
#16499 merged
Apr 23, 2025 -
[BugFix] Remove default multiproc executor
collective_rpc
timeout#17000 merged
Apr 22, 2025 -
Fencing Kernels Tests for enabling on AMD
#16929 merged
Apr 22, 2025 -
Add assertion for no objects while hashing hf_config
#16930 merged
Apr 22, 2025 -
[FEAT][ROCm]: Support AITER MLA
#15893 merged
Apr 22, 2025 -
[frontend] enhance tool_calls type check
#16882 merged
Apr 22, 2025 -
[Misc] Add S3 environment variables for better support of MinIO.
#16977 merged
Apr 22, 2025 -
[BugFix] Pass in correct VLLM config in FlashInfer backend (#13207)
#16973 merged
Apr 22, 2025 -
Improve configs -
SpeculativeConfig
#16971 merged
Apr 22, 2025 -
[Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni
#16974 merged
Apr 22, 2025 -
[Misc] refactor example series
#16972 merged
Apr 22, 2025 -
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER
#15001 merged
Apr 22, 2025 -
[Doc] Improve documentation for multimodal CLI args
#16960 merged
Apr 22, 2025 -
[BugFix] Fix incremental detokenization perf issue
#16963 merged
Apr 22, 2025 -
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS
#6036 merged
Apr 22, 2025 -
[V1] Remove pre-allocation for KV cache
#16941 merged
Apr 22, 2025 -
[Model] Use autoweightloader for mamba
#16950 merged
Apr 22, 2025 -
[Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams
#16767 merged
Apr 22, 2025 -
[Perf] Optimize
_update_states
for GPU model runner#16910 merged
Apr 22, 2025 -
[Doc] Update ai_accelerator/hpu-gaudi.inc.md
#16956 merged
Apr 22, 2025 -
[Bugfix] Fix f-string for Python 3.9-3.11
#16962 merged
Apr 22, 2025 -
Support S3 Sharded loading with RunAI Model Streamer
#16317 merged
Apr 22, 2025 -
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm
#15830 merged
Apr 22, 2025 -
[V1] Remove additional_config check
#16710 merged
Apr 22, 2025 -
[Kernel] Add expert_map support to Cutlass FP8 MOE
#16861 merged
Apr 22, 2025 -
[Misc] Remove the chunked prefill warning for LoRA
#16925 merged
Apr 22, 2025 -
[ROCm] Add aiter tkw1 kernel for Llama4 fp8
#16727 merged
Apr 22, 2025 -
[Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other
#16863 merged
Apr 22, 2025 -
[BugFix][Spec Decode] No in-place update to draft probs
#16952 merged
Apr 22, 2025 -
[Doc] Remove unnecessary V1 flag
#16924 merged
Apr 22, 2025 -
[TPU][V1] Enable Top-P
#16843 merged
Apr 22, 2025 -
[V1] V1 FlashInfer Attention
#16684 merged
Apr 22, 2025 -
[TPU][V1] Capture multimodal encoder during model compilation
#15051 merged
Apr 22, 2025 -
Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml
#16946 merged
Apr 22, 2025 -
[TPU][V1] Implicitly adjust page size when there's SMEM OOM
#16871 merged
Apr 21, 2025 -
[V1][Spec Decode] Handle draft tokens beyond max_model_len
#16087 merged
Apr 21, 2025 -
[Core] Speed up decode by remove synchronizing operation in sampler
#16436 merged
Apr 21, 2025 -
[Doc] mention how to install in CPU editable mode
#16923 merged
Apr 21, 2025 -
[doc] install required python3-dev apt package
#16888 merged
Apr 21, 2025 -
[XPU][Bugfix] minor fix for XPU
#15591 merged
Apr 21, 2025 -
Raise error for data-parallel with benchmark_throughput
#16737 merged
Apr 21, 2025 -
[Bugfix] Fix GLM rotary_dim issue and support v1
#16912 merged
Apr 21, 2025 -
[Misc] Refactor platform to get device specific stream and event
#14411 merged
Apr 21, 2025 -
[Misc] fix collect_env version parse
#15267 merged
Apr 21, 2025 -
Restore buffers when wake up from level 2 sleep (#16564)
#16889 merged
Apr 21, 2025 -
[Doc] Split dummy_processor_inputs() in Multimodal Docs
#16915 merged
Apr 21, 2025 -
[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni
#16907 merged
Apr 21, 2025 -
[CI/CD][V1] Add spec decode tests to CI
#16900 merged
Apr 21, 2025 -
[Bugfix] Fix v1/spec_decode/test_ngram.py
#16895 merged
Apr 21, 2025 -
[easy] Pass compile_fx only the config patches
#16845 merged
Apr 20, 2025 -
Improve configs -
CacheConfig
#16835 merged
Apr 20, 2025 -
Serialize tensors using int8 views
#16866 merged
Apr 19, 2025 -
Log how much time loading a compiled artifact takes
#16848 merged
Apr 19, 2025 -
[doc] update hyperlink
#16877 merged
Apr 19, 2025 -
[VLM] Clean up models
#16873 merged
Apr 19, 2025 -
[Model] Qwen2.5-Omni Cleanup
#16872 merged
Apr 19, 2025 -
[Model] Refactor Phi-4-multimodal to use merged processor and support V1
#15477 merged
Apr 19, 2025 -
[V1][Misc] stop update prefix cache stats when logs_stats is disabled
#16460 merged
Apr 19, 2025 -
[Misc] Benchmarks for audio models
#16505 merged
Apr 19, 2025
84 Pull requests opened by 64 people
-
[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation
#16878 opened
Apr 19, 2025 -
[Perf] Optimize MRotaryEmbedding::get_input_positions performance by numba
#16881 opened
Apr 19, 2025 -
Added support for HermesToolParser for models without special tokens
#16890 opened
Apr 20, 2025 -
profiling to find bottleneck of runing `vllm --version`
#16891 opened
Apr 20, 2025 -
[Model] Include extra module from sentence transformer
#16898 opened
Apr 21, 2025 -
[Bugfix] Fix the missing '}' issue for nested object parameters in stream function call.
#16919 opened
Apr 21, 2025 -
[Bugfix] Fix layer KV cache API not triggered with direct call enabled
#16921 opened
Apr 21, 2025 -
Add docker to build vllm against torch nightly
#16935 opened
Apr 21, 2025 -
[Misc] Add DeepSeek deployment example
#16938 opened
Apr 21, 2025 -
[Model] Refactor Mamba2 SSD to improve chunked prefill performance
#16942 opened
Apr 21, 2025 -
[Quantization] Quark MXFP4 format loading
#16943 opened
Apr 21, 2025 -
[Misc] Replace `cuda` hard code with `current_platform`
#16983 opened
Apr 22, 2025 -
[Hardware][TPU][V1] Better tpu multilora compilation
#16989 opened
Apr 22, 2025 -
Add squash option to container image build commands
#16991 opened
Apr 22, 2025 -
[RFC] per module sharded weight tagging
#17001 opened
Apr 22, 2025 -
[ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1
#17004 opened
Apr 22, 2025 -
Enable FlashInfer V1 FP8 kv cache
#17005 opened
Apr 22, 2025 -
Simplify (and fix) passing of guided decoding backend options
#17008 opened
Apr 22, 2025 -
[V1][Metrics] Add API for accessing in-memory Prometheus metrics
#17010 opened
Apr 22, 2025 -
[CI] Prune down lm-eval small tests
#17012 opened
Apr 22, 2025 -
[INTEL_HPU][v0] Enable spec decode on HPU
#17014 opened
Apr 23, 2025 -
[WIP][Attention] Update FlashMLA
#17027 opened
Apr 23, 2025 -
[Frontend] Add /classify endpoint
#17032 opened
Apr 23, 2025 -
Move V1 into regular `mypy` call
#17044 opened
Apr 23, 2025 -
[Core] Prevent side-channel attacks via cache salting
#17045 opened
Apr 23, 2025 -
[ROCm] default v1 args for mi300x
#17046 opened
Apr 23, 2025 -
[Misc] Make cached tokenizer pickle-compatible
#17048 opened
Apr 23, 2025 -
Fix: Python package installation for opentelmetry
#17049 opened
Apr 23, 2025 -
fix setuptools-scm was unable to detect version for workspace
#17050 opened
Apr 23, 2025 -
Add option to use torch._inductor.standalone_compile
#17057 opened
Apr 23, 2025 -
[Docs] Propose a deprecation policy for the project
#17063 opened
Apr 23, 2025 -
[TPU][V1][CI] Set `VLLM_XLA_CACHE_PATH=` to avoid disk-full error
#17064 opened
Apr 23, 2025 -
[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs
#17071 opened
Apr 23, 2025 -
[TPU][V1] Add support for top-logprobs
#17072 opened
Apr 23, 2025 -
[Bugfix] Fix Gemma3 multimodal placeholder replacement
#17074 opened
Apr 23, 2025 -
Introduce PaddingConfig to combine GPU cudagraph_capture_sizes and TPU num_tokens_paddings
#17081 opened
Apr 23, 2025 -
Fix `numel()` downcast in vllm/csrc/moe/moe_align_sum_kernels.cu +2
#17082 opened
Apr 23, 2025 -
Fix `numel()` downcast in vllm/csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu +2
#17083 opened
Apr 23, 2025 -
[V1] Add `structural_tag` support using xgrammar
#17085 opened
Apr 24, 2025 -
[BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set
#17088 opened
Apr 24, 2025 -
[Bugfix] Add contiguous call inside rope kernel wrapper
#17091 opened
Apr 24, 2025 -
[Bugfix] fix phi4-mini tool call parse in streaming mode
#17094 opened
Apr 24, 2025 -
[CI][UT]Compat with cuda and npu
#17100 opened
Apr 24, 2025 -
Update test_flash_attn.py
#17102 opened
Apr 24, 2025 -
[CI/Build] Add retry mechanism for add-apt-repository
#17107 opened
Apr 24, 2025 -
[FEAT] [ROCm]: Add AITER CK 2 Stages MoE support
#17110 opened
Apr 24, 2025 -
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels
#17112 opened
Apr 24, 2025 -
Fix static typing issues in `v1/attention`
#17113 opened
Apr 24, 2025 -
Enabling multi-group kernel tests.
#17115 opened
Apr 24, 2025 -
[VLM] Support HF format Phi-4-MM model
#17121 opened
Apr 24, 2025 -
[Misc]: Enable memory usage logging for vLLM GPU worker
#17122 opened
Apr 24, 2025 -
Benchmark script for fp8 vs bf16 gemm
#17126 opened
Apr 24, 2025 -
Improve configs - `ModelConfig`
#17130 opened
Apr 24, 2025 -
[Docs] Update structured output doc for V1
#17135 opened
Apr 24, 2025 -
[Misc] Only import amdsmi and _rocm_C on rocm platform
#17136 opened
Apr 24, 2025 -
[V1][Spec Decode] Make eagle compatible with prefix caching.
#17137 opened
Apr 24, 2025 -
[easy] Fix logspam on PiecewiseBackend errors
#17138 opened
Apr 24, 2025 -
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention
#17139 opened
Apr 24, 2025 -
[Kernel] FP8 quantization fused into V1 Triton Attention
#17143 opened
Apr 24, 2025 -
[Frontend][TPU] Enforce user input key args to reduce chance of large performance degradation
#17145 opened
Apr 24, 2025 -
[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels
#17146 opened
Apr 25, 2025 -
Add xLAM tool parser support
#17148 opened
Apr 25, 2025 -
[Misc] Add gemma3 chat template with pythonic-style function calling
#17149 opened
Apr 25, 2025 -
[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER
#17153 opened
Apr 25, 2025 -
[Bugfix] Modifications to error handling of multiple vllm api endpoints
#17165 opened
Apr 25, 2025 -
[CI] Add mteb testing to test the accuracy of the embedding model
#17175 opened
Apr 25, 2025 -
Add option "--expand-tools-even-if-tool-choice-none"
#17177 opened
Apr 25, 2025 -
[Bugfix] support local dataset path in benchmark_serving
#17179 opened
Apr 25, 2025 -
[Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device
#17186 opened
Apr 25, 2025 -
[V1] Remove num_input_tokens from attn_metadata
#17193 opened
Apr 25, 2025 -
[WIP][Bugfix] Fix 'MistralTokenizer' object has no attribute 'init_kwargs'
#17195 opened
Apr 25, 2025 -
[Bugfix] Fix Lora Name Parsing
#17196 opened
Apr 25, 2025 -
[Security] Don't bind tcp zmq socket to all interfaces
#17197 opened
Apr 25, 2025 -
[WIP] Support vLLM in transformers hybrid attention implementation
#17198 opened
Apr 25, 2025 -
[Hardware][Apple] Allows VLLM_TARGET_DEVICE=empty on MacOs
#17200 opened
Apr 25, 2025 -
[Misc]add configurable cuda graph size
#17201 opened
Apr 25, 2025 -
[Benchmark] Add single turn MTBench to Serving Bench
#17202 opened
Apr 25, 2025 -
[Misc][Tools][Benchmark] Publish script to auto tune server parameters
#17207 opened
Apr 25, 2025 -
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE
#17211 opened
Apr 26, 2025 -
[Bugfix] Fix standard models tests
#17217 opened
Apr 26, 2025 -
[Doc] Clarify note for H2O-VL
#17219 opened
Apr 26, 2025 -
[Bugfix] Get a specific type of layer from forward context
#17222 opened
Apr 26, 2025 -
[Bugfix] Fix missing int type for `-n` in multi-image example
#17223 opened
Apr 26, 2025
115 Issues closed by 38 people
-
[Bug]: msgspec.DecodeError: MessagePack data is malformed: trailing characters (byte 13)
#15207 closed
Apr 26, 2025 -
[Misc]: qwen2 vllm和transform 推理结果未对齐
#11478 closed
Apr 26, 2025 -
[Bug]: Prefix caching doesn't work for LlavaOneVision
#11371 closed
Apr 26, 2025 -
[Bug]: error when start in multiple GPU
#11467 closed
Apr 26, 2025 -
[Bug]: 'int' object has no attribute 'parser_state'
#11498 closed
Apr 26, 2025 -
[Bug]: Qwen2.5-72B-Instruct 在A800上推理不成功
#11506 closed
Apr 26, 2025 -
[Bug]: The CPU usage is very low when inference is performed on the ARM CPU
#11511 closed
Apr 26, 2025 -
[Bug]: vllm 0.6.5 run Qwen2-VL-7B-Instruct ,lora lond success but not effective
#11525 closed
Apr 26, 2025 -
[Bug]: Two beginning of sequence tokens for Lllama-3.2-3B-Instruct
#16028 closed
Apr 25, 2025 -
[Bug]: Two BOS when using chat
#16853 closed
Apr 25, 2025 -
[Bug]: run on cpu: ModuleNotFoundError: No module named 'vllm.benchmarks'
#15812 closed
Apr 25, 2025 -
[Bug]: KeyError in mm_input_cache when processing multimodal requests with Qwen2.5-VL-72B
#16875 closed
Apr 25, 2025 -
[Tracker] Merge security fixes for v0.8.5
#17128 closed
Apr 25, 2025 -
[Bug]: Invalid Mistral ChatCompletionRequest Body Exception
#16774 closed
Apr 25, 2025 -
[Bug]: API Returns Only Single Result Despite n=8 Parameter Setting
#17173 closed
Apr 25, 2025 -
[Usage]: Does vLLM support QwQ 32B + tool calling?
#17061 closed
Apr 25, 2025 -
[Bug]: [Feature]: I want to extend the vLLM MoE functionality to support a variable number of experts.
#17150 closed
Apr 25, 2025 -
[Bug]: Remove fallback to outlines for int/number range and pattern constraints in guided_json
#16723 closed
Apr 25, 2025 -
[Bug]: GLM-4-32B-0414-FP8 output !!!!! error (tensor is nan)
#17154 closed
Apr 25, 2025 -
[Bug]: MiniCPM3 failed on ascend npu because of ModuleNotFoundError: No module named 'triton'
#16955 closed
Apr 25, 2025 -
[Bug]: Cannot run MiniCPMV on OpenVINO
#12384 closed
Apr 25, 2025 -
[Feature]: expose the tqdm progress bar to enable logging the progress
#6154 closed
Apr 25, 2025 -
[Bug]: KeyError: 'layers.0.self_attn.qkv_proj.weight'
#9595 closed
Apr 25, 2025 -
[Bug]: Qwen2-VL-7B with sglang (vLLM-back) Performance Degradation on MME benchmark
#10588 closed
Apr 25, 2025 -
[Usage]: Client-Side Error Handling for VLLM in a Client-Server Architecture
#11487 closed
Apr 25, 2025 -
[Bug]: EADDRINUSE (-98) error when setting up NCCL communicator
#15987 closed
Apr 25, 2025 -
[Usage]: How to get log probabilities for existing tokens in assistant message?
#16686 closed
Apr 24, 2025 -
[Feature]: vLLM DP=2 didn't speed up the training as low batch size.
#17129 closed
Apr 24, 2025 -
[Bug]: xgrammar missing file crashes the server
#16030 closed
Apr 24, 2025 -
[Doc]: Documentation source code hyperlinks do not always point to the correct source code
#17120 closed
Apr 24, 2025 -
[Bug]: guided_json 请求报错 在 v0.7.2
#15073 closed
Apr 24, 2025 -
[Bug]:
#15329 closed
Apr 24, 2025 -
[Usage]: OpenAI Server API
#17075 closed
Apr 24, 2025 -
[Bug]: ValueError: Model architectures ['OPTForCausalLM'] failed to be inspected.
#17031 closed
Apr 24, 2025 -
[Usage]: How to log incoming requests (inputs and outputs) in vllm serve ?
#12336 closed
Apr 24, 2025 -
[Bug]: When use `guided choice` feature, vllm.engine.async_llm_engine.AsyncEngineDeadError
#8100 closed
Apr 24, 2025 -
[Usage]: RuntimeError: Failed to infer device type (Intel Iris Xe Graphics)
#8863 closed
Apr 24, 2025 -
[Bug]: AsyncLLMEngine CUDA runtime error 'device-side assert triggered'
#8948 closed
Apr 24, 2025 -
[Installation]: Segmentation fault when building Docker container on WSL
#10575 closed
Apr 24, 2025 -
[Bug]: Crash with Qwen2-Audio Model in vLLM During Audio Processing
#10627 closed
Apr 24, 2025 -
[Bug]: Prefill/decode separation leads to blocking and crashing in multi concurrent scenarios
#11445 closed
Apr 24, 2025 -
[Bug]: InternVL2-40B Inference Precision Problem
#11454 closed
Apr 24, 2025 -
[Misc]: Molmo inference multi-GPU
#11468 closed
Apr 24, 2025 -
[Usage]: How to figure out why vllm response nothing but trt-llm response meaningful result
#11473 closed
Apr 24, 2025 -
[Bug]: CI Build image failure due to mamba-ssm==2.2.4 installation error
#17068 closed
Apr 23, 2025 -
[Bug]: Llama4 Scout fails on H200
#16414 closed
Apr 23, 2025 -
[Feature]: guided decoding on TPU
#11104 closed
Apr 23, 2025 -
[Usage]:Qwen/QwQ-32B
#16931 closed
Apr 23, 2025 -
[Bug]: Qwen/Qwen2.5-VL-3B-Instruct doesnt identify tools
#16797 closed
Apr 23, 2025 -
[Bug]: Error when running Llama-4-Maverick-17B-128E-Instruct-FP8 on mi300x
#16474 closed
Apr 23, 2025 -
[Usage]: Customized model parameters on different devices
#16981 closed
Apr 23, 2025 -
[Bug]: vllm stopped at vLLM is using nccl==2.21.5
#16772 closed
Apr 23, 2025 -
[Bug]: AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers
#16958 closed
Apr 23, 2025 -
[Bug]: Qwen2.5-VL-72B Inference
#16997 closed
Apr 23, 2025 -
[Bug]: Run Llama4 Scout 16E w/ 10000 input length trigger vllm crashing, but run fine if use FA2.
#16948 closed
Apr 23, 2025 -
[test] llama4 testing issue
#17025 closed
Apr 23, 2025 -
[Usage]: how to change the path of download models
#16975 closed
Apr 23, 2025 -
[Bug]: /usr/bin/ld: cannot find -lcuda: No such file or directory, when running inference
#16984 closed
Apr 23, 2025 -
Error in running 'python -m vllm.entrypoints.openai.api_server '
#11411 closed
Apr 23, 2025 -
[Bug]: Extra body don't work when response_format is also sent for serving.
#7337 closed
Apr 23, 2025 -
[Bug]: vLLM 0.5.5 and FlashInfer0.1.6
#8091 closed
Apr 23, 2025 -
[Bug]: 启动之后 用了一段时间 显存越占越多
#8413 closed
Apr 23, 2025 -
[Bug]: Lora refuses to load from disk without extremely weird manipulations with file paths
#9063 closed
Apr 23, 2025 -
[Bug]: vLLM multi-step scheduling crashes when input prompt is long
#10009 closed
Apr 23, 2025 -
[Usage]: Removal of vllm.openai.rpc folder in vLLM 0.6.2 release
#10766 closed
Apr 23, 2025 -
[Performance]: Performance degradation due to CPU bottleneck when serving embedding models to GPUs
#11320 closed
Apr 23, 2025 -
[Bug]: no output of profile when VLLM_TORCH_PROFILER_DIR is enabled for vllm serve
#11346 closed
Apr 23, 2025 -
[Feature]: c4ai-command-r-plus-08-2024 tool choice support
#11405 closed
Apr 23, 2025 -
[RFC]: The two features i wish vllm has
#11410 closed
Apr 23, 2025 -
[Misc]: How to Profile Both EngineCoreClient and EngineCoreProc Activities in V1 Using Profiler
#11413 closed
Apr 23, 2025 -
[Bug]: 0.6.5 randomly closes connection/drops requests
#11421 closed
Apr 23, 2025 -
[Bug]: top k isn't deterministic
#16945 closed
Apr 22, 2025 -
[RFC]: tool_calls and None types.
#16678 closed
Apr 22, 2025 -
[Feature]: suggest passing a splited tensor to RLHF vllm's load_weights when tp>1
#16820 closed
Apr 22, 2025 -
[Bug]: VLLM config not set when using Flash Infer backend.
#13207 closed
Apr 22, 2025 -
[Bug]: vllm 0.8.x unable to load model from S3 using runai_streamer but works in 0.7.3
#16926 closed
Apr 22, 2025 -
[Doc]: Add documents on multimodal args
#16922 closed
Apr 22, 2025 -
[Usage]: what is the most efficient way to do with a 72b model and 8 * A100 ?
#12205 closed
Apr 22, 2025 -
[Bug]: GuidedDecodingParams choice - Request-level structured output backend must match engine-level backend
#16738 closed
Apr 22, 2025 -
[Bug]: [V1] New v1 engine does not support n>1?
#12584 closed
Apr 22, 2025 -
[Bug]: leaked instance 0xfffc8c22b108 of type "xgrammar.xgrammar_bindings.GrammarCompiler"
#16951 closed
Apr 22, 2025 -
[Installation]: Fail to build vllm from the latest source code
#16897 closed
Apr 22, 2025 -
[Bug]: [V1] Random infinite response generation followed by silent crash
#16151 closed
Apr 21, 2025 -
[Bug]: V1 engine Index Error When Single Request Near Max Context Length LLaMA 4
#16157 closed
Apr 21, 2025 -
[Bug]: [RLHF] Weights update broken with V1 multiprocessing
#16434 closed
Apr 21, 2025 -
[Bug]: Multi-GPU (TP > 1) vLLM serve docker timeout during startup
#16514 closed
Apr 21, 2025 -
[Bug]: glm.py rotary_dim bug
#16904 closed
Apr 21, 2025 -
[Bug]: vllm 0.8.3 abnormal TTFT (too long) in the first serving
#16858 closed
Apr 21, 2025 -
[Bug]: Pooling last token differences with Sentence Transformers for embedding models
#16892 closed
Apr 21, 2025 -
[Feature]: Add CLI Commands for Benchmarking
#13840 closed
Apr 21, 2025 -
[New Model]: nvidia/Hymba-1.5B-Base
#10783 closed
Apr 21, 2025 -
[Usage]: Is pipeline parallelism supported on machines that are not in the same local network?
#11285 closed
Apr 21, 2025 -
[Misc]: What is 'residual' used for in the IntermediateTensor class?
#11364 closed
Apr 21, 2025 -
Where does the default number 43328 of KV cache come from and How can I change it?
#11391 closed
Apr 21, 2025 -
[Bug]: After wake up from level 2 sleep, model cannot load weights properly
#16564 closed
Apr 20, 2025 -
[New Model]: Qwen/QwQ-32B-Preview
#10737 closed
Apr 20, 2025 -
[Bug]: Incomplete tool calling response for pipeline-parallel vllm with ray
#7194 closed
Apr 20, 2025 -
[Bug]: AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'
#7871 closed
Apr 20, 2025 -
[Doc]: Offline Inference Distributed
#8966 closed
Apr 20, 2025 -
[Usage]: how to use EAGLE on vLLM?
#11126 closed
Apr 20, 2025 -
[Bug]: Paligemma 2 model loading error
#11343 closed
Apr 20, 2025 -
[Feature]: meta-llama/Prompt-Guard-86M Usage Value Error.
#11360 closed
Apr 20, 2025 -
[Bug]: vLLM crashes on tokenized embedding input
#11375 closed
Apr 20, 2025 -
[Usage]: How do I run offline batch inference with Llama 405B BF16 across multinode (via SLURM)
#11379 closed
Apr 20, 2025 -
[Feature]: Benchmarks for audio models
#16354 closed
Apr 19, 2025
108 Issues opened by 101 people
-
[Usage]: EOFError when loading Qwen/Qwen2.5-32B-Instruct
#17218 opened
Apr 26, 2025 -
[Installation]: torch 2.6.0 unavailable for intel mac
#17212 opened
Apr 26, 2025 -
[Installation]: Can't get Mistral-Small-3.1-24B-Instruct-2503-Q6_K to load on Docker (local or HF)
#17210 opened
Apr 25, 2025 -
[Feature]: Support Lora for Beam Search
#17205 opened
Apr 25, 2025 -
[Feature]: Inflight BNB quantization for Mixtral models
#17199 opened
Apr 25, 2025 -
[Bug]: DP with sampling hangs after completing generation
#17194 opened
Apr 25, 2025 -
[RFC]: Custom sampling params support in REST API
#17191 opened
Apr 25, 2025 -
[Bug]: vllm LLM utils.py resolve_obj_by_qualname ValueError: not enough values to unpack (expected 2, got 1)
#17188 opened
Apr 25, 2025 -
[Installation]: deployment failure on Kuberentes with CPU device (testing).
#17187 opened
Apr 25, 2025 -
[Usage]: How to deploy tensorized vllm model (deserialize) as api_server?
#17178 opened
Apr 25, 2025 -
[Bug]: `uv run vllm serve` with DP results in NCCL error: two ranks use the same device
#17176 opened
Apr 25, 2025 -
[Installation]: Pinned version of OpenTelemetry in requirements
#17174 opened
Apr 25, 2025 -
[Usage]: I want to create custom docker image by adding my code
#17172 opened
Apr 25, 2025 -
[Bug]: Qwen2VL-2b / Qwen2.5-7b has AssertionError and Cuda error when qps goes higher
#17171 opened
Apr 25, 2025 -
[Bug]: HIP error: invalid device function
#17170 opened
Apr 25, 2025 -
[Installation]: Bloated docker image size causes problems on k8s
#17163 opened
Apr 25, 2025 -
[Bug]: failed to run distribute Inference with vllm 0.8.2
#17160 opened
Apr 25, 2025 -
[Bug]: GLM-Z1 uses vllm batch inference to output confusion
#17157 opened
Apr 25, 2025 -
[Bug]: DeepSeek Lora inference has no effect.
#17155 opened
Apr 25, 2025 -
[Bug]: LLVM ERROR: Failed to compute parent layout for slice layout. when using fp16
#17152 opened
Apr 25, 2025 -
[Bug]: 为什么在部署qwen2.5-vl-32b-instruct的时候,部署过程被卡死不动了
#17151 opened
Apr 25, 2025 -
[Bug]: waiting reqs vanish!
#17147 opened
Apr 25, 2025 -
Missing Opening <think> for Qwen32B
#17144 opened
Apr 24, 2025 -
[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1
#17140 opened
Apr 24, 2025 -
[Feature]: Support for image linebreak tokens for vision model
#17127 opened
Apr 24, 2025 -
[Feature]: Automatically detect numerical issues
#17123 opened
Apr 24, 2025 -
[Bug]: jinja2 TemplateError should return 422 instead of 500 error code
#17119 opened
Apr 24, 2025 -
[Bug]: Why does torch.cuda.memory_allocated() remain unchanged after calling sleep()?
#17117 opened
Apr 24, 2025 -
[Installation]: vllm/vllm-tpu image doesn't have :latest tag
#17114 opened
Apr 24, 2025 -
[Bug]: Tool calls data comes in content field after text chunks
#17109 opened
Apr 24, 2025 -
[Feature]: Add Support to Video Generation Models
#17106 opened
Apr 24, 2025 -
[Bug]: AsyncLLM sleep then wake_up produces meaningless outputs
#17103 opened
Apr 24, 2025 -
[Bug]: Shutdown during Qwen2.5-VL-72B inference on 4 A800s
#17101 opened
Apr 24, 2025 -
[Bug]: Failed to run dp+tp in 2 GPU Nodes
#17095 opened
Apr 24, 2025 -
tool call arguments parse failed
#17089 opened
Apr 24, 2025 -
[Bug]: raise NotImplementedError
#17086 opened
Apr 24, 2025 -
[Bug]: Importing DeepSpeed causes crash in vLLM when running with data parallelism and TP=1
#17079 opened
Apr 23, 2025 -
[Bug]: noop elimination for slice errors when end = -1
#17078 opened
Apr 23, 2025 -
[Bug]: Aria model error due to version mismatch with transformers
#17077 opened
Apr 23, 2025 -
[RFC]: Implement structural_tag support in structured output
#17076 opened
Apr 23, 2025 -
[Feature]: GGUF support for GLM4
#17069 opened
Apr 23, 2025 -
[RFC]: All Ops should be determined during init and wrapped in a Layer Module to avoid envs.ENVIRON overhead
#17067 opened
Apr 23, 2025 -
[Performance]: UVA vs UVM for CPU offloading on v0.8.4+
#17062 opened
Apr 23, 2025 -
[Bug]: Issue with SpecDecode when using data parallel
#17056 opened
Apr 23, 2025 -
[Bug]: ValueError when using Multi-Instance GPU
#17047 opened
Apr 23, 2025 -
[Usage]: I have 2 nodes 16 GPUs, how can i use 16 dp+16 ep to run deepseek v3?
#17041 opened
Apr 23, 2025 -
[Bug]: Many endpoints are returning 500 Internal Server Error
#17038 opened
Apr 23, 2025 -
[Bug]: Undocumented HTTP Status Codes for vllm endpoints
#17037 opened
Apr 23, 2025 -
[Bug]: Multiple openai endpoint Missing Content-Type Header
#17036 opened
Apr 23, 2025 -
[Usage]: DeepSeek R1 on a 8xH200 node is too slow
#17035 opened
Apr 23, 2025 -
[Feature]: add hostname in metrics for clustering deployment
#17029 opened
Apr 23, 2025 -
[Bug]: ```image_grid_thw``` not set in ```CachedRequestState``` - ```Qwen2.5 VL 3B```
#17007 opened
Apr 22, 2025 -
[Performance]: Distributed Inference w/ & w/o RDMA over Infiniband
#17006 opened
Apr 22, 2025 -
[Usage]: multilora_inference with max_loras>1
#17003 opened
Apr 22, 2025 -
[Bug]: Guided Decoding Backend options with the OpenAI server recently broken
#17002 opened
Apr 22, 2025 -
[Feature]: Automatically Enable Modality Specific Loras
#16994 opened
Apr 22, 2025 -
[Bug]: vLLM sleep experiences segmentation fault when used in TRL
#16993 opened
Apr 22, 2025 -
[Bug]: `original_load_name` undefined with certain torch versions
#16987 opened
Apr 22, 2025 -
[Bug]: Performance degradation with increasing number of requests in long-running vLLM inference sessions
#16985 opened
Apr 22, 2025 -
[Bug]: Is the logic order correct during the scheduler procedure?
#16982 opened
Apr 22, 2025 -
[Feature]: Enable Partial Guided Decoding / Structured Output Support
#16979 opened
Apr 22, 2025 -
[Bug]: unable automatically set CUDA_VISIBLE_DEVICES correctly for v0 engine data parallel
#16978 opened
Apr 22, 2025 -
[RFC]: scheduling policy optimization in vLLM
#16969 opened
Apr 22, 2025 -
[Bug]: cpu core 100%
#16968 opened
Apr 22, 2025 -
[Bug]: The output of MathResponse is empty when running THUDM/GLM-Z1-32B-0414 with vLLM-0.8.4
#16967 opened
Apr 22, 2025 -
[Bug]: vllm 0.8.4 whisper possible memory leak?
#16966 opened
Apr 22, 2025 -
[Usage]: How can vllm process multiple prompts within single request on server
#16965 opened
Apr 22, 2025 -
[Bug]: vllm 0.8.3 v1 startup time is too long when using multi lora
#16961 opened
Apr 22, 2025 -
[Bug]: DataParallel on multinode unable to start GPU
#16957 opened
Apr 22, 2025 -
[Bug]: Fail to use deepseek vl2 with images, maybe need a new chat template?
#16953 opened
Apr 22, 2025 -
[Performance]: Why/How vLLM uses CPU memory?
#16947 opened
Apr 21, 2025 -
[New Model]: nemotron Super GGUF
#16944 opened
Apr 21, 2025 -
[Doc]: update contributing guide for macOS Apple silicon
#16940 opened
Apr 21, 2025 -
[Bug]: Phi-4-MM generates gibberish for large image input with v1 chunked prefill
#16934 opened
Apr 21, 2025 -
[Bug]: Pooling model adapter removes the attributes expected by model init
#16932 opened
Apr 21, 2025 -
[Bug]: SharedStorageConnector only see first batch of tokens
#16928 opened
Apr 21, 2025 -
[Doc]: state requirements for testing or update to work for CPU-only
#16920 opened
Apr 21, 2025 -
Qwen2.5 VL and gemma-3-12b error on VLLM 8.4
#16918 opened
Apr 21, 2025 -
[UI_Bug]: Content_Menu_and_Icon_Spacing_Issue_in_UI
#16917 opened
Apr 21, 2025 -
[Bug]: CPU Memory oom on 8*L40s when deploy meta-llama/Llama-4-Scout-17B-16E-Instruct
#16916 opened
Apr 21, 2025 -
[Bug]: vllm can' t serve for Multi-audio input inference
#16914 opened
Apr 21, 2025 -
[Bug]: guided_grammar example syntax does not work
#16911 opened
Apr 21, 2025 -
[Bug]: Kimi-VL-A3B-Thinking Error
#16908 opened
Apr 21, 2025 -
[Bug]: architecture of models not correctly recognized
#16905 opened
Apr 21, 2025 -
[Bug]: mm_cache keyerror
#16903 opened
Apr 21, 2025 -
[Bug]: RuntimeError on RTX 5090: "no kernel image is available for execution on the device
#16901 opened
Apr 21, 2025 -
[Usage]: When deploying the GLM-4-32B BF16 model with vLLM 0.8.4, I encountered a GPU memory overflow
#16896 opened
Apr 21, 2025 -
[Feature]: Llama4 LoRA support
#16894 opened
Apr 20, 2025 -
[Bug]: tool_choice: "required" does not work for mistral
#16887 opened
Apr 20, 2025 -
[Usage]: Deciding max-num-seqs and max-num-batched-tokens for desired throughput
#16886 opened
Apr 20, 2025 -
[Usage]: Is it true that vllm doesn't support deepseek r1 yet with the v1 engine?
#16885 opened
Apr 20, 2025 -
[Bug]: internvl3-78B-AWQ
#16884 opened
Apr 20, 2025 -
[Bug]: Ngram speculative decoding doesn't work in vLLM 0.8.3/0.8.4 with VLLM_USE_V1 enabled.
#16883 opened
Apr 20, 2025 -
[Usage]: Request scheduling when using LoRA
#16876 opened
Apr 19, 2025 -
[New Model]: jinaai/jina-embeddings-v2-base-code
#16874 opened
Apr 19, 2025
358 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[Feature] support sequence parallelism using compilation pass
#16155 commented on
Apr 26, 2025 • 41 new comments -
[V1][Metrics] add support for kv event publishing
#16750 commented on
Apr 26, 2025 • 38 new comments -
[Kernel] some optimizations for dense marlin and moe marlin
#16850 commented on
Apr 24, 2025 • 35 new comments -
[Model] support MiniMax-VL-01 model
#16328 commented on
Apr 25, 2025 • 32 new comments -
[Kernel] Adding basic Triton JitCache for triton_attn
#16606 commented on
Apr 24, 2025 • 24 new comments -
[V1][Feature] Enable Speculative Decoding with Structured Outputs
#14702 commented on
Apr 25, 2025 • 22 new comments -
[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model
#16362 commented on
Apr 25, 2025 • 17 new comments -
[Model][Frontend] Adding timeseries modality support and Qwen2.5-ChatTS model support
#16852 commented on
Apr 21, 2025 • 15 new comments -
[Core] Support full cuda graph in v1
#16072 commented on
Apr 25, 2025 • 14 new comments -
Add default local directory LoRA resolver plugin.
#16855 commented on
Apr 24, 2025 • 12 new comments -
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass
#16756 commented on
Apr 25, 2025 • 11 new comments -
[Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel
#12591 commented on
Apr 26, 2025 • 10 new comments -
Update PyTorch to 2.7.0
#16859 commented on
Apr 26, 2025 • 10 new comments -
[Core] [Bugfix] Add Input Embeddings
#15428 commented on
Apr 24, 2025 • 9 new comments -
[MODEL ADDITION] Ovis2 Model Addition
#15826 commented on
Apr 25, 2025 • 9 new comments -
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend
#14238 commented on
Apr 25, 2025 • 8 new comments -
[Model] Add Granite Speech Support
#16246 commented on
Apr 26, 2025 • 8 new comments -
[V0][V1][Core] Add outlines integration for V1, and update V0 integration.
#15975 commented on
Apr 24, 2025 • 8 new comments -
[V1] vLLM OpenAI API custom args
#16862 commented on
Apr 23, 2025 • 8 new comments -
[TPU] Increase block size and reset block shapes
#16458 commented on
Apr 26, 2025 • 7 new comments -
[NVIDIA] Support Cutlass MLA for Blackwell GPUs
#16032 commented on
Apr 26, 2025 • 7 new comments -
[CPU] Support torch compile in CPU backend
#15020 commented on
Apr 22, 2025 • 7 new comments -
[FEAT] [ROCm]: Support AITER Linear
#14916 commented on
Apr 24, 2025 • 6 new comments -
[Misc] Add fully interleaved support for multimodal 'string' content format
#14047 commented on
Apr 22, 2025 • 6 new comments -
Add `pt_load_map_location` to allow loading to cuda
#16869 commented on
Apr 25, 2025 • 6 new comments -
[WIP] Add Flex to V1
#16078 commented on
Apr 25, 2025 • 5 new comments -
[Frontend]Reduce vLLM's import time
#15128 commented on
Apr 25, 2025 • 5 new comments -
[Misc] support multi-node data parallel
#15863 commented on
Apr 25, 2025 • 4 new comments -
Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1
#16573 commented on
Apr 25, 2025 • 3 new comments -
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature
#14968 commented on
Apr 23, 2025 • 2 new comments -
[Bugfix][V0] Another multi-sequence logprobs streaming edge case
#16805 commented on
Apr 23, 2025 • 2 new comments -
[Misc] Add Next Edit Prediction (NEP) datasets support in `benchmark_serving.py`
#16839 commented on
Apr 24, 2025 • 2 new comments -
[Misc] improve chat_with_tools example
#16044 commented on
Apr 25, 2025 • 2 new comments -
Add cutlass support for blackwell fp8 blockwise gemm
#14383 commented on
Apr 25, 2025 • 2 new comments -
Online Rotations to vLLM
#16443 commented on
Apr 25, 2025 • 2 new comments -
[Kernel] GGUF MoeVec kernel
#16780 commented on
Apr 25, 2025 • 2 new comments -
Adding Share Expert Fusion for DeepSeek
#15502 commented on
Apr 23, 2025 • 1 new comment -
[Bugfix] set correct lora mapping when compute prompt logprobs
#16694 commented on
Apr 26, 2025 • 1 new comment -
Support loading transformers models with named parameters
#16868 commented on
Apr 25, 2025 • 1 new comment -
[Distributed] Tensor Parallel RMSNorm
#10542 commented on
Apr 24, 2025 • 0 new comments -
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture
#10608 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: LLVM ERROR: Failed to compute parent layout for slice layout.
#15235 commented on
Apr 24, 2025 • 0 new comments -
Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support
#11844 commented on
Apr 21, 2025 • 0 new comments -
[Frontend] Disaggregate prefill decode with zmq
#11791 commented on
Apr 22, 2025 • 0 new comments -
[Misc] Allow LoRA to adaptively increase rank and remove possible_max_ranks
#10623 commented on
Apr 22, 2025 • 0 new comments -
[Feature]: Enable CUDA Graph without turn on torch.compile / Inductor for V1
#15896 commented on
Apr 24, 2025 • 0 new comments -
[Frontend] [Bugfix] Refactor tool parsers and simplify the tool parsing interface.
#11554 commented on
Apr 25, 2025 • 0 new comments -
[Frontend] improve hermes_tool_parser.py
#11453 commented on
Apr 25, 2025 • 0 new comments -
fix: add missing bos_token to example templates
#11432 commented on
Apr 25, 2025 • 0 new comments -
[Hardware][CPU] Refactor CPU vector types for ISAs
#10787 commented on
Apr 22, 2025 • 0 new comments -
[Model] Working BNB for InternVL.
#11095 commented on
Apr 24, 2025 • 0 new comments -
[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations
#10867 commented on
Apr 24, 2025 • 0 new comments -
[Misc][Benchmark]feat(benchmarks): Add async_request_generate function to support generate endpoint
#16421 commented on
Apr 24, 2025 • 0 new comments -
[CI/Build] Add support for Python 3.13
#13164 commented on
Apr 23, 2025 • 0 new comments -
[Bugfix] Adjust tool call handling in llama template to support single tool calls only
#12938 commented on
Apr 25, 2025 • 0 new comments -
[Bugfix] Update chat_utils.py to avoid issues when tool call is present but None
#12788 commented on
Apr 25, 2025 • 0 new comments -
[Frontend] Adding the "User Defined Custom Tool Calling" parser for the Llama models
#12752 commented on
Apr 25, 2025 • 0 new comments -
[Core] Add Additional Metrics to vLLM Server
#12726 commented on
Apr 25, 2025 • 0 new comments -
[Bug]: xgrammar==0.17 not work when guided
#15790 commented on
Apr 24, 2025 • 0 new comments -
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform
#12695 commented on
Apr 23, 2025 • 0 new comments -
[Bugfix] Fix quark fp8 format loading on AMD GPUs
#12612 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: gemma 3 structured output api occurs assertion error
#15766 commented on
Apr 24, 2025 • 0 new comments -
[Bugfix][Spec Decode][V0] fix: update logits processor for MQA scoring
#12537 commented on
Apr 21, 2025 • 0 new comments -
add support for AMD MI25/50/60
#12431 commented on
Apr 25, 2025 • 0 new comments -
[Core] Make disaggregated prefill compatible with pipeline parallelism
#12301 commented on
Apr 24, 2025 • 0 new comments -
[Core] Optimize topp/topk calculation in sampler
#12156 commented on
Apr 24, 2025 • 0 new comments -
[Doc] update docs for nightly benchmarks
#12022 commented on
Apr 22, 2025 • 0 new comments -
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on
Apr 22, 2025 • 0 new comments -
[Spec Decode][V0] feat: support LoRA with speculative decoding
#11966 commented on
Apr 21, 2025 • 0 new comments -
[Spec Decode] Add Script for converting HF Eagle checkpoint to vLLM compatible checkpoint
#11866 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: Mistral 3.1 Small Image inference is broken on 0.8.4
#16675 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Support Gemma 3 QAT series
#16856 commented on
Apr 25, 2025 • 0 new comments -
[Performance]: vllm Eagle performance is worse than expected
#9565 commented on
Apr 25, 2025 • 0 new comments -
[Bug]: vllm部署qwen2.5_vl_72b之后,你们有出现,刚部署好之后调用一切正常3-5秒一条,然后使用一段时间,就越来越慢了的情况吗60s一条
#13886 commented on
Apr 25, 2025 • 0 new comments -
[Usage]: Is it possible to use `meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8` with vLLM?
#12411 commented on
Apr 25, 2025 • 0 new comments -
[Feature] [ROCm]: AITER Kernel Integration
#14964 commented on
Apr 25, 2025 • 0 new comments -
[Bug]: AssertionError when using automatic prefix caching and prompt_logprobs
#8268 commented on
Apr 25, 2025 • 0 new comments -
[Bug]: [v0.6.5] Streaming tool call responses with the hermes template is inconsistent with the non-stream version.
#11392 commented on
Apr 25, 2025 • 0 new comments -
[Bug]: InternVL2-26B-AWQ Service startup failure
#12404 commented on
Apr 25, 2025 • 0 new comments -
[Feature]: The tool_choice option required is not yet supported but on the roadmap.
#11700 commented on
Apr 25, 2025 • 0 new comments -
[Feature]: Llama3.3 Tool calling support or a Geneneric and extensible llama tool calling support
#11799 commented on
Apr 25, 2025 • 0 new comments -
[New Model]: Support Efficient-Large-Model/NVILA
#11887 commented on
Apr 25, 2025 • 0 new comments -
[Usage]: Automated Tool Calling for OLMoForCausalLM
#12263 commented on
Apr 25, 2025 • 0 new comments -
[Usage]: Is it possible to speed up the generation speed by adding another video card?
#12322 commented on
Apr 25, 2025 • 0 new comments -
[Usage]: how to use tool calling with auto option, setting the tool works
#12349 commented on
Apr 25, 2025 • 0 new comments -
[Bug]: Inference with gguf returns garbage
#12364 commented on
Apr 25, 2025 • 0 new comments -
[Usage]: How to run vllm with regression task, just like classify task
#12379 commented on
Apr 25, 2025 • 0 new comments -
[Usage]: mistralai/Ministral-8B-Instruct-2410 scale to 128k context length.
#12385 commented on
Apr 25, 2025 • 0 new comments -
[Feature]: Consider integrating SVDquant (W4A4 quantization) from Nunchaku project
#12399 commented on
Apr 25, 2025 • 0 new comments -
[Usage]: Overwhelmed trying to find out information about how to serve Llama-3 70b to multiple users with 128k context
#12400 commented on
Apr 25, 2025 • 0 new comments -
Reshape cache flash kernel to support HND layout
#8200 commented on
Apr 23, 2025 • 0 new comments -
[BigFix] Fix the lm_head in gpt_bigcode in lora mode
#6357 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Support for jina-embeddings-v2-small-en
#16639 commented on
Apr 26, 2025 • 0 new comments -
[SpecDecode] Support EAGLE in V1
#15901 commented on
Apr 26, 2025 • 0 new comments -
[Feature]: Audit and Update Examples To Use `VLLM_USE_V1=1`
#14530 commented on
Apr 26, 2025 • 0 new comments -
[Usage]: How to increase the generation throughput of Qwen-0.5B
#14023 commented on
Apr 26, 2025 • 0 new comments -
[Bug]: v0.8.2 vLLM engine crashes when starting after V1 environment variable is enabled with deepseek-r1
#15769 commented on
Apr 26, 2025 • 0 new comments -
[Feature]: Implement Priority Scheduling In V1 Engine
#14002 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Can't deserialize object: ObjectRef,DeepSeek R1, H20*16, pp2, tp8, v1 engine
#15333 commented on
Apr 26, 2025 • 0 new comments -
[Feature]: Improve Logging for Error Messages
#14083 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Support Inflight quantization: load as 8bit quantization.
#11655 commented on
Apr 26, 2025 • 0 new comments -
[Bug]: FP8 Quantization with enforce_eager=False Causes Gibberish Output on Llama-4-Scout Model (VLLM_USE_V1=1)
#16337 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Return hidden states (in progress?)
#6165 commented on
Apr 25, 2025 • 0 new comments -
[Bug]: Guided generation throws 500 error or endless generation in vllm serve for mistral small 2501
#13260 commented on
Apr 25, 2025 • 0 new comments -
[Bug]: Bug in LRUEvictor: priority_queue and free_table desynchronization cause error
#16825 commented on
Apr 25, 2025 • 0 new comments -
unload the model
#3281 commented on
Apr 25, 2025 • 0 new comments -
[Feature]: Allow head_size smaller than 128 on TPU with Pallas backend
#10343 commented on
Apr 25, 2025 • 0 new comments -
[RFC]: Data Parallel Attention and Expert Parallel MoEs
#16037 commented on
Apr 25, 2025 • 0 new comments -
[Bug]: Vllm 0.8.2 + Ray 2.44 (Ray serve deployment) fallbacks to V0 Engine
#15569 commented on
Apr 25, 2025 • 0 new comments -
[ROCm] (Deprecated) Enable AITER Tkw1 kernel
#16418 commented on
Apr 19, 2025 • 0 new comments -
Fix cuda_version_str reset logic.
#16400 commented on
Apr 24, 2025 • 0 new comments -
[WIP]Docker Release
#16396 commented on
Apr 22, 2025 • 0 new comments -
[V1] Add request-level, per-step acceptance counts tracking for spec dec.
#16367 commented on
Apr 25, 2025 • 0 new comments -
Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling
#16357 commented on
Apr 22, 2025 • 0 new comments -
[Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support)
#16347 commented on
Apr 25, 2025 • 0 new comments -
[Bugfix][Frontend] Add missing "type":"function" in tool call streaming responses
#16346 commented on
Apr 25, 2025 • 0 new comments -
[V1][Spec Decode] Add random seed for EAGLE and its test script
#16235 commented on
Apr 23, 2025 • 0 new comments -
[MISC][Bugfix] Use less CPU when message queue has been empty for some time
#16226 commented on
Apr 21, 2025 • 0 new comments -
[Model] set default attn tmp scaling to True for llama4
#16216 commented on
Apr 26, 2025 • 0 new comments -
Support embedding models in V1
#16188 commented on
Apr 24, 2025 • 0 new comments -
[WIP] Hybrid Memory Allocator
#16178 commented on
Apr 25, 2025 • 0 new comments -
[v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type
#16101 commented on
Apr 26, 2025 • 0 new comments -
[Frontend] [Bugfix] Refactor tool parsers and simplify the tool parsing interface.
#16096 commented on
Apr 25, 2025 • 0 new comments -
[V1][Spec Decode] Non greedy sample with EAGLE / Reduce memory allocation for Rejection Sampler
#16077 commented on
Apr 25, 2025 • 0 new comments -
[ROCM] Add gfx950 to the custom attention archs
#16034 commented on
Apr 24, 2025 • 0 new comments -
[WIP][Feature] Support chunked prefill when using Deepseek MTP model as draft model
#15153 commented on
Apr 21, 2025 • 0 new comments -
[CORE] Eliminate Occasional Scheduling Delay for Parallel Sampling
#16849 commented on
Apr 22, 2025 • 0 new comments -
[V1] Async DP shutdown test
#16846 commented on
Apr 21, 2025 • 0 new comments -
[Misc] Raise ValueError for V1 during profiling when max_num_batched_tokens is too short
#16834 commented on
Apr 19, 2025 • 0 new comments -
Add quickreduce as alternative to custom allreduce
#16804 commented on
Apr 23, 2025 • 0 new comments -
[Kernel] Add Split-KV Attention Kernel to the triton_attn Backend
#16794 commented on
Apr 21, 2025 • 0 new comments -
[Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c…
#16751 commented on
Apr 21, 2025 • 0 new comments -
[CI] Enable test_initialization to run on V1
#16736 commented on
Apr 23, 2025 • 0 new comments -
[V1] LogitsProcessor interface
#16728 commented on
Apr 23, 2025 • 0 new comments -
[NIXL] vllm v0 nixl integration
#16677 commented on
Apr 21, 2025 • 0 new comments -
[V1][Spec Decode][Bugfix] Allocate lookahead token kvc in WAITING queue
#16613 commented on
Apr 23, 2025 • 0 new comments -
[Misc]fix demo function call JSONDecodeError
#16595 commented on
Apr 25, 2025 • 0 new comments -
[V1] Structured Outputs + Thinking parser compatiblity
#16577 commented on
Apr 26, 2025 • 0 new comments -
Remove scipy dep by implementing `resample_poly`
#16542 commented on
Apr 24, 2025 • 0 new comments -
Fix #15483 : Add error handling for model-dependent endpoints during sleep mode
#16536 commented on
Apr 22, 2025 • 0 new comments -
[Core] Enable IPv6 with vllm.utils.make_zmq_socket()
#16506 commented on
Apr 26, 2025 • 0 new comments -
Adding "amd_experimental: CI functionality to test all available test groups.
#16497 commented on
Apr 24, 2025 • 0 new comments -
[Bugfix][Model] fix Phi3Small model only support v0
#16493 commented on
Apr 22, 2025 • 0 new comments -
[Metrics] Log multi-modal cache stats
#16478 commented on
Apr 26, 2025 • 0 new comments -
Truncation control for embedding models
#14776 commented on
Apr 24, 2025 • 0 new comments -
[Quantization] Add Gemma2 and Gemma3 text model GGUF support
#14766 commented on
Apr 23, 2025 • 0 new comments -
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on
Apr 25, 2025 • 0 new comments -
[Neuron][V1] Experimental support for neuron backend with V1 architecture
#14648 commented on
Apr 25, 2025 • 0 new comments -
[Hardware][Intel GPU] Add V1 engine support and `chunked_prefill` kernel
#14612 commented on
Apr 25, 2025 • 0 new comments -
[Misc] Using ruff-format for smaller sets of directories
#14485 commented on
Apr 22, 2025 • 0 new comments -
[Frontend] Pythonic tool names flexibility (#14470)
#14474 commented on
Apr 25, 2025 • 0 new comments -
[Kernel] Update cutlass FP8 blockwise to use upstream CUTLASS
#14395 commented on
Apr 24, 2025 • 0 new comments -
[Core] Add DoRA Support
#14389 commented on
Apr 22, 2025 • 0 new comments -
[Doc] Create tool_chat_template_llama3.3_json.jinja
#14269 commented on
Apr 25, 2025 • 0 new comments -
[WIP][Attention] FlashAttn MLA
#14258 commented on
Apr 24, 2025 • 0 new comments -
Add CUDA kernel for per_token_group_quant_fp8
#14175 commented on
Apr 23, 2025 • 0 new comments -
[V1][Metrics] Add additional metrics to V1
#14148 commented on
Apr 22, 2025 • 0 new comments -
[Hardware][CPU] Vllm int8 quantization enablement for ARM CPU
#14129 commented on
Apr 22, 2025 • 0 new comments -
[Bugfix][Frontend] Strip empty tool calls from incoming chat conversations
#14054 commented on
Apr 25, 2025 • 0 new comments -
[Bugfix] Ensure JSON encoding preserves non-ASCII characters in Llama3JsonToolParser
#13826 commented on
Apr 25, 2025 • 0 new comments -
Minor fix in documentation for tool_calling.md
#13291 commented on
Apr 25, 2025 • 0 new comments -
[V1] DP scale-out (2/N): Decouple engine process management and comms
#15977 commented on
Apr 26, 2025 • 0 new comments -
Fixed Stream set to True, client stream receiving arguments, concatenated json string, missing curly braces end
#15930 commented on
Apr 25, 2025 • 0 new comments -
[Misc] Disable pin_memory in AsyncMetricsCollector for spec decode tensor allocation
#15886 commented on
Apr 23, 2025 • 0 new comments -
[Bugfix] fix client socket timeout when serve multi-node model in Ray
#15850 commented on
Apr 24, 2025 • 0 new comments -
[WIP][V1/0][P/D] XpYd based on p2p communication without cache store
#15806 commented on
Apr 26, 2025 • 0 new comments -
[Sampler] Adapt to FlashInfer 0.2.3 sampler API
#15777 commented on
Apr 23, 2025 • 0 new comments -
Use pip wheel to build wheels
#15749 commented on
Apr 24, 2025 • 0 new comments -
Try Python 3.13
#15743 commented on
Apr 22, 2025 • 0 new comments -
[Core] Remove legacy input mapper/processor from V0
#15686 commented on
Apr 25, 2025 • 0 new comments -
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend
#15655 commented on
Apr 22, 2025 • 0 new comments -
Enable Outlines with JSON Sub-Schema References
#15627 commented on
Apr 24, 2025 • 0 new comments -
[Frontend] fix streaming tool output lose 2 token bug #15545
#15546 commented on
Apr 25, 2025 • 0 new comments -
[Minor] QoL for Benchmarking
#15512 commented on
Apr 25, 2025 • 0 new comments -
[BugFix] fix speculative decoding memory leak when speculation is disabled
#15506 commented on
Apr 25, 2025 • 0 new comments -
[V1][Draft] Jump-forward decoding
#15490 commented on
Apr 24, 2025 • 0 new comments -
[Bugfix][Frontend] Fix pythonic tool parser failure with negative numbers
#15462 commented on
Apr 24, 2025 • 0 new comments -
[Misc] Improve cli help show
#15455 commented on
Apr 21, 2025 • 0 new comments -
[Spec Decode] Make speculative decoding compatible with pipeline parallelism
#15173 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: xgrammar doesn't support enums, but vllm isn't falling back to outlines
#15762 commented on
Apr 24, 2025 • 0 new comments -
[Installation]:
#14398 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: The Transformers implementation of My Model is not compatible with vLLM.
#16826 commented on
Apr 22, 2025 • 0 new comments -
[Feature]: Support Gemma3 GGUF
#14753 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: Mistral tool parser failed to parse function calling
#16190 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: InternVL3-9B call is hanging
#16782 commented on
Apr 22, 2025 • 0 new comments -
[Usage]: Guided choice not working as expected
#12225 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: Main branch code reasoning reports an error in h100 inference
#16656 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: An error occurred when deploying DeepSeek-R1-Channel-INT8 on two A100 machines using lws
#16827 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: Can't use yarn rope config for long context in Qwen2 model
#10293 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: Out of Memory (OOM) Issues During MMLU Evaluation with lm_eval
#10325 commented on
Apr 22, 2025 • 0 new comments -
[RFC]: Improve Ray Support in vLLM for Enhanced Elasticity and Performance
#11137 commented on
Apr 22, 2025 • 0 new comments -
新手入门,请多指教
#11223 commented on
Apr 22, 2025 • 0 new comments -
[Feature]: Add support for attention score output
#11365 commented on
Apr 22, 2025 • 0 new comments -
[Performance]: Prefill is not using cuda graph and become very slow when LORA enabled
#11436 commented on
Apr 22, 2025 • 0 new comments -
[Usage]: Does vLLM support deploying the speculative model on a second device?
#12200 commented on
Apr 22, 2025 • 0 new comments -
[Usage]: Does vLLM support running the DeepSeek-V3 model with CUDA 11.8?
#12247 commented on
Apr 22, 2025 • 0 new comments -
[Feature]: loading model from remote KV store such as Redis
#12250 commented on
Apr 22, 2025 • 0 new comments -
[Feature]: PD separation supports prefix caching
#12257 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'
#12267 commented on
Apr 22, 2025 • 0 new comments -
[Usage]: Does vLLM supports speculative decoding for MoE model?
#12278 commented on
Apr 22, 2025 • 0 new comments -
[Usage]:How to implement concurrency
#12289 commented on
Apr 22, 2025 • 0 new comments -
[Usage]: why no ray command in my docker image
#15284 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: Gemma3-27B failed in forward process
#16590 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Llama-3.1-405B-Instruct-FP8 only generates exclamation marks
#13035 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Qwen2.5-VL-32B, Following weights were not initialized from checkpoint
#15536 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
#10300 commented on
Apr 23, 2025 • 0 new comments -
[Performance]: Update Cascade Attention Heuristics for FA3
#15647 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: 使用Sonatype Nexus Repository时下载模型错误。
#14993 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: VLLM 0.8.3 LLM initialization hangs when EngineArgs data parallel size > 1
#16588 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: vllm 0.8.4 start with using ray, and ray's dashboard fails to start
#16779 commented on
Apr 23, 2025 • 0 new comments -
[Installation]: XPU dependencies not built against most recent oneAPI
#11734 commented on
Apr 23, 2025 • 0 new comments -
[Feature]: SwiftKV cache compression
#12220 commented on
Apr 23, 2025 • 0 new comments -
[Feature]: Support pass in user-specified backend to torch dynamo piecewise compilation
#12261 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Fail to load W4A16-G128 (llmcompressor) quantized model on CPU
#12268 commented on
Apr 23, 2025 • 0 new comments -
[Performance]: why vllm-0.6.1.post2 faster than latest vllm=0.6.6.post1?
#12274 commented on
Apr 23, 2025 • 0 new comments -
[Feature]: DeepSeek-R1 tool choice && Function Call
#12297 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: build docker error
#12300 commented on
Apr 23, 2025 • 0 new comments -
[Performance]: Unable to produce the result of throughput & latency claimed on vLLM dashboard v0
#12315 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: v0.8.2, enable calculate_kv_scales, caught exception
#15973 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Cast error details: Unable to cast 1024 to Tensor
#12771 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: `undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE` when running `0.7.3.dev57+g2ae88905.precompiled` on A100
#13047 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: Can't serve on ray cluster although passing VLLM_HOST_IP
#13521 commented on
Apr 22, 2025 • 0 new comments -
[Feature]: Composite model loading using `AutoWeightsLoader` for all models
#15697 commented on
Apr 22, 2025 • 0 new comments -
[Usage]: LLM.beam_search is much slower in vLLM 0.7.3 compared to 0.5.4
#14426 commented on
Apr 22, 2025 • 0 new comments -
[Bug]: Enabling LoRA not working with vLLM
#16676 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: Quantization example outdated (Ammo -> ModelOpt)
#9288 commented on
Apr 21, 2025 • 0 new comments -
[Usage]: Dynamically loaded LoRas do not appear on the /models endpoint
#10784 commented on
Apr 21, 2025 • 0 new comments -
[Misc]: Finetuned llama3.2 vision instruct model is failing during VLLM weight_loader
#11765 commented on
Apr 21, 2025 • 0 new comments -
[Misc]: For disaggregated prefill with multiple decode instances, drop_select might not enough
#12039 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: Inconsistent data received and sent using PyNcclPipe
#12197 commented on
Apr 21, 2025 • 0 new comments -
[Usage]: how to generate results and get the embeddings of the result
#12213 commented on
Apr 21, 2025 • 0 new comments -
[Usage]: Context window crashes web window when full
#12221 commented on
Apr 21, 2025 • 0 new comments -
[Performance]: Added request take too much time, and the model will not run untill all the request are added into the cache
#13259 commented on
Apr 21, 2025 • 0 new comments -
[Installation]: VLLM on ARM machine with GH200
#10459 commented on
Apr 20, 2025 • 0 new comments -
[Feature]: gemma3 raise error
#14723 commented on
Apr 20, 2025 • 0 new comments -
[Usage]: [V1] Misleading Error Messages
#13510 commented on
Apr 20, 2025 • 0 new comments -
[Usage]: How can I get the sparse embedding from OpenAI Embedding Client?
#13609 commented on
Apr 20, 2025 • 0 new comments -
[Usage]: Benchmarking Issues: Low Success Rate and Tensor Parallel Size Constraints on 8x AMD MI300x GPUs
#9070 commented on
Apr 20, 2025 • 0 new comments -
[Bug]: Speculative decoding inconsistency for Qwen-Coder-32B
#10913 commented on
Apr 20, 2025 • 0 new comments -
[Bug]: v0.7.3 can't work on wsl-ubuntu mirrored network
#13656 commented on
Apr 20, 2025 • 0 new comments -
[Bug]: InternVL3-78B OOM on 4 A100 40G in 0.8.4
#16749 commented on
Apr 20, 2025 • 0 new comments -
Flash Attention 3 (FA3) Support
#12429 commented on
Apr 19, 2025 • 0 new comments -
[Usage]: Does model streamer supports loading model from GCS bucket?
#12290 commented on
Apr 19, 2025 • 0 new comments -
[Feature]: Support Python 3.13
#12083 commented on
Apr 19, 2025 • 0 new comments -
[Bug]: Rocm Memory Access Fault.
#16840 commented on
Apr 19, 2025 • 0 new comments -
First tpot/itl is too long?
#15106 commented on
Apr 19, 2025 • 0 new comments -
[Bug]: v1 engine error when I using gemma-3 (v0 engine is okay)
#16643 commented on
Apr 19, 2025 • 0 new comments -
[Bug]: Not able to deploy Llama-4-Scout-17B-16E-Instruct on vllm-openai v0.8.3
#16197 commented on
Apr 21, 2025 • 0 new comments -
[Feature]: Support custom args in OpenAI (chat) completion requests
#16802 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: Calling the load_weights method of the MOE model failed
#16842 commented on
Apr 21, 2025 • 0 new comments -
[RFC]: KVBlocks and Metrics Publishing In Inference Frameworks
#16669 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: using TP = 16 to serving deepseek-v3 in 2*H20 On Ray cluster, get EngineCore exception
#16646 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: the official pre-built image for cpu-type prints a simple error: RuntimeError: Engine process failed to start. See stack trace for the root cause
#16446 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: When configuring Ray with a custom temporary directory using the --temp-dir parameter, the distributed multi-node inference cluster fails to deploy successfully.
#16819 commented on
Apr 21, 2025 • 0 new comments -
[Usage]: How to add a hook function
#16585 commented on
Apr 21, 2025 • 0 new comments -
[Feature]: Reduce vLLM's import time
#14924 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: sampling_params.n > 1, after reset_state_for_recompute() will meet 'AssertionError: seq_len: 2701, context_len: 0, query_len: 2701'
#14759 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: 100% CPU usage when idle
#16660 commented on
Apr 21, 2025 • 0 new comments -
[RFC]: Merge input processor and input mapper for multi-modal models
#10114 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: TypeError: Unknown image model type: qwen2_5_omni for branch: qwen2_omni_public_v1
#15754 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: Request stucks when serving model with v1 engine
#16580 commented on
Apr 21, 2025 • 0 new comments -
[New Model]: support Ovis VLM series
#13441 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: Cannot load Qwen2.5-VL
#16429 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: NVIDIA Jetson AGX Orin use vllm-0.7.4 error
#16465 commented on
Apr 21, 2025 • 0 new comments -
[Feature]: Support for Running Classification Task in Online Server
#13567 commented on
Apr 21, 2025 • 0 new comments -
[Feature]: Integrate Triton MoE Kernel
#16294 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: Engine iteration timed out. This should never happen!
#9839 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: Vllm0.6.2 UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
#8933 commented on
Apr 21, 2025 • 0 new comments -
Using the VLLM engine framework for inference, why is the first character generated always a space?
#3683 commented on
Apr 21, 2025 • 0 new comments -
[Bug]: deploy deepseek-r1-awq on 16 x 4090 48G, layer_kv_cache = torch.zeros(kv_cache_shape, [rank0]: RuntimeError: CUDA error: invalid argument
#15014 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Grammar error: Pointer '/$defs/xxxxx' does not exist
#16467 commented on
Apr 24, 2025 • 0 new comments -
[Usage]: How to configure the server parameters for THUDM/GLM-4-32B-0414 to support Function call using vllm-0.8.4?
#16771 commented on
Apr 24, 2025 • 0 new comments -
[Usage]: how to use prefill-decode disaggregation ??
#11490 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: MistralTokenizer not working when using Mistral Small 3.1 in HF format
#16292 commented on
Apr 24, 2025 • 0 new comments -
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: v1 flash_attn and triton_attn backends don't have `get_state_cls`
#15630 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Debugging vLLM script results in torch error
#15722 commented on
Apr 24, 2025 • 0 new comments -
[Usage]: There is no module or parameter named 'language_model' in Gemma3ForCausalLM
#15031 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Simple Data Parallelism in vLLM
#9206 commented on
Apr 24, 2025 • 0 new comments -
vLLM's V1 Engine Architecture
#8779 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Support Multiple Tasks Per Model
#11905 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: V0 engines gives incorrect output for Moonlight model
#16658 commented on
Apr 24, 2025 • 0 new comments -
[Roadmap] vLLM Roadmap Q2 2025
#15735 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Running llama2-7b on H20, Floating point exception (core dumped) appears on float16
#4392 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: load_adapter crashes server if called when generations are in progress
#13698 commented on
Apr 24, 2025 • 0 new comments -
[RFC]: Hidden states processor
#12249 commented on
Apr 24, 2025 • 0 new comments -
vllm keeps hanging when using djl-deepspeed
#2912 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Align the API with OAI's structured output
#7220 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Guided decoding is broken because tokenizers can't be pickled
#7557 commented on
Apr 24, 2025 • 0 new comments -
[Performance]: guided generation is very slow in offline mode
#8313 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: vllm api server return escaped unicode string in guided backend 'outlines'
#8805 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Outlines broken on vLLM 0.8+
#15636 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Disable unicode characters in structured decoding
#16363 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: issues with guided generation for tool calls (xgrammar)
#16321 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Multiple tool calls for llama3.2-11b-vision-instruct
#11786 commented on
Apr 24, 2025 • 0 new comments -
[RFC]: Refactor tool parsers to eliminate coding errors and allow more efficient implementations.
#11522 commented on
Apr 24, 2025 • 0 new comments -
[Usage]: Confirm tool calling is not supported and this is the closest thing can be done
#7912 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Is vllm support function call mode?
#6631 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Consider parallel_tool_calls parameter at the API level
#9451 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Native Tool Call for Gemma 3
#16482 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Qwen2.5 assistant output on tool call is empty
#16430 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Models converted to GGUF don't seem to be able to do tool calling
#16195 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: JSON based tool calling for Gemma 3
#15403 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Add support for reusable subschemas in tool requests (PydanticAI)
#15035 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: vLLM response on tool_calls does not align with OpenAI standard
#14951 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Support tool calls for DeepSeek.
#14745 commented on
Apr 24, 2025 • 0 new comments -
[New Model]: Command A with tool support
#14866 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Ultravox audio doesn't work with auto tool choice
#14209 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: pythonic tool parser only accepts alphabetical tool names
#14470 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: add tool calling support for DeepSeek-R1-Distill-Qwen-32B
#13700 commented on
Apr 24, 2025 • 0 new comments -
[Bug]:vLLM 0.6.3 generate_sequences Randomly Hangs After 1-2 Steps When trying to Implement Tool Calling with Logits Processors
#13671 commented on
Apr 24, 2025 • 0 new comments -
[Usage]: vLLM and In the fly tool calling
#13497 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: CPU offload not working for vllm serve
#15877 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Guided Decoding Schema Cache Store
#8902 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Support guided decoding with multistep decoding
#9893 commented on
Apr 23, 2025 • 0 new comments -
[Performance]: Transformers 4.45.1 slows down `outlines` guided decoding
#9032 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Distilled DeepSeek Models do not work with guided_json
#12548 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Guided Decoding (structured json outputs) not generating proper outputs.
#13683 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Very slow guided decoding with Outlines backend since v0.6.5
#12005 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: XGrammar-based CFG decoding degraded after 0.6.5
#12122 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Close feature gaps when using xgrammar for structured output
#12131 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: xgrammar crashes with speculative decoding
#11484 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Using "response_format": { "type": "json_object" } with /v1/chat/completions is terminating the engine
#11828 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Engine crashes with Pixtral-HF and xgrammar decoding
#11044 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Speculative decoding + guided decoding not working
#10442 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Speculative decoding breaks guided decoding.
#9423 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Compiling FSM index high memory && subprocess OOM
#7332 commented on
Apr 23, 2025 • 0 new comments -
[RFC]: TPU V1 Sampler planning
#16268 commented on
Apr 23, 2025 • 0 new comments -
[Installation]: how to run swiftkv with vllm
#16109 commented on
Apr 23, 2025 • 0 new comments -
[Usage]: Transcription "Maximum clip duration (30s) exceeded
#15012 commented on
Apr 23, 2025 • 0 new comments -
[New Model]: Multimodal Embedding Model GME.
#16406 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Llama 4 EOFError
#16127 commented on
Apr 23, 2025 • 0 new comments -
[Feature]: support tool and reasoning together
#14429 commented on
Apr 23, 2025 • 0 new comments -
[Feature]: hub.docker.com Please add arm docker image
#14656 commented on
Apr 23, 2025 • 0 new comments -
[Installation]: Can't build arm container image with podman without a SELinux relabel of bind mounts
#12734 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: cpu memory not released when wake up the vLLM instance
#16663 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Exception in worker VllmWorkerProcess while processing method init_device: NCCL error: unhandled cuda error
#9329 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Error with structured output inference after upgrade 0.6.2->0.6.3
#9462 commented on
Apr 24, 2025 • 0 new comments -
[Bug]:Structured outputs inference often took a very long time,and eventually causing a timeout and vLLM engine crushing.
#10081 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Guided Decoding Broken in Streaming mode
#10376 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: CPU Offloading errors (Worker.__init__() got an unexpected keyword argument 'kv_cache_dtype')
#11986 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: PaliGemma2 not working with OpenAI Docker serve
#12052 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Fail to use beamsearch with llm.chat
#12183 commented on
Apr 24, 2025 • 0 new comments -
[Usage]: How can I use LLMEngine to perform distributed inference for multimodal large models, such as Qwen-VL?
#12305 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Speculative decoding does not work
#12323 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Run multiple LLMs inference one by one with multiple TP always pending on the second one in Model list
#12337 commented on
Apr 24, 2025 • 0 new comments -
[Usage]: When running models on multiple GPUs, workload does not get split
#12354 commented on
Apr 24, 2025 • 0 new comments -
[RFC]: Refactor `config-format` and `load-format` as plugins
#12363 commented on
Apr 24, 2025 • 0 new comments -
[Feature]: Support LoRA adapter for whisper
#15370 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Qwen2.5 tool call failed
#16393 commented on
Apr 24, 2025 • 0 new comments -
[Bug]: Out of Memory error for Qwen2.5 in 0.8.0 and 0.8.1. Worked fine in the previous versions
#15228 commented on
Apr 24, 2025 • 0 new comments -
[V1] Feedback Thread
#12568 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Persistent OutOfMemoryError error when using speculative decoding
#8073 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: Bug while using deepspeed with TRL with vLLM
#16867 commented on
Apr 23, 2025 • 0 new comments -
[Feature]: Specific Docker Image for vllm["audio,video"]
#13940 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: examples/offline_inference/chat_with_tools.py JSONDecodeError
#16594 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on
Apr 23, 2025 • 0 new comments -
[Bug]: guided_json not working correctly with (quantized) mistral-small model
#15577 commented on
Apr 23, 2025 • 0 new comments