fix: reduce max_new_tokens default from 32768 to 4096 by Bortlesboat · Pull Request #279 · microsoft/VibeVoice

Bortlesboat · 2026-03-30T17:44:20Z

The default max_new_tokens=32768 in the ASR demo scripts forces PyTorch to pre-allocate KV-cache for 32K output tokens regardless of actual input length. This causes OOM on 24GB GPUs even for short audio clips (see #210).

4096 tokens is sufficient for transcribing ~1 hour of speech and matches the default already used in gradio_asr_demo_api_video.py (line 779). Users who need more can still pass --max_new_tokens=32768 explicitly.

Changed files:

demo/vibevoice_asr_inference_from_file.py — default 32768 → 4096
demo/vibevoice_asr_gradio_demo.py — default 32768 → 4096

The 32768 default forces PyTorch to pre-allocate KV-cache for 32K output tokens regardless of input length, causing OOM on consumer GPUs (24GB) even for short audio. 4096 tokens is sufficient for ~1 hour of ASR output and matches the default already used in the vLLM API client. Users processing very long audio can still pass --max_new_tokens=32768.

codeCraft-Ritik

Great fix! Reducing the default max_new_tokens significantly improves usability on limited GPU memory.

LONEWOLF3399 approved these changes Mar 30, 2026

View reviewed changes

codeCraft-Ritik reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reduce max_new_tokens default from 32768 to 4096#279

fix: reduce max_new_tokens default from 32768 to 4096#279
Bortlesboat wants to merge 1 commit intomicrosoft:mainfrom
Bortlesboat:fix/reduce-max-new-tokens-default

Bortlesboat commented Mar 30, 2026

codeCraft-Ritik left a comment

Labels

3 participants

Conversation

Bortlesboat commented Mar 30, 2026

codeCraft-Ritik left a comment

Choose a reason for hiding this comment

Labels

3 participants