Skip to content

fix: reduce max_new_tokens default from 32768 to 4096#279

Open
Bortlesboat wants to merge 1 commit intomicrosoft:mainfrom
Bortlesboat:fix/reduce-max-new-tokens-default
Open

fix: reduce max_new_tokens default from 32768 to 4096#279
Bortlesboat wants to merge 1 commit intomicrosoft:mainfrom
Bortlesboat:fix/reduce-max-new-tokens-default

Conversation

@Bortlesboat
Copy link
Copy Markdown

The default max_new_tokens=32768 in the ASR demo scripts forces PyTorch to pre-allocate KV-cache for 32K output tokens regardless of actual input length. This causes OOM on 24GB GPUs even for short audio clips (see #210).

4096 tokens is sufficient for transcribing ~1 hour of speech and matches the default already used in gradio_asr_demo_api_video.py (line 779). Users who need more can still pass --max_new_tokens=32768 explicitly.

Changed files:

  • demo/vibevoice_asr_inference_from_file.py — default 32768 → 4096
  • demo/vibevoice_asr_gradio_demo.py — default 32768 → 4096
The 32768 default forces PyTorch to pre-allocate KV-cache for 32K output
tokens regardless of input length, causing OOM on consumer GPUs (24GB)
even for short audio. 4096 tokens is sufficient for ~1 hour of ASR output
and matches the default already used in the vLLM API client.

Users processing very long audio can still pass --max_new_tokens=32768.
Copy link
Copy Markdown

@codeCraft-Ritik codeCraft-Ritik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix! Reducing the default max_new_tokens significantly improves usability on limited GPU memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants