Skip to content

[OMNIML-4788] specdec_bench/Qwen3.5-4B: throughput_32k benchmark + S3 upload step#1564

Merged
ChenhanYu merged 20 commits into
mainfrom
chenhany/specdec-bench-throughput-32k-and-s3-upload
Jun 7, 2026
Merged

[OMNIML-4788] specdec_bench/Qwen3.5-4B: throughput_32k benchmark + S3 upload step#1564
ChenhanYu merged 20 commits into
mainfrom
chenhany/specdec-bench-throughput-32k-and-s3-upload

Conversation

@ChenhanYu

@ChenhanYu ChenhanYu commented May 28, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change: enhancement (follow-up to #1531).

Extends the merged Qwen3.5-4B SPEED-Bench launcher YAMLs from a single-task qualitative-only smoke into a 3-task pipeline that also covers long-context throughput and verifies the S3-upload path end-to-end. Two commits, cleanly cherry-picked from #1531's late branch state — they were authored after the merge-commit was resolved against an earlier rebased head and so didn't ride along with that merge.

Pipeline shape (both YAMLs)

Task Split Save dir
task_0 qualitative (existing quality / acceptance-rate signal) /scratchspace/specdec_bench{,_mtp}/qualitative
task_1 throughput_32k (new — long-context throughput) /scratchspace/specdec_bench{,_mtp}/throughput_32k
task_2 upload to S3 in sweep layout s3://team-specdec-workgroup/results/specdec_bench{,_mtp}/<split>/

New artifacts

  • tools/launcher/common/specdec_bench/upload_to_s3.sh — thin wrapper around examples/specdec_bench/upload_to_s3.py so it can be invoked as a launcher task. Installs boto3 from requirements.txt on cold containers; warm pipelines pick it up from the prior run.sh.
  • tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml — pins engine_args.max_model_len = 40,960 (32K input + 4K output + 4K headroom) so vLLM doesn't silently auto-cap max_model_len below the 36K minimum needed for throughput_32k prompts on single-GPU runs.

Why max_model_len matters

Without an explicit max_model_len, vLLM auto-derives it from the model config (Qwen3.5-4B = 128K) and from the GPU-memory budget. On a single GPU the second factor can cap effective max_model_len well below 36K, silently truncating 32K-token prompts and producing wrong throughput numbers. The qualitative split is not affected (its prompts top out around 8K, well under any auto-derivation floor) so only task_1 carries the override.

S3 credentials

upload_to_s3.sh reads S3_ENDPOINT / S3_KEY_ID / S3_SECRET from the runtime environment (not hardcoded). --skip-existing + --allow-incomplete-provenance are passed by default so re-runs land alongside the prior upload, and runs lacking CONTAINER_IMAGE (Phase-2 harness work in OMNIML-4788 will populate it) still upload.

Testing

Cluster smoke on cw_dfw via:

uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml --yes

is currently in-flight (jobs 12257378/79/80, PD). Will update this PR with timing/AR numbers + S3 upload confirmation once it lands.

Before your PR is "Ready for review"

  • Backward compatible: ✅ (additive — task_0 keeps the prior qualitative behavior, just with /qualitative suffix in save_dir)
  • New PIP dep: ✅ no (boto3 already in examples/specdec_bench/requirements.txt from [OMNIML-4788] specdec_bench: configuration.json provenance + upload_to_s3 #1531)
  • New tests: N/A (launcher YAML + shell wrapper; covered by cluster smoke)
  • Changelog: N/A (internal-facing tooling)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a 32K-context runtime configuration (higher max model length) to enable long-context throughput benchmarking and avoid silent prompt truncation.
    • Added a launcher helper to upload benchmark results to S3 with incremental/retry-friendly options and pass/fail reporting.
  • Chores

    • Split Qwen3.5-4B benchmark into separate qualitative and 32K throughput tasks and added coordinated S3 upload.
    • Applied the same multi-task pipeline layout and clearer output organization to the MTP speculative-decoding benchmark.

Review Change Stack

@copy-pr-bot

copy-pr-bot Bot commented May 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ChenhanYu ChenhanYu requested review from a team and kevalmorabia97 May 28, 2026 23:41
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a 32K-context runtime-params YAML, an S3 upload Bash wrapper with dependency bootstrapping, and converts Qwen3.5-4B example pipelines (standard and MTP) into three-task workflows: qualitative, throughput_32k (using the new runtime params), and S3 upload.

Changes

SPEED-Bench Throughput and S3 Upload Pipeline

Layer / File(s) Summary
Runtime parameters for 32K throughput context
tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml
New YAML sets engine_args.max_model_len: 40960 to prevent vLLM prompt truncation for long-context throughput runs.
S3 upload wrapper and dependency management
tools/launcher/common/specdec_bench/upload_to_s3.sh
Bash wrapper that sources service_utils.sh, wires ERR/EXIT traps, ensures boto3 via requirements.txt if needed, runs the Python uploader, and reports PASS/FAIL.
Qwen3.5-4B standard benchmark pipeline
tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml
Restructured into a three-task pipeline: qualitative split (outputs to /scratchspace/specdec_bench/qualitative), throughput_32k task using the new runtime params, and S3 upload of /scratchspace/specdec_bench.
Qwen3.5-4B MTP benchmark pipeline
tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp.yaml
Adds a three-task MTP pipeline (qualitative draft=3, throughput_32k draft=3 using runtime params, and S3 upload of /scratchspace/specdec_bench_mtp).

Possibly related PRs

Suggested reviewers

  • kevalmorabia97
  • h-guo18

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main changes: adding a throughput_32k benchmark configuration and S3 upload functionality for Qwen3.5-4B SPEED-Bench.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No Python code changes in modelopt. No security anti-patterns found: torch.load, numpy.load, trust_remote_code=True hardcoded, eval/exec, nosec comments, or unsafe dependencies all absent.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenhany/specdec-bench-throughput-32k-and-s3-upload

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.91%. Comparing base (52f1ccb) to head (4322ac0).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1564      +/-   ##
==========================================
- Coverage   77.49%   76.91%   -0.58%     
==========================================
  Files         489      489              
  Lines       54415    54415              
==========================================
- Hits        42169    41856     -313     
- Misses      12246    12559     +313     
Flag Coverage Δ
examples 42.78% <ø> (-0.14%) ⬇️
regression 14.88% <ø> (+0.07%) ⬆️
unit 54.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review summary

Findings: 1 CRITICAL, 0 IMPORTANT, 1 SUGGESTION

Most impactful finding

[CRITICAL Algorithm] The whole point of runtime_params_throughput_32k.yaml — pinning engine_args.max_model_len = 40960 so vLLM can't silently truncate 32K prompts — does not actually take effect. VLLMModel.__init__ in examples/specdec_bench/specdec_bench/models/vllm.py:79-91 builds AsyncEngineArgs(...) with a hardcoded subset of kwargs and never reads max_model_len out of **kwargs, so the override from runtime_params.engine_args is silently dropped. vLLM continues to auto-derive max_model_len from the model config + GPU memory budget — exactly the failure mode the PR description says this file prevents. task_1 (throughput_32k) will still measure throughput over truncated inputs on memory-tight single-GPU runs.

The fix lives in vllm.py, not in this PR's diff: forward max_model_len (and ideally any engine_args overrides whose key matches an AsyncEngineArgs field) into AsyncEngineArgs(...). Without that fix the new YAML override and the --runtime_params line in both task_1s are dead weight. Inline comment on runtime_params_throughput_32k.yaml:14 has a concrete patch sketch.

The cluster smoke (12257378/79/80) called out in the PR description should catch this once it lands — recommend confirming the dumped serving_config.vllm_config.max_model_len actually reads 40960 in the throughput_32k run dir before merging.

Other note

[SUGGESTION] task_2 (S3 upload) is a CPU-only boto3 file copy but allocates a GPU and uses the heavy vllm/vllm-openai container — non-blocking, just operational efficiency.

Risk assessment

Medium-high. The PR is small and the launcher plumbing is fine, but the headline feature (correct 32K throughput numbers) is broken at the engine layer — anyone consuming the throughput_32k results uploaded to S3 would be reading from silently-truncated runs. Hold on the vLLMModel kwarg-forwarding fix before landing.

# needed, since its prompts top out at ~8K).

engine_args:
max_model_len: 40960

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL Algorithm] This max_model_len: 40960 override is silently dropped — vLLM never sees it.

VLLMModel.__init__ in examples/specdec_bench/specdec_bench/models/vllm.py:79-91 constructs AsyncEngineArgs(...) with a hardcoded subset of kwargs (tokenizer, trust_remote_code, tensor_parallel_size, enable_expert_parallel, enable_prefix_caching, speculative_config, max_num_seqs, skip_tokenizer_init, async_scheduling, enforce_eager). The **engine_args from runtime_params is unpacked into VLLMModel.__init__'s **kwargs, but max_model_len is not one of the whitelisted keys read out of that dict — so it's discarded. The comment at vllm.py:164 even acknowledges this ("max_model_len / kv cache / dtype defaults that don't appear in AsyncEngineArgs").

Impact: vLLM keeps auto-deriving max_model_len from the model config + GPU memory budget — exactly the failure mode the PR description ("vLLM's default auto-derivation … can cap max_model_len lower than 36K on a single GPU") claims this file prevents. 32K-token prompts in the throughput_32k split will still be silently truncated on memory-tight single-GPU runs, and the throughput numbers task_1 reports will be measured over truncated inputs. Adding the YAML without the forwarding fix gives a false sense of correctness.

Fix: forward unknown engine_args kwargs (or at minimum max_model_len) to AsyncEngineArgs in VLLMModel.__init__. e.g.

import dataclasses
extra_engine_kwargs = {
    k: v for k, v in kwargs.items()
    if k in {f.name for f in dataclasses.fields(AsyncEngineArgs)}
    and k not in {"model", "tokenizer", "trust_remote_code",
                  "tensor_parallel_size", "enable_expert_parallel",
                  "enable_prefix_caching", "speculative_config",
                  "max_num_seqs", "skip_tokenizer_init",
                  "async_scheduling", "enforce_eager"}
}
engine_args = AsyncEngineArgs(
    model=model_dir,
    ...,  # existing explicit kwargs
    **extra_engine_kwargs,
)

Without this (or an equivalent explicit max_model_len=kwargs.get("max_model_len")), this YAML and the --runtime_params line in both specdec_bench{,_mtp}.yaml::task_1 are dead weight.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 80309cae01 (this PR). examples/specdec_bench/specdec_bench/models/vllm.py:VLLMModel.__init__ now forwards any kwarg whose name matches an AsyncEngineArgs dataclass field — except the dozen kwargs VLLMModel consumes itself (collected into _VLLM_CONSUMED_KWARGS to prevent got multiple values for keyword argument crashes). So engine_args.max_model_len: 40960 from runtime_params_throughput_32k.yaml now actually reaches AsyncEngineArgs.

Unit tests in tests/examples/specdec_bench/test_vllm_kwargs_forwarding.py:

  • max_model_len is forwarded
  • multiple engine_args fields (max_model_len + dtype + gpu_memory_utilization) all pass through
  • consumed kwargs (e.g. tensor_parallel_size) are NOT double-forwarded (would crash AsyncEngineArgs)
  • unknown kwargs are silently dropped (matches prior behaviour)
  • a pin test asserts the test's copy of _VLLM_CONSUMED_KWARGS matches the module's definition

Cluster re-validation: planning to capture dump_env's serving_config.vllm_config.max_model_len from the throughput_32k run dir on the next smoke to confirm it reads 40960.

nodes: 1
ntasks_per_node: 1
gpus_per_node: 1
container: vllm/vllm-openai:qwen3_5-cu130

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] task_2 is a CPU-only boto3 file copy but allocates gpus_per_node: 1 and the heavyweight vllm/vllm-openai:qwen3_5-cu130 container.

On a busy GPU partition this can add hours of queue time for a job that does no compute, and it pulls a multi-GB image just to run boto3.upload_file(). If the launcher / cluster supports it, dropping gpus_per_node to 0 (or routing to a CPU partition) and using a small python+boto3 image would tighten the pipeline. Same applies to specdec_bench_mtp.yaml::task_2. Non-blocking — operationally simpler to reuse one container and one slurm template, just calling out the trade-off.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b90e30afc8 (this PR). Both specdec_bench.yaml and specdec_bench_mtp.yaml task_2 now use gpus_per_node: 0. Kept the same vllm/vllm-openai:qwen3_5-cu130 image so the already-pulled cached image is reused across tasks — Slurm will still schedule the 0-GPU job on the default batch partition (no separate cpu partition needed; GPU nodes accept 0-GPU jobs and have spare CPU slots). Switching to a dedicated lightweight python+boto3 image would shave the image-pull cost further but adds a maintenance surface, deferring that to a follow-up if image-pull becomes the bottleneck.

ChenhanYu added 2 commits May 29, 2026 08:42
…ith throughput_32k + S3 upload

3-task pipelines now exercise both:

* `qualitative` split (existing quality / acceptance-rate signal)
* `throughput_32k` split (long-context throughput signal)
* Final task uploads `/scratchspace/specdec_bench{,_mtp}` to S3 in sweep
  layout so qualitative/ and throughput_32k/ both land under
  `s3://team-specdec-workgroup/results/specdec_bench{,_mtp}/<split>/`.

Adds tools/launcher/common/specdec_bench/upload_to_s3.sh — thin wrapper
around examples/specdec_bench/upload_to_s3.py so it can be invoked as a
launcher task. Installs boto3 from examples/specdec_bench/requirements.txt
on cold containers (warm pipelines pick it up from the prior run.sh).
S3 credentials are read from S3_ENDPOINT / S3_KEY_ID / S3_SECRET env
vars; both --skip-existing and --allow-incomplete-provenance are passed
by default so reruns don't fail and runs lacking CONTAINER_IMAGE still
land (the harness Phase-2 work in OMNIML-4788 will populate it).

Save dirs split per benchmark (/scratchspace/<base>/<split>/) so the
upload step sees a proper sweep layout and doesn't overwrite earlier
results.

Signed-off-by: chenhany <chenhany@nvidia.com>
…Bench throughput_32k

Without an explicit max_model_len, vLLM auto-derives it from the model
config and gpu_memory_utilization. On a single-GPU job with Qwen3.5-4B
(declared max_position_embeddings = 128K), the auto-derivation can cap
max_model_len well below 36K to fit the KV cache budget, silently
truncating 32K-token prompts from SPEED-Bench-Internal/throughput_32k
and producing wrong throughput numbers.

Add tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml
that pins engine_args.max_model_len = 40,960 (32K input + 4K output +
4K headroom). Wire it via --runtime_params on task_1 in both Qwen3.5-4B
specdec_bench YAMLs (NONE/baseline + MTP variant).

The qualitative task uses the engine default (no override) because its
prompts top out around 8K and vLLM's auto-derivation handles that fine.

Signed-off-by: chenhany <chenhany@nvidia.com>
@ChenhanYu ChenhanYu force-pushed the chenhany/specdec-bench-throughput-32k-and-s3-upload branch from 09ea467 to 8004bd6 Compare May 29, 2026 15:42

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml (1)

93-98: ⚡ Quick win

Avoid reserving a GPU for the S3 upload-only task.

task_2 runs the uploader wrapper and doesn’t need GPU compute; reserving one GPU can unnecessarily delay scheduling and consume scarce resources.

💡 Proposed tweak
   task_2:
@@
     slurm_config:
       _factory_: "slurm_factory"
       nodes: 1
       ntasks_per_node: 1
-      gpus_per_node: 1
+      gpus_per_node: 0
       container: vllm/vllm-openai:qwen3_5-cu130
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml` around lines 93 -
98, The slurm_config currently reserves a GPU for all tasks which is unnecessary
for the uploader job (task_2); update the configuration so the uploader task
does not request GPU resources — either set gpus_per_node: 0 in a dedicated
slurm_config for task_2 or remove/override the GPU request in task_2's job spec
so task_2 runs without a GPU (leave the container image unchanged); locate the
slurm_config block and the task_2 job definition and adjust the GPU request
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml`:
- Around line 93-98: The slurm_config currently reserves a GPU for all tasks
which is unnecessary for the uploader job (task_2); update the configuration so
the uploader task does not request GPU resources — either set gpus_per_node: 0
in a dedicated slurm_config for task_2 or remove/override the GPU request in
task_2's job spec so task_2 runs without a GPU (leave the container image
unchanged); locate the slurm_config block and the task_2 job definition and
adjust the GPU request accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 81cec96c-e9c8-420f-928f-a4d7f179cd58

📥 Commits

Reviewing files that changed from the base of the PR and between 09ea467 and 8004bd6.

📒 Files selected for processing (4)
  • tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml
  • tools/launcher/common/specdec_bench/upload_to_s3.sh
  • tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml
  • tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml
  • tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp.yaml
ChenhanYu added 2 commits May 29, 2026 13:53
…SPEED-bench

The single-GPU concurrency=1 config takes ~4.4h on the 880-sample qualitative
split, which doesn't fit in the cw_dfw batch partition's 4h MaxTime. Acceptance
length (AL) — the primary metric — is concurrency-independent; only aa_timing
fidelity is sacrificed by increasing concurrency.

  task_0 (qualitative):     tp=2, concurrency=8, gpus=2  -> ~30-45 min
  task_1 (throughput_32k):  tp=2, concurrency=4, gpus=2  -> stays capped at
                                                            --num_requests 20
                                                            for KV-cache safety
                                                            at 32K-token prompts

tp_size=2 also doubles the KV-cache budget for task_1, making concurrency>1
feasible on 32K prompts that would OOM a single H100.

Signed-off-by: chenhany <chenhany@nvidia.com>
…oughput_32k to 80 samples @ concurrency=8

Prior config (concurrency=8 on qualitative, --num_requests 20 + concurrency=4 on
throughput_32k) was conservative-tuned for time-budget headroom. With tp_size=2
in place the KV budget is doubled, so we can push concurrency further:

  task_0 (qualitative):     concurrency 8 -> 32        (still tp_size=2)
  task_1 (throughput_32k):  concurrency 4 -> 8,
                            --num_requests 20 -> 80    (still tp_size=2)

AL is concurrency-independent; the bump only sacrifices aa_timing fidelity.
8 * 32K = 256K tokens of in-flight KV stays within the doubled KV budget on
tp_size=2.

Signed-off-by: chenhany <chenhany@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)
tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml (1)

101-106: 💤 Low value

Upload step requests a GPU it doesn't use.

task_2 only runs upload_to_s3.sh yet reserves gpus_per_node: 1 on a GPU container. On a busy cluster this can delay scheduling and waste an allocated GPU. If slurm_factory supports it, consider gpus_per_node: 0 (and a lighter image) for the upload step.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml` around lines 101
- 106, task_2's slurm_config is requesting a GPU it doesn't use; locate the
slurm_config block for task_2 (look for "task_2", "slurm_config", and the
upload_to_s3.sh step) and set gpus_per_node: 0 and switch the container to a
lightweight non-GPU image (or remove the container override) so the upload step
doesn't reserve GPU resources; keep the _factory_: "slurm_factory" and other
scheduling fields unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml`:
- Line 79: The --runtime_params value in
tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml points to the wrong
location; since examples/specdec_bench/run.py opens args.runtime_params directly
(open(args.runtime_params)), update the CLI entry for --runtime_params to the
repository path where the file actually lives
(tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml) so the
32k override can be loaded correctly.

---

Nitpick comments:
In `@tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml`:
- Around line 101-106: task_2's slurm_config is requesting a GPU it doesn't use;
locate the slurm_config block for task_2 (look for "task_2", "slurm_config", and
the upload_to_s3.sh step) and set gpus_per_node: 0 and switch the container to a
lightweight non-GPU image (or remove the container override) so the upload step
doesn't reserve GPU resources; keep the _factory_: "slurm_factory" and other
scheduling fields unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ed32921-0e71-4818-a43c-24bac63e217c

📥 Commits

Reviewing files that changed from the base of the PR and between 5c24516 and ae59ce9.

📒 Files selected for processing (2)
  • tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml
  • tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp.yaml
Comment thread tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml Outdated
ChenhanYu added 6 commits May 29, 2026 14:47
… throughput_32k

slurm.py's PatternPackager flattens modules/Model-Optimizer/tools/launcher/common/*
to /nemo_run/code/common/* (matches the convention already used by the script: field,
e.g. `script: common/specdec_bench/run.sh`). The runtime_params YAML path needs to
follow the same convention; otherwise the container working dir misses the file:

  FileNotFoundError: 'modules/Model-Optimizer/tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml'

This also makes the path consistent with the public launcher (launch.py) which
runs from tools/launcher/ as CWD, so the same `common/...` prefix resolves there
too.

Signed-off-by: chenhany <chenhany@nvidia.com>
…ECDEC_BENCH_S3_*

The earlier convention used unprefixed S3_ENDPOINT / S3_KEY_ID / S3_SECRET,
which would collide with any other S3 credentials a CI runner or developer
shell happens to carry. Rename to SPECDEC_BENCH_S3_* so the upload step is
explicit about which workflow it serves and avoids cross-tool credential
leakage.

Two changes:

* examples/specdec_bench/upload_to_s3.py — read SPECDEC_BENCH_S3_{ENDPOINT,
  KEY_ID,SECRET} as the defaults for --endpoint / --key-id / --secret.

* tools/launcher/core.py — get_default_env() now forwards SPECDEC_BENCH_S3_*
  from the launching shell into both slurm_env and local_env so the cluster
  upload task (common/specdec_bench/upload_to_s3.sh) sees them inside the
  container without committing secrets to any YAML.

The previous task_2 (12308020) on cw_dfw failed because the unprefixed S3_*
vars never reached the sbatch — even when the launching shell had them
exported. With this fix, exporting SPECDEC_BENCH_S3_* before `uv run
slurm.py` / launch.py is enough; the harness propagates them.

Signed-off-by: chenhany <chenhany@nvidia.com>
… runtime_params

Without this fix, kwargs unpacked from runtime_params.engine_args into
VLLMModel.__init__ are silently dropped — the AsyncEngineArgs(...) call only
reads a hardcoded subset. The clearest casualty was
runtime_params_throughput_32k.yaml::max_model_len: 40960 (added in
e3c4d4f to prevent vLLM from auto-deriving a short max_model_len and
truncating 32K prompts), which never took effect. Throughput numbers
measured on memory-tight single-GPU runs were over silently-truncated
inputs.

Caught by claude[bot] review on PR #1564 (CRITICAL).

Forward any kwarg whose name matches an AsyncEngineArgs dataclass field,
except for the dozen kwargs VLLMModel reads itself (collected into
`_VLLM_CONSUMED_KWARGS` so adding a future consumed kwarg fails the unit
test loudly). This lets users override max_model_len / dtype /
gpu_memory_utilization / any other AsyncEngineArgs field via
runtime_params.engine_args.

Tests: tests/examples/specdec_bench/test_vllm_kwargs_forwarding.py
covers the four cases:
  * max_model_len is forwarded
  * multiple engine_args fields all pass through
  * consumed kwargs (tensor_parallel_size, trust_remote_code) are NOT
    double-forwarded (would crash AsyncEngineArgs)
  * unknown/typo kwargs are silently dropped (matches prior behaviour)

Signed-off-by: chenhany <chenhany@nvidia.com>
…pload

task_2 is a boto3 file copy — pure CPU work that finishes in ~3s. The
prior gpus_per_node=1 reserved an entire GPU for the duration, sitting
in PD-Priority on busy GPU partitions for the few seconds of upload work.
Setting gpus_per_node=0 keeps the job on the default `batch` partition
(GPU node, but no GPU reserved) so it schedules instantly alongside
existing GPU jobs. No separate cpu partition needed — Slurm allows
0-GPU jobs on GPU partitions.

Suggested by claude[bot] review on PR #1564 (non-blocking).

Signed-off-by: chenhany <chenhany@nvidia.com>
…engine>

The prior names `specdec_bench` (NONE/AR baseline) and `specdec_bench_mtp`
encoded only the workflow, not the model. A future run with a different
base model (e.g., Qwen3-8B) under the same workflow would collide in
S3 against the existing Qwen3.5-4B run — `--skip-existing` would reject
it, masking the new data.

Rename sweep dirs to embed model + algorithm + engine so they stay
unique across multi-model / multi-engine benchmarking:

  specdec_bench       -> qwen35_4b_none_vllm
  specdec_bench_mtp   -> qwen35_4b_mtp_vllm

The S3 prefix `results/specdec_bench_mtp/` (14 objects from the prior
cluster smoke) has already been moved to `results/qwen35_4b_mtp_vllm/`
via boto3 copy+delete. No NONE-baseline data was in S3 (prior task_2
failures), so nothing else to migrate.

This matches the loose convention already used by other team sweeps in
the same bucket (e.g., `qwen35_35_sglang_mtp7`, `dsr1-0528-fp4_trtllm_mtp7`,
`kimi25_vllm_nv_eagle_d7`).

Signed-off-by: chenhany <chenhany@nvidia.com>
…h YAMLs

The bench-pipeline YAML now stops at task_1 (throughput_32k). S3 upload
moves out of the YAML and into the pensieve-intern specdec_bench
workflow's wrap_up stage, which owns harvesting /scratchspace/<sweep>/
contents and publishing them with provenance stamps (jira_ticket +
huggingface_model_id) so the visualizer can flag them as
`is_official=TRUE`.

Rationale (Phase 3 of OMNIML-4788):

* The launcher YAML lives in NVIDIA/Model-Optimizer/tools/launcher/
  examples/ where any external user can run it as a self-contained
  benchmark recipe. Baking S3 credentials + a team-specific bucket
  into that example is the wrong scoping — anyone running it locally
  hits an auth error or worse, exfiltrates to whatever S3 endpoint
  their env happens to have set.

* The wrap_up stage runs inside nmm-sandbox CI with the team's
  SPECDEC_BENCH_S3_* CI vars already configured (per the env-prefix
  rename in commit b928954). It can stamp jira_ticket from the
  Epic key + huggingface_model_id from the Epic SPEC, which is the
  data the visualizer needs to mark a run as an official record
  (csv_from_s3.py:255 `is_official = jira_ticket AND huggingface_model_id`).

The kwarg-forwarding fix in vllm.py (80309ca), runtime_params path
fix (2c69f89), and SPECDEC_BENCH_S3_* env-prefix in core.py
(b928954) all stay — they're useful infrastructure regardless of
where the upload step lives.

Signed-off-by: chenhany <chenhany@nvidia.com>
_engine_arg_fields = {f.name for f in dataclasses.fields(AsyncEngineArgs)}
forwarded_engine_kwargs = {
k: v for k, v in kwargs.items()
if k in _engine_arg_fields and k not in _VLLM_CONSUMED_KWARGS

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a issue here. If some args in _VLLM_CONSUMED_KWARGS are also passed in engine args in its original keyword, then there will still be a error of got multiple values for keyword argument?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. if prefix_cache or moe_expert_parallel_size is in engine args, it will not be de-duplicated here, therefore causing the error above.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a simpler way here is to fully respect the style of existing code: Only pass in what we need to AsyncEngineArgs. I think here we only need to append max_model_len in the arg list.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you are right on all three points. Replaced the entire _VLLM_CONSUMED_KWARGS / forwarded_engine_kwargs machinery with a single explicit max_model_len=kwargs.get("max_model_len") in the AsyncEngineArgs call (commit c121ab7). Tests updated to match: removed the forwarding-logic tests, added three targeted tests covering the explicit path, the None-default case, and the absence of duplicate-kwarg errors.

Replace the generic `_VLLM_CONSUMED_KWARGS` / `forwarded_engine_kwargs`
machinery with a single explicit `max_model_len=kwargs.get("max_model_len")`
in the AsyncEngineArgs constructor call, following the existing code style.

The generic approach had a latent bug: any kwarg whose yaml name differs
from its vllm name (prefix_cache → enable_prefix_caching,
moe_expert_parallel_size → enable_expert_parallel) wouldn't be in
_VLLM_CONSUMED_KWARGS under its vllm name, so it could be forwarded a
second time causing "got multiple values for keyword argument". Since
max_model_len is the only engine_arg we actually need from
runtime_params_throughput_32k.yaml, the explicit approach is both simpler
and correct.

Tests updated to match: drop the pure-Python forwarding-logic tests,
add three targeted tests that verify the explicit max_model_len path,
None-default behavior, and absence of duplicate-kwarg errors.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test c121ab7

DFlash ignores --draft_length: in vLLM, --draft_length maps to
speculative_num_steps but DFlash's accept path reads
speculative_num_draft_tokens instead. Without --block_size the cells
silently fall back to the default (and the agent's first
OMNIML-4962 dispatch hit this — saw the cell submit but the
acceptance length ran at AR≈1 instead of the expected ~5).

Convention: block_size = draft_length + 1 (cell t0_d3 → block_size=4,
t0_d7 → block_size=8).

Surfaced on OMNIML-4962. Was previously living on the cell PR
(#1638 / pensieve-intern/OMNIML-4961/t0_d3) but it's
specdec_bench infra, not per-cell, so it belongs here alongside
the rest of the OMNIML-4788 specdec_bench launcher work.

Signed-off-by: chenhany <chenhany@nvidia.com>
vLLM v0.22+ serving_config can contain int / float / None dict keys
(e.g. quantization config indexed by layer number). The provenance
dumper walks the config and calls _is_sensitive_key on every key —
which until now did `key.lower()` unconditionally and crashed:

  AttributeError: 'int' object has no attribute 'lower'
    at examples/specdec_bench/specdec_bench/utils.py:_is_sensitive_key

This was the root cause of OMNIML-4962 pipeline 53854022's experiment
failure (cicd_1780704675) — task_0 crashed before any tokens were
generated. Non-string keys are never sensitive by construction (the
allow-list and substring checks both operate on lowercased strings),
so the guard is unconditional: not a string → not sensitive.

Was previously living on the cell PR (#1638) but it's specdec_bench
infra, not per-cell, so it belongs here alongside the rest of the
OMNIML-4788 specdec_bench launcher work.

Signed-off-by: chenhany <chenhany@nvidia.com>
ChenhanYu added a commit that referenced this pull request Jun 6, 2026
The --block_size argparse arg + speculative_num_draft_tokens
forwarding (examples/specdec_bench/run.py) and the _is_sensitive_key
non-string-key guard (examples/specdec_bench/specdec_bench/utils.py)
were specdec_bench infrastructure — they got carried on this cell PR
because they were needed to run cell t0_d3 successfully, but they
apply to every algorithm/cell, not just DFlash/t0_d3.

Moving them to #1564 (chenhany/specdec-bench-throughput-32k-and-s3-upload,
OMNIML-4788) where they sit alongside the rest of the specdec_bench
launcher infra. That makes the dependency explicit: this cell PR
needs #1564 to land first.

This PR is now purely the two cell YAMLs.

Signed-off-by: chenhany <chenhany@nvidia.com>
ChenhanYu added a commit that referenced this pull request Jun 6, 2026
The per-cell file at common/specdec_bench/_cells/qwen35_4b_dflash_vllm_t0_d3.yaml
contained:
  sampling_kwargs: {temperature: 0}
  engine_args: {max_model_len: 40960}

Both are redundant for this cell:
- run.py's default sampling_kwargs is {"temperature": 0} when no
  --runtime_params is supplied (see run.py:149) — t0_* cells just don't
  pass --runtime_params for task_0.
- max_model_len=40960 already lives in the shared
  common/specdec_bench/runtime_params_throughput_32k.yaml (#1564), which
  is exactly what specdec_bench_mtp.yaml task_1 uses. Point task_1 at
  the shared file instead of duplicating it per cell.

Per-cell runtime_params become necessary only when temperature != 0
(future t1_d3 / t1_d7 cells); at that point we can either spawn a
per-cell file or add a --temperature CLI arg to run.py. Cross that
bridge when those cells land.

PR is now just the single example YAML (no _cells/ file).

Signed-off-by: chenhany <chenhany@nvidia.com>
ChenhanYu added 2 commits June 5, 2026 21:10
Replaces the runtime_params_throughput_32k.yaml indirection with
direct CLI flags. Lets cell / variant YAMLs parameterize the two
values agents typically need (sampling temperature, throughput-split
sequence-length cap) without spawning a per-cell YAML under
common/specdec_bench/.

CLI override precedence: --temperature / --max_seq_len win over the
corresponding key in --runtime_params engine_args / sampling_kwargs.
When neither is set, behavior is unchanged from before this PR
(default sampling_kwargs={"temperature": 0}, engine auto-derives
sequence-length cap).

Per-engine translation of --max_seq_len (since the same concept
has three different names across engines):

  VLLM   → max_model_len   (AsyncEngineArgs)
  TRTLLM → max_seq_len     (LLM(...))
  SGLANG → context_length  (sgl.Engine)

The mapping lives in run.py's _MAX_SEQ_LEN_KEY at module scope — one
table, one grep if a new engine is added. Per-engine wrappers carry
a comment pointing back at this seam; base.py's Model docstring
documents the cross-engine kwarg convention.

Wires SGLang's context_length through SGLANGModel for the first time
(was previously dropped on the floor) so --max_seq_len works there
too.

Removed:
- tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml
  — its sole payload was engine_args.max_model_len: 40960 which is
  now --max_seq_len 40960 on the task's args list.

Updated:
- tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml task_1
- tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp.yaml task_1
  Both switch from --runtime_params ...runtime_params_throughput_32k.yaml
  to --max_seq_len 40960.

Forward-looks-at: the cell PR series (e.g. #1638 for OMNIML-4962
t0_d3) had been planning per-cell _cells/<sweep>.yaml for temperature
overrides on t1_* cells. With --temperature exposed, those cells can
stay as one example YAML each with --temperature 1 on the args list —
no _cells/ directory needed at all.

Signed-off-by: chenhany <chenhany@nvidia.com>
@h-guo18: "These tests seems not necessary to keep since we do not have
kwargs filters now. I would suggest remove this file for conciseness,
but unharmful to keep."

The tests were added mid-review of this PR to lock in a specific
max_model_len-forwarding fix; with the simplified surface (and now
the --max_seq_len CLI flag mapping at run.py's seam), they no longer
guard a regression risk worth carrying.

Signed-off-by: chenhany <chenhany@nvidia.com>
ChenhanYu added a commit that referenced this pull request Jun 6, 2026
…ed in #1564)

#1564 replaced common/specdec_bench/runtime_params_throughput_32k.yaml
with a generic --max_seq_len CLI flag that maps to the engine-specific
kwarg at run.py's seam (VLLM → max_model_len). Following suit on this
cell so the runtime_params indirection disappears entirely.

Also reflows the inline comments to match the new convention.

Signed-off-by: chenhany <chenhany@nvidia.com>
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test 56f46c8

@copy-pr-bot

copy-pr-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown

/ok to test 56f46c8

@ChenhanYu, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test 56f46c8

@copy-pr-bot

copy-pr-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown

/ok to test 56f46c8

@ChenhanYu, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test 183b2dd

@ChenhanYu ChenhanYu enabled auto-merge (squash) June 6, 2026 04:45
run(), generate(), get_serving_config(), stop() all require a live vllm
engine and cannot be exercised by unit tests that stub the import. Mark
them with # pragma: no cover so codecov doesn't count them against the
project threshold, which dropped -5.42% after the PR added these lines.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test 4322ac0

2 similar comments
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test 4322ac0

@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test 4322ac0

@ChenhanYu ChenhanYu merged commit 2c52e7b into main Jun 7, 2026
63 of 65 checks passed
@ChenhanYu ChenhanYu deleted the chenhany/specdec-bench-throughput-32k-and-s3-upload branch June 7, 2026 03:04
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-07 03:04 UTC
kevalmorabia97 pushed a commit that referenced this pull request Jun 8, 2026
… upload step (#1564)

### What does this PR do?

Type of change: enhancement (follow-up to
[#1531](#1531)).

Extends the merged Qwen3.5-4B SPEED-Bench launcher YAMLs from a
single-task qualitative-only smoke into a **3-task pipeline** that also
covers long-context throughput and verifies the S3-upload path
end-to-end. Two commits, cleanly cherry-picked from #1531's late branch
state — they were authored after the merge-commit was resolved against
an earlier rebased head and so didn't ride along with that merge.

### Pipeline shape (both YAMLs)

| Task | Split | Save dir |
|---|---|---|
| `task_0` | qualitative (existing quality / acceptance-rate signal) |
`/scratchspace/specdec_bench{,_mtp}/qualitative` |
| `task_1` | **throughput_32k** (new — long-context throughput) |
`/scratchspace/specdec_bench{,_mtp}/throughput_32k` |
| `task_2` | **upload to S3 in sweep layout** |
`s3://team-specdec-workgroup/results/specdec_bench{,_mtp}/<split>/` |

### New artifacts

* `tools/launcher/common/specdec_bench/upload_to_s3.sh` — thin wrapper
around `examples/specdec_bench/upload_to_s3.py` so it can be invoked as
a launcher task. Installs `boto3` from `requirements.txt` on cold
containers; warm pipelines pick it up from the prior `run.sh`.
*
`tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml`
— pins `engine_args.max_model_len = 40,960` (32K input + 4K output + 4K
headroom) so vLLM doesn't silently auto-cap `max_model_len` below the
36K minimum needed for `throughput_32k` prompts on single-GPU runs.

### Why max_model_len matters

Without an explicit `max_model_len`, vLLM auto-derives it from the model
config (Qwen3.5-4B = 128K) **and from the GPU-memory budget**. On a
single GPU the second factor can cap effective `max_model_len` well
below 36K, silently truncating 32K-token prompts and producing wrong
throughput numbers. The qualitative split is not affected (its prompts
top out around 8K, well under any auto-derivation floor) so only
`task_1` carries the override.

### S3 credentials

`upload_to_s3.sh` reads `S3_ENDPOINT` / `S3_KEY_ID` / `S3_SECRET` from
the runtime environment (not hardcoded). `--skip-existing` +
`--allow-incomplete-provenance` are passed by default so re-runs land
alongside the prior upload, and runs lacking `CONTAINER_IMAGE` (Phase-2
harness work in OMNIML-4788 will populate it) still upload.

### Testing

Cluster smoke on cw_dfw via:

```
uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml --yes
```

is currently in-flight (jobs `12257378/79/80`, PD). Will update this PR
with timing/AR numbers + S3 upload confirmation once it lands.

### Before your PR is "Ready for review"

- Backward compatible: ✅ (additive — task_0 keeps the prior qualitative
behavior, just with `/qualitative` suffix in `save_dir`)
- New PIP dep: ✅ no (boto3 already in
`examples/specdec_bench/requirements.txt` from #1531)
- New tests: N/A (launcher YAML + shell wrapper; covered by cluster
smoke)
- Changelog: N/A (internal-facing tooling)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a 32K-context runtime configuration (higher max model length) to
enable long-context throughput benchmarking and avoid silent prompt
truncation.
* Added a launcher helper to upload benchmark results to S3 with
incremental/retry-friendly options and pass/fail reporting.

* **Chores**
* Split Qwen3.5-4B benchmark into separate qualitative and 32K
throughput tasks and added coordinated S3 upload.
* Applied the same multi-task pipeline layout and clearer output
organization to the MTP speculative-decoding benchmark.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1564?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: chenhany <chenhany@nvidia.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
ChenhanYu added a commit that referenced this pull request Jun 10, 2026
Post-#1564, each ``(model, algorithm, engine)`` Epic owns exactly one
committed YAML: the parent at
``tools/launcher/examples/<family>/<model>/specdec_bench_<algo>_<engine>.yaml``.
Per-cell knobs (temperature, max_seq_len, save_dir, draft_length /
block_size) come from CLI overrides at slurm-invoke time via
``pipeline.task_N.args+=[...]``. No per-cell file is committed.

This branch had originally created
``tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml``
+ a ``--runtime_params common/specdec_bench/_cells/...`` reference in
the parent, both pre-#1564 shapes that duplicated the parent's knobs.
The cell-stage workflow SPEC template had stale Step 3 guidance still
mentioning the ``_cells/<sweep_name>.yaml`` shape, which is what the
agent followed; that contradiction is cleaned up in pensieve-intern !91.

Drop:
- ``tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml``
- the ``--runtime_params common/specdec_bench/_cells/...`` line on both
  task_0 and task_1 in the parent YAML

Update the header comment to document the canonical CLI-override
invocation pattern (the same pattern used by the
NVIDIA-Nemotron-3-Super-120B-A12B-BF16 parent on main).

The cell-side overrides for OMNIML-5024 (t0_d3) become:
    pipeline.task_0.args+=["--temperature 0",
                            "--max_seq_len 65536",
                            "--save_dir /scratchspace/<sweep>/qualitative",
                            "--draft_length 3"]
    pipeline.task_1.args+=["--temperature 0",
                            "--max_seq_len 65536",
                            "--save_dir /scratchspace/<sweep>/throughput_32k",
                            "--num_requests 80",
                            "--draft_length 3"]

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
ChenhanYu added a commit that referenced this pull request Jun 15, 2026
Switch container in tools/launcher/examples/google/gemma-4-E4B-it/
specdec_bench_mtp_vllm.yaml from vllm/vllm-openai:v0.22.1 →
vllm/vllm-openai:gemma. The generic v0.22.1 image fails Gemma 4 MTP
engine startup; the Gemma-specific image is what the sister cell
(OMNIML-5024 cell_t0_d3) ran successfully on.

Per PR #1564, this benchmark family does NOT use per-cell
_cells/<sweep>.yaml files — cell-specific params are passed as
pipeline.task_N.args+=[...] overrides at submit time, not committed
to the repo. The parent yaml + the args-override CLI is the
complete contract.

Real run with this container: experiment cicd_1781548540,
AL_qualitative_overall=3.2945, AL_throughput_32k_overall=3.3803.

Surfaced 2026-06-15 (intern-agent job 341631795). Supersedes #1736.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Co-Authored-By: pensieve-intern agent <noreply@nvidia.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
ChenhanYu added a commit that referenced this pull request Jun 15, 2026
…(post-PR-#1564) (#1741)

## What

Comment-only change to the Nemotron-3-Super-120B-A12B-BF16 DFlash parent
YAML doc-comment header. Rewrites the example invocation from the
pre-PR-#1564 pattern (with `--runtime_params
common/specdec_bench/_cells/<sweep_name>.yaml`) to the current
post-#1564 pattern (CLI-flag-only overrides, no per-cell file).

No yaml content / config change — only the prose example in the header.

## Why

PR #1564 removed the `tools/launcher/common/specdec_bench/_cells/` and
`_runtime_params/` directories. Cell-specific knobs are now CLI
overrides at slurm-invoke time, not committed files. The cell SPEC in
pensieve-intern (`specdec_bench/specs/cell.md`) is explicit about this —
five times in prose.

But this parent YAML's doc-comment kept advertising the OLD pattern.
When pensieve-intern's agent runner scans `tools/launcher/examples/` for
a reference invocation (good practice — agents should imitate working
examples), it lands on this Nemotron parent and copies the stale
pattern. **Five recent agent dispatches on the gemma-4 Epic OMNIML-5022
(cells OMNIML-5024 / 5025 / 5026 / 5027) authored new
`_cells/<sweep>.yaml` files for this reason, despite the SPEC telling
them not to.**

Prose loses to a concrete checked-in counter-example. The Qwen3.5-4B
reference template that cell.md officially points at
(`tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp_vllm.yaml`)
is clean and shows only the bare `uv run slurm.py --yaml ...` form. This
PR makes Nemotron-120B consistent with that template.

## How surfaced

Diagnosed 2026-06-15 on OMNIML-5025 cell_t0_d7 ([intern-agent job
341631795](https://gitlab-master.nvidia.com/omniml/integration/nmm-sandbox/-/jobs/341631795)).
The cell agent authored `_cells/gemma-4-E4B-it_mtp_vllm_t0_d7.yaml` —
the engine's diff-shape classifier should have rejected it, but didn't
(tracked separately as OMNIML-5170). Root-causing the agent's behavior
surfaced this stale doc-comment as the source of the pattern.

## Verification

`grep -rn '_cells\|runtime_params common'` across the entire launcher
tree returned only this file. After this PR, the launcher tree carries
zero stale references.

Pairs with #1738 (gemma-4-E4B-it container fix)
and OMNIML-5170 (engine-side classifier hard-reject for `_cells/` paths
— defense in depth).

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Updated example documentation to clarify how to override per-cell
parameters using CLI flags in Slurm configuration runs.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants