Tags: NVIDIA/Model-Optimizer
Tags
chore: stop tracking .claude/scheduled_tasks.lock (#1758) ## What Remove `.claude/scheduled_tasks.lock` from version control and add a `.gitignore` rule so it is never committed again. ## Why This file is an **ephemeral Claude Code scheduler lock** — its contents are runtime process state (`sessionId`, `pid`, `procStart`, `acquiredAt`), not source. It was accidentally committed in #1623 and is currently tracked on `main`. Reported by @sychen52 in [review of #1623](#1623 (review)). ## Changes - `git rm --cached .claude/scheduled_tasks.lock` - Add `.claude/scheduled_tasks.lock` to `.gitignore` 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated repository configuration to exclude internal runtime lock files from version control. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Adds AutoQuant support for VLM / Qwen3.5-Qwen3.6 style models (#1381) ### What does this PR do? Type of change: new feature, bug fix, new tests ### Details - Enables AutoQuant search over fused MoE expert containers by snapshotting/restoring their per-expert quantizers. - Adds Qwen3.5/3.6 linear-attention grouping rules so fused deployment layers keep compatible quant formats. - Supports `w4a16_nvfp4` as an AutoQuant search format. - Preserves disabled AutoQuant layer patterns in generated configs while allowing selected modules like `lm_head` to override default disables. - Keeps recipe-mode and AutoQuantize VLM paths on the outer CausalLM so Qwen3.5/3.6-MoE `lm_head` remains visible. - Skips `parent_class`-scoped quant config entries during AutoQuant bare quantizer matching, preventing class-scoped global entries from last-match overriding every selected module. - Adds temporary hardcoded Qwen/VLM AutoQuant disabled-layer patterns in `hf_ptq.py` with a TODO to refactor into the config system. ### Usage ```bash python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path <model_path> \ --qformat fp8,w4a16_nvfp4 \ --auto_quantize_bits 5.0 \ --auto_quantize_cost_model active_moe \ --auto_quantize_checkpoint <autoquant_state.pt> \ --export_path <output_dir> ``` ### Testing - `/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest tests/unit/torch/quantization/test_autoquant.py::test_get_auto_quantize_config_keeps_selected_lm_head_enabled tests/unit/torch/quantization/test_config_validation.py::TestMatchQuantizerCfg::test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup` - `/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest tests/unit/torch/quantization/test_autoquant.py tests/unit/torch/quantization/test_config_validation.py -k "not data_parallel"` (`120 passed, 1 deselected`) - `/Users/weimingc/miniconda3/envs/modelopt/bin/python -m py_compile examples/llm_ptq/hf_ptq.py modelopt/torch/quantization/algorithms.py modelopt/torch/quantization/_auto_quantize_cost.py tests/unit/torch/quantization/test_autoquant.py tests/unit/torch/quantization/test_config_validation.py` - Full local affected-file pytest without `-k "not data_parallel"` only failed `test_data_parallel_auto_quantize` because this local sandbox cannot bind a free socket (`PermissionError: Operation not permitted`). - Ran Qwen3.6 35B AutoQuant e2e with `fp8,w4a16_nvfp4` and exported a checkpoint. - Verified exported checkpoint loads in vLLM nightly without local patches. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added w4a16_nvfp4 quantization format and optional cost-exclusion patterns for AutoQuantize. * **Improvements** * Safer multimodal/VLM handling and AutoQuantize now runs on the full outer model when applicable. * Better fused-MoE support, more accurate weight accounting, and refined attention-grouping for improved quantization choices. * Dynamic layer-disabling support for targeted disables. * **Tests** * New unit tests covering cost-model exclusions, fused-MoE accounting, and config selection. * **Documentation** * Updated cost-constraint example to show exclusion-pattern usage. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
[OMNIML-4788] specdec_bench/Qwen3.5-4B: throughput_32k benchmark + S3… … upload step (#1564) ### What does this PR do? Type of change: enhancement (follow-up to [#1531](#1531)). Extends the merged Qwen3.5-4B SPEED-Bench launcher YAMLs from a single-task qualitative-only smoke into a **3-task pipeline** that also covers long-context throughput and verifies the S3-upload path end-to-end. Two commits, cleanly cherry-picked from #1531's late branch state — they were authored after the merge-commit was resolved against an earlier rebased head and so didn't ride along with that merge. ### Pipeline shape (both YAMLs) | Task | Split | Save dir | |---|---|---| | `task_0` | qualitative (existing quality / acceptance-rate signal) | `/scratchspace/specdec_bench{,_mtp}/qualitative` | | `task_1` | **throughput_32k** (new — long-context throughput) | `/scratchspace/specdec_bench{,_mtp}/throughput_32k` | | `task_2` | **upload to S3 in sweep layout** | `s3://team-specdec-workgroup/results/specdec_bench{,_mtp}/<split>/` | ### New artifacts * `tools/launcher/common/specdec_bench/upload_to_s3.sh` — thin wrapper around `examples/specdec_bench/upload_to_s3.py` so it can be invoked as a launcher task. Installs `boto3` from `requirements.txt` on cold containers; warm pipelines pick it up from the prior `run.sh`. * `tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml` — pins `engine_args.max_model_len = 40,960` (32K input + 4K output + 4K headroom) so vLLM doesn't silently auto-cap `max_model_len` below the 36K minimum needed for `throughput_32k` prompts on single-GPU runs. ### Why max_model_len matters Without an explicit `max_model_len`, vLLM auto-derives it from the model config (Qwen3.5-4B = 128K) **and from the GPU-memory budget**. On a single GPU the second factor can cap effective `max_model_len` well below 36K, silently truncating 32K-token prompts and producing wrong throughput numbers. The qualitative split is not affected (its prompts top out around 8K, well under any auto-derivation floor) so only `task_1` carries the override. ### S3 credentials `upload_to_s3.sh` reads `S3_ENDPOINT` / `S3_KEY_ID` / `S3_SECRET` from the runtime environment (not hardcoded). `--skip-existing` + `--allow-incomplete-provenance` are passed by default so re-runs land alongside the prior upload, and runs lacking `CONTAINER_IMAGE` (Phase-2 harness work in OMNIML-4788 will populate it) still upload. ### Testing Cluster smoke on cw_dfw via: ``` uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml --yes ``` is currently in-flight (jobs `12257378/79/80`, PD). Will update this PR with timing/AR numbers + S3 upload confirmation once it lands. ### Before your PR is "Ready for review" - Backward compatible: ✅ (additive — task_0 keeps the prior qualitative behavior, just with `/qualitative` suffix in `save_dir`) - New PIP dep: ✅ no (boto3 already in `examples/specdec_bench/requirements.txt` from #1531) - New tests: N/A (launcher YAML + shell wrapper; covered by cluster smoke) - Changelog: N/A (internal-facing tooling) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a 32K-context runtime configuration (higher max model length) to enable long-context throughput benchmarking and avoid silent prompt truncation. * Added a launcher helper to upload benchmark results to S3 with incremental/retry-friendly options and pass/fail reporting. * **Chores** * Split Qwen3.5-4B benchmark into separate qualitative and 32K throughput tasks and added coordinated S3 upload. * Applied the same multi-task pipeline layout and clearer output organization to the MTP speculative-decoding benchmark. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1564?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: chenhany <chenhany@nvidia.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
fix(te-plugin): handle TE 2.15+ tuple return from `_Linear` / `_Group… …edLinear` TE 2.15+ changed `_Linear.forward` and `_GroupedLinear.forward` to return `(out, new_workspace)` tuples instead of a single tensor. ModelOpt's patched `te_quantized_linear_fn` / `te_grouped_quantized_linear_fn` still passed the whole tuple into `self.output_quantizer`, crashing inside `TensorQuantizer.forward` on `tuple.numel()`: AttributeError: 'tuple' object has no attribute 'numel' Mirror the existing pattern from `_QuantTELayerNormLinear.forward`: quantize only `output[0]` (activation) and pass auxiliary workspace metadata through verbatim. TE <= 2.14 returns a single tensor and falls through the isinstance branch unchanged. This unblocks Megatron-Bridge's TE 2.15 path; the local `patch_modelopt_te_linear_tuple_output` shim can be removed once this ships in a tagged release. Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
fix(te-plugin): handle TE 2.15+ tuple return from `_Linear` / `_Group… …edLinear` TE 2.15+ changed `_Linear.forward` and `_GroupedLinear.forward` to return `(out, new_workspace)` tuples instead of a single tensor. ModelOpt's patched `te_quantized_linear_fn` / `te_grouped_quantized_linear_fn` still passed the whole tuple into `self.output_quantizer`, crashing inside `TensorQuantizer.forward` on `tuple.numel()`: AttributeError: 'tuple' object has no attribute 'numel' Mirror the existing pattern from `_QuantTELayerNormLinear.forward`: quantize only `output[0]` (activation) and pass auxiliary workspace metadata through verbatim. TE <= 2.14 returns a single tensor and falls through the isinstance branch unchanged. This unblocks Megatron-Bridge's TE 2.15 path; the local `patch_modelopt_te_linear_tuple_output` shim can be removed once this ships in a tagged release. Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
fix(te-plugin): make _Linear arg indexing robust to TE signature chan… …ges (#1473) ### What does this PR do? Type of change: Bug fix ModelOpt's `te_quantized_linear_fn` and `te_grouped_quantized_linear_fn` read `weight` / `inp` from hard-coded positions in `args`. Two TE signature changes broke this scheme: - **TE 1.x → 2.0:** dropped the legacy `weight_fp8` slot between `weight` and `inp`. ModelOpt handled this with an `if Version("2.0") <= _TE_VERSION:` branch + a duplicate else branch. - **TE 2.14 → 2.15:** inserted `weight_workspace` between `weight` and `inp` at the `_Linear.forward` call site ([TE 2.15 linear.py L1663](https://github.com/NVIDIA/TransformerEngine/blob/release_v2.15/transformer_engine/pytorch/module/linear.py#L1663)). Unhandled by ModelOpt — `args[idx + 1]` resolved to `None` (workspace is None outside FP8), which then crashed `TensorQuantizer.forward` on `inputs.numel()` with `AttributeError: 'NoneType' object has no attribute 'numel'`. Surfaced as a regression in Megatron-Bridge after the TE 2.15 bump alongside ModelOpt 0.44.0rc3. - **TE 2.10:** `_GroupedLinear.forward`'s second positional slot was renamed `m_splits` → `non_tensor_args` (tuple wrapping). ModelOpt had a separate `Version("2.10")` gate for this. Replace all three version gates with **parameter-name introspection** of the live `_Linear.forward` / `_GroupedLinear.forward` signature. The parameter names (`weight`, `inp`, `m_splits`, `non_tensor_args`) have been stable across TE 1.x, 2.x, and 2.15+; only their relative positions shift. The new code reads the live signature via `inspect.signature(...).parameters`, locates `weight`/`inp` by name, and mutates only those positions in a list copy of `args` — everything between (e.g. TE 2.15's `weight_workspace`) and after passes through verbatim. The dual-branch code in `te_quantized_linear_fn` collapses to a single path. ### Usage No public API change. PTQ continues to work transparently across all supported TE versions: ```python import modelopt.torch.quantization as mtq # Works on TE 1.x, 2.0-2.14, 2.15.x, and 2.16+ — no version flag needed. mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop) ``` ### Testing <!-- Mention how have you tested your change if applicable. --> Existing TE plugin tests (`tests/gpu_megatron/torch/quantization/plugins/test_transformer_engine.py`) exercise both the `_forward` (no-grad calibration) and `_apply` (grad-enabled training) paths of `te_quantized_linear_fn` for `te.pytorch.Linear` — they would have caught the original TE 2.15 regression on a CI matrix entry pinned to TE 2.15. Verified trace correctness across: | TE version | `_Linear.forward` signature | `_te_linear` weight→inp gap | `_GroupedLinear.forward` second slot | |---|---|---|---| | 1.x | `(ctx, weight, weight_fp8, inp, …)` | 1 | n/a | | 2.0–2.14 | `(ctx, weight, inp, bias, …)` | 0 | `m_splits` | | 2.15.x | `(ctx, weight, weight_workspace, inp, …)` | 1 | `non_tensor_args` | | 2.16+ (main) | `(ctx, weight, inp, bias, fwd_args)` | 0 | `non_tensor_args` | ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ <!--- Public API unchanged; broadens the range of TE versions that work (TE 2.15.x now supported, TE 1.x still supported via the same introspection path). --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A <!--- Only adds a stdlib `inspect` import. --> - Did you write any new necessary tests?: Existing tests sufficient <!--- Bug fix is covered by existing `test_transformer_engine.py` for whatever single TE version CI exercises. A multi-version TE matrix is the right next step but is out of scope for this PR. --> ### Additional Information <!-- E.g. related issue. --> Triggered by Megatron-Bridge NVIDIA-NeMo/Megatron-Bridge#3783 failing tests after bumping ModelOpt 0.44.0rc2 → 0.44.0rc3 together with a Megatron-LM bump that pulls TE 2.15. ModelOpt rc2 had the same latent bug — it just wasn't exercised until TE 2.15 became the runtime version. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Improved Transformer Engine quantization plugin robustness by using runtime parameter inspection instead of version-based branching, ensuring compatibility across TE versions without requiring manual updates. [](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1473) <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
[Cherry-pick] PRs #1352 #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 #1375 #1386 #1353 #1356 #1390 (#1385) ## Cherry-picked PRs - #1352 - #1351 - #1330 - #1354 - #1355 - #1360 - #1342 - #1324 - #1340 - #1368 - #1373 - #1359 - #1361 - #1325 - #1369 - #1370 - #1371 - #1375 - #1386 - #1353 - #1356 - #1390 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added Python 3.14 support (basic unit tests verified; production defaults on Python 3.12) * Added Windows CUDA 13.x installation guidance * Introduced LLM ONNX export utilities with quantization support * Extended Medusa mode support in speculative decoding pipeline * **Bug Fixes** * Fixed FP8 quantization for vision transformer multi-head attention * Improved MoE expert handling in quantization calibration and inference * Enhanced ONNX graph utilities for FP8 weight transformation * **Documentation** * Comprehensive Minitron pruning + distillation + quantization + vLLM tutorials with ablation studies * Megatron data preparation guide for tokenization workflows * Puzzletron distillation results and cross-reference updates <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> Signed-off-by: Jennifer Chen <jennifchen@nvidia.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: ynankani <ynankani@nvidia.com> Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: vipandya <vipandya@nvidia.com> Signed-off-by: dmoodie <dmoodie@nvidia.com> Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com> Signed-off-by: Ye Yu <yeyu@nvidia.com> Signed-off-by: Kai Xu <kaix@nvidia.com> Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Co-authored-by: CodeRabbit <noreply@coderabbit.ai> Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com> Co-authored-by: Asha Anoosheh <aanoosheh@nvidia.com> Co-authored-by: Jenny Chen <jennifchen@nvidia.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: ynankani <ynankani@nvidia.com> Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Co-authored-by: vishalpandya1990 <vishalpandya1990@gmail.com> Co-authored-by: dthienan-nv <dmoodie@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Hrishith Thadicherla <99313418+hthadicherla@users.noreply.github.com> Co-authored-by: yeyu-nvidia <yeyu@nvidia.com> Co-authored-by: kaix-nv <kaix@nvidia.com> Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>
fix: PTQ 1GPU, export PP divisibility, hidden states conversations key ( #1293) ## Summary - **megatron_lm_ptq.yaml**: Qwen3-8B PTQ to single GPU for L40 clusters (TP=1, all tasks) - **quantize.sh**: Auto-find largest PP dividing model's `num_hidden_layers` for export step. Qwen3-8B has 36 layers which isn't divisible by 8, causing `AssertionError` on 8-GPU nodes - **compute_hidden_states_trtllm.py**: Use `messages` with `conversations` fallback, matching the HF version. Fixes `KeyError: 'conversations'` when data uses OpenAI `messages` format ## Test plan - [x] Qwen3-8B PTQ runs on single L40 GPU - [x] Export PP auto-selects valid divisor (36 layers → PP=6 on 8 GPUs, PP=4 on 4 GPUs, PP=1 on 1 GPU) - [x] EAGLE3 offline pipeline reads data with `messages` field 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Dataset input handling now supports multiple field formats for enhanced compatibility. * **Bug Fixes** * Optimized GPU resource allocation during model quantization with improved pipeline parallelism computation. * Updated quantization configuration for more efficient resource utilization. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PreviousNext