Skip to content

Tags: NVIDIA/Model-Optimizer

Tags

0.45.0rc1

Toggle 0.45.0rc1's commit message

Verified

This commit was signed with the committer’s verified signature.
kevalmorabia97 Keval Morabia
chore: stop tracking .claude/scheduled_tasks.lock (#1758)

## What
Remove `.claude/scheduled_tasks.lock` from version control and add a
`.gitignore` rule so it is never committed again.

## Why
This file is an **ephemeral Claude Code scheduler lock** — its contents
are runtime process state (`sessionId`, `pid`, `procStart`,
`acquiredAt`), not source. It was accidentally committed in #1623 and is
currently tracked on `main`.

Reported by @sychen52 in [review of
#1623](#1623 (review)).

## Changes
- `git rm --cached .claude/scheduled_tasks.lock`
- Add `.claude/scheduled_tasks.lock` to `.gitignore`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated repository configuration to exclude internal runtime lock
files from version control.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Ye Yu <yeyu@nvidia.com>

0.46.0dev

Toggle 0.46.0dev's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Adds AutoQuant support for VLM / Qwen3.5-Qwen3.6 style models (#1381)

### What does this PR do?

Type of change: new feature, bug fix, new tests

### Details

- Enables AutoQuant search over fused MoE expert containers by
snapshotting/restoring their per-expert quantizers.
- Adds Qwen3.5/3.6 linear-attention grouping rules so fused deployment
layers keep compatible quant formats.
- Supports `w4a16_nvfp4` as an AutoQuant search format.
- Preserves disabled AutoQuant layer patterns in generated configs while
allowing selected modules like `lm_head` to override default disables.
- Keeps recipe-mode and AutoQuantize VLM paths on the outer CausalLM so
Qwen3.5/3.6-MoE `lm_head` remains visible.
- Skips `parent_class`-scoped quant config entries during AutoQuant bare
quantizer matching, preventing class-scoped global entries from
last-match overriding every selected module.
- Adds temporary hardcoded Qwen/VLM AutoQuant disabled-layer patterns in
`hf_ptq.py` with a TODO to refactor into the config system.

### Usage

```bash
python examples/llm_ptq/hf_ptq.py \
  --pyt_ckpt_path <model_path> \
  --qformat fp8,w4a16_nvfp4 \
  --auto_quantize_bits 5.0 \
  --auto_quantize_cost_model active_moe \
  --auto_quantize_checkpoint <autoquant_state.pt> \
  --export_path <output_dir>
```

### Testing

- `/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest
tests/unit/torch/quantization/test_autoquant.py::test_get_auto_quantize_config_keeps_selected_lm_head_enabled
tests/unit/torch/quantization/test_config_validation.py::TestMatchQuantizerCfg::test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup`
- `/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest
tests/unit/torch/quantization/test_autoquant.py
tests/unit/torch/quantization/test_config_validation.py -k "not
data_parallel"` (`120 passed, 1 deselected`)
- `/Users/weimingc/miniconda3/envs/modelopt/bin/python -m py_compile
examples/llm_ptq/hf_ptq.py modelopt/torch/quantization/algorithms.py
modelopt/torch/quantization/_auto_quantize_cost.py
tests/unit/torch/quantization/test_autoquant.py
tests/unit/torch/quantization/test_config_validation.py`
- Full local affected-file pytest without `-k "not data_parallel"` only
failed `test_data_parallel_auto_quantize` because this local sandbox
cannot bind a free socket (`PermissionError: Operation not permitted`).
- Ran Qwen3.6 35B AutoQuant e2e with `fp8,w4a16_nvfp4` and exported a
checkpoint.
- Verified exported checkpoint loads in vLLM nightly without local
patches.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

### Additional Information

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added w4a16_nvfp4 quantization format and optional cost-exclusion
patterns for AutoQuantize.

* **Improvements**
* Safer multimodal/VLM handling and AutoQuantize now runs on the full
outer model when applicable.
* Better fused-MoE support, more accurate weight accounting, and refined
attention-grouping for improved quantization choices.
  * Dynamic layer-disabling support for targeted disables.

* **Tests**
* New unit tests covering cost-model exclusions, fused-MoE accounting,
and config selection.

* **Documentation**
  * Updated cost-constraint example to show exclusion-pattern usage.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

0.45.0rc0

Toggle 0.45.0rc0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[OMNIML-4788] specdec_bench/Qwen3.5-4B: throughput_32k benchmark + S3…

… upload step (#1564)

### What does this PR do?

Type of change: enhancement (follow-up to
[#1531](#1531)).

Extends the merged Qwen3.5-4B SPEED-Bench launcher YAMLs from a
single-task qualitative-only smoke into a **3-task pipeline** that also
covers long-context throughput and verifies the S3-upload path
end-to-end. Two commits, cleanly cherry-picked from #1531's late branch
state — they were authored after the merge-commit was resolved against
an earlier rebased head and so didn't ride along with that merge.

### Pipeline shape (both YAMLs)

| Task | Split | Save dir |
|---|---|---|
| `task_0` | qualitative (existing quality / acceptance-rate signal) |
`/scratchspace/specdec_bench{,_mtp}/qualitative` |
| `task_1` | **throughput_32k** (new — long-context throughput) |
`/scratchspace/specdec_bench{,_mtp}/throughput_32k` |
| `task_2` | **upload to S3 in sweep layout** |
`s3://team-specdec-workgroup/results/specdec_bench{,_mtp}/<split>/` |

### New artifacts

* `tools/launcher/common/specdec_bench/upload_to_s3.sh` — thin wrapper
around `examples/specdec_bench/upload_to_s3.py` so it can be invoked as
a launcher task. Installs `boto3` from `requirements.txt` on cold
containers; warm pipelines pick it up from the prior `run.sh`.
*
`tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml`
— pins `engine_args.max_model_len = 40,960` (32K input + 4K output + 4K
headroom) so vLLM doesn't silently auto-cap `max_model_len` below the
36K minimum needed for `throughput_32k` prompts on single-GPU runs.

### Why max_model_len matters

Without an explicit `max_model_len`, vLLM auto-derives it from the model
config (Qwen3.5-4B = 128K) **and from the GPU-memory budget**. On a
single GPU the second factor can cap effective `max_model_len` well
below 36K, silently truncating 32K-token prompts and producing wrong
throughput numbers. The qualitative split is not affected (its prompts
top out around 8K, well under any auto-derivation floor) so only
`task_1` carries the override.

### S3 credentials

`upload_to_s3.sh` reads `S3_ENDPOINT` / `S3_KEY_ID` / `S3_SECRET` from
the runtime environment (not hardcoded). `--skip-existing` +
`--allow-incomplete-provenance` are passed by default so re-runs land
alongside the prior upload, and runs lacking `CONTAINER_IMAGE` (Phase-2
harness work in OMNIML-4788 will populate it) still upload.

### Testing

Cluster smoke on cw_dfw via:

```
uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml --yes
```

is currently in-flight (jobs `12257378/79/80`, PD). Will update this PR
with timing/AR numbers + S3 upload confirmation once it lands.

### Before your PR is "Ready for review"

- Backward compatible: ✅ (additive — task_0 keeps the prior qualitative
behavior, just with `/qualitative` suffix in `save_dir`)
- New PIP dep: ✅ no (boto3 already in
`examples/specdec_bench/requirements.txt` from #1531)
- New tests: N/A (launcher YAML + shell wrapper; covered by cluster
smoke)
- Changelog: N/A (internal-facing tooling)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a 32K-context runtime configuration (higher max model length) to
enable long-context throughput benchmarking and avoid silent prompt
truncation.
* Added a launcher helper to upload benchmark results to S3 with
incremental/retry-friendly options and pass/fail reporting.

* **Chores**
* Split Qwen3.5-4B benchmark into separate qualitative and 32K
throughput tasks and added coordinated S3 upload.
* Applied the same multi-task pipeline layout and clearer output
organization to the MTP speculative-decoding benchmark.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1564?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: chenhany <chenhany@nvidia.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

0.44.0

Toggle 0.44.0's commit message

Verified

This commit was signed with the committer’s verified signature.
kevalmorabia97 Keval Morabia
fix(te-plugin): handle TE 2.15+ tuple return from `_Linear` / `_Group…

…edLinear`

TE 2.15+ changed `_Linear.forward` and `_GroupedLinear.forward` to return
`(out, new_workspace)` tuples instead of a single tensor. ModelOpt's
patched `te_quantized_linear_fn` / `te_grouped_quantized_linear_fn` still
passed the whole tuple into `self.output_quantizer`, crashing inside
`TensorQuantizer.forward` on `tuple.numel()`:

  AttributeError: 'tuple' object has no attribute 'numel'

Mirror the existing pattern from `_QuantTELayerNormLinear.forward`:
quantize only `output[0]` (activation) and pass auxiliary workspace
metadata through verbatim. TE <= 2.14 returns a single tensor and falls
through the isinstance branch unchanged.

This unblocks Megatron-Bridge's TE 2.15 path; the local
`patch_modelopt_te_linear_tuple_output` shim can be removed once this
ships in a tagged release.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

0.44.0rc5

Toggle 0.44.0rc5's commit message

Verified

This commit was signed with the committer’s verified signature.
kevalmorabia97 Keval Morabia
fix(te-plugin): handle TE 2.15+ tuple return from `_Linear` / `_Group…

…edLinear`

TE 2.15+ changed `_Linear.forward` and `_GroupedLinear.forward` to return
`(out, new_workspace)` tuples instead of a single tensor. ModelOpt's
patched `te_quantized_linear_fn` / `te_grouped_quantized_linear_fn` still
passed the whole tuple into `self.output_quantizer`, crashing inside
`TensorQuantizer.forward` on `tuple.numel()`:

  AttributeError: 'tuple' object has no attribute 'numel'

Mirror the existing pattern from `_QuantTELayerNormLinear.forward`:
quantize only `output[0]` (activation) and pass auxiliary workspace
metadata through verbatim. TE <= 2.14 returns a single tensor and falls
through the isinstance branch unchanged.

This unblocks Megatron-Bridge's TE 2.15 path; the local
`patch_modelopt_te_linear_tuple_output` shim can be removed once this
ships in a tagged release.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

0.44.0rc4

Toggle 0.44.0rc4's commit message

Verified

This commit was signed with the committer’s verified signature.
kevalmorabia97 Keval Morabia
fix(te-plugin): make _Linear arg indexing robust to TE signature chan…

…ges (#1473)

### What does this PR do?

Type of change: Bug fix

ModelOpt's `te_quantized_linear_fn` and `te_grouped_quantized_linear_fn`
read `weight` / `inp` from hard-coded positions in `args`. Two TE
signature changes broke this scheme:

- **TE 1.x → 2.0:** dropped the legacy `weight_fp8` slot between
`weight` and `inp`. ModelOpt handled this with an `if Version("2.0") <=
_TE_VERSION:` branch + a duplicate else branch.
- **TE 2.14 → 2.15:** inserted `weight_workspace` between `weight` and
`inp` at the `_Linear.forward` call site ([TE 2.15 linear.py
L1663](https://github.com/NVIDIA/TransformerEngine/blob/release_v2.15/transformer_engine/pytorch/module/linear.py#L1663)).
Unhandled by ModelOpt — `args[idx + 1]` resolved to `None` (workspace is
None outside FP8), which then crashed `TensorQuantizer.forward` on
`inputs.numel()` with `AttributeError: 'NoneType' object has no
attribute 'numel'`. Surfaced as a regression in Megatron-Bridge after
the TE 2.15 bump alongside ModelOpt 0.44.0rc3.
- **TE 2.10:** `_GroupedLinear.forward`'s second positional slot was
renamed `m_splits` → `non_tensor_args` (tuple wrapping). ModelOpt had a
separate `Version("2.10")` gate for this.

Replace all three version gates with **parameter-name introspection** of
the live `_Linear.forward` / `_GroupedLinear.forward` signature. The
parameter names (`weight`, `inp`, `m_splits`, `non_tensor_args`) have
been stable across TE 1.x, 2.x, and 2.15+; only their relative positions
shift. The new code reads the live signature via
`inspect.signature(...).parameters`, locates `weight`/`inp` by name, and
mutates only those positions in a list copy of `args` — everything
between (e.g. TE 2.15's `weight_workspace`) and after passes through
verbatim. The dual-branch code in `te_quantized_linear_fn` collapses to
a single path.

### Usage

No public API change. PTQ continues to work transparently across all
supported TE versions:

```python
import modelopt.torch.quantization as mtq
# Works on TE 1.x, 2.0-2.14, 2.15.x, and 2.16+ — no version flag needed.
mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop)
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

Existing TE plugin tests
(`tests/gpu_megatron/torch/quantization/plugins/test_transformer_engine.py`)
exercise both the `_forward` (no-grad calibration) and `_apply`
(grad-enabled training) paths of `te_quantized_linear_fn` for
`te.pytorch.Linear` — they would have caught the original TE 2.15
regression on a CI matrix entry pinned to TE 2.15. Verified trace
correctness across:

| TE version | `_Linear.forward` signature | `_te_linear` weight→inp gap
| `_GroupedLinear.forward` second slot |
|---|---|---|---|
| 1.x | `(ctx, weight, weight_fp8, inp, …)` | 1 | n/a |
| 2.0–2.14 | `(ctx, weight, inp, bias, …)` | 0 | `m_splits` |
| 2.15.x | `(ctx, weight, weight_workspace, inp, …)` | 1 |
`non_tensor_args` |
| 2.16+ (main) | `(ctx, weight, inp, bias, fwd_args)` | 0 |
`non_tensor_args` |

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ <!--- Public API unchanged;
broadens the range of TE versions that work (TE 2.15.x now supported, TE
1.x still supported via the same introspection path). -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A <!--- Only
adds a stdlib `inspect` import. -->
- Did you write any new necessary tests?: Existing tests sufficient
<!--- Bug fix is covered by existing `test_transformer_engine.py` for
whatever single TE version CI exercises. A multi-version TE matrix is
the right next step but is out of scope for this PR. -->

### Additional Information
<!-- E.g. related issue. -->
Triggered by Megatron-Bridge
NVIDIA-NeMo/Megatron-Bridge#3783 failing tests
after bumping ModelOpt 0.44.0rc2 → 0.44.0rc3 together with a Megatron-LM
bump that pulls TE 2.15. ModelOpt rc2 had the same latent bug — it just
wasn't exercised until TE 2.15 became the runtime version.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Improved Transformer Engine quantization plugin robustness by using
runtime parameter inspection instead of version-based branching,
ensuring compatibility across TE versions without requiring manual
updates.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1473)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

0.44.0rc3

Toggle 0.44.0rc3's commit message

Verified

This commit was signed with the committer’s verified signature.
kevalmorabia97 Keval Morabia
Add Deprecation warning for GradNAS

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

0.44.0rc2

Toggle 0.44.0rc2's commit message

Partially verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.
[Cherry-pick] PRs #1352 #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 

#1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 #1375 #1386 #1353 #1356 #1390 (#1385)

## Cherry-picked PRs

- #1352
- #1351
- #1330
- #1354
- #1355
- #1360
- #1342
- #1324
- #1340
- #1368
- #1373
- #1359
- #1361
- #1325
- #1369
- #1370
- #1371
- #1375
- #1386
- #1353
- #1356
- #1390

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added Python 3.14 support (basic unit tests verified; production
defaults on Python 3.12)
  * Added Windows CUDA 13.x installation guidance
  * Introduced LLM ONNX export utilities with quantization support
  * Extended Medusa mode support in speculative decoding pipeline

* **Bug Fixes**
  * Fixed FP8 quantization for vision transformer multi-head attention
* Improved MoE expert handling in quantization calibration and inference
  * Enhanced ONNX graph utilities for FP8 weight transformation

* **Documentation**
* Comprehensive Minitron pruning + distillation + quantization + vLLM
tutorials with ablation studies
  * Megatron data preparation guide for tokenization workflows
  * Puzzletron distillation results and cross-reference updates

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: ynankani <ynankani@nvidia.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: vipandya <vipandya@nvidia.com>
Signed-off-by: dmoodie <dmoodie@nvidia.com>
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com>
Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com>
Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
Co-authored-by: Asha Anoosheh <aanoosheh@nvidia.com>
Co-authored-by: Jenny Chen <jennifchen@nvidia.com>
Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>
Co-authored-by: ynankani <ynankani@nvidia.com>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Co-authored-by: vishalpandya1990 <vishalpandya1990@gmail.com>
Co-authored-by: dthienan-nv <dmoodie@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Hrishith Thadicherla <99313418+hthadicherla@users.noreply.github.com>
Co-authored-by: yeyu-nvidia <yeyu@nvidia.com>
Co-authored-by: kaix-nv <kaix@nvidia.com>
Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>

0.45.0dev

Toggle 0.45.0dev's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: PTQ 1GPU, export PP divisibility, hidden states conversations key (

#1293)

## Summary
- **megatron_lm_ptq.yaml**: Qwen3-8B PTQ to single GPU for L40 clusters
(TP=1, all tasks)
- **quantize.sh**: Auto-find largest PP dividing model's
`num_hidden_layers` for export step. Qwen3-8B has 36 layers which isn't
divisible by 8, causing `AssertionError` on 8-GPU nodes
- **compute_hidden_states_trtllm.py**: Use `messages` with
`conversations` fallback, matching the HF version. Fixes `KeyError:
'conversations'` when data uses OpenAI `messages` format

## Test plan
- [x] Qwen3-8B PTQ runs on single L40 GPU
- [x] Export PP auto-selects valid divisor (36 layers → PP=6 on 8 GPUs,
PP=4 on 4 GPUs, PP=1 on 1 GPU)
- [x] EAGLE3 offline pipeline reads data with `messages` field

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Dataset input handling now supports multiple field formats for
enhanced compatibility.

* **Bug Fixes**
* Optimized GPU resource allocation during model quantization with
improved pipeline parallelism computation.
* Updated quantization configuration for more efficient resource
utilization.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0.44.0rc1

Toggle 0.44.0rc1's commit message

Verified

This commit was signed with the committer’s verified signature.
kevalmorabia97 Keval Morabia
[Release-fix] Pin transformers<5.6 in release branch

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>