Skip to content

Puzzletron tutorial fixes for runtime optimization#1803

Open
grzegorz-k-karch wants to merge 4 commits into
mainfrom
gkarch/puzzletron-tutorial-fixes
Open

Puzzletron tutorial fixes for runtime optimization#1803
grzegorz-k-karch wants to merge 4 commits into
mainfrom
gkarch/puzzletron-tutorial-fixes

Conversation

@grzegorz-k-karch

@grzegorz-k-karch grzegorz-k-karch commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: Bug fix

Fixes some issues related to runtime optimization

  • Solved OOM - fix: reduced GPU memory utilization
  • Correctly export AnyModel config for vLLM - use namespace instead of dict to correctly read config
  • Fixed validate_model_defaults not found error - runtime optimization has now its own separate config files instead of reusing memory optimization files

Usage

(does not apply)

Testing

Tested by running the whole pipeline as described in the tutorial

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: ❌
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive model pruning configuration suite supporting attention, FFN, and hidden dimension optimization strategies
    • Introduced GPU memory utilization controls for efficient runtime benchmarking
    • Added vLLM model configuration conversion utilities for model validation
  • Configuration

    • Updated validation defaults for improved evaluation consistency
@grzegorz-k-karch grzegorz-k-karch self-assigned this Jun 23, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds gpu_memory_utilization support to RuntimeConfig and the vLLM benchmark command, introduces a convert_config_to_vllm_anymodel helper in runtime_utils.py, and introduces a full set of runtime pruning and validation YAML configs for the Llama-3.1-8B pruneffn experiment, including attention, FFN, and hidden-dimension pruning strategies.

Changes

vLLM GPU Memory Utilization Support

Layer / File(s) Summary
RuntimeConfig field and calc_runtime_stats wiring
modelopt/torch/puzzletron/subblock_stats/runtime_utils.py, modelopt/torch/puzzletron/subblock_stats/calc_runtime_stats.py
Adds gpu_memory_utilization: float = 0.5 to RuntimeConfig with co-resident benchmarking documentation, adds imports for SimpleNamespace, mprint, and convert_block_configs_to_per_layer_config, and wires the field from runtime_stats_config into the RuntimeConfig constructor call.
convert_config_to_vllm_anymodel helper and runtime_vllm.py update
modelopt/torch/puzzletron/subblock_stats/runtime_utils.py, modelopt/torch/puzzletron/subblock_stats/runtime_vllm.py
Introduces convert_config_to_vllm_anymodel which loads config.json, wraps it in SimpleNamespace, sets architectures/base_architecture, runs convert_block_configs_to_per_layer_config with mprint logging, and writes the result. Updates runtime_vllm.py to use SimpleNamespace(**config) + vars() serialization and appends --gpu-memory-utilization to the vLLM subprocess command.

Llama-3.1-8B pruneffn Runtime YAML Configs

Layer / File(s) Summary
Validation model and solution defaults
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_model_defaults.yaml, examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_solutions_defaults.yaml, examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/validate_model_defaults.yaml
Adds validate_model_defaults.yaml (bf16, block sizing, dataset column/name, disabled optional behaviors, load_dataset_fn reference) and validate_solutions_defaults.yaml (Hydra defaults composition, skip_validation, save_models, bigger_is_better, ablation sort flag). Fixes val_dataset_name from validation to valid in the memory config.
Pruning defaults base config
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/pruning_defaults.yaml
Adds pruning_defaults.yaml with validate_model_defaults include, model/experiment placeholders, eval/dataset settings, pruned checkpoint output path, and three parameterized pruning mode sections (FFN, KV-heads, hidden-dimension) with explicit initialization mode and activation-log wiring.
Attention, FFN, and hidden-dim pruning strategy configs
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/attn_pruning.yaml, .../ffn_pruning.yaml, .../hidden_dim_pruning.yaml
Adds attn_pruning.yaml (IndependentKvHeadContributionHook, KVHeadsPruningMixIn, PruneKVHeads gqa init, n_heads_in_group_list), ffn_pruning.yaml (IterativeChannelContributionHook, FFNIntermediatePruningMixIn, intermediate_size_list, PruneByActivationsLog), and hidden_dim_pruning.yaml (layer_norm_contribution hook, hidden_size_list, multi-mode init settings).
Top-level Llama-3_1-8B.yaml defaults update
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml
Rewires Hydra defaults for pruning, scoring, and realize_model to reference ../validate_solutions_defaults instead of ../llama-3_1-8B_pruneffn_memory entries, and reduces scoring.eval_samples from 128 to 16.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • meenchen
  • kevalmorabia97
  • ChenhanYu

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error The PR adds two # nosec comments in runtime_vllm.py (lines 30 and 105) to bypass Bandit security checks (B404 and B603). Per SECURITY.md line 164, "# nosec comments are not allowed" as a bypass f... Remove the # nosec B404 and # nosec B603 comments from runtime_vllm.py. If subprocess usage is genuinely required, add inline comments explaining why it is safe and request review from @NVIDIA/modelopt-setup-codeowners per SECURITY.m...
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: fixing Puzzletron tutorial issues related to runtime optimization by adding new config files and addressing OOM/config export issues.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch gkarch/puzzletron-tutorial-fixes

Comment @coderabbitai help to get the list of available commands.

@grzegorz-k-karch grzegorz-k-karch marked this pull request as ready for review June 23, 2026 09:33
@grzegorz-k-karch grzegorz-k-karch requested a review from a team as a code owner June 23, 2026 09:33
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1803/

Built to branch gh-pages at 2026-06-25 20:13 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.55%. Comparing base (c3b913b) to head (b86f13c).
���️ Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
...t/torch/puzzletron/subblock_stats/runtime_utils.py 0.00% 20 Missing ⚠️
...pt/torch/puzzletron/subblock_stats/runtime_vllm.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1803      +/-   ##
==========================================
- Coverage   62.88%   61.55%   -1.34%     
==========================================
  Files         511      514       +3     
  Lines       56634    60573    +3939     
==========================================
+ Hits        35615    37284    +1669     
- Misses      21019    23289    +2270     
Flag Coverage Δ
unit 54.60% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/torch/puzzletron/subblock_stats/runtime_utils.py (1)

91-114: 📐 Maintainability & Code Quality | 🔵 Trivial

Add return type annotation and document (or parameterize) hardcoded Llama architecture assumption.

  1. Return type hint: Add -> None to the function signature (line 93). The function has no explicit return statement.

  2. Hardcoded base_architecture: Line 107 unconditionally sets base_architecture = "LlamaForCausalLM". This module is Llama-specific (imports LlamaForCausalLM and LlamaModelDescriptor), so the hardcoding appears intentional. Either add a docstring note clarifying this function is Llama-specific, or accept base_architecture as a parameter if broader model support is planned.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/puzzletron/subblock_stats/runtime_utils.py` around lines 91 -
114, The function convert_config_to_vllm_anymodel is missing a return type
annotation and has a hardcoded assumption about the model architecture. First,
add the return type hint -> None to the function signature since the function
does not explicitly return any value. Second, address the hardcoded
base_architecture assignment that unconditionally sets it to "LlamaForCausalLM".
Either add documentation in the function's docstring to clarify that this
function is Llama-specific and explain why the architecture is hardcoded, or
alternatively, parameterize the base_architecture by accepting it as an optional
function parameter with a default value to allow for broader model support in
the future.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@modelopt/torch/puzzletron/subblock_stats/runtime_utils.py`:
- Around line 91-114: The function convert_config_to_vllm_anymodel is missing a
return type annotation and has a hardcoded assumption about the model
architecture. First, add the return type hint -> None to the function signature
since the function does not explicitly return any value. Second, address the
hardcoded base_architecture assignment that unconditionally sets it to
"LlamaForCausalLM". Either add documentation in the function's docstring to
clarify that this function is Llama-specific and explain why the architecture is
hardcoded, or alternatively, parameterize the base_architecture by accepting it
as an optional function parameter with a default value to allow for broader
model support in the future.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c26ffda4-6f74-455d-a721-7a5ed0be45e2

📥 Commits

Reviewing files that changed from the base of the PR and between c3b913b and 8b02d7a.

📒 Files selected for processing (10)
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/attn_pruning.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/ffn_pruning.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/hidden_dim_pruning.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/pruning_defaults.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_model_defaults.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_solutions_defaults.yaml
  • modelopt/torch/puzzletron/subblock_stats/calc_runtime_stats.py
  • modelopt/torch/puzzletron/subblock_stats/runtime_utils.py
  • modelopt/torch/puzzletron/subblock_stats/runtime_vllm.py
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

config = SimpleNamespace(**config_data)
config.architectures = ["AnyModel"]
config.base_architecture = "LlamaForCausalLM"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hard-coding to llama and no other model?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will be the next step to extent support to other models - working on now;
for this iteration only llama-based models are supported

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants