[AutoScheme] fix lm_head no_grad issue and support MOE model with shared mix_score#1971
Open
xin3he wants to merge 8 commits into
Open
[AutoScheme] fix lm_head no_grad issue and support MOE model with shared mix_score#1971xin3he wants to merge 8 commits into
xin3he wants to merge 8 commits into
Conversation
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
…ogging of batch average loss Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Contributor
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
wenhuach21
reviewed
Jul 1, 2026
| scheme_tag: Optional[str] = None, | ||
| ): | ||
| scores_dict = {} # Key=name,Val=[quant_total_bits, loss] | ||
| block_names = get_block_names(model)[0] |
Contributor
There was a problem hiding this comment.
vlm with quant_nontext_module needs to handle visual block
wenhuach21
reviewed
Jul 1, 2026
| enable_torch_compile=enable_torch_compile, | ||
| ) | ||
| set_module(model, name, new_m) | ||
| if offload_context is not None: |
Contributor
There was a problem hiding this comment.
why delete this
@lvliang-intel please review this change
wenhuach21
reviewed
Jul 1, 2026
| head_name = "lm_head" | ||
|
|
||
| # Sort by length to avoid prefix ambiguity and match faster in practice. | ||
| block_prefixes = [(name, name + ".") for name in sorted(block_names, key=len, reverse=True)] |
Contributor
There was a problem hiding this comment.
the file is too long, better move this function to utils
wenhuach21
reviewed
Jul 1, 2026
| def model_forward_low_gpu(model, dataloader, major_device="cuda", pbar=None): | ||
| def model_forward_low_gpu(model, dataloader, major_device="cuda", pbar=None, scheme_tag=None): | ||
| block_inputs = {} | ||
| total_batches = len(dataloader) if hasattr(dataloader, "__len__") else None |
Contributor
There was a problem hiding this comment.
please attach the cost of autoschme for qwen3-8B to avoid any regression
wenhuach21
reviewed
Jul 1, 2026
|
|
||
|
|
||
| def _fill_inactive_expert_scores(scores_dict: dict[str, list[float]], block_names: list[str]): | ||
| """Fill inactive experts with the min loss of active experts in each block. |
Contributor
There was a problem hiding this comment.
please demostrate the advantage of this choice over avg/max
…ate documentation and tests Signed-off-by: Xin He <xin3.he@intel.com>
… for MoE models Signed-off-by: Xin He <xin3.he@intel.com>
Contributor
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This pull request adds a new test to ensure that the
lm_headreceives a non-zeromix_scorewhen using thelow_gpu_mem_usagepath in the quantization process. The test captures and verifies the computed scores for different quantization schemes, improving test coverage for this specific scenario.Testing improvements:
test_lm_head_mix_score_nonzeroto verify that thelm_headreceives a non-zeromix_scoreacross quantization schemes in thelow_gpu_mem_usagepath, ensuring correct score computation and increasing test coverage.BTW, enhanced log in debug mode.
meta-llama/Llama-3.2-1B-Instruct
Details
command:
AR_LOG_LEVEL=debug auto_round --model_name /models/Llama-3.2-1B-Instruct/ --avg_bits 6 --options "mxfp4,mxfp8" --quant_lm_headQwen/Qwen3.6-35B-A3B
Details
Type of Change
Bug fix
Related Issues
Fixes or relates to #910 #1347
Checklist Before Submitting
/azp run Unit-Test-CUDA-AutoRound.