Skip to content

[AutoScheme] fix lm_head no_grad issue and support MOE model with shared mix_score#1971

Open
xin3he wants to merge 8 commits into
mainfrom
xinhe/6-29
Open

[AutoScheme] fix lm_head no_grad issue and support MOE model with shared mix_score#1971
xin3he wants to merge 8 commits into
mainfrom
xinhe/6-29

Conversation

@xin3he

@xin3he xin3he commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Description

This pull request adds a new test to ensure that the lm_head receives a non-zero mix_score when using the low_gpu_mem_usage path in the quantization process. The test captures and verifies the computed scores for different quantization schemes, improving test coverage for this specific scenario.

Testing improvements:

  • Added test_lm_head_mix_score_nonzero to verify that the lm_head receives a non-zero mix_score across quantization schemes in the low_gpu_mem_usage path, ensuring correct score computation and increasing test coverage.
  • verified with Qwen3.6-35B-A3B

BTW, enhanced log in debug mode.

meta-llama/Llama-3.2-1B-Instruct

Details

command: AR_LOG_LEVEL=debug auto_round --model_name /models/Llama-3.2-1B-Instruct/ --avg_bits 6 --options "mxfp4,mxfp8" --quant_lm_head

2026-07-01 11:44:08 INFO delta_loss.py L1791: AutoScheme steps(total)=64                                        
2026-07-01 11:44:08 INFO delta_loss.py L1792: AutoScheme steps variables: scheme_num=2, block_num=16, nsamples=1
6, batch_size=8                                                                                                 
2026-07-01 11:44:08 INFO delta_loss.py L1799: AutoScheme steps expanded(low_gpu): total_steps = scheme_num * blo
ck_num * 2(forward+backward) * n_batches = 2 * 16 * 2 * 2 = 128                                                 
Generating AutoScheme:   0%|                                                           | 0/128 [00:00<?, ?it/s]2
026-07-01 11:44:08 INFO delta_loss.py L1836: AutoScheme transition: switch to scheme 1/2 (MXFP4)                
2026-07-01 11:44:08 INFO calib_dataset.py L977: Preprocessing calibration dataset in a subprocess to avoid memor
y leaks...                                                                                                      
Generating AutoScheme:  12%|??????????????                                           | 16/128 [00:16<00:36,  3.07it/s]2
026-07-01 11:44:25 INFO device.py L1450: 'peak_ram': 5.63GB, 'peak_vram': 3.92GB                                
/home/xinhe/auto-round/.venv/lib/python3.12/site-packages/torch/autograd/graph.py:869: UserWarning: Flash Attent
ion defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_alg
orithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attenti
on_backward.cu:124.)                                                                                            
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass         
AutoScheme [1/2 MXFP4] cumulative batch 1/2  avg_loss=1.890084 layers=112                                       
Generating AutoScheme:  25%|??????????????????????????                                     | 32/128 [00:28<01:09,  1.39it/s]2
026-07-01 11:44:37 DEBUG delta_loss.py L1015: AutoScheme [1/2 MXFP4] cumulative batch 1/2  avg_loss=1.890084 lay
ers=112                                                                                                         
2026-07-01 11:44:37 DEBUG delta_loss.py L1030: AutoScheme [1/2 MXFP4] cumulative batch 1/2 block summary:       
2026-07-01 11:44:37 DEBUG delta_loss.py L905: AutoScheme [1/2 MXFP4] block loss summary (cumulative):           
2026-07-01 11:44:37 DEBUG delta_loss.py L909: AutoScheme | block | avg_loss |                                   
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.0 | 2.456473 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.1 | 2.578125 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.2 | 2.518973 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.3 | 2.277902 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.4 | 2.202009 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.5 | 2.055246 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.6 | 1.939174 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.7 | 1.851004 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.8 | 1.852121 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.9 | 1.838170 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.10 | 1.765625 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.11 | 1.484933 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.12 | 1.363839 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.13 | 1.316406 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.14 | 1.233817 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.15 | 1.507533 |
2026-07-01 11:44:37 DEBUG delta_loss.py L967: AutoScheme | lm_head | N/A |
2026-07-01 11:44:37 INFO delta_loss.py L980: AutoScheme non_block loss: none
Generating AutoScheme:  38%|??????????????????????????????????????                               | 48/128 [00:32<00:23,  3.37it/s]2026-07-01 11:44:42 INFO device.py L1450: 'peak_ram': 5.63GB, 'peak_vram': 4.0GB
AutoScheme [1/2 MXFP4] cumulative batch 2/2  avg_loss=2.058463 layers=112                                      
Generating AutoScheme:  50%|??????????????????????????????????????????????????                         | 64/128 [00:44<00:43,  1.49it/s]2026-07-01 11:44:53 DEBUG delta_loss.py L1015: AutoScheme [1/2 MXFP4] cumulative batch 2/2  avg_loss=2.058463 layers=112
2026-07-01 11:44:53 DEBUG delta_loss.py L1024: AutoScheme [1/2 MXFP4] cumulative batch 2/2 block summary skipped (same as final)
2026-07-01 11:44:53 DEBUG delta_loss.py L905: AutoScheme [1/2 MXFP4] block loss summary (final):
2026-07-01 11:45:33 DEBUG delta_loss.py L909: AutoScheme | block | avg_loss |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.0 | 0.342250 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.1 | 0.329904 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.2 | 0.316511 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.3 | 0.328020 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.4 | 0.321219 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.5 | 0.297294 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.6 | 0.292655 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.7 | 0.286900 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.8 | 0.298165 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.9 | 0.312988 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.10 | 0.291748 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.11 | 0.257987 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.12 | 0.239083 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.13 | 0.220668 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.14 | 0.205008 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.15 | 0.234026 |
2026-07-01 11:45:33 DEBUG delta_loss.py L967: AutoScheme | lm_head | N/A |
2026-07-01 11:45:33 INFO delta_loss.py L980: AutoScheme non_block loss: none
2026-07-01 11:45:33 INFO device.py L1450: 'peak_ram': 5.63GB, 'peak_vram': 5.44GB
2026-07-01 11:45:33 INFO delta_loss.py L1897: AutoScheme transition: scheme 2/2 scoring finished (total_loss=32.020996)
2026-07-01 11:45:34 INFO device.py L1450: 'peak_ram': 5.63GB, 'peak_vram': 5.44GB
2026-07-01 11:45:34 INFO device.py L1448: AutoScheme complete (low_cpu_mem_usage=disabled) 'peak_ram': 5.63GB, 'peak_vram': 5.44GB

Qwen/Qwen3.6-35B-A3B

Details
2026-07-01 11:57:31 INFO delta_loss.py L1736: The model appears to be an MoE  model. Using more samples to help generate a better auto-scheme recipe.                                                                                     
2026-07-01 11:57:31 INFO delta_loss.py L1791: AutoScheme steps(total)=160                                                                                                                                                                 
2026-07-01 11:57:31 INFO delta_loss.py L1792: AutoScheme steps variables: scheme_num=2, block_num=40, nsamples=64, batch_size=8                                                                                                           
2026-07-01 11:57:31 INFO delta_loss.py L1799: AutoScheme steps expanded(low_gpu): total_steps = scheme_num * block_num * 2(forward+backward) * n_batches = 2 * 40 * 2 * 8 = 1280                                                          
Generating AutoScheme:   0%|                                                         | 0/1280 [00:00<?, ?it/s]2026-07-01 11:57:31 INFO delta_loss.py L1836: AutoScheme transition: switch to scheme 1/2 (MXFP4)                           
2026-07-01 11:57:50 INFO calib_dataset.py L977: Preprocessing calibration dataset in a subprocess to avoid memory leaks...                                                                                                                
Generating AutoScheme:   3%|????                                            | 40/1280 [06:52<3:11:01,  9.24s/it]2026-07-01 12:04:26 INFO device.py L1450: 'peak_ram': 66.84GB, 'peak_vram': 22.7GB                                          
/home/xinhe/auto-round/.venv/lib/python3.12/site-packages/torch/autograd/graph.py:869: UserWarning: Flash Attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(Tr
ue, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:124.)                                                                                                                
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                                                                   
AutoScheme [1/2 MXFP4] cumulative batch 1/8  avg_loss=0.005665 layers=31071                                                                                                                                                               
Generating AutoScheme:   6%|??????                                          | 80/1280 [25:30<14:44:42, 44.24s/it]2026-07-01 12:23:02 DEBUG delta_loss.py L1015: AutoScheme [1/2 MXFP4] cumulative batch 1/8  avg_loss=0.005665 layers=31071  
2026-07-01 12:23:03 DEBUG delta_loss.py L1030: AutoScheme [1/2 MXFP4] cumulative batch 1/8 block summary:                                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L905: AutoScheme [1/2 MXFP4] block loss summary (cumulative):                                                                                                                                     
2026-07-01 12:23:03 DEBUG delta_loss.py L907: AutoScheme | block | avg_loss | non_exp_avg | exp_avg | inactive_exp | shared_loss |                                                                                                        
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.0 | 0.003526 | 0.203457 | 0.001183 | 0/256 | N/A |                                                                                                                      
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.1 | 0.004285 | 0.213921 | 0.001681 | 1/256 | 0.000028 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.2 | 0.004917 | 0.250949 | 0.002034 | 0/256 | N/A |                                                                                                                      
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.3 | 0.005059 | 0.262360 | 0.001289 | 2/256 | 0.000002 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.4 | 0.005247 | 0.224664 | 0.000982 | 6/256 | 0.000011 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.5 | 0.005625 | 0.241252 | 0.000084 | 9/256 | 0.000001 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.6 | 0.006103 | 0.270779 | 0.002675 | 7/256 | 0.000002 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.7 | 0.006279 | 0.304123 | 0.009237 | 7/256 | 0.000000 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.8 | 0.006329 | 0.263753 | 0.000136 | 11/256 | 0.000002 |                                                                                                                
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.9 | 0.006380 | 0.261746 | 0.000142 | 11/256 | 0.000000 |                                                                                                                
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.10 | 0.006566 | 0.291612 | 0.000112 | 9/256 | 0.000000 |                                                                                                                
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.11 | 0.006700 | 0.329559 | 0.005859 | 14/256 | 0.000001 |                                                                                                               
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.12 | 0.006035 | 0.264174 | 0.000064 | 8/256 | 0.000006 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.13 | 0.005634 | 0.224311 | 0.000314 | 10/256 | 0.000002 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.14 | 0.005807 | 0.248725 | 0.004674 | 8/256 | 0.000002 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.15 | 0.006381 | 0.317917 | 0.000631 | 14/256 | 0.000002 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.16 | 0.005211 | 0.197700 | 0.000519 | 10/256 | 0.000005 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.17 | 0.005135 | 0.196438 | 0.001879 | 9/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.18 | 0.005261 | 0.197456 | 0.001377 | 17/256 | 0.000004 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.19 | 0.005654 | 0.258820 | 0.000047 | 13/256 | 0.000004 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.20 | 0.005496 | 0.233521 | 0.000021 | 12/256 | 0.000002 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.21 | 0.005644 | 0.218262 | 0.000044 | 21/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.22 | 0.005673 | 0.238675 | 0.000970 | 13/256 | 0.000001 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.23 | 0.005732 | 0.286415 | 0.001066 | 12/256 | 0.000008 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.24 | 0.004898 | 0.192342 | 0.000363 | 16/256 | 0.000003 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.25 | 0.004493 | 0.164998 | 0.000129 | 10/256 | 0.000002 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.26 | 0.004950 | 0.202962 | 0.003667 | 12/256 | 0.000005 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.27 | 0.005520 | 0.281952 | 0.000662 | 8/256 | 0.000003 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.28 | 0.004972 | 0.188544 | 0.000846 | 10/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.29 | 0.005491 | 0.219198 | 0.004710 | 10/256 | 0.000003 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.30 | 0.005875 | 0.225966 | 0.000285 | 20/256 | 0.000005 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.31 | 0.006115 | 0.265884 | 0.000001 | 21/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.32 | 0.006555 | 0.247179 | 0.004435 | 21/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.33 | 0.006256 | 0.244602 | 0.000414 | 13/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.34 | 0.007238 | 0.279270 | 0.005646 | 26/256 | 0.000001 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.35 | 0.006032 | 0.285767 | 0.002546 | 11/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.36 | 0.005899 | 0.231432 | 0.006378 | 13/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.37 | 0.006209 | 0.245280 | 0.004445 | 18/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.38 | 0.007648 | 0.327650 | 0.000034 | 28/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.39 | 0.010694 | 0.587891 | 0.000170 | 18/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L965: AutoScheme | lm_head | 3.500000 | N/A | N/A | 0/0 | N/A |              
2026-07-01 12:23:03 DEBUG delta_loss.py L970: AutoScheme table note: non_exp_avg excludes experts; exp_avg excludes inactive experts; shared_loss is used for inactive expert broadcast.
2026-07-01 12:23:03 INFO delta_loss.py L980: AutoScheme non_block loss: none                                         
Generating AutoScheme:   9%|??????????                                        | 120/1280 [34:23<3:56:04, 12.21s/it]2026-07-01 12:31:56 INFO device.py L1450: 'peak_ram': 67.85GB, 'peak_vram': 28.89GB
AutoScheme [1/2 MXFP4] cumulative batch 2/8  avg_loss=0.005918 layers=31071                                          
Generating AutoScheme:  12%|????????????                                       | 160/1280 [48:05<5:40:51, 18.26s/it]2026-07-01 12:45:37 DEBUG delta_loss.py L1015: AutoScheme [1/2 MXFP4] cumulative batch 2/8  avg_loss=0.005918 layers=31071
2026-07-01 12:45:37 DEBUG delta_loss.py L1030: AutoScheme [1/2 MXFP4] cumulative batch 2/8 block summary:            
2026-07-01 12:45:37 DEBUG delta_loss.py L905: AutoScheme [1/2 MXFP4] block loss summary (cumulative):                
2026-07-01 12:45:37 DEBUG delta_loss.py L907: AutoScheme | block | avg_loss | non_exp_avg | exp_avg | inactive_exp | shared_loss |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.0 | 0.003621 | 0.207435 | 0.001233 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.1 | 0.004396 | 0.219082 | 0.001880 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.2 | 0.004994 | 0.255303 | 0.002060 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.3 | 0.005170 | 0.273804 | 0.002372 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.4 | 0.005355 | 0.230265 | 0.002720 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.5 | 0.005662 | 0.250570 | 0.000272 | 2/256 | 0.000034 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.6 | 0.006166 | 0.279867 | 0.002958 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.7 | 0.006434 | 0.324081 | 0.003125 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.8 | 0.006356 | 0.273356 | 0.001195 | 1/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.9 | 0.006602 | 0.277398 | 0.000405 | 1/256 | 0.000000 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.10 | 0.006789 | 0.305840 | 0.000068 | 2/256 | 0.000004 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.11 | 0.006699 | 0.346024 | 0.004786 | 2/256 | 0.000010 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.12 | 0.006101 | 0.271478 | 0.002991 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.13 | 0.005592 | 0.229506 | 0.000220 | 1/256 | 0.000009 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.14 | 0.005823 | 0.253879 | 0.002645 | 1/256 | 0.000002 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.15 | 0.006273 | 0.329887 | 0.000387 | 2/256 | 0.000004 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.16 | 0.005154 | 0.199816 | 0.000609 | 1/256 | 0.000009 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.17 | 0.005036 | 0.198229 | 0.005024 | 1/256 | 0.000002 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.18 | 0.005073 | 0.200948 | 0.002778 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.19 | 0.005667 | 0.268135 | 0.000027 | 4/256 | 0.000010 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.20 | 0.005448 | 0.236782 | 0.000885 | 4/256 | 0.000002 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.21 | 0.005516 | 0.226345 | 0.000148 | 3/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.22 | 0.005726 | 0.245972 | 0.000990 | 4/256 | 0.000014 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.23 | 0.005673 | 0.299305 | 0.002614 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.24 | 0.004813 | 0.197693 | 0.000090 | 3/256 | 0.000015 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.25 | 0.004473 | 0.169447 | 0.000381 | 1/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.26 | 0.004963 | 0.210341 | 0.002218 | 1/256 | 0.000004 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.27 | 0.005718 | 0.294151 | 0.000371 | 3/256 | 0.000004 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.28 | 0.004977 | 0.194485 | 0.001242 | 1/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.29 | 0.005563 | 0.228021 | 0.005020 | 2/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.30 | 0.005899 | 0.239705 | 0.000032 | 6/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.31 | 0.006198 | 0.286247 | 0.000076 | 6/256 | 0.000004 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.32 | 0.006578 | 0.258979 | 0.002238 | 7/256 | 0.000002 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.33 | 0.006442 | 0.259562 | 0.000216 | 4/256 | 0.000001 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.34 | 0.007279 | 0.294895 | 0.005880 | 14/256 | 0.000001 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.35 | 0.006235 | 0.308899 | 0.002605 | 2/256 | 0.000000 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.36 | 0.005883 | 0.243435 | 0.006831 | 4/256 | 0.000000 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.37 | 0.006203 | 0.256076 | 0.004021 | 9/256 | 0.000001 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.38 | 0.007438 | 0.345479 | 0.000018 | 7/256 | 0.000000 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.39 | 0.010684 | 0.622803 | 0.000194 | 7/256 | 0.000000 |
2026-07-01 12:45:37 DEBUG delta_loss.py L965: AutoScheme | lm_head | 3.671875 | N/A | N/A | 0/0 | N/A |              
2026-07-01 12:45:37 DEBUG delta_loss.py L970: AutoScheme table note: non_exp_avg excludes experts; exp_avg excludes inactive experts; shared_loss is used for inactive expert broadcast.
2026-07-01 12:45:37 INFO delta_loss.py L980: AutoScheme non_block loss: none 
Generating AutoScheme:  16%|??????????????                                      | 200/1280 [53:27<2:46:38,  9.26s/it]2026-07-01 12:51:02 INFO device.py L1450: 'peak_ram': 68.04GB, 'peak_vram': 28.89GB
AutoScheme [1/2 MXFP4] cumulative batch 3/8  avg_loss=0.005484 layers=31071                                   
Generating AutoScheme:  19%|????????????????                                   | 240/1280 [1:07:58<5:56:46, 20.58s/it]2026-07-01 13:05:30 DEBUG delta_loss.py L1015: AutoScheme [1/2 MXFP4] cumulative batch 3/8  avg_loss=0.005484 layers=31071
2026-07-01 13:05:30 DEBUG delta_loss.py L1030: AutoScheme [1/2 MXFP4] cumulative batch 3/8 block summary:
2026-07-01 13:05:30 DEBUG delta_loss.py L905: AutoScheme [1/2 MXFP4] block loss summary (cumulative):
2026-07-01 13:05:30 DEBUG delta_loss.py L907: AutoScheme | block | avg_loss | non_exp_avg | exp_avg | inactive_exp | shared_loss |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.0 | 0.003268 | 0.188201 | 0.001101 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.1 | 0.003921 | 0.195322 | 0.001678 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.2 | 0.004479 | 0.228805 | 0.001850 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.3 | 0.004593 | 0.243896 | 0.002100 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.4 | 0.004827 | 0.210436 | 0.002418 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.5 | 0.005093 | 0.227394 | 0.002488 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.6 | 0.005610 | 0.255335 | 0.002684 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.7 | 0.005908 | 0.296870 | 0.002877 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.8 | 0.005817 | 0.250986 | 0.002944 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.9 | 0.006011 | 0.255082 | 0.003093 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.10 | 0.006256 | 0.285920 | 0.002978 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.11 | 0.006219 | 0.322774 | 0.002922 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.12 | 0.005701 | 0.255037 | 0.002779 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.13 | 0.005250 | 0.216209 | 0.002778 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.14 | 0.005458 | 0.239850 | 0.002711 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.15 | 0.005915 | 0.315318 | 0.002692 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.16 | 0.004828 | 0.189485 | 0.002664 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.17 | 0.004713 | 0.188703 | 0.002556 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.18 | 0.004741 | 0.190913 | 0.002560 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.19 | 0.005214 | 0.253311 | 0.000081 | 1/256 | 0.000010 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.20 | 0.005043 | 0.219973 | 0.000714 | 3/256 | 0.000007 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.21 | 0.005044 | 0.211277 | 0.002627 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.22 | 0.005264 | 0.231848 | 0.002609 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.23 | 0.005363 | 0.285472 | 0.002445 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.24 | 0.004521 | 0.186964 | 0.000045 | 3/256 | 0.000016 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.25 | 0.004212 | 0.161237 | 0.000312 | 1/256 | 0.000015 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.26 | 0.004659 | 0.199549 | 0.003139 | 1/256 | 0.000017 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.27 | 0.005350 | 0.281036 | 0.000812 | 1/256 | 0.000001 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.28 | 0.004649 | 0.185072 | 0.002535 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.29 | 0.005182 | 0.214966 | 0.005256 | 2/256 | 0.000012 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.30 | 0.005443 | 0.226391 | 0.000101 | 3/256 | 0.000004 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.31 | 0.005714 | 0.269328 | 0.000111 | 2/256 | 0.000005 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.32 | 0.005923 | 0.241776 | 0.004242 | 2/256 | 0.000002 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.33 | 0.005925 | 0.241030 | 0.000234 | 2/256 | 0.000003 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.34 | 0.006431 | 0.276647 | 0.006036 | 4/256 | 0.000002 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.35 | 0.005723 | 0.289591 | 0.002766 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.36 | 0.005359 | 0.226097 | 0.006772 | 2/256 | 0.000001 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.37 | 0.005580 | 0.238383 | 0.003129 | 2/256 | 0.000000 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.38 | 0.006759 | 0.320080 | 0.000014 | 3/256 | 0.000000 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.39 | 0.009778 | 0.580933 | 0.000250 | 3/256 | 0.000000 |
2026-07-01 13:05:30 DEBUG delta_loss.py L965: AutoScheme | lm_head | 3.437500 | N/A | N/A | 0/0 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L970: AutoScheme table note: non_exp_avg excludes experts; exp_avg excludes inactive experts; shared_loss is used for inactive expert broadcast.
2026-07-01 13:05:30 INFO delta_loss.py L980: AutoScheme non_block loss: none
Generating AutoScheme:  22%|????????????????????                                 | 280/1280 [1:12:53<2:17:07,  8.23s/it]2026-07-01 13:10:27 INFO device.py L1450: 'peak_ram': 68.14GB, 'peak_vram': 28.89GB
...

Type of Change

Bug fix

Related Issues

Fixes or relates to #910 #1347

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.
  • The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.
xin3he added 6 commits June 30, 2026 07:24
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
…ogging of batch average loss

Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).
scheme_tag: Optional[str] = None,
):
scores_dict = {} # Key=name,Val=[quant_total_bits, loss]
block_names = get_block_names(model)[0]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vlm with quant_nontext_module needs to handle visual block

enable_torch_compile=enable_torch_compile,
)
set_module(model, name, new_m)
if offload_context is not None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete this
@lvliang-intel please review this change

head_name = "lm_head"

# Sort by length to avoid prefix ambiguity and match faster in practice.
block_prefixes = [(name, name + ".") for name in sorted(block_names, key=len, reverse=True)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file is too long, better move this function to utils

def model_forward_low_gpu(model, dataloader, major_device="cuda", pbar=None):
def model_forward_low_gpu(model, dataloader, major_device="cuda", pbar=None, scheme_tag=None):
block_inputs = {}
total_batches = len(dataloader) if hasattr(dataloader, "__len__") else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please attach the cost of autoschme for qwen3-8B to avoid any regression



def _fill_inactive_expert_scores(scores_dict: dict[str, list[float]], block_names: list[str]):
"""Fill inactive experts with the min loss of active experts in each block.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please demostrate the advantage of this choice over avg/max

@xin3he xin3he added this to the 0.14.0 milestone Jul 1, 2026
@xin3he xin3he requested a review from wenhuach21 July 1, 2026 05:58
xin3he added 2 commits July 1, 2026 14:02
…ate documentation and tests

Signed-off-by: Xin He <xin3.he@intel.com>
… for MoE models

Signed-off-by: Xin He <xin3.he@intel.com>
@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants