[AutoScheme] fix lm_head no_grad issue and support MOE model with shared mix_score by xin3he · Pull Request #1971 · intel/auto-round

xin3he · 2026-07-01T03:52:52Z

Description

This pull request adds a new test to ensure that the lm_head receives a non-zero mix_score when using the low_gpu_mem_usage path in the quantization process. The test captures and verifies the computed scores for different quantization schemes, improving test coverage for this specific scenario.

Testing improvements:

Added test_lm_head_mix_score_nonzero to verify that the lm_head receives a non-zero mix_score across quantization schemes in the low_gpu_mem_usage path, ensuring correct score computation and increasing test coverage.
verified with Qwen3.6-35B-A3B

BTW, enhanced log in debug mode.

meta-llama/Llama-3.2-1B-Instruct

Details

command: AR_LOG_LEVEL=debug auto_round --model_name /models/Llama-3.2-1B-Instruct/ --avg_bits 6 --options "mxfp4,mxfp8" --quant_lm_head

2026-07-01 11:44:08 INFO delta_loss.py L1791: AutoScheme steps(total)=64                                        
2026-07-01 11:44:08 INFO delta_loss.py L1792: AutoScheme steps variables: scheme_num=2, block_num=16, nsamples=1
6, batch_size=8                                                                                                 
2026-07-01 11:44:08 INFO delta_loss.py L1799: AutoScheme steps expanded(low_gpu): total_steps = scheme_num * blo
ck_num * 2(forward+backward) * n_batches = 2 * 16 * 2 * 2 = 128                                                 
Generating AutoScheme:   0%|                                                           | 0/128 [00:00<?, ?it/s]2
026-07-01 11:44:08 INFO delta_loss.py L1836: AutoScheme transition: switch to scheme 1/2 (MXFP4)                
2026-07-01 11:44:08 INFO calib_dataset.py L977: Preprocessing calibration dataset in a subprocess to avoid memor
y leaks...                                                                                                      
Generating AutoScheme:  12%|??????????????                                           | 16/128 [00:16<00:36,  3.07it/s]2
026-07-01 11:44:25 INFO device.py L1450: 'peak_ram': 5.63GB, 'peak_vram': 3.92GB                                
/home/xinhe/auto-round/.venv/lib/python3.12/site-packages/torch/autograd/graph.py:869: UserWarning: Flash Attent
ion defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_alg
orithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attenti
on_backward.cu:124.)                                                                                            
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass         
AutoScheme [1/2 MXFP4] cumulative batch 1/2  avg_loss=1.890084 layers=112                                       
Generating AutoScheme:  25%|??????????????????????????                                     | 32/128 [00:28<01:09,  1.39it/s]2
026-07-01 11:44:37 DEBUG delta_loss.py L1015: AutoScheme [1/2 MXFP4] cumulative batch 1/2  avg_loss=1.890084 lay
ers=112                                                                                                         
2026-07-01 11:44:37 DEBUG delta_loss.py L1030: AutoScheme [1/2 MXFP4] cumulative batch 1/2 block summary:       
2026-07-01 11:44:37 DEBUG delta_loss.py L905: AutoScheme [1/2 MXFP4] block loss summary (cumulative):           
2026-07-01 11:44:37 DEBUG delta_loss.py L909: AutoScheme | block | avg_loss |                                   
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.0 | 2.456473 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.1 | 2.578125 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.2 | 2.518973 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.3 | 2.277902 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.4 | 2.202009 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.5 | 2.055246 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.6 | 1.939174 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.7 | 1.851004 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.8 | 1.852121 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.9 | 1.838170 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.10 | 1.765625 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.11 | 1.484933 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.12 | 1.363839 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.13 | 1.316406 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.14 | 1.233817 |
2026-07-01 11:44:37 DEBUG delta_loss.py L953: AutoScheme | layers.15 | 1.507533 |
2026-07-01 11:44:37 DEBUG delta_loss.py L967: AutoScheme | lm_head | N/A |
2026-07-01 11:44:37 INFO delta_loss.py L980: AutoScheme non_block loss: none
Generating AutoScheme:  38%|??????????????????????????????????????                               | 48/128 [00:32<00:23,  3.37it/s]2026-07-01 11:44:42 INFO device.py L1450: 'peak_ram': 5.63GB, 'peak_vram': 4.0GB
AutoScheme [1/2 MXFP4] cumulative batch 2/2  avg_loss=2.058463 layers=112                                      
Generating AutoScheme:  50%|??????????????????????????????????????????????????                         | 64/128 [00:44<00:43,  1.49it/s]2026-07-01 11:44:53 DEBUG delta_loss.py L1015: AutoScheme [1/2 MXFP4] cumulative batch 2/2  avg_loss=2.058463 layers=112
2026-07-01 11:44:53 DEBUG delta_loss.py L1024: AutoScheme [1/2 MXFP4] cumulative batch 2/2 block summary skipped (same as final)
2026-07-01 11:44:53 DEBUG delta_loss.py L905: AutoScheme [1/2 MXFP4] block loss summary (final):
2026-07-01 11:45:33 DEBUG delta_loss.py L909: AutoScheme | block | avg_loss |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.0 | 0.342250 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.1 | 0.329904 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.2 | 0.316511 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.3 | 0.328020 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.4 | 0.321219 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.5 | 0.297294 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.6 | 0.292655 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.7 | 0.286900 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.8 | 0.298165 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.9 | 0.312988 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.10 | 0.291748 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.11 | 0.257987 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.12 | 0.239083 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.13 | 0.220668 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.14 | 0.205008 |
2026-07-01 11:45:33 DEBUG delta_loss.py L953: AutoScheme | layers.15 | 0.234026 |
2026-07-01 11:45:33 DEBUG delta_loss.py L967: AutoScheme | lm_head | N/A |
2026-07-01 11:45:33 INFO delta_loss.py L980: AutoScheme non_block loss: none
2026-07-01 11:45:33 INFO device.py L1450: 'peak_ram': 5.63GB, 'peak_vram': 5.44GB
2026-07-01 11:45:33 INFO delta_loss.py L1897: AutoScheme transition: scheme 2/2 scoring finished (total_loss=32.020996)
2026-07-01 11:45:34 INFO device.py L1450: 'peak_ram': 5.63GB, 'peak_vram': 5.44GB
2026-07-01 11:45:34 INFO device.py L1448: AutoScheme complete (low_cpu_mem_usage=disabled) 'peak_ram': 5.63GB, 'peak_vram': 5.44GB

Qwen/Qwen3.6-35B-A3B

Details

2026-07-01 11:57:31 INFO delta_loss.py L1736: The model appears to be an MoE  model. Using more samples to help generate a better auto-scheme recipe.                                                                                     
2026-07-01 11:57:31 INFO delta_loss.py L1791: AutoScheme steps(total)=160                                                                                                                                                                 
2026-07-01 11:57:31 INFO delta_loss.py L1792: AutoScheme steps variables: scheme_num=2, block_num=40, nsamples=64, batch_size=8                                                                                                           
2026-07-01 11:57:31 INFO delta_loss.py L1799: AutoScheme steps expanded(low_gpu): total_steps = scheme_num * block_num * 2(forward+backward) * n_batches = 2 * 40 * 2 * 8 = 1280                                                          
Generating AutoScheme:   0%|                                                         | 0/1280 [00:00<?, ?it/s]2026-07-01 11:57:31 INFO delta_loss.py L1836: AutoScheme transition: switch to scheme 1/2 (MXFP4)                           
2026-07-01 11:57:50 INFO calib_dataset.py L977: Preprocessing calibration dataset in a subprocess to avoid memory leaks...                                                                                                                
Generating AutoScheme:   3%|????                                            | 40/1280 [06:52<3:11:01,  9.24s/it]2026-07-01 12:04:26 INFO device.py L1450: 'peak_ram': 66.84GB, 'peak_vram': 22.7GB                                          
/home/xinhe/auto-round/.venv/lib/python3.12/site-packages/torch/autograd/graph.py:869: UserWarning: Flash Attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(Tr
ue, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:124.)                                                                                                                
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                                                                   
AutoScheme [1/2 MXFP4] cumulative batch 1/8  avg_loss=0.005665 layers=31071                                                                                                                                                               
Generating AutoScheme:   6%|??????                                          | 80/1280 [25:30<14:44:42, 44.24s/it]2026-07-01 12:23:02 DEBUG delta_loss.py L1015: AutoScheme [1/2 MXFP4] cumulative batch 1/8  avg_loss=0.005665 layers=31071  
2026-07-01 12:23:03 DEBUG delta_loss.py L1030: AutoScheme [1/2 MXFP4] cumulative batch 1/8 block summary:                                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L905: AutoScheme [1/2 MXFP4] block loss summary (cumulative):                                                                                                                                     
2026-07-01 12:23:03 DEBUG delta_loss.py L907: AutoScheme | block | avg_loss | non_exp_avg | exp_avg | inactive_exp | shared_loss |                                                                                                        
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.0 | 0.003526 | 0.203457 | 0.001183 | 0/256 | N/A |                                                                                                                      
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.1 | 0.004285 | 0.213921 | 0.001681 | 1/256 | 0.000028 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.2 | 0.004917 | 0.250949 | 0.002034 | 0/256 | N/A |                                                                                                                      
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.3 | 0.005059 | 0.262360 | 0.001289 | 2/256 | 0.000002 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.4 | 0.005247 | 0.224664 | 0.000982 | 6/256 | 0.000011 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.5 | 0.005625 | 0.241252 | 0.000084 | 9/256 | 0.000001 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.6 | 0.006103 | 0.270779 | 0.002675 | 7/256 | 0.000002 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.7 | 0.006279 | 0.304123 | 0.009237 | 7/256 | 0.000000 |                                                                                                                 
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.8 | 0.006329 | 0.263753 | 0.000136 | 11/256 | 0.000002 |                                                                                                                
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.9 | 0.006380 | 0.261746 | 0.000142 | 11/256 | 0.000000 |                                                                                                                
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.10 | 0.006566 | 0.291612 | 0.000112 | 9/256 | 0.000000 |                                                                                                                
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.11 | 0.006700 | 0.329559 | 0.005859 | 14/256 | 0.000001 |                                                                                                               
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.12 | 0.006035 | 0.264174 | 0.000064 | 8/256 | 0.000006 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.13 | 0.005634 | 0.224311 | 0.000314 | 10/256 | 0.000002 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.14 | 0.005807 | 0.248725 | 0.004674 | 8/256 | 0.000002 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.15 | 0.006381 | 0.317917 | 0.000631 | 14/256 | 0.000002 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.16 | 0.005211 | 0.197700 | 0.000519 | 10/256 | 0.000005 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.17 | 0.005135 | 0.196438 | 0.001879 | 9/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.18 | 0.005261 | 0.197456 | 0.001377 | 17/256 | 0.000004 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.19 | 0.005654 | 0.258820 | 0.000047 | 13/256 | 0.000004 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.20 | 0.005496 | 0.233521 | 0.000021 | 12/256 | 0.000002 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.21 | 0.005644 | 0.218262 | 0.000044 | 21/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.22 | 0.005673 | 0.238675 | 0.000970 | 13/256 | 0.000001 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.23 | 0.005732 | 0.286415 | 0.001066 | 12/256 | 0.000008 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.24 | 0.004898 | 0.192342 | 0.000363 | 16/256 | 0.000003 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.25 | 0.004493 | 0.164998 | 0.000129 | 10/256 | 0.000002 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.26 | 0.004950 | 0.202962 | 0.003667 | 12/256 | 0.000005 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.27 | 0.005520 | 0.281952 | 0.000662 | 8/256 | 0.000003 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.28 | 0.004972 | 0.188544 | 0.000846 | 10/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.29 | 0.005491 | 0.219198 | 0.004710 | 10/256 | 0.000003 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.30 | 0.005875 | 0.225966 | 0.000285 | 20/256 | 0.000005 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.31 | 0.006115 | 0.265884 | 0.000001 | 21/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.32 | 0.006555 | 0.247179 | 0.004435 | 21/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.33 | 0.006256 | 0.244602 | 0.000414 | 13/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.34 | 0.007238 | 0.279270 | 0.005646 | 26/256 | 0.000001 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.35 | 0.006032 | 0.285767 | 0.002546 | 11/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.36 | 0.005899 | 0.231432 | 0.006378 | 13/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.37 | 0.006209 | 0.245280 | 0.004445 | 18/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.38 | 0.007648 | 0.327650 | 0.000034 | 28/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L942: AutoScheme | layers.39 | 0.010694 | 0.587891 | 0.000170 | 18/256 | 0.000000 |
2026-07-01 12:23:03 DEBUG delta_loss.py L965: AutoScheme | lm_head | 3.500000 | N/A | N/A | 0/0 | N/A |              
2026-07-01 12:23:03 DEBUG delta_loss.py L970: AutoScheme table note: non_exp_avg excludes experts; exp_avg excludes inactive experts; shared_loss is used for inactive expert broadcast.
2026-07-01 12:23:03 INFO delta_loss.py L980: AutoScheme non_block loss: none                                         
Generating AutoScheme:   9%|??????????                                        | 120/1280 [34:23<3:56:04, 12.21s/it]2026-07-01 12:31:56 INFO device.py L1450: 'peak_ram': 67.85GB, 'peak_vram': 28.89GB
AutoScheme [1/2 MXFP4] cumulative batch 2/8  avg_loss=0.005918 layers=31071                                          
Generating AutoScheme:  12%|????????????                                       | 160/1280 [48:05<5:40:51, 18.26s/it]2026-07-01 12:45:37 DEBUG delta_loss.py L1015: AutoScheme [1/2 MXFP4] cumulative batch 2/8  avg_loss=0.005918 layers=31071
2026-07-01 12:45:37 DEBUG delta_loss.py L1030: AutoScheme [1/2 MXFP4] cumulative batch 2/8 block summary:            
2026-07-01 12:45:37 DEBUG delta_loss.py L905: AutoScheme [1/2 MXFP4] block loss summary (cumulative):                
2026-07-01 12:45:37 DEBUG delta_loss.py L907: AutoScheme | block | avg_loss | non_exp_avg | exp_avg | inactive_exp | shared_loss |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.0 | 0.003621 | 0.207435 | 0.001233 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.1 | 0.004396 | 0.219082 | 0.001880 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.2 | 0.004994 | 0.255303 | 0.002060 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.3 | 0.005170 | 0.273804 | 0.002372 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.4 | 0.005355 | 0.230265 | 0.002720 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.5 | 0.005662 | 0.250570 | 0.000272 | 2/256 | 0.000034 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.6 | 0.006166 | 0.279867 | 0.002958 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.7 | 0.006434 | 0.324081 | 0.003125 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.8 | 0.006356 | 0.273356 | 0.001195 | 1/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.9 | 0.006602 | 0.277398 | 0.000405 | 1/256 | 0.000000 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.10 | 0.006789 | 0.305840 | 0.000068 | 2/256 | 0.000004 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.11 | 0.006699 | 0.346024 | 0.004786 | 2/256 | 0.000010 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.12 | 0.006101 | 0.271478 | 0.002991 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.13 | 0.005592 | 0.229506 | 0.000220 | 1/256 | 0.000009 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.14 | 0.005823 | 0.253879 | 0.002645 | 1/256 | 0.000002 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.15 | 0.006273 | 0.329887 | 0.000387 | 2/256 | 0.000004 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.16 | 0.005154 | 0.199816 | 0.000609 | 1/256 | 0.000009 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.17 | 0.005036 | 0.198229 | 0.005024 | 1/256 | 0.000002 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.18 | 0.005073 | 0.200948 | 0.002778 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.19 | 0.005667 | 0.268135 | 0.000027 | 4/256 | 0.000010 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.20 | 0.005448 | 0.236782 | 0.000885 | 4/256 | 0.000002 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.21 | 0.005516 | 0.226345 | 0.000148 | 3/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.22 | 0.005726 | 0.245972 | 0.000990 | 4/256 | 0.000014 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.23 | 0.005673 | 0.299305 | 0.002614 | 0/256 | N/A |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.24 | 0.004813 | 0.197693 | 0.000090 | 3/256 | 0.000015 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.25 | 0.004473 | 0.169447 | 0.000381 | 1/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.26 | 0.004963 | 0.210341 | 0.002218 | 1/256 | 0.000004 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.27 | 0.005718 | 0.294151 | 0.000371 | 3/256 | 0.000004 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.28 | 0.004977 | 0.194485 | 0.001242 | 1/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.29 | 0.005563 | 0.228021 | 0.005020 | 2/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.30 | 0.005899 | 0.239705 | 0.000032 | 6/256 | 0.000005 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.31 | 0.006198 | 0.286247 | 0.000076 | 6/256 | 0.000004 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.32 | 0.006578 | 0.258979 | 0.002238 | 7/256 | 0.000002 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.33 | 0.006442 | 0.259562 | 0.000216 | 4/256 | 0.000001 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.34 | 0.007279 | 0.294895 | 0.005880 | 14/256 | 0.000001 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.35 | 0.006235 | 0.308899 | 0.002605 | 2/256 | 0.000000 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.36 | 0.005883 | 0.243435 | 0.006831 | 4/256 | 0.000000 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.37 | 0.006203 | 0.256076 | 0.004021 | 9/256 | 0.000001 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.38 | 0.007438 | 0.345479 | 0.000018 | 7/256 | 0.000000 |
2026-07-01 12:45:37 DEBUG delta_loss.py L942: AutoScheme | layers.39 | 0.010684 | 0.622803 | 0.000194 | 7/256 | 0.000000 |
2026-07-01 12:45:37 DEBUG delta_loss.py L965: AutoScheme | lm_head | 3.671875 | N/A | N/A | 0/0 | N/A |              
2026-07-01 12:45:37 DEBUG delta_loss.py L970: AutoScheme table note: non_exp_avg excludes experts; exp_avg excludes inactive experts; shared_loss is used for inactive expert broadcast.
2026-07-01 12:45:37 INFO delta_loss.py L980: AutoScheme non_block loss: none 
Generating AutoScheme:  16%|??????????????                                      | 200/1280 [53:27<2:46:38,  9.26s/it]2026-07-01 12:51:02 INFO device.py L1450: 'peak_ram': 68.04GB, 'peak_vram': 28.89GB
AutoScheme [1/2 MXFP4] cumulative batch 3/8  avg_loss=0.005484 layers=31071                                   
Generating AutoScheme:  19%|????????????????                                   | 240/1280 [1:07:58<5:56:46, 20.58s/it]2026-07-01 13:05:30 DEBUG delta_loss.py L1015: AutoScheme [1/2 MXFP4] cumulative batch 3/8  avg_loss=0.005484 layers=31071
2026-07-01 13:05:30 DEBUG delta_loss.py L1030: AutoScheme [1/2 MXFP4] cumulative batch 3/8 block summary:
2026-07-01 13:05:30 DEBUG delta_loss.py L905: AutoScheme [1/2 MXFP4] block loss summary (cumulative):
2026-07-01 13:05:30 DEBUG delta_loss.py L907: AutoScheme | block | avg_loss | non_exp_avg | exp_avg | inactive_exp | shared_loss |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.0 | 0.003268 | 0.188201 | 0.001101 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.1 | 0.003921 | 0.195322 | 0.001678 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.2 | 0.004479 | 0.228805 | 0.001850 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.3 | 0.004593 | 0.243896 | 0.002100 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.4 | 0.004827 | 0.210436 | 0.002418 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.5 | 0.005093 | 0.227394 | 0.002488 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.6 | 0.005610 | 0.255335 | 0.002684 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.7 | 0.005908 | 0.296870 | 0.002877 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.8 | 0.005817 | 0.250986 | 0.002944 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.9 | 0.006011 | 0.255082 | 0.003093 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.10 | 0.006256 | 0.285920 | 0.002978 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.11 | 0.006219 | 0.322774 | 0.002922 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.12 | 0.005701 | 0.255037 | 0.002779 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.13 | 0.005250 | 0.216209 | 0.002778 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.14 | 0.005458 | 0.239850 | 0.002711 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.15 | 0.005915 | 0.315318 | 0.002692 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.16 | 0.004828 | 0.189485 | 0.002664 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.17 | 0.004713 | 0.188703 | 0.002556 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.18 | 0.004741 | 0.190913 | 0.002560 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.19 | 0.005214 | 0.253311 | 0.000081 | 1/256 | 0.000010 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.20 | 0.005043 | 0.219973 | 0.000714 | 3/256 | 0.000007 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.21 | 0.005044 | 0.211277 | 0.002627 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.22 | 0.005264 | 0.231848 | 0.002609 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.23 | 0.005363 | 0.285472 | 0.002445 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.24 | 0.004521 | 0.186964 | 0.000045 | 3/256 | 0.000016 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.25 | 0.004212 | 0.161237 | 0.000312 | 1/256 | 0.000015 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.26 | 0.004659 | 0.199549 | 0.003139 | 1/256 | 0.000017 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.27 | 0.005350 | 0.281036 | 0.000812 | 1/256 | 0.000001 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.28 | 0.004649 | 0.185072 | 0.002535 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.29 | 0.005182 | 0.214966 | 0.005256 | 2/256 | 0.000012 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.30 | 0.005443 | 0.226391 | 0.000101 | 3/256 | 0.000004 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.31 | 0.005714 | 0.269328 | 0.000111 | 2/256 | 0.000005 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.32 | 0.005923 | 0.241776 | 0.004242 | 2/256 | 0.000002 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.33 | 0.005925 | 0.241030 | 0.000234 | 2/256 | 0.000003 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.34 | 0.006431 | 0.276647 | 0.006036 | 4/256 | 0.000002 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.35 | 0.005723 | 0.289591 | 0.002766 | 0/256 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.36 | 0.005359 | 0.226097 | 0.006772 | 2/256 | 0.000001 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.37 | 0.005580 | 0.238383 | 0.003129 | 2/256 | 0.000000 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.38 | 0.006759 | 0.320080 | 0.000014 | 3/256 | 0.000000 |
2026-07-01 13:05:30 DEBUG delta_loss.py L942: AutoScheme | layers.39 | 0.009778 | 0.580933 | 0.000250 | 3/256 | 0.000000 |
2026-07-01 13:05:30 DEBUG delta_loss.py L965: AutoScheme | lm_head | 3.437500 | N/A | N/A | 0/0 | N/A |
2026-07-01 13:05:30 DEBUG delta_loss.py L970: AutoScheme table note: non_exp_avg excludes experts; exp_avg excludes inactive experts; shared_loss is used for inactive expert broadcast.
2026-07-01 13:05:30 INFO delta_loss.py L980: AutoScheme non_block loss: none
Generating AutoScheme:  22%|????????????????????                                 | 280/1280 [1:12:53<2:17:07,  8.23s/it]2026-07-01 13:10:27 INFO device.py L1450: 'peak_ram': 68.14GB, 'peak_vram': 28.89GB
...

Type of Change

Bug fix

Related Issues

Fixes or relates to #910 #1347

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.
The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Signed-off-by: Xin He <xin3.he@intel.com>

…ogging of batch average loss Signed-off-by: Xin He <xin3.he@intel.com>

Signed-off-by: Xin He <xin3.he@intel.com>

chensuyue · 2026-07-01T04:41:19Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-07-01T04:41:28Z

Azure Pipelines successfully started running 1 pipeline(s).

wenhuach21 · 2026-07-01T05:32:18Z

+    scheme_tag: Optional[str] = None,
 ):
    scores_dict = {}  # Key=name,Val=[quant_total_bits, loss]
+    block_names = get_block_names(model)[0]


vlm with quant_nontext_module needs to handle visual block

wenhuach21 · 2026-07-01T05:32:41Z

                enable_torch_compile=enable_torch_compile,
            )
            set_module(model, name, new_m)
-    if offload_context is not None:


why delete this
@lvliang-intel please review this change

wenhuach21 · 2026-07-01T05:43:40Z

+        head_name = "lm_head"
+
+    # Sort by length to avoid prefix ambiguity and match faster in practice.
+    block_prefixes = [(name, name + ".") for name in sorted(block_names, key=len, reverse=True)]


the file is too long, better move this function to utils

wenhuach21 · 2026-07-01T05:45:08Z

-def model_forward_low_gpu(model, dataloader, major_device="cuda", pbar=None):
+def model_forward_low_gpu(model, dataloader, major_device="cuda", pbar=None, scheme_tag=None):
    block_inputs = {}
+    total_batches = len(dataloader) if hasattr(dataloader, "__len__") else None


please attach the cost of autoschme for qwen3-8B to avoid any regression

wenhuach21 · 2026-07-01T05:46:09Z

+
+
+def _fill_inactive_expert_scores(scores_dict: dict[str, list[float]], block_names: list[str]):
+    """Fill inactive experts with the min loss of active experts in each block.


please demostrate the advantage of this choice over avg/max

…ate documentation and tests Signed-off-by: Xin He <xin3.he@intel.com>

… for MoE models Signed-off-by: Xin He <xin3.he@intel.com>

chensuyue · 2026-07-01T07:11:45Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-07-01T07:11:55Z

Azure Pipelines successfully started running 1 pipeline(s).

xin3he added 6 commits June 30, 2026 07:24

fix bug of autoscheme

6c564fb

Signed-off-by: Xin He <xin3.he@intel.com>

update log in autoscheme

67df125

Signed-off-by: Xin He <xin3.he@intel.com>

add shared loss for inactived experts and refine log

6b2d5a5

Signed-off-by: Xin He <xin3.he@intel.com>

set lm_head grad_mode=True

115b5e4

Signed-off-by: Xin He <xin3.he@intel.com>

Enhance model_forward_low_gpu to support scheme tagging and improve l…

0a78c64

…ogging of batch average loss Signed-off-by: Xin He <xin3.he@intel.com>

revert nsamples change

7b13fd4

Signed-off-by: Xin He <xin3.he@intel.com>

wenhuach21 reviewed Jul 1, 2026

View reviewed changes

xin3he added this to the 0.14.0 milestone Jul 1, 2026

xin3he requested a review from wenhuach21 July 1, 2026 05:58

xin3he added 2 commits July 1, 2026 14:02

Add environment variables for AutoScheme nsamples and batch size; upd…

f2c6c23

…ate documentation and tests Signed-off-by: Xin He <xin3.he@intel.com>

Refactor _gen_layer_config to simplify nsamples and seqlen assignment…

56316a0

… for MoE models Signed-off-by: Xin He <xin3.he@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoScheme] fix lm_head no_grad issue and support MOE model with shared mix_score#1971

[AutoScheme] fix lm_head no_grad issue and support MOE model with shared mix_score#1971
xin3he wants to merge 8 commits into
mainfrom
xinhe/6-29

xin3he commented Jul 1, 2026 •

edited

Loading

chensuyue commented Jul 1, 2026

azure-pipelines Bot commented Jul 1, 2026

wenhuach21 Jul 1, 2026

wenhuach21 Jul 1, 2026

wenhuach21 Jul 1, 2026

wenhuach21 Jul 1, 2026

wenhuach21 Jul 1, 2026

chensuyue commented Jul 1, 2026

azure-pipelines Bot commented Jul 1, 2026

Labels

3 participants



		def _fill_inactive_expert_scores(scores_dict: dict[str, list[float]], block_names: list[str]):
		"""Fill inactive experts with the min loss of active experts in each block.

Uh oh!

Conversation

xin3he commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

meta-llama/Llama-3.2-1B-Instruct

Qwen/Qwen3.6-35B-A3B

Type of Change

Related Issues

Checklist Before Submitting

chensuyue commented Jul 1, 2026

azure-pipelines Bot commented Jul 1, 2026

wenhuach21 Jul 1, 2026

Choose a reason for hiding this comment

wenhuach21 Jul 1, 2026

Choose a reason for hiding this comment

wenhuach21 Jul 1, 2026

Choose a reason for hiding this comment

wenhuach21 Jul 1, 2026

Choose a reason for hiding this comment

wenhuach21 Jul 1, 2026

Choose a reason for hiding this comment

chensuyue commented Jul 1, 2026

azure-pipelines Bot commented Jul 1, 2026

Labels

3 participants

xin3he commented Jul 1, 2026 •

edited

Loading