Revisions to What is the best way to set a register to zero in x86 assembly: xor, mov or and?

Mention early on that Silvermont doesn't recognize SUB, only XOR.

Source Link

edited Apr 23 at 14:40

380.7k
53
759
1k

Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10. (It also doesn't recognize sub, only xor, saving transistors in the decoders by checking for fewer patterns.)

XOR-zeroing also sets FLAGS to a known state, so also breaks dependencies through FLAGS. (On paper it leaves AF "undefined", meaning there isn't a documented value, but in practice on modern CPUs it sets AF the same way as SUB, so the CPU can handle both the same way. If you need the same value for AF across all CPUs, use SUB. Semi-related history: Sure, xor’ing a register with itself is the idiom for zeroing it out, but why not sub - Raymond Chen, The Old New Thing.)

Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10.

Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10. (It also doesn't recognize sub, only xor, saving transistors in the decoders by checking for fewer patterns.)

XOR-zeroing also sets FLAGS to a known state, so also breaks dependencies through FLAGS. (On paper it leaves AF "undefined", meaning there isn't a documented value, but in practice on modern CPUs it sets AF the same way as SUB, so the CPU can handle both the same way. If you need the same value for AF across all CPUs, use SUB. Semi-related history: Sure, xor’ing a register with itself is the idiom for zeroing it out, but why not sub - Raymond Chen, The Old New Thing.)

Finally APX solves the problem of efficiently materializing a boolean integer.

Source Link

edited Aug 21, 2023 at 0:38

Peter Cordes

380.7k
53
759
1k

xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit8-bit destination (until APX extension¹), you usually need to take care to avoid partial-register penalties.

Footnote 1: Intel Advanced Performance Extensions (APX) introduces REX2 and EVEX forms of integer instructions, for 32 GPRs and new 3-operand forms of common instructions. And finally a zero-extending ("zero-upper" aka ZU) form of setcc r64. (Total instruction length of 6 bytes, using one of the spare bits in the EVEX prefix to encode legacy vs. zero-upper behaviour for register destinations.)

xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.

xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8-bit destination (until APX extension¹), you usually need to take care to avoid partial-register penalties.

Footnote 1: Intel Advanced Performance Extensions (APX) introduces REX2 and EVEX forms of integer instructions, for 32 GPRs and new 3-operand forms of common instructions. And finally a zero-extending ("zero-upper" aka ZU) form of setcc r64. (Total instruction length of 6 bytes, using one of the spare bits in the EVEX prefix to encode legacy vs. zero-upper behaviour for register destinations.)

Correct misinformation about partial-register stuff and movzx dword, byte mov-elim: it is a thing on CPUs after IvB.

Source Link

edited May 23, 2023 at 14:47

Peter Cordes

380.7k
53
759
1k

xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB and later only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and. (Agner incorrectly states that Haswell even removes thatAH merging penalties.)

...
call  some_func
xor     ecx,ecx    ; zero *before* thesetting testFLAGS
testcmp     eax,eax 42
setnz   cl         ; ecx = cl = (some_func() != 042)
add     ebx, ecx   ; no partial-register penalty here

This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies). (If the condition was ebx += (eax != 0), there are tricks like cmp eax, 1; sbb ebx, -1 using the carry flag with adc or sbb to add or subtract it directly, instead of materializing it as a 0/1 integer, as @l4m2 pointed out in comments. It might even be worth it to do sub eax, 42 (or LEA into another reg) / cmp eax,1 / sbb. Especially if it's hard to arrange to xor-zero before setting FLAGS, since cmp/setcc/movzx/add has all 4 operations on the critical path for latency.)

There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be It's cheaper on SnB, but I didn't spend much time trying to measurelike 1 cycle at worst, and Haswell and later don't rename partial registers separately from full regs. Using mov reg, 0 / setcc is probably best on recent CPUs, but would have a significant penalty on older Intel CPUs, (Nehalem and still be somewhat worse onearlier). On newer IntelCPUs it's close to as good as xor-zeroing, but has worse code-size than movzx.

Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB and later (except for Ice Lake) can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later AMD Zen family can only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as. Also worse than test/mov r,0/setcc (andbut much better on older Intel CPUs with partial-register stalls).

Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option. At least for zeroing / dependency-breaking"hot" code where this is probably the best alternative whenpart of an important latency chain. Otherwise go for xor/test/setccmovzx isn't an optionto save a bit of code size.

xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and Haswell even removes that.

...
call  some_func
xor     ecx,ecx    ; zero *before* the test
test    eax,eax
setnz   cl         ; cl = (some_func() != 0)
add     ebx, ecx   ; no partial-register penalty here

This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).

There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0 / setcc would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.

Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as test/mov r,0/setcc (and much better on older CPUs).

Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option.

xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB and later only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read. (Agner incorrectly states that Haswell removes AH merging penalties.)

...
call  some_func
xor     ecx,ecx    ; zero *before* setting FLAGS
cmp     eax, 42
setnz   cl         ; ecx = cl = (some_func() != 42)
add     ebx, ecx   ; no partial-register penalty here

This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies). (If the condition was ebx += (eax != 0), there are tricks like cmp eax, 1; sbb ebx, -1 using the carry flag with adc or sbb to add or subtract it directly, instead of materializing it as a 0/1 integer, as @l4m2 pointed out in comments. It might even be worth it to do sub eax, 42 (or LEA into another reg) / cmp eax,1 / sbb. Especially if it's hard to arrange to xor-zero before setting FLAGS, since cmp/setcc/movzx/add has all 4 operations on the critical path for latency.)

There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It's cheaper on SnB, like 1 cycle at worst, and Haswell and later don't rename partial registers separately from full regs. Using mov reg, 0 / setcc is probably best on recent CPUs, but would have a significant penalty on older Intel CPUs (Nehalem and earlier). On newer CPUs it's close to as good as xor-zeroing, but has worse code-size than movzx.

Using setcc / movzx r32, r8 is probably the best alternative for Intel P6, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB and later (except for Ice Lake) can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). AMD Zen family can only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc. Also worse than test/mov r,0/setcc (but much better on older Intel CPUs with partial-register stalls).

Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option. At least for "hot" code where this is part of an important latency chain. Otherwise go for movzx to save a bit of code size.

phrasing to account for scroll bar / width of code block: "everywhere else" was confusingly truncated to "everywhere". Also pick CL to show that it's not just AL where mov reg8, imm8 is a 2-byte instruction.

Source Link

edited Feb 22, 2021 at 13:37

Peter Cordes

380.7k
53
759
1k

Loading

section separator for the P3 stuff.

Source Link

edited May 23, 2020 at 16:06

Peter Cordes

380.7k
53
759
1k

Loading

include more sub-optimal examples

Source Link

edited Oct 24, 2019 at 2:53

Peter Cordes

380.7k
53
759
1k

Loading

include more sub-optimal examples

Source Link

edited Oct 24, 2019 at 2:31

Peter Cordes

380.7k
53
759
1k

Loading

the context for "both" got separated by too much text at some point.

Source Link

edited Sep 27, 2019 at 20:42

Peter Cordes

380.7k
53
759
1k

Loading

another example of what not to do. Add a table of vector examples and link related Qs about zeroing them.

Source Link

edited Sep 26, 2019 at 7:40

Peter Cordes

380.7k
53
759
1k

Loading

added 396 characters in body

Source Link

edited Jan 6, 2019 at 3:48

Peter Cordes

380.7k
53
759
1k

Loading

nickname typo

Source Link

edited Dec 16, 2018 at 13:49

user784668

Loading

P6-family had non-dep-breaking xor until at least P-M. And mention the Silvermont reason for r32 not r64.

Source Link

edited Dec 16, 2018 at 11:28

Peter Cordes

380.7k
53
759
1k

Loading

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 12:34

URL Rewriter Bot

Loading

vector zeroing, don't focus on the bypass delay as much as the execution ports.

Source Link

edited Oct 27, 2016 at 4:43

Peter Cordes

380.7k
53
759
1k

Loading

Some of Bruce Dawson's other blog posts are excellent. Be less critical about this one article.

Source Link

edited Sep 19, 2016 at 7:06

Peter Cordes

380.7k
53
759
1k

Loading

xor-zeroing is probably a dep breaker on P2/P3/PM

Source Link

edited Jan 19, 2016 at 5:19

Peter Cordes

380.7k
53
759
1k

Loading

re-order sections, and summary table. Rewrite a lot of the setcc section. Take out the speculation about `mov r32, imm32` noticing immediates with only the low byte set, since that's unlikely.

Source Link

edited Jan 19, 2016 at 5:02

Peter Cordes

380.7k
53
759
1k

Loading

added 2123 characters in body

Source Link

edited Dec 12, 2015 at 1:22

Peter Cordes

380.7k
53
759
1k

Loading

added 158 characters in body

Source Link

edited Dec 3, 2015 at 18:16

Peter Cordes

380.7k
53
759
1k

Loading

`mov` doesn't affect flags.

Source Link

edited Nov 26, 2015 at 8:49

Peter Cordes

380.7k
53
759
1k

Loading

explain why it only matters for Nehalem in the tl;dr

Source Link

edited Nov 26, 2015 at 8:07

Peter Cordes

380.7k
53
759
1k

Loading

vector regs

Source Link

edited Nov 12, 2015 at 15:41

Peter Cordes

380.7k
53
759
1k

Loading

summary

Source Link

edited Nov 12, 2015 at 9:42

Peter Cordes

380.7k
53
759
1k

Loading

Source Link

answered Nov 12, 2015 at 9:37

Peter Cordes

380.7k
53
759
1k

Loading

Collectives™ on Stack Overflow

Return to Answer

Post Timeline