Skip to main content
Mention early on that Silvermont doesn't recognize SUB, only XOR.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k

Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10. (It also doesn't recognize sub, only xor, saving transistors in the decoders by checking for fewer patterns.)

XOR-zeroing also sets FLAGS to a known state, so also breaks dependencies through FLAGS. (On paper it leaves AF "undefined", meaning there isn't a documented value, but in practice on modern CPUs it sets AF the same way as SUB, so the CPU can handle both the same way. If you need the same value for AF across all CPUs, use SUB. Semi-related history: Sure, xor’ing a register with itself is the idiom for zeroing it out, but why not sub - Raymond Chen, The Old New Thing.)

Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10.

Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10. (It also doesn't recognize sub, only xor, saving transistors in the decoders by checking for fewer patterns.)

XOR-zeroing also sets FLAGS to a known state, so also breaks dependencies through FLAGS. (On paper it leaves AF "undefined", meaning there isn't a documented value, but in practice on modern CPUs it sets AF the same way as SUB, so the CPU can handle both the same way. If you need the same value for AF across all CPUs, use SUB. Semi-related history: Sure, xor’ing a register with itself is the idiom for zeroing it out, but why not sub - Raymond Chen, The Old New Thing.)

Finally APX solves the problem of efficiently materializing a boolean integer.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k

xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit8-bit destination (until APX extension1), you usually need to take care to avoid partial-register penalties.

Footnote 1: Intel Advanced Performance Extensions (APX) introduces REX2 and EVEX forms of integer instructions, for 32 GPRs and new 3-operand forms of common instructions. And finally a zero-extending ("zero-upper" aka ZU) form of setcc r64. (Total instruction length of 6 bytes, using one of the spare bits in the EVEX prefix to encode legacy vs. zero-upper behaviour for register destinations.)

xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.

xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8-bit destination (until APX extension1), you usually need to take care to avoid partial-register penalties.

Footnote 1: Intel Advanced Performance Extensions (APX) introduces REX2 and EVEX forms of integer instructions, for 32 GPRs and new 3-operand forms of common instructions. And finally a zero-extending ("zero-upper" aka ZU) form of setcc r64. (Total instruction length of 6 bytes, using one of the spare bits in the EVEX prefix to encode legacy vs. zero-upper behaviour for register destinations.)

Correct misinformation about partial-register stuff and movzx dword, byte mov-elim: it is a thing on CPUs after IvB.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k

xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB and later only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and. (Agner incorrectly states that Haswell even removes thatAH merging penalties.)

...
call  some_func
xor     ecx,ecx    ; zero *before* thesetting testFLAGS
testcmp     eax,eax 42
setnz   cl         ; ecx = cl = (some_func() != 042)
add     ebx, ecx   ; no partial-register penalty here

This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies). (If the condition was ebx += (eax != 0), there are tricks like cmp eax, 1; sbb ebx, -1 using the carry flag with adc or sbb to add or subtract it directly, instead of materializing it as a 0/1 integer, as @l4m2 pointed out in comments. It might even be worth it to do sub eax, 42 (or LEA into another reg) / cmp eax,1 / sbb. Especially if it's hard to arrange to xor-zero before setting FLAGS, since cmp/setcc/movzx/add has all 4 operations on the critical path for latency.)

There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be It's cheaper on SnB, but I didn't spend much time trying to measurelike 1 cycle at worst, and Haswell and later don't rename partial registers separately from full regs. Using mov reg, 0 / setcc is probably best on recent CPUs, but would have a significant penalty on older Intel CPUs, (Nehalem and still be somewhat worse onearlier). On newer IntelCPUs it's close to as good as xor-zeroing, but has worse code-size than movzx.

Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB and later (except for Ice Lake) can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later AMD Zen family can only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as. Also worse than test/mov r,0/setcc (andbut much better on older Intel CPUs with partial-register stalls).

Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option. At least for zeroing / dependency-breaking"hot" code where this is probably the best alternative whenpart of an important latency chain. Otherwise go for xor/test/setccmovzx isn't an optionto save a bit of code size.

xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and Haswell even removes that.

...
call  some_func
xor     ecx,ecx    ; zero *before* the test
test    eax,eax
setnz   cl         ; cl = (some_func() != 0)
add     ebx, ecx   ; no partial-register penalty here

This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).

There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0 / setcc would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.

Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as test/mov r,0/setcc (and much better on older CPUs).

Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option.

xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB and later only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read. (Agner incorrectly states that Haswell removes AH merging penalties.)

...
call  some_func
xor     ecx,ecx    ; zero *before* setting FLAGS
cmp     eax, 42
setnz   cl         ; ecx = cl = (some_func() != 42)
add     ebx, ecx   ; no partial-register penalty here

This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies). (If the condition was ebx += (eax != 0), there are tricks like cmp eax, 1; sbb ebx, -1 using the carry flag with adc or sbb to add or subtract it directly, instead of materializing it as a 0/1 integer, as @l4m2 pointed out in comments. It might even be worth it to do sub eax, 42 (or LEA into another reg) / cmp eax,1 / sbb. Especially if it's hard to arrange to xor-zero before setting FLAGS, since cmp/setcc/movzx/add has all 4 operations on the critical path for latency.)

There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It's cheaper on SnB, like 1 cycle at worst, and Haswell and later don't rename partial registers separately from full regs. Using mov reg, 0 / setcc is probably best on recent CPUs, but would have a significant penalty on older Intel CPUs (Nehalem and earlier). On newer CPUs it's close to as good as xor-zeroing, but has worse code-size than movzx.

Using setcc / movzx r32, r8 is probably the best alternative for Intel P6, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB and later (except for Ice Lake) can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). AMD Zen family can only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc. Also worse than test/mov r,0/setcc (but much better on older Intel CPUs with partial-register stalls).

Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option. At least for "hot" code where this is part of an important latency chain. Otherwise go for movzx to save a bit of code size.

phrasing to account for scroll bar / width of code block: "everywhere else" was confusingly truncated to "everywhere". Also pick CL to show that it's not just AL where mov reg8, imm8 is a 2-byte instruction.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
section separator for the P3 stuff.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
include more sub-optimal examples
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
include more sub-optimal examples
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
the context for "both" got separated by too much text at some point.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
another example of what not to do. Add a table of vector examples and link related Qs about zeroing them.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
added 396 characters in body
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
nickname typo
Source Link
user784668
user784668
Loading
P6-family had non-dep-breaking xor until at least P-M. And mention the Silvermont reason for r32 not r64.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link
URL Rewriter Bot
URL Rewriter Bot
Loading
vector zeroing, don't focus on the bypass delay as much as the execution ports.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
Some of Bruce Dawson's other blog posts are excellent. Be less critical about this one article.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
xor-zeroing is probably a dep breaker on P2/P3/PM
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
re-order sections, and summary table. Rewrite a lot of the setcc section. Take out the speculation about `mov r32, imm32` noticing immediates with only the low byte set, since that's unlikely.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
added 2123 characters in body
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
added 158 characters in body
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
`mov` doesn't affect flags.
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
explain why it only matters for Nehalem in the tl;dr
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
vector regs
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
summary
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading
Source Link
Peter Cordes
  • 380.7k
  • 53
  • 759
  • 1k
Loading