Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

33
  • 9
    Most arithmetic instructions OP R,S are forced by an out of order CPU to wait for the content of register R to be filled by previous instructions with register R as a target; this is a data dependency. The key point is that Intel/AMD chips have special hardware to break must-wait-for-data-dependencies on register R when XOR R,R is encountered, and does not necessarily do so for other register zeroing instructions. This means the XOR instruction can be scheduled for immediate execution, and this is why Intel/AMD recommend using it. Commented Nov 12, 2015 at 10:41
  • 3
    @IraBaxter: Yup, and just to avoid any confusion (because I have seen this misconception on SO), mov reg, src also breaks dep chains for OO CPUs (regardless of src being imm32, [mem], or another register). This dependency-breaking doesn't get mentioned in optimization manuals because it's not a special case that only happens when src and dest are the same register. It always happens for instructions that don't depend on their dest. (except for Intel's implementation of popcnt/lzcnt/tzcnt having a false dep on the dest.) Commented Nov 12, 2015 at 11:15
  • 2
    @Zboson: The "latency" of an instruction with no dependencies only matters if there was a bubble in the pipeline. It's nice for mov-elimination, but for zeroing instructions the zero-latency benefit only comes into play after something like a branch mispredict or I$ miss, where execution is waiting for the decoded instructions, rather than for data to be ready. But yes, mov-elimination doesn't make mov free, only zero latency. The "not taking an execution port" part usually isn't important. Fused-domain throughput can easily be the bottleneck, esp. with loads or stores in the mix. Commented Nov 12, 2015 at 13:35
  • 2
    According to Agner KNL does not recognize Independence of 64-bit registers. So xor r64, r64 does not just waste a byte. As you say xor r32, r32 is the best choice especially with KNL. See section 15.7 "Special cases of independence" in this micrarch manual if you want to read more. Commented Dec 22, 2016 at 10:22
  • 4
    ah, where's good old MIPS, with its "zero register" when you need it. Commented Dec 29, 2017 at 0:24