Edit - Stack Overflow

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Rev

9

Most arithmetic instructions OP R,S are forced by an out of order CPU to wait for the content of register R to be filled by previous instructions with register R as a target; this is a data dependency. The key point is that Intel/AMD chips have special hardware to break must-wait-for-data-dependencies on register R when XOR R,R is encountered, and does not necessarily do so for other register zeroing instructions. This means the XOR instruction can be scheduled for immediate execution, and this is why Intel/AMD recommend using it.

Ira Baxter
– Ira Baxter

2015-11-12 10:41:05 +00:00
Commented Nov 12, 2015 at 10:41
3

@IraBaxter: Yup, and just to avoid any confusion (because I have seen this misconception on SO), mov reg, src also breaks dep chains for OO CPUs (regardless of src being imm32, [mem], or another register). This dependency-breaking doesn't get mentioned in optimization manuals because it's not a special case that only happens when src and dest are the same register. It always happens for instructions that don't depend on their dest. (except for Intel's implementation of popcnt/lzcnt/tzcnt having a false dep on the dest.)

Peter Cordes
– Peter Cordes

2015-11-12 11:15:14 +00:00
Commented Nov 12, 2015 at 11:15
2

@Zboson: The "latency" of an instruction with no dependencies only matters if there was a bubble in the pipeline. It's nice for mov-elimination, but for zeroing instructions the zero-latency benefit only comes into play after something like a branch mispredict or I$ miss, where execution is waiting for the decoded instructions, rather than for data to be ready. But yes, mov-elimination doesn't make mov free, only zero latency. The "not taking an execution port" part usually isn't important. Fused-domain throughput can easily be the bottleneck, esp. with loads or stores in the mix.

Peter Cordes
– Peter Cordes

2015-11-12 13:35:52 +00:00
Commented Nov 12, 2015 at 13:35
2

According to Agner KNL does not recognize Independence of 64-bit registers. So xor r64, r64 does not just waste a byte. As you say xor r32, r32 is the best choice especially with KNL. See section 15.7 "Special cases of independence" in this micrarch manual if you want to read more.

Z boson
– Z boson

2016-12-22 10:22:38 +00:00
Commented Dec 22, 2016 at 10:22
4

ah, where's good old MIPS, with its "zero register" when you need it.

hayalci
– hayalci

2017-12-29 00:24:53 +00:00
Commented Dec 29, 2017 at 0:24

| Show 28 more comments

Correct minor typos or mistakes
Clarify meaning without changing it
Add related resources or links
Always respect the author’s intent
Don’t use edits to reply to the author

create code fences with backticks ` or tildes ~
```
like so
```
add language identifier to highlight code
```python
def function(foo):
print(foo)
```
put returns between paragraphs
for linebreak add 2 spaces at end
_italic_ or **bold**
indent code by 4 spaces
backtick escapes `like _so_`
quote by placing > at start of line
to make links (use https whenever possible)

<https://example.com>

[example](https://example.com)

<a href="https://example.com">example</a>

formatting help »
answering help »

Collectives™ on Stack Overflow