Skip to main content
59 events
when toggle format what by license comment
Apr 23 at 14:40 history edited Peter Cordes CC BY-SA 4.0
Mention early on that Silvermont doesn't recognize SUB, only XOR.
Aug 21, 2023 at 0:38 history edited Peter Cordes CC BY-SA 4.0
Finally APX solves the problem of efficiently materializing a boolean integer.
Aug 21, 2023 at 0:31 comment added Peter Cordes @ecm: Correct on all counts, and yes there's no way to fix x86-32 after 386 made the mistake(?) of providing setcc r/m8 instead of setcc r/m16/32 (storing a bool to memory is rare and can be easily done with a temp register). setcc r/m16 wouldn't be very valuable, so APX providing setcc.zu full_reg via an EVEX for the same opcodes solves the same problem for the only important case, at the cost of larger code-size than my proposed way. It would've been nice if the REX2 encoding defaulted to zero-extend so a full 4-byte EVEX prefix would only be needed to not do that w. r16-r31.
Aug 20, 2023 at 20:16 comment added ecm "It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway." I think the new APX extensions will provide a setcc r64, right?
May 23, 2023 at 14:47 history edited Peter Cordes CC BY-SA 4.0
Correct misinformation about partial-register stuff and movzx dword, byte mov-elim: it is a thing on CPUs after IvB.
May 18, 2023 at 21:22 comment added Peter Cordes @l4m2: True, but I think this answer is already long and dense enough just talking about registers. :/ and dword mem, 0 is also slower, with a false dependency on the old value of memory, so usually only useful for code-size optimization at the expense of speed. Post in Tips for golfing in x86 machine code if it's not there already. (Coding style: I think it makes a lot more sense to put the size override on the memory operand, not the immediate, especially to distinguish from things like and eax, strict dword 0 to request an imm32 encoding.)
May 18, 2023 at 21:18 comment added l4m2 and mem, dword 0 is 3 bytes shorter than mov mem, dword 0 and 1 byte shorter than xor eax, eax/mov mem, eax, which may be useful to fit in small area
May 18, 2023 at 20:36 comment added Peter Cordes @l4m2: Good point, I should update the example to test for a condition like x == 4 or x & 0x88 that doesn't allow optimization to get the condition in CF for adc or sbb
May 18, 2023 at 15:48 comment added l4m2 cmp eax, 1; sbb ebx, -1 enough
May 18, 2023 at 14:47 comment added l4m2 To increase ebx by 1 iff eax is not zero, is mov ecx, eax; add ecx, -1; adc ebx, 0 shorter and faster?
Feb 22, 2021 at 13:37 history edited Peter Cordes CC BY-SA 4.0
phrasing to account for scroll bar / width of code block: "everywhere else" was confusingly truncated to "everywhere". Also pick CL to show that it's not just AL where mov reg8, imm8 is a 2-byte instruction.
May 23, 2020 at 16:06 history edited Peter Cordes CC BY-SA 4.0
section separator for the P3 stuff.
Oct 24, 2019 at 2:53 history edited Peter Cordes CC BY-SA 4.0
include more sub-optimal examples
Oct 24, 2019 at 2:31 history edited Peter Cordes CC BY-SA 4.0
include more sub-optimal examples
Sep 27, 2019 at 20:44 comment added Peter Cordes @ecm: I think the context for that "both" got lost in previous edits; fixed. Thanks for pointing it out.
Sep 27, 2019 at 20:42 history edited Peter Cordes CC BY-SA 4.0
the context for "both" got separated by too much text at some point.
Sep 27, 2019 at 20:37 comment added ecm Thanks! Will you edit your answer to make that part clearer?
Sep 27, 2019 at 20:32 comment added Peter Cordes @ecm: Yes, zero the reg and break the dependency on the old value with mov reg, 0, then zero it again and set the internal EAX = AX = AL tag bit (avoiding partial-register stalls) with xor reg,reg. i.e. the "upper bits zeroed" tag. IDK why xor-zeroing didn't break the dependency; maybe recognizing that in the decoders or issue stage would have taken extra logic vs. handling it only in the boolean logic execution unit? But AFAIK it only worked with xor same,same, not just any xor that happened to produce 0, and the ALU doesn't know if its inputs came from a reg, mem, or immed.
Sep 27, 2019 at 19:34 comment added ecm "so in some cases it was worth using both." I am unclear on what the other one is meant as here, aside the xor-zeroing. Is it mov with immediate? And would you do them in the order mov reg, 0 \ xor reg, reg, or the other way around?
Sep 26, 2019 at 7:40 history edited Peter Cordes CC BY-SA 4.0
another example of what not to do. Add a table of vector examples and link related Qs about zeroing them.
Jan 6, 2019 at 3:48 history edited Peter Cordes CC BY-SA 4.0
added 396 characters in body
Dec 16, 2018 at 13:49 history edited user784668 CC BY-SA 4.0
nickname typo
Dec 16, 2018 at 11:29 comment added Peter Cordes @Fanael: Thanks, I should have updated this a while ago. I checked on a Katmai PIII and found it wasn't dep-breaking a while ago, but never finished editing an update. Made one now to fix the two major things I left out.
Dec 16, 2018 at 11:28 history edited Peter Cordes CC BY-SA 4.0
P6-family had non-dep-breaking xor until at least P-M. And mention the Silvermont reason for r32 not r64.
Dec 16, 2018 at 10:48 comment added user784668 …and on those CPUs the results are unambiguous: 50 cycles for each 22 instructions (i.e. one iteration), indicating a clear dependency chain; compare 10 cycles for each 22 instructions on more modern CPUs where xor is dependency breaking. So it's clear that Agner is correct here in that xor is not dependency breaking on Pentium II/III and Pentium M. It may have changed in Yonah, the last generation of Pentium M sold as Core Solo and Core Duo (note, not Core 2), but I don't have that hardware to test.
Dec 16, 2018 at 10:43 comment added user784668 @PeterCordes "See Agner Fog's Example 6.17. in his microarch pdf. He claims this also applies to P2, P3, and even (early?) PM, but I'm sceptical of that. A comment on the linked blog post says it was only PPro that had this oversight. It seems really unlikely that multiple generations of the P6 family existed without recognizing xor-zeroing as a dep breaker." I tested it on my Tualatin Pentium III and Dothan Pentium M by looping on 10× imul eax, eax/xor eax, eax, with the reasoning that if xor is dep-breaking then the loop will be throughput bound and latency bound if it's not…
Oct 19, 2018 at 0:44 comment added Evan Carroll I totally lose you when you hit on setcc what does that have to do with zeroing? How does that add to the xor-idiom?
Dec 29, 2017 at 19:11 comment added Peter Cordes @WesTurner: Not in a regular x86 CPU using CMOS logic, like Intel Sandybridge-family. en.wikipedia.org/wiki/CMOS#Power:_switching_and_leakage. Running zeroing-idiom instructions uses about the same amount of power as running NOP instructions on SnB-family, cheaper than mov-immediate or a regular XOR, and probably less than any other instruction other than pause. But still much more than sitting in low-power sleep. Modern digital logic is very far from the information-theoretic limits of energy per computation, and nothing they do internally has negative cost.
Dec 29, 2017 at 5:29 comment added Wes Turner Is there a way to do this with negative entropy (less heat) when the register value is known? "The thermodynamic meaning of negative entropy" arxiv-vanity.com/papers/1009.1630
Dec 29, 2017 at 0:24 comment added hayalci ah, where's good old MIPS, with its "zero register" when you need it.
Jul 26, 2017 at 4:57 comment added Peter Cordes @BeeOnRope: Yeah exactly. I'm not sure how "canned" some of its idioms are, but I imagine it's easier for a compiler to deal with setcc/movzx as a single thing that stays together, vs. having to add stuff to the function's internal representation to express "ok, we need an xor-zeroed register before the flag-setting", and maybe have a fallback in case that's hard to do. (Although in most cases you'd expect it would just end up redoing a test or cmp).
Jul 26, 2017 at 4:47 comment added BeeOnRope Yeah, perhaps there is no feedback mechanism for the register pressure: it uses some "typical" tradeoff while generating code, and then when it gets to the end of a function it doesn't go back and relax the assumption that there might be register pressure.
Jul 26, 2017 at 4:22 comment added Peter Cordes @BeeOnRope: yeah, gcc does silly stuff like that sometimes. I wonder if it comes out of trying to save a register for cases where the cmp can't be deferred. (e.g. if it wanted to result in esi). In other cases of gcc making worse code (like setcc/movzx instead of xor/setcc), it usually looks like an idiom designed to reduce register pressure, used even when there isn't any.
Jul 26, 2017 at 4:14 comment added BeeOnRope Here's an example of even newest gcc somewhat stupidly issuing a mov reg, 0 rather than an xor for a simple function. Sure, it is probably doing that because it needs the flags preserved from the earlier cmp, but it could have just swapped the order! clang does fine, and icc also uses an xor but only gets part marks because it pointlessly includes a mov esi, esi in the critical path.
May 23, 2017 at 12:34 history edited URL Rewriter Bot
replaced http://stackoverflow.com/ with https://stackoverflow.com/
Dec 22, 2016 at 11:35 comment added Z boson Okay, I understand what you mean now.
Dec 22, 2016 at 10:57 comment added Peter Cordes @Zboson: I meant that code I tested/tuned on SKL might fail to avoid performance pitfalls for HSW or SnB, regardless of instruction-set choice. The question isn't how to make different versions for HSW and SKL, it's how to tune the HSW version without testing on pre-SKL hardware.
Dec 22, 2016 at 10:55 comment added Z boson I would look into Function Multiversioning (FMV) with GCC if you want to optimize for SKL as well as other archs. See phoronix.com/forums/forum/software/distributions/…
Dec 22, 2016 at 10:44 comment added Peter Cordes @Zboson: yeah, front-end bubbles should be a lot rarer with the increased uop-cache read bandwidth and increased legacy-decode throughput. (But the 4-uop/clock frontend max is still the same). More instructions running on more ports is potentially very cool. If I want to tune code to be good on recent Intel before SKL, I'm worried that SKL will avoid bottlenecks that HSW/BDW still have :/ I considered buying worse hardware :P We should take this to chat if you have any more to say. (I don't ATM). But I'd have to look how to create a chat when the comment thing doesn't offer the option.
Dec 22, 2016 at 10:40 comment added Z boson I am getting a Skylake system soon as well. Boring! Well not if you are coming from Core2Duo. But now that I have access to KNL. I was highly disappointed in Skylake due to the lack of AVX512. So much that I quit thinking about x86 SIMD for a while. I know that internally Skylake changes the pipeline over Broadwell significantly (so I hear) even if the instruction sets does not change but still.
Dec 22, 2016 at 10:35 comment added Peter Cordes @Zboson: I already started working on a KNL update for this answer after seeing that in Agner's update: need to point out that r32 is important even when it doesn't save a REX prefix (xor r8d,r8d). But then I got side-tracked setting up my new Skylake i7-6700k desktop with 16G of DDR4-2666 RAM :) Bit of an upgrade from 65nm Core2Duo, e.g. 8x faster video encoding with x264 (which has good tuning/optimization for Core2, unlike x265). I'll get back to that edit Real Soon Now, since I still have the text saved.
Dec 22, 2016 at 10:22 comment added Z boson According to Agner KNL does not recognize Independence of 64-bit registers. So xor r64, r64 does not just waste a byte. As you say xor r32, r32 is the best choice especially with KNL. See section 15.7 "Special cases of independence" in this micrarch manual if you want to read more.
Oct 27, 2016 at 4:43 history edited Peter Cordes CC BY-SA 3.0
vector zeroing, don't focus on the bypass delay as much as the execution ports.
Sep 19, 2016 at 7:06 history edited Peter Cordes CC BY-SA 3.0
Some of Bruce Dawson's other blog posts are excellent. Be less critical about this one article.
Jan 19, 2016 at 5:19 history edited Peter Cordes CC BY-SA 3.0
xor-zeroing is probably a dep breaker on P2/P3/PM
Jan 19, 2016 at 5:02 history edited Peter Cordes CC BY-SA 3.0
re-order sections, and summary table. Rewrite a lot of the setcc section. Take out the speculation about `mov r32, imm32` noticing immediates with only the low byte set, since that's unlikely.
Dec 12, 2015 at 1:22 history edited Peter Cordes CC BY-SA 3.0
added 2123 characters in body
Dec 3, 2015 at 18:16 history edited Peter Cordes CC BY-SA 3.0
added 158 characters in body
Nov 26, 2015 at 8:49 history edited Peter Cordes CC BY-SA 3.0
`mov` doesn't affect flags.
Nov 26, 2015 at 8:07 history edited Peter Cordes CC BY-SA 3.0
explain why it only matters for Nehalem in the tl;dr
Nov 23, 2015 at 10:32 audit First posts
Nov 23, 2015 at 11:02
Nov 12, 2015 at 19:33 vote accept balajimc55
Nov 12, 2015 at 15:41 history edited Peter Cordes CC BY-SA 3.0
vector regs
Nov 12, 2015 at 13:35 comment added Peter Cordes @Zboson: The "latency" of an instruction with no dependencies only matters if there was a bubble in the pipeline. It's nice for mov-elimination, but for zeroing instructions the zero-latency benefit only comes into play after something like a branch mispredict or I$ miss, where execution is waiting for the decoded instructions, rather than for data to be ready. But yes, mov-elimination doesn't make mov free, only zero latency. The "not taking an execution port" part usually isn't important. Fused-domain throughput can easily be the bottleneck, esp. with loads or stores in the mix.
Nov 12, 2015 at 11:15 comment added Peter Cordes @IraBaxter: Yup, and just to avoid any confusion (because I have seen this misconception on SO), mov reg, src also breaks dep chains for OO CPUs (regardless of src being imm32, [mem], or another register). This dependency-breaking doesn't get mentioned in optimization manuals because it's not a special case that only happens when src and dest are the same register. It always happens for instructions that don't depend on their dest. (except for Intel's implementation of popcnt/lzcnt/tzcnt having a false dep on the dest.)
Nov 12, 2015 at 10:41 comment added Ira Baxter Most arithmetic instructions OP R,S are forced by an out of order CPU to wait for the content of register R to be filled by previous instructions with register R as a target; this is a data dependency. The key point is that Intel/AMD chips have special hardware to break must-wait-for-data-dependencies on register R when XOR R,R is encountered, and does not necessarily do so for other register zeroing instructions. This means the XOR instruction can be scheduled for immediate execution, and this is why Intel/AMD recommend using it.
Nov 12, 2015 at 10:12 comment added Z boson Interesting. So it's not really 100% free. I mean even though it does not use a port it still costs a micro-op. That's a subtlety I missed in Agner's manual. Thanks! So it has zero latency but throughput is 4 (or 0.25 reciprocal throughput).
Nov 12, 2015 at 9:42 history edited Peter Cordes CC BY-SA 3.0
summary
Nov 12, 2015 at 9:37 history answered Peter Cordes CC BY-SA 3.0