Timeline for answer to Fastest way to do horizontal SSE vector sum (or other reduction) by Peter Cordes
Current License: CC BY-SA 4.0
Post Revisions
44 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| May 12, 2025 at 1:14 | comment | added | Maxim Egorushkin | hadd instructions use one register for the price of extra uops and latency. Simulating hadd with shuffle and add instructions costs multiple instructions and one extra scratch xmm/ymm register. Compute-heavy linear algebra code is often short of xmm registers, so that making that extra scratch xmm/ymm register available causes stack spills of xmm registers. The stack spills cost much more than one uop and one cycle improvement of shuffles and adds above hadd. Just my empirical observations. The extra cost of hadd is the price for not spilling an xmm register to stack. | |
| Jan 14, 2024 at 0:00 | comment | added | matanox | Given that the SIMD pipeline only works on four floats, and most answers to these type of questions involve a certain length of SIMD operations, what scale of speed up can you expect from going SIMD v.s. non-SIMD code implementations here and in similar cases? | |
| Dec 23, 2023 at 19:19 | history | edited | Peter Cordes | CC BY-SA 4.0 |
Annotate with types since the answer discusses multiple
|
| Dec 17, 2021 at 5:29 | history | edited | Peter Cordes | CC BY-SA 4.0 |
link signed bytes Q&A
|
| Sep 27, 2021 at 19:00 | history | edited | Peter Cordes | CC BY-SA 4.0 |
make it easier to see / find the 16-bit element link, and be clear about what you do after.
|
| May 12, 2021 at 2:50 | history | edited | Peter Cordes | CC BY-SA 4.0 |
Link canonicals for dot product of *arrays*
|
| Jul 25, 2020 at 20:00 | history | bounty awarded | Sarfaraz Nawaz | ||
| Jul 24, 2020 at 15:04 | comment | added | Peter Cordes |
@Nawaz: It's generally equal, except when it defeats micro-fusion of an indexed addressing mode on Haswell and later (Micro fusion and addressing modes). But you only have AVX1, not AVX2, so it's pre-Haswell and mulpd with an indexed addressing mode would unlaminate as well. (TL:DR: it depends). If it's all compiler-generated, no hand-written asm, you at least don't have to worry about AVX / SSE transition stalls. (Why is this SSE code 6 times slower without VZEROUPPER on Skylake?)
|
|
| Jul 24, 2020 at 15:00 | comment | added | Sarfaraz Nawaz |
ah. You're right. I was looking at an older generated .s file. Haha. Now it does generate vmulpd instructions. However, it seems to be slow (or maybe my machine right now is loaded too much). Do you think vmulpd in general is faster than mulpd? Or it depends and cannot be said without looking at the code? I'll post a question if I face any specific issue though.
|
|
| Jul 24, 2020 at 14:50 | comment | added | Peter Cordes |
@Nawaz: then -march=native should be using VEX versions of SSE instructions. GCC and clang both work that way. Post a question (not comments) if it's still happening after you double check that you're actually passing that compiler option correctly, and you made sure you're looking at the correct output file, etc. i.e. that it's not just a problem in your build script.
|
|
| Jul 24, 2020 at 14:48 | comment | added | Sarfaraz Nawaz |
then your CPU doesn't support AVX. .. sysctl -a | grep machdep.cpu.features | rg -i avx lists this: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C ... which has AVX1.0 .. That means my CPU supports AVX?
|
|
| Jul 24, 2020 at 14:47 | comment | added | Sarfaraz Nawaz | Oops fixed that | |
| Jul 24, 2020 at 14:46 | comment | added | Peter Cordes |
@Nawaz: That's literally what I just said, same instruction. And yes, the VEX encoding just adds a non-destructive destination. Except mulpd of course is * not +. And yes, Intel's asm manuals are pretty clear. See also uops.info/table.html for performance info, and agner.org/optimize to understand what the numbers mean. (more links in stackoverflow.com/tags/x86/info)
|
|
| Jul 24, 2020 at 14:45 | comment | added | Sarfaraz Nawaz |
Hmm. Seems like vmulpd is not faster; it's just a different variant of mulpd which stores the result in one of the operand itself and vmulpd stores in different register. So mulpd is like a *= b and vmulpd is like c = a * b .. please let me know if my understanding is right? felixcloutier.com/x86/mulpd
|
|
| Jul 24, 2020 at 14:44 | comment | added | Peter Cordes |
@Nawaz: no, it's just the VEX encoding of the same instruction, requiring AVX. If -march=native generates the old SSE encoding, then your CPU doesn't support AVX. Godbolt runs on Skylake-avx512 servers so -march=native there is -march=skylake-avx512.
|
|
| Jul 24, 2020 at 14:40 | comment | added | Sarfaraz Nawaz |
@PeterCordes.. Wow this answer. Will spend my weekend to fully understand it. I'm trying to compute dot product and compiling with clang 10, on my Mac Mojave, and I've used -O3 -march=native but it does not generate instructions like vmulpd, vaddpd like this: godbolt.org/z/5j4bPq, instead it generates mulpd and addpd . Any idea/recommendation how I generate the former? I assume vmulpd is a faster instruction than mulpd?
|
|
| Jun 20, 2020 at 9:12 | history | edited | CommunityBot |
Commonmark migration
|
|
| Mar 21, 2020 at 5:15 | history | edited | Peter Cordes | CC BY-SA 4.0 |
added 172 characters in body
|
| Feb 13, 2020 at 4:55 | history | edited | Peter Cordes | CC BY-SA 4.0 |
link related Q&As
|
| Dec 4, 2017 at 22:54 | comment | added | Peter Cordes | @jww: No. I wouldn't want to set anything in stone that I couldn't come back and edit if/when I realize my advice wasn't optimal after all. I did get an email once asking me if I wanted to be part of writing an asm book, but I never got back to them >.< Anyway, collecting up links to the more useful SO answers with "recipes" that I and others have written would be a good project, if I ever got around to it. | |
| Dec 4, 2017 at 21:41 | comment | added | jww | @PeterCordes - Out of curiosity, have you written any books on x86, assembly and intrinsics. I've been looking for a good book (with recipes) for several years now. The intrinsics are important because they are cross-platform. They work on Clang, GCC, MSVC, SunCC, etc. We can write them once and they run everywhere (unlike ASM for GNU's GAS). | |
| Jul 5, 2017 at 11:59 | vote | accept | FeepingCreature | ||
| Jul 1, 2017 at 9:01 | history | bounty awarded | Marcus Müller | ||
| May 23, 2017 at 11:54 | history | edited | URL Rewriter Bot |
replaced http://stackoverflow.com/ with https://stackoverflow.com/
|
|
| Mar 24, 2017 at 15:20 | comment | added | Peter Cordes | @Royi: There are already a couple sections in my answer that discuss the fact that _mm_hadd_ps is slow. | |
| Feb 14, 2017 at 2:29 | history | edited | Yuhong Bao | CC BY-SA 3.0 |
later K8s do have SSE3
|
| Feb 6, 2017 at 19:56 | comment | added | Royi |
@PeterCordes, How does your SSE3 solution compares to @PaulR solution - v = _mm_hadd_ps(v, v); v = _mm_hadd_ps(v, v);? Thank You.
|
|
| Dec 5, 2016 at 14:03 | history | edited | Peter Cordes | CC BY-SA 3.0 |
dummy args to reduce MOVs. Fix swapped elements in a comment
|
| Dec 5, 2016 at 13:21 | comment | added | arrowd |
@PeterCordes Thank you for great answer. Don't you have a typo in SSE1 (aka SSE) section at line _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1)); // [ C D | B A ]? I guess, you meant [ C D | A B ]?
|
|
| Dec 1, 2016 at 3:22 | comment | added | plasmacel |
Thanks for all the provided info here. It's time to detect the __AVX512F__ macro. :)
|
|
| Nov 30, 2016 at 16:00 | comment | added | Peter Cordes | @plasmacel: fun fact: on Knight's Landing (Xeon Phi = modified silvermont + AVX512), VPERMILPS (3c lat, 1c rtput) is more efficient than VSHUFPS (4c lat, 2c rtput), which does outweight the instruction-length difference for that architecture. I assume that's from being a 1-input shuffle vs 2-input. Agner Fog updated his stuff for KNL. :) | |
| Nov 29, 2016 at 5:33 | comment | added | Peter Cordes | @plasmacel: As my answer points out, my SSE3 version compiles optimally with AVX, but clang pessimises it to VPERMILPD: godbolt.org/g/ZH88wH. gcc's version is four 4B instructions (not counting the RET). clang's version is 2 bytes longer, and the same speed. What makes you think VPERMILPS is a win over SHUFPS? AFAIK, clang is wrong to favour it for immediate shuffles where the source is already in a register. Agner Fog's tables show no difference. It's useful for load+shuffle, and for variable-shuffles, and maybe easier for compilers since it's a 1-input instruction, but not faster | |
| Nov 29, 2016 at 5:30 | comment | added | Peter Cordes |
@plasmacel: no, unless your vector was in memory to start with, since VPERMILPS can load+shuffle. You get smaller code-size from using the AVX versions of older instructions, because you don't need an immediate, and they only need the 2-byte VEX prefix (C5 .. instead of C4 .. ..). Two-source shuffles like VSHUFPS and VMOVHLPS aren't any slower than one-source shuffles like VPSHUFD or VPERMILPS. If there's a difference in energy consumption, it's probably negligible.
|
|
| Nov 29, 2016 at 5:29 | comment | added | plasmacel |
Is it a win to use vpermilps instead of movsldup, movshdup, movhlps and movlhps when AVX is available? It is a win over shufps and looks like clang also tries to emit it instead of the mentioned ones.
|
|
| Nov 29, 2016 at 4:50 | comment | added | Peter Cordes | If you have a specific microarchitecture and compiler in mind, you can and should make a version that's more optimal for that. This answer tries to be optimal (latency, throughput and code-size) for modern CPUs like Haswell, while sucking at little as possible on old CPUs. i.e. my SSE1 / SSE2 versions don't do anything that's worse on Haswell just to run faster on an old SlowShuffle CPU like Merom. For Merom, PSHUFD might be a win because it and SHUFPS both run in flt->int domain. | |
| Nov 29, 2016 at 4:46 | comment | added | Peter Cordes | @plasmacel: keep in mind that these functions really need to inline to be useful. And yes, clang pessimizes the shuffles sometimes. That's especially bad for first-gen Core2 and other slow-shuffle CPUs where SHUFPS is far worse than MOVHLPS :( If you enable sse3 (godbolt.org/g/1qbNXw), though, it can use MOVSHDUP for Kornel's first shuffle, which is excellent. Anyway, if you are using clang, use whatever happens to coax clang into making nice asm after inlining. You could even write a version which takes a dummy arg to use as a target for movhlps (when AVX isn't available). | |
| Nov 29, 2016 at 4:40 | comment | added | Peter Cordes | @plasmacel: on many CPUs, including Intel SnB-family, there's extra bypass-delay latency to forward the result of an FP instruction to an integer shuffle, and from PSHUFD to ADDPS. It's great if you care about throughput and uop count but not latency. (SHUFPS between integer instructions has no penalty on SnB-family (unlike Nehalem), but the reverse is not true.) | |
| Nov 29, 2016 at 4:32 | comment | added | plasmacel |
With SSE2 the remaining movaps before the shufps also can be eliminated if you use pshufd by changing _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1)); to _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(v), _MM_SHUFFLE(2, 3, 0, 1)));. However that maybe adds some latency. godbolt.org/g/0trqRY
|
|
| Jun 3, 2016 at 19:07 | history | edited | Peter Cordes | CC BY-SA 3.0 |
link my VCL github repo
|
| May 27, 2016 at 15:10 | history | edited | Peter Cordes | CC BY-SA 3.0 |
tidy up: Talk about CPUs with slow shuffles all in one place. Fix the broken godbolt link.
|
| Feb 10, 2016 at 16:39 | vote | accept | FeepingCreature | ||
| Jul 5, 2017 at 11:48 | |||||
| Feb 9, 2016 at 9:19 | history | edited | Peter Cordes | CC BY-SA 3.0 |
total code size (disk fetch) matters mostly for compiler autovec.
|
| Feb 8, 2016 at 13:06 | history | edited | Peter Cordes | CC BY-SA 3.0 |
added 2 characters in body
|
| Feb 8, 2016 at 12:46 | history | answered | Peter Cordes | CC BY-SA 3.0 |