Timeline for answer to Fastest way to do horizontal SSE vector sum (or other reduction) by Peter Cordes

Current License: CC BY-SA 4.0

Post Revisions

44 events

when toggle format	what		by	license	comment
May 12, 2025 at 1:14	comment	added	Maxim Egorushkin		hadd instructions use one register for the price of extra uops and latency. Simulating hadd with shuffle and add instructions costs multiple instructions and one extra scratch xmm/ymm register. Compute-heavy linear algebra code is often short of xmm registers, so that making that extra scratch xmm/ymm register available causes stack spills of xmm registers. The stack spills cost much more than one uop and one cycle improvement of shuffles and adds above hadd. Just my empirical observations. The extra cost of hadd is the price for not spilling an xmm register to stack.
Jan 14, 2024 at 0:00	comment	added	matanox		Given that the SIMD pipeline only works on four floats, and most answers to these type of questions involve a certain length of SIMD operations, what scale of speed up can you expect from going SIMD v.s. non-SIMD code implementations here and in similar cases?
Dec 23, 2023 at 19:19	history	edited	Peter Cordes	CC BY-SA 4.0	Annotate with types since the answer discusses multiple
Dec 17, 2021 at 5:29	history	edited	Peter Cordes	CC BY-SA 4.0	link signed bytes Q&A
Sep 27, 2021 at 19:00	history	edited	Peter Cordes	CC BY-SA 4.0	make it easier to see / find the 16-bit element link, and be clear about what you do after.
May 12, 2021 at 2:50	history	edited	Peter Cordes	CC BY-SA 4.0	Link canonicals for dot product of arrays
Jul 25, 2020 at 20:00	history	bounty awarded	Sarfaraz Nawaz
Jul 24, 2020 at 15:04	comment	added	Peter Cordes		@Nawaz: It's generally equal, except when it defeats micro-fusion of an indexed addressing mode on Haswell and later (Micro fusion and addressing modes). But you only have AVX1, not AVX2, so it's pre-Haswell and `mulpd` with an indexed addressing mode would unlaminate as well. (TL:DR: it depends). If it's all compiler-generated, no hand-written asm, you at least don't have to worry about AVX / SSE transition stalls. (Why is this SSE code 6 times slower without VZEROUPPER on Skylake?)
Jul 24, 2020 at 15:00	comment	added	Sarfaraz Nawaz		ah. You're right. I was looking at an older generated `.s` file. Haha. Now it does generate `vmulpd` instructions. However, it seems to be slow (or maybe my machine right now is loaded too much). Do you think `vmulpd` in general is faster than `mulpd`? Or it depends and cannot be said without looking at the code? I'll post a question if I face any specific issue though.
Jul 24, 2020 at 14:50	comment	added	Peter Cordes		@Nawaz: then `-march=native` should be using VEX versions of SSE instructions. GCC and clang both work that way. Post a question (not comments) if it's still happening after you double check that you're actually passing that compiler option correctly, and you made sure you're looking at the correct output file, etc. i.e. that it's not just a problem in your build script.
Jul 24, 2020 at 14:48	comment	added	Sarfaraz Nawaz		then your CPU doesn't support AVX. .. `sysctl -a \| grep machdep.cpu.features \| rg -i avx` lists this: `FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C` ... which has `AVX1.0` .. That means my CPU supports AVX?
Jul 24, 2020 at 14:47	comment	added	Sarfaraz Nawaz		Oops fixed that
Jul 24, 2020 at 14:46	comment	added	Peter Cordes		@Nawaz: That's literally what I just said, same instruction. And yes, the VEX encoding just adds a non-destructive destination. Except `mulpd` of course is `*` not `+`. And yes, Intel's asm manuals are pretty clear. See also uops.info/table.html for performance info, and agner.org/optimize to understand what the numbers mean. (more links in stackoverflow.com/tags/x86/info)
Jul 24, 2020 at 14:45	comment	added	Sarfaraz Nawaz		Hmm. Seems like `vmulpd` is not faster; it's just a different variant of `mulpd` which stores the result in one of the operand itself and `vmulpd` stores in different register. So `mulpd` is like `a = b` and `vmulpd` is like `c = a b` .. please let me know if my understanding is right? felixcloutier.com/x86/mulpd
Jul 24, 2020 at 14:44	comment	added	Peter Cordes		@Nawaz: no, it's just the VEX encoding of the same instruction, requiring AVX. If `-march=native` generates the old SSE encoding, then your CPU doesn't support AVX. Godbolt runs on Skylake-avx512 servers so `-march=native` there is `-march=skylake-avx512`.
Jul 24, 2020 at 14:40	comment	added	Sarfaraz Nawaz		@PeterCordes.. Wow this answer. Will spend my weekend to fully understand it. I'm trying to compute dot product and compiling with clang 10, on my Mac Mojave, and I've used `-O3 -march=native` but it does not generate instructions like `vmulpd`, `vaddpd` like this: godbolt.org/z/5j4bPq, instead it generates `mulpd` and `addpd` . Any idea/recommendation how I generate the former? I assume `vmulpd` is a faster instruction than `mulpd`?
Jun 20, 2020 at 9:12	history	edited	CommunityBot		Commonmark migration
Mar 21, 2020 at 5:15	history	edited	Peter Cordes	CC BY-SA 4.0	added 172 characters in body
Feb 13, 2020 at 4:55	history	edited	Peter Cordes	CC BY-SA 4.0	link related Q&As
Dec 4, 2017 at 22:54	comment	added	Peter Cordes		@jww: No. I wouldn't want to set anything in stone that I couldn't come back and edit if/when I realize my advice wasn't optimal after all. I did get an email once asking me if I wanted to be part of writing an asm book, but I never got back to them >.< Anyway, collecting up links to the more useful SO answers with "recipes" that I and others have written would be a good project, if I ever got around to it.
Dec 4, 2017 at 21:41	comment	added	jww		@PeterCordes - Out of curiosity, have you written any books on x86, assembly and intrinsics. I've been looking for a good book (with recipes) for several years now. The intrinsics are important because they are cross-platform. They work on Clang, GCC, MSVC, SunCC, etc. We can write them once and they run everywhere (unlike ASM for GNU's GAS).
Jul 5, 2017 at 11:59	vote	accept	FeepingCreature
Jul 1, 2017 at 9:01	history	bounty awarded	Marcus Müller
May 23, 2017 at 11:54	history	edited	URL Rewriter Bot		replaced http://stackoverflow.com/ with https://stackoverflow.com/
Mar 24, 2017 at 15:20	comment	added	Peter Cordes		@Royi: There are already a couple sections in my answer that discuss the fact that _mm_hadd_ps is slow.
Feb 14, 2017 at 2:29	history	edited	Yuhong Bao	CC BY-SA 3.0	later K8s do have SSE3
Feb 6, 2017 at 19:56	comment	added	Royi		@PeterCordes, How does your SSE3 solution compares to @PaulR solution - `v = _mm_hadd_ps(v, v); v = _mm_hadd_ps(v, v);`? Thank You.
Dec 5, 2016 at 14:03	history	edited	Peter Cordes	CC BY-SA 3.0	dummy args to reduce MOVs. Fix swapped elements in a comment
Dec 5, 2016 at 13:21	comment	added	arrowd		@PeterCordes Thank you for great answer. Don't you have a typo in `SSE1 (aka SSE)` section at line `_mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1)); // [ C D \| B A ]`? I guess, you meant `[ C D \| A B ]`?
Dec 1, 2016 at 3:22	comment	added	plasmacel		Thanks for all the provided info here. It's time to detect the `__AVX512F__` macro. :)
Nov 30, 2016 at 16:00	comment	added	Peter Cordes		@plasmacel: fun fact: on Knight's Landing (Xeon Phi = modified silvermont + AVX512), VPERMILPS (3c lat, 1c rtput) is more efficient than VSHUFPS (4c lat, 2c rtput), which does outweight the instruction-length difference for that architecture. I assume that's from being a 1-input shuffle vs 2-input. Agner Fog updated his stuff for KNL. :)
Nov 29, 2016 at 5:33	comment	added	Peter Cordes		@plasmacel: As my answer points out, my SSE3 version compiles optimally with AVX, but clang pessimises it to VPERMILPD: godbolt.org/g/ZH88wH. gcc's version is four 4B instructions (not counting the RET). clang's version is 2 bytes longer, and the same speed. What makes you think VPERMILPS is a win over SHUFPS? AFAIK, clang is wrong to favour it for immediate shuffles where the source is already in a register. Agner Fog's tables show no difference. It's useful for load+shuffle, and for variable-shuffles, and maybe easier for compilers since it's a 1-input instruction, but not faster
Nov 29, 2016 at 5:30	comment	added	Peter Cordes		@plasmacel: no, unless your vector was in memory to start with, since VPERMILPS can load+shuffle. You get smaller code-size from using the AVX versions of older instructions, because you don't need an immediate, and they only need the 2-byte VEX prefix (`C5 ..` instead of `C4 .. ..`). Two-source shuffles like VSHUFPS and VMOVHLPS aren't any slower than one-source shuffles like VPSHUFD or VPERMILPS. If there's a difference in energy consumption, it's probably negligible.
Nov 29, 2016 at 5:29	comment	added	plasmacel		Is it a win to use `vpermilps` instead of `movsldup`, `movshdup`, `movhlps` and `movlhps` when AVX is available? It is a win over `shufps` and looks like clang also tries to emit it instead of the mentioned ones.
Nov 29, 2016 at 4:50	comment	added	Peter Cordes		If you have a specific microarchitecture and compiler in mind, you can and should make a version that's more optimal for that. This answer tries to be optimal (latency, throughput and code-size) for modern CPUs like Haswell, while sucking at little as possible on old CPUs. i.e. my SSE1 / SSE2 versions don't do anything that's worse on Haswell just to run faster on an old SlowShuffle CPU like Merom. For Merom, PSHUFD might be a win because it and SHUFPS both run in flt->int domain.
Nov 29, 2016 at 4:46	comment	added	Peter Cordes		@plasmacel: keep in mind that these functions really need to inline to be useful. And yes, clang pessimizes the shuffles sometimes. That's especially bad for first-gen Core2 and other slow-shuffle CPUs where SHUFPS is far worse than MOVHLPS :( If you enable sse3 (godbolt.org/g/1qbNXw), though, it can use MOVSHDUP for Kornel's first shuffle, which is excellent. Anyway, if you are using clang, use whatever happens to coax clang into making nice asm after inlining. You could even write a version which takes a dummy arg to use as a target for movhlps (when AVX isn't available).
Nov 29, 2016 at 4:40	comment	added	Peter Cordes		@plasmacel: on many CPUs, including Intel SnB-family, there's extra bypass-delay latency to forward the result of an FP instruction to an integer shuffle, and from PSHUFD to ADDPS. It's great if you care about throughput and uop count but not latency. (SHUFPS between integer instructions has no penalty on SnB-family (unlike Nehalem), but the reverse is not true.)
Nov 29, 2016 at 4:32	comment	added	plasmacel		With SSE2 the remaining `movaps` before the `shufps` also can be eliminated if you use `pshufd` by changing `_mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1));` to `_mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(v), _MM_SHUFFLE(2, 3, 0, 1)));`. However that maybe adds some latency. godbolt.org/g/0trqRY
Jun 3, 2016 at 19:07	history	edited	Peter Cordes	CC BY-SA 3.0	link my VCL github repo
May 27, 2016 at 15:10	history	edited	Peter Cordes	CC BY-SA 3.0	tidy up: Talk about CPUs with slow shuffles all in one place. Fix the broken godbolt link.
Feb 10, 2016 at 16:39	vote	accept	FeepingCreature
Jul 5, 2017 at 11:48
Feb 9, 2016 at 9:19	history	edited	Peter Cordes	CC BY-SA 3.0	total code size (disk fetch) matters mostly for compiler autovec.
Feb 8, 2016 at 13:06	history	edited	Peter Cordes	CC BY-SA 3.0	added 2 characters in body
Feb 8, 2016 at 12:46	history	answered	Peter Cordes	CC BY-SA 3.0

toggle format

Collectives™ on Stack Overflow

Timeline for answer to Fastest way to do horizontal SSE vector sum (or other reduction) by Peter Cordes

Current License: CC BY-SA 4.0

Post Revisions