Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

2
  • 3
    C works this way, too, on x86-64 other than Windows. In the x86-64 System V ABI, long double is an 80-bit type (stored as an aligned 16 bytes, which is probably too much alignment for most cases...) It's really only MSVC, and compilers aiming to be ABI-compatible with it, i.e. targeting Windows, that lose easy access to an 80-bit FP type. (GCC has a -mlong-double-64/80/128 option to override the ABI (making incompatible code), which you can use to narrow long double on non-Windows, or widen it on Windows. gcc.gnu.org/onlinedocs/gcc/x86-Options.html) Commented Jun 8, 2023 at 23:42
  • 2
    x87 80-bit load/store are low performance on current CPUs, but the actual computation using registers is still fairly fast. See my answer for details. fadd and fmul have lower throughput than scalar SSE2 math (like only 1/clock), but similar latency on Intel, with fmul and fadd using different ports so they can run in parallel. Agner Fog's testing ((agner.org/optimize) found Zen 3 has an fadd throughput of only one per 2 clocks, but fully pipelined fmul. But fiadd throughput of 1/clock? That doesn't make much sense, probably an error in hand-editing his spreadsheet. Commented Jun 8, 2023 at 23:51