Improve successful find speed by 1 cycle on Aarch64 by Nicoshev · Pull Request #2589 · facebook/folly

Nicoshev · 2026-02-22T21:38:16Z

Summary:
The result of SparseMaskIter's next is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

clz x16, x16
lsl x16, x16, #1
and x16, x16, #0xf8
ldr x16, [x14, x16]

After the changes, we verified the and is omitted:

clz x16, x16
lsl x16, x16, #1
ldr x16, [x14, x16]

By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗

Differential Revision: D94030304

meta-codesync · 2026-02-22T21:38:24Z

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94030304.

Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304

Summary: Pull Request resolved: facebook#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304

Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304

Summary: Pull Request resolved: facebook#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304

Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304

Summary: Pull Request resolved: facebook#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304

Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304

Summary: Pull Request resolved: facebook#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304

Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Reviewed By: yfeldblum Differential Revision: D94030304

Summary: Instruction CTZ is not available on armv9a CPUs. This implies that to compute trailing zeroes, RBIT followed by CLZ must be issued. We are changing the mask's logic to reverse bits once nad rely on CLZ instead of CTZ. The former instruction sequence looked like this: 2c3e48: fmov x15, d1 2c3e4c: rbit x16, x15 2c3e50: clz x16, x16 2c3e54: lsl x16, x16, facebook#1 2c3e58: and x16, x16, #0xf8 2c3e5c: ldr w16, [x14, x16] 2c3e60: cbz w16, 2c3e88 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x35c> 2c3e64: sub x16, x15, #0x1 2c3e68: ands x15, x16, x15 2c3e6c: b.ne 2c3e4c <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x320> // b.any The newer looks like this: 2c3d90: fmov x15, d1 2c3d94: rbit x15, x15 2c3d98: clz x17, x15 2c3d9c: lsl x18, x17, facebook#1 2c3da0: and x18, x18, #0xf8 2c3da4: ldr w18, [x16, x18] 2c3da8: cbz w18, 2c3dd0 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x364> 2c3dac: lsr x17, x12, x17 2c3db0: ands x15, x17, x15 2c3db4: b.ne 2c3d98 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x32c> // b.any We can observe three things: -The final conditional branch jumps back to clz instead of rbit. -Instruction lsr depends on the result of clz, whereas sub depends on the result of rbit; meaning the old codepath can be executed speculatively earlier. -Assignment of 0x7FFFFFFFFFFFFFFF has been hoisted. There are no improvements nor regressions observed on benchmarks, probably it doesn't hit the case where many matches occur within the same tag. The added instruction of assigning 0x7FFFFFFFFFFFFFFF could potentially delay the tag memory load by 1 cycle, although it's unlikely This change allows performance improvements on occupiedIter: D94023144 Reviewed By: yfeldblum Differential Revision: D94020004

Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Reviewed By: yfeldblum Differential Revision: D94030304

meta-codesync · 2026-02-24T17:00:27Z

This pull request has been merged in b26baa2.

Summary: X-link: facebook/folly#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, #1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, #1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Reviewed By: yfeldblum Differential Revision: D94030304 fbshipit-source-id: c40a692345b051634c78c262eef8ee966a804104

Summary: Pull Request resolved: facebook#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Reviewed By: yfeldblum Differential Revision: D94030304 fbshipit-source-id: c40a692345b051634c78c262eef8ee966a804104

meta-cla Bot added the CLA Signed label Feb 22, 2026

meta-codesync Bot added fb-exported meta-exported labels Feb 22, 2026

Nicoshev force-pushed the export-D94030304 branch 2 times, most recently from 8a914be to 4736ea5 Compare February 23, 2026 17:36

Nicoshev force-pushed the export-D94030304 branch 2 times, most recently from ba00a7a to a3712e4 Compare February 23, 2026 19:14

Nicoshev force-pushed the export-D94030304 branch from a3712e4 to 7fc0747 Compare February 23, 2026 19:17

Nicoshev force-pushed the export-D94030304 branch from 7fc0747 to b42ac7c Compare February 23, 2026 20:26

Nicoshev force-pushed the export-D94030304 branch from b42ac7c to 79e8721 Compare February 23, 2026 20:29

Nicoshev force-pushed the export-D94030304 branch from 79e8721 to 5cba0c0 Compare February 23, 2026 22:06

Nicoshev force-pushed the export-D94030304 branch 2 times, most recently from 89f39ec to b8a8f42 Compare February 23, 2026 23:21

Nicoshev force-pushed the export-D94030304 branch from b8a8f42 to 414a31e Compare February 23, 2026 23:21

Nicoshev force-pushed the export-D94030304 branch from 414a31e to 1afe17a Compare February 24, 2026 03:36

Nicoshev added 2 commits February 23, 2026 20:59

Nicoshev force-pushed the export-D94030304 branch from 1afe17a to e8598d0 Compare February 24, 2026 04:59

meta-codesync Bot closed this in b26baa2 Feb 24, 2026

facebook-github-bot added the Merged label Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve successful find speed by 1 cycle on Aarch64#2589

Improve successful find speed by 1 cycle on Aarch64#2589
Nicoshev wants to merge 2 commits into
facebook:mainfrom
Nicoshev:export-D94030304

Nicoshev commented Feb 22, 2026

meta-codesync Bot commented Feb 22, 2026

meta-codesync Bot commented Feb 24, 2026

Labels

2 participants