Improve successful find speed by 1 cycle on Aarch64#2589
Closed
Nicoshev wants to merge 2 commits into
Closed
Conversation
8a914be to
4736ea5
Compare
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 23, 2026
Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 23, 2026
Summary: Pull Request resolved: facebook#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304
ba00a7a to
a3712e4
Compare
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 23, 2026
Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 23, 2026
Summary: Pull Request resolved: facebook#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304
a3712e4 to
7fc0747
Compare
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 23, 2026
Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304
7fc0747 to
b42ac7c
Compare
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 23, 2026
Summary: Pull Request resolved: facebook#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304
b42ac7c to
79e8721
Compare
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 23, 2026
Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304
79e8721 to
5cba0c0
Compare
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 23, 2026
Summary: Pull Request resolved: facebook#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Differential Revision: D94030304
89f39ec to
b8a8f42
Compare
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 23, 2026
Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Reviewed By: yfeldblum Differential Revision: D94030304
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 23, 2026
Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Reviewed By: yfeldblum Differential Revision: D94030304
b8a8f42 to
414a31e
Compare
Nicoshev
added a commit
to Nicoshev/folly
that referenced
this pull request
Feb 24, 2026
Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Reviewed By: yfeldblum Differential Revision: D94030304
414a31e to
1afe17a
Compare
Summary: Instruction CTZ is not available on armv9a CPUs. This implies that to compute trailing zeroes, RBIT followed by CLZ must be issued. We are changing the mask's logic to reverse bits once nad rely on CLZ instead of CTZ. The former instruction sequence looked like this: 2c3e48: fmov x15, d1 2c3e4c: rbit x16, x15 2c3e50: clz x16, x16 2c3e54: lsl x16, x16, facebook#1 2c3e58: and x16, x16, #0xf8 2c3e5c: ldr w16, [x14, x16] 2c3e60: cbz w16, 2c3e88 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x35c> 2c3e64: sub x16, x15, #0x1 2c3e68: ands x15, x16, x15 2c3e6c: b.ne 2c3e4c <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x320> // b.any The newer looks like this: 2c3d90: fmov x15, d1 2c3d94: rbit x15, x15 2c3d98: clz x17, x15 2c3d9c: lsl x18, x17, facebook#1 2c3da0: and x18, x18, #0xf8 2c3da4: ldr w18, [x16, x18] 2c3da8: cbz w18, 2c3dd0 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x364> 2c3dac: lsr x17, x12, x17 2c3db0: ands x15, x17, x15 2c3db4: b.ne 2c3d98 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x32c> // b.any We can observe three things: -The final conditional branch jumps back to clz instead of rbit. -Instruction lsr depends on the result of clz, whereas sub depends on the result of rbit; meaning the old codepath can be executed speculatively earlier. -Assignment of 0x7FFFFFFFFFFFFFFF has been hoisted. There are no improvements nor regressions observed on benchmarks, probably it doesn't hit the case where many matches occur within the same tag. The added instruction of assigning 0x7FFFFFFFFFFFFFFF could potentially delay the tag memory load by 1 cycle, although it's unlikely This change allows performance improvements on occupiedIter: D94023144 Reviewed By: yfeldblum Differential Revision: D94020004
Summary: The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Reviewed By: yfeldblum Differential Revision: D94030304
1afe17a to
e8598d0
Compare
|
This pull request has been merged in b26baa2. |
meta-codesync Bot
pushed a commit
to facebook/hhvm
that referenced
this pull request
Feb 24, 2026
Summary: X-link: facebook/folly#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, #1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, #1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Reviewed By: yfeldblum Differential Revision: D94030304 fbshipit-source-id: c40a692345b051634c78c262eef8ee966a804104
bolunfeng
pushed a commit
to bolunfeng/folly
that referenced
this pull request
May 22, 2026
Summary: Pull Request resolved: facebook#2589 The result of SparseMaskIter's next() is often used as an index on an 8-byte element array. In this case, the index needs to be shifted left by 3 to access the desired memory position. The return statement of the mentioned function contains i >> 2. The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2. However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a multiple of 4. We do know that variable i will always be a multiple of 4. We add the assume clause so the compiler avoids emitting the &0xf8 Before the assembly looked like this: clz x16, x16 lsl x16, x16, facebook#1 and x16, x16, #0xf8 ldr x16, [x14, x16] After the changes, we verified the AND is omitted: clz x16, x16 lsl x16, x16, facebook#1 ldr x16, [x14, x16] By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗 It also allows the processor to foresee proceeding instructions up to 1 cycle earlier Reviewed By: yfeldblum Differential Revision: D94030304 fbshipit-source-id: c40a692345b051634c78c262eef8ee966a804104
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
The result of SparseMaskIter's next is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8
Before the assembly looked like this:
clz x16, x16
lsl x16, x16, #1
and x16, x16, #0xf8
ldr x16, [x14, x16]
After the changes, we verified the and is omitted:
clz x16, x16
lsl x16, x16, #1
ldr x16, [x14, x16]
By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
Differential Revision: D94030304