Skip to content

Improve successful find speed by 1 cycle on Aarch64#2589

Closed
Nicoshev wants to merge 2 commits into
facebook:mainfrom
Nicoshev:export-D94030304
Closed

Improve successful find speed by 1 cycle on Aarch64#2589
Nicoshev wants to merge 2 commits into
facebook:mainfrom
Nicoshev:export-D94030304

Conversation

@Nicoshev

Copy link
Copy Markdown
Contributor

Summary:
The result of SparseMaskIter's next is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

clz x16, x16
lsl x16, x16, #1
and x16, x16, #0xf8
ldr x16, [x14, x16]

After the changes, we verified the and is omitted:

clz x16, x16
lsl x16, x16, #1
ldr x16, [x14, x16]

By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗

Differential Revision: D94030304

@meta-cla meta-cla Bot added the CLA Signed label Feb 22, 2026
@meta-codesync

meta-codesync Bot commented Feb 22, 2026

Copy link
Copy Markdown

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94030304.

@Nicoshev Nicoshev force-pushed the export-D94030304 branch 2 times, most recently from 8a914be to 4736ea5 Compare February 23, 2026 17:36
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 23, 2026
Summary:

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]


By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Differential Revision: D94030304
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 23, 2026
Summary:
Pull Request resolved: facebook#2589

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]

By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Differential Revision: D94030304
@Nicoshev Nicoshev force-pushed the export-D94030304 branch 2 times, most recently from ba00a7a to a3712e4 Compare February 23, 2026 19:14
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 23, 2026
Summary:

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]


By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Differential Revision: D94030304
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 23, 2026
Summary:
Pull Request resolved: facebook#2589

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]

By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Differential Revision: D94030304
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 23, 2026
Summary:

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]


By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Differential Revision: D94030304
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 23, 2026
Summary:
Pull Request resolved: facebook#2589

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]

By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Differential Revision: D94030304
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 23, 2026
Summary:

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]


By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Differential Revision: D94030304
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 23, 2026
Summary:
Pull Request resolved: facebook#2589

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]

By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Differential Revision: D94030304
@Nicoshev Nicoshev force-pushed the export-D94030304 branch 2 times, most recently from 89f39ec to b8a8f42 Compare February 23, 2026 23:21
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 23, 2026
Summary:

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]


By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Reviewed By: yfeldblum

Differential Revision: D94030304
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 23, 2026
Summary:

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]


By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Reviewed By: yfeldblum

Differential Revision: D94030304
Nicoshev added a commit to Nicoshev/folly that referenced this pull request Feb 24, 2026
Summary:

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]


By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Reviewed By: yfeldblum

Differential Revision: D94030304
Summary:
Instruction CTZ is not available on armv9a CPUs.
This implies that to compute trailing zeroes, RBIT followed by CLZ must be issued.

We are changing the mask's logic to reverse bits once nad rely on CLZ instead of CTZ.

The former instruction sequence looked like this:

  2c3e48:	fmov	x15, d1
  2c3e4c:	rbit	x16, x15
  2c3e50:	clz	x16, x16
  2c3e54:	lsl	x16, x16, facebook#1
  2c3e58:	and	x16, x16, #0xf8
  2c3e5c:	ldr	w16, [x14, x16]
  2c3e60:	cbz	w16, 2c3e88 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x35c>
  2c3e64:	sub	x16, x15, #0x1
  2c3e68:	ands	x15, x16, x15
  2c3e6c:	b.ne	2c3e4c <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x320>  // b.any

The newer looks like this:

  2c3d90:	fmov	x15, d1
  2c3d94:	rbit	x15, x15
  2c3d98:	clz	x17, x15
  2c3d9c:	lsl	x18, x17, facebook#1
  2c3da0:	and	x18, x18, #0xf8
  2c3da4:	ldr	w18, [x16, x18]
  2c3da8:	cbz	w18, 2c3dd0 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x364>
  2c3dac:	lsr	x17, x12, x17
  2c3db0:	ands	x15, x17, x15
  2c3db4:	b.ne	2c3d98 <_ZN30F14Set_equalityRefinement_Test8TestBodyEv+0x32c>  // b.any


We can observe three things:

-The final conditional branch jumps back to clz instead of rbit.
-Instruction lsr depends on the result of clz, whereas sub depends on the result of rbit; meaning the old codepath can be executed speculatively earlier.
-Assignment of 0x7FFFFFFFFFFFFFFF has been hoisted.

There are no improvements nor regressions observed on benchmarks, probably it doesn't hit the case where many matches occur within the same tag.
The added instruction of assigning 0x7FFFFFFFFFFFFFFF could potentially delay the tag memory load by 1 cycle, although it's unlikely

This change allows performance improvements on occupiedIter: D94023144

Reviewed By: yfeldblum

Differential Revision: D94020004
Summary:

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]


By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Reviewed By: yfeldblum

Differential Revision: D94030304
@meta-codesync

meta-codesync Bot commented Feb 24, 2026

Copy link
Copy Markdown

This pull request has been merged in b26baa2.

meta-codesync Bot pushed a commit to facebook/hhvm that referenced this pull request Feb 24, 2026
Summary:
X-link: facebook/folly#2589

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, #1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, #1
  ldr	x16, [x14, x16]

By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Reviewed By: yfeldblum

Differential Revision: D94030304

fbshipit-source-id: c40a692345b051634c78c262eef8ee966a804104
bolunfeng pushed a commit to bolunfeng/folly that referenced this pull request May 22, 2026
Summary:
Pull Request resolved: facebook#2589

The result of SparseMaskIter's next() is often used as an index on an 8-byte element array.
In this case, the index needs to be shifted left by 3 to access the desired memory position.
The return statement of the mentioned function contains i >> 2.
The compiler is simplifying the shifts by only issuing a lsl 1 while ommitting the lsr 2.
However, it then ANDs the shifted value by 0xf8, to ensure correctness when variable i is not a
multiple of 4.
We do know that variable i will always be a multiple of 4.
We add the assume clause so the compiler avoids emitting the &0xf8

Before the assembly looked like this:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  and	x16, x16, #0xf8
  ldr	x16, [x14, x16]

After the changes, we verified the AND is omitted:

  clz	x16, x16
  lsl	x16, x16, facebook#1
  ldr	x16, [x14, x16]

By removing a pipelined instruction in the codepath, execution latency is reduced by 1 cycle 🤗
It also allows the processor to foresee proceeding instructions up to 1 cycle earlier

Reviewed By: yfeldblum

Differential Revision: D94030304

fbshipit-source-id: c40a692345b051634c78c262eef8ee966a804104
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment