Skip to main content
Spelling and grammar
Source Link
Toby Speight
  • 88.7k
  • 14
  • 104
  • 327

For example, let's say you use a table based approach, using a table with 1024 entries to process 10 bits: 8 bits belonging to the current byte, plus two extra bits that are two last bits from the previous byte. You would form the index into the table by concatenating those things. There is a somewhat annoying case at the start of the array (or the end, depending on how you arrange things), where we must avoid finding instances of the pattern that start before the array. If we know that the pattern is 101, then that automatically doesn't happen if the "left over bits" are initialized as zero. If the pattern can be 000 and other annoying values, it takes some subtle and possibly bug-prone logic to avoid finding pattern matches that would include one or two phantom bits that aren't part of the array.

Most of the "raw processing performance" of a modern high-performance CPU (by which I roughly mean, basically anything that isn't a basic microcontroller,microcontroller; even some fancy microcontrollers have ARM NEON) is locked behind its SIMD/vector processing capabilities, so whether we can use that deserves some thought.

It's not that SIMD cannot do table-lookups,lookups; depending on the ISA it can,: there is for example tbx/tbl in ARM NEON and SVE, and pshufb (SSSE3), vpermb (AVX512), vpermt2b (AVX512) in x86/x64-land. But a small table means there's not a lot of progress made per lookup,lookup; for example a lookup into a 16-entry (4-bit index) table only processes 2 positions if the pattern has a length of 3, a lookup into a 64-entry (6-bit index) table (eg vpermb in AVX512) processes 4 positions, - that's something but it's not really reasonable to use a 1024-entry table.

The word-oriented approach mostly "just works" in SIMD, working. Working 7 bytes at the time is a bit awkward but possible (just shuffle the bytes), maybe it works out better to count the word-boundary-crossing patterns separately, or with AVX512 you could use those "concatenate and shift" instructions such as vpshldq.

For example, let's say you use a table based approach, using a table with 1024 entries to process 10 bits: 8 bits belonging to the current byte, plus two extra bits that are two last bits from the previous byte. You would form the index into the table by concatenating those things. There is somewhat annoying case at the start of the array (or the end, depending on how you arrange things), where we must avoid finding instances of the pattern that start before the array. If we know that the pattern is 101, then that automatically doesn't happen if the "left over bits" are initialized as zero. If the pattern can be 000 and other annoying values, it takes some subtle and possibly bug-prone logic to avoid finding pattern matches that would include one or two phantom bits that aren't part of the array.

Most of the "raw processing performance" of a modern high-performance CPU (by which I roughly mean, basically anything that isn't a basic microcontroller, even some fancy microcontrollers have ARM NEON) is locked behind its SIMD/vector processing capabilities, so whether we can use that deserves some thought.

It's not that SIMD cannot do table-lookups, depending on the ISA it can, there is for example tbx/tbl in ARM NEON and SVE, and pshufb (SSSE3), vpermb (AVX512), vpermt2b (AVX512) in x86/x64-land. But a small table means there's not a lot of progress made per lookup, for example a lookup into a 16-entry (4-bit index) table only processes 2 positions if the pattern has a length of 3, a lookup into a 64-entry (6-bit index) table (eg vpermb in AVX512) processes 4 positions, that's something but it's not really reasonable to use a 1024-entry table.

The word-oriented approach mostly "just works" in SIMD, working 7 bytes at the time is a bit awkward but possible (just shuffle the bytes), maybe it works out better to count the word-boundary-crossing patterns separately, or with AVX512 you could use those "concatenate and shift" instructions such as vpshldq.

For example, let's say you use a table based approach, using a table with 1024 entries to process 10 bits: 8 bits belonging to the current byte, plus two extra bits that are two last bits from the previous byte. You would form the index into the table by concatenating those things. There is a somewhat annoying case at the start of the array (or the end, depending on how you arrange things), where we must avoid finding instances of the pattern that start before the array. If we know that the pattern is 101, then that automatically doesn't happen if the "left over bits" are initialized as zero. If the pattern can be 000 and other annoying values, it takes some subtle and possibly bug-prone logic to avoid finding pattern matches that would include one or two phantom bits that aren't part of the array.

Most of the "raw processing performance" of a modern high-performance CPU (by which I roughly mean, basically anything that isn't a basic microcontroller; even some fancy microcontrollers have ARM NEON) is locked behind its SIMD/vector processing capabilities, so whether we can use that deserves some thought.

It's not that SIMD cannot do table-lookups; depending on the ISA it can: there is for example tbx/tbl in ARM NEON and SVE, and pshufb (SSSE3), vpermb (AVX512), vpermt2b (AVX512) in x86/x64-land. But a small table means there's not a lot of progress made per lookup; for example a lookup into a 16-entry (4-bit index) table only processes 2 positions if the pattern has a length of 3, a lookup into a 64-entry (6-bit index) table (eg vpermb in AVX512) processes 4 positions - that's something but it's not really reasonable to use a 1024-entry table.

The word-oriented approach mostly "just works" in SIMD. Working 7 bytes at the time is a bit awkward but possible (just shuffle the bytes), maybe it works out better to count the word-boundary-crossing patterns separately, or with AVX512 you could use those "concatenate and shift" instructions such as vpshldq.

added 169 characters in body
Source Link
user555045
  • 12.6k
  • 1
  • 19
  • 39

If you have access to the kind of shift where you can supply an extra operand of bits to shift into the top/bottom, you can use that to work a full word at the time.

The word-oriented approach mostly "just works" in SIMD, working 7 bytes at the time is a bit awkward but possible (just shuffle the bytes), maybe it works out better to count the word-boundary-crossing patterns separately, or with AVX512 you could use those "concatenate and shift" instructions such as vpshldq.

The word-oriented approach mostly "just works" in SIMD, working 7 bytes at the time is a bit awkward but possible (just shuffle the bytes), maybe it works out better to count the word-boundary-crossing patterns separately.

If you have access to the kind of shift where you can supply an extra operand of bits to shift into the top/bottom, you can use that to work a full word at the time.

The word-oriented approach mostly "just works" in SIMD, working 7 bytes at the time is a bit awkward but possible (just shuffle the bytes), maybe it works out better to count the word-boundary-crossing patterns separately, or with AVX512 you could use those "concatenate and shift" instructions such as vpshldq.

Source Link
user555045
  • 12.6k
  • 1
  • 19
  • 39

I see basically two categories of solution that aren't bit-by-bit:

  1. A table-based approach.
  2. A word-oriented approach.

I will get to them, but using this as a hook:

I tried to write it in a generic way just in case a follow up question would come up. Is it a problem?

"Problem" is a bit much, but inherently you will tend to lose opportunities for simplification, or to put it another way, you're making the problem more difficult to solve.

For example, let's say you use a table based approach, using a table with 1024 entries to process 10 bits: 8 bits belonging to the current byte, plus two extra bits that are two last bits from the previous byte. You would form the index into the table by concatenating those things. There is somewhat annoying case at the start of the array (or the end, depending on how you arrange things), where we must avoid finding instances of the pattern that start before the array. If we know that the pattern is 101, then that automatically doesn't happen if the "left over bits" are initialized as zero. If the pattern can be 000 and other annoying values, it takes some subtle and possibly bug-prone logic to avoid finding pattern matches that would include one or two phantom bits that aren't part of the array.

A word-oriented approach grabs a bigger chunk of the array at once and uses a couple of bitwise operations to turn those bits from the array into a mask indicating where the pattern matches. In the simplest case, if we were looking for the single-bit pattern of 1, the data itself would already be a mask of where the pattern matches (and there wouldn't be any complications from patterns that cross a word-boundary), all that remains is to std::popcount the data word-by-word and sum the results. If we were looking for 0, we first invert the data, and use std::popcount(~word).

For longer patterns, some shifts and ANDs enter the picture. For example to get the mask indicating where the pattern 101 matches, the "core" of the solution could be:

mask = data & (~data << 1) & (data << 2)

It's possible to support variable patterns, using XOR to do conditional inversion. Of course that adds some additional complexity.

You could shift right instead of left. Does that matter? Not really, not at this point.

I've seemingly ignored the "pattern can span across boundaries" problem here, but for the pattern 101 there is an elegant (perhaps not the fastest, but it's nice and simple) solution: work 7 bytes at the time (if you're using uint64_t as your "word"), using 2 of the 8 spare bits that leaves in an uint64_t to put the two extra bits from the adjacent word. That leaves 6 "padding" bits that are in the word but in which no matches must be counted, but here's a nice thing about the pattern 101: that doesn't matter, there will never be a match involving those 6 zeroes. There could have been if a pattern was allowed to end in a zero, though that is not hard to solve, just put an extra bitwise AND to get rid of the invalid matches.

Another option to handle patterns that cross across word boundaries is handling them explicitly, counting them separately from the rest.

What about SIMD

Most of the "raw processing performance" of a modern high-performance CPU (by which I roughly mean, basically anything that isn't a basic microcontroller, even some fancy microcontrollers have ARM NEON) is locked behind its SIMD/vector processing capabilities, so whether we can use that deserves some thought.

A word-oriented approach is probably more amenable to a SIMD implementation, either manual but especially for auto-vectorization (compilers are, so far, not very eager to turn a table-lookup into a variable-shuffle).

It's not that SIMD cannot do table-lookups, depending on the ISA it can, there is for example tbx/tbl in ARM NEON and SVE, and pshufb (SSSE3), vpermb (AVX512), vpermt2b (AVX512) in x86/x64-land. But a small table means there's not a lot of progress made per lookup, for example a lookup into a 16-entry (4-bit index) table only processes 2 positions if the pattern has a length of 3, a lookup into a 64-entry (6-bit index) table (eg vpermb in AVX512) processes 4 positions, that's something but it's not really reasonable to use a 1024-entry table.

The word-oriented approach mostly "just works" in SIMD, working 7 bytes at the time is a bit awkward but possible (just shuffle the bytes), maybe it works out better to count the word-boundary-crossing patterns separately.