I think a precomputed 10-bit lookup table is the way to go. This reduces the number of operations to a minimum and is thus likely much faster. It might introduce caching issues. As always, only benchmarking in the appropriate environment will tell how it compares in practice.
unsigned count = 0;
uint16_t x = 0;
for ( size_t i=0; i<n; ++i ) {
x = ( ( x & 0x3 ) << 8 ) | buf[ i ];
count += count_lookup[ x ];
}
Note: the code is pure C, not C++.