For a Miller-Rabin primality test I need a fast way to compute a^d mod n.
Where a is one of 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37 with decreasing probability and increasing n. E.g. for n < 2047 only a = 2 is used. And tests are run in order until one says n is not a prime. Most non-primes are detected early.
The other bits of information I have about the parameter are n = 2^s * d + 1. n is odd, d is odd and 2 * d < n.
Below is the code I have to compute a^d mod n for uint32 values and the % operations result in __aeabi_uldivmod calls on ARM, which isn't exactly fast. Extending this to work for uint64_t requires the use of __int128, which gcc doesn't have on ARM. So I have to implement my own sqr(), mul() and mod() functions. Before I do that I wonder if the code could be improved:
#include <cstdint>
#include <bit>
#include <cassert>
// compute a^d mod n, d > 0
uint32_t pow_mod_n(uint32_t a, uint32_t d, uint32_t n) {
if (d == 0) __builtin_unreachable();
unsigned shift = std::countl_zero(d) + 1;
uint32_t t = a;
int32_t m = d << shift;
for (unsigned i = 32 - shift; i > 0; --i) {
t = ((uint64_t)t * t) % n;
if (m < 0) t = ((uint64_t)t * a) % n;
m <<= 1;
}
return t;
}
Given the above limits for a, d and n I'm also looking for specializations of the code. E.g. when n < 65536 all the operations can be done with uint32_t, which speeds up the code on ARM a lot. Or code optimized for a * n < UINT32_MAX. I figure there must be a whole bunch of limits the code can be specialized for. I figure anything that lets me use a smaller type for some operation would be helpful. Has someone worked out all the thresholds that could be used to choose different functions?