As an exercise in learning ARM64 assembly (aka AArch64), I wrote this function to convert 64-bit binary to hexadecimal with SIMD instructions. I'm most interested in feedback on the algorithm, instruction selection, and micro-optimizations for size or speed.
This was inspired by x86 SIMD implementations of this function on Stack Overflow by Peter Cordes.
It's written for armv8-a little-endian without alignment checks, using the GNU assembler. A test driver in Unix C is included at the end.
I am testing and benchmarking on a Raspberry Pi 4B with 1.5 GHz Cortex A-72 CPU. Currently it can do about 1.5e8 repetitions per second, or about 10 clock cycles per rep.
u64tohex.s:
// u64tohex: Convert unsigned 64-bit integer to null-terminated ASCII
// hexadecimal.
//
// C declaration: void u64tohex(char *buf, uint64_t value)
//
// Leading zeros are always included, so output is always 16
// characters plus terminating null.
//
// For AArch64 armv8-a little-endian with alignment check disabled.
.text
.global u64tohex
.balign 16
// Lookup table for use by tbl instruction. Should be all the
// hex digits in order.
// Keep this in the .text section, adjacent to u64tohex, so
// PC-relative literal load can be used to load it.
hex_table:
.ascii "0123456789abcdef"
.balign 16
u64tohex:
// x0 = buf
// x1 = value.
// Running example: suppose x1 = 0xdeadbeef12345678.
// This is for little-endian machines, so we have to do some
// sort of reversal to be able to print the most-significant
// nibble first. On the integer side we have the rev
// instruction to reverse bytes; there doesn't seem to be
// anything comparable in SIMD. So start with rev.
rev x1, x1
// Move to SIMD register. d0 is the low dword of v0.
// fmov is faster than mov v0.d[0] or dup v0.2d
fmov d0, x1
// now v0 = 00 00 00 00 00 00 00 00 de ad be ef 12 34 56 78
// Repeat each byte twice.
zip1 v0.16b, v0.16b, v0.16b
// now v0 = de de ad ad be be ef ef 12 12 34 34 56 56 78 78
// Even bytes of the output should correspond to the
// most-significant nibble of each input byte. So each even
// byte needs to be right-shifted by 4. This is done with
// ushl and a negative shift count. Odd bytes should stay
// unchanged for now, so need a shift count of 0.
// Use a 16-bit immediate move, which duplicates across all
// elements, to fill even bytes of v1 with -4 (0xfc) and odd bytes
// with 0.
movi v1.8h, #0x00fc
ushl v0.16b, v0.16b, v1.16b
// now v0 = 0d de 0a ad 0b be 0e ef 01 12 03 34 05 56 07 78
// Mask off all high nibbles. Fill all bytes of v1 with 0xf.
movi v1.16b, #0x0f
and v0.16b, v0.16b, v1.16b
// now v0 = 0d 0e 0a 0d 0b 0e 0e 0f 01 02 03 04 05 06 07 08
// Substitute each byte of v0 with the corresponding byte of
// hex_table. So 00 -> '0', 0a -> 'a' and so on.
ldr q1, hex_table
tbl v0.16b, { v1.16b }, v0.16b
// All done. Store result in buffer. Per tests, st1 is
// slightly faster than the more obvious `str q0, [x0]`.
st1 { v0.16b }, [x0]
// Append the terminating null.
strb wzr, [x0, #16]
ret
Here is a simple C program to exercise the function. Compile and link with gcc -O3 main.c u64tohex.s. I'm not particularly looking for feedback on the C code, though of course feel free to give some if you like.
main.c:
// Test driver for u64tohex
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <inttypes.h>
#include <time.h>
// Randomly chosen increment, so that we can rapidly cycle through
// many different test values.
#define INC 0x1928374656473829UL
extern void u64tohex(char *buf, uint64_t val);
// Compare u64tohex with sprintf, to test for correctness
static void test(uint64_t reps) {
uint64_t v = 0;
while (reps--) {
char hex_out[] = "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz";
char sprintf_out[] = "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz";
u64tohex(hex_out, v);
sprintf(sprintf_out, "%016" PRIx64, v);
if (strcmp(hex_out, sprintf_out) != 0) {
printf("Fail: bad %s, good %s\n", hex_out, sprintf_out);
exit(1);
}
v += INC;
}
printf("Correctness tests passed\n");
}
static void benchmark(uint64_t reps) {
printf("%" PRId64 " reps:", reps);
fflush(stdout);
struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
uint64_t v = 0;
for (uint64_t i = 0; i < reps; i++) {
char buf[100] __attribute__((aligned(16)));
u64tohex(buf, v);
v += INC;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
double elapsed = (end.tv_sec - start.tv_sec) + 1.0e-9 * (end.tv_nsec - start.tv_nsec);
printf(" %f secs (%e reps per second)\n", elapsed, reps/elapsed);
}
int main(int argc, char *argv[]) {
test(100000);
benchmark(100000000);
return 0;
}
zip1does, so may be it's not correct. \$\endgroup\$tblfor general permutation, so that could substitute for therev / zip1. That opens up some approaches for doing the whole thing in SIMD, which would be desirable if we want to convert a large array of values loaded from memory, rather than a single value from an integer register. In particular we can do the shift and masks to two 64-bit inputs at once, keeping even nibbles and odd nibbles in two different registers, and then use two two-registertblinstructions to sort all the correct bytes in the correct order. \$\endgroup\$zip1 vA, vB, vCtakes the elements from the low halves ofvBandvC, interleaves them, and puts the result invA. There is a nice picture in C7.2.403 of the Architecture Reference Manual.zip2does the same thing but taking elements from the high halves. \$\endgroup\$rev64which is the SIMD version ofrev. I think that pluszipis better thantbl. \$\endgroup\$