Binary to hex in ARM64 SIMD assembly

Question

As an exercise in learning ARM64 assembly (aka AArch64), I wrote this function to convert 64-bit binary to hexadecimal with SIMD instructions. I'm most interested in feedback on the algorithm, instruction selection, and micro-optimizations for size or speed.

This was inspired by x86 SIMD implementations of this function on Stack Overflow by Peter Cordes.

It's written for armv8-a little-endian without alignment checks, using the GNU assembler. A test driver in Unix C is included at the end.

I am testing and benchmarking on a Raspberry Pi 4B with 1.5 GHz Cortex A-72 CPU. Currently it can do about 1.5e8 repetitions per second, or about 10 clock cycles per rep.

u64tohex.s:

// u64tohex: Convert unsigned 64-bit integer to null-terminated ASCII
// hexadecimal.
//
// C declaration: void u64tohex(char *buf, uint64_t value)
//
// Leading zeros are always included, so output is always 16
// characters plus terminating null. 
//
// For AArch64 armv8-a little-endian with alignment check disabled.
        
        .text
        .global u64tohex

        .balign 16
        // Lookup table for use by tbl instruction. Should be all the
        // hex digits in order.

        // Keep this in the .text section, adjacent to u64tohex, so
        // PC-relative literal load can be used to load it.
hex_table:
        .ascii "0123456789abcdef"

        .balign 16
u64tohex:
        // x0 = buf
        // x1 = value.
        // Running example: suppose x1 = 0xdeadbeef12345678.

        // This is for little-endian machines, so we have to do some
        // sort of reversal to be able to print the most-significant
        // nibble first.  On the integer side we have the rev
        // instruction to reverse bytes; there doesn't seem to be
        // anything comparable in SIMD.  So start with rev.
        rev x1, x1

        // Move to SIMD register.  d0 is the low dword of v0.  
        // fmov is faster than mov v0.d[0] or dup v0.2d
        fmov d0, x1

        // now v0 = 00 00 00 00 00 00 00 00 de ad be ef 12 34 56 78
        
        // Repeat each byte twice.
        zip1 v0.16b, v0.16b, v0.16b

        // now v0 = de de ad ad be be ef ef 12 12 34 34 56 56 78 78
        
        // Even bytes of the output should correspond to the
        // most-significant nibble of each input byte.  So each even
        // byte needs to be right-shifted by 4.  This is done with
        // ushl and a negative shift count.  Odd bytes should stay
        // unchanged for now, so need a shift count of 0.

        // Use a 16-bit immediate move, which duplicates across all
        // elements, to fill even bytes of v1 with -4 (0xfc) and odd bytes
        // with 0.
        movi v1.8h, #0x00fc
        ushl v0.16b, v0.16b, v1.16b

        // now v0 = 0d de 0a ad 0b be 0e ef 01 12 03 34 05 56 07 78
        
        // Mask off all high nibbles.  Fill all bytes of v1 with 0xf.
        movi v1.16b, #0x0f
        and v0.16b, v0.16b, v1.16b

        // now v0 = 0d 0e 0a 0d 0b 0e 0e 0f 01 02 03 04 05 06 07 08

        // Substitute each byte of v0 with the corresponding byte of
        // hex_table.  So 00 -> '0', 0a -> 'a' and so on.
        ldr q1, hex_table
        tbl v0.16b, { v1.16b }, v0.16b

        // All done.  Store result in buffer.  Per tests, st1 is
        // slightly faster than the more obvious `str q0, [x0]`.
        st1 { v0.16b }, [x0]

        // Append the terminating null.
        strb wzr, [x0, #16]

        ret

Here is a simple C program to exercise the function. Compile and link with gcc -O3 main.c u64tohex.s. I'm not particularly looking for feedback on the C code, though of course feel free to give some if you like.

main.c:

// Test driver for u64tohex

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <inttypes.h>
#include <time.h>

// Randomly chosen increment, so that we can rapidly cycle through
// many different test values.
#define INC 0x1928374656473829UL

extern void u64tohex(char *buf, uint64_t val);

// Compare u64tohex with sprintf, to test for correctness
static void test(uint64_t reps) {
    uint64_t v = 0;
    while (reps--) {
        char hex_out[] = "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz";
        char sprintf_out[] = "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz";
        u64tohex(hex_out, v);
        sprintf(sprintf_out, "%016" PRIx64, v);
        if (strcmp(hex_out, sprintf_out) != 0) {
            printf("Fail: bad %s, good %s\n", hex_out, sprintf_out);
            exit(1);
        }
        v += INC;
    }
    printf("Correctness tests passed\n");
}

static void benchmark(uint64_t reps) {
    printf("%" PRId64 " reps:", reps);
    fflush(stdout);

    struct timespec start, end;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);

    uint64_t v = 0;
    for (uint64_t i = 0; i < reps; i++) {
        char buf[100] __attribute__((aligned(16)));
        u64tohex(buf, v);
        v += INC;
    }

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    double elapsed = (end.tv_sec - start.tv_sec) + 1.0e-9 * (end.tv_nsec - start.tv_nsec);
    printf(" %f secs (%e reps per second)\n", elapsed, reps/elapsed);
}

int main(int argc, char *argv[]) {
    test(100000);
    benchmark(100000000);
    return 0;
}

Have you tried doing low/high nibble separation in GPR? Something like this: gist.github.com/stepantubanov/1867a3662ba1098a42cf5837870c662c (not tested, may be broken). I am not sure I fully understand what zip1 does, so may be it's not correct. — stepan
– stepan, Commented Mar 7, 2021 at 18:24
Thinking some more, I realized that one can use tbl for general permutation, so that could substitute for the rev / zip1. That opens up some approaches for doing the whole thing in SIMD, which would be desirable if we want to convert a large array of values loaded from memory, rather than a single value from an integer register. In particular we can do the shift and masks to two 64-bit inputs at once, keeping even nibbles and odd nibbles in two different registers, and then use two two-register tbl instructions to sort all the correct bytes in the correct order. — Nate Eldredge
– Nate Eldredge, Commented Mar 7, 2021 at 19:23
There's some overhead to load the permutations from static memory, but if we are doing this in a big loop, it can be done just once. I might play with this and post a new version of the question for this setting. (I know to leave this one as it is.) — Nate Eldredge
– Nate Eldredge, Commented Mar 7, 2021 at 19:24
@stepan: zip1 vA, vB, vC takes the elements from the low halves of vB and vC, interleaves them, and puts the result in vA. There is a nice picture in C7.2.403 of the Architecture Reference Manual. zip2 does the same thing but taking elements from the high halves. — Nate Eldredge
– Nate Eldredge, Commented Mar 7, 2021 at 19:27
Oh, I just found rev64 which is the SIMD version of rev. I think that plus zip is better than tbl. — Nate Eldredge
– Nate Eldredge, Commented Mar 7, 2021 at 22:43

Stack Exchange Network

Binary to hex in ARM64 SIMD assembly

0

You must log in to answer this question.

Hot Network Questions

Binary to hex in ARM64 SIMD assembly

0

You must log in to answer this question.

Related

Hot Network Questions