4
\$\begingroup\$

As an exercise in learning ARM64 assembly (aka AArch64), I wrote this function to convert 64-bit binary to hexadecimal with SIMD instructions. I'm most interested in feedback on the algorithm, instruction selection, and micro-optimizations for size or speed.

This was inspired by x86 SIMD implementations of this function on Stack Overflow by Peter Cordes.

It's written for armv8-a little-endian without alignment checks, using the GNU assembler. A test driver in Unix C is included at the end.

I am testing and benchmarking on a Raspberry Pi 4B with 1.5 GHz Cortex A-72 CPU. Currently it can do about 1.5e8 repetitions per second, or about 10 clock cycles per rep.

u64tohex.s:

// u64tohex: Convert unsigned 64-bit integer to null-terminated ASCII
// hexadecimal.
//
// C declaration: void u64tohex(char *buf, uint64_t value)
//
// Leading zeros are always included, so output is always 16
// characters plus terminating null. 
//
// For AArch64 armv8-a little-endian with alignment check disabled.
        
        .text
        .global u64tohex

        .balign 16
        // Lookup table for use by tbl instruction. Should be all the
        // hex digits in order.

        // Keep this in the .text section, adjacent to u64tohex, so
        // PC-relative literal load can be used to load it.
hex_table:
        .ascii "0123456789abcdef"

        .balign 16
u64tohex:
        // x0 = buf
        // x1 = value.
        // Running example: suppose x1 = 0xdeadbeef12345678.

        // This is for little-endian machines, so we have to do some
        // sort of reversal to be able to print the most-significant
        // nibble first.  On the integer side we have the rev
        // instruction to reverse bytes; there doesn't seem to be
        // anything comparable in SIMD.  So start with rev.
        rev x1, x1

        // Move to SIMD register.  d0 is the low dword of v0.  
        // fmov is faster than mov v0.d[0] or dup v0.2d
        fmov d0, x1

        // now v0 = 00 00 00 00 00 00 00 00 de ad be ef 12 34 56 78
        
        // Repeat each byte twice.
        zip1 v0.16b, v0.16b, v0.16b

        // now v0 = de de ad ad be be ef ef 12 12 34 34 56 56 78 78
        
        // Even bytes of the output should correspond to the
        // most-significant nibble of each input byte.  So each even
        // byte needs to be right-shifted by 4.  This is done with
        // ushl and a negative shift count.  Odd bytes should stay
        // unchanged for now, so need a shift count of 0.

        // Use a 16-bit immediate move, which duplicates across all
        // elements, to fill even bytes of v1 with -4 (0xfc) and odd bytes
        // with 0.
        movi v1.8h, #0x00fc
        ushl v0.16b, v0.16b, v1.16b

        // now v0 = 0d de 0a ad 0b be 0e ef 01 12 03 34 05 56 07 78
        
        // Mask off all high nibbles.  Fill all bytes of v1 with 0xf.
        movi v1.16b, #0x0f
        and v0.16b, v0.16b, v1.16b

        // now v0 = 0d 0e 0a 0d 0b 0e 0e 0f 01 02 03 04 05 06 07 08

        // Substitute each byte of v0 with the corresponding byte of
        // hex_table.  So 00 -> '0', 0a -> 'a' and so on.
        ldr q1, hex_table
        tbl v0.16b, { v1.16b }, v0.16b

        // All done.  Store result in buffer.  Per tests, st1 is
        // slightly faster than the more obvious `str q0, [x0]`.
        st1 { v0.16b }, [x0]

        // Append the terminating null.
        strb wzr, [x0, #16]

        ret

Here is a simple C program to exercise the function. Compile and link with gcc -O3 main.c u64tohex.s. I'm not particularly looking for feedback on the C code, though of course feel free to give some if you like.

main.c:

// Test driver for u64tohex

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <inttypes.h>
#include <time.h>

// Randomly chosen increment, so that we can rapidly cycle through
// many different test values.
#define INC 0x1928374656473829UL

extern void u64tohex(char *buf, uint64_t val);

// Compare u64tohex with sprintf, to test for correctness
static void test(uint64_t reps) {
    uint64_t v = 0;
    while (reps--) {
        char hex_out[] = "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz";
        char sprintf_out[] = "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz";
        u64tohex(hex_out, v);
        sprintf(sprintf_out, "%016" PRIx64, v);
        if (strcmp(hex_out, sprintf_out) != 0) {
            printf("Fail: bad %s, good %s\n", hex_out, sprintf_out);
            exit(1);
        }
        v += INC;
    }
    printf("Correctness tests passed\n");
}

static void benchmark(uint64_t reps) {
    printf("%" PRId64 " reps:", reps);
    fflush(stdout);

    struct timespec start, end;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);

    uint64_t v = 0;
    for (uint64_t i = 0; i < reps; i++) {
        char buf[100] __attribute__((aligned(16)));
        u64tohex(buf, v);
        v += INC;
    }

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    double elapsed = (end.tv_sec - start.tv_sec) + 1.0e-9 * (end.tv_nsec - start.tv_nsec);
    printf(" %f secs (%e reps per second)\n", elapsed, reps/elapsed);
}

int main(int argc, char *argv[]) {
    test(100000);
    benchmark(100000000);
    return 0;
}
\$\endgroup\$
5
  • \$\begingroup\$ Have you tried doing low/high nibble separation in GPR? Something like this: gist.github.com/stepantubanov/1867a3662ba1098a42cf5837870c662c (not tested, may be broken). I am not sure I fully understand what zip1 does, so may be it's not correct. \$\endgroup\$ Commented Mar 7, 2021 at 18:24
  • \$\begingroup\$ Thinking some more, I realized that one can use tbl for general permutation, so that could substitute for the rev / zip1. That opens up some approaches for doing the whole thing in SIMD, which would be desirable if we want to convert a large array of values loaded from memory, rather than a single value from an integer register. In particular we can do the shift and masks to two 64-bit inputs at once, keeping even nibbles and odd nibbles in two different registers, and then use two two-register tbl instructions to sort all the correct bytes in the correct order. \$\endgroup\$ Commented Mar 7, 2021 at 19:23
  • \$\begingroup\$ There's some overhead to load the permutations from static memory, but if we are doing this in a big loop, it can be done just once. I might play with this and post a new version of the question for this setting. (I know to leave this one as it is.) \$\endgroup\$ Commented Mar 7, 2021 at 19:24
  • \$\begingroup\$ @stepan: zip1 vA, vB, vC takes the elements from the low halves of vB and vC, interleaves them, and puts the result in vA. There is a nice picture in C7.2.403 of the Architecture Reference Manual. zip2 does the same thing but taking elements from the high halves. \$\endgroup\$ Commented Mar 7, 2021 at 19:27
  • 1
    \$\begingroup\$ Oh, I just found rev64 which is the SIMD version of rev. I think that plus zip is better than tbl. \$\endgroup\$ Commented Mar 7, 2021 at 22:43

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.