Consider the code snippet:
uint16_t ave_uint16_4x4_matrix(int xx, int yy, uint16_t **mat) {
register uint32_t tot=0U, xi, yi;
// Sum 4 by 4 square
for(yi=yy; yi < yy+4; yi++) {
for(xi=xx; xi < xx+4; xi++) {
tot += mat[yi][xi];
}
}
tot >>= 4; // BitShift to do quick Gozinta! /= 16
return((uint16_t)tot);
}
The intent is to average a 4x4 chunk of a larger uint16_t matrix and produce a single result which should fit in a ushort (not tested).
With the AVX2 SIMD registers YMM0-YMM15, one value could be placed in each. With the 256 bit size, all 16 could be stacked in a single register.
The best AVX2 approach I can find would take at least 4 whacks at it. Is there a better way to add up all 16 uint16s and store the result in an int?
VPAVGW:: _mm256_avg_epu16 computes the average of 2 vectors.
Skylake 6700K, GCC Linux