Skip to main content
Post Deleted by CommunityBot
Post Locked by CommunityBot
Post Migrated Away to codereview.stackexchange.com by CommunityBot
Post Closed as "Not suitable for this site" by CommunityBot
Source Link

AVX2 method to average 4x4 block of UINT16s

Consider the code snippet:

uint16_t ave_uint16_4x4_matrix(int xx, int yy, uint16_t **mat)  {
register uint32_t tot=0U, xi, yi;

// Sum 4 by 4 square
for(yi=yy; yi < yy+4; yi++)  {
    for(xi=xx; xi < xx+4; xi++)  {
        tot += mat[yi][xi];  
    }
}
tot >>= 4;  // BitShift to do quick Gozinta!  /= 16
return((uint16_t)tot);

}

The intent is to average a 4x4 chunk of a larger uint16_t matrix and produce a single result which should fit in a ushort (not tested).

With the AVX2 SIMD registers YMM0-YMM15, one value could be placed in each. With the 256 bit size, all 16 could be stacked in a single register.

The best AVX2 approach I can find would take at least 4 whacks at it. Is there a better way to add up all 16 uint16s and store the result in an int?

VPAVGW:: _mm256_avg_epu16 computes the average of 2 vectors.
Skylake 6700K, GCC Linux