Revisions to Fastest way to do horizontal SSE vector sum (or other reduction)

added 153 characters in body

Source Link

edited Jan 10, 2022 at 6:59

20.5k
1
43
68

The total cost of reduction by splitting a vector to low/high parts is O(log2(N)), while the amortised cost by splitting a vector to even/odd sequences is O(1).

If the horizontal operation happens to be sum -- then one can actually use just a single hadd per operationiteration.

If the horizontal operation happens to be sum -- then one can actually use just a single hadd per operation.

The total cost of reduction by splitting a vector to low/high parts is O(log2(N)), while the amortised cost by splitting a vector to even/odd sequences is O(1).

If the horizontal operation happens to be sum -- then one can actually use just a single hadd per iteration.

added 118 characters in body

Source Link

edited Jan 10, 2022 at 5:58

Aki Suihkonen

20.5k
1
43
68

If the horizontal operation happens to be sum -- then one can actually use just a single hadd per operation.

Source Link

answered Jan 10, 2022 at 5:53

Aki Suihkonen

20.5k
1
43
68

Often the question of fastest possible way presupposes a task that needs to be done multiple times, in time critical loop.

Then it's possible, that the fastest method can be an iterative method working pairwise, which amortizes some of the work between iterations.

inline vec update(vec context, vec data) {
    vec even = get_evens(context, data);
    vec odd = get_odds(context, data);
    return vertical_operation(even, odd);
}

void my_algo(vec *data, int N, vec_element_type *out) {

   vec4 context{0,0,0,0};
   context = update(context, data[0]);
   int i;
   for (int i = 0; i < N-1; i++) {
       context = update(context, data[i+1]);
       output[i] = extract_lane(context, 1);
   }
   context = update(context, anything);
   output[N-1] = extract_lane(context, 1);
}

The wanted sum will be found from the second element (index 1) of the accumulator (after 1 iteration) while the first element will contain the total reduction of all elements so far.

Reduct = [ -- ][ -- ][ -- ][ -- ]
New input = [i0 ][ i1 ][ i2 ][ i3 ]

evens = [ -- ][ -- ][ i0 ][ i2 ]
odds  = [ -- ][ -- ][ i1 ][ i3 ]
-------   vertical arithmetic reduction ----
Reduct = [ -- ][ -- ][ 01 ][ 23 ]


input = [ 4 ][ 5 ][ 6 ][ 7 ]

evens = [ -- ][ 01 ][ 4 ][ 6 ]
odds  = [ -- ][ 23 ][ 5 ][ 7 ]

Reduct = [ -- ][ 0123 ][ 45 ][ 67 ]

New input: [ 8 ] [ 9 ] [ a ] [ b ]
evens = [ -- ][ 45 ][ 8 ][ a ]
odds =  [0123][ 67 ][ 9 ][ b ]
------------------------------
Reduct = [0123][4567][ 89 ][ ab ]

I have doubts, if this would prove to be faster for a vector length of 3 or 4 than presented by Mr Cordes, however for 16 or 8 bit data this method should prove to be worthwhile. Then of course one needs to perform 3 or 4 rounds respectively before the result can be acquired.

Collectives™ on Stack Overflow

Return to Answer

Post Timeline