Often the question of fastest possible way presupposes a task that needs to be done multiple times, in time critical loop.
Then it's possible, that the fastest method can be an iterative method working pairwise, which amortizes some of the work between iterations.
inline vec update(vec context, vec data) {
vec even = get_evens(context, data);
vec odd = get_odds(context, data);
return vertical_operation(even, odd);
}
void my_algo(vec *data, int N, vec_element_type *out) {
vec4 context{0,0,0,0};
context = update(context, data[0]);
int i;
for (int i = 0; i < N-1; i++) {
context = update(context, data[i+1]);
output[i] = extract_lane(context, 1);
}
context = update(context, anything);
output[N-1] = extract_lane(context, 1);
}
The wanted sum will be found from the second element (index 1) of the accumulator (after 1 iteration) while the first element will contain the total reduction of all elements so far.
Reduct = [ -- ][ -- ][ -- ][ -- ]
New input = [i0 ][ i1 ][ i2 ][ i3 ]
evens = [ -- ][ -- ][ i0 ][ i2 ]
odds = [ -- ][ -- ][ i1 ][ i3 ]
------- vertical arithmetic reduction ----
Reduct = [ -- ][ -- ][ 01 ][ 23 ]
input = [ 4 ][ 5 ][ 6 ][ 7 ]
evens = [ -- ][ 01 ][ 4 ][ 6 ]
odds = [ -- ][ 23 ][ 5 ][ 7 ]
Reduct = [ -- ][ 0123 ][ 45 ][ 67 ]
New input: [ 8 ] [ 9 ] [ a ] [ b ]
evens = [ -- ][ 45 ][ 8 ][ a ]
odds = [0123][ 67 ][ 9 ][ b ]
------------------------------
Reduct = [0123][4567][ 89 ][ ab ]
I have doubts, if this would prove to be faster for a vector length of 3 or 4 than presented by Mr Cordes, however for 16 or 8 bit data this method should prove to be worthwhile. Then of course one needs to perform 3 or 4 rounds respectively before the result can be acquired.