Vectorized and Multi Threaded Image Convolution

Question

I created code for Image Convolution. The code is in my Image Convolution GitHub Repository.

It includes the case for arbitrary Image Convolution and for Separable Kernel Convolution.

The code is a straightforward implementation using SSE Intrinsics for Vectorization and OpenMP for Multi Threading. It is also portable (Compiles both on GCC and MSVC) and written in pure C.

I was wondering if there are more Low Hanging Fruits to increase performance of the code using pure C Code (No Assembly).

For instance, this is the code for the Separable Convolution (Which always keeps data in continuous):

// ------------------------------- ImageConvolutionSeparableKernel ------------------------------ //
void ImageConvolutionSeparableKernel(float* mO, float* mI, float* mTmp, int numRows, int numCols, float* vRowKernel, int rowKernelLength, float* vColKernel, int colKernelLength)
{
    int ii, jj, kk, pxShift;
    // DECLARE_ALIGN float tmpVal[SSE_STRIDE];
    DECLARE_ALIGN(float, tmpVal, SSE_STRIDE);
    int rowKernelRadius, colKernelRadius, rowSseKernelRadius, colSseKernelRadius;

    __m128 currSum;
    __m128 currPx;
    __m128 kernelWeight;

    rowKernelRadius = rowKernelLength / 2;
    colKernelRadius = colKernelLength / 2;

    if ((rowKernelRadius % SSE_STRIDE)) {
        rowSseKernelRadius = rowKernelRadius + (SSE_STRIDE - (rowKernelRadius % SSE_STRIDE));
    }
    else {
        rowSseKernelRadius = rowKernelRadius;
    }

    if ((colKernelRadius % SSE_STRIDE)) {
        colSseKernelRadius = colKernelRadius + (SSE_STRIDE - (colKernelRadius % SSE_STRIDE));
    }
    else {
        colSseKernelRadius = colKernelRadius;
    }

    /*--- Start - Filtering Rows --- */
    // Unpacking data in Transpose as Pre Processing for filtering along Columns

    /*--- Left Edge Pixels --- */
#pragma omp parallel for private(jj, currSum, kk, pxShift, kernelWeight, currPx, tmpVal)
    for (ii = 0; ii < numRows; ii++) {
        for (jj = 0; jj < rowSseKernelRadius; jj += SSE_STRIDE) {
            currSum = _mm_setzero_ps();
            for (kk = 0; kk < rowKernelLength; kk++) {
                pxShift = kk - rowKernelRadius;
                kernelWeight = _mm_set1_ps(vRowKernel[kk]);

                if ((jj + pxShift) < -2) {
                    currPx = _mm_set1_ps(mI[(ii * numCols)]);
                }
                else if ((jj + pxShift) < -1) {
                    // currPx = _mm_set_ps(mI[(ii * numCols)], mI[(ii * numCols)], mI[(ii * numCols)], mI[(ii * numCols) + 1]);
                    currPx = _mm_set_ps(mI[(ii * numCols) + 1], mI[(ii * numCols)], mI[(ii * numCols)], mI[(ii * numCols)]); // Using set data is packed in reverse compared to load!
                }
                else if ((jj + pxShift) < 0) {
                    // currPx = _mm_set_ps(mI[(ii * numCols)], mI[(ii * numCols)], mI[(ii * numCols) + 1], mI[(ii * numCols) + 2]);
                    currPx = _mm_set_ps(mI[(ii * numCols) + 2], mI[(ii * numCols) + 1], mI[(ii * numCols)], mI[(ii * numCols)]);
                }
                else {
                    currPx = _mm_loadu_ps(&mI[(ii * numCols) + jj + pxShift]);
                }

                currSum = _mm_add_ps(currSum, _mm_mul_ps(kernelWeight, currPx));
            }
            _mm_store_ps(tmpVal, currSum);

            // Unpack Data in Transpose
            for (kk = 0; kk < SSE_STRIDE; kk++) {
                mTmp[((jj + kk) * numRows) + ii] = tmpVal[kk];
            }
        }
    }

    /*--- Main Pixels --- */
#pragma omp parallel for private(jj, currSum, kk, pxShift, kernelWeight, tmpVal)
    for (ii = 0; ii < numRows; ii++) {
        for (jj = rowSseKernelRadius; jj < (numCols - rowSseKernelRadius); jj += SSE_STRIDE) {
            currSum = _mm_setzero_ps();
            for (kk = 0; kk < rowKernelLength; kk++) {
                pxShift = kk - rowKernelRadius;
                kernelWeight = _mm_set1_ps(vRowKernel[kk]);
                //printf("Address %d\n",((int)(&mI[(ii * numCols) + jj + pxShift]) % 16));
                //printf("Address %p\n", &mI[(ii * numCols) + jj + pxShift]);
                currSum = _mm_add_ps(currSum, _mm_mul_ps(kernelWeight, _mm_loadu_ps(&mI[(ii * numCols) + jj + pxShift])));
            }
            _mm_store_ps(tmpVal, currSum);

            // Unpack Data in Transpose
            for (kk = 0; kk < SSE_STRIDE; kk++) {
                mTmp[((jj + kk) * numRows) + ii] = tmpVal[kk];
            }
        }
    }

    /*--- Right Edge Pixels --- */
#pragma omp parallel for private(jj, currSum, kk, pxShift, kernelWeight, currPx, tmpVal)
    for (ii = 0; ii < numRows; ii++) {
        for (jj = (numCols - rowSseKernelRadius); jj < numCols; jj += SSE_STRIDE) {
            currSum = _mm_setzero_ps();
            for (kk = 0; kk < rowKernelLength; kk++) {
                pxShift = kk - rowKernelRadius;
                kernelWeight = _mm_set1_ps(vRowKernel[kk]);

                if ((jj + pxShift) > (numCols - 2)) {
                    currPx = _mm_set1_ps(mI[(ii * numCols) + numCols - 1]);
                }
                else if ((jj + pxShift) > (numCols - 3)) {
                    //currPx = _mm_set_ps(mI[(ii * numCols) + numCols - 2], mI[(ii * numCols) + numCols - 1], mI[(ii * numCols) + numCols - 1], mI[(ii * numCols) + numCols - 1]);
                    currPx = _mm_set_ps(mI[(ii * numCols) + numCols - 1], mI[(ii * numCols) + numCols - 1], mI[(ii * numCols) + numCols - 1], mI[(ii * numCols) + numCols - 2]);
                }
                else if ((jj + pxShift) > (numCols - 4)) {
                    // currPx = _mm_set_ps(mI[(ii * numCols) + numCols - 3], mI[(ii * numCols) + numCols - 2], mI[(ii * numCols) + numCols - 1], mI[(ii * numCols) + numCols - 1]);
                    currPx = _mm_set_ps(mI[(ii * numCols) + numCols - 1], mI[(ii * numCols) + numCols - 1], mI[(ii * numCols) + numCols - 2], mI[(ii * numCols) + numCols - 3]);
                }
                else {
                    currPx = _mm_loadu_ps(&mI[(ii * numCols) + jj + pxShift]);
                }

                currSum = _mm_add_ps(currSum, _mm_mul_ps(kernelWeight, currPx));
            }
            _mm_store_ps(tmpVal, currSum);

            // Unpack Data in Transpose
            for (kk = 0; kk < SSE_STRIDE; kk++) {
                mTmp[((jj + kk) * numRows) + ii] = tmpVal[kk];
            }
        }
    }

    /*--- Finish - Filtering Rows --- */


    /*--- Start - Filtering Columns --- */
    // Loading data from Transposed array for contiguous data

    /*--- Left Edge Pixels --- */
#pragma omp parallel for private(jj, currSum, kk, pxShift, kernelWeight, currPx, tmpVal)
    for (ii = 0; ii < numCols; ii++) {
        for (jj = 0; jj < colSseKernelRadius; jj += SSE_STRIDE) {
            currSum = _mm_setzero_ps();
            for (kk = 0; kk < colKernelLength; kk++) {
                pxShift = kk - colKernelRadius;
                kernelWeight = _mm_set1_ps(vColKernel[kk]);

                if ((jj + pxShift) < -2) {
                    currPx = _mm_set1_ps(mTmp[(ii * numRows)]);
                }
                else if ((jj + pxShift) < -1) {
                    // currPx = _mm_set_ps(mI[(ii * numRows)], mI[(ii * numRows)], mI[(ii * numRows)], mI[(ii * numRows) + 1]);
                    currPx = _mm_set_ps(mTmp[(ii * numRows) + 1], mTmp[(ii * numRows)], mTmp[(ii * numRows)], mTmp[(ii * numRows)]);
                }
                else if ((jj + pxShift) < 0) {
                    // currPx = _mm_set_ps(mI[(ii * numRows)], mI[(ii * numRows)], mI[(ii * numRows) + 1], mI[(ii * numRows) + 2]);
                    currPx = _mm_set_ps(mTmp[(ii * numRows) + 2], mTmp[(ii * numRows) + 1], mTmp[(ii * numRows)], mTmp[(ii * numRows)]);
                }
                else {
                    currPx = _mm_loadu_ps(&mTmp[(ii * numRows) + jj + pxShift]);
                }

                currSum = _mm_add_ps(currSum, _mm_mul_ps(kernelWeight, currPx));
            }
            _mm_store_ps(tmpVal, currSum);

            // Unpack Data in Transpose
            for (kk = 0; kk < SSE_STRIDE; kk++) {
                mO[((jj + kk) * numCols) + ii] = tmpVal[kk];
            }
        }
    }

    /*--- Main Pixels --- */
#pragma omp parallel for private(jj, currSum, kk, pxShift, kernelWeight, tmpVal)
    for (ii = 0; ii < numCols; ii++) {
        for (jj = colSseKernelRadius; jj < (numRows - colSseKernelRadius); jj += SSE_STRIDE) {
            currSum = _mm_setzero_ps();
            for (kk = 0; kk < colKernelLength; kk++) {
                pxShift = kk - colKernelRadius;
                kernelWeight = _mm_set1_ps(vColKernel[kk]);
                currSum = _mm_add_ps(currSum, _mm_mul_ps(kernelWeight, _mm_loadu_ps(&mTmp[(ii * numRows) + jj + pxShift])));
            }
            _mm_store_ps(tmpVal, currSum);

            // Unpack Data in Transpose
            for (kk = 0; kk < SSE_STRIDE; kk++) {
                mO[((jj + kk) * numCols) + ii] = tmpVal[kk];
            }
        }
    }

    /*--- Right Edge Pixels --- */
#pragma omp parallel for private(jj, currSum, kk, pxShift, kernelWeight, currPx, tmpVal)
    for (ii = 0; ii < numCols; ii++) {
        for (jj = (numRows - colSseKernelRadius); jj < numRows; jj += SSE_STRIDE) {
            currSum = _mm_setzero_ps();
            for (kk = 0; kk < colKernelLength; kk++) {
                pxShift = kk - colKernelRadius;
                kernelWeight = _mm_set1_ps(vColKernel[kk]);

                if ((jj + pxShift) > (numRows - 2)) {
                    currPx = _mm_set1_ps(mTmp[(ii * numRows) + numRows - 1]);
                }
                else if ((jj + pxShift) >(numRows - 3)) {
                    // currPx = _mm_set_ps(mI[(ii * numRows) + numRows - 2], mI[(ii * numRows) + numRows - 1], mI[(ii * numRows) + numRows - 1], mI[(ii * numRows) + numRows - 1]);
                    currPx = _mm_set_ps(mTmp[(ii * numRows) + numRows - 1], mTmp[(ii * numRows) + numRows - 1], mTmp[(ii * numRows) + numRows - 1], mTmp[(ii * numRows) + numRows - 2]);
                }
                else if ((jj + pxShift) > (numRows - 4)) {
                    //currPx = _mm_set_ps(mI[(ii * numRows) + numRows - 3], mI[(ii * numRows) + numRows - 2], mI[(ii * numRows) + numRows - 1], mI[(ii * numRows) + numRows - 1]);
                    currPx = _mm_set_ps(mTmp[(ii * numRows) + numRows - 1], mTmp[(ii * numRows) + numRows - 1], mTmp[(ii * numRows) + numRows - 2], mTmp[(ii * numRows) + numRows - 3]);
                }
                else {
                    currPx = _mm_loadu_ps(&mTmp[(ii * numRows) + jj + pxShift]);
                }

                currSum = _mm_add_ps(currSum, _mm_mul_ps(kernelWeight, currPx));
            }
            _mm_store_ps(tmpVal, currSum);

            // Unpack Data in Transpose
            for (kk = 0; kk < SSE_STRIDE; kk++) {
                mO[((jj + kk) * numRows) + ii] = tmpVal[kk];
            }
        }
    }


}

You seem to be missing some headers - where are DECLARE_ALIGN, SSE_STRIDE and __m128 defined? Not to mention the numerous _mm_*() functions. — Toby Speight
– Toby Speight, Commented Aug 7, 2017 at 9:37
The _mm_*() are SSE Intrinsics (Comes with any modern Compiler). DECLARE_ALIGNis a macro to define static variables with 16 Byte alignment. — Royi
– Royi, Commented Aug 7, 2017 at 10:03
Maybe I'm wrong about this, but I thought that #pragma omp parallel starts a parallel section, which means it creates threads. I would have written #pragma omp parallel once at the top, then #pragma omp for to run each of the loops in parallel. Though I would hope that the compiler optimizes out the creation/destruction of threads... In any case, creating threads is expensive, so it's worth comparing. — Cris Luengo
– Cris Luengo, Commented Oct 24, 2017 at 3:12

user555045 · Accepted Answer · 2018-09-10 20:51:48Z

One problem that immediately jumps out at me is the loop-carried dependency through addps, which has a latency of either 3 or 4 (depending on the processor) while there are not nearly enough instructions there to fill all that time, so it's lost throughput. The typical solution is unrolling and using multiple accumulators. There's too much stuff there for me to comfortably write the actual code, but the general idea is something like this:

acc0 = _mm_setzero_ps();
acc1 = acc0;
acc2 = acc0;
acc3 = acc0;
for (...; ...; ... += 4) {
    acc0 = _mm_add_ps(acc0, _mm_mul_ps(...));
    acc1 = _mm_add_ps(acc1, _mm_mul_ps(...));
    acc2 = _mm_add_ps(acc2, _mm_mul_ps(...));
    acc3 = _mm_add_ps(acc3, _mm_mul_ps(...));
}
sum = _mm_add_ps(_mm_add_ps(acc0, acc1), _mm_add_ps(acc2, acc3));

How much you should unroll by depends a lot on the precise code and the actual processor. For example on Haswell the initial target performance would be 2 FMAs per cycle (so unrolling by a factor of 10), but that would mean the kernel cannot come from memory since both available loads are needed for the image, limiting the performance to half the target performance.

Getting to 50% of the goal isn't great, but there is hope: after unrolling, multiple iterations can share the same load from the kernel. That load can be a wide load, and instead of set1 the appropriate element for the iteration can be shuffled into all lanes. This alone can reduce the number loads from the kernel to a quarter of the original (or an eighth, with AVX).

And there is more: unrolling a different way. If multiple unrelated convolutions (different rows) are interleaved, the load from the kernel can be re-used for each of them. If you do 4 of these, you could try doing an in-register transpose (there is _MM_TRANSPOSE4_PS for that, which is more of macro than a proper intrinsic) and then just using 4 wide stores. That transpose is fairly ugly, weighing in at 4 unpacks and 4 movelh/hl's for a total of 8 µops to port 5, but it turns 16 scalar stores into 4 wide stores so that looks like a decent trade.

With all this unrolling it's easy to go too far, running out of registers and spilling to the stack in the inner loop would just murder the performance so definitely avoid that - always check the assembly (you don't have to write it, just read). Unrolling by a total factor of 8 or 10 is probably OK, both in terms of being enough to reach (or get close to) the latency-throughput product and in not exceeding the number of available registers. So for example 4 rows and 2 iterations of the inner loop makes a factor of 8.

Unrolling the inner loop depends on the kernel being a nice size, if it isn't then part of the last iteration is wasted work since it's partially multiplying data by the kernel-padding. So unrolling rows may work out better in general.

This implements the "unrolling a different way" idea, working on 4 rows at once:

for (ii = 0; ii + 3 < numRows; ii += 4) {
    for (jj = rowSseKernelRadius; jj < (numCols - rowSseKernelRadius); jj += 4) {
        __m128 s0 = _mm_setzero_ps(), s1 = _mm_setzero_ps(), s2 = _mm_setzero_ps(), s3 = _mm_setzero_ps();
        for (kk = 0; kk < rowKernelLength; kk++) {
            pxShift = kk - rowKernelRadius;
            kernelWeight = _mm_set1_ps(vRowKernel[kk]);
            s0 = _mm_add_ps(s0, _mm_mul_ps(kernelWeight, _mm_loadu_ps(&mI[(ii * numCols) + jj + pxShift])));
            s1 = _mm_add_ps(s1, _mm_mul_ps(kernelWeight, _mm_loadu_ps(&mI[((ii + 1) * numCols) + jj + pxShift])));
            s2 = _mm_add_ps(s2, _mm_mul_ps(kernelWeight, _mm_loadu_ps(&mI[((ii + 2) * numCols) + jj + pxShift])));
            s3 = _mm_add_ps(s3, _mm_mul_ps(kernelWeight, _mm_loadu_ps(&mI[((ii + 3) * numCols) + jj + pxShift])));
        }

        _MM_TRANSPOSE4_PS(s0, s1, s2, s3);
        _mm_storeu_ps(&mTmp[((jj + 0) * numRows) + ii], s0);
        _mm_storeu_ps(&mTmp[((jj + 1) * numRows) + ii], s1);
        _mm_storeu_ps(&mTmp[((jj + 2) * numRows) + ii], s2);
        _mm_storeu_ps(&mTmp[((jj + 3) * numRows) + ii], s3);
    }
}

Naturally this may leave up to 3 rows to do with the normal 1-row-at-the-time implementation.

As a quick benchmark, on my PC (4770K) and compiled with MSVC with OpenMP not enabled, I get these results. Plotting flops/cycle so higher is better. "U: 4" means the ii loops is unrolled by 4, in the manner shown above (U: 1 is normal, not unrolled). K is the size of the kernel.

It also helps a bit (at least when compiled with MSVC) to track the load address across loop iterations instead of recalculating it. The recalculation is actually non-trivial and includes a sign-extension. Compilers should learn to do this.. Anyway, like this before the loop:

float *rowptr = &mI[(ii * numCols) + jj - colKernelRadius];

Then just use it four times and increment at the end of the loop body:

s3 = _mm_add_ps(s3, _mm_mul_ps(kernelWeight, _mm_loadu_ps(rowptr + numCols * 3)));
rowptr++;

But this is basically to work around MSVC not optimizing that, not a fundamental change.

Adding some light tiling also seems to help. It's not the nicest kind of tiling, it doesn't "connect" tiles between the horizontal and vertical passes (which I don't know how to do) so the vertical pass still takes a ton cache misses to read the temporary data, but at least writing back the results happens in more of a tile instead of a column. It's still nowhere near as fast as a matrix multiplication. Implementation is pretty simple, just add two more loops around the main computation to chop it into tiles along both axes:

for (int ib = 0; ib < numRows; ib += ibsize) {
    int imax = min(ib + ibsize, numRows);
    for (int jb = rowSseKernelRadius; jb < numCols - rowSseKernelRadius; jb += jbsize) {
        int jmax = min(jb + jbsize, numCols - rowSseKernelRadius);
        for (ii = ib; ii < imax; ii += 4) {
            for (jj = jb; jj < jmax; jj += 4) {
                main body

Thank You. Actually, if one removed the Boundary Cases the code is only few lines. Regarding Loop Unrolling, won't the compiler do that by itself? Can't I just trigger it to do so? Thank You. — Royi
– Royi, Commented Aug 5, 2017 at 12:28
@Royi they may, especially Clang. They're unlikely to get it right though, since it involves just not plain unrolling but also merging the loads/set1s into a wide load/shuffle and so on. They're probably not going to interleave rows since it's not the inner loop, and they're even less likely to insert a SIMD transpose. You could try to encourage the compiler with -ffast-math and unrolling pragmas. — user555045
– user555045, Commented Aug 5, 2017 at 12:39
I'm using -ffast-math and still GCC is slower than MSVC 2015 which is surprising. I will give explicit loop unrolling a try and will update. Any thinking about the classic convolution? — Royi
– Royi, Commented Aug 5, 2017 at 12:42
I have implementation for Separable Convolution (2 1D Convolution with the Transpose) and 1 full 2D Convolution. I thought maybe things are easier with the full convolution. I +1 your answer. Thank You! — Royi
– Royi, Commented Aug 5, 2017 at 12:57

Cris Luengo · Accepted Answer · 2017-10-24 03:30:18Z

Something to look into: data access patterns. You go across all image lines, processing the first few pixels (were the kernel straddles the boundary), then again across all image lines, processing the bulk of the pixels, and then a third time across all image lines, processing the last few pixels. If you combine the outer loops into a single loop, walking over all image lines once, and processing all pixels on the line in order, you access data in the same order in which they are stored, thereby using the cache better. You might not notice the difference on the typical 256x256 test images, because they fit entirely in the cache of modern processors. Try it on a real image, today's cameras produce 10 or 15 million pixels!

Next, if you walk through the image as indicated above, your computation (ii * numCols) + jj increases by one for each pixel that you process, all through the image. Thus you don't need to compute that value, you can just increment a counter (or simply increment a pointer to the pixel). Even when you don't process all pixels in order, I find it usually saves time to pre-compute ii * numCols outside the loop over jj. Sometimes compilers are able to optimize this out, but not always.

And one more thing you could do is write a specialization for symmetric kernels. If k[-x]==k[x] (assuming k[0] is the middle value in the kernel), then you can compute (I[-x]+I[x]) * k[x] instead of I[-x]*k[-x] + I[x]*k[x], saving you half the multiplications. (I don't know if this is harder to do with SSE instructions, I don't have any experience with those.)

Stack Exchange Network

Vectorized and Multi Threaded Image Convolution

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Vectorized and Multi Threaded Image Convolution

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions