SSE Intrinsics Gaussian Blur Approximation

Question

The following is the current iteration of a gaussian blur approximation code I am using.

A typical naive convolution operation is O(N*M), where N is the number of image pixels, and M is the number of kernel pixels.

I am using three tricks to avoid this penalty:

3 iterations of simple box blurs look very similar to a gaussian blur.
A box blur kernel can be applied separately to each axis: 1D horizontal blur + 1D vertical blur equals a 2D convolution box blur.
A box blur kernel consists of equal values, so we can simply add one pixel from the start and subtract one pixel from the tail instead of convolving the entire kernel.

This results in an O(M) operation, independent of kernel size.

On top of this, I am employing both SSE intrinsics and parallel execution to achieve my goal.

This results in impressive performance compared to the naive implementation, but in spite of these tricks, I can not get speed up to the level I'd like. Do you have any more ideas?

For brevity I have included only the horizontal pass. The vertical pass is algorithmically identical.

void FastBlurImageFilter::ProcessHorizontal(const ImagePixelBuffer &source)
{
    int size = this->size; //the size of the kernel.
    __m128 numPixels = _mm_set1_ps((float)size);
    __m128 divisor = _mm_mul_ps(numPixels, _mm_mul_ps(numPixels, numPixels));
    int halfSize = size / 2;
    __m128 filler = _mm_set1_ps((float)(halfSize));
    int w = source.Width() - size;
    int h = source.Height();
    int sW = source.Width();

    //thread-local cache vectors
    Concurrency::combinable<std::vector<__m128>> sCache;
    Concurrency::combinable<std::vector<__m128>> dCache;
    Concurrency::parallel_for(0, h, [&] (int y)
    {
        float* sRowF = reinterpret_cast<float*>(source.Row(y));

        //initialize thread local vectors
        std::vector<__m128> s = sCache.local();
        if (s.capacity() < sW) s.reserve(sW);
        std::vector<__m128> d = dCache.local();
        if (d.capacity() < sW) d.reserve(sW);

        __m128* sRow = &(s[0]);
        __m128* dRow = &(d[0]);

        //load the source pixels to a cache vector
        for (int x = 0; x < sW; x++, sRowF += 4)
        {
            sRow[x] = _mm_load_ps(sRowF);
        }

        //3 iterations to approximate gaussian blur
        for (int i = 0; i < 3; ++i)
        {
            __m128* sPixel = sRow;
            __m128* dPixel = dRow;

            __m128 firstValue = *sPixel;
            __m128 pixel = _mm_mul_ps(firstValue, filler);
            __m128* nextStop = sPixel + halfSize;

            //process the first pixel of the row
            while (sPixel <= nextStop)
            {
                pixel = _mm_add_ps(pixel, *(sPixel++));
            }
            *(dPixel++) = pixel;

#define DOPIXEL(vA,vB) { pixel = _mm_add_ps(pixel, _mm_sub_ps(vB, vA)); *(dPixel++) = pixel; }

            //process the pixels up until half of the kernel size
            nextStop = sPixel + halfSize;
            while (sPixel < nextStop)
            {
                DOPIXEL(firstValue, *(sPixel++));
            }

            //process the middle pixels (usually the biggest part)
            __m128* tailPixel = sPixel - size;
            nextStop = sPixel + w;
            while (sPixel < nextStop)
            {
                DOPIXEL(*(tailPixel++), *(sPixel++));
            }

            //process the last halfKernelSize pixels.
            __m128 lastValue = *(sPixel - 1);
            nextStop = dPixel + halfSize;
            while (dPixel < nextStop)
            {
                DOPIXEL(*(tailPixel++), lastValue);
            }
#undef DOPIXEL
            //swap the caches for the next iteration
            __m128* temp = sRow;
            sRow = dRow;
            dRow = temp;
        }           

          //store the pixel values back in the image buffer.
        __m128* first = &d[0];
        __m128* last = first + sW;
        float* final = reinterpret_cast<float*>(source.Row(y));
        while (first < last)
        {
            _mm_store_ps(final, _mm_div_ps(*(first++), divisor));
            final += 4;
        }
    }
    );
}

This is my first CodeReview post, so please let me know if there is anything I should add to my quesiton.

Dave · Accepted Answer · 2013-10-13 11:47:08Z

3

Gaussian blurs can also be split into a horizontal and vertical pass without loss of precision. It's actually one of the characteristics which makes them so popular. You may find that using several box-blur passes in each direction is actually slower than a single gaussian-blur pass in each direction (depending on how well you're able to optimise a gaussian blur).

On the topic of SSE, this is the sort of operation which is perfectly suited to GPUs if you're serious about performance (and using GPU processing will give you a higher reward : time ratio than SSE). Not to mention the other tricks which are available to GPUs (e.g. sampling between pixels to get 2 or 4 values interpolated "for free").

And don't forget that for large blurs which don't need to be pixel-perfect (which seems to match your case, since small blurs aren't slow and using a box-blur isn't pixel-perfect), you can downscale the image first, then upscale it with linear interpolation at the end. Those operations are both O(N), making them much faster than a convolution. Try downscaling to 1/16th of the original size (1/4 x 1/4) for an immediate performance boost: from O(N*M) to O(N*(2+M/16)), making it worthwhile as long as M > 2.1. Even halving each dimension will boost performance a little.

These higher-level things are more important to get down first than processor-specific optimisations you can make.

answered Oct 13, 2013 at 11:47

Dave

2211 silver badge9 bronze badges

\$\begingroup\$ Thanks, all of these sounds like good ideas. I will look into the 1D gaussian blur, though I'm not sold on it, as it is still dependant on kernel size. GPU processing will definitely be faster, but is not an option at this time. The downsampling idea sounds promising as well, I will definitely look into it. \$\endgroup\$

Rotem
– Rotem

2013-10-13 11:57:19 +00:00
Commented Oct 13, 2013 at 11:57
1

\$\begingroup\$ While a gaussian is dependant on M, remember that you only need to run it once. So it's O(NM) instead of O(NK) where K is the number of box passes you make. More importantly, a gaussian uses O(NM) reads and O(N) writes, whereas a box blur uses O(NK) reads and O(N*K) writes. As your blur kernel gets smaller, M gets closer to K (and the downscaling I suggested will allow you to use smaller blur kernels), so there will be a point where memory access makes the box method slower. As with anything, benchmark! \$\endgroup\$

Dave
– Dave

2013-10-13 12:06:49 +00:00
Commented Oct 13, 2013 at 12:06
\$\begingroup\$ Cool, I see your point now. \$\endgroup\$

Rotem
– Rotem

2013-10-13 12:09:48 +00:00
Commented Oct 13, 2013 at 12:09
\$\begingroup\$ Won't those operations be mostly Memory Bounded? How do you get pass that? \$\endgroup\$

Royi
– Royi

2016-10-13 22:17:43 +00:00
Commented Oct 13, 2016 at 22:17
\$\begingroup\$ Question, When you move along the rows, and you jump by 4 pixels due to the SSE pointer, how can you keep the formula of adding 1 item and removing 1 item? \$\endgroup\$

Royi
– Royi

2016-10-14 00:13:15 +00:00
Commented Oct 14, 2016 at 0:13

Add a comment |

Stack Exchange Network

SSE Intrinsics Gaussian Blur Approximation

1 Answer 1

You must log in to answer this question.

Hot Network Questions

SSE Intrinsics Gaussian Blur Approximation

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions