Implementing a 1D Convolution SIMD Friendly in Julia

Question

I want to implement a 1D convolution in Julia using the direct calculation since the conv() function in DSP.jl uses DFT (fft) based methods.

In order to make it fast, I'd like the code to be SIMD friendly (I am OK with using LoopVectorization.jl).

The trivial code is:

using BenchmarkTools;

function _Conv1D!( vO :: Array{T, 1}, vA :: Array{T, 1}, vB :: Array{T, 1} ) :: Array{T, 1} where {T <: Real}

    lenA = length(vA);
    lenB = length(vB);

    fill!(vO, zero(T));
    for idxB in 1:lenB
        @simd for idxA in 1:lenA
            @inbounds vO[idxA + idxB - 1] += vA[idxA] * vB[idxB];
        end
    end

    return vO;

end

function _Conv1D( vA :: Array{T, 1}, vB :: Array{T, 1} ) :: Array{T, 1} where {T <: Real}

    lenA = length(vA);
    lenB = length(vB);

    vO = Array{T, 1}(undef, lenA + lenB - 1);

    return _Conv1D!(vO, vA, vB);

end

numSamplesA = 1000;
numSamplesB = 15;

vA = rand(numSamplesA);
vB = rand(numSamplesB);
vO = rand(numSamplesA + numSamplesB - 1);

@benchmark _Conv1D($vA, $vB)

I get:

BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.522 μs … 558.844 μs  ┊ GC (min … max):  0.00% … 99.28%
 Time  (median):     3.333 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   4.321 μs ±  13.718 μs  ┊ GC (mean ± σ):  10.38% ±  3.67%

     ██▆        ▂    
  ▃▄▃███▆▅▇▆▄▂▃▇██▄▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▃▃▃▂▂▁▁▁▁▁▁▁▁ ▂
  2.52 μs         Histogram: frequency by time        8.51 μs <

 Memory estimate: 8.06 KiB, allocs estimate: 1.

I read Chris Elrod's post Orthogonalize Indices. Basically I tried removing the inner loop and got something like:

function Conv1D!( vO :: Array{T, 1}, vA :: Array{T, 1}, vB :: Array{T, 1} ) :: Array{T, 1} where {T <: Real}

    lenA = length(vA);
    lenB = length(vB);
    lenO = length(vO);
    vC   = view(vB, lenB:-1:1);
    @simd for ii in 1:lenO
        # Rolling vB over vA
        startIdxA = max(1, ii - lenB + 1);
        endIdxA   = min(lenA, ii);
        startIdxC = max(lenB - ii + 1, 1);
        endIdxC   = min(lenB, lenO - ii + 1);
        # println("startA = $startIdxA, endA = $endIdxA, startC = $startIdxC, endC = $endIdxC");
        @inbounds vO[ii] = sum(view(vA, startIdxA:endIdxA) .* view(vC, startIdxC:endIdxC));
    end

    return vO;

end

function Conv1D( vA :: Array{T, 1}, vB :: Array{T, 1} ) :: Array{T, 1} where {T <: Real}

    lenA = length(vA);
    lenB = length(vB);

    vO = Array{T, 1}(undef, lenA + lenB - 1);

    return Conv1D!(vO, vA, vB);

end

numSamplesA = 1000;
numSamplesB = 15;

vA = rand(numSamplesA);
vB = rand(numSamplesB);
vO = rand(numSamplesA + numSamplesB - 1);

@benchmark Conv1D($vA, $vB)

Yet I get much much slower results.

Is there anything I can do to improve results any farther? Maybe something to make the code much more SIMD friendly?

A Godbold link for _Conv1D!(): https://godbolt.org/z/e8W7e473h.

Remark: Originally, in Conv1D!(), I used @inbounds vO[ii] = dot(view(vA, startIdxA:endIdxA), view(vC, startIdxC:endIdxC));. It was slower.

You didn't include a godbolt.org link. Can you describe how the generated code for v2 differs from the (faster) v1 code? — J_H
– J_H, Commented Apr 18, 2023 at 16:53
@J_H, I wish I knew how to analyze the lower level code :-). I guess than I'd be able to do the next step and improve it. — Royi
– Royi, Commented Apr 18, 2023 at 18:08
I was suggesting that if you offer godbolt links to the generated code, you'll make it easier for a diverse audience of contributors to compare them and to comment on them. If you accompany that with remarks on any insights you've gleaned, so much the better. — J_H
– J_H, Commented Apr 18, 2023 at 18:12
I see your point. Embraced it: godbolt.org/z/e8W7e473h. This is the link. — Royi
– Royi, Commented Apr 19, 2023 at 6:12
You have only provided the godbolt link for one of your test cases, likewise for your benchmarks. Running your second benchmark with benchmarking makes it obvious why: A: 195.214 μs (0 allocations: 0 bytes) B: 1.700 ms (1999 allocations: 7.90 MiB) You're generating a load of temporary arrays. — DeathIncarnate
– DeathIncarnate, Commented Apr 19, 2023 at 9:53

Royi · Accepted Answer · 2023-07-10 10:25:46Z

Based on code by Chris Elrod I managed to come to this:

function Conv1D!( vO :: Vector{T}, vA :: Vector{T}, vB :: Vector{T} ) where {T <: Real}

    J = length(vA);
    K = length(vB); #<! Assumed to be the Kernel
    
    # Optimized for the case the kernel is in vB (Shorter)
    J < K && return Conv1D!(vO, vB, vA);
    
    I = J + K - 1; #<! Output length
    
    @turbo for ii in 1:(K - 1) #<! Head
        sumVal = zero(T);
        for kk in 1:K
            ib0 = (ii >= kk);
            oa = ib0 ? vA[ii - kk + 1] : zero(T);
            sumVal += vB[kk] * oa;
        end
        vO[ii] = sumVal;
    end
    @turbo inline=true for ii in K:(J - 1) #<! Middle
        sumVal = zero(T);
        for kk in 1:K
            sumVal += vB[kk] * vA[ii - kk + 1];
        end
        vO[ii] = sumVal;
    end
    @turbo for ii in J:I #<! Tail
        sumVal = zero(T);
        for kk in 1:K
            ib0 = (ii < J + kk);
            oa = ib0 ? vA[ii - kk + 1] : zero(T);
            sumVal += vB[kk] * oa;
        end
        vO[ii] = sumVal;
    end
    return vO
end

This code will efficiently use the SIMD capabilities of the CPU (Assuming x64 CPU).

The main idea is to replace the series of if with 3 optimized cases for beginning of the signal, middle and the end.
This makes the code SIMD friendly.

Pay attention that it won't work well with @simd instead of @turbo.

Thanks for posting this code. It's a good idea to summarise which changes you made, and why - a self-answer ought to review the code, just like any other answer. — Toby Speight
– Toby Speight, Commented May 18, 2023 at 14:06

Stack Exchange Network

Implementing a 1D Convolution SIMD Friendly in Julia

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Implementing a 1D Convolution SIMD Friendly in Julia

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions