Fast, small area and low latency selection/partial sorting algorithm

Question

I'm looking for a fast way to do a partial sort of 81 numbers - Ideally I'm looking to extract the lowest 16 values (its not necessary for the 16 to be in the absolutely correct order).

The target for this is dedicated hardware in an FPGA - so this slightly complicated matters as I want the area of the resultant implementation as small as possible. I looked at and implemented the odd-even merge sort algorithm, but I'm ideally looking for anything that might be more efficient for my needs (trade algorithm implementation size for a partial sort giving lowest 16, not necessarily in order as opposed to a full sort).

Do you need the lowest 16 values, or are you ok if there are some bigger values in there? — Georg Schölly
– Georg Schölly, Commented Nov 9, 2012 at 10:34
I'm ok with some larger values - I might just take the lowest 24 values to compensate for this — trican
– trican, Commented Nov 9, 2012 at 10:40
Since you are sorting for a few number of elements, you should take counting sort into account, which costs you in nearly linear time. — ylc
– ylc, Commented Nov 9, 2012 at 10:56
How big are the numbers? If they are small, a radix sort with a radix of 16 might be good for you. I used such a sort for a list of less than 256 numbers of small size (short ints). It required only 16 buckets, which fit into a SSE register, which almost completely got rid of the initialization overhead. — Gunther Piez
– Gunther Piez, Commented Nov 9, 2012 at 11:21
You title asks for fast but your question asks for minimum area. Do you need to know the index of the min values, or just the min values? Is all the data available at once? — user597225
– user597225, Commented Nov 9, 2012 at 19:16

Martin Thompson · Accepted Answer · 2012-11-12 12:36:54Z

8

This can be done in O(nlog_k) time, where in your case n = 81 and k = 16. This is much efficiently when you dealing with big number of elements than O(nlog_n).

Algorithm is following:

Init max heap data structure. This heap will contain your top 16. Size of this heap will not exceed 16 and it has log(k) for insertion and deletion
Iterate through list of n - numbers. For each n:
- if heap has less than 16 elements, just add it
- if heap has 16 elements, compare n with max from heap (if it is greater than max just skip, if less than max, remove max and add n.)

This way at every iteration you keep tracking smallest k numbers from currently processed part of list.

edited Nov 12, 2012 at 12:36

Martin Thompson

16.9k1 gold badge40 silver badges58 bronze badges

answered Nov 9, 2012 at 10:49

mishadoff

10.9k2 gold badges36 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Thomash Over a year ago

You should use a heap and not a sorted list because the sorted list does not have O(log k) for insertion.

trican Over a year ago

thanks - again this sounds interesting - though I think currently the quicksort would be better for a hardware implementation. I'll need to investigate further

rici Over a year ago

@trican: This algorithm should be pretty simple for a hardware implementation, particularly if you can do a couple of comparisons in parallel. First you do make_heap on the first 16 elements of the array (which is actually eight heap operations). Then you go through the remaining 81-16 elements, ignoring any which are greater than element 0, and otherwise replacing element 0 and reheapifying (which is the same operation as the last one in the make_heap.) This way your heap size is fixed and the individual operations are simplified.

score 2 · Accepted Answer · 2012-11-12 18:52:59Z

This sounds like a signal processing kernel of some sort. It's hard to help answer this without knowing the exact data flow in your design. Any algorithm involving a sort has a address decoding cost since you'll need to be able to write and read a memory of 81-elements. If this data is in a memory this cost has already been paid but if it is in distinct registers then writing to them carries an area cost.

Assuming the data is in a memory, you could use bubble sort and take bottom 16 values. The below code assumes a two-port memory but it could work with a single port by alternating reads and writes on every clock cycle using a temporary register to store the write value and write index. This may not be more area efficient with only 81 elements in the memory. Alternatively the source memory could be implemented as two single-port memories with one having odd indices and the other even indices.

// Not working code 
reg [15:0] inData [80:0]; // Must be two-port
reg [3:0]  iterCount = 0;
reg [6:0]  index = 0;
reg sorting;

always @(posedge clk)
  begin
  // Some condition to control execution
  if(sorting)
    begin

    if(index == 80)
      begin 

      // Stop after 16 iterations
      if(iterCount == 15)
        begin
        sorting <= 0;
        iterCount <= 0;
        end
      else
        begin
        iterCount <= iterCount+1;
        index <= 0;
        end
      end 
    else
      begin
      // If this is smaller than next item
      if(inData[index] < inData[index+1])
        begin
        // Swap with next item
        inData[index+1] <= inData[index];
        inData[index]   <= inData[index+1];
        end
      index <= index + 1;
      end
    end
  end

EDIT: If you're latency constrained, allowed only one clock domain and must pipeline then the problem is limited to selecting a sorting network and mapping it to a pipeline. You can't use resource sharing so area is fixed given your sorting network.

Select a sorting network(Bitonic, Pairwise,etc) for N=81 (Not easy)
Build a directed data flow graph representing the sorting network
Remove all nodes except those required for outputs 66-81
Determine the latency L of one comparison node
Use ALAP scheduling to assign M nodes per clock where M*L < 1/f
If scheduling succeeds code it up in an HDL

Many thanks for this suggestion Adam. Whilst I appreciate this approach probably represents the minimal area and probably has a very high maximum frequency, my concern however is the latency is hundreds (or possibly thousands) of clock cycles - which I cant afford. As you suspected this is for a signal processing application, so I get a new group of 81 values to sort every clock cycle. Thus efficient pipelining (in the sense of multiple groups are being sorted at once) with a short latency is vital.
An even odd merge sort or bitonic sort would have much shorter latency than a bubble sort and also has a very regular data flow for implementation purposes. Of course the trade off is much higher area - its delicate balancing act I suppose

Origin · Accepted Answer · 2012-11-09 10:40:29Z

If you're looking for general algorithms, then you could use a version of Quicksort. Quicksort sorts values around a pivot element. You choose a pivot and sort your array. You then get x values < pivot and (n-x-1) larger ones. Depending on x, you choose either one array to continue processing. If x>16, then you know that the numbers your looking for are all in the x-array and you can continue sorting that. If not, then you know x lowest and can now look for the 16-x others in the other array recursively.

The resulting arrays of quicksort are not fully sorted, you only know that they are smaller or larger than your pivot. Some info at wikipedia and a paper.

This sounds like a very interesting approach - I will definitely read the linked paper and have a think about its hardware implementation - I would need to use a fixed number of iterations to keep it amendable for hardware. But I suspect that wouldn't compromise the result too much

Alberto Bonsanto · Accepted Answer · 2012-11-09 10:44:59Z

First set an array with 16 values. And use something like selection sort:

You think the first value is the minimal in the array.
Compare the value of the first element, second, until you find a number lower than it.
Take that as your new minimal, and compare it with the next numbers until you find a number lower than it.
If you finished the array, the number you have is your minimal, save it in an array and exclude it from the source array, and start over again until you fill the 16 positions of the solution array.

shuckc · Accepted Answer · 2022-12-18 13:42:23Z

If you want to avoid state machines and loop-indexing (with associated large muxes or RAMs) then a hardware friendly approach might be to use a sorting network for 17 items (yuck!). Feed this with 16 items from registers (initially empty) and 1 item from the input bus. For each valid cycle on the input bus, the output of the sorting network is the smallest 16 seen so far with the new element either within the results or at the top (if it is larger). The smallest 16 then latch into the registers for the next cycle and the top output from the sorting network is ignored.

This will require 16*(n-bits per element) registers, and a sorting network that is 5 subtract/compare blocks deep and 8 wide, for 40 subtract/complete blocks total (each with a worst case propagation of n-bits). Can be pipelined between stages trivially if you can back pressure the input bus. The solution is almost constant in time and space: 1 cycle per input, and space is only a function of the number of outputs not inputs.

"sorting network for 17 items (yuck!)" as the result is not required to be ordered, you just need to decide which value to discard: max(17). "constant in time […] 1 cycle per input" linear time.

greybeard · Accepted Answer · 2025-04-25 08:47:08Z

Musing selection of k values of n using combinatoric circuits recently, I'm thinking selection network (of CEs: compare & exchange elements).
As the size of practical such networks grows with nlog²n, one should compare different partitionings:

sorters	sorter CEs	layers	merge CEs	layers	total CEs	layers
40+41	216+230	17	16	1	462	18
3*27	414	14	48+16	5 + 1	478	20
3*20+21	273+97	12	2*48+16	5 + 1	482	18
4*16+17	120+122+71	9	3*48+16	2*5 + 1	473	20
custom 41+40	218+213	17	16	1	447	18

(Below "Knuth diagram" shows 81-16 = 65 elements where the min/lower output is dispensable - every top 16-of-81 network does.)
The cost of the non-custom alternatives is closer than I expected.
The best network from above from a known good 40-sorter, a known fast 41-sorter and a merger not preserving order:

Restituting order requires 32 more CEs in 4 more layers.

The next wist is using B. Dobbelaere's SorterHunter (after climbing some of its learning curve …) to find good top/bottom-16-of-40/41 networks to combine:

bottom-16-of-81 network from custom bottom-16-of-41 network, same less first row, and merge

(So why not let SorterHunter find good top-16-of-81 networks directly?
As of V0.4, it is limited to 64 inputs.)

One more resource: Christian Borgelt on Top-k-Net - Splitter Selection Networks (the TLD isn't lost on me).

Collectives™ on Stack Overflow

Fast, small area and low latency selection/partial sorting algorithm

6 Answers 6

3 Comments

2 Comments

1 Comment

Comments

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

2 Comments

1 Comment

Comments

1 Comment

1 Comment

Linked

Related