Container that orders its elements based on their frequency

Question

I was inspired by Find the K most-frequent values in a vector, and originally posted an early version of this in an answer.

The aim is to have a container that maintains values ordered by their frequency, so we can easily obtain the most-frequent values at any time.

I've just implemented enough to be able to insert elements and obtain the N most frequent; the addition of member functions to make it a complete container (with iterators, removal, emplacement etc.) would be straightforward and might make an enjoyable exercise for the reader.

The insert() function operates in O(1) time, confirmed by varying the number of iterations of the test program's loop between 100,000 and 100,000,000. When I used ten million insertions, run time was about 1.1 seconds on my system (compiled with g++-12 -O3).

#include <algorithm>
#include <concepts>
#include <functional>
#include <list>
#include <map>
#include <iterator>
#include <ranges>
#include <unordered_map>
#include <utility>
#include <vector>

namespace {
    template<typename Key>
    concept is_hashable = requires { std::hash<Key>{}; };
}

template<typename T>
class frequency_ordered_multiset
{
    template<typename Key, typename Value>
    using fast_map =
        std::conditional_t<is_hashable<Key>,
                           std::unordered_map<Key, Value>,
                           std::map<Key, Value>>;
    using key_type =
        std::conditional_t<sizeof (T) <= sizeof (std::reference_wrapper<const T>),
                           T, std::reference_wrapper<const T>>;

    // We use a list because its iterators stay valid after modifications.
    // The list is always in descending order of count.
    struct item { const T val; std::size_t count; };
    std::list<item> items = {};

    // We use this map to find the list items by their value.
    using items_iter = decltype(items.begin());
    fast_map<key_type, items_iter> by_value = {};

public:
    template<typename V>
    requires std::assignable_from<T&, V&>
    void insert(V&& val)
    {
        auto map_it = by_value.find(val);
        if (map_it == by_value.end()) {
            // first appearance - put it at end of list
            items.emplace_back(std::forward<V>(val), 1);
            auto list_it = items.end();
            by_value.emplace(list_it->val, --list_it);
            return;
        }
        // Update existing item
        auto const list_it = map_it->second;
        auto const new_count = ++list_it->count;
        // Search backwards for the correct position
        auto const other_it =
            std::ranges::lower_bound(std::reverse_iterator{list_it}, items.rend(),
                                     new_count, std::less{}, &item::count).base();
        // Move the item (without invalidating iterators)
        // This may be a no-op
        items.splice(other_it, items, list_it);
    }

    auto most_frequent(std::size_t k)
    {
        return items
            | std::views::take(k)
            | std::views::transform(&item::val);
    }
};

#include <chrono>
#include <random>
#include <iostream>
int main()
{
    // check correctness
    frequency_ordered_multiset<int> fom;
    for (auto i: {1, 1, 3, 3, 2, 2, 2, 3, 3, 3, 3, 4, 4}) {
        fom.insert(i);
        std::ranges::copy(fom.most_frequent(2),
                          std::ostream_iterator<int>(std::cout, ", "));
        std::cout << '\n';
    }

    // test performance
    std::mt19937 r;
    std::uniform_int_distribution<int> dist{0, 100};
    for (auto i = 0;  i < 10'000'000;  ++i) {
        fom.insert(dist(r));
    }
    std::ranges::copy(fom.most_frequent(10),
                      std::ostream_iterator<int>(std::cout, ", "));
    std::cout << '\n';
}

Test program output:

time ./281926
1, 
1, 
1, 3, 
1, 3, 
1, 3, 2, 
1, 3, 2, 
2, 1, 3, 
2, 3, 1, 
3, 2, 1, 
3, 2, 1, 
3, 2, 1, 
3, 2, 1, 
3, 2, 1, 
90, 94, 71, 10, 36, 11, 15, 83, 24, 61, 
1.11user 0.00system 0:01.11elapsed 100%CPU (0avgtext+0avgdata 3568maxresident)k

G. Sliepen · Accepted Answer · 2022-12-15 13:03:43Z

Higher memory usage than necessary if K < N

If you are only interested in the \$K\$ most frequent elements, then you only need to maintain a list of size \$K\$. You can do that if you make sure the key and counter are stored in the map, not in the list, and only use the list to maintain references to map entries sorted by the counter value:

using fast_map = …;

struct item {
    const T val;
    std::size_t count{};
    std::list<fast_map<T, item>::iterator>::iterator it;
};

fast_map<T, item> items;
std::list<fast_map<T, item>::iterator> by_count;

Despite the seemingly circular type definition this compiles. You can use by_count.end() as the value of it for those items that are not part of the list.

You could also avoid the separate std::list entirely if you make item part of an intrusive list, like for example using Boost.Intrusive, although this could end up using more memory if \$N\$ is much larger than \$K\$.

Don't use increment/decrement operators on variables used multiple times in one statement

This is confusing, UB before C++17, and unspecified since C++17 (see order of evaluation):

auto list_it = items.end();
by_value.emplace(list_it->val, --list_it);

Just move the decrement to its own line. Alternatively, you could reverse the order of the list so you can use items.begin() and not have to modify that.

Does not compile if `T` is small and not copyable

You are storing T twice by value if its size is equal to or smaller than a reference. However, it is possible to make a small type that is not copyable. You could check for this and use a std::reference_wrapper for uncopyable types regardless of their size.

Complexity

The insert() function operates in O(1) time, confirmed by varying the number of iterations of the test program's loop between 100,000 and 100,000,000.

That's the average-case complexity, but not the worst-case complexity. Consider that there is a potential non-\$O(1)\$ operation in insert(), and that's the call to std::ranges::lower_bound(). On a std::list, this is \$O(N)\$. That might happen if all elements in the list so far have the same count, and then another insert happens that would increment the count of one of those existing elements.

You might then think that the amortized complexity of finding the \$K\$ most-frequent elements of a vector is still \$O(N)\$, but I think that might not be the case with a worst-case input. Consider that vector having the values \$1 \dots N/2\$, in that order, repeated twice (possibly reversed the second time). That might up ending \$O(N^2)\$.

That was a last-minute change to emplace the list's value rather than the function argument, and I totally missed the modification there. Thanks for spotting that. — Toby Speight
– Toby Speight, Commented Dec 15, 2022 at 12:36

Stack Exchange Network

Container that orders its elements based on their frequency

1 Answer 1

Higher memory usage than necessary if K < N

Don't use increment/decrement operators on variables used multiple times in one statement

Does not compile if `T` is small and not copyable

Complexity

You must log in to answer this question.

Linked

Hot Network Questions

Container that orders its elements based on their frequency

1 Answer 1

Higher memory usage than necessary if K < N

Don't use increment/decrement operators on variables used multiple times in one statement

Does not compile if T is small and not copyable

Complexity

You must log in to answer this question.

Linked

Related

Hot Network Questions

Does not compile if `T` is small and not copyable