Skip to content

Proposal: evaluate a read-optimized concurrent ordered set as alternative to ConcurrentSkipList for read-heavy workloads #2638

Description

@Joey-Forever
Image

Context

I've built a concurrent red-black tree — ConcurrentRBTree — that is API-compatible with folly::ConcurrentSkipList (same Accessor pattern, same EBR-style NodeRecycler, same find / insert / erase / lower_bound / iterator surface). On read-heavy workloads it consistently outperforms ConcurrentSkipList by 1.4–1.55× aggregate throughput on Intel i9-13900K.

I wanted to sound the team out on whether there's interest in upstream evaluation — or, at minimum, in adding it as a benchmark baseline for ConcurrentSkipList. No strong expectation either way; mainly wanted the team aware of the data point.

Benchmark summary

Workload: int32_t, all threads do mixed read/write, write probability swept 10⁻³ → 0.5, writes split 50/50 insert/erase (50/50 existing/non-existing keys). Intel i9-13900K, Linux, clang++ -O2 -DNDEBUG.

Aggregate throughput ratio (RBTree / SkipList):

Threads × init size Ratio
1 × 8M 1.16×
16 × 8M 1.52×
27 × 8M 1.41×
16 × 4M 1.40×
16 × 10⁵ (L2-resident) 0.96×
27 × 10⁵ (high-write, L2-resident) 0.73×

Advantage scales with working-set size; degrades on tiny L2-resident sets under heavy writer contention (see trade-offs below).

Why — perf stat snapshot (10% write, 16T, 8M init)

RBTree SkipList
Data references 8.08 × 10⁹ 15.19 × 10⁹
Instructions 22.0 × 10⁹ 42.0 × 10⁹
Branches 4.41 × 10⁹ 8.74 × 10⁹
L1d miss rate 11.0% 9.2%
LLd miss rate 4.1% 2.4%
Branch mispredict rate 13.1% 9.1%

Per-access rates favor SkipList (smaller nodes, better locality). But the RBTree touches roughly half as many cache lines, instructions, and branches per operation — so absolute L1d misses and absolute branch mispredicts are both lower. Matches the intuition that both are O(log n) but the red-black tree's constant factor is about 1/2 that of a skip list.

Full 12-plot matrix (1/16/27 threads × 10⁵/10⁶/4·10⁶/8·10⁶ entries): x86_result/

Design summary

  • Dual-indexed: red-black tree (relaxed-atomic child pointers, approximate index) + sorted singly-linked list threaded through all data nodes (acquire/release next_, ground truth for ordering).
  • Lock-free reads: descend tree to approximate less-bound, walk list ≤3 steps to the exact target; retry from root if the fixup bound is exceeded (indicates a concurrent rotation disrupted the descent).
  • Two-phase writes: lock-free descent, then a single cache-line-aligned std::atomic_flag spinlock around the structural commit (list splice + tree attach/detach + rebalancing). Readers never block.
  • Visibility gate: per-node std::atomic<bool> accessible_ toggled release-store after full wiring on insert / before detachment on erase.
  • Reclamation: NodeRecycler pattern directly adapted from folly::ConcurrentSkipList::NodeRecyclerAccessor acts as epoch guard; batched delete on last releaseRef.
  • Header-only, ~1000 LoC, C++17.

Honest trade-offs

  • Writer scaling ceiling. The single writer-side spinlock becomes the bottleneck at high write probability on small working sets (see 27 × 10⁵ / 0.73× row above). For workloads with >20% writes, ConcurrentSkipList remains the better choice.
  • Per-node memory. ~30% larger than a ConcurrentSkipList node for 4-byte values; overhead shrinks quickly as value size grows.
  • No custom comparators / no const_iterator yet (both planned).

What I'm asking

Two possibilities, in order of lower-to-higher commitment from the Folly team:

  1. Benchmark baseline inclusion. Would you accept a PR that adds ConcurrentRBTree as an additional data point in Folly's existing ConcurrentSkipList benchmark harness? Purely comparative, no API changes to Folly.
  2. Upstream evaluation. If there's interest, I'd be happy to iterate on API review, coding-style alignment, ASAN/TSAN/UBSan coverage, and integration with Folly's existing concurrent-container fuzzers, with the aim of landing it under a Folly namespace (e.g. something like folly::ReadOptimizedConcurrentSet) as an alternative container for read-heavy use cases.

If neither is of interest, a "not now / not a fit" response is completely fine — the work stands on its own and this is just a heads-up.

Thanks for ConcurrentSkipList and NodeRecycler — both directly inspired the reclamation and Accessor design here.

Links

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions