Proposal: evaluate a read-optimized concurrent ordered set as alternative to ConcurrentSkipList for read-heavy workloads

Context

I've built a concurrent red-black tree — ConcurrentRBTree — that is API-compatible with folly::ConcurrentSkipList (same Accessor pattern, same EBR-style NodeRecycler, same find / insert / erase / lower_bound / iterator surface). On read-heavy workloads it consistently outperforms ConcurrentSkipList by 1.4–1.55× aggregate throughput on Intel i9-13900K.

I wanted to sound the team out on whether there's interest in upstream evaluation — or, at minimum, in adding it as a benchmark baseline for ConcurrentSkipList. No strong expectation either way; mainly wanted the team aware of the data point.

Benchmark summary

Workload: int32_t, all threads do mixed read/write, write probability swept 10⁻³ → 0.5, writes split 50/50 insert/erase (50/50 existing/non-existing keys). Intel i9-13900K, Linux, clang++ -O2 -DNDEBUG.

Aggregate throughput ratio (RBTree / SkipList):

Threads × init size	Ratio
1 × 8M	1.16×
16 × 8M	1.52×
27 × 8M	1.41×
16 × 4M	1.40×
16 × 10⁵ (L2-resident)	0.96×
27 × 10⁵ (high-write, L2-resident)	0.73×

Advantage scales with working-set size; degrades on tiny L2-resident sets under heavy writer contention (see trade-offs below).

Why — `perf stat` snapshot (10% write, 16T, 8M init)

	RBTree	SkipList
Data references	8.08 × 10⁹	15.19 × 10⁹
Instructions	22.0 × 10⁹	42.0 × 10⁹
Branches	4.41 × 10⁹	8.74 × 10⁹
L1d miss rate	11.0%	9.2%
LLd miss rate	4.1%	2.4%
Branch mispredict rate	13.1%	9.1%

Per-access rates favor SkipList (smaller nodes, better locality). But the RBTree touches roughly half as many cache lines, instructions, and branches per operation — so absolute L1d misses and absolute branch mispredicts are both lower. Matches the intuition that both are O(log n) but the red-black tree's constant factor is about 1/2 that of a skip list.

Full 12-plot matrix (1/16/27 threads × 10⁵/10⁶/4·10⁶/8·10⁶ entries): x86_result/

Design summary

Dual-indexed: red-black tree (relaxed-atomic child pointers, approximate index) + sorted singly-linked list threaded through all data nodes (acquire/release next_, ground truth for ordering).
Lock-free reads: descend tree to approximate less-bound, walk list ≤3 steps to the exact target; retry from root if the fixup bound is exceeded (indicates a concurrent rotation disrupted the descent).
Two-phase writes: lock-free descent, then a single cache-line-aligned std::atomic_flag spinlock around the structural commit (list splice + tree attach/detach + rebalancing). Readers never block.
Visibility gate: per-node std::atomic<bool> accessible_ toggled release-store after full wiring on insert / before detachment on erase.
Reclamation: NodeRecycler pattern directly adapted from folly::ConcurrentSkipList::NodeRecycler — Accessor acts as epoch guard; batched delete on last releaseRef.
Header-only, ~1000 LoC, C++17.

Honest trade-offs

Writer scaling ceiling. The single writer-side spinlock becomes the bottleneck at high write probability on small working sets (see 27 × 10⁵ / 0.73× row above). For workloads with >20% writes, ConcurrentSkipList remains the better choice.
Per-node memory. ~30% larger than a ConcurrentSkipList node for 4-byte values; overhead shrinks quickly as value size grows.
No custom comparators / no const_iterator yet (both planned).

What I'm asking

Two possibilities, in order of lower-to-higher commitment from the Folly team:

Benchmark baseline inclusion. Would you accept a PR that adds ConcurrentRBTree as an additional data point in Folly's existing ConcurrentSkipList benchmark harness? Purely comparative, no API changes to Folly.
Upstream evaluation. If there's interest, I'd be happy to iterate on API review, coding-style alignment, ASAN/TSAN/UBSan coverage, and integration with Folly's existing concurrent-container fuzzers, with the aim of landing it under a Folly namespace (e.g. something like folly::ReadOptimizedConcurrentSet) as an alternative container for read-heavy use cases.

If neither is of interest, a "not now / not a fit" response is completely fine — the work stands on its own and this is just a heads-up.

Thanks for ConcurrentSkipList and NodeRecycler — both directly inspired the reclamation and Accessor design here.

Links

Source: https://github.com/Joey-Forever/ConcurrentRBTree
Single-header implementation: src/include/ConcurrentRBTree.h
Benchmark harness: src/test/comparision_test/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: evaluate a read-optimized concurrent ordered set as alternative to ConcurrentSkipList for read-heavy workloads #2638

Context

Benchmark summary

Why — `perf stat` snapshot (10% write, 16T, 8M init)

Design summary

Honest trade-offs

What I'm asking

Links

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Proposal: evaluate a read-optimized concurrent ordered set as alternative to ConcurrentSkipList for read-heavy workloads #2638

Description

Context

Benchmark summary

Why — perf stat snapshot (10% write, 16T, 8M init)

Design summary

Honest trade-offs

What I'm asking

Links

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why — `perf stat` snapshot (10% write, 16T, 8M init)