Context
I've built a concurrent red-black tree — ConcurrentRBTree — that is API-compatible with folly::ConcurrentSkipList (same Accessor pattern, same EBR-style NodeRecycler, same find / insert / erase / lower_bound / iterator surface). On read-heavy workloads it consistently outperforms ConcurrentSkipList by 1.4–1.55× aggregate throughput on Intel i9-13900K.
I wanted to sound the team out on whether there's interest in upstream evaluation — or, at minimum, in adding it as a benchmark baseline for ConcurrentSkipList. No strong expectation either way; mainly wanted the team aware of the data point.
Benchmark summary
Workload: int32_t, all threads do mixed read/write, write probability swept 10⁻³ → 0.5, writes split 50/50 insert/erase (50/50 existing/non-existing keys). Intel i9-13900K, Linux, clang++ -O2 -DNDEBUG.
Aggregate throughput ratio (RBTree / SkipList):
| Threads × init size |
Ratio |
| 1 × 8M |
1.16× |
| 16 × 8M |
1.52× |
| 27 × 8M |
1.41× |
| 16 × 4M |
1.40× |
| 16 × 10⁵ (L2-resident) |
0.96× |
| 27 × 10⁵ (high-write, L2-resident) |
0.73× |
Advantage scales with working-set size; degrades on tiny L2-resident sets under heavy writer contention (see trade-offs below).
Why — perf stat snapshot (10% write, 16T, 8M init)
|
RBTree |
SkipList |
| Data references |
8.08 × 10⁹ |
15.19 × 10⁹ |
| Instructions |
22.0 × 10⁹ |
42.0 × 10⁹ |
| Branches |
4.41 × 10⁹ |
8.74 × 10⁹ |
| L1d miss rate |
11.0% |
9.2% |
| LLd miss rate |
4.1% |
2.4% |
| Branch mispredict rate |
13.1% |
9.1% |
Per-access rates favor SkipList (smaller nodes, better locality). But the RBTree touches roughly half as many cache lines, instructions, and branches per operation — so absolute L1d misses and absolute branch mispredicts are both lower. Matches the intuition that both are O(log n) but the red-black tree's constant factor is about 1/2 that of a skip list.
Full 12-plot matrix (1/16/27 threads × 10⁵/10⁶/4·10⁶/8·10⁶ entries): x86_result/
Design summary
- Dual-indexed: red-black tree (relaxed-atomic child pointers, approximate index) + sorted singly-linked list threaded through all data nodes (acquire/release
next_, ground truth for ordering).
- Lock-free reads: descend tree to approximate less-bound, walk list ≤3 steps to the exact target; retry from root if the fixup bound is exceeded (indicates a concurrent rotation disrupted the descent).
- Two-phase writes: lock-free descent, then a single cache-line-aligned
std::atomic_flag spinlock around the structural commit (list splice + tree attach/detach + rebalancing). Readers never block.
- Visibility gate: per-node
std::atomic<bool> accessible_ toggled release-store after full wiring on insert / before detachment on erase.
- Reclamation:
NodeRecycler pattern directly adapted from folly::ConcurrentSkipList::NodeRecycler — Accessor acts as epoch guard; batched delete on last releaseRef.
- Header-only, ~1000 LoC, C++17.
Honest trade-offs
- Writer scaling ceiling. The single writer-side spinlock becomes the bottleneck at high write probability on small working sets (see 27 × 10⁵ / 0.73× row above). For workloads with >20% writes,
ConcurrentSkipList remains the better choice.
- Per-node memory. ~30% larger than a
ConcurrentSkipList node for 4-byte values; overhead shrinks quickly as value size grows.
- No custom comparators / no
const_iterator yet (both planned).
What I'm asking
Two possibilities, in order of lower-to-higher commitment from the Folly team:
- Benchmark baseline inclusion. Would you accept a PR that adds
ConcurrentRBTree as an additional data point in Folly's existing ConcurrentSkipList benchmark harness? Purely comparative, no API changes to Folly.
- Upstream evaluation. If there's interest, I'd be happy to iterate on API review, coding-style alignment, ASAN/TSAN/UBSan coverage, and integration with Folly's existing concurrent-container fuzzers, with the aim of landing it under a Folly namespace (e.g. something like
folly::ReadOptimizedConcurrentSet) as an alternative container for read-heavy use cases.
If neither is of interest, a "not now / not a fit" response is completely fine — the work stands on its own and this is just a heads-up.
Thanks for ConcurrentSkipList and NodeRecycler — both directly inspired the reclamation and Accessor design here.
Links
Context
I've built a concurrent red-black tree — ConcurrentRBTree — that is API-compatible with
folly::ConcurrentSkipList(sameAccessorpattern, same EBR-styleNodeRecycler, samefind / insert / erase / lower_bound / iteratorsurface). On read-heavy workloads it consistently outperformsConcurrentSkipListby 1.4–1.55× aggregate throughput on Intel i9-13900K.I wanted to sound the team out on whether there's interest in upstream evaluation — or, at minimum, in adding it as a benchmark baseline for
ConcurrentSkipList. No strong expectation either way; mainly wanted the team aware of the data point.Benchmark summary
Workload:
int32_t, all threads do mixed read/write, write probability swept 10⁻³ → 0.5, writes split 50/50 insert/erase (50/50 existing/non-existing keys). Intel i9-13900K, Linux,clang++ -O2 -DNDEBUG.Aggregate throughput ratio (RBTree / SkipList):
Advantage scales with working-set size; degrades on tiny L2-resident sets under heavy writer contention (see trade-offs below).
Why —
perf statsnapshot (10% write, 16T, 8M init)Per-access rates favor SkipList (smaller nodes, better locality). But the RBTree touches roughly half as many cache lines, instructions, and branches per operation — so absolute L1d misses and absolute branch mispredicts are both lower. Matches the intuition that both are O(log n) but the red-black tree's constant factor is about 1/2 that of a skip list.
Full 12-plot matrix (1/16/27 threads × 10⁵/10⁶/4·10⁶/8·10⁶ entries): x86_result/
Design summary
next_, ground truth for ordering).std::atomic_flagspinlock around the structural commit (list splice + tree attach/detach + rebalancing). Readers never block.std::atomic<bool> accessible_toggled release-store after full wiring on insert / before detachment on erase.NodeRecyclerpattern directly adapted fromfolly::ConcurrentSkipList::NodeRecycler—Accessoracts as epoch guard; batched delete on lastreleaseRef.Honest trade-offs
ConcurrentSkipListremains the better choice.ConcurrentSkipListnode for 4-byte values; overhead shrinks quickly as value size grows.const_iteratoryet (both planned).What I'm asking
Two possibilities, in order of lower-to-higher commitment from the Folly team:
ConcurrentRBTreeas an additional data point in Folly's existingConcurrentSkipListbenchmark harness? Purely comparative, no API changes to Folly.folly::ReadOptimizedConcurrentSet) as an alternative container for read-heavy use cases.If neither is of interest, a "not now / not a fit" response is completely fine — the work stands on its own and this is just a heads-up.
Thanks for
ConcurrentSkipListandNodeRecycler— both directly inspired the reclamation andAccessordesign here.Links