Fragment memtable range tombstone in the write path#10380
Conversation
|
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@cbi42 has updated the pull request. You must reimport the pull request before landing. |
ajkr
left a comment
There was a problem hiding this comment.
Are we still planning to take this approach? When discussing this, I didn't realize this only takes effect when is_range_del_table_empty_ == false. In that case we are only adding shared_ptr access on the read path for DeleteRange() users, which is not ideal but also not obviously a big deal.
If you do rerun the benchmarks, can you borrow @hx235's devbig or another dual socket host (maybe dba allocate with some constraints)?
There was a problem hiding this comment.
It seems safe to use std::memory_order_relaxed here since the new list won't be needed until readers start using seqnos >= this range tombstone, and we already have memory ordering on the seqno. It might not help though since I read atomic shared_ptr is often implemented with global spinlock.
There was a problem hiding this comment.
This also applies to the std::atomic_load, perhaps more importantly.
|
For the benchmarks, perhaps 0, 1, 10, 100, and 1000 range tombstones in memtable would be good to compare. That way we'll know how it'll affect workloads that likely exist already. |
Yes, I still plan to work on this. I want to explore using the SuperVersion APIs, at least for the immutable memtables. Since the cost of installing SuperVersion is cheaper for immutable memtables as it's only done once per table. As for live memtables, the memtable V2 PR (#10308) shows a better worst case perf for writes, so it depends on whether we worry about that. I'll also run the benchmarks on a dual socket host and with the suggest number of tombstones. Thanks for proposing that. |
ajkr
left a comment
There was a problem hiding this comment.
I want to explore using the SuperVersion APIs, at least for the immutable memtables. Since the cost of installing SuperVersion is cheaper for immutable memtables as it's only done once per table.
Got it. Is the idea to build the fragmented range tombstone list on the write path and then install it during memtable switch? It might be tricky to build it only when the memtable becomes immutable because that happens during an arbitrary write (whichever one is last to fill up the memtable, typically), which is not necessarily a DeleteRange():
rocksdb/db/db_impl/db_impl_write.cc
Lines 2187 to 2191 in 87649d3
There was a problem hiding this comment.
This also applies to the std::atomic_load, perhaps more importantly.
There was a problem hiding this comment.
I wonder if we should try ThreadLocalPtr or CoreLocalArray to hold references to the FragmentedRangeTombstoneList. Core-local is usually preferred nowadays. The FragmentedRangeTombstoneList references could be locally refcounted so we only mutate the shared refcount when the list changes.
I found this folly class interesting: https://github.com/facebook/folly/blob/4b930941724f3d2ed44dd9d7e389437dbf26d787/folly/concurrency/CoreCachedSharedPtr.h#L162. It uses shared_ptr for local refcounting (holder) of another shared_ptr. We might be able to use CoreLocalArray to do something similar. (I suspect our interface maybe should not be specific to shared_ptr; more details to come on that...)
|
@cbi42 has updated the pull request. You must reimport the pull request before landing. |
|
@cbi42 has updated the pull request. You must reimport the pull request before landing. |
|
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Do you see any possibility for an API (like PrepareImmutable() or something) that is called by SwitchMemtable() with lock released? I've also been thinking about madvise() with MADV_DONTNEED the unused memory at the same point.
There was a problem hiding this comment.
Also regarding this particular variable, I wonder if MemTableListVersion::AddRangeTombstoneIterators() should pass the info about immutability down to the MemTable::NewRangeTombstoneIterator().
There was a problem hiding this comment.
PrepareImmutable()
I think an API like this is great once we do multiple things in the prepare phase (like when we add madvise() calls). One question about unused memory, what are the don't-needed memory in the immutable memtable? Will they be accessed by threads holding old SuperVersion that still treat this memtable as a live memtable?
There was a problem hiding this comment.
Also regarding this particular variable, I wonder if
MemTableListVersion::AddRangeTombstoneIterators()should
pass the info about immutability down to theMemTable::NewRangeTombstoneIterator().
Ah yes, good point. I think a similar change in Get/MultiGet calls are also needed and doable. Then we can get rid of this atomic flag.
I changed my mind on this due to realizing the fragmented range tombstone list is already being built every read while the memtable was mutable, so it doesn't seem particularly concerning to do it once in the write path. Though I do still hope we figure out how to build the list in the mutable memtable during the |
|
@cbi42 has updated the pull request. You must reimport the pull request before landing. |
|
@cbi42 has updated the pull request. You must reimport the pull request before landing. |
|
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Thanks for the review. I've updated the PR with the atomic boolean variable |
#10547 does the fragmentation for mutable memtables, but in the read path (the first reader does the fragmentation). Whether to do it in read or write path can be adjusted relatively easily. At least the idea of using CoreLocalArray to speed up reads is working. |
Summary: Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. #10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (#10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables. The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (#10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance. Pull Request resolved: #10547 Test Plan: - TestGet() in stress test is updated in #10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones. ``` ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000 ``` Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used. micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom. | |fillrandom main | write path caching | read path caching |memtable V3 (#10308) | readrandom main | write path caching | read path caching |memtable V3 | |--- |--- |--- |--- |--- | --- | --- | --- | --- | | 0 |6.35 |6.15 |5.82 |6.12 |2.24 |2.26 |2.03 |2.07 | | 1 |5.99 |5.88 |5.77 |6.28 |2.65 |2.27 |2.24 |2.5 | | 10 |6.15 |6.02 |5.92 |5.95 |5.15 |2.61 |2.31 |2.53 | | 100 |5.95 |5.78 |5.88 |6.23 |28.31 |2.34 |2.45 |2.94 | | 100 25 threads |52.01 |45.85 |46.18 |47.52 |35.97 |3.34 |3.34 |3.56 | | 1000 |6.0 |7.07 |5.98 |6.08 |333.18 |2.86 |2.7 |3.6 | | 1000 25 threads |52.6 |148.86 |79.06 |45.52 |473.49 |3.66 |3.48 |4.38 | - Benchmark performance of`readwhilewriting` from #10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes` readrandom micros/op: | |main |write path caching |read path caching |memtable V3 | |---|---|---|---|---| | single thread |48.28 |1.55 |1.52 |1.96 | | 25 threads |64.3 |2.55 |2.67 |2.64 | Reviewed By: ajkr Differential Revision: D38895410 Pulled By: cbi42 fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
Summary:
Test plan:
Commit 99cdf16 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.
Single thread result:
32-thread result: note that "Max # tombstones" is per thread.