Skip to content

Fragment memtable range tombstone in the write path#10380

Closed
cbi42 wants to merge 7 commits into
facebook:mainfrom
cbi42:cache-range-del
Closed

Fragment memtable range tombstone in the write path#10380
cbi42 wants to merge 7 commits into
facebook:mainfrom
cbi42:cache-range-del

Conversation

@cbi42

@cbi42 cbi42 commented Jul 18, 2022

Copy link
Copy Markdown
Contributor

Summary:

  • Right now each read fragments the memtable range tombstones Avoid re-fragmenting memtable range tombstones on each read #4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
  • db_bench is updated to print out the number of range deletions executed if there is any.

Test plan:

  • CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
  • Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable.
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100

multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100

Commit 99cdf16 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.

Single thread result:

Max # tombstones main fillrandom micros/op 99cdf16 Post PR main readrandom micros/op 99cdf16 Post PR
0 6.68 6.57 6.72 4.72 4.79 4.54
1 6.67 6.58 6.62 5.41 4.74 4.72
10 6.59 6.5 6.56 7.83 4.69 4.59
100 6.62 6.75 6.58 29.57 5.04 5.09
1000 6.54 6.82 6.61 320.33 5.22 5.21

32-thread result: note that "Max # tombstones" is per thread.

Max # tombstones main fillrandom micros/op 99cdf16 Post PR main readrandom micros/op 99cdf16 Post PR
0 234.52 260.25 239.42 5.06 5.38 5.09
1 236.46 262.0 231.1 19.57 22.14 5.45
10 236.95 263.84 251.49 151.73 21.61 5.73
100 268.16 296.8 280.13 2308.52 22.27 6.57
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cbi42 cbi42 marked this pull request as ready for review July 18, 2022 06:31
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot

Copy link
Copy Markdown
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@ajkr ajkr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still planning to take this approach? When discussing this, I didn't realize this only takes effect when is_range_del_table_empty_ == false. In that case we are only adding shared_ptr access on the read path for DeleteRange() users, which is not ideal but also not obviously a big deal.

If you do rerun the benchmarks, can you borrow @hx235's devbig or another dual socket host (maybe dba allocate with some constraints)?

Comment thread db/memtable.cc Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems safe to use std::memory_order_relaxed here since the new list won't be needed until readers start using seqnos >= this range tombstone, and we already have memory ordering on the seqno. It might not help though since I read atomic shared_ptr is often implemented with global spinlock.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also applies to the std::atomic_load, perhaps more importantly.

@ajkr

ajkr commented Jul 28, 2022

Copy link
Copy Markdown
Contributor

For the benchmarks, perhaps 0, 1, 10, 100, and 1000 range tombstones in memtable would be good to compare. That way we'll know how it'll affect workloads that likely exist already.

@cbi42

cbi42 commented Jul 28, 2022

Copy link
Copy Markdown
Contributor Author

Are we still planning to take this approach? When discussing this, I didn't realize this only takes effect when is_range_del_table_empty_ == false. In that case we are only adding shared_ptr access on the read path for DeleteRange() users, which is not ideal but also not obviously a big deal.

If you do rerun the benchmarks, can you borrow @hx235's devbig or another dual socket host (maybe dba allocate with some constraints)?

Yes, I still plan to work on this. I want to explore using the SuperVersion APIs, at least for the immutable memtables. Since the cost of installing SuperVersion is cheaper for immutable memtables as it's only done once per table. As for live memtables, the memtable V2 PR (#10308) shows a better worst case perf for writes, so it depends on whether we worry about that. I'll also run the benchmarks on a dual socket host and with the suggest number of tombstones. Thanks for proposing that.

@ajkr ajkr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to explore using the SuperVersion APIs, at least for the immutable memtables. Since the cost of installing SuperVersion is cheaper for immutable memtables as it's only done once per table.

Got it. Is the idea to build the fragmented range tombstone list on the write path and then install it during memtable switch? It might be tricky to build it only when the memtable becomes immutable because that happens during an arbitrary write (whichever one is last to fill up the memtable, typically), which is not necessarily a DeleteRange():

cfd->imm()->Add(cfd->mem(), &context->memtables_to_free_);
new_mem->Ref();
cfd->SetMemtable(new_mem);
InstallSuperVersionAndScheduleWork(cfd, &context->superversion_context,
mutable_cf_options);

Comment thread db/memtable.cc Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also applies to the std::atomic_load, perhaps more importantly.

Comment thread db/memtable.h Outdated
Comment on lines 610 to 652

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should try ThreadLocalPtr or CoreLocalArray to hold references to the FragmentedRangeTombstoneList. Core-local is usually preferred nowadays. The FragmentedRangeTombstoneList references could be locally refcounted so we only mutate the shared refcount when the list changes.

I found this folly class interesting: https://github.com/facebook/folly/blob/4b930941724f3d2ed44dd9d7e389437dbf26d787/folly/concurrency/CoreCachedSharedPtr.h#L162. It uses shared_ptr for local refcounting (holder) of another shared_ptr. We might be able to use CoreLocalArray to do something similar. (I suspect our interface maybe should not be specific to shared_ptr; more details to come on that...)

@facebook-github-bot

Copy link
Copy Markdown
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@cbi42 cbi42 force-pushed the cache-range-del branch from dd08453 to fc87db4 Compare August 4, 2022 04:07
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot

Copy link
Copy Markdown
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ajkr ajkr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread db/memtable.h Outdated
Comment on lines 622 to 624

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see any possibility for an API (like PrepareImmutable() or something) that is called by SwitchMemtable() with lock released? I've also been thinking about madvise() with MADV_DONTNEED the unused memory at the same point.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also regarding this particular variable, I wonder if MemTableListVersion::AddRangeTombstoneIterators() should pass the info about immutability down to the MemTable::NewRangeTombstoneIterator().

@cbi42 cbi42 Aug 4, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrepareImmutable()

I think an API like this is great once we do multiple things in the prepare phase (like when we add madvise() calls). One question about unused memory, what are the don't-needed memory in the immutable memtable? Will they be accessed by threads holding old SuperVersion that still treat this memtable as a live memtable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also regarding this particular variable, I wonder if MemTableListVersion::AddRangeTombstoneIterators() should
pass the info about immutability down to the MemTable::NewRangeTombstoneIterator().

Ah yes, good point. I think a similar change in Get/MultiGet calls are also needed and doable. Then we can get rid of this atomic flag.

@ajkr

ajkr commented Aug 4, 2022

Copy link
Copy Markdown
Contributor

I want to explore using the SuperVersion APIs, at least for the immutable memtables. Since the cost of installing SuperVersion is cheaper for immutable memtables as it's only done once per table.

Got it. Is the idea to build the fragmented range tombstone list on the write path and then install it during memtable switch? It might be tricky to build it only when the memtable becomes immutable because that happens during an arbitrary write (whichever one is last to fill up the memtable, typically), which is not necessarily a DeleteRange():

I changed my mind on this due to realizing the fragmented range tombstone list is already being built every read while the memtable was mutable, so it doesn't seem particularly concerning to do it once in the write path. Though I do still hope we figure out how to build the list in the mutable memtable during the DeleteRange() write in a next step :)

@cbi42 cbi42 force-pushed the cache-range-del branch from fc87db4 to 3d438c1 Compare August 4, 2022 23:33
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot

Copy link
Copy Markdown
Contributor

@cbi42 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot

Copy link
Copy Markdown
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cbi42

cbi42 commented Aug 5, 2022

Copy link
Copy Markdown
Contributor Author

LGTM!

Thanks for the review. I've updated the PR with the atomic boolean variable has_range_tombstone_list_ removed. The PR description is also updated with the latest benchmark.

@cbi42

cbi42 commented Aug 21, 2022

Copy link
Copy Markdown
Contributor Author

Though I do still hope we figure out how to build the list in the mutable memtable during the DeleteRange() write in a next step :)

#10547 does the fragmentation for mutable memtables, but in the read path (the first reader does the fragmentation). Whether to do it in read or write path can be adjusted relatively easily. At least the idea of using CoreLocalArray to speed up reads is working.

facebook-github-bot pushed a commit that referenced this pull request Sep 14, 2022
Summary:
Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. #10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (#10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables.

The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (#10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance.

Pull Request resolved: #10547

Test Plan:
- TestGet() in stress test is updated in #10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones.
```
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000
```
Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used.

micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom.
|   |fillrandom main           | write path caching          | read path caching          |memtable V3 (#10308)     | readrandom main            | write path caching           | read path caching            |memtable V3      |
|---   |---  |---   |---   |---   | ---   |           ---   |  ---   |  ---   |
| 0                    |6.35                           |6.15                           |5.82                           |6.12                           |2.24                           |2.26                           |2.03                           |2.07                           |
| 1                    |5.99                           |5.88                           |5.77                           |6.28                           |2.65                           |2.27                           |2.24                           |2.5                            |
| 10                   |6.15                           |6.02                           |5.92                           |5.95                           |5.15                           |2.61                           |2.31                           |2.53                           |
| 100                  |5.95                           |5.78                           |5.88                           |6.23                           |28.31                          |2.34                           |2.45                           |2.94                           |
| 100 25 threads       |52.01                          |45.85                          |46.18                          |47.52                          |35.97                          |3.34                           |3.34                           |3.56                           |
| 1000                 |6.0                            |7.07                           |5.98                           |6.08                           |333.18                         |2.86                           |2.7                            |3.6                            |
| 1000 25 threads      |52.6                           |148.86                         |79.06                          |45.52                          |473.49                         |3.66                           |3.48                           |4.38                           |

  - Benchmark performance of`readwhilewriting` from #10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes`

readrandom micros/op:
|  |main            |write path caching           |read path caching            |memtable V3      |
|---|---|---|---|---|
| single thread        |48.28                          |1.55                           |1.52                           |1.96                           |
| 25 threads           |64.3                           |2.55                           |2.67                           |2.64                           |

Reviewed By: ajkr

Differential Revision: D38895410

Pulled By: cbi42

fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants