Fragment memtable range tombstone in the write path by cbi42 · Pull Request #10380 · facebook/rocksdb

cbi42 · 2022-07-18T03:32:24Z

Summary:

Right now each read fragments the memtable range tombstones Avoid re-fragmenting memtable range tombstones on each read #4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
db_bench is updated to print out the number of range deletions executed if there is any.

Test plan:

CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable.

single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100

multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100

Commit 99cdf16 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.

Single thread result:

Max # tombstones	main fillrandom micros/op	`99cdf16`	Post PR	main readrandom micros/op	`99cdf16`	Post PR
0	6.68	6.57	6.72	4.72	4.79	4.54
1	6.67	6.58	6.62	5.41	4.74	4.72
10	6.59	6.5	6.56	7.83	4.69	4.59
100	6.62	6.75	6.58	29.57	5.04	5.09
1000	6.54	6.82	6.61	320.33	5.22	5.21

32-thread result: note that "Max # tombstones" is per thread.

Max # tombstones	main fillrandom micros/op	`99cdf16`	Post PR	main readrandom micros/op	`99cdf16`	Post PR
0	234.52	260.25	239.42	5.06	5.38	5.09
1	236.46	262.0	231.1	19.57	22.14	5.45
10	236.95	263.84	251.49	151.73	21.61	5.73
100	268.16	296.8	280.13	2308.52	22.27	6.57

facebook-github-bot · 2022-07-18T04:49:43Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-07-18T06:34:33Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-07-21T18:12:50Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

ajkr

Are we still planning to take this approach? When discussing this, I didn't realize this only takes effect when is_range_del_table_empty_ == false. In that case we are only adding shared_ptr access on the read path for DeleteRange() users, which is not ideal but also not obviously a big deal.

If you do rerun the benchmarks, can you borrow @hx235's devbig or another dual socket host (maybe dba allocate with some constraints)?

ajkr · 2022-07-27T23:59:45Z

It seems safe to use std::memory_order_relaxed here since the new list won't be needed until readers start using seqnos >= this range tombstone, and we already have memory ordering on the seqno. It might not help though since I read atomic shared_ptr is often implemented with global spinlock.

This also applies to the std::atomic_load, perhaps more importantly.

ajkr · 2022-07-28T00:24:21Z

For the benchmarks, perhaps 0, 1, 10, 100, and 1000 range tombstones in memtable would be good to compare. That way we'll know how it'll affect workloads that likely exist already.

cbi42 · 2022-07-28T15:33:43Z

Are we still planning to take this approach? When discussing this, I didn't realize this only takes effect when is_range_del_table_empty_ == false. In that case we are only adding shared_ptr access on the read path for DeleteRange() users, which is not ideal but also not obviously a big deal.

If you do rerun the benchmarks, can you borrow @hx235's devbig or another dual socket host (maybe dba allocate with some constraints)?

Yes, I still plan to work on this. I want to explore using the SuperVersion APIs, at least for the immutable memtables. Since the cost of installing SuperVersion is cheaper for immutable memtables as it's only done once per table. As for live memtables, the memtable V2 PR (#10308) shows a better worst case perf for writes, so it depends on whether we worry about that. I'll also run the benchmarks on a dual socket host and with the suggest number of tombstones. Thanks for proposing that.

ajkr

I want to explore using the SuperVersion APIs, at least for the immutable memtables. Since the cost of installing SuperVersion is cheaper for immutable memtables as it's only done once per table.

Got it. Is the idea to build the fragmented range tombstone list on the write path and then install it during memtable switch? It might be tricky to build it only when the memtable becomes immutable because that happens during an arbitrary write (whichever one is last to fill up the memtable, typically), which is not necessarily a DeleteRange():

rocksdb/db/db_impl/db_impl_write.cc

Lines 2187 to 2191 in 87649d3

    
           cfd->imm()->Add(cfd->mem(), &context->memtables_to_free_); 
        
           new_mem->Ref(); 
        
           cfd->SetMemtable(new_mem); 
        
           InstallSuperVersionAndScheduleWork(cfd, &context->superversion_context, 
        
                                              mutable_cf_options);

ajkr · 2022-07-28T17:27:57Z

This also applies to the std::atomic_load, perhaps more importantly.

ajkr · 2022-07-29T19:26:59Z

I wonder if we should try ThreadLocalPtr or CoreLocalArray to hold references to the FragmentedRangeTombstoneList. Core-local is usually preferred nowadays. The FragmentedRangeTombstoneList references could be locally refcounted so we only mutate the shared refcount when the list changes.

I found this folly class interesting: https://github.com/facebook/folly/blob/4b930941724f3d2ed44dd9d7e389437dbf26d787/folly/concurrency/CoreCachedSharedPtr.h#L162. It uses shared_ptr for local refcounting (holder) of another shared_ptr. We might be able to use CoreLocalArray to do something similar. (I suspect our interface maybe should not be specific to shared_ptr; more details to come on that...)

facebook-github-bot · 2022-08-03T23:58:46Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-04T04:07:15Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-04T04:52:17Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr

LGTM!

ajkr · 2022-08-04T18:42:20Z

Do you see any possibility for an API (like PrepareImmutable() or something) that is called by SwitchMemtable() with lock released? I've also been thinking about madvise() with MADV_DONTNEED the unused memory at the same point.

Also regarding this particular variable, I wonder if MemTableListVersion::AddRangeTombstoneIterators() should pass the info about immutability down to the MemTable::NewRangeTombstoneIterator().

PrepareImmutable()

I think an API like this is great once we do multiple things in the prepare phase (like when we add madvise() calls). One question about unused memory, what are the don't-needed memory in the immutable memtable? Will they be accessed by threads holding old SuperVersion that still treat this memtable as a live memtable?

Also regarding this particular variable, I wonder if MemTableListVersion::AddRangeTombstoneIterators() should
pass the info about immutability down to the MemTable::NewRangeTombstoneIterator().

Ah yes, good point. I think a similar change in Get/MultiGet calls are also needed and doable. Then we can get rid of this atomic flag.

ajkr · 2022-08-04T19:00:24Z

I want to explore using the SuperVersion APIs, at least for the immutable memtables. Since the cost of installing SuperVersion is cheaper for immutable memtables as it's only done once per table.

Got it. Is the idea to build the fragmented range tombstone list on the write path and then install it during memtable switch? It might be tricky to build it only when the memtable becomes immutable because that happens during an arbitrary write (whichever one is last to fill up the memtable, typically), which is not necessarily a DeleteRange():

I changed my mind on this due to realizing the fragmented range tombstone list is already being built every read while the memtable was mutable, so it doesn't seem particularly concerning to do it once in the write path. Though I do still hope we figure out how to build the list in the mutable memtable during the DeleteRange() write in a next step :)

facebook-github-bot · 2022-08-04T23:33:13Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-04T23:41:53Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-05T00:55:18Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cbi42 · 2022-08-05T00:57:38Z

LGTM!

Thanks for the review. I've updated the PR with the atomic boolean variable has_range_tombstone_list_ removed. The PR description is also updated with the latest benchmark.

cbi42 · 2022-08-21T19:25:28Z

Though I do still hope we figure out how to build the list in the mutable memtable during the DeleteRange() write in a next step :)

#10547 does the fragmentation for mutable memtables, but in the read path (the first reader does the fragmentation). Whether to do it in read or write path can be adjusted relatively easily. At least the idea of using CoreLocalArray to speed up reads is working.

Summary: Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. #10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (#10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables. The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (#10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance. Pull Request resolved: #10547 Test Plan: - TestGet() in stress test is updated in #10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones. ``` ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000 ``` Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used. micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom. | |fillrandom main | write path caching | read path caching |memtable V3 (#10308) | readrandom main | write path caching | read path caching |memtable V3 | |--- |--- |--- |--- |--- | --- | --- | --- | --- | | 0 |6.35 |6.15 |5.82 |6.12 |2.24 |2.26 |2.03 |2.07 | | 1 |5.99 |5.88 |5.77 |6.28 |2.65 |2.27 |2.24 |2.5 | | 10 |6.15 |6.02 |5.92 |5.95 |5.15 |2.61 |2.31 |2.53 | | 100 |5.95 |5.78 |5.88 |6.23 |28.31 |2.34 |2.45 |2.94 | | 100 25 threads |52.01 |45.85 |46.18 |47.52 |35.97 |3.34 |3.34 |3.56 | | 1000 |6.0 |7.07 |5.98 |6.08 |333.18 |2.86 |2.7 |3.6 | | 1000 25 threads |52.6 |148.86 |79.06 |45.52 |473.49 |3.66 |3.48 |4.38 | - Benchmark performance of`readwhilewriting` from #10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes` readrandom micros/op: | |main |write path caching |read path caching |memtable V3 | |---|---|---|---|---| | single thread |48.28 |1.55 |1.52 |1.96 | | 25 threads |64.3 |2.55 |2.67 |2.64 | Reviewed By: ajkr Differential Revision: D38895410 Pulled By: cbi42 fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559

facebook-github-bot added the CLA Signed label Jul 18, 2022

cbi42 force-pushed the cache-range-del branch from d6ba306 to 1aad1ca Compare July 18, 2022 03:33

cbi42 marked this pull request as ready for review July 18, 2022 06:31

cbi42 requested a review from ajkr July 18, 2022 06:35

cbi42 mentioned this pull request Jul 21, 2022

reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path #10308

Open

ajkr reviewed Jul 27, 2022

View reviewed changes

ajkr reviewed Jul 28, 2022

View reviewed changes

Comment thread db/memtable.cc Outdated

ajkr Jul 28, 2022

Copy link
Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also applies to the std::atomic_load, perhaps more importantly.

ajkr reviewed Jul 29, 2022

View reviewed changes

cbi42 force-pushed the cache-range-del branch from dd08453 to fc87db4 Compare August 4, 2022 04:07

ajkr approved these changes Aug 4, 2022

View reviewed changes

cbi42 added 6 commits August 4, 2022 16:28

Fragmenting range tombstone in the write path

1d5bcab

remove readoption from NewRangeTombstoneIteratorInternal

87bd3b6

Fix potential data race

99cdf16

Only cache immutable memtables

de81d8a

Remove unnecessary atomic boolean flag

e1cc241

Updates HISTORY.md

3d438c1

cbi42 force-pushed the cache-range-del branch from fc87db4 to 3d438c1 Compare August 4, 2022 23:33

format

b4c8bb9

facebook-github-bot closed this in 9d77bf8 Aug 5, 2022

cbi42 mentioned this pull request Aug 21, 2022

Cache fragmented range tombstone list for mutable memtables #10547

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fragment memtable range tombstone in the write path#10380

Fragment memtable range tombstone in the write path#10380
cbi42 wants to merge 7 commits into
facebook:mainfrom
cbi42:cache-range-del

cbi42 commented Jul 18, 2022 •

edited

Loading

facebook-github-bot commented Jul 18, 2022

facebook-github-bot commented Jul 18, 2022

facebook-github-bot commented Jul 21, 2022

ajkr left a comment

ajkr Jul 27, 2022

ajkr Jul 28, 2022

ajkr commented Jul 28, 2022

cbi42 commented Jul 28, 2022

ajkr left a comment

ajkr Jul 28, 2022

ajkr Jul 29, 2022

facebook-github-bot commented Aug 3, 2022

facebook-github-bot commented Aug 4, 2022

facebook-github-bot commented Aug 4, 2022

ajkr left a comment

ajkr Aug 4, 2022

ajkr Aug 4, 2022

cbi42 Aug 4, 2022 •

edited

Loading

cbi42 Aug 4, 2022

ajkr commented Aug 4, 2022

facebook-github-bot commented Aug 4, 2022

facebook-github-bot commented Aug 4, 2022

facebook-github-bot commented Aug 5, 2022

cbi42 commented Aug 5, 2022

cbi42 commented Aug 21, 2022

Labels

3 participants

	cfd->imm()->Add(cfd->mem(), &context->memtables_to_free_);
	new_mem->Ref();
	cfd->SetMemtable(new_mem);
	InstallSuperVersionAndScheduleWork(cfd, &context->superversion_context,
	mutable_cf_options);

Uh oh!

Conversation

cbi42 commented Jul 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

facebook-github-bot commented Jul 18, 2022

facebook-github-bot commented Jul 18, 2022

facebook-github-bot commented Jul 21, 2022

ajkr left a comment

Choose a reason for hiding this comment

ajkr Jul 27, 2022

Choose a reason for hiding this comment

ajkr Jul 28, 2022

Choose a reason for hiding this comment

ajkr commented Jul 28, 2022

cbi42 commented Jul 28, 2022

ajkr left a comment

Choose a reason for hiding this comment

ajkr Jul 28, 2022

Choose a reason for hiding this comment

ajkr Jul 29, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Aug 3, 2022

facebook-github-bot commented Aug 4, 2022

facebook-github-bot commented Aug 4, 2022

ajkr left a comment

Choose a reason for hiding this comment

ajkr Aug 4, 2022

Choose a reason for hiding this comment

ajkr Aug 4, 2022

Choose a reason for hiding this comment

cbi42 Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

cbi42 Aug 4, 2022

Choose a reason for hiding this comment

ajkr commented Aug 4, 2022

facebook-github-bot commented Aug 4, 2022

facebook-github-bot commented Aug 4, 2022

facebook-github-bot commented Aug 5, 2022

cbi42 commented Aug 5, 2022

cbi42 commented Aug 21, 2022

Labels

3 participants

cbi42 commented Jul 18, 2022 •

edited

Loading

cbi42 Aug 4, 2022 •

edited

Loading