Skip to content

Split Block into multiple classes#8690

Open
mrambacher wants to merge 16 commits into
facebook:mainfrom
mrambacher:DataBlock
Open

Split Block into multiple classes#8690
mrambacher wants to merge 16 commits into
facebook:mainfrom
mrambacher:DataBlock

Conversation

@mrambacher

@mrambacher mrambacher commented Aug 22, 2021

Copy link
Copy Markdown
Contributor

Makes the Block class into a base class with different implementations (DataBlock, MetaBlock, IndexBlock, etc).

This change can allow different implementations of these base classes for use for different iterator implementations. For example, it is possible to create different data block and iterator implementations that un-"delta encode" the values, which may yield performance improvements under certain circumstances.

By making different subclasses of Block, different implementations can be provided for different functionality.  For example, it is possible to create different types of DataBlocks where the data is pre-parsed or the iterators behave differently.

@pdillinger pdillinger left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty promising so far 👍

Comment thread table/block_based/block.h Outdated
Comment thread table/block_based/block.h Outdated
Comment on lines +203 to +204
const char* data_; // contents_.data.data()
uint32_t limit_; // contents_.data.size()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never understood why these are duplicated from contents_, costing 12 bytes per cached block in memory (0.1 - 0.3% ish?)

I see limit_ is set to 0 if size is too small, but why not just use the < sizeof(uin32_t) in more places or throw away the bad BlockContents and replace it with empty?

I'm slightly concerned about CPU regression due to more virtual calls (and many conditionals remaining, not encoded in the virtual call), but this seems like a good "pay as you go" performance opportunity.

Comment thread table/block_based/block_like_traits.h Outdated
Comment thread table/block_based/block.h Outdated
@pdillinger

Copy link
Copy Markdown
Contributor

Still Work In Progress?

@mrambacher mrambacher changed the title WIP: Split Block into multiple classes Sep 28, 2021
@mrambacher

Copy link
Copy Markdown
Contributor Author

Still Work In Progress?

This is "ready for review" now. This PR is phase 1 in a multi-step process to allow different implementations of Block (specifically DataBlock) to test performance. Phase 2 will be allowing different implementations of DataBlockIter and Phase 3 will be testing different implementations to see if we can improve performance.

@pdillinger pdillinger left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor fixes & recommendations. I would not be surprised if there is some subtle regression in here that is not caught by unit tests. Please run crash test locally for some hours before shipping.

options.create_if_missing = true;
options.prefix_extractor.reset(NewFixedPrefixTransform(20));
// Picking a Capped prefix of 3 so that the short key ("foo") fits
options.prefix_extractor.reset(NewCappedPrefixTransform(3));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to fundamentally violate the point of the test "prefix longer than key"

options.create_if_missing = true;
options.prefix_extractor.reset(NewCappedPrefixTransform(5));
// Picking a Capped prefix of 3 so that the short key ("foo") fits
options.prefix_extractor.reset(NewCappedPrefixTransform(3));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the purpose of the original code is but changing it should be unnecessary. See #8529 (comment)

Status decode_s __attribute__((__unused__)) = decoded_value_.DecodeFrom(
&v, have_first_key_,
(value_delta_encoded_ && shared) ? &decoded_value_.handle : nullptr);
(value_delta_encoded_ && is_shared) ? &decoded_value_.handle : nullptr);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have seen a couple of times in stress test violation of the final assertion in DecodeEntry. (Internal ref T100361844.) In looking into that, I noticed that our test suite never reaches this point in the code with global_seqno_state_ != nullptr && !value_delta_encoded_. Since you are refactoring this code, it would be nice to add coverage if we can, potentially find any bugs old and/or new.

if ((p = GetVarint32Ptr(p, limit, value_length)) == nullptr) {
return nullptr;
}
const char* Block::DecodeEntry(const char* p, const char* limit,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend keeping 'inline' (and more below)

@ajkr

ajkr commented Sep 29, 2021

Copy link
Copy Markdown
Contributor

I would not be surprised if there is some subtle regression in here that is not caught by unit tests. Please run crash test locally for some hours before shipping.

What about performance tests?

@mrambacher

Copy link
Copy Markdown
Contributor Author

I would not be surprised if there is some subtle regression in here that is not caught by unit tests. Please run crash test locally for some hours before shipping.

What about performance tests?

I have executed the performance tests from the Wiki page (with also revrange and revrangewhilewriting tests) and so no regressions. I can post the results/comparison in the PR if it helps.

I have not executed stress tests but will do so shortly.

@ajkr

ajkr commented Sep 29, 2021

Copy link
Copy Markdown
Contributor

Here are the commands I'm using:

$ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -target_file_size_base=1048576 -compression_type=none
$ make clean && EXTRA_LDFLAGS="-fprofile-generate" EXTRA_CXXFLAGS="-fprofile-generate" V=1 USE_LTO=1 DEBUG_LEVEL=0 OPTIMIZE_LEVEL=-O3 make -j48 db_bench
$ TEST_TMPDIR=/dev/shm ./db_bench -readonly=1 -use_existing_db=1 -benchmarks=seekrandom -seek_nexts=1000 -target_file_size_base=1048576 -cache_size=0 -duration=30
$ find . -name '*.o' | xargs rm -f
$ EXTRA_LDFLAGS="-fprofile-use -fprofile-correction" EXTRA_CXXFLAGS="-Wno-error=missing-profile -fprofile-use -fprofile-correction" V=1 USE_LTO=1 DEBUG_LEVEL=0 OPTIMIZE_LEVEL=-O3 make -j48 db_bench
$ TEST_TMPDIR=/dev/shm ./db_bench -readonly=1 -use_existing_db=1 -benchmarks=seekrandom -seek_nexts=1000 -target_file_size_base=1048576 -cache_size=0 -duration=30

Here are the results.

Results before 8690: 962.0MB/s, 930.0MB/s, 916.5MB/s
Results after 8690: 838.5MB/s, 850.6MB/s, 822.5MB/s

I will do a second build and run from scratch to make sure no mistakes.

@pdillinger pdillinger left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is among the most performance-critical areas of code and Andrew's test indicates notable regression. More investigation and testing needed.

@ajkr

ajkr commented Sep 29, 2021

Copy link
Copy Markdown
Contributor

Here are the commands I'm using:

$ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -target_file_size_base=1048576 -compression_type=none
$ make clean && EXTRA_LDFLAGS="-fprofile-generate" EXTRA_CXXFLAGS="-fprofile-generate" V=1 USE_LTO=1 DEBUG_LEVEL=0 OPTIMIZE_LEVEL=-O3 make -j48 db_bench
$ TEST_TMPDIR=/dev/shm ./db_bench -readonly=1 -use_existing_db=1 -benchmarks=seekrandom -seek_nexts=1000 -target_file_size_base=1048576 -cache_size=0 -duration=30
$ find . -name '*.o' | xargs rm -f
$ EXTRA_LDFLAGS="-fprofile-use -fprofile-correction" EXTRA_CXXFLAGS="-Wno-error=missing-profile -fprofile-use -fprofile-correction" V=1 USE_LTO=1 DEBUG_LEVEL=0 OPTIMIZE_LEVEL=-O3 make -j48 db_bench
$ TEST_TMPDIR=/dev/shm ./db_bench -readonly=1 -use_existing_db=1 -benchmarks=seekrandom -seek_nexts=1000 -target_file_size_base=1048576 -cache_size=0 -duration=30

Here are the results.

Results before 8690: 962.0MB/s, 930.0MB/s, 916.5MB/s Results after 8690: 838.5MB/s, 850.6MB/s, 822.5MB/s

I will do a second build and run from scratch to make sure no mistakes.

I did a second run with a slight variation (skipping the USE_LTO=1 in the -fprofile-generate command) and then the regression was 2% instead of 10%. Did ten runs each instead of three to make sure the result was stable.

So, will try a third time with the original commands.

@ajkr

ajkr commented Sep 29, 2021

Copy link
Copy Markdown
Contributor

Here are the commands I'm using:

$ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -target_file_size_base=1048576 -compression_type=none
$ make clean && EXTRA_LDFLAGS="-fprofile-generate" EXTRA_CXXFLAGS="-fprofile-generate" V=1 USE_LTO=1 DEBUG_LEVEL=0 OPTIMIZE_LEVEL=-O3 make -j48 db_bench
$ TEST_TMPDIR=/dev/shm ./db_bench -readonly=1 -use_existing_db=1 -benchmarks=seekrandom -seek_nexts=1000 -target_file_size_base=1048576 -cache_size=0 -duration=30
$ find . -name '*.o' | xargs rm -f
$ EXTRA_LDFLAGS="-fprofile-use -fprofile-correction" EXTRA_CXXFLAGS="-Wno-error=missing-profile -fprofile-use -fprofile-correction" V=1 USE_LTO=1 DEBUG_LEVEL=0 OPTIMIZE_LEVEL=-O3 make -j48 db_bench
$ TEST_TMPDIR=/dev/shm ./db_bench -readonly=1 -use_existing_db=1 -benchmarks=seekrandom -seek_nexts=1000 -target_file_size_base=1048576 -cache_size=0 -duration=30

Here are the results.
Results before 8690: 962.0MB/s, 930.0MB/s, 916.5MB/s Results after 8690: 838.5MB/s, 850.6MB/s, 822.5MB/s
I will do a second build and run from scratch to make sure no mistakes.

I did a second run with a slight variation (skipping the USE_LTO=1 in the -fprofile-generate command) and then the regression was 2% instead of 10%. Did ten runs each instead of three to make sure the result was stable.

So, will try a third time with the original commands.

My third rebuild from scratch brought back the 10% regression. That is surprising USE_LTO=1 in the -fprofile-generate command makes a big difference, but beside the point. I'd guess you can repro the issue with less fancy (and faster) build commands though I haven't tried yet actually.

@ajkr

ajkr commented Sep 30, 2021

Copy link
Copy Markdown
Contributor

Here are some commands I found to exaggerate the regression up to 15% while using simpler build options:

$ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -target_file_size_base=1048576 -compression_type=none
$ make clean && V=1 DEBUG_LEVEL=0 OPTIMIZE_LEVEL=-O3 make -j48 db_bench
$ TEST_TMPDIR=/dev/shm ./db_bench -mmap_read=1 -verify_checksum=false -use_existing_db=1 -benchmarks=seekrandom -seek_nexts=1000 -target_file_size_base=1048576 -cache_size=0 -duration=30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

4 participants