Split Block into multiple classes#8690
Conversation
By making different subclasses of Block, different implementations can be provided for different functionality. For example, it is possible to create different types of DataBlocks where the data is pre-parsed or the iterators behave differently.
pdillinger
left a comment
There was a problem hiding this comment.
Looks pretty promising so far 👍
| const char* data_; // contents_.data.data() | ||
| uint32_t limit_; // contents_.data.size() |
There was a problem hiding this comment.
I have never understood why these are duplicated from contents_, costing 12 bytes per cached block in memory (0.1 - 0.3% ish?)
I see limit_ is set to 0 if size is too small, but why not just use the < sizeof(uin32_t) in more places or throw away the bad BlockContents and replace it with empty?
I'm slightly concerned about CPU regression due to more virtual calls (and many conditionals remaining, not encoded in the virtual call), but this seems like a good "pay as you go" performance opportunity.
|
Still Work In Progress? |
This is "ready for review" now. This PR is phase 1 in a multi-step process to allow different implementations of Block (specifically DataBlock) to test performance. Phase 2 will be allowing different implementations of DataBlockIter and Phase 3 will be testing different implementations to see if we can improve performance. |
pdillinger
left a comment
There was a problem hiding this comment.
Some minor fixes & recommendations. I would not be surprised if there is some subtle regression in here that is not caught by unit tests. Please run crash test locally for some hours before shipping.
| options.create_if_missing = true; | ||
| options.prefix_extractor.reset(NewFixedPrefixTransform(20)); | ||
| // Picking a Capped prefix of 3 so that the short key ("foo") fits | ||
| options.prefix_extractor.reset(NewCappedPrefixTransform(3)); |
There was a problem hiding this comment.
This seems to fundamentally violate the point of the test "prefix longer than key"
| options.create_if_missing = true; | ||
| options.prefix_extractor.reset(NewCappedPrefixTransform(5)); | ||
| // Picking a Capped prefix of 3 so that the short key ("foo") fits | ||
| options.prefix_extractor.reset(NewCappedPrefixTransform(3)); |
There was a problem hiding this comment.
I'm not sure what the purpose of the original code is but changing it should be unnecessary. See #8529 (comment)
| Status decode_s __attribute__((__unused__)) = decoded_value_.DecodeFrom( | ||
| &v, have_first_key_, | ||
| (value_delta_encoded_ && shared) ? &decoded_value_.handle : nullptr); | ||
| (value_delta_encoded_ && is_shared) ? &decoded_value_.handle : nullptr); |
There was a problem hiding this comment.
We have seen a couple of times in stress test violation of the final assertion in DecodeEntry. (Internal ref T100361844.) In looking into that, I noticed that our test suite never reaches this point in the code with global_seqno_state_ != nullptr && !value_delta_encoded_. Since you are refactoring this code, it would be nice to add coverage if we can, potentially find any bugs old and/or new.
| if ((p = GetVarint32Ptr(p, limit, value_length)) == nullptr) { | ||
| return nullptr; | ||
| } | ||
| const char* Block::DecodeEntry(const char* p, const char* limit, |
There was a problem hiding this comment.
Recommend keeping 'inline' (and more below)
What about performance tests? |
I have executed the performance tests from the Wiki page (with also revrange and revrangewhilewriting tests) and so no regressions. I can post the results/comparison in the PR if it helps. I have not executed stress tests but will do so shortly. |
|
Here are the commands I'm using: Here are the results. Results before 8690: 962.0MB/s, 930.0MB/s, 916.5MB/s I will do a second build and run from scratch to make sure no mistakes. |
pdillinger
left a comment
There was a problem hiding this comment.
This is among the most performance-critical areas of code and Andrew's test indicates notable regression. More investigation and testing needed.
I did a second run with a slight variation (skipping the USE_LTO=1 in the So, will try a third time with the original commands. |
My third rebuild from scratch brought back the 10% regression. That is surprising USE_LTO=1 in the |
|
Here are some commands I found to exaggerate the regression up to 15% while using simpler build options: |
Makes the Block class into a base class with different implementations (DataBlock, MetaBlock, IndexBlock, etc).
This change can allow different implementations of these base classes for use for different iterator implementations. For example, it is possible to create different data block and iterator implementations that un-"delta encode" the values, which may yield performance improvements under certain circumstances.