Add GetUpdateSince() to crash test#12646
Conversation
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
|
||
| last_seqno = res.sequence; | ||
| iter->Next(); | ||
| } |
There was a problem hiding this comment.
Can check status after !iter->Valid()
| iter->Next(); | ||
|
|
||
| while (iter->Valid()) { | ||
| if (!iter->status().ok()) { |
There was a problem hiding this comment.
GetUpdateSince() seems to expect WAL to contain updates with consecutive sequence numbers. It can fail the stress test here when there is a concurrent file ingestion: #10007. There may be other incompatibilities like writes with disable_wal can cause a hole in sequence numbers in WAL too (https://groups.google.com/g/rocksdb/c/W1Axka8CVf8).
There was a problem hiding this comment.
Thanks - need to look into more of this.
There was a problem hiding this comment.
Fixed and launching more tests to check
There was a problem hiding this comment.
Found more incompatibility run - need to fix them
| if dest_params.get("get_update_since_one_in") != 0: | ||
| # Set very high values to avoid WAL cleanup during `GetUpdatesSince()` | ||
| dest_params["WAL_ttl_seconds"] = 0xFFFFFFFFFFFFFFFF | ||
| dest_params["WAL_size_limit_MB"] = 0xFFFFFFFFFFFFFFFF |
There was a problem hiding this comment.
Should we still set some limit to avoid running out of space?
There was a problem hiding this comment.
Tuning down the parameter will make the GetUpdateSinceOneIn fail in more ways. This is more difficult than I anticipate will take a look
There was a problem hiding this comment.
Fixed - it turns out that we don't have to keep such a high value. If WAL was deleted, the returned iterator will just be ok() but not valid.
| "optimize_filters_for_memory": lambda: random.randint(0, 1), | ||
| "partition_filters": lambda: random.randint(0, 1), | ||
| "partition_pinning": lambda: random.randint(0, 3), | ||
| "get_update_since_one_in": lambda: random.choice([0, 0, 0, 1000]), |
There was a problem hiding this comment.
May need to sanity check that disable_wal is 0.
|
~~Might need to temporarily pause on this as the space vs functionality trade-off is harder to solve than expected. ~~ Got some new ideas from @ajkr so can continue now |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
ajkr
left a comment
There was a problem hiding this comment.
Thanks for working on this. I think there are some possible improvements, though
| // `TryAgain()` is allowed as it might happens when new writes happen between | ||
| // calling `GetLatestSequenceNumber()` above and `iter->Next()` |
There was a problem hiding this comment.
Sorry I don't understand. GetLatestSequenceNumber() appears to be the maximum possible value of the lower bound we pass to GetUpdatesSince().
| DEFINE_int32(get_update_since_one_in, 0, | ||
| "If non-zero, then GetUpdateSince() will be called " |
There was a problem hiding this comment.
s/get_update_since/get_updates_since/g
s/GetUpdateSince/GetUpdatesSince/g
| "partition_pinning": lambda: random.randint(0, 3), | ||
| # `get_update_since_one_in` requires no sequence number hole in all WALs | ||
| # To simplify santization, we enable or disable `get_update_since_one_in` consistently across runs | ||
| "get_update_since_one_in": random.choice([0, 1000000]), |
There was a problem hiding this comment.
Since it is a relatively uncommon/simpler function and testing it prevents us from testing common/complicated functions (disableWAL writes, IngestExternalFile()), I think we should enable it much less frequently. Maybe one in ten or even less frequent.
| "partition_filters": lambda: random.randint(0, 1), | ||
| "partition_pinning": lambda: random.randint(0, 3), | ||
| # `get_update_since_one_in` requires no sequence number hole in all WALs | ||
| # To simplify santization, we enable or disable `get_update_since_one_in` consistently across runs |
There was a problem hiding this comment.
Hm that makes db_crashtest.py harder to use, compared to changing it per run.
If we save the GetLatestSequenceNumber() at startup and use that as a lower bound for GetUpdatesSince()'s seq_number, does that allow us to ignore what past runs did?
| if (!iter->status().ok()) { | ||
| fprintf(stderr, | ||
| "Iterator with non-ok status returned by " | ||
| "GetUpdatesSince(): %s \n", | ||
| iter->status().ToString().c_str()); | ||
| thread->shared->SafeTerminate(); | ||
| } else if (!iter->Valid()) { | ||
| return; | ||
| } | ||
|
|
||
| BatchResult res = iter->GetBatch(); | ||
| SequenceNumber last_seqno = res.sequence; | ||
| iter->Next(); |
There was a problem hiding this comment.
This seems like a lot of extra code to avoid a special case for the first iteration of the loop. Can we initialize last_seqno to zero and let the loop treat that as a value that does not need consecutive-ness checks?
|
Won't be able to get to this now since there are some more context for me to learn and stablize the test. Will get to this later. |
| iter->status().ToString().c_str()); | ||
| thread->shared->SafeTerminate(); | ||
| } else if (!iter->Valid()) { | ||
| return; |
There was a problem hiding this comment.
If I am not wrong, then file deletions should be re-enabled here again by calling db_->EnableFileDeletions();.
Otherwise this test would disable file deletions and not turn them back on, so it may influence the behavior of other tests.
There was a problem hiding this comment.
Great catch. I think ideally this special case would be deleted and we just let control flow through the end of the function per https://github.com/facebook/rocksdb/pull/12646/files#r1628634984. Still, thanks for demonstrating how seemingly small/innocent extra code can make things subtly wrong.
There was a problem hiding this comment.
Good catch! Early return needs to be handled carefully with clean up work…
| # WriteCommitted only | ||
| dest_params["use_put_entity_one_in"] = 0 | ||
| if dest_params.get("get_update_since_one_in") != 0: | ||
| # To avoid sequence number hole in WAL |
There was a problem hiding this comment.
I think log file recycling is also not compatible with GetUpdatesSince so maybe we can disable it here too. We should probably mention this in the documentation for GetUpdatesSince.
according to facebook/rocksdb#12646 (comment), GetUpdatesSince() may be incompatible with recycling log files.
Context/Summary: as titled
Test: CI [ongoing]