Skip to content

Add GetUpdateSince() to crash test#12646

Open
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:get_update_since
Open

Add GetUpdateSince() to crash test#12646
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:get_update_since

Conversation

@hx235

@hx235 hx235 commented May 11, 2024

Copy link
Copy Markdown
Contributor

Context/Summary: as titled

Test: CI [ongoing]

@facebook-github-bot

Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 requested review from ajkr, cbi42 and jowlyzhang May 13, 2024 19:48

last_seqno = res.sequence;
iter->Next();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can check status after !iter->Valid()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread db_stress_tool/db_stress_test_base.cc Outdated
iter->Next();

while (iter->Valid()) {
if (!iter->status().ok()) {

@cbi42 cbi42 May 14, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetUpdateSince() seems to expect WAL to contain updates with consecutive sequence numbers. It can fail the stress test here when there is a concurrent file ingestion: #10007. There may be other incompatibilities like writes with disable_wal can cause a hole in sequence numbers in WAL too (https://groups.google.com/g/rocksdb/c/W1Axka8CVf8).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - need to look into more of this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and launching more tests to check

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found more incompatibility run - need to fix them

Comment thread tools/db_crashtest.py Outdated
if dest_params.get("get_update_since_one_in") != 0:
# Set very high values to avoid WAL cleanup during `GetUpdatesSince()`
dest_params["WAL_ttl_seconds"] = 0xFFFFFFFFFFFFFFFF
dest_params["WAL_size_limit_MB"] = 0xFFFFFFFFFFFFFFFF

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still set some limit to avoid running out of space?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuning down the parameter will make the GetUpdateSinceOneIn fail in more ways. This is more difficult than I anticipate will take a look

@hx235 hx235 May 23, 2024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - it turns out that we don't have to keep such a high value. If WAL was deleted, the returned iterator will just be ok() but not valid.

Comment thread tools/db_crashtest.py Outdated
"optimize_filters_for_memory": lambda: random.randint(0, 1),
"partition_filters": lambda: random.randint(0, 1),
"partition_pinning": lambda: random.randint(0, 3),
"get_update_since_one_in": lambda: random.choice([0, 0, 0, 1000]),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need to sanity check that disable_wal is 0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@hx235

hx235 commented May 17, 2024

Copy link
Copy Markdown
Contributor Author

~~Might need to temporarily pause on this as the space vs functionality trade-off is harder to solve than expected. ~~ Got some new ideas from @ajkr so can continue now

@hx235 hx235 closed this May 17, 2024
@hx235 hx235 reopened this May 17, 2024
@hx235 hx235 force-pushed the get_update_since branch from 4c380dc to 7647a4b Compare May 17, 2024 18:13
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the get_update_since branch from 7647a4b to 351ae10 Compare May 23, 2024 19:09
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the get_update_since branch from 351ae10 to f663e83 Compare May 23, 2024 19:14
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot

Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the get_update_since branch from f663e83 to 19ac8a0 Compare May 24, 2024 21:59
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@ajkr ajkr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I think there are some possible improvements, though

Comment on lines +2481 to +2482
// `TryAgain()` is allowed as it might happens when new writes happen between
// calling `GetLatestSequenceNumber()` above and `iter->Next()`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand. GetLatestSequenceNumber() appears to be the maximum possible value of the lower bound we pass to GetUpdatesSince().

Comment on lines +776 to +777
DEFINE_int32(get_update_since_one_in, 0,
"If non-zero, then GetUpdateSince() will be called "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/get_update_since/get_updates_since/g
s/GetUpdateSince/GetUpdatesSince/g

Comment thread tools/db_crashtest.py
"partition_pinning": lambda: random.randint(0, 3),
# `get_update_since_one_in` requires no sequence number hole in all WALs
# To simplify santization, we enable or disable `get_update_since_one_in` consistently across runs
"get_update_since_one_in": random.choice([0, 1000000]),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a relatively uncommon/simpler function and testing it prevents us from testing common/complicated functions (disableWAL writes, IngestExternalFile()), I think we should enable it much less frequently. Maybe one in ten or even less frequent.

Comment thread tools/db_crashtest.py
"partition_filters": lambda: random.randint(0, 1),
"partition_pinning": lambda: random.randint(0, 3),
# `get_update_since_one_in` requires no sequence number hole in all WALs
# To simplify santization, we enable or disable `get_update_since_one_in` consistently across runs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm that makes db_crashtest.py harder to use, compared to changing it per run.

If we save the GetLatestSequenceNumber() at startup and use that as a lower bound for GetUpdatesSince()'s seq_number, does that allow us to ignore what past runs did?

Comment on lines +2450 to +2462
if (!iter->status().ok()) {
fprintf(stderr,
"Iterator with non-ok status returned by "
"GetUpdatesSince(): %s \n",
iter->status().ToString().c_str());
thread->shared->SafeTerminate();
} else if (!iter->Valid()) {
return;
}

BatchResult res = iter->GetBatch();
SequenceNumber last_seqno = res.sequence;
iter->Next();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of extra code to avoid a special case for the first iteration of the loop. Can we initialize last_seqno to zero and let the loop treat that as a value that does not need consecutive-ness checks?

@hx235

hx235 commented Jun 6, 2024

Copy link
Copy Markdown
Contributor Author

Won't be able to get to this now since there are some more context for me to learn and stablize the test. Will get to this later.

iter->status().ToString().c_str());
thread->shared->SafeTerminate();
} else if (!iter->Valid()) {
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not wrong, then file deletions should be re-enabled here again by calling db_->EnableFileDeletions();.
Otherwise this test would disable file deletions and not turn them back on, so it may influence the behavior of other tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I think ideally this special case would be deleted and we just let control flow through the end of the function per https://github.com/facebook/rocksdb/pull/12646/files#r1628634984. Still, thanks for demonstrating how seemingly small/innocent extra code can make things subtly wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Early return needs to be handled carefully with clean up work…

Comment thread tools/db_crashtest.py
# WriteCommitted only
dest_params["use_put_entity_one_in"] = 0
if dest_params.get("get_update_since_one_in") != 0:
# To avoid sequence number hole in WAL

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think log file recycling is also not compatible with GetUpdatesSince so maybe we can disable it here too. We should probably mention this in the documentation for GetUpdatesSince.

jsteemann added a commit to arangodb/arangodb that referenced this pull request Jul 22, 2024
according to facebook/rocksdb#12646 (comment), GetUpdatesSince() may be incompatible with recycling log files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

6 participants