Skip to content

WBWI supports writing timestamps as you go#14084

Open
nickbrekhus wants to merge 2 commits into
facebook:mainfrom
nickbrekhus:export-D85605627
Open

WBWI supports writing timestamps as you go#14084
nickbrekhus wants to merge 2 commits into
facebook:mainfrom
nickbrekhus:export-D85605627

Conversation

@nickbrekhus

Copy link
Copy Markdown

Summary:
Adds support to write timestamps as-you-go for UDT-enabled CFs within a WBWI (WriteBatchWithIndex). When a WBWI is instantiated with require_timestamps=true, mutations must supply a timestamp iff the CF is UDT-enabled. Callers should not call UpdateTimestamps when required_timestamps=true as this will overwrite the timestamps supplied with mutations.

When require_timestamps = false, the previous behavior is maintained. In other words timestamps must never be supplied with mutations and UpdateTimestamps must be called before writing the batch to DB.

Differential Revision: D85605627

@meta-codesync

meta-codesync Bot commented Oct 27, 2025

Copy link
Copy Markdown

@nickbrekhus has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85605627.

@meta-cla meta-cla Bot added the CLA Signed label Oct 27, 2025
nickbrekhus pushed a commit to nickbrekhus/rocksdb that referenced this pull request Oct 27, 2025
Summary:

Adds support to write timestamps as-you-go for UDT-enabled CFs within a WBWI (WriteBatchWithIndex). When a WBWI is instantiated with require_timestamps=true, mutations *must* supply a timestamp iff the CF is UDT-enabled. Callers *should not* call UpdateTimestamps when `required_timestamps=true` as this will overwrite the timestamps supplied with mutations.

When require_timestamps = false, the previous behavior is maintained. In other words timestamps must never be supplied with mutations and UpdateTimestamps must be called before writing the batch to DB.

Differential Revision: D85605627
@nickbrekhus nickbrekhus force-pushed the export-D85605627 branch 2 times, most recently from 7bf995c to dd77ee3 Compare October 28, 2025 00:36
nickbrekhus pushed a commit to nickbrekhus/rocksdb that referenced this pull request Oct 28, 2025
Summary:

Adds support to write timestamps as-you-go for UDT-enabled CFs within a WBWI (WriteBatchWithIndex). When a WBWI is instantiated with require_timestamps=true, mutations *must* supply a timestamp iff the CF is UDT-enabled. Callers *should not* call UpdateTimestamps when `required_timestamps=true` as this will overwrite the timestamps supplied with mutations.

When require_timestamps = false, the previous behavior is maintained. In other words timestamps must never be supplied with mutations and UpdateTimestamps must be called before writing the batch to DB.

Differential Revision: D85605627
nickbrekhus pushed a commit to nickbrekhus/rocksdb that referenced this pull request Oct 28, 2025
Summary:

Adds support to write timestamps as-you-go for UDT-enabled CFs within a WBWI (WriteBatchWithIndex). When a WBWI is instantiated with require_timestamps=true, mutations *must* supply a timestamp iff the CF is UDT-enabled. Callers *should not* call UpdateTimestamps when `required_timestamps=true` as this will overwrite the timestamps supplied with mutations.

When require_timestamps = false, the previous behavior is maintained. In other words timestamps must never be supplied with mutations and UpdateTimestamps must be called before writing the batch to DB.

Differential Revision: D85605627
@nickbrekhus nickbrekhus force-pushed the export-D85605627 branch 2 times, most recently from 2ac1238 to dd77ee3 Compare October 28, 2025 00:37
nickbrekhus pushed a commit to nickbrekhus/rocksdb that referenced this pull request Oct 28, 2025
Summary:

Adds support to write timestamps as-you-go for UDT-enabled CFs within a WBWI (WriteBatchWithIndex). When a WBWI is instantiated with require_timestamps=true, mutations *must* supply a timestamp iff the CF is UDT-enabled. Callers *should not* call UpdateTimestamps when `required_timestamps=true` as this will overwrite the timestamps supplied with mutations.

When require_timestamps = false, the previous behavior is maintained. In other words timestamps must never be supplied with mutations and UpdateTimestamps must be called before writing the batch to DB.

Differential Revision: D85605627
nickbrekhus pushed a commit to nickbrekhus/rocksdb that referenced this pull request Oct 28, 2025
Summary:

Adds support to write timestamps as-you-go for UDT-enabled CFs within a WBWI (WriteBatchWithIndex). When a WBWI is instantiated with require_timestamps=true, mutations *must* supply a timestamp iff the CF is UDT-enabled. Callers *should not* call UpdateTimestamps when `required_timestamps=true` as this will overwrite the timestamps supplied with mutations.

When require_timestamps = false, the previous behavior is maintained. In other words timestamps must never be supplied with mutations and UpdateTimestamps must be called before writing the batch to DB.

Differential Revision: D85605627
Nick Brekhus added 2 commits October 27, 2025 18:16
Summary:

When WBWI, UDT CF, overwrite and rollback are used together, WBWI index is corrupted and will contain duplicate key entries. The problem, which is fixed by this diff, is that the record reader does not strip the timestamp packed with the key.

This results in corruption of the index during rebuild, since overwrite deduplication uses the slice read from the batch. The indexed key is still correct, as AddOrUpdate re-reads the key from the offset, resulting in the key being stripped.

Differential Revision: D85616849
Summary:

Adds support to write timestamps as-you-go for UDT-enabled CFs within a WBWI (WriteBatchWithIndex). When a WBWI is instantiated with require_timestamps=true, mutations *must* supply a timestamp iff the CF is UDT-enabled. Callers *should not* call UpdateTimestamps when `required_timestamps=true` as this will overwrite the timestamps supplied with mutations.

When require_timestamps = false, the previous behavior is maintained. In other words timestamps must never be supplied with mutations and UpdateTimestamps must be called before writing the batch to DB.

Differential Revision: D85605627
if (compare_result < 0) {
return {Status::OK(), false};
}
return {Status::InvalidArgument("cannot write below max ts for key"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "below last ts for key"

Are we introducing the capability to write multiple entries in a single WB with the same key but different timestamps? It might seem obvious that the various timestamped versions of each key are all committed, but not necessarily when they are coalesced for the purposes of WBWI reads. That should be clarified.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in general there can be multiple timestamps within a batch when this feature is enabled, including multiple timestamps for a given key. How do you suggest clarifying that? I tried to make at least the read behavior clear in the arg comment.

// enabled.
//
// Note that regardless of whether require_timestamps is
// true or false, read semantics are RYOW. In other words,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree that RYOW implies ignoring unpersisted/uncommitted timestamps. RYOW means you can see your own writes, not that you get them when you don't ask for them.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few options here outside of the one I've chosen. I think implementation wise there isn't a major difference in complexity so long as we are OK with some extra cost when doing point reads below max(ts in batch). To avoid the extra cost the index needs to be (key, ts) instead of (key).

  1. the current behavior (described in comment)
  2. having two notions of read_timestamp, one that controls visibility of reads inside the batch and another that controls visibility of reads within the DB.
  3. ReadOptions::read_timestamp controls visibility of both.

I didn't like 2 because it is in my view a bit confusing. Further, my particular use case had no need for this level of control. (In practice the batch read_timestamp would always be max(ts for key, ts in batch)

3 I think is fairly intuitive but changes the snapshot position in the database as well. In principle that could cause some records in the DB to become visible that may not be intended. For our particular consistency model any records above the read snapshot that are actually read would be a bug in the system, so I think that's OK. The other minor gripe I have with this is that again in practice read_timestamp would always be max(ts for key, ts in batch).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't see this sooner. My open GitHub page apparently wasn't auto-refreshing like it's supposed to.

Isn't using max(ts for key, ts in batch) equivalent to no ReadOptions::timestamp (using the UDT assumption of timestamp monotonicity by key)? So wouldn't it suffice to get that working first without committing to a semantic for timestamp with WBWI? @nickbrekhus

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression it wouldn't be possible to achieve snapshot isolation if timestamp is unconstrained and concurrent writes are possible, but I wasn't considering using read_options.snapshot. Assuming specifying both read_timestamp and snapshot is permissible (the documentation is unclear on this but I don't see why it wouldn't work) that gives user pretty much full control without introducing any new options.

Definitely an improvement over the approach taken in this diff.

snapshot isolation, ryow: ReadOptions{.snapshot = snapshot, .read_timestamp = nullptr}
read commited, ryow: ReadOptions{.snapshot = nullptr, .read_timestamp = nullptr}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One flaw in the reasoning in my comment: As far as I can tell, you cannot select the position of a snapshot. It is always as-of the latest write flushed to the database. I don't think that gives users enough flexibility to do useful things with WBWI + UDT although it would be adequate for my current use cases.

For example, if you were implementing multi shard transactions with snapshot isolation (si), you would want to select a basis for the transaction that is the same for all shards participating. This is not likely to be the latest write to the database for at least some shards. So in this scenario specifying a snapshot does not work.

Using solely timestamp to control visibility doesn't work either (that either causes you to lose RYOW or get read committed semantics), so you are kind of stuck and would need to use UpdateTimestamps().

That said, this is not the scenario I want to add this feature for and maybe it's OK for that kind of use case to use UpdateTimestamps(), as I don't see a lot of utility in allowing multiple timestamps when the batches logically represent a single transaction. In our case they represent multiple transactions that we batch to amortize various overheads. We still need read-your-write semantics during this batching because various aspects of the system rely on it during batch application.

On the other hand, if we just continue to ignore timestamp visibility evaluation in write batches the multi-shard si scenario works. The other fix I can see for this is to allow specifying a batch_read_timestamp but I do have some concerns about the ergonomics of such an interface and I don't have any clear understanding of the scenarios that would benefit.

const ReadableWriteBatch* write_batch = delta_iterator_->GetWriteBatch();
assert(write_batch);

Slice raw_key, value_ignore, blob_ignore, xid_ignore;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"raw key" is not recognized, specific terminology. I guess it's a user key with timestamp.

size_t ts_sz = ucmp->timestamp_size();
if (ts_sz > 0) {
ret.key = StripTimestampFromUserKey(ret.key, ts_sz);
Slice rawKey = ret.key;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid "raw key"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix naming to snake_case too.

bool current_at_base_;
bool equal_keys_;
bool allow_unprepared_value_;
// if set, the iterator will return Entrys with timestamps filled in.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Fill timestamps" is a bad name for how to read something. It implies modification.

Also, "will return Entrys with timestamps filled in" - does that mean only entries that already have timestamps or all entries but each with an associated timestamp?

Also, the data model with UDT includes timestamps. Doing something different should not be considered a default or normalized interpretation of that data model. And 'false' are considered default or normal values for bools. How about "ignore_timestamps"?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm trying to do is preserve the existing behavior when require_timestamps is not set on the WBWI. Namely, the iterator would return Slice() for the timestamp of batched keys in all CFs previously. The iterator itself cannot tell, so some sort of flag or sentinel is needed. The batch entry itself has a zero-fillled slice for timestamp. But It could be that the zero-filled slice is a valid timestamp.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_timestamps sounds good to me.


Slice raw_key, value_ignore, blob_ignore, xid_ignore;
WriteType type_ignore;
Status s = write_batch->GetEntryFromDataOffset(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function looks expensive overall. Why's it not simpler (like key())?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment