WBWI supports writing timestamps as you go by nickbrekhus · Pull Request #14084 · facebook/rocksdb

nickbrekhus · 2025-10-27T21:33:03Z

Summary:
Adds support to write timestamps as-you-go for UDT-enabled CFs within a WBWI (WriteBatchWithIndex). When a WBWI is instantiated with require_timestamps=true, mutations must supply a timestamp iff the CF is UDT-enabled. Callers should not call UpdateTimestamps when required_timestamps=true as this will overwrite the timestamps supplied with mutations.

When require_timestamps = false, the previous behavior is maintained. In other words timestamps must never be supplied with mutations and UpdateTimestamps must be called before writing the batch to DB.

Differential Revision: D85605627

meta-codesync · 2025-10-27T21:33:10Z

@nickbrekhus has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85605627.

Summary: Adds support to write timestamps as-you-go for UDT-enabled CFs within a WBWI (WriteBatchWithIndex). When a WBWI is instantiated with require_timestamps=true, mutations *must* supply a timestamp iff the CF is UDT-enabled. Callers *should not* call UpdateTimestamps when `required_timestamps=true` as this will overwrite the timestamps supplied with mutations. When require_timestamps = false, the previous behavior is maintained. In other words timestamps must never be supplied with mutations and UpdateTimestamps must be called before writing the batch to DB. Differential Revision: D85605627

Summary: When WBWI, UDT CF, overwrite and rollback are used together, WBWI index is corrupted and will contain duplicate key entries. The problem, which is fixed by this diff, is that the record reader does not strip the timestamp packed with the key. This results in corruption of the index during rebuild, since overwrite deduplication uses the slice read from the batch. The indexed key is still correct, as AddOrUpdate re-reads the key from the offset, resulting in the key being stripped. Differential Revision: D85616849

Summary: Adds support to write timestamps as-you-go for UDT-enabled CFs within a WBWI (WriteBatchWithIndex). When a WBWI is instantiated with require_timestamps=true, mutations *must* supply a timestamp iff the CF is UDT-enabled. Callers *should not* call UpdateTimestamps when `required_timestamps=true` as this will overwrite the timestamps supplied with mutations. When require_timestamps = false, the previous behavior is maintained. In other words timestamps must never be supplied with mutations and UpdateTimestamps must be called before writing the batch to DB. Differential Revision: D85605627

pdillinger · 2025-10-28T23:45:30Z

+        if (compare_result < 0) {
+          return {Status::OK(), false};
+        }
+        return {Status::InvalidArgument("cannot write below max ts for key"),


Should be "below last ts for key"

Are we introducing the capability to write multiple entries in a single WB with the same key but different timestamps? It might seem obvious that the various timestamped versions of each key are all committed, but not necessarily when they are coalesced for the purposes of WBWI reads. That should be clarified.

Yes, in general there can be multiple timestamps within a batch when this feature is enabled, including multiple timestamps for a given key. How do you suggest clarifying that? I tried to make at least the read behavior clear in the arg comment.

pdillinger · 2025-10-29T18:31:19Z

+  //                     enabled.
+  //
+  //                     Note that regardless of whether require_timestamps is
+  //                     true or false, read semantics are RYOW. In other words,


I don't agree that RYOW implies ignoring unpersisted/uncommitted timestamps. RYOW means you can see your own writes, not that you get them when you don't ask for them.

I see a few options here outside of the one I've chosen. I think implementation wise there isn't a major difference in complexity so long as we are OK with some extra cost when doing point reads below max(ts in batch). To avoid the extra cost the index needs to be (key, ts) instead of (key).

the current behavior (described in comment)

having two notions of read_timestamp, one that controls visibility of reads inside the batch and another that controls visibility of reads within the DB.

ReadOptions::read_timestamp controls visibility of both.

I didn't like 2 because it is in my view a bit confusing. Further, my particular use case had no need for this level of control. (In practice the batch read_timestamp would always be max(ts for key, ts in batch)

3 I think is fairly intuitive but changes the snapshot position in the database as well. In principle that could cause some records in the DB to become visible that may not be intended. For our particular consistency model any records above the read snapshot that are actually read would be a bug in the system, so I think that's OK. The other minor gripe I have with this is that again in practice read_timestamp would always be max(ts for key, ts in batch).

Sorry I didn't see this sooner. My open GitHub page apparently wasn't auto-refreshing like it's supposed to.

Isn't using max(ts for key, ts in batch) equivalent to no ReadOptions::timestamp (using the UDT assumption of timestamp monotonicity by key)? So wouldn't it suffice to get that working first without committing to a semantic for timestamp with WBWI? @nickbrekhus

I was under the impression it wouldn't be possible to achieve snapshot isolation if timestamp is unconstrained and concurrent writes are possible, but I wasn't considering using read_options.snapshot. Assuming specifying both read_timestamp and snapshot is permissible (the documentation is unclear on this but I don't see why it wouldn't work) that gives user pretty much full control without introducing any new options.

Definitely an improvement over the approach taken in this diff.

snapshot isolation, ryow: ReadOptions{.snapshot = snapshot, .read_timestamp = nullptr}
read commited, ryow: ReadOptions{.snapshot = nullptr, .read_timestamp = nullptr}

One flaw in the reasoning in my comment: As far as I can tell, you cannot select the position of a snapshot. It is always as-of the latest write flushed to the database. I don't think that gives users enough flexibility to do useful things with WBWI + UDT although it would be adequate for my current use cases.

For example, if you were implementing multi shard transactions with snapshot isolation (si), you would want to select a basis for the transaction that is the same for all shards participating. This is not likely to be the latest write to the database for at least some shards. So in this scenario specifying a snapshot does not work.

Using solely timestamp to control visibility doesn't work either (that either causes you to lose RYOW or get read committed semantics), so you are kind of stuck and would need to use UpdateTimestamps().

That said, this is not the scenario I want to add this feature for and maybe it's OK for that kind of use case to use UpdateTimestamps(), as I don't see a lot of utility in allowing multiple timestamps when the batches logically represent a single transaction. In our case they represent multiple transactions that we batch to amortize various overheads. We still need read-your-write semantics during this batching because various aspects of the system rely on it during batch application.

On the other hand, if we just continue to ignore timestamp visibility evaluation in write batches the multi-shard si scenario works. The other fix I can see for this is to allow specifying a batch_read_timestamp but I do have some concerns about the ergonomics of such an interface and I don't have any clear understanding of the scenarios that would benefit.

pdillinger · 2025-10-29T18:48:46Z

+  const ReadableWriteBatch* write_batch = delta_iterator_->GetWriteBatch();
+  assert(write_batch);
+
+  Slice raw_key, value_ignore, blob_ignore, xid_ignore;


"raw key" is not recognized, specific terminology. I guess it's a user key with timestamp.

pdillinger · 2025-10-29T18:49:34Z

  size_t ts_sz = ucmp->timestamp_size();
  if (ts_sz > 0) {
-    ret.key = StripTimestampFromUserKey(ret.key, ts_sz);
+    Slice rawKey = ret.key;


Avoid "raw key"

will fix naming to snake_case too.

pdillinger · 2025-10-29T18:53:43Z

  bool current_at_base_;
  bool equal_keys_;
  bool allow_unprepared_value_;
+  // if set, the iterator will return Entrys with timestamps filled in.


"Fill timestamps" is a bad name for how to read something. It implies modification.

Also, "will return Entrys with timestamps filled in" - does that mean only entries that already have timestamps or all entries but each with an associated timestamp?

Also, the data model with UDT includes timestamps. Doing something different should not be considered a default or normalized interpretation of that data model. And 'false' are considered default or normal values for bools. How about "ignore_timestamps"?

What I'm trying to do is preserve the existing behavior when require_timestamps is not set on the WBWI. Namely, the iterator would return Slice() for the timestamp of batched keys in all CFs previously. The iterator itself cannot tell, so some sort of flag or sentinel is needed. The batch entry itself has a zero-fillled slice for timestamp. But It could be that the zero-filled slice is a valid timestamp.

ignore_timestamps sounds good to me.

pdillinger · 2025-10-29T18:58:05Z

+
+  Slice raw_key, value_ignore, blob_ignore, xid_ignore;
+  WriteType type_ignore;
+  Status s = write_batch->GetEntryFromDataOffset(


This function looks expensive overall. Why's it not simpler (like key())?

meta-cla Bot added the CLA Signed label Oct 27, 2025

meta-codesync Bot added fb-exported meta-exported labels Oct 27, 2025

nickbrekhus force-pushed the export-D85605627 branch 2 times, most recently from 7bf995c to dd77ee3 Compare October 28, 2025 00:36

nickbrekhus force-pushed the export-D85605627 branch 2 times, most recently from 2ac1238 to dd77ee3 Compare October 28, 2025 00:37

nickbrekhus force-pushed the export-D85605627 branch from dd77ee3 to 18088c5 Compare October 28, 2025 00:52

nickbrekhus force-pushed the export-D85605627 branch from 18088c5 to 3b90dcf Compare October 28, 2025 01:10

Nick Brekhus added 2 commits October 27, 2025 18:16

nickbrekhus force-pushed the export-D85605627 branch from 3b90dcf to 89e4e52 Compare October 28, 2025 01:17

pdillinger reviewed Oct 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WBWI supports writing timestamps as you go#14084

WBWI supports writing timestamps as you go#14084
nickbrekhus wants to merge 2 commits into
facebook:mainfrom
nickbrekhus:export-D85605627

nickbrekhus commented Oct 27, 2025

meta-codesync Bot commented Oct 27, 2025

pdillinger Oct 28, 2025

nickbrekhus Oct 29, 2025

pdillinger Oct 29, 2025

nickbrekhus Oct 29, 2025

pdillinger Nov 3, 2025

nickbrekhus Nov 3, 2025

nickbrekhus Nov 6, 2025

pdillinger Oct 29, 2025

pdillinger Oct 29, 2025

nickbrekhus Oct 29, 2025

pdillinger Oct 29, 2025

nickbrekhus Oct 29, 2025

nickbrekhus Oct 29, 2025

pdillinger Oct 29, 2025

Labels

2 participants

Uh oh!

Conversation

nickbrekhus commented Oct 27, 2025

meta-codesync Bot commented Oct 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

2 participants