Skip to content

ReadOptions::pin_data no longer pins iterator keys/values for mmap reads on Primary/Secondary instances #14895

Description

@andpred

Summary

The documented guarantee for iterator key pinning is not honored when the DB is opened with allow_mmap_reads = true on a Primary or Secondary instance, even though all documented preconditions are met. Iterator::key() returns a slice that is copied into the iterator's internal saved_key_ buffer and is invalidated by the next Next()/Prev(), and rocksdb.iterator.is-key-pinned reports 0. This worked prior to #3881 (commit fea2b1dfb) and regressed as an unintended side effect of a Get()-only correctness fix.

Expected behavior

Per include/rocksdb/options.h (pin_data):

// Keep the blocks loaded by the iterator pinned in memory as long as the
// iterator is not deleted. If used when reading from tables created with
// BlockBasedTableOptions::use_delta_encoding = false,
// Iterator's property "rocksdb.iterator.is-key-pinned" is guaranteed to
// return 1.
bool pin_data = false;

and include/rocksdb/iterator.h:

// Property "rocksdb.iterator.is-key-pinned":
// If returning "1", this means that the Slice returned by key() is valid
// as long as the iterator is not deleted.
// It is guaranteed to always return "1" if
// - Iterator created with ReadOptions::pin_data = true
// - DB tables were created with
//   BlockBasedTableOptions::use_delta_encoding = false.

With pin_data = true and use_delta_encoding = false, key() should remain valid for the iterator's lifetime and is-key-pinned should be 1.

Actual behavior

On Primary and Secondary instances, with allow_mmap_reads = true, and compression disabled (so blocks are served straight from the mmap), is-key-pinned returns 0 and key() slices are invalidated on the next iterator movement.

Root cause

The pinning decision is gated in table/block_based/block_based_table_reader_impl.h:

const bool block_contents_pinned =
    block.IsCached() ||
    (!block.GetValue()->own_bytes() && rep_->immortal_table);

With mmap reads and no compression, block.IsCached() is false and own_bytes() is false (the slice points into the mmap). immortal_table is only true for Read-Only instances with an unbounded table cache. It is false for both Primary and Secondary instances.

When the block iterator reports IsKeyPinned() == false, DBIter copies the key into its reused saved_key_ buffer (db/db_iter.cc):

saved_key_.SetUserKey(ikey_.user_key,
    !pin_thru_lifetime_ || !iter_.iter()->IsKeyPinned() /* copy */);

so DBIter::key() returns saved_key_ rather than a pointer into the block, and the next step may reallocate that buffer.

This is a regression introduced in #3881. That PR fixed a bug affecting Get() with PinnableSlice on mmap memory (point reads do not hold a SuperVersion for the slice's lifetime). However, the fix was implemented through a shared flag that also disables pinning for iterators, which do hold a SuperVersion and were previously safe. Before #3881, mmap iterator keys/values were pinned, relying on the fact that an iterator pins a SuperVersion which keeps the files (and their mmap regions) alive for the iterator's lifetime. The fea2b1dfb commit message states this explicitly: "For iterator reads, a SuperVersion is pinned ... Blocks are pinned ... This works for ... mmap reads, where the file owns the memory region." Confirmed empirically by running the same probe against a normal mmap DB: the test key's is-key-pinned was 1 at fea2b1dfb^, became 0 at fea2b1dfb.

Steps to reproduce

  1. Build a DB with BlockBasedTableOptions::use_delta_encoding = false and compression disabled.
  2. Open it with Options::allow_mmap_reads = true.
  3. Create an iterator with ReadOptions::pin_data = true.
  4. Seek() to a key, capture it->key(), call Next(), then inspect the captured slice and db->GetProperty(it, "rocksdb.iterator.is-key-pinned", ...).

Observed: is-key-pinned == "0" and the captured slice no longer matches the original key. Expected: is-key-pinned == "1" and the slice remains valid.

Environment

  • RocksDB: 10.4.2 (also reproducible at the Copy Get() result when file reads use mmap #3881 boundary)
  • Datastore built with use_delta_encoding = false, compression disabled
  • Secondary instance opened with allow_mmap_reads = true, pin_data = true
  • Linux x86_64

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions