You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The documented guarantee for iterator key pinning is not honored when the DB is opened with allow_mmap_reads = true on a Primary or Secondary instance, even though all documented preconditions are met. Iterator::key() returns a slice that is copied into the iterator's internal saved_key_ buffer and is invalidated by the next Next()/Prev(), and rocksdb.iterator.is-key-pinned reports 0. This worked prior to #3881 (commit fea2b1dfb) and regressed as an unintended side effect of a Get()-only correctness fix.
// Keep the blocks loaded by the iterator pinned in memory as long as the// iterator is not deleted. If used when reading from tables created with// BlockBasedTableOptions::use_delta_encoding = false,// Iterator's property "rocksdb.iterator.is-key-pinned" is guaranteed to// return 1.bool pin_data = false;
// Property "rocksdb.iterator.is-key-pinned":// If returning "1", this means that the Slice returned by key() is valid// as long as the iterator is not deleted.// It is guaranteed to always return "1" if// - Iterator created with ReadOptions::pin_data = true// - DB tables were created with// BlockBasedTableOptions::use_delta_encoding = false.
With pin_data = true and use_delta_encoding = false, key() should remain valid for the iterator's lifetime and is-key-pinned should be 1.
Actual behavior
On Primary and Secondary instances, with allow_mmap_reads = true, and compression disabled (so blocks are served straight from the mmap), is-key-pinned returns 0 and key() slices are invalidated on the next iterator movement.
With mmap reads and no compression, block.IsCached() is false and own_bytes() is false (the slice points into the mmap). immortal_table is only true for Read-Only instances with an unbounded table cache. It is false for both Primary and Secondary instances.
When the block iterator reports IsKeyPinned() == false, DBIter copies the key into its reused saved_key_ buffer (db/db_iter.cc):
so DBIter::key() returns saved_key_ rather than a pointer into the block, and the next step may reallocate that buffer.
This is a regression introduced in #3881. That PR fixed a bug affecting Get() with PinnableSlice on mmap memory (point reads do not hold a SuperVersion for the slice's lifetime). However, the fix was implemented through a shared flag that also disables pinning for iterators, which do hold a SuperVersion and were previously safe. Before #3881, mmap iterator keys/values were pinned, relying on the fact that an iterator pins a SuperVersion which keeps the files (and their mmap regions) alive for the iterator's lifetime. The fea2b1dfb commit message states this explicitly: "For iterator reads, a SuperVersion is pinned ... Blocks are pinned ... This works for ... mmap reads, where the file owns the memory region." Confirmed empirically by running the same probe against a normal mmap DB: the test key's is-key-pinned was 1 at fea2b1dfb^, became 0 at fea2b1dfb.
Steps to reproduce
Build a DB with BlockBasedTableOptions::use_delta_encoding = false and compression disabled.
Open it with Options::allow_mmap_reads = true.
Create an iterator with ReadOptions::pin_data = true.
Seek() to a key, capture it->key(), call Next(), then inspect the captured slice and db->GetProperty(it, "rocksdb.iterator.is-key-pinned", ...).
Observed: is-key-pinned == "0" and the captured slice no longer matches the original key. Expected: is-key-pinned == "1" and the slice remains valid.
Summary
The documented guarantee for iterator key pinning is not honored when the DB is opened with
allow_mmap_reads = trueon a Primary or Secondary instance, even though all documented preconditions are met.Iterator::key()returns a slice that is copied into the iterator's internalsaved_key_buffer and is invalidated by the nextNext()/Prev(), androcksdb.iterator.is-key-pinnedreports0. This worked prior to #3881 (commitfea2b1dfb) and regressed as an unintended side effect of aGet()-only correctness fix.Expected behavior
Per
include/rocksdb/options.h(pin_data):and
include/rocksdb/iterator.h:With
pin_data = trueanduse_delta_encoding = false,key()should remain valid for the iterator's lifetime andis-key-pinnedshould be1.Actual behavior
On Primary and Secondary instances, with
allow_mmap_reads = true, and compression disabled (so blocks are served straight from the mmap),is-key-pinnedreturns0andkey()slices are invalidated on the next iterator movement.Root cause
The pinning decision is gated in
table/block_based/block_based_table_reader_impl.h:With mmap reads and no compression,
block.IsCached()isfalseandown_bytes()isfalse(the slice points into the mmap).immortal_tableis only true for Read-Only instances with an unbounded table cache. It isfalsefor both Primary and Secondary instances.When the block iterator reports
IsKeyPinned() == false,DBItercopies the key into its reusedsaved_key_buffer (db/db_iter.cc):saved_key_.SetUserKey(ikey_.user_key, !pin_thru_lifetime_ || !iter_.iter()->IsKeyPinned() /* copy */);so
DBIter::key()returnssaved_key_rather than a pointer into the block, and the next step may reallocate that buffer.This is a regression introduced in #3881. That PR fixed a bug affecting
Get()withPinnableSliceon mmap memory (point reads do not hold aSuperVersionfor the slice's lifetime). However, the fix was implemented through a shared flag that also disables pinning for iterators, which do hold aSuperVersionand were previously safe. Before #3881, mmap iterator keys/values were pinned, relying on the fact that an iterator pins aSuperVersionwhich keeps the files (and their mmap regions) alive for the iterator's lifetime. Thefea2b1dfbcommit message states this explicitly: "For iterator reads, aSuperVersionis pinned ... Blocks are pinned ... This works for ... mmap reads, where the file owns the memory region." Confirmed empirically by running the same probe against a normal mmap DB: the test key'sis-key-pinnedwas 1 atfea2b1dfb^, became 0 atfea2b1dfb.Steps to reproduce
BlockBasedTableOptions::use_delta_encoding = falseand compression disabled.Options::allow_mmap_reads = true.ReadOptions::pin_data = true.Seek()to a key, captureit->key(), callNext(), then inspect the captured slice anddb->GetProperty(it, "rocksdb.iterator.is-key-pinned", ...).Observed:
is-key-pinned == "0"and the captured slice no longer matches the original key. Expected:is-key-pinned == "1"and the slice remains valid.Environment
use_delta_encoding = false, compression disabledallow_mmap_reads = true,pin_data = true