External file ingestion, support fast prepare path from DB generated files (#14854)#14854
External file ingestion, support fast prepare path from DB generated files (#14854)#14854joshkang97 wants to merge 1 commit into
Conversation
|
@joshkang97 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108440584. |
✅ clang-tidy: No findings on changed linesCompleted in 397.0s. |
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit fa09b9b ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
…files (facebook#14854) Summary: External SST ingestion can already skip the open-and-scan metadata path when ingesting files produced by `SstFileWriter`. This extends that fast path to live SST files produced by a DB, so applications that move DB-generated files between DBs can request prepared metadata directly from the source DB and pass it through ingestion. The API is intentionally limited to live table files whose metadata is already available from the DB and cached table reader. Callers remain responsible for keeping the source file alive and unchanged until ingestion completes. Reviewed By: xingbowang Differential Revision: D108440584
fa09b9b to
34e1cdd
Compare
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit fa09b9b SummaryWell-structured PR that cleanly extends the existing High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🟡 MEDIUMM1.
|
| Context | Does code execute? | Assumptions hold? | Action needed? |
|---|---|---|---|
| WritePreparedTxnDB | Yes (inherits DBImpl) | Yes - read-only metadata extraction | Safe |
| ReadOnly DB | Possible | Would work for metadata extraction | Safe |
| SecondaryInstance | Possible | Files are available, metadata in version | Safe |
| User-defined timestamps | Yes | FileMetaData keys include timestamps; ingestion path handles padding | Safe |
| BlobDB | Yes | Blob references are in values not keys; metadata is correct | Safe |
| FIFO/Universal compaction | Yes | Different compaction but FileMetaData format is the same | Safe |
Assumption stress-test results:
- "Path must exactly match" - Verified. String equality against
TableFileName()output. Symlinks fail by design. - "file_meta pointer valid after mutex release" - Verified. Version Ref prevents deallocation.
- "Seqno overwrite from FileDescriptor is correct" - Verified.
fd.largest_seqno/fd.smallest_seqnotrack actual bounds. - "InternalKeys from FileMetaData are valid for ingestion" - Verified. FileMetaData stores actual keys (not shortened), with real sequence numbers correct for
allow_db_generated_files=true.
Positive Observations
- Clean separation of concerns: metadata extraction is a pure read operation
- Correct Ref/Unref pattern with RAII cleanup via
Defer - Path validation prevents metadata extraction for non-owned files
- Sync point test verifies the fast path is actually taken
- Good error path coverage: resets
file_infoto nullptr on failure - Compatible with existing
IngestExternalFileArg::file_infosinfrastructure
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
…files (facebook#14854) Summary: External SST ingestion can already skip the open-and-scan metadata path when ingesting files produced by `SstFileWriter`. This extends that fast path to live SST files produced by a DB, so applications that move DB-generated files between DBs can request prepared metadata directly from the source DB and pass it through ingestion. The API is intentionally limited to live table files whose metadata is already available from the DB and cached table reader. Callers remain responsible for keeping the source file alive and unchanged until ingestion completes. Reviewed By: xingbowang Differential Revision: D108440584
34e1cdd to
f4b9853
Compare
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit f4b9853 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit f4b9853 SummaryClean, well-structured PR that extends the High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🔴 HIGHNone. 🟡 MEDIUMM1.
|
| Context | Does code execute? | Assumptions hold? | Action needed? |
|---|---|---|---|
| WritePreparedTxnDB | YES (inherits DBImpl) | YES -- metadata extraction is read-only | Safe |
| ReadOnly DB | YES (inherits DBImpl) | YES -- method only reads metadata | Safe |
| Secondary DB | YES (inherits DBImpl) | YES -- but secondary may have stale metadata | Document caveat |
| User-defined timestamps | YES | YES -- FileMetaData.smallest/largest include timestamps, and GetIngestedFileInfoFromFileInfo handles UDT padding |
Safe |
| BlobDB | YES | PARTIAL -- SST with blob references could be ingested, but blob files aren't transferred. This is a pre-existing limitation, not introduced by this PR. | Not this PR's scope |
| Concurrent compaction | YES | YES -- Version ref keeps FileMetaData alive. File could be deleted after Version unref, but that's the documented TOCTOU responsibility on the caller. | Safe |
Assumption Stress Test
Claim: "FileDescriptor seqno fields match actual file contents"
- Precondition: The MANIFEST correctly tracks smallest/largest seqno for each file.
- This is a fundamental RocksDB invariant maintained by compaction and flush. If it's violated, many other things break. The override is correct because
TableProperties.key_largest_seqnomay beUINT64_MAXfor old SST files that predate this property, whileFileDescriptoralways has the correct values from the MANIFEST. - Verdict: Sound.
Claim: "Path must exactly match a live table file"
- Precondition: Caller provides the exact path as reported by
GetLiveFilesMetaData(). - The validation compares against
TableFileName(cf_paths, file_number, path_id). This could fail if the caller uses a different but equivalent path (e.g., with symlinks or different casing on case-insensitive filesystems). This is documented behavior and matches the strict-path design intent. - Verdict: Acceptable -- strict matching is the right choice for safety.
Positive Observations
- Clean separation of concerns: The method only extracts metadata -- it doesn't modify any DB state (except potentially populating the table cache, which is a read-side cache).
- Correct refcounting: The
Deferpattern withfile_version->Ref()/Unref()andfile_cfd->Ref()/UnrefAndTryDelete()correctly keeps the metadata alive while accessing it outside the mutex. - Good error validation: The method validates the file name format, checks the file exists in a live version, and validates the path matches -- preventing misuse.
- SyncPoint in test: The test uses
SyncPointto verify the fast path is actually taken, preventing silent fallback to the slow path. - API documentation: The
db.hcomment clearly documents the table cache side effect.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
|
This pull request has been merged in 5f66923. |
Summary:
External SST ingestion can already skip the open-and-scan metadata path when ingesting files produced by
SstFileWriter. This extends that fast path to live SST files produced by a DB, so applications that move DB-generated files between DBs can request prepared metadata directly from the source DB and pass it through ingestion.The API is intentionally limited to live table files whose metadata is already available from the DB and cached table reader. Callers remain responsible for keeping the source file alive and unchanged until ingestion completes.
Reviewed By: xingbowang
Differential Revision: D108440584