Pass file metadata to IngestExternalFile to improve ingestion latency (#14837)#14837
Pass file metadata to IngestExternalFile to improve ingestion latency (#14837)#14837joshkang97 wants to merge 1 commit into
Conversation
|
@joshkang97 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107721261. |
✅ clang-tidy: No findings on changed linesCompleted in 340.7s. |
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 8d5e7ff ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 8d5e7ff SummaryWell-designed optimization that reduces external file ingestion latency ~34% by reusing SstFileWriter metadata. The code is largely correct, with the two main code paths (fast and scan) producing equivalent results for all common scenarios. A few gaps in test coverage and a minor hot-path performance concern should be addressed. High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🟡 MEDIUMM1. Extra per-key string copy in
|
| Context | Executes? | Correct? | Notes |
|---|---|---|---|
| WritePreparedTxnDB | YES | YES | Seqno assignment in Run() unaffected |
| ReadOnly DB | NO | N/A | IngestExternalFile not available |
| allow_db_generated_files | YES | YES (with caveat L1) | SstFileWriter seqnos are always 0 |
| User-defined timestamps | YES | YES | Padding logic verified correct |
| UDT in Memtables only | YES | YES (needs test M2) | Strip/pad logic consistent |
| write_global_seqno=true | NO | N/A | Correctly rejected in validation |
| Range-del-only files | YES | Likely correct but untested (M3) | MaybeUpdateRange handles it |
| Range-del-only + UDT | YES | BUG | Empty InternalKey::Encode() assertion |
| FIFO/Universal compaction | YES | YES | Affects Run(), not Prepare() |
| atomic_replace_range | YES | YES | Orthogonal to metadata acquisition |
| verify_checksums_before_ingest | YES | YES | Opens file only for verification |
Key verified equivalences:
MaybeUpdateRangetakes min/max of boundaries regardless of call count, so single-synthetic-tombstone fast path equals per-tombstone scan pathInternalKey::DecodeFrom()copies data (rep_.assign()), no lifetime concernsexternal_sst_file_global_seqno_offsetis 0 in in-memory properties, safely handled sincewrite_global_seqnois rejectedGetTableProperties()returns by value, safe afterFinish()- Internal key encoding in SstFileWriter matches what TableReader returns
Positive Observations
- Clean separation of fast/scan paths with shared validation and checksum verification
- Good use of sync points for testing (
GetIngestedFileInfo:ReadPath) - Comprehensive db_stress integration with appropriate exclusions
- Well-designed benchmark in db_bench that isolates the optimization
- Thorough input validation (size mismatch, null pointers, write_global_seqno incompatibility)
- Ingestion timing histograms are a valuable addition independent of the fast path
- The code correctly handles the UDT timestamp lifecycle (strip in writer, pad in ingestion)
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
…facebook#14837) Summary: External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied. This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns an opaque `PreparedFileInfo` containing file size, table properties, and prepared internal `smallest`/`largest` bounds. Point-key and range-deletion bounds are folded into that same bound pair, so ingestion has one prepared boundary shape regardless of whether the file contains point keys, range deletions, or both. A new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and checksum verification still opens the file when `verify_checksums_before_ingest` is set. Range tombstones and user-defined-timestamp files (including the "UDT in Memtables only" format, whose persisted boundary keys carry no timestamp) are handled. Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use the RocksDB comparators, and the no-persisted-timestamp path pads prepared bounds back to the internal timestamp shape before installing the file. ## Benchmark results `db_bench` `ingestexternalfile` benchmark, Buck opt build (`mode/opt`), on SSD (btrfs). Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput. | Config | Prepare SUM / P50 | Run SUM / P50 | Total benchmark time | | --- | ---: | ---: | ---: | | 1M keys/SST, baseline | 205.304 ms / 19.930 ms | 73.880 ms / 5.622 ms | 44.227 s | | 1M keys/SST, `file_info=true` | 73.077 ms / 8.250 ms | 125.509 ms / 11.950 ms | 42.813 s | | 2M keys/SST, baseline | 243.766 ms / 25.930 ms | 104.723 ms / 8.433 ms | 84.616 s | | 2M keys/SST, `file_info=true` | 70.712 ms / 7.733 ms | 240.838 ms / 18.000 ms | 84.946 s | The prepared metadata path cuts `Prepare()` substantially. `Run()` is longer with `file_info=true` because the baseline path opens/scans the table during `Prepare()`, which warms OS/block-cache state before the later table-cache open in `Run()`. A syscall trace of the `file_info=true` path showed the remaining prepare time is mostly expected link/sync durability work (`link()` and `fdatasync()`), not metadata scanning. Command shape: ``` cd /data/users/jkangs/fbsource/fbcode buck run mode/opt //internal_repo_rocksdb/repo:db_bench -- --benchmarks=ingestexternalfile --num=<1000000|2000000> --value_size=100 \ --compression_type=none --ingest_external_file_batch_size=10 \ --ingest_external_file_num_batches=10 --disable_auto_compactions=1 \ --statistics --db=<db_path> --wal_dir=<wal_path> \ --ingest_external_file_use_file_info=<false|true> ``` Differential Revision: D107721261
8d5e7ff to
e634e35
Compare
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit e634e35 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
…facebook#14837) Summary: External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied. This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns a `PreparedFileInfo` with the file size, table properties, and prepared internal `smallest`/`largest` bounds, and a new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified when `verify_checksums_before_ingest` is set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled. Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file. ## Benchmark Results `db_bench ingestexternalfile` benchmark, release build, on SSD (btrfs). Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput. Each generated SST was `121,883,007` bytes (`116.24 MiB`). The benchmark used 1M keys/SST and 5 ingest batches per run. | Batch size | Files/run | Config | Prepare P50 | Run P50 | Total ingestion P50 | Total P50 drop | | --- | --- | --- | --- | --- | --- | --- | | 10 | 50 | Baseline | 20.667 ms | 5.500 ms | 26.167 ms | -- | | 10 | 50 | `file_info=true` | 7.838 ms | 9.826 ms | 17.664 ms | 32.5% | | 30 | 150 | Baseline | 59.278 ms | 20.667 ms | 79.945 ms | -- | | 30 | 150 | `file_info=true` | 18.000 ms | 31.167 ms | 49.167 ms | 38.5% | | 50 | 250 | Baseline | 92.500 ms | 37.250 ms | 129.750 ms | -- | | 50 | 250 | `file_info=true` | 29.416 ms | 62.500 ms | 91.916 ms | 29.2% | The prepared metadata path cuts `Prepare()` substantially. `Run()` is longer with `file_info=true` because the baseline path opens/scans the table during `Prepare()`, which warms OS/block-cache state before the later table-cache open in `Run()`. A syscall trace of the `file_info=true` path showed the remaining prepare time is mostly due to link/sync syscalls. Benchmark args: `--benchmarks=ingestexternalfile --num=1000000 --value_size=100 --compression_type=none --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=5 --disable_auto_compactions=1 --statistics --ingest_external_file_use_file_info=<false|true>`. Differential Revision: D107721261
e634e35 to
69a4b4e
Compare
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit e634e35 SummaryWell-designed performance optimization that adds a metadata fast-path to external SST file ingestion, cutting High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🔴 HIGHNone. 🟡 MEDIUMM1. Missing test for range-deletion-only files via fast-path --
|
| Context | Executes? | Safe? | Notes |
|---|---|---|---|
| WritePreparedTxnDB | YES | YES | Seqno reassigned in common path |
| UDT (persisted) | YES | YES | Boundaries carry timestamps |
| UDT (memtable only) | YES | YES | Padding via ExtractValueType is correct |
| FIFO compaction | YES | YES | Level assignment in common path |
| allow_db_generated_files | YES | YES | Handled explicitly |
| verify_checksums | YES | YES | Opens file in common path |
Key claims verified: unique_id IS set (common path), start_ukey/limit_ukey IS set (common path), boundary computation IS equivalent (uses same SerializeKey/SerializeEndKey logic), write_global_seqno rejection IS enforced.
Positive Observations
- Clean dispatcher pattern with common post-processing
- Good validation (null pointers, size mismatches, write_global_seqno)
- Debug assertions verifying range-del timestamps
- Well-integrated stress test with configurable probability
- Test sync point verifying fast-path is taken
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 69a4b4e ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 69a4b4e SummaryWell-designed performance optimization that cleanly splits ingestion metadata acquisition into a fast-path (reuse SstFileWriter metadata) and scan-path. The boundary computation refactoring is thorough, with careful handling of UDT (both persisted and stripped), range deletions, and sequence number bounds. Good test coverage with unit tests, stress test integration, db_bench support, and crash test configuration. High-severity findings (1):
Full review (click to expand)Findings🔴 HIGHH1. Unconditional boundary computation in Finish() penalizes existing callers --
|
| Context | Fast path handles correctly? | Verification |
|---|---|---|
| UDT persisted | YES | Range dels re-serialized with max_ts in Finish(); tested in TimestampsPersistedIngestWithFileInfoRangeDeletionBounds |
| UDT not persisted (strip) | YES | Timestamps stripped in Finish(), padded back in GetIngestedFileInfoFromFileInfo; debug assertion in scan path validates max_ts invariant for range dels; tested in TimestampsNotPersistedIngestWithFileInfoRangeDeletionBounds |
| Range deletions | YES | Boundary extension tracked via SerializeKey/SerializeEndKey; tested in IngestWithFileInfoRangeDeletion |
| allow_db_generated_files | YES | Seqno bounds from table properties; rejection tested |
| verify_checksums | YES | Opens file just for verification; tested in IngestWithFileInfoVerifiesChecksum |
| write_global_seqno | REJECTED | Validated in db_impl.cc; tested in IngestWithFileInfoRejectsWriteGlobalSeqno |
| Concurrent ingestions | YES | shared_ptr is safe; stress test exercises concurrent paths |
| File modification between Finish/Ingest | Same risk as existing API | Both paths read the file at ingestion time if checksums enabled |
Positive Observations
- Clean separation of concerns:
GetIngestedFileInfoFromFileInfo(fast) vsGetIngestedFileInfoFromFile(scan) with shared post-processing inGetIngestedFileInfowrapper. The common wrapper handles checksum verification, FD assignment, and start_ukey/limit_ukey computation uniformly. - Thorough input validation in
db_impl.cc: null pointer check, size mismatch check, write_global_seqno incompatibility. - Debug assertions in the scan path (new
#ifndef NDEBUGblock) cross-validate the assumption that range tombstone keys use max timestamp, providing a safety net against future regressions. - Well-integrated stress testing with configurable probability via
ingest_external_file_use_file_info_one_in. - db_bench integration provides a reproducible benchmark with clear methodology (link files, isolate ingest path).
- The
IngestWithFileInfotest uses sync points to verify the scan path is actually skipped (read_path_countcheck). - Range deletion boundary tracking refactored to use
InternalKeythroughout SstFileWriter, eliminating a user-key/internal-key impedance mismatch.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
…facebook#14837) Summary: External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied. This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns a `PreparedFileInfo` with the file size, table properties, and prepared internal `smallest`/`largest` bounds, and a new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified when `verify_checksums_before_ingest` is set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled. Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file. The `ingestexternalfile` `db_bench` benchmark now also exposes `--ingest_external_file_fill_cache` so ingestion-read block-cache effects can be controlled independently while comparing the file-info fast path against the normal open-and-scan path. ## Benchmark Results Existing `db_bench ingestexternalfile` benchmark, release build, on SSD (btrfs). Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput. These numbers used the benchmark's previous default `fill_cache=true`. Each generated SST was `121,883,007` bytes (`116.24 MiB`). The benchmark used 1M keys/SST and 5 ingest batches per run. | Batch size | Files/run | Config | Prepare P50 | Run P50 | Total ingestion P50 | Total P50 drop | | --- | --- | --- | --- | --- | --- | --- | | 10 | 50 | Baseline | 20.667 ms | 5.500 ms | 26.167 ms | -- | | 10 | 50 | `file_info=true` | 7.838 ms | 9.826 ms | 17.664 ms | 32.5% | | 30 | 150 | Baseline | 59.278 ms | 20.667 ms | 79.945 ms | -- | | 30 | 150 | `file_info=true` | 18.000 ms | 31.167 ms | 49.167 ms | 38.5% | | 50 | 250 | Baseline | 92.500 ms | 37.250 ms | 129.750 ms | -- | | 50 | 250 | `file_info=true` | 29.416 ms | 62.500 ms | 91.916 ms | 29.2% | The prepared metadata path cuts `Prepare()` substantially. `Run()` is longer with `file_info=true` because the baseline path opens/scans the table during `Prepare()`, which warms OS/block-cache state before the later table-cache open in `Run()`. A syscall trace of the `file_info=true` path showed the remaining prepare time is mostly due to link/sync syscalls. Updated fill-cache-isolated benchmark args: `--benchmarks=ingestexternalfile --num=1000000 --value_size=100 --compression_type=none --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=5 --disable_auto_compactions=1 --statistics --ingest_external_file_use_file_info=<false|true> --ingest_external_file_fill_cache=false`. Differential Revision: D107721261
69a4b4e to
507675f
Compare
…facebook#14837) Summary: External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied. This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns a `PreparedFileInfo` with the file size, table properties, and prepared internal `smallest`/`largest` bounds, and a new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified when `verify_checksums_before_ingest` is set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled. Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file. The `ingestexternalfile` `db_bench` benchmark now also exposes `--ingest_external_file_fill_cache` so ingestion-read block-cache effects can be controlled independently while comparing the file-info fast path against the normal open-and-scan path. A future PR will support creating `PreparedFileInfo` from a existing DB-generated file, so this optimization is not just limited to SstFileWriter. ## Benchmark Results Existing `db_bench ingestexternalfile` benchmark, release build, using an XFS flash-backed filesystem. Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput. These numbers used direct reads and `fill_cache=false` to avoid warming the block cache while comparing the file-info fast path against the normal open-and-scan path. Each run used 1M keys/SST, one ingest call, and `db_bench` reported a 62.9 MB estimated file size. The comparison varied the number of files in that ingest call across 10, 30, and 50 files. | Files/batch | Config | Prepare P50 | Run P50 | Run P50/file | Total ingestion P50 | Total P50 drop | | --- | --- | --- | --- | --- | --- | --- | | 10 | Baseline | 10.909 ms | 7.593 ms | 0.759 ms/file | 18.502 ms | -- | | 10 | `file_info=true` | 1.127 ms | 7.541 ms | 0.754 ms/file | 8.668 ms | 53.2% | | 30 | Baseline | 29.049 ms | 23.161 ms | 0.772 ms/file | 52.210 ms | -- | | 30 | `file_info=true` | 2.734 ms | 23.384 ms | 0.779 ms/file | 26.118 ms | 50.0% | | 50 | Baseline | 49.036 ms | 39.224 ms | 0.784 ms/file | 88.260 ms | -- | | 50 | `file_info=true` | 4.133 ms | 39.417 ms | 0.788 ms/file | 43.550 ms | 50.7% | With `fill_cache=false` and direct reads, the metadata fast path cuts `Prepare()` by roughly 90-92%. `Run()` is roughly 0.75-0.79 ms per file across these runs. End-to-end `db_bench` micros/op is almost unchanged because this benchmark includes SST generation and compaction work; use `rocksdb.ingest.external.file.prepare.micros` and `rocksdb.ingest.external.file.run.micros` to isolate ingestion. Benchmark args: `--benchmarks=ingestexternalfile --use_existing_db=0 --num=1000000 --compression_type=none --statistics --use_direct_reads=true --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=1 --ingest_external_file_use_file_info=<false|true> --ingest_external_file_fill_cache=false`. Reviewed By: xingbowang Differential Revision: D107721261
507675f to
818103b
Compare
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 818103b ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 818103b SummaryWell-motivated optimization that eliminates redundant SST file scanning during ingestion by reusing metadata from High-severity findings (2):
The full review with 2 HIGH, 5 MEDIUM, and 4 LOW findings is written to ℹ️ About this responseGenerated by Claude Code. Limitations:
Commands:
|
…facebook#14837) Summary: External file ingestion (`DB::IngestExternalFile[s]`) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied. This change lets a caller skip that work when it already has the file's metadata. `SstFileWriter::Finish()` now returns a `PreparedFileInfo` with the file size, table properties, and prepared internal `smallest`/`largest` bounds, and a new `IngestExternalFileArg::file_infos` field carries one `PreparedFileInfo` per file into `IngestExternalFiles()`. When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified when `verify_checksums_before_ingest` is set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled. Internally, ingestion-job metadata acquisition was split into `GetIngestedFileInfoFromFileInfo` (reuse caller metadata) and `GetIngestedFileInfoFromFile` (open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file. The `ingestexternalfile` `db_bench` benchmark now also exposes `--ingest_external_file_fill_cache` so ingestion-read block-cache effects can be controlled independently while comparing the file-info fast path against the normal open-and-scan path. A future PR will support creating `PreparedFileInfo` from a existing DB-generated file, so this optimization is not just limited to SstFileWriter. ## Benchmark Results Existing `db_bench ingestexternalfile` benchmark, release build, using an XFS flash-backed filesystem. Files are linked (`move_files`) so the measurement isolates the ingest path rather than file-copy throughput. These numbers used direct reads and `fill_cache=false` to avoid warming the block cache while comparing the file-info fast path against the normal open-and-scan path. Each run used 1M keys/SST, one ingest call, and `db_bench` reported a 62.9 MB estimated file size. The comparison varied the number of files in that ingest call across 10, 30, and 50 files. | Files/batch | Config | Prepare P50 | Run P50 | Run P50/file | Total ingestion P50 | Total P50 drop | | --- | --- | --- | --- | --- | --- | --- | | 10 | Baseline | 10.909 ms | 7.593 ms | 0.759 ms/file | 18.502 ms | -- | | 10 | `file_info=true` | 1.127 ms | 7.541 ms | 0.754 ms/file | 8.668 ms | 53.2% | | 30 | Baseline | 29.049 ms | 23.161 ms | 0.772 ms/file | 52.210 ms | -- | | 30 | `file_info=true` | 2.734 ms | 23.384 ms | 0.779 ms/file | 26.118 ms | 50.0% | | 50 | Baseline | 49.036 ms | 39.224 ms | 0.784 ms/file | 88.260 ms | -- | | 50 | `file_info=true` | 4.133 ms | 39.417 ms | 0.788 ms/file | 43.550 ms | 50.7% | With `fill_cache=false` and direct reads, the metadata fast path cuts `Prepare()` by roughly 90-92%. `Run()` is roughly 0.75-0.79 ms per file across these runs. End-to-end `db_bench` micros/op is almost unchanged because this benchmark includes SST generation and compaction work; use `rocksdb.ingest.external.file.prepare.micros` and `rocksdb.ingest.external.file.run.micros` to isolate ingestion. Benchmark args: `--benchmarks=ingestexternalfile --use_existing_db=0 --num=1000000 --compression_type=none --statistics --use_direct_reads=true --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=1 --ingest_external_file_use_file_info=<false|true> --ingest_external_file_fill_cache=false`. Reviewed By: xingbowang Differential Revision: D107721261
818103b to
b5fcb6f
Compare
|
This pull request has been merged in ad43a5f. |
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit b5fcb6f ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit b5fcb6f SummaryWell-designed optimization that eliminates redundant SST file scanning during ingestion by reusing metadata from High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🔴 HIGHNo high-severity findings. 🟡 MEDIUMM1. UDT-not-persisted: range deletion start key timestamp padding differs between fast and scan paths --
|
| Context | Affected? | Analysis |
|---|---|---|
| UDT persisted | Yes, tested | Range deletion bounds correctly use max timestamp |
| UDT not persisted | Yes, tested | M1: minor internal key timestamp difference |
| allow_db_generated_files | Partial | Seqno bounds from table_properties work correctly |
| write_global_seqno | Rejected | Correctly validated in PrepareFileIngestion |
| Concurrent ingestion | Safe | PreparedFileInfo is const-qualified |
Positive Observations
- Clean refactoring of
GetIngestedFileInfointo two paths with shared dispatcher SanityCheckTablePropertiessimplification improves testability- Comprehensive test coverage including UDT, checksums, and error cases
- Good stress test and db_bench integration
- Useful debug assertion for range deletion timestamp consistency in the scan path
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Summary:
External file ingestion (
DB::IngestExternalFile[s]) re-opens every SST file and scans it -- footer, properties, index, filter, and the first/last data blocks -- to recompute the boundary keys, sequence-number bounds, and table properties before committing. For cold, I/O-bound files this scan dominates ingest latency, even when the file is moved/linked rather than copied.This change lets a caller skip that work when it already has the file's metadata.
SstFileWriter::Finish()now returns aPreparedFileInfowith the file size, table properties, and prepared internalsmallest/largestbounds, and a newIngestExternalFileArg::file_infosfield carries onePreparedFileInfoper file intoIngestExternalFiles(). When set, ingestion reuses that metadata instead of re-opening and scanning each file. The file is still copied/linked, and the checksum is still verified whenverify_checksums_before_ingestis set (the fast path opens the file only for that). Point-key and range-deletion bounds are folded into the same prepared bound pair, and user-defined-timestamp files (including the "UDT in Memtables only" format, whose boundary keys carry no timestamp) are handled.Internally, ingestion-job metadata acquisition was split into
GetIngestedFileInfoFromFileInfo(reuse caller metadata) andGetIngestedFileInfoFromFile(open + scan). Prepared boundary updates use RocksDB comparators, and the timestamp-stripping path pads prepared bounds back to the internal timestamp shape before installing the file.The
ingestexternalfiledb_benchbenchmark now also exposes--ingest_external_file_fill_cacheso ingestion-read block-cache effects can be controlled independently while comparing the file-info fast path against the normal open-and-scan path.A future PR will support creating
PreparedFileInfofrom a existing DB-generated file, so this optimization is not just limited to SstFileWriter.Benchmark Results
Existing
db_bench ingestexternalfilebenchmark, release build, using an XFS flash-backed filesystem. Files are linked (move_files) so the measurement isolates the ingest path rather than file-copy throughput. These numbers used direct reads andfill_cache=falseto avoid warming the block cache while comparing the file-info fast path against the normal open-and-scan path.Each run used 1M keys/SST, one ingest call, and
db_benchreported a 62.9 MB estimated file size. The comparison varied the number of files in that ingest call across 10, 30, and 50 files.file_info=truefile_info=truefile_info=trueWith
fill_cache=falseand direct reads, the metadata fast path cutsPrepare()by roughly 90-92%.Run()is roughly 0.75-0.79 ms per file across these runs. End-to-enddb_benchmicros/op is almost unchanged because this benchmark includes SST generation and compaction work; userocksdb.ingest.external.file.prepare.microsandrocksdb.ingest.external.file.run.microsto isolate ingestion.Benchmark args:
--benchmarks=ingestexternalfile --use_existing_db=0 --num=1000000 --compression_type=none --statistics --use_direct_reads=true --ingest_external_file_batch_size=<10|30|50> --ingest_external_file_num_batches=1 --ingest_external_file_use_file_info=<false|true> --ingest_external_file_fill_cache=false.Reviewed By: xingbowang
Differential Revision: D107721261