Corrected answer based on inputs from Databricks support.
So if I want to configure my job such that every day, once a day, the job looks at “last 24 hours” and processes any files that were missed, then would this configuration help?
df = spark.readStream.format("cloudFiles") \
.options(**autoloader_config) \
.options("cloudFiles.backfillInterval", "1 day") \
.load("s3a://bucket/path/")
Primary confusion is: If I set it to “1 day” then will autoloader scan all files in “s3a://bucket/path/”
to look for missed files, or only the files newer than “1 day”?
Concern is that if autoloader scans ALL files under the input path,
then over time it’ll be scanning millions of files, which doesn’t make
sense.
Then 1 day later, it will scan every file in source again...
Yes, it will trigger a full directory listing every "1 day".
So yes as the number of files grows over time, this job might get slower over time.
Customer typically use a lifecycle policy on the bucket such as "files older than 30 days can be deleted or archived to another path"
We have a feature in private preview that will do this for you now. It's called CleanSource.
To summarize, the advice is to use:
- File Notification mode whenever possible (over Directory Listing)
https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes.html#file-notification-mode
- Then use the code you showed once a day/week (so this part of the code will be directory listing) which is what the note in the doc mentions to ensure that if there's a flip in file notification from the service perspective, to make sure no files are missed.
Since the number of files can increase over time for this step, you can then:
- Use a bucket lifecycle policy to remove older files, or use the CleanSource preview feature
(Here's a screenshot from the private preview documentation. This won't work until we enroll you for this).