Does cloudFiles.backfillInterval Reprocess Every File in Source Every Time Autoloader Runs?

Question

Im struggling to understand how to control the backfill process baked into Autoloader: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html#trigger-regular-backfills-using-cloudfilesbackfillinterval

If I set cloudFiles.backfillInterval to '1 day', the autoloader stream will scan every file in source to check if anything has been missed.

Then 1 day later, it will scan every file in source again...

As the number of files in source grows over time, surely this process is going to take longer and longer...

I experimented with cloudFiles.maxFileAge, assuming if I set it to something like '1 year' the backfill process would only re-scan files less than a year old, but alas that does not seem to be the case.

Am I missing something? Is there another way to control the backfill process, or is the way it works out of the box, scanning every file in source, just what I'll have to account for?

Frantz Paul · Accepted Answer · 2025-01-07 19:25:24Z

1

Based on some answers provided by Databricks, the backfillInterval is used to check all files based on the interval you set. So it is supposed to check all files, just at the specified interval.

References: https://community.databricks.com/t5/data-engineering/what-does-autoloader-s-cloudfiles-backfillinterval-do/td-p/7709

https://community.databricks.com/t5/data-engineering/databricks-auto-loader-cloudfiles-backfillinterval/td-p/37915

answered Jan 7 at 19:25

Frantz Paul

1673 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Andy McWilliams Jan 10 at 14:06

Yeah, not great as the dataset continues to grow over time, resulting in a longer backfill process.

Kashyap · Accepted Answer · 2025-01-30 20:09:33Z

Corrected answer based on inputs from Databricks support.

So if I want to configure my job such that every day, once a day, the job looks at “last 24 hours” and processes any files that were missed, then would this configuration help?
df = spark.readStream.format("cloudFiles") \
           .options(**autoloader_config) \
           .options("cloudFiles.backfillInterval", "1 day") \
           .load("s3a://bucket/path/")
Primary confusion is: If I set it to “1 day” then will autoloader scan all files in “s3a://bucket/path/” to look for missed files, or only the files newer than “1 day”? Concern is that if autoloader scans ALL files under the input path, then over time it’ll be scanning millions of files, which doesn’t make sense.

Then 1 day later, it will scan every file in source again...

Yes, it will trigger a full directory listing every "1 day". So yes as the number of files grows over time, this job might get slower over time.

Customer typically use a lifecycle policy on the bucket such as "files older than 30 days can be deleted or archived to another path"

We have a feature in private preview that will do this for you now. It's called CleanSource.

To summarize, the advice is to use:

File Notification mode whenever possible (over Directory Listing) https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes.html#file-notification-mode
Then use the code you showed once a day/week (so this part of the code will be directory listing) which is what the note in the doc mentions to ensure that if there's a flip in file notification from the service perspective, to make sure no files are missed. Since the number of files can increase over time for this step, you can then:
Use a bucket lifecycle policy to remove older files, or use the CleanSource preview feature (Here's a screenshot from the private preview documentation. This won't work until we enroll you for this).

Unfortunately not, lastBackfillFinishTimeMs is just used to determine when the next backfill will run, which will scan every file in source.
Thankfully we are using datetime partition folders for our data, so can load specific folders based on the current date as a workaround..
Curious. You mean if (now - lastBackfillFinishTimeMs) >= 1_day; then scan_all_files()? I'm checking with databricks support. Not sure how you'll load specific folders using same checkpoint location.
@AndyMcWilliams, you're right, confirmed by Databricks support.

Collectives™ on Stack Overflow

Does cloudFiles.backfillInterval Reprocess Every File in Source Every Time Autoloader Runs?

2 Answers 2

1 Comment

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Related