0

Im struggling to understand how to control the backfill process baked into Autoloader: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html#trigger-regular-backfills-using-cloudfilesbackfillinterval

If I set cloudFiles.backfillInterval to '1 day', the autoloader stream will scan every file in source to check if anything has been missed.

Then 1 day later, it will scan every file in source again...

As the number of files in source grows over time, surely this process is going to take longer and longer...

I experimented with cloudFiles.maxFileAge, assuming if I set it to something like '1 year' the backfill process would only re-scan files less than a year old, but alas that does not seem to be the case.

Am I missing something? Is there another way to control the backfill process, or is the way it works out of the box, scanning every file in source, just what I'll have to account for?

2 Answers 2

1

Based on some answers provided by Databricks, the backfillInterval is used to check all files based on the interval you set. So it is supposed to check all files, just at the specified interval.

enter image description here

References: https://community.databricks.com/t5/data-engineering/what-does-autoloader-s-cloudfiles-backfillinterval-do/td-p/7709

https://community.databricks.com/t5/data-engineering/databricks-auto-loader-cloudfiles-backfillinterval/td-p/37915

Sign up to request clarification or add additional context in comments.

1 Comment

Yeah, not great as the dataset continues to grow over time, resulting in a longer backfill process.
0

Corrected answer based on inputs from Databricks support.


So if I want to configure my job such that every day, once a day, the job looks at “last 24 hours” and processes any files that were missed, then would this configuration help?

df = spark.readStream.format("cloudFiles") \
           .options(**autoloader_config) \
           .options("cloudFiles.backfillInterval", "1 day") \
           .load("s3a://bucket/path/")

Primary confusion is: If I set it to “1 day” then will autoloader scan all files in “s3a://bucket/path/” to look for missed files, or only the files newer than “1 day”? Concern is that if autoloader scans ALL files under the input path, then over time it’ll be scanning millions of files, which doesn’t make sense.


Then 1 day later, it will scan every file in source again...

Yes, it will trigger a full directory listing every "1 day". So yes as the number of files grows over time, this job might get slower over time.

Customer typically use a lifecycle policy on the bucket such as "files older than 30 days can be deleted or archived to another path"

We have a feature in private preview that will do this for you now. It's called CleanSource.

To summarize, the advice is to use:

  1. File Notification mode whenever possible (over Directory Listing) https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes.html#file-notification-mode
  2. Then use the code you showed once a day/week (so this part of the code will be directory listing) which is what the note in the doc mentions to ensure that if there's a flip in file notification from the service perspective, to make sure no files are missed. Since the number of files can increase over time for this step, you can then:
  3. Use a bucket lifecycle policy to remove older files, or use the CleanSource preview feature (Here's a screenshot from the private preview documentation. This won't work until we enroll you for this).

4 Comments

Unfortunately not, lastBackfillFinishTimeMs is just used to determine when the next backfill will run, which will scan every file in source.
Thankfully we are using datetime partition folders for our data, so can load specific folders based on the current date as a workaround..
Curious. You mean if (now - lastBackfillFinishTimeMs) >= 1_day; then scan_all_files()? I'm checking with databricks support. Not sure how you'll load specific folders using same checkpoint location.
@AndyMcWilliams, you're right, confirmed by Databricks support.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.