We have a daily ETL process where we write Parquet data (~15GB) stored in Azure Data Lake Storage (ADLS) into a table in Azure SQL Database. The target table is truncated and reloaded each day.
Environment details:
Azure SQL DB Tier: S12 with 3000 DTUs
Databricks Cluster: 10 workers (Standard_D4s_v3), total 160 GB RAM and 40 cores
source_df.write\
.mode("overwrite")\
.option("truncate", True)\
.option("tableLock", "false")\
.option("batchsize", "100000")\
.option("schemaCheckEnabled", "false")\
.jdbc(jdbcUrl, target_table, properties=connection_properties)
The issue is that this write operation takes a long time and eventually fails with a connection timed out error.
What we’ve tried so far:
Splitting the DataFrame and writing in chunks
Partitioning the data (e.g., by month) and inserting sequentially
Despite these attempts, the load still takes too long or fails entirely.
Question: Is there a more efficient way to load large Parquet files into Azure SQL DB from Databricks? We are trying ADF option parallely as well, but any inputs on how to get this done with code would be appreciated.