-1

We have a daily ETL process where we write Parquet data (~15GB) stored in Azure Data Lake Storage (ADLS) into a table in Azure SQL Database. The target table is truncated and reloaded each day.

Environment details:

Azure SQL DB Tier: S12 with 3000 DTUs

Databricks Cluster: 10 workers (Standard_D4s_v3), total 160 GB RAM and 40 cores

source_df.write\
.mode("overwrite")\
.option("truncate", True)\
.option("tableLock", "false")\
.option("batchsize", "100000")\
.option("schemaCheckEnabled", "false")\
.jdbc(jdbcUrl, target_table, properties=connection_properties)

The issue is that this write operation takes a long time and eventually fails with a connection timed out error.

What we’ve tried so far:

Splitting the DataFrame and writing in chunks

Partitioning the data (e.g., by month) and inserting sequentially

Despite these attempts, the load still takes too long or fails entirely.

Question: Is there a more efficient way to load large Parquet files into Azure SQL DB from Databricks? We are trying ADF option parallely as well, but any inputs on how to get this done with code would be appreciated.

9
  • Did you use Databricks also in Azure? If you're using unity catalog, you can add your database to catalog and write using sql syntax. I'll suggest to try this way.
    – Lev Gelman
    Commented yesterday
  • Yes, our Databricks is also in Azure. We are not using Unity catalog at the moment.
    – Harish J
    Commented yesterday
  • @LevGelman Could you explain more about the Unity Catalog option. I will try to use it if possible.
    – Harish J
    Commented yesterday
  • Do you see any error? Commented yesterday
  • I can't find the relevant topic, but you can go the Catalog and add connection to your SQL Server , and it will reflect SQL Server objects in catalog, and the you can use sql syntax to insert data.
    – Lev Gelman
    Commented yesterday

1 Answer 1

0

For example how you can write a DataFrame to an Azure SQL table using PySpark with JDBC:

df.write \
    .format("jdbc") \
    .option("driver", jdbc_properties["driver"]) \
    .option("url", jdbc_properties["url"]) \
    .option("dbtable", "L_Table") \
    .option("user", jdbc_properties["user"]) \
    .option("password", jdbc_properties["password"]) \
    .option("truncate", "true") \
    .option("batchsize", 10000) \
    .option("isolationLevel", "NONE") \
    .option("numPartitions", 10) \
    .mode("overwrite") \
    .save()


.option("batchsize", 10000)

Specifies the number of rows to be written in each batch. This helps optimize performance during the write operation.

.option("isolationLevel", "NONE")

Sets the transaction isolation level. "NONE" means no specific isolation level is enforced, which can improve performance for bulk operations.

.option("numPartitions", 10)

Sets the number of partitions used during the write process, enabling parallelism for better throughput.

.mode("overwrite")

Overwrites the data in the target table (L_Table) if it already exists.

Use Case: This code establishes a JDBC connection from PySpark to an Azure SQL database and writes a DataFrame (df) to the SQL table L_Table. In this example, the DataFrame has 100 columns, and the data is written from a data lake into the Azure SQL table efficiently using the specified configuration.

enter image description here

enter image description here

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.