Performance tuning tips for writing a PySpark dataframe to an Azure SQL Database table using Databricks

Question

Problem statement:

Our data is stored on Azure Data Lake Storage Gen2, we need to process about 25 million records - 30 columns - on a daily basis into an Azure SQL Server database. We are using Databricks and PySpark code to retrieve this data from a Delta Table and load this data into the SQL Server database table.

Currently we are experimenting and trying to tune our ETL process with a sample of 5M records, this is taking 25 minutes on average. We are looking for tips and tricks on how we can reduce this further, given that in our production environment we should process 25M records on a daily basis.

Technical setup:

Databricks cluster:

4 executors, 4 cores each, 8GB memory
driver has 8 GB memory and 4 cores
Runtime version 12.2 LTS, includes Apache Spark 3.3.2

SQL database:

Standard S2
50 DTU

Code sample:

    dataframe = spark.table("hive_metastore.catalog.table") #5M records
    dataframe_repartioned = dataframe.repartition(64)  
        
    sql_host_name = f'sql_server_hostname'
    sql_db_name = f'sql_server_database'
    sql_user_id = "admin"
    sql_server_pwd = "***"

     jdbc_url = f'jdbc:sqlserver://{sql_host_name}:1433;databaseName={sql_db_name};user={sql_user_id};password={sql_server_pwd};'
 
    dataframe_repartioned.write \
     .format("com.microsoft.sqlserver.jdbc.spark") \
     .option("truncate", "true") \
     .option("url", jdbc_url) \
     .option("dbtable", "schema.TABLE") \
     .option("tableLock", "true") \
     .option("batchsize", "10000") \
     .option("numPartitions", 64) \
     .mode("overwrite") \
     .save()

After googling and reading similar questions on Stackoverflow, we already tried the following:

Applying proper lengths for nvarchars in our SQL Database table:

at first all nvarchar columns were of nvarchar(max)
we changed this to more applicable lengths given the data, nvarchar(10) in most cases
this reduced our processing time from 35 -> 25 minutes

Repartitioning our dataframe before writing to SQL Server database:

we noticed that after reading the Delta Table into a dataframe, it was all in a single partition
we repartitioned the dataframe into 64 partitions
this had very little impact on the overall processing time (reduced with 45 seconds)

Using another JDBC connect than the Databricks default:

some articles/posts suggest to use the com.microsoft.sqlserver.jdbc.spark connector
we changed our code to using this one, but it is not having any impact

We do see that our Azure SQL Database is capping at 100% DTU usage during the insert process. When we increased our Azure SQL Database from S2 -> S3 (double the performance from 50 -> 100 DTU) processing time reduced with 4 minutes. But still taking 21 minutes for 5M records.

Are there any ways to improve our writes to be more efficient or is the only resolution to increase the available DTU's for our Azure SQL Database even more?

Specifying .option("batchsize", "10000") also doesn't seem to have any impact. .option("tableLock", "true") did reduce the processing time by 1 min.

How did you make it work? I'm using spark 3.5 and got an error, save method doesn't exist. From the documentation the library supports up to 3.1.x only. — Playing With BI
– Playing With BI, Commented Jul 29, 2024 at 18:40
In the Github repo of the connector you can find the JAR which works with Spark 3.4. Haven't tested this yet with 3.5, but I've read online it does seem to work. github.com/microsoft/sql-spark-connector — wbaeckelmans
– wbaeckelmans, Commented Oct 27, 2024 at 8:01

wbaeckelmans · Accepted Answer · 2023-12-05 15:41:17Z

0

After some further performance testing, we noticed that additional tuning on Spark side didn't have much effect, but increasing our Azure SQL Server database tier had a very substantial impact.

Processing time for 5M records per service tier:

S2: 25 minutes
S3: 21 minutes
S4: 10 minutes
S6: 05 minutes

We decided to scale our Azure SQL database and will look into the option of auto-scaling our Azure SQL database during our ETL process.

answered Dec 5, 2023 at 15:41

wbaeckelmans

3894 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Performance tuning tips for writing a PySpark dataframe to an Azure SQL Database table using Databricks

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related