Problem statement:
Our data is stored on Azure Data Lake Storage Gen2, we need to process about 25 million records - 30 columns - on a daily basis into an Azure SQL Server database. We are using Databricks and PySpark code to retrieve this data from a Delta Table and load this data into the SQL Server database table.
Currently we are experimenting and trying to tune our ETL process with a sample of 5M records, this is taking 25 minutes on average. We are looking for tips and tricks on how we can reduce this further, given that in our production environment we should process 25M records on a daily basis.
Technical setup:
Databricks cluster:
- 4 executors, 4 cores each, 8GB memory
- driver has 8 GB memory and 4 cores
- Runtime version 12.2 LTS, includes Apache Spark 3.3.2
SQL database:
- Standard S2
- 50 DTU
Code sample:
dataframe = spark.table("hive_metastore.catalog.table") #5M records
dataframe_repartioned = dataframe.repartition(64)
sql_host_name = f'sql_server_hostname'
sql_db_name = f'sql_server_database'
sql_user_id = "admin"
sql_server_pwd = "***"
jdbc_url = f'jdbc:sqlserver://{sql_host_name}:1433;databaseName={sql_db_name};user={sql_user_id};password={sql_server_pwd};'
dataframe_repartioned.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("truncate", "true") \
.option("url", jdbc_url) \
.option("dbtable", "schema.TABLE") \
.option("tableLock", "true") \
.option("batchsize", "10000") \
.option("numPartitions", 64) \
.mode("overwrite") \
.save()
After googling and reading similar questions on Stackoverflow, we already tried the following:
Applying proper lengths for nvarchars in our SQL Database table:
- at first all nvarchar columns were of nvarchar(max)
- we changed this to more applicable lengths given the data, nvarchar(10) in most cases
- this reduced our processing time from 35 -> 25 minutes
Repartitioning our dataframe before writing to SQL Server database:
- we noticed that after reading the Delta Table into a dataframe, it was all in a single partition
- we repartitioned the dataframe into 64 partitions
- this had very little impact on the overall processing time (reduced with 45 seconds)
Using another JDBC connect than the Databricks default:
- some articles/posts suggest to use the com.microsoft.sqlserver.jdbc.spark connector
- we changed our code to using this one, but it is not having any impact
We do see that our Azure SQL Database is capping at 100% DTU usage during the insert process. When we increased our Azure SQL Database from S2 -> S3 (double the performance from 50 -> 100 DTU) processing time reduced with 4 minutes. But still taking 21 minutes for 5M records.
Are there any ways to improve our writes to be more efficient or is the only resolution to increase the available DTU's for our Azure SQL Database even more?
Specifying .option("batchsize", "10000") also doesn't seem to have any impact.
.option("tableLock", "true") did reduce the processing time by 1 min.