2

Problem statement:

Our data is stored on Azure Data Lake Storage Gen2, we need to process about 25 million records - 30 columns - on a daily basis into an Azure SQL Server database. We are using Databricks and PySpark code to retrieve this data from a Delta Table and load this data into the SQL Server database table.

Currently we are experimenting and trying to tune our ETL process with a sample of 5M records, this is taking 25 minutes on average. We are looking for tips and tricks on how we can reduce this further, given that in our production environment we should process 25M records on a daily basis.

Technical setup:

Databricks cluster:

  • 4 executors, 4 cores each, 8GB memory
  • driver has 8 GB memory and 4 cores
  • Runtime version 12.2 LTS, includes Apache Spark 3.3.2

SQL database:

  • Standard S2
  • 50 DTU

Code sample:

    dataframe = spark.table("hive_metastore.catalog.table") #5M records
    dataframe_repartioned = dataframe.repartition(64)  
        
    sql_host_name = f'sql_server_hostname'
    sql_db_name = f'sql_server_database'
    sql_user_id = "admin"
    sql_server_pwd = "***"

     jdbc_url = f'jdbc:sqlserver://{sql_host_name}:1433;databaseName={sql_db_name};user={sql_user_id};password={sql_server_pwd};'
 
    dataframe_repartioned.write \
     .format("com.microsoft.sqlserver.jdbc.spark") \
     .option("truncate", "true") \
     .option("url", jdbc_url) \
     .option("dbtable", "schema.TABLE") \
     .option("tableLock", "true") \
     .option("batchsize", "10000") \
     .option("numPartitions", 64) \
     .mode("overwrite") \
     .save()

After googling and reading similar questions on Stackoverflow, we already tried the following:

Applying proper lengths for nvarchars in our SQL Database table:

  • at first all nvarchar columns were of nvarchar(max)
  • we changed this to more applicable lengths given the data, nvarchar(10) in most cases
  • this reduced our processing time from 35 -> 25 minutes

Repartitioning our dataframe before writing to SQL Server database:

  • we noticed that after reading the Delta Table into a dataframe, it was all in a single partition
  • we repartitioned the dataframe into 64 partitions
  • this had very little impact on the overall processing time (reduced with 45 seconds)

Using another JDBC connect than the Databricks default:

  • some articles/posts suggest to use the com.microsoft.sqlserver.jdbc.spark connector
  • we changed our code to using this one, but it is not having any impact

We do see that our Azure SQL Database is capping at 100% DTU usage during the insert process. When we increased our Azure SQL Database from S2 -> S3 (double the performance from 50 -> 100 DTU) processing time reduced with 4 minutes. But still taking 21 minutes for 5M records.

Are there any ways to improve our writes to be more efficient or is the only resolution to increase the available DTU's for our Azure SQL Database even more?

Specifying .option("batchsize", "10000") also doesn't seem to have any impact. .option("tableLock", "true") did reduce the processing time by 1 min.

2
  • How did you make it work? I'm using spark 3.5 and got an error, save method doesn't exist. From the documentation the library supports up to 3.1.x only. Commented Jul 29, 2024 at 18:40
  • In the Github repo of the connector you can find the JAR which works with Spark 3.4. Haven't tested this yet with 3.5, but I've read online it does seem to work. github.com/microsoft/sql-spark-connector Commented Oct 27, 2024 at 8:01

1 Answer 1

0

After some further performance testing, we noticed that additional tuning on Spark side didn't have much effect, but increasing our Azure SQL Server database tier had a very substantial impact.

Processing time for 5M records per service tier:

  • S2: 25 minutes
  • S3: 21 minutes
  • S4: 10 minutes
  • S6: 05 minutes

We decided to scale our Azure SQL database and will look into the option of auto-scaling our Azure SQL database during our ETL process.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.