I am trying to load roughly 20 million records from the Delta table in Databricks to the Azure SQL database using the Apache Spark connector: SQL Server & Azure SQL supporting Python API and Spark 3.0.
Below is the code which I am using. Do you think I am missing something here? The same code executes fine if I am using the write format as jdbc.
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("truncate", "true") \
.option("url", url) \
.option("dbtable", Tablenamewithschema) \
.option("user", user) \
.option("password", password) \
.option("reliabilityLevel", "BEST_EFFORT") \
.option("tableLock", "True") \
.option("isolationLevel", "True") \
.option("batchsize", "100000") \
.option("schemaCheckEnabled", "false") \
.save()
I am getting below error if using the mentioned connector.
Error while Loading the data into a table -
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 28 in stage 2.0 failed 4 times, most recent failure: Lost task 28.3 in stage 2.0 (TID 54):
com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
I have created a new cluster with the below configuration and only one library is installed in that cluster which is from maven coordinate - com.microsoft.azure:spark-mssql-connector_2.12:1.2.0
- 8 Worker nodes with 14GB memory and 4 cores.
- Databricks Runtime Version - 9.1 LTS (includes Apache Spark 3.1.2, scala 2.12).
Is there something else I can do to improve the performance as for around 100 million record old jdbc driver is taking around 1 hour.