Writing Data to External Databases Through PySpark

Question

I want to write the data from a PySpark DataFrame to external databases, say an Azure MySQL database. So far, I have managed to do this using .write.jdbc(),

spark_df.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })

Here, if I am not mistaken, the only options available for mode are append and overwrite, however, I want to have more control over how the data is written. For example, I want to be able to perform update and delete operations.

How can I do this? Is it possible to say, write SQL queries to write data to the external databases? If so, please give me an example.

Nick.Mc · Accepted Answer · 2021-11-03 02:44:27Z

1

First I suggest you use the specific Azure SQL connector. https://learn.microsoft.com/en-us/azure/azure-sql/database/spark-connector.

Then I recommend you use bulk mode as row by row mode is slow, and can incur unexpected charges if you have log analytics turned on.

Lastly, for any kind of data transformation, you should use an ELT pattern:

Load raw data into an empty staging table
Run SQL code, or even better, a stored procedure which performs required logic (for example merging into a final table) run DML such as a stored proc

answered Nov 3, 2021 at 2:44

Nick.Mc

19.4k6 gold badges67 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Minura Punchihewa Over a year ago

I like the idea of using a ELT pattern. The links that you have given above are in Scala. Are there any examples of how to do this with PySpark? Running a stored procedure after loading data to the staging tables would mean that I will be using the database engine and not Spark, right? Will this defeat the purpose of using Spark, because we will not leverage distributed processing?

Nick.Mc Over a year ago

You're right, it uses the SQL compute not the spark compute. But is your workload so intensive that you need distributed processing? Remember, spark is actually the new kid on the block. RDBMS have have had many years to optimise algorithms. For example, SQL Server will use a parallel execution plan if it will be faster. If are processing anything under 10's of millions of rows. Spark is busy implementing new features to catch up to traditional RDBMS (indexes, delta tables)

Minura Punchihewa Over a year ago

That makes a lot of sense. So, I was thinking that another option I might have is to read the existing data (if any) in a given table into another DataFrame, perform the required operations using this and the DataFrame with the new data, and finally write it back to the same table. Which option do you think is more efficient? This one or using SQL compute? I will have a very large amount of data to process, by the way.

Nick.Mc Over a year ago

There are two things to consider: 1. Is there actually a performance issue compared to your performance requirements (probably not) 2. More importantly, is it maintainable? Are you able to build and support a solution that might require a fair bit of SQL knowledge? The maintainability question is important, you don't want to build something that crashes that you can't keep running. In other words, if you are more comfortable with PySpark than SQL, and there is no difference in performance, then I suggest you use PySark and minimise the SQL components

Nick.Mc Over a year ago

Keep in mind that because it's difficult to "merge" via the Spark Connector, what you propose is that you read a full dataset into spark, reprocess, then write it back again. I have to say I've seen a few spark solutions that would be much easier in SQL, but the author only knew spark so they did it that way.

|

Collectives™ on Stack Overflow

Writing Data to External Databases Through PySpark

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related