1

I want to write the data from a PySpark DataFrame to external databases, say an Azure MySQL database. So far, I have managed to do this using .write.jdbc(),

spark_df.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })

Here, if I am not mistaken, the only options available for mode are append and overwrite, however, I want to have more control over how the data is written. For example, I want to be able to perform update and delete operations.

How can I do this? Is it possible to say, write SQL queries to write data to the external databases? If so, please give me an example.

1 Answer 1

1

First I suggest you use the specific Azure SQL connector. https://learn.microsoft.com/en-us/azure/azure-sql/database/spark-connector.

Then I recommend you use bulk mode as row by row mode is slow, and can incur unexpected charges if you have log analytics turned on.

Lastly, for any kind of data transformation, you should use an ELT pattern:

  1. Load raw data into an empty staging table
  2. Run SQL code, or even better, a stored procedure which performs required logic (for example merging into a final table) run DML such as a stored proc
Sign up to request clarification or add additional context in comments.

6 Comments

I like the idea of using a ELT pattern. The links that you have given above are in Scala. Are there any examples of how to do this with PySpark? Running a stored procedure after loading data to the staging tables would mean that I will be using the database engine and not Spark, right? Will this defeat the purpose of using Spark, because we will not leverage distributed processing?
You're right, it uses the SQL compute not the spark compute. But is your workload so intensive that you need distributed processing? Remember, spark is actually the new kid on the block. RDBMS have have had many years to optimise algorithms. For example, SQL Server will use a parallel execution plan if it will be faster. If are processing anything under 10's of millions of rows. Spark is busy implementing new features to catch up to traditional RDBMS (indexes, delta tables)
That makes a lot of sense. So, I was thinking that another option I might have is to read the existing data (if any) in a given table into another DataFrame, perform the required operations using this and the DataFrame with the new data, and finally write it back to the same table. Which option do you think is more efficient? This one or using SQL compute? I will have a very large amount of data to process, by the way.
There are two things to consider: 1. Is there actually a performance issue compared to your performance requirements (probably not) 2. More importantly, is it maintainable? Are you able to build and support a solution that might require a fair bit of SQL knowledge? The maintainability question is important, you don't want to build something that crashes that you can't keep running. In other words, if you are more comfortable with PySpark than SQL, and there is no difference in performance, then I suggest you use PySark and minimise the SQL components
Keep in mind that because it's difficult to "merge" via the Spark Connector, what you propose is that you read a full dataset into spark, reprocess, then write it back again. I have to say I've seen a few spark solutions that would be much easier in SQL, but the author only knew spark so they did it that way.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.