How to cache dataframes that will be retained after spark session terminates

Question

I'm new to data engineering, so this might be a basic question. But I haven't been able to clarify

--context--
I have a spark job executed by an Azure Data Factory pipeline every 10 minutes. On every execution, the spark job connects to a database(Azure SQL Server) and loads 3 tables. All three are small tables(1.9M, 72K, 72K). But sometimes it fails to connect to the database, thus the whole pipeline fails.

What I want to do:
I want to cache the 3 tables (once a day if possible) so that spark doesn't have to connect to the database every single execution.

At first I thought df.persist(StorageLevel.DISK_ONLY) could retain the dataframes even after the spark session is terminated, but couldn't confirm. Another option I thought of was saving the tables in Hive, but Hive is usually for big data so I'm not sure if using it is the right choice in this situation.

Any advise is welcome.

Dominik Lenda · Accepted Answer · 2024-11-09 16:38:02Z

0

When you df.persist(StorageLevel.DISK_ONLY), Spark caches data in the location defined byspark.local.dir property. Caching is useful for iterative jobs where Spark can avoid recomputing the entire DAG each time an action such as count() is called. Instead, it starts from the point when the data was cached (see Figure 1 in Spark research paper for further explanations). But once the app is stopped with spark.stop(), the cached data is lost and cannot be accessed by new applications.

The issue you're facing is with unreliable connection to database, not caching. I'd recommend adding a few retries in case the connection fails. Also, consider writing data to external storage (e.g. HDFS, local file system) so that it can be read when the connection to DB is down.

answered Nov 9, 2024 at 16:38

Dominik Lenda

941 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

April Lee Over a year ago

Thank you for your answer @DominikLenda. I was slowing walking towards the truth that caching cannot solve my issue. I will certainly add retries, and will also try storing my tables in a local file system.

Collectives™ on Stack Overflow

How to cache dataframes that will be retained after spark session terminates

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related