I'm new to data engineering, so this might be a basic question. But I haven't been able to clarify
--context--
I have a spark job executed by an Azure Data Factory pipeline every 10 minutes.
On every execution, the spark job connects to a database(Azure SQL Server) and loads 3 tables. All three are small tables(1.9M, 72K, 72K). But sometimes it fails to connect to the database, thus the whole pipeline fails.
What I want to do:
I want to cache the 3 tables (once a day if possible) so that spark doesn't have to connect to the database every single execution.
At first I thought df.persist(StorageLevel.DISK_ONLY) could retain the dataframes even after the spark session is terminated, but couldn't confirm. Another option I thought of was saving the tables in Hive, but Hive is usually for big data so I'm not sure if using it is the right choice in this situation.
Any advise is welcome.