We are in process of upgrading Databricks platform. Couple of weeks ago we setup Unity Catalog.
Now we are trying to go from Databricks Runtime 13.3 LTS to 15.4 LTS. Two notebooks that we running (out of 40+) took 3 times more time (30 min to 1.5h). We tried DR 16.0 with same results.
We then rolled back to Databricks Runtime 14.3 LTS and things got faster.
We used to run everything on Standard_DS3_v2 and Standard_DS4_v2. We also switched to Standard_D8ads_v5 (as they support delta caching). That also improved speed a little.
- What could lead to such different performance between Databricks Runtimes?
I am wondering lazy evaluation working differently between versions. We often count on materializing dataframe by doing something like:
df.cache()
rc = df.count()
Idea is that if dataframe is referenced multiple times later, it will execute only once.
- I am wondering if that trick is not working any more and if on later steps it is re-running everything again and again.
Today, this process was about to write only 5700 rows in a delta table. One of the things that we do is to use coalesce(1) before we write it (file is anyway very small).
- How can that step take 10 mins now (on DBR14.3)? Is it re-running everything from scratch (ignoring caching of dataframe before it)?

coalesce()? How do you know thatcoalesce()is taking 10 mins?df.explain(extended=True). here if you find the data is been reading from InMemoryTableScan then the cached df is used.