0

We are in process of upgrading Databricks platform. Couple of weeks ago we setup Unity Catalog.

Now we are trying to go from Databricks Runtime 13.3 LTS to 15.4 LTS. Two notebooks that we running (out of 40+) took 3 times more time (30 min to 1.5h). We tried DR 16.0 with same results.

We then rolled back to Databricks Runtime 14.3 LTS and things got faster.

We used to run everything on Standard_DS3_v2 and Standard_DS4_v2. We also switched to Standard_D8ads_v5 (as they support delta caching). That also improved speed a little.

  • What could lead to such different performance between Databricks Runtimes?

I am wondering lazy evaluation working differently between versions. We often count on materializing dataframe by doing something like:

df.cache()
rc = df.count()

Idea is that if dataframe is referenced multiple times later, it will execute only once.

  • I am wondering if that trick is not working any more and if on later steps it is re-running everything again and again.

Today, this process was about to write only 5700 rows in a delta table. One of the things that we do is to use coalesce(1) before we write it (file is anyway very small).

  • How can that step take 10 mins now (on DBR14.3)? Is it re-running everything from scratch (ignoring caching of dataframe before it)?
4
  • Is photon enabled? Can you describe at least a bit of what your notebooks do? What was the actual time taken before and after in minutes/hours? What's the input size? Why use coalesce()? How do you know that coalesce() is taking 10 mins? Commented Dec 14, 2024 at 0:52
  • Can you check Spark UI to see if there are any specific stages or tasks that are taking longer? Commented Dec 16, 2024 at 3:26
  • Photon is not enabled. The notebook is getting silver data, joining with dims and inserting new rows in fact table. Some dims are very large - dimCustomers. Incremental for 5700 rows time jumped from 30 min to 90 min. If I do not use Coalesce(1), we get 200 small files written for incremental (instead of one 1MB file). I have logging before and after Coalesce and I see time. Commented Dec 16, 2024 at 19:23
  • check the plan using df.explain(extended=True). here if you find the data is been reading from InMemoryTableScan then the cached df is used. Commented Dec 17, 2024 at 8:46

1 Answer 1

0

Enablling Photon can improve the performance of your queries, especially for operations like joins and writes.

To enable Photon you can use the below:

spark.conf.set("spark.databricks.photon.enabled", "true")

As you mentioned you are using df.cache() & coalesce(1)

You can use below approach using the Persisting DataFrames instead of caching.

transformed_df.persist()
transformed_df.count()

Then you can use Use optimized file compaction After writing the Delta table, compact files using the OPTIMIZE command:

spark.sql(f"""
OPTIMIZE delta.`{delta_table_path}`
""")

I agree with @JayashankarGS you can use df.explain(extended=True) to debug performance issues,queries are slow. Also to understand the differences in execution plans in Databricks Runtime versions. And verify if your transformations are optimized example filters applied.

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

What is the benefit of using persist() instead of cache()? (saving it on disk instead of memory)
You can get rid of OutOfMemory errors. using persist() will make sure that the code do not fail because memory limitations.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.