How to optimize performance when writing Delta Table in Spark (Microsoft Fabric)?

Question

Environment: Microsoft Fabric, F512 capacity, 40 medium nodes (8vCores):

In a PySpark notebook I have a DataFrame containing 379 million rows which I would like to make available in a Fabric Lakehouse as a Delta table so that it can be accessed through the SQL endpoint of the Lakehouse. I am not using any of the advanced Delta Live features, such as versioning or ACID. The dataset will be recreated daily.

To achieve this I tried the following.

Create the table using save():

df_pv_tm_join.write.mode("overwrite").format("delta").save("Tables/df_pv_tm_join7")

This runs for 10 minutes and I end up with six parquet files, each about 200MB in size, which is ideal. The result is a Delta Table I in the "Tables" section of the Lakehouse. During the saving process I'm seeing two warning about "Time Skew" which I don't know how to address. All findings on the internet always seem to point to "Data Skew".

In comparison, use parquet():

df_pv_tm_join.repartition(400).write.mode("overwrite").parquet("Files/output/df_pv_tm_join3")

If I repartition the dataset and write it out as delta parquet, execution time drops significantly to only 2min 30s. The downside is that I end up with 400 small files and no table.

What are my options (if any) to save the table with max performance?

Can you partition the table using a business key or date? You can increase performance for reading and writing as it will use partition elimination to read/write to specific files. You want the process to use smaller files rather than read many big ones — Jon
– Jon, Commented Aug 29, 2024 at 11:41

Perl99 · Accepted Answer · 2024-08-30 09:56:51Z

0

Consider liquid clustering:

The following are examples of scenarios that benefit from clustering:

Tables often filtered by high cardinality columns.

Tables with significant skew in data distribution.

Tables that grow quickly and require maintenance and tuning effort.

Tables with access patterns that change over time.

Tables where a typical partition column could leave the table with too many or too few partitions.

https://docs.delta.io/latest/delta-clustering.html

Also consider partitionedBy to partition on write: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.partitionedBy.html

Clustering should be superior as it is native to Delta tables.

answered Aug 30, 2024 at 9:56

Perl99

1631 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Frankenstine Joe · Accepted Answer · 2024-09-26 17:03:20Z

0

You can take a look at the belo:

Partition the data into smaller partitions as you mentioned before writing delta as a table preferably based on a business key which does not skew the data.
Optimize the tables afterwards for best performance with Optimize, V-order and Vacuum commands.

https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-table-maintenance

answered Sep 26, 2024 at 17:03

Frankenstine Joe

3236 silver badges16 bronze badges

Collectives™ on Stack Overflow

How to optimize performance when writing Delta Table in Spark (Microsoft Fabric)?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related