1

Environment: Microsoft Fabric, F512 capacity, 40 medium nodes (8vCores):

In a PySpark notebook I have a DataFrame containing 379 million rows which I would like to make available in a Fabric Lakehouse as a Delta table so that it can be accessed through the SQL endpoint of the Lakehouse. I am not using any of the advanced Delta Live features, such as versioning or ACID. The dataset will be recreated daily.

To achieve this I tried the following.

Create the table using save():

df_pv_tm_join.write.mode("overwrite").format("delta").save("Tables/df_pv_tm_join7")

This runs for 10 minutes and I end up with six parquet files, each about 200MB in size, which is ideal. The result is a Delta Table I in the "Tables" section of the Lakehouse. During the saving process I'm seeing two warning about "Time Skew" which I don't know how to address. All findings on the internet always seem to point to "Data Skew".

In comparison, use parquet():

df_pv_tm_join.repartition(400).write.mode("overwrite").parquet("Files/output/df_pv_tm_join3")

If I repartition the dataset and write it out as delta parquet, execution time drops significantly to only 2min 30s. The downside is that I end up with 400 small files and no table.

What are my options (if any) to save the table with max performance?

1
  • Can you partition the table using a business key or date? You can increase performance for reading and writing as it will use partition elimination to read/write to specific files. You want the process to use smaller files rather than read many big ones Commented Aug 29, 2024 at 11:41

2 Answers 2

0

Consider liquid clustering:

The following are examples of scenarios that benefit from clustering:

  • Tables often filtered by high cardinality columns.
  • Tables with significant skew in data distribution.
  • Tables that grow quickly and require maintenance and tuning effort.
  • Tables with access patterns that change over time.
  • Tables where a typical partition column could leave the table with too many or too few partitions.

https://docs.delta.io/latest/delta-clustering.html

Also consider partitionedBy to partition on write: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.partitionedBy.html

Clustering should be superior as it is native to Delta tables.

Sign up to request clarification or add additional context in comments.

Comments

0

You can take a look at the belo:

  • Partition the data into smaller partitions as you mentioned before writing delta as a table preferably based on a business key which does not skew the data.
  • Optimize the tables afterwards for best performance with Optimize, V-order and Vacuum commands.

https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-table-maintenance

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.