Environment: Microsoft Fabric, F512 capacity, 40 medium nodes (8vCores):
In a PySpark notebook I have a DataFrame containing 379 million rows which I would like to make available in a Fabric Lakehouse as a Delta table so that it can be accessed through the SQL endpoint of the Lakehouse. I am not using any of the advanced Delta Live features, such as versioning or ACID. The dataset will be recreated daily.
To achieve this I tried the following.
Create the table using save():
df_pv_tm_join.write.mode("overwrite").format("delta").save("Tables/df_pv_tm_join7")
This runs for 10 minutes and I end up with six parquet files, each about 200MB in size, which is ideal. The result is a Delta Table I in the "Tables" section of the Lakehouse. During the saving process I'm seeing two warning about "Time Skew" which I don't know how to address. All findings on the internet always seem to point to "Data Skew".
In comparison, use parquet():
df_pv_tm_join.repartition(400).write.mode("overwrite").parquet("Files/output/df_pv_tm_join3")
If I repartition the dataset and write it out as delta parquet, execution time drops significantly to only 2min 30s. The downside is that I end up with 400 small files and no table.
What are my options (if any) to save the table with max performance?