How does Spark read unpartitioned Delta tables?

Question

I observe severe underutilization of CPU in my Databricks job run metrics, on average less than 50% - indicating that I do not parallelize enough tasks in the Spark workflow.

I am especially interested in improving the job's read parallelism. For context, I read multiple tables.

Is my knowledge correct that, concerning the read stage, Spark creates the same amount of tasks as the table-to-read has partitions? Moreover, what does this look like when my table is not partitioned?

I know that by default the configuration option spark.sql.shuffle.partitions = 200. However, the Spark docs detail that this property applies to wide transformations (i.e. joins & aggregates) happening in shuffle stages which come after the initial read stage if I am not mistaken.

Configures the number of partitions to use when shuffling data for joins or aggregations.

Since reading is neither a join nor an aggregation, I wonder if this default value (i.e. 200) will be the level of Spark parallelism (i.e. 200 tasks) to fill up the 6*16 = 96 cores in my cluster to read that unpartitioned data.

On the unpartitioned table, I have messed with spark.conf.set("spark.sql.files.maxPartitionBytes", "64MB") that controls the maximum number of bytes read per partition (default: 128 MB) to increase the number of Spark partitions, which increases the number of tasks and, therefore, parallelism.

However, I have not seen a significant result in performance.

When reading a partitioned table, Spark creates one task per partition. If a table has 100 partitions, Spark will create 100 tasks for reading. — Dileep Raj Narayan Thumula
– Dileep Raj Narayan Thumula, Commented Feb 24 at 15:11
@DileepRajNarayanThumula, thank you for the comment; this confirms my first question. How about unpartitioned tables? — Louis
– Louis, Commented Feb 24 at 15:21

Louis · Accepted Answer · 2025-03-15 17:26:57Z

1

Whether a (Delta) table is partitioned or not or reading a pruned partition, there is a common approach.

spark.sql.shuffle.partitions has no role in reading from data at rest that is relevant for shuffling due JOIN etc.
Does not matter if pyspark or Spark with Scala API.
spark.default.parallelism is only relevant for RDD's so leaving out of the equation.
spark.sql.files.maxPartitionBytes is relevant and also if the source is a parquet file one could argue.

Imagine a Delta table with 5 files: 1 very large file (X size) and 4 smaller files (X/5 size). The App has 2 vCores available, and spark.sql.files.maxPartitionBytes = X/2.

Then:

The number of Tasks/parallelization will be:
- 4 for the smaller files that are of size X/5 that fit into X/2 sizing
- 2 for the large file of size X, that is split on X/2
6 Tasks can run concurrently to read the file.

edited Mar 15 at 17:26

Louis

251 silver badge8 bronze badges

answered Feb 24 at 19:21

Ged

18.5k8 gold badges53 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Louis Feb 25 at 6:48

Thank you for the answer. I suppose that the example you gave is for partitioned tables, where different partitions may vary greatly in size. To my understanding, the file sizes of an unpartitioned table are more or less uniform, and tweaking spark.sql.files.maxPartitionBytes will control the read parallelism in that case.

Collectives™ on Stack Overflow

How does Spark read unpartitioned Delta tables?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related