I observe severe underutilization of CPU in my Databricks job run metrics, on average less than 50% - indicating that I do not parallelize enough tasks in the Spark workflow.
I am especially interested in improving the job's read parallelism. For context, I read multiple tables.
Is my knowledge correct that, concerning the read stage, Spark creates the same amount of tasks as the table-to-read has partitions? Moreover, what does this look like when my table is not partitioned?
I know that by default the configuration option spark.sql.shuffle.partitions = 200. However, the Spark docs detail that this property applies to wide transformations (i.e. joins & aggregates) happening in shuffle stages which come after the initial read stage if I am not mistaken.
Configures the number of partitions to use when shuffling data for joins or aggregations.
Since reading is neither a join nor an aggregation, I wonder if this default value (i.e. 200) will be the level of Spark parallelism (i.e. 200 tasks) to fill up the 6*16 = 96 cores in my cluster to read that unpartitioned data.
On the unpartitioned table, I have messed with spark.conf.set("spark.sql.files.maxPartitionBytes", "64MB") that controls the maximum number of bytes read per partition (default: 128 MB) to increase the number of Spark partitions, which increases the number of tasks and, therefore, parallelism.
However, I have not seen a significant result in performance.