I have a PySpark job that ingests data into a Delta table originally partitioned by year, month, day and hour. The job takes 2hr to complete. The job runs daily ingesting previous days full data. Recently I switched the partitioning to a single field, event_date (a date type). After this change, the job started failing with out-of-memory (OOM) errors.
My understanding is that the number of Spark tasks is closely tied to the number of partitions. By switching to event_date, does this mean that all of a day’s data is now being written by a single task, whereas previously the data was distributed across 24 hourly partitions per day ie 24 tasks?