0

I have a PySpark job that ingests data into a Delta table originally partitioned by year, month, day and hour. The job takes 2hr to complete. The job runs daily ingesting previous days full data. Recently I switched the partitioning to a single field, event_date (a date type). After this change, the job started failing with out-of-memory (OOM) errors.

My understanding is that the number of Spark tasks is closely tied to the number of partitions. By switching to event_date, does this mean that all of a day’s data is now being written by a single task, whereas previously the data was distributed across 24 hourly partitions per day ie 24 tasks?

1
  • 1
    Sort of. Really depends on what kind of operations you have. Spark can simply read and distribute data among its task if it is just to forward the data. It will be helpeful if you mention what you are trying to do in spark.Having said that, if event_date is the partition to the delta lake writing to the table definitely becomes a bottleneck because tasks cannot write independently to the destination table. Best way to know the issue is to check Spark UI for tasks and partitions. It will give you the clear picture. Please add the code and the Spark UI screenshots. That would help answer better Commented Sep 16 at 10:06

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.