I’m trying to load JSON data into an Iceberg table. The source files are named with timestamps that include colons (:), so I need to read them as plain text first. Additionally, each file is in a pretty-printed JSON format, which requires using the wholeText option set to true when reading.
This approach works fine for most files. However, I’ve encountered a single large file (~3.5 GB) that consistently causes Spark to fail with an OutOfMemoryError.
Here are the configuration details I’ve tried so far:
Executors: 4
Executor cores: 4
Executor memory: 32 GB
Driver memory: 64 GB
Despite these settings, the job still fails due to insufficient memory during processing.
Has anyone faced a similar issue or found a way to efficiently handle large, pretty-printed JSON files with wholeText = true in Spark?
Any suggestions for optimizing memory usage or alternative approaches would be greatly appreciated.