Spark OutOfMemoryError when reading large JSON file (3.5GB) as wholeText due to colon in path

I’m trying to load JSON data into an Iceberg table. The source files are named with timestamps that include colons (:), so I need to read them as plain text first. Additionally, each file is in a pretty-printed JSON format, which requires using the wholeText option set to true when reading.

This approach works fine for most files. However, I’ve encountered a single large file (~3.5 GB) that consistently causes Spark to fail with an OutOfMemoryError.

Here are the configuration details I’ve tried so far:

Executors: 4

Executor cores: 4

Executor memory: 32 GB

Driver memory: 64 GB

Despite these settings, the job still fails due to insufficient memory during processing.

Has anyone faced a similar issue or found a way to efficiently handle large, pretty-printed JSON files with wholeText = true in Spark?

Any suggestions for optimizing memory usage or alternative approaches would be greatly appreciated.

asked Oct 25 at 4:56

Raj Mhatre

11 bronze badge

Maybe use Polars or DuckDB instead?

Frank
– Frank

2025-10-25 20:42:13 +00:00
Commented Oct 25 at 20:42
you could possibly read it in pandas and then convert the pandas df to a spark df

samkart
– samkart

2025-10-27 03:54:11 +00:00
Commented Oct 27 at 3:54
No i have to go for Spark-SQL approach. Dataframes or other libraries are out of context. But thanks anyways.

Raj Mhatre
– Raj Mhatre

2025-10-27 09:15:12 +00:00
Commented Oct 27 at 9:15
Technically you can revert pretty printing by reading the file line-by-line (as string), trimming (throwing away any whitespace at the start or the end - though there shouldn't be any at the end), and writing back the result either continously or with linebreaks. But any decent JSON parser should be rather good at ignoring extra whitespace, so this will unlikely help. The The source files are named with timestamps that include colons (:), so I need to read them as plain text first part is not clear to me, what does the format of a filename has to do with reading the contents of the file?

tevemadar
– tevemadar

2025-10-28 09:50:25 +00:00
Commented Oct 28 at 9:50

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Spark OutOfMemoryError when reading large JSON file (3.5GB) as wholeText due to colon in path

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest