0

I’m trying to load JSON data into an Iceberg table. The source files are named with timestamps that include colons (:), so I need to read them as plain text first. Additionally, each file is in a pretty-printed JSON format, which requires using the wholeText option set to true when reading.

This approach works fine for most files. However, I’ve encountered a single large file (~3.5 GB) that consistently causes Spark to fail with an OutOfMemoryError.

Here are the configuration details I’ve tried so far:

Executors: 4

Executor cores: 4

Executor memory: 32 GB

Driver memory: 64 GB

Despite these settings, the job still fails due to insufficient memory during processing.

Has anyone faced a similar issue or found a way to efficiently handle large, pretty-printed JSON files with wholeText = true in Spark?

Any suggestions for optimizing memory usage or alternative approaches would be greatly appreciated.

4
  • Maybe use Polars or DuckDB instead? Commented Oct 25 at 20:42
  • you could possibly read it in pandas and then convert the pandas df to a spark df Commented Oct 27 at 3:54
  • No i have to go for Spark-SQL approach. Dataframes or other libraries are out of context. But thanks anyways. Commented Oct 27 at 9:15
  • Technically you can revert pretty printing by reading the file line-by-line (as string), trimming (throwing away any whitespace at the start or the end - though there shouldn't be any at the end), and writing back the result either continously or with linebreaks. But any decent JSON parser should be rather good at ignoring extra whitespace, so this will unlikely help. The The source files are named with timestamps that include colons (:), so I need to read them as plain text first part is not clear to me, what does the format of a filename has to do with reading the contents of the file? Commented Oct 28 at 9:50

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.