0

We have source files in json format with roughly 500 rows, but there are 750 mil records if JSON file is fully flattened.

My data bricks notebook is reading the source file in a single partition no matter what I do. e.g. I set the shuffle partitions, disabled the AQE, and set the partitions but still file gets loaded into single partition only.

df = spark.read.load(Filename, format='json', multiline=True, encoding = 'UTF-8', schema= schema)

We are using parse_json function which flatten this json files into 750 mil records and due to 1 partition it only runs on single parition taking very long time and also causing OOM errors.

Json Schema:

{ "abc" : [ { "a" : "", "b": [ { "ba":"", "bb":"" } ], "c" :[ { "ca":"", "cb":"" } ] } ], "bce" : "" }

1 Answer 1

1

After reviewing and doing lots of research I figured that only way to solve this is to repartition dataset at first place after reading the file.

df = spark.read.load(Filename, format='json', multiline=True, encoding = 'UTF-8', schema= schema).repartition(no of desired partitions)

This solved the problem and now I am able to process the data faster and able to write to parquet faster as well.

Update: If you are reading a single file or if the source folder has only 1 file then by default you get 1 partition.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.