Reading json file in databricks dataframe only have only single partition

Question

We have source files in json format with roughly 500 rows, but there are 750 mil records if JSON file is fully flattened.

My data bricks notebook is reading the source file in a single partition no matter what I do. e.g. I set the shuffle partitions, disabled the AQE, and set the partitions but still file gets loaded into single partition only.

df = spark.read.load(Filename, format='json', multiline=True, encoding = 'UTF-8', schema= schema)

We are using parse_json function which flatten this json files into 750 mil records and due to 1 partition it only runs on single parition taking very long time and also causing OOM errors.

Json Schema:

{ "abc" : [ { "a" : "", "b": [ { "ba":"", "bb":"" } ], "c" :[ { "ca":"", "cb":"" } ] } ], "bce" : "" }

rdobbss · Accepted Answer · 2022-06-13 18:20:00Z

1

After reviewing and doing lots of research I figured that only way to solve this is to repartition dataset at first place after reading the file.

df = spark.read.load(Filename, format='json', multiline=True, encoding = 'UTF-8', schema= schema).repartition(no of desired partitions)

This solved the problem and now I am able to process the data faster and able to write to parquet faster as well.

Update: If you are reading a single file or if the source folder has only 1 file then by default you get 1 partition.

edited Jun 13, 2022 at 18:20

answered Jun 8, 2022 at 16:00

rdobbss

113 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Reading json file in databricks dataframe only have only single partition

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related