We have source files in json format with roughly 500 rows, but there are 750 mil records if JSON file is fully flattened.
My data bricks notebook is reading the source file in a single partition no matter what I do. e.g. I set the shuffle partitions, disabled the AQE, and set the partitions but still file gets loaded into single partition only.
df = spark.read.load(Filename, format='json', multiline=True, encoding = 'UTF-8', schema= schema)
We are using parse_json function which flatten this json files into 750 mil records and due to 1 partition it only runs on single parition taking very long time and also causing OOM errors.
Json Schema:
{ "abc" : [ { "a" : "", "b": [ { "ba":"", "bb":"" } ], "c" :[ { "ca":"", "cb":"" } ] } ], "bce" : "" }