I've got a multiline CSV file which is about 150GB and I've been trying to load it using the usual code e.g.
df = spark.read.format('csv').option('header', True).option('multiLine', True).load('path/to/csv/file')
This works fine for small files, but for large files like this one it takes several hours. From what I understand, this is due to it only using a single executor to read and process it, since it can't split it (and thus partition out to other nodes) as it needs to 'find' all the lines, new lines, quotes etc correctly...
I figured someone out there must have already had this same problem, however (surprisingly), I've not been able to find anything where someone is having the same issue. Is nobody loading huge multiline CSV files out there in pyspark world???
I've tried hacking together some bits of code which has been very hit and miss, like loading the whole file in at once and trying to parse it using custom code, but with not much luck...
So, I'm throwing it out there now in desperation to ask if anyone out there had to deal with the same problem, and if so, what did you do? Ideally a nice little pyspark function would nice ;-)
EDIT
Sorry forgot to mention (after reading some of the first comments), this is part of a wider ETL processing system running on Azure, using Azure Synapse Spark and ADSL gen2 and is processing thousands of similar CSV files, some smaller ones but quite a few large ones like this example...
multiLineas True limit your parallelization of your Spark job. If you can only use Spark, and you're using Databricks runtime environment, did you try thisspark.databricks.sql.csv.edgeParserSplittable=true? community.databricks.com/t5/data-engineering/…inferschema=truewhen reading a file or provide your own? stackoverflow.com/questions/56927329/…