Loading a large multiline CSV file using pyspark is extremely slow

Ask Question

Asked 22 days ago

Modified 22 days ago

Viewed 98 times

I've got a multiline CSV file which is about 150GB and I've been trying to load it using the usual code e.g.

df = spark.read.format('csv').option('header', True).option('multiLine', True).load('path/to/csv/file')

This works fine for small files, but for large files like this one it takes several hours. From what I understand, this is due to it only using a single executor to read and process it, since it can't split it (and thus partition out to other nodes) as it needs to 'find' all the lines, new lines, quotes etc correctly...

I figured someone out there must have already had this same problem, however (surprisingly), I've not been able to find anything where someone is having the same issue. Is nobody loading huge multiline CSV files out there in pyspark world???

I've tried hacking together some bits of code which has been very hit and miss, like loading the whole file in at once and trying to parse it using custom code, but with not much luck...

So, I'm throwing it out there now in desperation to ask if anyone out there had to deal with the same problem, and if so, what did you do? Ideally a nice little pyspark function would nice ;-)

EDIT

Sorry forgot to mention (after reading some of the first comments), this is part of a wider ETL processing system running on Azure, using Azure Synapse Spark and ADSL gen2 and is processing thousands of similar CSV files, some smaller ones but quite a few large ones like this example...

edited Oct 30 at 11:15

asked Oct 30 at 10:27

rocket porg

2325 silver badges14 bronze badges

Python is not the best tool for that. Have you considered witing your C++ or Ocaml program to do so? Perhaps transforming that huge CSV thing to a sqlite ? And what operating system and computer (RAM size, CPU, number of cores) do you use?

Basile Starynkevitch
– Basile Starynkevitch

2025-10-30 10:42:36 +00:00
Commented Oct 30 at 10:42
I think setting multiLine as True limit your parallelization of your Spark job. If you can only use Spark, and you're using Databricks runtime environment, did you try this spark.databricks.sql.csv.edgeParserSplittable=true? community.databricks.com/t5/data-engineering/…

Jonathan
– Jonathan

2025-10-30 10:52:37 +00:00
Commented Oct 30 at 10:52
I would suggest using a Rust based tool like Polars or DuckDB to convert it to Parquet files then Spark can do parallel processing. Bu if you are already using Polars, why bother with Spark?

Frank
– Frank

2025-10-30 14:45:41 +00:00
Commented Oct 30 at 14:45
Do you set inferschema=true when reading a file or provide your own? stackoverflow.com/questions/56927329/…

mazaneicha
– mazaneicha

2025-10-30 20:36:13 +00:00
Commented Oct 30 at 20:36
no, sorry, the above code was just short example to illustrate using the multiline option, using inferschema=true does slow it down a little but even without it, its still hours to read..

rocket porg
– rocket porg

2025-10-31 20:40:20 +00:00
Commented Oct 31 at 20:40

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Loading a large multiline CSV file using pyspark is extremely slow

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked