2

I've got a multiline CSV file which is about 150GB and I've been trying to load it using the usual code e.g.

df = spark.read.format('csv').option('header', True).option('multiLine', True).load('path/to/csv/file')

This works fine for small files, but for large files like this one it takes several hours. From what I understand, this is due to it only using a single executor to read and process it, since it can't split it (and thus partition out to other nodes) as it needs to 'find' all the lines, new lines, quotes etc correctly...

I figured someone out there must have already had this same problem, however (surprisingly), I've not been able to find anything where someone is having the same issue. Is nobody loading huge multiline CSV files out there in pyspark world???

I've tried hacking together some bits of code which has been very hit and miss, like loading the whole file in at once and trying to parse it using custom code, but with not much luck...

So, I'm throwing it out there now in desperation to ask if anyone out there had to deal with the same problem, and if so, what did you do? Ideally a nice little pyspark function would nice ;-)

EDIT

Sorry forgot to mention (after reading some of the first comments), this is part of a wider ETL processing system running on Azure, using Azure Synapse Spark and ADSL gen2 and is processing thousands of similar CSV files, some smaller ones but quite a few large ones like this example...

5
  • Python is not the best tool for that. Have you considered witing your C++ or Ocaml program to do so? Perhaps transforming that huge CSV thing to a sqlite ? And what operating system and computer (RAM size, CPU, number of cores) do you use? Commented Oct 30 at 10:42
  • I think setting multiLine as True limit your parallelization of your Spark job. If you can only use Spark, and you're using Databricks runtime environment, did you try this spark.databricks.sql.csv.edgeParserSplittable=true? community.databricks.com/t5/data-engineering/… Commented Oct 30 at 10:52
  • I would suggest using a Rust based tool like Polars or DuckDB to convert it to Parquet files then Spark can do parallel processing. Bu if you are already using Polars, why bother with Spark? Commented Oct 30 at 14:45
  • Do you set inferschema=true when reading a file or provide your own? stackoverflow.com/questions/56927329/… Commented Oct 30 at 20:36
  • no, sorry, the above code was just short example to illustrate using the multiline option, using inferschema=true does slow it down a little but even without it, its still hours to read.. Commented Oct 31 at 20:40

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.