Python = dask Vs pandas, error in read_csv

Question

I've got an error on reading a file with dask, which work with pandas :

import dask.dataframe as dd
import pandas as pd
pdf = pd.read_csv("./tous_les_docs.csv")
pdf.shape
(20140796, 7)

while dask gives me an error :

df = dd.read_csv("./tous_les_docs.csv")
df.describe().compute()
ParserError: Error tokenizing data. C error: EOF inside string starting at line 192999

Answer : Adding "blocksize=None" make it work :

df = dd.read_csv("./tous_les_docs.csv", blocksize=None)

You have to show your CSV file. Was it exported by some non RFC compliant program(for example Excel)? Some csv librarys are more strict or don't handle faulty csv files by default. — Lee
– Lee, Commented Apr 29, 2019 at 12:06
I understood dask was suppose to behave the same way as pandas :-/ ? — Romain Jouin
– Romain Jouin, Commented Apr 29, 2019 at 12:07

mdurant · Accepted Answer · 2019-04-29 12:49:55Z

1

The documentation says that this could happen

It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.

http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

It seems Dask chops the file in chunks by line terminator but without scanning the whole file from the start, to see if a line terminator is in a string.

edited Apr 29, 2019 at 12:49

mdurant

28.8k5 gold badges49 silver badges79 bronze badges

answered Apr 29, 2019 at 12:13

Lee

1,4249 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python = dask Vs pandas, error in read_csv

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related