0

I've got an error on reading a file with dask, which work with pandas :

import dask.dataframe as dd
import pandas as pd
pdf = pd.read_csv("./tous_les_docs.csv")
pdf.shape
(20140796, 7)

while dask gives me an error :

df = dd.read_csv("./tous_les_docs.csv")
df.describe().compute()
ParserError: Error tokenizing data. C error: EOF inside string starting at line 192999

Answer : Adding "blocksize=None" make it work :

df = dd.read_csv("./tous_les_docs.csv", blocksize=None)
3
  • You have to show your CSV file. Was it exported by some non RFC compliant program(for example Excel)? Some csv librarys are more strict or don't handle faulty csv files by default. Commented Apr 29, 2019 at 12:06
  • I understood dask was suppose to behave the same way as pandas :-/ ? Commented Apr 29, 2019 at 12:07
  • the csv has been produced by pandas with a df.to_csv(path) Commented Apr 29, 2019 at 12:07

1 Answer 1

1

The documentation says that this could happen

It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.

http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

It seems Dask chops the file in chunks by line terminator but without scanning the whole file from the start, to see if a line terminator is in a string.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.