Dask ParserError: Error tokenizing data when reading CSV

Question

I am getting the same error as this question, but the recommended solution of setting blocksize=None isn't solving the issue for me. I'm trying to convert the NYC taxi data from CSV to Parquet and this is the code I'm running:

ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
    parse_dates=["pickup_datetime", "dropoff_datetime"],
    blocksize=None,
    dtype={
        "tolls_amount": "float64",
        "store_and_fwd_flag": "object",
    },
)

ddf.to_parquet(
    "s3://coiled-datasets/nyc-tlc/2010",
    engine="pyarrow",
    compression="snappy",
    write_metadata_file=False,
)

Here's the error I'm getting:

"ParserError: Error tokenizing data. C error: Expected 18 fields in line 2958, saw 19".

Adding blocksize=None helps sometimes, see here for example, and I'm not sure why it's not solving my issue.

Any suggestions on how to get past this issue?

This code works for the 2011 taxi data, so their must be something weird in the 2010 taxi data that's causing this issue.

Please show your exception. I believe it is likely happening while reading the file's header to compute the meta for the dataframe, not while processing the block(s). — mdurant
– mdurant, Commented Jan 19, 2022 at 14:20

SultanOrazbayev · Accepted Answer · 2022-01-19 17:08:08Z

The raw file s3://nyc-tlc/trip data/yellow_tripdata_2010-02.csv contains an error (one too many commas). This is the offending line (middle) and its neighbours:

VTS,2010-02-16 08:02:00,2010-02-16 08:14:00,5,4.2999999999999998,-73.955112999999997,40.786718,1,,-73.924710000000005,40.841335000000001,CSH,11.699999999999999,0,0.5,0,0,12.199999999999999
CMT,2010-02-24 16:25:18,2010-02-24 16:52:14,1,12.4,-73.988956000000002,40.736567000000001,1,,,-73.861762999999996,40.768383999999998,CAS,29.300000000000001,1,0.5,0,4.5700000000000003,35.369999999999997
VTS,2010-02-16 07:58:00,2010-02-16 08:09:00,1,2.9700000000000002,-73.977469999999997,40.779359999999997,1,,-74.004427000000007,40.742137999999997,CRD,9.3000000000000007,0,0.5,1.5,0,11.300000000000001

Some of the options are:

on_bad_lines kwarg to pandas can be set to warn or skip (so this should be also possible with dask.dataframe;
fix the raw file (knowing where the error is) with something like sed (assuming you can modify the raw files) or on the fly by reading the file line by line.

Collectives™ on Stack Overflow

Dask ParserError: Error tokenizing data when reading CSV

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related