I am getting the same error as this question, but the recommended solution of setting blocksize=None isn't solving the issue for me. I'm trying to convert the NYC taxi data from CSV to Parquet and this is the code I'm running:
ddf = dd.read_csv(
"s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
parse_dates=["pickup_datetime", "dropoff_datetime"],
blocksize=None,
dtype={
"tolls_amount": "float64",
"store_and_fwd_flag": "object",
},
)
ddf.to_parquet(
"s3://coiled-datasets/nyc-tlc/2010",
engine="pyarrow",
compression="snappy",
write_metadata_file=False,
)
Here's the error I'm getting:
"ParserError: Error tokenizing data. C error: Expected 18 fields in line 2958, saw 19".
Adding blocksize=None helps sometimes, see here for example, and I'm not sure why it's not solving my issue.
Any suggestions on how to get past this issue?
This code works for the 2011 taxi data, so their must be something weird in the 2010 taxi data that's causing this issue.