Loading csv file in chunks [duplicate]

Question

I have data set of 2.5 GB which contain tens of millions of rows

I'm trying to load data like

 %%time
 import pandas as pd
 data=pd.read_csv('C:\\Users\\mahes_000\\Desktop\\yellow.csv',iterator=True,
                  chunksize=50000)

Where I'm getting multiple of chunksize part and I'm trying to do some operations like

 %%time
 data.get_chunk().head(5)
 data.get_chunk().shape
 data.get_chunk().drop(['Rate_Code'],axis=1)

For operation it choose any one chunksize part and do all the operation it. Then what about the remaining parts? How can I do operations on complete data without memory-error.

You need to loop through the iterator. for i in data and perform the operation. — Srce Cde
– Srce Cde, Commented Nov 28, 2018 at 7:25

yatu · Accepted Answer · 2018-11-28 15:34:36Z

2

From the documentation on the parameter chunksize:

Return TextFileReader object for iteration

Thus by placing the object in a loop you will iteratively read the data in chunks specified in chunksize:

chunksize = 5e4
for chunk in pd.read_csv(filename, chunksize=chunksize):
    #print(chunk.head(5))
    #print(chunk.shape())

edited Nov 28, 2018 at 15:34

answered Nov 28, 2018 at 9:00

yatu

88.6k12 gold badges93 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mahesh Over a year ago

Can you add some process on chunk so that i will get reference.

yatu Over a year ago

Well @Mahesh, chunk is a dataframe, so you can perform any process you have in mind directly on it.

Collectives™ on Stack Overflow

Loading csv file in chunks [duplicate]

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related