2

I have data set of 2.5 GB which contain tens of millions of rows

I'm trying to load data like

 %%time
 import pandas as pd
 data=pd.read_csv('C:\\Users\\mahes_000\\Desktop\\yellow.csv',iterator=True,
                  chunksize=50000)

Where I'm getting multiple of chunksize part and I'm trying to do some operations like

 %%time
 data.get_chunk().head(5)
 data.get_chunk().shape
 data.get_chunk().drop(['Rate_Code'],axis=1)

For operation it choose any one chunksize part and do all the operation it. Then what about the remaining parts? How can I do operations on complete data without memory-error.

1
  • You need to loop through the iterator. for i in data and perform the operation. Commented Nov 28, 2018 at 7:25

1 Answer 1

2

From the documentation on the parameter chunksize:

Return TextFileReader object for iteration

Thus by placing the object in a loop you will iteratively read the data in chunks specified in chunksize:

chunksize = 5e4
for chunk in pd.read_csv(filename, chunksize=chunksize):
    #print(chunk.head(5))
    #print(chunk.shape())
Sign up to request clarification or add additional context in comments.

2 Comments

Can you add some process on chunk so that i will get reference.
Well @Mahesh, chunk is a dataframe, so you can perform any process you have in mind directly on it.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.