TypeError when using chunksize argument to pandas method pd.read_csv()

Question

I have a csv file like this:

   1  1.1  0      0.1  13.1494  32.7957  2.27266  0.2  3  5.4   ...     \
0  2    2  0  8.17680  4.76726  25.6957  1.13633    0  3  4.8   ...      
1  3    0  0  8.22718  2.35340  15.2934  1.13633    0  3  4.8   ...

read the file using panda.read_csv:

data_raw = pd.read_csv(filename, chunksize=chunksize)

Now, I want to make a dataframe:

df = pd.DataFrame(data_raw, columns=['id', 'colNam1', 'colNam2', 'colNam3',...])

But I met a problem:

  File "test.py", line 143, in <module>
    data = load_frame(csvfile)
  File "test.py", line 53, in load_frame
    'id', 'colNam1', 'colNam2', 'colNam3',...])
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 325, in __init__
    raise TypeError("data argument can't be an iterator")
TypeError: data argument can't be an iterator

I don't know why.

EdChum · Accepted Answer · 2017-01-25 09:21:34Z

1

This is because what is returned when you pass chunksize as a param to read_csv is an iterable rather than a df as such.

To demonstrate:

In [67]:
import io
import pandas as pd
t="""a         b
0 -0.278303 -1.625377
1 -1.954218  0.843397
2  1.213572 -0.098594"""
df = pd.read_csv(io.StringIO(t), chunksize=1)
df

Out[67]:
<pandas.io.parsers.TextFileReader at 0x7e9e8d0>

You can see that the df here is in this case not a DataFrame but a TextFileReader object

It's unclear to me what you're really trying to achieve but if you want to read a specific number of rows you can pass nrows instead:

In [69]:
t="""a         b
0 -0.278303 -1.625377
1 -1.954218  0.843397
2  1.213572 -0.098594"""
df = pd.read_csv(io.StringIO(t), nrows=1)
df

Out[69]:
             a         b
0  0 -0.278303 -1.625377

The idea here with your original problem is that you need to iterate over it in order to get the chunks:

In [73]:
for r in df:
    print(r)

             a         b
0  0 -0.278303 -1.625377
             a         b
1  1 -1.954218  0.843397
             a         b
2  2  1.213572 -0.098594

If you want to generate a df from the chunks you need to append to a list and then call concat:

In [77]:
df_list=[]
for r in df:
    df_list.append(r)
pd.concat(df_list)

Out[77]:
             a         b
0  0 -0.278303 -1.625377
1  1 -1.954218  0.843397
2  2  1.213572 -0.098594

edited Jan 25, 2017 at 9:21

answered Jan 25, 2017 at 9:15

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Long Ye Over a year ago

Thanks. But how can i do if I want to put the data into a DataFrame? The RAM of the computer is just 3GB.

Long Ye Over a year ago

Thank you very much. I want to generate a df from the chunks. And I have used your df_list codes. But the process is killed because of the full RAM. My computer's RAM is just 3GB. But the CSV file is more than 3GB. Do you any answer?

EdChum Over a year ago

Well you need more ram or you need to consider hot to process a csv off-line. You can't break the laws of physics here, if you don't have enough ram then you need to process the file in chunks or consider what you really need to load in ram

Long Ye Over a year ago

Could you give some specific examples in my current situtation? How to gnerate a df from the chunks or a csv off-line? Thanks.

Collectives™ on Stack Overflow

TypeError when using chunksize argument to pandas method pd.read_csv()

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related