0

I have a csv file like this:

   1  1.1  0      0.1  13.1494  32.7957  2.27266  0.2  3  5.4   ...     \
0  2    2  0  8.17680  4.76726  25.6957  1.13633    0  3  4.8   ...      
1  3    0  0  8.22718  2.35340  15.2934  1.13633    0  3  4.8   ...

read the file using panda.read_csv:

data_raw = pd.read_csv(filename, chunksize=chunksize)

Now, I want to make a dataframe:

df = pd.DataFrame(data_raw, columns=['id', 'colNam1', 'colNam2', 'colNam3',...])

But I met a problem:

  File "test.py", line 143, in <module>
    data = load_frame(csvfile)
  File "test.py", line 53, in load_frame
    'id', 'colNam1', 'colNam2', 'colNam3',...])
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 325, in __init__
    raise TypeError("data argument can't be an iterator")
TypeError: data argument can't be an iterator

I don't know why.

0

1 Answer 1

1

This is because what is returned when you pass chunksize as a param to read_csv is an iterable rather than a df as such.

To demonstrate:

In [67]:
import io
import pandas as pd
t="""a         b
0 -0.278303 -1.625377
1 -1.954218  0.843397
2  1.213572 -0.098594"""
df = pd.read_csv(io.StringIO(t), chunksize=1)
df

Out[67]:
<pandas.io.parsers.TextFileReader at 0x7e9e8d0>

You can see that the df here is in this case not a DataFrame but a TextFileReader object

It's unclear to me what you're really trying to achieve but if you want to read a specific number of rows you can pass nrows instead:

In [69]:
t="""a         b
0 -0.278303 -1.625377
1 -1.954218  0.843397
2  1.213572 -0.098594"""
df = pd.read_csv(io.StringIO(t), nrows=1)
df

Out[69]:
             a         b
0  0 -0.278303 -1.625377

The idea here with your original problem is that you need to iterate over it in order to get the chunks:

In [73]:
for r in df:
    print(r)

             a         b
0  0 -0.278303 -1.625377
             a         b
1  1 -1.954218  0.843397
             a         b
2  2  1.213572 -0.098594

If you want to generate a df from the chunks you need to append to a list and then call concat:

In [77]:
df_list=[]
for r in df:
    df_list.append(r)
pd.concat(df_list)

Out[77]:
             a         b
0  0 -0.278303 -1.625377
1  1 -1.954218  0.843397
2  2  1.213572 -0.098594
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. But how can i do if I want to put the data into a DataFrame? The RAM of the computer is just 3GB.
Thank you very much. I want to generate a df from the chunks. And I have used your df_list codes. But the process is killed because of the full RAM. My computer's RAM is just 3GB. But the CSV file is more than 3GB. Do you any answer?
Well you need more ram or you need to consider hot to process a csv off-line. You can't break the laws of physics here, if you don't have enough ram then you need to process the file in chunks or consider what you really need to load in ram
Could you give some specific examples in my current situtation? How to gnerate a df from the chunks or a csv off-line? Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.