Create a pandas DataFrame from generator?

Question

I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns. Using python 3.3/ pandas 0.12

I've try to create a DataFrame from:

import pandas as pd
df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)

but throws an error:

... 
C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
   1046                 values.append(row)
   1047                 i += 1
-> 1048                 if i >= nrows:
   1049                     break
   1050 

TypeError: unorderable types: int() >= NoneType()

I managed it to work consuming the generator in a list, but uses twice memory:

df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)

The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(

The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?

Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.

Update:

It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form. Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.

Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.

The problem must be in tuple_generator, since the problem does not occur for simple generator expressions like tuple_generator = (item for item in [[1,2,3],[2,3,4,5]]). — unutbu
– unutbu, Commented Sep 20, 2013 at 11:58
@unutbu Not on pandas 0.12. On the development version it works correctly. — Viktor Kerkez
– Viktor Kerkez, Commented Sep 20, 2013 at 12:02
It sounds like you might be experiencing thrashing, in which case you should consider adding more memory to your machine. — Phillip Cloud
– Phillip Cloud, Commented Sep 20, 2013 at 12:24
These were pretty old versions of pandas. I have done this sort of thing before on Gb files; make sure your generator uses chunking (e.g. read in ~64KB chunk of input, yield each item in the result, iterate till use up the input). Don't try to just process the entire input in one go. — smci
– smci, Commented Mar 21, 2024 at 2:00

Guilherme David da Costa · Accepted Answer · 2021-04-10 17:42:08Z

32

You certainly can construct a pandas.DataFrame() from a generator of tuples, as of version 0.19 (and probably earlier). Don't use .from_records(); just use the constructor, for example:

import pandas as pd
someGenerator = ( (x, chr(x)) for x in range(48,127) )
someDf = pd.DataFrame(someGenerator)

Produces:

type(someDf) #pandas.core.frame.DataFrame

someDf.dtypes
#0     int64
#1    object
#dtype: object

someDf.tail(10)
#      0  1
#69  117  u
#70  118  v
#71  119  w
#72  120  x
#73  121  y
#74  122  z
#75  123  {
#76  124  |
#77  125  }
#78  126  ~

edited Apr 10, 2021 at 17:42

Guilherme David da Costa

2,3784 gold badges32 silver badges47 bronze badges

answered Apr 27, 2017 at 15:04

C8H10N4O2

19.2k10 gold badges106 silver badges145 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

tinproject Over a year ago

This question is from where pandas don't allow generators at all (pre 0.13).

tinproject Over a year ago

The usage of .from_records() is the correct for the use case of the question, as uses a generator of records. The default constructor don't get clear how the generator will be interpreted, if as a generator of records or as a generator of columns (series).

DarkHark Over a year ago

Took a little creativity to work with CSV lines, but in case anyone else comes across the same issue, within my generator I used for line in lines: yield next(csv.reader([line]). This was useful for me because I needed to perform some cleansing on each line and had other conditional logic to worry about within the CSV.

SteveS Over a year ago

Dear @c8h10n4o2 please explain why one should choose using generator in this case?

Viktor Kerkez · Accepted Answer · 2013-09-20 12:16:36Z

20

You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).

Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv or something else...

If you want to get super complicated you can create a file like object that will return the lines:

def gen():
    lines = [
        'col1,col2\n',
        'foo,bar\n',
        'foo,baz\n',
        'bar,baz\n'
    ]
    for line in lines:
        yield line

class Reader(object):
    def __init__(self, g):
        self.g = g
    def read(self, n=0):
        try:
            return next(self.g)
        except StopIteration:
            return ''

And then use the read_csv:

>>> pd.read_csv(Reader(gen()))
  col1 col2
0  foo  bar
1  foo  baz
2  bar  baz

edited Sep 20, 2013 at 12:16

answered Sep 20, 2013 at 12:09

Viktor Kerkez

46.8k13 gold badges109 silver badges88 bronze badges

5 Comments

tinproject Over a year ago

You are right, pandas 0.12 does not support generators. I've installed the dev version and DataFrame constructor allow generators but DataFrame.from_records() not. I've made a patch for it.

ekta Over a year ago

@Viktor Kerkez : Quick question, if my generator function had list of lists in lines, but not consistently, say some objects could be lists-of-lists, and some could be simply lists,how would I gracefully change the "read" method, or should I handle it when I iterate over lines in gen() ?

ekta Over a year ago

@Viktor kerkez : very basic question, but here's what I mean. If I define lines = [ ['col1,col2\n'], ['foo,bar\n'], ['foo,baz\n'], ['bar,baz\n'] ], then keeping the rest same, I see that the Python shell restarts. I also tried instantiating then object for Reader class as r=Reader(gen()) df=pd.read_csv(r) . This suggests to me that there's something very basic about the class(Object) type notation, that I don't understand. My assumption is that I should be allowed to create lists if I wanted so, inside of a df "column", but not shell-restart.

Viktor Kerkez Over a year ago

@ekta read_csv function can parse only "pure" CSV files which cant contain lists. If you want lists in your data frame columns you'll have to use something else... Either parse json or do it manually.

member555 Over a year ago

@ViktorKerkez How does your Reader() solution effects on performance?

Jeff · Accepted Answer · 2013-09-20 13:15:13Z

7

To get it to be memory efficient, read in chunks. Something like this, using Viktor's Reader class from above.

df = pd.concat(list(pd.read_csv(Reader(gen()),chunksize=10000)),axis=1)

edited Sep 20, 2013 at 13:15

answered Sep 20, 2013 at 13:10

Jeff

130k21 gold badges223 silver badges189 bronze badges

1 Comment

code base 5000 Over a year ago

@Jeff - the Reader() causes the read_csv and from_csv() to crash python. Is this solution still valid?

canals_bib_bit · Accepted Answer · 2018-03-24 16:18:22Z

3

If generator is just like a list of DataFrames, you need just to create a new DataFrame concatenating elements of the list:

result = pd.concat(list)

Recently I've faced the same problem.

edited Mar 24, 2018 at 16:18

canals_bib_bit

1,2912 gold badges12 silver badges25 bronze badges

answered Mar 24, 2018 at 16:09

Natalia Sashnikova

311 silver badge2 bronze badges

Comments

Guilherme Freitas · Accepted Answer · 2013-12-05 00:09:45Z

2

You can also use something like (Python tested in 2.7.5)

from itertools import izip

def dataframe_from_row_iterator(row_iterator, colnames):
    col_iterator = izip(*row_iterator)
    return pd.DataFrame({cn: cv for (cn, cv) in izip(colnames, col_iterator)})

You can also adapt this to append rows to a DataFrame.

-- Edit, Dec 4th: s/row/rows in last line

edited Dec 5, 2013 at 0:09

answered Oct 29, 2013 at 18:26

Guilherme Freitas

3613 silver badges5 bronze badges

2 Comments

U2EF1 Over a year ago

This has the same problem as presented in the question, it is infeasible to materialize the whole of the data as anything other than a dataframe or numpy array or some other packed form. Here you materialize it as a dict.

Guilherme Freitas Over a year ago

Agreed, it does materialize the data as a dict. However, you don't have to materialize all of it at once; just consume part of the generator, then append the data to a DataFrame in chunks. Just use itertools.islice to get the chunks from the generator/row_iterator.

Collectives™ on Stack Overflow

Create a pandas DataFrame from generator?

5 Answers 5

4 Comments

5 Comments

1 Comment

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

5 Comments

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related