46

I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns. Using python 3.3/ pandas 0.12

I've try to create a DataFrame from:

import pandas as pd
df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)

but throws an error:

... 
C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
   1046                 values.append(row)
   1047                 i += 1
-> 1048                 if i >= nrows:
   1049                     break
   1050 

TypeError: unorderable types: int() >= NoneType()

I managed it to work consuming the generator in a list, but uses twice memory:

df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)

The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(

The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?

Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.

Update:

It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form. Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.

Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.

4
  • The problem must be in tuple_generator, since the problem does not occur for simple generator expressions like tuple_generator = (item for item in [[1,2,3],[2,3,4,5]]). Commented Sep 20, 2013 at 11:58
  • @unutbu Not on pandas 0.12. On the development version it works correctly. Commented Sep 20, 2013 at 12:02
  • 1
    It sounds like you might be experiencing thrashing, in which case you should consider adding more memory to your machine. Commented Sep 20, 2013 at 12:24
  • These were pretty old versions of pandas. I have done this sort of thing before on Gb files; make sure your generator uses chunking (e.g. read in ~64KB chunk of input, yield each item in the result, iterate till use up the input). Don't try to just process the entire input in one go. Commented Mar 21, 2024 at 2:00

5 Answers 5

32

You certainly can construct a pandas.DataFrame() from a generator of tuples, as of version 0.19 (and probably earlier). Don't use .from_records(); just use the constructor, for example:

import pandas as pd
someGenerator = ( (x, chr(x)) for x in range(48,127) )
someDf = pd.DataFrame(someGenerator)

Produces:

type(someDf) #pandas.core.frame.DataFrame

someDf.dtypes
#0     int64
#1    object
#dtype: object

someDf.tail(10)
#      0  1
#69  117  u
#70  118  v
#71  119  w
#72  120  x
#73  121  y
#74  122  z
#75  123  {
#76  124  |
#77  125  }
#78  126  ~
Sign up to request clarification or add additional context in comments.

4 Comments

This question is from where pandas don't allow generators at all (pre 0.13).
The usage of .from_records() is the correct for the use case of the question, as uses a generator of records. The default constructor don't get clear how the generator will be interpreted, if as a generator of records or as a generator of columns (series).
Took a little creativity to work with CSV lines, but in case anyone else comes across the same issue, within my generator I used for line in lines: yield next(csv.reader([line]). This was useful for me because I needed to perform some cleansing on each line and had other conditional logic to worry about within the CSV.
Dear @c8h10n4o2 please explain why one should choose using generator in this case?
20

You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).

Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv or something else...

If you want to get super complicated you can create a file like object that will return the lines:

def gen():
    lines = [
        'col1,col2\n',
        'foo,bar\n',
        'foo,baz\n',
        'bar,baz\n'
    ]
    for line in lines:
        yield line

class Reader(object):
    def __init__(self, g):
        self.g = g
    def read(self, n=0):
        try:
            return next(self.g)
        except StopIteration:
            return ''

And then use the read_csv:

>>> pd.read_csv(Reader(gen()))
  col1 col2
0  foo  bar
1  foo  baz
2  bar  baz

5 Comments

You are right, pandas 0.12 does not support generators. I've installed the dev version and DataFrame constructor allow generators but DataFrame.from_records() not. I've made a patch for it.
@Viktor Kerkez : Quick question, if my generator function had list of lists in lines, but not consistently, say some objects could be lists-of-lists, and some could be simply lists,how would I gracefully change the "read" method, or should I handle it when I iterate over lines in gen() ?
@Viktor kerkez : very basic question, but here's what I mean. If I define lines = [ ['col1,col2\n'], ['foo,bar\n'], ['foo,baz\n'], ['bar,baz\n'] ], then keeping the rest same, I see that the Python shell restarts. I also tried instantiating then object for Reader class as r=Reader(gen()) df=pd.read_csv(r) . This suggests to me that there's something very basic about the class(Object) type notation, that I don't understand. My assumption is that I should be allowed to create lists if I wanted so, inside of a df "column", but not shell-restart.
@ekta read_csv function can parse only "pure" CSV files which cant contain lists. If you want lists in your data frame columns you'll have to use something else... Either parse json or do it manually.
@ViktorKerkez How does your Reader() solution effects on performance?
7

To get it to be memory efficient, read in chunks. Something like this, using Viktor's Reader class from above.

df = pd.concat(list(pd.read_csv(Reader(gen()),chunksize=10000)),axis=1)

1 Comment

@Jeff - the Reader() causes the read_csv and from_csv() to crash python. Is this solution still valid?
3

If generator is just like a list of DataFrames, you need just to create a new DataFrame concatenating elements of the list:

result = pd.concat(list)

Recently I've faced the same problem.

Comments

2

You can also use something like (Python tested in 2.7.5)

from itertools import izip

def dataframe_from_row_iterator(row_iterator, colnames):
    col_iterator = izip(*row_iterator)
    return pd.DataFrame({cn: cv for (cn, cv) in izip(colnames, col_iterator)})

You can also adapt this to append rows to a DataFrame.

-- Edit, Dec 4th: s/row/rows in last line

2 Comments

This has the same problem as presented in the question, it is infeasible to materialize the whole of the data as anything other than a dataframe or numpy array or some other packed form. Here you materialize it as a dict.
Agreed, it does materialize the data as a dict. However, you don't have to materialize all of it at once; just consume part of the generator, then append the data to a DataFrame in chunks. Just use itertools.islice to get the chunks from the generator/row_iterator.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.