1

I have a propitiatory cursor (arcpy.da.SearchCursor) object that I need to load into a pandas dataframe.

It implements next(), reset() as you would expect for a generator object in Python.

Using another post in stackexchange, which is brilliant, I created a class that makes the generator act like a file-like object. This works for the default case, where chunksize is not set, but when I go to set the chunk size for each dataframe, it crashes python.

My guess is that the n=0 needs to be implemented so x number of rows are returned, but so far this has been wrong.

What is the proper way to implement my class so I can use generators to load a dataframe? I need to use chunksize because my datasets are huge.

So the pseudo code would be:

customfileobject = Reader(cursor)
dfs = pd.read_csv(customfileobject, columns=cursor.fields,
                  chunksize=10000)

I am using Pandas version 0.16.1 and Python 2.7.10.

Class below:

class Reader(object):

    """allows a cursor object to be read like a filebuffer"""
    def __init__(self, fc=None, columns="*", cursor=None):
        if cursor or fc:
            if fc:
                self.g = arcpy.da.SearchCursor(fc, columns)
            else:
                self.g = cursor
        else:
            raise ValueError("You must provide a da.SearchCursor or table path and column names")
    def read(self, n=0):
        try:
            vals = []
            if n == 0:
                return next(self.g)
            else:
                # return multiple rows?
                for x in range(n):
                   try:
                      vals.append(self.g.next())
                   except StopIteration:
                      return ''
        except StopIteration:
            return ''
    def reset(self):
        self.g.reset()
5
  • 1
    Would it work if you implement read(self) to read only one entry at a time? Commented Jul 27, 2016 at 14:12
  • I assume you mean pd.read_csv: pd.from_csv does not admit a chunksize argument. Commented Jul 27, 2016 at 17:00
  • @ptrj - that causes python.exe to crash. Commented Jul 27, 2016 at 17:41
  • @AlbertoGarcia-Raboso yes I fixed that mistake thanks Commented Jul 27, 2016 at 17:42
  • And what happens if you define read literally as in the post you linked? Commented Jul 27, 2016 at 19:38

1 Answer 1

1

Try the following read function:

def read(self, n=0):
    if n == 0:
        try:
            return next(self.g)
        except StopIteration:
            return ''
    else:
        vals = []
        try:
            for x in range(n):
                vals.append(next(self.g))
        except StopIteration:
            pass
        finally:
            return ''.join(vals)

You should tell pd.read_csv the column names using the names argument (not columns), and that you have no header row (header=None).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.