0

So I have a table with 146 columns and approx. 8 mil rows of sparse data stored locally into a Postgresql.

My goal is to select the whole dataset at once, store it into a pandas dataframe and perform some calculations.

So far I have read about server side cursors in many threads but i guess I'm doing something wrong as I don't see improvement in memory. The documentation is also quite limited..

My code so far is the following:

cur=conn.cursor('testCursor')
cur.itersize = 100000
cur.execute("select * from events")

df = cur.fetchall()

df = pd.DataFrame(df)
conn.commit()
conn.close()

I also tried using fetchmany() or fetchone() instead of fetchall() but I don't know how to scroll the results. I guess I could use something like this for fetchone() but I don't know how to handle fetchmany():

df = cur.fetchone()
while row:
   row = cur.fetchone() 

Lastly, in case of fetchone() and fetchmany() how can I concat the results into a single dataframe without consuming all of my memory? Just to note that I have 16gb available RAM

2
  • An approach on your problem could be to copy your whole date table in a columnar DB (eg. MonetDB) and perform analysis in python including your code within the query. MonetDB permit you to embed python code into queries. This is a built-in feature. Here an example ref: monetdb.org/blog/voter-classification-using-monetdbpython. Hope this is useful for you. Commented May 12, 2017 at 4:58
  • Sure, thank you! If don't find any solution with Postgres I'll give it a try Commented May 12, 2017 at 5:57

1 Answer 1

1

8 mil rows x 146 columns (assuming that a column stores at least one byte) would give you at least 1 GB. Considering that your columns probably store more than a byte per column, even if you would succeed with the first step of what you try to do, you would hit RAM constraints (e.g. the end result won't fit in RAM).

The usual strategy to process large datasets is processing them in small batches and then (if needed) combine the results. Have a look at PySpark, for example.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, the data are about 3.5 gb. Assuming I want to process the data into small batches using fetchmany() multiple times, how can I scroll the results?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.