Psycopg - Memory error when selecting a large dataset from PostgreSQL

Question

So I have a table with 146 columns and approx. 8 mil rows of sparse data stored locally into a Postgresql.

My goal is to select the whole dataset at once, store it into a pandas dataframe and perform some calculations.

So far I have read about server side cursors in many threads but i guess I'm doing something wrong as I don't see improvement in memory. The documentation is also quite limited..

My code so far is the following:

cur=conn.cursor('testCursor')
cur.itersize = 100000
cur.execute("select * from events")

df = cur.fetchall()

df = pd.DataFrame(df)
conn.commit()
conn.close()

I also tried using fetchmany() or fetchone() instead of fetchall() but I don't know how to scroll the results. I guess I could use something like this for fetchone() but I don't know how to handle fetchmany():

df = cur.fetchone()
while row:
   row = cur.fetchone()

Lastly, in case of fetchone() and fetchmany() how can I concat the results into a single dataframe without consuming all of my memory? Just to note that I have 16gb available RAM

An approach on your problem could be to copy your whole date table in a columnar DB (eg. MonetDB) and perform analysis in python including your code within the query. MonetDB permit you to embed python code into queries. This is a built-in feature. Here an example ref: monetdb.org/blog/voter-classification-using-monetdbpython. Hope this is useful for you. — e.arbitrio
– e.arbitrio, Commented May 12, 2017 at 4:58
Sure, thank you! If don't find any solution with Postgres I'll give it a try — Mewtwo
– Mewtwo, Commented May 12, 2017 at 5:57

Ashalynd · Accepted Answer · 2017-05-12 04:50:01Z

1

8 mil rows x 146 columns (assuming that a column stores at least one byte) would give you at least 1 GB. Considering that your columns probably store more than a byte per column, even if you would succeed with the first step of what you try to do, you would hit RAM constraints (e.g. the end result won't fit in RAM).

The usual strategy to process large datasets is processing them in small batches and then (if needed) combine the results. Have a look at PySpark, for example.

answered May 12, 2017 at 4:50

Ashalynd

12.6k2 gold badges36 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mewtwo Over a year ago

Yes, the data are about 3.5 gb. Assuming I want to process the data into small batches using fetchmany() multiple times, how can I scroll the results?

Collectives™ on Stack Overflow

Psycopg - Memory error when selecting a large dataset from PostgreSQL

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related