I am using psycopg2 and pandas to extract data from Postgres.
pandas.read_sql_query supports Python "generator" pattern when providing chunksize argument. It's not very helpful when working with large datasets, since the whole data is initially retrieved from DB into client-side memory and later chunked into separate frames based on chunksize. Large datasets will easily run into out-of-memory problems with this approach.
Postgres/psycopg2 are addressing this problem with server-side cursors. But Pandas does not seem to be supporting it.
Instead of doing:
iter = sql.read_sql_query(sql,
conn,
index_col='col1',
chunksize=chunksize)
I tried reimplementing it like this:
from pandas.io.sql import SQLiteDatabase
curs = conn.cursor(name='cur_name') # server side cursor creation
curs.itersize = chunksize
pandas_sql = SQLiteDatabase(curs, is_cursor=True)
iter = pandas_sql.read_query(
sql,
index_col='col1',
chunksize=chunksize)
but it fails because Pandas tries to access cursor.description, which is NULL for some reason with server-side cursors (and idea why?).
What's the best approach to proceed? Tnx
P.S.
- SQLiteDatabase is used with Postgres when SQLAlchemy is not available
- Feature request on Pandas - https://github.com/pandas-dev/pandas/issues/35689