How to read 10 million rows from POSTGRES and write the data to a CSV in python?

Question

I am trying to run a query on a table which has about 10 million rows. Basically I am trying to run Select * from events and then I am writing it to a CSV file.

Here is the code:

    with create_server_connection() as connection:
      cursor = connection.cursor()
      cursor.itersize = 20000
      cwd = os.getcwd()
      query = open( sql_file_path, mode='r').read()
      print(query)
      cursor.execute(query)
      with open(file_name, 'w', newline='')as fp:
        a = csv.writer(fp)
        for row in cursor:
          a.writerow(row)

def create_server_connection():
    DB_CONNECTION_PARAMS = os.environ["DB_REPLICA_CONNECTION"]
    json_object = json.loads(DB_CONNECTION_PARAMS)
    try:
      conn = psycopg2.connect(
        database=json_object["PGDATABASE"], user=json_object["PGUSER"], password=json_object["PGPASSWORD"], host=json_object["PGHOST"], port=json_object["PGPORT"]
      )  
    except psycopg2.OperationalError as e:
      print('Unable to connect!\n{0}').format(e)
      sys.exit(1)

    return conn

However, for some reason, this whole process is taking up a lot of memory. I am running this as an AWS-batch process and the process exits with this error OutOfMemoryError: Container killed due to memory usage

Is there a way to reduce memory usage?

Did you try pandas read_sql? You can read it in chunks and write it pandas.pydata.org/docs/reference/api/pandas.read_sql.html — Tom Ron
– Tom Ron, Commented Jun 27, 2022 at 11:31
Why not use COPY to export to CSV directly? There's no faster way than have the database itself export the data. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jun 27, 2022 at 11:34
@TomRon pandas will be far worse. If the process fails when the data is cached once, it will fail far faster if the data has to be cached into a dataframe before exporting starts — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jun 27, 2022 at 11:36
@PanagiotisKanavos what if I want to run a complex query rather than simply running Select * statements ? — Karry Bee
– Karry Bee, Commented Jun 27, 2022 at 12:17
COPY doesn't care about the query's complexity - provided it's actually a query instead of a multi-statement script. Client cursors cache the data on the client anyway - that's why they're called client cursors, because the data is on the client. You'll have to use a server-side cursor (just pass a name to cursor). The default batch size is I think 2K rows but can change with the iterrows attribute — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jun 27, 2022 at 12:21

Ture Pålsson · Accepted Answer · 2022-06-27 11:37:38Z

1

From the psycopg2 docs:

When a database query is executed, the Psycopg cursor usually fetches all the records returned by the backend, transferring them to the client process. If the query returned an huge amount of data, a proportionally large amount of memory will be allocated by the client.

If the dataset is too large to be practically handled on the client side, it is possible to create a server side cursor. Using this kind of cursor it is possible to transfer to the client only a controlled amount of data, so that a large dataset can be examined without keeping it entirely in memory.

answered Jun 27, 2022 at 11:37

Ture Pålsson

7,0392 gold badges17 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to read 10 million rows from POSTGRES and write the data to a CSV in python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related