0

I am trying to run a query on a table which has about 10 million rows. Basically I am trying to run Select * from events and then I am writing it to a CSV file.

Here is the code:

    with create_server_connection() as connection:
      cursor = connection.cursor()
      cursor.itersize = 20000
      cwd = os.getcwd()
      query = open( sql_file_path, mode='r').read()
      print(query)
      cursor.execute(query)
      with open(file_name, 'w', newline='')as fp:
        a = csv.writer(fp)
        for row in cursor:
          a.writerow(row)  
def create_server_connection():
    DB_CONNECTION_PARAMS = os.environ["DB_REPLICA_CONNECTION"]
    json_object = json.loads(DB_CONNECTION_PARAMS)
    try:
      conn = psycopg2.connect(
        database=json_object["PGDATABASE"], user=json_object["PGUSER"], password=json_object["PGPASSWORD"], host=json_object["PGHOST"], port=json_object["PGPORT"]
      )  
    except psycopg2.OperationalError as e:
      print('Unable to connect!\n{0}').format(e)
      sys.exit(1)

    return conn

However, for some reason, this whole process is taking up a lot of memory. I am running this as an AWS-batch process and the process exits with this error OutOfMemoryError: Container killed due to memory usage

Is there a way to reduce memory usage?

6
  • Did you try pandas read_sql? You can read it in chunks and write it pandas.pydata.org/docs/reference/api/pandas.read_sql.html Commented Jun 27, 2022 at 11:31
  • 1
    Why not use COPY to export to CSV directly? There's no faster way than have the database itself export the data. Commented Jun 27, 2022 at 11:34
  • @TomRon pandas will be far worse. If the process fails when the data is cached once, it will fail far faster if the data has to be cached into a dataframe before exporting starts Commented Jun 27, 2022 at 11:36
  • @PanagiotisKanavos what if I want to run a complex query rather than simply running Select * statements ? Commented Jun 27, 2022 at 12:17
  • 1
    COPY doesn't care about the query's complexity - provided it's actually a query instead of a multi-statement script. Client cursors cache the data on the client anyway - that's why they're called client cursors, because the data is on the client. You'll have to use a server-side cursor (just pass a name to cursor). The default batch size is I think 2K rows but can change with the iterrows attribute Commented Jun 27, 2022 at 12:21

1 Answer 1

1

From the psycopg2 docs:

When a database query is executed, the Psycopg cursor usually fetches all the records returned by the backend, transferring them to the client process. If the query returned an huge amount of data, a proportionally large amount of memory will be allocated by the client.

If the dataset is too large to be practically handled on the client side, it is possible to create a server side cursor. Using this kind of cursor it is possible to transfer to the client only a controlled amount of data, so that a large dataset can be examined without keeping it entirely in memory.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.