End Objective

  • Download some data from the internet (about 5 GB in size)

  • Possibly convert some strings/datetimes

  • Upload to Postgres database


I have written some code which uploads some data to a Postgres database. "Upload" here meaning "replace all existing data in the table and insert new data". Of course, there is no SQL "upload" statement, so what this really does is run a sequence of INSERT statements.

I have tried several variations, searching for improved performance. The data is read from a csv file. The file size is 5 GB.

The original version of my code took about 90 minutes to complete. That is 5120 MB in 90 minutes, or 57 MB/minute, 1MB/s. The data is on the same host as the database. There is no network connection between the data source and sink, except for the Linux kernel. (Everything is via localhost.)

The first version uses the following code.

df.to_sql(
    name='table',
    schema='schema',
    con=postgres_engine,
    if_exists='replace',
    index=False,
)

I tried some variations, including adding

chunksize=1000,

and

method='multi',

I was using psycopg2 and psycopg2-binary.

When I enabled method='multi' the process never completed. It took longer than 5 hours, then I killed it.

I did not notice any particular improvement using chunksize. I tried values 1000, and 100000.

psycopg2 does offer something closer to the form of an "upload" statement in the form of copy_expert. I found this took about 15 minutes to complete.

connection = psycopg2.connect(
    host=postgres_host,
    user=postgres_user,
    password=postgres_password,
    dbname=postgres_database,
    port=5432,
)

buffer = io.StringIO()
df.to_csv(buffer, index=False, header=False)
buffer.seek(0)

cursor = connection.cursor()
cursor.copy_expert(
    'COPY schema.table FROM STDIN WITH (FORMAT csv)',
    buffer,
)
connection.commit()

A significant improvement but it is only about 5 MB/s. That still seems very slow.

The sequence of operations may seem a bit weird. Firstly a file is read using pandas.read_csv. The data is then written to a StringIO object using DataFrame.to_csv. It is outside of the scope of this question, but I need some method of translating a datetime string from one (weird) format to an ISO format which Postgres/Pandas understands.

The data is sourced from a call to requests. In other words, downloaded from some remote server.


Aside: Datetime format

If you wanted to know more detail, the source provides data with dates in the following format, which is not ISO and therefore I assume not readable by Postgres.

format='%Y-%m-%d %H:%M'

My conclusion thus far from trying a few different methods with variations on parameters is that I have so far not been able to achieve better performance than about 5 MB/s.

It seems that there are a large number of possible methods, accounting for all the variations in arguments which can be supplied to each function call.

At the end of the day, all I really need to do is download some data from the internet, and upload it to Postgres, possibly with some string manipulation to account for differences in recognized datetime formats.

What method should I use?

  • psycopg (psycopg3) or psycopg2?

  • sqlalchemy or not?

  • What packages need to be installed with pip3?

  • Should I use pandas.to_sql or some other method such as copy_expert or something else?


Solution

I found the following psycopg3 solution works well and gives good performance.

pip3 install psycopg[binary]

And yes the above is correct. psycopg2 = psycopg2. psycopg = psycopg3, somewhat confusingly.

Example code is provided below. Here I assume that some_data is a bytes object, returned by something such as response = requests.get, some_data = response.content.

If you are sourcing data from some other location, such as a local file, change this accordingly.

Note also that you may be able to obtain improved performance by not using pandas at all. In my particular case, I need to read the data, modify it and then write it back to a StringIO object. However, you may be able to read your data directly to a StringIO object, bypassing at least two expensive operations. (read_csv and to_csv.)

import pandas
import psycopg
import io

postgres_connection_string = f'user={}, password={}, host={}, dbname={}, port={}'

df = pandas.read_csv(io.BytesIO(some_data), header=None)
buffer = io.StringIO()
df.to_csv(buffer, index=False, header=False)
buffer.seek(0)
columns = '(column_1, column_2)'
with psycopg.connect(postgres_connection_string) as connection:
    with connection.cursor() as cursor:
        with cursor.copy(f'COPY some_schema.some_table {columns} FROM STDIN WITH (FORMAT csv)') as copy:
            copy.write(file.read())
    conn.commit()

I see performance figures of about 51 MB/s for this.

6 Replies 6

Why aren't you using the PostgreSQL built-in COPY funktion? I would expect it to be an order of magnitude faster than what you are doing.

The best practice is to use COPY, which you did. You should investigate the slowness, which looks abnormal. Do you have 57 GIN indexes or a slow trigger on the table?

There is only one index on the table which is a primary key. There is nothing else.

Reducing PostgreSQL CPU load using sqlalchemy

If you need to work on the CSV data I would take Pandas out of the equation and use the Python csv module directly. It speed thing ups a great deal. For a quick and dirty example see this How to convert the 50000 txt file into csv. Another suggestion would be to use Duckdb or Polars, they both move data quickly, especially doing I/O, as compared to Pandas. See this: https://aklaver.org/wordpress/2024/03/08/using-polars-duckdb-with-postgres/ for examples. FYI, both tools allow you to convert to and from Pandas dataframes if that is necessary.-

[1] [2] Body must be at least 30 characters; I entered 7.

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.