End Objective
Download some data from the internet (about 5 GB in size)
Possibly convert some strings/datetimes
Upload to Postgres database
I have written some code which uploads some data to a Postgres database. "Upload" here meaning "replace all existing data in the table and insert new data". Of course, there is no SQL "upload" statement, so what this really does is run a sequence of INSERT statements.
I have tried several variations, searching for improved performance. The data is read from a csv file. The file size is 5 GB.
The original version of my code took about 90 minutes to complete. That is 5120 MB in 90 minutes, or 57 MB/minute, 1MB/s. The data is on the same host as the database. There is no network connection between the data source and sink, except for the Linux kernel. (Everything is via localhost.)
The first version uses the following code.
df.to_sql(
name='table',
schema='schema',
con=postgres_engine,
if_exists='replace',
index=False,
)
I tried some variations, including adding
chunksize=1000,
and
method='multi',
I was using psycopg2 and psycopg2-binary.
When I enabled method='multi' the process never completed. It took longer than 5 hours, then I killed it.
I did not notice any particular improvement using chunksize. I tried values 1000, and 100000.
psycopg2 does offer something closer to the form of an "upload" statement in the form of copy_expert. I found this took about 15 minutes to complete.
connection = psycopg2.connect(
host=postgres_host,
user=postgres_user,
password=postgres_password,
dbname=postgres_database,
port=5432,
)
buffer = io.StringIO()
df.to_csv(buffer, index=False, header=False)
buffer.seek(0)
cursor = connection.cursor()
cursor.copy_expert(
'COPY schema.table FROM STDIN WITH (FORMAT csv)',
buffer,
)
connection.commit()
A significant improvement but it is only about 5 MB/s. That still seems very slow.
The sequence of operations may seem a bit weird. Firstly a file is read using pandas.read_csv. The data is then written to a StringIO object using DataFrame.to_csv. It is outside of the scope of this question, but I need some method of translating a datetime string from one (weird) format to an ISO format which Postgres/Pandas understands.
The data is sourced from a call to requests. In other words, downloaded from some remote server.
Aside: Datetime format
If you wanted to know more detail, the source provides data with dates in the following format, which is not ISO and therefore I assume not readable by Postgres.
format='%Y-%m-%d %H:%M'
My conclusion thus far from trying a few different methods with variations on parameters is that I have so far not been able to achieve better performance than about 5 MB/s.
It seems that there are a large number of possible methods, accounting for all the variations in arguments which can be supplied to each function call.
At the end of the day, all I really need to do is download some data from the internet, and upload it to Postgres, possibly with some string manipulation to account for differences in recognized datetime formats.
What method should I use?
psycopg (psycopg3) or psycopg2?
sqlalchemy or not?
What packages need to be installed with
pip3?Should I use
pandas.to_sqlor some other method such ascopy_expertor something else?
Solution
I found the following psycopg3 solution works well and gives good performance.
pip3 install psycopg[binary]
And yes the above is correct. psycopg2 = psycopg2. psycopg = psycopg3, somewhat confusingly.
Example code is provided below. Here I assume that some_data is a bytes object, returned by something such as response = requests.get, some_data = response.content.
If you are sourcing data from some other location, such as a local file, change this accordingly.
Note also that you may be able to obtain improved performance by not using pandas at all. In my particular case, I need to read the data, modify it and then write it back to a StringIO object. However, you may be able to read your data directly to a StringIO object, bypassing at least two expensive operations. (read_csv and to_csv.)
import pandas
import psycopg
import io
postgres_connection_string = f'user={}, password={}, host={}, dbname={}, port={}'
df = pandas.read_csv(io.BytesIO(some_data), header=None)
buffer = io.StringIO()
df.to_csv(buffer, index=False, header=False)
buffer.seek(0)
columns = '(column_1, column_2)'
with psycopg.connect(postgres_connection_string) as connection:
with connection.cursor() as cursor:
with cursor.copy(f'COPY some_schema.some_table {columns} FROM STDIN WITH (FORMAT csv)') as copy:
copy.write(file.read())
conn.commit()
I see performance figures of about 51 MB/s for this.