Insert data into Postgresql from Python

Question

≈105 seconds per 1 million rows to insert into Postgresql local database on table with 2 indexes and 4 columns it is slow or fast ?

Python Code:

import os 
import pandas as pd 
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import create_engine

num =  32473068
batch = 1000000

def main(data):
    engine = create_engine('postgresql://***:****' + host + ':5432/kaggle')
    data.to_sql(con=engine, name=tbl_name, if_exists='append', index=False)

for i in range(0, num, batch):
    data = pd.read_csv(data_path+'app_events.csv', skiprows=i, nrows=batch)
    data.columns = ['event_id', 'app_id', 'is_installed', 'is_active']
    data = data.reset_index(drop=True)
    batchSize = 10000
    batchList = [data.iloc[x:x + batchSize].reset_index(drop=True) for x in range(0, len(data), batchSize)]
    with ThreadPoolExecutor(max_workers=30) as executor:
        future_to_url = {executor.submit(main, d): d for d in batchList}
        for k, future in enumerate(as_completed(future_to_url)):
            url = future_to_url[future]

Postgres has special command for import CSV files COPY - nothing should be faster. — Pavel Stehule
– Pavel Stehule, Commented Jul 14, 2016 at 9:05
In addition to @user2189731's excellent point about enabling use_batch_mode=True in your SQLAlchemy in your engine instantiation I would suggest that you skip multi-threading for this task. That's unlikely to provide any beneficial concurrency and probably loses far more to lock contention then you'd gain. — Jim Dennis
– Jim Dennis, Commented Jan 2, 2018 at 2:12

Martijn Pieters · Accepted Answer · 2019-03-09 14:24:50Z

2

It depends on your hardware too. As a reference, my old I5 laptop with HDD uses ~300s to insert 0.1M rows(roughly 200-300 Mega bytes).

I learned from other similar questions that to split big values into bulks when using insert() command could speed up. Since you're using Pandas I assume it has certain optimization already. But I suggest you to make a quick test to see if it helps too.

Pandas actually used non-optimized insert command. See (to_sql + sqlalchemy + copy from + postgresql engine?). So bulk insert or other methods should be used to improve performance.
SQLalchemy 1.2 uses bulk insert when you initialize your engine with "use_batch_mode=True" parameter. I saw 100X speedup on my I5+HDD laptop! Meaning with 0.1M record, originally it took me 300s and now it's 3s!!. If you computer is better than mine, I bet you could see this tremendous speedup with your 1M records.

edited Mar 9, 2019 at 14:24

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

answered Dec 29, 2017 at 2:05

user2189731

5588 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Insert data into Postgresql from Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related