0

≈105 seconds per 1 million rows to insert into Postgresql local database on table with 2 indexes and 4 columns it is slow or fast ?

Python Code:

import os 
import pandas as pd 
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import create_engine

num =  32473068
batch = 1000000

def main(data):
    engine = create_engine('postgresql://***:****' + host + ':5432/kaggle')
    data.to_sql(con=engine, name=tbl_name, if_exists='append', index=False)

for i in range(0, num, batch):
    data = pd.read_csv(data_path+'app_events.csv', skiprows=i, nrows=batch)
    data.columns = ['event_id', 'app_id', 'is_installed', 'is_active']
    data = data.reset_index(drop=True)
    batchSize = 10000
    batchList = [data.iloc[x:x + batchSize].reset_index(drop=True) for x in range(0, len(data), batchSize)]
    with ThreadPoolExecutor(max_workers=30) as executor:
        future_to_url = {executor.submit(main, d): d for d in batchList}
        for k, future in enumerate(as_completed(future_to_url)):
            url = future_to_url[future]
3
  • 1
    Postgres has special command for import CSV files COPY - nothing should be faster. Commented Jul 14, 2016 at 9:05
  • 1
    pg_bulkload is actually faster Commented Jul 14, 2016 at 9:27
  • In addition to @user2189731's excellent point about enabling use_batch_mode=True in your SQLAlchemy in your engine instantiation I would suggest that you skip multi-threading for this task. That's unlikely to provide any beneficial concurrency and probably loses far more to lock contention then you'd gain. Commented Jan 2, 2018 at 2:12

1 Answer 1

2

It depends on your hardware too. As a reference, my old I5 laptop with HDD uses ~300s to insert 0.1M rows(roughly 200-300 Mega bytes).

I learned from other similar questions that to split big values into bulks when using insert() command could speed up. Since you're using Pandas I assume it has certain optimization already. But I suggest you to make a quick test to see if it helps too.

  • Pandas actually used non-optimized insert command. See (to_sql + sqlalchemy + copy from + postgresql engine?). So bulk insert or other methods should be used to improve performance.

  • SQLalchemy 1.2 uses bulk insert when you initialize your engine with "use_batch_mode=True" parameter. I saw 100X speedup on my I5+HDD laptop! Meaning with 0.1M record, originally it took me 300s and now it's 3s!!. If you computer is better than mine, I bet you could see this tremendous speedup with your 1M records.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.