Improve pandas' to_sql() performance with SQL Server

Question

I come to you because i cannot fix an issues with pandas.DataFrame.to_sql() method.

I've made the connection between my script and my database, i can send queries, but actually it's too slow for me.

I would like to find a way to improve the performance of my script on this. Maybe someone will find a solution?

Here is my code :

  engine = sqlalchemy.create_engine(con['sql']['connexion_string'])
  conn = engine.connect()
  metadata = sqlalchemy.Metadata()
  try : 
    if(con['sql']['strategy'] == 'NEW'): 
      query = sqlalchemy.Table(con['sql']['table'],metadata).delete()
      conn.execute(query)
      Sql_to_deploy.to_sql(con['sql']['table'],engine,if_exists='append',index = False,chunksize = 1000,method = 'multi')
    elif(con['sql']['strategy'] == 'APPEND'):
      Sql_to_deploy.to_sql(con['sql']['table'],engine,if_exists='append',index = False,chunksize = 1000,method = 'multi')
    else:
      pass
  except Exception as e:
    print(type(e))

It's working and too slow when i retire chunksize and method parameters,it's this moment where it's too slow (almost 3 minutes for 30 thousand lines). When i put these parameters, i get an sqlalchemy.exc.ProgrammingError...

thanks for your help !

You might want to post this on codereview.stackexchange.com its more focused on this kind of request — Josh Beauregard
– Josh Beauregard, Commented Jul 30, 2020 at 14:57
I'm using SQL Server, with mssql+pyodbc as a windows authetification. — Neal Poidras
– Neal Poidras, Commented Jul 30, 2020 at 15:48

Gord Thompson · Accepted Answer · 2022-12-01 17:18:33Z

8

For mssql+pyodbc you will get the best performance from to_sql if you

use Microsoft's ODBC Driver for SQL Server, and
enable fast_executemany=True in your create_engine call.

For example, this code runs in just over 3 seconds on my network:

from time import perf_counter
import pandas as pd
import sqlalchemy as sa

ngn_local = sa.create_engine("mssql+pyodbc://mssqlLocal64")
ngn_remote = sa.create_engine(
    (
        "mssql+pyodbc://sa:[email protected]/mydb"
        "?driver=ODBC+Driver+17+for+SQL+Server"
    ),
    fast_executemany=True,
)

df = pd.read_sql_query(
    "SELECT * FROM MillionRows WHERE ID <= 30000", ngn_local
)

t0 = perf_counter()
df.to_sql("pd_test", ngn_remote, index=False, if_exists="replace")
print(f"{perf_counter() - t0} seconds")

whereas with fast_executemany=False (which is the default) the same process takes 143 seconds (2.4 minutes).

edited Dec 1, 2022 at 17:18

answered Jul 30, 2020 at 17:19

Gord Thompson

125k38 gold badges251 silver badges458 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Neal Poidras Over a year ago

Thank you so much, i've taped fastexecutemany against to fast_executemany...

Neal Poidras Over a year ago

And indeed, pass from 300 secondes to 14 for 50k rows, it's clearly better ! Thanks :D

Russel Jan 31 at 16:30

I created a tiny Python package to speed up loading pandas to SQL Server, internally the package uses .Net's SqlBulkCopy to load data very quickly. The package is even faster than fast_executemany. See github.com/RusselWebber/arrowsqlbcpy

Rob Raymond · Accepted Answer · 2020-07-30 12:13:26Z

0

I synthesised a dataframe with 36k rows. This always inserts in < 1.5s. Also a dumb select with expensive where clause and a group by which gets marginally slower as table grows but always < 0.5s

no indexes so inserts are fast
no indexes to help selects
running on mariadb inside a docker container on my laptop. So not all all optimised
all defaults which is performing well

More information

what indexes do you have on your table? rule of thumb fewer is better. More indexes, slower inserts
what timings do you see from this synthesised case?

import numpy as np
import pandas as pd
import random, time
import sqlalchemy

a = np.array(np.meshgrid([2018,2019,2020], [1,2,3,4,5,6,7,8,9,10,11,12], 
                     [f"Stock {i+1}" for i in range(1000)],
                    )).reshape(3,-1)
a = [a[0], a[1], a[2], [round(random.uniform(-1,2.5),1) for e in a[0]]]
df1= pd.DataFrame({"Year":a[0], "Month":a[1], "Stock":a[2], "Sharpe":a[3], })


temptable = "tempx"

engine = sqlalchemy.create_engine('mysql+pymysql://sniffer:[email protected]/sniffer')
conn = engine.connect()
try:
#     conn.execute(f"drop table {temptable}")
    pass
except sqlalchemy.exc.OperationalError:
    pass # ignore drop error if table does not exist
start = time.time()
df1.to_sql(name=temptable,con=engine, index=False, if_exists='append')
curr = conn.execute(f"select count(*) as c from {temptable}")
res = [{curr.keys()[i]:v for i,v in enumerate(t)} for t in curr.fetchall()]
print(f"Time: {time.time()-start:.2f}s database count:{res[0]['c']}, dataframe count:{len(df1)}")
curr.close()
start = time.time()
curr = conn.execute(f"""select Year, count(*) as c 
                        from {temptable} 
                        where Month=1 
                        and Sharpe between 1 and 2 
                        and stock like '%%2%%'
                        group by Year""")
res = [{curr.keys()[i]:v for i,v in enumerate(t)} for t in curr.fetchall()]
print(f"Time: {time.time()-start:.2f}s database result:{res} {curr.keys()}")
curr.close()
conn.close()

output

Time: 1.23s database count:360000, dataframe count:36000
Time: 0.27s database result:[{'Year': '2018', 'c': 839}, {'Year': '2019', 'c': 853}, {'Year': '2020', 'c': 882}] ['Year', 'c']

edited Jul 30, 2020 at 12:13

answered Jul 30, 2020 at 12:07

Rob Raymond

31.5k3 gold badges19 silver badges34 bronze badges

4 Comments

Neal Poidras Over a year ago

Actually i don't have index on my sql because i put the parameter to false.And on my database i don't have index to check neither. Clearly i juste have to put throught the table my data. But my data is in integer and string. and i have a column in Datetime in my database, maybe it's this? But for few data it's working. I don't why so it's not working with big...

Rob Raymond Over a year ago

the example I've done is strings and floats. I added a datetime column to it as well and no change to times (as expected). what DBMS are you running and is it far away on the WAN?

Neal Poidras Over a year ago

What do you mean by WAN?

Rob Raymond Over a year ago

WAN = wide area network, LAN = local area network. How separated is your python pandas from your DBMS? ODBC interfaces are typically chatty so important to design your infra setup such that middle tier data management functions are "close" to your DBMS. The whole point of multi-tier application design

Collectives™ on Stack Overflow

Improve pandas' to_sql() performance with SQL Server

2 Answers 2

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related