How to use Bulk insert to insert data from Dataframe to SQL Server table?

Question

I'm new to Python so reaching out for help. I have a csv file in S3 bucket, I would like to use Python pyodbc to import this csv file to a table in SQL server. This file is 50 MB (400k records). My code is below. As my code states below, my csv data is in a dataframe, how can I use Bulk insert to insert dataframe data into sql server table. If my approach does not work, please advise me with a different approach.

# Connection to S3
s3 = boto3.client(
    service_name = 's3',
    region_name = 'us-gov-west-1',
    aws_access_key_id = 'ZZZZZZZZZZZZZZZZZZ',
    aws_secret_access_key = 'AAAAAAAAAAAAAAAAA')

# Connection to SQL Server
server = 'myserver.amazonaws.com'
path = 'folder1/folder2/folder3/myCSVFile.csv'

cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE=DB-staging;UID=User132;PWD=XXXXXX')
    cursor = cnxn.cursor()
    
obj_sum = s3.get_object(Bucket = 'my_bucket', Key = path)
csv_data = pd.read_csv(obj_sum['Body'])
df = pd.DataFrame(csv_data, columns = ['SYSTEM_NAME', 'BUCKET_NAME', 'LOCATION', 'FILE_NAME', 'LAST_MOD_DATE', 'FILE_SIZE'])
#print(df.head(n=15).to_string(index=False))

# Insert DataFrame to table
cursor.execute("""truncate table dbo.table1""")
cursor.execute("""BULK INSERT dbo.table1 FROM """ + .....# what do I put here since data is in dataframe??)

I tried to loop through the dataframe and it took 20 minutes to insert 5k records. Code below. Looping through each record is an option but a poor one. This is why I'm moving towards bulk insert if possible.

for i in df.itertuples(index = False):
    if i.FILE_SIZE != 0:
       cursor.execute("""insert into dbo.table1 (SYSTEM_NAME, BUCKET_NAME, X_LOCATION, FILE_NAME, LAST_MOD_DATE, FILE_SIZE) 
                  values (?,?,?,?,?,?)""", i.SYSTEM_NAME, i.BUCKET_NAME, i.LOCATION, i.FILE_NAME, i.LAST_MOD_DATE, i.FILE_SIZE)

Lastly, bonus question ... I would like to check if the "FILE_SIZE" column in my dataframe equals to 0, if it is skip over that record and move forward to the next record.

Thank you in advnace.

do i need to use turbodbc and pyarrow? create an Arrow table instead of dataframe? — AJR
– AJR, Commented Oct 5, 2020 at 4:42
Why not to_sql? It uses a multiple-row insert, which should be pretty fast. Did you benchmark it and find it too slow? — Nick ODell
– Nick ODell, Commented Oct 5, 2020 at 4:50
Yes. I tried it and my job is still running 9 hours later for 50MB file. — AJR
– AJR, Commented Oct 5, 2020 at 14:03

Nickolay · Accepted Answer · 2021-12-20 12:32:14Z

10

Thanks for the help.

using fast_executemany = True did the job for me.

engine = sal.create_engine("mssql+pyodbc://username:password@"+server+":1433/db-name?driver=ODBC+Driver+17+for+SQL+Server?Trusted_Connection=yes",
                           fast_executemany = True)
conn = engine.connect()

I had to change my code around to use "sqlalchemy" but it working great now.

To call the function to upload data to SQL Server is below:

df.to_sql(str, con = engine, index = False, if_exists = 'replace')

edited Dec 20, 2021 at 12:32

Nickolay

32.5k13 gold badges111 silver badges195 bronze badges

answered Oct 6, 2020 at 15:07

AJR

5894 gold badges15 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Akshay Gupta Over a year ago

What exactly is sal referencing to?

vinsent paramanantham Over a year ago

Table "Table name" Already exists

SamyIshak Over a year ago

@AkshayGupta probably import sqlalchemy as sal

Collectives™ on Stack Overflow

How to use Bulk insert to insert data from Dataframe to SQL Server table?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related