4

I'm new to Python so reaching out for help. I have a csv file in S3 bucket, I would like to use Python pyodbc to import this csv file to a table in SQL server. This file is 50 MB (400k records). My code is below. As my code states below, my csv data is in a dataframe, how can I use Bulk insert to insert dataframe data into sql server table. If my approach does not work, please advise me with a different approach.

# Connection to S3
s3 = boto3.client(
    service_name = 's3',
    region_name = 'us-gov-west-1',
    aws_access_key_id = 'ZZZZZZZZZZZZZZZZZZ',
    aws_secret_access_key = 'AAAAAAAAAAAAAAAAA')

# Connection to SQL Server
server = 'myserver.amazonaws.com'
path = 'folder1/folder2/folder3/myCSVFile.csv'

cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE=DB-staging;UID=User132;PWD=XXXXXX')
    cursor = cnxn.cursor()
    
obj_sum = s3.get_object(Bucket = 'my_bucket', Key = path)
csv_data = pd.read_csv(obj_sum['Body'])
df = pd.DataFrame(csv_data, columns = ['SYSTEM_NAME', 'BUCKET_NAME', 'LOCATION', 'FILE_NAME', 'LAST_MOD_DATE', 'FILE_SIZE'])
#print(df.head(n=15).to_string(index=False))

# Insert DataFrame to table
cursor.execute("""truncate table dbo.table1""")
cursor.execute("""BULK INSERT dbo.table1 FROM """ + .....# what do I put here since data is in dataframe??)

I tried to loop through the dataframe and it took 20 minutes to insert 5k records. Code below. Looping through each record is an option but a poor one. This is why I'm moving towards bulk insert if possible.

for i in df.itertuples(index = False):
    if i.FILE_SIZE != 0:
       cursor.execute("""insert into dbo.table1 (SYSTEM_NAME, BUCKET_NAME, X_LOCATION, FILE_NAME, LAST_MOD_DATE, FILE_SIZE) 
                  values (?,?,?,?,?,?)""", i.SYSTEM_NAME, i.BUCKET_NAME, i.LOCATION, i.FILE_NAME, i.LAST_MOD_DATE, i.FILE_SIZE)

Lastly, bonus question ... I would like to check if the "FILE_SIZE" column in my dataframe equals to 0, if it is skip over that record and move forward to the next record.

Thank you in advnace.

5
  • do i need to use turbodbc and pyarrow? create an Arrow table instead of dataframe? Commented Oct 5, 2020 at 4:42
  • 1
    Why not to_sql? It uses a multiple-row insert, which should be pretty fast. Did you benchmark it and find it too slow? Commented Oct 5, 2020 at 4:50
  • Yes. I tried it and my job is still running 9 hours later for 50MB file. Commented Oct 5, 2020 at 14:03
  • 2
    Related answer here. Commented Oct 5, 2020 at 14:20
  • … and for filtering rows see this answer. Commented Oct 5, 2020 at 14:34

1 Answer 1

10

Thanks for the help.

using fast_executemany = True did the job for me.

engine = sal.create_engine("mssql+pyodbc://username:password@"+server+":1433/db-name?driver=ODBC+Driver+17+for+SQL+Server?Trusted_Connection=yes",
                           fast_executemany = True)
conn = engine.connect()

I had to change my code around to use "sqlalchemy" but it working great now.

To call the function to upload data to SQL Server is below:

df.to_sql(str, con = engine, index = False, if_exists = 'replace')
Sign up to request clarification or add additional context in comments.

3 Comments

What exactly is sal referencing to?
Table "Table name" Already exists
@AkshayGupta probably import sqlalchemy as sal

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.