0

I am trying to bulk insert pandas dataframe data into Postgresql. In Pandas dataframe I have 35 columns and in Postgresql table I have 45 columns. I am choosing 12 matching column from pandas dataframe and inserting into postgresql table. For this I am using the following code snippets:

df = pd.read_excel(raw_file_path,sheet_name = 'Sheet1',usecols=col_names) <---col_names = list of desired columns (12 columns)
cols = ','.join(list(df.columns))
tuples = [tuple(x) for x in df.to_numpy()]
query = "INSERT INTO {0}.{1} ({2}) VALUES (%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s);".format(schema_name,table_name,cols)
curr = conn.cursor()
try:
    curr.executemany(query,tuples)
    conn.commit()
    curr.close()
except (Exception, psycopg2.DatabaseError) as error:
    print("Error: %s" % error)
    conn.rollback()
    curr.close()
    return 1
finally:
    if conn is not None:
        conn.close()
        print('Database connection closed.')

When running I am getting this error:

SyntaxError: syntax error at or near "%"
LINE 1: ...it,purchase_group,indenter_name,wbs_code) VALUES (%s,%s,%s,%...

Even if I use ? in place of %%s I am still getting this error.

Can anybody throw some light on this?

P.S. I am using Postgresql version 10.

6
  • %s would take a string value from a python variable. What do you want to do with those %s things? You already put variables in the string with {0} etc. Do you want to pass on %s or put some value there? Commented Sep 13, 2020 at 3:29
  • @antont: I want to pass on %s i.e. row wise values as tuples into the db. Objective is to bulk insert. Or even if I use the query string like "INSERT INTO {0}.{1} ({2})".format(schema_name,table_name,cols) + "VALUES(?,?,...?)" then also I am getting the same error. Commented Sep 13, 2020 at 3:32
  • Well put {3} {4} etc if you want more parameters, I think better not to mix two syntaxes on one line Commented Sep 13, 2020 at 3:54
  • Why are you doubling the %? Commented Sep 13, 2020 at 3:58
  • @parafit: I saw somewhere using %%s. Hence using. Commented Sep 13, 2020 at 4:04

1 Answer 1

1

What you're doing now is actually insert a pandas dataframe one row at a time. Even if this worked, it would be an extremely slow operation. At the same time, if the data might contain strings, just placing them into a query string like this leaves you open to SQL injection.

I wouldn't reinvent the wheel. Pandas has a to_sql function that takes a dataframe and converts it into a query for you. You can specify what to do on conflict (when a row already exists).

It works with SQLAlchemy, which has excellent support for PostgreSQL. And even though it might be a new package to explore and install, you're not required to use it anywhere else to make this work.

from sqlalchemy import create_engine
engine = create_engine('postgresql://localhost:5432/mydatabase')

pd.read_excel(
    raw_file_path,
    sheet_name = 'Sheet1',
    usecols=col_names  # <---col_names = list of desired columns (12 columns)
).to_sql(
    schema=schema_name,
    name=table_name,
    con=engine,
    method='multi'  # this makes it do all inserts in one go
)
Sign up to request clarification or add additional context in comments.

6 Comments

This is good. However, I was looking for more traditional and generic approach.
Can you explain what you mean by "traditional"? I think this is about as generic as it gets
where is actual insertion happening?
Inside the function, it's handled by pandas. See the documentation I linked for an example: they call the function, then SELECT and see that the results have been inserted
Thanks Ruben! this worked. Just a bit curious. How to close a connection when using create_engine?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.