2

I have a pandas DataFrame in python and want this DataFrame directly to be written into a Netezza Database.

I would like to use the pandas.to_sql() method that is described here but it seems like that this method needs one to use SQLAlchemy to connect to the DataBase. The Problem: SQLAlchemy does not support Netezza. What I am using at the moment to connect to the database is pyodbc. But this o the other hand is not understood by pandas.to_sql() or am I wrong with this?

My workaround to this is to write the DataFrame into a csv file via pandas.to_csv() and send this to the Netezza Database via pyodbc.

Since I have big data, writing the csv first is a performance issue. I actually do not care if I have to use SQLAlchemy or pyodbc or something different but I cannot change the fact that I have a Netezza Database.

I am aware of deontologician project but as the author states itself "is far from complete, has a lot of bugs". I got the package to work (see my solution below). But if someone nows a better solution, please let me know!

I figured it out. For my solution see accepted answer.

6
  • 1
    Did you try using the dialect to see if it works in your particular case? When I took over the sqlalchemy-access dialect the first thing I did was to get pandas to_sql working. Commented Sep 17, 2019 at 15:19
  • 1
    Check the list returned by pyodbc.drivers() to see what drivers are available to your Python app. Do you see the Netezza driver in that list? Commented Sep 18, 2019 at 11:56
  • Yes. "NetezzaSQL" is in the list. I got rid of the pyodbc.InterfaceError. The error gave me the hint to add a data source in the the Windows ODBC Data source administrator tool and use that instead of trying to "manually" type in the server address. Then it gets the driver. I don't know if this is normal behavior, since in pyodbc it was not neccessary to to that. So instead of engine = create_engine(netezza://usr:pass@address:port/database_name) I use engine = create_engine(netezza://ODBCDataSourceName) It still does not work yet. I will update my question soon. Commented Sep 19, 2019 at 9:00
  • Thanks for the effort, Gord! I found a solution. Edited my questions to include it. Commented Sep 19, 2019 at 14:02
  • 1
    Thanks for taking the time to follow up. It's fairly common for people here to post their solution as an answer to their own question. That makes it easier for others to benefit, and if you accept your own answer it serves to mark the issue as "resolved". Commented Sep 19, 2019 at 16:36

2 Answers 2

4

Solution

I found a solution that I want to share for everyone with the same problem. I tried the netezza dialect from deontologician but it does not work with python3 so I made a fork and corrected some encoding issues. I uploaded to github and it is available here. Be aware that I just made some small changes and that is mostly work of deontologician and nobody is maintaining it.

Having the netezza dialect I got pandas.to_sql() to work directy with the Netezza database:

import netezza_dialect
from sqlalchemy import create_engine

engine = create_engine("netezza://ODBCDataSourceName")

df.to_sql("YourDatabase", 
          engine,  
          if_exists='append',
          index=False,
          dtype=your_dtypes,
          chunksize=1600,
          method='multi')

A little explaination to the to_sql() parameters:

It is essential that you use the method='multi' parameter if you do not want to take pandas for ever to write in the database. Because without it it would send an INSERT query per row. You can use 'multi' or you can define your own insertion method. Be aware that you have to have at least pandas v0.24.0 to use it. See the docs for more info.

When using method='multi' it can happen (happend at least to me) that you exceed the parameter limit. In my case it was 1600 so I had to add chunksize=1600 to avoid this.

Note

If you get a warning or error like the following:

C:\Users\USER\anaconda3\envs\myenv\lib\site-packages\sqlalchemy\connectors\pyodbc.py:79: SAWarning: No driver name specified; this is expected by PyODBC when using DSN-less connections
  "No driver name specified; "
pyodbc.InterfaceError: ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')

Then you propably treid to connect to the database via

engine = create_engine(netezza://usr:pass@address:port/database_name)

You have to set up the database in the ODBC Data Source Administrator tool from Windows and then use the name you defined there.

engine = create_engine(netezza://ODBCDataSourceName)

Then it should have no problems to find the driver.

Sign up to request clarification or add additional context in comments.

Comments

1

I know you already answered the question yourself (thanks for sharing the solution)

One general comment about large data-writes to Netezza: I’d always choose to write data to a file and then use the external table/ODBC interface to insert the data. Instead of inserting 1600 rows at a time, you can probably insert millions of rows in the same timeframe.

We use UTF8 data in the flat file and CSV unless you want to load binary data which will probably require fixed width files.

I’m not a python savvy but I hope you can follow me ...

If you need a documentation link, you can start here: https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_syntax.html

6 Comments

Thanks for the information. You are right. The 1600 parameter limit was a surprise for me and is kind of a drawback for me. I am relatively new to databases and maybe my thinking is to naive. But lets say I have data with millions of rows in my code and in my system memory. My naive thinking was, I want to write that directly to netezza instead of writing a csv file first which needs conversion to ASCII and time for writing the file on the HDD just to open it again and upload it to netezza. And you are right: Because of the big amount of data I would prefer saving binary instead of CSV.
You Will unfortunately need to convert it to a format the database supports, there is no ‘binary’ insert option. I was referring to columns of data containing PDF files or the like.
UTF8-CSV is probably the best. Study the ‘escaping’ capabilities of netezza external tables so you can handle your four needed ‘special characters: row separator, column separator, escape character, string quotation mark. I usually go with (in same order) Chr(13), Chr(11), chr(92), chr(34) - or as they go by in ‘civil life’: Carriage Return, Tab, backslash, Double quote
I understand your hesitation, file I/O is much slower that CPU/RAM speed... but you have a TCP/IP network in the mix and an ODBC driver with sloooow code. Don’t expect more than 40MB/sec under those conditions. Furthermore the processing of external tables on netezza are pretty good. Look at file-pipes if you are on Linux/Unix at the client (pandas) end. That way all I/O is kept in memory and you STILL get to bypass the odbc typeconversion code that’s the real bottleneck when you try to avoid files. In other words: trust me: files are faster - intuitive or not :)
Like @LarsGOlsen was describing, write the data into a file, and use a transient external table to load data. That means you'll create an INSERT statement on top an external file on a location to load data. Documentations should have more info about how to
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.