Pandas read_sql_table from postgres+psycopg2 fails with UnicodeDecodeError

Question

I have to download the information from a PostgreSQL server (that we don't have any control over) to CSV for some non-critical analysis (basically we're looking for tables containing a specific string in any row or column), so I decided to use Pandas read_sql_table to do this, but I keep getting an UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7: invalid start byte error on some tables, after researching through other SO questions, I've changed the client encoding to UTF8 but the error still happens. The server encoding is SQL_ASCII.

An oversimplified version of my script would look like this:

ENCODING = 'utf8'

conn_str = f"postgresql+psycopg2://{config['DBUSER']}:{config['DBPASS']}@{config['DBHOST']}/{config['DBNAME']}"

engine = create_engine(conn_str, client_encoding=ENCODING, pool_recycle=36000)
conn = engine.connect()

server = self.conn.execute("SHOW SERVER_ENCODING").fetchone()
print("Server Encoding ", server.server_encoding)
client = self.conn.execute("SHOW CLIENT_ENCODING").fetchone()
print("Client Encoding ", client.client_encoding)

df = pandas.read_sql_table(VIEWNAME, conn, SCHEMA)

Outputs:

Server Encoding  SQL_ASCII
Client Encoding  UNICODE
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7: invalid start byte

As far as I can tell this issue is with the underlying SQLAlchemy connection, so I would like to solve that at the connection level, and if that's not possible, I can get away with downloading all the non problematic rows in the table but it seems there's no support to do something like that

No the problem is Server Encoding SQL_ASCII. Per Localization, Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding.". This means you need to find out what client encoding was used to put the data into the table. Best bet is one of the Windows code pages. — Adrian Klaver
– Adrian Klaver, Commented Jun 7, 2022 at 14:26
@AdrianKlaver Thanks for pointing me in the right direction, I got the job done by wrapping the script in a loop that attempts to load the data with different character sets. Would you mind posting an answer? I would love to give you an upvote and accepted answer if you decide to do it — rantsh
– rantsh, Commented Jun 7, 2022 at 19:08

Adrian Klaver · Accepted Answer · 2022-06-07 20:06:28Z

Per the Postgres docs here Localization/Character Set:

The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0–127 according to the ASCII standard, while byte values 128–255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL will be unable to help you by converting or validating non-ASCII characters.

This means the data can be in a variety of encodings once you get ASCII 127. Generally it turns out the data was entered as one of the Windows code pages. If you know where the data originally came from you maybe able to narrow down the choices. Then you can try setting the client_encoding to the appropriate code page to see if the export succeeds.

Going forward it is a good idea to move the server_encoding off SQL_ASCII, to make life easier.

Collectives™ on Stack Overflow

Pandas read_sql_table from postgres+psycopg2 fails with UnicodeDecodeError

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related