0

I have to download the information from a PostgreSQL server (that we don't have any control over) to CSV for some non-critical analysis (basically we're looking for tables containing a specific string in any row or column), so I decided to use Pandas read_sql_table to do this, but I keep getting an UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7: invalid start byte error on some tables, after researching through other SO questions, I've changed the client encoding to UTF8 but the error still happens. The server encoding is SQL_ASCII.

An oversimplified version of my script would look like this:

ENCODING = 'utf8'

conn_str = f"postgresql+psycopg2://{config['DBUSER']}:{config['DBPASS']}@{config['DBHOST']}/{config['DBNAME']}"

engine = create_engine(conn_str, client_encoding=ENCODING, pool_recycle=36000)
conn = engine.connect()

server = self.conn.execute("SHOW SERVER_ENCODING").fetchone()
print("Server Encoding ", server.server_encoding)
client = self.conn.execute("SHOW CLIENT_ENCODING").fetchone()
print("Client Encoding ", client.client_encoding)

df = pandas.read_sql_table(VIEWNAME, conn, SCHEMA)

Outputs:

Server Encoding  SQL_ASCII
Client Encoding  UNICODE
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 7: invalid start byte

As far as I can tell this issue is with the underlying SQLAlchemy connection, so I would like to solve that at the connection level, and if that's not possible, I can get away with downloading all the non problematic rows in the table but it seems there's no support to do something like that

2
  • 2
    No the problem is Server Encoding SQL_ASCII. Per Localization, Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding.". This means you need to find out what client encoding was used to put the data into the table. Best bet is one of the Windows code pages. Commented Jun 7, 2022 at 14:26
  • @AdrianKlaver Thanks for pointing me in the right direction, I got the job done by wrapping the script in a loop that attempts to load the data with different character sets. Would you mind posting an answer? I would love to give you an upvote and accepted answer if you decide to do it Commented Jun 7, 2022 at 19:08

1 Answer 1

1

Per the Postgres docs here Localization/Character Set:

The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0–127 according to the ASCII standard, while byte values 128–255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL will be unable to help you by converting or validating non-ASCII characters.

This means the data can be in a variety of encodings once you get ASCII 127. Generally it turns out the data was entered as one of the Windows code pages. If you know where the data originally came from you maybe able to narrow down the choices. Then you can try setting the client_encoding to the appropriate code page to see if the export succeeds.

Going forward it is a good idea to move the server_encoding off SQL_ASCII, to make life easier.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.