Java / Postgres / Mybatis - invalid byte sequence for encoding "UTF8": 0xe3 0xa1 0x54

Question

On our application server, the DevOps team are using SQL_ASCII encoding in the Postgres (9.4) DB.

A 3rd-party application is inserting Surnames into the Employee table with accented characters, e.g. Núñez

My Java (8) Application is a Spring (4.3.15) WebApp using Mybatis (3.2.4)

When my application reads such Surnames out of the SQL_ASCII db, I get:

org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0xe3 0xa1 0x54 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:616) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:466) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:459) at org.apache.tomcat.dbcp.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:93) at org.apache.tomcat.dbcp.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:93) at jdk.internal.reflect.GeneratedMethodAccessor78.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.ibatis.logging.jdbc.PreparedStatementLogger.invoke(PreparedStatementLogger.java:55) at com.sun.proxy.$Proxy98.execute(Unknown Source)

If I try changing the client_encoding via:

SET client_encoding = 'SQL_ASCII';

Then I get error:

org.postgresql.util.PSQLException: The server's client_encoding parameter was changed to LATIN1. The JDBC driver requires client_encoding to be UTF8 for correct operation. at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1950)

How can I "safely" read these characters from the DB?

Laurenz Albe · Accepted Answer · 2021-02-24 07:13:20Z

1

You are lost. An SQL_ASCII database is not encoding aware, it treats all bytes (except the 0 byte) equal. There will be no encoding conversion in the database.

So unless the data is accidentally encoded as UTF-8, which it isn't (according to the errror message), you cannot use it with the JDBC driver.

You will have to dump the database and restore it (using the appropriate -E option) into a different database (v13) with a proper encoding. Any encoding inconsistencies will have to be fixed manually during this process.

This question will provide additional insight.

answered Feb 24, 2021 at 7:13

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Black Over a year ago

Thanks, but I'm a bit bewildered - any thoughts as to how can Postgres think the limitation of being unable to read any DB with encoding other than UTF-8 is acceptable?

Black Over a year ago

But the PostgresJDBC driver insists the client encoding is UTF-8. What if the DB was LATIN1 - could we read from it via the PG JDBC driver?

Laurenz Albe Over a year ago

Yes, that would work just fine. That's why I recommended to dump and restore the data to a database with a proper encoding.

Thiyanesh · Accepted Answer · 2021-02-24 07:38:41Z

1

allowEncodingChanges=true

Can you try setting allowEncodingChanges=true in the JDBC connection URL? (and also characterEncoding)

allowEncodingChanges = boolean

When using the V3 protocol the driver monitors changes in certain server configuration parameters that should not be touched by end users. The client_encoding setting is set by the driver and should not be altered. If the driver detects a change it will abort the connection. There is one legitimate exception to this behaviour though, using the COPY command on a file residing on the server's filesystem. The only means of specifying the encoding of this file is by altering the client_encoding setting. The JDBC team considers this a failing of the COPY command and hopes to provide an alternate means of specifying the encoding in the future, but for now there is this URL parameter. Enable this only if you need to override the client encoding when doing a copy.

Ref: Chapter 3. Initializing the Driver

Bytes in the error message are 1110 0011, 1010 0001, 0101 0100

If the stored data was encoded in ISO-8859-1, this will be ã, ¡, T.

When this byte stream was read as UTF-8, 1110^MSB in first byte indicates a 3 UTF-8 byte character(counting itself).

So the following next 2 bytes should start with 10^MSB. But the 3^rd byte is starting with 01^MSB

By default, JDBC driver is decoding this stream in UTF-8 and failing on invalid a byte stream.

Assuming the third byte had started with `10`^MSB, the code would work without error, but it can incorrectly map all these `3` bytes into a `single Unicode code point`(assuming the original encoding was not `UTF-8` and the corresponding unicode code point has a valid character representation).

edited Feb 24, 2021 at 7:38

answered Feb 24, 2021 at 5:57

Thiyanesh

2,3601 gold badge6 silver badges11 bronze badges

2 Comments

Black Over a year ago

This might work, but as per the text you've quoted it's not the intended purpose of the parameter and would definitely qualify as a 'Kludge'. Isn't there an endorsed way to read these characters from an SQL_ASCII DB in Java?

Thiyanesh Over a year ago

On a different note, any byte using the MSB as 1 is left to interpretation in ASCII.

When the server character set is SQL_ASCII, the server interprets byte values 0-127 according to the ASCII standard, while byte values 128-255 are taken as uninterpreted characters

multibyte.

Collectives™ on Stack Overflow

Java / Postgres / Mybatis - invalid byte sequence for encoding "UTF8": 0xe3 0xa1 0x54

2 Answers 2

3 Comments

allowEncodingChanges=true

Assuming the third byte had started with `10`^MSB, the code would work without error, but it can incorrectly map all these `3` bytes into a `single Unicode code point`(assuming the original encoding was not `UTF-8` and the corresponding unicode code point has a valid character representation).

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

allowEncodingChanges=true

Assuming the third byte had started with 10MSB, the code would work without error, but it can incorrectly map all these 3 bytes into a single Unicode code point(assuming the original encoding was not UTF-8 and the corresponding unicode code point has a valid character representation).

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Assuming the third byte had started with `10`^MSB, the code would work without error, but it can incorrectly map all these `3` bytes into a `single Unicode code point`(assuming the original encoding was not `UTF-8` and the corresponding unicode code point has a valid character representation).