Weird string encoding issue in Java

Ask Question

Asked 1 year, 2 months ago

Modified 1 year, 2 months ago

Viewed 110 times

We have a Java code that, when running on the client's machine, seems to behave incorrectly with regard to character encoding. Unfortunately we're unable to reproduce the issue locally. Here's the code:

    static private String bytesToHex(byte[] in) {
        final StringBuilder builder = new StringBuilder();
        for(byte b : in) {
            builder.append(String.format("%02x", b));
        }
        return builder.toString();
    }


    static private String normalize(String str) {
            System.out.println("Normalize " + str + " / hex " + bytesToHex(str.getBytes(StandardCharsets.UTF_8)));
            // ...
        }

The hex part of the print is incorrect in the client's logs for diacritic characters, e.g. if str = "ë" we get:

Local output: "Normalize ë / hex c3ab" (as expected)
Client output: "Normalize ë / hex c383c2ab" !!!

In the client output it looks like the UTF8 bytes for "ë" (c3ab) have been interpreted in another encoding such as ISO-8859-1, so the string became "Ã«" which in UTF8 is c383c2ab.

Any idea how this could happen?

Edit: alright, solved. In fact str did contain "Ã«" but the client's log was written in ISO and I was reading it in UTF8 mode, that's why I ended up with a "ë". Now as to why str contains this: it comes from a REST call and apparently the line new InputStreamReader(conn.getInputStream()) was using UTF8 on my system and ISO on others! So the fix is to specify UTF8 in the constructor. Bunch of weird issues adding up.

edited Sep 9, 2024 at 14:39

asked Sep 9, 2024 at 9:46

ExecutionSommaire

436 bronze badges

The client output you posted does not contain the text hex. Is this accurate to the actual client output? Does that text somehow not show up in the output?

user2357112
– user2357112

2024-09-09 09:49:27 +00:00
Commented Sep 9, 2024 at 9:49
Sorry, that's a mistake. Fixed.

ExecutionSommaire
– ExecutionSommaire

2024-09-09 09:53:25 +00:00
Commented Sep 9, 2024 at 9:53
It sounds like you need to validate your assumptions. The code you've posted is completely self contained and deterministic. str.getBytes(StandardCharsets.UTF_8)) will not produce different bytes on different JVMs.

aioobe
– aioobe

2024-09-09 10:08:12 +00:00
Commented Sep 9, 2024 at 10:08
1

If your client mistook UTF-8 bytes for ISO-8859-1, it must have done the reverse as well when interpreting the output, as otherwise, you wouldn’t see an ë again in the printed line.

Holger
– Holger

2024-09-09 10:18:31 +00:00
Commented Sep 9, 2024 at 10:18
4

By the way, your bytesToHex method is very inefficient as it creates a new Formatter under the hood for every byte. You can use a single Formatter in the first place: Formatter f = new Formatter(); for(byte b: in) f.format("%02x", b); return f.toString(); or, starting with Java 17, just use return HexFormat.of().formatHex(in);

Holger
– Holger

2024-09-09 10:24:46 +00:00
Commented Sep 9, 2024 at 10:24

| Show 3 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Weird string encoding issue in Java

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest