0

We have a Java code that, when running on the client's machine, seems to behave incorrectly with regard to character encoding. Unfortunately we're unable to reproduce the issue locally. Here's the code:

    static private String bytesToHex(byte[] in) {
        final StringBuilder builder = new StringBuilder();
        for(byte b : in) {
            builder.append(String.format("%02x", b));
        }
        return builder.toString();
    }


    static private String normalize(String str) {
            System.out.println("Normalize " + str + " / hex " + bytesToHex(str.getBytes(StandardCharsets.UTF_8)));
            // ...
        }

The hex part of the print is incorrect in the client's logs for diacritic characters, e.g. if str = "ë" we get:

  • Local output: "Normalize ë / hex c3ab" (as expected)
  • Client output: "Normalize ë / hex c383c2ab" !!!

In the client output it looks like the UTF8 bytes for "ë" (c3ab) have been interpreted in another encoding such as ISO-8859-1, so the string became "ë" which in UTF8 is c383c2ab.

Any idea how this could happen?

Edit: alright, solved. In fact str did contain "ë" but the client's log was written in ISO and I was reading it in UTF8 mode, that's why I ended up with a "ë". Now as to why str contains this: it comes from a REST call and apparently the line new InputStreamReader(conn.getInputStream()) was using UTF8 on my system and ISO on others! So the fix is to specify UTF8 in the constructor. Bunch of weird issues adding up.

8
  • The client output you posted does not contain the text hex. Is this accurate to the actual client output? Does that text somehow not show up in the output? Commented Sep 9, 2024 at 9:49
  • Sorry, that's a mistake. Fixed. Commented Sep 9, 2024 at 9:53
  • It sounds like you need to validate your assumptions. The code you've posted is completely self contained and deterministic. str.getBytes(StandardCharsets.UTF_8)) will not produce different bytes on different JVMs. Commented Sep 9, 2024 at 10:08
  • 1
    If your client mistook UTF-8 bytes for ISO-8859-1, it must have done the reverse as well when interpreting the output, as otherwise, you wouldn’t see an ë again in the printed line. Commented Sep 9, 2024 at 10:18
  • 4
    By the way, your bytesToHex method is very inefficient as it creates a new Formatter under the hood for every byte. You can use a single Formatter in the first place: Formatter f = new Formatter(); for(byte b: in) f.format("%02x", b); return f.toString(); or, starting with Java 17, just use return HexFormat.of().formatHex(in); Commented Sep 9, 2024 at 10:24

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.