String byte encoding issue

Question

Given that I have following function

static void fun(String str) {
        System.out.println(String.format("%s | length in String: %d | length in bytes: %d | bytes: %s", str, str.length(), str.getBytes().length, Arrays.toString(str.getBytes())));
    }

on invoking fun("ó"); its output is

ó | length in String: 1 | length in bytes: 2 | bytes: [-61, -77]

so it means character ó needs 2 bytes to represent and as per Character class documentation too default is UTF-16 in java, considering that when I do following

System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.ISO_8859_1));// output=Ã³
System.out.println(new String("ó".getBytes(), StandardCharsets.US_ASCII));// output=��
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_8));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16BE));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16LE));// output=돃

Why any of UTF_16, UTF_16BE, UTF_16LE charset not able to decode bytes properly, given that bytes are representing a 16 bit length character? And how UTF-8 is able decode it properly given that UTF-8 consider each character only 8 bit long so it should have printed 2 chars(1 char for each byte) like in ISO_8859_1.

What you do here does not make much sense. Building a string from a wrong encoding can only fail. — Henry
– Henry, Commented Nov 3, 2020 at 11:09

Sweeper · Accepted Answer · 2020-11-03 11:13:28Z

getBytes always returns the bytes encoded in the platform's default charset, which is probably UTF-8 for you.

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

So you are essentially trying to decode a bunch of UTF-8 bytes with non-UTF-8 charsets. No wonder you don't get expected results.

Though kind of pointless, you can get what you want by passing the desired charset to getBytes, so that the string is encoded correctly.

    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16), StandardCharsets.UTF_16));
    System.out.println(new String("ó".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.ISO_8859_1));
    System.out.println(new String("ó".getBytes(StandardCharsets.US_ASCII), StandardCharsets.US_ASCII));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16BE), StandardCharsets.UTF_16BE));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16LE), StandardCharsets.UTF_16LE));

You also seem to have some misunderstanding about encodings. It's not just about the number of bytes that a character takes. The byte-count-per-character for two encodings being the same doesn't mean that they are compatible with each other. Also, it is not always one byte per character in UTF-8. UTF-8 is a variable-length encoding.

Collectives™ on Stack Overflow

String byte encoding issue

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related