0

Given that I have following function

static void fun(String str) {
        System.out.println(String.format("%s | length in String: %d | length in bytes: %d | bytes: %s", str, str.length(), str.getBytes().length, Arrays.toString(str.getBytes())));
    }

on invoking fun("ó"); its output is

ó | length in String: 1 | length in bytes: 2 | bytes: [-61, -77]

so it means character ó needs 2 bytes to represent and as per Character class documentation too default is UTF-16 in java, considering that when I do following

System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.ISO_8859_1));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.US_ASCII));// output=��
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_8));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16BE));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16LE));// output=돃

Why any of UTF_16, UTF_16BE, UTF_16LE charset not able to decode bytes properly, given that bytes are representing a 16 bit length character? And how UTF-8 is able decode it properly given that UTF-8 consider each character only 8 bit long so it should have printed 2 chars(1 char for each byte) like in ISO_8859_1.

1
  • 2
    What you do here does not make much sense. Building a string from a wrong encoding can only fail. Commented Nov 3, 2020 at 11:09

1 Answer 1

5

getBytes always returns the bytes encoded in the platform's default charset, which is probably UTF-8 for you.

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

So you are essentially trying to decode a bunch of UTF-8 bytes with non-UTF-8 charsets. No wonder you don't get expected results.

Though kind of pointless, you can get what you want by passing the desired charset to getBytes, so that the string is encoded correctly.

    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16), StandardCharsets.UTF_16));
    System.out.println(new String("ó".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.ISO_8859_1));
    System.out.println(new String("ó".getBytes(StandardCharsets.US_ASCII), StandardCharsets.US_ASCII));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16BE), StandardCharsets.UTF_16BE));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16LE), StandardCharsets.UTF_16LE));

You also seem to have some misunderstanding about encodings. It's not just about the number of bytes that a character takes. The byte-count-per-character for two encodings being the same doesn't mean that they are compatible with each other. Also, it is not always one byte per character in UTF-8. UTF-8 is a variable-length encoding.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.