1

The following test fails.

@Test
public void testConversions() {
    final Charset charset = Charsets.UTF_8;
    final byte[] inputBytes = {37, 80, 68, 70, 45, 49, 46, 52, 13, 10, 37, -11, -28, -10, -4, 13, 10};
    final String string = new String(inputBytes, charset);
    final byte[] outputBytes = string.getBytes(charset);
    assertArrayEquals(inputBytes, outputBytes);
}

If instead of UTF-8 charset ISO_8859_1 is used, the test passes, even with much bigger inputBytes array. Do the input and output differ because of 'variable-width' property of UTF-8?

Bonus question: Is it a true presumption that the conversions byte[] → String → byte[] will always have the same input and output byte arrays, if ISO_8859_1 is used?

0

2 Answers 2

5

Do the input and output differ because of 'variable-width' property of UTF-8?

They differ because not all sequences of bytes will occur in a valid UTF-8 encoded string, because of the variable-width encoding.

You can see this in the table on the Wikipedia article about UTF-8:

  • 1 byte: 0xxxxxxx
  • 2 bytes: 110xxxxx 10xxxxxx
  • 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
  • 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The xs show bits which can be arbitrarily 0 or 1; the numbers shows bits which must be set to that value in a valid encoding.

Thus, you will never find e.g. 11000000 11000000 in a valid UTF-8 string. If you attempt to build a string from such bytes, the character encoding will do... something. Specifically:

[new String(byte[], Charset)] always replaces malformed-input and unmappable-character sequences with this charset's default replacement string

So, the string you build won't necessarily be able to be mapped back to the input.

Bonus question

Yes, because it's a fixed-width encoding, where all possible bytes have a single corresponding character.


There is no good reason to try to convert a byte[] directly to a String, unless you know that it's a valid encoding of a String that you want to recover (and you know the charset used to encode it) (or, you suspect it's a string, and you want to attempt to recover its contents).

If you want to transmit a byte[] over some channel that requires you to send strings, use something like base64 encoding.

Sign up to request clarification or add additional context in comments.

2 Comments

Got it! Thank you very much for such an elaborated explanation! :) An advice about base64 will be very useful. Thanks again!
There is an alternative to "always replaces malformed-input and unmappable-character sequences": Exceptions! Expecting the use of one encoding and getting bytes that don't match is certainly exceptional. The receiving code is wrong or the data is corrupt (almost certainly by the sending code rather than by a transmission or storage problem). If you use the character encoding decoding classes in the ways that they throw exceptions, you'll find bugs sooner. If you still want to pass on corrupted text on the theory that some is better than none, then you can do that in an exception handler.
1

Bonus question: Is it a true presumption that the conversions byte[] → String → byte[] will always have the same input and output byte arrays, if ISO_8859_1 is used?

Yes. Any single-byte charset that maps a unique character to each byte will preserve all the byte values in a round-trip conversion. And as of 1987, ISO 8859 1 does have a unique mapping for every single byte value.

Whereas CP1252 (Windows Latin 1), a common default character set on Windows, has 5 byte values which no character is mapped to. So if you used cp1252 for that round trip conversion, you would on average loss 5 out of every 256 bytes or about 2% of your data

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.