Why in chain byte[] → String → byte[] input and output differ when using UTF-8 charset?

Question

The following test fails.

@Test
public void testConversions() {
    final Charset charset = Charsets.UTF_8;
    final byte[] inputBytes = {37, 80, 68, 70, 45, 49, 46, 52, 13, 10, 37, -11, -28, -10, -4, 13, 10};
    final String string = new String(inputBytes, charset);
    final byte[] outputBytes = string.getBytes(charset);
    assertArrayEquals(inputBytes, outputBytes);
}

If instead of UTF-8 charset ISO_8859_1 is used, the test passes, even with much bigger inputBytes array. Do the input and output differ because of 'variable-width' property of UTF-8?

Bonus question: Is it a true presumption that the conversions byte[] → String → byte[] will always have the same input and output byte arrays, if ISO_8859_1 is used?

Andy Turner · Accepted Answer · 2018-02-08 22:03:12Z

5

Do the input and output differ because of 'variable-width' property of UTF-8?

They differ because not all sequences of bytes will occur in a valid UTF-8 encoded string, because of the variable-width encoding.

You can see this in the table on the Wikipedia article about UTF-8:

1 byte: 0xxxxxxx
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The xs show bits which can be arbitrarily 0 or 1; the numbers shows bits which must be set to that value in a valid encoding.

Thus, you will never find e.g. 11000000 11000000 in a valid UTF-8 string. If you attempt to build a string from such bytes, the character encoding will do... something. Specifically:

[new String(byte[], Charset)] always replaces malformed-input and unmappable-character sequences with this charset's default replacement string

So, the string you build won't necessarily be able to be mapped back to the input.

Bonus question

Yes, because it's a fixed-width encoding, where all possible bytes have a single corresponding character.

There is no good reason to try to convert a byte[] directly to a String, unless you know that it's a valid encoding of a String that you want to recover (and you know the charset used to encode it) (or, you suspect it's a string, and you want to attempt to recover its contents).

If you want to transmit a byte[] over some channel that requires you to send strings, use something like base64 encoding.

edited Feb 8, 2018 at 22:03

answered Feb 8, 2018 at 21:41

Andy Turner

141k11 gold badges169 silver badges263 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

PresentProgrammer Over a year ago

Got it! Thank you very much for such an elaborated explanation! :) An advice about base64 will be very useful. Thanks again!

Tom Blodget Over a year ago

There is an alternative to "always replaces malformed-input and unmappable-character sequences": Exceptions! Expecting the use of one encoding and getting bytes that don't match is certainly exceptional. The receiving code is wrong or the data is corrupt (almost certainly by the sending code rather than by a transmission or storage problem). If you use the character encoding decoding classes in the ways that they throw exceptions, you'll find bugs sooner. If you still want to pass on corrupted text on the theory that some is better than none, then you can do that in an exception handler.

Peter Moore · Accepted Answer · 2018-02-16 21:47:19Z

1

Bonus question: Is it a true presumption that the conversions byte[] → String → byte[] will always have the same input and output byte arrays, if ISO_8859_1 is used?

Yes. Any single-byte charset that maps a unique character to each byte will preserve all the byte values in a round-trip conversion. And as of 1987, ISO 8859 1 does have a unique mapping for every single byte value.

Whereas CP1252 (Windows Latin 1), a common default character set on Windows, has 5 byte values which no character is mapped to. So if you used cp1252 for that round trip conversion, you would on average loss 5 out of every 256 bytes or about 2% of your data

answered Feb 16, 2018 at 21:47

Peter Moore

863 bronze badges

Collectives™ on Stack Overflow

Why in chain byte[] → String → byte[] input and output differ when using UTF-8 charset?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related