Understanding String encoding/decoding Java

Question

I have a program I run with mvn exec:java (my main file is encoded in utf-8 and the default charset of my system is windows-1252)

System.out.println(Charset.defaultCharset()); //print windows-1252
String s = "éàè";
System.out.println(new String(s.getBytes(Charset.forName("UTF-8")))); //OK Print éàè
System.out.println(new String(s.getBytes(Charset.forName("windows-1252")))); //Not OK Print ▒▒▒

I don't understand why the first print works, according to the documentation getBytes encodes the String into a sequence of bytes using the given charset and the String constructor constructs a new String by decoding the specified array of bytes using the platform's default charset

So the first print encodes in UTF-8 and then decode with the platform's default charset which is windows-1252, how could this workd ? It cannot decode the encoded utf-8 byte array using the platform charset windows-1252.

The second print is wrong, I don't understand why. As my file is encoded in utf-8 and the platform charset is windows-1252, my intention is to encode the String with windows-1252 charset so I call s.getBytes(Charset.forName("windows-1252")) and then create a String with the previous result but it doesn't work

Try PrintStream out = new PrintStream(System.out, true, "windows-1252"); out.println(s); — bradimus
– bradimus, Commented Mar 3, 2016 at 17:00
As a sidenote, the MS-DOS default charset is not 1252, see here : docs.oracle.com/javase/7/docs/technotes/guides/intl/… — Arnaud
– Arnaud, Commented Mar 3, 2016 at 17:01
PrintStream out = new PrintStream(System.out, true, "windows-1252"); doesn't work but PrintStream out = new PrintStream(System.out, true, "utf-8"); does — Olivier Boissé
– Olivier Boissé, Commented Mar 3, 2016 at 17:04
@Berger you are right, I am using MinGW to execute my program, with MS-DOS the program works correctly — Olivier Boissé
– Olivier Boissé, Commented Mar 3, 2016 at 17:13

Remy Lebeau · Accepted Answer · 2016-03-04 03:16:41Z

3

The String value éàè is encoded in UTF-8 as byte octets 0xC3 0xA9 0xC3 0xA0 0xC3 0xA8. Those same byte octets interpreted as Windows-1252 are the String value Ã©Ã<nbsp>Ã¨ (where <nbsp> is a non-breaking space character, Unicode codepoint U+00A0).

In the first example, you are converting a String to the above UTF-8 bytes, and then you are converting the bytes back to a String using Windows-1252 instead of UTF-8. So you should be getting a new String value of Ã©Ã<nbsp>Ã¨, not éàè. You are then writing that String to the console, so it gets encoded using Windows-1252 back to byte octets 0xC3 0xA9 0xC3 0xA0 0xC3 0xA8, which should be displayed as Ã©Ã<nbsp>Ã¨ (or something similar to it) if the console is displaying the bytes as-is. On the other hand, if the console is configured for UTF-8 instead, those bytes would display as éàè when interpreted as UTF-8.

In the second example, since you are using Windows-1252 for both encoding and decoding, and the particular characters in question are supported by Windows-1252, you should end up with the original String value éàè before writing it to the console. If that String gets encoded to bytes using Windows-1252, and the console is configured for UTF-8, it would make sense why you don't see éàè displayed. The String value éàè is encoded in Windows-1252 as byte octets 0xE9 0xE0 0xE8, which is not a valid UTF-8 byte octet sequence.

In short, the behavior you are seeing would happen when your console is configured to interpret outgoing bytes as UTF-8, but you are not giving it proper UTF-8 encoded bytes as output.

edited Mar 4, 2016 at 3:16

answered Mar 4, 2016 at 3:05

Remy Lebeau

609k36 gold badges516 silver badges875 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

erickson Over a year ago

This is a restatement of the question, not an answer. If you are asserting that his console is set to UTF-8, you should lead off with that.

Remy Lebeau Over a year ago

I did not just restate the question. The question is asking to understand how the encodings are behaving differently than expected. I think I answered that.

erickson Over a year ago

The lede was buried. It makes more sense now with the concluding paragraph.

Olivier Boissé Over a year ago

exact @RemyLebeau, I forgot to take in consideration the encoding of the console. Effectively, the result differs whether I use MinGW console or MS-DOS

Collectives™ on Stack Overflow

Understanding String encoding/decoding Java

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related