Java String encoding (UTF-8)

Question

I have come across this line of legacy code, which I am trying to figure out:

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

As far as I can understand, it is encoding & decoding using the same charSet.

How is this different from the following?

String newString = oldString;

Is there any scenario in which the two lines will have different outputs?

p.s.: Just to clarify, yes I am aware of the excellent article on encoding by Joel Spolsky !

Well of course, one difference is that with String newString = oldString;, you still only have one copy of the string (you're just pointing to it from two variables). The decode/encode makes a copy of the string. Not that it matters much, since Strings are immutable. This probably isn't why that old code is that way, though; String has a much more direct way to clone itself (String(String)). I can't think of a good reason why you'd do the encoding/decoding, other than testing the String class's encoding/decoding methods. — T.J. Crowder
– T.J. Crowder, Commented Jan 13, 2012 at 16:48
Does the context give any hint why string conversion may have or had been necessary? — Thorbjørn Ravn Andersen
– Thorbjørn Ravn Andersen, Commented Jan 13, 2012 at 16:52
@T.J.Crowder: +1 , ofcourse! I did not mean the difference in the actual object referred. Thanks for pointing that out. — OceanBlue
– OceanBlue, Commented Jan 13, 2012 at 18:14
There is one more major difference: one of them does not compile ;-) — Draško Kokić
– Draško Kokić, Commented Jul 3, 2013 at 12:59

Peter Lawrey · Accepted Answer · 2012-01-13 17:09:37Z

22

This could be complicated way of doing

String newString = new String(oldString);

This shortens the String is the underlying char[] used is much longer.

However more specifically it will be checking that every character can be UTF-8 encoded.

There are some "characters" you can have in a String which cannot be encoded and these would be turned into ?

Any character between \uD800 and \uDFFF cannot be encoded and will be turned into '?'

String oldString = "\uD800";
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8");
System.out.println(newString.equals(oldString));

prints

false

answered Jan 13, 2012 at 17:09

Peter Lawrey

535k83 gold badges770 silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Cagatay Over a year ago

The only reason oldString fails to encode properly is because it is not a valid UTF-16 (native representation of strings in Java) string to begin with. UTF-8 is fully capable of encoding any and all Unicode code points itself. In this case, there would be a difference only when oldString contains an invalid sequence of UTF-16 bytes.

afrischke · Accepted Answer · 2012-01-13 17:45:31Z

4

How is this different from the following?

This line of code here:

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

constructs a new String object (i.e. a copy of oldString), while this line of code:

String newString = oldString;

declares a new variable of type java.lang.String and initializes it to refer to the same String object as the variable oldString.

Is there any scenario in which the two lines will have different outputs?

Absolutely:

String newString = oldString;
boolean isSameInstance = newString == oldString; // isSameInstance == true

vs.

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
 // isSameInstance == false (in most cases)    
boolean isSameInstance = newString == oldString;

a_horse_with_no_name (see comment) is right of course. The equivalent of

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

is

String newString = new String(oldString);

minus the subtle difference wrt the encoding that Peter Lawrey explains in his answer.

edited Jan 13, 2012 at 17:45

answered Jan 13, 2012 at 16:55

afrischke

3,86619 silver badges30 bronze badges

1 Comment

user330315 Over a year ago

String newString = new String(oldString) would be equivalent to the "original" line I guess

Collectives™ on Stack Overflow

Java String encoding (UTF-8)

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related