18

I have come across this line of legacy code, which I am trying to figure out:

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

As far as I can understand, it is encoding & decoding using the same charSet.

How is this different from the following?

String newString = oldString;

Is there any scenario in which the two lines will have different outputs?

p.s.: Just to clarify, yes I am aware of the excellent article on encoding by Joel Spolsky !

4
  • 8
    Well of course, one difference is that with String newString = oldString;, you still only have one copy of the string (you're just pointing to it from two variables). The decode/encode makes a copy of the string. Not that it matters much, since Strings are immutable. This probably isn't why that old code is that way, though; String has a much more direct way to clone itself (String(String)). I can't think of a good reason why you'd do the encoding/decoding, other than testing the String class's encoding/decoding methods. Commented Jan 13, 2012 at 16:48
  • Does the context give any hint why string conversion may have or had been necessary? Commented Jan 13, 2012 at 16:52
  • @T.J.Crowder: +1 , ofcourse! I did not mean the difference in the actual object referred. Thanks for pointing that out. Commented Jan 13, 2012 at 18:14
  • There is one more major difference: one of them does not compile ;-) Commented Jul 3, 2013 at 12:59

2 Answers 2

22

This could be complicated way of doing

String newString = new String(oldString);

This shortens the String is the underlying char[] used is much longer.

However more specifically it will be checking that every character can be UTF-8 encoded.

There are some "characters" you can have in a String which cannot be encoded and these would be turned into ?

Any character between \uD800 and \uDFFF cannot be encoded and will be turned into '?'

String oldString = "\uD800";
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8");
System.out.println(newString.equals(oldString));

prints

false
Sign up to request clarification or add additional context in comments.

1 Comment

The only reason oldString fails to encode properly is because it is not a valid UTF-16 (native representation of strings in Java) string to begin with. UTF-8 is fully capable of encoding any and all Unicode code points itself. In this case, there would be a difference only when oldString contains an invalid sequence of UTF-16 bytes.
4

How is this different from the following?

This line of code here:

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

constructs a new String object (i.e. a copy of oldString), while this line of code:

String newString = oldString;

declares a new variable of type java.lang.String and initializes it to refer to the same String object as the variable oldString.

Is there any scenario in which the two lines will have different outputs?

Absolutely:

String newString = oldString;
boolean isSameInstance = newString == oldString; // isSameInstance == true

vs.

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
 // isSameInstance == false (in most cases)    
boolean isSameInstance = newString == oldString;

a_horse_with_no_name (see comment) is right of course. The equivalent of

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

is

String newString = new String(oldString);

minus the subtle difference wrt the encoding that Peter Lawrey explains in his answer.

1 Comment

String newString = new String(oldString) would be equivalent to the "original" line I guess

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.