1

I'm facing an issue with character encoding in linux. I'm retrieving a content from amazon S3, which was saved using UTF-8 encoding. The content is in Chinese and I'm able to see the content correctly in the browser.

I'm using amazon SDK to retrieve the content and do some update to it.Here's the code I'm using:


StringBuilder builder = new StringBuilder();
S3Object object = client.getObject(new GetObjectRequest(bucketName, key));
        BufferedReader reader = new BufferedReader(new 
                InputStreamReader(object.getObjectContent(), "utf-8"));
while (true) {
    String line = reader.readLine();
    if (line == null) 
        break;
    builder.append(line);
}

This piece of code works fine in Windows environment as I was able to update the content and save it back without messing up any chinese characters in it.

But, its acting differently in linux enviroment. The code is unable to translate the characters properly, the chinese characters are rendered as ???

I'm not sure what's going wrong here. Any pointers will be appreciated.

-Thanks

6
  • 2
    When you say the characters are rendered as ???, where are you seeing these rendered? Perhaps the data is fine but you're trying to display them in an environment that doesn't support Unicode or in a font that doesn't have the proper glyphs. Commented May 13, 2011 at 0:29
  • 2
    That code looks fine. It's probably your terminal that needs to be in UTF-8 mode to display the characters, or you're outputting the wrong encoding, probably using the platform default encoding which might not be UTF-8. Show us the code you use to output the characters, and tell us what terminal you're using. Commented May 13, 2011 at 0:29
  • When you say the characters are not showing up properly, are you outputting them to a console? If so, what type of console? Commented May 13, 2011 at 0:30
  • Its not about the display. I'm add some text back to the content and then save it back to S3. The chinese characters look fine if I do the process in windows and look up the updated data in S3. But if it gets processed in Linux, then the characters just turn to ??? . I'm viewing it in browser using the S3 link. Commented May 13, 2011 at 0:38
  • Maybe, I should be a little bit precise. After I retrieve the content, I'm adding few more chinese characters to the content and saving it back to S3. The new characters which I added is looking good.The existing ones are the one which is getting messed up.I'm sort of clueless at this weird behaviour. Commented May 13, 2011 at 0:43

1 Answer 1

4

The default charset is different for the 2 OS's your using.

To start off, you can confirm the difference by printing out the default charset.

Charset.defaultCharset.name()

Somewhere in your code, I think this default charset is being used for some String conversion. The correct procedure should be to track that down, and specify UTF-8.

Without seeing that code, I can only suggest the 'cheating' way to do it: set the default charset explicitly, near the beginning of your code, or at Java startup. See here for changing default charset: Setting the default Java character encoding?

HTH

Sign up to request clarification or add additional context in comments.

2 Comments

. thanks for your input. Charset.defaultCharset.name() --> shows US_ASCII. Now, if I update the .bashrc and add LANG=en_US.UTF-8, it works fine. But I want to do this programatically instead of setting it at bash profile. Not sure, why the encoding to UTF-8 doesn't solve the issue. I even tried encoding the strings to utf-8. Is there a way to override the default character set in java?
Hi Shamik, You said you found a way to solve this issue. Currently I`m facing exactly the same. Could you please explain how you solved it?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.