Java file.encoding on reading UTF-8 file and handling UTF-8 string

Question

I am trying to read UTF-8 encoded XML file and pass UTF-8 string to native code (C++ dll)

My problem is best explained with a Sample program

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;

public class UniCodeTest {

    private static void testByteConversion(String input) throws UnsupportedEncodingException  {    

        byte[] utf_8 = input.getBytes("UTF-8");  // convert unicode string to UTF-8
        String test = new String(utf_8);         // Build String with UTF-8 equvalent chars 
        byte[] utf_8_converted = test.getBytes();// Get the bytes: in effect this will be called in JNI wrapper on C++ side to read it in char*

        // simple workaround to print hex values
        String utfString = "";
        for (int i = 0; i < utf_8.length; i++) {
            utfString += " " + Integer.toHexString(utf_8[i]);
        }          

        String convertedUtfString = "";
        for (int i = 0; i < utf_8_converted.length; i++) {
            convertedUtfString += " " + Integer.toHexString(utf_8_converted[i]);
        }
        if (utfString.equals(convertedUtfString))   {
            System.out.println("Success" ); 
        }
        else {
            System.out.println("Failure" ); 
        }
    }

    public static void main(String[] args) {
        try {
              File inFile = new File("c:/test.txt");
              BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF8"));
              String str;
              while ((str = in.readLine()) != null) {
                  testByteConversion(str);
              }
              in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

and the test file has stored in UTF-8 format (Tamil Locale)

#just test
 நனமை
 நன்மை

I did the following experiments:

set file.encoding property to 'UTF-8' I get success for both inputs
when I set file.encoding to 'CP-1252' First input, i get 'Success' and for the second input I am getting 'Failure'

Here is what I got for the failure case

utf_8           :  e0 ae a8 e0 ae a9 e0 af 8d e0 ae ae e0 af 88
utf_8_converted :  e0 ae a8 e0 ae a9 e0 af 3f e0 ae ae e0 af 88

I do not understand why 8d is converted into 3f when file.encoding set to CP-1252. Can any one please explain me

I miss the link between file.encoding and string manipulation

Thanks in advance :)

I am not sure but looking at the codepage layout of both encoding I can see that 8d (141 in decimal) is blank in CP-1252 while has a value in URF-8. Maybe that's your problem. See en.wikipedia.org/wiki/Windows-1252 and en.wikipedia.org/wiki/UTF-8 to check it. — Eypros
– Eypros, Commented Jul 16, 2014 at 6:12

nablex · Accepted Answer · 2014-07-16 06:08:55Z

2

I have only diagonally read your post, but this is an odd step:

byte[] utf_8 = input.getBytes("UTF-8");  // convert unicode string to UTF-8
String test = new String(utf_8);

Because you take a string in java (which is a list of encoding-agnostic unicode codepoints), transform it to bytes with a given encoding (UTF-8) but then you construct a new String without specifying the encoding, so in effect test now contains the utf-8 bytes transformed with the system encoding which may or may not be a valid result depending on what you put in the string and which system encoding you have.

In the next step you get the bytes again from the horrific entity that is "test" in the default encoding. Assuming it even works (as in the bytes from the original UTF-8 string are a valid byte array in whatever system encoding you have), the next step is basically a useless move because it will use the same system encoding you used to construct test:

byte[] utf_8_converted = test.getBytes();

answered Jul 16, 2014 at 6:08

nablex

4,7875 gold badges40 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user1919511 Over a year ago

The above test program is trimmed version of what is written about 10 years ago, so usability is not a question here. The idea behind the code is to construct a utf-8 encoded string so that native code can access this string to support utf-8. BTW, whatever you say as unicode point is UTF-16 encoding. I am more interested in understanding how & why file.encoding affects behavior. Any pointer towards that would be more helpful. thanks

Ankur Shanbhag · Accepted Answer · 2014-07-16 06:12:18Z

1

I think this statement is causing the issue:
byte[] utf_8_converted = test.getBytes();

From the documentation of String.getBytes() API:

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the default charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.

Point to note: Default Charset used for conversion is not UTF-8

Try this:

byte[] utf_8_converted = test.getBytes("UTF-8");

answered Jul 16, 2014 at 6:12

Ankur Shanbhag

7,8042 gold badges31 silver badges38 bronze badges

1 Comment

user1919511 Over a year ago

I am trying to understand how file.encoding affect the platform character set. Any pointer towards that is much appreciated

Collectives™ on Stack Overflow

Java file.encoding on reading UTF-8 file and handling UTF-8 string

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related