I am trying to read UTF-8 encoded XML file and pass UTF-8 string to native code (C++ dll)
My problem is best explained with a Sample program
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
public class UniCodeTest {
private static void testByteConversion(String input) throws UnsupportedEncodingException {
byte[] utf_8 = input.getBytes("UTF-8"); // convert unicode string to UTF-8
String test = new String(utf_8); // Build String with UTF-8 equvalent chars
byte[] utf_8_converted = test.getBytes();// Get the bytes: in effect this will be called in JNI wrapper on C++ side to read it in char*
// simple workaround to print hex values
String utfString = "";
for (int i = 0; i < utf_8.length; i++) {
utfString += " " + Integer.toHexString(utf_8[i]);
}
String convertedUtfString = "";
for (int i = 0; i < utf_8_converted.length; i++) {
convertedUtfString += " " + Integer.toHexString(utf_8_converted[i]);
}
if (utfString.equals(convertedUtfString)) {
System.out.println("Success" );
}
else {
System.out.println("Failure" );
}
}
public static void main(String[] args) {
try {
File inFile = new File("c:/test.txt");
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
testByteConversion(str);
}
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
and the test file has stored in UTF-8 format (Tamil Locale)
#just test
நனமை
நன்மை
I did the following experiments:
set file.encoding property to 'UTF-8' I get success for both inputs
when I set file.encoding to 'CP-1252' First input, i get 'Success' and for the second input I am getting 'Failure'
Here is what I got for the failure case
utf_8 : e0 ae a8 e0 ae a9 e0 af 8d e0 ae ae e0 af 88
utf_8_converted : e0 ae a8 e0 ae a9 e0 af 3f e0 ae ae e0 af 88
I do not understand why 8d is converted into 3f when file.encoding set to CP-1252. Can any one please explain me
I miss the link between file.encoding and string manipulation
Thanks in advance :)
8d(141in decimal) is blank in CP-1252 while has a value in URF-8. Maybe that's your problem. See en.wikipedia.org/wiki/Windows-1252 and en.wikipedia.org/wiki/UTF-8 to check it.