3

I am using MySQL 5.6.27, Java Connector 5.1.36 on Linux, and I have problem with some of the Serbian/Croatian/Slovenian characters.

Database is started with

./bin/mysqld_safe --user=mysql --bind_address=localhost --character-set-server=utf8 &

Database is created with

-- CREATE USER 'my_test'@'localhost' IDENTIFIED BY 'my_test';
-- CREATE DATABASE my_test DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
-- GRANT ALL PRIVILEGES ON my_test.* TO 'my_test'@'localhost' IDENTIFIED BY 'my_test';
USE my_test;

CREATE TABLE proba
(
    content TEXT NOT NULL
) CHARACTER SET 'utf8' COLLATE 'utf8_general_ci';

INSERT INTO proba(content)
VALUES ('markovič marko SURČIN');

INSERT INTO proba(content)
VALUES ('Nikolić Nikola Ćićevac');   

INSERT INTO proba(content)
VALUES ('petroviš đura Đeram');

INSERT INTO proba(content)
VALUES ('Milošević Miloš Šabac');

INSERT INTO proba(content)
VALUES ('jovanović žarko Žarkovo');

This dump is imported into MySQL with

/usr/local/mysql-5.6.27/bin/mysql --user=my_test --password < schema.sql

Java client fetches data with

public class Serbian
{
    public static void main(String[] args) throws ClassNotFoundException, SQLException
    {
        Connection conn = null;
        try
        {
            Class.forName("com.mysql.jdbc.Driver");
            conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/my_test?user=my_test&password=my_test&useUnicode=true&characterEncoding=UTF-8&collation=utf8_unicode_ci");

            Statement stmt = conn.createStatement();
            ResultSet rs = stmt.executeQuery("SELECT content FROM proba");
            while(rs.next())
            {
                String s = rs.getString("content");
                System.out.println(s);
            }
            rs.close();
        }
        catch (SQLException exc)
        {
            exc.printStackTrace();
        }
        finally
        {
            conn.close();
        }
    }
}

The result is

markovič marko SUR??IN
Nikoli?? Nikola ??i??evac
petroviš ??ura Đeram
Miloševi?? Miloš Šabac
jovanovi?? žarko Žarkovo

(question marks are actually invalid characters reported by editor). In other words, letters Č, Ć, ć, đ are invalid while Ž, ž, č, Š, š, Đ are valid. It's weird that UTF8 partially works. Is there something that I should give a try or this seems to be a MySQL/Java Connector problem?

16
  • This is the console output, but could you try writing this in a file instead ? Commented Dec 18, 2015 at 13:35
  • Actually, it's written into a file, not the console. Commented Dec 18, 2015 at 13:37
  • 1
    @Jan: Here is the hex dump: 6d 61 72 6b 6f 76 69 c4 8d 20 6d 61 72 6b 6f 20 53 55 52 c4 3f 49 4e 0a 4e 69 6b 6f 6c 69 c4 3f 20 4e 69 6b 6f 6c 61 20 c4 3f 69 c4 3f 65 76 61 63 0a 70 65 74 72 6f 76 69 c5 a1 20 c4 3f 75 72 61 20 c4 90 65 72 61 6d 0a 4d 69 6c 6f c5 a1 65 76 69 c4 3f 20 4d 69 6c 6f c5 a1 20 c5 a0 61 62 61 63 0a 6a 6f 76 61 6e 6f 76 69 c4 3f 20 c5 be 61 72 6b 6f 20 c5 bd 61 72 6b 6f 76 6f 0a @duffymo: I believe that I've set everything (see the post), the weird part is that it works partially. If it wasn't set, I guess none of the non-latin characters would work. Commented Dec 18, 2015 at 15:06
  • 2
    Your letter Č shows up as c4 3f which is 쐿 in utf-8. There has to be something broken before you retrieve that string. How did you get the bytes? Commented Dec 18, 2015 at 15:28
  • 1
    Looking at the middle of that hex: 53 55 52 c4 3f 49 4e -- Assuming it is latin1, it says SURÄ?IN. In utf8 the hex should be 53 55 52 C48C 49 4E. Because of the ?, I suspect you have made attempts to fix it. But in reality just made things worse. Commented Dec 19, 2015 at 20:35

2 Answers 2

1

Comment field too short. Not yet a "final" answer.

Where your letter Č should be hexdump shows up as c4 3f which is not valid utf-8.

Whereas

for(byte b : "Č".getBytes("UTF-8")) {
   System.out.println("-> " + Integer.toHexString(b));
}

Results in

-> ffffffc4
-> ffffff8c

And \uc48c is correctly printed as Č again.

You could check if moving from getString() to getNString() changes result.

Wait a second

There's 'c4 3f' everywhere some of your "missing" characters should be. The hex is already broken.

Sign up to request clarification or add additional context in comments.

9 Comments

I tried also getNString() and works in same way. If hex is already broken, then it's a problem at Connector or MySQL side?
It might still be the output into file / system.out try writing bytes of string you read from db
In mysql you can see names as they're supposed to be?
Well I suppose you'll have to do that onion-wise: layer by layer. Check bytes in mysql. See what bytes are received in getString () and so on. Have you tried inserting the strings from java?
I looks like you might be mixing up UTF-8 encodings and Unicode code points. C4 3F is actually not a valid UTF-8 encoding because the second byte starts with the bits 00, not 10. And Č is indeed UTF-8 encoded as C4 8C but that's not the same as the Unicode code point \uC48C which is and UTF-8 encoded as EC 92 8C.
|
0

Here is the solution. The SQL file remains same, although adding

SET NAMES 'utf8' COLLATE 'utf8_general_ci';

doesn't hurt. Importing dump has to be changed by adding the switch --default-character-set=utf8:

/usr/local/mysql-5.6.27/bin/mysql --user=my_test --password --default-character-set=utf8 < schema.sql

Without this parameter, hex string in the query (SELECT hex(content) FROM proba;) shows corrupted hex string with the non-existing UTF8 characters. Finally, Java code has to deal with UTF8, so in the while loop the following line fetches the UTF8 string:

System.out.println(new String(rs.getString("content").getBytes(Charset.forName("UTF-8"))));

Execution of the program and redirecting to file (java -cp ".:./mysql-connector-java-5.1.36-bin.jar" Serbian > java.log) does not make trouble, so everything looks fine:

markovič marko SURČIN
Nikolić Nikola Ćićevac
petroviš đura Đeram
Milošević Miloš Šabac
jovanović žarko Žarkovo

@Jan, @GordThompson: thanks for help in better understanding the problem.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.