0

I'm working on a small C++ app that does a bit of string handling. Currently, I want to get extract a string at a particular character index. My naive solution of using string's at() method works fine, but it breaks for non-ascii strings. For example:

string test = "ヘ(^_^ヘ)(ノ^_^)ノ"
cout << test.at(0) << endl;

Produces a pound sign as output for me under gcc 4.2. I don't think it's a problem with my terminal either, because I can print out the entire string just fine. Is there a library or something I could use to get the desired effect?

2 Answers 2

2

string uses chars which are only 8 bits. You need to use wstring if you want to encode 16-bit characters.

Sign up to request clarification or add additional context in comments.

Comments

1

Your string is probably UTF-8, where "characters" and "bytes" are not the same thing. The std::string class assumes "characters" are one byte each, so the results are wrong.

Your options are to convert the string to UTF-16 and use a wstring instead, where you can (generally) assume that characters are all two bytes (a wchar_t or short) each, or you can use a library like ICU or UTF8-CPP to operate on UTF-8 strings directly, doing things like "get the 3rd character" rather than "get the 3rd byte".

Or, if you want to go minimalist, you could just code up a (relatively) simple function to get the byte offset and length of a particular character by reusing the internals of one of the UTF-8 string-length functions from one of the libraries listed above or from google. Basically you have to inspect each character and jump ahead 1-3 bytes to get to the start of the next character depending on what bits are set.

Here's one that could be easily translated from PHP:

for($i = 0; $i < strlen($str); $i++) {
    $value = ord($str[$i]);
    if($value > 127) {
        if($value >= 192 && $value <= 223)
            $i++;
        elseif($value >= 224 && $value <= 239)
            $i = $i + 2;
        elseif($value >= 240 && $value <= 247)
            $i = $i + 3;
        else
            die('Not a UTF-8 compatible string');
        }
    $count++;
} 

http://www.php.net/manual/en/function.strlen.php#25715

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.