0

Most texts on the C++ standard library mention wstring as being the equivalent of string, except parameterized on wchar_t instead of char, and then proceed to demonstrate string only.

Well, sometimes, there are some specific quirks, and here is one: I can't seem to assign a wstring from an NULL-terminated array of 16-bit characters. The problem is the assignment happily uses the null character and whatever garbage follows as actual characters. Here is a very small reduction:

typedef unsigned short PA_Unichar;
PA_Unichar arr[256];
fill(arr); // sets to 52 00 4b 00 44 00 61 00 74 00 61 00 00 00 7a 00 7a 00 7a 00
// now arr contains "RKData\0zzz" in its 10 first values
wstring ws;
ws.assign((const wchar_t *)arr);
int l = ws.length();

At this point l is not the expected 6 (numbers of chars in "RKData"), but much larger. In my test run, it is 29. Why 29? No idea. A memory dump doesn't show any specific value for the 29th character.

So the question: is this a bug in my standard C++ library (Mac OS X Snow Leopard), or a bug in my code? How am I supposed to assign a null-terminated array of 16-bit chars to a wstring?

Thanks

3
  • Just a shot in the dark, try a double null terminator Commented Aug 27, 2009 at 11:56
  • @obelix, a null character is the same both big- and little-endian. Commented Aug 27, 2009 at 11:57
  • @Nick - yep. i saw binary and thought it might be endianness Commented Aug 27, 2009 at 11:59

3 Answers 3

9

Under most Unixes (Mac OS X as well), whar_t represents UTF-32 single code point, and not 16bit utf-16 point like at windows.

So you need to:

  1. Either:

    ws.assing(arr,arr + length_of_string);
    

    That would use arr as iterator and copy each short int to wchar_t. But this would work only if your characters lay in BMP or representing UCS-2 (16bit legacy encoding).

  2. Or, correctly work with utf-16: converting utf-16 to utf-32 -- you need to find surrogate pairs and merge them to single code point.

Sign up to request clarification or add additional context in comments.

Comments

3

Just do it. You didn't in your code, you assigned an array of unsigned shorts to a wstring and you used a cast to shut the compiler up. wchar_t != unsigned short. You certainly can't assume they have the same size.

Comments

0

I'd think your code would work, just by inspection. But you could always work around the trouble:

ws.assign(static_cast<const wchar_t*>(arr), wcslen(arr));

1 Comment

If ws.assign can't find the proper terminating point of the string by picking out the null character, why would wcslen? I think Artyom hit the nail on the head -- wchar_t != unsigned short.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.