I want a clarification about SHIFT-JIS characters set. Is ASCII a subset of SHIFT-JIS character set similar to UTF-8. If a file has mix of SHIFT-JIS and ASCII. how can we read the same using QT codecs?
2 Answers
Is ASCII a subset of SHIFT-JIS character set similar to UTF-8
No: the backslash (0x5C) is missing from SHIFT-JIS and being replaced by a Yen currency symbol.
If a file has mix of SHIFT-JIS and ASCII. how can we read the same using QT codecs.?
By using QTextCodec do properly decode the various pieces; however, detecting how each part is encoded is up to you...
1 Comment
At least according to wikipedia there are multiple variants of shift-jis.
The original shift-jis was based on JIS X 0201 which is almost but not quite an extension of ASCII. Two codes differed, 0x5C is a backslash in ASCII but a yen sign in the original shift-jis. 0x7E was a vertical bar (aka "pipe") in ascii, but an overline in shift-jis.
However, "code page 932" the windows variant of shift-jis maps the ASCII range to the ASCII unicode code points. The HTML5 spec follows the same procedure as windows. In turn many Japanese fonts will have the yen and overline at positions 0x5C and 0x7C.
You may need to experiment to find the particular behaviour of whatever encoding/decoding library you use and to decide if that behaviour is appropriate for your application.