Strange utf8 decoding error in windows notepad

Question

If you type the following string into a text file encoded with utf8(without bom) and open it with notepad.exe,you will get some weired characters on screen. But notepad can actually decode this string well without the last 'a'. Very strange behavior. I am using Windows 10 1809.

[19, 16, 12, 14, 15, 15, 12, 17, 18, 15, 14, 15, 19, 13, 20, 18, 16, 19, 14, 16, 20, 16, 18, 12, 13, 14, 15, 20, 19, 17, 14, 17, 18, 16, 13, 12, 17, 14, 16, 13, 13, 12, 15, 20, 19, 15, 19, 13, 18, 19, 17, 14, 17, 18, 12, 15, 18, 12, 19, 15, 12, 19, 18, 12, 17, 20, 14, 16, 17, 18, 15, 12, 13, 19, 18, 17, 18, 14, 19, 18, 16, 15, 18, 17, 15, 15, 19, 16, 15, 14, 19, 13, 19, 15, 17, 16, 12, 12, 18, 12, 14, 12, 16, 19, 12, 19, 12, 17, 19, 20, 19, 17, 19, 20, 16, 19, 16, 19, 16, 12, 12, 18, 19, 17, 18, 16, 12, 17, 13, 18, 20, 19, 18, 20, 14, 16, 13, 12, 12, 14, 13, 19, 17, 20, 18, 15, 12, 15, 20, 14, 16, 15, 16, 19, 20, 20, 12, 17, 13, 20, 16, 20, 13a

I wonder if this is a windows bug or there is something I can do to solve this.

It seems like Notepad is interpreting it as fixed 2-byte for the entire string, so internally converting it to UCS-2. [1 9, 1 6, 1 2, 1 maps to ㅛ ⰹ ㄠ ⰶ ㄠ ⰲ ㄠ , so the first character is actually '[1', second is '9,' , third is ' 1', etc. So when you remove the last 'a', it cannot encode that into a 2-byte character. I'm sorry if the above is confusing. I only understand bits and pieces. Still trying to figure it all out. — rmutalik
– rmutalik, Commented Apr 29, 2019 at 2:56

rmutalik · Accepted Answer · 2019-04-29 16:27:00Z

Did more research; figured it out.

Seems like a variation of the classic case of "Bush hid the facts". https://en.wikipedia.org/wiki/Bush_hid_the_facts

It looks like Notepad has a different character encoding default for saving a file than it does for opening a file. Yes, this does seem like a bug.

But there is an actual explanation for what is occurring:

Notepad checks for a BOM byte sequence. If it does not find one, it has 2 options: the encoding is either UTF-16 Little Endian (without BOM) or plain ASCII. It checks for UTF-16 LE first using a function called IsTextUnicode.
IsTextUnicode runs a series of tests to guess whether the given text is Unicode or not. One of these tests is IS_TEXT_UNICODE_STATISTICS, which uses statistical analysis. If the test is true, then the given text is probably Unicode, but absolute certainty is not guaranteed.
https://learn.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-istextunicode
If IsTextUnicode returns true, Notepad encodes the file with UTF-16 LE, producing the strange output you saw. We can confirm this with this character ㄠ. Its corresponding ASCII characters are ' 1' (space one); the corresponding hex values for those ASCII characters are 0x20 for space and 0x31 for one. Since the byte-ordering is Little Endian, the order for the Unicode code point would be '1 ', or U+3120, which you can confirm if you look up that code point.
https://unicode-table.com/en/3120/

If you want to solve the issue, you need to break the pattern which helps IsTextUnicode determine if the given text is Unicode. You can insert a newline before the text to break the pattern.

Hope that helped!

Collectives™ on Stack Overflow

Strange utf8 decoding error in windows notepad

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related