How does the Python compiler preprocess the source file with the declared encoding?

Question

Let's say I have a Python 3 source file in cp1251 encoding with the following content:

# эюяьъ (some Russian comment)
print('Hehehey')

If I run the file, I'll get this:

SyntaxError: Non-UTF-8 code starting with '\xfd' in file ... on line 1 but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

That's clear and expected - I understand that, in general, cp1251 byte sequence can't be decoded with UTF-8, which is a default encoding in Python 3.

But if I edit the file as follows:

# coding: utf-8
# эюяьъ (some Russian comment)
print('Hehehey')

everything will work fine.

And that is pretty confusing.
In the 2nd example I still have in the source the same cp1251 byte sequence, which is not valid in UTF-8, and I expect the compiler should use the same encoding (UTF-8) for preprocessing the file and terminate with the same error.
I have read PEP 263 but still don't get the reason it doesn't happen.

So, why my code works in the 2nd case and terminates in the 1st?

UPD.

In order to check whether my text editor is smart enough to change the file's encoding because of the line # coding: utf-8, let's look at the actual bytes:

(1st example)

23 20 fd fe ff fa fc ...

(2nd example)

23 20 63 6f 64 69 6e 67 3a 20 75 74 66 2d 38 0a
23 20 fd fe ff fa fc ...

These f-bytes are for cyrillic letters in cp1251 and they are not valid in UTF-8.

Furhermore, if I edit the source this way:

# coding: utf-8
# эюяъь (some Russian comment)
print('Hehehey')
print('эюяъь')

I'll face the error:

SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xfd ...

So, unfortunately my text editor isn't so smart.
Thus, in the above examples the source file is not converted from cp1251 to UTF-8.

It's likely that your text editor silently does the right thing and encodes your file as UTF-8, because it's clever enough to figure out what # coding: utf-8 means in a Python file. Take a hex editor and look at the actual bytes in the file. — Tomalak
– Tomalak, Commented Oct 16, 2017 at 22:30
Hmm, it doesn't look like that. The source file is definitely in cp1251 in both examples. With or without a comment # coding: utf-8 there are non UTF-8 bytes. For example, in both cases I have the '\xfd' byte which is a cp1251 'э' and which cause an error in the 1st example. So, there are must be a different explanation. — MaximTitarenko
– MaximTitarenko, Commented Oct 16, 2017 at 23:08
@Tomalak: Doesn't look like that's the case. I checked by making a file of the form described, then using iconv to convert it from UTF-8 to cp1251, so no editor was involved. The behavior was exactly as the OP describes: A coding: declaration, even one that just declares the implicit UTF-8 decoding explicitly, silenced the error, even when the file contained non-UTF-8 bytes, while failing to provide coding: declaration triggered the error. This is real, not an editor artifact. — ShadowRanger
– ShadowRanger, Commented Oct 17, 2017 at 1:56

ShadowRanger · Accepted Answer · 2017-10-17 01:53:50Z

This seems to be a quirk of how the strict behavior for the default encoding is enforced. In the tokenizer function, decoding_gets, if it didn't find an explicit encoding declaration yet (tok->encoding is still NULL), it does a character by character check of the line for invalid UTF-8 characters and pops the SyntaxError you're seeing that references PEP 263.

But if an encoding has been specified, check_coding_spec will have defined tok->encoding, and that default encoding strict test is bypassed completely; it isn't replaced with a test for the declared encoding.

Normally, this would cause problems when the code is actually being parsed, but it looks like comments are handled in a stripped down way: As soon as the comment character, #, is recognized, the tokenizer just grabs and discards characters until it sees a newline or EOF, it's not trying to do anything with them at all (which makes sense; parsing comments is just wasting time that could be spent on stuff that actually runs).

Thus, the behavior you observe: An encoding declaration disables the strict file-wide character by character checking for valid UTF-8 that is applied when no encoding is declared explicitly, and comments are special-cased so that their contents are ignored, allowing garbage bytes in the comments to escape detection.

Collectives™ on Stack Overflow

How does the Python compiler preprocess the source file with the declared encoding?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related