5

On 64-bit Debian Linux 6:

Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxint
9223372036854775807
>>> sys.maxunicode
1114111

On 64-bit Windows 7:

Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit (AMD64)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxint
2147483647
>>> sys.maxunicode
65535

Both Operating Systems are 64-bit. They have sys.maxunicode, according to wikipedia There are 1,114,112 code points in unicode. Is sys.maxunicode on Windows wrong?

And why do they have different sys.maxint?

3
  • on my 32-bit machines: when using linux, both python 2 and python 3 return 1114111 for sys.maxunicode, using windows i get 65535 for sys.maxunicode Commented Nov 17, 2011 at 9:34
  • also, sys.maxint has disappeared from python 3... Commented Nov 17, 2011 at 9:37
  • "Why" questions are not really well suited for StackOverflow, but perhaps @RaymondHettinger can shed some light on this. Commented Nov 17, 2011 at 10:32

2 Answers 2

4

I don't know what your question is, but sys.maxunicode is not wrong on Windows.

See the docs:

sys.maxunicode

An integer giving the largest supported code point for a Unicode character. The value of this depends on the configuration option that specifies whether Unicode characters are stored as UCS-2 or UCS-4.

Python on Windows uses UCS-2, so the largest code point is 65,535 (and the supplementary-plane characters are encoded by 2*16 bit "surrogate pairs").

About sys.maxint, this shows at which point Python 2 switches from "simple integers" (123) to "long integers" (12345678987654321L). Obviously Python for Windows uses 32 bits, and Python for Linux uses 64 bits. Since Python 3, this has become irrelevant because the simple and long integer types have been merged into one. Therefore, sys.maxint is gone from Python 3.

Sign up to request clarification or add additional context in comments.

8 Comments

i would also add that sys.maxunicode has no relation whatsoever with sys.maxint.
As I understand it, "surrogate pairs" apply only to UTF-16; UCS-2 is simply incapable of representing characters past 65535.
@TimPietzcker: I would like to add a pointer to the documentation about supplementary character planes: "Any Unicode character can be encoded [with \Uxxxxxxxx], but characters outside the Basic Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is compiled to use 16-bit code units (the default). Individual code units which form parts of a surrogate pair can be encoded using this escape sequence." (docs.python.org/reference/lexical_analysis.html#string-literals).
@KeithThompson: it looks like Python can encode characters outside of the Basic Multilingual Plane (BMP) even when it has sys.maxunicode==65535: print repr(u"\U00010120") correctly returns the original input string representation. So, it looks like Python is using UCS-2 internally, with a convention that allows it to represent characters outside of the BMP. In fact, if you look at the internal representation with u"\U00010120".encode('unicode_internal').encode('hex'), you see that Python uses the special code 0xd800, which is guaranteed not to point to any character (like d800-dfff).
Is UCS-2 "with a convention that allows it to represent characters outside the BMP" just a way to describe UTF-16, or does Python's convention differ from UTF-16?
|
1

Regarding the difference is sys.maxint, see What is the bit size of long on 64-bit Windows?. Python uses the long type internally to store a small integer on Python 2.x.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.