unicode error with codecs when reading a pdf file in python

Question

I am trying to read a pdf file with the following contain:

%PDF-1.4\n%âãÏÓ

If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Do you know a way to get a unicode string from the content of this file?

PS: I am using python 2.7

user149341 · Accepted Answer · 2013-06-18 05:53:30Z

9

PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.

For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:

f = codecs.open(filename, encoding="ISO8859-1", mode="rb")

But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.

answered Jun 18, 2013 at 5:53

user149341

Sign up to request clarification or add additional context in comments.

3 Comments

trez Over a year ago

what I don't understand is why you can open it with open (ie as an ascii string) but not as a unicode one while you can still have u"\xe2" as a valid string

user149341 Over a year ago

You were trying to read the file in as a UTF-8 string. Only certain sequences of bytes are valid UTF-8 data; the one in the header of the PDF is not valid.

Cairnarvon Over a year ago

@trez The \xe2 in '\xe2' (string) and u"\xe2" (unicode string) don't mean the same thing. In the former, it's a literal byte. In the latter, it's the Unicode code point. It just happens to be the case that for â, the Latin-1 representation is the byte \xe2, and the Unicode codepoint U+00E2. It's (for practical purposes) just a coincidence.

Cairnarvon · Accepted Answer · 2013-06-18 05:55:55Z

1

The issue of trying to interpret arbitrary binary data as text aside, 0xe2 is â in Latin-1, not UTF-8. You're using the wrong codec.

answered Jun 18, 2013 at 5:55

Cairnarvon

28.2k9 gold badges55 silver badges66 bronze badges

2 Comments

trez Over a year ago

not really, my editor thing it's latin1 and display it as â but in fact that's only the value 0xe2

mkl Over a year ago

Editors do best guesses based on normally found text file contents. PDF is not a text file. Thus, your editor's guess is meaningless.

Collectives™ on Stack Overflow

unicode error with codecs when reading a pdf file in python

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related