3

I am trying to read a pdf file with the following contain:

%PDF-1.4\n%âãÏÓ

If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Do you know a way to get a unicode string from the content of this file?

PS: I am using python 2.7

2 Answers 2

9

PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.

For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:

f = codecs.open(filename, encoding="ISO8859-1", mode="rb")

But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.

Sign up to request clarification or add additional context in comments.

3 Comments

what I don't understand is why you can open it with open (ie as an ascii string) but not as a unicode one while you can still have u"\xe2" as a valid string
You were trying to read the file in as a UTF-8 string. Only certain sequences of bytes are valid UTF-8 data; the one in the header of the PDF is not valid.
@trez The \xe2 in '\xe2' (string) and u"\xe2" (unicode string) don't mean the same thing. In the former, it's a literal byte. In the latter, it's the Unicode code point. It just happens to be the case that for â, the Latin-1 representation is the byte \xe2, and the Unicode codepoint U+00E2. It's (for practical purposes) just a coincidence.
1

The issue of trying to interpret arbitrary binary data as text aside, 0xe2 is â in Latin-1, not UTF-8. You're using the wrong codec.

2 Comments

not really, my editor thing it's latin1 and display it as â but in fact that's only the value 0xe2
Editors do best guesses based on normally found text file contents. PDF is not a text file. Thus, your editor's guess is meaningless.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.