How to handle encodings using Python Requests library

Question

I've struggled with encodings for far too long, and today I want to break the mental block wide open.

Right now, I'm using Requests to scrape a bunch of websites, and from what I can tell it is using the HTTP headers to figure out the encodings that the pages are using, falling back to chardet when the site's headers are missing. From there, it decodes the bytecode it downloads, and then helpfully hands me a unicode object in r.text.

All good.

But where I'm confused is that from there I do some work on the text and then print it out to stdout, providing an encoding when I print:

 print foo.encode('utf-8')

The problem is that when I do that, the thing that's printed is messed up. In the following, I expect to get an emdash between the word 'judgments' and 'Standard':

 Declaratory judgmentsStandard of review.

Instead, I get the boxy thing with the four tiny numbers in it. It doesn't seem to show up here, of course, but I think the numbers are 0097, which corresponds to what I get if I do:

repr(foo)
u'Declaratory judgments\x97Standard of review.'

So that kind of makes sense, but where's my emdash?

The process boils down to:

Requests downloads a page and intelligently decodes the text to a unicode object
I work with it
I encode it to utf-8 and print it out.

Where's the problem? This sounds like the mythical unicode sandwich to me, but clearly I'm missing something.

Ned Batchelder · Accepted Answer · 2012-07-21 00:40:55Z

4

You are doing something odd. \x97 is an emdash in the cp1252 encoding. In a Unicode string, it's U+0097 END OF GUARDED AREA. Somehow, you are reading cp1252 bytes as Unicode. Show more of the code that got you to this state, and we can dig deeper.

PS: the Unicode sandwich is hardly mythical, it is an ideal to strive for! :)

edited Jul 21, 2012 at 0:40

answered Jul 21, 2012 at 0:27

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

mlissner Over a year ago

Yup! You nailed it. The page is the problem. It doesn't declare an encoding so chardet recognizes it as 'ISO-8859-1', and Requests decodes it as such. Then, when I encode it as utf-8, of course that fails too. How did you know this off the top of your head? I want to avoid this in the future?

mlissner Over a year ago

Oh, and another question...Firefox and Chrome detect this page as iso-8859-1 too...yet they display the emdashes perfectly! What's their trick?

Ned Batchelder Over a year ago

Long experience tells me that characters like emdash encoded at \x9X is probably cp1252. Looking it up on Wikipedia confirmed that cp1252 maps the byte you showed to the character you expected. cp1252 is actually a superset of iso8859-1, with printable characters where iso8859-1 has none. So when browsers say they are using 8859-1, they actually use cp1252 because why not, it just makes more characters printable.

mlissner Over a year ago

Makes sense. So this sounds like a bug in either chardet or Requests. Seems like they could just have a catch that says whenever they see iso-8859-1, they should assume it's actually cp1252?

Ned Batchelder Over a year ago

I have no experience with chardet. I would expect if there's an \x97 byte in the input, that chardet would give you cp1252 as the result. You are right the page seems to have no encoding declaration at all, though it says it's served by IIS, so that's a clue, though obscure!

|

Collectives™ on Stack Overflow

How to handle encodings using Python Requests library

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related