1

i need get a page source (html) and convert him to uft8, because i want find some text in this page( like, if 'my_same_text' in page_source: then...). This page contains russian text (сyrillic symbols), and this tag

<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">

I use flask, and request python lib. i send request source = requests.get('url/')

if 'сyrillic symbols' in source.text: ...

and i can`t find my text, this is due to the encoding how i can convert text to utf8? i try .encode() .decode() but it did not help.

2 Answers 2

5

Let's create a page with an windows-1251 charset given in meta tag and some Russian nonsense text. I saved it in Sublime Text as a windows-1251 file, for sure.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
 </head>
 <body>
  <p>Привет, мир!</p>
 </body>
</html>

You can use a little trick in the requests library:

If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text.

So it goes like that:

In [1]: import requests

In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')

In [3]: result.encoding = 'windows-1251'

In [4]: u'Привет' in result.text
Out[4]: True

Voila!

If it doesn't work for you, there's a slightly uglier approach.

You should take a look at what encoding do the web-server is sending you.

It may be that the encoding of the response is actually cp1252 (also known as ISO-8859-1), or whatever else, but neither utf8 nor cp1251. It may differ and depends on a web-server!

In [1]: import requests

In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')

In [3]: result.encoding
Out[3]: 'ISO-8859-1'

So we should recode it accordingly.

In [4]: u'Привет'.encode('cp1251').decode('cp1252') in result.text
Out[4]: True

But that just looks ugly to me (also, I suck at encodings and it's not really the best solution at all). I'd go with a re-setting the encoding using requests itself.

Sign up to request clarification or add additional context in comments.

Comments

2

As documented, requests automatically decode response.text to unicode, so you must either look for a unicode string:

if u'cyrillic symbols' in source.text:
    # ...

or encode response.text in the appropriate encoding:

# -*- coding: utf-8 -*-
# (....)
if 'cyrillic symbols' in source.text.encode("utf-8"):
    # ...

The first solution being much simpler and lighter.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.