3

I have a crawler that parses the HMTL of a given site and prints parts of the source code. Here is my script:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import urllib.request
import re

class Crawler:

    headers = {'User-Agent' : 'Mozilla/5.0'}
    keyword = 'arroz'

    def extra(self):
        url = "http://buscando.extra.com.br/search?w=" + self.keyword
        r = requests.head(url, allow_redirects=True)    
        print(r.url)
        html = urllib.request.urlopen(urllib.request.Request(url, None, self.headers)).read()
        soup = BeautifulSoup(html, 'html.parser')
        return soup.encode('utf-8')

    def __init__(self):
        extra = self.extra()
        print(extra)

Crawler()

My code works fine, but it prints the source with an annoying b' before the value. I already tried to use decode('utf-8') but it didn't work. Any ideas?

UPDATE

If I don't use the encode('utf-8') I have the following error:

Traceback (most recent call last):
  File "crawler.py", line 25, in <module>
    Crawler()
  File "crawler.py", line 23, in __init__
    print(extra)
  File "c:\Python34\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position
13345: character maps to <undefined>
4
  • 2
    So why are you using encode here? Try just return soup. Commented Nov 1, 2015 at 3:05
  • Without this it returns the following error: Traceback (most recent call last): File "crawler.py", line 25, in <module> Crawler() File "crawler.py", line 23, in __init__ print(extra) File "c:\Python34\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position 13345: character maps to <undefined> Commented Nov 1, 2015 at 3:07
  • 1
    bytes doesn't have an encode method in Python 3 so you are starting with a string and converting it to a byte string Commented Nov 1, 2015 at 3:07
  • @chucksmash Using encode('utf-8') I've stoped the error above. But I've created a new one: UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position 13345: character maps to <undefined>. Commented Nov 1, 2015 at 3:09

1 Answer 1

1

When I run your code as given except replacing return soup.encode('utf-8') with return soup, it works fine. My environment:

  • OS: Ubuntu 15.10
  • Python: 3.4.3
  • python3 dist-packages bs4 version: beautifulsoup4==4.3.2

This leads me to suspect that the problem lies with your environment, not your code. Your stack trace mentions cp850.py and your source is hitting a .com.br site - this makes me think that perhaps the default encoding of your shell can't handle unicode characters. Here's the Wikipedia page for cp850 - Code Page 850.

You can check the default encoding your terminal is using with:

>>> import sys
>>> sys.stdout.encoding

My terminal responds with:

'UTF-8'

I'm assuming yours won't and that this is the root of the issue you are running into.

EDIT:

In fact, I can exactly replicate your error with:

>>> print("\u2030".encode("cp850"))

So that's the issue - because of your computer's locale settings, print is implicitly converting to your system's default encoding and raising the UnicodeDecodeError.

Updating Windows to display unicode characters from the command prompt is a bit outside my wheelhouse so I can't offer any advice other than to direct you to a relevant question/answer.

Sign up to request clarification or add additional context in comments.

2 Comments

Yeah! It returned cp850. What should I do?
I've tried the link solution, without success... thanks tho.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.